GSE122425 Processing Pipeline
RNA-Seq
code_examples
5 steps
Publication
Multiplexed transcriptome discovery of RNA-binding protein binding sites by antibody-barcode eCLIP.Nature methods (2023) — PMID 36550273
Dataset
GSE122425Effects of NSUN2 deficiency on the mRNA 5-methylcytosine modification and gene expression profile in HEK293 cells (RNA-Seq)
Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
Processing Steps
Generate Jupyter Notebook-
1
Bcl2fastq (v2.17.1.14) used for basecalling.
$ Bash example
# Install bcl2fastq (example using conda) # conda install -c bioconda bcl2fastq2 # Define input and output directories BCL_INPUT_DIR="/path/to/your/bcl_run_folder" # Replace with the actual path to your Illumina run folder OUTPUT_DIR="/path/to/your/output_fastq_folder" # Replace with your desired output directory # Execute bcl2fastq for basecalling and demultiplexing # A sample sheet (SampleSheet.csv) is often required in the run folder or specified with --sample-sheet bcl2fastq --runfolder-dir "$BCL_INPUT_DIR" \ --output-dir "$OUTPUT_DIR" -
2
3â adaptor-trimming and low quality reads removing by Cutadaptï¼version 1.9.1ï¼
$ Bash example
# Install Cutadapt (version 1.9.1) # conda install -c bioconda cutadapt=1.9.1 # Define input and output file paths INPUT_FASTQ="input_reads.fastq.gz" OUTPUT_FASTQ="trimmed_reads.fastq.gz" # Define the 3' adaptor sequence (replace with the actual adaptor used in the experiment) # Common Illumina 3' adaptor for eCLIP might be similar to: AGATCGGAAGAGCACACGTCTGAACTCCAGTCACNNNNNNNNATCTCGTATGCCGTCTTCTGCTTG ADAPTOR_SEQUENCE="ADAPTOR_SEQUENCE_PLACEHOLDER" # Define quality threshold and minimum read length QUALITY_THRESHOLD=20 # Phred quality score threshold MIN_READ_LENGTH=15 # Minimum length for reads after trimming # Run Cutadapt for 3' adaptor trimming and low quality read removal cutadapt -a "${ADAPTOR_SEQUENCE}" \ -q ${QUALITY_THRESHOLD} \ -m ${MIN_READ_LENGTH} \ -o "${OUTPUT_FASTQ}" \ "${INPUT_FASTQ}" -
3
The high quality trimmed reads (clean reads) were aligned to the human reference genome (UCSC HG19) using Hisat2(v2.0.1).
$ Bash example
# Install HISAT2 (if not already installed) # conda install -c bioconda hisat2=2.0.1 # Define input and output files READ1="clean_reads_R1.fastq.gz" # Placeholder for high quality trimmed read file 1 READ2="clean_reads_R2.fastq.gz" # Placeholder for high quality trimmed read file 2 OUTPUT_SAM="aligned_reads.sam" # Output SAM file # Define the path to the HISAT2 index for UCSC HG19 # This index needs to be pre-built or downloaded. # To build the index, first download the hg19 reference genome FASTA file (e.g., from UCSC): # wget https://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz # gunzip hg19.fa.gz # Then build the HISAT2 index: # hisat2-build hg19.fa path/to/hisat2_indices/hg19 HISAT2_INDEX="path/to/hisat2_indices/hg19" # Placeholder for the base name of the hg19 HISAT2 index # Run HISAT2 alignment hisat2 -x "${HISAT2_INDEX}" -1 "${READ1}" -2 "${READ2}" -S "${OUTPUT_SAM}" -
4
Reads Per Kilobase of exon per Megabase of library size (RPKM) were calculated using Htseq (V0.6.1).
$ Bash example
# Install HTSeq (if not already installed) # conda install -c bioconda htseq # Define input and reference files INPUT_BAM="input_aligned_reads.bam" # Replace with your actual aligned BAM file GTF_FILE="Homo_sapiens.GRCh38.111.gtf" # Placeholder: Replace with your specific GTF annotation file (e.g., from Ensembl or GENCODE) OUTPUT_COUNTS_FILE="htseq_exon_counts.txt" # Run htseq-count to get raw exon counts # -f bam: Input file format is BAM # -r pos: Assume input BAM is sorted by position (common for aligners like STAR, HISAT2) # -s no: Strandedness (can be 'yes', 'reverse', 'no'). 'no' is a common default if not specified. # -a 10: Minimum alignment quality score (default is 10) # -t exon: Feature type to count (as specified in the description "Reads Per Kilobase of exon") # -i gene_id: Attribute in the GTF file to use as the feature identifier htseq-count -f bam -r pos -s no -a 10 -t exon -i gene_id "${INPUT_BAM}" "${GTF_FILE}" > "${OUTPUT_COUNTS_FILE}" # RPKM calculation is a post-processing step after obtaining raw counts. # HTSeq itself calculates raw counts, not RPKM directly. # RPKM (Reads Per Kilobase of exon per Megabase of library size) is calculated as: # RPKM = (raw_exon_counts * 10^9) / (exon_length_in_bases * total_mapped_reads_in_library) # # To perform this calculation, you would typically: # 1. Extract exon lengths from the GTF file (e.g., using a custom script or tool like gffread/gffutils). # 2. Get the total number of mapped reads from the BAM file (e.g., using samtools flagstat) or from the htseq-count summary. # 3. Use a custom script (e.g., Python, R) to combine counts, lengths, and total reads to compute RPKM. -
5
Differential expression analysis used the DESeq2 (V1.6.3).
$ Bash example
# Install Bioconductor and DESeq2 if not already installed # Note: Version 1.6.3 is quite old. Current BiocManager might install a newer version. # To install a specific older version, you might need to use an older Bioconductor release. # R -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")' # R -e 'BiocManager::install("DESeq2", version = "3.0")' # Example for Bioconductor 3.0, which corresponds to DESeq2 v1.6.3 # Create an R script for DESeq2 analysis cat << 'EOF' > run_deseq2.R # Load DESeq2 library library(DESeq2) # --- Configuration --- # Input count matrix file (e.g., from featureCounts, HTSeq-count, Salmon, Kallisto) # Rows are genes/features, columns are samples. First column should be gene IDs. count_file <- "counts.tsv" # Placeholder: Replace with your actual count matrix file # Sample metadata file # Rows are samples, columns are metadata (e.g., condition, treatment, batch). # Must contain a column whose values match the column names in the count_file # and a column for the primary comparison (e.g., 'condition'). metadata_file <- "sample_metadata.tsv" # Placeholder: Replace with your actual metadata file # Output file for differential expression results output_file <- "deseq2_results.tsv" # Design formula (e.g., ~ condition, ~ batch + condition) # This specifies the experimental design for the differential expression analysis. design_formula_str <- "~ condition" # Placeholder: Adjust based on your experimental design # --- End Configuration --- # Load count data # Assuming the first column is gene IDs and subsequent columns are sample counts. count_data <- read.delim(count_file, row.names = 1, check.names = FALSE) # Ensure counts are integers, as DESeq2 expects integer counts. count_data <- round(count_data) # Load sample metadata # Assuming the first column of metadata_file contains sample IDs that match count_data column names. col_data <- read.delim(metadata_file, row.names = 1) # Ensure sample names match between count data and metadata, and order them consistently. common_samples <- intersect(colnames(count_data), rownames(col_data)) if (length(common_samples) == 0) { stop("No common samples found between count data and metadata. Please check file formats and sample IDs.") } count_data <- count_data[, common_samples] col_data <- col_data[common_samples, ] # Create DESeqDataSet object # This object stores the count data, sample information, and the experimental design. dds <- DESeqDataSetFromMatrix(countData = count_data, colData = col_data, design = as.formula(design_formula_str)) # Run DESeq2 analysis # This performs normalization, dispersion estimation, and differential expression testing. dds <- DESeq(dds) # Get results # By default, results() extracts the comparison for the last variable in the design formula. # For specific comparisons (e.g., 'treated' vs 'untreated' for 'condition'), use: # res <- results(dds, contrast=c("condition","treated","untreated")) res <- results(dds) # Order results by adjusted p-value (padj) for easier interpretation of most significant genes. res <- res[order(res$padj),] # Save results to a tab-separated file. write.table(as.data.frame(res), file = output_file, sep = "\t", quote = FALSE, row.names = TRUE) message(paste("DESeq2 analysis completed. Results saved to", output_file)) EOF # Execute the R script Rscript run_deseq2.R # Note: The count data (counts.tsv) would typically be generated upstream # using tools like featureCounts, HTSeq-count, Salmon, or Kallisto. # These upstream quantification steps usually rely on a reference genome # (e.g., GRCh38/hg38 for human, GRCm39/mm39 for mouse) and gene annotations. # For this specific DESeq2 differential expression step, no direct reference genome is used.
Tools Used
Raw Source Text
Bcl2fastq (v2.17.1.14) used for basecalling. 3â adaptor-trimming and low quality reads removing by Cutadaptï¼version 1.9.1ï¼ The high quality trimmed reads (clean reads) were aligned to the human reference genome (UCSC HG19) using Hisat2(v2.0.1). Reads Per Kilobase of exon per Megabase of library size (RPKM) were calculated using Htseq (V0.6.1). Differential expression analysis used the DESeq2 (V1.6.3). Genome_build: HG19 Supplementary_files_format_and_content: Excel files include RPKM values for each Sample