GSE122425 Processing Pipeline

RNA-Seq code_examples 5 steps

Publication

Multiplexed transcriptome discovery of RNA-binding protein binding sites by antibody-barcode eCLIP.

Nature methods (2023) — PMID 36550273

Dataset

Effects of NSUN2 deficiency on the mRNA 5-methylcytosine modification and gene expression profile in HEK293 cells (RNA-Seq)

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Bcl2fastq (v2.17.1.14) used for basecalling.

Bcl2fastq v2.17.1.14 GitHub

$ Bash example

# Install bcl2fastq (example using conda)
# conda install -c bioconda bcl2fastq2

# Define input and output directories
BCL_INPUT_DIR="/path/to/your/bcl_run_folder" # Replace with the actual path to your Illumina run folder
OUTPUT_DIR="/path/to/your/output_fastq_folder" # Replace with your desired output directory

# Execute bcl2fastq for basecalling and demultiplexing
# A sample sheet (SampleSheet.csv) is often required in the run folder or specified with --sample-sheet
bcl2fastq --runfolder-dir "$BCL_INPUT_DIR" \
          --output-dir "$OUTPUT_DIR"

View on GitHub

3â adaptor-trimming and low quality reads removing by Cutadaptï¼version 1.9.1ï¼

Cutadapt v1.9.1 GitHub

$ Bash example

# Install Cutadapt (version 1.9.1)
# conda install -c bioconda cutadapt=1.9.1

# Define input and output file paths
INPUT_FASTQ="input_reads.fastq.gz"
OUTPUT_FASTQ="trimmed_reads.fastq.gz"

# Define the 3' adaptor sequence (replace with the actual adaptor used in the experiment)
# Common Illumina 3' adaptor for eCLIP might be similar to: AGATCGGAAGAGCACACGTCTGAACTCCAGTCACNNNNNNNNATCTCGTATGCCGTCTTCTGCTTG
ADAPTOR_SEQUENCE="ADAPTOR_SEQUENCE_PLACEHOLDER"

# Define quality threshold and minimum read length
QUALITY_THRESHOLD=20 # Phred quality score threshold
MIN_READ_LENGTH=15   # Minimum length for reads after trimming

# Run Cutadapt for 3' adaptor trimming and low quality read removal
cutadapt -a "${ADAPTOR_SEQUENCE}" \
         -q ${QUALITY_THRESHOLD} \
         -m ${MIN_READ_LENGTH} \
         -o "${OUTPUT_FASTQ}" \
         "${INPUT_FASTQ}"

View on GitHub

The high quality trimmed reads (clean reads) were aligned to the human reference genome (UCSC HG19) using Hisat2(v2.0.1).

HISAT2 v2.0.1 GitHub

$ Bash example

# Install HISAT2 (if not already installed)
# conda install -c bioconda hisat2=2.0.1

# Define input and output files
READ1="clean_reads_R1.fastq.gz" # Placeholder for high quality trimmed read file 1
READ2="clean_reads_R2.fastq.gz" # Placeholder for high quality trimmed read file 2
OUTPUT_SAM="aligned_reads.sam" # Output SAM file

# Define the path to the HISAT2 index for UCSC HG19
# This index needs to be pre-built or downloaded. 
# To build the index, first download the hg19 reference genome FASTA file (e.g., from UCSC):
# wget https://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz
# gunzip hg19.fa.gz
# Then build the HISAT2 index:
# hisat2-build hg19.fa path/to/hisat2_indices/hg19
HISAT2_INDEX="path/to/hisat2_indices/hg19" # Placeholder for the base name of the hg19 HISAT2 index

# Run HISAT2 alignment
hisat2 -x "${HISAT2_INDEX}" -1 "${READ1}" -2 "${READ2}" -S "${OUTPUT_SAM}"

View on GitHub

Reads Per Kilobase of exon per Megabase of library size (RPKM) were calculated using Htseq (V0.6.1).

HTSeq v0.6.1 GitHub

$ Bash example

# Install HTSeq (if not already installed)
# conda install -c bioconda htseq

# Define input and reference files
INPUT_BAM="input_aligned_reads.bam" # Replace with your actual aligned BAM file
GTF_FILE="Homo_sapiens.GRCh38.111.gtf" # Placeholder: Replace with your specific GTF annotation file (e.g., from Ensembl or GENCODE)
OUTPUT_COUNTS_FILE="htseq_exon_counts.txt"

# Run htseq-count to get raw exon counts
# -f bam: Input file format is BAM
# -r pos: Assume input BAM is sorted by position (common for aligners like STAR, HISAT2)
# -s no: Strandedness (can be 'yes', 'reverse', 'no'). 'no' is a common default if not specified.
# -a 10: Minimum alignment quality score (default is 10)
# -t exon: Feature type to count (as specified in the description "Reads Per Kilobase of exon")
# -i gene_id: Attribute in the GTF file to use as the feature identifier
htseq-count -f bam -r pos -s no -a 10 -t exon -i gene_id "${INPUT_BAM}" "${GTF_FILE}" > "${OUTPUT_COUNTS_FILE}"

# RPKM calculation is a post-processing step after obtaining raw counts.
# HTSeq itself calculates raw counts, not RPKM directly.
# RPKM (Reads Per Kilobase of exon per Megabase of library size) is calculated as:
# RPKM = (raw_exon_counts * 10^9) / (exon_length_in_bases * total_mapped_reads_in_library)
#
# To perform this calculation, you would typically:
# 1. Extract exon lengths from the GTF file (e.g., using a custom script or tool like gffread/gffutils).
# 2. Get the total number of mapped reads from the BAM file (e.g., using samtools flagstat) or from the htseq-count summary.
# 3. Use a custom script (e.g., Python, R) to combine counts, lengths, and total reads to compute RPKM.

View on GitHub

Differential expression analysis used the DESeq2 (V1.6.3).

DESeq2 v1.6.3 GitHub

$ Bash example

# Install Bioconductor and DESeq2 if not already installed
# Note: Version 1.6.3 is quite old. Current BiocManager might install a newer version.
# To install a specific older version, you might need to use an older Bioconductor release.
# R -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")'
# R -e 'BiocManager::install("DESeq2", version = "3.0")' # Example for Bioconductor 3.0, which corresponds to DESeq2 v1.6.3

# Create an R script for DESeq2 analysis
cat << 'EOF' > run_deseq2.R
# Load DESeq2 library
library(DESeq2)

# --- Configuration ---
# Input count matrix file (e.g., from featureCounts, HTSeq-count, Salmon, Kallisto)
# Rows are genes/features, columns are samples. First column should be gene IDs.
count_file <- "counts.tsv" # Placeholder: Replace with your actual count matrix file

# Sample metadata file
# Rows are samples, columns are metadata (e.g., condition, treatment, batch).
# Must contain a column whose values match the column names in the count_file
# and a column for the primary comparison (e.g., 'condition').
metadata_file <- "sample_metadata.tsv" # Placeholder: Replace with your actual metadata file

# Output file for differential expression results
output_file <- "deseq2_results.tsv"

# Design formula (e.g., ~ condition, ~ batch + condition)
# This specifies the experimental design for the differential expression analysis.
design_formula_str <- "~ condition" # Placeholder: Adjust based on your experimental design
# --- End Configuration ---

# Load count data
# Assuming the first column is gene IDs and subsequent columns are sample counts.
count_data <- read.delim(count_file, row.names = 1, check.names = FALSE)
# Ensure counts are integers, as DESeq2 expects integer counts.
count_data <- round(count_data)

# Load sample metadata
# Assuming the first column of metadata_file contains sample IDs that match count_data column names.
col_data <- read.delim(metadata_file, row.names = 1)

# Ensure sample names match between count data and metadata, and order them consistently.
common_samples <- intersect(colnames(count_data), rownames(col_data))
if (length(common_samples) == 0) {
  stop("No common samples found between count data and metadata. Please check file formats and sample IDs.")
}
count_data <- count_data[, common_samples]
col_data <- col_data[common_samples, ]

# Create DESeqDataSet object
# This object stores the count data, sample information, and the experimental design.
dds <- DESeqDataSetFromMatrix(countData = count_data,
                              colData = col_data,
                              design = as.formula(design_formula_str))

# Run DESeq2 analysis
# This performs normalization, dispersion estimation, and differential expression testing.
dds <- DESeq(dds)

# Get results
# By default, results() extracts the comparison for the last variable in the design formula.
# For specific comparisons (e.g., 'treated' vs 'untreated' for 'condition'), use:
# res <- results(dds, contrast=c("condition","treated","untreated"))
res <- results(dds)

# Order results by adjusted p-value (padj) for easier interpretation of most significant genes.
res <- res[order(res$padj),]

# Save results to a tab-separated file.
write.table(as.data.frame(res), file = output_file, sep = "\t", quote = FALSE, row.names = TRUE)

message(paste("DESeq2 analysis completed. Results saved to", output_file))
EOF

# Execute the R script
Rscript run_deseq2.R

# Note: The count data (counts.tsv) would typically be generated upstream
# using tools like featureCounts, HTSeq-count, Salmon, or Kallisto.
# These upstream quantification steps usually rely on a reference genome
# (e.g., GRCh38/hg38 for human, GRCm39/mm39 for mouse) and gene annotations.
# For this specific DESeq2 differential expression step, no direct reference genome is used.

View on GitHub

Tools Used

DESeq2

Raw Source Text

Bcl2fastq (v2.17.1.14) used for basecalling.
3â adaptor-trimming and low quality reads removing by Cutadaptï¼version 1.9.1ï¼
The high quality trimmed reads (clean reads) were aligned to the human reference genome (UCSC HG19) using Hisat2(v2.0.1).
Reads Per Kilobase of exon per Megabase of library size (RPKM) were calculated using Htseq (V0.6.1).
Differential expression analysis used the DESeq2 (V1.6.3).
Genome_build: HG19
Supplementary_files_format_and_content: Excel files include RPKM values for each Sample

← Back to Analysis