GSE72502 Processing Pipeline

RNA-Seq code_examples 5 steps

Publication

The Ro60 autoantigen binds endogenous retroelements and regulates inflammatory gene expression.

Science (New York, N.Y.) (2015) — PMID 26382853

Dataset

Healthy donor PBMC RNA-seq with or without interferon-alpha stimulation

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Illumina software used for basecalling.

bcl2fastq (Inferred with models/gemini-2.5-flash) vv2.20 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install bcl2fastq (example using conda)
# conda install -c bioconda bcl2fastq2

# Example usage of bcl2fastq for basecalling and demultiplexing
# Replace /path/to/illumina/run_folder with the actual path to your Illumina run folder containing BCL files
# Replace /path/to/output_fastq with the desired output directory for FASTQ files
bcl2fastq --runfolder-dir /path/to/illumina/run_folder \
              --output-dir /path/to/output_fastq \
              --no-lane-splitting \
              --minimum-trimmed-read-length 35 \
              --mask-short-adapter-reads 35 \
              --barcode-mismatches 1

View on GitHub

Sequenced reads were trimmed for adaptor sequence, filtered for rRNA contamination, filtered for low-complexity or low-quality sequence, then mapped to hg19 whole genome using GSNAP.

GSNAP v2021-12-25 (Inferred with models/gemini-2.5-flash)

$ Bash example

# Install GSNAP (if not already installed)
# conda install -c bioconda gsnap

# Reference genome preparation (if not already done)
# Download hg19 reference genome from UCSC
# wget -c http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz
# gunzip hg19.fa.gz
# Create GSNAP index for hg19. -k 15 is a common k-mer size for RNA-seq.
# gsnap_build -d hg19_index -g hg19.fa -k 15

# Define variables
GENOME_INDEX_DIR="/path/to/gsnap_indexes" # Placeholder: Directory containing the hg19_index
GENOME_INDEX_NAME="hg19_index"
INPUT_FASTQ="input_reads_preprocessed.fastq" # Placeholder: Reads after trimming, rRNA filtering, and quality filtering
OUTPUT_SAM="aligned_reads.sam"
THREADS=8 # Number of CPU threads to use

# GSNAP alignment command
# Description: Sequenced reads were trimmed for adaptor sequence, filtered for rRNA contamination,
# filtered for low-complexity or low-quality sequence (these steps are assumed to be completed
# and the output is INPUT_FASTQ). Then mapped to hg19 whole genome using GSNAP.
# Parameters chosen are common for RNA-seq alignment with GSNAP, given the rRNA filtering step.
# --orientation=fr is typically for paired-end reads, but can be used for single-end if library prep has specific orientation.
# --novel-splice-sites=1 allows for detection of one novel splice site per read.
# --max-mismatches=0.05 allows up to 5% mismatches relative to read length.
# --quality-protocol=sanger specifies the quality score encoding.
gsnap -d "${GENOME_INDEX_NAME}" \
      -D "${GENOME_INDEX_DIR}" \
      -t "${THREADS}" \
      --orientation=fr \
      --novel-splice-sites=1 \
      --max-mismatches=0.05 \
      --quality-protocol=sanger \
      -A sam \
      "${INPUT_FASTQ}" > "${OUTPUT_SAM}"

# Optional: Convert SAM to BAM and sort (common post-alignment steps)
# samtools view -bS "${OUTPUT_SAM}" | samtools sort -o "${OUTPUT_SAM%.sam}.bam"
# samtools index "${OUTPUT_SAM%.sam}.bam"

Reads Per Kilobase of exon per Megabase of library size (RPKM) were calculated using a protocol from Chepelev et al., Nucleic Acids Research, 2009.

Custom RPKM Script (Inferred with models/gemini-2.5-flash) vNot specified (Inferred with models/gemini-2.5-flash)

$ Bash example

# Install subread (which includes featureCounts) and samtools if not already available
# conda install -c bioconda subread samtools

# Define input and reference files
# Replace 'aligned_reads.bam' with your actual aligned BAM file
INPUT_BAM="aligned_reads.bam"
# Placeholder for a human GRCh38 GTF file. Obtain from Ensembl, GENCODE, or UCSC.
# Example: Homo_sapiens.GRCh38.109.gtf from Ensembl
GENOME_GTF="Homo_sapiens.GRCh38.109.gtf" 
OUTPUT_COUNTS="gene_counts.txt"
OUTPUT_RPKM="gene_rpkm.txt"

# 1. Count reads per gene (aggregating exon counts to gene level)
# -p: Assume paired-end reads (remove if single-end)
# -t exon: Count features of type 'exon'
# -g gene_id: Aggregate counts by 'gene_id'
# -a: Annotation file in GTF/GFF format
featureCounts -p -t exon -g gene_id -a "${GENOME_GTF}" -o "${OUTPUT_COUNTS}" "${INPUT_BAM}"

# 2. Calculate RPKM using the formula from Chepelev et al., 2009
# RPKM = (number of reads mapped to a gene * 10^9) / (total number of reads in the library * gene length in kilobases)

# Get total number of mapped reads in the library from the BAM file
# -c: count reads
# -F 260: Exclude unmapped (0x4), secondary alignment (0x100), and supplementary alignment (0x800) reads
TOTAL_READS=$(samtools view -c -F 260 "${INPUT_BAM}")

# Process the featureCounts output to calculate RPKM
# The featureCounts output file has a header, then columns like:
# Geneid    Chr    Start    End    Strand    Length    /path/to/aligned_reads.bam
awk -v total_reads="${TOTAL_READS}" '
BEGIN {
    OFS="\t";
    print "Geneid", "Length_Bases", "Raw_Counts", "RPKM";
}
NR > 2 { # Skip the first two header lines from featureCounts output
    gene_id = $1;
    gene_length_bases = $6; # Gene length in bases (sum of exon lengths)
    raw_counts = $7;        # Raw read counts for the gene

    if (gene_length_bases > 0 && total_reads > 0) {
        # Calculate RPKM
        rpkm = (raw_counts * 1000000000) / (total_reads * (gene_length_bases / 1000));
        print gene_id, gene_length_bases, raw_counts, rpkm;
    } else {
        # Handle cases with zero length or zero total reads to avoid division by zero
        print gene_id, gene_length_bases, raw_counts, 0;
    }
}' "${OUTPUT_COUNTS}" > "${OUTPUT_RPKM}"

echo "RPKM calculation complete. Results saved to ${OUTPUT_RPKM}"

In short, exons from all isoforms of a gene were merged to create one meta-transcript.

agat (Inferred with models/gemini-2.5-flash) v1.0.0 GitHub

$ Bash example

# Install AGAT if not already installed
# conda install -c bioconda agat

# Define input GTF file (e.g., Gencode annotation)
GTF_FILE="gencode.v44.annotation.gtf" # Placeholder for a reference GTF file (e.g., human GRCh38)

# Define output GTF file for meta-transcripts
OUTPUT_GTF="gene_meta_transcripts.gtf"

# Step: Merge exons from all isoforms of a gene to create one meta-transcript
# agat_sp_merge_annotations.pl: Script to merge annotations
# --gtf: Input GTF file
# --type exon: Specify that we want to merge 'exon' features
# --group_by gene: Group features by their parent gene_id before merging
# --output: Output GTF file
agat_sp_merge_annotations.pl --gtf "${GTF_FILE}" --type exon --group_by gene --output "${OUTPUT_GTF}"

echo "Meta-transcripts (merged exons per gene) saved to ${OUTPUT_GTF}"

View on GitHub

The number of reads falling in the exons of this meta-transcript were counted and normalized by the size of the meta-transcript and by the size of the library.

RSEM (Inferred with models/gemini-2.5-flash) v1.3.6 GitHub

$ Bash example

# Install RSEM (example using conda)
# conda install -c bioconda rsem=1.3.6

# Define variables
# Placeholder for the genome FASTA file used for alignment and RSEM reference preparation.
GENOME_FASTA="path/to/reference_genome.fa"
# Placeholder for the GTF annotation file. This file should define the "meta-transcripts"
# and their exon structures. For a general case, this would be a standard annotation like Gencode.
GTF_ANNOTATION="path/to/custom_or_gencode.gtf"
# Name for the RSEM reference index. This will create a directory with this name.
RSEM_REFERENCE_NAME="rsem_index_meta_transcripts"
# Path to the input BAM file containing aligned reads.
INPUT_BAM="path/to/aligned_reads.bam"
# Prefix for the output quantification files (e.g., .genes.results, .isoforms.results).
OUTPUT_PREFIX="quantification_results"
# Number of threads to use for parallel processing.
NUM_THREADS=8
# Strandedness of the library: 0 for unstranded, 0.5 for partially stranded, 1 for fully stranded.
# Adjust based on your library preparation protocol.
STRANDEDNESS_PROB=0.5 # Example: partially stranded

# Step 1: Prepare RSEM reference (run once for a given genome and annotation)
# This step builds the necessary index files for RSEM from the genome FASTA and GTF.
# The GTF_ANNOTATION must contain the definitions of the "meta-transcripts" for them to be quantified.
# rsem-prepare-reference --gtf "${GTF_ANNOTATION}" "${GENOME_FASTA}" "${RSEM_REFERENCE_NAME}"

# Step 2: Calculate expression using RSEM
# This command quantifies transcript and gene abundances from aligned reads.
# It counts reads falling into exons of defined transcripts (including meta-transcripts from the GTF)
# and normalizes by transcript length and library size (e.g., to FPKM/TPM).
rsem-calculate-expression \
    --bam \
    --paired-end \
    --forward-prob "${STRANDEDNESS_PROB}" \
    --num-threads "${NUM_THREADS}" \
    "${INPUT_BAM}" \
    "${RSEM_REFERENCE_NAME}" \
    "${OUTPUT_PREFIX}"

# Output files will include:
# ${OUTPUT_PREFIX}.genes.results: Contains gene-level quantification (e.g., FPKM, TPM, expected counts).
# ${OUTPUT_PREFIX}.isoforms.results: Contains isoform/transcript-level quantification.
# ${OUTPUT_PREFIX}.stat: Contains run statistics.

View on GitHub

Raw Source Text

Illumina software used for basecalling.
Sequenced reads were trimmed for adaptor sequence, filtered for rRNA contamination, filtered for low-complexity or low-quality sequence, then mapped to hg19 whole genome using GSNAP.
Reads Per Kilobase of exon per Megabase of library size (RPKM) were calculated using a protocol from Chepelev et al., Nucleic Acids Research, 2009. In short, exons from all isoforms of a gene were merged to create one meta-transcript. The number of reads falling in the exons of this meta-transcript were counted and normalized by the size of the meta-transcript and by the size of the library.
Genome_build: GRCh37 (hg19)
Supplementary_files_format_and_content: Excel file includes RPKM values for each Sample.

← Back to Analysis