GSE72502 Processing Pipeline

RNA-Seq code_examples 5 steps

Publication

The Ro60 autoantigen binds endogenous retroelements and regulates inflammatory gene expression.

Science (New York, N.Y.) (2015) — PMID 26382853

Dataset

GSE72502

Healthy donor PBMC RNA-seq with or without interferon-alpha stimulation

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    Illumina software used for basecalling.

    bcl2fastq (Inferred with models/gemini-2.5-flash) vv2.20 (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install bcl2fastq (example using conda)
    # conda install -c bioconda bcl2fastq2
    
    # Example usage of bcl2fastq for basecalling and demultiplexing
    # Replace /path/to/illumina/run_folder with the actual path to your Illumina run folder containing BCL files
    # Replace /path/to/output_fastq with the desired output directory for FASTQ files
    bcl2fastq --runfolder-dir /path/to/illumina/run_folder \
                  --output-dir /path/to/output_fastq \
                  --no-lane-splitting \
                  --minimum-trimmed-read-length 35 \
                  --mask-short-adapter-reads 35 \
                  --barcode-mismatches 1
  2. 2

    Sequenced reads were trimmed for adaptor sequence, filtered for rRNA contamination, filtered for low-complexity or low-quality sequence, then mapped to hg19 whole genome using GSNAP.

    GSNAP v2021-12-25 (Inferred with models/gemini-2.5-flash)
    $ Bash example
    # Install GSNAP (if not already installed)
    # conda install -c bioconda gsnap
    
    # Reference genome preparation (if not already done)
    # Download hg19 reference genome from UCSC
    # wget -c http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz
    # gunzip hg19.fa.gz
    # Create GSNAP index for hg19. -k 15 is a common k-mer size for RNA-seq.
    # gsnap_build -d hg19_index -g hg19.fa -k 15
    
    # Define variables
    GENOME_INDEX_DIR="/path/to/gsnap_indexes" # Placeholder: Directory containing the hg19_index
    GENOME_INDEX_NAME="hg19_index"
    INPUT_FASTQ="input_reads_preprocessed.fastq" # Placeholder: Reads after trimming, rRNA filtering, and quality filtering
    OUTPUT_SAM="aligned_reads.sam"
    THREADS=8 # Number of CPU threads to use
    
    # GSNAP alignment command
    # Description: Sequenced reads were trimmed for adaptor sequence, filtered for rRNA contamination,
    # filtered for low-complexity or low-quality sequence (these steps are assumed to be completed
    # and the output is INPUT_FASTQ). Then mapped to hg19 whole genome using GSNAP.
    # Parameters chosen are common for RNA-seq alignment with GSNAP, given the rRNA filtering step.
    # --orientation=fr is typically for paired-end reads, but can be used for single-end if library prep has specific orientation.
    # --novel-splice-sites=1 allows for detection of one novel splice site per read.
    # --max-mismatches=0.05 allows up to 5% mismatches relative to read length.
    # --quality-protocol=sanger specifies the quality score encoding.
    gsnap -d "${GENOME_INDEX_NAME}" \
          -D "${GENOME_INDEX_DIR}" \
          -t "${THREADS}" \
          --orientation=fr \
          --novel-splice-sites=1 \
          --max-mismatches=0.05 \
          --quality-protocol=sanger \
          -A sam \
          "${INPUT_FASTQ}" > "${OUTPUT_SAM}"
    
    # Optional: Convert SAM to BAM and sort (common post-alignment steps)
    # samtools view -bS "${OUTPUT_SAM}" | samtools sort -o "${OUTPUT_SAM%.sam}.bam"
    # samtools index "${OUTPUT_SAM%.sam}.bam"
  3. 3

    Reads Per Kilobase of exon per Megabase of library size (RPKM) were calculated using a protocol from Chepelev et al., Nucleic Acids Research, 2009.

    Custom RPKM Script (Inferred with models/gemini-2.5-flash) vNot specified (Inferred with models/gemini-2.5-flash)
    $ Bash example
    # Install subread (which includes featureCounts) and samtools if not already available
    # conda install -c bioconda subread samtools
    
    # Define input and reference files
    # Replace 'aligned_reads.bam' with your actual aligned BAM file
    INPUT_BAM="aligned_reads.bam"
    # Placeholder for a human GRCh38 GTF file. Obtain from Ensembl, GENCODE, or UCSC.
    # Example: Homo_sapiens.GRCh38.109.gtf from Ensembl
    GENOME_GTF="Homo_sapiens.GRCh38.109.gtf" 
    OUTPUT_COUNTS="gene_counts.txt"
    OUTPUT_RPKM="gene_rpkm.txt"
    
    # 1. Count reads per gene (aggregating exon counts to gene level)
    # -p: Assume paired-end reads (remove if single-end)
    # -t exon: Count features of type 'exon'
    # -g gene_id: Aggregate counts by 'gene_id'
    # -a: Annotation file in GTF/GFF format
    featureCounts -p -t exon -g gene_id -a "${GENOME_GTF}" -o "${OUTPUT_COUNTS}" "${INPUT_BAM}"
    
    # 2. Calculate RPKM using the formula from Chepelev et al., 2009
    # RPKM = (number of reads mapped to a gene * 10^9) / (total number of reads in the library * gene length in kilobases)
    
    # Get total number of mapped reads in the library from the BAM file
    # -c: count reads
    # -F 260: Exclude unmapped (0x4), secondary alignment (0x100), and supplementary alignment (0x800) reads
    TOTAL_READS=$(samtools view -c -F 260 "${INPUT_BAM}")
    
    # Process the featureCounts output to calculate RPKM
    # The featureCounts output file has a header, then columns like:
    # Geneid    Chr    Start    End    Strand    Length    /path/to/aligned_reads.bam
    awk -v total_reads="${TOTAL_READS}" '
    BEGIN {
        OFS="\t";
        print "Geneid", "Length_Bases", "Raw_Counts", "RPKM";
    }
    NR > 2 { # Skip the first two header lines from featureCounts output
        gene_id = $1;
        gene_length_bases = $6; # Gene length in bases (sum of exon lengths)
        raw_counts = $7;        # Raw read counts for the gene
    
        if (gene_length_bases > 0 && total_reads > 0) {
            # Calculate RPKM
            rpkm = (raw_counts * 1000000000) / (total_reads * (gene_length_bases / 1000));
            print gene_id, gene_length_bases, raw_counts, rpkm;
        } else {
            # Handle cases with zero length or zero total reads to avoid division by zero
            print gene_id, gene_length_bases, raw_counts, 0;
        }
    }' "${OUTPUT_COUNTS}" > "${OUTPUT_RPKM}"
    
    echo "RPKM calculation complete. Results saved to ${OUTPUT_RPKM}"
  4. 4

    In short, exons from all isoforms of a gene were merged to create one meta-transcript.

    agat (Inferred with models/gemini-2.5-flash) v1.0.0 GitHub
    $ Bash example
    # Install AGAT if not already installed
    # conda install -c bioconda agat
    
    # Define input GTF file (e.g., Gencode annotation)
    GTF_FILE="gencode.v44.annotation.gtf" # Placeholder for a reference GTF file (e.g., human GRCh38)
    
    # Define output GTF file for meta-transcripts
    OUTPUT_GTF="gene_meta_transcripts.gtf"
    
    # Step: Merge exons from all isoforms of a gene to create one meta-transcript
    # agat_sp_merge_annotations.pl: Script to merge annotations
    # --gtf: Input GTF file
    # --type exon: Specify that we want to merge 'exon' features
    # --group_by gene: Group features by their parent gene_id before merging
    # --output: Output GTF file
    agat_sp_merge_annotations.pl --gtf "${GTF_FILE}" --type exon --group_by gene --output "${OUTPUT_GTF}"
    
    echo "Meta-transcripts (merged exons per gene) saved to ${OUTPUT_GTF}"
  5. 5

    The number of reads falling in the exons of this meta-transcript were counted and normalized by the size of the meta-transcript and by the size of the library.

    RSEM (Inferred with models/gemini-2.5-flash) v1.3.6 GitHub
    $ Bash example
    # Install RSEM (example using conda)
    # conda install -c bioconda rsem=1.3.6
    
    # Define variables
    # Placeholder for the genome FASTA file used for alignment and RSEM reference preparation.
    GENOME_FASTA="path/to/reference_genome.fa"
    # Placeholder for the GTF annotation file. This file should define the "meta-transcripts"
    # and their exon structures. For a general case, this would be a standard annotation like Gencode.
    GTF_ANNOTATION="path/to/custom_or_gencode.gtf"
    # Name for the RSEM reference index. This will create a directory with this name.
    RSEM_REFERENCE_NAME="rsem_index_meta_transcripts"
    # Path to the input BAM file containing aligned reads.
    INPUT_BAM="path/to/aligned_reads.bam"
    # Prefix for the output quantification files (e.g., .genes.results, .isoforms.results).
    OUTPUT_PREFIX="quantification_results"
    # Number of threads to use for parallel processing.
    NUM_THREADS=8
    # Strandedness of the library: 0 for unstranded, 0.5 for partially stranded, 1 for fully stranded.
    # Adjust based on your library preparation protocol.
    STRANDEDNESS_PROB=0.5 # Example: partially stranded
    
    # Step 1: Prepare RSEM reference (run once for a given genome and annotation)
    # This step builds the necessary index files for RSEM from the genome FASTA and GTF.
    # The GTF_ANNOTATION must contain the definitions of the "meta-transcripts" for them to be quantified.
    # rsem-prepare-reference --gtf "${GTF_ANNOTATION}" "${GENOME_FASTA}" "${RSEM_REFERENCE_NAME}"
    
    # Step 2: Calculate expression using RSEM
    # This command quantifies transcript and gene abundances from aligned reads.
    # It counts reads falling into exons of defined transcripts (including meta-transcripts from the GTF)
    # and normalizes by transcript length and library size (e.g., to FPKM/TPM).
    rsem-calculate-expression \
        --bam \
        --paired-end \
        --forward-prob "${STRANDEDNESS_PROB}" \
        --num-threads "${NUM_THREADS}" \
        "${INPUT_BAM}" \
        "${RSEM_REFERENCE_NAME}" \
        "${OUTPUT_PREFIX}"
    
    # Output files will include:
    # ${OUTPUT_PREFIX}.genes.results: Contains gene-level quantification (e.g., FPKM, TPM, expected counts).
    # ${OUTPUT_PREFIX}.isoforms.results: Contains isoform/transcript-level quantification.
    # ${OUTPUT_PREFIX}.stat: Contains run statistics.
Raw Source Text
Illumina software used for basecalling.
Sequenced reads were trimmed for adaptor sequence, filtered for rRNA contamination, filtered for low-complexity or low-quality sequence, then mapped to hg19 whole genome using GSNAP.
Reads Per Kilobase of exon per Megabase of library size (RPKM) were calculated using a protocol from Chepelev et al., Nucleic Acids Research, 2009. In short, exons from all isoforms of a gene were merged to create one meta-transcript. The number of reads falling in the exons of this meta-transcript were counted and normalized by the size of the meta-transcript and by the size of the library.
Genome_build: GRCh37 (hg19)
Supplementary_files_format_and_content: Excel file includes RPKM values for each Sample.
← Back to Analysis