GSE72509 Processing Pipeline

RNA-Seq code_examples 5 steps

Publication

The Ro60 autoantigen binds endogenous retroelements and regulates inflammatory gene expression.

Science (New York, N.Y.) (2015) — PMID 26382853

Dataset

GSE72509

SLE lupus RNA-seq

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    Illumina software CASAVA used for basecalling.

    CASAVA vNot specified
    $ Bash example
    # Illumina CASAVA software was used for basecalling and demultiplexing.
    # This process typically involves two main steps:
    # 1. Configuration using configureBclToFastq.pl to generate a Makefile.
    # 2. Execution of the Makefile using 'make' to perform the actual basecalling and demultiplexing.
    
    # Define placeholder paths for an Illumina run.
    # Replace these with your actual paths.
    INPUT_BCL_DIR="/path/to/illumina/run/folder/Data/Intensities/BaseCalls" # Directory containing BCL files
    SAMPLE_SHEET_PATH="/path/to/SampleSheet.csv" # Path to your SampleSheet.csv
    OUTPUT_FASTQ_DIR="/path/to/output/fastq_files" # Directory where FASTQ files will be generated
    
    # Create the output directory if it doesn't exist
    mkdir -p "${OUTPUT_FASTQ_DIR}"
    
    # Step 1: Configure basecalling and demultiplexing
    # This command generates a Makefile in the current working directory (or specified output dir).
    # It reads BCL files from the input directory and uses the sample sheet for demultiplexing.
    # The --mismatches parameter specifies the number of allowed mismatches for barcode demultiplexing.
    configureBclToFastq.pl --input-dir "${INPUT_BCL_DIR}" \
                           --output-dir "${OUTPUT_FASTQ_DIR}" \
                           --sample-sheet "${SAMPLE_SHEET_PATH}" \
                           --mismatches 1
    
    # Step 2: Execute the generated Makefile
    # This command performs the actual basecalling and demultiplexing, producing FASTQ files.
    # Adjust the '-j N' parameter to specify the number of CPU cores to use for parallel processing.
    make -j 8
  2. 2

    Sequenced reads were trimmed for adaptor sequence, filtered for rRNA contamination, filtered for low-complexity or low-quality sequence, then mapped to hg19 whole genome using GSNAP.

    GSNAP v2022-04-12 GitHub
    $ Bash example
    # --- Installation (commented out) ---
    # conda install -c bioconda fastp
    # conda install -c bioconda bbmap # Provides bbduk.sh
    # conda install -c bioconda gmap # Provides gsnap
    # conda install -c bioconda samtools
    
    # --- Reference Data Preparation (commented out) ---
    # Download hg19 reference genome
    # mkdir -p references/hg19
    # wget -P references/hg19 http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz
    # gunzip references/hg19/hg19.fa.gz
    
    # Build GSNAP index for hg19
    # mkdir -p gsnap_indexes
    # gmap_build -d hg19 -D gsnap_indexes references/hg19/hg19.fa
    
    # --- Input Files ---
    # Assuming input reads are paired-end and gzipped
    READ1="input_R1.fastq.gz"
    READ2="input_R2.fastq.gz"
    OUTPUT_PREFIX="mapped_reads"
    
    # --- Pre-processing: Trimming and Quality Filtering with fastp ---
    # Trim adaptors, filter low quality reads, and remove polyG tails (common for Illumina)
    # Output will be gzipped
    fastp \
        -i "${READ1}" \
        -o "${OUTPUT_PREFIX}.trimmed_R1.fastq.gz" \
        -I "${READ2}" \
        -O "${OUTPUT_PREFIX}.trimmed_R2.fastq.gz" \
        --detect_adapter_for_pe \
        --qualified_quality_phred 15 \
        --length_required 30 \
        --low_complexity_filter \
        --complexity_threshold 30 \
        --thread 8 \
        --json "${OUTPUT_PREFIX}.fastp.json" \
        --html "${OUTPUT_PREFIX}.fastp.html"
    
    TRIMMED_R1="${OUTPUT_PREFIX}.trimmed_R1.fastq.gz"
    TRIMMED_R2="${OUTPUT_PREFIX}.trimmed_R2.fastq.gz"
    
    # --- Pre-processing: rRNA Contamination Filtering with bbduk.sh ---
    # Filter reads against an rRNA database. Reads matching rRNA are discarded.
    # Using BBMap's internal rRNA reference ('ref=ribo').
    bbduk.sh \
        in1="${TRIMMED_R1}" \
        in2="${TRIMMED_R2}" \
        out1="${OUTPUT_PREFIX}.rRNA_filtered_R1.fastq.gz" \
        out2="${OUTPUT_PREFIX}.rRNA_filtered_R2.fastq.gz" \
        ref=ribo \
        k=31 \
        hdist=1 \
        stats="${OUTPUT_PREFIX}.bbduk_rRNA_stats.txt" \
        threads=8
    
    RRNA_FILTERED_R1="${OUTPUT_PREFIX}.rRNA_filtered_R1.fastq.gz"
    RRNA_FILTERED_R2="${OUTPUT_PREFIX}.rRNA_filtered_R2.fastq.gz"
    
    # --- Mapping to hg19 whole genome using GSNAP ---
    # -d hg19: Use the hg19 database
    # -D gsnap_indexes: Directory where the hg19 database was built
    # -A sam: Output in SAM format
    # -o: Output file name
    # --gunzip: Decompress gzipped input files on the fly
    # -t 8: Use 8 threads
    # --pairmax-dna=1000: Max insert size for DNA reads (adjust based on expected fragment size)
    # GSNAP is splice-aware, but for "whole genome" mapping, it's often used for general alignment.
    # If this were explicitly RNA-seq, additional splicing parameters might be considered.
    gsnap \
        -d hg19 \
        -D gsnap_indexes \
        -A sam \
        -o "${OUTPUT_PREFIX}.sam" \
        --gunzip \
        -t 8 \
        --pairmax-dna=1000 \
        "${RRNA_FILTERED_R1}" "${RRNA_FILTERED_R2}"
    
    # --- Post-processing: Convert SAM to BAM, sort, and index ---
    samtools view -bS "${OUTPUT_PREFIX}.sam" -o "${OUTPUT_PREFIX}.bam"
    samtools sort "${OUTPUT_PREFIX}.bam" -o "${OUTPUT_PREFIX}.sorted.bam"
    samtools index "${OUTPUT_PREFIX}.sorted.bam"
    
    # Clean up intermediate files (optional)
    # rm "${OUTPUT_PREFIX}.sam"
    # rm "${TRIMMED_R1}" "${TRIMMED_R2}"
    # rm "${RRNA_FILTERED_R1}" "${RRNA_FILTERED_R2}"
  3. 3

    Reads Per Kilobase of exon per Megabase of library size (RPKM) were calculated using a protocol from Chepelev et al., Nucleic Acids Research, 2009.

    Custom script (Inferred with models/gemini-2.5-flash) vN/A
    $ Bash example
    # --- Installation (commented out) ---
    # # No specific tool installation for the RPKM calculation itself,
    # # as it's a formula often implemented via custom scripts (e.g., Python, R, Awk).
    # # Tools for upstream steps (e.g., samtools for total reads, featureCounts for counts)
    # # would be installed separately.
    
    # --- Define input files and parameters ---
    # Placeholder for a file containing gene/exon counts and their lengths.
    # This file is assumed to be pre-processed from alignment and annotation.
    # Format: gene_id <tab> read_count <tab> length_in_bases
    GENE_COUNTS_LENGTHS_FILE="path/to/your/gene_counts_and_lengths.tsv" # e.g., derived from featureCounts output
    # Placeholder for the total number of mapped reads in the library.
    # This value is typically obtained from the alignment summary or by counting reads in the BAM file.
    TOTAL_MAPPED_READS=50000000 # Example: Replace with actual total mapped reads (e.g., from samtools view -c -F 4 aligned.bam)
    # Placeholder for output file name
    OUTPUT_RPKM_FILE="sample_rpkm.tsv"
    
    # --- Reference Datasets (Implicitly used in upstream steps to generate inputs) ---
    # Reference Genome Assembly: hg38 (Human Genome build 38) - for alignment
    # Annotation Source: GENCODE (e.g., gencode.v44) - for gene/exon lengths and counting
    
    # --- Calculate RPKM using the formula from Chepelev et al., 2009 ---
    # RPKM = (10^9 * C) / (N * L)
    # Where:
    # C = Number of reads mapped to the gene (from GENE_COUNTS_LENGTHS_FILE, column 2)
    # N = Total number of mapped reads in the library (TOTAL_MAPPED_READS)
    # L = Length of the gene in bases (from GENE_COUNTS_LENGTHS_FILE, column 3)
    
    echo -e "gene_id\tRPKM" > "${OUTPUT_RPKM_FILE}"
    awk -v total_reads="${TOTAL_MAPPED_READS}" 'BEGIN { OFS="\t" } NR > 1 {
        gene_id = $1;
        count = $2;
        length_bases = $3;
    
        # Handle cases where length_bases might be zero to avoid division by zero
        if (length_bases == 0) {
            rpkm = 0;
        } else {
            # RPKM = (count * 10^9) / (total_reads * length_bases)
            rpkm = (count * 1000000000) / (total_reads * length_bases);
        }
        print gene_id, rpkm;
    }' "${GENE_COUNTS_LENGTHS_FILE}" >> "${OUTPUT_RPKM_FILE}"
    
    echo "RPKM calculation complete. Results saved to ${OUTPUT_RPKM_FILE}"
  4. 4

    In short, exons from all isoforms of a gene were merged to create one meta-transcript.

    RSEM (Inferred with models/gemini-2.5-flash) v1.3.3 GitHub
    $ Bash example
    # Install RSEM if not already installed
    # conda install -c bioconda rsem
    
    # Define input files and output prefix
    # Replace with actual GTF and FASTA files for your organism and assembly
    INPUT_GTF="Homo_sapiens.GRCh38.109.gtf" # Example: Ensembl GTF for human GRCh38
    REFERENCE_FASTA="Homo_sapiens.GRCh38.dna.primary_assembly.fa" # Example: Primary assembly FASTA for human GRCh38
    OUTPUT_PREFIX="GRCh38_rsem_reference" # Prefix for the generated RSEM index files
    
    # Prepare RSEM reference. This command internally processes the GTF to create
    # a 'union exon' model (conceptually a meta-transcript) for each gene,
    # which RSEM uses for accurate transcript quantification.
    rsem-prepare-reference \
        --gtf "${INPUT_GTF}" \
        "${REFERENCE_FASTA}" \
        "${OUTPUT_PREFIX}"
  5. 5

    The number of reads falling in the exons of this meta-transcript were counted and normalized by the size of the meta-transcript and by the size of the library.

    RSEM (Inferred with models/gemini-2.5-flash) v1.3.3 GitHub
    $ Bash example
    # Install RSEM (if not already installed)
    # conda install -c bioconda rsem
    
    # Define variables
    # RSEM requires a reference index, which is built once from a genome FASTA and GTF file.
    # Example command to build reference:
    # rsem-prepare-reference --gtf Homo_sapiens.GRCh38.109.gtf Homo_sapiens.GRCh38.dna.primary_assembly.fa human_hg38_rsem_ref
    # Reference genome FASTA: ftp://ftp.ensembl.org/pub/release-109/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
    # GTF annotation: ftp://ftp.ensembl.org/pub/release-109/gtf/homo_sapiens/Homo_sapiens.GRCh38.109.gtf.gz
    RSEM_REFERENCE_PATH="/path/to/rsem_references/human_hg38_rsem_ref" # Placeholder for the indexed RSEM reference
    INPUT_BAM="aligned_reads.bam" # Input BAM file from a splice-aware aligner (e.g., STAR)
    SAMPLE_PREFIX="sample_id" # Prefix for output files
    OUTPUT_DIR="rsem_quantification"
    NUM_THREADS=8 # Number of threads for parallel processing
    
    # Create output directory if it doesn't exist
    mkdir -p "${OUTPUT_DIR}"
    
    # Run RSEM to calculate expression
    # This command processes an input BAM file (assumed to be genome-aligned)
    # and quantifies expression at both gene and isoform levels.
    # The description "normalized by the size of the meta-transcript and by the size of the library"
    # is directly addressed by RSEM's TPM (Transcripts Per Million) and FPKM (Fragments Per Kilobase of transcript per Million mapped reads) 
    # outputs in the .genes.results and .isoforms.results files.
    # --bam: Specifies that the input is a BAM file.
    # --no-qualities: Speeds up processing by not reading quality scores (often not needed for quantification).
    # --num-threads: Utilizes multiple CPU threads for faster execution.
    # --paired-end: Add this flag if your data is paired-end.
    # --strand-specific: Add --forward-prob 0 or --reverse-prob 0 if your library is strand-specific.
    rsem-calculate-expression \
        --bam \
        --no-qualities \
        --num-threads "${NUM_THREADS}" \
        "${INPUT_BAM}" \
        "${RSEM_REFERENCE_PATH}" \
        "${OUTPUT_DIR}/${SAMPLE_PREFIX}"
    
    # The primary output files for this step will be:
    # - ${OUTPUT_DIR}/${SAMPLE_PREFIX}.genes.results: Contains gene-level expected counts, TPM, FPKM.
    # - ${OUTPUT_DIR}/${SAMPLE_PREFIX}.isoforms.results: Contains isoform-level expected counts, TPM, FPKM.
Raw Source Text
Illumina software CASAVA used for basecalling.
Sequenced reads were trimmed for adaptor sequence, filtered for rRNA contamination, filtered for low-complexity or low-quality sequence, then mapped to hg19 whole genome using GSNAP.
Reads Per Kilobase of exon per Megabase of library size (RPKM) were calculated using a protocol from Chepelev et al., Nucleic Acids Research, 2009. In short, exons from all isoforms of a gene were merged to create one meta-transcript. The number of reads falling in the exons of this meta-transcript were counted and normalized by the size of the meta-transcript and by the size of the library.
Genome_build: GRCh37 (hg19)
Supplementary_files_format_and_content: SLE_RPKMs.txt: Tab-delimited text file includes RPKM values for each patient and control.
← Back to Analysis