GSE72509 Processing Pipeline

RNA-Seq code_examples 5 steps

Publication

The Ro60 autoantigen binds endogenous retroelements and regulates inflammatory gene expression.

Science (New York, N.Y.) (2015) — PMID 26382853

Dataset

SLE lupus RNA-seq

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Illumina software CASAVA used for basecalling.

CASAVA vNot specified

$ Bash example

# Illumina CASAVA software was used for basecalling and demultiplexing.
# This process typically involves two main steps:
# 1. Configuration using configureBclToFastq.pl to generate a Makefile.
# 2. Execution of the Makefile using 'make' to perform the actual basecalling and demultiplexing.

# Define placeholder paths for an Illumina run.
# Replace these with your actual paths.
INPUT_BCL_DIR="/path/to/illumina/run/folder/Data/Intensities/BaseCalls" # Directory containing BCL files
SAMPLE_SHEET_PATH="/path/to/SampleSheet.csv" # Path to your SampleSheet.csv
OUTPUT_FASTQ_DIR="/path/to/output/fastq_files" # Directory where FASTQ files will be generated

# Create the output directory if it doesn't exist
mkdir -p "${OUTPUT_FASTQ_DIR}"

# Step 1: Configure basecalling and demultiplexing
# This command generates a Makefile in the current working directory (or specified output dir).
# It reads BCL files from the input directory and uses the sample sheet for demultiplexing.
# The --mismatches parameter specifies the number of allowed mismatches for barcode demultiplexing.
configureBclToFastq.pl --input-dir "${INPUT_BCL_DIR}" \
--output-dir "${OUTPUT_FASTQ_DIR}" \
--sample-sheet "${SAMPLE_SHEET_PATH}" \
--mismatches 1

# Step 2: Execute the generated Makefile
# This command performs the actual basecalling and demultiplexing, producing FASTQ files.
# Adjust the '-j N' parameter to specify the number of CPU cores to use for parallel processing.
make -j 8

Sequenced reads were trimmed for adaptor sequence, filtered for rRNA contamination, filtered for low-complexity or low-quality sequence, then mapped to hg19 whole genome using GSNAP.

GSNAP v2022-04-12 GitHub

$ Bash example

# --- Installation (commented out) ---
# conda install -c bioconda fastp
# conda install -c bioconda bbmap # Provides bbduk.sh
# conda install -c bioconda gmap # Provides gsnap
# conda install -c bioconda samtools

# --- Reference Data Preparation (commented out) ---
# Download hg19 reference genome
# mkdir -p references/hg19
# wget -P references/hg19 http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz
# gunzip references/hg19/hg19.fa.gz

# Build GSNAP index for hg19
# mkdir -p gsnap_indexes
# gmap_build -d hg19 -D gsnap_indexes references/hg19/hg19.fa

# --- Input Files ---
# Assuming input reads are paired-end and gzipped
READ1="input_R1.fastq.gz"
READ2="input_R2.fastq.gz"
OUTPUT_PREFIX="mapped_reads"

# --- Pre-processing: Trimming and Quality Filtering with fastp ---
# Trim adaptors, filter low quality reads, and remove polyG tails (common for Illumina)
# Output will be gzipped
fastp \
    -i "${READ1}" \
    -o "${OUTPUT_PREFIX}.trimmed_R1.fastq.gz" \
    -I "${READ2}" \
    -O "${OUTPUT_PREFIX}.trimmed_R2.fastq.gz" \
    --detect_adapter_for_pe \
    --qualified_quality_phred 15 \
    --length_required 30 \
    --low_complexity_filter \
    --complexity_threshold 30 \
    --thread 8 \
    --json "${OUTPUT_PREFIX}.fastp.json" \
    --html "${OUTPUT_PREFIX}.fastp.html"

TRIMMED_R1="${OUTPUT_PREFIX}.trimmed_R1.fastq.gz"
TRIMMED_R2="${OUTPUT_PREFIX}.trimmed_R2.fastq.gz"

# --- Pre-processing: rRNA Contamination Filtering with bbduk.sh ---
# Filter reads against an rRNA database. Reads matching rRNA are discarded.
# Using BBMap's internal rRNA reference ('ref=ribo').
bbduk.sh \
    in1="${TRIMMED_R1}" \
    in2="${TRIMMED_R2}" \
    out1="${OUTPUT_PREFIX}.rRNA_filtered_R1.fastq.gz" \
    out2="${OUTPUT_PREFIX}.rRNA_filtered_R2.fastq.gz" \
    ref=ribo \
    k=31 \
    hdist=1 \
    stats="${OUTPUT_PREFIX}.bbduk_rRNA_stats.txt" \
    threads=8

RRNA_FILTERED_R1="${OUTPUT_PREFIX}.rRNA_filtered_R1.fastq.gz"
RRNA_FILTERED_R2="${OUTPUT_PREFIX}.rRNA_filtered_R2.fastq.gz"

# --- Mapping to hg19 whole genome using GSNAP ---
# -d hg19: Use the hg19 database
# -D gsnap_indexes: Directory where the hg19 database was built
# -A sam: Output in SAM format
# -o: Output file name
# --gunzip: Decompress gzipped input files on the fly
# -t 8: Use 8 threads
# --pairmax-dna=1000: Max insert size for DNA reads (adjust based on expected fragment size)
# GSNAP is splice-aware, but for "whole genome" mapping, it's often used for general alignment.
# If this were explicitly RNA-seq, additional splicing parameters might be considered.
gsnap \
    -d hg19 \
    -D gsnap_indexes \
    -A sam \
    -o "${OUTPUT_PREFIX}.sam" \
    --gunzip \
    -t 8 \
    --pairmax-dna=1000 \
    "${RRNA_FILTERED_R1}" "${RRNA_FILTERED_R2}"

# --- Post-processing: Convert SAM to BAM, sort, and index ---
samtools view -bS "${OUTPUT_PREFIX}.sam" -o "${OUTPUT_PREFIX}.bam"
samtools sort "${OUTPUT_PREFIX}.bam" -o "${OUTPUT_PREFIX}.sorted.bam"
samtools index "${OUTPUT_PREFIX}.sorted.bam"

# Clean up intermediate files (optional)
# rm "${OUTPUT_PREFIX}.sam"
# rm "${TRIMMED_R1}" "${TRIMMED_R2}"
# rm "${RRNA_FILTERED_R1}" "${RRNA_FILTERED_R2}"

View on GitHub

Reads Per Kilobase of exon per Megabase of library size (RPKM) were calculated using a protocol from Chepelev et al., Nucleic Acids Research, 2009.

Custom script (Inferred with models/gemini-2.5-flash) vN/A

$ Bash example

# --- Installation (commented out) ---
# # No specific tool installation for the RPKM calculation itself,
# # as it's a formula often implemented via custom scripts (e.g., Python, R, Awk).
# # Tools for upstream steps (e.g., samtools for total reads, featureCounts for counts)
# # would be installed separately.

# --- Define input files and parameters ---
# Placeholder for a file containing gene/exon counts and their lengths.
# This file is assumed to be pre-processed from alignment and annotation.
# Format: gene_id <tab> read_count <tab> length_in_bases
GENE_COUNTS_LENGTHS_FILE="path/to/your/gene_counts_and_lengths.tsv" # e.g., derived from featureCounts output
# Placeholder for the total number of mapped reads in the library.
# This value is typically obtained from the alignment summary or by counting reads in the BAM file.
TOTAL_MAPPED_READS=50000000 # Example: Replace with actual total mapped reads (e.g., from samtools view -c -F 4 aligned.bam)
# Placeholder for output file name
OUTPUT_RPKM_FILE="sample_rpkm.tsv"

# --- Reference Datasets (Implicitly used in upstream steps to generate inputs) ---
# Reference Genome Assembly: hg38 (Human Genome build 38) - for alignment
# Annotation Source: GENCODE (e.g., gencode.v44) - for gene/exon lengths and counting

# --- Calculate RPKM using the formula from Chepelev et al., 2009 ---
# RPKM = (10^9 * C) / (N * L)
# Where:
# C = Number of reads mapped to the gene (from GENE_COUNTS_LENGTHS_FILE, column 2)
# N = Total number of mapped reads in the library (TOTAL_MAPPED_READS)
# L = Length of the gene in bases (from GENE_COUNTS_LENGTHS_FILE, column 3)

echo -e "gene_id\tRPKM" > "${OUTPUT_RPKM_FILE}"
awk -v total_reads="${TOTAL_MAPPED_READS}" 'BEGIN { OFS="\t" } NR > 1 {
    gene_id = $1;
    count = $2;
    length_bases = $3;

    # Handle cases where length_bases might be zero to avoid division by zero
    if (length_bases == 0) {
        rpkm = 0;
    } else {
        # RPKM = (count * 10^9) / (total_reads * length_bases)
        rpkm = (count * 1000000000) / (total_reads * length_bases);
    }
    print gene_id, rpkm;
}' "${GENE_COUNTS_LENGTHS_FILE}" >> "${OUTPUT_RPKM_FILE}"

echo "RPKM calculation complete. Results saved to ${OUTPUT_RPKM_FILE}"

In short, exons from all isoforms of a gene were merged to create one meta-transcript.

RSEM (Inferred with models/gemini-2.5-flash) v1.3.3 GitHub

$ Bash example

# Install RSEM if not already installed
# conda install -c bioconda rsem

# Define input files and output prefix
# Replace with actual GTF and FASTA files for your organism and assembly
INPUT_GTF="Homo_sapiens.GRCh38.109.gtf" # Example: Ensembl GTF for human GRCh38
REFERENCE_FASTA="Homo_sapiens.GRCh38.dna.primary_assembly.fa" # Example: Primary assembly FASTA for human GRCh38
OUTPUT_PREFIX="GRCh38_rsem_reference" # Prefix for the generated RSEM index files

# Prepare RSEM reference. This command internally processes the GTF to create
# a 'union exon' model (conceptually a meta-transcript) for each gene,
# which RSEM uses for accurate transcript quantification.
rsem-prepare-reference \
    --gtf "${INPUT_GTF}" \
    "${REFERENCE_FASTA}" \
    "${OUTPUT_PREFIX}"

View on GitHub

The number of reads falling in the exons of this meta-transcript were counted and normalized by the size of the meta-transcript and by the size of the library.

RSEM (Inferred with models/gemini-2.5-flash) v1.3.3 GitHub

$ Bash example

# Install RSEM (if not already installed)
# conda install -c bioconda rsem

# Define variables
# RSEM requires a reference index, which is built once from a genome FASTA and GTF file.
# Example command to build reference:
# rsem-prepare-reference --gtf Homo_sapiens.GRCh38.109.gtf Homo_sapiens.GRCh38.dna.primary_assembly.fa human_hg38_rsem_ref
# Reference genome FASTA: ftp://ftp.ensembl.org/pub/release-109/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
# GTF annotation: ftp://ftp.ensembl.org/pub/release-109/gtf/homo_sapiens/Homo_sapiens.GRCh38.109.gtf.gz
RSEM_REFERENCE_PATH="/path/to/rsem_references/human_hg38_rsem_ref" # Placeholder for the indexed RSEM reference
INPUT_BAM="aligned_reads.bam" # Input BAM file from a splice-aware aligner (e.g., STAR)
SAMPLE_PREFIX="sample_id" # Prefix for output files
OUTPUT_DIR="rsem_quantification"
NUM_THREADS=8 # Number of threads for parallel processing

# Create output directory if it doesn't exist
mkdir -p "${OUTPUT_DIR}"

# Run RSEM to calculate expression
# This command processes an input BAM file (assumed to be genome-aligned)
# and quantifies expression at both gene and isoform levels.
# The description "normalized by the size of the meta-transcript and by the size of the library"
# is directly addressed by RSEM's TPM (Transcripts Per Million) and FPKM (Fragments Per Kilobase of transcript per Million mapped reads) 
# outputs in the .genes.results and .isoforms.results files.
# --bam: Specifies that the input is a BAM file.
# --no-qualities: Speeds up processing by not reading quality scores (often not needed for quantification).
# --num-threads: Utilizes multiple CPU threads for faster execution.
# --paired-end: Add this flag if your data is paired-end.
# --strand-specific: Add --forward-prob 0 or --reverse-prob 0 if your library is strand-specific.
rsem-calculate-expression \
    --bam \
    --no-qualities \
    --num-threads "${NUM_THREADS}" \
    "${INPUT_BAM}" \
    "${RSEM_REFERENCE_PATH}" \
    "${OUTPUT_DIR}/${SAMPLE_PREFIX}"

# The primary output files for this step will be:
# - ${OUTPUT_DIR}/${SAMPLE_PREFIX}.genes.results: Contains gene-level expected counts, TPM, FPKM.
# - ${OUTPUT_DIR}/${SAMPLE_PREFIX}.isoforms.results: Contains isoform-level expected counts, TPM, FPKM.

View on GitHub

Raw Source Text

Illumina software CASAVA used for basecalling.
Sequenced reads were trimmed for adaptor sequence, filtered for rRNA contamination, filtered for low-complexity or low-quality sequence, then mapped to hg19 whole genome using GSNAP.
Reads Per Kilobase of exon per Megabase of library size (RPKM) were calculated using a protocol from Chepelev et al., Nucleic Acids Research, 2009. In short, exons from all isoforms of a gene were merged to create one meta-transcript. The number of reads falling in the exons of this meta-transcript were counted and normalized by the size of the meta-transcript and by the size of the library.
Genome_build: GRCh37 (hg19)
Supplementary_files_format_and_content: SLE_RPKMs.txt: Tab-delimited text file includes RPKM values for each patient and control.

← Back to Analysis