GSE72501 Processing Pipeline

RNA-Seq code_examples 5 steps

Publication

The Ro60 autoantigen binds endogenous retroelements and regulates inflammatory gene expression.

Science (New York, N.Y.) (2015) — PMID 26382853

Dataset

Ro60-knockout cells

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Illumina software used for basecalling.

bcl2fastq (Inferred with models/gemini-2.5-flash) v2.20.0.422 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install bcl2fastq (example using conda)
# conda install -c bioconda bcl2fastq2

# Example bcl2fastq command for basecalling and demultiplexing
# Replace /path/to/run_folder with the actual path to your Illumina run folder containing BCL files
# Replace /path/to/output_directory with your desired output location for FASTQ files
# Replace /path/to/sample_sheet.csv with your actual sample sheet for demultiplexing
bcl2fastq --runfolder-dir /path/to/run_folder \
                  --output-dir /path/to/output_directory \
                  --sample-sheet /path/to/sample_sheet.csv \
                  --no-lane-splitting \
                  --barcode-mismatches 1 \
                  --minimum-trimmed-read-length 35 \
                  --mask-short-adapter-reads 35

View on GitHub

Sequenced reads were trimmed for adaptor sequence, filtered for rRNA contamination, filtered for low-complexity or low-quality sequence, then mapped to hg19 whole genome using GSNAP.

GSNAP v2021-08-25

$ Bash example

# Define variables
RAW_READS="raw_reads.fastq.gz"
TRIMMED_READS="trimmed_reads.fastq.gz"
RRNA_FILTERED_READS="rRNA_filtered_reads.fastq.gz"
MAPPED_SAM="mapped_reads.sam"

# Paths for reference data
# Replace with actual paths to your genome index directory, hg19 FASTA, and rRNA reference FASTA
GENOME_DIR="/path/to/gsnap_indexes"
HG19_FASTA="/path/to/hg19.fa"
RRNA_REF="/path/to/rRNA_references.fasta" # e.g., from NCBI or BBMap resources

# --- Pre-processing Steps ---

# 1. Trim adaptor sequence and filter low-quality/low-complexity sequence using fastp
# conda install -c bioconda fastp
fastp -i "${RAW_READS}" -o "${TRIMMED_READS}" \
      --detect_adapter_for_pe \
      --qualified_quality_phred 15 \
      --length_required 30 \
      --thread 8

# 2. Filter for rRNA contamination using bbduk.sh
# conda install -c bioconda bbmap
bbduk.sh in="${TRIMMED_READS}" out="${RRNA_FILTERED_READS}" \
         ref="${RRNA_REF}" \
         k=31 hdist=1 stats=bbduk_stats.txt \
         threads=8

# --- Mapping Step ---

# 3. Map to hg19 whole genome using GSNAP
# conda install -c bioconda gmap

# Build GSNAP index for hg19 (run once if index does not exist)
# mkdir -p "${GENOME_DIR}"
# gmap_build -D "${GENOME_DIR}" -d hg19 "${HG19_FASTA}"

gsnap -D "${GENOME_DIR}" -d hg19 \
      -A sam \
      -o "${MAPPED_SAM}" \
      -N 1 \
      -t 8 \
      "${RRNA_FILTERED_READS}"

Reads Per Kilobase of exon per Megabase of library size (RPKM) were calculated using a protocol from Chepelev et al., Nucleic Acids Research, 2009.

python (Inferred with models/gemini-2.5-flash) v3.x (Inferred with models/gemini-2.5-flash)

$ Bash example

# Define input and output files
INPUT_BAM="aligned_reads.bam"
GENOME_ANNOTATION_GTF="gencode.v38.annotation.gtf" # Placeholder for human hg38 annotation
OUTPUT_RPKM_FILE="gene_rpkm.tsv"
GENE_COUNTS_FILE="gene_counts.tsv"

# --- Installation instructions (commented out) ---
# The following tools are required:
# - subread (for featureCounts)
# - samtools
# - python (with pandas library)
# Example installation using conda:
# conda install -c bioconda subread samtools
# conda install -c conda-forge pandas python

# Step 1: Count reads per gene using featureCounts
# This step generates raw read counts per gene, which are necessary for RPKM calculation.
# -a: annotation file (GTF/GFF format)
# -o: output file for counts
# -F GTF: specify GTF format for annotation
# -t exon: count features of type 'exon' (as RPKM is based on exonic regions)
# -g gene_id: group features by 'gene_id' attribute to get gene-level counts
featureCounts -a "${GENOME_ANNOTATION_GTF}" -o "${GENE_COUNTS_FILE}" -F GTF -t exon -g gene_id "${INPUT_BAM}"

# Step 2: Get total mapped reads from the BAM file
# The total number of mapped reads is required for library size normalization in RPKM.
# samtools flagstat provides mapping statistics.
# grep 'mapped (' extracts the line containing the total mapped reads.
# awk '{print $1}' gets the first field, which is the number of mapped reads.
total_mapped_reads=$(samtools flagstat "${INPUT_BAM}" | grep 'mapped (' | awk '{print $1}')

# Step 3: Calculate RPKM using a Python script
# RPKM (Reads Per Kilobase of exon per Million mapped reads) is calculated as:
# RPKM = (read_count * 10^9) / (gene_length_in_bp * total_mapped_reads_in_millions)

# Create a temporary Python script for RPKM calculation
cat << 'EOF' > calculate_rpkm.py
import pandas as pd
import sys
import re

def calculate_rpkm(gene_counts_file, total_mapped_reads, gtf_file, output_file):
    # Load gene counts from featureCounts output
    # Skip header lines starting with '#' and use the first column as index (Geneid)
    counts_df = pd.read_csv(gene_counts_file, sep='\t', comment='#', index_col=0)
    # The last column contains the counts for the input BAM file provided to featureCounts
    read_counts = counts_df.iloc[:, -1] 

    # Parse GTF to get gene lengths (sum of exon lengths per gene)
    # This is crucial for the 'per Kilobase of exon' part of RPKM.
    gene_lengths = {}
    with open(gtf_file, 'r') as f:
        for line in f:
            if line.startswith('#'):
                continue
            parts = line.strip().split('\t')
            if len(parts) < 9:
                continue
            feature_type = parts[2]
            # Only consider exons for gene length calculation as per RPKM definition
            if feature_type == 'exon':
                start = int(parts[3])
                end = int(parts[4])
                length = end - start + 1
                
                attributes = parts[8]
                gene_id_match = re.search(r'gene_id "([^"]+)"', attributes)
                if gene_id_match:
                    gene_id = gene_id_match.group(1)
                    gene_lengths[gene_id] = gene_lengths.get(gene_id, 0) + length

    # Convert total mapped reads to millions for the RPKM formula
    total_mapped_reads_in_millions = total_mapped_reads / 1_000_000.0

    rpkm_data = {}
    # Iterate through genes present in the counts file
    for gene_id in read_counts.index:
        count = read_counts.get(gene_id, 0)
        length = gene_lengths.get(gene_id, 0)

        if length > 0 and total_mapped_reads_in_millions > 0:
            # Apply the RPKM formula
            # C = number of reads mapped to the gene
            # N = total number of mapped reads in millions
            # L = gene length in base pairs (sum of exon lengths)
            rpkm = (count * 1_000_000_000) / (length * total_mapped_reads_in_millions)
        else:
            rpkm = 0.0 # Assign 0 if gene length or total mapped reads are zero
        rpkm_data[gene_id] = rpkm

    # Create a DataFrame and save RPKM values to a TSV file
    rpkm_df = pd.DataFrame.from_dict(rpkm_data, orient='index', columns=['RPKM'])
    rpkm_df.index.name = 'Geneid'
    rpkm_df.to_csv(output_file, sep='\t')

if __name__ == '__main__':
    if len(sys.argv) != 5:
        print("Usage: python calculate_rpkm.py <gene_counts_file> <total_mapped_reads> <gtf_file> <output_file>")
        sys.exit(1)
    
    gene_counts_file = sys.argv[1]
    total_mapped_reads = int(sys.argv[2])
    gtf_file = sys.argv[3]
    output_file = sys.argv[4]

    calculate_rpkm(gene_counts_file, total_mapped_reads, gtf_file, output_file)
EOF

# Execute the Python script to calculate RPKM
python calculate_rpkm.py "${GENE_COUNTS_FILE}" "${total_mapped_reads}" "${GENOME_ANNOTATION_GTF}" "${OUTPUT_RPKM_FILE}"

# Clean up the temporary Python script
rm calculate_rpkm.py

In short, exons from all isoforms of a gene were merged to create one meta-transcript.

RSEM gtf_to_genes_and_transcripts.py (Inferred with models/gemini-2.5-flash) vv1.3.3 GitHub

$ Bash example

# Install RSEM (if not already installed)
# conda install -c bioconda rsem

# Define input GTF and output GTF
# Placeholder: Replace 'Homo_sapiens.GRCh38.109.gtf' with the actual path to your gene annotation GTF file.
INPUT_GTF="Homo_sapiens.GRCh38.109.gtf"
OUTPUT_META_GTF="Homo_sapiens.GRCh38.109.meta_transcripts.gtf"

# Execute the RSEM utility to create meta-transcripts.
# This script takes a GTF and outputs a new GTF where each gene has a single "transcript"
# representing the union of all its exons across all isoforms, effectively creating a meta-transcript.
gtf_to_genes_and_transcripts.py \
    --gtf "${INPUT_GTF}" \
    --output_gtf "${OUTPUT_META_GTF}" \
    --gene_level_transcripts

View on GitHub

The number of reads falling in the exons of this meta-transcript were counted and normalized by the size of the meta-transcript and by the size of the library.

featureCounts (Inferred with models/gemini-2.5-flash) v2.0.6 GitHub

$ Bash example

# Install Subread (which includes featureCounts)
# conda install -c bioconda subread

# Define input and output files
BAM_FILE="aligned_reads.bam"
# The GTF file should contain the definitions of exons for the meta-transcripts.
# This file is a critical reference dataset for this step.
GTF_FILE="path/to/meta_transcript_exons.gtf"
OUTPUT_RAW_COUNTS_FILE="meta_transcript_exon_raw_counts.txt"
OUTPUT_NORMALIZED_COUNTS_FILE="meta_transcript_exon_normalized_counts.txt"

# 1. Count reads falling into exons of meta-transcripts
# -a: Annotation file (GTF/GFF)
# -o: Output file for raw counts
# -F GTF: Specify GTF format for the annotation file
# -t exon: Count features of type 'exon'
# -g gene_id: Attribute used to group features (e.g., 'gene_id' or 'transcript_id' if meta-transcript IDs are in this attribute)
# -p: Specify if reads are paired-end (remove if single-end)
# -s 0: Specify strandedness (0=unstranded, 1=stranded, 2=reversely stranded). Adjust based on library prep.
featureCounts -a "${GTF_FILE}" -o "${OUTPUT_RAW_COUNTS_FILE}" -F GTF -t exon -g gene_id -p -s 0 "${BAM_FILE}"

# 2. Normalize counts by meta-transcript size and library size (e.g., RPKM/TPM)
# featureCounts outputs raw counts and feature lengths. 
# Normalization is typically a post-processing step using a custom script or statistical package (e.g., DESeq2, edgeR, or a simple RPKM calculation).
#
# Conceptual steps for normalization:
# a. Extract raw counts and feature lengths from "${OUTPUT_RAW_COUNTS_FILE}".
# b. Determine the total number of mapped reads (library size) from the BAM file or alignment logs.
#    Example: LIBRARY_SIZE=$(samtools view -c -F 4 "${BAM_FILE}")
# c. Calculate RPKM (Reads Per Kilobase of transcript per Million mapped reads) or TPM (Transcripts Per Million).
#    RPKM = (raw_count * 10^9) / (feature_length_in_bases * library_size_in_reads)
#
# The actual implementation would depend on the exact format of the meta_transcript_exons.gtf and desired output.
# For simplicity and focusing on the primary counting tool, the normalization step is described conceptually.

View on GitHub

Raw Source Text

Illumina software used for basecalling.
Sequenced reads were trimmed for adaptor sequence, filtered for rRNA contamination, filtered for low-complexity or low-quality sequence, then mapped to hg19 whole genome using GSNAP.
Reads Per Kilobase of exon per Megabase of library size (RPKM) were calculated using a protocol from Chepelev et al., Nucleic Acids Research, 2009. In short, exons from all isoforms of a gene were merged to create one meta-transcript. The number of reads falling in the exons of this meta-transcript were counted and normalized by the size of the meta-transcript and by the size of the library.
Genome_build: GRCh37 (hg19)
Supplementary_files_format_and_content: Excel file includes RPKM values for each Sample.

← Back to Analysis