GSE114922 Processing Pipeline

RNA-Seq code_examples 7 steps

Publication

Integrative RNA-omics Discovers GNAS Alternative Splicing as a Phenotypic Driver of Splicing Factor-Mutant Neoplasms.

Cancer discovery (2022) — PMID 34620690

Dataset

RNA sequencing of bone marrow CD34+ hematopoietic stem and progenitor cells from patients with myelodysplastic syndrome and healthy controls

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Real time Analysis (RTA): HiSeq 4000 - RTA v2.73, HCS v3.3.52

RTA vv2.73

$ Bash example

bash
# Real Time Analysis (RTA) is proprietary software embedded in Illumina sequencing instruments.
# It performs base calling and quality scoring during the sequencing run.
# This step is typically controlled via the instrument's graphical user interface (GUI)
# and does not involve direct command-line execution by a user in a typical bioinformatics pipeline.

# The versions specified are:
# RTA: v2.73
# HCS (HiSeq Control Software): v3.3.52

# No direct bash command for execution is available as it's an instrument-level process.
# The output of RTA (e.g., BCL files) is then used as input for downstream bioinformatics tools.

Base-calling: bcl2fastq v2.17.1.14

bcl2fastq v2.17.1.14 GitHub

$ Bash example

# Install bcl2fastq (example using conda)
# conda create -n bcl2fastq_env bcl2fastq=2.17.1.14 -c bioconda -c conda-forge
# conda activate bcl2fastq_env

# Define input and output directories
INPUT_RUN_FOLDER="/path/to/illumina/run_folder"
OUTPUT_FASTQ_DIR="/path/to/output/fastq_files"
SAMPLE_SHEET="/path/to/sample_sheet.csv" # Required for demultiplexing

# Create output directory if it doesn't exist
mkdir -p "${OUTPUT_FASTQ_DIR}"

# Execute bcl2fastq
# This command converts BCL files from the run folder into FASTQ files,
# performing demultiplexing based on the provided sample sheet.
# Common parameters include:
# --runfolder-dir: Path to the Illumina run folder containing BCL files.
# --output-dir: Path where the generated FASTQ files will be stored.
# --sample-sheet: Path to the sample sheet CSV file for demultiplexing.
# --no-lane-splitting: Prevents splitting output by lane (often preferred).
# -r, -w, -p: Number of threads for reading, writing, and processing (example values).
bcl2fastq --runfolder-dir "${INPUT_RUN_FOLDER}" \
          --output-dir "${OUTPUT_FASTQ_DIR}" \
          --sample-sheet "${SAMPLE_SHEET}" \
          --no-lane-splitting \
          --minimum-trimmed-read-length 0 \
          --mask-short-adapter-reads 0 \
          -r 8 -w 8 -p 8

View on GitHub

Adapter trimming: RNA-seq reads were trimmed for SMARTer and Illumina adapter sequences using skewer-v0.1.125

RNA-seq v0.1.125 GitHub

$ Bash example

# Install skewer (example using conda)
# conda install -c bioconda skewer=0.1.125

# Define input and output files (assuming paired-end RNA-seq)
INPUT_R1="input_R1.fastq.gz"
INPUT_R2="input_R2.fastq.gz"
OUTPUT_PREFIX="trimmed_reads"
ADAPTER_FILE="smarter_illumina_adapters.fasta" # Placeholder for a FASTA file containing SMARTer and Illumina adapter sequences

# Example content for smarter_illumina_adapters.fasta:
# >SMARTer_Adapter_Sequence_1
# AAGCAGTGGTATCAACGCAGAGTACATGGGG
# >Illumina_Universal_Adapter_Sequence_1
# AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT
# (Add all relevant SMARTer and Illumina adapter sequences here)

# Run skewer for adapter trimming
# -x: Path to a FASTA file containing adapter sequences to be trimmed
# -l: Minimum read length after trimming (e.g., 20 bp, common for RNA-seq)
# -q: Minimum quality score for quality trimming (e.g., 20, common)
# -m pe: Paired-end mode (use '-m se' for single-end if applicable)
# -o: Output file prefix. Skewer will append '-trimmed-pair1.fastq.gz' and '-trimmed-pair2.fastq.gz' for paired-end.
skewer -x "${ADAPTER_FILE}" -l 20 -q 20 -m pe -o "${OUTPUT_PREFIX}" "${INPUT_R1}" "${INPUT_R2}"

View on GitHub

Alignment: Reads in gzipped fastq format were aligned using Hisat2 version-2.0.0-beta with default parameters to a modified reference genome comprising the human genome, Homo sapiens GRCh37 - human_g1k_v37, and fasta sequences for ERCC spike-ins

HISAT2 v2.0.0 GitHub

$ Bash example

# Install HISAT2
# conda install -c bioconda hisat2=2.0.0 samtools

# --- Reference Preparation ---
# Placeholder for human reference genome GRCh37 (human_g1k_v37) FASTA file.
# This file can be obtained from sources like the GATK resource bundle or UCSC.
# Example: wget -O human_g1k_v37.fasta.gz "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz"
# gunzip human_g1k_v37.fasta.gz

# Placeholder for ERCC spike-in sequences FASTA file.
# This file can be obtained from Thermo Fisher Scientific.
# Example: wget -O ERCC_Controls_Analysis.zip "https://assets.thermofisher.com/TFS-Assets/LSG/manuals/ERCC_Controls_Analysis.zip"
# unzip ERCC_Controls_Analysis.zip
# mv ERCC92.fasta ercc_spikeins.fasta

# Combine human and ERCC reference sequences into a single FASTA file
cat human_g1k_v37.fasta ercc_spikeins.fasta > combined_reference.fasta

# Build HISAT2 index for the combined reference genome
# -p: number of threads to use for index building (e.g., 8)
hisat2-build -p 8 combined_reference.fasta combined_reference_index

# --- Alignment ---
# Define input FASTQ files (assuming paired-end reads in gzipped format)
# Replace 'sample_R1.fastq.gz' and 'sample_R2.fastq.gz' with actual input file names
READS_R1="sample_R1.fastq.gz"
READS_R2="sample_R2.fastq.gz"
OUTPUT_SAM="sample.sam"

# Align reads using HISAT2 with default parameters to the modified reference genome
# -x: basename of the HISAT2 index
# -1: comma-separated list of files containing upstream mates (paired-end)
# -2: comma-separated list of files containing downstream mates (paired-end)
# -S: file to write SAM alignments to
# --threads: number of threads to use for alignment (e.g., 8, common optimization for performance)
hisat2 -x combined_reference_index -1 "${READS_R1}" -2 "${READS_R2}" -S "${OUTPUT_SAM}" --threads 8

# --- Post-alignment processing (optional, often separate pipeline steps) ---
# Convert SAM to BAM format
# samtools view -bS "${OUTPUT_SAM}" > "${OUTPUT_SAM%.sam}.bam"

# Sort the BAM file by coordinate
# samtools sort "${OUTPUT_SAM%.sam}.bam" -o "${OUTPUT_SAM%.sam}.sorted.bam"

# Index the sorted BAM file
# samtools index "${OUTPUT_SAM%.sam}.sorted.bam"

# Remove intermediate SAM file
# rm "${OUTPUT_SAM}"

View on GitHub

Mark Duplicates: Duplicate reads were marked using MarkDuplicates.jar implemented in Picard tools v1.92

Picard v1.92 GitHub

$ Bash example

# Install Picard (if not already installed)
# conda install -c bioconda picard

# Example usage of MarkDuplicates.jar from Picard tools v1.92
# This command marks duplicate reads in an input BAM file.
# Replace 'input.bam', 'output.bam', and 'markduplicates_metrics.txt' with your actual file names.
# Replace '/path/to/picard.jar' with the actual path to your Picard JAR file.

java -jar /path/to/picard.jar MarkDuplicates \
    I=input.bam \
    O=output.bam \
    M=markduplicates_metrics.txt \
    VALIDATION_STRINGENCY=SILENT

View on GitHub

Read counts (abundance measurement): Reads mapping uniquely to genes annotated in ENSEMBL release 76 were counted using featureCounts implemented in subread-v1.5.0

featureCounts v1.5.0

$ Bash example

# Install subread (which includes featureCounts)
# conda install -c bioconda subread=1.5.0

# Placeholder for ENSEMBL release 76 GTF file (e.g., for Homo sapiens GRCh38)
# Ensembl release 76 for human corresponds to GRCh38.
# Download from Ensembl archives: ftp://ftp.ensembl.org/pub/release-76/gtf/homo_sapiens/Homo_sapiens.GRCh38.76.gtf.gz
ENSEMBL_GTF="Homo_sapiens.GRCh38.76.gtf" # Placeholder path to unzipped GTF

# Input BAM file (aligned reads)
INPUT_BAM="aligned_reads.bam" # Placeholder path to the alignment file

# Output counts file
OUTPUT_COUNTS="gene_counts.txt"

# Count reads mapping uniquely to genes using featureCounts
# -a: Annotation file (GTF/GFF) containing gene features (or exons with gene_id attribute)
# -o: Output file for counts
# By default, featureCounts counts uniquely mapped reads and aggregates to gene_id from exon features.
featureCounts -a "${ENSEMBL_GTF}" \
              -o "${OUTPUT_COUNTS}" \
              "${INPUT_BAM}"

Normalized Counts: Read counts were normalized to counts per million (CPM) and Transcripts per Million (TPM) by dividing each gene-count by total reads mapped (uniquely mapped to exonic regions) per million and using ENSEMBL annotated read-lengths

Ensembl vN/A

$ Bash example

# --- Installation (commented out) ---
# It is recommended to use a virtual environment or conda for package management.
# conda create -n normalization_env python=3.9 pandas gtfparse
# conda activate normalization_env
# pip install gtfparse pandas

# --- Reference Data Setup ---
# Download Ensembl GTF annotation file (e.g., Homo sapiens GRCh38, release 109)
# wget http://ftp.ensembl.org/pub/release-109/gtf/homo_sapiens/Homo_sapiens.GRCh38.109.gtf.gz
# gunzip Homo_sapiens.GRCh38.109.gtf.gz
GTF_FILE="Homo_sapiens.GRCh38.109.gtf"
GENE_LENGTHS_FILE="ensembl_gene_lengths_GRCh38.txt"

# Python script to parse GTF and calculate gene lengths (sum of exon lengths per gene)
# This script should be saved as parse_gtf_for_gene_lengths.py
cat << 'EOF' > parse_gtf_for_gene_lengths.py
import gtfparse
import pandas as pd
import sys

GTF_FILE = sys.argv[1]
OUTPUT_GENE_LENGTHS_FILE = sys.argv[2]

# Load GTF
gtf_df = gtfparse.read_gtf(GTF_FILE)

# Filter for exons and calculate length
exons_df = gtf_df[gtf_df['feature'] == 'exon']
exons_df['length'] = exons_df['end'] - exons_df['start'] + 1

# Sum lengths of all exons associated with each gene_id.
# Note: This is a common approximation for 'gene length' in TPM calculations.
# More sophisticated methods might consider the union of exonic regions or effective lengths.
gene_lengths = exons_df.groupby('gene_id')['length'].sum().reset_index()
gene_lengths.columns = ['gene_id', 'gene_length']

gene_lengths.to_csv(OUTPUT_GENE_LENGTHS_FILE, sep='\t', index=False, header=False)
print(f"Gene lengths saved to {OUTPUT_GENE_LENGTHS_FILE}")
EOF

python parse_gtf_for_gene_lengths.py "$GTF_FILE" "$GENE_LENGTHS_FILE"

# --- Simulate Input Data (replace with actual pipeline outputs) ---
# Assuming 'gene_counts.txt' is the output from a tool like featureCounts or HTSeq-count.
# It should contain gene IDs and raw read counts.
# Example content:
# ENSG00000000003\t1000
# ENSG00000000005\t500
# ENSG00000000419\t2000
# ...

# Create a dummy gene_counts.txt for demonstration
echo -e "ENSG00000000003\t1000" > gene_counts.txt
echo -e "ENSG00000000005\t500" >> gene_counts.txt
echo -e "ENSG00000000419\t2000" >> gene_counts.txt

# Assuming 'gene_counts.txt.summary' is the summary file from featureCounts
# It contains total assigned reads, which are 'uniquely mapped to exonic regions'.
# Create a dummy summary file for demonstration
echo -e "Assigned\t10000000" > gene_counts.txt.summary
echo -e "Unassigned_NoFeatures\t500000" >> gene_counts.txt.summary
echo -e "Unassigned_Ambiguity\t100000" >> gene_counts.txt.summary

TOTAL_MAPPED_READS_FILE="total_exonic_mapped_reads.txt"
# Extract total uniquely mapped reads to exonic regions from the summary file
TOTAL_MAPPED_READS=$(awk '$1 == "Assigned" {print $2}' gene_counts.txt.summary)
echo "$TOTAL_MAPPED_READS" > "$TOTAL_MAPPED_READS_FILE"

# --- Main Normalization Script ---
# This Python script performs CPM and TPM calculations as described.
# It should be saved as normalize_counts.py
cat << 'EOF' > normalize_counts.py
import pandas as pd
import sys

GENE_COUNTS_FILE = sys.argv[1]
GENE_LENGTHS_FILE = sys.argv[2]
TOTAL_MAPPED_READS_FILE = sys.argv[3]
OUTPUT_FILE = sys.argv[4]

# Load gene counts
counts_df = pd.read_csv(GENE_COUNTS_FILE, sep='\t', header=None, names=['gene_id', 'raw_count'])

# Load gene lengths from Ensembl annotation
lengths_df = pd.read_csv(GENE_LENGTHS_FILE, sep='\t', header=None, names=['gene_id', 'gene_length'])

# Load total uniquely mapped reads to exonic regions
with open(TOTAL_MAPPED_READS_FILE, 'r') as f:
    total_mapped_reads = int(f.read().strip())

# Merge counts and lengths
merged_df = pd.merge(counts_df, lengths_df, on='gene_id', how='left')

# Calculate CPM: (gene_count / total_mapped_reads) * 1,000,000
merged_df['CPM'] = (merged_df['raw_count'] / total_mapped_reads) * 1_000_000

# Prepare for TPM calculation: only consider genes with length information
merged_df_tpm = merged_df.dropna(subset=['gene_length']).copy()
merged_df_tpm['gene_length'] = merged_df_tpm['gene_length'].astype(int)

# Calculate RPK (Reads Per Kilobase): raw_count / (gene_length / 1000)
merged_df_tpm['RPK'] = merged_df_tpm['raw_count'] / (merged_df_tpm['gene_length'] / 1000)

# Calculate scaling factor: sum of all RPKs / 1,000,000
scaling_factor = merged_df_tpm['RPK'].sum() / 1_000_000

# Calculate TPM: RPK / scaling_factor
merged_df_tpm['TPM'] = merged_df_tpm['RPK'] / scaling_factor

# Merge TPM back into the main dataframe, filling NaN for genes without length (if any)
final_df = pd.merge(merged_df, merged_df_tpm[['gene_id', 'TPM']], on='gene_id', how='left')

# Output results
final_df[['gene_id', 'raw_count', 'CPM', 'TPM']].to_csv(OUTPUT_FILE, sep='\t', index=False)

print(f"Normalized counts saved to {OUTPUT_FILE}")
EOF

python normalize_counts.py gene_counts.txt "$GENE_LENGTHS_FILE" "$TOTAL_MAPPED_READS_FILE" normalized_counts.tsv

Tools Used

RNA-seq

Raw Source Text

Real time Analysis (RTA):  HiSeq 4000 - RTA v2.73, HCS v3.3.52
Base-calling: bcl2fastq v2.17.1.14
Adapter trimming: RNA-seq reads were trimmed for SMARTer and Illumina  adapter sequences using skewer-v0.1.125
Alignment: Reads in gzipped fastq format were aligned using Hisat2 version-2.0.0-beta with default parameters to a modified reference genome comprising the human genome, Homo sapiens GRCh37 - human_g1k_v37, and fasta sequences for ERCC spike-ins
Mark Duplicates: Duplicate reads were marked using MarkDuplicates.jar implemented in Picard tools v1.92
Read counts (abundance measurement): Reads mapping uniquely to genes annotated in ENSEMBL release 76 were counted using featureCounts implemented in subread-v1.5.0
Normalized Counts: Read counts were normalized to counts per million (CPM) and Transcripts per Million (TPM) by dividing each gene-count by total reads mapped (uniquely mapped to exonic regions) per million and using ENSEMBL annotated read-lengths
Genome_build: GRCh37
Supplementary_files_format_and_content: Count_table.txt: raw read counts for all 63677 ENSEMBL annotated genes
Supplementary_files_format_and_content: CPM_table.txt:  Counts per million normalized counts all 63677 ENSEMBL annotated genes
Supplementary_files_format_and_content: TPM_table.txt:  Transcripts per million (gene-length and sequencing depth normalized counts) all 63677 ENSEMBL annotated genes
Supplementary_files_format_and_content: Text

← Back to Analysis