GSE63816 Processing Pipeline

RNA-Seq code_examples 6 steps

Publication

Integrative RNA-omics Discovers GNAS Alternative Splicing as a Phenotypic Driver of Splicing Factor-Mutant Neoplasms.

Cancer discovery (2022) — PMID 34620690

Dataset

Aberrant splicing of U12-type introns is the hallmark of ZRSR2 mutant myelodysplastic syndrome

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Mapped the sequenced reads to the reference transcript in the database obtained from Refseq, Ensemble and UCSC known genes using bowtie.

Bowtie v1.3.1 GitHub

$ Bash example

# Install Bowtie (if not already installed)
# conda install -c bioconda bowtie

# Placeholder for combined transcript reference (e.g., from Refseq, Ensembl, UCSC known genes for Homo sapiens GRCh38)
# This file would contain all transcript sequences concatenated.
# For example, you might download cDNA sequences from Ensembl and RefSeq and combine them.
# wget ftp://ftp.ensembl.org/pub/release-109/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz
# gunzip Homo_sapiens.GRCh38.cdna.all.fa.gz
# mv Homo_sapiens.GRCh38.cdna.all.fa combined_transcripts.fa

# Build the Bowtie index for the combined transcript reference
# bowtie-build combined_transcripts.fa transcript_index

# Align sequenced reads to the transcript index using Bowtie
# -q: Input reads are in FASTQ format
# -p 8: Use 8 threads for alignment
# --best --strata: Report only alignments that fall into the best stratum (fewest mismatches) and among those, report the ones that are 'best' (lowest sum of quality values at mismatched positions)
# -v 2: Allow up to 2 mismatches in the alignment
# transcript_index: The basename of the Bowtie index files
# reads.fastq: Placeholder for the input sequenced reads file
# aligned_reads.sam: Placeholder for the output SAM file containing alignments
bowtie -q -p 8 --best --strata -v 2 transcript_index reads.fastq aligned_reads.sam

View on GitHub

Unmapped or poorly mapped reads were realigned to human reference genome (hg19) with the Blat software.

Blat vv36 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install Blat (if not already installed)
# conda install -c bioconda blat

# Define reference genome path
HG19_REF="/path/to/genomes/hg19/hg19.fa" # Placeholder for hg19 reference genome

# Define input reads file (unmapped or poorly mapped reads)
INPUT_READS="unmapped_reads.fasta" # Placeholder for input reads (e.g., converted from FASTQ if necessary, as Blat primarily uses FASTA)

# Define output PSL file
OUTPUT_PSL="realigned_reads.psl"

# Realign reads to hg19 using Blat
# Blat aligns DNA sequences (reads) to a DNA database (reference genome).
# The output is in PSL (BLAT alignment format).
blat "${HG19_REF}" "${INPUT_READS}" "${OUTPUT_PSL}"

View on GitHub

The mapped reads were organized into five different libraries 1) exon read library, 2) intron read library, 3) exon-intron junction read library, 4) exon-exon junction read library, and 5) intergenic read library.

Custom Read Categorization Script (using samtools and bedtools) (Inferred with models/gemini-2.5-flash) vN/A (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Define input files and reference annotations
ALIGNED_BAM="aligned_reads.bam" # Input BAM file from alignment
OUTPUT_DIR="read_libraries"
mkdir -p "${OUTPUT_DIR}"

# --- Reference Datasets (placeholders for latest GRCh38/GENCODE) ---
# GENCODE GTF for gene annotations (exons, introns, genes)
GENOME_GTF="references/gencode.v44.annotation.gtf"
# Reference genome FASTA for intergenic region definition (optional, but good for completeness)
GENOME_FASTA="references/GRCh38.primary_assembly.genome.fa"
# Chromosome sizes file for defining intergenic regions (genome complement)
GENOME_CHROM_SIZES="references/GRCh38.chrom.sizes"

# # Installation of core tools (samtools, bedtools, python with pysam)
# # conda create -n read_categorization python=3.9
# # conda activate read_categorization
# # conda install -c bioconda samtools bedtools pysam

# # Download reference datasets (example commands)
# # mkdir -p references
# # wget -P references/ ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/gencode.v44.annotation.gtf.gz
# # gunzip "${GENOME_GTF}.gz"
# # wget -P references/ https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz
# # gunzip "${GENOME_FASTA}.gz"
# # wget -P references/ https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes

# --- Conceptual Custom Python Script for Read Categorization ---
# This script would parse the GTF to define genomic regions (exons, introns, intergenic).
# It would then iterate through the input BAM file, analyze each read's alignment
# and CIGAR string, and assign it to one of the five categories.
# Finally, it would write the reads for each category into separate BAM files.
# The logic for identifying junction reads (exon-intron, exon-exon) is complex
# and typically involves parsing CIGAR strings and comparing with known splice sites
# or gene models.

# Execute the custom categorization script
python3 categorize_mapped_reads.py \
    --input_bam "${ALIGNED_BAM}" \
    --gtf "${GENOME_GTF}" \
    --output_dir "${OUTPUT_DIR}" \
    --genome_fasta "${GENOME_FASTA}" \
    --chrom_sizes "${GENOME_CHROM_SIZES}"

# Expected output files in ${OUTPUT_DIR}:
# exon_read_library.bam
# intron_read_library.bam
# exon_intron_junction_read_library.bam
# exon_exon_junction_read_library.bam
# intergenic_read_library.bam

View on GitHub

The gene classification was done using all known Refseq transcripts.

RefSeq vNot specified GitHub

$ Bash example

# Install ncbi-datasets if not already installed
# conda install -c bioconda ncbi-datasets-cli

# Define output directory for RefSeq data
OUTPUT_DIR="refseq_transcripts"
mkdir -p "${OUTPUT_DIR}"

# Download human RefSeq mRNA transcripts (GRCh38.p14, RefSeq accession GCF_000001405.40)
# This command uses the NCBI Datasets command-line tool to retrieve RefSeq data.
# While RefSeq itself is a database, this tool provides programmatic access to its content,
# which is a prerequisite for "gene classification using RefSeq transcripts."
# The actual classification step would typically involve aligning query sequences
# against these downloaded transcripts using a separate bioinformatics tool (e.g., STAR, Salmon, BLAST).
ncbi-datasets download genome accession GCF_000001405.40 \
    --assembly-source RefSeq \
    --include rna \
    --filename "${OUTPUT_DIR}/human_refseq_rna.zip"

# Unzip the downloaded data to make transcripts accessible
unzip "${OUTPUT_DIR}/human_refseq_rna.zip" -d "${OUTPUT_DIR}"

View on GitHub

Differential splicing analysis.

rMATS (Inferred with models/gemini-2.5-flash) v4.1.2 GitHub

$ Bash example

# Install rMATS (example using conda)
# conda create -n rmats_env python=3.8
# conda activate rmats_env
# conda install -c bioconda rmats-turbo

# Define input files and parameters
# Replace with actual paths to your BAM files and GTF annotation
CONTROL_BAM_FILES="control_replicate1.bam,control_replicate2.bam" # Comma-separated list of BAM files for condition 1
TREATED_BAM_FILES="treated_replicate1.bam,treated_replicate2.bam" # Comma-separated list of BAM files for condition 2
GTF_FILE="gencode.v38.annotation.gtf" # Example: GENCODE human hg38 annotation file
OUTPUT_DIR="rmats_output"
READ_TYPE="paired" # or "single" depending on your sequencing data
LIB_TYPE="fr-firststrand" # or "fr-secondstrand", "fr-unstranded" - depends on your library preparation

# Create output directory if it doesn't exist
mkdir -p "${OUTPUT_DIR}"

# Run rMATS for differential splicing analysis
rmats.py \
    --b1 "${CONTROL_BAM_FILES}" \
    --b2 "${TREATED_BAM_FILES}" \
    --gtf "${GTF_FILE}" \
    --od "${OUTPUT_DIR}" \
    --tmp "${OUTPUT_DIR}/tmp" \
    --variable-read-length \
    --${READ_TYPE} \
    --libType "${LIB_TYPE}" \
    --nthread 8 # Number of threads to use

View on GitHub

Classification of introns.

rMATS-turbo (Inferred with models/gemini-2.5-flash) v4.1.2 GitHub

$ Bash example

# conda install -c bioconda rmats-turbo

# Example: Classify introns by detecting differential alternative splicing events
# This assumes you have two groups of samples (e.g., control and treated)
# and aligned RNA-seq BAM files. rMATS classifies introns into event types like
# Skipped Exon (SE), Retained Intron (RI), Alternative 5' Splice Site (A5SS),
# Alternative 3' Splice Site (A3SS), and Mutually Exclusive Exons (MXE).

# Define input BAM files (replace with actual comma-separated paths for replicates)
# Example: control_rep1.bam,control_rep2.bam
BAM_FILES_GROUP1="path/to/control_sample1.bam,path/to/control_sample2.bam"
BAM_FILES_GROUP2="path/to/treated_sample1.bam,path/to/treated_sample2.bam"

# Define GTF annotation file (replace with actual path or download if needed)
# This GTF should correspond to the genome assembly used for alignment.
# Example: Ensembl Homo_sapiens.GRCh38.109.gtf
GTF_FILE="Homo_sapiens.GRCh38.109.gtf" # Placeholder: Download from Ensembl or similar source

# Define output directory for rMATS results
OUTPUT_DIR="rmats_intron_classification_results"

# Define read length (adjust based on your sequencing data)
READ_LENGTH=100 

# Define number of threads for parallel processing
NUM_THREADS=8

# Define read type: 'paired' for paired-end reads, 'single' for single-end reads
READ_TYPE="paired"

# Create output directory if it doesn't exist
mkdir -p "${OUTPUT_DIR}"

# Run rMATS-turbo to classify and quantify alternative splicing events involving introns
rmats.py \
    --b1 "${BAM_FILES_GROUP1}" \
    --b2 "${BAM_FILES_GROUP2}" \
    --gtf "${GTF_FILE}" \
    -t "${READ_TYPE}" \
    --readLength "${READ_LENGTH}" \
    --nthread "${NUM_THREADS}" \
    --tmp "${OUTPUT_DIR}/tmp" \
    -o "${OUTPUT_DIR}"

echo "rMATS analysis complete. Results are in ${OUTPUT_DIR}"

View on GitHub

Raw Source Text

Mapped the sequenced reads to the reference transcript in the database obtained from Refseq, Ensemble and UCSC known genes using bowtie.
Unmapped or poorly mapped reads were realigned to human reference genome (hg19) with the Blat software.
The mapped reads were organized into five different libraries 1) exon read library, 2) intron read library, 3) exon-intron junction read library, 4) exon-exon junction read library, and 5) intergenic read library. The gene classification was done using all known Refseq transcripts.
Differential splicing analysis.
Classification of introns.
Genome_build: hg19
Supplementary_files_format_and_content: Normalized bigwig for each sample

← Back to Analysis