GSE63816 Processing Pipeline
RNA-Seq
code_examples
6 steps
Publication
Integrative RNA-omics Discovers GNAS Alternative Splicing as a Phenotypic Driver of Splicing Factor-Mutant Neoplasms.Cancer discovery (2022) — PMID 34620690
Dataset
GSE63816Aberrant splicing of U12-type introns is the hallmark of ZRSR2 mutant myelodysplastic syndrome
Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
Processing Steps
Generate Jupyter Notebook-
1
Mapped the sequenced reads to the reference transcript in the database obtained from Refseq, Ensemble and UCSC known genes using bowtie.
$ Bash example
# Install Bowtie (if not already installed) # conda install -c bioconda bowtie # Placeholder for combined transcript reference (e.g., from Refseq, Ensembl, UCSC known genes for Homo sapiens GRCh38) # This file would contain all transcript sequences concatenated. # For example, you might download cDNA sequences from Ensembl and RefSeq and combine them. # wget ftp://ftp.ensembl.org/pub/release-109/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz # gunzip Homo_sapiens.GRCh38.cdna.all.fa.gz # mv Homo_sapiens.GRCh38.cdna.all.fa combined_transcripts.fa # Build the Bowtie index for the combined transcript reference # bowtie-build combined_transcripts.fa transcript_index # Align sequenced reads to the transcript index using Bowtie # -q: Input reads are in FASTQ format # -p 8: Use 8 threads for alignment # --best --strata: Report only alignments that fall into the best stratum (fewest mismatches) and among those, report the ones that are 'best' (lowest sum of quality values at mismatched positions) # -v 2: Allow up to 2 mismatches in the alignment # transcript_index: The basename of the Bowtie index files # reads.fastq: Placeholder for the input sequenced reads file # aligned_reads.sam: Placeholder for the output SAM file containing alignments bowtie -q -p 8 --best --strata -v 2 transcript_index reads.fastq aligned_reads.sam
-
2
Unmapped or poorly mapped reads were realigned to human reference genome (hg19) with the Blat software.
$ Bash example
# Install Blat (if not already installed) # conda install -c bioconda blat # Define reference genome path HG19_REF="/path/to/genomes/hg19/hg19.fa" # Placeholder for hg19 reference genome # Define input reads file (unmapped or poorly mapped reads) INPUT_READS="unmapped_reads.fasta" # Placeholder for input reads (e.g., converted from FASTQ if necessary, as Blat primarily uses FASTA) # Define output PSL file OUTPUT_PSL="realigned_reads.psl" # Realign reads to hg19 using Blat # Blat aligns DNA sequences (reads) to a DNA database (reference genome). # The output is in PSL (BLAT alignment format). blat "${HG19_REF}" "${INPUT_READS}" "${OUTPUT_PSL}" -
3
The mapped reads were organized into five different libraries 1) exon read library, 2) intron read library, 3) exon-intron junction read library, 4) exon-exon junction read library, and 5) intergenic read library.
Custom Read Categorization Script (using samtools and bedtools) (Inferred with models/gemini-2.5-flash) vN/A (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# Define input files and reference annotations ALIGNED_BAM="aligned_reads.bam" # Input BAM file from alignment OUTPUT_DIR="read_libraries" mkdir -p "${OUTPUT_DIR}" # --- Reference Datasets (placeholders for latest GRCh38/GENCODE) --- # GENCODE GTF for gene annotations (exons, introns, genes) GENOME_GTF="references/gencode.v44.annotation.gtf" # Reference genome FASTA for intergenic region definition (optional, but good for completeness) GENOME_FASTA="references/GRCh38.primary_assembly.genome.fa" # Chromosome sizes file for defining intergenic regions (genome complement) GENOME_CHROM_SIZES="references/GRCh38.chrom.sizes" # # Installation of core tools (samtools, bedtools, python with pysam) # # conda create -n read_categorization python=3.9 # # conda activate read_categorization # # conda install -c bioconda samtools bedtools pysam # # Download reference datasets (example commands) # # mkdir -p references # # wget -P references/ ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/gencode.v44.annotation.gtf.gz # # gunzip "${GENOME_GTF}.gz" # # wget -P references/ https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz # # gunzip "${GENOME_FASTA}.gz" # # wget -P references/ https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes # --- Conceptual Custom Python Script for Read Categorization --- # This script would parse the GTF to define genomic regions (exons, introns, intergenic). # It would then iterate through the input BAM file, analyze each read's alignment # and CIGAR string, and assign it to one of the five categories. # Finally, it would write the reads for each category into separate BAM files. # The logic for identifying junction reads (exon-intron, exon-exon) is complex # and typically involves parsing CIGAR strings and comparing with known splice sites # or gene models. # Execute the custom categorization script python3 categorize_mapped_reads.py \ --input_bam "${ALIGNED_BAM}" \ --gtf "${GENOME_GTF}" \ --output_dir "${OUTPUT_DIR}" \ --genome_fasta "${GENOME_FASTA}" \ --chrom_sizes "${GENOME_CHROM_SIZES}" # Expected output files in ${OUTPUT_DIR}: # exon_read_library.bam # intron_read_library.bam # exon_intron_junction_read_library.bam # exon_exon_junction_read_library.bam # intergenic_read_library.bam -
4
The gene classification was done using all known Refseq transcripts.
$ Bash example
# Install ncbi-datasets if not already installed # conda install -c bioconda ncbi-datasets-cli # Define output directory for RefSeq data OUTPUT_DIR="refseq_transcripts" mkdir -p "${OUTPUT_DIR}" # Download human RefSeq mRNA transcripts (GRCh38.p14, RefSeq accession GCF_000001405.40) # This command uses the NCBI Datasets command-line tool to retrieve RefSeq data. # While RefSeq itself is a database, this tool provides programmatic access to its content, # which is a prerequisite for "gene classification using RefSeq transcripts." # The actual classification step would typically involve aligning query sequences # against these downloaded transcripts using a separate bioinformatics tool (e.g., STAR, Salmon, BLAST). ncbi-datasets download genome accession GCF_000001405.40 \ --assembly-source RefSeq \ --include rna \ --filename "${OUTPUT_DIR}/human_refseq_rna.zip" # Unzip the downloaded data to make transcripts accessible unzip "${OUTPUT_DIR}/human_refseq_rna.zip" -d "${OUTPUT_DIR}" -
5
Differential splicing analysis.
$ Bash example
# Install rMATS (example using conda) # conda create -n rmats_env python=3.8 # conda activate rmats_env # conda install -c bioconda rmats-turbo # Define input files and parameters # Replace with actual paths to your BAM files and GTF annotation CONTROL_BAM_FILES="control_replicate1.bam,control_replicate2.bam" # Comma-separated list of BAM files for condition 1 TREATED_BAM_FILES="treated_replicate1.bam,treated_replicate2.bam" # Comma-separated list of BAM files for condition 2 GTF_FILE="gencode.v38.annotation.gtf" # Example: GENCODE human hg38 annotation file OUTPUT_DIR="rmats_output" READ_TYPE="paired" # or "single" depending on your sequencing data LIB_TYPE="fr-firststrand" # or "fr-secondstrand", "fr-unstranded" - depends on your library preparation # Create output directory if it doesn't exist mkdir -p "${OUTPUT_DIR}" # Run rMATS for differential splicing analysis rmats.py \ --b1 "${CONTROL_BAM_FILES}" \ --b2 "${TREATED_BAM_FILES}" \ --gtf "${GTF_FILE}" \ --od "${OUTPUT_DIR}" \ --tmp "${OUTPUT_DIR}/tmp" \ --variable-read-length \ --${READ_TYPE} \ --libType "${LIB_TYPE}" \ --nthread 8 # Number of threads to use -
6
Classification of introns.
$ Bash example
# conda install -c bioconda rmats-turbo # Example: Classify introns by detecting differential alternative splicing events # This assumes you have two groups of samples (e.g., control and treated) # and aligned RNA-seq BAM files. rMATS classifies introns into event types like # Skipped Exon (SE), Retained Intron (RI), Alternative 5' Splice Site (A5SS), # Alternative 3' Splice Site (A3SS), and Mutually Exclusive Exons (MXE). # Define input BAM files (replace with actual comma-separated paths for replicates) # Example: control_rep1.bam,control_rep2.bam BAM_FILES_GROUP1="path/to/control_sample1.bam,path/to/control_sample2.bam" BAM_FILES_GROUP2="path/to/treated_sample1.bam,path/to/treated_sample2.bam" # Define GTF annotation file (replace with actual path or download if needed) # This GTF should correspond to the genome assembly used for alignment. # Example: Ensembl Homo_sapiens.GRCh38.109.gtf GTF_FILE="Homo_sapiens.GRCh38.109.gtf" # Placeholder: Download from Ensembl or similar source # Define output directory for rMATS results OUTPUT_DIR="rmats_intron_classification_results" # Define read length (adjust based on your sequencing data) READ_LENGTH=100 # Define number of threads for parallel processing NUM_THREADS=8 # Define read type: 'paired' for paired-end reads, 'single' for single-end reads READ_TYPE="paired" # Create output directory if it doesn't exist mkdir -p "${OUTPUT_DIR}" # Run rMATS-turbo to classify and quantify alternative splicing events involving introns rmats.py \ --b1 "${BAM_FILES_GROUP1}" \ --b2 "${BAM_FILES_GROUP2}" \ --gtf "${GTF_FILE}" \ -t "${READ_TYPE}" \ --readLength "${READ_LENGTH}" \ --nthread "${NUM_THREADS}" \ --tmp "${OUTPUT_DIR}/tmp" \ -o "${OUTPUT_DIR}" echo "rMATS analysis complete. Results are in ${OUTPUT_DIR}"
Raw Source Text
Mapped the sequenced reads to the reference transcript in the database obtained from Refseq, Ensemble and UCSC known genes using bowtie. Unmapped or poorly mapped reads were realigned to human reference genome (hg19) with the Blat software. The mapped reads were organized into five different libraries 1) exon read library, 2) intron read library, 3) exon-intron junction read library, 4) exon-exon junction read library, and 5) intergenic read library. The gene classification was done using all known Refseq transcripts. Differential splicing analysis. Classification of introns. Genome_build: hg19 Supplementary_files_format_and_content: Normalized bigwig for each sample