GSE114922 Processing Pipeline
Publication
Integrative RNA-omics Discovers GNAS Alternative Splicing as a Phenotypic Driver of Splicing Factor-Mutant Neoplasms.Cancer discovery (2022) — PMID 34620690
Dataset
GSE114922RNA sequencing of bone marrow CD34+ hematopoietic stem and progenitor cells from patients with myelodysplastic syndrome and healthy controls
Processing Steps
Generate Jupyter Notebook-
1
Real time Analysis (RTA): HiSeq 4000 - RTA v2.73, HCS v3.3.52
RTA vv2.73$ Bash example
bash # Real Time Analysis (RTA) is proprietary software embedded in Illumina sequencing instruments. # It performs base calling and quality scoring during the sequencing run. # This step is typically controlled via the instrument's graphical user interface (GUI) # and does not involve direct command-line execution by a user in a typical bioinformatics pipeline. # The versions specified are: # RTA: v2.73 # HCS (HiSeq Control Software): v3.3.52 # No direct bash command for execution is available as it's an instrument-level process. # The output of RTA (e.g., BCL files) is then used as input for downstream bioinformatics tools.
-
2
Base-calling: bcl2fastq v2.17.1.14
$ Bash example
# Install bcl2fastq (example using conda) # conda create -n bcl2fastq_env bcl2fastq=2.17.1.14 -c bioconda -c conda-forge # conda activate bcl2fastq_env # Define input and output directories INPUT_RUN_FOLDER="/path/to/illumina/run_folder" OUTPUT_FASTQ_DIR="/path/to/output/fastq_files" SAMPLE_SHEET="/path/to/sample_sheet.csv" # Required for demultiplexing # Create output directory if it doesn't exist mkdir -p "${OUTPUT_FASTQ_DIR}" # Execute bcl2fastq # This command converts BCL files from the run folder into FASTQ files, # performing demultiplexing based on the provided sample sheet. # Common parameters include: # --runfolder-dir: Path to the Illumina run folder containing BCL files. # --output-dir: Path where the generated FASTQ files will be stored. # --sample-sheet: Path to the sample sheet CSV file for demultiplexing. # --no-lane-splitting: Prevents splitting output by lane (often preferred). # -r, -w, -p: Number of threads for reading, writing, and processing (example values). bcl2fastq --runfolder-dir "${INPUT_RUN_FOLDER}" \ --output-dir "${OUTPUT_FASTQ_DIR}" \ --sample-sheet "${SAMPLE_SHEET}" \ --no-lane-splitting \ --minimum-trimmed-read-length 0 \ --mask-short-adapter-reads 0 \ -r 8 -w 8 -p 8 -
3
Adapter trimming: RNA-seq reads were trimmed for SMARTer and Illumina adapter sequences using skewer-v0.1.125
$ Bash example
# Install skewer (example using conda) # conda install -c bioconda skewer=0.1.125 # Define input and output files (assuming paired-end RNA-seq) INPUT_R1="input_R1.fastq.gz" INPUT_R2="input_R2.fastq.gz" OUTPUT_PREFIX="trimmed_reads" ADAPTER_FILE="smarter_illumina_adapters.fasta" # Placeholder for a FASTA file containing SMARTer and Illumina adapter sequences # Example content for smarter_illumina_adapters.fasta: # >SMARTer_Adapter_Sequence_1 # AAGCAGTGGTATCAACGCAGAGTACATGGGG # >Illumina_Universal_Adapter_Sequence_1 # AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT # (Add all relevant SMARTer and Illumina adapter sequences here) # Run skewer for adapter trimming # -x: Path to a FASTA file containing adapter sequences to be trimmed # -l: Minimum read length after trimming (e.g., 20 bp, common for RNA-seq) # -q: Minimum quality score for quality trimming (e.g., 20, common) # -m pe: Paired-end mode (use '-m se' for single-end if applicable) # -o: Output file prefix. Skewer will append '-trimmed-pair1.fastq.gz' and '-trimmed-pair2.fastq.gz' for paired-end. skewer -x "${ADAPTER_FILE}" -l 20 -q 20 -m pe -o "${OUTPUT_PREFIX}" "${INPUT_R1}" "${INPUT_R2}" -
4
Alignment: Reads in gzipped fastq format were aligned using Hisat2 version-2.0.0-beta with default parameters to a modified reference genome comprising the human genome, Homo sapiens GRCh37 - human_g1k_v37, and fasta sequences for ERCC spike-ins
$ Bash example
# Install HISAT2 # conda install -c bioconda hisat2=2.0.0 samtools # --- Reference Preparation --- # Placeholder for human reference genome GRCh37 (human_g1k_v37) FASTA file. # This file can be obtained from sources like the GATK resource bundle or UCSC. # Example: wget -O human_g1k_v37.fasta.gz "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz" # gunzip human_g1k_v37.fasta.gz # Placeholder for ERCC spike-in sequences FASTA file. # This file can be obtained from Thermo Fisher Scientific. # Example: wget -O ERCC_Controls_Analysis.zip "https://assets.thermofisher.com/TFS-Assets/LSG/manuals/ERCC_Controls_Analysis.zip" # unzip ERCC_Controls_Analysis.zip # mv ERCC92.fasta ercc_spikeins.fasta # Combine human and ERCC reference sequences into a single FASTA file cat human_g1k_v37.fasta ercc_spikeins.fasta > combined_reference.fasta # Build HISAT2 index for the combined reference genome # -p: number of threads to use for index building (e.g., 8) hisat2-build -p 8 combined_reference.fasta combined_reference_index # --- Alignment --- # Define input FASTQ files (assuming paired-end reads in gzipped format) # Replace 'sample_R1.fastq.gz' and 'sample_R2.fastq.gz' with actual input file names READS_R1="sample_R1.fastq.gz" READS_R2="sample_R2.fastq.gz" OUTPUT_SAM="sample.sam" # Align reads using HISAT2 with default parameters to the modified reference genome # -x: basename of the HISAT2 index # -1: comma-separated list of files containing upstream mates (paired-end) # -2: comma-separated list of files containing downstream mates (paired-end) # -S: file to write SAM alignments to # --threads: number of threads to use for alignment (e.g., 8, common optimization for performance) hisat2 -x combined_reference_index -1 "${READS_R1}" -2 "${READS_R2}" -S "${OUTPUT_SAM}" --threads 8 # --- Post-alignment processing (optional, often separate pipeline steps) --- # Convert SAM to BAM format # samtools view -bS "${OUTPUT_SAM}" > "${OUTPUT_SAM%.sam}.bam" # Sort the BAM file by coordinate # samtools sort "${OUTPUT_SAM%.sam}.bam" -o "${OUTPUT_SAM%.sam}.sorted.bam" # Index the sorted BAM file # samtools index "${OUTPUT_SAM%.sam}.sorted.bam" # Remove intermediate SAM file # rm "${OUTPUT_SAM}" -
5
Mark Duplicates: Duplicate reads were marked using MarkDuplicates.jar implemented in Picard tools v1.92
$ Bash example
# Install Picard (if not already installed) # conda install -c bioconda picard # Example usage of MarkDuplicates.jar from Picard tools v1.92 # This command marks duplicate reads in an input BAM file. # Replace 'input.bam', 'output.bam', and 'markduplicates_metrics.txt' with your actual file names. # Replace '/path/to/picard.jar' with the actual path to your Picard JAR file. java -jar /path/to/picard.jar MarkDuplicates \ I=input.bam \ O=output.bam \ M=markduplicates_metrics.txt \ VALIDATION_STRINGENCY=SILENT -
6
Read counts (abundance measurement): Reads mapping uniquely to genes annotated in ENSEMBL release 76 were counted using featureCounts implemented in subread-v1.5.0
featureCounts v1.5.0$ Bash example
# Install subread (which includes featureCounts) # conda install -c bioconda subread=1.5.0 # Placeholder for ENSEMBL release 76 GTF file (e.g., for Homo sapiens GRCh38) # Ensembl release 76 for human corresponds to GRCh38. # Download from Ensembl archives: ftp://ftp.ensembl.org/pub/release-76/gtf/homo_sapiens/Homo_sapiens.GRCh38.76.gtf.gz ENSEMBL_GTF="Homo_sapiens.GRCh38.76.gtf" # Placeholder path to unzipped GTF # Input BAM file (aligned reads) INPUT_BAM="aligned_reads.bam" # Placeholder path to the alignment file # Output counts file OUTPUT_COUNTS="gene_counts.txt" # Count reads mapping uniquely to genes using featureCounts # -a: Annotation file (GTF/GFF) containing gene features (or exons with gene_id attribute) # -o: Output file for counts # By default, featureCounts counts uniquely mapped reads and aggregates to gene_id from exon features. featureCounts -a "${ENSEMBL_GTF}" \ -o "${OUTPUT_COUNTS}" \ "${INPUT_BAM}" -
7
Normalized Counts: Read counts were normalized to counts per million (CPM) and Transcripts per Million (TPM) by dividing each gene-count by total reads mapped (uniquely mapped to exonic regions) per million and using ENSEMBL annotated read-lengths
Ensembl vN/A$ Bash example
# --- Installation (commented out) --- # It is recommended to use a virtual environment or conda for package management. # conda create -n normalization_env python=3.9 pandas gtfparse # conda activate normalization_env # pip install gtfparse pandas # --- Reference Data Setup --- # Download Ensembl GTF annotation file (e.g., Homo sapiens GRCh38, release 109) # wget http://ftp.ensembl.org/pub/release-109/gtf/homo_sapiens/Homo_sapiens.GRCh38.109.gtf.gz # gunzip Homo_sapiens.GRCh38.109.gtf.gz GTF_FILE="Homo_sapiens.GRCh38.109.gtf" GENE_LENGTHS_FILE="ensembl_gene_lengths_GRCh38.txt" # Python script to parse GTF and calculate gene lengths (sum of exon lengths per gene) # This script should be saved as parse_gtf_for_gene_lengths.py cat << 'EOF' > parse_gtf_for_gene_lengths.py import gtfparse import pandas as pd import sys GTF_FILE = sys.argv[1] OUTPUT_GENE_LENGTHS_FILE = sys.argv[2] # Load GTF gtf_df = gtfparse.read_gtf(GTF_FILE) # Filter for exons and calculate length exons_df = gtf_df[gtf_df['feature'] == 'exon'] exons_df['length'] = exons_df['end'] - exons_df['start'] + 1 # Sum lengths of all exons associated with each gene_id. # Note: This is a common approximation for 'gene length' in TPM calculations. # More sophisticated methods might consider the union of exonic regions or effective lengths. gene_lengths = exons_df.groupby('gene_id')['length'].sum().reset_index() gene_lengths.columns = ['gene_id', 'gene_length'] gene_lengths.to_csv(OUTPUT_GENE_LENGTHS_FILE, sep='\t', index=False, header=False) print(f"Gene lengths saved to {OUTPUT_GENE_LENGTHS_FILE}") EOF python parse_gtf_for_gene_lengths.py "$GTF_FILE" "$GENE_LENGTHS_FILE" # --- Simulate Input Data (replace with actual pipeline outputs) --- # Assuming 'gene_counts.txt' is the output from a tool like featureCounts or HTSeq-count. # It should contain gene IDs and raw read counts. # Example content: # ENSG00000000003\t1000 # ENSG00000000005\t500 # ENSG00000000419\t2000 # ... # Create a dummy gene_counts.txt for demonstration echo -e "ENSG00000000003\t1000" > gene_counts.txt echo -e "ENSG00000000005\t500" >> gene_counts.txt echo -e "ENSG00000000419\t2000" >> gene_counts.txt # Assuming 'gene_counts.txt.summary' is the summary file from featureCounts # It contains total assigned reads, which are 'uniquely mapped to exonic regions'. # Create a dummy summary file for demonstration echo -e "Assigned\t10000000" > gene_counts.txt.summary echo -e "Unassigned_NoFeatures\t500000" >> gene_counts.txt.summary echo -e "Unassigned_Ambiguity\t100000" >> gene_counts.txt.summary TOTAL_MAPPED_READS_FILE="total_exonic_mapped_reads.txt" # Extract total uniquely mapped reads to exonic regions from the summary file TOTAL_MAPPED_READS=$(awk '$1 == "Assigned" {print $2}' gene_counts.txt.summary) echo "$TOTAL_MAPPED_READS" > "$TOTAL_MAPPED_READS_FILE" # --- Main Normalization Script --- # This Python script performs CPM and TPM calculations as described. # It should be saved as normalize_counts.py cat << 'EOF' > normalize_counts.py import pandas as pd import sys GENE_COUNTS_FILE = sys.argv[1] GENE_LENGTHS_FILE = sys.argv[2] TOTAL_MAPPED_READS_FILE = sys.argv[3] OUTPUT_FILE = sys.argv[4] # Load gene counts counts_df = pd.read_csv(GENE_COUNTS_FILE, sep='\t', header=None, names=['gene_id', 'raw_count']) # Load gene lengths from Ensembl annotation lengths_df = pd.read_csv(GENE_LENGTHS_FILE, sep='\t', header=None, names=['gene_id', 'gene_length']) # Load total uniquely mapped reads to exonic regions with open(TOTAL_MAPPED_READS_FILE, 'r') as f: total_mapped_reads = int(f.read().strip()) # Merge counts and lengths merged_df = pd.merge(counts_df, lengths_df, on='gene_id', how='left') # Calculate CPM: (gene_count / total_mapped_reads) * 1,000,000 merged_df['CPM'] = (merged_df['raw_count'] / total_mapped_reads) * 1_000_000 # Prepare for TPM calculation: only consider genes with length information merged_df_tpm = merged_df.dropna(subset=['gene_length']).copy() merged_df_tpm['gene_length'] = merged_df_tpm['gene_length'].astype(int) # Calculate RPK (Reads Per Kilobase): raw_count / (gene_length / 1000) merged_df_tpm['RPK'] = merged_df_tpm['raw_count'] / (merged_df_tpm['gene_length'] / 1000) # Calculate scaling factor: sum of all RPKs / 1,000,000 scaling_factor = merged_df_tpm['RPK'].sum() / 1_000_000 # Calculate TPM: RPK / scaling_factor merged_df_tpm['TPM'] = merged_df_tpm['RPK'] / scaling_factor # Merge TPM back into the main dataframe, filling NaN for genes without length (if any) final_df = pd.merge(merged_df, merged_df_tpm[['gene_id', 'TPM']], on='gene_id', how='left') # Output results final_df[['gene_id', 'raw_count', 'CPM', 'TPM']].to_csv(OUTPUT_FILE, sep='\t', index=False) print(f"Normalized counts saved to {OUTPUT_FILE}") EOF python normalize_counts.py gene_counts.txt "$GENE_LENGTHS_FILE" "$TOTAL_MAPPED_READS_FILE" normalized_counts.tsv
Tools Used
Raw Source Text
Real time Analysis (RTA): HiSeq 4000 - RTA v2.73, HCS v3.3.52 Base-calling: bcl2fastq v2.17.1.14 Adapter trimming: RNA-seq reads were trimmed for SMARTer and Illumina adapter sequences using skewer-v0.1.125 Alignment: Reads in gzipped fastq format were aligned using Hisat2 version-2.0.0-beta with default parameters to a modified reference genome comprising the human genome, Homo sapiens GRCh37 - human_g1k_v37, and fasta sequences for ERCC spike-ins Mark Duplicates: Duplicate reads were marked using MarkDuplicates.jar implemented in Picard tools v1.92 Read counts (abundance measurement): Reads mapping uniquely to genes annotated in ENSEMBL release 76 were counted using featureCounts implemented in subread-v1.5.0 Normalized Counts: Read counts were normalized to counts per million (CPM) and Transcripts per Million (TPM) by dividing each gene-count by total reads mapped (uniquely mapped to exonic regions) per million and using ENSEMBL annotated read-lengths Genome_build: GRCh37 Supplementary_files_format_and_content: Count_table.txt: raw read counts for all 63677 ENSEMBL annotated genes Supplementary_files_format_and_content: CPM_table.txt: Counts per million normalized counts all 63677 ENSEMBL annotated genes Supplementary_files_format_and_content: TPM_table.txt: Transcripts per million (gene-length and sequencing depth normalized counts) all 63677 ENSEMBL annotated genes Supplementary_files_format_and_content: Text