GSE114922 Processing Pipeline

RNA-Seq code_examples 7 steps

Publication

Integrative RNA-omics Discovers GNAS Alternative Splicing as a Phenotypic Driver of Splicing Factor-Mutant Neoplasms.

Cancer discovery (2022) — PMID 34620690

Dataset

GSE114922

RNA sequencing of bone marrow CD34+ hematopoietic stem and progenitor cells from patients with myelodysplastic syndrome and healthy controls

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    Real time Analysis (RTA): HiSeq 4000 - RTA v2.73, HCS v3.3.52

    RTA vv2.73
    $ Bash example
    bash
    # Real Time Analysis (RTA) is proprietary software embedded in Illumina sequencing instruments.
    # It performs base calling and quality scoring during the sequencing run.
    # This step is typically controlled via the instrument's graphical user interface (GUI)
    # and does not involve direct command-line execution by a user in a typical bioinformatics pipeline.
    
    # The versions specified are:
    # RTA: v2.73
    # HCS (HiSeq Control Software): v3.3.52
    
    # No direct bash command for execution is available as it's an instrument-level process.
    # The output of RTA (e.g., BCL files) is then used as input for downstream bioinformatics tools.
    
  2. 2

    Base-calling: bcl2fastq v2.17.1.14

    bcl2fastq v2.17.1.14 GitHub
    $ Bash example
    # Install bcl2fastq (example using conda)
    # conda create -n bcl2fastq_env bcl2fastq=2.17.1.14 -c bioconda -c conda-forge
    # conda activate bcl2fastq_env
    
    # Define input and output directories
    INPUT_RUN_FOLDER="/path/to/illumina/run_folder"
    OUTPUT_FASTQ_DIR="/path/to/output/fastq_files"
    SAMPLE_SHEET="/path/to/sample_sheet.csv" # Required for demultiplexing
    
    # Create output directory if it doesn't exist
    mkdir -p "${OUTPUT_FASTQ_DIR}"
    
    # Execute bcl2fastq
    # This command converts BCL files from the run folder into FASTQ files,
    # performing demultiplexing based on the provided sample sheet.
    # Common parameters include:
    # --runfolder-dir: Path to the Illumina run folder containing BCL files.
    # --output-dir: Path where the generated FASTQ files will be stored.
    # --sample-sheet: Path to the sample sheet CSV file for demultiplexing.
    # --no-lane-splitting: Prevents splitting output by lane (often preferred).
    # -r, -w, -p: Number of threads for reading, writing, and processing (example values).
    bcl2fastq --runfolder-dir "${INPUT_RUN_FOLDER}" \
              --output-dir "${OUTPUT_FASTQ_DIR}" \
              --sample-sheet "${SAMPLE_SHEET}" \
              --no-lane-splitting \
              --minimum-trimmed-read-length 0 \
              --mask-short-adapter-reads 0 \
              -r 8 -w 8 -p 8
  3. 3

    Adapter trimming: RNA-seq reads were trimmed for SMARTer and Illumina adapter sequences using skewer-v0.1.125

    $ Bash example
    # Install skewer (example using conda)
    # conda install -c bioconda skewer=0.1.125
    
    # Define input and output files (assuming paired-end RNA-seq)
    INPUT_R1="input_R1.fastq.gz"
    INPUT_R2="input_R2.fastq.gz"
    OUTPUT_PREFIX="trimmed_reads"
    ADAPTER_FILE="smarter_illumina_adapters.fasta" # Placeholder for a FASTA file containing SMARTer and Illumina adapter sequences
    
    # Example content for smarter_illumina_adapters.fasta:
    # >SMARTer_Adapter_Sequence_1
    # AAGCAGTGGTATCAACGCAGAGTACATGGGG
    # >Illumina_Universal_Adapter_Sequence_1
    # AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT
    # (Add all relevant SMARTer and Illumina adapter sequences here)
    
    # Run skewer for adapter trimming
    # -x: Path to a FASTA file containing adapter sequences to be trimmed
    # -l: Minimum read length after trimming (e.g., 20 bp, common for RNA-seq)
    # -q: Minimum quality score for quality trimming (e.g., 20, common)
    # -m pe: Paired-end mode (use '-m se' for single-end if applicable)
    # -o: Output file prefix. Skewer will append '-trimmed-pair1.fastq.gz' and '-trimmed-pair2.fastq.gz' for paired-end.
    skewer -x "${ADAPTER_FILE}" -l 20 -q 20 -m pe -o "${OUTPUT_PREFIX}" "${INPUT_R1}" "${INPUT_R2}"
  4. 4

    Alignment: Reads in gzipped fastq format were aligned using Hisat2 version-2.0.0-beta with default parameters to a modified reference genome comprising the human genome, Homo sapiens GRCh37 - human_g1k_v37, and fasta sequences for ERCC spike-ins

    HISAT2 v2.0.0 GitHub
    $ Bash example
    # Install HISAT2
    # conda install -c bioconda hisat2=2.0.0 samtools
    
    # --- Reference Preparation ---
    # Placeholder for human reference genome GRCh37 (human_g1k_v37) FASTA file.
    # This file can be obtained from sources like the GATK resource bundle or UCSC.
    # Example: wget -O human_g1k_v37.fasta.gz "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz"
    # gunzip human_g1k_v37.fasta.gz
    
    # Placeholder for ERCC spike-in sequences FASTA file.
    # This file can be obtained from Thermo Fisher Scientific.
    # Example: wget -O ERCC_Controls_Analysis.zip "https://assets.thermofisher.com/TFS-Assets/LSG/manuals/ERCC_Controls_Analysis.zip"
    # unzip ERCC_Controls_Analysis.zip
    # mv ERCC92.fasta ercc_spikeins.fasta
    
    # Combine human and ERCC reference sequences into a single FASTA file
    cat human_g1k_v37.fasta ercc_spikeins.fasta > combined_reference.fasta
    
    # Build HISAT2 index for the combined reference genome
    # -p: number of threads to use for index building (e.g., 8)
    hisat2-build -p 8 combined_reference.fasta combined_reference_index
    
    # --- Alignment ---
    # Define input FASTQ files (assuming paired-end reads in gzipped format)
    # Replace 'sample_R1.fastq.gz' and 'sample_R2.fastq.gz' with actual input file names
    READS_R1="sample_R1.fastq.gz"
    READS_R2="sample_R2.fastq.gz"
    OUTPUT_SAM="sample.sam"
    
    # Align reads using HISAT2 with default parameters to the modified reference genome
    # -x: basename of the HISAT2 index
    # -1: comma-separated list of files containing upstream mates (paired-end)
    # -2: comma-separated list of files containing downstream mates (paired-end)
    # -S: file to write SAM alignments to
    # --threads: number of threads to use for alignment (e.g., 8, common optimization for performance)
    hisat2 -x combined_reference_index -1 "${READS_R1}" -2 "${READS_R2}" -S "${OUTPUT_SAM}" --threads 8
    
    # --- Post-alignment processing (optional, often separate pipeline steps) ---
    # Convert SAM to BAM format
    # samtools view -bS "${OUTPUT_SAM}" > "${OUTPUT_SAM%.sam}.bam"
    
    # Sort the BAM file by coordinate
    # samtools sort "${OUTPUT_SAM%.sam}.bam" -o "${OUTPUT_SAM%.sam}.sorted.bam"
    
    # Index the sorted BAM file
    # samtools index "${OUTPUT_SAM%.sam}.sorted.bam"
    
    # Remove intermediate SAM file
    # rm "${OUTPUT_SAM}"
  5. 5

    Mark Duplicates: Duplicate reads were marked using MarkDuplicates.jar implemented in Picard tools v1.92

    Picard v1.92 GitHub
    $ Bash example
    # Install Picard (if not already installed)
    # conda install -c bioconda picard
    
    # Example usage of MarkDuplicates.jar from Picard tools v1.92
    # This command marks duplicate reads in an input BAM file.
    # Replace 'input.bam', 'output.bam', and 'markduplicates_metrics.txt' with your actual file names.
    # Replace '/path/to/picard.jar' with the actual path to your Picard JAR file.
    
    java -jar /path/to/picard.jar MarkDuplicates \
        I=input.bam \
        O=output.bam \
        M=markduplicates_metrics.txt \
        VALIDATION_STRINGENCY=SILENT
  6. 6

    Read counts (abundance measurement): Reads mapping uniquely to genes annotated in ENSEMBL release 76 were counted using featureCounts implemented in subread-v1.5.0

    featureCounts v1.5.0
    $ Bash example
    # Install subread (which includes featureCounts)
    # conda install -c bioconda subread=1.5.0
    
    # Placeholder for ENSEMBL release 76 GTF file (e.g., for Homo sapiens GRCh38)
    # Ensembl release 76 for human corresponds to GRCh38.
    # Download from Ensembl archives: ftp://ftp.ensembl.org/pub/release-76/gtf/homo_sapiens/Homo_sapiens.GRCh38.76.gtf.gz
    ENSEMBL_GTF="Homo_sapiens.GRCh38.76.gtf" # Placeholder path to unzipped GTF
    
    # Input BAM file (aligned reads)
    INPUT_BAM="aligned_reads.bam" # Placeholder path to the alignment file
    
    # Output counts file
    OUTPUT_COUNTS="gene_counts.txt"
    
    # Count reads mapping uniquely to genes using featureCounts
    # -a: Annotation file (GTF/GFF) containing gene features (or exons with gene_id attribute)
    # -o: Output file for counts
    # By default, featureCounts counts uniquely mapped reads and aggregates to gene_id from exon features.
    featureCounts -a "${ENSEMBL_GTF}" \
                  -o "${OUTPUT_COUNTS}" \
                  "${INPUT_BAM}"
  7. 7

    Normalized Counts: Read counts were normalized to counts per million (CPM) and Transcripts per Million (TPM) by dividing each gene-count by total reads mapped (uniquely mapped to exonic regions) per million and using ENSEMBL annotated read-lengths

    Ensembl vN/A
    $ Bash example
    # --- Installation (commented out) ---
    # It is recommended to use a virtual environment or conda for package management.
    # conda create -n normalization_env python=3.9 pandas gtfparse
    # conda activate normalization_env
    # pip install gtfparse pandas
    
    # --- Reference Data Setup ---
    # Download Ensembl GTF annotation file (e.g., Homo sapiens GRCh38, release 109)
    # wget http://ftp.ensembl.org/pub/release-109/gtf/homo_sapiens/Homo_sapiens.GRCh38.109.gtf.gz
    # gunzip Homo_sapiens.GRCh38.109.gtf.gz
    GTF_FILE="Homo_sapiens.GRCh38.109.gtf"
    GENE_LENGTHS_FILE="ensembl_gene_lengths_GRCh38.txt"
    
    # Python script to parse GTF and calculate gene lengths (sum of exon lengths per gene)
    # This script should be saved as parse_gtf_for_gene_lengths.py
    cat << 'EOF' > parse_gtf_for_gene_lengths.py
    import gtfparse
    import pandas as pd
    import sys
    
    GTF_FILE = sys.argv[1]
    OUTPUT_GENE_LENGTHS_FILE = sys.argv[2]
    
    # Load GTF
    gtf_df = gtfparse.read_gtf(GTF_FILE)
    
    # Filter for exons and calculate length
    exons_df = gtf_df[gtf_df['feature'] == 'exon']
    exons_df['length'] = exons_df['end'] - exons_df['start'] + 1
    
    # Sum lengths of all exons associated with each gene_id.
    # Note: This is a common approximation for 'gene length' in TPM calculations.
    # More sophisticated methods might consider the union of exonic regions or effective lengths.
    gene_lengths = exons_df.groupby('gene_id')['length'].sum().reset_index()
    gene_lengths.columns = ['gene_id', 'gene_length']
    
    gene_lengths.to_csv(OUTPUT_GENE_LENGTHS_FILE, sep='\t', index=False, header=False)
    print(f"Gene lengths saved to {OUTPUT_GENE_LENGTHS_FILE}")
    EOF
    
    python parse_gtf_for_gene_lengths.py "$GTF_FILE" "$GENE_LENGTHS_FILE"
    
    # --- Simulate Input Data (replace with actual pipeline outputs) ---
    # Assuming 'gene_counts.txt' is the output from a tool like featureCounts or HTSeq-count.
    # It should contain gene IDs and raw read counts.
    # Example content:
    # ENSG00000000003\t1000
    # ENSG00000000005\t500
    # ENSG00000000419\t2000
    # ...
    
    # Create a dummy gene_counts.txt for demonstration
    echo -e "ENSG00000000003\t1000" > gene_counts.txt
    echo -e "ENSG00000000005\t500" >> gene_counts.txt
    echo -e "ENSG00000000419\t2000" >> gene_counts.txt
    
    # Assuming 'gene_counts.txt.summary' is the summary file from featureCounts
    # It contains total assigned reads, which are 'uniquely mapped to exonic regions'.
    # Create a dummy summary file for demonstration
    echo -e "Assigned\t10000000" > gene_counts.txt.summary
    echo -e "Unassigned_NoFeatures\t500000" >> gene_counts.txt.summary
    echo -e "Unassigned_Ambiguity\t100000" >> gene_counts.txt.summary
    
    TOTAL_MAPPED_READS_FILE="total_exonic_mapped_reads.txt"
    # Extract total uniquely mapped reads to exonic regions from the summary file
    TOTAL_MAPPED_READS=$(awk '$1 == "Assigned" {print $2}' gene_counts.txt.summary)
    echo "$TOTAL_MAPPED_READS" > "$TOTAL_MAPPED_READS_FILE"
    
    # --- Main Normalization Script ---
    # This Python script performs CPM and TPM calculations as described.
    # It should be saved as normalize_counts.py
    cat << 'EOF' > normalize_counts.py
    import pandas as pd
    import sys
    
    GENE_COUNTS_FILE = sys.argv[1]
    GENE_LENGTHS_FILE = sys.argv[2]
    TOTAL_MAPPED_READS_FILE = sys.argv[3]
    OUTPUT_FILE = sys.argv[4]
    
    # Load gene counts
    counts_df = pd.read_csv(GENE_COUNTS_FILE, sep='\t', header=None, names=['gene_id', 'raw_count'])
    
    # Load gene lengths from Ensembl annotation
    lengths_df = pd.read_csv(GENE_LENGTHS_FILE, sep='\t', header=None, names=['gene_id', 'gene_length'])
    
    # Load total uniquely mapped reads to exonic regions
    with open(TOTAL_MAPPED_READS_FILE, 'r') as f:
        total_mapped_reads = int(f.read().strip())
    
    # Merge counts and lengths
    merged_df = pd.merge(counts_df, lengths_df, on='gene_id', how='left')
    
    # Calculate CPM: (gene_count / total_mapped_reads) * 1,000,000
    merged_df['CPM'] = (merged_df['raw_count'] / total_mapped_reads) * 1_000_000
    
    # Prepare for TPM calculation: only consider genes with length information
    merged_df_tpm = merged_df.dropna(subset=['gene_length']).copy()
    merged_df_tpm['gene_length'] = merged_df_tpm['gene_length'].astype(int)
    
    # Calculate RPK (Reads Per Kilobase): raw_count / (gene_length / 1000)
    merged_df_tpm['RPK'] = merged_df_tpm['raw_count'] / (merged_df_tpm['gene_length'] / 1000)
    
    # Calculate scaling factor: sum of all RPKs / 1,000,000
    scaling_factor = merged_df_tpm['RPK'].sum() / 1_000_000
    
    # Calculate TPM: RPK / scaling_factor
    merged_df_tpm['TPM'] = merged_df_tpm['RPK'] / scaling_factor
    
    # Merge TPM back into the main dataframe, filling NaN for genes without length (if any)
    final_df = pd.merge(merged_df, merged_df_tpm[['gene_id', 'TPM']], on='gene_id', how='left')
    
    # Output results
    final_df[['gene_id', 'raw_count', 'CPM', 'TPM']].to_csv(OUTPUT_FILE, sep='\t', index=False)
    
    print(f"Normalized counts saved to {OUTPUT_FILE}")
    EOF
    
    python normalize_counts.py gene_counts.txt "$GENE_LENGTHS_FILE" "$TOTAL_MAPPED_READS_FILE" normalized_counts.tsv

Tools Used

Raw Source Text
Real time Analysis (RTA):  HiSeq 4000 - RTA v2.73, HCS v3.3.52
Base-calling: bcl2fastq v2.17.1.14
Adapter trimming: RNA-seq reads were trimmed for SMARTer and Illumina  adapter sequences using skewer-v0.1.125
Alignment: Reads in gzipped fastq format were aligned using Hisat2 version-2.0.0-beta with default parameters to a modified reference genome comprising the human genome, Homo sapiens GRCh37 - human_g1k_v37, and fasta sequences for ERCC spike-ins
Mark Duplicates: Duplicate reads were marked using MarkDuplicates.jar implemented in Picard tools v1.92
Read counts (abundance measurement): Reads mapping uniquely to genes annotated in ENSEMBL release 76 were counted using featureCounts implemented in subread-v1.5.0
Normalized Counts: Read counts were normalized to counts per million (CPM) and Transcripts per Million (TPM) by dividing each gene-count by total reads mapped (uniquely mapped to exonic regions) per million and using ENSEMBL annotated read-lengths
Genome_build: GRCh37
Supplementary_files_format_and_content: Count_table.txt: raw read counts for all 63677 ENSEMBL annotated genes
Supplementary_files_format_and_content: CPM_table.txt:  Counts per million normalized counts all 63677 ENSEMBL annotated genes
Supplementary_files_format_and_content: TPM_table.txt:  Transcripts per million (gene-length and sequencing depth normalized counts) all 63677 ENSEMBL annotated genes
Supplementary_files_format_and_content: Text
← Back to Analysis