GSE122069 Processing Pipeline

RNA-Seq code_examples 2 steps

Publication

Aberrant NOVA1 function disrupts alternative splicing in early stages of amyotrophic lateral sclerosis.

Acta neuropathologica (2022) — PMID 35778567

Dataset

GSE122069

Premature polyadenylation-mediated loss of stathmin-2 is a hallmark of TDP-43-dependent neurodegeneration

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    50bp single-end FASTQ files were obtained using the Illumina demultiplexing pipeline.

    bcl2fastq (Inferred with models/gemini-2.5-flash) v2.20.0 (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install bcl2fastq (example using conda)
    # conda install -c bioconda bcl2fastq2
    
    # Placeholder for Illumina run directory containing BCL files
    ILLUMINA_RUN_DIR="/path/to/illumina_run_directory"
    
    # Placeholder for SampleSheet.csv, which defines samples, indexes, and read lengths
    SAMPLE_SHEET="/path/to/SampleSheet.csv"
    
    # Placeholder for output directory where FASTQ files will be generated
    OUTPUT_FASTQ_DIR="/path/to/output_fastq_files"
    
    # Create output directory if it doesn't exist
    mkdir -p "${OUTPUT_FASTQ_DIR}"
    
    # Run bcl2fastq to demultiplex BCL files into FASTQ files.
    # The 50bp single-end nature of the reads is determined by the sequencing run configuration and the SampleSheet.csv.
    bcl2fastq --runfolder-dir "${ILLUMINA_RUN_DIR}" \
              --output-dir "${OUTPUT_FASTQ_DIR}" \
              --sample-sheet "${SAMPLE_SHEET}"
  2. 2

    STAR (Dobin et al., 2013) and RSEM (Li and Dewey, 2011) were used to align the reads to the human reference sequence HG38 and to calculate the raw counts and transcripts per million (TPM) values for genes, respectively

    STAR vSTAR 2.7.1a, RSEM 1.3.1 GitHub
    $ Bash example
    # Define variables
    READ1="input_R1.fastq.gz"
    READ2="input_R2.fastq.gz"
    OUTPUT_PREFIX="sample_output"
    THREADS=8
    
    # Reference genome and annotation (UCSC hg38 and Gencode v38 are common choices)
    GENOME_FASTA="path/to/hg38.fa"
    GTF_ANNOTATION="path/to/gencode.v38.annotation.gtf"
    
    # Directories for indices
    STAR_GENOME_DIR="path/to/star_hg38_index"
    RSEM_REF_DIR="path/to/rsem_hg38_index"
    RSEM_REF_NAME="hg38_rsem_ref" # Base name for RSEM index files
    
    # --- Installation (commented out) ---
    # # Install STAR and RSEM using conda
    # conda create -n rna_seq_env star=2.7.1a rsem=1.3.1 -y
    # conda activate rna_seq_env
    
    # --- Index Generation (commented out) ---
    # # 1. Generate STAR genome index
    # # Adjust --sjdbOverhang based on read length (e.g., read_length - 1).
    # # A common default is 100 if read length is unknown or variable.
    # mkdir -p "${STAR_GENOME_DIR}"
    # STAR --runThreadN "${THREADS}" \
    #      --runMode genomeGenerate \
    #      --genomeDir "${STAR_GENOME_DIR}" \
    #      --genomeFastaFiles "${GENOME_FASTA}" \
    #      --sjdbGTFfile "${GTF_ANNOTATION}" \
    #      --sjdbOverhang 100
    
    # # 2. Generate RSEM reference index
    # mkdir -p "${RSEM_REF_DIR}"
    # rsem-prepare-reference --gtf "${GTF_ANNOTATION}" \
    #                        "${GENOME_FASTA}" \
    #                        "${RSEM_REF_DIR}/${RSEM_REF_NAME}"
    
    # --- Execution Commands ---
    
    # 1. Align reads to the human reference sequence HG38 with STAR
    # Output: Aligned reads in BAM format, sorted by coordinate, and transcriptome-mapped reads for RSEM
    STAR --runThreadN "${THREADS}" \
         --genomeDir "${STAR_GENOME_DIR}" \
         --readFilesIn "${READ1}" "${READ2}" \
         --readFilesCommand zcat \
         --outFileNamePrefix "${OUTPUT_PREFIX}." \
         --outSAMtype BAM SortedByCoordinate \
         --outSAMattributes All \
         --outFilterMultimapNmax 20 \
         --outFilterMismatchNoverLmax 0.05 \
         --alignIntronMin 20 \
         --alignIntronMax 1000000 \
         --alignMatesGapMax 1000000 \
         --outFilterType BySJout \
         --outFilterScoreMinOverLread 0.3 \
         --outFilterMatchNoverLread 0.3 \
         --limitBAMsortRAM 30000000000 \
         --quantMode TranscriptomeSAM
    
    # 2. Calculate raw counts and TPM values for genes with RSEM
    # Input: Transcriptome-mapped BAM from STAR
    # Output: ${OUTPUT_PREFIX}.genes.results, ${OUTPUT_PREFIX}.isoforms.results
    rsem-calculate-expression --paired-end \
                              --num-threads "${THREADS}" \
                              --bam \
                              --no-bam-output \
                              "${OUTPUT_PREFIX}.Aligned.toTranscriptome.out.bam" \
                              "${RSEM_REF_DIR}/${RSEM_REF_NAME}" \
                              "${OUTPUT_PREFIX}"
    

Tools Used

Raw Source Text
50bp single-end FASTQ files were obtained using the Illumina demultiplexing pipeline. STAR (Dobin et al., 2013) and RSEM (Li and Dewey, 2011) were used to align the reads to the human reference sequence HG38 and to calculate the raw counts and transcripts per million (TPM) values for genes, respectively
Genome_build: HG38
Supplementary_files_format_and_content: TPM, csv format
← Back to Analysis