GSE122069 Processing Pipeline
RNA-Seq
code_examples
2 steps
Publication
Aberrant NOVA1 function disrupts alternative splicing in early stages of amyotrophic lateral sclerosis.Acta neuropathologica (2022) — PMID 35778567
Dataset
GSE122069Premature polyadenylation-mediated loss of stathmin-2 is a hallmark of TDP-43-dependent neurodegeneration
Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
Processing Steps
Generate Jupyter Notebook-
1
50bp single-end FASTQ files were obtained using the Illumina demultiplexing pipeline.
bcl2fastq (Inferred with models/gemini-2.5-flash) v2.20.0 (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# Install bcl2fastq (example using conda) # conda install -c bioconda bcl2fastq2 # Placeholder for Illumina run directory containing BCL files ILLUMINA_RUN_DIR="/path/to/illumina_run_directory" # Placeholder for SampleSheet.csv, which defines samples, indexes, and read lengths SAMPLE_SHEET="/path/to/SampleSheet.csv" # Placeholder for output directory where FASTQ files will be generated OUTPUT_FASTQ_DIR="/path/to/output_fastq_files" # Create output directory if it doesn't exist mkdir -p "${OUTPUT_FASTQ_DIR}" # Run bcl2fastq to demultiplex BCL files into FASTQ files. # The 50bp single-end nature of the reads is determined by the sequencing run configuration and the SampleSheet.csv. bcl2fastq --runfolder-dir "${ILLUMINA_RUN_DIR}" \ --output-dir "${OUTPUT_FASTQ_DIR}" \ --sample-sheet "${SAMPLE_SHEET}" -
2
STAR (Dobin et al., 2013) and RSEM (Li and Dewey, 2011) were used to align the reads to the human reference sequence HG38 and to calculate the raw counts and transcripts per million (TPM) values for genes, respectively
$ Bash example
# Define variables READ1="input_R1.fastq.gz" READ2="input_R2.fastq.gz" OUTPUT_PREFIX="sample_output" THREADS=8 # Reference genome and annotation (UCSC hg38 and Gencode v38 are common choices) GENOME_FASTA="path/to/hg38.fa" GTF_ANNOTATION="path/to/gencode.v38.annotation.gtf" # Directories for indices STAR_GENOME_DIR="path/to/star_hg38_index" RSEM_REF_DIR="path/to/rsem_hg38_index" RSEM_REF_NAME="hg38_rsem_ref" # Base name for RSEM index files # --- Installation (commented out) --- # # Install STAR and RSEM using conda # conda create -n rna_seq_env star=2.7.1a rsem=1.3.1 -y # conda activate rna_seq_env # --- Index Generation (commented out) --- # # 1. Generate STAR genome index # # Adjust --sjdbOverhang based on read length (e.g., read_length - 1). # # A common default is 100 if read length is unknown or variable. # mkdir -p "${STAR_GENOME_DIR}" # STAR --runThreadN "${THREADS}" \ # --runMode genomeGenerate \ # --genomeDir "${STAR_GENOME_DIR}" \ # --genomeFastaFiles "${GENOME_FASTA}" \ # --sjdbGTFfile "${GTF_ANNOTATION}" \ # --sjdbOverhang 100 # # 2. Generate RSEM reference index # mkdir -p "${RSEM_REF_DIR}" # rsem-prepare-reference --gtf "${GTF_ANNOTATION}" \ # "${GENOME_FASTA}" \ # "${RSEM_REF_DIR}/${RSEM_REF_NAME}" # --- Execution Commands --- # 1. Align reads to the human reference sequence HG38 with STAR # Output: Aligned reads in BAM format, sorted by coordinate, and transcriptome-mapped reads for RSEM STAR --runThreadN "${THREADS}" \ --genomeDir "${STAR_GENOME_DIR}" \ --readFilesIn "${READ1}" "${READ2}" \ --readFilesCommand zcat \ --outFileNamePrefix "${OUTPUT_PREFIX}." \ --outSAMtype BAM SortedByCoordinate \ --outSAMattributes All \ --outFilterMultimapNmax 20 \ --outFilterMismatchNoverLmax 0.05 \ --alignIntronMin 20 \ --alignIntronMax 1000000 \ --alignMatesGapMax 1000000 \ --outFilterType BySJout \ --outFilterScoreMinOverLread 0.3 \ --outFilterMatchNoverLread 0.3 \ --limitBAMsortRAM 30000000000 \ --quantMode TranscriptomeSAM # 2. Calculate raw counts and TPM values for genes with RSEM # Input: Transcriptome-mapped BAM from STAR # Output: ${OUTPUT_PREFIX}.genes.results, ${OUTPUT_PREFIX}.isoforms.results rsem-calculate-expression --paired-end \ --num-threads "${THREADS}" \ --bam \ --no-bam-output \ "${OUTPUT_PREFIX}.Aligned.toTranscriptome.out.bam" \ "${RSEM_REF_DIR}/${RSEM_REF_NAME}" \ "${OUTPUT_PREFIX}"
Tools Used
Raw Source Text
50bp single-end FASTQ files were obtained using the Illumina demultiplexing pipeline. STAR (Dobin et al., 2013) and RSEM (Li and Dewey, 2011) were used to align the reads to the human reference sequence HG38 and to calculate the raw counts and transcripts per million (TPM) values for genes, respectively Genome_build: HG38 Supplementary_files_format_and_content: TPM, csv format