GSE281960 Processing Pipeline
Publication
Investigational eIF2B activator DNL343 modulates the integrated stress response in preclinical models of TDP-43 pathology and individuals with ALS in a randomized clinical trial.Nature communications (2025) — PMID 40825784
Dataset
GSE281960Investigational eIF2B activator DNL343 modulates the Integrated Stress Response in Preclinical Models of TDP-43 Pathology.
Processing Steps
Generate Jupyter Notebook-
1
Sequencing adapters were trimmed from the raw reads with skewer.
$ Bash example
# Install skewer (if not already installed) # conda install -c bioconda skewer # Define input and output file paths # Replace with actual input raw read files (e.g., from a sequencing run) RAW_READS_R1="raw_reads_R1.fastq.gz" RAW_READS_R2="raw_reads_R2.fastq.gz" OUTPUT_PREFIX="trimmed_reads" # Trim sequencing adapters using skewer # skewer automatically detects adapters if -x is not specified. # -o: Output file prefix. skewer will append '-trimmed-pair1.fastq.gz' and '-trimmed-pair2.fastq.gz' for paired-end reads. skewer -o "${OUTPUT_PREFIX}" "${RAW_READS_R1}" "${RAW_READS_R2}" -
2
Reads were aligned to the human genome version GRCh38_p126.
STAR (Inferred with models/gemini-2.5-flash) v2.7.10a (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# Install STAR (if not already installed) # conda install -c bioconda star # Define variables # GENOME_DIR should point to the directory containing the pre-built STAR genome index for GRCh38_p126. # This index needs to be generated once using STAR --runMode genomeGenerate with the GRCh38_p126 FASTA and GTF files. # Reference genome source: NCBI (e.g., GCF_000001405.40_GRCh38.p14, where p126 would be a specific patch release). GENOME_DIR="/path/to/STAR_GRCh38_p126_index" READ1="sample_R1.fastq.gz" # Placeholder for input read 1 file READ2="sample_R2.fastq.gz" # Placeholder for input read 2 file (remove if single-end data) OUTPUT_PREFIX="aligned_GRCh38_p126_" # Prefix for output files THREADS=8 # Number of threads to use # Align reads to the human genome GRCh38_p126 STAR --genomeDir ${GENOME_DIR} \ --readFilesIn ${READ1} ${READ2} \ --runThreadN ${THREADS} \ --outFileNamePrefix ${OUTPUT_PREFIX} \ --outSAMtype BAM SortedByCoordinate \ --outSAMunmapped Within \ --outSAMattributes Standard \ --quantMode GeneCounts \ --outFilterMultimapNmax 20 \ --outFilterMismatchNoverLmax 0.04 \ --alignIntronMin 20 \ --alignIntronMax 1000000 \ --alignMatesGapMax 1000000 -
3
A STAR index (version 2.7.1a) was built with the âsjdbOverhang=50 argument.
$ Bash example
# Install STAR if not already installed # conda install -c bioconda star # Define input and output paths # Replace with actual paths to your genome FASTA and GTF files GENOME_DIR="path/to/genome_references" OUTPUT_INDEX_DIR="path/to/star_index" # Placeholder for genome and GTF files (e.g., GRCh38, Gencode v44) # Download from sources like GENCODE (https://www.gencodegenes.org/human/) or Ensembl GENOME_FA="${GENOME_DIR}/GRCh38.primary_assembly.genome.fa" GTF_FILE="${GENOME_DIR}/gencode.v44.annotation.gtf" # Create output directory if it doesn't exist mkdir -p "${OUTPUT_INDEX_DIR}" # Build STAR index STAR --runMode genomeGenerate \ --genomeDir "${OUTPUT_INDEX_DIR}" \ --genomeFastaFiles "${GENOME_FA}" \ --sjdbGTFfile "${GTF_FILE}" \ --sjdbOverhang 50 \ --runThreadN 8 # Adjust number of threads as needed -
4
Splice junctions from Gencode gene models (release M28) were provided via the âsjdbGTFfile argument.
$ Bash example
# Install STAR if not already available # conda install -c bioconda star # Define reference paths (placeholders) # Gencode M28 is based on the GRCh38 human reference genome. GENOME_FASTA="/path/to/GRCh38.primary_assembly.genome.fa" GENCODE_GTF="/path/to/gencode.vM28.annotation.gtf" # Gencode gene models (release M28) STAR_INDEX_DIR="/path/to/STAR_genome_index_gencodeM28" # Create STAR genome index with Gencode M28 splice junctions # The --sjdbGTFfile argument is primarily used during genome index generation # to incorporate known splice junctions into the index for improved alignment. mkdir -p "${STAR_INDEX_DIR}" STAR \ --runMode genomeGenerate \ --genomeDir "${STAR_INDEX_DIR}" \ --genomeFastaFiles "${GENOME_FASTA}" \ --sjdbGTFfile "${GENCODE_GTF}" \ --sjdbOverhang 100 \ --runThreadN 8 # Adjust number of threads as needed -
5
STAR alignments were generated with the following parameters: âoutFilterType BySJout, âquantMode TranscriptomeSAM, âoutFilterIntronMotifs RemoveNoncanonicalUnannotated, âoutSAMstrandField intronMotif, âoutSAMattributes NH HI AS nM MD XS and âoutSAMunmapped Within.
$ Bash example
# Install STAR if not already installed # conda install -c bioconda star # Placeholder for STAR genome index directory (e.g., hg38) GENOME_DIR="/path/to/star_genome_index_hg38" # Placeholder for input FASTQ files (adjust for single-end or paired-end) READ1="sample_R1.fastq.gz" READ2="sample_R2.fastq.gz" # Remove if single-end # Placeholder for output file prefix OUTPUT_PREFIX="sample_aligned_" STAR \ --runThreadN 8 \ --genomeDir "${GENOME_DIR}" \ --readFilesIn "${READ1}" "${READ2}" \ --readFilesCommand zcat \ --outFileNamePrefix "${OUTPUT_PREFIX}" \ --outFilterType BySJout \ --quantMode TranscriptomeSAM \ --outFilterIntronMotifs RemoveNoncanonicalUnannotated \ --outSAMstrandField intronMotif \ --outSAMattributes NH HI AS nM MD XS \ --outSAMunmapped Within -
6
Alignments were obtained with the following parameters: âreadFilesCommand zcat âoutFilterType BySJout âoutFilterMultimapNmax 20 âalignSJoverhangMin 8 âalignSJDBoverhangMin 1 âoutFilterMismatchNmax 999 âoutFilterMismatchNoverLmax 0.6 âalignIntronMin 20 âalignIntronMax 1000000 âalignMatesGapMax 1000000 âquantMode GeneCounts âoutSAMunmapped Within âoutSAMattributes NH HI AS nM MD XS âoutSAMstrandField intronMotif âoutSAMtype BAM SortedByCoordinate âoutBAMcompression 6.
$ Bash example
# Install STAR (example using conda) # conda install -c bioconda star # Define variables # Placeholder for a human GRCh38 STAR genome index. This directory should contain files like SA, SAindex, genome, etc. GENOME_DIR="/path/to/STAR_genome_index_GRCh38" # Placeholder for input FASTQ files. Adjust for single-end reads if necessary. READ1="sample_R1.fastq.gz" READ2="sample_R2.fastq.gz" # Prefix for all output files generated by STAR OUTPUT_PREFIX="sample_aligned" # STAR alignment command STAR \ --genomeDir "${GENOME_DIR}" \ --readFilesIn "${READ1}" "${READ2}" \ --readFilesCommand zcat \ --outFilterType BySJout \ --outFilterMultimapNmax 20 \ --alignSJoverhangMin 8 \ --alignSJDBoverhangMin 1 \ --outFilterMismatchNmax 999 \ --outFilterMismatchNoverLmax 0.6 \ --alignIntronMin 20 \ --alignIntronMax 1000000 \ --alignMatesGapMax 1000000 \ --quantMode GeneCounts \ --outSAMunmapped Within \ --outSAMattributes NH HI AS nM MD XS \ --outSAMstrandField intronMotif \ --outSAMtype BAM SortedByCoordinate \ --outBAMcompression 6 \ --outFileNamePrefix "${OUTPUT_PREFIX}" -
7
Alignments sharing the same UMI and genomic coordinate were deduplicated using the collapse_UMI_bam tool (Lexogen).
collapse_UMI_bam vNot specified (Inferred with models/gemini-2.5-flash)$ Bash example
# Install Python if not already available # conda install python # Download the collapse_UMI_bam.py script from Lexogen's resources. # The exact URL might vary or be provided with the specific kit. # wget -O collapse_UMI_bam.py "https://www.lexogen.com/fileadmin/user_upload/support/QuantSeq/collapse_UMI_bam.py" # chmod +x collapse_UMI_bam.py # Ensure the input BAM file is sorted by coordinate before deduplication. # samtools sort -o input.sorted.bam input.bam # samtools index input.sorted.bam # Execute the deduplication using collapse_UMI_bam.py # Replace 'input.bam' with your actual input BAM file and 'deduplicated.bam' with your desired output file name. python collapse_UMI_bam.py -i input.bam -o deduplicated.bam
-
8
The raw gene expression matrix was constructed from the 'forward' column of STAR's ReadsPerGene.out.tab output files using R.
$ Bash example
# --- Installation (commented out) --- # To install STAR and R (if not already available): # conda create -n star_r_env star r-base -y # conda activate star_r_env # --- Placeholder for STAR alignment (upstream step) --- # This step generates the 'ReadsPerGene.out.tab' files for each sample. # For demonstration, we assume these files already exist in sample-specific directories. # Example STAR alignment command (replace with actual genome/annotation paths and sample data): # STAR --runThreadN 8 \ # --genomeDir /path/to/STAR_genome_index/hg38 \ # --readFilesIn sample1_R1.fastq.gz sample1_R2.fastq.gz \ # --readFilesCommand zcat \ # --outFileNamePrefix sample1/ \ # --outSAMtype BAM SortedByCoordinate \ # --outReadsPerGene FeatureCounts \ # --quantMode GeneCounts # --- Create dummy sample directories and files for demonstration --- # In a real pipeline, these files would be generated by STAR. mkdir -p sample1 sample2 sample3 echo -e "N_unmapped\t0\t0\t0\nN_multimapping\t0\t0\t0\nN_noFeature\t0\t0\t0\nN_ambiguous\t0\t0\t0\nGENE1\t100\t50\t50\nGENE2\t200\t120\t80\nGENE3\t150\t70\t80" > sample1/ReadsPerGene.out.tab echo -e "N_unmapped\t0\t0\t0\nN_multimapping\t0\t0\t0\nN_noFeature\t0\t0\t0\nN_ambiguous\t0\t0\t0\nGENE1\t110\t60\t50\nGENE2\t210\t130\t80\nGENE3\t160\t75\t85" > sample2/ReadsPerGene.out.tab echo -e "N_unmapped\t0\t0\t0\nN_multimapping\t0\t0\t0\nN_noFeature\t0\t0\t0\nN_ambiguous\t0\t0\t0\nGENE1\t90\t45\t45\nGENE2\t190\t110\t80\nGENE3\t140\t65\t75" > sample3/ReadsPerGene.out.tab # --- R script to construct the gene expression matrix --- # This script reads the 'forward' column (column 3) from STAR's ReadsPerGene.out.tab files # for multiple samples and combines them into a single gene expression matrix. cat << 'EOF' > construct_expression_matrix.R # R script to construct gene expression matrix from STAR's ReadsPerGene.out.tab # Assumes 'ReadsPerGene.out.tab' files are located in subdirectories named after samples. # List of sample directories (replace with actual sample directories or pass as argument) sample_dirs <- c("sample1", "sample2", "sample3") # Initialize an empty list to store counts for each sample all_counts <- list() gene_ids <- NULL for (s_dir in sample_dirs) { file_path <- file.path(s_dir, "ReadsPerGene.out.tab") if (!file.exists(file_path)) { warning(paste("File not found:", file_path)) next } # Read the file, skipping the first 4 lines (STAR header) # and assuming tab-separated values. # The 'ReadsPerGene.out.tab' file typically has 4 columns: # 1. Gene ID # 2. Unstranded counts # 3. Forward strand counts # 4. Reverse strand counts data <- read.table(file_path, sep="\t", header=FALSE, skip=4, stringsAsFactors=FALSE) # The first column is Gene ID, the third column is forward strand counts. if (is.null(gene_ids)) { gene_ids <- data[, 1] } else { # Ensure gene IDs are consistent across samples if (!all(gene_ids == data[, 1])) { stop("Gene IDs are not consistent across samples.") } } # Store forward strand counts (column 3). all_counts[[s_dir]] <- data[, 3] } # Combine all counts into a matrix. if (length(all_counts) > 0) { expression_matrix <- do.call(cbind, all_counts) colnames(expression_matrix) <- sample_dirs rownames(expression_matrix) <- gene_ids # Save the matrix as a tab-separated file. write.table(expression_matrix, "gene_expression_matrix_forward_strand.tsv", sep="\t", row.names=TRUE, quote=FALSE) message("Gene expression matrix constructed and saved to gene_expression_matrix_forward_strand.tsv") } else { message("No valid sample data found to construct matrix.") } EOF # Execute the R script to construct the matrix Rscript construct_expression_matrix.R # --- Clean up dummy files (optional) --- # rm -rf sample1 sample2 sample3 construct_expression_matrix.R -
9
Gene symbols and biotype information were extracted from the Gencode GTF file.
$ Bash example
# Input Gencode GTF file (replace with your specific version and species, e.g., gencode.v45.annotation.gtf for human or gencode.vM25.annotation.gtf for mouse) GENCODE_GTF="gencode.annotation.gtf" # Output file for gene symbols and biotypes OUTPUT_FILE="gene_info.tsv" # Extract gene_id, gene_name, and gene_type for 'gene' features awk ' $3 == "gene" { gene_id = ""; gene_name = ""; gene_type = ""; # Split the 9th column (attributes) by semicolon and space split($9, attrs, "; "); for (i=1; i<=length(attrs); i++) { # Remove leading/trailing spaces and quotes gsub(/^ +| +$/, "", attrs[i]); gsub(/"/, "", attrs[i]); if (attrs[i] ~ /^gene_id/) { gene_id = substr(attrs[i], index(attrs[i], " ") + 1); } else if (attrs[i] ~ /^gene_name/) { gene_name = substr(attrs[i], index(attrs[i], " ") + 1); } else if (attrs[i] ~ /^gene_type/) { gene_type = substr(attrs[i], index(attrs[i], " ") + 1); } } # Print gene_id, gene_name, and gene_type separated by tabs print gene_id "\t" gene_name "\t" gene_type }' "${GENCODE_GTF}" > "${OUTPUT_FILE}" -
10
To calculate normalized counts per million (CPM) TMM normalization was applied across all genes annotated as protein_coding with the edgeR R package.
edgeR GitHub$ Bash example
# Install R and Bioconductor if not already present # sudo apt-get update && sudo apt-get install -y r-base # R -e 'install.packages("BiocManager", repos="https://cloud.r-project.org")' # R -e 'BiocManager::install("edgeR")' # Create a dummy raw_counts.tsv for demonstration # In a real pipeline, this file would be generated by an upstream step (e.g., featureCounts, htseq-count) cat <<EOF > raw_counts.tsv GeneID Sample1 Sample2 Sample3 GeneA 100 150 120 GeneB 200 220 180 GeneC 50 60 70 GeneD 10 15 12 EOF # Create the R script for edgeR processing cat <<'EOF' > normalize_cpm_tmm.R # Load edgeR package library(edgeR) # --- Configuration --- input_counts_file <- "raw_counts.tsv" # Assumed input file containing raw counts output_cpm_file <- "normalized_cpm_tmm.tsv" # Output file for normalized CPM values # --- Main processing --- # 1. Read raw counts # Assuming the first column is gene IDs and subsequent columns are sample counts. # The description implies that only protein_coding genes are included in the input. counts_data <- read.delim(input_counts_file, row.names = 1, check.names = FALSE) # Ensure counts are numeric and non-negative integers counts_matrix <- as.matrix(counts_data) counts_matrix <- round(counts_matrix) # Ensure integers if not already counts_matrix[counts_matrix < 0] <- 0 # Ensure non-negative # 2. Create a DGEList object # A DGEList object is required by edgeR for normalization. dge <- DGEList(counts = counts_matrix) # 3. Apply TMM normalization # TMM (Trimmed Mean of M-values) normalization is applied to calculate normalization factors. dge <- calcNormFactors(dge, method = "TMM") # 4. Calculate normalized CPM values # CPM (Counts Per Million) values are calculated using the TMM normalization factors. # normalized.lib.sizes = TRUE ensures library sizes are adjusted by normalization factors. # log = FALSE ensures raw CPM values (not log-transformed) are returned. # prior.count = 0 ensures no pseudo-counts are added before CPM calculation. cpm_values <- cpm(dge, normalized.lib.sizes = TRUE, log = FALSE, prior.count = 0) # 5. Save the normalized CPM matrix # The resulting matrix is saved to a tab-separated file. write.table(cpm_values, file = output_cpm_file, sep = "\t", quote = FALSE, col.names = NA) message(paste("Normalized CPM (TMM) values saved to:", output_cpm_file)) EOF # Execute the R script to perform TMM normalization and CPM calculation Rscript normalize_cpm_tmm.R
Tools Used
Raw Source Text
Sequencing adapters were trimmed from the raw reads with skewer. Reads were aligned to the human genome version GRCh38_p126. A STAR index (version 2.7.1a) was built with the âsjdbOverhang=50 argument. Splice junctions from Gencode gene models (release M28) were provided via the âsjdbGTFfile argument. STAR alignments were generated with the following parameters: âoutFilterType BySJout, âquantMode TranscriptomeSAM, âoutFilterIntronMotifs RemoveNoncanonicalUnannotated, âoutSAMstrandField intronMotif, âoutSAMattributes NH HI AS nM MD XS and âoutSAMunmapped Within. Alignments were obtained with the following parameters: âreadFilesCommand zcat âoutFilterType BySJout âoutFilterMultimapNmax 20 âalignSJoverhangMin 8 âalignSJDBoverhangMin 1 âoutFilterMismatchNmax 999 âoutFilterMismatchNoverLmax 0.6 âalignIntronMin 20 âalignIntronMax 1000000 âalignMatesGapMax 1000000 âquantMode GeneCounts âoutSAMunmapped Within âoutSAMattributes NH HI AS nM MD XS âoutSAMstrandField intronMotif âoutSAMtype BAM SortedByCoordinate âoutBAMcompression 6. Alignments sharing the same UMI and genomic coordinate were deduplicated using the collapse_UMI_bam tool (Lexogen). The raw gene expression matrix was constructed from the 'forward' column of STAR's ReadsPerGene.out.tab output files using R. Gene symbols and biotype information were extracted from the Gencode GTF file. To calculate normalized counts per million (CPM) TMM normalization was applied across all genes annotated as protein_coding with the edgeR R package. Assembly: GRCh38 Supplementary files format and content: File raw_and_normalized_counts.xlsx is a microsoft excel file with three worksheets: 1. sample information 2. raw count table 3. normalized count table (log2 counts per million)