GSE281960 Processing Pipeline

GSE code_examples 10 steps

Publication

Investigational eIF2B activator DNL343 modulates the integrated stress response in preclinical models of TDP-43 pathology and individuals with ALS in a randomized clinical trial.

Nature communications (2025) — PMID 40825784

Dataset

GSE281960

Investigational eIF2B activator DNL343 modulates the Integrated Stress Response in Preclinical Models of TDP-43 Pathology.

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Sequencing adapters were trimmed from the raw reads with skewer.

skewer v0.2.2 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install skewer (if not already installed)
# conda install -c bioconda skewer

# Define input and output file paths
# Replace with actual input raw read files (e.g., from a sequencing run)
RAW_READS_R1="raw_reads_R1.fastq.gz"
RAW_READS_R2="raw_reads_R2.fastq.gz"
OUTPUT_PREFIX="trimmed_reads"

# Trim sequencing adapters using skewer
# skewer automatically detects adapters if -x is not specified.
# -o: Output file prefix. skewer will append '-trimmed-pair1.fastq.gz' and '-trimmed-pair2.fastq.gz' for paired-end reads.
skewer -o "${OUTPUT_PREFIX}" "${RAW_READS_R1}" "${RAW_READS_R2}"

View on GitHub

Reads were aligned to the human genome version GRCh38_p126.

STAR (Inferred with models/gemini-2.5-flash) v2.7.10a (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install STAR (if not already installed)
# conda install -c bioconda star

# Define variables
# GENOME_DIR should point to the directory containing the pre-built STAR genome index for GRCh38_p126.
# This index needs to be generated once using STAR --runMode genomeGenerate with the GRCh38_p126 FASTA and GTF files.
# Reference genome source: NCBI (e.g., GCF_000001405.40_GRCh38.p14, where p126 would be a specific patch release).
GENOME_DIR="/path/to/STAR_GRCh38_p126_index"
READ1="sample_R1.fastq.gz" # Placeholder for input read 1 file
READ2="sample_R2.fastq.gz" # Placeholder for input read 2 file (remove if single-end data)
OUTPUT_PREFIX="aligned_GRCh38_p126_" # Prefix for output files
THREADS=8 # Number of threads to use

# Align reads to the human genome GRCh38_p126
STAR --genomeDir ${GENOME_DIR} \
     --readFilesIn ${READ1} ${READ2} \
     --runThreadN ${THREADS} \
     --outFileNamePrefix ${OUTPUT_PREFIX} \
     --outSAMtype BAM SortedByCoordinate \
     --outSAMunmapped Within \
     --outSAMattributes Standard \
     --quantMode GeneCounts \
     --outFilterMultimapNmax 20 \
     --outFilterMismatchNoverLmax 0.04 \
     --alignIntronMin 20 \
     --alignIntronMax 1000000 \
     --alignMatesGapMax 1000000

View on GitHub

A STAR index (version 2.7.1a) was built with the âsjdbOverhang=50 argument.

STAR v2.7.1a GitHub

$ Bash example

# Install STAR if not already installed
# conda install -c bioconda star

# Define input and output paths
# Replace with actual paths to your genome FASTA and GTF files
GENOME_DIR="path/to/genome_references" 
OUTPUT_INDEX_DIR="path/to/star_index"

# Placeholder for genome and GTF files (e.g., GRCh38, Gencode v44)
# Download from sources like GENCODE (https://www.gencodegenes.org/human/) or Ensembl
GENOME_FA="${GENOME_DIR}/GRCh38.primary_assembly.genome.fa"
GTF_FILE="${GENOME_DIR}/gencode.v44.annotation.gtf"

# Create output directory if it doesn't exist
mkdir -p "${OUTPUT_INDEX_DIR}"

# Build STAR index
STAR --runMode genomeGenerate \
     --genomeDir "${OUTPUT_INDEX_DIR}" \
     --genomeFastaFiles "${GENOME_FA}" \
     --sjdbGTFfile "${GTF_FILE}" \
     --sjdbOverhang 50 \
     --runThreadN 8 # Adjust number of threads as needed

View on GitHub

Splice junctions from Gencode gene models (release M28) were provided via the âsjdbGTFfile argument.

GENCODE v2.7.9a GitHub

$ Bash example

# Install STAR if not already available
# conda install -c bioconda star

# Define reference paths (placeholders)
# Gencode M28 is based on the GRCh38 human reference genome.
GENOME_FASTA="/path/to/GRCh38.primary_assembly.genome.fa" 
GENCODE_GTF="/path/to/gencode.vM28.annotation.gtf" # Gencode gene models (release M28)
STAR_INDEX_DIR="/path/to/STAR_genome_index_gencodeM28"

# Create STAR genome index with Gencode M28 splice junctions
# The --sjdbGTFfile argument is primarily used during genome index generation
# to incorporate known splice junctions into the index for improved alignment.
mkdir -p "${STAR_INDEX_DIR}"

STAR \
  --runMode genomeGenerate \
  --genomeDir "${STAR_INDEX_DIR}" \
  --genomeFastaFiles "${GENOME_FASTA}" \
  --sjdbGTFfile "${GENCODE_GTF}" \
  --sjdbOverhang 100 \
  --runThreadN 8 # Adjust number of threads as needed

View on GitHub

STAR alignments were generated with the following parameters: âoutFilterType BySJout, âquantMode TranscriptomeSAM, âoutFilterIntronMotifs RemoveNoncanonicalUnannotated, âoutSAMstrandField intronMotif, âoutSAMattributes NH HI AS nM MD XS and âoutSAMunmapped Within.

STAR v2.7.10a GitHub

$ Bash example

# Install STAR if not already installed
# conda install -c bioconda star

# Placeholder for STAR genome index directory (e.g., hg38)
GENOME_DIR="/path/to/star_genome_index_hg38"

# Placeholder for input FASTQ files (adjust for single-end or paired-end)
READ1="sample_R1.fastq.gz"
READ2="sample_R2.fastq.gz" # Remove if single-end

# Placeholder for output file prefix
OUTPUT_PREFIX="sample_aligned_"

STAR \
  --runThreadN 8 \
  --genomeDir "${GENOME_DIR}" \
  --readFilesIn "${READ1}" "${READ2}" \
  --readFilesCommand zcat \
  --outFileNamePrefix "${OUTPUT_PREFIX}" \
  --outFilterType BySJout \
  --quantMode TranscriptomeSAM \
  --outFilterIntronMotifs RemoveNoncanonicalUnannotated \
  --outSAMstrandField intronMotif \
  --outSAMattributes NH HI AS nM MD XS \
  --outSAMunmapped Within

View on GitHub

Alignments were obtained with the following parameters: âreadFilesCommand zcat âoutFilterType BySJout âoutFilterMultimapNmax 20 âalignSJoverhangMin 8 âalignSJDBoverhangMin 1 âoutFilterMismatchNmax 999 âoutFilterMismatchNoverLmax 0.6 âalignIntronMin 20 âalignIntronMax 1000000 âalignMatesGapMax 1000000 âquantMode GeneCounts âoutSAMunmapped Within âoutSAMattributes NH HI AS nM MD XS âoutSAMstrandField intronMotif âoutSAMtype BAM SortedByCoordinate âoutBAMcompression 6.

STAR (Inferred with models/gemini-2.5-flash) vNot specified GitHub

$ Bash example

# Install STAR (example using conda)
# conda install -c bioconda star

# Define variables
# Placeholder for a human GRCh38 STAR genome index. This directory should contain files like SA, SAindex, genome, etc.
GENOME_DIR="/path/to/STAR_genome_index_GRCh38"
# Placeholder for input FASTQ files. Adjust for single-end reads if necessary.
READ1="sample_R1.fastq.gz"
READ2="sample_R2.fastq.gz"
# Prefix for all output files generated by STAR
OUTPUT_PREFIX="sample_aligned"

# STAR alignment command
STAR \
    --genomeDir "${GENOME_DIR}" \
    --readFilesIn "${READ1}" "${READ2}" \
    --readFilesCommand zcat \
    --outFilterType BySJout \
    --outFilterMultimapNmax 20 \
    --alignSJoverhangMin 8 \
    --alignSJDBoverhangMin 1 \
    --outFilterMismatchNmax 999 \
    --outFilterMismatchNoverLmax 0.6 \
    --alignIntronMin 20 \
    --alignIntronMax 1000000 \
    --alignMatesGapMax 1000000 \
    --quantMode GeneCounts \
    --outSAMunmapped Within \
    --outSAMattributes NH HI AS nM MD XS \
    --outSAMstrandField intronMotif \
    --outSAMtype BAM SortedByCoordinate \
    --outBAMcompression 6 \
    --outFileNamePrefix "${OUTPUT_PREFIX}"

View on GitHub

Alignments sharing the same UMI and genomic coordinate were deduplicated using the collapse_UMI_bam tool (Lexogen).

collapse_UMI_bam vNot specified (Inferred with models/gemini-2.5-flash)

$ Bash example

# Install Python if not already available
# conda install python

# Download the collapse_UMI_bam.py script from Lexogen's resources.
# The exact URL might vary or be provided with the specific kit.
# wget -O collapse_UMI_bam.py "https://www.lexogen.com/fileadmin/user_upload/support/QuantSeq/collapse_UMI_bam.py"
# chmod +x collapse_UMI_bam.py

# Ensure the input BAM file is sorted by coordinate before deduplication.
# samtools sort -o input.sorted.bam input.bam
# samtools index input.sorted.bam

# Execute the deduplication using collapse_UMI_bam.py
# Replace 'input.bam' with your actual input BAM file and 'deduplicated.bam' with your desired output file name.
python collapse_UMI_bam.py -i input.bam -o deduplicated.bam

The raw gene expression matrix was constructed from the 'forward' column of STAR's ReadsPerGene.out.tab output files using R.

STAR v2.7.10a GitHub

$ Bash example

# --- Installation (commented out) ---
# To install STAR and R (if not already available):
# conda create -n star_r_env star r-base -y
# conda activate star_r_env

# --- Placeholder for STAR alignment (upstream step) ---
# This step generates the 'ReadsPerGene.out.tab' files for each sample.
# For demonstration, we assume these files already exist in sample-specific directories.
# Example STAR alignment command (replace with actual genome/annotation paths and sample data):
# STAR --runThreadN 8 \
#      --genomeDir /path/to/STAR_genome_index/hg38 \
#      --readFilesIn sample1_R1.fastq.gz sample1_R2.fastq.gz \
#      --readFilesCommand zcat \
#      --outFileNamePrefix sample1/ \
#      --outSAMtype BAM SortedByCoordinate \
#      --outReadsPerGene FeatureCounts \
#      --quantMode GeneCounts

# --- Create dummy sample directories and files for demonstration ---
# In a real pipeline, these files would be generated by STAR.
mkdir -p sample1 sample2 sample3
echo -e "N_unmapped\t0\t0\t0\nN_multimapping\t0\t0\t0\nN_noFeature\t0\t0\t0\nN_ambiguous\t0\t0\t0\nGENE1\t100\t50\t50\nGENE2\t200\t120\t80\nGENE3\t150\t70\t80" > sample1/ReadsPerGene.out.tab
echo -e "N_unmapped\t0\t0\t0\nN_multimapping\t0\t0\t0\nN_noFeature\t0\t0\t0\nN_ambiguous\t0\t0\t0\nGENE1\t110\t60\t50\nGENE2\t210\t130\t80\nGENE3\t160\t75\t85" > sample2/ReadsPerGene.out.tab
echo -e "N_unmapped\t0\t0\t0\nN_multimapping\t0\t0\t0\nN_noFeature\t0\t0\t0\nN_ambiguous\t0\t0\t0\nGENE1\t90\t45\t45\nGENE2\t190\t110\t80\nGENE3\t140\t65\t75" > sample3/ReadsPerGene.out.tab

# --- R script to construct the gene expression matrix ---
# This script reads the 'forward' column (column 3) from STAR's ReadsPerGene.out.tab files
# for multiple samples and combines them into a single gene expression matrix.
cat << 'EOF' > construct_expression_matrix.R
# R script to construct gene expression matrix from STAR's ReadsPerGene.out.tab
# Assumes 'ReadsPerGene.out.tab' files are located in subdirectories named after samples.

# List of sample directories (replace with actual sample directories or pass as argument)
sample_dirs <- c("sample1", "sample2", "sample3")

# Initialize an empty list to store counts for each sample
all_counts <- list()
gene_ids <- NULL

for (s_dir in sample_dirs) {
  file_path <- file.path(s_dir, "ReadsPerGene.out.tab")
  if (!file.exists(file_path)) {
    warning(paste("File not found:", file_path))
    next
  }
  
  # Read the file, skipping the first 4 lines (STAR header)
  # and assuming tab-separated values.
  # The 'ReadsPerGene.out.tab' file typically has 4 columns:
  # 1. Gene ID
  # 2. Unstranded counts
  # 3. Forward strand counts
  # 4. Reverse strand counts
  data <- read.table(file_path, sep="\t", header=FALSE, skip=4, stringsAsFactors=FALSE)
  
  # The first column is Gene ID, the third column is forward strand counts.
  if (is.null(gene_ids)) {
    gene_ids <- data[, 1]
  } else {
    # Ensure gene IDs are consistent across samples
    if (!all(gene_ids == data[, 1])) {
      stop("Gene IDs are not consistent across samples.")
    }
  }
  
  # Store forward strand counts (column 3).
  all_counts[[s_dir]] <- data[, 3]
}

# Combine all counts into a matrix.
if (length(all_counts) > 0) {
  expression_matrix <- do.call(cbind, all_counts)
  colnames(expression_matrix) <- sample_dirs
  rownames(expression_matrix) <- gene_ids
  
  # Save the matrix as a tab-separated file.
  write.table(expression_matrix, "gene_expression_matrix_forward_strand.tsv", sep="\t", row.names=TRUE, quote=FALSE)
  
  message("Gene expression matrix constructed and saved to gene_expression_matrix_forward_strand.tsv")
} else {
  message("No valid sample data found to construct matrix.")
}
EOF

# Execute the R script to construct the matrix
Rscript construct_expression_matrix.R

# --- Clean up dummy files (optional) ---
# rm -rf sample1 sample2 sample3 construct_expression_matrix.R

View on GitHub

Gene symbols and biotype information were extracted from the Gencode GTF file.

GENCODE vlatest (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Input Gencode GTF file (replace with your specific version and species, e.g., gencode.v45.annotation.gtf for human or gencode.vM25.annotation.gtf for mouse)
GENCODE_GTF="gencode.annotation.gtf"

# Output file for gene symbols and biotypes
OUTPUT_FILE="gene_info.tsv"

# Extract gene_id, gene_name, and gene_type for 'gene' features
awk '
$3 == "gene" {
    gene_id = ""; gene_name = ""; gene_type = "";
    # Split the 9th column (attributes) by semicolon and space
    split($9, attrs, "; ");
    for (i=1; i<=length(attrs); i++) {
        # Remove leading/trailing spaces and quotes
        gsub(/^ +| +$/, "", attrs[i]);
        gsub(/"/, "", attrs[i]);
        
        if (attrs[i] ~ /^gene_id/) {
            gene_id = substr(attrs[i], index(attrs[i], " ") + 1);
        } else if (attrs[i] ~ /^gene_name/) {
            gene_name = substr(attrs[i], index(attrs[i], " ") + 1);
        } else if (attrs[i] ~ /^gene_type/) {
            gene_type = substr(attrs[i], index(attrs[i], " ") + 1);
        }
    }
    # Print gene_id, gene_name, and gene_type separated by tabs
    print gene_id "\t" gene_name "\t" gene_type
}' "${GENCODE_GTF}" > "${OUTPUT_FILE}"

View on GitHub

To calculate normalized counts per million (CPM) TMM normalization was applied across all genes annotated as protein_coding with the edgeR R package.

edgeR GitHub

$ Bash example

# Install R and Bioconductor if not already present
# sudo apt-get update && sudo apt-get install -y r-base
# R -e 'install.packages("BiocManager", repos="https://cloud.r-project.org")'
# R -e 'BiocManager::install("edgeR")'

# Create a dummy raw_counts.tsv for demonstration
# In a real pipeline, this file would be generated by an upstream step (e.g., featureCounts, htseq-count)
cat <<EOF > raw_counts.tsv
GeneID	Sample1	Sample2	Sample3
GeneA	100	150	120
GeneB	200	220	180
GeneC	50	60	70
GeneD	10	15	12
EOF

# Create the R script for edgeR processing
cat <<'EOF' > normalize_cpm_tmm.R
# Load edgeR package
library(edgeR)

# --- Configuration ---
input_counts_file <- "raw_counts.tsv" # Assumed input file containing raw counts
output_cpm_file <- "normalized_cpm_tmm.tsv" # Output file for normalized CPM values

# --- Main processing ---

# 1. Read raw counts
# Assuming the first column is gene IDs and subsequent columns are sample counts.
# The description implies that only protein_coding genes are included in the input.
counts_data <- read.delim(input_counts_file, row.names = 1, check.names = FALSE)

# Ensure counts are numeric and non-negative integers
counts_matrix <- as.matrix(counts_data)
counts_matrix <- round(counts_matrix) # Ensure integers if not already
counts_matrix[counts_matrix < 0] <- 0 # Ensure non-negative

# 2. Create a DGEList object
# A DGEList object is required by edgeR for normalization.
dge <- DGEList(counts = counts_matrix)

# 3. Apply TMM normalization
# TMM (Trimmed Mean of M-values) normalization is applied to calculate normalization factors.
dge <- calcNormFactors(dge, method = "TMM")

# 4. Calculate normalized CPM values
# CPM (Counts Per Million) values are calculated using the TMM normalization factors.
# normalized.lib.sizes = TRUE ensures library sizes are adjusted by normalization factors.
# log = FALSE ensures raw CPM values (not log-transformed) are returned.
# prior.count = 0 ensures no pseudo-counts are added before CPM calculation.
cpm_values <- cpm(dge, normalized.lib.sizes = TRUE, log = FALSE, prior.count = 0)

# 5. Save the normalized CPM matrix
# The resulting matrix is saved to a tab-separated file.
write.table(cpm_values, file = output_cpm_file, sep = "\t", quote = FALSE, col.names = NA)

message(paste("Normalized CPM (TMM) values saved to:", output_cpm_file))
EOF

# Execute the R script to perform TMM normalization and CPM calculation
Rscript normalize_cpm_tmm.R

View on GitHub

Tools Used

STAR

Raw Source Text

Sequencing adapters were trimmed from the raw reads with skewer. Reads were aligned to the human genome version GRCh38_p126. A STAR index (version 2.7.1a) was built with the âsjdbOverhang=50 argument. Splice junctions from Gencode gene models (release M28) were provided via the âsjdbGTFfile argument. STAR alignments were generated with the following parameters: âoutFilterType BySJout, âquantMode TranscriptomeSAM, âoutFilterIntronMotifs RemoveNoncanonicalUnannotated, âoutSAMstrandField intronMotif, âoutSAMattributes NH HI AS nM MD XS and âoutSAMunmapped Within. Alignments were obtained with the following parameters: âreadFilesCommand zcat âoutFilterType BySJout âoutFilterMultimapNmax 20 âalignSJoverhangMin 8 âalignSJDBoverhangMin 1 âoutFilterMismatchNmax 999 âoutFilterMismatchNoverLmax 0.6 âalignIntronMin 20 âalignIntronMax 1000000 âalignMatesGapMax 1000000 âquantMode GeneCounts âoutSAMunmapped Within âoutSAMattributes NH HI AS nM MD XS âoutSAMstrandField intronMotif âoutSAMtype BAM SortedByCoordinate âoutBAMcompression 6. Alignments sharing the same UMI and genomic coordinate were deduplicated using the collapse_UMI_bam tool (Lexogen). The raw gene expression matrix was constructed from the 'forward' column of STAR's ReadsPerGene.out.tab output files using R. Gene symbols and biotype information were extracted from the Gencode GTF file. To calculate normalized counts per million (CPM) TMM normalization was applied across all genes annotated as protein_coding with the edgeR R package.
Assembly: GRCh38
Supplementary files format and content: File raw_and_normalized_counts.xlsx is a microsoft excel file with three worksheets: 1. sample information 2. raw count table 3. normalized count table (log2 counts per million)

← Back to Analysis