GSE281960 Processing Pipeline

GSE code_examples 10 steps

Publication

Investigational eIF2B activator DNL343 modulates the integrated stress response in preclinical models of TDP-43 pathology and individuals with ALS in a randomized clinical trial.

Nature communications (2025) — PMID 40825784

Dataset

GSE281960

Investigational eIF2B activator DNL343 modulates the Integrated Stress Response in Preclinical Models of TDP-43 Pathology.

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    Sequencing adapters were trimmed from the raw reads with skewer.

    skewer v0.2.2 (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install skewer (if not already installed)
    # conda install -c bioconda skewer
    
    # Define input and output file paths
    # Replace with actual input raw read files (e.g., from a sequencing run)
    RAW_READS_R1="raw_reads_R1.fastq.gz"
    RAW_READS_R2="raw_reads_R2.fastq.gz"
    OUTPUT_PREFIX="trimmed_reads"
    
    # Trim sequencing adapters using skewer
    # skewer automatically detects adapters if -x is not specified.
    # -o: Output file prefix. skewer will append '-trimmed-pair1.fastq.gz' and '-trimmed-pair2.fastq.gz' for paired-end reads.
    skewer -o "${OUTPUT_PREFIX}" "${RAW_READS_R1}" "${RAW_READS_R2}"
  2. 2

    Reads were aligned to the human genome version GRCh38_p126.

    STAR (Inferred with models/gemini-2.5-flash) v2.7.10a (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install STAR (if not already installed)
    # conda install -c bioconda star
    
    # Define variables
    # GENOME_DIR should point to the directory containing the pre-built STAR genome index for GRCh38_p126.
    # This index needs to be generated once using STAR --runMode genomeGenerate with the GRCh38_p126 FASTA and GTF files.
    # Reference genome source: NCBI (e.g., GCF_000001405.40_GRCh38.p14, where p126 would be a specific patch release).
    GENOME_DIR="/path/to/STAR_GRCh38_p126_index"
    READ1="sample_R1.fastq.gz" # Placeholder for input read 1 file
    READ2="sample_R2.fastq.gz" # Placeholder for input read 2 file (remove if single-end data)
    OUTPUT_PREFIX="aligned_GRCh38_p126_" # Prefix for output files
    THREADS=8 # Number of threads to use
    
    # Align reads to the human genome GRCh38_p126
    STAR --genomeDir ${GENOME_DIR} \
         --readFilesIn ${READ1} ${READ2} \
         --runThreadN ${THREADS} \
         --outFileNamePrefix ${OUTPUT_PREFIX} \
         --outSAMtype BAM SortedByCoordinate \
         --outSAMunmapped Within \
         --outSAMattributes Standard \
         --quantMode GeneCounts \
         --outFilterMultimapNmax 20 \
         --outFilterMismatchNoverLmax 0.04 \
         --alignIntronMin 20 \
         --alignIntronMax 1000000 \
         --alignMatesGapMax 1000000
  3. 3

    A STAR index (version 2.7.1a) was built with the –sjdbOverhang=50 argument.

    $ Bash example
    # Install STAR if not already installed
    # conda install -c bioconda star
    
    # Define input and output paths
    # Replace with actual paths to your genome FASTA and GTF files
    GENOME_DIR="path/to/genome_references" 
    OUTPUT_INDEX_DIR="path/to/star_index"
    
    # Placeholder for genome and GTF files (e.g., GRCh38, Gencode v44)
    # Download from sources like GENCODE (https://www.gencodegenes.org/human/) or Ensembl
    GENOME_FA="${GENOME_DIR}/GRCh38.primary_assembly.genome.fa"
    GTF_FILE="${GENOME_DIR}/gencode.v44.annotation.gtf"
    
    # Create output directory if it doesn't exist
    mkdir -p "${OUTPUT_INDEX_DIR}"
    
    # Build STAR index
    STAR --runMode genomeGenerate \
         --genomeDir "${OUTPUT_INDEX_DIR}" \
         --genomeFastaFiles "${GENOME_FA}" \
         --sjdbGTFfile "${GTF_FILE}" \
         --sjdbOverhang 50 \
         --runThreadN 8 # Adjust number of threads as needed
    
  4. 4

    Splice junctions from Gencode gene models (release M28) were provided via the –sjdbGTFfile argument.

    GENCODE v2.7.9a GitHub
    $ Bash example
    # Install STAR if not already available
    # conda install -c bioconda star
    
    # Define reference paths (placeholders)
    # Gencode M28 is based on the GRCh38 human reference genome.
    GENOME_FASTA="/path/to/GRCh38.primary_assembly.genome.fa" 
    GENCODE_GTF="/path/to/gencode.vM28.annotation.gtf" # Gencode gene models (release M28)
    STAR_INDEX_DIR="/path/to/STAR_genome_index_gencodeM28"
    
    # Create STAR genome index with Gencode M28 splice junctions
    # The --sjdbGTFfile argument is primarily used during genome index generation
    # to incorporate known splice junctions into the index for improved alignment.
    mkdir -p "${STAR_INDEX_DIR}"
    
    STAR \
      --runMode genomeGenerate \
      --genomeDir "${STAR_INDEX_DIR}" \
      --genomeFastaFiles "${GENOME_FASTA}" \
      --sjdbGTFfile "${GENCODE_GTF}" \
      --sjdbOverhang 100 \
      --runThreadN 8 # Adjust number of threads as needed
  5. 5

    STAR alignments were generated with the following parameters: –outFilterType BySJout, –quantMode TranscriptomeSAM, –outFilterIntronMotifs RemoveNoncanonicalUnannotated, –outSAMstrandField intronMotif, –outSAMattributes NH HI AS nM MD XS and –outSAMunmapped Within.

    $ Bash example
    # Install STAR if not already installed
    # conda install -c bioconda star
    
    # Placeholder for STAR genome index directory (e.g., hg38)
    GENOME_DIR="/path/to/star_genome_index_hg38"
    
    # Placeholder for input FASTQ files (adjust for single-end or paired-end)
    READ1="sample_R1.fastq.gz"
    READ2="sample_R2.fastq.gz" # Remove if single-end
    
    # Placeholder for output file prefix
    OUTPUT_PREFIX="sample_aligned_"
    
    STAR \
      --runThreadN 8 \
      --genomeDir "${GENOME_DIR}" \
      --readFilesIn "${READ1}" "${READ2}" \
      --readFilesCommand zcat \
      --outFileNamePrefix "${OUTPUT_PREFIX}" \
      --outFilterType BySJout \
      --quantMode TranscriptomeSAM \
      --outFilterIntronMotifs RemoveNoncanonicalUnannotated \
      --outSAMstrandField intronMotif \
      --outSAMattributes NH HI AS nM MD XS \
      --outSAMunmapped Within
  6. 6

    Alignments were obtained with the following parameters: –readFilesCommand zcat –outFilterType BySJout –outFilterMultimapNmax 20 –alignSJoverhangMin 8 –alignSJDBoverhangMin 1 –outFilterMismatchNmax 999 –outFilterMismatchNoverLmax 0.6 –alignIntronMin 20 –alignIntronMax 1000000 –alignMatesGapMax 1000000 –quantMode GeneCounts –outSAMunmapped Within –outSAMattributes NH HI AS nM MD XS –outSAMstrandField intronMotif –outSAMtype BAM SortedByCoordinate –outBAMcompression 6.

    STAR (Inferred with models/gemini-2.5-flash) vNot specified GitHub
    $ Bash example
    # Install STAR (example using conda)
    # conda install -c bioconda star
    
    # Define variables
    # Placeholder for a human GRCh38 STAR genome index. This directory should contain files like SA, SAindex, genome, etc.
    GENOME_DIR="/path/to/STAR_genome_index_GRCh38"
    # Placeholder for input FASTQ files. Adjust for single-end reads if necessary.
    READ1="sample_R1.fastq.gz"
    READ2="sample_R2.fastq.gz"
    # Prefix for all output files generated by STAR
    OUTPUT_PREFIX="sample_aligned"
    
    # STAR alignment command
    STAR \
        --genomeDir "${GENOME_DIR}" \
        --readFilesIn "${READ1}" "${READ2}" \
        --readFilesCommand zcat \
        --outFilterType BySJout \
        --outFilterMultimapNmax 20 \
        --alignSJoverhangMin 8 \
        --alignSJDBoverhangMin 1 \
        --outFilterMismatchNmax 999 \
        --outFilterMismatchNoverLmax 0.6 \
        --alignIntronMin 20 \
        --alignIntronMax 1000000 \
        --alignMatesGapMax 1000000 \
        --quantMode GeneCounts \
        --outSAMunmapped Within \
        --outSAMattributes NH HI AS nM MD XS \
        --outSAMstrandField intronMotif \
        --outSAMtype BAM SortedByCoordinate \
        --outBAMcompression 6 \
        --outFileNamePrefix "${OUTPUT_PREFIX}"
  7. 7

    Alignments sharing the same UMI and genomic coordinate were deduplicated using the collapse_UMI_bam tool (Lexogen).

    collapse_UMI_bam vNot specified (Inferred with models/gemini-2.5-flash)
    $ Bash example
    # Install Python if not already available
    # conda install python
    
    # Download the collapse_UMI_bam.py script from Lexogen's resources.
    # The exact URL might vary or be provided with the specific kit.
    # wget -O collapse_UMI_bam.py "https://www.lexogen.com/fileadmin/user_upload/support/QuantSeq/collapse_UMI_bam.py"
    # chmod +x collapse_UMI_bam.py
    
    # Ensure the input BAM file is sorted by coordinate before deduplication.
    # samtools sort -o input.sorted.bam input.bam
    # samtools index input.sorted.bam
    
    # Execute the deduplication using collapse_UMI_bam.py
    # Replace 'input.bam' with your actual input BAM file and 'deduplicated.bam' with your desired output file name.
    python collapse_UMI_bam.py -i input.bam -o deduplicated.bam
  8. 8

    The raw gene expression matrix was constructed from the 'forward' column of STAR's ReadsPerGene.out.tab output files using R.

    $ Bash example
    # --- Installation (commented out) ---
    # To install STAR and R (if not already available):
    # conda create -n star_r_env star r-base -y
    # conda activate star_r_env
    
    # --- Placeholder for STAR alignment (upstream step) ---
    # This step generates the 'ReadsPerGene.out.tab' files for each sample.
    # For demonstration, we assume these files already exist in sample-specific directories.
    # Example STAR alignment command (replace with actual genome/annotation paths and sample data):
    # STAR --runThreadN 8 \
    #      --genomeDir /path/to/STAR_genome_index/hg38 \
    #      --readFilesIn sample1_R1.fastq.gz sample1_R2.fastq.gz \
    #      --readFilesCommand zcat \
    #      --outFileNamePrefix sample1/ \
    #      --outSAMtype BAM SortedByCoordinate \
    #      --outReadsPerGene FeatureCounts \
    #      --quantMode GeneCounts
    
    # --- Create dummy sample directories and files for demonstration ---
    # In a real pipeline, these files would be generated by STAR.
    mkdir -p sample1 sample2 sample3
    echo -e "N_unmapped\t0\t0\t0\nN_multimapping\t0\t0\t0\nN_noFeature\t0\t0\t0\nN_ambiguous\t0\t0\t0\nGENE1\t100\t50\t50\nGENE2\t200\t120\t80\nGENE3\t150\t70\t80" > sample1/ReadsPerGene.out.tab
    echo -e "N_unmapped\t0\t0\t0\nN_multimapping\t0\t0\t0\nN_noFeature\t0\t0\t0\nN_ambiguous\t0\t0\t0\nGENE1\t110\t60\t50\nGENE2\t210\t130\t80\nGENE3\t160\t75\t85" > sample2/ReadsPerGene.out.tab
    echo -e "N_unmapped\t0\t0\t0\nN_multimapping\t0\t0\t0\nN_noFeature\t0\t0\t0\nN_ambiguous\t0\t0\t0\nGENE1\t90\t45\t45\nGENE2\t190\t110\t80\nGENE3\t140\t65\t75" > sample3/ReadsPerGene.out.tab
    
    # --- R script to construct the gene expression matrix ---
    # This script reads the 'forward' column (column 3) from STAR's ReadsPerGene.out.tab files
    # for multiple samples and combines them into a single gene expression matrix.
    cat << 'EOF' > construct_expression_matrix.R
    # R script to construct gene expression matrix from STAR's ReadsPerGene.out.tab
    # Assumes 'ReadsPerGene.out.tab' files are located in subdirectories named after samples.
    
    # List of sample directories (replace with actual sample directories or pass as argument)
    sample_dirs <- c("sample1", "sample2", "sample3")
    
    # Initialize an empty list to store counts for each sample
    all_counts <- list()
    gene_ids <- NULL
    
    for (s_dir in sample_dirs) {
      file_path <- file.path(s_dir, "ReadsPerGene.out.tab")
      if (!file.exists(file_path)) {
        warning(paste("File not found:", file_path))
        next
      }
      
      # Read the file, skipping the first 4 lines (STAR header)
      # and assuming tab-separated values.
      # The 'ReadsPerGene.out.tab' file typically has 4 columns:
      # 1. Gene ID
      # 2. Unstranded counts
      # 3. Forward strand counts
      # 4. Reverse strand counts
      data <- read.table(file_path, sep="\t", header=FALSE, skip=4, stringsAsFactors=FALSE)
      
      # The first column is Gene ID, the third column is forward strand counts.
      if (is.null(gene_ids)) {
        gene_ids <- data[, 1]
      } else {
        # Ensure gene IDs are consistent across samples
        if (!all(gene_ids == data[, 1])) {
          stop("Gene IDs are not consistent across samples.")
        }
      }
      
      # Store forward strand counts (column 3).
      all_counts[[s_dir]] <- data[, 3]
    }
    
    # Combine all counts into a matrix.
    if (length(all_counts) > 0) {
      expression_matrix <- do.call(cbind, all_counts)
      colnames(expression_matrix) <- sample_dirs
      rownames(expression_matrix) <- gene_ids
      
      # Save the matrix as a tab-separated file.
      write.table(expression_matrix, "gene_expression_matrix_forward_strand.tsv", sep="\t", row.names=TRUE, quote=FALSE)
      
      message("Gene expression matrix constructed and saved to gene_expression_matrix_forward_strand.tsv")
    } else {
      message("No valid sample data found to construct matrix.")
    }
    EOF
    
    # Execute the R script to construct the matrix
    Rscript construct_expression_matrix.R
    
    # --- Clean up dummy files (optional) ---
    # rm -rf sample1 sample2 sample3 construct_expression_matrix.R
  9. 9

    Gene symbols and biotype information were extracted from the Gencode GTF file.

    GENCODE vlatest (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Input Gencode GTF file (replace with your specific version and species, e.g., gencode.v45.annotation.gtf for human or gencode.vM25.annotation.gtf for mouse)
    GENCODE_GTF="gencode.annotation.gtf"
    
    # Output file for gene symbols and biotypes
    OUTPUT_FILE="gene_info.tsv"
    
    # Extract gene_id, gene_name, and gene_type for 'gene' features
    awk '
    $3 == "gene" {
        gene_id = ""; gene_name = ""; gene_type = "";
        # Split the 9th column (attributes) by semicolon and space
        split($9, attrs, "; ");
        for (i=1; i<=length(attrs); i++) {
            # Remove leading/trailing spaces and quotes
            gsub(/^ +| +$/, "", attrs[i]);
            gsub(/"/, "", attrs[i]);
            
            if (attrs[i] ~ /^gene_id/) {
                gene_id = substr(attrs[i], index(attrs[i], " ") + 1);
            } else if (attrs[i] ~ /^gene_name/) {
                gene_name = substr(attrs[i], index(attrs[i], " ") + 1);
            } else if (attrs[i] ~ /^gene_type/) {
                gene_type = substr(attrs[i], index(attrs[i], " ") + 1);
            }
        }
        # Print gene_id, gene_name, and gene_type separated by tabs
        print gene_id "\t" gene_name "\t" gene_type
    }' "${GENCODE_GTF}" > "${OUTPUT_FILE}"
  10. 10

    To calculate normalized counts per million (CPM) TMM normalization was applied across all genes annotated as protein_coding with the edgeR R package.

    edgeR GitHub
    $ Bash example
    # Install R and Bioconductor if not already present
    # sudo apt-get update && sudo apt-get install -y r-base
    # R -e 'install.packages("BiocManager", repos="https://cloud.r-project.org")'
    # R -e 'BiocManager::install("edgeR")'
    
    # Create a dummy raw_counts.tsv for demonstration
    # In a real pipeline, this file would be generated by an upstream step (e.g., featureCounts, htseq-count)
    cat <<EOF > raw_counts.tsv
    GeneID	Sample1	Sample2	Sample3
    GeneA	100	150	120
    GeneB	200	220	180
    GeneC	50	60	70
    GeneD	10	15	12
    EOF
    
    # Create the R script for edgeR processing
    cat <<'EOF' > normalize_cpm_tmm.R
    # Load edgeR package
    library(edgeR)
    
    # --- Configuration ---
    input_counts_file <- "raw_counts.tsv" # Assumed input file containing raw counts
    output_cpm_file <- "normalized_cpm_tmm.tsv" # Output file for normalized CPM values
    
    # --- Main processing ---
    
    # 1. Read raw counts
    # Assuming the first column is gene IDs and subsequent columns are sample counts.
    # The description implies that only protein_coding genes are included in the input.
    counts_data <- read.delim(input_counts_file, row.names = 1, check.names = FALSE)
    
    # Ensure counts are numeric and non-negative integers
    counts_matrix <- as.matrix(counts_data)
    counts_matrix <- round(counts_matrix) # Ensure integers if not already
    counts_matrix[counts_matrix < 0] <- 0 # Ensure non-negative
    
    # 2. Create a DGEList object
    # A DGEList object is required by edgeR for normalization.
    dge <- DGEList(counts = counts_matrix)
    
    # 3. Apply TMM normalization
    # TMM (Trimmed Mean of M-values) normalization is applied to calculate normalization factors.
    dge <- calcNormFactors(dge, method = "TMM")
    
    # 4. Calculate normalized CPM values
    # CPM (Counts Per Million) values are calculated using the TMM normalization factors.
    # normalized.lib.sizes = TRUE ensures library sizes are adjusted by normalization factors.
    # log = FALSE ensures raw CPM values (not log-transformed) are returned.
    # prior.count = 0 ensures no pseudo-counts are added before CPM calculation.
    cpm_values <- cpm(dge, normalized.lib.sizes = TRUE, log = FALSE, prior.count = 0)
    
    # 5. Save the normalized CPM matrix
    # The resulting matrix is saved to a tab-separated file.
    write.table(cpm_values, file = output_cpm_file, sep = "\t", quote = FALSE, col.names = NA)
    
    message(paste("Normalized CPM (TMM) values saved to:", output_cpm_file))
    EOF
    
    # Execute the R script to perform TMM normalization and CPM calculation
    Rscript normalize_cpm_tmm.R

Tools Used

Raw Source Text
Sequencing adapters were trimmed from the raw reads with skewer. Reads were aligned to the human genome version GRCh38_p126. A STAR index (version 2.7.1a) was built with the –sjdbOverhang=50 argument. Splice junctions from Gencode gene models (release M28) were provided via the –sjdbGTFfile argument. STAR alignments were generated with the following parameters: –outFilterType BySJout, –quantMode TranscriptomeSAM, –outFilterIntronMotifs RemoveNoncanonicalUnannotated, –outSAMstrandField intronMotif, –outSAMattributes NH HI AS nM MD XS and –outSAMunmapped Within. Alignments were obtained with the following parameters: –readFilesCommand zcat –outFilterType BySJout –outFilterMultimapNmax 20 –alignSJoverhangMin 8 –alignSJDBoverhangMin 1 –outFilterMismatchNmax 999 –outFilterMismatchNoverLmax 0.6 –alignIntronMin 20 –alignIntronMax 1000000 –alignMatesGapMax 1000000 –quantMode GeneCounts –outSAMunmapped Within –outSAMattributes NH HI AS nM MD XS –outSAMstrandField intronMotif –outSAMtype BAM SortedByCoordinate –outBAMcompression 6. Alignments sharing the same UMI and genomic coordinate were deduplicated using the collapse_UMI_bam tool (Lexogen). The raw gene expression matrix was constructed from the 'forward' column of STAR's ReadsPerGene.out.tab output files using R. Gene symbols and biotype information were extracted from the Gencode GTF file. To calculate normalized counts per million (CPM) TMM normalization was applied across all genes annotated as protein_coding with the edgeR R package.
Assembly: GRCh38
Supplementary files format and content: File raw_and_normalized_counts.xlsx is a microsoft excel file with three worksheets: 1. sample information 2. raw count table 3. normalized count table (log2 counts per million)
← Back to Analysis