GSE121569 Processing Pipeline

RNA-Seq code_examples 12 steps

Publication

Aberrant NOVA1 function disrupts alternative splicing in early stages of amyotrophic lateral sclerosis.

Acta neuropathologica (2022) — PMID 35778567

Dataset

GSE121569

ALS implicated protein TDP-43 sustains levels of STMN2 a mediator of motor neuron growth and repair

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    All FASTQ files were analyzed using the bcbioRNASeq workflow and toolchain91.

    bcbio-nextgen (Inferred with models/gemini-2.5-flash) vtoolchain91 GitHub
    $ Bash example
    # --- Installation (commented out) ---
    # It is recommended to install bcbio-nextgen via its installer script:
    # wget https://raw.githubusercontent.com/bcbio/bcbio-nextgen/master/scripts/bcbio_nextgen_install.py
    # python bcbio_nextgen_install.py /path/to/bcbio --genomes GRCh38 --aligners star --tools
    # export PATH="/path/to/bcbio/bin:$PATH"
    
    # --- Prepare input configuration (example) ---
    # bcbio-nextgen typically uses a sample CSV or a YAML configuration file.
    # This example assumes a 'project.yaml' and 'samples.csv' are prepared.
    # For RNA-seq, a common reference genome is GRCh38 (human) or mm10 (mouse).
    # We will use GRCh38 as a placeholder.
    
    # Create a dummy samples.csv (replace with actual FASTQ paths and sample names)
    cat <<EOF > samples.csv
    samplename,description,fastq_1,fastq_2
    sample1,sample1,data/sample1_R1.fastq.gz,data/sample1_R2.fastq.gz
    sample2,sample2,data/sample2_R1.fastq.gz,data/sample2_R2.fastq.gz
    EOF
    
    # Create a dummy bcbio_config.yaml (replace with actual project settings)
    # This configuration specifies RNA-seq analysis with STAR aligner and GRCh38 genome.
    cat <<EOF > bcbio_config.yaml
    details:
      - analysis: RNA-seq
        genome_build: GRCh38
        aligner: star
        quality_format: standard
        # Add other relevant bcbio-nextgen RNA-seq parameters as needed
        # The 'toolchain91' mentioned in the description likely refers to a specific
        # bcbio-nextgen installation or configuration used at the time.
    upload:
      dir: final_analysis_dir
    EOF
    
    # --- Run bcbio-nextgen RNA-seq analysis ---
    # The command below executes the bcbio-nextgen workflow.
    bcbio_nextgen.py bcbio_config.yaml samples.csv -w run --cores 8 --memory 32G
  2. 2

    The FASTQ files were aligned to the GRCh37/hg19 reference genome.

    bowtie2 (Inferred with models/gemini-2.5-flash) v2.4.5 GitHub
    $ Bash example
    # Install bowtie2 (if not already installed)
    # conda install -c bioconda bowtie2
    
    # Install samtools (if not already installed)
    # conda install -c bioconda samtools
    
    # Define variables
    FASTQ_FILE="sample.fastq.gz" # Placeholder for input FASTQ file
    REFERENCE_GENOME_INDEX_PREFIX="/path/to/GRCh37_hg19_index/GRCh37_hg19" # Path to the bowtie2 index prefix for GRCh37/hg19
    OUTPUT_BAM_UNSORTED="aligned_reads.bam"
    OUTPUT_BAM_SORTED="aligned_reads.sorted.bam"
    
    # Example: How to build the index for GRCh37/hg19 (UCSC hg19)
    # mkdir -p /path/to/GRCh37_hg19_index
    # cd /path/to/GRCh37_hg19_index
    # wget https://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz
    # gunzip hg19.fa.gz
    # bowtie2-build hg19.fa GRCh37_hg19
    # cd - # Go back to original directory
    
    # Perform alignment using bowtie2
    # Assuming single-end reads. For paired-end, use -1 <read1.fastq> -2 <read2.fastq>
    # Output is piped directly to samtools for conversion to BAM
    bowtie2 -x "${REFERENCE_GENOME_INDEX_PREFIX}" \
            -U "${FASTQ_FILE}" \
            --threads 8 \
            2> bowtie2_alignment.log | \
            samtools view -bS - > "${OUTPUT_BAM_UNSORTED}"
    
    # Sort the BAM file
    samtools sort "${OUTPUT_BAM_UNSORTED}" -o "${OUTPUT_BAM_SORTED}"
    
    # Index the sorted BAM file (optional, but highly recommended for downstream tools)
    samtools index "${OUTPUT_BAM_SORTED}"
  3. 3

    Differential expression testing was performed using DESeq2 suite of bioinformatics tools38.

    DESeq2 vNot specified. GitHub
    $ Bash example
    # Install R (if not already installed)
    # sudo apt-get update
    # sudo apt-get install r-base
    
    # Open R and install BiocManager, then DESeq2
    # R -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager"); BiocManager::install("DESeq2")'
    
    # Create dummy input files for demonstration purposes.
    # In a real bioinformatics pipeline, 'counts.csv' would be generated by an upstream quantification tool
    # (e.g., featureCounts, Salmon, Kallisto) and 'coldata.csv' would be a sample metadata file.
    
    # Create a dummy count matrix (genes as rows, samples as columns)
    cat << 'EOF' > counts.csv
    gene1,100,120,90,500,550,480
    gene2,50,60,55,200,210,190
    gene3,200,210,190,80,90,75
    gene4,150,140,160,155,145,165
    gene5,30,35,28,100,110,95
    EOF
    
    # Create a dummy sample metadata file (sample IDs and conditions)
    cat << 'EOF' > coldata.csv
    sample1,control
    sample2,control
    sample3,control
    sample4,treated
    sample5,treated
    sample6,treated
    EOF
    
    # Create an R script to perform DESeq2 analysis
    cat << 'EOF' > run_deseq2.R
    library(DESeq2)
    
    # Load count matrix
    # Assuming counts.csv has gene IDs as the first column and sample counts as subsequent columns
    counts_data_raw <- read.csv("counts.csv", row.names = 1, header = TRUE)
    
    # Load sample metadata
    # Assuming coldata.csv has sample IDs as the first column and condition as the second
    col_data <- read.csv("coldata.csv", row.names = 1, header = FALSE, col.names = c("sample", "condition"))
    
    # Ensure sample names in col_data match and are in the same order as in counts_data_raw
    col_data <- col_data[colnames(counts_data_raw), , drop = FALSE]
    stopifnot(all(rownames(col_data) == colnames(counts_data_raw)))
    
    # Ensure count data is an integer matrix
    counts_data <- as.matrix(round(counts_data_raw))
    
    # Create DESeqDataSet object
    dds <- DESeqDataSetFromMatrix(countData = counts_data,
                                      colData = col_data,
                                      design = ~ condition)
    
    # Pre-filtering: remove genes with very low counts across all samples
    # This helps to reduce the memory size and increase the speed of the DESeq2 analysis.
    # A common filtering threshold is to keep genes with at least 10 counts summed across all samples.
    keep <- rowSums(counts(dds)) >= 10
    dds <- dds[keep,]
    
    # Run DESeq2 analysis
    dds <- DESeq(dds)
    
    # Get results for 'treated' vs 'control' comparison
    # The contrast argument specifies the comparison to make.
    res <- results(dds, contrast=c("condition", "treated", "control"))
    
    # Order results by adjusted p-value (padj)
    res_ordered <- res[order(res$padj),]
    
    # Save the differential expression results to a CSV file
    write.csv(as.data.frame(res_ordered), "deseq2_results.csv")
    
    message("DESeq2 analysis complete. Results saved to deseq2_results.csv")
    EOF
    
    # Execute the R script to perform differential expression analysis
    Rscript run_deseq2.R
  4. 4

    The DEXSeq module of Bioconductor was used to identify differential splicing92.

    R vR 4.3.x / Bioconductor 3.18 (Inferred with models/gemini-2.5-flash)
    $ Bash example
    # Install BiocManager if not already installed
    # R -e "if (!requireNamespace('BiocManager', quietly = TRUE)) install.packages('BiocManager')"
    
    # Install DEXSeq package if not already installed
    # R -e "BiocManager::install('DEXSeq')"
    
    # Create a directory for results
    mkdir -p dexseq_results
    
    # Placeholder for input files:
    # 1. Exon count files for each sample (e.g., generated by htseq-count -m union -t exon -i gene_id)
    #    Example: counts/sample1_exon_counts.txt, counts/sample2_exon_counts.txt, ...
    # 2. Sample information file (CSV or TSV) with columns like 'sample_id', 'condition', 'replicate'
    #    Example: sample_info.csv
    # 3. GTF/GFF annotation file used for counting (e.g., from Ensembl or GENCODE)
    #    Example: annotation.gtf
    
    # Placeholder for R script to perform DEXSeq analysis
    # This script assumes count files are in a 'counts' subdirectory
    # and a 'sample_info.csv' file is in the current directory.
    # It also assumes an annotation file (e.g., gtf) is available.
    
    cat << 'EOF' > run_dexseq.R
    library(DEXSeq)
    library(GenomicFeatures) # For making exon bins from GTF, if needed
    
    # --- Configuration ---
    # Path to the directory containing exon count files
    count_dir <- "counts"
    # Path to the sample information file
    sample_info_file <- "sample_info.csv"
    # Path to the GTF annotation file (used to define exon bins if not pre-computed)
    gtf_file <- "annotation.gtf"
    # Output directory for results
    output_dir <- "dexseq_results"
    
    # --- Load Sample Information ---
    sample_data <- read.csv(sample_info_file, row.names = 1)
    # Ensure 'condition' is a factor for the design formula
    sample_data$condition <- factor(sample_data$condition)
    
    # --- Prepare Count Files ---
    # List all count files matching the pattern
    count_files <- list.files(count_dir, pattern = "_exon_counts.txt$", full.names = TRUE)
    names(count_files) <- gsub("_exon_counts.txt$", "", basename(count_files))
    
    # Ensure sample order in sample_data matches count file order
    sample_data <- sample_data[names(count_files), ]
    
    # Read counts into a list and then combine into a matrix
    count_data_list <- lapply(count_files, function(f) {
      read.delim(f, header = FALSE, stringsAsFactors = FALSE)
    })
    
    # Extract exon IDs and gene IDs from the first count file (assuming consistent format: "gene_id:exon_id")
    exon_ids <- count_data_list[[1]][, 1]
    gene_ids <- sapply(strsplit(exon_ids, ":"), `[`, 1)
    
    # Combine counts into a matrix
    counts_matrix <- do.call(cbind, lapply(count_data_list, function(df) df[, 2]))
    colnames(counts_matrix) <- names(count_files)
    rownames(counts_matrix) <- exon_ids
    
    # --- Create DEXSeqDataSet object ---
    # The design formula specifies the model for differential exon usage.
    # ~ sample + exon + condition:exon tests for differential exon usage between conditions,
    # accounting for overall gene expression differences (sample) and baseline exon usage (exon).
    # 'sample' is implicitly handled by the DEXSeq functions.
    dxd <- DEXSeqDataSet(
      countData = counts_matrix,
      sampleData = sample_data,
      design = ~ sample + exon + condition:exon, # Example design
      featureID = exon_ids,
      groupID = gene_ids
    )
    
    # --- Differential Splicing Analysis ---
    print("Estimating size factors...")
    dxd <- estimateSizeFactors(dxd)
    
    print("Estimating dispersions...")
    dxd <- estimateDispersions(dxd)
    
    print("Testing for differential exon usage...")
    dxd <- testForDEU(dxd)
    
    print("Estimating fold changes...")
    dxd <- estimateExonFoldChanges(dxd, fitExpToVar = "condition")
    
    # --- Get Results ---
    dxr <- DEXSeqResults(dxd)
    
    # Order results by adjusted p-value
    dxr <- dxr[order(dxr$padj),]
    
    # Save results
    write.csv(as.data.frame(dxr), file.path(output_dir, "dexseq_results.csv"), row.names = TRUE)
    
    # Save normalized counts (optional)
    # normalized_counts <- counts(dxd, normalized = TRUE)
    # write.csv(normalized_counts, file.path(output_dir, "dexseq_normalized_counts.csv"), row.names = TRUE)
    
    print(paste("DEXSeq analysis complete. Results saved to", file.path(output_dir, "dexseq_results.csv")))
    EOF
    
    # Create dummy input files for demonstration purposes
    mkdir -p counts
    echo -e "geneA:exon1\t100\ngeneA:exon2\t50\ngeneB:exon1\t200" > counts/sample1_exon_counts.txt
    echo -e "geneA:exon1\t120\ngeneA:exon2\t150\ngeneB:exon1\t210" > counts/sample2_exon_counts.txt
    echo -e "geneA:exon1\t110\ngeneA:exon2\t60\ngeneB:exon1\t190" > counts/sample3_exon_counts.txt
    echo -e "geneA:exon1\t130\ngeneA:exon2\t160\ngeneB:exon1\t220" > counts/sample4_exon_counts.txt
    
    echo "sample_id,condition,replicate" > sample_info.csv
    echo "sample1,control,1" >> sample_info.csv
    echo "sample2,treatment,1" >> sample_info.csv
    echo "sample3,control,2" >> sample_info.csv
    echo "sample4,treatment,2" >> sample_info.csv
    
    # A dummy GTF file (DEXSeq doesn't strictly need it if counts are pre-processed with exon_id:gene_id, but good practice to have a reference)
    echo -e "chr1\tHAVANA\tgene\t1000\t2000\t.\t+\t.\tgene_id \"geneA\"; gene_name \"geneA\";" > annotation.gtf
    echo -e "chr1\tHAVANA\texon\t1000\t1200\t.\t+\t.\tgene_id \"geneA\"; transcript_id \"geneA-001\"; exon_id \"geneA:exon1\";" >> annotation.gtf
    echo -e "chr1\tHAVANA\texon\t1500\t1700\t.\t+\t.\tgene_id \"geneA\"; transcript_id \"geneA-001\"; exon_id \"geneA:exon2\";" >> annotation.gtf
    echo -e "chr1\tHAVANA\tgene\t3000\t4000\t.\t+\t.\tgene_id \"geneB\"; gene_name \"geneB\";" >> annotation.gtf
    echo -e "chr1\tHAVANA\texon\t3000\t3500\t.\t+\t.\tgene_id \"geneB\"; transcript_id \"geneB-001\"; exon_id \"geneB:exon1\";" >> annotation.gtf
    
    # Execute the R script
    Rscript run_dexseq.R
    
    # Clean up dummy files (optional)
    # rm -rf counts sample_info.csv annotation.gtf run_dexseq.R
    
  5. 5

    We used salmon to generate the counts and tximport to load them at gene level93,94.

    Salmon vSalmon 1.10.2 (Inferred with models/gemini-2.5-flash), tximport 1.28.0 (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # --- Salmon Installation (example) ---
    # conda create -n salmon_env salmon=1.10.2 -y
    # conda activate salmon_env
    
    # --- tximport Installation (example) ---
    # R -e "if (!requireNamespace('BiocManager', quietly = TRUE)) install.packages('BiocManager'); BiocManager::install('tximport'); BiocManager::install('GenomicFeatures'); BiocManager::install('rtracklayer')"
    
    # --- Reference Data (Placeholder: Human GRCh38 transcriptome from Ensembl) ---
    # Download human GRCh38 transcriptome FASTA (e.g., from Ensembl release 109)
    # wget -O Homo_sapiens.GRCh38.cdna.all.fa.gz ftp://ftp.ensembl.org/pub/release-109/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz
    # gunzip Homo_sapiens.GRCh38.cdna.all.fa.gz
    TRANSCRIPTOME_FASTA="Homo_sapiens.GRCh38.cdna.all.fa"
    SALMON_INDEX_DIR="salmon_index_grch38"
    
    # Download human GRCh38 GTF (e.g., from Ensembl release 109) for tx2gene mapping
    # wget -O Homo_sapiens.GRCh38.109.gtf.gz ftp://ftp.ensembl.org/pub/release-109/gtf/homo_sapiens/Homo_sapiens.GRCh38.109.gtf.gz
    # gunzip Homo_sapiens.GRCh38.109.gtf.gz
    GTF_FILE="Homo_sapiens.GRCh38.109.gtf"
    
    # --- Salmon: Indexing (if not already done) ---
    # salmon index -t "${TRANSCRIPTOME_FASTA}" -i "${SALMON_INDEX_DIR}"
    
    # --- Salmon: Generate counts ---
    # Assuming paired-end reads: sample_1_R1.fastq.gz, sample_1_R2.fastq.gz
    # Replace with actual input files and desired output directory
    READS_R1="sample_1_R1.fastq.gz"
    READS_R2="sample_1_R2.fastq.gz"
    SALMON_OUTPUT_DIR="salmon_quant_sample_1"
    
    salmon quant -i "${SALMON_INDEX_DIR}" -l A -1 "${READS_R1}" -2 "${READS_R2}" -p 8 -o "${SALMON_OUTPUT_DIR}" --validateMappings
    
    # --- tximport: Load counts at gene level ---
    # Create an R script to perform tximport
    cat << 'EOF' > tximport_script.R
    library(tximport)
    library(readr)
    # For tximport documentation and more details, see: https://bioconductor.org/packages/release/bioc/html/tximport.html
    
    # Define paths to salmon quantification files
    # Replace with actual paths for all samples. Example for one sample:
    quant_files <- c(
      "salmon_quant_sample_1/quant.sf"
      # Add more sample paths here if processing multiple samples:
      # "salmon_quant_sample_2/quant.sf",
      # "salmon_quant_sample_3/quant.sf"
    )
    names(quant_files) <- basename(dirname(quant_files)) # Use directory names as sample names
    
    # --- Generate tx2gene mapping ---
    # This step requires a GTF file and typically uses Bioconductor packages for robustness.
    # Placeholder: Homo_sapiens.GRCh38.109.gtf
    gtf_file <- "Homo_sapiens.GRCh38.109.gtf" # Ensure this path is correct and file exists
    
    # Recommended robust way to create tx2gene mapping using GenomicFeatures:
    # library(GenomicFeatures)
    # library(rtracklayer)
    # txdb <- makeTxDbFromGFF(gtf_file, format="gtf")
    # k <- keys(txdb, keytype = "TXNAME")
    # tx2gene <- select(txdb, k, "GENEID", "TXNAME")
    # colnames(tx2gene) <- c("TXNAME", "GENEID") # Ensure column names match tximport expectation
    
    # Simplified tx2gene creation for demonstration (less robust than GenomicFeatures):
    # This parses transcript_id and gene_id from 'transcript' lines in the GTF.
    # It might not cover all GTF variations or edge cases, but works for many standard GTFs.
    message("Parsing GTF for tx2gene mapping...")
    tx2gene_df <- data.frame(TXNAME=character(), GENEID=character(), stringsAsFactors=FALSE)
    gtf_lines <- readLines(gtf_file)
    for (line in gtf_lines) {
      if (startsWith(line, "#")) next
      if (grepl("\ttranscript\t", line)) {
        transcript_id_match <- regmatches(line, regexec("transcript_id \"([^\"]+)\"", line))
        gene_id_match <- regmatches(line, regexec("gene_id \"([^\"]+)\"", line))
        if (length(transcript_id_match[[1]]) > 1 && length(gene_id_match[[1]]) > 1) {
          tx_id <- transcript_id_match[[1]][2]
          g_id <- gene_id_match[[1]][2]
          tx2gene_df <- rbind(tx2gene_df, data.frame(TXNAME=tx_id, GENEID=g_id, stringsAsFactors=FALSE))
        }
      }
    }
    tx2gene <- unique(tx2gene_df) # Remove potential duplicates
    message(paste0("Found ", nrow(tx2gene), " transcript-to-gene mappings."))
    
    # Import counts using tximport
    message("Importing counts with tximport...")
    txi <- tximport(quant_files, type="salmon", tx2gene=tx2gene, ignoreTxVersion=TRUE)
    
    # Access gene-level counts
    gene_counts <- txi$counts
    gene_abundance <- txi$abundance
    gene_length <- txi$length
    
    # Save gene-level counts to a CSV file
    write.csv(gene_counts, "gene_level_counts.csv", row.names = TRUE)
    write.csv(gene_abundance, "gene_level_abundance.csv", row.names = TRUE)
    
    message("Gene-level counts and abundance saved to gene_level_counts.csv and gene_level_abundance.csv")
    EOF
    
    Rscript tximport_script.R
  6. 6

    All p-values are then corrected for multiple comparisons using the method of Benjamini and Hochberg95.

    R (Inferred with models/gemini-2.5-flash) v4.2.0
    $ Bash example
    # Example: Create a dummy p_values.tsv for demonstration
    echo -e "gene\tp_value\tother_info" > p_values.tsv
    echo -e "geneA\t0.001\tdata1" >> p_values.tsv
    echo -e "geneB\t0.05\tdata2" >> p_values.tsv
    echo -e "geneC\t0.1\tdata3" >> p_values.tsv
    echo -e "geneD\t0.0005\tdata4" >> p_values.tsv
    echo -e "geneE\t0.02\tdata5" >> p_values.tsv
    
    # Install R if not already present (commented out)
    # sudo apt-get update
    # sudo apt-get install r-base
    
    # Create an R script for p-value correction
    cat << 'EOF' > correct_pvalues.R
    # correct_pvalues.R
    # Read p-values from a TSV file
    # Assuming the p-values are in a column specified by p_value_column_name
    args <- commandArgs(trailingOnly = TRUE)
    input_file <- args[1]
    output_file <- args[2]
    p_value_column_name <- args[3]
    
    if (length(args) < 3) {
      stop("Usage: Rscript correct_pvalues.R <input_file.tsv> <output_file.tsv> <p_value_column_name>")
    }
    
    data <- read.delim(input_file, header = TRUE, sep = "\t")
    
    if (!(p_value_column_name %in% colnames(data))) {
      stop(paste("P-value column '", p_value_column_name, "' not found in the input file.", sep=""))
    }
    
    p_values <- data[[p_value_column_name]]
    
    if (!is.numeric(p_values)) {
      stop(paste("Column '", p_value_column_name, "' is not numeric.", sep=""))
    }
    
    # Correct p-values using Benjamini-Hochberg method
    adjusted_p_values <- p.adjust(p_values, method = "BH")
    
    # Add adjusted p-values to the data frame
    data$adjusted_p_value_BH <- adjusted_p_values
    
    # Write the results to a new TSV file
    write.table(data, file = output_file, sep = "\t", quote = FALSE, row.names = FALSE)
    
    message(paste("P-values corrected and saved to:", output_file))
    EOF
    
    # Execute the R script
    Rscript correct_pvalues.R p_values.tsv adjusted_p_values.tsv p_value
    
    # Clean up the R script (optional)
    # rm correct_pvalues.R
  7. 7

    We used an adjusted p-value cutoff of 0.05 with no log fold-change ratio cutoff for differential gene expression analysis and a 0.1 cutoff with no log fold-change ratio cutoff for differential exon usage.

    DESeq2 (Inferred with models/gemini-2.5-flash) vN/A GitHub
    $ Bash example
    # Install R and Bioconductor packages if not already installed
    # conda install -c bioconda bioconductor-deseq2
    
    # Create an R script for differential gene expression analysis using DESeq2
    cat << 'EOF' > run_deseq2.R
    # Load necessary libraries
    library(DESeq2)
    
    # --- Placeholder for loading count data and sample information ---
    # In a real scenario, replace these with actual paths to your data
    # Example:
    # count_matrix <- read.csv("gene_counts.csv", row.names = 1)
    # sample_info <- read.csv("sample_metadata.csv", row.names = 1)
    # dds <- DESeqDataSetFromMatrix(countData = count_matrix,
    #                               colData = sample_info,
    #                               design = ~ condition)
    
    # For demonstration, let's create dummy data
    set.seed(123)
    counts <- matrix(sample(1:1000, 100*6, replace=TRUE), ncol=6)
    rownames(counts) <- paste0("gene", 1:100)
    colnames(counts) <- paste0("sample", 1:6)
    condition <- factor(c("control", "control", "control", "treated", "treated", "treated"))
    colData <- data.frame(condition = condition)
    dds <- DESeqDataSetFromMatrix(countData = counts, colData = colData, design = ~ condition)
    
    # Pre-filtering: remove rows with very low counts
    keep <- rowSums(counts(dds)) >= 10
    dds <- dds[keep,]
    
    # Run DESeq2 analysis
    dds <- DESeq(dds)
    
    # Extract results for differential gene expression
    # Adjusted p-value cutoff of 0.05, no log fold-change ratio cutoff
    res_gene_expression <- results(dds, alpha = 0.05, lfcThreshold = 0) # lfcThreshold = 0 means no LFC cutoff
    
    # Order results by adjusted p-value
    res_gene_expression <- res_gene_expression[order(res_gene_expression$padj),]
    
    # Save results
    write.csv(as.data.frame(res_gene_expression), "differential_gene_expression_results.csv")
    
    message("Differential gene expression analysis complete. Results saved to differential_gene_expression_results.csv")
    
    # --- Placeholder for differential exon usage analysis ---
    # The description mentions a 0.1 cutoff for differential exon usage.
    # This would typically be performed by a separate tool like DEXSeq or DRIMSeq.
    # Example conceptual code if DEXSeq were used (not executed here):
    # # library(DEXSeq)
    # # # Load exon counts and design matrix
    # # # dxd <- DEXSeqDataSetFromExonCounts(...)
    # # # dxd <- estimateSizeFactors(dxd)
    # # # dxd <- estimateDispersions(dxd)
    # # # dxd <- testForDEU(dxd)
    # # # dxr <- DEXSeqResults(dxd)
    # # # dxr_filtered <- dxr[dxr$padj < 0.1,]
    # # # write.csv(as.data.frame(dxr_filtered), "differential_exon_usage_results.csv")
    
    EOF
    
    # Execute the R script
    Rscript run_deseq2.R
  8. 8

    [tpm.csv.gz] gene x sample matrix of TPM for each sample

    RSEM (Inferred with models/gemini-2.5-flash) v1.3.3 GitHub
    $ Bash example
    # Install RSEM (example using conda)
    # conda install -c bioconda rsem=1.3.3
    
    # --- RSEM Quantification for a single sample ---
    # This command quantifies gene and isoform expression from RNA-Seq data.
    # Input: Aligned reads (BAM file) for a single sample.
    # Reference: Pre-built RSEM index for the target genome (e.g., hg38).
    # Output: Several files, including .genes.results which contains TPM values.
    
    # Placeholder for RSEM index path and name (e.g., for human GRCh38)
    # This index would have been prepared using 'rsem-prepare-reference' with a GTF and FASTA file.
    RSEM_INDEX_PATH="/path/to/rsem_index/hg38_rsem_index"
    # Placeholder for input BAM file for a single sample
    INPUT_BAM="sample_1.bam"
    # Placeholder for output prefix for the sample
    OUTPUT_PREFIX="sample_1_rsem_output"
    
    # Execute RSEM quantification
    rsem-calculate-expression \
        --bam \
        --no-qualities \
        -p 8 \
        "${RSEM_INDEX_PATH}" \
        "${INPUT_BAM}" \
        "${OUTPUT_PREFIX}"
    
    # --- Post-processing to create gene x sample TPM matrix (tpm.csv.gz) ---
    # After running 'rsem-calculate-expression' for all samples,
    # TPM values (typically column 6) from each sample's '.genes.results' file are extracted
    # and combined into a single gene x sample matrix. This matrix is then converted to CSV format and gzipped.
    # This step usually involves a custom script (e.g., in Python or R) to parse and merge results from multiple samples.
    # Example: A script would iterate through all "${SAMPLE_PREFIX}.genes.results" files,
    # extract the 'gene_id' and 'TPM' columns, and then merge them into a single table.
    # Finally, the merged table would be converted to CSV and compressed:
    # custom_tpm_matrix_script.py --input_dir . --output_file tpm.csv.gz
  9. 9

    [raw_counts.csv.gz] gene x sample matrix of raw counts per sample

    RSEM (Inferred with models/gemini-2.5-flash) v1.3.3 GitHub
    $ Bash example
    # Install RSEM (example using conda)
    # conda install -c bioconda rsem
    
    # Define reference paths (using GRCh38 as a placeholder)
    # Replace with actual paths to your reference genome and GTF file.
    GENOME_FASTA="path/to/GRCh38.primary_assembly.genome.fa"
    GTF_FILE="path/to/gencode.v38.annotation.gtf" # e.g., from GENCODE
    RSEM_INDEX_DIR="rsem_index_GRCh38"
    
    # 1. Prepare RSEM reference (run once for a given reference)
    # This step builds the necessary index files for RSEM.
    # rsem-prepare-reference --gtf "${GTF_FILE}" "${GENOME_FASTA}" "${RSEM_INDEX_DIR}"
    
    # --- Example for a single sample quantification (repeated for all samples) ---
    # This block demonstrates how to run RSEM for one sample.
    # You would repeat this for every sample in your experiment.
    SAMPLE_ID="sample_name" # Replace with your actual sample identifier
    READ1_FASTQ="path/to/${SAMPLE_ID}_R1.fastq.gz" # Replace with actual path to R1 FASTQ
    READ2_FASTQ="path/to/${SAMPLE_ID}_R2.fastq.gz" # Replace with actual path to R2 FASTQ (remove if single-end)
    OUTPUT_PREFIX="${SAMPLE_ID}" # Prefix for RSEM output files
    NUM_THREADS=8 # Number of threads to use
    
    # 2. Run RSEM to calculate gene and isoform expression for a single sample
    # This command uses RSEM's built-in aligner (Bowtie2 by default) for paired-end reads.
    # Adjust parameters as needed (e.g., --single-end, --star for STAR alignment).
    rsem-calculate-expression \
        --paired-end \
        --num-threads "${NUM_THREADS}" \
        --gzip-output \
        "${READ1_FASTQ}" \
        "${READ2_FASTQ}" \
        "${RSEM_INDEX_DIR}" \
        "${OUTPUT_PREFIX}"
    
    # RSEM will generate files like "${SAMPLE_ID}.genes.results" and "${SAMPLE_ID}.isoforms.results".
    # The 'expected_count' column from the ".genes.results" file is typically used for raw counts.
    
    # --- Combine results from multiple samples into a gene x sample matrix ---
    # Assume you have run 'rsem-calculate-expression' for multiple samples,
    # resulting in files like 'sample1.genes.results', 'sample2.genes.results', etc.
    # These files should be in a directory accessible by the next step.
    
    # Define a list of all sample .genes.results files
    # Replace with actual paths to your .genes.results files.
    # Example: SAMPLE_GENES_RESULTS=("results/sample1.genes.results" "results/sample2.genes.results")
    SAMPLE_GENES_RESULTS=(
        "path/to/sample1.genes.results"
        "path/to/sample2.genes.results"
        "path/to/sample3.genes.results"
        # Add all other sample .genes.results files here
    )
    
    # 3. Use rsem-generate-data-matrix to create a combined matrix of expected counts
    # This command will output a tab-separated file with gene_id as the first column,
    # followed by expected counts for each sample.
    rsem-generate-data-matrix "${SAMPLE_GENES_RESULTS[@]}" > combined_raw_counts.tsv
    
    # 4. Convert the combined TSV to CSV and gzip it to match the desired output format
    # The 'rsem-generate-data-matrix' output has a header like 'gene_id\tsample1_expected_count\tsample2_expected_count...'
    # This step replaces tabs with commas and then gzipps the file.
    sed 's/\t/,/g' combined_raw_counts.tsv | gzip > raw_counts.csv.gz
    
    # The final 'raw_counts.csv.gz' will contain the gene x sample matrix of raw counts.
  10. 10

    [category_als_vs_control_all.csv.gz] differential expression calls for all genes from DESeq2. (ensgene: Ensembl gene ID, baseMean: mean expression across all samples, log2FoldChange: log2Fold change (knockdown/control), lfcSE: standard error of the log2FoldChange, stat: value of Wald test statistic, pvalue: p-value of differential expression, symbol: gene symbol, description: gene description, biotype: biotype of gene, broadClass: broad biotype class gene belongs to, chr: chromosome gene is on, start: start of gene on chromosome, end: end of gene on chromosome, strand: strand gene is on)

    $ Bash example
    # Install R and Bioconductor packages if not already installed
    # conda create -n deseq2_env r-base r-essentials bioconductor-deseq2 bioconductor-annotationdbi bioconductor-org.hs.eg.db bioconductor-rtracklayer bioconductor-genomicfeatures -y
    # conda activate deseq2_env
    
    # Create dummy count data (replace with your actual count matrix)
    cat << 'EOF' > counts.tsv
    gene_id	sample_ALS_1	sample_ALS_2	sample_Control_1	sample_Control_2
    ENSG00000000003	100	120	50	60
    ENSG00000000005	200	210	250	260
    ENSG00000000419	50	55	100	110
    ENSG00000000457	10	12	8	9
    ENSG00000000460	300	310	150	160
    EOF
    
    # Create dummy sample information (replace with your actual sample metadata)
    cat << 'EOF' > sample_info.tsv
    sample_id	condition
    sample_ALS_1	ALS
    sample_ALS_2	ALS
    sample_Control_1	control
    sample_Control_2	control
    EOF
    
    # Download a reference GTF file (e.g., Ensembl GRCh38 release 109)
    # This is a large file, so it's commented out. Replace with your actual GTF path.
    # wget -nc https://ftp.ensembl.org/pub/release-109/gtf/homo_sapiens/Homo_sapiens.GRCh38.109.gtf.gz
    
    # Create an R script file for DESeq2 differential expression analysis
    cat << 'EOF' > run_deseq2.R
    # Load necessary libraries
    library(DESeq2)
    library(AnnotationDbi)
    library(org.Hs.eg.db) # For human gene annotations, adjust for other species
    library(rtracklayer)
    library(GenomicFeatures)
    
    # --- Configuration ---
    # Input count matrix file (e.g., from featureCounts)
    count_file <- "counts.tsv" 
    # Sample information file (e.g., tab-separated, with 'sample_id' and 'condition' columns)
    sample_info_file <- "sample_info.tsv"
    # Output file name
    output_csv <- "category_als_vs_control_all.csv"
    # Design formula (e.g., ~ condition)
    design_formula <- ~ condition 
    # Contrast for results (e.g., c("condition", "ALS", "control"))
    contrast_levels <- c("condition", "ALS", "control")
    # Reference GTF file for gene annotations (replace with your actual path)
    gtf_file <- "Homo_sapiens.GRCh38.109.gtf.gz" 
    
    # --- Load Data ---
    # Load count data
    count_data <- read.table(count_file, header=TRUE, row.names=1, sep="\t")
    # Ensure counts are integers
    count_data <- round(count_data)
    
    # Load sample information
    sample_info <- read.table(sample_info_file, header=TRUE, row.names=1, sep="\t")
    # Ensure sample order matches count data columns
    sample_info <- sample_info[colnames(count_data), , drop=FALSE]
    
    # --- Create DESeqDataSet object ---
    dds <- DESeqDataSetFromMatrix(countData = count_data,
                                  colData = sample_info,
                                  design = design_formula)
    
    # --- Run DESeq2 analysis ---
    dds <- DESeq(dds)
    
    # --- Extract results ---
    # Specify the contrast for differential expression (ALS vs control)
    res <- results(dds, contrast=contrast_levels)
    
    # --- Add gene annotations ---
    res_df <- as.data.frame(res)
    res_df$ensgene <- rownames(res_df)
    
    # Add gene symbols and descriptions from org.Hs.eg.db
    res_df$symbol <- mapIds(org.Hs.eg.db,
                            keys=res_df$ensgene,
                            column="SYMBOL",
                            keytype="ENSEMBL",
                            multiVals="first")
    res_df$description <- mapIds(org.Hs.eg.db,
                                 keys=res_df$ensgene,
                                 column="GENENAME",
                                 keytype="ENSEMBL",
                                 multiVals="first")
    
    # Initialize annotation columns with NA
    res_df$biotype <- NA
    res_df$broadClass <- NA
    res_df$chr <- NA
    res_df$start <- NA
    res_df$end <- NA
    res_df$strand <- NA
    
    # Load and process GTF for genomic annotations if available
    if (file.exists(gtf_file)) {
      message("Loading GTF for gene annotations...")
      gtf <- import(gtf_file)
      gene_features <- as.data.frame(gtf[gtf$type == "gene"]) # Filter for gene features
      gene_features <- gene_features[!duplicated(gene_features$gene_id), ] # Ensure unique gene entries
      
      # Prepare annotation data frame from GTF
      gtf_annotations <- data.frame(
        ensgene = gene_features$gene_id,
        biotype = gene_features$gene_biotype,
        chr = as.character(gene_features$seqnames),
        start = gene_features$start,
        end = gene_features$end,
        strand = as.character(gene_features$strand),
        stringsAsFactors = FALSE
      )
      
      # Add broadClass based on biotype (example logic)
      gtf_annotations$broadClass <- ifelse(grepl("protein_coding", gtf_annotations$biotype), "protein_coding", "non_coding")
      
      # Merge with results
      # Use suffixes to handle potential column name conflicts if res_df already had these (e.g., from a previous merge)
      res_df <- merge(res_df, gtf_annotations, by="ensgene", all.x=TRUE, suffixes=c("", ".gtf"))
      
      # Replace NA annotation columns with GTF-derived ones where available
      for (col_name in c("biotype", "broadClass", "chr", "start", "end", "strand")) {
        gtf_col_name <- paste0(col_name, ".gtf")
        if (gtf_col_name %in% colnames(res_df)) {
          # Fill NA values in original column with values from GTF-derived column
          res_df[[col_name]] <- ifelse(is.na(res_df[[col_name]]), res_df[[gtf_col_name]], res_df[[col_name]])
          # Remove the GTF-derived column after merging
          res_df <- res_df[, !(colnames(res_df) == gtf_col_name)]
        }
      }
    } else {
      warning(paste("GTF file not found:", gtf_file, "Genomic annotations (biotype, chr, start, end, strand) will be NA."))
    }
    
    # Reorder and rename columns to match the description
    final_results <- res_df[, c("ensgene", "baseMean", "log2FoldChange", "lfcSE", "stat", "pvalue", 
                                "symbol", "description", "biotype", "broadClass", "chr", "start", "end", "strand")]
    
    # Write results to CSV
    write.csv(final_results, file=output_csv, row.names=FALSE)
    
    # Compress the output file
    system(paste0("gzip ", output_csv))
    
    EOF
    
    # Execute the R script
    Rscript run_deseq2.R
  11. 11

    [dexseq-all.csv.gz] gene fragment differential expression calls from DEXSeq. (fragment: gene fragment ID, groupID: group ID of fragment, featureID: gene ID of fragment, exonBaseMean: expressoin of exon across all samples, dispersion: dispersion of fragment across all samples, stat: value from Wald test, pvalue: p-value from Wald test, padj: BH adjusted p-value, scrambled: expression in scrambled siRNA, als: expression in TDP43 knockdown samples, rest of columns are locations of fragments in the genome, counts of the fragment for each individual sample and gene annotation similar to the above)

    DEXSeq v(Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install R and Bioconductor packages if not already present
    # sudo apt-get update && sudo apt-get install -y r-base
    # R -e "if (!requireNamespace('BiocManager', quietly = TRUE)) install.packages('BiocManager', repos='https://cloud.r-project.org'); BiocManager::install(c('DEXSeq', 'GenomicFeatures'), update=FALSE, ask=FALSE)
    
    # Define input/output paths and reference files
    # Placeholder: Replace with actual paths
    REFERENCE_GENOME="hg38"
    GENOME_GTF="/path/to/reference/Homo_sapiens.GRCh38.109.gtf" # Original GTF for annotation preparation
    FLATTENED_GTF="/path/to/reference/Homo_sapiens.GRCh38.109.flattened.gtf" # Output of dexseq_prepare_annotation.py
    COUNT_DIR="/path/to/dexseq_counts" # Directory containing count files from dexseq_count.py
    SAMPLE_INFO_FILE="/path/to/sample_info.tsv" # Tab-separated file: sample_id\tcondition\t...
    OUTPUT_PREFIX="dexseq-all"
    
    # Ensure output directory exists
    mkdir -p "$(dirname "${OUTPUT_PREFIX}.csv.gz")"
    
    # --- Step 1: Prepare annotation (if not already done) ---
    # This step uses a Python script from the DEXSeq package.
    # # conda install -c bioconda dexseq
    # if [ ! -f "${FLATTENED_GTF}" ]; then
    #   echo "Preparing flattened GTF annotation..."
    #   dexseq_prepare_annotation.py "${GENOME_GTF}" "${FLATTENED_GTF}"
    # fi
    
    # --- Step 2: Count reads per exon (if not already done) ---
    # This step uses a Python script from the DEXSeq package.
    # # Assuming BAM files are in /path/to/aligned_bams/
    # # And sample_info.tsv contains sample IDs.
    # if [ ! -d "${COUNT_DIR}" ]; then
    #   mkdir -p "${COUNT_DIR}"
    #   echo "Counting reads for DEXSeq..."
    #   # Example loop, adjust based on actual BAM file naming and sample list
    #   # for sample_id in $(cut -f1 "${SAMPLE_INFO_FILE}" | tail -n +2); do
    #   #   BAM_FILE="/path/to/aligned_bams/${sample_id}.bam"
    #   #   OUTPUT_COUNT_FILE="${COUNT_DIR}/${sample_id}.txt"
    #   #   if [ -f "${BAM_FILE}" ]; then
    #   #     echo "Processing ${sample_id}..."
    #   #     dexseq_count.py -p yes -f bam "${FLATTENED_GTF}" "${BAM_FILE}" > "${OUTPUT_COUNT_FILE}"
    #   #   else
    #   #     echo "Warning: BAM file not found for ${sample_id}: ${BAM_FILE}"
    #   #   fi
    #   # done
    #   echo "Please ensure '${COUNT_DIR}' contains count files generated by dexseq_count.py."
    # fi
    
    # --- Step 3: Run DEXSeq differential expression analysis in R ---
    echo "Running DEXSeq differential expression analysis..."
    
    # Create a temporary R script
    R_SCRIPT=$(mktemp /tmp/dexseq_analysis_XXXXXX.R)
    
    cat <<EOF > "${R_SCRIPT}"
    library(DEXSeq)
    library(GenomicFeatures) # For makeTxDbFromGFF, etc.
    
    # --- Configuration ---
    FLATTENED_GTF <- Sys.getenv('FLATTENED_GTF')
    COUNT_DIR <- Sys.getenv('COUNT_DIR')
    SAMPLE_INFO_FILE <- Sys.getenv('SAMPLE_INFO_FILE')
    OUTPUT_PREFIX <- Sys.getenv('OUTPUT_PREFIX', 'dexseq-all')
    REFERENCE_GENOME <- Sys.getenv('REFERENCE_GENOME', 'hg38') # Placeholder
    
    # --- Load Sample Information ---
    sample_data <- read.delim(SAMPLE_INFO_FILE, sep='\t', header=TRUE, row.names=1)
    if (!'condition' %in% colnames(sample_data)) {
      stop("Sample info file must contain a 'condition' column.")
    }
    sample_data\$condition <- factor(sample_data\$condition)
    
    # --- Create DEXSeqDataSet ---
    # DEXSeqDataSetFromHTSeq expects count files to be named after sample IDs
    # and a flattened GTF file.
    message("Creating DEXSeqDataSet from HTSeq counts...")
    dxd <- DEXSeqDataSetFromHTSeq(
      countDir = COUNT_DIR,
      sampleData = sample_data,
      design = ~ sample + exon + condition:exon, # Standard DEXSeq design
      flattenedfile = FLATTENED_GTF
    )
    
    # --- Differential Exon Usage Analysis ---
    message("Estimating size factors...")
    dxd <- estimateSizeFactors(dxd)
    message("Estimating dispersions...")
    dxd <- estimateDispersions(dxd)
    message("Testing for differential exon usage...")
    dxd <- testForDEU(dxd)
    message("Estimating exon fold changes...")
    dxd <- estimateExonFoldChanges(dxd, fitExpToVar="condition")
    
    # --- Get Results ---
    message("Extracting results...")
    dxr <- DEXSeqResults(dxd)
    
    # Extract normalized counts for condition-specific means
    normalized_counts <- counts(dxd, normalized=TRUE)
    
    # Calculate mean expression for each condition
    # Assuming 'scrambled' and 'als' are the conditions based on description
    # This part needs to be dynamic based on actual conditions in sample_data
    conditions <- levels(sample_data\$condition)
    condition_means <- list()
    for (cond in conditions) {
      samples_in_cond <- rownames(sample_data)[sample_data\$condition == cond]
      if (length(samples_in_cond) > 0) {
        condition_means[[cond]] <- rowMeans(normalized_counts[, samples_in_cond, drop=FALSE])
      } else {
        condition_means[[cond]] <- rep(NA, nrow(normalized_counts))
      }
    }
    
    # Combine results
    results_df <- as.data.frame(dxr)
    
    # Add condition-specific means
    for (cond in conditions) {
      results_df[[cond]] <- condition_means[[cond]][rownames(results_df)]
    }
    
    # Select and reorder columns to match description
    # fragment: gene fragment ID (featureID from DEXSeqResults)
    # groupID: group ID of fragment (groupID from DEXSeqResults)
    # featureID: gene ID of fragment (groupID from DEXSeqResults, as per common usage)
    # exonBaseMean: baseMean
    # dispersion: dispersion (or dispFit, dispOutlier - using 'dispersion' as it's a combined estimate)
    # stat: stat
    # pvalue: pvalue
    # padj: padj
    # scrambled: expression in scrambled siRNA (from condition_means)
    # als: expression in TDP43 knockdown samples (from condition_means)
    # rest of columns are locations of fragments in the genome (seqnames, start, end, strand, width)
    # counts of the fragment for each individual sample (from normalized_counts)
    # gene annotation similar to the above (groupID, featureID)
    
    # Prepare final output dataframe
    final_results <- data.frame(
      fragment = results_df\$featureID,
      groupID = results_df\$groupID,
      featureID = results_df\$groupID, # Assuming this refers to gene ID
      exonBaseMean = results_df\$baseMean,
      dispersion = results_df\$dispersion,
      stat = results_df\$stat,
      pvalue = results_df\$pvalue,
      padj = results_df\$padj,
      scrambled = results_df\$scrambled, # Assuming 'scrambled' is a condition
      als = results_df\$als,             # Assuming 'als' is a condition
      seqnames = results_df\$seqnames,
      start = results_df\$start,
      end = results_df\$end,
      width = results_df\$width,
      strand = results_df\$strand
    )
    
    # Add individual sample normalized counts
    # Ensure column names are valid for R data.frame (e.g., no spaces or special chars)
    normalized_counts_for_output <- t(normalized_counts[rownames(results_df), ])
    colnames(normalized_counts_for_output) <- paste0("norm_count_", colnames(normalized_counts_for_output))
    final_results <- cbind(final_results, normalized_counts_for_output)
    
    # Write results to gzipped CSV
    message(paste0("Writing results to ", OUTPUT_PREFIX, ".csv.gz"))
    write.csv(final_results, gzfile(paste0(OUTPUT_PREFIX, '.csv.gz')), row.names=FALSE)
    
    message("DEXSeq analysis complete.")
    EOF
    
    # Set environment variables for the R script
    export FLATTENED_GTF="${FLATTENED_GTF}"
    export COUNT_DIR="${COUNT_DIR}"
    export SAMPLE_INFO_FILE="${SAMPLE_INFO_FILE}"
    export OUTPUT_PREFIX="${OUTPUT_PREFIX}"
    export REFERENCE_GENOME="${REFERENCE_GENOME}"
    
    # Execute the R script
    Rscript "${R_SCRIPT}"
    
    # Clean up temporary R script
    rm -f "${R_SCRIPT}"
    
  12. 12

    [combined.dexseq.gz] exon part count table

    DEXSeq vInferred with models/gemini-2.5-flash GitHub
    $ Bash example
    # Install DEXSeq and its dependencies (R and Bioconductor)
    # R
    # install.packages("BiocManager")
    # BiocManager::install("DEXSeq")
    
    # Assuming DEXSeq is installed and its python scripts are accessible.
    # The 'dexseq_count.py' script is typically found within the DEXSeq R package installation directory.
    # For example: /path/to/R/library/DEXSeq/python_scripts/dexseq_count.py
    
    # Define reference genome and annotation versions for placeholder files
    GENOME_ASSEMBLY="hg38"
    GENCODE_VERSION="v38"
    
    # Define paths for annotation files
    # DEXSeq prepared GFF file (this is a prerequisite, generated from a standard GTF/GFF)
    # Example: gencode.v38.dexseq.gff would be generated from gencode.v38.annotation.gtf
    DEXSEQ_GFF="${GENOME_ASSEMBLY}.${GENCODE_VERSION}.dexseq.gff"
    
    # Placeholder for input BAM file (aligned reads for a single sample)
    INPUT_BAM="sample_aligned.bam"
    
    # Output file name as specified in the description
    OUTPUT_COUNT_TABLE="combined.dexseq.gz"
    
    # --- Prerequisite: Prepare the annotation file for DEXSeq (commented out as it's a setup step) ---
    # This step converts a standard GTF/GFF to a DEXSeq-compatible GFF.
    # If you already have the DEXSeq GFF, you can skip this.
    # Original GTF from GENCODE (example)
    # GENCODE_GTF="${GENOME_ASSEMBLY}.${GENCODE_VERSION}.annotation.gtf.gz"
    # DEXSEQ_PREPARE_ANNOTATION_SCRIPT="/path/to/DEXSeq/inst/python_scripts/dexseq_prepare_annotation.py"
    # gunzip -c ${GENCODE_GTF} > ${GENCODE_GTF%.gz}
    # python ${DEXSEQ_PREPARE_ANNOTATION_SCRIPT} \
    #     ${GENCODE_GTF%.gz} \
    #     ${DEXSEQ_GFF}
    
    # --- Main step: Count reads per exon part using dexseq_count.py ---
    # Path to dexseq_count.py script
    DEXSEQ_COUNT_SCRIPT="/path/to/DEXSeq/inst/python_scripts/dexseq_count.py"
    
    # Execute the counting command
    # -p yes indicates paired-end reads. Use -p no for single-end reads.
    # The output file 'combined.dexseq.gz' will contain the exon part counts for the given sample.
    python ${DEXSEQ_COUNT_SCRIPT} \
        -p yes \
        ${DEXSEQ_GFF} \
        ${INPUT_BAM} \
        ${OUTPUT_COUNT_TABLE}

Tools Used

Raw Source Text
All FASTQ files were analyzed using the bcbioRNASeq workflow and toolchain91. The FASTQ files were aligned to the GRCh37/hg19 reference genome.
Differential expression testing was performed using DESeq2 suite of bioinformatics tools38. The DEXSeq module of Bioconductor was used to identify differential splicing92. We used salmon to generate the counts and tximport to load them at gene level93,94. All p-values are then corrected for multiple comparisons using the method of Benjamini and Hochberg95.  We used an adjusted p-value cutoff of 0.05 with no log fold-change ratio cutoff for differential gene expression analysis and a 0.1 cutoff with no log fold-change ratio cutoff for differential exon usage.
Genome_build: hg19
Supplementary_files_format_and_content:
[tpm.csv.gz]  gene x sample matrix of TPM for each sample
[raw_counts.csv.gz]  gene x sample matrix of raw counts per sample
[category_als_vs_control_all.csv.gz]  differential expression calls for all genes from DESeq2. (ensgene: Ensembl gene ID, baseMean: mean expression across all samples, log2FoldChange: log2Fold change (knockdown/control), lfcSE: standard error of the log2FoldChange, stat: value of Wald test statistic, pvalue: p-value of differential expression, symbol: gene symbol, description: gene description, biotype: biotype of gene, broadClass: broad biotype class gene belongs to, chr: chromosome gene is on, start: start of gene on chromosome, end: end of gene on chromosome, strand: strand gene is on)
[dexseq-all.csv.gz]  gene fragment differential expression calls from DEXSeq. (fragment: gene fragment ID, groupID: group ID of fragment, featureID: gene ID of fragment, exonBaseMean: expressoin of exon across all samples, dispersion: dispersion of fragment across all samples, stat: value from Wald test, pvalue: p-value from Wald test, padj: BH adjusted p-value, scrambled: expression in scrambled siRNA, als: expression in TDP43 knockdown samples, rest of columns are locations of fragments in the genome, counts of the fragment for each individual sample and gene annotation similar to the above)
[combined.dexseq.gz]  exon part count table
← Back to Analysis