GSE121569 Processing Pipeline

RNA-Seq code_examples 12 steps

Publication

Aberrant NOVA1 function disrupts alternative splicing in early stages of amyotrophic lateral sclerosis.

Acta neuropathologica (2022) — PMID 35778567

Dataset

ALS implicated protein TDP-43 sustains levels of STMN2 a mediator of motor neuron growth and repair

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

All FASTQ files were analyzed using the bcbioRNASeq workflow and toolchain91.

bcbio-nextgen (Inferred with models/gemini-2.5-flash) vtoolchain91 GitHub

$ Bash example

# --- Installation (commented out) ---
# It is recommended to install bcbio-nextgen via its installer script:
# wget https://raw.githubusercontent.com/bcbio/bcbio-nextgen/master/scripts/bcbio_nextgen_install.py
# python bcbio_nextgen_install.py /path/to/bcbio --genomes GRCh38 --aligners star --tools
# export PATH="/path/to/bcbio/bin:$PATH"

# --- Prepare input configuration (example) ---
# bcbio-nextgen typically uses a sample CSV or a YAML configuration file.
# This example assumes a 'project.yaml' and 'samples.csv' are prepared.
# For RNA-seq, a common reference genome is GRCh38 (human) or mm10 (mouse).
# We will use GRCh38 as a placeholder.

# Create a dummy samples.csv (replace with actual FASTQ paths and sample names)
cat <<EOF > samples.csv
samplename,description,fastq_1,fastq_2
sample1,sample1,data/sample1_R1.fastq.gz,data/sample1_R2.fastq.gz
sample2,sample2,data/sample2_R1.fastq.gz,data/sample2_R2.fastq.gz
EOF

# Create a dummy bcbio_config.yaml (replace with actual project settings)
# This configuration specifies RNA-seq analysis with STAR aligner and GRCh38 genome.
cat <<EOF > bcbio_config.yaml
details:
  - analysis: RNA-seq
    genome_build: GRCh38
    aligner: star
    quality_format: standard
    # Add other relevant bcbio-nextgen RNA-seq parameters as needed
    # The 'toolchain91' mentioned in the description likely refers to a specific
    # bcbio-nextgen installation or configuration used at the time.
upload:
  dir: final_analysis_dir
EOF

# --- Run bcbio-nextgen RNA-seq analysis ---
# The command below executes the bcbio-nextgen workflow.
bcbio_nextgen.py bcbio_config.yaml samples.csv -w run --cores 8 --memory 32G

View on GitHub

The FASTQ files were aligned to the GRCh37/hg19 reference genome.

bowtie2 (Inferred with models/gemini-2.5-flash) v2.4.5 GitHub

$ Bash example

# Install bowtie2 (if not already installed)
# conda install -c bioconda bowtie2

# Install samtools (if not already installed)
# conda install -c bioconda samtools

# Define variables
FASTQ_FILE="sample.fastq.gz" # Placeholder for input FASTQ file
REFERENCE_GENOME_INDEX_PREFIX="/path/to/GRCh37_hg19_index/GRCh37_hg19" # Path to the bowtie2 index prefix for GRCh37/hg19
OUTPUT_BAM_UNSORTED="aligned_reads.bam"
OUTPUT_BAM_SORTED="aligned_reads.sorted.bam"

# Example: How to build the index for GRCh37/hg19 (UCSC hg19)
# mkdir -p /path/to/GRCh37_hg19_index
# cd /path/to/GRCh37_hg19_index
# wget https://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz
# gunzip hg19.fa.gz
# bowtie2-build hg19.fa GRCh37_hg19
# cd - # Go back to original directory

# Perform alignment using bowtie2
# Assuming single-end reads. For paired-end, use -1 <read1.fastq> -2 <read2.fastq>
# Output is piped directly to samtools for conversion to BAM
bowtie2 -x "${REFERENCE_GENOME_INDEX_PREFIX}" \
        -U "${FASTQ_FILE}" \
        --threads 8 \
        2> bowtie2_alignment.log | \
        samtools view -bS - > "${OUTPUT_BAM_UNSORTED}"

# Sort the BAM file
samtools sort "${OUTPUT_BAM_UNSORTED}" -o "${OUTPUT_BAM_SORTED}"

# Index the sorted BAM file (optional, but highly recommended for downstream tools)
samtools index "${OUTPUT_BAM_SORTED}"

View on GitHub

Differential expression testing was performed using DESeq2 suite of bioinformatics tools38.

DESeq2 vNot specified. GitHub

$ Bash example

# Install R (if not already installed)
# sudo apt-get update
# sudo apt-get install r-base

# Open R and install BiocManager, then DESeq2
# R -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager"); BiocManager::install("DESeq2")'

# Create dummy input files for demonstration purposes.
# In a real bioinformatics pipeline, 'counts.csv' would be generated by an upstream quantification tool
# (e.g., featureCounts, Salmon, Kallisto) and 'coldata.csv' would be a sample metadata file.

# Create a dummy count matrix (genes as rows, samples as columns)
cat << 'EOF' > counts.csv
gene1,100,120,90,500,550,480
gene2,50,60,55,200,210,190
gene3,200,210,190,80,90,75
gene4,150,140,160,155,145,165
gene5,30,35,28,100,110,95
EOF

# Create a dummy sample metadata file (sample IDs and conditions)
cat << 'EOF' > coldata.csv
sample1,control
sample2,control
sample3,control
sample4,treated
sample5,treated
sample6,treated
EOF

# Create an R script to perform DESeq2 analysis
cat << 'EOF' > run_deseq2.R
library(DESeq2)

# Load count matrix
# Assuming counts.csv has gene IDs as the first column and sample counts as subsequent columns
counts_data_raw <- read.csv("counts.csv", row.names = 1, header = TRUE)

# Load sample metadata
# Assuming coldata.csv has sample IDs as the first column and condition as the second
col_data <- read.csv("coldata.csv", row.names = 1, header = FALSE, col.names = c("sample", "condition"))

# Ensure sample names in col_data match and are in the same order as in counts_data_raw
col_data <- col_data[colnames(counts_data_raw), , drop = FALSE]
stopifnot(all(rownames(col_data) == colnames(counts_data_raw)))

# Ensure count data is an integer matrix
counts_data <- as.matrix(round(counts_data_raw))

# Create DESeqDataSet object
dds <- DESeqDataSetFromMatrix(countData = counts_data,
                                  colData = col_data,
                                  design = ~ condition)

# Pre-filtering: remove genes with very low counts across all samples
# This helps to reduce the memory size and increase the speed of the DESeq2 analysis.
# A common filtering threshold is to keep genes with at least 10 counts summed across all samples.
keep <- rowSums(counts(dds)) >= 10
dds <- dds[keep,]

# Run DESeq2 analysis
dds <- DESeq(dds)

# Get results for 'treated' vs 'control' comparison
# The contrast argument specifies the comparison to make.
res <- results(dds, contrast=c("condition", "treated", "control"))

# Order results by adjusted p-value (padj)
res_ordered <- res[order(res$padj),]

# Save the differential expression results to a CSV file
write.csv(as.data.frame(res_ordered), "deseq2_results.csv")

message("DESeq2 analysis complete. Results saved to deseq2_results.csv")
EOF

# Execute the R script to perform differential expression analysis
Rscript run_deseq2.R

View on GitHub

The DEXSeq module of Bioconductor was used to identify differential splicing92.

R vR 4.3.x / Bioconductor 3.18 (Inferred with models/gemini-2.5-flash)

$ Bash example

# Install BiocManager if not already installed
# R -e "if (!requireNamespace('BiocManager', quietly = TRUE)) install.packages('BiocManager')"

# Install DEXSeq package if not already installed
# R -e "BiocManager::install('DEXSeq')"

# Create a directory for results
mkdir -p dexseq_results

# Placeholder for input files:
# 1. Exon count files for each sample (e.g., generated by htseq-count -m union -t exon -i gene_id)
#    Example: counts/sample1_exon_counts.txt, counts/sample2_exon_counts.txt, ...
# 2. Sample information file (CSV or TSV) with columns like 'sample_id', 'condition', 'replicate'
#    Example: sample_info.csv
# 3. GTF/GFF annotation file used for counting (e.g., from Ensembl or GENCODE)
#    Example: annotation.gtf

# Placeholder for R script to perform DEXSeq analysis
# This script assumes count files are in a 'counts' subdirectory
# and a 'sample_info.csv' file is in the current directory.
# It also assumes an annotation file (e.g., gtf) is available.

cat << 'EOF' > run_dexseq.R
library(DEXSeq)
library(GenomicFeatures) # For making exon bins from GTF, if needed

# --- Configuration ---
# Path to the directory containing exon count files
count_dir <- "counts"
# Path to the sample information file
sample_info_file <- "sample_info.csv"
# Path to the GTF annotation file (used to define exon bins if not pre-computed)
gtf_file <- "annotation.gtf"
# Output directory for results
output_dir <- "dexseq_results"

# --- Load Sample Information ---
sample_data <- read.csv(sample_info_file, row.names = 1)
# Ensure 'condition' is a factor for the design formula
sample_data$condition <- factor(sample_data$condition)

# --- Prepare Count Files ---
# List all count files matching the pattern
count_files <- list.files(count_dir, pattern = "_exon_counts.txt$", full.names = TRUE)
names(count_files) <- gsub("_exon_counts.txt$", "", basename(count_files))

# Ensure sample order in sample_data matches count file order
sample_data <- sample_data[names(count_files), ]

# Read counts into a list and then combine into a matrix
count_data_list <- lapply(count_files, function(f) {
  read.delim(f, header = FALSE, stringsAsFactors = FALSE)
})

# Extract exon IDs and gene IDs from the first count file (assuming consistent format: "gene_id:exon_id")
exon_ids <- count_data_list[[1]][, 1]
gene_ids <- sapply(strsplit(exon_ids, ":"), `[`, 1)

# Combine counts into a matrix
counts_matrix <- do.call(cbind, lapply(count_data_list, function(df) df[, 2]))
colnames(counts_matrix) <- names(count_files)
rownames(counts_matrix) <- exon_ids

# --- Create DEXSeqDataSet object ---
# The design formula specifies the model for differential exon usage.
# ~ sample + exon + condition:exon tests for differential exon usage between conditions,
# accounting for overall gene expression differences (sample) and baseline exon usage (exon).
# 'sample' is implicitly handled by the DEXSeq functions.
dxd <- DEXSeqDataSet(
  countData = counts_matrix,
  sampleData = sample_data,
  design = ~ sample + exon + condition:exon, # Example design
  featureID = exon_ids,
  groupID = gene_ids
)

# --- Differential Splicing Analysis ---
print("Estimating size factors...")
dxd <- estimateSizeFactors(dxd)

print("Estimating dispersions...")
dxd <- estimateDispersions(dxd)

print("Testing for differential exon usage...")
dxd <- testForDEU(dxd)

print("Estimating fold changes...")
dxd <- estimateExonFoldChanges(dxd, fitExpToVar = "condition")

# --- Get Results ---
dxr <- DEXSeqResults(dxd)

# Order results by adjusted p-value
dxr <- dxr[order(dxr$padj),]

# Save results
write.csv(as.data.frame(dxr), file.path(output_dir, "dexseq_results.csv"), row.names = TRUE)

# Save normalized counts (optional)
# normalized_counts <- counts(dxd, normalized = TRUE)
# write.csv(normalized_counts, file.path(output_dir, "dexseq_normalized_counts.csv"), row.names = TRUE)

print(paste("DEXSeq analysis complete. Results saved to", file.path(output_dir, "dexseq_results.csv")))
EOF

# Create dummy input files for demonstration purposes
mkdir -p counts
echo -e "geneA:exon1\t100\ngeneA:exon2\t50\ngeneB:exon1\t200" > counts/sample1_exon_counts.txt
echo -e "geneA:exon1\t120\ngeneA:exon2\t150\ngeneB:exon1\t210" > counts/sample2_exon_counts.txt
echo -e "geneA:exon1\t110\ngeneA:exon2\t60\ngeneB:exon1\t190" > counts/sample3_exon_counts.txt
echo -e "geneA:exon1\t130\ngeneA:exon2\t160\ngeneB:exon1\t220" > counts/sample4_exon_counts.txt

echo "sample_id,condition,replicate" > sample_info.csv
echo "sample1,control,1" >> sample_info.csv
echo "sample2,treatment,1" >> sample_info.csv
echo "sample3,control,2" >> sample_info.csv
echo "sample4,treatment,2" >> sample_info.csv

# A dummy GTF file (DEXSeq doesn't strictly need it if counts are pre-processed with exon_id:gene_id, but good practice to have a reference)
echo -e "chr1\tHAVANA\tgene\t1000\t2000\t.\t+\t.\tgene_id \"geneA\"; gene_name \"geneA\";" > annotation.gtf
echo -e "chr1\tHAVANA\texon\t1000\t1200\t.\t+\t.\tgene_id \"geneA\"; transcript_id \"geneA-001\"; exon_id \"geneA:exon1\";" >> annotation.gtf
echo -e "chr1\tHAVANA\texon\t1500\t1700\t.\t+\t.\tgene_id \"geneA\"; transcript_id \"geneA-001\"; exon_id \"geneA:exon2\";" >> annotation.gtf
echo -e "chr1\tHAVANA\tgene\t3000\t4000\t.\t+\t.\tgene_id \"geneB\"; gene_name \"geneB\";" >> annotation.gtf
echo -e "chr1\tHAVANA\texon\t3000\t3500\t.\t+\t.\tgene_id \"geneB\"; transcript_id \"geneB-001\"; exon_id \"geneB:exon1\";" >> annotation.gtf

# Execute the R script
Rscript run_dexseq.R

# Clean up dummy files (optional)
# rm -rf counts sample_info.csv annotation.gtf run_dexseq.R

We used salmon to generate the counts and tximport to load them at gene level93,94.

Salmon vSalmon 1.10.2 (Inferred with models/gemini-2.5-flash), tximport 1.28.0 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# --- Salmon Installation (example) ---
# conda create -n salmon_env salmon=1.10.2 -y
# conda activate salmon_env

# --- tximport Installation (example) ---
# R -e "if (!requireNamespace('BiocManager', quietly = TRUE)) install.packages('BiocManager'); BiocManager::install('tximport'); BiocManager::install('GenomicFeatures'); BiocManager::install('rtracklayer')"

# --- Reference Data (Placeholder: Human GRCh38 transcriptome from Ensembl) ---
# Download human GRCh38 transcriptome FASTA (e.g., from Ensembl release 109)
# wget -O Homo_sapiens.GRCh38.cdna.all.fa.gz ftp://ftp.ensembl.org/pub/release-109/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz
# gunzip Homo_sapiens.GRCh38.cdna.all.fa.gz
TRANSCRIPTOME_FASTA="Homo_sapiens.GRCh38.cdna.all.fa"
SALMON_INDEX_DIR="salmon_index_grch38"

# Download human GRCh38 GTF (e.g., from Ensembl release 109) for tx2gene mapping
# wget -O Homo_sapiens.GRCh38.109.gtf.gz ftp://ftp.ensembl.org/pub/release-109/gtf/homo_sapiens/Homo_sapiens.GRCh38.109.gtf.gz
# gunzip Homo_sapiens.GRCh38.109.gtf.gz
GTF_FILE="Homo_sapiens.GRCh38.109.gtf"

# --- Salmon: Indexing (if not already done) ---
# salmon index -t "${TRANSCRIPTOME_FASTA}" -i "${SALMON_INDEX_DIR}"

# --- Salmon: Generate counts ---
# Assuming paired-end reads: sample_1_R1.fastq.gz, sample_1_R2.fastq.gz
# Replace with actual input files and desired output directory
READS_R1="sample_1_R1.fastq.gz"
READS_R2="sample_1_R2.fastq.gz"
SALMON_OUTPUT_DIR="salmon_quant_sample_1"

salmon quant -i "${SALMON_INDEX_DIR}" -l A -1 "${READS_R1}" -2 "${READS_R2}" -p 8 -o "${SALMON_OUTPUT_DIR}" --validateMappings

# --- tximport: Load counts at gene level ---
# Create an R script to perform tximport
cat << 'EOF' > tximport_script.R
library(tximport)
library(readr)
# For tximport documentation and more details, see: https://bioconductor.org/packages/release/bioc/html/tximport.html

# Define paths to salmon quantification files
# Replace with actual paths for all samples. Example for one sample:
quant_files <- c(
  "salmon_quant_sample_1/quant.sf"
  # Add more sample paths here if processing multiple samples:
  # "salmon_quant_sample_2/quant.sf",
  # "salmon_quant_sample_3/quant.sf"
)
names(quant_files) <- basename(dirname(quant_files)) # Use directory names as sample names

# --- Generate tx2gene mapping ---
# This step requires a GTF file and typically uses Bioconductor packages for robustness.
# Placeholder: Homo_sapiens.GRCh38.109.gtf
gtf_file <- "Homo_sapiens.GRCh38.109.gtf" # Ensure this path is correct and file exists

# Recommended robust way to create tx2gene mapping using GenomicFeatures:
# library(GenomicFeatures)
# library(rtracklayer)
# txdb <- makeTxDbFromGFF(gtf_file, format="gtf")
# k <- keys(txdb, keytype = "TXNAME")
# tx2gene <- select(txdb, k, "GENEID", "TXNAME")
# colnames(tx2gene) <- c("TXNAME", "GENEID") # Ensure column names match tximport expectation

# Simplified tx2gene creation for demonstration (less robust than GenomicFeatures):
# This parses transcript_id and gene_id from 'transcript' lines in the GTF.
# It might not cover all GTF variations or edge cases, but works for many standard GTFs.
message("Parsing GTF for tx2gene mapping...")
tx2gene_df <- data.frame(TXNAME=character(), GENEID=character(), stringsAsFactors=FALSE)
gtf_lines <- readLines(gtf_file)
for (line in gtf_lines) {
  if (startsWith(line, "#")) next
  if (grepl("\ttranscript\t", line)) {
    transcript_id_match <- regmatches(line, regexec("transcript_id \"([^\"]+)\"", line))
    gene_id_match <- regmatches(line, regexec("gene_id \"([^\"]+)\"", line))
    if (length(transcript_id_match[[1]]) > 1 && length(gene_id_match[[1]]) > 1) {
      tx_id <- transcript_id_match[[1]][2]
      g_id <- gene_id_match[[1]][2]
      tx2gene_df <- rbind(tx2gene_df, data.frame(TXNAME=tx_id, GENEID=g_id, stringsAsFactors=FALSE))
    }
  }
}
tx2gene <- unique(tx2gene_df) # Remove potential duplicates
message(paste0("Found ", nrow(tx2gene), " transcript-to-gene mappings."))

# Import counts using tximport
message("Importing counts with tximport...")
txi <- tximport(quant_files, type="salmon", tx2gene=tx2gene, ignoreTxVersion=TRUE)

# Access gene-level counts
gene_counts <- txi$counts
gene_abundance <- txi$abundance
gene_length <- txi$length

# Save gene-level counts to a CSV file
write.csv(gene_counts, "gene_level_counts.csv", row.names = TRUE)
write.csv(gene_abundance, "gene_level_abundance.csv", row.names = TRUE)

message("Gene-level counts and abundance saved to gene_level_counts.csv and gene_level_abundance.csv")
EOF

Rscript tximport_script.R

View on GitHub

All p-values are then corrected for multiple comparisons using the method of Benjamini and Hochberg95.

R (Inferred with models/gemini-2.5-flash) v4.2.0

$ Bash example

# Example: Create a dummy p_values.tsv for demonstration
echo -e "gene\tp_value\tother_info" > p_values.tsv
echo -e "geneA\t0.001\tdata1" >> p_values.tsv
echo -e "geneB\t0.05\tdata2" >> p_values.tsv
echo -e "geneC\t0.1\tdata3" >> p_values.tsv
echo -e "geneD\t0.0005\tdata4" >> p_values.tsv
echo -e "geneE\t0.02\tdata5" >> p_values.tsv

# Install R if not already present (commented out)
# sudo apt-get update
# sudo apt-get install r-base

# Create an R script for p-value correction
cat << 'EOF' > correct_pvalues.R
# correct_pvalues.R
# Read p-values from a TSV file
# Assuming the p-values are in a column specified by p_value_column_name
args <- commandArgs(trailingOnly = TRUE)
input_file <- args[1]
output_file <- args[2]
p_value_column_name <- args[3]

if (length(args) < 3) {
  stop("Usage: Rscript correct_pvalues.R <input_file.tsv> <output_file.tsv> <p_value_column_name>")
}

data <- read.delim(input_file, header = TRUE, sep = "\t")

if (!(p_value_column_name %in% colnames(data))) {
  stop(paste("P-value column '", p_value_column_name, "' not found in the input file.", sep=""))
}

p_values <- data[[p_value_column_name]]

if (!is.numeric(p_values)) {
  stop(paste("Column '", p_value_column_name, "' is not numeric.", sep=""))
}

# Correct p-values using Benjamini-Hochberg method
adjusted_p_values <- p.adjust(p_values, method = "BH")

# Add adjusted p-values to the data frame
data$adjusted_p_value_BH <- adjusted_p_values

# Write the results to a new TSV file
write.table(data, file = output_file, sep = "\t", quote = FALSE, row.names = FALSE)

message(paste("P-values corrected and saved to:", output_file))
EOF

# Execute the R script
Rscript correct_pvalues.R p_values.tsv adjusted_p_values.tsv p_value

# Clean up the R script (optional)
# rm correct_pvalues.R

We used an adjusted p-value cutoff of 0.05 with no log fold-change ratio cutoff for differential gene expression analysis and a 0.1 cutoff with no log fold-change ratio cutoff for differential exon usage.

DESeq2 (Inferred with models/gemini-2.5-flash) vN/A GitHub

$ Bash example

# Install R and Bioconductor packages if not already installed
# conda install -c bioconda bioconductor-deseq2

# Create an R script for differential gene expression analysis using DESeq2
cat << 'EOF' > run_deseq2.R
# Load necessary libraries
library(DESeq2)

# --- Placeholder for loading count data and sample information ---
# In a real scenario, replace these with actual paths to your data
# Example:
# count_matrix <- read.csv("gene_counts.csv", row.names = 1)
# sample_info <- read.csv("sample_metadata.csv", row.names = 1)
# dds <- DESeqDataSetFromMatrix(countData = count_matrix,
#                               colData = sample_info,
#                               design = ~ condition)

# For demonstration, let's create dummy data
set.seed(123)
counts <- matrix(sample(1:1000, 100*6, replace=TRUE), ncol=6)
rownames(counts) <- paste0("gene", 1:100)
colnames(counts) <- paste0("sample", 1:6)
condition <- factor(c("control", "control", "control", "treated", "treated", "treated"))
colData <- data.frame(condition = condition)
dds <- DESeqDataSetFromMatrix(countData = counts, colData = colData, design = ~ condition)

# Pre-filtering: remove rows with very low counts
keep <- rowSums(counts(dds)) >= 10
dds <- dds[keep,]

# Run DESeq2 analysis
dds <- DESeq(dds)

# Extract results for differential gene expression
# Adjusted p-value cutoff of 0.05, no log fold-change ratio cutoff
res_gene_expression <- results(dds, alpha = 0.05, lfcThreshold = 0) # lfcThreshold = 0 means no LFC cutoff

# Order results by adjusted p-value
res_gene_expression <- res_gene_expression[order(res_gene_expression$padj),]

# Save results
write.csv(as.data.frame(res_gene_expression), "differential_gene_expression_results.csv")

message("Differential gene expression analysis complete. Results saved to differential_gene_expression_results.csv")

# --- Placeholder for differential exon usage analysis ---
# The description mentions a 0.1 cutoff for differential exon usage.
# This would typically be performed by a separate tool like DEXSeq or DRIMSeq.
# Example conceptual code if DEXSeq were used (not executed here):
# # library(DEXSeq)
# # # Load exon counts and design matrix
# # # dxd <- DEXSeqDataSetFromExonCounts(...)
# # # dxd <- estimateSizeFactors(dxd)
# # # dxd <- estimateDispersions(dxd)
# # # dxd <- testForDEU(dxd)
# # # dxr <- DEXSeqResults(dxd)
# # # dxr_filtered <- dxr[dxr$padj < 0.1,]
# # # write.csv(as.data.frame(dxr_filtered), "differential_exon_usage_results.csv")

EOF

# Execute the R script
Rscript run_deseq2.R

View on GitHub

[tpm.csv.gz] gene x sample matrix of TPM for each sample

RSEM (Inferred with models/gemini-2.5-flash) v1.3.3 GitHub

$ Bash example

# Install RSEM (example using conda)
# conda install -c bioconda rsem=1.3.3

# --- RSEM Quantification for a single sample ---
# This command quantifies gene and isoform expression from RNA-Seq data.
# Input: Aligned reads (BAM file) for a single sample.
# Reference: Pre-built RSEM index for the target genome (e.g., hg38).
# Output: Several files, including .genes.results which contains TPM values.

# Placeholder for RSEM index path and name (e.g., for human GRCh38)
# This index would have been prepared using 'rsem-prepare-reference' with a GTF and FASTA file.
RSEM_INDEX_PATH="/path/to/rsem_index/hg38_rsem_index"
# Placeholder for input BAM file for a single sample
INPUT_BAM="sample_1.bam"
# Placeholder for output prefix for the sample
OUTPUT_PREFIX="sample_1_rsem_output"

# Execute RSEM quantification
rsem-calculate-expression \
    --bam \
    --no-qualities \
    -p 8 \
    "${RSEM_INDEX_PATH}" \
    "${INPUT_BAM}" \
    "${OUTPUT_PREFIX}"

# --- Post-processing to create gene x sample TPM matrix (tpm.csv.gz) ---
# After running 'rsem-calculate-expression' for all samples,
# TPM values (typically column 6) from each sample's '.genes.results' file are extracted
# and combined into a single gene x sample matrix. This matrix is then converted to CSV format and gzipped.
# This step usually involves a custom script (e.g., in Python or R) to parse and merge results from multiple samples.
# Example: A script would iterate through all "${SAMPLE_PREFIX}.genes.results" files,
# extract the 'gene_id' and 'TPM' columns, and then merge them into a single table.
# Finally, the merged table would be converted to CSV and compressed:
# custom_tpm_matrix_script.py --input_dir . --output_file tpm.csv.gz

View on GitHub

[raw_counts.csv.gz] gene x sample matrix of raw counts per sample

RSEM (Inferred with models/gemini-2.5-flash) v1.3.3 GitHub

$ Bash example

# Install RSEM (example using conda)
# conda install -c bioconda rsem

# Define reference paths (using GRCh38 as a placeholder)
# Replace with actual paths to your reference genome and GTF file.
GENOME_FASTA="path/to/GRCh38.primary_assembly.genome.fa"
GTF_FILE="path/to/gencode.v38.annotation.gtf" # e.g., from GENCODE
RSEM_INDEX_DIR="rsem_index_GRCh38"

# 1. Prepare RSEM reference (run once for a given reference)
# This step builds the necessary index files for RSEM.
# rsem-prepare-reference --gtf "${GTF_FILE}" "${GENOME_FASTA}" "${RSEM_INDEX_DIR}"

# --- Example for a single sample quantification (repeated for all samples) ---
# This block demonstrates how to run RSEM for one sample.
# You would repeat this for every sample in your experiment.
SAMPLE_ID="sample_name" # Replace with your actual sample identifier
READ1_FASTQ="path/to/${SAMPLE_ID}_R1.fastq.gz" # Replace with actual path to R1 FASTQ
READ2_FASTQ="path/to/${SAMPLE_ID}_R2.fastq.gz" # Replace with actual path to R2 FASTQ (remove if single-end)
OUTPUT_PREFIX="${SAMPLE_ID}" # Prefix for RSEM output files
NUM_THREADS=8 # Number of threads to use

# 2. Run RSEM to calculate gene and isoform expression for a single sample
# This command uses RSEM's built-in aligner (Bowtie2 by default) for paired-end reads.
# Adjust parameters as needed (e.g., --single-end, --star for STAR alignment).
rsem-calculate-expression \
    --paired-end \
    --num-threads "${NUM_THREADS}" \
    --gzip-output \
    "${READ1_FASTQ}" \
    "${READ2_FASTQ}" \
    "${RSEM_INDEX_DIR}" \
    "${OUTPUT_PREFIX}"

# RSEM will generate files like "${SAMPLE_ID}.genes.results" and "${SAMPLE_ID}.isoforms.results".
# The 'expected_count' column from the ".genes.results" file is typically used for raw counts.

# --- Combine results from multiple samples into a gene x sample matrix ---
# Assume you have run 'rsem-calculate-expression' for multiple samples,
# resulting in files like 'sample1.genes.results', 'sample2.genes.results', etc.
# These files should be in a directory accessible by the next step.

# Define a list of all sample .genes.results files
# Replace with actual paths to your .genes.results files.
# Example: SAMPLE_GENES_RESULTS=("results/sample1.genes.results" "results/sample2.genes.results")
SAMPLE_GENES_RESULTS=(
    "path/to/sample1.genes.results"
    "path/to/sample2.genes.results"
    "path/to/sample3.genes.results"
    # Add all other sample .genes.results files here
)

# 3. Use rsem-generate-data-matrix to create a combined matrix of expected counts
# This command will output a tab-separated file with gene_id as the first column,
# followed by expected counts for each sample.
rsem-generate-data-matrix "${SAMPLE_GENES_RESULTS[@]}" > combined_raw_counts.tsv

# 4. Convert the combined TSV to CSV and gzip it to match the desired output format
# The 'rsem-generate-data-matrix' output has a header like 'gene_id\tsample1_expected_count\tsample2_expected_count...'
# This step replaces tabs with commas and then gzipps the file.
sed 's/\t/,/g' combined_raw_counts.tsv | gzip > raw_counts.csv.gz

# The final 'raw_counts.csv.gz' will contain the gene x sample matrix of raw counts.

View on GitHub

[category_als_vs_control_all.csv.gz] differential expression calls for all genes from DESeq2. (ensgene: Ensembl gene ID, baseMean: mean expression across all samples, log2FoldChange: log2Fold change (knockdown/control), lfcSE: standard error of the log2FoldChange, stat: value of Wald test statistic, pvalue: p-value of differential expression, symbol: gene symbol, description: gene description, biotype: biotype of gene, broadClass: broad biotype class gene belongs to, chr: chromosome gene is on, start: start of gene on chromosome, end: end of gene on chromosome, strand: strand gene is on)

DESeq2 v1.38.3 GitHub

$ Bash example

# Install R and Bioconductor packages if not already installed
# conda create -n deseq2_env r-base r-essentials bioconductor-deseq2 bioconductor-annotationdbi bioconductor-org.hs.eg.db bioconductor-rtracklayer bioconductor-genomicfeatures -y
# conda activate deseq2_env

# Create dummy count data (replace with your actual count matrix)
cat << 'EOF' > counts.tsv
gene_id	sample_ALS_1	sample_ALS_2	sample_Control_1	sample_Control_2
ENSG00000000003	100	120	50	60
ENSG00000000005	200	210	250	260
ENSG00000000419	50	55	100	110
ENSG00000000457	10	12	8	9
ENSG00000000460	300	310	150	160
EOF

# Create dummy sample information (replace with your actual sample metadata)
cat << 'EOF' > sample_info.tsv
sample_id	condition
sample_ALS_1	ALS
sample_ALS_2	ALS
sample_Control_1	control
sample_Control_2	control
EOF

# Download a reference GTF file (e.g., Ensembl GRCh38 release 109)
# This is a large file, so it's commented out. Replace with your actual GTF path.
# wget -nc https://ftp.ensembl.org/pub/release-109/gtf/homo_sapiens/Homo_sapiens.GRCh38.109.gtf.gz

# Create an R script file for DESeq2 differential expression analysis
cat << 'EOF' > run_deseq2.R
# Load necessary libraries
library(DESeq2)
library(AnnotationDbi)
library(org.Hs.eg.db) # For human gene annotations, adjust for other species
library(rtracklayer)
library(GenomicFeatures)

# --- Configuration ---
# Input count matrix file (e.g., from featureCounts)
count_file <- "counts.tsv" 
# Sample information file (e.g., tab-separated, with 'sample_id' and 'condition' columns)
sample_info_file <- "sample_info.tsv"
# Output file name
output_csv <- "category_als_vs_control_all.csv"
# Design formula (e.g., ~ condition)
design_formula <- ~ condition 
# Contrast for results (e.g., c("condition", "ALS", "control"))
contrast_levels <- c("condition", "ALS", "control")
# Reference GTF file for gene annotations (replace with your actual path)
gtf_file <- "Homo_sapiens.GRCh38.109.gtf.gz" 

# --- Load Data ---
# Load count data
count_data <- read.table(count_file, header=TRUE, row.names=1, sep="\t")
# Ensure counts are integers
count_data <- round(count_data)

# Load sample information
sample_info <- read.table(sample_info_file, header=TRUE, row.names=1, sep="\t")
# Ensure sample order matches count data columns
sample_info <- sample_info[colnames(count_data), , drop=FALSE]

# --- Create DESeqDataSet object ---
dds <- DESeqDataSetFromMatrix(countData = count_data,
                              colData = sample_info,
                              design = design_formula)

# --- Run DESeq2 analysis ---
dds <- DESeq(dds)

# --- Extract results ---
# Specify the contrast for differential expression (ALS vs control)
res <- results(dds, contrast=contrast_levels)

# --- Add gene annotations ---
res_df <- as.data.frame(res)
res_df$ensgene <- rownames(res_df)

# Add gene symbols and descriptions from org.Hs.eg.db
res_df$symbol <- mapIds(org.Hs.eg.db,
                        keys=res_df$ensgene,
                        column="SYMBOL",
                        keytype="ENSEMBL",
                        multiVals="first")
res_df$description <- mapIds(org.Hs.eg.db,
                             keys=res_df$ensgene,
                             column="GENENAME",
                             keytype="ENSEMBL",
                             multiVals="first")

# Initialize annotation columns with NA
res_df$biotype <- NA
res_df$broadClass <- NA
res_df$chr <- NA
res_df$start <- NA
res_df$end <- NA
res_df$strand <- NA

# Load and process GTF for genomic annotations if available
if (file.exists(gtf_file)) {
  message("Loading GTF for gene annotations...")
  gtf <- import(gtf_file)
  gene_features <- as.data.frame(gtf[gtf$type == "gene"]) # Filter for gene features
  gene_features <- gene_features[!duplicated(gene_features$gene_id), ] # Ensure unique gene entries
  
  # Prepare annotation data frame from GTF
  gtf_annotations <- data.frame(
    ensgene = gene_features$gene_id,
    biotype = gene_features$gene_biotype,
    chr = as.character(gene_features$seqnames),
    start = gene_features$start,
    end = gene_features$end,
    strand = as.character(gene_features$strand),
    stringsAsFactors = FALSE
  )
  
  # Add broadClass based on biotype (example logic)
  gtf_annotations$broadClass <- ifelse(grepl("protein_coding", gtf_annotations$biotype), "protein_coding", "non_coding")
  
  # Merge with results
  # Use suffixes to handle potential column name conflicts if res_df already had these (e.g., from a previous merge)
  res_df <- merge(res_df, gtf_annotations, by="ensgene", all.x=TRUE, suffixes=c("", ".gtf"))
  
  # Replace NA annotation columns with GTF-derived ones where available
  for (col_name in c("biotype", "broadClass", "chr", "start", "end", "strand")) {
    gtf_col_name <- paste0(col_name, ".gtf")
    if (gtf_col_name %in% colnames(res_df)) {
      # Fill NA values in original column with values from GTF-derived column
      res_df[[col_name]] <- ifelse(is.na(res_df[[col_name]]), res_df[[gtf_col_name]], res_df[[col_name]])
      # Remove the GTF-derived column after merging
      res_df <- res_df[, !(colnames(res_df) == gtf_col_name)]
    }
  }
} else {
  warning(paste("GTF file not found:", gtf_file, "Genomic annotations (biotype, chr, start, end, strand) will be NA."))
}

# Reorder and rename columns to match the description
final_results <- res_df[, c("ensgene", "baseMean", "log2FoldChange", "lfcSE", "stat", "pvalue", 
                            "symbol", "description", "biotype", "broadClass", "chr", "start", "end", "strand")]

# Write results to CSV
write.csv(final_results, file=output_csv, row.names=FALSE)

# Compress the output file
system(paste0("gzip ", output_csv))

EOF

# Execute the R script
Rscript run_deseq2.R

View on GitHub

[dexseq-all.csv.gz] gene fragment differential expression calls from DEXSeq. (fragment: gene fragment ID, groupID: group ID of fragment, featureID: gene ID of fragment, exonBaseMean: expressoin of exon across all samples, dispersion: dispersion of fragment across all samples, stat: value from Wald test, pvalue: p-value from Wald test, padj: BH adjusted p-value, scrambled: expression in scrambled siRNA, als: expression in TDP43 knockdown samples, rest of columns are locations of fragments in the genome, counts of the fragment for each individual sample and gene annotation similar to the above)

DEXSeq v(Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install R and Bioconductor packages if not already present
# sudo apt-get update && sudo apt-get install -y r-base
# R -e "if (!requireNamespace('BiocManager', quietly = TRUE)) install.packages('BiocManager', repos='https://cloud.r-project.org'); BiocManager::install(c('DEXSeq', 'GenomicFeatures'), update=FALSE, ask=FALSE)

# Define input/output paths and reference files
# Placeholder: Replace with actual paths
REFERENCE_GENOME="hg38"
GENOME_GTF="/path/to/reference/Homo_sapiens.GRCh38.109.gtf" # Original GTF for annotation preparation
FLATTENED_GTF="/path/to/reference/Homo_sapiens.GRCh38.109.flattened.gtf" # Output of dexseq_prepare_annotation.py
COUNT_DIR="/path/to/dexseq_counts" # Directory containing count files from dexseq_count.py
SAMPLE_INFO_FILE="/path/to/sample_info.tsv" # Tab-separated file: sample_id\tcondition\t...
OUTPUT_PREFIX="dexseq-all"

# Ensure output directory exists
mkdir -p "$(dirname "${OUTPUT_PREFIX}.csv.gz")"

# --- Step 1: Prepare annotation (if not already done) ---
# This step uses a Python script from the DEXSeq package.
# # conda install -c bioconda dexseq
# if [ ! -f "${FLATTENED_GTF}" ]; then
#   echo "Preparing flattened GTF annotation..."
#   dexseq_prepare_annotation.py "${GENOME_GTF}" "${FLATTENED_GTF}"
# fi

# --- Step 2: Count reads per exon (if not already done) ---
# This step uses a Python script from the DEXSeq package.
# # Assuming BAM files are in /path/to/aligned_bams/
# # And sample_info.tsv contains sample IDs.
# if [ ! -d "${COUNT_DIR}" ]; then
#   mkdir -p "${COUNT_DIR}"
#   echo "Counting reads for DEXSeq..."
#   # Example loop, adjust based on actual BAM file naming and sample list
#   # for sample_id in $(cut -f1 "${SAMPLE_INFO_FILE}" | tail -n +2); do
#   #   BAM_FILE="/path/to/aligned_bams/${sample_id}.bam"
#   #   OUTPUT_COUNT_FILE="${COUNT_DIR}/${sample_id}.txt"
#   #   if [ -f "${BAM_FILE}" ]; then
#   #     echo "Processing ${sample_id}..."
#   #     dexseq_count.py -p yes -f bam "${FLATTENED_GTF}" "${BAM_FILE}" > "${OUTPUT_COUNT_FILE}"
#   #   else
#   #     echo "Warning: BAM file not found for ${sample_id}: ${BAM_FILE}"
#   #   fi
#   # done
#   echo "Please ensure '${COUNT_DIR}' contains count files generated by dexseq_count.py."
# fi

# --- Step 3: Run DEXSeq differential expression analysis in R ---
echo "Running DEXSeq differential expression analysis..."

# Create a temporary R script
R_SCRIPT=$(mktemp /tmp/dexseq_analysis_XXXXXX.R)

cat <<EOF > "${R_SCRIPT}"
library(DEXSeq)
library(GenomicFeatures) # For makeTxDbFromGFF, etc.

# --- Configuration ---
FLATTENED_GTF <- Sys.getenv('FLATTENED_GTF')
COUNT_DIR <- Sys.getenv('COUNT_DIR')
SAMPLE_INFO_FILE <- Sys.getenv('SAMPLE_INFO_FILE')
OUTPUT_PREFIX <- Sys.getenv('OUTPUT_PREFIX', 'dexseq-all')
REFERENCE_GENOME <- Sys.getenv('REFERENCE_GENOME', 'hg38') # Placeholder

# --- Load Sample Information ---
sample_data <- read.delim(SAMPLE_INFO_FILE, sep='\t', header=TRUE, row.names=1)
if (!'condition' %in% colnames(sample_data)) {
  stop("Sample info file must contain a 'condition' column.")
}
sample_data\$condition <- factor(sample_data\$condition)

# --- Create DEXSeqDataSet ---
# DEXSeqDataSetFromHTSeq expects count files to be named after sample IDs
# and a flattened GTF file.
message("Creating DEXSeqDataSet from HTSeq counts...")
dxd <- DEXSeqDataSetFromHTSeq(
  countDir = COUNT_DIR,
  sampleData = sample_data,
  design = ~ sample + exon + condition:exon, # Standard DEXSeq design
  flattenedfile = FLATTENED_GTF
)

# --- Differential Exon Usage Analysis ---
message("Estimating size factors...")
dxd <- estimateSizeFactors(dxd)
message("Estimating dispersions...")
dxd <- estimateDispersions(dxd)
message("Testing for differential exon usage...")
dxd <- testForDEU(dxd)
message("Estimating exon fold changes...")
dxd <- estimateExonFoldChanges(dxd, fitExpToVar="condition")

# --- Get Results ---
message("Extracting results...")
dxr <- DEXSeqResults(dxd)

# Extract normalized counts for condition-specific means
normalized_counts <- counts(dxd, normalized=TRUE)

# Calculate mean expression for each condition
# Assuming 'scrambled' and 'als' are the conditions based on description
# This part needs to be dynamic based on actual conditions in sample_data
conditions <- levels(sample_data\$condition)
condition_means <- list()
for (cond in conditions) {
  samples_in_cond <- rownames(sample_data)[sample_data\$condition == cond]
  if (length(samples_in_cond) > 0) {
    condition_means[[cond]] <- rowMeans(normalized_counts[, samples_in_cond, drop=FALSE])
  } else {
    condition_means[[cond]] <- rep(NA, nrow(normalized_counts))
  }
}

# Combine results
results_df <- as.data.frame(dxr)

# Add condition-specific means
for (cond in conditions) {
  results_df[[cond]] <- condition_means[[cond]][rownames(results_df)]
}

# Select and reorder columns to match description
# fragment: gene fragment ID (featureID from DEXSeqResults)
# groupID: group ID of fragment (groupID from DEXSeqResults)
# featureID: gene ID of fragment (groupID from DEXSeqResults, as per common usage)
# exonBaseMean: baseMean
# dispersion: dispersion (or dispFit, dispOutlier - using 'dispersion' as it's a combined estimate)
# stat: stat
# pvalue: pvalue
# padj: padj
# scrambled: expression in scrambled siRNA (from condition_means)
# als: expression in TDP43 knockdown samples (from condition_means)
# rest of columns are locations of fragments in the genome (seqnames, start, end, strand, width)
# counts of the fragment for each individual sample (from normalized_counts)
# gene annotation similar to the above (groupID, featureID)

# Prepare final output dataframe
final_results <- data.frame(
  fragment = results_df\$featureID,
  groupID = results_df\$groupID,
  featureID = results_df\$groupID, # Assuming this refers to gene ID
  exonBaseMean = results_df\$baseMean,
  dispersion = results_df\$dispersion,
  stat = results_df\$stat,
  pvalue = results_df\$pvalue,
  padj = results_df\$padj,
  scrambled = results_df\$scrambled, # Assuming 'scrambled' is a condition
  als = results_df\$als,             # Assuming 'als' is a condition
  seqnames = results_df\$seqnames,
  start = results_df\$start,
  end = results_df\$end,
  width = results_df\$width,
  strand = results_df\$strand
)

# Add individual sample normalized counts
# Ensure column names are valid for R data.frame (e.g., no spaces or special chars)
normalized_counts_for_output <- t(normalized_counts[rownames(results_df), ])
colnames(normalized_counts_for_output) <- paste0("norm_count_", colnames(normalized_counts_for_output))
final_results <- cbind(final_results, normalized_counts_for_output)

# Write results to gzipped CSV
message(paste0("Writing results to ", OUTPUT_PREFIX, ".csv.gz"))
write.csv(final_results, gzfile(paste0(OUTPUT_PREFIX, '.csv.gz')), row.names=FALSE)

message("DEXSeq analysis complete.")
EOF

# Set environment variables for the R script
export FLATTENED_GTF="${FLATTENED_GTF}"
export COUNT_DIR="${COUNT_DIR}"
export SAMPLE_INFO_FILE="${SAMPLE_INFO_FILE}"
export OUTPUT_PREFIX="${OUTPUT_PREFIX}"
export REFERENCE_GENOME="${REFERENCE_GENOME}"

# Execute the R script
Rscript "${R_SCRIPT}"

# Clean up temporary R script
rm -f "${R_SCRIPT}"

View on GitHub

[combined.dexseq.gz] exon part count table

DEXSeq vInferred with models/gemini-2.5-flash GitHub

$ Bash example

# Install DEXSeq and its dependencies (R and Bioconductor)
# R
# install.packages("BiocManager")
# BiocManager::install("DEXSeq")

# Assuming DEXSeq is installed and its python scripts are accessible.
# The 'dexseq_count.py' script is typically found within the DEXSeq R package installation directory.
# For example: /path/to/R/library/DEXSeq/python_scripts/dexseq_count.py

# Define reference genome and annotation versions for placeholder files
GENOME_ASSEMBLY="hg38"
GENCODE_VERSION="v38"

# Define paths for annotation files
# DEXSeq prepared GFF file (this is a prerequisite, generated from a standard GTF/GFF)
# Example: gencode.v38.dexseq.gff would be generated from gencode.v38.annotation.gtf
DEXSEQ_GFF="${GENOME_ASSEMBLY}.${GENCODE_VERSION}.dexseq.gff"

# Placeholder for input BAM file (aligned reads for a single sample)
INPUT_BAM="sample_aligned.bam"

# Output file name as specified in the description
OUTPUT_COUNT_TABLE="combined.dexseq.gz"

# --- Prerequisite: Prepare the annotation file for DEXSeq (commented out as it's a setup step) ---
# This step converts a standard GTF/GFF to a DEXSeq-compatible GFF.
# If you already have the DEXSeq GFF, you can skip this.
# Original GTF from GENCODE (example)
# GENCODE_GTF="${GENOME_ASSEMBLY}.${GENCODE_VERSION}.annotation.gtf.gz"
# DEXSEQ_PREPARE_ANNOTATION_SCRIPT="/path/to/DEXSeq/inst/python_scripts/dexseq_prepare_annotation.py"
# gunzip -c ${GENCODE_GTF} > ${GENCODE_GTF%.gz}
# python ${DEXSEQ_PREPARE_ANNOTATION_SCRIPT} \
#     ${GENCODE_GTF%.gz} \
#     ${DEXSEQ_GFF}

# --- Main step: Count reads per exon part using dexseq_count.py ---
# Path to dexseq_count.py script
DEXSEQ_COUNT_SCRIPT="/path/to/DEXSeq/inst/python_scripts/dexseq_count.py"

# Execute the counting command
# -p yes indicates paired-end reads. Use -p no for single-end reads.
# The output file 'combined.dexseq.gz' will contain the exon part counts for the given sample.
python ${DEXSEQ_COUNT_SCRIPT} \
    -p yes \
    ${DEXSEQ_GFF} \
    ${INPUT_BAM} \
    ${OUTPUT_COUNT_TABLE}

View on GitHub

Tools Used

DESeq2 R

Raw Source Text

All FASTQ files were analyzed using the bcbioRNASeq workflow and toolchain91. The FASTQ files were aligned to the GRCh37/hg19 reference genome.
Differential expression testing was performed using DESeq2 suite of bioinformatics tools38. The DEXSeq module of Bioconductor was used to identify differential splicing92. We used salmon to generate the counts and tximport to load them at gene level93,94. All p-values are then corrected for multiple comparisons using the method of Benjamini and Hochberg95.  We used an adjusted p-value cutoff of 0.05 with no log fold-change ratio cutoff for differential gene expression analysis and a 0.1 cutoff with no log fold-change ratio cutoff for differential exon usage.
Genome_build: hg19
Supplementary_files_format_and_content:
[tpm.csv.gz]  gene x sample matrix of TPM for each sample
[raw_counts.csv.gz]  gene x sample matrix of raw counts per sample
[category_als_vs_control_all.csv.gz]  differential expression calls for all genes from DESeq2. (ensgene: Ensembl gene ID, baseMean: mean expression across all samples, log2FoldChange: log2Fold change (knockdown/control), lfcSE: standard error of the log2FoldChange, stat: value of Wald test statistic, pvalue: p-value of differential expression, symbol: gene symbol, description: gene description, biotype: biotype of gene, broadClass: broad biotype class gene belongs to, chr: chromosome gene is on, start: start of gene on chromosome, end: end of gene on chromosome, strand: strand gene is on)
[dexseq-all.csv.gz]  gene fragment differential expression calls from DEXSeq. (fragment: gene fragment ID, groupID: group ID of fragment, featureID: gene ID of fragment, exonBaseMean: expressoin of exon across all samples, dispersion: dispersion of fragment across all samples, stat: value from Wald test, pvalue: p-value from Wald test, padj: BH adjusted p-value, scrambled: expression in scrambled siRNA, als: expression in TDP43 knockdown samples, rest of columns are locations of fragments in the genome, counts of the fragment for each individual sample and gene annotation similar to the above)
[combined.dexseq.gz]  exon part count table

← Back to Analysis