GSE65973 Processing Pipeline

RNA-Seq code_examples 5 steps

Publication

Aberrant NOVA1 function disrupts alternative splicing in early stages of amyotrophic lateral sclerosis.

Acta neuropathologica (2022) — PMID 35778567

Dataset

RNA sequencing of mice expressing NLS-hTDP-43

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

3' Illumina adaptors trimmed with cutadapt

cutadapt v4.0 GitHub

$ Bash example

# Install cutadapt via conda
# conda install -c bioconda cutadapt=4.0

# Define input and output filenames (placeholders)
INPUT_FASTQ="input.fastq.gz"
TRIMMED_FASTQ="trimmed.fastq.gz"

# Define common Illumina 3' adapter sequence (for single-end or Read 1 of paired-end)
ADAPTER_3PRIME="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA"

# Define minimum read length after trimming and quality threshold
MIN_LENGTH="18"
QUALITY_THRESHOLD="20"

# Define number of CPU cores to use for parallel processing
NUM_CORES="4"

# Execute cutadapt to trim 3' Illumina adaptors and perform quality trimming
# -a: Specifies the 3' adapter sequence to be removed.
# -o: Specifies the output file for trimmed reads.
# --minimum-length: Discards reads shorter than this length after trimming.
# --cores: Uses multiple CPU cores for faster processing.
# -q: Performs quality trimming from both 5' and 3' ends with the specified quality threshold.
cutadapt \
  -a "${ADAPTER_3PRIME}" \
  -o "${TRIMMED_FASTQ}" \
  "${INPUT_FASTQ}" \
  --minimum-length "${MIN_LENGTH}" \
  --cores "${NUM_CORES}" \
  -q "${QUALITY_THRESHOLD}","${QUALITY_THRESHOLD}"

View on GitHub

Mapped to mm9 genome build with Bowtie, version 0.12.7 with options --time -k 100 --best --strata -v 2 -p 3, and filtered to a mismatch rate of 6%

Bowtie v0.12.7 GitHub

$ Bash example

# Install Bowtie (if not already installed)
# conda install -c bioconda bowtie=0.12.7

# Define variables
# Replace /path/to/mm9_bowtie_index/mm9 with the actual path to your Bowtie index for mm9.
# The index files (e.g., mm9.1.ebwt, mm9.2.ebwt, etc.) should be in this directory.
BOWTIE_INDEX_PREFIX="/path/to/mm9_bowtie_index/mm9"
INPUT_FASTQ="input_reads.fastq" # Replace with your input FASTQ file (e.g., single-end reads)
OUTPUT_SAM="aligned_reads.sam" # Replace with your desired output SAM file name

# Execute Bowtie alignment
# --time: Report wall-clock time taken by Bowtie
# -k 100: Report up to 100 valid alignments per read
# --best: Report alignments that are "best" in terms of stratum and number of mismatches
# --strata: Don't report alignments for a read unless they fall into the best stratum
# -v 2: Align reads with at most 2 mismatches (this implicitly handles the 6% mismatch rate for typical read lengths)
# -p 3: Use 3 parallel threads
bowtie --time -k 100 --best --strata -v 2 -p 3 "$BOWTIE_INDEX_PREFIX" "$INPUT_FASTQ" > "$OUTPUT_SAM"

View on GitHub

Mapped reads were counted for each exon using bedtools multicov and merged into raw gene counts with bedtoools groupBy

bedtools v2.31.0 GitHub

$ Bash example

# Define reference genome and annotation
GENOME="hg38"
GTF_URL="https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_45/gencode.v45.annotation.gtf.gz"
GTF_FILE="gencode.v45.annotation.gtf.gz"
EXONS_BED="gencode.v45.exons.bed"

# Input BAM file (placeholder - replace with your actual mapped reads BAM file)
MAPPED_READS_BAM="mapped_reads.bam"
# Ensure BAM is sorted and indexed for optimal performance with bedtools
# Example:
# samtools sort -o mapped_reads.sorted.bam mapped_reads.bam
# samtools index mapped_reads.sorted.bam
# MAPPED_READS_BAM="mapped_reads.sorted.bam"

# Output files
EXON_COUNTS_TSV="exon_raw_counts.tsv"
GENE_COUNTS_TSV="gene_raw_counts.tsv"

# --- Installation (commented out) ---
# Install bedtools and optionally samtools (for BAM sorting/indexing) if not already available
# conda install -c bioconda bedtools
# conda install -c bioconda samtools

# --- Prepare exon annotations (if not already available) ---
# Download GTF if not present
if [ ! -f "$GTF_FILE" ]; then
    echo "Downloading GTF file from $GTF_URL..."
    wget "$GTF_URL"
fi

# Convert GTF to BED format for exons, ensuring gene_id is in the name field (column 4).
# This step is crucial for bedtools groupBy later to correctly group by gene.
# The awk script extracts chromosome, 0-based start, 1-based end, gene_id, score (set to 0), and strand.
if [ ! -f "$EXONS_BED" ]; then
    echo "Creating exon BED file from GTF: $EXONS_BED..."
    zcat "$GTF_FILE" | awk '
        BEGIN { OFS="\t" }
        $3 == "exon" {
            chr = $1;
            start = $4 - 1; # Convert to 0-based start
            end = $5;
            strand = $7;
            
            # Extract gene_id from the 9th column (attributes)
            gene_id = "";
            split($9, attrs, "; ");
            for (i=1; i<=length(attrs); i++) {
                if (attrs[i] ~ /^gene_id /) {
                    gene_id = substr(attrs[i], index(attrs[i], "\"")+1, length(attrs[i]) - index(attrs[i], "\"") - 1);
                    break;
                }
            }
            
            if (gene_id != "") {
                print chr, start, end, gene_id, "0", strand;
            }
        }' > "$EXONS_BED"
    echo "Exon BED file created: $EXONS_BED"
else
    echo "Exon BED file already exists: $EXONS_BED"
fi

# --- Main execution ---

# 1. Count reads for each exon using bedtools multicov
#    -b: BED file of features (exons)
#    -f: BAM file of mapped reads
#    Output format: chr\tstart\tend\tgene_id\tscore\tstrand\tcount
echo "Counting reads per exon using bedtools multicov..."
bedtools multicov -b "$EXONS_BED" -f "$MAPPED_READS_BAM" > "$EXON_COUNTS_TSV"

# 2. Merge into raw gene counts with bedtools groupBy
#    -i: Input file (exon_raw_counts.tsv from the previous step)
#    -g 4: Group by the 4th column (which contains the gene_id)
#    -c 7: Operate on the 7th column (which contains the exon read count)
#    -o sum: Sum the values in the 7th column for each group (gene)
echo "Merging exon counts into gene counts using bedtools groupBy..."
bedtools groupBy -i "$EXON_COUNTS_TSV" -g 4 -c 7 -o sum > "$GENE_COUNTS_TSV"

echo "Raw exon counts saved to: $EXON_COUNTS_TSV"
echo "Raw gene counts saved to: $GENE_COUNTS_TSV"

View on GitHub

DESeq2 was used to normalize read counts.

DESeq2 v1.42.0 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install R and Bioconductor (if not already installed)
# sudo apt-get update
# sudo apt-get install r-base
# R -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager"); BiocManager::install("DESeq2")'

# This script assumes you have a count matrix file (e.g., 'counts.tsv')
# and an optional sample metadata file (e.g., 'coldata.tsv').
# The count matrix should have gene IDs as row names and sample IDs as column names.
# The metadata file should have sample IDs as row names and relevant experimental factors as columns.

# Example of how to create dummy input files for demonstration:
# echo -e "gene_id\tsample1\tsample2\tsample3\tsample4\ngeneA\t100\t120\t90\t110\ngeneB\t50\t60\t45\t55\ngeneC\t200\t210\t190\t220" > counts.tsv
# echo -e "sample_id\tcondition\nsample1\tcontrol\nsample2\tcontrol\nsample3\ttreated\nsample4\ttreated" > coldata.tsv

Rscript -e '
  library(DESeq2)
  
  # --- Configuration ---
  count_matrix_file <- "counts.tsv" # Path to your raw count matrix file
  sample_metadata_file <- "coldata.tsv" # Path to your sample metadata file
  output_normalized_counts_file <- "normalized_counts.tsv" # Output file for normalized counts
  
  # --- Load Data ---
  # Load count data: first column as row names (gene IDs), tab-separated
  count_data <- read.table(count_matrix_file, header = TRUE, row.names = 1, sep = "\t")
  
  # Load sample metadata: first column as row names (sample IDs), tab-separated
  coldata <- read.table(sample_metadata_file, header = TRUE, row.names = 1, sep = "\t")
  
  # Ensure sample order in coldata matches column order in count_data
  # This is crucial for DESeqDataSetFromMatrix
  coldata <- coldata[colnames(count_data), , drop = FALSE]
  
  # --- Create DESeqDataSet object ---
  # For simple normalization, a design of ~1 is sufficient as we are not performing
  # differential expression analysis here, only estimating size factors.
  dds <- DESeqDataSetFromMatrix(countData = count_data,
                                colData = coldata,
                                design = ~ 1)
  
  # --- Perform Normalization ---
  # Estimate size factors to account for differences in library size and sequencing depth.
  # This is the core normalization step in DESeq2.
  dds <- estimateSizeFactors(dds)
  
  # Extract normalized counts
  normalized_counts <- counts(dds, normalized = TRUE)
  
  # --- Save Results ---
  # Write normalized counts to a tab-separated file
  # col.names = NA ensures that the row names (gene IDs) are written as the first column
  write.table(normalized_counts, output_normalized_counts_file, sep = "\t", quote = FALSE, col.names = NA)
  
  message(paste0("DESeq2 normalization complete. Normalized counts saved to ", output_normalized_counts_file))
'

View on GitHub

Genes were filtered to have a minimum of 10 reads per million in at least one sample, and those counts were used to test for differential expression

DESeq2 (Inferred with models/gemini-2.5-flash) v1.38.3 GitHub

$ Bash example

# Install R and DESeq2 (if not already installed)
# sudo apt-get update
# sudo apt-get install -y r-base
# R -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")'
# R -e 'BiocManager::install("DESeq2")'

# Create dummy input files for demonstration (replace with your actual files)
# gene_counts.csv: A CSV file with gene IDs as row names and sample IDs as column names, containing raw read counts.
# sample_metadata.csv: A CSV file with sample IDs as row names and experimental conditions (e.g., 'condition', 'treatment') as columns.

cat <<EOF > gene_counts.csv
GeneID,sample1,sample2,sample3,sample4,sample5,sample6
geneA,100,120,500,550,110,520
geneB,5,8,10,12,6,11
geneC,200,210,250,260,205,255
geneD,15,18,20,22,16,21
geneE,0,1,0,2,0,1
geneF,1000,1100,1200,1300,1050,1250
EOF

cat <<EOF > sample_metadata.csv
SampleID,condition
sample1,control
sample2,control
sample3,treatment
sample4,treatment
sample5,control
sample6,treatment
EOF

# R script to perform filtering and differential expression using DESeq2
cat <<'EOF_R_SCRIPT' > run_deseq2.R
library(DESeq2)

# Load count matrix
counts_df <- read.csv("gene_counts.csv", row.names = 1)

# Load sample metadata
coldata <- read.csv("sample_metadata.csv", row.names = 1)

# Ensure order of samples in counts_df and coldata matches
stopifnot(all(colnames(counts_df) == rownames(coldata)))

# --- Filtering step: minimum of 10 reads per million in at least one sample ---
# Calculate total reads per sample
total_reads_per_sample <- colSums(counts_df)

# Calculate RPM for each gene in each sample
# Add a small pseudocount to total_reads_per_sample to avoid division by zero if a sample has 0 total reads
rpm_matrix <- t(t(counts_df) / (total_reads_per_sample + 1e-6)) * 1e6

# Identify genes that have at least 10 RPM in at least one sample
keep_genes <- rowSums(rpm_matrix >= 10) >= 1

# Filter the count matrix
filtered_counts_df <- counts_df[keep_genes, ]

# --- Differential Expression using DESeq2 ---
# Create DESeqDataSet object
dds <- DESeqDataSetFromMatrix(countData = round(filtered_counts_df), # DESeq2 expects integer counts
                              colData = coldata,
                              design = ~ condition)

# Run DESeq2
dds <- DESeq(dds)

# Get results for 'treatment' vs 'control'
res <- results(dds, contrast = c("condition", "treatment", "control"))

# Order by adjusted p-value
res_ordered <- res[order(res$padj), ]

# Save results
write.csv(as.data.frame(res_ordered), "deseq2_differential_expression_results.csv")

# Optional: Save normalized counts
normalized_counts <- counts(dds, normalized=TRUE)
write.csv(as.data.frame(normalized_counts), "deseq2_normalized_counts.csv")

message("DESeq2 analysis complete. Results saved to deseq2_differential_expression_results.csv")
message("Normalized counts saved to deseq2_normalized_counts.csv")
EOF_R_SCRIPT

# Execute the R script
Rscript run_deseq2.R

View on GitHub

Tools Used

DESeq2

Raw Source Text

3' Illumina adaptors trimmed with cutadapt
Mapped to mm9 genome build with Bowtie, version 0.12.7 with options --time -k 100 --best --strata -v 2 -p 3, and filtered to a mismatch rate of 6%
Mapped reads were counted for each exon using bedtools multicov and merged into raw gene counts with bedtoools groupBy
DESeq2 was used to normalize read counts. Genes were filtered to have a minimum of 10 reads per million in at least one sample, and those counts were used to test for differential expression
Genome_build: mm9
Supplementary_files_format_and_content: NLS_deseq2_all_alluniq_vs_mmu9htdp43nls.txt is the differential expression output of DESeq2. NLS_norm_counts_all.txt is the file of normalized read counts per gene as computed by DESeq2. The counts are in matrix format, one file for the whole experiment.

← Back to Analysis