GSE76220 Processing Pipeline

RNA-Seq code_examples 5 steps

Publication

Transcriptome-pathology correlation identifies interplay between TDP-43 and the expression of its kinase CK1E in sporadic ALS.

Acta neuropathologica (2018) — PMID 29881994

Dataset

GSE76220

Gene Expression Signatures of Motor Neuron Populations Isolated from Sporadic ALS

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Raw reads were adaptor-trimmed and aligned to the human genome (UCSC version hg18) using bowtie (parameters: -l 20 -m 5 -k 5 --best -q hg18).

Bowtie v1.1.2 GitHub

$ Bash example

# Install Bowtie
# conda install -c bioconda bowtie=1.1.2

# Install Cutadapt (for adaptor trimming, inferred with models/gemini-2.5-flash)
# conda install -c bioconda cutadapt

# Download human genome (hg18) and build Bowtie index
# mkdir -p reference
# wget -O reference/hg18.fa.gz http://hgdownload.soe.ucsc.edu/goldenPath/hg18/bigZips/hg18.fa.gz
# gunzip reference/hg18.fa.gz
# bowtie-build reference/hg18.fa reference/hg18

# Define input file placeholders (replace with actual file names)
RAW_READS="raw_reads.fastq.gz" # Assuming gzipped fastq input
TRIMMED_READS="trimmed_reads.fastq"
ALIGNED_SAM="aligned_reads.sam"

# Adaptor trimming using cutadapt (example with common Illumina universal adaptor)
# Adjust adaptor sequence (-a) and minimum length (-m) as needed based on experimental design.
cutadapt -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA -m 20 -o "${TRIMMED_READS}" "${RAW_READS}"

# Align trimmed reads to hg18 using Bowtie
# 'reference/hg18' refers to the basename of the Bowtie index files (e.g., hg18.1.ebwt, hg18.rev.1.ebwt) located in the 'reference' directory.
bowtie -l 20 -m 5 -k 5 --best -q reference/hg18 "${TRIMMED_READS}" > "${ALIGNED_SAM}"

View on GitHub

Aligned reads were assigned to a custom annotation of human genes [36], [48], and gene expression values were calculated as RPKMs [28].

featureCounts (Inferred with models/gemini-2.5-flash) vN/A

$ Bash example

# Install Subread (which includes featureCounts)
# conda install -c bioconda subread

# Placeholder for input BAM file (aligned reads)
INPUT_BAM="aligned_reads.bam"

# Placeholder for custom gene annotation (GTF/GFF format)
# Assuming GRCh38 assembly for human genes
# Replace with actual path to your custom annotation
GENE_ANNOTATION="custom_human_genes_GRCh38.gtf"

# Placeholder for output files
OUTPUT_COUNTS_BASE="gene_counts" # featureCounts will create gene_counts.txt and gene_counts.txt.summary
OUTPUT_RPKMS="gene_expression_rpkm.txt"

# 1. Assign aligned reads to genes using featureCounts
# -a: annotation file
# -o: output file prefix for counts (will create .txt and .txt.summary)
# -F GTF: specify annotation file format
# -t exon: feature type to count (e.g., exon)
# -g gene_id: attribute to group features by (e.g., gene_id)
# -s 0: strand specificity (0=unstranded, 1=stranded, 2=reverse stranded) - common for RNA-seq, adjust if known
# --primary: count only primary alignments (useful if BAM contains secondary/supplementary)
# Add -p if reads are paired-end (e.g., featureCounts ... -p ...)
featureCounts -a "${GENE_ANNOTATION}" -o "${OUTPUT_COUNTS_BASE}.txt" -F GTF -t exon -g gene_id -s 0 --primary "${INPUT_BAM}"

# 2. Calculate RPKMs from raw counts and gene lengths
# Get total assigned reads from the featureCounts summary file
TOTAL_ASSIGNED_READS=$(awk '$1 == "Assigned" {print $2}' "${OUTPUT_COUNTS_BASE}.txt.summary")

# Check if total assigned reads is valid
if [ -z "${TOTAL_ASSIGNED_READS}" ] || [ "${TOTAL_ASSIGNED_READS}" -eq 0 ]; then
    echo "Error: Could not get total assigned reads or total assigned reads is zero."
    exit 1
fi

# Create a header for the RPKM file
echo -e "GeneID\tRPKM" > "${OUTPUT_RPKMS}"

# Process the featureCounts output file to calculate RPKM
# Skip the header line (line 1)
# Column 1: GeneID, Column 6: Length, Column 7: Count (for the first BAM)
tail -n +2 "${OUTPUT_COUNTS_BASE}.txt" | awk -v total_reads="${TOTAL_ASSIGNED_READS}" '
BEGIN { OFS="\t" }
{
    gene_id = $1;
    gene_length = $6;
    read_count = $7;

    if (gene_length > 0 && total_reads > 0) {
        rpkm = (read_count * 1000000000) / (gene_length * total_reads);
        print gene_id, rpkm;
    } else {
        print gene_id, 0; # Handle cases with zero length or zero reads
    }
}' >> "${OUTPUT_RPKMS}"

Control versus sALS patient datasets were compared in a pairwise fashion, calculating ratio fold-changes for all possible 117 disease versus control pairs.

R (Inferred with models/gemini-2.5-flash) vlatest stable GitHub

$ Bash example

# Install R (if not already installed)
# sudo apt-get update
# sudo apt-get install r-base

# Create dummy input data files for demonstration
# data_matrix.tsv: Features (rows) x Samples (columns)
# This dummy data is structured to yield 9 sALS samples and 13 Control samples, resulting in 9 * 13 = 117 pairs.
cat <<EOF > data_matrix.tsv
Feature\tControl_1\tControl_2\tControl_3\tControl_4\tControl_5\tControl_6\tControl_7\tControl_8\tControl_9\tControl_10\tControl_11\tControl_12\tControl_13\tsALS_1\tsALS_2\tsALS_3\tsALS_4\tsALS_5\tsALS_6\tsALS_7\tsALS_8\tsALS_9
GeneA\t100\t110\t95\t105\t102\t98\t103\t107\t99\t101\t104\t106\t97\t200\t210\t190\t220\t230\t240\t250\t260\t270
GeneB\t50\t55\t48\t52\t51\t49\t53\t56\t47\t50\t54\t57\t46\t25\t28\t24\t26\t27\t29\t30\t31\t32
GeneC\t200\t205\t198\t202\t201\t199\t203\t206\t197\t200\t204\t207\t196\t100\t105\t98\t102\t103\t104\t105\t106\t107
EOF

# sample_info.tsv: SampleID, Group
cat <<EOF > sample_info.tsv
SampleID\tGroup
Control_1\tControl
Control_2\tControl
Control_3\tControl
Control_4\tControl
Control_5\tControl
Control_6\tControl
Control_7\tControl
Control_8\tControl
Control_9\tControl
Control_10\tControl
Control_11\tControl
Control_12\tControl
Control_13\tControl
sALS_1\tsALS
sALS_2\tsALS
sALS_3\tsALS
sALS_4\tsALS
sALS_5\tsALS
sALS_6\tsALS
sALS_7\tsALS
sALS_8\tsALS
sALS_9\tsALS
EOF

# Create the R script for calculating pairwise fold changes
cat <<'EOF' > calculate_fold_changes.R
# Load data
data <- read.delim("data_matrix.tsv", row.names = 1, check.names = FALSE)
sample_info <- read.delim("sample_info.tsv")

# Identify control and sALS samples
control_samples <- sample_info$SampleID[sample_info$Group == "Control"]
sals_samples <- sample_info$SampleID[sample_info$Group == "sALS"]

# Ensure samples are present in the data matrix
control_samples <- intersect(control_samples, colnames(data))
sals_samples <- intersect(sals_samples, colnames(data))

# Initialize a data frame to store results
results_df <- data.frame()

# Iterate through all possible disease versus control pairs
# The description states "all possible 117 disease versus control pairs".
# This script will calculate all possible pairs based on the input data.
message(paste("Found", length(sals_samples), "sALS samples and", length(control_samples), "Control samples."))
message(paste("Total possible pairs to calculate:", length(sals_samples) * length(control_samples)))

for (sals_id in sals_samples) {
  for (control_id in control_samples) {
    # Calculate ratio fold-change for each feature
    # Add a small pseudocount to avoid division by zero or log(0)
    pseudocount <- 1e-6
    
    # Ensure samples exist in data
    if (sals_id %in% colnames(data) && control_id %in% colnames(data)) {
      fold_changes <- (data[[sals_id]] + pseudocount) / (data[[control_id]] + pseudocount)
      
      # Store results
      pair_name <- paste0(sals_id, "_vs_", control_id)
      temp_df <- data.frame(
        Feature = rownames(data),
        Pair = pair_name,
        FoldChange = fold_changes,
        stringsAsFactors = FALSE
      )
      results_df <- rbind(results_df, temp_df)
    } else {
      warning(paste("Skipping pair:", sals_id, "vs", control_id, "- one or both samples not found in data."))
    }
  }
}

# Output results
write.csv(results_df, "pairwise_fold_changes.csv", row.names = FALSE)
message("Pairwise fold changes calculated and saved to 'pairwise_fold_changes.csv'")
EOF

# Execute the R script
Rscript calculate_fold_changes.R

View on GitHub

For Median Percentile Rank (MPR) analysis, each set of fold-changes were sorted and given a percentile rank, and the median percentile rank for each gene was determined and plotted as a histogram.

Python (Inferred with models/gemini-2.5-flash) v3.9 GitHub

$ Bash example

# Install dependencies (if not already installed)
# conda install pandas numpy matplotlib seaborn

# Create a dummy input file for demonstration
# In a real scenario, 'fold_changes.tsv' would be generated by a previous step.
echo -e "gene\tcond1_fc\tcond2_fc\tcond3_fc\tcond4_fc" > fold_changes.tsv
echo -e "geneA\t1.2\t0.8\t1.5\t1.1" >> fold_changes.tsv
echo -e "geneB\t-0.5\t0.2\t-0.1\t0.3" >> fold_changes.tsv
echo -e "geneC\t2.0\t1.8\t2.2\t1.9" >> fold_changes.tsv
echo -e "geneD\t0.1\t-0.3\t0.0\t-0.2" >> fold_changes.tsv
echo -e "geneE\t-1.0\t-1.2\t-0.8\t-0.9" >> fold_changes.tsv
echo -e "geneF\t0.5\t0.6\t0.4\t0.7" >> fold_changes.tsv
echo -e "geneG\t-1.5\t-1.8\t-1.3\t-1.6" >> fold_changes.tsv
echo -e "geneH\t2.5\t2.3\t2.7\t2.4" >> fold_changes.tsv
echo -e "geneI\t-0.2\t0.1\t-0.3\t0.0" >> fold_changes.tsv
echo -e "geneJ\t1.0\t1.1\t0.9\t1.2" >> fold_changes.tsv

# Python script for Median Percentile Rank (MPR) analysis
python -c "
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load fold-change data
# Assuming the input file 'fold_changes.tsv' has gene names in the first column
# and fold-changes in subsequent columns (representing different experiments/conditions).
try:
    df = pd.read_csv('fold_changes.tsv', sep='\t', index_col='gene')
except FileNotFoundError:
    print('Error: fold_changes.tsv not found. Please ensure the input file exists.')
    exit(1)

# Calculate percentile ranks for each column (each experiment/condition)
# 'each set of fold-changes' refers to the fold-changes for all genes in one experiment.
# We rank these values from 0 to 100.
# axis=0 means apply rank column-wise
percentile_ranks_df = df.rank(pct=True, axis=0) * 100

# Determine the median percentile rank for each gene across all experiments
# 'the median percentile rank for each gene was determined'
# axis=1 means calculate median row-wise
median_percentile_ranks = percentile_ranks_df.median(axis=1)

# Plot as a histogram
plt.figure(figsize=(10, 6))
sns.histplot(median_percentile_ranks.dropna(), bins=30, kde=True)
plt.title('Histogram of Median Percentile Ranks (MPR)')
plt.xlabel('Median Percentile Rank')
plt.ylabel('Number of Genes')
plt.grid(axis='y', alpha=0.75)
plt.tight_layout()
plt.savefig('mpr_histogram.png')
print('MPR histogram saved to mpr_histogram.png')

# Optionally save the median percentile ranks
median_percentile_ranks.name = 'Median_Percentile_Rank'
median_percentile_ranks.to_csv('median_percentile_ranks.tsv', sep='\t', header=True)
print('Median percentile ranks saved to median_percentile_ranks.tsv')
"

View on GitHub

The local minima near the down-regulated edge, 0.15 was taken as the down-regulated cutoff, and a mirrored cutoff of 0.85 was taken for the up-regulated cutoff, and these gene sets were used for subsequent GO analysis.

Gene Ontology vNot specified (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Assume an input file 'gene_scores.tsv' with gene IDs and their associated scores.
# For demonstration, let's create a dummy file:
echo -e "gene_id\tscore" > gene_scores.tsv
echo -e "GENE1\t0.05" >> gene_scores.tsv
echo -e "GENE2\t0.10" >> gene_scores.tsv
echo -e "GENE3\t0.20" >> gene_scores.tsv
echo -e "GENE4\t0.50" >> gene_scores.tsv
echo -e "GENE5\t0.80" >> gene_scores.tsv
echo -e "GENE6\t0.90" >> gene_scores.tsv
echo -e "GENE7\t0.95" >> gene_scores.tsv
echo -e "GENE8\t0.14" >> gene_scores.tsv
echo -e "GENE9\t0.86" >> gene_scores.tsv

# Define cutoffs
DOWN_REGULATED_CUTOFF=0.15
UP_REGULATED_CUTOFF=0.85

# Extract down-regulated genes
awk -v cutoff="$DOWN_REGULATED_CUTOFF" 'NR > 1 && $2 < cutoff {print $1}' gene_scores.tsv > down_regulated_genes.txt

# Extract up-regulated genes
awk -v cutoff="$UP_REGULATED_CUTOFF" 'NR > 1 && $2 > cutoff {print $1}' gene_scores.tsv > up_regulated_genes.txt

# Create an R script for GO analysis using clusterProfiler
cat << 'EOF' > run_go_analysis.R
# Install necessary packages if not already installed
# if (!requireNamespace("BiocManager", quietly = TRUE))
#     install.packages("BiocManager")
# BiocManager::install("clusterProfiler")
# BiocManager::install("org.Hs.eg.db") # For Homo sapiens
# BiocManager::install("AnnotationDbi")
# install.packages("dplyr")

library(clusterProfiler)
library(org.Hs.eg.db)
library(dplyr)
library(AnnotationDbi)

# Load gene lists
down_genes <- read.table("down_regulated_genes.txt", header = FALSE, stringsAsFactors = FALSE)$V1
up_genes <- read.table("up_regulated_genes.txt", header = FALSE, stringsAsFactors = FALSE)$V1

# For demonstration, let's assume a background gene list (all genes in the input file)
# In a real scenario, this would be all genes tested in the experiment.
background_genes <- read.table("gene_scores.tsv", header = TRUE, stringsAsFactors = FALSE)$gene_id

# Function to map gene symbols to Entrez IDs
map_symbols_to_entrez <- function(gene_symbols, OrgDb_obj) {
  mapped_genes <- AnnotationDbi::select(OrgDb_obj,
                                        keys = gene_symbols,
                                        columns = c("ENTREZID", "SYMBOL"),
                                        keytype = "SYMBOL")
  # Remove NAs and duplicates
  mapped_genes <- mapped_genes %>%
    filter(!is.na(ENTREZID)) %>%
    distinct(SYMBOL, .keep_all = TRUE)
  return(mapped_genes$ENTREZID)
}

# Convert gene symbols to Entrez IDs
down_genes_entrez <- map_symbols_to_entrez(down_genes, org.Hs.eg.db)
up_genes_entrez <- map_symbols_to_entrez(up_genes, org.Hs.eg.db)
background_genes_entrez <- map_symbols_to_entrez(background_genes, org.Hs.eg.db)


# Perform GO enrichment for down-regulated genes
if (length(down_genes_entrez) > 0) {
  ego_down <- enrichGO(gene          = down_genes_entrez,
                       universe      = background_genes_entrez,
                       OrgDb         = org.Hs.eg.db,
                       ont           = "BP", # Biological Process
                       pAdjustMethod = "BH",
                       pvalueCutoff  = 0.05,
                       qvalueCutoff  = 0.05,
                       readable      = TRUE) # Convert Entrez ID to gene Symbol
  if (!is.null(ego_down) && nrow(ego_down@result) > 0) {
    write.csv(as.data.frame(ego_down), "go_enrichment_down_regulated.csv", row.names = FALSE)
    print("GO enrichment for down-regulated genes completed. Results saved to go_enrichment_down_regulated.csv")
  } else {
    print("No significant GO terms found for down-regulated genes.")
  }
} else {
  print("No down-regulated genes found for GO analysis.")
}

# Perform GO enrichment for up-regulated genes
if (length(up_genes_entrez) > 0) {
  ego_up <- enrichGO(gene          = up_genes_entrez,
                     universe      = background_genes_entrez,
                     OrgDb         = org.Hs.eg.db,
                     ont           = "BP", # Biological Process
                     pAdjustMethod = "BH",
                     pvalueCutoff  = 0.05,
                     qvalueCutoff  = 0.05,
                     readable      = TRUE) # Convert Entrez ID to gene Symbol
  if (!is.null(ego_up) && nrow(ego_up@result) > 0) {
    write.csv(as.data.frame(ego_up), "go_enrichment_up_regulated.csv", row.names = FALSE)
    print("GO enrichment for up-regulated genes completed. Results saved to go_enrichment_up_regulated.csv")
  } else {
    print("No significant GO terms found for up-regulated genes.")
  }
} else {
  print("No up-regulated genes found for GO analysis.")
}
EOF

# Execute the R script
Rscript run_go_analysis.R

View on GitHub

Tools Used

Gene Ontology

Raw Source Text

Raw reads were adaptor-trimmed and aligned to the human genome (UCSC version hg18) using bowtie (parameters: -l 20 -m 5 -k 5 --best -q hg18). Aligned reads were assigned to a custom annotation of human genes [36], [48], and gene expression values were calculated as RPKMs [28].
Control versus sALS patient datasets were compared in a pairwise fashion, calculating ratio fold-changes for all possible 117 disease versus control pairs. For Median Percentile Rank (MPR) analysis, each set of fold-changes were sorted and given a percentile rank, and the median percentile rank for each gene was determined and plotted as a histogram. The local minima near the down-regulated edge, 0.15 was taken as the down-regulated cutoff, and a mirrored cutoff of 0.85 was taken for the up-regulated cutoff, and these gene sets were used for subsequent GO analysis.
Genome_build: hg18
Supplementary_files_format_and_content: Fastq files were aligned to the UCSC genome build (mm9, mm10, or hg19) using BWA, novoaligner (HITS-CLIP) or Olego (polyA-seq). The resulting .sam files were converted to bed files using samtools (HITS-CLIP) or Olego (PolyA-seq). The resulting .bed files were filtered to remove PCR duplicates (HITS-CLIP) and internal A-stretches (polyA-seq). BedGraph files (wiggles) were generated from bedfiles using bedtools (HITS-CLIP) or tag2profile tool from Olego (polyA-seq). Olego is an open source software by Chaolin Zhang lab from Columbia University.

← Back to Analysis