GSE76220 Processing Pipeline

RNA-Seq code_examples 5 steps

Publication

Transcriptome-pathology correlation identifies interplay between TDP-43 and the expression of its kinase CK1E in sporadic ALS.

Acta neuropathologica (2018) — PMID 29881994

Dataset

GSE76220

Gene Expression Signatures of Motor Neuron Populations Isolated from Sporadic ALS

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    Raw reads were adaptor-trimmed and aligned to the human genome (UCSC version hg18) using bowtie (parameters: -l 20 -m 5 -k 5 --best -q hg18).

    Bowtie v1.1.2 GitHub
    $ Bash example
    # Install Bowtie
    # conda install -c bioconda bowtie=1.1.2
    
    # Install Cutadapt (for adaptor trimming, inferred with models/gemini-2.5-flash)
    # conda install -c bioconda cutadapt
    
    # Download human genome (hg18) and build Bowtie index
    # mkdir -p reference
    # wget -O reference/hg18.fa.gz http://hgdownload.soe.ucsc.edu/goldenPath/hg18/bigZips/hg18.fa.gz
    # gunzip reference/hg18.fa.gz
    # bowtie-build reference/hg18.fa reference/hg18
    
    # Define input file placeholders (replace with actual file names)
    RAW_READS="raw_reads.fastq.gz" # Assuming gzipped fastq input
    TRIMMED_READS="trimmed_reads.fastq"
    ALIGNED_SAM="aligned_reads.sam"
    
    # Adaptor trimming using cutadapt (example with common Illumina universal adaptor)
    # Adjust adaptor sequence (-a) and minimum length (-m) as needed based on experimental design.
    cutadapt -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA -m 20 -o "${TRIMMED_READS}" "${RAW_READS}"
    
    # Align trimmed reads to hg18 using Bowtie
    # 'reference/hg18' refers to the basename of the Bowtie index files (e.g., hg18.1.ebwt, hg18.rev.1.ebwt) located in the 'reference' directory.
    bowtie -l 20 -m 5 -k 5 --best -q reference/hg18 "${TRIMMED_READS}" > "${ALIGNED_SAM}"
  2. 2

    Aligned reads were assigned to a custom annotation of human genes [36], [48], and gene expression values were calculated as RPKMs [28].

    featureCounts (Inferred with models/gemini-2.5-flash) vN/A
    $ Bash example
    # Install Subread (which includes featureCounts)
    # conda install -c bioconda subread
    
    # Placeholder for input BAM file (aligned reads)
    INPUT_BAM="aligned_reads.bam"
    
    # Placeholder for custom gene annotation (GTF/GFF format)
    # Assuming GRCh38 assembly for human genes
    # Replace with actual path to your custom annotation
    GENE_ANNOTATION="custom_human_genes_GRCh38.gtf"
    
    # Placeholder for output files
    OUTPUT_COUNTS_BASE="gene_counts" # featureCounts will create gene_counts.txt and gene_counts.txt.summary
    OUTPUT_RPKMS="gene_expression_rpkm.txt"
    
    # 1. Assign aligned reads to genes using featureCounts
    # -a: annotation file
    # -o: output file prefix for counts (will create .txt and .txt.summary)
    # -F GTF: specify annotation file format
    # -t exon: feature type to count (e.g., exon)
    # -g gene_id: attribute to group features by (e.g., gene_id)
    # -s 0: strand specificity (0=unstranded, 1=stranded, 2=reverse stranded) - common for RNA-seq, adjust if known
    # --primary: count only primary alignments (useful if BAM contains secondary/supplementary)
    # Add -p if reads are paired-end (e.g., featureCounts ... -p ...)
    featureCounts -a "${GENE_ANNOTATION}" -o "${OUTPUT_COUNTS_BASE}.txt" -F GTF -t exon -g gene_id -s 0 --primary "${INPUT_BAM}"
    
    # 2. Calculate RPKMs from raw counts and gene lengths
    # Get total assigned reads from the featureCounts summary file
    TOTAL_ASSIGNED_READS=$(awk '$1 == "Assigned" {print $2}' "${OUTPUT_COUNTS_BASE}.txt.summary")
    
    # Check if total assigned reads is valid
    if [ -z "${TOTAL_ASSIGNED_READS}" ] || [ "${TOTAL_ASSIGNED_READS}" -eq 0 ]; then
        echo "Error: Could not get total assigned reads or total assigned reads is zero."
        exit 1
    fi
    
    # Create a header for the RPKM file
    echo -e "GeneID\tRPKM" > "${OUTPUT_RPKMS}"
    
    # Process the featureCounts output file to calculate RPKM
    # Skip the header line (line 1)
    # Column 1: GeneID, Column 6: Length, Column 7: Count (for the first BAM)
    tail -n +2 "${OUTPUT_COUNTS_BASE}.txt" | awk -v total_reads="${TOTAL_ASSIGNED_READS}" '
    BEGIN { OFS="\t" }
    {
        gene_id = $1;
        gene_length = $6;
        read_count = $7;
    
        if (gene_length > 0 && total_reads > 0) {
            rpkm = (read_count * 1000000000) / (gene_length * total_reads);
            print gene_id, rpkm;
        } else {
            print gene_id, 0; # Handle cases with zero length or zero reads
        }
    }' >> "${OUTPUT_RPKMS}"
  3. 3

    Control versus sALS patient datasets were compared in a pairwise fashion, calculating ratio fold-changes for all possible 117 disease versus control pairs.

    R (Inferred with models/gemini-2.5-flash) vlatest stable GitHub
    $ Bash example
    # Install R (if not already installed)
    # sudo apt-get update
    # sudo apt-get install r-base
    
    # Create dummy input data files for demonstration
    # data_matrix.tsv: Features (rows) x Samples (columns)
    # This dummy data is structured to yield 9 sALS samples and 13 Control samples, resulting in 9 * 13 = 117 pairs.
    cat <<EOF > data_matrix.tsv
    Feature\tControl_1\tControl_2\tControl_3\tControl_4\tControl_5\tControl_6\tControl_7\tControl_8\tControl_9\tControl_10\tControl_11\tControl_12\tControl_13\tsALS_1\tsALS_2\tsALS_3\tsALS_4\tsALS_5\tsALS_6\tsALS_7\tsALS_8\tsALS_9
    GeneA\t100\t110\t95\t105\t102\t98\t103\t107\t99\t101\t104\t106\t97\t200\t210\t190\t220\t230\t240\t250\t260\t270
    GeneB\t50\t55\t48\t52\t51\t49\t53\t56\t47\t50\t54\t57\t46\t25\t28\t24\t26\t27\t29\t30\t31\t32
    GeneC\t200\t205\t198\t202\t201\t199\t203\t206\t197\t200\t204\t207\t196\t100\t105\t98\t102\t103\t104\t105\t106\t107
    EOF
    
    # sample_info.tsv: SampleID, Group
    cat <<EOF > sample_info.tsv
    SampleID\tGroup
    Control_1\tControl
    Control_2\tControl
    Control_3\tControl
    Control_4\tControl
    Control_5\tControl
    Control_6\tControl
    Control_7\tControl
    Control_8\tControl
    Control_9\tControl
    Control_10\tControl
    Control_11\tControl
    Control_12\tControl
    Control_13\tControl
    sALS_1\tsALS
    sALS_2\tsALS
    sALS_3\tsALS
    sALS_4\tsALS
    sALS_5\tsALS
    sALS_6\tsALS
    sALS_7\tsALS
    sALS_8\tsALS
    sALS_9\tsALS
    EOF
    
    # Create the R script for calculating pairwise fold changes
    cat <<'EOF' > calculate_fold_changes.R
    # Load data
    data <- read.delim("data_matrix.tsv", row.names = 1, check.names = FALSE)
    sample_info <- read.delim("sample_info.tsv")
    
    # Identify control and sALS samples
    control_samples <- sample_info$SampleID[sample_info$Group == "Control"]
    sals_samples <- sample_info$SampleID[sample_info$Group == "sALS"]
    
    # Ensure samples are present in the data matrix
    control_samples <- intersect(control_samples, colnames(data))
    sals_samples <- intersect(sals_samples, colnames(data))
    
    # Initialize a data frame to store results
    results_df <- data.frame()
    
    # Iterate through all possible disease versus control pairs
    # The description states "all possible 117 disease versus control pairs".
    # This script will calculate all possible pairs based on the input data.
    message(paste("Found", length(sals_samples), "sALS samples and", length(control_samples), "Control samples."))
    message(paste("Total possible pairs to calculate:", length(sals_samples) * length(control_samples)))
    
    for (sals_id in sals_samples) {
      for (control_id in control_samples) {
        # Calculate ratio fold-change for each feature
        # Add a small pseudocount to avoid division by zero or log(0)
        pseudocount <- 1e-6
        
        # Ensure samples exist in data
        if (sals_id %in% colnames(data) && control_id %in% colnames(data)) {
          fold_changes <- (data[[sals_id]] + pseudocount) / (data[[control_id]] + pseudocount)
          
          # Store results
          pair_name <- paste0(sals_id, "_vs_", control_id)
          temp_df <- data.frame(
            Feature = rownames(data),
            Pair = pair_name,
            FoldChange = fold_changes,
            stringsAsFactors = FALSE
          )
          results_df <- rbind(results_df, temp_df)
        } else {
          warning(paste("Skipping pair:", sals_id, "vs", control_id, "- one or both samples not found in data."))
        }
      }
    }
    
    # Output results
    write.csv(results_df, "pairwise_fold_changes.csv", row.names = FALSE)
    message("Pairwise fold changes calculated and saved to 'pairwise_fold_changes.csv'")
    EOF
    
    # Execute the R script
    Rscript calculate_fold_changes.R
  4. 4

    For Median Percentile Rank (MPR) analysis, each set of fold-changes were sorted and given a percentile rank, and the median percentile rank for each gene was determined and plotted as a histogram.

    Python (Inferred with models/gemini-2.5-flash) v3.9 GitHub
    $ Bash example
    # Install dependencies (if not already installed)
    # conda install pandas numpy matplotlib seaborn
    
    # Create a dummy input file for demonstration
    # In a real scenario, 'fold_changes.tsv' would be generated by a previous step.
    echo -e "gene\tcond1_fc\tcond2_fc\tcond3_fc\tcond4_fc" > fold_changes.tsv
    echo -e "geneA\t1.2\t0.8\t1.5\t1.1" >> fold_changes.tsv
    echo -e "geneB\t-0.5\t0.2\t-0.1\t0.3" >> fold_changes.tsv
    echo -e "geneC\t2.0\t1.8\t2.2\t1.9" >> fold_changes.tsv
    echo -e "geneD\t0.1\t-0.3\t0.0\t-0.2" >> fold_changes.tsv
    echo -e "geneE\t-1.0\t-1.2\t-0.8\t-0.9" >> fold_changes.tsv
    echo -e "geneF\t0.5\t0.6\t0.4\t0.7" >> fold_changes.tsv
    echo -e "geneG\t-1.5\t-1.8\t-1.3\t-1.6" >> fold_changes.tsv
    echo -e "geneH\t2.5\t2.3\t2.7\t2.4" >> fold_changes.tsv
    echo -e "geneI\t-0.2\t0.1\t-0.3\t0.0" >> fold_changes.tsv
    echo -e "geneJ\t1.0\t1.1\t0.9\t1.2" >> fold_changes.tsv
    
    # Python script for Median Percentile Rank (MPR) analysis
    python -c "
    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    import seaborn as sns
    
    # Load fold-change data
    # Assuming the input file 'fold_changes.tsv' has gene names in the first column
    # and fold-changes in subsequent columns (representing different experiments/conditions).
    try:
        df = pd.read_csv('fold_changes.tsv', sep='\t', index_col='gene')
    except FileNotFoundError:
        print('Error: fold_changes.tsv not found. Please ensure the input file exists.')
        exit(1)
    
    # Calculate percentile ranks for each column (each experiment/condition)
    # 'each set of fold-changes' refers to the fold-changes for all genes in one experiment.
    # We rank these values from 0 to 100.
    # axis=0 means apply rank column-wise
    percentile_ranks_df = df.rank(pct=True, axis=0) * 100
    
    # Determine the median percentile rank for each gene across all experiments
    # 'the median percentile rank for each gene was determined'
    # axis=1 means calculate median row-wise
    median_percentile_ranks = percentile_ranks_df.median(axis=1)
    
    # Plot as a histogram
    plt.figure(figsize=(10, 6))
    sns.histplot(median_percentile_ranks.dropna(), bins=30, kde=True)
    plt.title('Histogram of Median Percentile Ranks (MPR)')
    plt.xlabel('Median Percentile Rank')
    plt.ylabel('Number of Genes')
    plt.grid(axis='y', alpha=0.75)
    plt.tight_layout()
    plt.savefig('mpr_histogram.png')
    print('MPR histogram saved to mpr_histogram.png')
    
    # Optionally save the median percentile ranks
    median_percentile_ranks.name = 'Median_Percentile_Rank'
    median_percentile_ranks.to_csv('median_percentile_ranks.tsv', sep='\t', header=True)
    print('Median percentile ranks saved to median_percentile_ranks.tsv')
    "
  5. 5

    The local minima near the down-regulated edge, 0.15 was taken as the down-regulated cutoff, and a mirrored cutoff of 0.85 was taken for the up-regulated cutoff, and these gene sets were used for subsequent GO analysis.

    Gene Ontology vNot specified (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Assume an input file 'gene_scores.tsv' with gene IDs and their associated scores.
    # For demonstration, let's create a dummy file:
    echo -e "gene_id\tscore" > gene_scores.tsv
    echo -e "GENE1\t0.05" >> gene_scores.tsv
    echo -e "GENE2\t0.10" >> gene_scores.tsv
    echo -e "GENE3\t0.20" >> gene_scores.tsv
    echo -e "GENE4\t0.50" >> gene_scores.tsv
    echo -e "GENE5\t0.80" >> gene_scores.tsv
    echo -e "GENE6\t0.90" >> gene_scores.tsv
    echo -e "GENE7\t0.95" >> gene_scores.tsv
    echo -e "GENE8\t0.14" >> gene_scores.tsv
    echo -e "GENE9\t0.86" >> gene_scores.tsv
    
    # Define cutoffs
    DOWN_REGULATED_CUTOFF=0.15
    UP_REGULATED_CUTOFF=0.85
    
    # Extract down-regulated genes
    awk -v cutoff="$DOWN_REGULATED_CUTOFF" 'NR > 1 && $2 < cutoff {print $1}' gene_scores.tsv > down_regulated_genes.txt
    
    # Extract up-regulated genes
    awk -v cutoff="$UP_REGULATED_CUTOFF" 'NR > 1 && $2 > cutoff {print $1}' gene_scores.tsv > up_regulated_genes.txt
    
    # Create an R script for GO analysis using clusterProfiler
    cat << 'EOF' > run_go_analysis.R
    # Install necessary packages if not already installed
    # if (!requireNamespace("BiocManager", quietly = TRUE))
    #     install.packages("BiocManager")
    # BiocManager::install("clusterProfiler")
    # BiocManager::install("org.Hs.eg.db") # For Homo sapiens
    # BiocManager::install("AnnotationDbi")
    # install.packages("dplyr")
    
    library(clusterProfiler)
    library(org.Hs.eg.db)
    library(dplyr)
    library(AnnotationDbi)
    
    # Load gene lists
    down_genes <- read.table("down_regulated_genes.txt", header = FALSE, stringsAsFactors = FALSE)$V1
    up_genes <- read.table("up_regulated_genes.txt", header = FALSE, stringsAsFactors = FALSE)$V1
    
    # For demonstration, let's assume a background gene list (all genes in the input file)
    # In a real scenario, this would be all genes tested in the experiment.
    background_genes <- read.table("gene_scores.tsv", header = TRUE, stringsAsFactors = FALSE)$gene_id
    
    # Function to map gene symbols to Entrez IDs
    map_symbols_to_entrez <- function(gene_symbols, OrgDb_obj) {
      mapped_genes <- AnnotationDbi::select(OrgDb_obj,
                                            keys = gene_symbols,
                                            columns = c("ENTREZID", "SYMBOL"),
                                            keytype = "SYMBOL")
      # Remove NAs and duplicates
      mapped_genes <- mapped_genes %>%
        filter(!is.na(ENTREZID)) %>%
        distinct(SYMBOL, .keep_all = TRUE)
      return(mapped_genes$ENTREZID)
    }
    
    # Convert gene symbols to Entrez IDs
    down_genes_entrez <- map_symbols_to_entrez(down_genes, org.Hs.eg.db)
    up_genes_entrez <- map_symbols_to_entrez(up_genes, org.Hs.eg.db)
    background_genes_entrez <- map_symbols_to_entrez(background_genes, org.Hs.eg.db)
    
    
    # Perform GO enrichment for down-regulated genes
    if (length(down_genes_entrez) > 0) {
      ego_down <- enrichGO(gene          = down_genes_entrez,
                           universe      = background_genes_entrez,
                           OrgDb         = org.Hs.eg.db,
                           ont           = "BP", # Biological Process
                           pAdjustMethod = "BH",
                           pvalueCutoff  = 0.05,
                           qvalueCutoff  = 0.05,
                           readable      = TRUE) # Convert Entrez ID to gene Symbol
      if (!is.null(ego_down) && nrow(ego_down@result) > 0) {
        write.csv(as.data.frame(ego_down), "go_enrichment_down_regulated.csv", row.names = FALSE)
        print("GO enrichment for down-regulated genes completed. Results saved to go_enrichment_down_regulated.csv")
      } else {
        print("No significant GO terms found for down-regulated genes.")
      }
    } else {
      print("No down-regulated genes found for GO analysis.")
    }
    
    # Perform GO enrichment for up-regulated genes
    if (length(up_genes_entrez) > 0) {
      ego_up <- enrichGO(gene          = up_genes_entrez,
                         universe      = background_genes_entrez,
                         OrgDb         = org.Hs.eg.db,
                         ont           = "BP", # Biological Process
                         pAdjustMethod = "BH",
                         pvalueCutoff  = 0.05,
                         qvalueCutoff  = 0.05,
                         readable      = TRUE) # Convert Entrez ID to gene Symbol
      if (!is.null(ego_up) && nrow(ego_up@result) > 0) {
        write.csv(as.data.frame(ego_up), "go_enrichment_up_regulated.csv", row.names = FALSE)
        print("GO enrichment for up-regulated genes completed. Results saved to go_enrichment_up_regulated.csv")
      } else {
        print("No significant GO terms found for up-regulated genes.")
      }
    } else {
      print("No up-regulated genes found for GO analysis.")
    }
    EOF
    
    # Execute the R script
    Rscript run_go_analysis.R

Tools Used

Raw Source Text
Raw reads were adaptor-trimmed and aligned to the human genome (UCSC version hg18) using bowtie (parameters: -l 20 -m 5 -k 5 --best -q hg18). Aligned reads were assigned to a custom annotation of human genes [36], [48], and gene expression values were calculated as RPKMs [28].
Control versus sALS patient datasets were compared in a pairwise fashion, calculating ratio fold-changes for all possible 117 disease versus control pairs. For Median Percentile Rank (MPR) analysis, each set of fold-changes were sorted and given a percentile rank, and the median percentile rank for each gene was determined and plotted as a histogram. The local minima near the down-regulated edge, 0.15 was taken as the down-regulated cutoff, and a mirrored cutoff of 0.85 was taken for the up-regulated cutoff, and these gene sets were used for subsequent GO analysis.
Genome_build: hg18
Supplementary_files_format_and_content: Fastq files were aligned to the UCSC genome build (mm9, mm10, or hg19) using BWA, novoaligner (HITS-CLIP) or Olego (polyA-seq). The resulting .sam files were converted to bed files using samtools (HITS-CLIP) or Olego (PolyA-seq). The resulting .bed files were filtered to remove PCR duplicates (HITS-CLIP) and internal A-stretches (polyA-seq). BedGraph files (wiggles) were generated from bedfiles using bedtools (HITS-CLIP) or tag2profile tool from Olego (polyA-seq). Olego is an open source software by Chaolin Zhang lab from Columbia University.
← Back to Analysis