GSE76220 Processing Pipeline
Publication
Transcriptome-pathology correlation identifies interplay between TDP-43 and the expression of its kinase CK1E in sporadic ALS.Acta neuropathologica (2018) — PMID 29881994
Processing Steps
Generate Jupyter Notebook-
1
Raw reads were adaptor-trimmed and aligned to the human genome (UCSC version hg18) using bowtie (parameters: -l 20 -m 5 -k 5 --best -q hg18).
$ Bash example
# Install Bowtie # conda install -c bioconda bowtie=1.1.2 # Install Cutadapt (for adaptor trimming, inferred with models/gemini-2.5-flash) # conda install -c bioconda cutadapt # Download human genome (hg18) and build Bowtie index # mkdir -p reference # wget -O reference/hg18.fa.gz http://hgdownload.soe.ucsc.edu/goldenPath/hg18/bigZips/hg18.fa.gz # gunzip reference/hg18.fa.gz # bowtie-build reference/hg18.fa reference/hg18 # Define input file placeholders (replace with actual file names) RAW_READS="raw_reads.fastq.gz" # Assuming gzipped fastq input TRIMMED_READS="trimmed_reads.fastq" ALIGNED_SAM="aligned_reads.sam" # Adaptor trimming using cutadapt (example with common Illumina universal adaptor) # Adjust adaptor sequence (-a) and minimum length (-m) as needed based on experimental design. cutadapt -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCA -m 20 -o "${TRIMMED_READS}" "${RAW_READS}" # Align trimmed reads to hg18 using Bowtie # 'reference/hg18' refers to the basename of the Bowtie index files (e.g., hg18.1.ebwt, hg18.rev.1.ebwt) located in the 'reference' directory. bowtie -l 20 -m 5 -k 5 --best -q reference/hg18 "${TRIMMED_READS}" > "${ALIGNED_SAM}" -
2
Aligned reads were assigned to a custom annotation of human genes [36], [48], and gene expression values were calculated as RPKMs [28].
featureCounts (Inferred with models/gemini-2.5-flash) vN/A$ Bash example
# Install Subread (which includes featureCounts) # conda install -c bioconda subread # Placeholder for input BAM file (aligned reads) INPUT_BAM="aligned_reads.bam" # Placeholder for custom gene annotation (GTF/GFF format) # Assuming GRCh38 assembly for human genes # Replace with actual path to your custom annotation GENE_ANNOTATION="custom_human_genes_GRCh38.gtf" # Placeholder for output files OUTPUT_COUNTS_BASE="gene_counts" # featureCounts will create gene_counts.txt and gene_counts.txt.summary OUTPUT_RPKMS="gene_expression_rpkm.txt" # 1. Assign aligned reads to genes using featureCounts # -a: annotation file # -o: output file prefix for counts (will create .txt and .txt.summary) # -F GTF: specify annotation file format # -t exon: feature type to count (e.g., exon) # -g gene_id: attribute to group features by (e.g., gene_id) # -s 0: strand specificity (0=unstranded, 1=stranded, 2=reverse stranded) - common for RNA-seq, adjust if known # --primary: count only primary alignments (useful if BAM contains secondary/supplementary) # Add -p if reads are paired-end (e.g., featureCounts ... -p ...) featureCounts -a "${GENE_ANNOTATION}" -o "${OUTPUT_COUNTS_BASE}.txt" -F GTF -t exon -g gene_id -s 0 --primary "${INPUT_BAM}" # 2. Calculate RPKMs from raw counts and gene lengths # Get total assigned reads from the featureCounts summary file TOTAL_ASSIGNED_READS=$(awk '$1 == "Assigned" {print $2}' "${OUTPUT_COUNTS_BASE}.txt.summary") # Check if total assigned reads is valid if [ -z "${TOTAL_ASSIGNED_READS}" ] || [ "${TOTAL_ASSIGNED_READS}" -eq 0 ]; then echo "Error: Could not get total assigned reads or total assigned reads is zero." exit 1 fi # Create a header for the RPKM file echo -e "GeneID\tRPKM" > "${OUTPUT_RPKMS}" # Process the featureCounts output file to calculate RPKM # Skip the header line (line 1) # Column 1: GeneID, Column 6: Length, Column 7: Count (for the first BAM) tail -n +2 "${OUTPUT_COUNTS_BASE}.txt" | awk -v total_reads="${TOTAL_ASSIGNED_READS}" ' BEGIN { OFS="\t" } { gene_id = $1; gene_length = $6; read_count = $7; if (gene_length > 0 && total_reads > 0) { rpkm = (read_count * 1000000000) / (gene_length * total_reads); print gene_id, rpkm; } else { print gene_id, 0; # Handle cases with zero length or zero reads } }' >> "${OUTPUT_RPKMS}" -
3
Control versus sALS patient datasets were compared in a pairwise fashion, calculating ratio fold-changes for all possible 117 disease versus control pairs.
$ Bash example
# Install R (if not already installed) # sudo apt-get update # sudo apt-get install r-base # Create dummy input data files for demonstration # data_matrix.tsv: Features (rows) x Samples (columns) # This dummy data is structured to yield 9 sALS samples and 13 Control samples, resulting in 9 * 13 = 117 pairs. cat <<EOF > data_matrix.tsv Feature\tControl_1\tControl_2\tControl_3\tControl_4\tControl_5\tControl_6\tControl_7\tControl_8\tControl_9\tControl_10\tControl_11\tControl_12\tControl_13\tsALS_1\tsALS_2\tsALS_3\tsALS_4\tsALS_5\tsALS_6\tsALS_7\tsALS_8\tsALS_9 GeneA\t100\t110\t95\t105\t102\t98\t103\t107\t99\t101\t104\t106\t97\t200\t210\t190\t220\t230\t240\t250\t260\t270 GeneB\t50\t55\t48\t52\t51\t49\t53\t56\t47\t50\t54\t57\t46\t25\t28\t24\t26\t27\t29\t30\t31\t32 GeneC\t200\t205\t198\t202\t201\t199\t203\t206\t197\t200\t204\t207\t196\t100\t105\t98\t102\t103\t104\t105\t106\t107 EOF # sample_info.tsv: SampleID, Group cat <<EOF > sample_info.tsv SampleID\tGroup Control_1\tControl Control_2\tControl Control_3\tControl Control_4\tControl Control_5\tControl Control_6\tControl Control_7\tControl Control_8\tControl Control_9\tControl Control_10\tControl Control_11\tControl Control_12\tControl Control_13\tControl sALS_1\tsALS sALS_2\tsALS sALS_3\tsALS sALS_4\tsALS sALS_5\tsALS sALS_6\tsALS sALS_7\tsALS sALS_8\tsALS sALS_9\tsALS EOF # Create the R script for calculating pairwise fold changes cat <<'EOF' > calculate_fold_changes.R # Load data data <- read.delim("data_matrix.tsv", row.names = 1, check.names = FALSE) sample_info <- read.delim("sample_info.tsv") # Identify control and sALS samples control_samples <- sample_info$SampleID[sample_info$Group == "Control"] sals_samples <- sample_info$SampleID[sample_info$Group == "sALS"] # Ensure samples are present in the data matrix control_samples <- intersect(control_samples, colnames(data)) sals_samples <- intersect(sals_samples, colnames(data)) # Initialize a data frame to store results results_df <- data.frame() # Iterate through all possible disease versus control pairs # The description states "all possible 117 disease versus control pairs". # This script will calculate all possible pairs based on the input data. message(paste("Found", length(sals_samples), "sALS samples and", length(control_samples), "Control samples.")) message(paste("Total possible pairs to calculate:", length(sals_samples) * length(control_samples))) for (sals_id in sals_samples) { for (control_id in control_samples) { # Calculate ratio fold-change for each feature # Add a small pseudocount to avoid division by zero or log(0) pseudocount <- 1e-6 # Ensure samples exist in data if (sals_id %in% colnames(data) && control_id %in% colnames(data)) { fold_changes <- (data[[sals_id]] + pseudocount) / (data[[control_id]] + pseudocount) # Store results pair_name <- paste0(sals_id, "_vs_", control_id) temp_df <- data.frame( Feature = rownames(data), Pair = pair_name, FoldChange = fold_changes, stringsAsFactors = FALSE ) results_df <- rbind(results_df, temp_df) } else { warning(paste("Skipping pair:", sals_id, "vs", control_id, "- one or both samples not found in data.")) } } } # Output results write.csv(results_df, "pairwise_fold_changes.csv", row.names = FALSE) message("Pairwise fold changes calculated and saved to 'pairwise_fold_changes.csv'") EOF # Execute the R script Rscript calculate_fold_changes.R -
4
For Median Percentile Rank (MPR) analysis, each set of fold-changes were sorted and given a percentile rank, and the median percentile rank for each gene was determined and plotted as a histogram.
$ Bash example
# Install dependencies (if not already installed) # conda install pandas numpy matplotlib seaborn # Create a dummy input file for demonstration # In a real scenario, 'fold_changes.tsv' would be generated by a previous step. echo -e "gene\tcond1_fc\tcond2_fc\tcond3_fc\tcond4_fc" > fold_changes.tsv echo -e "geneA\t1.2\t0.8\t1.5\t1.1" >> fold_changes.tsv echo -e "geneB\t-0.5\t0.2\t-0.1\t0.3" >> fold_changes.tsv echo -e "geneC\t2.0\t1.8\t2.2\t1.9" >> fold_changes.tsv echo -e "geneD\t0.1\t-0.3\t0.0\t-0.2" >> fold_changes.tsv echo -e "geneE\t-1.0\t-1.2\t-0.8\t-0.9" >> fold_changes.tsv echo -e "geneF\t0.5\t0.6\t0.4\t0.7" >> fold_changes.tsv echo -e "geneG\t-1.5\t-1.8\t-1.3\t-1.6" >> fold_changes.tsv echo -e "geneH\t2.5\t2.3\t2.7\t2.4" >> fold_changes.tsv echo -e "geneI\t-0.2\t0.1\t-0.3\t0.0" >> fold_changes.tsv echo -e "geneJ\t1.0\t1.1\t0.9\t1.2" >> fold_changes.tsv # Python script for Median Percentile Rank (MPR) analysis python -c " import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns # Load fold-change data # Assuming the input file 'fold_changes.tsv' has gene names in the first column # and fold-changes in subsequent columns (representing different experiments/conditions). try: df = pd.read_csv('fold_changes.tsv', sep='\t', index_col='gene') except FileNotFoundError: print('Error: fold_changes.tsv not found. Please ensure the input file exists.') exit(1) # Calculate percentile ranks for each column (each experiment/condition) # 'each set of fold-changes' refers to the fold-changes for all genes in one experiment. # We rank these values from 0 to 100. # axis=0 means apply rank column-wise percentile_ranks_df = df.rank(pct=True, axis=0) * 100 # Determine the median percentile rank for each gene across all experiments # 'the median percentile rank for each gene was determined' # axis=1 means calculate median row-wise median_percentile_ranks = percentile_ranks_df.median(axis=1) # Plot as a histogram plt.figure(figsize=(10, 6)) sns.histplot(median_percentile_ranks.dropna(), bins=30, kde=True) plt.title('Histogram of Median Percentile Ranks (MPR)') plt.xlabel('Median Percentile Rank') plt.ylabel('Number of Genes') plt.grid(axis='y', alpha=0.75) plt.tight_layout() plt.savefig('mpr_histogram.png') print('MPR histogram saved to mpr_histogram.png') # Optionally save the median percentile ranks median_percentile_ranks.name = 'Median_Percentile_Rank' median_percentile_ranks.to_csv('median_percentile_ranks.tsv', sep='\t', header=True) print('Median percentile ranks saved to median_percentile_ranks.tsv') " -
5
The local minima near the down-regulated edge, 0.15 was taken as the down-regulated cutoff, and a mirrored cutoff of 0.85 was taken for the up-regulated cutoff, and these gene sets were used for subsequent GO analysis.
$ Bash example
# Assume an input file 'gene_scores.tsv' with gene IDs and their associated scores. # For demonstration, let's create a dummy file: echo -e "gene_id\tscore" > gene_scores.tsv echo -e "GENE1\t0.05" >> gene_scores.tsv echo -e "GENE2\t0.10" >> gene_scores.tsv echo -e "GENE3\t0.20" >> gene_scores.tsv echo -e "GENE4\t0.50" >> gene_scores.tsv echo -e "GENE5\t0.80" >> gene_scores.tsv echo -e "GENE6\t0.90" >> gene_scores.tsv echo -e "GENE7\t0.95" >> gene_scores.tsv echo -e "GENE8\t0.14" >> gene_scores.tsv echo -e "GENE9\t0.86" >> gene_scores.tsv # Define cutoffs DOWN_REGULATED_CUTOFF=0.15 UP_REGULATED_CUTOFF=0.85 # Extract down-regulated genes awk -v cutoff="$DOWN_REGULATED_CUTOFF" 'NR > 1 && $2 < cutoff {print $1}' gene_scores.tsv > down_regulated_genes.txt # Extract up-regulated genes awk -v cutoff="$UP_REGULATED_CUTOFF" 'NR > 1 && $2 > cutoff {print $1}' gene_scores.tsv > up_regulated_genes.txt # Create an R script for GO analysis using clusterProfiler cat << 'EOF' > run_go_analysis.R # Install necessary packages if not already installed # if (!requireNamespace("BiocManager", quietly = TRUE)) # install.packages("BiocManager") # BiocManager::install("clusterProfiler") # BiocManager::install("org.Hs.eg.db") # For Homo sapiens # BiocManager::install("AnnotationDbi") # install.packages("dplyr") library(clusterProfiler) library(org.Hs.eg.db) library(dplyr) library(AnnotationDbi) # Load gene lists down_genes <- read.table("down_regulated_genes.txt", header = FALSE, stringsAsFactors = FALSE)$V1 up_genes <- read.table("up_regulated_genes.txt", header = FALSE, stringsAsFactors = FALSE)$V1 # For demonstration, let's assume a background gene list (all genes in the input file) # In a real scenario, this would be all genes tested in the experiment. background_genes <- read.table("gene_scores.tsv", header = TRUE, stringsAsFactors = FALSE)$gene_id # Function to map gene symbols to Entrez IDs map_symbols_to_entrez <- function(gene_symbols, OrgDb_obj) { mapped_genes <- AnnotationDbi::select(OrgDb_obj, keys = gene_symbols, columns = c("ENTREZID", "SYMBOL"), keytype = "SYMBOL") # Remove NAs and duplicates mapped_genes <- mapped_genes %>% filter(!is.na(ENTREZID)) %>% distinct(SYMBOL, .keep_all = TRUE) return(mapped_genes$ENTREZID) } # Convert gene symbols to Entrez IDs down_genes_entrez <- map_symbols_to_entrez(down_genes, org.Hs.eg.db) up_genes_entrez <- map_symbols_to_entrez(up_genes, org.Hs.eg.db) background_genes_entrez <- map_symbols_to_entrez(background_genes, org.Hs.eg.db) # Perform GO enrichment for down-regulated genes if (length(down_genes_entrez) > 0) { ego_down <- enrichGO(gene = down_genes_entrez, universe = background_genes_entrez, OrgDb = org.Hs.eg.db, ont = "BP", # Biological Process pAdjustMethod = "BH", pvalueCutoff = 0.05, qvalueCutoff = 0.05, readable = TRUE) # Convert Entrez ID to gene Symbol if (!is.null(ego_down) && nrow(ego_down@result) > 0) { write.csv(as.data.frame(ego_down), "go_enrichment_down_regulated.csv", row.names = FALSE) print("GO enrichment for down-regulated genes completed. Results saved to go_enrichment_down_regulated.csv") } else { print("No significant GO terms found for down-regulated genes.") } } else { print("No down-regulated genes found for GO analysis.") } # Perform GO enrichment for up-regulated genes if (length(up_genes_entrez) > 0) { ego_up <- enrichGO(gene = up_genes_entrez, universe = background_genes_entrez, OrgDb = org.Hs.eg.db, ont = "BP", # Biological Process pAdjustMethod = "BH", pvalueCutoff = 0.05, qvalueCutoff = 0.05, readable = TRUE) # Convert Entrez ID to gene Symbol if (!is.null(ego_up) && nrow(ego_up@result) > 0) { write.csv(as.data.frame(ego_up), "go_enrichment_up_regulated.csv", row.names = FALSE) print("GO enrichment for up-regulated genes completed. Results saved to go_enrichment_up_regulated.csv") } else { print("No significant GO terms found for up-regulated genes.") } } else { print("No up-regulated genes found for GO analysis.") } EOF # Execute the R script Rscript run_go_analysis.R
Tools Used
Raw Source Text
Raw reads were adaptor-trimmed and aligned to the human genome (UCSC version hg18) using bowtie (parameters: -l 20 -m 5 -k 5 --best -q hg18). Aligned reads were assigned to a custom annotation of human genes [36], [48], and gene expression values were calculated as RPKMs [28]. Control versus sALS patient datasets were compared in a pairwise fashion, calculating ratio fold-changes for all possible 117 disease versus control pairs. For Median Percentile Rank (MPR) analysis, each set of fold-changes were sorted and given a percentile rank, and the median percentile rank for each gene was determined and plotted as a histogram. The local minima near the down-regulated edge, 0.15 was taken as the down-regulated cutoff, and a mirrored cutoff of 0.85 was taken for the up-regulated cutoff, and these gene sets were used for subsequent GO analysis. Genome_build: hg18 Supplementary_files_format_and_content: Fastq files were aligned to the UCSC genome build (mm9, mm10, or hg19) using BWA, novoaligner (HITS-CLIP) or Olego (polyA-seq). The resulting .sam files were converted to bed files using samtools (HITS-CLIP) or Olego (PolyA-seq). The resulting .bed files were filtered to remove PCR duplicates (HITS-CLIP) and internal A-stretches (polyA-seq). BedGraph files (wiggles) were generated from bedfiles using bedtools (HITS-CLIP) or tag2profile tool from Olego (polyA-seq). Olego is an open source software by Chaolin Zhang lab from Columbia University.