GSE16687 Processing Pipeline
Publication
A distinct microRNA signature for definitive endoderm derived from human embryonic stem cells.Stem cells and development (2010) — PMID 19807270
Dataset
GSE16687MicroRNA expression data from differentiation of human Cyt49 ESCs into definitive endoderm on MEF feeder layers
Processing Steps
Generate Jupyter Notebook-
1
The signal processing implemented for the Ambion miRCHIP is a multi-step process involving probe specific signal detection calls, background estimate and correction, constant variance stabilization and either array scaling or global normalization.
limma (Inferred with models/gemini-2.5-flash) vNot specified (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# Install R and necessary packages if not already installed # R -e "if (!requireNamespace('BiocManager', quietly = TRUE)) install.packages('BiocManager', repos='http://cran.us.r-project.org'); BiocManager::install('limma')" # Placeholder for raw data file (e.g., tab-separated file from Ambion scanner) RAW_DATA_FILE="miRCHIP_raw_data.txt" OUTPUT_NORMALIZED_FILE="miRCHIP_normalized_data.txt" # This R script performs background correction, variance stabilization, and normalization # using the limma package, suitable for single-channel microarray data like Ambion miRCHIP. # The exact parsing of RAW_DATA_FILE into E (foreground) and Eb (background) matrices # would depend on the specific format of the Ambion miRCHIP output. # For demonstration, we assume these matrices can be extracted. Rscript -e ' library(limma) # --- Configuration --- raw_data_file <- Sys.getenv("RAW_DATA_FILE") output_normalized_file <- Sys.getenv("OUTPUT_NORMALIZED_FILE") # --- Dummy Data Creation (Replace with actual data loading) --- # In a real scenario, you would parse your Ambion miRCHIP raw data file(s) # to extract foreground (E) and background (Eb) intensity matrices. # Example: If your raw data file has columns like "ProbeID", "Sample1_Signal", "Sample1_Background", ... # you would read it and then separate into E and Eb matrices. # For demonstration, creating dummy matrices: set.seed(42) num_probes <- 100 num_samples <- 5 probe_ids <- paste0("miR_probe_", 1:num_probes) sample_names <- paste0("Sample_", 1:num_samples) E_raw <- matrix(rnorm(num_probes * num_samples, mean=1000, sd=200), nrow=num_probes, ncol=num_samples, dimnames=list(probe_ids, sample_names)) Eb_raw <- matrix(rnorm(num_probes * num_samples, mean=100, sd=20), nrow=num_probes, ncol=num_samples, dimnames=list(probe_ids, sample_names)) # --- 1. Create EList object --- # This object holds raw foreground (E) and background (Eb) intensities. MA <- new("EList") MA$E <- E_raw MA$Eb <- Eb_raw # --- 2. Background Estimate and Correction --- # Using "normexp" method, common for single-channel arrays. MA_bg_corrected <- backgroundCorrect(MA, method="normexp") # --- 3. Constant Variance Stabilization and Global Normalization --- # Using "vsn" (Variance Stabilizing Normalization) method. # This method performs both variance stabilization and global normalization. MA_normalized <- normalizeBetweenArrays(MA_bg_corrected, method="vsn") # --- 4. Extract and Save Normalized Data --- normalized_expression_matrix <- MA_normalized$E write.table(normalized_expression_matrix, file=output_normalized_file, sep="\t", quote=FALSE, row.names=TRUE) cat(paste0("Normalized data written to: ", output_normalized_file, "\n")) ' -
2
For each probe, an estimated background value is subtracted that is derived from the median signal of a set of G-C matched anti-genomic controls.
$ Bash example
# This step describes a custom background subtraction method for microarray data. # It involves calculating the median signal from G-C matched anti-genomic control probes # and subtracting this value from each experimental probe's signal. # The exact implementation would depend on the microarray platform and data format. # Input files (placeholders): PROBE_SIGNAL_FILE="raw_probe_signals.txt" # File containing raw signal intensities for all probes CONTROL_PROBE_INFO="gc_matched_control_definitions.txt" # File defining G-C matched anti-genomic controls for each probe OUTPUT_FILE="background_subtracted_probe_signals.txt" # A real implementation would involve parsing these files, identifying control probes, # calculating their median signal, and applying the subtraction. This is typically done # in a scripting language like R or Python, or using vendor-specific software. # Example conceptual execution command (assuming a custom script 'subtract_background.R' exists): # Rscript subtract_background.R \ # --probe_signals ${PROBE_SIGNAL_FILE} \ # --control_info ${CONTROL_PROBE_INFO} \ # --output ${OUTPUT_FILE} # For demonstration, a generic echo command: echo "Performing G-C matched anti-genomic background subtraction on ${PROBE_SIGNAL_FILE} to create ${OUTPUT_FILE}" -
3
Arrays within a specific analysis experiment were normalized together according to the variance stabilization methods described by Huber et al. (Huber et al., 2002).
vsn (Inferred with models/gemini-2.5-flash) vNot specified$ Bash example
# Install BiocManager if not already installed # R -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")' # Install vsn package if not already installed # R -e 'BiocManager::install("vsn")' Rscript -e ' library(vsn) # Load raw data (replace with your actual file path and format) # Assuming "raw_array_intensities.csv" contains the raw array intensity matrix. # The first column is assumed to be probe IDs, and subsequent columns are array samples. # Adjust read.csv parameters (e.g., sep, header) based on your file format. raw_data <- read.csv("raw_array_intensities.csv", row.names = 1) # Convert to a matrix for vsn processing data_matrix <- as.matrix(raw_data) # Perform VSN normalization using the vsn2 function # vsn2 returns an object of class "vsn" containing the normalized data vsn_fit <- vsn2(data_matrix) # Extract the normalized data matrix from the vsn object normalized_data <- exprs(vsn_fit) # Save the normalized data to a new CSV file write.csv(normalized_data, "normalized_array_intensities_vsn.csv") message("VSN normalization complete. Normalized data saved to normalized_array_intensities_vsn.csv") ' -
4
Detection calls were based on a Wilcoxon rank-sum test of the miRNA probe signal compared to the distribution of signals from GC-content matched anti-genomic probes.
$ Bash example
# Install R if not already installed # sudo apt-get update && sudo apt-get install -y r-base # Create a dummy R script for demonstration cat << 'EOF' > detect_miRNA_calls.R #!/usr/bin/env Rscript # This script performs detection calls based on a Wilcoxon rank-sum test. # It compares miRNA probe signals to GC-content matched anti-genomic probe signals. # # Input files are assumed to be tab-separated with at least 'probe_id', 'signal', and 'gc_content' columns. # For the Wilcoxon test to be applicable, 'current_miRNA_signals' (signals for a specific miRNA probe) # should ideally be a vector of observations (e.g., from replicates). # If 'miRNA probe signal' refers to a single value per probe, a standard Wilcoxon rank-sum test # comparing two groups is not directly applicable. In such cases, a custom approach # (e.g., comparing the single signal's rank within the combined distribution, or using percentile cutoffs) # would be needed. For demonstration, if a single miRNA signal is encountered, a percentile cutoff is used. args = commandArgs(trailingOnly=TRUE) if (length(args) < 3) { stop("Usage: Rscript detect_miRNA_calls.R <miRNA_signals_file> <anti_genomic_signals_file> <output_calls_file>", call.=FALSE) } miRNA_signals_file <- args[1] anti_genomic_signals_file <- args[2] output_calls_file <- args[3] # Read input data miRNA_data <- read.table(miRNA_signals_file, header=TRUE, sep="\t", stringsAsFactors=FALSE) anti_genomic_data <- read.table(anti_genomic_signals_file, header=TRUE, sep="\t", stringsAsFactors=FALSE) # Initialize results data frame results <- data.frame( probe_id = character(), p_value = numeric(), detected = logical(), stringsAsFactors = FALSE ) # Iterate through each unique miRNA probe unique_miRNA_probes <- unique(miRNA_data$probe_id) # Define GC content tolerance for matching gc_tolerance <- 0.05 # e.g., +/- 5% GC content for (probe_id in unique_miRNA_probes) { current_miRNA_signals <- miRNA_data$signal[miRNA_data$probe_id == probe_id] current_miRNA_gc <- miRNA_data$gc_content[miRNA_data$probe_id == probe_id][1] # Assuming GC content is consistent for a probe # Find GC-content matched anti-genomic probes matched_anti_genomic_signals <- anti_genomic_data$signal[ anti_genomic_data$gc_content >= (current_miRNA_gc - gc_tolerance) & anti_genomic_data$gc_content <= (current_miRNA_gc + gc_tolerance) ] p_value <- NA detected <- FALSE # Perform Wilcoxon rank-sum test if enough data (at least 2 observations per group) if (length(current_miRNA_signals) >= 2 && length(matched_anti_genomic_signals) >= 2) { # One-sided test: alternative="greater" to check if miRNA signals are significantly higher test_result <- wilcox.test(current_miRNA_signals, matched_anti_genomic_signals, alternative = "greater") p_value <- test_result$p.value detection_threshold <- 0.05 # Example p-value threshold detected <- p_value < detection_threshold } else if (length(current_miRNA_signals) == 1 && length(matched_anti_genomic_signals) >= 2) { # Special handling for a single miRNA signal against a distribution. # A direct wilcox.test is not applicable. Using a percentile cutoff as a proxy for detection. warning(paste("Probe", probe_id, ": Single miRNA signal detected. Using 95th percentile cutoff instead of Wilcoxon for detection call.")) percentile_threshold <- quantile(matched_anti_genomic_signals, 0.95) detected <- current_miRNA_signals[1] > percentile_threshold # p_value remains NA as no Wilcoxon test was performed } else { warning(paste("Probe", probe_id, ": Not enough data for Wilcoxon test or percentile comparison. Skipping detection.")) } results <- rbind(results, data.frame(probe_id = probe_id, p_value = p_value, detected = detected)) } # Write results write.table(results, file = output_calls_file, sep = "\t", row.names = FALSE, quote = FALSE) EOF # Make the script executable chmod +x detect_miRNA_calls.R # Create dummy input files for demonstration # These files simulate miRNA probe signals and GC-content matched anti-genomic probe signals. # 'probe_id' identifies the probe, 'signal' is its intensity, and 'gc_content' is its GC percentage. echo -e "probe_id\tsignal\tgc_content" > miRNA_probe_signals.txt echo -e "miRNA_1\t150\t0.45" >> miRNA_probe_signals.txt echo -e "miRNA_1\t160\t0.45" >> miRNA_probe_signals.txt # Replicate for miRNA_1 echo -e "miRNA_2\t80\t0.50" >> miRNA_probe_signals.txt echo -e "miRNA_3\t250\t0.60" >> miRNA_probe_signals.txt # Single signal for miRNA_3 echo -e "probe_id\tsignal\tgc_content" > anti_genomic_probe_signals.txt echo -e "anti_1\t50\t0.44" >> anti_genomic_probe_signals.txt echo -e "anti_2\t60\t0.46" >> anti_genomic_probe_signals.txt echo -e "anti_3\t70\t0.45" >> anti_genomic_probe_signals.txt echo -e "anti_4\t80\t0.47" >> anti_genomic_probe_signals.txt echo -e "anti_5\t90\t0.43" >> anti_genomic_probe_signals.txt echo -e "anti_6\t100\t0.51" >> anti_genomic_probe_signals.txt echo -e "anti_7\t110\t0.52" >> anti_genomic_probe_signals.txt echo -e "anti_8\t120\t0.50" >> anti_genomic_probe_signals.txt echo -e "anti_9\t130\t0.61" >> anti_genomic_probe_signals.txt echo -e "anti_10\t140\t0.59" >> anti_genomic_probe_signals.txt # Execute the R script to perform detection calls ./detect_miRNA_calls.R miRNA_probe_signals.txt anti_genomic_probe_signals.txt miRNA_detection_calls.txt -
5
For statistical hypothesis testing, a two-sample t-Test, with assumption of equal variance, was applied.
$ Bash example
# Install R (if not already installed) # conda install -c r r-base # Create dummy input data files for demonstration echo -e "10.1\n10.5\n9.8\n10.3\n10.0" > sample1.txt echo -e "11.2\n10.9\n11.5\n11.0\n11.3" > sample2.txt # Create an R script to perform the two-sample t-test with equal variance cat << 'EOF' > run_t_test.R # Read data from files data1 <- as.numeric(readLines("sample1.txt")) data2 <- as.numeric(readLines("sample2.txt")) # Perform two-sample t-test assuming equal variances # var.equal = TRUE specifies the assumption of equal variance t_test_result <- t.test(data1, data2, var.equal = TRUE) # Print the results print(t_test_result) # Optionally, save results to a file # sink("t_test_results.txt") # print(t_test_result) # sink() EOF # Execute the R script Rscript run_t_test.R -
6
One-way ANOVA was used for experimental designs with more than two experimental groupings or levels of the same factor.
$ Bash example
#!/bin/bash # This script performs a one-way ANOVA using R. # The example data simulates an experimental design with more than two groups. # Ensure R is installed. # conda install -c r r-base # Create an R script file cat << 'EOF' > run_anova.R # Sample data for demonstration # Let's assume 'response' is the dependent variable and 'group' is the independent factor set.seed(123) group_a <- rnorm(30, mean = 10, sd = 2) group_b <- rnorm(30, mean = 12, sd = 2) group_c <- rnorm(30, mean = 11, sd = 2) response <- c(group_a, group_b, group_c) group <- factor(rep(c("A", "B", "C"), each = 30)) # Create a data frame data_anova <- data.frame(response = response, group = group) # Perform one-way ANOVA # The formula 'response ~ group' means 'response' is explained by 'group' anova_result <- aov(response ~ group, data = data_anova) # Print the summary of the ANOVA summary(anova_result) # Optional: Post-hoc test if ANOVA is significant (e.g., Tukey HSD) # If the p-value from summary(anova_result) is less than a chosen significance level (e.g., 0.05) # TukeyHSD(anova_result) EOF # Execute the R script Rscript run_anova.R -
7
These tests define which probes are considered to be significantly differentially expressed, or significant, based on a default p-value of 0.001 and log2 difference > 1.
$ Bash example
# Install R and Bioconductor if not already installed # For example, on Ubuntu: # sudo apt-get update # sudo apt-get install r-base # R -e "if (!requireNamespace('BiocManager', quietly = TRUE)) install.packages('BiocManager'); BiocManager::install('DESeq2')" # Create a dummy DESeq2 results file for demonstration purposes. # In a real bioinformatics pipeline, this file would be generated by a previous differential expression analysis step. echo -e "gene_id\tbaseMean\tlog2FoldChange\tlfcSE\tstat\tpvalue\tpadj" > deseq2_results.tsv echo -e "geneA\t100.5\t2.5\t0.2\t12.5\t0.00001\t0.0001" >> deseq2_results.tsv echo -e "geneB\t50.2\t-1.8\t0.3\t-6.0\t0.0001\t0.001" >> deseq2_results.tsv echo -e "geneC\t200.1\t0.5\t0.1\t5.0\t0.1\t0.5" >> deseq2_results.tsv echo -e "geneD\t10.0\t3.0\t0.5\t6.0\t0.002\t0.01" >> deseq2_results.tsv echo -e "geneE\t150.0\t-2.1\t0.25\t-8.4\t0.00005\t0.0005" >> deseq2_results.tsv echo -e "geneF\t75.0\t1.2\t0.15\t8.0\t0.0005\t0.005" >> deseq2_results.tsv echo -e "geneG\t30.0\t0.8\t0.1\t8.0\t0.00001\t0.0001" >> deseq2_results.tsv # High significance, low LFC echo -e "geneH\t250.0\t-1.5\t0.2\t-7.5\t0.005\t0.01" >> deseq2_results.tsv # High LFC, low significance # R script to filter DESeq2 results based on specified p-value and log2 difference thresholds Rscript -e ' # Read the DESeq2 results file results_df <- read.delim("deseq2_results.tsv", sep="\t", header=TRUE) # Define significance thresholds as per the description p_value_threshold <- 0.001 log2_diff_threshold <- 1 # Filter for significantly differentially expressed probes # "log2 difference > 1" is interpreted as absolute log2FoldChange > 1 for differential expression significant_probes <- subset(results_df, abs(log2FoldChange) > log2_diff_threshold & pvalue < p_value_threshold) # Save the significant probes to a new file write.table(significant_probes, "significant_probes.tsv", sep="\t", quote=FALSE, row.names=FALSE) # Optional: Print a summary of the filtering results cat(paste("Total probes analyzed:", nrow(results_df), "\n")) cat(paste("Probes identified as significantly differentially expressed:", nrow(significant_probes), "\n")) '
Raw Source Text
The signal processing implemented for the Ambion miRCHIP is a multi-step process involving probe specific signal detection calls, background estimate and correction, constant variance stabilization and either array scaling or global normalization. For each probe, an estimated background value is subtracted that is derived from the median signal of a set of G-C matched anti-genomic controls. Arrays within a specific analysis experiment were normalized together according to the variance stabilization methods described by Huber et al. (Huber et al., 2002). Detection calls were based on a Wilcoxon rank-sum test of the miRNA probe signal compared to the distribution of signals from GC-content matched anti-genomic probes. For statistical hypothesis testing, a two-sample t-Test, with assumption of equal variance, was applied. One-way ANOVA was used for experimental designs with more than two experimental groupings or levels of the same factor. These tests define which probes are considered to be significantly differentially expressed, or significant, based on a default p-value of 0.001 and log2 difference > 1.