GSE16687 Processing Pipeline

GSE code_examples 7 steps

Publication

A distinct microRNA signature for definitive endoderm derived from human embryonic stem cells.

Stem cells and development (2010) — PMID 19807270

Dataset

MicroRNA expression data from differentiation of human Cyt49 ESCs into definitive endoderm on MEF feeder layers

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

The signal processing implemented for the Ambion miRCHIP is a multi-step process involving probe specific signal detection calls, background estimate and correction, constant variance stabilization and either array scaling or global normalization.

limma (Inferred with models/gemini-2.5-flash) vNot specified (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install R and necessary packages if not already installed
# R -e "if (!requireNamespace('BiocManager', quietly = TRUE)) install.packages('BiocManager', repos='http://cran.us.r-project.org'); BiocManager::install('limma')"

# Placeholder for raw data file (e.g., tab-separated file from Ambion scanner)
RAW_DATA_FILE="miRCHIP_raw_data.txt"
OUTPUT_NORMALIZED_FILE="miRCHIP_normalized_data.txt"

# This R script performs background correction, variance stabilization, and normalization
# using the limma package, suitable for single-channel microarray data like Ambion miRCHIP.
# The exact parsing of RAW_DATA_FILE into E (foreground) and Eb (background) matrices
# would depend on the specific format of the Ambion miRCHIP output.
# For demonstration, we assume these matrices can be extracted.

Rscript -e '
  library(limma)
  
  # --- Configuration ---
  raw_data_file <- Sys.getenv("RAW_DATA_FILE")
  output_normalized_file <- Sys.getenv("OUTPUT_NORMALIZED_FILE")
  
  # --- Dummy Data Creation (Replace with actual data loading) ---
  # In a real scenario, you would parse your Ambion miRCHIP raw data file(s)
  # to extract foreground (E) and background (Eb) intensity matrices.
  # Example: If your raw data file has columns like "ProbeID", "Sample1_Signal", "Sample1_Background", ...
  # you would read it and then separate into E and Eb matrices.
  
  # For demonstration, creating dummy matrices:
  set.seed(42)
  num_probes <- 100
  num_samples <- 5
  probe_ids <- paste0("miR_probe_", 1:num_probes)
  sample_names <- paste0("Sample_", 1:num_samples)
  
  E_raw <- matrix(rnorm(num_probes * num_samples, mean=1000, sd=200), nrow=num_probes, ncol=num_samples, dimnames=list(probe_ids, sample_names))
  Eb_raw <- matrix(rnorm(num_probes * num_samples, mean=100, sd=20), nrow=num_probes, ncol=num_samples, dimnames=list(probe_ids, sample_names))
  
  # --- 1. Create EList object ---
  # This object holds raw foreground (E) and background (Eb) intensities.
  MA <- new("EList")
  MA$E <- E_raw
  MA$Eb <- Eb_raw
  
  # --- 2. Background Estimate and Correction ---
  # Using "normexp" method, common for single-channel arrays.
  MA_bg_corrected <- backgroundCorrect(MA, method="normexp")
  
  # --- 3. Constant Variance Stabilization and Global Normalization ---
  # Using "vsn" (Variance Stabilizing Normalization) method.
  # This method performs both variance stabilization and global normalization.
  MA_normalized <- normalizeBetweenArrays(MA_bg_corrected, method="vsn")
  
  # --- 4. Extract and Save Normalized Data ---
  normalized_expression_matrix <- MA_normalized$E
  write.table(normalized_expression_matrix, file=output_normalized_file, sep="\t", quote=FALSE, row.names=TRUE)
  
  cat(paste0("Normalized data written to: ", output_normalized_file, "\n"))
'

View on GitHub

For each probe, an estimated background value is subtracted that is derived from the median signal of a set of G-C matched anti-genomic controls.

Custom Script (Inferred with models/gemini-2.5-flash) vNot specified GitHub

$ Bash example

# This step describes a custom background subtraction method for microarray data.
# It involves calculating the median signal from G-C matched anti-genomic control probes
# and subtracting this value from each experimental probe's signal.
# The exact implementation would depend on the microarray platform and data format.

# Input files (placeholders):
PROBE_SIGNAL_FILE="raw_probe_signals.txt" # File containing raw signal intensities for all probes
CONTROL_PROBE_INFO="gc_matched_control_definitions.txt" # File defining G-C matched anti-genomic controls for each probe
OUTPUT_FILE="background_subtracted_probe_signals.txt"

# A real implementation would involve parsing these files, identifying control probes,
# calculating their median signal, and applying the subtraction. This is typically done
# in a scripting language like R or Python, or using vendor-specific software.

# Example conceptual execution command (assuming a custom script 'subtract_background.R' exists):
# Rscript subtract_background.R \
#   --probe_signals ${PROBE_SIGNAL_FILE} \
#   --control_info ${CONTROL_PROBE_INFO} \
#   --output ${OUTPUT_FILE}

# For demonstration, a generic echo command:
echo "Performing G-C matched anti-genomic background subtraction on ${PROBE_SIGNAL_FILE} to create ${OUTPUT_FILE}"

View on GitHub

Arrays within a specific analysis experiment were normalized together according to the variance stabilization methods described by Huber et al. (Huber et al., 2002).

vsn (Inferred with models/gemini-2.5-flash) vNot specified

$ Bash example

# Install BiocManager if not already installed
# R -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")'

# Install vsn package if not already installed
# R -e 'BiocManager::install("vsn")'

Rscript -e '
  library(vsn)
  
  # Load raw data (replace with your actual file path and format)
  # Assuming "raw_array_intensities.csv" contains the raw array intensity matrix.
  # The first column is assumed to be probe IDs, and subsequent columns are array samples.
  # Adjust read.csv parameters (e.g., sep, header) based on your file format.
  raw_data <- read.csv("raw_array_intensities.csv", row.names = 1)
  
  # Convert to a matrix for vsn processing
  data_matrix <- as.matrix(raw_data)
  
  # Perform VSN normalization using the vsn2 function
  # vsn2 returns an object of class "vsn" containing the normalized data
  vsn_fit <- vsn2(data_matrix)
  
  # Extract the normalized data matrix from the vsn object
  normalized_data <- exprs(vsn_fit)
  
  # Save the normalized data to a new CSV file
  write.csv(normalized_data, "normalized_array_intensities_vsn.csv")
  
  message("VSN normalization complete. Normalized data saved to normalized_array_intensities_vsn.csv")
'

Detection calls were based on a Wilcoxon rank-sum test of the miRNA probe signal compared to the distribution of signals from GC-content matched anti-genomic probes.

R (Inferred with models/gemini-2.5-flash) vN/A GitHub

$ Bash example

# Install R if not already installed
# sudo apt-get update && sudo apt-get install -y r-base

# Create a dummy R script for demonstration
cat << 'EOF' > detect_miRNA_calls.R
#!/usr/bin/env Rscript

# This script performs detection calls based on a Wilcoxon rank-sum test.
# It compares miRNA probe signals to GC-content matched anti-genomic probe signals.
#
# Input files are assumed to be tab-separated with at least 'probe_id', 'signal', and 'gc_content' columns.
# For the Wilcoxon test to be applicable, 'current_miRNA_signals' (signals for a specific miRNA probe)
# should ideally be a vector of observations (e.g., from replicates).
# If 'miRNA probe signal' refers to a single value per probe, a standard Wilcoxon rank-sum test
# comparing two groups is not directly applicable. In such cases, a custom approach
# (e.g., comparing the single signal's rank within the combined distribution, or using percentile cutoffs)
# would be needed. For demonstration, if a single miRNA signal is encountered, a percentile cutoff is used.

args = commandArgs(trailingOnly=TRUE)

if (length(args) < 3) {
  stop("Usage: Rscript detect_miRNA_calls.R <miRNA_signals_file> <anti_genomic_signals_file> <output_calls_file>", call.=FALSE)
}

miRNA_signals_file <- args[1]
anti_genomic_signals_file <- args[2]
output_calls_file <- args[3]

# Read input data
miRNA_data <- read.table(miRNA_signals_file, header=TRUE, sep="\t", stringsAsFactors=FALSE)
anti_genomic_data <- read.table(anti_genomic_signals_file, header=TRUE, sep="\t", stringsAsFactors=FALSE)

# Initialize results data frame
results <- data.frame(
  probe_id = character(),
  p_value = numeric(),
  detected = logical(),
  stringsAsFactors = FALSE
)

# Iterate through each unique miRNA probe
unique_miRNA_probes <- unique(miRNA_data$probe_id)

# Define GC content tolerance for matching
gc_tolerance <- 0.05 # e.g., +/- 5% GC content

for (probe_id in unique_miRNA_probes) {
  current_miRNA_signals <- miRNA_data$signal[miRNA_data$probe_id == probe_id]
  current_miRNA_gc <- miRNA_data$gc_content[miRNA_data$probe_id == probe_id][1] # Assuming GC content is consistent for a probe

  # Find GC-content matched anti-genomic probes
  matched_anti_genomic_signals <- anti_genomic_data$signal[
    anti_genomic_data$gc_content >= (current_miRNA_gc - gc_tolerance) &
    anti_genomic_data$gc_content <= (current_miRNA_gc + gc_tolerance)
  ]

  p_value <- NA
  detected <- FALSE

  # Perform Wilcoxon rank-sum test if enough data (at least 2 observations per group)
  if (length(current_miRNA_signals) >= 2 && length(matched_anti_genomic_signals) >= 2) {
    # One-sided test: alternative="greater" to check if miRNA signals are significantly higher
    test_result <- wilcox.test(current_miRNA_signals, matched_anti_genomic_signals, alternative = "greater")
    p_value <- test_result$p.value
    detection_threshold <- 0.05 # Example p-value threshold
    detected <- p_value < detection_threshold
  } else if (length(current_miRNA_signals) == 1 && length(matched_anti_genomic_signals) >= 2) {
    # Special handling for a single miRNA signal against a distribution.
    # A direct wilcox.test is not applicable. Using a percentile cutoff as a proxy for detection.
    warning(paste("Probe", probe_id, ": Single miRNA signal detected. Using 95th percentile cutoff instead of Wilcoxon for detection call."))
    percentile_threshold <- quantile(matched_anti_genomic_signals, 0.95)
    detected <- current_miRNA_signals[1] > percentile_threshold
    # p_value remains NA as no Wilcoxon test was performed
  } else {
    warning(paste("Probe", probe_id, ": Not enough data for Wilcoxon test or percentile comparison. Skipping detection."))
  }

  results <- rbind(results, data.frame(probe_id = probe_id, p_value = p_value, detected = detected))
}

# Write results
write.table(results, file = output_calls_file, sep = "\t", row.names = FALSE, quote = FALSE)
EOF

# Make the script executable
chmod +x detect_miRNA_calls.R

# Create dummy input files for demonstration
# These files simulate miRNA probe signals and GC-content matched anti-genomic probe signals.
# 'probe_id' identifies the probe, 'signal' is its intensity, and 'gc_content' is its GC percentage.
echo -e "probe_id\tsignal\tgc_content" > miRNA_probe_signals.txt
echo -e "miRNA_1\t150\t0.45" >> miRNA_probe_signals.txt
echo -e "miRNA_1\t160\t0.45" >> miRNA_probe_signals.txt # Replicate for miRNA_1
echo -e "miRNA_2\t80\t0.50" >> miRNA_probe_signals.txt
echo -e "miRNA_3\t250\t0.60" >> miRNA_probe_signals.txt # Single signal for miRNA_3

echo -e "probe_id\tsignal\tgc_content" > anti_genomic_probe_signals.txt
echo -e "anti_1\t50\t0.44" >> anti_genomic_probe_signals.txt
echo -e "anti_2\t60\t0.46" >> anti_genomic_probe_signals.txt
echo -e "anti_3\t70\t0.45" >> anti_genomic_probe_signals.txt
echo -e "anti_4\t80\t0.47" >> anti_genomic_probe_signals.txt
echo -e "anti_5\t90\t0.43" >> anti_genomic_probe_signals.txt
echo -e "anti_6\t100\t0.51" >> anti_genomic_probe_signals.txt
echo -e "anti_7\t110\t0.52" >> anti_genomic_probe_signals.txt
echo -e "anti_8\t120\t0.50" >> anti_genomic_probe_signals.txt
echo -e "anti_9\t130\t0.61" >> anti_genomic_probe_signals.txt
echo -e "anti_10\t140\t0.59" >> anti_genomic_probe_signals.txt

# Execute the R script to perform detection calls
./detect_miRNA_calls.R miRNA_probe_signals.txt anti_genomic_probe_signals.txt miRNA_detection_calls.txt

View on GitHub

For statistical hypothesis testing, a two-sample t-Test, with assumption of equal variance, was applied.

R (Inferred with models/gemini-2.5-flash) v4.x.x GitHub

$ Bash example

# Install R (if not already installed)
# conda install -c r r-base

# Create dummy input data files for demonstration
echo -e "10.1\n10.5\n9.8\n10.3\n10.0" > sample1.txt
echo -e "11.2\n10.9\n11.5\n11.0\n11.3" > sample2.txt

# Create an R script to perform the two-sample t-test with equal variance
cat << 'EOF' > run_t_test.R
# Read data from files
data1 <- as.numeric(readLines("sample1.txt"))
data2 <- as.numeric(readLines("sample2.txt"))

# Perform two-sample t-test assuming equal variances
# var.equal = TRUE specifies the assumption of equal variance
t_test_result <- t.test(data1, data2, var.equal = TRUE)

# Print the results
print(t_test_result)

# Optionally, save results to a file
# sink("t_test_results.txt")
# print(t_test_result)
# sink()
EOF

# Execute the R script
Rscript run_t_test.R

View on GitHub

One-way ANOVA was used for experimental designs with more than two experimental groupings or levels of the same factor.

R (Inferred with models/gemini-2.5-flash) v4.x (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

#!/bin/bash

# This script performs a one-way ANOVA using R.
# The example data simulates an experimental design with more than two groups.

# Ensure R is installed.
# conda install -c r r-base

# Create an R script file
cat << 'EOF' > run_anova.R
# Sample data for demonstration
# Let's assume 'response' is the dependent variable and 'group' is the independent factor
set.seed(123)
group_a <- rnorm(30, mean = 10, sd = 2)
group_b <- rnorm(30, mean = 12, sd = 2)
group_c <- rnorm(30, mean = 11, sd = 2)

response <- c(group_a, group_b, group_c)
group <- factor(rep(c("A", "B", "C"), each = 30))

# Create a data frame
data_anova <- data.frame(response = response, group = group)

# Perform one-way ANOVA
# The formula 'response ~ group' means 'response' is explained by 'group'
anova_result <- aov(response ~ group, data = data_anova)

# Print the summary of the ANOVA
summary(anova_result)

# Optional: Post-hoc test if ANOVA is significant (e.g., Tukey HSD)
# If the p-value from summary(anova_result) is less than a chosen significance level (e.g., 0.05)
# TukeyHSD(anova_result)
EOF

# Execute the R script
Rscript run_anova.R

View on GitHub

These tests define which probes are considered to be significantly differentially expressed, or significant, based on a default p-value of 0.001 and log2 difference > 1.

DESeq2 (Inferred with models/gemini-2.5-flash) v1.38.3 GitHub

$ Bash example

# Install R and Bioconductor if not already installed
# For example, on Ubuntu:
# sudo apt-get update
# sudo apt-get install r-base
# R -e "if (!requireNamespace('BiocManager', quietly = TRUE)) install.packages('BiocManager'); BiocManager::install('DESeq2')"

# Create a dummy DESeq2 results file for demonstration purposes.
# In a real bioinformatics pipeline, this file would be generated by a previous differential expression analysis step.
echo -e "gene_id\tbaseMean\tlog2FoldChange\tlfcSE\tstat\tpvalue\tpadj" > deseq2_results.tsv
echo -e "geneA\t100.5\t2.5\t0.2\t12.5\t0.00001\t0.0001" >> deseq2_results.tsv
echo -e "geneB\t50.2\t-1.8\t0.3\t-6.0\t0.0001\t0.001" >> deseq2_results.tsv
echo -e "geneC\t200.1\t0.5\t0.1\t5.0\t0.1\t0.5" >> deseq2_results.tsv
echo -e "geneD\t10.0\t3.0\t0.5\t6.0\t0.002\t0.01" >> deseq2_results.tsv
echo -e "geneE\t150.0\t-2.1\t0.25\t-8.4\t0.00005\t0.0005" >> deseq2_results.tsv
echo -e "geneF\t75.0\t1.2\t0.15\t8.0\t0.0005\t0.005" >> deseq2_results.tsv
echo -e "geneG\t30.0\t0.8\t0.1\t8.0\t0.00001\t0.0001" >> deseq2_results.tsv # High significance, low LFC
echo -e "geneH\t250.0\t-1.5\t0.2\t-7.5\t0.005\t0.01" >> deseq2_results.tsv # High LFC, low significance

# R script to filter DESeq2 results based on specified p-value and log2 difference thresholds
Rscript -e '
# Read the DESeq2 results file
results_df <- read.delim("deseq2_results.tsv", sep="\t", header=TRUE)

# Define significance thresholds as per the description
p_value_threshold <- 0.001
log2_diff_threshold <- 1

# Filter for significantly differentially expressed probes
# "log2 difference > 1" is interpreted as absolute log2FoldChange > 1 for differential expression
significant_probes <- subset(results_df,
                             abs(log2FoldChange) > log2_diff_threshold &
                             pvalue < p_value_threshold)

# Save the significant probes to a new file
write.table(significant_probes, "significant_probes.tsv", sep="\t", quote=FALSE, row.names=FALSE)

# Optional: Print a summary of the filtering results
cat(paste("Total probes analyzed:", nrow(results_df), "\n"))
cat(paste("Probes identified as significantly differentially expressed:", nrow(significant_probes), "\n"))
'

View on GitHub

Raw Source Text

The signal processing implemented for the Ambion miRCHIP is a multi-step process involving probe specific signal detection calls, background estimate and correction, constant variance stabilization and either array scaling or global normalization. For each probe, an estimated background value is subtracted that is derived from the median signal of a set of G-C matched anti-genomic controls. Arrays within a specific analysis experiment were normalized together according to the variance stabilization methods described by Huber et al. (Huber et al., 2002). Detection calls were based on a Wilcoxon rank-sum test of the miRNA probe signal compared to the distribution of signals from GC-content matched anti-genomic probes.  For statistical hypothesis testing, a two-sample t-Test, with assumption of equal variance, was applied.  One-way ANOVA was used for experimental designs with more than two experimental groupings or levels of the same factor.  These tests define which probes are considered to be significantly differentially expressed, or significant, based on a default p-value of 0.001 and log2 difference > 1.

← Back to Analysis