GSE16690 Processing Pipeline

GSE code_examples 7 steps

Publication

A distinct microRNA signature for definitive endoderm derived from human embryonic stem cells.

Stem cells and development (2010) — PMID 19807270

Dataset

A distinct microRNA signature for definitive endoderm derived from human embryonic stem cells

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

The signal processing implemented for the Ambion miRCHIP is a multi-step process involving probe specific signal detection calls, background estimate and correction, constant variance stabilization and either array scaling or global normalization.

limma (Inferred with models/gemini-2.5-flash) vBioconductor GitHub

$ Bash example

# Install limma if not already installed
# if (!requireNamespace("BiocManager", quietly = TRUE))
#     install.packages("BiocManager")
# BiocManager::install("limma")

# R script for Ambion miRCHIP signal processing using limma
# This script demonstrates the general steps described: probe specific signal detection calls (conceptual),
# background estimate and correction, constant variance stabilization, and array scaling/global normalization.
# Actual implementation details (e.g., data loading format, specific normalization methods) may vary
# based on the exact Ambion miRCHIP output files and experimental design.

library(limma)

# Placeholder for input data files (e.g., raw intensity files from Ambion miRCHIP)
# For two-color arrays (common for miRCHIP), you would typically have red and green channel intensities.
# A 'targets' file is often used to link sample names to their raw data files.
# For demonstration, we'll simulate an RGList object.

# --- Conceptual Step: Probe specific signal detection calls ---
# This step typically involves filtering probes based on detection p-values or flags provided by the array platform.
# limma itself doesn't generate these calls, but can use them for filtering prior to analysis.
# For example, if you have a detection p-value matrix 'detP', you might filter probes like:
# detected_probes <- rowSums(detP < 0.01) >= N_samples_threshold
# RG <- RG[detected_probes, ]

# --- Simulate raw data (replace with actual data loading) ---
# For a real scenario, you would load data from files, e.g., using read.maimages()
# or specific functions if Ambion provides a dedicated R package.
set.seed(42)
num_probes <- 5000
num_samples <- 6

# Simulate Red and Green channel intensities
R_intensities <- matrix(rnorm(num_probes * num_samples, mean=1000, sd=200), nrow=num_probes)
G_intensities <- matrix(rnorm(num_probes * num_samples, mean=900, sd=180), nrow=num_probes)

# Add some background noise and ensure positive values
R_intensities[R_intensities < 1] <- 1
G_intensities[G_intensities < 1] <- 1

rownames(R_intensities) <- paste0("Probe_", 1:num_probes)
colnames(R_intensities) <- paste0("Sample_", 1:num_samples)
rownames(G_intensities) <- paste0("Probe_", 1:num_probes)
colnames(G_intensities) <- paste0("Sample_", 1:num_samples)

# Create an RGList object (Red and Green channel intensities)
RG <- new("RGList", list(R=R_intensities, G=G_intensities,
                         targets=data.frame(FileName=paste0("Sample_", 1:num_samples, ".txt"))))
RG$genes <- data.frame(ProbeID=rownames(RG$R))

message("--- Raw data loaded (simulated) ---")

# --- Background estimate and correction ---
# Using 'normexp' method, which is robust for two-color arrays.
MA <- backgroundCorrect(RG, method="normexp")
message("--- Background corrected ---")

# --- Constant variance stabilization and normalization ---
# Variance stabilization is often achieved through log transformation (implicit in M and A values)
# and robust normalization methods.

# 1. Within-array normalization (e.g., 'loess' for two-color arrays to correct for dye effects)
MA_norm_within <- normalizeWithinArrays(MA, method="loess")
message("--- Within-array normalized (loess) ---")

# 2. Between-array normalization (global normalization/array scaling)
# 'quantile' is a common and robust global normalization method.
# 'scale' method can also be used for global scaling.
MA_normalized <- normalizeBetweenArrays(MA_norm_within, method="quantile")
message("--- Between-array normalized (quantile) ---")

# --- Output processed data ---
# The 'A' component of the MA_normalized object contains the average log2 intensity for each probe/sample
# after background correction and normalization, which can be considered the processed signal.
processed_intensities <- MA_normalized$A

output_file <- "processed_miRCHIP_data.txt"
write.table(processed_intensities, file=output_file, sep="\t", quote=FALSE, row.names=TRUE)
message(paste("--- Processed data saved to", output_file, "---"))

# For further analysis (e.g., differential expression), an MArrayLM object would typically be created:
# design <- model.matrix(~0+factor(c(1,1,2,2,3,3)))
# colnames(design) <- c("Group1", "Group2", "Group3")
# fit <- lmFit(MA_normalized, design)
# fit <- eBayes(fit)
# topTable(fit, coef="Group2-Group1")

View on GitHub

For each probe, an estimated background value is subtracted that is derived from the median signal of a set of G-C matched anti-genomic controls.

gcrma (Inferred with models/gemini-2.5-flash) vBioconductor GitHub

$ Bash example

# Install R and Bioconductor if not already present
# sudo apt-get update && sudo apt-get install -y r-base
# R -e "if (!requireNamespace('BiocManager', quietly = TRUE)) install.packages('BiocManager'); BiocManager::install(c('affy', 'gcrma'))"

#!/bin/bash

# Create a dummy directory for CEL files (replace with actual path to your CEL files)
mkdir -p ./cel_files
# In a real scenario, copy your .CEL files here, e.g.:
# cp /path/to/your/raw_data/*.CEL ./cel_files/

# Create an R script to perform GCRMA background correction
cat << 'EOF' > run_gcrma.R
# Load necessary libraries
library(affy)
library(gcrma)

# Define the directory containing CEL files
cel_dir <- "./cel_files/"

# Read CEL files from the specified directory
# This assumes all .CEL files in the directory belong to the experiment.
# For more complex setups, a phenoData file or explicit list of files can be used.
raw_data <- ReadAffy(celfile.path = cel_dir)

# Perform GCRMA background correction, normalization, and summarization
# GCRMA uses probe affinity and G-C content to estimate and subtract background noise.
eset <- gcrma(raw_data)

# Extract the expression matrix (normalized and background-corrected values)
expression_matrix <- exprs(eset)

# Save the processed data to a tab-separated file
# The output file 'processed_data.txt' will contain the expression values
# with probe IDs as row names and sample names as column names.
write.table(expression_matrix, file = "processed_data.txt", sep = "\t", quote = FALSE, row.names = TRUE)
EOF

# Execute the R script
Rscript run_gcrma.R

View on GitHub

Arrays within a specific analysis experiment were normalized together according to the variance stabilization methods described by Huber et al. (Huber et al., 2002).

vsn (Inferred with models/gemini-2.5-flash) vNot specified, typically part of Bioconductor

$ Bash example

Rscript -e '
    # Load the vsn package
    library(vsn)

    # --- User-defined parameters ---
    # Replace "input_raw_microarray_data.tsv" with the actual path to your raw microarray data file.
    # This file is expected to be a matrix where rows are probes/features and columns are samples/arrays.
    # Adjust `read.delim` parameters (e.g., `sep`, `header`, `row.names`) based on your specific input file format.
    input_file="input_raw_microarray_data.tsv"
    output_file="normalized_microarray_data_vsn.tsv"

    # --- Load raw data ---
    # Example: Reads a tab-separated file with probe IDs in the first column and raw intensity values in subsequent columns.
    raw_data <- read.delim(input_file, row.names = 1, header = TRUE, sep="\t")

    # Convert to matrix for vsnMatrix function
    raw_data_matrix <- as.matrix(raw_data)

    # --- Perform VSN normalization ---
    # The vsnMatrix function applies variance stabilization normalization to a matrix of intensity values.
    # For Affymetrix data, you might use `vsn` directly on an `AffyBatch` object from the `affy` package
    # (e.g., `library(affy); raw_affy_data <- ReadAffy(); normalized_affy_data <- vsn(raw_affy_data)`).
    normalized_data_matrix <- vsnMatrix(raw_data_matrix)

    # --- Save normalized data ---
    # Write the normalized matrix to a tab-separated file.
    # `col.names = NA` ensures that the row names (probe IDs) are written as the first column,
    # and the column names (sample IDs) are written starting from the second column.
    write.table(normalized_data_matrix, output_file, sep = "\t", quote = FALSE, col.names = NA)
'

Detection calls were based on a Wilcoxon rank-sum test of the miRNA probe signal compared to the distribution of signals from GC-content matched anti-genomic probes.

Custom Script using Wilcoxon rank-sum test (Inferred with models/gemini-2.5-flash) vN/A GitHub

$ Bash example

# Example Python script to perform Wilcoxon rank-sum test
# This script assumes two input files, each with a single column of numerical signals.

# Create dummy input files for demonstration (replace with actual data in a real pipeline)
# echo "10\n12\n11\n15\n9" > miRNA_probe_signals.txt
# echo "8\n7\n9\n10\n6" > anti_genomic_probe_signals.txt

# Python script content
cat << 'EOF' > run_wilcoxon_test.py
import pandas as pd
from scipy.stats import ranksums
import sys

def run_wilcoxon(miRNA_file, anti_genomic_file, output_file):
    try:
        miRNA_signals = pd.read_csv(miRNA_file, header=None).iloc[:, 0].dropna().tolist()
        anti_genomic_signals = pd.read_csv(anti_genomic_file, header=None).iloc[:, 0].dropna().tolist()
    except Exception as e:
        print(f"Error reading input files: {e}", file=sys.stderr)
        sys.exit(1)

    if not miRNA_signals or not anti_genomic_signals:
        print("Error: One or both input signal lists are empty after reading.", file=sys.stderr)
        sys.exit(1)

    # Perform Wilcoxon rank-sum test
    statistic, p_value = ranksums(miRNA_signals, anti_genomic_signals)

    with open(output_file, 'w') as f:
        f.write(f"Wilcoxon Rank-Sum Test Results:\n")
        f.write(f"Statistic: {statistic}\n")
        f.write(f"P-value: {p_value}\n")
    print(f"Results saved to {output_file}")

if __name__ == "__main__":
    if len(sys.argv) != 4:
        print("Usage: python run_wilcoxon_test.py <miRNA_probe_signals_file> <anti_genomic_probe_signals_file> <output_results_file>", file=sys.stderr)
        sys.exit(1)

    miRNA_file = sys.argv[1]
    anti_genomic_file = sys.argv[2]
    output_file = sys.argv[3]
    run_wilcoxon(miRNA_file, anti_genomic_file, output_file)
EOF

# Execute the Python script
# Ensure Python, pandas, and scipy are installed (e.g., via conda or pip):
# conda install -c anaconda python pandas scipy
python run_wilcoxon_test.py miRNA_probe_signals.txt anti_genomic_probe_signals.txt wilcoxon_results.txt

View on GitHub

For statistical hypothesis testing, a two-sample t-Test, with assumption of equal variance, was applied.

scipy.stats.ttest_ind (Inferred with models/gemini-2.5-flash) v1.11.x GitHub

$ Bash example

# Install Python and SciPy if not already available
# conda install -c anaconda python scipy numpy

# Example Python script to perform a two-sample t-test with equal variance.
# This script assumes 'sample1.txt' and 'sample2.txt' contain numerical data,
# with one value per line. Replace these with actual input file paths.

# Create dummy input files for demonstration purposes (uncomment to use)
# echo -e "10.1\n12.3\n11.5\n13.0\n10.8" > sample1.txt
# echo -e "9.5\n11.2\n10.7\n12.1\n9.9" > sample2.txt

python -c "
import numpy as np
from scipy import stats
import sys

def run_ttest(file1, file2, output_file='ttest_results.txt'):
    try:
        sample1 = np.loadtxt(file1)
        sample2 = np.loadtxt(file2)
    except Exception as e:
        print(f'Error loading samples: {e}', file=sys.stderr)
        sys.exit(1)

    if len(sample1) < 2 or len(sample2) < 2:
        print('Each sample must have at least two data points for t-test.', file=sys.stderr)
        sys.exit(1)

    # Perform two-sample t-test assuming equal variance
    # equal_var=True corresponds to the assumption of equal variance (Student's t-test)
    t_statistic, p_value = stats.ttest_ind(sample1, sample2, equal_var=True)

    with open(output_file, 'w') as f:
        f.write(f'T-statistic: {t_statistic}\n')
        f.write(f'P-value: {p_value}\n')
    print(f'T-test results written to {output_file}')

if __name__ == '__main__':
    # Define input files (replace with actual paths in a real pipeline)
    input_file1 = 'sample1.txt'
    input_file2 = 'sample2.txt'
    output_file = 'ttest_results.txt'
    run_ttest(input_file1, input_file2, output_file)
"

View on GitHub

One-way ANOVA was used for experimental designs with more than two experimental groupings or levels of the same factor.

R (Inferred with models/gemini-2.5-flash) vNot specified GitHub

$ Bash example

# Example data file (replace with actual experimental data)
# This file should contain a column for the dependent variable (e.g., 'measurement')
# and a column for the grouping factor (e.g., 'group').
echo "measurement,group" > data.csv
echo "10,A" >> data.csv
echo "12,A" >> data.csv
echo "11,A" >> data.csv
echo "15,B" >> data.csv
echo "14,B" >> data.csv
echo "16,B" >> data.csv
echo "18,C" >> data.csv
echo "20,C" >> data.csv
echo "19,C" >> data.csv

# Create an R script for one-way ANOVA
cat << 'EOF' > run_anova.R
# Load data
data <- read.csv("data.csv")

# Ensure the grouping variable is a factor
data$group <- as.factor(data$group)

# Perform one-way ANOVA
# The formula 'dependent_variable ~ grouping_factor'
anova_result <- aov(measurement ~ group, data = data)

# Print summary of ANOVA results
summary(anova_result)

# Optionally, perform post-hoc tests if ANOVA is significant (e.g., Tukey HSD)
# If you have more than two groups and the ANOVA p-value is significant,
# you might want to know which specific groups differ.
# if (summary(anova_result)[[1]]$`Pr(>F)`[1] < 0.05) {
#   print("Performing Tukey HSD post-hoc test:")
#   print(TukeyHSD(anova_result))
# }

# Save results to a file (optional)
# sink("anova_results.txt")
# summary(anova_result)
# sink()
EOF

# Execute the R script
Rscript run_anova.R

View on GitHub

These tests define which probes are considered to be significantly differentially expressed, or significant, based on a default p-value of 0.001 and log2 difference > 1.

DESeq2 (Inferred with models/gemini-2.5-flash) v1.42.0 (Bioconductor 3.18)

$ Bash example

# This script demonstrates how to filter differential expression results based on specified p-value and log2 fold change thresholds, typically performed on output from tools like DESeq2.

# Install R and DESeq2 if not already installed
# conda install -c conda-forge r-base
# conda install -c bioconda bioconductor-deseq2
# conda install -c conda-forge r-data.table # For efficient data handling

# Create a dummy DESeq2 results file for demonstration purposes.
# In a real pipeline, this file would be the output from a DESeq2 analysis.
cat <<EOF > deseq2_results.csv
gene,baseMean,log2FoldChange,lfcSE,stat,pvalue,padj
geneA,100,0.5,0.1,5,0.01,0.05
geneB,200,1.2,0.1,12,0.0001,0.0005
geneC,50,2.5,0.2,10,0.00001,0.00005
geneD,300,-1.5,0.15,-10,0.00005,0.0001
geneE,150,0.8,0.1,8,0.005,0.01
geneF,80,-0.7,0.05,-14,0.000001,0.000005
geneG,250,1.8,0.1,15,0.002,0.005
EOF

# R script to load DESeq2-like results and apply significance filtering
Rscript -e '
  library(data.table) # Used for efficient reading/writing of data frames

  # Load the differential expression results file
  results_df <- fread("deseq2_results.csv")

  # Define the significance thresholds as per the description
  p_value_threshold <- 0.001
  log2_diff_threshold <- 1

  # Filter for significantly differentially expressed probes/genes
  # The description implies a p-value, but for differential expression, adjusted p-value (padj) is standard.
  # We use padj < p_value_threshold and absolute log2FoldChange > log2_diff_threshold.
  significant_probes <- results_df[
    results_df$padj < p_value_threshold & 
    abs(results_df$log2FoldChange) > log2_diff_threshold,
  ]

  # Output the list of significant probes to a new CSV file
  fwrite(significant_probes, "significant_probes.csv")

  # Print a summary of the filtering process
  cat(paste0("Total probes analyzed: ", nrow(results_df), "\n"))
  cat(paste0("Probes considered significant (padj < ", p_value_threshold, " and |log2FoldChange| > ", log2_diff_threshold, "): ", nrow(significant_probes), "\n"))
'

# Clean up the dummy results file
rm deseq2_results.csv

Raw Source Text

The signal processing implemented for the Ambion miRCHIP is a multi-step process involving probe specific signal detection calls, background estimate and correction, constant variance stabilization and either array scaling or global normalization. For each probe, an estimated background value is subtracted that is derived from the median signal of a set of G-C matched anti-genomic controls. Arrays within a specific analysis experiment were normalized together according to the variance stabilization methods described by Huber et al. (Huber et al., 2002). Detection calls were based on a Wilcoxon rank-sum test of the miRNA probe signal compared to the distribution of signals from GC-content matched anti-genomic probes.  For statistical hypothesis testing, a two-sample t-Test, with assumption of equal variance, was applied.  One-way ANOVA was used for experimental designs with more than two experimental groupings or levels of the same factor.  These tests define which probes are considered to be significantly differentially expressed, or significant, based on a default p-value of 0.001 and log2 difference > 1.

← Back to Analysis