GSE16689 Processing Pipeline

GSE code_examples 7 steps

Publication

A distinct microRNA signature for definitive endoderm derived from human embryonic stem cells.

Stem cells and development (2010) — PMID 19807270

Dataset

MicroRNA expression data from differentiation of human H9 ESCs into definitive endoderm on MEF feeder layers

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

The signal processing implemented for the Ambion miRCHIP is a multi-step process involving probe specific signal detection calls, background estimate and correction, constant variance stabilization and either array scaling or global normalization.

Microarray data processing software (Inferred with models/gemini-2.5-flash) vN/A GitHub

$ Bash example

# This step describes conceptual signal processing for Ambion miRCHIP data.
# Microarray data processing is typically performed using R packages (e.g., limma, oligo, affy) or specialized commercial software.
# The following is a conceptual R script outline for such processing, assuming raw data files (e.g., .gpr or similar) and an annotation file.

# Install necessary R packages if not already installed (example for limma and oligo)
# install.packages("BiocManager")
# BiocManager::install("limma")
# BiocManager::install("oligo") # Or 'affy' depending on the specific chip type and raw data format

# R script (conceptual outline for processing Ambion miRCHIP data)
# Rscript process_mirchip_data.R

# --- Content of process_mirchip_data.R (conceptual) ---
# library(oligo) # Or limma, affy, depending on the specific data format and processing needs
# library(limma)

# # Placeholder for raw data files (e.g., scanner output files for Ambion miRCHIP)
# # raw_data_directory <- "./raw_mirchip_data"
# # raw_files <- list.files(raw_data_directory, pattern = ".gpr$", full.names = TRUE) # Example for GenePix/Agilent GPR files

# # 1. Load raw data and perform probe specific signal detection calls
# # This step is highly dependent on the raw data format (e.g., .gpr, .txt, .cel)
# # For Ambion miRCHIP, this might involve reading intensity values and flags.
# # raw_expression_data <- read.maimages(raw_files, source = "genepix") # Example using limma for two-color arrays

# # 2. Background estimate and correction
# # bg_corrected_data <- backgroundCorrect(raw_expression_data, method = "normexp") # Example using limma's normexp method

# # 3. Constant variance stabilization
# # This often involves a transformation like log2 or a more sophisticated method like VSN.
# # vsn_data <- normalizeBetweenArrays(bg_corrected_data, method = "vsn") # Example using limma with VSN

# # 4. Array scaling or global normalization
# # If not already handled by VSN, other normalization methods can be applied.
# # normalized_expression <- normalizeBetweenArrays(vsn_data, method = "quantile") # Example using limma with quantile normalization

# # 5. Summarize probe signals to miRNA level (if multiple probes per miRNA)
# # This requires an annotation file mapping probes to miRNAs.
# # annotation_file <- "./ambion_mirchip_annotation.txt" # Placeholder for Ambion miRCHIP annotation
# # mirchip_annotation <- read.delim(annotation_file)
# # 
# # # Example: Aggregate probe intensities to miRNA level (e.g., by mean or median)
# # # processed_mirna_expression <- aggregate(normalized_expression$E, by=list(mirchip_annotation$miRNA_ID), FUN=mean)

# # 6. Save processed data
# # write.csv(processed_mirna_expression, "processed_ambion_mirchip_expression.csv", row.names = FALSE)

View on GitHub

For each probe, an estimated background value is subtracted that is derived from the median signal of a set of G-C matched anti-genomic controls.

Unknown (Inferred with models/gemini-2.5-flash) vUnknown

$ Bash example

# This step describes a background subtraction method commonly used in probe-based assays, such as microarrays.
# It involves identifying G-C matched anti-genomic control probes, calculating their median signal,
# and subtracting this median value from the signal of each experimental probe.
# The specific tool or software used for this operation is not explicitly stated in the description.

# Below is a conceptual representation of how such a process might be executed using a custom script.
# The actual implementation would depend on the specific data format and programming language (e.g., R, Python).

# Example of a conceptual command using a placeholder script:
python custom_background_correction.py \
    --probe_signal_file "raw_probe_signals.tsv" \
    --control_probe_file "gc_matched_anti_genomic_controls.tsv" \
    --output_file "background_corrected_signals.tsv" \
    --gc_content_column "GC_Content" \
    --signal_column "Intensity" \
    --method "median_subtraction"

Arrays within a specific analysis experiment were normalized together according to the variance stabilization methods described by Huber et al. (Huber et al., 2002).

vsn (Inferred with models/gemini-2.5-flash) vNot specified

$ Bash example

# Install R and Bioconductor if not already present
# conda install -c conda-forge r-base
# conda install -c bioconda bioconductor-vsn

# Create an R script to perform VSN normalization
cat << 'EOF' > normalize_vsn.R
# Load the vsn package and Biobase for ExpressionSet manipulation
library(vsn)
library(Biobase) # Required for exprs() function if vsn2 returns an ExpressionSet

# --- Placeholder for loading your raw microarray data ---
# Replace 'raw_data_matrix.tsv' with your actual input file path.
# This example assumes a tab-separated file where rows are probes/genes and columns are samples.
# Adjust the loading method (e.g., read.csv, read.delim) based on your file format.
# Example: raw_data_matrix <- as.matrix(read.delim("raw_data_matrix.tsv", row.names = 1))

# For demonstration, let's create a dummy matrix representing raw intensity data
set.seed(123)
raw_data_matrix <- matrix(rnorm(1000 * 5, mean = 1000, sd = 200), ncol = 5)
colnames(raw_data_matrix) <- paste0("Sample", 1:5)
rownames(raw_data_matrix) <- paste0("Probe", 1:1000)

# Perform Variance Stabilization Normalization (VSN)
# vsn2 is typically used for matrix input. For ExpressionSet objects, vsn() can be applied directly.
normalized_eset <- vsn2(raw_data_matrix)

# Extract the normalized data matrix from the ExpressionSet object
normalized_matrix <- exprs(normalized_eset)

# --- Placeholder for saving the normalized data ---
# Replace 'normalized_data_vsn.tsv' with your desired output file name.
write.table(normalized_matrix, "normalized_data_vsn.tsv", sep = "\t", quote = FALSE, row.names = TRUE)

message("VSN normalization complete. Normalized data saved to normalized_data_vsn.tsv")
EOF

# Execute the R script
Rscript normalize_vsn.R

Detection calls were based on a Wilcoxon rank-sum test of the miRNA probe signal compared to the distribution of signals from GC-content matched anti-genomic probes.

Custom script (R/Python) (Inferred with models/gemini-2.5-flash) vN/A GitHub

$ Bash example

# Example R script to perform Wilcoxon rank-sum test
# This script assumes two input files:
# 1. miRNA_probe_signals.tsv: Tab-separated file with probe_id and signal for miRNA probes
# 2. gc_matched_anti_genomic_signals.tsv: Tab-separated file with probe_id and signal for GC-matched anti-genomic probes

# Create dummy input files for demonstration
echo -e "probe1\t100\nprobe2\t120\nprobe3\t90\nprobe4\t110" > miRNA_probe_signals.tsv
echo -e "probeA\t80\nprobeB\t95\nprobeC\t70\nprobeD\t85\nprobeE\t100" > gc_matched_anti_genomic_signals.tsv

# R script content
R_SCRIPT="""
# Load data
miRNA_signals <- read.delim("miRNA_probe_signals.tsv", header=FALSE, col.names=c("probe_id", "signal"))
anti_genomic_signals <- read.delim("gc_matched_anti_genomic_signals.tsv", header=FALSE, col.names=c("probe_id", "signal"))

# Perform Wilcoxon rank-sum test
# The alternative hypothesis is that the true location shift is not equal to 0 (two-sided test)
# If a specific direction is expected (e.g., miRNA signals are higher), 'greater' or 'less' can be used.
wilcox_test_result <- wilcox.test(miRNA_signals$signal, anti_genomic_signals$signal, alternative = "two.sided")

# Print results
cat("Wilcoxon Rank-Sum Test Results:\n")
print(wilcox_test_result)

# Optionally, save results to a file
sink("wilcoxon_test_results.txt")
print(wilcox_test_result)
sink()
"""

# Execute the R script
Rscript -e "${R_SCRIPT}"

# Clean up dummy files
rm miRNA_probe_signals.tsv gc_matched_anti_genomic_signals.tsv

View on GitHub

For statistical hypothesis testing, a two-sample t-Test, with assumption of equal variance, was applied.

scipy.stats (Inferred with models/gemini-2.5-flash) v1.11.x GitHub

$ Bash example

# Install Python and SciPy if not already available
# conda create -n ttest_env python=3.9 scipy numpy
# conda activate ttest_env

# Example data files (replace with actual data paths in a real pipeline)
# For demonstration, let's create dummy data files
# In a real pipeline, these would be generated by upstream steps.
# Each file contains numerical values, one per line, representing a sample.

echo "10.1" > sample1_data.txt
echo "10.5" >> sample1_data.txt
echo "9.8" >> sample1_data.txt
echo "11.2" >> sample1_data.txt
echo "10.3" >> sample1_data.txt

echo "12.0" > sample2_data.txt
echo "11.5" >> sample2_data.txt
echo "12.8" >> sample2_data.txt
echo "11.9" >> sample2_data.txt
echo "12.2" >> sample2_data.txt

# Python script to perform the two-sample t-test with equal variance
python -c "
import numpy as np
from scipy import stats

# Load data from files
sample1 = np.loadtxt('sample1_data.txt')
sample2 = np.loadtxt('sample2_data.txt')

# Perform two-sample t-test assuming equal variance
# equal_var=True corresponds to the assumption of equal variance (Student's t-test)
t_statistic, p_value = stats.ttest_ind(sample1, sample2, equal_var=True)

print(f'T-statistic: {t_statistic}')
print(f'P-value: {p_value}')
"

View on GitHub

One-way ANOVA was used for experimental designs with more than two experimental groupings or levels of the same factor.

R (Inferred with models/gemini-2.5-flash) v4.3.2 GitHub

$ Bash example

# Install R if not already installed (example for Ubuntu/Debian):
# sudo apt update
# sudo apt install r-base

# Create a dummy R script for one-way ANOVA
cat << 'EOF' > run_anova.R
# Simulate data for demonstration purposes
# In a real scenario, 'my_data.csv' would be loaded here.
set.seed(123)

# Group 1: Control
group1 <- rnorm(30, mean = 10, sd = 2)
# Group 2: Treatment A
group2 <- rnorm(30, mean = 12, sd = 2)
# Group 3: Treatment B
group3 <- rnorm(30, mean = 10.5, sd = 2)

# Combine into a data frame
my_data <- data.frame(
  value = c(group1, group2, group3),
  group = factor(c(rep("Control", 30), rep("TreatmentA", 30), rep("TreatmentB", 30)))
)

# Perform one-way ANOVA
anova_result <- aov(value ~ group, data = my_data)

# Print the summary of the ANOVA results
print("One-way ANOVA Results:")
print(summary(anova_result))

# Optionally, perform post-hoc tests if ANOVA is significant (e.g., Tukey HSD)
# if (summary(anova_result)[[1]][["Pr(>F)"]][1] < 0.05) {
#   print("\nPost-hoc Tukey HSD Test:")
#   print(TukeyHSD(anova_result))
# }

# Save results to a file (optional)
# sink("anova_results.txt")
# print(summary(anova_result))
# sink()
EOF

# Execute the R script
Rscript run_anova.R

View on GitHub

These tests define which probes are considered to be significantly differentially expressed, or significant, based on a default p-value of 0.001 and log2 difference > 1.

Custom Script for Differential Expression Filtering (Inferred with models/gemini-2.5-flash) v1.0 GitHub

$ Bash example

# Define thresholds for significance
P_VALUE_THRESHOLD=0.001
LOG2_FC_THRESHOLD=1

# Input file containing differential expression results (e.g., from DESeq2, edgeR)
# Replace 'differential_expression_results.tsv' with the actual path to your input file.
INPUT_FILE="differential_expression_results.tsv"

# Output file for significantly differentially expressed probes
OUTPUT_FILE="significant_probes.tsv"

# Filter significant probes based on p-value and absolute log2 fold change thresholds.
# Assumes the input file is tab-separated, with log2FoldChange in the 2nd column ($2)
# and p-value in the 3rd column ($3). The first line is treated as a header.
awk -v p_thresh="$P_VALUE_THRESHOLD" -v fc_thresh="$LOG2_FC_THRESHOLD" '
BEGIN { FS="\t"; OFS="\t" }
NR==1 { print; next } # Print header line
{
  log2fc = $2;
  pvalue = $3;
  # Check if p-value is below threshold AND absolute log2 fold change is above threshold
  if (pvalue < p_thresh && (log2fc > fc_thresh || log2fc < -fc_thresh)) {
    print # Print the line if criteria are met
  }
}' "$INPUT_FILE" > "$OUTPUT_FILE"

View on GitHub

Raw Source Text

The signal processing implemented for the Ambion miRCHIP is a multi-step process involving probe specific signal detection calls, background estimate and correction, constant variance stabilization and either array scaling or global normalization. For each probe, an estimated background value is subtracted that is derived from the median signal of a set of G-C matched anti-genomic controls. Arrays within a specific analysis experiment were normalized together according to the variance stabilization methods described by Huber et al. (Huber et al., 2002). Detection calls were based on a Wilcoxon rank-sum test of the miRNA probe signal compared to the distribution of signals from GC-content matched anti-genomic probes.  For statistical hypothesis testing, a two-sample t-Test, with assumption of equal variance, was applied.  One-way ANOVA was used for experimental designs with more than two experimental groupings or levels of the same factor.  These tests define which probes are considered to be significantly differentially expressed, or significant, based on a default p-value of 0.001 and log2 difference > 1.

← Back to Analysis