GSE16678 Processing Pipeline

GSE code_examples 7 steps

Publication

A distinct microRNA signature for definitive endoderm derived from human embryonic stem cells.

Stem cells and development (2010) — PMID 19807270

Dataset

MicroRNA expression data from differentiation of human Cyt49 ESCs into definitive endoderm in feeder-free conditions

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

The signal processing implemented for the Ambion miRCHIP is a multi-step process involving probe specific signal detection calls, background estimate and correction, constant variance stabilization and either array scaling or global normalization.

affy (Inferred with models/gemini-2.5-flash) vBioconductor (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install R and Bioconductor if not already installed
# sudo apt-get update
# sudo apt-get install r-base
# R -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager"); BiocManager::install("affy")'

# Assuming raw data files (e.g., .CEL files for Affymetrix-like arrays) are in a directory named 'raw_data'
# The specific file format for Ambion miRCHIP might vary, but .CEL is a common example for such processing.
# Please ensure your raw data files are in the 'raw_data' directory or adjust the 'data_dir' variable.

Rscript -e '
  library(affy)

  # Define the directory containing raw data files
  data_dir <- "raw_data"
  # List all raw data files (e.g., .CEL files). Adjust pattern if your files have a different extension.
  raw_files <- list.files(path = data_dir, pattern = "\\.CEL$", full.names = TRUE)

  if (length(raw_files) == 0) {
    stop("No raw data files found in the specified directory. Please ensure files are present and named correctly (e.g., .CEL).")
  }

  # Step 1: Read raw data into an AffyBatch object
  affy_batch <- ReadAffy(filenames = raw_files)

  # Option A: MAS5 processing for "probe specific signal detection calls" and "array scaling"
  # MAS5 performs background correction, global scaling normalization, and summarization.
  # It also provides Present/Marginal/Absent (P/M/A) calls.
  message("Performing MAS5 processing for signal detection calls and global scaling normalization...")
  eset_mas5 <- mas5(affy_batch)
  mas5_expression_matrix <- exprs(eset_mas5)
  write.csv(mas5_expression_matrix, "mas5_expression_matrix.csv")

  # Get MAS5 detection calls (Present/Marginal/Absent)
  mas5_detection_calls <- mas5calls(affy_batch)
  write.csv(exprs(mas5_detection_calls), "mas5_detection_calls.csv")

  # Option B: RMA processing for "background estimate and correction", "constant variance stabilization",
  # and "global normalization" (e.g., quantile normalization).
  # RMA performs background correction, quantile normalization, and median-polish summarization,
  # which inherently stabilizes variance.
  message("Performing RMA processing for variance stabilization and quantile normalization...")
  eset_rma <- rma(affy_batch)
  rma_expression_matrix <- exprs(eset_rma)
  write.csv(rma_expression_matrix, "rma_expression_matrix.csv")

  message("Microarray signal processing complete. Output files: mas5_expression_matrix.csv, mas5_detection_calls.csv, rma_expression_matrix.csv")
'

View on GitHub

For each probe, an estimated background value is subtracted that is derived from the median signal of a set of G-C matched anti-genomic controls.

Custom script or R package function for microarray background subtraction (Inferred with models/gemini-2.5-flash) vN/A

$ Bash example

# This step describes a conceptual background subtraction process common in microarray analysis.
# The actual implementation would typically be in a scripting language like R or Python,
# often integrated into a larger microarray processing pipeline (e.g., using Bioconductor packages).

# Parameters inferred from description:
# - Background derived from: median signal
# - Controls used: G-C matched anti-genomic controls

# Example conceptual command (actual tool/script would vary widely):
# A custom script would read probe data and control data,
# calculate the median signal for G-C matched anti-genomic controls,
# and subtract this background from the corresponding probe signals.

# Example placeholder for a custom script call:
# python custom_background_subtraction.py \
#     --probe_input "probe_signals.tsv" \
#     --control_input "gc_matched_controls.tsv" \
#     --output "background_subtracted_signals.tsv" \
#     --method "median_gc_matched"

Arrays within a specific analysis experiment were normalized together according to the variance stabilization methods described by Huber et al. (Huber et al., 2002).

vsn (Inferred with models/gemini-2.5-flash) vNot specified (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install BiocManager if not already installed
# R -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")'

# Install the vsn package if not already installed
# R -e 'BiocManager::install("vsn")'

# Example R script to perform vsn normalization
# This script assumes 'input_raw_data.csv' contains a matrix of raw microarray data
# and outputs 'normalized_expression_data.csv'.
# Users should adapt the data loading and saving parts to their specific file formats (e.g., CEL files, ExpressionSet objects).
Rscript -e '
    library(vsn)

    # --- Placeholder for raw data loading ---
    # Replace this section with your actual data loading logic.
    # Example 1: Loading a CSV file into a matrix
    # raw_data_matrix <- read.csv("input_raw_data.csv", row.names = 1)

    # Example 2: If you have Affymetrix CEL files and want to create an AffyBatch object
    # library(affy)
    # cel_files <- list.files(path = "path/to/your/cel_files", pattern = "CEL", full.names = TRUE)
    # raw_data_affy <- ReadAffy(filenames = cel_files)
    # For vsn, you can pass the AffyBatch object directly or its expression matrix.
    # raw_data_matrix <- exprs(raw_data_affy)

    # For demonstration, let"s create a dummy matrix
    set.seed(123)
    raw_data_matrix <- matrix(rnorm(1000, mean = 10, sd = 2), ncol = 10)
    colnames(raw_data_matrix) <- paste0("Sample", 1:10)
    rownames(raw_data_matrix) <- paste0("Probe", 1:100)
    # -----------------------------------------

    # Perform variance stabilization normalization
    # If raw_data_matrix is a simple matrix:
    normalized_data <- vsn(raw_data_matrix)

    # If raw_data_matrix was an AffyBatch or ExpressionSet object, you could pass it directly:
    # normalized_data <- vsn(raw_data_affy)

    # --- Placeholder for saving normalized data ---
    # Replace this section with your actual data saving logic.
    # If normalized_data is a matrix:
    write.csv(normalized_data, "normalized_expression_data.csv")

    # If normalized_data is an ExpressionSet object (e.g., if you passed an AffyBatch to vsn):
    # save(normalized_data, file = "normalized_expression_set.RData")
    # write.csv(exprs(normalized_data), "normalized_expression_data.csv") # To save the expression matrix
    # -----------------------------------------
'

View on GitHub

Detection calls were based on a Wilcoxon rank-sum test of the miRNA probe signal compared to the distribution of signals from GC-content matched anti-genomic probes.

Custom script utilizing Wilcoxon rank-sum test (Inferred with models/gemini-2.5-flash) vN/A GitHub

$ Bash example

# The description refers to a statistical test (Wilcoxon rank-sum test) applied to miRNA probe signals against GC-content matched anti-genomic probes. This is typically implemented within a custom script using a statistical programming language like Python (with SciPy) or R.

# Example Python script for performing a Wilcoxon rank-sum test for detection calls.
# This script would read two sets of signal values (miRNA probes and control probes),
# perform the test, and output detection results (e.g., p-values, significant calls).

# Installation (example for Python with SciPy):
# conda install -c anaconda scipy numpy pandas

# Assuming a custom Python script named 'run_wilcoxon_detection.py' that takes
# input signal files and an output file for detection calls.
# The script would internally use a function like scipy.stats.ranksums.

python run_wilcoxon_detection.py \
    --mirna_signals miRNA_probe_signals.txt \
    --anti_genomic_signals gc_matched_anti_genomic_signals.txt \
    --output detection_calls.txt \
    --alpha 0.05 # Example parameter: significance level for detection

View on GitHub

For statistical hypothesis testing, a two-sample t-Test, with assumption of equal variance, was applied.

scipy.stats.ttest_ind (Inferred with models/gemini-2.5-flash) v1.11.0 GitHub

$ Bash example

# Install Miniconda if not already installed
# wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh
# bash miniconda.sh -b -p $HOME/miniconda
# eval "$($HOME/miniconda/bin/conda shell.bash hook)"
# conda init
# conda activate base

# Create a new conda environment for scientific Python libraries
# conda create -n stats_env python=3.9 scipy numpy -y
# conda activate stats_env

# Python script to perform a two-sample t-test with equal variance
python -c "
import numpy as np
from scipy import stats

# Placeholder for sample data. In a real pipeline, these would be loaded from files
# (e.g., CSV, TSV) or passed as arguments. For demonstration, we use dummy data.
np.random.seed(42) # for reproducibility
sample1 = np.random.normal(loc=10, scale=2, size=50)
sample2 = np.random.normal(loc=10.5, scale=2, size=60)

# Perform two-sample t-test assuming equal variance
# equal_var=True is the default for ttest_ind, but explicitly stated for clarity
t_statistic, p_value = stats.ttest_ind(sample1, sample2, equal_var=True)

print(f'Two-sample t-Test (equal variance assumed):')
print(f'  Sample 1 mean: {np.mean(sample1):.3f}')
print(f'  Sample 2 mean: {np.mean(sample2):.3f}')
print(f'  T-statistic: {t_statistic:.4f}')
print(f'  P-value: {p_value:.4e}')
"

View on GitHub

One-way ANOVA was used for experimental designs with more than two experimental groupings or levels of the same factor.

R (Inferred with models/gemini-2.5-flash) vR 4.x GitHub

$ Bash example

# Install R if not already installed (example using conda)
# conda install -c r r-base r-essentials -y

# Create a dummy data file for demonstration purposes.
# In a real bioinformatics pipeline, this would be an actual output from a previous step,
# containing quantitative measurements (e.g., gene expression, peak intensity) and grouping information.
echo "expression_level,group" > experimental_data.csv
echo "10.2,Control" >> experimental_data.csv
echo "11.5,Control" >> experimental_data.csv
echo "9.8,Control" >> experimental_data.csv
echo "15.1,TreatmentA" >> experimental_data.csv
echo "14.5,TreatmentA" >> experimental_data.csv
echo "16.0,TreatmentA" >> experimental_data.csv
echo "12.3,TreatmentB" >> experimental_data.csv
echo "13.0,TreatmentB" >> experimental_data.csv
echo "11.9,TreatmentB" >> experimental_data.csv
echo "8.5,Control" >> experimental_data.csv
echo "17.2,TreatmentA" >> experimental_data.csv
echo "10.5,TreatmentB" >> experimental_data.csv

# R script to perform One-way ANOVA
R_CODE=$(cat << 'EOF'
# Read the experimental data from a CSV file
data <- read.csv('experimental_data.csv')

# Ensure the grouping variable is treated as a factor
data$group <- as.factor(data$group)

# Perform One-way ANOVA
# The formula 'dependent_variable ~ independent_variable' specifies the model.
# Here, 'expression_level' is the dependent variable and 'group' is the independent grouping factor.
anova_model <- aov(expression_level ~ group, data = data)

# Print the ANOVA summary table to standard output
cat("--- One-way ANOVA Results ---\n")
print(summary(anova_model))

# Optionally, perform a post-hoc test (e.g., Tukey HSD) if the ANOVA is significant.
# This helps identify which specific group means differ. You would typically check the p-value
# from the ANOVA summary first (e.g., summary(anova_model)[[1]]$`Pr(>F)`[1] < 0.05).
# For demonstration, we'll just run it if the model is valid.
# if (length(unique(data$group)) > 2) { # Tukey HSD requires more than 2 groups
#   cat("\n--- Tukey HSD Post-hoc Test Results ---\n")
#   print(TukeyHSD(anova_model))
# }
EOF
)

# Execute the R script using Rscript
Rscript -e "$R_CODE"

View on GitHub

These tests define which probes are considered to be significantly differentially expressed, or significant, based on a default p-value of 0.001 and log2 difference > 1.

limma (Inferred with models/gemini-2.5-flash) vlatest GitHub

$ Bash example

# This script filters differential expression results based on p-value and log2 fold change.
# It assumes an input file (e.g., limma_results.tsv) with columns for log2FoldChange and pvalue.

# Define parameters
P_VALUE_THRESHOLD=0.001
LOG2FC_THRESHOLD=1

# Example input file (replace with actual file name)
INPUT_FILE="limma_results.tsv" # Assuming output from limma, a common tool for microarray data
OUTPUT_FILE="significant_probes.tsv"

# Assuming columns are 2: logFC, 5: P.Value (adjust column numbers as needed for limma output)
# For limma, common columns in its output table might include 'logFC', 'AveExpr', 't', 'P.Value', 'adj.P.Val', 'B'.
# This example assumes logFC is in the 2nd column and P.Value is in the 5th column.
# Please verify column indices based on your actual file format.

# Get header from the input file
head -n 1 "${INPUT_FILE}" > "${OUTPUT_FILE}"

# Filter data rows based on p-value and absolute log2 fold change
awk -v p_thresh="${P_VALUE_THRESHOLD}" -v log2fc_thresh="${LOG2FC_THRESHOLD}" '
BEGIN { FS="\t"; OFS="\t" }
NR > 1 { # Skip header row
    log2fc = $2; # Assuming logFC is the 2nd column
    pvalue = $5; # Assuming P.Value is the 5th column

    # Check if p-value is less than the threshold AND absolute log2 fold change is greater than the threshold
    if (pvalue < p_thresh && (log2fc > log2fc_thresh || log2fc < -log2fc_thresh)) {
        print $0
    }
}' "${INPUT_FILE}" >> "${OUTPUT_FILE}"

echo "Filtered significant probes saved to ${OUTPUT_FILE}"

View on GitHub

Raw Source Text

The signal processing implemented for the Ambion miRCHIP is a multi-step process involving probe specific signal detection calls, background estimate and correction, constant variance stabilization and either array scaling or global normalization. For each probe, an estimated background value is subtracted that is derived from the median signal of a set of G-C matched anti-genomic controls. Arrays within a specific analysis experiment were normalized together according to the variance stabilization methods described by Huber et al. (Huber et al., 2002). Detection calls were based on a Wilcoxon rank-sum test of the miRNA probe signal compared to the distribution of signals from GC-content matched anti-genomic probes.  For statistical hypothesis testing, a two-sample t-Test, with assumption of equal variance, was applied.  One-way ANOVA was used for experimental designs with more than two experimental groupings or levels of the same factor.  These tests define which probes are considered to be significantly differentially expressed, or significant, based on a default p-value of 0.001 and log2 difference > 1.

← Back to Analysis