GSE16690 Processing Pipeline
Publication
A distinct microRNA signature for definitive endoderm derived from human embryonic stem cells.Stem cells and development (2010) — PMID 19807270
Dataset
GSE16690A distinct microRNA signature for definitive endoderm derived from human embryonic stem cells
Processing Steps
Generate Jupyter Notebook-
1
The signal processing implemented for the Ambion miRCHIP is a multi-step process involving probe specific signal detection calls, background estimate and correction, constant variance stabilization and either array scaling or global normalization.
$ Bash example
# Install limma if not already installed # if (!requireNamespace("BiocManager", quietly = TRUE)) # install.packages("BiocManager") # BiocManager::install("limma") # R script for Ambion miRCHIP signal processing using limma # This script demonstrates the general steps described: probe specific signal detection calls (conceptual), # background estimate and correction, constant variance stabilization, and array scaling/global normalization. # Actual implementation details (e.g., data loading format, specific normalization methods) may vary # based on the exact Ambion miRCHIP output files and experimental design. library(limma) # Placeholder for input data files (e.g., raw intensity files from Ambion miRCHIP) # For two-color arrays (common for miRCHIP), you would typically have red and green channel intensities. # A 'targets' file is often used to link sample names to their raw data files. # For demonstration, we'll simulate an RGList object. # --- Conceptual Step: Probe specific signal detection calls --- # This step typically involves filtering probes based on detection p-values or flags provided by the array platform. # limma itself doesn't generate these calls, but can use them for filtering prior to analysis. # For example, if you have a detection p-value matrix 'detP', you might filter probes like: # detected_probes <- rowSums(detP < 0.01) >= N_samples_threshold # RG <- RG[detected_probes, ] # --- Simulate raw data (replace with actual data loading) --- # For a real scenario, you would load data from files, e.g., using read.maimages() # or specific functions if Ambion provides a dedicated R package. set.seed(42) num_probes <- 5000 num_samples <- 6 # Simulate Red and Green channel intensities R_intensities <- matrix(rnorm(num_probes * num_samples, mean=1000, sd=200), nrow=num_probes) G_intensities <- matrix(rnorm(num_probes * num_samples, mean=900, sd=180), nrow=num_probes) # Add some background noise and ensure positive values R_intensities[R_intensities < 1] <- 1 G_intensities[G_intensities < 1] <- 1 rownames(R_intensities) <- paste0("Probe_", 1:num_probes) colnames(R_intensities) <- paste0("Sample_", 1:num_samples) rownames(G_intensities) <- paste0("Probe_", 1:num_probes) colnames(G_intensities) <- paste0("Sample_", 1:num_samples) # Create an RGList object (Red and Green channel intensities) RG <- new("RGList", list(R=R_intensities, G=G_intensities, targets=data.frame(FileName=paste0("Sample_", 1:num_samples, ".txt")))) RG$genes <- data.frame(ProbeID=rownames(RG$R)) message("--- Raw data loaded (simulated) ---") # --- Background estimate and correction --- # Using 'normexp' method, which is robust for two-color arrays. MA <- backgroundCorrect(RG, method="normexp") message("--- Background corrected ---") # --- Constant variance stabilization and normalization --- # Variance stabilization is often achieved through log transformation (implicit in M and A values) # and robust normalization methods. # 1. Within-array normalization (e.g., 'loess' for two-color arrays to correct for dye effects) MA_norm_within <- normalizeWithinArrays(MA, method="loess") message("--- Within-array normalized (loess) ---") # 2. Between-array normalization (global normalization/array scaling) # 'quantile' is a common and robust global normalization method. # 'scale' method can also be used for global scaling. MA_normalized <- normalizeBetweenArrays(MA_norm_within, method="quantile") message("--- Between-array normalized (quantile) ---") # --- Output processed data --- # The 'A' component of the MA_normalized object contains the average log2 intensity for each probe/sample # after background correction and normalization, which can be considered the processed signal. processed_intensities <- MA_normalized$A output_file <- "processed_miRCHIP_data.txt" write.table(processed_intensities, file=output_file, sep="\t", quote=FALSE, row.names=TRUE) message(paste("--- Processed data saved to", output_file, "---")) # For further analysis (e.g., differential expression), an MArrayLM object would typically be created: # design <- model.matrix(~0+factor(c(1,1,2,2,3,3))) # colnames(design) <- c("Group1", "Group2", "Group3") # fit <- lmFit(MA_normalized, design) # fit <- eBayes(fit) # topTable(fit, coef="Group2-Group1") -
2
For each probe, an estimated background value is subtracted that is derived from the median signal of a set of G-C matched anti-genomic controls.
$ Bash example
# Install R and Bioconductor if not already present # sudo apt-get update && sudo apt-get install -y r-base # R -e "if (!requireNamespace('BiocManager', quietly = TRUE)) install.packages('BiocManager'); BiocManager::install(c('affy', 'gcrma'))" #!/bin/bash # Create a dummy directory for CEL files (replace with actual path to your CEL files) mkdir -p ./cel_files # In a real scenario, copy your .CEL files here, e.g.: # cp /path/to/your/raw_data/*.CEL ./cel_files/ # Create an R script to perform GCRMA background correction cat << 'EOF' > run_gcrma.R # Load necessary libraries library(affy) library(gcrma) # Define the directory containing CEL files cel_dir <- "./cel_files/" # Read CEL files from the specified directory # This assumes all .CEL files in the directory belong to the experiment. # For more complex setups, a phenoData file or explicit list of files can be used. raw_data <- ReadAffy(celfile.path = cel_dir) # Perform GCRMA background correction, normalization, and summarization # GCRMA uses probe affinity and G-C content to estimate and subtract background noise. eset <- gcrma(raw_data) # Extract the expression matrix (normalized and background-corrected values) expression_matrix <- exprs(eset) # Save the processed data to a tab-separated file # The output file 'processed_data.txt' will contain the expression values # with probe IDs as row names and sample names as column names. write.table(expression_matrix, file = "processed_data.txt", sep = "\t", quote = FALSE, row.names = TRUE) EOF # Execute the R script Rscript run_gcrma.R -
3
Arrays within a specific analysis experiment were normalized together according to the variance stabilization methods described by Huber et al. (Huber et al., 2002).
vsn (Inferred with models/gemini-2.5-flash) vNot specified, typically part of Bioconductor$ Bash example
Rscript -e ' # Load the vsn package library(vsn) # --- User-defined parameters --- # Replace "input_raw_microarray_data.tsv" with the actual path to your raw microarray data file. # This file is expected to be a matrix where rows are probes/features and columns are samples/arrays. # Adjust `read.delim` parameters (e.g., `sep`, `header`, `row.names`) based on your specific input file format. input_file="input_raw_microarray_data.tsv" output_file="normalized_microarray_data_vsn.tsv" # --- Load raw data --- # Example: Reads a tab-separated file with probe IDs in the first column and raw intensity values in subsequent columns. raw_data <- read.delim(input_file, row.names = 1, header = TRUE, sep="\t") # Convert to matrix for vsnMatrix function raw_data_matrix <- as.matrix(raw_data) # --- Perform VSN normalization --- # The vsnMatrix function applies variance stabilization normalization to a matrix of intensity values. # For Affymetrix data, you might use `vsn` directly on an `AffyBatch` object from the `affy` package # (e.g., `library(affy); raw_affy_data <- ReadAffy(); normalized_affy_data <- vsn(raw_affy_data)`). normalized_data_matrix <- vsnMatrix(raw_data_matrix) # --- Save normalized data --- # Write the normalized matrix to a tab-separated file. # `col.names = NA` ensures that the row names (probe IDs) are written as the first column, # and the column names (sample IDs) are written starting from the second column. write.table(normalized_data_matrix, output_file, sep = "\t", quote = FALSE, col.names = NA) ' -
4
Detection calls were based on a Wilcoxon rank-sum test of the miRNA probe signal compared to the distribution of signals from GC-content matched anti-genomic probes.
$ Bash example
# Example Python script to perform Wilcoxon rank-sum test # This script assumes two input files, each with a single column of numerical signals. # Create dummy input files for demonstration (replace with actual data in a real pipeline) # echo "10\n12\n11\n15\n9" > miRNA_probe_signals.txt # echo "8\n7\n9\n10\n6" > anti_genomic_probe_signals.txt # Python script content cat << 'EOF' > run_wilcoxon_test.py import pandas as pd from scipy.stats import ranksums import sys def run_wilcoxon(miRNA_file, anti_genomic_file, output_file): try: miRNA_signals = pd.read_csv(miRNA_file, header=None).iloc[:, 0].dropna().tolist() anti_genomic_signals = pd.read_csv(anti_genomic_file, header=None).iloc[:, 0].dropna().tolist() except Exception as e: print(f"Error reading input files: {e}", file=sys.stderr) sys.exit(1) if not miRNA_signals or not anti_genomic_signals: print("Error: One or both input signal lists are empty after reading.", file=sys.stderr) sys.exit(1) # Perform Wilcoxon rank-sum test statistic, p_value = ranksums(miRNA_signals, anti_genomic_signals) with open(output_file, 'w') as f: f.write(f"Wilcoxon Rank-Sum Test Results:\n") f.write(f"Statistic: {statistic}\n") f.write(f"P-value: {p_value}\n") print(f"Results saved to {output_file}") if __name__ == "__main__": if len(sys.argv) != 4: print("Usage: python run_wilcoxon_test.py <miRNA_probe_signals_file> <anti_genomic_probe_signals_file> <output_results_file>", file=sys.stderr) sys.exit(1) miRNA_file = sys.argv[1] anti_genomic_file = sys.argv[2] output_file = sys.argv[3] run_wilcoxon(miRNA_file, anti_genomic_file, output_file) EOF # Execute the Python script # Ensure Python, pandas, and scipy are installed (e.g., via conda or pip): # conda install -c anaconda python pandas scipy python run_wilcoxon_test.py miRNA_probe_signals.txt anti_genomic_probe_signals.txt wilcoxon_results.txt -
5
For statistical hypothesis testing, a two-sample t-Test, with assumption of equal variance, was applied.
$ Bash example
# Install Python and SciPy if not already available # conda install -c anaconda python scipy numpy # Example Python script to perform a two-sample t-test with equal variance. # This script assumes 'sample1.txt' and 'sample2.txt' contain numerical data, # with one value per line. Replace these with actual input file paths. # Create dummy input files for demonstration purposes (uncomment to use) # echo -e "10.1\n12.3\n11.5\n13.0\n10.8" > sample1.txt # echo -e "9.5\n11.2\n10.7\n12.1\n9.9" > sample2.txt python -c " import numpy as np from scipy import stats import sys def run_ttest(file1, file2, output_file='ttest_results.txt'): try: sample1 = np.loadtxt(file1) sample2 = np.loadtxt(file2) except Exception as e: print(f'Error loading samples: {e}', file=sys.stderr) sys.exit(1) if len(sample1) < 2 or len(sample2) < 2: print('Each sample must have at least two data points for t-test.', file=sys.stderr) sys.exit(1) # Perform two-sample t-test assuming equal variance # equal_var=True corresponds to the assumption of equal variance (Student's t-test) t_statistic, p_value = stats.ttest_ind(sample1, sample2, equal_var=True) with open(output_file, 'w') as f: f.write(f'T-statistic: {t_statistic}\n') f.write(f'P-value: {p_value}\n') print(f'T-test results written to {output_file}') if __name__ == '__main__': # Define input files (replace with actual paths in a real pipeline) input_file1 = 'sample1.txt' input_file2 = 'sample2.txt' output_file = 'ttest_results.txt' run_ttest(input_file1, input_file2, output_file) " -
6
One-way ANOVA was used for experimental designs with more than two experimental groupings or levels of the same factor.
$ Bash example
# Example data file (replace with actual experimental data) # This file should contain a column for the dependent variable (e.g., 'measurement') # and a column for the grouping factor (e.g., 'group'). echo "measurement,group" > data.csv echo "10,A" >> data.csv echo "12,A" >> data.csv echo "11,A" >> data.csv echo "15,B" >> data.csv echo "14,B" >> data.csv echo "16,B" >> data.csv echo "18,C" >> data.csv echo "20,C" >> data.csv echo "19,C" >> data.csv # Create an R script for one-way ANOVA cat << 'EOF' > run_anova.R # Load data data <- read.csv("data.csv") # Ensure the grouping variable is a factor data$group <- as.factor(data$group) # Perform one-way ANOVA # The formula 'dependent_variable ~ grouping_factor' anova_result <- aov(measurement ~ group, data = data) # Print summary of ANOVA results summary(anova_result) # Optionally, perform post-hoc tests if ANOVA is significant (e.g., Tukey HSD) # If you have more than two groups and the ANOVA p-value is significant, # you might want to know which specific groups differ. # if (summary(anova_result)[[1]]$`Pr(>F)`[1] < 0.05) { # print("Performing Tukey HSD post-hoc test:") # print(TukeyHSD(anova_result)) # } # Save results to a file (optional) # sink("anova_results.txt") # summary(anova_result) # sink() EOF # Execute the R script Rscript run_anova.R -
7
These tests define which probes are considered to be significantly differentially expressed, or significant, based on a default p-value of 0.001 and log2 difference > 1.
DESeq2 (Inferred with models/gemini-2.5-flash) v1.42.0 (Bioconductor 3.18)$ Bash example
# This script demonstrates how to filter differential expression results based on specified p-value and log2 fold change thresholds, typically performed on output from tools like DESeq2. # Install R and DESeq2 if not already installed # conda install -c conda-forge r-base # conda install -c bioconda bioconductor-deseq2 # conda install -c conda-forge r-data.table # For efficient data handling # Create a dummy DESeq2 results file for demonstration purposes. # In a real pipeline, this file would be the output from a DESeq2 analysis. cat <<EOF > deseq2_results.csv gene,baseMean,log2FoldChange,lfcSE,stat,pvalue,padj geneA,100,0.5,0.1,5,0.01,0.05 geneB,200,1.2,0.1,12,0.0001,0.0005 geneC,50,2.5,0.2,10,0.00001,0.00005 geneD,300,-1.5,0.15,-10,0.00005,0.0001 geneE,150,0.8,0.1,8,0.005,0.01 geneF,80,-0.7,0.05,-14,0.000001,0.000005 geneG,250,1.8,0.1,15,0.002,0.005 EOF # R script to load DESeq2-like results and apply significance filtering Rscript -e ' library(data.table) # Used for efficient reading/writing of data frames # Load the differential expression results file results_df <- fread("deseq2_results.csv") # Define the significance thresholds as per the description p_value_threshold <- 0.001 log2_diff_threshold <- 1 # Filter for significantly differentially expressed probes/genes # The description implies a p-value, but for differential expression, adjusted p-value (padj) is standard. # We use padj < p_value_threshold and absolute log2FoldChange > log2_diff_threshold. significant_probes <- results_df[ results_df$padj < p_value_threshold & abs(results_df$log2FoldChange) > log2_diff_threshold, ] # Output the list of significant probes to a new CSV file fwrite(significant_probes, "significant_probes.csv") # Print a summary of the filtering process cat(paste0("Total probes analyzed: ", nrow(results_df), "\n")) cat(paste0("Probes considered significant (padj < ", p_value_threshold, " and |log2FoldChange| > ", log2_diff_threshold, "): ", nrow(significant_probes), "\n")) ' # Clean up the dummy results file rm deseq2_results.csv
Raw Source Text
The signal processing implemented for the Ambion miRCHIP is a multi-step process involving probe specific signal detection calls, background estimate and correction, constant variance stabilization and either array scaling or global normalization. For each probe, an estimated background value is subtracted that is derived from the median signal of a set of G-C matched anti-genomic controls. Arrays within a specific analysis experiment were normalized together according to the variance stabilization methods described by Huber et al. (Huber et al., 2002). Detection calls were based on a Wilcoxon rank-sum test of the miRNA probe signal compared to the distribution of signals from GC-content matched anti-genomic probes. For statistical hypothesis testing, a two-sample t-Test, with assumption of equal variance, was applied. One-way ANOVA was used for experimental designs with more than two experimental groupings or levels of the same factor. These tests define which probes are considered to be significantly differentially expressed, or significant, based on a default p-value of 0.001 and log2 difference > 1.