GSE16689 Processing Pipeline
Publication
A distinct microRNA signature for definitive endoderm derived from human embryonic stem cells.Stem cells and development (2010) — PMID 19807270
Dataset
GSE16689MicroRNA expression data from differentiation of human H9 ESCs into definitive endoderm on MEF feeder layers
Processing Steps
Generate Jupyter Notebook-
1
The signal processing implemented for the Ambion miRCHIP is a multi-step process involving probe specific signal detection calls, background estimate and correction, constant variance stabilization and either array scaling or global normalization.
$ Bash example
# This step describes conceptual signal processing for Ambion miRCHIP data. # Microarray data processing is typically performed using R packages (e.g., limma, oligo, affy) or specialized commercial software. # The following is a conceptual R script outline for such processing, assuming raw data files (e.g., .gpr or similar) and an annotation file. # Install necessary R packages if not already installed (example for limma and oligo) # install.packages("BiocManager") # BiocManager::install("limma") # BiocManager::install("oligo") # Or 'affy' depending on the specific chip type and raw data format # R script (conceptual outline for processing Ambion miRCHIP data) # Rscript process_mirchip_data.R # --- Content of process_mirchip_data.R (conceptual) --- # library(oligo) # Or limma, affy, depending on the specific data format and processing needs # library(limma) # # Placeholder for raw data files (e.g., scanner output files for Ambion miRCHIP) # # raw_data_directory <- "./raw_mirchip_data" # # raw_files <- list.files(raw_data_directory, pattern = ".gpr$", full.names = TRUE) # Example for GenePix/Agilent GPR files # # 1. Load raw data and perform probe specific signal detection calls # # This step is highly dependent on the raw data format (e.g., .gpr, .txt, .cel) # # For Ambion miRCHIP, this might involve reading intensity values and flags. # # raw_expression_data <- read.maimages(raw_files, source = "genepix") # Example using limma for two-color arrays # # 2. Background estimate and correction # # bg_corrected_data <- backgroundCorrect(raw_expression_data, method = "normexp") # Example using limma's normexp method # # 3. Constant variance stabilization # # This often involves a transformation like log2 or a more sophisticated method like VSN. # # vsn_data <- normalizeBetweenArrays(bg_corrected_data, method = "vsn") # Example using limma with VSN # # 4. Array scaling or global normalization # # If not already handled by VSN, other normalization methods can be applied. # # normalized_expression <- normalizeBetweenArrays(vsn_data, method = "quantile") # Example using limma with quantile normalization # # 5. Summarize probe signals to miRNA level (if multiple probes per miRNA) # # This requires an annotation file mapping probes to miRNAs. # # annotation_file <- "./ambion_mirchip_annotation.txt" # Placeholder for Ambion miRCHIP annotation # # mirchip_annotation <- read.delim(annotation_file) # # # # # Example: Aggregate probe intensities to miRNA level (e.g., by mean or median) # # # processed_mirna_expression <- aggregate(normalized_expression$E, by=list(mirchip_annotation$miRNA_ID), FUN=mean) # # 6. Save processed data # # write.csv(processed_mirna_expression, "processed_ambion_mirchip_expression.csv", row.names = FALSE) -
2
For each probe, an estimated background value is subtracted that is derived from the median signal of a set of G-C matched anti-genomic controls.
Unknown (Inferred with models/gemini-2.5-flash) vUnknown$ Bash example
# This step describes a background subtraction method commonly used in probe-based assays, such as microarrays. # It involves identifying G-C matched anti-genomic control probes, calculating their median signal, # and subtracting this median value from the signal of each experimental probe. # The specific tool or software used for this operation is not explicitly stated in the description. # Below is a conceptual representation of how such a process might be executed using a custom script. # The actual implementation would depend on the specific data format and programming language (e.g., R, Python). # Example of a conceptual command using a placeholder script: python custom_background_correction.py \ --probe_signal_file "raw_probe_signals.tsv" \ --control_probe_file "gc_matched_anti_genomic_controls.tsv" \ --output_file "background_corrected_signals.tsv" \ --gc_content_column "GC_Content" \ --signal_column "Intensity" \ --method "median_subtraction" -
3
Arrays within a specific analysis experiment were normalized together according to the variance stabilization methods described by Huber et al. (Huber et al., 2002).
vsn (Inferred with models/gemini-2.5-flash) vNot specified$ Bash example
# Install R and Bioconductor if not already present # conda install -c conda-forge r-base # conda install -c bioconda bioconductor-vsn # Create an R script to perform VSN normalization cat << 'EOF' > normalize_vsn.R # Load the vsn package and Biobase for ExpressionSet manipulation library(vsn) library(Biobase) # Required for exprs() function if vsn2 returns an ExpressionSet # --- Placeholder for loading your raw microarray data --- # Replace 'raw_data_matrix.tsv' with your actual input file path. # This example assumes a tab-separated file where rows are probes/genes and columns are samples. # Adjust the loading method (e.g., read.csv, read.delim) based on your file format. # Example: raw_data_matrix <- as.matrix(read.delim("raw_data_matrix.tsv", row.names = 1)) # For demonstration, let's create a dummy matrix representing raw intensity data set.seed(123) raw_data_matrix <- matrix(rnorm(1000 * 5, mean = 1000, sd = 200), ncol = 5) colnames(raw_data_matrix) <- paste0("Sample", 1:5) rownames(raw_data_matrix) <- paste0("Probe", 1:1000) # Perform Variance Stabilization Normalization (VSN) # vsn2 is typically used for matrix input. For ExpressionSet objects, vsn() can be applied directly. normalized_eset <- vsn2(raw_data_matrix) # Extract the normalized data matrix from the ExpressionSet object normalized_matrix <- exprs(normalized_eset) # --- Placeholder for saving the normalized data --- # Replace 'normalized_data_vsn.tsv' with your desired output file name. write.table(normalized_matrix, "normalized_data_vsn.tsv", sep = "\t", quote = FALSE, row.names = TRUE) message("VSN normalization complete. Normalized data saved to normalized_data_vsn.tsv") EOF # Execute the R script Rscript normalize_vsn.R -
4
Detection calls were based on a Wilcoxon rank-sum test of the miRNA probe signal compared to the distribution of signals from GC-content matched anti-genomic probes.
$ Bash example
# Example R script to perform Wilcoxon rank-sum test # This script assumes two input files: # 1. miRNA_probe_signals.tsv: Tab-separated file with probe_id and signal for miRNA probes # 2. gc_matched_anti_genomic_signals.tsv: Tab-separated file with probe_id and signal for GC-matched anti-genomic probes # Create dummy input files for demonstration echo -e "probe1\t100\nprobe2\t120\nprobe3\t90\nprobe4\t110" > miRNA_probe_signals.tsv echo -e "probeA\t80\nprobeB\t95\nprobeC\t70\nprobeD\t85\nprobeE\t100" > gc_matched_anti_genomic_signals.tsv # R script content R_SCRIPT=""" # Load data miRNA_signals <- read.delim("miRNA_probe_signals.tsv", header=FALSE, col.names=c("probe_id", "signal")) anti_genomic_signals <- read.delim("gc_matched_anti_genomic_signals.tsv", header=FALSE, col.names=c("probe_id", "signal")) # Perform Wilcoxon rank-sum test # The alternative hypothesis is that the true location shift is not equal to 0 (two-sided test) # If a specific direction is expected (e.g., miRNA signals are higher), 'greater' or 'less' can be used. wilcox_test_result <- wilcox.test(miRNA_signals$signal, anti_genomic_signals$signal, alternative = "two.sided") # Print results cat("Wilcoxon Rank-Sum Test Results:\n") print(wilcox_test_result) # Optionally, save results to a file sink("wilcoxon_test_results.txt") print(wilcox_test_result) sink() """ # Execute the R script Rscript -e "${R_SCRIPT}" # Clean up dummy files rm miRNA_probe_signals.tsv gc_matched_anti_genomic_signals.tsv -
5
For statistical hypothesis testing, a two-sample t-Test, with assumption of equal variance, was applied.
$ Bash example
# Install Python and SciPy if not already available # conda create -n ttest_env python=3.9 scipy numpy # conda activate ttest_env # Example data files (replace with actual data paths in a real pipeline) # For demonstration, let's create dummy data files # In a real pipeline, these would be generated by upstream steps. # Each file contains numerical values, one per line, representing a sample. echo "10.1" > sample1_data.txt echo "10.5" >> sample1_data.txt echo "9.8" >> sample1_data.txt echo "11.2" >> sample1_data.txt echo "10.3" >> sample1_data.txt echo "12.0" > sample2_data.txt echo "11.5" >> sample2_data.txt echo "12.8" >> sample2_data.txt echo "11.9" >> sample2_data.txt echo "12.2" >> sample2_data.txt # Python script to perform the two-sample t-test with equal variance python -c " import numpy as np from scipy import stats # Load data from files sample1 = np.loadtxt('sample1_data.txt') sample2 = np.loadtxt('sample2_data.txt') # Perform two-sample t-test assuming equal variance # equal_var=True corresponds to the assumption of equal variance (Student's t-test) t_statistic, p_value = stats.ttest_ind(sample1, sample2, equal_var=True) print(f'T-statistic: {t_statistic}') print(f'P-value: {p_value}') " -
6
One-way ANOVA was used for experimental designs with more than two experimental groupings or levels of the same factor.
$ Bash example
# Install R if not already installed (example for Ubuntu/Debian): # sudo apt update # sudo apt install r-base # Create a dummy R script for one-way ANOVA cat << 'EOF' > run_anova.R # Simulate data for demonstration purposes # In a real scenario, 'my_data.csv' would be loaded here. set.seed(123) # Group 1: Control group1 <- rnorm(30, mean = 10, sd = 2) # Group 2: Treatment A group2 <- rnorm(30, mean = 12, sd = 2) # Group 3: Treatment B group3 <- rnorm(30, mean = 10.5, sd = 2) # Combine into a data frame my_data <- data.frame( value = c(group1, group2, group3), group = factor(c(rep("Control", 30), rep("TreatmentA", 30), rep("TreatmentB", 30))) ) # Perform one-way ANOVA anova_result <- aov(value ~ group, data = my_data) # Print the summary of the ANOVA results print("One-way ANOVA Results:") print(summary(anova_result)) # Optionally, perform post-hoc tests if ANOVA is significant (e.g., Tukey HSD) # if (summary(anova_result)[[1]][["Pr(>F)"]][1] < 0.05) { # print("\nPost-hoc Tukey HSD Test:") # print(TukeyHSD(anova_result)) # } # Save results to a file (optional) # sink("anova_results.txt") # print(summary(anova_result)) # sink() EOF # Execute the R script Rscript run_anova.R -
7
These tests define which probes are considered to be significantly differentially expressed, or significant, based on a default p-value of 0.001 and log2 difference > 1.
Custom Script for Differential Expression Filtering (Inferred with models/gemini-2.5-flash) v1.0 GitHub$ Bash example
# Define thresholds for significance P_VALUE_THRESHOLD=0.001 LOG2_FC_THRESHOLD=1 # Input file containing differential expression results (e.g., from DESeq2, edgeR) # Replace 'differential_expression_results.tsv' with the actual path to your input file. INPUT_FILE="differential_expression_results.tsv" # Output file for significantly differentially expressed probes OUTPUT_FILE="significant_probes.tsv" # Filter significant probes based on p-value and absolute log2 fold change thresholds. # Assumes the input file is tab-separated, with log2FoldChange in the 2nd column ($2) # and p-value in the 3rd column ($3). The first line is treated as a header. awk -v p_thresh="$P_VALUE_THRESHOLD" -v fc_thresh="$LOG2_FC_THRESHOLD" ' BEGIN { FS="\t"; OFS="\t" } NR==1 { print; next } # Print header line { log2fc = $2; pvalue = $3; # Check if p-value is below threshold AND absolute log2 fold change is above threshold if (pvalue < p_thresh && (log2fc > fc_thresh || log2fc < -fc_thresh)) { print # Print the line if criteria are met } }' "$INPUT_FILE" > "$OUTPUT_FILE"
Raw Source Text
The signal processing implemented for the Ambion miRCHIP is a multi-step process involving probe specific signal detection calls, background estimate and correction, constant variance stabilization and either array scaling or global normalization. For each probe, an estimated background value is subtracted that is derived from the median signal of a set of G-C matched anti-genomic controls. Arrays within a specific analysis experiment were normalized together according to the variance stabilization methods described by Huber et al. (Huber et al., 2002). Detection calls were based on a Wilcoxon rank-sum test of the miRNA probe signal compared to the distribution of signals from GC-content matched anti-genomic probes. For statistical hypothesis testing, a two-sample t-Test, with assumption of equal variance, was applied. One-way ANOVA was used for experimental designs with more than two experimental groupings or levels of the same factor. These tests define which probes are considered to be significantly differentially expressed, or significant, based on a default p-value of 0.001 and log2 difference > 1.