GSE16689 Processing Pipeline

GSE code_examples 7 steps

Publication

A distinct microRNA signature for definitive endoderm derived from human embryonic stem cells.

Stem cells and development (2010) — PMID 19807270

Dataset

GSE16689

MicroRNA expression data from differentiation of human H9 ESCs into definitive endoderm on MEF feeder layers

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    The signal processing implemented for the Ambion miRCHIP is a multi-step process involving probe specific signal detection calls, background estimate and correction, constant variance stabilization and either array scaling or global normalization.

    Microarray data processing software (Inferred with models/gemini-2.5-flash) vN/A GitHub
    $ Bash example
    # This step describes conceptual signal processing for Ambion miRCHIP data.
    # Microarray data processing is typically performed using R packages (e.g., limma, oligo, affy) or specialized commercial software.
    # The following is a conceptual R script outline for such processing, assuming raw data files (e.g., .gpr or similar) and an annotation file.
    
    # Install necessary R packages if not already installed (example for limma and oligo)
    # install.packages("BiocManager")
    # BiocManager::install("limma")
    # BiocManager::install("oligo") # Or 'affy' depending on the specific chip type and raw data format
    
    # R script (conceptual outline for processing Ambion miRCHIP data)
    # Rscript process_mirchip_data.R
    
    # --- Content of process_mirchip_data.R (conceptual) ---
    # library(oligo) # Or limma, affy, depending on the specific data format and processing needs
    # library(limma)
    
    # # Placeholder for raw data files (e.g., scanner output files for Ambion miRCHIP)
    # # raw_data_directory <- "./raw_mirchip_data"
    # # raw_files <- list.files(raw_data_directory, pattern = ".gpr$", full.names = TRUE) # Example for GenePix/Agilent GPR files
    
    # # 1. Load raw data and perform probe specific signal detection calls
    # # This step is highly dependent on the raw data format (e.g., .gpr, .txt, .cel)
    # # For Ambion miRCHIP, this might involve reading intensity values and flags.
    # # raw_expression_data <- read.maimages(raw_files, source = "genepix") # Example using limma for two-color arrays
    
    # # 2. Background estimate and correction
    # # bg_corrected_data <- backgroundCorrect(raw_expression_data, method = "normexp") # Example using limma's normexp method
    
    # # 3. Constant variance stabilization
    # # This often involves a transformation like log2 or a more sophisticated method like VSN.
    # # vsn_data <- normalizeBetweenArrays(bg_corrected_data, method = "vsn") # Example using limma with VSN
    
    # # 4. Array scaling or global normalization
    # # If not already handled by VSN, other normalization methods can be applied.
    # # normalized_expression <- normalizeBetweenArrays(vsn_data, method = "quantile") # Example using limma with quantile normalization
    
    # # 5. Summarize probe signals to miRNA level (if multiple probes per miRNA)
    # # This requires an annotation file mapping probes to miRNAs.
    # # annotation_file <- "./ambion_mirchip_annotation.txt" # Placeholder for Ambion miRCHIP annotation
    # # mirchip_annotation <- read.delim(annotation_file)
    # # 
    # # # Example: Aggregate probe intensities to miRNA level (e.g., by mean or median)
    # # # processed_mirna_expression <- aggregate(normalized_expression$E, by=list(mirchip_annotation$miRNA_ID), FUN=mean)
    
    # # 6. Save processed data
    # # write.csv(processed_mirna_expression, "processed_ambion_mirchip_expression.csv", row.names = FALSE)
    
  2. 2

    For each probe, an estimated background value is subtracted that is derived from the median signal of a set of G-C matched anti-genomic controls.

    Unknown (Inferred with models/gemini-2.5-flash) vUnknown
    $ Bash example
    # This step describes a background subtraction method commonly used in probe-based assays, such as microarrays.
    # It involves identifying G-C matched anti-genomic control probes, calculating their median signal,
    # and subtracting this median value from the signal of each experimental probe.
    # The specific tool or software used for this operation is not explicitly stated in the description.
    
    # Below is a conceptual representation of how such a process might be executed using a custom script.
    # The actual implementation would depend on the specific data format and programming language (e.g., R, Python).
    
    # Example of a conceptual command using a placeholder script:
    python custom_background_correction.py \
        --probe_signal_file "raw_probe_signals.tsv" \
        --control_probe_file "gc_matched_anti_genomic_controls.tsv" \
        --output_file "background_corrected_signals.tsv" \
        --gc_content_column "GC_Content" \
        --signal_column "Intensity" \
        --method "median_subtraction"
    
  3. 3

    Arrays within a specific analysis experiment were normalized together according to the variance stabilization methods described by Huber et al. (Huber et al., 2002).

    vsn (Inferred with models/gemini-2.5-flash) vNot specified
    $ Bash example
    # Install R and Bioconductor if not already present
    # conda install -c conda-forge r-base
    # conda install -c bioconda bioconductor-vsn
    
    # Create an R script to perform VSN normalization
    cat << 'EOF' > normalize_vsn.R
    # Load the vsn package and Biobase for ExpressionSet manipulation
    library(vsn)
    library(Biobase) # Required for exprs() function if vsn2 returns an ExpressionSet
    
    # --- Placeholder for loading your raw microarray data ---
    # Replace 'raw_data_matrix.tsv' with your actual input file path.
    # This example assumes a tab-separated file where rows are probes/genes and columns are samples.
    # Adjust the loading method (e.g., read.csv, read.delim) based on your file format.
    # Example: raw_data_matrix <- as.matrix(read.delim("raw_data_matrix.tsv", row.names = 1))
    
    # For demonstration, let's create a dummy matrix representing raw intensity data
    set.seed(123)
    raw_data_matrix <- matrix(rnorm(1000 * 5, mean = 1000, sd = 200), ncol = 5)
    colnames(raw_data_matrix) <- paste0("Sample", 1:5)
    rownames(raw_data_matrix) <- paste0("Probe", 1:1000)
    
    # Perform Variance Stabilization Normalization (VSN)
    # vsn2 is typically used for matrix input. For ExpressionSet objects, vsn() can be applied directly.
    normalized_eset <- vsn2(raw_data_matrix)
    
    # Extract the normalized data matrix from the ExpressionSet object
    normalized_matrix <- exprs(normalized_eset)
    
    # --- Placeholder for saving the normalized data ---
    # Replace 'normalized_data_vsn.tsv' with your desired output file name.
    write.table(normalized_matrix, "normalized_data_vsn.tsv", sep = "\t", quote = FALSE, row.names = TRUE)
    
    message("VSN normalization complete. Normalized data saved to normalized_data_vsn.tsv")
    EOF
    
    # Execute the R script
    Rscript normalize_vsn.R
  4. 4

    Detection calls were based on a Wilcoxon rank-sum test of the miRNA probe signal compared to the distribution of signals from GC-content matched anti-genomic probes.

    Custom script (R/Python) (Inferred with models/gemini-2.5-flash) vN/A GitHub
    $ Bash example
    # Example R script to perform Wilcoxon rank-sum test
    # This script assumes two input files:
    # 1. miRNA_probe_signals.tsv: Tab-separated file with probe_id and signal for miRNA probes
    # 2. gc_matched_anti_genomic_signals.tsv: Tab-separated file with probe_id and signal for GC-matched anti-genomic probes
    
    # Create dummy input files for demonstration
    echo -e "probe1\t100\nprobe2\t120\nprobe3\t90\nprobe4\t110" > miRNA_probe_signals.tsv
    echo -e "probeA\t80\nprobeB\t95\nprobeC\t70\nprobeD\t85\nprobeE\t100" > gc_matched_anti_genomic_signals.tsv
    
    # R script content
    R_SCRIPT="""
    # Load data
    miRNA_signals <- read.delim("miRNA_probe_signals.tsv", header=FALSE, col.names=c("probe_id", "signal"))
    anti_genomic_signals <- read.delim("gc_matched_anti_genomic_signals.tsv", header=FALSE, col.names=c("probe_id", "signal"))
    
    # Perform Wilcoxon rank-sum test
    # The alternative hypothesis is that the true location shift is not equal to 0 (two-sided test)
    # If a specific direction is expected (e.g., miRNA signals are higher), 'greater' or 'less' can be used.
    wilcox_test_result <- wilcox.test(miRNA_signals$signal, anti_genomic_signals$signal, alternative = "two.sided")
    
    # Print results
    cat("Wilcoxon Rank-Sum Test Results:\n")
    print(wilcox_test_result)
    
    # Optionally, save results to a file
    sink("wilcoxon_test_results.txt")
    print(wilcox_test_result)
    sink()
    """
    
    # Execute the R script
    Rscript -e "${R_SCRIPT}"
    
    # Clean up dummy files
    rm miRNA_probe_signals.tsv gc_matched_anti_genomic_signals.tsv
  5. 5

    For statistical hypothesis testing, a two-sample t-Test, with assumption of equal variance, was applied.

    scipy.stats (Inferred with models/gemini-2.5-flash) v1.11.x GitHub
    $ Bash example
    # Install Python and SciPy if not already available
    # conda create -n ttest_env python=3.9 scipy numpy
    # conda activate ttest_env
    
    # Example data files (replace with actual data paths in a real pipeline)
    # For demonstration, let's create dummy data files
    # In a real pipeline, these would be generated by upstream steps.
    # Each file contains numerical values, one per line, representing a sample.
    
    echo "10.1" > sample1_data.txt
    echo "10.5" >> sample1_data.txt
    echo "9.8" >> sample1_data.txt
    echo "11.2" >> sample1_data.txt
    echo "10.3" >> sample1_data.txt
    
    echo "12.0" > sample2_data.txt
    echo "11.5" >> sample2_data.txt
    echo "12.8" >> sample2_data.txt
    echo "11.9" >> sample2_data.txt
    echo "12.2" >> sample2_data.txt
    
    # Python script to perform the two-sample t-test with equal variance
    python -c "
    import numpy as np
    from scipy import stats
    
    # Load data from files
    sample1 = np.loadtxt('sample1_data.txt')
    sample2 = np.loadtxt('sample2_data.txt')
    
    # Perform two-sample t-test assuming equal variance
    # equal_var=True corresponds to the assumption of equal variance (Student's t-test)
    t_statistic, p_value = stats.ttest_ind(sample1, sample2, equal_var=True)
    
    print(f'T-statistic: {t_statistic}')
    print(f'P-value: {p_value}')
    "
  6. 6

    One-way ANOVA was used for experimental designs with more than two experimental groupings or levels of the same factor.

    R (Inferred with models/gemini-2.5-flash) v4.3.2 GitHub
    $ Bash example
    # Install R if not already installed (example for Ubuntu/Debian):
    # sudo apt update
    # sudo apt install r-base
    
    # Create a dummy R script for one-way ANOVA
    cat << 'EOF' > run_anova.R
    # Simulate data for demonstration purposes
    # In a real scenario, 'my_data.csv' would be loaded here.
    set.seed(123)
    
    # Group 1: Control
    group1 <- rnorm(30, mean = 10, sd = 2)
    # Group 2: Treatment A
    group2 <- rnorm(30, mean = 12, sd = 2)
    # Group 3: Treatment B
    group3 <- rnorm(30, mean = 10.5, sd = 2)
    
    # Combine into a data frame
    my_data <- data.frame(
      value = c(group1, group2, group3),
      group = factor(c(rep("Control", 30), rep("TreatmentA", 30), rep("TreatmentB", 30)))
    )
    
    # Perform one-way ANOVA
    anova_result <- aov(value ~ group, data = my_data)
    
    # Print the summary of the ANOVA results
    print("One-way ANOVA Results:")
    print(summary(anova_result))
    
    # Optionally, perform post-hoc tests if ANOVA is significant (e.g., Tukey HSD)
    # if (summary(anova_result)[[1]][["Pr(>F)"]][1] < 0.05) {
    #   print("\nPost-hoc Tukey HSD Test:")
    #   print(TukeyHSD(anova_result))
    # }
    
    # Save results to a file (optional)
    # sink("anova_results.txt")
    # print(summary(anova_result))
    # sink()
    EOF
    
    # Execute the R script
    Rscript run_anova.R
  7. 7

    These tests define which probes are considered to be significantly differentially expressed, or significant, based on a default p-value of 0.001 and log2 difference > 1.

    Custom Script for Differential Expression Filtering (Inferred with models/gemini-2.5-flash) v1.0 GitHub
    $ Bash example
    # Define thresholds for significance
    P_VALUE_THRESHOLD=0.001
    LOG2_FC_THRESHOLD=1
    
    # Input file containing differential expression results (e.g., from DESeq2, edgeR)
    # Replace 'differential_expression_results.tsv' with the actual path to your input file.
    INPUT_FILE="differential_expression_results.tsv"
    
    # Output file for significantly differentially expressed probes
    OUTPUT_FILE="significant_probes.tsv"
    
    # Filter significant probes based on p-value and absolute log2 fold change thresholds.
    # Assumes the input file is tab-separated, with log2FoldChange in the 2nd column ($2)
    # and p-value in the 3rd column ($3). The first line is treated as a header.
    awk -v p_thresh="$P_VALUE_THRESHOLD" -v fc_thresh="$LOG2_FC_THRESHOLD" '
    BEGIN { FS="\t"; OFS="\t" }
    NR==1 { print; next } # Print header line
    {
      log2fc = $2;
      pvalue = $3;
      # Check if p-value is below threshold AND absolute log2 fold change is above threshold
      if (pvalue < p_thresh && (log2fc > fc_thresh || log2fc < -fc_thresh)) {
        print # Print the line if criteria are met
      }
    }' "$INPUT_FILE" > "$OUTPUT_FILE"
Raw Source Text
The signal processing implemented for the Ambion miRCHIP is a multi-step process involving probe specific signal detection calls, background estimate and correction, constant variance stabilization and either array scaling or global normalization. For each probe, an estimated background value is subtracted that is derived from the median signal of a set of G-C matched anti-genomic controls. Arrays within a specific analysis experiment were normalized together according to the variance stabilization methods described by Huber et al. (Huber et al., 2002). Detection calls were based on a Wilcoxon rank-sum test of the miRNA probe signal compared to the distribution of signals from GC-content matched anti-genomic probes.  For statistical hypothesis testing, a two-sample t-Test, with assumption of equal variance, was applied.  One-way ANOVA was used for experimental designs with more than two experimental groupings or levels of the same factor.  These tests define which probes are considered to be significantly differentially expressed, or significant, based on a default p-value of 0.001 and log2 difference > 1.
← Back to Analysis