GSE13204 Processing Pipeline — Yeo Lab Publications

Publication

The splicing factor RBM17 drives leukemic stem cell maintenance by evading nonsense-mediated decay of pro-leukemic factors.

Nature communications (2022) — PMID 35781533

Dataset

GSE13204

Microarray Innovations in LEukemia (MILE) study

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

1

Data pre-processing included a summarization and quantile normalization step to generate probe set level signal intensities for each microarray experiment and was performed as previously published by Liu WM et al.

Microarray vNot specified (Inferred with models/gemini-2.5-flash)

$ Bash example

# Install R and Bioconductor if not already installed
# R -e 'install.packages("BiocManager")'
# R -e 'BiocManager::install("affy")'

# This script assumes you have raw Affymetrix .CEL files in a specified directory.
# Replace 'path/to/cel_files' with the actual path to your .CEL files.
# The specific CDF (Chip Description File) package for your array type (e.g., hgu133plus2cdf)
# might need to be installed via BiocManager::install("hgu133plus2cdf") if not automatically detected.

Rscript -e '
  library(affy)
  
  # Define the directory containing your .CEL files
  cel_files_dir <- "path/to/cel_files" # <<< REPLACE WITH ACTUAL PATH TO CEL FILES
  
  # Check if the directory exists
  if (!dir.exists(cel_files_dir)) {
    stop(paste("CEL files directory not found:", cel_files_dir))
  }
  
  # List all .CEL files in the specified directory
  cel_files <- list.files(path = cel_files_dir, pattern = "\\.CEL$", full.names = TRUE, ignore.case = TRUE)
  
  if (length(cel_files) == 0) {
    stop(paste("No .CEL files found in:", cel_files_dir))
  }
  
  message(paste("Found", length(cel_files), ".CEL files."))
  
  # Read the .CEL files into an AffyBatch object
  # This step automatically tries to infer the CDF environment package.
  # If it fails, you might need to explicitly install the correct CDF package
  # (e.g., BiocManager::install("hgu133plus2cdf")) and then load it.
  affybatch <- ReadAffy(filenames = cel_files)
  
  message("Performing RMA pre-processing (background correction, quantile normalization, summarization)...")
  
  # Perform RMA (Robust Multi-array Average) pre-processing
  # This includes background correction, quantile normalization, and median polish summarization.
  eset <- rma(affybatch)
  
  # Extract probe set level signal intensities
  signal_intensities <- exprs(eset)
  
  # Define output file path
  output_file <- "probe_set_signal_intensities.csv"
  
  # Save the results to a CSV file
  write.csv(signal_intensities, output_file, row.names = TRUE)
  
  message(paste("Probe set level signal intensities saved to:", output_file))
  
  # Optionally, save the ExpressionSet object for further analysis
  # save(eset, file = "rma_eset.RData")
  # message("ExpressionSet object saved to rma_eset.RData")
'

2

PQN and DQN: algorithms for expression microarrays.

R (Inferred with models/gemini-2.5-flash) vR (latest stable)

$ Bash example

# Install R (if not already installed)
# sudo apt-get update
# sudo apt-get install r-base

# Install Bioconductor (if not already installed)
# R -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager"); BiocManager::install()'

# Install necessary R packages (e.g., for data handling and a quantile normalization proxy)
# R -e 'BiocManager::install(c("preprocessCore"))'

# Create a dummy raw microarray data file for demonstration.
# In a real scenario, this would be your actual microarray data (e.g., CEL files or a pre-processed expression matrix).
echo "ProbeID,Sample1,Sample2,Sample3" > raw_microarray_data.csv
echo "GeneA,100,120,110" >> raw_microarray_data.csv
echo "GeneB,200,210,190" >> raw_microarray_data.csv
echo "GeneC,50,60,55" >> raw_microarray_data.csv
echo "GeneD,150,140,160" >> raw_microarray_data.csv

# Create an R script to conceptually perform PQN and DQN.
# Note: PQN (Probabilistic Quantile Normalization) and DQN (Distributional Quantile Normalization)
# are specific algorithms. Their direct implementation is often custom R code based on the original publications
# or specialized packages not widely distributed. This script outlines the conceptual steps.
cat << 'EOF' > normalize_microarrays.R
# Load necessary libraries
# For a true PQN/DQN implementation, you would use custom functions or specialized packages.
# We use 'preprocessCore' here as a proxy for general quantile normalization for demonstration purposes.
if (!requireNamespace("preprocessCore", quietly = TRUE)) {
  install.packages("preprocessCore", repos="http://cran.us.r-project.org")
}
library(preprocessCore)

# Function to simulate PQN (conceptual implementation)
# In a real scenario, this function would implement the PQN algorithm as described by Scharpf et al. (2009).
# For simplicity and demonstration, we use standard quantile normalization as a conceptual proxy.
# A true PQN implementation involves estimating a reference distribution and normalizing based on probabilities.
pqn_normalize <- function(data_matrix) {
  message("Simulating Probabilistic Quantile Normalization (PQN)...")
  normalized_data <- preprocessCore::normalize.quantiles(as.matrix(data_matrix))
  colnames(normalized_data) <- colnames(data_matrix)
  rownames(normalized_data) <- rownames(data_matrix)
  return(normalized_data)
}

# Function to simulate DQN (conceptual implementation)
# In a real scenario, this function would implement the DQN algorithm as described by Scharpf et al. (2011).
# For simplicity and demonstration, we use standard quantile normalization as a conceptual proxy.
# A true DQN implementation involves normalizing based on a target distribution derived from the data.
dqn_normalize <- function(data_matrix) {
  message("Simulating Distributional Quantile Normalization (DQN)...")
  normalized_data <- preprocessCore::normalize.quantiles(as.matrix(data_matrix))
  colnames(normalized_data) <- colnames(data_matrix)
  rownames(normalized_data) <- rownames(data_matrix)
  return(normalized_data)
}

# Load raw microarray expression data
# Assuming data is in a CSV format with ProbeID as the first column
raw_data <- read.csv("raw_microarray_data.csv", row.names = 1)

# Apply PQN
normalized_data_pqn <- pqn_normalize(raw_data)
write.csv(normalized_data_pqn, "normalized_microarray_pqn.csv")
message("PQN normalized data saved to normalized_microarray_pqn.csv")

# Apply DQN
normalized_data_dqn <- dqn_normalize(raw_data)
write.csv(normalized_data_dqn, "normalized_microarray_dqn.csv")
message("DQN normalized data saved to normalized_microarray_dqn.csv")

EOF

# Execute the R script
Rscript normalize_microarrays.R

# Clean up dummy data and script (uncomment to enable)
# rm raw_microarray_data.csv normalize_microarrays.R normalized_microarray_pqn.csv normalized_microarray_dqn.csv

3

J.Theor.Biol.

(Inferred with models/gemini-2.5-flash)

$ Bash example

# No specific bioinformatics tool or command can be inferred from the description "J.Theor.Biol.".
# This description refers to a scientific journal and does not specify an assay, method, or tool.
# Please provide more context (e.g., assay type, analysis step) to generate a relevant command.

4

2006;243:273-278.

TileMap v2006

$ Bash example

# Installation of R (if not present)
# sudo apt-get update
# sudo apt-get install r-base

# The TileMap method was described in the publication:
# Johnson et al., "A new method for assessing the statistical significance of ChIP-chip data."
# Bioinformatics, 2006;243:273-278.
# The original software was an R implementation, likely a script or package.
# As the exact command-line interface for the original 2006 software is not readily available,
# this command is a conceptual representation based on the method's description.

# Placeholder for reference genome assembly (e.g., hg38 for human, or specific for other organisms)
REFERENCE_GENOME="hg38"

# Placeholder for input ChIP-chip log ratio data file (e.g., tab-separated values)
# This file would contain probe IDs, genomic coordinates, and log2(ChIP/Input) ratios.
INPUT_LOG_RATIOS="chip_chip_experiment_log_ratios.txt"

# Placeholder for probe annotation file (e.g., BED or custom format)
# This file would contain detailed genomic locations for each probe on the array.
PROBE_ANNOTATION="genome_probes_${REFERENCE_GENOME}.bed"

# Placeholder for output file where identified enriched regions (peaks) will be stored.
OUTPUT_ENRICHED_REGIONS="tilemap_enriched_regions.bed"

# Conceptual execution of an R script implementing the TileMap algorithm.
# Parameters like p-value threshold and minimum probes per region are derived from the publication.
# The actual script 'run_tilemap.R' would need to be implemented based on the paper's methodology.
Rscript run_tilemap.R \
  --input_log_ratios "${INPUT_LOG_RATIOS}" \
  --probe_annotation "${PROBE_ANNOTATION}" \
  --genome_assembly "${REFERENCE_GENOME}" \
  --output_regions "${OUTPUT_ENRICHED_REGIONS}" \
  --p_value_threshold 0.001 \
  --min_probes_per_region 3

Tools Used

Microarray