GSE16969 Processing Pipeline — Yeo Lab Publications

Publication

Genomic analysis of the molecular neuropathology of tuberous sclerosis using a human stem cell model.

Genome medicine (2016) — PMID 27655340

Dataset

Gene expression analysis of TSC-tubers reveals increased expression of adhesion and inflammatory factors

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

1

The data were analyzed with MicroArray Suite version 5.0 (MAS 5.0) using Affymetrix default analysis settings and Robust Multi-Array Average (RMA) analysis as normalization method.

Microarray v5.0

$ Bash example

# MAS 5.0 (MicroArray Suite) is a proprietary, GUI-based software by Affymetrix.
# The following R script demonstrates how to perform Robust Multi-Array Average (RMA)
# normalization, which is a standard method for Affymetrix data analysis, often
# performed using Bioconductor packages on raw .CEL files.

# Install R and Bioconductor if not already present
# sudo apt-get update && sudo apt-get install -y r-base
# R -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager"); BiocManager::install("affy")'

# Create an R script to perform RMA
cat << 'EOF' > run_rma.R
library(affy)

# Set working directory to where CEL files are located or specify path
# Replace "/path/to/your/cel_files" with the actual directory containing your .CEL files
cel_files_dir <- Sys.getenv("CEL_FILES_DIR", ".") # Default to current directory
cel_files <- list.files(path = cel_files_dir, pattern = "\\.CEL$", full.names = TRUE, ignore.case = TRUE)

if (length(cel_files) == 0) {
  stop("No .CEL files found in the specified directory: ", cel_files_dir)
}

# Read the AffyBatch object from CEL files
raw_data <- ReadAffy(filenames = cel_files)

# Perform RMA normalization and summarization
# This includes background correction, normalization, and summarization steps
rma_data <- rma(raw_data)

# Extract expression matrix
expression_matrix <- exprs(rma_data)

# Write normalized expression matrix to a file
output_file <- Sys.getenv("OUTPUT_FILE", "rma_normalized_expression.tsv")
write.table(expression_matrix, file = output_file, sep = "\t", quote = FALSE, row.names = TRUE)

message(paste("RMA normalized expression matrix written to:", output_file))
EOF

# Execute the R script
# Set the environment variable for the directory containing your .CEL files
# Example: export CEL_FILES_DIR="./data/affymetrix_cels"
export CEL_FILES_DIR="/path/to/your/cel_files"
export OUTPUT_FILE="rma_normalized_expression.tsv"
Rscript run_rma.R

2

The trimmed mean target intensity of each array was set arbitrarily to 100.

R (Inferred with models/gemini-2.5-flash) v4.x GitHub

$ Bash example

# Install R and necessary packages if not already installed.
# For example, using conda:
# conda install -c r r-base

# Create an R script to perform the trimmed mean scaling normalization.
# This script assumes input data is a tab-separated file with identifiers
# (e.g., Probe IDs) in the first column and array intensities in subsequent columns.
# Replace 'input_intensities.tsv' and 'output_normalized_intensities.tsv' with actual file names.

cat << 'EOF' > normalize_trimmed_mean.R
# R script for trimmed mean target intensity normalization
# This script scales the intensity of each array such that its trimmed mean
# equals a specified target intensity.

# Function to perform trimmed mean scaling normalization
# Args:
#   input_file: Path to the input tab-separated file containing intensities.
#               Assumes first column is identifiers (e.g., ProbeID) and
#               subsequent columns are array intensities.
#   output_file: Path to the output tab-separated file for normalized intensities.
#   target_intensity: The desired trimmed mean intensity for each array (default: 100).
#   trim_fraction: The fraction (0 to 0.5) of observations to be trimmed from each end
#                  when calculating the mean (default: 0.02, i.e., 2% from each end).
normalize_trimmed_mean_scaling <- function(input_file, output_file, target_intensity = 100, trim_fraction = 0.02) {
  # Check if input file exists
  if (!file.exists(input_file)) {
    stop(paste("Error: Input file not found at", input_file))
  }

  # Read the input intensity data
  # Assuming the first column is probe IDs/identifiers and subsequent columns are array intensities
  # check.names=FALSE to prevent R from modifying column names (e.g., adding X to numeric names)
  data_raw <- read.delim(input_file, header = TRUE, row.names = 1, sep = "\t", check.names = FALSE)

  # Ensure data is numeric for calculations
  data_numeric <- as.matrix(data_raw)
  if (!is.numeric(data_numeric)) {
    stop("Error: Intensity columns must contain numeric values.")
  }

  # Calculate trimmed mean for each array (column)
  # na.rm = TRUE to handle potential missing values
  trimmed_means <- apply(data_numeric, 2, function(x) mean(x, trim = trim_fraction, na.rm = TRUE))

  # Check for any NaN or Inf in trimmed means, which could indicate issues (e.g., all NAs in a column)
  if (any(is.nan(trimmed_means)) || any(is.infinite(trimmed_means))) {
    stop("Error: Trimmed mean calculation resulted in NaN or Inf for some arrays. Check input data.")
  }

  # Calculate scaling factors
  # Avoid division by zero if a trimmed mean is 0
  scaling_factors <- ifelse(trimmed_means == 0, 0, target_intensity / trimmed_means)

  # Apply scaling to each array
  # sweep applies a function (here, multiplication) to the rows or columns of a matrix
  # using a vector of values (scaling_factors)
  normalized_data <- sweep(data_numeric, 2, scaling_factors, "*")

  # Combine with original row names (probe IDs/identifiers)
  normalized_df <- as.data.frame(normalized_data)
  normalized_df <- cbind(Identifier = rownames(data_raw), normalized_df)

  # Write the normalized data to an output file
  write.table(normalized_df, output_file, sep = "\t", row.names = FALSE, quote = FALSE)

  message(paste("Normalization complete. Output written to:", output_file))
  message("\nTrimmed means before scaling:")
  print(trimmed_means)
  message("\nScaling factors applied:")
  print(scaling_factors)
}

# --- Script execution ---
# Define input and output files (PLACEHOLDERS - REPLACE WITH ACTUAL PATHS)
# Example: input_intensities.tsv should contain columns like:
# Identifier    Array1_Intensity    Array2_Intensity
# ProbeA        1200                1500
# ProbeB        800                 950
input_file_path <- "input_intensities.tsv"
output_file_path <- "output_normalized_intensities.tsv"

# Define parameters based on the description
target_intensity_val <- 100
trim_fraction_val <- 0.02 # This is an inferred parameter (2% from each end, total 4% trimmed)

# Run the normalization function
normalize_trimmed_mean_scaling(
  input_file = input_file_path,
  output_file = output_file_path,
  target_intensity = target_intensity_val,
  trim_fraction = trim_fraction_val
)
EOF

# Execute the R script
# Ensure 'input_intensities.tsv' exists in the current directory or provide full path
# and that R is installed and in your PATH.
Rscript normalize_trimmed_mean.R

View on GitHub

Tools Used

Microarray