GSE14555 Processing Pipeline

GSE code_examples 2 steps

Publication

Zmat3 Is a Key Splicing Regulator in the p53 Tumor Suppression Program.

Molecular cell (2020) — PMID 33157015

Dataset

Divergent Transcriptomic Responses to Aryl Hydrocarbon Receptor Agonists Between Rat and Human Primary Hepatocytes

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

All .CEL files (within each species) were pre-processed using the default settings of the justGCRMA function of gcrma package version 2.8.0 (Wu et al., 2004) as implemented in R.

R vgcrma 2.8.0 GitHub

$ Bash example

# Install Bioconductor if not already installed
# if (!requireNamespace("BiocManager", quietly = TRUE))
#     install.packages("BiocManager")
#
# # Install gcrma package version 2.8.0.
# # Note: gcrma 2.8.0 is a very old version (released with Bioconductor 2.0 / R 2.5).
# # Installing this specific version on a modern R environment might be challenging
# # and may require using an older Bioconductor release or specific archival methods.
# # For a modern setup, BiocManager::install("gcrma") would install the latest version.
# # Example for installing a specific old Bioconductor package version (requires specific Bioconductor version):
# # BiocManager::install("gcrma", version = "2.0") # This would attempt to install gcrma from Bioconductor 2.0
# # However, this command itself might not work directly without setting up the correct R version first.
# # For demonstration, we assume gcrma and its dependencies (like affy) are available.

# Create an R script to preprocess CEL files
cat << 'EOF' > preprocess_cel_files.R
# Load necessary libraries
# gcrma depends on affy and Biobase
library(affy)
library(gcrma)

# Define the directory containing .CEL files
# The description mentions "within each species", implying this script
# would be run separately for each species' set of CEL files.
input_celfiles_dir <- "." # Current directory, adjust as needed

# List all .CEL files in the specified directory
cel_files <- list.files(path = input_celfiles_dir, pattern = "\\.CEL$", full.names = TRUE, ignore.case = TRUE)

if (length(cel_files) == 0) {
    stop(paste("No .CEL files found in the directory:", input_celfiles_dir))
}

message(paste("Found", length(cel_files), ".CEL files for processing."))

# Read CEL files into an AffyBatch object
# This step requires the 'affy' package.
raw_data <- ReadAffy(filenames = cel_files)

# Pre-process using justGCRMA with default settings
# The output is an ExpressionSet object, which contains the normalized expression matrix.
# The 'justGCRMA' function is from the 'gcrma' package.
message("Starting justGCRMA pre-processing with default settings...")
eset <- justGCRMA(raw_data)
message("justGCRMA pre-processing complete.")

# Extract the normalized expression matrix from the ExpressionSet object
normalized_expression_matrix <- exprs(eset)

# Define the output file name
output_file <- "normalized_expression_matrix.tsv"

# Save the normalized expression matrix to a tab-separated file
# Row names (probe IDs) and column names (sample IDs) are included.
write.table(normalized_expression_matrix, file = output_file, sep = "\t", quote = FALSE, row.names = TRUE)

message(paste("Normalized expression matrix saved to:", output_file))
EOF

# Execute the R script using Rscript
Rscript preprocess_cel_files.R

View on GitHub

This function background corrects perfect-match probe intensities using probe sequence information, log2-transforms the data, quantile normalizes across the arrays, and summarizes probe intensities via the robust multiarray average (RMA) method (Irizarry et al., 2003) to give an intensity value (log2 scale) for each probe set.

R (affy/oligo package) (Inferred with models/gemini-2.5-flash) vNot specified (Inferred with models/gemini-2.5-flash)

$ Bash example

# This script demonstrates how to perform RMA normalization and summarization
# using the 'affy' R package, which implements the described steps.
# Replace 'path/to/cel_files' with the actual directory containing your .CEL files.
# Replace 'output_rma_expression.tsv' with your desired output file name.

# Install Bioconductor and 'affy' package if not already installed
# R -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager"); BiocManager::install("affy")'

# Create an R script to perform RMA
cat << 'EOF' > run_rma.R
library(affy)

# Define the path to your CEL files
cel_files_path <- Sys.getenv("CEL_FILES_PATH", ".") # Default to current directory

# Read CEL files
# This assumes all .CEL files in the specified directory are part of the experiment.
# You might need to filter them if there are other files.
raw_data <- ReadAffy(celfile.path = cel_files_path)

# Perform RMA normalization and summarization
# This function automatically handles background correction, log2-transformation,
# quantile normalization, and summarization as described by Irizarry et al. (2003).
eset <- rma(raw_data)

# Extract expression matrix
expr_matrix <- exprs(eset)

# Define the output file name
output_file <- Sys.getenv("OUTPUT_FILE", "rma_expression.tsv")

# Write results to a tab-separated file
write.table(expr_matrix, file = output_file, sep = "\t", quote = FALSE, row.names = TRUE)

message(paste("RMA processed data saved to:", output_file))
EOF

# Set environment variables for the R script
export CEL_FILES_PATH="/path/to/your/cel_files" # IMPORTANT: Replace with your actual CEL file directory
export OUTPUT_FILE="rma_expression.tsv"

# Execute the R script
Rscript run_rma.R

# Clean up the R script
rm run_rma.R

Tools Used

Raw Source Text

All .CEL files (within each species) were pre-processed using the default settings of the justGCRMA function of gcrma package version 2.8.0 (Wu et al., 2004) as implemented in R. This function background corrects perfect-match probe intensities using probe sequence information, log2-transforms the data, quantile normalizes across the arrays, and summarizes probe intensities via the robust multiarray average (RMA) method (Irizarry et al., 2003) to give an intensity value (log2 scale) for each probe set.

← Back to Analysis