GSE37892 Processing Pipeline — Yeo Lab Publications

Publication

DDX5 promotes oncogene C3 and FABP1 expressions and drives intestinal inflammation and tumorigenesis.

Life science alliance (2020) — PMID 32817263

Dataset

A seven-gene signature aggregates a subgroup of stage II colon cancers with stage III.

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

1

CEL files were processed in the R (v.

R vUnknown GitHub

$ Bash example

# Install R (if not already installed)
# conda install -c conda-forge r-base

# Install Bioconductor packages for CEL file processing (e.g., 'affy' for RMA normalization)
# R -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")'
# R -e 'BiocManager::install("affy")'

# Create a placeholder R script for processing CEL files
cat << 'EOF' > process_cel_files.R
# Load necessary libraries
library(affy)

# Define the directory containing CEL files
# Replace "." with the actual path to your CEL files if they are not in the current directory
cel_dir <- "."

# List all CEL files in the specified directory
cel_files <- list.celfiles(path = cel_dir, full.names = TRUE)

# Check if any CEL files were found
if (length(cel_files) == 0) {
  stop("No CEL files found in the specified directory: ", cel_dir)
}

message(paste("Found", length(cel_files), "CEL files."))

# Read CEL files into an AffyBatch object
# This step can be memory intensive depending on the number and size of CEL files
raw_data <- ReadAffy(filenames = cel_files)

# Perform Robust Multi-array Average (RMA) normalization
# RMA is a common method for background correction, normalization, and summarization of Affymetrix data
normalized_data <- rma(raw_data)

# Extract the expression matrix (log2 transformed and normalized intensities)
expression_matrix <- exprs(normalized_data)

# Save the processed expression matrix to a CSV file
output_csv_file <- "processed_cel_expression.csv"
write.csv(expression_matrix, file = output_csv_file, row.names = TRUE)
message(paste("Processed expression matrix saved to:", output_csv_file))

# Optionally, save the entire ExpressionSet object for further analysis in R
output_rdata_file <- "processed_cel_eset.RData"
save(normalized_data, file = output_rdata_file)
message(paste("Normalized ExpressionSet object saved to:", output_rdata_file))
EOF

# Execute the R script to process CEL files
# Ensure that your CEL files are in the directory specified by 'cel_dir' in the R script
Rscript process_cel_files.R

View on GitHub

2

2.10.0)/Bioconductor (v 2.5) environment.

R v2.5 GitHub

$ Bash example

# This step describes the R/Bioconductor environment used, not a specific execution command.
# The description indicates that the analysis was performed within an R (v 2.10.0, though the prompt specifies 2.5) and Bioconductor (v 2.5) environment.
# No specific R script or command is provided in the description.

# To use R version 2.5 with Bioconductor 2.5, you would typically need to have it installed.
# Installation of such old R/Bioconductor versions can be complex and might require specific system configurations or virtual environments.
# For modern systems, using tools like `conda` or `renv` for environment management is recommended, but finding R 2.5 and Bioconductor 2.5 via conda might be challenging due to their age.

# Example of how one might launch R, assuming it's in the PATH and the correct version is active:
# R --version # To check the R version
# Rscript -e "packageVersion('Biobase')" # To check a core Bioconductor package version, indicating Bioconductor environment

# If a specific R script were provided, the command would typically look like:
# Rscript your_analysis_script.R arg1 arg2
# Or for interactive use:
# R

View on GitHub

3

Pre-processing steps (background adjustment, normalization and summarization) were performed with the GCRMA package (v.2.18.1)

GCRMA v2.18.1 GitHub

$ Bash example

# Install R and Bioconductor packages if not already installed (uncomment and run if needed)
# R -e "install.packages('BiocManager')"
# R -e "BiocManager::install('gcrma')"
# R -e "BiocManager::install('affy')"
# R -e "BiocManager::install('hgu133plus2.db')" # Example: Replace with the appropriate chip annotation package for your data (e.g., hgu133plus2.db, hgu95av2.db, etc.)

# Create a dummy R script to perform GCRMA pre-processing
cat << 'EOF' > run_gcrma_preprocessing.R
# Load necessary libraries
library(affy)
library(gcrma)

# --- Configuration --- #
# Define the directory containing your raw Affymetrix .CEL files
cel_files_directory <- "./path/to/your/cel_files"

# Define the output file name for the normalized expression matrix
output_expression_file <- "gcrma_normalized_expression.txt"
# --- End Configuration --- #

# Check if the CEL files directory exists
if (!dir.exists(cel_files_directory)) {
    stop(paste("Error: CEL files directory not found at", cel_files_directory))
}

# List all .CEL files in the specified directory
cel_files <- list.files(path = cel_files_directory, pattern = "\\.CEL$", full.names = TRUE, ignore.case = TRUE)

if (length(cel_files) == 0) {
    stop(paste("No .CEL files found in", cel_files_directory, ". Please ensure files are present and have a .CEL extension."))
}

message(paste("Found", length(cel_files), ".CEL files."))

# Read CEL files into an AffyBatch object
# This step requires that all CEL files are from the same chip type
# and that the corresponding chip annotation package is installed.
raw_data <- ReadAffy(filenames = cel_files)

message("Performing GCRMA pre-processing (background adjustment, normalization, summarization)...")

# Perform GCRMA pre-processing
# The gcrma function performs background adjustment, normalization, and summarization
# by default, as described in the pipeline step.
eset <- gcrma(raw_data)

# Extract the normalized expression matrix
expression_matrix <- exprs(eset)

# Write the normalized expression matrix to a tab-separated file
write.table(expression_matrix, file = output_expression_file, sep = "\t", quote = FALSE, row.names = TRUE)

message(paste("GCRMA pre-processing complete. Normalized expression matrix saved to:", output_expression_file))
EOF

# Execute the R script
Rscript run_gcrma_preprocessing.R

View on GitHub

Tools Used

R