GSE14333 Processing Pipeline — Yeo Lab Publications

Publication

DDX5 promotes oncogene C3 and FABP1 expressions and drives intestinal inflammation and tumorigenesis.

Life science alliance (2020) — PMID 32817263

Dataset

Expression data from 290 primary colorectal cancers

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

1

Use the simpleaffy package in R/Bioconductor to calculate MAS5.0 calls.

R v4.3 (Bioconductor 3.18)

$ Bash example

# Install R and Bioconductor simpleaffy package if not already installed
# For Conda (recommended for environment management):
# conda create -n bioconductor_env r-base bioconductor-simpleaffy -y
# conda activate bioconductor_env

# For R directly:
# Rscript -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")'
# Rscript -e 'BiocManager::install("simpleaffy")'

# Create an R script to perform MAS5.0 normalization
cat << 'EOF' > mas5_normalization.R
library(simpleaffy)

# Define input directory containing .CEL files and output file
# Assuming .CEL files are in the current directory. Adjust 'cel_dir' if needed.
cel_dir <- "."
output_file <- "mas5_normalized_expression.tsv"

# List all .CEL files in the specified directory
cel_files <- list.celfiles(cel_dir, full.names=TRUE)

# Check if any .CEL files were found
if (length(cel_files) == 0) {
  stop(paste("No .CEL files found in", cel_dir, ". Please ensure .CEL files are present or adjust 'cel_dir'."))
}

# Read Affymetrix .CEL files into an AffyBatch object
# For more complex experiments (e.g., with sample metadata), consider creating a phenoData file
# and using read.affybatch(filenames=cel_files, phenoData=pheno_data_object)
raw_data <- read.affybatch(filenames=cel_files)

# Perform MAS5.0 normalization
# By default, MAS5 normalization in simpleaffy outputs log2 transformed values.
mas5_normalized_data <- mas5(raw_data)

# Extract expression values
expression_matrix <- exprs(mas5_normalized_data)

# Write normalized expression matrix to a TSV file
write.table(expression_matrix, file=output_file, sep="\t", quote=FALSE, row.names=TRUE)

message(paste("MAS5.0 normalized expression values written to:", output_file))
EOF

# Execute the R script
Rscript mas5_normalization.R

2

These values were subsequently normalized using quantile normalization.

R (with preprocessCore package) (Inferred with models/gemini-2.5-flash) vR 4.3.0, preprocessCore 1.62.0 GitHub

$ Bash example

# Install R and preprocessCore if not already available
# conda create -n r_env r-base bioconductor-preprocesscore -y
# conda activate r_env

# Example: Create a dummy input file (replace with your actual input_matrix.tsv)
# echo -e "gene\tsample1\tsample2\tsample3" > input_matrix.tsv
# echo -e "geneA\t100\t200\t50" >> input_matrix.tsv
# echo -e "geneB\t50\t100\t25" >> input_matrix.tsv
# echo -e "geneC\t200\t50\t100" >> input_matrix.tsv

# R script for quantile normalization
Rscript -e '
  library(preprocessCore)
  
  # Define input and output file names
  input_file <- "input_matrix.tsv"
  output_file <- "normalized_matrix.tsv"
  
  # Read the input matrix
  # Assuming the first column contains row identifiers (e.g., gene names)
  # and subsequent columns contain numeric data for samples.
  # header=TRUE assumes the first row contains column names (sample IDs).
  # sep="\t" assumes tab-separated values.
  # Adjust these parameters based on the actual input file format.
  input_data <- read.table(input_file, sep="\t", header=TRUE, row.names=1, check.names=FALSE)
  
  # Convert to matrix for preprocessCore
  data_matrix <- as.matrix(input_data)
  
  # Perform quantile normalization
  normalized_matrix <- normalize.quantiles(data_matrix)
  
  # Restore column names (sample IDs) and row names (gene IDs)
  colnames(normalized_matrix) <- colnames(data_matrix)
  rownames(normalized_matrix) <- rownames(data_matrix)
  
  # Write the normalized matrix to an output file
  # quote=FALSE prevents R from adding quotes around string values.
  # col.names=NA is used when row.names=TRUE to leave the top-left cell empty,
  # which is standard for matrices with row names written to file.
  write.table(normalized_matrix, output_file, sep="\t", quote=FALSE, row.names=TRUE, col.names=NA)
'

View on GitHub

Tools Used

R