GSE13067 Processing Pipeline — Yeo Lab Publications

Publication

DDX5 promotes oncogene C3 and FABP1 expressions and drives intestinal inflammation and tumorigenesis.

Life science alliance (2020) — PMID 32817263

Dataset

Expression data from primary colorectal cancers

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

1

The simpleaffy package in R/Bioconductor was used to calculate MAS5.0 calls.

R vInferred with models/gemini-2.5-flash

$ Bash example

# Install R and Bioconductor (if not already installed)
# For example, using conda:
# conda create -n r_simpleaffy r-base bioconductor-simpleaffy bioconductor-affy -y
# conda activate r_simpleaffy

# Or, within R:
# if (!requireNamespace("BiocManager", quietly = TRUE))
#     install.packages("BiocManager")
# BiocManager::install(c("simpleaffy", "affy"))

# Create an R script to calculate MAS5.0 calls
cat << 'EOF' > calculate_mas5_calls.R
# Load necessary libraries
library(simpleaffy)
library(affy) # Often used to read CEL files

# Define the path to your CEL files
# Replace with the actual path to your Affymetrix .CEL files
cel_file_directory <- "." # Example: current directory, or specify e.g., "/data/affymetrix_cels"

# List all .CEL files in the specified directory
# The pattern uses double backslashes for escaping the dot in R regex
cel_files <- list.files(path = cel_file_directory, pattern = ".*\\.CEL$", full.names = TRUE)

if (length(cel_files) == 0) {
    stop("No .CEL files found in the specified directory: ", cel_file_directory, ". Please provide Affymetrix .CEL files.")
}

# Read the .CEL files into an AffyBatch object
# This step requires the 'affy' package
raw_data <- ReadAffy(filenames = cel_files)

# Calculate MAS5.0 calls
# This function returns a data frame with probe set IDs, detection p-values, and call (P/M/A)
mas5_calls_results <- mas5calls(raw_data)

# Define the output file path
output_file <- "mas5_calls_output.csv"

# Save the results to a CSV file
write.csv(mas5_calls_results, output_file, row.names = FALSE)

message(paste("MAS5.0 calls successfully calculated and saved to:", output_file))
EOF

Rscript calculate_mas5_calls.R

2

These values are not normalized.

Raw Data (Inferred with models/gemini-2.5-flash) vN/A

$ Bash example

# This step indicates that the data values are in their raw, unnormalized state.
# No specific command is executed for this descriptive step itself.
# Normalization would typically be performed by a subsequent tool.

3

Normalization was performed by quantile normalization with respect to a defined reference set.

R (preprocessCore) (Inferred with models/gemini-2.5-flash) vR 4.3.2, preprocessCore 1.64.0

$ Bash example

# Install R and preprocessCore package if not already installed
# conda install -c conda-forge r-base=4.3.2
# conda install -c bioconda r-preprocesscore=1.64.0

# Create dummy input files for demonstration (replace with actual paths and data)
# input_data.tsv: Tab-separated matrix with gene/feature names as row names and samples as columns
# reference_distribution.tsv: Single-column file containing the sorted values of the target distribution

# Example dummy input_data.tsv (replace with your actual data file)
# echo -e "gene\tsample1\tsample2\tsample3" > input_data.tsv
# echo -e "geneA\t100\t120\t90" >> input_data.tsv
# echo -e "geneB\t50\t60\t45" >> input_data.tsv
# echo -e "geneC\t200\t210\t180" >> input_data.tsv
# echo -e "geneD\t10\t15\t8" >> input_data.tsv

# Example dummy reference_distribution.tsv (replace with your actual reference distribution file)
# This file should contain a single column of sorted values representing the target distribution.
# echo -e "10\n50\n100\n200" > reference_distribution.tsv

# R script for quantile normalization with a defined reference set
cat << 'EOF' > normalize_script.R
library(preprocessCore)

# Load input data matrix
# Assumes the first column contains row names (e.g., gene IDs) and subsequent columns are sample data.
# Adjust 'sep', 'header', and 'row.names' based on your actual input file format.
input_data <- read.table("input_data.tsv", sep="\t", header=TRUE, row.names=1)
data_matrix <- as.matrix(input_data)

# Load the defined reference distribution
# Assumes it's a single-column file without a header.
reference_dist <- as.vector(read.table("reference_distribution.tsv", header=FALSE)[,1])

# Perform quantile normalization with respect to the defined reference distribution
# The 'target.distribution' argument ensures normalization to a specific reference.
normalized_matrix <- normalize.quantiles.use.target(data_matrix, target.distribution=reference_dist)

# Restore column and row names to the normalized matrix
colnames(normalized_matrix) <- colnames(data_matrix)
rownames(normalized_matrix) <- rownames(data_matrix)

# Save the normalized data to an output file
# 'col.names=NA' is used to prevent writing a blank column name for the row names column.
write.table(normalized_matrix, "normalized_counts.tsv", sep="\t", quote=FALSE, col.names=NA)
EOF

# Execute the R script
Rscript normalize_script.R

Tools Used

R