GSE13067 Processing Pipeline
GSE
code_examples
3 steps
Publication
DDX5 promotes oncogene C3 and FABP1 expressions and drives intestinal inflammation and tumorigenesis.Life science alliance (2020) — PMID 32817263
Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
Processing Steps
Generate Jupyter Notebook-
1
The simpleaffy package in R/Bioconductor was used to calculate MAS5.0 calls.
R vInferred with models/gemini-2.5-flash$ Bash example
# Install R and Bioconductor (if not already installed) # For example, using conda: # conda create -n r_simpleaffy r-base bioconductor-simpleaffy bioconductor-affy -y # conda activate r_simpleaffy # Or, within R: # if (!requireNamespace("BiocManager", quietly = TRUE)) # install.packages("BiocManager") # BiocManager::install(c("simpleaffy", "affy")) # Create an R script to calculate MAS5.0 calls cat << 'EOF' > calculate_mas5_calls.R # Load necessary libraries library(simpleaffy) library(affy) # Often used to read CEL files # Define the path to your CEL files # Replace with the actual path to your Affymetrix .CEL files cel_file_directory <- "." # Example: current directory, or specify e.g., "/data/affymetrix_cels" # List all .CEL files in the specified directory # The pattern uses double backslashes for escaping the dot in R regex cel_files <- list.files(path = cel_file_directory, pattern = ".*\\.CEL$", full.names = TRUE) if (length(cel_files) == 0) { stop("No .CEL files found in the specified directory: ", cel_file_directory, ". Please provide Affymetrix .CEL files.") } # Read the .CEL files into an AffyBatch object # This step requires the 'affy' package raw_data <- ReadAffy(filenames = cel_files) # Calculate MAS5.0 calls # This function returns a data frame with probe set IDs, detection p-values, and call (P/M/A) mas5_calls_results <- mas5calls(raw_data) # Define the output file path output_file <- "mas5_calls_output.csv" # Save the results to a CSV file write.csv(mas5_calls_results, output_file, row.names = FALSE) message(paste("MAS5.0 calls successfully calculated and saved to:", output_file)) EOF Rscript calculate_mas5_calls.R -
2
These values are not normalized.
Raw Data (Inferred with models/gemini-2.5-flash) vN/A$ Bash example
# This step indicates that the data values are in their raw, unnormalized state. # No specific command is executed for this descriptive step itself. # Normalization would typically be performed by a subsequent tool.
-
3
Normalization was performed by quantile normalization with respect to a defined reference set.
R (preprocessCore) (Inferred with models/gemini-2.5-flash) vR 4.3.2, preprocessCore 1.64.0$ Bash example
# Install R and preprocessCore package if not already installed # conda install -c conda-forge r-base=4.3.2 # conda install -c bioconda r-preprocesscore=1.64.0 # Create dummy input files for demonstration (replace with actual paths and data) # input_data.tsv: Tab-separated matrix with gene/feature names as row names and samples as columns # reference_distribution.tsv: Single-column file containing the sorted values of the target distribution # Example dummy input_data.tsv (replace with your actual data file) # echo -e "gene\tsample1\tsample2\tsample3" > input_data.tsv # echo -e "geneA\t100\t120\t90" >> input_data.tsv # echo -e "geneB\t50\t60\t45" >> input_data.tsv # echo -e "geneC\t200\t210\t180" >> input_data.tsv # echo -e "geneD\t10\t15\t8" >> input_data.tsv # Example dummy reference_distribution.tsv (replace with your actual reference distribution file) # This file should contain a single column of sorted values representing the target distribution. # echo -e "10\n50\n100\n200" > reference_distribution.tsv # R script for quantile normalization with a defined reference set cat << 'EOF' > normalize_script.R library(preprocessCore) # Load input data matrix # Assumes the first column contains row names (e.g., gene IDs) and subsequent columns are sample data. # Adjust 'sep', 'header', and 'row.names' based on your actual input file format. input_data <- read.table("input_data.tsv", sep="\t", header=TRUE, row.names=1) data_matrix <- as.matrix(input_data) # Load the defined reference distribution # Assumes it's a single-column file without a header. reference_dist <- as.vector(read.table("reference_distribution.tsv", header=FALSE)[,1]) # Perform quantile normalization with respect to the defined reference distribution # The 'target.distribution' argument ensures normalization to a specific reference. normalized_matrix <- normalize.quantiles.use.target(data_matrix, target.distribution=reference_dist) # Restore column and row names to the normalized matrix colnames(normalized_matrix) <- colnames(data_matrix) rownames(normalized_matrix) <- rownames(data_matrix) # Save the normalized data to an output file # 'col.names=NA' is used to prevent writing a blank column name for the row names column. write.table(normalized_matrix, "normalized_counts.tsv", sep="\t", quote=FALSE, col.names=NA) EOF # Execute the R script Rscript normalize_script.R
Tools Used
Raw Source Text
The simpleaffy package in R/Bioconductor was used to calculate MAS5.0 calls. These values are not normalized. Normalization was performed by quantile normalization with respect to a defined reference set.