GSE13067 Processing Pipeline

GSE code_examples 3 steps

Publication

DDX5 promotes oncogene C3 and FABP1 expressions and drives intestinal inflammation and tumorigenesis.

Life science alliance (2020) — PMID 32817263

Dataset

GSE13067

Expression data from primary colorectal cancers

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    The simpleaffy package in R/Bioconductor was used to calculate MAS5.0 calls.

    R vInferred with models/gemini-2.5-flash
    $ Bash example
    # Install R and Bioconductor (if not already installed)
    # For example, using conda:
    # conda create -n r_simpleaffy r-base bioconductor-simpleaffy bioconductor-affy -y
    # conda activate r_simpleaffy
    
    # Or, within R:
    # if (!requireNamespace("BiocManager", quietly = TRUE))
    #     install.packages("BiocManager")
    # BiocManager::install(c("simpleaffy", "affy"))
    
    # Create an R script to calculate MAS5.0 calls
    cat << 'EOF' > calculate_mas5_calls.R
    # Load necessary libraries
    library(simpleaffy)
    library(affy) # Often used to read CEL files
    
    # Define the path to your CEL files
    # Replace with the actual path to your Affymetrix .CEL files
    cel_file_directory <- "." # Example: current directory, or specify e.g., "/data/affymetrix_cels"
    
    # List all .CEL files in the specified directory
    # The pattern uses double backslashes for escaping the dot in R regex
    cel_files <- list.files(path = cel_file_directory, pattern = ".*\\.CEL$", full.names = TRUE)
    
    if (length(cel_files) == 0) {
        stop("No .CEL files found in the specified directory: ", cel_file_directory, ". Please provide Affymetrix .CEL files.")
    }
    
    # Read the .CEL files into an AffyBatch object
    # This step requires the 'affy' package
    raw_data <- ReadAffy(filenames = cel_files)
    
    # Calculate MAS5.0 calls
    # This function returns a data frame with probe set IDs, detection p-values, and call (P/M/A)
    mas5_calls_results <- mas5calls(raw_data)
    
    # Define the output file path
    output_file <- "mas5_calls_output.csv"
    
    # Save the results to a CSV file
    write.csv(mas5_calls_results, output_file, row.names = FALSE)
    
    message(paste("MAS5.0 calls successfully calculated and saved to:", output_file))
    EOF
    
    Rscript calculate_mas5_calls.R
  2. 2

    These values are not normalized.

    Raw Data (Inferred with models/gemini-2.5-flash) vN/A
    $ Bash example
    # This step indicates that the data values are in their raw, unnormalized state.
    # No specific command is executed for this descriptive step itself.
    # Normalization would typically be performed by a subsequent tool.
  3. 3

    Normalization was performed by quantile normalization with respect to a defined reference set.

    R (preprocessCore) (Inferred with models/gemini-2.5-flash) vR 4.3.2, preprocessCore 1.64.0
    $ Bash example
    # Install R and preprocessCore package if not already installed
    # conda install -c conda-forge r-base=4.3.2
    # conda install -c bioconda r-preprocesscore=1.64.0
    
    # Create dummy input files for demonstration (replace with actual paths and data)
    # input_data.tsv: Tab-separated matrix with gene/feature names as row names and samples as columns
    # reference_distribution.tsv: Single-column file containing the sorted values of the target distribution
    
    # Example dummy input_data.tsv (replace with your actual data file)
    # echo -e "gene\tsample1\tsample2\tsample3" > input_data.tsv
    # echo -e "geneA\t100\t120\t90" >> input_data.tsv
    # echo -e "geneB\t50\t60\t45" >> input_data.tsv
    # echo -e "geneC\t200\t210\t180" >> input_data.tsv
    # echo -e "geneD\t10\t15\t8" >> input_data.tsv
    
    # Example dummy reference_distribution.tsv (replace with your actual reference distribution file)
    # This file should contain a single column of sorted values representing the target distribution.
    # echo -e "10\n50\n100\n200" > reference_distribution.tsv
    
    # R script for quantile normalization with a defined reference set
    cat << 'EOF' > normalize_script.R
    library(preprocessCore)
    
    # Load input data matrix
    # Assumes the first column contains row names (e.g., gene IDs) and subsequent columns are sample data.
    # Adjust 'sep', 'header', and 'row.names' based on your actual input file format.
    input_data <- read.table("input_data.tsv", sep="\t", header=TRUE, row.names=1)
    data_matrix <- as.matrix(input_data)
    
    # Load the defined reference distribution
    # Assumes it's a single-column file without a header.
    reference_dist <- as.vector(read.table("reference_distribution.tsv", header=FALSE)[,1])
    
    # Perform quantile normalization with respect to the defined reference distribution
    # The 'target.distribution' argument ensures normalization to a specific reference.
    normalized_matrix <- normalize.quantiles.use.target(data_matrix, target.distribution=reference_dist)
    
    # Restore column and row names to the normalized matrix
    colnames(normalized_matrix) <- colnames(data_matrix)
    rownames(normalized_matrix) <- rownames(data_matrix)
    
    # Save the normalized data to an output file
    # 'col.names=NA' is used to prevent writing a blank column name for the row names column.
    write.table(normalized_matrix, "normalized_counts.tsv", sep="\t", quote=FALSE, col.names=NA)
    EOF
    
    # Execute the R script
    Rscript normalize_script.R

Tools Used

Raw Source Text
The simpleaffy package in R/Bioconductor was used to calculate MAS5.0 calls. These values are not normalized. Normalization was performed by quantile normalization with respect to a defined reference set.
← Back to Analysis