GSE14333 Processing Pipeline

GSE code_examples 2 steps

Publication

DDX5 promotes oncogene C3 and FABP1 expressions and drives intestinal inflammation and tumorigenesis.

Life science alliance (2020) — PMID 32817263

Dataset

GSE14333

Expression data from 290 primary colorectal cancers

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    Use the simpleaffy package in R/Bioconductor to calculate MAS5.0 calls.

    R v4.3 (Bioconductor 3.18)
    $ Bash example
    # Install R and Bioconductor simpleaffy package if not already installed
    # For Conda (recommended for environment management):
    # conda create -n bioconductor_env r-base bioconductor-simpleaffy -y
    # conda activate bioconductor_env
    
    # For R directly:
    # Rscript -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")'
    # Rscript -e 'BiocManager::install("simpleaffy")'
    
    # Create an R script to perform MAS5.0 normalization
    cat << 'EOF' > mas5_normalization.R
    library(simpleaffy)
    
    # Define input directory containing .CEL files and output file
    # Assuming .CEL files are in the current directory. Adjust 'cel_dir' if needed.
    cel_dir <- "."
    output_file <- "mas5_normalized_expression.tsv"
    
    # List all .CEL files in the specified directory
    cel_files <- list.celfiles(cel_dir, full.names=TRUE)
    
    # Check if any .CEL files were found
    if (length(cel_files) == 0) {
      stop(paste("No .CEL files found in", cel_dir, ". Please ensure .CEL files are present or adjust 'cel_dir'."))
    }
    
    # Read Affymetrix .CEL files into an AffyBatch object
    # For more complex experiments (e.g., with sample metadata), consider creating a phenoData file
    # and using read.affybatch(filenames=cel_files, phenoData=pheno_data_object)
    raw_data <- read.affybatch(filenames=cel_files)
    
    # Perform MAS5.0 normalization
    # By default, MAS5 normalization in simpleaffy outputs log2 transformed values.
    mas5_normalized_data <- mas5(raw_data)
    
    # Extract expression values
    expression_matrix <- exprs(mas5_normalized_data)
    
    # Write normalized expression matrix to a TSV file
    write.table(expression_matrix, file=output_file, sep="\t", quote=FALSE, row.names=TRUE)
    
    message(paste("MAS5.0 normalized expression values written to:", output_file))
    EOF
    
    # Execute the R script
    Rscript mas5_normalization.R
  2. 2

    These values were subsequently normalized using quantile normalization.

    R (with preprocessCore package) (Inferred with models/gemini-2.5-flash) vR 4.3.0, preprocessCore 1.62.0 GitHub
    $ Bash example
    # Install R and preprocessCore if not already available
    # conda create -n r_env r-base bioconductor-preprocesscore -y
    # conda activate r_env
    
    # Example: Create a dummy input file (replace with your actual input_matrix.tsv)
    # echo -e "gene\tsample1\tsample2\tsample3" > input_matrix.tsv
    # echo -e "geneA\t100\t200\t50" >> input_matrix.tsv
    # echo -e "geneB\t50\t100\t25" >> input_matrix.tsv
    # echo -e "geneC\t200\t50\t100" >> input_matrix.tsv
    
    # R script for quantile normalization
    Rscript -e '
      library(preprocessCore)
      
      # Define input and output file names
      input_file <- "input_matrix.tsv"
      output_file <- "normalized_matrix.tsv"
      
      # Read the input matrix
      # Assuming the first column contains row identifiers (e.g., gene names)
      # and subsequent columns contain numeric data for samples.
      # header=TRUE assumes the first row contains column names (sample IDs).
      # sep="\t" assumes tab-separated values.
      # Adjust these parameters based on the actual input file format.
      input_data <- read.table(input_file, sep="\t", header=TRUE, row.names=1, check.names=FALSE)
      
      # Convert to matrix for preprocessCore
      data_matrix <- as.matrix(input_data)
      
      # Perform quantile normalization
      normalized_matrix <- normalize.quantiles(data_matrix)
      
      # Restore column names (sample IDs) and row names (gene IDs)
      colnames(normalized_matrix) <- colnames(data_matrix)
      rownames(normalized_matrix) <- rownames(data_matrix)
      
      # Write the normalized matrix to an output file
      # quote=FALSE prevents R from adding quotes around string values.
      # col.names=NA is used when row.names=TRUE to leave the top-left cell empty,
      # which is standard for matrices with row names written to file.
      write.table(normalized_matrix, output_file, sep="\t", quote=FALSE, row.names=TRUE, col.names=NA)
    '

Tools Used

Raw Source Text
Use the simpleaffy package in R/Bioconductor to calculate MAS5.0 calls. These values were subsequently normalized using quantile normalization.
← Back to Analysis