GSE39873 Processing Pipeline — Yeo Lab Publications

Publication

LIN28 binds messenger RNAs at GGAGA motifs and regulates splicing factor abundance.

Molecular cell (2012) — PMID 22959275

Dataset

LIN28 binds messenger RNAs at GGAGA motifs and regulates splicing factor abundance

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

1

Data processed using Affymetrix package (Affy Power Tools) apt-probeset-summarize.

Microarray vInferred with models/gemini-2.5-flash

$ Bash example

# Install Affymetrix Power Tools (APT)
# APT is a proprietary software suite from Thermo Fisher Scientific. Installation typically involves downloading the suite from their official website.
# Example (conceptual, actual installation may vary based on OS and APT version):
# wget https://assets.thermofisher.com/TFS-Assets/LSG/software/APT_2.10.2_Linux.zip
# unzip APT_2.10.2_Linux.zip
# export PATH=$PATH:/path/to/apt/bin

# Define input CEL files (replace with actual file paths for your experiment)
# These are the raw data files generated by Affymetrix arrays.
CEL_FILES="sample1.CEL sample2.CEL sample3.CEL"

# Define output directory for summarization results
OUTPUT_DIR="apt_summarize_output"
mkdir -p "${OUTPUT_DIR}"

# Define the CDF file for the specific array type (replace with actual path to your CDF file)
# The CDF (Chip Description File) is crucial for defining probe sets and is usually downloaded from Affymetrix or Bioconductor.
# Example for a common array type (e.g., Human Gene 1.0 ST array):
# CDF_FILE="/path/to/HuGene-1_0-st-v1.cdf"
# For demonstration, using a placeholder. Ensure you use the correct CDF for your array.
CDF_FILE="path/to/your/array_type.cdf"

# Run apt-probeset-summarize using the RMA (Robust Multi-array Average) algorithm
# -a rma: Specifies the RMA algorithm for summarization, a common and robust method.
# -o ${OUTPUT_DIR}: Specifies the output directory where summarized data will be stored.
# -c ${CDF_FILE}: Specifies the CDF file to define probe sets for summarization.
# --cel-files ${CEL_FILES}: Specifies the input CEL files to be processed.
apt-probeset-summarize -a rma -o "${OUTPUT_DIR}" -c "${CDF_FILE}" --cel-files ${CEL_FILES}

echo "Probeset summarization complete. Results are in ${OUTPUT_DIR}"

2

Iter-plier algorithm used to quantify probesets.

iterPlier v1.78.0

$ Bash example

# Install R and Bioconductor if not already present
# sudo apt-get update
# sudo apt-get install -y r-base
# R -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager", repos="https://cloud.r-project.org")'
# R -e 'BiocManager::install(c("affy", "iterPlier"))'
# R -e 'BiocManager::install("hgu133plus2.db")' # Placeholder: Install the appropriate array-specific CDF package (e.g., for Affymetrix Human Genome U133 Plus 2.0 Array)

# Create an R script for iter-plier quantification
cat << 'EOF' > iter_plier_quantification.R
#!/usr/bin/env Rscript

# Parse command line arguments
args <- commandArgs(trailingOnly = TRUE)
if (length(args) < 2) {
  stop("Usage: Rscript iter_plier_quantification.R <cel_files_dir> <output_file>\nExample: Rscript iter_plier_quantification.R ./raw_cel_files expression_matrix.tsv", call.=FALSE)
}

cel_files_dir <- args[1]
output_file <- args[2]

# Load necessary libraries
# Ensure 'affy' and 'iterPlier' packages are installed via BiocManager
library(affy)
library(iterPlier)

# List CEL files in the specified directory
cel_files <- list.celfiles(cel_files_dir, full.names = TRUE)

if (length(cel_files) == 0) {
  stop(paste("No CEL files found in:", cel_files_dir), call.=FALSE)
}

message(paste("Found", length(cel_files), "CEL files. Reading data..."))

# Read CEL files into an AffyBatch object
# This step requires the appropriate CDF environment to be installed (e.g., hgu133plus2.db)
raw_data <- ReadAffy(filenames = cel_files)

message("Quantifying probesets using iterPlier...")

# Perform quantification using the iterPlier function
# This function performs background correction, normalization, and summarization.
# It returns an ExpressionSet object. The CDF information is inferred from the AffyBatch object.
expression_set <- iterPlier(raw_data)

# Extract expression matrix (log2 transformed intensities)
expression_matrix <- exprs(expression_set)

# Write results to a tab-separated file
write.table(expression_matrix, file = output_file, sep = "\t", quote = FALSE, row.names = TRUE)

message(paste("Quantification complete. Results written to:", output_file))
EOF

# Make the R script executable
chmod +x iter_plier_quantification.R

# Example usage:
# Create a dummy directory for CEL files (replace with actual path)
# mkdir -p /path/to/your/cel_files_directory
# Create dummy CEL files for demonstration (replace with actual CEL files)
# touch /path/to/your/cel_files_directory/sample1.CEL
# touch /path/to/your/cel_files_directory/sample2.CEL

# Run the R script
# Replace /path/to/your/cel_files_directory with the actual directory containing CEL files
# Replace output_expression.tsv with your desired output file name
./iter_plier_quantification.R /path/to/your/cel_files_directory output_expression.tsv

3

HJAY_r2.pgf

Custom Process (Inferred with models/gemini-2.5-flash) vr2

$ Bash example

# This command is a placeholder for a custom bioinformatics process identified as HJAY_r2.pgf.
# No specific tool, parameters, or input/output files could be inferred from the description.
# If a reference genome is required, 'hg38' is used as a common placeholder.
# Replace 'custom_hj_tool' with the actual executable and adjust parameters as needed.

# Example: custom_hj_tool --input_file data.txt --output_file HJAY_r2.pgf --genome_assembly hg38
echo "Executing custom process HJAY_r2.pgf..."

Tools Used

Microarray