GSE77634 Processing Pipeline

GSE code_examples 7 steps

Publication

Robust transcriptome-wide discovery of RNA-binding protein binding sites with enhanced CLIP (eCLIP).

Nature methods (2016) — PMID 27018577

Dataset

Enhanced CLIP (eCLIP) enables robust and scalable transcriptome-wide discovery and characterization of RNA binding protein binding sites

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Cel files were analyzed using Expression Console (build 1.4.1.46, Affymetrix), with RMA normalization and DABG probe-level detection.

Microarray v1.4.1.46

$ Bash example

# Expression Console is a GUI-based software. The following bash script uses Affymetrix Power Tools (APT), a command-line suite that provides similar functionality for processing CEL files with RMA normalization and DABG detection.

# Install Affymetrix Power Tools (APT) if not already installed.
# APT is proprietary software from Thermo Fisher Scientific (formerly Affymetrix).
# Installation typically involves downloading from their website and setting up the PATH.
# Example (conceptual, actual installation may vary based on OS and APT version):
# wget https://assets.thermofisher.com/TFS-Assets/LSG/software/APT_2.11.2_Linux_x86_64.zip
# unzip APT_2.11.2_Linux_x86_64.zip -d /opt/apt
# export PATH=$PATH:/opt/apt/bin

# Define input CEL files and output directory
# Replace 'sample*.CEL' with the actual paths to your CEL files.
# Replace '/path/to/your_array.cdf' with the path to the appropriate CDF file for your microarray type.
CEL_FILES="sample1.CEL sample2.CEL sample3.CEL" # Example: space-separated list of CEL files
OUTPUT_DIR="expression_console_analysis_output"
CDF_FILE="/path/to/your_array.cdf" # e.g., /opt/apt/library-files/HuGene-1_0-st-v1.cdf

mkdir -p "${OUTPUT_DIR}"

# Execute RMA normalization and DABG probe-level detection using apt-probeset-summarize.
# This command conceptually replicates the core analytical steps described for Expression Console.
# --dabg-pvalue-threshold 0.05 is a common default, adjust if a specific threshold was used.
apt-probeset-summarize \
-a rma \
-p "${CDF_FILE}" \
-o "${OUTPUT_DIR}" \
-c "${CEL_FILES}" \
--log-file "${OUTPUT_DIR}/apt_log.txt" \
--dabg \
--dabg-pvalue-threshold 0.05

Only probesets with detection p-value â¤ 0.05 in more than half of the microarray samples were considered for downstream analysis.

Microarray vNot specified GitHub

$ Bash example

#!/bin/bash

# Define input and output files
INPUT_PVALUES="detection_pvalues.tsv" # This file should contain detection p-values from upstream processing
OUTPUT_FILTERED_PROBESETS="filtered_probesets.tsv"

# Define parameters
P_VALUE_THRESHOLD=0.05

# --- Installation (commented out) ---
# # Install R if not available
# # sudo apt-get update && sudo apt-get install -y r-base

# --- Create a dummy input file for demonstration if it doesn't exist ---
# In a real pipeline, this file would be generated by an upstream step (e.g., MAS5 processing of Affymetrix CEL files)
if [ ! -f "$INPUT_PVALUES" ]; then
    echo "Creating dummy input file: $INPUT_PVALUES"
    echo -e "ProbesetID\tSample1_Pval\tSample2_Pval\tSample3_Pval\tSample4_Pval\tSample5_Pval\tSample6_Pval\tSample7_Pval\tSample8_Pval\tSample9_Pval\tSample10_Pval" > "$INPUT_PVALUES"
    for i in $(seq 1 1000); do
        PROBESET_ID="Probeset_$i"
        PVALS=""
        for j in $(seq 1 10); do
            # Generate random p-values, some below 0.05, some above
            RANDOM_PVAL=$(awk -v seed=$RANDOM 'BEGIN{srand(seed); print rand()}' | cut -c 3-7)
            PVALS="$PVALS\t0.$RANDOM_PVAL"
        done
        echo -e "$PROBESET_ID$PVALS" >> "$INPUT_PVALUES"
    done
fi

# --- R script for filtering ---
R_SCRIPT=$(cat <<EOF
# Load necessary libraries (base R functions are sufficient for this logic)

# --- Configuration ---
P_VALUE_THRESHOLD <- ${P_VALUE_THRESHOLD}
input_file <- "${INPUT_PVALUES}"
output_file <- "${OUTPUT_FILTERED_PROBESETS}"

# --- Load data ---
if (!file.exists(input_file)) {
    stop(paste("Input file not found:", input_file))
}
detection_pvalues_df <- read.delim(input_file, header = TRUE, row.names = 1)
detection_pvalues_matrix <- as.matrix(detection_pvalues_df)

num_samples <- ncol(detection_pvalues_matrix)
# "more than half" means floor(N/2) + 1
min_samples_detected <- floor(num_samples / 2) + 1 

# Apply filtering
# Count for each probeset how many samples have p-value <= P_VALUE_THRESHOLD
probeset_detection_counts <- rowSums(detection_pvalues_matrix <= P_VALUE_THRESHOLD)

# Identify probesets that meet the criteria
filtered_probeset_indices <- which(probeset_detection_counts >= min_samples_detected)
filtered_probeset_ids <- rownames(detection_pvalues_matrix)[filtered_probeset_indices]

# Output the list of filtered probeset IDs
write.table(data.frame(ProbesetID = filtered_probeset_ids), 
            output_file, 
            sep = "\t", 
            row.names = FALSE, 
            quote = FALSE)

message(paste("Filtered probesets written to:", output_file))
message(paste("Number of probesets before filtering:", nrow(detection_pvalues_matrix)))
message(paste("Number of probesets after filtering:", length(filtered_probeset_ids)))
EOF
)

# Execute the R script
Rscript -e "$R_SCRIPT"

echo "Filtering complete. Filtered probeset IDs are in $OUTPUT_FILTERED_PROBESETS"

View on GitHub

All probes corresponding to cassette exons profiled on the microarray (comprising exclusion junction, upstream and downstream inclusion junction, and inclusion exonic probes) were identified and normalized against the average signal on a per-gene basis to remove gene expression changes.

Microarray vNot specified (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install R and Bioconductor packages if not already present
# conda install -c conda-forge r-base
# conda install -c bioconda bioconductor-limma # limma is a common package for microarray analysis, though the described normalization is custom.

# Create a dummy raw intensity file (replace with actual data)
# This file represents the probe intensities from the microarray.
echo "ProbeID\tSample_1\tSample_2\tSample_3" > raw_intensities.tsv
echo "Probe_1\t10.2\t11.5\t9.8" >> raw_intensities.tsv
echo "Probe_2\t8.5\t9.1\t8.9" >> raw_intensities.tsv
echo "Probe_3\t12.1\t13.0\t11.9" >> raw_intensities.tsv
echo "Probe_4\t7.8\t8.2\t7.5" >> raw_intensities.tsv
echo "Probe_5\t15.0\t14.5\t15.2" >> raw_intensities.tsv
echo "Probe_6\t9.5\t10.0\t9.7" >> raw_intensities.tsv
echo "Probe_7\t11.0\t11.2\t10.8" >> raw_intensities.tsv
echo "Probe_8\t6.5\t7.0\t6.8" >> raw_intensities.tsv
echo "Probe_9\t13.5\t14.0\t13.2" >> raw_intensities.tsv
echo "Probe_10\t10.0\t10.5\t9.9" >> raw_intensities.tsv

# Create a dummy probe annotation file (replace with actual data)
# This file maps probes to genes and their specific type related to cassette exons.
echo "ProbeID\tGeneID\tExonType" > probe_annotation.tsv
echo "Probe_1\tGeneA\tinclusion_exonic" >> probe_annotation.tsv
echo "Probe_2\tGeneA\texclusion_junction" >> probe_annotation.tsv
echo "Probe_3\tGeneB\tupstream_inclusion_junction" >> probe_annotation.tsv
echo "Probe_4\tGeneB\tdownstream_inclusion_junction" >> probe_annotation.tsv
echo "Probe_5\tGeneC\tinclusion_exonic" >> probe_annotation.tsv
echo "Probe_6\tGeneC\texclusion_junction" >> probe_annotation.tsv
echo "Probe_7\tGeneD\tupstream_inclusion_junction" >> probe_annotation.tsv
echo "Probe_8\tGeneD\tdownstream_inclusion_junction" >> probe_annotation.tsv
echo "Probe_9\tGeneE\tinclusion_exonic" >> probe_annotation.tsv
echo "Probe_10\tGeneE\texclusion_junction" >> probe_annotation.tsv

# Create the R script for custom normalization
cat << 'EOF' > normalize_microarray.R
# --- Input parameters ---
input_intensities_file <- "raw_intensities.tsv"
input_annotation_file <- "probe_annotation.tsv"
output_normalized_file <- "normalized_cassette_exon_probes.tsv"

# --- Reference datasets ---
# Reference genome: hg38 (inferred as latest human assembly for gene mapping)
# Probe annotation source: Microarray platform provider (e.g., Affymetrix, Agilent) specific to the array design.

# --- Read input data ---
probe_intensities <- read.delim(input_intensities_file, row.names = 1, sep = "\t", check.names = FALSE)
probe_annotation <- read.delim(input_annotation_file, sep = "\t", check.names = FALSE)

# --- Step 1: Identify probes corresponding to cassette exons ---
# The description specifies: "exclusion junction, upstream and downstream inclusion junction, and inclusion exonic probes"
cassette_exon_types <- c("exclusion_junction", "upstream_inclusion_junction",
                         "downstream_inclusion_junction", "inclusion_exonic")

cassette_exon_probes_df <- subset(probe_annotation, ExonType %in% cassette_exon_types)

# Filter intensity data to include only these probes
# Ensure probe IDs match between annotation and intensity data
common_probes <- intersect(rownames(probe_intensities), cassette_exon_probes_df$ProbeID)
if (length(common_probes) == 0) {
  stop("No common probes found between intensity data and cassette exon annotation.")
}

filtered_intensities <- probe_intensities[common_probes, , drop = FALSE]
filtered_annotation <- cassette_exon_probes_df[cassette_exon_probes_df$ProbeID %in% common_probes, ]
# Ensure annotation is ordered consistently with intensities
filtered_annotation <- filtered_annotation[match(rownames(filtered_intensities), filtered_annotation$ProbeID), ]


# --- Step 2: Normalize against the average signal on a per-gene basis ---
# This removes gene expression changes by normalizing each probe's signal
# by the average signal of its associated gene *for each sample*.

normalized_intensities <- filtered_intensities # Initialize with filtered data

for (gene in unique(filtered_annotation$GeneID)) {
  gene_probes <- subset(filtered_annotation, GeneID == gene)$ProbeID
  if (length(gene_probes) > 0) {
    # Get intensities for probes belonging to the current gene
    gene_probe_signals <- filtered_intensities[gene_probes, , drop = FALSE]

    # Calculate average signal for this gene *per sample*
    # Use colMeans for per-sample average. Handle cases with a single probe for a gene.
    if (length(gene_probes) == 1) {
      gene_averages_per_sample <- as.numeric(gene_probe_signals)
    } else {
      gene_averages_per_sample <- colMeans(gene_probe_signals, na.rm = TRUE)
    }

    # Normalize each probe's signal by the gene's average in that sample
    for (sample_idx in 1:ncol(normalized_intensities)) {
      if (gene_averages_per_sample[sample_idx] > 0) { # Avoid division by zero
        normalized_intensities[gene_probes, sample_idx] <-
          gene_probe_signals[, sample_idx] / gene_averages_per_sample[sample_idx]
      } else {
        # If gene average is zero, set normalized values to NA
        normalized_intensities[gene_probes, sample_idx] <- NA
      }
    }
  }
}

# Handle potential Inf/NaN values resulting from division
normalized_intensities[is.infinite(normalized_intensities)] <- NA
normalized_intensities[is.nan(normalized_intensities)] <- NA

# --- Output normalized data ---
write.table(normalized_intensities, output_normalized_file, sep = "\t", quote = FALSE, row.names = TRUE)

message(paste("Normalized data saved to:", output_normalized_file))
EOF

# Execute the R script
Rscript normalize_microarray.R

View on GitHub

Studentâs t-test was performed on residuals for inclusion probes and exclusion probes separately to identify robust splicing changes, which were quantified by SepScore ( defined as the normalized change in exclusion minus the normalized change in inclusion).

Custom Python/R script (Inferred with models/gemini-2.5-flash) vN/A GitHub

$ Bash example

# This is a placeholder command for a custom script that would perform the described analysis.
# The actual script (e.g., analyze_splicing.py or analyze_splicing.R) would contain the logic for:
# 1. Reading inclusion and exclusion probe data (likely residuals from a prior statistical model).
# 2. Performing Student's t-test on these datasets separately.
# 3. Calculating SepScore: (normalized change in exclusion - normalized change in inclusion).
# 4. Identifying robust splicing changes based on t-test results and SepScore.

# Assuming input files for residuals and normalized changes are available from upstream steps.
# For example, 'inclusion_probe_residuals.tsv', 'exclusion_probe_residuals.tsv',
# and 'normalized_splicing_changes.tsv' containing the necessary data.

python analyze_splicing_changes.py \
    --inclusion_residuals inclusion_probe_residuals.tsv \
    --exclusion_residuals exclusion_probe_residuals.tsv \
    --normalized_changes normalized_splicing_changes.tsv \
    --output_splicing_results splicing_analysis_results.tsv \
    --significance_threshold 0.05 # Example parameter for significance level

View on GitHub

HTA-2_0.r1.pgf

Affymetrix Power Tools (APT) (Inferred with models/gemini-2.5-flash) vNot specified

$ Bash example

# Install Affymetrix Power Tools (APT) if not already installed
# For example, using conda:
# conda create -n apt_env -c bioconda affymetrix-power-tools
# conda activate apt_env

# Define variables for clarity
PGF_FILE="HTA-2_0.r1.pgf" # Human Transcriptome Array 2.0 Probe Group File
CLF_FILE="HTA-2_0.r1.clf" # Human Transcriptome Array 2.0 Chip Layout File (inferred as companion to PGF)
OUTPUT_DIR="hta_2_0_summarized_data"
CEL_FILES="sample1.CEL sample2.CEL sample3.CEL" # Placeholder for actual input CEL files, space-separated

# Reference datasets (PGF and CLF files) are specific to the Affymetrix Human Transcriptome Array 2.0.
# These files are typically downloaded from Thermo Fisher Scientific (formerly Affymetrix) support pages for the specific array.
# Example placeholder for downloading (actual URLs may vary and require login/registration):
# wget -O "$PGF_FILE" "https://assets.thermofisher.com/TFS-Assets/LSG/support-documents/HTA-2_0.r1.pgf"
# wget -O "$CLF_FILE" "https://assets.thermofisher.com/TFS-Assets/LSG/support-documents/HTA-2_0.r1.clf"

# Create output directory if it doesn't exist
mkdir -p "$OUTPUT_DIR"

# Run probeset summarization using the RMA (Robust Multi-array Average) algorithm.
# The HTA-2_0.r1.pgf and HTA-2_0.r1.clf files provide the necessary probe definitions and chip layout.
apt-probeset-summarize \
    -a rma \
    -p "$PGF_FILE" \
    -c "$CLF_FILE" \
    -o "$OUTPUT_DIR" \
    --cel-files "$CEL_FILES"

HTA-2_0.r1.Psrs.mps

oligo (Inferred with models/gemini-2.5-flash) v2_0.r1 (Refers to the HTA array design, not the oligo package version) GitHub

$ Bash example

# This script processes Affymetrix Human Transcriptome Array 2.0 (HTA 2.0) .CEL files
# using the 'oligo' R package for background correction, normalization, and summarization (RMA).
# The "HTA-2_0.r1.Psrs.mps" description refers to the specific array design and its associated probe set definition file.
# While the .mps file defines the array, 'oligo' typically handles this information internally
# or via Bioconductor annotation packages (e.g., hta20transcriptcluster.db) when reading .CEL files.

# Define input CEL files directory and output directory
CEL_DIR="/path/to/raw_cel_files" # Placeholder: Replace with actual path to your .CEL files
OUTPUT_DIR="/path/to/output_directory" # Placeholder: Replace with desired output path
mkdir -p "${OUTPUT_DIR}"

# Install R and Bioconductor packages if not already present
# Note: These commands are commented out as they are installation steps.
# sudo apt-get update && sudo apt-get install -y r-base
# R -e "install.packages('BiocManager', repos='https://cloud.r-project.org')"
# R -e "BiocManager::install('oligo', update=FALSE, ask=FALSE)"
# R -e "BiocManager::install('hta20transcriptcluster.db', update=FALSE, ask=FALSE)" # Annotation package for HTA 2.0

# Create an R script for processing HTA 2.0 data
cat << 'EOF' > "${OUTPUT_DIR}/process_hta20_rma.R"
library(oligo)
# library(hta20transcriptcluster.db) # Optional: For adding annotations later.
                                  # Ensure this package is installed if needed.

# Set the directory containing CEL files
cel_files_dir <- "${CEL_DIR}"

# List all .CEL files in the specified directory
cel_files <- list.celfiles(path=cel_files_dir, full.names=TRUE)

if (length(cel_files) == 0) {
  stop("No .CEL files found in the specified directory: ", cel_files_dir)
}

message(paste("Found", length(cel_files), ".CEL files."))

# Read CEL files into an oligo FeatureSet object (e.g., GeneFeatureSet for HTA 2.0)
# oligo automatically detects the array type (HTA 2.0) from the CEL file headers.
raw_data <- read.celfiles(cel_files)

# Perform RMA (Robust Multi-array Average) background correction, normalization, and summarization
# This will produce a GeneFeatureSet or ExonFeatureSet with summarized expression values.
message("Performing RMA summarization...")
eset <- rma(raw_data)

# Extract the expression matrix
expression_matrix <- exprs(eset)

# Get feature names (probe set IDs or transcript cluster IDs)
feature_ids <- featureNames(eset)
rownames(expression_matrix) <- feature_ids

# Optionally, add sample names (from CEL file names)
sample_names <- sampleNames(eset)
colnames(expression_matrix) <- sample_names

# Write the expression matrix to a CSV file
output_file <- file.path("${OUTPUT_DIR}", "hta20_rma_expression.csv")
write.csv(expression_matrix, file=output_file, row.names=TRUE)

message(paste("RMA processed expression matrix saved to:", output_file))

# Example of how to add annotation (requires hta20transcriptcluster.db package)
# if (requireNamespace("hta20transcriptcluster.db", quietly = TRUE)) {
#   message("Adding annotations...")
#   annotation_data <- AnnotationDbi::select(
#     hta20transcriptcluster.db,
#     keys = feature_ids,
#     columns = c("SYMBOL", "GENENAME", "ENTREZID"),
#     keytype = "PROBEID" # Or "TRANSCRIPTCLUSTERID" depending on the array and package
#   )
#   # Merge annotation with expression data (example)
#   # annotated_expression <- merge(as.data.frame(expression_matrix), annotation_data, by.x="row.names", by.y="PROBEID", all.x=TRUE)
#   # write.csv(annotated_expression, file=file.path("${OUTPUT_DIR}", "hta20_rma_annotated_expression.csv"), row.names=FALSE)
#   message("Annotation complete (if uncommented and merged).")
# } else {
#   message("Annotation package 'hta20transcriptcluster.db' not found. Skipping annotation step.")
#   message("Install with BiocManager::install('hta20transcriptcluster.db') if needed.")
# }

EOF

# Execute the R script
Rscript "${OUTPUT_DIR}/process_hta20_rma.R"

View on GitHub

RMA probeset-level signal estimates, DABG detection flag (Absent or Present), and DABG detection p-value obtained from 'Alt Splice Analysis' performed in Affymetrix Expression Console build 1.4.1.46.

Microarray v1.4.1.46

$ Bash example

# Define input CEL files (replace with actual paths)
CEL_FILES="sample1.cel sample2.cel sample3.cel" # Example: List of input Affymetrix .CEL files

# Define output directory
OUTPUT_DIR="affymetrix_expression_console_results"
mkdir -p "${OUTPUT_DIR}"

# Reference: Affymetrix array design file (CDF) for the specific array type used.
# The specific CDF file depends on the array type (e.g., Human Exon 1.0 ST Array).
# Example: CDF_FILE="HuEx-1_0-st-v2.cdf"
# ANNOTATION_FILE="HuEx-1_0-st-v2.na36.hg19.probeset.csv" # Example annotation file for probeset mapping
# GENOME_ASSEMBLY="hg19" # Placeholder if genome-level annotation was used for downstream analysis

# This step was performed using the Affymetrix Expression Console (v1.4.1.46), a GUI-based software.
# The following describes the analysis performed within the GUI, as there is no direct command-line execution for the full workflow.
echo "Affymetrix Expression Console (v1.4.1.46) was used for 'Alt Splice Analysis'."
echo "Input CEL files: ${CEL_FILES}"
echo "Analysis performed: RMA probeset-level signal estimates, DABG detection flag, and DABG detection p-value."
echo "Results were exported to: ${OUTPUT_DIR}/"

Tools Used

Microarray

Raw Source Text

Cel files were analyzed using Expression Console (build 1.4.1.46, Affymetrix), with RMA normalization and DABG probe-level detection. Only probesets with detection p-value â¤ 0.05 in more than half of the microarray samples were considered for downstream analysis. All probes corresponding to cassette exons profiled on the microarray (comprising exclusion junction, upstream and downstream inclusion junction, and inclusion exonic probes) were identified and normalized against the average signal on a per-gene basis to remove gene expression changes. Studentâs t-test was performed on residuals for inclusion probes and exclusion probes separately to identify robust splicing changes, which were quantified by SepScore ( defined as the normalized change in exclusion minus the normalized change in inclusion).
HTA-2_0.r1.pgf
HTA-2_0.r1.Psrs.mps
RMA probeset-level signal estimates, DABG detection flag (Absent or Present), and DABG detection p-value obtained from 'Alt Splice Analysis' performed in Affymetrix Expression Console build 1.4.1.46.

← Back to Analysis