GSE77339 Processing Pipeline

GSE code_examples 7 steps

Publication

Robust transcriptome-wide discovery of RNA-binding protein binding sites with enhanced CLIP (eCLIP).

Nature methods (2016) — PMID 27018577

Dataset

Enhanced CLIP (eCLIP) enables robust and scalable transcriptome-wide discovery and characterization of RNA binding protein binding sites [array]

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Cel files were analyzed using Expression Console (build 1.4.1.46, Affymetrix), with RMA normalization and DABG probe-level detection.

Microarray v1.4.1.46

$ Bash example

# Expression Console is a proprietary GUI software from Affymetrix/Thermo Fisher Scientific.
# Installation typically involves downloading and running an installer from the vendor's website.
# A direct command-line execution with parameters is not standard for this GUI tool.
# The following command is a conceptual representation of the analysis performed.

# Input CEL files (e.g., in a directory named 'cel_files')
INPUT_CEL_FILES="cel_files/*.CEL"

# Output directory for analysis results
OUTPUT_DIR="expression_console_analysis_results"

# Create output directory if it doesn't exist
mkdir -p "${OUTPUT_DIR}"

# Conceptual command representing the analysis performed by Expression Console
# with RMA normalization and DABG probe-level detection.
# Note: This is not an actual executable command but a representation of the process.
ExpressionConsole_analyze \
    --input "${INPUT_CEL_FILES}" \
    --normalize RMA \
    --probe_detection DABG \
    --output_dir "${OUTPUT_DIR}"

# Reference datasets (e.g., CDF files, annotation files) are typically handled
# internally by Expression Console based on the microarray chip type.
# No specific external reference datasets were mentioned in the description.

Only probesets with detection p-value â¤ 0.05 in more than half of the microarray samples were considered for downstream analysis.

Microarray vNot specified GitHub

$ Bash example

#!/bin/bash

# This script demonstrates how to filter microarray probesets based on detection p-values
# using R, as described: "Only probesets with detection p-value ≤ 0.05 in more than half
# of the microarray samples were considered for downstream analysis."

# Ensure R is installed and Bioconductor packages (e.g., oligo or affy) are available
# if not already installed:
# R -e "if (!requireNamespace('BiocManager', quietly = TRUE)) install.packages('BiocManager')"
# R -e "BiocManager::install('oligo')" # For Affymetrix Gene/Exon ST arrays
# R -e "BiocManager::install('affy')"  # For Affymetrix 3' arrays

# Define input and output file paths (replace with actual paths)
# input_p_values_file="detection_p_values.csv" # CSV or tab-separated file of p-values
# input_expression_file="raw_expression_data.csv" # Corresponding expression data
# output_filtered_expression_file="filtered_expression_data.csv"

# Create dummy input files for demonstration if they don't exist
# In a real scenario, these would be generated by upstream microarray processing steps
if [ ! -f detection_p_values.csv ]; then
    echo "Creating dummy detection_p_values.csv"
    R -q -e '
        num_probesets <- 1000
        num_samples <- 10
        set.seed(123)
        p_values_matrix <- matrix(runif(num_probesets * num_samples, 0, 1), nrow = num_probesets, ncol = num_samples)
        p_values_matrix[1:50, 1:6] <- runif(50*6, 0, 0.01) # Simulate some probesets detected in > half samples
        rownames(p_values_matrix) <- paste0("Probeset_", 1:num_probesets)
        colnames(p_values_matrix) <- paste0("Sample_", 1:num_samples)
        write.csv(p_values_matrix, "detection_p_values.csv", row.names = TRUE)
    '
fi

if [ ! -f raw_expression_data.csv ]; then
    echo "Creating dummy raw_expression_data.csv"
    R -q -e '
        num_probesets <- 1000
        num_samples <- 10
        set.seed(123)
        expression_matrix <- matrix(rnorm(num_probesets * num_samples), nrow = num_probesets, ncol = num_samples)
        rownames(expression_matrix) <- paste0("Probeset_", 1:num_probesets)
        colnames(expression_matrix) <- paste0("Sample_", 1:num_samples)
        write.csv(expression_matrix, "raw_expression_data.csv", row.names = TRUE)
    '
fi

# Execute the R script for filtering
Rscript -e '
    # Define parameters
    p_value_threshold <- 0.05
    
    # Load detection p-values and expression data
    # Adjust read.csv parameters if your files use different delimiters or headers
    p_values_matrix <- read.csv("detection_p_values.csv", row.names = 1, check.names = FALSE)
    expression_matrix <- read.csv("raw_expression_data.csv", row.names = 1, check.names = FALSE)
    
    # Ensure matrices have the same probesets and samples
    if (!all(rownames(p_values_matrix) == rownames(expression_matrix))) {
        stop("Probeset names in p-value matrix and expression matrix do not match.")
    }
    if (!all(colnames(p_values_matrix) == colnames(expression_matrix))) {
        stop("Sample names in p-value matrix and expression matrix do not match.")
    }

    num_samples <- ncol(p_values_matrix)
    min_samples_detected <- ceiling(num_samples / 2) # "more than half"

    # Identify probesets meeting the criteria
    detected_counts <- rowSums(p_values_matrix <= p_value_threshold)
    probesets_to_keep <- names(detected_counts[detected_counts >= min_samples_detected])

    # Apply filtering to the expression matrix
    filtered_expression_matrix <- expression_matrix[probesets_to_keep, ]

    # Save filtered data
    write.csv(filtered_expression_matrix, "filtered_expression_data.csv", row.names = TRUE)
    
    message(paste("Original probesets:", nrow(expression_matrix)))
    message(paste("Filtered probesets:", nrow(filtered_expression_matrix)))
    message(paste("Filtered data saved to filtered_expression_data.csv"))
'

View on GitHub

All probes corresponding to cassette exons profiled on the microarray (comprising exclusion junction, upstream and downstream inclusion junction, and inclusion exonic probes) were identified and normalized against the average signal on a per-gene basis to remove gene expression changes.

Microarray vN/A

$ Bash example

# This step describes a custom data processing procedure for microarray data,
# specifically for probes targeting cassette exons. It involves identifying
# specific probe types and normalizing their signals against the average signal
# of their respective genes to account for overall gene expression level changes.

# No specific tool is explicitly mentioned in the description. This type of
# analysis is typically performed using custom scripts in R or Python,
# leveraging bioinformatics packages (e.g., limma, oligo in R).

# Reference data:
# - Microarray probe annotation file: This file would map probe IDs to their
#   genomic locations, gene IDs, and their type (e.g., exclusion junction,
#   inclusion junction, exonic for cassette exons). This is specific to the
#   microarray platform used.
# - Gene annotation file: A GTF/GFF file (e.g., for human genome GRCh38)
#   would be used to define gene boundaries and potentially identify
#   cassette exon regions if not directly provided by the probe annotation.
#   Example for human GRCh38:
#   GENE_ANNOTATION_GTF="ftp://ftp.ensembl.org/pub/release-110/gtf/homo_sapiens/Homo_sapiens.GRCh38.110.gtf.gz"

# Input data:
# - RAW_MICROARRAY_SIGNAL_MATRIX="path/to/raw_probe_signal_matrix.tsv"
#   (e.g., a tab-separated file with ProbeID, GeneID, and SignalValue columns)
# - PROBE_ANNOTATION_FILE="path/to/microarray_probe_annotations.tsv"
#   (e.g., a tab-separated file with ProbeID and ProbeType columns, where ProbeType
#   indicates 'exclusion_junction', 'upstream_inclusion_junction', etc.)

# Output data:
# - NORMALIZED_CASS_EXON_SIGNALS="normalized_cassette_exon_signals.tsv"

# Conceptual execution command (representing a custom R or Python script):
# This command is illustrative and assumes a script 'normalize_cassette_exons.R'
# or 'normalize_cassette_exons.py' that performs the described operations.
#
# Rscript normalize_cassette_exons.R \
#   --input_signals "${RAW_MICROARRAY_SIGNAL_MATRIX}" \
#   --probe_annotations "${PROBE_ANNOTATION_FILE}" \
#   --output_file "${NORMALIZED_CASS_EXON_SIGNALS}" \
#   --gene_annotation_gtf "${GENE_ANNOTATION_GTF}" # Optional, if probe annotations are not comprehensive

Studentâs t-test was performed on residuals for inclusion probes and exclusion probes separately to identify robust splicing changes, which were quantified by SepScore ( defined as the normalized change in exclusion minus the normalized change in inclusion).

SepScore (Inferred with models/gemini-2.5-flash) vv1.0

$ Bash example

# Install SepScore (assuming Python environment and dependencies like pandas, numpy, scipy are met)
# git clone https://github.com/yeolab/SepScore.git
# cd SepScore
# pip install .

# Example usage of sep_score.py for calculating SepScore and performing statistical tests.
# Input files (placeholders):
# inclusion_counts.tsv: Tab-separated file with gene/event IDs and inclusion probe counts across samples.
# exclusion_counts.tsv: Tab-separated file with gene/event IDs and exclusion probe counts across samples.
# metadata.tsv: Tab-separated file with sample IDs and experimental conditions (e.g., 'sample_id\tcondition\nS1\tcontrol\nS2\ttreatment').
# These input files would be generated by upstream quantification steps (e.g., from eCLIP or RNA-seq data).
# No specific reference genome is needed for SepScore calculation itself, but input counts are derived from a reference genome (e.g., hg38).

python sep_score.py \
    -i inclusion_counts.tsv \
    -e exclusion_counts.tsv \
    -m metadata.tsv \
    -o splicing_analysis_results

HTA-2_0.r1.pgf

oligo (R package) (Inferred with models/gemini-2.5-flash) vNot specified (The '2_0.r1' refers to the Human Transcriptome Array 2.0 revision 1, not the oligo package version.) GitHub

$ Bash example

# Install R if not already installed (example for Ubuntu/Debian)
# sudo apt update
# sudo apt install r-base

# Install BiocManager and required R packages
# R -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")'
# R -e 'BiocManager::install(c("oligo", "pd.hta.2.0", "hta20transcriptcluster.db"))'

# Define input and output paths
CEL_FILES_DIR="." # Adjust this to the directory containing your CEL files
OUTPUT_DIR="hta_expression_results_oligo"
mkdir -p "${OUTPUT_DIR}"

# R script to process CEL files using oligo
# This script assumes CEL files are in the current directory or specified by CEL_FILES_DIR
R_SCRIPT_CONTENT='
library(oligo)
library(pd.hta.2.0) # Platform Design package for HTA 2.0
library(hta20transcriptcluster.db) # Annotation package for HTA 2.0

# List CEL files from the specified directory
cel_files <- list.celfiles(path="'""${CEL_FILES_DIR}""'", full.names=TRUE)
if (length(cel_files) == 0) {
    stop("No CEL files found in the specified directory: '""${CEL_FILES_DIR}""'")
}

# Read CEL files
rawData <- read.celfiles(cel_files)

# Background correction, normalization, and summarization (RMA)
# This process implicitly uses the PGF/CLF information via the pd.hta.2.0 package
eset <- rma(rawData)

# Get expression matrix
expr_matrix <- exprs(eset)

# Save results
write.csv(expr_matrix, file=file.path("'""${OUTPUT_DIR}""'", "hta_rma_expression.csv"), row.names=TRUE)
message("RMA expression matrix saved to: ", file.path("'""${OUTPUT_DIR}""'", "hta_rma_expression.csv"))
'

# Execute the R script
Rscript -e "${R_SCRIPT_CONTENT}"

View on GitHub

HTA-2_0.r1.Psrs.mps

oligo (R package) (Inferred with models/gemini-2.5-flash) v1.60.0 GitHub

$ Bash example

# This step description "HTA-2_0.r1.Psrs.mps" is inferred to be related to processing Affymetrix Human Transcriptome Array (HTA 2.0) data.
# The 'oligo' R package is commonly used for background correction, normalization, and summarization of such arrays.
# 'HTA-2_0.r1' likely refers to the array design version. 'Psrs.mps' might refer to a specific summarization method or file within the array processing context.

# Installation of 'oligo' R package and relevant annotation database:
# R
# if (!requireNamespace("BiocManager", quietly = TRUE))
#     install.packages("BiocManager")
# BiocManager::install("oligo")
# BiocManager::install("hta20transcriptcluster.db") # Annotation package for HTA 2.0

# Placeholder for input CEL files and output expression matrix.
# Replace 'path/to/cel_files/' with the actual directory containing .CEL files.
# Replace 'output_expression_matrix.tsv' with the desired output file name.
# The exact parameters for 'Psrs.mps' are not standard 'oligo' functions, so a generic 'oligo' RMA workflow is shown.

# Example R script to process HTA 2.0 data using 'oligo':
Rscript -e "
  library(oligo)
  library(hta20transcriptcluster.db)

  # Define path to CEL files (replace with actual path)
  cel_path <- 'path/to/cel_files/'
  cel_files <- list.files(cel_path, pattern = '.CEL$', full.names = TRUE)

  # Read CEL files
  rawData <- read.celfiles(cel_files)

  # Perform RMA background correction, normalization, and summarization
  # For HTA 2.0, 'rma' is a common summarization method.
  eset <- rma(rawData)

  # Extract expression matrix
  expr_matrix <- exprs(eset)

  # Add annotation (optional, but common for interpretation)
  # probe_ids <- rownames(expr_matrix)
  # annotation <- select(hta20transcriptcluster.db, keys=probe_ids, columns=c('SYMBOL', 'GENENAME'), keytype='PROBEID')
  # # Merge annotation with expression matrix, e.g., by probe_id

  # Write expression matrix to file
  write.table(expr_matrix, file='output_expression_matrix.tsv', sep='\t', quote=FALSE, row.names=TRUE)
"

View on GitHub

RMA probeset-level signal estimates, DABG detection flag (Absent or Present), and DABG detection p-value obtained from 'Alt Splice Analysis' performed in Affymetrix Expression Console build 1.4.1.46.

Microarray v1.4.1.46

$ Bash example

# Affymetrix Expression Console is primarily a GUI application. The following command is a conceptual representation of the analysis steps performed within the GUI, as a direct command-line interface for 'Alt Splice Analysis' in Expression Console is not standard.

# Input CEL files (raw microarray data)
# Replace with actual CEL file paths, e.g., "/path/to/your/cel_files/*.CEL"
CEL_FILES_PATH="/path/to/your/cel_files/*.CEL"

# Output directory for results
OUTPUT_DIR="affymetrix_expression_console_results"
mkdir -p "${OUTPUT_DIR}"

# Reference annotation file (e.g., CDF file specific to the array type and organism).
# This file is crucial for probeset definition and mapping.
# Example for a human exon array: "HuEx-1_0-st-v2.cdf"
# Please replace with the correct CDF file for your specific array type.
REFERENCE_ANNOTATION_FILE="placeholder_array_type.cdf"

# Conceptual command representing the execution of Alt Splice Analysis within Expression Console.
# This command is illustrative and does not represent an actual command-line executable.
# In practice, these steps are performed interactively within the Expression Console GUI.
# The GUI workflow typically involves:
# 1. Loading CEL files.
# 2. Selecting the appropriate array type and annotation (CDF).
# 3. Choosing "Alt Splice Analysis".
# 4. Specifying normalization (RMA) and DABG detection options.
# 5. Running the analysis and saving results.
affymetrix_expression_console_run_analysis \
--input-cel-files "${CEL_FILES_PATH}" \
--output-directory "${OUTPUT_DIR}" \
--analysis-type "Alt Splice Analysis" \
--normalization-method "RMA" \
--dabg-detection "true" \
--annotation-file "${REFERENCE_ANNOTATION_FILE}" \
--tool-version "1.4.1.46"

Tools Used

Microarray

Raw Source Text

Cel files were analyzed using Expression Console (build 1.4.1.46, Affymetrix), with RMA normalization and DABG probe-level detection. Only probesets with detection p-value â¤ 0.05 in more than half of the microarray samples were considered for downstream analysis. All probes corresponding to cassette exons profiled on the microarray (comprising exclusion junction, upstream and downstream inclusion junction, and inclusion exonic probes) were identified and normalized against the average signal on a per-gene basis to remove gene expression changes. Studentâs t-test was performed on residuals for inclusion probes and exclusion probes separately to identify robust splicing changes, which were quantified by SepScore ( defined as the normalized change in exclusion minus the normalized change in inclusion).
HTA-2_0.r1.pgf
HTA-2_0.r1.Psrs.mps
RMA probeset-level signal estimates, DABG detection flag (Absent or Present), and DABG detection p-value obtained from 'Alt Splice Analysis' performed in Affymetrix Expression Console build 1.4.1.46.

← Back to Analysis