GSE77339 Processing Pipeline
Publication
Robust transcriptome-wide discovery of RNA-binding protein binding sites with enhanced CLIP (eCLIP).Nature methods (2016) — PMID 27018577
Dataset
GSE77339Enhanced CLIP (eCLIP) enables robust and scalable transcriptome-wide discovery and characterization of RNA binding protein binding sites [array]
Processing Steps
Generate Jupyter Notebook-
1
Cel files were analyzed using Expression Console (build 1.4.1.46, Affymetrix), with RMA normalization and DABG probe-level detection.
Microarray v1.4.1.46$ Bash example
# Expression Console is a proprietary GUI software from Affymetrix/Thermo Fisher Scientific. # Installation typically involves downloading and running an installer from the vendor's website. # A direct command-line execution with parameters is not standard for this GUI tool. # The following command is a conceptual representation of the analysis performed. # Input CEL files (e.g., in a directory named 'cel_files') INPUT_CEL_FILES="cel_files/*.CEL" # Output directory for analysis results OUTPUT_DIR="expression_console_analysis_results" # Create output directory if it doesn't exist mkdir -p "${OUTPUT_DIR}" # Conceptual command representing the analysis performed by Expression Console # with RMA normalization and DABG probe-level detection. # Note: This is not an actual executable command but a representation of the process. ExpressionConsole_analyze \ --input "${INPUT_CEL_FILES}" \ --normalize RMA \ --probe_detection DABG \ --output_dir "${OUTPUT_DIR}" # Reference datasets (e.g., CDF files, annotation files) are typically handled # internally by Expression Console based on the microarray chip type. # No specific external reference datasets were mentioned in the description. -
2
Only probesets with detection p-value ⤠0.05 in more than half of the microarray samples were considered for downstream analysis.
$ Bash example
#!/bin/bash # This script demonstrates how to filter microarray probesets based on detection p-values # using R, as described: "Only probesets with detection p-value ≤ 0.05 in more than half # of the microarray samples were considered for downstream analysis." # Ensure R is installed and Bioconductor packages (e.g., oligo or affy) are available # if not already installed: # R -e "if (!requireNamespace('BiocManager', quietly = TRUE)) install.packages('BiocManager')" # R -e "BiocManager::install('oligo')" # For Affymetrix Gene/Exon ST arrays # R -e "BiocManager::install('affy')" # For Affymetrix 3' arrays # Define input and output file paths (replace with actual paths) # input_p_values_file="detection_p_values.csv" # CSV or tab-separated file of p-values # input_expression_file="raw_expression_data.csv" # Corresponding expression data # output_filtered_expression_file="filtered_expression_data.csv" # Create dummy input files for demonstration if they don't exist # In a real scenario, these would be generated by upstream microarray processing steps if [ ! -f detection_p_values.csv ]; then echo "Creating dummy detection_p_values.csv" R -q -e ' num_probesets <- 1000 num_samples <- 10 set.seed(123) p_values_matrix <- matrix(runif(num_probesets * num_samples, 0, 1), nrow = num_probesets, ncol = num_samples) p_values_matrix[1:50, 1:6] <- runif(50*6, 0, 0.01) # Simulate some probesets detected in > half samples rownames(p_values_matrix) <- paste0("Probeset_", 1:num_probesets) colnames(p_values_matrix) <- paste0("Sample_", 1:num_samples) write.csv(p_values_matrix, "detection_p_values.csv", row.names = TRUE) ' fi if [ ! -f raw_expression_data.csv ]; then echo "Creating dummy raw_expression_data.csv" R -q -e ' num_probesets <- 1000 num_samples <- 10 set.seed(123) expression_matrix <- matrix(rnorm(num_probesets * num_samples), nrow = num_probesets, ncol = num_samples) rownames(expression_matrix) <- paste0("Probeset_", 1:num_probesets) colnames(expression_matrix) <- paste0("Sample_", 1:num_samples) write.csv(expression_matrix, "raw_expression_data.csv", row.names = TRUE) ' fi # Execute the R script for filtering Rscript -e ' # Define parameters p_value_threshold <- 0.05 # Load detection p-values and expression data # Adjust read.csv parameters if your files use different delimiters or headers p_values_matrix <- read.csv("detection_p_values.csv", row.names = 1, check.names = FALSE) expression_matrix <- read.csv("raw_expression_data.csv", row.names = 1, check.names = FALSE) # Ensure matrices have the same probesets and samples if (!all(rownames(p_values_matrix) == rownames(expression_matrix))) { stop("Probeset names in p-value matrix and expression matrix do not match.") } if (!all(colnames(p_values_matrix) == colnames(expression_matrix))) { stop("Sample names in p-value matrix and expression matrix do not match.") } num_samples <- ncol(p_values_matrix) min_samples_detected <- ceiling(num_samples / 2) # "more than half" # Identify probesets meeting the criteria detected_counts <- rowSums(p_values_matrix <= p_value_threshold) probesets_to_keep <- names(detected_counts[detected_counts >= min_samples_detected]) # Apply filtering to the expression matrix filtered_expression_matrix <- expression_matrix[probesets_to_keep, ] # Save filtered data write.csv(filtered_expression_matrix, "filtered_expression_data.csv", row.names = TRUE) message(paste("Original probesets:", nrow(expression_matrix))) message(paste("Filtered probesets:", nrow(filtered_expression_matrix))) message(paste("Filtered data saved to filtered_expression_data.csv")) ' -
3
All probes corresponding to cassette exons profiled on the microarray (comprising exclusion junction, upstream and downstream inclusion junction, and inclusion exonic probes) were identified and normalized against the average signal on a per-gene basis to remove gene expression changes.
Microarray vN/A$ Bash example
# This step describes a custom data processing procedure for microarray data, # specifically for probes targeting cassette exons. It involves identifying # specific probe types and normalizing their signals against the average signal # of their respective genes to account for overall gene expression level changes. # No specific tool is explicitly mentioned in the description. This type of # analysis is typically performed using custom scripts in R or Python, # leveraging bioinformatics packages (e.g., limma, oligo in R). # Reference data: # - Microarray probe annotation file: This file would map probe IDs to their # genomic locations, gene IDs, and their type (e.g., exclusion junction, # inclusion junction, exonic for cassette exons). This is specific to the # microarray platform used. # - Gene annotation file: A GTF/GFF file (e.g., for human genome GRCh38) # would be used to define gene boundaries and potentially identify # cassette exon regions if not directly provided by the probe annotation. # Example for human GRCh38: # GENE_ANNOTATION_GTF="ftp://ftp.ensembl.org/pub/release-110/gtf/homo_sapiens/Homo_sapiens.GRCh38.110.gtf.gz" # Input data: # - RAW_MICROARRAY_SIGNAL_MATRIX="path/to/raw_probe_signal_matrix.tsv" # (e.g., a tab-separated file with ProbeID, GeneID, and SignalValue columns) # - PROBE_ANNOTATION_FILE="path/to/microarray_probe_annotations.tsv" # (e.g., a tab-separated file with ProbeID and ProbeType columns, where ProbeType # indicates 'exclusion_junction', 'upstream_inclusion_junction', etc.) # Output data: # - NORMALIZED_CASS_EXON_SIGNALS="normalized_cassette_exon_signals.tsv" # Conceptual execution command (representing a custom R or Python script): # This command is illustrative and assumes a script 'normalize_cassette_exons.R' # or 'normalize_cassette_exons.py' that performs the described operations. # # Rscript normalize_cassette_exons.R \ # --input_signals "${RAW_MICROARRAY_SIGNAL_MATRIX}" \ # --probe_annotations "${PROBE_ANNOTATION_FILE}" \ # --output_file "${NORMALIZED_CASS_EXON_SIGNALS}" \ # --gene_annotation_gtf "${GENE_ANNOTATION_GTF}" # Optional, if probe annotations are not comprehensive -
4
Studentâs t-test was performed on residuals for inclusion probes and exclusion probes separately to identify robust splicing changes, which were quantified by SepScore ( defined as the normalized change in exclusion minus the normalized change in inclusion).
SepScore (Inferred with models/gemini-2.5-flash) vv1.0$ Bash example
# Install SepScore (assuming Python environment and dependencies like pandas, numpy, scipy are met) # git clone https://github.com/yeolab/SepScore.git # cd SepScore # pip install . # Example usage of sep_score.py for calculating SepScore and performing statistical tests. # Input files (placeholders): # inclusion_counts.tsv: Tab-separated file with gene/event IDs and inclusion probe counts across samples. # exclusion_counts.tsv: Tab-separated file with gene/event IDs and exclusion probe counts across samples. # metadata.tsv: Tab-separated file with sample IDs and experimental conditions (e.g., 'sample_id\tcondition\nS1\tcontrol\nS2\ttreatment'). # These input files would be generated by upstream quantification steps (e.g., from eCLIP or RNA-seq data). # No specific reference genome is needed for SepScore calculation itself, but input counts are derived from a reference genome (e.g., hg38). python sep_score.py \ -i inclusion_counts.tsv \ -e exclusion_counts.tsv \ -m metadata.tsv \ -o splicing_analysis_results -
5
HTA-2_0.r1.pgf
oligo (R package) (Inferred with models/gemini-2.5-flash) vNot specified (The '2_0.r1' refers to the Human Transcriptome Array 2.0 revision 1, not the oligo package version.) GitHub$ Bash example
# Install R if not already installed (example for Ubuntu/Debian) # sudo apt update # sudo apt install r-base # Install BiocManager and required R packages # R -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")' # R -e 'BiocManager::install(c("oligo", "pd.hta.2.0", "hta20transcriptcluster.db"))' # Define input and output paths CEL_FILES_DIR="." # Adjust this to the directory containing your CEL files OUTPUT_DIR="hta_expression_results_oligo" mkdir -p "${OUTPUT_DIR}" # R script to process CEL files using oligo # This script assumes CEL files are in the current directory or specified by CEL_FILES_DIR R_SCRIPT_CONTENT=' library(oligo) library(pd.hta.2.0) # Platform Design package for HTA 2.0 library(hta20transcriptcluster.db) # Annotation package for HTA 2.0 # List CEL files from the specified directory cel_files <- list.celfiles(path="'""${CEL_FILES_DIR}""'", full.names=TRUE) if (length(cel_files) == 0) { stop("No CEL files found in the specified directory: '""${CEL_FILES_DIR}""'") } # Read CEL files rawData <- read.celfiles(cel_files) # Background correction, normalization, and summarization (RMA) # This process implicitly uses the PGF/CLF information via the pd.hta.2.0 package eset <- rma(rawData) # Get expression matrix expr_matrix <- exprs(eset) # Save results write.csv(expr_matrix, file=file.path("'""${OUTPUT_DIR}""'", "hta_rma_expression.csv"), row.names=TRUE) message("RMA expression matrix saved to: ", file.path("'""${OUTPUT_DIR}""'", "hta_rma_expression.csv")) ' # Execute the R script Rscript -e "${R_SCRIPT_CONTENT}" -
6
HTA-2_0.r1.Psrs.mps
$ Bash example
# This step description "HTA-2_0.r1.Psrs.mps" is inferred to be related to processing Affymetrix Human Transcriptome Array (HTA 2.0) data. # The 'oligo' R package is commonly used for background correction, normalization, and summarization of such arrays. # 'HTA-2_0.r1' likely refers to the array design version. 'Psrs.mps' might refer to a specific summarization method or file within the array processing context. # Installation of 'oligo' R package and relevant annotation database: # R # if (!requireNamespace("BiocManager", quietly = TRUE)) # install.packages("BiocManager") # BiocManager::install("oligo") # BiocManager::install("hta20transcriptcluster.db") # Annotation package for HTA 2.0 # Placeholder for input CEL files and output expression matrix. # Replace 'path/to/cel_files/' with the actual directory containing .CEL files. # Replace 'output_expression_matrix.tsv' with the desired output file name. # The exact parameters for 'Psrs.mps' are not standard 'oligo' functions, so a generic 'oligo' RMA workflow is shown. # Example R script to process HTA 2.0 data using 'oligo': Rscript -e " library(oligo) library(hta20transcriptcluster.db) # Define path to CEL files (replace with actual path) cel_path <- 'path/to/cel_files/' cel_files <- list.files(cel_path, pattern = '.CEL$', full.names = TRUE) # Read CEL files rawData <- read.celfiles(cel_files) # Perform RMA background correction, normalization, and summarization # For HTA 2.0, 'rma' is a common summarization method. eset <- rma(rawData) # Extract expression matrix expr_matrix <- exprs(eset) # Add annotation (optional, but common for interpretation) # probe_ids <- rownames(expr_matrix) # annotation <- select(hta20transcriptcluster.db, keys=probe_ids, columns=c('SYMBOL', 'GENENAME'), keytype='PROBEID') # # Merge annotation with expression matrix, e.g., by probe_id # Write expression matrix to file write.table(expr_matrix, file='output_expression_matrix.tsv', sep='\t', quote=FALSE, row.names=TRUE) " -
7
RMA probeset-level signal estimates, DABG detection flag (Absent or Present), and DABG detection p-value obtained from 'Alt Splice Analysis' performed in Affymetrix Expression Console build 1.4.1.46.
Microarray v1.4.1.46$ Bash example
# Affymetrix Expression Console is primarily a GUI application. The following command is a conceptual representation of the analysis steps performed within the GUI, as a direct command-line interface for 'Alt Splice Analysis' in Expression Console is not standard. # Input CEL files (raw microarray data) # Replace with actual CEL file paths, e.g., "/path/to/your/cel_files/*.CEL" CEL_FILES_PATH="/path/to/your/cel_files/*.CEL" # Output directory for results OUTPUT_DIR="affymetrix_expression_console_results" mkdir -p "${OUTPUT_DIR}" # Reference annotation file (e.g., CDF file specific to the array type and organism). # This file is crucial for probeset definition and mapping. # Example for a human exon array: "HuEx-1_0-st-v2.cdf" # Please replace with the correct CDF file for your specific array type. REFERENCE_ANNOTATION_FILE="placeholder_array_type.cdf" # Conceptual command representing the execution of Alt Splice Analysis within Expression Console. # This command is illustrative and does not represent an actual command-line executable. # In practice, these steps are performed interactively within the Expression Console GUI. # The GUI workflow typically involves: # 1. Loading CEL files. # 2. Selecting the appropriate array type and annotation (CDF). # 3. Choosing "Alt Splice Analysis". # 4. Specifying normalization (RMA) and DABG detection options. # 5. Running the analysis and saving results. affymetrix_expression_console_run_analysis \ --input-cel-files "${CEL_FILES_PATH}" \ --output-directory "${OUTPUT_DIR}" \ --analysis-type "Alt Splice Analysis" \ --normalization-method "RMA" \ --dabg-detection "true" \ --annotation-file "${REFERENCE_ANNOTATION_FILE}" \ --tool-version "1.4.1.46"
Tools Used
Raw Source Text
Cel files were analyzed using Expression Console (build 1.4.1.46, Affymetrix), with RMA normalization and DABG probe-level detection. Only probesets with detection p-value ⤠0.05 in more than half of the microarray samples were considered for downstream analysis. All probes corresponding to cassette exons profiled on the microarray (comprising exclusion junction, upstream and downstream inclusion junction, and inclusion exonic probes) were identified and normalized against the average signal on a per-gene basis to remove gene expression changes. Studentâs t-test was performed on residuals for inclusion probes and exclusion probes separately to identify robust splicing changes, which were quantified by SepScore ( defined as the normalized change in exclusion minus the normalized change in inclusion). HTA-2_0.r1.pgf HTA-2_0.r1.Psrs.mps RMA probeset-level signal estimates, DABG detection flag (Absent or Present), and DABG detection p-value obtained from 'Alt Splice Analysis' performed in Affymetrix Expression Console build 1.4.1.46.