GSE77634 Processing Pipeline
Publication
Robust transcriptome-wide discovery of RNA-binding protein binding sites with enhanced CLIP (eCLIP).Nature methods (2016) — PMID 27018577
Dataset
GSE77634Enhanced CLIP (eCLIP) enables robust and scalable transcriptome-wide discovery and characterization of RNA binding protein binding sites
Processing Steps
Generate Jupyter Notebook-
1
Cel files were analyzed using Expression Console (build 1.4.1.46, Affymetrix), with RMA normalization and DABG probe-level detection.
Microarray v1.4.1.46$ Bash example
# Expression Console is a GUI-based software. The following bash script uses Affymetrix Power Tools (APT), a command-line suite that provides similar functionality for processing CEL files with RMA normalization and DABG detection. # Install Affymetrix Power Tools (APT) if not already installed. # APT is proprietary software from Thermo Fisher Scientific (formerly Affymetrix). # Installation typically involves downloading from their website and setting up the PATH. # Example (conceptual, actual installation may vary based on OS and APT version): # wget https://assets.thermofisher.com/TFS-Assets/LSG/software/APT_2.11.2_Linux_x86_64.zip # unzip APT_2.11.2_Linux_x86_64.zip -d /opt/apt # export PATH=$PATH:/opt/apt/bin # Define input CEL files and output directory # Replace 'sample*.CEL' with the actual paths to your CEL files. # Replace '/path/to/your_array.cdf' with the path to the appropriate CDF file for your microarray type. CEL_FILES="sample1.CEL sample2.CEL sample3.CEL" # Example: space-separated list of CEL files OUTPUT_DIR="expression_console_analysis_output" CDF_FILE="/path/to/your_array.cdf" # e.g., /opt/apt/library-files/HuGene-1_0-st-v1.cdf mkdir -p "${OUTPUT_DIR}" # Execute RMA normalization and DABG probe-level detection using apt-probeset-summarize. # This command conceptually replicates the core analytical steps described for Expression Console. # --dabg-pvalue-threshold 0.05 is a common default, adjust if a specific threshold was used. apt-probeset-summarize \ -a rma \ -p "${CDF_FILE}" \ -o "${OUTPUT_DIR}" \ -c "${CEL_FILES}" \ --log-file "${OUTPUT_DIR}/apt_log.txt" \ --dabg \ --dabg-pvalue-threshold 0.05 -
2
Only probesets with detection p-value ⤠0.05 in more than half of the microarray samples were considered for downstream analysis.
$ Bash example
#!/bin/bash # Define input and output files INPUT_PVALUES="detection_pvalues.tsv" # This file should contain detection p-values from upstream processing OUTPUT_FILTERED_PROBESETS="filtered_probesets.tsv" # Define parameters P_VALUE_THRESHOLD=0.05 # --- Installation (commented out) --- # # Install R if not available # # sudo apt-get update && sudo apt-get install -y r-base # --- Create a dummy input file for demonstration if it doesn't exist --- # In a real pipeline, this file would be generated by an upstream step (e.g., MAS5 processing of Affymetrix CEL files) if [ ! -f "$INPUT_PVALUES" ]; then echo "Creating dummy input file: $INPUT_PVALUES" echo -e "ProbesetID\tSample1_Pval\tSample2_Pval\tSample3_Pval\tSample4_Pval\tSample5_Pval\tSample6_Pval\tSample7_Pval\tSample8_Pval\tSample9_Pval\tSample10_Pval" > "$INPUT_PVALUES" for i in $(seq 1 1000); do PROBESET_ID="Probeset_$i" PVALS="" for j in $(seq 1 10); do # Generate random p-values, some below 0.05, some above RANDOM_PVAL=$(awk -v seed=$RANDOM 'BEGIN{srand(seed); print rand()}' | cut -c 3-7) PVALS="$PVALS\t0.$RANDOM_PVAL" done echo -e "$PROBESET_ID$PVALS" >> "$INPUT_PVALUES" done fi # --- R script for filtering --- R_SCRIPT=$(cat <<EOF # Load necessary libraries (base R functions are sufficient for this logic) # --- Configuration --- P_VALUE_THRESHOLD <- ${P_VALUE_THRESHOLD} input_file <- "${INPUT_PVALUES}" output_file <- "${OUTPUT_FILTERED_PROBESETS}" # --- Load data --- if (!file.exists(input_file)) { stop(paste("Input file not found:", input_file)) } detection_pvalues_df <- read.delim(input_file, header = TRUE, row.names = 1) detection_pvalues_matrix <- as.matrix(detection_pvalues_df) num_samples <- ncol(detection_pvalues_matrix) # "more than half" means floor(N/2) + 1 min_samples_detected <- floor(num_samples / 2) + 1 # Apply filtering # Count for each probeset how many samples have p-value <= P_VALUE_THRESHOLD probeset_detection_counts <- rowSums(detection_pvalues_matrix <= P_VALUE_THRESHOLD) # Identify probesets that meet the criteria filtered_probeset_indices <- which(probeset_detection_counts >= min_samples_detected) filtered_probeset_ids <- rownames(detection_pvalues_matrix)[filtered_probeset_indices] # Output the list of filtered probeset IDs write.table(data.frame(ProbesetID = filtered_probeset_ids), output_file, sep = "\t", row.names = FALSE, quote = FALSE) message(paste("Filtered probesets written to:", output_file)) message(paste("Number of probesets before filtering:", nrow(detection_pvalues_matrix))) message(paste("Number of probesets after filtering:", length(filtered_probeset_ids))) EOF ) # Execute the R script Rscript -e "$R_SCRIPT" echo "Filtering complete. Filtered probeset IDs are in $OUTPUT_FILTERED_PROBESETS" -
3
All probes corresponding to cassette exons profiled on the microarray (comprising exclusion junction, upstream and downstream inclusion junction, and inclusion exonic probes) were identified and normalized against the average signal on a per-gene basis to remove gene expression changes.
$ Bash example
# Install R and Bioconductor packages if not already present # conda install -c conda-forge r-base # conda install -c bioconda bioconductor-limma # limma is a common package for microarray analysis, though the described normalization is custom. # Create a dummy raw intensity file (replace with actual data) # This file represents the probe intensities from the microarray. echo "ProbeID\tSample_1\tSample_2\tSample_3" > raw_intensities.tsv echo "Probe_1\t10.2\t11.5\t9.8" >> raw_intensities.tsv echo "Probe_2\t8.5\t9.1\t8.9" >> raw_intensities.tsv echo "Probe_3\t12.1\t13.0\t11.9" >> raw_intensities.tsv echo "Probe_4\t7.8\t8.2\t7.5" >> raw_intensities.tsv echo "Probe_5\t15.0\t14.5\t15.2" >> raw_intensities.tsv echo "Probe_6\t9.5\t10.0\t9.7" >> raw_intensities.tsv echo "Probe_7\t11.0\t11.2\t10.8" >> raw_intensities.tsv echo "Probe_8\t6.5\t7.0\t6.8" >> raw_intensities.tsv echo "Probe_9\t13.5\t14.0\t13.2" >> raw_intensities.tsv echo "Probe_10\t10.0\t10.5\t9.9" >> raw_intensities.tsv # Create a dummy probe annotation file (replace with actual data) # This file maps probes to genes and their specific type related to cassette exons. echo "ProbeID\tGeneID\tExonType" > probe_annotation.tsv echo "Probe_1\tGeneA\tinclusion_exonic" >> probe_annotation.tsv echo "Probe_2\tGeneA\texclusion_junction" >> probe_annotation.tsv echo "Probe_3\tGeneB\tupstream_inclusion_junction" >> probe_annotation.tsv echo "Probe_4\tGeneB\tdownstream_inclusion_junction" >> probe_annotation.tsv echo "Probe_5\tGeneC\tinclusion_exonic" >> probe_annotation.tsv echo "Probe_6\tGeneC\texclusion_junction" >> probe_annotation.tsv echo "Probe_7\tGeneD\tupstream_inclusion_junction" >> probe_annotation.tsv echo "Probe_8\tGeneD\tdownstream_inclusion_junction" >> probe_annotation.tsv echo "Probe_9\tGeneE\tinclusion_exonic" >> probe_annotation.tsv echo "Probe_10\tGeneE\texclusion_junction" >> probe_annotation.tsv # Create the R script for custom normalization cat << 'EOF' > normalize_microarray.R # --- Input parameters --- input_intensities_file <- "raw_intensities.tsv" input_annotation_file <- "probe_annotation.tsv" output_normalized_file <- "normalized_cassette_exon_probes.tsv" # --- Reference datasets --- # Reference genome: hg38 (inferred as latest human assembly for gene mapping) # Probe annotation source: Microarray platform provider (e.g., Affymetrix, Agilent) specific to the array design. # --- Read input data --- probe_intensities <- read.delim(input_intensities_file, row.names = 1, sep = "\t", check.names = FALSE) probe_annotation <- read.delim(input_annotation_file, sep = "\t", check.names = FALSE) # --- Step 1: Identify probes corresponding to cassette exons --- # The description specifies: "exclusion junction, upstream and downstream inclusion junction, and inclusion exonic probes" cassette_exon_types <- c("exclusion_junction", "upstream_inclusion_junction", "downstream_inclusion_junction", "inclusion_exonic") cassette_exon_probes_df <- subset(probe_annotation, ExonType %in% cassette_exon_types) # Filter intensity data to include only these probes # Ensure probe IDs match between annotation and intensity data common_probes <- intersect(rownames(probe_intensities), cassette_exon_probes_df$ProbeID) if (length(common_probes) == 0) { stop("No common probes found between intensity data and cassette exon annotation.") } filtered_intensities <- probe_intensities[common_probes, , drop = FALSE] filtered_annotation <- cassette_exon_probes_df[cassette_exon_probes_df$ProbeID %in% common_probes, ] # Ensure annotation is ordered consistently with intensities filtered_annotation <- filtered_annotation[match(rownames(filtered_intensities), filtered_annotation$ProbeID), ] # --- Step 2: Normalize against the average signal on a per-gene basis --- # This removes gene expression changes by normalizing each probe's signal # by the average signal of its associated gene *for each sample*. normalized_intensities <- filtered_intensities # Initialize with filtered data for (gene in unique(filtered_annotation$GeneID)) { gene_probes <- subset(filtered_annotation, GeneID == gene)$ProbeID if (length(gene_probes) > 0) { # Get intensities for probes belonging to the current gene gene_probe_signals <- filtered_intensities[gene_probes, , drop = FALSE] # Calculate average signal for this gene *per sample* # Use colMeans for per-sample average. Handle cases with a single probe for a gene. if (length(gene_probes) == 1) { gene_averages_per_sample <- as.numeric(gene_probe_signals) } else { gene_averages_per_sample <- colMeans(gene_probe_signals, na.rm = TRUE) } # Normalize each probe's signal by the gene's average in that sample for (sample_idx in 1:ncol(normalized_intensities)) { if (gene_averages_per_sample[sample_idx] > 0) { # Avoid division by zero normalized_intensities[gene_probes, sample_idx] <- gene_probe_signals[, sample_idx] / gene_averages_per_sample[sample_idx] } else { # If gene average is zero, set normalized values to NA normalized_intensities[gene_probes, sample_idx] <- NA } } } } # Handle potential Inf/NaN values resulting from division normalized_intensities[is.infinite(normalized_intensities)] <- NA normalized_intensities[is.nan(normalized_intensities)] <- NA # --- Output normalized data --- write.table(normalized_intensities, output_normalized_file, sep = "\t", quote = FALSE, row.names = TRUE) message(paste("Normalized data saved to:", output_normalized_file)) EOF # Execute the R script Rscript normalize_microarray.R -
4
Studentâs t-test was performed on residuals for inclusion probes and exclusion probes separately to identify robust splicing changes, which were quantified by SepScore ( defined as the normalized change in exclusion minus the normalized change in inclusion).
$ Bash example
# This is a placeholder command for a custom script that would perform the described analysis. # The actual script (e.g., analyze_splicing.py or analyze_splicing.R) would contain the logic for: # 1. Reading inclusion and exclusion probe data (likely residuals from a prior statistical model). # 2. Performing Student's t-test on these datasets separately. # 3. Calculating SepScore: (normalized change in exclusion - normalized change in inclusion). # 4. Identifying robust splicing changes based on t-test results and SepScore. # Assuming input files for residuals and normalized changes are available from upstream steps. # For example, 'inclusion_probe_residuals.tsv', 'exclusion_probe_residuals.tsv', # and 'normalized_splicing_changes.tsv' containing the necessary data. python analyze_splicing_changes.py \ --inclusion_residuals inclusion_probe_residuals.tsv \ --exclusion_residuals exclusion_probe_residuals.tsv \ --normalized_changes normalized_splicing_changes.tsv \ --output_splicing_results splicing_analysis_results.tsv \ --significance_threshold 0.05 # Example parameter for significance level -
5
HTA-2_0.r1.pgf
Affymetrix Power Tools (APT) (Inferred with models/gemini-2.5-flash) vNot specified$ Bash example
# Install Affymetrix Power Tools (APT) if not already installed # For example, using conda: # conda create -n apt_env -c bioconda affymetrix-power-tools # conda activate apt_env # Define variables for clarity PGF_FILE="HTA-2_0.r1.pgf" # Human Transcriptome Array 2.0 Probe Group File CLF_FILE="HTA-2_0.r1.clf" # Human Transcriptome Array 2.0 Chip Layout File (inferred as companion to PGF) OUTPUT_DIR="hta_2_0_summarized_data" CEL_FILES="sample1.CEL sample2.CEL sample3.CEL" # Placeholder for actual input CEL files, space-separated # Reference datasets (PGF and CLF files) are specific to the Affymetrix Human Transcriptome Array 2.0. # These files are typically downloaded from Thermo Fisher Scientific (formerly Affymetrix) support pages for the specific array. # Example placeholder for downloading (actual URLs may vary and require login/registration): # wget -O "$PGF_FILE" "https://assets.thermofisher.com/TFS-Assets/LSG/support-documents/HTA-2_0.r1.pgf" # wget -O "$CLF_FILE" "https://assets.thermofisher.com/TFS-Assets/LSG/support-documents/HTA-2_0.r1.clf" # Create output directory if it doesn't exist mkdir -p "$OUTPUT_DIR" # Run probeset summarization using the RMA (Robust Multi-array Average) algorithm. # The HTA-2_0.r1.pgf and HTA-2_0.r1.clf files provide the necessary probe definitions and chip layout. apt-probeset-summarize \ -a rma \ -p "$PGF_FILE" \ -c "$CLF_FILE" \ -o "$OUTPUT_DIR" \ --cel-files "$CEL_FILES" -
6
HTA-2_0.r1.Psrs.mps
oligo (Inferred with models/gemini-2.5-flash) v2_0.r1 (Refers to the HTA array design, not the oligo package version) GitHub$ Bash example
# This script processes Affymetrix Human Transcriptome Array 2.0 (HTA 2.0) .CEL files # using the 'oligo' R package for background correction, normalization, and summarization (RMA). # The "HTA-2_0.r1.Psrs.mps" description refers to the specific array design and its associated probe set definition file. # While the .mps file defines the array, 'oligo' typically handles this information internally # or via Bioconductor annotation packages (e.g., hta20transcriptcluster.db) when reading .CEL files. # Define input CEL files directory and output directory CEL_DIR="/path/to/raw_cel_files" # Placeholder: Replace with actual path to your .CEL files OUTPUT_DIR="/path/to/output_directory" # Placeholder: Replace with desired output path mkdir -p "${OUTPUT_DIR}" # Install R and Bioconductor packages if not already present # Note: These commands are commented out as they are installation steps. # sudo apt-get update && sudo apt-get install -y r-base # R -e "install.packages('BiocManager', repos='https://cloud.r-project.org')" # R -e "BiocManager::install('oligo', update=FALSE, ask=FALSE)" # R -e "BiocManager::install('hta20transcriptcluster.db', update=FALSE, ask=FALSE)" # Annotation package for HTA 2.0 # Create an R script for processing HTA 2.0 data cat << 'EOF' > "${OUTPUT_DIR}/process_hta20_rma.R" library(oligo) # library(hta20transcriptcluster.db) # Optional: For adding annotations later. # Ensure this package is installed if needed. # Set the directory containing CEL files cel_files_dir <- "${CEL_DIR}" # List all .CEL files in the specified directory cel_files <- list.celfiles(path=cel_files_dir, full.names=TRUE) if (length(cel_files) == 0) { stop("No .CEL files found in the specified directory: ", cel_files_dir) } message(paste("Found", length(cel_files), ".CEL files.")) # Read CEL files into an oligo FeatureSet object (e.g., GeneFeatureSet for HTA 2.0) # oligo automatically detects the array type (HTA 2.0) from the CEL file headers. raw_data <- read.celfiles(cel_files) # Perform RMA (Robust Multi-array Average) background correction, normalization, and summarization # This will produce a GeneFeatureSet or ExonFeatureSet with summarized expression values. message("Performing RMA summarization...") eset <- rma(raw_data) # Extract the expression matrix expression_matrix <- exprs(eset) # Get feature names (probe set IDs or transcript cluster IDs) feature_ids <- featureNames(eset) rownames(expression_matrix) <- feature_ids # Optionally, add sample names (from CEL file names) sample_names <- sampleNames(eset) colnames(expression_matrix) <- sample_names # Write the expression matrix to a CSV file output_file <- file.path("${OUTPUT_DIR}", "hta20_rma_expression.csv") write.csv(expression_matrix, file=output_file, row.names=TRUE) message(paste("RMA processed expression matrix saved to:", output_file)) # Example of how to add annotation (requires hta20transcriptcluster.db package) # if (requireNamespace("hta20transcriptcluster.db", quietly = TRUE)) { # message("Adding annotations...") # annotation_data <- AnnotationDbi::select( # hta20transcriptcluster.db, # keys = feature_ids, # columns = c("SYMBOL", "GENENAME", "ENTREZID"), # keytype = "PROBEID" # Or "TRANSCRIPTCLUSTERID" depending on the array and package # ) # # Merge annotation with expression data (example) # # annotated_expression <- merge(as.data.frame(expression_matrix), annotation_data, by.x="row.names", by.y="PROBEID", all.x=TRUE) # # write.csv(annotated_expression, file=file.path("${OUTPUT_DIR}", "hta20_rma_annotated_expression.csv"), row.names=FALSE) # message("Annotation complete (if uncommented and merged).") # } else { # message("Annotation package 'hta20transcriptcluster.db' not found. Skipping annotation step.") # message("Install with BiocManager::install('hta20transcriptcluster.db') if needed.") # } EOF # Execute the R script Rscript "${OUTPUT_DIR}/process_hta20_rma.R" -
7
RMA probeset-level signal estimates, DABG detection flag (Absent or Present), and DABG detection p-value obtained from 'Alt Splice Analysis' performed in Affymetrix Expression Console build 1.4.1.46.
Microarray v1.4.1.46$ Bash example
# Define input CEL files (replace with actual paths) CEL_FILES="sample1.cel sample2.cel sample3.cel" # Example: List of input Affymetrix .CEL files # Define output directory OUTPUT_DIR="affymetrix_expression_console_results" mkdir -p "${OUTPUT_DIR}" # Reference: Affymetrix array design file (CDF) for the specific array type used. # The specific CDF file depends on the array type (e.g., Human Exon 1.0 ST Array). # Example: CDF_FILE="HuEx-1_0-st-v2.cdf" # ANNOTATION_FILE="HuEx-1_0-st-v2.na36.hg19.probeset.csv" # Example annotation file for probeset mapping # GENOME_ASSEMBLY="hg19" # Placeholder if genome-level annotation was used for downstream analysis # This step was performed using the Affymetrix Expression Console (v1.4.1.46), a GUI-based software. # The following describes the analysis performed within the GUI, as there is no direct command-line execution for the full workflow. echo "Affymetrix Expression Console (v1.4.1.46) was used for 'Alt Splice Analysis'." echo "Input CEL files: ${CEL_FILES}" echo "Analysis performed: RMA probeset-level signal estimates, DABG detection flag, and DABG detection p-value." echo "Results were exported to: ${OUTPUT_DIR}/"
Tools Used
Raw Source Text
Cel files were analyzed using Expression Console (build 1.4.1.46, Affymetrix), with RMA normalization and DABG probe-level detection. Only probesets with detection p-value ⤠0.05 in more than half of the microarray samples were considered for downstream analysis. All probes corresponding to cassette exons profiled on the microarray (comprising exclusion junction, upstream and downstream inclusion junction, and inclusion exonic probes) were identified and normalized against the average signal on a per-gene basis to remove gene expression changes. Studentâs t-test was performed on residuals for inclusion probes and exclusion probes separately to identify robust splicing changes, which were quantified by SepScore ( defined as the normalized change in exclusion minus the normalized change in inclusion). HTA-2_0.r1.pgf HTA-2_0.r1.Psrs.mps RMA probeset-level signal estimates, DABG detection flag (Absent or Present), and DABG detection p-value obtained from 'Alt Splice Analysis' performed in Affymetrix Expression Console build 1.4.1.46.