GSE86223 Processing Pipeline
Publication
Protein-RNA Networks Regulated by Normal and ALS-Associated Mutant HNRNPA2B1 in the Nervous System.Neuron (2016) — PMID 27773581
Dataset
GSE86223HNRNPA2B1 regulates alternative RNA processing in the nervous system and accumulates in granules in ALS IPSC-derived motor neurons [hnRNPA2B1_Arrays_…
Processing Steps
Generate Jupyter Notebook-
1
Cel files were analyzed using Expression Console (build 1.4.1.46, Affymetrix), with RMA normalization and DABG probe-level detection.
Microarray v1.4.1.46$ Bash example
# This analysis is typically performed using the Expression Console GUI (Graphical User Interface). # A direct command-line interface for the full analysis workflow is not standard for this software. # The following command represents a conceptual execution if a CLI or a custom wrapper script were available. # Define input CEL files (replace with actual file paths) input_cel_files=( "sample1.CEL" "sample2.CEL" # ... add all relevant CEL files ) # Define output directory output_dir="expression_console_results" mkdir -p "${output_dir}" # Conceptual command for Expression Console analysis # In a real scenario, this would involve manual steps within the GUI: # 1. Open Expression Console. # 2. Import CEL files. # 3. Select appropriate array type (e.g., from a CDF file). # 4. Choose 'RMA' for normalization. # 5. Choose 'DABG' for probe-level detection. # 6. Export processed data. # This command is a placeholder to represent the action. expression_console_cli_placeholder \ --input-cels "${input_cel_files[@]}" \ --normalization-method RMA \ --detection-method DABG \ --output-dir "${output_dir}" \ --version 1.4.1.46 -
2
Only probesets with detection p-value ⤠0.05 in more than half of the microarray samples were considered for downstream analysis.
$ Bash example
# Install R and Bioconductor if not already installed # sudo apt-get update # sudo apt-get install r-base # R -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager"); BiocManager::install(c("oligo", "limma"))' Rscript -e ' # Load necessary libraries. oligo is common for Affymetrix array processing, # and limma is widely used for downstream analysis, often after such filtering. library(oligo) library(limma) # --- Configuration & Data Loading --- # In a real pipeline, you would load your actual microarray data here. # This typically involves an ExpressionSet object or a matrix of expression values # and a corresponding matrix of detection p-values. For Affymetrix arrays, # detection p-values are often generated by MAS 5.0 algorithm or can be # extracted/calculated using packages like oligo or affy. # For demonstration, let''s simulate an expression matrix and detection p-value matrix. set.seed(123) num_probesets <- 10000 num_samples <- 10 # Simulate expression data (e.g., log2 intensity) expr_matrix <- matrix(rnorm(num_probesets * num_samples, mean=7, sd=1), nrow=num_probesets, ncol=num_samples, dimnames=list(paste0("Probeset_", 1:num_probesets), paste0("Sample_", 1:num_samples))) # Simulate detection p-values # Assume some probesets are consistently detected, some are not detection_p_values <- matrix(runif(num_probesets * num_samples, min=0, max=1), nrow=num_probesets, ncol=num_samples, dimnames=list(paste0("Probeset_", 1:num_probesets), paste0("Sample_", 1:num_samples))) # Make some probesets consistently detected (low p-value) detection_p_values[1:1000, ] <- runif(1000 * num_samples, min=0, max=0.01) # Make some probesets consistently undetected (high p-value) detection_p_values[9000:10000, ] <- runif(1001 * num_samples, min=0.5, max=1) # --- Filtering Parameters --- p_value_threshold <- 0.05 min_samples_detected_fraction <- 0.5 # "more than half" # --- Filtering Logic --- message(paste("Initial number of probesets:", nrow(expr_matrix))) message(paste("Number of samples:", ncol(expr_matrix))) # Count samples where detection p-value is <= threshold for each probeset detected_counts <- rowSums(detection_p_values <= p_value_threshold) # Determine the minimum number of samples required # "more than half" means > N/2. For N samples, this is floor(N/2) + 1. min_samples_required <- floor(num_samples * min_samples_detected_fraction) + 1 message(paste("Minimum samples required for detection:", min_samples_required)) # Identify probesets that meet the criteria probesets_to_keep_idx <- which(detected_counts >= min_samples_required) # Filter the expression matrix and detection p-value matrix filtered_expr_matrix <- expr_matrix[probesets_to_keep_idx, ] filtered_detection_p_values <- detection_p_values[probesets_to_keep_idx, ] message(paste("Number of probesets after filtering:", nrow(filtered_expr_matrix))) # --- Save filtered data (optional) --- # write.csv(filtered_expr_matrix, "filtered_expression_data.csv", row.names=TRUE) # write.csv(filtered_detection_p_values, "filtered_detection_p_values.csv", row.names=TRUE) # The filtered_expr_matrix would typically be passed to the next step of your pipeline, # e.g., for differential expression analysis using limma. ' -
3
All probes corresponding to cassette exons profiled on the microarray (comprising exclusion junction, upstream and downstream inclusion junction, and inclusion exonic probes) were identified and normalized against the average signal on a per-gene basis to remove gene expression changes.
$ Bash example
# Install R and necessary Bioconductor packages if not already present # conda install -c bioconda bioconductor-oligo bioconductor-limma # R -e 'install.packages("BiocManager")' # R -e 'BiocManager::install(c("oligo", "limma"))' # Create dummy files for demonstration purposes. # In a real scenario, 'cel_files' would contain actual Affymetrix CEL files, # and 'probe_annotation.tsv' would be a comprehensive annotation file # mapping probe IDs to gene IDs and specific exon features (e.g., from Bioconductor annotation packages). mkdir -p cel_files # Create a dummy probe annotation file with relevant features echo -e "probe_id\tgene_id\texon_feature" > probe_annotation.tsv echo -e "probeA_1\tgeneA\tinclusion_exonic" >> probe_annotation.tsv echo -e "probeA_2\tgeneA\texclusion_junction" >> probe_annotation.tsv echo -e "probeA_3\tgeneA\tupstream_inclusion_junction" >> probe_annotation.tsv echo -e "probeA_4\tgeneA\tdownstream_inclusion_junction" >> probe_annotation.tsv echo -e "probeA_5\tgeneA\tconstitutive_exon" >> probe_annotation.tsv # Example of a non-cassette probe echo -e "probeB_1\tgeneB\tinclusion_exonic" >> probe_annotation.tsv echo -e "probeB_2\tgeneB\texclusion_junction" >> probe_annotation.tsv echo -e "probeC_1\tgeneC\tinclusion_exonic" >> probe_annotation.tsv echo -e "probeC_2\tgeneC\tconstitutive_exon" >> probe_annotation.tsv Rscript -e ' library(oligo) # Used for reading Affymetrix CEL files (conceptual here) library(limma) # Used for general microarray analysis and statistical modeling # --- Configuration --- cel_files_dir <- "cel_files" probe_annotation_file <- "probe_annotation.tsv" output_normalized_file <- "normalized_probe_signals.tsv" # --- 1. Load Raw Microarray Data (Conceptual) --- # In a real scenario, this would read CEL files and perform initial processing: # cel_files <- list.files(cel_files_dir, pattern = "\\.CEL$", full.names = TRUE) # raw_data <- read.celfiles(cel_files) # eset <- rma(raw_data) # Background correction, normalization (RMA), summarization # exprs_data <- exprs(eset) # Extract expression matrix (log2 scale) # For demonstration, create a dummy expression matrix (log2 scale) message("Simulating raw microarray data loading...") set.seed(123) num_probes_simulated <- 100 # Simulate more probes than in dummy annotation num_samples <- 3 probe_ids_simulated <- paste0("probe", LETTERS[1:3], "_", 1:300)[1:num_probes_simulated] sample_names <- paste0("sample", 1:num_samples) # Simulate raw intensities (log2 scale) raw_exprs <- matrix(rnorm(num_probes_simulated * num_samples, mean = 8, sd = 1), nrow = num_probes_simulated, ncol = num_samples, dimnames = list(probe_ids_simulated, sample_names)) exprs_data <- raw_exprs # --- 2. Load Probe Annotation --- message(paste("Loading probe annotation from:", probe_annotation_file)) if (!file.exists(probe_annotation_file)) { stop("Probe annotation file not found. Please provide a valid annotation.") } probe_annot <- read.delim(probe_annotation_file, stringsAsFactors = FALSE) # Ensure probe_ids in annotation match those in expression data common_probes <- intersect(rownames(exprs_data), probe_annot$probe_id) if (length(common_probes) == 0) { stop("No common probes between expression data and annotation. Check probe IDs.") } exprs_data <- exprs_data[common_probes, , drop = FALSE] probe_annot <- probe_annot[match(common_probes, probe_annot$probe_id), , drop = FALSE] # --- 3. Identify Probes for Cassette Exons and Junctions --- # Filter annotation to only include probes relevant to cassette exons as described relevant_features <- c("inclusion_exonic", "exclusion_junction", "upstream_inclusion_junction", "downstream_inclusion_junction") cassette_exon_probes_annot <- subset(probe_annot, exon_feature %in% relevant_features) # Filter expression data to only include these specific probes cassette_exprs_data <- exprs_data[cassette_exon_probes_annot$probe_id, , drop = FALSE] # Add gene_id to the expression data for easier grouping cassette_exprs_data_df <- as.data.frame(cassette_exprs_data) cassette_exprs_data_df$probe_id <- rownames(cassette_exprs_data_df) cassette_exprs_data_df <- merge(cassette_exprs_data_df, cassette_exon_probes_annot[, c("probe_id", "gene_id")], by = "probe_id", all.x = TRUE) # --- 4. Normalize against the average signal on a per-gene basis --- # This step removes overall gene expression changes, highlighting relative probe intensity differences # that reflect splicing. Assuming log2-transformed data, this is done by subtracting the gene mean. message("Performing per-gene normalization...") normalized_exprs_list <- list() for (sample_col in sample_names) { # Calculate gene-level average signal for the current sample gene_avg_signal <- aggregate(cassette_exprs_data_df[[sample_col]], by = list(gene_id = cassette_exprs_data_df$gene_id), FUN = mean) colnames(gene_avg_signal)[2] <- "gene_mean" # Merge gene average back to probe-level data sample_data <- merge(cassette_exprs_data_df[, c("probe_id", "gene_id", sample_col)], gene_avg_signal, by = "gene_id", all.x = TRUE) # Normalize: subtract gene mean from probe signal (assuming log2 data) # This yields a log-ratio of probe intensity to gene intensity, isolating splicing effects. sample_data$normalized_signal <- sample_data[[sample_col]] - sample_data$gene_mean normalized_exprs_list[[sample_col]] <- sample_data[, c("probe_id", "gene_id", "normalized_signal")] colnames(normalized_exprs_list[[sample_col]])[3] <- paste0("normalized_", sample_col) } # Combine normalized results from all samples into a single data frame normalized_data_final <- normalized_exprs_list[[1]] if (length(normalized_exprs_list) > 1) { for (i in 2:length(normalized_exprs_list)) { normalized_data_final <- merge(normalized_data_final, normalized_exprs_list[[i]], by = c("probe_id", "gene_id"), all.x = TRUE) } } # --- 5. Save Normalized Data --- message(paste("Saving normalized probe signals to:", output_normalized_file)) write.table(normalized_data_final, file = output_normalized_file, sep = "\t", quote = FALSE, row.names = FALSE) message("Normalization complete.") ' -
4
Studentâs t-test was performed on residuals for inclusion probes and exclusion probes separately to identify robust splicing changes, which were quantified by SepScore ( defined as the normalized change in exclusion minus the normalized change in inclusion).
sepscore (Inferred with models/gemini-2.5-flash) vNot explicitly versioned$ Bash example
# Clone the sepscore repository # git clone https://github.com/yeolab/sepscore.git # cd sepscore # Install the sepscore package # pip install . # Example usage of sepscore_cli.py # The input file should be a tab-separated file where the first column is the event ID (e.g., splicing event or probe ID), # followed by columns for each sample containing their respective PSI (Percent Spliced In) values or similar quantification. # The description mentions 'inclusion probes and exclusion probes', implying the input data is derived from such probes. # SepScore then operates on these quantified values (e.g., PSI) to identify robust splicing changes. python sepscore_cli.py \ -i input_psi_values.tsv \ -o sepscore_results.tsv \ -c control_sample_1,control_sample_2,control_sample_3 \ -t treatment_sample_1,treatment_sample_2,treatment_sample_3
-
5
HTA-2_0.r1.pgf
$ Bash example
# Create a directory for Affymetrix reference files if it doesn't exist mkdir -p affy_references # Download the HTA-2_0.r1.pgf and its companion HTA-2_0.r1.clf files. # These files are typically provided by Thermo Fisher Scientific (Affymetrix) # and may require registration or specific software to access. # The exact direct download links are not publicly available without navigating their support site. # As a placeholder, we assume a manual download or a pre-configured path. # For a real pipeline, replace with actual download commands or ensure files are present. # Example placeholder for obtaining the PGF file: # wget "https://assets.thermofisher.com/TFS-Assets/LSG/support-documents/HTA-2_0.r1.pgf" -O affy_references/HTA-2_0.r1.pgf # wget "https://assets.thermofisher.com/TFS-Assets/LSG/support-documents/HTA-2_0.r1.clf" -O affy_references/HTA-2_0.r1.clf # For demonstration, we'll just echo the path where it's expected to be. PGF_FILE="affy_references/HTA-2_0.r1.pgf" CLF_FILE="affy_references/HTA-2_0.r1.clf" echo "Ensure HTA-2_0.r1.pgf and HTA-2_0.r1.clf are available at: ${PGF_FILE} and ${CLF_FILE}" echo "These files are used by tools like the R 'oligo' package for processing Affymetrix Human Transcriptome Array 2.0 data." -
6
HTA-2_0.r1.Psrs.mps
HTSeq-count (Inferred with models/gemini-2.5-flash) v2.0 (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# Install HTSeq (if not already installed) # conda install -c bioconda htseq # Define input and output files # INPUT_BAM: Placeholder for aligned RNA-seq reads (e.g., from STAR or HISAT2) # ANNOTATION_GTF: Placeholder for genome annotation GTF file # OUTPUT_COUNTS: Placeholder for the output gene counts file INPUT_BAM="aligned_reads_r1.bam" # Assuming 'r1' refers to a replicate or run ANNOTATION_GTF="Homo_sapiens.GRCh38.109.gtf" # Using latest GRCh38 as a placeholder OUTPUT_COUNTS="HTA-2_0.r1.Psrs.mps.counts.txt" # Incorporating description into output name # Execute HTSeq-count for gene quantification # Common parameters used: # -f bam: Input file format is BAM # -r pos: Assume reads are sorted by position (default for many aligners) # -s no: Assume unstranded data (change to 'yes' or 'reverse' if data is stranded) # -a 10: Minimum alignment quality score to consider a read (adjust as needed) # -t exon: Feature type in the GTF file to count (e.g., 'exon' for gene counting) # -i gene_id: Attribute in the GTF file to use as the feature identifier htseq-count -f bam -r pos -s no -a 10 -t exon -i gene_id "${INPUT_BAM}" "${ANNOTATION_GTF}" > "${OUTPUT_COUNTS}"
Tools Used
Raw Source Text
Cel files were analyzed using Expression Console (build 1.4.1.46, Affymetrix), with RMA normalization and DABG probe-level detection. Only probesets with detection p-value ⤠0.05 in more than half of the microarray samples were considered for downstream analysis. All probes corresponding to cassette exons profiled on the microarray (comprising exclusion junction, upstream and downstream inclusion junction, and inclusion exonic probes) were identified and normalized against the average signal on a per-gene basis to remove gene expression changes. Studentâs t-test was performed on residuals for inclusion probes and exclusion probes separately to identify robust splicing changes, which were quantified by SepScore ( defined as the normalized change in exclusion minus the normalized change in inclusion). HTA-2_0.r1.pgf HTA-2_0.r1.Psrs.mps