GSE18920 Processing Pipeline

GSE code_examples 4 steps

Publication

Aberrant NOVA1 function disrupts alternative splicing in early stages of amyotrophic lateral sclerosis.

Acta neuropathologica (2022) — PMID 35778567

Dataset

Sporadic ALS has compartment-specific aberrant exon splicing and altered cell-matrix adhesion biology

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Data were processed using Partek Genomics Suite version 6.4.

Partek Genomics Suite v6.4

$ Bash example

# Partek Genomics Suite is primarily a GUI-based software.
# A direct command-line execution script cannot be generated without specific details
# on the analysis performed within the suite (e.g., RNA-seq differential expression, ChIP-seq peak calling, etc.).
# The description only states that data were processed using the software, not the specific steps or parameters.
# Therefore, a generic bash command is not applicable here.

RMA background correction and quantile normalization were performed, and probeset summarization used the median polish method (RMA default setting in Partek).

Partek Genomics Suite vNot specified (Inferred with models/gemini-2.5-flash)

$ Bash example

# Install R and Bioconductor (if not already installed)
# conda install -c conda-forge r-base
# R -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")'
# R -e 'BiocManager::install("affy")'

# Placeholder for input CEL files and output directory
# Replace './path/to/your/cel_files' with the actual directory containing your .CEL files
INPUT_CEL_FILES="./path/to/your/cel_files"
OUTPUT_DIR="./path/to/output"

mkdir -p "$OUTPUT_DIR"

# R script to perform RMA using the affy package, which implements the described steps
# This script mimics the default RMA processing (background correction, quantile normalization, median polish summarization)
# as typically performed by software like Partek.
Rscript -e '
  library(affy)
  
  # Set working directory to where CEL files are located
  cel_path <- Sys.getenv("INPUT_CEL_FILES")
  if (cel_path == "") {
    stop("INPUT_CEL_FILES environment variable not set. Please set it to the directory containing your CEL files.")
  }
  setwd(cel_path)
  
  # Read CEL files from the specified directory
  # This will read all .CEL files found in the directory
  # For specific files, use ReadAffy(filenames=c("file1.CEL", "file2.CEL"))
  raw_data <- ReadAffy()
  
  # Perform RMA: Robust Multi-array Average
  # This function performs background correction, quantile normalization, and probeset summarization
  # using the median polish method by default, matching the description.
  eset_rma <- rma(raw_data)
  
  # Extract the normalized expression matrix
  expr_matrix <- exprs(eset_rma)
  
  # Define output file path
  output_file <- file.path(Sys.getenv("OUTPUT_DIR"), "rma_normalized_expression.tsv")
  
  # Write the normalized expression matrix to a TSV file
  write.table(expr_matrix, file=output_file, sep="\t", quote=FALSE, col.names=NA)
  
  message(paste("RMA normalized expression matrix saved to:", output_file))
'

Core probesets were used for the analysis.

R (for microarray analysis) (Inferred with models/gemini-2.5-flash)

$ Bash example

# This script represents a conceptual step where "core probesets" are utilized
# within a broader microarray data analysis pipeline. The specific definition
# and selection of "core probesets" would be implemented within the R script
# based on the experimental design and array type.

# Example: Define input data (e.g., normalized expression matrix or ExpressionSet object)
# INPUT_EXPRESSION_DATA="normalized_expression.rds" # Or a directory of CEL files

# Example: Define output directory for analysis results
# OUTPUT_DIR="analysis_with_core_probesets"
# mkdir -p ${OUTPUT_DIR}

# The R script below is a placeholder demonstrating how "core probesets" might be
# selected or used in an analysis. The actual implementation depends on the
# specific criteria for defining "core probesets" (e.g., manufacturer's definition,
# probesets mapping to well-annotated genes, probesets passing quality filters).
R_SCRIPT_NAME="analyze_with_core_probesets.R"

cat << 'EOF' > ${R_SCRIPT_NAME}
# Load necessary R packages for microarray analysis (e.g., oligo, limma, annotation packages)
# if (!requireNamespace("BiocManager", quietly = TRUE))
#     install.packages("BiocManager")
# BiocManager::install(c("oligo", "limma", "hgu133plus2.db")) # Example packages

# library(oligo)
# library(limma)
# library(hgu133plus2.db) # Example for Affymetrix Human Genome U133 Plus 2.0 Array

# --- Placeholder for loading pre-processed microarray data ---
# In a real pipeline, this would be an ExpressionSet object or a matrix of normalized expression values.
# For demonstration, let's create a dummy ExpressionSet.
set.seed(42)
num_total_probesets <- 10000
num_samples <- 5
dummy_exprs <- matrix(rnorm(num_total_probesets * num_samples, mean = 7, sd = 1),
                          ncol = num_samples,
                          dimnames = list(paste0("PROBEID_", 1:num_total_probesets),
                                          paste0("Sample", 1:num_samples)))
# Create dummy phenoData
dummy_pheno <- data.frame(sample_id = paste0("Sample", 1:num_samples),
                              group = c(rep("Control", 3), rep("Treated", 2)))
pheno_data <- new("AnnotatedDataFrame", data = dummy_pheno)
eset <- new("ExpressionSet", exprs = dummy_exprs, phenoData = pheno_data)

message(paste("Initial ExpressionSet contains", nrow(eset), "probesets."))

# --- Definition and selection of "core probesets" ---
# This is the critical step implied by the description.
# The method for identifying "core probesets" is highly context-dependent.
# Possible approaches:
# 1. Using a predefined list of probeset IDs (e.g., from manufacturer or previous study).
#    Example: core_probeset_ids <- read.table("path/to/core_probesets_list.txt", header=FALSE)$V1
# 2. Filtering based on annotation (e.g., probesets mapping to well-annotated genes, removing controls).
#    Example:
#    annotations <- AnnotationDbi::select(hgu133plus2.db, keys=featureNames(eset),
#                                         columns=c("SYMBOL", "ENTREZID"), keytype="PROBEID")
#    core_probeset_ids <- annotations$PROBEID[!is.na(annotations$SYMBOL)]
# 3. Filtering based on quality metrics (e.g., detection p-value, signal intensity).

# For this generic example, let's simulate selecting a subset of probesets
# as "core" based on a simple criterion (e.g., top N most variable probesets,
# or a random subset if no specific criteria are provided).
# In a real scenario, this would be a well-defined selection.
num_selected_core_probesets <- 5000 # Example: Select 5000 probesets
if (nrow(eset) > num_selected_core_probesets) {
    # Example: Select a random subset as "core" for demonstration
    set.seed(101) # For reproducibility of the random selection
    core_probeset_indices <- sample(1:nrow(eset), num_selected_core_probesets)
    eset_core <- eset[core_probeset_indices, ]
} else {
    eset_core <- eset # If fewer probesets than target, use all
}

message(paste("Analysis proceeding with", nrow(eset_core), "core probesets."))

# --- Downstream analysis using the 'eset_core' object ---
# This is where the actual analysis (e.g., differential expression, clustering)
# would be performed using only the selected core probesets.
# Example: Perform a simple differential expression analysis using limma (conceptual)
# design_matrix <- model.matrix(~0 + group, data = pData(eset_core))
# colnames(design_matrix) <- levels(pData(eset_core)$group)
# fit <- lmFit(eset_core, design_matrix)
# contrast_matrix <- makeContrasts(Treated - Control, levels = design_matrix)
# fit2 <- contrasts.fit(fit, contrast_matrix)
# fit2 <- eBayes(fit2)
# top_table_results <- topTable(fit2, number = Inf, adjust.method = "BH")
# write.csv(top_table_results, file = file.path("analysis_results_core_probesets.csv"))

message("Analysis using core probesets completed successfully.")
EOF

Rscript ${R_SCRIPT_NAME}

For alternative splicing analysis, probe sets with maximum signal <3 and differential expression p-values > 0.5 were excluded.

R (Inferred with models/gemini-2.5-flash) v(Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# This script filters probe sets based on maximum signal and differential expression p-values.
# It assumes an input file (e.g., CSV) with columns for 'max_signal' and 'differential_expression_p_value'.

# Install R if not already installed
# sudo apt-get update && sudo apt-get install -y r-base
# Or use conda:
# conda install -c r r

# Create an R script for filtering
cat << 'EOF' > filter_probe_sets.R
# Load your data (replace 'your_probe_set_data.csv' with your actual file path)
# Example: data <- read.csv("your_probe_set_data.csv")

# For demonstration purposes, creating dummy data:
data <- data.frame(
  probe_id = paste0("probe_", 1:10),
  max_signal = c(1.5, 4.2, 2.8, 5.1, 0.9, 3.5, 6.0, 2.1, 4.8, 1.2),
  differential_expression_p_value = c(0.01, 0.005, 0.6, 0.02, 0.8, 0.1, 0.001, 0.7, 0.03, 0.9)
)

# Define exclusion thresholds
max_signal_threshold_for_exclusion <- 3
p_value_threshold_for_exclusion <- 0.5

# Filter out probe sets based on the description:
# "probe sets with maximum signal <3 AND differential expression p-values > 0.5 were excluded."
# This means we KEEP probe sets where the exclusion condition is NOT met.
# Exclusion condition: (max_signal < 3) AND (p_value > 0.5)
# Keep condition: NOT ((max_signal < 3) AND (p_value > 0.5))
# Which is equivalent to: (max_signal >= 3) OR (p_value <= 0.5)

filtered_data <- data[data$max_signal >= max_signal_threshold_for_exclusion |
                      data$differential_expression_p_value <= p_value_threshold_for_exclusion, ]

# Print the original and filtered data for verification
print("Original Data:")
print(data)
print("\nFiltered Data (excluded if max_signal < 3 AND p_value > 0.5):")
print(filtered_data)

# Optionally, save the filtered data to a new CSV file
# write.csv(filtered_data, "filtered_probe_set_data.csv", row.names = FALSE)
EOF

# Execute the R script
Rscript filter_probe_sets.R

View on GitHub

Raw Source Text

Data were processed using Partek Genomics Suite version 6.4.  RMA background correction and quantile normalization were performed, and probeset summarization used the median polish method (RMA default setting in Partek).  Core probesets were used for the analysis.  For alternative splicing analysis, probe sets with maximum signal <3 and differential expression p-values > 0.5 were excluded.

← Back to Analysis