GSE21037 Processing Pipeline

GSE code_examples 3 steps

Publication

A model for neural development and treatment of Rett syndrome using human induced pluripotent stem cells.

Cell (2010) — PMID 21074045

Dataset

L1 retrotransposition in neurons is mediated by MeCP2

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Gene-level signal estimates were derived from the CEL files by RMA-sketch normalization as a method in the apt-probeset-summarize program (see Yeo et al.

apt-probeset-summarize vNot specified

$ Bash example

# Install Affymetrix Power Tools (APT)
# conda install -c bioconda affymetrix-power-tools

# Define input CEL files (replace with actual file paths).
# Example: CEL_FILES="sample1.CEL sample2.CEL sample3.CEL"
# Or, if all CEL files are in a directory:
CEL_FILES=$(find /path/to/your/cel_files_directory -name "*.CEL" | tr '\n' ' ')

# Define output directory
OUTPUT_DIR="gene_level_estimates"
mkdir -p "${OUTPUT_DIR}"

# Define annotation file (replace with the correct annotation file for your array type and genome build).
# This file provides the mapping from probesets to genes (e.g., a CDF file or a probeset-to-gene mapping file).
# The specific file depends on the Affymetrix array used (e.g., HuGene-1_0-st-v1, HG-U133_Plus_2).
# Example for a human array (e.g., HuGene-1_0-st-v1):
# ANNOTATION_FILE="/path/to/HuGene-1_0-st-v1.r2.dt1.hg19.na33.transcript.csv"
# Example for an older array:
# ANNOTATION_FILE="/path/to/HG-U133_Plus_2.cdf"
ANNOTATION_FILE="path/to/your_array_annotation.csv" # Placeholder for the specific array annotation file

# Run apt-probeset-summarize with RMA-sketch normalization
apt-probeset-summarize \
    --method rma-sketch \
    --cel-files "${CEL_FILES}" \
    --annotation-file "${ANNOTATION_FILE}" \
    --output-file "${OUTPUT_DIR}/gene_level_signals" \
    --log-file "${OUTPUT_DIR}/apt_log.txt"

2007; PMID 7967047).

BLAST (Inferred with models/gemini-2.5-flash) vlegacy BLAST (circa 2007) GitHub

$ Bash example

# Install BLAST (e.g., using conda)
# conda install -c bioconda blast

# --- Placeholder for reference database and query sequences ---
# Replace 'reference.fasta' with your actual reference sequence file (e.g., genome, transcriptome).
# Replace 'my_blast_db' with your desired database name.
# This command creates a BLAST database from your reference FASTA file.
makeblastdb -in reference.fasta -dbtype nucl -out my_blast_db

# Replace 'query.fasta' with your actual query sequence file.

# Run blastn (Nucleotide BLAST)
# -query: Input query file
# -db: BLAST database name
# -out: Output file in tabular format
# -outfmt 6: Tabular output format (standard for parsing)
# -num_threads: Number of CPU threads to use
blastn -query query.fasta -db my_blast_db -out blastn_results.tsv -outfmt 6 -num_threads 8

View on GitHub

Hierarchical clustering of the full dataset by probeset values was performed by complete linkage using Euclidean distance as a similarity metric in Matlab.

MATLAB vNot specified (Standard Library)

$ Bash example

#!/bin/bash

# Input data file (replace with your actual path and format)
# Example: probeset_data.csv (assuming comma-separated values)
INPUT_DATA="probeset_data.csv"
OUTPUT_CLUSTERING_MATRIX="hierarchical_clustering_results.csv"
OUTPUT_DENDROGRAM_PNG="dendrogram.png"

# Create a temporary MATLAB script for clustering
cat << EOF > run_clustering.m
% MATLAB script for hierarchical clustering
% Load data (adjust 'readmatrix' based on your actual data format, e.g., readtable, then table2array)
try
    data = readmatrix('$INPUT_DATA');
catch ME
    disp(['Error loading data: ', ME.message]);
    exit(1);
end

% Calculate pairwise Euclidean distances
Y = pdist(data, 'euclidean');

% Perform hierarchical clustering with complete linkage
Z = linkage(Y, 'complete');

% Save the clustering result (linkage matrix)
writematrix(Z, '$OUTPUT_CLUSTERING_MATRIX');

% Optional: Generate and save a dendrogram image
% This might require a display server if running headless, or MATLAB's 'batch' mode
% might handle it if 'figure('visible','off')' is used. For robust headless operation,
% ensure Xvfb is running or consider alternative plotting libraries.
try
    figure('visible','off'); % Create a figure without displaying it
    dendrogram(Z);
    title('Hierarchical Clustering Dendrogram');
    saveas(gcf, '$OUTPUT_DENDROGRAM_PNG');
    close(gcf); % Close the figure
catch ME
    disp(['Warning: Could not generate dendrogram image. ', ME.message]);
end

exit(0); % Exit MATLAB successfully
EOF

# Execute the MATLAB script in batch mode
# Assumes MATLAB is installed and its executable is in the system PATH.
matlab -batch "run_clustering" -logfile matlab_clustering.log

# Clean up the temporary MATLAB script
rm run_clustering.m

Raw Source Text

Gene-level signal estimates were derived from the CEL files by RMA-sketch normalization as a method in the apt-probeset-summarize program (see Yeo et al. 2007; PMID 7967047). Hierarchical clustering of the full dataset by probeset values was performed by complete linkage using Euclidean distance as a similarity metric in Matlab.

← Back to Analysis