GSE21037 Processing Pipeline

GSE code_examples 3 steps

Publication

A model for neural development and treatment of Rett syndrome using human induced pluripotent stem cells.

Cell (2010) — PMID 21074045

Dataset

GSE21037

L1 retrotransposition in neurons is mediated by MeCP2

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    Gene-level signal estimates were derived from the CEL files by RMA-sketch normalization as a method in the apt-probeset-summarize program (see Yeo et al.

    apt-probeset-summarize vNot specified
    $ Bash example
    # Install Affymetrix Power Tools (APT)
    # conda install -c bioconda affymetrix-power-tools
    
    # Define input CEL files (replace with actual file paths).
    # Example: CEL_FILES="sample1.CEL sample2.CEL sample3.CEL"
    # Or, if all CEL files are in a directory:
    CEL_FILES=$(find /path/to/your/cel_files_directory -name "*.CEL" | tr '\n' ' ')
    
    # Define output directory
    OUTPUT_DIR="gene_level_estimates"
    mkdir -p "${OUTPUT_DIR}"
    
    # Define annotation file (replace with the correct annotation file for your array type and genome build).
    # This file provides the mapping from probesets to genes (e.g., a CDF file or a probeset-to-gene mapping file).
    # The specific file depends on the Affymetrix array used (e.g., HuGene-1_0-st-v1, HG-U133_Plus_2).
    # Example for a human array (e.g., HuGene-1_0-st-v1):
    # ANNOTATION_FILE="/path/to/HuGene-1_0-st-v1.r2.dt1.hg19.na33.transcript.csv"
    # Example for an older array:
    # ANNOTATION_FILE="/path/to/HG-U133_Plus_2.cdf"
    ANNOTATION_FILE="path/to/your_array_annotation.csv" # Placeholder for the specific array annotation file
    
    # Run apt-probeset-summarize with RMA-sketch normalization
    apt-probeset-summarize \
        --method rma-sketch \
        --cel-files "${CEL_FILES}" \
        --annotation-file "${ANNOTATION_FILE}" \
        --output-file "${OUTPUT_DIR}/gene_level_signals" \
        --log-file "${OUTPUT_DIR}/apt_log.txt"
  2. 2

    2007; PMID 7967047).

    BLAST (Inferred with models/gemini-2.5-flash) vlegacy BLAST (circa 2007) GitHub
    $ Bash example
    # Install BLAST (e.g., using conda)
    # conda install -c bioconda blast
    
    # --- Placeholder for reference database and query sequences ---
    # Replace 'reference.fasta' with your actual reference sequence file (e.g., genome, transcriptome).
    # Replace 'my_blast_db' with your desired database name.
    # This command creates a BLAST database from your reference FASTA file.
    makeblastdb -in reference.fasta -dbtype nucl -out my_blast_db
    
    # Replace 'query.fasta' with your actual query sequence file.
    
    # Run blastn (Nucleotide BLAST)
    # -query: Input query file
    # -db: BLAST database name
    # -out: Output file in tabular format
    # -outfmt 6: Tabular output format (standard for parsing)
    # -num_threads: Number of CPU threads to use
    blastn -query query.fasta -db my_blast_db -out blastn_results.tsv -outfmt 6 -num_threads 8
  3. 3

    Hierarchical clustering of the full dataset by probeset values was performed by complete linkage using Euclidean distance as a similarity metric in Matlab.

    MATLAB vNot specified (Standard Library)
    $ Bash example
    #!/bin/bash
    
    # Input data file (replace with your actual path and format)
    # Example: probeset_data.csv (assuming comma-separated values)
    INPUT_DATA="probeset_data.csv"
    OUTPUT_CLUSTERING_MATRIX="hierarchical_clustering_results.csv"
    OUTPUT_DENDROGRAM_PNG="dendrogram.png"
    
    # Create a temporary MATLAB script for clustering
    cat << EOF > run_clustering.m
    % MATLAB script for hierarchical clustering
    % Load data (adjust 'readmatrix' based on your actual data format, e.g., readtable, then table2array)
    try
        data = readmatrix('$INPUT_DATA');
    catch ME
        disp(['Error loading data: ', ME.message]);
        exit(1);
    end
    
    % Calculate pairwise Euclidean distances
    Y = pdist(data, 'euclidean');
    
    % Perform hierarchical clustering with complete linkage
    Z = linkage(Y, 'complete');
    
    % Save the clustering result (linkage matrix)
    writematrix(Z, '$OUTPUT_CLUSTERING_MATRIX');
    
    % Optional: Generate and save a dendrogram image
    % This might require a display server if running headless, or MATLAB's 'batch' mode
    % might handle it if 'figure('visible','off')' is used. For robust headless operation,
    % ensure Xvfb is running or consider alternative plotting libraries.
    try
        figure('visible','off'); % Create a figure without displaying it
        dendrogram(Z);
        title('Hierarchical Clustering Dendrogram');
        saveas(gcf, '$OUTPUT_DENDROGRAM_PNG');
        close(gcf); % Close the figure
    catch ME
        disp(['Warning: Could not generate dendrogram image. ', ME.message]);
    end
    
    exit(0); % Exit MATLAB successfully
    EOF
    
    # Execute the MATLAB script in batch mode
    # Assumes MATLAB is installed and its executable is in the system PATH.
    matlab -batch "run_clustering" -logfile matlab_clustering.log
    
    # Clean up the temporary MATLAB script
    rm run_clustering.m
Raw Source Text
Gene-level signal estimates were derived from the CEL files by RMA-sketch normalization as a method in the apt-probeset-summarize program (see Yeo et al. 2007; PMID 7967047). Hierarchical clustering of the full dataset by probeset values was performed by complete linkage using Euclidean distance as a similarity metric in Matlab.
← Back to Analysis