GSE21037 Processing Pipeline
GSE
code_examples
3 steps
Publication
A model for neural development and treatment of Rett syndrome using human induced pluripotent stem cells.Cell (2010) — PMID 21074045
Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
Processing Steps
Generate Jupyter Notebook-
1
Gene-level signal estimates were derived from the CEL files by RMA-sketch normalization as a method in the apt-probeset-summarize program (see Yeo et al.
apt-probeset-summarize vNot specified$ Bash example
# Install Affymetrix Power Tools (APT) # conda install -c bioconda affymetrix-power-tools # Define input CEL files (replace with actual file paths). # Example: CEL_FILES="sample1.CEL sample2.CEL sample3.CEL" # Or, if all CEL files are in a directory: CEL_FILES=$(find /path/to/your/cel_files_directory -name "*.CEL" | tr '\n' ' ') # Define output directory OUTPUT_DIR="gene_level_estimates" mkdir -p "${OUTPUT_DIR}" # Define annotation file (replace with the correct annotation file for your array type and genome build). # This file provides the mapping from probesets to genes (e.g., a CDF file or a probeset-to-gene mapping file). # The specific file depends on the Affymetrix array used (e.g., HuGene-1_0-st-v1, HG-U133_Plus_2). # Example for a human array (e.g., HuGene-1_0-st-v1): # ANNOTATION_FILE="/path/to/HuGene-1_0-st-v1.r2.dt1.hg19.na33.transcript.csv" # Example for an older array: # ANNOTATION_FILE="/path/to/HG-U133_Plus_2.cdf" ANNOTATION_FILE="path/to/your_array_annotation.csv" # Placeholder for the specific array annotation file # Run apt-probeset-summarize with RMA-sketch normalization apt-probeset-summarize \ --method rma-sketch \ --cel-files "${CEL_FILES}" \ --annotation-file "${ANNOTATION_FILE}" \ --output-file "${OUTPUT_DIR}/gene_level_signals" \ --log-file "${OUTPUT_DIR}/apt_log.txt" -
2
2007; PMID 7967047).
$ Bash example
# Install BLAST (e.g., using conda) # conda install -c bioconda blast # --- Placeholder for reference database and query sequences --- # Replace 'reference.fasta' with your actual reference sequence file (e.g., genome, transcriptome). # Replace 'my_blast_db' with your desired database name. # This command creates a BLAST database from your reference FASTA file. makeblastdb -in reference.fasta -dbtype nucl -out my_blast_db # Replace 'query.fasta' with your actual query sequence file. # Run blastn (Nucleotide BLAST) # -query: Input query file # -db: BLAST database name # -out: Output file in tabular format # -outfmt 6: Tabular output format (standard for parsing) # -num_threads: Number of CPU threads to use blastn -query query.fasta -db my_blast_db -out blastn_results.tsv -outfmt 6 -num_threads 8
-
3
Hierarchical clustering of the full dataset by probeset values was performed by complete linkage using Euclidean distance as a similarity metric in Matlab.
MATLAB vNot specified (Standard Library)$ Bash example
#!/bin/bash # Input data file (replace with your actual path and format) # Example: probeset_data.csv (assuming comma-separated values) INPUT_DATA="probeset_data.csv" OUTPUT_CLUSTERING_MATRIX="hierarchical_clustering_results.csv" OUTPUT_DENDROGRAM_PNG="dendrogram.png" # Create a temporary MATLAB script for clustering cat << EOF > run_clustering.m % MATLAB script for hierarchical clustering % Load data (adjust 'readmatrix' based on your actual data format, e.g., readtable, then table2array) try data = readmatrix('$INPUT_DATA'); catch ME disp(['Error loading data: ', ME.message]); exit(1); end % Calculate pairwise Euclidean distances Y = pdist(data, 'euclidean'); % Perform hierarchical clustering with complete linkage Z = linkage(Y, 'complete'); % Save the clustering result (linkage matrix) writematrix(Z, '$OUTPUT_CLUSTERING_MATRIX'); % Optional: Generate and save a dendrogram image % This might require a display server if running headless, or MATLAB's 'batch' mode % might handle it if 'figure('visible','off')' is used. For robust headless operation, % ensure Xvfb is running or consider alternative plotting libraries. try figure('visible','off'); % Create a figure without displaying it dendrogram(Z); title('Hierarchical Clustering Dendrogram'); saveas(gcf, '$OUTPUT_DENDROGRAM_PNG'); close(gcf); % Close the figure catch ME disp(['Warning: Could not generate dendrogram image. ', ME.message]); end exit(0); % Exit MATLAB successfully EOF # Execute the MATLAB script in batch mode # Assumes MATLAB is installed and its executable is in the system PATH. matlab -batch "run_clustering" -logfile matlab_clustering.log # Clean up the temporary MATLAB script rm run_clustering.m
Raw Source Text
Gene-level signal estimates were derived from the CEL files by RMA-sketch normalization as a method in the apt-probeset-summarize program (see Yeo et al. 2007; PMID 7967047). Hierarchical clustering of the full dataset by probeset values was performed by complete linkage using Euclidean distance as a similarity metric in Matlab.