GSE24759 Processing Pipeline

GSE code_examples 3 steps

Publication

Musashi-2 attenuates AHR signalling to expand human haematopoietic stem cells.

Nature (2016) — PMID 27121842

Dataset

GSE24759

Densely interconnected transcriptional circuits control cell states in human hematopoiesis

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    Transcript levels were processed from data image files using RMA method, implemented by Bioconductor R package.

    R vLatest stable Bioconductor version compatible with R (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install R and Bioconductor affy package (example using conda)
    # conda create -n rma_env r-base bioconductor-affy -c conda-forge -c bioconda -y
    # conda activate rma_env
    
    # Create an R script for RMA processing
    cat << 'EOF' > process_rma.R
    library(affy)
    
    # Define input directory containing .CEL files
    # Replace 'input_cel_files_dir' with the actual path to your CEL files
    cel_files_dir <- "input_cel_files_dir" 
    
    # Read Affymetrix CEL files
    # This will automatically detect and read all .CEL files in the specified directory
    # For specific files, you can use: ReadAffy(filenames=c("file1.CEL", "file2.CEL"))
    data <- ReadAffy(celfile.path=cel_files_dir)
    
    # Perform RMA normalization
    rma_data <- rma(data)
    
    # Extract normalized expression matrix
    expression_matrix <- exprs(rma_data)
    
    # Define output file path
    # Replace 'output_expression_file.tsv' with your desired output file name
    output_file <- "output_expression_file.tsv"
    
    # Write the normalized expression matrix to a tab-separated file
    write.table(expression_matrix, file=output_file, sep="\t", quote=FALSE, row.names=TRUE)
    
    cat(paste0("RMA normalization complete. Output written to: ", output_file, "\n"))
    EOF
    
    # Execute the R script
    Rscript process_rma.R
    
    # Deactivate conda environment if it was activated
    # conda deactivate
  2. 2

    To reduce batch effects, transcript levels were further corrected with the ComBat method (Johnson, 2007), which applies empirical Bayes framework for adjusting data for batch effects.

    ComBat (sva R package) (Inferred with models/gemini-2.5-flash) vlatest stable (sva R package) GitHub
    $ Bash example
    # Install R and Bioconductor if not already installed
    # sudo apt-get update
    # sudo apt-get install r-base
    # R -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager"); BiocManager::install("sva")'
    
    # Create dummy input files for demonstration purposes.
    # In a real scenario, 'expression_matrix.tsv' would be your transcript level data
    # and 'sample_info.tsv' would contain batch information for each sample.
    # Example: expression_matrix.tsv (genes as rows, samples as columns)
    echo -e "gene\tsample1\tsample2\tsample3\tsample4" > expression_matrix.tsv
    echo -e "geneA\t100\t120\t150\t180" >> expression_matrix.tsv
    echo -e "geneB\t50\t60\t70\t80" >> expression_matrix.tsv
    echo -e "geneC\t200\t210\t220\t230" >> expression_matrix.tsv
    
    # Example: sample_info.tsv (sample IDs and corresponding batch information)
    echo -e "sample\tbatch\tcondition" > sample_info.tsv
    echo -e "sample1\tbatch1\tcontrol" >> sample_info.tsv
    echo -e "sample2\tbatch1\ttreated" >> sample_info.tsv
    echo -e "sample3\tbatch2\tcontrol" >> sample_info.tsv
    echo -e "sample4\tbatch2\ttreated" >> sample_info.tsv
    
    # R script to perform ComBat correction using the sva package
    cat << 'EOF' > run_combat.R
    library(sva)
    library(data.table) # For fread/fwrite
    
    # Load expression data
    # Assuming genes in rows, samples in columns, first column is gene ID
    expr_data <- fread("expression_matrix.tsv", data.table = FALSE)
    gene_ids <- expr_data[, 1]
    expr_matrix <- as.matrix(expr_data[, -1])
    rownames(expr_matrix) <- gene_ids
    
    # Load sample information
    sample_info <- fread("sample_info.tsv", data.table = FALSE)
    # Ensure sample order in sample_info matches expression matrix columns
    sample_info <- sample_info[match(colnames(expr_matrix), sample_info$sample), ]
    
    # Extract batch variable
    batch <- sample_info$batch
    
    # Perform ComBat correction
    # If there are other covariates to preserve (e.g., 'condition'),
    # they can be added to the 'mod' parameter:
    # mod <- model.matrix(~condition, data=sample_info)
    # combat_corrected_data <- ComBat(dat=expr_matrix, batch=batch, mod=mod)
    combat_corrected_data <- ComBat(dat=expr_matrix, batch=batch)
    
    # Prepare output dataframe
    corrected_df <- as.data.frame(combat_corrected_data)
    corrected_df <- cbind(gene = rownames(corrected_df), corrected_df)
    
    # Save batch-corrected data to a new TSV file
    fwrite(corrected_df, "corrected_expression_matrix.tsv", sep = "\t", row.names = FALSE)
    EOF
    
    # Execute the R script to perform ComBat correction
    Rscript run_combat.R
  3. 3

    Final dataset which was used in the paper and contains 8968 genes is included in the supplementary file GSE24759_data.sort.txt.

    (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # This step describes the final dataset file GSE24759_data.sort.txt
    # which contains 8968 genes and was used in the paper.
    # Assuming the file is available in the current directory, you can inspect its content:
    head GSE24759_data.sort.txt

Tools Used

Raw Source Text
Transcript levels were processed from data image files using RMA method, implemented by Bioconductor R package. To reduce batch effects, transcript levels were further corrected with the ComBat method (Johnson, 2007), which applies empirical Bayes framework for adjusting data for batch effects. Final dataset which was used in the paper and contains 8968 genes is included in the supplementary file GSE24759_data.sort.txt.
← Back to Analysis