GSE24759 Processing Pipeline
GSE
code_examples
3 steps
Publication
Musashi-2 attenuates AHR signalling to expand human haematopoietic stem cells.Nature (2016) — PMID 27121842
Dataset
GSE24759Densely interconnected transcriptional circuits control cell states in human hematopoiesis
Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
Processing Steps
Generate Jupyter Notebook-
1
Transcript levels were processed from data image files using RMA method, implemented by Bioconductor R package.
R vLatest stable Bioconductor version compatible with R (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# Install R and Bioconductor affy package (example using conda) # conda create -n rma_env r-base bioconductor-affy -c conda-forge -c bioconda -y # conda activate rma_env # Create an R script for RMA processing cat << 'EOF' > process_rma.R library(affy) # Define input directory containing .CEL files # Replace 'input_cel_files_dir' with the actual path to your CEL files cel_files_dir <- "input_cel_files_dir" # Read Affymetrix CEL files # This will automatically detect and read all .CEL files in the specified directory # For specific files, you can use: ReadAffy(filenames=c("file1.CEL", "file2.CEL")) data <- ReadAffy(celfile.path=cel_files_dir) # Perform RMA normalization rma_data <- rma(data) # Extract normalized expression matrix expression_matrix <- exprs(rma_data) # Define output file path # Replace 'output_expression_file.tsv' with your desired output file name output_file <- "output_expression_file.tsv" # Write the normalized expression matrix to a tab-separated file write.table(expression_matrix, file=output_file, sep="\t", quote=FALSE, row.names=TRUE) cat(paste0("RMA normalization complete. Output written to: ", output_file, "\n")) EOF # Execute the R script Rscript process_rma.R # Deactivate conda environment if it was activated # conda deactivate -
2
To reduce batch effects, transcript levels were further corrected with the ComBat method (Johnson, 2007), which applies empirical Bayes framework for adjusting data for batch effects.
ComBat (sva R package) (Inferred with models/gemini-2.5-flash) vlatest stable (sva R package) GitHub$ Bash example
# Install R and Bioconductor if not already installed # sudo apt-get update # sudo apt-get install r-base # R -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager"); BiocManager::install("sva")' # Create dummy input files for demonstration purposes. # In a real scenario, 'expression_matrix.tsv' would be your transcript level data # and 'sample_info.tsv' would contain batch information for each sample. # Example: expression_matrix.tsv (genes as rows, samples as columns) echo -e "gene\tsample1\tsample2\tsample3\tsample4" > expression_matrix.tsv echo -e "geneA\t100\t120\t150\t180" >> expression_matrix.tsv echo -e "geneB\t50\t60\t70\t80" >> expression_matrix.tsv echo -e "geneC\t200\t210\t220\t230" >> expression_matrix.tsv # Example: sample_info.tsv (sample IDs and corresponding batch information) echo -e "sample\tbatch\tcondition" > sample_info.tsv echo -e "sample1\tbatch1\tcontrol" >> sample_info.tsv echo -e "sample2\tbatch1\ttreated" >> sample_info.tsv echo -e "sample3\tbatch2\tcontrol" >> sample_info.tsv echo -e "sample4\tbatch2\ttreated" >> sample_info.tsv # R script to perform ComBat correction using the sva package cat << 'EOF' > run_combat.R library(sva) library(data.table) # For fread/fwrite # Load expression data # Assuming genes in rows, samples in columns, first column is gene ID expr_data <- fread("expression_matrix.tsv", data.table = FALSE) gene_ids <- expr_data[, 1] expr_matrix <- as.matrix(expr_data[, -1]) rownames(expr_matrix) <- gene_ids # Load sample information sample_info <- fread("sample_info.tsv", data.table = FALSE) # Ensure sample order in sample_info matches expression matrix columns sample_info <- sample_info[match(colnames(expr_matrix), sample_info$sample), ] # Extract batch variable batch <- sample_info$batch # Perform ComBat correction # If there are other covariates to preserve (e.g., 'condition'), # they can be added to the 'mod' parameter: # mod <- model.matrix(~condition, data=sample_info) # combat_corrected_data <- ComBat(dat=expr_matrix, batch=batch, mod=mod) combat_corrected_data <- ComBat(dat=expr_matrix, batch=batch) # Prepare output dataframe corrected_df <- as.data.frame(combat_corrected_data) corrected_df <- cbind(gene = rownames(corrected_df), corrected_df) # Save batch-corrected data to a new TSV file fwrite(corrected_df, "corrected_expression_matrix.tsv", sep = "\t", row.names = FALSE) EOF # Execute the R script to perform ComBat correction Rscript run_combat.R -
3
Final dataset which was used in the paper and contains 8968 genes is included in the supplementary file GSE24759_data.sort.txt.
(Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# This step describes the final dataset file GSE24759_data.sort.txt # which contains 8968 genes and was used in the paper. # Assuming the file is available in the current directory, you can inspect its content: head GSE24759_data.sort.txt
Tools Used
Raw Source Text
Transcript levels were processed from data image files using RMA method, implemented by Bioconductor R package. To reduce batch effects, transcript levels were further corrected with the ComBat method (Johnson, 2007), which applies empirical Bayes framework for adjusting data for batch effects. Final dataset which was used in the paper and contains 8968 genes is included in the supplementary file GSE24759_data.sort.txt.