GSE24759 Processing Pipeline

GSE code_examples 3 steps

Publication

Musashi-2 attenuates AHR signalling to expand human haematopoietic stem cells.

Nature (2016) — PMID 27121842

Dataset

Densely interconnected transcriptional circuits control cell states in human hematopoiesis

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Transcript levels were processed from data image files using RMA method, implemented by Bioconductor R package.

R vLatest stable Bioconductor version compatible with R (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install R and Bioconductor affy package (example using conda)
# conda create -n rma_env r-base bioconductor-affy -c conda-forge -c bioconda -y
# conda activate rma_env

# Create an R script for RMA processing
cat << 'EOF' > process_rma.R
library(affy)

# Define input directory containing .CEL files
# Replace 'input_cel_files_dir' with the actual path to your CEL files
cel_files_dir <- "input_cel_files_dir" 

# Read Affymetrix CEL files
# This will automatically detect and read all .CEL files in the specified directory
# For specific files, you can use: ReadAffy(filenames=c("file1.CEL", "file2.CEL"))
data <- ReadAffy(celfile.path=cel_files_dir)

# Perform RMA normalization
rma_data <- rma(data)

# Extract normalized expression matrix
expression_matrix <- exprs(rma_data)

# Define output file path
# Replace 'output_expression_file.tsv' with your desired output file name
output_file <- "output_expression_file.tsv"

# Write the normalized expression matrix to a tab-separated file
write.table(expression_matrix, file=output_file, sep="\t", quote=FALSE, row.names=TRUE)

cat(paste0("RMA normalization complete. Output written to: ", output_file, "\n"))
EOF

# Execute the R script
Rscript process_rma.R

# Deactivate conda environment if it was activated
# conda deactivate

View on GitHub

To reduce batch effects, transcript levels were further corrected with the ComBat method (Johnson, 2007), which applies empirical Bayes framework for adjusting data for batch effects.

ComBat (sva R package) (Inferred with models/gemini-2.5-flash) vlatest stable (sva R package) GitHub

$ Bash example

# Install R and Bioconductor if not already installed
# sudo apt-get update
# sudo apt-get install r-base
# R -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager"); BiocManager::install("sva")'

# Create dummy input files for demonstration purposes.
# In a real scenario, 'expression_matrix.tsv' would be your transcript level data
# and 'sample_info.tsv' would contain batch information for each sample.
# Example: expression_matrix.tsv (genes as rows, samples as columns)
echo -e "gene\tsample1\tsample2\tsample3\tsample4" > expression_matrix.tsv
echo -e "geneA\t100\t120\t150\t180" >> expression_matrix.tsv
echo -e "geneB\t50\t60\t70\t80" >> expression_matrix.tsv
echo -e "geneC\t200\t210\t220\t230" >> expression_matrix.tsv

# Example: sample_info.tsv (sample IDs and corresponding batch information)
echo -e "sample\tbatch\tcondition" > sample_info.tsv
echo -e "sample1\tbatch1\tcontrol" >> sample_info.tsv
echo -e "sample2\tbatch1\ttreated" >> sample_info.tsv
echo -e "sample3\tbatch2\tcontrol" >> sample_info.tsv
echo -e "sample4\tbatch2\ttreated" >> sample_info.tsv

# R script to perform ComBat correction using the sva package
cat << 'EOF' > run_combat.R
library(sva)
library(data.table) # For fread/fwrite

# Load expression data
# Assuming genes in rows, samples in columns, first column is gene ID
expr_data <- fread("expression_matrix.tsv", data.table = FALSE)
gene_ids <- expr_data[, 1]
expr_matrix <- as.matrix(expr_data[, -1])
rownames(expr_matrix) <- gene_ids

# Load sample information
sample_info <- fread("sample_info.tsv", data.table = FALSE)
# Ensure sample order in sample_info matches expression matrix columns
sample_info <- sample_info[match(colnames(expr_matrix), sample_info$sample), ]

# Extract batch variable
batch <- sample_info$batch

# Perform ComBat correction
# If there are other covariates to preserve (e.g., 'condition'),
# they can be added to the 'mod' parameter:
# mod <- model.matrix(~condition, data=sample_info)
# combat_corrected_data <- ComBat(dat=expr_matrix, batch=batch, mod=mod)
combat_corrected_data <- ComBat(dat=expr_matrix, batch=batch)

# Prepare output dataframe
corrected_df <- as.data.frame(combat_corrected_data)
corrected_df <- cbind(gene = rownames(corrected_df), corrected_df)

# Save batch-corrected data to a new TSV file
fwrite(corrected_df, "corrected_expression_matrix.tsv", sep = "\t", row.names = FALSE)
EOF

# Execute the R script to perform ComBat correction
Rscript run_combat.R

View on GitHub

Final dataset which was used in the paper and contains 8968 genes is included in the supplementary file GSE24759_data.sort.txt.

(Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# This step describes the final dataset file GSE24759_data.sort.txt
# which contains 8968 genes and was used in the paper.
# Assuming the file is available in the current directory, you can inspect its content:
head GSE24759_data.sort.txt

View on GitHub

Tools Used

Raw Source Text

Transcript levels were processed from data image files using RMA method, implemented by Bioconductor R package. To reduce batch effects, transcript levels were further corrected with the ComBat method (Johnson, 2007), which applies empirical Bayes framework for adjusting data for batch effects. Final dataset which was used in the paper and contains 8968 genes is included in the supplementary file GSE24759_data.sort.txt.

← Back to Analysis