GSE70685 Processing Pipeline

RNA-Seq code_examples 3 steps

Publication

Musashi-2 attenuates AHR signalling to expand human haematopoietic stem cells.

Nature (2016) — PMID 27121842

Dataset

Musashi-2 attenuates AHR signalling to expand human haematopoietic stem cells

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Reads for all samples were mapped to the human genome using Casava (Ver 1.8.2) with default paramters

Casava v1.8.2

$ Bash example

# Casava is an Illumina software suite, typically installed with sequencing instruments.
# It orchestrates multiple steps including base calling, demultiplexing, and alignment (often using ELAND).
# A direct single command for 'mapping' is not typically exposed, as it's part of a larger pipeline execution.
# The following is a conceptual representation of initiating a Casava run for a human genome, assuming default parameters.
# Replace '/path/to/run_folder' with your actual Illumina run folder containing BCL files.
# Replace '/path/to/output_dir' with your desired output directory for FASTQ and alignment files.
# Replace '/path/to/human_genome_index' with the path to your human genome (e.g., hg38) index prepared for ELAND or Casava's internal aligner.

# Example of setting up and running a Casava 1.8.2 pipeline (conceptual):
# This typically involves a configuration script and then a 'make' command.
# The alignment to the human genome would be an internal step within this pipeline.

# Navigate to the run folder (or a directory where you want to configure the run)
# cd /path/to/run_folder

# Configure the Casava pipeline. This step generates a Makefile.
# The --genome-folder parameter would point to the pre-indexed human genome for alignment.
# The --output-dir specifies where the processed data (including aligned reads) will be placed.
# configureBclToFastq.pl \
# --input-dir /path/to/run_folder/Data/Intensities/BaseCalls \
# --output-dir /path/to/output_dir \
# --sample-sheet /path/to/run_folder/SampleSheet.csv \
# --genome-folder /path/to/human_genome_index \
# --default-params # This is a placeholder for 'default parameters' as specified in the description

# After configuration, execute the pipeline using make. This will perform base calling, demultiplexing, and alignment.
# make -j <number_of_cores>

# Since a direct, single command for 'mapping' with Casava 1.8.2 is not standard, and it's a pipeline,
# a more generic representation of the *outcome* of mapping might be considered if a specific command is strictly required.
# However, adhering to the description, the mapping is *done by* Casava.
# For modern alignment, a tool like BWA or STAR would have a clear command, but Casava is an older, integrated system.

# Placeholder for reference genome (hg38/GRCh38) if not explicitly provided in the description's context:
# Reference genome: GRCh38/hg38
# Source: NCBI/UCSC/GENCODE
# Example index path (conceptual, depends on Casava's internal aligner requirements):
# /path/to/human_genome_index/GRCh38_Casava_index

RPKM values and reads counts were calculated using Casava

RSEM (Inferred with models/gemini-2.5-flash) v1.3.3 GitHub

$ Bash example

# Install RSEM (if not already installed)
# conda install -c bioconda rsem

# Define reference genome and annotation
# Replace with actual paths to your reference files
GENOME_FASTA="GRCh38.p14.genome.fa" # Placeholder for latest human assembly
GTF_FILE="gencode.v44.annotation.gtf" # Placeholder for latest GENCODE annotation for hg38
RSEM_INDEX_BASE="rsem_ref"

# Define input and output files
# Assuming 'sample.bam' is the aligned BAM file produced by Casava
BAM_FILE="sample.bam"
SAMPLE_NAME="sample_id"
OUTPUT_DIR="rsem_output"

# Create output directory
mkdir -p "${OUTPUT_DIR}"

# 1. Build RSEM reference index (run this once per reference genome)
# This step prepares the reference for RSEM quantification.
# rsem-prepare-reference --gtf "${GTF_FILE}" "${GENOME_FASTA}" "${RSEM_INDEX_BASE}"

# 2. Quantify expression using RSEM from aligned BAM files
# --bam: Specifies that the input is a BAM file.
# --no-qualities: Use if the BAM file does not contain quality scores (common for older alignments).
# --paired-end: Use if the input reads are paired-end (remove if single-end).
# --output-genome-bam: Outputs a genome-aligned BAM file (optional, can be removed if not needed).
# --num-threads: Number of threads to use for parallel processing.
# --estimate-rspd: Estimate read start position distribution (recommended for better accuracy).
# --seed: Random seed for reproducibility.
rsem-calculate-expression \
    --bam \
    --no-qualities \
    --paired-end \
    --output-genome-bam \
    --num-threads 8 \
    --estimate-rspd \
    --seed 12345 \
    "${BAM_FILE}" \
    "${RSEM_INDEX_BASE}" \
    "${OUTPUT_DIR}/${SAMPLE_NAME}"

echo "RPKM values are available in the '${OUTPUT_DIR}/${SAMPLE_NAME}.genes.results' file, typically in the 'RPKM' column."

View on GitHub

Analysis of differntial gene expression was performed thorugh ratio analysis and R (Deseq package)

DESeq2 vBioconductor (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install R and Bioconductor if not already installed
# sudo apt-get update
# sudo apt-get install -y r-base
# R -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")'
# R -e 'BiocManager::install("DESeq2")'

# Create placeholder input files for demonstration.
# In a real scenario, 'counts_matrix.tsv' would be generated by upstream quantification tools
# (e.g., featureCounts, HTSeq, Salmon, Kallisto) and 'sample_metadata.tsv' would be provided.

# Placeholder for gene count matrix
echo -e "gene_id\tsample1\tsample2\tsample3\tsample4" > counts_matrix.tsv
echo -e "geneA\t100\t120\t50\t60" >> counts_matrix.tsv
echo -e "geneB\t50\t60\t100\t110" >> counts_matrix.tsv
echo -e "geneC\t200\t210\t220\t230" >> counts_matrix.tsv
echo -e "geneD\t10\t12\t20\t25" >> counts_matrix.tsv

# Placeholder for sample metadata
echo -e "sample\tcondition" > sample_metadata.tsv
echo -e "sample1\tcontrol" >> sample_metadata.tsv
echo -e "sample2\tcontrol" >> sample_metadata.tsv
echo -e "sample3\ttreated" >> sample_metadata.tsv
echo -e "sample4\ttreated" >> sample_metadata.tsv

# R script for DESeq2 analysis
cat << 'EOF' > run_deseq2.R
library(DESeq2)

# Load count data
# The first column is assumed to be gene IDs, and subsequent columns are sample counts.
count_data <- read.table("counts_matrix.tsv", header = TRUE, row.names = 1, sep = "\t")
# DESeq2 requires integer counts
count_data <- round(count_data)

# Load sample metadata
# The first column is assumed to be sample IDs, and subsequent columns are experimental factors.
sample_data <- read.table("sample_metadata.tsv", header = TRUE, row.names = 1, sep = "\t")

# Ensure sample names in count data and metadata match and are in the same order
sample_data <- sample_data[colnames(count_data), , drop = FALSE]

# Create DESeqDataSet object
# 'design' specifies the experimental design, here comparing 'condition' groups.
dds <- DESeqDataSetFromMatrix(countData = count_data,
                              colData = sample_data,
                              design = ~ condition)

# Run DESeq2 analysis
dds <- DESeq(dds)

# Get results for 'treated' vs 'control'
# The contrast argument specifies the comparison: c("factor", "level_numerator", "level_denominator")
res <- results(dds, contrast = c("condition", "treated", "control"))

# Order results by adjusted p-value
res_ordered <- res[order(res$padj),]

# Save differential expression results
write.csv(as.data.frame(res_ordered), file = "deseq2_results.csv")

# Optional: Save normalized counts
normalized_counts <- counts(dds, normalized=TRUE)
write.csv(as.data.frame(normalized_counts), file = "deseq2_normalized_counts.csv")

message("DESeq2 analysis complete. Results saved to deseq2_results.csv")
EOF

# Execute the R script
Rscript run_deseq2.R

View on GitHub

Tools Used

DESeq2

Raw Source Text

Reads for all samples were mapped to the human genome using Casava (Ver 1.8.2) with default paramters
RPKM values and reads counts were calculated using Casava
Analysis of differntial gene expression was performed thorugh ratio analysis and R (Deseq package)
Genome_build: hg19
Supplementary_files_format_and_content: KH_ratios_results.xlsx: This file contains the results of a ratio analysis of gene expression
Supplementary_files_format_and_content: KH_DESeq_results.xlsx: This file contains the results of a gene expression analysi performed using the DEseq R package

← Back to Analysis