GSE70685 Processing Pipeline

RNA-Seq code_examples 3 steps

Publication

Musashi-2 attenuates AHR signalling to expand human haematopoietic stem cells.

Nature (2016) — PMID 27121842

Dataset

GSE70685

Musashi-2 attenuates AHR signalling to expand human haematopoietic stem cells

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    Reads for all samples were mapped to the human genome using Casava (Ver 1.8.2) with default paramters

    Casava v1.8.2
    $ Bash example
    # Casava is an Illumina software suite, typically installed with sequencing instruments.
    # It orchestrates multiple steps including base calling, demultiplexing, and alignment (often using ELAND).
    # A direct single command for 'mapping' is not typically exposed, as it's part of a larger pipeline execution.
    # The following is a conceptual representation of initiating a Casava run for a human genome, assuming default parameters.
    # Replace '/path/to/run_folder' with your actual Illumina run folder containing BCL files.
    # Replace '/path/to/output_dir' with your desired output directory for FASTQ and alignment files.
    # Replace '/path/to/human_genome_index' with the path to your human genome (e.g., hg38) index prepared for ELAND or Casava's internal aligner.
    
    # Example of setting up and running a Casava 1.8.2 pipeline (conceptual):
    # This typically involves a configuration script and then a 'make' command.
    # The alignment to the human genome would be an internal step within this pipeline.
    
    # Navigate to the run folder (or a directory where you want to configure the run)
    # cd /path/to/run_folder
    
    # Configure the Casava pipeline. This step generates a Makefile.
    # The --genome-folder parameter would point to the pre-indexed human genome for alignment.
    # The --output-dir specifies where the processed data (including aligned reads) will be placed.
    # configureBclToFastq.pl \
    #   --input-dir /path/to/run_folder/Data/Intensities/BaseCalls \
    #   --output-dir /path/to/output_dir \
    #   --sample-sheet /path/to/run_folder/SampleSheet.csv \
    #   --genome-folder /path/to/human_genome_index \
    #   --default-params # This is a placeholder for 'default parameters' as specified in the description
    
    # After configuration, execute the pipeline using make. This will perform base calling, demultiplexing, and alignment.
    # make -j <number_of_cores>
    
    # Since a direct, single command for 'mapping' with Casava 1.8.2 is not standard, and it's a pipeline, 
    # a more generic representation of the *outcome* of mapping might be considered if a specific command is strictly required.
    # However, adhering to the description, the mapping is *done by* Casava.
    # For modern alignment, a tool like BWA or STAR would have a clear command, but Casava is an older, integrated system.
    
    # Placeholder for reference genome (hg38/GRCh38) if not explicitly provided in the description's context:
    # Reference genome: GRCh38/hg38
    # Source: NCBI/UCSC/GENCODE
    # Example index path (conceptual, depends on Casava's internal aligner requirements):
    # /path/to/human_genome_index/GRCh38_Casava_index
  2. 2

    RPKM values and reads counts were calculated using Casava

    RSEM (Inferred with models/gemini-2.5-flash) v1.3.3 GitHub
    $ Bash example
    # Install RSEM (if not already installed)
    # conda install -c bioconda rsem
    
    # Define reference genome and annotation
    # Replace with actual paths to your reference files
    GENOME_FASTA="GRCh38.p14.genome.fa" # Placeholder for latest human assembly
    GTF_FILE="gencode.v44.annotation.gtf" # Placeholder for latest GENCODE annotation for hg38
    RSEM_INDEX_BASE="rsem_ref"
    
    # Define input and output files
    # Assuming 'sample.bam' is the aligned BAM file produced by Casava
    BAM_FILE="sample.bam"
    SAMPLE_NAME="sample_id"
    OUTPUT_DIR="rsem_output"
    
    # Create output directory
    mkdir -p "${OUTPUT_DIR}"
    
    # 1. Build RSEM reference index (run this once per reference genome)
    # This step prepares the reference for RSEM quantification.
    # rsem-prepare-reference --gtf "${GTF_FILE}" "${GENOME_FASTA}" "${RSEM_INDEX_BASE}"
    
    # 2. Quantify expression using RSEM from aligned BAM files
    # --bam: Specifies that the input is a BAM file.
    # --no-qualities: Use if the BAM file does not contain quality scores (common for older alignments).
    # --paired-end: Use if the input reads are paired-end (remove if single-end).
    # --output-genome-bam: Outputs a genome-aligned BAM file (optional, can be removed if not needed).
    # --num-threads: Number of threads to use for parallel processing.
    # --estimate-rspd: Estimate read start position distribution (recommended for better accuracy).
    # --seed: Random seed for reproducibility.
    rsem-calculate-expression \
        --bam \
        --no-qualities \
        --paired-end \
        --output-genome-bam \
        --num-threads 8 \
        --estimate-rspd \
        --seed 12345 \
        "${BAM_FILE}" \
        "${RSEM_INDEX_BASE}" \
        "${OUTPUT_DIR}/${SAMPLE_NAME}"
    
    echo "RPKM values are available in the '${OUTPUT_DIR}/${SAMPLE_NAME}.genes.results' file, typically in the 'RPKM' column."
  3. 3

    Analysis of differntial gene expression was performed thorugh ratio analysis and R (Deseq package)

    DESeq2 vBioconductor (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install R and Bioconductor if not already installed
    # sudo apt-get update
    # sudo apt-get install -y r-base
    # R -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")'
    # R -e 'BiocManager::install("DESeq2")'
    
    # Create placeholder input files for demonstration.
    # In a real scenario, 'counts_matrix.tsv' would be generated by upstream quantification tools
    # (e.g., featureCounts, HTSeq, Salmon, Kallisto) and 'sample_metadata.tsv' would be provided.
    
    # Placeholder for gene count matrix
    echo -e "gene_id\tsample1\tsample2\tsample3\tsample4" > counts_matrix.tsv
    echo -e "geneA\t100\t120\t50\t60" >> counts_matrix.tsv
    echo -e "geneB\t50\t60\t100\t110" >> counts_matrix.tsv
    echo -e "geneC\t200\t210\t220\t230" >> counts_matrix.tsv
    echo -e "geneD\t10\t12\t20\t25" >> counts_matrix.tsv
    
    # Placeholder for sample metadata
    echo -e "sample\tcondition" > sample_metadata.tsv
    echo -e "sample1\tcontrol" >> sample_metadata.tsv
    echo -e "sample2\tcontrol" >> sample_metadata.tsv
    echo -e "sample3\ttreated" >> sample_metadata.tsv
    echo -e "sample4\ttreated" >> sample_metadata.tsv
    
    # R script for DESeq2 analysis
    cat << 'EOF' > run_deseq2.R
    library(DESeq2)
    
    # Load count data
    # The first column is assumed to be gene IDs, and subsequent columns are sample counts.
    count_data <- read.table("counts_matrix.tsv", header = TRUE, row.names = 1, sep = "\t")
    # DESeq2 requires integer counts
    count_data <- round(count_data)
    
    # Load sample metadata
    # The first column is assumed to be sample IDs, and subsequent columns are experimental factors.
    sample_data <- read.table("sample_metadata.tsv", header = TRUE, row.names = 1, sep = "\t")
    
    # Ensure sample names in count data and metadata match and are in the same order
    sample_data <- sample_data[colnames(count_data), , drop = FALSE]
    
    # Create DESeqDataSet object
    # 'design' specifies the experimental design, here comparing 'condition' groups.
    dds <- DESeqDataSetFromMatrix(countData = count_data,
                                  colData = sample_data,
                                  design = ~ condition)
    
    # Run DESeq2 analysis
    dds <- DESeq(dds)
    
    # Get results for 'treated' vs 'control'
    # The contrast argument specifies the comparison: c("factor", "level_numerator", "level_denominator")
    res <- results(dds, contrast = c("condition", "treated", "control"))
    
    # Order results by adjusted p-value
    res_ordered <- res[order(res$padj),]
    
    # Save differential expression results
    write.csv(as.data.frame(res_ordered), file = "deseq2_results.csv")
    
    # Optional: Save normalized counts
    normalized_counts <- counts(dds, normalized=TRUE)
    write.csv(as.data.frame(normalized_counts), file = "deseq2_normalized_counts.csv")
    
    message("DESeq2 analysis complete. Results saved to deseq2_results.csv")
    EOF
    
    # Execute the R script
    Rscript run_deseq2.R

Tools Used

Raw Source Text
Reads for all samples were mapped to the human genome using Casava (Ver 1.8.2) with default paramters
RPKM values and reads counts were calculated using Casava
Analysis of differntial gene expression was performed thorugh ratio analysis and R (Deseq package)
Genome_build: hg19
Supplementary_files_format_and_content: KH_ratios_results.xlsx: This file contains the results of a ratio analysis of gene expression
Supplementary_files_format_and_content: KH_DESeq_results.xlsx: This file contains the results of a gene expression analysi performed using the DEseq R package
← Back to Analysis