GSE70685 Processing Pipeline
RNA-Seq
code_examples
3 steps
Publication
Musashi-2 attenuates AHR signalling to expand human haematopoietic stem cells.Nature (2016) — PMID 27121842
Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
Processing Steps
Generate Jupyter Notebook-
1
Reads for all samples were mapped to the human genome using Casava (Ver 1.8.2) with default paramters
Casava v1.8.2$ Bash example
# Casava is an Illumina software suite, typically installed with sequencing instruments. # It orchestrates multiple steps including base calling, demultiplexing, and alignment (often using ELAND). # A direct single command for 'mapping' is not typically exposed, as it's part of a larger pipeline execution. # The following is a conceptual representation of initiating a Casava run for a human genome, assuming default parameters. # Replace '/path/to/run_folder' with your actual Illumina run folder containing BCL files. # Replace '/path/to/output_dir' with your desired output directory for FASTQ and alignment files. # Replace '/path/to/human_genome_index' with the path to your human genome (e.g., hg38) index prepared for ELAND or Casava's internal aligner. # Example of setting up and running a Casava 1.8.2 pipeline (conceptual): # This typically involves a configuration script and then a 'make' command. # The alignment to the human genome would be an internal step within this pipeline. # Navigate to the run folder (or a directory where you want to configure the run) # cd /path/to/run_folder # Configure the Casava pipeline. This step generates a Makefile. # The --genome-folder parameter would point to the pre-indexed human genome for alignment. # The --output-dir specifies where the processed data (including aligned reads) will be placed. # configureBclToFastq.pl \ # --input-dir /path/to/run_folder/Data/Intensities/BaseCalls \ # --output-dir /path/to/output_dir \ # --sample-sheet /path/to/run_folder/SampleSheet.csv \ # --genome-folder /path/to/human_genome_index \ # --default-params # This is a placeholder for 'default parameters' as specified in the description # After configuration, execute the pipeline using make. This will perform base calling, demultiplexing, and alignment. # make -j <number_of_cores> # Since a direct, single command for 'mapping' with Casava 1.8.2 is not standard, and it's a pipeline, # a more generic representation of the *outcome* of mapping might be considered if a specific command is strictly required. # However, adhering to the description, the mapping is *done by* Casava. # For modern alignment, a tool like BWA or STAR would have a clear command, but Casava is an older, integrated system. # Placeholder for reference genome (hg38/GRCh38) if not explicitly provided in the description's context: # Reference genome: GRCh38/hg38 # Source: NCBI/UCSC/GENCODE # Example index path (conceptual, depends on Casava's internal aligner requirements): # /path/to/human_genome_index/GRCh38_Casava_index
-
2
RPKM values and reads counts were calculated using Casava
$ Bash example
# Install RSEM (if not already installed) # conda install -c bioconda rsem # Define reference genome and annotation # Replace with actual paths to your reference files GENOME_FASTA="GRCh38.p14.genome.fa" # Placeholder for latest human assembly GTF_FILE="gencode.v44.annotation.gtf" # Placeholder for latest GENCODE annotation for hg38 RSEM_INDEX_BASE="rsem_ref" # Define input and output files # Assuming 'sample.bam' is the aligned BAM file produced by Casava BAM_FILE="sample.bam" SAMPLE_NAME="sample_id" OUTPUT_DIR="rsem_output" # Create output directory mkdir -p "${OUTPUT_DIR}" # 1. Build RSEM reference index (run this once per reference genome) # This step prepares the reference for RSEM quantification. # rsem-prepare-reference --gtf "${GTF_FILE}" "${GENOME_FASTA}" "${RSEM_INDEX_BASE}" # 2. Quantify expression using RSEM from aligned BAM files # --bam: Specifies that the input is a BAM file. # --no-qualities: Use if the BAM file does not contain quality scores (common for older alignments). # --paired-end: Use if the input reads are paired-end (remove if single-end). # --output-genome-bam: Outputs a genome-aligned BAM file (optional, can be removed if not needed). # --num-threads: Number of threads to use for parallel processing. # --estimate-rspd: Estimate read start position distribution (recommended for better accuracy). # --seed: Random seed for reproducibility. rsem-calculate-expression \ --bam \ --no-qualities \ --paired-end \ --output-genome-bam \ --num-threads 8 \ --estimate-rspd \ --seed 12345 \ "${BAM_FILE}" \ "${RSEM_INDEX_BASE}" \ "${OUTPUT_DIR}/${SAMPLE_NAME}" echo "RPKM values are available in the '${OUTPUT_DIR}/${SAMPLE_NAME}.genes.results' file, typically in the 'RPKM' column." -
3
Analysis of differntial gene expression was performed thorugh ratio analysis and R (Deseq package)
$ Bash example
# Install R and Bioconductor if not already installed # sudo apt-get update # sudo apt-get install -y r-base # R -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")' # R -e 'BiocManager::install("DESeq2")' # Create placeholder input files for demonstration. # In a real scenario, 'counts_matrix.tsv' would be generated by upstream quantification tools # (e.g., featureCounts, HTSeq, Salmon, Kallisto) and 'sample_metadata.tsv' would be provided. # Placeholder for gene count matrix echo -e "gene_id\tsample1\tsample2\tsample3\tsample4" > counts_matrix.tsv echo -e "geneA\t100\t120\t50\t60" >> counts_matrix.tsv echo -e "geneB\t50\t60\t100\t110" >> counts_matrix.tsv echo -e "geneC\t200\t210\t220\t230" >> counts_matrix.tsv echo -e "geneD\t10\t12\t20\t25" >> counts_matrix.tsv # Placeholder for sample metadata echo -e "sample\tcondition" > sample_metadata.tsv echo -e "sample1\tcontrol" >> sample_metadata.tsv echo -e "sample2\tcontrol" >> sample_metadata.tsv echo -e "sample3\ttreated" >> sample_metadata.tsv echo -e "sample4\ttreated" >> sample_metadata.tsv # R script for DESeq2 analysis cat << 'EOF' > run_deseq2.R library(DESeq2) # Load count data # The first column is assumed to be gene IDs, and subsequent columns are sample counts. count_data <- read.table("counts_matrix.tsv", header = TRUE, row.names = 1, sep = "\t") # DESeq2 requires integer counts count_data <- round(count_data) # Load sample metadata # The first column is assumed to be sample IDs, and subsequent columns are experimental factors. sample_data <- read.table("sample_metadata.tsv", header = TRUE, row.names = 1, sep = "\t") # Ensure sample names in count data and metadata match and are in the same order sample_data <- sample_data[colnames(count_data), , drop = FALSE] # Create DESeqDataSet object # 'design' specifies the experimental design, here comparing 'condition' groups. dds <- DESeqDataSetFromMatrix(countData = count_data, colData = sample_data, design = ~ condition) # Run DESeq2 analysis dds <- DESeq(dds) # Get results for 'treated' vs 'control' # The contrast argument specifies the comparison: c("factor", "level_numerator", "level_denominator") res <- results(dds, contrast = c("condition", "treated", "control")) # Order results by adjusted p-value res_ordered <- res[order(res$padj),] # Save differential expression results write.csv(as.data.frame(res_ordered), file = "deseq2_results.csv") # Optional: Save normalized counts normalized_counts <- counts(dds, normalized=TRUE) write.csv(as.data.frame(normalized_counts), file = "deseq2_normalized_counts.csv") message("DESeq2 analysis complete. Results saved to deseq2_results.csv") EOF # Execute the R script Rscript run_deseq2.R
Tools Used
Raw Source Text
Reads for all samples were mapped to the human genome using Casava (Ver 1.8.2) with default paramters RPKM values and reads counts were calculated using Casava Analysis of differntial gene expression was performed thorugh ratio analysis and R (Deseq package) Genome_build: hg19 Supplementary_files_format_and_content: KH_ratios_results.xlsx: This file contains the results of a ratio analysis of gene expression Supplementary_files_format_and_content: KH_DESeq_results.xlsx: This file contains the results of a gene expression analysi performed using the DEseq R package