GSE112479 Processing Pipeline
GSE
code_examples
3 steps
Publication
The RNA Helicase DDX6 Controls Cellular Plasticity by Modulating P-Body Homeostasis.Cell stem cell (2019) — PMID 31588046
Dataset
GSE112479The RNA helicase DDX6 regulates self-renewal and differentiation of human and mouse stem cells
Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
Processing Steps
Generate Jupyter Notebook-
1
STAR was used to map sequencing reads to human reference genome (Ensembl annotation of hg19 assembly).
$ Bash example
# Install STAR # conda install -c bioconda star # Define variables GENOME_DIR="path/to/hg19_star_index" # Path to the pre-built STAR genome index FASTA_FILE="path/to/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa" # Placeholder for hg19 FASTA GTF_FILE="path/to/Homo_sapiens.GRCh37.75.gtf" # Placeholder for Ensembl hg19 GTF READS_R1="input_R1.fastq.gz" # Input FASTQ file for Read 1 (or single-end reads) READS_R2="input_R2.fastq.gz" # Input FASTQ file for Read 2 (omit for single-end reads) OUTPUT_DIR="star_output" # Directory for alignment output OUTPUT_PREFIX="aligned_reads" # Prefix for output files THREADS=8 # Number of CPU threads to use, adjust based on available resources RAM_LIMIT_BAM_SORT=30000000000 # RAM limit for BAM sorting in bytes (e.g., 30GB), adjust based on available RAM # --- Reference Data Preparation (Run once if index not available) --- # Download reference files (example for Ensembl hg19/GRCh37 release 75) # mkdir -p references/hg19_ensembl_75 # cd references/hg19_ensembl_75 # wget ftp://ftp.ensembl.org/pub/release-75/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz # wget ftp://ftp.ensembl.org/pub/release-75/gtf/homo_sapiens/Homo_sapiens.GRCh37.75.gtf.gz # gunzip *.gz # cd - # Create STAR genome index (run once with appropriate FASTA and GTF files) # mkdir -p "${GENOME_DIR}" # STAR --runMode genomeGenerate \ # --genomeDir "${GENOME_DIR}" \ # --genomeFastaFiles "${FASTA_FILE}" \ # --sjdbGTFfile "${GTF_FILE}" \ # --sjdbOverhang 100 \ # --runThreadN "${THREADS}" # --- Alignment Step --- # Create output directory if it doesn't exist mkdir -p "${OUTPUT_DIR}" # Run STAR alignment STAR --runThreadN "${THREADS}" \ --genomeDir "${GENOME_DIR}" \ --readFilesIn "${READS_R1}" "${READS_R2}" \ --readFilesCommand zcat \ --outFileNamePrefix "${OUTPUT_DIR}/${OUTPUT_PREFIX}_" \ --outSAMtype BAM SortedByCoordinate \ --outSAMattributes All \ --outSAMunmapped Within KeepPairs \ --outFilterType BySJout \ --outFilterMultimapNmax 20 \ --outFilterMismatchNmax 999 \ --outFilterMismatchNoverLmax 0.04 \ --alignIntronMin 20 \ --alignIntronMax 1000000 \ --alignMatesGapMax 1000000 \ --alignSJFilterReads UniqueBothMultipleReadGroup \ --limitBAMsortRAM "${RAM_LIMIT_BAM_SORT}" -
2
Read counts over transcripts were calculated using HTSeq v.0.6.0 based on Ensembl annotation of hg19 genome.
HTSeq v0.6.0$ Bash example
# Install HTSeq (if not already installed) # conda install -c bioconda htseq # Define input and output files INPUT_BAM="aligned_reads.bam" # Replace with your actual alignment file OUTPUT_COUNTS="transcript_counts.txt" # Download Ensembl hg19 GTF annotation (example, replace with actual path if already downloaded) # For hg19 (GRCh37), Ensembl release 75 is a common choice. # wget -O Homo_sapiens.GRCh37.75.gtf.gz ftp://ftp.ensembl.org/pub/release-75/gtf/homo_sapiens/Homo_sapiens.GRCh37.75.gtf.gz # gunzip Homo_sapiens.GRCh37.75.gtf.gz # Placeholder for annotation GTF path ANNOTATION_GTF="/path/to/Homo_sapiens.GRCh37.75.gtf" # Replace with your actual GTF path # Run htseq-count # Parameters: # -f bam: Input file format is BAM # -r pos: Assume input BAM is sorted by position # -s no: Assume unstranded library (adjust to 'yes' or 'reverse' if known) # -a 10: Minimum alignment quality score (adjust as needed) # -t exon: Feature type to count (default for gene/transcript counting) # -i gene_id: Attribute in GTF to use as feature ID (default) htseq-count -f bam -r pos -s no -a 10 "${INPUT_BAM}" "${ANNOTATION_GTF}" > "${OUTPUT_COUNTS}" -
3
Differential expression analysis was performed using EdgeR.
edgeR v3.42.0 (Inferred with models/gemini-2.5-flash)$ Bash example
# Install R and Bioconductor if not already present # sudo apt-get update && sudo apt-get install -y r-base # R -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")' # R -e 'BiocManager::install("edgeR")' # Create dummy input files for demonstration if they don't exist # In a real scenario, counts.tsv would be generated by a quantification tool (e.g., featureCounts, htseq-count) # and design.tsv would be a metadata file describing your samples. if [ ! -f "counts.tsv" ]; then echo -e "gene\tsample1_control\tsample2_control\tsample3_treatment\tsample4_treatment" echo -e "geneA\t100\t120\t500\t550" echo -e "geneB\t50\t60\t30\t35" echo -e "geneC\t200\t210\t220\t230" echo -e "geneD\t10\t12\t100\t110" echo -e "geneE\t5\t7\t8\t9" echo -e "geneF\t1000\t1100\t1050\t1150" echo -e "geneG\t20\t25\t10\t12" echo -e "geneH\t150\t160\t70\t80" echo -e "geneI\t30\t35\t150\t160" echo -e "geneJ\t80\t90\t40\t45" echo -e "geneK\t120\t130\t60\t65" echo -e "geneL\t25\t28\t120\t130" echo -e "geneM\t5\t6\t20\t22" echo -e "geneN\t10\t11\t5\t6" echo -e "geneO\t15\t17\t30\t32" echo -e "geneP\t20\t22\t10\t11" echo -e "geneQ\t25\t27\t50\t55" echo -e "geneR\t30\t33\t15\t17" echo -e "geneS\t35\t38\t70\t75" echo -e "geneT\t40\t42\t20\t22" echo -e "geneU\t45\t48\t90\t95" echo -e "geneV\t50\t52\t25\t27" echo -e "geneW\t55\t58\t110\t115" echo -e "geneX\t60\t62\t30\t32" echo -e "geneY\t65\t68\t130\t135" echo -e "geneZ\t70\t72\t35\t37" echo "Created dummy counts.tsv" fi if [ ! -f "design.tsv" ]; then echo -e "sample\tcondition" echo -e "sample1_control\tcontrol" echo -e "sample2_control\tcontrol" echo -e "sample3_treatment\ttreatment" echo -e "sample4_treatment\ttreatment" echo "Created dummy design.tsv" fi # Create an R script for edgeR analysis cat << 'EOF' > run_edger.R library(edgeR) # --- Configuration --- counts_file <- "counts.tsv" design_file <- "design.tsv" output_file <- "de_results.tsv" # --- Load Data --- counts_data <- read.delim(counts_file, row.names = 1, stringsAsFactors = FALSE) sample_info <- read.delim(design_file, row.names = 1, stringsAsFactors = TRUE) # Ensure sample order matches between counts and design sample_info <- sample_info[colnames(counts_data), , drop = FALSE] # --- Create DGEList object --- dge <- DGEList(counts = counts_data, group = sample_info$condition) # --- Filtering low-expressed genes --- # Keep genes with at least 1 CPM in at least 2 samples (minimum group size) keep <- filterByExpr(dge, group = sample_info$condition) dge <- dge[keep, , keep.lib.sizes = FALSE] # --- Normalization --- dge <- calcNormFactors(dge) # --- Design Matrix --- design <- model.matrix(~0 + group, data = dge$samples) colnames(design) <- levels(dge$samples$group) # --- Estimate Dispersion --- dge <- estimateDisp(dge, design) # --- Fit GLM and Test for Differential Expression --- fit <- glmFit(dge, design) # Define contrasts for comparison (e.g., Treatment vs Control) contrast_matrix <- makeContrasts(Treatment_vs_Control = treatment - control, levels = design) lrt <- glmLRT(fit, contrast = contrast_matrix[, "Treatment_vs_Control"]) # --- Extract Results --- results <- topTags(lrt, n = Inf, adjust.method = "BH", sort.by = "PValue") # --- Write Results --- write.table(results$table, file = output_file, sep = "\t", quote = FALSE, row.names = TRUE) message(paste("Differential expression results saved to:", output_file)) EOF # Execute the R script Rscript run_edger.R
Tools Used
Raw Source Text
STAR was used to map sequencing reads to human reference genome (Ensembl annotation of hg19 assembly). Read counts over transcripts were calculated using HTSeq v.0.6.0 based on Ensembl annotation of hg19 genome. Differential expression analysis was performed using EdgeR. Genome_build: Homo sapiens UCSC hg19 Supplementary_files_format_and_content: Tab-delimited tables of RPKM values