GSE112479 Processing Pipeline — Yeo Lab Publications

Publication

The RNA Helicase DDX6 Controls Cellular Plasticity by Modulating P-Body Homeostasis.

Cell stem cell (2019) — PMID 31588046

Dataset

The RNA helicase DDX6 regulates self-renewal and differentiation of human and mouse stem cells

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

1

STAR was used to map sequencing reads to human reference genome (Ensembl annotation of hg19 assembly).

STAR v2.7.11a GitHub

$ Bash example

# Install STAR
# conda install -c bioconda star

# Define variables
GENOME_DIR="path/to/hg19_star_index" # Path to the pre-built STAR genome index
FASTA_FILE="path/to/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa" # Placeholder for hg19 FASTA
GTF_FILE="path/to/Homo_sapiens.GRCh37.75.gtf" # Placeholder for Ensembl hg19 GTF
READS_R1="input_R1.fastq.gz" # Input FASTQ file for Read 1 (or single-end reads)
READS_R2="input_R2.fastq.gz" # Input FASTQ file for Read 2 (omit for single-end reads)
OUTPUT_DIR="star_output" # Directory for alignment output
OUTPUT_PREFIX="aligned_reads" # Prefix for output files
THREADS=8 # Number of CPU threads to use, adjust based on available resources
RAM_LIMIT_BAM_SORT=30000000000 # RAM limit for BAM sorting in bytes (e.g., 30GB), adjust based on available RAM

# --- Reference Data Preparation (Run once if index not available) ---
# Download reference files (example for Ensembl hg19/GRCh37 release 75)
# mkdir -p references/hg19_ensembl_75
# cd references/hg19_ensembl_75
# wget ftp://ftp.ensembl.org/pub/release-75/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz
# wget ftp://ftp.ensembl.org/pub/release-75/gtf/homo_sapiens/Homo_sapiens.GRCh37.75.gtf.gz
# gunzip *.gz
# cd -

# Create STAR genome index (run once with appropriate FASTA and GTF files)
# mkdir -p "${GENOME_DIR}"
# STAR --runMode genomeGenerate \
#      --genomeDir "${GENOME_DIR}" \
#      --genomeFastaFiles "${FASTA_FILE}" \
#      --sjdbGTFfile "${GTF_FILE}" \
#      --sjdbOverhang 100 \
#      --runThreadN "${THREADS}"

# --- Alignment Step ---
# Create output directory if it doesn't exist
mkdir -p "${OUTPUT_DIR}"

# Run STAR alignment
STAR --runThreadN "${THREADS}" \
     --genomeDir "${GENOME_DIR}" \
     --readFilesIn "${READS_R1}" "${READS_R2}" \
     --readFilesCommand zcat \
     --outFileNamePrefix "${OUTPUT_DIR}/${OUTPUT_PREFIX}_" \
     --outSAMtype BAM SortedByCoordinate \
     --outSAMattributes All \
     --outSAMunmapped Within KeepPairs \
     --outFilterType BySJout \
     --outFilterMultimapNmax 20 \
     --outFilterMismatchNmax 999 \
     --outFilterMismatchNoverLmax 0.04 \
     --alignIntronMin 20 \
     --alignIntronMax 1000000 \
     --alignMatesGapMax 1000000 \
     --alignSJFilterReads UniqueBothMultipleReadGroup \
     --limitBAMsortRAM "${RAM_LIMIT_BAM_SORT}"

View on GitHub

2

Read counts over transcripts were calculated using HTSeq v.0.6.0 based on Ensembl annotation of hg19 genome.

HTSeq v0.6.0

$ Bash example

# Install HTSeq (if not already installed)
# conda install -c bioconda htseq

# Define input and output files
INPUT_BAM="aligned_reads.bam" # Replace with your actual alignment file
OUTPUT_COUNTS="transcript_counts.txt"

# Download Ensembl hg19 GTF annotation (example, replace with actual path if already downloaded)
# For hg19 (GRCh37), Ensembl release 75 is a common choice.
# wget -O Homo_sapiens.GRCh37.75.gtf.gz ftp://ftp.ensembl.org/pub/release-75/gtf/homo_sapiens/Homo_sapiens.GRCh37.75.gtf.gz
# gunzip Homo_sapiens.GRCh37.75.gtf.gz

# Placeholder for annotation GTF path
ANNOTATION_GTF="/path/to/Homo_sapiens.GRCh37.75.gtf" # Replace with your actual GTF path

# Run htseq-count
# Parameters:
# -f bam: Input file format is BAM
# -r pos: Assume input BAM is sorted by position
# -s no: Assume unstranded library (adjust to 'yes' or 'reverse' if known)
# -a 10: Minimum alignment quality score (adjust as needed)
# -t exon: Feature type to count (default for gene/transcript counting)
# -i gene_id: Attribute in GTF to use as feature ID (default)
htseq-count -f bam -r pos -s no -a 10 "${INPUT_BAM}" "${ANNOTATION_GTF}" > "${OUTPUT_COUNTS}"

3

Differential expression analysis was performed using EdgeR.

edgeR v3.42.0 (Inferred with models/gemini-2.5-flash)

$ Bash example

# Install R and Bioconductor if not already present
# sudo apt-get update && sudo apt-get install -y r-base
# R -e 'if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager")'
# R -e 'BiocManager::install("edgeR")'

# Create dummy input files for demonstration if they don't exist
# In a real scenario, counts.tsv would be generated by a quantification tool (e.g., featureCounts, htseq-count)
# and design.tsv would be a metadata file describing your samples.
if [ ! -f "counts.tsv" ]; then
  echo -e "gene\tsample1_control\tsample2_control\tsample3_treatment\tsample4_treatment"
  echo -e "geneA\t100\t120\t500\t550"
  echo -e "geneB\t50\t60\t30\t35"
  echo -e "geneC\t200\t210\t220\t230"
  echo -e "geneD\t10\t12\t100\t110"
  echo -e "geneE\t5\t7\t8\t9"
  echo -e "geneF\t1000\t1100\t1050\t1150"
  echo -e "geneG\t20\t25\t10\t12"
  echo -e "geneH\t150\t160\t70\t80"
  echo -e "geneI\t30\t35\t150\t160"
  echo -e "geneJ\t80\t90\t40\t45"
  echo -e "geneK\t120\t130\t60\t65"
  echo -e "geneL\t25\t28\t120\t130"
  echo -e "geneM\t5\t6\t20\t22"
  echo -e "geneN\t10\t11\t5\t6"
  echo -e "geneO\t15\t17\t30\t32"
  echo -e "geneP\t20\t22\t10\t11"
  echo -e "geneQ\t25\t27\t50\t55"
  echo -e "geneR\t30\t33\t15\t17"
  echo -e "geneS\t35\t38\t70\t75"
  echo -e "geneT\t40\t42\t20\t22"
  echo -e "geneU\t45\t48\t90\t95"
  echo -e "geneV\t50\t52\t25\t27"
  echo -e "geneW\t55\t58\t110\t115"
  echo -e "geneX\t60\t62\t30\t32"
  echo -e "geneY\t65\t68\t130\t135"
  echo -e "geneZ\t70\t72\t35\t37"
  echo "Created dummy counts.tsv"
fi

if [ ! -f "design.tsv" ]; then
  echo -e "sample\tcondition"
  echo -e "sample1_control\tcontrol"
  echo -e "sample2_control\tcontrol"
  echo -e "sample3_treatment\ttreatment"
  echo -e "sample4_treatment\ttreatment"
  echo "Created dummy design.tsv"
fi

# Create an R script for edgeR analysis
cat << 'EOF' > run_edger.R
library(edgeR)

# --- Configuration ---
counts_file <- "counts.tsv"
design_file <- "design.tsv"
output_file <- "de_results.tsv"

# --- Load Data ---
counts_data <- read.delim(counts_file, row.names = 1, stringsAsFactors = FALSE)
sample_info <- read.delim(design_file, row.names = 1, stringsAsFactors = TRUE)

# Ensure sample order matches between counts and design
sample_info <- sample_info[colnames(counts_data), , drop = FALSE]

# --- Create DGEList object ---
dge <- DGEList(counts = counts_data, group = sample_info$condition)

# --- Filtering low-expressed genes ---
# Keep genes with at least 1 CPM in at least 2 samples (minimum group size)
keep <- filterByExpr(dge, group = sample_info$condition)
dge <- dge[keep, , keep.lib.sizes = FALSE]

# --- Normalization ---
dge <- calcNormFactors(dge)

# --- Design Matrix ---
design <- model.matrix(~0 + group, data = dge$samples)
colnames(design) <- levels(dge$samples$group)

# --- Estimate Dispersion ---
dge <- estimateDisp(dge, design)

# --- Fit GLM and Test for Differential Expression ---
fit <- glmFit(dge, design)

# Define contrasts for comparison (e.g., Treatment vs Control)
contrast_matrix <- makeContrasts(Treatment_vs_Control = treatment - control, levels = design)
lrt <- glmLRT(fit, contrast = contrast_matrix[, "Treatment_vs_Control"])

# --- Extract Results ---
results <- topTags(lrt, n = Inf, adjust.method = "BH", sort.by = "PValue")

# --- Write Results ---
write.table(results$table, file = output_file, sep = "\t", quote = FALSE, row.names = TRUE)

message(paste("Differential expression results saved to:", output_file))
EOF

# Execute the R script
Rscript run_edger.R

Tools Used

STAR