GSE112383 Processing Pipeline

RNA-Seq code_examples 3 steps

Publication

The RNA Helicase DDX6 Controls Cellular Plasticity by Modulating P-Body Homeostasis.

Cell stem cell (2019) — PMID 31588046

Dataset

The RNA helicase DDX6 regulates self-renewal and differentiation of human and mouse stem cells [RNA-seq]

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

STAR was used to map sequencing reads to human reference genome (Ensembl annotation of hg19 assembly).

STAR v2.7.10a (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install STAR (if not already installed)
# conda install -c bioconda star

# Define variables for clarity
GENOME_DIR="/path/to/STAR_index/hg19_ensembl" # STAR index built with hg19 and Ensembl annotation
READS_R1="sample_R1.fastq.gz" # Placeholder for input forward reads
READS_R2="sample_R2.fastq.gz" # Placeholder for input reverse reads (remove if single-end)
OUTPUT_PREFIX="sample_name_" # Prefix for output files
THREADS=8 # Number of threads to use for alignment
RAM_LIMIT_BAM_SORT=30000000000 # 30GB for BAM sorting (adjust based on available RAM)

# Run STAR alignment
STAR --runMode alignReads \
  --genomeDir "${GENOME_DIR}" \
  --readFilesIn "${READS_R1}" "${READS_R2}" \
  --readFilesCommand zcat \
  --outFileNamePrefix "${OUTPUT_PREFIX}" \
  --runThreadN "${THREADS}" \
  --outSAMtype BAM SortedByCoordinate \
  --outSAMunmapped Within KeepPairs \
  --outFilterType BySJout \
  --outFilterMultimapNmax 20 \
  --outFilterMismatchNmax 999 \
  --outFilterMismatchNoverLmax 0.1 \
  --alignIntronMin 20 \
  --alignIntronMax 1000000 \
  --alignMatesGapMax 1000000 \
  --limitBAMsortRAM "${RAM_LIMIT_BAM_SORT}"

View on GitHub

Read counts over transcripts were calculated using HTSeq v.0.6.0 based on Ensembl annotation of hg19 genome.

HTSeq v0.6.0 GitHub

$ Bash example

# Install HTSeq (if not already installed)
# conda install -c bioconda htseq=0.6.0

# Define input and output files
INPUT_BAM="aligned_reads.bam" # Placeholder for your alignment file
ANNOTATION_GTF="Homo_sapiens.GRCh37.75.gtf" # Ensembl hg19 annotation (GRCh37 is hg19)
OUTPUT_COUNTS="transcript_counts.txt"

# Download annotation if not present (example for Ensembl release 75, which corresponds to GRCh37/hg19)
# wget -O ${ANNOTATION_GTF}.gz ftp://ftp.ensembl.org/pub/release-75/gtf/homo_sapiens/Homo_sapiens.GRCh37.75.gtf.gz
# gunzip ${ANNOTATION_GTF}.gz

# Run htseq-count to calculate read counts over exons (representing transcripts/genes)
# -f bam: Input file format is BAM
# -s no: Assume unstranded data (adjust to yes/reverse if known for your assay)
# -a 10: Minimum alignment quality score of 10
# -m union: Union mode for handling reads overlapping multiple features
# --type exon: Count reads mapping to 'exon' features in the GTF
# --idattr gene_id: Use 'gene_id' attribute from GTF as feature identifier
htseq-count \
    -f bam \
    -s no \
    -a 10 \
    -m union \
    --type exon \
    --idattr gene_id \
    "${INPUT_BAM}" \
    "${ANNOTATION_GTF}" \
    > "${OUTPUT_COUNTS}"

View on GitHub

Differential expression analysis was performed using EdgeR.

edgeR v3.46.0 GitHub

$ Bash example

# Install R and Bioconductor if not already present
# R -e "if (!requireNamespace('BiocManager', quietly = TRUE)) install.packages('BiocManager')"
# R -e "BiocManager::install('edgeR')"

# This script assumes you have two input files:
# 1. counts.tsv: A tab-separated file with gene IDs as row names and sample counts as columns.
#    Example:
#    gene\tsample1\tsample2\tsample3\tsample4
#    geneA\t100\t120\t50\t60
#    geneB\t20\t25\t80\t90
# 2. samples.tsv: A tab-separated file with sample names and their corresponding group/condition.
#    Example:
#    sample\tgroup
#    sample1\tcontrol
#    sample2\tcontrol
#    sample3\ttreated
#    sample4\ttreated

# Run edgeR analysis
Rscript -e '
  library(edgeR)

  # Load count data
  counts_df <- read.delim("counts.tsv", row.names = 1, check.names = FALSE)
  counts_matrix <- as.matrix(counts_df)

  # Load sample information
  samples_df <- read.delim("samples.tsv", check.names = FALSE)
  # Ensure sample order in samples_df matches column order in counts_matrix
  samples_df <- samples_df[match(colnames(counts_matrix), samples_df$sample), ]
  group <- factor(samples_df$group)

  # Create DGEList object
  dge <- DGEList(counts=counts_matrix, group=group)

  # Filter out lowly expressed genes (optional, but good practice)
  # Genes are kept if they have at least 10 counts per million (CPM) in at least 2 samples
  keep <- filterByExpr(dge, min.count = 10, min.total.count = 15, large.n = 2)
  dge <- dge[keep,,keep.lib.sizes=FALSE]

  # Normalize library sizes using TMM method
  dge <- calcNormFactors(dge)

  # Create design matrix
  design <- model.matrix(~group)

  # Estimate dispersion
  # Common dispersion
  dge <- estimateDisp(dge, design)

  # Fit GLM
  fit <- glmQLFit(dge, design)

  # Perform differential expression test (e.g., comparing the second group level vs the first)
  # The coefficient for the comparison depends on the levels of your "group" factor.
  # For example, if group levels are "control" and "treated", and "treated" is the second level,
  # then coef=2 compares "treated" vs "control".
  qlf <- glmQLFTest(fit, coef=ncol(design)) # Compares the last group level against the first

  # Get results
  results <- topTags(qlf, n=Inf, sort.by="PValue")$table

  # Save results
  write.table(results, "edgeR_differential_expression_results.tsv", sep="\t", quote=FALSE, row.names=TRUE)
'

View on GitHub

Tools Used

STAR

Raw Source Text

STAR was used to map sequencing reads to human reference genome (Ensembl annotation of hg19 assembly).
Read counts over transcripts were calculated using HTSeq v.0.6.0 based on Ensembl annotation of hg19 genome.
Differential expression analysis was performed using EdgeR.
Genome_build: Homo sapiens UCSC hg19
Supplementary_files_format_and_content: Tab-delimited tables of RPKM values

← Back to Analysis