GSE112383 Processing Pipeline

RNA-Seq code_examples 3 steps

Publication

The RNA Helicase DDX6 Controls Cellular Plasticity by Modulating P-Body Homeostasis.

Cell stem cell (2019) — PMID 31588046

Dataset

GSE112383

The RNA helicase DDX6 regulates self-renewal and differentiation of human and mouse stem cells [RNA-seq]

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    STAR was used to map sequencing reads to human reference genome (Ensembl annotation of hg19 assembly).

    STAR v2.7.10a (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install STAR (if not already installed)
    # conda install -c bioconda star
    
    # Define variables for clarity
    GENOME_DIR="/path/to/STAR_index/hg19_ensembl" # STAR index built with hg19 and Ensembl annotation
    READS_R1="sample_R1.fastq.gz" # Placeholder for input forward reads
    READS_R2="sample_R2.fastq.gz" # Placeholder for input reverse reads (remove if single-end)
    OUTPUT_PREFIX="sample_name_" # Prefix for output files
    THREADS=8 # Number of threads to use for alignment
    RAM_LIMIT_BAM_SORT=30000000000 # 30GB for BAM sorting (adjust based on available RAM)
    
    # Run STAR alignment
    STAR --runMode alignReads \
      --genomeDir "${GENOME_DIR}" \
      --readFilesIn "${READS_R1}" "${READS_R2}" \
      --readFilesCommand zcat \
      --outFileNamePrefix "${OUTPUT_PREFIX}" \
      --runThreadN "${THREADS}" \
      --outSAMtype BAM SortedByCoordinate \
      --outSAMunmapped Within KeepPairs \
      --outFilterType BySJout \
      --outFilterMultimapNmax 20 \
      --outFilterMismatchNmax 999 \
      --outFilterMismatchNoverLmax 0.1 \
      --alignIntronMin 20 \
      --alignIntronMax 1000000 \
      --alignMatesGapMax 1000000 \
      --limitBAMsortRAM "${RAM_LIMIT_BAM_SORT}"
  2. 2

    Read counts over transcripts were calculated using HTSeq v.0.6.0 based on Ensembl annotation of hg19 genome.

    HTSeq v0.6.0 GitHub
    $ Bash example
    # Install HTSeq (if not already installed)
    # conda install -c bioconda htseq=0.6.0
    
    # Define input and output files
    INPUT_BAM="aligned_reads.bam" # Placeholder for your alignment file
    ANNOTATION_GTF="Homo_sapiens.GRCh37.75.gtf" # Ensembl hg19 annotation (GRCh37 is hg19)
    OUTPUT_COUNTS="transcript_counts.txt"
    
    # Download annotation if not present (example for Ensembl release 75, which corresponds to GRCh37/hg19)
    # wget -O ${ANNOTATION_GTF}.gz ftp://ftp.ensembl.org/pub/release-75/gtf/homo_sapiens/Homo_sapiens.GRCh37.75.gtf.gz
    # gunzip ${ANNOTATION_GTF}.gz
    
    # Run htseq-count to calculate read counts over exons (representing transcripts/genes)
    # -f bam: Input file format is BAM
    # -s no: Assume unstranded data (adjust to yes/reverse if known for your assay)
    # -a 10: Minimum alignment quality score of 10
    # -m union: Union mode for handling reads overlapping multiple features
    # --type exon: Count reads mapping to 'exon' features in the GTF
    # --idattr gene_id: Use 'gene_id' attribute from GTF as feature identifier
    htseq-count \
        -f bam \
        -s no \
        -a 10 \
        -m union \
        --type exon \
        --idattr gene_id \
        "${INPUT_BAM}" \
        "${ANNOTATION_GTF}" \
        > "${OUTPUT_COUNTS}"
  3. 3

    Differential expression analysis was performed using EdgeR.

    edgeR v3.46.0 GitHub
    $ Bash example
    # Install R and Bioconductor if not already present
    # R -e "if (!requireNamespace('BiocManager', quietly = TRUE)) install.packages('BiocManager')"
    # R -e "BiocManager::install('edgeR')"
    
    # This script assumes you have two input files:
    # 1. counts.tsv: A tab-separated file with gene IDs as row names and sample counts as columns.
    #    Example:
    #    gene\tsample1\tsample2\tsample3\tsample4
    #    geneA\t100\t120\t50\t60
    #    geneB\t20\t25\t80\t90
    # 2. samples.tsv: A tab-separated file with sample names and their corresponding group/condition.
    #    Example:
    #    sample\tgroup
    #    sample1\tcontrol
    #    sample2\tcontrol
    #    sample3\ttreated
    #    sample4\ttreated
    
    # Run edgeR analysis
    Rscript -e '
      library(edgeR)
    
      # Load count data
      counts_df <- read.delim("counts.tsv", row.names = 1, check.names = FALSE)
      counts_matrix <- as.matrix(counts_df)
    
      # Load sample information
      samples_df <- read.delim("samples.tsv", check.names = FALSE)
      # Ensure sample order in samples_df matches column order in counts_matrix
      samples_df <- samples_df[match(colnames(counts_matrix), samples_df$sample), ]
      group <- factor(samples_df$group)
    
      # Create DGEList object
      dge <- DGEList(counts=counts_matrix, group=group)
    
      # Filter out lowly expressed genes (optional, but good practice)
      # Genes are kept if they have at least 10 counts per million (CPM) in at least 2 samples
      keep <- filterByExpr(dge, min.count = 10, min.total.count = 15, large.n = 2)
      dge <- dge[keep,,keep.lib.sizes=FALSE]
    
      # Normalize library sizes using TMM method
      dge <- calcNormFactors(dge)
    
      # Create design matrix
      design <- model.matrix(~group)
    
      # Estimate dispersion
      # Common dispersion
      dge <- estimateDisp(dge, design)
    
      # Fit GLM
      fit <- glmQLFit(dge, design)
    
      # Perform differential expression test (e.g., comparing the second group level vs the first)
      # The coefficient for the comparison depends on the levels of your "group" factor.
      # For example, if group levels are "control" and "treated", and "treated" is the second level,
      # then coef=2 compares "treated" vs "control".
      qlf <- glmQLFTest(fit, coef=ncol(design)) # Compares the last group level against the first
    
      # Get results
      results <- topTags(qlf, n=Inf, sort.by="PValue")$table
    
      # Save results
      write.table(results, "edgeR_differential_expression_results.tsv", sep="\t", quote=FALSE, row.names=TRUE)
    '

Tools Used

Raw Source Text
STAR was used to map sequencing reads to human reference genome (Ensembl annotation of hg19 assembly).
Read counts over transcripts were calculated using HTSeq v.0.6.0 based on Ensembl annotation of hg19 genome.
Differential expression analysis was performed using EdgeR.
Genome_build: Homo sapiens UCSC hg19
Supplementary_files_format_and_content: Tab-delimited tables of RPKM values
← Back to Analysis