GSE112383 Processing Pipeline
RNA-Seq
code_examples
3 steps
Publication
The RNA Helicase DDX6 Controls Cellular Plasticity by Modulating P-Body Homeostasis.Cell stem cell (2019) — PMID 31588046
Dataset
GSE112383The RNA helicase DDX6 regulates self-renewal and differentiation of human and mouse stem cells [RNA-seq]
Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
Processing Steps
Generate Jupyter Notebook-
1
STAR was used to map sequencing reads to human reference genome (Ensembl annotation of hg19 assembly).
$ Bash example
# Install STAR (if not already installed) # conda install -c bioconda star # Define variables for clarity GENOME_DIR="/path/to/STAR_index/hg19_ensembl" # STAR index built with hg19 and Ensembl annotation READS_R1="sample_R1.fastq.gz" # Placeholder for input forward reads READS_R2="sample_R2.fastq.gz" # Placeholder for input reverse reads (remove if single-end) OUTPUT_PREFIX="sample_name_" # Prefix for output files THREADS=8 # Number of threads to use for alignment RAM_LIMIT_BAM_SORT=30000000000 # 30GB for BAM sorting (adjust based on available RAM) # Run STAR alignment STAR --runMode alignReads \ --genomeDir "${GENOME_DIR}" \ --readFilesIn "${READS_R1}" "${READS_R2}" \ --readFilesCommand zcat \ --outFileNamePrefix "${OUTPUT_PREFIX}" \ --runThreadN "${THREADS}" \ --outSAMtype BAM SortedByCoordinate \ --outSAMunmapped Within KeepPairs \ --outFilterType BySJout \ --outFilterMultimapNmax 20 \ --outFilterMismatchNmax 999 \ --outFilterMismatchNoverLmax 0.1 \ --alignIntronMin 20 \ --alignIntronMax 1000000 \ --alignMatesGapMax 1000000 \ --limitBAMsortRAM "${RAM_LIMIT_BAM_SORT}" -
2
Read counts over transcripts were calculated using HTSeq v.0.6.0 based on Ensembl annotation of hg19 genome.
$ Bash example
# Install HTSeq (if not already installed) # conda install -c bioconda htseq=0.6.0 # Define input and output files INPUT_BAM="aligned_reads.bam" # Placeholder for your alignment file ANNOTATION_GTF="Homo_sapiens.GRCh37.75.gtf" # Ensembl hg19 annotation (GRCh37 is hg19) OUTPUT_COUNTS="transcript_counts.txt" # Download annotation if not present (example for Ensembl release 75, which corresponds to GRCh37/hg19) # wget -O ${ANNOTATION_GTF}.gz ftp://ftp.ensembl.org/pub/release-75/gtf/homo_sapiens/Homo_sapiens.GRCh37.75.gtf.gz # gunzip ${ANNOTATION_GTF}.gz # Run htseq-count to calculate read counts over exons (representing transcripts/genes) # -f bam: Input file format is BAM # -s no: Assume unstranded data (adjust to yes/reverse if known for your assay) # -a 10: Minimum alignment quality score of 10 # -m union: Union mode for handling reads overlapping multiple features # --type exon: Count reads mapping to 'exon' features in the GTF # --idattr gene_id: Use 'gene_id' attribute from GTF as feature identifier htseq-count \ -f bam \ -s no \ -a 10 \ -m union \ --type exon \ --idattr gene_id \ "${INPUT_BAM}" \ "${ANNOTATION_GTF}" \ > "${OUTPUT_COUNTS}" -
3
Differential expression analysis was performed using EdgeR.
$ Bash example
# Install R and Bioconductor if not already present # R -e "if (!requireNamespace('BiocManager', quietly = TRUE)) install.packages('BiocManager')" # R -e "BiocManager::install('edgeR')" # This script assumes you have two input files: # 1. counts.tsv: A tab-separated file with gene IDs as row names and sample counts as columns. # Example: # gene\tsample1\tsample2\tsample3\tsample4 # geneA\t100\t120\t50\t60 # geneB\t20\t25\t80\t90 # 2. samples.tsv: A tab-separated file with sample names and their corresponding group/condition. # Example: # sample\tgroup # sample1\tcontrol # sample2\tcontrol # sample3\ttreated # sample4\ttreated # Run edgeR analysis Rscript -e ' library(edgeR) # Load count data counts_df <- read.delim("counts.tsv", row.names = 1, check.names = FALSE) counts_matrix <- as.matrix(counts_df) # Load sample information samples_df <- read.delim("samples.tsv", check.names = FALSE) # Ensure sample order in samples_df matches column order in counts_matrix samples_df <- samples_df[match(colnames(counts_matrix), samples_df$sample), ] group <- factor(samples_df$group) # Create DGEList object dge <- DGEList(counts=counts_matrix, group=group) # Filter out lowly expressed genes (optional, but good practice) # Genes are kept if they have at least 10 counts per million (CPM) in at least 2 samples keep <- filterByExpr(dge, min.count = 10, min.total.count = 15, large.n = 2) dge <- dge[keep,,keep.lib.sizes=FALSE] # Normalize library sizes using TMM method dge <- calcNormFactors(dge) # Create design matrix design <- model.matrix(~group) # Estimate dispersion # Common dispersion dge <- estimateDisp(dge, design) # Fit GLM fit <- glmQLFit(dge, design) # Perform differential expression test (e.g., comparing the second group level vs the first) # The coefficient for the comparison depends on the levels of your "group" factor. # For example, if group levels are "control" and "treated", and "treated" is the second level, # then coef=2 compares "treated" vs "control". qlf <- glmQLFTest(fit, coef=ncol(design)) # Compares the last group level against the first # Get results results <- topTags(qlf, n=Inf, sort.by="PValue")$table # Save results write.table(results, "edgeR_differential_expression_results.tsv", sep="\t", quote=FALSE, row.names=TRUE) '
Tools Used
Raw Source Text
STAR was used to map sequencing reads to human reference genome (Ensembl annotation of hg19 assembly). Read counts over transcripts were calculated using HTSeq v.0.6.0 based on Ensembl annotation of hg19 genome. Differential expression analysis was performed using EdgeR. Genome_build: Homo sapiens UCSC hg19 Supplementary_files_format_and_content: Tab-delimited tables of RPKM values