GSE125808 Processing Pipeline
RNA-Seq
code_examples
4 steps
Publication
The RNA Helicase DDX6 Controls Cellular Plasticity by Modulating P-Body Homeostasis.Cell stem cell (2019) — PMID 31588046
Dataset
GSE125808The RNA helicase DDX6 regulates self-renewal and differentiation of human and mouse stem cells [polysome RNA-seq]
Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
Processing Steps
Generate Jupyter Notebook-
1
RNA-sequencing reads were trimmed using cutadapt (v1.4.0) of adaptor sequences, and mapped to repetitive elements (RepBase v18.04) using the STAR (v2.4.0i).
$ Bash example
# Install STAR (if not already installed) # conda install -c bioconda star # Placeholder for input trimmed reads # Replace 'trimmed_reads.fastq.gz' with your actual trimmed RNA-seq reads file INPUT_READS="trimmed_reads.fastq.gz" # Placeholder for RepBase v18.04 STAR index directory # You would need to build this index yourself using STAR's --runMode genomeGenerate # with the RepBase v18.04 sequences. # Example: STAR --runMode genomeGenerate --genomeDir path/to/RepBase_v18.04_STAR_index --genomeFastaFiles RepBase_v18.04.fasta --sjdbGTFfile RepBase_v18.04.gtf --runThreadN <threads> REPBASE_STAR_INDEX="path/to/RepBase_v18.04_STAR_index" # Output prefix for STAR alignment files OUTPUT_PREFIX="repbase_alignment_" # Number of threads to use NUM_THREADS=8 # Adjust as needed # Map RNA-sequencing reads to repetitive elements (RepBase v18.04) using STAR STAR --runThreadN ${NUM_THREADS} \ --genomeDir ${REPBASE_STAR_INDEX} \ --readFilesIn ${INPUT_READS} \ --outFileNamePrefix ${OUTPUT_PREFIX} \ --outSAMtype BAM SortedByCoordinate \ --outBAMcompression 6 \ --outFilterMultimapNmax 100 # Allow reads to map to up to 100 locations, common for repetitive elements -
2
Reads did not map to repetitive elements were then mapped to the human genome (hg19).
$ Bash example
# Define variables INPUT_FASTQ="reads_unmapped_to_repeats.fastq.gz" # Placeholder for input reads after repetitive element filtering GENOME_DIR="STAR_index_hg19" # Path to pre-built STAR genome index for hg19 OUTPUT_PREFIX="aligned_to_hg19" NUM_THREADS=8 # Adjust as needed # Reference genome: hg19 (UCSC source) # Example for building STAR index (commented out as per instructions): # # mkdir -p ${GENOME_DIR} # # wget -O hg19.fa.gz http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz # # wget -O hg19.ncbiRefSeq.gtf.gz http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/genes/hg19.ncbiRefSeq.gtf.gz # # gunzip hg19.fa.gz hg19.ncbiRefSeq.gtf.gz # # STAR --runMode genomeGenerate \ # # --genomeDir ${GENOME_DIR} \ # # --genomeFastaFiles hg19.fa \ # # --sjdbGTFfile hg19.ncbiRefSeq.gtf \ # # --sjdbOverhang 100 \ # # --runThreadN ${NUM_THREADS} # Run STAR alignment STAR --genomeDir ${GENOME_DIR} \ --readFilesIn ${INPUT_FASTQ} \ --readFilesCommand zcat \ --outFileNamePrefix ${OUTPUT_PREFIX}_ \ --outSAMtype BAM SortedByCoordinate \ --outSAMunmapped Within \ --outFilterType BySJout \ --outFilterMultimapNmax 20 \ --outFilterMismatchNmax 999 \ --outFilterMismatchNoverLmax 0.04 \ --alignIntronMin 20 \ --alignIntronMax 1000000 \ --alignMatesGapMax 1000000 \ --runThreadN ${NUM_THREADS} # Index the resulting BAM file samtools index ${OUTPUT_PREFIX}_Aligned.sortedByCoord.out.bam -
3
Using GENCODE (v19) gene annotations and featureCounts (v.1.5.0) to create read count matrices.
$ Bash example
# Install featureCounts (part of Subread package) # For Linux/macOS, download the specific version from SourceForge or a mirror: # wget https://sourceforge.net/projects/subread/files/subread-1.5.0-Linux-x86_64.tar.gz # (adjust for your OS/architecture) # tar -xzf subread-1.5.0-Linux-x86_64.tar.gz # export PATH=$PATH:$(pwd)/subread-1.5.0-Linux-x86_64/bin # Adjust path as needed # Or via Bioconda (ensure correct channel and version): # conda install -c bioconda subread=1.5.0 # Reference Data: GENCODE (v19) gene annotations # Download GENCODE v19 GTF file # wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz # gunzip gencode.v19.annotation.gtf.gz GENCODE_V19_GTF="gencode.v19.annotation.gtf" # Example input BAM files (replace with actual sample files) # Assuming aligned reads are in BAM format, e.g., from STAR or HISAT2 INPUT_BAMS="sample_1.bam sample_2.bam sample_3.bam" # List all BAM files to be included in the count matrix # Output file for the read count matrix OUTPUT_COUNTS="gene_read_counts.txt" # Create read count matrices using featureCounts # Parameters: # -a: Annotation file (GTF/GFF) - required # -o: Output file - required # -F GTF: Specify annotation file format (GTF is common for GENCODE annotations) # -t exon: Count reads mapping to 'exon' features (common for gene-level counts) # -g gene_id: Group features by 'gene_id' attribute to get gene-level counts # -T 8: Use 8 threads for parallel processing (adjust as needed based on available cores) # -p: Specify if reads are paired-end (add this flag if your BAM files contain paired-end reads) # -s 0: Specify strand specificity (0: unstranded, 1: stranded, 2: reverse stranded; adjust based on library prep) # For RNA-seq, 0 (unstranded) or 1/2 (stranded) are common. If not specified, 0 is the default. featureCounts -a ${GENCODE_V19_GTF} -o ${OUTPUT_COUNTS} -F GTF -t exon -g gene_id -T 8 ${INPUT_BAMS} -
4
The transcript RPKMs of input and polysome fractions were calculated from the read count matrices.
$ Bash example
# Install RSEM # conda create -n rsem_env rsem -y # conda activate rsem_env # Define reference paths and output directory # Reference datasets: Using human genome assembly GRCh38 (hg38) and Gencode v38 annotation as placeholders. GENOME_FASTA="path/to/hg38.fa" # Placeholder: Replace with actual path to human genome FASTA (e.g., from UCSC or Ensembl) GTF_FILE="path/to/gencode.v38.annotation.gtf" # Placeholder: Replace with actual path to GTF annotation (e.g., from Gencode) RSEM_REF_DIR="rsem_ref_hg38" OUTPUT_DIR="rsem_quantification" INPUT_BAM="input_fraction.bam" # Placeholder: Replace with actual path to input fraction BAM file POLYSOME_BAM="polysome_fraction.bam" # Placeholder: Replace with actual path to polysome fraction BAM file NUM_THREADS=8 mkdir -p "${RSEM_REF_DIR}" mkdir -p "${OUTPUT_DIR}" # Step 1: Prepare RSEM reference (if not already done) # This step needs to be run only once for a given reference genome and GTF. # The --star option is used for compatibility with STAR-aligned BAM files, which is common for RNA-seq. if [ ! -f "${RSEM_REF_DIR}/hg38.idx.ok" ]; then echo "Preparing RSEM reference..." rsem-prepare-reference \ --gtf "${GTF_FILE}" \ --star \ --num-threads "${NUM_THREADS}" \ "${GENOME_FASTA}" \ "${RSEM_REF_DIR}/hg38" touch "${RSEM_REF_DIR}/hg38.idx.ok" # Marker file to indicate successful reference preparation else echo "RSEM reference already prepared." fi # Step 2: Calculate RPKMs for input fraction # RSEM takes aligned BAM files as input and quantifies transcript expression, including RPKM values. echo "Calculating RPKMs for input fraction..." rsem-calculate-expression \ --bam \ --no-qualities \ --num-threads "${NUM_THREADS}" \ "${INPUT_BAM}" \ "${RSEM_REF_DIR}/hg38" \ "${OUTPUT_DIR}/input_fraction" # Step 3: Calculate RPKMs for polysome fraction echo "Calculating RPKMs for polysome fraction..." rsem-calculate-expression \ --bam \ --no-qualities \ --num-threads "${NUM_THREADS}" \ "${POLYSOME_BAM}" \ "${RSEM_REF_DIR}/hg38" \ "${OUTPUT_DIR}/polysome_fraction" echo "RPKM calculation complete. Results are in ${OUTPUT_DIR}/" # The RPKM values are typically found in the .isoforms.results and .genes.results files # within the output directory, e.g., ${OUTPUT_DIR}/input_fraction.isoforms.results
Tools Used
Raw Source Text
RNA-sequencing reads were trimmed using cutadapt (v1.4.0) of adaptor sequences, and mapped to repetitive elements (RepBase v18.04) using the STAR (v2.4.0i). Reads did not map to repetitive elements were then mapped to the human genome (hg19). Using GENCODE (v19) gene annotations and featureCounts (v.1.5.0) to create read count matrices. The transcript RPKMs of input and polysome fractions were calculated from the read count matrices. Genome_build: Homo sapiens UCSC hg19 Supplementary_files_format_and_content: RPKM