GSE125808 Processing Pipeline

RNA-Seq code_examples 4 steps

Publication

The RNA Helicase DDX6 Controls Cellular Plasticity by Modulating P-Body Homeostasis.

Cell stem cell (2019) — PMID 31588046

Dataset

The RNA helicase DDX6 regulates self-renewal and differentiation of human and mouse stem cells [polysome RNA-seq]

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

RNA-sequencing reads were trimmed using cutadapt (v1.4.0) of adaptor sequences, and mapped to repetitive elements (RepBase v18.04) using the STAR (v2.4.0i).

STAR v2.4.0i GitHub

$ Bash example

# Install STAR (if not already installed)
# conda install -c bioconda star

# Placeholder for input trimmed reads
# Replace 'trimmed_reads.fastq.gz' with your actual trimmed RNA-seq reads file
INPUT_READS="trimmed_reads.fastq.gz"

# Placeholder for RepBase v18.04 STAR index directory
# You would need to build this index yourself using STAR's --runMode genomeGenerate
# with the RepBase v18.04 sequences. 
# Example: STAR --runMode genomeGenerate --genomeDir path/to/RepBase_v18.04_STAR_index --genomeFastaFiles RepBase_v18.04.fasta --sjdbGTFfile RepBase_v18.04.gtf --runThreadN <threads>
REPBASE_STAR_INDEX="path/to/RepBase_v18.04_STAR_index"

# Output prefix for STAR alignment files
OUTPUT_PREFIX="repbase_alignment_"

# Number of threads to use
NUM_THREADS=8 # Adjust as needed

# Map RNA-sequencing reads to repetitive elements (RepBase v18.04) using STAR
STAR --runThreadN ${NUM_THREADS} \
     --genomeDir ${REPBASE_STAR_INDEX} \
     --readFilesIn ${INPUT_READS} \
     --outFileNamePrefix ${OUTPUT_PREFIX} \
     --outSAMtype BAM SortedByCoordinate \
     --outBAMcompression 6 \
     --outFilterMultimapNmax 100 # Allow reads to map to up to 100 locations, common for repetitive elements

View on GitHub

Reads did not map to repetitive elements were then mapped to the human genome (hg19).

STAR (Inferred with models/gemini-2.5-flash) v2.7.10a GitHub

$ Bash example

# Define variables
INPUT_FASTQ="reads_unmapped_to_repeats.fastq.gz" # Placeholder for input reads after repetitive element filtering
GENOME_DIR="STAR_index_hg19" # Path to pre-built STAR genome index for hg19
OUTPUT_PREFIX="aligned_to_hg19"
NUM_THREADS=8 # Adjust as needed

# Reference genome: hg19 (UCSC source)
# Example for building STAR index (commented out as per instructions):
# # mkdir -p ${GENOME_DIR}
# # wget -O hg19.fa.gz http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz
# # wget -O hg19.ncbiRefSeq.gtf.gz http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/genes/hg19.ncbiRefSeq.gtf.gz
# # gunzip hg19.fa.gz hg19.ncbiRefSeq.gtf.gz
# # STAR --runMode genomeGenerate \
# #      --genomeDir ${GENOME_DIR} \
# #      --genomeFastaFiles hg19.fa \
# #      --sjdbGTFfile hg19.ncbiRefSeq.gtf \
# #      --sjdbOverhang 100 \
# #      --runThreadN ${NUM_THREADS}

# Run STAR alignment
STAR --genomeDir ${GENOME_DIR} \
     --readFilesIn ${INPUT_FASTQ} \
     --readFilesCommand zcat \
     --outFileNamePrefix ${OUTPUT_PREFIX}_ \
     --outSAMtype BAM SortedByCoordinate \
     --outSAMunmapped Within \
     --outFilterType BySJout \
     --outFilterMultimapNmax 20 \
     --outFilterMismatchNmax 999 \
     --outFilterMismatchNoverLmax 0.04 \
     --alignIntronMin 20 \
     --alignIntronMax 1000000 \
     --alignMatesGapMax 1000000 \
     --runThreadN ${NUM_THREADS}

# Index the resulting BAM file
samtools index ${OUTPUT_PREFIX}_Aligned.sortedByCoord.out.bam

View on GitHub

Using GENCODE (v19) gene annotations and featureCounts (v.1.5.0) to create read count matrices.

featureCounts v1.5.0 GitHub

$ Bash example

# Install featureCounts (part of Subread package)
# For Linux/macOS, download the specific version from SourceForge or a mirror:
# wget https://sourceforge.net/projects/subread/files/subread-1.5.0-Linux-x86_64.tar.gz # (adjust for your OS/architecture)
# tar -xzf subread-1.5.0-Linux-x86_64.tar.gz
# export PATH=$PATH:$(pwd)/subread-1.5.0-Linux-x86_64/bin # Adjust path as needed

# Or via Bioconda (ensure correct channel and version):
# conda install -c bioconda subread=1.5.0

# Reference Data: GENCODE (v19) gene annotations
# Download GENCODE v19 GTF file
# wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz
# gunzip gencode.v19.annotation.gtf.gz
GENCODE_V19_GTF="gencode.v19.annotation.gtf"

# Example input BAM files (replace with actual sample files)
# Assuming aligned reads are in BAM format, e.g., from STAR or HISAT2
INPUT_BAMS="sample_1.bam sample_2.bam sample_3.bam" # List all BAM files to be included in the count matrix

# Output file for the read count matrix
OUTPUT_COUNTS="gene_read_counts.txt"

# Create read count matrices using featureCounts
# Parameters:
# -a: Annotation file (GTF/GFF) - required
# -o: Output file - required
# -F GTF: Specify annotation file format (GTF is common for GENCODE annotations)
# -t exon: Count reads mapping to 'exon' features (common for gene-level counts)
# -g gene_id: Group features by 'gene_id' attribute to get gene-level counts
# -T 8: Use 8 threads for parallel processing (adjust as needed based on available cores)
# -p: Specify if reads are paired-end (add this flag if your BAM files contain paired-end reads)
# -s 0: Specify strand specificity (0: unstranded, 1: stranded, 2: reverse stranded; adjust based on library prep)
#       For RNA-seq, 0 (unstranded) or 1/2 (stranded) are common. If not specified, 0 is the default.
featureCounts -a ${GENCODE_V19_GTF} -o ${OUTPUT_COUNTS} -F GTF -t exon -g gene_id -T 8 ${INPUT_BAMS}

View on GitHub

The transcript RPKMs of input and polysome fractions were calculated from the read count matrices.

RSEM (Inferred with models/gemini-2.5-flash) v1.3.3 GitHub

$ Bash example

# Install RSEM
# conda create -n rsem_env rsem -y
# conda activate rsem_env

# Define reference paths and output directory
# Reference datasets: Using human genome assembly GRCh38 (hg38) and Gencode v38 annotation as placeholders.
GENOME_FASTA="path/to/hg38.fa" # Placeholder: Replace with actual path to human genome FASTA (e.g., from UCSC or Ensembl)
GTF_FILE="path/to/gencode.v38.annotation.gtf" # Placeholder: Replace with actual path to GTF annotation (e.g., from Gencode)
RSEM_REF_DIR="rsem_ref_hg38"
OUTPUT_DIR="rsem_quantification"
INPUT_BAM="input_fraction.bam" # Placeholder: Replace with actual path to input fraction BAM file
POLYSOME_BAM="polysome_fraction.bam" # Placeholder: Replace with actual path to polysome fraction BAM file
NUM_THREADS=8

mkdir -p "${RSEM_REF_DIR}"
mkdir -p "${OUTPUT_DIR}"

# Step 1: Prepare RSEM reference (if not already done)
# This step needs to be run only once for a given reference genome and GTF.
# The --star option is used for compatibility with STAR-aligned BAM files, which is common for RNA-seq.
if [ ! -f "${RSEM_REF_DIR}/hg38.idx.ok" ]; then
    echo "Preparing RSEM reference..."
    rsem-prepare-reference \
        --gtf "${GTF_FILE}" \
        --star \
        --num-threads "${NUM_THREADS}" \
        "${GENOME_FASTA}" \
        "${RSEM_REF_DIR}/hg38"
    touch "${RSEM_REF_DIR}/hg38.idx.ok" # Marker file to indicate successful reference preparation
else
    echo "RSEM reference already prepared."
fi

# Step 2: Calculate RPKMs for input fraction
# RSEM takes aligned BAM files as input and quantifies transcript expression, including RPKM values.
echo "Calculating RPKMs for input fraction..."
rsem-calculate-expression \
    --bam \
    --no-qualities \
    --num-threads "${NUM_THREADS}" \
    "${INPUT_BAM}" \
    "${RSEM_REF_DIR}/hg38" \
    "${OUTPUT_DIR}/input_fraction"

# Step 3: Calculate RPKMs for polysome fraction
echo "Calculating RPKMs for polysome fraction..."
rsem-calculate-expression \
    --bam \
    --no-qualities \
    --num-threads "${NUM_THREADS}" \
    "${POLYSOME_BAM}" \
    "${RSEM_REF_DIR}/hg38" \
    "${OUTPUT_DIR}/polysome_fraction"

echo "RPKM calculation complete. Results are in ${OUTPUT_DIR}/"
# The RPKM values are typically found in the .isoforms.results and .genes.results files
# within the output directory, e.g., ${OUTPUT_DIR}/input_fraction.isoforms.results

View on GitHub

Tools Used

STAR

Raw Source Text

RNA-sequencing reads were trimmed using cutadapt (v1.4.0) of adaptor sequences, and mapped to repetitive elements (RepBase v18.04) using the STAR (v2.4.0i). Reads did not map to repetitive elements were then mapped to the human genome (hg19). Using GENCODE (v19) gene annotations and featureCounts (v.1.5.0) to create read count matrices.
The transcript RPKMs of input and polysome fractions were calculated from the read count matrices.
Genome_build: Homo sapiens UCSC hg19
Supplementary_files_format_and_content: RPKM

← Back to Analysis