GSE112782 Processing Pipeline

RIP-Seq code_examples 5 steps

Publication

The RNA Helicase DDX6 Controls Cellular Plasticity by Modulating P-Body Homeostasis.

Cell stem cell (2019) — PMID 31588046

Dataset

The RNA helicase DDX6 regulates self-renewal and differentiation of human and mouse stem cells [CLIP-Seq]

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Reads were adapter-trimmed and mapped to human-specific repetitive elements from RepBase (version 18.04) by STAR.

STAR v2.7.0f GitHub

$ Bash example

# Install STAR (e.g., via conda)
# conda install -c bioconda star

# --- Reference Data Preparation (Run once) ---
# Download RepBase 18.04 human-specific repetitive elements FASTA file.
# This file would contain sequences of repetitive elements from RepBase version 18.04.
# Example placeholder:
# repbase_fasta="RepBase18.04_human_repeats.fasta"
# genome_dir="star_repbase_index"

# Build STAR genome index for RepBase sequences.
# This step requires sufficient RAM (e.g., typically tens of GBs for a full genome, adjust for repeats).
# STAR --runMode genomeGenerate \
#      --genomeDir ${genome_dir} \
#      --genomeFastaFiles ${repbase_fasta} \
#      --runThreadN 8 # Adjust threads as needed for index generation

# --- Alignment Step ---
# Define input and output paths
input_read1="reads_R1.fastq.gz" # Path to adapter-trimmed R1 FASTQ file
input_read2="reads_R2.fastq.gz" # Path to adapter-trimmed R2 FASTQ file (if paired-end)
output_prefix="aligned_to_repbase" # Prefix for output files
genome_dir="star_repbase_index" # Path to the STAR index built from RepBase 18.04

# Execute STAR alignment
STAR --genomeDir ${genome_dir} \
     --readFilesIn ${input_read1} ${input_read2} \
     --readFilesCommand zcat \
     --outFileNamePrefix ${output_prefix}. \
     --outSAMtype BAM SortedByCoordinate \
     --outSAMattributes All \
     --runThreadN 8 # Adjust threads as needed for alignment

View on GitHub

Repeat-mapping reads were removed and remaining reads mapped to the human genome assembly hg19 with STAR.

STAR v2.7.10a GitHub

$ Bash example

# Install STAR (example using conda)
# conda install -c bioconda star

# Define variables
GENOME_DIR="/path/to/STAR_index/hg19" # Placeholder for hg19 STAR genome index
INPUT_FASTQ="input_reads.fastq.gz" # Placeholder for the input FASTQ file after repeat-mapping reads were removed
OUTPUT_PREFIX="aligned_reads" # Prefix for output files

# Run STAR alignment
# "Repeat-mapping reads were removed" is interpreted as only outputting uniquely mapping reads (--outFilterMultimapNmax 1)
STAR --genomeDir "${GENOME_DIR}" \
     --readFilesIn "${INPUT_FASTQ}" \
     --readFilesCommand zcat \
     --outFileNamePrefix "${OUTPUT_PREFIX}_" \
     --outSAMtype BAM SortedByCoordinate \
     --outFilterMultimapNmax 1 \
     --runThreadN 8 # Example: use 8 threads

View on GitHub

PCR duplicate reads were removed using the unique molecular identifier (UMI) sequences in the 5â adapter and remaining reads retained as âusable readsâ.

umi_tools (Inferred with models/gemini-2.5-flash) v1.1.2 GitHub

$ Bash example

# Install umi_tools if not already installed
# conda install -c bioconda umi_tools

# Remove PCR duplicate reads using UMIs. This command assumes UMIs have already been extracted
# from the 5' adapter and appended to read names (e.g., by a preceding 'umi_tools extract' step
# or during adapter trimming/alignment). The '--paired' flag is used for paired-end sequencing data.
# Replace 'input_aligned_reads.bam' with the actual input BAM file containing aligned reads with UMIs in their names.
# Replace 'deduplicated_usable_reads.bam' with the desired output file name for usable reads.
# Replace 'deduplication_stats.log' with the desired output file name for deduplication statistics.
umi_tools dedup \
    --stdin input_aligned_reads.bam \
    --stdout deduplicated_usable_reads.bam \
    --paired \
    --output-stats deduplication_stats.log

View on GitHub

Peaks were called on the usable reads by CLIPper ,and assigned to gene regions annotated in Gencode (v19)

CLIPper vNot specified GitHub

$ Bash example

# Install CLIPper (if not already installed)
# pip install clipper # Or clone the repo and run directly

# Define input and output files
INPUT_BAM="usable_reads.bam" # Placeholder for usable reads (e.g., aligned reads in BAM format)
OUTPUT_PEAKS="clipper_peaks.bed"
GENCODE_V19_BED="gencode.v19.annotation.bed" # Placeholder for Gencode v19 gene regions in BED format

# Ensure Gencode v19 annotation file is available
# Example for downloading and converting Gencode v19 GTF to BED (adjust paths and tools as needed):
# wget -O gencode.v19.annotation.gtf.gz "ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz"
# gunzip gencode.v19.annotation.gtf.gz
# awk '$3 == "gene" {print $1"\t"$4-1"\t"$5"\t"$10"\t0\t"$7}' gencode.v19.annotation.gtf | sed 's/\"//g' | sed 's/;//g' > gencode.v19.annotation.bed

# 1. Call peaks using CLIPper
# Assuming 'clipper.py' is in PATH or specified with full path
clipper.py -s hg19 -o "${OUTPUT_PEAKS}" "${INPUT_BAM}"

# 2. Assign peaks to gene regions using bedtools intersect
# Install bedtools (if not already installed)
# conda install -c bioconda bedtools
bedtools intersect -a "${OUTPUT_PEAKS}" -b "${GENCODE_V19_BED}" -wa -wb > "${OUTPUT_PEAKS%.bed}_annotated_genes.bed"

View on GitHub

Each peak was normalized to the size-matched input (SMInput)

Custom Script (Inferred with models/gemini-2.5-flash) vN/A GitHub

$ Bash example

# Installation of bedtools (if not already installed)
# conda install -c bioconda bedtools

# Installation of samtools (if not already installed)
# conda install -c bioconda samtools

# Define input files (replace with actual paths)
# IP_PEAKS: BED file containing the identified peaks for the IP sample.
# IP_BAM: Aligned reads (BAM format) for the Immunoprecipitation (IP) sample.
# SM_INPUT_BAM: Aligned reads (BAM format) for the Size-Matched Input (SMInput) sample.
# IP_LIB_SIZE_FILE: File containing the total number of mapped reads for the IP sample.
# SM_INPUT_LIB_SIZE_FILE: File containing the total number of mapped reads for the SMInput sample.
IP_PEAKS="ip_peaks.bed"
IP_BAM="ip_aligned.bam"
SM_INPUT_BAM="sm_input_aligned.bam"
IP_LIB_SIZE_FILE="ip_library_size.txt"
SM_INPUT_LIB_SIZE_FILE="sm_input_library_size.txt"

# Example of how library size files might be generated (typically done after alignment):
# samtools flagstat ${IP_BAM} | grep 'mapped (' | awk '{print $1}' > ${IP_LIB_SIZE_FILE}
# samtools flagstat ${SM_INPUT_BAM} | grep 'mapped (' | awk '{print $1}' > ${SM_INPUT_LIB_SIZE_FILE}

# Read library sizes (total mapped reads) from the respective files
IP_LIB_SIZE=$(cat ${IP_LIB_SIZE_FILE})
SM_INPUT_LIB_SIZE=$(cat ${SM_INPUT_LIB_SIZE_FILE})

# 1. Count reads in peaks for both IP and SMInput samples using bedtools multicov.
# The output file 'peak_raw_counts.tsv' will have columns: chr, start, end, name, IP_counts, SMInput_counts.
bedtools multicov -bams ${IP_BAM} ${SM_INPUT_BAM} -bed ${IP_PEAKS} > peak_raw_counts.tsv

# 2. Perform normalization: calculate Reads Per Million (RPM) for both IP and SMInput,
# then compute the ratio (IP_RPM / SMInput_RPM) for each peak.
# This awk script assumes 'peak_raw_counts.tsv' has columns: $1=chr, $2=start, $3=end, $4=name, $5=IP_counts, $6=SMInput_counts.
# It adds new columns for IP_RPM, SMInput_RPM, and the Normalized_Ratio.
# Pseudocounts are not explicitly added here, but division by zero for RPM is handled.
awk -v ip_lib_size="${IP_LIB_SIZE}" -v sm_input_lib_size="${SM_INPUT_LIB_SIZE}" 'BEGIN { OFS="\t"; print "chrom", "start", "end", "name", "IP_counts", "SMInput_counts", "IP_RPM", "SMInput_RPM", "Normalized_Ratio" } NR > 0 {
    ip_counts = $5;
    sm_input_counts = $6;

    # Calculate RPM for IP sample
    ip_rpm = (ip_lib_size > 0) ? (ip_counts / ip_lib_size) * 1000000 : 0;
    # Calculate RPM for SMInput sample
    sm_input_rpm = (sm_input_lib_size > 0) ? (sm_input_counts / sm_input_lib_size) * 1000000 : 0;

    # Calculate the normalized ratio (IP_RPM / SMInput_RPM)
    if (sm_input_rpm > 0) {
        normalized_ratio = ip_rpm / sm_input_rpm;
    } else {
        normalized_ratio = "NA"; # Handle cases where SMInput RPM is zero
    }
    print $1, $2, $3, $4, ip_counts, sm_input_counts, ip_rpm, sm_input_rpm, normalized_ratio;
}' peak_raw_counts.tsv > normalized_peaks.tsv

# Reference genome: hg38 (used for alignment and peak calling, not directly in this normalization step)
# Source: https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.40/

View on GitHub

Tools Used

STAR

Raw Source Text

Reads were adapter-trimmed and mapped to human-specific repetitive elements from RepBase (version 18.04) by STAR. Repeat-mapping reads were removed and remaining reads mapped to the human genome assembly hg19 with STAR. PCR duplicate reads were removed using the unique molecular identifier (UMI) sequences in the 5â adapter and remaining reads retained as âusable readsâ.
Peaks were called on the usable reads by CLIPper ,and assigned to gene regions annotated in Gencode (v19)
Each peak was normalized to the size-matched input (SMInput)
Genome_build: Homo sapiens UCSC hg19
Supplementary_files_format_and_content: Each bed file was generated by read normalization between IP over SMInput. The columns in the bed files represent (chr, start, stop,-log10(pvalue),log2(fold change), strand). The CSV files contain the log2 fold change of reads upon IP over SMInput in annotated regions of each transcripts. Each bigwig file contains read distribution of each RBP, and bigbed contains clusters of predicted RBP binding.

← Back to Analysis