GSE125809 Processing Pipeline

RIP-Seq code_examples 6 steps

Publication

The RNA Helicase DDX6 Controls Cellular Plasticity by Modulating P-Body Homeostasis.

Cell stem cell (2019) — PMID 31588046

Dataset

The RNA helicase DDX6 regulation in human skeletal myoblasts [eCLIP-seq]

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

library strategy: CLIP-seq

CLIP-seq vInfer from description GitHub

$ Bash example

# --- Installation Instructions (commented out) ---
# For STAR:
# conda install -c bioconda star

# For clipper:
# pip install clipper

# For merge_peaks (custom pipeline):
# git clone https://github.com/yeolab/merge_peaks.git
# # Follow installation instructions within the cloned repository, e.g.:
# # cd merge_peaks
# # python setup.py install

# --- Placeholder Variables ---
# Reference genome directory for STAR (e.g., human hg38)
# Replace with your actual path to the STAR index
GENOME_DIR="/path/to/STAR_index/hg38"
# Reference genome fasta file (e.g., human hg38)
GENOME_FASTA="/path/to/hg38.fa"
# GTF annotation file (e.g., Gencode for hg38)
GTF_FILE="/path/to/gencode.vXX.annotation.gtf"

# Input CLIP-seq raw reads (e.g., FastQ files)
# Replace with your actual input file paths
CLIP_READS_R1="sample_clip_R1.fastq.gz"
CLIP_READS_R2="sample_clip_R2.fastq.gz" # Omit if single-end
# Input control raw reads (e.g., FastQ files from input/mock IP)
CONTROL_READS_R1="sample_control_R1.fastq.gz"
CONTROL_READS_R2="sample_control_R2.fastq.gz" # Omit if single-end

# Output prefixes
CLIP_OUTPUT_PREFIX="sample_clip"
CONTROL_OUTPUT_PREFIX="sample_control"

# --- Step 1: Alignment of CLIP-seq and Control Reads with STAR ---
echo "--- Step 1: Aligning CLIP-seq reads with STAR ---"
# Align CLIP-seq reads
STAR --genomeDir "${GENOME_DIR}" \
     --readFilesIn "${CLIP_READS_R1}" "${CLIP_READS_R2}" \
     --runThreadN 8 \
     --outFileNamePrefix "${CLIP_OUTPUT_PREFIX}_star_" \
     --outSAMtype BAM SortedByCoordinate \
     --quantMode GeneCounts \
     --outFilterMultimapNmax 10 \
     --outFilterMismatchNmax 3 # Example parameters, adjust as needed

# Align Control reads
echo "--- Step 1: Aligning Control reads with STAR ---"
STAR --genomeDir "${GENOME_DIR}" \
     --readFilesIn "${CONTROL_READS_R1}" "${CONTROL_READS_R2}" \
     --runThreadN 8 \
     --outFileNamePrefix "${CONTROL_OUTPUT_PREFIX}_star_" \
     --outSAMtype BAM SortedByCoordinate \
     --quantMode GeneCounts \
     --outFilterMultimapNmax 10 \
     --outFilterMismatchNmax 3

# Define aligned BAM files
CLIP_BAM="${CLIP_OUTPUT_PREFIX}_star_Aligned.sortedByCoord.out.bam"
CONTROL_BAM="${CONTROL_OUTPUT_PREFIX}_star_Aligned.sortedByCoord.out.bam"

# --- Step 2: Peak Calling with clipper ---
echo "--- Step 2: Calling peaks with clipper ---"
# clipper requires aligned BAM files and a genome size.
# Use 'hg38' as a placeholder for genome size, or provide a custom .sizes file.
clipper -b "${CLIP_BAM}" \
        -c "${CONTROL_BAM}" \
        -s hg38 \
        -o "${CLIP_OUTPUT_PREFIX}_peaks.bed" \
        -p 0.01 \
        --bonferroni # Example parameters, adjust as needed

# Define peak file from clipper
CLIP_PEAKS="${CLIP_OUTPUT_PREFIX}_peaks.bed"

# --- Step 3: IDR (Identifying Reproducible Peaks) with merge_peaks pipeline ---
# This step typically requires multiple biological replicates.
# Assuming two replicates for demonstration:
# Replace with actual peak files from replicate 1 and replicate 2
REP1_PEAKS="replicate1_clip_peaks.bed"
REP2_PEAKS="replicate2_clip_peaks.bed"
IDR_OUTPUT_PREFIX="combined_idr_peaks"

echo "--- Step 3: Performing IDR with merge_peaks pipeline (requires replicates) ---"
# The merge_peaks repository contains scripts for IDR analysis.
# This is a conceptual command representing the execution of an IDR script from that pipeline.
# You would typically navigate to the cloned merge_peaks directory and run a script like 'run_idr.py'
python /path/to/merge_peaks/scripts/run_idr.py \
       --peak_files "${REP1_PEAKS}" "${REP2_PEAKS}" \
       --output_prefix "${IDR_OUTPUT_PREFIX}" \
       --idr_threshold 0.05 \
       --peak_caller clipper # Specify the peak caller used

View on GitHub

Reads were adapter-trimmed and mapped to human-specific repetitive elements from RepBase (version 18.04) by STAR.

STAR v2.7.3a GitHub

$ Bash example

# Install STAR (if not already installed)
# conda install -c bioconda star

# Define variables
INPUT_READS="trimmed_reads.fastq.gz" # Placeholder for adapter-trimmed reads
REPBASE_INDEX_DIR="/path/to/repbase_18_04_star_index" # Placeholder for STAR index built from RepBase 18.04 human-specific repetitive elements
OUTPUT_PREFIX="repbase_mapping_output"
NUM_THREADS=8 # Adjust as needed

# Note: The STAR index for RepBase 18.04 human-specific repetitive elements needs to be pre-built.
# This typically involves creating a FASTA file of the desired repetitive elements
# and then running STAR --runMode genomeGenerate ...

# Map adapter-trimmed reads to human-specific repetitive elements from RepBase
STAR --runThreadN ${NUM_THREADS} \
     --genomeDir ${REPBASE_INDEX_DIR} \
     --readFilesIn ${INPUT_READS} \
     --outFileNamePrefix ${OUTPUT_PREFIX} \
     --outSAMtype BAM SortedByCoordinate \
     --outSAMattributes All \
     --outFilterMultimapNmax 100 \
     --outFilterMismatchNmax 10 \
     --alignIntronMax 1 # Assuming repetitive elements do not contain large introns, or are treated as single exons

View on GitHub

Repeat-mapping reads were removed and remaining reads mapped to the human genome assembly hg19 with STAR.

STAR v2.5.2a GitHub

$ Bash example

# conda install -c bioconda star bowtie2 samtools

# Placeholder for rRNA/repeat genome index (e.g., from UCSC hg19 rRNA or a custom repeat library).
# This step assumes 'repeat-mapping reads' refers to reads mapping to a dedicated repeat/rRNA genome,
# a common pre-processing step in eCLIP pipelines.
# bowtie2-build /path/to/rRNA.fa /path/to/rRNA_index

# Step 1: Remove repeat-mapping reads (e.g., rRNA reads) using Bowtie2.
# Reads that *do not* map to the repeat genome are passed to STAR.
bowtie2 -x /path/to/rRNA_index \
        -U reads.fastq.gz \
        --un-gz filtered_reads.fastq.gz \
        -S /dev/null # Discard mapped reads

# Step 2: Map remaining reads to human genome hg19 with STAR.
# Create STAR index (if not already present) for hg19.
# STAR --runMode genomeGenerate \
#      --genomeDir /path/to/STAR_index/hg19 \
#      --genomeFastaFiles /path/to/hg19.fa \
#      --sjdbGTFfile /path/to/gencode.v19.annotation.gtf \
#      --runThreadN 8

STAR --genomeDir /path/to/STAR_index/hg19 \
     --readFilesIn filtered_reads.fastq.gz \
     --readFilesCommand zcat \
     --outFileNamePrefix aligned_ \
     --outSAMtype BAM SortedByCoordinate \
     --runThreadN 8

# Rename output file for clarity
mv aligned_Aligned.sortedByCoord.out.bam aligned.bam

View on GitHub

PCR duplicate reads were removed using the unique molecular identifier (UMI) sequences in the 5â adapter and remaining reads retained as âusable readsâ.

umi_tools (Inferred with models/gemini-2.5-flash) v1.1.1 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install umi-tools
# conda install -c bioconda umi-tools=1.1.1

# Example command for UMI-based deduplication.
# This command assumes that the UMIs have already been extracted from the 5' adapter
# and appended to the read names (e.g., by a prior 'umi_tools extract' step).
# Replace 'input_aligned_with_umis.bam' with your actual input BAM file
# and 'output_deduplicated.bam' with your desired output file name.
# The --paired-end flag should be used if the input reads are paired-end.
# The --method unique is a common deduplication strategy.
umi_tools dedup \
    --extract-umis-from-read-names \
    --method unique \
    --paired-end \
    -I input_aligned_with_umis.bam \
    -S output_deduplicated.bam \
    --output-stats deduplication_stats.tsv

View on GitHub

Peaks were called on the usable reads by CLIPper ,and assigned to gene regions annotated in Gencode (v19)

CLIPper v0.0.1 GitHub

$ Bash example

# Install CLIPper (example, adjust based on actual installation method)
# git clone https://github.com/yeolab/clipper.git
# cd clipper
# python setup.py install

# Define input and output files
INPUT_BAM="input_reads.bam"
OUTPUT_BED="peaks.bed"

# Define reference files for Gencode v19 (corresponding to hg19/GRCh37)
# Download Gencode v19 GTF
# wget -O gencode.v19.annotation.gtf.gz ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz
# gunzip gencode.v19.annotation.gtf.gz
GENCODE_GTF="gencode.v19.annotation.gtf"

# Download hg19/GRCh37 primary assembly FASTA (Ensembl release 75 corresponds to Gencode v19)
# wget -O Homo_sapiens.GRCh37.dna.primary_assembly.fa.gz ftp://ftp.ensembl.org/pub/release-75/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.dna.primary_assembly.fa.gz
# gunzip Homo_sapiens.GRCh37.dna.primary_assembly.fa.gz
GENOME_FASTA="Homo_sapiens.GRCh37.dna.primary_assembly.fa"

# Generate chromosome sizes file from FASTA
# samtools faidx "${GENOME_FASTA}"
# cut -f1,2 "${GENOME_FASTA}".fai > hg19.chrom.sizes
GENOME_SIZES="hg19.chrom.sizes"

# Execute CLIPper to call peaks and assign to gene regions
python clipper.py -b "${INPUT_BAM}" \
                  -s "${GENOME_SIZES}" \
                  -o "${OUTPUT_BED}" \
                  -g "${GENOME_FASTA}" \
                  -a "${GENCODE_GTF}"

View on GitHub

Each peak was normalized to the size-matched input (SMInput)

normalize_peaks.py (from yeolab/skipper) (Inferred with models/gemini-2.5-flash) vFrom yeolab/skipper pipeline (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# --- Setup for normalize_peaks.py ---
# The normalize_peaks.py script is part of the yeolab/skipper pipeline.
# It can be found at: https://github.com/yeolab/skipper/blob/master/scripts/normalize_peaks.py
# Ensure this script is available in your PATH or specified by its full path.
# For example, you might clone the repository and use the script:
# git clone https://github.com/yeolab/skipper.git
# NORMALIZE_SCRIPT="./skipper/scripts/normalize_peaks.py" # Adjust path as needed

# Define variables for clarity
IP_PEAKS="ip_peaks.bed" # Placeholder: Path to your called IP peaks (e.g., from clipper output)
IP_BAM="ip_aligned.bam" # Placeholder: Path to your IP sample's aligned BAM file
SM_INPUT_BAM="sm_input_aligned.bam" # Placeholder: Path to your size-matched input sample's aligned BAM file
OUTPUT_NORMALIZED_PEAKS="normalized_peaks.bed" # Desired output file name for normalized peaks
GENOME_SIZE="2.7e9" # Placeholder: Effective genome size (e.g., 2.7e9 for human hg38, 2.9e9 for hg19)

# Install dependencies (if not already installed)
# conda install -c bioconda python bedtools samtools numpy scipy

# Execute the normalization script
# Ensure normalize_peaks.py is executable and its Python dependencies (e.g., numpy, scipy) are met.
python normalize_peaks.py \
    -p "${IP_PEAKS}" \
    -i "${IP_BAM}" \
    -c "${SM_INPUT_BAM}" \
    -o "${OUTPUT_NORMALIZED_PEAKS}" \
    -g "${GENOME_SIZE}"

View on GitHub

Tools Used

CLIP-seq STAR

Raw Source Text

library strategy: CLIP-seq
Reads were adapter-trimmed and mapped to human-specific repetitive elements from RepBase (version 18.04) by STAR. Repeat-mapping reads were removed and remaining reads mapped to the human genome assembly hg19 with STAR. PCR duplicate reads were removed using the unique molecular identifier (UMI) sequences in the 5â adapter and remaining reads retained as âusable readsâ.
Peaks were called on the usable reads by CLIPper ,and assigned to gene regions annotated in Gencode (v19)
Each peak was normalized to the size-matched input (SMInput)
Genome_build: Homo sapiens UCSC hg19
Supplementary_files_format_and_content: Each bed file was gernated by read normalization between IP over SMInput. The columns in the bed files represent (chr, start, stop,-log10(pvalue),log2(fold change), strand). The CSV files contain the log2 fold change of reads upon IP over SMInput in annotated regions of each transcripts. Each bigwig file contains read distribution of each RBP, and bigbed contains clusters of predicted RBP binding.

← Back to Analysis