GSE240521 Processing Pipeline

RIP-Seq code_examples 9 steps

Publication

High-sensitivity in situ capture of endogenous RNA-protein interactions in fixed cells and primary tissues.

Nature communications (2024) — PMID 39152130

Dataset

An in situ method for identification of transcriptome-wide protein-RNA interactions in cells [eCLIP-seq ]

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Data was processed using the eCLIP pipeline and available at: http://github.com/yeolab/eclip

eCLIP vN/A GitHub

$ Bash example

# Install cwltool if not already installed
# pip install cwltool

# Clone the eCLIP pipeline repository
# git clone https://github.com/yeolab/eclip.git
# cd eclip

# Placeholder for input data and reference genome.
# Replace with actual paths to your FASTQ files and reference genome (e.g., hg38).
# Example:
# READ1_FASTQ="/path/to/sample_R1.fastq.gz"
# READ2_FASTQ="/path/to/sample_R2.fastq.gz"
# REFERENCE_GENOME_FASTA="/path/to/hg38.fa"
# REFERENCE_GENOME_GTF="/path/to/hg38.gtf" # Or appropriate annotation file

# Create an inputs.yaml file based on the eclip.cwl requirements.
# Refer to https://github.com/yeolab/eclip/blob/master/example_inputs.yaml for a detailed example.
# Example simplified inputs.yaml structure:
# cat << EOF > inputs.yaml
# read1:
#   class: File
#   path: ${READ1_FASTQ}
# read2:
#   class: File
#   path: ${READ2_FASTQ}
# genome_fasta:
#   class: File
#   path: ${REFERENCE_GENOME_FASTA}
# genome_gtf:
#   class: File
#   path: ${REFERENCE_GENOME_GTF}
# output_prefix: "my_eclip_output"
# EOF

# Execute the eCLIP CWL pipeline.
# This command assumes 'eclip.cwl' and 'inputs.yaml' are in the current directory.
cwltool eclip.cwl inputs.yaml

View on GitHub

Unique Molecular Identifiers (UMIs) were extracted from raw sequencing reads with umi_tools extract

UMI-tools vInferred with models/gemini-2.5-flash GitHub

$ Bash example

# Install UMI-tools
# conda install -c bioconda umi-tools

# Example: Extract UMIs (assuming 12bp at the start of Read 1) from raw sequencing reads
# and append them to the read header.
# Input: raw_reads_R1.fastq.gz
# Output: umi_extracted_reads_R1.fastq.gz
umi_tools extract \
    --bc-pattern="^(?P<umi_1>.{12})" \
    --extract-method=regex \
    -I raw_reads_R1.fastq.gz \
    -S umi_extracted_reads_R1.fastq.gz \
    --log=umi_tools_extract.log

View on GitHub

Post-umi-extracted reads were trimmed for adapter sequences and barcode sequences (eCLIP samples) using cutadapt.

cutadapt v4.0 GitHub

$ Bash example

# Install cutadapt (example using conda)
# conda install -c bioconda cutadapt=4.0

# Define input and output files
INPUT_READS="sample_umi_extracted.fastq.gz"
OUTPUT_TRIMMED_READS="sample_trimmed.fastq.gz"
OUTPUT_UNTRIMMED_READS="sample_untrimmed.fastq.gz" # Optional, for reads where no adapter was found

# Define adapter sequences for eCLIP (from yeolab/skipper workflow)
# Illumina TruSeq Small RNA 3' Adapter (commonly found at 3' end of cDNA)
ADAPTER_3PRIME="AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC"
# Illumina TruSeq Universal Adapter (part of P7, sometimes found at 5' end of cDNA)
ADAPTER_5PRIME="AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT"

# Define trimming parameters (from yeolab/skipper workflow)
MIN_LENGTH=18 # Minimum read length after trimming
QUALITY_CUTOFF=20 # Quality cutoff for 3' end trimming (Phred score)
THREADS=8 # Number of CPU threads to use

# Execute cutadapt for adapter and barcode trimming
cutadapt \
    -a "${ADAPTER_3PRIME}" \
    -g "${ADAPTER_5PRIME}" \
    -o "${OUTPUT_TRIMMED_READS}" \
    --untrimmed-output "${OUTPUT_UNTRIMMED_READS}" \
    --minimum-length "${MIN_LENGTH}" \
    --quality-cutoff "${QUALITY_CUTOFF}" \
    --cores "${THREADS}" \
    "${INPUT_READS}"

View on GitHub

Trimmed reads were mapped against RepBase with STAR to remove reads mapping to repetitive sequences (--outFilterMultimapNmax 30 --alignEndsType EndToEnd --outFilterMultimapScoreRange 1 --outSAMmode Full --outFilterType BySJout --outSAMtype BAM Unsorted --outFilterScoreMin 10 --outReadsUnmapped Fastx --outSAMattributes All)

STAR v2.7.10a GitHub

$ Bash example

# Install STAR (example)
# conda install -c bioconda star

# Define variables
# Placeholder for input trimmed reads. Adjust for paired-end if necessary.
READS_R1="trimmed_reads_R1.fastq.gz"
# READS_R2="trimmed_reads_R2.fastq.gz" # Uncomment for paired-end

# Placeholder for RepBase FASTA file. Obtaining RepBase might require a license or specific download steps.
# Example: wget -O RepBase.fasta "http://www.girinst.org/repbase/update/RepBase.fasta.gz" (actual URL may vary)
REPBASE_FASTA="RepBase.fasta"

# Directory for the STAR genome index
STAR_INDEX_DIR="RepBase_STAR_index"

# Prefix for output files
OUTPUT_PREFIX="repbase_filtered"

# Number of threads to use
THREADS=8 # Adjust as needed

# --- Step 1: Build STAR index for RepBase (if not already built) ---
# This step is typically performed once. If the index already exists, skip this.
# STAR --runMode genomeGenerate \
#      --genomeDir ${STAR_INDEX_DIR} \
#      --genomeFastaFiles ${REPBASE_FASTA} \
#      --runThreadN ${THREADS} \
#      --genomeSAindexNbases 10 # Adjust based on genome size, 10 is suitable for small decoy genomes

# --- Step 2: Map trimmed reads against RepBase and extract unmapped reads ---
STAR --genomeDir ${STAR_INDEX_DIR} \
     --readFilesIn ${READS_R1} \
     --runThreadN ${THREADS} \
     --outFileNamePrefix ${OUTPUT_PREFIX} \
     --outFilterMultimapNmax 30 \
     --alignEndsType EndToEnd \
     --outFilterMultimapScoreRange 1 \
     --outSAMmode Full \
     --outFilterType BySJout \
     --outSAMtype BAM Unsorted \
     --outFilterScoreMin 10 \
     --outReadsUnmapped Fastx \
     --outSAMattributes All

# The unmapped reads will be in files like: ${OUTPUT_PREFIX}Unmapped.out.mate1 (and .mate2 for paired-end)

View on GitHub

Remaining reads were mapped to the appropriate genome build (mm10) using STAR aligner (--outFilterMultimapNmax 1 --alignEndsType EndToEnd --outFilterMultimapScoreRange 1 --outSAMmode Full --outFilterType BySJout --outSAMtype BAM Unsorted --outFilterScoreMin 10 --outReadsUnmapped Fastx --outSAMattributes All)

STAR v2.7.10a GitHub

$ Bash example

# Install STAR (if not already installed)
# conda install -c bioconda star

# Define variables for clarity
# Replace /path/to/STAR_index/mm10 with the actual path to your mm10 STAR genome index
GENOME_DIR="/path/to/STAR_index/mm10"
# Replace remaining_reads.fastq.gz with your actual input FASTQ file
INPUT_FASTQ="remaining_reads.fastq.gz"
# Prefix for output files (e.g., aligned_mm10_Aligned.out.bam, aligned_mm10_Unmapped.out.mate1)
OUTPUT_PREFIX="aligned_mm10_"

# Run STAR alignment
STAR \
  --runThreadN 8 \
  --genomeDir "${GENOME_DIR}" \
  --readFilesIn "${INPUT_FASTQ}" \
  --outFileNamePrefix "${OUTPUT_PREFIX}" \
  --outFilterMultimapNmax 1 \
  --alignEndsType EndToEnd \
  --outFilterMultimapScoreRange 1 \
  --outSAMmode Full \
  --outFilterType BySJout \
  --outSAMtype BAM Unsorted \
  --outFilterScoreMin 10 \
  --outReadsUnmapped Fastx \
  --outSAMattributes All

View on GitHub

Uniquely mapped reads were removed of PCR duplicates with umi_tools

UMI-tools v1.1.2 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install UMI-tools (example using conda)
# conda install -c bioconda umi-tools

# Check UMI-tools version
# umi_tools --version

# Remove PCR duplicates from uniquely mapped reads
# This command assumes UMIs are embedded in the read ID, separated by a colon.
# Adjust --extract-umi-method and --umi-separator (or use --umi-tag if UMIs are in a BAM tag)
# based on how UMIs were incorporated during library preparation and mapping.
# Placeholder filenames are used for input and output.
umi_tools dedup \
    --input mapped_reads.bam \
    --output deduplicated_reads.bam \
    --extract-umi-method=read_id \
    --umi-separator=':' \
    --output-stats umi_tools_dedup_stats.tsv \
    --log umi_tools_dedup.log

View on GitHub

Peak clusters were identified with CLIPper, available at: https://github.com/YeoLab/clipper

CLIPper vunspecified GitHub

$ Bash example

# Clone the CLIPper repository
# git clone https://github.com/YeoLab/clipper.git
# cd clipper

# Install dependencies (if not already installed)
# pip install numpy scipy pysam

# Example usage for eCLIP peak calling with human hg38 genome.
# Replace <TREATED_BAM>, <CONTROL_BAM>, <OUTPUT_PREFIX> with actual paths/names.
# Reference genome and annotation files (e.g., hg38.fa, gencode.v38.annotation.gtf) should be pre-indexed and available.
# The size factor (-s) can be calculated based on library sizes or spike-ins, or omitted if CLIPper calculates it or defaults to 1.0.

python clipper.py \
    -b treated_sample.bam \
    -c control_sample.bam \
    -o clipper_peaks \
    -g hg38.fa \
    -a gencode.v38.annotation.gtf \
    -p 0.05 \
    -f 0.05 \
    -u 100 \
    -d 100

View on GitHub

Clusters enriched over corresponding size-matched input (SMInput) were identified using a custom Perl script, available in the main eCLIP repository as: overlap_peakfi_with_bam.pl

eCLIP vN/A GitHub

$ Bash example

# --- Installation (commented out) ---
# Ensure Perl is installed
# sudo apt-get update && sudo apt-get install -y perl

# Clone the eCLIP repository to access the script
# git clone https://github.com/yeolab/eclip.git
# cd eclip
# git checkout master # Or a specific commit/tag if available and desired
# cd ..

# --- Define Input Files and Parameters ---
# Placeholder for the input peak file (e.g., output from CLIPper)
PEAK_FILE="path/to/your/input_peaks.bed"

# Placeholder for the size-matched input BAM file
SM_INPUT_BAM="path/to/your/sm_input.bam"

# Output file prefix for the enriched clusters
OUTPUT_PREFIX="enriched_clusters"

# Genome assembly for chromosome sizes (e.g., hg38, mm10)
# Using hg38 (GRCh38) as a common latest human assembly placeholder
GENOME_ASSEMBLY="hg38"

# Minimum overlap percentage for peak identification (default 0.5)
MIN_OVERLAP=0.5

# Minimum number of reads in a cluster (default 5)
MIN_READS=5

# Minimum fold enrichment over SMInput (default 2)
MIN_FOLD_ENRICHMENT=2

# --- Execute the Script ---
# The script is located in the 'bin' directory of the cloned eCLIP repository
perl eclip/bin/overlap_peakfi_with_bam.pl \
    --peakfile "${PEAK_FILE}" \
    --bamfile "${SM_INPUT_BAM}" \
    --output "${OUTPUT_PREFIX}" \
    --genome "${GENOME_ASSEMBLY}" \
    --min_overlap "${MIN_OVERLAP}" \
    --min_reads "${MIN_READS}" \
    --min_fold_enrichment "${MIN_FOLD_ENRICHMENT}"

View on GitHub

Overlapping enriched clusters (peaks) were merged with a custom perl script, available in the main eCLIP repository as: compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl

eCLIP vN/A GitHub

$ Bash example

# Install Perl if not already available
# sudo apt-get update && sudo apt-get install perl # For Debian/Ubuntu
# yum install perl # For CentOS/RHEL

# Download the script (if not part of a larger pipeline installation, e.g., cloning the eCLIP repository)
# wget https://raw.githubusercontent.com/yeolab/eclip/master/bin/compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl
# chmod +x compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl

# Example usage:
# Assuming input peak files are named rep1_peaks.bed, rep2_peaks.bed, etc.
# And the desired output prefix for the merged file is 'merged_replicates'
perl compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl rep1_peaks.bed rep2_peaks.bed merged_replicates

View on GitHub

Tools Used

eCLIP STAR

Raw Source Text

Data was processed using the eCLIP pipeline and available at: http://github.com/yeolab/eclip
Unique Molecular Identifiers (UMIs) were extracted from raw sequencing reads with umi_tools extract
Post-umi-extracted reads were trimmed for adapter sequences and barcode sequences (eCLIP samples) using cutadapt.
Trimmed reads were mapped against RepBase with STAR to remove reads mapping to repetitive sequences (--outFilterMultimapNmax 30 --alignEndsType EndToEnd --outFilterMultimapScoreRange 1 --outSAMmode Full --outFilterType BySJout --outSAMtype BAM Unsorted --outFilterScoreMin 10 --outReadsUnmapped Fastx --outSAMattributes All)
Remaining reads were mapped to the appropriate genome build (mm10) using STAR aligner (--outFilterMultimapNmax 1 --alignEndsType EndToEnd --outFilterMultimapScoreRange 1 --outSAMmode Full --outFilterType BySJout --outSAMtype BAM Unsorted --outFilterScoreMin 10 --outReadsUnmapped Fastx --outSAMattributes All)
Uniquely mapped reads were removed of PCR duplicates with umi_tools
Peak clusters were identified with CLIPper, available at: https://github.com/YeoLab/clipper
Clusters enriched over corresponding size-matched input (SMInput) were identified using a custom Perl script, available in the main eCLIP repository as: overlap_peakfi_with_bam.pl
Overlapping enriched clusters (peaks) were merged with a custom perl script, available in the main eCLIP repository as: compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl
Assembly: mm10
Supplementary files format and content: bigwigs contain RPM-normalized read densities of uniquely-mapped reads
Supplementary files format and content: BED files contain CLIPper peak clusters. Columns 4 and 5 describe the -log10(p-value) and log2(fold) enrichment IP over corresponding SMInput.

← Back to Analysis