GSE230717 Processing Pipeline

RIP-Seq code_examples 9 steps

Publication

Large-scale map of RNA-binding protein interactomes across the mRNA life cycle.

Molecular cell (2024) — PMID 39303721

Dataset

Large-scale map of RNA binding protein interactomes across the mRNA life-cycle

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Data was processed using the eCLIP pipeline and available at https://github.com/YeoLab/skipper

Skipper vN/A GitHub

$ Bash example

git clone https://github.com/YeoLab/skipper.git
cd skipper

# Configure the workflow by editing config/config.yaml and config/samples.yaml
# Example reference genome (hg38 inferred as common default for human eCLIP):
# In config/config.yaml, set:
#   species: human
#   assembly: hg38
#   genome_fasta: /path/to/your/hg38.fasta
#   gtf: /path/to/your/hg38.gtf
#   star_index: /path/to/your/star_index_hg38
#   ... other reference paths ...

# In config/samples.yaml, define your samples and their fastq files.

# (Optional) Create and activate the Snakemake environment
# conda env create -f envs/snakemake.yaml
# conda activate snakemake

# Execute the Skipper eCLIP processing workflow
# The --use-conda flag will create environments for individual rules as needed.
# Adjust --cores based on available resources.
snakemake --use-conda --cores 8

View on GitHub

Unique Molecular Identifiers (UMIs) were extracted using fastp 0.11.5 (https://github.com/OpenGene/fastp)

fastp v0.11.5 GitHub

$ Bash example

# Install fastp (if not already installed)
# conda install -c bioconda fastp

# Example command for UMI extraction using fastp.
# This command assumes UMIs are at the beginning of Read 1 and are 10 bases long.
# Adjust input/output filenames, UMI location (--umi_loc), and UMI length (--umi_len) as per your library preparation.
# fastp will prepend the UMI sequence to the read ID in the output FASTQ files.
fastp \
  -i input_read1.fastq.gz \
  -o output_read1_umi_extracted.fastq.gz \
  -I input_read2.fastq.gz \
  -O output_read2_umi_extracted.fastq.gz \
  --umi \
  --umi_loc per_read \
  --umi_len 10 \
  --thread 8 \
  --json fastp.json \
  --html fastp.html

View on GitHub

Post-umi-extracted reads were trimmed for adapter sequences and barcode sequences (eCLIP samples)

eCLIP v2.10 GitHub

$ Bash example

# Install cutadapt (if not already installed)
# conda install -c bioconda cutadapt

# Define input and output files
INPUT_FASTQ="input_umi_extracted_reads.fastq"
OUTPUT_FASTQ="trimmed_reads.fastq"

# Define adapter sequences and minimum read length
# These are placeholder sequences. Actual sequences should be obtained from the library preparation protocol
# or the specific eCLIP protocol used (e.g., from the Yeo lab).
ADAPTER_3PRIME="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" # Example: Illumina Universal Adapter
ADAPTER_5PRIME="GTTCAGAGTTCTACAGTCCGACGATC" # Example: A common linker or 5' barcode sequence to be trimmed after UMI extraction
MIN_READ_LENGTH=18

# Execute cutadapt to trim adapter and barcode sequences
cutadapt \
  -a "${ADAPTER_3PRIME}" \
  -g "${ADAPTER_5PRIME}" \
  -m ${MIN_READ_LENGTH} \
  -o "${OUTPUT_FASTQ}" \
  "${INPUT_FASTQ}"

View on GitHub

Mapping was then performed against the full human genome (hg38) including a database of splice junctions with STAR (v 2.7.6) allowing up to 100 multimapped regions.

STAR v2.7.6 GitHub

$ Bash example

# Install STAR (if not already installed)
# conda install -c bioconda star=2.7.6

# Define variables
READ1_FASTQ="path/to/your/read1.fastq.gz" # Replace with your R1 FASTQ file
READ2_FASTQ="path/to/your/read2.fastq.gz" # Replace with your R2 FASTQ file (remove if single-end)
STAR_GENOME_DIR="path/to/your/star_hg38_index" # Replace with path to pre-built STAR hg38 genome index
OUTPUT_PREFIX="sample_name" # Prefix for output files
NUM_THREADS=8 # Number of threads to use

# Note: The STAR genome index (STAR_GENOME_DIR) must be pre-built using hg38 FASTA and a GTF/GFF file for splice junctions.
# Example command for building index (run once, adjust sjdbOverhang based on read length - 1):
# STAR --runMode genomeGenerate \
#      --genomeDir ${STAR_GENOME_DIR} \
#      --genomeFastaFiles path/to/hg38.fa \
#      --sjdbGTFfile path/to/gencode.vXX.annotation.gtf \
#      --sjdbOverhang 100 \
#      --runThreadN ${NUM_THREADS}

# Perform mapping with STAR
STAR --genomeDir ${STAR_GENOME_DIR} \
     --readFilesIn ${READ1_FASTQ} ${READ2_FASTQ} \
     --readFilesCommand zcat \
     --outFileNamePrefix ${OUTPUT_PREFIX}_ \
     --outFilterMultimapNmax 100 \
     --outSAMtype BAM SortedByCoordinate \
     --outSAMattributes All \
     --runThreadN ${NUM_THREADS}

View on GitHub

Reads were then PCR deduplicated using UMIcollaps (https://github.com/Daniel-Liu-c0deb0t/UMICollapse

UMIcollaps vlatest GitHub

$ Bash example

# Clone the repository if not already available
# git clone https://github.com/Daniel-Liu-c0deb0t/UMICollapse.git
# cd UMICollapse

# Execute UMIcollaps for PCR deduplication
# Assuming input_reads.fastq is the input file and deduplicated_reads.fastq is the desired output.
# The UMI length (-l) is an example (default or common value); adjust if the actual UMI length is known from experimental design.
python UMICollapse.py -i input_reads.fastq -o deduplicated_reads.fastq -l 10

View on GitHub

Enriched âwindowsâ (IP versus SM-Input) was called on deduplicated reads using a GC-bias aware beta-binomial model.

clipper (Inferred with models/gemini-2.5-flash) vPython 3.8 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install clipper (if not already installed)
# git clone https://github.com/yeolab/clipper.git
# cd clipper
# # Ensure dependencies are met, e.g., pysam, numpy, scipy
# pip install pysam numpy scipy

# Define input and output files
IP_BAM="path/to/deduplicated_IP.bam" # Immunoprecipitated reads, deduplicated
SM_INPUT_BAM="path/to/deduplicated_SM_Input.bam" # Size-matched input reads, deduplicated
GENOME_ASSEMBLY="hg38" # Placeholder: Replace with the actual genome assembly (e.g., hg19, mm10)
OUTPUT_PREFIX="eCLIP_peaks"

# Run clipper to call enriched windows using a GC-bias aware beta-binomial model
python clipper.py \
    -s ${GENOME_ASSEMBLY} \
    -u ${IP_BAM} \
    -c ${SM_INPUT_BAM} \
    -o ${OUTPUT_PREFIX}

View on GitHub

Each window (~100 b.p.) were partitioned from Gencode v38 and associated with a specific type of genomic region (CDS, UTR, proximal introns near splice siteâ¦ etc).

GENCODE vv38 GitHub

$ Bash example

# Download Gencode v38 annotation GTF
wget -nc https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_38/gencode.v38.annotation.gtf.gz
gunzip -f gencode.v38.annotation.gtf.gz

# Download hg38 chrom.sizes for bedtools makewindows
wget -nc http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes

# # Installation for bedtools:
# # conda install -c bioconda bedtools

# Extract specific genomic regions from Gencode v38 GTF into BED format
# Note: This is a simplified representation. A robust pipeline would use dedicated GTF parsers
# or more complex scripting to accurately define all region types, especially introns.

# Extract CDS regions
awk '$3 == "CDS"' gencode.v38.annotation.gtf | \
awk -v OFS='\t' '{print $1, $4-1, $5, $10, $12, $7}' | \
sed 's/;//g; s/"//g' | \
sort -k1,1 -k2,2n > gencode.v38.cds.bed

# Extract UTR regions
awk '$3 == "UTR"' gencode.v38.annotation.gtf | \
awk -v OFS='\t' '{print $1, $4-1, $5, $10, $12, $7}' | \
sed 's/;//g; s/"//g' | \
sort -k1,1 -k2,2n > gencode.v38.utr.bed

# Extract Exon regions (often used to derive introns)
awk '$3 == "exon"' gencode.v38.annotation.gtf | \
awk -v OFS='\t' '{print $1, $4-1, $5, $10, $12, $7}' | \
sed 's/;//g; s/"//g' | \
sort -k1,1 -k2,2n > gencode.v38.exons.bed

# Conceptual step: Derive intron regions and proximal introns near splice sites.
# This typically involves custom scripting based on exon coordinates (e.g., using bedtools complement
# on exons within transcripts, then defining proximal regions +/- X bp around splice sites).
# For this example, we assume these files (e.g., gencode.v38.introns.bed, gencode.v38.proximal_introns.bed)
# would be generated by a preceding step or custom script.

# Create genome-wide windows of ~100 b.p.
bedtools makewindows -g hg38.chrom.sizes -w 100 > genome.100bp_windows.bed

# Associate each window with a specific type of genomic region
# This step involves intersecting the windows with the derived region BED files
# and assigning a primary region type based on overlap hierarchy (e.g., CDS > UTR > Intron).
# This is often done with a custom script after multiple bedtools intersect/map calls.

# Example of intersecting windows with CDS regions:
bedtools intersect -a genome.100bp_windows.bed -b gencode.v38.cds.bed -wa -wb > windows_overlapping_cds.bed

# Further processing (e.g., with a Python/R script) would then assign a single region type
# to each window based on the overlaps with CDS, UTR, Intron, etc., following a defined hierarchy.
# For instance:
# python assign_region_type.py genome.100bp_windows.bed windows_overlapping_cds.bed windows_overlapping_utr.bed ... > final_annotated_windows.bed

View on GitHub

Windows are filtered using FDR < 0.2 and only reproducible windows between two replicates were used.

merge_replicates.py (Inferred with models/gemini-2.5-flash) vlatest GitHub

$ Bash example

# Clone the merge_peaks repository
# git clone https://github.com/yeolab/merge_peaks.git
# cd merge_peaks

# Example usage of merge_replicates.py
# This command filters windows (peaks) from two replicate BED files
# using an FDR threshold of 0.2 and identifies reproducible windows.
# Replace 'replicate1_peaks.bed' and 'replicate2_peaks.bed' with your actual input peak files.
# The output will be prefixed with 'reproducible_windows'.
python merge_replicates.py --fdr_threshold 0.2 replicate1_peaks.bed replicate2_peaks.bed --output_prefix reproducible_windows

View on GitHub

Data was processed using the eCLIP pipeline and available at https://github.com/YeoLab/skipper

Skipper vLatest commit GitHub

$ Bash example

# Clone the Skipper workflow repository
git clone https://github.com/YeoLab/skipper.git
cd skipper

# --- Configuration Setup (User needs to adapt these files) ---
# The workflow requires a configuration file (e.g., config/config.yaml)
# and a samplesheet (e.g., config/samples.tsv).
#
# Example config/config.yaml (adjust genome_dir and other parameters as needed):
# genome_assembly: "hg38"
# genome_dir: "/path/to/your/genome/data/hg38"
#
# Example config/samples.tsv (tab-separated, adjust paths to your FASTQ files):
# sample_id    fastq_r1    fastq_r2    replicate
# sample1_rep1 /path/to/sample1_rep1_R1.fastq.gz /path/to/sample1_rep1_R2.fastq.gz 1
# sample1_rep2 /path/to/sample1_rep2_R1.fastq.gz /path/to/sample1_rep2_R2.fastq.gz 2
#
# --- Environment Setup (Optional, but recommended) ---
# # Install mamba (if not already installed)
# # conda install -c conda-forge mamba
#
# # Create and activate the Snakemake environment
# # mamba env create -f envs/snakemake.yaml
# # conda activate snakemake

# --- Run the Skipper eCLIP workflow ---
# Adjust --cores based on available resources.
# Ensure config/config.yaml and config/samples.tsv are properly configured.
snakemake --use-conda --cores 8 --configfile config/config.yaml

View on GitHub

Tools Used

Skipper eCLIP STAR

Raw Source Text

Data was processed using the eCLIP pipeline and available at https://github.com/YeoLab/skipper
Unique Molecular Identifiers (UMIs) were extracted using fastp 0.11.5 (https://github.com/OpenGene/fastp)
Post-umi-extracted reads were trimmed for adapter sequences and barcode sequences (eCLIP samples)
Mapping was then performed against the full human genome (hg38) including a database of splice junctions with STAR (v 2.7.6) allowing up to 100 multimapped regions.
Reads were then PCR deduplicated using UMIcollaps (https://github.com/Daniel-Liu-c0deb0t/UMICollapse
Enriched âwindowsâ (IP versus SM-Input) was called on deduplicated reads using a GC-bias aware beta-binomial model. Each window (~100 b.p.) were partitioned from Gencode v38 and associated with a specific type of genomic region (CDS, UTR, proximal introns near splice siteâ¦ etc). Windows are filtered using FDR < 0.2 and only reproducible windows between two replicates were used.
Data was processed using the eCLIP pipeline and available at https://github.com/YeoLab/skipper
Assembly: hg38
Supplementary files format and content: bed files contain Skipper windows.

← Back to Analysis