GSE199650 Processing Pipeline

RIP-Seq code_examples 11 steps

Publication

Human CCR4 deadenylase homolog Angel1 is a non-stop mRNA decay factor.

RNA (New York, N.Y.) (2025) — PMID 40441874

Dataset

GSE199650

The 2',3' cyclic phosphatase Angel1 facilitates mRNA degradation during human ribosome-associated quality control

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Raw reads were processed using the eCLIP pipeline.

eCLIP v1.0.0

$ Bash example

# Create dummy input files for demonstration
mkdir -p inputs outputs
echo "@read1/1\nATGCATGCATGCATGC\n+\nIIIIIIIIIIIIIIII" > inputs/sample_R1.fastq
echo "@read1/2\nATGCATGCATGCATGC\n+\nIIIIIIIIIIIIIIII" > inputs/sample_R2.fastq
echo "@read1/1\nATGCATGCATGCATGC\n+\nIIIIIIIIIIIIIIII" > inputs/control_R1.fastq
echo "@read1/2\nATGCATGCATGCATGC\n+\nIIIIIIIIIIIIIIII" > inputs/control_R2.fastq

# Placeholder for reference genome (hg38 STAR index, FASTA, GTF)
# In a real scenario, these paths would point to actual reference files.
# For eCLIP, a common reference is hg38.
# Example: /path/to/STAR_indexes/hg38
STAR_INDEX_DIR="/path/to/STAR_indexes/hg38"
GENOME_FASTA="/path/to/genome/hg38.fa"
GENOME_GTF="/path/to/annotations/gencode.v38.annotation.gtf"

# Create dummy reference directories/files for the script to run without error
# In a real scenario, these would be actual pre-built indices and files.
mkdir -p "${STAR_INDEX_DIR}"
touch "${STAR_INDEX_DIR}/SA" # Dummy file to make it a valid directory for CWL
mkdir -p "$(dirname "${GENOME_FASTA}")"
touch "${GENOME_FASTA}"
mkdir -p "$(dirname "${GENOME_GTF}")"
touch "${GENOME_GTF}"

# Create a sample CWL input YAML file
cat << EOF > inputs.yaml
sample_r1:
  class: File
  path: inputs/sample_R1.fastq
sample_r2:
  class: File
  path: inputs/sample_R2.fastq
control_r1:
  class: File
  path: inputs/control_R1.fastq
control_r2:
  class: File
  path: inputs/control_R2.fastq
star_index_dir:
  class: Directory
  path: ${STAR_INDEX_DIR}
genome_fasta:
  class: File
  path: ${GENOME_FASTA}
genome_gtf:
  class: File
  path: ${GENOME_GTF}
output_dir: outputs
EOF

# Install cwltool if not already installed
# pip install cwltool

# Download the eCLIP CWL workflow definition
# wget https://raw.githubusercontent.com/yeolab/eclip/master/eclip_pipeline.cwl

# Run the eCLIP pipeline using cwltool.
# The eclip_pipeline.cwl file specifies 'docker: yeolab/eclip:latest', which corresponds to version 1.0.0.
cwltool eclip_pipeline.cwl --inputs inputs.yaml --outdir outputs

Reads were then trimmed with cutadapt

cutadapt v4.0 GitHub

$ Bash example

# Install cutadapt via conda
# conda install -c bioconda cutadapt=4.0

# Trim adapters and low-quality bases from paired-end reads
# -a A{100}: Trim 3' poly-A adapter (up to 100 A's)
# -A G{100}: Trim 5' poly-G adapter (up to 100 G's) - often used for eCLIP
# -q 20: Trim low-quality bases from the 3' end with a quality cutoff of 20
# --minimum-length 18: Discard reads shorter than 18 bp after trimming
# -o: Output file for R1 reads
# -p: Output file for R2 reads
# input_R1.fastq.gz input_R2.fastq.gz: Input paired-end FASTQ files
cutadapt \
    -a A{100} \
    -A G{100} \
    -q 20 \
    --minimum-length 18 \
    -o trimmed_R1.fastq.gz \
    -p trimmed_R2.fastq.gz \
    input_R1.fastq.gz input_R2.fastq.gz

View on GitHub

Reads were then trimmed again with cutadapt to remove double-ligation events.

cutadapt v2.10 GitHub

$ Bash example

# Install cutadapt (example using conda)
# conda install -c bioconda cutadapt=2.10

# Define input and output files
INPUT_FASTQ="input.fastq.gz"
OUTPUT_FASTQ="trimmed_double_ligation.fastq.gz"

# Define the 3' adapter sequence for eCLIP, which can cause double-ligation events.
# This adapter sequence is commonly used in Yeo lab eCLIP workflows (e.g., in the CWL workflow).
# For specific experiments, verify the exact adapter sequence used.
ADAPTER_SEQUENCE="AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT"

# Trim reads to remove the 3' adapter sequence, addressing double-ligation events.
# -a: 3' adapter sequence to be removed from the 3' end of the reads.
# -o: Output file for trimmed reads.
# --minimum-length: Discard reads shorter than this length after trimming (e.g., 18 bp, common in eCLIP).
# --quality-cutoff: Trim low-quality bases from the 3' end using a Phred quality score cutoff (e.g., 20).
# --cores: Number of CPU cores to use for parallel processing.
cutadapt \
    -a "${ADAPTER_SEQUENCE}" \
    -o "${OUTPUT_FASTQ}" \
    --minimum-length 18 \
    --quality-cutoff 20 \
    --cores 4 \
    "${INPUT_FASTQ}"

View on GitHub

Trimmed and filtered reads were then mapped with STAR against a repeat element database

STAR v2.7.0f GitHub

$ Bash example

# Install STAR (if not already installed)
# conda install -c bioconda star

# Note: A STAR genome index for repeat elements must be pre-built. 
# Example command for building the index (replace 'repeat_elements.fasta' with your actual repeat FASTA file):
# STAR --runThreadN 8 --runMode genomeGenerate --genomeDir repeat_element_star_index --genomeFastaFiles repeat_elements.fasta --genomeSAindexNbases 12

# Map trimmed and filtered reads to the repeat element database
# Input: input_reads.fastq.gz (placeholder for trimmed and filtered reads)
# Output: mapped_repeats_.Aligned.sortedByCoord.out.bam (sorted BAM file of mapped reads)
#         mapped_repeats_.Unmapped.out.mate1 (unmapped reads, often used for subsequent genomic alignment in eCLIP)
STAR \
  --runThreadN 8 \
  --genomeDir repeat_element_star_index \
  --readFilesIn input_reads.fastq.gz \
  --readFilesCommand zcat \
  --outFileNamePrefix mapped_repeats_ \
  --outSAMtype BAM SortedByCoordinate \
  --outFilterMultimapNmax 1 \
  --outFilterMismatchNmax 3 \
  --alignIntronMax 1 \
  --alignSJDBoverhangMin 1 \
  --outReadsUnmapped Fastx

View on GitHub

Unmapped reads filtered of repeat elements were then mapped with STAR against a human genome (GRCh37)

STAR v2.7.0f GitHub

$ Bash example

# Install STAR (example using conda)
# conda create -n star_env star=2.7.0f -c bioconda -c conda-forge
# conda activate star_env

# Reference Genome (GRCh37/hg19) and Annotation (GENCODE v19)
# Download FASTA and GTF files:
# wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz
# wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz
# gunzip hg19.fa.gz
# gunzip gencode.v19.annotation.gtf.gz

# Build STAR genome index (example, adjust --runThreadN and --sjdbOverhang as needed)
# mkdir -p /path/to/STAR_index/GRCh37_GENCODEv19
# STAR --runMode genomeGenerate \
#      --genomeDir /path/to/STAR_index/GRCh37_GENCODEv19 \
#      --genomeFastaFiles hg19.fa \
#      --sjdbGTFfile gencode.v19.annotation.gtf \
#      --sjdbOverhang 99 \
#      --runThreadN 8

# Define variables for input and output
INPUT_READS="filtered_reads.fastq.gz" # Reads after filtering repeat elements
GENOME_DIR="/path/to/STAR_index/GRCh37_GENCODEv19" # Path to the pre-built STAR genome index for GRCh37
OUTPUT_PREFIX="mapped_reads_"

# Execute STAR mapping
STAR --runThreadN 8 \
     --genomeDir "${GENOME_DIR}" \
     --readFilesIn "${INPUT_READS}" \
     --readFilesCommand zcat \
     --outFileNamePrefix "${OUTPUT_PREFIX}" \
     --outSAMtype BAM SortedByCoordinate \
     --outSAMattributes All \
     --outFilterMultimapNmax 1 \
     --alignIntronMax 1 \
     --outFilterMismatchNmax 3 \
     --outFilterScoreMinOverLread 0.66 \
     --outFilterMatchNminOverLread 0.66

View on GitHub

Aligned reads were sorted with samtools

samtools v1.19 GitHub

$ Bash example

# Install samtools (example using conda)
# conda install -c bioconda samtools

# Sort aligned reads by coordinate (default)
# Replace input.bam with your actual input aligned BAM file
# Replace output_sorted.bam with your desired output sorted BAM file
samtools sort -o output_sorted.bam input.bam

View on GitHub

Sorted reads were collapsed with umi_tools.

UMI-tools vlatest (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install UMI-tools if not already installed
# conda install -c bioconda umi-tools

# Example usage:
# Assuming 'sorted_reads.bam' is the input BAM file with UMIs in read names
# and it is sorted by coordinate (e.g., using samtools sort).
# The --method unique option collapses reads with identical UMIs and mapping positions.
# Other methods like 'cluster' or 'directional' might be used depending on the specific assay and desired stringency.
umi_tools dedup \
    --stdin sorted_reads.bam \
    --stdout collapsed_reads.bam \
    --method unique \
    --log dedup.log

View on GitHub

BAM files were used to identify peak clusters with Clipper.Â

CLIPper vNot specified GitHub

$ Bash example

# Clone the CLIPper repository
# git clone https://github.com/yeolab/clipper.git
# cd clipper

# Ensure Python and required libraries (e.g., pysam, numpy, scipy) are installed
# conda install -c bioconda python pysam numpy scipy

# Placeholder for input BAM file
INPUT_BAM="input.bam"
# Placeholder for output peak file
OUTPUT_PEAKS="output_peaks.bed"
# Placeholder for genome assembly (e.g., hg38) - Inferred as no specific reference was mentioned.
GENOME_ASSEMBLY="hg38"
# Placeholder for chromosome sizes file (e.g., from UCSC)
CHROM_SIZES_FILE="${GENOME_ASSEMBLY}.chrom.sizes"
# Placeholder for genome fasta file (e.g., from UCSC)
GENOME_FASTA_FILE="${GENOME_ASSEMBLY}.fa"

# Example: Download hg38 chrom.sizes and fasta if not available
# wget -nc http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes -O "${CHROM_SIZES_FILE}"
# wget -nc http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz -O "${GENOME_ASSEMBLY}.fa.gz"
# gunzip -f "${GENOME_ASSEMBLY}.fa.gz"
# mv "${GENOME_ASSEMBLY}.fa" "${GENOME_FASTA_FILE}"

# Execute CLIPper to identify peak clusters
# Parameters like p-value, fold enrichment, window size, and step size are not specified in the description,
# so default values or user-defined values would typically be used. This command uses the minimum required parameters.
python clipper.py -b "${INPUT_BAM}" -o "${OUTPUT_PEAKS}" -s "${CHROM_SIZES_FILE}" -g "${GENOME_FASTA_FILE}"

View on GitHub

Peak clusters were normalized using BAM files for IP against BAM files for INPUT with overlap_peakfi_with_bam.pl, included in eclip 0.1.5+.

eCLIP v0.1.5 GitHub

$ Bash example

# Clone the eCLIP repository if not already available
# git clone https://github.com/yeolab/eclip.git
# export PATH=$PATH:/path/to/eclip/bin

# Define input and output files
PEAK_FILE="input_peak_clusters.bed" # Placeholder for the peak clusters file (e.g., from CLIPper)
IP_BAM="ip_sample.bam"             # Placeholder for the IP BAM file
INPUT_BAM="input_sample.bam"       # Placeholder for the INPUT BAM file
OUTPUT_PREFIX="normalized_peaks"   # Prefix for output files

# Run overlap_peakfi_with_bam.pl for normalization
overlap_peakfi_with_bam.pl "${PEAK_FILE}" "${IP_BAM}" "${INPUT_BAM}" "${OUTPUT_PREFIX}"

View on GitHub

Overlapping normalized peak regions were merged with compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl, included within eclip-0.1.5+

eCLIP v0.1.5 GitHub

$ Bash example

# Clone the merge_peaks repository
# git clone https://github.com/yeolab/merge_peaks.git
# cd merge_peaks

# Example usage: Merge overlapping normalized peak regions from multiple replicate files.
# Replace normalized_peak_rep1.bed, normalized_peak_rep2.bed, etc., with your actual input files.
# The script outputs the merged peaks to standard output, which is then redirected to a file.
perl compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl normalized_peak_rep1.bed normalized_peak_rep2.bed > merged_replicate_peaks.bed

View on GitHub

Filtered peak files were ranked by entropy score (make_informationcontent_from_peaks.pl included within the merge_peaks pipeline) and used as inputs to IDR to determine reproducible peaks.

IDR v2.0.4 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install IDR (if not already installed)
# conda install -c bioconda idr

# Example input files (output from make_informationcontent_from_peaks.pl)
# These files are assumed to be in narrowPeak format where the 5th column (score)
# has been replaced with the entropy score, as per the merge_peaks pipeline's
# make_informationcontent_from_peaks.pl script.
REP1_PEAKS="rep1.entropy_ranked.narrowPeak"
REP2_PEAKS="rep2.entropy_ranked.narrowPeak"
OUTPUT_PREFIX="idr_output"
IDR_THRESHOLD="0.05" # Common IDR threshold for reproducibility

# Run IDR using the entropy score (in the 5th column) for ranking
idr --samples "${REP1_PEAKS}" "${REP2_PEAKS}" \
    --output-file "${OUTPUT_PREFIX}.idr" \
    --rank score \
    --plot \
    --log-output-file "${OUTPUT_PREFIX}.idr.log" \
    --soft-idr-threshold "${IDR_THRESHOLD}"

View on GitHub

Tools Used

eCLIP STAR

Raw Source Text

Raw reads were processed using the eCLIP pipeline.
Reads were then trimmed with cutadapt
Reads were then trimmed again with cutadapt to remove double-ligation events.
Trimmed and filtered reads were then mapped with STAR against a repeat element database
Unmapped reads filtered of repeat elements were then mapped with STAR   against a human genome (GRCh37)
Aligned reads were sorted with samtools
Sorted reads were collapsed with umi_tools.
BAM files were used to identify peak clusters with Clipper.Â
Peak clusters were normalized using BAM files for IP against BAM files for INPUT with overlap_peakfi_with_bam.pl, included in eclip 0.1.5+.
Overlapping normalized peak regions were merged with compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl, included within eclip-0.1.5+
Filtered peak files were ranked by entropy score (make_informationcontent_from_peaks.pl included within the merge_peaks pipeline) and used as inputs to IDR to determine reproducible peaks.
Assembly: hg19
Supplementary files format and content: BigWig files contain RPM-normalized read densities

← Back to Analysis