GSE199650 Processing Pipeline

RIP-Seq code_examples 11 steps

Publication

Human CCR4 deadenylase homolog Angel1 is a non-stop mRNA decay factor.

RNA (New York, N.Y.) (2025) — PMID 40441874

Dataset

GSE199650

The 2',3' cyclic phosphatase Angel1 facilitates mRNA degradation during human ribosome-associated quality control

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    Raw reads were processed using the eCLIP pipeline.

    eCLIP v1.0.0
    $ Bash example
    # Create dummy input files for demonstration
    mkdir -p inputs outputs
    echo "@read1/1\nATGCATGCATGCATGC\n+\nIIIIIIIIIIIIIIII" > inputs/sample_R1.fastq
    echo "@read1/2\nATGCATGCATGCATGC\n+\nIIIIIIIIIIIIIIII" > inputs/sample_R2.fastq
    echo "@read1/1\nATGCATGCATGCATGC\n+\nIIIIIIIIIIIIIIII" > inputs/control_R1.fastq
    echo "@read1/2\nATGCATGCATGCATGC\n+\nIIIIIIIIIIIIIIII" > inputs/control_R2.fastq
    
    # Placeholder for reference genome (hg38 STAR index, FASTA, GTF)
    # In a real scenario, these paths would point to actual reference files.
    # For eCLIP, a common reference is hg38.
    # Example: /path/to/STAR_indexes/hg38
    STAR_INDEX_DIR="/path/to/STAR_indexes/hg38"
    GENOME_FASTA="/path/to/genome/hg38.fa"
    GENOME_GTF="/path/to/annotations/gencode.v38.annotation.gtf"
    
    # Create dummy reference directories/files for the script to run without error
    # In a real scenario, these would be actual pre-built indices and files.
    mkdir -p "${STAR_INDEX_DIR}"
    touch "${STAR_INDEX_DIR}/SA" # Dummy file to make it a valid directory for CWL
    mkdir -p "$(dirname "${GENOME_FASTA}")"
    touch "${GENOME_FASTA}"
    mkdir -p "$(dirname "${GENOME_GTF}")"
    touch "${GENOME_GTF}"
    
    # Create a sample CWL input YAML file
    cat << EOF > inputs.yaml
    sample_r1:
      class: File
      path: inputs/sample_R1.fastq
    sample_r2:
      class: File
      path: inputs/sample_R2.fastq
    control_r1:
      class: File
      path: inputs/control_R1.fastq
    control_r2:
      class: File
      path: inputs/control_R2.fastq
    star_index_dir:
      class: Directory
      path: ${STAR_INDEX_DIR}
    genome_fasta:
      class: File
      path: ${GENOME_FASTA}
    genome_gtf:
      class: File
      path: ${GENOME_GTF}
    output_dir: outputs
    EOF
    
    # Install cwltool if not already installed
    # pip install cwltool
    
    # Download the eCLIP CWL workflow definition
    # wget https://raw.githubusercontent.com/yeolab/eclip/master/eclip_pipeline.cwl
    
    # Run the eCLIP pipeline using cwltool.
    # The eclip_pipeline.cwl file specifies 'docker: yeolab/eclip:latest', which corresponds to version 1.0.0.
    cwltool eclip_pipeline.cwl --inputs inputs.yaml --outdir outputs
  2. 2

    Reads were then trimmed with cutadapt

    cutadapt v4.0 GitHub
    $ Bash example
    # Install cutadapt via conda
    # conda install -c bioconda cutadapt=4.0
    
    # Trim adapters and low-quality bases from paired-end reads
    # -a A{100}: Trim 3' poly-A adapter (up to 100 A's)
    # -A G{100}: Trim 5' poly-G adapter (up to 100 G's) - often used for eCLIP
    # -q 20: Trim low-quality bases from the 3' end with a quality cutoff of 20
    # --minimum-length 18: Discard reads shorter than 18 bp after trimming
    # -o: Output file for R1 reads
    # -p: Output file for R2 reads
    # input_R1.fastq.gz input_R2.fastq.gz: Input paired-end FASTQ files
    cutadapt \
        -a A{100} \
        -A G{100} \
        -q 20 \
        --minimum-length 18 \
        -o trimmed_R1.fastq.gz \
        -p trimmed_R2.fastq.gz \
        input_R1.fastq.gz input_R2.fastq.gz
  3. 3

    Reads were then trimmed again with cutadapt to remove double-ligation events.

    cutadapt v2.10 GitHub
    $ Bash example
    # Install cutadapt (example using conda)
    # conda install -c bioconda cutadapt=2.10
    
    # Define input and output files
    INPUT_FASTQ="input.fastq.gz"
    OUTPUT_FASTQ="trimmed_double_ligation.fastq.gz"
    
    # Define the 3' adapter sequence for eCLIP, which can cause double-ligation events.
    # This adapter sequence is commonly used in Yeo lab eCLIP workflows (e.g., in the CWL workflow).
    # For specific experiments, verify the exact adapter sequence used.
    ADAPTER_SEQUENCE="AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT"
    
    # Trim reads to remove the 3' adapter sequence, addressing double-ligation events.
    # -a: 3' adapter sequence to be removed from the 3' end of the reads.
    # -o: Output file for trimmed reads.
    # --minimum-length: Discard reads shorter than this length after trimming (e.g., 18 bp, common in eCLIP).
    # --quality-cutoff: Trim low-quality bases from the 3' end using a Phred quality score cutoff (e.g., 20).
    # --cores: Number of CPU cores to use for parallel processing.
    cutadapt \
        -a "${ADAPTER_SEQUENCE}" \
        -o "${OUTPUT_FASTQ}" \
        --minimum-length 18 \
        --quality-cutoff 20 \
        --cores 4 \
        "${INPUT_FASTQ}"
  4. 4

    Trimmed and filtered reads were then mapped with STAR against a repeat element database

    $ Bash example
    # Install STAR (if not already installed)
    # conda install -c bioconda star
    
    # Note: A STAR genome index for repeat elements must be pre-built. 
    # Example command for building the index (replace 'repeat_elements.fasta' with your actual repeat FASTA file):
    # STAR --runThreadN 8 --runMode genomeGenerate --genomeDir repeat_element_star_index --genomeFastaFiles repeat_elements.fasta --genomeSAindexNbases 12
    
    # Map trimmed and filtered reads to the repeat element database
    # Input: input_reads.fastq.gz (placeholder for trimmed and filtered reads)
    # Output: mapped_repeats_.Aligned.sortedByCoord.out.bam (sorted BAM file of mapped reads)
    #         mapped_repeats_.Unmapped.out.mate1 (unmapped reads, often used for subsequent genomic alignment in eCLIP)
    STAR \
      --runThreadN 8 \
      --genomeDir repeat_element_star_index \
      --readFilesIn input_reads.fastq.gz \
      --readFilesCommand zcat \
      --outFileNamePrefix mapped_repeats_ \
      --outSAMtype BAM SortedByCoordinate \
      --outFilterMultimapNmax 1 \
      --outFilterMismatchNmax 3 \
      --alignIntronMax 1 \
      --alignSJDBoverhangMin 1 \
      --outReadsUnmapped Fastx
  5. 5

    Unmapped reads filtered of repeat elements were then mapped with STAR against a human genome (GRCh37)

    $ Bash example
    # Install STAR (example using conda)
    # conda create -n star_env star=2.7.0f -c bioconda -c conda-forge
    # conda activate star_env
    
    # Reference Genome (GRCh37/hg19) and Annotation (GENCODE v19)
    # Download FASTA and GTF files:
    # wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz
    # wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz
    # gunzip hg19.fa.gz
    # gunzip gencode.v19.annotation.gtf.gz
    
    # Build STAR genome index (example, adjust --runThreadN and --sjdbOverhang as needed)
    # mkdir -p /path/to/STAR_index/GRCh37_GENCODEv19
    # STAR --runMode genomeGenerate \
    #      --genomeDir /path/to/STAR_index/GRCh37_GENCODEv19 \
    #      --genomeFastaFiles hg19.fa \
    #      --sjdbGTFfile gencode.v19.annotation.gtf \
    #      --sjdbOverhang 99 \
    #      --runThreadN 8
    
    # Define variables for input and output
    INPUT_READS="filtered_reads.fastq.gz" # Reads after filtering repeat elements
    GENOME_DIR="/path/to/STAR_index/GRCh37_GENCODEv19" # Path to the pre-built STAR genome index for GRCh37
    OUTPUT_PREFIX="mapped_reads_"
    
    # Execute STAR mapping
    STAR --runThreadN 8 \
         --genomeDir "${GENOME_DIR}" \
         --readFilesIn "${INPUT_READS}" \
         --readFilesCommand zcat \
         --outFileNamePrefix "${OUTPUT_PREFIX}" \
         --outSAMtype BAM SortedByCoordinate \
         --outSAMattributes All \
         --outFilterMultimapNmax 1 \
         --alignIntronMax 1 \
         --outFilterMismatchNmax 3 \
         --outFilterScoreMinOverLread 0.66 \
         --outFilterMatchNminOverLread 0.66
  6. 6

    Aligned reads were sorted with samtools

    samtools v1.19 GitHub
    $ Bash example
    # Install samtools (example using conda)
    # conda install -c bioconda samtools
    
    # Sort aligned reads by coordinate (default)
    # Replace input.bam with your actual input aligned BAM file
    # Replace output_sorted.bam with your desired output sorted BAM file
    samtools sort -o output_sorted.bam input.bam
  7. 7

    Sorted reads were collapsed with umi_tools.

    UMI-tools vlatest (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install UMI-tools if not already installed
    # conda install -c bioconda umi-tools
    
    # Example usage:
    # Assuming 'sorted_reads.bam' is the input BAM file with UMIs in read names
    # and it is sorted by coordinate (e.g., using samtools sort).
    # The --method unique option collapses reads with identical UMIs and mapping positions.
    # Other methods like 'cluster' or 'directional' might be used depending on the specific assay and desired stringency.
    umi_tools dedup \
        --stdin sorted_reads.bam \
        --stdout collapsed_reads.bam \
        --method unique \
        --log dedup.log
  8. 8

    BAM files were used to identify peak clusters with Clipper.Â

    CLIPper vNot specified GitHub
    $ Bash example
    # Clone the CLIPper repository
    # git clone https://github.com/yeolab/clipper.git
    # cd clipper
    
    # Ensure Python and required libraries (e.g., pysam, numpy, scipy) are installed
    # conda install -c bioconda python pysam numpy scipy
    
    # Placeholder for input BAM file
    INPUT_BAM="input.bam"
    # Placeholder for output peak file
    OUTPUT_PEAKS="output_peaks.bed"
    # Placeholder for genome assembly (e.g., hg38) - Inferred as no specific reference was mentioned.
    GENOME_ASSEMBLY="hg38"
    # Placeholder for chromosome sizes file (e.g., from UCSC)
    CHROM_SIZES_FILE="${GENOME_ASSEMBLY}.chrom.sizes"
    # Placeholder for genome fasta file (e.g., from UCSC)
    GENOME_FASTA_FILE="${GENOME_ASSEMBLY}.fa"
    
    # Example: Download hg38 chrom.sizes and fasta if not available
    # wget -nc http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes -O "${CHROM_SIZES_FILE}"
    # wget -nc http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz -O "${GENOME_ASSEMBLY}.fa.gz"
    # gunzip -f "${GENOME_ASSEMBLY}.fa.gz"
    # mv "${GENOME_ASSEMBLY}.fa" "${GENOME_FASTA_FILE}"
    
    # Execute CLIPper to identify peak clusters
    # Parameters like p-value, fold enrichment, window size, and step size are not specified in the description,
    # so default values or user-defined values would typically be used. This command uses the minimum required parameters.
    python clipper.py -b "${INPUT_BAM}" -o "${OUTPUT_PEAKS}" -s "${CHROM_SIZES_FILE}" -g "${GENOME_FASTA_FILE}"
  9. 9

    Peak clusters were normalized using BAM files for IP against BAM files for INPUT with overlap_peakfi_with_bam.pl, included in eclip 0.1.5+.

    $ Bash example
    # Clone the eCLIP repository if not already available
    # git clone https://github.com/yeolab/eclip.git
    # export PATH=$PATH:/path/to/eclip/bin
    
    # Define input and output files
    PEAK_FILE="input_peak_clusters.bed" # Placeholder for the peak clusters file (e.g., from CLIPper)
    IP_BAM="ip_sample.bam"             # Placeholder for the IP BAM file
    INPUT_BAM="input_sample.bam"       # Placeholder for the INPUT BAM file
    OUTPUT_PREFIX="normalized_peaks"   # Prefix for output files
    
    # Run overlap_peakfi_with_bam.pl for normalization
    overlap_peakfi_with_bam.pl "${PEAK_FILE}" "${IP_BAM}" "${INPUT_BAM}" "${OUTPUT_PREFIX}"
  10. 10

    Overlapping normalized peak regions were merged with compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl, included within eclip-0.1.5+

    $ Bash example
    # Clone the merge_peaks repository
    # git clone https://github.com/yeolab/merge_peaks.git
    # cd merge_peaks
    
    # Example usage: Merge overlapping normalized peak regions from multiple replicate files.
    # Replace normalized_peak_rep1.bed, normalized_peak_rep2.bed, etc., with your actual input files.
    # The script outputs the merged peaks to standard output, which is then redirected to a file.
    perl compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl normalized_peak_rep1.bed normalized_peak_rep2.bed > merged_replicate_peaks.bed
  11. 11

    Filtered peak files were ranked by entropy score (make_informationcontent_from_peaks.pl included within the merge_peaks pipeline) and used as inputs to IDR to determine reproducible peaks.

    IDR v2.0.4 (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install IDR (if not already installed)
    # conda install -c bioconda idr
    
    # Example input files (output from make_informationcontent_from_peaks.pl)
    # These files are assumed to be in narrowPeak format where the 5th column (score)
    # has been replaced with the entropy score, as per the merge_peaks pipeline's
    # make_informationcontent_from_peaks.pl script.
    REP1_PEAKS="rep1.entropy_ranked.narrowPeak"
    REP2_PEAKS="rep2.entropy_ranked.narrowPeak"
    OUTPUT_PREFIX="idr_output"
    IDR_THRESHOLD="0.05" # Common IDR threshold for reproducibility
    
    # Run IDR using the entropy score (in the 5th column) for ranking
    idr --samples "${REP1_PEAKS}" "${REP2_PEAKS}" \
        --output-file "${OUTPUT_PREFIX}.idr" \
        --rank score \
        --plot \
        --log-output-file "${OUTPUT_PREFIX}.idr.log" \
        --soft-idr-threshold "${IDR_THRESHOLD}"

Tools Used

Raw Source Text
Raw reads were processed using the eCLIP pipeline.
Reads were then trimmed with cutadapt
Reads were then trimmed again with cutadapt to remove double-ligation events.
Trimmed and filtered reads were then mapped with STAR against a repeat element database
Unmapped reads filtered of repeat elements were then mapped with STAR   against a human genome (GRCh37)
Aligned reads were sorted with samtools
Sorted reads were collapsed with umi_tools.
BAM files were used to identify peak clusters with Clipper.Â
Peak clusters were normalized using BAM files for IP against BAM files for INPUT with overlap_peakfi_with_bam.pl, included in eclip 0.1.5+.
Overlapping normalized peak regions were merged with compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl, included within eclip-0.1.5+
Filtered peak files were ranked by entropy score (make_informationcontent_from_peaks.pl included within the merge_peaks pipeline) and used as inputs to IDR to determine reproducible peaks.
Assembly: hg19
Supplementary files format and content: BigWig files contain RPM-normalized read densities
← Back to Analysis