GSE199650 Processing Pipeline
Publication
Human CCR4 deadenylase homolog Angel1 is a non-stop mRNA decay factor.RNA (New York, N.Y.) (2025) — PMID 40441874
Dataset
GSE199650The 2',3' cyclic phosphatase Angel1 facilitates mRNA degradation during human ribosome-associated quality control
Processing Steps
Generate Jupyter Notebook-
1
Raw reads were processed using the eCLIP pipeline.
eCLIP v1.0.0$ Bash example
# Create dummy input files for demonstration mkdir -p inputs outputs echo "@read1/1\nATGCATGCATGCATGC\n+\nIIIIIIIIIIIIIIII" > inputs/sample_R1.fastq echo "@read1/2\nATGCATGCATGCATGC\n+\nIIIIIIIIIIIIIIII" > inputs/sample_R2.fastq echo "@read1/1\nATGCATGCATGCATGC\n+\nIIIIIIIIIIIIIIII" > inputs/control_R1.fastq echo "@read1/2\nATGCATGCATGCATGC\n+\nIIIIIIIIIIIIIIII" > inputs/control_R2.fastq # Placeholder for reference genome (hg38 STAR index, FASTA, GTF) # In a real scenario, these paths would point to actual reference files. # For eCLIP, a common reference is hg38. # Example: /path/to/STAR_indexes/hg38 STAR_INDEX_DIR="/path/to/STAR_indexes/hg38" GENOME_FASTA="/path/to/genome/hg38.fa" GENOME_GTF="/path/to/annotations/gencode.v38.annotation.gtf" # Create dummy reference directories/files for the script to run without error # In a real scenario, these would be actual pre-built indices and files. mkdir -p "${STAR_INDEX_DIR}" touch "${STAR_INDEX_DIR}/SA" # Dummy file to make it a valid directory for CWL mkdir -p "$(dirname "${GENOME_FASTA}")" touch "${GENOME_FASTA}" mkdir -p "$(dirname "${GENOME_GTF}")" touch "${GENOME_GTF}" # Create a sample CWL input YAML file cat << EOF > inputs.yaml sample_r1: class: File path: inputs/sample_R1.fastq sample_r2: class: File path: inputs/sample_R2.fastq control_r1: class: File path: inputs/control_R1.fastq control_r2: class: File path: inputs/control_R2.fastq star_index_dir: class: Directory path: ${STAR_INDEX_DIR} genome_fasta: class: File path: ${GENOME_FASTA} genome_gtf: class: File path: ${GENOME_GTF} output_dir: outputs EOF # Install cwltool if not already installed # pip install cwltool # Download the eCLIP CWL workflow definition # wget https://raw.githubusercontent.com/yeolab/eclip/master/eclip_pipeline.cwl # Run the eCLIP pipeline using cwltool. # The eclip_pipeline.cwl file specifies 'docker: yeolab/eclip:latest', which corresponds to version 1.0.0. cwltool eclip_pipeline.cwl --inputs inputs.yaml --outdir outputs -
2
Reads were then trimmed with cutadapt
$ Bash example
# Install cutadapt via conda # conda install -c bioconda cutadapt=4.0 # Trim adapters and low-quality bases from paired-end reads # -a A{100}: Trim 3' poly-A adapter (up to 100 A's) # -A G{100}: Trim 5' poly-G adapter (up to 100 G's) - often used for eCLIP # -q 20: Trim low-quality bases from the 3' end with a quality cutoff of 20 # --minimum-length 18: Discard reads shorter than 18 bp after trimming # -o: Output file for R1 reads # -p: Output file for R2 reads # input_R1.fastq.gz input_R2.fastq.gz: Input paired-end FASTQ files cutadapt \ -a A{100} \ -A G{100} \ -q 20 \ --minimum-length 18 \ -o trimmed_R1.fastq.gz \ -p trimmed_R2.fastq.gz \ input_R1.fastq.gz input_R2.fastq.gz -
3
Reads were then trimmed again with cutadapt to remove double-ligation events.
$ Bash example
# Install cutadapt (example using conda) # conda install -c bioconda cutadapt=2.10 # Define input and output files INPUT_FASTQ="input.fastq.gz" OUTPUT_FASTQ="trimmed_double_ligation.fastq.gz" # Define the 3' adapter sequence for eCLIP, which can cause double-ligation events. # This adapter sequence is commonly used in Yeo lab eCLIP workflows (e.g., in the CWL workflow). # For specific experiments, verify the exact adapter sequence used. ADAPTER_SEQUENCE="AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT" # Trim reads to remove the 3' adapter sequence, addressing double-ligation events. # -a: 3' adapter sequence to be removed from the 3' end of the reads. # -o: Output file for trimmed reads. # --minimum-length: Discard reads shorter than this length after trimming (e.g., 18 bp, common in eCLIP). # --quality-cutoff: Trim low-quality bases from the 3' end using a Phred quality score cutoff (e.g., 20). # --cores: Number of CPU cores to use for parallel processing. cutadapt \ -a "${ADAPTER_SEQUENCE}" \ -o "${OUTPUT_FASTQ}" \ --minimum-length 18 \ --quality-cutoff 20 \ --cores 4 \ "${INPUT_FASTQ}" -
4
Trimmed and filtered reads were then mapped with STAR against a repeat element database
$ Bash example
# Install STAR (if not already installed) # conda install -c bioconda star # Note: A STAR genome index for repeat elements must be pre-built. # Example command for building the index (replace 'repeat_elements.fasta' with your actual repeat FASTA file): # STAR --runThreadN 8 --runMode genomeGenerate --genomeDir repeat_element_star_index --genomeFastaFiles repeat_elements.fasta --genomeSAindexNbases 12 # Map trimmed and filtered reads to the repeat element database # Input: input_reads.fastq.gz (placeholder for trimmed and filtered reads) # Output: mapped_repeats_.Aligned.sortedByCoord.out.bam (sorted BAM file of mapped reads) # mapped_repeats_.Unmapped.out.mate1 (unmapped reads, often used for subsequent genomic alignment in eCLIP) STAR \ --runThreadN 8 \ --genomeDir repeat_element_star_index \ --readFilesIn input_reads.fastq.gz \ --readFilesCommand zcat \ --outFileNamePrefix mapped_repeats_ \ --outSAMtype BAM SortedByCoordinate \ --outFilterMultimapNmax 1 \ --outFilterMismatchNmax 3 \ --alignIntronMax 1 \ --alignSJDBoverhangMin 1 \ --outReadsUnmapped Fastx
-
5
Unmapped reads filtered of repeat elements were then mapped with STAR against a human genome (GRCh37)
$ Bash example
# Install STAR (example using conda) # conda create -n star_env star=2.7.0f -c bioconda -c conda-forge # conda activate star_env # Reference Genome (GRCh37/hg19) and Annotation (GENCODE v19) # Download FASTA and GTF files: # wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz # wget ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz # gunzip hg19.fa.gz # gunzip gencode.v19.annotation.gtf.gz # Build STAR genome index (example, adjust --runThreadN and --sjdbOverhang as needed) # mkdir -p /path/to/STAR_index/GRCh37_GENCODEv19 # STAR --runMode genomeGenerate \ # --genomeDir /path/to/STAR_index/GRCh37_GENCODEv19 \ # --genomeFastaFiles hg19.fa \ # --sjdbGTFfile gencode.v19.annotation.gtf \ # --sjdbOverhang 99 \ # --runThreadN 8 # Define variables for input and output INPUT_READS="filtered_reads.fastq.gz" # Reads after filtering repeat elements GENOME_DIR="/path/to/STAR_index/GRCh37_GENCODEv19" # Path to the pre-built STAR genome index for GRCh37 OUTPUT_PREFIX="mapped_reads_" # Execute STAR mapping STAR --runThreadN 8 \ --genomeDir "${GENOME_DIR}" \ --readFilesIn "${INPUT_READS}" \ --readFilesCommand zcat \ --outFileNamePrefix "${OUTPUT_PREFIX}" \ --outSAMtype BAM SortedByCoordinate \ --outSAMattributes All \ --outFilterMultimapNmax 1 \ --alignIntronMax 1 \ --outFilterMismatchNmax 3 \ --outFilterScoreMinOverLread 0.66 \ --outFilterMatchNminOverLread 0.66 -
6
Aligned reads were sorted with samtools
$ Bash example
# Install samtools (example using conda) # conda install -c bioconda samtools # Sort aligned reads by coordinate (default) # Replace input.bam with your actual input aligned BAM file # Replace output_sorted.bam with your desired output sorted BAM file samtools sort -o output_sorted.bam input.bam
-
7
Sorted reads were collapsed with umi_tools.
$ Bash example
# Install UMI-tools if not already installed # conda install -c bioconda umi-tools # Example usage: # Assuming 'sorted_reads.bam' is the input BAM file with UMIs in read names # and it is sorted by coordinate (e.g., using samtools sort). # The --method unique option collapses reads with identical UMIs and mapping positions. # Other methods like 'cluster' or 'directional' might be used depending on the specific assay and desired stringency. umi_tools dedup \ --stdin sorted_reads.bam \ --stdout collapsed_reads.bam \ --method unique \ --log dedup.log -
8
BAM files were used to identify peak clusters with Clipper.Â
$ Bash example
# Clone the CLIPper repository # git clone https://github.com/yeolab/clipper.git # cd clipper # Ensure Python and required libraries (e.g., pysam, numpy, scipy) are installed # conda install -c bioconda python pysam numpy scipy # Placeholder for input BAM file INPUT_BAM="input.bam" # Placeholder for output peak file OUTPUT_PEAKS="output_peaks.bed" # Placeholder for genome assembly (e.g., hg38) - Inferred as no specific reference was mentioned. GENOME_ASSEMBLY="hg38" # Placeholder for chromosome sizes file (e.g., from UCSC) CHROM_SIZES_FILE="${GENOME_ASSEMBLY}.chrom.sizes" # Placeholder for genome fasta file (e.g., from UCSC) GENOME_FASTA_FILE="${GENOME_ASSEMBLY}.fa" # Example: Download hg38 chrom.sizes and fasta if not available # wget -nc http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes -O "${CHROM_SIZES_FILE}" # wget -nc http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz -O "${GENOME_ASSEMBLY}.fa.gz" # gunzip -f "${GENOME_ASSEMBLY}.fa.gz" # mv "${GENOME_ASSEMBLY}.fa" "${GENOME_FASTA_FILE}" # Execute CLIPper to identify peak clusters # Parameters like p-value, fold enrichment, window size, and step size are not specified in the description, # so default values or user-defined values would typically be used. This command uses the minimum required parameters. python clipper.py -b "${INPUT_BAM}" -o "${OUTPUT_PEAKS}" -s "${CHROM_SIZES_FILE}" -g "${GENOME_FASTA_FILE}" -
9
Peak clusters were normalized using BAM files for IP against BAM files for INPUT with overlap_peakfi_with_bam.pl, included in eclip 0.1.5+.
$ Bash example
# Clone the eCLIP repository if not already available # git clone https://github.com/yeolab/eclip.git # export PATH=$PATH:/path/to/eclip/bin # Define input and output files PEAK_FILE="input_peak_clusters.bed" # Placeholder for the peak clusters file (e.g., from CLIPper) IP_BAM="ip_sample.bam" # Placeholder for the IP BAM file INPUT_BAM="input_sample.bam" # Placeholder for the INPUT BAM file OUTPUT_PREFIX="normalized_peaks" # Prefix for output files # Run overlap_peakfi_with_bam.pl for normalization overlap_peakfi_with_bam.pl "${PEAK_FILE}" "${IP_BAM}" "${INPUT_BAM}" "${OUTPUT_PREFIX}" -
10
Overlapping normalized peak regions were merged with compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl, included within eclip-0.1.5+
$ Bash example
# Clone the merge_peaks repository # git clone https://github.com/yeolab/merge_peaks.git # cd merge_peaks # Example usage: Merge overlapping normalized peak regions from multiple replicate files. # Replace normalized_peak_rep1.bed, normalized_peak_rep2.bed, etc., with your actual input files. # The script outputs the merged peaks to standard output, which is then redirected to a file. perl compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl normalized_peak_rep1.bed normalized_peak_rep2.bed > merged_replicate_peaks.bed
-
11
Filtered peak files were ranked by entropy score (make_informationcontent_from_peaks.pl included within the merge_peaks pipeline) and used as inputs to IDR to determine reproducible peaks.
$ Bash example
# Install IDR (if not already installed) # conda install -c bioconda idr # Example input files (output from make_informationcontent_from_peaks.pl) # These files are assumed to be in narrowPeak format where the 5th column (score) # has been replaced with the entropy score, as per the merge_peaks pipeline's # make_informationcontent_from_peaks.pl script. REP1_PEAKS="rep1.entropy_ranked.narrowPeak" REP2_PEAKS="rep2.entropy_ranked.narrowPeak" OUTPUT_PREFIX="idr_output" IDR_THRESHOLD="0.05" # Common IDR threshold for reproducibility # Run IDR using the entropy score (in the 5th column) for ranking idr --samples "${REP1_PEAKS}" "${REP2_PEAKS}" \ --output-file "${OUTPUT_PREFIX}.idr" \ --rank score \ --plot \ --log-output-file "${OUTPUT_PREFIX}.idr.log" \ --soft-idr-threshold "${IDR_THRESHOLD}"
Raw Source Text
Raw reads were processed using the eCLIP pipeline. Reads were then trimmed with cutadapt Reads were then trimmed again with cutadapt to remove double-ligation events. Trimmed and filtered reads were then mapped with STAR against a repeat element database Unmapped reads filtered of repeat elements were then mapped with STAR against a human genome (GRCh37) Aligned reads were sorted with samtools Sorted reads were collapsed with umi_tools. BAM files were used to identify peak clusters with Clipper. Peak clusters were normalized using BAM files for IP against BAM files for INPUT with overlap_peakfi_with_bam.pl, included in eclip 0.1.5+. Overlapping normalized peak regions were merged with compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl, included within eclip-0.1.5+ Filtered peak files were ranked by entropy score (make_informationcontent_from_peaks.pl included within the merge_peaks pipeline) and used as inputs to IDR to determine reproducible peaks. Assembly: hg19 Supplementary files format and content: BigWig files contain RPM-normalized read densities