GSE180686 Processing Pipeline
Publication
Transcriptome-wide identification of RNA-binding protein binding sites using seCLIP-seq.Nature protocols (2022) — PMID 35322209
Dataset
GSE180686Transcriptome-wide identification of RNA binding protein binding sites using seCLIP-seq
Processing Steps
Generate Jupyter Notebook-
1
Raw reads were processed using the eCLIP pipeline v0.7.0
$ Bash example
# This script processes raw reads using the eCLIP pipeline v0.7.0. # The eCLIP pipeline is implemented as a CWL workflow. # Source: https://github.com/yeolab/eclip # --- Prerequisites --- # Ensure cwltool and Docker are installed. # pip install cwltool # apt-get install docker.io # or yum install docker # --- Download/Clone the eCLIP CWL workflow --- # If you haven't already, clone the eCLIP repository. # git clone https://github.com/yeolab/eclip.git # cd eclip # For a specific version (v0.7.0), you might need to checkout a specific commit # if v0.7.0 is not a tag. For this example, we assume the eclip.cwl is in the current directory. # You might need to adjust the path to eclip.cwl if it's not in the current directory. # --- Define Input Parameters --- # Create an input YAML file for the eCLIP CWL workflow. # Replace placeholder paths with your actual data and reference files. # Using hg38 as a placeholder reference genome. cat << EOF > eclip_inputs.yaml fastq_r1: class: File path: /path/to/your/sample_R1.fastq.gz # Path to your raw forward reads fastq_r2: class: File path: /path/to/your/sample_R2.fastq.gz # Path to your raw reverse reads (optional for SE) genome_fasta: class: File path: /path/to/your/reference/hg38.fa # Path to the reference genome FASTA (e.g., hg38) genome_gtf: class: File path: /path/to/your/reference/gencode.v38.annotation.gtf # Path to the reference genome GTF (e.g., Gencode v38) genome_chrom_sizes: class: File path: /path/to/your/reference/hg38.chrom.sizes # Path to chromosome sizes file genome_star_index: class: Directory path: /path/to/your/reference/STAR_index_hg38 # Path to pre-built STAR index for hg38 adapter_fasta: class: File path: /path/to/your/adapter_sequences.fa # Path to adapter sequences FASTA for trimming output_dir: "eclip_processing_output" # Directory for output files # Add any other specific parameters for v0.7.0 if known EOF # --- Execute the eCLIP CWL Workflow --- # The eclip.cwl file should be in the current directory or specified with its full path. cwltool --outdir eclip_processing_output eclip.cwl eclip_inputs.yaml
-
2
Sequenced reads were reformatted to include randomers in read headers with umi_tools (1.0.0).
$ Bash example
# Install UMI-tools (if not already installed) # conda install -c bioconda umi-tools=1.0.0 # Example command for reformatting reads to include UMIs in headers. # This assumes UMIs are at the beginning of Read 2 (e.g., 10bp long) and need to be moved to the header of both reads. # Adjust input/output filenames and --bc-pattern according to your specific data structure. umi_tools extract \ --input input_R1.fastq.gz \ --read2-in input_R2.fastq.gz \ --output-stat umi_extract_stats.tsv \ --stdout output_R1.fastq.gz \ --read2-out output_R2.fastq.gz \ --bc-pattern NNNNNNNNNN
-
3
Args: --random-seed 1 --bc-pattern NNNNNNNNNN
$ Bash example
# Install umi_tools # conda install -c bioconda umi_tools=1.1.2 # Example usage: Extract UMIs from FASTQ reads # Replace input.fastq.gz and output.fastq.gz with actual file paths umi_tools extract \ --random-seed 1 \ --bc-pattern NNNNNNNNNN \ -I input.fastq.gz \ -S output.fastq.gz
-
4
Reads were then trimmed with cutadapt (1.14).
$ Bash example
# Install cutadapt (if not already installed) # conda install -c bioconda cutadapt=1.14 # Example command for trimming adapter sequences and low-quality bases from paired-end reads. # Replace 'ADAPTER_SEQUENCE_R1' and 'ADAPTER_SEQUENCE_R2' with the actual adapter sequences used in your library preparation. # Replace 'input_R1.fastq.gz' and 'input_R2.fastq.gz' with your input FASTQ files. # The quality threshold (-q) and minimum length (--minimum-length) are common defaults and can be adjusted. cutadapt -a ADAPTER_SEQUENCE_R1 -A ADAPTER_SEQUENCE_R2 \ -q 20,20 --minimum-length 20 \ -o trimmed_R1.fastq.gz -p trimmed_R2.fastq.gz \ input_R1.fastq.gz input_R2.fastq.gz -
5
Args: --match-read-wildcards -O 1 --times 1 -e 0.1 --quality-cutoff 6 -m 18 -a InvRNA1/Ril19.fasta (fasta sequences can be found at: https://github.com/YeoLab/eclip/tree/master/example/inputs/)
$ Bash example
# Install cutadapt (example, comment out) # conda install -c bioconda cutadapt # Create a directory for reference data if it doesn't exist mkdir -p reference_data # Download adapter sequences from the YeoLab eCLIP repository wget -O reference_data/InvRNA1_Ril19.fasta https://raw.githubusercontent.com/YeoLab/eclip/master/example/inputs/InvRNA1/Ril19.fasta # Placeholder for input reads. Replace 'input_reads.fastq.gz' with your actual input file. # Example: wget -O input_reads.fastq.gz <URL_TO_YOUR_FASTQ_FILE> # Execute cutadapt for adapter trimming and quality filtering cutadapt \ --match-read-wildcards \ -O 1 \ --times 1 \ -e 0.1 \ --quality-cutoff 6 \ -m 18 \ -a reference_data/InvRNA1_Ril19.fasta \ -o trimmed_reads.fastq.gz \ input_reads.fastq.gz
-
6
Reads were then trimmed again with cutadapt (1.14) to remove double-ligation events.
$ Bash example
# Install cutadapt via conda # conda install -c bioconda cutadapt=1.14 # Define input and output file names INPUT_FASTQ="input.fastq.gz" OUTPUT_FASTQ="output.fastq.gz" # Define adapter sequences. # For "double-ligation events", this typically refers to removing the library adapter sequence. # The specific adapter sequence needs to be provided based on the library preparation kit used. # Example Illumina TruSeq 3' adapter: AGATCGGAAGAGCACACGTCTGAACTCCAGTCA # Example Illumina Nextera 3' adapter: CTGTCTCTTATACACATCTCCGAGCCCACGAGAC # For double-ligation, it might also involve trimming a 5' adapter or using -b for anywhere. # Here, we assume a common 3' adapter trimming scenario, which often addresses such events. ADAPTER_3PRIME="YOUR_3PRIME_ADAPTER_SEQUENCE" # Replace with actual adapter sequence # Minimum read length after trimming. Common to remove very short reads/adapter dimers. MIN_LENGTH=18 # Example value, adjust as needed based on expected insert size # Execute cutadapt to remove the 3' adapter and filter by length cutadapt -a "${ADAPTER_3PRIME}" \ --minimum-length "${MIN_LENGTH}" \ -o "${OUTPUT_FASTQ}" \ "${INPUT_FASTQ}" -
7
Args: --match-read-wildcards -O 5 --times 1 -e 0.1 --quality-cutoff 6 -m 18 -a InvRNA1/Ril19.fasta (fasta sequences can be found at: https://github.com/YeoLab/eclip/tree/master/example/inputs/)
$ Bash example
# Install cutadapt (version 1.18 as used in YeoLab eCLIP pipeline) # conda create -n cutadapt_env cutadapt=1.18 # conda activate cutadapt_env # Download the adapter sequences specified in the description # These sequences are from the YeoLab eCLIP example inputs # Source: https://github.com/YeoLab/eclip/tree/master/example/inputs/InvRNA1/Ril19.fasta mkdir -p InvRNA1 wget -O InvRNA1/Ril19.fasta https://raw.githubusercontent.com/YeoLab/eclip/master/example/inputs/InvRNA1/Ril19.fasta # Placeholder for input FASTQ file. Replace 'input.fastq' with your actual input file. # Placeholder for output FASTQ file. Replace 'output.fastq' with your desired output file. # If processing paired-end reads, use -A for the second adapter and -p for the second output file. # Example: cutadapt -a ADAPTER1 -A ADAPTER2 -o output_R1.fastq -p output_R2.fastq input_R1.fastq input_R2.fastq cutadapt \ --match-read-wildcards \ -O 5 \ --times 1 \ -e 0.1 \ --quality-cutoff 6 \ -m 18 \ -a InvRNA1/Ril19.fasta \ -o output.fastq \ input.fastq
-
8
Trimmed and filtered reads were then mapped with STAR (2.7.6a) against a repeat element database (RepBase 18.05).
$ Bash example
# Install STAR (if not already installed) # conda install -c bioconda star # Placeholder for input files (replace with actual paths) # trimmed_filtered_R1.fastq.gz # trimmed_filtered_R2.fastq.gz # Placeholder for RepBase 18.05 STAR index directory. # This assumes a STAR index has been built from RepBase 18.05 sequences. # Example of how to build an index (usually done once): # mkdir repbase_18.05_index # STAR --runMode genomeGenerate \ # --genomeDir repbase_18.05_index \ # --genomeFastaFiles repbase_18.05.fasta \ # --genomeSAindexNbases 10 # Adjust based on sequence length, 10 is suitable for shorter sequences like repeats # Align trimmed and filtered reads to the RepBase 18.05 index STAR --runThreadN 8 \ --genomeDir repbase_18.05_index \ --readFilesIn trimmed_filtered_R1.fastq.gz trimmed_filtered_R2.fastq.gz \ --readFilesCommand zcat \ --outFileNamePrefix repbase_aligned_ \ --outSAMtype BAM SortedByCoordinate \ --outSAMattributes Standard \ --outFilterMultimapNmax 100 # Allow multiple mapping for reads mapping to repetitive elements -
9
Args: --runThreadN 16 \ --genomeDir human_repbase \ --readFilesIn path/to/read1 \ --outFileNamePrefix out_prefix \ --outReadsUnmapped Fastx \ --outSAMtype BAM Unsorted \ --outSAMattributes All \ --outSAMunmapped Within \ --outSAMattrRGline ID:foo \ --outFilterType BySJout \ --outFilterMultimapNmax 30 \ --outFilterMultimapScoreRange 1 \ --outFilterScoreMin 10 \ --alignEndsType EndToEnd
STAR (Inferred with models/gemini-2.5-flash) v2.7.10a (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# Install STAR (example using conda) # conda install -c bioconda star # Align reads using STAR # Note: '--genomeDir human_repbase' refers to a pre-built STAR index directory named 'human_repbase'. # Ensure this directory exists and contains the STAR genome index files. # Replace 'path/to/read1' with the actual path to your input FASTQ file. STAR --runThreadN 16 \ --genomeDir human_repbase \ --readFilesIn path/to/read1 \ --outFileNamePrefix out_prefix \ --outReadsUnmapped Fastx \ --outSAMtype BAM Unsorted \ --outSAMattributes All \ --outSAMunmapped Within \ --outSAMattrRGline ID:foo \ --outFilterType BySJout \ --outFilterMultimapNmax 30 \ --outFilterMultimapScoreRange 1 \ --outFilterScoreMin 10 \ --alignEndsType EndToEnd -
10
Unmapped reads filtered of repeat elements were then mapped with STAR (2.7.6a) against a human genome (GRCh38).
$ Bash example
# Install STAR (if not already installed) # conda install -c bioconda star=2.7.6a # Define variables STAR_VERSION="2.7.6a" GENOME_DIR="/path/to/STAR_index/GRCh38" # Placeholder for GRCh38 STAR index INPUT_FASTQ="filtered_unmapped_reads.fastq.gz" # Input reads after repeat filtering OUTPUT_PREFIX="aligned_GRCh38_" THREADS=8 # Example number of threads LIMIT_BAM_SORT_RAM="31000000000" # Example: 31GB RAM for sorting BAM # Ensure the genome directory exists and contains the STAR index. # You would typically generate this index once using STAR --runMode genomeGenerate. # Example command to generate index (run once): # STAR --runMode genomeGenerate \ # --genomeDir ${GENOME_DIR} \ # --genomeFastaFiles /path/to/GRCh38.fa \ # --sjdbGTFfile /path/to/GRCh38.gtf \ # --runThreadN ${THREADS} # Run STAR alignment STAR --runThreadN ${THREADS} \ --genomeDir ${GENOME_DIR} \ --readFilesIn ${INPUT_FASTQ} \ --outFileNamePrefix ${OUTPUT_PREFIX} \ --outSAMtype BAM SortedByCoordinate \ --outSAMunmapped Within \ --outFilterMultimapNmax 20 \ --outFilterMismatchNmax 999 \ --outFilterMismatchNoverLmax 0.04 \ --alignIntronMin 20 \ --alignIntronMax 1000000 \ --alignMatesGapMax 1000000 \ --limitBAMsortRAM ${LIMIT_BAM_SORT_RAM} \ --readFilesCommand zcat -
11
Args: --runThreadN 16 \ --genomeDir genomedir \ --readFilesIn /path/to/read1 \ --outFileNamePrefix out_prefix \ --outReadsUnmapped Fastx \ --outSAMtype BAM Unsorted \ --outSAMattributes All \ --outSAMunmapped Within \ --outSAMattrRGline ID:foo \ --outFilterType BySJout \ --outFilterMultimapNmax 1 \ --outFilterMultimapScoreRange 1 \ --outFilterScoreMin 10 \ --alignEndsType EndToEnd
$ Bash example
# Install STAR (if not already installed) # conda install -c bioconda star # Define variables (replace with actual paths and values) GENOME_DIR="/path/to/STAR_genome_index" # Directory containing STAR genome index files READ_FILES="/path/to/input_read1.fastq.gz" # Input FASTQ file(s). For paired-end, use "/path/to/read1.fastq.gz /path/to/read2.fastq.gz" OUT_PREFIX="aligned_reads" # Prefix for output files THREADS=16 # Number of threads to use # Run STAR alignment STAR --runThreadN "${THREADS}" \ --genomeDir "${GENOME_DIR}" \ --readFilesIn "${READ_FILES}" \ --outFileNamePrefix "${OUT_PREFIX}" \ --outReadsUnmapped Fastx \ --outSAMtype BAM Unsorted \ --outSAMattributes All \ --outSAMunmapped Within \ --outSAMattrRGline "ID:foo" \ --outFilterType BySJout \ --outFilterMultimapNmax 1 \ --outFilterMultimapScoreRange 1 \ --outFilterScoreMin 10 \ --alignEndsType EndToEnd -
12
Aligned reads were sorted with samtools (1.6)
$ Bash example
# Install samtools if not already installed # conda install -c bioconda samtools=1.6 # Sort aligned reads # Assuming 'aligned_reads.bam' is the input aligned BAM file # And 'sorted_reads.bam' is the desired sorted output BAM file samtools sort -o sorted_reads.bam aligned_reads.bam
-
13
Sorted reads were collapsed with umi_tools (1.0.0).
$ Bash example
# Install umi_tools if not already installed # conda install -c bioconda umi_tools # Placeholder for input and output files # INPUT_BAM should be a sorted BAM file containing reads with UMIs. # UMIs are typically in the read ID or a custom tag (e.g., 'RX') added by a previous step (e.g., umi_tools extract). INPUT_BAM="sorted_reads.bam" OUTPUT_DEDUP_BAM="deduplicated_reads.bam" LOG_FILE="umi_tools_dedup.log" # Collapse (deduplicate) reads using umi_tools dedup. # The --method=directional is a common choice for RNA-seq and similar assays # to account for PCR duplicates while preserving biological signal. # Add --paired if the input reads are paired-end. umi_tools dedup \ --input=$INPUT_BAM \ --output=$OUTPUT_DEDUP_BAM \ --method=directional \ --log=$LOG_FILE # --paired # Uncomment if processing paired-end reads -
14
Args: --random-seed 1 --method unique
Custom script (Inferred with models/gemini-2.5-flash) vN/A$ Bash example
my_script --random-seed 1 --method unique
-
15
BAM files were used to identify peak clusters with Clipper (1.2.2).
$ Bash example
bash # Install CLIPper (assuming Python environment) # It's recommended to install CLIPper in a dedicated conda environment. # conda create -n clipper_env python=3.8 # conda activate clipper_env # pip install clipper==1.2.2 # Or clone the repo and run directly # git clone https://github.com/yeolab/clipper.git # cd clipper # python setup.py install # Placeholder for genome sizes file (e.g., hg38) # This file contains chromosome names and their lengths, crucial for peak calling. # Example: wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes GENOME_SIZES_FILE="hg38.chrom.sizes" # Replace with the actual path to your genome sizes file # Input BAM file(s) # The description implies one or more BAM files. For this example, we use a single placeholder. INPUT_BAM="input.bam" # Replace with the actual path to your input BAM file # Output prefix for peak clusters OUTPUT_PREFIX="clipper_peaks" # Run CLIPper (version 1.2.2) # Parameters like p-value (-p), window size (-w), and FDR (-f) are common for peak calling. # These are placeholders; actual values should be determined based on experimental design. python clipper.py \ -b "${INPUT_BAM}" \ -s "${GENOME_SIZES_FILE}" \ -o "${OUTPUT_PREFIX}" \ -p 0.001 \ -w 10 \ -f 0.05 # The output will typically be a BED file (e.g., clipper_peaks.bed) containing the identified peak clusters. -
16
Args: --species GRCh38_v29e --bam path/to/input.bam --timeout 3600000 --maxgenes 1000000 --save-pickle --outfile path/to/output.bam
Custom script/tool for gene-centric BAM processing (Inferred with models/gemini-2.5-flash) vUnknown$ Bash example
# This script processes a BAM file based on gene annotations, # potentially for gene-centric quantification or feature extraction. # Define input and output paths INPUT_BAM="path/to/input.bam" OUTPUT_BAM="path/to/output.bam" # Reference genome and annotation for GRCh38_v29e # GRCh38_v29e typically refers to the GRCh38 human genome assembly # and Gencode v29 annotation. # Example commands to download these reference files: # # # Download GRCh38 primary assembly FASTA (e.g., Ensembl release 92) # # wget -O Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz ftp://ftp.ensembl.org/pub/release-92/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz # # gunzip Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz # # # Download Gencode v29 GTF annotation # # wget -O gencode.v29.annotation.gtf.gz ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_29/gencode.v29.annotation.gtf.gz # # gunzip gencode.v29.annotation.gtf.gz # Execute the inferred tool (placeholder name 'inferred_tool_script') # Ensure 'inferred_tool_script' is in your PATH or provide its full path. # If this is a Python script, it might be executed as 'python inferred_tool_script.py ...' # Assuming it's an executable script: inferred_tool_script \ --species GRCh38_v29e \ --bam "${INPUT_BAM}" \ --timeout 3600000 \ --maxgenes 1000000 \ --save-pickle \ --outfile "${OUTPUT_BAM}" -
17
Peak clusters were normalized using BAM files for IP against BAM files for INPUT with overlap_peakfi_with_bam.pl, included in eclip 0.1.5+.
$ Bash example
# Install Perl if not already available # conda install -c bioconda perl # The script 'overlap_peakfi_with_bam.pl' is a custom utility often found within eCLIP pipeline setups. # It is mentioned as being included in 'eclip 0.1.5+'. # You might need to locate or download this specific script from your eCLIP pipeline environment. # Example: Assuming the script is in your PATH or current directory. # Define input and output files PEAK_CLUSTERS_FILE="peak_clusters.bed" IP_BAM_FILE="ip_sample.bam" INPUT_BAM_FILE="input_sample.bam" NORMALIZED_PEAK_CLUSTERS_FILE="normalized_peak_clusters.bed" # Execute the normalization script perl overlap_peakfi_with_bam.pl "${PEAK_CLUSTERS_FILE}" "${IP_BAM_FILE}" "${INPUT_BAM_FILE}" > "${NORMALIZED_PEAK_CLUSTERS_FILE}" -
18
Overlapping normalized peak regions were merged with compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl, included within eclip-0.1.5+
$ Bash example
# Install Perl if not already available # conda install -c bioconda perl # Clone the merge_peaks repository to obtain the script # git clone https://github.com/yeolab/merge_peaks.git # cd merge_peaks # Define placeholder input and output file names. # These would be the normalized peak regions from individual replicates. # Replace with actual file paths from your pipeline. INPUT_PEAK_FILE_REP1="sample_replicate1_normalized_peaks.bed" INPUT_PEAK_FILE_REP2="sample_replicate2_normalized_peaks.bed" # Add more input files if there are more replicates, e.g.: # INPUT_PEAK_FILE_REP3="sample_replicate3_normalized_peaks.bed" OUTPUT_MERGED_PEAKS="merged_replicate_normalized_peaks.bed" # Create a file containing the list of input peak files. # This is a common way the script is used within the eCLIP pipeline (e.g., via run_idr.sh). echo "${INPUT_PEAK_FILE_REP1}" > input_peak_list.txt echo "${INPUT_PEAK_FILE_REP2}" >> input_peak_list.txt # If more replicates: # echo "${INPUT_PEAK_FILE_REP3}" >> input_peak_list.txt # Execute the Perl script to merge overlapping normalized peak regions. # Assuming the script 'compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl' # is in the current directory or its path is specified. # If cloned, it would typically be ./compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl perl compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl \ -f input_peak_list.txt \ -o "${OUTPUT_MERGED_PEAKS}" # Clean up the temporary file list # rm input_peak_list.txt -
19
Normalized peaks were filtered using a manually curated list blacklist file containing known artifacts/regions of nonspecific binding
$ Bash example
# Example: Install bedtools # conda install -c bioconda bedtools # Define input and output files INPUT_PEAKS="normalized_peaks.bed" # Placeholder for the input normalized peak file BLACKLIST_FILE="blacklist.bed" # Placeholder for the manually curated blacklist file (e.g., hg38_blacklist.bed) OUTPUT_PEAKS="filtered_peaks.bed" # Filter normalized peaks by removing any regions that overlap with the blacklist file # The '-v' option in bedtools intersect reports entries in A that *do not* overlap with any entries in B. bedtools intersect -v -a "${INPUT_PEAKS}" -b "${BLACKLIST_FILE}" > "${OUTPUT_PEAKS}" -
20
Filtered peak files were ranked by entropy score (make_informationcontent_from_peaks.pl included within the merge_peaks pipeline) and used as inputs to IDR (2.0.2) to determine reproducible peaks.
$ Bash example
# Install IDR (Irreproducible Discovery Rate) tool # conda install -c bioconda idr=2.0.2 # Example input files (assuming they have been pre-ranked by entropy score # using make_informationcontent_from_peaks.pl and the score is in the 5th column, # typically 'signal.value' for narrowPeak files) REP1_PEAKS="rep1.entropy_ranked.narrowPeak" REP2_PEAKS="rep2.entropy_ranked.narrowPeak" OUTPUT_PREFIX="idr_reproducible_peaks" # Run IDR to determine reproducible peaks # --input-file-type narrowPeak: Specifies the input file format. # --rank-by signal.value: Ranks peaks based on the signal value (5th column), # which is assumed to contain the entropy score after pre-processing. # --output-file: Base name for the output reproducible peak file. # --plot: Generates diagnostic plots. # --log-output-file: Log file for IDR execution. idr --input-file-type narrowPeak \ --rank-by signal.value \ --output-file "${OUTPUT_PREFIX}.idr.out" \ --plot \ --log-output-file "${OUTPUT_PREFIX}.idr.log" \ "${REP1_PEAKS}" \ "${REP2_PEAKS}"
Raw Source Text
Raw reads were processed using the eCLIP pipeline v0.7.0 Sequenced reads were reformatted to include randomers in read headers with umi_tools (1.0.0). Args: --random-seed 1 --bc-pattern NNNNNNNNNN Reads were then trimmed with cutadapt (1.14). Args: --match-read-wildcards -O 1 --times 1 -e 0.1 --quality-cutoff 6 -m 18 -a InvRNA1/Ril19.fasta (fasta sequences can be found at: https://github.com/YeoLab/eclip/tree/master/example/inputs/) Reads were then trimmed again with cutadapt (1.14) to remove double-ligation events. Args: --match-read-wildcards -O 5 --times 1 -e 0.1 --quality-cutoff 6 -m 18 -a InvRNA1/Ril19.fasta (fasta sequences can be found at: https://github.com/YeoLab/eclip/tree/master/example/inputs/) Trimmed and filtered reads were then mapped with STAR (2.7.6a) against a repeat element database (RepBase 18.05). Args: --runThreadN 16 \ --genomeDir human_repbase \ --readFilesIn path/to/read1 \ --outFileNamePrefix out_prefix \ --outReadsUnmapped Fastx \ --outSAMtype BAM Unsorted \ --outSAMattributes All \ --outSAMunmapped Within \ --outSAMattrRGline ID:foo \ --outFilterType BySJout \ --outFilterMultimapNmax 30 \ --outFilterMultimapScoreRange 1 \ --outFilterScoreMin 10 \ --alignEndsType EndToEnd Unmapped reads filtered of repeat elements were then mapped with STAR (2.7.6a) against a human genome (GRCh38). Args: --runThreadN 16 \ --genomeDir genomedir \ --readFilesIn /path/to/read1 \ --outFileNamePrefix out_prefix \ --outReadsUnmapped Fastx \ --outSAMtype BAM Unsorted \ --outSAMattributes All \ --outSAMunmapped Within \ --outSAMattrRGline ID:foo \ --outFilterType BySJout \ --outFilterMultimapNmax 1 \ --outFilterMultimapScoreRange 1 \ --outFilterScoreMin 10 \ --alignEndsType EndToEnd Aligned reads were sorted with samtools (1.6) Sorted reads were collapsed with umi_tools (1.0.0). Args: --random-seed 1 --method unique BAM files were used to identify peak clusters with Clipper (1.2.2). Args: --species GRCh38_v29e --bam path/to/input.bam --timeout 3600000 --maxgenes 1000000 --save-pickle --outfile path/to/output.bam Peak clusters were normalized using BAM files for IP against BAM files for INPUT with overlap_peakfi_with_bam.pl, included in eclip 0.1.5+. Overlapping normalized peak regions were merged with compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl, included within eclip-0.1.5+ Normalized peaks were filtered using a manually curated list blacklist file containing known artifacts/regions of nonspecific binding Filtered peak files were ranked by entropy score (make_informationcontent_from_peaks.pl included within the merge_peaks pipeline) and used as inputs to IDR (2.0.2) to determine reproducible peaks. Genome_build: GRCh38 Supplementary_files_format_and_content: BigWig files contain RPM-normalized read densities