GSE180686 Processing Pipeline

RIP-Seq code_examples 20 steps

Publication

Transcriptome-wide identification of RNA-binding protein binding sites using seCLIP-seq.

Nature protocols (2022) — PMID 35322209

Dataset

Transcriptome-wide identification of RNA binding protein binding sites using seCLIP-seq

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Raw reads were processed using the eCLIP pipeline v0.7.0

eCLIP v0.7.0 GitHub

$ Bash example

# This script processes raw reads using the eCLIP pipeline v0.7.0.
# The eCLIP pipeline is implemented as a CWL workflow.
# Source: https://github.com/yeolab/eclip

# --- Prerequisites ---
# Ensure cwltool and Docker are installed.
# pip install cwltool
# apt-get install docker.io # or yum install docker

# --- Download/Clone the eCLIP CWL workflow ---
# If you haven't already, clone the eCLIP repository.
# git clone https://github.com/yeolab/eclip.git
# cd eclip
# For a specific version (v0.7.0), you might need to checkout a specific commit
# if v0.7.0 is not a tag. For this example, we assume the eclip.cwl is in the current directory.
# You might need to adjust the path to eclip.cwl if it's not in the current directory.

# --- Define Input Parameters ---
# Create an input YAML file for the eCLIP CWL workflow.
# Replace placeholder paths with your actual data and reference files.
# Using hg38 as a placeholder reference genome.
cat << EOF > eclip_inputs.yaml
fastq_r1:
class: File
path: /path/to/your/sample_R1.fastq.gz # Path to your raw forward reads
fastq_r2:
class: File
path: /path/to/your/sample_R2.fastq.gz # Path to your raw reverse reads (optional for SE)
genome_fasta:
class: File
path: /path/to/your/reference/hg38.fa # Path to the reference genome FASTA (e.g., hg38)
genome_gtf:
class: File
path: /path/to/your/reference/gencode.v38.annotation.gtf # Path to the reference genome GTF (e.g., Gencode v38)
genome_chrom_sizes:
class: File
path: /path/to/your/reference/hg38.chrom.sizes # Path to chromosome sizes file
genome_star_index:
class: Directory
path: /path/to/your/reference/STAR_index_hg38 # Path to pre-built STAR index for hg38
adapter_fasta:
class: File
path: /path/to/your/adapter_sequences.fa # Path to adapter sequences FASTA for trimming
output_dir: "eclip_processing_output" # Directory for output files
# Add any other specific parameters for v0.7.0 if known
EOF

# --- Execute the eCLIP CWL Workflow ---
# The eclip.cwl file should be in the current directory or specified with its full path.
cwltool --outdir eclip_processing_output eclip.cwl eclip_inputs.yaml

View on GitHub

Sequenced reads were reformatted to include randomers in read headers with umi_tools (1.0.0).

UMI-tools v1.0.0 GitHub

$ Bash example

# Install UMI-tools (if not already installed)
# conda install -c bioconda umi-tools=1.0.0

# Example command for reformatting reads to include UMIs in headers.
# This assumes UMIs are at the beginning of Read 2 (e.g., 10bp long) and need to be moved to the header of both reads.
# Adjust input/output filenames and --bc-pattern according to your specific data structure.

umi_tools extract \
  --input input_R1.fastq.gz \
  --read2-in input_R2.fastq.gz \
  --output-stat umi_extract_stats.tsv \
  --stdout output_R1.fastq.gz \
  --read2-out output_R2.fastq.gz \
  --bc-pattern NNNNNNNNNN

View on GitHub

Args: --random-seed 1 --bc-pattern NNNNNNNNNN

umi_tools (Inferred with models/gemini-2.5-flash) v1.1.2 GitHub

$ Bash example

# Install umi_tools
# conda install -c bioconda umi_tools=1.1.2

# Example usage: Extract UMIs from FASTQ reads
# Replace input.fastq.gz and output.fastq.gz with actual file paths
umi_tools extract \
  --random-seed 1 \
  --bc-pattern NNNNNNNNNN \
  -I input.fastq.gz \
  -S output.fastq.gz

View on GitHub

Reads were then trimmed with cutadapt (1.14).

cutadapt v1.14 GitHub

$ Bash example

# Install cutadapt (if not already installed)
# conda install -c bioconda cutadapt=1.14

# Example command for trimming adapter sequences and low-quality bases from paired-end reads.
# Replace 'ADAPTER_SEQUENCE_R1' and 'ADAPTER_SEQUENCE_R2' with the actual adapter sequences used in your library preparation.
# Replace 'input_R1.fastq.gz' and 'input_R2.fastq.gz' with your input FASTQ files.
# The quality threshold (-q) and minimum length (--minimum-length) are common defaults and can be adjusted.
cutadapt -a ADAPTER_SEQUENCE_R1 -A ADAPTER_SEQUENCE_R2 \
         -q 20,20 --minimum-length 20 \
         -o trimmed_R1.fastq.gz -p trimmed_R2.fastq.gz \
         input_R1.fastq.gz input_R2.fastq.gz

View on GitHub

Args: --match-read-wildcards -O 1 --times 1 -e 0.1 --quality-cutoff 6 -m 18 -a InvRNA1/Ril19.fasta (fasta sequences can be found at: https://github.com/YeoLab/eclip/tree/master/example/inputs/)

eCLIP v2.10 GitHub

$ Bash example

# Install cutadapt (example, comment out)
# conda install -c bioconda cutadapt

# Create a directory for reference data if it doesn't exist
mkdir -p reference_data

# Download adapter sequences from the YeoLab eCLIP repository
wget -O reference_data/InvRNA1_Ril19.fasta https://raw.githubusercontent.com/YeoLab/eclip/master/example/inputs/InvRNA1/Ril19.fasta

# Placeholder for input reads. Replace 'input_reads.fastq.gz' with your actual input file.
# Example: wget -O input_reads.fastq.gz <URL_TO_YOUR_FASTQ_FILE>

# Execute cutadapt for adapter trimming and quality filtering
cutadapt \
  --match-read-wildcards \
  -O 1 \
  --times 1 \
  -e 0.1 \
  --quality-cutoff 6 \
  -m 18 \
  -a reference_data/InvRNA1_Ril19.fasta \
  -o trimmed_reads.fastq.gz \
  input_reads.fastq.gz

View on GitHub

Reads were then trimmed again with cutadapt (1.14) to remove double-ligation events.

cutadapt v1.14 GitHub

$ Bash example

# Install cutadapt via conda
# conda install -c bioconda cutadapt=1.14

# Define input and output file names
INPUT_FASTQ="input.fastq.gz"
OUTPUT_FASTQ="output.fastq.gz"

# Define adapter sequences.
# For "double-ligation events", this typically refers to removing the library adapter sequence.
# The specific adapter sequence needs to be provided based on the library preparation kit used.
# Example Illumina TruSeq 3' adapter: AGATCGGAAGAGCACACGTCTGAACTCCAGTCA
# Example Illumina Nextera 3' adapter: CTGTCTCTTATACACATCTCCGAGCCCACGAGAC
# For double-ligation, it might also involve trimming a 5' adapter or using -b for anywhere.
# Here, we assume a common 3' adapter trimming scenario, which often addresses such events.
ADAPTER_3PRIME="YOUR_3PRIME_ADAPTER_SEQUENCE" # Replace with actual adapter sequence

# Minimum read length after trimming. Common to remove very short reads/adapter dimers.
MIN_LENGTH=18 # Example value, adjust as needed based on expected insert size

# Execute cutadapt to remove the 3' adapter and filter by length
cutadapt -a "${ADAPTER_3PRIME}" \
         --minimum-length "${MIN_LENGTH}" \
         -o "${OUTPUT_FASTQ}" \
         "${INPUT_FASTQ}"

View on GitHub

Args: --match-read-wildcards -O 5 --times 1 -e 0.1 --quality-cutoff 6 -m 18 -a InvRNA1/Ril19.fasta (fasta sequences can be found at: https://github.com/YeoLab/eclip/tree/master/example/inputs/)

eCLIP v1.18 GitHub

$ Bash example

# Install cutadapt (version 1.18 as used in YeoLab eCLIP pipeline)
# conda create -n cutadapt_env cutadapt=1.18
# conda activate cutadapt_env

# Download the adapter sequences specified in the description
# These sequences are from the YeoLab eCLIP example inputs
# Source: https://github.com/YeoLab/eclip/tree/master/example/inputs/InvRNA1/Ril19.fasta
mkdir -p InvRNA1
wget -O InvRNA1/Ril19.fasta https://raw.githubusercontent.com/YeoLab/eclip/master/example/inputs/InvRNA1/Ril19.fasta

# Placeholder for input FASTQ file. Replace 'input.fastq' with your actual input file.
# Placeholder for output FASTQ file. Replace 'output.fastq' with your desired output file.
# If processing paired-end reads, use -A for the second adapter and -p for the second output file.
# Example: cutadapt -a ADAPTER1 -A ADAPTER2 -o output_R1.fastq -p output_R2.fastq input_R1.fastq input_R2.fastq

cutadapt \
  --match-read-wildcards \
  -O 5 \
  --times 1 \
  -e 0.1 \
  --quality-cutoff 6 \
  -m 18 \
  -a InvRNA1/Ril19.fasta \
  -o output.fastq \
  input.fastq

View on GitHub

Trimmed and filtered reads were then mapped with STAR (2.7.6a) against a repeat element database (RepBase 18.05).

STAR v2.7.6a GitHub

$ Bash example

# Install STAR (if not already installed)
# conda install -c bioconda star

# Placeholder for input files (replace with actual paths)
# trimmed_filtered_R1.fastq.gz
# trimmed_filtered_R2.fastq.gz

# Placeholder for RepBase 18.05 STAR index directory.
# This assumes a STAR index has been built from RepBase 18.05 sequences.
# Example of how to build an index (usually done once):
# mkdir repbase_18.05_index
# STAR --runMode genomeGenerate \
#      --genomeDir repbase_18.05_index \
#      --genomeFastaFiles repbase_18.05.fasta \
#      --genomeSAindexNbases 10 # Adjust based on sequence length, 10 is suitable for shorter sequences like repeats

# Align trimmed and filtered reads to the RepBase 18.05 index
STAR --runThreadN 8 \
     --genomeDir repbase_18.05_index \
     --readFilesIn trimmed_filtered_R1.fastq.gz trimmed_filtered_R2.fastq.gz \
     --readFilesCommand zcat \
     --outFileNamePrefix repbase_aligned_ \
     --outSAMtype BAM SortedByCoordinate \
     --outSAMattributes Standard \
     --outFilterMultimapNmax 100 # Allow multiple mapping for reads mapping to repetitive elements

View on GitHub

Args: --runThreadN 16 \ --genomeDir human_repbase \ --readFilesIn path/to/read1 \ --outFileNamePrefix out_prefix \ --outReadsUnmapped Fastx \ --outSAMtype BAM Unsorted \ --outSAMattributes All \ --outSAMunmapped Within \ --outSAMattrRGline ID:foo \ --outFilterType BySJout \ --outFilterMultimapNmax 30 \ --outFilterMultimapScoreRange 1 \ --outFilterScoreMin 10 \ --alignEndsType EndToEnd

STAR (Inferred with models/gemini-2.5-flash) v2.7.10a (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install STAR (example using conda)
# conda install -c bioconda star

# Align reads using STAR
# Note: '--genomeDir human_repbase' refers to a pre-built STAR index directory named 'human_repbase'.
# Ensure this directory exists and contains the STAR genome index files.
# Replace 'path/to/read1' with the actual path to your input FASTQ file.
STAR --runThreadN 16 \
     --genomeDir human_repbase \
     --readFilesIn path/to/read1 \
     --outFileNamePrefix out_prefix \
     --outReadsUnmapped Fastx \
     --outSAMtype BAM Unsorted \
     --outSAMattributes All \
     --outSAMunmapped Within \
     --outSAMattrRGline ID:foo \
     --outFilterType BySJout \
     --outFilterMultimapNmax 30 \
     --outFilterMultimapScoreRange 1 \
     --outFilterScoreMin 10 \
     --alignEndsType EndToEnd

View on GitHub

Unmapped reads filtered of repeat elements were then mapped with STAR (2.7.6a) against a human genome (GRCh38).

STAR v2.7.6a GitHub

$ Bash example

# Install STAR (if not already installed)
# conda install -c bioconda star=2.7.6a

# Define variables
STAR_VERSION="2.7.6a"
GENOME_DIR="/path/to/STAR_index/GRCh38" # Placeholder for GRCh38 STAR index
INPUT_FASTQ="filtered_unmapped_reads.fastq.gz" # Input reads after repeat filtering
OUTPUT_PREFIX="aligned_GRCh38_"
THREADS=8 # Example number of threads
LIMIT_BAM_SORT_RAM="31000000000" # Example: 31GB RAM for sorting BAM

# Ensure the genome directory exists and contains the STAR index.
# You would typically generate this index once using STAR --runMode genomeGenerate.
# Example command to generate index (run once):
# STAR --runMode genomeGenerate \
#      --genomeDir ${GENOME_DIR} \
#      --genomeFastaFiles /path/to/GRCh38.fa \
#      --sjdbGTFfile /path/to/GRCh38.gtf \
#      --runThreadN ${THREADS}

# Run STAR alignment
STAR --runThreadN ${THREADS} \
     --genomeDir ${GENOME_DIR} \
     --readFilesIn ${INPUT_FASTQ} \
     --outFileNamePrefix ${OUTPUT_PREFIX} \
     --outSAMtype BAM SortedByCoordinate \
     --outSAMunmapped Within \
     --outFilterMultimapNmax 20 \
     --outFilterMismatchNmax 999 \
     --outFilterMismatchNoverLmax 0.04 \
     --alignIntronMin 20 \
     --alignIntronMax 1000000 \
     --alignMatesGapMax 1000000 \
     --limitBAMsortRAM ${LIMIT_BAM_SORT_RAM} \
     --readFilesCommand zcat

View on GitHub

Args: --runThreadN 16 \ --genomeDir genomedir \ --readFilesIn /path/to/read1 \ --outFileNamePrefix out_prefix \ --outReadsUnmapped Fastx \ --outSAMtype BAM Unsorted \ --outSAMattributes All \ --outSAMunmapped Within \ --outSAMattrRGline ID:foo \ --outFilterType BySJout \ --outFilterMultimapNmax 1 \ --outFilterMultimapScoreRange 1 \ --outFilterScoreMin 10 \ --alignEndsType EndToEnd

STAR (Inferred with models/gemini-2.5-flash) vlatest (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install STAR (if not already installed)
# conda install -c bioconda star

# Define variables (replace with actual paths and values)
GENOME_DIR="/path/to/STAR_genome_index" # Directory containing STAR genome index files
READ_FILES="/path/to/input_read1.fastq.gz" # Input FASTQ file(s). For paired-end, use "/path/to/read1.fastq.gz /path/to/read2.fastq.gz"
OUT_PREFIX="aligned_reads" # Prefix for output files
THREADS=16 # Number of threads to use

# Run STAR alignment
STAR --runThreadN "${THREADS}" \
     --genomeDir "${GENOME_DIR}" \
     --readFilesIn "${READ_FILES}" \
     --outFileNamePrefix "${OUT_PREFIX}" \
     --outReadsUnmapped Fastx \
     --outSAMtype BAM Unsorted \
     --outSAMattributes All \
     --outSAMunmapped Within \
     --outSAMattrRGline "ID:foo" \
     --outFilterType BySJout \
     --outFilterMultimapNmax 1 \
     --outFilterMultimapScoreRange 1 \
     --outFilterScoreMin 10 \
     --alignEndsType EndToEnd

View on GitHub

Aligned reads were sorted with samtools (1.6)

samtools v1.6 GitHub

$ Bash example

# Install samtools if not already installed
# conda install -c bioconda samtools=1.6

# Sort aligned reads
# Assuming 'aligned_reads.bam' is the input aligned BAM file
# And 'sorted_reads.bam' is the desired sorted output BAM file
samtools sort -o sorted_reads.bam aligned_reads.bam

View on GitHub

Sorted reads were collapsed with umi_tools (1.0.0).

UMI-tools v1.0.0 GitHub

$ Bash example

# Install umi_tools if not already installed
# conda install -c bioconda umi_tools

# Placeholder for input and output files
# INPUT_BAM should be a sorted BAM file containing reads with UMIs.
# UMIs are typically in the read ID or a custom tag (e.g., 'RX') added by a previous step (e.g., umi_tools extract).
INPUT_BAM="sorted_reads.bam"
OUTPUT_DEDUP_BAM="deduplicated_reads.bam"
LOG_FILE="umi_tools_dedup.log"

# Collapse (deduplicate) reads using umi_tools dedup.
# The --method=directional is a common choice for RNA-seq and similar assays
# to account for PCR duplicates while preserving biological signal.
# Add --paired if the input reads are paired-end.
umi_tools dedup \
    --input=$INPUT_BAM \
    --output=$OUTPUT_DEDUP_BAM \
    --method=directional \
    --log=$LOG_FILE
    # --paired # Uncomment if processing paired-end reads

View on GitHub

14
Args: --random-seed 1 --method unique

Custom script (Inferred with models/gemini-2.5-flash) vN/A
$ Bash example
```
my_script --random-seed 1 --method unique
```

BAM files were used to identify peak clusters with Clipper (1.2.2).

CLIPper v1.2.2 GitHub

$ Bash example

bash
# Install CLIPper (assuming Python environment)
# It's recommended to install CLIPper in a dedicated conda environment.
# conda create -n clipper_env python=3.8
# conda activate clipper_env
# pip install clipper==1.2.2 # Or clone the repo and run directly
# git clone https://github.com/yeolab/clipper.git
# cd clipper
# python setup.py install

# Placeholder for genome sizes file (e.g., hg38)
# This file contains chromosome names and their lengths, crucial for peak calling.
# Example: wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes
GENOME_SIZES_FILE="hg38.chrom.sizes" # Replace with the actual path to your genome sizes file

# Input BAM file(s)
# The description implies one or more BAM files. For this example, we use a single placeholder.
INPUT_BAM="input.bam" # Replace with the actual path to your input BAM file

# Output prefix for peak clusters
OUTPUT_PREFIX="clipper_peaks"

# Run CLIPper (version 1.2.2)
# Parameters like p-value (-p), window size (-w), and FDR (-f) are common for peak calling.
# These are placeholders; actual values should be determined based on experimental design.
python clipper.py \
    -b "${INPUT_BAM}" \
    -s "${GENOME_SIZES_FILE}" \
    -o "${OUTPUT_PREFIX}" \
    -p 0.001 \
    -w 10 \
    -f 0.05

# The output will typically be a BED file (e.g., clipper_peaks.bed) containing the identified peak clusters.

View on GitHub

Args: --species GRCh38_v29e --bam path/to/input.bam --timeout 3600000 --maxgenes 1000000 --save-pickle --outfile path/to/output.bam

Custom script/tool for gene-centric BAM processing (Inferred with models/gemini-2.5-flash) vUnknown

$ Bash example

# This script processes a BAM file based on gene annotations,
# potentially for gene-centric quantification or feature extraction.

# Define input and output paths
INPUT_BAM="path/to/input.bam"
OUTPUT_BAM="path/to/output.bam"

# Reference genome and annotation for GRCh38_v29e
# GRCh38_v29e typically refers to the GRCh38 human genome assembly
# and Gencode v29 annotation.
# Example commands to download these reference files:
#
# # Download GRCh38 primary assembly FASTA (e.g., Ensembl release 92)
# # wget -O Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz ftp://ftp.ensembl.org/pub/release-92/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
# # gunzip Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
#
# # Download Gencode v29 GTF annotation
# # wget -O gencode.v29.annotation.gtf.gz ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_29/gencode.v29.annotation.gtf.gz
# # gunzip gencode.v29.annotation.gtf.gz

# Execute the inferred tool (placeholder name 'inferred_tool_script')
# Ensure 'inferred_tool_script' is in your PATH or provide its full path.
# If this is a Python script, it might be executed as 'python inferred_tool_script.py ...'
# Assuming it's an executable script:
inferred_tool_script \
  --species GRCh38_v29e \
  --bam "${INPUT_BAM}" \
  --timeout 3600000 \
  --maxgenes 1000000 \
  --save-pickle \
  --outfile "${OUTPUT_BAM}"

Peak clusters were normalized using BAM files for IP against BAM files for INPUT with overlap_peakfi_with_bam.pl, included in eclip 0.1.5+.

eCLIP v0.1.5 GitHub

$ Bash example

# Install Perl if not already available
# conda install -c bioconda perl

# The script 'overlap_peakfi_with_bam.pl' is a custom utility often found within eCLIP pipeline setups.
# It is mentioned as being included in 'eclip 0.1.5+'.
# You might need to locate or download this specific script from your eCLIP pipeline environment.
# Example: Assuming the script is in your PATH or current directory.

# Define input and output files
PEAK_CLUSTERS_FILE="peak_clusters.bed"
IP_BAM_FILE="ip_sample.bam"
INPUT_BAM_FILE="input_sample.bam"
NORMALIZED_PEAK_CLUSTERS_FILE="normalized_peak_clusters.bed"

# Execute the normalization script
perl overlap_peakfi_with_bam.pl "${PEAK_CLUSTERS_FILE}" "${IP_BAM_FILE}" "${INPUT_BAM_FILE}" > "${NORMALIZED_PEAK_CLUSTERS_FILE}"

View on GitHub

Overlapping normalized peak regions were merged with compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl, included within eclip-0.1.5+

eCLIP v0.1.5 GitHub

$ Bash example

# Install Perl if not already available
# conda install -c bioconda perl

# Clone the merge_peaks repository to obtain the script
# git clone https://github.com/yeolab/merge_peaks.git
# cd merge_peaks

# Define placeholder input and output file names.
# These would be the normalized peak regions from individual replicates.
# Replace with actual file paths from your pipeline.
INPUT_PEAK_FILE_REP1="sample_replicate1_normalized_peaks.bed"
INPUT_PEAK_FILE_REP2="sample_replicate2_normalized_peaks.bed"
# Add more input files if there are more replicates, e.g.:
# INPUT_PEAK_FILE_REP3="sample_replicate3_normalized_peaks.bed"

OUTPUT_MERGED_PEAKS="merged_replicate_normalized_peaks.bed"

# Create a file containing the list of input peak files.
# This is a common way the script is used within the eCLIP pipeline (e.g., via run_idr.sh).
echo "${INPUT_PEAK_FILE_REP1}" > input_peak_list.txt
echo "${INPUT_PEAK_FILE_REP2}" >> input_peak_list.txt
# If more replicates:
# echo "${INPUT_PEAK_FILE_REP3}" >> input_peak_list.txt

# Execute the Perl script to merge overlapping normalized peak regions.
# Assuming the script 'compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl'
# is in the current directory or its path is specified.
# If cloned, it would typically be ./compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl
perl compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl \
    -f input_peak_list.txt \
    -o "${OUTPUT_MERGED_PEAKS}"

# Clean up the temporary file list
# rm input_peak_list.txt

View on GitHub

Normalized peaks were filtered using a manually curated list blacklist file containing known artifacts/regions of nonspecific binding

bedtools (Inferred with models/gemini-2.5-flash) v2.30.0 GitHub

$ Bash example

# Example: Install bedtools
# conda install -c bioconda bedtools

# Define input and output files
INPUT_PEAKS="normalized_peaks.bed" # Placeholder for the input normalized peak file
BLACKLIST_FILE="blacklist.bed" # Placeholder for the manually curated blacklist file (e.g., hg38_blacklist.bed)
OUTPUT_PEAKS="filtered_peaks.bed"

# Filter normalized peaks by removing any regions that overlap with the blacklist file
# The '-v' option in bedtools intersect reports entries in A that *do not* overlap with any entries in B.
bedtools intersect -v -a "${INPUT_PEAKS}" -b "${BLACKLIST_FILE}" > "${OUTPUT_PEAKS}"

View on GitHub

Filtered peak files were ranked by entropy score (make_informationcontent_from_peaks.pl included within the merge_peaks pipeline) and used as inputs to IDR (2.0.2) to determine reproducible peaks.

IDR v2.0.2 GitHub

$ Bash example

# Install IDR (Irreproducible Discovery Rate) tool
# conda install -c bioconda idr=2.0.2

# Example input files (assuming they have been pre-ranked by entropy score
# using make_informationcontent_from_peaks.pl and the score is in the 5th column,
# typically 'signal.value' for narrowPeak files)
REP1_PEAKS="rep1.entropy_ranked.narrowPeak"
REP2_PEAKS="rep2.entropy_ranked.narrowPeak"
OUTPUT_PREFIX="idr_reproducible_peaks"

# Run IDR to determine reproducible peaks
# --input-file-type narrowPeak: Specifies the input file format.
# --rank-by signal.value: Ranks peaks based on the signal value (5th column),
#                         which is assumed to contain the entropy score after pre-processing.
# --output-file: Base name for the output reproducible peak file.
# --plot: Generates diagnostic plots.
# --log-output-file: Log file for IDR execution.
idr --input-file-type narrowPeak \
    --rank-by signal.value \
    --output-file "${OUTPUT_PREFIX}.idr.out" \
    --plot \
    --log-output-file "${OUTPUT_PREFIX}.idr.log" \
    "${REP1_PEAKS}" \
    "${REP2_PEAKS}"

View on GitHub

Tools Used

eCLIP STAR

Raw Source Text

Raw reads were processed using the eCLIP pipeline v0.7.0
Sequenced reads were reformatted to include randomers in read headers with umi_tools (1.0.0). Args: --random-seed 1 --bc-pattern NNNNNNNNNN
Reads were then trimmed with cutadapt (1.14). Args: --match-read-wildcards -O 1 --times 1 -e 0.1 --quality-cutoff 6 -m 18 -a InvRNA1/Ril19.fasta (fasta sequences can be found at: https://github.com/YeoLab/eclip/tree/master/example/inputs/)
Reads were then trimmed again with cutadapt (1.14) to remove double-ligation events. Args: --match-read-wildcards -O 5 --times 1 -e 0.1 --quality-cutoff 6 -m 18  -a InvRNA1/Ril19.fasta (fasta sequences can be found at: https://github.com/YeoLab/eclip/tree/master/example/inputs/)
Trimmed and filtered reads were then mapped with STAR (2.7.6a) against a repeat element database (RepBase 18.05). Args: --runThreadN 16 \  --genomeDir human_repbase \  --readFilesIn path/to/read1 \  --outFileNamePrefix out_prefix \  --outReadsUnmapped Fastx \  --outSAMtype BAM Unsorted \  --outSAMattributes All \  --outSAMunmapped Within \  --outSAMattrRGline ID:foo \  --outFilterType BySJout \  --outFilterMultimapNmax 30 \  --outFilterMultimapScoreRange 1 \  --outFilterScoreMin 10 \  --alignEndsType EndToEnd
Unmapped reads filtered of repeat elements were then mapped with STAR (2.7.6a) against a human genome (GRCh38). Args: --runThreadN 16 \  --genomeDir genomedir \  --readFilesIn /path/to/read1 \  --outFileNamePrefix out_prefix \  --outReadsUnmapped Fastx \  --outSAMtype BAM   Unsorted \  --outSAMattributes All \  --outSAMunmapped Within \  --outSAMattrRGline ID:foo \  --outFilterType BySJout \  --outFilterMultimapNmax 1 \  --outFilterMultimapScoreRange 1 \  --outFilterScoreMin 10 \  --alignEndsType EndToEnd
Aligned reads were sorted with samtools (1.6)
Sorted reads were collapsed with umi_tools (1.0.0). Args: --random-seed 1 --method unique
BAM files were used to identify peak clusters with Clipper (1.2.2). Args: --species GRCh38_v29e --bam path/to/input.bam --timeout 3600000 --maxgenes 1000000 --save-pickle --outfile path/to/output.bam
Peak clusters were normalized using BAM files for IP against BAM files for INPUT with overlap_peakfi_with_bam.pl, included in eclip 0.1.5+.
Overlapping normalized peak regions were merged with compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl, included within eclip-0.1.5+
Normalized peaks were filtered using a manually curated list blacklist file containing known artifacts/regions of nonspecific binding
Filtered peak files were ranked by entropy score (make_informationcontent_from_peaks.pl included within the merge_peaks pipeline) and used as inputs to IDR (2.0.2) to determine reproducible peaks.
Genome_build: GRCh38
Supplementary_files_format_and_content: BigWig files contain RPM-normalized read densities

← Back to Analysis