GSE180686 Processing Pipeline

RIP-Seq code_examples 20 steps

Publication

Transcriptome-wide identification of RNA-binding protein binding sites using seCLIP-seq.

Nature protocols (2022) — PMID 35322209

Dataset

GSE180686

Transcriptome-wide identification of RNA binding protein binding sites using seCLIP-seq

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    Raw reads were processed using the eCLIP pipeline v0.7.0

    $ Bash example
    # This script processes raw reads using the eCLIP pipeline v0.7.0.
    # The eCLIP pipeline is implemented as a CWL workflow.
    # Source: https://github.com/yeolab/eclip
    
    # --- Prerequisites ---
    # Ensure cwltool and Docker are installed.
    # pip install cwltool
    # apt-get install docker.io # or yum install docker
    
    # --- Download/Clone the eCLIP CWL workflow ---
    # If you haven't already, clone the eCLIP repository.
    # git clone https://github.com/yeolab/eclip.git
    # cd eclip
    # For a specific version (v0.7.0), you might need to checkout a specific commit
    # if v0.7.0 is not a tag. For this example, we assume the eclip.cwl is in the current directory.
    # You might need to adjust the path to eclip.cwl if it's not in the current directory.
    
    # --- Define Input Parameters ---
    # Create an input YAML file for the eCLIP CWL workflow.
    # Replace placeholder paths with your actual data and reference files.
    # Using hg38 as a placeholder reference genome.
    cat << EOF > eclip_inputs.yaml
    fastq_r1:
      class: File
      path: /path/to/your/sample_R1.fastq.gz # Path to your raw forward reads
    fastq_r2:
      class: File
      path: /path/to/your/sample_R2.fastq.gz # Path to your raw reverse reads (optional for SE)
    genome_fasta:
      class: File
      path: /path/to/your/reference/hg38.fa # Path to the reference genome FASTA (e.g., hg38)
    genome_gtf:
      class: File
      path: /path/to/your/reference/gencode.v38.annotation.gtf # Path to the reference genome GTF (e.g., Gencode v38)
    genome_chrom_sizes:
      class: File
      path: /path/to/your/reference/hg38.chrom.sizes # Path to chromosome sizes file
    genome_star_index:
      class: Directory
      path: /path/to/your/reference/STAR_index_hg38 # Path to pre-built STAR index for hg38
    adapter_fasta:
      class: File
      path: /path/to/your/adapter_sequences.fa # Path to adapter sequences FASTA for trimming
    output_dir: "eclip_processing_output" # Directory for output files
    # Add any other specific parameters for v0.7.0 if known
    EOF
    
    # --- Execute the eCLIP CWL Workflow ---
    # The eclip.cwl file should be in the current directory or specified with its full path.
    cwltool --outdir eclip_processing_output eclip.cwl eclip_inputs.yaml
  2. 2

    Sequenced reads were reformatted to include randomers in read headers with umi_tools (1.0.0).

    UMI-tools v1.0.0 GitHub
    $ Bash example
    # Install UMI-tools (if not already installed)
    # conda install -c bioconda umi-tools=1.0.0
    
    # Example command for reformatting reads to include UMIs in headers.
    # This assumes UMIs are at the beginning of Read 2 (e.g., 10bp long) and need to be moved to the header of both reads.
    # Adjust input/output filenames and --bc-pattern according to your specific data structure.
    
    umi_tools extract \
      --input input_R1.fastq.gz \
      --read2-in input_R2.fastq.gz \
      --output-stat umi_extract_stats.tsv \
      --stdout output_R1.fastq.gz \
      --read2-out output_R2.fastq.gz \
      --bc-pattern NNNNNNNNNN
  3. 3

    Args: --random-seed 1 --bc-pattern NNNNNNNNNN

    umi_tools (Inferred with models/gemini-2.5-flash) v1.1.2 GitHub
    $ Bash example
    # Install umi_tools
    # conda install -c bioconda umi_tools=1.1.2
    
    # Example usage: Extract UMIs from FASTQ reads
    # Replace input.fastq.gz and output.fastq.gz with actual file paths
    umi_tools extract \
      --random-seed 1 \
      --bc-pattern NNNNNNNNNN \
      -I input.fastq.gz \
      -S output.fastq.gz
  4. 4

    Reads were then trimmed with cutadapt (1.14).

    cutadapt v1.14 GitHub
    $ Bash example
    # Install cutadapt (if not already installed)
    # conda install -c bioconda cutadapt=1.14
    
    # Example command for trimming adapter sequences and low-quality bases from paired-end reads.
    # Replace 'ADAPTER_SEQUENCE_R1' and 'ADAPTER_SEQUENCE_R2' with the actual adapter sequences used in your library preparation.
    # Replace 'input_R1.fastq.gz' and 'input_R2.fastq.gz' with your input FASTQ files.
    # The quality threshold (-q) and minimum length (--minimum-length) are common defaults and can be adjusted.
    cutadapt -a ADAPTER_SEQUENCE_R1 -A ADAPTER_SEQUENCE_R2 \
             -q 20,20 --minimum-length 20 \
             -o trimmed_R1.fastq.gz -p trimmed_R2.fastq.gz \
             input_R1.fastq.gz input_R2.fastq.gz
  5. 5

    Args: --match-read-wildcards -O 1 --times 1 -e 0.1 --quality-cutoff 6 -m 18 -a InvRNA1/Ril19.fasta (fasta sequences can be found at: https://github.com/YeoLab/eclip/tree/master/example/inputs/)

    $ Bash example
    # Install cutadapt (example, comment out)
    # conda install -c bioconda cutadapt
    
    # Create a directory for reference data if it doesn't exist
    mkdir -p reference_data
    
    # Download adapter sequences from the YeoLab eCLIP repository
    wget -O reference_data/InvRNA1_Ril19.fasta https://raw.githubusercontent.com/YeoLab/eclip/master/example/inputs/InvRNA1/Ril19.fasta
    
    # Placeholder for input reads. Replace 'input_reads.fastq.gz' with your actual input file.
    # Example: wget -O input_reads.fastq.gz <URL_TO_YOUR_FASTQ_FILE>
    
    # Execute cutadapt for adapter trimming and quality filtering
    cutadapt \
      --match-read-wildcards \
      -O 1 \
      --times 1 \
      -e 0.1 \
      --quality-cutoff 6 \
      -m 18 \
      -a reference_data/InvRNA1_Ril19.fasta \
      -o trimmed_reads.fastq.gz \
      input_reads.fastq.gz
  6. 6

    Reads were then trimmed again with cutadapt (1.14) to remove double-ligation events.

    cutadapt v1.14 GitHub
    $ Bash example
    # Install cutadapt via conda
    # conda install -c bioconda cutadapt=1.14
    
    # Define input and output file names
    INPUT_FASTQ="input.fastq.gz"
    OUTPUT_FASTQ="output.fastq.gz"
    
    # Define adapter sequences.
    # For "double-ligation events", this typically refers to removing the library adapter sequence.
    # The specific adapter sequence needs to be provided based on the library preparation kit used.
    # Example Illumina TruSeq 3' adapter: AGATCGGAAGAGCACACGTCTGAACTCCAGTCA
    # Example Illumina Nextera 3' adapter: CTGTCTCTTATACACATCTCCGAGCCCACGAGAC
    # For double-ligation, it might also involve trimming a 5' adapter or using -b for anywhere.
    # Here, we assume a common 3' adapter trimming scenario, which often addresses such events.
    ADAPTER_3PRIME="YOUR_3PRIME_ADAPTER_SEQUENCE" # Replace with actual adapter sequence
    
    # Minimum read length after trimming. Common to remove very short reads/adapter dimers.
    MIN_LENGTH=18 # Example value, adjust as needed based on expected insert size
    
    # Execute cutadapt to remove the 3' adapter and filter by length
    cutadapt -a "${ADAPTER_3PRIME}" \
             --minimum-length "${MIN_LENGTH}" \
             -o "${OUTPUT_FASTQ}" \
             "${INPUT_FASTQ}"
  7. 7

    Args: --match-read-wildcards -O 5 --times 1 -e 0.1 --quality-cutoff 6 -m 18 -a InvRNA1/Ril19.fasta (fasta sequences can be found at: https://github.com/YeoLab/eclip/tree/master/example/inputs/)

    $ Bash example
    # Install cutadapt (version 1.18 as used in YeoLab eCLIP pipeline)
    # conda create -n cutadapt_env cutadapt=1.18
    # conda activate cutadapt_env
    
    # Download the adapter sequences specified in the description
    # These sequences are from the YeoLab eCLIP example inputs
    # Source: https://github.com/YeoLab/eclip/tree/master/example/inputs/InvRNA1/Ril19.fasta
    mkdir -p InvRNA1
    wget -O InvRNA1/Ril19.fasta https://raw.githubusercontent.com/YeoLab/eclip/master/example/inputs/InvRNA1/Ril19.fasta
    
    # Placeholder for input FASTQ file. Replace 'input.fastq' with your actual input file.
    # Placeholder for output FASTQ file. Replace 'output.fastq' with your desired output file.
    # If processing paired-end reads, use -A for the second adapter and -p for the second output file.
    # Example: cutadapt -a ADAPTER1 -A ADAPTER2 -o output_R1.fastq -p output_R2.fastq input_R1.fastq input_R2.fastq
    
    cutadapt \
      --match-read-wildcards \
      -O 5 \
      --times 1 \
      -e 0.1 \
      --quality-cutoff 6 \
      -m 18 \
      -a InvRNA1/Ril19.fasta \
      -o output.fastq \
      input.fastq
  8. 8

    Trimmed and filtered reads were then mapped with STAR (2.7.6a) against a repeat element database (RepBase 18.05).

    $ Bash example
    # Install STAR (if not already installed)
    # conda install -c bioconda star
    
    # Placeholder for input files (replace with actual paths)
    # trimmed_filtered_R1.fastq.gz
    # trimmed_filtered_R2.fastq.gz
    
    # Placeholder for RepBase 18.05 STAR index directory.
    # This assumes a STAR index has been built from RepBase 18.05 sequences.
    # Example of how to build an index (usually done once):
    # mkdir repbase_18.05_index
    # STAR --runMode genomeGenerate \
    #      --genomeDir repbase_18.05_index \
    #      --genomeFastaFiles repbase_18.05.fasta \
    #      --genomeSAindexNbases 10 # Adjust based on sequence length, 10 is suitable for shorter sequences like repeats
    
    # Align trimmed and filtered reads to the RepBase 18.05 index
    STAR --runThreadN 8 \
         --genomeDir repbase_18.05_index \
         --readFilesIn trimmed_filtered_R1.fastq.gz trimmed_filtered_R2.fastq.gz \
         --readFilesCommand zcat \
         --outFileNamePrefix repbase_aligned_ \
         --outSAMtype BAM SortedByCoordinate \
         --outSAMattributes Standard \
         --outFilterMultimapNmax 100 # Allow multiple mapping for reads mapping to repetitive elements
    
  9. 9

    Args: --runThreadN 16 \ --genomeDir human_repbase \ --readFilesIn path/to/read1 \ --outFileNamePrefix out_prefix \ --outReadsUnmapped Fastx \ --outSAMtype BAM Unsorted \ --outSAMattributes All \ --outSAMunmapped Within \ --outSAMattrRGline ID:foo \ --outFilterType BySJout \ --outFilterMultimapNmax 30 \ --outFilterMultimapScoreRange 1 \ --outFilterScoreMin 10 \ --alignEndsType EndToEnd

    STAR (Inferred with models/gemini-2.5-flash) v2.7.10a (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install STAR (example using conda)
    # conda install -c bioconda star
    
    # Align reads using STAR
    # Note: '--genomeDir human_repbase' refers to a pre-built STAR index directory named 'human_repbase'.
    # Ensure this directory exists and contains the STAR genome index files.
    # Replace 'path/to/read1' with the actual path to your input FASTQ file.
    STAR --runThreadN 16 \
         --genomeDir human_repbase \
         --readFilesIn path/to/read1 \
         --outFileNamePrefix out_prefix \
         --outReadsUnmapped Fastx \
         --outSAMtype BAM Unsorted \
         --outSAMattributes All \
         --outSAMunmapped Within \
         --outSAMattrRGline ID:foo \
         --outFilterType BySJout \
         --outFilterMultimapNmax 30 \
         --outFilterMultimapScoreRange 1 \
         --outFilterScoreMin 10 \
         --alignEndsType EndToEnd
  10. 10

    Unmapped reads filtered of repeat elements were then mapped with STAR (2.7.6a) against a human genome (GRCh38).

    $ Bash example
    # Install STAR (if not already installed)
    # conda install -c bioconda star=2.7.6a
    
    # Define variables
    STAR_VERSION="2.7.6a"
    GENOME_DIR="/path/to/STAR_index/GRCh38" # Placeholder for GRCh38 STAR index
    INPUT_FASTQ="filtered_unmapped_reads.fastq.gz" # Input reads after repeat filtering
    OUTPUT_PREFIX="aligned_GRCh38_"
    THREADS=8 # Example number of threads
    LIMIT_BAM_SORT_RAM="31000000000" # Example: 31GB RAM for sorting BAM
    
    # Ensure the genome directory exists and contains the STAR index.
    # You would typically generate this index once using STAR --runMode genomeGenerate.
    # Example command to generate index (run once):
    # STAR --runMode genomeGenerate \
    #      --genomeDir ${GENOME_DIR} \
    #      --genomeFastaFiles /path/to/GRCh38.fa \
    #      --sjdbGTFfile /path/to/GRCh38.gtf \
    #      --runThreadN ${THREADS}
    
    # Run STAR alignment
    STAR --runThreadN ${THREADS} \
         --genomeDir ${GENOME_DIR} \
         --readFilesIn ${INPUT_FASTQ} \
         --outFileNamePrefix ${OUTPUT_PREFIX} \
         --outSAMtype BAM SortedByCoordinate \
         --outSAMunmapped Within \
         --outFilterMultimapNmax 20 \
         --outFilterMismatchNmax 999 \
         --outFilterMismatchNoverLmax 0.04 \
         --alignIntronMin 20 \
         --alignIntronMax 1000000 \
         --alignMatesGapMax 1000000 \
         --limitBAMsortRAM ${LIMIT_BAM_SORT_RAM} \
         --readFilesCommand zcat
  11. 11

    Args: --runThreadN 16 \ --genomeDir genomedir \ --readFilesIn /path/to/read1 \ --outFileNamePrefix out_prefix \ --outReadsUnmapped Fastx \ --outSAMtype BAM Unsorted \ --outSAMattributes All \ --outSAMunmapped Within \ --outSAMattrRGline ID:foo \ --outFilterType BySJout \ --outFilterMultimapNmax 1 \ --outFilterMultimapScoreRange 1 \ --outFilterScoreMin 10 \ --alignEndsType EndToEnd

    STAR (Inferred with models/gemini-2.5-flash) vlatest (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install STAR (if not already installed)
    # conda install -c bioconda star
    
    # Define variables (replace with actual paths and values)
    GENOME_DIR="/path/to/STAR_genome_index" # Directory containing STAR genome index files
    READ_FILES="/path/to/input_read1.fastq.gz" # Input FASTQ file(s). For paired-end, use "/path/to/read1.fastq.gz /path/to/read2.fastq.gz"
    OUT_PREFIX="aligned_reads" # Prefix for output files
    THREADS=16 # Number of threads to use
    
    # Run STAR alignment
    STAR --runThreadN "${THREADS}" \
         --genomeDir "${GENOME_DIR}" \
         --readFilesIn "${READ_FILES}" \
         --outFileNamePrefix "${OUT_PREFIX}" \
         --outReadsUnmapped Fastx \
         --outSAMtype BAM Unsorted \
         --outSAMattributes All \
         --outSAMunmapped Within \
         --outSAMattrRGline "ID:foo" \
         --outFilterType BySJout \
         --outFilterMultimapNmax 1 \
         --outFilterMultimapScoreRange 1 \
         --outFilterScoreMin 10 \
         --alignEndsType EndToEnd
  12. 12

    Aligned reads were sorted with samtools (1.6)

    samtools v1.6 GitHub
    $ Bash example
    # Install samtools if not already installed
    # conda install -c bioconda samtools=1.6
    
    # Sort aligned reads
    # Assuming 'aligned_reads.bam' is the input aligned BAM file
    # And 'sorted_reads.bam' is the desired sorted output BAM file
    samtools sort -o sorted_reads.bam aligned_reads.bam
  13. 13

    Sorted reads were collapsed with umi_tools (1.0.0).

    UMI-tools v1.0.0 GitHub
    $ Bash example
    # Install umi_tools if not already installed
    # conda install -c bioconda umi_tools
    
    # Placeholder for input and output files
    # INPUT_BAM should be a sorted BAM file containing reads with UMIs.
    # UMIs are typically in the read ID or a custom tag (e.g., 'RX') added by a previous step (e.g., umi_tools extract).
    INPUT_BAM="sorted_reads.bam"
    OUTPUT_DEDUP_BAM="deduplicated_reads.bam"
    LOG_FILE="umi_tools_dedup.log"
    
    # Collapse (deduplicate) reads using umi_tools dedup.
    # The --method=directional is a common choice for RNA-seq and similar assays
    # to account for PCR duplicates while preserving biological signal.
    # Add --paired if the input reads are paired-end.
    umi_tools dedup \
        --input=$INPUT_BAM \
        --output=$OUTPUT_DEDUP_BAM \
        --method=directional \
        --log=$LOG_FILE
        # --paired # Uncomment if processing paired-end reads
  14. 14

    Args: --random-seed 1 --method unique

    Custom script (Inferred with models/gemini-2.5-flash) vN/A
    $ Bash example
    my_script --random-seed 1 --method unique
  15. 15

    BAM files were used to identify peak clusters with Clipper (1.2.2).

    CLIPper v1.2.2 GitHub
    $ Bash example
    bash
    # Install CLIPper (assuming Python environment)
    # It's recommended to install CLIPper in a dedicated conda environment.
    # conda create -n clipper_env python=3.8
    # conda activate clipper_env
    # pip install clipper==1.2.2 # Or clone the repo and run directly
    # git clone https://github.com/yeolab/clipper.git
    # cd clipper
    # python setup.py install
    
    # Placeholder for genome sizes file (e.g., hg38)
    # This file contains chromosome names and their lengths, crucial for peak calling.
    # Example: wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes
    GENOME_SIZES_FILE="hg38.chrom.sizes" # Replace with the actual path to your genome sizes file
    
    # Input BAM file(s)
    # The description implies one or more BAM files. For this example, we use a single placeholder.
    INPUT_BAM="input.bam" # Replace with the actual path to your input BAM file
    
    # Output prefix for peak clusters
    OUTPUT_PREFIX="clipper_peaks"
    
    # Run CLIPper (version 1.2.2)
    # Parameters like p-value (-p), window size (-w), and FDR (-f) are common for peak calling.
    # These are placeholders; actual values should be determined based on experimental design.
    python clipper.py \
        -b "${INPUT_BAM}" \
        -s "${GENOME_SIZES_FILE}" \
        -o "${OUTPUT_PREFIX}" \
        -p 0.001 \
        -w 10 \
        -f 0.05
    
    # The output will typically be a BED file (e.g., clipper_peaks.bed) containing the identified peak clusters.
    
  16. 16

    Args: --species GRCh38_v29e --bam path/to/input.bam --timeout 3600000 --maxgenes 1000000 --save-pickle --outfile path/to/output.bam

    Custom script/tool for gene-centric BAM processing (Inferred with models/gemini-2.5-flash) vUnknown
    $ Bash example
    # This script processes a BAM file based on gene annotations,
    # potentially for gene-centric quantification or feature extraction.
    
    # Define input and output paths
    INPUT_BAM="path/to/input.bam"
    OUTPUT_BAM="path/to/output.bam"
    
    # Reference genome and annotation for GRCh38_v29e
    # GRCh38_v29e typically refers to the GRCh38 human genome assembly
    # and Gencode v29 annotation.
    # Example commands to download these reference files:
    #
    # # Download GRCh38 primary assembly FASTA (e.g., Ensembl release 92)
    # # wget -O Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz ftp://ftp.ensembl.org/pub/release-92/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
    # # gunzip Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz
    #
    # # Download Gencode v29 GTF annotation
    # # wget -O gencode.v29.annotation.gtf.gz ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_29/gencode.v29.annotation.gtf.gz
    # # gunzip gencode.v29.annotation.gtf.gz
    
    # Execute the inferred tool (placeholder name 'inferred_tool_script')
    # Ensure 'inferred_tool_script' is in your PATH or provide its full path.
    # If this is a Python script, it might be executed as 'python inferred_tool_script.py ...'
    # Assuming it's an executable script:
    inferred_tool_script \
      --species GRCh38_v29e \
      --bam "${INPUT_BAM}" \
      --timeout 3600000 \
      --maxgenes 1000000 \
      --save-pickle \
      --outfile "${OUTPUT_BAM}"
  17. 17

    Peak clusters were normalized using BAM files for IP against BAM files for INPUT with overlap_peakfi_with_bam.pl, included in eclip 0.1.5+.

    $ Bash example
    # Install Perl if not already available
    # conda install -c bioconda perl
    
    # The script 'overlap_peakfi_with_bam.pl' is a custom utility often found within eCLIP pipeline setups.
    # It is mentioned as being included in 'eclip 0.1.5+'.
    # You might need to locate or download this specific script from your eCLIP pipeline environment.
    # Example: Assuming the script is in your PATH or current directory.
    
    # Define input and output files
    PEAK_CLUSTERS_FILE="peak_clusters.bed"
    IP_BAM_FILE="ip_sample.bam"
    INPUT_BAM_FILE="input_sample.bam"
    NORMALIZED_PEAK_CLUSTERS_FILE="normalized_peak_clusters.bed"
    
    # Execute the normalization script
    perl overlap_peakfi_with_bam.pl "${PEAK_CLUSTERS_FILE}" "${IP_BAM_FILE}" "${INPUT_BAM_FILE}" > "${NORMALIZED_PEAK_CLUSTERS_FILE}"
  18. 18

    Overlapping normalized peak regions were merged with compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl, included within eclip-0.1.5+

    $ Bash example
    # Install Perl if not already available
    # conda install -c bioconda perl
    
    # Clone the merge_peaks repository to obtain the script
    # git clone https://github.com/yeolab/merge_peaks.git
    # cd merge_peaks
    
    # Define placeholder input and output file names.
    # These would be the normalized peak regions from individual replicates.
    # Replace with actual file paths from your pipeline.
    INPUT_PEAK_FILE_REP1="sample_replicate1_normalized_peaks.bed"
    INPUT_PEAK_FILE_REP2="sample_replicate2_normalized_peaks.bed"
    # Add more input files if there are more replicates, e.g.:
    # INPUT_PEAK_FILE_REP3="sample_replicate3_normalized_peaks.bed"
    
    OUTPUT_MERGED_PEAKS="merged_replicate_normalized_peaks.bed"
    
    # Create a file containing the list of input peak files.
    # This is a common way the script is used within the eCLIP pipeline (e.g., via run_idr.sh).
    echo "${INPUT_PEAK_FILE_REP1}" > input_peak_list.txt
    echo "${INPUT_PEAK_FILE_REP2}" >> input_peak_list.txt
    # If more replicates:
    # echo "${INPUT_PEAK_FILE_REP3}" >> input_peak_list.txt
    
    # Execute the Perl script to merge overlapping normalized peak regions.
    # Assuming the script 'compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl'
    # is in the current directory or its path is specified.
    # If cloned, it would typically be ./compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl
    perl compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl \
        -f input_peak_list.txt \
        -o "${OUTPUT_MERGED_PEAKS}"
    
    # Clean up the temporary file list
    # rm input_peak_list.txt
  19. 19

    Normalized peaks were filtered using a manually curated list blacklist file containing known artifacts/regions of nonspecific binding

    bedtools (Inferred with models/gemini-2.5-flash) v2.30.0 GitHub
    $ Bash example
    # Example: Install bedtools
    # conda install -c bioconda bedtools
    
    # Define input and output files
    INPUT_PEAKS="normalized_peaks.bed" # Placeholder for the input normalized peak file
    BLACKLIST_FILE="blacklist.bed" # Placeholder for the manually curated blacklist file (e.g., hg38_blacklist.bed)
    OUTPUT_PEAKS="filtered_peaks.bed"
    
    # Filter normalized peaks by removing any regions that overlap with the blacklist file
    # The '-v' option in bedtools intersect reports entries in A that *do not* overlap with any entries in B.
    bedtools intersect -v -a "${INPUT_PEAKS}" -b "${BLACKLIST_FILE}" > "${OUTPUT_PEAKS}"
  20. 20

    Filtered peak files were ranked by entropy score (make_informationcontent_from_peaks.pl included within the merge_peaks pipeline) and used as inputs to IDR (2.0.2) to determine reproducible peaks.

    IDR v2.0.2 GitHub
    $ Bash example
    # Install IDR (Irreproducible Discovery Rate) tool
    # conda install -c bioconda idr=2.0.2
    
    # Example input files (assuming they have been pre-ranked by entropy score
    # using make_informationcontent_from_peaks.pl and the score is in the 5th column,
    # typically 'signal.value' for narrowPeak files)
    REP1_PEAKS="rep1.entropy_ranked.narrowPeak"
    REP2_PEAKS="rep2.entropy_ranked.narrowPeak"
    OUTPUT_PREFIX="idr_reproducible_peaks"
    
    # Run IDR to determine reproducible peaks
    # --input-file-type narrowPeak: Specifies the input file format.
    # --rank-by signal.value: Ranks peaks based on the signal value (5th column),
    #                         which is assumed to contain the entropy score after pre-processing.
    # --output-file: Base name for the output reproducible peak file.
    # --plot: Generates diagnostic plots.
    # --log-output-file: Log file for IDR execution.
    idr --input-file-type narrowPeak \
        --rank-by signal.value \
        --output-file "${OUTPUT_PREFIX}.idr.out" \
        --plot \
        --log-output-file "${OUTPUT_PREFIX}.idr.log" \
        "${REP1_PEAKS}" \
        "${REP2_PEAKS}"

Tools Used

Raw Source Text
Raw reads were processed using the eCLIP pipeline v0.7.0
Sequenced reads were reformatted to include randomers in read headers with umi_tools (1.0.0). Args: --random-seed 1 --bc-pattern NNNNNNNNNN
Reads were then trimmed with cutadapt (1.14). Args: --match-read-wildcards -O 1 --times 1 -e 0.1 --quality-cutoff 6 -m 18 -a InvRNA1/Ril19.fasta (fasta sequences can be found at: https://github.com/YeoLab/eclip/tree/master/example/inputs/)
Reads were then trimmed again with cutadapt (1.14) to remove double-ligation events. Args: --match-read-wildcards -O 5 --times 1 -e 0.1 --quality-cutoff 6 -m 18  -a InvRNA1/Ril19.fasta (fasta sequences can be found at: https://github.com/YeoLab/eclip/tree/master/example/inputs/)
Trimmed and filtered reads were then mapped with STAR (2.7.6a) against a repeat element database (RepBase 18.05). Args: --runThreadN 16 \  --genomeDir human_repbase \  --readFilesIn path/to/read1 \  --outFileNamePrefix out_prefix \  --outReadsUnmapped Fastx \  --outSAMtype BAM Unsorted \  --outSAMattributes All \  --outSAMunmapped Within \  --outSAMattrRGline ID:foo \  --outFilterType BySJout \  --outFilterMultimapNmax 30 \  --outFilterMultimapScoreRange 1 \  --outFilterScoreMin 10 \  --alignEndsType EndToEnd
Unmapped reads filtered of repeat elements were then mapped with STAR (2.7.6a) against a human genome (GRCh38). Args: --runThreadN 16 \  --genomeDir genomedir \  --readFilesIn /path/to/read1 \  --outFileNamePrefix out_prefix \  --outReadsUnmapped Fastx \  --outSAMtype BAM   Unsorted \  --outSAMattributes All \  --outSAMunmapped Within \  --outSAMattrRGline ID:foo \  --outFilterType BySJout \  --outFilterMultimapNmax 1 \  --outFilterMultimapScoreRange 1 \  --outFilterScoreMin 10 \  --alignEndsType EndToEnd
Aligned reads were sorted with samtools (1.6)
Sorted reads were collapsed with umi_tools (1.0.0). Args: --random-seed 1 --method unique
BAM files were used to identify peak clusters with Clipper (1.2.2). Args: --species GRCh38_v29e --bam path/to/input.bam --timeout 3600000 --maxgenes 1000000 --save-pickle --outfile path/to/output.bam
Peak clusters were normalized using BAM files for IP against BAM files for INPUT with overlap_peakfi_with_bam.pl, included in eclip 0.1.5+.
Overlapping normalized peak regions were merged with compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl, included within eclip-0.1.5+
Normalized peaks were filtered using a manually curated list blacklist file containing known artifacts/regions of nonspecific binding
Filtered peak files were ranked by entropy score (make_informationcontent_from_peaks.pl included within the merge_peaks pipeline) and used as inputs to IDR (2.0.2) to determine reproducible peaks.
Genome_build: GRCh38
Supplementary_files_format_and_content: BigWig files contain RPM-normalized read densities
← Back to Analysis