GSE127944 Processing Pipeline

RIP-Seq code_examples 22 steps

Publication

In Vivo Screening Unveils Pervasive RNA-Binding Protein Dependencies in Leukemic Stem Cells and Identifies ELAVL1 as a Therapeutic Target.

Blood cancer discovery (2023) — PMID 36763002

Dataset

GSE127944

In vivo CRISPR screening unveils RNA binding protein dependencies for leukemic stem cells and identifies ELAVL1 as a potential therapeutic target [eC…

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Sequenced reads were removed of inline barcodes and reformatted to include randomers in read headers with eclipdemux (v0.0.1).

eclipdemux v0.0.1 GitHub

$ Bash example

# Install eclipdemux (assuming it's part of the yeolab/eclip repository)
# git clone https://github.com/yeolab/eclip.git
# cd eclip/scripts
# Ensure Python environment is set up.

# Define input and output files
INPUT_FASTQ="input_reads.fastq.gz"
OUTPUT_PREFIX="demuxed_reads"

# Define barcode and randomer lengths (example values, adjust as per experimental design)
# The description implies both inline barcodes are removed and randomers are added to headers.
BARCODE_LENGTH=6 # Example: length of the inline barcode to be removed
RANDOMER_LENGTH=10 # Example: length of the randomer to be extracted and added to header

# Execute eclipdemux to remove inline barcodes and add randomers to read headers
# The --gzip flag is used assuming input and output files are gzipped.
python eclipdemux.py \
    -i "${INPUT_FASTQ}" \
    -o "${OUTPUT_PREFIX}" \
    --barcode-length "${BARCODE_LENGTH}" \
    --randomer-length "${RANDOMER_LENGTH}" \
    --gzip

View on GitHub

2
Args: --length 10

Unknown (Inferred with models/gemini-2.5-flash) vUnknown
$ Bash example
```
your_tool_here --length 10
```

Reads were then trimmed with cutadapt (1.9.1).

cutadapt v1.9.1 GitHub

$ Bash example

# Install cutadapt (e.g., using conda)
# conda install -c bioconda cutadapt=1.9.1

# Define input and output file names (placeholders)
# Replace with actual file paths
INPUT_READS_R1="input_reads_R1.fastq.gz"
INPUT_READS_R2="input_reads_R2.fastq.gz"
OUTPUT_TRIMMED_R1="trimmed_reads_R1.fastq.gz"
OUTPUT_TRIMMED_R2="trimmed_reads_R2.fastq.gz"

# Define common Illumina adapter sequences (placeholders)
# Adjust these based on the specific library preparation kit used
# Example: Illumina Universal Adapter (3' end of Read 1)
ADAPTER_R1="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA"
# Example: Illumina Small RNA 3' Adapter (3' end of Read 2, reverse complement of Read 1 adapter if same)
ADAPTER_R2="AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT"

# Trim reads with cutadapt
# -a: 3' adapter for R1
# -A: 3' adapter for R2
# -q: Quality trimming (e.g., trim bases with quality < 20 from 5' and 3' ends)
# -m: Minimum read length after trimming (e.g., 20 bp)
# -o: Output file for R1
# -p: Output file for R2
cutadapt -a "${ADAPTER_R1}" -A "${ADAPTER_R2}" \
         -q 20 -m 20 \
         -o "${OUTPUT_TRIMMED_R1}" -p "${OUTPUT_TRIMMED_R2}" \
         "${INPUT_READS_R1}" "${INPUT_READS_R2}"

View on GitHub

Args: --match-read-wildcards -O 1 --times 1 -e 0.1 --quality-cutoff 6 -m 18 -g g_adapters.fasta -A A_adapters.fasta -a a_adapters.fasta (fasta sequences generated from parsebarcodes.sh within the eclip 0.1.5+ pipeline)

eCLIP v0.1.5

$ Bash example

# Install cutadapt (if not already installed)
# conda install -c bioconda cutadapt

# This step uses cutadapt for adapter trimming, with adapter sequences
# (g_adapters.fasta, A_adapters.fasta, a_adapters.fasta) that are typically
# generated by a preceding parsebarcodes.sh step within the eCLIP pipeline.
# Input and output filenames are placeholders as they were not specified in the description.
cutadapt \
  --match-read-wildcards \
  -O 1 \
  --times 1 \
  -e 0.1 \
  -q 6 \
  -m 18 \
  -g g_adapters.fasta \
  -A A_adapters.fasta \
  -a a_adapters.fasta \
  -o trimmed_reads.fastq.gz \
  input_reads.fastq.gz

Reads were then trimmed once more with cutadapt (1.9.1) to remove double-ligation events.

cutadapt v1.9.1

$ Bash example

# Install cutadapt if not already installed
# conda install -c bioconda cutadapt=1.9.1

# Define adapter sequences from eCLIP workflow (yeolab/eclip)
ADAPTER_A="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" # Illumina TruSeq Universal Adapter
ADAPTER_B="AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT" # Illumina TruSeq Read 2 Adapter

# Define input and output filenames
INPUT_READ1="input_R1.fastq.gz"
INPUT_READ2="input_R2.fastq.gz"
OUTPUT_READ1="trimmed_R1.fastq.gz"
OUTPUT_READ2="trimmed_R2.fastq.gz"

# Define number of threads
N_THREADS=8

# Trim reads to remove adapter sequences and potential double-ligation events
# -a: 3' adapter for R1
# -A: 3' adapter for R2
# -O 3: Minimum overlap of 3 bases required to trim an adapter
# -m 18: Discard reads shorter than 18 bp after trimming
# --cores: Number of CPU cores to use
cutadapt -a "${ADAPTER_A}" \
         -A "${ADAPTER_B}" \
         -O 3 \
         -m 18 \
         --cores "${N_THREADS}" \
         -o "${OUTPUT_READ1}" \
         -p "${OUTPUT_READ2}" \
         "${INPUT_READ1}" \
         "${INPUT_READ2}"

Args: --match-read-wildcards -O 1 --times 1 -e 0.1 --quality-cutoff 6 -m 18 -A A_adapters.fasta (fasta sequences generated from parsebarcodes.sh within the eclip 0.1.5+ pipeline)

eCLIP v0.1.5 GitHub

$ Bash example

# Install cutadapt (if not already installed)
# conda install -c bioconda cutadapt=1.18

# Placeholder for input and output files:
# A_adapters.fasta: Fasta sequences generated from parsebarcodes.sh within the eclip 0.1.5+ pipeline.
#                   This file contains the adapter sequences to be trimmed.
# input_reads.fastq: Raw sequencing reads (e.g., from a single-end eCLIP experiment).
# trimmed_reads.fastq: Output file with trimmed reads.

# Note on '-A A_adapters.fasta' in description vs. '-a file:A_adapters.fasta' in code:
# The eCLIP pipeline's CWL workflow typically uses '-a file:ADAPTERS.fasta' for 3' adapter trimming
# from a FASTA file on the first read. While the description uses '-A', which is for the second read
# in paired-end data, we infer '-a file:' based on common eCLIP pipeline implementation for single-end
# or primary read trimming, and the standard cutadapt syntax for reading adapters from a file.

cutadapt \
  --match-read-wildcards \
  -O 1 \
  --times 1 \
  -e 0.1 \
  --quality-cutoff 6 \
  -m 18 \
  -a file:A_adapters.fasta \
  -o trimmed_reads.fastq \
  input_reads.fastq

View on GitHub

Trimmed reads were then mapped with STAR (2.4.0i) against a mouse-specific repeat element database (RepBase 18.05).

STAR v2.4.0i GitHub

$ Bash example

# Install STAR if not already available
# conda install -c bioconda star=2.4.0i

# Define paths and parameters
REPBASE_FASTA="RepBase18.05_mouse.fasta" # Placeholder for the mouse-specific RepBase FASTA file
STAR_INDEX_DIR="star_repbase_index"
TRIMMED_READS="trimmed_reads.fastq.gz" # Placeholder for your trimmed reads file (e.g., single-end)
# For paired-end reads, use:
# TRIMMED_READS_R1="trimmed_reads_R1.fastq.gz"
# TRIMMED_READS_R2="trimmed_reads_R2.fastq.gz"
# And modify --readFilesIn accordingly: --readFilesIn "${TRIMMED_READS_R1}" "${TRIMMED_READS_R2}"
OUTPUT_PREFIX="repbase_mapping_output"
NUM_THREADS=8 # Example number of threads

# 1. Create STAR genome index for the RepBase database
mkdir -p "${STAR_INDEX_DIR}"

# Note: For repeat element databases, splicing is not relevant, so sjdbOverhang is omitted.
# The genome is essentially the collection of repeat sequences.
STAR --runMode genomeGenerate \
     --genomeDir "${STAR_INDEX_DIR}" \
     --genomeFastaFiles "${REPBASE_FASTA}" \
     --runThreadN "${NUM_THREADS}"

# 2. Map trimmed reads to the RepBase index
# For repeat element mapping, it's common to allow more multimapping and disable splicing.
STAR --runThreadN "${NUM_THREADS}" \
     --genomeDir "${STAR_INDEX_DIR}" \
     --readFilesIn "${TRIMMED_READS}" \
     --outFileNamePrefix "${OUTPUT_PREFIX}" \
     --outSAMtype BAM SortedByCoordinate \
     --outBAMcompression 6 \
     --alignIntronMax 1 \
     --outFilterMultimapNmax 100 \
     --outFilterScoreMinOverLread 0.6 \
     --outFilterMatchNminOverLread 0.6

View on GitHub

Args: --runThreadN 16 \ --genomeDir mouse_repbase \ --readFilesIn path/to/read1 path/to/read2 \ --outFileNamePrefix out_prefix \ --outReadsUnmapped Fastx \ --outSAMtype BAM Unsorted \ --outSAMattributes All \ --outSAMunmapped Within \ --outSAMattrRGline ID:foo \ --outFilterType BySJout \ --outFilterMultimapNmax 30 \ --outFilterMultimapScoreRange 1 \ --outFilterScoreMin 10 \ --alignEndsType EndToEnd

STAR (Inferred with models/gemini-2.5-flash) vNot specified GitHub

$ Bash example

# Install STAR (example using conda)
# conda install -c bioconda star

# Define variables
# Placeholder for mouse genome index. For eCLIP, this would typically be a STAR index built from a mouse reference genome (e.g., mm10/GRCm38) potentially with repetitive elements (like from RepeatMasker) added to the FASTA for indexing.
GENOME_DIR="/path/to/STAR_genome_indices/mouse_mm10_repbase"
READ1="path/to/read1"
READ2="path/to/read2"
OUT_PREFIX="out_prefix"
THREADS=16

# Run STAR alignment
STAR \
  --runThreadN "${THREADS}" \
  --genomeDir "${GENOME_DIR}" \
  --readFilesIn "${READ1}" "${READ2}" \
  --outFileNamePrefix "${OUT_PREFIX}" \
  --outReadsUnmapped Fastx \
  --outSAMtype BAM Unsorted \
  --outSAMattributes All \
  --outSAMunmapped Within \
  --outSAMattrRGline ID:foo \
  --outFilterType BySJout \
  --outFilterMultimapNmax 30 \
  --outFilterMultimapScoreRange 1 \
  --outFilterScoreMin 10 \
  --alignEndsType EndToEnd

View on GitHub

Unmapped reads filtered of repeat elements were then mapped with STAR (2.4.0i) against a mouse genome (mm9).

STAR v2.4.0i GitHub

$ Bash example

# Install STAR (if not already installed)
# conda install -c bioconda star=2.4.0i

# Define reference genome path for mm9
# The mm9 genome FASTA files can be downloaded from UCSC (e.g., http://hgdownload.soe.ucsc.edu/goldenPath/mm9/bigZips/chromFa.tar.gz).
# A corresponding GTF/GFF annotation file (e.g., from Ensembl or UCSC refGene) is also recommended for splice-aware alignment.
GENOME_DIR="/path/to/STAR_index/mm9"

# Example: Build STAR index for mm9 (if not already built)
# Replace 'Mus_musculus.mm9.fa' and 'Mus_musculus.mm9.gtf' with your actual file paths.
# Adjust --sjdbOverhang based on your read length (typically ReadLength - 1).
# STAR --runMode genomeGenerate \
#      --genomeDir ${GENOME_DIR} \
#      --genomeFastaFiles Mus_musculus.mm9.fa \
#      --sjdbGTFfile Mus_musculus.mm9.gtf \
#      --sjdbOverhang 100 \
#      --runThreadN 8

# Input reads (assuming unmapped reads filtered of repeat elements are in this file)
INPUT_READS="unmapped_filtered_reads.fastq.gz"
OUTPUT_PREFIX="mapped_star_mm9_"
NUM_THREADS=8 # Adjust based on available resources

STAR --runMode alignReads \
     --genomeDir ${GENOME_DIR} \
     --readFilesIn ${INPUT_READS} \
     --outFileNamePrefix ${OUTPUT_PREFIX} \
     --outSAMtype BAM SortedByCoordinate \
     --outSAMunmapped Within \
     --runThreadN ${NUM_THREADS}

View on GitHub

Args: --runThreadN 16 \ --genomeDir genomedir \ --readFilesIn /path/to/read1 /path/to/read2 \ --outFileNamePrefix out_prefix \ --outReadsUnmapped Fastx \ --outSAMtype BAM Unsorted \ --outSAMattributes All \ --outSAMunmapped Within \ --outSAMattrRGline ID:foo \ --outFilterType BySJout \ --outFilterMultimapNmax 1 \ --outFilterMultimapScoreRange 1 \ --outFilterScoreMin 10 \ --alignEndsType EndToEnd

STAR (Inferred with models/gemini-2.5-flash) v2.7.10a GitHub

$ Bash example

# Install STAR (example using conda)
# conda install -c bioconda star

# Define variables for input/output and reference
# For RNA-based assays, a splice-aware aligner like STAR requires a pre-built genome index.
# The latest human reference genome assembly is GRCh38.
# Example: Download or build STAR index for GRCh38 (e.g., from Gencode or ENCODE project).
GENOME_DIR="/path/to/STAR_genome_index/GRCh38" 
READ1="/path/to/sample_R1.fastq.gz"
READ2="/path/to/sample_R2.fastq.gz"
OUTPUT_PREFIX="sample_aligned"

# Run STAR alignment
STAR \
  --runThreadN 16 \
  --genomeDir "${GENOME_DIR}" \
  --readFilesIn "${READ1}" "${READ2}" \
  --outFileNamePrefix "${OUTPUT_PREFIX}" \
  --outReadsUnmapped Fastx \
  --outSAMtype BAM Unsorted \
  --outSAMattributes All \
  --outSAMunmapped Within \
  --outSAMattrRGline ID:foo \
  --outFilterType BySJout \
  --outFilterMultimapNmax 1 \
  --outFilterMultimapScoreRange 1 \
  --outFilterScoreMin 10 \
  --alignEndsType EndToEnd

View on GitHub

Aligned reads were sorted with samtools (1.4.1)

samtools v1.4.1 GitHub

$ Bash example

# Install samtools if not already available
# conda install -c bioconda samtools=1.4.1

# Sort aligned reads by coordinate
# Replace 'input.bam' with the actual path to your aligned BAM file
# Replace 'sorted_reads.bam' with the desired output path for the sorted BAM file
samtools sort -o sorted_reads.bam input.bam

View on GitHub

Sorted reads were collapsed with barcodecollapsepe.py included in eclip 0.1.5+ pipelines.

eCLIP v0.1.5 GitHub

$ Bash example

# Install eclip (assuming Python 3 environment)
# It's recommended to install eclip within a conda environment
# conda create -n eclip_env python=3.8
# conda activate eclip_env
# pip install eclip==0.1.5 # If available on PyPI, otherwise clone and install from source
# If installing from source:
# git clone https://github.com/yeolab/eclip.git
# cd eclip
# pip install .
# Ensure barcodecollapsepe.py is in your PATH or call it directly, e.g., python3 /path/to/eclip/tools/barcodecollapsepe.py

# Define input and output files
INPUT_R1="sorted_reads_R1.fastq.gz"
INPUT_R2="sorted_reads_R2.fastq.gz"
OUTPUT_R1_PREFIX="collapsed_reads_R1"
OUTPUT_R2_PREFIX="collapsed_reads_R2"

# Decompress input files for processing
gunzip -c "${INPUT_R1}" > "${INPUT_R1%.gz}"
gunzip -c "${INPUT_R2}" > "${INPUT_R2%.gz}"

# Execute barcodecollapsepe.py to collapse reads
# Using default parameters from eclip CWL workflow (min_qual, min_read_qual, min_seq_qual)
python3 /path/to/eclip/tools/barcodecollapsepe.py \
    -i "${INPUT_R1%.gz}" \
    -j "${INPUT_R2%.gz}" \
    -o "${OUTPUT_R1_PREFIX}.fastq" \
    -p "${OUTPUT_R2_PREFIX}.fastq" \
    -m 20 -q 20 -s 20

# Recompress output files
gzip "${OUTPUT_R1_PREFIX}.fastq"
gzip "${OUTPUT_R2_PREFIX}.fastq"

# Clean up uncompressed intermediate files (optional)
# rm "${INPUT_R1%.gz}" "${INPUT_R2%.gz}"

View on GitHub

Args: -o preRmDup.bam -m metrics_file -b rmDup.bam

Picard MarkDuplicates (Inferred with models/gemini-2.5-flash) v2.27.4 GitHub

$ Bash example

# Install Picard (e.g., via Conda)
# conda create -n picard_env picard -c bioconda -c conda-forge
# conda activate picard_env

# Define variables for clarity, mapping description's arguments to Picard's standard flags
# -b rmDup.bam -> Input BAM file
# -o preRmDup.bam -> Output BAM file (with duplicates marked)
# -m metrics_file -> Output metrics file
INPUT_BAM="rmDup.bam"
OUTPUT_BAM="preRmDup.bam"
METRICS_FILE="metrics_file"

# Execute Picard MarkDuplicates to identify and mark duplicate reads in a BAM file.
# ASSUME_SORTED=true is often used as input BAMs are typically sorted by coordinate.
# VALIDATION_STRINGENCY=SILENT is used to suppress verbose validation warnings.
picard MarkDuplicates \
    I="${INPUT_BAM}" \
    O="${OUTPUT_BAM}" \
    M="${METRICS_FILE}" \
    ASSUME_SORTED=true \
    VALIDATION_STRINGENCY=SILENT

View on GitHub

PCR de-duped reads from each inline barcode were then merged with samtools (1.4.1) merge (merged.bam)

samtools v1.4.1 GitHub

$ Bash example

# Install samtools if not already installed
# conda install -c bioconda samtools=1.4.1

# Merge PCR de-duped reads from each inline barcode
# Replace input_deduped_barcode_1.bam, input_deduped_barcode_2.bam with your actual input files
samtools merge merged.bam input_deduped_barcode_1.bam input_deduped_barcode_2.bam

View on GitHub

Merged alignments were split to keep just read2 using samtools (1.4.1) view.

samtools v1.4.1 GitHub

$ Bash example

# Install samtools if not already installed
# conda install -c bioconda samtools=1.4.1

# Define input and output file names
INPUT_BAM="merged_alignments.bam"
OUTPUT_BAM="read2_alignments.bam"

# Split merged alignments to keep just read2 using samtools view
# -b: output BAM format
# -f 0x80: require all of these flags to be present (0x80 = second in pair)
samtools view -b -f 0x80 "${INPUT_BAM}" > "${OUTPUT_BAM}"

View on GitHub

Args: -h -b -f 128

Unknown (Inferred with models/gemini-2.5-flash) vUnknown

$ Bash example

# The specific tool could not be inferred from the provided description and arguments.
# Please specify the tool to generate a more accurate command.
# Example placeholder command:
unknown_tool -h -b -f 128

Read2 BAM files were used to identify peak clusters with Clipper (1.2.2).

CLIPper v1.2.2 GitHub

$ Bash example

# Install CLIPper (example, specific version might require cloning the repo or a specific conda/pip package)
# conda install -c bioconda clipper

# Define input and output files
INPUT_BAM="read2.bam" # Placeholder for "Read2 BAM files"
OUTPUT_PREFIX="peak_clusters"
GENOME_SIZES="hg38.chrom.sizes" # Placeholder for genome size file, often required by CLIPper

# Ensure the genome size file exists or provide a path to it.
# Example: Download from UCSC or ENCODE
# wget -O ${GENOME_SIZES} http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes

# Run CLIPper to identify peak clusters
# Using common default parameters as none were specified in the description.
# -b: Input BAM file
# -o: Output prefix for peak files
# -s: Genome size file
# -p: P-value threshold (example: 0.01)
# -f: FDR threshold (example: 0.05)
# -w: Window size (example: 20)
python clipper.py -b "${INPUT_BAM}" -o "${OUTPUT_PREFIX}" -s "${GENOME_SIZES}" -p 0.01 -f 0.05 -w 20

View on GitHub

Args: --species mm9 --bam path/to/input.bam --timeout 3600000 --maxgenes 1000000 --save-pickle --outfile path/to/output.bam

count_reads_in_genes.py (Inferred with models/gemini-2.5-flash) vPart of yeolab/eclip workflow GitHub

$ Bash example

# Installation instructions:
# This script is typically part of the yeolab/eclip repository.
# git clone https://github.com/yeolab/eclip.git
# cd eclip/scripts
# # Ensure Python dependencies are met, e.g., pysam, pandas, numpy
# # conda create -n eclip_env python=3.8
# # conda activate eclip_env
# # pip install pysam pandas numpy

# Define input/output paths and reference data
INPUT_BAM="path/to/input.bam"
OUTPUT_FILE="path/to/output.bam" # As specified in the description
SPECIES="mm9"
# Reference GTF file for mm9 is required by count_reads_in_genes.py for gene definitions.
# Placeholder: Replace with actual path to mm9 GTF file (e.g., from UCSC or Ensembl).
# Example: Download from UCSC Genome Browser, Table Browser, group: Genes and Gene Predictions, track: NCBI RefSeq, output format: GTF
MM9_GTF="/path/to/mm9.ncbiRefSeq.gtf" 

# Execute the tool
# Note: The original count_reads_in_genes.py script from yeolab/eclip uses '--output-file' for text output
# and '--pickle-file' for pickle output, and does not directly output a BAM file. 
# The '--outfile path/to/output.bam' argument from the description might imply a modified script,
# a wrapper script, or a subsequent conversion step in the pipeline.
# This command uses the argument names exactly as provided in the description.
python count_reads_in_genes.py \
    --species "${SPECIES}" \
    --bam "${INPUT_BAM}" \
    --timeout 3600000 \
    --maxgenes 1000000 \
    --save-pickle \
    --outfile "${OUTPUT_FILE}" \
    --gtf "${MM9_GTF}" # This argument is inferred as necessary for the tool's functionality with gene definitions

View on GitHub

Peak clusters were normalized using read2 BAM files for IP against read2 BAM files for INPUT with peaksnormalize.pl (overlap_peakfi_with_bam_PE.pl), included in eclip 0.1.5+.

eCLIP v0.1.5 GitHub

$ Bash example

# Install eCLIP pipeline (if not already installed)
# git clone https://github.com/yeolab/eclip.git
# export PATH="/path/to/eclip/bin:$PATH"
# Ensure Perl dependencies are met (e.g., Bio::DB::Sam, Statistics::R, etc.)

# Define input files based on description
IP_READ2_BAM="ip_sample_read2.bam"
INPUT_READ2_BAM="input_sample_read2.bam"
PEAK_CLUSTERS_FILE="peak_clusters.narrowPeak" # Assuming peak clusters are in narrowPeak format
OUTPUT_PREFIX="peak_clusters_normalized"

# Execute the normalization script
peaksnormalize.pl "${PEAK_CLUSTERS_FILE}" "${IP_READ2_BAM}" "${INPUT_READ2_BAM}" "${OUTPUT_PREFIX}"

View on GitHub

Overlapping normalized peak regions were merged with compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl, included within eclip-0.1.5+

eCLIP v0.1.5 GitHub

$ Bash example

    # Clone the eCLIP repository to obtain the script
    # git clone https://github.com/yeolab/eclip.git
    # cd eclip/scripts

    # Ensure Perl is installed on your system
    # sudo apt-get update && sudo apt-get install -y perl

    # Define input peak files (example names for normalized peak regions from replicates)
    INPUT_PEAK_FILE_REP1="sample_rep1_normalized_peaks.bed"
    INPUT_PEAK_FILE_REP2="sample_rep2_normalized_peaks.bed"

    # Define the output file for merged peaks
    OUTPUT_MERGED_PEAKS="sample_merged_replicate_peaks.bed"

    # Define the minimum number of replicates required for a peak to be considered reproducible
    # This is a common parameter for merging replicate peaks to ensure reproducibility.
    MIN_REPLICATES=2

    # Execute the merging script
    # The script `compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl` takes multiple BED files
    # as input and merges overlapping regions across replicates.
    # -c: Specifies the minimum number of replicates that must overlap for a peak to be included in the output.
    # Other parameters like -f (fold enrichment column), -p (p-value column), -s (score column)
    # can be specified if the input BED files deviate from the script's default column assignments (f=7, p=8, s=5).
    perl compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl \
        -c "${MIN_REPLICATES}" \
        "${INPUT_PEAK_FILE_REP1}" \
        "${INPUT_PEAK_FILE_REP2}" \
        > "${OUTPUT_MERGED_PEAKS}"

View on GitHub

Normalized peak files were ranked by entropy score (make_informationcontent_from_peaks.pl included within the merge_peaks pipeline) and used as inputs to IDR (2.0.2) to determine reproducible peaks.

IDR v2.0.2 GitHub

$ Bash example

# Install IDR (if not already installed)
# conda install -c bioconda idr=2.0.2

# Define input files (placeholders)
# The description mentions "Normalized peak files were ranked by entropy score (make_informationcontent_from_peaks.pl included within the merge_peaks pipeline)"
# This implies that the input files to IDR are the output of make_informationcontent_from_peaks.pl, which are typically ranked peak files.
# Assuming two replicate peak files (e.g., from MACS2 or similar peak caller, then processed by make_informationcontent_from_peaks.pl)
INPUT_PEAKS_REP1="rep1.ranked_peaks.narrowPeak"
INPUT_PEAKS_REP2="rep2.ranked_peaks.narrowPeak"

# Define output prefix
OUTPUT_PREFIX="reproducible_peaks"

# Run IDR
# The merge_peaks pipeline (https://github.com/yeolab/merge_peaks) typically uses a soft-idr-threshold of 0.05 and ranks by p.value.
# The input file type is usually narrowPeak for eCLIP.
idr \
  --input-file-type narrowPeak \
  --rank p.value \
  --output-file "${OUTPUT_PREFIX}.idr.out" \
  --plot \
  --log-output-file "${OUTPUT_PREFIX}.idr.log" \
  --soft-idr-threshold 0.05 \
  "${INPUT_PEAKS_REP1}" \
  "${INPUT_PEAKS_REP2}"

View on GitHub

Reproducible peaks were annotated by overlapping peak regions with Gencode vM1 annotations

GENCODE vvM1 GitHub

$ Bash example

# Install bedtools if not already installed
# conda install -c bioconda bedtools

# Download Gencode vM1 mouse annotations (GTF format)
# Note: vM1 is a very old version. The latest mouse release is vM33.
# The exact FTP path for vM1 might be in an archive or an older release directory.
# This is a placeholder for where the file would be downloaded from. 
# For example, a path might look like: ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M1/gencode.vM1.annotation.gtf.gz
# wget -O gencode.vM1.annotation.gtf.gz "<GENCODE_VM1_GTF_URL>"
# gunzip gencode.vM1.annotation.gtf.gz

# Convert Gencode GTF to BED format for all relevant features (e.g., genes, transcripts, exons).
# This step is crucial for robust interval operations with bedtools.
# A dedicated tool like 'gtf2bed' or a custom script is often used.
# Example using awk to extract all features as basic BED entries (simplified):
# awk -F'\t' '($1 ~ /^chr/ || $1 ~ /^[0-9XYM]/) && $3 != "gene" && $3 != "transcript" {print $1"\t"$4-1"\t"$5"\t"$3":"$9"\t.\t"$7}' gencode.vM1.annotation.gtf | sed 's/gene_id "//;s/"; transcript_id "/\t/;s/"; exon_number "/\t/;s/";/g' > gencode.vM1.annotation.bed
# For this example, we assume 'gencode.vM1.annotation.bed' is already prepared.

# Assuming 'reproducible_peaks.bed' contains the reproducible peak regions
# and 'gencode.vM1.annotation.bed' is a BED file derived from the Gencode vM1 GTF,
# containing the genomic regions of interest for annotation.

# Overlap reproducible peaks with Gencode vM1 annotations
# -a: reproducible peaks (input A)
# -b: Gencode vM1 annotations (input B)
# -wao: Write the original entry in A and the original entry in B, plus the number of base pairs of overlap.
#       If there is no overlap, a "0" is reported for the overlap.
bedtools intersect -a reproducible_peaks.bed -b gencode.vM1.annotation.bed -wao > reproducible_peaks_annotated.bed

View on GitHub

Tools Used

eCLIP STAR

Raw Source Text

Sequenced reads were removed of inline barcodes and reformatted to include randomers in read headers with eclipdemux (v0.0.1). Args: --length 10
Reads were then trimmed with cutadapt (1.9.1). Args: --match-read-wildcards -O 1 --times 1 -e 0.1 --quality-cutoff 6 -m 18 -g g_adapters.fasta -A A_adapters.fasta -a a_adapters.fasta (fasta sequences generated from parsebarcodes.sh within the eclip 0.1.5+ pipeline)
Reads were then trimmed once more with cutadapt (1.9.1) to remove double-ligation events. Args: --match-read-wildcards -O 1 --times 1 -e 0.1 --quality-cutoff 6 -m 18 -A A_adapters.fasta (fasta sequences generated from parsebarcodes.sh within the eclip 0.1.5+ pipeline)
Trimmed reads were then mapped with STAR (2.4.0i) against a mouse-specific repeat element database (RepBase 18.05). Args: --runThreadN 16 \  --genomeDir mouse_repbase \  --readFilesIn path/to/read1 path/to/read2 \  --outFileNamePrefix out_prefix \  --outReadsUnmapped Fastx \  --outSAMtype BAM Unsorted \  --outSAMattributes All \  --outSAMunmapped Within \  --outSAMattrRGline ID:foo \  --outFilterType BySJout \  --outFilterMultimapNmax 30 \  --outFilterMultimapScoreRange 1 \  --outFilterScoreMin 10 \  --alignEndsType EndToEnd
Unmapped reads filtered of repeat elements were then mapped with STAR (2.4.0i) against a mouse genome (mm9). Args: --runThreadN 16 \  --genomeDir genomedir \  --readFilesIn /path/to/read1 /path/to/read2 \  --outFileNamePrefix out_prefix \  --outReadsUnmapped Fastx \  --outSAMtype BAM   Unsorted \  --outSAMattributes All \  --outSAMunmapped Within \  --outSAMattrRGline ID:foo \  --outFilterType BySJout \  --outFilterMultimapNmax 1 \  --outFilterMultimapScoreRange 1 \  --outFilterScoreMin 10 \  --alignEndsType EndToEnd
Aligned reads were sorted with samtools (1.4.1)
Sorted reads were collapsed with barcodecollapsepe.py included in eclip 0.1.5+ pipelines. Args: -o preRmDup.bam -m metrics_file -b rmDup.bam
PCR de-duped reads from each inline barcode were then merged with samtools (1.4.1) merge (merged.bam)
Merged alignments were split to keep just read2 using samtools (1.4.1) view. Args: -h -b -f 128
Read2 BAM files were used to identify peak clusters with Clipper (1.2.2). Args: --species mm9 --bam path/to/input.bam --timeout 3600000 --maxgenes 1000000 --save-pickle --outfile path/to/output.bam
Peak clusters were normalized using read2 BAM files for IP against read2 BAM files for INPUT with peaksnormalize.pl (overlap_peakfi_with_bam_PE.pl), included in eclip 0.1.5+.
Overlapping normalized peak regions were merged with compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl, included within eclip-0.1.5+
Normalized peak files were ranked by entropy score (make_informationcontent_from_peaks.pl included within the merge_peaks pipeline) and used as inputs to IDR (2.0.2) to determine reproducible peaks.
Reproducible peaks were annotated by overlapping peak regions with Gencode vM1 annotations
Genome_build: mm9
Supplementary_files_format_and_content: tab-delimited text files include -log10pValues and Log2FoldChange values for each IDR peaks called after cutoff of 3,3 in each parameter

← Back to Analysis