GSE180129 Processing Pipeline

RIP-Seq code_examples 24 steps

Publication

Identification of the global miR-130a targetome reveals a role for TBL1XR1 in hematopoietic stem cell self-renewal and t(8;21) AML.

Cell reports (2022) — PMID 35263585

Dataset

GSE180129

Global Identification of the miR-130a targetome in human HSPC Reveals a Novel Role for TBL1XR1 in Hematopoietic Stem Cell Self-Renewal and t(8;21) A…

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Sequenced reads were reformatted to include randomers in read headers with umi_tools (1.0.0).

UMI-tools v1.0.0 GitHub

$ Bash example

# Install umi_tools if not already installed
# conda install -c bioconda umi-tools

# Example: Extract 10bp UMI from the start of each read and add it to the read header.
# Replace 'input.fastq.gz' with your actual input file.
# Replace 'output.fastq.gz' with your desired output file.
# Adjust '--bc-pattern' based on the actual UMI length and position in your reads.
# For example, if the UMI is 10bp at the start of the read, use NNNNNNNNNN.
# If the UMI is extracted using a regex, use --extract-method=regex and --bc-pattern="^(?P<umi_1>.{10})"
# If working with paired-end reads, use --read2-in and --read2-out parameters.
umi_tools extract \
    --bc-pattern=NNNNNNNNNN \
    --stdin=input.fastq.gz \
    --stdout \
    --log=umi_tools_extract.log \
    --output-stat=umi_tools_extract.tsv \
    > output.fastq.gz

View on GitHub

Args: --random-seed 1 --bc-pattern NNNNNNNNNN

umi_tools extract (Inferred with models/gemini-2.5-flash) v1.1.2 GitHub

$ Bash example

# Install umi_tools if not already installed
# conda install -c bioconda umi-tools

# Example usage of umi_tools extract
# Replace input.fastq.gz and output.fastq.gz with actual file paths
umi_tools extract \
    --random-seed 1 \
    --bc-pattern NNNNNNNNNN \
    -I input.fastq.gz \
    -S output.fastq.gz \
    --log umi_tools_extract.log

View on GitHub

Reads were then trimmed with cutadapt (1.14).

cutadapt v1.14 GitHub

$ Bash example

# Install cutadapt if not already installed
# conda install -c bioconda cutadapt

# Define input and output files
# Replace 'input.fastq.gz' with your actual input file
INPUT_FASTQ="input.fastq.gz"
OUTPUT_FASTQ="trimmed.fastq.gz"

# Define adapter sequence (replace with the actual adapter sequence used)
# A common Illumina 3' adapter sequence is: AGATCGGAAGAGCACACGTCTGAACTCCAGTCA
# If a 5' adapter was used, use -g instead of -a.
# If paired-end reads, use -A and -G for the respective adapters.
ADAPTER_SEQUENCE="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA"

# Run cutadapt to trim adapters and perform quality trimming
# -a: 3' adapter sequence
# -m: Minimum length of reads to keep after trimming (e.g., 15 bp)
# -q: Quality cutoff (e.g., 20 for Phred score 20) for 5' and 3' ends
# -o: Output file
cutadapt -a "${ADAPTER_SEQUENCE}" \
         -m 15 \
         -q 20 \
         -o "${OUTPUT_FASTQ}" \
         "${INPUT_FASTQ}"

View on GitHub

Args: --match-read-wildcards -O 1 --times 1 -e 0.1 --quality-cutoff 6 -m 18 -a InvRNA*.fasta (fasta sequences can be found at: https://github.com/YeoLab/eclip/tree/master/example/inputs/)

eCLIP v1.18 GitHub

$ Bash example

# Install cutadapt
# conda install -c bioconda cutadapt=1.18

# Define input and output files
INPUT_FASTQ="input.fastq.gz" # Placeholder for your input FASTQ file (e.g., SRR1042713_1.fastq.gz from YeoLab/eclip example inputs)
OUTPUT_FASTQ="trimmed_output.fastq.gz" # Placeholder for your output trimmed FASTQ file
ADAPTER_FASTA="InvRNA.fasta" # Adapter sequences from YeoLab/eclip example inputs

# Download adapter file if not present (example)
# wget -O ${ADAPTER_FASTA} https://raw.githubusercontent.com/YeoLab/eclip/master/example/inputs/InvRNA.fasta

# Execute cutadapt command
cutadapt \
    --match-read-wildcards \
    -O 1 \
    --times 1 \
    -e 0.1 \
    --quality-cutoff 6 \
    -m 18 \
    -a file:${ADAPTER_FASTA} \
    -o ${OUTPUT_FASTQ} \
    ${INPUT_FASTQ}

View on GitHub

Reads were then trimmed again with cutadapt (1.14) to remove double-ligation events.

cutadapt v1.14 GitHub

$ Bash example

# Install cutadapt (example using conda)
# conda install -c bioconda cutadapt=1.14

# Define input and output files
INPUT_FASTQ="input_reads.fastq.gz"
OUTPUT_FASTQ="trimmed_reads.fastq.gz"

# Placeholder for the adapter sequence. This needs to be replaced with the actual adapter sequence
# that causes double-ligation events in your specific library preparation.
# Common adapter sequences include Illumina TruSeq adapters, but it depends on the kit.
# Example: ADAPTER_SEQUENCE="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA"
ADAPTER_SEQUENCE="YOUR_DOUBLE_LIGATION_ADAPTER_SEQUENCE"

# Trim reads to remove double-ligation events (adapter sequences)
# -a ADAPTER_SEQUENCE: Removes the 3' adapter sequence. If the adapter is present anywhere, it will be removed.
# -o: Specifies the output file.
cutadapt -a "${ADAPTER_SEQUENCE}" -o "${OUTPUT_FASTQ}" "${INPUT_FASTQ}"

View on GitHub

Args: --match-read-wildcards -O 5 --times 1 -e 0.1 --quality-cutoff 6 -m 18 -a Ril19.fasta (fasta sequences can be found at: https://github.com/YeoLab/eclip/tree/master/example/inputs/)

eCLIP vYeoLab/eclip repository (script `eclip.py`) GitHub

$ Bash example

# Install dependencies (example, adjust as needed)
# The eCLIP script requires Python, cutadapt, STAR, samtools, bedtools, R, pysam, pybedtools, numpy, scipy.
# conda create -n eclip_env python=3.8
# conda activate eclip_env
# conda install -c bioconda cutadapt star samtools bedtools
# pip install pysam pybedtools numpy scipy

# Clone the eCLIP repository if eclip.py is not already available
# git clone https://github.com/YeoLab/eclip.git
# cd eclip
# # Make sure eclip.py is executable or run with python
# chmod +x eclip.py
# # Add to PATH or use full path to eclip.py
# export PATH=$(pwd):$PATH

# Download the adapter sequence
wget https://raw.githubusercontent.com/YeoLab/eclip/master/example/inputs/Ril19.fasta -O Ril19.fasta

# Execute the eCLIP command (assuming 'eclip' is in PATH or using 'python eclip.py')
eclip --match-read-wildcards -O 5 --times 1 -e 0.1 --quality-cutoff 6 -m 18 -a Ril19.fasta

View on GitHub

For targeted miR-130a datasets, reads that did not contain the primer sequence (CAGUGCAAUGUUAAAAGGGCAU) were filtered out.

cutadapt (Inferred with models/gemini-2.5-flash) vN/A (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

    # Install cutadapt (if not already installed)
    # conda install -c bioconda cutadapt

    # Define primer sequence
    PRIMER_SEQUENCE="CAGUGCAAUGUUAAAAGGGCAU"

    # Define input and output files (placeholders)
    INPUT_FASTQ="input.fastq.gz"
    OUTPUT_FASTQ="filtered_output.fastq.gz"

    # Filter out reads that do not contain the 5' primer sequence
    cutadapt -g "${PRIMER_SEQUENCE}" --discard-untrimmed -o "${OUTPUT_FASTQ}" "${INPUT_FASTQ}"

View on GitHub

For total chimeric eCLIP datasets, 10nt from the 3'end of Read1 was trimmed with umi_tools.

UMI-tools v1.1.2 GitHub

$ Bash example

# Install umi-tools via conda
# conda install -c bioconda umi-tools=1.1.2

# Example usage: Trim 10nt from the 3' end of Read1 using umi_tools extract.
# This command assumes the 10nt at the 3' end of Read1 constitutes the UMI,
# which is extracted and subsequently removed from the read sequence, then appended to the read header.
# Replace input_R1.fastq.gz and output_R1_trimmed.fastq.gz with actual file names.
umi_tools extract \
    --bc-pattern=.*(?P<umi_1>.{10})$ \
    --extract-method=regex \
    -I input_R1.fastq.gz \
    -S output_R1_trimmed.fastq.gz

View on GitHub

Args: -u -9

Data Processing Script (Inferred with models/gemini-2.5-flash) vCustom

$ Bash example

# The arguments '-u' and '-9' are highly generic and do not point to a specific, well-known bioinformatics tool or a standard combination of flags for a single tool.
# Without further context (e.g., assay type, specific pipeline, or surrounding commands), it is impossible to definitively identify the tool.
# It is inferred to be a custom data processing script where '-u' and '-9' serve specific, custom-defined purposes (e.g., unique processing, a numerical threshold, or a specific mode).

# Placeholder command for a hypothetical custom script:
# Assuming 'my_pipeline_step_script.sh' is the tool that interprets these arguments.
# Replace 'input_file.txt' and 'output_file.txt' with actual file names.
my_pipeline_step_script.sh -u -9 input_file.txt > output_file.txt

Trimmed and filtered reads were then mapped with STAR (2.7.6a) against a repeat element database (RepBase 18.05).

STAR v2.7.6a GitHub

$ Bash example

# Install STAR (example using conda)
# conda install -c bioconda star

# Define variables
# Assuming paired-end reads. Adjust if single-end (e.g., READS_R2="")
READS_R1="trimmed_filtered_reads_R1.fastq.gz"
READS_R2="trimmed_filtered_reads_R2.fastq.gz" 
# Placeholder for the STAR index of the RepBase 18.05 database
# This index needs to be built once using STAR --runMode genomeGenerate
REPEAT_DB_INDEX="/path/to/STAR_index/RepBase_18.05" 
OUTPUT_DIR="star_repeat_alignment_output"
NUM_THREADS=8 # Example number of threads

# Create output directory if it doesn't exist
mkdir -p "${OUTPUT_DIR}"

# Run STAR alignment against the repeat element database
# Parameters are chosen to be suitable for mapping to repetitive sequences:
# --outFilterMultimapNmax: Allows reads to map to multiple locations (important for repeats)
# --alignIntronMax 1: Disables splicing detection, as repeats are not typically spliced units
STAR \
  --runThreadN "${NUM_THREADS}" \
  --genomeDir "${REPEAT_DB_INDEX}" \
  --readFilesIn "${READS_R1}" "${READS_R2}" \
  --readFilesCommand zcat \
  --outFileNamePrefix "${OUTPUT_DIR}/" \
  --outSAMtype BAM SortedByCoordinate \
  --outReadsUnmapped Fastx \
  --outFilterMultimapNmax 100 \
  --alignIntronMax 1

View on GitHub

Args: --runThreadN 16 \ --genomeDir human_repbase \ --readFilesIn path/to/read1 \ --outFileNamePrefix out_prefix \ --outReadsUnmapped Fastx \ --outSAMtype BAM Unsorted \ --outSAMattributes All \ --outSAMunmapped Within \ --outSAMattrRGline ID:foo \ --outFilterType BySJout \ --outFilterMultimapNmax 30 \ --outFilterMultimapScoreRange 1 \ --outFilterScoreMin 10 \ --alignEndsType EndToEnd

STAR (Inferred with models/gemini-2.5-flash) v2.7.10a GitHub

$ Bash example

# Install STAR (if not already installed)
# conda install -c bioconda star

# Define variables
# GENOME_DIR should be the path to your pre-built STAR genome index for 'human_repbase'.
# For example, if 'human_repbase' refers to a GRCh38-based index:
# GENOME_DIR="/path/to/STAR_indices/human_repbase_GRCh38"
GENOME_DIR="human_repbase"
READ_FILE="path/to/read1"
OUTPUT_PREFIX="out_prefix"

# Execute STAR alignment
STAR --runThreadN 16 \
     --genomeDir "${GENOME_DIR}" \
     --readFilesIn "${READ_FILE}" \
     --outFileNamePrefix "${OUTPUT_PREFIX}" \
     --outReadsUnmapped Fastx \
     --outSAMtype BAM Unsorted \
     --outSAMattributes All \
     --outSAMunmapped Within \
     --outSAMattrRGline ID:foo \
     --outFilterType BySJout \
     --outFilterMultimapNmax 30 \
     --outFilterMultimapScoreRange 1 \
     --outFilterScoreMin 10 \
     --alignEndsType EndToEnd

View on GitHub

Unmapped reads filtered of repeat elements were then mapped with STAR (2.7.6a) against a human genome (hg19).

STAR v2.7.6a GitHub

$ Bash example

# Install STAR (if not already installed)
# conda install -c bioconda star=2.7.6a

# Define variables
STAR_VERSION="2.7.6a"
GENOME_DIR="/path/to/your/hg19_STAR_index" # Placeholder for hg19 STAR index
INPUT_FASTQ="unmapped_reads_filtered.fastq.gz" # Placeholder for input reads (e.g., from previous filtering step)
OUTPUT_PREFIX="mapped_reads"

# Note: The STAR genome index for hg19 must be pre-built using STAR --runMode genomeGenerate.
# Example command to build the index (commented out):
# mkdir -p ${GENOME_DIR}
# STAR --runMode genomeGenerate \
#      --genomeDir ${GENOME_DIR} \
#      --genomeFastaFiles /path/to/hg19.fa \
#      --sjdbGTFfile /path/to/hg19.gtf \
#      --runThreadN 8 # Adjust threads as needed

# Run STAR alignment
# The description implies mapping against a human genome (hg19).
STAR --genomeDir ${GENOME_DIR} \
     --readFilesIn ${INPUT_FASTQ} \
     --outFileNamePrefix ${OUTPUT_PREFIX}_ \
     --outSAMtype BAM SortedByCoordinate \
     --runThreadN 8 # Adjust threads as needed based on available resources

# Rename output BAM for clarity
mv ${OUTPUT_PREFIX}_Aligned.sortedByCoord.out.bam mapped_reads.bam

View on GitHub

Args: --runThreadN 16 \ --genomeDir genomedir \ --readFilesIn /path/to/read1 \ --outFileNamePrefix out_prefix \ --outReadsUnmapped Fastx \ --outSAMtype BAM Unsorted \ --outSAMattributes All \ --outSAMunmapped Within \ --outSAMattrRGline ID:foo \ --outFilterType BySJout \ --outFilterMultimapNmax 1 \ --outFilterMultimapScoreRange 1 \ --outFilterScoreMin 10 \ --alignEndsType EndToEnd

STAR (Inferred with models/gemini-2.5-flash) vNot specified (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install STAR (example using conda)
# conda install -c bioconda star

# Define variables (placeholders)
GENOME_DIR="/path/to/GRCh38_STAR_index" # Example: STAR index for human GRCh38
READ_FILE_1="/path/to/sample_R1.fastq.gz"
OUT_PREFIX="sample_aligned_"

# Execute STAR alignment command
STAR \
  --runThreadN 16 \
  --genomeDir "${GENOME_DIR}" \
  --readFilesIn "${READ_FILE_1}" \
  --outFileNamePrefix "${OUT_PREFIX}" \
  --outReadsUnmapped Fastx \
  --outSAMtype BAM Unsorted \
  --outSAMattributes All \
  --outSAMunmapped Within \
  --outSAMattrRGline ID:foo \
  --outFilterType BySJout \
  --outFilterMultimapNmax 1 \
  --outFilterMultimapScoreRange 1 \
  --outFilterScoreMin 10 \
  --alignEndsType EndToEnd

View on GitHub

Aligned reads were sorted with samtools (1.6)

samtools v1.6 GitHub

$ Bash example

# Install samtools if not already installed
# conda install -c bioconda samtools=1.6

# Sort aligned reads (e.g., a BAM file)
# Replace input.bam with your actual input file and output.bam with your desired output file name
samtools sort -o sorted_reads.bam aligned_reads.bam

View on GitHub

Sorted reads were collapsed with umi_tools (1.0.0).

UMI-tools v1.0.0 GitHub

$ Bash example

# Install umi_tools (if not already installed)
# conda install -c bioconda umi-tools

# Example: Collapse sorted reads with umi_tools dedup
# This assumes UMIs have already been extracted and stored in a BAM tag (e.g., 'xf')
# Replace input.sorted.bam and output.deduplicated.bam with actual file names.
# Adjust --umi-tag and --extract-method if UMIs are stored differently (e.g., in read names).
umi_tools dedup \
    --extract-method=xf \
    --umi-tag=xf \
    --output-stats=dedup_stats.txt \
    -I input.sorted.bam \
    -S output.deduplicated.bam

View on GitHub

Args: --random-seed 1 --method unique

umi_tools dedup (Inferred with models/gemini-2.5-flash) v1.1.3 GitHub

$ Bash example

# Install umi_tools if not already installed
# conda install -c bioconda umi_tools

# Example usage: Deduplicate reads in a BAM file
# Replace input.bam and output.bam with actual file names
umi_tools dedup \
    --random-seed 1 \
    --method unique \
    -I input.bam \
    -S output.bam

View on GitHub

BAM files were used to identify peak clusters with Clipper (1.2.2).

CLIPper v1.2.2 GitHub

$ Bash example

# Install CLIPper (if not already installed)
# conda install -c bioconda clipper

# Run CLIPper to identify peak clusters from a BAM file
# Replace 'input.bam' with your actual input BAM file.
# Replace 'output_peaks' with your desired output file prefix.
# '-s 3.1e9' specifies the genome size for human (hg38). Adjust if using a different organism or assembly.
clipper.py -b input.bam -s 3.1e9 -o output_peaks

View on GitHub

Args: --species (hg19) --bam path/to/input.bam --timeout 3600000 --maxgenes 1000000 --save-pickle --outfile path/to/output.bam

(Inferred with models/gemini-2.5-flash)

$ Bash example

# This script processes a BAM file, potentially performing gene-level analysis or filtering.
# The tool is inferred based on arguments like --maxgenes and --save-pickle, suggesting a custom Python script.
# The reference genome hg19 (GRCh37) is specified.

# Example command for the inferred tool
python inferred_tool_script.py \
    --species hg19 \
    --bam path/to/input.bam \
    --timeout 3600000 \
    --maxgenes 1000000 \
    --save-pickle \
    --outfile path/to/output.bam

Peak clusters were normalized using BAM files for IP against BAM files for INPUT with peaksnormalize.pl (overlap_peakfi_with_bam_PE.pl), included in eclip 0.1.5+.

eCLIP v0.1.5 GitHub

$ Bash example

# Ensure Perl and necessary modules are installed.
# Clone the eCLIP repository and add its 'bin' directory to your PATH.
# For example:
# git clone https://github.com/yeolab/eclip.git
# export PATH=$PATH:$(pwd)/eclip/bin

# Define input files (placeholders)
IP_BAM="ip_sample.bam"
INPUT_BAM="input_sample.bam"
PEAK_FILE="peak_clusters.bed" # Assuming peak clusters are in BED format
OUTPUT_PREFIX="normalized_peaks"

# Normalize peak clusters using peaksnormalize.pl
# This script internally uses overlap_peakfi_with_bam_PE.pl as described.
peaksnormalize.pl \
  --peak_file "${PEAK_FILE}" \
  --ip_bam "${IP_BAM}" \
  --input_bam "${INPUT_BAM}" \
  --output_prefix "${OUTPUT_PREFIX}"

View on GitHub

Overlapping normalized peak regions were merged with compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl, included within eclip-0.1.5+

eCLIP v0.1.5 GitHub

$ Bash example

# Installation of eCLIP pipeline (if needed, typically via git clone)
# git clone https://github.com/yeolab/eclip.git
# cd eclip
# This script is located in the 'scripts' directory of the eCLIP repository.

# Define input peak files (these would be the normalized peak regions from replicates)
# Replace with actual paths and filenames of your normalized peak BED files
INPUT_PEAK_FILE_REP1="sample_A_replicate1_normalized_peaks.bed"
INPUT_PEAK_FILE_REP2="sample_A_replicate2_normalized_peaks.bed"
# Add more replicate files if available, e.g., INPUT_PEAK_FILE_REP3="sample_A_replicate3_normalized_peaks.bed"

# Define the output prefix for the merged peak file
OUTPUT_PREFIX="sample_A_merged_replicate_peaks"

# Path to the script within the eCLIP repository
# Assuming the script is run from the eCLIP base directory or its path is known
ECLIP_SCRIPT_PATH="scripts/compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl"

# Execute the script to merge overlapping normalized peak regions from replicates
perl "${ECLIP_SCRIPT_PATH}" \
    "${OUTPUT_PREFIX}" \
    "${INPUT_PEAK_FILE_REP1}" \
    "${INPUT_PEAK_FILE_REP2}"
    # Add more "${INPUT_PEAK_FILE_REP_N}" here if more replicates are used

View on GitHub

Normalized peak (compressed.bed) files were ranked by entropy score (make_informationcontent_from_peaks.pl included within the merge_peaks pipeline) and used as inputs to IDR (2.0.2) to determine reproducible peaks.

IDR v2.0.2 GitHub

$ Bash example

# Install IDR (version 2.0.2)
# conda install -c bioconda idr=2.0.2

# Install merge_peaks dependencies (Perl script)
# No specific installation for make_informationcontent_from_peaks.pl, it's a Perl script.
# Ensure Perl is available.

# Define input peak files (placeholders)
REP1_PEAKS="rep1.compressed.bed"
REP2_PEAKS="rep2.compressed.bed"
OUTPUT_PREFIX="idr_output"

# Define reference genome files (placeholders, required by make_informationcontent_from_peaks.pl)
# Replace with actual paths to your genome fasta and chromosome sizes file.
GENOME_FASTA="path/to/genome.fa"
CHROM_SIZES="path/to/chrom.sizes"

# Step 1: Rank normalized peak files by entropy score
# make_informationcontent_from_peaks.pl is part of the yeolab/merge_peaks pipeline.
# It takes a BED file and adds an entropy score as the 5th column (0-indexed).
# The script requires a genome fasta and chrom sizes.

# Assuming make_informationcontent_from_peaks.pl is in the current directory or PATH
perl make_informationcontent_from_peaks.pl -i "${REP1_PEAKS}" -o "${REP1_PEAKS%.bed}.entropy.bed" -g "${GENOME_FASTA}" -s "${CHROM_SIZES}"
perl make_informationcontent_from_peaks.pl -i "${REP2_PEAKS}" -o "${REP2_PEAKS%.bed}.entropy.bed" -g "${GENOME_FASTA}" -s "${CHROM_SIZES}"

# Step 2: Sort the entropy-ranked peak files by the entropy score (5th column, descending)
# IDR expects peaks to be sorted by score.
sort -k5,5nr "${REP1_PEAKS%.bed}.entropy.bed" > "${REP1_PEAKS%.bed}.entropy.sorted.bed"
sort -k5,5nr "${REP2_PEAKS%.bed}.entropy.bed" > "${REP2_PEAKS%.bed}.entropy.sorted.bed"

# Step 3: Run IDR (Irreproducible Discovery Rate) to determine reproducible peaks
# IDR version 2.0.2
# The --rank parameter specifies which column to use for ranking (default is 5 for narrowPeak/broadPeak).
# Since we sorted by entropy score into the 5th column, we use --rank 5.
# A common IDR threshold is 0.05. The description doesn't specify, so a common default is used.

idr --samples "${REP1_PEAKS%.bed}.entropy.sorted.bed" "${REP2_PEAKS%.bed}.entropy.sorted.bed" \
    --output-file "${OUTPUT_PREFIX}.bed" \
    --rank 5 \
    --plot \
    --log-output-file "${OUTPUT_PREFIX}.log" \
    --soft-idr-threshold 0.05

View on GitHub

To capture chimeric reads, trimmed fastqs were converted to fasta and collapsed into unique sequences using seqtk seq (1.3-r106) and ctk fasta2collapse.pl (v1.0.8)

seqtk, ctk fasta2collapse.pl v1.3-r106, v1.0.8

$ Bash example

# Install seqtk (if not already installed)
# conda install -c bioconda seqtk

# Install ctk fasta2collapse.pl (if not already installed)
# This script is typically part of a larger pipeline or custom scripts, often found in CLIP-seq toolkits.
# It might require Perl and specific modules. For eCLIP assays, it's often found within the Yeo lab's eCLIP pipeline.
# Example: git clone https://github.com/yeolab/eclip.git
# Then ensure ctk_fasta2collapse.pl is in your PATH or specify its full path.

# Placeholder for input trimmed FASTQ file(s)
# Assuming 'trimmed_reads.fastq' as a single input file for demonstration.
# If multiple FASTQ files (e.g., paired-end reads), this step would typically be applied per file or after merging.
INPUT_FASTQ="trimmed_reads.fastq"

# Placeholder for output collapsed FASTA file
OUTPUT_FASTA="collapsed_unique_reads.fasta"

# Step 1: Convert trimmed FASTQ to FASTA using seqtk
# The output FASTA file will have the same base name as the input FASTQ, but with a .fasta extension.
seqtk seq -a "${INPUT_FASTQ}" > "${INPUT_FASTQ%.fastq}.fasta"

# Step 2: Collapse unique sequences from the FASTA file using ctk fasta2collapse.pl
# This script identifies and collapses identical sequences, often counting their occurrences, which is crucial for chimeric read analysis.
ctk fasta2collapse.pl "${INPUT_FASTQ%.fastq}.fasta" > "${OUTPUT_FASTA}"

Collapsed sequences were indexed and used to map mature miRNA (miRBase) using bowtie2 (v2.2.6)

Bowtie2 v2.2.6 GitHub

$ Bash example

# Install Bowtie2 (example using conda)
# conda install -c bioconda bowtie2=2.2.6

# --- Reference Data Preparation (prior step) ---
# Download mature miRNA sequences from miRBase (e.g., mature.fa from miRBase FTP)
# Example: wget http://www.mirbase.org/ftp/CURRENT/mature.fa.gz
# gunzip mature.fa.gz

# Build Bowtie2 index for mature miRNA sequences
# bowtie2-build mature.fa miRBase_mature_miRNA_index

# --- Mapping Step ---
# Map collapsed sequences to the miRBase mature miRNA index
# Input: collapsed_sequences.fastq (your collapsed sequence file)
# Reference Index: miRBase_mature_miRNA_index (the index built from miRBase mature miRNA sequences)
# Output: mapped_miRNA.sam (SAM format alignment file)
bowtie2 -x miRBase_mature_miRNA_index -U collapsed_sequences.fastq -S mapped_miRNA.sam

View on GitHub

Sequences that mapped to miRs whose mRNA portion were at least 18nt were re-assigned according to their original read ID, and mapped to the human genome as previously described.

STAR (Inferred with models/gemini-2.5-flash) v2.7.10a GitHub

$ Bash example

# Install STAR if not already installed
# conda install -c bioconda star

# Define variables
# Placeholder for human GRCh38 genome index. This directory should contain files like SA, SAindex, genome, etc.
GENOME_DIR="/path/to/STAR_genome_index/GRCh38" 
# Placeholder for the input FASTQ file containing sequences re-assigned after miR mapping.
INPUT_FASTQ="reassigned_reads.fastq.gz" 
# Prefix for output files (e.g., reassigned_to_genome_Aligned.out.bam)
OUTPUT_PREFIX="reassigned_to_genome"
# Number of threads to use for STAR
THREADS=8

# Align sequences to the human genome
# This command assumes the STAR genome index for GRCh38 has already been built.
# If not, you would first run:
# STAR --runThreadN ${THREADS} --runMode genomeGenerate \
#      --genomeDir ${GENOME_DIR} \
#      --genomeFastaFiles /path/to/GRCh38.primary_assembly.fa \
#      --sjdbGTFfile /path/to/gencode.v38.annotation.gtf \
#      --sjdbOverhang 100 # Adjust sjdbOverhang based on read length (e.g., read_length - 1)
STAR --runThreadN ${THREADS} \
     --genomeDir ${GENOME_DIR} \
     --readFilesIn ${INPUT_FASTQ} \
     --outFileNamePrefix ${OUTPUT_PREFIX}_ \
     --outSAMtype BAM SortedByCoordinate \
     --outSAMunmapped None

View on GitHub

Tools Used

eCLIP STAR Bowtie2

Raw Source Text

Sequenced reads were reformatted to include randomers in read headers with umi_tools (1.0.0). Args: --random-seed 1 --bc-pattern NNNNNNNNNN
Reads were then trimmed with cutadapt (1.14). Args: --match-read-wildcards -O 1 --times 1 -e 0.1 --quality-cutoff 6 -m 18 -a InvRNA*.fasta (fasta sequences can be found at: https://github.com/YeoLab/eclip/tree/master/example/inputs/)
Reads were then trimmed again with cutadapt (1.14) to remove double-ligation events. Args: --match-read-wildcards -O 5 --times 1 -e 0.1 --quality-cutoff 6 -m 18  -a Ril19.fasta (fasta sequences can be found at: https://github.com/YeoLab/eclip/tree/master/example/inputs/)
For targeted miR-130a datasets, reads that did not contain the primer sequence (CAGUGCAAUGUUAAAAGGGCAU) were filtered out.
For total chimeric eCLIP datasets, 10nt from the 3'end of Read1 was trimmed with umi_tools. Args: -u -9
Trimmed and filtered reads were then mapped with STAR (2.7.6a) against a repeat element database (RepBase 18.05). Args: --runThreadN 16 \  --genomeDir human_repbase \  --readFilesIn path/to/read1 \  --outFileNamePrefix out_prefix \  --outReadsUnmapped Fastx \  --outSAMtype BAM Unsorted \  --outSAMattributes All \  --outSAMunmapped Within \  --outSAMattrRGline ID:foo \  --outFilterType BySJout \  --outFilterMultimapNmax 30 \  --outFilterMultimapScoreRange 1 \  --outFilterScoreMin 10 \  --alignEndsType EndToEnd
Unmapped reads filtered of repeat elements were then mapped with STAR (2.7.6a) against a human genome (hg19). Args: --runThreadN 16 \  --genomeDir genomedir \  --readFilesIn /path/to/read1 \  --outFileNamePrefix out_prefix \  --outReadsUnmapped Fastx \  --outSAMtype BAM   Unsorted \  --outSAMattributes All \  --outSAMunmapped Within \  --outSAMattrRGline ID:foo \  --outFilterType BySJout \  --outFilterMultimapNmax 1 \  --outFilterMultimapScoreRange 1 \  --outFilterScoreMin 10 \  --alignEndsType EndToEnd
Aligned reads were sorted with samtools (1.6)
Sorted reads were collapsed with umi_tools (1.0.0). Args: --random-seed 1 --method unique
BAM files were used to identify peak clusters with Clipper (1.2.2). Args: --species (hg19) --bam path/to/input.bam --timeout 3600000 --maxgenes 1000000 --save-pickle --outfile path/to/output.bam
Peak clusters were normalized using BAM files for IP against BAM files for INPUT with peaksnormalize.pl (overlap_peakfi_with_bam_PE.pl), included in eclip 0.1.5+.
Overlapping normalized peak regions were merged with compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl, included within eclip-0.1.5+
Normalized peak (compressed.bed) files were ranked by entropy score (make_informationcontent_from_peaks.pl included within the merge_peaks pipeline) and used as inputs to IDR (2.0.2) to determine reproducible peaks.
To capture chimeric reads, trimmed fastqs were converted to fasta and collapsed into unique sequences using seqtk seq (1.3-r106) and ctk fasta2collapse.pl (v1.0.8)
Collapsed sequences were indexed and used to map mature miRNA (miRBase) using bowtie2 (v2.2.6)
Sequences that mapped to miRs whose mRNA portion were at least 18nt were re-assigned according to their original read ID, and mapped to the human genome as previously described.
Genome_build: hg19
Supplementary_files_format_and_content: gk.Au9.umi.r1.fq.genome-mappedSoSo.rmDupSo.peakClusters.bed and gk.Au13.umi.r1.fq.genome-mappedSoSo.rmDupSo.peakClusters.bed contain enriched regions from the miR-130a targeted chimeric eCLIP dataset
Supplementary_files_format_and_content: cellsB_rep1.vs.cellsB_rep2.bed contains reproducible total chimeric eCLIP peaks across replicates
Supplementary_files_format_and_content: BigWig files contain RPM-normalized read densities

← Back to Analysis