GSE180129 Processing Pipeline

RIP-Seq code_examples 24 steps

Publication

Identification of the global miR-130a targetome reveals a role for TBL1XR1 in hematopoietic stem cell self-renewal and t(8;21) AML.

Cell reports (2022) — PMID 35263585

Dataset

GSE180129

Global Identification of the miR-130a targetome in human HSPC Reveals a Novel Role for TBL1XR1 in Hematopoietic Stem Cell Self-Renewal and t(8;21) A…

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    Sequenced reads were reformatted to include randomers in read headers with umi_tools (1.0.0).

    UMI-tools v1.0.0 GitHub
    $ Bash example
    # Install umi_tools if not already installed
    # conda install -c bioconda umi-tools
    
    # Example: Extract 10bp UMI from the start of each read and add it to the read header.
    # Replace 'input.fastq.gz' with your actual input file.
    # Replace 'output.fastq.gz' with your desired output file.
    # Adjust '--bc-pattern' based on the actual UMI length and position in your reads.
    # For example, if the UMI is 10bp at the start of the read, use NNNNNNNNNN.
    # If the UMI is extracted using a regex, use --extract-method=regex and --bc-pattern="^(?P<umi_1>.{10})"
    # If working with paired-end reads, use --read2-in and --read2-out parameters.
    umi_tools extract \
        --bc-pattern=NNNNNNNNNN \
        --stdin=input.fastq.gz \
        --stdout \
        --log=umi_tools_extract.log \
        --output-stat=umi_tools_extract.tsv \
        > output.fastq.gz
  2. 2

    Args: --random-seed 1 --bc-pattern NNNNNNNNNN

    umi_tools extract (Inferred with models/gemini-2.5-flash) v1.1.2 GitHub
    $ Bash example
    # Install umi_tools if not already installed
    # conda install -c bioconda umi-tools
    
    # Example usage of umi_tools extract
    # Replace input.fastq.gz and output.fastq.gz with actual file paths
    umi_tools extract \
        --random-seed 1 \
        --bc-pattern NNNNNNNNNN \
        -I input.fastq.gz \
        -S output.fastq.gz \
        --log umi_tools_extract.log
  3. 3

    Reads were then trimmed with cutadapt (1.14).

    cutadapt v1.14 GitHub
    $ Bash example
    # Install cutadapt if not already installed
    # conda install -c bioconda cutadapt
    
    # Define input and output files
    # Replace 'input.fastq.gz' with your actual input file
    INPUT_FASTQ="input.fastq.gz"
    OUTPUT_FASTQ="trimmed.fastq.gz"
    
    # Define adapter sequence (replace with the actual adapter sequence used)
    # A common Illumina 3' adapter sequence is: AGATCGGAAGAGCACACGTCTGAACTCCAGTCA
    # If a 5' adapter was used, use -g instead of -a.
    # If paired-end reads, use -A and -G for the respective adapters.
    ADAPTER_SEQUENCE="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA"
    
    # Run cutadapt to trim adapters and perform quality trimming
    # -a: 3' adapter sequence
    # -m: Minimum length of reads to keep after trimming (e.g., 15 bp)
    # -q: Quality cutoff (e.g., 20 for Phred score 20) for 5' and 3' ends
    # -o: Output file
    cutadapt -a "${ADAPTER_SEQUENCE}" \
             -m 15 \
             -q 20 \
             -o "${OUTPUT_FASTQ}" \
             "${INPUT_FASTQ}"
  4. 4

    Args: --match-read-wildcards -O 1 --times 1 -e 0.1 --quality-cutoff 6 -m 18 -a InvRNA*.fasta (fasta sequences can be found at: https://github.com/YeoLab/eclip/tree/master/example/inputs/)

    $ Bash example
    # Install cutadapt
    # conda install -c bioconda cutadapt=1.18
    
    # Define input and output files
    INPUT_FASTQ="input.fastq.gz" # Placeholder for your input FASTQ file (e.g., SRR1042713_1.fastq.gz from YeoLab/eclip example inputs)
    OUTPUT_FASTQ="trimmed_output.fastq.gz" # Placeholder for your output trimmed FASTQ file
    ADAPTER_FASTA="InvRNA.fasta" # Adapter sequences from YeoLab/eclip example inputs
    
    # Download adapter file if not present (example)
    # wget -O ${ADAPTER_FASTA} https://raw.githubusercontent.com/YeoLab/eclip/master/example/inputs/InvRNA.fasta
    
    # Execute cutadapt command
    cutadapt \
        --match-read-wildcards \
        -O 1 \
        --times 1 \
        -e 0.1 \
        --quality-cutoff 6 \
        -m 18 \
        -a file:${ADAPTER_FASTA} \
        -o ${OUTPUT_FASTQ} \
        ${INPUT_FASTQ}
  5. 5

    Reads were then trimmed again with cutadapt (1.14) to remove double-ligation events.

    cutadapt v1.14 GitHub
    $ Bash example
    # Install cutadapt (example using conda)
    # conda install -c bioconda cutadapt=1.14
    
    # Define input and output files
    INPUT_FASTQ="input_reads.fastq.gz"
    OUTPUT_FASTQ="trimmed_reads.fastq.gz"
    
    # Placeholder for the adapter sequence. This needs to be replaced with the actual adapter sequence
    # that causes double-ligation events in your specific library preparation.
    # Common adapter sequences include Illumina TruSeq adapters, but it depends on the kit.
    # Example: ADAPTER_SEQUENCE="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA"
    ADAPTER_SEQUENCE="YOUR_DOUBLE_LIGATION_ADAPTER_SEQUENCE"
    
    # Trim reads to remove double-ligation events (adapter sequences)
    # -a ADAPTER_SEQUENCE: Removes the 3' adapter sequence. If the adapter is present anywhere, it will be removed.
    # -o: Specifies the output file.
    cutadapt -a "${ADAPTER_SEQUENCE}" -o "${OUTPUT_FASTQ}" "${INPUT_FASTQ}"
  6. 6

    Args: --match-read-wildcards -O 5 --times 1 -e 0.1 --quality-cutoff 6 -m 18 -a Ril19.fasta (fasta sequences can be found at: https://github.com/YeoLab/eclip/tree/master/example/inputs/)

    eCLIP vYeoLab/eclip repository (script `eclip.py`) GitHub
    $ Bash example
    # Install dependencies (example, adjust as needed)
    # The eCLIP script requires Python, cutadapt, STAR, samtools, bedtools, R, pysam, pybedtools, numpy, scipy.
    # conda create -n eclip_env python=3.8
    # conda activate eclip_env
    # conda install -c bioconda cutadapt star samtools bedtools
    # pip install pysam pybedtools numpy scipy
    
    # Clone the eCLIP repository if eclip.py is not already available
    # git clone https://github.com/YeoLab/eclip.git
    # cd eclip
    # # Make sure eclip.py is executable or run with python
    # chmod +x eclip.py
    # # Add to PATH or use full path to eclip.py
    # export PATH=$(pwd):$PATH
    
    # Download the adapter sequence
    wget https://raw.githubusercontent.com/YeoLab/eclip/master/example/inputs/Ril19.fasta -O Ril19.fasta
    
    # Execute the eCLIP command (assuming 'eclip' is in PATH or using 'python eclip.py')
    eclip --match-read-wildcards -O 5 --times 1 -e 0.1 --quality-cutoff 6 -m 18 -a Ril19.fasta
  7. 7

    For targeted miR-130a datasets, reads that did not contain the primer sequence (CAGUGCAAUGUUAAAAGGGCAU) were filtered out.

    cutadapt (Inferred with models/gemini-2.5-flash) vN/A (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
        # Install cutadapt (if not already installed)
        # conda install -c bioconda cutadapt
    
        # Define primer sequence
        PRIMER_SEQUENCE="CAGUGCAAUGUUAAAAGGGCAU"
    
        # Define input and output files (placeholders)
        INPUT_FASTQ="input.fastq.gz"
        OUTPUT_FASTQ="filtered_output.fastq.gz"
    
        # Filter out reads that do not contain the 5' primer sequence
        cutadapt -g "${PRIMER_SEQUENCE}" --discard-untrimmed -o "${OUTPUT_FASTQ}" "${INPUT_FASTQ}"
  8. 8

    For total chimeric eCLIP datasets, 10nt from the 3'end of Read1 was trimmed with umi_tools.

    UMI-tools v1.1.2 GitHub
    $ Bash example
    # Install umi-tools via conda
    # conda install -c bioconda umi-tools=1.1.2
    
    # Example usage: Trim 10nt from the 3' end of Read1 using umi_tools extract.
    # This command assumes the 10nt at the 3' end of Read1 constitutes the UMI,
    # which is extracted and subsequently removed from the read sequence, then appended to the read header.
    # Replace input_R1.fastq.gz and output_R1_trimmed.fastq.gz with actual file names.
    umi_tools extract \
        --bc-pattern=.*(?P<umi_1>.{10})$ \
        --extract-method=regex \
        -I input_R1.fastq.gz \
        -S output_R1_trimmed.fastq.gz
  9. 9

    Args: -u -9

    Data Processing Script (Inferred with models/gemini-2.5-flash) vCustom
    $ Bash example
    # The arguments '-u' and '-9' are highly generic and do not point to a specific, well-known bioinformatics tool or a standard combination of flags for a single tool.
    # Without further context (e.g., assay type, specific pipeline, or surrounding commands), it is impossible to definitively identify the tool.
    # It is inferred to be a custom data processing script where '-u' and '-9' serve specific, custom-defined purposes (e.g., unique processing, a numerical threshold, or a specific mode).
    
    # Placeholder command for a hypothetical custom script:
    # Assuming 'my_pipeline_step_script.sh' is the tool that interprets these arguments.
    # Replace 'input_file.txt' and 'output_file.txt' with actual file names.
    my_pipeline_step_script.sh -u -9 input_file.txt > output_file.txt
  10. 10

    Trimmed and filtered reads were then mapped with STAR (2.7.6a) against a repeat element database (RepBase 18.05).

    $ Bash example
    # Install STAR (example using conda)
    # conda install -c bioconda star
    
    # Define variables
    # Assuming paired-end reads. Adjust if single-end (e.g., READS_R2="")
    READS_R1="trimmed_filtered_reads_R1.fastq.gz"
    READS_R2="trimmed_filtered_reads_R2.fastq.gz" 
    # Placeholder for the STAR index of the RepBase 18.05 database
    # This index needs to be built once using STAR --runMode genomeGenerate
    REPEAT_DB_INDEX="/path/to/STAR_index/RepBase_18.05" 
    OUTPUT_DIR="star_repeat_alignment_output"
    NUM_THREADS=8 # Example number of threads
    
    # Create output directory if it doesn't exist
    mkdir -p "${OUTPUT_DIR}"
    
    # Run STAR alignment against the repeat element database
    # Parameters are chosen to be suitable for mapping to repetitive sequences:
    # --outFilterMultimapNmax: Allows reads to map to multiple locations (important for repeats)
    # --alignIntronMax 1: Disables splicing detection, as repeats are not typically spliced units
    STAR \
      --runThreadN "${NUM_THREADS}" \
      --genomeDir "${REPEAT_DB_INDEX}" \
      --readFilesIn "${READS_R1}" "${READS_R2}" \
      --readFilesCommand zcat \
      --outFileNamePrefix "${OUTPUT_DIR}/" \
      --outSAMtype BAM SortedByCoordinate \
      --outReadsUnmapped Fastx \
      --outFilterMultimapNmax 100 \
      --alignIntronMax 1
  11. 11

    Args: --runThreadN 16 \ --genomeDir human_repbase \ --readFilesIn path/to/read1 \ --outFileNamePrefix out_prefix \ --outReadsUnmapped Fastx \ --outSAMtype BAM Unsorted \ --outSAMattributes All \ --outSAMunmapped Within \ --outSAMattrRGline ID:foo \ --outFilterType BySJout \ --outFilterMultimapNmax 30 \ --outFilterMultimapScoreRange 1 \ --outFilterScoreMin 10 \ --alignEndsType EndToEnd

    STAR (Inferred with models/gemini-2.5-flash) v2.7.10a GitHub
    $ Bash example
    # Install STAR (if not already installed)
    # conda install -c bioconda star
    
    # Define variables
    # GENOME_DIR should be the path to your pre-built STAR genome index for 'human_repbase'.
    # For example, if 'human_repbase' refers to a GRCh38-based index:
    # GENOME_DIR="/path/to/STAR_indices/human_repbase_GRCh38"
    GENOME_DIR="human_repbase"
    READ_FILE="path/to/read1"
    OUTPUT_PREFIX="out_prefix"
    
    # Execute STAR alignment
    STAR --runThreadN 16 \
         --genomeDir "${GENOME_DIR}" \
         --readFilesIn "${READ_FILE}" \
         --outFileNamePrefix "${OUTPUT_PREFIX}" \
         --outReadsUnmapped Fastx \
         --outSAMtype BAM Unsorted \
         --outSAMattributes All \
         --outSAMunmapped Within \
         --outSAMattrRGline ID:foo \
         --outFilterType BySJout \
         --outFilterMultimapNmax 30 \
         --outFilterMultimapScoreRange 1 \
         --outFilterScoreMin 10 \
         --alignEndsType EndToEnd
  12. 12

    Unmapped reads filtered of repeat elements were then mapped with STAR (2.7.6a) against a human genome (hg19).

    $ Bash example
    # Install STAR (if not already installed)
    # conda install -c bioconda star=2.7.6a
    
    # Define variables
    STAR_VERSION="2.7.6a"
    GENOME_DIR="/path/to/your/hg19_STAR_index" # Placeholder for hg19 STAR index
    INPUT_FASTQ="unmapped_reads_filtered.fastq.gz" # Placeholder for input reads (e.g., from previous filtering step)
    OUTPUT_PREFIX="mapped_reads"
    
    # Note: The STAR genome index for hg19 must be pre-built using STAR --runMode genomeGenerate.
    # Example command to build the index (commented out):
    # mkdir -p ${GENOME_DIR}
    # STAR --runMode genomeGenerate \
    #      --genomeDir ${GENOME_DIR} \
    #      --genomeFastaFiles /path/to/hg19.fa \
    #      --sjdbGTFfile /path/to/hg19.gtf \
    #      --runThreadN 8 # Adjust threads as needed
    
    # Run STAR alignment
    # The description implies mapping against a human genome (hg19).
    STAR --genomeDir ${GENOME_DIR} \
         --readFilesIn ${INPUT_FASTQ} \
         --outFileNamePrefix ${OUTPUT_PREFIX}_ \
         --outSAMtype BAM SortedByCoordinate \
         --runThreadN 8 # Adjust threads as needed based on available resources
    
    # Rename output BAM for clarity
    mv ${OUTPUT_PREFIX}_Aligned.sortedByCoord.out.bam mapped_reads.bam
  13. 13

    Args: --runThreadN 16 \ --genomeDir genomedir \ --readFilesIn /path/to/read1 \ --outFileNamePrefix out_prefix \ --outReadsUnmapped Fastx \ --outSAMtype BAM Unsorted \ --outSAMattributes All \ --outSAMunmapped Within \ --outSAMattrRGline ID:foo \ --outFilterType BySJout \ --outFilterMultimapNmax 1 \ --outFilterMultimapScoreRange 1 \ --outFilterScoreMin 10 \ --alignEndsType EndToEnd

    STAR (Inferred with models/gemini-2.5-flash) vNot specified (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install STAR (example using conda)
    # conda install -c bioconda star
    
    # Define variables (placeholders)
    GENOME_DIR="/path/to/GRCh38_STAR_index" # Example: STAR index for human GRCh38
    READ_FILE_1="/path/to/sample_R1.fastq.gz"
    OUT_PREFIX="sample_aligned_"
    
    # Execute STAR alignment command
    STAR \
      --runThreadN 16 \
      --genomeDir "${GENOME_DIR}" \
      --readFilesIn "${READ_FILE_1}" \
      --outFileNamePrefix "${OUT_PREFIX}" \
      --outReadsUnmapped Fastx \
      --outSAMtype BAM Unsorted \
      --outSAMattributes All \
      --outSAMunmapped Within \
      --outSAMattrRGline ID:foo \
      --outFilterType BySJout \
      --outFilterMultimapNmax 1 \
      --outFilterMultimapScoreRange 1 \
      --outFilterScoreMin 10 \
      --alignEndsType EndToEnd
  14. 14

    Aligned reads were sorted with samtools (1.6)

    samtools v1.6 GitHub
    $ Bash example
    # Install samtools if not already installed
    # conda install -c bioconda samtools=1.6
    
    # Sort aligned reads (e.g., a BAM file)
    # Replace input.bam with your actual input file and output.bam with your desired output file name
    samtools sort -o sorted_reads.bam aligned_reads.bam
  15. 15

    Sorted reads were collapsed with umi_tools (1.0.0).

    UMI-tools v1.0.0 GitHub
    $ Bash example
    # Install umi_tools (if not already installed)
    # conda install -c bioconda umi-tools
    
    # Example: Collapse sorted reads with umi_tools dedup
    # This assumes UMIs have already been extracted and stored in a BAM tag (e.g., 'xf')
    # Replace input.sorted.bam and output.deduplicated.bam with actual file names.
    # Adjust --umi-tag and --extract-method if UMIs are stored differently (e.g., in read names).
    umi_tools dedup \
        --extract-method=xf \
        --umi-tag=xf \
        --output-stats=dedup_stats.txt \
        -I input.sorted.bam \
        -S output.deduplicated.bam
  16. 16

    Args: --random-seed 1 --method unique

    umi_tools dedup (Inferred with models/gemini-2.5-flash) v1.1.3 GitHub
    $ Bash example
    # Install umi_tools if not already installed
    # conda install -c bioconda umi_tools
    
    # Example usage: Deduplicate reads in a BAM file
    # Replace input.bam and output.bam with actual file names
    umi_tools dedup \
        --random-seed 1 \
        --method unique \
        -I input.bam \
        -S output.bam
  17. 17

    BAM files were used to identify peak clusters with Clipper (1.2.2).

    CLIPper v1.2.2 GitHub
    $ Bash example
    # Install CLIPper (if not already installed)
    # conda install -c bioconda clipper
    
    # Run CLIPper to identify peak clusters from a BAM file
    # Replace 'input.bam' with your actual input BAM file.
    # Replace 'output_peaks' with your desired output file prefix.
    # '-s 3.1e9' specifies the genome size for human (hg38). Adjust if using a different organism or assembly.
    clipper.py -b input.bam -s 3.1e9 -o output_peaks
  18. 18

    Args: --species (hg19) --bam path/to/input.bam --timeout 3600000 --maxgenes 1000000 --save-pickle --outfile path/to/output.bam

    (Inferred with models/gemini-2.5-flash)
    $ Bash example
    # This script processes a BAM file, potentially performing gene-level analysis or filtering.
    # The tool is inferred based on arguments like --maxgenes and --save-pickle, suggesting a custom Python script.
    # The reference genome hg19 (GRCh37) is specified.
    
    # Example command for the inferred tool
    python inferred_tool_script.py \
        --species hg19 \
        --bam path/to/input.bam \
        --timeout 3600000 \
        --maxgenes 1000000 \
        --save-pickle \
        --outfile path/to/output.bam
  19. 19

    Peak clusters were normalized using BAM files for IP against BAM files for INPUT with peaksnormalize.pl (overlap_peakfi_with_bam_PE.pl), included in eclip 0.1.5+.

    $ Bash example
    # Ensure Perl and necessary modules are installed.
    # Clone the eCLIP repository and add its 'bin' directory to your PATH.
    # For example:
    # git clone https://github.com/yeolab/eclip.git
    # export PATH=$PATH:$(pwd)/eclip/bin
    
    # Define input files (placeholders)
    IP_BAM="ip_sample.bam"
    INPUT_BAM="input_sample.bam"
    PEAK_FILE="peak_clusters.bed" # Assuming peak clusters are in BED format
    OUTPUT_PREFIX="normalized_peaks"
    
    # Normalize peak clusters using peaksnormalize.pl
    # This script internally uses overlap_peakfi_with_bam_PE.pl as described.
    peaksnormalize.pl \
      --peak_file "${PEAK_FILE}" \
      --ip_bam "${IP_BAM}" \
      --input_bam "${INPUT_BAM}" \
      --output_prefix "${OUTPUT_PREFIX}"
  20. 20

    Overlapping normalized peak regions were merged with compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl, included within eclip-0.1.5+

    $ Bash example
    # Installation of eCLIP pipeline (if needed, typically via git clone)
    # git clone https://github.com/yeolab/eclip.git
    # cd eclip
    # This script is located in the 'scripts' directory of the eCLIP repository.
    
    # Define input peak files (these would be the normalized peak regions from replicates)
    # Replace with actual paths and filenames of your normalized peak BED files
    INPUT_PEAK_FILE_REP1="sample_A_replicate1_normalized_peaks.bed"
    INPUT_PEAK_FILE_REP2="sample_A_replicate2_normalized_peaks.bed"
    # Add more replicate files if available, e.g., INPUT_PEAK_FILE_REP3="sample_A_replicate3_normalized_peaks.bed"
    
    # Define the output prefix for the merged peak file
    OUTPUT_PREFIX="sample_A_merged_replicate_peaks"
    
    # Path to the script within the eCLIP repository
    # Assuming the script is run from the eCLIP base directory or its path is known
    ECLIP_SCRIPT_PATH="scripts/compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl"
    
    # Execute the script to merge overlapping normalized peak regions from replicates
    perl "${ECLIP_SCRIPT_PATH}" \
        "${OUTPUT_PREFIX}" \
        "${INPUT_PEAK_FILE_REP1}" \
        "${INPUT_PEAK_FILE_REP2}"
        # Add more "${INPUT_PEAK_FILE_REP_N}" here if more replicates are used
  21. 21

    Normalized peak (compressed.bed) files were ranked by entropy score (make_informationcontent_from_peaks.pl included within the merge_peaks pipeline) and used as inputs to IDR (2.0.2) to determine reproducible peaks.

    IDR v2.0.2 GitHub
    $ Bash example
    # Install IDR (version 2.0.2)
    # conda install -c bioconda idr=2.0.2
    
    # Install merge_peaks dependencies (Perl script)
    # No specific installation for make_informationcontent_from_peaks.pl, it's a Perl script.
    # Ensure Perl is available.
    
    # Define input peak files (placeholders)
    REP1_PEAKS="rep1.compressed.bed"
    REP2_PEAKS="rep2.compressed.bed"
    OUTPUT_PREFIX="idr_output"
    
    # Define reference genome files (placeholders, required by make_informationcontent_from_peaks.pl)
    # Replace with actual paths to your genome fasta and chromosome sizes file.
    GENOME_FASTA="path/to/genome.fa"
    CHROM_SIZES="path/to/chrom.sizes"
    
    # Step 1: Rank normalized peak files by entropy score
    # make_informationcontent_from_peaks.pl is part of the yeolab/merge_peaks pipeline.
    # It takes a BED file and adds an entropy score as the 5th column (0-indexed).
    # The script requires a genome fasta and chrom sizes.
    
    # Assuming make_informationcontent_from_peaks.pl is in the current directory or PATH
    perl make_informationcontent_from_peaks.pl -i "${REP1_PEAKS}" -o "${REP1_PEAKS%.bed}.entropy.bed" -g "${GENOME_FASTA}" -s "${CHROM_SIZES}"
    perl make_informationcontent_from_peaks.pl -i "${REP2_PEAKS}" -o "${REP2_PEAKS%.bed}.entropy.bed" -g "${GENOME_FASTA}" -s "${CHROM_SIZES}"
    
    # Step 2: Sort the entropy-ranked peak files by the entropy score (5th column, descending)
    # IDR expects peaks to be sorted by score.
    sort -k5,5nr "${REP1_PEAKS%.bed}.entropy.bed" > "${REP1_PEAKS%.bed}.entropy.sorted.bed"
    sort -k5,5nr "${REP2_PEAKS%.bed}.entropy.bed" > "${REP2_PEAKS%.bed}.entropy.sorted.bed"
    
    # Step 3: Run IDR (Irreproducible Discovery Rate) to determine reproducible peaks
    # IDR version 2.0.2
    # The --rank parameter specifies which column to use for ranking (default is 5 for narrowPeak/broadPeak).
    # Since we sorted by entropy score into the 5th column, we use --rank 5.
    # A common IDR threshold is 0.05. The description doesn't specify, so a common default is used.
    
    idr --samples "${REP1_PEAKS%.bed}.entropy.sorted.bed" "${REP2_PEAKS%.bed}.entropy.sorted.bed" \
        --output-file "${OUTPUT_PREFIX}.bed" \
        --rank 5 \
        --plot \
        --log-output-file "${OUTPUT_PREFIX}.log" \
        --soft-idr-threshold 0.05
  22. 22

    To capture chimeric reads, trimmed fastqs were converted to fasta and collapsed into unique sequences using seqtk seq (1.3-r106) and ctk fasta2collapse.pl (v1.0.8)

    seqtk, ctk fasta2collapse.pl v1.3-r106, v1.0.8
    $ Bash example
    # Install seqtk (if not already installed)
    # conda install -c bioconda seqtk
    
    # Install ctk fasta2collapse.pl (if not already installed)
    # This script is typically part of a larger pipeline or custom scripts, often found in CLIP-seq toolkits.
    # It might require Perl and specific modules. For eCLIP assays, it's often found within the Yeo lab's eCLIP pipeline.
    # Example: git clone https://github.com/yeolab/eclip.git
    # Then ensure ctk_fasta2collapse.pl is in your PATH or specify its full path.
    
    # Placeholder for input trimmed FASTQ file(s)
    # Assuming 'trimmed_reads.fastq' as a single input file for demonstration.
    # If multiple FASTQ files (e.g., paired-end reads), this step would typically be applied per file or after merging.
    INPUT_FASTQ="trimmed_reads.fastq"
    
    # Placeholder for output collapsed FASTA file
    OUTPUT_FASTA="collapsed_unique_reads.fasta"
    
    # Step 1: Convert trimmed FASTQ to FASTA using seqtk
    # The output FASTA file will have the same base name as the input FASTQ, but with a .fasta extension.
    seqtk seq -a "${INPUT_FASTQ}" > "${INPUT_FASTQ%.fastq}.fasta"
    
    # Step 2: Collapse unique sequences from the FASTA file using ctk fasta2collapse.pl
    # This script identifies and collapses identical sequences, often counting their occurrences, which is crucial for chimeric read analysis.
    ctk fasta2collapse.pl "${INPUT_FASTQ%.fastq}.fasta" > "${OUTPUT_FASTA}"
  23. 23

    Collapsed sequences were indexed and used to map mature miRNA (miRBase) using bowtie2 (v2.2.6)

    $ Bash example
    # Install Bowtie2 (example using conda)
    # conda install -c bioconda bowtie2=2.2.6
    
    # --- Reference Data Preparation (prior step) ---
    # Download mature miRNA sequences from miRBase (e.g., mature.fa from miRBase FTP)
    # Example: wget http://www.mirbase.org/ftp/CURRENT/mature.fa.gz
    # gunzip mature.fa.gz
    
    # Build Bowtie2 index for mature miRNA sequences
    # bowtie2-build mature.fa miRBase_mature_miRNA_index
    
    # --- Mapping Step ---
    # Map collapsed sequences to the miRBase mature miRNA index
    # Input: collapsed_sequences.fastq (your collapsed sequence file)
    # Reference Index: miRBase_mature_miRNA_index (the index built from miRBase mature miRNA sequences)
    # Output: mapped_miRNA.sam (SAM format alignment file)
    bowtie2 -x miRBase_mature_miRNA_index -U collapsed_sequences.fastq -S mapped_miRNA.sam
  24. 24

    Sequences that mapped to miRs whose mRNA portion were at least 18nt were re-assigned according to their original read ID, and mapped to the human genome as previously described.

    STAR (Inferred with models/gemini-2.5-flash) v2.7.10a GitHub
    $ Bash example
    # Install STAR if not already installed
    # conda install -c bioconda star
    
    # Define variables
    # Placeholder for human GRCh38 genome index. This directory should contain files like SA, SAindex, genome, etc.
    GENOME_DIR="/path/to/STAR_genome_index/GRCh38" 
    # Placeholder for the input FASTQ file containing sequences re-assigned after miR mapping.
    INPUT_FASTQ="reassigned_reads.fastq.gz" 
    # Prefix for output files (e.g., reassigned_to_genome_Aligned.out.bam)
    OUTPUT_PREFIX="reassigned_to_genome"
    # Number of threads to use for STAR
    THREADS=8
    
    # Align sequences to the human genome
    # This command assumes the STAR genome index for GRCh38 has already been built.
    # If not, you would first run:
    # STAR --runThreadN ${THREADS} --runMode genomeGenerate \
    #      --genomeDir ${GENOME_DIR} \
    #      --genomeFastaFiles /path/to/GRCh38.primary_assembly.fa \
    #      --sjdbGTFfile /path/to/gencode.v38.annotation.gtf \
    #      --sjdbOverhang 100 # Adjust sjdbOverhang based on read length (e.g., read_length - 1)
    STAR --runThreadN ${THREADS} \
         --genomeDir ${GENOME_DIR} \
         --readFilesIn ${INPUT_FASTQ} \
         --outFileNamePrefix ${OUTPUT_PREFIX}_ \
         --outSAMtype BAM SortedByCoordinate \
         --outSAMunmapped None

Tools Used

Raw Source Text
Sequenced reads were reformatted to include randomers in read headers with umi_tools (1.0.0). Args: --random-seed 1 --bc-pattern NNNNNNNNNN
Reads were then trimmed with cutadapt (1.14). Args: --match-read-wildcards -O 1 --times 1 -e 0.1 --quality-cutoff 6 -m 18 -a InvRNA*.fasta (fasta sequences can be found at: https://github.com/YeoLab/eclip/tree/master/example/inputs/)
Reads were then trimmed again with cutadapt (1.14) to remove double-ligation events. Args: --match-read-wildcards -O 5 --times 1 -e 0.1 --quality-cutoff 6 -m 18  -a Ril19.fasta (fasta sequences can be found at: https://github.com/YeoLab/eclip/tree/master/example/inputs/)
For targeted miR-130a datasets, reads that did not contain the primer sequence (CAGUGCAAUGUUAAAAGGGCAU) were filtered out.
For total chimeric eCLIP datasets, 10nt from the 3'end of Read1 was trimmed with umi_tools. Args: -u -9
Trimmed and filtered reads were then mapped with STAR (2.7.6a) against a repeat element database (RepBase 18.05). Args: --runThreadN 16 \  --genomeDir human_repbase \  --readFilesIn path/to/read1 \  --outFileNamePrefix out_prefix \  --outReadsUnmapped Fastx \  --outSAMtype BAM Unsorted \  --outSAMattributes All \  --outSAMunmapped Within \  --outSAMattrRGline ID:foo \  --outFilterType BySJout \  --outFilterMultimapNmax 30 \  --outFilterMultimapScoreRange 1 \  --outFilterScoreMin 10 \  --alignEndsType EndToEnd
Unmapped reads filtered of repeat elements were then mapped with STAR (2.7.6a) against a human genome (hg19). Args: --runThreadN 16 \  --genomeDir genomedir \  --readFilesIn /path/to/read1 \  --outFileNamePrefix out_prefix \  --outReadsUnmapped Fastx \  --outSAMtype BAM   Unsorted \  --outSAMattributes All \  --outSAMunmapped Within \  --outSAMattrRGline ID:foo \  --outFilterType BySJout \  --outFilterMultimapNmax 1 \  --outFilterMultimapScoreRange 1 \  --outFilterScoreMin 10 \  --alignEndsType EndToEnd
Aligned reads were sorted with samtools (1.6)
Sorted reads were collapsed with umi_tools (1.0.0). Args: --random-seed 1 --method unique
BAM files were used to identify peak clusters with Clipper (1.2.2). Args: --species (hg19) --bam path/to/input.bam --timeout 3600000 --maxgenes 1000000 --save-pickle --outfile path/to/output.bam
Peak clusters were normalized using BAM files for IP against BAM files for INPUT with peaksnormalize.pl (overlap_peakfi_with_bam_PE.pl), included in eclip 0.1.5+.
Overlapping normalized peak regions were merged with compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl, included within eclip-0.1.5+
Normalized peak (compressed.bed) files were ranked by entropy score (make_informationcontent_from_peaks.pl included within the merge_peaks pipeline) and used as inputs to IDR (2.0.2) to determine reproducible peaks.
To capture chimeric reads, trimmed fastqs were converted to fasta and collapsed into unique sequences using seqtk seq (1.3-r106) and ctk fasta2collapse.pl (v1.0.8)
Collapsed sequences were indexed and used to map mature miRNA (miRBase) using bowtie2 (v2.2.6)
Sequences that mapped to miRs whose mRNA portion were at least 18nt were re-assigned according to their original read ID, and mapped to the human genome as previously described.
Genome_build: hg19
Supplementary_files_format_and_content: gk.Au9.umi.r1.fq.genome-mappedSoSo.rmDupSo.peakClusters.bed and gk.Au13.umi.r1.fq.genome-mappedSoSo.rmDupSo.peakClusters.bed contain enriched regions from the miR-130a targeted chimeric eCLIP dataset
Supplementary_files_format_and_content: cellsB_rep1.vs.cellsB_rep2.bed contains reproducible total chimeric eCLIP peaks across replicates
Supplementary_files_format_and_content: BigWig files contain RPM-normalized read densities
← Back to Analysis