GSE107767 Processing Pipeline

OTHER code_examples 33 steps

Publication

A protein-RNA interaction atlas of the ribosome biogenesis factor AATF.

Scientific reports (2019) — PMID 31363146

Dataset

GSE107767

Best practices for eCLIP experiments and analysis [poor quality]

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    Library strategy: eCLIP-seq

    $ Bash example
    #!/bin/bash
    
    # This script outlines a typical eCLIP-seq data processing pipeline,
    # drawing from the Yeo lab's eCLIP CWL workflow (https://github.com/yeolab/eclip).
    # It includes adapter trimming, alignment, deduplication, and peak calling.
    # For IDR (Identifying Reproducible Peaks), the merge_peaks.py tool is recommended.
    
    # --- Configuration --- 
    # Placeholder paths for reference genome and input files.
    # Please replace with actual paths on your system.
    
    # Reference Genome: Human hg38 (GRCh38) is used as a placeholder.
    # Ensure you have a STAR index built for your chosen genome.
    GENOME_FASTA="/path/to/reference/genome/hg38.fa"
    STAR_INDEX_DIR="/path/to/reference/genome/STAR_index/hg38"
    
    # Input FASTQ files (example: single-end eCLIP and corresponding input control)
    INPUT_FASTQ_SAMPLE="sample_eclip_rep1.fastq.gz"
    INPUT_FASTQ_CONTROL="sample_input_rep1.fastq.gz"
    
    # Output prefixes
    SAMPLE_PREFIX="sample_eclip_rep1"
    CONTROL_PREFIX="sample_input_rep1"
    
    # Adapter sequence (example: Illumina TruSeq, verify with your library prep kit)
    ADAPTER="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA"
    
    # --- 1. Adapter Trimming (using Cutadapt) ---
    # Cutadapt is used to remove sequencing adapters and low-quality bases.
    # Installation: # conda install -c bioconda cutadapt
    
    echo "Trimming adapters for ${INPUT_FASTQ_SAMPLE}..."
    cutadapt -a "${ADAPTER}" -o "${SAMPLE_PREFIX}_trimmed.fastq.gz" "${INPUT_FASTQ_SAMPLE}" > "${SAMPLE_PREFIX}_cutadapt.log" 2>&1
    
    echo "Trimming adapters for ${INPUT_FASTQ_CONTROL}..."
    cutadapt -a "${ADAPTER}" -o "${CONTROL_PREFIX}_trimmed.fastq.gz" "${INPUT_FASTQ_CONTROL}" > "${CONTROL_PREFIX}_cutadapt.log" 2>&1
    
    # --- 2. Alignment (using STAR) ---
    # STAR is a splice-aware aligner suitable for CLIP-seq data.
    # Installation: # conda install -c bioconda star
    
    echo "Aligning ${SAMPLE_PREFIX}_trimmed.fastq.gz to genome..."
    STAR --genomeDir "${STAR_INDEX_DIR}" \
         --readFilesIn "${SAMPLE_PREFIX}_trimmed.fastq.gz" \
         --outFileNamePrefix "${SAMPLE_PREFIX}_aligned_" \
         --outSAMtype BAM SortedByCoordinate \
         --runThreadN 8 # Adjust number of threads as appropriate
    
    echo "Aligning ${CONTROL_PREFIX}_trimmed.fastq.gz to genome..."
    STAR --genomeDir "${STAR_INDEX_DIR}" \
         --readFilesIn "${CONTROL_PREFIX}_trimmed.fastq.gz" \
         --outFileNamePrefix "${CONTROL_PREFIX}_aligned_" \
         --outSAMtype BAM SortedByCoordinate \
         --runThreadN 8 # Adjust number of threads as appropriate
    
    # --- 3. Deduplication (using Picard MarkDuplicates) ---
    # Deduplication is crucial for CLIP-seq to remove PCR duplicates.
    # Installation: # conda install -c bioconda picard
    
    echo "Deduplicating ${SAMPLE_PREFIX}_aligned_Aligned.sortedByCoord.out.bam..."
    java -jar /path/to/picard.jar MarkDuplicates \
         I="${SAMPLE_PREFIX}_aligned_Aligned.sortedByCoord.out.bam" \
         O="${SAMPLE_PREFIX}_dedup.bam" \
         M="${SAMPLE_PREFIX}_dedup_metrics.txt" \
         REMOVE_DUPLICATES=true \
         AS=true # Assume sorted input
    
    echo "Deduplicating ${CONTROL_PREFIX}_aligned_Aligned.sortedByCoord.out.bam..."
    java -jar /path/to/picard.jar MarkDuplicates \
         I="${CONTROL_PREFIX}_aligned_Aligned.sortedByCoord.out.bam" \
         O="${CONTROL_PREFIX}_dedup.bam" \
         M="${CONTROL_PREFIX}_dedup_metrics.txt" \
         REMOVE_DUPLICATES=true \
         AS=true
    
    # Index BAM files for downstream tools like CLIPper
    # Installation: # conda install -c bioconda samtools
    echo "Indexing BAM files..."
    samtools index "${SAMPLE_PREFIX}_dedup.bam"
    samtools index "${CONTROL_PREFIX}_dedup.bam"
    
    # --- 4. Peak Calling (using CLIPper) ---
    # CLIPper is a dedicated peak caller for CLIP-seq data.
    # Installation: # conda install -c bioconda clipper
    
    echo "Calling peaks for ${SAMPLE_PREFIX} using CLIPper..."
    clipper -s hg38 \
            -i "${SAMPLE_PREFIX}_dedup.bam" \
            -c "${CONTROL_PREFIX}_dedup.bam" \
            -o "${SAMPLE_PREFIX}_peaks.bed" \
            -v # Verbose output
    
    # --- 5. IDR (Identifying Reproducible Peaks - using merge_peaks.py) ---
    # This step requires multiple replicates (e.g., rep1_peaks.bed, rep2_peaks.bed).
    # The merge_peaks.py script is part of the Yeo lab's custom IDR pipeline.
    # Source: https://github.com/yeolab/merge_peaks
    # Installation: # git clone https://github.com/yeolab/merge_peaks.git
    #               # cd merge_peaks && python setup.py install
    #
    # Example command (assuming you have peak files from two replicates):
    # echo "Performing IDR with merge_peaks.py for reproducible peaks..."
    # python /path/to/merge_peaks/merge_peaks.py \
    #        -i rep1_peaks.bed rep2_peaks.bed \
    #        -o reproducible_peaks.bed \
    #        --idr_threshold 0.1 # Example IDR threshold
    
  2. 2

    Takes output from raw files.

    FastQC (Inferred with models/gemini-2.5-flash) v0.12.1 GitHub
    $ Bash example
    # Install FastQC (if not already installed)
    # conda install -c bioconda fastqc
    
    # Run FastQC on raw sequencing reads to generate quality reports
    # Replace 'raw_reads.fastq.gz' with your actual input FASTQ file(s)
    # Replace 'output_directory' with your desired output location for reports
    fastqc raw_reads.fastq.gz -o output_directory
  3. 3

    Run to trim off both 5’ and 3’ adapters on both reads.

    cutadapt (Inferred with models/gemini-2.5-flash) v4.0 GitHub
    $ Bash example
    # Install cutadapt (e.g., using conda)
    # conda create -n cutadapt_env cutadapt=4.0 -y
    # conda activate cutadapt_env
    
    # Define input and output files
    READ1_IN="read1.fastq.gz"
    READ2_IN="read2.fastq.gz"
    READ1_OUT="trimmed_read1.fastq.gz"
    READ2_OUT="trimmed_read2.fastq.gz"
    REPORT_FILE="cutadapt_report.txt"
    
    # Define 3' adapter sequences (Illumina TruSeq, common for eCLIP/RNA-seq)
    # These are the sequences that appear at the 3' end of the read.
    ADAPTER_R1_3PRIME="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" # Adapter for Read 1 (3' end)
    ADAPTER_R2_3PRIME="AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT" # Adapter for Read 2 (3' end)
    
    # Define 5' adapter sequences (if applicable and known)
    # These are sequences that appear at the 5' end of the read.
    # For eCLIP, this might be a random barcode/UMI or a specific library adapter.
    # IMPORTANT: Replace "YOUR_5PRIME_ADAPTER_R1_SEQUENCE" and "YOUR_5PRIME_ADAPTER_R2_SEQUENCE"
    # with the actual 5' adapter sequences used in your library preparation.
    # If no specific 5' adapter sequence needs to be trimmed, these parameters can be omitted.
    ADAPTER_R1_5PRIME="YOUR_5PRIME_ADAPTER_R1_SEQUENCE" # Adapter for Read 1 (5' end)
    ADAPTER_R2_5PRIME="YOUR_5PRIME_ADAPTER_R2_SEQUENCE" # Adapter for Read 2 (5' end)
    
    # Number of CPU cores to use
    NUM_CORES=8
    
    # Run cutadapt to trim 5' and 3' adapters, perform quality trimming, and filter by length
    # -g: 5' adapter for R1 (searches from the beginning of the read)
    # -G: 5' adapter for R2 (searches from the beginning of the read)
    # -a: 3' adapter for R1 (searches from the end of the read)
    # -A: 3' adapter for R2 (searches from the end of the read)
    # -q: Quality cutoff (e.g., 20 for Phred score 20)
    # -m: Minimum length of reads after trimming
    # -o: Output file for R1
    # -p: Output file for R2
    # --cores: Number of CPU cores
    # --report=full: Generate a detailed report
    cutadapt \
      -g "${ADAPTER_R1_5PRIME}" \
      -G "${ADAPTER_R2_5PRIME}" \
      -a "${ADAPTER_R1_3PRIME}" \
      -A "${ADAPTER_R2_3PRIME}" \
      -q 20 \
      -m 15 \
      --cores "${NUM_CORES}" \
      -o "${READ1_OUT}" \
      -p "${READ2_OUT}" \
      "${READ1_IN}" \
      "${READ2_IN}" \
      > "${REPORT_FILE}" 2>&1
  4. 4

    Command: quality-cutoff 6 -m 18 -a NNNNNAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC -g CTTCCGATCTACAAGTT -g CTTCCGATCTTGGTCCT -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGT AGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz /full/path/to/files/file_R1.C01.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.metrics

    quality-cutoff vN/A GitHub
    $ Bash example
    # This tool appears to be a custom script or a specific subcommand within a larger bioinformatics suite.
    # Installation instructions are not available for 'quality-cutoff' as a standalone package.
    # Please refer to the specific pipeline documentation where this command is used for setup instructions.
    
    quality-cutoff 6 \
      -m 18 \
      -a NNNNNAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC \
      -g CTTCCGATCTACAAGTT \
      -g CTTCCGATCTTGGTCCT \
      -A AACTTGTAGATCGGA \
      -A AGGACCAAGATCGGA \
      -A ACTTGTAGATCGGAA \
      -A AGGACCAAGATCGGAA \
      -A CTTGT AGATCGGAAG \
      -A GACCAAGATCGGAAG \
      -A TTGTAGATCGGAAGA \
      -A ACCAAGATCGGAAGA \
      -A TGTAGATCGGAAGAG \
      -A CCAAGATCGGAAGAG \
      -A GTAGATCGGAAGAGC \
      -A CAAGATCGGAAGAGC \
      -A TAGATCGGAAGAGCG \
      -A AAGATCGGAAGAGCG \
      -A AGATCGGAAGAGCGT \
      -A GATCGGAAGAGCGTC \
      -A ATCGGAAGAGCGTCG \
      -A TCGGAAGAGCGTCGT \
      -A CGGAAGAGCGTCGTG \
      -A GGAAGAGCGTCGTGT \
      -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz \
      -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz \
      /full/path/to/files/file_R1.C01.fastq.gz \
      /full/path/to/files/file_R2.C01.fastq.gz \
      > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.metrics
  5. 5

    Takes output from cutadapt round 1.

    cutadapt v4.0 GitHub
    $ Bash example
    # Install cutadapt (if not already installed)
    # conda install -c bioconda cutadapt=4.0
    
    # Define input and output files
    INPUT_FASTQ="input_from_round1.fastq.gz"
    OUTPUT_FASTQ="trimmed_round2.fastq.gz"
    ADAPTER_SEQUENCE="AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT" # Example adapter for a second round of trimming in eCLIP (e.g., small RNA 3' adapter from yeolab/eclip CWL)
    
    # Execute cutadapt for a second round of trimming
    # Parameters inferred from typical eCLIP pipelines (e.g., yeolab/eclip or yeolab/skipper)
    # -a: 3' adapter sequence
    # -e: Maximum error rate
    # -q: Quality cutoff for 3' end trimming
    # -m: Minimum read length after trimming
    # -o: Output file
    cutadapt -a "${ADAPTER_SEQUENCE}" \
             -e 0.1 \
             -q 20 \
             -m 18 \
             -o "${OUTPUT_FASTQ}" \
             "${INPUT_FASTQ}"
  6. 6

    Run to trim off the 3’ adapters on read 2, to control for double ligation events.

    cutadapt (Inferred with models/gemini-2.5-flash) v3.4 (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install cutadapt (example using conda)
    # conda install -c bioconda cutadapt=3.4
    
    # Define input and output files (placeholders)
    # INPUT_READ2: Path to the input FASTQ file for read 2
    # OUTPUT_TRIMMED_READ2: Path for the output FASTQ file after trimming
    # LOG_FILE: Path for the log file generated by cutadapt
    INPUT_READ2="read2.fastq.gz"
    OUTPUT_TRIMMED_READ2="read2_trimmed.fastq.gz"
    LOG_FILE="cutadapt_trim_r2.log"
    
    # 3' adapter sequence for read 2, commonly used in eCLIP assays
    # This adapter sequence (AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC) is derived from the
    # Illumina TruSeq Small RNA 3' Adapter, which can ligate to itself or other
    # adapters in double ligation events in eCLIP.
    ADAPTER_R2="AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC"
    
    # Run cutadapt to trim the 3' adapter from read 2
    # -a ADAPTER: Specifies a 3' adapter to be removed from the reads.
    #             This option searches for the adapter sequence at the 3' end of the read.
    # -o: Specifies the output file for the trimmed reads.
    cutadapt -a "${ADAPTER_R2}" -o "${OUTPUT_TRIMMED_READ2}" "${INPUT_READ2}" > "${LOG_FILE}" 2>&1
  7. 7

    Command: cutadapt -f fastq --match-read-wildcards --times 1 -e 0.1 -O 5 --quality-cutoff 6 -m 18 -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGTAGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.metrics

    cutadapt v1.18 GitHub
    $ Bash example
    # Install cutadapt (example using conda)
    # conda install -c bioconda cutadapt
    
    # Define input and output files
    INPUT_R1="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz"
    INPUT_R2="/full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz"
    OUTPUT_R1="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz"
    OUTPUT_R2="/full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz"
    METRICS_FILE="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.metrics"
    
    # Define adapters (these are likely fragments of the Illumina universal adapter or specific library adapters)
    # The command specifies multiple -A arguments, which are 3' adapters for the R1 read.
    # Cutadapt will also search for the reverse complement of these adapters in the R2 read.
    ADAPTERS=(
        "AACTTGTAGATCGGA"
        "AGGACCAAGATCGGA"
        "ACTTGTAGATCGGAA"
        "GGACCAAGATCGGAA"
        "CTTGTAGATCGGAAG"
        "GACCAAGATCGGAAG"
        "TTGTAGATCGGAAGA"
        "ACCAAGATCGGAAGA"
        "TGTAGATCGGAAGAG"
        "CCAAGATCGGAAGAG"
        "GTAGATCGGAAGAGC"
        "CAAGATCGGAAGAGC"
        "TAGATCGGAAGAGCG"
        "AAGATCGGAAGAGCG"
        "AGATCGGAAGAGCGT"
        "GATCGGAAGAGCGTC"
        "ATCGGAAGAGCGTCG"
        "TCGGAAGAGCGTCGT"
        "CGGAAGAGCGTCGTG"
        "GGAAGAGCGTCGTGT"
    )
    
    # Construct the adapter arguments string
    ADAPTER_ARGS=""
    for adapter in "${ADAPTERS[@]}"; do
        ADAPTER_ARGS+=" -A ${adapter}"
    done
    
    # Execute cutadapt command
    cutadapt -f fastq \
        --match-read-wildcards \
        --times 1 \
        -e 0.1 \
        -O 5 \
        --quality-cutoff 6 \
        -m 18 \
        ${ADAPTER_ARGS} \
        -o "${OUTPUT_R1}" \
        -p "${OUTPUT_R2}" \
        "${INPUT_R1}" \
        "${INPUT_R2}" \
        > "${METRICS_FILE}"
    
  8. 8

    Takes output from cutadapt round 2.

    cutadapt v2.10 GitHub
    $ Bash example
    # Install cutadapt (if not already installed)
    # conda install -c bioconda cutadapt=2.10
    
    # Define input and output files
    # INPUT_FASTQ is the output from cutadapt round 1 (typically 3' adapter trimming)
    INPUT_FASTQ="reads_r1_trimmed_3prime.fastq.gz"
    # OUTPUT_FASTQ will contain reads trimmed for both 3' and 5' adapters
    OUTPUT_FASTQ="reads_r1_trimmed_both.fastq.gz"
    
    # Define the 5' adapter sequence (e.g., eCLIP linker adapter)
    # This adapter sequence is specific to the library preparation protocol.
    # Example: AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT
    # (This example assumes a common eCLIP linker adapter without a UMI at the 5' end for simplicity in this command)
    ADAPTER_5PRIME="AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT"
    
    # Perform 5' adapter trimming (cutadapt round 2 in a typical eCLIP pipeline)
    # -g: Specifies a 5' adapter to be removed from the beginning of the read.
    # -o: Specifies the output file for trimmed reads.
    # -q 20,20: Trims low-quality bases from both ends of the read with a quality cutoff of 20.
    # --minimum-length 15: Discards reads shorter than 15 bp after trimming.
    # --cores 0: Uses all available CPU cores for parallel processing.
    # --report=full: Generates a detailed trimming report.
    cutadapt -g "${ADAPTER_5PRIME}" \
             -o "${OUTPUT_FASTQ}" \
             -q 20,20 \
             --minimum-length 15 \
             --cores 0 \
             --report=full \
             "${INPUT_FASTQ}"
  9. 9

    Maps to human specific version of RepBase used to remove repetitive elements, helps control for spurious artifacts from rRNA (& other) repetitive reads.

    STAR (Inferred with models/gemini-2.5-flash) v2.7.10a (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Define variables
    INPUT_READS="input_reads.fastq.gz" # Replace with your input FASTQ file(s). For paired-end, use "input_reads_R1.fastq.gz input_reads_R2.fastq.gz"
    OUTPUT_PREFIX="filtered_reads"
    REPBASE_FASTA="human_repbase.fasta" # Placeholder for human specific RepBase sequences. This file needs to be prepared from RepBase.
    STAR_INDEX_DIR="star_repbase_index"
    NUM_THREADS=8 # Adjust as needed
    
    # --- Installation (commented out) ---
    # conda install -c bioconda star
    
    # --- Build STAR index for RepBase sequences (if not already built) ---
    # This step assumes you have a FASTA file containing human specific RepBase sequences.
    # You would typically download or generate this file from RepBase (e.g., http://www.girinst.org/repbase/).
    # Example of how to build the index (uncomment and run if needed):
    # mkdir -p ${STAR_INDEX_DIR}
    # STAR --runMode genomeGenerate \
    #      --genomeDir ${STAR_INDEX_DIR} \
    #      --genomeFastaFiles ${REPBASE_FASTA} \
    #      --runThreadN ${NUM_THREADS}
    
    # --- Align reads to RepBase index and filter out mapped reads ---
    # Reads that map to the RepBase index are considered repetitive and are discarded.
    # The --outReadsUnmapped Fastx option outputs reads that did NOT map to RepBase.
    STAR --genomeDir ${STAR_INDEX_DIR} \
         --readFilesIn ${INPUT_READS} \
         --runThreadN ${NUM_THREADS} \
         --outFileNamePrefix ${OUTPUT_PREFIX}_repbase_ \
         --outSAMtype None \
         --outReadsUnmapped Fastx # Output unmapped reads to a FASTA/Q file
    
    # The unmapped reads are the ones that *do not* map to RepBase, which are the desired "filtered" reads.
    # The output file will be named ${OUTPUT_PREFIX}_repbase_Unmapped.out.mate1 (and mate2 for paired-end).
    # If input was gzipped, output will also be gzipped.
    
    # Rename for clarity
    mv ${OUTPUT_PREFIX}_repbase_Unmapped.out.mate1 ${OUTPUT_PREFIX}.fastq.gz
    # If paired-end, uncomment and adjust:
    # mv ${OUTPUT_PREFIX}_repbase_Unmapped.out.mate1 ${OUTPUT_PREFIX}_R1.fastq.gz
    # mv ${OUTPUT_PREFIX}_repbase_Unmapped.out.mate2 ${OUTPUT_PREFIX}_R2.fastq.gz
    
    # Clean up intermediate files (optional)
    # rm ${OUTPUT_PREFIX}_repbase_Log.final.out
    # rm ${OUTPUT_PREFIX}_repbase_Log.out
    # rm ${OUTPUT_PREFIX}_repbase_Log.progress.out
    # rm ${OUTPUT_PREFIX}_repbase_SJ.out.tab
  10. 10

    Command: STAR --runMode alignReads --runThreadN 16 --genomeDir /path/to/RepBase_human_database_file --genomeLoad LoadAndRemove --readFilesIn /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz --outSAMunmapped Within --outFilterMultimapNmax 30 --outFilterMultimapScoreRange 1 --outFileNamePrefix /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam --outSAMattributes All --readFilesCommand zcat --outStd BAM_Unsorted --outSAMtype BAM Unsorted --outFilterType BySJout --outReadsUnmapped Fastx --outFilterScoreMin 10 --outSAMattrRGline ID:foo --alignEndsType EndToEnd > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam

    STAR v2.7.10a (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    STAR --runMode alignReads --runThreadN 16 --genomeDir /path/to/RepBase_human_database_file --genomeLoad LoadAndRemove --readFilesIn /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz --outSAMunmapped Within --outFilterMultimapNmax 30 --outFilterMultimapScoreRange 1 --outFileNamePrefix /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam --outSAMattributes All --readFilesCommand zcat --outStd BAM_Unsorted --outSAMtype BAM Unsorted --outFilterType BySJout --outReadsUnmapped Fastx --outFilterScoreMin 10 --outSAMattrRGline ID:foo --alignEndsType EndToEnd > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam
  11. 11

    Takes output from STAR rmRep.

    $ Bash example
    # Install STAR (if not already installed)
    # conda install -c bioconda star
    
    # Define variables
    # Placeholder for STAR genome index (e.g., human GRCh38)
    GENOME_DIR="/path/to/STAR_genome_index/GRCh38"
    # Placeholder for input FASTQ files
    READ_FILE_1="input_read_1.fastq.gz"
    READ_FILE_2="input_read_2.fastq.gz" # Remove if single-end
    OUTPUT_PREFIX="aligned_reads_rmRep" # Output file prefix
    THREADS=8 # Number of threads
    
    # Run STAR alignment with filtering for multimapping reads (interpreted as 'rmRep' for removing reads mapping to repetitive regions)
    STAR --genomeDir "${GENOME_DIR}" \
         --readFilesIn "${READ_FILE_1}" "${READ_FILE_2}" \
         --runThreadN "${THREADS}" \
         --outFileNamePrefix "${OUTPUT_PREFIX}" \
         --outSAMtype BAM SortedByCoordinate \
         --outFilterMultimapNmax 1 \
         --outFilterMismatchNmax 10 \
         --outFilterMatchNminOverLread 0.66 \
         --outFilterScoreMinOverLread 0.66
  12. 12

    Maps unique reads to the human genome.

    STAR (Inferred with models/gemini-2.5-flash) v2.7.10a (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install STAR (if not already installed)
    # conda install -c bioconda star
    
    # Define variables for input and reference
    # Replace with actual paths and filenames
    READ1="input_R1.fastq.gz" # Path to your forward reads (gzipped FASTQ)
    READ2="input_R2.fastq.gz" # Path to your reverse reads (gzipped FASTQ)
    GENOME_DIR="/path/to/STAR_index/GRCh38" # Path to the pre-built STAR genome index for human GRCh38
    OUTPUT_PREFIX="sample_aligned" # Prefix for output files
    THREADS=8 # Number of threads to use
    
    # Create output directory if it doesn't exist
    mkdir -p "./star_output"
    
    # Run STAR alignment
    # --runThreadN: Number of threads
    # --genomeDir: Path to the STAR genome index
    # --readFilesIn: Input FASTQ files (paired-end assumed)
    # --readFilesCommand: Command to decompress gzipped files on the fly
    # --outFileNamePrefix: Prefix for all output files
    # --outSAMtype BAM SortedByCoordinate: Output sorted BAM file
    # --outFilterMultimapNmax 1: Only report uniquely mapping reads (essential for 'unique reads')
    # --outFilterMismatchNmax 3: Maximum number of mismatches allowed (adjust as needed)
    # --outFilterScoreMinOverLread 0.66: Minimum ratio of alignment score to read length
    # --outFilterMatchNminOverLread 0.66: Minimum ratio of aligned bases to read length
    # --limitBAMsortRAM: Limit RAM for BAM sorting (e.g., 30GB, adjust based on available memory)
    STAR \
      --runThreadN ${THREADS} \
      --genomeDir ${GENOME_DIR} \
      --readFilesIn ${READ1} ${READ2} \
      --readFilesCommand zcat \
      --outFileNamePrefix "./star_output/${OUTPUT_PREFIX}_" \
      --outSAMtype BAM SortedByCoordinate \
      --outFilterMultimapNmax 1 \
      --outFilterMismatchNmax 3 \
      --outFilterScoreMinOverLread 0.66 \
      --outFilterMatchNminOverLread 0.66 \
      --limitBAMsortRAM 30000000000 # 30GB
    
    # Note: The human genome reference (GRCh38) needs to be pre-indexed using STAR's --runMode genomeGenerate command.
  13. 13

    Command: STAR --runMode alignReads --runThreadN 16 --genomeDir /path/to/STAR_database_file --genomeLoad LoadAndRemove --readFilesIn /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate1 /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate2 --outSAMunmapped Within --outFilterMultimapNmax 1 --outFilterMultimapScoreRange 1 --outFileNamePrefix /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam --outSAMattributes All --outStd BAM_Unsorted --outSAMtype BAM Unsorted --outFilterType BySJout --outReadsUnmapped Fastx --outFilterScoreMin 10 --outSAMattrRGline ID:foo --alignEndsType EndToEnd > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam

    STAR vInferred with models/gemini-2.5-flash GitHub
    $ Bash example
    # Define variables for input files and reference genome
    # Replace with actual paths and reference genome index
    GENOME_DIR="/path/to/your/GRCh38_STAR_index" # Example: /path/to/STAR_indices/GRCh38
    READ1="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate1"
    READ2="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate2"
    OUTPUT_BAM="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam"
    OUTPUT_PREFIX="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam" # This prefix will be used for auxiliary files like Log.final.out, SJ.out.tab
    
    # STAR alignment command
    STAR \
      --runMode alignReads \
      --runThreadN 16 \
      --genomeDir "${GENOME_DIR}" \
      --genomeLoad LoadAndRemove \
      --readFilesIn "${READ1}" "${READ2}" \
      --outSAMunmapped Within \
      --outFilterMultimapNmax 1 \
      --outFilterMultimapScoreRange 1 \
      --outFileNamePrefix "${OUTPUT_PREFIX}" \
      --outSAMattributes All \
      --outStd BAM_Unsorted \
      --outSAMtype BAM Unsorted \
      --outFilterType BySJout \
      --outReadsUnmapped Fastx \
      --outFilterScoreMin 10 \
      --outSAMattrRGline ID:foo \
      --alignEndsType EndToEnd > "${OUTPUT_BAM}"
  14. 14

    takes output from STAR genome mapping.

    STAR v2.5.2b
    $ Bash example
    # Install STAR (if not already installed)
    # conda install -c bioconda star
    
    # Define variables
    # STAR genome index for human GRCh38 (hg38) is used as a placeholder.
    # This directory should contain genome files (e.g., Genome.fa) and STAR index files (e.g., SA, SAindex).
    STAR_INDEX_DIR="/path/to/STAR_index/hg38"
    
    # Placeholder for input FASTQ file. The eCLIP CWL workflow typically uses single-end reads for alignment.
    INPUT_FASTQ="input_reads.fastq.gz"
    
    # Prefix for output files (e.g., aligned_reads.Aligned.sortedByCoordinate.out.bam)
    OUTPUT_PREFIX="aligned_reads"
    
    # Number of threads to use for alignment
    NUM_THREADS=8 # Adjust based on available CPU cores
    
    # Limit on RAM for sorting BAM. 30GB is used in the eCLIP CWL workflow.
    MAX_BAM_SORT_RAM=30000000000 # 30 GB
    
    # Run STAR alignment with parameters optimized for eCLIP (derived from yeolab/eclip CWL)
    STAR --runThreadN ${NUM_THREADS} \
         --genomeDir ${STAR_INDEX_DIR} \
         --readFilesIn ${INPUT_FASTQ} \
         --outFileNamePrefix ${OUTPUT_PREFIX}. \
         --outSAMtype BAM SortedByCoordinate \
         --outSAMattributes All \
         --outFilterMultimapNmax 1 \
         --outFilterMismatchNmax 3 \
         --alignIntronMax 1 \
         --outFilterType BySJout \
         --outFilterScoreMinOverLread 0.66 \
         --outFilterMatchNminOverLread 0.66 \
         --limitBAMsortRAM ${MAX_BAM_SORT_RAM}
  15. 15

    Custom random-mer-aware script for PCR duplicate removal.

    UMI-tools (Inferred with models/gemini-2.5-flash) v>=1.0.0 (Inferred with models/gemini-2.5-flash)
    $ Bash example
    # Install UMI-tools if not already installed (e.g., using conda)
    # conda install -c bioconda umi-tools
    
    # Example: Extract UMIs from FASTQ headers and then deduplicate BAM file
    # This assumes UMIs are already in the read names (e.g., from a previous custom script like extract_umi_from_fastq.py in Yeo lab eCLIP pipeline)
    # If UMIs are in a separate read or barcode, the 'extract' command would be used first.
    
    # Define input and output file paths
    INPUT_BAM="input_aligned_reads.bam"
    OUTPUT_DEDUP_BAM="output_deduplicated_reads.bam"
    OUTPUT_DEDUP_STATS="output_deduplication_stats.tsv"
    
    # Sort the BAM file by coordinate and then by UMI for efficient deduplication
    # This step is crucial for umi_tools dedup to work correctly with 'directional' or 'unique' methods.
    # The Yeo lab eCLIP pipeline often sorts by coordinate and then by read name (which contains the UMI).
    
    # Sort by coordinate and then by queryname (read name, where UMI is expected to be)
    # samtools sort -n -o ${INPUT_BAM%.bam}.qname_sorted.bam ${INPUT_BAM}
    # samtools fixmate -m ${INPUT_BAM%.bam}.qname_sorted.bam ${INPUT_BAM%.bam}.fixmate.bam
    # samtools sort -o ${INPUT_BAM%.bam}.position_sorted.bam ${INPUT_BAM%.bam}.fixmate.bam
    # samtools index ${INPUT_BAM%.bam}.position_sorted.bam
    
    # For simplicity, assuming input BAM is already sorted by coordinate and UMI is in read name (e.g., using a custom script beforehand)
    # If the UMI is in the read name, umi_tools dedup can directly use it.
    # The 'directional' method is common for eCLIP to handle potential UMI errors and strand specificity.
    
    umi_tools dedup \
      --method=directional \
      --extract-umi-method=read_id \
      -I "${INPUT_BAM}" \
      -S "${OUTPUT_DEDUP_BAM}" \
      --output-stats="${OUTPUT_DEDUP_STATS}"
    
    # Index the deduplicated BAM file
    samtools index "${OUTPUT_DEDUP_BAM}"
    
  16. 16

    Command: barcode_collapse_pe.py --bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam --out_file /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam --metrics_file /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.metrics

    barcode_collapse_pe.py (Inferred with models/gemini-2.5-flash) vNot explicitly versioned, part of Yeo Lab eCLIP pipeline (Skipper)
    $ Bash example
    # Install Miniconda or Anaconda if not already installed
    # wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh
    # bash miniconda.sh -b -p $HOME/miniconda
    # export PATH="$HOME/miniconda/bin:$PATH"
    # conda init bash
    # source ~/.bashrc
    
    # Clone the Skipper repository
    # git clone https://github.com/yeolab/skipper.git
    # cd skipper
    
    # Create and activate the conda environment
    # conda env create -f environment.yaml
    # conda activate skipper_env
    
    # Define input and output files
    INPUT_BAM="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam"
    OUTPUT_BAM="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam"
    METRICS_FILE="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.metrics"
    
    # Execute the barcode collapse command
    # Assuming the script is run from the 'skipper' directory or its path is added to PATH
    # If not, specify the full path to the script, e.g., python /path/to/skipper/scripts/barcode_collapse_pe.py
    python scripts/barcode_collapse_pe.py \
        --bam "${INPUT_BAM}" \
        --out_file "${OUTPUT_BAM}" \
        --metrics_file "${METRICS_FILE}"
  17. 17

    Takes output from barcode collapse PE.

    cutadapt (Inferred with models/gemini-2.5-flash) v4.0 GitHub
    $ Bash example
    # Install cutadapt if not already installed
    # conda install -c bioconda cutadapt=4.0
    
    # Define input and output files
    # INPUT_R1 and INPUT_R2 are the paired-end FASTQ files from the barcode collapse step.
    INPUT_R1="collapsed_R1.fastq.gz"
    INPUT_R2="collapsed_R2.fastq.gz"
    OUTPUT_R1="trimmed_R1.fastq.gz"
    OUTPUT_R2="trimmed_R2.fastq.gz"
    OUTPUT_UNTRIMMED_R1="untrimmed_R1.fastq.gz"
    OUTPUT_UNTRIMMED_R2="untrimmed_R2.fastq.gz"
    LOG_FILE="cutadapt.log"
    
    # Define adapters (Illumina universal and small RNA 3' adapters, common in eCLIP)
    # These adapters are typically found in eCLIP library preparations.
    ADAPTER_A="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" # Adapter for Read 1
    ADAPTER_B="AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT" # Adapter for Read 2
    
    # Run cutadapt for paired-end adapter trimming and quality filtering.
    # -a: 3' adapter for Read 1
    # -A: 3' adapter for Read 2
    # -o: Output file for trimmed Read 1
    # -p: Output file for trimmed Read 2
    # --untrimmed-output: Output file for Read 1 reads that were not trimmed
    # --untrimmed-paired-output: Output file for Read 2 reads that were not trimmed (paired with untrimmed Read 1)
    # --minimum-length: Discard reads shorter than 18 bp after trimming (common in eCLIP workflows)
    # --quality-cutoff: Trim low-quality bases from 3' end (e.g., Q6)
    # --cores: Number of CPU cores to use for parallel processing
    cutadapt \
      -a "${ADAPTER_A}" \
      -A "${ADAPTER_B}" \
      -o "${OUTPUT_R1}" \
      -p "${OUTPUT_R2}" \
      --untrimmed-output "${OUTPUT_UNTRIMMED_R1}" \
      --untrimmed-paired-output "${OUTPUT_UNTRIMMED_R2}" \
      --minimum-length 18 \
      --quality-cutoff 6 \
      --cores 8 \
      "${INPUT_R1}" "${INPUT_R2}" > "${LOG_FILE}" 2>&1
  18. 18

    Sorts resulting bam file for use downstream.

    samtools (Inferred with models/gemini-2.5-flash) v1.19 (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install samtools if not already installed
    # conda install -c bioconda samtools
    
    # Sort the BAM file by coordinate
    samtools sort -o output.sorted.bam input.bam
  19. 19

    Command: java -Xmx2048m -XX:+UseParallelOldGC -XX:ParallelGCThreads=4 -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Djava.io.tmpdir=/full/path/to/files/.queue/tmp -cp /path/to/gatk/dist/Queue.jar net.sf.picard.sam.SortSam INPUT=/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam TMP_DIR=/full/path/to/files/.queue/tmp OUTPUT=/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam VALIDATION_STRINGENCY=SILENT SO=coordinate CREATE_INDEX=true

    Picard vInferred with models/gemini-2.5-flash GitHub
    $ Bash example
    # Install Picard (often bundled with GATK or available as a standalone JAR)
    # For example, using conda:
    # conda install -c bioconda picard
    # Or download the JAR directly:
    # wget https://github.com/broadinstitute/picard/releases/download/<VERSION>/picard.jar
    # export PICARD_JAR="/path/to/picard.jar"
    
    # Define variables
    INPUT_BAM="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam"
    OUTPUT_BAM="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam"
    TMP_DIR="/full/path/to/files/.queue/tmp"
    # The original command uses Queue.jar, which implies an older GATK setup where Picard tools were run via GATK's Queue framework.
    # If using a modern standalone Picard.jar, the command would typically be 'java -jar $PICARD_JAR SortSam ...'
    # For this specific command, we'll assume Queue.jar is available and contains the necessary Picard classes.
    GATK_QUEUE_JAR="/path/to/gatk/dist/Queue.jar" # Placeholder for the actual path to Queue.jar
    
    # Create temporary directory if it doesn't exist
    mkdir -p "${TMP_DIR}"
    
    # Execute Picard SortSam
    java -Xmx2048m \
         -XX:+UseParallelOldGC \
         -XX:ParallelGCThreads=4 \
         -XX:GCTimeLimit=50 \
         -XX:GCHeapFreeLimit=10 \
         -Djava.io.tmpdir="${TMP_DIR}" \
         -cp "${GATK_QUEUE_JAR}" \
         net.sf.picard.sam.SortSam \
         INPUT="${INPUT_BAM}" \
         TMP_DIR="${TMP_DIR}" \
         OUTPUT="${OUTPUT_BAM}" \
         VALIDATION_STRINGENCY=SILENT \
         SO=coordinate \
         CREATE_INDEX=true
  20. 20

    Takes output from sortSam, makes bam index for use downstream.

    samtools (Inferred with models/gemini-2.5-flash) v1.10 (Inferred with models/gemini-2.5-flash)
    $ Bash example
    # Install samtools if not already available
    # conda install -c bioconda samtools=1.10
    
    # Assume 'sorted.bam' is the output from sortSam
    samtools index sorted.bam
  21. 21

    Command: samtools index /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam.bai

    samtools v(Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install samtools (if not already installed)
    # conda install -c bioconda samtools
    
    # Define input and output paths
    INPUT_BAM="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam"
    OUTPUT_BAI="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam.bai"
    
    # Create an index for the BAM file
    samtools index "${INPUT_BAM}" "${OUTPUT_BAI}"
  22. 22

    Takes inputs from multiple final bam files.

    samtools (Inferred with models/gemini-2.5-flash) v1.19 GitHub
    $ Bash example
    # Install samtools if not available
    # conda install -c bioconda samtools
    
    # Example: Merging multiple BAM files into a single output BAM file.
    # This is a common step to combine technical replicates or lanes for a single sample
    # before downstream analysis (e.g., peak calling, quantification).
    # Replace input1.bam, input2.bam, etc., with the actual paths to your input BAM files.
    # Replace merged_output.bam with the desired name for the combined BAM file.
    samtools merge merged_output.bam input1.bam input2.bam input3.bam
  23. 23

    Merges the two technical replicates for further downstream analysis.

    samtools (Inferred with models/gemini-2.5-flash) v1.10 GitHub
    $ Bash example
    # Install samtools (if not already installed)
    # conda install -c bioconda samtools=1.10
    
    # Merge technical replicates (BAM files)
    # Replace 'replicate1.bam' and 'replicate2.bam' with your actual input BAM files.
    # Replace 'merged_replicates.bam' with your desired output file name.
    samtools merge merged_replicates.bam replicate1.bam replicate2.bam
  24. 24

    Command: samtools merge /full/path/to/files/CombinedID.merged.bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam /full/path/to/files/file_R1.D08.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam

    samtools v1.10 GitHub
    $ Bash example
    # Install samtools (e.g., using conda)
    # conda install -c bioconda samtools=1.10
    
    samtools merge /full/path/to/files/CombinedID.merged.bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam /full/path/to/files/file_R1.D08.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam
  25. 25

    Takes output from sortSam, makes bam index for use downstream.

    samtools (Inferred with models/gemini-2.5-flash) v1.10 GitHub
    $ Bash example
    # Install samtools if not already installed
    # conda install -c bioconda samtools=1.10
    
    # Assume input sorted BAM file is named 'sorted.bam'
    # The output will be 'sorted.bam.bai'
    samtools index sorted.bam
  26. 26

    Command: samtools index /full/path/to/files/CombinedID.merged.bam /full/path/to/files/CombinedID.merged.bam.bai

    samtools v1.10 GitHub
    $ Bash example
    # samtools is a widely used tool for manipulating SAM/BAM/CRAM files.
    # It can be installed via conda:
    # conda install -c bioconda samtools
    
    samtools index /full/path/to/files/CombinedID.merged.bam /full/path/to/files/CombinedID.merged.bam.bai
  27. 27

    Takes output from sortSam.

    samtools (Inferred with models/gemini-2.5-flash) v1.19.1 GitHub
    $ Bash example
    # Install samtools if not already installed
    # conda install -c bioconda samtools
    
    # This step takes a sorted BAM file (output from sortSam, typically samtools sort)
    # and creates an index (.bai) file, which is essential for many downstream tools
    # that require random access to alignments (e.g., visualization, variant calling).
    # Replace input.sorted.bam with your actual sorted BAM file.
    samtools index input.sorted.bam
  28. 28

    Only outputs the second read in each pair for use with single stranded peak caller.

    reformat.sh (Inferred with models/gemini-2.5-flash) vLatest (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install BBMap (BBTools) if not already installed
    # conda install -c bioconda bbmap
    
    # Define input and output files
    # Assuming input are gzipped paired-end FASTQ files
    INPUT_R1="input_read1.fastq.gz"
    INPUT_R2="input_read2.fastq.gz"
    OUTPUT_R2_ONLY="second_reads_only.fastq.gz"
    
    # Command to output only the second read in each pair
    # 'in1' and 'in2' specify the paired-end input files.
    # 'out' specifies the output file for the second reads.
    # 'out1=null' discards the first reads.
    reformat.sh in1="${INPUT_R1}" in2="${INPUT_R2}" out="${OUTPUT_R2_ONLY}" out1=null
  29. 29

    This is the final bam file to perform analysis on.

    samtools (Inferred with models/gemini-2.5-flash) v1.19 GitHub
    $ Bash example
    # Install samtools if not already available
    # conda install -c bioconda samtools
    
    # Assume 'input.bam' is the raw or aligned BAM file that needs to be finalized for analysis.
    # A 'final' BAM file is typically sorted by coordinate and indexed for efficient access by downstream tools.
    INPUT_BAM="input.bam"
    OUTPUT_BAM="final_sorted.bam"
    
    # Sort the BAM file by coordinate
    samtools sort -o "${OUTPUT_BAM}" "${INPUT_BAM}"
    
    # Index the sorted BAM file
    samtools index "${OUTPUT_BAM}"
    
    echo "Final BAM file ready for analysis: ${OUTPUT_BAM}"
    echo "Index file: ${OUTPUT_BAM}.bai"
  30. 30

    Command: samtools view -hb -f 128 /full/path/to/files/CombinedID.merged.bam > /full/path/to/files/CombinedID.merged.r2.bam

    samtools v1.10 GitHub
    $ Bash example
    # Install samtools if not already available
    # conda install -c bioconda samtools=1.10
    
    # Extract reads that are the second in a pair (flag 128) from a merged BAM file.
    # -h: Include header in the output BAM file.
    # -b: Output in BAM format.
    # -f 128: Only output alignments with all bits in 128 set (i.e., the second read in a pair).
    samtools view -hb -f 128 /full/path/to/files/CombinedID.merged.bam > /full/path/to/files/CombinedID.merged.r2.bam
  31. 31

    Takes results from samtools view.

    samtools v1.19 GitHub
    $ Bash example
    # Install samtools if not already available
    # conda install -c bioconda samtools
    
    # Input BAM file (e.g., alignment output from a previous step)
    INPUT_BAM="aligned_reads.bam"
    
    # Output BAM file containing only mapped reads
    OUTPUT_MAPPED_BAM="mapped_reads.bam"
    
    # Use samtools view to filter for mapped reads (flag -F 4 excludes unmapped reads).
    # -b: output in BAM format
    # -h: include header
    samtools view -b -h -F 4 "${INPUT_BAM}" > "${OUTPUT_MAPPED_BAM}"
  32. 32

    Calls peaks on those files.

    clipper (Inferred with models/gemini-2.5-flash) vlatest (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    bash
    # Install clipper (if not already installed, e.g., via pip or a Docker image)
    # pip install clipper
    
    # Define input and output variables
    IP_BAM="input_ip.bam" # Replace with your IP BAM file
    CONTROL_BAM="input_control.bam" # Replace with your control BAM file (e.g., SMInput or IgG)
    GENOME_SIZE="hs" # Use 'hs' for human, 'mm' for mouse, or specify a numeric value (e.g., 3.1e9 for hg38)
    OUTPUT_PREFIX="eclip_peaks"
    
    # Call peaks using clipper with common eCLIP parameters
    # -b: IP BAM file
    # -c: Control BAM file
    # -s: Genome size (e.g., 'hs' for human, 'mm' for mouse)
    # -o: Output prefix
    # --fdr: False Discovery Rate threshold (default 0.05)
    # --window_size: Window size for peak calling (default 100)
    # --min_peak_reads: Minimum number of reads in a peak (default 5)
    # --min_peak_length: Minimum peak length (default 10)
    # --max_peak_length: Maximum peak length (default 1000)
    # -u: Use unique reads only (recommended for eCLIP)
    # -d: Discard duplicate reads (recommended for eCLIP)
    
    clipper -b "${IP_BAM}" \
            -c "${CONTROL_BAM}" \
            -s "${GENOME_SIZE}" \
            -o "${OUTPUT_PREFIX}" \
            --fdr 0.05 \
            --window_size 100 \
            --min_peak_reads 5 \
            --min_peak_length 10 \
            --max_peak_length 1000 \
            -u -d
    
    echo "Peak calling complete. Output files are prefixed with ${OUTPUT_PREFIX}"
    
  33. 33

    Command: clipper -b /full/path/to/files/CombinedID.merged.r2.bam -s hg19 -o /full/path/to/files/CombinedID.merged.r2.peaks.bed --bonferroni --superlocal --threshold-method binomial --save-pickle

    CLIPper vLatest (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install CLIPper (example using conda and pip)
    # conda create -n clipper_env python=3.8
    # conda activate clipper_env
    # pip install clipper-tools
    
    # Define input and output files
    INPUT_BAM="/full/path/to/files/CombinedID.merged.r2.bam"
    OUTPUT_BED="/full/path/to/files/CombinedID.merged.r2.peaks.bed"
    GENOME_ASSEMBLY="hg19" # Reference genome assembly
    
    # Execute CLIPper command
    clipper -b "${INPUT_BAM}" -s "${GENOME_ASSEMBLY}" -o "${OUTPUT_BED}" --bonferroni --superlocal --threshold-method binomial --save-pickle

Tools Used

Raw Source Text
Library strategy: eCLIP-seq
Takes output from raw files.  Run to trim off both 5’ and 3’ adapters on both reads. Command: quality-cutoff 6  -m 18  -a NNNNNAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC  -g CTTCCGATCTACAAGTT -g CTTCCGATCTTGGTCCT  -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGT  AGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT  -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz  -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz  /full/path/to/files/file_R1.C01.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.metrics
Takes output from cutadapt round 1. Run to trim off the 3’ adapters on read 2, to control for double ligation events. Command: cutadapt -f fastq --match-read-wildcards  --times 1  -e 0.1  -O 5  --quality-cutoff 6  -m 18  -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGTAGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT  -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz  -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz  /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz  /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.metrics
Takes output from cutadapt round 2.  Maps to human specific version of RepBase used to remove repetitive elements, helps control for spurious artifacts from rRNA (& other) repetitive reads.  Command: STAR  --runMode alignReads  --runThreadN 16  --genomeDir /path/to/RepBase_human_database_file --genomeLoad LoadAndRemove  --readFilesIn /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz --outSAMunmapped Within  --outFilterMultimapNmax 30  --outFilterMultimapScoreRange 1  --outFileNamePrefix /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam --outSAMattributes All  --readFilesCommand zcat  --outStd BAM_Unsorted  --outSAMtype BAM Unsorted  --outFilterType BySJout  --outReadsUnmapped Fastx  --outFilterScoreMin 10  --outSAMattrRGline ID:foo  --alignEndsType EndToEnd > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam
Takes output from STAR rmRep.  Maps unique reads to the human genome.  Command: STAR  --runMode alignReads  --runThreadN 16  --genomeDir  /path/to/STAR_database_file --genomeLoad LoadAndRemove  --readFilesIn  /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate1  /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate2  --outSAMunmapped Within  --outFilterMultimapNmax 1  --outFilterMultimapScoreRange 1  --outFileNamePrefix /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam  --outSAMattributes All  --outStd BAM_Unsorted  --outSAMtype BAM Unsorted  --outFilterType BySJout  --outReadsUnmapped Fastx  --outFilterScoreMin 10  --outSAMattrRGline ID:foo  --alignEndsType EndToEnd > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam
takes output from STAR genome mapping.  Custom random-mer-aware script for PCR duplicate removal. Command: barcode_collapse_pe.py  --bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam  --out_file /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam  --metrics_file /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.metrics
Takes output from barcode collapse PE.  Sorts resulting bam file for use downstream.  Command: java  -Xmx2048m  -XX:+UseParallelOldGC  -XX:ParallelGCThreads=4  -XX:GCTimeLimit=50  -XX:GCHeapFreeLimit=10  -Djava.io.tmpdir=/full/path/to/files/.queue/tmp  -cp /path/to/gatk/dist/Queue.jar  net.sf.picard.sam.SortSam  INPUT=/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam  TMP_DIR=/full/path/to/files/.queue/tmp  OUTPUT=/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam  VALIDATION_STRINGENCY=SILENT  SO=coordinate  CREATE_INDEX=true
Takes output from sortSam, makes bam index for use downstream.  Command: samtools index  /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam  /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam.bai
Takes inputs from multiple final bam files.  Merges the two technical replicates for further downstream analysis.  Command: samtools  merge  /full/path/to/files/CombinedID.merged.bam  /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam /full/path/to/files/file_R1.D08.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam
Takes output from sortSam, makes bam index for use downstream.  Command: samtools  index  /full/path/to/files/CombinedID.merged.bam  /full/path/to/files/CombinedID.merged.bam.bai
Takes output from sortSam.  Only outputs the second read in each pair for use with single stranded peak caller.  This is the final bam file to perform analysis on.  Command: samtools view -hb -f 128  /full/path/to/files/CombinedID.merged.bam  >  /full/path/to/files/CombinedID.merged.r2.bam
Takes results from samtools view.  Calls peaks on those files.  Command: clipper  -b /full/path/to/files/CombinedID.merged.r2.bam  -s hg19  -o /full/path/to/files/CombinedID.merged.r2.peaks.bed  --bonferroni  --superlocal  --threshold-method binomial  --save-pickle
Genome_build: hg19
Supplementary_files_format_and_content: bed format, contains clusters of predicted RBP binding
← Back to Analysis