GSE86035 Processing Pipeline

RIP-Seq code_examples 32 steps

Publication

SONAR Discovers RNA-Binding Proteins from Analysis of Large-Scale Protein-Protein Interactomes.

Molecular cell (2016) — PMID 27720645

Dataset

SONAR discovers RNA binding proteins from analysis of large-scale protein-protein interactomes.

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Takes output from raw files.

(Inferred with models/gemini-2.5-flash) vN/A

$ Bash example

# The description "Takes output from raw files." is too generic to infer a specific bioinformatics command, tool, or assay type.
# Please provide more context about the type of raw files (e.g., FASTQ, BAM) and the intended processing step (e.g., quality control, alignment, quantification, peak calling) to generate a relevant bash command.

Run to trim off both 5â and 3â adapters on both reads.

cutadapt (Inferred with models/gemini-2.5-flash) v2.10 GitHub

$ Bash example

# Install cutadapt (e.g., using conda)
# conda install -c bioconda cutadapt=2.10

# Define input and output files (placeholders)
INPUT_R1="input_read1.fastq.gz"
INPUT_R2="input_read2.fastq.gz"
OUTPUT_R1="trimmed_read1.fastq.gz"
OUTPUT_R2="trimmed_read2.fastq.gz"

# Define adapter sequences from Yeo lab eCLIP workflow (https://github.com/yeolab/eclip/blob/master/eclip.cwl)
ADAPTER_3PRIME="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA"
ADAPTER_5PRIME="AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT"

# Define trimming parameters (from Yeo lab eCLIP workflow defaults)
MIN_LENGTH=18 # Minimum length of reads to keep after trimming
QUALITY_CUTOFF=20 # Quality cutoff for base trimming

# Run cutadapt to trim 5' and 3' adapters from both reads
cutadapt \
  -a "${ADAPTER_3PRIME}" \
  -A "${ADAPTER_3PRIME}" \
  -g "${ADAPTER_5PRIME}" \
  -G "${ADAPTER_5PRIME}" \
  --minimum-length "${MIN_LENGTH}" \
  --quality-cutoff "${QUALITY_CUTOFF}" \
  -o "${OUTPUT_R1}" \
  -p "${OUTPUT_R2}" \
  "${INPUT_R1}" \
  "${INPUT_R2}"

View on GitHub

Command: quality-cutoff 6 -m 18 -a NNNNNAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC -g CTTCCGATCTACAAGTT -g CTTCCGATCTTGGTCCT -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGT AGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz /full/path/to/files/file_R1.C01.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.metrics

quality-cutoff.py (from Yeo Lab eCLIP pipeline) (Inferred with models/gemini-2.5-flash) veCLIP pipeline (pre-2021) (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# The 'quality-cutoff' command is a Python script (quality-cutoff.py)
# found within the Yeo Lab eCLIP pipeline repository.
# To make it directly executable as 'quality-cutoff', you might need to:
# 1. Clone the repository:
# git clone https://github.com/yeolab/eclip.git
# 2. Navigate to the src directory:
# cd eclip/src
# 3. Ensure Python dependencies (e.g., numpy, pysam) are installed:
# conda create -n eclip_env python=3.8
# conda activate eclip_env
# pip install numpy pysam
# 4. Make the script executable and add it to your PATH, or create a symlink/alias.
# For example, you could run it as 'python /path/to/eclip/src/quality-cutoff.py ...'
# or if it's in your PATH and executable, directly as shown below.

quality-cutoff 6 \
    -m 18 \
    -a NNNNNAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC \
    -g CTTCCGATCTACAAGTT \
    -g CTTCCGATCTTGGTCCT \
    -A AACTTGTAGATCGGA \
    -A AGGACCAAGATCGGA \
    -A ACTTGTAGATCGGAA \
    -A AGGACCAAGATCGGAA \
    -A CTTGT AGATCGGAAG \
    -A GACCAAGATCGGAAG \
    -A TTGTAGATCGGAAGA \
    -A ACCAAGATCGGAAGA \
    -A TGTAGATCGGAAGAG \
    -A CCAAGATCGGAAGAG \
    -A GTAGATCGGAAGAGC \
    -A CAAGATCGGAAGAGC \
    -A TAGATCGGAAGAGCG \
    -A AAGATCGGAAGAGCG \
    -A AGATCGGAAGAGCGT \
    -A GATCGGAAGAGCGTC \
    -A ATCGGAAGAGCGTCG \
    -A TCGGAAGAGCGTCGT \
    -A CGGAAGAGCGTCGTG \
    -A GGAAGAGCGTCGTGT \
    -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz \
    -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz \
    /full/path/to/files/file_R1.C01.fastq.gz \
    /full/path/to/files/file_R2.C01.fastq.gz \
    > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.metrics

View on GitHub

Takes output from cutadapt round 1.

cutadapt v2.10 GitHub

$ Bash example

# Install cutadapt (example using conda)
# conda install -c bioconda cutadapt=2.10

# Define input and output files
# INPUT_FASTQ is the output from cutadapt round 1 (e.g., adapter-trimmed reads)
INPUT_FASTQ="round1_trimmed.fastq.gz"
OUTPUT_FASTQ="round2_trimmed.fastq.gz"

# Define parameters for poly-A trimming (a common second round in eCLIP workflows)
# This command is inferred from the yeolab/eclip CWL workflow's poly-A trimming step.
POLY_A_ADAPTER="A{100}" # Trims up to 100 'A's from the 3' end
MIN_LENGTH=18           # Minimum read length after trimming, common for eCLIP

# Execute cutadapt for poly-A trimming
cutadapt -a "${POLY_A_ADAPTER}" \
         -o "${OUTPUT_FASTQ}" \
         --minimum-length "${MIN_LENGTH}" \
         "${INPUT_FASTQ}"

View on GitHub

Run to trim off the 3â adapters on read 2, to control for double ligation events.

cutadapt (Inferred with models/gemini-2.5-flash) v4.0 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install cutadapt if not already installed
# conda install -c bioconda cutadapt=4.0

# Define input and output files
INPUT_READ2="read2.fastq.gz"
OUTPUT_TRIMMED_READ2="read2_trimmed.fastq.gz"

# Define the 3' adapter sequence for double ligation events (eCLIP specific)
# This adapter sequence is derived from the yeolab/skipper workflow config.yaml
# (AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT)
ADAPTER_R2_DOUBLE_LIGATION="AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT"

# Define minimum length for reads after trimming (from yeolab/skipper config)
MIN_LENGTH=18

# Define number of threads
THREADS=4

# Run cutadapt to trim the 3' adapter from read 2
# This step controls for double ligation events by removing the 3' adapter sequence
# that might have ligated to itself or another adapter on read 2.
cutadapt \
    -a "${ADAPTER_R2_DOUBLE_LIGATION}" \
    -o "${OUTPUT_TRIMMED_READ2}" \
    --minimum-length "${MIN_LENGTH}" \
    --cores "${THREADS}" \
    "${INPUT_READ2}"

View on GitHub

6
Command: cutadapt -f fastq --match-read-wildcards --times 1 -e 0.1 -O 5 --quality-cutoff 6 -m 18 -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGTAGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.metrics

cutadapt v4.0 GitHub
$ Bash example
```
cutadapt -f fastq --match-read-wildcards --times 1 -e 0.1 -O 5 --quality-cutoff 6 -m 18 -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGTAGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.metrics
```
View on GitHub

Takes output from cutadapt round 2.

cutadapt v3.4 GitHub

$ Bash example

# Installation (example using conda):
# conda install -c bioconda cutadapt=3.4

# Define input and output files
# INPUT_FASTQ is the output from the first round of cutadapt trimming.
INPUT_FASTQ="input_from_cutadapt_round1.fastq.gz"
OUTPUT_FASTQ="output_cutadapt_round2_trimmed.fastq.gz"

# Define parameters for cutadapt round 2 (e.g., poly-A trimming and quality filtering)
# ADAPTER_ROUND2: Common for poly-A trimming (e.g., A{100} for 100 A's)
ADAPTER_ROUND2="A{100}"
MIN_LENGTH=18 # Minimum read length after trimming
QUALITY_CUTOFF=6 # Quality cutoff for 3' end trimming (Phred score)
NEXTSEQ_TRIM=20 # Trim low-quality bases from 3' end of NextSeq reads (Phred score)
NUM_CORES=8 # Number of CPU cores to use

cutadapt \
  -a "${ADAPTER_ROUND2}" \
  --minimum-length "${MIN_LENGTH}" \
  --quality-cutoff "${QUALITY_CUTOFF}" \
  --nextseq-trim "${NEXTSEQ_TRIM}" \
  --cores "${NUM_CORES}" \
  -o "${OUTPUT_FASTQ}" \
  "${INPUT_FASTQ}"

View on GitHub

Maps to human specific version of RepBase used to remove repetitive elements, helps control for spurious artifacts from rRNA (& other) repetitive reads.

bowtie (Inferred with models/gemini-2.5-flash) v1.x (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Define variables
# Path to the Bowtie index for human rRNA and repetitive elements (e.g., built from hg38 rRNA and RepBase sequences).
# This index should be pre-built using `bowtie-build` from a FASTA file containing all relevant repetitive sequences.
# Example FASTA sources: RepBase (https://www.girinst.org/repbase/) and NCBI RefSeq for rRNA.
BOWTIE_INDEX_PREFIX="/path/to/human_rRNA_RepBase_hg38_index"

# Input FASTQ file (gzipped)
INPUT_FASTQ_GZ="input_reads.fastq.gz"

# Output FASTQ file for reads that did NOT align to repetitive elements (non-repetitive reads), initially uncompressed
OUTPUT_NON_REPETITIVE_FASTQ="output_non_repetitive_reads.fastq"

# Output SAM file containing reads that DID align to repetitive elements (can be discarded or used for QC)
OUTPUT_REPETITIVE_SAM="output_repetitive_reads.sam"

# Number of threads to use
THREADS=8

# Filter repetitive reads using Bowtie
# -q: input reads are FASTQ format
# -S: output alignments in SAM format
# --un: write reads that do not align to the specified file
# -p: number of threads
# `zcat` is used to decompress the gzipped input FASTQ and pipe it to bowtie.
# The `-` after `BOWTIE_INDEX_PREFIX` tells bowtie to read input from stdin.
zcat "${INPUT_FASTQ_GZ}" | bowtie -q -S --un "${OUTPUT_NON_REPETITIVE_FASTQ}" -p "${THREADS}" "${BOWTIE_INDEX_PREFIX}" - > "${OUTPUT_REPETITIVE_SAM}"

# Gzip the non-repetitive reads file for storage
gzip "${OUTPUT_NON_REPETITIVE_FASTQ}"

View on GitHub

9
Command: STAR --runMode alignReads --runThreadN 16 --genomeDir /path/to/RepBase_human_database_file --genomeLoad LoadAndRemove --readFilesIn /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz --outSAMunmapped Within --outFilterMultimapNmax 30 --outFilterMultimapScoreRange 1 --outFileNamePrefix /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam --outSAMattributes All --readFilesCommand zcat --outStd BAM_Unsorted --outSAMtype BAM Unsorted --outFilterType BySJout --outReadsUnmapped Fastx --outFilterScoreMin 10 --outSAMattrRGline ID:foo --alignEndsType EndToEnd > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam

STAR vUnknown (Inferred with models/gemini-2.5-flash) GitHub
$ Bash example
```
STAR --runMode alignReads --runThreadN 16 --genomeDir /path/to/RepBase_human_database_file --genomeLoad LoadAndRemove --readFilesIn /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz --outSAMunmapped Within --outFilterMultimapNmax 30 --outFilterMultimapScoreRange 1 --outFileNamePrefix /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam --outSAMattributes All --readFilesCommand zcat --outStd BAM_Unsorted --outSAMtype BAM Unsorted --outFilterType BySJout --outReadsUnmapped Fastx --outFilterScoreMin 10 --outSAMattrRGline ID:foo --alignEndsType EndToEnd > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam
```
View on GitHub

Takes output from STAR rmRep.

STAR vNot explicitly versioned (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install CLIPper (if not already installed)
# git clone https://github.com/yeolab/clipper.git
# cd clipper
# python setup.py install # or ensure clipper.py is in PATH or accessible

# Placeholder for input BAM file (output from STAR alignment and deduplication, e.g., from a 'STAR rmRep' stage)
INPUT_BAM="input.dedup.bam"

# Placeholder for genome size file (e.g., for human hg38)
# This file should contain chromosome names and their sizes, one per line.
# Example: chr1\t248956422
# You can generate this from a reference genome fasta index:
# samtools faidx hg38.fa
# cut -f1,2 hg38.fa.fai > hg38.chrom.sizes
GENOME_SIZE_FILE="hg38.chrom.sizes"

# Output file for CLIPper peaks
OUTPUT_PEAKS="output_peaks.bed"

# Run CLIPper
# Assuming clipper.py is in the PATH or current directory (e.g., if installed via setup.py)
python clipper.py -b "${INPUT_BAM}" -s "${GENOME_SIZE_FILE}" -o "${OUTPUT_PEAKS}"

View on GitHub

Maps unique reads to the human genome.

BWA (Inferred with models/gemini-2.5-flash) v0.7.17 GitHub

$ Bash example

# Install BWA (if not already installed)
# conda install -c bioconda bwa samtools

# Define variables
REF_GENOME="/path/to/human_genome_hg38.fa" # Placeholder for human genome GRCh38 (e.g., from GATK resource bundle or UCSC)
READ1="/path/to/sample_R1.fastq.gz"
READ2="/path/to/sample_R2.fastq.gz"
OUTPUT_BAM="aligned_reads.bam"
THREADS=8 # Number of CPU threads to use

# Index the reference genome (run once per reference)
# bwa index "${REF_GENOME}"

# Align paired-end reads to the human genome using BWA-MEM
# -M: Mark shorter split hits as secondary (for Picard compatibility)
# -t: Number of threads
# -R: Read group header. Replace with actual sample ID, library, platform, etc.
bwa mem -M -t "${THREADS}" \
    -R "@RG\tID:sample_id\tSM:sample_name\tPL:ILLUMINA\tLB:library_name" \
    "${REF_GENOME}" "${READ1}" "${READ2}" | \
    samtools view -bS - | \
    samtools sort -o "${OUTPUT_BAM}" -

# Index the sorted BAM file
samtools index "${OUTPUT_BAM}"

View on GitHub

Command: STAR --runMode alignReads --runThreadN 16 --genomeDir /path/to/STAR_database_file --genomeLoad LoadAndRemove --readFilesIn /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate1 /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate2 --outSAMunmapped Within --outFilterMultimapNmax 1 --outFilterMultimapScoreRange 1 --outFileNamePrefix /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam --outSAMattributes All --outStd BAM_Unsorted --outSAMtype BAM Unsorted --outFilterType BySJout --outReadsUnmapped Fastx --outFilterScoreMin 10 --outSAMattrRGline ID:foo --alignEndsType EndToEnd > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam

STAR vInferred with models/gemini-2.5-flash GitHub

$ Bash example

# STAR is a splice-aware aligner for RNA-seq reads.
# It requires a pre-built genome index. Replace `/path/to/STAR_database_file` with the actual path to your STAR genome index.
# Replace `/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate1` and `...mate2` with your actual input FASTQ files.
# Replace `/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam` with your desired output BAM file path.

# Example installation using Conda (uncomment to use):
# conda create -n star_env star -y
# conda activate star_env

STAR --runMode alignReads \
  --runThreadN 16 \
  --genomeDir /path/to/STAR_database_file \
  --genomeLoad LoadAndRemove \
  --readFilesIn /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate1 /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate2 \
  --outSAMunmapped Within \
  --outFilterMultimapNmax 1 \
  --outFilterMultimapScoreRange 1 \
  --outFileNamePrefix /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep. \
  --outSAMattributes All \
  --outStd BAM_Unsorted \
  --outSAMtype BAM Unsorted \
  --outFilterType BySJout \
  --outReadsUnmapped Fastx \
  --outFilterScoreMin 10 \
  --outSAMattrRGline ID:foo \
  --alignEndsType EndToEnd > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam

View on GitHub

takes output from STAR genome mapping.

STAR v2.7.10a GitHub

$ Bash example

# Install STAR (if not already installed)
# conda install -c bioconda star

# Define variables
# GENOME_DIR should point to the directory containing STAR genome index files (e.g., from hg38/GRCh38)
GENOME_DIR="/path/to/STAR_genome_index/GRCh38" 
READ1="sample_R1.fastq.gz" # Placeholder for input read 1 (gzipped FASTQ)
READ2="sample_R2.fastq.gz" # Placeholder for input read 2 (gzipped FASTQ, remove if single-end)
OUTPUT_PREFIX="sample_STAR_aligned_" # Prefix for output files
THREADS=8 # Number of threads to use

# Run STAR genome mapping
# Parameters are chosen to be suitable for RNA-based assays like eCLIP, 
# including splice-aware alignment and filtering for uniquely mapping reads.
STAR --runThreadN ${THREADS} \
     --genomeDir ${GENOME_DIR} \
     --readFilesIn ${READ1} ${READ2} \
     --readFilesCommand zcat \
     --outFileNamePrefix ${OUTPUT_PREFIX} \
     --outSAMtype BAM SortedByCoordinate \
     --outSAMattributes All \
     --outFilterMultimapNmax 1 \
     --outFilterMismatchNmax 3 \
     --alignIntronMax 1000000 \
     --alignMatesGapMax 1000000 \
     --alignSJDBoverhangMin 1 \
     --alignSJoverhangMin 8 \
     --sjdbScore 1 \
     --outFilterType BySJout \
     --outFilterScoreMinOverLread 0.3 \
     --outFilterMatchNminOverLread 0.3 \
     --limitBAMsortRAM 30000000000 # Adjust based on available RAM (e.g., 30GB)

View on GitHub

Custom random-mer-aware script for PCR duplicate removal.

umi_tools (Inferred with models/gemini-2.5-flash) v1.1.2 GitHub

$ Bash example

# Install umi_tools and samtools
# conda install -c bioconda umi_tools samtools

# Define input and output files
INPUT_BAM="input.bam"
OUTPUT_DEDUP_BAM="output.dedup.bam"
UMI_LENGTH=6 # IMPORTANT: Adjust UMI_LENGTH based on your assay's random-mer design (e.g., 6 for 6N random-mer)

# Step 1: Extract UMI from read sequence and append to read name
# This step is crucial for 'random-mer-aware' duplicate removal.
# Assuming UMI is at the 5' end of Read 1 (N{UMI_LENGTH}X* pattern).
# Adjust --bc-pattern if UMI is in a different location or read.
umi_tools extract \
    --input "${INPUT_BAM}" \
    --output "extracted_umi.bam" \
    --bc-pattern="N${UMI_LENGTH}X*" \
    --log "umi_extract.log"

# Step 2: Sort the BAM file by queryname, fixmate information, then by coordinate
# This sorting order is recommended for umi_tools dedup with paired-end reads.
samtools sort -n -o "extracted_umi.name_sorted.bam" "extracted_umi.bam"
samtools fixmate -m "extracted_umi.name_sorted.bam" "extracted_umi.fixmate.bam"
samtools sort -o "extracted_umi.position_sorted.bam" "extracted_umi.fixmate.bam"
samtools index "extracted_umi.position_sorted.bam"

# Step 3: Run umi_tools dedup to remove PCR duplicates
# The 'directional' method is commonly used for eCLIP and other UMI-based assays.
umi_tools dedup \
    --input "extracted_umi.position_sorted.bam" \
    --output "${OUTPUT_DEDUP_BAM}" \
    --method directional \
    --umi-separator ":" \
    --log "dedup.log" \
    --output-stats "dedup_stats.tsv"

# Clean up intermediate files (optional)
rm extracted_umi.bam extracted_umi.name_sorted.bam extracted_umi.fixmate.bam extracted_umi.position_sorted.bam extracted_umi.position_sorted.bam.bai

View on GitHub

Command: barcode_collapse_pe.py --bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam --out_file /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam --metrics_file /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.metrics

barcode_collapse_pe.py (part of Yeo Lab eCLIP pipeline) (Inferred with models/gemini-2.5-flash) vNot specified GitHub

$ Bash example

# The barcode_collapse_pe.py script is part of the Yeo Lab eCLIP pipeline.
# It is recommended to clone the repository and set up a conda environment.
# git clone https://github.com/yeolab/eclip.git
# cd eclip

# Create and activate a conda environment (example, adjust Python version as needed):
# conda create -n eclip_env python=3.8
# conda activate eclip_env

# Install necessary dependencies. The full eCLIP pipeline has many dependencies,
# but for barcode_collapse_pe.py, pysam is a key one.
# conda install -c bioconda pysam

# Ensure the script is executable and in your PATH, or provide its full path.
# For example, if cloned to /path/to/eclip, you might call it as:
# python /path/to/eclip/scripts/barcode_collapse_pe.py \
#   --bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam \
#   --out_file /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam \
#   --metrics_file /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.metrics

# Execute the command as provided:
barcode_collapse_pe.py \
  --bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam \
  --out_file /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam \
  --metrics_file /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.metrics

View on GitHub

Takes output from barcode collapse PE.

STAR (Inferred with models/gemini-2.5-flash) v2.7.9a GitHub

$ Bash example

# Install STAR if not already available
# conda install -c bioconda star

# Define input and output files
INPUT_FASTQ="collapsed_reads.fastq.gz" # Placeholder for output from barcode collapse PE
OUTPUT_PREFIX="aligned_reads"
GENOME_DIR="/path/to/STAR_index/hg38" # Placeholder for STAR genome index (e.g., hg38)

# Run STAR alignment
STAR \
  --genomeDir "${GENOME_DIR}" \
  --readFilesIn "${INPUT_FASTQ}" \
  --runThreadN 8 \
  --outFileNamePrefix "${OUTPUT_PREFIX}_" \
  --outSAMtype BAM SortedByCoordinate \
  --outSAMattributes Standard

View on GitHub

Sorts resulting bam file for use downstream.

samtools (Inferred with models/gemini-2.5-flash) v1.15.1 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install samtools if not already installed
# conda install -c bioconda samtools=1.15.1

# Sort the BAM file by coordinate
# Replace input.bam with your actual input BAM file
# Replace output.sorted.bam with your desired output sorted BAM file name
# Adjust -@ parameter based on available CPU cores
samtools sort -@ 4 -o output.sorted.bam input.bam

View on GitHub

Command: java -Xmx2048m -XX:+UseParallelOldGC -XX:ParallelGCThreads=4 -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Djava.io.tmpdir=/full/path/to/files/.queue/tmp -cp /path/to/gatk/dist/Queue.jar net.sf.picard.sam.SortSam INPUT=/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam TMP_DIR=/full/path/to/files/.queue/tmp OUTPUT=/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam VALIDATION_STRINGENCY=SILENT SO=coordinate CREATE_INDEX=true

Picard vPre-2.0 GitHub

$ Bash example

# Install Picard (example using conda)
# conda install -c bioconda picard

# Or download the jar directly from Broad Institute's GitHub releases
# wget https://github.com/broadinstitute/picard/releases/download/<version>/picard.jar
# export PICARD_JAR="/path/to/picard.jar"

# The command provided uses a specific path to Queue.jar which likely bundles or provides access to Picard tools.
# This setup is characteristic of older GATK/Picard integrations (e.g., GATK 3.x).
java -Xmx2048m -XX:+UseParallelOldGC -XX:ParallelGCThreads=4 -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Djava.io.tmpdir=/full/path/to/files/.queue/tmp -cp /path/to/gatk/dist/Queue.jar net.sf.picard.sam.SortSam INPUT=/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam TMP_DIR=/full/path/to/files/.queue/tmp OUTPUT=/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam VALIDATION_STRINGENCY=SILENT SO=coordinate CREATE_INDEX=true

View on GitHub

19
Takes output from sortSam, makes bam index for use downstream.

samtools index (Inferred with models/gemini-2.5-flash) v1.19 GitHub
$ Bash example
```
# Install samtools if not already installed
# conda install -c bioconda samtools

# Assuming 'sorted.bam' is the output from sortSam
samtools index sorted.bam
```
View on GitHub

Command: samtools index /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam.bai

samtools v1.10 GitHub

$ Bash example

# Install samtools (if not already installed)
# conda install -c bioconda samtools=1.10

# Define input and output paths
INPUT_BAM="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam"
OUTPUT_BAI="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam.bai"

# Execute samtools index
samtools index "${INPUT_BAM}" "${OUTPUT_BAI}"

View on GitHub

Takes inputs from multiple final bam files.

samtools v1.19 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install samtools if not already available
# conda install -c bioconda samtools

# Example: Merge multiple final BAM files into a single BAM file.
# This step is often used to combine technical replicates or prepare files for downstream analysis.
# Parameters:
#   -o <output.bam>: Specify the output merged BAM file.
#   <input1.bam> <input2.bam> ...: List of input BAM files to be merged.

# Placeholder for input BAM files
# Replace with actual file paths for your multiple final BAM files
INPUT_BAM_FILES="input_replicate_1.bam input_replicate_2.bam input_replicate_3.bam"
OUTPUT_MERGED_BAM="merged_sample.bam"

samtools merge -o "${OUTPUT_MERGED_BAM}" ${INPUT_BAM_FILES}

# Index the merged BAM file (optional, but good practice for downstream tools)
samtools index "${OUTPUT_MERGED_BAM}"

View on GitHub

Merges the two technical replicates for further downstream analysis.

samtools merge (Inferred with models/gemini-2.5-flash) v1.15.1 GitHub

$ Bash example

# Install samtools if not already installed
# conda install -c bioconda samtools

# Merge technical replicates (e.g., BAM files)
# Assuming input BAM files are replicate1.bam and replicate2.bam
# And output will be merged_replicates.bam
samtools merge merged_replicates.bam replicate1.bam replicate2.bam

View on GitHub

Command: samtools merge /full/path/to/files/CombinedID.merged.bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam /full/path/to/files/file_R1.D08.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam

samtools v1.19 GitHub

$ Bash example

# Install samtools if not already installed
# conda install -c bioconda samtools
# or
# sudo apt-get update && sudo apt-get install samtools

# Execute the samtools merge command
samtools merge /full/path/to/files/CombinedID.merged.bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam /full/path/to/files/file_R1.D08.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam

View on GitHub

Takes output from sortSam, makes bam index for use downstream.

samtools (Inferred with models/gemini-2.5-flash) v1.19 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install samtools if not already installed
# conda install -c bioconda samtools

# Assume 'sorted.bam' is the output from sortSam
# The 'samtools index' command creates an index file (.bai) for the input BAM file.
samtools index sorted.bam

View on GitHub

Command: samtools index /full/path/to/files/CombinedID.merged.bam /full/path/to/files/CombinedID.merged.bam.bai

samtools v1.19 GitHub

$ Bash example

# Install samtools if not already installed
# conda install -c bioconda samtools

# Execute samtools index command
samtools index /full/path/to/files/CombinedID.merged.bam /full/path/to/files/CombinedID.merged.bam.bai

View on GitHub

Takes output from sortSam.

samtools (Inferred with models/gemini-2.5-flash) v1.19.1 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install samtools if not already installed
# conda install -c bioconda samtools=1.19.1

# Define input and output file names
# INPUT_BAM is the sorted BAM file, which is the output from sortSam
INPUT_BAM="input_sorted.bam"

# Index the sorted BAM file for efficient access by downstream tools
samtools index "${INPUT_BAM}"

View on GitHub

Only outputs the second read in each pair for use with single stranded peak caller.

samtools (Inferred with models/gemini-2.5-flash) v1.15.1 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install samtools if not already available
# conda install -c bioconda samtools

# This command extracts only the second read in each pair from a paired-end BAM file
# and outputs them to a FASTQ file. This is useful for single-stranded peak callers
# that only require reads from one strand (e.g., the second read in a pair).
#
# -f 0x80: Selects reads that are the second in a pair (SAM flag 0x80).
# -N: Use original base quality scores (do not convert to Sanger format).
# -s /dev/null: Discard singleton reads (reads whose mate was not found).
# -1 /dev/null: Discard the first read in a pair (output to /dev/null).
# -2 output_R2.fastq: Output the second read in a pair to 'output_R2.fastq'.
# input.bam: The input paired-end BAM file.
samtools fastq -f 0x80 -N -s /dev/null -1 /dev/null -2 output_R2.fastq input.bam

View on GitHub

This is the final bam file to perform analysis on.

samtools (Inferred with models/gemini-2.5-flash) v1.9 GitHub

$ Bash example

# Install samtools (if not already installed)
# conda install -c bioconda samtools=1.9

# This command ensures the final BAM file is indexed, which is crucial for many downstream analysis tools.
# Replace 'sample.final.bam' with the actual path to your final BAM file.
samtools index sample.final.bam

View on GitHub

Command: samtools view -hb -f 128 /full/path/to/files/CombinedID.merged.bam > /full/path/to/files/CombinedID.merged.r2.bam

samtools v1.19 GitHub

$ Bash example

# Install samtools (e.g., via conda)
# conda install -c bioconda samtools=1.19

# Define input and output paths
INPUT_BAM="/full/path/to/files/CombinedID.merged.bam"
OUTPUT_BAM="/full/path/to/files/CombinedID.merged.r2.bam"

# Filter BAM file to extract reads that are the second in a pair
samtools view -hb -f 128 "${INPUT_BAM}" > "${OUTPUT_BAM}"

View on GitHub

Takes results from samtools view.

samtools v1.19 GitHub

$ Bash example

# Install samtools if not already available
# conda install -c bioconda samtools

# Example: Filter mapped reads from a BAM file and output to a new BAM file
# This command takes an input BAM file (input.bam)
# and outputs a new BAM file (output_mapped.bam) containing only mapped reads.
# -b: Output in BAM format
# -F 4: Exclude reads where the FLAG indicates the read is unmapped (0x4)
samtools view -b -F 4 input.bam > output_mapped.bam

View on GitHub

Calls peaks on those files.

clipper (Inferred with models/gemini-2.5-flash) vlatest GitHub

$ Bash example

# Clone the clipper repository if not already available
# git clone https://github.com/yeolab/clipper.git
# cd clipper

# Ensure Python and necessary libraries (e.g., pysam, numpy) are installed
# conda install -c bioconda pysam numpy

# Define input and output files (placeholders)
IP_BAM="ip_sample.bam" # Placeholder: Path to the IP sample's aligned BAM file
CONTROL_BAM="control_sample.bam" # Placeholder: Path to the control sample's aligned BAM file
OUTPUT_BED="peaks.bed" # Placeholder: Desired output BED file name for peaks
GENOME_SIZE="hg38" # Placeholder: Replace with the actual genome assembly (e.g., hg19, mm10, dm6) used for alignment

# Execute clipper to call peaks
# Assuming clipper.py is in the current directory or in your system's PATH
python clipper.py -b "${IP_BAM}" -c "${CONTROL_BAM}" -s "${GENOME_SIZE}" -o "${OUTPUT_BED}"

View on GitHub

Command: clipper -b /full/path/to/files/CombinedID.merged.r2.bam -s hg19 -o /full/path/to/files/CombinedID.merged.r2.peaks.bed --bonferroni --superlocal --threshold-method binomial --save-pickle

CLIPper vNot specified GitHub

$ Bash example

# Install CLIPper (example, adjust as needed)
# conda install -c bioconda clipper
# # or
# # pip install clipper

# Run CLIPper for peak calling
clipper -b /full/path/to/files/CombinedID.merged.r2.bam -s hg19 -o /full/path/to/files/CombinedID.merged.r2.peaks.bed --bonferroni --superlocal --threshold-method binomial --save-pickle

View on GitHub

Tools Used

STAR

Raw Source Text

Takes output from raw files.  Run to trim off both 5â and 3â adapters on both reads. Command: quality-cutoff 6  -m 18  -a NNNNNAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC  -g CTTCCGATCTACAAGTT -g CTTCCGATCTTGGTCCT  -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGT  AGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT  -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz  -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz  /full/path/to/files/file_R1.C01.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.metrics
Takes output from cutadapt round 1. Run to trim off the 3â adapters on read 2, to control for double ligation events. Command: cutadapt -f fastq --match-read-wildcards  --times 1  -e 0.1  -O 5  --quality-cutoff 6  -m 18  -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGTAGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT  -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz  -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz  /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz  /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.metrics
Takes output from cutadapt round 2.  Maps to human specific version of RepBase used to remove repetitive elements, helps control for spurious artifacts from rRNA (& other) repetitive reads.  Command: STAR  --runMode alignReads  --runThreadN 16  --genomeDir /path/to/RepBase_human_database_file --genomeLoad LoadAndRemove  --readFilesIn /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz --outSAMunmapped Within  --outFilterMultimapNmax 30  --outFilterMultimapScoreRange 1  --outFileNamePrefix /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam --outSAMattributes All  --readFilesCommand zcat  --outStd BAM_Unsorted  --outSAMtype BAM Unsorted  --outFilterType BySJout  --outReadsUnmapped Fastx  --outFilterScoreMin 10  --outSAMattrRGline ID:foo  --alignEndsType EndToEnd > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam
Takes output from STAR rmRep.  Maps unique reads to the human genome.  Command: STAR  --runMode alignReads  --runThreadN 16  --genomeDir  /path/to/STAR_database_file --genomeLoad LoadAndRemove  --readFilesIn  /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate1  /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate2  --outSAMunmapped Within  --outFilterMultimapNmax 1  --outFilterMultimapScoreRange 1  --outFileNamePrefix /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam  --outSAMattributes All  --outStd BAM_Unsorted  --outSAMtype BAM Unsorted  --outFilterType BySJout  --outReadsUnmapped Fastx  --outFilterScoreMin 10  --outSAMattrRGline ID:foo  --alignEndsType EndToEnd > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam
takes output from STAR genome mapping.  Custom random-mer-aware script for PCR duplicate removal. Command: barcode_collapse_pe.py  --bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam  --out_file /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam  --metrics_file /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.metrics
Takes output from barcode collapse PE.  Sorts resulting bam file for use downstream.  Command: java  -Xmx2048m  -XX:+UseParallelOldGC  -XX:ParallelGCThreads=4  -XX:GCTimeLimit=50  -XX:GCHeapFreeLimit=10  -Djava.io.tmpdir=/full/path/to/files/.queue/tmp  -cp /path/to/gatk/dist/Queue.jar  net.sf.picard.sam.SortSam  INPUT=/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam  TMP_DIR=/full/path/to/files/.queue/tmp  OUTPUT=/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam  VALIDATION_STRINGENCY=SILENT  SO=coordinate  CREATE_INDEX=true
Takes output from sortSam, makes bam index for use downstream.  Command: samtools index  /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam  /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam.bai
Takes inputs from multiple final bam files.  Merges the two technical replicates for further downstream analysis.  Command: samtools  merge  /full/path/to/files/CombinedID.merged.bam  /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam /full/path/to/files/file_R1.D08.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam
Takes output from sortSam, makes bam index for use downstream.  Command: samtools  index  /full/path/to/files/CombinedID.merged.bam  /full/path/to/files/CombinedID.merged.bam.bai
Takes output from sortSam.  Only outputs the second read in each pair for use with single stranded peak caller.  This is the final bam file to perform analysis on.  Command: samtools view -hb -f 128  /full/path/to/files/CombinedID.merged.bam  >  /full/path/to/files/CombinedID.merged.r2.bam
Takes results from samtools view.  Calls peaks on those files.  Command: clipper  -b /full/path/to/files/CombinedID.merged.r2.bam  -s hg19  -o /full/path/to/files/CombinedID.merged.r2.peaks.bed  --bonferroni  --superlocal  --threshold-method binomial  --save-pickle
Genome_build: hg19
Supplementary_files_format_and_content: bed format, contains clusters of predicted RBP binding

← Back to Analysis