GSE77629 Processing Pipeline
Publication
Robust transcriptome-wide discovery of RNA-binding protein binding sites with enhanced CLIP (eCLIP).Nature methods (2016) — PMID 27018577
Dataset
GSE77629Enhanced CLIP (eCLIP) enables robust and scalable transcriptome-wide discovery and characterization of RNA binding protein binding sites [eCLIP - 293…
Processing Steps
Generate Jupyter Notebook-
1
Takes output from raw files.
$ Bash example
# Install cutadapt (if not already installed) # conda install -c bioconda cutadapt # Define input and output file names (placeholders) INPUT_R1="sample_R1.fastq.gz" INPUT_R2="sample_R2.fastq.gz" OUTPUT_R1_TRIMMED="sample_R1_trimmed.fastq.gz" OUTPUT_R2_TRIMMED="sample_R2_trimmed.fastq.gz" # Define common Illumina adapters (adjust if different adapters were used) # Forward adapter sequence ADAPTER_FWD="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" # Reverse adapter sequence ADAPTER_REV="AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT" # Run cutadapt for adapter trimming and quality filtering # -a: 3' adapter for read 1 # -A: 3' adapter for read 2 # -o: Output file for read 1 # -p: Output file for read 2 # -m: Minimum read length after trimming # -q: Trim low-quality ends from reads (Phred score threshold) cutadapt -a "${ADAPTER_FWD}" \ -A "${ADAPTER_REV}" \ -o "${OUTPUT_R1_TRIMMED}" \ -p "${OUTPUT_R2_TRIMMED}" \ -m 20 -q 20 \ "${INPUT_R1}" "${INPUT_R2}" -
2
Run to trim off both 5â and 3â adapters on both reads.
cutadapt (Inferred with models/gemini-2.5-flash) v4.0 (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# Install cutadapt (e.g., using conda) # conda install -c bioconda cutadapt=4.0 # Define input and output files READ1_IN="input_R1.fastq.gz" READ2_IN="input_R2.fastq.gz" READ1_OUT="trimmed_R1.fastq.gz" READ2_OUT="trimmed_R2.fastq.gz" # Define 3' adapter sequences (Illumina universal adapters as placeholders) # These are searched for at the 3' end of the reads. ADAPTER_3PRIME_R1="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" # For Read 1 ADAPTER_3PRIME_R2="AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT" # For Read 2 (reverse complement of Read 1 adapter) # Define 5' adapter sequences (placeholders, replace with actual 5' adapters if known) # These are searched for at the 5' end of the reads. ADAPTER_5PRIME_R1="GCTCTTCCGATCT" # Example 5' adapter sequence for Read 1 ADAPTER_5PRIME_R2="GCTCTTCCGATCT" # Example 5' adapter sequence for Read 2 # Number of CPU threads to use NUM_THREADS=$(nproc) # Run cutadapt to trim 5' and 3' adapters from both paired-end reads cutadapt \ -j "${NUM_THREADS}" \ -a "${ADAPTER_3PRIME_R1}" \ -A "${ADAPTER_3PRIME_R2}" \ -g "${ADAPTER_5PRIME_R1}" \ -G "${ADAPTER_5PRIME_R2}" \ -o "${READ1_OUT}" \ -p "${READ2_OUT}" \ "${READ1_IN}" \ "${READ2_IN}" -
3
Command: quality-cutoff 6 -m 18 -a NNNNNAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC -g CTTCCGATCTACAAGTT -g CTTCCGATCTTGGTCCT -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGT AGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz /full/path/to/files/file_R1.C01.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.metrics
$ Bash example
# Clone the eCLIP pipeline repository # git clone https://github.com/yeolab/eclip.git # cd eclip # Create and activate a conda environment with necessary dependencies # conda create -n eclip_env python=3.8 cutadapt=3.4 -y # conda activate eclip_env # Execute the quality trimming script python scripts/quality_trimming.py \ --quality-cutoff 6 \ --min-length 18 \ --adapter-3prime NNNNNAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC \ --adapter-5prime CTTCCGATCTACAAGTT \ --adapter-5prime CTTCCGATCTTGGTCCT \ --adapter-3prime AACTTGTAGATCGGA \ --adapter-3prime AGGACCAAGATCGGA \ --adapter-3prime ACTTGTAGATCGGAA \ --adapter-3prime GGACCAAGATCGGAA \ --adapter-3prime "CTTGT AGATCGGAAG" \ --adapter-3prime GACCAAGATCGGAAG \ --adapter-3prime TTGTAGATCGGAAGA \ --adapter-3prime ACCAAGATCGGAAGA \ --adapter-3prime TGTAGATCGGAAGAG \ --adapter-3prime CCAAGATCGGAAGAG \ --adapter-3prime GTAGATCGGAAGAGC \ --adapter-3prime CAAGATCGGAAGAGC \ --adapter-3prime TAGATCGGAAGAGCG \ --adapter-3prime AAGATCGGAAGAGCG \ --adapter-3prime AGATCGGAAGAGCGT \ --adapter-3prime GATCGGAAGAGCGTC \ --adapter-3prime ATCGGAAGAGCGTCG \ --adapter-3prime TCGGAAGAGCGTCGT \ --adapter-3prime CGGAAGAGCGTCGTG \ --adapter-3prime GGAAGAGCGTCGTGT \ --output-R1 /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz \ --output-R2 /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz \ --input-R1 /full/path/to/files/file_R1.C01.fastq.gz \ --input-R2 /full/path/to/files/file_R2.C01.fastq.gz \ > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.metrics -
4
Takes output from cutadapt round 1.
$ Bash example
# Install cutadapt if not already installed # conda install -c bioconda cutadapt=1.18 # Define input and output files INPUT_FASTQ="round1_trimmed.fastq.gz" OUTPUT_FASTQ="round2_trimmed.fastq.gz" # Define common eCLIP 3' adapter sequence (Illumina TruSeq or similar) # This adapter is commonly used in eCLIP workflows for 3' end trimming. ADAPTER_3PRIME="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" # Define trimming parameters MIN_LENGTH=18 # Minimum read length after trimming, common for eCLIP QUALITY_THRESHOLD=20 # Quality threshold for 3' end trimming, common for eCLIP # Execute cutadapt for round 2 trimming # This command assumes further trimming of the 3' adapter and quality filtering # after an initial trimming round. cutadapt -a "${ADAPTER_3PRIME}" \ -m "${MIN_LENGTH}" \ -q "${QUALITY_THRESHOLD}" \ -o "${OUTPUT_FASTQ}" \ "${INPUT_FASTQ}" -
5
Run to trim off the 3â adapters on read 2, to control for double ligation events.
cutadapt (Inferred with models/gemini-2.5-flash) v3.4 (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# Install cutadapt (example using conda) # conda install -c bioconda cutadapt=3.4 # Define input and output file paths (placeholders) READ1_IN="input_R1.fastq.gz" READ2_IN="input_R2.fastq.gz" READ1_OUT="trimmed_R1.fastq.gz" READ2_OUT="trimmed_R2.fastq.gz" # Define the 3' adapter sequence for Read 2 (common eCLIP adapter, inferred from Yeo lab pipelines) ADAPTER_R2="AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC" # Define minimum length and quality trim threshold (inferred from Yeo lab pipelines) MIN_LEN=18 NEXTSEQ_TRIM_QUAL=20 # Run cutadapt to trim 3' adapters from Read 2 cutadapt \ -a "${ADAPTER_R2}" \ -o "${READ1_OUT}" \ -p "${READ2_OUT}" \ --minimum-length "${MIN_LEN}" \ --nextseq-trim "${NEXTSEQ_TRIM_QUAL}" \ "${READ1_IN}" \ "${READ2_IN}" -
6
Command: cutadapt -f fastq --match-read-wildcards --times 1 -e 0.1 -O 5 --quality-cutoff 6 -m 18 -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGTAGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.metrics
$ Bash example
# Install cutadapt (example using conda) # conda install -c bioconda cutadapt # Define input and output paths INPUT_R1="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz" INPUT_R2="/full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz" OUTPUT_R1="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz" OUTPUT_R2="/full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz" METRICS_FILE="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.metrics" # Execute cutadapt command cutadapt -f fastq \ --match-read-wildcards \ --times 1 \ -e 0.1 \ -O 5 \ --quality-cutoff 6 \ -m 18 \ -A AACTTGTAGATCGGA \ -A AGGACCAAGATCGGA \ -A ACTTGTAGATCGGAA \ -A GGACCAAGATCGGAA \ -A CTTGTAGATCGGAAG \ -A GACCAAGATCGGAAG \ -A TTGTAGATCGGAAGA \ -A ACCAAGATCGGAAGA \ -A TGTAGATCGGAAGAG \ -A CCAAGATCGGAAGAG \ -A GTAGATCGGAAGAGC \ -A CAAGATCGGAAGAGC \ -A TAGATCGGAAGAGCG \ -A AAGATCGGAAGAGCG \ -A AGATCGGAAGAGCGT \ -A GATCGGAAGAGCGTC \ -A ATCGGAAGAGCGTCG \ -A TCGGAAGAGCGTCGT \ -A CGGAAGAGCGTCGTG \ -A GGAAGAGCGTCGTGT \ -o "${OUTPUT_R1}" \ -p "${OUTPUT_R2}" \ "${INPUT_R1}" \ "${INPUT_R2}" > "${METRICS_FILE}" -
7
Takes output from cutadapt round 2.
$ Bash example
# Install cutadapt if not already available # conda install -c bioconda cutadapt # Define input and output files # INPUT_FASTQ is the output from the previous cutadapt round (round 2 in the description's context) INPUT_FASTQ="sample_R1_trimmed_3prime.fastq.gz" OUTPUT_FASTQ="sample_R1_trimmed_5prime.fastq.gz" REPORT_FILE="sample_R1_trimmed_5prime.cutadapt.log" # Define parameters for 5' adapter trimming and quality filtering # For eCLIP, the 5' adapter sequence can be a specific sequence or a generic N-adapter. # The Yeo lab eCLIP pipeline often uses a long string of Ns for 5' adapter trimming. # Example: NNNNNNNNNNNN (12 N's) or a specific 5' adapter sequence. # Using -g for 5' adapter trimming. FIVE_PRIME_ADAPTER="NNNNNNNNNNNN" ERROR_RATE=0.1 MIN_LENGTH=18 QUALITY_CUTOFF=20 NUM_CORES=8 # Adjust based on available resources cutadapt \ -g "${FIVE_PRIME_ADAPTER}" \ -o "${OUTPUT_FASTQ}" \ --error-rate "${ERROR_RATE}" \ --minimum-length "${MIN_LENGTH}" \ --quality-cutoff "${QUALITY_CUTOFF}" \ --cores "${NUM_CORES}" \ "${INPUT_FASTQ}" \ > "${REPORT_FILE}" 2>&1 -
8
Maps to human specific version of RepBase used to remove repetitive elements, helps control for spurious artifacts from rRNA (& other) repetitive reads.
bowtie2 (Inferred with models/gemini-2.5-flash) vlatest (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# Install bowtie2 if not available # conda install -c bioconda bowtie2 # --- Reference Data Setup --- # Download or prepare a FASTA file containing human repetitive elements (e.g., RepBase sequences, rRNA, mitochondrial DNA). # This file should represent a 'human specific version of RepBase' as described. # As a placeholder, you might use a general human blacklist FASTA, or construct one from RepBase data. # Example placeholder for a general human blacklist FASTA (hg38): # wget -O human_repetitive_elements.fasta "https://raw.githubusercontent.com/ENCODE-DCC/chip-seq-pipeline2/master/references/blacklist/hg38-blacklist.v2.fasta" # For a more comprehensive RepBase-derived reference, you would typically download RepBase data and extract human-specific elements. # Build the bowtie2 index for the repetitive elements # Replace 'human_repetitive_elements.fasta' with your actual reference file bowtie2-build human_repetitive_elements.fasta human_repetitive_elements_index # --- Filtering Reads --- # Input FASTQ file (gzipped or unzipped). Adjust for paired-end reads if necessary. INPUT_FASTQ="input.fastq.gz" # Output FASTQ file containing reads that *did not* map to repetitive elements (i.e., the filtered, clean reads) OUTPUT_FILTERED_FASTQ="filtered_reads.fastq.gz" # Output FASTQ file containing reads that *did* map to repetitive elements (i.e., the discarded repetitive reads) OUTPUT_REPETITIVE_FASTQ="repetitive_reads.fastq.gz" # Bowtie2 index prefix created above INDEX_PREFIX="human_repetitive_elements_index" # Align reads to the repetitive elements index and keep only the unmapped reads. # These unmapped reads are considered free of the specified repetitive elements. # --un-gz: output unmapped reads to a gzipped file # --al-gz: output mapped reads to a gzipped file # -p: number of threads (adjust as needed) # -q: input reads are FASTQ format # -x: index prefix # -U: single-end reads (use -1 and -2 for paired-end reads) # --very-fast-local: a preset for speed, adjust as needed for sensitivity vs. speed bowtie2 -p 8 -q -x "${INDEX_PREFIX}" -U "${INPUT_FASTQ}" \ --un-gz "${OUTPUT_FILTERED_FASTQ}" \ --al-gz "${OUTPUT_REPETITIVE_FASTQ}" \ --very-fast-local -
9
Command: STAR --runMode alignReads --runThreadN 16 --genomeDir /path/to/RepBase_human_database_file --genomeLoad LoadAndRemove --readFilesIn /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz --outSAMunmapped Within --outFilterMultimapNmax 30 --outFilterMultimapScoreRange 1 --outFileNamePrefix /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam --outSAMattributes All --readFilesCommand zcat --outStd BAM_Unsorted --outSAMtype BAM Unsorted --outFilterType BySJout --outReadsUnmapped Fastx --outFilterScoreMin 10 --outSAMattrRGline ID:foo --alignEndsType EndToEnd > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam
$ Bash example
STAR --runMode alignReads --runThreadN 16 --genomeDir /path/to/RepBase_human_database_file --genomeLoad LoadAndRemove --readFilesIn /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz --outSAMunmapped Within --outFilterMultimapNmax 30 --outFilterMultimapScoreRange 1 --outFileNamePrefix /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam --outSAMattributes All --readFilesCommand zcat --outStd BAM_Unsorted --outSAMtype BAM Unsorted --outFilterType BySJout --outReadsUnmapped Fastx --outFilterScoreMin 10 --outSAMattrRGline ID:foo --alignEndsType EndToEnd > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam
-
10
Takes output from STAR rmRep.
$ Bash example
# Install samtools (if not already installed) # conda install -c bioconda samtools=1.19 # Placeholder for reference genome and annotation (if needed for previous steps) # GENOME_DIR="/path/to/STAR_genome_index/GRCh38" # GTF_FILE="/path/to/gencode.vXX.annotation.gtf" # --- Hypothetical previous step: STAR alignment and deduplication (STAR rmRep) --- # This section illustrates how 'aligned_deduplicated.bam' (output from STAR rmRep) might be generated. # The 'rmRep' likely refers to removing PCR duplicates after STAR alignment. # Read1="input_R1.fastq.gz" # Read2="input_R2.fastq.gz" # OutputPrefix="aligned" # STAR --genomeDir ${GENOME_DIR} \ # --readFilesIn ${Read1} ${Read2} \ # --runThreadN 8 \ # --outFileNamePrefix ${OutputPrefix} \ # --outSAMtype BAM SortedByCoordinate \ # --outSAMunmapped None \ # --outSAMattributes Standard # InputBAM="${OutputPrefix}Aligned.sortedByCoord.out.bam" # DeduplicatedBAM="aligned_deduplicated.bam" # samtools fixmate -m ${InputBAM} ${InputBAM}.fixmate.bam # samtools sort -o ${InputBAM}.fixmate.sorted.bam ${InputBAM}.fixmate.bam # samtools markdup -r -s ${InputBAM}.fixmate.sorted.bam ${DeduplicatedBAM} # --- Current step: Takes output from STAR rmRep --- # The description only specifies the input for this step: a deduplicated BAM file. # As no specific action is described for this step, a common and necessary subsequent action # for a deduplicated BAM file is to index it, making it ready for downstream analysis and visualization. INPUT_DEDUP_BAM="aligned_deduplicated.bam" # This file is the output from the 'STAR rmRep' step samtools index "${INPUT_DEDUP_BAM}" -
11
Maps unique reads to the human genome.
$ Bash example
# Install STAR (e.g., using conda) # conda install -c bioconda star # Create a placeholder for the human genome STAR index. # Replace '/path/to/STAR_genome_index/human_GRCh38' with the actual path to your pre-built STAR index. # The index can be built from a FASTA file (e.g., GRCh38 primary assembly from GENCODE or UCSC) and GTF annotation file. # Example command to build index (run once): # STAR --runThreadN 8 --runMode genomeGenerate --genomeDir /path/to/STAR_genome_index/human_GRCh38 --genomeFastaFiles /path/to/human_GRCh38.fa --sjdbGTFfile /path/to/human_GRCh38.gtf --sjdbOverhang 100 # Map unique reads to the human genome # Replace 'reads_R1.fastq.gz' and 'reads_R2.fastq.gz' with your actual input FASTQ files. # Adjust '--runThreadN' based on available CPU cores. # The '--outFilterMultimapNmax 1' parameter ensures only uniquely mapping reads are reported. STAR --genomeDir /path/to/STAR_genome_index/human_GRCh38 \ --readFilesIn reads_R1.fastq.gz reads_R2.fastq.gz \ --runThreadN 8 \ --outFileNamePrefix aligned_reads_ \ --outSAMtype BAM SortedByCoordinate \ --readFilesCommand zcat \ --outFilterMultimapNmax 1 \ --outFilterMismatchNmax 10 \ --outFilterScoreMinOverLread 0.66 \ --outFilterMatchNminOverLread 0.66 -
12
Command: STAR --runMode alignReads --runThreadN 16 --genomeDir /path/to/STAR_database_file --genomeLoad LoadAndRemove --readFilesIn /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate1 /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate2 --outSAMunmapped Within --outFilterMultimapNmax 1 --outFilterMultimapScoreRange 1 --outFileNamePrefix /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam --outSAMattributes All --outStd BAM_Unsorted --outSAMtype BAM Unsorted --outFilterType BySJout --outReadsUnmapped Fastx --outFilterScoreMin 10 --outSAMattrRGline ID:foo --alignEndsType EndToEnd > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam
$ Bash example
# Install STAR (example using Conda) # conda install -c bioconda star # Define variables STAR_GENOME_DIR="/path/to/your/STAR_index/GRCh38" # Example: GRCh38 human genome index READ_FILE_1="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate1" # Input mate 1 FASTQ file READ_FILE_2="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate2" # Input mate 2 FASTQ file OUTPUT_PREFIX="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam" # Output file prefix (including .bam for the main output) OUTPUT_BAM="${OUTPUT_PREFIX}" # The final BAM file will be redirected to this path # Execute STAR alignment STAR \ --runMode alignReads \ --runThreadN 16 \ --genomeDir "${STAR_GENOME_DIR}" \ --genomeLoad LoadAndRemove \ --readFilesIn "${READ_FILE_1}" "${READ_FILE_2}" \ --outSAMunmapped Within \ --outFilterMultimapNmax 1 \ --outFilterMultimapScoreRange 1 \ --outFileNamePrefix "${OUTPUT_PREFIX}" \ --outSAMattributes All \ --outStd BAM_Unsorted \ --outSAMtype BAM Unsorted \ --outFilterType BySJout \ --outReadsUnmapped Fastx \ --outFilterScoreMin 10 \ --outSAMattrRGline ID:foo \ --alignEndsType EndToEnd > "${OUTPUT_BAM}" -
13
takes output from STAR genome mapping.
$ Bash example
# Install STAR if not already installed # conda install -c bioconda star # Define variables # Replace with actual paths and filenames GENOME_DIR="/path/to/STAR_index/GRCh38" # Placeholder for STAR genome index directory (e.g., GRCh38) READ1_FASTQ="input_R1.fastq.gz" # Placeholder for input Read 1 FASTQ file READ2_FASTQ="input_R2.fastq.gz" # Placeholder for input Read 2 FASTQ file (remove if single-end) OUTPUT_PREFIX="sample_name" # Prefix for output files THREADS=8 # Number of threads to use # Run STAR genome mapping # This command aligns RNA-seq reads (e.g., from eCLIP) to a reference genome. # It outputs a sorted BAM file, filters for uniquely mapping reads, and allows for a few mismatches. STAR --genomeDir "${GENOME_DIR}" \ --readFilesIn "${READ1_FASTQ}" "${READ2_FASTQ}" \ --readFilesCommand zcat \ --outFileNamePrefix "${OUTPUT_PREFIX}_" \ --outSAMtype BAM SortedByCoordinate \ --outFilterMultimapNmax 1 \ --outFilterMismatchNmax 3 \ --outFilterScoreMinOverLread 0.66 \ --outFilterMatchNminOverLread 0.66 \ --runThreadN "${THREADS}" -
14
Custom random-mer-aware script for PCR duplicate removal.
dedup_umi.py from yeolab/eclip workflow (Inferred with models/gemini-2.5-flash) vPython script within yeolab/eclip workflow$ Bash example
# Clone the eCLIP workflow repository to get the script # git clone https://github.com/yeolab/eclip.git # cd eclip/tools # Install dependencies (e.g., pysam) # pip install pysam # Define input and output file paths INPUT_BAM="aligned_reads_with_umis.bam" # Placeholder for your aligned BAM file with UMIs in read names OUTPUT_DEDUP_BAM="deduplicated_reads.bam" # Execute the custom random-mer-aware script for PCR duplicate removal # This script expects UMIs to be in the read names (e.g., @read_id:UMI) python dedup_umi.py -i "${INPUT_BAM}" -o "${OUTPUT_DEDUP_BAM}" -
15
Command: barcode_collapse_pe.py --bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam --out_file /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam --metrics_file /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.metrics
$ Bash example
# Install Miniconda or Anaconda if not already installed # wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh # bash miniconda.sh -b -p $HOME/miniconda # rm miniconda.sh # export PATH="$HOME/miniconda/bin:$PATH" # Create a conda environment for the eCLIP pipeline dependencies # conda create -n eclip_env python=3.8 pysam numpy pandas -y # conda activate eclip_env # Clone the eCLIP pipeline repository to get the script # git clone https://github.com/yeolab/eclip.git # SCRIPT_PATH="eclip/scripts/barcode_collapse_pe.py" # Define input and output file paths INPUT_BAM="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam" OUTPUT_BAM="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam" METRICS_FILE="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.metrics" # Execute the command (assuming SCRIPT_PATH is defined after cloning the repo) python "${SCRIPT_PATH}" \ --bam "${INPUT_BAM}" \ --out_file "${OUTPUT_BAM}" \ --metrics_file "${METRICS_FILE}" -
16
Takes output from barcode collapse PE.
$ Bash example
# Install cutadapt (example using conda) # conda install -c bioconda cutadapt=4.0 # Define input and output files (assuming paired-end reads from barcode collapse) INPUT_R1="collapsed_reads_R1.fastq.gz" INPUT_R2="collapsed_reads_R2.fastq.gz" OUTPUT_R1="trimmed_reads_R1.fastq.gz" OUTPUT_R2="trimmed_reads_R2.fastq.gz" REPORT="cutadapt_report.txt" # Define common eCLIP adapter sequences (from yeolab/skipper workflow) # -a: 3' adapter for R1 reads # -A: 3' adapter for R2 reads ADAPTER_3_PRIME_R1="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" ADAPTER_3_PRIME_R2="AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT" # Execute cutadapt for paired-end adapter trimming and quality filtering # --minimum-length: Discard reads shorter than 18 bp after trimming # --quality-cutoff: Trim low-quality bases from the 3' end using a quality score cutoff of 20 cutadapt \ -a "${ADAPTER_3_PRIME_R1}" \ -A "${ADAPTER_3_PRIME_R2}" \ --minimum-length 18 \ --quality-cutoff 20 \ -o "${OUTPUT_R1}" \ -p "${OUTPUT_R2}" \ "${INPUT_R1}" \ "${INPUT_R2}" \ > "${REPORT}" 2>&1 -
17
Sorts resulting bam file for use downstream.
$ Bash example
# Install samtools if not already installed # conda install -c bioconda samtools # Sort the BAM file. Replace 'input.bam' with your actual input file and 'output.bam' with your desired output file name. samtools sort -o output.bam input.bam # Index the sorted BAM file for downstream use (e.g., visualization, variant calling) samtools index output.bam
-
18
Command: java -Xmx2048m -XX:+UseParallelOldGC -XX:ParallelGCThreads=4 -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Djava.io.tmpdir=/full/path/to/files/.queue/tmp -cp /path/to/gatk/dist/Queue.jar net.sf.picard.sam.SortSam INPUT=/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam TMP_DIR=/full/path/to/files/.queue/tmp OUTPUT=/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam VALIDATION_STRINGENCY=SILENT SO=coordinate CREATE_INDEX=true
$ Bash example
# Install Picard (often bundled with GATK or available standalone) # conda create -n picard_env picard -c bioconda -c conda-forge # conda activate picard_env # Define variables # The command uses Queue.jar from GATK's distribution. This JAR is typically part of GATK 3.x # and contains the necessary Picard classes or acts as an entry point for them. PICARD_JAR="/path/to/gatk/dist/Queue.jar" # Adjust this path as needed DATA_DIR="/full/path/to/files" # Adjust this path as needed INPUT_BAM="${DATA_DIR}/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam" OUTPUT_BAM="${DATA_DIR}/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam" TMP_DIR="${DATA_DIR}/.queue/tmp" # Create temporary directory if it doesn't exist mkdir -p "${TMP_DIR}" # Execute Picard SortSam java -Xmx2048m \ -XX:+UseParallelOldGC \ -XX:ParallelGCThreads=4 \ -XX:GCTimeLimit=50 \ -XX:GCHeapFreeLimit=10 \ -Djava.io.tmpdir="${TMP_DIR}" \ -cp "${PICARD_JAR}" \ net.sf.picard.sam.SortSam \ INPUT="${INPUT_BAM}" \ TMP_DIR="${TMP_DIR}" \ OUTPUT="${OUTPUT_BAM}" \ VALIDATION_STRINGENCY=SILENT \ SO=coordinate \ CREATE_INDEX=true -
19
Takes output from sortSam, makes bam index for use downstream.
samtools (Inferred with models/gemini-2.5-flash) v1.19 (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# Install samtools if not already installed # conda install -c bioconda samtools=1.19 # Assume the sorted BAM file from sortSam is named 'input_sorted.bam' # This command creates an index file (e.g., 'input_sorted.bam.bai') in the same directory. samtools index input_sorted.bam
-
20
Command: samtools index /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam.bai
$ Bash example
# Install samtools (if not already installed) # conda install -c bioconda samtools samtools index /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam.bai
-
21
Takes inputs from multiple final bam files.
$ Bash example
# Install samtools if not already available # conda install -c bioconda samtools # Example: Merge multiple BAM files for a single sample (e.g., technical replicates or lanes) # This command takes multiple input BAM files and merges them into a single output BAM file. # Replace input_file_1.bam, input_file_2.bam, etc., with your actual BAM file paths. # Replace combined_output.bam with your desired output file name. samtools merge -o combined_output.bam input_file_1.bam input_file_2.bam input_file_3.bam
-
22
Merges the two technical replicates for further downstream analysis.
$ Bash example
# Install samtools if not already installed # conda install -c bioconda samtools # Define input and output file names (example placeholders) INPUT_REPLICATE_1="sample_replicate1.bam" INPUT_REPLICATE_2="sample_replicate2.bam" OUTPUT_MERGED_BAM="sample_merged_replicates.bam" # Merge the two technical replicates BAM files samtools merge "${OUTPUT_MERGED_BAM}" "${INPUT_REPLICATE_1}" "${INPUT_REPLICATE_2}" -
23
Command: samtools merge /full/path/to/files/CombinedID.merged.bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam /full/path/to/files/file_R1.D08.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam
$ Bash example
# Install samtools if not already installed # conda install -c bioconda samtools # Define input and output files OUTPUT_BAM="/full/path/to/files/CombinedID.merged.bam" INPUT_BAM_1="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam" INPUT_BAM_2="/full/path/to/files/file_R1.D08.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam" # Merge multiple sorted BAM files into a single sorted BAM file samtools merge "${OUTPUT_BAM}" "${INPUT_BAM_1}" "${INPUT_BAM_2}" -
24
Takes output from sortSam, makes bam index for use downstream.
samtools index (Inferred with models/gemini-2.5-flash) v1.19 (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# Install samtools if not already installed # conda install -c bioconda samtools=1.19 # Assume the output from sortSam is a sorted BAM file named 'input.sorted.bam' # This command creates an index file 'input.sorted.bam.bai' in the same directory. samtools index input.sorted.bam
-
25
Command: samtools index /full/path/to/files/CombinedID.merged.bam /full/path/to/files/CombinedID.merged.bam.bai
$ Bash example
# Install samtools (if not already installed) # conda install -c bioconda samtools samtools index /full/path/to/files/CombinedID.merged.bam /full/path/to/files/CombinedID.merged.bam.bai
-
26
Takes output from sortSam.
samtools (Inferred with models/gemini-2.5-flash) v1.10 (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# Install samtools if not already installed # conda install -c bioconda samtools # This command takes a sorted BAM file (output from sortSam) and creates an index file (.bai). # The index file is essential for quickly accessing regions of the BAM file, # which is required by many downstream tools (e.g., genome browsers, variant callers). # Example: Assuming 'sorted.bam' is the output from sortSam samtools index sorted.bam
-
27
Only outputs the second read in each pair for use with single stranded peak caller.
$ Bash example
# Install samtools (example using conda) # conda install -c bioconda samtools=1.19 # Extract second reads in pair from an aligned BAM file and convert to FASTQ # The 0x80 flag in samtools view selects reads that are the 'second in pair'. # Replace input.bam with your aligned BAM file. # Replace output_R2.fastq with your desired output FASTQ file name. samtools view -f 0x80 -b input.bam | samtools fastq - > output_R2.fastq
-
28
This is the final bam file to perform analysis on.
$ Bash example
# Install samtools if not already available # conda install -c bioconda samtools # Sort the BAM file by coordinate. This is a common step to prepare a "final" BAM for analysis. # Replace 'input.bam' with the actual aligned BAM file name. samtools sort -o final.bam input.bam # Index the sorted BAM file. An index (.bai) file is crucial for many downstream tools and visualization. samtools index final.bam
-
29
Command: samtools view -hb -f 128 /full/path/to/files/CombinedID.merged.bam > /full/path/to/files/CombinedID.merged.r2.bam
$ Bash example
# Install samtools (e.g., using conda) # conda install -c bioconda samtools=1.9 # Extract reads that are the second in a pair (R2) from a merged BAM file # -h: Include header in the output # -b: Output in BAM format # -f 128: Select reads where the FLAG has the 0x80 bit set (read is the second in a pair) samtools view -hb -f 128 /full/path/to/files/CombinedID.merged.bam > /full/path/to/files/CombinedID.merged.r2.bam
-
30
Takes results from samtools view.
$ Bash example
# Install samtools (e.g., via conda) # conda install -c bioconda samtools=1.10 # Example: Convert a sorted BAM file to SAM format # This command takes an input BAM file and outputs its content in SAM format to standard output. # The -h flag includes the header. # Replace 'input.bam' with your actual input file. # The output can be redirected to a file (e.g., > output.sam) or piped to another command. samtools view -h input.bam > output.sam
-
31
Calls peaks on those files.
$ Bash example
bash # Install Miniconda if not already installed # wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh # bash Miniconda3-latest-Linux-x86_64.sh -b -p $HOME/miniconda # export PATH="$HOME/miniconda/bin:$PATH" # conda init bash # source ~/.bashrc # Create a conda environment for clipper and its dependencies # conda create -n clipper_env python=3.8 numpy scipy pysam -y # conda activate clipper_env # Clone the clipper repository # git clone https://github.com/yeolab/clipper.git # cd clipper # Define input files and parameters (placeholders - replace with actual paths/values) IP_BAM="path/to/your/ip_sample.bam" CONTROL_BAM="path/to/your/control_sample.bam" # e.g., SMInput or IgG GENOME_SIZE="hg38" # Placeholder: use 'hg38' for human, 'mm10' for mouse, or a numerical value (e.g., 3.1e9 for human) OUTPUT_PREFIX="eclip_peaks" SPECIES="human" # Placeholder: 'human', 'mouse', etc. FDR_THRESHOLD=0.05 LOGFC_THRESHOLD=1.0 # Execute clipper for differential peak calling (typical for eCLIP) python clipper.py \ -b "${IP_BAM}" \ -c "${CONTROL_BAM}" \ -s "${GENOME_SIZE}" \ -o "${OUTPUT_PREFIX}" \ --species "${SPECIES}" \ --threshold-fdr "${FDR_THRESHOLD}" \ --threshold-logfc "${LOGFC_THRESHOLD}" \ --verbose # Output files will be generated in the current directory, e.g., eclip_peaks.bed, eclip_peaks.narrowPeak -
32
Command: clipper -b /full/path/to/files/CombinedID.merged.r2.bam -s hg19 -o /full/path/to/files/CombinedID.merged.r2.peaks.bed --bonferroni --superlocal --threshold-method binomial --save-pickle
$ Bash example
# Install CLIPper (example using conda) # conda create -n clipper_env python=3.8 # conda activate clipper_env # pip install git+https://github.com/yeolab/clipper.git # Execute CLIPper command clipper -b /full/path/to/files/CombinedID.merged.r2.bam -s hg19 -o /full/path/to/files/CombinedID.merged.r2.peaks.bed --bonferroni --superlocal --threshold-method binomial --save-pickle
Tools Used
Raw Source Text
Takes output from raw files. Run to trim off both 5â and 3â adapters on both reads. Command: quality-cutoff 6 -m 18 -a NNNNNAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC -g CTTCCGATCTACAAGTT -g CTTCCGATCTTGGTCCT -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGT AGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz /full/path/to/files/file_R1.C01.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.metrics Takes output from cutadapt round 1. Run to trim off the 3â adapters on read 2, to control for double ligation events. Command: cutadapt -f fastq --match-read-wildcards --times 1 -e 0.1 -O 5 --quality-cutoff 6 -m 18 -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGTAGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.metrics Takes output from cutadapt round 2. Maps to human specific version of RepBase used to remove repetitive elements, helps control for spurious artifacts from rRNA (& other) repetitive reads. Command: STAR --runMode alignReads --runThreadN 16 --genomeDir /path/to/RepBase_human_database_file --genomeLoad LoadAndRemove --readFilesIn /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz --outSAMunmapped Within --outFilterMultimapNmax 30 --outFilterMultimapScoreRange 1 --outFileNamePrefix /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam --outSAMattributes All --readFilesCommand zcat --outStd BAM_Unsorted --outSAMtype BAM Unsorted --outFilterType BySJout --outReadsUnmapped Fastx --outFilterScoreMin 10 --outSAMattrRGline ID:foo --alignEndsType EndToEnd > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam Takes output from STAR rmRep. Maps unique reads to the human genome. Command: STAR --runMode alignReads --runThreadN 16 --genomeDir /path/to/STAR_database_file --genomeLoad LoadAndRemove --readFilesIn /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate1 /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate2 --outSAMunmapped Within --outFilterMultimapNmax 1 --outFilterMultimapScoreRange 1 --outFileNamePrefix /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam --outSAMattributes All --outStd BAM_Unsorted --outSAMtype BAM Unsorted --outFilterType BySJout --outReadsUnmapped Fastx --outFilterScoreMin 10 --outSAMattrRGline ID:foo --alignEndsType EndToEnd > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam takes output from STAR genome mapping. Custom random-mer-aware script for PCR duplicate removal. Command: barcode_collapse_pe.py --bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam --out_file /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam --metrics_file /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.metrics Takes output from barcode collapse PE. Sorts resulting bam file for use downstream. Command: java -Xmx2048m -XX:+UseParallelOldGC -XX:ParallelGCThreads=4 -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Djava.io.tmpdir=/full/path/to/files/.queue/tmp -cp /path/to/gatk/dist/Queue.jar net.sf.picard.sam.SortSam INPUT=/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam TMP_DIR=/full/path/to/files/.queue/tmp OUTPUT=/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam VALIDATION_STRINGENCY=SILENT SO=coordinate CREATE_INDEX=true Takes output from sortSam, makes bam index for use downstream. Command: samtools index /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam.bai Takes inputs from multiple final bam files. Merges the two technical replicates for further downstream analysis. Command: samtools merge /full/path/to/files/CombinedID.merged.bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam /full/path/to/files/file_R1.D08.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam Takes output from sortSam, makes bam index for use downstream. Command: samtools index /full/path/to/files/CombinedID.merged.bam /full/path/to/files/CombinedID.merged.bam.bai Takes output from sortSam. Only outputs the second read in each pair for use with single stranded peak caller. This is the final bam file to perform analysis on. Command: samtools view -hb -f 128 /full/path/to/files/CombinedID.merged.bam > /full/path/to/files/CombinedID.merged.r2.bam Takes results from samtools view. Calls peaks on those files. Command: clipper -b /full/path/to/files/CombinedID.merged.r2.bam -s hg19 -o /full/path/to/files/CombinedID.merged.r2.peaks.bed --bonferroni --superlocal --threshold-method binomial --save-pickle Genome_build: hg19 Supplementary_files_format_and_content: bigWig, bigBed, bed (col1: chrom, col2: chromStart, col3: chromEnd, col4: -log10 pvalue, col5: log2 fold enrichment above input, col6: strand) format, contains clusters of predicted RBP binding