GSE86035 Processing Pipeline
Publication
SONAR Discovers RNA-Binding Proteins from Analysis of Large-Scale Protein-Protein Interactomes.Molecular cell (2016) — PMID 27720645
Dataset
GSE86035SONAR discovers RNA binding proteins from analysis of large-scale protein-protein interactomes.
Processing Steps
Generate Jupyter Notebook-
1
Takes output from raw files.
(Inferred with models/gemini-2.5-flash) vN/A$ Bash example
# The description "Takes output from raw files." is too generic to infer a specific bioinformatics command, tool, or assay type. # Please provide more context about the type of raw files (e.g., FASTQ, BAM) and the intended processing step (e.g., quality control, alignment, quantification, peak calling) to generate a relevant bash command.
-
2
Run to trim off both 5â and 3â adapters on both reads.
$ Bash example
# Install cutadapt (e.g., using conda) # conda install -c bioconda cutadapt=2.10 # Define input and output files (placeholders) INPUT_R1="input_read1.fastq.gz" INPUT_R2="input_read2.fastq.gz" OUTPUT_R1="trimmed_read1.fastq.gz" OUTPUT_R2="trimmed_read2.fastq.gz" # Define adapter sequences from Yeo lab eCLIP workflow (https://github.com/yeolab/eclip/blob/master/eclip.cwl) ADAPTER_3PRIME="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" ADAPTER_5PRIME="AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT" # Define trimming parameters (from Yeo lab eCLIP workflow defaults) MIN_LENGTH=18 # Minimum length of reads to keep after trimming QUALITY_CUTOFF=20 # Quality cutoff for base trimming # Run cutadapt to trim 5' and 3' adapters from both reads cutadapt \ -a "${ADAPTER_3PRIME}" \ -A "${ADAPTER_3PRIME}" \ -g "${ADAPTER_5PRIME}" \ -G "${ADAPTER_5PRIME}" \ --minimum-length "${MIN_LENGTH}" \ --quality-cutoff "${QUALITY_CUTOFF}" \ -o "${OUTPUT_R1}" \ -p "${OUTPUT_R2}" \ "${INPUT_R1}" \ "${INPUT_R2}" -
3
Command: quality-cutoff 6 -m 18 -a NNNNNAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC -g CTTCCGATCTACAAGTT -g CTTCCGATCTTGGTCCT -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGT AGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz /full/path/to/files/file_R1.C01.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.metrics
quality-cutoff.py (from Yeo Lab eCLIP pipeline) (Inferred with models/gemini-2.5-flash) veCLIP pipeline (pre-2021) (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# The 'quality-cutoff' command is a Python script (quality-cutoff.py) # found within the Yeo Lab eCLIP pipeline repository. # To make it directly executable as 'quality-cutoff', you might need to: # 1. Clone the repository: # git clone https://github.com/yeolab/eclip.git # 2. Navigate to the src directory: # cd eclip/src # 3. Ensure Python dependencies (e.g., numpy, pysam) are installed: # conda create -n eclip_env python=3.8 # conda activate eclip_env # pip install numpy pysam # 4. Make the script executable and add it to your PATH, or create a symlink/alias. # For example, you could run it as 'python /path/to/eclip/src/quality-cutoff.py ...' # or if it's in your PATH and executable, directly as shown below. quality-cutoff 6 \ -m 18 \ -a NNNNNAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC \ -g CTTCCGATCTACAAGTT \ -g CTTCCGATCTTGGTCCT \ -A AACTTGTAGATCGGA \ -A AGGACCAAGATCGGA \ -A ACTTGTAGATCGGAA \ -A AGGACCAAGATCGGAA \ -A CTTGT AGATCGGAAG \ -A GACCAAGATCGGAAG \ -A TTGTAGATCGGAAGA \ -A ACCAAGATCGGAAGA \ -A TGTAGATCGGAAGAG \ -A CCAAGATCGGAAGAG \ -A GTAGATCGGAAGAGC \ -A CAAGATCGGAAGAGC \ -A TAGATCGGAAGAGCG \ -A AAGATCGGAAGAGCG \ -A AGATCGGAAGAGCGT \ -A GATCGGAAGAGCGTC \ -A ATCGGAAGAGCGTCG \ -A TCGGAAGAGCGTCGT \ -A CGGAAGAGCGTCGTG \ -A GGAAGAGCGTCGTGT \ -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz \ -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz \ /full/path/to/files/file_R1.C01.fastq.gz \ /full/path/to/files/file_R2.C01.fastq.gz \ > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.metrics -
4
Takes output from cutadapt round 1.
$ Bash example
# Install cutadapt (example using conda) # conda install -c bioconda cutadapt=2.10 # Define input and output files # INPUT_FASTQ is the output from cutadapt round 1 (e.g., adapter-trimmed reads) INPUT_FASTQ="round1_trimmed.fastq.gz" OUTPUT_FASTQ="round2_trimmed.fastq.gz" # Define parameters for poly-A trimming (a common second round in eCLIP workflows) # This command is inferred from the yeolab/eclip CWL workflow's poly-A trimming step. POLY_A_ADAPTER="A{100}" # Trims up to 100 'A's from the 3' end MIN_LENGTH=18 # Minimum read length after trimming, common for eCLIP # Execute cutadapt for poly-A trimming cutadapt -a "${POLY_A_ADAPTER}" \ -o "${OUTPUT_FASTQ}" \ --minimum-length "${MIN_LENGTH}" \ "${INPUT_FASTQ}" -
5
Run to trim off the 3â adapters on read 2, to control for double ligation events.
cutadapt (Inferred with models/gemini-2.5-flash) v4.0 (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# Install cutadapt if not already installed # conda install -c bioconda cutadapt=4.0 # Define input and output files INPUT_READ2="read2.fastq.gz" OUTPUT_TRIMMED_READ2="read2_trimmed.fastq.gz" # Define the 3' adapter sequence for double ligation events (eCLIP specific) # This adapter sequence is derived from the yeolab/skipper workflow config.yaml # (AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT) ADAPTER_R2_DOUBLE_LIGATION="AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT" # Define minimum length for reads after trimming (from yeolab/skipper config) MIN_LENGTH=18 # Define number of threads THREADS=4 # Run cutadapt to trim the 3' adapter from read 2 # This step controls for double ligation events by removing the 3' adapter sequence # that might have ligated to itself or another adapter on read 2. cutadapt \ -a "${ADAPTER_R2_DOUBLE_LIGATION}" \ -o "${OUTPUT_TRIMMED_READ2}" \ --minimum-length "${MIN_LENGTH}" \ --cores "${THREADS}" \ "${INPUT_READ2}" -
6
Command: cutadapt -f fastq --match-read-wildcards --times 1 -e 0.1 -O 5 --quality-cutoff 6 -m 18 -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGTAGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.metrics
$ Bash example
cutadapt -f fastq --match-read-wildcards --times 1 -e 0.1 -O 5 --quality-cutoff 6 -m 18 -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGTAGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.metrics
-
7
Takes output from cutadapt round 2.
$ Bash example
# Installation (example using conda): # conda install -c bioconda cutadapt=3.4 # Define input and output files # INPUT_FASTQ is the output from the first round of cutadapt trimming. INPUT_FASTQ="input_from_cutadapt_round1.fastq.gz" OUTPUT_FASTQ="output_cutadapt_round2_trimmed.fastq.gz" # Define parameters for cutadapt round 2 (e.g., poly-A trimming and quality filtering) # ADAPTER_ROUND2: Common for poly-A trimming (e.g., A{100} for 100 A's) ADAPTER_ROUND2="A{100}" MIN_LENGTH=18 # Minimum read length after trimming QUALITY_CUTOFF=6 # Quality cutoff for 3' end trimming (Phred score) NEXTSEQ_TRIM=20 # Trim low-quality bases from 3' end of NextSeq reads (Phred score) NUM_CORES=8 # Number of CPU cores to use cutadapt \ -a "${ADAPTER_ROUND2}" \ --minimum-length "${MIN_LENGTH}" \ --quality-cutoff "${QUALITY_CUTOFF}" \ --nextseq-trim "${NEXTSEQ_TRIM}" \ --cores "${NUM_CORES}" \ -o "${OUTPUT_FASTQ}" \ "${INPUT_FASTQ}" -
8
Maps to human specific version of RepBase used to remove repetitive elements, helps control for spurious artifacts from rRNA (& other) repetitive reads.
$ Bash example
# Define variables # Path to the Bowtie index for human rRNA and repetitive elements (e.g., built from hg38 rRNA and RepBase sequences). # This index should be pre-built using `bowtie-build` from a FASTA file containing all relevant repetitive sequences. # Example FASTA sources: RepBase (https://www.girinst.org/repbase/) and NCBI RefSeq for rRNA. BOWTIE_INDEX_PREFIX="/path/to/human_rRNA_RepBase_hg38_index" # Input FASTQ file (gzipped) INPUT_FASTQ_GZ="input_reads.fastq.gz" # Output FASTQ file for reads that did NOT align to repetitive elements (non-repetitive reads), initially uncompressed OUTPUT_NON_REPETITIVE_FASTQ="output_non_repetitive_reads.fastq" # Output SAM file containing reads that DID align to repetitive elements (can be discarded or used for QC) OUTPUT_REPETITIVE_SAM="output_repetitive_reads.sam" # Number of threads to use THREADS=8 # Filter repetitive reads using Bowtie # -q: input reads are FASTQ format # -S: output alignments in SAM format # --un: write reads that do not align to the specified file # -p: number of threads # `zcat` is used to decompress the gzipped input FASTQ and pipe it to bowtie. # The `-` after `BOWTIE_INDEX_PREFIX` tells bowtie to read input from stdin. zcat "${INPUT_FASTQ_GZ}" | bowtie -q -S --un "${OUTPUT_NON_REPETITIVE_FASTQ}" -p "${THREADS}" "${BOWTIE_INDEX_PREFIX}" - > "${OUTPUT_REPETITIVE_SAM}" # Gzip the non-repetitive reads file for storage gzip "${OUTPUT_NON_REPETITIVE_FASTQ}" -
9
Command: STAR --runMode alignReads --runThreadN 16 --genomeDir /path/to/RepBase_human_database_file --genomeLoad LoadAndRemove --readFilesIn /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz --outSAMunmapped Within --outFilterMultimapNmax 30 --outFilterMultimapScoreRange 1 --outFileNamePrefix /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam --outSAMattributes All --readFilesCommand zcat --outStd BAM_Unsorted --outSAMtype BAM Unsorted --outFilterType BySJout --outReadsUnmapped Fastx --outFilterScoreMin 10 --outSAMattrRGline ID:foo --alignEndsType EndToEnd > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam
$ Bash example
STAR --runMode alignReads --runThreadN 16 --genomeDir /path/to/RepBase_human_database_file --genomeLoad LoadAndRemove --readFilesIn /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz --outSAMunmapped Within --outFilterMultimapNmax 30 --outFilterMultimapScoreRange 1 --outFileNamePrefix /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam --outSAMattributes All --readFilesCommand zcat --outStd BAM_Unsorted --outSAMtype BAM Unsorted --outFilterType BySJout --outReadsUnmapped Fastx --outFilterScoreMin 10 --outSAMattrRGline ID:foo --alignEndsType EndToEnd > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam
-
10
Takes output from STAR rmRep.
$ Bash example
# Install CLIPper (if not already installed) # git clone https://github.com/yeolab/clipper.git # cd clipper # python setup.py install # or ensure clipper.py is in PATH or accessible # Placeholder for input BAM file (output from STAR alignment and deduplication, e.g., from a 'STAR rmRep' stage) INPUT_BAM="input.dedup.bam" # Placeholder for genome size file (e.g., for human hg38) # This file should contain chromosome names and their sizes, one per line. # Example: chr1\t248956422 # You can generate this from a reference genome fasta index: # samtools faidx hg38.fa # cut -f1,2 hg38.fa.fai > hg38.chrom.sizes GENOME_SIZE_FILE="hg38.chrom.sizes" # Output file for CLIPper peaks OUTPUT_PEAKS="output_peaks.bed" # Run CLIPper # Assuming clipper.py is in the PATH or current directory (e.g., if installed via setup.py) python clipper.py -b "${INPUT_BAM}" -s "${GENOME_SIZE_FILE}" -o "${OUTPUT_PEAKS}" -
11
Maps unique reads to the human genome.
$ Bash example
# Install BWA (if not already installed) # conda install -c bioconda bwa samtools # Define variables REF_GENOME="/path/to/human_genome_hg38.fa" # Placeholder for human genome GRCh38 (e.g., from GATK resource bundle or UCSC) READ1="/path/to/sample_R1.fastq.gz" READ2="/path/to/sample_R2.fastq.gz" OUTPUT_BAM="aligned_reads.bam" THREADS=8 # Number of CPU threads to use # Index the reference genome (run once per reference) # bwa index "${REF_GENOME}" # Align paired-end reads to the human genome using BWA-MEM # -M: Mark shorter split hits as secondary (for Picard compatibility) # -t: Number of threads # -R: Read group header. Replace with actual sample ID, library, platform, etc. bwa mem -M -t "${THREADS}" \ -R "@RG\tID:sample_id\tSM:sample_name\tPL:ILLUMINA\tLB:library_name" \ "${REF_GENOME}" "${READ1}" "${READ2}" | \ samtools view -bS - | \ samtools sort -o "${OUTPUT_BAM}" - # Index the sorted BAM file samtools index "${OUTPUT_BAM}" -
12
Command: STAR --runMode alignReads --runThreadN 16 --genomeDir /path/to/STAR_database_file --genomeLoad LoadAndRemove --readFilesIn /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate1 /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate2 --outSAMunmapped Within --outFilterMultimapNmax 1 --outFilterMultimapScoreRange 1 --outFileNamePrefix /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam --outSAMattributes All --outStd BAM_Unsorted --outSAMtype BAM Unsorted --outFilterType BySJout --outReadsUnmapped Fastx --outFilterScoreMin 10 --outSAMattrRGline ID:foo --alignEndsType EndToEnd > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam
$ Bash example
# STAR is a splice-aware aligner for RNA-seq reads. # It requires a pre-built genome index. Replace `/path/to/STAR_database_file` with the actual path to your STAR genome index. # Replace `/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate1` and `...mate2` with your actual input FASTQ files. # Replace `/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam` with your desired output BAM file path. # Example installation using Conda (uncomment to use): # conda create -n star_env star -y # conda activate star_env STAR --runMode alignReads \ --runThreadN 16 \ --genomeDir /path/to/STAR_database_file \ --genomeLoad LoadAndRemove \ --readFilesIn /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate1 /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate2 \ --outSAMunmapped Within \ --outFilterMultimapNmax 1 \ --outFilterMultimapScoreRange 1 \ --outFileNamePrefix /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep. \ --outSAMattributes All \ --outStd BAM_Unsorted \ --outSAMtype BAM Unsorted \ --outFilterType BySJout \ --outReadsUnmapped Fastx \ --outFilterScoreMin 10 \ --outSAMattrRGline ID:foo \ --alignEndsType EndToEnd > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam
-
13
takes output from STAR genome mapping.
$ Bash example
# Install STAR (if not already installed) # conda install -c bioconda star # Define variables # GENOME_DIR should point to the directory containing STAR genome index files (e.g., from hg38/GRCh38) GENOME_DIR="/path/to/STAR_genome_index/GRCh38" READ1="sample_R1.fastq.gz" # Placeholder for input read 1 (gzipped FASTQ) READ2="sample_R2.fastq.gz" # Placeholder for input read 2 (gzipped FASTQ, remove if single-end) OUTPUT_PREFIX="sample_STAR_aligned_" # Prefix for output files THREADS=8 # Number of threads to use # Run STAR genome mapping # Parameters are chosen to be suitable for RNA-based assays like eCLIP, # including splice-aware alignment and filtering for uniquely mapping reads. STAR --runThreadN ${THREADS} \ --genomeDir ${GENOME_DIR} \ --readFilesIn ${READ1} ${READ2} \ --readFilesCommand zcat \ --outFileNamePrefix ${OUTPUT_PREFIX} \ --outSAMtype BAM SortedByCoordinate \ --outSAMattributes All \ --outFilterMultimapNmax 1 \ --outFilterMismatchNmax 3 \ --alignIntronMax 1000000 \ --alignMatesGapMax 1000000 \ --alignSJDBoverhangMin 1 \ --alignSJoverhangMin 8 \ --sjdbScore 1 \ --outFilterType BySJout \ --outFilterScoreMinOverLread 0.3 \ --outFilterMatchNminOverLread 0.3 \ --limitBAMsortRAM 30000000000 # Adjust based on available RAM (e.g., 30GB) -
14
Custom random-mer-aware script for PCR duplicate removal.
$ Bash example
# Install umi_tools and samtools # conda install -c bioconda umi_tools samtools # Define input and output files INPUT_BAM="input.bam" OUTPUT_DEDUP_BAM="output.dedup.bam" UMI_LENGTH=6 # IMPORTANT: Adjust UMI_LENGTH based on your assay's random-mer design (e.g., 6 for 6N random-mer) # Step 1: Extract UMI from read sequence and append to read name # This step is crucial for 'random-mer-aware' duplicate removal. # Assuming UMI is at the 5' end of Read 1 (N{UMI_LENGTH}X* pattern). # Adjust --bc-pattern if UMI is in a different location or read. umi_tools extract \ --input "${INPUT_BAM}" \ --output "extracted_umi.bam" \ --bc-pattern="N${UMI_LENGTH}X*" \ --log "umi_extract.log" # Step 2: Sort the BAM file by queryname, fixmate information, then by coordinate # This sorting order is recommended for umi_tools dedup with paired-end reads. samtools sort -n -o "extracted_umi.name_sorted.bam" "extracted_umi.bam" samtools fixmate -m "extracted_umi.name_sorted.bam" "extracted_umi.fixmate.bam" samtools sort -o "extracted_umi.position_sorted.bam" "extracted_umi.fixmate.bam" samtools index "extracted_umi.position_sorted.bam" # Step 3: Run umi_tools dedup to remove PCR duplicates # The 'directional' method is commonly used for eCLIP and other UMI-based assays. umi_tools dedup \ --input "extracted_umi.position_sorted.bam" \ --output "${OUTPUT_DEDUP_BAM}" \ --method directional \ --umi-separator ":" \ --log "dedup.log" \ --output-stats "dedup_stats.tsv" # Clean up intermediate files (optional) rm extracted_umi.bam extracted_umi.name_sorted.bam extracted_umi.fixmate.bam extracted_umi.position_sorted.bam extracted_umi.position_sorted.bam.bai -
15
Command: barcode_collapse_pe.py --bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam --out_file /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam --metrics_file /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.metrics
barcode_collapse_pe.py (part of Yeo Lab eCLIP pipeline) (Inferred with models/gemini-2.5-flash) vNot specified GitHub$ Bash example
# The barcode_collapse_pe.py script is part of the Yeo Lab eCLIP pipeline. # It is recommended to clone the repository and set up a conda environment. # git clone https://github.com/yeolab/eclip.git # cd eclip # Create and activate a conda environment (example, adjust Python version as needed): # conda create -n eclip_env python=3.8 # conda activate eclip_env # Install necessary dependencies. The full eCLIP pipeline has many dependencies, # but for barcode_collapse_pe.py, pysam is a key one. # conda install -c bioconda pysam # Ensure the script is executable and in your PATH, or provide its full path. # For example, if cloned to /path/to/eclip, you might call it as: # python /path/to/eclip/scripts/barcode_collapse_pe.py \ # --bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam \ # --out_file /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam \ # --metrics_file /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.metrics # Execute the command as provided: barcode_collapse_pe.py \ --bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam \ --out_file /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam \ --metrics_file /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.metrics
-
16
Takes output from barcode collapse PE.
$ Bash example
# Install STAR if not already available # conda install -c bioconda star # Define input and output files INPUT_FASTQ="collapsed_reads.fastq.gz" # Placeholder for output from barcode collapse PE OUTPUT_PREFIX="aligned_reads" GENOME_DIR="/path/to/STAR_index/hg38" # Placeholder for STAR genome index (e.g., hg38) # Run STAR alignment STAR \ --genomeDir "${GENOME_DIR}" \ --readFilesIn "${INPUT_FASTQ}" \ --runThreadN 8 \ --outFileNamePrefix "${OUTPUT_PREFIX}_" \ --outSAMtype BAM SortedByCoordinate \ --outSAMattributes Standard -
17
Sorts resulting bam file for use downstream.
samtools (Inferred with models/gemini-2.5-flash) v1.15.1 (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# Install samtools if not already installed # conda install -c bioconda samtools=1.15.1 # Sort the BAM file by coordinate # Replace input.bam with your actual input BAM file # Replace output.sorted.bam with your desired output sorted BAM file name # Adjust -@ parameter based on available CPU cores samtools sort -@ 4 -o output.sorted.bam input.bam
-
18
Command: java -Xmx2048m -XX:+UseParallelOldGC -XX:ParallelGCThreads=4 -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Djava.io.tmpdir=/full/path/to/files/.queue/tmp -cp /path/to/gatk/dist/Queue.jar net.sf.picard.sam.SortSam INPUT=/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam TMP_DIR=/full/path/to/files/.queue/tmp OUTPUT=/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam VALIDATION_STRINGENCY=SILENT SO=coordinate CREATE_INDEX=true
$ Bash example
# Install Picard (example using conda) # conda install -c bioconda picard # Or download the jar directly from Broad Institute's GitHub releases # wget https://github.com/broadinstitute/picard/releases/download/<version>/picard.jar # export PICARD_JAR="/path/to/picard.jar" # The command provided uses a specific path to Queue.jar which likely bundles or provides access to Picard tools. # This setup is characteristic of older GATK/Picard integrations (e.g., GATK 3.x). java -Xmx2048m -XX:+UseParallelOldGC -XX:ParallelGCThreads=4 -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Djava.io.tmpdir=/full/path/to/files/.queue/tmp -cp /path/to/gatk/dist/Queue.jar net.sf.picard.sam.SortSam INPUT=/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam TMP_DIR=/full/path/to/files/.queue/tmp OUTPUT=/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam VALIDATION_STRINGENCY=SILENT SO=coordinate CREATE_INDEX=true
-
19
Takes output from sortSam, makes bam index for use downstream.
$ Bash example
# Install samtools if not already installed # conda install -c bioconda samtools # Assuming 'sorted.bam' is the output from sortSam samtools index sorted.bam
-
20
Command: samtools index /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam.bai
$ Bash example
# Install samtools (if not already installed) # conda install -c bioconda samtools=1.10 # Define input and output paths INPUT_BAM="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam" OUTPUT_BAI="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam.bai" # Execute samtools index samtools index "${INPUT_BAM}" "${OUTPUT_BAI}" -
21
Takes inputs from multiple final bam files.
$ Bash example
# Install samtools if not already available # conda install -c bioconda samtools # Example: Merge multiple final BAM files into a single BAM file. # This step is often used to combine technical replicates or prepare files for downstream analysis. # Parameters: # -o <output.bam>: Specify the output merged BAM file. # <input1.bam> <input2.bam> ...: List of input BAM files to be merged. # Placeholder for input BAM files # Replace with actual file paths for your multiple final BAM files INPUT_BAM_FILES="input_replicate_1.bam input_replicate_2.bam input_replicate_3.bam" OUTPUT_MERGED_BAM="merged_sample.bam" samtools merge -o "${OUTPUT_MERGED_BAM}" ${INPUT_BAM_FILES} # Index the merged BAM file (optional, but good practice for downstream tools) samtools index "${OUTPUT_MERGED_BAM}" -
22
Merges the two technical replicates for further downstream analysis.
$ Bash example
# Install samtools if not already installed # conda install -c bioconda samtools # Merge technical replicates (e.g., BAM files) # Assuming input BAM files are replicate1.bam and replicate2.bam # And output will be merged_replicates.bam samtools merge merged_replicates.bam replicate1.bam replicate2.bam
-
23
Command: samtools merge /full/path/to/files/CombinedID.merged.bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam /full/path/to/files/file_R1.D08.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam
$ Bash example
# Install samtools if not already installed # conda install -c bioconda samtools # or # sudo apt-get update && sudo apt-get install samtools # Execute the samtools merge command samtools merge /full/path/to/files/CombinedID.merged.bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam /full/path/to/files/file_R1.D08.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam
-
24
Takes output from sortSam, makes bam index for use downstream.
samtools (Inferred with models/gemini-2.5-flash) v1.19 (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# Install samtools if not already installed # conda install -c bioconda samtools # Assume 'sorted.bam' is the output from sortSam # The 'samtools index' command creates an index file (.bai) for the input BAM file. samtools index sorted.bam
-
25
Command: samtools index /full/path/to/files/CombinedID.merged.bam /full/path/to/files/CombinedID.merged.bam.bai
$ Bash example
# Install samtools if not already installed # conda install -c bioconda samtools # Execute samtools index command samtools index /full/path/to/files/CombinedID.merged.bam /full/path/to/files/CombinedID.merged.bam.bai
-
26
Takes output from sortSam.
samtools (Inferred with models/gemini-2.5-flash) v1.19.1 (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# Install samtools if not already installed # conda install -c bioconda samtools=1.19.1 # Define input and output file names # INPUT_BAM is the sorted BAM file, which is the output from sortSam INPUT_BAM="input_sorted.bam" # Index the sorted BAM file for efficient access by downstream tools samtools index "${INPUT_BAM}" -
27
Only outputs the second read in each pair for use with single stranded peak caller.
samtools (Inferred with models/gemini-2.5-flash) v1.15.1 (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# Install samtools if not already available # conda install -c bioconda samtools # This command extracts only the second read in each pair from a paired-end BAM file # and outputs them to a FASTQ file. This is useful for single-stranded peak callers # that only require reads from one strand (e.g., the second read in a pair). # # -f 0x80: Selects reads that are the second in a pair (SAM flag 0x80). # -N: Use original base quality scores (do not convert to Sanger format). # -s /dev/null: Discard singleton reads (reads whose mate was not found). # -1 /dev/null: Discard the first read in a pair (output to /dev/null). # -2 output_R2.fastq: Output the second read in a pair to 'output_R2.fastq'. # input.bam: The input paired-end BAM file. samtools fastq -f 0x80 -N -s /dev/null -1 /dev/null -2 output_R2.fastq input.bam
-
28
This is the final bam file to perform analysis on.
$ Bash example
# Install samtools (if not already installed) # conda install -c bioconda samtools=1.9 # This command ensures the final BAM file is indexed, which is crucial for many downstream analysis tools. # Replace 'sample.final.bam' with the actual path to your final BAM file. samtools index sample.final.bam
-
29
Command: samtools view -hb -f 128 /full/path/to/files/CombinedID.merged.bam > /full/path/to/files/CombinedID.merged.r2.bam
$ Bash example
# Install samtools (e.g., via conda) # conda install -c bioconda samtools=1.19 # Define input and output paths INPUT_BAM="/full/path/to/files/CombinedID.merged.bam" OUTPUT_BAM="/full/path/to/files/CombinedID.merged.r2.bam" # Filter BAM file to extract reads that are the second in a pair samtools view -hb -f 128 "${INPUT_BAM}" > "${OUTPUT_BAM}" -
30
Takes results from samtools view.
$ Bash example
# Install samtools if not already available # conda install -c bioconda samtools # Example: Filter mapped reads from a BAM file and output to a new BAM file # This command takes an input BAM file (input.bam) # and outputs a new BAM file (output_mapped.bam) containing only mapped reads. # -b: Output in BAM format # -F 4: Exclude reads where the FLAG indicates the read is unmapped (0x4) samtools view -b -F 4 input.bam > output_mapped.bam
-
31
Calls peaks on those files.
$ Bash example
# Clone the clipper repository if not already available # git clone https://github.com/yeolab/clipper.git # cd clipper # Ensure Python and necessary libraries (e.g., pysam, numpy) are installed # conda install -c bioconda pysam numpy # Define input and output files (placeholders) IP_BAM="ip_sample.bam" # Placeholder: Path to the IP sample's aligned BAM file CONTROL_BAM="control_sample.bam" # Placeholder: Path to the control sample's aligned BAM file OUTPUT_BED="peaks.bed" # Placeholder: Desired output BED file name for peaks GENOME_SIZE="hg38" # Placeholder: Replace with the actual genome assembly (e.g., hg19, mm10, dm6) used for alignment # Execute clipper to call peaks # Assuming clipper.py is in the current directory or in your system's PATH python clipper.py -b "${IP_BAM}" -c "${CONTROL_BAM}" -s "${GENOME_SIZE}" -o "${OUTPUT_BED}" -
32
Command: clipper -b /full/path/to/files/CombinedID.merged.r2.bam -s hg19 -o /full/path/to/files/CombinedID.merged.r2.peaks.bed --bonferroni --superlocal --threshold-method binomial --save-pickle
$ Bash example
# Install CLIPper (example, adjust as needed) # conda install -c bioconda clipper # # or # # pip install clipper # Run CLIPper for peak calling clipper -b /full/path/to/files/CombinedID.merged.r2.bam -s hg19 -o /full/path/to/files/CombinedID.merged.r2.peaks.bed --bonferroni --superlocal --threshold-method binomial --save-pickle
Tools Used
Raw Source Text
Takes output from raw files. Run to trim off both 5â and 3â adapters on both reads. Command: quality-cutoff 6 -m 18 -a NNNNNAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC -g CTTCCGATCTACAAGTT -g CTTCCGATCTTGGTCCT -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGT AGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz /full/path/to/files/file_R1.C01.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.metrics Takes output from cutadapt round 1. Run to trim off the 3â adapters on read 2, to control for double ligation events. Command: cutadapt -f fastq --match-read-wildcards --times 1 -e 0.1 -O 5 --quality-cutoff 6 -m 18 -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGTAGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.metrics Takes output from cutadapt round 2. Maps to human specific version of RepBase used to remove repetitive elements, helps control for spurious artifacts from rRNA (& other) repetitive reads. Command: STAR --runMode alignReads --runThreadN 16 --genomeDir /path/to/RepBase_human_database_file --genomeLoad LoadAndRemove --readFilesIn /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz --outSAMunmapped Within --outFilterMultimapNmax 30 --outFilterMultimapScoreRange 1 --outFileNamePrefix /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam --outSAMattributes All --readFilesCommand zcat --outStd BAM_Unsorted --outSAMtype BAM Unsorted --outFilterType BySJout --outReadsUnmapped Fastx --outFilterScoreMin 10 --outSAMattrRGline ID:foo --alignEndsType EndToEnd > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam Takes output from STAR rmRep. Maps unique reads to the human genome. Command: STAR --runMode alignReads --runThreadN 16 --genomeDir /path/to/STAR_database_file --genomeLoad LoadAndRemove --readFilesIn /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate1 /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate2 --outSAMunmapped Within --outFilterMultimapNmax 1 --outFilterMultimapScoreRange 1 --outFileNamePrefix /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam --outSAMattributes All --outStd BAM_Unsorted --outSAMtype BAM Unsorted --outFilterType BySJout --outReadsUnmapped Fastx --outFilterScoreMin 10 --outSAMattrRGline ID:foo --alignEndsType EndToEnd > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam takes output from STAR genome mapping. Custom random-mer-aware script for PCR duplicate removal. Command: barcode_collapse_pe.py --bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam --out_file /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam --metrics_file /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.metrics Takes output from barcode collapse PE. Sorts resulting bam file for use downstream. Command: java -Xmx2048m -XX:+UseParallelOldGC -XX:ParallelGCThreads=4 -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Djava.io.tmpdir=/full/path/to/files/.queue/tmp -cp /path/to/gatk/dist/Queue.jar net.sf.picard.sam.SortSam INPUT=/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam TMP_DIR=/full/path/to/files/.queue/tmp OUTPUT=/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam VALIDATION_STRINGENCY=SILENT SO=coordinate CREATE_INDEX=true Takes output from sortSam, makes bam index for use downstream. Command: samtools index /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam.bai Takes inputs from multiple final bam files. Merges the two technical replicates for further downstream analysis. Command: samtools merge /full/path/to/files/CombinedID.merged.bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam /full/path/to/files/file_R1.D08.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam Takes output from sortSam, makes bam index for use downstream. Command: samtools index /full/path/to/files/CombinedID.merged.bam /full/path/to/files/CombinedID.merged.bam.bai Takes output from sortSam. Only outputs the second read in each pair for use with single stranded peak caller. This is the final bam file to perform analysis on. Command: samtools view -hb -f 128 /full/path/to/files/CombinedID.merged.bam > /full/path/to/files/CombinedID.merged.r2.bam Takes results from samtools view. Calls peaks on those files. Command: clipper -b /full/path/to/files/CombinedID.merged.r2.bam -s hg19 -o /full/path/to/files/CombinedID.merged.r2.peaks.bed --bonferroni --superlocal --threshold-method binomial --save-pickle Genome_build: hg19 Supplementary_files_format_and_content: bed format, contains clusters of predicted RBP binding