GSE78509 Processing Pipeline
Publication
Enhanced CLIP Uncovers IMP Protein-RNA Targets in Human Pluripotent Stem Cells Important for Cell Adhesion and Survival.Cell reports (2016) — PMID 27068461
Dataset
GSE78509Enhanced CLIP uncovers IMP protein-RNA targets in human pluripotent stem cells important for cell adhesion and survival
Processing Steps
Generate Jupyter Notebook-
1
Library strategy: eCLIP-Seq
$ Bash example
# Install cwltool (if not already installed) # conda install -c conda-forge cwltool # Clone the eCLIP workflow repository # git clone https://github.com/yeolab/eclip.git # cd eclip # Navigate into the cloned directory to run the workflow # --- Placeholder for Reference Data --- # Replace with actual paths to your reference genome and annotation files. # For human hg38 as an example: GENOME_DIR="/path/to/hg38_star_index" # Directory containing STAR index files GTF_FILE="/path/to/gencode.v38.annotation.gtf" # GTF annotation file FASTA_FILE="/path/to/hg38.fa" # Genome FASTA file CHROM_SIZES="/path/to/hg38.chrom.sizes" # Chromosome sizes file # --- Placeholder for Input Data --- # Replace with actual paths to your eCLIP and control FASTQ files. # Assuming paired-end reads for both sample and control. SAMPLE_R1="/path/to/sample_rep1_R1.fastq.gz" SAMPLE_R2="/path/to/sample_rep1_R2.fastq.gz" CONTROL_R1="/path/to/control_rep1_R1.fastq.gz" CONTROL_R2="/path/to/control_rep1_R2.fastq.gz" # Create an input YAML file for the eCLIP CWL workflow cat << EOF > eclip_inputs.yaml fastq_r1: class: File path: ${SAMPLE_R1} fastq_r2: class: File path: ${SAMPLE_R2} control_fastq_r1: class: File path: ${CONTROL_R1} control_fastq_r2: class: File path: ${CONTROL_R2} genome_dir: class: Directory path: ${GENOME_DIR} gtf_file: class: File path: ${GTF_FILE} fasta_file: class: File path: ${FASTA_FILE} chrom_sizes: class: File path: ${CHROM_SIZES} output_prefix: "my_eclip_experiment" adapter_sequence: "AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" # Common Illumina adapter sequence min_read_length: 15 max_read_length: 100 min_quality: 20 threads: 8 memory: 32 # GB EOF # Execute the eCLIP CWL workflow cwltool eclip.cwl eclip_inputs.yaml -
2
Takes output from raw files.
$ Bash example
# Install FastQC if not already installed # conda install -c bioconda fastqc # Define input and output paths INPUT_DIR="raw_data" # Directory containing raw FASTQ files OUTPUT_DIR="fastqc_reports" SAMPLE_ID="sample_name" # Placeholder for a specific sample identifier # Create output directory if it doesn't exist mkdir -p "${OUTPUT_DIR}" # Run FastQC on raw FASTQ files # Assuming paired-end reads: ${SAMPLE_ID}_R1.fastq.gz and ${SAMPLE_ID}_R2.fastq.gz fastqc -o "${OUTPUT_DIR}" "${INPUT_DIR}/${SAMPLE_ID}_R1.fastq.gz" "${INPUT_DIR}/${SAMPLE_ID}_R2.fastq.gz" # If single-end reads, use: # fastqc -o "${OUTPUT_DIR}" "${INPUT_DIR}/${SAMPLE_ID}.fastq.gz" -
3
Run to trim off both 5â and 3â adapters on both reads.
$ Bash example
# Installation (uncomment to install via conda) # conda install -c bioconda cutadapt=4.0 # Define input and output files (replace with actual filenames) READ1_IN="raw_R1.fastq.gz" READ2_IN="raw_R2.fastq.gz" READ1_OUT="trimmed_R1.fastq.gz" READ2_OUT="trimmed_R2.fastq.gz" # Define common Illumina TruSeq adapters (replace with actual adapters if known) # These are standard Illumina Universal Adapter (for R1) and Reverse Complement Adapter (for R2) ADAPTER_FWD="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" ADAPTER_REV="AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT" # Run cutadapt to trim adapters from both 5' and 3' ends of paired-end reads. # -a: 3' adapter sequence for Read 1 # -A: 3' adapter sequence for Read 2 # -o: Output file for Read 1 # -p: Output file for Read 2 # -m: Minimum length of reads after trimming (e.g., 18 bp, adjust as needed) # --cores: Number of CPU cores to use for parallel processing cutadapt \ -a "${ADAPTER_FWD}" \ -A "${ADAPTER_REV}" \ -o "${READ1_OUT}" \ -p "${READ2_OUT}" \ "${READ1_IN}" \ "${READ2_IN}" \ -m 18 \ --cores 4 -
4
Command: quality-cutoff 6 -m 18 -a NNNNNAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC -g CTTCCGATCTACAAGTT -g CTTCCGATCTTGGTCCT -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGT AGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz /full/path/to/files/file_R1.C01.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.metrics
$ Bash example
# This script is part of the Yeo lab eCLIP pipeline (https://github.com/yeolab/eclip). # It requires Python and the pysam library. # Installation steps (assuming you are in a suitable environment): # # First, clone the eclip repository to get the script: # # git clone https://github.com/yeolab/eclip.git # # cd eclip/tools # # Ensure Python and pysam are installed. For example, using conda: # # conda install -c conda-forge python # # conda install -c bioconda pysam # Define input and output paths INPUT_R1="/full/path/to/files/file_R1.C01.fastq.gz" INPUT_R2="/full/path/to/files/file_R2.C01.fastq.gz" OUTPUT_R1="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz" OUTPUT_R2="/full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz" METRICS_FILE="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.metrics" # Execute the quality-cutoff script. # Note: The script 'quality-cutoff.py' should be in your current working directory or its path should be specified. # Assuming it's run from the 'eclip/tools' directory or the script is copied/linked. python quality-cutoff.py 6 -m 18 \ -a NNNNNAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC \ -g CTTCCGATCTACAAGTT \ -g CTTCCGATCTTGGTCCT \ -A AACTTGTAGATCGGA \ -A AGGACCAAGATCGGA \ -A ACTTGTAGATCGGAA \ -A GGACCAAGATCGGAA \ -A CTTGT AGATCGGAAG \ -A GACCAAGATCGGAAG \ -A TTGTAGATCGGAAGA \ -A ACCAAGATCGGAAGA \ -A TGTAGATCGGAAGAG \ -A CCAAGATCGGAAGAG \ -A GTAGATCGGAAGAGC \ -A CAAGATCGGAAGAGC \ -A TAGATCGGAAGAGCG \ -A AAGATCGGAAGAGCG \ -A AGATCGGAAGAGCGT \ -A GATCGGAAGAGCGTC \ -A ATCGGAAGAGCGTCG \ -A TCGGAAGAGCGTCGT \ -A CGGAAGAGCGTCGTG \ -A GGAAGAGCGTCGTGT \ -o "${OUTPUT_R1}" \ -p "${OUTPUT_R2}" \ "${INPUT_R1}" \ "${INPUT_R2}" > "${METRICS_FILE}" -
5
Takes output from cutadapt round 1.
$ Bash example
# Install cutadapt (example using conda) # conda install -c bioconda cutadapt=3.4 # Define input and output files # INPUT_FASTQ: Output from the first round of cutadapt processing INPUT_FASTQ="input_from_cutadapt_round1.fastq.gz" OUTPUT_FASTQ="output_cutadapt_round2.fastq.gz" # Execute cutadapt for quality trimming and minimum length filtering. # This is a common second step after initial adapter trimming. # -q 20: Trim low-quality ends from reads. The 3' end is trimmed until the quality score is at least 20. # -m 18: Discard reads shorter than 18 bp after trimming. cutadapt -q 20 -m 18 -o "${OUTPUT_FASTQ}" "${INPUT_FASTQ}" -
6
Run to trim off the 3â adapters on read 2, to control for double ligation events.
$ Bash example
# Install cutadapt if not already available # conda install -c bioconda cutadapt=3.4 # Define input and output files INPUT_R2="sample_R2.fastq.gz" OUTPUT_R2_TRIMMED="sample_R2_trimmed.fastq.gz" # Define the 3' adapter sequence for eCLIP (from Yeo Lab workflows) # This adapter is AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT # It's a common Illumina 3' adapter sequence used in eCLIP. ADAPTER_R2="AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT" # Minimum length for reads after trimming MIN_LENGTH=18 # Run cutadapt to trim the 3' adapter from Read 2 # -a: Specifies a 3' adapter to be removed from the 3' end of the read. # -o: Output file for trimmed Read 2. # --minimum-length: Discard reads shorter than this length after trimming. # --discard-untrimmed: Discard reads that do not contain the adapter. This is important # for controlling double ligation events, as it removes adapter-only # reads or reads with very short inserts that are mostly adapter. cutadapt -a "${ADAPTER_R2}" \ -o "${OUTPUT_R2_TRIMMED}" \ --minimum-length "${MIN_LENGTH}" \ --discard-untrimmed \ "${INPUT_R2}" -
7
Command: cutadapt -f fastq --match-read-wildcards --times 1 -e 0.1 -O 5 --quality-cutoff 6 -m 18 -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGTAGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.metrics
$ Bash example
cutadapt -f fastq --match-read-wildcards --times 1 -e 0.1 -O 5 --quality-cutoff 6 -m 18 -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGTAGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.metrics
-
8
Takes output from cutadapt round 2.
$ Bash example
# Install cutadapt (if not already installed) # conda install -c bioconda cutadapt=4.0 # Define input and output files # Input files are the output from a previous cutadapt round (round 2) INPUT_R1="cutadapt_round2_R1.fastq.gz" INPUT_R2="cutadapt_round2_R2.fastq.gz" OUTPUT_R1="cutadapt_round3_R1.fastq.gz" OUTPUT_R2="cutadapt_round3_R2.fastq.gz" REPORT_JSON="cutadapt_round3_report.json" # Define adapter sequences (placeholders - replace with actual sequences specific to the library preparation) # For eCLIP, these are typically Illumina adapters or custom adapters. # Example Illumina universal adapter for 3' end ADAPTER_3PRIME="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" # Example Illumina small RNA 5' adapter for 5' end (if applicable, otherwise use universal) ADAPTER_5PRIME="AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT" # Execute cutadapt for further trimming and quality filtering # -a: 3' adapter sequence for read 1 # -A: 3' adapter sequence for read 2 (or 5' adapter if trimming 5' end of read 2) # -q: Quality trimming. Trim low-quality ends from reads. Format: 3'end,5'end # --minimum-length: Discard reads shorter than this length after trimming # --cores: Number of CPU cores to use for parallel processing # --json: Write a JSON report with trimming statistics # -o: Output file for read 1 # -p: Output file for read 2 cutadapt \ -a "${ADAPTER_3PRIME}" \ -A "${ADAPTER_5PRIME}" \ -q 20,20 \ --minimum-length 18 \ --cores 4 \ --json "${REPORT_JSON}" \ -o "${OUTPUT_R1}" \ -p "${OUTPUT_R2}" \ "${INPUT_R1}" \ "${INPUT_R2}" -
9
Maps to human specific version of RepBase used to remove repetitive elements, helps control for spurious artifacts from rRNA (& other) repetitive reads.
$ Bash example
# Install bowtie2 if not already installed # conda install -c bioconda bowtie2 # Define reference files and output directory # REPETITIVE_ELEMENTS_FASTA is a placeholder for a FASTA file containing human repetitive sequences # (e.g., rRNA, tRNA, and sequences from RepBase for the human genome, like hg38). # This file needs to be prepared beforehand, often by concatenating known repetitive sequences. REPETITIVE_ELEMENTS_FASTA="path/to/human_repetitive_elements.fa" BOWTIE2_INDEX_PREFIX="human_repetitive_elements_idx" INPUT_FASTQ="input_reads.fastq.gz" OUTPUT_UNMAPPED_FASTQ="reads_without_repetitive_elements.fastq.gz" OUTPUT_MAPPED_SAM="repetitive_reads.sam" # Step 1: Build Bowtie2 index for repetitive elements (run once) # This command creates index files (.bt2) from the FASTA reference. # bowtie2-build "${REPETITIVE_ELEMENTS_FASTA}" "${BOWTIE2_INDEX_PREFIX}" # Step 2: Align reads to the repetitive elements index # Reads that align to repetitive elements are considered artifacts and are discarded. # --un-gz: writes reads that *do not* align to the index to a gzipped FASTQ file. # -S: writes all alignments (including those to repetitive elements) to a SAM file. # -U: specifies the input FASTQ file (unpaired reads). # -p: number of threads to use. bowtie2 -p 8 -x "${BOWTIE2_INDEX_PREFIX}" -U "${INPUT_FASTQ}" \ --un-gz "${OUTPUT_UNMAPPED_FASTQ}" -S "${OUTPUT_MAPPED_SAM}" # The file "${OUTPUT_UNMAPPED_FASTQ}" contains the reads with repetitive elements removed, # which can then be used for subsequent genomic alignment. -
10
Command: STAR --runMode alignReads --runThreadN 16 --genomeDir /path/to/RepBase_human_database_file --genomeLoad LoadAndRemove --readFilesIn /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz --outSAMunmapped Within --outFilterMultimapNmax 30 --outFilterMultimapScoreRange 1 --outFileNamePrefix /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam --outSAMattributes All --readFilesCommand zcat --outStd BAM_Unsorted --outSAMtype BAM Unsorted --outFilterType BySJout --outReadsUnmapped Fastx --outFilterScoreMin 10 --outSAMattrRGline ID:foo --alignEndsType EndToEnd > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam
$ Bash example
# Install STAR (example using conda) # conda install -c bioconda star # Define variables for clarity GENOME_DIR="/path/to/RepBase_human_database_file" # Reference: RepBase human repetitive element database READS_R1="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz" READS_R2="/full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz" OUTPUT_PREFIX="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam" # Prefix for STAR's auxiliary output files (e.g., logs, unmapped reads if not stdout) OUTPUT_BAM="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam" # Destination for the unsorted BAM output from stdout # Execute STAR alignment STAR --runMode alignReads \ --runThreadN 16 \ --genomeDir "${GENOME_DIR}" \ --genomeLoad LoadAndRemove \ --readFilesIn "${READS_R1}" "${READS_R2}" \ --outSAMunmapped Within \ --outFilterMultimapNmax 30 \ --outFilterMultimapScoreRange 1 \ --outFileNamePrefix "${OUTPUT_PREFIX}" \ --outSAMattributes All \ --readFilesCommand zcat \ --outStd BAM_Unsorted \ --outSAMtype BAM Unsorted \ --outFilterType BySJout \ --outReadsUnmapped Fastx \ --outFilterScoreMin 10 \ --outSAMattrRGline ID:foo \ --alignEndsType EndToEnd \ > "${OUTPUT_BAM}" -
11
Takes output from STAR rmRep.
$ Bash example
# Install samtools if not already available # conda install -c bioconda samtools=1.19.2 # The description "Takes output from STAR rmRep" is interpreted as a step that performs deduplication (removing PCR duplicates/replicates) on the output of STAR alignment. # 'rmRep' is inferred to mean 'remove replicates' or 'remove duplicates'. # samtools markdup is the standard tool for this in eCLIP pipelines (e.g., Yeo lab workflows). # Assuming 'star_aligned.bam' is the coordinate-sorted BAM file output from STAR. # If the STAR output is not sorted, it must be sorted first: # samtools sort -o star_aligned.sorted.bam star_aligned.bam # Remove PCR duplicates (replicates) from the STAR aligned BAM file. # The '-r' option removes duplicate reads. # The '-s' option outputs statistics to stderr, or to a file if redirected. # The input BAM file must be coordinate-sorted. # Replace 'star_aligned.sorted.bam' with the actual path to your sorted STAR alignment file. # Replace 'deduplicated_reads.bam' with your desired output file name. samtools markdup -r star_aligned.sorted.bam deduplicated_reads.bam
-
12
Maps unique reads to the human genome.
$ Bash example
# Install STAR (if not already installed) # conda install -c bioconda star # Define variables # Replace with actual paths and filenames GENOME_DIR="/path/to/star_genome_index_GRCh38" # Directory containing STAR genome index for human GRCh38 READS_FILE="input_reads.fastq.gz" # Input FASTQ file (gzipped) OUTPUT_PREFIX="mapped_reads" # Prefix for output files NUM_THREADS=8 # Number of threads to use (adjust based on available resources) # Reference Genome: Human GRCh38 (or a specific build like hg38) is assumed. # The STAR genome index (GENOME_DIR) must be pre-built using the human genome FASTA and GTF files. # Example command to build index (run once): # STAR --runMode genomeGenerate \ # --genomeDir "${GENOME_DIR}" \ # --genomeFastaFiles "/path/to/GRCh38.primary_assembly.fa" \ # --sjdbGTFfile "/path/to/gencode.v44.annotation.gtf" \ # --runThreadN "${NUM_THREADS}" # Maps unique reads to the human genome STAR --genomeDir "${GENOME_DIR}" \ --readFilesIn "${READS_FILE}" \ --readFilesCommand zcat \ --outFileNamePrefix "${OUTPUT_PREFIX}_" \ --outSAMtype BAM SortedByCoordinate \ --outSAMunmapped Within \ --outSAMattributes Standard \ --runThreadN "${NUM_THREADS}" # Output BAM file will be: ${OUTPUT_PREFIX}_Aligned.sortedByCoord.out.bam -
13
Command: STAR --runMode alignReads --runThreadN 16 --genomeDir /path/to/STAR_database_file --genomeLoad LoadAndRemove --readFilesIn /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate1 /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate2 --outSAMunmapped Within --outFilterMultimapNmax 1 --outFilterMultimapScoreRange 1 --outFileNamePrefix /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam --outSAMattributes All --outStd BAM_Unsorted --outSAMtype BAM Unsorted --outFilterType BySJout --outReadsUnmapped Fastx --outFilterScoreMin 10 --outSAMattrRGline ID:foo --alignEndsType EndToEnd > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam
$ Bash example
# Install STAR (example using conda) # conda install -c bioconda star # Define variables # Placeholder for STAR genome indices. Replace with the actual path to your reference genome. GENOME_DIR="/path/to/STAR_genome_indices/GRCh38" # Input FASTQ files (mate1 and mate2) READ_FILE_MATE1="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate1" READ_FILE_MATE2="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate2" # Output file prefix for STAR's auxiliary files (Log.out, SJ.out.tab, etc.) # Note: The original command uses a .bam suffix in the prefix, which is unusual but reproduced here. STAR_OUTPUT_FILE_PREFIX="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam" # The final aligned BAM file, captured from STAR's standard output ALIGNED_BAM_FILE="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam" # Run STAR alignment STAR \ --runMode alignReads \ --runThreadN 16 \ --genomeDir "${GENOME_DIR}" \ --genomeLoad LoadAndRemove \ --readFilesIn "${READ_FILE_MATE1}" "${READ_FILE_MATE2}" \ --outSAMunmapped Within \ --outFilterMultimapNmax 1 \ --outFilterMultimapScoreRange 1 \ --outFileNamePrefix "${STAR_OUTPUT_FILE_PREFIX}" \ --outSAMattributes All \ --outStd BAM_Unsorted \ --outSAMtype BAM Unsorted \ --outFilterType BySJout \ --outReadsUnmapped Fastx \ --outFilterScoreMin 10 \ --outSAMattrRGline ID:foo \ --alignEndsType EndToEnd \ > "${ALIGNED_BAM_FILE}" -
14
takes output from STAR genome mapping.
$ Bash example
# Install STAR (if not already installed) # conda install -c bioconda star # Placeholder for reference genome and annotation files # Replace with actual paths to your GRCh38 FASTA and GTF files GENOME_FASTA="/path/to/GRCh38.fa" GTF_FILE="/path/to/gencode.v38.annotation.gtf" STAR_INDEX_DIR="/path/to/STAR_index/GRCh38" # Placeholder for input FASTQ files # Replace with actual paths to your R1 and R2 FASTQ files INPUT_FASTQ_R1="input_R1.fastq.gz" INPUT_FASTQ_R2="input_R2.fastq.gz" # Placeholder for output prefix OUTPUT_PREFIX="aligned_reads" # Number of threads to use NUM_THREADS=8 # Create STAR genome index (run once per genome version) # This step is typically done before alignment and the index is reused. # STAR --runThreadN ${NUM_THREADS} \ # --runMode genomeGenerate \ # --genomeDir ${STAR_INDEX_DIR} \ # --genomeFastaFiles ${GENOME_FASTA} \ # --sjdbGTFfile ${GTF_FILE} \ # --sjdbOverhang 100 # Recommended for typical RNA-seq read lengths (e.g., 101bp) # Perform genome mapping using STAR # Parameters are commonly used for RNA-seq alignment, adapted from eCLIP workflows. STAR --runThreadN ${NUM_THREADS} \ --genomeDir ${STAR_INDEX_DIR} \ --readFilesIn ${INPUT_FASTQ_R1} ${INPUT_FASTQ_R2} \ --readFilesCommand zcat \ --outFileNamePrefix ${OUTPUT_PREFIX}_ \ --outSAMtype BAM SortedByCoordinate \ --outSAMunmapped Within \ --outSAMattributes Standard \ --outFilterType BySJout \ --outFilterMultimapNmax 20 \ --alignSJDBoverhangMin 1 \ --alignSJoverhangMin 8 \ --alignIntronMin 20 \ --alignIntronMax 1000000 \ --alignMatesGapMax 1000000 \ --limitBAMsortRAM 30000000000 # ~30GB, adjust based on available RAM -
15
Custom random-mer-aware script for PCR duplicate removal.
$ Bash example
# Install umi_tools if not already available # conda install -c bioconda umi-tools # This script performs PCR duplicate removal using umi_tools dedup, # which is designed to handle Unique Molecular Identifiers (UMIs) or random-mers. # It assumes UMIs have already been extracted and appended to read names # (e.g., by a prior 'umi_tools extract' step) and the input is a # coordinate-sorted BAM file. # Define input and output file paths INPUT_BAM="path/to/your/aligned_and_sorted.bam" OUTPUT_BAM="path/to/your/deduplicated.bam" LOG_FILE="umi_tools_dedup.log" # Execute umi_tools dedup # --stdin: Input BAM file # --stdout: Output BAM file with duplicates removed # --method directional: Recommended method for UMI-based deduplication # --umi-separator "_": Specifies the separator used when UMIs are in read names # --log: Log file for deduplication statistics # --paired: Use this flag if your data is paired-end (common for eCLIP) umi_tools dedup \ --stdin "${INPUT_BAM}" \ --stdout "${OUTPUT_BAM}" \ --method directional \ --umi-separator "_" \ --log "${LOG_FILE}" \ --paired -
16
Command: barcode_collapse_pe.py --bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam --out_file /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam --metrics_file /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.metrics
$ Bash example
# Install Miniconda/Anaconda if not already installed # wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh # bash miniconda.sh -b -p $HOME/miniconda # export PATH="$HOME/miniconda/bin:$PATH" # conda init bash # source ~/.bashrc # Clone the Skipper repository # git clone https://github.com/yeolab/skipper.git # cd skipper # Create and activate the conda environment using the provided environment.yaml # conda env create -f environment.yaml # conda activate skipper # Assuming the environment name is 'skipper' or derived from the directory # Define input and output files INPUT_BAM="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam" OUTPUT_BAM="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam" METRICS_FILE="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.metrics" # Execute the barcode collapse command # The script 'barcode_collapse_pe.py' is located in the 'scripts' directory of the cloned Skipper repository. # Ensure you are in the 'skipper' directory or provide the full path to the script. python scripts/barcode_collapse_pe.py --bam "${INPUT_BAM}" --out_file "${OUTPUT_BAM}" --metrics_file "${METRICS_FILE}" -
17
Takes output from barcode collapse PE.
$ Bash example
# Define variables (placeholders) # Replace these with actual paths and values for your specific analysis. # For human (Homo sapiens), hg38 (GRCh38) is the latest assembly. GENOME_DIR="/path/to/STAR_index/hg38" # Example: /path/to/STAR_index/GRCh38_gencode_v38 GTF_FILE="/path/to/annotations/gencode.v38.annotation.gtf" # Example: /path/to/annotations/gencode.v38.annotation.gtf READ1_FASTQ="collapsed_reads_R1.fastq.gz" # Input: R1 FASTQ from barcode collapse PE READ2_FASTQ="collapsed_reads_R2.fastq.gz" # Input: R2 FASTQ from barcode collapse PE OUTPUT_PREFIX="aligned_reads_" # Prefix for output files (e.g., aligned_reads_Aligned.sortedByCoord.out.bam) THREADS=8 # Number of threads to use for STAR LIMIT_BAM_SORT_RAM=60000000000 # 60GB, adjust based on available RAM for sorting BAM # Run STAR alignment STAR \ --runThreadN "${THREADS}" \ --genomeDir "${GENOME_DIR}" \ --readFilesIn "${READ1_FASTQ}" "${READ2_FASTQ}" \ --outFileNamePrefix "${OUTPUT_PREFIX}" \ --outSAMtype BAM SortedByCoordinate \ --outSAMattributes All \ --outSAMunmapped Within \ --outFilterMultimapNmax 20 \ --outFilterMismatchNmax 999 \ --alignIntronMin 20 \ --alignIntronMax 1000000 \ --alignMatesGapMax 1000000 \ --sjdbGTFfile "${GTF_FILE}" \ --limitBAMsortRAM "${LIMIT_BAM_SORT_RAM}" -
18
Sorts resulting bam file for use downstream.
$ Bash example
# Install samtools (if not already installed) # conda install -c bioconda samtools=1.10 # Sort the BAM file # Replace 'input.bam' with your unsorted BAM file # Replace 'output.sorted.bam' with your desired output sorted BAM file name samtools sort -o output.sorted.bam input.bam
-
19
Command: java -Xmx2048m -XX:+UseParallelOldGC -XX:ParallelGCThreads=4 -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Djava.io.tmpdir=/full/path/to/files/.queue/tmp -cp /path/to/gatk/dist/Queue.jar net.sf.picard.sam.SortSam INPUT=/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam TMP_DIR=/full/path/to/files/.queue/tmp OUTPUT=/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam VALIDATION_STRINGENCY=SILENT SO=coordinate CREATE_INDEX=true
$ Bash example
# Define variables for input/output paths and resources INPUT_BAM="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam" OUTPUT_BAM="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam" TMP_DIR="/full/path/to/files/.queue/tmp" PICARD_JAR_PATH="/path/to/gatk/dist/Queue.jar" # Note: This path points to GATK's Queue.jar, which in some older GATK distributions might have bundled Picard classes. # Create temporary directory if it doesn't exist mkdir -p "${TMP_DIR}" # Execute Picard SortSam command java -Xmx2048m \ -XX:+UseParallelOldGC \ -XX:ParallelGCThreads=4 \ -XX:GCTimeLimit=50 \ -XX:GCHeapFreeLimit=10 \ -Djava.io.tmpdir="${TMP_DIR}" \ -cp "${PICARD_JAR_PATH}" \ net.sf.picard.sam.SortSam \ INPUT="${INPUT_BAM}" \ TMP_DIR="${TMP_DIR}" \ OUTPUT="${OUTPUT_BAM}" \ VALIDATION_STRINGENCY=SILENT \ SO=coordinate \ CREATE_INDEX=true -
20
Takes output from sortSam, makes bam index for use downstream.
samtools (Inferred with models/gemini-2.5-flash) v1.19 (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# Install samtools if not already installed # conda install -c bioconda samtools # Assume 'sorted.bam' is the output from sortSam # This command creates an index file named 'sorted.bam.bai' in the same directory samtools index sorted.bam
-
21
Command: samtools index /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam.bai
$ Bash example
# Install samtools (e.g., using conda) # conda install -c bioconda samtools=1.19 # Define input and output files INPUT_BAM="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam" OUTPUT_BAI="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam.bai" # Execute samtools index command samtools index "${INPUT_BAM}" "${OUTPUT_BAI}" -
22
Takes inputs from multiple final bam files.
$ Bash example
# Install samtools if not available # conda install -c bioconda samtools # This command merges multiple sorted BAM files into a single sorted BAM file. # Replace 'input_1.bam', 'input_2.bam', 'input_3.bam' with the actual paths to your input BAM files. # Replace 'merged_output.bam' with the desired name for the merged output file. # The '-o' flag is optional for samtools merge, as the output file is typically the first argument. samtools merge merged_output.bam input_1.bam input_2.bam input_3.bam
-
23
Merges the two technical replicates for further downstream analysis.
$ Bash example
# Install samtools (if not already installed) # conda install -c bioconda samtools # Define input and output file paths INPUT_REPLICATE_1="replicate1.bam" INPUT_REPLICATE_2="replicate2.bam" OUTPUT_MERGED_BAM="merged_replicates.bam" # Merge the two technical replicate BAM files samtools merge "${OUTPUT_MERGED_BAM}" "${INPUT_REPLICATE_1}" "${INPUT_REPLICATE_2}" -
24
Command: samtools merge /full/path/to/files/CombinedID.merged.bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam /full/path/to/files/file_R1.D08.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam
$ Bash example
# Install samtools (e.g., using conda) # conda install -c bioconda samtools=1.19 # Merge multiple sorted BAM files into a single merged BAM file samtools merge /full/path/to/files/CombinedID.merged.bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam /full/path/to/files/file_R1.D08.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam
-
25
Takes output from sortSam, makes bam index for use downstream.
samtools (Inferred with models/gemini-2.5-flash) v1.10 (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# Install samtools if not already available # conda install -c bioconda samtools=1.10 # Assume the input sorted BAM file is named 'sorted.bam' # This command creates an index file 'sorted.bam.bai' in the same directory. samtools index sorted.bam
-
26
Command: samtools index /full/path/to/files/CombinedID.merged.bam /full/path/to/files/CombinedID.merged.bam.bai
$ Bash example
# samtools is a widely used tool for manipulating SAM/BAM/CRAM files. # Installation: # conda install -c bioconda samtools # or # apt-get install samtools samtools index /full/path/to/files/CombinedID.merged.bam /full/path/to/files/CombinedID.merged.bam.bai
-
27
Takes output from sortSam.
$ Bash example
# Install samtools if not already installed # conda install -c bioconda samtools # Define input and output file names # INPUT_BAM is the output from the sortSam step INPUT_BAM="sorted.bam" OUTPUT_DEDUP_BAM="deduplicated.bam" # Mark duplicates in the sorted BAM file # -r: Remove duplicate reads (default is to just flag them) # -s: Output statistics to stderr samtools markdup -r -s "${INPUT_BAM}" "${OUTPUT_DEDUP_BAM}" # Index the deduplicated BAM file for quick access samtools index "${OUTPUT_DEDUP_BAM}" -
28
Only outputs the second read in each pair for use with single stranded peak caller.
reformat.sh (Inferred with models/gemini-2.5-flash) v38.90 (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# Install BBTools (if not already installed) # conda install -c bioconda bbmap # Example usage: # Assuming input paired-end FASTQ files are named 'sample_R1.fastq.gz' and 'sample_R2.fastq.gz' # and the desired output file for the second read is 'sample_R2_only.fastq.gz' # This command takes both R1 and R2 files as input, but explicitly outputs only the R2 reads # to a new file, discarding R1. reformat.sh in1=sample_R1.fastq.gz in2=sample_R2.fastq.gz out1=null out2=sample_R2_only.fastq.gz
-
29
This is the final bam file to perform analysis on.
$ Bash example
# The description indicates a final BAM file ready for analysis. # This typically implies the BAM file has been sorted and indexed. # Samtools is the standard utility for these operations. # Install samtools if not already available (e.g., via conda) # conda install -c bioconda samtools # Assuming 'input_unsorted.bam' is the BAM file after alignment and other processing steps # and 'final.bam' is the desired output name for the sorted and indexed file. # Sort the BAM file by coordinate samtools sort -o final.bam input_unsorted.bam # Index the sorted BAM file samtools index final.bam
-
30
Command: samtools view -hb -f 128 /full/path/to/files/CombinedID.merged.bam > /full/path/to/files/CombinedID.merged.r2.bam
$ Bash example
# Install samtools if not already installed # conda install -c bioconda samtools # Command: Filter BAM to keep only reads that are the second in a pair # -h: Include header in the output # -b: Output in BAM format # -f 128: Only output reads with the FLAG 128 set (second in pair) samtools view -hb -f 128 /full/path/to/files/CombinedID.merged.bam > /full/path/to/files/CombinedID.merged.r2.bam
-
31
Takes results from samtools view.
$ Bash example
# Install samtools (if not already installed) # conda install -c bioconda samtools=1.10 # Convert SAM to BAM # This command takes a SAM file as input and outputs a compressed BAM file. # -b: output BAM format # -S: input is SAM format (optional, samtools can often infer) # -h: include header in output samtools view -bS -h input.sam > output.bam
-
32
Calls peaks on those files.
$ Bash example
# Install clipper (if not already installed) # git clone https://github.com/yeolab/clipper.git # cd clipper # pip install -r requirements.txt # if requirements.txt exists and has dependencies # Define input files and reference genome INPUT_BAM="input.bam" # Placeholder for the input BAM file CONTROL_BAM="control.bam" # Placeholder for the control BAM file (highly recommended for eCLIP) GENOME_FASTA="hg38.fa" # Placeholder for the reference genome FASTA (e.g., latest human assembly hg38) GENOME_NAME="hg38" # Placeholder for the reference genome name OUTPUT_PREFIX="peaks" # Execute clipper to call peaks # Parameters are set to common defaults for eCLIP peak calling # -o: Output BED file name # -s: Strand specificity ('.' for unstranded, '+' or '-' for specific strand) # -c: Control BAM file for background subtraction # -p: P-value threshold for peak calling # -f: Fold enrichment threshold for peak calling # -r: Reference genome name # -g: Reference genome FASTA file python clipper.py \ -o "${OUTPUT_PREFIX}.bed" \ -s . \ -c "${CONTROL_BAM}" \ -p 0.01 \ -f 2.0 \ -r "${GENOME_NAME}" \ -g "${GENOME_FASTA}" \ "${INPUT_BAM}" -
33
Command: clipper -b /full/path/to/files/CombinedID.merged.r2.bam -s hg19 -o /full/path/to/files/CombinedID.merged.r2.peaks.bed --bonferroni --superlocal --threshold-method binomial --save-pickle
$ Bash example
# Install CLIPper (example using conda) # conda create -n clipper_env python=3.8 # conda activate clipper_env # conda install -c bioconda clipper clipper -b /full/path/to/files/CombinedID.merged.r2.bam -s hg19 -o /full/path/to/files/CombinedID.merged.r2.peaks.bed --bonferroni --superlocal --threshold-method binomial --save-pickle
Raw Source Text
Library strategy: eCLIP-Seq Takes output from raw files. Run to trim off both 5â and 3â adapters on both reads. Command: quality-cutoff 6 -m 18 -a NNNNNAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC -g CTTCCGATCTACAAGTT -g CTTCCGATCTTGGTCCT -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGT AGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz /full/path/to/files/file_R1.C01.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.metrics Takes output from cutadapt round 1. Run to trim off the 3â adapters on read 2, to control for double ligation events. Command: cutadapt -f fastq --match-read-wildcards --times 1 -e 0.1 -O 5 --quality-cutoff 6 -m 18 -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGTAGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.metrics Takes output from cutadapt round 2. Maps to human specific version of RepBase used to remove repetitive elements, helps control for spurious artifacts from rRNA (& other) repetitive reads. Command: STAR --runMode alignReads --runThreadN 16 --genomeDir /path/to/RepBase_human_database_file --genomeLoad LoadAndRemove --readFilesIn /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz --outSAMunmapped Within --outFilterMultimapNmax 30 --outFilterMultimapScoreRange 1 --outFileNamePrefix /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam --outSAMattributes All --readFilesCommand zcat --outStd BAM_Unsorted --outSAMtype BAM Unsorted --outFilterType BySJout --outReadsUnmapped Fastx --outFilterScoreMin 10 --outSAMattrRGline ID:foo --alignEndsType EndToEnd > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam Takes output from STAR rmRep. Maps unique reads to the human genome. Command: STAR --runMode alignReads --runThreadN 16 --genomeDir /path/to/STAR_database_file --genomeLoad LoadAndRemove --readFilesIn /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate1 /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate2 --outSAMunmapped Within --outFilterMultimapNmax 1 --outFilterMultimapScoreRange 1 --outFileNamePrefix /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam --outSAMattributes All --outStd BAM_Unsorted --outSAMtype BAM Unsorted --outFilterType BySJout --outReadsUnmapped Fastx --outFilterScoreMin 10 --outSAMattrRGline ID:foo --alignEndsType EndToEnd > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam takes output from STAR genome mapping. Custom random-mer-aware script for PCR duplicate removal. Command: barcode_collapse_pe.py --bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam --out_file /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam --metrics_file /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.metrics Takes output from barcode collapse PE. Sorts resulting bam file for use downstream. Command: java -Xmx2048m -XX:+UseParallelOldGC -XX:ParallelGCThreads=4 -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Djava.io.tmpdir=/full/path/to/files/.queue/tmp -cp /path/to/gatk/dist/Queue.jar net.sf.picard.sam.SortSam INPUT=/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam TMP_DIR=/full/path/to/files/.queue/tmp OUTPUT=/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam VALIDATION_STRINGENCY=SILENT SO=coordinate CREATE_INDEX=true Takes output from sortSam, makes bam index for use downstream. Command: samtools index /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam.bai Takes inputs from multiple final bam files. Merges the two technical replicates for further downstream analysis. Command: samtools merge /full/path/to/files/CombinedID.merged.bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam /full/path/to/files/file_R1.D08.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam Takes output from sortSam, makes bam index for use downstream. Command: samtools index /full/path/to/files/CombinedID.merged.bam /full/path/to/files/CombinedID.merged.bam.bai Takes output from sortSam. Only outputs the second read in each pair for use with single stranded peak caller. This is the final bam file to perform analysis on. Command: samtools view -hb -f 128 /full/path/to/files/CombinedID.merged.bam > /full/path/to/files/CombinedID.merged.r2.bam Takes results from samtools view. Calls peaks on those files. Command: clipper -b /full/path/to/files/CombinedID.merged.r2.bam -s hg19 -o /full/path/to/files/CombinedID.merged.r2.peaks.bed --bonferroni --superlocal --threshold-method binomial --save-pickle Genome_build: hg19 Supplementary_files_format_and_content: bed format, contains clusters of predicted RBP binding, txt format contains counts of reads for both IP and Input for each gene in subtranscriptomic region, bigWigs are read densities for positive and negative strand genome wide