GSE107766 Processing Pipeline
Publication
A protein-RNA interaction atlas of the ribosome biogenesis factor AATF.Scientific reports (2019) — PMID 31363146
Processing Steps
Generate Jupyter Notebook-
1
Library strategy: eCLIP-seq
$ Bash example
# --- Installation (commented out) --- # conda create -n eclip_env python=3.8 r-base=4.0 star cutadapt picard-tools samtools bedtools # conda activate eclip_env # pip install clipper # git clone https://github.com/yeolab/merge_peaks.git # export PATH=$PATH:$(pwd)/merge_peaks # Add merge_peaks to PATH # --- Configuration --- GENOME_DIR="/path/to/hg38_star_index" # Placeholder: Pre-built STAR index for hg38 GENOME_FASTA="/path/to/hg38.fa" # Placeholder: hg38 reference genome FASTA GTF_FILE="/path/to/hg38.gtf" # Placeholder: hg38 gene annotation GTF GENOME_SIZE_FILE="/path/to/hg38.chrom.sizes" # Placeholder: hg38 chromosome sizes (e.g., from UCSC goldenPath) ADAPTERS="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" # Common Illumina adapters, adjust if known OUTPUT_DIR="eclip_output" mkdir -p "$OUTPUT_DIR" # --- Input Files (Placeholders) --- R1="sample_R1.fastq.gz" R2="sample_R2.fastq.gz" CONTROL_R1="input_R1.fastq.gz" # Assuming an input control for peak calling CONTROL_R2="input_R2.fastq.gz" # --- Step 1: Adapter Trimming (using cutadapt) --- echo "Step 1: Adapter Trimming with Cutadapt" cutadapt -a "$ADAPTERS" -A "$ADAPTERS" \ -o "$OUTPUT_DIR/trimmed_R1.fastq.gz" \ -p "$OUTPUT_DIR/trimmed_R2.fastq.gz" \ "$R1" "$R2" > "$OUTPUT_DIR/cutadapt_report.txt" cutadapt -a "$ADAPTERS" -A "$ADAPTERS" \ -o "$OUTPUT_DIR/control_trimmed_R1.fastq.gz" \ -p "$OUTPUT_DIR/control_trimmed_R2.fastq.gz" \ "$CONTROL_R1" "$CONTROL_R2" > "$OUTPUT_DIR/control_cutadapt_report.txt" # --- Step 2: Alignment (using STAR) --- echo "Step 2: Alignment with STAR" STAR --genomeDir "$GENOME_DIR" \ --readFilesIn "$OUTPUT_DIR/trimmed_R1.fastq.gz" "$OUTPUT_DIR/trimmed_R2.fastq.gz" \ --readFilesCommand zcat \ --outFileNamePrefix "$OUTPUT_DIR/sample_" \ --outSAMtype BAM SortedByCoordinate \ --outFilterMultimapNmax 1 \ --outFilterMismatchNmax 3 \ --outFilterScoreMinOverLread 0.6 \ --outFilterMatchNminOverLread 0.6 \ --alignIntronMin 20 \ --alignIntronMax 1000000 \ --runThreadN 8 STAR --genomeDir "$GENOME_DIR" \ --readFilesIn "$OUTPUT_DIR/control_trimmed_R1.fastq.gz" "$OUTPUT_DIR/control_trimmed_R2.fastq.gz" \ --readFilesCommand zcat \ --outFileNamePrefix "$OUTPUT_DIR/control_" \ --outSAMtype BAM SortedByCoordinate \ --outFilterMultimapNmax 1 \ --outFilterMismatchNmax 3 \ --outFilterScoreMinOverLread 0.6 \ --outFilterMatchNminOverLread 0.6 \ --alignIntronMin 20 \ --alignIntronMax 1000000 \ --runThreadN 8 # Index BAM files samtools index "$OUTPUT_DIR/sample_Aligned.sortedByCoord.out.bam" samtools index "$OUTPUT_DIR/control_Aligned.sortedByCoord.out.bam" # --- Step 3: Deduplication (using Picard MarkDuplicates) --- echo "Step 3: Deduplication with Picard MarkDuplicates" java -jar /path/to/picard.jar MarkDuplicates \ I="$OUTPUT_DIR/sample_Aligned.sortedByCoord.out.bam" \ O="$OUTPUT_DIR/sample_dedup.bam" \ M="$OUTPUT_DIR/sample_dedup_metrics.txt" \ REMOVE_DUPLICATES=true java -jar /path/to/picard.jar MarkDuplicates \ I="$OUTPUT_DIR/control_Aligned.sortedByCoord.out.bam" \ O="$OUTPUT_DIR/control_dedup.bam" \ M="$OUTPUT_DIR/control_dedup_metrics.txt" \ REMOVE_DUPLICATES=true samtools index "$OUTPUT_DIR/sample_dedup.bam" samtools index "$OUTPUT_DIR/control_dedup.bam" # --- Step 4: Peak Calling (using CLIPPER) --- echo "Step 4: Peak Calling with CLIPPER" # CLIPPER requires a BED file for input, so convert BAM to BED bedtools bamtobed -i "$OUTPUT_DIR/sample_dedup.bam" > "$OUTPUT_DIR/sample_dedup.bed" bedtools bamtobed -i "$OUTPUT_DIR/control_dedup.bam" > "$OUTPUT_DIR/control_dedup.bed" clipper -s hg38 -o "$OUTPUT_DIR/sample_peaks.bed" \ -i "$OUTPUT_DIR/control_dedup.bed" \ "$OUTPUT_DIR/sample_dedup.bed" # --- Step 5: IDR (using merge_peaks) --- echo "Step 5: IDR with merge_peaks" # For proper IDR, multiple replicates are typically required. # This example simulates with a single peak file for demonstration. # In a real scenario, you would run CLIPPER on multiple replicates and then use merge_peaks. mkdir -p "$OUTPUT_DIR/clipper_peaks" cp "$OUTPUT_DIR/sample_peaks.bed" "$OUTPUT_DIR/clipper_peaks/sample_rep1_peaks.bed" # If you had a second replicate, you would copy it here: # cp "$OUTPUT_DIR/sample_rep2_peaks.bed" "$OUTPUT_DIR/clipper_peaks/sample_rep2_peaks.bed" python merge_peaks.py -i "$OUTPUT_DIR/clipper_peaks" \ -o "$OUTPUT_DIR/idr_output" \ -idr 0.05 \ -s "$GENOME_SIZE_FILE" \ --prefix "eCLIP_IDR" -
2
Takes output from raw files.
$ Bash example
# Install FastQC (if not already installed) # conda install -c bioconda fastqc # Example usage: Run FastQC on raw FASTQ files # Replace sample_R1.fastq.gz and sample_R2.fastq.gz with your actual raw file names # The -o . option specifies the output directory (current directory in this case) fastqc -o . sample_R1.fastq.gz sample_R2.fastq.gz
-
3
Run to trim off both 5â and 3â adapters on both reads.
$ Bash example
# Install cutadapt if not already installed # conda install -c bioconda cutadapt=4.0 # Define input and output files (placeholders) READ1_IN="input_R1.fastq.gz" READ2_IN="input_R2.fastq.gz" READ1_OUT="output_R1.trimmed.fastq.gz" READ2_OUT="output_R2.trimmed.fastq.gz" LOG_FILE="cutadapt.log" # Define adapter sequences (example Illumina TruSeq adapters) # -a: 3' adapter on the forward read # -A: 3' adapter on the reverse read ADAPTER_FWD="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" ADAPTER_REV="AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT" # Define trimming parameters MIN_LEN=18 # Minimum length of read to keep QUALITY_CUTOFF=20 # Quality cutoff at 3' end ERROR_RATE=0.1 # Maximum error rate OVERLAP=3 # Minimum overlap between read and adapter THREADS=8 # Number of CPU threads # Run cutadapt to trim adapters from paired-end reads cutadapt \ -a "${ADAPTER_FWD}" \ -A "${ADAPTER_REV}" \ -m "${MIN_LEN}" \ -q "${QUALITY_CUTOFF}" \ -e "${ERROR_RATE}" \ --overlap "${OVERLAP}" \ -j "${THREADS}" \ -o "${READ1_OUT}" \ -p "${READ2_OUT}" \ "${READ1_IN}" "${READ2_IN}" \ > "${LOG_FILE}" 2>&1 -
4
Command: quality-cutoff 6 -m 18 -a NNNNNAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC -g CTTCCGATCTACAAGTT -g CTTCCGATCTTGGTCCT -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGT AGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz /full/path/to/files/file_R1.C01.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.metrics
$ Bash example
# Install cutadapt if not already installed # conda install -c bioconda cutadapt cutadapt \ --quality-cutoff=6 \ --minimum-length=18 \ -a NNNNNAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC \ -g CTTCCGATCTACAAGTT \ -g CTTCCGATCTTGGTCCT \ -A AACTTGTAGATCGGA \ -A AGGACCAAGATCGGA \ -A ACTTGTAGATCGGAA \ -A GGACCAAGATCGGAA \ -A CTTGT AGATCGGAAG \ -A GACCAAGATCGGAAG \ -A TTGTAGATCGGAAGA \ -A ACCAAGATCGGAAGA \ -A TGTAGATCGGAAGAG \ -A CCAAGATCGGAAGAG \ -A GTAGATCGGAAGAGC \ -A CAAGATCGGAAGAGC \ -A TAGATCGGAAGAGCG \ -A AAGATCGGAAGAGCG \ -A AGATCGGAAGAGCGT \ -A GATCGGAAGAGCGTC \ -A ATCGGAAGAGCGTCG \ -A TCGGAAGAGCGTCGT \ -A CGGAAGAGCGTCGTG \ -A GGAAGAGCGTCGTGT \ -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz \ -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz \ /full/path/to/files/file_R1.C01.fastq.gz \ /full/path/to/files/file_R2.C01.fastq.gz \ > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.metrics
-
5
Takes output from cutadapt round 1.
$ Bash example
# Install cutadapt (if not already installed) # conda install -c bioconda cutadapt=4.0 # Define input and output files # INPUT_FASTQ is the output from a previous cutadapt round (e.g., adapter trimming) INPUT_FASTQ="input_from_cutadapt_round1.fastq.gz" OUTPUT_FASTQ="output_cutadapt_round2.fastq.gz" # Execute cutadapt for poly-A trimming, quality filtering, and minimum length filtering. # This is a common second trimming step in eCLIP workflows after initial adapter trimming. # -a "A{100}": Trims poly-A tails (up to 100 'A's). # -q 20: Trims low-quality bases from the 3' end with a quality cutoff of 20. # -m 18: Discards reads shorter than 18 bp after trimming. cutadapt \ -a "A{100}" \ -q 20 \ -m 18 \ -o "${OUTPUT_FASTQ}" \ "${INPUT_FASTQ}" -
6
Run to trim off the 3â adapters on read 2, to control for double ligation events.
$ Bash example
# Install cutadapt via conda # conda install -c bioconda cutadapt=3.4 # Define input and output files INPUT_R2="read2.fastq.gz" OUTPUT_R2="read2_trimmed.fastq.gz" # Define adapter sequence for 3' end of Read 2. # This sequence (Illumina Small RNA 3' Adapter or similar) is commonly found # at the 3' end of Read 2 in eCLIP and small RNA-seq libraries, often due to # double ligation events or as part of the library preparation. ADAPTER_SEQUENCE="AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT" # Define trimming parameters THREADS=4 # Number of CPU cores to use for parallel processing MIN_READ_LENGTH=18 # Discard reads shorter than this length after trimming QUALITY_CUTOFF=20 # Trim low-quality bases from the 3' end using a Phred score cutoff # Run cutadapt to trim the specified 3' adapter from Read 2 cutadapt \ -a "${ADAPTER_SEQUENCE}" \ -o "${OUTPUT_R2}" \ --cores "${THREADS}" \ --minimum-length "${MIN_READ_LENGTH}" \ --quality-cutoff "${QUALITY_CUTOFF}" \ "${INPUT_R2}" -
7
Command: cutadapt -f fastq --match-read-wildcards --times 1 -e 0.1 -O 5 --quality-cutoff 6 -m 18 -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGTAGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.metrics
$ Bash example
# Install cutadapt (e.g., using conda) # conda install -c bioconda cutadapt cutadapt -f fastq --match-read-wildcards --times 1 -e 0.1 -O 5 --quality-cutoff 6 -m 18 -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGTAGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.metrics
-
8
Takes output from cutadapt round 2.
$ Bash example
# Install cutadapt (example using conda) # conda install -c bioconda cutadapt=4.0 # Define input and output files # INPUT_FASTQ is the output from cutadapt round 1 INPUT_FASTQ="sample_R1_round1_trimmed.fastq.gz" OUTPUT_FASTQ="sample_R1_round2_trimmed.fastq.gz" # Define trimming parameters for cutadapt round 2 # These are placeholders. Actual values depend on the specific adapters/barcodes # to be removed in this second trimming step for eCLIP. # For example, if round 1 removed sequencing adapters, round 2 might remove random barcodes. # Replace 'YOUR_3PRIME_ADAPTER_SEQUENCE' and 'YOUR_5PRIME_ADAPTER_SEQUENCE' with actual sequences. ADAPTER_3PRIME="YOUR_3PRIME_ADAPTER_SEQUENCE" # Example: AGATCGGAAGAGCACACGTCTGAACTCCAGTCA ADAPTER_5PRIME="YOUR_5PRIME_ADAPTER_SEQUENCE" # Example: GTTCAGAGTTCTACAGTCCGACGATC TRIM_5PRIME_BASES=0 # Number of fixed bases to remove from 5' end TRIM_3PRIME_BASES=0 # Number of fixed bases to remove from 3' end (use negative value for cutadapt) QUALITY_THRESHOLD=20 # Phred quality score threshold for trimming MIN_READ_LENGTH=18 # Minimum read length after trimming THREADS=4 # Number of CPU threads to use # Execute cutadapt round 2 cutadapt \ -a "${ADAPTER_3PRIME}" \ -g "${ADAPTER_5PRIME}" \ -u "${TRIM_5PRIME_BASES}" \ -u "-${TRIM_3PRIME_BASES}" \ -q "${QUALITY_THRESHOLD}" \ -m "${MIN_READ_LENGTH}" \ --discard-untrimmed \ -j "${THREADS}" \ -o "${OUTPUT_FASTQ}" \ "${INPUT_FASTQ}" -
9
Maps to human specific version of RepBase used to remove repetitive elements, helps control for spurious artifacts from rRNA (& other) repetitive reads.
$ Bash example
# Install RepeatMasker (if not already installed) # RepeatMasker requires a repeat library, typically RepBase, which is often installed with it or separately configured. Ensure RepBase is configured for human species. # For example, using conda: # conda install -c bioconda repeatmasker # Define input and output files # Replace 'human_genome.fasta' with the actual path to your unmasked human reference genome (e.g., GRCh38/hg38). GENOME_FASTA="human_genome.fasta" OUTPUT_DIR="repeatmasker_output" NUM_THREADS=8 # Adjust based on available CPU cores # Create output directory if it doesn't exist mkdir -p "${OUTPUT_DIR}" # Run RepeatMasker to identify and mask repetitive elements in the human genome # -species human: Uses human-specific repeat libraries from RepBase. # -pa ${NUM_THREADS}: Specifies the number of processors to use for parallel execution. # -dir ${OUTPUT_DIR}: Sets the directory where all output files will be stored. # -xsmall: Masks identified repeats with 'N's (instead of lowercase letters), which is often preferred for downstream analysis. # -gff: Outputs the repeat annotations in GFF format, in addition to the standard .out and .tbl files. RepeatMasker -species human -pa "${NUM_THREADS}" -dir "${OUTPUT_DIR}" -xsmall -gff "${GENOME_FASTA}" # Expected output files in ${OUTPUT_DIR}: # - human_genome.fasta.masked: The masked genome FASTA file with repeats replaced by 'N's. # - human_genome.fasta.out: A detailed report of all identified repeats. # - human_genome.fasta.tbl: A summary table of repeat classes and families. # - human_genome.fasta.gff: Repeat annotations in GFF format (if -gff was used). -
10
Command: STAR --runMode alignReads --runThreadN 16 --genomeDir /path/to/RepBase_human_database_file --genomeLoad LoadAndRemove --readFilesIn /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz --outSAMunmapped Within --outFilterMultimapNmax 30 --outFilterMultimapScoreRange 1 --outFileNamePrefix /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam --outSAMattributes All --readFilesCommand zcat --outStd BAM_Unsorted --outSAMtype BAM Unsorted --outFilterType BySJout --outReadsUnmapped Fastx --outFilterScoreMin 10 --outSAMattrRGline ID:foo --alignEndsType EndToEnd > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam
$ Bash example
# Install STAR using conda # conda install -c bioconda star # Define variables for input, output, and reference files # The --genomeDir expects a STAR index built from the RepBase human sequences. # Replace with the actual path to your STAR index. GENOME_DIR="/path/to/RepBase_human_STAR_index" READ1_FILE="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz" READ2_FILE="/full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz" OUTPUT_BAM="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam" # The --outFileNamePrefix is used for other STAR output files (e.g., Log.out, SJ.out.tab). # In this command, it's set to the same path as the output BAM, which means other files # will be named like '/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamLog.out'. # If you prefer these auxiliary files in a separate directory or with a different prefix, # adjust this variable accordingly. OUTPUT_PREFIX="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam" # Execute the STAR alignment command STAR \ --runMode alignReads \ --runThreadN 16 \ --genomeDir "${GENOME_DIR}" \ --genomeLoad LoadAndRemove \ --readFilesIn "${READ1_FILE}" "${READ2_FILE}" \ --outSAMunmapped Within \ --outFilterMultimapNmax 30 \ --outFilterMultimapScoreRange 1 \ --outFileNamePrefix "${OUTPUT_PREFIX}" \ --outSAMattributes All \ --readFilesCommand zcat \ --outStd BAM_Unsorted \ --outSAMtype BAM Unsorted \ --outFilterType BySJout \ --outReadsUnmapped Fastx \ --outFilterScoreMin 10 \ --outSAMattrRGline ID:foo \ --alignEndsType EndToEnd > "${OUTPUT_BAM}" -
11
Takes output from STAR rmRep.
$ Bash example
# Install samtools if not already available # conda install -c bioconda samtools=1.10 # Placeholder for input BAM from STAR alignment (e.g., from a previous STAR alignment step) INPUT_BAM="star_aligned_reads.bam" OUTPUT_DEDUP_BAM="star_aligned_reads.dedup.bam" # Deduplication steps as typically performed in eCLIP pipelines (e.g., from yeolab/skipper) # 1. Sort by read name samtools sort -n "${INPUT_BAM}" -o "${INPUT_BAM%.bam}.namesort.bam" # 2. Fixmate information samtools fixmate -m "${INPUT_BAM%.bam}.namesort.bam" "${INPUT_BAM%.bam}.fixmate.bam" # 3. Sort by coordinate samtools sort "${INPUT_BAM%.bam}.fixmate.bam" -o "${INPUT_BAM%.bam}.positionsort.bam" # 4. Mark and remove PCR duplicates samtools markdup -r "${INPUT_BAM%.bam}.positionsort.bam" "${OUTPUT_DEDUP_BAM}" # 5. Index the deduplicated BAM file samtools index "${OUTPUT_DEDUP_BAM}" # Optional: Clean up intermediate files # rm "${INPUT_BAM%.bam}.namesort.bam" "${INPUT_BAM%.bam}.fixmate.bam" "${INPUT_BAM%.bam}.positionsort.bam" -
12
Maps unique reads to the human genome.
BWA (Burrows-Wheeler Aligner) (Inferred with models/gemini-2.5-flash) v0.7.17 (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# Install BWA and Samtools # conda install -c bioconda bwa samtools # Define reference genome and read files REFERENCE_GENOME="/path/to/human_genome/GRCh38.fa" READS_R1="reads_1.fastq.gz" READS_R2="reads_2.fastq.gz" OUTPUT_BAM="aligned_reads.sorted.bam" # Index the reference genome (if not already indexed) # This step only needs to be run once per reference genome bwa index "${REFERENCE_GENOME}" # Map reads to the human genome using BWA-MEM # -t 8: Use 8 threads for alignment # Pipe the SAM output directly to samtools for conversion to BAM, sorting, and indexing bwa mem -t 8 "${REFERENCE_GENOME}" "${READS_R1}" "${READS_R2}" | \ samtools view -bS - | \ samtools sort -o "${OUTPUT_BAM}" - samtools index "${OUTPUT_BAM}" -
13
Command: STAR --runMode alignReads --runThreadN 16 --genomeDir /path/to/STAR_database_file --genomeLoad LoadAndRemove --readFilesIn /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate1 /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate2 --outSAMunmapped Within --outFilterMultimapNmax 1 --outFilterMultimapScoreRange 1 --outFileNamePrefix /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam --outSAMattributes All --outStd BAM_Unsorted --outSAMtype BAM Unsorted --outFilterType BySJout --outReadsUnmapped Fastx --outFilterScoreMin 10 --outSAMattrRGline ID:foo --alignEndsType EndToEnd > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam
$ Bash example
bash # Reference genome directory placeholder. Replace with your actual STAR index path. # Example: /path/to/your/STAR_index/GRCh38 GENOME_DIR="/path/to/STAR_database_file" # Input read files. These appear to be unmapped reads from a previous BAM file. # Replace with your actual input file paths. READ_FILE_MATE1="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate1" READ_FILE_MATE2="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate2" # Output BAM file prefix and final output file name. # The command redirects stdout to the final BAM file, so the prefix is mainly for other STAR output files (e.g., Log.out, SJ.out.tab). OUTPUT_PREFIX="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep." FINAL_OUTPUT_BAM="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam" STAR \ --runMode alignReads \ --runThreadN 16 \ --genomeDir "${GENOME_DIR}" \ --genomeLoad LoadAndRemove \ --readFilesIn "${READ_FILE_MATE1}" "${READ_FILE_MATE2}" \ --outSAMunmapped Within \ --outFilterMultimapNmax 1 \ --outFilterMultimapScoreRange 1 \ --outFileNamePrefix "${OUTPUT_PREFIX}" \ --outSAMattributes All \ --outStd BAM_Unsorted \ --outSAMtype BAM Unsorted \ --outFilterType BySJout \ --outReadsUnmapped Fastx \ --outFilterScoreMin 10 \ --outSAMattrRGline ID:foo \ --alignEndsType EndToEnd \ > "${FINAL_OUTPUT_BAM}" -
14
takes output from STAR genome mapping.
$ Bash example
# Install STAR (if not already installed) # conda install -c bioconda star # Define variables (replace with actual paths and names) GENOME_DIR="/path/to/STAR_genome_index/hg38" # Placeholder: hg38 genome index READ1_FASTQ="input_read1.fastq.gz" READ2_FASTQ="input_read2.fastq.gz" # Omit if single-end OUTPUT_PREFIX="aligned_reads" NUM_THREADS=8 # Adjust as needed # Run STAR genome mapping for paired-end reads STAR \ --runThreadN ${NUM_THREADS} \ --genomeDir ${GENOME_DIR} \ --readFilesIn ${READ1_FASTQ} ${READ2_FASTQ} \ --outFileNamePrefix ${OUTPUT_PREFIX} \ --outSAMtype BAM SortedByCoordinate \ --outSAMattributes Standard \ --outSAMunmapped Within \ --outFilterMultimapNmax 20 \ --outFilterMismatchNmax 999 \ --alignIntronMin 20 \ --alignIntronMax 1000000 \ --alignMatesGapMax 1000000 \ --sjdbScore 1 \ --readFilesCommand zcat # The output will be ${OUTPUT_PREFIX}Aligned.sortedByCoord.out.bam -
15
Custom random-mer-aware script for PCR duplicate removal.
$ Bash example
# Install umi_tools # conda install -c bioconda umi_tools # Example usage for PCR duplicate removal using random-mers (UMIs). # This command assumes that UMIs have already been extracted from the reads # and appended to the read names (e.g., by a preceding `umi_tools extract` step). # The UMI is expected to be separated from the original read name by a colon. # The 'directional' method is generally recommended for its robustness. # Replace 'input.bam' with your aligned BAM file containing UMIs in read names. # Replace 'output.dedup.bam' with your desired output deduplicated BAM file name. # Replace 'output.dedup.log' with your desired log file name. umi_tools dedup \ --in input.bam \ --out output.dedup.bam \ --umi-separator ":" \ --method directional \ --paired \ --log output.dedup.log -
16
Command: barcode_collapse_pe.py --bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam --out_file /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam --metrics_file /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.metrics
barcode_collapse_pe.py (Part of Skipper pipeline) (Inferred with models/gemini-2.5-flash) vLatest (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# Install Miniconda if not already installed # wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh # bash Miniconda3-latest-Linux-x86_64.sh -b -p $HOME/miniconda # export PATH="$HOME/miniconda/bin:$PATH" # conda init bash # source ~/.bashrc # Clone the Skipper repository # git clone https://github.com/yeolab/skipper.git # cd skipper # Create and activate the conda environment (assuming environment.yml is available in the skipper directory) # conda env create -f environment.yml # conda activate skipper_env # or the name specified in environment.yml # Define variables for input and output files INPUT_BAM="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam" OUTPUT_BAM="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam" METRICS_FILE="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.metrics" # Execute the barcode_collapse_pe.py script # Assuming barcode_collapse_pe.py is in the current PATH or specified with its full path # If running from the cloned skipper directory, the path would be ./scripts/barcode_collapse_pe.py python barcode_collapse_pe.py \ --bam "${INPUT_BAM}" \ --out_file "${OUTPUT_BAM}" \ --metrics_file "${METRICS_FILE}" -
17
Takes output from barcode collapse PE.
$ Bash example
# Install STAR (example using conda) # conda install -c bioconda star=2.7.9a # Define variables # Placeholder for STAR genome index, typically built from a reference genome like hg38. # Example: /path/to/STAR_index/hg38_gencode_v38 GENOME_DIR="/path/to/STAR_index/hg38" READ1="sample_R1.collapsed.fastq.gz" # Output from barcode collapse PE READ2="sample_R2.collapsed.fastq.gz" # Output from barcode collapse PE OUTPUT_PREFIX="sample_aligned" # Prefix for output files THREADS=8 # Adjust based on available resources # Run STAR alignment for paired-end reads # This command aligns the collapsed paired-end reads to the reference genome. STAR --genomeDir "${GENOME_DIR}" \ --readFilesIn "${READ1}" "${READ2}" \ --runThreadN "${THREADS}" \ --outFileNamePrefix "${OUTPUT_PREFIX}" \ --outSAMtype BAM SortedByCoordinate \ --outSAMattributes All \ --readFilesCommand zcat \ --twopassMode Basic \ --limitBAMsortRAM 30000000000 # Example: 30GB RAM for sorting, adjust as needed -
18
Sorts resulting bam file for use downstream.
samtools (Inferred with models/gemini-2.5-flash) v1.19 (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# Install samtools (if not already installed) # conda install -c bioconda samtools=1.19 # Sort the BAM file by coordinate (default behavior) samtools sort -o sorted_output.bam input.bam
-
19
Command: java -Xmx2048m -XX:+UseParallelOldGC -XX:ParallelGCThreads=4 -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Djava.io.tmpdir=/full/path/to/files/.queue/tmp -cp /path/to/gatk/dist/Queue.jar net.sf.picard.sam.SortSam INPUT=/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam TMP_DIR=/full/path/to/files/.queue/tmp OUTPUT=/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam VALIDATION_STRINGENCY=SILENT SO=coordinate CREATE_INDEX=true
$ Bash example
# Install Picard (e.g., via conda or by downloading the jar) # conda install -c bioconda picard # Or download the latest release from https://github.com/broadinstitute/picard/releases # Ensure Java is installed (e.g., OpenJDK 11 or later) # Define variables for clarity INPUT_BAM="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam" OUTPUT_BAM="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam" TMP_DIR="/full/path/to/files/.queue/tmp" QUEUE_JAR_PATH="/path/to/gatk/dist/Queue.jar" # Create temporary directory if it doesn't exist mkdir -p "${TMP_DIR}" # Execute Picard SortSam command java -Xmx2048m \ -XX:+UseParallelOldGC \ -XX:ParallelGCThreads=4 \ -XX:GCTimeLimit=50 \ -XX:GCHeapFreeLimit=10 \ -Djava.io.tmpdir="${TMP_DIR}" \ -cp "${QUEUE_JAR_PATH}" \ net.sf.picard.sam.SortSam \ INPUT="${INPUT_BAM}" \ TMP_DIR="${TMP_DIR}" \ OUTPUT="${OUTPUT_BAM}" \ VALIDATION_STRINGENCY=SILENT \ SO=coordinate \ CREATE_INDEX=true -
20
Takes output from sortSam, makes bam index for use downstream.
samtools index (Inferred with models/gemini-2.5-flash) v1.19.1 (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# Install samtools if not already installed # conda install -c bioconda samtools # Assuming 'sorted.bam' is the output from sortSam # Replace 'sorted.bam' with the actual path to your sorted BAM file samtools index sorted.bam
-
21
Command: samtools index /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam.bai
$ Bash example
# Install samtools (e.g., using conda) # conda install -c bioconda samtools=1.19 samtools index /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam.bai
-
22
Takes inputs from multiple final bam files.
$ Bash example
# Install samtools if not already available # conda install -c bioconda samtools=1.19 # Merge multiple BAM files into a single output BAM file. # Replace input1.bam, input2.bam, etc., with your actual BAM file paths. # Replace merged_output.bam with your desired output file name. samtools merge merged_output.bam input1.bam input2.bam input3.bam
-
23
Merges the two technical replicates for further downstream analysis.
$ Bash example
# Install samtools if not already installed # conda install -c bioconda samtools # Merge two technical replicate BAM files # Replace replicate1.bam and replicate2.bam with actual input files # Replace merged_replicates.bam with the desired output file name samtools merge merged_replicates.bam replicate1.bam replicate2.bam
-
24
Command: samtools merge /full/path/to/files/CombinedID.merged.bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam /full/path/to/files/file_R1.D08.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam
$ Bash example
# Install samtools if not already installed # conda install -c bioconda samtools # Merge multiple sorted BAM files into a single sorted BAM file samtools merge /full/path/to/files/CombinedID.merged.bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam /full/path/to/files/file_R1.D08.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam
-
25
Takes output from sortSam, makes bam index for use downstream.
samtools (Inferred with models/gemini-2.5-flash) v1.10 (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# Install samtools if not already available # conda install -c bioconda samtools=1.10 # Create BAM index for the sorted BAM file # Assuming 'sorted.bam' is the output from sortSam samtools index sorted.bam
-
26
Command: samtools index /full/path/to/files/CombinedID.merged.bam /full/path/to/files/CombinedID.merged.bam.bai
$ Bash example
# Install samtools if not already installed # conda install -c bioconda samtools samtools index /full/path/to/files/CombinedID.merged.bam /full/path/to/files/CombinedID.merged.bam.bai
-
27
Takes output from sortSam.
$ Bash example
# Install samtools if not already installed # conda install -c bioconda samtools=1.19.1 # This step takes a sorted BAM file (output from sortSam) and creates an index file (.bai). # The index file is essential for many downstream applications that require random access to reads in the BAM file. # Replace 'sorted.bam' with the actual name of your sorted BAM file. samtools index sorted.bam
-
28
Only outputs the second read in each pair for use with single stranded peak caller.
reformat.sh (part of BBMap suite) (Inferred with models/gemini-2.5-flash) v38.90 (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# Install BBMap (contains reformat.sh) # conda install -c bioconda bbmap # Define input and output files # This command assumes the input is an interleaved FASTQ file containing paired-end reads. INPUT_INTERLEAVED_FASTQ="input_interleaved_reads.fastq.gz" OUTPUT_R2_FASTQ="output_second_reads.fastq.gz" # Only outputs the second read in each pair for use with single stranded peak caller. # 'r1=f' prevents outputting read 1, 'r2=t' enables outputting read 2. reformat.sh in="${INPUT_INTERLEAVED_FASTQ}" out="${OUTPUT_R2_FASTQ}" r1=f r2=t -
29
This is the final bam file to perform analysis on.
$ Bash example
# Install samtools if not already installed # conda install -c bioconda samtools # Assume 'final.bam' is the input BAM file that has undergone alignment, sorting, and duplicate removal. # Index the final BAM file to enable fast random access for downstream analysis tools. samtools index final.bam
-
30
Command: samtools view -hb -f 128 /full/path/to/files/CombinedID.merged.bam > /full/path/to/files/CombinedID.merged.r2.bam
$ Bash example
# Install samtools (if not already installed) # conda install -c bioconda samtools samtools view -hb -f 128 /full/path/to/files/CombinedID.merged.bam > /full/path/to/files/CombinedID.merged.r2.bam
-
31
Takes results from samtools view.
$ Bash example
# Install samtools (if not already installed) # conda install -c bioconda samtools # Define input and output file paths INPUT_BAM="input_aligned_reads.bam" # Placeholder for an input BAM file, e.g., from an alignment step OUTPUT_SAM="output_viewed_reads.sam" # Placeholder for an output SAM file # Command: Takes results from samtools view. # This command converts a BAM file to a SAM file, including the header (-h). # This is a common operation to view or extract data from a BAM file for further text-based processing or inspection. samtools view -h "${INPUT_BAM}" > "${OUTPUT_SAM}" -
32
Calls peaks on those files.
clipper (Inferred with models/gemini-2.5-flash) vlatest (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# Install clipper (if not already installed) # git clone https://github.com/yeolab/clipper.git # cd clipper # python setup.py install # Or ensure clipper.py is executable and in PATH # Placeholder for input BAM file (aligned reads) and output peak file INPUT_BAM="aligned_reads.bam" OUTPUT_PEAKS="peaks.bed" # Placeholder for genome assembly (hg38) and its approximate size. # clipper's default genome list might not include hg38, so we provide the size directly. GENOME_SIZE_HG38="3100000000" # Approximate size for human genome assembly hg38 # Execute clipper to call peaks # Assuming clipper.py is accessible in the current directory or system PATH python clipper.py -b "${INPUT_BAM}" -s "${GENOME_SIZE_HG38}" -o "${OUTPUT_PEAKS}" -
33
Command: clipper -b /full/path/to/files/CombinedID.merged.r2.bam -s hg19 -o /full/path/to/files/CombinedID.merged.r2.peaks.bed --bonferroni --superlocal --threshold-method binomial --save-pickle
$ Bash example
# Install CLIPper (if not already installed) # CLIPper is a Python-based tool. Installation typically involves cloning the repository # and running the setup script, or ensuring the main script is executable and in your PATH. # Example installation (adjust if 'clipper' is not directly in PATH): # git clone https://github.com/yeolab/clipper.git # cd clipper # python setup.py install --user # Install to user's local site-packages # # Or, if you just want to run the script directly: # # python /path/to/clipper/clipper.py ... clipper -b /full/path/to/files/CombinedID.merged.r2.bam -s hg19 -o /full/path/to/files/CombinedID.merged.r2.peaks.bed --bonferroni --superlocal --threshold-method binomial --save-pickle
Raw Source Text
Library strategy: eCLIP-seq Takes output from raw files. Run to trim off both 5â and 3â adapters on both reads. Command: quality-cutoff 6 -m 18 -a NNNNNAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC -g CTTCCGATCTACAAGTT -g CTTCCGATCTTGGTCCT -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGT AGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz /full/path/to/files/file_R1.C01.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.metrics Takes output from cutadapt round 1. Run to trim off the 3â adapters on read 2, to control for double ligation events. Command: cutadapt -f fastq --match-read-wildcards --times 1 -e 0.1 -O 5 --quality-cutoff 6 -m 18 -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGTAGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.metrics Takes output from cutadapt round 2. Maps to human specific version of RepBase used to remove repetitive elements, helps control for spurious artifacts from rRNA (& other) repetitive reads. Command: STAR --runMode alignReads --runThreadN 16 --genomeDir /path/to/RepBase_human_database_file --genomeLoad LoadAndRemove --readFilesIn /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz --outSAMunmapped Within --outFilterMultimapNmax 30 --outFilterMultimapScoreRange 1 --outFileNamePrefix /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam --outSAMattributes All --readFilesCommand zcat --outStd BAM_Unsorted --outSAMtype BAM Unsorted --outFilterType BySJout --outReadsUnmapped Fastx --outFilterScoreMin 10 --outSAMattrRGline ID:foo --alignEndsType EndToEnd > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam Takes output from STAR rmRep. Maps unique reads to the human genome. Command: STAR --runMode alignReads --runThreadN 16 --genomeDir /path/to/STAR_database_file --genomeLoad LoadAndRemove --readFilesIn /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate1 /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate2 --outSAMunmapped Within --outFilterMultimapNmax 1 --outFilterMultimapScoreRange 1 --outFileNamePrefix /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam --outSAMattributes All --outStd BAM_Unsorted --outSAMtype BAM Unsorted --outFilterType BySJout --outReadsUnmapped Fastx --outFilterScoreMin 10 --outSAMattrRGline ID:foo --alignEndsType EndToEnd > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam takes output from STAR genome mapping. Custom random-mer-aware script for PCR duplicate removal. Command: barcode_collapse_pe.py --bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam --out_file /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam --metrics_file /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.metrics Takes output from barcode collapse PE. Sorts resulting bam file for use downstream. Command: java -Xmx2048m -XX:+UseParallelOldGC -XX:ParallelGCThreads=4 -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Djava.io.tmpdir=/full/path/to/files/.queue/tmp -cp /path/to/gatk/dist/Queue.jar net.sf.picard.sam.SortSam INPUT=/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam TMP_DIR=/full/path/to/files/.queue/tmp OUTPUT=/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam VALIDATION_STRINGENCY=SILENT SO=coordinate CREATE_INDEX=true Takes output from sortSam, makes bam index for use downstream. Command: samtools index /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam.bai Takes inputs from multiple final bam files. Merges the two technical replicates for further downstream analysis. Command: samtools merge /full/path/to/files/CombinedID.merged.bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam /full/path/to/files/file_R1.D08.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam Takes output from sortSam, makes bam index for use downstream. Command: samtools index /full/path/to/files/CombinedID.merged.bam /full/path/to/files/CombinedID.merged.bam.bai Takes output from sortSam. Only outputs the second read in each pair for use with single stranded peak caller. This is the final bam file to perform analysis on. Command: samtools view -hb -f 128 /full/path/to/files/CombinedID.merged.bam > /full/path/to/files/CombinedID.merged.r2.bam Takes results from samtools view. Calls peaks on those files. Command: clipper -b /full/path/to/files/CombinedID.merged.r2.bam -s hg19 -o /full/path/to/files/CombinedID.merged.r2.peaks.bed --bonferroni --superlocal --threshold-method binomial --save-pickle Genome_build: hg19 Supplementary_files_format_and_content: bed format, contains clusters of predicted RBP binding