GSE107767 Processing Pipeline
Publication
A protein-RNA interaction atlas of the ribosome biogenesis factor AATF.Scientific reports (2019) — PMID 31363146
Processing Steps
Generate Jupyter Notebook-
1
Library strategy: eCLIP-seq
$ Bash example
#!/bin/bash # This script outlines a typical eCLIP-seq data processing pipeline, # drawing from the Yeo lab's eCLIP CWL workflow (https://github.com/yeolab/eclip). # It includes adapter trimming, alignment, deduplication, and peak calling. # For IDR (Identifying Reproducible Peaks), the merge_peaks.py tool is recommended. # --- Configuration --- # Placeholder paths for reference genome and input files. # Please replace with actual paths on your system. # Reference Genome: Human hg38 (GRCh38) is used as a placeholder. # Ensure you have a STAR index built for your chosen genome. GENOME_FASTA="/path/to/reference/genome/hg38.fa" STAR_INDEX_DIR="/path/to/reference/genome/STAR_index/hg38" # Input FASTQ files (example: single-end eCLIP and corresponding input control) INPUT_FASTQ_SAMPLE="sample_eclip_rep1.fastq.gz" INPUT_FASTQ_CONTROL="sample_input_rep1.fastq.gz" # Output prefixes SAMPLE_PREFIX="sample_eclip_rep1" CONTROL_PREFIX="sample_input_rep1" # Adapter sequence (example: Illumina TruSeq, verify with your library prep kit) ADAPTER="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" # --- 1. Adapter Trimming (using Cutadapt) --- # Cutadapt is used to remove sequencing adapters and low-quality bases. # Installation: # conda install -c bioconda cutadapt echo "Trimming adapters for ${INPUT_FASTQ_SAMPLE}..." cutadapt -a "${ADAPTER}" -o "${SAMPLE_PREFIX}_trimmed.fastq.gz" "${INPUT_FASTQ_SAMPLE}" > "${SAMPLE_PREFIX}_cutadapt.log" 2>&1 echo "Trimming adapters for ${INPUT_FASTQ_CONTROL}..." cutadapt -a "${ADAPTER}" -o "${CONTROL_PREFIX}_trimmed.fastq.gz" "${INPUT_FASTQ_CONTROL}" > "${CONTROL_PREFIX}_cutadapt.log" 2>&1 # --- 2. Alignment (using STAR) --- # STAR is a splice-aware aligner suitable for CLIP-seq data. # Installation: # conda install -c bioconda star echo "Aligning ${SAMPLE_PREFIX}_trimmed.fastq.gz to genome..." STAR --genomeDir "${STAR_INDEX_DIR}" \ --readFilesIn "${SAMPLE_PREFIX}_trimmed.fastq.gz" \ --outFileNamePrefix "${SAMPLE_PREFIX}_aligned_" \ --outSAMtype BAM SortedByCoordinate \ --runThreadN 8 # Adjust number of threads as appropriate echo "Aligning ${CONTROL_PREFIX}_trimmed.fastq.gz to genome..." STAR --genomeDir "${STAR_INDEX_DIR}" \ --readFilesIn "${CONTROL_PREFIX}_trimmed.fastq.gz" \ --outFileNamePrefix "${CONTROL_PREFIX}_aligned_" \ --outSAMtype BAM SortedByCoordinate \ --runThreadN 8 # Adjust number of threads as appropriate # --- 3. Deduplication (using Picard MarkDuplicates) --- # Deduplication is crucial for CLIP-seq to remove PCR duplicates. # Installation: # conda install -c bioconda picard echo "Deduplicating ${SAMPLE_PREFIX}_aligned_Aligned.sortedByCoord.out.bam..." java -jar /path/to/picard.jar MarkDuplicates \ I="${SAMPLE_PREFIX}_aligned_Aligned.sortedByCoord.out.bam" \ O="${SAMPLE_PREFIX}_dedup.bam" \ M="${SAMPLE_PREFIX}_dedup_metrics.txt" \ REMOVE_DUPLICATES=true \ AS=true # Assume sorted input echo "Deduplicating ${CONTROL_PREFIX}_aligned_Aligned.sortedByCoord.out.bam..." java -jar /path/to/picard.jar MarkDuplicates \ I="${CONTROL_PREFIX}_aligned_Aligned.sortedByCoord.out.bam" \ O="${CONTROL_PREFIX}_dedup.bam" \ M="${CONTROL_PREFIX}_dedup_metrics.txt" \ REMOVE_DUPLICATES=true \ AS=true # Index BAM files for downstream tools like CLIPper # Installation: # conda install -c bioconda samtools echo "Indexing BAM files..." samtools index "${SAMPLE_PREFIX}_dedup.bam" samtools index "${CONTROL_PREFIX}_dedup.bam" # --- 4. Peak Calling (using CLIPper) --- # CLIPper is a dedicated peak caller for CLIP-seq data. # Installation: # conda install -c bioconda clipper echo "Calling peaks for ${SAMPLE_PREFIX} using CLIPper..." clipper -s hg38 \ -i "${SAMPLE_PREFIX}_dedup.bam" \ -c "${CONTROL_PREFIX}_dedup.bam" \ -o "${SAMPLE_PREFIX}_peaks.bed" \ -v # Verbose output # --- 5. IDR (Identifying Reproducible Peaks - using merge_peaks.py) --- # This step requires multiple replicates (e.g., rep1_peaks.bed, rep2_peaks.bed). # The merge_peaks.py script is part of the Yeo lab's custom IDR pipeline. # Source: https://github.com/yeolab/merge_peaks # Installation: # git clone https://github.com/yeolab/merge_peaks.git # # cd merge_peaks && python setup.py install # # Example command (assuming you have peak files from two replicates): # echo "Performing IDR with merge_peaks.py for reproducible peaks..." # python /path/to/merge_peaks/merge_peaks.py \ # -i rep1_peaks.bed rep2_peaks.bed \ # -o reproducible_peaks.bed \ # --idr_threshold 0.1 # Example IDR threshold -
2
Takes output from raw files.
$ Bash example
# Install FastQC (if not already installed) # conda install -c bioconda fastqc # Run FastQC on raw sequencing reads to generate quality reports # Replace 'raw_reads.fastq.gz' with your actual input FASTQ file(s) # Replace 'output_directory' with your desired output location for reports fastqc raw_reads.fastq.gz -o output_directory
-
3
Run to trim off both 5â and 3â adapters on both reads.
$ Bash example
# Install cutadapt (e.g., using conda) # conda create -n cutadapt_env cutadapt=4.0 -y # conda activate cutadapt_env # Define input and output files READ1_IN="read1.fastq.gz" READ2_IN="read2.fastq.gz" READ1_OUT="trimmed_read1.fastq.gz" READ2_OUT="trimmed_read2.fastq.gz" REPORT_FILE="cutadapt_report.txt" # Define 3' adapter sequences (Illumina TruSeq, common for eCLIP/RNA-seq) # These are the sequences that appear at the 3' end of the read. ADAPTER_R1_3PRIME="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" # Adapter for Read 1 (3' end) ADAPTER_R2_3PRIME="AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT" # Adapter for Read 2 (3' end) # Define 5' adapter sequences (if applicable and known) # These are sequences that appear at the 5' end of the read. # For eCLIP, this might be a random barcode/UMI or a specific library adapter. # IMPORTANT: Replace "YOUR_5PRIME_ADAPTER_R1_SEQUENCE" and "YOUR_5PRIME_ADAPTER_R2_SEQUENCE" # with the actual 5' adapter sequences used in your library preparation. # If no specific 5' adapter sequence needs to be trimmed, these parameters can be omitted. ADAPTER_R1_5PRIME="YOUR_5PRIME_ADAPTER_R1_SEQUENCE" # Adapter for Read 1 (5' end) ADAPTER_R2_5PRIME="YOUR_5PRIME_ADAPTER_R2_SEQUENCE" # Adapter for Read 2 (5' end) # Number of CPU cores to use NUM_CORES=8 # Run cutadapt to trim 5' and 3' adapters, perform quality trimming, and filter by length # -g: 5' adapter for R1 (searches from the beginning of the read) # -G: 5' adapter for R2 (searches from the beginning of the read) # -a: 3' adapter for R1 (searches from the end of the read) # -A: 3' adapter for R2 (searches from the end of the read) # -q: Quality cutoff (e.g., 20 for Phred score 20) # -m: Minimum length of reads after trimming # -o: Output file for R1 # -p: Output file for R2 # --cores: Number of CPU cores # --report=full: Generate a detailed report cutadapt \ -g "${ADAPTER_R1_5PRIME}" \ -G "${ADAPTER_R2_5PRIME}" \ -a "${ADAPTER_R1_3PRIME}" \ -A "${ADAPTER_R2_3PRIME}" \ -q 20 \ -m 15 \ --cores "${NUM_CORES}" \ -o "${READ1_OUT}" \ -p "${READ2_OUT}" \ "${READ1_IN}" \ "${READ2_IN}" \ > "${REPORT_FILE}" 2>&1 -
4
Command: quality-cutoff 6 -m 18 -a NNNNNAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC -g CTTCCGATCTACAAGTT -g CTTCCGATCTTGGTCCT -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGT AGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz /full/path/to/files/file_R1.C01.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.metrics
$ Bash example
# This tool appears to be a custom script or a specific subcommand within a larger bioinformatics suite. # Installation instructions are not available for 'quality-cutoff' as a standalone package. # Please refer to the specific pipeline documentation where this command is used for setup instructions. quality-cutoff 6 \ -m 18 \ -a NNNNNAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC \ -g CTTCCGATCTACAAGTT \ -g CTTCCGATCTTGGTCCT \ -A AACTTGTAGATCGGA \ -A AGGACCAAGATCGGA \ -A ACTTGTAGATCGGAA \ -A AGGACCAAGATCGGAA \ -A CTTGT AGATCGGAAG \ -A GACCAAGATCGGAAG \ -A TTGTAGATCGGAAGA \ -A ACCAAGATCGGAAGA \ -A TGTAGATCGGAAGAG \ -A CCAAGATCGGAAGAG \ -A GTAGATCGGAAGAGC \ -A CAAGATCGGAAGAGC \ -A TAGATCGGAAGAGCG \ -A AAGATCGGAAGAGCG \ -A AGATCGGAAGAGCGT \ -A GATCGGAAGAGCGTC \ -A ATCGGAAGAGCGTCG \ -A TCGGAAGAGCGTCGT \ -A CGGAAGAGCGTCGTG \ -A GGAAGAGCGTCGTGT \ -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz \ -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz \ /full/path/to/files/file_R1.C01.fastq.gz \ /full/path/to/files/file_R2.C01.fastq.gz \ > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.metrics
-
5
Takes output from cutadapt round 1.
$ Bash example
# Install cutadapt (if not already installed) # conda install -c bioconda cutadapt=4.0 # Define input and output files INPUT_FASTQ="input_from_round1.fastq.gz" OUTPUT_FASTQ="trimmed_round2.fastq.gz" ADAPTER_SEQUENCE="AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT" # Example adapter for a second round of trimming in eCLIP (e.g., small RNA 3' adapter from yeolab/eclip CWL) # Execute cutadapt for a second round of trimming # Parameters inferred from typical eCLIP pipelines (e.g., yeolab/eclip or yeolab/skipper) # -a: 3' adapter sequence # -e: Maximum error rate # -q: Quality cutoff for 3' end trimming # -m: Minimum read length after trimming # -o: Output file cutadapt -a "${ADAPTER_SEQUENCE}" \ -e 0.1 \ -q 20 \ -m 18 \ -o "${OUTPUT_FASTQ}" \ "${INPUT_FASTQ}" -
6
Run to trim off the 3â adapters on read 2, to control for double ligation events.
cutadapt (Inferred with models/gemini-2.5-flash) v3.4 (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# Install cutadapt (example using conda) # conda install -c bioconda cutadapt=3.4 # Define input and output files (placeholders) # INPUT_READ2: Path to the input FASTQ file for read 2 # OUTPUT_TRIMMED_READ2: Path for the output FASTQ file after trimming # LOG_FILE: Path for the log file generated by cutadapt INPUT_READ2="read2.fastq.gz" OUTPUT_TRIMMED_READ2="read2_trimmed.fastq.gz" LOG_FILE="cutadapt_trim_r2.log" # 3' adapter sequence for read 2, commonly used in eCLIP assays # This adapter sequence (AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC) is derived from the # Illumina TruSeq Small RNA 3' Adapter, which can ligate to itself or other # adapters in double ligation events in eCLIP. ADAPTER_R2="AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC" # Run cutadapt to trim the 3' adapter from read 2 # -a ADAPTER: Specifies a 3' adapter to be removed from the reads. # This option searches for the adapter sequence at the 3' end of the read. # -o: Specifies the output file for the trimmed reads. cutadapt -a "${ADAPTER_R2}" -o "${OUTPUT_TRIMMED_READ2}" "${INPUT_READ2}" > "${LOG_FILE}" 2>&1 -
7
Command: cutadapt -f fastq --match-read-wildcards --times 1 -e 0.1 -O 5 --quality-cutoff 6 -m 18 -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGTAGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.metrics
$ Bash example
# Install cutadapt (example using conda) # conda install -c bioconda cutadapt # Define input and output files INPUT_R1="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz" INPUT_R2="/full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz" OUTPUT_R1="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz" OUTPUT_R2="/full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz" METRICS_FILE="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.metrics" # Define adapters (these are likely fragments of the Illumina universal adapter or specific library adapters) # The command specifies multiple -A arguments, which are 3' adapters for the R1 read. # Cutadapt will also search for the reverse complement of these adapters in the R2 read. ADAPTERS=( "AACTTGTAGATCGGA" "AGGACCAAGATCGGA" "ACTTGTAGATCGGAA" "GGACCAAGATCGGAA" "CTTGTAGATCGGAAG" "GACCAAGATCGGAAG" "TTGTAGATCGGAAGA" "ACCAAGATCGGAAGA" "TGTAGATCGGAAGAG" "CCAAGATCGGAAGAG" "GTAGATCGGAAGAGC" "CAAGATCGGAAGAGC" "TAGATCGGAAGAGCG" "AAGATCGGAAGAGCG" "AGATCGGAAGAGCGT" "GATCGGAAGAGCGTC" "ATCGGAAGAGCGTCG" "TCGGAAGAGCGTCGT" "CGGAAGAGCGTCGTG" "GGAAGAGCGTCGTGT" ) # Construct the adapter arguments string ADAPTER_ARGS="" for adapter in "${ADAPTERS[@]}"; do ADAPTER_ARGS+=" -A ${adapter}" done # Execute cutadapt command cutadapt -f fastq \ --match-read-wildcards \ --times 1 \ -e 0.1 \ -O 5 \ --quality-cutoff 6 \ -m 18 \ ${ADAPTER_ARGS} \ -o "${OUTPUT_R1}" \ -p "${OUTPUT_R2}" \ "${INPUT_R1}" \ "${INPUT_R2}" \ > "${METRICS_FILE}" -
8
Takes output from cutadapt round 2.
$ Bash example
# Install cutadapt (if not already installed) # conda install -c bioconda cutadapt=2.10 # Define input and output files # INPUT_FASTQ is the output from cutadapt round 1 (typically 3' adapter trimming) INPUT_FASTQ="reads_r1_trimmed_3prime.fastq.gz" # OUTPUT_FASTQ will contain reads trimmed for both 3' and 5' adapters OUTPUT_FASTQ="reads_r1_trimmed_both.fastq.gz" # Define the 5' adapter sequence (e.g., eCLIP linker adapter) # This adapter sequence is specific to the library preparation protocol. # Example: AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT # (This example assumes a common eCLIP linker adapter without a UMI at the 5' end for simplicity in this command) ADAPTER_5PRIME="AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT" # Perform 5' adapter trimming (cutadapt round 2 in a typical eCLIP pipeline) # -g: Specifies a 5' adapter to be removed from the beginning of the read. # -o: Specifies the output file for trimmed reads. # -q 20,20: Trims low-quality bases from both ends of the read with a quality cutoff of 20. # --minimum-length 15: Discards reads shorter than 15 bp after trimming. # --cores 0: Uses all available CPU cores for parallel processing. # --report=full: Generates a detailed trimming report. cutadapt -g "${ADAPTER_5PRIME}" \ -o "${OUTPUT_FASTQ}" \ -q 20,20 \ --minimum-length 15 \ --cores 0 \ --report=full \ "${INPUT_FASTQ}" -
9
Maps to human specific version of RepBase used to remove repetitive elements, helps control for spurious artifacts from rRNA (& other) repetitive reads.
STAR (Inferred with models/gemini-2.5-flash) v2.7.10a (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# Define variables INPUT_READS="input_reads.fastq.gz" # Replace with your input FASTQ file(s). For paired-end, use "input_reads_R1.fastq.gz input_reads_R2.fastq.gz" OUTPUT_PREFIX="filtered_reads" REPBASE_FASTA="human_repbase.fasta" # Placeholder for human specific RepBase sequences. This file needs to be prepared from RepBase. STAR_INDEX_DIR="star_repbase_index" NUM_THREADS=8 # Adjust as needed # --- Installation (commented out) --- # conda install -c bioconda star # --- Build STAR index for RepBase sequences (if not already built) --- # This step assumes you have a FASTA file containing human specific RepBase sequences. # You would typically download or generate this file from RepBase (e.g., http://www.girinst.org/repbase/). # Example of how to build the index (uncomment and run if needed): # mkdir -p ${STAR_INDEX_DIR} # STAR --runMode genomeGenerate \ # --genomeDir ${STAR_INDEX_DIR} \ # --genomeFastaFiles ${REPBASE_FASTA} \ # --runThreadN ${NUM_THREADS} # --- Align reads to RepBase index and filter out mapped reads --- # Reads that map to the RepBase index are considered repetitive and are discarded. # The --outReadsUnmapped Fastx option outputs reads that did NOT map to RepBase. STAR --genomeDir ${STAR_INDEX_DIR} \ --readFilesIn ${INPUT_READS} \ --runThreadN ${NUM_THREADS} \ --outFileNamePrefix ${OUTPUT_PREFIX}_repbase_ \ --outSAMtype None \ --outReadsUnmapped Fastx # Output unmapped reads to a FASTA/Q file # The unmapped reads are the ones that *do not* map to RepBase, which are the desired "filtered" reads. # The output file will be named ${OUTPUT_PREFIX}_repbase_Unmapped.out.mate1 (and mate2 for paired-end). # If input was gzipped, output will also be gzipped. # Rename for clarity mv ${OUTPUT_PREFIX}_repbase_Unmapped.out.mate1 ${OUTPUT_PREFIX}.fastq.gz # If paired-end, uncomment and adjust: # mv ${OUTPUT_PREFIX}_repbase_Unmapped.out.mate1 ${OUTPUT_PREFIX}_R1.fastq.gz # mv ${OUTPUT_PREFIX}_repbase_Unmapped.out.mate2 ${OUTPUT_PREFIX}_R2.fastq.gz # Clean up intermediate files (optional) # rm ${OUTPUT_PREFIX}_repbase_Log.final.out # rm ${OUTPUT_PREFIX}_repbase_Log.out # rm ${OUTPUT_PREFIX}_repbase_Log.progress.out # rm ${OUTPUT_PREFIX}_repbase_SJ.out.tab -
10
Command: STAR --runMode alignReads --runThreadN 16 --genomeDir /path/to/RepBase_human_database_file --genomeLoad LoadAndRemove --readFilesIn /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz --outSAMunmapped Within --outFilterMultimapNmax 30 --outFilterMultimapScoreRange 1 --outFileNamePrefix /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam --outSAMattributes All --readFilesCommand zcat --outStd BAM_Unsorted --outSAMtype BAM Unsorted --outFilterType BySJout --outReadsUnmapped Fastx --outFilterScoreMin 10 --outSAMattrRGline ID:foo --alignEndsType EndToEnd > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam
$ Bash example
STAR --runMode alignReads --runThreadN 16 --genomeDir /path/to/RepBase_human_database_file --genomeLoad LoadAndRemove --readFilesIn /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz --outSAMunmapped Within --outFilterMultimapNmax 30 --outFilterMultimapScoreRange 1 --outFileNamePrefix /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam --outSAMattributes All --readFilesCommand zcat --outStd BAM_Unsorted --outSAMtype BAM Unsorted --outFilterType BySJout --outReadsUnmapped Fastx --outFilterScoreMin 10 --outSAMattrRGline ID:foo --alignEndsType EndToEnd > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam
-
11
Takes output from STAR rmRep.
$ Bash example
# Install STAR (if not already installed) # conda install -c bioconda star # Define variables # Placeholder for STAR genome index (e.g., human GRCh38) GENOME_DIR="/path/to/STAR_genome_index/GRCh38" # Placeholder for input FASTQ files READ_FILE_1="input_read_1.fastq.gz" READ_FILE_2="input_read_2.fastq.gz" # Remove if single-end OUTPUT_PREFIX="aligned_reads_rmRep" # Output file prefix THREADS=8 # Number of threads # Run STAR alignment with filtering for multimapping reads (interpreted as 'rmRep' for removing reads mapping to repetitive regions) STAR --genomeDir "${GENOME_DIR}" \ --readFilesIn "${READ_FILE_1}" "${READ_FILE_2}" \ --runThreadN "${THREADS}" \ --outFileNamePrefix "${OUTPUT_PREFIX}" \ --outSAMtype BAM SortedByCoordinate \ --outFilterMultimapNmax 1 \ --outFilterMismatchNmax 10 \ --outFilterMatchNminOverLread 0.66 \ --outFilterScoreMinOverLread 0.66 -
12
Maps unique reads to the human genome.
STAR (Inferred with models/gemini-2.5-flash) v2.7.10a (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# Install STAR (if not already installed) # conda install -c bioconda star # Define variables for input and reference # Replace with actual paths and filenames READ1="input_R1.fastq.gz" # Path to your forward reads (gzipped FASTQ) READ2="input_R2.fastq.gz" # Path to your reverse reads (gzipped FASTQ) GENOME_DIR="/path/to/STAR_index/GRCh38" # Path to the pre-built STAR genome index for human GRCh38 OUTPUT_PREFIX="sample_aligned" # Prefix for output files THREADS=8 # Number of threads to use # Create output directory if it doesn't exist mkdir -p "./star_output" # Run STAR alignment # --runThreadN: Number of threads # --genomeDir: Path to the STAR genome index # --readFilesIn: Input FASTQ files (paired-end assumed) # --readFilesCommand: Command to decompress gzipped files on the fly # --outFileNamePrefix: Prefix for all output files # --outSAMtype BAM SortedByCoordinate: Output sorted BAM file # --outFilterMultimapNmax 1: Only report uniquely mapping reads (essential for 'unique reads') # --outFilterMismatchNmax 3: Maximum number of mismatches allowed (adjust as needed) # --outFilterScoreMinOverLread 0.66: Minimum ratio of alignment score to read length # --outFilterMatchNminOverLread 0.66: Minimum ratio of aligned bases to read length # --limitBAMsortRAM: Limit RAM for BAM sorting (e.g., 30GB, adjust based on available memory) STAR \ --runThreadN ${THREADS} \ --genomeDir ${GENOME_DIR} \ --readFilesIn ${READ1} ${READ2} \ --readFilesCommand zcat \ --outFileNamePrefix "./star_output/${OUTPUT_PREFIX}_" \ --outSAMtype BAM SortedByCoordinate \ --outFilterMultimapNmax 1 \ --outFilterMismatchNmax 3 \ --outFilterScoreMinOverLread 0.66 \ --outFilterMatchNminOverLread 0.66 \ --limitBAMsortRAM 30000000000 # 30GB # Note: The human genome reference (GRCh38) needs to be pre-indexed using STAR's --runMode genomeGenerate command. -
13
Command: STAR --runMode alignReads --runThreadN 16 --genomeDir /path/to/STAR_database_file --genomeLoad LoadAndRemove --readFilesIn /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate1 /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate2 --outSAMunmapped Within --outFilterMultimapNmax 1 --outFilterMultimapScoreRange 1 --outFileNamePrefix /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam --outSAMattributes All --outStd BAM_Unsorted --outSAMtype BAM Unsorted --outFilterType BySJout --outReadsUnmapped Fastx --outFilterScoreMin 10 --outSAMattrRGline ID:foo --alignEndsType EndToEnd > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam
$ Bash example
# Define variables for input files and reference genome # Replace with actual paths and reference genome index GENOME_DIR="/path/to/your/GRCh38_STAR_index" # Example: /path/to/STAR_indices/GRCh38 READ1="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate1" READ2="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate2" OUTPUT_BAM="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam" OUTPUT_PREFIX="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam" # This prefix will be used for auxiliary files like Log.final.out, SJ.out.tab # STAR alignment command STAR \ --runMode alignReads \ --runThreadN 16 \ --genomeDir "${GENOME_DIR}" \ --genomeLoad LoadAndRemove \ --readFilesIn "${READ1}" "${READ2}" \ --outSAMunmapped Within \ --outFilterMultimapNmax 1 \ --outFilterMultimapScoreRange 1 \ --outFileNamePrefix "${OUTPUT_PREFIX}" \ --outSAMattributes All \ --outStd BAM_Unsorted \ --outSAMtype BAM Unsorted \ --outFilterType BySJout \ --outReadsUnmapped Fastx \ --outFilterScoreMin 10 \ --outSAMattrRGline ID:foo \ --alignEndsType EndToEnd > "${OUTPUT_BAM}" -
14
takes output from STAR genome mapping.
STAR v2.5.2b$ Bash example
# Install STAR (if not already installed) # conda install -c bioconda star # Define variables # STAR genome index for human GRCh38 (hg38) is used as a placeholder. # This directory should contain genome files (e.g., Genome.fa) and STAR index files (e.g., SA, SAindex). STAR_INDEX_DIR="/path/to/STAR_index/hg38" # Placeholder for input FASTQ file. The eCLIP CWL workflow typically uses single-end reads for alignment. INPUT_FASTQ="input_reads.fastq.gz" # Prefix for output files (e.g., aligned_reads.Aligned.sortedByCoordinate.out.bam) OUTPUT_PREFIX="aligned_reads" # Number of threads to use for alignment NUM_THREADS=8 # Adjust based on available CPU cores # Limit on RAM for sorting BAM. 30GB is used in the eCLIP CWL workflow. MAX_BAM_SORT_RAM=30000000000 # 30 GB # Run STAR alignment with parameters optimized for eCLIP (derived from yeolab/eclip CWL) STAR --runThreadN ${NUM_THREADS} \ --genomeDir ${STAR_INDEX_DIR} \ --readFilesIn ${INPUT_FASTQ} \ --outFileNamePrefix ${OUTPUT_PREFIX}. \ --outSAMtype BAM SortedByCoordinate \ --outSAMattributes All \ --outFilterMultimapNmax 1 \ --outFilterMismatchNmax 3 \ --alignIntronMax 1 \ --outFilterType BySJout \ --outFilterScoreMinOverLread 0.66 \ --outFilterMatchNminOverLread 0.66 \ --limitBAMsortRAM ${MAX_BAM_SORT_RAM} -
15
Custom random-mer-aware script for PCR duplicate removal.
UMI-tools (Inferred with models/gemini-2.5-flash) v>=1.0.0 (Inferred with models/gemini-2.5-flash)$ Bash example
# Install UMI-tools if not already installed (e.g., using conda) # conda install -c bioconda umi-tools # Example: Extract UMIs from FASTQ headers and then deduplicate BAM file # This assumes UMIs are already in the read names (e.g., from a previous custom script like extract_umi_from_fastq.py in Yeo lab eCLIP pipeline) # If UMIs are in a separate read or barcode, the 'extract' command would be used first. # Define input and output file paths INPUT_BAM="input_aligned_reads.bam" OUTPUT_DEDUP_BAM="output_deduplicated_reads.bam" OUTPUT_DEDUP_STATS="output_deduplication_stats.tsv" # Sort the BAM file by coordinate and then by UMI for efficient deduplication # This step is crucial for umi_tools dedup to work correctly with 'directional' or 'unique' methods. # The Yeo lab eCLIP pipeline often sorts by coordinate and then by read name (which contains the UMI). # Sort by coordinate and then by queryname (read name, where UMI is expected to be) # samtools sort -n -o ${INPUT_BAM%.bam}.qname_sorted.bam ${INPUT_BAM} # samtools fixmate -m ${INPUT_BAM%.bam}.qname_sorted.bam ${INPUT_BAM%.bam}.fixmate.bam # samtools sort -o ${INPUT_BAM%.bam}.position_sorted.bam ${INPUT_BAM%.bam}.fixmate.bam # samtools index ${INPUT_BAM%.bam}.position_sorted.bam # For simplicity, assuming input BAM is already sorted by coordinate and UMI is in read name (e.g., using a custom script beforehand) # If the UMI is in the read name, umi_tools dedup can directly use it. # The 'directional' method is common for eCLIP to handle potential UMI errors and strand specificity. umi_tools dedup \ --method=directional \ --extract-umi-method=read_id \ -I "${INPUT_BAM}" \ -S "${OUTPUT_DEDUP_BAM}" \ --output-stats="${OUTPUT_DEDUP_STATS}" # Index the deduplicated BAM file samtools index "${OUTPUT_DEDUP_BAM}" -
16
Command: barcode_collapse_pe.py --bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam --out_file /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam --metrics_file /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.metrics
barcode_collapse_pe.py (Inferred with models/gemini-2.5-flash) vNot explicitly versioned, part of Yeo Lab eCLIP pipeline (Skipper)$ Bash example
# Install Miniconda or Anaconda if not already installed # wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh # bash miniconda.sh -b -p $HOME/miniconda # export PATH="$HOME/miniconda/bin:$PATH" # conda init bash # source ~/.bashrc # Clone the Skipper repository # git clone https://github.com/yeolab/skipper.git # cd skipper # Create and activate the conda environment # conda env create -f environment.yaml # conda activate skipper_env # Define input and output files INPUT_BAM="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam" OUTPUT_BAM="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam" METRICS_FILE="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.metrics" # Execute the barcode collapse command # Assuming the script is run from the 'skipper' directory or its path is added to PATH # If not, specify the full path to the script, e.g., python /path/to/skipper/scripts/barcode_collapse_pe.py python scripts/barcode_collapse_pe.py \ --bam "${INPUT_BAM}" \ --out_file "${OUTPUT_BAM}" \ --metrics_file "${METRICS_FILE}" -
17
Takes output from barcode collapse PE.
$ Bash example
# Install cutadapt if not already installed # conda install -c bioconda cutadapt=4.0 # Define input and output files # INPUT_R1 and INPUT_R2 are the paired-end FASTQ files from the barcode collapse step. INPUT_R1="collapsed_R1.fastq.gz" INPUT_R2="collapsed_R2.fastq.gz" OUTPUT_R1="trimmed_R1.fastq.gz" OUTPUT_R2="trimmed_R2.fastq.gz" OUTPUT_UNTRIMMED_R1="untrimmed_R1.fastq.gz" OUTPUT_UNTRIMMED_R2="untrimmed_R2.fastq.gz" LOG_FILE="cutadapt.log" # Define adapters (Illumina universal and small RNA 3' adapters, common in eCLIP) # These adapters are typically found in eCLIP library preparations. ADAPTER_A="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" # Adapter for Read 1 ADAPTER_B="AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT" # Adapter for Read 2 # Run cutadapt for paired-end adapter trimming and quality filtering. # -a: 3' adapter for Read 1 # -A: 3' adapter for Read 2 # -o: Output file for trimmed Read 1 # -p: Output file for trimmed Read 2 # --untrimmed-output: Output file for Read 1 reads that were not trimmed # --untrimmed-paired-output: Output file for Read 2 reads that were not trimmed (paired with untrimmed Read 1) # --minimum-length: Discard reads shorter than 18 bp after trimming (common in eCLIP workflows) # --quality-cutoff: Trim low-quality bases from 3' end (e.g., Q6) # --cores: Number of CPU cores to use for parallel processing cutadapt \ -a "${ADAPTER_A}" \ -A "${ADAPTER_B}" \ -o "${OUTPUT_R1}" \ -p "${OUTPUT_R2}" \ --untrimmed-output "${OUTPUT_UNTRIMMED_R1}" \ --untrimmed-paired-output "${OUTPUT_UNTRIMMED_R2}" \ --minimum-length 18 \ --quality-cutoff 6 \ --cores 8 \ "${INPUT_R1}" "${INPUT_R2}" > "${LOG_FILE}" 2>&1 -
18
Sorts resulting bam file for use downstream.
samtools (Inferred with models/gemini-2.5-flash) v1.19 (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# Install samtools if not already installed # conda install -c bioconda samtools # Sort the BAM file by coordinate samtools sort -o output.sorted.bam input.bam
-
19
Command: java -Xmx2048m -XX:+UseParallelOldGC -XX:ParallelGCThreads=4 -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Djava.io.tmpdir=/full/path/to/files/.queue/tmp -cp /path/to/gatk/dist/Queue.jar net.sf.picard.sam.SortSam INPUT=/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam TMP_DIR=/full/path/to/files/.queue/tmp OUTPUT=/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam VALIDATION_STRINGENCY=SILENT SO=coordinate CREATE_INDEX=true
$ Bash example
# Install Picard (often bundled with GATK or available as a standalone JAR) # For example, using conda: # conda install -c bioconda picard # Or download the JAR directly: # wget https://github.com/broadinstitute/picard/releases/download/<VERSION>/picard.jar # export PICARD_JAR="/path/to/picard.jar" # Define variables INPUT_BAM="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam" OUTPUT_BAM="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam" TMP_DIR="/full/path/to/files/.queue/tmp" # The original command uses Queue.jar, which implies an older GATK setup where Picard tools were run via GATK's Queue framework. # If using a modern standalone Picard.jar, the command would typically be 'java -jar $PICARD_JAR SortSam ...' # For this specific command, we'll assume Queue.jar is available and contains the necessary Picard classes. GATK_QUEUE_JAR="/path/to/gatk/dist/Queue.jar" # Placeholder for the actual path to Queue.jar # Create temporary directory if it doesn't exist mkdir -p "${TMP_DIR}" # Execute Picard SortSam java -Xmx2048m \ -XX:+UseParallelOldGC \ -XX:ParallelGCThreads=4 \ -XX:GCTimeLimit=50 \ -XX:GCHeapFreeLimit=10 \ -Djava.io.tmpdir="${TMP_DIR}" \ -cp "${GATK_QUEUE_JAR}" \ net.sf.picard.sam.SortSam \ INPUT="${INPUT_BAM}" \ TMP_DIR="${TMP_DIR}" \ OUTPUT="${OUTPUT_BAM}" \ VALIDATION_STRINGENCY=SILENT \ SO=coordinate \ CREATE_INDEX=true -
20
Takes output from sortSam, makes bam index for use downstream.
samtools (Inferred with models/gemini-2.5-flash) v1.10 (Inferred with models/gemini-2.5-flash)$ Bash example
# Install samtools if not already available # conda install -c bioconda samtools=1.10 # Assume 'sorted.bam' is the output from sortSam samtools index sorted.bam
-
21
Command: samtools index /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam.bai
$ Bash example
# Install samtools (if not already installed) # conda install -c bioconda samtools # Define input and output paths INPUT_BAM="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam" OUTPUT_BAI="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam.bai" # Create an index for the BAM file samtools index "${INPUT_BAM}" "${OUTPUT_BAI}" -
22
Takes inputs from multiple final bam files.
$ Bash example
# Install samtools if not available # conda install -c bioconda samtools # Example: Merging multiple BAM files into a single output BAM file. # This is a common step to combine technical replicates or lanes for a single sample # before downstream analysis (e.g., peak calling, quantification). # Replace input1.bam, input2.bam, etc., with the actual paths to your input BAM files. # Replace merged_output.bam with the desired name for the combined BAM file. samtools merge merged_output.bam input1.bam input2.bam input3.bam
-
23
Merges the two technical replicates for further downstream analysis.
$ Bash example
# Install samtools (if not already installed) # conda install -c bioconda samtools=1.10 # Merge technical replicates (BAM files) # Replace 'replicate1.bam' and 'replicate2.bam' with your actual input BAM files. # Replace 'merged_replicates.bam' with your desired output file name. samtools merge merged_replicates.bam replicate1.bam replicate2.bam
-
24
Command: samtools merge /full/path/to/files/CombinedID.merged.bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam /full/path/to/files/file_R1.D08.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam
$ Bash example
# Install samtools (e.g., using conda) # conda install -c bioconda samtools=1.10 samtools merge /full/path/to/files/CombinedID.merged.bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam /full/path/to/files/file_R1.D08.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam
-
25
Takes output from sortSam, makes bam index for use downstream.
$ Bash example
# Install samtools if not already installed # conda install -c bioconda samtools=1.10 # Assume input sorted BAM file is named 'sorted.bam' # The output will be 'sorted.bam.bai' samtools index sorted.bam
-
26
Command: samtools index /full/path/to/files/CombinedID.merged.bam /full/path/to/files/CombinedID.merged.bam.bai
$ Bash example
# samtools is a widely used tool for manipulating SAM/BAM/CRAM files. # It can be installed via conda: # conda install -c bioconda samtools samtools index /full/path/to/files/CombinedID.merged.bam /full/path/to/files/CombinedID.merged.bam.bai
-
27
Takes output from sortSam.
$ Bash example
# Install samtools if not already installed # conda install -c bioconda samtools # This step takes a sorted BAM file (output from sortSam, typically samtools sort) # and creates an index (.bai) file, which is essential for many downstream tools # that require random access to alignments (e.g., visualization, variant calling). # Replace input.sorted.bam with your actual sorted BAM file. samtools index input.sorted.bam
-
28
Only outputs the second read in each pair for use with single stranded peak caller.
reformat.sh (Inferred with models/gemini-2.5-flash) vLatest (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# Install BBMap (BBTools) if not already installed # conda install -c bioconda bbmap # Define input and output files # Assuming input are gzipped paired-end FASTQ files INPUT_R1="input_read1.fastq.gz" INPUT_R2="input_read2.fastq.gz" OUTPUT_R2_ONLY="second_reads_only.fastq.gz" # Command to output only the second read in each pair # 'in1' and 'in2' specify the paired-end input files. # 'out' specifies the output file for the second reads. # 'out1=null' discards the first reads. reformat.sh in1="${INPUT_R1}" in2="${INPUT_R2}" out="${OUTPUT_R2_ONLY}" out1=null -
29
This is the final bam file to perform analysis on.
$ Bash example
# Install samtools if not already available # conda install -c bioconda samtools # Assume 'input.bam' is the raw or aligned BAM file that needs to be finalized for analysis. # A 'final' BAM file is typically sorted by coordinate and indexed for efficient access by downstream tools. INPUT_BAM="input.bam" OUTPUT_BAM="final_sorted.bam" # Sort the BAM file by coordinate samtools sort -o "${OUTPUT_BAM}" "${INPUT_BAM}" # Index the sorted BAM file samtools index "${OUTPUT_BAM}" echo "Final BAM file ready for analysis: ${OUTPUT_BAM}" echo "Index file: ${OUTPUT_BAM}.bai" -
30
Command: samtools view -hb -f 128 /full/path/to/files/CombinedID.merged.bam > /full/path/to/files/CombinedID.merged.r2.bam
$ Bash example
# Install samtools if not already available # conda install -c bioconda samtools=1.10 # Extract reads that are the second in a pair (flag 128) from a merged BAM file. # -h: Include header in the output BAM file. # -b: Output in BAM format. # -f 128: Only output alignments with all bits in 128 set (i.e., the second read in a pair). samtools view -hb -f 128 /full/path/to/files/CombinedID.merged.bam > /full/path/to/files/CombinedID.merged.r2.bam
-
31
Takes results from samtools view.
$ Bash example
# Install samtools if not already available # conda install -c bioconda samtools # Input BAM file (e.g., alignment output from a previous step) INPUT_BAM="aligned_reads.bam" # Output BAM file containing only mapped reads OUTPUT_MAPPED_BAM="mapped_reads.bam" # Use samtools view to filter for mapped reads (flag -F 4 excludes unmapped reads). # -b: output in BAM format # -h: include header samtools view -b -h -F 4 "${INPUT_BAM}" > "${OUTPUT_MAPPED_BAM}" -
32
Calls peaks on those files.
clipper (Inferred with models/gemini-2.5-flash) vlatest (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
bash # Install clipper (if not already installed, e.g., via pip or a Docker image) # pip install clipper # Define input and output variables IP_BAM="input_ip.bam" # Replace with your IP BAM file CONTROL_BAM="input_control.bam" # Replace with your control BAM file (e.g., SMInput or IgG) GENOME_SIZE="hs" # Use 'hs' for human, 'mm' for mouse, or specify a numeric value (e.g., 3.1e9 for hg38) OUTPUT_PREFIX="eclip_peaks" # Call peaks using clipper with common eCLIP parameters # -b: IP BAM file # -c: Control BAM file # -s: Genome size (e.g., 'hs' for human, 'mm' for mouse) # -o: Output prefix # --fdr: False Discovery Rate threshold (default 0.05) # --window_size: Window size for peak calling (default 100) # --min_peak_reads: Minimum number of reads in a peak (default 5) # --min_peak_length: Minimum peak length (default 10) # --max_peak_length: Maximum peak length (default 1000) # -u: Use unique reads only (recommended for eCLIP) # -d: Discard duplicate reads (recommended for eCLIP) clipper -b "${IP_BAM}" \ -c "${CONTROL_BAM}" \ -s "${GENOME_SIZE}" \ -o "${OUTPUT_PREFIX}" \ --fdr 0.05 \ --window_size 100 \ --min_peak_reads 5 \ --min_peak_length 10 \ --max_peak_length 1000 \ -u -d echo "Peak calling complete. Output files are prefixed with ${OUTPUT_PREFIX}" -
33
Command: clipper -b /full/path/to/files/CombinedID.merged.r2.bam -s hg19 -o /full/path/to/files/CombinedID.merged.r2.peaks.bed --bonferroni --superlocal --threshold-method binomial --save-pickle
$ Bash example
# Install CLIPper (example using conda and pip) # conda create -n clipper_env python=3.8 # conda activate clipper_env # pip install clipper-tools # Define input and output files INPUT_BAM="/full/path/to/files/CombinedID.merged.r2.bam" OUTPUT_BED="/full/path/to/files/CombinedID.merged.r2.peaks.bed" GENOME_ASSEMBLY="hg19" # Reference genome assembly # Execute CLIPper command clipper -b "${INPUT_BAM}" -s "${GENOME_ASSEMBLY}" -o "${OUTPUT_BED}" --bonferroni --superlocal --threshold-method binomial --save-pickle
Raw Source Text
Library strategy: eCLIP-seq Takes output from raw files. Run to trim off both 5â and 3â adapters on both reads. Command: quality-cutoff 6 -m 18 -a NNNNNAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC -g CTTCCGATCTACAAGTT -g CTTCCGATCTTGGTCCT -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGT AGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz /full/path/to/files/file_R1.C01.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.metrics Takes output from cutadapt round 1. Run to trim off the 3â adapters on read 2, to control for double ligation events. Command: cutadapt -f fastq --match-read-wildcards --times 1 -e 0.1 -O 5 --quality-cutoff 6 -m 18 -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGTAGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.metrics Takes output from cutadapt round 2. Maps to human specific version of RepBase used to remove repetitive elements, helps control for spurious artifacts from rRNA (& other) repetitive reads. Command: STAR --runMode alignReads --runThreadN 16 --genomeDir /path/to/RepBase_human_database_file --genomeLoad LoadAndRemove --readFilesIn /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz --outSAMunmapped Within --outFilterMultimapNmax 30 --outFilterMultimapScoreRange 1 --outFileNamePrefix /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam --outSAMattributes All --readFilesCommand zcat --outStd BAM_Unsorted --outSAMtype BAM Unsorted --outFilterType BySJout --outReadsUnmapped Fastx --outFilterScoreMin 10 --outSAMattrRGline ID:foo --alignEndsType EndToEnd > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam Takes output from STAR rmRep. Maps unique reads to the human genome. Command: STAR --runMode alignReads --runThreadN 16 --genomeDir /path/to/STAR_database_file --genomeLoad LoadAndRemove --readFilesIn /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate1 /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate2 --outSAMunmapped Within --outFilterMultimapNmax 1 --outFilterMultimapScoreRange 1 --outFileNamePrefix /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam --outSAMattributes All --outStd BAM_Unsorted --outSAMtype BAM Unsorted --outFilterType BySJout --outReadsUnmapped Fastx --outFilterScoreMin 10 --outSAMattrRGline ID:foo --alignEndsType EndToEnd > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam takes output from STAR genome mapping. Custom random-mer-aware script for PCR duplicate removal. Command: barcode_collapse_pe.py --bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam --out_file /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam --metrics_file /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.metrics Takes output from barcode collapse PE. Sorts resulting bam file for use downstream. Command: java -Xmx2048m -XX:+UseParallelOldGC -XX:ParallelGCThreads=4 -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Djava.io.tmpdir=/full/path/to/files/.queue/tmp -cp /path/to/gatk/dist/Queue.jar net.sf.picard.sam.SortSam INPUT=/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam TMP_DIR=/full/path/to/files/.queue/tmp OUTPUT=/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam VALIDATION_STRINGENCY=SILENT SO=coordinate CREATE_INDEX=true Takes output from sortSam, makes bam index for use downstream. Command: samtools index /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam.bai Takes inputs from multiple final bam files. Merges the two technical replicates for further downstream analysis. Command: samtools merge /full/path/to/files/CombinedID.merged.bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam /full/path/to/files/file_R1.D08.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam Takes output from sortSam, makes bam index for use downstream. Command: samtools index /full/path/to/files/CombinedID.merged.bam /full/path/to/files/CombinedID.merged.bam.bai Takes output from sortSam. Only outputs the second read in each pair for use with single stranded peak caller. This is the final bam file to perform analysis on. Command: samtools view -hb -f 128 /full/path/to/files/CombinedID.merged.bam > /full/path/to/files/CombinedID.merged.r2.bam Takes results from samtools view. Calls peaks on those files. Command: clipper -b /full/path/to/files/CombinedID.merged.r2.bam -s hg19 -o /full/path/to/files/CombinedID.merged.r2.peaks.bed --bonferroni --superlocal --threshold-method binomial --save-pickle Genome_build: hg19 Supplementary_files_format_and_content: bed format, contains clusters of predicted RBP binding