GSE80039 Processing Pipeline

OTHER code_examples 33 steps

Publication

Robust transcriptome-wide discovery of RNA-binding protein binding sites with enhanced CLIP (eCLIP).

Nature methods (2016) — PMID 27018577

Dataset

Enhanced CLIP (eCLIP) enables robust and scalable transcriptome-wide discovery and characterization of RNA binding protein binding sites [eCLIP - Hep…

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Library strategy: eCLIP-seq

eCLIP vCWL workflow (yeolab/eclip) GitHub

$ Bash example

# This command is a placeholder for running the eCLIP CWL workflow.
# It assumes 'eclip.cwl' is the main workflow definition file
# and 'eclip_inputs.yaml' contains paths to input FASTQ files,
# genome reference (e.g., hg38), and other necessary parameters.
#
# Example 'eclip_inputs.yaml' content for a human (hg38) sample:
# fastq_r1: { class: File, path: "sample_R1.fastq.gz" }
# fastq_r2: { class: File, path: "sample_R2.fastq.gz" }
# genome_fasta: { class: File, path: "/path/to/hg38.fa" }
# genome_star_index: { class: Directory, path: "/path/to/hg38_star_index" }
# adapter_sequence: "AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" # Example adapter
# output_dir: "eclip_results"
#
# For detailed setup and execution, refer to the yeolab/eclip GitHub repository:
# https://github.com/yeolab/eclip/
#
# Installation of cwltool (if not already installed):
# conda install -c conda-forge cwltool
# or
# pip install cwltool
#
# Clone the eCLIP CWL workflow repository:
# git clone https://github.com/yeolab/eclip.git
# cd eclip
#
# Execute the eCLIP CWL workflow:
# Replace 'eclip.cwl' and 'eclip_inputs.yaml' with actual paths.
cwltool eclip.cwl eclip_inputs.yaml

View on GitHub

Takes output from raw files.

Trim Galore! (Inferred with models/gemini-2.5-flash) v0.6.7 GitHub

$ Bash example

# Install Trim Galore! (if not already installed)
# conda install -c bioconda trim-galore

# Define input raw FASTQ files (replace with actual file paths)
# Assuming paired-end raw FASTQ files as common input for many pipelines
INPUT_FASTQ_R1="sample_R1.fastq.gz"
INPUT_FASTQ_R2="sample_R2.fastq.gz"

# Define output directory for trimmed FASTQ files
OUTPUT_DIR="./trimmed_fastq"
mkdir -p "${OUTPUT_DIR}"

# Run Trim Galore! for adapter trimming and quality filtering
# This command processes paired-end reads, automatically detects adapters,
# and places the trimmed files in the specified output directory.
# Trim Galore! internally uses Cutadapt for trimming.
trim_galore --paired \
            --output_dir "${OUTPUT_DIR}" \
            "${INPUT_FASTQ_R1}" \
            "${INPUT_FASTQ_R2}"

View on GitHub

Run to trim off both 5â and 3â adapters on both reads.

cutadapt (Inferred with models/gemini-2.5-flash) v4.0 GitHub

$ Bash example

# Install cutadapt if not already installed
# conda install -c bioconda cutadapt=4.0

# Define input and output file paths
READ1_IN="read1.fastq.gz"
READ2_IN="read2.fastq.gz"
READ1_OUT="trimmed_read1.fastq.gz"
READ2_OUT="trimmed_read2.fastq.gz"
REPORT_FILE="cutadapt_report.txt"

# Define adapter sequences (example Illumina TruSeq adapters from Yeo lab's skipper workflow)
# IMPORTANT: Replace these with the actual adapter sequences used in your library preparation.
# If distinct 5' adapters are used, replace ADAPTER_FWD_5PRIME and ADAPTER_REV_5PRIME accordingly.
ADAPTER_FWD_3PRIME="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA"
ADAPTER_REV_3PRIME="AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT"
# For 5' adapter trimming, often the same sequence or a specific 5' adapter is used.
# Using the same sequence as a placeholder if no distinct 5' adapter is specified.
ADAPTER_FWD_5PRIME="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" # Placeholder, replace if distinct 5' adapter exists
ADAPTER_REV_5PRIME="AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT" # Placeholder, replace if distinct 5' adapter exists

# Run cutadapt to trim both 5' and 3' adapters from both reads
# -a ADAPTER_FWD_3PRIME: 3' adapter for R1
# -A ADAPTER_REV_3PRIME: 3' adapter for R2
# -g ADAPTER_FWD_5PRIME: 5' adapter for R1
# -G ADAPTER_REV_5PRIME: 5' adapter for R2
# -q 20: Trim low-quality bases from the 3' end (Phred score < 20)
# --minimum-length 15: Discard reads shorter than 15 bp after trimming
# -e 0.1: Maximum error rate for adapter matching
# -o: Output file for R1
# -p: Output file for R2
cutadapt \
  -a "${ADAPTER_FWD_3PRIME}" \
  -A "${ADAPTER_REV_3PRIME}" \
  -g "${ADAPTER_FWD_5PRIME}" \
  -G "${ADAPTER_REV_5PRIME}" \
  -q 20 \
  --minimum-length 15 \
  -e 0.1 \
  -o "${READ1_OUT}" \
  -p "${READ2_OUT}" \
  "${READ1_IN}" "${READ2_IN}" > "${REPORT_FILE}" 2>&1

View on GitHub

Command: quality-cutoff 6 -m 18 -a NNNNNAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC -g CTTCCGATCTACAAGTT -g CTTCCGATCTTGGTCCT -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGT AGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz /full/path/to/files/file_R1.C01.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.metrics

quality-cutoff vAssociated with yeolab/eclip workflow (circa 2017-2020) GitHub

$ Bash example

# Install dependencies: Cutadapt
# conda install -c bioconda cutadapt

# Install quality-cutoff script:
# This script (quality-cutoff.py) is part of the yeolab/eclip workflow.
# git clone https://github.com/yeolab/eclip.git
# cd eclip/scripts
# chmod +x quality-cutoff.py
# # Ensure 'quality-cutoff' is in your PATH, e.g., by creating a symlink or adding the directory to PATH:
# # sudo ln -s $(pwd)/quality-cutoff.py /usr/local/bin/quality-cutoff
# # Alternatively, invoke directly using python: python /path/to/eclip/scripts/quality-cutoff.py ...

# Execute the quality-cutoff command
quality-cutoff 6 -m 18 -a NNNNNAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC -g CTTCCGATCTACAAGTT -g CTTCCGATCTTGGTCCT -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGT AGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz /full/path/to/files/file_R1.C01.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.metrics

View on GitHub

Takes output from cutadapt round 1.

cutadapt v1.18 GitHub

$ Bash example

# Install cutadapt if not already available
# conda install -c bioconda cutadapt=1.18

# Execute cutadapt for a second round of trimming.
# This command typically focuses on quality trimming, length filtering,
# removing reads with Ns, and potentially trimming poly-A tails,
# assuming primary adapter trimming was handled in the first round.
cutadapt \
  -q 20,20 \
  -m 18 \
  --max-n 0 \
  -a "A{10}" \
  -o output_cutadapt_round2.fastq.gz \
  input_from_cutadapt_round1.fastq.gz

View on GitHub

Run to trim off the 3â adapters on read 2, to control for double ligation events.

cutadapt (Inferred with models/gemini-2.5-flash) v2.10 GitHub

$ Bash example

# Install cutadapt if not already installed
# conda install -c bioconda cutadapt

# Define input and output files
READ1_INPUT="sample_R1.fastq.gz"
READ2_INPUT="sample_R2.fastq.gz"
READ1_TRIMMED="sample_R1_trimmed.fastq.gz"
READ2_TRIMMED="sample_R2_trimmed.fastq.gz"
ADAPTER_SEQUENCE="AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC" # Standard Illumina TruSeq adapter for eCLIP

# Run cutadapt to trim 3' adapters from Read 2, outputting both R1 and R2
# -a ADAPTER: Specifies the 3' adapter to trim from the forward read (R2 in this case)
# -o: Output file for the forward read (R2)
# -p: Output file for the reverse read (R1), which is paired with the forward read
# -m 18: Discard reads shorter than 18 bp after trimming, as used in the eCLIP pipeline
cutadapt -a "${ADAPTER_SEQUENCE}" \
         -o "${READ2_TRIMMED}" \
         -p "${READ1_TRIMMED}" \
         -m 18 \
         "${READ2_INPUT}" \
         "${READ1_INPUT}"

View on GitHub

Command: cutadapt -f fastq --match-read-wildcards --times 1 -e 0.1 -O 5 --quality-cutoff 6 -m 18 -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGTAGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.metrics

cutadapt v(Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install cutadapt (e.g., using conda)
# conda install -c bioconda cutadapt

# Define input and output files
INPUT_R1="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz"
INPUT_R2="/full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz"
OUTPUT_R1="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz"
OUTPUT_R2="/full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz"
METRICS_FILE="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.metrics"

# Define adapters
ADAPTERS=(
    "-A AACTTGTAGATCGGA"
    "-A AGGACCAAGATCGGA"
    "-A ACTTGTAGATCGGAA"
    "-A GGACCAAGATCGGAA"
    "-A CTTGTAGATCGGAAG"
    "-A GACCAAGATCGGAAG"
    "-A TTGTAGATCGGAAGA"
    "-A ACCAAGATCGGAAGA"
    "-A TGTAGATCGGAAGAG"
    "-A CCAAGATCGGAAGAG"
    "-A GTAGATCGGAAGAGC"
    "-A CAAGATCGGAAGAGC"
    "-A TAGATCGGAAGAGCG"
    "-A AAGATCGGAAGAGCG"
    "-A AGATCGGAAGAGCGT"
    "-A GATCGGAAGAGCGTC"
    "-A ATCGGAAGAGCGTCG"
    "-A TCGGAAGAGCGTCGT"
    "-A CGGAAGAGCGTCGTG"
    "-A GGAAGAGCGTCGTGT"
)

# Execute cutadapt command
cutadapt -f fastq \
    --match-read-wildcards \
    --times 1 \
    -e 0.1 \
    -O 5 \
    --quality-cutoff 6 \
    -m 18 \
    "${ADAPTERS[@]}" \
    -o "${OUTPUT_R1}" \
    -p "${OUTPUT_R2}" \
    "${INPUT_R1}" \
    "${INPUT_R2}" \
    > "${METRICS_FILE}"

View on GitHub

Takes output from cutadapt round 2.

cutadapt v1.18 GitHub

$ Bash example

# Install cutadapt if not already installed
# conda install -c bioconda cutadapt=1.18

# Define input and output files
# INPUT_FASTQ is the output from cutadapt round 1 (adapter trimming)
INPUT_FASTQ="sample_R1_trimmed_adapter.fastq.gz"
OUTPUT_FASTQ="sample_R1_trimmed_polyA.fastq.gz"

# Run cutadapt for poly-A trimming (round 2 in eCLIP pipeline)
# -a A{100}: Trims a poly-A tail of up to 100 A's
# -q 10: Trims low-quality bases from the 3' end with a quality cutoff of 10
# --minimum-length 18: Discards reads shorter than 18 bp after trimming
# -e 0.1: Maximum error rate of 10% for adapter matching
# --overlap 3: Minimum overlap of 3 bases for adapter matching
# -j 8: Use 8 CPU cores for parallel processing
cutadapt -a A{100} \
         -q 10 \
         --minimum-length 18 \
         -e 0.1 \
         --overlap 3 \
         -j 8 \
         -o "${OUTPUT_FASTQ}" \
         "${INPUT_FASTQ}"

View on GitHub

Maps to human specific version of RepBase used to remove repetitive elements, helps control for spurious artifacts from rRNA (& other) repetitive reads.

BBDuk (Inferred with models/gemini-2.5-flash) v38.90 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install BBTools suite (which includes BBDuk)
# conda install -c bioconda bbmap

# Define variables
# Replace with your actual input FASTQ file(s)
INPUT_READS="input_reads.fastq.gz" 
OUTPUT_FILTERED_READS="filtered_reads.fastq.gz"

# This FASTA file contains sequences of human-specific repetitive elements from RepBase.
# It needs to be prepared beforehand, e.g., by extracting sequences from the RepBase database
# (Genetic Information Research Institute - GIRI) or by extracting repeat sequences
# identified by RepeatMasker on the human reference genome (e.g., GRCh38).
# For example, a combined FASTA of human rRNA, tRNA, and other RepBase elements.
HUMAN_REPBASE_FASTA="path/to/human_repbase_elements.fa"

# Run BBDuk to remove reads that map to human-specific repetitive elements.
# BBDuk maps reads against the provided reference FASTA and filters out matches.
# in: Input FASTQ file(s). Can be comma-separated for multiple files or wildcards.
# ref: Reference FASTA file containing repetitive element sequences.
# out: Output FASTQ file(s) with repetitive reads removed.
# k: Kmer length for matching (default 31, can be adjusted for sensitivity).
# hdist: Hamming distance for kmer matching (default 1, allows for 1 mismatch).
# stats: Output statistics about filtered reads to a specified file.
# overwrite: Allow overwriting output files if they exist.
# The description mentions "rRNA (& other) repetitive reads". If the HUMAN_REPBASE_FASTA
# includes rRNA sequences, then this single step can handle both rRNA and other RepBase elements.

bbduk.sh in="${INPUT_READS}" \
           ref="${HUMAN_REPBASE_FASTA}" \
           out="${OUTPUT_FILTERED_READS}" \
           k=31 \
           hdist=1 \
           stats="${OUTPUT_FILTERED_READS}.stats" \
           overwrite=true

View on GitHub

Command: STAR --runMode alignReads --runThreadN 16 --genomeDir /path/to/RepBase_human_database_file --genomeLoad LoadAndRemove --readFilesIn /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz --outSAMunmapped Within --outFilterMultimapNmax 30 --outFilterMultimapScoreRange 1 --outFileNamePrefix /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam --outSAMattributes All --readFilesCommand zcat --outStd BAM_Unsorted --outSAMtype BAM Unsorted --outFilterType BySJout --outReadsUnmapped Fastx --outFilterScoreMin 10 --outSAMattrRGline ID:foo --alignEndsType EndToEnd > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam

STAR vInferred with models/gemini-2.5-flash GitHub

$ Bash example

bash
# Reference genome directory for RepBase human repeats.
# This is a placeholder. Replace with the actual path to your STAR-indexed RepBase human genome directory.
GENOME_DIR="/path/to/RepBase_human_database_file"

# Input FASTQ files
READ1="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz"
READ2="/full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz"

# Output BAM file prefix (for auxiliary files like Log.out, SJ.out.tab) and the final redirected BAM file.
# Note: The main alignment output is sent to stdout (--outStd BAM_Unsorted) and then redirected to FINAL_BAM_OUTPUT.
OUTPUT_PREFIX="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam"
FINAL_BAM_OUTPUT="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam"

STAR --runMode alignReads \
     --runThreadN 16 \
     --genomeDir "${GENOME_DIR}" \
     --genomeLoad LoadAndRemove \
     --readFilesIn "${READ1}" "${READ2}" \
     --outSAMunmapped Within \
     --outFilterMultimapNmax 30 \
     --outFilterMultimapScoreRange 1 \
     --outFileNamePrefix "${OUTPUT_PREFIX}" \
     --outSAMattributes All \
     --readFilesCommand zcat \
     --outStd BAM_Unsorted \
     --outSAMtype BAM Unsorted \
     --outFilterType BySJout \
     --outReadsUnmapped Fastx \
     --outFilterScoreMin 10 \
     --outSAMattrRGline ID:foo \
     --alignEndsType EndToEnd \
     > "${FINAL_BAM_OUTPUT}"

View on GitHub

Takes output from STAR rmRep.

STAR v2.7.0f GitHub

$ Bash example

# Install STAR (if not already installed)
# conda install -c bioconda star

# Install samtools (if not already installed)
# conda install -c bioconda samtools

# Define variables (replace with actual paths and filenames)
# GENOME_DIR: Path to the STAR genome index (e.g., for hg38).
# READ1: Path to the input FASTQ file for read 1.
# READ2: Path to the input FASTQ file for read 2 (optional, remove if single-end).
# OUTPUT_PREFIX: Prefix for output files.
# THREADS: Number of threads to use for STAR alignment.
GENOME_DIR="/path/to/STAR_genome_index/hg38"
READ1="input_read1.fastq.gz"
READ2="input_read2.fastq.gz" # Remove this line if single-end reads
OUTPUT_PREFIX="aligned_reads"
THREADS=8

# 1. Align reads with STAR
# This command aligns RNA-based assay reads (like eCLIP) to a reference genome.
# Parameters are adapted from the Yeo lab eCLIP CWL workflow (https://github.com/yeolab/eclip).
# --runThreadN: Number of threads.
# --genomeDir: Path to the STAR genome index.
# --readFilesIn: Input FASTQ files. Use only READ1 if single-end.
# --outFileNamePrefix: Prefix for output files.
# --outSAMtype BAM SortedByCoordinate: Output sorted BAM file.
# --outFilterMultimapNmax 1: Consider only uniquely mapping reads (common for eCLIP).
# --outFilterMismatchNmax 3: Max number of mismatches per read.
# --alignIntronMax 1: For eCLIP, introns are not expected, so set to 1 to disable splicing.
# --alignEndsType Local: Local alignment for eCLIP.
# --outFilterScoreMinOverLread 0.66 --outFilterMatchNminOverLread 0.66: Filtering parameters.
# --outFilterMatchNmin 20: Minimum number of matched bases.
# --limitBAMsortRAM 30000000000: Limit RAM for BAM sorting (30GB).

STAR \
  --runThreadN ${THREADS} \
  --genomeDir ${GENOME_DIR} \
  --readFilesIn ${READ1} ${READ2} \
  --outFileNamePrefix ${OUTPUT_PREFIX}_ \
  --outSAMtype BAM SortedByCoordinate \
  --outFilterMultimapNmax 1 \
  --outFilterMismatchNmax 3 \
  --alignIntronMax 1 \
  --alignEndsType Local \
  --outFilterScoreMinOverLread 0.66 \
  --outFilterMatchNminOverLread 0.66 \
  --outFilterMatchNmin 20 \
  --limitBAMsortRAM 30000000000

# The above command produces a sorted BAM file: ${OUTPUT_PREFIX}_Aligned.sortedByCoordinate.out.bam

# 2. Deduplicate reads using samtools markdup (implied by "rmRep" - remove replicates)
# This step removes PCR duplicates from the aligned BAM file, which is crucial for eCLIP.
# -r: Remove duplicate reads (rather than just marking them).
# -S: Treat all reads as single-end (used in eCLIP pipelines even for paired-end input if pairing is not strictly maintained).

samtools markdup -r -S \
  ${OUTPUT_PREFIX}_Aligned.sortedByCoordinate.out.bam \
  ${OUTPUT_PREFIX}_deduplicated.bam

# Index the deduplicated BAM file for downstream processing
samtools index ${OUTPUT_PREFIX}_deduplicated.bam

View on GitHub

Maps unique reads to the human genome.

STAR (Inferred with models/gemini-2.5-flash) v2.7.10a GitHub

$ Bash example

# Install STAR (if not already installed)
# conda install -c bioconda star=2.7.10a

# --- Reference Data Setup (Example using GRCh38 and GENCODE v38) ---
# Download human genome primary assembly FASTA (e.g., from UCSC or NCBI)
# wget -P /path/to/references/ https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz
# gunzip /path/to/references/hg38.fa.gz

# Download GENCODE v38 GTF annotation (e.g., from GENCODE)
# wget -P /path/to/references/ https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_38/gencode.v38.annotation.gtf.gz
# gunzip /path/to/references/gencode.v38.annotation.gtf.gz

# Create STAR genome index (run once per reference genome)
# mkdir -p /path/to/STAR_index/GRCh38_gencode_v38
# STAR --runThreadN 8 \
# --runMode genomeGenerate \
# --genomeDir /path/to/STAR_index/GRCh38_gencode_v38 \
# --genomeFastaFiles /path/to/references/hg38.fa \
# --sjdbGTFfile /path/to/references/gencode.v38.annotation.gtf \
# --sjdbOverhang 100 # Recommended for RNA-seq, typically read length - 1

# --- Alignment Command ---
# Maps unique reads to the human genome (GRCh38) using STAR
# Input: input_reads.fastq.gz (replace with your actual FASTQ file)
# Output: output_prefix_Aligned.sortedByCoord.out.bam (BAM file sorted by coordinate)
#         output_prefix_ReadsPerGene.out.tab (Gene counts, if --quantMode GeneCounts is used)
STAR --runThreadN 8 \
--genomeDir /path/to/STAR_index/GRCh38_gencode_v38 \
--readFilesIn input_reads.fastq.gz \
--outFileNamePrefix output_prefix_ \
--outSAMtype BAM SortedByCoordinate \
--outFilterMultimapNmax 1 \
--outFilterMismatchNmax 10 \
--outFilterScoreMinOverLread 0.66 \
--outFilterMatchNminOverLread 0.66 \
--quantMode GeneCounts

View on GitHub

Command: STAR --runMode alignReads --runThreadN 16 --genomeDir /path/to/STAR_database_file --genomeLoad LoadAndRemove --readFilesIn /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate1 /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate2 --outSAMunmapped Within --outFilterMultimapNmax 1 --outFilterMultimapScoreRange 1 --outFileNamePrefix /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam --outSAMattributes All --outStd BAM_Unsorted --outSAMtype BAM Unsorted --outFilterType BySJout --outReadsUnmapped Fastx --outFilterScoreMin 10 --outSAMattrRGline ID:foo --alignEndsType EndToEnd > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam

STAR v2.7.10a GitHub

$ Bash example

bash
# Install STAR (if not already installed)
# conda install -c bioconda star

# Define variables
# Replace with your actual STAR genome directory (e.g., for hg38 or mm10).
# For eCLIP/RNA-based assays, hg38 is a common reference.
GENOME_DIR="/path/to/your/STAR_index/hg38" 

# Input FASTQ files (these appear to be unmapped mates extracted from a BAM file)
INPUT_READS_MATE1="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate1"
INPUT_READS_MATE2="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate2"

# Output BAM file (the alignment output is redirected to this file from stdout)
OUTPUT_BAM_FILE="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam"

# Prefix for other STAR output files (e.g., Log.out, SJ.out.tab, etc.)
# Note: The original description uses the .bam file name as a prefix, which will result in files like "your.bamLog.out".
# If you prefer a cleaner prefix, you might change this to something like "/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep."
OUTPUT_FILE_PREFIX="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam"

# Run STAR alignment
STAR \
  --runMode alignReads \
  --runThreadN 16 \
  --genomeDir "${GENOME_DIR}" \
  --genomeLoad LoadAndRemove \
  --readFilesIn "${INPUT_READS_MATE1}" "${INPUT_READS_MATE2}" \
  --outSAMunmapped Within \
  --outFilterMultimapNmax 1 \
  --outFilterMultimapScoreRange 1 \
  --outFileNamePrefix "${OUTPUT_FILE_PREFIX}" \
  --outSAMattributes All \
  --outStd BAM_Unsorted \
  --outSAMtype BAM Unsorted \
  --outFilterType BySJout \
  --outReadsUnmapped Fastx \
  --outFilterScoreMin 10 \
  --outSAMattrRGline ID:foo \
  --alignEndsType EndToEnd \
  > "${OUTPUT_BAM_FILE}"

View on GitHub

takes output from STAR genome mapping.

STAR v2.7.10a GitHub

$ Bash example

# Install STAR (if not already installed)
# conda install -c bioconda star

# Define variables
# Replace with actual paths and filenames
GENOME_DIR="/path/to/STAR_genome_index/GRCh38" # Placeholder for GRCh38 genome index
READ1_FASTQ="input_R1.fastq.gz"
READ2_FASTQ="input_R2.fastq.gz" # Omit if single-end
OUTPUT_PREFIX="mapped_reads"
THREADS=8 # Number of threads to use

# Run STAR mapping
STAR --genomeDir "${GENOME_DIR}" \
     --readFilesIn "${READ1_FASTQ}" "${READ2_FASTQ}" \
     --runThreadN "${THREADS}" \
     --outFileNamePrefix "${OUTPUT_PREFIX}_" \
     --outSAMtype BAM SortedByCoordinate \
     --outSAMattributes All \
     --outSAMunmapped Within \
     --outFilterMultimapNmax 20 \
     --outFilterMismatchNmax 999 \
     --outFilterMismatchNoverLmax 0.04 \
     --alignIntronMin 20 \
     --alignIntronMax 1000000 \
     --alignMatesGapMax 1000000 \
     --limitBAMsortRAM 30000000000 # ~30GB RAM for sorting, adjust as needed based on available memory

View on GitHub

Custom random-mer-aware script for PCR duplicate removal.

umi_tools (Inferred with models/gemini-2.5-flash) v1.1.2 GitHub

$ Bash example

# Install umi_tools if not already installed
# conda install -c bioconda umi_tools=1.1.2

# Example: Deduplicate a BAM file using UMIs embedded in read IDs.
# This command assumes UMIs have been extracted and appended to read IDs
# in a previous step (e.g., using 'umi_tools extract') and are separated by an underscore '_'.
# The 'directional' method is commonly used for eCLIP data to handle PCR duplicates.

umi_tools dedup \
    --input input.sorted.bam \
    --output output.dedup.bam \
    --extract-umi-method=read_id \
    --umi-separator='_' \
    --method=directional \
    --log dedup.log

View on GitHub

Command: barcode_collapse_pe.py --bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam --out_file /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam --metrics_file /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.metrics

eCLIP pipeline (Yeo Lab) (Inferred with models/gemini-2.5-flash) vv1.0.1 GitHub

$ Bash example

bash
# Clone the eCLIP pipeline repository
# git clone https://github.com/yeolab/eclip.git
# cd eclip

# Create and activate the conda environment (if using conda) from the provided environment.yml
# conda env create -f environment.yml
# conda activate eclip

# Set the path to the eCLIP scripts directory
# Adjust this path to where you cloned the 'eclip' repository
ECLIP_SCRIPTS_DIR="/path/to/cloned/eclip/scripts"

# Define input and output file paths
INPUT_BAM="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam"
OUTPUT_BAM="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam"
METRICS_FILE="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.metrics"

# Execute the barcode_collapse_pe.py script
python "${ECLIP_SCRIPTS_DIR}/barcode_collapse_pe.py" \
    --bam "${INPUT_BAM}" \
    --out_file "${OUTPUT_BAM}" \
    --metrics_file "${METRICS_FILE}"

View on GitHub

Takes output from barcode collapse PE.

cutadapt (Inferred with models/gemini-2.5-flash) v1.18 GitHub

$ Bash example

# Install cutadapt if not already installed
# conda install -c bioconda cutadapt

# Define input files (expected output from a barcode collapse PE step)
# These are placeholder names; adjust to actual file names from the previous step.
INPUT_R1="collapsed_reads_r1.fastq.gz"
INPUT_R2="collapsed_reads_r2.fastq.gz"

# Define output files for trimmed reads
OUTPUT_R1="trimmed_R1.fastq.gz"
OUTPUT_R2="trimmed_R2.fastq.gz"

# Define adapter sequences commonly used in eCLIP (Illumina TruSeq adapters)
# These specific adapters are used in the Yeo lab eCLIP pipeline.
ADAPTER_R1="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA"
ADAPTER_R2="AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT"

# Execute cutadapt for paired-end adapter trimming
# -a: 3' adapter for R1
# -A: 3' adapter for R2
# -o: Output file for R1
# -p: Output file for R2
# -m 18: Discard reads shorter than 18 bp after trimming (as used in eCLIP pipeline)
cutadapt -a "${ADAPTER_R1}" -A "${ADAPTER_R2}" \
         -o "${OUTPUT_R1}" -p "${OUTPUT_R2}" \
         -m 18 \
         "${INPUT_R1}" "${INPUT_R2}"

View on GitHub

Sorts resulting bam file for use downstream.

samtools (Inferred with models/gemini-2.5-flash) v1.19 GitHub

$ Bash example

# Install samtools if not already installed
# conda install -c bioconda samtools

# Define input and output file names
INPUT_BAM="input.bam"
OUTPUT_SORTED_BAM="${INPUT_BAM%.bam}.sorted.bam"

# Sort the BAM file by coordinate
# -o: output file
# -@: number of threads (adjust as needed)
# -m: memory per thread (adjust as needed, e.g., 2G for 2GB)
samtools sort -o "${OUTPUT_SORTED_BAM}" -@ 8 -m 2G "${INPUT_BAM}"

View on GitHub

Command: java -Xmx2048m -XX:+UseParallelOldGC -XX:ParallelGCThreads=4 -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Djava.io.tmpdir=/full/path/to/files/.queue/tmp -cp /path/to/gatk/dist/Queue.jar net.sf.picard.sam.SortSam INPUT=/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam TMP_DIR=/full/path/to/files/.queue/tmp OUTPUT=/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam VALIDATION_STRINGENCY=SILENT SO=coordinate CREATE_INDEX=true

Picard vInferred with models/gemini-2.5-flash GitHub

$ Bash example

# Install Picard (example using conda)
# conda install -c bioconda picard

# Define variables for paths and files
GATK_QUEUE_JAR="/path/to/gatk/dist/Queue.jar" # Adjust this path to your GATK Queue.jar
DATA_DIR="/full/path/to/files" # Base directory for input/output files
INPUT_BAM="$DATA_DIR/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam"
OUTPUT_BAM="$DATA_DIR/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam"
TMP_DIR="$DATA_DIR/.queue/tmp"

# Create temporary directory if it doesn't exist
mkdir -p "$TMP_DIR"

# Execute Picard SortSam via GATK Queue.jar
java -Xmx2048m \
     -XX:+UseParallelOldGC \
     -XX:ParallelGCThreads=4 \
     -XX:GCTimeLimit=50 \
     -XX:GCHeapFreeLimit=10 \
     -Djava.io.tmpdir="$TMP_DIR" \
     -cp "$GATK_QUEUE_JAR" \
     net.sf.picard.sam.SortSam \
     INPUT="$INPUT_BAM" \
     TMP_DIR="$TMP_DIR" \
     OUTPUT="$OUTPUT_BAM" \
     VALIDATION_STRINGENCY=SILENT \
     SO=coordinate \
     CREATE_INDEX=true

View on GitHub

Takes output from sortSam, makes bam index for use downstream.

samtools (Inferred with models/gemini-2.5-flash) v1.19 GitHub

$ Bash example

# Install samtools if not already installed
# conda install -c bioconda samtools

# Assuming 'sorted.bam' is the output from sortSam
samtools index sorted.bam

View on GitHub

Command: samtools index /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam.bai

samtools v1.15.1 (Inferred from Skipper workflow) GitHub

$ Bash example

# Install samtools (if not already installed)
# conda install -c bioconda samtools

# Define input and output paths
INPUT_BAM="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam"
OUTPUT_BAI="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam.bai"

# Execute samtools index command
samtools index "$INPUT_BAM" "$OUTPUT_BAI"

View on GitHub

Takes inputs from multiple final bam files.

samtools merge (Inferred with models/gemini-2.5-flash) v1.19.1 GitHub

$ Bash example

# Install samtools (if not already installed)
# conda install -c bioconda samtools

# Example: Merging multiple final BAM files into a single BAM file.
# This is a common step for combining technical or biological replicates.
# Replace 'replicate1.bam', 'replicate2.bam', etc., with your actual input BAM file names.
# Replace 'merged_replicates.bam' with your desired output BAM file name.
samtools merge -o merged_replicates.bam replicate1.bam replicate2.bam replicate3.bam

View on GitHub

Merges the two technical replicates for further downstream analysis.

samtools merge (Inferred with models/gemini-2.5-flash) v1.19 GitHub

$ Bash example

# Install samtools if not already available
# conda install -c bioconda samtools

# Define input and output file names (replace with actual file paths)
INPUT_REPLICATE_1="replicate1.bam"
INPUT_REPLICATE_2="replicate2.bam"
OUTPUT_MERGED_BAM="merged_replicates.bam"

# Merge the two technical replicates into a single BAM file
samtools merge -o "${OUTPUT_MERGED_BAM}" "${INPUT_REPLICATE_1}" "${INPUT_REPLICATE_2}"

View on GitHub

Command: samtools merge /full/path/to/files/CombinedID.merged.bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam /full/path/to/files/file_R1.D08.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam

samtools v1.10 GitHub

$ Bash example

# Install samtools if not already installed
# conda install -c bioconda samtools

# Define input and output files
OUTPUT_BAM="/full/path/to/files/CombinedID.merged.bam"
INPUT_BAM_1="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam"
INPUT_BAM_2="/full/path/to/files/file_R1.D08.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam"

# Merge sorted BAM files
samtools merge "${OUTPUT_BAM}" "${INPUT_BAM_1}" "${INPUT_BAM_2}"

View on GitHub

25
Takes output from sortSam, makes bam index for use downstream.

samtools index (Inferred with models/gemini-2.5-flash) v1.19 (Inferred with models/gemini-2.5-flash) GitHub
$ Bash example
```
# Install samtools if not already installed
# conda install -c bioconda samtools=1.19

# Assuming 'sorted.bam' is the output from sortSam
samtools index sorted.bam
```
View on GitHub

Command: samtools index /full/path/to/files/CombinedID.merged.bam /full/path/to/files/CombinedID.merged.bam.bai

samtools v1.x GitHub

$ Bash example

# Install samtools (if not already installed)
# conda install -c bioconda samtools

# Execute samtools index command
samtools index /full/path/to/files/CombinedID.merged.bam /full/path/to/files/CombinedID.merged.bam.bai

View on GitHub

Takes output from sortSam.

samtools (Inferred with models/gemini-2.5-flash) v1.19 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install samtools if not already installed
# conda install -c bioconda samtools

# Example: Index a sorted BAM file
# This step is commonly performed after sorting a BAM file to enable fast random access to alignments.
# Input: sorted.bam (output from sortSam)
# Output: sorted.bam.bai (BAM index file)
samtools index sorted.bam

View on GitHub

Only outputs the second read in each pair for use with single stranded peak caller.

samtools (Inferred with models/gemini-2.5-flash) v1.10 GitHub

$ Bash example

# Install samtools (example using conda)
# conda install -c bioconda samtools=1.10

# Example: Extract only the second read from each pair in an aligned BAM file
# This command filters for reads where the 'second in pair' flag (0x80) is set
# and converts them to FASTQ format.
# Input: aligned.bam (BAM file containing paired-end reads)
# Output: second_reads.fastq (FASTQ file containing only the second reads in each pair)
samtools fastq -f 0x80 aligned.bam > second_reads.fastq

View on GitHub

This is the final bam file to perform analysis on.

samtools (Inferred with models/gemini-2.5-flash) v1.19 GitHub

$ Bash example

# Install samtools if not already available
# conda install -c bioconda samtools

# The description "final bam file to perform analysis on" implies the BAM file is sorted and indexed.
# This code block demonstrates how to sort and index a BAM file using samtools.
# Replace 'input.bam' with your actual unsorted BAM file and 'final.bam' with your desired output name.
samtools sort -o final.bam input.bam
samtools index final.bam

View on GitHub

Command: samtools view -hb -f 128 /full/path/to/files/CombinedID.merged.bam > /full/path/to/files/CombinedID.merged.r2.bam

samtools v1.19 GitHub

$ Bash example

# Install samtools if not already installed
# conda install -c bioconda samtools

# Define input and output file paths
INPUT_BAM="/full/path/to/files/CombinedID.merged.bam"
OUTPUT_BAM="/full/path/to/files/CombinedID.merged.r2.bam"

# Extract reads that are the second in a pair (flag 128) and output as BAM
samtools view -hb -f 128 "${INPUT_BAM}" > "${OUTPUT_BAM}"

View on GitHub

Takes results from samtools view.

samtools v1.19 GitHub

$ Bash example

# Install samtools (example using conda)
# conda install -c bioconda samtools=1.19

# This step takes a BAM file (e.g., 'input.bam') that was previously generated
# by a 'samtools view' command (e.g., for format conversion or initial filtering).
# It then sorts the BAM file by coordinate, which is a common next step in bioinformatics pipelines.
samtools sort -o output.sorted.bam input.bam

View on GitHub

Calls peaks on those files.

clipper (Inferred with models/gemini-2.5-flash) vfrom source (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install clipper (if not already available)
# clipper is a Python script, often run directly or installed via pip.
# git clone https://github.com/yeolab/clipper.git
# cd clipper
# pip install .

# Define input files and parameters (placeholders)
# Replace with actual paths to your IP and control BAM files
IP_BAM="path/to/your/ip_replicate1.bam"
CONTROL_BAM="path/to/your/control_replicate1.bam" # Optional, but highly recommended for eCLIP
OUTPUT_PREFIX="eclip_peaks"
SPECIES="hg38" # Placeholder: Use 'hg38' for human, 'mm10' for mouse, etc.
FDR_THRESHOLD="0.01" # False Discovery Rate threshold for peak calling
WINDOW_SIZE="20" # Window size for peak detection, common for eCLIP

# Execute clipper to call peaks
# Ensure 'clipper.py' is in your PATH or provide the full path to the script
clipper.py --species "${SPECIES}" \
           --bam "${IP_BAM}" \
           --control-bam "${CONTROL_BAM}" \
           --output-prefix "${OUTPUT_PREFIX}" \
           --fdr "${FDR_THRESHOLD}" \
           --window-size "${WINDOW_SIZE}"

View on GitHub

Command: clipper -b /full/path/to/files/CombinedID.merged.r2.bam -s hg19 -o /full/path/to/files/CombinedID.merged.r2.peaks.bed --bonferroni --superlocal --threshold-method binomial --save-pickle

CLIPper vUnknown GitHub

$ Bash example

# Installation instructions for CLIPper.
# It is recommended to use a virtual environment (e.g., conda or venv).
#
# Example using conda:
# conda create -n clipper_env python=3.7
# conda activate clipper_env
# pip install clipper
#
# Alternatively, if installing from source:
# git clone https://github.com/yeolab/clipper.git
# cd clipper
# pip install .
#
# Reference genome: hg19. Ensure the necessary genome files (e.g., FASTA, gene annotations)
# for hg19 are configured or available in the environment where CLIPper is run.

clipper -b /full/path/to/files/CombinedID.merged.r2.bam -s hg19 -o /full/path/to/files/CombinedID.merged.r2.peaks.bed --bonferroni --superlocal --threshold-method binomial --save-pickle

View on GitHub

Tools Used

eCLIP STAR

Raw Source Text

Library strategy: eCLIP-seq
Takes output from raw files.  Run to trim off both 5â and 3â adapters on both reads. Command: quality-cutoff 6  -m 18  -a NNNNNAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC  -g CTTCCGATCTACAAGTT -g CTTCCGATCTTGGTCCT  -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGT  AGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT  -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz  -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz  /full/path/to/files/file_R1.C01.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.metrics
Takes output from cutadapt round 1. Run to trim off the 3â adapters on read 2, to control for double ligation events. Command: cutadapt -f fastq --match-read-wildcards  --times 1  -e 0.1  -O 5  --quality-cutoff 6  -m 18  -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGTAGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT  -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz  -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz  /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz  /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.metrics
Takes output from cutadapt round 2.  Maps to human specific version of RepBase used to remove repetitive elements, helps control for spurious artifacts from rRNA (& other) repetitive reads.  Command: STAR  --runMode alignReads  --runThreadN 16  --genomeDir /path/to/RepBase_human_database_file --genomeLoad LoadAndRemove  --readFilesIn /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz --outSAMunmapped Within  --outFilterMultimapNmax 30  --outFilterMultimapScoreRange 1  --outFileNamePrefix /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam --outSAMattributes All  --readFilesCommand zcat  --outStd BAM_Unsorted  --outSAMtype BAM Unsorted  --outFilterType BySJout  --outReadsUnmapped Fastx  --outFilterScoreMin 10  --outSAMattrRGline ID:foo  --alignEndsType EndToEnd > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam
Takes output from STAR rmRep.  Maps unique reads to the human genome.  Command: STAR  --runMode alignReads  --runThreadN 16  --genomeDir  /path/to/STAR_database_file --genomeLoad LoadAndRemove  --readFilesIn  /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate1  /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate2  --outSAMunmapped Within  --outFilterMultimapNmax 1  --outFilterMultimapScoreRange 1  --outFileNamePrefix /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam  --outSAMattributes All  --outStd BAM_Unsorted  --outSAMtype BAM Unsorted  --outFilterType BySJout  --outReadsUnmapped Fastx  --outFilterScoreMin 10  --outSAMattrRGline ID:foo  --alignEndsType EndToEnd > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam
takes output from STAR genome mapping.  Custom random-mer-aware script for PCR duplicate removal. Command: barcode_collapse_pe.py  --bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam  --out_file /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam  --metrics_file /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.metrics
Takes output from barcode collapse PE.  Sorts resulting bam file for use downstream.  Command: java  -Xmx2048m  -XX:+UseParallelOldGC  -XX:ParallelGCThreads=4  -XX:GCTimeLimit=50  -XX:GCHeapFreeLimit=10  -Djava.io.tmpdir=/full/path/to/files/.queue/tmp  -cp /path/to/gatk/dist/Queue.jar  net.sf.picard.sam.SortSam  INPUT=/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam  TMP_DIR=/full/path/to/files/.queue/tmp  OUTPUT=/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam  VALIDATION_STRINGENCY=SILENT  SO=coordinate  CREATE_INDEX=true
Takes output from sortSam, makes bam index for use downstream.  Command: samtools index  /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam  /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam.bai
Takes inputs from multiple final bam files.  Merges the two technical replicates for further downstream analysis.  Command: samtools  merge  /full/path/to/files/CombinedID.merged.bam  /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam /full/path/to/files/file_R1.D08.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam
Takes output from sortSam, makes bam index for use downstream.  Command: samtools  index  /full/path/to/files/CombinedID.merged.bam  /full/path/to/files/CombinedID.merged.bam.bai
Takes output from sortSam.  Only outputs the second read in each pair for use with single stranded peak caller.  This is the final bam file to perform analysis on.  Command: samtools view -hb -f 128  /full/path/to/files/CombinedID.merged.bam  >  /full/path/to/files/CombinedID.merged.r2.bam
Takes results from samtools view.  Calls peaks on those files.  Command: clipper  -b /full/path/to/files/CombinedID.merged.r2.bam  -s hg19  -o /full/path/to/files/CombinedID.merged.r2.peaks.bed  --bonferroni  --superlocal  --threshold-method binomial  --save-pickle
Genome_build: hg19
Supplementary_files_format_and_content: bigWig, bigBed, bed (col1: chrom, col2: chromStart, col3: chromEnd, col4: -log10 pvalue, col5: log2 fold enrichment above input, col6: strand) format, contains clusters of predicted RBP binding

← Back to Analysis