GSE107766 Processing Pipeline

OTHER code_examples 33 steps

Publication

A protein-RNA interaction atlas of the ribosome biogenesis factor AATF.

Scientific reports (2019) — PMID 31363146

Dataset

Best practices for eCLIP experiments and analysis [uncertain quality]

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Library strategy: eCLIP-seq

eCLIP vBased on yeolab/eclip workflow (last updated 2020) GitHub

$ Bash example

# --- Installation (commented out) ---
# conda create -n eclip_env python=3.8 r-base=4.0 star cutadapt picard-tools samtools bedtools
# conda activate eclip_env
# pip install clipper
# git clone https://github.com/yeolab/merge_peaks.git
# export PATH=$PATH:$(pwd)/merge_peaks # Add merge_peaks to PATH

# --- Configuration ---
GENOME_DIR="/path/to/hg38_star_index" # Placeholder: Pre-built STAR index for hg38
GENOME_FASTA="/path/to/hg38.fa" # Placeholder: hg38 reference genome FASTA
GTF_FILE="/path/to/hg38.gtf" # Placeholder: hg38 gene annotation GTF
GENOME_SIZE_FILE="/path/to/hg38.chrom.sizes" # Placeholder: hg38 chromosome sizes (e.g., from UCSC goldenPath)
ADAPTERS="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" # Common Illumina adapters, adjust if known
OUTPUT_DIR="eclip_output"
mkdir -p "$OUTPUT_DIR"

# --- Input Files (Placeholders) ---
R1="sample_R1.fastq.gz"
R2="sample_R2.fastq.gz"
CONTROL_R1="input_R1.fastq.gz" # Assuming an input control for peak calling
CONTROL_R2="input_R2.fastq.gz"

# --- Step 1: Adapter Trimming (using cutadapt) ---
echo "Step 1: Adapter Trimming with Cutadapt"
cutadapt -a "$ADAPTERS" -A "$ADAPTERS" \
         -o "$OUTPUT_DIR/trimmed_R1.fastq.gz" \
         -p "$OUTPUT_DIR/trimmed_R2.fastq.gz" \
         "$R1" "$R2" > "$OUTPUT_DIR/cutadapt_report.txt"

cutadapt -a "$ADAPTERS" -A "$ADAPTERS" \
         -o "$OUTPUT_DIR/control_trimmed_R1.fastq.gz" \
         -p "$OUTPUT_DIR/control_trimmed_R2.fastq.gz" \
         "$CONTROL_R1" "$CONTROL_R2" > "$OUTPUT_DIR/control_cutadapt_report.txt"

# --- Step 2: Alignment (using STAR) ---
echo "Step 2: Alignment with STAR"
STAR --genomeDir "$GENOME_DIR" \
     --readFilesIn "$OUTPUT_DIR/trimmed_R1.fastq.gz" "$OUTPUT_DIR/trimmed_R2.fastq.gz" \
     --readFilesCommand zcat \
     --outFileNamePrefix "$OUTPUT_DIR/sample_" \
     --outSAMtype BAM SortedByCoordinate \
     --outFilterMultimapNmax 1 \
     --outFilterMismatchNmax 3 \
     --outFilterScoreMinOverLread 0.6 \
     --outFilterMatchNminOverLread 0.6 \
     --alignIntronMin 20 \
     --alignIntronMax 1000000 \
     --runThreadN 8

STAR --genomeDir "$GENOME_DIR" \
     --readFilesIn "$OUTPUT_DIR/control_trimmed_R1.fastq.gz" "$OUTPUT_DIR/control_trimmed_R2.fastq.gz" \
     --readFilesCommand zcat \
     --outFileNamePrefix "$OUTPUT_DIR/control_" \
     --outSAMtype BAM SortedByCoordinate \
     --outFilterMultimapNmax 1 \
     --outFilterMismatchNmax 3 \
     --outFilterScoreMinOverLread 0.6 \
     --outFilterMatchNminOverLread 0.6 \
     --alignIntronMin 20 \
     --alignIntronMax 1000000 \
     --runThreadN 8

# Index BAM files
samtools index "$OUTPUT_DIR/sample_Aligned.sortedByCoord.out.bam"
samtools index "$OUTPUT_DIR/control_Aligned.sortedByCoord.out.bam"

# --- Step 3: Deduplication (using Picard MarkDuplicates) ---
echo "Step 3: Deduplication with Picard MarkDuplicates"
java -jar /path/to/picard.jar MarkDuplicates \
         I="$OUTPUT_DIR/sample_Aligned.sortedByCoord.out.bam" \
         O="$OUTPUT_DIR/sample_dedup.bam" \
         M="$OUTPUT_DIR/sample_dedup_metrics.txt" \
         REMOVE_DUPLICATES=true

java -jar /path/to/picard.jar MarkDuplicates \
         I="$OUTPUT_DIR/control_Aligned.sortedByCoord.out.bam" \
         O="$OUTPUT_DIR/control_dedup.bam" \
         M="$OUTPUT_DIR/control_dedup_metrics.txt" \
         REMOVE_DUPLICATES=true

samtools index "$OUTPUT_DIR/sample_dedup.bam"
samtools index "$OUTPUT_DIR/control_dedup.bam"

# --- Step 4: Peak Calling (using CLIPPER) ---
echo "Step 4: Peak Calling with CLIPPER"
# CLIPPER requires a BED file for input, so convert BAM to BED
bedtools bamtobed -i "$OUTPUT_DIR/sample_dedup.bam" > "$OUTPUT_DIR/sample_dedup.bed"
bedtools bamtobed -i "$OUTPUT_DIR/control_dedup.bam" > "$OUTPUT_DIR/control_dedup.bed"

clipper -s hg38 -o "$OUTPUT_DIR/sample_peaks.bed" \
        -i "$OUTPUT_DIR/control_dedup.bed" \
        "$OUTPUT_DIR/sample_dedup.bed"

# --- Step 5: IDR (using merge_peaks) ---
echo "Step 5: IDR with merge_peaks"
# For proper IDR, multiple replicates are typically required.
# This example simulates with a single peak file for demonstration.
# In a real scenario, you would run CLIPPER on multiple replicates and then use merge_peaks.

mkdir -p "$OUTPUT_DIR/clipper_peaks"
cp "$OUTPUT_DIR/sample_peaks.bed" "$OUTPUT_DIR/clipper_peaks/sample_rep1_peaks.bed"
# If you had a second replicate, you would copy it here:
# cp "$OUTPUT_DIR/sample_rep2_peaks.bed" "$OUTPUT_DIR/clipper_peaks/sample_rep2_peaks.bed"

python merge_peaks.py -i "$OUTPUT_DIR/clipper_peaks" \
                      -o "$OUTPUT_DIR/idr_output" \
                      -idr 0.05 \
                      -s "$GENOME_SIZE_FILE" \
                      --prefix "eCLIP_IDR"

View on GitHub

Takes output from raw files.

FastQC (Inferred with models/gemini-2.5-flash) v0.11.9 GitHub

$ Bash example

# Install FastQC (if not already installed)
# conda install -c bioconda fastqc

# Example usage: Run FastQC on raw FASTQ files
# Replace sample_R1.fastq.gz and sample_R2.fastq.gz with your actual raw file names
# The -o . option specifies the output directory (current directory in this case)
fastqc -o . sample_R1.fastq.gz sample_R2.fastq.gz

View on GitHub

Run to trim off both 5â and 3â adapters on both reads.

cutadapt (Inferred with models/gemini-2.5-flash) v4.0 GitHub

$ Bash example

# Install cutadapt if not already installed
# conda install -c bioconda cutadapt=4.0

# Define input and output files (placeholders)
READ1_IN="input_R1.fastq.gz"
READ2_IN="input_R2.fastq.gz"
READ1_OUT="output_R1.trimmed.fastq.gz"
READ2_OUT="output_R2.trimmed.fastq.gz"
LOG_FILE="cutadapt.log"

# Define adapter sequences (example Illumina TruSeq adapters)
# -a: 3' adapter on the forward read
# -A: 3' adapter on the reverse read
ADAPTER_FWD="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA"
ADAPTER_REV="AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT"

# Define trimming parameters
MIN_LEN=18       # Minimum length of read to keep
QUALITY_CUTOFF=20 # Quality cutoff at 3' end
ERROR_RATE=0.1   # Maximum error rate
OVERLAP=3        # Minimum overlap between read and adapter
THREADS=8        # Number of CPU threads

# Run cutadapt to trim adapters from paired-end reads
cutadapt \
    -a "${ADAPTER_FWD}" \
    -A "${ADAPTER_REV}" \
    -m "${MIN_LEN}" \
    -q "${QUALITY_CUTOFF}" \
    -e "${ERROR_RATE}" \
    --overlap "${OVERLAP}" \
    -j "${THREADS}" \
    -o "${READ1_OUT}" \
    -p "${READ2_OUT}" \
    "${READ1_IN}" "${READ2_IN}" \
    > "${LOG_FILE}" 2>&1

View on GitHub

Command: quality-cutoff 6 -m 18 -a NNNNNAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC -g CTTCCGATCTACAAGTT -g CTTCCGATCTTGGTCCT -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGT AGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz /full/path/to/files/file_R1.C01.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.metrics

cutadapt (Inferred with models/gemini-2.5-flash) vNot specified GitHub

$ Bash example

# Install cutadapt if not already installed
# conda install -c bioconda cutadapt

cutadapt \
  --quality-cutoff=6 \
  --minimum-length=18 \
  -a NNNNNAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC \
  -g CTTCCGATCTACAAGTT \
  -g CTTCCGATCTTGGTCCT \
  -A AACTTGTAGATCGGA \
  -A AGGACCAAGATCGGA \
  -A ACTTGTAGATCGGAA \
  -A GGACCAAGATCGGAA \
  -A CTTGT AGATCGGAAG \
  -A GACCAAGATCGGAAG \
  -A TTGTAGATCGGAAGA \
  -A ACCAAGATCGGAAGA \
  -A TGTAGATCGGAAGAG \
  -A CCAAGATCGGAAGAG \
  -A GTAGATCGGAAGAGC \
  -A CAAGATCGGAAGAGC \
  -A TAGATCGGAAGAGCG \
  -A AAGATCGGAAGAGCG \
  -A AGATCGGAAGAGCGT \
  -A GATCGGAAGAGCGTC \
  -A ATCGGAAGAGCGTCG \
  -A TCGGAAGAGCGTCGT \
  -A CGGAAGAGCGTCGTG \
  -A GGAAGAGCGTCGTGT \
  -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz \
  -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz \
  /full/path/to/files/file_R1.C01.fastq.gz \
  /full/path/to/files/file_R2.C01.fastq.gz \
  > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.metrics

View on GitHub

Takes output from cutadapt round 1.

cutadapt v4.0 GitHub

$ Bash example

# Install cutadapt (if not already installed)
# conda install -c bioconda cutadapt=4.0

# Define input and output files
# INPUT_FASTQ is the output from a previous cutadapt round (e.g., adapter trimming)
INPUT_FASTQ="input_from_cutadapt_round1.fastq.gz"
OUTPUT_FASTQ="output_cutadapt_round2.fastq.gz"

# Execute cutadapt for poly-A trimming, quality filtering, and minimum length filtering.
# This is a common second trimming step in eCLIP workflows after initial adapter trimming.
# -a "A{100}": Trims poly-A tails (up to 100 'A's).
# -q 20: Trims low-quality bases from the 3' end with a quality cutoff of 20.
# -m 18: Discards reads shorter than 18 bp after trimming.
cutadapt \
  -a "A{100}" \
  -q 20 \
  -m 18 \
  -o "${OUTPUT_FASTQ}" \
  "${INPUT_FASTQ}"

View on GitHub

Run to trim off the 3â adapters on read 2, to control for double ligation events.

cutadapt (Inferred with models/gemini-2.5-flash) v3.4 GitHub

$ Bash example

# Install cutadapt via conda
# conda install -c bioconda cutadapt=3.4

# Define input and output files
INPUT_R2="read2.fastq.gz"
OUTPUT_R2="read2_trimmed.fastq.gz"

# Define adapter sequence for 3' end of Read 2.
# This sequence (Illumina Small RNA 3' Adapter or similar) is commonly found
# at the 3' end of Read 2 in eCLIP and small RNA-seq libraries, often due to
# double ligation events or as part of the library preparation.
ADAPTER_SEQUENCE="AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT"

# Define trimming parameters
THREADS=4 # Number of CPU cores to use for parallel processing
MIN_READ_LENGTH=18 # Discard reads shorter than this length after trimming
QUALITY_CUTOFF=20 # Trim low-quality bases from the 3' end using a Phred score cutoff

# Run cutadapt to trim the specified 3' adapter from Read 2
cutadapt \
  -a "${ADAPTER_SEQUENCE}" \
  -o "${OUTPUT_R2}" \
  --cores "${THREADS}" \
  --minimum-length "${MIN_READ_LENGTH}" \
  --quality-cutoff "${QUALITY_CUTOFF}" \
  "${INPUT_R2}"

View on GitHub

7
Command: cutadapt -f fastq --match-read-wildcards --times 1 -e 0.1 -O 5 --quality-cutoff 6 -m 18 -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGTAGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.metrics

cutadapt vInferred with models/gemini-2.5-flash GitHub
$ Bash example
```
# Install cutadapt (e.g., using conda)
# conda install -c bioconda cutadapt

cutadapt -f fastq --match-read-wildcards --times 1 -e 0.1 -O 5 --quality-cutoff 6 -m 18 -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGTAGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.metrics
```
View on GitHub

Takes output from cutadapt round 2.

cutadapt v4.0 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install cutadapt (example using conda)
# conda install -c bioconda cutadapt=4.0

# Define input and output files
# INPUT_FASTQ is the output from cutadapt round 1
INPUT_FASTQ="sample_R1_round1_trimmed.fastq.gz"
OUTPUT_FASTQ="sample_R1_round2_trimmed.fastq.gz"

# Define trimming parameters for cutadapt round 2
# These are placeholders. Actual values depend on the specific adapters/barcodes
# to be removed in this second trimming step for eCLIP.
# For example, if round 1 removed sequencing adapters, round 2 might remove random barcodes.
# Replace 'YOUR_3PRIME_ADAPTER_SEQUENCE' and 'YOUR_5PRIME_ADAPTER_SEQUENCE' with actual sequences.
ADAPTER_3PRIME="YOUR_3PRIME_ADAPTER_SEQUENCE" # Example: AGATCGGAAGAGCACACGTCTGAACTCCAGTCA
ADAPTER_5PRIME="YOUR_5PRIME_ADAPTER_SEQUENCE" # Example: GTTCAGAGTTCTACAGTCCGACGATC
TRIM_5PRIME_BASES=0 # Number of fixed bases to remove from 5' end
TRIM_3PRIME_BASES=0 # Number of fixed bases to remove from 3' end (use negative value for cutadapt)
QUALITY_THRESHOLD=20 # Phred quality score threshold for trimming
MIN_READ_LENGTH=18 # Minimum read length after trimming
THREADS=4 # Number of CPU threads to use

# Execute cutadapt round 2
cutadapt \
    -a "${ADAPTER_3PRIME}" \
    -g "${ADAPTER_5PRIME}" \
    -u "${TRIM_5PRIME_BASES}" \
    -u "-${TRIM_3PRIME_BASES}" \
    -q "${QUALITY_THRESHOLD}" \
    -m "${MIN_READ_LENGTH}" \
    --discard-untrimmed \
    -j "${THREADS}" \
    -o "${OUTPUT_FASTQ}" \
    "${INPUT_FASTQ}"

View on GitHub

Maps to human specific version of RepBase used to remove repetitive elements, helps control for spurious artifacts from rRNA (& other) repetitive reads.

RepeatMasker (Inferred with models/gemini-2.5-flash) v4.1.2-p1 GitHub

$ Bash example

# Install RepeatMasker (if not already installed)
# RepeatMasker requires a repeat library, typically RepBase, which is often installed with it or separately configured. Ensure RepBase is configured for human species.
# For example, using conda:
# conda install -c bioconda repeatmasker

# Define input and output files
# Replace 'human_genome.fasta' with the actual path to your unmasked human reference genome (e.g., GRCh38/hg38).
GENOME_FASTA="human_genome.fasta"
OUTPUT_DIR="repeatmasker_output"
NUM_THREADS=8 # Adjust based on available CPU cores

# Create output directory if it doesn't exist
mkdir -p "${OUTPUT_DIR}"

# Run RepeatMasker to identify and mask repetitive elements in the human genome
# -species human: Uses human-specific repeat libraries from RepBase.
# -pa ${NUM_THREADS}: Specifies the number of processors to use for parallel execution.
# -dir ${OUTPUT_DIR}: Sets the directory where all output files will be stored.
# -xsmall: Masks identified repeats with 'N's (instead of lowercase letters), which is often preferred for downstream analysis.
# -gff: Outputs the repeat annotations in GFF format, in addition to the standard .out and .tbl files.
RepeatMasker -species human -pa "${NUM_THREADS}" -dir "${OUTPUT_DIR}" -xsmall -gff "${GENOME_FASTA}"

# Expected output files in ${OUTPUT_DIR}:
# - human_genome.fasta.masked: The masked genome FASTA file with repeats replaced by 'N's.
# - human_genome.fasta.out: A detailed report of all identified repeats.
# - human_genome.fasta.tbl: A summary table of repeat classes and families.
# - human_genome.fasta.gff: Repeat annotations in GFF format (if -gff was used).

View on GitHub

Command: STAR --runMode alignReads --runThreadN 16 --genomeDir /path/to/RepBase_human_database_file --genomeLoad LoadAndRemove --readFilesIn /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz --outSAMunmapped Within --outFilterMultimapNmax 30 --outFilterMultimapScoreRange 1 --outFileNamePrefix /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam --outSAMattributes All --readFilesCommand zcat --outStd BAM_Unsorted --outSAMtype BAM Unsorted --outFilterType BySJout --outReadsUnmapped Fastx --outFilterScoreMin 10 --outSAMattrRGline ID:foo --alignEndsType EndToEnd > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam

STAR v2.7.10a GitHub

$ Bash example

# Install STAR using conda
# conda install -c bioconda star

# Define variables for input, output, and reference files
# The --genomeDir expects a STAR index built from the RepBase human sequences.
# Replace with the actual path to your STAR index.
GENOME_DIR="/path/to/RepBase_human_STAR_index"
READ1_FILE="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz"
READ2_FILE="/full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz"
OUTPUT_BAM="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam"

# The --outFileNamePrefix is used for other STAR output files (e.g., Log.out, SJ.out.tab).
# In this command, it's set to the same path as the output BAM, which means other files
# will be named like '/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamLog.out'.
# If you prefer these auxiliary files in a separate directory or with a different prefix,
# adjust this variable accordingly.
OUTPUT_PREFIX="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam"

# Execute the STAR alignment command
STAR \
  --runMode alignReads \
  --runThreadN 16 \
  --genomeDir "${GENOME_DIR}" \
  --genomeLoad LoadAndRemove \
  --readFilesIn "${READ1_FILE}" "${READ2_FILE}" \
  --outSAMunmapped Within \
  --outFilterMultimapNmax 30 \
  --outFilterMultimapScoreRange 1 \
  --outFileNamePrefix "${OUTPUT_PREFIX}" \
  --outSAMattributes All \
  --readFilesCommand zcat \
  --outStd BAM_Unsorted \
  --outSAMtype BAM Unsorted \
  --outFilterType BySJout \
  --outReadsUnmapped Fastx \
  --outFilterScoreMin 10 \
  --outSAMattrRGline ID:foo \
  --alignEndsType EndToEnd > "${OUTPUT_BAM}"

View on GitHub

Takes output from STAR rmRep.

STAR v1.10 GitHub

$ Bash example

# Install samtools if not already available
# conda install -c bioconda samtools=1.10

# Placeholder for input BAM from STAR alignment (e.g., from a previous STAR alignment step)
INPUT_BAM="star_aligned_reads.bam"
OUTPUT_DEDUP_BAM="star_aligned_reads.dedup.bam"

# Deduplication steps as typically performed in eCLIP pipelines (e.g., from yeolab/skipper)
# 1. Sort by read name
samtools sort -n "${INPUT_BAM}" -o "${INPUT_BAM%.bam}.namesort.bam"
# 2. Fixmate information
samtools fixmate -m "${INPUT_BAM%.bam}.namesort.bam" "${INPUT_BAM%.bam}.fixmate.bam"
# 3. Sort by coordinate
samtools sort "${INPUT_BAM%.bam}.fixmate.bam" -o "${INPUT_BAM%.bam}.positionsort.bam"
# 4. Mark and remove PCR duplicates
samtools markdup -r "${INPUT_BAM%.bam}.positionsort.bam" "${OUTPUT_DEDUP_BAM}"
# 5. Index the deduplicated BAM file
samtools index "${OUTPUT_DEDUP_BAM}"

# Optional: Clean up intermediate files
# rm "${INPUT_BAM%.bam}.namesort.bam" "${INPUT_BAM%.bam}.fixmate.bam" "${INPUT_BAM%.bam}.positionsort.bam"

View on GitHub

Maps unique reads to the human genome.

BWA (Burrows-Wheeler Aligner) (Inferred with models/gemini-2.5-flash) v0.7.17 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install BWA and Samtools
# conda install -c bioconda bwa samtools

# Define reference genome and read files
REFERENCE_GENOME="/path/to/human_genome/GRCh38.fa"
READS_R1="reads_1.fastq.gz"
READS_R2="reads_2.fastq.gz"
OUTPUT_BAM="aligned_reads.sorted.bam"

# Index the reference genome (if not already indexed)
# This step only needs to be run once per reference genome
bwa index "${REFERENCE_GENOME}"

# Map reads to the human genome using BWA-MEM
# -t 8: Use 8 threads for alignment
# Pipe the SAM output directly to samtools for conversion to BAM, sorting, and indexing
bwa mem -t 8 "${REFERENCE_GENOME}" "${READS_R1}" "${READS_R2}" | \
  samtools view -bS - | \
  samtools sort -o "${OUTPUT_BAM}" -
samtools index "${OUTPUT_BAM}"

View on GitHub

Command: STAR --runMode alignReads --runThreadN 16 --genomeDir /path/to/STAR_database_file --genomeLoad LoadAndRemove --readFilesIn /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate1 /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate2 --outSAMunmapped Within --outFilterMultimapNmax 1 --outFilterMultimapScoreRange 1 --outFileNamePrefix /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam --outSAMattributes All --outStd BAM_Unsorted --outSAMtype BAM Unsorted --outFilterType BySJout --outReadsUnmapped Fastx --outFilterScoreMin 10 --outSAMattrRGline ID:foo --alignEndsType EndToEnd > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam

STAR vInferred with models/gemini-2.5-flash GitHub

$ Bash example

bash
# Reference genome directory placeholder. Replace with your actual STAR index path.
# Example: /path/to/your/STAR_index/GRCh38
GENOME_DIR="/path/to/STAR_database_file"

# Input read files. These appear to be unmapped reads from a previous BAM file.
# Replace with your actual input file paths.
READ_FILE_MATE1="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate1"
READ_FILE_MATE2="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate2"

# Output BAM file prefix and final output file name.
# The command redirects stdout to the final BAM file, so the prefix is mainly for other STAR output files (e.g., Log.out, SJ.out.tab).
OUTPUT_PREFIX="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep."
FINAL_OUTPUT_BAM="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam"

STAR \
  --runMode alignReads \
  --runThreadN 16 \
  --genomeDir "${GENOME_DIR}" \
  --genomeLoad LoadAndRemove \
  --readFilesIn "${READ_FILE_MATE1}" "${READ_FILE_MATE2}" \
  --outSAMunmapped Within \
  --outFilterMultimapNmax 1 \
  --outFilterMultimapScoreRange 1 \
  --outFileNamePrefix "${OUTPUT_PREFIX}" \
  --outSAMattributes All \
  --outStd BAM_Unsorted \
  --outSAMtype BAM Unsorted \
  --outFilterType BySJout \
  --outReadsUnmapped Fastx \
  --outFilterScoreMin 10 \
  --outSAMattrRGline ID:foo \
  --alignEndsType EndToEnd \
  > "${FINAL_OUTPUT_BAM}"

View on GitHub

takes output from STAR genome mapping.

STAR v2.7.3a GitHub

$ Bash example

# Install STAR (if not already installed)
# conda install -c bioconda star

# Define variables (replace with actual paths and names)
GENOME_DIR="/path/to/STAR_genome_index/hg38" # Placeholder: hg38 genome index
READ1_FASTQ="input_read1.fastq.gz"
READ2_FASTQ="input_read2.fastq.gz" # Omit if single-end
OUTPUT_PREFIX="aligned_reads"
NUM_THREADS=8 # Adjust as needed

# Run STAR genome mapping for paired-end reads
STAR \
  --runThreadN ${NUM_THREADS} \
  --genomeDir ${GENOME_DIR} \
  --readFilesIn ${READ1_FASTQ} ${READ2_FASTQ} \
  --outFileNamePrefix ${OUTPUT_PREFIX} \
  --outSAMtype BAM SortedByCoordinate \
  --outSAMattributes Standard \
  --outSAMunmapped Within \
  --outFilterMultimapNmax 20 \
  --outFilterMismatchNmax 999 \
  --alignIntronMin 20 \
  --alignIntronMax 1000000 \
  --alignMatesGapMax 1000000 \
  --sjdbScore 1 \
  --readFilesCommand zcat

# The output will be ${OUTPUT_PREFIX}Aligned.sortedByCoord.out.bam

View on GitHub

Custom random-mer-aware script for PCR duplicate removal.

umi_tools (Inferred with models/gemini-2.5-flash) v1.1.2 GitHub

$ Bash example

# Install umi_tools
# conda install -c bioconda umi_tools

# Example usage for PCR duplicate removal using random-mers (UMIs).
# This command assumes that UMIs have already been extracted from the reads
# and appended to the read names (e.g., by a preceding `umi_tools extract` step).
# The UMI is expected to be separated from the original read name by a colon.
# The 'directional' method is generally recommended for its robustness.
# Replace 'input.bam' with your aligned BAM file containing UMIs in read names.
# Replace 'output.dedup.bam' with your desired output deduplicated BAM file name.
# Replace 'output.dedup.log' with your desired log file name.

umi_tools dedup \
    --in input.bam \
    --out output.dedup.bam \
    --umi-separator ":" \
    --method directional \
    --paired \
    --log output.dedup.log

View on GitHub

Command: barcode_collapse_pe.py --bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam --out_file /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam --metrics_file /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.metrics

barcode_collapse_pe.py (Part of Skipper pipeline) (Inferred with models/gemini-2.5-flash) vLatest (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install Miniconda if not already installed
# wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
# bash Miniconda3-latest-Linux-x86_64.sh -b -p $HOME/miniconda
# export PATH="$HOME/miniconda/bin:$PATH"
# conda init bash
# source ~/.bashrc

# Clone the Skipper repository
# git clone https://github.com/yeolab/skipper.git
# cd skipper

# Create and activate the conda environment (assuming environment.yml is available in the skipper directory)
# conda env create -f environment.yml
# conda activate skipper_env # or the name specified in environment.yml

# Define variables for input and output files
INPUT_BAM="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam"
OUTPUT_BAM="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam"
METRICS_FILE="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.metrics"

# Execute the barcode_collapse_pe.py script
# Assuming barcode_collapse_pe.py is in the current PATH or specified with its full path
# If running from the cloned skipper directory, the path would be ./scripts/barcode_collapse_pe.py
python barcode_collapse_pe.py \
  --bam "${INPUT_BAM}" \
  --out_file "${OUTPUT_BAM}" \
  --metrics_file "${METRICS_FILE}"

View on GitHub

Takes output from barcode collapse PE.

STAR (Inferred with models/gemini-2.5-flash) v2.7.9a GitHub

$ Bash example

# Install STAR (example using conda)
# conda install -c bioconda star=2.7.9a

# Define variables
# Placeholder for STAR genome index, typically built from a reference genome like hg38.
# Example: /path/to/STAR_index/hg38_gencode_v38
GENOME_DIR="/path/to/STAR_index/hg38" 
READ1="sample_R1.collapsed.fastq.gz" # Output from barcode collapse PE
READ2="sample_R2.collapsed.fastq.gz" # Output from barcode collapse PE
OUTPUT_PREFIX="sample_aligned" # Prefix for output files
THREADS=8 # Adjust based on available resources

# Run STAR alignment for paired-end reads
# This command aligns the collapsed paired-end reads to the reference genome.
STAR --genomeDir "${GENOME_DIR}" \
     --readFilesIn "${READ1}" "${READ2}" \
     --runThreadN "${THREADS}" \
     --outFileNamePrefix "${OUTPUT_PREFIX}" \
     --outSAMtype BAM SortedByCoordinate \
     --outSAMattributes All \
     --readFilesCommand zcat \
     --twopassMode Basic \
     --limitBAMsortRAM 30000000000 # Example: 30GB RAM for sorting, adjust as needed

View on GitHub

Sorts resulting bam file for use downstream.

samtools (Inferred with models/gemini-2.5-flash) v1.19 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install samtools (if not already installed)
# conda install -c bioconda samtools=1.19

# Sort the BAM file by coordinate (default behavior)
samtools sort -o sorted_output.bam input.bam

View on GitHub

Command: java -Xmx2048m -XX:+UseParallelOldGC -XX:ParallelGCThreads=4 -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Djava.io.tmpdir=/full/path/to/files/.queue/tmp -cp /path/to/gatk/dist/Queue.jar net.sf.picard.sam.SortSam INPUT=/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam TMP_DIR=/full/path/to/files/.queue/tmp OUTPUT=/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam VALIDATION_STRINGENCY=SILENT SO=coordinate CREATE_INDEX=true

Picard v2.x.x (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install Picard (e.g., via conda or by downloading the jar)
# conda install -c bioconda picard
# Or download the latest release from https://github.com/broadinstitute/picard/releases
# Ensure Java is installed (e.g., OpenJDK 11 or later)

# Define variables for clarity
INPUT_BAM="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam"
OUTPUT_BAM="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam"
TMP_DIR="/full/path/to/files/.queue/tmp"
QUEUE_JAR_PATH="/path/to/gatk/dist/Queue.jar"

# Create temporary directory if it doesn't exist
mkdir -p "${TMP_DIR}"

# Execute Picard SortSam command
java -Xmx2048m \
  -XX:+UseParallelOldGC \
  -XX:ParallelGCThreads=4 \
  -XX:GCTimeLimit=50 \
  -XX:GCHeapFreeLimit=10 \
  -Djava.io.tmpdir="${TMP_DIR}" \
  -cp "${QUEUE_JAR_PATH}" \
  net.sf.picard.sam.SortSam \
  INPUT="${INPUT_BAM}" \
  TMP_DIR="${TMP_DIR}" \
  OUTPUT="${OUTPUT_BAM}" \
  VALIDATION_STRINGENCY=SILENT \
  SO=coordinate \
  CREATE_INDEX=true

View on GitHub

Takes output from sortSam, makes bam index for use downstream.

samtools index (Inferred with models/gemini-2.5-flash) v1.19.1 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install samtools if not already installed
# conda install -c bioconda samtools

# Assuming 'sorted.bam' is the output from sortSam
# Replace 'sorted.bam' with the actual path to your sorted BAM file
samtools index sorted.bam

View on GitHub

Command: samtools index /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam.bai

samtools v1.19 GitHub

$ Bash example

# Install samtools (e.g., using conda)
# conda install -c bioconda samtools=1.19

samtools index /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam.bai

View on GitHub

Takes inputs from multiple final bam files.

samtools merge (Inferred with models/gemini-2.5-flash) v1.19 GitHub

$ Bash example

# Install samtools if not already available
# conda install -c bioconda samtools=1.19

# Merge multiple BAM files into a single output BAM file.
# Replace input1.bam, input2.bam, etc., with your actual BAM file paths.
# Replace merged_output.bam with your desired output file name.
samtools merge merged_output.bam input1.bam input2.bam input3.bam

View on GitHub

Merges the two technical replicates for further downstream analysis.

samtools (Inferred with models/gemini-2.5-flash) v1.19 GitHub

$ Bash example

# Install samtools if not already installed
# conda install -c bioconda samtools

# Merge two technical replicate BAM files
# Replace replicate1.bam and replicate2.bam with actual input files
# Replace merged_replicates.bam with the desired output file name
samtools merge merged_replicates.bam replicate1.bam replicate2.bam

View on GitHub

Command: samtools merge /full/path/to/files/CombinedID.merged.bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam /full/path/to/files/file_R1.D08.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam

samtools v1.19 GitHub

$ Bash example

# Install samtools if not already installed
# conda install -c bioconda samtools

# Merge multiple sorted BAM files into a single sorted BAM file
samtools merge /full/path/to/files/CombinedID.merged.bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam /full/path/to/files/file_R1.D08.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam

View on GitHub

Takes output from sortSam, makes bam index for use downstream.

samtools (Inferred with models/gemini-2.5-flash) v1.10 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install samtools if not already available
# conda install -c bioconda samtools=1.10

# Create BAM index for the sorted BAM file
# Assuming 'sorted.bam' is the output from sortSam
samtools index sorted.bam

View on GitHub

Command: samtools index /full/path/to/files/CombinedID.merged.bam /full/path/to/files/CombinedID.merged.bam.bai

samtools v1.19 GitHub

$ Bash example

# Install samtools if not already installed
# conda install -c bioconda samtools

samtools index /full/path/to/files/CombinedID.merged.bam /full/path/to/files/CombinedID.merged.bam.bai

View on GitHub

Takes output from sortSam.

samtools (Inferred with models/gemini-2.5-flash) v1.19.1 GitHub

$ Bash example

# Install samtools if not already installed
# conda install -c bioconda samtools=1.19.1

# This step takes a sorted BAM file (output from sortSam) and creates an index file (.bai).
# The index file is essential for many downstream applications that require random access to reads in the BAM file.
# Replace 'sorted.bam' with the actual name of your sorted BAM file.
samtools index sorted.bam

View on GitHub

Only outputs the second read in each pair for use with single stranded peak caller.

reformat.sh (part of BBMap suite) (Inferred with models/gemini-2.5-flash) v38.90 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install BBMap (contains reformat.sh)
# conda install -c bioconda bbmap

# Define input and output files
# This command assumes the input is an interleaved FASTQ file containing paired-end reads.
INPUT_INTERLEAVED_FASTQ="input_interleaved_reads.fastq.gz"
OUTPUT_R2_FASTQ="output_second_reads.fastq.gz"

# Only outputs the second read in each pair for use with single stranded peak caller.
# 'r1=f' prevents outputting read 1, 'r2=t' enables outputting read 2.
reformat.sh in="${INPUT_INTERLEAVED_FASTQ}" out="${OUTPUT_R2_FASTQ}" r1=f r2=t

View on GitHub

This is the final bam file to perform analysis on.

samtools (Inferred with models/gemini-2.5-flash) v1.19 GitHub

$ Bash example

# Install samtools if not already installed
# conda install -c bioconda samtools

# Assume 'final.bam' is the input BAM file that has undergone alignment, sorting, and duplicate removal.
# Index the final BAM file to enable fast random access for downstream analysis tools.
samtools index final.bam

View on GitHub

Command: samtools view -hb -f 128 /full/path/to/files/CombinedID.merged.bam > /full/path/to/files/CombinedID.merged.r2.bam

samtools v1.10+ GitHub

$ Bash example

# Install samtools (if not already installed)
# conda install -c bioconda samtools

samtools view -hb -f 128 /full/path/to/files/CombinedID.merged.bam > /full/path/to/files/CombinedID.merged.r2.bam

View on GitHub

Takes results from samtools view.

samtools v1.19 GitHub

$ Bash example

# Install samtools (if not already installed)
# conda install -c bioconda samtools

# Define input and output file paths
INPUT_BAM="input_aligned_reads.bam" # Placeholder for an input BAM file, e.g., from an alignment step
OUTPUT_SAM="output_viewed_reads.sam" # Placeholder for an output SAM file

# Command: Takes results from samtools view.
# This command converts a BAM file to a SAM file, including the header (-h).
# This is a common operation to view or extract data from a BAM file for further text-based processing or inspection.
samtools view -h "${INPUT_BAM}" > "${OUTPUT_SAM}"

View on GitHub

Calls peaks on those files.

clipper (Inferred with models/gemini-2.5-flash) vlatest (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install clipper (if not already installed)
# git clone https://github.com/yeolab/clipper.git
# cd clipper
# python setup.py install # Or ensure clipper.py is executable and in PATH

# Placeholder for input BAM file (aligned reads) and output peak file
INPUT_BAM="aligned_reads.bam"
OUTPUT_PEAKS="peaks.bed"

# Placeholder for genome assembly (hg38) and its approximate size.
# clipper's default genome list might not include hg38, so we provide the size directly.
GENOME_SIZE_HG38="3100000000" # Approximate size for human genome assembly hg38

# Execute clipper to call peaks
# Assuming clipper.py is accessible in the current directory or system PATH
python clipper.py -b "${INPUT_BAM}" -s "${GENOME_SIZE_HG38}" -o "${OUTPUT_PEAKS}"

View on GitHub

Command: clipper -b /full/path/to/files/CombinedID.merged.r2.bam -s hg19 -o /full/path/to/files/CombinedID.merged.r2.peaks.bed --bonferroni --superlocal --threshold-method binomial --save-pickle

CLIPper vNot specified GitHub

$ Bash example

# Install CLIPper (if not already installed)
# CLIPper is a Python-based tool. Installation typically involves cloning the repository
# and running the setup script, or ensuring the main script is executable and in your PATH.
# Example installation (adjust if 'clipper' is not directly in PATH):
# git clone https://github.com/yeolab/clipper.git
# cd clipper
# python setup.py install --user # Install to user's local site-packages
# # Or, if you just want to run the script directly:
# # python /path/to/clipper/clipper.py ...

clipper -b /full/path/to/files/CombinedID.merged.r2.bam -s hg19 -o /full/path/to/files/CombinedID.merged.r2.peaks.bed --bonferroni --superlocal --threshold-method binomial --save-pickle

View on GitHub

Tools Used

eCLIP STAR

Raw Source Text

Library strategy: eCLIP-seq
Takes output from raw files.  Run to trim off both 5â and 3â adapters on both reads. Command: quality-cutoff 6  -m 18  -a NNNNNAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC  -g CTTCCGATCTACAAGTT -g CTTCCGATCTTGGTCCT  -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGT  AGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT  -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz  -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz  /full/path/to/files/file_R1.C01.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.metrics
Takes output from cutadapt round 1. Run to trim off the 3â adapters on read 2, to control for double ligation events. Command: cutadapt -f fastq --match-read-wildcards  --times 1  -e 0.1  -O 5  --quality-cutoff 6  -m 18  -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGTAGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT  -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz  -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz  /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz  /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.metrics
Takes output from cutadapt round 2.  Maps to human specific version of RepBase used to remove repetitive elements, helps control for spurious artifacts from rRNA (& other) repetitive reads.  Command: STAR  --runMode alignReads  --runThreadN 16  --genomeDir /path/to/RepBase_human_database_file --genomeLoad LoadAndRemove  --readFilesIn /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz --outSAMunmapped Within  --outFilterMultimapNmax 30  --outFilterMultimapScoreRange 1  --outFileNamePrefix /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam --outSAMattributes All  --readFilesCommand zcat  --outStd BAM_Unsorted  --outSAMtype BAM Unsorted  --outFilterType BySJout  --outReadsUnmapped Fastx  --outFilterScoreMin 10  --outSAMattrRGline ID:foo  --alignEndsType EndToEnd > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam
Takes output from STAR rmRep.  Maps unique reads to the human genome.  Command: STAR  --runMode alignReads  --runThreadN 16  --genomeDir  /path/to/STAR_database_file --genomeLoad LoadAndRemove  --readFilesIn  /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate1  /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate2  --outSAMunmapped Within  --outFilterMultimapNmax 1  --outFilterMultimapScoreRange 1  --outFileNamePrefix /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam  --outSAMattributes All  --outStd BAM_Unsorted  --outSAMtype BAM Unsorted  --outFilterType BySJout  --outReadsUnmapped Fastx  --outFilterScoreMin 10  --outSAMattrRGline ID:foo  --alignEndsType EndToEnd > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam
takes output from STAR genome mapping.  Custom random-mer-aware script for PCR duplicate removal. Command: barcode_collapse_pe.py  --bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam  --out_file /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam  --metrics_file /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.metrics
Takes output from barcode collapse PE.  Sorts resulting bam file for use downstream.  Command: java  -Xmx2048m  -XX:+UseParallelOldGC  -XX:ParallelGCThreads=4  -XX:GCTimeLimit=50  -XX:GCHeapFreeLimit=10  -Djava.io.tmpdir=/full/path/to/files/.queue/tmp  -cp /path/to/gatk/dist/Queue.jar  net.sf.picard.sam.SortSam  INPUT=/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam  TMP_DIR=/full/path/to/files/.queue/tmp  OUTPUT=/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam  VALIDATION_STRINGENCY=SILENT  SO=coordinate  CREATE_INDEX=true
Takes output from sortSam, makes bam index for use downstream.  Command: samtools index  /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam  /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam.bai
Takes inputs from multiple final bam files.  Merges the two technical replicates for further downstream analysis.  Command: samtools  merge  /full/path/to/files/CombinedID.merged.bam  /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam /full/path/to/files/file_R1.D08.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam
Takes output from sortSam, makes bam index for use downstream.  Command: samtools  index  /full/path/to/files/CombinedID.merged.bam  /full/path/to/files/CombinedID.merged.bam.bai
Takes output from sortSam.  Only outputs the second read in each pair for use with single stranded peak caller.  This is the final bam file to perform analysis on.  Command: samtools view -hb -f 128  /full/path/to/files/CombinedID.merged.bam  >  /full/path/to/files/CombinedID.merged.r2.bam
Takes results from samtools view.  Calls peaks on those files.  Command: clipper  -b /full/path/to/files/CombinedID.merged.r2.bam  -s hg19  -o /full/path/to/files/CombinedID.merged.r2.peaks.bed  --bonferroni  --superlocal  --threshold-method binomial  --save-pickle
Genome_build: hg19
Supplementary_files_format_and_content: bed format, contains clusters of predicted RBP binding

← Back to Analysis