GSE134971 Processing Pipeline

RIP-Seq code_examples 32 steps

Publication

An in vivo genome-wide CRISPR screen identifies the RNA-binding protein Staufen2 as a key regulator of myeloid leukemia.

Nature cancer (2020) — PMID 34109316

Dataset

GSE134971

An In Vivo Genome-Wide CRISPR Screen Identifies the RNA-Binding Protein Staufen2 as a Key Regulator of Myeloid Leukemia

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Takes output from raw files.

(Inferred with models/gemini-2.5-flash) vN/A

$ Bash example

# The step description "Takes output from raw files." is too generic
# to infer a specific bioinformatics tool, parameters, or reference datasets.
# Please provide more context (e.g., assay type, specific processing step)
# to generate a relevant command.

Run to trim off both 5â and 3â adapters on both reads.

cutadapt (Inferred with models/gemini-2.5-flash) v4.0 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install cutadapt (e.g., using conda)
# conda install -c bioconda cutadapt=4.0

# Define input and output file names
READ1_IN="input_R1.fastq.gz"
READ2_IN="input_R2.fastq.gz"
READ1_OUT="trimmed_R1.fastq.gz"
READ2_OUT="trimmed_R2.fastq.gz"

# Define adapter sequences
# These are common Illumina TruSeq adapters. 
# Replace with specific adapter sequences if known for your library preparation.
ADAPTER_R1_3PRIME="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" # 3' adapter for Read 1
ADAPTER_R2_3PRIME="AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT" # 3' adapter for Read 2 (often reverse complement of R1 adapter)

# If specific 5' adapters need to be removed from the start of the reads,
# use -g for Read 1 and -G for Read 2, e.g.:
# ADAPTER_R1_5PRIME="YOUR_5PRIME_ADAPTER_R1"
# ADAPTER_R2_5PRIME="YOUR_5PRIME_ADAPTER_R2"
# cutadapt ... -g "${ADAPTER_R1_5PRIME}" -G "${ADAPTER_R2_5PRIME}" ...

# Run cutadapt to trim 3' adapters from both reads
cutadapt \
  -a "${ADAPTER_R1_3PRIME}" \
  -A "${ADAPTER_R2_3PRIME}" \
  -o "${READ1_OUT}" \
  -p "${READ2_OUT}" \
  "${READ1_IN}" \
  "${READ2_IN}"

View on GitHub

Command: quality-cutoff 6 -m 18 -a NNNNNAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC -g CTTCCGATCTACAAGTT -g CTTCCGATCTTGGTCCT -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGT AGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz /full/path/to/files/file_R1.C01.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.metrics

quality_cutoff.py (Inferred with models/gemini-2.5-flash) vNot specified

$ Bash example

# Install dependencies (e.g., cutadapt, which is used internally by quality_cutoff.py)
# conda install -c bioconda cutadapt

# Clone the eCLIP pipeline repository to obtain the quality_cutoff.py script
# git clone https://github.com/yeolab/eclip.git
# Assuming 'quality-cutoff' is an executable script or symlink to 'eclip/tools/quality_cutoff/quality_cutoff.py' in your PATH

# Define input and output file paths
INPUT_R1="/full/path/to/files/file_R1.C01.fastq.gz"
INPUT_R2="/full/path/to/files/file_R2.C01.fastq.gz"
OUTPUT_R1="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz"
OUTPUT_R2="/full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz"
METRICS_FILE="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.metrics"

# Execute the quality-cutoff command for adapter trimming and quality filtering
quality-cutoff 6 \
  -m 18 \
  -a NNNNNAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC \
  -g CTTCCGATCTACAAGTT \
  -g CTTCCGATCTTGGTCCT \
  -A AACTTGTAGATCGGA \
  -A AGGACCAAGATCGGA \
  -A ACTTGTAGATCGGAA \
  -A GGACCAAGATCGGAA \
  -A CTTGT AGATCGGAAG \
  -A GACCAAGATCGGAAG \
  -A TTGTAGATCGGAAGA \
  -A ACCAAGATCGGAAGA \
  -A TGTAGATCGGAAGAG \
  -A CCAAGATCGGAAGAG \
  -A GTAGATCGGAAGAGC \
  -A CAAGATCGGAAGAGC \
  -A TAGATCGGAAGAGCG \
  -A AAGATCGGAAGAGCG \
  -A AGATCGGAAGAGCGT \
  -A GATCGGAAGAGCGTC \
  -A ATCGGAAGAGCGTCG \
  -A TCGGAAGAGCGTCGT \
  -A CGGAAGAGCGTCGTG \
  -A GGAAGAGCGTCGTGT \
  -o "${OUTPUT_R1}" \
  -p "${OUTPUT_R2}" \
  "${INPUT_R1}" \
  "${INPUT_R2}" \
  > "${METRICS_FILE}"

Takes output from cutadapt round 1.

cutadapt v1.18 GitHub

$ Bash example

# Install cutadapt if not already installed
# conda install -c bioconda cutadapt=1.18

# Define input and output files
INPUT_FASTQ="input_from_cutadapt_round1.fastq.gz"
OUTPUT_FASTQ="trimmed_round2.fastq.gz"

# Define adapter sequence for round 2 (placeholder - replace with actual sequence).
# In eCLIP pipelines, a second round of cutadapt might trim a different linker,
# poly-A/T tails, or perform more stringent quality trimming.
# Example: A common 3' adapter for eCLIP might be a linker or RT primer sequence.
ADAPTER_SEQUENCE_ROUND2="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" # Placeholder: Example Illumina TruSeq adapter

# Execute cutadapt for round 2 trimming
# -a ADAPTER_SEQUENCE_ROUND2: Trim 3' adapter
# -q 20,20: Quality trim from both ends with a threshold of 20 (Phred score)
# --minimum-length 18: Discard reads shorter than 18 bp after trimming
# --discard-untrimmed: Discard reads that do not contain the adapter sequence
cutadapt -a "${ADAPTER_SEQUENCE_ROUND2}" -q 20,20 --minimum-length 18 --discard-untrimmed -o "${OUTPUT_FASTQ}" "${INPUT_FASTQ}"

View on GitHub

Run to trim off the 3â adapters on read 2, to control for double ligation events.

cutadapt (Inferred with models/gemini-2.5-flash) v2.10 GitHub

$ Bash example

# Install cutadapt if not already installed
# conda install -c bioconda cutadapt=2.10

# Define input and output files (placeholders)
INPUT_R1="input_R1.fastq.gz"
INPUT_R2="input_R2.fastq.gz"
OUTPUT_R1_TRIMMED="output_R1_trimmed.fastq.gz"
OUTPUT_R2_TRIMMED="output_R2_trimmed.fastq.gz"

# Common Illumina TruSeq 3' adapter sequence used in eCLIP
ADAPTER_SEQUENCE="AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC"

# Run cutadapt to trim 3' adapters from Read 2
# -A: 3' adapter sequence to be removed from Read 2
# -o: Output file for Read 1 (untrimmed or trimmed based on paired-end processing)
# -p: Output file for Read 2 (trimmed)
# --minimum-length: Discard reads shorter than this after trimming
# -e: Maximum error rate for adapter matching
# -j: Number of CPU cores to use
# --max-n: Maximum number of N bases allowed in a read
# --discard-untrimmed: Discard reads in which no adapter was found
# --nextseq-trim=20: Trim low-quality bases from the 3' end of reads (NextSeq-specific quality trimming)
cutadapt -A "${ADAPTER_SEQUENCE}" \
         -o "${OUTPUT_R1_TRIMMED}" \
         -p "${OUTPUT_R2_TRIMMED}" \
         --minimum-length 18 \
         -e 0.1 \
         -j 4 \
         --max-n 0 \
         --discard-untrimmed \
         --nextseq-trim=20 \
         "${INPUT_R1}" "${INPUT_R2}"

View on GitHub

Command: cutadapt -f fastq --match-read-wildcards --times 1 -e 0.1 -O 5 --quality-cutoff 6 -m 18 -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGTAGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.metrics

cutadapt vInferred GitHub

$ Bash example

# cutadapt is a tool to find and remove adapter sequences, primers, poly-A tails, and other unwanted sequences from high-throughput sequencing reads.
# For installation, you can use pip:
# pip install cutadapt
# Or conda:
# conda install -c bioconda cutadapt

cutadapt -f fastq --match-read-wildcards --times 1 -e 0.1 -O 5 --quality-cutoff 6 -m 18 -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGTAGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.metrics

View on GitHub

Takes output from cutadapt round 2.

cutadapt v2.10 GitHub

$ Bash example

# Install cutadapt (example using conda)
# conda install -c bioconda cutadapt=2.10

# Define input and output files
# Assuming 'input_from_cutadapt_round1.fastq.gz' is the output from the first cutadapt step
INPUT_FASTQ="input_from_cutadapt_round1.fastq.gz"
OUTPUT_FASTQ="trimmed_polyA_round2.fastq.gz"

# Command for cutadapt round 2 (e.g., poly-A trimming and quality filtering)
# Parameters inferred from yeolab/eclip workflow for poly-A trimming (cutadapt_trim_polyA step)
cutadapt \
  -a "A{100}" \
  -q 20 \
  --minimum-length 18 \
  --nextseq-trim=20 \
  -e 0.1 \
  -o "${OUTPUT_FASTQ}" \
  "${INPUT_FASTQ}"

View on GitHub

Maps to human specific version of RepBase used to remove repetitive elements, helps control for spurious artifacts from rRNA (& other) repetitive reads.

bowtie2 (Inferred with models/gemini-2.5-flash) vlatest (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install bowtie2 if not already installed
# conda install -c bioconda bowtie2

# Placeholder for the combined human repetitive elements FASTA file.
# This file should contain sequences from RepBase (human-specific) and rRNA.
# You would typically download RepBase, filter for human elements, and combine with known rRNA sequences.
# Example source for RepBase: https://www.girinst.org/repbase/
# Example source for rRNA: NCBI RefSeq or UCSC genome browser
HUMAN_REPETITIVE_FASTA="human_repbase_and_rRNA.fasta"

# Build bowtie2 index for the combined repetitive elements.
# This step needs to be done once for the reference.
# bowtie2-build "${HUMAN_REPETITIVE_FASTA}" "human_repbase_rRNA_index"

# Input FASTQ file (e.g., raw sequencing reads)
INPUT_FASTQ="input_reads.fastq.gz"
# Output FASTQ file containing reads that did NOT align to repetitive elements
OUTPUT_FASTQ="non_repetitive_reads.fastq.gz"

# Align reads to the repetitive elements index and output unaligned reads.
# -x: specify the basename of the index
# -U: specify the input FASTQ file (unpaired reads)
# --un-gz: output unaligned reads to a gzipped FASTQ file
# -p: number of threads to use
bowtie2 -x human_repbase_rRNA_index -U "${INPUT_FASTQ}" \
        --un-gz "${OUTPUT_FASTQ}" \
        -p 8 > /dev/null # Redirect SAM output to null as we only care about unaligned reads

View on GitHub

Command: STAR --runMode alignReads --runThreadN 16 --genomeDir /path/to/RepBase_human_database_file --genomeLoad LoadAndRemove --readFilesIn /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz --outSAMunmapped Within --outFilterMultimapNmax 30 --outFilterMultimapScoreRange 1 --outFileNamePrefix /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam --outSAMattributes All --readFilesCommand zcat --outStd BAM_Unsorted --outSAMtype BAM Unsorted --outFilterType BySJout --outReadsUnmapped Fastx --outFilterScoreMin 10 --outSAMattrRGline ID:foo --alignEndsType EndToEnd > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam

STAR v2.7.10a GitHub

$ Bash example

# Install STAR (e.g., via Conda)
# conda install -c bioconda star

# Define placeholder variables for input/output paths and reference genome
# NOTE: /path/to/RepBase_human_database_file should be a STAR index built from RepBase sequences.
# For a real analysis, you would need to build this index first.
GENOME_DIR="/path/to/RepBase_human_database_file"
INPUT_R1="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz"
INPUT_R2="/full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz"
OUTPUT_PREFIX="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam"

# Execute the STAR alignment command
STAR \
  --runMode alignReads \
  --runThreadN 16 \
  --genomeDir "${GENOME_DIR}" \
  --genomeLoad LoadAndRemove \
  --readFilesIn "${INPUT_R1}" "${INPUT_R2}" \
  --outSAMunmapped Within \
  --outFilterMultimapNmax 30 \
  --outFilterMultimapScoreRange 1 \
  --outFileNamePrefix "${OUTPUT_PREFIX}" \
  --outSAMattributes All \
  --readFilesCommand zcat \
  --outStd BAM_Unsorted \
  --outSAMtype BAM Unsorted \
  --outFilterType BySJout \
  --outReadsUnmapped Fastx \
  --outFilterScoreMin 10 \
  --outSAMattrRGline ID:foo \
  --alignEndsType EndToEnd > "${OUTPUT_PREFIX}"

View on GitHub

Takes output from STAR rmRep.

STAR v2.18.27 GitHub

$ Bash example

# Install Picard (if not already installed)
# For example, using conda:
# conda install -c bioconda picard

# Define input and output files
# INPUT_BAM is the sorted BAM file generated by STAR alignment
INPUT_BAM="aligned_sorted.bam"
OUTPUT_BAM="deduplicated.bam"
METRICS_FILE="markduplicates_metrics.txt"

# Run Picard MarkDuplicates to remove PCR duplicates
# The 'rmRep' in the description is interpreted as removing PCR duplicates,
# a common post-alignment step in eCLIP pipelines (e.g., yeolab/eclip).
java -jar /path/to/picard.jar MarkDuplicates \
  I=${INPUT_BAM} \
  O=${OUTPUT_BAM} \
  M=${METRICS_FILE} \
  REMOVE_DUPLICATES=true \
  ASSUME_SORTED=true

View on GitHub

Maps unique reads to the human genome.

STAR (Inferred with models/gemini-2.5-flash) v2.7.10a GitHub

$ Bash example

# Define variables
NUM_THREADS=8 # Adjust based on available CPU cores
STAR_INDEX_DIR="/path/to/STAR_genome_index/GRCh38" # Path to pre-built STAR genome index for human GRCh38
READS_R1="input_reads_R1.fastq.gz" # Path to input gzipped FASTQ file (Read 1)
READS_R2="input_reads_R2.fastq.gz" # Path to input gzipped FASTQ file (Read 2, remove if single-end)
OUTPUT_PREFIX="aligned_unique_reads" # Prefix for output files

# --- Installation (commented out) ---
# conda install -c bioconda star=2.7.10a

# --- Map unique reads to the human genome using STAR ---
# The --outFilterMultimapNmax 1 parameter ensures only uniquely mapping reads are reported.
# Adjust --limitBAMsortRAM based on available system RAM (e.g., 30GB for 30,000,000,000 bytes).
STAR --runThreadN ${NUM_THREADS} \
     --genomeDir ${STAR_INDEX_DIR} \
     --readFilesIn ${READS_R1} ${READS_R2} \
     --readFilesCommand zcat \
     --outFileNamePrefix ${OUTPUT_PREFIX}_ \
     --outSAMtype BAM SortedByCoordinate \
     --outSAMattributes All \
     --outFilterMultimapNmax 1 \
     --outFilterMismatchNmax 10 \
     --outFilterScoreMinOverLread 0.66 \
     --outFilterMatchNminOverLread 0.66 \
     --outReadsUnmapped Fastx \
     --limitBAMsortRAM 30000000000

View on GitHub

Command: STAR --runMode alignReads --runThreadN 16 --genomeDir /path/to/STAR_database_file --genomeLoad LoadAndRemove --readFilesIn /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate1 /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate2 --outSAMunmapped Within --outFilterMultimapNmax 1 --outFilterMultimapScoreRange 1 --outFileNamePrefix /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam --outSAMattributes All --outStd BAM_Unsorted --outSAMtype BAM Unsorted --outFilterType BySJout --outReadsUnmapped Fastx --outFilterScoreMin 10 --outSAMattrRGline ID:foo --alignEndsType EndToEnd > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam

STAR vInferred with models/gemini-2.5-flash GitHub

$ Bash example

# Install STAR (example using conda)
# conda install -c bioconda star

# Define variables
STAR_GENOME_DIR="/path/to/your/GRCh38_STAR_index" # Example: GRCh38 STAR index, typically built from FASTA and GTF/GFF files (e.g., from UCSC or Ensembl)
INPUT_R1="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate1"
INPUT_R2="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate2"
OUTPUT_FILE_BASE="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam" # This path is used for both the redirected BAM and as the prefix for other STAR output files (e.g., Log.out, SJ.out.tab)

# Execute STAR alignment
STAR \
  --runMode alignReads \
  --runThreadN 16 \
  --genomeDir "${STAR_GENOME_DIR}" \
  --genomeLoad LoadAndRemove \
  --readFilesIn "${INPUT_R1}" "${INPUT_R2}" \
  --outSAMunmapped Within \
  --outFilterMultimapNmax 1 \
  --outFilterMultimapScoreRange 1 \
  --outFileNamePrefix "${OUTPUT_FILE_BASE}" \
  --outSAMattributes All \
  --outStd BAM_Unsorted \
  --outSAMtype BAM Unsorted \
  --outFilterType BySJout \
  --outReadsUnmapped Fastx \
  --outFilterScoreMin 10 \
  --outSAMattrRGline ID:foo \
  --alignEndsType EndToEnd > "${OUTPUT_FILE_BASE}"

View on GitHub

takes output from STAR genome mapping.

STAR v2.7.10a GitHub

$ Bash example

# Install STAR (if not already installed)
# conda install -c bioconda star

# Define variables
# Placeholder for human GRCh38 genome index. This directory should contain genome files (e.g., genome.fa) and STAR index files.
GENOME_DIR="/path/to/STAR_genome_index/GRCh38"
# Placeholder for GTF annotation file (e.g., Gencode v38). This is crucial for splice junction detection.
GTF_FILE="/path/to/gencode.v38.annotation.gtf.gz"
# Placeholder for input FASTQ files (paired-end example).
READ1="sample_R1.fastq.gz"
READ2="sample_R2.fastq.gz"
# Prefix for output files.
OUTPUT_PREFIX="sample_aligned_"
# Number of threads to use for mapping.
THREADS=8

# Run STAR genome mapping
# This command performs splice-aware alignment of RNA-seq reads to a reference genome.
STAR \
  --runThreadN "${THREADS}" \
  --genomeDir "${GENOME_DIR}" \
  --readFilesIn "${READ1}" "${READ2}" \
  --readFilesCommand zcat \
  --outFileNamePrefix "${OUTPUT_PREFIX}" \
  --outSAMtype BAM SortedByCoordinate \
  --outSAMattributes All \
  --outFilterMultimapNmax 20 \
  --outFilterMismatchNmax 999 \
  --outFilterMismatchNoverLmax 0.1 \
  --alignIntronMin 20 \
  --alignIntronMax 1000000 \
  --alignMatesGapMax 1000000 \
  --sjdbGTFfile "${GTF_FILE}" \
  --sjdbOverhang 100

View on GitHub

Custom random-mer-aware script for PCR duplicate removal.

umi_tools (Inferred with models/gemini-2.5-flash) v1.0.1 GitHub

$ Bash example

# Install umi_tools (if not already installed)
# conda install -c bioconda umi_tools

# Define input and output files
INPUT_BAM="aligned_reads.bam" # Input BAM file (coordinate-sorted)
OUTPUT_BAM="deduplicated_reads.bam" # Output deduplicated BAM file
LOG_FILE="umi_tools_dedup.log" # Log file for umi_tools
STATS_FILE="umi_tools_dedup_stats.tsv" # Statistics file for deduplication

# Define the random-mer (UMI) pattern specific to Yeo lab eCLIP libraries.
# This pattern extracts a 6bp random-mer from the 5' end of Read 1,
# followed by the common eCLIP adapter sequence.
UMI_PATTERN="^(?P<umi_1>.{6})(?P<discard_1>GATCGGAAGAGCACACGTCTGAACTCCAGTCAC)"

# Run umi_tools dedup for random-mer-aware PCR duplicate removal
umi_tools dedup \
    --stdin="${INPUT_BAM}" \
    --stdout="${OUTPUT_BAM}" \
    --log="${LOG_FILE}" \
    --method=unique \
    --extract-method=regex \
    --umi-pattern="${UMI_PATTERN}" \
    --paired \
    --output-stats="${STATS_FILE}"

View on GitHub

15
Command: barcode_collapse_pe.py --bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam --out_file /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam --metrics_file /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.metrics

skipper (Inferred with models/gemini-2.5-flash) v0.1.0 (Inferred with models/gemini-2.5-flash) GitHub
$ Bash example
```
# Install skipper (includes barcode_collapse_pe.py)
# conda install -c bioconda skipper

# Execute barcode_collapse_pe.py
barcode_collapse_pe.py \
  --bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam \
  --out_file /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam \
  --metrics_file /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.metrics
```
View on GitHub

Takes output from barcode collapse PE.

STAR (Inferred with models/gemini-2.5-flash) v2.7.10a GitHub

$ Bash example

# Install STAR (example using conda)
# conda install -c bioconda star=2.7.10a

# Define input and output files
# INPUT_FASTQ represents the output from the 'barcode collapse PE' step.
INPUT_FASTQ="collapsed_reads.fastq.gz"
OUTPUT_PREFIX="aligned_reads_"

# Placeholder for STAR genome index directory (e.g., for human hg38)
# This index needs to be pre-built using STAR's genomeGenerate command.
GENOME_DIR="/path/to/STAR_index/hg38"

# Align reads using STAR
# Parameters are commonly used for eCLIP data alignment, aiming for sensitive mapping
# while handling multimapping and splicing.
STAR --genomeDir ${GENOME_DIR} \
     --readFilesIn ${INPUT_FASTQ} \
     --readFilesCommand zcat \
     --outFileNamePrefix ${OUTPUT_PREFIX} \
     --outSAMtype BAM SortedByCoordinate \
     --outFilterMultimapNmax 20 \
     --outFilterMismatchNmax 3 \
     --outFilterScoreMinOverLread 0.6 \
     --outFilterMatchNminOverLread 0.6 \
     --alignIntronMin 20 \
     --alignIntronMax 1000000 \
     --alignMatesGapMax 1000000 \
     --runThreadN 8 # Adjust thread count as needed

View on GitHub

Sorts resulting bam file for use downstream.

samtools (Inferred with models/gemini-2.5-flash) v1.19.1 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

bash
# Install samtools if not already installed
# conda install -c bioconda samtools=1.19.1

# Sort the BAM file
# Replace input.bam with your actual input BAM file
# Replace output_sorted.bam with your desired output sorted BAM file name
# -o: output file
# -@: number of threads (adjust as needed)
samtools sort -o output_sorted.bam input.bam -@ 8

View on GitHub

Command: java -Xmx2048m -XX:+UseParallelOldGC -XX:ParallelGCThreads=4 -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Djava.io.tmpdir=/full/path/to/files/.queue/tmp -cp /path/to/gatk/dist/Queue.jar net.sf.picard.sam.SortSam INPUT=/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam TMP_DIR=/full/path/to/files/.queue/tmp OUTPUT=/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam VALIDATION_STRINGENCY=SILENT SO=coordinate CREATE_INDEX=true

Picard vInferred with models/gemini-2.5-flash GitHub

$ Bash example

# Install Picard Tools (often bundled with GATK)
# For example, using conda:
# conda install -c bioconda picard
# Or download the JAR directly from Broad Institute GATK releases.

java -Xmx2048m -XX:+UseParallelOldGC -XX:ParallelGCThreads=4 -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Djava.io.tmpdir=/full/path/to/files/.queue/tmp -cp /path/to/gatk/dist/Queue.jar net.sf.picard.sam.SortSam INPUT=/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam TMP_DIR=/full/path/to/files/.queue/tmp OUTPUT=/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam VALIDATION_STRINGENCY=SILENT SO=coordinate CREATE_INDEX=true

View on GitHub

19
Takes output from sortSam, makes bam index for use downstream.

samtools index (Inferred with models/gemini-2.5-flash) v1.10 (Inferred with models/gemini-2.5-flash) GitHub
$ Bash example
```
# Install samtools if not already installed
# conda install -c bioconda samtools

# Assuming 'sorted.bam' is the output from sortSam
samtools index sorted.bam
```
View on GitHub

Command: samtools index /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam.bai

samtools v1.10 GitHub

$ Bash example

# Install samtools if not already installed
# conda install -c bioconda samtools=1.10

samtools index /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam.bai

View on GitHub

Takes inputs from multiple final bam files.

samtools merge (Inferred with models/gemini-2.5-flash) v1.19 GitHub

$ Bash example

# Install samtools if not available
# conda install -c bioconda samtools

# Example input BAM files
# Replace these with your actual input BAM file paths
INPUT_BAM_1="sample1_replicate1.bam"
INPUT_BAM_2="sample1_replicate2.bam"
INPUT_BAM_3="sample1_replicate3.bam"

# Define the output merged BAM file
OUTPUT_MERGED_BAM="sample1_merged.bam"

# Merge multiple final bam files into a single output bam file.
# This is a common step for combining technical replicates before downstream analysis.
samtools merge "${OUTPUT_MERGED_BAM}" "${INPUT_BAM_1}" "${INPUT_BAM_2}" "${INPUT_BAM_3}"

# Index the merged BAM file (optional, but highly recommended for downstream tools)
samtools index "${OUTPUT_MERGED_BAM}"

View on GitHub

Merges the two technical replicates for further downstream analysis.

samtools merge (Inferred with models/gemini-2.5-flash) v1.19 GitHub

$ Bash example

# Install samtools if not already installed
# conda install -c bioconda samtools

# Merge two technical replicate BAM files into a single BAM file.
# This command assumes the input files are BAM files and merges them into a new BAM file.
# Replace 'replicate1.bam', 'replicate2.bam' with the actual input technical replicate BAM files.
# Replace 'merged_replicates.bam' with the desired output merged BAM file name.
samtools merge merged_replicates.bam replicate1.bam replicate2.bam

View on GitHub

Command: samtools merge /full/path/to/files/CombinedID.merged.bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam /full/path/to/files/file_R1.D08.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam

samtools v1.19.1 GitHub

$ Bash example

# Install samtools if not already installed
# conda install -c bioconda samtools

# Merge multiple sorted BAM files into a single sorted BAM file
samtools merge /full/path/to/files/CombinedID.merged.bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam /full/path/to/files/file_R1.D08.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam

View on GitHub

Takes output from sortSam, makes bam index for use downstream.

samtools (Inferred with models/gemini-2.5-flash) v1.10 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install samtools if not already installed
# conda install -c bioconda samtools

# Assuming the output from sortSam is 'sorted.bam'
# This command creates an index file 'sorted.bam.bai'
samtools index sorted.bam

View on GitHub

Command: samtools index /full/path/to/files/CombinedID.merged.bam /full/path/to/files/CombinedID.merged.bam.bai

samtools v1.19 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install samtools (if not already installed)
# conda install -c bioconda samtools

# Define input and output file paths
INPUT_BAM="/full/path/to/files/CombinedID.merged.bam"
OUTPUT_BAI="/full/path/to/files/CombinedID.merged.bam.bai"

# Create a BAM index file for the input BAM file
samtools index "${INPUT_BAM}" "${OUTPUT_BAI}"

View on GitHub

Takes output from sortSam.

samtools (Inferred with models/gemini-2.5-flash) v1.19.1 GitHub

$ Bash example

# This step takes a sorted BAM file as input.
# A common and essential next step after sorting a BAM file is to index it.
# Indexing allows for fast random access to reads by genomic region, which is required
# by many downstream analysis tools (e.g., IGV, samtools view with regions, featureCounts).

# Install samtools if not already installed
# conda install -c bioconda samtools

# Define input file name
# The input BAM file is assumed to be the output from a 'sortSam' step.
INPUT_SORTED_BAM="input_sorted.bam" # Placeholder for the actual sorted BAM file

# Index the sorted BAM file
# This command creates an accompanying .bai index file (e.g., input_sorted.bam.bai).
samtools index "${INPUT_SORTED_BAM}"

View on GitHub

Only outputs the second read in each pair for use with single stranded peak caller.

seqtk (Inferred with models/gemini-2.5-flash) v1.3 GitHub

$ Bash example

# conda install -c bioconda seqtk
# Assuming INPUT_INTERLEAVED_FASTQ is a single interleaved FASTQ file containing paired-end reads.
# This command extracts only the second read (R2) from each pair.
seqtk seq -2 ${INPUT_INTERLEAVED_FASTQ} > ${OUTPUT_R2_ONLY_FASTQ}

View on GitHub

This is the final bam file to perform analysis on.

samtools (Inferred with models/gemini-2.5-flash) v1.19.1 GitHub

$ Bash example

# Install samtools if not already available
# conda install -c bioconda samtools

# Assume 'input.bam' is the BAM file that needs to be finalized (e.g., sorted and indexed)
# Sort the BAM file by coordinate to make it ready for most downstream analyses
samtools sort -o final.bam input.bam

# Index the sorted BAM file, which is required for many tools to quickly access regions
samtools index final.bam

View on GitHub

Command: samtools view -hb -f 128 /full/path/to/files/CombinedID.merged.bam > /full/path/to/files/CombinedID.merged.r2.bam

samtools v1.9 GitHub

$ Bash example

# Install samtools if not already installed
# conda install -c bioconda samtools

# Extract reads that are the second in a pair (-f 128) from a merged BAM file
# Output in BAM format (-b) and include header (-h)
samtools view -hb -f 128 /full/path/to/files/CombinedID.merged.bam > /full/path/to/files/CombinedID.merged.r2.bam

View on GitHub

Takes results from samtools view.

samtools v1.19 GitHub

$ Bash example

# Install samtools (if not already installed)
# conda install -c bioconda samtools

# Example: Convert a BAM file to SAM format for viewing.
# Replace 'input.bam' with your actual input file and 'output.sam' with your desired output file.
samtools view input.bam > output.sam

View on GitHub

Calls peaks on those files.

clipper (Inferred with models/gemini-2.5-flash) vlatest (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install clipper (if not already installed)
# git clone https://github.com/yeolab/clipper.git
# cd clipper
# pip install -r requirements.txt # Ensure Python and dependencies like pybedtools are available

# Define input files (placeholders)
# Replace 'ip.bam' with the path to your IP (immunoprecipitation) BAM file.
# Replace 'input.bam' with the path to your control/input BAM file.
IP_BAM="ip.bam"
CONTROL_BAM="input.bam"
OUTPUT_PREFIX="clipper_peaks"
GENOME_SIZE="hs" # For human (hg38/hg19), use 'hs'. For mouse (mm10/mm9), use 'mm'.

# Execute clipper to call peaks
python clipper/clipper.py -b "${IP_BAM}" -c "${CONTROL_BAM}" -s "${GENOME_SIZE}" -o "${OUTPUT_PREFIX}"

View on GitHub

Command: clipper -b /full/path/to/files/CombinedID.merged.r2.bam -s hg19 -o /full/path/to/files/CombinedID.merged.r2.peaks.bed --bonferroni --superlocal --threshold-method binomial --save-pickle

CLIPper v0.0.1 GitHub

$ Bash example

# Install CLIPper (if not already installed)
# pip install clipper

# Execute CLIPper command
clipper -b /full/path/to/files/CombinedID.merged.r2.bam -s hg19 -o /full/path/to/files/CombinedID.merged.r2.peaks.bed --bonferroni --superlocal --threshold-method binomial --save-pickle

View on GitHub

Tools Used

STAR

Raw Source Text

Takes output from raw files.  Run to trim off both 5â and 3â adapters on both reads. Command: quality-cutoff 6  -m 18  -a NNNNNAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC  -g CTTCCGATCTACAAGTT -g CTTCCGATCTTGGTCCT  -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGT  AGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT  -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz  -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz  /full/path/to/files/file_R1.C01.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.metrics
Takes output from cutadapt round 1. Run to trim off the 3â adapters on read 2, to control for double ligation events. Command: cutadapt -f fastq --match-read-wildcards  --times 1  -e 0.1  -O 5  --quality-cutoff 6  -m 18  -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGTAGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT  -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz  -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz  /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz  /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.metrics
Takes output from cutadapt round 2.  Maps to human specific version of RepBase used to remove repetitive elements, helps control for spurious artifacts from rRNA (& other) repetitive reads.  Command: STAR  --runMode alignReads  --runThreadN 16  --genomeDir /path/to/RepBase_human_database_file --genomeLoad LoadAndRemove  --readFilesIn /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz --outSAMunmapped Within  --outFilterMultimapNmax 30  --outFilterMultimapScoreRange 1  --outFileNamePrefix /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam --outSAMattributes All  --readFilesCommand zcat  --outStd BAM_Unsorted  --outSAMtype BAM Unsorted  --outFilterType BySJout  --outReadsUnmapped Fastx  --outFilterScoreMin 10  --outSAMattrRGline ID:foo  --alignEndsType EndToEnd > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam
Takes output from STAR rmRep.  Maps unique reads to the human genome.  Command: STAR  --runMode alignReads  --runThreadN 16  --genomeDir  /path/to/STAR_database_file --genomeLoad LoadAndRemove  --readFilesIn  /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate1  /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate2  --outSAMunmapped Within  --outFilterMultimapNmax 1  --outFilterMultimapScoreRange 1  --outFileNamePrefix /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam  --outSAMattributes All  --outStd BAM_Unsorted  --outSAMtype BAM Unsorted  --outFilterType BySJout  --outReadsUnmapped Fastx  --outFilterScoreMin 10  --outSAMattrRGline ID:foo  --alignEndsType EndToEnd > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam
takes output from STAR genome mapping.  Custom random-mer-aware script for PCR duplicate removal. Command: barcode_collapse_pe.py  --bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam  --out_file /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam  --metrics_file /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.metrics
Takes output from barcode collapse PE.  Sorts resulting bam file for use downstream.  Command: java  -Xmx2048m  -XX:+UseParallelOldGC  -XX:ParallelGCThreads=4  -XX:GCTimeLimit=50  -XX:GCHeapFreeLimit=10  -Djava.io.tmpdir=/full/path/to/files/.queue/tmp  -cp /path/to/gatk/dist/Queue.jar  net.sf.picard.sam.SortSam  INPUT=/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam  TMP_DIR=/full/path/to/files/.queue/tmp  OUTPUT=/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam  VALIDATION_STRINGENCY=SILENT  SO=coordinate  CREATE_INDEX=true
Takes output from sortSam, makes bam index for use downstream.  Command: samtools index  /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam  /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam.bai
Takes inputs from multiple final bam files.  Merges the two technical replicates for further downstream analysis.  Command: samtools  merge  /full/path/to/files/CombinedID.merged.bam  /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam /full/path/to/files/file_R1.D08.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam
Takes output from sortSam, makes bam index for use downstream.  Command: samtools  index  /full/path/to/files/CombinedID.merged.bam  /full/path/to/files/CombinedID.merged.bam.bai
Takes output from sortSam.  Only outputs the second read in each pair for use with single stranded peak caller.  This is the final bam file to perform analysis on.  Command: samtools view -hb -f 128  /full/path/to/files/CombinedID.merged.bam  >  /full/path/to/files/CombinedID.merged.r2.bam
Takes results from samtools view.  Calls peaks on those files.  Command: clipper  -b /full/path/to/files/CombinedID.merged.r2.bam  -s hg19  -o /full/path/to/files/CombinedID.merged.r2.peaks.bed  --bonferroni  --superlocal  --threshold-method binomial  --save-pickle
Genome_build: hg19
Supplementary_files_format_and_content: bed format, contains clusters of predicted RBP binding
Supplementary_files_format_and_content: 764.01v02.IDR.out.0102merged.bed.narrowPeak.bed: a combined sample between 764_01 and 764_02, normalized against an input background (764_INPUT).

← Back to Analysis