GSE78509 Processing Pipeline

GSE code_examples 33 steps

Publication

Enhanced CLIP Uncovers IMP Protein-RNA Targets in Human Pluripotent Stem Cells Important for Cell Adhesion and Survival.

Cell reports (2016) — PMID 27068461

Dataset

GSE78509

Enhanced CLIP uncovers IMP protein-RNA targets in human pluripotent stem cells important for cell adhesion and survival

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Library strategy: eCLIP-Seq

eCLIP vlatest (from GitHub repository) GitHub

$ Bash example

# Install cwltool (if not already installed)
# conda install -c conda-forge cwltool

# Clone the eCLIP workflow repository
# git clone https://github.com/yeolab/eclip.git
# cd eclip # Navigate into the cloned directory to run the workflow

# --- Placeholder for Reference Data ---
# Replace with actual paths to your reference genome and annotation files.
# For human hg38 as an example:
GENOME_DIR="/path/to/hg38_star_index" # Directory containing STAR index files
GTF_FILE="/path/to/gencode.v38.annotation.gtf" # GTF annotation file
FASTA_FILE="/path/to/hg38.fa" # Genome FASTA file
CHROM_SIZES="/path/to/hg38.chrom.sizes" # Chromosome sizes file

# --- Placeholder for Input Data ---
# Replace with actual paths to your eCLIP and control FASTQ files.
# Assuming paired-end reads for both sample and control.
SAMPLE_R1="/path/to/sample_rep1_R1.fastq.gz"
SAMPLE_R2="/path/to/sample_rep1_R2.fastq.gz"
CONTROL_R1="/path/to/control_rep1_R1.fastq.gz"
CONTROL_R2="/path/to/control_rep1_R2.fastq.gz"

# Create an input YAML file for the eCLIP CWL workflow
cat << EOF > eclip_inputs.yaml
fastq_r1:
  class: File
  path: ${SAMPLE_R1}
fastq_r2:
  class: File
  path: ${SAMPLE_R2}
control_fastq_r1:
  class: File
  path: ${CONTROL_R1}
control_fastq_r2:
  class: File
  path: ${CONTROL_R2}
genome_dir:
  class: Directory
  path: ${GENOME_DIR}
gtf_file:
  class: File
  path: ${GTF_FILE}
fasta_file:
  class: File
  path: ${FASTA_FILE}
chrom_sizes:
  class: File
  path: ${CHROM_SIZES}
output_prefix: "my_eclip_experiment"
adapter_sequence: "AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" # Common Illumina adapter sequence
min_read_length: 15
max_read_length: 100
min_quality: 20
threads: 8
memory: 32 # GB
EOF

# Execute the eCLIP CWL workflow
cwltool eclip.cwl eclip_inputs.yaml

View on GitHub

Takes output from raw files.

FastQC (Inferred with models/gemini-2.5-flash) v0.11.9 GitHub

$ Bash example

# Install FastQC if not already installed
# conda install -c bioconda fastqc

# Define input and output paths
INPUT_DIR="raw_data" # Directory containing raw FASTQ files
OUTPUT_DIR="fastqc_reports"
SAMPLE_ID="sample_name" # Placeholder for a specific sample identifier

# Create output directory if it doesn't exist
mkdir -p "${OUTPUT_DIR}"

# Run FastQC on raw FASTQ files
# Assuming paired-end reads: ${SAMPLE_ID}_R1.fastq.gz and ${SAMPLE_ID}_R2.fastq.gz
fastqc -o "${OUTPUT_DIR}" "${INPUT_DIR}/${SAMPLE_ID}_R1.fastq.gz" "${INPUT_DIR}/${SAMPLE_ID}_R2.fastq.gz"

# If single-end reads, use:
# fastqc -o "${OUTPUT_DIR}" "${INPUT_DIR}/${SAMPLE_ID}.fastq.gz"

View on GitHub

Run to trim off both 5â and 3â adapters on both reads.

cutadapt (Inferred with models/gemini-2.5-flash) v4.0 GitHub

$ Bash example

# Installation (uncomment to install via conda)
# conda install -c bioconda cutadapt=4.0

# Define input and output files (replace with actual filenames)
READ1_IN="raw_R1.fastq.gz"
READ2_IN="raw_R2.fastq.gz"
READ1_OUT="trimmed_R1.fastq.gz"
READ2_OUT="trimmed_R2.fastq.gz"

# Define common Illumina TruSeq adapters (replace with actual adapters if known)
# These are standard Illumina Universal Adapter (for R1) and Reverse Complement Adapter (for R2)
ADAPTER_FWD="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA"
ADAPTER_REV="AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT"

# Run cutadapt to trim adapters from both 5' and 3' ends of paired-end reads.
# -a: 3' adapter sequence for Read 1
# -A: 3' adapter sequence for Read 2
# -o: Output file for Read 1
# -p: Output file for Read 2
# -m: Minimum length of reads after trimming (e.g., 18 bp, adjust as needed)
# --cores: Number of CPU cores to use for parallel processing
cutadapt \
    -a "${ADAPTER_FWD}" \
    -A "${ADAPTER_REV}" \
    -o "${READ1_OUT}" \
    -p "${READ2_OUT}" \
    "${READ1_IN}" \
    "${READ2_IN}" \
    -m 18 \
    --cores 4

View on GitHub

Command: quality-cutoff 6 -m 18 -a NNNNNAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC -g CTTCCGATCTACAAGTT -g CTTCCGATCTTGGTCCT -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGT AGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz /full/path/to/files/file_R1.C01.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.metrics

quality-cutoff.py (Inferred with models/gemini-2.5-flash) vN/A GitHub

$ Bash example

# This script is part of the Yeo lab eCLIP pipeline (https://github.com/yeolab/eclip).
# It requires Python and the pysam library.
# Installation steps (assuming you are in a suitable environment):
# # First, clone the eclip repository to get the script:
# # git clone https://github.com/yeolab/eclip.git
# # cd eclip/tools
# # Ensure Python and pysam are installed. For example, using conda:
# # conda install -c conda-forge python
# # conda install -c bioconda pysam

# Define input and output paths
INPUT_R1="/full/path/to/files/file_R1.C01.fastq.gz"
INPUT_R2="/full/path/to/files/file_R2.C01.fastq.gz"
OUTPUT_R1="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz"
OUTPUT_R2="/full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz"
METRICS_FILE="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.metrics"

# Execute the quality-cutoff script.
# Note: The script 'quality-cutoff.py' should be in your current working directory or its path should be specified.
# Assuming it's run from the 'eclip/tools' directory or the script is copied/linked.
python quality-cutoff.py 6 -m 18 \
  -a NNNNNAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC \
  -g CTTCCGATCTACAAGTT \
  -g CTTCCGATCTTGGTCCT \
  -A AACTTGTAGATCGGA \
  -A AGGACCAAGATCGGA \
  -A ACTTGTAGATCGGAA \
  -A GGACCAAGATCGGAA \
  -A CTTGT AGATCGGAAG \
  -A GACCAAGATCGGAAG \
  -A TTGTAGATCGGAAGA \
  -A ACCAAGATCGGAAGA \
  -A TGTAGATCGGAAGAG \
  -A CCAAGATCGGAAGAG \
  -A GTAGATCGGAAGAGC \
  -A CAAGATCGGAAGAGC \
  -A TAGATCGGAAGAGCG \
  -A AAGATCGGAAGAGCG \
  -A AGATCGGAAGAGCGT \
  -A GATCGGAAGAGCGTC \
  -A ATCGGAAGAGCGTCG \
  -A TCGGAAGAGCGTCGT \
  -A CGGAAGAGCGTCGTG \
  -A GGAAGAGCGTCGTGT \
  -o "${OUTPUT_R1}" \
  -p "${OUTPUT_R2}" \
  "${INPUT_R1}" \
  "${INPUT_R2}" > "${METRICS_FILE}"

View on GitHub

Takes output from cutadapt round 1.

cutadapt v3.4 GitHub

$ Bash example

# Install cutadapt (example using conda)
# conda install -c bioconda cutadapt=3.4

# Define input and output files
# INPUT_FASTQ: Output from the first round of cutadapt processing
INPUT_FASTQ="input_from_cutadapt_round1.fastq.gz"
OUTPUT_FASTQ="output_cutadapt_round2.fastq.gz"

# Execute cutadapt for quality trimming and minimum length filtering.
# This is a common second step after initial adapter trimming.
# -q 20: Trim low-quality ends from reads. The 3' end is trimmed until the quality score is at least 20.
# -m 18: Discard reads shorter than 18 bp after trimming.
cutadapt -q 20 -m 18 -o "${OUTPUT_FASTQ}" "${INPUT_FASTQ}"

View on GitHub

Run to trim off the 3â adapters on read 2, to control for double ligation events.

cutadapt (Inferred with models/gemini-2.5-flash) v3.4 GitHub

$ Bash example

# Install cutadapt if not already available
# conda install -c bioconda cutadapt=3.4

# Define input and output files
INPUT_R2="sample_R2.fastq.gz"
OUTPUT_R2_TRIMMED="sample_R2_trimmed.fastq.gz"

# Define the 3' adapter sequence for eCLIP (from Yeo Lab workflows)
# This adapter is AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT
# It's a common Illumina 3' adapter sequence used in eCLIP.
ADAPTER_R2="AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT"

# Minimum length for reads after trimming
MIN_LENGTH=18

# Run cutadapt to trim the 3' adapter from Read 2
# -a: Specifies a 3' adapter to be removed from the 3' end of the read.
# -o: Output file for trimmed Read 2.
# --minimum-length: Discard reads shorter than this length after trimming.
# --discard-untrimmed: Discard reads that do not contain the adapter. This is important
#                      for controlling double ligation events, as it removes adapter-only
#                      reads or reads with very short inserts that are mostly adapter.
cutadapt -a "${ADAPTER_R2}" \
         -o "${OUTPUT_R2_TRIMMED}" \
         --minimum-length "${MIN_LENGTH}" \
         --discard-untrimmed \
         "${INPUT_R2}"

View on GitHub

7
Command: cutadapt -f fastq --match-read-wildcards --times 1 -e 0.1 -O 5 --quality-cutoff 6 -m 18 -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGTAGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.metrics

cutadapt vInferred with models/gemini-2.5-flash GitHub
$ Bash example
```
cutadapt -f fastq --match-read-wildcards --times 1 -e 0.1 -O 5 --quality-cutoff 6 -m 18 -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGTAGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.metrics
```
View on GitHub

Takes output from cutadapt round 2.

cutadapt v4.0 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install cutadapt (if not already installed)
# conda install -c bioconda cutadapt=4.0

# Define input and output files
# Input files are the output from a previous cutadapt round (round 2)
INPUT_R1="cutadapt_round2_R1.fastq.gz"
INPUT_R2="cutadapt_round2_R2.fastq.gz"
OUTPUT_R1="cutadapt_round3_R1.fastq.gz"
OUTPUT_R2="cutadapt_round3_R2.fastq.gz"
REPORT_JSON="cutadapt_round3_report.json"

# Define adapter sequences (placeholders - replace with actual sequences specific to the library preparation)
# For eCLIP, these are typically Illumina adapters or custom adapters.
# Example Illumina universal adapter for 3' end
ADAPTER_3PRIME="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA"
# Example Illumina small RNA 5' adapter for 5' end (if applicable, otherwise use universal)
ADAPTER_5PRIME="AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT"

# Execute cutadapt for further trimming and quality filtering
# -a: 3' adapter sequence for read 1
# -A: 3' adapter sequence for read 2 (or 5' adapter if trimming 5' end of read 2)
# -q: Quality trimming. Trim low-quality ends from reads. Format: 3'end,5'end
# --minimum-length: Discard reads shorter than this length after trimming
# --cores: Number of CPU cores to use for parallel processing
# --json: Write a JSON report with trimming statistics
# -o: Output file for read 1
# -p: Output file for read 2
cutadapt \
  -a "${ADAPTER_3PRIME}" \
  -A "${ADAPTER_5PRIME}" \
  -q 20,20 \
  --minimum-length 18 \
  --cores 4 \
  --json "${REPORT_JSON}" \
  -o "${OUTPUT_R1}" \
  -p "${OUTPUT_R2}" \
  "${INPUT_R1}" \
  "${INPUT_R2}"

View on GitHub

Maps to human specific version of RepBase used to remove repetitive elements, helps control for spurious artifacts from rRNA (& other) repetitive reads.

bowtie2 (Inferred with models/gemini-2.5-flash) v2.x GitHub

$ Bash example

# Install bowtie2 if not already installed
# conda install -c bioconda bowtie2

# Define reference files and output directory
# REPETITIVE_ELEMENTS_FASTA is a placeholder for a FASTA file containing human repetitive sequences
# (e.g., rRNA, tRNA, and sequences from RepBase for the human genome, like hg38). 
# This file needs to be prepared beforehand, often by concatenating known repetitive sequences.
REPETITIVE_ELEMENTS_FASTA="path/to/human_repetitive_elements.fa" 
BOWTIE2_INDEX_PREFIX="human_repetitive_elements_idx"
INPUT_FASTQ="input_reads.fastq.gz"
OUTPUT_UNMAPPED_FASTQ="reads_without_repetitive_elements.fastq.gz"
OUTPUT_MAPPED_SAM="repetitive_reads.sam"

# Step 1: Build Bowtie2 index for repetitive elements (run once)
# This command creates index files (.bt2) from the FASTA reference.
# bowtie2-build "${REPETITIVE_ELEMENTS_FASTA}" "${BOWTIE2_INDEX_PREFIX}"

# Step 2: Align reads to the repetitive elements index
# Reads that align to repetitive elements are considered artifacts and are discarded.
# --un-gz: writes reads that *do not* align to the index to a gzipped FASTQ file.
# -S: writes all alignments (including those to repetitive elements) to a SAM file.
# -U: specifies the input FASTQ file (unpaired reads).
# -p: number of threads to use.
bowtie2 -p 8 -x "${BOWTIE2_INDEX_PREFIX}" -U "${INPUT_FASTQ}" \
    --un-gz "${OUTPUT_UNMAPPED_FASTQ}" -S "${OUTPUT_MAPPED_SAM}"

# The file "${OUTPUT_UNMAPPED_FASTQ}" contains the reads with repetitive elements removed,
# which can then be used for subsequent genomic alignment.

View on GitHub

Command: STAR --runMode alignReads --runThreadN 16 --genomeDir /path/to/RepBase_human_database_file --genomeLoad LoadAndRemove --readFilesIn /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz --outSAMunmapped Within --outFilterMultimapNmax 30 --outFilterMultimapScoreRange 1 --outFileNamePrefix /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam --outSAMattributes All --readFilesCommand zcat --outStd BAM_Unsorted --outSAMtype BAM Unsorted --outFilterType BySJout --outReadsUnmapped Fastx --outFilterScoreMin 10 --outSAMattrRGline ID:foo --alignEndsType EndToEnd > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam

STAR v2.7.10a GitHub

$ Bash example

# Install STAR (example using conda)
# conda install -c bioconda star

# Define variables for clarity
GENOME_DIR="/path/to/RepBase_human_database_file" # Reference: RepBase human repetitive element database
READS_R1="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz"
READS_R2="/full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz"
OUTPUT_PREFIX="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam" # Prefix for STAR's auxiliary output files (e.g., logs, unmapped reads if not stdout)
OUTPUT_BAM="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam" # Destination for the unsorted BAM output from stdout

# Execute STAR alignment
STAR --runMode alignReads \
     --runThreadN 16 \
     --genomeDir "${GENOME_DIR}" \
     --genomeLoad LoadAndRemove \
     --readFilesIn "${READS_R1}" "${READS_R2}" \
     --outSAMunmapped Within \
     --outFilterMultimapNmax 30 \
     --outFilterMultimapScoreRange 1 \
     --outFileNamePrefix "${OUTPUT_PREFIX}" \
     --outSAMattributes All \
     --readFilesCommand zcat \
     --outStd BAM_Unsorted \
     --outSAMtype BAM Unsorted \
     --outFilterType BySJout \
     --outReadsUnmapped Fastx \
     --outFilterScoreMin 10 \
     --outSAMattrRGline ID:foo \
     --alignEndsType EndToEnd \
     > "${OUTPUT_BAM}"

View on GitHub

Takes output from STAR rmRep.

STAR v1.19.2 GitHub

$ Bash example

# Install samtools if not already available
# conda install -c bioconda samtools=1.19.2

# The description "Takes output from STAR rmRep" is interpreted as a step that performs deduplication (removing PCR duplicates/replicates) on the output of STAR alignment.
# 'rmRep' is inferred to mean 'remove replicates' or 'remove duplicates'.
# samtools markdup is the standard tool for this in eCLIP pipelines (e.g., Yeo lab workflows).

# Assuming 'star_aligned.bam' is the coordinate-sorted BAM file output from STAR.
# If the STAR output is not sorted, it must be sorted first:
# samtools sort -o star_aligned.sorted.bam star_aligned.bam

# Remove PCR duplicates (replicates) from the STAR aligned BAM file.
# The '-r' option removes duplicate reads.
# The '-s' option outputs statistics to stderr, or to a file if redirected.
# The input BAM file must be coordinate-sorted.
# Replace 'star_aligned.sorted.bam' with the actual path to your sorted STAR alignment file.
# Replace 'deduplicated_reads.bam' with your desired output file name.
samtools markdup -r star_aligned.sorted.bam deduplicated_reads.bam

View on GitHub

Maps unique reads to the human genome.

STAR (Inferred with models/gemini-2.5-flash) v2.7.10a GitHub

$ Bash example

# Install STAR (if not already installed)
# conda install -c bioconda star

# Define variables
# Replace with actual paths and filenames
GENOME_DIR="/path/to/star_genome_index_GRCh38" # Directory containing STAR genome index for human GRCh38
READS_FILE="input_reads.fastq.gz" # Input FASTQ file (gzipped)
OUTPUT_PREFIX="mapped_reads" # Prefix for output files
NUM_THREADS=8 # Number of threads to use (adjust based on available resources)

# Reference Genome: Human GRCh38 (or a specific build like hg38) is assumed.
# The STAR genome index (GENOME_DIR) must be pre-built using the human genome FASTA and GTF files.
# Example command to build index (run once):
# STAR --runMode genomeGenerate \
#      --genomeDir "${GENOME_DIR}" \
#      --genomeFastaFiles "/path/to/GRCh38.primary_assembly.fa" \
#      --sjdbGTFfile "/path/to/gencode.v44.annotation.gtf" \
#      --runThreadN "${NUM_THREADS}"

# Maps unique reads to the human genome
STAR --genomeDir "${GENOME_DIR}" \
     --readFilesIn "${READS_FILE}" \
     --readFilesCommand zcat \
     --outFileNamePrefix "${OUTPUT_PREFIX}_" \
     --outSAMtype BAM SortedByCoordinate \
     --outSAMunmapped Within \
     --outSAMattributes Standard \
     --runThreadN "${NUM_THREADS}"

# Output BAM file will be: ${OUTPUT_PREFIX}_Aligned.sortedByCoord.out.bam

View on GitHub

Command: STAR --runMode alignReads --runThreadN 16 --genomeDir /path/to/STAR_database_file --genomeLoad LoadAndRemove --readFilesIn /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate1 /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate2 --outSAMunmapped Within --outFilterMultimapNmax 1 --outFilterMultimapScoreRange 1 --outFileNamePrefix /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam --outSAMattributes All --outStd BAM_Unsorted --outSAMtype BAM Unsorted --outFilterType BySJout --outReadsUnmapped Fastx --outFilterScoreMin 10 --outSAMattrRGline ID:foo --alignEndsType EndToEnd > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam

STAR v2.7.10a GitHub

$ Bash example

# Install STAR (example using conda)
# conda install -c bioconda star

# Define variables
# Placeholder for STAR genome indices. Replace with the actual path to your reference genome.
GENOME_DIR="/path/to/STAR_genome_indices/GRCh38" 

# Input FASTQ files (mate1 and mate2)
READ_FILE_MATE1="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate1"
READ_FILE_MATE2="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate2"

# Output file prefix for STAR's auxiliary files (Log.out, SJ.out.tab, etc.)
# Note: The original command uses a .bam suffix in the prefix, which is unusual but reproduced here.
STAR_OUTPUT_FILE_PREFIX="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam"

# The final aligned BAM file, captured from STAR's standard output
ALIGNED_BAM_FILE="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam"

# Run STAR alignment
STAR \
  --runMode alignReads \
  --runThreadN 16 \
  --genomeDir "${GENOME_DIR}" \
  --genomeLoad LoadAndRemove \
  --readFilesIn "${READ_FILE_MATE1}" "${READ_FILE_MATE2}" \
  --outSAMunmapped Within \
  --outFilterMultimapNmax 1 \
  --outFilterMultimapScoreRange 1 \
  --outFileNamePrefix "${STAR_OUTPUT_FILE_PREFIX}" \
  --outSAMattributes All \
  --outStd BAM_Unsorted \
  --outSAMtype BAM Unsorted \
  --outFilterType BySJout \
  --outReadsUnmapped Fastx \
  --outFilterScoreMin 10 \
  --outSAMattrRGline ID:foo \
  --alignEndsType EndToEnd \
  > "${ALIGNED_BAM_FILE}"

View on GitHub

takes output from STAR genome mapping.

STAR v2.7.3a (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install STAR (if not already installed)
# conda install -c bioconda star

# Placeholder for reference genome and annotation files
# Replace with actual paths to your GRCh38 FASTA and GTF files
GENOME_FASTA="/path/to/GRCh38.fa"
GTF_FILE="/path/to/gencode.v38.annotation.gtf"
STAR_INDEX_DIR="/path/to/STAR_index/GRCh38"

# Placeholder for input FASTQ files
# Replace with actual paths to your R1 and R2 FASTQ files
INPUT_FASTQ_R1="input_R1.fastq.gz"
INPUT_FASTQ_R2="input_R2.fastq.gz"

# Placeholder for output prefix
OUTPUT_PREFIX="aligned_reads"

# Number of threads to use
NUM_THREADS=8

# Create STAR genome index (run once per genome version)
# This step is typically done before alignment and the index is reused.
# STAR --runThreadN ${NUM_THREADS} \
#      --runMode genomeGenerate \
#      --genomeDir ${STAR_INDEX_DIR} \
#      --genomeFastaFiles ${GENOME_FASTA} \
#      --sjdbGTFfile ${GTF_FILE} \
#      --sjdbOverhang 100 # Recommended for typical RNA-seq read lengths (e.g., 101bp)

# Perform genome mapping using STAR
# Parameters are commonly used for RNA-seq alignment, adapted from eCLIP workflows.
STAR --runThreadN ${NUM_THREADS} \
     --genomeDir ${STAR_INDEX_DIR} \
     --readFilesIn ${INPUT_FASTQ_R1} ${INPUT_FASTQ_R2} \
     --readFilesCommand zcat \
     --outFileNamePrefix ${OUTPUT_PREFIX}_ \
     --outSAMtype BAM SortedByCoordinate \
     --outSAMunmapped Within \
     --outSAMattributes Standard \
     --outFilterType BySJout \
     --outFilterMultimapNmax 20 \
     --alignSJDBoverhangMin 1 \
     --alignSJoverhangMin 8 \
     --alignIntronMin 20 \
     --alignIntronMax 1000000 \
     --alignMatesGapMax 1000000 \
     --limitBAMsortRAM 30000000000 # ~30GB, adjust based on available RAM

View on GitHub

Custom random-mer-aware script for PCR duplicate removal.

umi_tools (Inferred with models/gemini-2.5-flash) v1.1.2 GitHub

$ Bash example

    # Install umi_tools if not already available
    # conda install -c bioconda umi-tools

    # This script performs PCR duplicate removal using umi_tools dedup,
    # which is designed to handle Unique Molecular Identifiers (UMIs) or random-mers.
    # It assumes UMIs have already been extracted and appended to read names
    # (e.g., by a prior 'umi_tools extract' step) and the input is a
    # coordinate-sorted BAM file.

    # Define input and output file paths
    INPUT_BAM="path/to/your/aligned_and_sorted.bam"
    OUTPUT_BAM="path/to/your/deduplicated.bam"
    LOG_FILE="umi_tools_dedup.log"

    # Execute umi_tools dedup
    # --stdin: Input BAM file
    # --stdout: Output BAM file with duplicates removed
    # --method directional: Recommended method for UMI-based deduplication
    # --umi-separator "_": Specifies the separator used when UMIs are in read names
    # --log: Log file for deduplication statistics
    # --paired: Use this flag if your data is paired-end (common for eCLIP)
    umi_tools dedup \
        --stdin "${INPUT_BAM}" \
        --stdout "${OUTPUT_BAM}" \
        --method directional \
        --umi-separator "_" \
        --log "${LOG_FILE}" \
        --paired

View on GitHub

Command: barcode_collapse_pe.py --bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam --out_file /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam --metrics_file /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.metrics

Skipper (Inferred with models/gemini-2.5-flash) vlatest GitHub

$ Bash example

# Install Miniconda/Anaconda if not already installed
# wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh
# bash miniconda.sh -b -p $HOME/miniconda
# export PATH="$HOME/miniconda/bin:$PATH"
# conda init bash
# source ~/.bashrc

# Clone the Skipper repository
# git clone https://github.com/yeolab/skipper.git
# cd skipper

# Create and activate the conda environment using the provided environment.yaml
# conda env create -f environment.yaml
# conda activate skipper # Assuming the environment name is 'skipper' or derived from the directory

# Define input and output files
INPUT_BAM="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam"
OUTPUT_BAM="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam"
METRICS_FILE="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.metrics"

# Execute the barcode collapse command
# The script 'barcode_collapse_pe.py' is located in the 'scripts' directory of the cloned Skipper repository.
# Ensure you are in the 'skipper' directory or provide the full path to the script.
python scripts/barcode_collapse_pe.py --bam "${INPUT_BAM}" --out_file "${OUTPUT_BAM}" --metrics_file "${METRICS_FILE}"

View on GitHub

Takes output from barcode collapse PE.

STAR (Inferred with models/gemini-2.5-flash) v2.7.0f GitHub

$ Bash example

# Define variables (placeholders)
# Replace these with actual paths and values for your specific analysis.
# For human (Homo sapiens), hg38 (GRCh38) is the latest assembly.
GENOME_DIR="/path/to/STAR_index/hg38" # Example: /path/to/STAR_index/GRCh38_gencode_v38
GTF_FILE="/path/to/annotations/gencode.v38.annotation.gtf" # Example: /path/to/annotations/gencode.v38.annotation.gtf
READ1_FASTQ="collapsed_reads_R1.fastq.gz" # Input: R1 FASTQ from barcode collapse PE
READ2_FASTQ="collapsed_reads_R2.fastq.gz" # Input: R2 FASTQ from barcode collapse PE
OUTPUT_PREFIX="aligned_reads_" # Prefix for output files (e.g., aligned_reads_Aligned.sortedByCoord.out.bam)
THREADS=8 # Number of threads to use for STAR
LIMIT_BAM_SORT_RAM=60000000000 # 60GB, adjust based on available RAM for sorting BAM

# Run STAR alignment
STAR \
  --runThreadN "${THREADS}" \
  --genomeDir "${GENOME_DIR}" \
  --readFilesIn "${READ1_FASTQ}" "${READ2_FASTQ}" \
  --outFileNamePrefix "${OUTPUT_PREFIX}" \
  --outSAMtype BAM SortedByCoordinate \
  --outSAMattributes All \
  --outSAMunmapped Within \
  --outFilterMultimapNmax 20 \
  --outFilterMismatchNmax 999 \
  --alignIntronMin 20 \
  --alignIntronMax 1000000 \
  --alignMatesGapMax 1000000 \
  --sjdbGTFfile "${GTF_FILE}" \
  --limitBAMsortRAM "${LIMIT_BAM_SORT_RAM}"

View on GitHub

Sorts resulting bam file for use downstream.

samtools (Inferred with models/gemini-2.5-flash) v1.10 GitHub

$ Bash example

# Install samtools (if not already installed)
# conda install -c bioconda samtools=1.10

# Sort the BAM file
# Replace 'input.bam' with your unsorted BAM file
# Replace 'output.sorted.bam' with your desired output sorted BAM file name
samtools sort -o output.sorted.bam input.bam

View on GitHub

Command: java -Xmx2048m -XX:+UseParallelOldGC -XX:ParallelGCThreads=4 -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Djava.io.tmpdir=/full/path/to/files/.queue/tmp -cp /path/to/gatk/dist/Queue.jar net.sf.picard.sam.SortSam INPUT=/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam TMP_DIR=/full/path/to/files/.queue/tmp OUTPUT=/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam VALIDATION_STRINGENCY=SILENT SO=coordinate CREATE_INDEX=true

Picard vInferred with models/gemini-2.5-flash GitHub

$ Bash example

# Define variables for input/output paths and resources
INPUT_BAM="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam"
OUTPUT_BAM="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam"
TMP_DIR="/full/path/to/files/.queue/tmp"
PICARD_JAR_PATH="/path/to/gatk/dist/Queue.jar" # Note: This path points to GATK's Queue.jar, which in some older GATK distributions might have bundled Picard classes.

# Create temporary directory if it doesn't exist
mkdir -p "${TMP_DIR}"

# Execute Picard SortSam command
java -Xmx2048m \
     -XX:+UseParallelOldGC \
     -XX:ParallelGCThreads=4 \
     -XX:GCTimeLimit=50 \
     -XX:GCHeapFreeLimit=10 \
     -Djava.io.tmpdir="${TMP_DIR}" \
     -cp "${PICARD_JAR_PATH}" \
     net.sf.picard.sam.SortSam \
     INPUT="${INPUT_BAM}" \
     TMP_DIR="${TMP_DIR}" \
     OUTPUT="${OUTPUT_BAM}" \
     VALIDATION_STRINGENCY=SILENT \
     SO=coordinate \
     CREATE_INDEX=true

View on GitHub

Takes output from sortSam, makes bam index for use downstream.

samtools (Inferred with models/gemini-2.5-flash) v1.19 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install samtools if not already installed
# conda install -c bioconda samtools

# Assume 'sorted.bam' is the output from sortSam
# This command creates an index file named 'sorted.bam.bai' in the same directory
samtools index sorted.bam

View on GitHub

Command: samtools index /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam.bai

samtools v1.19 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install samtools (e.g., using conda)
# conda install -c bioconda samtools=1.19

# Define input and output files
INPUT_BAM="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam"
OUTPUT_BAI="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam.bai"

# Execute samtools index command
samtools index "${INPUT_BAM}" "${OUTPUT_BAI}"

View on GitHub

Takes inputs from multiple final bam files.

samtools (Inferred with models/gemini-2.5-flash) v1.19.2 GitHub

$ Bash example

# Install samtools if not available
# conda install -c bioconda samtools

# This command merges multiple sorted BAM files into a single sorted BAM file.
# Replace 'input_1.bam', 'input_2.bam', 'input_3.bam' with the actual paths to your input BAM files.
# Replace 'merged_output.bam' with the desired name for the merged output file.
# The '-o' flag is optional for samtools merge, as the output file is typically the first argument.
samtools merge merged_output.bam input_1.bam input_2.bam input_3.bam

View on GitHub

Merges the two technical replicates for further downstream analysis.

samtools (Inferred with models/gemini-2.5-flash) v1.9 GitHub

$ Bash example

# Install samtools (if not already installed)
# conda install -c bioconda samtools

# Define input and output file paths
INPUT_REPLICATE_1="replicate1.bam"
INPUT_REPLICATE_2="replicate2.bam"
OUTPUT_MERGED_BAM="merged_replicates.bam"

# Merge the two technical replicate BAM files
samtools merge "${OUTPUT_MERGED_BAM}" "${INPUT_REPLICATE_1}" "${INPUT_REPLICATE_2}"

View on GitHub

Command: samtools merge /full/path/to/files/CombinedID.merged.bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam /full/path/to/files/file_R1.D08.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam

samtools v1.19 GitHub

$ Bash example

# Install samtools (e.g., using conda)
# conda install -c bioconda samtools=1.19

# Merge multiple sorted BAM files into a single merged BAM file
samtools merge /full/path/to/files/CombinedID.merged.bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam /full/path/to/files/file_R1.D08.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam

View on GitHub

Takes output from sortSam, makes bam index for use downstream.

samtools (Inferred with models/gemini-2.5-flash) v1.10 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install samtools if not already available
# conda install -c bioconda samtools=1.10

# Assume the input sorted BAM file is named 'sorted.bam'
# This command creates an index file 'sorted.bam.bai' in the same directory.
samtools index sorted.bam

View on GitHub

Command: samtools index /full/path/to/files/CombinedID.merged.bam /full/path/to/files/CombinedID.merged.bam.bai

samtools vInferred with models/gemini-2.5-flash GitHub

$ Bash example

# samtools is a widely used tool for manipulating SAM/BAM/CRAM files.
# Installation:
# conda install -c bioconda samtools
# or
# apt-get install samtools

samtools index /full/path/to/files/CombinedID.merged.bam /full/path/to/files/CombinedID.merged.bam.bai

View on GitHub

Takes output from sortSam.

samtools (Inferred with models/gemini-2.5-flash) v1.19 GitHub

$ Bash example

# Install samtools if not already installed
# conda install -c bioconda samtools

# Define input and output file names
# INPUT_BAM is the output from the sortSam step
INPUT_BAM="sorted.bam"
OUTPUT_DEDUP_BAM="deduplicated.bam"

# Mark duplicates in the sorted BAM file
# -r: Remove duplicate reads (default is to just flag them)
# -s: Output statistics to stderr
samtools markdup -r -s "${INPUT_BAM}" "${OUTPUT_DEDUP_BAM}"

# Index the deduplicated BAM file for quick access
samtools index "${OUTPUT_DEDUP_BAM}"

View on GitHub

Only outputs the second read in each pair for use with single stranded peak caller.

reformat.sh (Inferred with models/gemini-2.5-flash) v38.90 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install BBTools (if not already installed)
# conda install -c bioconda bbmap

# Example usage:
# Assuming input paired-end FASTQ files are named 'sample_R1.fastq.gz' and 'sample_R2.fastq.gz'
# and the desired output file for the second read is 'sample_R2_only.fastq.gz'

# This command takes both R1 and R2 files as input, but explicitly outputs only the R2 reads
# to a new file, discarding R1.
reformat.sh in1=sample_R1.fastq.gz in2=sample_R2.fastq.gz out1=null out2=sample_R2_only.fastq.gz

View on GitHub

This is the final bam file to perform analysis on.

samtools (Inferred with models/gemini-2.5-flash) v1.19 GitHub

$ Bash example

# The description indicates a final BAM file ready for analysis.
# This typically implies the BAM file has been sorted and indexed.
# Samtools is the standard utility for these operations.

# Install samtools if not already available (e.g., via conda)
# conda install -c bioconda samtools

# Assuming 'input_unsorted.bam' is the BAM file after alignment and other processing steps
# and 'final.bam' is the desired output name for the sorted and indexed file.

# Sort the BAM file by coordinate
samtools sort -o final.bam input_unsorted.bam

# Index the sorted BAM file
samtools index final.bam

View on GitHub

Command: samtools view -hb -f 128 /full/path/to/files/CombinedID.merged.bam > /full/path/to/files/CombinedID.merged.r2.bam

samtools v1.15 GitHub

$ Bash example

# Install samtools if not already installed
# conda install -c bioconda samtools

# Command: Filter BAM to keep only reads that are the second in a pair
# -h: Include header in the output
# -b: Output in BAM format
# -f 128: Only output reads with the FLAG 128 set (second in pair)
samtools view -hb -f 128 /full/path/to/files/CombinedID.merged.bam > /full/path/to/files/CombinedID.merged.r2.bam

View on GitHub

Takes results from samtools view.

samtools v1.10 GitHub

$ Bash example

# Install samtools (if not already installed)
# conda install -c bioconda samtools=1.10

# Convert SAM to BAM
# This command takes a SAM file as input and outputs a compressed BAM file.
# -b: output BAM format
# -S: input is SAM format (optional, samtools can often infer)
# -h: include header in output
samtools view -bS -h input.sam > output.bam

View on GitHub

Calls peaks on those files.

clipper (Inferred with models/gemini-2.5-flash) vNot specified GitHub

$ Bash example

# Install clipper (if not already installed)
# git clone https://github.com/yeolab/clipper.git
# cd clipper
# pip install -r requirements.txt # if requirements.txt exists and has dependencies

# Define input files and reference genome
INPUT_BAM="input.bam" # Placeholder for the input BAM file
CONTROL_BAM="control.bam" # Placeholder for the control BAM file (highly recommended for eCLIP)
GENOME_FASTA="hg38.fa" # Placeholder for the reference genome FASTA (e.g., latest human assembly hg38)
GENOME_NAME="hg38" # Placeholder for the reference genome name
OUTPUT_PREFIX="peaks"

# Execute clipper to call peaks
# Parameters are set to common defaults for eCLIP peak calling
# -o: Output BED file name
# -s: Strand specificity ('.' for unstranded, '+' or '-' for specific strand)
# -c: Control BAM file for background subtraction
# -p: P-value threshold for peak calling
# -f: Fold enrichment threshold for peak calling
# -r: Reference genome name
# -g: Reference genome FASTA file
python clipper.py \
    -o "${OUTPUT_PREFIX}.bed" \
    -s . \
    -c "${CONTROL_BAM}" \
    -p 0.01 \
    -f 2.0 \
    -r "${GENOME_NAME}" \
    -g "${GENOME_FASTA}" \
    "${INPUT_BAM}"

View on GitHub

Command: clipper -b /full/path/to/files/CombinedID.merged.r2.bam -s hg19 -o /full/path/to/files/CombinedID.merged.r2.peaks.bed --bonferroni --superlocal --threshold-method binomial --save-pickle

CLIPper v0.0.1 (Inferred from GitHub repository) GitHub

$ Bash example

# Install CLIPper (example using conda)
# conda create -n clipper_env python=3.8
# conda activate clipper_env
# conda install -c bioconda clipper

clipper -b /full/path/to/files/CombinedID.merged.r2.bam -s hg19 -o /full/path/to/files/CombinedID.merged.r2.peaks.bed --bonferroni --superlocal --threshold-method binomial --save-pickle

View on GitHub

Tools Used

eCLIP STAR

Raw Source Text

Library strategy: eCLIP-Seq
Takes output from raw files.  Run to trim off both 5â and 3â adapters on both reads. Command: quality-cutoff 6  -m 18  -a NNNNNAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC  -g CTTCCGATCTACAAGTT -g CTTCCGATCTTGGTCCT  -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGT  AGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT  -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz  -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz  /full/path/to/files/file_R1.C01.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.metrics
Takes output from cutadapt round 1. Run to trim off the 3â adapters on read 2, to control for double ligation events. Command: cutadapt -f fastq --match-read-wildcards  --times 1  -e 0.1  -O 5  --quality-cutoff 6  -m 18  -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGTAGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT  -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz  -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz  /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz  /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.metrics
Takes output from cutadapt round 2.  Maps to human specific version of RepBase used to remove repetitive elements, helps control for spurious artifacts from rRNA (& other) repetitive reads.  Command: STAR  --runMode alignReads  --runThreadN 16  --genomeDir /path/to/RepBase_human_database_file --genomeLoad LoadAndRemove  --readFilesIn /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz --outSAMunmapped Within  --outFilterMultimapNmax 30  --outFilterMultimapScoreRange 1  --outFileNamePrefix /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam --outSAMattributes All  --readFilesCommand zcat  --outStd BAM_Unsorted  --outSAMtype BAM Unsorted  --outFilterType BySJout  --outReadsUnmapped Fastx  --outFilterScoreMin 10  --outSAMattrRGline ID:foo  --alignEndsType EndToEnd > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam
Takes output from STAR rmRep.  Maps unique reads to the human genome.  Command: STAR  --runMode alignReads  --runThreadN 16  --genomeDir  /path/to/STAR_database_file --genomeLoad LoadAndRemove  --readFilesIn  /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate1  /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate2  --outSAMunmapped Within  --outFilterMultimapNmax 1  --outFilterMultimapScoreRange 1  --outFileNamePrefix /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam  --outSAMattributes All  --outStd BAM_Unsorted  --outSAMtype BAM Unsorted  --outFilterType BySJout  --outReadsUnmapped Fastx  --outFilterScoreMin 10  --outSAMattrRGline ID:foo  --alignEndsType EndToEnd > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam
takes output from STAR genome mapping.  Custom random-mer-aware script for PCR duplicate removal. Command: barcode_collapse_pe.py  --bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam  --out_file /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam  --metrics_file /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.metrics
Takes output from barcode collapse PE.  Sorts resulting bam file for use downstream.  Command: java  -Xmx2048m  -XX:+UseParallelOldGC  -XX:ParallelGCThreads=4  -XX:GCTimeLimit=50  -XX:GCHeapFreeLimit=10  -Djava.io.tmpdir=/full/path/to/files/.queue/tmp  -cp /path/to/gatk/dist/Queue.jar  net.sf.picard.sam.SortSam  INPUT=/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam  TMP_DIR=/full/path/to/files/.queue/tmp  OUTPUT=/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam  VALIDATION_STRINGENCY=SILENT  SO=coordinate  CREATE_INDEX=true
Takes output from sortSam, makes bam index for use downstream.  Command: samtools index  /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam  /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam.bai
Takes inputs from multiple final bam files.  Merges the two technical replicates for further downstream analysis.  Command: samtools  merge  /full/path/to/files/CombinedID.merged.bam  /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam /full/path/to/files/file_R1.D08.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam
Takes output from sortSam, makes bam index for use downstream.  Command: samtools  index  /full/path/to/files/CombinedID.merged.bam  /full/path/to/files/CombinedID.merged.bam.bai
Takes output from sortSam.  Only outputs the second read in each pair for use with single stranded peak caller.  This is the final bam file to perform analysis on.  Command: samtools view -hb -f 128  /full/path/to/files/CombinedID.merged.bam  >  /full/path/to/files/CombinedID.merged.r2.bam
Takes results from samtools view.  Calls peaks on those files.  Command: clipper  -b /full/path/to/files/CombinedID.merged.r2.bam  -s hg19  -o /full/path/to/files/CombinedID.merged.r2.peaks.bed  --bonferroni  --superlocal  --threshold-method binomial  --save-pickle
Genome_build: hg19
Supplementary_files_format_and_content: bed format, contains clusters of predicted RBP binding, txt format contains counts of reads for both IP and Input for each gene in subtranscriptomic region, bigWigs are read densities for positive and negative strand genome wide

← Back to Analysis