GSE179634 Processing Pipeline

RIP-Seq code_examples 32 steps

Publication

Splicing factor SRSF1 deficiency in the liver triggers NASH-like pathology and cell death.

Nature communications (2023) — PMID 36759613

Dataset

Splicing Factor SRSF1 Deficiency in the Liver Triggers NASH-like Pathology via R-Loop Induced DNA Damage and Cell Death

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

1

Takes output from raw files.

N/A (Inferred with models/gemini-2.5-flash) vN/A

Run to trim off both 5â and 3â adapters on both reads.

cutadapt (Inferred with models/gemini-2.5-flash) v4.0 (Inferred with models/gemini-2.5-flash)

$ Bash example

# Install cutadapt if not already installed
# conda install -c bioconda cutadapt=4.0

# Define input and output files
INPUT_R1="input_R1.fastq.gz"
INPUT_R2="input_R2.fastq.gz"
OUTPUT_R1="trimmed_R1.fastq.gz"
OUTPUT_R2="trimmed_R2.fastq.gz"

# Define common Illumina adapter sequences
# These are placeholders; actual adapters should be determined from library prep
# ADAPTER_FWD is typically the Illumina Universal Adapter for Read 1
ADAPTER_FWD="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA"
# ADAPTER_REV is typically the Illumina Small RNA 3' Adapter or similar for Read 2
ADAPTER_REV="AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT"

# Run cutadapt to trim 3' adapters from both reads.
# cutadapt's -a and -A flags search for and remove the adapter sequence
# from anywhere in the read, effectively handling both 5' and 3' occurrences
# if the adapter sequence itself is present. For explicit 5' fixed-length
# trimming (e.g., random Ns), -g or -G with ^ADAPTER would be used,
# but this is not specified in the description.
#
# Optional common parameters (not included in the core command as not specified in description):
# -j <threads>: Number of CPU threads to use.
# -m <min_len>: Discard reads shorter than <min_len> after trimming.
# -q <qual_trim>: Trim low-quality bases from 3' end.
cutadapt -a "${ADAPTER_FWD}" \
         -A "${ADAPTER_REV}" \
         -o "${OUTPUT_R1}" \
         -p "${OUTPUT_R2}" \
         "${INPUT_R1}" \
         "${INPUT_R2}"

Command: quality-cutoff 6 -m 18 -a NNNNNAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC -g CTTCCGATCTACAAGTT -g CTTCCGATCTTGGTCCT -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGT AGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz /full/path/to/files/file_R1.C01.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.metrics

eclip vN/A (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Clone the eCLIP repository if not already installed
# git clone https://github.com/yeolab/eclip.git
# cd eclip
# # It's recommended to use a virtual environment
# # conda create -n eclip_env python=3.8
# # conda activate eclip_env
# # pip install -r requirements.txt
# # Ensure 'quality-cutoff' (which is typically 'python scripts/quality_cutoff.py') is accessible in your PATH or run directly.

# Define input and output paths
INPUT_R1="/full/path/to/files/file_R1.C01.fastq.gz"
INPUT_R2="/full/path/to/files/file_R2.C01.fastq.gz"
OUTPUT_R1="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz"
OUTPUT_R2="/full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz"
METRICS_FILE="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.metrics"

# Execute the quality-cutoff command
# Note: The original command had '-A CTTGT AGATCGGAAG'.
# Based on the quality_cutoff.py script's argument parsing for multiple -A flags,
# it is assumed this was a typo and should be two separate -A flags for two adapter fragments.
quality-cutoff 6 \
  -m 18 \
  -a NNNNNAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC \
  -g CTTCCGATCTACAAGTT \
  -g CTTCCGATCTTGGTCCT \
  -A AACTTGTAGATCGGA \
  -A AGGACCAAGATCGGA \
  -A ACTTGTAGATCGGAA \
  -A GGACCAAGATCGGAA \
  -A CTTGT \
  -A AGATCGGAAG \
  -A TTGTAGATCGGAAGA \
  -A ACCAAGATCGGAAGA \
  -A TGTAGATCGGAAGAG \
  -A CCAAGATCGGAAGAG \
  -A GTAGATCGGAAGAGC \
  -A CAAGATCGGAAGAGC \
  -A TAGATCGGAAGAGCG \
  -A AAGATCGGAAGAGCG \
  -A AGATCGGAAGAGCGT \
  -A GATCGGAAGAGCGTC \
  -A ATCGGAAGAGCGTCG \
  -A TCGGAAGAGCGTCGT \
  -A CGGAAGAGCGTCGTG \
  -A GGAAGAGCGTCGTGT \
  -o "${OUTPUT_R1}" \
  -p "${OUTPUT_R2}" \
  "${INPUT_R1}" \
  "${INPUT_R2}" > "${METRICS_FILE}"

View on GitHub

Takes output from cutadapt round 1.

cutadapt v2.10 GitHub

$ Bash example

# Install cutadapt if not already installed
# conda install -c bioconda cutadapt=2.10

# Define input and output files
# INPUT_FASTQ is the output from a previous cutadapt round 1 (e.g., 3' adapter trimming)
INPUT_FASTQ="round1_trimmed.fastq.gz"
OUTPUT_FASTQ="round2_trimmed.fastq.gz"

# Define parameters for cutadapt round 2 (e.g., 5' adapter trimming and quality filtering)
# Replace "ADAPTER_5PRIME_SEQUENCE" with the actual 5' adapter sequence for your assay.
# This example uses a common 5' adapter sequence, but it must be verified for the specific library prep.
ADAPTER_5PRIME_SEQUENCE="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" # Example 5' adapter, replace with actual
QUALITY_CUTOFF="20,20" # Trim low-quality bases from both ends (e.g., Phred score < 20)
MINIMUM_LENGTH="15" # Discard reads shorter than 15 bp after trimming
NUM_THREADS=$(nproc) # Use all available CPU cores

cutadapt \
  -g "${ADAPTER_5PRIME_SEQUENCE}" \
  -q "${QUALITY_CUTOFF}" \
  --minimum-length "${MINIMUM_LENGTH}" \
  --cores "${NUM_THREADS}" \
  -o "${OUTPUT_FASTQ}" \
  "${INPUT_FASTQ}"

View on GitHub

Run to trim off the 3â adapters on read 2, to control for double ligation events.

cutadapt (Inferred with models/gemini-2.5-flash) v4.0 GitHub

$ Bash example

# Install cutadapt (if not already installed)
# conda install -c bioconda cutadapt=4.0

# Define input and output file paths
INPUT_R1="input_R1.fastq.gz"
INPUT_R2="input_R2.fastq.gz"
OUTPUT_R1="trimmed_R1.fastq.gz"
OUTPUT_R2="trimmed_R2.fastq.gz"

# Define the 3' adapter sequence for Read 2.
# This is a common Illumina TruSeq adapter used in eCLIP for Read 2.
# This adapter is trimmed to control for double ligation events.
ADAPTER_R2="AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC"

# Run cutadapt to trim the 3' adapter from Read 2.
# -A: Specifies the 3' adapter sequence for Read 2.
# -o: Output file for Read 1 (untrimmed in this specific step, but paired with R2 output).
# -p: Output file for Read 2 (trimmed).
# --minimum-length: Discard reads shorter than this length after trimming (e.g., 18bp is common in eCLIP).
# -j: Number of CPU threads to use for parallel processing (e.g., 8).
cutadapt -A "${ADAPTER_R2}" \
         -o "${OUTPUT_R1}" \
         -p "${OUTPUT_R2}" \
         --minimum-length 18 \
         -j 8 \
         "${INPUT_R1}" "${INPUT_R2}"

View on GitHub

6
Command: cutadapt -f fastq --match-read-wildcards --times 1 -e 0.1 -O 5 --quality-cutoff 6 -m 18 -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGTAGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.metrics

cutadapt v1.18 GitHub
$ Bash example
```
# conda install -c bioconda cutadapt
cutadapt -f fastq --match-read-wildcards --times 1 -e 0.1 -O 5 --quality-cutoff 6 -m 18 -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGTAGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.metrics
```
View on GitHub

Takes output from cutadapt round 2.

cutadapt v4.0 GitHub

$ Bash example

# Install cutadapt if not already installed
# conda install -c bioconda cutadapt

# Define input and output files
# INPUT_FASTQ represents the output from a previous cutadapt round (round 1).
INPUT_FASTQ="input_from_cutadapt_round1.fastq.gz"
OUTPUT_FASTQ="output_cutadapt_round2.fastq.gz"
REPORT_FILE="cutadapt_round2_report.txt"

# Define adapter sequences and trimming parameters for round 2.
# These are placeholders; actual values depend on the specific eCLIP library preparation
# and what was trimmed in round 1. Round 2 might focus on secondary adapters,
# more stringent quality trimming, or length filtering.
# Example 3' adapter (e.g., Illumina universal or specific RT primer adapter).
ADAPTER_3PRIME="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA"
# ADAPTER_5PRIME="GTTCAGAGTTCTACAGTCCGACGATC" # Uncomment and set if a 5' adapter needs trimming
QUALITY_CUTOFF=20 # Phred quality score cutoff
MIN_LENGTH=18     # Minimum read length after trimming
CORES=4           # Number of CPU cores to use for parallel processing

# Execute cutadapt for round 2 trimming.
# This command assumes single-end reads. For paired-end reads, use -A and -G for the reverse read.
# --discard-untrimmed is often used in eCLIP to ensure reads contain the adapter, indicating successful ligation.
cutadapt \
    -a "${ADAPTER_3PRIME}" \
    --quality-cutoff="${QUALITY_CUTOFF}" \
    --minimum-length="${MIN_LENGTH}" \
    --discard-untrimmed \
    --cores="${CORES}" \
    -o "${OUTPUT_FASTQ}" \
    "${INPUT_FASTQ}" \
    > "${REPORT_FILE}" 2>&1

# Note: For paired-end reads, the command would be more complex, e.g.:
# cutadapt \
#     -a "${ADAPTER_3PRIME_R1}" \
#     -A "${ADAPTER_3PRIME_R2}" \
#     --quality-cutoff="${QUALITY_CUTOFF}" \
#     --minimum-length="${MIN_LENGTH}" \
#     --discard-untrimmed \
#     --cores="${CORES}" \
#     -o "${OUTPUT_FASTQ_R1}" \
#     -p "${OUTPUT_FASTQ_R2}" \
#     "${INPUT_FASTQ_R1}" \
#     "${INPUT_FASTQ_R2}" \
#     > "${REPORT_FILE}" 2>&1

View on GitHub

Maps to human specific version of RepBase used to remove repetitive elements, helps control for spurious artifacts from rRNA (& other) repetitive reads.

bbduk (Inferred with models/gemini-2.5-flash) vNot specified (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install BBMap suite if not available
# conda install -c bioconda bbmap

# Placeholder for human specific RepBase repeats FASTA file.
# This file would typically be generated by extracting human repetitive elements from RepBase
# or by using a pre-compiled contaminant file that includes common human repeats (e.g., rRNA, tRNAs, SINEs, LINEs).
# For example, a file like 'human_repbase_repeats.fa' would contain sequences of known human repetitive elements.
HUMAN_REPBASE_FASTA="/path/to/human_repbase_repeats.fa"

# Input FASTQ file (e.g., raw reads from eCLIP)
INPUT_FASTQ="input_reads.fastq.gz"

# Output FASTQ file containing reads with repetitive elements removed
OUTPUT_FASTQ="filtered_non_repetitive_reads.fastq.gz"

# Remove repetitive reads by mapping against the human RepBase repeats FASTA.
# 'k=31' specifies a kmer size of 31, common for contaminant filtering.
# 'hdist=1' allows for 1 mismatch during mapping.
# 'stats=repbase_filter_stats.txt' will output statistics on the reads removed.
# '-Xmx4g' allocates 4GB of memory, adjust as needed based on input file size and system resources.
bbduk.sh in="$INPUT_FASTQ" \
         out="$OUTPUT_FASTQ" \
         ref="$HUMAN_REPBASE_FASTA" \
         k=31 hdist=1 \
         stats="repbase_filter_stats.txt" \
         -Xmx4g

View on GitHub

Command: STAR --runMode alignReads --runThreadN 16 --genomeDir /path/to/RepBase_human_database_file --genomeLoad LoadAndRemove --readFilesIn /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz --outSAMunmapped Within --outFilterMultimapNmax 30 --outFilterMultimapScoreRange 1 --outFileNamePrefix /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam --outSAMattributes All --readFilesCommand zcat --outStd BAM_Unsorted --outSAMtype BAM Unsorted --outFilterType BySJout --outReadsUnmapped Fastx --outFilterScoreMin 10 --outSAMattrRGline ID:foo --alignEndsType EndToEnd > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam

STAR v2.7.10a GitHub

$ Bash example

# Install STAR (example using conda)
# conda install -c bioconda star

# Define variables for clarity
GENOME_DIR="/path/to/RepBase_human_database_file"
READ_FILE_1="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz"
READ_FILE_2="/full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz"
OUTPUT_PREFIX="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam"
FINAL_OUTPUT_BAM="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam"

# Execute STAR alignment
STAR \
  --runMode alignReads \
  --runThreadN 16 \
  --genomeDir "${GENOME_DIR}" \
  --genomeLoad LoadAndRemove \
  --readFilesIn "${READ_FILE_1}" "${READ_FILE_2}" \
  --outSAMunmapped Within \
  --outFilterMultimapNmax 30 \
  --outFilterMultimapScoreRange 1 \
  --outFileNamePrefix "${OUTPUT_PREFIX}" \
  --outSAMattributes All \
  --readFilesCommand zcat \
  --outStd BAM_Unsorted \
  --outSAMtype BAM Unsorted \
  --outFilterType BySJout \
  --outReadsUnmapped Fastx \
  --outFilterScoreMin 10 \
  --outSAMattrRGline ID:foo \
  --alignEndsType EndToEnd \
  > "${FINAL_OUTPUT_BAM}"

View on GitHub

Takes output from STAR rmRep.

STAR v1.10 GitHub

$ Bash example

# Install samtools if not already installed
# conda install -c bioconda samtools=1.10

# Input BAM file from STAR alignment (e.g., aligned_reads.bam)
# This file is assumed to be coordinate-sorted.
INPUT_BAM="aligned_reads.bam"
OUTPUT_BAM="aligned_reads.markdup.bam"
METRICS_FILE="markdup_metrics.txt"

# Remove PCR duplicates from the aligned BAM file
# -r: Remove duplicates (rather than just marking them)
# -s: Output statistics to stderr (redirected to a file here)
samtools markdup -r -s "$INPUT_BAM" "$OUTPUT_BAM" > "$METRICS_FILE"

View on GitHub

Maps unique reads to the mouse genome.

STAR (Inferred with models/gemini-2.5-flash) v2.7.10a GitHub

$ Bash example

# Install STAR (if not already installed)
# conda install -c bioconda star

# Placeholder for STAR genome index directory.
# The mouse genome (e.g., mm10/GRCm38) STAR index needs to be pre-built or downloaded.
# Example command to build index (run once, replace paths):
# STAR --runThreadN 8 --runMode genomeGenerate --genomeDir /path/to/STAR_index/mm10 \
#      --genomeFastaFiles /path/to/mouse_genome.fa --sjdbGTFfile /path/to/mouse_annotations.gtf \
#      --sjdbOverhang 100 # Adjust sjdbOverhang based on read length - 1

# Align unique reads to the mouse genome
# Input: reads.fastq.gz (replace with your actual input FASTQ file)
# Output: aligned_Aligned.sortedByCoord.out.bam
STAR --genomeDir /path/to/STAR_index/mm10 \
     --readFilesIn reads.fastq.gz \
     --outFileNamePrefix aligned_ \
     --outSAMtype BAM SortedByCoordinate \
     --runThreadN 8 \
     --outFilterMultimapNmax 1 \
     --outFilterMismatchNmax 10 \
     --outFilterScoreMinOverLread 0.66 \
     --outFilterMatchNminOverLread 0.66

View on GitHub

Command: STAR --runMode alignReads --runThreadN 16 --genomeDir /path/to/STAR_database_file --genomeLoad LoadAndRemove --readFilesIn /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate1 /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate2 --outSAMunmapped Within --outFilterMultimapNmax 1 --outFilterMultimapScoreRange 1 --outFileNamePrefix /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam --outSAMattributes All --outStd BAM_Unsorted --outSAMtype BAM Unsorted --outFilterType BySJout --outReadsUnmapped Fastx --outFilterScoreMin 10 --outSAMattrRGline ID:foo --alignEndsType EndToEnd > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam

STAR vInferred with models/gemini-2.5-flash GitHub

$ Bash example

# Install STAR (example using conda):
# conda install -c bioconda star

# Define variables for paths
GENOME_DIR="/path/to/STAR_index/hg38" # Placeholder for human hg38 genome directory, replace with actual path
READ_FILE_1="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate1"
READ_FILE_2="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate2"
OUTPUT_PREFIX="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam"
OUTPUT_BAM="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam"

# Execute STAR alignment command
STAR \
  --runMode alignReads \
  --runThreadN 16 \
  --genomeDir "${GENOME_DIR}" \
  --genomeLoad LoadAndRemove \
  --readFilesIn "${READ_FILE_1}" "${READ_FILE_2}" \
  --outSAMunmapped Within \
  --outFilterMultimapNmax 1 \
  --outFilterMultimapScoreRange 1 \
  --outFileNamePrefix "${OUTPUT_PREFIX}" \
  --outSAMattributes All \
  --outStd BAM_Unsorted \
  --outSAMtype BAM Unsorted \
  --outFilterType BySJout \
  --outReadsUnmapped Fastx \
  --outFilterScoreMin 10 \
  --outSAMattrRGline ID:foo \
  --alignEndsType EndToEnd > "${OUTPUT_BAM}"

View on GitHub

takes output from STAR genome mapping.

STAR v2.7.10a (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install STAR (example using conda)
# conda create -n star_env star=2.7.10a samtools -c bioconda -c conda-forge
# conda activate star_env

# --- Reference Data Setup (Example for hg38) ---
# This step assumes you have already built a STAR genome index.
# If not, you would typically run:
# STAR --runThreadN <num_threads> --runMode genomeGenerate \
#      --genomeDir /path/to/STAR_index_hg38 \
#      --genomeFastaFiles /path/to/GRCh38.primary_assembly.genome.fa \
#      --sjdbGTFfile /path/to/gencode.v38.annotation.gtf \
#      --sjdbOverhang 100 # (or read length - 1)

# --- Define variables ---
GENOME_DIR="/path/to/STAR_index_hg38" # Placeholder for STAR genome index directory (e.g., for human hg38)
READ1="sample_R1.fastq.gz"           # Placeholder for input FASTQ file (Read 1)
READ2="sample_R2.fastq.gz"           # Placeholder for input FASTQ file (Read 2, remove if single-end)
OUTPUT_PREFIX="sample_"              # Prefix for output files
THREADS=8                            # Number of threads to use

# --- Run STAR alignment ---
STAR --genomeDir "${GENOME_DIR}" \
     --readFilesIn "${READ1}" "${READ2}" \
     --runThreadN "${THREADS}" \
     --outFileNamePrefix "${OUTPUT_PREFIX}" \
     --outSAMtype BAM SortedByCoordinate \
     --outSAMunmapped Within \
     --outSAMattributes Standard \
     --quantMode GeneCounts \
     --outFilterType BySJout \
     --outFilterMultimapNmax 20 \
     --alignSJDBoverhangMin 1 \
     --alignSJoverhangMin 8 \
     --alignIntronMin 20 \
     --alignIntronMax 1000000 \
     --alignMatesGapMax 1000000

# --- Index the output BAM file ---
samtools index "${OUTPUT_PREFIX}Aligned.sortedByCoordinate.out.bam"

View on GitHub

Custom random-mer-aware script for PCR duplicate removal.

dedup_umi.py (Inferred with models/gemini-2.5-flash) vPart of yeolab/eclip workflow

$ Bash example

# This script is part of the yeolab/eclip workflow and requires Python with pysam.
# You might need to install pysam if it's not already in your environment:
# pip install pysam

# Define paths and parameters
# Replace with the actual path to the dedup_umi.py script from the yeolab/eclip repository
SCRIPT_PATH="/path/to/yeolab/eclip/scripts/dedup_umi.py"

INPUT_BAM="aligned_reads_with_umis.bam" # Input BAM file containing UMI-tagged reads
OUTPUT_BAM="deduplicated_reads.bam"     # Output BAM file with PCR duplicates removed
UMI_LENGTH=6                            # Length of the random-mer (UMI) in base pairs. Common for eCLIP.

# Execute the custom random-mer-aware PCR duplicate removal script
python "${SCRIPT_PATH}" -i "${INPUT_BAM}" -o "${OUTPUT_BAM}" -l "${UMI_LENGTH}"

Command: barcode_collapse_pe.py --bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam --out_file /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam --metrics_file /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.metrics

barcode_collapse_pe.py (Inferred with models/gemini-2.5-flash) vv1.2 (from yeolab/eclip pipeline) GitHub

$ Bash example

# Install Miniconda or Anaconda if not already installed
# wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh
# bash miniconda.sh -b -p $HOME/miniconda
# export PATH="$HOME/miniconda/bin:$PATH"
# conda init bash
# source ~/.bashrc

# Create and activate a conda environment for eCLIP tools (requires Python 2.7 and pysam)
# conda create -n eclip_env python=2.7 pysam=0.10.0 -y
# conda activate eclip_env

# Clone the eclip repository to get the script
# git clone https://github.com/yeolab/eclip.git
# cd eclip/src

# Execute the barcode collapse command
# Ensure you are in the directory containing barcode_collapse_pe.py or it's in your PATH
python barcode_collapse_pe.py \
    --bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam \
    --out_file /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam \
    --metrics_file /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.metrics

View on GitHub

Takes output from barcode collapse PE.

STAR (Inferred with models/gemini-2.5-flash) v2.7.0f

$ Bash example

# Install STAR (if not already installed)
# conda install -c bioconda star

# Define variables (replace with actual paths and filenames)
GENOME_DIR="/path/to/STAR_index/GRCh38" # Placeholder for human GRCh38 genome index
READ1_FASTQ="collapsed_R1.fastq.gz" # Output from barcode collapse PE (Read 1)
READ2_FASTQ="collapsed_R2.fastq.gz" # Output from barcode collapse PE (Read 2)
OUTPUT_PREFIX="aligned_sample_prefix_"
THREADS=8 # Number of threads to use

# Run STAR alignment
STAR \
  --genomeDir "${GENOME_DIR}" \
  --readFilesIn "${READ1_FASTQ}" "${READ2_FASTQ}" \
  --runThreadN "${THREADS}" \
  --outFileNamePrefix "${OUTPUT_PREFIX}" \
  --outSAMtype BAM SortedByCoordinate \
  --outSAMattributes All \
  --quantMode GeneCounts \
  --outFilterMultimapNmax 20 \
  --outFilterMismatchNoverLmax 0.04 \
  --alignIntronMin 20 \
  --alignIntronMax 1000000 \
  --alignMatesGapMax 1000000 \
  --limitBAMsortRAM 30000000000 # 30GB RAM for sorting, adjust as needed

Sorts resulting bam file for use downstream.

samtools (Inferred with models/gemini-2.5-flash) v1.10 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install samtools if not already available
# conda install -c bioconda samtools=1.10

# Define input and output file names
INPUT_BAM="input.bam"
OUTPUT_SORTED_BAM="output_sorted.bam"

# Sort the BAM file by coordinate
# The -o flag specifies the output file.
samtools sort -o "${OUTPUT_SORTED_BAM}" "${INPUT_BAM}"

View on GitHub

Command: java -Xmx2048m -XX:+UseParallelOldGC -XX:ParallelGCThreads=4 -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Djava.io.tmpdir=/full/path/to/files/.queue/tmp -cp /path/to/gatk/dist/Queue.jar net.sf.picard.sam.SortSam INPUT=/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam TMP_DIR=/full/path/to/files/.queue/tmp OUTPUT=/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam VALIDATION_STRINGENCY=SILENT SO=coordinate CREATE_INDEX=true

Picard vNot specified GitHub

$ Bash example

# Picard tools are typically run via Java. You can download the latest Picard JAR from the Broad Institute GitHub releases.
# For example, using conda:
# conda install -c bioconda picard

java -Xmx2048m -XX:+UseParallelOldGC -XX:ParallelGCThreads=4 -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Djava.io.tmpdir=/full/path/to/files/.queue/tmp -cp /path/to/gatk/dist/Queue.jar net.sf.picard.sam.SortSam INPUT=/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam TMP_DIR=/full/path/to/files/.queue/tmp OUTPUT=/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam VALIDATION_STRINGENCY=SILENT SO=coordinate CREATE_INDEX=true

View on GitHub

Takes output from sortSam, makes bam index for use downstream.

samtools (Inferred with models/gemini-2.5-flash) v1.15.1 GitHub

$ Bash example

# Install samtools if not already installed
# conda install -c bioconda samtools=1.15.1

# Define input BAM file (output from sortSam)
INPUT_BAM="sorted.bam" # Placeholder for the sorted BAM file

# Create BAM index for downstream use
samtools index "${INPUT_BAM}"

View on GitHub

Command: samtools index /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam.bai

samtools v1.19 GitHub

$ Bash example

# Install samtools if not already installed
# conda install -c bioconda samtools=1.19

# Create an index for the sorted BAM file
samtools index /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam.bai

View on GitHub

Takes inputs from multiple final bam files.

samtools (Inferred with models/gemini-2.5-flash) v1.19.2 GitHub

$ Bash example

# Install samtools if not already available
# conda install -c bioconda samtools

# Example: Merge multiple final BAM files into a single BAM file.
# This step takes multiple input BAM files (e.g., from technical replicates or different lanes)
# and combines them into one consolidated BAM file for downstream analysis.
# Replace input1.bam, input2.bam, etc., with your actual input BAM file paths.
# Replace merged_output.bam with your desired output merged BAM file name.
# -@ specifies the number of threads to use.
samtools merge -@ 4 merged_output.bam input1.bam input2.bam input3.bam

View on GitHub

Merges the two technical replicates for further downstream analysis.

samtools (Inferred with models/gemini-2.5-flash) v1.19 GitHub

$ Bash example

# Install samtools if not already installed
# conda install -c bioconda samtools=1.19

# Define input and output file paths
INPUT_REPLICATE1_BAM="replicate1.bam"
INPUT_REPLICATE2_BAM="replicate2.bam"
OUTPUT_MERGED_BAM="merged_replicates.bam"

# Merge the two technical replicates (BAM files)
samtools merge "${OUTPUT_MERGED_BAM}" "${INPUT_REPLICATE1_BAM}" "${INPUT_REPLICATE2_BAM}"

View on GitHub

Command: samtools merge /full/path/to/files/CombinedID.merged.bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam /full/path/to/files/file_R1.D08.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam

samtools vInfer from description (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install samtools (e.g., using conda)
# conda install -c bioconda samtools

# Define input and output files
INPUT_BAM_1="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam"
INPUT_BAM_2="/full/path/to/files/file_R1.D08.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam"
OUTPUT_MERGED_BAM="/full/path/to/files/CombinedID.merged.bam"

# Execute samtools merge command
samtools merge "${OUTPUT_MERGED_BAM}" "${INPUT_BAM_1}" "${INPUT_BAM_2}"

View on GitHub

24
Takes output from sortSam, makes bam index for use downstream.

samtools index (Inferred with models/gemini-2.5-flash) v1.19.1 (Inferred with models/gemini-2.5-flash) GitHub
$ Bash example
```
# Install samtools if not already installed
# conda install -c bioconda samtools

# Assuming 'sorted.bam' is the output from sortSam
samtools index sorted.bam
```
View on GitHub
25
Command: samtools index /full/path/to/files/CombinedID.merged.bam /full/path/to/files/CombinedID.merged.bam.bai

samtools vNot specified (Inferred with models/gemini-2.5-flash) GitHub
$ Bash example
```
# Install samtools if not already installed
# conda install -c bioconda samtools

samtools index /full/path/to/files/CombinedID.merged.bam /full/path/to/files/CombinedID.merged.bam.bai
```
View on GitHub

Takes output from sortSam.

samtools (Inferred with models/gemini-2.5-flash) v1.19.1 GitHub

$ Bash example

# Install samtools if not already installed
# conda install -c bioconda samtools=1.19.1

# This step takes a sorted BAM file (output from sortSam) and creates an index (.bai) file.
# The index file is crucial for efficient random access to reads within the BAM file,
# enabling many downstream tools to function correctly and quickly.
# Replace 'sorted_input.bam' with the actual path to your sorted BAM file.
samtools index sorted_input.bam

View on GitHub

Only outputs the second read in each pair for use with single stranded peak caller.

reformat.sh (BBMap) (Inferred with models/gemini-2.5-flash) v38.90 GitHub

$ Bash example

# Install BBMap (part of BBTools)
# conda install -c bioconda bbmap

# This command takes paired-end FASTQ files (input_R1.fastq.gz and input_R2.fastq.gz)
# and outputs only the second read (R2) to a new file (output_R2_only.fastq.gz).
# The first read (R1) is discarded by setting out1=null.
reformat.sh in1=input_R1.fastq.gz in2=input_R2.fastq.gz out1=null out2=output_R2_only.fastq.gz

View on GitHub

This is the final bam file to perform analysis on.

samtools (Inferred with models/gemini-2.5-flash) v1.19 GitHub

$ Bash example

# Install samtools if not already available
# conda install -c bioconda samtools

# Assume 'input.bam' is an aligned BAM file that needs to be finalized.
# Sort the BAM file by coordinate, which is often a prerequisite for downstream analysis.
samtools sort -o final.bam input.bam

# Index the sorted BAM file, which is necessary for quick access and visualization.
samtools index final.bam

View on GitHub

Command: samtools view -hb -f 128 /full/path/to/files/CombinedID.merged.bam > /full/path/to/files/CombinedID.merged.r2.bam

samtools v1.9 GitHub

$ Bash example

# Install samtools if not already available
# conda install -c bioconda samtools=1.9

# Define input and output file paths
INPUT_BAM="/full/path/to/files/CombinedID.merged.bam"
OUTPUT_BAM="/full/path/to/files/CombinedID.merged.r2.bam"

# Extract reads that are the second in a pair (flag 128)
# -h: Include header in the output
# -b: Output in BAM format
# -f 128: Only output reads with flag 128 set (second in pair)
samtools view -hb -f 128 "${INPUT_BAM}" > "${OUTPUT_BAM}"

View on GitHub

Takes results from samtools view.

samtools v1.9 GitHub

$ Bash example

# Install samtools (if not already installed)
# conda install -c bioconda samtools=1.9

# Convert SAM (Sequence Alignment/Map) format to BAM (Binary Alignment/Map) format.
# This is a common initial step after alignment to reduce file size and enable faster processing.
# Input: aligned_reads.sam (e.g., output from an aligner like STAR or HISAT2)
# Output: aligned_reads.bam
# Parameters:
#   -b: Output in BAM format.
#   -S: Input is in SAM format (optional, samtools often infers this).
samtools view -bS aligned_reads.sam > aligned_reads.bam

View on GitHub

Calls peaks on those files.

clipper (Inferred with models/gemini-2.5-flash) vNot specified GitHub

$ Bash example

# Clone the clipper repository if not already available
# git clone https://github.com/yeolab/clipper.git
# cd clipper

# Ensure Python and required libraries (e.g., pysam) are installed
# conda install -c bioconda pysam

# Define input files and genome
# Replace with actual paths to your IP and control BAM files
IP_BAM="path/to/your/ip.bam"
CONTROL_BAM="path/to/your/control.bam"
GENOME_SIZE="hg38" # Using hg38 as the latest assembly placeholder for human
OUTPUT_PREFIX="eclip_peaks"

# Execute clipper to call peaks
python clipper.py -b "${IP_BAM}" -c "${CONTROL_BAM}" -s "${GENOME_SIZE}" -o "${OUTPUT_PREFIX}"

View on GitHub

Command: clipper -b /full/path/to/files/CombinedID.merged.r2.bam -s mm9 -o /full/path/to/files/CombinedID.merged.r2.peaks.bed --bonferroni --superlocal --threshold-method binomial --save-pickle

CLIPper v0.0.1 GitHub

$ Bash example

# Install CLIPper
# conda install -c bioconda clipper

# Execute CLIPper
clipper -b /full/path/to/files/CombinedID.merged.r2.bam -s mm9 -o /full/path/to/files/CombinedID.merged.r2.peaks.bed --bonferroni --superlocal --threshold-method binomial --save-pickle

View on GitHub

Tools Used

eCLIP STAR

Raw Source Text

Takes output from raw files.  Run to trim off both 5â and 3â adapters on both reads. Command: quality-cutoff 6  -m 18  -a NNNNNAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC  -g CTTCCGATCTACAAGTT -g CTTCCGATCTTGGTCCT  -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGT  AGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT  -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz  -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz  /full/path/to/files/file_R1.C01.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.metrics
Takes output from cutadapt round 1. Run to trim off the 3â adapters on read 2, to control for double ligation events. Command: cutadapt -f fastq --match-read-wildcards  --times 1  -e 0.1  -O 5  --quality-cutoff 6  -m 18  -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGTAGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT  -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz  -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz  /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz  /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.metrics
Takes output from cutadapt round 2.  Maps to human specific version of RepBase used to remove repetitive elements, helps control for spurious artifacts from rRNA (& other) repetitive reads.  Command: STAR  --runMode alignReads  --runThreadN 16  --genomeDir /path/to/RepBase_human_database_file --genomeLoad LoadAndRemove  --readFilesIn /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz --outSAMunmapped Within  --outFilterMultimapNmax 30  --outFilterMultimapScoreRange 1  --outFileNamePrefix /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam --outSAMattributes All  --readFilesCommand zcat  --outStd BAM_Unsorted  --outSAMtype BAM Unsorted  --outFilterType BySJout  --outReadsUnmapped Fastx  --outFilterScoreMin 10  --outSAMattrRGline ID:foo  --alignEndsType EndToEnd > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam
Takes output from STAR rmRep.  Maps unique reads to the mouse genome.  Command: STAR  --runMode alignReads  --runThreadN 16  --genomeDir  /path/to/STAR_database_file --genomeLoad LoadAndRemove  --readFilesIn  /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate1  /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate2  --outSAMunmapped Within  --outFilterMultimapNmax 1  --outFilterMultimapScoreRange 1  --outFileNamePrefix /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam  --outSAMattributes All  --outStd BAM_Unsorted  --outSAMtype BAM Unsorted  --outFilterType BySJout  --outReadsUnmapped Fastx  --outFilterScoreMin 10  --outSAMattrRGline ID:foo  --alignEndsType EndToEnd > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam
takes output from STAR genome mapping.  Custom random-mer-aware script for PCR duplicate removal. Command: barcode_collapse_pe.py  --bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam  --out_file /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam  --metrics_file /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.metrics
Takes output from barcode collapse PE.  Sorts resulting bam file for use downstream.  Command: java  -Xmx2048m  -XX:+UseParallelOldGC  -XX:ParallelGCThreads=4  -XX:GCTimeLimit=50  -XX:GCHeapFreeLimit=10  -Djava.io.tmpdir=/full/path/to/files/.queue/tmp  -cp /path/to/gatk/dist/Queue.jar  net.sf.picard.sam.SortSam  INPUT=/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam  TMP_DIR=/full/path/to/files/.queue/tmp  OUTPUT=/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam  VALIDATION_STRINGENCY=SILENT  SO=coordinate  CREATE_INDEX=true
Takes output from sortSam, makes bam index for use downstream.  Command: samtools index  /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam  /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam.bai
Takes inputs from multiple final bam files.  Merges the two technical replicates for further downstream analysis.  Command: samtools  merge  /full/path/to/files/CombinedID.merged.bam  /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam /full/path/to/files/file_R1.D08.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam
Takes output from sortSam, makes bam index for use downstream.  Command: samtools  index  /full/path/to/files/CombinedID.merged.bam  /full/path/to/files/CombinedID.merged.bam.bai
Takes output from sortSam.  Only outputs the second read in each pair for use with single stranded peak caller.  This is the final bam file to perform analysis on.  Command: samtools view -hb -f 128  /full/path/to/files/CombinedID.merged.bam  >  /full/path/to/files/CombinedID.merged.r2.bam
Takes results from samtools view.  Calls peaks on those files.  Command: clipper  -b /full/path/to/files/CombinedID.merged.r2.bam  -s mm9  -o /full/path/to/files/CombinedID.merged.r2.peaks.bed  --bonferroni  --superlocal  --threshold-method binomial  --save-pickle
Genome_build: mm9
Supplementary_files_format_and_content: bed format, contains clusters of predicted RBP binding

← Back to Analysis