GSE179634 Processing Pipeline
Publication
Splicing factor SRSF1 deficiency in the liver triggers NASH-like pathology and cell death.Nature communications (2023) — PMID 36759613
Dataset
GSE179634Splicing Factor SRSF1 Deficiency in the Liver Triggers NASH-like Pathology via R-Loop Induced DNA Damage and Cell Death
Processing Steps
Generate Jupyter Notebook-
1
Takes output from raw files.
N/A (Inferred with models/gemini-2.5-flash) vN/A -
2
Run to trim off both 5â and 3â adapters on both reads.
cutadapt (Inferred with models/gemini-2.5-flash) v4.0 (Inferred with models/gemini-2.5-flash)$ Bash example
# Install cutadapt if not already installed # conda install -c bioconda cutadapt=4.0 # Define input and output files INPUT_R1="input_R1.fastq.gz" INPUT_R2="input_R2.fastq.gz" OUTPUT_R1="trimmed_R1.fastq.gz" OUTPUT_R2="trimmed_R2.fastq.gz" # Define common Illumina adapter sequences # These are placeholders; actual adapters should be determined from library prep # ADAPTER_FWD is typically the Illumina Universal Adapter for Read 1 ADAPTER_FWD="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" # ADAPTER_REV is typically the Illumina Small RNA 3' Adapter or similar for Read 2 ADAPTER_REV="AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT" # Run cutadapt to trim 3' adapters from both reads. # cutadapt's -a and -A flags search for and remove the adapter sequence # from anywhere in the read, effectively handling both 5' and 3' occurrences # if the adapter sequence itself is present. For explicit 5' fixed-length # trimming (e.g., random Ns), -g or -G with ^ADAPTER would be used, # but this is not specified in the description. # # Optional common parameters (not included in the core command as not specified in description): # -j <threads>: Number of CPU threads to use. # -m <min_len>: Discard reads shorter than <min_len> after trimming. # -q <qual_trim>: Trim low-quality bases from 3' end. cutadapt -a "${ADAPTER_FWD}" \ -A "${ADAPTER_REV}" \ -o "${OUTPUT_R1}" \ -p "${OUTPUT_R2}" \ "${INPUT_R1}" \ "${INPUT_R2}" -
3
Command: quality-cutoff 6 -m 18 -a NNNNNAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC -g CTTCCGATCTACAAGTT -g CTTCCGATCTTGGTCCT -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGT AGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz /full/path/to/files/file_R1.C01.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.metrics
$ Bash example
# Clone the eCLIP repository if not already installed # git clone https://github.com/yeolab/eclip.git # cd eclip # # It's recommended to use a virtual environment # # conda create -n eclip_env python=3.8 # # conda activate eclip_env # # pip install -r requirements.txt # # Ensure 'quality-cutoff' (which is typically 'python scripts/quality_cutoff.py') is accessible in your PATH or run directly. # Define input and output paths INPUT_R1="/full/path/to/files/file_R1.C01.fastq.gz" INPUT_R2="/full/path/to/files/file_R2.C01.fastq.gz" OUTPUT_R1="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz" OUTPUT_R2="/full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz" METRICS_FILE="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.metrics" # Execute the quality-cutoff command # Note: The original command had '-A CTTGT AGATCGGAAG'. # Based on the quality_cutoff.py script's argument parsing for multiple -A flags, # it is assumed this was a typo and should be two separate -A flags for two adapter fragments. quality-cutoff 6 \ -m 18 \ -a NNNNNAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC \ -g CTTCCGATCTACAAGTT \ -g CTTCCGATCTTGGTCCT \ -A AACTTGTAGATCGGA \ -A AGGACCAAGATCGGA \ -A ACTTGTAGATCGGAA \ -A GGACCAAGATCGGAA \ -A CTTGT \ -A AGATCGGAAG \ -A TTGTAGATCGGAAGA \ -A ACCAAGATCGGAAGA \ -A TGTAGATCGGAAGAG \ -A CCAAGATCGGAAGAG \ -A GTAGATCGGAAGAGC \ -A CAAGATCGGAAGAGC \ -A TAGATCGGAAGAGCG \ -A AAGATCGGAAGAGCG \ -A AGATCGGAAGAGCGT \ -A GATCGGAAGAGCGTC \ -A ATCGGAAGAGCGTCG \ -A TCGGAAGAGCGTCGT \ -A CGGAAGAGCGTCGTG \ -A GGAAGAGCGTCGTGT \ -o "${OUTPUT_R1}" \ -p "${OUTPUT_R2}" \ "${INPUT_R1}" \ "${INPUT_R2}" > "${METRICS_FILE}" -
4
Takes output from cutadapt round 1.
$ Bash example
# Install cutadapt if not already installed # conda install -c bioconda cutadapt=2.10 # Define input and output files # INPUT_FASTQ is the output from a previous cutadapt round 1 (e.g., 3' adapter trimming) INPUT_FASTQ="round1_trimmed.fastq.gz" OUTPUT_FASTQ="round2_trimmed.fastq.gz" # Define parameters for cutadapt round 2 (e.g., 5' adapter trimming and quality filtering) # Replace "ADAPTER_5PRIME_SEQUENCE" with the actual 5' adapter sequence for your assay. # This example uses a common 5' adapter sequence, but it must be verified for the specific library prep. ADAPTER_5PRIME_SEQUENCE="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" # Example 5' adapter, replace with actual QUALITY_CUTOFF="20,20" # Trim low-quality bases from both ends (e.g., Phred score < 20) MINIMUM_LENGTH="15" # Discard reads shorter than 15 bp after trimming NUM_THREADS=$(nproc) # Use all available CPU cores cutadapt \ -g "${ADAPTER_5PRIME_SEQUENCE}" \ -q "${QUALITY_CUTOFF}" \ --minimum-length "${MINIMUM_LENGTH}" \ --cores "${NUM_THREADS}" \ -o "${OUTPUT_FASTQ}" \ "${INPUT_FASTQ}" -
5
Run to trim off the 3â adapters on read 2, to control for double ligation events.
$ Bash example
# Install cutadapt (if not already installed) # conda install -c bioconda cutadapt=4.0 # Define input and output file paths INPUT_R1="input_R1.fastq.gz" INPUT_R2="input_R2.fastq.gz" OUTPUT_R1="trimmed_R1.fastq.gz" OUTPUT_R2="trimmed_R2.fastq.gz" # Define the 3' adapter sequence for Read 2. # This is a common Illumina TruSeq adapter used in eCLIP for Read 2. # This adapter is trimmed to control for double ligation events. ADAPTER_R2="AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC" # Run cutadapt to trim the 3' adapter from Read 2. # -A: Specifies the 3' adapter sequence for Read 2. # -o: Output file for Read 1 (untrimmed in this specific step, but paired with R2 output). # -p: Output file for Read 2 (trimmed). # --minimum-length: Discard reads shorter than this length after trimming (e.g., 18bp is common in eCLIP). # -j: Number of CPU threads to use for parallel processing (e.g., 8). cutadapt -A "${ADAPTER_R2}" \ -o "${OUTPUT_R1}" \ -p "${OUTPUT_R2}" \ --minimum-length 18 \ -j 8 \ "${INPUT_R1}" "${INPUT_R2}" -
6
Command: cutadapt -f fastq --match-read-wildcards --times 1 -e 0.1 -O 5 --quality-cutoff 6 -m 18 -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGTAGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.metrics
$ Bash example
# conda install -c bioconda cutadapt cutadapt -f fastq --match-read-wildcards --times 1 -e 0.1 -O 5 --quality-cutoff 6 -m 18 -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGTAGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.metrics
-
7
Takes output from cutadapt round 2.
$ Bash example
# Install cutadapt if not already installed # conda install -c bioconda cutadapt # Define input and output files # INPUT_FASTQ represents the output from a previous cutadapt round (round 1). INPUT_FASTQ="input_from_cutadapt_round1.fastq.gz" OUTPUT_FASTQ="output_cutadapt_round2.fastq.gz" REPORT_FILE="cutadapt_round2_report.txt" # Define adapter sequences and trimming parameters for round 2. # These are placeholders; actual values depend on the specific eCLIP library preparation # and what was trimmed in round 1. Round 2 might focus on secondary adapters, # more stringent quality trimming, or length filtering. # Example 3' adapter (e.g., Illumina universal or specific RT primer adapter). ADAPTER_3PRIME="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" # ADAPTER_5PRIME="GTTCAGAGTTCTACAGTCCGACGATC" # Uncomment and set if a 5' adapter needs trimming QUALITY_CUTOFF=20 # Phred quality score cutoff MIN_LENGTH=18 # Minimum read length after trimming CORES=4 # Number of CPU cores to use for parallel processing # Execute cutadapt for round 2 trimming. # This command assumes single-end reads. For paired-end reads, use -A and -G for the reverse read. # --discard-untrimmed is often used in eCLIP to ensure reads contain the adapter, indicating successful ligation. cutadapt \ -a "${ADAPTER_3PRIME}" \ --quality-cutoff="${QUALITY_CUTOFF}" \ --minimum-length="${MIN_LENGTH}" \ --discard-untrimmed \ --cores="${CORES}" \ -o "${OUTPUT_FASTQ}" \ "${INPUT_FASTQ}" \ > "${REPORT_FILE}" 2>&1 # Note: For paired-end reads, the command would be more complex, e.g.: # cutadapt \ # -a "${ADAPTER_3PRIME_R1}" \ # -A "${ADAPTER_3PRIME_R2}" \ # --quality-cutoff="${QUALITY_CUTOFF}" \ # --minimum-length="${MIN_LENGTH}" \ # --discard-untrimmed \ # --cores="${CORES}" \ # -o "${OUTPUT_FASTQ_R1}" \ # -p "${OUTPUT_FASTQ_R2}" \ # "${INPUT_FASTQ_R1}" \ # "${INPUT_FASTQ_R2}" \ # > "${REPORT_FILE}" 2>&1 -
8
Maps to human specific version of RepBase used to remove repetitive elements, helps control for spurious artifacts from rRNA (& other) repetitive reads.
bbduk (Inferred with models/gemini-2.5-flash) vNot specified (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# Install BBMap suite if not available # conda install -c bioconda bbmap # Placeholder for human specific RepBase repeats FASTA file. # This file would typically be generated by extracting human repetitive elements from RepBase # or by using a pre-compiled contaminant file that includes common human repeats (e.g., rRNA, tRNAs, SINEs, LINEs). # For example, a file like 'human_repbase_repeats.fa' would contain sequences of known human repetitive elements. HUMAN_REPBASE_FASTA="/path/to/human_repbase_repeats.fa" # Input FASTQ file (e.g., raw reads from eCLIP) INPUT_FASTQ="input_reads.fastq.gz" # Output FASTQ file containing reads with repetitive elements removed OUTPUT_FASTQ="filtered_non_repetitive_reads.fastq.gz" # Remove repetitive reads by mapping against the human RepBase repeats FASTA. # 'k=31' specifies a kmer size of 31, common for contaminant filtering. # 'hdist=1' allows for 1 mismatch during mapping. # 'stats=repbase_filter_stats.txt' will output statistics on the reads removed. # '-Xmx4g' allocates 4GB of memory, adjust as needed based on input file size and system resources. bbduk.sh in="$INPUT_FASTQ" \ out="$OUTPUT_FASTQ" \ ref="$HUMAN_REPBASE_FASTA" \ k=31 hdist=1 \ stats="repbase_filter_stats.txt" \ -Xmx4g -
9
Command: STAR --runMode alignReads --runThreadN 16 --genomeDir /path/to/RepBase_human_database_file --genomeLoad LoadAndRemove --readFilesIn /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz --outSAMunmapped Within --outFilterMultimapNmax 30 --outFilterMultimapScoreRange 1 --outFileNamePrefix /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam --outSAMattributes All --readFilesCommand zcat --outStd BAM_Unsorted --outSAMtype BAM Unsorted --outFilterType BySJout --outReadsUnmapped Fastx --outFilterScoreMin 10 --outSAMattrRGline ID:foo --alignEndsType EndToEnd > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam
$ Bash example
# Install STAR (example using conda) # conda install -c bioconda star # Define variables for clarity GENOME_DIR="/path/to/RepBase_human_database_file" READ_FILE_1="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz" READ_FILE_2="/full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz" OUTPUT_PREFIX="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam" FINAL_OUTPUT_BAM="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam" # Execute STAR alignment STAR \ --runMode alignReads \ --runThreadN 16 \ --genomeDir "${GENOME_DIR}" \ --genomeLoad LoadAndRemove \ --readFilesIn "${READ_FILE_1}" "${READ_FILE_2}" \ --outSAMunmapped Within \ --outFilterMultimapNmax 30 \ --outFilterMultimapScoreRange 1 \ --outFileNamePrefix "${OUTPUT_PREFIX}" \ --outSAMattributes All \ --readFilesCommand zcat \ --outStd BAM_Unsorted \ --outSAMtype BAM Unsorted \ --outFilterType BySJout \ --outReadsUnmapped Fastx \ --outFilterScoreMin 10 \ --outSAMattrRGline ID:foo \ --alignEndsType EndToEnd \ > "${FINAL_OUTPUT_BAM}" -
10
Takes output from STAR rmRep.
$ Bash example
# Install samtools if not already installed # conda install -c bioconda samtools=1.10 # Input BAM file from STAR alignment (e.g., aligned_reads.bam) # This file is assumed to be coordinate-sorted. INPUT_BAM="aligned_reads.bam" OUTPUT_BAM="aligned_reads.markdup.bam" METRICS_FILE="markdup_metrics.txt" # Remove PCR duplicates from the aligned BAM file # -r: Remove duplicates (rather than just marking them) # -s: Output statistics to stderr (redirected to a file here) samtools markdup -r -s "$INPUT_BAM" "$OUTPUT_BAM" > "$METRICS_FILE"
-
11
Maps unique reads to the mouse genome.
$ Bash example
# Install STAR (if not already installed) # conda install -c bioconda star # Placeholder for STAR genome index directory. # The mouse genome (e.g., mm10/GRCm38) STAR index needs to be pre-built or downloaded. # Example command to build index (run once, replace paths): # STAR --runThreadN 8 --runMode genomeGenerate --genomeDir /path/to/STAR_index/mm10 \ # --genomeFastaFiles /path/to/mouse_genome.fa --sjdbGTFfile /path/to/mouse_annotations.gtf \ # --sjdbOverhang 100 # Adjust sjdbOverhang based on read length - 1 # Align unique reads to the mouse genome # Input: reads.fastq.gz (replace with your actual input FASTQ file) # Output: aligned_Aligned.sortedByCoord.out.bam STAR --genomeDir /path/to/STAR_index/mm10 \ --readFilesIn reads.fastq.gz \ --outFileNamePrefix aligned_ \ --outSAMtype BAM SortedByCoordinate \ --runThreadN 8 \ --outFilterMultimapNmax 1 \ --outFilterMismatchNmax 10 \ --outFilterScoreMinOverLread 0.66 \ --outFilterMatchNminOverLread 0.66 -
12
Command: STAR --runMode alignReads --runThreadN 16 --genomeDir /path/to/STAR_database_file --genomeLoad LoadAndRemove --readFilesIn /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate1 /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate2 --outSAMunmapped Within --outFilterMultimapNmax 1 --outFilterMultimapScoreRange 1 --outFileNamePrefix /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam --outSAMattributes All --outStd BAM_Unsorted --outSAMtype BAM Unsorted --outFilterType BySJout --outReadsUnmapped Fastx --outFilterScoreMin 10 --outSAMattrRGline ID:foo --alignEndsType EndToEnd > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam
$ Bash example
# Install STAR (example using conda): # conda install -c bioconda star # Define variables for paths GENOME_DIR="/path/to/STAR_index/hg38" # Placeholder for human hg38 genome directory, replace with actual path READ_FILE_1="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate1" READ_FILE_2="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate2" OUTPUT_PREFIX="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam" OUTPUT_BAM="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam" # Execute STAR alignment command STAR \ --runMode alignReads \ --runThreadN 16 \ --genomeDir "${GENOME_DIR}" \ --genomeLoad LoadAndRemove \ --readFilesIn "${READ_FILE_1}" "${READ_FILE_2}" \ --outSAMunmapped Within \ --outFilterMultimapNmax 1 \ --outFilterMultimapScoreRange 1 \ --outFileNamePrefix "${OUTPUT_PREFIX}" \ --outSAMattributes All \ --outStd BAM_Unsorted \ --outSAMtype BAM Unsorted \ --outFilterType BySJout \ --outReadsUnmapped Fastx \ --outFilterScoreMin 10 \ --outSAMattrRGline ID:foo \ --alignEndsType EndToEnd > "${OUTPUT_BAM}" -
13
takes output from STAR genome mapping.
$ Bash example
# Install STAR (example using conda) # conda create -n star_env star=2.7.10a samtools -c bioconda -c conda-forge # conda activate star_env # --- Reference Data Setup (Example for hg38) --- # This step assumes you have already built a STAR genome index. # If not, you would typically run: # STAR --runThreadN <num_threads> --runMode genomeGenerate \ # --genomeDir /path/to/STAR_index_hg38 \ # --genomeFastaFiles /path/to/GRCh38.primary_assembly.genome.fa \ # --sjdbGTFfile /path/to/gencode.v38.annotation.gtf \ # --sjdbOverhang 100 # (or read length - 1) # --- Define variables --- GENOME_DIR="/path/to/STAR_index_hg38" # Placeholder for STAR genome index directory (e.g., for human hg38) READ1="sample_R1.fastq.gz" # Placeholder for input FASTQ file (Read 1) READ2="sample_R2.fastq.gz" # Placeholder for input FASTQ file (Read 2, remove if single-end) OUTPUT_PREFIX="sample_" # Prefix for output files THREADS=8 # Number of threads to use # --- Run STAR alignment --- STAR --genomeDir "${GENOME_DIR}" \ --readFilesIn "${READ1}" "${READ2}" \ --runThreadN "${THREADS}" \ --outFileNamePrefix "${OUTPUT_PREFIX}" \ --outSAMtype BAM SortedByCoordinate \ --outSAMunmapped Within \ --outSAMattributes Standard \ --quantMode GeneCounts \ --outFilterType BySJout \ --outFilterMultimapNmax 20 \ --alignSJDBoverhangMin 1 \ --alignSJoverhangMin 8 \ --alignIntronMin 20 \ --alignIntronMax 1000000 \ --alignMatesGapMax 1000000 # --- Index the output BAM file --- samtools index "${OUTPUT_PREFIX}Aligned.sortedByCoordinate.out.bam" -
14
Custom random-mer-aware script for PCR duplicate removal.
dedup_umi.py (Inferred with models/gemini-2.5-flash) vPart of yeolab/eclip workflow$ Bash example
# This script is part of the yeolab/eclip workflow and requires Python with pysam. # You might need to install pysam if it's not already in your environment: # pip install pysam # Define paths and parameters # Replace with the actual path to the dedup_umi.py script from the yeolab/eclip repository SCRIPT_PATH="/path/to/yeolab/eclip/scripts/dedup_umi.py" INPUT_BAM="aligned_reads_with_umis.bam" # Input BAM file containing UMI-tagged reads OUTPUT_BAM="deduplicated_reads.bam" # Output BAM file with PCR duplicates removed UMI_LENGTH=6 # Length of the random-mer (UMI) in base pairs. Common for eCLIP. # Execute the custom random-mer-aware PCR duplicate removal script python "${SCRIPT_PATH}" -i "${INPUT_BAM}" -o "${OUTPUT_BAM}" -l "${UMI_LENGTH}" -
15
Command: barcode_collapse_pe.py --bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam --out_file /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam --metrics_file /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.metrics
barcode_collapse_pe.py (Inferred with models/gemini-2.5-flash) vv1.2 (from yeolab/eclip pipeline) GitHub$ Bash example
# Install Miniconda or Anaconda if not already installed # wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh # bash miniconda.sh -b -p $HOME/miniconda # export PATH="$HOME/miniconda/bin:$PATH" # conda init bash # source ~/.bashrc # Create and activate a conda environment for eCLIP tools (requires Python 2.7 and pysam) # conda create -n eclip_env python=2.7 pysam=0.10.0 -y # conda activate eclip_env # Clone the eclip repository to get the script # git clone https://github.com/yeolab/eclip.git # cd eclip/src # Execute the barcode collapse command # Ensure you are in the directory containing barcode_collapse_pe.py or it's in your PATH python barcode_collapse_pe.py \ --bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam \ --out_file /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam \ --metrics_file /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.metrics -
16
Takes output from barcode collapse PE.
STAR (Inferred with models/gemini-2.5-flash) v2.7.0f$ Bash example
# Install STAR (if not already installed) # conda install -c bioconda star # Define variables (replace with actual paths and filenames) GENOME_DIR="/path/to/STAR_index/GRCh38" # Placeholder for human GRCh38 genome index READ1_FASTQ="collapsed_R1.fastq.gz" # Output from barcode collapse PE (Read 1) READ2_FASTQ="collapsed_R2.fastq.gz" # Output from barcode collapse PE (Read 2) OUTPUT_PREFIX="aligned_sample_prefix_" THREADS=8 # Number of threads to use # Run STAR alignment STAR \ --genomeDir "${GENOME_DIR}" \ --readFilesIn "${READ1_FASTQ}" "${READ2_FASTQ}" \ --runThreadN "${THREADS}" \ --outFileNamePrefix "${OUTPUT_PREFIX}" \ --outSAMtype BAM SortedByCoordinate \ --outSAMattributes All \ --quantMode GeneCounts \ --outFilterMultimapNmax 20 \ --outFilterMismatchNoverLmax 0.04 \ --alignIntronMin 20 \ --alignIntronMax 1000000 \ --alignMatesGapMax 1000000 \ --limitBAMsortRAM 30000000000 # 30GB RAM for sorting, adjust as needed -
17
Sorts resulting bam file for use downstream.
samtools (Inferred with models/gemini-2.5-flash) v1.10 (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# Install samtools if not already available # conda install -c bioconda samtools=1.10 # Define input and output file names INPUT_BAM="input.bam" OUTPUT_SORTED_BAM="output_sorted.bam" # Sort the BAM file by coordinate # The -o flag specifies the output file. samtools sort -o "${OUTPUT_SORTED_BAM}" "${INPUT_BAM}" -
18
Command: java -Xmx2048m -XX:+UseParallelOldGC -XX:ParallelGCThreads=4 -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Djava.io.tmpdir=/full/path/to/files/.queue/tmp -cp /path/to/gatk/dist/Queue.jar net.sf.picard.sam.SortSam INPUT=/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam TMP_DIR=/full/path/to/files/.queue/tmp OUTPUT=/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam VALIDATION_STRINGENCY=SILENT SO=coordinate CREATE_INDEX=true
$ Bash example
# Picard tools are typically run via Java. You can download the latest Picard JAR from the Broad Institute GitHub releases. # For example, using conda: # conda install -c bioconda picard java -Xmx2048m -XX:+UseParallelOldGC -XX:ParallelGCThreads=4 -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Djava.io.tmpdir=/full/path/to/files/.queue/tmp -cp /path/to/gatk/dist/Queue.jar net.sf.picard.sam.SortSam INPUT=/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam TMP_DIR=/full/path/to/files/.queue/tmp OUTPUT=/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam VALIDATION_STRINGENCY=SILENT SO=coordinate CREATE_INDEX=true
-
19
Takes output from sortSam, makes bam index for use downstream.
$ Bash example
# Install samtools if not already installed # conda install -c bioconda samtools=1.15.1 # Define input BAM file (output from sortSam) INPUT_BAM="sorted.bam" # Placeholder for the sorted BAM file # Create BAM index for downstream use samtools index "${INPUT_BAM}" -
20
Command: samtools index /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam.bai
$ Bash example
# Install samtools if not already installed # conda install -c bioconda samtools=1.19 # Create an index for the sorted BAM file samtools index /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam.bai
-
21
Takes inputs from multiple final bam files.
$ Bash example
# Install samtools if not already available # conda install -c bioconda samtools # Example: Merge multiple final BAM files into a single BAM file. # This step takes multiple input BAM files (e.g., from technical replicates or different lanes) # and combines them into one consolidated BAM file for downstream analysis. # Replace input1.bam, input2.bam, etc., with your actual input BAM file paths. # Replace merged_output.bam with your desired output merged BAM file name. # -@ specifies the number of threads to use. samtools merge -@ 4 merged_output.bam input1.bam input2.bam input3.bam
-
22
Merges the two technical replicates for further downstream analysis.
$ Bash example
# Install samtools if not already installed # conda install -c bioconda samtools=1.19 # Define input and output file paths INPUT_REPLICATE1_BAM="replicate1.bam" INPUT_REPLICATE2_BAM="replicate2.bam" OUTPUT_MERGED_BAM="merged_replicates.bam" # Merge the two technical replicates (BAM files) samtools merge "${OUTPUT_MERGED_BAM}" "${INPUT_REPLICATE1_BAM}" "${INPUT_REPLICATE2_BAM}" -
23
Command: samtools merge /full/path/to/files/CombinedID.merged.bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam /full/path/to/files/file_R1.D08.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam
$ Bash example
# Install samtools (e.g., using conda) # conda install -c bioconda samtools # Define input and output files INPUT_BAM_1="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam" INPUT_BAM_2="/full/path/to/files/file_R1.D08.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam" OUTPUT_MERGED_BAM="/full/path/to/files/CombinedID.merged.bam" # Execute samtools merge command samtools merge "${OUTPUT_MERGED_BAM}" "${INPUT_BAM_1}" "${INPUT_BAM_2}" -
24
Takes output from sortSam, makes bam index for use downstream.
samtools index (Inferred with models/gemini-2.5-flash) v1.19.1 (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# Install samtools if not already installed # conda install -c bioconda samtools # Assuming 'sorted.bam' is the output from sortSam samtools index sorted.bam
-
25
Command: samtools index /full/path/to/files/CombinedID.merged.bam /full/path/to/files/CombinedID.merged.bam.bai
$ Bash example
# Install samtools if not already installed # conda install -c bioconda samtools samtools index /full/path/to/files/CombinedID.merged.bam /full/path/to/files/CombinedID.merged.bam.bai
-
26
Takes output from sortSam.
$ Bash example
# Install samtools if not already installed # conda install -c bioconda samtools=1.19.1 # This step takes a sorted BAM file (output from sortSam) and creates an index (.bai) file. # The index file is crucial for efficient random access to reads within the BAM file, # enabling many downstream tools to function correctly and quickly. # Replace 'sorted_input.bam' with the actual path to your sorted BAM file. samtools index sorted_input.bam
-
27
Only outputs the second read in each pair for use with single stranded peak caller.
$ Bash example
# Install BBMap (part of BBTools) # conda install -c bioconda bbmap # This command takes paired-end FASTQ files (input_R1.fastq.gz and input_R2.fastq.gz) # and outputs only the second read (R2) to a new file (output_R2_only.fastq.gz). # The first read (R1) is discarded by setting out1=null. reformat.sh in1=input_R1.fastq.gz in2=input_R2.fastq.gz out1=null out2=output_R2_only.fastq.gz
-
28
This is the final bam file to perform analysis on.
$ Bash example
# Install samtools if not already available # conda install -c bioconda samtools # Assume 'input.bam' is an aligned BAM file that needs to be finalized. # Sort the BAM file by coordinate, which is often a prerequisite for downstream analysis. samtools sort -o final.bam input.bam # Index the sorted BAM file, which is necessary for quick access and visualization. samtools index final.bam
-
29
Command: samtools view -hb -f 128 /full/path/to/files/CombinedID.merged.bam > /full/path/to/files/CombinedID.merged.r2.bam
$ Bash example
# Install samtools if not already available # conda install -c bioconda samtools=1.9 # Define input and output file paths INPUT_BAM="/full/path/to/files/CombinedID.merged.bam" OUTPUT_BAM="/full/path/to/files/CombinedID.merged.r2.bam" # Extract reads that are the second in a pair (flag 128) # -h: Include header in the output # -b: Output in BAM format # -f 128: Only output reads with flag 128 set (second in pair) samtools view -hb -f 128 "${INPUT_BAM}" > "${OUTPUT_BAM}" -
30
Takes results from samtools view.
$ Bash example
# Install samtools (if not already installed) # conda install -c bioconda samtools=1.9 # Convert SAM (Sequence Alignment/Map) format to BAM (Binary Alignment/Map) format. # This is a common initial step after alignment to reduce file size and enable faster processing. # Input: aligned_reads.sam (e.g., output from an aligner like STAR or HISAT2) # Output: aligned_reads.bam # Parameters: # -b: Output in BAM format. # -S: Input is in SAM format (optional, samtools often infers this). samtools view -bS aligned_reads.sam > aligned_reads.bam
-
31
Calls peaks on those files.
$ Bash example
# Clone the clipper repository if not already available # git clone https://github.com/yeolab/clipper.git # cd clipper # Ensure Python and required libraries (e.g., pysam) are installed # conda install -c bioconda pysam # Define input files and genome # Replace with actual paths to your IP and control BAM files IP_BAM="path/to/your/ip.bam" CONTROL_BAM="path/to/your/control.bam" GENOME_SIZE="hg38" # Using hg38 as the latest assembly placeholder for human OUTPUT_PREFIX="eclip_peaks" # Execute clipper to call peaks python clipper.py -b "${IP_BAM}" -c "${CONTROL_BAM}" -s "${GENOME_SIZE}" -o "${OUTPUT_PREFIX}" -
32
Command: clipper -b /full/path/to/files/CombinedID.merged.r2.bam -s mm9 -o /full/path/to/files/CombinedID.merged.r2.peaks.bed --bonferroni --superlocal --threshold-method binomial --save-pickle
$ Bash example
# Install CLIPper # conda install -c bioconda clipper # Execute CLIPper clipper -b /full/path/to/files/CombinedID.merged.r2.bam -s mm9 -o /full/path/to/files/CombinedID.merged.r2.peaks.bed --bonferroni --superlocal --threshold-method binomial --save-pickle
Raw Source Text
Takes output from raw files. Run to trim off both 5â and 3â adapters on both reads. Command: quality-cutoff 6 -m 18 -a NNNNNAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC -g CTTCCGATCTACAAGTT -g CTTCCGATCTTGGTCCT -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGT AGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz /full/path/to/files/file_R1.C01.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.metrics Takes output from cutadapt round 1. Run to trim off the 3â adapters on read 2, to control for double ligation events. Command: cutadapt -f fastq --match-read-wildcards --times 1 -e 0.1 -O 5 --quality-cutoff 6 -m 18 -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGTAGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.metrics Takes output from cutadapt round 2. Maps to human specific version of RepBase used to remove repetitive elements, helps control for spurious artifacts from rRNA (& other) repetitive reads. Command: STAR --runMode alignReads --runThreadN 16 --genomeDir /path/to/RepBase_human_database_file --genomeLoad LoadAndRemove --readFilesIn /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz --outSAMunmapped Within --outFilterMultimapNmax 30 --outFilterMultimapScoreRange 1 --outFileNamePrefix /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam --outSAMattributes All --readFilesCommand zcat --outStd BAM_Unsorted --outSAMtype BAM Unsorted --outFilterType BySJout --outReadsUnmapped Fastx --outFilterScoreMin 10 --outSAMattrRGline ID:foo --alignEndsType EndToEnd > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam Takes output from STAR rmRep. Maps unique reads to the mouse genome. Command: STAR --runMode alignReads --runThreadN 16 --genomeDir /path/to/STAR_database_file --genomeLoad LoadAndRemove --readFilesIn /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate1 /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate2 --outSAMunmapped Within --outFilterMultimapNmax 1 --outFilterMultimapScoreRange 1 --outFileNamePrefix /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam --outSAMattributes All --outStd BAM_Unsorted --outSAMtype BAM Unsorted --outFilterType BySJout --outReadsUnmapped Fastx --outFilterScoreMin 10 --outSAMattrRGline ID:foo --alignEndsType EndToEnd > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam takes output from STAR genome mapping. Custom random-mer-aware script for PCR duplicate removal. Command: barcode_collapse_pe.py --bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam --out_file /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam --metrics_file /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.metrics Takes output from barcode collapse PE. Sorts resulting bam file for use downstream. Command: java -Xmx2048m -XX:+UseParallelOldGC -XX:ParallelGCThreads=4 -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Djava.io.tmpdir=/full/path/to/files/.queue/tmp -cp /path/to/gatk/dist/Queue.jar net.sf.picard.sam.SortSam INPUT=/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam TMP_DIR=/full/path/to/files/.queue/tmp OUTPUT=/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam VALIDATION_STRINGENCY=SILENT SO=coordinate CREATE_INDEX=true Takes output from sortSam, makes bam index for use downstream. Command: samtools index /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam.bai Takes inputs from multiple final bam files. Merges the two technical replicates for further downstream analysis. Command: samtools merge /full/path/to/files/CombinedID.merged.bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam /full/path/to/files/file_R1.D08.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam Takes output from sortSam, makes bam index for use downstream. Command: samtools index /full/path/to/files/CombinedID.merged.bam /full/path/to/files/CombinedID.merged.bam.bai Takes output from sortSam. Only outputs the second read in each pair for use with single stranded peak caller. This is the final bam file to perform analysis on. Command: samtools view -hb -f 128 /full/path/to/files/CombinedID.merged.bam > /full/path/to/files/CombinedID.merged.r2.bam Takes results from samtools view. Calls peaks on those files. Command: clipper -b /full/path/to/files/CombinedID.merged.r2.bam -s mm9 -o /full/path/to/files/CombinedID.merged.r2.peaks.bed --bonferroni --superlocal --threshold-method binomial --save-pickle Genome_build: mm9 Supplementary_files_format_and_content: bed format, contains clusters of predicted RBP binding