GSE113193 Processing Pipeline
Publication
Tissue-selective restriction of RNA editing of CaV1.3 by splicing factor SRSF9.Nucleic acids research (2018) — PMID 29733375
Processing Steps
Generate Jupyter Notebook-
1
Takes output from raw files.
FastQC (Inferred with models/gemini-2.5-flash) v0.11.9 (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# Define input raw sequencing files (example) RAW_FILES="sample_R1.fastq.gz sample_R2.fastq.gz" # Define output directory for quality control reports OUTPUT_DIR="fastqc_reports" # Create output directory if it doesn't exist mkdir -p "${OUTPUT_DIR}" # Install FastQC (example using conda) # conda install -c bioconda fastqc # Run FastQC to generate quality control reports from raw files fastqc ${RAW_FILES} -o "${OUTPUT_DIR}" -
2
Run to trim off both 5â and 3â adapters on both reads.
$ Bash example
# Install fastp (if not already installed) # conda install -c bioconda fastp=0.23.2 # Define input and output file paths INPUT_READ1="read1.fastq.gz" INPUT_READ2="read2.fastq.gz" OUTPUT_READ1="trimmed_read1.fastq.gz" OUTPUT_READ2="trimmed_read2.fastq.gz" REPORT_JSON="fastp_report.json" REPORT_HTML="fastp_report.html" # Run fastp to trim adapters and perform quality filtering # -i: input read 1 # -o: output read 1 # -I: input read 2 # -O: output read 2 # --detect_adapter_for_pe: automatically detect and trim adapters for paired-end reads # --qualified_quality_phred 15: filter reads with quality score < 15 # --length_required 20: discard reads shorter than 20 bp after trimming # --trim_poly_g: trim polyG tails (common in Illumina NextSeq/NovaSeq) # --trim_poly_x: trim polyX tails # -j 8: use 8 threads # -h: HTML report output # -j: JSON report output fastp \ -i "${INPUT_READ1}" \ -o "${OUTPUT_READ1}" \ -I "${INPUT_READ2}" \ -O "${OUTPUT_READ2}" \ --detect_adapter_for_pe \ --qualified_quality_phred 15 \ --length_required 20 \ --trim_poly_g \ --trim_poly_x \ -j "${REPORT_JSON}" \ -h "${REPORT_HTML}" \ -w 8 -
3
Command: quality-cutoff 6 -m 18 -a NNNNNAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC -g CTTCCGATCTACAAGTT -g CTTCCGATCTTGGTCCT -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGT AGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz /full/path/to/files/file_R1.C01.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.metrics
cutadapt (Inferred with models/gemini-2.5-flash) v4.x (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# Install cutadapt (example using conda) # conda install -c bioconda cutadapt # Define input and output file paths INPUT_R1="/full/path/to/files/file_R1.C01.fastq.gz" INPUT_R2="/full/path/to/files/file_R2.C01.fastq.gz" OUTPUT_R1="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz" OUTPUT_R2="/full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz" METRICS_FILE="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.metrics" # Define adapter sequences ADAPTER_R1="NNNNNAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC" ADAPTER_G1="CTTCCGATCTACAAGTT" ADAPTER_G2="CTTCCGATCTTGGTCCT" ADAPTER_A_LIST=( "AACTTGTAGATCGGA" "AGGACCAAGATCGGA" "ACTTGTAGATCGGAA" "GGACCAAGATCGGAA" "CTTGT AGATCGGAAG" # Note: space in original command, likely a typo or specific sequence "GACCAAGATCGGAAG" "TTGTAGATCGGAAGA" "ACCAAGATCGGAAGA" "TGTAGATCGGAAGAG" "CCAAGATCGGAAGAG" "GTAGATCGGAAGAGC" "CAAGATCGGAAGAGC" "TAGATCGGAAGAGCG" "AAGATCGGAAGAGCG" "AGATCGGAAGAGCGT" "GATCGGAAGAGCGTC" "ATCGGAAGAGCGTCG" "TCGGAAGAGCGTCGT" "CGGAAGAGCGTCGTG" "GGAAGAGCGTCGTGT" ) # Construct the -A options string ADAPTER_A_OPTIONS="" for adapter in "${ADAPTER_A_LIST[@]}"; do ADAPTER_A_OPTIONS+=" -A \"${adapter}\"" done # Execute cutadapt for adapter and quality trimming # The original command 'quality-cutoff 6' is interpreted as 'cutadapt -q 6' for quality trimming. # The multiple -g and -A options are handled by cutadapt. cutadapt -q 6 \ -m 18 \ -a "${ADAPTER_R1}" \ -g "${ADAPTER_G1}" \ -g "${ADAPTER_G2}" \ ${ADAPTER_A_OPTIONS} \ -o "${OUTPUT_R1}" \ -p "${OUTPUT_R2}" \ "${INPUT_R1}" \ "${INPUT_R2}" > "${METRICS_FILE}" -
4
Takes output from cutadapt round 1.
$ Bash example
# Install cutadapt if not already available # conda install -c bioconda cutadapt=1.18 # Define input and output files INPUT_FASTQ="reads_from_cutadapt_round1.fastq.gz" OUTPUT_FASTQ="reads_trimmed_round2.fastq.gz" # Execute cutadapt for quality, length, and poly-A trimming # This step assumes a previous cutadapt round (round 1) for primary adapter trimming. # -a A{100}: Trim poly-A tails (up to 100 'A's) from the 3' end # -q 20,20: Trim low-quality bases from both ends (Phred score < 20) # --minimum-length 18: Discard reads shorter than 18 bp after trimming # -o: Output file cutadapt -a A{100} -q 20,20 --minimum-length 18 -o "${OUTPUT_FASTQ}" "${INPUT_FASTQ}" -
5
Run to trim off the 3â adapters on read 2, to control for double ligation events.
$ Bash example
# Install cutadapt (example using conda) # conda install -c bioconda cutadapt=4.0 # Define variables INPUT_R1="input_R1.fastq.gz" INPUT_R2="input_R2.fastq.gz" OUTPUT_R1="trimmed_R1.fastq.gz" OUTPUT_R2="trimmed_R2.fastq.gz" # Adapter sequence for the 3' end of Read 2. # For eCLIP assays, this is often the rP5 adapter or a variant, # which is ligated to the 3' end of the RNA fragment. # Example sequence from yeolab/skipper eCLIP pipeline: ADAPTER_R2_SEQUENCE="AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT" # Run cutadapt to trim 3' adapters from Read 2 # -A ADAPTER_R2_SEQUENCE: Trims the given adapter from the 3' end of Read 2. # This is crucial for controlling double ligation events # where the adapter might ligate to itself or another adapter. # -q 20: Trims low-quality bases from the 3' end (Phred score < 20). # -m 18: Discard reads shorter than 18 bp after trimming. # -o: Output file for Read 1 # -p: Output file for Read 2 cutadapt -A "${ADAPTER_R2_SEQUENCE}" \ -q 20 \ -m 18 \ -o "${OUTPUT_R1}" \ -p "${OUTPUT_R2}" \ "${INPUT_R1}" \ "${INPUT_R2}" -
6
Command: cutadapt -f fastq --match-read-wildcards --times 1 -e 0.1 -O 5 --quality-cutoff 6 -m 18 -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGTAGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.metrics
$ Bash example
# Install cutadapt (e.g., using conda) # conda install -c bioconda cutadapt cutadapt -f fastq --match-read-wildcards --times 1 -e 0.1 -O 5 --quality-cutoff 6 -m 18 -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGTAGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.metrics
-
7
Takes output from cutadapt round 2.
$ Bash example
# Install cutadapt (if not already installed) # conda install -c bioconda cutadapt=4.0 # Define input and output files INPUT_FASTQ="input_from_cutadapt_round2.fastq.gz" OUTPUT_FASTQ="trimmed_round3.fastq.gz" LOG_FILE="cutadapt_round3.log" # Define parameters (these are placeholders and should be adjusted based on specific experimental design) # This step assumes further trimming is needed after an initial cutadapt round. # It could involve removing a different adapter, more stringent quality trimming, or length filtering. # Example: Trim remaining 3' adapters (e.g., poly-A or a different library adapter) # Example: Quality trim from 3' end with a cutoff of 20 # Example: Discard reads shorter than 18 bp ADAPTER_SEQUENCE="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" # Placeholder for a common Illumina adapter or specific library adapter MIN_LENGTH=18 QUALITY_CUTOFF=20 # Execute cutadapt cutadapt \ -a "${ADAPTER_SEQUENCE}" \ --quality-cutoff="${QUALITY_CUTOFF}" \ --minimum-length="${MIN_LENGTH}" \ -o "${OUTPUT_FASTQ}" \ "${INPUT_FASTQ}" \ > "${LOG_FILE}" 2>&1 -
8
Maps to human specific version of RepBase used to remove repetitive elements, helps control for spurious artifacts from rRNA (& other) repetitive reads.
$ Bash example
# Install bbtools (if not already installed) # conda install -c bioconda bbtools # Download and prepare a human-specific RepBase library for bbduk. # This typically involves downloading RepBase sequences (e.g., from GIRI RepBase website), # filtering for human-specific elements, and compiling them into a single FASTA file. # Example placeholder for the RepBase reference file. The actual download and processing # of RepBase data may require registration or specific scripts. # For instance, one might download 'Homo_sapiens.fa' from RepBase and use it as the reference. REPBASE_REF="path/to/human_repbase.fasta" # Placeholder for the actual path to the RepBase reference file # Input FASTQ file (e.g., raw reads or adapter-trimmed reads) INPUT_FASTQ="input_reads.fastq.gz" # Output FASTQ file with repetitive reads removed OUTPUT_FASTQ="filtered_non_repetitive_reads.fastq.gz" # Run bbduk to remove reads mapping to human repetitive elements # k=31: kmer size for matching (default for contaminant removal) # hdist=1: allow 1 mismatch in kmer # stats: output statistics about removed reads # -Xmx: memory allocation bbduk.sh \ in="$INPUT_FASTQ" \ out="$OUTPUT_FASTQ" \ ref="$REPBASE_REF" \ k=31 \ hdist=1 \ stats=bbduk_repbase_filter_stats.txt \ -Xmx4g -
9
Command: STAR --runMode alignReads --runThreadN 16 --genomeDir /path/to/RepBase_human_database_file --genomeLoad LoadAndRemove --readFilesIn /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz --outSAMunmapped Within --outFilterMultimapNmax 30 --outFilterMultimapScoreRange 1 --outFileNamePrefix /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam --outSAMattributes All --readFilesCommand zcat --outStd BAM_Unsorted --outSAMtype BAM Unsorted --outFilterType BySJout --outReadsUnmapped Fastx --outFilterScoreMin 10 --outSAMattrRGline ID:foo --alignEndsType EndToEnd > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam
$ Bash example
# Install STAR (example using conda) # conda install -c bioconda star # Define variables for clarity GENOME_DIR="/path/to/RepBase_human_database_file" READ_FILE_R1="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz" READ_FILE_R2="/full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz" OUTPUT_PREFIX="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam" FINAL_OUTPUT_BAM="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam" # Execute STAR alignment STAR --runMode alignReads \ --runThreadN 16 \ --genomeDir "${GENOME_DIR}" \ --genomeLoad LoadAndRemove \ --readFilesIn "${READ_FILE_R1}" "${READ_FILE_R2}" \ --outSAMunmapped Within \ --outFilterMultimapNmax 30 \ --outFilterMultimapScoreRange 1 \ --outFileNamePrefix "${OUTPUT_PREFIX}" \ --outSAMattributes All \ --readFilesCommand zcat \ --outStd BAM_Unsorted \ --outSAMtype BAM Unsorted \ --outFilterType BySJout \ --outReadsUnmapped Fastx \ --outFilterScoreMin 10 \ --outSAMattrRGline ID:foo \ --alignEndsType EndToEnd > "${FINAL_OUTPUT_BAM}" -
10
Takes output from STAR rmRep.
$ Bash example
# Install STAR (if not already installed) # conda install -c bioconda star # Define variables GENOME_DIR="/path/to/STAR_genome_index/hg38" # Placeholder for STAR genome index directory (e.g., human hg38) READ1_FASTQ="input_read1.fastq.gz" # Placeholder for input FASTQ file (read 1) READ2_FASTQ="input_read2.fastq.gz" # Placeholder for input FASTQ file (read 2, remove if single-end) OUTPUT_PREFIX="aligned_reads" # Prefix for output files THREADS=8 # Number of threads to use RG_ID="sample_id" # Read group ID RG_SM="sample_name" # Sample name RG_LB="library_name" # Library name RG_PU="platform_unit" # Platform unit (e.g., flowcell-lane) # Create output directory if it doesn't exist mkdir -p star_output # Run STAR alignment (parameters adapted from Yeo lab eCLIP workflow) STAR \ --runThreadN "${THREADS}" \ --genomeDir "${GENOME_DIR}" \ --readFilesIn "${READ1_FASTQ}" "${READ2_FASTQ}" \ --readFilesCommand zcat \ --outFileNamePrefix "star_output/${OUTPUT_PREFIX}_" \ --outSAMtype BAM SortedByCoordinate \ --outSAMattributes All \ --outFilterMultimapNmax 1 \ --outFilterMismatchNmax 999 \ --outFilterMismatchNoverLmax 0.05 \ --alignIntronMin 20 \ --alignIntronMax 1000000 \ --alignMatesGapMax 1000000 \ --alignSJoverhangMin 8 \ --alignSJDBoverhangMin 1 \ --sjdbScore 1 \ --limitSjdbInsertNsj 2000000 \ --outFilterType BySJout \ --outFilterScoreMinOverLread 0.3 \ --outFilterMatchNoverLread 0.3 \ --winAnchorMultimapNmax 50 \ --seedSearchStartLmax 30 \ --seedPerReadNmax 100000 \ --seedPerWindowNmax 1000 \ --alignTranscriptsPerReadNmax 100000 \ --alignTranscriptsPerWindowNmax 10000 \ --quantMode GeneCounts \ --outSAMattrRGline ID:"${RG_ID}" SM:"${RG_SM}" LB:"${RG_LB}" PL:ILLUMINA PU:"${RG_PU}" -
11
Maps unique reads to the human genome.
$ Bash example
# Install BWA and Samtools if not already available # conda install -c bioconda bwa samtools # Placeholder for human genome reference (e.g., hg38) # Ensure the reference genome is indexed with bwa index # bwa index human_genome.fa # Align reads to the human genome and filter for uniquely mapped reads # -t: number of threads # human_genome.fa: path to the indexed human reference genome # reads.fastq.gz: path to the input FASTQ file # | samtools view -bS -F 4 -q 20 - : pipe output to samtools to filter for mapped reads with MAPQ >= 20 # | samtools sort -o aligned_unique_reads.bam - : pipe output to samtools to sort and save as BAM bwa mem -t 8 human_genome.fa reads.fastq.gz | \ samtools view -bS -F 4 -q 20 - | \ samtools sort -o aligned_unique_reads.bam - # Index the sorted BAM file samtools index aligned_unique_reads.bam
-
12
Command: STAR --runMode alignReads --runThreadN 16 --genomeDir /path/to/STAR_database_file --genomeLoad LoadAndRemove --readFilesIn /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate1 /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate2 --outSAMunmapped Within --outFilterMultimapNmax 1 --outFilterMultimapScoreRange 1 --outFileNamePrefix /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam --outSAMattributes All --outStd BAM_Unsorted --outSAMtype BAM Unsorted --outFilterType BySJout --outReadsUnmapped Fastx --outFilterScoreMin 10 --outSAMattrRGline ID:foo --alignEndsType EndToEnd > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam
$ Bash example
# Install STAR (example using Conda): # conda install -c bioconda star # Define variables STAR_GENOME_DIR="/path/to/star_index/GRCh38" # Placeholder: Replace with the actual path to your STAR genome index directory (e.g., for human hg38) INPUT_R1="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate1" # Replace with actual path to mate1 FASTQ/FASTA file INPUT_R2="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate2" # Replace with actual path to mate2 FASTQ/FASTA file OUTPUT_BAM_PATH="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam" # The final aligned BAM file name and prefix for auxiliary files # Ensure the output directory exists mkdir -p "$(dirname "${OUTPUT_BAM_PATH}")" # Run STAR alignment STAR \ --runMode alignReads \ --runThreadN 16 \ --genomeDir "${STAR_GENOME_DIR}" \ --genomeLoad LoadAndRemove \ --readFilesIn "${INPUT_R1}" "${INPUT_R2}" \ --outSAMunmapped Within \ --outFilterMultimapNmax 1 \ --outFilterMultimapScoreRange 1 \ --outFileNamePrefix "${OUTPUT_BAM_PATH}" \ --outSAMattributes All \ --outStd BAM_Unsorted \ --outSAMtype BAM Unsorted \ --outFilterType BySJout \ --outReadsUnmapped Fastx \ --outFilterScoreMin 10 \ --outSAMattrRGline ID:foo \ --alignEndsType EndToEnd > "${OUTPUT_BAM_PATH}" -
13
Takes output from STAR genome mapping.
$ Bash example
# Install STAR (example using conda) # conda install -c bioconda star # Define variables # Placeholder for STAR genome index (e.g., hg38). # This directory should contain files generated by STAR --runMode genomeGenerate. GENOME_DIR="/path/to/STAR_index_hg38" READ1_FASTQ="sample_R1.fastq.gz" READ2_FASTQ="sample_R2.fastq.gz" # Comment out or remove if single-end OUTPUT_PREFIX="sample_" THREADS=8 # Number of threads to use # Run STAR genome mapping STAR \ --genomeDir "${GENOME_DIR}" \ --readFilesIn "${READ1_FASTQ}" "${READ2_FASTQ}" \ --runThreadN "${THREADS}" \ --outFileNamePrefix "${OUTPUT_PREFIX}" \ --outSAMtype BAM SortedByCoordinate \ --outSAMunmapped Within \ --outSAMattributes Standard \ --readFilesCommand zcat \ --limitBAMsortRAM 30000000000 # Adjust based on available RAM, ~30GB recommended for large genomes -
14
Custom random-mer-aware script for PCR duplicate removal.
$ Bash example
# Install umi_tools (example using conda) # conda install -c bioconda umi-tools # Example usage for PCR duplicate removal with umi_tools dedup # This assumes UMIs have already been extracted and appended to read names # (e.g., using 'umi_tools extract' prior to alignment). # The '--method directional' is commonly used for UMI-based deduplication. # The '--paired' flag is used for paired-end sequencing data. umi_tools dedup \ -I input_aligned_reads.bam \ -S deduplicated_reads.bam \ --method directional \ --paired \ --log deduplication.log -
15
Command: barcode_collapse_pe.py --bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam --out_file /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam --metrics_file /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.metrics
$ Bash example
# Install dependencies (if not already installed) # This script requires pysam # conda install -c bioconda pysam # If barcode_collapse_pe.py is not in your PATH, you might need to clone the repository # and add its scripts directory to PATH, or call it directly: # git clone https://github.com/yeolab/eclip.git # # export PATH=$PATH:$(pwd)/eclip/scripts # # python eclip/scripts/barcode_collapse_pe.py ... # Define input and output file paths as they appear in the description INPUT_BAM="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam" OUTPUT_BAM="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam" METRICS_FILE="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.metrics" # Execute the command barcode_collapse_pe.py --bam "${INPUT_BAM}" --out_file "${OUTPUT_BAM}" --metrics_file "${METRICS_FILE}" -
16
Takes output from barcode collapse PE.
$ Bash example
# Install STAR (if not already installed) # conda install -c bioconda star=2.7.10a # Define variables # INPUT_FASTQ: Placeholder for the gzipped FASTQ file containing deduplicated paired-end reads # from the barcode collapse PE step. INPUT_FASTQ="deduplicated_reads_PE.fastq.gz" # OUTPUT_PREFIX: Prefix for output files (e.g., aligned BAM, log files) OUTPUT_PREFIX="aligned_reads" # GENOME_DIR: Path to the pre-built STAR genome index directory. # For eCLIP, this is typically a human reference genome like hg38. # Replace with the actual path to your genome index. GENOME_DIR="/path/to/STAR_genome_index/hg38" # THREADS: Number of CPU threads to use for alignment. THREADS=8 # RAM_LIMIT: Maximum RAM (in bytes) to be used for sorting BAM. Adjust based on available system RAM. # Example: 30GB = 30 * 1024^3 bytes RAM_LIMIT=30000000000 # --- STAR Alignment Command --- # This command aligns the deduplicated paired-end reads to the reference genome. # It outputs a sorted BAM file and gene counts. STAR --runThreadN ${THREADS} \ --genomeDir ${GENOME_DIR} \ --readFilesIn ${INPUT_FASTQ} \ --readFilesCommand zcat \ --outFileNamePrefix ${OUTPUT_PREFIX}_ \ --outSAMtype BAM SortedByCoordinate \ --outSAMattributes All \ --quantMode GeneCounts \ --outFilterMultimapNmax 20 \ --outFilterMismatchNmax 3 \ --alignIntronMin 20 \ --alignIntronMax 1000000 \ --alignMatesGapMax 1000000 \ --limitBAMsortRAM ${RAM_LIMIT} -
17
Sorts resulting bam file for use downstream.
$ Bash example
# Install samtools (if not already installed) # conda install -c bioconda samtools=1.10 # Sort the BAM file by coordinate # Replace 'input.bam' with your actual input BAM file # Replace 'output.sorted.bam' with your desired output sorted BAM file name # Adjust '-@ 8' for the number of threads/CPUs you want to use samtools sort -@ 8 -o output.sorted.bam input.bam
-
18
Command: java -Xmx2048m -XX:+UseParallelOldGC -XX:ParallelGCThreads=4 -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Djava.io.tmpdir=/full/path/to/files/.queue/tmp -cp /path/to/gatk/dist/Queue.jar net.sf.picard.sam.SortSam INPUT=/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam TMP_DIR=/full/path/to/files/.queue/tmp OUTPUT=/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam VALIDATION_STRINGENCY=SILENT SO=coordinate CREATE_INDEX=true
$ Bash example
# Install Picard (e.g., via Conda or by downloading the JAR) # conda install -c bioconda picard # Or download the picard.jar from https://github.com/broadinstitute/picard/releases (for standalone versions) # Define variables for paths and files # NOTE: The original command uses Queue.jar from a GATK distribution for the classpath. # This implies that the Picard classes were bundled within that specific GATK Queue.jar, # or that Queue.jar was part of a larger GATK distribution where Picard classes were accessible. # Adjust GATK_DIST_PATH to your specific GATK distribution path where Queue.jar resides. GATK_DIST_PATH="/path/to/gatk/dist" # Adjust this to your GATK distribution path INPUT_BAM="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam" # Input BAM file OUTPUT_BAM="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam" # Output sorted BAM file TMP_DIR="/full/path/to/files/.queue/tmp" # Temporary directory # Create the temporary directory if it doesn't exist mkdir -p "${TMP_DIR}" # Execute Picard SortSam command java -Xmx2048m \ -XX:+UseParallelOldGC \ -XX:ParallelGCThreads=4 \ -XX:GCTimeLimit=50 \ -XX:GCHeapFreeLimit=10 \ -Djava.io.tmpdir="${TMP_DIR}" \ -cp "${GATK_DIST_PATH}/Queue.jar" \ net.sf.picard.sam.SortSam \ INPUT="${INPUT_BAM}" \ TMP_DIR="${TMP_DIR}" \ OUTPUT="${OUTPUT_BAM}" \ VALIDATION_STRINGENCY=SILENT \ SO=coordinate \ CREATE_INDEX=true -
19
Takes output from sortSam, makes bam index for use downstream.
$ Bash example
# Install samtools if not already installed # conda install -c bioconda samtools # Assuming 'sorted.bam' is the output from sortSam INPUT_BAM="sorted.bam" samtools index "${INPUT_BAM}" -
20
Command: samtools index /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam.bai
$ Bash example
# Install samtools if not already installed # conda install -c bioconda samtools=1.10 # Create an index for the sorted BAM file samtools index /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam.bai
-
21
Takes inputs from multiple final bam files.
$ Bash example
bash # Install samtools if not already available # conda install -c bioconda samtools # Merges multiple sorted BAM files into a single sorted BAM file. # Replace input_1.bam, input_2.bam, etc., with your actual input BAM files. # Replace merged_output.bam with your desired output file name. samtools merge merged_output.bam input_1.bam input_2.bam input_3.bam
-
22
Merges the two technical replicates for further downstream analysis.
$ Bash example
# Install samtools if not already installed # conda install -c bioconda samtools # Merge technical replicates (BAM files) # Replace replicate1.bam and replicate2.bam with your actual input BAM files. # Replace merged_replicates.bam with your desired output merged BAM file name. samtools merge merged_replicates.bam replicate1.bam replicate2.bam
-
23
Command: samtools merge /full/path/to/files/CombinedID.merged.bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam /full/path/to/files/file_R1.D08.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam
$ Bash example
# Install samtools (e.g., using conda) # conda install -c bioconda samtools # Define input and output files OUTPUT_BAM="/full/path/to/files/CombinedID.merged.bam" INPUT_BAM_1="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam" INPUT_BAM_2="/full/path/to/files/file_R1.D08.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam" # Execute samtools merge samtools merge "${OUTPUT_BAM}" "${INPUT_BAM_1}" "${INPUT_BAM_2}" -
24
Takes output from sortSam, makes bam index for use downstream.
samtools (Inferred with models/gemini-2.5-flash) v1.10 (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# Install samtools if not already installed # conda install -c bioconda samtools=1.10 # Assuming 'sorted.bam' is the output from sortSam samtools index sorted.bam
-
25
Command: samtools index /full/path/to/files/CombinedID.merged.bam /full/path/to/files/CombinedID.merged.bam.bai
$ Bash example
# Install samtools if not already available # conda install -c bioconda samtools samtools index /full/path/to/files/CombinedID.merged.bam /full/path/to/files/CombinedID.merged.bam.bai
-
26
Takes output from sortSam.
$ Bash example
# Install samtools if not already installed # conda install -c bioconda samtools # This command creates an index file (.bai) for a sorted BAM file, # which is essential for many downstream applications and visualization. # Replace 'sorted.bam' with the actual output file from sortSam. samtools index sorted.bam
-
27
Only outputs the second read in each pair for use with single stranded peak caller.
reformat.sh (Inferred with models/gemini-2.5-flash) v38.90 (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# Install BBMap suite (contains reformat.sh) # conda install -c bioconda bbmap # Define input and output file names INPUT_R1="input_R1.fastq.gz" INPUT_R2="input_R2.fastq.gz" OUTPUT_R2_ONLY="output_R2_only.fastq.gz" # Only outputs the second read in each pair for use with single stranded peak caller. # 'in1' and 'in2' specify the paired-end input files. # 'out1=null' discards the first read. # 'out2' specifies the output file for the second read. reformat.sh in1="${INPUT_R1}" in2="${INPUT_R2}" out1=null out2="${OUTPUT_R2_ONLY}" -
28
This is the final bam file to perform analysis on.
$ Bash example
# Install samtools if not already available # conda install -c bioconda samtools # Placeholder for the input BAM file (e.g., from an alignment step) INPUT_BAM="input.bam" # Define the output sorted BAM file name OUTPUT_SORTED_BAM="final.sorted.bam" # Sort the BAM file to prepare it for indexing and downstream analysis samtools sort -o "${OUTPUT_SORTED_BAM}" "${INPUT_BAM}" # Index the sorted BAM file for quick access to specific regions samtools index "${OUTPUT_SORTED_BAM}" echo "Final BAM file '${OUTPUT_SORTED_BAM}' and its index '${OUTPUT_SORTED_BAM}.bai' are ready for analysis." -
29
Command: samtools view -hb -f 128 /full/path/to/files/CombinedID.merged.bam > /full/path/to/files/CombinedID.merged.r2.bam
$ Bash example
# Install samtools (if not already installed) # conda install -c bioconda samtools # Define input and output file paths INPUT_BAM="/full/path/to/files/CombinedID.merged.bam" OUTPUT_BAM="/full/path/to/files/CombinedID.merged.r2.bam" # Filter BAM file to extract reads that are the second in a pair # -h: Include header in the output # -b: Output in BAM format # -f 128: Only output reads with the FLAG bit 128 set (second in pair) samtools view -hb -f 128 "${INPUT_BAM}" > "${OUTPUT_BAM}" -
30
Takes results from samtools view.
$ Bash example
# Install samtools if not already available # conda install -c bioconda samtools # This command takes an input BAM file and outputs a SAM file, including the header. # Replace 'input.bam' with your actual input file and 'output.sam' with your desired output file. samtools view -h input.bam > output.sam
-
31
Calls peaks on those files.
clipper (Inferred with models/gemini-2.5-flash) vlatest (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# Install clipper (if not already installed) # git clone https://github.com/yeolab/clipper.git # cd clipper # pip install -r requirements.txt # Check if requirements.txt exists, if not, assume basic python dependencies # For demonstration, let's assume clipper.py is in the current PATH or directory # And placeholder BAM files exist: input_ip.bam, input_control.bam # Download hg38 chromosome sizes if not available (using UCSC as source) # wget -qO- http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes > hg38.chrom.sizes # Call peaks using clipper # Assuming input_ip.bam and input_control.bam are the input files # And hg38.chrom.sizes is the genome size file (placeholder for latest assembly) python clipper.py -b input_ip.bam -c input_control.bam -s hg38.chrom.sizes -o peaks_output
-
32
Command: clipper -b /full/path/to/files/CombinedID.merged.r2.bam -s hg19 -o /full/path/to/files/CombinedID.merged.r2.peaks.bed --bonferroni --superlocal --threshold-method binomial --save-pickle
$ Bash example
# Install CLIPper (example, adjust as needed) # CLIPper is a Python-based tool. It can often be run directly after cloning or installed via pip. # git clone https://github.com/yeolab/clipper.git # cd clipper # python setup.py install # Ensure required Python dependencies are installed (e.g., numpy, scipy, pysam) # conda install -c bioconda numpy scipy pysam # Define input and output paths INPUT_BAM="/full/path/to/files/CombinedID.merged.r2.bam" OUTPUT_BED="/full/path/to/files/CombinedID.merged.r2.peaks.bed" GENOME_ASSEMBLY="hg19" # Reference genome assembly for CLIPper # Execute CLIPper command for peak calling clipper -b "${INPUT_BAM}" -s "${GENOME_ASSEMBLY}" -o "${OUTPUT_BED}" --bonferroni --superlocal --threshold-method binomial --save-pickle
Tools Used
Raw Source Text
Takes output from raw files. Run to trim off both 5â and 3â adapters on both reads. Command: quality-cutoff 6 -m 18 -a NNNNNAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC -g CTTCCGATCTACAAGTT -g CTTCCGATCTTGGTCCT -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGT AGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz /full/path/to/files/file_R1.C01.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.metrics Takes output from cutadapt round 1. Run to trim off the 3â adapters on read 2, to control for double ligation events. Command: cutadapt -f fastq --match-read-wildcards --times 1 -e 0.1 -O 5 --quality-cutoff 6 -m 18 -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGTAGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.metrics Takes output from cutadapt round 2. Maps to human specific version of RepBase used to remove repetitive elements, helps control for spurious artifacts from rRNA (& other) repetitive reads. Command: STAR --runMode alignReads --runThreadN 16 --genomeDir /path/to/RepBase_human_database_file --genomeLoad LoadAndRemove --readFilesIn /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz --outSAMunmapped Within --outFilterMultimapNmax 30 --outFilterMultimapScoreRange 1 --outFileNamePrefix /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam --outSAMattributes All --readFilesCommand zcat --outStd BAM_Unsorted --outSAMtype BAM Unsorted --outFilterType BySJout --outReadsUnmapped Fastx --outFilterScoreMin 10 --outSAMattrRGline ID:foo --alignEndsType EndToEnd > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam Takes output from STAR rmRep. Maps unique reads to the human genome. Command: STAR --runMode alignReads --runThreadN 16 --genomeDir /path/to/STAR_database_file --genomeLoad LoadAndRemove --readFilesIn /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate1 /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate2 --outSAMunmapped Within --outFilterMultimapNmax 1 --outFilterMultimapScoreRange 1 --outFileNamePrefix /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam --outSAMattributes All --outStd BAM_Unsorted --outSAMtype BAM Unsorted --outFilterType BySJout --outReadsUnmapped Fastx --outFilterScoreMin 10 --outSAMattrRGline ID:foo --alignEndsType EndToEnd > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam Takes output from STAR genome mapping. Custom random-mer-aware script for PCR duplicate removal. Command: barcode_collapse_pe.py --bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam --out_file /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam --metrics_file /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.metrics Takes output from barcode collapse PE. Sorts resulting bam file for use downstream. Command: java -Xmx2048m -XX:+UseParallelOldGC -XX:ParallelGCThreads=4 -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Djava.io.tmpdir=/full/path/to/files/.queue/tmp -cp /path/to/gatk/dist/Queue.jar net.sf.picard.sam.SortSam INPUT=/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam TMP_DIR=/full/path/to/files/.queue/tmp OUTPUT=/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam VALIDATION_STRINGENCY=SILENT SO=coordinate CREATE_INDEX=true Takes output from sortSam, makes bam index for use downstream. Command: samtools index /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam.bai Takes inputs from multiple final bam files. Merges the two technical replicates for further downstream analysis. Command: samtools merge /full/path/to/files/CombinedID.merged.bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam /full/path/to/files/file_R1.D08.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam Takes output from sortSam, makes bam index for use downstream. Command: samtools index /full/path/to/files/CombinedID.merged.bam /full/path/to/files/CombinedID.merged.bam.bai Takes output from sortSam. Only outputs the second read in each pair for use with single stranded peak caller. This is the final bam file to perform analysis on. Command: samtools view -hb -f 128 /full/path/to/files/CombinedID.merged.bam > /full/path/to/files/CombinedID.merged.r2.bam Takes results from samtools view. Calls peaks on those files. Command: clipper -b /full/path/to/files/CombinedID.merged.r2.bam -s hg19 -o /full/path/to/files/CombinedID.merged.r2.peaks.bed --bonferroni --superlocal --threshold-method binomial --save-pickle Genome_build: hg19 Supplementary_files_format_and_content: bigWig format. Maps the raw reads to both hg19 and mm10 respectively. Describes the mapped reads from the eCLIP experiments. Supplementary_files_format_and_content: bed format, contains clusters of predicted RBP binding (on hg19 and mm10 respectively); column 4 contains -log10(p-value) and column 5 contains log2(fold-enrichment) in eCLIP versus paired size-matched input