GSE103225 Processing Pipeline
Publication
Transcriptome-pathology correlation identifies interplay between TDP-43 and the expression of its kinase CK1E in sporadic ALS.Acta neuropathologica (2018) — PMID 29881994
Dataset
GSE103225Transcriptome-pathology correlations predict CSNK1E-mediated TDP-43 phosphorylation in sporadic amyotrophic lateral sclerosis
Processing Steps
Generate Jupyter Notebook-
1
Takes output from raw files.
Generic Raw File Processing (Inferred with models/gemini-2.5-flash) vN/A$ Bash example
# The step description "Takes output from raw files." is too generic to infer a specific tool, command, or parameters. # This typically refers to the initial input stage of a pipeline where raw sequencing data (e.g., FASTQ files) are provided. # Subsequent steps would involve quality control, alignment, or other specific processing based on the assay type. # No specific command can be inferred from this generic description.
-
2
Run to trim off both 5â and 3â adapters on both reads.
$ Bash example
# Install cutadapt (if not already installed) # conda install -c bioconda -c conda-forge cutadapt=4.0 # Define input and output file paths INPUT_R1="input_R1.fastq.gz" INPUT_R2="input_R2.fastq.gz" OUTPUT_R1="trimmed_R1.fastq.gz" OUTPUT_R2="trimmed_R2.fastq.gz" # Define adapter sequences based on Yeo lab eCLIP protocol (from skipper workflow) # These are common Illumina universal and small RNA 3' adapters, and specific 5' adapters. ADAPTER_3PRIME_R1="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" ADAPTER_3PRIME_R2="AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT" ADAPTER_5PRIME_R1="GATCGTCGGACTGTAGAACTCTGAAC" ADAPTER_5PRIME_R2="GATCGTCGGACTGTAGAACTCTGAAC" # Minimum read length after trimming MIN_READ_LENGTH=18 # Number of CPU cores to use NUM_CORES=4 # Run cutadapt to trim 5' and 3' adapters from both reads cutadapt \ -a "${ADAPTER_3PRIME_R1}" \ -A "${ADAPTER_3PRIME_R2}" \ -g "${ADAPTER_5PRIME_R1}" \ -G "${ADAPTER_5PRIME_R2}" \ -o "${OUTPUT_R1}" \ -p "${OUTPUT_R2}" \ --minimum-length "${MIN_READ_LENGTH}" \ --cores "${NUM_CORES}" \ "${INPUT_R1}" "${INPUT_R2}" -
3
Command: quality-cutoff 6 -m 18 -a NNNNNAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC -g CTTCCGATCTACAAGTT -g CTTCCGATCTTGGTCCT -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGT AGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz /full/path/to/files/file_R1.C01.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.metrics
cutadapt (Inferred with models/gemini-2.5-flash) v2.10 (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# Install cutadapt (e.g., via conda) # conda install -c bioconda cutadapt cutadapt -q 6 -m 18 \ -a NNNNNAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC \ -g CTTCCGATCTACAAGTT \ -g CTTCCGATCTTGGTCCT \ -A AACTTGTAGATCGGA \ -A AGGACCAAGATCGGA \ -A ACTTGTAGATCGGAA \ -A GGACCAAGATCGGAA \ -A CTTGT AGATCGGAAG \ -A GACCAAGATCGGAAG \ -A TTGTAGATCGGAAGA \ -A ACCAAGATCGGAAGA \ -A TGTAGATCGGAAGAG \ -A CCAAGATCGGAAGAG \ -A GTAGATCGGAAGAGC \ -A CAAGATCGGAAGAGC \ -A TAGATCGGAAGAGCG \ -A AAGATCGGAAGAGCG \ -A AGATCGGAAGAGCGT \ -A GATCGGAAGAGCGTC \ -A ATCGGAAGAGCGTCG \ -A TCGGAAGAGCGTCGT \ -A CGGAAGAGCGTCGTG \ -A GGAAGAGCGTCGTGT \ -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz \ -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz \ /full/path/to/files/file_R1.C01.fastq.gz \ /full/path/to/files/file_R2.C01.fastq.gz \ > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.metrics
-
4
Takes output from cutadapt round 1.
$ Bash example
# Install cutadapt (if not already installed) # conda install -c bioconda cutadapt=1.18 # Define input and output files # INPUT_FASTQ represents the output from cutadapt round 1 INPUT_FASTQ="round1_trimmed.fastq.gz" OUTPUT_FASTQ="round2_trimmed.fastq.gz" # Define adapter sequence (common Illumina 3' adapter used in eCLIP workflows) ADAPTER_SEQUENCE="AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC" # Execute cutadapt for adapter and quality trimming # Parameters are inferred from common eCLIP cutadapt usage (e.g., Yeo lab eCLIP CWL workflow) cutadapt \ -a "${ADAPTER_SEQUENCE}" \ -q 20 \ --minimum-length 18 \ --error-rate 0.1 \ -o "${OUTPUT_FASTQ}" \ "${INPUT_FASTQ}" -
5
Run to trim off the 3â adapters on read 2, to control for double ligation events.
cutadapt (Inferred with models/gemini-2.5-flash) v4.0 (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# Install cutadapt (e.g., using conda) # conda install -c bioconda cutadapt=4.0 # Define variables # ADAPTER_R2_3PRIME is a common Illumina TruSeq Small RNA 3' Adapter sequence used in eCLIP workflows ADAPTER_R2_3PRIME="AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC" MIN_READ_LENGTH=18 # Minimum read length to keep after trimming INPUT_READ2="sample_R2.fastq.gz" OUTPUT_READ2_TRIMMED="sample_R2_trimmed.fastq.gz" NUM_THREADS=8 # Number of CPU cores to use # Run cutadapt to trim 3' adapters on Read 2 # -a: Specifies a 3' adapter sequence for the forward read (or single-end read) # In this context, it's applied to Read 2 to remove the 3' adapter. # -o: Output file for the trimmed reads. # --minimum-length: Discard reads shorter than this length after trimming. # --cores: Number of CPU cores to use for parallel processing. cutadapt -a "${ADAPTER_R2_3PRIME}" \ -o "${OUTPUT_READ2_TRIMMED}" \ --minimum-length "${MIN_READ_LENGTH}" \ --cores "${NUM_THREADS}" \ "${INPUT_READ2}" -
6
Command: cutadapt -f fastq --match-read-wildcards --times 1 -e 0.1 -O 5 --quality-cutoff 6 -m 18 -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGTAGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.metrics
$ Bash example
# Install cutadapt (example using conda) # conda install -c bioconda cutadapt # Define input and output files INPUT_R1="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz" INPUT_R2="/full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz" OUTPUT_R1="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz" OUTPUT_R2="/full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz" METRICS_FILE="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.metrics" # Define adapter sequences ADAPTERS=( "AACTTGTAGATCGGA" "AGGACCAAGATCGGA" "ACTTGTAGATCGGAA" "GGACCAAGATCGGAA" "CTTGTAGATCGGAAG" "GACCAAGATCGGAAG" "TTGTAGATCGGAAGA" "ACCAAGATCGGAAGA" "TGTAGATCGGAAGAG" "CCAAGATCGGAAGAG" "GTAGATCGGAAGAGC" "CAAGATCGGAAGAGC" "TAGATCGGAAGAGCG" "AAGATCGGAAGAGCG" "AGATCGGAAGAGCGT" "GATCGGAAGAGCGTC" "ATCGGAAGAGCGTCG" "TCGGAAGAGCGTCGT" "CGGAAGAGCGTCGTG" "GGAAGAGCGTCGTGT" ) # Construct the adapter arguments string ADAPTER_ARGS="" for ADAPTER in "${ADAPTERS[@]}"; do ADAPTER_ARGS+=" -A ${ADAPTER}" done # Execute cutadapt command cutadapt -f fastq --match-read-wildcards --times 1 -e 0.1 -O 5 --quality-cutoff 6 -m 18 ${ADAPTER_ARGS} -o "${OUTPUT_R1}" -p "${OUTPUT_R2}" "${INPUT_R1}" "${INPUT_R2}" > "${METRICS_FILE}" -
7
Takes output from cutadapt round 2.
$ Bash example
# Install cutadapt (if not already installed) # conda install -c bioconda cutadapt=1.18 # This command performs poly-A trimming, often considered a "round 2" trimming step # after initial adapter trimming in eCLIP pipelines. The version and parameters # (min_length, quality_cutoff) are inferred from the Yeo lab eCLIP CWL workflow. # Input: FASTQ file that has already undergone initial adapter trimming (e.g., from "cutadapt round 1"). # Output: FASTQ file with poly-A tails removed. cutadapt \ -a "A{100}" \ -m 18 \ -q 20 \ -o sample_round2_trimmed.fastq.gz \ sample_round1_trimmed.fastq.gz -
8
Maps to human specific version of RepBase used to remove repetitive elements, helps control for spurious artifacts from rRNA (& other) repetitive reads.
$ Bash example
# Install BBTools (if not already installed) # conda install -c bioconda bbtools # Define input and output files # Replace 'input_reads.fastq.gz' with your actual input FASTQ file(s). # For paired-end reads, use 'in1=reads_R1.fastq.gz in2=reads_R2.fastq.gz out1=filtered_R1.fastq.gz out2=filtered_R2.fastq.gz' INPUT_FASTQ="input_reads.fastq.gz" OUTPUT_FASTQ="filtered_reads.fastq.gz" STATS_FILE="repbase_filtering_stats.txt" # Define the RepBase reference file. # This file should contain human-specific repetitive elements (including rRNA sequences) # derived from RepBase. You would typically download and prepare this file beforehand. # Example: A FASTA file containing human repetitive sequences. # Placeholder: Replace with the actual path to your RepBase FASTA file. HUMAN_REPBASE_FASTA="path/to/human_repbase.fasta" # Run bbduk to remove reads mapping to human RepBase sequences. # k=31 is a common kmer size for contaminant filtering. # hdist=1 allows for 1 mismatch in kmer matching. # minidentity=90 ensures high identity matches are removed. # stats= outputs statistics about filtered reads. bbduk.sh \ in="$INPUT_FASTQ" \ out="$OUTPUT_FASTQ" \ ref="$HUMAN_REPBASE_FASTA" \ k=31 \ hdist=1 \ minidentity=90 \ stats="$STATS_FILE" \ overwrite=true
-
9
Command: STAR --runMode alignReads --runThreadN 16 --genomeDir /path/to/RepBase_human_database_file --genomeLoad LoadAndRemove --readFilesIn /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz --outSAMunmapped Within --outFilterMultimapNmax 30 --outFilterMultimapScoreRange 1 --outFileNamePrefix /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam --outSAMattributes All --readFilesCommand zcat --outStd BAM_Unsorted --outSAMtype BAM Unsorted --outFilterType BySJout --outReadsUnmapped Fastx --outFilterScoreMin 10 --outSAMattrRGline ID:foo --alignEndsType EndToEnd > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam
STAR vNot specified, inferring a recent stable version (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# Install STAR (e.g., using Conda) # conda install -c bioconda star # Define variables for clarity (optional, but good practice) GENOME_DIR="/path/to/RepBase_human_database_file" READ_FILE_R1="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz" READ_FILE_R2="/full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz" OUTPUT_BAM="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam" OUTPUT_PREFIX="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam" STAR \ --runMode alignReads \ --runThreadN 16 \ --genomeDir "${GENOME_DIR}" \ --genomeLoad LoadAndRemove \ --readFilesIn "${READ_FILE_R1}" "${READ_FILE_R2}" \ --outSAMunmapped Within \ --outFilterMultimapNmax 30 \ --outFilterMultimapScoreRange 1 \ --outFileNamePrefix "${OUTPUT_PREFIX}" \ --outSAMattributes All \ --readFilesCommand zcat \ --outStd BAM_Unsorted \ --outSAMtype BAM Unsorted \ --outFilterType BySJout \ --outReadsUnmapped Fastx \ --outFilterScoreMin 10 \ --outSAMattrRGline ID:foo \ --alignEndsType EndToEnd > "${OUTPUT_BAM}" -
10
Takes output from STAR rmRep.
$ Bash example
# Install STAR (if not already installed) # conda install -c bioconda star # Define variables for genome index and input read files # Replace with actual paths to your genome index and FASTQ files GENOME_DIR="/path/to/star_genome_index/" READS_R1="/path/to/input_reads_R1.fastq.gz" READS_R2="/path/to/input_reads_R2.fastq.gz" # Remove if single-end reads OUTPUT_PREFIX="./star_output/aligned_reads" # Create output directory if it doesn't exist mkdir -p $(dirname ${OUTPUT_PREFIX}) # Run STAR alignment # Note: STAR itself does not have a direct 'rmRep' (remove replicates/duplicates) command. # Removing PCR duplicates is typically a post-alignment step performed by tools like samtools markdup. # This command performs a standard STAR alignment, producing a sorted BAM file. STAR \ --runThreadN 8 \ --genomeDir ${GENOME_DIR} \ --readFilesIn ${READS_R1} ${READS_R2} \ --readFilesCommand zcat \ --outFileNamePrefix ${OUTPUT_PREFIX} \ --outSAMtype BAM SortedByCoordinate \ --outSAMattributes Standard \ --outFilterType BySJout \ --outFilterMultimapNmax 20 \ --outFilterMismatchNmax 999 \ --outFilterMismatchNoverLmax 0.1 \ --alignIntronMin 20 \ --alignIntronMax 1000000 \ --alignMatesGapMax 1000000 \ --alignSJoverhangMin 8 \ --alignSJDBoverhangMin 1 -
11
Maps unique reads to the human genome.
$ Bash example
# Define variables READ1="input_R1.fastq.gz" # Path to input R1 FASTQ file (gzipped) READ2="input_R2.fastq.gz" # Path to input R2 FASTQ file (gzipped) OUTPUT_PREFIX="aligned_sample" # Prefix for output files GENOME_DIR="/path/to/human_genome_star_index_GRCh38" # Path to STAR genome index directory (e.g., for GRCh38) THREADS=8 # Number of threads to use # --- Installation (commented out) --- # # Install STAR using conda # # conda create -n star_env star -c bioconda -y # # conda activate star_env # # --- Reference Genome Indexing (run once, commented out) --- # # Placeholder for human genome FASTA and GTF files (e.g., GRCh38) # # GENOME_FASTA="/path/to/human_genome/GRCh38.primary_assembly.genome.fa" # # GTF_FILE="/path/to/human_genome/gencode.v44.annotation.gtf" # Or other relevant GTF # # Create STAR genome index (if not already created) # # mkdir -p ${GENOME_DIR} # # STAR --runMode genomeGenerate \ # # --genomeDir ${GENOME_DIR} \ # # --genomeFastaFiles ${GENOME_FASTA} \ # # --sjdbGTFfile ${GTF_FILE} \ # # --sjdbOverhang 100 \ # # --runThreadN ${THREADS} # --- Alignment Command --- # Maps unique reads to the human genome using STAR STAR --genomeDir ${GENOME_DIR} \ --readFilesIn ${READ1} ${READ2} \ --readFilesCommand zcat \ --outFileNamePrefix ${OUTPUT_PREFIX}_ \ --outSAMtype BAM SortedByCoordinate \ --outSAMunmapped Within \ --outFilterMultimapNmax 1 \ --outFilterMismatchNmax 10 \ --runThreadN ${THREADS} # Output files will include: # ${OUTPUT_PREFIX}_Aligned.sortedByCoord.out.bam # ${OUTPUT_PREFIX}_Log.final.out # ${OUTPUT_PREFIX}_Log.out # ${OUTPUT_PREFIX}_Log.progress.out # ${OUTPUT_PREFIX}_SJ.out.tab -
12
Command: STAR --runMode alignReads --runThreadN 16 --genomeDir /path/to/STAR_database_file --genomeLoad LoadAndRemove --readFilesIn /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate1 /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate2 --outSAMunmapped Within --outFilterMultimapNmax 1 --outFilterMultimapScoreRange 1 --outFileNamePrefix /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam --outSAMattributes All --outStd BAM_Unsorted --outSAMtype BAM Unsorted --outFilterType BySJout --outReadsUnmapped Fastx --outFilterScoreMin 10 --outSAMattrRGline ID:foo --alignEndsType EndToEnd > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam
$ Bash example
# Install STAR (if not already installed) # conda install -c bioconda star # Define variables GENOME_DIR="/path/to/STAR_database_file" # Path to the STAR genome index directory INPUT_READS_MATE1="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate1" INPUT_READS_MATE2="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate2" OUTPUT_FILE_PREFIX="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam" # Prefix for STAR's auxiliary output files (logs, junctions, etc.) FINAL_BAM_OUTPUT="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam" # Path for the final aligned BAM file (redirected stdout) STAR --runMode alignReads \ --runThreadN 16 \ --genomeDir "${GENOME_DIR}" \ --genomeLoad LoadAndRemove \ --readFilesIn "${INPUT_READS_MATE1}" "${INPUT_READS_MATE2}" \ --outSAMunmapped Within \ --outFilterMultimapNmax 1 \ --outFilterMultimapScoreRange 1 \ --outFileNamePrefix "${OUTPUT_FILE_PREFIX}" \ --outSAMattributes All \ --outStd BAM_Unsorted \ --outSAMtype BAM Unsorted \ --outFilterType BySJout \ --outReadsUnmapped Fastx \ --outFilterScoreMin 10 \ --outSAMattrRGline ID:foo \ --alignEndsType EndToEnd \ > "${FINAL_BAM_OUTPUT}" -
13
takes output from STAR genome mapping.
$ Bash example
# Install STAR (e.g., using conda) # conda install -c bioconda star # Create a STAR genome index (if not already present) # This step needs to be run once for a given genome and annotation. # Replace /path/to/genome_fasta/hg38.fa and /path/to/gtf/gencode.v38.annotation.gtf with actual paths. # mkdir -p /path/to/star_index/hg38_star_index # STAR \ # --runThreadN 8 \ # --runMode genomeGenerate \ # --genomeDir /path/to/star_index/hg38_star_index \ # --genomeFastaFiles /path/to/genome_fasta/hg38.fa \ # --sjdbGTFfile /path/to/gtf/gencode.v38.annotation.gtf \ # --sjdbOverhang 100 # Typically (ReadLength - 1) # Align reads using STAR # Replace /path/to/star_index/hg38_star_index with the actual path to your genome index. # Replace read1.fastq.gz and read2.fastq.gz with your input FASTQ files. # Replace sample_name with your desired output prefix. STAR \ --runThreadN 8 \ --genomeDir /path/to/star_index/hg38_star_index \ --readFilesIn read1.fastq.gz read2.fastq.gz \ --readFilesCommand zcat \ --outFileNamePrefix sample_name_ \ --outSAMtype BAM SortedByCoordinate \ --outSAMattributes All \ --outFilterMultimapNmax 1 \ --outFilterMismatchNmax 3 \ --alignIntronMax 1 # For eCLIP, often set to 1 to prevent splicing, focusing on direct RNA binding
-
14
Custom random-mer-aware script for PCR duplicate removal.
umi_tools (Inferred with models/gemini-2.5-flash) v1.1.1 (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# Install umi_tools (e.g., via conda) # conda install -c bioconda umi-tools=1.1.1 # This command assumes random mers (NNNs) have been extracted from the adapter # and appended to the read name, typically separated by an underscore. # For example, a read name might look like: 'read_id_NNNNNN' umi_tools dedup \ --stdin input.bam \ --stdout output.dedup.bam \ --extract-umi-method read_name_suffix \ --umi-separator '_' \ --log dedup.log -
15
Command: barcode_collapse_pe.py --bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam --out_file /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam --metrics_file /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.metrics
barcode_collapse_pe.py (Inferred with models/gemini-2.5-flash) vN/A (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# This script is part of the eCLIP pipeline developed by the Yeo Lab. # It is typically run with Python 2.7. # # To install necessary dependencies (example using conda): # # conda create -n python2_env python=2.7 pysam # # conda activate python2_env # # To obtain the script 'barcode_collapse_pe.py', you would typically clone the eclip repository: # # git clone https://github.com/yeolab/eclip.git # # cd eclip/scripts # # For execution, ensure 'barcode_collapse_pe.py' is in your PATH or specify its full path. # The command below assumes 'barcode_collapse_pe.py' is directly executable or called with 'python2'. # Define input and output file paths based on the description INPUT_BAM="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam" OUTPUT_BAM="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam" METRICS_FILE="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.metrics" # Execute the barcode collapsing command # The script 'barcode_collapse_pe.py' from the yeolab/eclip repository is designed to be run with python2. python2 barcode_collapse_pe.py \ --bam "${INPUT_BAM}" \ --out_file "${OUTPUT_BAM}" \ --metrics_file "${METRICS_FILE}" -
16
Takes output from barcode collapse PE.
$ Bash example
# Install umi_tools (e.g., using conda): # conda install -c bioconda umi-tools=1.1.2 # Define input and output files # INPUT_BAM should be the output from the barcode collapse PE step, # where UMIs have been extracted and appended to read IDs. INPUT_BAM="input_barcode_collapsed.bam" OUTPUT_BAM="output_deduplicated.bam" LOG_FILE="umi_tools_dedup.log" # Run umi_tools dedup for paired-end data # --extract-method=read_id assumes UMIs are already in the read ID (e.g., from a previous umi_tools extract step) # --umi-separator=':' specifies the separator used when UMIs were appended to read IDs # --paired indicates that the input BAM contains paired-end reads umi_tools dedup \ -I "${INPUT_BAM}" \ -S "${OUTPUT_BAM}" \ --extract-method=read_id \ --umi-separator=':' \ --paired \ --log="${LOG_FILE}" -
17
Sorts resulting bam file for use downstream.
samtools (Inferred with models/gemini-2.5-flash) v1.19 (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# Install samtools (if not already installed) # conda install -c bioconda samtools # Sort the BAM file # Replace input.bam with your actual input BAM file # Replace output.sorted.bam with your desired output sorted BAM file name samtools sort -o output.sorted.bam -@ 4 input.bam
-
18
Command: java -Xmx2048m -XX:+UseParallelOldGC -XX:ParallelGCThreads=4 -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Djava.io.tmpdir=/full/path/to/files/.queue/tmp -cp /path/to/gatk/dist/Queue.jar net.sf.picard.sam.SortSam INPUT=/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam TMP_DIR=/full/path/to/files/.queue/tmp OUTPUT=/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam VALIDATION_STRINGENCY=SILENT SO=coordinate CREATE_INDEX=true
$ Bash example
# Install Java (e.g., OpenJDK 11 or 8) # conda install -c conda-forge openjdk=11 # Picard tools are often distributed as a single JAR file. The command uses 'Queue.jar' as the classpath, # which might be a GATK wrapper or a specific setup where Picard classes are accessible via this JAR. # Ensure the Queue.jar (or the appropriate Picard JAR) is accessible at the specified path. # Define variables for paths DATA_DIR="/full/path/to/files" # Replace with your actual data directory QUEUE_JAR_PATH="/path/to/gatk/dist/Queue.jar" # Replace with the actual path to Queue.jar INPUT_BAM="${DATA_DIR}/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam" OUTPUT_SORTED_BAM="${DATA_DIR}/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam" TMP_DIR="${DATA_DIR}/.queue/tmp" # Create the temporary directory if it doesn't exist mkdir -p "${TMP_DIR}" # Execute the Picard SortSam command java -Xmx2048m -XX:+UseParallelOldGC -XX:ParallelGCThreads=4 -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Djava.io.tmpdir="${TMP_DIR}" -cp "${QUEUE_JAR_PATH}" net.sf.picard.sam.SortSam INPUT="${INPUT_BAM}" TMP_DIR="${TMP_DIR}" OUTPUT="${OUTPUT_SORTED_BAM}" VALIDATION_STRINGENCY=SILENT SO=coordinate CREATE_INDEX=true -
19
Takes output from sortSam, makes bam index for use downstream.
samtools index (Inferred with models/gemini-2.5-flash) v1.19 (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# Install samtools if not already installed # conda install -c bioconda samtools # Assuming 'input.sorted.bam' is the output from sortSam # Replace 'input.sorted.bam' with the actual sorted BAM file name samtools index input.sorted.bam
-
20
Command: samtools index /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam.bai
$ Bash example
# Install samtools if not already available # conda install -c bioconda samtools=1.19 # Define input and output paths INPUT_BAM="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam" OUTPUT_BAI="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam.bai" # Create BAM index samtools index "${INPUT_BAM}" "${OUTPUT_BAI}" -
21
Takes inputs from multiple final bam files.
$ Bash example
# Install samtools if not already available # conda install -c bioconda samtools # Example: Merging multiple final BAM files into a single BAM file. # This is a common step when combining technical replicates or preparing for downstream analysis. # Replace input_1.bam, input_2.bam, ... with your actual BAM file paths. # Replace merged_output.bam with your desired output file name. samtools merge merged_output.bam input_1.bam input_2.bam input_3.bam # Index the merged BAM file for efficient access (optional but recommended) samtools index merged_output.bam
-
22
Merges the two technical replicates for further downstream analysis.
$ Bash example
# Install samtools if not already installed # conda install -c bioconda samtools # Merge two technical replicate BAM files into a single BAM file. # Replace 'replicate_1.bam' and 'replicate_2.bam' with your actual input files. # The output will be 'merged_replicates.bam'. samtools merge merged_replicates.bam replicate_1.bam replicate_2.bam
-
23
Command: samtools merge /full/path/to/files/CombinedID.merged.bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam /full/path/to/files/file_R1.D08.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam
$ Bash example
# Install samtools (e.g., using conda) # conda install -c bioconda samtools=1.19 samtools merge /full/path/to/files/CombinedID.merged.bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam /full/path/to/files/file_R1.D08.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam
-
24
Takes output from sortSam, makes bam index for use downstream.
samtools index (Inferred with models/gemini-2.5-flash) v1.19 (Inferred with models/gemini-2.5-flash) GitHub$ Bash example
# Install samtools if not already installed # conda install -c bioconda samtools # Assuming 'input.sorted.bam' is the output from sortSam # This command creates an index file 'input.sorted.bam.bai' in the same directory. samtools index input.sorted.bam
-
25
Command: samtools index /full/path/to/files/CombinedID.merged.bam /full/path/to/files/CombinedID.merged.bam.bai
$ Bash example
# Install samtools if not already available # conda install -c bioconda samtools=1.19 # Define input and output paths INPUT_BAM="/full/path/to/files/CombinedID.merged.bam" OUTPUT_BAI="/full/path/to/files/CombinedID.merged.bam.bai" # Create index for the BAM file samtools index "${INPUT_BAM}" "${OUTPUT_BAI}" -
26
Takes output from sortSam.
$ Bash example
# Install samtools if not already available # conda install -c bioconda samtools=1.19 # Assuming input.bam is the sorted BAM file from sortSam. # This command removes PCR duplicates from the sorted BAM file. # -r: Remove duplicate reads # -s: Write statistics to a file (optional, but good for QC) # -@: Number of threads to use (adjust as needed for parallel processing) samtools markdup -r -s markdup_stats.txt -@ 4 input.bam output.bam
-
27
Only outputs the second read in each pair for use with single stranded peak caller.
samtools (Inferred with models/gemini-2.5-flash) v1.10 (Inferred with models/gemini-2.5-flash)$ Bash example
# Install samtools if not already installed # conda install -c bioconda samtools # This command extracts only the second read in each pair from a BAM file # and outputs them to a FASTQ file. The -f 0x80 flag selects reads that are the second in a pair. # The -N flag retains the original read names. samtools fastq -f 0x80 -N -o output_R2.fastq input.bam
-
28
This is the final bam file to perform analysis on.
STAR (Inferred with models/gemini-2.5-flash), Samtools (Inferred with models/gemini-2.5-flash) vSTAR 2.7.10a, Samtools 1.17 GitHub$ Bash example
# Install STAR (example, uncomment if needed) # conda install -c bioconda star # Install Samtools (example, uncomment if needed) # conda install -c bioconda samtools # Define reference genome and STAR index path GENOME_DIR="/path/to/STAR_index/hg38" # Placeholder for human genome hg38 STAR index READ1="input_read1.fastq.gz" READ2="input_read2.fastq.gz" OUTPUT_PREFIX="aligned_reads" UNSORTED_BAM="${OUTPUT_PREFIX}_Aligned.out.bam" FINAL_BAM="final.bam" # 1. Perform alignment with STAR # This command aligns paired-end reads to the genome and outputs an unsorted BAM file. STAR --genomeDir "${GENOME_DIR}" \ --readFilesIn "${READ1}" "${READ2}" \ --runThreadN 8 \ --outFileNamePrefix "${OUTPUT_PREFIX}_" \ --outSAMtype BAM Unsorted \ --outSAMunmapped Within \ --outSAMattributes Standard \ --quantMode GeneCounts # Optional: for gene quantification, if desired # 2. Sort the BAM file by coordinate # This is a crucial step for most downstream analyses and for indexing. samtools sort -@ 8 -o "${FINAL_BAM}" "${UNSORTED_BAM}" # 3. Index the sorted BAM file # This creates the .bai index file, which is crucial for random access to the BAM file. samtools index "${FINAL_BAM}" # Clean up intermediate unsorted BAM if desired # rm "${UNSORTED_BAM}" -
29
Command: samtools view -hb -f 128 /full/path/to/files/CombinedID.merged.bam > /full/path/to/files/CombinedID.merged.r2.bam
$ Bash example
# Install samtools (e.g., using conda) # conda install -c bioconda samtools # Extract reads that are the second in a pair (R2 reads) from a merged BAM file # -h: Include header in the output # -b: Output in BAM format # -f 128: Select reads with flag 128 (second in pair) samtools view -hb -f 128 /full/path/to/files/CombinedID.merged.bam > /full/path/to/files/CombinedID.merged.r2.bam
-
30
Takes results from samtools view.
$ Bash example
# Install samtools if not already available # conda install -c bioconda samtools=1.9 # Example: Filter a BAM file to keep only primary alignments and mapped reads, outputting to BAM. # Replace 'input.bam' with your actual input file and 'filtered.bam' with your desired output file. samtools view -F 256 -F 4 -b input.bam > filtered.bam
-
31
Calls peaks on those files.
$ Bash example
# Install clipper using conda # conda create -n clipper_env python=3.8 # conda activate clipper_env # conda install -c bioconda clipper=0.0.3 # Define input and output files INPUT_BAM="aligned_reads.bam" # Placeholder for your aligned BAM file OUTPUT_BED="peaks.bed" GENOME_ASSEMBLY="hg38" # Placeholder for the reference genome assembly (e.g., hg19, mm10) # Execute CLIPper peak calling clipper -b "${INPUT_BAM}" -s "${GENOME_ASSEMBLY}" -o "${OUTPUT_BED}" -
32
Command: clipper -b /full/path/to/files/CombinedID.merged.r2.bam -s hg19 -o /full/path/to/files/CombinedID.merged.r2.peaks.bed --bonferroni --superlocal --threshold-method binomial --save-pickle
$ Bash example
# Install CLIPper (example using pip or conda) # pip install clipper # Or: # conda install -c bioconda clipper # Run CLIPper for peak calling clipper -b /full/path/to/files/CombinedID.merged.r2.bam -s hg19 -o /full/path/to/files/CombinedID.merged.r2.peaks.bed --bonferroni --superlocal --threshold-method binomial --save-pickle
Tools Used
Raw Source Text
Takes output from raw files. Run to trim off both 5â and 3â adapters on both reads. Command: quality-cutoff 6 -m 18 -a NNNNNAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC -g CTTCCGATCTACAAGTT -g CTTCCGATCTTGGTCCT -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGT AGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz /full/path/to/files/file_R1.C01.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.metrics Takes output from cutadapt round 1. Run to trim off the 3â adapters on read 2, to control for double ligation events. Command: cutadapt -f fastq --match-read-wildcards --times 1 -e 0.1 -O 5 --quality-cutoff 6 -m 18 -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGTAGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.metrics Takes output from cutadapt round 2. Maps to human specific version of RepBase used to remove repetitive elements, helps control for spurious artifacts from rRNA (& other) repetitive reads. Command: STAR --runMode alignReads --runThreadN 16 --genomeDir /path/to/RepBase_human_database_file --genomeLoad LoadAndRemove --readFilesIn /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz --outSAMunmapped Within --outFilterMultimapNmax 30 --outFilterMultimapScoreRange 1 --outFileNamePrefix /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam --outSAMattributes All --readFilesCommand zcat --outStd BAM_Unsorted --outSAMtype BAM Unsorted --outFilterType BySJout --outReadsUnmapped Fastx --outFilterScoreMin 10 --outSAMattrRGline ID:foo --alignEndsType EndToEnd > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam Takes output from STAR rmRep. Maps unique reads to the human genome. Command: STAR --runMode alignReads --runThreadN 16 --genomeDir /path/to/STAR_database_file --genomeLoad LoadAndRemove --readFilesIn /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate1 /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate2 --outSAMunmapped Within --outFilterMultimapNmax 1 --outFilterMultimapScoreRange 1 --outFileNamePrefix /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam --outSAMattributes All --outStd BAM_Unsorted --outSAMtype BAM Unsorted --outFilterType BySJout --outReadsUnmapped Fastx --outFilterScoreMin 10 --outSAMattrRGline ID:foo --alignEndsType EndToEnd > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam takes output from STAR genome mapping. Custom random-mer-aware script for PCR duplicate removal. Command: barcode_collapse_pe.py --bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam --out_file /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam --metrics_file /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.metrics Takes output from barcode collapse PE. Sorts resulting bam file for use downstream. Command: java -Xmx2048m -XX:+UseParallelOldGC -XX:ParallelGCThreads=4 -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Djava.io.tmpdir=/full/path/to/files/.queue/tmp -cp /path/to/gatk/dist/Queue.jar net.sf.picard.sam.SortSam INPUT=/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam TMP_DIR=/full/path/to/files/.queue/tmp OUTPUT=/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam VALIDATION_STRINGENCY=SILENT SO=coordinate CREATE_INDEX=true Takes output from sortSam, makes bam index for use downstream. Command: samtools index /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam.bai Takes inputs from multiple final bam files. Merges the two technical replicates for further downstream analysis. Command: samtools merge /full/path/to/files/CombinedID.merged.bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam /full/path/to/files/file_R1.D08.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam Takes output from sortSam, makes bam index for use downstream. Command: samtools index /full/path/to/files/CombinedID.merged.bam /full/path/to/files/CombinedID.merged.bam.bai Takes output from sortSam. Only outputs the second read in each pair for use with single stranded peak caller. This is the final bam file to perform analysis on. Command: samtools view -hb -f 128 /full/path/to/files/CombinedID.merged.bam > /full/path/to/files/CombinedID.merged.r2.bam Takes results from samtools view. Calls peaks on those files. Command: clipper -b /full/path/to/files/CombinedID.merged.r2.bam -s hg19 -o /full/path/to/files/CombinedID.merged.r2.peaks.bed --bonferroni --superlocal --threshold-method binomial --save-pickle Genome_build: hg19 Supplementary_files_format_and_content: bed format, contains clusters of predicted RBP binding