GSE103225 Processing Pipeline

RIP-Seq code_examples 32 steps

Publication

Transcriptome-pathology correlation identifies interplay between TDP-43 and the expression of its kinase CK1E in sporadic ALS.

Acta neuropathologica (2018) — PMID 29881994

Dataset

GSE103225

Transcriptome-pathology correlations predict CSNK1E-mediated TDP-43 phosphorylation in sporadic amyotrophic lateral sclerosis

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Takes output from raw files.

Generic Raw File Processing (Inferred with models/gemini-2.5-flash) vN/A

$ Bash example

# The step description "Takes output from raw files." is too generic to infer a specific tool, command, or parameters.
# This typically refers to the initial input stage of a pipeline where raw sequencing data (e.g., FASTQ files) are provided.
# Subsequent steps would involve quality control, alignment, or other specific processing based on the assay type.
# No specific command can be inferred from this generic description.

Run to trim off both 5â and 3â adapters on both reads.

cutadapt (Inferred with models/gemini-2.5-flash) v4.0 GitHub

$ Bash example

# Install cutadapt (if not already installed)
# conda install -c bioconda -c conda-forge cutadapt=4.0

# Define input and output file paths
INPUT_R1="input_R1.fastq.gz"
INPUT_R2="input_R2.fastq.gz"
OUTPUT_R1="trimmed_R1.fastq.gz"
OUTPUT_R2="trimmed_R2.fastq.gz"

# Define adapter sequences based on Yeo lab eCLIP protocol (from skipper workflow)
# These are common Illumina universal and small RNA 3' adapters, and specific 5' adapters.
ADAPTER_3PRIME_R1="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA"
ADAPTER_3PRIME_R2="AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT"
ADAPTER_5PRIME_R1="GATCGTCGGACTGTAGAACTCTGAAC"
ADAPTER_5PRIME_R2="GATCGTCGGACTGTAGAACTCTGAAC"

# Minimum read length after trimming
MIN_READ_LENGTH=18

# Number of CPU cores to use
NUM_CORES=4

# Run cutadapt to trim 5' and 3' adapters from both reads
cutadapt \
  -a "${ADAPTER_3PRIME_R1}" \
  -A "${ADAPTER_3PRIME_R2}" \
  -g "${ADAPTER_5PRIME_R1}" \
  -G "${ADAPTER_5PRIME_R2}" \
  -o "${OUTPUT_R1}" \
  -p "${OUTPUT_R2}" \
  --minimum-length "${MIN_READ_LENGTH}" \
  --cores "${NUM_CORES}" \
  "${INPUT_R1}" "${INPUT_R2}"

View on GitHub

3
Command: quality-cutoff 6 -m 18 -a NNNNNAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC -g CTTCCGATCTACAAGTT -g CTTCCGATCTTGGTCCT -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGT AGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz /full/path/to/files/file_R1.C01.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.metrics

cutadapt (Inferred with models/gemini-2.5-flash) v2.10 (Inferred with models/gemini-2.5-flash) GitHub
$ Bash example
```
# Install cutadapt (e.g., via conda)
# conda install -c bioconda cutadapt

cutadapt -q 6 -m 18 \
-a NNNNNAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC \
-g CTTCCGATCTACAAGTT \
-g CTTCCGATCTTGGTCCT \
-A AACTTGTAGATCGGA \
-A AGGACCAAGATCGGA \
-A ACTTGTAGATCGGAA \
-A GGACCAAGATCGGAA \
-A CTTGT AGATCGGAAG \
-A GACCAAGATCGGAAG \
-A TTGTAGATCGGAAGA \
-A ACCAAGATCGGAAGA \
-A TGTAGATCGGAAGAG \
-A CCAAGATCGGAAGAG \
-A GTAGATCGGAAGAGC \
-A CAAGATCGGAAGAGC \
-A TAGATCGGAAGAGCG \
-A AAGATCGGAAGAGCG \
-A AGATCGGAAGAGCGT \
-A GATCGGAAGAGCGTC \
-A ATCGGAAGAGCGTCG \
-A TCGGAAGAGCGTCGT \
-A CGGAAGAGCGTCGTG \
-A GGAAGAGCGTCGTGT \
-o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz \
-p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz \
/full/path/to/files/file_R1.C01.fastq.gz \
/full/path/to/files/file_R2.C01.fastq.gz \
> /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.metrics
```
View on GitHub

Takes output from cutadapt round 1.

cutadapt v1.18 GitHub

$ Bash example

# Install cutadapt (if not already installed)
# conda install -c bioconda cutadapt=1.18

# Define input and output files
# INPUT_FASTQ represents the output from cutadapt round 1
INPUT_FASTQ="round1_trimmed.fastq.gz"
OUTPUT_FASTQ="round2_trimmed.fastq.gz"

# Define adapter sequence (common Illumina 3' adapter used in eCLIP workflows)
ADAPTER_SEQUENCE="AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC"

# Execute cutadapt for adapter and quality trimming
# Parameters are inferred from common eCLIP cutadapt usage (e.g., Yeo lab eCLIP CWL workflow)
cutadapt \
  -a "${ADAPTER_SEQUENCE}" \
  -q 20 \
  --minimum-length 18 \
  --error-rate 0.1 \
  -o "${OUTPUT_FASTQ}" \
  "${INPUT_FASTQ}"

View on GitHub

Run to trim off the 3â adapters on read 2, to control for double ligation events.

cutadapt (Inferred with models/gemini-2.5-flash) v4.0 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install cutadapt (e.g., using conda)
# conda install -c bioconda cutadapt=4.0

# Define variables
# ADAPTER_R2_3PRIME is a common Illumina TruSeq Small RNA 3' Adapter sequence used in eCLIP workflows
ADAPTER_R2_3PRIME="AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC"
MIN_READ_LENGTH=18 # Minimum read length to keep after trimming
INPUT_READ2="sample_R2.fastq.gz"
OUTPUT_READ2_TRIMMED="sample_R2_trimmed.fastq.gz"
NUM_THREADS=8 # Number of CPU cores to use

# Run cutadapt to trim 3' adapters on Read 2
# -a: Specifies a 3' adapter sequence for the forward read (or single-end read)
#     In this context, it's applied to Read 2 to remove the 3' adapter.
# -o: Output file for the trimmed reads.
# --minimum-length: Discard reads shorter than this length after trimming.
# --cores: Number of CPU cores to use for parallel processing.
cutadapt -a "${ADAPTER_R2_3PRIME}" \
         -o "${OUTPUT_READ2_TRIMMED}" \
         --minimum-length "${MIN_READ_LENGTH}" \
         --cores "${NUM_THREADS}" \
         "${INPUT_READ2}"

View on GitHub

Command: cutadapt -f fastq --match-read-wildcards --times 1 -e 0.1 -O 5 --quality-cutoff 6 -m 18 -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGTAGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.metrics

cutadapt vInferred with models/gemini-2.5-flash GitHub

$ Bash example

# Install cutadapt (example using conda)
# conda install -c bioconda cutadapt

# Define input and output files
INPUT_R1="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz"
INPUT_R2="/full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz"
OUTPUT_R1="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz"
OUTPUT_R2="/full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz"
METRICS_FILE="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.metrics"

# Define adapter sequences
ADAPTERS=(
    "AACTTGTAGATCGGA"
    "AGGACCAAGATCGGA"
    "ACTTGTAGATCGGAA"
    "GGACCAAGATCGGAA"
    "CTTGTAGATCGGAAG"
    "GACCAAGATCGGAAG"
    "TTGTAGATCGGAAGA"
    "ACCAAGATCGGAAGA"
    "TGTAGATCGGAAGAG"
    "CCAAGATCGGAAGAG"
    "GTAGATCGGAAGAGC"
    "CAAGATCGGAAGAGC"
    "TAGATCGGAAGAGCG"
    "AAGATCGGAAGAGCG"
    "AGATCGGAAGAGCGT"
    "GATCGGAAGAGCGTC"
    "ATCGGAAGAGCGTCG"
    "TCGGAAGAGCGTCGT"
    "CGGAAGAGCGTCGTG"
    "GGAAGAGCGTCGTGT"
)

# Construct the adapter arguments string
ADAPTER_ARGS=""
for ADAPTER in "${ADAPTERS[@]}"; do
    ADAPTER_ARGS+=" -A ${ADAPTER}"
done

# Execute cutadapt command
cutadapt -f fastq --match-read-wildcards --times 1 -e 0.1 -O 5 --quality-cutoff 6 -m 18 ${ADAPTER_ARGS} -o "${OUTPUT_R1}" -p "${OUTPUT_R2}" "${INPUT_R1}" "${INPUT_R2}" > "${METRICS_FILE}"

View on GitHub

Takes output from cutadapt round 2.

cutadapt v1.18 GitHub

$ Bash example

# Install cutadapt (if not already installed)
# conda install -c bioconda cutadapt=1.18

# This command performs poly-A trimming, often considered a "round 2" trimming step
# after initial adapter trimming in eCLIP pipelines. The version and parameters
# (min_length, quality_cutoff) are inferred from the Yeo lab eCLIP CWL workflow.
# Input: FASTQ file that has already undergone initial adapter trimming (e.g., from "cutadapt round 1").
# Output: FASTQ file with poly-A tails removed.
cutadapt \
  -a "A{100}" \
  -m 18 \
  -q 20 \
  -o sample_round2_trimmed.fastq.gz \
  sample_round1_trimmed.fastq.gz

View on GitHub

Maps to human specific version of RepBase used to remove repetitive elements, helps control for spurious artifacts from rRNA (& other) repetitive reads.

bbduk (Inferred with models/gemini-2.5-flash) vNot specified GitHub

$ Bash example

# Install BBTools (if not already installed)
# conda install -c bioconda bbtools

# Define input and output files
# Replace 'input_reads.fastq.gz' with your actual input FASTQ file(s).
# For paired-end reads, use 'in1=reads_R1.fastq.gz in2=reads_R2.fastq.gz out1=filtered_R1.fastq.gz out2=filtered_R2.fastq.gz'
INPUT_FASTQ="input_reads.fastq.gz"
OUTPUT_FASTQ="filtered_reads.fastq.gz"
STATS_FILE="repbase_filtering_stats.txt"

# Define the RepBase reference file.
# This file should contain human-specific repetitive elements (including rRNA sequences)
# derived from RepBase. You would typically download and prepare this file beforehand.
# Example: A FASTA file containing human repetitive sequences.
# Placeholder: Replace with the actual path to your RepBase FASTA file.
HUMAN_REPBASE_FASTA="path/to/human_repbase.fasta"

# Run bbduk to remove reads mapping to human RepBase sequences.
# k=31 is a common kmer size for contaminant filtering.
# hdist=1 allows for 1 mismatch in kmer matching.
# minidentity=90 ensures high identity matches are removed.
# stats= outputs statistics about filtered reads.
bbduk.sh \
  in="$INPUT_FASTQ" \
  out="$OUTPUT_FASTQ" \
  ref="$HUMAN_REPBASE_FASTA" \
  k=31 \
  hdist=1 \
  minidentity=90 \
  stats="$STATS_FILE" \
  overwrite=true

View on GitHub

Command: STAR --runMode alignReads --runThreadN 16 --genomeDir /path/to/RepBase_human_database_file --genomeLoad LoadAndRemove --readFilesIn /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz --outSAMunmapped Within --outFilterMultimapNmax 30 --outFilterMultimapScoreRange 1 --outFileNamePrefix /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam --outSAMattributes All --readFilesCommand zcat --outStd BAM_Unsorted --outSAMtype BAM Unsorted --outFilterType BySJout --outReadsUnmapped Fastx --outFilterScoreMin 10 --outSAMattrRGline ID:foo --alignEndsType EndToEnd > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam

STAR vNot specified, inferring a recent stable version (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install STAR (e.g., using Conda)
# conda install -c bioconda star

# Define variables for clarity (optional, but good practice)
GENOME_DIR="/path/to/RepBase_human_database_file"
READ_FILE_R1="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz"
READ_FILE_R2="/full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz"
OUTPUT_BAM="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam"
OUTPUT_PREFIX="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam"

STAR \
  --runMode alignReads \
  --runThreadN 16 \
  --genomeDir "${GENOME_DIR}" \
  --genomeLoad LoadAndRemove \
  --readFilesIn "${READ_FILE_R1}" "${READ_FILE_R2}" \
  --outSAMunmapped Within \
  --outFilterMultimapNmax 30 \
  --outFilterMultimapScoreRange 1 \
  --outFileNamePrefix "${OUTPUT_PREFIX}" \
  --outSAMattributes All \
  --readFilesCommand zcat \
  --outStd BAM_Unsorted \
  --outSAMtype BAM Unsorted \
  --outFilterType BySJout \
  --outReadsUnmapped Fastx \
  --outFilterScoreMin 10 \
  --outSAMattrRGline ID:foo \
  --alignEndsType EndToEnd > "${OUTPUT_BAM}"

View on GitHub

Takes output from STAR rmRep.

STAR v2.7.10a GitHub

$ Bash example

# Install STAR (if not already installed)
# conda install -c bioconda star

# Define variables for genome index and input read files
# Replace with actual paths to your genome index and FASTQ files
GENOME_DIR="/path/to/star_genome_index/"
READS_R1="/path/to/input_reads_R1.fastq.gz"
READS_R2="/path/to/input_reads_R2.fastq.gz" # Remove if single-end reads
OUTPUT_PREFIX="./star_output/aligned_reads"

# Create output directory if it doesn't exist
mkdir -p $(dirname ${OUTPUT_PREFIX})

# Run STAR alignment
# Note: STAR itself does not have a direct 'rmRep' (remove replicates/duplicates) command.
# Removing PCR duplicates is typically a post-alignment step performed by tools like samtools markdup.
# This command performs a standard STAR alignment, producing a sorted BAM file.
STAR \
  --runThreadN 8 \
  --genomeDir ${GENOME_DIR} \
  --readFilesIn ${READS_R1} ${READS_R2} \
  --readFilesCommand zcat \
  --outFileNamePrefix ${OUTPUT_PREFIX} \
  --outSAMtype BAM SortedByCoordinate \
  --outSAMattributes Standard \
  --outFilterType BySJout \
  --outFilterMultimapNmax 20 \
  --outFilterMismatchNmax 999 \
  --outFilterMismatchNoverLmax 0.1 \
  --alignIntronMin 20 \
  --alignIntronMax 1000000 \
  --alignMatesGapMax 1000000 \
  --alignSJoverhangMin 8 \
  --alignSJDBoverhangMin 1

View on GitHub

Maps unique reads to the human genome.

STAR (Inferred with models/gemini-2.5-flash) v2.7.10a GitHub

$ Bash example

# Define variables
READ1="input_R1.fastq.gz" # Path to input R1 FASTQ file (gzipped)
READ2="input_R2.fastq.gz" # Path to input R2 FASTQ file (gzipped)
OUTPUT_PREFIX="aligned_sample" # Prefix for output files
GENOME_DIR="/path/to/human_genome_star_index_GRCh38" # Path to STAR genome index directory (e.g., for GRCh38)
THREADS=8 # Number of threads to use

# --- Installation (commented out) ---
# # Install STAR using conda
# # conda create -n star_env star -c bioconda -y
# # conda activate star_env

# # --- Reference Genome Indexing (run once, commented out) ---
# # Placeholder for human genome FASTA and GTF files (e.g., GRCh38)
# # GENOME_FASTA="/path/to/human_genome/GRCh38.primary_assembly.genome.fa"
# # GTF_FILE="/path/to/human_genome/gencode.v44.annotation.gtf" # Or other relevant GTF

# # Create STAR genome index (if not already created)
# # mkdir -p ${GENOME_DIR}
# # STAR --runMode genomeGenerate \
# #      --genomeDir ${GENOME_DIR} \
# #      --genomeFastaFiles ${GENOME_FASTA} \
# #      --sjdbGTFfile ${GTF_FILE} \
# #      --sjdbOverhang 100 \
# #      --runThreadN ${THREADS}

# --- Alignment Command ---
# Maps unique reads to the human genome using STAR
STAR --genomeDir ${GENOME_DIR} \
     --readFilesIn ${READ1} ${READ2} \
     --readFilesCommand zcat \
     --outFileNamePrefix ${OUTPUT_PREFIX}_ \
     --outSAMtype BAM SortedByCoordinate \
     --outSAMunmapped Within \
     --outFilterMultimapNmax 1 \
     --outFilterMismatchNmax 10 \
     --runThreadN ${THREADS}

# Output files will include:
# ${OUTPUT_PREFIX}_Aligned.sortedByCoord.out.bam
# ${OUTPUT_PREFIX}_Log.final.out
# ${OUTPUT_PREFIX}_Log.out
# ${OUTPUT_PREFIX}_Log.progress.out
# ${OUTPUT_PREFIX}_SJ.out.tab

View on GitHub

Command: STAR --runMode alignReads --runThreadN 16 --genomeDir /path/to/STAR_database_file --genomeLoad LoadAndRemove --readFilesIn /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate1 /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate2 --outSAMunmapped Within --outFilterMultimapNmax 1 --outFilterMultimapScoreRange 1 --outFileNamePrefix /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam --outSAMattributes All --outStd BAM_Unsorted --outSAMtype BAM Unsorted --outFilterType BySJout --outReadsUnmapped Fastx --outFilterScoreMin 10 --outSAMattrRGline ID:foo --alignEndsType EndToEnd > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam

STAR v2.7.x GitHub

$ Bash example

# Install STAR (if not already installed)
# conda install -c bioconda star

# Define variables
GENOME_DIR="/path/to/STAR_database_file" # Path to the STAR genome index directory
INPUT_READS_MATE1="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate1"
INPUT_READS_MATE2="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate2"
OUTPUT_FILE_PREFIX="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam" # Prefix for STAR's auxiliary output files (logs, junctions, etc.)
FINAL_BAM_OUTPUT="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam" # Path for the final aligned BAM file (redirected stdout)

STAR --runMode alignReads \
  --runThreadN 16 \
  --genomeDir "${GENOME_DIR}" \
  --genomeLoad LoadAndRemove \
  --readFilesIn "${INPUT_READS_MATE1}" "${INPUT_READS_MATE2}" \
  --outSAMunmapped Within \
  --outFilterMultimapNmax 1 \
  --outFilterMultimapScoreRange 1 \
  --outFileNamePrefix "${OUTPUT_FILE_PREFIX}" \
  --outSAMattributes All \
  --outStd BAM_Unsorted \
  --outSAMtype BAM Unsorted \
  --outFilterType BySJout \
  --outReadsUnmapped Fastx \
  --outFilterScoreMin 10 \
  --outSAMattrRGline ID:foo \
  --alignEndsType EndToEnd \
  > "${FINAL_BAM_OUTPUT}"

View on GitHub

takes output from STAR genome mapping.

STAR v2.7.10a GitHub

$ Bash example

# Install STAR (e.g., using conda)
# conda install -c bioconda star

# Create a STAR genome index (if not already present)
# This step needs to be run once for a given genome and annotation.
# Replace /path/to/genome_fasta/hg38.fa and /path/to/gtf/gencode.v38.annotation.gtf with actual paths.
# mkdir -p /path/to/star_index/hg38_star_index
# STAR \
#   --runThreadN 8 \
#   --runMode genomeGenerate \
#   --genomeDir /path/to/star_index/hg38_star_index \
#   --genomeFastaFiles /path/to/genome_fasta/hg38.fa \
#   --sjdbGTFfile /path/to/gtf/gencode.v38.annotation.gtf \
#   --sjdbOverhang 100 # Typically (ReadLength - 1)

# Align reads using STAR
# Replace /path/to/star_index/hg38_star_index with the actual path to your genome index.
# Replace read1.fastq.gz and read2.fastq.gz with your input FASTQ files.
# Replace sample_name with your desired output prefix.

STAR \
  --runThreadN 8 \
  --genomeDir /path/to/star_index/hg38_star_index \
  --readFilesIn read1.fastq.gz read2.fastq.gz \
  --readFilesCommand zcat \
  --outFileNamePrefix sample_name_ \
  --outSAMtype BAM SortedByCoordinate \
  --outSAMattributes All \
  --outFilterMultimapNmax 1 \
  --outFilterMismatchNmax 3 \
  --alignIntronMax 1 # For eCLIP, often set to 1 to prevent splicing, focusing on direct RNA binding

View on GitHub

Custom random-mer-aware script for PCR duplicate removal.

umi_tools (Inferred with models/gemini-2.5-flash) v1.1.1 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install umi_tools (e.g., via conda)
# conda install -c bioconda umi-tools=1.1.1

# This command assumes random mers (NNNs) have been extracted from the adapter
# and appended to the read name, typically separated by an underscore.
# For example, a read name might look like: 'read_id_NNNNNN'

umi_tools dedup \
    --stdin input.bam \
    --stdout output.dedup.bam \
    --extract-umi-method read_name_suffix \
    --umi-separator '_' \
    --log dedup.log

View on GitHub

Command: barcode_collapse_pe.py --bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam --out_file /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam --metrics_file /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.metrics

barcode_collapse_pe.py (Inferred with models/gemini-2.5-flash) vN/A (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# This script is part of the eCLIP pipeline developed by the Yeo Lab.
# It is typically run with Python 2.7.
#
# To install necessary dependencies (example using conda):
# # conda create -n python2_env python=2.7 pysam
# # conda activate python2_env
#
# To obtain the script 'barcode_collapse_pe.py', you would typically clone the eclip repository:
# # git clone https://github.com/yeolab/eclip.git
# # cd eclip/scripts
#
# For execution, ensure 'barcode_collapse_pe.py' is in your PATH or specify its full path.
# The command below assumes 'barcode_collapse_pe.py' is directly executable or called with 'python2'.

# Define input and output file paths based on the description
INPUT_BAM="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam"
OUTPUT_BAM="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam"
METRICS_FILE="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.metrics"

# Execute the barcode collapsing command
# The script 'barcode_collapse_pe.py' from the yeolab/eclip repository is designed to be run with python2.
python2 barcode_collapse_pe.py \
    --bam "${INPUT_BAM}" \
    --out_file "${OUTPUT_BAM}" \
    --metrics_file "${METRICS_FILE}"

View on GitHub

Takes output from barcode collapse PE.

umi_tools (Inferred with models/gemini-2.5-flash) v1.1.2 GitHub

$ Bash example

# Install umi_tools (e.g., using conda):
# conda install -c bioconda umi-tools=1.1.2

# Define input and output files
# INPUT_BAM should be the output from the barcode collapse PE step, 
# where UMIs have been extracted and appended to read IDs.
INPUT_BAM="input_barcode_collapsed.bam"
OUTPUT_BAM="output_deduplicated.bam"
LOG_FILE="umi_tools_dedup.log"

# Run umi_tools dedup for paired-end data
# --extract-method=read_id assumes UMIs are already in the read ID (e.g., from a previous umi_tools extract step)
# --umi-separator=':' specifies the separator used when UMIs were appended to read IDs
# --paired indicates that the input BAM contains paired-end reads
umi_tools dedup \
    -I "${INPUT_BAM}" \
    -S "${OUTPUT_BAM}" \
    --extract-method=read_id \
    --umi-separator=':' \
    --paired \
    --log="${LOG_FILE}"

View on GitHub

Sorts resulting bam file for use downstream.

samtools (Inferred with models/gemini-2.5-flash) v1.19 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install samtools (if not already installed)
# conda install -c bioconda samtools

# Sort the BAM file
# Replace input.bam with your actual input BAM file
# Replace output.sorted.bam with your desired output sorted BAM file name
samtools sort -o output.sorted.bam -@ 4 input.bam

View on GitHub

Command: java -Xmx2048m -XX:+UseParallelOldGC -XX:ParallelGCThreads=4 -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Djava.io.tmpdir=/full/path/to/files/.queue/tmp -cp /path/to/gatk/dist/Queue.jar net.sf.picard.sam.SortSam INPUT=/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam TMP_DIR=/full/path/to/files/.queue/tmp OUTPUT=/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam VALIDATION_STRINGENCY=SILENT SO=coordinate CREATE_INDEX=true

Picard vInferred with models/gemini-2.5-flash GitHub

$ Bash example

# Install Java (e.g., OpenJDK 11 or 8)
# conda install -c conda-forge openjdk=11

# Picard tools are often distributed as a single JAR file. The command uses 'Queue.jar' as the classpath,
# which might be a GATK wrapper or a specific setup where Picard classes are accessible via this JAR.
# Ensure the Queue.jar (or the appropriate Picard JAR) is accessible at the specified path.

# Define variables for paths
DATA_DIR="/full/path/to/files" # Replace with your actual data directory
QUEUE_JAR_PATH="/path/to/gatk/dist/Queue.jar" # Replace with the actual path to Queue.jar

INPUT_BAM="${DATA_DIR}/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam"
OUTPUT_SORTED_BAM="${DATA_DIR}/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam"
TMP_DIR="${DATA_DIR}/.queue/tmp"

# Create the temporary directory if it doesn't exist
mkdir -p "${TMP_DIR}"

# Execute the Picard SortSam command
java -Xmx2048m -XX:+UseParallelOldGC -XX:ParallelGCThreads=4 -XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -Djava.io.tmpdir="${TMP_DIR}" -cp "${QUEUE_JAR_PATH}" net.sf.picard.sam.SortSam INPUT="${INPUT_BAM}" TMP_DIR="${TMP_DIR}" OUTPUT="${OUTPUT_SORTED_BAM}" VALIDATION_STRINGENCY=SILENT SO=coordinate CREATE_INDEX=true

View on GitHub

Takes output from sortSam, makes bam index for use downstream.

samtools index (Inferred with models/gemini-2.5-flash) v1.19 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install samtools if not already installed
# conda install -c bioconda samtools

# Assuming 'input.sorted.bam' is the output from sortSam
# Replace 'input.sorted.bam' with the actual sorted BAM file name
samtools index input.sorted.bam

View on GitHub

Command: samtools index /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam.bai

samtools v1.19 GitHub

$ Bash example

# Install samtools if not already available
# conda install -c bioconda samtools=1.19

# Define input and output paths
INPUT_BAM="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam"
OUTPUT_BAI="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam.bai"

# Create BAM index
samtools index "${INPUT_BAM}" "${OUTPUT_BAI}"

View on GitHub

Takes inputs from multiple final bam files.

samtools merge (Inferred with models/gemini-2.5-flash) v1.x GitHub

$ Bash example

# Install samtools if not already available
# conda install -c bioconda samtools

# Example: Merging multiple final BAM files into a single BAM file.
# This is a common step when combining technical replicates or preparing for downstream analysis.
# Replace input_1.bam, input_2.bam, ... with your actual BAM file paths.
# Replace merged_output.bam with your desired output file name.

samtools merge merged_output.bam input_1.bam input_2.bam input_3.bam

# Index the merged BAM file for efficient access (optional but recommended)
samtools index merged_output.bam

View on GitHub

Merges the two technical replicates for further downstream analysis.

samtools (Inferred with models/gemini-2.5-flash) v1.10 GitHub

$ Bash example

# Install samtools if not already installed
# conda install -c bioconda samtools

# Merge two technical replicate BAM files into a single BAM file.
# Replace 'replicate_1.bam' and 'replicate_2.bam' with your actual input files.
# The output will be 'merged_replicates.bam'.
samtools merge merged_replicates.bam replicate_1.bam replicate_2.bam

View on GitHub

Command: samtools merge /full/path/to/files/CombinedID.merged.bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam /full/path/to/files/file_R1.D08.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam

samtools v1.19 GitHub

$ Bash example

# Install samtools (e.g., using conda)
# conda install -c bioconda samtools=1.19

samtools merge /full/path/to/files/CombinedID.merged.bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam /full/path/to/files/file_R1.D08.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam

View on GitHub

Takes output from sortSam, makes bam index for use downstream.

samtools index (Inferred with models/gemini-2.5-flash) v1.19 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install samtools if not already installed
# conda install -c bioconda samtools

# Assuming 'input.sorted.bam' is the output from sortSam
# This command creates an index file 'input.sorted.bam.bai' in the same directory.
samtools index input.sorted.bam

View on GitHub

Command: samtools index /full/path/to/files/CombinedID.merged.bam /full/path/to/files/CombinedID.merged.bam.bai

samtools v1.19 GitHub

$ Bash example

# Install samtools if not already available
# conda install -c bioconda samtools=1.19

# Define input and output paths
INPUT_BAM="/full/path/to/files/CombinedID.merged.bam"
OUTPUT_BAI="/full/path/to/files/CombinedID.merged.bam.bai"

# Create index for the BAM file
samtools index "${INPUT_BAM}" "${OUTPUT_BAI}"

View on GitHub

Takes output from sortSam.

samtools (Inferred with models/gemini-2.5-flash) v1.19 GitHub

$ Bash example

# Install samtools if not already available
# conda install -c bioconda samtools=1.19

# Assuming input.bam is the sorted BAM file from sortSam.
# This command removes PCR duplicates from the sorted BAM file.
# -r: Remove duplicate reads
# -s: Write statistics to a file (optional, but good for QC)
# -@: Number of threads to use (adjust as needed for parallel processing)
samtools markdup -r -s markdup_stats.txt -@ 4 input.bam output.bam

View on GitHub

Only outputs the second read in each pair for use with single stranded peak caller.

samtools (Inferred with models/gemini-2.5-flash) v1.10 (Inferred with models/gemini-2.5-flash)

$ Bash example

# Install samtools if not already installed
# conda install -c bioconda samtools

# This command extracts only the second read in each pair from a BAM file
# and outputs them to a FASTQ file. The -f 0x80 flag selects reads that are the second in a pair.
# The -N flag retains the original read names.
samtools fastq -f 0x80 -N -o output_R2.fastq input.bam

This is the final bam file to perform analysis on.

STAR (Inferred with models/gemini-2.5-flash), Samtools (Inferred with models/gemini-2.5-flash) vSTAR 2.7.10a, Samtools 1.17 GitHub

$ Bash example

# Install STAR (example, uncomment if needed)
# conda install -c bioconda star

# Install Samtools (example, uncomment if needed)
# conda install -c bioconda samtools

# Define reference genome and STAR index path
GENOME_DIR="/path/to/STAR_index/hg38" # Placeholder for human genome hg38 STAR index
READ1="input_read1.fastq.gz"
READ2="input_read2.fastq.gz"
OUTPUT_PREFIX="aligned_reads"
UNSORTED_BAM="${OUTPUT_PREFIX}_Aligned.out.bam"
FINAL_BAM="final.bam"

# 1. Perform alignment with STAR
# This command aligns paired-end reads to the genome and outputs an unsorted BAM file.
STAR --genomeDir "${GENOME_DIR}" \
     --readFilesIn "${READ1}" "${READ2}" \
     --runThreadN 8 \
     --outFileNamePrefix "${OUTPUT_PREFIX}_" \
     --outSAMtype BAM Unsorted \
     --outSAMunmapped Within \
     --outSAMattributes Standard \
     --quantMode GeneCounts # Optional: for gene quantification, if desired

# 2. Sort the BAM file by coordinate
# This is a crucial step for most downstream analyses and for indexing.
samtools sort -@ 8 -o "${FINAL_BAM}" "${UNSORTED_BAM}"

# 3. Index the sorted BAM file
# This creates the .bai index file, which is crucial for random access to the BAM file.
samtools index "${FINAL_BAM}"

# Clean up intermediate unsorted BAM if desired
# rm "${UNSORTED_BAM}"

View on GitHub

Command: samtools view -hb -f 128 /full/path/to/files/CombinedID.merged.bam > /full/path/to/files/CombinedID.merged.r2.bam

samtools v1.19 GitHub

$ Bash example

# Install samtools (e.g., using conda)
# conda install -c bioconda samtools

# Extract reads that are the second in a pair (R2 reads) from a merged BAM file
# -h: Include header in the output
# -b: Output in BAM format
# -f 128: Select reads with flag 128 (second in pair)
samtools view -hb -f 128 /full/path/to/files/CombinedID.merged.bam > /full/path/to/files/CombinedID.merged.r2.bam

View on GitHub

Takes results from samtools view.

samtools v1.9 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install samtools if not already available
# conda install -c bioconda samtools=1.9

# Example: Filter a BAM file to keep only primary alignments and mapped reads, outputting to BAM.
# Replace 'input.bam' with your actual input file and 'filtered.bam' with your desired output file.
samtools view -F 256 -F 4 -b input.bam > filtered.bam

View on GitHub

Calls peaks on those files.

clipper (Inferred with models/gemini-2.5-flash) v0.0.3 GitHub

$ Bash example

# Install clipper using conda
# conda create -n clipper_env python=3.8
# conda activate clipper_env
# conda install -c bioconda clipper=0.0.3

# Define input and output files
INPUT_BAM="aligned_reads.bam" # Placeholder for your aligned BAM file
OUTPUT_BED="peaks.bed"
GENOME_ASSEMBLY="hg38" # Placeholder for the reference genome assembly (e.g., hg19, mm10)

# Execute CLIPper peak calling
clipper -b "${INPUT_BAM}" -s "${GENOME_ASSEMBLY}" -o "${OUTPUT_BED}"

View on GitHub

Command: clipper -b /full/path/to/files/CombinedID.merged.r2.bam -s hg19 -o /full/path/to/files/CombinedID.merged.r2.peaks.bed --bonferroni --superlocal --threshold-method binomial --save-pickle

CLIPper v0.0.1 GitHub

$ Bash example

# Install CLIPper (example using pip or conda)
# pip install clipper
# Or:
# conda install -c bioconda clipper

# Run CLIPper for peak calling
clipper -b /full/path/to/files/CombinedID.merged.r2.bam -s hg19 -o /full/path/to/files/CombinedID.merged.r2.peaks.bed --bonferroni --superlocal --threshold-method binomial --save-pickle

View on GitHub

Tools Used

STAR

Raw Source Text

Takes output from raw files.  Run to trim off both 5â and 3â adapters on both reads. Command: quality-cutoff 6  -m 18  -a NNNNNAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC  -g CTTCCGATCTACAAGTT -g CTTCCGATCTTGGTCCT  -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGT  AGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT  -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz  -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz  /full/path/to/files/file_R1.C01.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.metrics
Takes output from cutadapt round 1. Run to trim off the 3â adapters on read 2, to control for double ligation events. Command: cutadapt -f fastq --match-read-wildcards  --times 1  -e 0.1  -O 5  --quality-cutoff 6  -m 18  -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGTAGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT  -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz  -p /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz  /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.fastq.gz  /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.fastq.gz > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.metrics
Takes output from cutadapt round 2.  Maps to human specific version of RepBase used to remove repetitive elements, helps control for spurious artifacts from rRNA (& other) repetitive reads.  Command: STAR  --runMode alignReads  --runThreadN 16  --genomeDir /path/to/RepBase_human_database_file --genomeLoad LoadAndRemove  --readFilesIn /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz.adapterTrim.round2.fastq.gz --outSAMunmapped Within  --outFilterMultimapNmax 30  --outFilterMultimapScoreRange 1  --outFileNamePrefix /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam --outSAMattributes All  --readFilesCommand zcat  --outStd BAM_Unsorted  --outSAMtype BAM Unsorted  --outFilterType BySJout  --outReadsUnmapped Fastx  --outFilterScoreMin 10  --outSAMattrRGline ID:foo  --alignEndsType EndToEnd > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bam
Takes output from STAR rmRep.  Maps unique reads to the human genome.  Command: STAR  --runMode alignReads  --runThreadN 16  --genomeDir  /path/to/STAR_database_file --genomeLoad LoadAndRemove  --readFilesIn  /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate1  /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate2  --outSAMunmapped Within  --outFilterMultimapNmax 1  --outFilterMultimapScoreRange 1  --outFileNamePrefix /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam  --outSAMattributes All  --outStd BAM_Unsorted  --outSAMtype BAM Unsorted  --outFilterType BySJout  --outReadsUnmapped Fastx  --outFilterScoreMin 10  --outSAMattrRGline ID:foo  --alignEndsType EndToEnd > /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam
takes output from STAR genome mapping.  Custom random-mer-aware script for PCR duplicate removal. Command: barcode_collapse_pe.py  --bam /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam  --out_file /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam  --metrics_file /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.metrics
Takes output from barcode collapse PE.  Sorts resulting bam file for use downstream.  Command: java  -Xmx2048m  -XX:+UseParallelOldGC  -XX:ParallelGCThreads=4  -XX:GCTimeLimit=50  -XX:GCHeapFreeLimit=10  -Djava.io.tmpdir=/full/path/to/files/.queue/tmp  -cp /path/to/gatk/dist/Queue.jar  net.sf.picard.sam.SortSam  INPUT=/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.bam  TMP_DIR=/full/path/to/files/.queue/tmp  OUTPUT=/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam  VALIDATION_STRINGENCY=SILENT  SO=coordinate  CREATE_INDEX=true
Takes output from sortSam, makes bam index for use downstream.  Command: samtools index  /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam  /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam.bai
Takes inputs from multiple final bam files.  Merges the two technical replicates for further downstream analysis.  Command: samtools  merge  /full/path/to/files/CombinedID.merged.bam  /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam /full/path/to/files/file_R1.D08.fastq.gz.adapterTrim.round2.rmRep.rmDup.sorted.bam
Takes output from sortSam, makes bam index for use downstream.  Command: samtools  index  /full/path/to/files/CombinedID.merged.bam  /full/path/to/files/CombinedID.merged.bam.bai
Takes output from sortSam.  Only outputs the second read in each pair for use with single stranded peak caller.  This is the final bam file to perform analysis on.  Command: samtools view -hb -f 128  /full/path/to/files/CombinedID.merged.bam  >  /full/path/to/files/CombinedID.merged.r2.bam
Takes results from samtools view.  Calls peaks on those files.  Command: clipper  -b /full/path/to/files/CombinedID.merged.r2.bam  -s hg19  -o /full/path/to/files/CombinedID.merged.r2.peaks.bed  --bonferroni  --superlocal  --threshold-method binomial  --save-pickle
Genome_build: hg19
Supplementary_files_format_and_content: bed format, contains clusters of predicted RBP binding

← Back to Analysis