GSE135435 Processing Pipeline

RIP-Seq code_examples 29 steps

Publication

The <i>Thermus thermophilus</i> DEAD-box protein Hera is a general RNA binding protein and plays a key role in tRNA metabolism.

RNA (New York, N.Y.) (2020) — PMID 32669294

Dataset

GSE135435

The Thermus thermophilus DEAD-box protein Hera is a general RNA chaperone and plays a key role in tRNA metabolism

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Takes output from raw files.

FastQC (Inferred with models/gemini-2.5-flash) v0.11.9 GitHub

$ Bash example

# Install FastQC (example using conda)
# conda install -c bioconda fastqc=0.11.9

# Define input and output file paths
# Assuming 'raw_files' refers to FASTQ sequencing reads.
# Replace 'input_raw_reads.fastq.gz' with your actual raw sequencing file.
INPUT_FASTQ="input_raw_reads.fastq.gz"
OUTPUT_DIR="fastqc_reports"

# Create output directory if it doesn't exist
mkdir -p "${OUTPUT_DIR}"

# Run FastQC to generate quality control reports from raw sequencing reads
fastqc "${INPUT_FASTQ}" -o "${OUTPUT_DIR}"

View on GitHub

Run to trim off both 5â and 3â adapters on both reads.

cutadapt (Inferred with models/gemini-2.5-flash) v3.4 GitHub

$ Bash example

# Install cutadapt (if not already installed)
# conda install -c bioconda cutadapt=3.4

# Define input and output file paths
READ1_IN="input_read1.fastq.gz"
READ2_IN="input_read2.fastq.gz"
READ1_OUT="trimmed_read1.fastq.gz"
READ2_OUT="trimmed_read2.fastq.gz"

# Define adapter sequences
# These are common Illumina adapters. Adjust if specific adapters are known for your assay.
# For 3' adapters (adapters at the 3' end of the read)
ADAPTER_3PRIME_R1="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" # Illumina universal adapter
ADAPTER_3PRIME_R2="AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT" # Illumina TruSeq adapter, often used for R2

# For 5' adapters (adapters at the 5' end of the read, e.g., primer remnants or adapter dimers)
# If specific 5' adapters are known, replace these placeholders.
# Using the same sequences as 3' adapters is a common fallback if specific 5' adapters are not provided,
# as it helps remove adapter dimers or partial adapters that might appear at the 5' end.
ADAPTER_5PRIME_R1="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA"
ADAPTER_5PRIME_R2="AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT"

# Number of threads for parallel processing
THREADS=8

# Quality trimming threshold (Phred score)
QUALITY_THRESHOLD=20

# Minimum read length after trimming
MIN_LENGTH=20

# Execute cutadapt for paired-end trimming
cutadapt \
  -j "${THREADS}" \
  -q "${QUALITY_THRESHOLD}","${QUALITY_THRESHOLD}" \
  --minimum-length "${MIN_LENGTH}" \
  --pair-filter=any \
  -a "${ADAPTER_3PRIME_R1}" \
  -A "${ADAPTER_3PRIME_R2}" \
  -g "${ADAPTER_5PRIME_R1}" \
  -G "${ADAPTER_5PRIME_R2}" \
  -o "${READ1_OUT}" \
  -p "${READ2_OUT}" \
  "${READ1_IN}" \
  "${READ2_IN}"

View on GitHub

Command: quality-cutoff 6 -m 18 -a NNNNNAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC -g CTTCCGATCTACAAGTT -g CTTCCGATCTTGGTCCT -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGTAGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT -o /full/path/to/files/file_R1.X1A.fastq.gz.adapterTrim.fastq.gz -p /full/path/to/files/file_R2.X1A.fastq.gz.adapterTrim.fastq.gz /full/path/to/files/file_R1.X1A.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz > /full/path/to/files/file_R1.X1A.fastq.gz.adapterTrim.metrics

Skipper (quality-cutoff.py script) (Inferred with models/gemini-2.5-flash) v0.1.0 GitHub

$ Bash example

# Install Skipper (if not already installed)
# It's recommended to install Skipper in a dedicated conda environment.
# conda create -n skipper_env python=3.8
# conda activate skipper_env
# pip install skipper

# Define input/output files and parameters
INPUT_R1="/full/path/to/files/file_R1.X1A.fastq.gz"
INPUT_R2="/full/path/to/files/file_R2.C01.fastq.gz"
OUTPUT_R1="/full/path/to/files/file_R1.X1A.fastq.gz.adapterTrim.fastq.gz"
OUTPUT_R2="/full/path/to/files/file_R2.X1A.fastq.gz.adapterTrim.fastq.gz"
METRICS_FILE="/full/path/to/files/file_R1.X1A.fastq.gz.adapterTrim.metrics"

QUALITY_THRESHOLD="6"
MIN_LENGTH="18"

ADAPTER_R1_1="NNNNNAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC"
ADAPTER_R1_2="CTTCCGATCTACAAGTT"
ADAPTER_R1_3="CTTCCGATCTTGGTCCT"

# Adapters for R2
ADAPTER_R2_LIST=(
    "AACTTGTAGATCGGA"
    "AGGACCAAGATCGGA"
    "ACTTGTAGATCGGAA"
    "GGACCAAGATCGGAA"
    "CTTGTAGATCGGAAG"
    "GACCAAGATCGGAAG"
    "TTGTAGATCGGAAGA"
    "ACCAAGATCGGAAGA"
    "TGTAGATCGGAAGAG"
    "CCAAGATCGGAAGAG"
    "GTAGATCGGAAGAGC"
    "CAAGATCGGAAGAGC"
    "TAGATCGGAAGAGCG"
    "AAGATCGGAAGAGCG"
    "AGATCGGAAGAGCGT"
    "GATCGGAAGAGCGTC"
    "ATCGGAAGAGCGTCG"
    "TCGGAAGAGCGTCGT"
    "CGGAAGAGCGTCGTG"
    "GGAAGAGCGTCGTGT"
)

# Construct the -A arguments for R2
ADAPTER_R2_ARGS=""
for adapter in "${ADAPTER_R2_LIST[@]}"; do
    ADAPTER_R2_ARGS+=" -A ${adapter}"
done

# Execute the command
quality-cutoff "${QUALITY_THRESHOLD}" \
    -m "${MIN_LENGTH}" \
    -a "${ADAPTER_R1_1}" \
    -g "${ADAPTER_R1_2}" \
    -g "${ADAPTER_R1_3}" \
    ${ADAPTER_R2_ARGS} \
    -o "${OUTPUT_R1}" \
    -p "${OUTPUT_R2}" \
    "${INPUT_R1}" \
    "${INPUT_R2}" > "${METRICS_FILE}"

View on GitHub

Takes output from cutadapt round 1.

cutadapt v4.0 GitHub

$ Bash example

# Install cutadapt (e.g., via conda)
# conda create -n cutadapt_env cutadapt=4.0
# conda activate cutadapt_env

# Define input and output files
# Assuming input is a single-end FASTQ file that has already undergone a first round of trimming
INPUT_FASTQ="sample_R1_trimmed_round1.fastq.gz"
OUTPUT_FASTQ="sample_R1_trimmed_round2.fastq.gz"
REPORT_JSON="sample_R1_trimmed_round2_report.json"

# Define adapter for round 2 (e.g., poly-A tail, common in eCLIP after initial adapter trimming)
# This is an inferred adapter for a second round of trimming in eCLIP workflows (e.g., Skipper)
POLY_A_ADAPTER="A{30}"

# Execute cutadapt for the second round of trimming
# Parameters: -a for adapter, -m for minimum length, -q for quality trimming, --cores for parallel processing, --json for report
cutadapt \
  -a "${POLY_A_ADAPTER}" \
  -m 18 \
  -q 20 \
  --cores 8 \
  --json "${REPORT_JSON}" \
  -o "${OUTPUT_FASTQ}" \
  "${INPUT_FASTQ}"

View on GitHub

Run to trim off the 3â adapters on read 2, to control for double ligation events.

cutadapt (Inferred with models/gemini-2.5-flash) v4.0 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install cutadapt (if not already installed)
# conda install -c bioconda cutadapt=4.0

# Define input and output file paths (placeholders)
INPUT_R1="read1.fastq.gz"
INPUT_R2="read2.fastq.gz"
OUTPUT_R1="trimmed_read1.fastq.gz"
OUTPUT_R2="trimmed_read2.fastq.gz"

# Define the 3' adapter sequence for eCLIP (common Illumina TruSeq adapter)
# Source: https://github.com/yeolab/skipper/blob/master/config/config.yaml
ADAPTER_3_PRIME="AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC"

# Run cutadapt to trim 3' adapters from read 2
# -A: 3' adapter for read 2
# -q 20: Trim low-quality bases from 3' end (Phred score < 20)
# -m 18: Discard reads shorter than 18 bp after trimming
# -o: Output file for read 1
# -p: Output file for read 2
cutadapt -A "${ADAPTER_3_PRIME}" \
         -q 20 \
         -m 18 \
         -o "${OUTPUT_R1}" \
         -p "${OUTPUT_R2}" \
         "${INPUT_R1}" \
         "${INPUT_R2}"

View on GitHub

Command: cutadapt -f fastq --match-read-wildcards --times 1 -e 0.1 -O 5 --quality-cutoff 6 -m 18 -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGTAGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz -p /full/path/to/files/file_R2.X1A.fastq.gz.adapterTrim.round2.fastq.gz /full/path/to/files/file_R1.X1A.fastq.gz.adapterTrim.fastq.gz /full/path/to/files/file_R2.X1A.fastq.gz.adapterTrim.fastq.gz > /full/path/to/files/file_R1.X1A.fastq.gz.adapterTrim.round2.metrics

cutadapt v4.1 GitHub

$ Bash example

# Install cutadapt (e.g., via conda)
# conda install -c bioconda cutadapt=4.1

cutadapt -f fastq --match-read-wildcards --times 1 -e 0.1 -O 5 --quality-cutoff 6 -m 18 -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGTAGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz -p /full/path/to/files/file_R2.X1A.fastq.gz.adapterTrim.round2.fastq.gz /full/path/to/files/file_R1.X1A.fastq.gz.adapterTrim.fastq.gz /full/path/to/files/file_R2.X1A.fastq.gz.adapterTrim.fastq.gz > /full/path/to/files/file_R1.X1A.fastq.gz.adapterTrim.round2.metrics

View on GitHub

Takes output from cutadapt round 2.

cutadapt v1.18 GitHub

$ Bash example

# Install cutadapt (if not already installed)
# conda install -c bioconda cutadapt

# Define input and output files
# INPUT_FASTQ represents the output from cutadapt round 2
INPUT_FASTQ="sample_r2.fastq.gz"
OUTPUT_FASTQ="sample.r3.fastq.gz"

# Define parameters based on the Yeo Lab eCLIP workflow (cutadapt_r3 step)
# Adapter sequence: Illumina TruSeq Small RNA 3' adapter
ADAPTER_SEQUENCE="AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC"
MINIMUM_LENGTH=18
ERROR_RATE=0.1
NEXTSEQ_TRIM=20 # Trim low-quality ends from NextSeq data (default in eCLIP workflow)
QUALITY_CUTOFF=20 # Quality cutoff for 3' end trimming
CORES=4 # Number of CPU cores to use (adjust based on available resources)

cutadapt \
  -a "${ADAPTER_SEQUENCE}" \
  --nextseq-trim="${NEXTSEQ_TRIM}" \
  -q "${QUALITY_CUTOFF}" \
  -m "${MINIMUM_LENGTH}" \
  -e "${ERROR_RATE}" \
  -j "${CORES}" \
  -o "${OUTPUT_FASTQ}" \
  "${INPUT_FASTQ}"

View on GitHub

Maps to human specific version of RepBase used to remove repetitive elements, helps control for spurious artifacts from rRNA (& other) repetitive reads.

RepeatMasker (Inferred with models/gemini-2.5-flash) v4.1.5 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install RepeatMasker (example using conda)
# conda install -c bioconda repeatmasker

# Download human genome assembly (e.g., GRCh38/hg38) as a placeholder if not already available
# wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz
# gunzip hg38.fa.gz

# Run RepeatMasker to identify and mask repetitive elements in the human genome.
# The -species human flag ensures the use of the human-specific RepBase library.
# -pa 8: Use 8 parallel processors for faster execution.
# -gff: Output results in GFF format, which is useful for downstream analysis.
# -dir repeatmasker_output: Specify an output directory for all generated files.
# -s: Perform a sensitive search, which is generally recommended for thorough masking.
# -xsmall: Mask simple repeats and low complexity regions with 'x' instead of 'N'.
# hg38.fa: Placeholder for the input human genome FASTA file.
RepeatMasker -species human -pa 8 -gff -dir repeatmasker_output hg38.fa

View on GitHub

Command: STAR --runMode alignReads --runThreadN 16 --genomeDir /path/to/RepBase_human_database_file --genomeLoad LoadAndRemove --readFilesIn /full/path/to/files/file_R1.X1A.fastq.gz.adapterTrim.round2.fastq.gz /full/path/to/files/file_R2.X1A.fastq.gz.adapterTrim.round2.fastq.gz --outSAMunmapped Within --outFilterMultimapNmax 10 --outFilterMultimapScoreRange 1 --outFileNamePrefix /full/path/to/files/file_R1.X1A.fastq.gz.adapterTrim.round2.rep.bam --outSAMattributes All --readFilesCommand zcat --outStd BAM_Unsorted --outSAMtype BAM Unsorted --outFilterType BySJout --outReadsUnmapped Fastx --outFilterScoreMin 10 --outSAMattrRGline ID:foo --alignEndsType EndToEnd > /full/path/to/files/file_R1.X1A.fastq.gz.adapterTrim.round2.rep.bam

STAR vInferred: 2.7.10a GitHub

$ Bash example

# Install STAR (example using conda)
# conda install -c bioconda star

# Define variables for clarity
# GENOME_DIR is a placeholder for a STAR index built from RepBase for human GRCh38.
# RepBase is a database of repetitive DNA elements.
GENOME_DIR="/path/to/STAR_index_RepBase_human_GRCh38"
READ1="/full/path/to/files/file_R1.X1A.fastq.gz.adapterTrim.round2.fastq.gz"
READ2="/full/path/to/files/file_R2.X1A.fastq.gz.adapterTrim.round2.fastq.gz"
OUTPUT_BAM="/full/path/to/files/file_R1.X1A.fastq.gz.adapterTrim.round2.rep.bam"

# Note: The --outFileNamePrefix parameter will set the prefix for other STAR output files
# (e.g., Log.out, SJ.out.tab). Since --outStd BAM_Unsorted is used, the primary BAM output
# is redirected to stdout and then to the OUTPUT_BAM file.
OUTPUT_PREFIX="${OUTPUT_BAM}"

STAR --runMode alignReads \
     --runThreadN 16 \
     --genomeDir "${GENOME_DIR}" \
     --genomeLoad LoadAndRemove \
     --readFilesIn "${READ1}" "${READ2}" \
     --outSAMunmapped Within \
     --outFilterMultimapNmax 10 \
     --outFilterMultimapScoreRange 1 \
     --outFileNamePrefix "${OUTPUT_PREFIX}" \
     --outSAMattributes All \
     --readFilesCommand zcat \
     --outStd BAM_Unsorted \
     --outSAMtype BAM Unsorted \
     --outFilterType BySJout \
     --outReadsUnmapped Fastx \
     --outFilterScoreMin 10 \
     --outSAMattrRGline ID:foo \
     --alignEndsType EndToEnd > "${OUTPUT_BAM}"

View on GitHub

Takes output from STAR rmRep.

STAR v2.7.10a GitHub

$ Bash example

# Install STAR (example using conda)
# conda install -c bioconda star=2.7.10a

# Install samtools (example using conda)
# conda install -c bioconda samtools

# Define variables
INPUT_FASTQ="input.fastq.gz"
GENOME_DIR="/path/to/STAR_genome_index/hg38" # Replace with actual path to STAR genome index
GTF_FILE="/path/to/gencode.v38.annotation.gtf" # Replace with actual path to GTF annotation file
OUTPUT_PREFIX="aligned_reads"
THREADS=8

# 1. STAR Alignment
# This step aligns reads to the reference genome and generates a sorted BAM file.
STAR --genomeDir "${GENOME_DIR}" \
     --readFilesIn "${INPUT_FASTQ}" \
     --runThreadN "${THREADS}" \
     --outFileNamePrefix "${OUTPUT_PREFIX}" \
     --outSAMtype BAM SortedByCoordinate \
     --outSAMattributes All \
     --quantMode GeneCounts \
     --sjdbGTFfile "${GTF_FILE}"

# 2. Remove PCR duplicates (inferred "rmRep" step using samtools markdup)
# This step removes PCR duplicates from the aligned BAM file, which is a common post-alignment step in eCLIP.
# First, fixmate information is added, then sorted by query name, then duplicates are marked.
# The output is a deduplicated, coordinate-sorted BAM file.

samtools fixmate -m "${OUTPUT_PREFIX}Aligned.sortedByCoordinate.out.bam" "${OUTPUT_PREFIX}.fixmate.bam"
samtools sort -n "${OUTPUT_PREFIX}.fixmate.bam" -o "${OUTPUT_PREFIX}.queryname_sorted.bam"
samtools markdup -r -s "${OUTPUT_PREFIX}.queryname_sorted.bam" "${OUTPUT_PREFIX}.dedup.bam"

# Index the deduplicated BAM file
samtools index "${OUTPUT_PREFIX}.dedup.bam"

# Clean up intermediate files (optional)
# rm "${OUTPUT_PREFIX}.fixmate.bam" "${OUTPUT_PREFIX}.queryname_sorted.bam"

View on GitHub

Maps reads to the HB27 (GCF_000008125.1) genome.

bwa (Inferred with models/gemini-2.5-flash) v0.7.17 GitHub

$ Bash example

# Install bwa and samtools if not already present
# conda install -c bioconda bwa samtools

# Define reference genome and output paths
GENOME_ACCESSION="GCF_000008125.1"
GENOME_NAME="GRCh37.p13"
GENOME_FASTA_GZ="${GENOME_ACCESSION}_${GENOME_NAME}_genomic.fna.gz"
GENOME_DIR="ref_genome"
INDEX_PREFIX="${GENOME_DIR}/human_hb27_grch37"
DECOMPRESSED_FASTA="${INDEX_PREFIX}.fna"

# Create directory for reference genome
mkdir -p "${GENOME_DIR}"

# Download reference genome if not exists
if [ ! -f "${GENOME_DIR}/${GENOME_FASTA_GZ}" ]; then
    echo "Downloading reference genome ${GENOME_FASTA_GZ}..."
    wget -O "${GENOME_DIR}/${GENOME_FASTA_GZ}" "ftp://ftp.ncbi.nlm.nih.gov/genomes/all/${GENOME_ACCESSION}/${GENOME_ACCESSION}_${GENOME_NAME}/${GENOME_FASTA_GZ}"
fi

# Decompress the genome for indexing
if [ ! -f "${DECOMPRESSED_FASTA}" ]; then
    echo "Decompressing genome..."
    gunzip -c "${GENOME_DIR}/${GENOME_FASTA_GZ}" > "${DECOMPRESSED_FASTA}"
fi

# Build BWA index
# Check if index files exist (e.g., .bwt)
if [ ! -f "${INDEX_PREFIX}.bwt" ]; then
    echo "Building BWA index for ${DECOMPRESSED_FASTA}..."
    bwa index -p "${INDEX_PREFIX}" "${DECOMPRESSED_FASTA}"
fi

# Define input FASTQ files (replace with actual filenames)
# Assuming paired-end reads for a robust example. Adjust if single-end.
READ1="input_reads_R1.fastq.gz"
READ2="input_reads_R2.fastq.gz"

# Define output BAM file
SORTED_BAM="aligned_reads.sorted.bam"

echo "Mapping reads to ${GENOME_NAME} (${GENOME_ACCESSION}) using BWA MEM..."
# Map reads using bwa mem, pipe to samtools for conversion to BAM, sorting, and indexing
bwa mem -t 8 "${INDEX_PREFIX}" "${READ1}" "${READ2}" | \
samtools view -Sb - | \
samtools sort -@ 8 -o "${SORTED_BAM}" -
samtools index "${SORTED_BAM}"

echo "Alignment complete. Sorted BAM: ${SORTED_BAM}"

View on GitHub

Command: STAR --runMode alignReads --runThreadN 16 --genomeDir /path/to/STAR_database_file --genomeLoad LoadAndRemove --readFilesIn /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate1 /full/path/to/files/file_R1.X1A.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate2 --outSAMunmapped Within --outFilterMultimapNmax 10 --outFilterMultimapScoreRange 1 --outFileNamePrefix /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam --outSAMattributes All --outStd BAM_Unsorted --outSAMtype BAM Unsorted --outFilterType BySJout --outReadsUnmapped Fastx --outFilterScoreMin 10 --outSAMattrRGline ID:foo --alignEndsType EndToEnd > /full/path/to/files/file_R1.X1AUnmappedAligned.out.bam

STAR v2.7.10a GitHub

$ Bash example

# Define variables for paths
# Replace with actual paths to your STAR genome index and input/output files/directories
GENOME_DIR="/path/to/STAR_genome_indices/GRCh38" # Example: /path/to/STAR_genome_indices/hg38_gencode_v38
READ_FILE_1="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate1"
READ_FILE_2="/full/path/to/files/file_R1.X1A.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate2"
OUTPUT_PREFIX="/full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam" # This will be the prefix for STAR's auxiliary output files (e.g., Log.out, SJ.out.tab, etc.)
FINAL_BAM_OUTPUT="/full/path/to/files/file_R1.X1AUnmappedAligned.out.bam" # This is the primary alignment BAM output, redirected from stdout

# Install STAR (e.g., via conda)
# conda install -c bioconda star

STAR \
  --runMode alignReads \
  --runThreadN 16 \
  --genomeDir "${GENOME_DIR}" \
  --genomeLoad LoadAndRemove \
  --readFilesIn "${READ_FILE_1}" "${READ_FILE_2}" \
  --outSAMunmapped Within \
  --outFilterMultimapNmax 10 \
  --outFilterMultimapScoreRange 1 \
  --outFileNamePrefix "${OUTPUT_PREFIX}" \
  --outSAMattributes All \
  --outStd BAM_Unsorted \
  --outSAMtype BAM Unsorted \
  --outFilterType BySJout \
  --outReadsUnmapped Fastx \
  --outFilterScoreMin 10 \
  --outSAMattrRGline ID:foo \
  --alignEndsType EndToEnd \
  > "${FINAL_BAM_OUTPUT}"

View on GitHub

Takes output from STAR genome alignments.

STAR v2.7.10a GitHub

$ Bash example

# Install STAR (example using conda)
# conda create -n star_env star=2.7.10a -c bioconda -c conda-forge
# conda activate star_env

# --- Placeholder Variables ---
# Path to the STAR genome index directory (e.g., built from hg38/GRCh38)
# To build a STAR index, you would typically run:
# STAR --runMode genomeGenerate --genomeDir /path/to/star_index/hg38 \
#      --genomeFastaFiles /path/to/hg38.fa --sjdbGTFfile /path/to/gencode.vXX.annotation.gtf \
#      --runThreadN <num_threads>
GENOME_DIR="/path/to/star_index/hg38" 

# Input FASTQ files (gzipped)
FASTQ_R1="sample_R1.fastq.gz"
FASTQ_R2="sample_R2.fastq.gz" # Use "" if single-end

# Prefix for output files
OUTPUT_PREFIX="aligned_sample_"

# Output directory
OUTPUT_DIR="star_output"
mkdir -p "${OUTPUT_DIR}"

# --- STAR Alignment Command ---
STAR \
  --genomeDir "${GENOME_DIR}" \
  --readFilesIn "${FASTQ_R1}" "${FASTQ_R2}" \
  --runThreadN 8 \
  --outFileNamePrefix "${OUTPUT_DIR}/${OUTPUT_PREFIX}" \
  --outSAMtype BAM SortedByCoordinate \
  --outSAMattributes All \
  --outFilterMultimapNmax 20 \
  --alignSJDBoverhangMin 8 \
  --alignSJoverhangMin 8 \
  --alignIntronMin 20 \
  --alignIntronMax 1000000 \
  --alignMatesGapMax 1000000 \
  --outFilterMismatchNmax 999 \
  --outFilterMismatchNoverLmax 0.04 \
  --outFilterScoreMinOverLread 0.0 \
  --outFilterMatchNminOverLread 0.0 \
  --limitBAMsortRAM 30000000000 \
  --readFilesCommand zcat

View on GitHub

Run samtools sort.

samtools v1.15.1 GitHub

$ Bash example

# Install samtools (if not already installed)
# conda install -c bioconda samtools

# Sort a BAM file by coordinate. This is the default sorting order for samtools sort.
# Replace 'input.bam' with the path to your unsorted BAM file.
# Replace 'output.sorted.bam' with the desired name for the sorted output BAM file.
samtools sort -o output.sorted.bam input.bam

View on GitHub

15
Command: samtools sort -o /full/path/to/files/file_R1.X1AUnmappedAligned.out.sorted.bam /full/path/to/files/file_R1.X1AUnmappedAligned.out.bam

samtools v1.9 (Inferred with models/gemini-2.5-flash) GitHub
$ Bash example
```
samtools sort -o /full/path/to/files/file_R1.X1AUnmappedAligned.out.sorted.bam /full/path/to/files/file_R1.X1AUnmappedAligned.out.bam
```
View on GitHub

Takes output from sortSam, makes bam index for use downstream.

samtools index (Inferred with models/gemini-2.5-flash) v1.19 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install samtools if not already installed
# conda install -c bioconda samtools=1.19

# Assume the output from sortSam is a sorted BAM file, e.g., 'sample_sorted.bam'
# This command creates an index file (e.g., 'sample_sorted.bam.bai') for the BAM file.
samtools index sample_sorted.bam

View on GitHub

Command: samtools index /full/path/to/files/file_R1.X1AUnmappedAligned.out.sorted.bam

samtools v1.19 GitHub

$ Bash example

bash
# Install samtools (e.g., using conda)
# conda install -c bioconda samtools=1.19

samtools index /full/path/to/files/file_R1.X1AUnmappedAligned.out.sorted.bam

View on GitHub

Takes inputs from multiple final bam files.

samtools (Inferred with models/gemini-2.5-flash) v1.19 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install samtools if not available
# conda install -c bioconda samtools

# This step takes multiple final BAM files and merges them into a single BAM file.
# This is a common operation when combining technical or biological replicates
# before downstream analyses like peak calling or quantification.
# Replace 'input_file_1.bam input_file_2.bam input_file_3.bam' with your actual input BAM file paths.
# Replace 'merged_output.bam' with your desired output merged BAM file name.
# The '-@' flag specifies the number of threads to use.
samtools merge -@ 4 merged_output.bam input_file_1.bam input_file_2.bam input_file_3.bam

View on GitHub

Merges the two technical replicates for further downstream analysis.

samtools (Inferred with models/gemini-2.5-flash) v1.x (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install samtools if not already installed
# conda install -c bioconda samtools

# Merge technical replicates (e.g., BAM files)
# Assuming replicate1.bam and replicate2.bam are the input BAM files
# and merged_replicates.bam is the output merged BAM file.
samtools merge merged_replicates.bam replicate1.bam replicate2.bam

View on GitHub

Command: samtools merge HERA.HER1_CLIP.X1AUnmappedAligned.out.bam.sorted.merged.bam /full/path/to/files/file_R1.X1AUnmappedAligned.out.sorted.bam /full/path/to/files/file_R1.X1BUnmappedAligned.out.sorted.bam

samtools v1.19 GitHub

$ Bash example

# Install samtools (e.g., using conda)
# conda install -c bioconda samtools=1.19

# Define input and output files
INPUT_BAM_1="/full/path/to/files/file_R1.X1AUnmappedAligned.out.sorted.bam"
INPUT_BAM_2="/full/path/to/files/file_R1.X1BUnmappedAligned.out.sorted.bam"
OUTPUT_BAM="HERA.HER1_CLIP.X1AUnmappedAligned.out.bam.sorted.merged.bam"

# Execute samtools merge
samtools merge "${OUTPUT_BAM}" "${INPUT_BAM_1}" "${INPUT_BAM_2}"

View on GitHub

Takes output from samtools sort, makes bam index for use downstream.

samtools v1.19 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install samtools if not already installed
# conda install -c bioconda samtools=1.19

# Create a BAM index for the sorted BAM file
# input.sorted.bam is the output from samtools sort
samtools index input.sorted.bam

View on GitHub

Command: samtools index HERA.HER1_CLIP.X1AUnmappedAligned.out.bam.sorted.merged.bam

samtools v1.9 GitHub

$ Bash example

# Install samtools (e.g., via conda)
# conda install -c bioconda samtools=1.9

samtools index HERA.HER1_CLIP.X1AUnmappedAligned.out.bam.sorted.merged.bam

View on GitHub

Takes output from samtools sort.

samtools v1.10 GitHub

$ Bash example

# Install samtools (if not already installed)
# conda install -c bioconda samtools=1.10

# This step takes a sorted BAM file as input and creates an index (.bai) file.
# Replace 'input.sorted.bam' with your actual sorted BAM file name.
samtools index input.sorted.bam

View on GitHub

Only outputs the second read in each pair for use with single stranded peak caller.

bbtools reformat.sh (Inferred with models/gemini-2.5-flash) v39.01 GitHub

$ Bash example

# Install bbmap suite (which includes reformat.sh)
# conda install -c bioconda bbmap

# This command takes paired-end FASTQ files (input_R1.fastq.gz and input_R2.fastq.gz)
# and outputs a new FASTQ file (output_R2_only.fastq.gz) containing only the second reads.
# 'r1=f' ensures read 1 is not outputted, and 'r2=t' ensures read 2 is outputted.
reformat.sh in1=input_R1.fastq.gz in2=input_R2.fastq.gz out=output_R2_only.fastq.gz r1=f r2=t

View on GitHub

This is the final bam file to perform analysis on.

samtools (Inferred with models/gemini-2.5-flash) v1.19 GitHub

$ Bash example

# Install samtools if not already installed
# conda install -c bioconda samtools

# Define placeholder for the input BAM file before final processing
# Replace 'input_raw.bam' with the actual path to your unsorted or intermediate BAM file.
INPUT_BAM="input_raw.bam"

# Define the name for the final sorted and indexed BAM file
FINAL_BAM="final.bam"

# Sort the BAM file by coordinate
# The -o option specifies the output file name
samtools sort -o "${FINAL_BAM}" "${INPUT_BAM}"

# Index the sorted BAM file
# This creates an index file (e.g., final.bam.bai) in the same directory
samtools index "${FINAL_BAM}"

View on GitHub

Command: samtools view -hb -f 128 -o HERA.HER1_CLIP.X1AUnmappedAligned.out.bam.sorted.merged.r2.bam HERA.HER1_CLIP.X1AUnmappedAligned.out.bam.sorted.merged.bam

samtools v1.10 GitHub

$ Bash example

# Install samtools (e.g., via conda)
# conda install -c bioconda samtools=1.10

samtools view -hb -f 128 -o HERA.HER1_CLIP.X1AUnmappedAligned.out.bam.sorted.merged.r2.bam HERA.HER1_CLIP.X1AUnmappedAligned.out.bam.sorted.merged.bam

View on GitHub

Takes results from samtools view.

samtools v1.19 GitHub

$ Bash example

# Install samtools if not already available
# conda install -c bioconda samtools

# Example: Convert a BAM file to SAM format, including the header,
# to "view" the alignment results. This command takes an input BAM file
# and outputs its content in SAM format to standard output, redirected to a file.
# Replace 'input.bam' with your actual input alignment file.
# Replace 'output.sam' with your desired output SAM file name.
samtools view -h input.bam > output.sam

View on GitHub

Calls peaks on those files.

clipper (Inferred with models/gemini-2.5-flash) vv1.0.0 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install clipper (if not already installed)
# pip install clipper

# Define input files and genome assembly
IP_BAM="ip_sample.bam" # Placeholder: Replace with the path to your IP sample BAM file
CONTROL_BAM="control_sample.bam" # Placeholder: Replace with the path to your control sample BAM file (e.g., input, IgG)
GENOME_ASSEMBLY="hg38" # Placeholder: Use the appropriate genome assembly (e.g., hg19, mm10)
OUTPUT_PREFIX="eclip_peaks"

# Execute clipper to call peaks
# Common parameters: -b (IP BAM), -c (Control BAM), -s (genome assembly), -o (output prefix)
# Optional parameters might include -p (p-value threshold, default 0.01), -w (window size, default 10)
clipper -b "${IP_BAM}" -c "${CONTROL_BAM}" -s "${GENOME_ASSEMBLY}" -o "${OUTPUT_PREFIX}"

View on GitHub

Command: clipper --bam HERA.HER1_CLIP.X1AUnmappedAligned.out.bam.sorted.merged.r2.bam --species hb27 --outfile /full/path/to/files/CombinedID.merged.r2.peaks.bed --maxgenes 1000000 --save-pickle

CLIPper vNot specified

$ Bash example

# Installation instructions (example, adjust path as needed)
# git clone https://github.com/yeolab/clipper.git
# cd clipper
# # Ensure Python environment is set up, e.g., with conda
# # conda create -n clipper_env python=3.8
# # conda activate clipper_env
# # pip install -r requirements.txt # If a requirements.txt exists
# # Add clipper to your PATH or run directly using python
# # export PATH=$PATH:$(pwd)

# Define input and output files
INPUT_BAM="HERA.HER1_CLIP.X1AUnmappedAligned.out.bam.sorted.merged.r2.bam"
OUTPUT_BED="/full/path/to/files/CombinedID.merged.r2.peaks.bed"
SPECIES_ID="hb27" # Custom species identifier, likely referring to a specific human genome build and annotation used by the Yeo lab.

# Execute CLIPper peak calling
clipper --bam "${INPUT_BAM}" \
        --species "${SPECIES_ID}" \
        --outfile "${OUTPUT_BED}" \
        --maxgenes 1000000 \
        --save-pickle

Tools Used

STAR

Raw Source Text

Takes output from raw files.  Run to trim off both 5â and 3â adapters on both reads. Command: quality-cutoff 6  -m 18  -a NNNNNAGATCGGAAGAGCACACGTCTGAACTCCAGTCAC  -g CTTCCGATCTACAAGTT -g CTTCCGATCTTGGTCCT  -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGTAGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT  -o /full/path/to/files/file_R1.X1A.fastq.gz.adapterTrim.fastq.gz  -p /full/path/to/files/file_R2.X1A.fastq.gz.adapterTrim.fastq.gz  /full/path/to/files/file_R1.X1A.fastq.gz /full/path/to/files/file_R2.C01.fastq.gz > /full/path/to/files/file_R1.X1A.fastq.gz.adapterTrim.metrics
Takes output from cutadapt round 1. Run to trim off the 3â adapters on read 2, to control for double ligation events. Command: cutadapt -f fastq --match-read-wildcards  --times 1  -e 0.1  -O 5  --quality-cutoff 6  -m 18  -A AACTTGTAGATCGGA -A AGGACCAAGATCGGA -A ACTTGTAGATCGGAA -A GGACCAAGATCGGAA -A CTTGTAGATCGGAAG -A GACCAAGATCGGAAG -A TTGTAGATCGGAAGA -A ACCAAGATCGGAAGA -A TGTAGATCGGAAGAG -A CCAAGATCGGAAGAG -A GTAGATCGGAAGAGC -A CAAGATCGGAAGAGC -A TAGATCGGAAGAGCG -A AAGATCGGAAGAGCG -A AGATCGGAAGAGCGT -A GATCGGAAGAGCGTC -A ATCGGAAGAGCGTCG -A TCGGAAGAGCGTCGT -A CGGAAGAGCGTCGTG -A GGAAGAGCGTCGTGT  -o /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.fastq.gz  -p /full/path/to/files/file_R2.X1A.fastq.gz.adapterTrim.round2.fastq.gz  /full/path/to/files/file_R1.X1A.fastq.gz.adapterTrim.fastq.gz  /full/path/to/files/file_R2.X1A.fastq.gz.adapterTrim.fastq.gz > /full/path/to/files/file_R1.X1A.fastq.gz.adapterTrim.round2.metrics
Takes output from cutadapt round 2.  Maps to human specific version of RepBase used to remove repetitive elements, helps control for spurious artifacts from rRNA (& other) repetitive reads.  Command: STAR  --runMode alignReads  --runThreadN 16  --genomeDir /path/to/RepBase_human_database_file --genomeLoad LoadAndRemove  --readFilesIn /full/path/to/files/file_R1.X1A.fastq.gz.adapterTrim.round2.fastq.gz /full/path/to/files/file_R2.X1A.fastq.gz.adapterTrim.round2.fastq.gz --outSAMunmapped Within  --outFilterMultimapNmax 10  --outFilterMultimapScoreRange 1  --outFileNamePrefix /full/path/to/files/file_R1.X1A.fastq.gz.adapterTrim.round2.rep.bam --outSAMattributes All  --readFilesCommand zcat  --outStd BAM_Unsorted  --outSAMtype BAM Unsorted  --outFilterType BySJout  --outReadsUnmapped Fastx  --outFilterScoreMin 10  --outSAMattrRGline ID:foo  --alignEndsType EndToEnd > /full/path/to/files/file_R1.X1A.fastq.gz.adapterTrim.round2.rep.bam
Takes output from STAR rmRep.  Maps reads to the HB27 (GCF_000008125.1) genome.  Command: STAR  --runMode alignReads  --runThreadN 16  --genomeDir  /path/to/STAR_database_file --genomeLoad LoadAndRemove  --readFilesIn  /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate1  /full/path/to/files/file_R1.X1A.fastq.gz.adapterTrim.round2.rep.bamUnmapped.out.mate2  --outSAMunmapped Within  --outFilterMultimapNmax 10  --outFilterMultimapScoreRange 1  --outFileNamePrefix /full/path/to/files/file_R1.C01.fastq.gz.adapterTrim.round2.rmRep.bam  --outSAMattributes All  --outStd BAM_Unsorted  --outSAMtype BAM Unsorted  --outFilterType BySJout  --outReadsUnmapped Fastx  --outFilterScoreMin 10  --outSAMattrRGline ID:foo  --alignEndsType EndToEnd > /full/path/to/files/file_R1.X1AUnmappedAligned.out.bam
Takes output from STAR genome alignments. Run samtools sort. Command: samtools sort -o /full/path/to/files/file_R1.X1AUnmappedAligned.out.sorted.bam /full/path/to/files/file_R1.X1AUnmappedAligned.out.bam
Takes output from sortSam, makes bam index for use downstream.  Command: samtools index  /full/path/to/files/file_R1.X1AUnmappedAligned.out.sorted.bam
Takes inputs from multiple final bam files.  Merges the two technical replicates for further downstream analysis.  Command: samtools  merge  HERA.HER1_CLIP.X1AUnmappedAligned.out.bam.sorted.merged.bam /full/path/to/files/file_R1.X1AUnmappedAligned.out.sorted.bam /full/path/to/files/file_R1.X1BUnmappedAligned.out.sorted.bam
Takes output from samtools sort, makes bam index for use downstream.  Command: samtools index HERA.HER1_CLIP.X1AUnmappedAligned.out.bam.sorted.merged.bam
Takes output from samtools sort.  Only outputs the second read in each pair for use with single stranded peak caller.  This is the final bam file to perform analysis on.  Command: samtools view -hb -f 128 -o HERA.HER1_CLIP.X1AUnmappedAligned.out.bam.sorted.merged.r2.bam HERA.HER1_CLIP.X1AUnmappedAligned.out.bam.sorted.merged.bam
Takes results from samtools view.  Calls peaks on those files.  Command: clipper  --bam HERA.HER1_CLIP.X1AUnmappedAligned.out.bam.sorted.merged.r2.bam  --species hb27  --outfile /full/path/to/files/CombinedID.merged.r2.peaks.bed --maxgenes 1000000 --save-pickle
Genome_build: hb27
Supplementary_files_format_and_content: bed format, contains clusters of predicted RBP binding

← Back to Analysis