GSE243675 Processing Pipeline

RIP-Seq code_examples 9 steps

Publication

A phage nucleus-associated RNA-binding protein is required for jumbo phage infection.

Nucleic acids research (2024) — PMID 38554115

Dataset

A phage nucleus-associated RNA-binding protein required for jumbo phage infection

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

R1 data was processed using the single ended eCLIP pipeline and available at: http://github.com/yeolab/eclip.

eCLIP vlatest (as of 2020-11-17)

$ Bash example

# Clone the eCLIP pipeline repository if not already present
# git clone https://github.com/yeolab/eclip.git
# cd eclip

# Define placeholder paths for input data and reference files
# Replace these with your actual file paths and ensure they are accessible.
# Reference genome: Human hg38 is a common choice for eCLIP.
# Adapter sequences can be found in the eCLIP repository (e.g., eclip/data/adapters.fa).
# Blacklist regions for hg38 can be obtained from ENCODE.
# Spike-in sequences are optional and depend on experimental design.

R1_FASTQ="path/to/your_sample_R1.fastq.gz"
GENOME_FASTA="path/to/reference/hg38.fa"
GENOME_GTF="path/to/reference/hg38.gtf"
STAR_INDEX_DIR="path/to/reference/star_index_hg38" # Directory containing STAR index files
CHROM_SIZES="path/to/reference/hg38.chrom.sizes"
BLACKLIST_BED="path/to/reference/hg38_blacklist.bed"
ADAPTER_FASTA="path/to/eclip_repo/data/adapters.fa" # Example path from the eCLIP repo
SPIKEIN_FASTA="path/to/reference/spikein.fa" # Optional, if spike-ins were used
SPIKEIN_GTF="path/to/reference/spikein.gtf" # Optional
SPIKEIN_STAR_INDEX_DIR="path/to/reference/spikein_star_index" # Optional

# Create a CWL input configuration file (e.g., inputs.yaml)
# This example assumes single-ended data as per the description.
cat << EOF > eclip_inputs.yaml
fastq_r1:
  class: File
  path: ${R1_FASTQ}
# fastq_r2: # Uncomment and provide path if paired-end
#   class: File
#   path: path/to/your_sample_R2.fastq.gz
genome_fasta:
  class: File
  path: ${GENOME_FASTA}
genome_gtf:
  class: File
  path: ${GENOME_GTF}
star_index:
  class: Directory
  path: ${STAR_INDEX_DIR}
chrom_sizes:
  class: File
  path: ${CHROM_SIZES}
blacklist_bed:
  class: File
  path: ${BLACKLIST_BED}
adapter_fasta:
  class: File
  path: ${ADAPTER_FASTA}
spikein_fasta:
  class: File
  path: ${SPIKEIN_FASTA}
spikein_gtf:
  class: File
  path: ${SPIKEIN_GTF}
spikein_star_index:
  class: Directory
  path: ${SPIKEIN_STAR_INDEX_DIR}
output_dir: "eclip_output" # Specify an output directory
threads: 8 # Number of threads to use
memory: 32 # Memory in GB
EOF

# Run the eCLIP CWL workflow using cwltool
# Ensure cwltool is installed: # pip install cwltool
cwltool eclip.cwl eclip_inputs.yaml

Unique Molecular Identifiers (UMIs) were extracted from raw sequencing reads with umi_tools extract

UMI-tools vInferred with models/gemini-2.5-flash GitHub

$ Bash example

# Install umi-tools if not already installed
# conda install -c bioconda umi-tools

# Example command for umi_tools extract
# This command assumes a 10bp UMI at the start of Read 1.
# Adjust --bc-pattern according to your specific library preparation and UMI location.
# Replace 'raw_reads.fastq.gz' with your actual input file and 'umi_extracted_reads.fastq.gz' with your desired output file name.
umi_tools extract \
    --input raw_reads.fastq.gz \
    --output umi_extracted_reads.fastq.gz \
    --bc-pattern="^(?P<umi_1>.{10})(?P<discard_1>.*)$" \
    --log umi_tools_extract.log

View on GitHub

Post-umi-extracted reads were trimmed for adapter sequences and barcode sequences (eCLIP samples) using cutadapt.

cutadapt v4.0 GitHub

$ Bash example

# Install cutadapt (e.g., via conda)
# conda install -c bioconda cutadapt=4.0

# Define input and output files
INPUT_FASTQ="post_umi_extracted_reads.fastq.gz"
OUTPUT_FASTQ="trimmed_reads.fastq.gz"

# Define adapter sequences based on common eCLIP practices (e.g., from yeolab/skipper workflow)
# 3' adapter (Illumina TruSeq Small RNA 3' Adapter)
ADAPTER_3PRIME="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA"
# 5' adapter/barcode (e.g., Illumina 5' adapter, or specific barcode sequence to be trimmed)
# The yeolab/skipper workflow uses GATCGTCGGACTGTAGAACTCTGAACGTGTAGATCA as a 5' adapter.
ADAPTER_5PRIME="GATCGTCGGACTGTAGAACTCTGAACGTGTAGATCA"

# Run cutadapt to trim adapter and barcode sequences
# -a: 3' adapter
# -g: 5' adapter/barcode
# -o: Output file
# --minimum-length: Discard reads shorter than this length after trimming
# -q: Quality trim from both ends (e.g., 20,20 for phred score 20)
# -j: Number of CPU cores to use
cutadapt \
  -a "${ADAPTER_3PRIME}" \
  -g "${ADAPTER_5PRIME}" \
  -o "${OUTPUT_FASTQ}" \
  --minimum-length 18 \
  -q 20,20 \
  -j 8 \
  "${INPUT_FASTQ}"

View on GitHub

Trimmed reads were mapped against RepBase with STAR to remove reads mapping to repetitive sequences (--outFilterMultimapNmax 30 --alignEndsType EndToEnd --outFilterMultimapScoreRange 1 --outSAMmode Full --outFilterType BySJout --outSAMtype BAM Unsorted --outFilterScoreMin 10 --outReadsUnmapped Fastx --outSAMattributes All)

STAR v2.7.10a GitHub

$ Bash example

# Install STAR (example using conda)
# conda create -n star_env star=2.7.10a -c bioconda -c conda-forge
# conda activate star_env

# Define variables
INPUT_READS="input_trimmed_reads.fastq.gz" # Path to your trimmed FASTQ reads
REPBASE_STAR_INDEX_DIR="path/to/repbase_star_index" # Path to the STAR genome index built from RepBase sequences
OUTPUT_PREFIX="repbase_alignment" # Prefix for output files
THREADS=8 # Number of threads to use

# Build STAR index for RepBase (if not already done)
# This step is typically done once for a reference. Example:
# STAR --runMode genomeGenerate \
#      --genomeDir ${REPBASE_STAR_INDEX_DIR} \
#      --genomeFastaFiles repbase.fasta \
#      --genomeSAindexNbases 10 # Adjust based on genome size, 10 for smaller genomes like RepBase

# Run STAR to map reads against RepBase and output unmapped reads
STAR --runThreadN ${THREADS} \
     --genomeDir ${REPBASE_STAR_INDEX_DIR} \
     --readFilesIn ${INPUT_READS} \
     --outFileNamePrefix ${OUTPUT_PREFIX}_ \
     --outFilterMultimapNmax 30 \
     --alignEndsType EndToEnd \
     --outFilterMultimapScoreRange 1 \
     --outSAMmode Full \
     --outFilterType BySJout \
     --outSAMtype BAM Unsorted \
     --outFilterScoreMin 10 \
     --outReadsUnmapped Fastx \
     --outSAMattributes All

# The unmapped reads will be in ${OUTPUT_PREFIX}_Unmapped.out.fastq
# The mapped reads will be in ${OUTPUT_PREFIX}_Aligned.out.bam

View on GitHub

Remaining reads were mapped to the appropriate genome build (Pa_P01 + PhiPA3) using STAR aligner (--outFilterMultimapNmax 1 --alignEndsType EndToEnd --outFilterMultimapScoreRange 1 --outSAMmode Full --outFilterType BySJout --outSAMtype BAM Unsorted --outFilterScoreMin 10 --outReadsUnmapped Fastx --outSAMattributes All)

STAR v2.7.10a GitHub

$ Bash example

# Install STAR (example using conda)
# conda install -c bioconda star=2.7.10a

# Assume genome index has been built for the custom reference "Pa_P01 + PhiPA3"
# Example for building index (replace with actual fasta files):
# STAR --runMode genomeGenerate \
#      --genomeDir path/to/Pa_P01_PhiPA3_STAR_index \
#      --genomeFastaFiles Pa_P01.fasta PhiPA3.fasta \
#      --runThreadN 8 # Adjust threads as needed

STAR --genomeDir path/to/Pa_P01_PhiPA3_STAR_index \
     --readFilesIn input_reads.fastq.gz \
     --outFileNamePrefix aligned_reads_ \
     --outFilterMultimapNmax 1 \
     --alignEndsType EndToEnd \
     --outFilterMultimapScoreRange 1 \
     --outSAMmode Full \
     --outFilterType BySJout \
     --outSAMtype BAM Unsorted \
     --outFilterScoreMin 10 \
     --outReadsUnmapped Fastx \
     --outSAMattributes All \
     --runThreadN 8 # Adjust threads as needed

View on GitHub

Uniquely mapped reads were removed of PCR duplicates with umi_tools

UMI-tools v1.1.2 GitHub

$ Bash example

# Install UMI-tools (e.g., using conda)
# conda create -n umi_tools_env umi-tools=1.1.2 -c bioconda -c conda-forge
# conda activate umi_tools_env

# Assuming 'uniquely_mapped.bam' is the input BAM file containing uniquely mapped reads
# and UMIs are present in the read names, separated by a colon.
# Adjust --umi-separator and --extract-method if UMIs are in a different format or location.
# If reads are paired-end, include --paired-end.

umi_tools dedup \
    --umi-separator=':' \
    --extract-method=string \
    --paired-end \
    --output-stats deduplication_stats.txt \
    -I uniquely_mapped.bam \
    -S deduplicated.bam

View on GitHub

Peak clusters were identified with CLIPper, available at: https://github.com/YeoLab/clipper

CLIPper vunspecified GitHub

$ Bash example

# Install CLIPper (if not already installed)
# git clone https://github.com/YeoLab/clipper.git
# cd clipper
# # Ensure Python dependencies are met, e.g., numpy, scipy, pysam
# # pip install numpy scipy pysam

# Define input and output files (placeholders)
# Replace with actual paths to your data
INPUT_BAM="path/to/your_experiment.bam"
CONTROL_BAM="path/to/your_control.bam" # Optional, but highly recommended for robust peak calling
OUTPUT_PREFIX="clipper_peaks"

# Define reference genome files (placeholders for human hg38)
# Source: UCSC Genome Browser or GENCODE for GRCh38/hg38
# Example paths for hg38:
GENOME_FASTA="path/to/GRCh38.primary_assembly.genome.fa"
GENES_GTF="path/to/gencode.v38.annotation.gtf"

# Run CLIPper to identify peak clusters
# Parameters are examples; adjust based on experimental design and CLIPper documentation
# -b: Input BAM file (required)
# -c: Control BAM file (optional, but recommended)
# -o: Output prefix for peak files (required)
# -g: Reference genome FASTA file (required for some features, good practice)
# -a: Gene annotation GTF file (required for some features, good practice)
# -p: P-value threshold (default 0.01)
# -f: FDR threshold (default 0.05)
python CLIPper.py \
    -b "${INPUT_BAM}" \
    -c "${CONTROL_BAM}" \
    -o "${OUTPUT_PREFIX}" \
    -g "${GENOME_FASTA}" \
    -a "${GENES_GTF}" \
    -p 0.01 \
    -f 0.05

View on GitHub

Clusters enriched over corresponding size-matched input (SMInput) were identified using a custom Perl script, available in the main eCLIP repository as: overlap_peakfi_with_bam.pl

eCLIP veCLIP pipeline (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Clone the eCLIP repository to get the script
# git clone https://github.com/yeolab/eclip.git
# cd eclip
# export PATH=$PATH:$(pwd)/bin

# Define input and output files
PEAK_FILE="eclip_peaks.bed" # Placeholder for the identified clusters/peaks
SM_INPUT_BAM="sm_input.bam" # Placeholder for the size-matched input BAM file
OUTPUT_FILE="enriched_clusters.bed"

# Define enrichment parameters (placeholders, adjust as needed based on experimental design)
MIN_OVERLAP_FRACTION=0.1 # Minimum fraction of peak overlapping with BAM reads
MIN_READS_IN_OVERLAP=10  # Minimum reads in the overlapping region
MIN_FOLD_ENRICHMENT=2.0  # Minimum fold enrichment over SMInput
MIN_PVALUE=0.05          # Maximum p-value for enrichment
MIN_LOG2_FOLD_ENRICHMENT=1.0 # Minimum log2 fold enrichment (equivalent to 2-fold)
MIN_LOG10_PVALUE=1.3     # Minimum -log10(p-value) (equivalent to -log10(0.05))

# Execute the overlap_peakfi_with_bam.pl script
overlap_peakfi_with_bam.pl \
    "${PEAK_FILE}" \
    "${SM_INPUT_BAM}" \
    "${OUTPUT_FILE}" \
    "${MIN_OVERLAP_FRACTION}" \
    "${MIN_READS_IN_OVERLAP}" \
    "${MIN_FOLD_ENRICHMENT}" \
    "${MIN_PVALUE}" \
    "${MIN_LOG2_FOLD_ENRICHMENT}" \
    "${MIN_LOG10_PVALUE}"

View on GitHub

Overlapping enriched clusters (peaks) were merged with a custom perl script, available in the main eCLIP repository as: compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl

eCLIP vN/A GitHub

$ Bash example

# Install Perl if not already available
# conda install -c conda-forge perl

# Download the script from the eCLIP repository
# wget https://raw.githubusercontent.com/yeolab/eclip/master/bin/compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl
# chmod +x compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl

# Define input and output files (placeholders for enriched peak files in BED format)
# Replace 'replicate1_peaks.bed', 'replicate2_peaks.bed' with actual input peak files.
# The script can take multiple input files.
INPUT_PEAK_FILE_1="replicate1_peaks.bed"
INPUT_PEAK_FILE_2="replicate2_peaks.bed"
OUTPUT_MERGED_PEAKS="merged_replicate_peaks.bed"

# Execute the script to merge overlapping enriched peaks from replicates
perl compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl "${INPUT_PEAK_FILE_1}" "${INPUT_PEAK_FILE_2}" > "${OUTPUT_MERGED_PEAKS}"

View on GitHub

Tools Used

eCLIP STAR

Raw Source Text

R1 data was processed using the single ended eCLIP pipeline and available at: http://github.com/yeolab/eclip.
Unique Molecular Identifiers (UMIs) were extracted from raw sequencing reads with umi_tools extract
Post-umi-extracted reads were trimmed for adapter sequences and barcode sequences (eCLIP samples) using cutadapt.
Trimmed reads were mapped against RepBase with STAR to remove reads mapping to repetitive sequences (--outFilterMultimapNmax 30 --alignEndsType EndToEnd --outFilterMultimapScoreRange 1 --outSAMmode Full --outFilterType BySJout --outSAMtype BAM Unsorted --outFilterScoreMin 10 --outReadsUnmapped Fastx --outSAMattributes All)
Remaining reads were mapped to the appropriate genome build (Pa_P01 + PhiPA3) using STAR aligner (--outFilterMultimapNmax 1 --alignEndsType EndToEnd --outFilterMultimapScoreRange 1 --outSAMmode Full --outFilterType BySJout --outSAMtype BAM Unsorted --outFilterScoreMin 10 --outReadsUnmapped Fastx --outSAMattributes All)
Uniquely mapped reads were removed of PCR duplicates with umi_tools
Peak clusters were identified with CLIPper, available at: https://github.com/YeoLab/clipper
Clusters enriched over corresponding size-matched input (SMInput) were identified using a custom Perl script, available in the main eCLIP repository as: overlap_peakfi_with_bam.pl
Overlapping enriched clusters (peaks) were merged with a custom perl script, available in the main eCLIP repository as: compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl
Assembly: PhiPA3
Assembly: Pa_PA01
Supplementary files format and content: compressed.bed files contain single replicate peaks (-log10 p-value, log2 fold enrichment above size-matched input in columns 4 and 5 respectively)
Supplementary files format and content: *vs* files contain reproducible peaks using IDR

← Back to Analysis