GSE243675 Processing Pipeline

RIP-Seq code_examples 9 steps

Publication

A phage nucleus-associated RNA-binding protein is required for jumbo phage infection.

Nucleic acids research (2024) — PMID 38554115

Dataset

GSE243675

A phage nucleus-associated RNA-binding protein required for jumbo phage infection

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    R1 data was processed using the single ended eCLIP pipeline and available at: http://github.com/yeolab/eclip.

    eCLIP vlatest (as of 2020-11-17)
    $ Bash example
    # Clone the eCLIP pipeline repository if not already present
    # git clone https://github.com/yeolab/eclip.git
    # cd eclip
    
    # Define placeholder paths for input data and reference files
    # Replace these with your actual file paths and ensure they are accessible.
    # Reference genome: Human hg38 is a common choice for eCLIP.
    # Adapter sequences can be found in the eCLIP repository (e.g., eclip/data/adapters.fa).
    # Blacklist regions for hg38 can be obtained from ENCODE.
    # Spike-in sequences are optional and depend on experimental design.
    
    R1_FASTQ="path/to/your_sample_R1.fastq.gz"
    GENOME_FASTA="path/to/reference/hg38.fa"
    GENOME_GTF="path/to/reference/hg38.gtf"
    STAR_INDEX_DIR="path/to/reference/star_index_hg38" # Directory containing STAR index files
    CHROM_SIZES="path/to/reference/hg38.chrom.sizes"
    BLACKLIST_BED="path/to/reference/hg38_blacklist.bed"
    ADAPTER_FASTA="path/to/eclip_repo/data/adapters.fa" # Example path from the eCLIP repo
    SPIKEIN_FASTA="path/to/reference/spikein.fa" # Optional, if spike-ins were used
    SPIKEIN_GTF="path/to/reference/spikein.gtf" # Optional
    SPIKEIN_STAR_INDEX_DIR="path/to/reference/spikein_star_index" # Optional
    
    # Create a CWL input configuration file (e.g., inputs.yaml)
    # This example assumes single-ended data as per the description.
    cat << EOF > eclip_inputs.yaml
    fastq_r1:
      class: File
      path: ${R1_FASTQ}
    # fastq_r2: # Uncomment and provide path if paired-end
    #   class: File
    #   path: path/to/your_sample_R2.fastq.gz
    genome_fasta:
      class: File
      path: ${GENOME_FASTA}
    genome_gtf:
      class: File
      path: ${GENOME_GTF}
    star_index:
      class: Directory
      path: ${STAR_INDEX_DIR}
    chrom_sizes:
      class: File
      path: ${CHROM_SIZES}
    blacklist_bed:
      class: File
      path: ${BLACKLIST_BED}
    adapter_fasta:
      class: File
      path: ${ADAPTER_FASTA}
    spikein_fasta:
      class: File
      path: ${SPIKEIN_FASTA}
    spikein_gtf:
      class: File
      path: ${SPIKEIN_GTF}
    spikein_star_index:
      class: Directory
      path: ${SPIKEIN_STAR_INDEX_DIR}
    output_dir: "eclip_output" # Specify an output directory
    threads: 8 # Number of threads to use
    memory: 32 # Memory in GB
    EOF
    
    # Run the eCLIP CWL workflow using cwltool
    # Ensure cwltool is installed: # pip install cwltool
    cwltool eclip.cwl eclip_inputs.yaml
  2. 2

    Unique Molecular Identifiers (UMIs) were extracted from raw sequencing reads with umi_tools extract

    UMI-tools vInferred with models/gemini-2.5-flash GitHub
    $ Bash example
    # Install umi-tools if not already installed
    # conda install -c bioconda umi-tools
    
    # Example command for umi_tools extract
    # This command assumes a 10bp UMI at the start of Read 1.
    # Adjust --bc-pattern according to your specific library preparation and UMI location.
    # Replace 'raw_reads.fastq.gz' with your actual input file and 'umi_extracted_reads.fastq.gz' with your desired output file name.
    umi_tools extract \
        --input raw_reads.fastq.gz \
        --output umi_extracted_reads.fastq.gz \
        --bc-pattern="^(?P<umi_1>.{10})(?P<discard_1>.*)$" \
        --log umi_tools_extract.log
  3. 3

    Post-umi-extracted reads were trimmed for adapter sequences and barcode sequences (eCLIP samples) using cutadapt.

    cutadapt v4.0 GitHub
    $ Bash example
    # Install cutadapt (e.g., via conda)
    # conda install -c bioconda cutadapt=4.0
    
    # Define input and output files
    INPUT_FASTQ="post_umi_extracted_reads.fastq.gz"
    OUTPUT_FASTQ="trimmed_reads.fastq.gz"
    
    # Define adapter sequences based on common eCLIP practices (e.g., from yeolab/skipper workflow)
    # 3' adapter (Illumina TruSeq Small RNA 3' Adapter)
    ADAPTER_3PRIME="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA"
    # 5' adapter/barcode (e.g., Illumina 5' adapter, or specific barcode sequence to be trimmed)
    # The yeolab/skipper workflow uses GATCGTCGGACTGTAGAACTCTGAACGTGTAGATCA as a 5' adapter.
    ADAPTER_5PRIME="GATCGTCGGACTGTAGAACTCTGAACGTGTAGATCA"
    
    # Run cutadapt to trim adapter and barcode sequences
    # -a: 3' adapter
    # -g: 5' adapter/barcode
    # -o: Output file
    # --minimum-length: Discard reads shorter than this length after trimming
    # -q: Quality trim from both ends (e.g., 20,20 for phred score 20)
    # -j: Number of CPU cores to use
    cutadapt \
      -a "${ADAPTER_3PRIME}" \
      -g "${ADAPTER_5PRIME}" \
      -o "${OUTPUT_FASTQ}" \
      --minimum-length 18 \
      -q 20,20 \
      -j 8 \
      "${INPUT_FASTQ}"
  4. 4

    Trimmed reads were mapped against RepBase with STAR to remove reads mapping to repetitive sequences (--outFilterMultimapNmax 30 --alignEndsType EndToEnd --outFilterMultimapScoreRange 1 --outSAMmode Full --outFilterType BySJout --outSAMtype BAM Unsorted --outFilterScoreMin 10 --outReadsUnmapped Fastx --outSAMattributes All)

    $ Bash example
    # Install STAR (example using conda)
    # conda create -n star_env star=2.7.10a -c bioconda -c conda-forge
    # conda activate star_env
    
    # Define variables
    INPUT_READS="input_trimmed_reads.fastq.gz" # Path to your trimmed FASTQ reads
    REPBASE_STAR_INDEX_DIR="path/to/repbase_star_index" # Path to the STAR genome index built from RepBase sequences
    OUTPUT_PREFIX="repbase_alignment" # Prefix for output files
    THREADS=8 # Number of threads to use
    
    # Build STAR index for RepBase (if not already done)
    # This step is typically done once for a reference. Example:
    # STAR --runMode genomeGenerate \
    #      --genomeDir ${REPBASE_STAR_INDEX_DIR} \
    #      --genomeFastaFiles repbase.fasta \
    #      --genomeSAindexNbases 10 # Adjust based on genome size, 10 for smaller genomes like RepBase
    
    # Run STAR to map reads against RepBase and output unmapped reads
    STAR --runThreadN ${THREADS} \
         --genomeDir ${REPBASE_STAR_INDEX_DIR} \
         --readFilesIn ${INPUT_READS} \
         --outFileNamePrefix ${OUTPUT_PREFIX}_ \
         --outFilterMultimapNmax 30 \
         --alignEndsType EndToEnd \
         --outFilterMultimapScoreRange 1 \
         --outSAMmode Full \
         --outFilterType BySJout \
         --outSAMtype BAM Unsorted \
         --outFilterScoreMin 10 \
         --outReadsUnmapped Fastx \
         --outSAMattributes All
    
    # The unmapped reads will be in ${OUTPUT_PREFIX}_Unmapped.out.fastq
    # The mapped reads will be in ${OUTPUT_PREFIX}_Aligned.out.bam
  5. 5

    Remaining reads were mapped to the appropriate genome build (Pa_P01 + PhiPA3) using STAR aligner (--outFilterMultimapNmax 1 --alignEndsType EndToEnd --outFilterMultimapScoreRange 1 --outSAMmode Full --outFilterType BySJout --outSAMtype BAM Unsorted --outFilterScoreMin 10 --outReadsUnmapped Fastx --outSAMattributes All)

    $ Bash example
    # Install STAR (example using conda)
    # conda install -c bioconda star=2.7.10a
    
    # Assume genome index has been built for the custom reference "Pa_P01 + PhiPA3"
    # Example for building index (replace with actual fasta files):
    # STAR --runMode genomeGenerate \
    #      --genomeDir path/to/Pa_P01_PhiPA3_STAR_index \
    #      --genomeFastaFiles Pa_P01.fasta PhiPA3.fasta \
    #      --runThreadN 8 # Adjust threads as needed
    
    STAR --genomeDir path/to/Pa_P01_PhiPA3_STAR_index \
         --readFilesIn input_reads.fastq.gz \
         --outFileNamePrefix aligned_reads_ \
         --outFilterMultimapNmax 1 \
         --alignEndsType EndToEnd \
         --outFilterMultimapScoreRange 1 \
         --outSAMmode Full \
         --outFilterType BySJout \
         --outSAMtype BAM Unsorted \
         --outFilterScoreMin 10 \
         --outReadsUnmapped Fastx \
         --outSAMattributes All \
         --runThreadN 8 # Adjust threads as needed
  6. 6

    Uniquely mapped reads were removed of PCR duplicates with umi_tools

    UMI-tools v1.1.2 GitHub
    $ Bash example
    # Install UMI-tools (e.g., using conda)
    # conda create -n umi_tools_env umi-tools=1.1.2 -c bioconda -c conda-forge
    # conda activate umi_tools_env
    
    # Assuming 'uniquely_mapped.bam' is the input BAM file containing uniquely mapped reads
    # and UMIs are present in the read names, separated by a colon.
    # Adjust --umi-separator and --extract-method if UMIs are in a different format or location.
    # If reads are paired-end, include --paired-end.
    
    umi_tools dedup \
        --umi-separator=':' \
        --extract-method=string \
        --paired-end \
        --output-stats deduplication_stats.txt \
        -I uniquely_mapped.bam \
        -S deduplicated.bam
  7. 7

    Peak clusters were identified with CLIPper, available at: https://github.com/YeoLab/clipper

    CLIPper vunspecified GitHub
    $ Bash example
    # Install CLIPper (if not already installed)
    # git clone https://github.com/YeoLab/clipper.git
    # cd clipper
    # # Ensure Python dependencies are met, e.g., numpy, scipy, pysam
    # # pip install numpy scipy pysam
    
    # Define input and output files (placeholders)
    # Replace with actual paths to your data
    INPUT_BAM="path/to/your_experiment.bam"
    CONTROL_BAM="path/to/your_control.bam" # Optional, but highly recommended for robust peak calling
    OUTPUT_PREFIX="clipper_peaks"
    
    # Define reference genome files (placeholders for human hg38)
    # Source: UCSC Genome Browser or GENCODE for GRCh38/hg38
    # Example paths for hg38:
    GENOME_FASTA="path/to/GRCh38.primary_assembly.genome.fa"
    GENES_GTF="path/to/gencode.v38.annotation.gtf"
    
    # Run CLIPper to identify peak clusters
    # Parameters are examples; adjust based on experimental design and CLIPper documentation
    # -b: Input BAM file (required)
    # -c: Control BAM file (optional, but recommended)
    # -o: Output prefix for peak files (required)
    # -g: Reference genome FASTA file (required for some features, good practice)
    # -a: Gene annotation GTF file (required for some features, good practice)
    # -p: P-value threshold (default 0.01)
    # -f: FDR threshold (default 0.05)
    python CLIPper.py \
        -b "${INPUT_BAM}" \
        -c "${CONTROL_BAM}" \
        -o "${OUTPUT_PREFIX}" \
        -g "${GENOME_FASTA}" \
        -a "${GENES_GTF}" \
        -p 0.01 \
        -f 0.05
  8. 8

    Clusters enriched over corresponding size-matched input (SMInput) were identified using a custom Perl script, available in the main eCLIP repository as: overlap_peakfi_with_bam.pl

    eCLIP veCLIP pipeline (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Clone the eCLIP repository to get the script
    # git clone https://github.com/yeolab/eclip.git
    # cd eclip
    # export PATH=$PATH:$(pwd)/bin
    
    # Define input and output files
    PEAK_FILE="eclip_peaks.bed" # Placeholder for the identified clusters/peaks
    SM_INPUT_BAM="sm_input.bam" # Placeholder for the size-matched input BAM file
    OUTPUT_FILE="enriched_clusters.bed"
    
    # Define enrichment parameters (placeholders, adjust as needed based on experimental design)
    MIN_OVERLAP_FRACTION=0.1 # Minimum fraction of peak overlapping with BAM reads
    MIN_READS_IN_OVERLAP=10  # Minimum reads in the overlapping region
    MIN_FOLD_ENRICHMENT=2.0  # Minimum fold enrichment over SMInput
    MIN_PVALUE=0.05          # Maximum p-value for enrichment
    MIN_LOG2_FOLD_ENRICHMENT=1.0 # Minimum log2 fold enrichment (equivalent to 2-fold)
    MIN_LOG10_PVALUE=1.3     # Minimum -log10(p-value) (equivalent to -log10(0.05))
    
    # Execute the overlap_peakfi_with_bam.pl script
    overlap_peakfi_with_bam.pl \
        "${PEAK_FILE}" \
        "${SM_INPUT_BAM}" \
        "${OUTPUT_FILE}" \
        "${MIN_OVERLAP_FRACTION}" \
        "${MIN_READS_IN_OVERLAP}" \
        "${MIN_FOLD_ENRICHMENT}" \
        "${MIN_PVALUE}" \
        "${MIN_LOG2_FOLD_ENRICHMENT}" \
        "${MIN_LOG10_PVALUE}"
  9. 9

    Overlapping enriched clusters (peaks) were merged with a custom perl script, available in the main eCLIP repository as: compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl

    $ Bash example
    # Install Perl if not already available
    # conda install -c conda-forge perl
    
    # Download the script from the eCLIP repository
    # wget https://raw.githubusercontent.com/yeolab/eclip/master/bin/compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl
    # chmod +x compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl
    
    # Define input and output files (placeholders for enriched peak files in BED format)
    # Replace 'replicate1_peaks.bed', 'replicate2_peaks.bed' with actual input peak files.
    # The script can take multiple input files.
    INPUT_PEAK_FILE_1="replicate1_peaks.bed"
    INPUT_PEAK_FILE_2="replicate2_peaks.bed"
    OUTPUT_MERGED_PEAKS="merged_replicate_peaks.bed"
    
    # Execute the script to merge overlapping enriched peaks from replicates
    perl compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl "${INPUT_PEAK_FILE_1}" "${INPUT_PEAK_FILE_2}" > "${OUTPUT_MERGED_PEAKS}"

Tools Used

Raw Source Text
R1 data was processed using the single ended eCLIP pipeline and available at: http://github.com/yeolab/eclip.
Unique Molecular Identifiers (UMIs) were extracted from raw sequencing reads with umi_tools extract
Post-umi-extracted reads were trimmed for adapter sequences and barcode sequences (eCLIP samples) using cutadapt.
Trimmed reads were mapped against RepBase with STAR to remove reads mapping to repetitive sequences (--outFilterMultimapNmax 30 --alignEndsType EndToEnd --outFilterMultimapScoreRange 1 --outSAMmode Full --outFilterType BySJout --outSAMtype BAM Unsorted --outFilterScoreMin 10 --outReadsUnmapped Fastx --outSAMattributes All)
Remaining reads were mapped to the appropriate genome build (Pa_P01 + PhiPA3) using STAR aligner (--outFilterMultimapNmax 1 --alignEndsType EndToEnd --outFilterMultimapScoreRange 1 --outSAMmode Full --outFilterType BySJout --outSAMtype BAM Unsorted --outFilterScoreMin 10 --outReadsUnmapped Fastx --outSAMattributes All)
Uniquely mapped reads were removed of PCR duplicates with umi_tools
Peak clusters were identified with CLIPper, available at: https://github.com/YeoLab/clipper
Clusters enriched over corresponding size-matched input (SMInput) were identified using a custom Perl script, available in the main eCLIP repository as: overlap_peakfi_with_bam.pl
Overlapping enriched clusters (peaks) were merged with a custom perl script, available in the main eCLIP repository as: compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl
Assembly: PhiPA3
Assembly: Pa_PA01
Supplementary files format and content: compressed.bed files contain single replicate peaks (-log10 p-value, log2 fold enrichment above size-matched input in columns 4 and 5 respectively)
Supplementary files format and content: *vs* files contain reproducible peaks using IDR
← Back to Analysis