GSE243675 Processing Pipeline
Publication
A phage nucleus-associated RNA-binding protein is required for jumbo phage infection.Nucleic acids research (2024) — PMID 38554115
Processing Steps
Generate Jupyter Notebook-
1
R1 data was processed using the single ended eCLIP pipeline and available at: http://github.com/yeolab/eclip.
eCLIP vlatest (as of 2020-11-17)$ Bash example
# Clone the eCLIP pipeline repository if not already present # git clone https://github.com/yeolab/eclip.git # cd eclip # Define placeholder paths for input data and reference files # Replace these with your actual file paths and ensure they are accessible. # Reference genome: Human hg38 is a common choice for eCLIP. # Adapter sequences can be found in the eCLIP repository (e.g., eclip/data/adapters.fa). # Blacklist regions for hg38 can be obtained from ENCODE. # Spike-in sequences are optional and depend on experimental design. R1_FASTQ="path/to/your_sample_R1.fastq.gz" GENOME_FASTA="path/to/reference/hg38.fa" GENOME_GTF="path/to/reference/hg38.gtf" STAR_INDEX_DIR="path/to/reference/star_index_hg38" # Directory containing STAR index files CHROM_SIZES="path/to/reference/hg38.chrom.sizes" BLACKLIST_BED="path/to/reference/hg38_blacklist.bed" ADAPTER_FASTA="path/to/eclip_repo/data/adapters.fa" # Example path from the eCLIP repo SPIKEIN_FASTA="path/to/reference/spikein.fa" # Optional, if spike-ins were used SPIKEIN_GTF="path/to/reference/spikein.gtf" # Optional SPIKEIN_STAR_INDEX_DIR="path/to/reference/spikein_star_index" # Optional # Create a CWL input configuration file (e.g., inputs.yaml) # This example assumes single-ended data as per the description. cat << EOF > eclip_inputs.yaml fastq_r1: class: File path: ${R1_FASTQ} # fastq_r2: # Uncomment and provide path if paired-end # class: File # path: path/to/your_sample_R2.fastq.gz genome_fasta: class: File path: ${GENOME_FASTA} genome_gtf: class: File path: ${GENOME_GTF} star_index: class: Directory path: ${STAR_INDEX_DIR} chrom_sizes: class: File path: ${CHROM_SIZES} blacklist_bed: class: File path: ${BLACKLIST_BED} adapter_fasta: class: File path: ${ADAPTER_FASTA} spikein_fasta: class: File path: ${SPIKEIN_FASTA} spikein_gtf: class: File path: ${SPIKEIN_GTF} spikein_star_index: class: Directory path: ${SPIKEIN_STAR_INDEX_DIR} output_dir: "eclip_output" # Specify an output directory threads: 8 # Number of threads to use memory: 32 # Memory in GB EOF # Run the eCLIP CWL workflow using cwltool # Ensure cwltool is installed: # pip install cwltool cwltool eclip.cwl eclip_inputs.yaml -
2
Unique Molecular Identifiers (UMIs) were extracted from raw sequencing reads with umi_tools extract
$ Bash example
# Install umi-tools if not already installed # conda install -c bioconda umi-tools # Example command for umi_tools extract # This command assumes a 10bp UMI at the start of Read 1. # Adjust --bc-pattern according to your specific library preparation and UMI location. # Replace 'raw_reads.fastq.gz' with your actual input file and 'umi_extracted_reads.fastq.gz' with your desired output file name. umi_tools extract \ --input raw_reads.fastq.gz \ --output umi_extracted_reads.fastq.gz \ --bc-pattern="^(?P<umi_1>.{10})(?P<discard_1>.*)$" \ --log umi_tools_extract.log -
3
Post-umi-extracted reads were trimmed for adapter sequences and barcode sequences (eCLIP samples) using cutadapt.
$ Bash example
# Install cutadapt (e.g., via conda) # conda install -c bioconda cutadapt=4.0 # Define input and output files INPUT_FASTQ="post_umi_extracted_reads.fastq.gz" OUTPUT_FASTQ="trimmed_reads.fastq.gz" # Define adapter sequences based on common eCLIP practices (e.g., from yeolab/skipper workflow) # 3' adapter (Illumina TruSeq Small RNA 3' Adapter) ADAPTER_3PRIME="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA" # 5' adapter/barcode (e.g., Illumina 5' adapter, or specific barcode sequence to be trimmed) # The yeolab/skipper workflow uses GATCGTCGGACTGTAGAACTCTGAACGTGTAGATCA as a 5' adapter. ADAPTER_5PRIME="GATCGTCGGACTGTAGAACTCTGAACGTGTAGATCA" # Run cutadapt to trim adapter and barcode sequences # -a: 3' adapter # -g: 5' adapter/barcode # -o: Output file # --minimum-length: Discard reads shorter than this length after trimming # -q: Quality trim from both ends (e.g., 20,20 for phred score 20) # -j: Number of CPU cores to use cutadapt \ -a "${ADAPTER_3PRIME}" \ -g "${ADAPTER_5PRIME}" \ -o "${OUTPUT_FASTQ}" \ --minimum-length 18 \ -q 20,20 \ -j 8 \ "${INPUT_FASTQ}" -
4
Trimmed reads were mapped against RepBase with STAR to remove reads mapping to repetitive sequences (--outFilterMultimapNmax 30 --alignEndsType EndToEnd --outFilterMultimapScoreRange 1 --outSAMmode Full --outFilterType BySJout --outSAMtype BAM Unsorted --outFilterScoreMin 10 --outReadsUnmapped Fastx --outSAMattributes All)
$ Bash example
# Install STAR (example using conda) # conda create -n star_env star=2.7.10a -c bioconda -c conda-forge # conda activate star_env # Define variables INPUT_READS="input_trimmed_reads.fastq.gz" # Path to your trimmed FASTQ reads REPBASE_STAR_INDEX_DIR="path/to/repbase_star_index" # Path to the STAR genome index built from RepBase sequences OUTPUT_PREFIX="repbase_alignment" # Prefix for output files THREADS=8 # Number of threads to use # Build STAR index for RepBase (if not already done) # This step is typically done once for a reference. Example: # STAR --runMode genomeGenerate \ # --genomeDir ${REPBASE_STAR_INDEX_DIR} \ # --genomeFastaFiles repbase.fasta \ # --genomeSAindexNbases 10 # Adjust based on genome size, 10 for smaller genomes like RepBase # Run STAR to map reads against RepBase and output unmapped reads STAR --runThreadN ${THREADS} \ --genomeDir ${REPBASE_STAR_INDEX_DIR} \ --readFilesIn ${INPUT_READS} \ --outFileNamePrefix ${OUTPUT_PREFIX}_ \ --outFilterMultimapNmax 30 \ --alignEndsType EndToEnd \ --outFilterMultimapScoreRange 1 \ --outSAMmode Full \ --outFilterType BySJout \ --outSAMtype BAM Unsorted \ --outFilterScoreMin 10 \ --outReadsUnmapped Fastx \ --outSAMattributes All # The unmapped reads will be in ${OUTPUT_PREFIX}_Unmapped.out.fastq # The mapped reads will be in ${OUTPUT_PREFIX}_Aligned.out.bam -
5
Remaining reads were mapped to the appropriate genome build (Pa_P01 + PhiPA3) using STAR aligner (--outFilterMultimapNmax 1 --alignEndsType EndToEnd --outFilterMultimapScoreRange 1 --outSAMmode Full --outFilterType BySJout --outSAMtype BAM Unsorted --outFilterScoreMin 10 --outReadsUnmapped Fastx --outSAMattributes All)
$ Bash example
# Install STAR (example using conda) # conda install -c bioconda star=2.7.10a # Assume genome index has been built for the custom reference "Pa_P01 + PhiPA3" # Example for building index (replace with actual fasta files): # STAR --runMode genomeGenerate \ # --genomeDir path/to/Pa_P01_PhiPA3_STAR_index \ # --genomeFastaFiles Pa_P01.fasta PhiPA3.fasta \ # --runThreadN 8 # Adjust threads as needed STAR --genomeDir path/to/Pa_P01_PhiPA3_STAR_index \ --readFilesIn input_reads.fastq.gz \ --outFileNamePrefix aligned_reads_ \ --outFilterMultimapNmax 1 \ --alignEndsType EndToEnd \ --outFilterMultimapScoreRange 1 \ --outSAMmode Full \ --outFilterType BySJout \ --outSAMtype BAM Unsorted \ --outFilterScoreMin 10 \ --outReadsUnmapped Fastx \ --outSAMattributes All \ --runThreadN 8 # Adjust threads as needed -
6
Uniquely mapped reads were removed of PCR duplicates with umi_tools
$ Bash example
# Install UMI-tools (e.g., using conda) # conda create -n umi_tools_env umi-tools=1.1.2 -c bioconda -c conda-forge # conda activate umi_tools_env # Assuming 'uniquely_mapped.bam' is the input BAM file containing uniquely mapped reads # and UMIs are present in the read names, separated by a colon. # Adjust --umi-separator and --extract-method if UMIs are in a different format or location. # If reads are paired-end, include --paired-end. umi_tools dedup \ --umi-separator=':' \ --extract-method=string \ --paired-end \ --output-stats deduplication_stats.txt \ -I uniquely_mapped.bam \ -S deduplicated.bam -
7
Peak clusters were identified with CLIPper, available at: https://github.com/YeoLab/clipper
$ Bash example
# Install CLIPper (if not already installed) # git clone https://github.com/YeoLab/clipper.git # cd clipper # # Ensure Python dependencies are met, e.g., numpy, scipy, pysam # # pip install numpy scipy pysam # Define input and output files (placeholders) # Replace with actual paths to your data INPUT_BAM="path/to/your_experiment.bam" CONTROL_BAM="path/to/your_control.bam" # Optional, but highly recommended for robust peak calling OUTPUT_PREFIX="clipper_peaks" # Define reference genome files (placeholders for human hg38) # Source: UCSC Genome Browser or GENCODE for GRCh38/hg38 # Example paths for hg38: GENOME_FASTA="path/to/GRCh38.primary_assembly.genome.fa" GENES_GTF="path/to/gencode.v38.annotation.gtf" # Run CLIPper to identify peak clusters # Parameters are examples; adjust based on experimental design and CLIPper documentation # -b: Input BAM file (required) # -c: Control BAM file (optional, but recommended) # -o: Output prefix for peak files (required) # -g: Reference genome FASTA file (required for some features, good practice) # -a: Gene annotation GTF file (required for some features, good practice) # -p: P-value threshold (default 0.01) # -f: FDR threshold (default 0.05) python CLIPper.py \ -b "${INPUT_BAM}" \ -c "${CONTROL_BAM}" \ -o "${OUTPUT_PREFIX}" \ -g "${GENOME_FASTA}" \ -a "${GENES_GTF}" \ -p 0.01 \ -f 0.05 -
8
Clusters enriched over corresponding size-matched input (SMInput) were identified using a custom Perl script, available in the main eCLIP repository as: overlap_peakfi_with_bam.pl
$ Bash example
# Clone the eCLIP repository to get the script # git clone https://github.com/yeolab/eclip.git # cd eclip # export PATH=$PATH:$(pwd)/bin # Define input and output files PEAK_FILE="eclip_peaks.bed" # Placeholder for the identified clusters/peaks SM_INPUT_BAM="sm_input.bam" # Placeholder for the size-matched input BAM file OUTPUT_FILE="enriched_clusters.bed" # Define enrichment parameters (placeholders, adjust as needed based on experimental design) MIN_OVERLAP_FRACTION=0.1 # Minimum fraction of peak overlapping with BAM reads MIN_READS_IN_OVERLAP=10 # Minimum reads in the overlapping region MIN_FOLD_ENRICHMENT=2.0 # Minimum fold enrichment over SMInput MIN_PVALUE=0.05 # Maximum p-value for enrichment MIN_LOG2_FOLD_ENRICHMENT=1.0 # Minimum log2 fold enrichment (equivalent to 2-fold) MIN_LOG10_PVALUE=1.3 # Minimum -log10(p-value) (equivalent to -log10(0.05)) # Execute the overlap_peakfi_with_bam.pl script overlap_peakfi_with_bam.pl \ "${PEAK_FILE}" \ "${SM_INPUT_BAM}" \ "${OUTPUT_FILE}" \ "${MIN_OVERLAP_FRACTION}" \ "${MIN_READS_IN_OVERLAP}" \ "${MIN_FOLD_ENRICHMENT}" \ "${MIN_PVALUE}" \ "${MIN_LOG2_FOLD_ENRICHMENT}" \ "${MIN_LOG10_PVALUE}" -
9
Overlapping enriched clusters (peaks) were merged with a custom perl script, available in the main eCLIP repository as: compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl
$ Bash example
# Install Perl if not already available # conda install -c conda-forge perl # Download the script from the eCLIP repository # wget https://raw.githubusercontent.com/yeolab/eclip/master/bin/compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl # chmod +x compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl # Define input and output files (placeholders for enriched peak files in BED format) # Replace 'replicate1_peaks.bed', 'replicate2_peaks.bed' with actual input peak files. # The script can take multiple input files. INPUT_PEAK_FILE_1="replicate1_peaks.bed" INPUT_PEAK_FILE_2="replicate2_peaks.bed" OUTPUT_MERGED_PEAKS="merged_replicate_peaks.bed" # Execute the script to merge overlapping enriched peaks from replicates perl compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl "${INPUT_PEAK_FILE_1}" "${INPUT_PEAK_FILE_2}" > "${OUTPUT_MERGED_PEAKS}"
Raw Source Text
R1 data was processed using the single ended eCLIP pipeline and available at: http://github.com/yeolab/eclip. Unique Molecular Identifiers (UMIs) were extracted from raw sequencing reads with umi_tools extract Post-umi-extracted reads were trimmed for adapter sequences and barcode sequences (eCLIP samples) using cutadapt. Trimmed reads were mapped against RepBase with STAR to remove reads mapping to repetitive sequences (--outFilterMultimapNmax 30 --alignEndsType EndToEnd --outFilterMultimapScoreRange 1 --outSAMmode Full --outFilterType BySJout --outSAMtype BAM Unsorted --outFilterScoreMin 10 --outReadsUnmapped Fastx --outSAMattributes All) Remaining reads were mapped to the appropriate genome build (Pa_P01 + PhiPA3) using STAR aligner (--outFilterMultimapNmax 1 --alignEndsType EndToEnd --outFilterMultimapScoreRange 1 --outSAMmode Full --outFilterType BySJout --outSAMtype BAM Unsorted --outFilterScoreMin 10 --outReadsUnmapped Fastx --outSAMattributes All) Uniquely mapped reads were removed of PCR duplicates with umi_tools Peak clusters were identified with CLIPper, available at: https://github.com/YeoLab/clipper Clusters enriched over corresponding size-matched input (SMInput) were identified using a custom Perl script, available in the main eCLIP repository as: overlap_peakfi_with_bam.pl Overlapping enriched clusters (peaks) were merged with a custom perl script, available in the main eCLIP repository as: compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl Assembly: PhiPA3 Assembly: Pa_PA01 Supplementary files format and content: compressed.bed files contain single replicate peaks (-log10 p-value, log2 fold enrichment above size-matched input in columns 4 and 5 respectively) Supplementary files format and content: *vs* files contain reproducible peaks using IDR