GSE284636 Processing Pipeline
Publication
The IFIT2-IFIT3 antiviral complex targets short 5' untranslated regions on viral mRNAs for translation inhibition.Nature microbiology (2025) — PMID 41093992
Dataset
GSE284636Short 5â UTRs serve as a marker for mRNA translation inhibition by the IFIT2-IFIT3 antiviral complex
Processing Steps
Generate Jupyter Notebook-
1
Standard processing of eCLIP data was performed as previously described (Blue SB, et al.
$ Bash example
# Install cwltool if not already installed # pip install cwltool # Define placeholder paths for input files and reference genome # Replace these with your actual file paths and ensure reference files are indexed/prepared as required by the workflow. # For human (hg38) reference, ensure you have the FASTA, GTF, STAR index, chromosome sizes, and blacklist regions. FASTQ_R1="/path/to/your/sample_R1.fastq.gz" FASTQ_R2="/path/to/your/sample_R2.fastq.gz" # Set to "null" if single-end, e.g., FASTQ_R2="null" GENOME_FASTA="/path/to/your/reference/hg38.fa" GENOME_GTF="/path/to/your/reference/hg38.gtf" STAR_INDEX_DIR="/path/to/your/reference/STAR_index_hg38" CHROM_SIZES="/path/to/your/reference/hg38.chrom.sizes" BLACKLIST_BED="/path/to/your/reference/hg38_blacklist.bed" OUTPUT_DIR="./eclip_output" # Create output directory if it doesn't exist mkdir -p "${OUTPUT_DIR}" # Create a CWL input YAML file for the eCLIP workflow # Note: If FASTQ_R2 is "null", the entry for fastq_r2 in the YAML should be 'fastq_r2: null' cat << EOF > eclip_inputs.yaml fastq_r1: class: File path: ${FASTQ_R1} fastq_r2: class: File path: ${FASTQ_R2} genome_fasta: class: File path: ${GENOME_FASTA} genome_gtf: class: File path: ${GENOME_GTF} star_index: class: Directory path: ${STAR_INDEX_DIR} chrom_sizes: class: File path: ${CHROM_SIZES} blacklist_regions: class: File path: ${BLACKLIST_BED} output_dir: ${OUTPUT_DIR} EOF # Execute the eCLIP CWL workflow using cwltool # Replace '/path/to/yeolab/eclip/workflow.cwl' with the actual path to the workflow.cwl file # from the cloned yeolab/eclip repository. cwltool --outdir "${OUTPUT_DIR}" /path/to/yeolab/eclip/workflow.cwl eclip_inputs.yaml -
2
Nature Protocols 2022), with mapping performed to a custom genome index that included both the VSV genome and either hg19 (for 293T experiments) or mm10 (for MEF experiments) as described below.
$ Bash example
# Install STAR (if not already installed) # conda install -c bioconda star # Define variables for input and output # CUSTOM_GENOME_INDEX_DIR should point to the pre-built STAR index # This index would have been created from a combination of the VSV genome and either hg19 (for 293T experiments) or mm10 (for MEF experiments). # Example for a 293T experiment (adjust paths and filenames as needed): CUSTOM_GENOME_INDEX_DIR="/path/to/star_index_vsv_hg19" # Placeholder for the custom genome index directory (e.g., VSV+hg19) READ1_FASTQ="sample_R1.fastq.gz" # Placeholder for input Read 1 FASTQ file READ2_FASTQ="sample_R2.fastq.gz" # Placeholder for input Read 2 FASTQ file (remove if single-end reads) OUTPUT_PREFIX="aligned_sample" # Prefix for output files NUM_THREADS=8 # Number of threads to use for alignment # Perform alignment using STAR # --genomeDir: Path to the STAR genome index # --readFilesIn: Input FASTQ files (space-separated for paired-end, single file for single-end) # --runThreadN: Number of threads for parallel processing # --outFileNamePrefix: Prefix for all output files generated by STAR # --outSAMtype BAM SortedByCoordinate: Output a sorted BAM file # --readFilesCommand zcat: Command to decompress gzipped FASTQ files on-the-fly STAR --genomeDir "${CUSTOM_GENOME_INDEX_DIR}" \ --readFilesIn "${READ1_FASTQ}" "${READ2_FASTQ}" \ --runThreadN "${NUM_THREADS}" \ --outFileNamePrefix "${OUTPUT_PREFIX}_" \ --outSAMtype BAM SortedByCoordinate \ --readFilesCommand zcat -
3
UMI removal and cut adapt: umi_tools extract --random-seed 1 --bc-pattern NNNNNNNNNN --stdin 23145FL-08-01-37_S37_L001_R1_001.fastq.gz --log /dev/null | cutadapt -j 8 --match-read-wildcards --times 1 -e 0.1 --quality-cutoff 6 -m 18 -a a_adapters.fasta -O 1 - --report minimal 2> EV225_IP.cut.adapt.log | cutadapt -j 8 --match-read-wildcards --times 1 -e 0.1 --quality-cutoff 6 -m 18 -a a_adapters.fasta -O 5 - -o EV225_IP.trim.gz --report minimal >>EV225_IP.cut.adapt.log
$ Bash example
# Reference dataset 'a_adapters.fasta' source is not specified. umi_tools extract --random-seed 1 --bc-pattern NNNNNNNNNN --stdin 23145FL-08-01-37_S37_L001_R1_001.fastq.gz --log /dev/null | \ cutadapt -j 8 --match-read-wildcards --times 1 -e 0.1 --quality-cutoff 6 -m 18 -a a_adapters.fasta -O 1 - --report minimal 2> EV225_IP.cut.adapt.log | \ cutadapt -j 8 --match-read-wildcards --times 1 -e 0.1 --quality-cutoff 6 -m 18 -a a_adapters.fasta -O 5 - -o EV225_IP.trim.gz --report minimal >>EV225_IP.cut.adapt.log
-
4
Repeat mapping (takes adapter-trimmed fastq files, and maps to a repeat database to filter repeat elements prior to genome mapping: STAR --runMode alignReads --runThreadN 8 --alignEndsType EndToEnd --genomeDir [mus_musculus_repbase_v2 or homo_sapiens_repbase_v2] --genomeLoad NoSharedMemory --outBAMcompression 10 --outFileNamePrefix EV225_IP.repeat.map --outFilterMultimapNmax 100 --outFilterMultimapScoreRange 1 --outFilterScoreMin 10 --outFilterType BySJout --outReadsUnmapped Fastx --outSAMattrRGline ID:foo --outSAMattributes All --outSAMmode Full --outSAMtype BAM Unsorted --outSAMunmapped None --outStd Log --readFilesCommand zcat --readFilesIn EV225_IP.trim.gz
$ Bash example
# Install STAR (if not already installed) # conda install -c bioconda star # Define variables INPUT_FASTQ="EV225_IP.trim.gz" OUTPUT_PREFIX="EV225_IP.repeat.map" # Choose the appropriate repeat database genome directory: # For Mus musculus: GENOME_DIR="mus_musculus_repbase_v2" # For Homo sapiens: GENOME_DIR="homo_sapiens_repbase_v2" GENOME_DIR="[mus_musculus_repbase_v2 or homo_sapiens_repbase_v2]" # Placeholder - replace with actual path THREADS=8 # Execute STAR command for repeat mapping STAR --runMode alignReads \ --runThreadN "${THREADS}" \ --alignEndsType EndToEnd \ --genomeDir "${GENOME_DIR}" \ --genomeLoad NoSharedMemory \ --outBAMcompression 10 \ --outFileNamePrefix "${OUTPUT_PREFIX}" \ --outFilterMultimapNmax 100 \ --outFilterMultimapScoreRange 1 \ --outFilterScoreMin 10 \ --outFilterType BySJout \ --outReadsUnmapped Fastx \ --outSAMattrRGline ID:foo \ --outSAMattributes All \ --outSAMmode Full \ --outSAMtype BAM Unsorted \ --outSAMunmapped None \ --outStd Log \ --readFilesCommand zcat \ --readFilesIn "${INPUT_FASTQ}" -
5
Unique genomic mapping (takes repeat-removed reads and maps to the reference genome: STAR --runMode alignReads --runThreadN 8 --alignEndsType EndToEnd --genomeDir [hg19+VSV or mm10+VSV] --genomeLoad NoSharedMemory --outBAMcompression 10 --outFileNamePrefix EV225_IP.genome.map --outFilterMultimapNmax 1 --outFilterMultimapScoreRange 1 --outFilterScoreMin 10 --outFilterType BySJout --outReadsUnmapped Fastx --outSAMattrRGline ID:foo --outSAMattributes All --outSAMmode Full --outSAMtype BAM Unsorted --outSAMunmapped None --outStd Log --readFilesIn EV225_IP.repeat.mapUnmapped.out.mate1
$ Bash example
# Install STAR (example using conda) # conda install -c bioconda star # Placeholder for genome directory. # This directory should contain the STAR-indexed genome (e.g., hg19 with VSV sequences). # Example for genome generation: # STAR --runMode genomeGenerate --genomeDir hg19_vsv_genome_dir --genomeFastaFiles hg19.fa vsv.fa --sjdbGTFfile annotation.gtf --runThreadN 8 GENOME_DIR="hg19_vsv_genome_dir" # Replace with the actual path to your indexed genome (e.g., hg19+VSV or mm10+VSV) # Input reads file READS_FILE="EV225_IP.repeat.mapUnmapped.out.mate1" # Output prefix OUTPUT_PREFIX="EV225_IP.genome.map" STAR \ --runMode alignReads \ --runThreadN 8 \ --alignEndsType EndToEnd \ --genomeDir "${GENOME_DIR}" \ --genomeLoad NoSharedMemory \ --outBAMcompression 10 \ --outFileNamePrefix "${OUTPUT_PREFIX}" \ --outFilterMultimapNmax 1 \ --outFilterMultimapScoreRange 1 \ --outFilterScoreMin 10 \ --outFilterType BySJout \ --outReadsUnmapped Fastx \ --outSAMattrRGline ID:foo \ --outSAMattributes All \ --outSAMmode Full \ --outSAMtype BAM Unsorted \ --outSAMunmapped None \ --outStd Log \ --readFilesIn "${READS_FILE}" -
6
Sort mapped reads: samtools sort -@ 8 -m 2G -o EV225_IP.genome.map.bam *Aligned.out.bam
$ Bash example
# Install samtools if not already available # conda install -c bioconda samtools # Define input and output file paths # Replace 'sample_name_Aligned.out.bam' with your actual input BAM file. # The description uses '*Aligned.out.bam' as a placeholder for the input file. INPUT_BAM="sample_name_Aligned.out.bam" OUTPUT_BAM="EV225_IP.genome.map.bam" # Sort mapped reads samtools sort -@ 8 -m 2G -o "${OUTPUT_BAM}" "${INPUT_BAM}" -
7
Remove PCR duplicate reads: samtools index EV225_IP.genome.map.bam umi_tools dedup --random-seed 1 --stdin EV225_IP.genome.map.bam --method unique --stdout EV225_IP.bam
$ Bash example
samtools index EV225_IP.genome.map.bam umi_tools dedup --random-seed 1 --stdin EV225_IP.genome.map.bam --method unique --stdout EV225_IP.bam
-
8
Make bigwigs: samtools index EV225_IP.bam b2bw EV225_IP.bam --cpus 1
$ Bash example
# Install samtools # conda install -c bioconda samtools # Install deepTools # conda install -c bioconda deepTools # Index the BAM file, a prerequisite for bigWig generation samtools index EV225_IP.bam # Generate bigWig file from the indexed BAM file # The description uses 'b2bw', which is inferred to be a custom script or alias wrapping deepTools bamCoverage. # We use common parameters for bigWig generation from sequencing data. bamCoverage -b EV225_IP.bam \ -o EV225_IP.bigwig \ --numberOfProcessors 1 \ --binSize 10 \ --normalizeUsing RPKM -
9
Isolate VSV-mapped reads only: bigWigToWig EV225.pos.bw -chrom=chrVSV EV225.pos.bw.chrVSV.wig
$ Bash example
# Install UCSC Genome Browser utilities if not already installed # conda install -c bioconda ucsc-bigwigtowig bigWigToWig EV225.pos.bw -chrom=chrVSV EV225.pos.bw.chrVSV.wig
-
10
wigToBigWig EV225.pos.bw.chrVSV.wig /path/to/hg19_and_vsv.chrom.sizes EV242.pos.bw.chrVSV.wig.bw
$ Bash example
# Install UCSC tools if not already available # conda install -c bioconda ucsc-wigtobigwig # Define input, reference, and output files INPUT_WIG="EV225.pos.bw.chrVSV.wig" CHROM_SIZES="/path/to/hg19_and_vsv.chrom.sizes" # This file should contain chromosome names and their sizes for hg19 and VSV. A standard hg19.chrom.sizes can be obtained from UCSC, but 'hg19_and_vsv' implies a custom file. OUTPUT_BIGWIG="EV242.pos.bw.chrVSV.wig.bw" # Execute the wigToBigWig command wigToBigWig "${INPUT_WIG}" "${CHROM_SIZES}" "${OUTPUT_BIGWIG}"
Tools Used
Raw Source Text
Standard processing of eCLIP data was performed as previously described (Blue SB, et al. Nature Protocols 2022), with mapping performed to a custom genome index that included both the VSV genome and either hg19 (for 293T experiments) or mm10 (for MEF experiments) as described below. UMI removal and cut adapt: umi_tools extract --random-seed 1 --bc-pattern NNNNNNNNNN --stdin 23145FL-08-01-37_S37_L001_R1_001.fastq.gz --log /dev/null | cutadapt -j 8 --match-read-wildcards --times 1 -e 0.1 --quality-cutoff 6 -m 18 -a a_adapters.fasta -O 1 - --report minimal 2> EV225_IP.cut.adapt.log | cutadapt -j 8 --match-read-wildcards --times 1 -e 0.1 --quality-cutoff 6 -m 18 -a a_adapters.fasta -O 5 - -o EV225_IP.trim.gz --report minimal >>EV225_IP.cut.adapt.log Repeat mapping (takes adapter-trimmed fastq files, and maps to a repeat database to filter repeat elements prior to genome mapping: STAR --runMode alignReads --runThreadN 8 --alignEndsType EndToEnd --genomeDir [mus_musculus_repbase_v2 or homo_sapiens_repbase_v2] --genomeLoad NoSharedMemory --outBAMcompression 10 --outFileNamePrefix EV225_IP.repeat.map --outFilterMultimapNmax 100 --outFilterMultimapScoreRange 1 --outFilterScoreMin 10 --outFilterType BySJout --outReadsUnmapped Fastx --outSAMattrRGline ID:foo --outSAMattributes All --outSAMmode Full --outSAMtype BAM Unsorted --outSAMunmapped None --outStd Log --readFilesCommand zcat --readFilesIn EV225_IP.trim.gz Unique genomic mapping (takes repeat-removed reads and maps to the reference genome: STAR --runMode alignReads --runThreadN 8 --alignEndsType EndToEnd --genomeDir [hg19+VSV or mm10+VSV] --genomeLoad NoSharedMemory --outBAMcompression 10 --outFileNamePrefix EV225_IP.genome.map --outFilterMultimapNmax 1 --outFilterMultimapScoreRange 1 --outFilterScoreMin 10 --outFilterType BySJout --outReadsUnmapped Fastx --outSAMattrRGline ID:foo --outSAMattributes All --outSAMmode Full --outSAMtype BAM Unsorted --outSAMunmapped None --outStd Log --readFilesIn EV225_IP.repeat.mapUnmapped.out.mate1 Sort mapped reads: samtools sort -@ 8 -m 2G -o EV225_IP.genome.map.bam *Aligned.out.bam Remove PCR duplicate reads: samtools index EV225_IP.genome.map.bam umi_tools dedup --random-seed 1 --stdin EV225_IP.genome.map.bam --method unique --stdout EV225_IP.bam Make bigwigs: samtools index EV225_IP.bam b2bw EV225_IP.bam --cpus 1 Isolate VSV-mapped reads only: bigWigToWig EV225.pos.bw -chrom=chrVSV EV225.pos.bw.chrVSV.wig wigToBigWig EV225.pos.bw.chrVSV.wig /path/to/hg19_and_vsv.chrom.sizes EV242.pos.bw.chrVSV.wig.bw Assembly: hg19 (for 293T) or mm10 (for MEF) with the addition of VSV (using accession NC_001560.1 with GFP was inserted between G and L as described in PMID 10400792) Supplementary files format and content: bw files contain read density (in bigwig format) for the VSV genome (positive or negative strand as indicated in filename). Bigwig files are not included for datasets with zero reads