GSE284636 Processing Pipeline

OTHER code_examples 10 steps

Publication

The IFIT2-IFIT3 antiviral complex targets short 5' untranslated regions on viral mRNAs for translation inhibition.

Nature microbiology (2025) — PMID 41093992

Dataset

Short 5â UTRs serve as a marker for mRNA translation inhibition by the IFIT2-IFIT3 antiviral complex

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Standard processing of eCLIP data was performed as previously described (Blue SB, et al.

eCLIP vCWL workflow (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install cwltool if not already installed
# pip install cwltool

# Define placeholder paths for input files and reference genome
# Replace these with your actual file paths and ensure reference files are indexed/prepared as required by the workflow.
# For human (hg38) reference, ensure you have the FASTA, GTF, STAR index, chromosome sizes, and blacklist regions.
FASTQ_R1="/path/to/your/sample_R1.fastq.gz"
FASTQ_R2="/path/to/your/sample_R2.fastq.gz" # Set to "null" if single-end, e.g., FASTQ_R2="null"
GENOME_FASTA="/path/to/your/reference/hg38.fa"
GENOME_GTF="/path/to/your/reference/hg38.gtf"
STAR_INDEX_DIR="/path/to/your/reference/STAR_index_hg38"
CHROM_SIZES="/path/to/your/reference/hg38.chrom.sizes"
BLACKLIST_BED="/path/to/your/reference/hg38_blacklist.bed"
OUTPUT_DIR="./eclip_output"

# Create output directory if it doesn't exist
mkdir -p "${OUTPUT_DIR}"

# Create a CWL input YAML file for the eCLIP workflow
# Note: If FASTQ_R2 is "null", the entry for fastq_r2 in the YAML should be 'fastq_r2: null'
cat << EOF > eclip_inputs.yaml
fastq_r1:
  class: File
  path: ${FASTQ_R1}
fastq_r2:
  class: File
  path: ${FASTQ_R2}
genome_fasta:
  class: File
  path: ${GENOME_FASTA}
genome_gtf:
  class: File
  path: ${GENOME_GTF}
star_index:
  class: Directory
  path: ${STAR_INDEX_DIR}
chrom_sizes:
  class: File
  path: ${CHROM_SIZES}
blacklist_regions:
  class: File
  path: ${BLACKLIST_BED}
output_dir: ${OUTPUT_DIR}
EOF

# Execute the eCLIP CWL workflow using cwltool
# Replace '/path/to/yeolab/eclip/workflow.cwl' with the actual path to the workflow.cwl file
# from the cloned yeolab/eclip repository.
cwltool --outdir "${OUTPUT_DIR}" /path/to/yeolab/eclip/workflow.cwl eclip_inputs.yaml

View on GitHub

Nature Protocols 2022), with mapping performed to a custom genome index that included both the VSV genome and either hg19 (for 293T experiments) or mm10 (for MEF experiments) as described below.

STAR (Inferred with models/gemini-2.5-flash) v2.7.10a GitHub

$ Bash example

# Install STAR (if not already installed)
# conda install -c bioconda star

# Define variables for input and output
# CUSTOM_GENOME_INDEX_DIR should point to the pre-built STAR index
# This index would have been created from a combination of the VSV genome and either hg19 (for 293T experiments) or mm10 (for MEF experiments).
# Example for a 293T experiment (adjust paths and filenames as needed):
CUSTOM_GENOME_INDEX_DIR="/path/to/star_index_vsv_hg19" # Placeholder for the custom genome index directory (e.g., VSV+hg19)
READ1_FASTQ="sample_R1.fastq.gz" # Placeholder for input Read 1 FASTQ file
READ2_FASTQ="sample_R2.fastq.gz" # Placeholder for input Read 2 FASTQ file (remove if single-end reads)
OUTPUT_PREFIX="aligned_sample" # Prefix for output files
NUM_THREADS=8 # Number of threads to use for alignment

# Perform alignment using STAR
# --genomeDir: Path to the STAR genome index
# --readFilesIn: Input FASTQ files (space-separated for paired-end, single file for single-end)
# --runThreadN: Number of threads for parallel processing
# --outFileNamePrefix: Prefix for all output files generated by STAR
# --outSAMtype BAM SortedByCoordinate: Output a sorted BAM file
# --readFilesCommand zcat: Command to decompress gzipped FASTQ files on-the-fly
STAR --genomeDir "${CUSTOM_GENOME_INDEX_DIR}" \
     --readFilesIn "${READ1_FASTQ}" "${READ2_FASTQ}" \
     --runThreadN "${NUM_THREADS}" \
     --outFileNamePrefix "${OUTPUT_PREFIX}_" \
     --outSAMtype BAM SortedByCoordinate \
     --readFilesCommand zcat

View on GitHub

3
UMI removal and cut adapt: umi_tools extract --random-seed 1 --bc-pattern NNNNNNNNNN --stdin 23145FL-08-01-37_S37_L001_R1_001.fastq.gz --log /dev/null | cutadapt -j 8 --match-read-wildcards --times 1 -e 0.1 --quality-cutoff 6 -m 18 -a a_adapters.fasta -O 1 - --report minimal 2> EV225_IP.cut.adapt.log | cutadapt -j 8 --match-read-wildcards --times 1 -e 0.1 --quality-cutoff 6 -m 18 -a a_adapters.fasta -O 5 - -o EV225_IP.trim.gz --report minimal >>EV225_IP.cut.adapt.log

cutadapt v4.0 GitHub
$ Bash example
```
# Reference dataset 'a_adapters.fasta' source is not specified.
umi_tools extract --random-seed 1 --bc-pattern NNNNNNNNNN --stdin 23145FL-08-01-37_S37_L001_R1_001.fastq.gz --log /dev/null | \
cutadapt -j 8 --match-read-wildcards --times 1 -e 0.1 --quality-cutoff 6 -m 18 -a a_adapters.fasta -O 1 - --report minimal 2> EV225_IP.cut.adapt.log | \
cutadapt -j 8 --match-read-wildcards --times 1 -e 0.1 --quality-cutoff 6 -m 18 -a a_adapters.fasta -O 5 - -o EV225_IP.trim.gz --report minimal >>EV225_IP.cut.adapt.log
```
View on GitHub

Repeat mapping (takes adapter-trimmed fastq files, and maps to a repeat database to filter repeat elements prior to genome mapping: STAR --runMode alignReads --runThreadN 8 --alignEndsType EndToEnd --genomeDir [mus_musculus_repbase_v2 or homo_sapiens_repbase_v2] --genomeLoad NoSharedMemory --outBAMcompression 10 --outFileNamePrefix EV225_IP.repeat.map --outFilterMultimapNmax 100 --outFilterMultimapScoreRange 1 --outFilterScoreMin 10 --outFilterType BySJout --outReadsUnmapped Fastx --outSAMattrRGline ID:foo --outSAMattributes All --outSAMmode Full --outSAMtype BAM Unsorted --outSAMunmapped None --outStd Log --readFilesCommand zcat --readFilesIn EV225_IP.trim.gz

STAR vInferred with models/gemini-2.5-flash GitHub

$ Bash example

# Install STAR (if not already installed)
# conda install -c bioconda star

# Define variables
INPUT_FASTQ="EV225_IP.trim.gz"
OUTPUT_PREFIX="EV225_IP.repeat.map"
# Choose the appropriate repeat database genome directory:
# For Mus musculus: GENOME_DIR="mus_musculus_repbase_v2"
# For Homo sapiens: GENOME_DIR="homo_sapiens_repbase_v2"
GENOME_DIR="[mus_musculus_repbase_v2 or homo_sapiens_repbase_v2]" # Placeholder - replace with actual path
THREADS=8

# Execute STAR command for repeat mapping
STAR --runMode alignReads \
     --runThreadN "${THREADS}" \
     --alignEndsType EndToEnd \
     --genomeDir "${GENOME_DIR}" \
     --genomeLoad NoSharedMemory \
     --outBAMcompression 10 \
     --outFileNamePrefix "${OUTPUT_PREFIX}" \
     --outFilterMultimapNmax 100 \
     --outFilterMultimapScoreRange 1 \
     --outFilterScoreMin 10 \
     --outFilterType BySJout \
     --outReadsUnmapped Fastx \
     --outSAMattrRGline ID:foo \
     --outSAMattributes All \
     --outSAMmode Full \
     --outSAMtype BAM Unsorted \
     --outSAMunmapped None \
     --outStd Log \
     --readFilesCommand zcat \
     --readFilesIn "${INPUT_FASTQ}"

View on GitHub

Unique genomic mapping (takes repeat-removed reads and maps to the reference genome: STAR --runMode alignReads --runThreadN 8 --alignEndsType EndToEnd --genomeDir [hg19+VSV or mm10+VSV] --genomeLoad NoSharedMemory --outBAMcompression 10 --outFileNamePrefix EV225_IP.genome.map --outFilterMultimapNmax 1 --outFilterMultimapScoreRange 1 --outFilterScoreMin 10 --outFilterType BySJout --outReadsUnmapped Fastx --outSAMattrRGline ID:foo --outSAMattributes All --outSAMmode Full --outSAMtype BAM Unsorted --outSAMunmapped None --outStd Log --readFilesIn EV225_IP.repeat.mapUnmapped.out.mate1

STAR v2.7.9a GitHub

$ Bash example

# Install STAR (example using conda)
# conda install -c bioconda star

# Placeholder for genome directory.
# This directory should contain the STAR-indexed genome (e.g., hg19 with VSV sequences).
# Example for genome generation:
# STAR --runMode genomeGenerate --genomeDir hg19_vsv_genome_dir --genomeFastaFiles hg19.fa vsv.fa --sjdbGTFfile annotation.gtf --runThreadN 8
GENOME_DIR="hg19_vsv_genome_dir" # Replace with the actual path to your indexed genome (e.g., hg19+VSV or mm10+VSV)

# Input reads file
READS_FILE="EV225_IP.repeat.mapUnmapped.out.mate1"

# Output prefix
OUTPUT_PREFIX="EV225_IP.genome.map"

STAR \
  --runMode alignReads \
  --runThreadN 8 \
  --alignEndsType EndToEnd \
  --genomeDir "${GENOME_DIR}" \
  --genomeLoad NoSharedMemory \
  --outBAMcompression 10 \
  --outFileNamePrefix "${OUTPUT_PREFIX}" \
  --outFilterMultimapNmax 1 \
  --outFilterMultimapScoreRange 1 \
  --outFilterScoreMin 10 \
  --outFilterType BySJout \
  --outReadsUnmapped Fastx \
  --outSAMattrRGline ID:foo \
  --outSAMattributes All \
  --outSAMmode Full \
  --outSAMtype BAM Unsorted \
  --outSAMunmapped None \
  --outStd Log \
  --readFilesIn "${READS_FILE}"

View on GitHub

Sort mapped reads: samtools sort -@ 8 -m 2G -o EV225_IP.genome.map.bam *Aligned.out.bam

samtools v1.9 GitHub

$ Bash example

# Install samtools if not already available
# conda install -c bioconda samtools

# Define input and output file paths
# Replace 'sample_name_Aligned.out.bam' with your actual input BAM file.
# The description uses '*Aligned.out.bam' as a placeholder for the input file.
INPUT_BAM="sample_name_Aligned.out.bam"
OUTPUT_BAM="EV225_IP.genome.map.bam"

# Sort mapped reads
samtools sort -@ 8 -m 2G -o "${OUTPUT_BAM}" "${INPUT_BAM}"

View on GitHub

7
Remove PCR duplicate reads: samtools index EV225_IP.genome.map.bam umi_tools dedup --random-seed 1 --stdin EV225_IP.genome.map.bam --method unique --stdout EV225_IP.bam

UMI-tools vNot specified (Inferred with models/gemini-2.5-flash) GitHub
$ Bash example
```
samtools index EV225_IP.genome.map.bam
umi_tools dedup --random-seed 1 --stdin EV225_IP.genome.map.bam --method unique --stdout EV225_IP.bam
```
View on GitHub

Make bigwigs: samtools index EV225_IP.bam b2bw EV225_IP.bam --cpus 1

samtools v3.5.1 GitHub

$ Bash example

# Install samtools
# conda install -c bioconda samtools

# Install deepTools
# conda install -c bioconda deepTools

# Index the BAM file, a prerequisite for bigWig generation
samtools index EV225_IP.bam

# Generate bigWig file from the indexed BAM file
# The description uses 'b2bw', which is inferred to be a custom script or alias wrapping deepTools bamCoverage.
# We use common parameters for bigWig generation from sequencing data.
bamCoverage -b EV225_IP.bam \
            -o EV225_IP.bigwig \
            --numberOfProcessors 1 \
            --binSize 10 \
            --normalizeUsing RPKM

View on GitHub

9
Isolate VSV-mapped reads only: bigWigToWig EV225.pos.bw -chrom=chrVSV EV225.pos.bw.chrVSV.wig

bigWigToWig vUCSC Genome Browser utilities (Inferred with models/gemini-2.5-flash) GitHub
$ Bash example
```
# Install UCSC Genome Browser utilities if not already installed
# conda install -c bioconda ucsc-bigwigtowig

bigWigToWig EV225.pos.bw -chrom=chrVSV EV225.pos.bw.chrVSV.wig
```
View on GitHub

wigToBigWig EV225.pos.bw.chrVSV.wig /path/to/hg19_and_vsv.chrom.sizes EV242.pos.bw.chrVSV.wig.bw

UCSC tools vUCSC tools GitHub

$ Bash example

# Install UCSC tools if not already available
# conda install -c bioconda ucsc-wigtobigwig

# Define input, reference, and output files
INPUT_WIG="EV225.pos.bw.chrVSV.wig"
CHROM_SIZES="/path/to/hg19_and_vsv.chrom.sizes" # This file should contain chromosome names and their sizes for hg19 and VSV. A standard hg19.chrom.sizes can be obtained from UCSC, but 'hg19_and_vsv' implies a custom file.
OUTPUT_BIGWIG="EV242.pos.bw.chrVSV.wig.bw"

# Execute the wigToBigWig command
wigToBigWig "${INPUT_WIG}" "${CHROM_SIZES}" "${OUTPUT_BIGWIG}"

View on GitHub

Tools Used

eCLIP STAR UCSC tools

Raw Source Text

Standard processing of eCLIP data was performed as previously described (Blue SB, et al. Nature Protocols 2022), with mapping performed to a custom genome index that included both the VSV genome and either hg19 (for 293T experiments) or mm10 (for MEF experiments) as described below.
UMI removal and cut adapt:  umi_tools extract   --random-seed 1   --bc-pattern NNNNNNNNNN   --stdin 23145FL-08-01-37_S37_L001_R1_001.fastq.gz   --log /dev/null | cutadapt   -j 8   --match-read-wildcards   --times 1   -e 0.1   --quality-cutoff 6   -m 18   -a a_adapters.fasta   -O 1   -   --report minimal   2> EV225_IP.cut.adapt.log | cutadapt   -j 8  --match-read-wildcards   --times 1   -e 0.1   --quality-cutoff 6   -m 18   -a a_adapters.fasta   -O 5   -   -o EV225_IP.trim.gz   --report minimal   >>EV225_IP.cut.adapt.log
Repeat mapping (takes adapter-trimmed fastq files, and maps to a repeat database to filter repeat elements prior to genome mapping:     STAR     --runMode alignReads     --runThreadN 8     --alignEndsType EndToEnd     --genomeDir  [mus_musculus_repbase_v2 or homo_sapiens_repbase_v2]   --genomeLoad NoSharedMemory     --outBAMcompression 10     --outFileNamePrefix EV225_IP.repeat.map     --outFilterMultimapNmax 100     --outFilterMultimapScoreRange 1     --outFilterScoreMin 10     --outFilterType BySJout     --outReadsUnmapped Fastx    --outSAMattrRGline ID:foo     --outSAMattributes All     --outSAMmode Full     --outSAMtype BAM Unsorted     --outSAMunmapped None     --outStd Log     --readFilesCommand zcat     --readFilesIn EV225_IP.trim.gz
Unique genomic mapping (takes repeat-removed reads and maps to the reference genome: STAR   --runMode alignReads   --runThreadN 8   --alignEndsType EndToEnd   --genomeDir [hg19+VSV or mm10+VSV] --genomeLoad NoSharedMemory   --outBAMcompression 10   --outFileNamePrefix EV225_IP.genome.map   --outFilterMultimapNmax 1   --outFilterMultimapScoreRange 1   --outFilterScoreMin 10   --outFilterType BySJout   --outReadsUnmapped Fastx   --outSAMattrRGline ID:foo   --outSAMattributes All   --outSAMmode Full   --outSAMtype BAM Unsorted   --outSAMunmapped None   --outStd Log   --readFilesIn EV225_IP.repeat.mapUnmapped.out.mate1
Sort mapped reads: samtools sort -@ 8 -m 2G -o EV225_IP.genome.map.bam *Aligned.out.bam
Remove PCR duplicate reads: samtools index EV225_IP.genome.map.bam umi_tools dedup   --random-seed 1   --stdin EV225_IP.genome.map.bam   --method unique   --stdout EV225_IP.bam
Make bigwigs: samtools index EV225_IP.bam b2bw EV225_IP.bam  --cpus 1
Isolate VSV-mapped reads only: bigWigToWig EV225.pos.bw -chrom=chrVSV EV225.pos.bw.chrVSV.wig
wigToBigWig EV225.pos.bw.chrVSV.wig /path/to/hg19_and_vsv.chrom.sizes EV242.pos.bw.chrVSV.wig.bw
Assembly: hg19 (for 293T) or mm10 (for MEF) with the addition of VSV (using accession NC_001560.1 with GFP was inserted between G and L as described in PMID 10400792)
Supplementary files format and content: bw files contain read density (in bigwig format) for the VSV genome (positive or negative strand as indicated in filename). Bigwig files are not included for datasets with zero reads

← Back to Analysis