GSE284636 Processing Pipeline

OTHER code_examples 10 steps

Publication

The IFIT2-IFIT3 antiviral complex targets short 5' untranslated regions on viral mRNAs for translation inhibition.

Nature microbiology (2025) — PMID 41093992

Dataset

GSE284636

Short 5’ UTRs serve as a marker for mRNA translation inhibition by the IFIT2-IFIT3 antiviral complex

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    Standard processing of eCLIP data was performed as previously described (Blue SB, et al.

    eCLIP vCWL workflow (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install cwltool if not already installed
    # pip install cwltool
    
    # Define placeholder paths for input files and reference genome
    # Replace these with your actual file paths and ensure reference files are indexed/prepared as required by the workflow.
    # For human (hg38) reference, ensure you have the FASTA, GTF, STAR index, chromosome sizes, and blacklist regions.
    FASTQ_R1="/path/to/your/sample_R1.fastq.gz"
    FASTQ_R2="/path/to/your/sample_R2.fastq.gz" # Set to "null" if single-end, e.g., FASTQ_R2="null"
    GENOME_FASTA="/path/to/your/reference/hg38.fa"
    GENOME_GTF="/path/to/your/reference/hg38.gtf"
    STAR_INDEX_DIR="/path/to/your/reference/STAR_index_hg38"
    CHROM_SIZES="/path/to/your/reference/hg38.chrom.sizes"
    BLACKLIST_BED="/path/to/your/reference/hg38_blacklist.bed"
    OUTPUT_DIR="./eclip_output"
    
    # Create output directory if it doesn't exist
    mkdir -p "${OUTPUT_DIR}"
    
    # Create a CWL input YAML file for the eCLIP workflow
    # Note: If FASTQ_R2 is "null", the entry for fastq_r2 in the YAML should be 'fastq_r2: null'
    cat << EOF > eclip_inputs.yaml
    fastq_r1:
      class: File
      path: ${FASTQ_R1}
    fastq_r2:
      class: File
      path: ${FASTQ_R2}
    genome_fasta:
      class: File
      path: ${GENOME_FASTA}
    genome_gtf:
      class: File
      path: ${GENOME_GTF}
    star_index:
      class: Directory
      path: ${STAR_INDEX_DIR}
    chrom_sizes:
      class: File
      path: ${CHROM_SIZES}
    blacklist_regions:
      class: File
      path: ${BLACKLIST_BED}
    output_dir: ${OUTPUT_DIR}
    EOF
    
    # Execute the eCLIP CWL workflow using cwltool
    # Replace '/path/to/yeolab/eclip/workflow.cwl' with the actual path to the workflow.cwl file
    # from the cloned yeolab/eclip repository.
    cwltool --outdir "${OUTPUT_DIR}" /path/to/yeolab/eclip/workflow.cwl eclip_inputs.yaml
    
  2. 2

    Nature Protocols 2022), with mapping performed to a custom genome index that included both the VSV genome and either hg19 (for 293T experiments) or mm10 (for MEF experiments) as described below.

    STAR (Inferred with models/gemini-2.5-flash) v2.7.10a GitHub
    $ Bash example
    # Install STAR (if not already installed)
    # conda install -c bioconda star
    
    # Define variables for input and output
    # CUSTOM_GENOME_INDEX_DIR should point to the pre-built STAR index
    # This index would have been created from a combination of the VSV genome and either hg19 (for 293T experiments) or mm10 (for MEF experiments).
    # Example for a 293T experiment (adjust paths and filenames as needed):
    CUSTOM_GENOME_INDEX_DIR="/path/to/star_index_vsv_hg19" # Placeholder for the custom genome index directory (e.g., VSV+hg19)
    READ1_FASTQ="sample_R1.fastq.gz" # Placeholder for input Read 1 FASTQ file
    READ2_FASTQ="sample_R2.fastq.gz" # Placeholder for input Read 2 FASTQ file (remove if single-end reads)
    OUTPUT_PREFIX="aligned_sample" # Prefix for output files
    NUM_THREADS=8 # Number of threads to use for alignment
    
    # Perform alignment using STAR
    # --genomeDir: Path to the STAR genome index
    # --readFilesIn: Input FASTQ files (space-separated for paired-end, single file for single-end)
    # --runThreadN: Number of threads for parallel processing
    # --outFileNamePrefix: Prefix for all output files generated by STAR
    # --outSAMtype BAM SortedByCoordinate: Output a sorted BAM file
    # --readFilesCommand zcat: Command to decompress gzipped FASTQ files on-the-fly
    STAR --genomeDir "${CUSTOM_GENOME_INDEX_DIR}" \
         --readFilesIn "${READ1_FASTQ}" "${READ2_FASTQ}" \
         --runThreadN "${NUM_THREADS}" \
         --outFileNamePrefix "${OUTPUT_PREFIX}_" \
         --outSAMtype BAM SortedByCoordinate \
         --readFilesCommand zcat
    
  3. 3

    UMI removal and cut adapt: umi_tools extract --random-seed 1 --bc-pattern NNNNNNNNNN --stdin 23145FL-08-01-37_S37_L001_R1_001.fastq.gz --log /dev/null | cutadapt -j 8 --match-read-wildcards --times 1 -e 0.1 --quality-cutoff 6 -m 18 -a a_adapters.fasta -O 1 - --report minimal 2> EV225_IP.cut.adapt.log | cutadapt -j 8 --match-read-wildcards --times 1 -e 0.1 --quality-cutoff 6 -m 18 -a a_adapters.fasta -O 5 - -o EV225_IP.trim.gz --report minimal >>EV225_IP.cut.adapt.log

    cutadapt v4.0 GitHub
    $ Bash example
    # Reference dataset 'a_adapters.fasta' source is not specified.
    umi_tools extract --random-seed 1 --bc-pattern NNNNNNNNNN --stdin 23145FL-08-01-37_S37_L001_R1_001.fastq.gz --log /dev/null | \
    cutadapt -j 8 --match-read-wildcards --times 1 -e 0.1 --quality-cutoff 6 -m 18 -a a_adapters.fasta -O 1 - --report minimal 2> EV225_IP.cut.adapt.log | \
    cutadapt -j 8 --match-read-wildcards --times 1 -e 0.1 --quality-cutoff 6 -m 18 -a a_adapters.fasta -O 5 - -o EV225_IP.trim.gz --report minimal >>EV225_IP.cut.adapt.log
  4. 4

    Repeat mapping (takes adapter-trimmed fastq files, and maps to a repeat database to filter repeat elements prior to genome mapping: STAR --runMode alignReads --runThreadN 8 --alignEndsType EndToEnd --genomeDir [mus_musculus_repbase_v2 or homo_sapiens_repbase_v2] --genomeLoad NoSharedMemory --outBAMcompression 10 --outFileNamePrefix EV225_IP.repeat.map --outFilterMultimapNmax 100 --outFilterMultimapScoreRange 1 --outFilterScoreMin 10 --outFilterType BySJout --outReadsUnmapped Fastx --outSAMattrRGline ID:foo --outSAMattributes All --outSAMmode Full --outSAMtype BAM Unsorted --outSAMunmapped None --outStd Log --readFilesCommand zcat --readFilesIn EV225_IP.trim.gz

    STAR vInferred with models/gemini-2.5-flash GitHub
    $ Bash example
    # Install STAR (if not already installed)
    # conda install -c bioconda star
    
    # Define variables
    INPUT_FASTQ="EV225_IP.trim.gz"
    OUTPUT_PREFIX="EV225_IP.repeat.map"
    # Choose the appropriate repeat database genome directory:
    # For Mus musculus: GENOME_DIR="mus_musculus_repbase_v2"
    # For Homo sapiens: GENOME_DIR="homo_sapiens_repbase_v2"
    GENOME_DIR="[mus_musculus_repbase_v2 or homo_sapiens_repbase_v2]" # Placeholder - replace with actual path
    THREADS=8
    
    # Execute STAR command for repeat mapping
    STAR --runMode alignReads \
         --runThreadN "${THREADS}" \
         --alignEndsType EndToEnd \
         --genomeDir "${GENOME_DIR}" \
         --genomeLoad NoSharedMemory \
         --outBAMcompression 10 \
         --outFileNamePrefix "${OUTPUT_PREFIX}" \
         --outFilterMultimapNmax 100 \
         --outFilterMultimapScoreRange 1 \
         --outFilterScoreMin 10 \
         --outFilterType BySJout \
         --outReadsUnmapped Fastx \
         --outSAMattrRGline ID:foo \
         --outSAMattributes All \
         --outSAMmode Full \
         --outSAMtype BAM Unsorted \
         --outSAMunmapped None \
         --outStd Log \
         --readFilesCommand zcat \
         --readFilesIn "${INPUT_FASTQ}"
  5. 5

    Unique genomic mapping (takes repeat-removed reads and maps to the reference genome: STAR --runMode alignReads --runThreadN 8 --alignEndsType EndToEnd --genomeDir [hg19+VSV or mm10+VSV] --genomeLoad NoSharedMemory --outBAMcompression 10 --outFileNamePrefix EV225_IP.genome.map --outFilterMultimapNmax 1 --outFilterMultimapScoreRange 1 --outFilterScoreMin 10 --outFilterType BySJout --outReadsUnmapped Fastx --outSAMattrRGline ID:foo --outSAMattributes All --outSAMmode Full --outSAMtype BAM Unsorted --outSAMunmapped None --outStd Log --readFilesIn EV225_IP.repeat.mapUnmapped.out.mate1

    $ Bash example
    # Install STAR (example using conda)
    # conda install -c bioconda star
    
    # Placeholder for genome directory.
    # This directory should contain the STAR-indexed genome (e.g., hg19 with VSV sequences).
    # Example for genome generation:
    # STAR --runMode genomeGenerate --genomeDir hg19_vsv_genome_dir --genomeFastaFiles hg19.fa vsv.fa --sjdbGTFfile annotation.gtf --runThreadN 8
    GENOME_DIR="hg19_vsv_genome_dir" # Replace with the actual path to your indexed genome (e.g., hg19+VSV or mm10+VSV)
    
    # Input reads file
    READS_FILE="EV225_IP.repeat.mapUnmapped.out.mate1"
    
    # Output prefix
    OUTPUT_PREFIX="EV225_IP.genome.map"
    
    STAR \
      --runMode alignReads \
      --runThreadN 8 \
      --alignEndsType EndToEnd \
      --genomeDir "${GENOME_DIR}" \
      --genomeLoad NoSharedMemory \
      --outBAMcompression 10 \
      --outFileNamePrefix "${OUTPUT_PREFIX}" \
      --outFilterMultimapNmax 1 \
      --outFilterMultimapScoreRange 1 \
      --outFilterScoreMin 10 \
      --outFilterType BySJout \
      --outReadsUnmapped Fastx \
      --outSAMattrRGline ID:foo \
      --outSAMattributes All \
      --outSAMmode Full \
      --outSAMtype BAM Unsorted \
      --outSAMunmapped None \
      --outStd Log \
      --readFilesIn "${READS_FILE}"
  6. 6

    Sort mapped reads: samtools sort -@ 8 -m 2G -o EV225_IP.genome.map.bam *Aligned.out.bam

    samtools v1.9 GitHub
    $ Bash example
    # Install samtools if not already available
    # conda install -c bioconda samtools
    
    # Define input and output file paths
    # Replace 'sample_name_Aligned.out.bam' with your actual input BAM file.
    # The description uses '*Aligned.out.bam' as a placeholder for the input file.
    INPUT_BAM="sample_name_Aligned.out.bam"
    OUTPUT_BAM="EV225_IP.genome.map.bam"
    
    # Sort mapped reads
    samtools sort -@ 8 -m 2G -o "${OUTPUT_BAM}" "${INPUT_BAM}"
  7. 7

    Remove PCR duplicate reads: samtools index EV225_IP.genome.map.bam umi_tools dedup --random-seed 1 --stdin EV225_IP.genome.map.bam --method unique --stdout EV225_IP.bam

    UMI-tools vNot specified (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    samtools index EV225_IP.genome.map.bam
    umi_tools dedup --random-seed 1 --stdin EV225_IP.genome.map.bam --method unique --stdout EV225_IP.bam
  8. 8

    Make bigwigs: samtools index EV225_IP.bam b2bw EV225_IP.bam --cpus 1

    samtools v3.5.1 GitHub
    $ Bash example
    # Install samtools
    # conda install -c bioconda samtools
    
    # Install deepTools
    # conda install -c bioconda deepTools
    
    # Index the BAM file, a prerequisite for bigWig generation
    samtools index EV225_IP.bam
    
    # Generate bigWig file from the indexed BAM file
    # The description uses 'b2bw', which is inferred to be a custom script or alias wrapping deepTools bamCoverage.
    # We use common parameters for bigWig generation from sequencing data.
    bamCoverage -b EV225_IP.bam \
                -o EV225_IP.bigwig \
                --numberOfProcessors 1 \
                --binSize 10 \
                --normalizeUsing RPKM
  9. 9

    Isolate VSV-mapped reads only: bigWigToWig EV225.pos.bw -chrom=chrVSV EV225.pos.bw.chrVSV.wig

    bigWigToWig vUCSC Genome Browser utilities (Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install UCSC Genome Browser utilities if not already installed
    # conda install -c bioconda ucsc-bigwigtowig
    
    bigWigToWig EV225.pos.bw -chrom=chrVSV EV225.pos.bw.chrVSV.wig
  10. 10

    wigToBigWig EV225.pos.bw.chrVSV.wig /path/to/hg19_and_vsv.chrom.sizes EV242.pos.bw.chrVSV.wig.bw

    $ Bash example
    # Install UCSC tools if not already available
    # conda install -c bioconda ucsc-wigtobigwig
    
    # Define input, reference, and output files
    INPUT_WIG="EV225.pos.bw.chrVSV.wig"
    CHROM_SIZES="/path/to/hg19_and_vsv.chrom.sizes" # This file should contain chromosome names and their sizes for hg19 and VSV. A standard hg19.chrom.sizes can be obtained from UCSC, but 'hg19_and_vsv' implies a custom file.
    OUTPUT_BIGWIG="EV242.pos.bw.chrVSV.wig.bw"
    
    # Execute the wigToBigWig command
    wigToBigWig "${INPUT_WIG}" "${CHROM_SIZES}" "${OUTPUT_BIGWIG}"

Tools Used

Raw Source Text
Standard processing of eCLIP data was performed as previously described (Blue SB, et al. Nature Protocols 2022), with mapping performed to a custom genome index that included both the VSV genome and either hg19 (for 293T experiments) or mm10 (for MEF experiments) as described below.
UMI removal and cut adapt:  umi_tools extract   --random-seed 1   --bc-pattern NNNNNNNNNN   --stdin 23145FL-08-01-37_S37_L001_R1_001.fastq.gz   --log /dev/null | cutadapt   -j 8   --match-read-wildcards   --times 1   -e 0.1   --quality-cutoff 6   -m 18   -a a_adapters.fasta   -O 1   -   --report minimal   2> EV225_IP.cut.adapt.log | cutadapt   -j 8  --match-read-wildcards   --times 1   -e 0.1   --quality-cutoff 6   -m 18   -a a_adapters.fasta   -O 5   -   -o EV225_IP.trim.gz   --report minimal   >>EV225_IP.cut.adapt.log
Repeat mapping (takes adapter-trimmed fastq files, and maps to a repeat database to filter repeat elements prior to genome mapping:     STAR     --runMode alignReads     --runThreadN 8     --alignEndsType EndToEnd     --genomeDir  [mus_musculus_repbase_v2 or homo_sapiens_repbase_v2]   --genomeLoad NoSharedMemory     --outBAMcompression 10     --outFileNamePrefix EV225_IP.repeat.map     --outFilterMultimapNmax 100     --outFilterMultimapScoreRange 1     --outFilterScoreMin 10     --outFilterType BySJout     --outReadsUnmapped Fastx    --outSAMattrRGline ID:foo     --outSAMattributes All     --outSAMmode Full     --outSAMtype BAM Unsorted     --outSAMunmapped None     --outStd Log     --readFilesCommand zcat     --readFilesIn EV225_IP.trim.gz
Unique genomic mapping (takes repeat-removed reads and maps to the reference genome: STAR   --runMode alignReads   --runThreadN 8   --alignEndsType EndToEnd   --genomeDir [hg19+VSV or mm10+VSV] --genomeLoad NoSharedMemory   --outBAMcompression 10   --outFileNamePrefix EV225_IP.genome.map   --outFilterMultimapNmax 1   --outFilterMultimapScoreRange 1   --outFilterScoreMin 10   --outFilterType BySJout   --outReadsUnmapped Fastx   --outSAMattrRGline ID:foo   --outSAMattributes All   --outSAMmode Full   --outSAMtype BAM Unsorted   --outSAMunmapped None   --outStd Log   --readFilesIn EV225_IP.repeat.mapUnmapped.out.mate1
Sort mapped reads: samtools sort -@ 8 -m 2G -o EV225_IP.genome.map.bam *Aligned.out.bam
Remove PCR duplicate reads: samtools index EV225_IP.genome.map.bam umi_tools dedup   --random-seed 1   --stdin EV225_IP.genome.map.bam   --method unique   --stdout EV225_IP.bam
Make bigwigs: samtools index EV225_IP.bam b2bw EV225_IP.bam  --cpus 1
Isolate VSV-mapped reads only: bigWigToWig EV225.pos.bw -chrom=chrVSV EV225.pos.bw.chrVSV.wig
wigToBigWig EV225.pos.bw.chrVSV.wig /path/to/hg19_and_vsv.chrom.sizes EV242.pos.bw.chrVSV.wig.bw
Assembly: hg19 (for 293T) or mm10 (for MEF) with the addition of VSV (using accession NC_001560.1 with GFP was inserted between G and L as described in PMID 10400792)
Supplementary files format and content: bw files contain read density (in bigwig format) for the VSV genome (positive or negative strand as indicated in filename). Bigwig files are not included for datasets with zero reads
← Back to Analysis