GSE173508 Processing Pipeline

GSE code_examples 19 steps

Publication

Discovery and functional interrogation of SARS-CoV-2 protein-RNA interactions.

Research square (2022) — PMID 35313591

Dataset

GSE173508

Discovery and functional interrogation of the virus and host RNA interactome of SARS-CoV-2 proteins

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.
  1. 1

    Sequenced reads were reformatted to include randomers in read headers with umi_tools (1.0.0).

    UMI-tools v1.0.0
    $ Bash example
    # Install UMI-tools (if not already installed)
    # conda install -c bioconda umi-tools=1.0.0
    
    # Reformat sequenced reads to extract Unique Molecular Identifiers (UMIs) and include them in read headers.
    # This command assumes UMIs are at the start of Read 1 and are 6 base pairs long.
    # Adjust the --bc-pattern according to the actual UMI structure and position in your reads.
    # For example:
    #   - If the UMI is 6bp at the start of Read 1: --bc-pattern=NNNNNN
    #   - If the UMI is 6bp at the start of Read 2 (for paired-end data): --bc-pattern=NNNNNN --read2-in=read2.fastq.gz --read2-out=read2_umi_extracted.fastq.gz
    #   - If the UMI is in an index read: --bc-pattern=NNNNNN --index-in=index.fastq.gz
    
    umi_tools extract \
        --input input_reads.fastq.gz \
        --output output_reads_with_umi_in_header.fastq.gz \
        --bc-pattern=NNNNNN \
        --log umi_tools_extract.log
  2. 2

    Args: --random-seed 1 --bc-pattern NNNNNNNNNN

    umi_tools (Inferred with models/gemini-2.5-flash) v1.1.2 GitHub
    $ Bash example
    # Install umi_tools via conda
    # conda install -c bioconda umi_tools
    
    # Example usage of umi_tools extract for processing Unique Molecular Identifiers (UMIs).
    # Replace input.fastq.gz and output.fastq.gz with your actual input and output file paths.
    # The --bc-pattern NNNNNNNNNN specifies a 10-nucleotide barcode at the start of the read.
    # The --random-seed 1 ensures reproducibility for any random processes within umi_tools.
    umi_tools extract --random-seed 1 --bc-pattern NNNNNNNNNN -I input.fastq.gz -S output.fastq.gz
  3. 3

    Reads were then trimmed with cutadapt (1.14).

    cutadapt v1.14 GitHub
    $ Bash example
    # Install cutadapt (if not already installed)
    # conda install -c bioconda cutadapt=1.14
    
    # Define input and output files (placeholders)
    INPUT_FASTQ="input.fastq.gz"
    OUTPUT_FASTQ="trimmed.fastq.gz"
    
    # Define common adapter sequence (e.g., Illumina universal adapter)
    ADAPTER_SEQUENCE="AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC"
    
    # Define trimming parameters
    MINIMUM_LENGTH=15  # Minimum read length after trimming
    QUALITY_CUTOFF=20  # Quality cutoff for 3' end trimming
    NUM_CORES=4        # Number of cores for parallel processing
    
    # Execute cutadapt command
    cutadapt \
      -a "${ADAPTER_SEQUENCE}" \
      -o "${OUTPUT_FASTQ}" \
      -m ${MINIMUM_LENGTH} \
      -q ${QUALITY_CUTOFF} \
      --cores ${NUM_CORES} \
      "${INPUT_FASTQ}"
  4. 4

    Args: --match-read-wildcards -O 1 --times 1 -e 0.1 --quality-cutoff 6 -m 18 -a InvRNA*.fasta (fasta sequences can be found at: https://github.com/YeoLab/eclip/tree/master/example/inputs/)

    eCLIP vNot explicitly versioned, using latest available from yeolab/eclip repository GitHub
    $ Bash example
    # Install STAR aligner (if not already installed)
    # conda install -c bioconda star
    
    # Clone the eCLIP repository (if not already cloned)
    # git clone https://github.com/YeoLab/eclip.git
    # cd eclip
    # export PATH=$(pwd)/src:$PATH # Add eclip scripts to PATH
    
    # Define input/output and reference paths (placeholders - replace with actual paths)
    INPUT_FASTQ="path/to/your/input_reads.fastq.gz" # Example: reads.fastq.gz or R1.fastq.gz R2.fastq.gz for paired-end
    OUTPUT_PREFIX="eclip_alignment_output" # Prefix for output files
    GENOME_DIR="path/to/STAR_index/hg38" # Path to the STAR genome index (e.g., for human hg38)
    
    # Download InvRNA*.fasta files (if not already present)
    # mkdir -p example_inputs
    # for i in $(seq 1 1560); do
    #     wget -P example_inputs https://raw.githubusercontent.com/YeoLab/eclip/master/example/inputs/InvRNA_"$i".fasta
    # done
    # ANNOTATION_FASTA_FILES="example_inputs/InvRNA_*.fasta"
    
    # For demonstration, assuming InvRNA*.fasta are in the current directory or specified path
    ANNOTATION_FASTA_FILES="InvRNA*.fasta" # Adjust path if files are elsewhere
    
    # Execute the eclip_align_star.py script
    python eclip_align_star.py \
      -i "${INPUT_FASTQ}" \
      -o "${OUTPUT_PREFIX}" \
      -g "${GENOME_DIR}" \
      --match-read-wildcards \
      -O 1 \
      --times 1 \
      -e 0.1 \
      --quality-cutoff 6 \
      -m 18 \
      -a ${ANNOTATION_FASTA_FILES}
  5. 5

    Reads were then trimmed once more with cutadapt (1.14) to remove double-ligation events.

    cutadapt v1.14 GitHub
    $ Bash example
    # Install cutadapt (if not already installed)
    # conda install -c bioconda cutadapt=1.14
    
    # Define input and output file paths
    # Assuming single-end reads for this example. Adjust for paired-end if necessary.
    INPUT_READS="input_reads_previously_trimmed.fastq.gz"
    OUTPUT_TRIMMED_READS="reads_trimmed_double_ligation.fastq.gz"
    
    # Define the 3' adapter sequence commonly associated with double-ligation events in eCLIP.
    # This is a placeholder; the exact adapter might vary based on the specific library preparation
    # and the nature of the double-ligation event. A common eCLIP 3' adapter is used here.
    # Example eCLIP 3' adapter from Yeo lab protocols: AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC
    ADAPTER_SEQUENCE="AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC"
    
    # Trim reads with cutadapt to remove the 3' adapter (double-ligation events).
    # The description implies a re-trimming step, so the input reads are already partially processed.
    # -a: 3' adapter sequence to be removed from the 3' end of the reads.
    # -o: Output file for trimmed reads.
    # -m 18: Discard reads shorter than 18 bp after trimming (a common minimum length for eCLIP reads).
    # -q 20,20: Trim low-quality bases from the 5' and 3' ends with a quality threshold of 20.
    # --cores=4: Use 4 CPU cores for parallel processing (adjust based on available resources).
    cutadapt -a "${ADAPTER_SEQUENCE}" \
             -o "${OUTPUT_TRIMMED_READS}" \
             -m 18 \
             -q 20,20 \
             --cores=4 \
             "${INPUT_READS}"
    
  6. 6

    Args: --match-read-wildcards -O 5 --times 1 -e 0.1 --quality-cutoff 6 -m 18 -a InvRNA*.fasta (fasta sequences can be found at: https://github.com/YeoLab/eclip/tree/master/example/inputs/)

    $ Bash example
    # Install umi_tools if not already installed
    # conda install -c bioconda umi_tools
    
    # Reference dataset: InvRNA_adapters.fasta
    # This file (e.g., InvRNA_adapters.fasta from https://github.com/YeoLab/eclip/tree/master/example/inputs/)
    # typically contains adapter sequences used for trimming reads, often with tools like cutadapt,
    # which usually precedes UMI deduplication. The '-a InvRNA*.fasta' argument in the description
    # is characteristic of adapter trimming, but the other arguments are specific to umi_tools dedup.
    # For this step, we assume the primary operation is UMI deduplication.
    
    # Placeholder for input BAM file (e.g., from alignment and adapter trimming step)
    # The input BAM should be sorted by coordinate for umi_tools dedup.
    INPUT_BAM="input_aligned_trimmed.bam"
    OUTPUT_BAM="output_deduplicated.bam"
    
    umi_tools dedup \
      --match-read-wildcards \
      -O 5 \
      --times 1 \
      -e 0.1 \
      --quality-cutoff 6 \
      -m 18 \
      -I "${INPUT_BAM}" \
      -S "${OUTPUT_BAM}"
  7. 7

    Trimmed reads were then mapped with STAR (2.4.0i) against a repeat element database (RepBase 18.05).

    $ Bash example
    # Install STAR (if not already installed)
    # conda install -c bioconda star=2.4.0i
    # conda install -c bioconda samtools # For indexing BAM files
    
    # --- Reference Data Setup ---
    # Placeholder for RepBase 18.05 FASTA file. 
    # Actual download might require a license or specific access from GIRINST.
    # For demonstration, we assume 'RepBase18.05.fasta' is available in the working directory.
    # Example download (replace with actual source if available and licensed):
    # wget -O RepBase18.05.fasta.gz "http://www.girinst.org/repbase/update/RepBase18.05.fasta.gz"
    # gunzip RepBase18.05.fasta.gz
    
    repbase_fasta="RepBase18.05.fasta"
    genomeDir="RepBase_STAR_index"
    
    # Create STAR index for RepBase 18.05
    # The --genomeSAindexNbases parameter is adjusted for shorter sequences typical of repeat elements.
    # --runThreadN: Use multiple threads for indexing to speed up the process.
    mkdir -p "${genomeDir}"
    STAR --runMode genomeGenerate \
         --genomeDir "${genomeDir}" \
         --genomeFastaFiles "${repbase_fasta}" \
         --genomeSAindexNbases 10 \
         --runThreadN 8
    
    # --- Alignment Step ---
    # Input: Trimmed reads (placeholder). Replace with your actual trimmed reads file.
    trimmed_reads="trimmed_reads.fastq.gz" 
    
    # Output prefix for STAR alignment files
    output_prefix="aligned_to_repeats"
    
    # Align trimmed reads to the RepBase 18.05 index
    # --outSAMtype BAM SortedByCoordinate: Output sorted BAM file.
    # --outReadsUnmapped Fastx: Output unmapped reads to a separate FASTQ file.
    # --outFilterMultimapNmax 100: Allow reads to map to up to 100 locations, which is common for repetitive sequences.
    # --alignIntronMax 1: Set to 1 to effectively disable splicing, as repeats are not typically spliced regions.
    # --runThreadN: Use multiple threads for alignment.
    STAR --runMode alignReads \
         --genomeDir "${genomeDir}" \
         --readFilesIn "${trimmed_reads}" \
         --outFileNamePrefix "${output_prefix}" \
         --outSAMtype BAM SortedByCoordinate \
         --outReadsUnmapped Fastx \
         --outFilterMultimapNmax 100 \
         --alignIntronMax 1 \
         --runThreadN 8
    
    # Index the resulting BAM file for easier viewing and downstream processing
    samtools index "${output_prefix}Aligned.sortedByCoord.out.bam"
  8. 8

    Args: --runThreadN 16 \ --genomeDir human_repbase \ --readFilesIn path/to/read1 \ --outFileNamePrefix out_prefix \ --outReadsUnmapped Fastx \ --outSAMtype BAM Unsorted \ --outSAMattributes All \ --outSAMunmapped Within \ --outSAMattrRGline ID:foo \ --outFilterType BySJout \ --outFilterMultimapNmax 30 \ --outFilterMultimapScoreRange 1 \ --outFilterScoreMin 10 \ --alignEndsType EndToEnd

    STAR (Inferred with models/gemini-2.5-flash) v(Inferred with models/gemini-2.5-flash) GitHub
    $ Bash example
    # Install STAR (example using conda)
    # conda install -c bioconda star
    
    # Note: 'human_repbase' refers to a pre-indexed STAR genome directory.
    # If this is a custom index including repetitive elements, ensure it's generated correctly.
    # For a standard human genome (e.g., hg38), you would first generate the genome index:
    # STAR --runThreadN 16 --runMode genomeGenerate --genomeDir /path/to/STAR_index/hg38 --genomeFastaFiles /path/to/hg38.fa --sjdbGTFfile /path/to/gencode.vXX.annotation.gtf
    
    # Align reads using STAR
    STAR \
        --runThreadN 16 \
        --genomeDir human_repbase \
        --readFilesIn path/to/read1 \
        --outFileNamePrefix out_prefix \
        --outReadsUnmapped Fastx \
        --outSAMtype BAM Unsorted \
        --outSAMattributes All \
        --outSAMunmapped Within \
        --outSAMattrRGline ID:foo \
        --outFilterType BySJout \
        --outFilterMultimapNmax 30 \
        --outFilterMultimapScoreRange 1 \
        --outFilterScoreMin 10 \
        --alignEndsType EndToEnd
  9. 9

    Unmapped reads filtered of repeat elements were then mapped with STAR (2.4.0i) against a human genome (hg19/ChlSab2).

    $ Bash example
    # Install STAR (example using conda)
    # conda install -c bioconda star
    
    # Define variables
    READS_FILE="filtered_unmapped_reads.fastq.gz" # Placeholder for input reads (e.g., from a previous filtering step)
    GENOME_DIR="/path/to/STAR_genome_index/hg19" # Placeholder for hg19 STAR genome index directory
    OUTPUT_PREFIX="mapped_reads_" # Prefix for output files (e.g., mapped_reads_Aligned.sortedByCoord.out.bam)
    NUM_THREADS=8 # Example number of threads, adjust based on available CPU cores
    
    # Run STAR alignment
    STAR --genomeDir "${GENOME_DIR}" \
         --readFilesIn "${READS_FILE}" \
         --runThreadN "${NUM_THREADS}" \
         --outFileNamePrefix "${OUTPUT_PREFIX}" \
         --outSAMtype BAM SortedByCoordinate \
         --outBAMcompression 6 \
         --limitBAMsortRAM 30000000000 # Example: 30GB, adjust based on available RAM and expected BAM file size
  10. 10

    Args: --runThreadN 16 \ --genomeDir genomedir \ --readFilesIn /path/to/read1 \ --outFileNamePrefix out_prefix \ --outReadsUnmapped Fastx \ --outSAMtype BAM Unsorted \ --outSAMattributes All \ --outSAMunmapped Within \ --outSAMattrRGline ID:foo \ --outFilterType BySJout \ --outFilterMultimapNmax 1 \ --outFilterMultimapScoreRange 1 \ --outFilterScoreMin 10 \ --alignEndsType EndToEnd

    STAR (Inferred with models/gemini-2.5-flash) v2.7.10a GitHub
    $ Bash example
    # Install STAR (example using conda)
    # conda install -c bioconda star
    
    # Define variables
    GENOME_DIR="/path/to/STAR_genome_index_GRCh38" # Placeholder for GRCh38 genome index
    READ1_FILE="/path/to/sample_R1.fastq.gz" # Placeholder for input read 1 file
    OUTPUT_PREFIX="sample_aligned" # Placeholder for output file prefix
    
    # Execute STAR alignment
    STAR \
      --runThreadN 16 \
      --genomeDir "${GENOME_DIR}" \
      --readFilesIn "${READ1_FILE}" \
      --outFileNamePrefix "${OUTPUT_PREFIX}" \
      --outReadsUnmapped Fastx \
      --outSAMtype BAM Unsorted \
      --outSAMattributes All \
      --outSAMunmapped Within \
      --outSAMattrRGline ID:foo \
      --outFilterType BySJout \
      --outFilterMultimapNmax 1 \
      --outFilterMultimapScoreRange 1 \
      --outFilterScoreMin 10 \
      --alignEndsType EndToEnd
  11. 11

    Aligned reads were sorted with samtools (1.6)

    samtools v1.6 GitHub
    $ Bash example
    # Install samtools if not already installed
    # conda install -c bioconda samtools=1.6
    
    # Sort an aligned BAM file
    # Replace input.bam with your actual input file
    # Replace output.sorted.bam with your desired output file name
    samtools sort -o output.sorted.bam input.bam
  12. 12

    Sorted reads were collapsed with umi_tools (1.0.0).

    UMI-tools v1.0.0 GitHub
    $ Bash example
    # Install umi_tools if not already installed
    # conda install -c bioconda umi_tools=1.0.0
    
    # Placeholder for input sorted BAM file
    INPUT_SORTED_BAM="input.sorted.bam"
    # Placeholder for output deduplicated BAM file
    OUTPUT_DEDUPLICATED_BAM="output.deduplicated.bam"
    
    # Collapse sorted reads using umi_tools dedup.
    # This command assumes UMIs are present in the read names and uses the 'directional' method
    # for deduplication, which is common for RNA-seq data to preserve strand information.
    # Adjust --umi-separator or --extract-umi-method/--umi-tag if UMIs are stored differently
    # (e.g., in a BAM tag).
    umi_tools dedup \
        --input "${INPUT_SORTED_BAM}" \
        --output "${OUTPUT_DEDUPLICATED_BAM}" \
        --method directional \
        --log "umi_tools_dedup.log"
  13. 13

    Args: --random-seed 1 --method unique

    Generic Data Processing Script (Inferred with models/gemini-2.5-flash) vUnknown
    $ Bash example
    # This is a placeholder command as the specific tool and context are not provided.
    # It is assumed to be a custom script or a generic data processing step.
    # Replace 'my_tool_or_script.sh' with the actual executable name and specify input/output files if applicable.
    my_tool_or_script.sh --random-seed 1 --method unique
  14. 14

    BAM files were used to identify peak clusters with Clipper (1.2.2).

    CLIPper v1.2.2 GitHub
    $ Bash example
    # Install CLIPper (if not already installed)
    # pip install clipper
    
    # Example: Download a genome size file (e.g., for hg38) if not already present
    # wget -O hg38.chrom.sizes http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.chrom.sizes
    
    # Run CLIPper to identify peak clusters from BAM files
    # Replace 'input.bam' with your actual BAM file(s)
    # Replace 'hg38.chrom.sizes' with the appropriate genome size file for your reference genome
    # The p-value threshold (-p) is a common parameter; adjust as needed.
    clipper.py -b input.bam -s hg38.chrom.sizes -o output_peaks.bed -p 0.01
  15. 15

    Args: --species (hg19/ChlSab2_Sars) --bam path/to/input.bam --timeout 3600000 --maxgenes 1000000 --save-pickle --outfile path/to/output.bam

    Custom Python Script (Inferred with models/gemini-2.5-flash) vN/A GitHub
    $ Bash example
    # Define input and output paths
    INPUT_BAM="path/to/input.bam"
    OUTPUT_BAM="path/to/output.bam"
    
    # Define species (reference genome)
    # hg19 is a human reference genome, commonly available from UCSC or NCBI.
    # ChlSab2_Sars is likely a specific reference, possibly for SARS-CoV-2 on Chlorocebus sabaeus (green monkey).
    SPECIES="hg19" 
    
    # Define other parameters
    TIMEOUT_MS=3600000 # 1 hour in milliseconds
    MAX_GENES=1000000
    
    # Assuming a custom Python script based on the arguments like --save-pickle and --maxgenes,
    # which are not typical for standard bioinformatics tools. The actual script name
    # would need to be known from the pipeline context. For demonstration, we'll use
    # a placeholder 'process_bam_genes.py'. This script would likely require a Python
    # environment with necessary libraries (e.g., pysam).
    
    # Example installation (commented out)
    # conda create -n myenv python=3.8 pysam
    # conda activate myenv
    
    # Execute the inferred custom Python script
    python process_bam_genes.py \
        --species "${SPECIES}" \
        --bam "${INPUT_BAM}" \
        --timeout "${TIMEOUT_MS}" \
        --maxgenes "${MAX_GENES}" \
        --save-pickle \
        --outfile "${OUTPUT_BAM}"
  16. 16

    Peak clusters were normalized using BAM files for IP against BAM files for INPUT with peaksnormalize.pl (overlap_peakfi_with_bam_PE.pl), included in eclip 0.1.5+.

    $ Bash example
    # Clone the eclip repository if not already available
    # git clone https://github.com/yeolab/eclip.git
    # export PATH=$PATH:$(pwd)/eclip/bin
    
    # Example input files (replace with actual paths)
    # PEAK_FILE should be a peak file with read counts, typically generated by a script like overlap_peakfi_with_bam_PE.pl
    # (as mentioned in the description) or another read counting tool.
    PEAK_FILE="peaks_with_counts.bed" 
    IP_BAM="ip.bam"
    INPUT_BAM="input.bam"
    OUTPUT_PREFIX="normalized_peaks"
    
    # Run peaksnormalize.pl
    # This script normalizes peak clusters using IP and INPUT BAM files.
    peaksnormalize.pl "${PEAK_FILE}" "${IP_BAM}" "${INPUT_BAM}" "${OUTPUT_PREFIX}"
  17. 17

    Overlapping normalized peak regions were merged with compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl, included within eclip-0.1.5+

    $ Bash example
    # Install Perl if not available
    # conda install -c bioconda perl
    
    # Clone the eCLIP pipeline repository to get the script
    # git clone https://github.com/yeolab/eclip.git
    # cd eclip/scripts
    
    # Define input and output file names (replace with actual paths and names)
    # These would be the normalized peak regions from individual replicates
    INPUT_PEAK_FILE_REP1="replicate1_normalized_peaks.bed"
    INPUT_PEAK_FILE_REP2="replicate2_normalized_peaks.bed"
    # Add more input files if there are more replicates
    
    OUTPUT_MERGED_PEAKS="merged_replicate_overlapping_peaks.bed"
    
    # Execute the merging script
    # Assuming the script 'compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl' is in the current PATH or specified with its full path
    perl compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl \
        "${INPUT_PEAK_FILE_REP1}" \
        "${INPUT_PEAK_FILE_REP2}" \
        > "${OUTPUT_MERGED_PEAKS}"
  18. 18

    Normalized peak (compressed.bed) files were ranked by entropy score (make_informationcontent_from_peaks.pl included within the merge_peaks pipeline) and used as inputs to IDR (2.0.2) to determine reproducible peaks.

    IDR v2.0.2 GitHub
    $ Bash example
    # Install merge_peaks pipeline (which includes make_informationcontent_from_peaks.pl and a custom IDR script)
    # git clone https://github.com/yeolab/merge_peaks.git
    # export PATH=$PATH:$(pwd)/merge_peaks/bin
    
    # Example input files (normalized and compressed .bed files)
    # Assuming 'compressed.bed' refers to standard BED files that might have been gzipped and unzipped,
    # or simply a naming convention for processed BED files.
    INPUT_REP1="replicate1.bed"
    INPUT_REP2="replicate2.bed"
    OUTPUT_PREFIX="idr_output"
    
    # Step 1: Rank peaks by entropy score (information content)
    # make_informationcontent_from_peaks.pl is part of the merge_peaks pipeline.
    # It takes a BED file and outputs a new BED file with an added 'signal.value' column
    # representing the information content, which IDR uses for ranking.
    make_informationcontent_from_peaks.pl "${INPUT_REP1}" > "${INPUT_REP1%.bed}.ranked.bed"
    make_informationcontent_from_peaks.pl "${INPUT_REP2}" > "${INPUT_REP2%.bed}.ranked.bed"
    
    # Step 2: Use ranked peaks as inputs to IDR (version 2.0.2 logic, as implemented in merge_peaks)
    # The 'idr' script used here is from the merge_peaks pipeline (merge_peaks/bin/idr).
    # It implements the IDR 2.0.2 logic for determining reproducible peaks.
    idr --input-file-type bed \
        --rank-by signal.value \
        --output-file "${OUTPUT_PREFIX}.idr.bed" \
        "${INPUT_REP1%.bed}.ranked.bed" \
        "${INPUT_REP2%.bed}.ranked.bed"
  19. 19

    Reproducible peaks were filtered for those ≥20 bases in length, and not overlapping with WT negative control samples.

    bedtools (Inferred with models/gemini-2.5-flash) v2.30.0 GitHub
    $ Bash example
    # Install bedtools if not already installed
    # conda install -c bioconda bedtools
    
    # Filter peaks for length >= 20 bases
    awk 'BEGIN{OFS="\t"} {if ($3 - $2 >= 20) print}' reproducible_peaks.bed > reproducible_peaks_length_filtered.bed
    
    # Filter out peaks overlapping with WT negative control samples
    bedtools subtract -a reproducible_peaks_length_filtered.bed -b wt_negative_control.bed > final_filtered_peaks.bed

Tools Used

Raw Source Text
Sequenced reads were reformatted to include randomers in read headers with umi_tools (1.0.0). Args: --random-seed 1 --bc-pattern NNNNNNNNNN
Reads were then trimmed with cutadapt (1.14). Args: --match-read-wildcards -O 1 --times 1 -e 0.1 --quality-cutoff 6 -m 18 -a InvRNA*.fasta (fasta sequences can be found at: https://github.com/YeoLab/eclip/tree/master/example/inputs/)
Reads were then trimmed once more with cutadapt (1.14) to remove double-ligation events. Args: --match-read-wildcards -O 5 --times 1 -e 0.1 --quality-cutoff 6 -m 18  -a InvRNA*.fasta (fasta sequences can be found at: https://github.com/YeoLab/eclip/tree/master/example/inputs/)
Trimmed reads were then mapped with STAR (2.4.0i) against a repeat element database (RepBase 18.05). Args: --runThreadN 16 \  --genomeDir human_repbase \  --readFilesIn path/to/read1 \  --outFileNamePrefix out_prefix \  --outReadsUnmapped Fastx \  --outSAMtype BAM Unsorted \  --outSAMattributes All \  --outSAMunmapped Within \  --outSAMattrRGline ID:foo \  --outFilterType BySJout \  --outFilterMultimapNmax 30 \  --outFilterMultimapScoreRange 1 \  --outFilterScoreMin 10 \  --alignEndsType EndToEnd
Unmapped reads filtered of repeat elements were then mapped with STAR (2.4.0i) against a human genome (hg19/ChlSab2). Args: --runThreadN 16 \  --genomeDir genomedir \  --readFilesIn /path/to/read1 \  --outFileNamePrefix out_prefix \  --outReadsUnmapped Fastx \  --outSAMtype BAM   Unsorted \  --outSAMattributes All \  --outSAMunmapped Within \  --outSAMattrRGline ID:foo \  --outFilterType BySJout \  --outFilterMultimapNmax 1 \  --outFilterMultimapScoreRange 1 \  --outFilterScoreMin 10 \  --alignEndsType EndToEnd
Aligned reads were sorted with samtools (1.6)
Sorted reads were collapsed with umi_tools (1.0.0). Args: --random-seed 1 --method unique
BAM files were used to identify peak clusters with Clipper (1.2.2). Args: --species (hg19/ChlSab2_Sars) --bam path/to/input.bam --timeout 3600000 --maxgenes 1000000 --save-pickle --outfile path/to/output.bam
Peak clusters were normalized using BAM files for IP against BAM files for INPUT with peaksnormalize.pl (overlap_peakfi_with_bam_PE.pl), included in eclip 0.1.5+.
Overlapping normalized peak regions were merged with compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl, included within eclip-0.1.5+
Normalized peak (compressed.bed) files were ranked by entropy score (make_informationcontent_from_peaks.pl included within the merge_peaks pipeline) and used as inputs to IDR (2.0.2) to determine reproducible peaks.
Reproducible peaks were filtered for those ≥20 bases in length, and not overlapping with WT negative control samples.
Genome_build: hg19
Genome_build: ChlSab2
Genome_build: Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome (MN908947.3)
← Back to Analysis