GSE51556 Processing Pipeline

RNA-Seq code_examples 14 steps

Publication

The dsRBP and inactive editor ADR-1 utilizes dsRNA binding to regulate A-to-I RNA editing across the C. elegans transcriptome.

Cell reports (2014) — PMID 24508457

Dataset

GSE51556

The dsRBP and inactive editor, ADR-1, utilizes dsRNA binding to regulate A-to-I RNA editing across the C. elegans transcriptome

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

ssRNAseq: The adr-1(-);adr-2(-) sample was sequenced on one lane of Illumina's HiSeq 2000 yielding 216 million single-end 76nt reads.

STAR (Inferred with models/gemini-2.5-flash) vNot specified GitHub

$ Bash example

# Install STAR (example using conda)
# conda install -c bioconda star

# Define variables
# Assuming the raw sequencing data is available as a gzipped FASTQ file.
READS_FILE="adr1_adr2_ssRNAseq.fastq.gz" # Placeholder for the raw single-end 76nt reads
# Reference genome and annotation for C. elegans (inferred from adr-1/adr-2 genes)
# Latest C. elegans assembly is ce11.
REFERENCE_FASTA="ce11.fa" # Placeholder for C. elegans reference genome FASTA file (e.g., from UCSC or Ensembl)
GTF_FILE="ce11.gtf"       # Placeholder for C. elegans gene annotation GTF file (e.g., from UCSC or Ensembl)
GENOME_DIR="STAR_genome_ce11" # Directory where STAR genome index will be stored
OUTPUT_DIR="star_alignment_output"
NUM_THREADS=8 # Number of threads to use for STAR

# Create output directory if it doesn't exist
mkdir -p "${OUTPUT_DIR}"

# --- Genome Indexing (Run this step once for the reference genome) ---
# This command generates the STAR genome index.
# It requires the reference FASTA file and a GTF annotation file for splice junction information.
# STAR --runMode genomeGenerate \
#      --genomeDir "${GENOME_DIR}" \
#      --genomeFastaFiles "${REFERENCE_FASTA}" \
#      --sjdbGTFfile "${GTF_FILE}" \
#      --runThreadN "${NUM_THREADS}" \
#      --sjdbOverhang 75 # Recommended: read length - 1 (76nt reads -> 75)

# --- Alignment of single-end RNA-seq reads ---
# This command aligns the single-end 76nt reads to the C. elegans genome.
# It outputs a sorted BAM file.
STAR --genomeDir "${GENOME_DIR}" \
     --readFilesIn "${READS_FILE}" \
     --readFilesCommand zcat \
     --outFileNamePrefix "${OUTPUT_DIR}/adr1_adr2_ssRNAseq_" \
     --outSAMtype BAM SortedByCoordinate \
     --outSAMunmapped Within \
     --outFilterMultimapNmax 20 \
     --outFilterMismatchNmax 999 \
     --outFilterMismatchNoverLmax 0.04 \
     --alignIntronMin 20 \
     --alignIntronMax 1000000 \
     --alignMatesGapMax 1000000 \
     --alignSJoverhangMin 8 \
     --alignSJDBoverhangMin 1 \
     --sjdbScore 1 \
     --runThreadN "${NUM_THREADS}" \
     --limitBAMsortRAM 30000000000 # Allocate ~30GB RAM for sorting, adjust based on system resources and dataset size

View on GitHub

Each other sample was sequenced on a lane of Illumina GAII yielding between 37 and 42 million reads of the same type.

FastQC (Inferred with models/gemini-2.5-flash) v0.11.9 GitHub

$ Bash example

# Install FastQC if not already available
# conda install -c bioconda fastqc

# Run FastQC on the generated sequencing reads to assess quality and verify characteristics
# such as read count, read length, and quality metrics, consistent with the described yield.
# 'sample_reads.fastq.gz' is a placeholder for the actual input FASTQ file(s).
fastqc -o fastqc_output_dir sample_reads.fastq.gz

View on GitHub

Illumina Casava1.8.2 software used for basecalling.

Illumina Casava v1.8.2

$ Bash example

# Illumina Casava 1.8.2 is proprietary instrument software used for basecalling
# and demultiplexing directly on Illumina sequencing platforms.
# It does not have a user-executable command-line interface for post-sequencing analysis.
# The basecalling process is performed automatically by the instrument during the run.
# The output of Casava basecalling is BCL (Base Call) files.
# Subsequent steps typically involve converting these BCL files to FASTQ format
# using tools like bcl2fastq, which is a separate software.

# This placeholder command signifies the completion of the basecalling step
# performed by the Illumina Casava 1.8.2 software on the sequencing instrument.
echo "Illumina Casava 1.8.2 basecalling completed on instrument, producing BCL files."

Mapping: Sequenced reads were mapped to the C. elegans reference genome (ce10) using TopHat aligner (version 2.0.6) allowing only uniquely-mapped reads with up to two mismatches each with command line options -Mx 1 and -N 2

TopHat v2.0.6 GitHub

$ Bash example

# Installation (example using conda)
# conda install -c bioconda tophat=2.0.6

# Define variables
READS="reads.fastq.gz" # Placeholder for input sequenced reads (e.g., from a FASTQ file)
REFERENCE_INDEX="/path/to/ce10_tophat_index" # Placeholder for the C. elegans ce10 TopHat/Bowtie2 index
OUTPUT_DIR="tophat_ce10_output"

# Run TopHat alignment
tophat -Mx 1 -N 2 -o "${OUTPUT_DIR}" "${REFERENCE_INDEX}" "${READS}"

View on GitHub

Variant calling: sites with RNA-DNA differences were identified by SAMtools mpileup (version 0.1.18) tallying up to 1000 alignments per site.

samtools v0.1.18 GitHub

$ Bash example

# Install samtools version 0.1.18 (if not already installed)
# conda install -c bioconda samtools=0.1.18

# Define input and output files, and reference genome
INPUT_BAM="input.bam" # Replace with your input alignment file (e.g., sorted.bam)
REFERENCE_FASTA="reference.fasta" # Replace with your reference genome in FASTA format (e.g., hg38.fa)
OUTPUT_PILEUP="output.pileup" # Replace with your desired output pileup file name

# Run samtools mpileup to identify sites with RNA-DNA differences
# -d 1000: Tally up to 1000 alignments per site (max-depth)
# -f: Specify the reference genome file
samtools mpileup -d 1000 -f "${REFERENCE_FASTA}" "${INPUT_BAM}" > "${OUTPUT_PILEUP}"

View on GitHub

Additional command line options used were -D -I and -g.

Generic Bioinformatics Tool (Inferred with models/gemini-2.5-flash) vUnknown (Inferred with models/gemini-2.5-flash)

$ Bash example

# Install generic_bio_tool (example, replace with actual installation if tool is known)
# conda install -c bioconda generic_bio_tool

# Example usage of a generic bioinformatics tool with options -D, -I, and -g.
# The exact meaning of these options depends on the specific tool.
# For this example, we assume -I specifies an input file and -g specifies a genome assembly.
# -D is often used for debug mode or defining parameters.
# Replace 'input.fastq' with your actual input file.
# Replace 'output.txt' with your desired output file.
# Replace 'hg38' with the actual genome assembly if known; otherwise, hg38 is a common placeholder for the latest human assembly.

generic_bio_tool -D -I input.fastq -g hg38 > output.txt

Site filters: Annotated SNPs were obtained from Illumina's iGenomes collection for C. elegans (ce10) and unannotated variants were extracted from the adr-1(-);adr-2(-) RNA-Seq dataset.

RNA-seq v4.2.6.1

$ Bash example

# --- Installation (commented out) ---
# # Install GATK (e.g., via conda)
# # conda create -n gatk_env openjdk=11 gatk=4.2.6.1 -c bioconda -c conda-forge
# # conda activate gatk_env

# # Install STAR (e.g., via conda)
# # conda install -c bioconda star=2.7.10a

# # Install samtools (e.g., via conda) - often a dependency or useful for BAM manipulation
# # conda install -c bioconda samtools=1.16.1

# --- Define Variables ---
# Reference genome (C. elegans ce10 from Illumina iGenomes)
REF_DIR="/path/to/illumina_igenomes/Caenorhabditis_elegans/UCSC/ce10/Sequence/WholeGenomeFasta"
REF_FASTA="${REF_DIR}/genome.fa"
GTF_FILE="/path/to/illumina_igenomes/Caenorhabditis_elegans/UCSC/ce10/Annotation/Genes/genes.gtf" # Or .gff for STAR

# Known SNPs for GATK BaseRecalibrator (derived from Illumina iGenomes annotated SNPs)
# This VCF would contain the "Annotated SNPs" mentioned in the description.
KNOWN_SNPS_VCF="/path/to/illumina_igenomes/Caenorhabditis_elegans/UCSC/ce10/Annotation/Variations/annotated_snps.vcf.gz"

# Input RNA-Seq FASTQ files for adr-1(-);adr-2(-)
# Assuming paired-end reads
READ1="/path/to/adr1_adr2_rnaseq_R1.fastq.gz"
READ2="/path/to/adr1_adr2_rnaseq_R2.fastq.gz"
SAMPLE_NAME="adr1_adr2_rnaseq"
OUTPUT_DIR="./variant_calling_output"

mkdir -p "${OUTPUT_DIR}"

# --- Step 1: STAR Alignment ---
# Build STAR genome index (if not already built)
# This step needs to be run once per reference genome.
# STAR --runMode genomeGenerate \
#      --genomeDir "${OUTPUT_DIR}/star_index" \
#      --genomeFastaFiles "${REF_FASTA}" \
#      --sjdbGTFfile "${GTF_FILE}" \
#      --sjdbOverhang 100 # Adjust based on read length (e.g., ReadLength - 1)

# Align reads
STAR --genomeDir "${OUTPUT_DIR}/star_index" \
     --readFilesIn "${READ1}" "${READ2}" \
     --readFilesCommand zcat \
     --outFileNamePrefix "${OUTPUT_DIR}/${SAMPLE_NAME}_" \
     --outSAMtype BAM SortedByCoordinate \
     --outSAMattributes All \
     --outFilterMismatchNmax 3 \
     --outFilterMultimapNmax 20 \
     --runThreadN 8 # Adjust number of threads as needed

ALIGNED_BAM="${OUTPUT_DIR}/${SAMPLE_NAME}_Aligned.sortedByCoord.out.bam"
PROCESSED_BAM="${OUTPUT_DIR}/${SAMPLE_NAME}_processed.bam"
DEDUP_BAM="${OUTPUT_DIR}/${SAMPLE_NAME}_dedup.bam"
BQSR_TABLE="${OUTPUT_DIR}/${SAMPLE_NAME}_bqsr.table"
RECAL_BAM="${OUTPUT_DIR}/${SAMPLE_NAME}_recal.bam"
RAW_VCF="${OUTPUT_DIR}/${SAMPLE_NAME}_raw_variants.vcf.gz"

# --- Step 2: GATK Pre-processing (AddOrReplaceReadGroups, MarkDuplicates) ---
# Add or Replace Read Groups - essential for GATK tools
gatk AddOrReplaceReadGroups \
     -I "${ALIGNED_BAM}" \
     -O "${PROCESSED_BAM}" \
     -RGID "${SAMPLE_NAME}" \
     -RGLB "lib1" \
     -RGPL "ILLUMINA" \
     -RGPU "unit1" \
     -RGSN "${SAMPLE_NAME}" \
     --CREATE_INDEX true

# Mark Duplicates - removes PCR duplicates
gatk MarkDuplicates \
     -I "${PROCESSED_BAM}" \
     -O "${DEDUP_BAM}" \
     -M "${OUTPUT_DIR}/${SAMPLE_NAME}_marked_duplicates.txt" \
     --CREATE_INDEX true

# --- Step 3: GATK Base Quality Score Recalibration (BQSR) ---
# This step uses known SNPs to improve base quality scores, which is crucial for accurate variant calling.
# It's optional but highly recommended for RNA-seq variant calling to reduce false positives.
gatk BaseRecalibrator \
     -I "${DEDUP_BAM}" \
     -R "${REF_FASTA}" \
     --known-sites "${KNOWN_SNPS_VCF}" \
     -O "${BQSR_TABLE}"

gatk ApplyBQSR \
     -I "${DEDUP_BAM}" \
     -R "${REF_FASTA}" \
     --bqsr-recal-file "${BQSR_TABLE}" \
     -O "${RECAL_BAM}"

# --- Step 4: GATK HaplotypeCaller for Variant Calling ---
# Extract unannotated variants from the recalibrated BAM file.
gatk HaplotypeCaller \
     -R "${REF_FASTA}" \
     -I "${RECAL_BAM}" \
     -O "${RAW_VCF}" \
     -ERC GVCF \
     -ploidy 2 \
     --dont-use-soft-clipped-bases \
     --standard-min-confidence-threshold-for-calling 20 \
     --standard-min-confidence-threshold-for-emitting 20

These genomic variants were filtered from the putative sites in all other strains reducing the number of false-positive predictions.

bcftools (Inferred with models/gemini-2.5-flash) v1.19 GitHub

$ Bash example

# Install bcftools (if not already installed)
# conda install -c bioconda bcftools

# Filter genomic variants: remove variants present in 'all_other_strains.vcf.gz' from 'current_strain.vcf.gz'
# This command identifies variants unique to 'current_strain.vcf.gz' that are not found in 'all_other_strains.vcf.gz',
# thereby reducing false-positive predictions by excluding common variants.
bcftools isec -C current_strain.vcf.gz all_other_strains.vcf.gz -o filtered_variants.vcf

View on GitHub

Read filters: Each read aligned to one the remaining putative sites was filtered out if: a) it was a suspected PCR duplicate, according to SAMtools rmdup (version 0.1.18); b) it had a junction overhang < 10nt according to its SAMtools CIGAR string; c) it had > 1 non-A2G or non-C2T mismatch or any short indel, per its MD tag; d) it had a mismatch less than 25nt away from either end of the read

samtools v0.1.18

$ Bash example

# Install samtools 0.1.18 (if not already installed)
# Note: samtools 0.1.18 is an old version. Modern samtools versions are recommended for better performance and features.
# For example, using conda:
# conda create -n samtools_0_1_18 samtools=0.1.18
# conda activate samtools_0_1_18

# Define input and output file names
INPUT_BAM="aligned_reads.bam" # Placeholder for the input aligned BAM file
DEDUP_BAM="deduplicated_reads.bam" # Output BAM after duplicate removal
# FILTERED_BAM="final_filtered_reads.bam" # Output BAM after all filters (conceptual)

# Ensure input BAM is sorted by coordinate for samtools rmdup
# If your input BAM is not sorted, uncomment and run the following lines:
# samtools sort -o "${INPUT_BAM%.bam}.sorted.bam" "$INPUT_BAM"
# INPUT_BAM="${INPUT_BAM%.bam}.sorted.bam"

# a) Filter out suspected PCR duplicates using samtools rmdup
# This command removes reads that are considered PCR duplicates.
samtools rmdup "$INPUT_BAM" "$DEDUP_BAM"

# The remaining filtering criteria (b, c, d) involve complex parsing of CIGAR and MD tags.
# samtools version 0.1.18 does not provide direct command-line options to implement these specific filters.
# These steps would typically be performed by custom scripts (e.g., in Python or Perl)
# that iterate through the reads in the BAM file, parse the CIGAR and MD tags for each read,
# and apply the specified logical conditions to filter them.

# b) Filter reads with junction overhang < 10nt according to its SAMtools CIGAR string
# This requires parsing the CIGAR string to identify splice junctions ('N' operations)
# and checking the length of the matched segments ('M' operations) immediately flanking them.

# c) Filter reads with > 1 non-A2G or non-C2T mismatch or any short indel, per its MD tag
# This requires parsing the MD tag to count specific types of mismatches (excluding A2G/C2T)
# and parsing the CIGAR string for any insertion ('I') or deletion ('D') operations.

# d) Filter reads with a mismatch less than 25nt away from either end of the read
# This requires parsing the MD tag to identify mismatch positions and comparing them
# to the read's start and end coordinates.

# For a complete pipeline, these complex filters would follow the rmdup step,
# likely using a custom script that reads from "$DEDUP_BAM" and writes to a new filtered BAM file.

Identify sites: Putative RNA editing sites were identified from A2G variants on the sense strand and T2C variants on the antisense strand that were covered by more than 5 reads which passed the filters in step 5, including the stringent 25nt threshold for filter 5d).

REDItools (Inferred with models/gemini-2.5-flash) v2.0

$ Bash example

# Define variables
REFERENCE_GENOME="path/to/human_genome.fa" # Placeholder for a human reference genome, e.g., hg38
INPUT_BAM="path/to/filtered_rna_seq_reads.bam" # This BAM is assumed to have passed step 5 filters, including the stringent 25nt threshold for filter 5d)
OUTPUT_REDI="rna_editing_sites.txt"

MIN_COVERAGE=6 # "covered by more than 5 reads" means >= 6 reads

# REDItools installation (example)
# conda install -c bioconda reditools

# Execute REDItoolsDnaRna.py to identify A-to-I RNA editing sites
# REDItoolsDnaRna.py identifies A-to-I (A>G on sense, T>C on antisense) by default.
# -s 1: strand specific (important for sense/antisense distinction)
# -c: minimum coverage for a site to be considered
# -q: minimum base quality (default 25, can be adjusted if needed, but input reads are filtered)
# -m: minimum mapping quality (default 20, can be adjusted if needed, but input reads are filtered)
# -f: reference genome FASTA file
# -i: input RNA-seq BAM file (assumed to be filtered as per description)
# -o: output file for identified RNA editing sites

REDItoolsDnaRna.py \
    -s 1 \
    -c "${MIN_COVERAGE}" \
    -q 25 \
    -m 20 \
    -f "${REFERENCE_GENOME}" \
    -i "${INPUT_BAM}" \
    -o "${OUTPUT_REDI}"

Quantify sites: The extent of editing at each site and our confidence in that prediction were quantified by a novel extension of the classical Bayesian model used for genomic variants, which is described in more detail in the next section.

Custom Bayesian RNA Editing Quantifier (Inferred with models/gemini-2.5-flash) v1.0.0

$ Bash example

# This is an inferred custom script based on the description of a "novel extension of the classical Bayesian model"
# for quantifying RNA editing sites. The actual tool name and parameters would depend on the specific implementation
# described in the "next section" of the original source.

# Example installation (hypothetical, assuming a Python-based tool)
# git clone https://github.com/your_lab/rna_editing_quantifier.git # Placeholder URL
# cd rna_editing_quantifier
# pip install . # Or python setup.py install

# Define input files (placeholders)
ALIGNED_READS="sample.bam" # Path to the aligned RNA-seq reads (BAM/CRAM)
REFERENCE_GENOME="hg38.fa" # Path to the reference genome FASTA file (e.g., human hg38)
OUTPUT_DIR="rna_editing_results"
mkdir -p "${OUTPUT_DIR}"

# Execute the inferred RNA editing quantification tool
# This command is hypothetical and based on common practices for such tools.
# It would typically take aligned reads and a reference genome, and output a table
# of editing sites with quantification and confidence scores.
custom_bayesian_rna_editing_quantifier \
    --input_bam "${ALIGNED_READS}" \
    --reference_fasta "${REFERENCE_GENOME}" \
    --output_file "${OUTPUT_DIR}/editing_sites.tsv" \
    --min_coverage 10 \
    --min_base_quality 20 \
    --threads 8 \
    --report_confidence

To increase the accuracy and confidence of our predictions, we used additional reads from the relaxed version of filter 5d) that overlap the sites identified in step 6.

bedtools (Inferred with models/gemini-2.5-flash) v2.30.0 GitHub

$ Bash example

# Install bedtools (if not already installed)
# conda install -c bioconda bedtools

# This step uses additional reads (from a relaxed filter) to re-quantify signal
# at previously identified sites (e.g., peaks or binding sites from step 6).
# This re-quantification helps increase the accuracy and confidence of predictions.

# Input:
#   relaxed_filtered_reads.bam: BAM file containing reads from the relaxed version of filter 5d).
#   step6_identified_sites.bed: BED file containing the genomic sites identified in step 6.
# Output:
#   site_read_counts.bed: A BED file with an additional column indicating the number of
#                         relaxed reads overlapping each site.

# Example command using bedtools coverage to count overlapping reads:
bedtools coverage -a step6_identified_sites.bed -b relaxed_filtered_reads.bam -counts > site_read_counts.bed

View on GitHub

Moreover, we dropped sites that exhibited editing in 100% of the reads (suggesting a genomic variant not filtered out by step 4) and those with very low editing (less than 10%), which would have been hard to distinguish from sequencing errors.

awk (Inferred with models/gemini-2.5-flash) vN/A GitHub

$ Bash example

# This command filters a tab-separated file of RNA editing sites.
# It assumes the editing percentage (a numerical value between 0 and 100) is located in the 5th column.
# Sites exhibiting 100% editing are excluded, as they are likely genomic variants.
# Sites with less than 10% editing are also excluded, as they are difficult to distinguish from sequencing errors.
# The 'NR==1' condition ensures that the header row of the input file is preserved in the output.
awk -F'\t' 'NR==1 || ($5 >= 10 && $5 < 100)' input_editing_sites.tsv > filtered_editing_sites.tsv

View on GitHub

The predicted RNA editing sites from each strain were characterized according to their position in annotated genic regions (introns, exons, 3'/5' UTRs, etc.) and according to their overlap with other strains.

bedtools (Inferred with models/gemini-2.5-flash) v2.30.0 GitHub

$ Bash example

# Install bedtools (if not already installed)
# conda install -c bioconda bedtools

# Define input files (placeholders)
STRAIN1_SITES="strain1_predicted_editing_sites.bed"
STRAIN2_SITES="strain2_predicted_editing_sites.bed"
STRAIN3_SITES="strain3_predicted_editing_sites.bed" # Example for multi-strain overlap

# Reference genome annotation (GTF/GFF) from which genic regions would be derived.
# Example: Homo_sapiens.GRCh38.109.gtf
# Source: http://ftp.ensembl.org/pub/release-109/gtf/homo_sapiens/Homo_sapiens.GRCh38.109.gtf.gz

# Placeholder for a pre-processed BED file containing genic regions (e.g., exons, introns, UTRs).
# This file would typically be generated from a GTF/GFF file using tools like gtf2bed or custom scripts.
# It should have at least 4 columns: chrom, start, end, feature_type (e.g., exon, intron, 3UTR, 5UTR).
GENIC_REGIONS_BED="all_genic_regions.bed"

# --- Step 1: Characterize according to their position in annotated genic regions ---
# Intersect predicted RNA editing sites from each strain with annotated genic regions.
# The -wao (write original and overlap) option adds columns for overlapping features.
# The -s (strand-specific) option is used if sites and features have strand information.
# Post-processing (e.g., using awk or a Python script) would then categorize each site
# based on the overlapping features (e.g., "exonic", "intronic", "intergenic").
bedtools intersect -a "${STRAIN1_SITES}" -b "${GENIC_REGIONS_BED}" -wao -s > "${STRAIN1_SITES%.bed}_genic_annotated.bed"
bedtools intersect -a "${STRAIN2_SITES}" -b "${GENIC_REGIONS_BED}" -wao -s > "${STRAIN2_SITES%.bed}_genic_annotated.bed"
# Repeat for other strains as needed.

# --- Step 2: Characterize according to their overlap with other strains ---
# Use bedtools multiinter to identify regions that overlap across multiple strains.
# The output indicates which input files (strains) each interval overlaps.
# The 4th column of the output indicates the number of overlaps.
# The last column lists the indices of the input files that overlap.
bedtools multiinter -i "${STRAIN1_SITES}" "${STRAIN2_SITES}" "${STRAIN3_SITES}" > multi_strain_editing_site_overlap.bed

# Further analysis would combine the genic annotation and multi-strain overlap information.

View on GitHub

Tools Used

TopHat RNA-seq

Raw Source Text

ssRNAseq: The adr-1(-);adr-2(-) sample was sequenced on one lane of Illumina's HiSeq 2000 yielding 216 million single-end 76nt reads. Each other sample was sequenced on a lane of Illumina GAII yielding between 37 and 42 million reads of the same type. Illumina Casava1.8.2 software used for basecalling.
Mapping: Sequenced reads were mapped to the C. elegans reference genome (ce10) using TopHat aligner (version 2.0.6) allowing only uniquely-mapped reads with up to two mismatches each with command line options -Mx 1 and -N 2
Variant calling: sites with RNA-DNA differences were identified by SAMtools mpileup (version 0.1.18) tallying up to 1000 alignments per site. Additional command line options used were -D -I and -g.
Site filters: Annotated SNPs were obtained from Illumina's iGenomes collection for C. elegans (ce10) and unannotated variants were extracted from the adr-1(-);adr-2(-) RNA-Seq dataset. These genomic variants were filtered from the putative sites in all other strains reducing the number of false-positive predictions.
Read filters: Each read aligned to one the remaining putative sites was filtered out if: a) it was a suspected PCR duplicate, according to SAMtools rmdup (version 0.1.18); b) it had a junction overhang < 10nt according to its SAMtools CIGAR string; c) it had > 1 non-A2G or non-C2T mismatch or any short indel, per its MD tag; d) it had a mismatch less than 25nt away from either end of the read
Identify sites: Putative RNA editing sites were identified from A2G variants on the sense strand and T2C variants on the antisense strand that were covered by more than 5 reads which passed the filters in step 5, including the stringent 25nt threshold for filter 5d).
Quantify sites: The extent of editing at each site and our confidence in that prediction were quantified by a novel extension of the classical Bayesian model used for genomic variants, which is described in more detail in the next section.
To increase the accuracy and confidence of our predictions, we used additional reads from the relaxed version of filter 5d) that overlap the sites identified in step 6. Moreover, we dropped sites that exhibited editing in 100% of the reads (suggesting a genomic variant not filtered out by step 4) and those with very low editing (less than 10%), which would have been hard to distinguish from sequencing errors.
The predicted RNA editing sites from each strain were characterized according to their position in annotated genic regions (introns, exons, 3'/5' UTRs, etc.) and according to their overlap with other strains.
Supplementary_files_format_and_content: tab-delimited text file containing #READS and %EDITING values for each editing site
Genome_build: ce10

← Back to Analysis