GSE90963 Processing Pipeline

RNA-Seq code_examples 35 steps

Publication

Pseudouridine synthases modify human pre-mRNA co-transcriptionally and affect pre-mRNA processing.

Molecular cell (2022) — PMID 35051350

Dataset

Transcriptome-wide Profiling of Multiple RNA Modifications Simultaneously at Single-base Resolution

Warning: Pipeline descriptions and code snippets may be inferred or AI-generated. Use them only as a starting point to guide analysis, and validate before use.

Processing Steps

Generate Jupyter Notebook

Reference genome: Novoindex (2.8) (http://www.novocraft.com) was used to create a standard and bisulfite reference index on a combination of hg19 chromosome, scaffold and splice junction sequences.

Novoindex v2.8 GitHub

$ Bash example

# Install Novoalign (which includes Novoindex)
# conda install -c bioconda novocraft-novoalign

# Placeholder for the combined hg19 reference sequences.
# This file would contain hg19 chromosomes, scaffolds, and splice junction sequences.
# Example: cat hg19.fa hg19_scaffolds.fa hg19_splice_junctions.fa > hg19_combined.fa
REFERENCE_FASTA="hg19_combined.fa"

# Create a standard reference index
novoindex hg19_standard.nix "${REFERENCE_FASTA}"

# Create a bisulfite reference index
novoindex -b hg19_bisulfite.nix "${REFERENCE_FASTA}"

View on GitHub

Splice junction sequences were generated with USeq (v8.8.8) 1 MakeTranscriptome using Ensembl transcript annotations (build 74).

Ensembl v8.8.8

$ Bash example

# Download Ensembl GRCh37 build 74 genome and GTF
# For human (Homo sapiens)
# wget ftp://ftp.ensembl.org/pub/release-74/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.74.dna.primary_assembly.fa.gz
# wget ftp://ftp.ensembl.org/pub/release-74/gtf/homo_sapiens/Homo_sapiens.GRCh37.74.gtf.gz
# gunzip Homo_sapiens.GRCh37.74.dna.primary_assembly.fa.gz
# gunzip Homo_sapiens.GRCh37.74.gtf.gz

# Install USeq (assuming Java is installed)
# wget https://github.com/davidnix/USeq/releases/download/v8.8.8/USeq_8.8.8.zip
# unzip USeq_8.8.8.zip
# mv USeq_8.8.8/USeq.jar . # Or place it in your PATH

# Define variables
GENOME_FASTA="Homo_sapiens.GRCh37.74.dna.primary_assembly.fa"
ANNOTATION_GTF="Homo_sapiens.GRCh37.74.gtf"
OUTPUT_PREFIX="ensembl_74_transcriptome" # Output prefix for transcriptome FASTA and potentially other files
USEQ_JAR="USeq.jar" # Adjust path if necessary

# Generate transcriptome and associated junction information using USeq MakeTranscriptome
# The description states "Splice junction sequences were generated".
# MakeTranscriptome primarily generates a transcriptome FASTA, which implicitly defines junctions.
# It can also output a BED file of junctions. The output files will be prefixed with ${OUTPUT_PREFIX}.
java -Xmx4G -jar "${USEQ_JAR}" MakeTranscriptome \
    -g "${GENOME_FASTA}" \
    -a "${ANNOTATION_GTF}" \
    -o "${OUTPUT_PREFIX}"

Alignment parameters: BS and NBS samples were aligned to the transcriptome indexes described above using Novoalign (v2.08.01) (http://www.novocraft.com) allowing up to 50 alignments and trimming adaptor sequences (TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC GATCGTCGGACTGTAGAACTCTGAAC).

Novoalign v2.08.01 GitHub

$ Bash example

# Install Novoalign (Note: Novoalign is commercial software, installation typically involves downloading and licensing from Novocraft)
# For example, if you have the executable in your PATH:

# Define variables for input, output, and reference
READ1="sample_R1.fastq.gz"
READ2="sample_R2.fastq.gz"
TRANSCRIPTOME_INDEX="transcriptome_index.nix" # Placeholder for the transcriptome index file
OUTPUT_SAM="sample_aligned.sam"

ADAPTER_FWD="TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC"
ADAPTER_REV="GATCGTCGGACTGTAGAACTCTGAAC"

# Align reads using Novoalign
novoalign -d "${TRANSCRIPTOME_INDEX}" \
          -f "${READ1}" "${READ2}" \
          -a "${ADAPTER_FWD}" "${ADAPTER_REV}" \
          -r All 50 \
          -o SAM > "${OUTPUT_SAM}"

# Convert SAM to BAM, sort, and index (common post-alignment steps)
# samtools view -bS "${OUTPUT_SAM}" | samtools sort -o "${OUTPUT_SAM%.sam}.bam" -
# samtools index "${OUTPUT_SAM%.sam}.bam"

View on GitHub

Alignment processing : The alignment files were processed with USeq SamTranscriptomeParser, which reports the best alignment location for each read in bam format and converts the coordinates of splice junction alignments to genomic space.

USeq SamTranscriptomeParser v8.0.0

$ Bash example

# Install Java Development Kit (if not already installed)
# sudo apt-get update
# sudo apt-get install default-jdk

# Download USeq JAR (replace with actual download link or build instructions if needed)
# The USeq project is typically built from source or a pre-compiled JAR is used.
# As of version 8.0.0, you might need to build it using Maven:
# git clone https://github.com/davebx/useq.git
# cd useq
# mvn clean install # Requires Maven to be installed
# The USeq.jar file will then be located in the 'target/' directory.

# Define input, output, and reference files
INPUT_BAM="input_transcriptome_aligned.bam" # Placeholder for your transcriptome-aligned BAM file
OUTPUT_BAM="output_genomic_aligned.bam"   # Placeholder for the output BAM file with genomic coordinates
GENOME_FASTA="hg38.fasta"                 # Placeholder for the reference genome FASTA file (e.g., hg38.fasta)
GENE_ANNOTATION_GTF="gencode.v44.annotation.gtf" # Placeholder for the gene annotation GTF file (e.g., Gencode v44 for hg38)
USEQ_JAR="USeq.jar"                       # Path to the USeq JAR file (e.g., target/USeq.jar if built from source)

# Execute USeq SamTranscriptomeParser
# -Xmx4G allocates 4GB of memory to the Java Virtual Machine, adjust as needed.
java -Xmx4G -jar "${USEQ_JAR}" SamTranscriptomeParser \
    -bamFile "${INPUT_BAM}" \
    -genomeFasta "${GENOME_FASTA}" \
    -geneModelFile "${GENE_ANNOTATION_GTF}" \
    -saveFile "${OUTPUT_BAM}"

One alignment location is selected at random if a read aligned to multiple locations in the genome with equal confidence.

STAR (Inferred with models/gemini-2.5-flash) v2.7.10a GitHub

$ Bash example

# Install STAR (if not already installed)
# conda install -c bioconda star

# Define variables (replace with actual paths and names)
GENOME_DIR="/path/to/your/genome_index/hg38"
READ1_FASTQ="read1.fastq.gz"
READ2_FASTQ="read2.fastq.gz" # If paired-end, otherwise remove
OUTPUT_PREFIX="aligned_reads"
NUM_THREADS=8

# Run STAR alignment with random selection for multi-mappers with equal confidence
STAR \
  --genomeDir ${GENOME_DIR} \
  --readFilesIn ${READ1_FASTQ} ${READ2_FASTQ} \
  --runThreadN ${NUM_THREADS} \
  --outFileNamePrefix ${OUTPUT_PREFIX}. \
  --outSAMtype BAM SortedByCoordinate \
  --outMultimapperOrder Random \
  --outFilterMultimapNmax 100 \
  --outSAMmultNmax 1 \
  --quantMode GeneCounts # Optional: Add gene quantification if needed

View on GitHub

Alignments were reordered using Picard (v1.87) (http://picard.sourceforge.net) ReorderBam and all alignments were set as primary using the set_sam_primary.py (https://github.com/HuntsmanCancerInstitute/RBSSeqTools).

Picard v1.87 GitHub

$ Bash example

# Install Picard (example using conda)
# conda install -c bioconda picard

# Download set_sam_primary.py (example)
# git clone https://github.com/HuntsmanCancerInstitute/RBSSeqTools.git
# SCRIPT_DIR="RBSSeqTools" # Assuming the clone creates this directory

# Define input and output files
INPUT_BAM="input.bam"
REORDERED_BAM="reordered.bam"
OUTPUT_BAM="output.bam" # Final output after setting primary
REFERENCE_DICT="reference.fasta.dict" # Placeholder for reference dictionary (e.g., from GATK resource bundle)

# Reorder alignments using Picard ReorderSam (v1.87)
# Note: Picard uses ReorderSam for both SAM and BAM files.
java -jar picard.jar ReorderSam \
    I="${INPUT_BAM}" \
    O="${REORDERED_BAM}" \
    REFERENCE="${REFERENCE_DICT}"

# Set all alignments as primary using set_sam_primary.py
# Assuming the script is located at ${SCRIPT_DIR}/set_sam_primary.py or in PATH
python ${SCRIPT_DIR}/set_sam_primary.py \
    -i "${REORDERED_BAM}" \
    -o "${OUTPUT_BAM}"

View on GitHub

The alignments were then processed with GATK (v3.4-0) 2 SplitNCigarReads, which splits spliced alignments into multiple alignment segments without N cigar elements.

GATK v3.4

$ Bash example

# GATK 3.4 is an older version. It's typically run as a Java JAR file.
# Download GATK 3.4 from the Broad Institute archives or use a specific conda package if available.
# Example manual download and setup:
# wget https://software.broadinstitute.org/gatk/download/auth?package=GATK-3.4.tar.bz2 -O GATK-3.4.tar.bz2
# tar -xjvf GATK-3.4.tar.bz2
# export GATK_JAR="/path/to/GATK-3.4/GenomeAnalysisTK.jar"

# Define input/output files and reference genome
INPUT_BAM="input_alignments.bam" # Replace with your input BAM file
OUTPUT_BAM="output_split_alignments.bam" # Replace with your desired output BAM file
REFERENCE_FASTA="human_g1k_v37.fasta" # Placeholder: Replace with your reference genome FASTA file (e.g., hg19, GRCh37)

# Run GATK SplitNCigarReads
# This command splits spliced alignments (containing 'N' cigar elements) into multiple segments.
java -jar "${GATK_JAR}" \
    -T SplitNCigarReads \
    -R "${REFERENCE_FASTA}" \
    -I "${INPUT_BAM}" \
    -o "${OUTPUT_BAM}" \
    -rf ReassignOneMappingQuality \
    -RMQF 255 \
    -U AllowOriginalQualities

Reads containing deletions four base pairs or longer were removed using remove_deletions.py (https://github.com/HuntsmanCancerInstitute/RBSSeqTools).

remove_deletions.py vNot specified GitHub

$ Bash example

# Installation: Clone the repository and ensure Python environment is set up
# git clone https://github.com/HuntsmanCancerInstitute/RBSSeqTools.git
# cd RBSSeqTools
# Assuming remove_deletions.py is in the current directory or accessible via PATH
# Example usage with placeholder input/output files
python remove_deletions.py -d 4 -o filtered_reads.bam input_reads.bam

View on GitHub

m5C analysis pipeline: BS and NBS sample reads were aligned and processed as outlined above.

Bismark (Inferred with models/gemini-2.5-flash) v0.24.2 GitHub

$ Bash example

# Install Bismark and its dependencies (Bowtie2/HISAT2, Samtools, etc.)
# conda install -c bioconda bismark
# conda install -c bioconda bowtie2
# conda install -c bioconda samtools

# Define reference genome and sample names
# Using hg38 as a placeholder for the latest human assembly
REF_GENOME_DIR="/path/to/your/hg38_bisulfite_genome"
UNCONVERTED_REF_GENOME="/path/to/your/hg38_unconverted_genome/hg38.fa"
BS_SAMPLE_R1="bs_sample_R1.fastq.gz"
BS_SAMPLE_R2="bs_sample_R2.fastq.gz" # If paired-end
NBS_SAMPLE_R1="nbs_sample_R1.fastq.gz"
NBS_SAMPLE_R2="nbs_sample_R2.fastq.gz" # If paired-end

# --- m5C Analysis for BS Samples (Core Pipeline) ---

# 1. Prepare bisulfite reference genome (if not already done)
# This step only needs to be run once per reference genome for Bismark.
# bismark_genome_preparation --bowtie2 ${REF_GENOME_DIR}

# 2. Align BS sample reads to the bisulfite-converted genome
# Using paired-end alignment with 4 threads and outputting to a specified directory.
bismark --bowtie2 --temp_dir . -p 4 --output_dir ./bismark_output_bs ${REF_GENOME_DIR} -1 ${BS_SAMPLE_R1} -2 ${BS_SAMPLE_R2}

# 3. Extract methylation information from aligned BS reads
# Generates methylation calls, bedGraph, and cytosine report.
bismark_methylation_extractor --paired-end --no_overlap --report --bedGraph --cytosine_report --output ${REF_GENOME_DIR} ./bismark_output_bs/${BS_SAMPLE_R1%.fastq.gz}_bismark_bt2_pe.bam

# --- Processing for NBS Samples (for comparison within the m5C pipeline) ---
# NBS samples are typically used as controls to identify non-methylation related C>T conversions or for general genomic context.
# They are usually aligned with a standard aligner to the unconverted genome.

# Example: Align NBS sample reads to the standard (unconverted) reference genome using Bowtie2
# bowtie2 -x ${UNCONVERTED_REF_GENOME} -1 ${NBS_SAMPLE_R1} -2 ${NBS_SAMPLE_R2} | samtools view -bS - > ./nbs_output/${NBS_SAMPLE_R1%.fastq.gz}.bam
# samtools sort ./nbs_output/${NBS_SAMPLE_R1%.fastq.gz}.bam -o ./nbs_output/${NBS_SAMPLE_R1%.fastq.gz}.sorted.bam
# samtools index ./nbs_output/${NBS_SAMPLE_R1%.fastq.gz}.sorted.bam

# Further processing for NBS (e.g., variant calling, general read coverage) would be integrated with BS data for differential methylation analysis or other comparative studies.

View on GitHub

Alignments were further processed using CreateMethTable (https://github.com/HuntsmanCancerInstitute/RBSSeqTools), which iterates over the alignment file and reports forward alignment base counts at all reference ÃCÃ positions and in reverse alignment base counts at all reference ÃGÃ positions.

CreateMethTable vNot specified (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install RBSSeqTools (assuming Python environment)
# git clone https://github.com/HuntsmanCancerInstitute/RBSSeqTools.git
# cd RBSSeqTools
# python setup.py install # Or ensure CreateMethTable.py is executable and in PATH

# Define input and output files
INPUT_ALIGNMENT="input.bam" # Placeholder for the alignment file (e.g., sorted BAM)
REFERENCE_GENOME="hg38.fa" # Placeholder for the reference genome FASTA file (e.g., GRCh38)
OUTPUT_METH_TABLE="output.meth.table" # Placeholder for the output methylation table

# Execute CreateMethTable
# The script iterates over the alignment file and reports base counts at C/G positions.
CreateMethTable.py -i "${INPUT_ALIGNMENT}" -r "${REFERENCE_GENOME}" -o "${OUTPUT_METH_TABLE}"

View on GitHub

A Fisher's exact test p-value was calculated for each position by comparing the observed count of unconverted and converted bases to the expected count, which was calculated assuming an error rate of 0.001.

Custom Python Script using SciPy (Inferred with models/gemini-2.5-flash) vSciPy GitHub

$ Bash example

# Install necessary libraries
# conda install -c anaconda scipy pandas
# or
# pip install scipy pandas

# Create a Python script (e.g., calculate_fisher_pvalue.py)
cat << 'EOF' > calculate_fisher_pvalue.py
import pandas as pd
from scipy.stats import fisher_exact
import sys

def calculate_fisher_pvalue(input_file, output_file, error_rate=0.001, alternative='greater'):
    """
    Calculates Fisher's exact test p-value for each position.

    Compares observed unconverted/converted counts to expected counts
    derived from a background error rate.
    """
    try:
        df = pd.read_csv(input_file, sep='\t')
    except FileNotFoundError:
        print(f"Error: Input file '{input_file}' not found.", file=sys.stderr)
        sys.exit(1)

    if 'unconverted_count' not in df.columns or 'converted_count' not in df.columns:
        print("Error: Input file must contain 'unconverted_count' and 'converted_count' columns.", file=sys.stderr)
        sys.exit(1)

    results = []
    # Use a large, fixed total for the hypothetical background sample
    # to represent the population error rate. This ensures the 'expected' row
    # in the Fisher's test table accurately reflects the error_rate without
    # being tied to the specific total observed counts at each position.
    # A very large number makes the expected proportions stable.
    hypothetical_total_for_background = 1_000_000_000 # 1 billion
    expected_unconverted_bg = round(hypothetical_total_for_background * (1 - error_rate))
    expected_converted_bg = round(hypothetical_total_for_background * error_rate)

    for index, row in df.iterrows():
        observed_unconverted = row['unconverted_count']
        observed_converted = row['converted_count']

        # Create the 2x2 contingency table
        # Row 1: Observed counts at the current position
        # Row 2: Hypothetical background counts reflecting the error rate
        table = [
            [observed_unconverted, observed_converted],
            [expected_unconverted_bg, expected_converted_bg]
        ]

        # Perform Fisher's exact test
        # 'greater' alternative tests if observed conversion is significantly higher than background
        odds_ratio, p_value = fisher_exact(table, alternative=alternative)
        results.append(p_value)

    df['fisher_p_value'] = results
    df.to_csv(output_file, sep='\t', index=False)
    print(f"Fisher's exact test p-values calculated and saved to '{output_file}'.")

if __name__ == "__main__":
    if len(sys.argv) < 3:
        print("Usage: python calculate_fisher_pvalue.py <input_tsv_file> <output_tsv_file> [error_rate] [alternative]", file=sys.stderr)
        sys.exit(1)

    input_file = sys.argv[1]
    output_file = sys.argv[2]
    error_rate = float(sys.argv[3]) if len(sys.argv) > 3 else 0.001
    alternative = sys.argv[4] if len(sys.argv) > 4 else 'greater'

    calculate_fisher_pvalue(input_file, output_file, error_rate, alternative)
EOF

# Example usage:
# Assume 'base_conversion_counts.tsv' is an input file with 'unconverted_count' and 'converted_count' columns.
# Example input file content:
# position\tunconverted_count\tconverted_count
# 1\t1000\t1
# 2\t500\t5
# 3\t2000\t0

python calculate_fisher_pvalue.py base_conversion_counts.tsv base_conversion_pvalues.tsv 0.001 'greater'

View on GitHub

The table was filtered to create a final list of candidates by selecting only those sites which had a read depth Â³10 in both BS and NBS datasets, a non-conversion rate of Â³20% in the BS dataset and a Fisher exact p-value Â²0.05.

awk (Inferred with models/gemini-2.5-flash) vN/A

$ Bash example

# Assuming input file is 'candidates.tsv' and output file is 'filtered_candidates.tsv'
# Assuming the input table is tab-separated and contains a header row.
# Column assignments (these are placeholders and should be adjusted based on the actual table structure):
# $4: read_depth_BS
# $5: read_depth_NBS
# $6: non_conversion_rate_BS (as a decimal, e.g., 0.20 for 20%)
# $7: fisher_p_value

awk 'BEGIN {OFS="\t"} NR==1 {print; next} $4 >= 10 && $5 >= 10 && $6 >= 0.20 && $7 <= 0.05' candidates.tsv > filtered_candidates.tsv

These sites were further screened manually for evidence of mis-mapping due to loss of complexity, or nonconversion arising from nondenatured secondary structure.

Manual Review vN/A

$ Bash example

# This step describes a manual screening process.
# No specific software tool or command is applicable for this manual review step.
# The screening involves checking for mis-mapping due to loss of complexity or nonconversion from nondenatured secondary structure.

m1A analysis pipeline: BS and NBS sample reads were aligned and processed as listed above.

Bismark (Inferred with models/gemini-2.5-flash) v0.24.0

$ Bash example

# Install Bismark and STAR (if not already installed)
# conda install -c bioconda bismark=0.24.0
# conda install -c bioconda star=2.7.10a

# Define reference genome and input files
GENOME_DIR="./genome_hg38"
BS_READS_R1="bs_sample_R1.fastq.gz"
BS_READS_R2="bs_sample_R2.fastq.gz"
NBS_READS_R1="nbs_sample_R1.fastq.gz"
NBS_READS_R2="nbs_sample_R2.fastq.gz"

# Placeholder for reference genome FASTA (e.g., human hg38) and GTF (e.g., Gencode)
# Download hg38.fa from UCSC or Ensembl
# mkdir -p ${GENOME_DIR}
# wget -O ${GENOME_DIR}/hg38.fa.gz "http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz"
# gunzip ${GENOME_DIR}/hg38.fa.gz
# wget -O ${GENOME_DIR}/gencode.v44.annotation.gtf.gz "https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/gencode.v44.annotation.gtf.gz"
# gunzip ${GENOME_DIR}/gencode.v44.annotation.gtf.gz

# 1. Prepare Bismark genome index (for BS reads)
# This step needs to be run only once per reference genome. Uses Bowtie2 as aligner.
# bismark_genome_preparation --path_to_aligner bowtie2 ${GENOME_DIR}

# 2. Prepare STAR genome index (for NBS reads)
# This step needs to be run only once per reference genome.
# mkdir -p ${GENOME_DIR}/STAR_index
# STAR --runThreadN 8 --runMode genomeGenerate --genomeDir ${GENOME_DIR}/STAR_index --genomeFastaFiles ${GENOME_DIR}/hg38.fa --sjdbGTFfile ${GENOME_DIR}/gencode.v44.annotation.gtf --sjdbOverhang 100

# 3. Align BS sample reads with Bismark
# Using --pbat for PBAT libraries, common in RNA BS-seq. Adjust if library type is different.
# --score_min L,0,-0.6 is a common setting for RNA BS-seq to allow for more mismatches due to conversions.
# Output BAM files will be in the specified output_dir.
bismark --genome ${GENOME_DIR} \
        -1 ${BS_READS_R1} \
        -2 ${BS_READS_R2} \
        --output_dir ./bismark_bs_output \
        --temp_dir ./bismark_tmp \
        --bowtie2 \
        --pbat \
        --score_min L,0,-0.6 \
        --multicore 4

# 4. Align NBS sample reads with STAR (assuming NBS is non-bisulfite RNA-seq control)
# Adjust --runThreadN based on available cores.
STAR --genomeDir ${GENOME_DIR}/STAR_index \
     --readFilesIn ${NBS_READS_R1} ${NBS_READS_R2} \
     --readFilesCommand zcat \
     --outFileNamePrefix ./star_nbs_output/nbs_sample_ \
     --outSAMtype BAM SortedByCoordinate \
     --outSAMattributes All \
     --outFilterMultimapNmax 20 \
     --outFilterMismatchNmax 999 \
     --outFilterMismatchNoverLmax 0.1 \
     --alignIntronMin 20 \
     --alignIntronMax 1000000 \
     --alignMatesGapMax 1000000 \
     --runThreadN 8

# 5. Extract methylation information from BS aligned reads (post-processing for BS)
# This step generates coverage files and methylation reports.
bismark_methylation_extractor \
    --output ./bismark_bs_output \
    --multicore 4 \
    --no_overlap \
    --report \
    --comprehensive \
    --bedGraph \
    ./bismark_bs_output/${BS_READS_R1%.fastq.gz}_bismark_bt2_pe.bam

# Note: Further processing for m1A analysis would involve comparing BS and NBS data,
# potentially using custom scripts or specialized tools to identify m1A sites based on conversion rates.

CreateMethTable was used to report forward alignment base counts at all reference ÃAÃ positions with at least 100x coverage and in reverse alignment base counts at all reference ÃTÃ positions with at least 100x coverage.

CreateMethTable

$ Bash example

# Assuming CreateMethTable is a custom script or executable.
# Placeholder for input BAM file and reference genome.
# Replace with actual file paths.
INPUT_BAM="aligned_reads.bam"
REFERENCE_FASTA="hg38.fasta" # Using hg38 as a common latest assembly placeholder
OUTPUT_TABLE="methylation_report.tsv"

# Execute CreateMethTable with inferred parameters.
# Parameter names are speculative as it's likely a custom script.
CreateMethTable \
  --input "$INPUT_BAM" \
  --reference "$REFERENCE_FASTA" \
  --output "$OUTPUT_TABLE" \
  --min-coverage 100 \
  --report-forward-A \
  --report-reverse-T

A one-tailed Fisher's exact test, corrected for multiple comparisons using the Benjamini-Hochberg (BH) procedure, was used to test if the mismatch rate in the NBS sample was greater than the BS sample.

R (Inferred with models/gemini-2.5-flash) v>= 4.0.0

$ Bash example

# Install R if not already installed (example for Ubuntu)
# sudo apt-get update
# sudo apt-get install r-base

# Create a dummy R script to perform the analysis
cat << 'EOF' > fisher_bh_test.R
# Simulate data for multiple comparisons
# In a real scenario, this data would be loaded from a file (e.g., CSV, TSV)
# Each row represents a comparison (e.g., a specific genomic position or feature)
# Columns: NBS_mismatches, NBS_matches, BS_mismatches, BS_matches
data <- data.frame(
  NBS_mismatches = c(10, 5, 12, 8, 15),
  NBS_matches = c(90, 95, 88, 92, 85),
  BS_mismatches = c(3, 2, 4, 3, 6),
  BS_matches = c(97, 98, 96, 97, 94)
)

# Initialize a vector to store p-values
p_values <- numeric(nrow(data))

# Perform one-tailed Fisher's exact test for each comparison
for (i in 1:nrow(data)) {
  # Create a 2x2 contingency table for the current comparison
  # Rows: NBS, BS
  # Columns: Mismatches, Matches
  contingency_table <- matrix(c(
    data$NBS_mismatches[i], data$BS_mismatches[i],
    data$NBS_matches[i], data$BS_matches[i]
  ), nrow = 2, byrow = TRUE)

  # Perform one-tailed Fisher's exact test (alternative = "greater")
  # This tests if the odds ratio (NBS_mismatches/NBS_matches) / (BS_mismatches/BS_matches) is greater than 1
  test_result <- fisher.test(contingency_table, alternative = "greater")
  p_values[i] <- test_result$p.value
}

# Apply Benjamini-Hochberg (BH) correction for multiple comparisons
adjusted_p_values <- p.adjust(p_values, method = "BH")

# Combine results
results <- data.frame(
  Comparison = 1:nrow(data),
  NBS_mismatches = data$NBS_mismatches,
  NBS_matches = data$NBS_matches,
  BS_mismatches = data$BS_mismatches,
  BS_matches = data$BS_matches,
  P_value = p_values,
  Adjusted_P_value_BH = adjusted_p_values
)

# Print results
print("Fisher's Exact Test Results with BH Correction:")
print(results)

# Optionally, save results to a CSV file
# write.csv(results, "fisher_bh_results.csv", row.names = FALSE)
EOF

# Execute the R script
Rscript fisher_bh_test.R

One-tailed binomial tests, corrected using BH, were used to test if the number of C, G or T bases were greater than expected given error rate (0.001).

(Inferred with models/gemini-2.5-flash)

$ Bash example

# This command is a conceptual representation as the specific tool
# and its command-line interface are not explicitly provided in the description.
# It assumes a custom script or a module within a larger framework
# that performs one-tailed binomial tests with Benjamini-Hochberg (BH) correction.

# Parameters inferred from the description:
# - expected_error_rate: 0.001
# - bases_to_test: C, G, T
# - correction_method: BH (Benjamini-Hochberg)

# Placeholder for a hypothetical execution command:
run_base_enrichment_test \
    --input_data "sequencing_data.bam" \
    --expected_error_rate 0.001 \
    --test_bases C G T \
    --correction_method BH \
    --output_report "base_enrichment_results.tsv"

Positions where 1) the mismatch rate of the NBS sample was significantly higher than the BS sample, 2) two of C, G or T were significantly higher than the error rate and 3) the FDR or C was not lower than both G and T were reported as m1A sites.

Custom Script (Inferred with models/gemini-2.5-flash) vN/A

$ Bash example

# This script identifies m1A sites based on the criteria described.
# It assumes an input TSV file with pre-calculated metrics for each potential site.
# The specific column names and statistical thresholds (e.g., p-values, FDRs) are illustrative
# and should be adjusted based on the actual upstream analysis output.

# Example Python script (m1a_site_caller.py) to implement the logic:
# import pandas as pd
# import sys

# def identify_m1a_sites(input_file, output_file, p_threshold_mismatch=0.05, p_threshold_base_error=0.05):
#     """
#     Identifies m1A sites based on specific criteria from a given input file.

#     Args:
#         input_file (str): Path to the input TSV file containing site metrics.
#                           Expected columns (example names):
#                           'site_id', 'mismatch_rate_NBS', 'mismatch_rate_BS', 'pvalue_NBS_vs_BS',
#                           'C_rate', 'G_rate', 'T_rate', 'error_rate',
#                           'pvalue_C_vs_error', 'pvalue_G_vs_error', 'pvalue_T_vs_error',
#                           'FDR_C', 'FDR_G', 'FDR_T'
#         output_file (str): Path to the output TSV file for identified m1A sites.
#         p_threshold_mismatch (float): P-value threshold for NBS vs BS mismatch rate.
#         p_threshold_base_error (float): P-value threshold for C/G/T vs error rate.
#     """
#     try:
#         df = pd.read_csv(input_file, sep='\t')
#     except FileNotFoundError:
#         print(f"Error: Input file '{input_file}' not found.", file=sys.stderr)
#         sys.exit(1)
#     except Exception as e:
#         print(f"Error reading input file: {e}", file=sys.stderr)
#         sys.exit(1)

#     # Ensure required columns exist (adjust names as per actual data)
#     required_cols = [
#         'site_id', 'mismatch_rate_NBS', 'mismatch_rate_BS', 'pvalue_NBS_vs_BS',
#         'C_rate', 'G_rate', 'T_rate', 'error_rate',
#         'pvalue_C_vs_error', 'pvalue_G_vs_error', 'pvalue_T_vs_error',
#         'FDR_C', 'FDR_G', 'FDR_T'
#     ]
#     missing_cols = [col for col in required_cols if col not in df.columns]
#     if missing_cols:
#         print(f"Error: Missing required columns in input file: {', '.join(missing_cols)}", file=sys.stderr)
#         sys.exit(1)

#     m1a_sites = []

#     for index, row in df.iterrows():
#         # Condition 1: Mismatch rate of the NBS sample was significantly higher than the BS sample
#         cond1 = (row['mismatch_rate_NBS'] > row['mismatch_rate_BS']) and \
#                 (row['pvalue_NBS_vs_BS'] < p_threshold_mismatch)

#         # Condition 2: Two of C, G or T were significantly higher than the error rate
#         significant_bases = 0
#         if row['pvalue_C_vs_error'] < p_threshold_base_error:
#             significant_bases += 1
#         if row['pvalue_G_vs_error'] < p_threshold_base_error:
#             significant_bases += 1
#         if row['pvalue_T_vs_error'] < p_threshold_base_error:
#             significant_bases += 1
#         cond2 = (significant_bases >= 2)

#         # Condition 3: The FDR of C was not lower than both G and T
#         # This means (FDR_C >= FDR_G) OR (FDR_C >= FDR_T)
#         cond3 = (row['FDR_C'] >= row['FDR_G']) or (row['FDR_C'] >= row['FDR_T'])

#         if cond1 and cond2 and cond3:
#             m1a_sites.append(row)

#     if m1a_sites:
#         pd.DataFrame(m1a_sites).to_csv(output_file, sep='\t', index=False)
#         print(f"Identified {len(m1a_sites)} m1A sites and saved to '{output_file}'.")
#     else:
#         print("No m1A sites identified based on the given criteria.")
#         pd.DataFrame(columns=df.columns).to_csv(output_file, sep='\t', index=False)

# if __name__ == "__main__":
#     if len(sys.argv) != 3:
#         print("Usage: python m1a_site_caller.py <input_site_metrics.tsv> <output_m1a_sites.tsv>", file=sys.stderr)
#         sys.exit(1)
    
#     input_tsv = sys.argv[1]
#     output_tsv = sys.argv[2]
    
#     identify_m1a_sites(input_tsv, output_tsv)

# Assuming the Python script 'm1a_site_caller.py' is available in the current directory
# and an input file 'site_metrics.tsv' contains the necessary pre-calculated metrics.
# Replace 'site_metrics.tsv' with your actual input file and 'm1A_sites.tsv' with your desired output file.

# Example execution command:
python m1a_site_caller.py site_metrics.tsv m1A_sites.tsv

Pseudouridine analysis pipeline: BS and NBS samples were aligned to the transcriptome indexes described above.

STAR (Inferred with models/gemini-2.5-flash) v2.7.10a GitHub

$ Bash example

# Install STAR (example using conda)
# conda install -c bioconda star

# Define variables
# Replace 'path/to/STAR_transcriptome_index' with the actual path to your STAR index directory.
# This index should have been generated from the transcriptome FASTA file(s).
INDEX_DIR="path/to/STAR_transcriptome_index"

# Replace 'BS_sample_R1.fastq.gz' and 'NBS_sample_R1.fastq.gz' with your actual input FASTQ files.
# Assuming single-end reads, which is common for some modification sequencing assays.
BS_SAMPLE_FASTQ="BS_sample_R1.fastq.gz"
NBS_SAMPLE_FASTQ="NBS_sample_R1.fastq.gz"

OUTPUT_DIR="./aligned_reads"

# Create output directory if it doesn't exist
mkdir -p "${OUTPUT_DIR}"

# Align BS sample reads to the transcriptome index
# --genomeDir: Path to the STAR genome/transcriptome index.
# --readFilesIn: Input FASTQ file(s).
# --outFileNamePrefix: Prefix for all output files.
# --outSAMtype BAM SortedByCoordinate: Output BAM file, sorted by coordinate.
# --runThreadN: Number of threads to use.
STAR \
  --genomeDir "${INDEX_DIR}" \
  --readFilesIn "${BS_SAMPLE_FASTQ}" \
  --outFileNamePrefix "${OUTPUT_DIR}/BS_sample_" \
  --outSAMtype BAM SortedByCoordinate \
  --runThreadN 8

# Align NBS sample reads to the transcriptome index
STAR \
  --genomeDir "${INDEX_DIR}" \
  --readFilesIn "${NBS_SAMPLE_FASTQ}" \
  --outFileNamePrefix "${OUTPUT_DIR}/NBS_sample_" \
  --outSAMtype BAM SortedByCoordinate \
  --runThreadN 8

View on GitHub

Pseudouridine positions were called using ScorePsuedouridinePositions (https://github.com/HuntsmanCancerInstitute/RBSSeqTools).

ScorePsuedouridinePositions v(Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Clone the RBSSeqTools repository if not already present
# git clone https://github.com/HuntsmanCancerInstitute/RBSSeqTools.git
# cd RBSSeqTools

# Define input and output files (replace with actual paths)
INPUT_BAM="path/to/your/input.bam" # e.g., alignment file from RNA-seq
PSEUDOURIDINE_SITES_BED="path/to/your/pseudouridine_sites.bed" # BED file defining pseudouridine sites to score
GENOME_FASTA="path/to/your/genome.fa" # Reference genome FASTA file (e.g., hg38.fa)
OUTPUT_PREFIX="pseudouridine_scores"

# Execute the ScorePsuedouridinePositions script
# This script calculates scores for pseudouridine positions based on read coverage and modifications.
python RBSSeqTools/ScorePsuedouridinePositions.py \
    -b "${INPUT_BAM}" \
    -p "${PSEUDOURIDINE_SITES_BED}" \
    -g "${GENOME_FASTA}" \
    -o "${OUTPUT_PREFIX}"

View on GitHub

This application iterates over a bam file, tracking the number of deletions and overall coverage at each genomic position.

samtools (Inferred with models/gemini-2.5-flash) v1.19 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Install samtools if not already installed
# conda install -c bioconda samtools

# Define input and output files
INPUT_BAM="input.bam"
REFERENCE_FASTA="reference.fa" # Placeholder: Replace with actual reference genome path (e.g., GRCh38/hg38)

# Ensure reference genome is indexed (if not already)
# samtools faidx "${REFERENCE_FASTA}"

# 1. Track overall coverage at each genomic position
# Output format: <chrom> <pos> <depth>
samtools depth -a "${INPUT_BAM}" > "${INPUT_BAM%.bam}_coverage.tsv"

# 2. Generate pileup information which contains deletion data at each genomic position
# The pileup output's 'read_bases' column contains indel information (e.g., -[length][sequence])
# that can be parsed to count deletions.
# This command generates the raw data from which deletion counts can be derived.
# Note: Extracting precise deletion counts per position from the pileup format
# typically requires a dedicated parsing script (e.g., in Python or Perl)
# due to the complex encoding of indels.
samtools mpileup -f "${REFERENCE_FASTA}" "${INPUT_BAM}" > "${INPUT_BAM%.bam}_pileup.tsv"

View on GitHub

Positions are skipped if 1) BS sample < 5 deletions, 2) BS sample fraction deletion < 0.01, 3) BS sample coverage < 10 reads, 4) NBS sample fraction deletion > 0.01 or NBS sample coverage < 10 reads.

bcftools (Inferred with models/gemini-2.5-flash) v1.18

$ Bash example

# Install bcftools if not already installed
# conda install -c bioconda bcftools=1.18

# Define input and output VCF files
INPUT_VCF="input.vcf" # Placeholder for the input VCF file
OUTPUT_VCF="filtered_deletions.vcf.gz" # Placeholder for the output VCF file (using .gz for compressed output)

# Note: This command assumes that the input VCF file has two samples,
# where the first sample (index 0) corresponds to the 'BS sample'
# and the second sample (index 1) corresponds to the 'NBS sample'.
# It also assumes that the VCF FORMAT fields 'NDEL' (Number of Deletions),
# 'FDEL' (Fraction Deletion), and 'DP' (Depth) are present and correctly populated
# for both samples. These fields are custom and would need to be added by an
# upstream variant annotation step.

# Filter positions based on the specified criteria
# A position is KEPT if it DOES NOT meet any of the skip conditions.
# Skip conditions (OR logic):
# 1) BS sample (index 0) NDEL < 5
# 2) BS sample (index 0) FDEL < 0.01
# 3) BS sample (index 0) DP < 10
# 4) NBS sample (index 1) FDEL > 0.01
# 5) NBS sample (index 1) DP < 10
#
# The filter expression below negates these conditions to select positions to KEEP:
# (BS_NDEL >= 5 AND BS_FDEL >= 0.01 AND BS_DP >= 10 AND NBS_FDEL <= 0.01 AND NBS_DP >= 10)
bcftools filter \
    --include 'FORMAT/NDEL[0]>=5 && FORMAT/FDEL[0]>=0.01 && FORMAT/DP[0]>=10 && FORMAT/FDEL[1]<=0.01 && FORMAT/DP[1]>=10' \
    "${INPUT_VCF}" \
    -o "${OUTPUT_VCF}" \
    -Oz # Output compressed VCF

# Reference genome: GRCh38 (inferred as no specific reference was mentioned)
# Note: This filtering step does not directly use a reference genome, but it's a common context for variant calling pipelines.

Adjacent positions are merged into a single deletion group on the initial analysis pass.

vt normalize (Inferred with models/gemini-2.5-flash) v0.577 GitHub

$ Bash example

# Install vt (if not already installed)
# conda install -c bioconda vt=0.577

# Placeholder for reference genome. Replace with the actual path to your reference FASTA file.
# For example, using GRCh38.p14 as a common latest assembly.
REFERENCE_GENOME="/path/to/GRCh38.p14.fa"

# Input VCF/BCF file (e.g., from variant calling)
INPUT_VCF="input.vcf.gz"

# Output normalized VCF/BCF file
OUTPUT_VCF="normalized.vcf.gz"

# Merge adjacent positions into a single deletion group by normalizing indels.
# vt normalize left-aligns and trims indels, ensuring a canonical representation.
# This process can consolidate complex or adjacent indel representations into a single, simpler form.
# -r: Reference genome FASTA file for indel normalization.
vt normalize -r "${REFERENCE_GENOME}" "${INPUT_VCF}" -o "${OUTPUT_VCF}"

View on GitHub

Deletion groups are pruned by removing positions with fraction deletions less than half the maximum observed faction deletion in the group.

Custom script (Inferred with models/gemini-2.5-flash) vN/A (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Define input and output VCF files
INPUT_VCF="input.vcf"
OUTPUT_VCF="filtered_deletions.vcf"

# Create a dummy input VCF for demonstration purposes.
# This VCF includes deletion variants at chr1:101 and chr1:200.
# - At chr1:101, Deletion 1 (AF=0.2) should be pruned because 0.2 < (0.5 * max(0.2, 0.9) = 0.45).
# - At chr1:200, Deletion 3 (AF=0.3) should be pruned because 0.3 < (0.5 * max(0.3, 0.7) = 0.35).
cat << 'EOF' > "${INPUT_VCF}"
##fileformat=VCFv4.2
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	SAMPLE1
chr1	100	.	T	TA	.	PASS	AF=0.8	GT:AD	0/1:10,40
chr1	101	.	TA	T	.	PASS	AF=0.2	GT:AD	0/1:40,10
chr1	101	.	TA	T	.	PASS	AF=0.9	GT:AD	0/1:10,90
chr1	102	.	C	G	.	PASS	AF=0.5	GT:AD	0/1:25,25
chr1	200	.	GAT	G	.	PASS	AF=0.3	GT:AD	0/1:35,15
chr1	200	.	GA	G	.	PASS	AF=0.7	GT:AD	0/1:15,35
EOF

# Define the custom Python script inline. This script implements the logic:
# "Deletion groups are pruned by removing positions with fraction deletions less than half the maximum observed faction deletion in the group."
# It groups deletions by (CHROM, POS), calculates the maximum allele frequency (AF) for deletions within each group,
# and then filters out deletions whose AF is less than half of that maximum.
cat << 'EOF' > prune_deletion_groups.py
#!/usr/bin/env python
import sys
from collections import defaultdict

def parse_vcf_line(line):
    if line.startswith('#'):
        return None
    parts = line.strip().split('\t')
    chrom, pos, _, ref, alt_str, qual, filt, info = parts[:8]
    pos = int(pos)
    alts = alt_str.split(',')
    
    is_deletion = False
    for alt in alts:
        if len(alt) < len(ref):
            is_deletion = True
            break

    af = 0.0
    if 'AF=' in info:
        try:
            # Assuming AF is a comma-separated list, take the first for simplicity
            af = float(info.split('AF=')[1].split(';')[0].split(',')[0])
        except (ValueError, IndexError):
            pass # AF not properly formatted or missing

    return {'chrom': chrom, 'pos': pos, 'ref': ref, 'alts': alts, 'af': af, 'line': line.strip(), 'is_deletion': is_deletion}

def main():
    if len(sys.argv) != 3:
        print("Usage: python prune_deletion_groups.py <input_vcf> <output_vcf>", file=sys.stderr)
        sys.exit(1)

    input_vcf_path = sys.argv[1]
    output_vcf_path = sys.argv[2]

    deletion_groups = defaultdict(list) # Key: (chrom, pos), Value: list of deletion records
    header_lines = []
    all_records_by_line = {} # Store all records by their original line for re-ordering

    with open(input_vcf_path, 'r') as infile:
        for line in infile:
            if line.startswith('##'):
                header_lines.append(line.strip())
                continue
            if line.startswith('#CHROM'):
                header_lines.append(line.strip())
                continue
            
            record = parse_vcf_line(line)
            if record:
                all_records_by_line[record['line']] = record
                if record['is_deletion']:
                    deletion_groups[(record['chrom'], record['pos'])].append(record)

    filtered_deletion_lines = set()

    for (chrom, pos), group_records in deletion_groups.items():
        if not group_records:
            continue

        max_fraction_deletion = 0.0
        for rec in group_records:
            max_fraction_deletion = max(max_fraction_deletion, rec['af'])

        threshold = 0.5 * max_fraction_deletion

        for rec in group_records:
            if rec['af'] >= threshold:
                filtered_deletion_lines.add(rec['line'])
    
    # Collect all lines to be written, maintaining original order as much as possible
    output_lines = []
    for line in header_lines:
        output_lines.append(line)

    # Iterate through original input lines to preserve order for non-deletions and filtered deletions
    with open(input_vcf_path, 'r') as infile:
        for line in infile:
            line_stripped = line.strip()
            if line_stripped in all_records_by_line:
                record = all_records_by_line[line_stripped]
                if not record['is_deletion']:
                    output_lines.append(line_stripped)
                elif record['is_deletion'] and line_stripped in filtered_deletion_lines:
                    output_lines.append(line_stripped)

    with open(output_vcf_path, 'w') as outfile:
        for line in output_lines:
            outfile.write(line + '\n')

if __name__ == "__main__":
    main()
EOF

# Make the script executable
chmod +x prune_deletion_groups.py

# Execute the custom script
./prune_deletion_groups.py "${INPUT_VCF}" "${OUTPUT_VCF}"

# Optional: Clean up dummy files
# rm "${INPUT_VCF}" prune_deletion_groups.py

View on GitHub

If pruning results in non-continuous regions, both are used in the remaining analysis.

bedtools (Inferred with models/gemini-2.5-flash) v2.30.0 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# This script implements the policy: "If pruning results in non-continuous regions, both are used in the remaining analysis."
# This implies that any fragments (continuous or non-continuous) resulting from a pruning operation are retained.

# Define input and output files (placeholders)
INPUT_REGIONS="input.bed" # e.g., initial peaks or genomic regions
REGIONS_TO_REMOVE="regions_to_remove.bed" # e.g., blacklist regions, low-confidence areas
PRUNED_OUTPUT_REGIONS="pruned_regions_for_analysis.bed"

# Reference genome (placeholder, adjust as needed for specific assay)
GENOME_ASSEMBLY="hg38" # Using a common latest assembly as a placeholder

# --- Installation (commented out) ---
# conda install -c bioconda bedtools

# --- Simulate the 'pruning' operation using bedtools subtract ---
# This step removes regions_to_remove from input_regions.
# The output file (PRUNED_OUTPUT_REGIONS) will contain the remaining fragments.
# If an input region is split by a removed region, it will result in non-continuous fragments.
# According to the policy, these non-continuous fragments are NOT discarded.
bedtools subtract -a "${INPUT_REGIONS}" -b "${REGIONS_TO_REMOVE}" > "${PRUNED_OUTPUT_REGIONS}"

# --- Policy Implementation ---
# The description states that if pruning results in non-continuous regions, "both are used".
# This means that the output of the 'bedtools subtract' command, which can contain
# both continuous and non-continuous fragments, is directly passed to the next analysis step.
# No further filtering based on continuity is performed here.

echo "Pruning operation completed. All resulting regions (continuous or non-continuous) are retained."
echo "Retained regions for further analysis are in: ${PRUNED_OUTPUT_REGIONS}"

# Example of how the output might be used in a subsequent step:
# next_analysis_tool --input_regions "${PRUNED_OUTPUT_REGIONS}" --genome "${GENOME_ASSEMBLY}"

View on GitHub

Deletion groups are then scanned for potential pseudouridine locations using the following steps: 1) T at the position with highest fraction deletion (FirstDelBase) 2) T in the deletion group (AltPositionBase) 3) T in a homopolymer region (considering bisulfite treatment) with deletion at the leftmost position of the homopolymer (LeftAlignBase) or 4) nearest T within 25 bp of deletion (Neighborhood).

Pseudouridine Deletion Scanner (Inferred with models/gemini-2.5-flash) v1.0 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# This step describes a custom algorithm for identifying pseudouridine locations
# based on deletion patterns in sequencing data, likely after bisulfite treatment.
# No specific tool is named in the description, so the command below is conceptual
# and represents the execution of a hypothetical script or program implementing
# the described logic.

# Input: A file containing processed deletion information (e.g., from aligned RNA-seq data).
# This file would typically contain coordinates, deletion counts, and possibly base identities.
INPUT_DELETION_DATA="processed_deletion_data.tsv" # Placeholder for input data
OUTPUT_PSEUDOURIDINE_SITES="potential_pseudouridine_sites.bed" # Placeholder for output

# Reference genome (e.g., for identifying homopolymer regions).
# Using hg38 as a common human reference genome placeholder.
REFERENCE_GENOME_FASTA="/path/to/human_genome/hg38.fa"

# Conceptual command to execute the pseudouridine deletion scanning.
# The parameters are illustrative, reflecting the rules described:
# 1) T at the position with highest fraction deletion (FirstDelBase)
# 2) T in the deletion group (AltPositionBase)
# 3) T in a homopolymer region (considering bisulfite treatment) with deletion at the leftmost position (LeftAlignBase)
# 4) nearest T within 25 bp of deletion (Neighborhood)
# The actual implementation would be a custom script or a specialized tool.

# Example installation (conceptual, as no specific tool is named)
# # conda create -n pseudouridine_env python=3.9
# # conda activate pseudouridine_env
# # pip install pandas biopython # if a custom script uses these libraries

# This command represents the execution of the described pseudouridine detection logic.
# The parameters are inferred from the description's rules.
run_pseudouridine_deletion_scanner.sh \
  --input_deletions "$INPUT_DELETION_DATA" \
  --output_sites "$OUTPUT_PSEUDOURIDINE_SITES" \
  --genome_fasta "$REFERENCE_GENOME_FASTA" \
  --enable_first_del_base_check \
  --enable_alt_position_base_check \
  --enable_left_align_base_check \
  --neighborhood_distance 25 \
  --bisulfite_mode true

View on GitHub

If multiple deletion groups share the same predicted pseudouridine position, only the group with the highest fraction deletion is used.

Custom Script (Inferred with models/gemini-2.5-flash) vN/A

$ Bash example

# This step filters predicted pseudouridine positions.
# If multiple deletion groups share the same predicted pseudouridine position,
# only the group with the highest fraction deletion is retained.

# Input: A tab-separated file containing predicted pseudouridine positions,
#        including chromosome, position, strand, deletion group identifier,
#        and fraction deletion. (e.g., predicted_pseudouridines.tsv)
# Output: A filtered tab-separated file with unique pseudouridine positions
#         based on the highest fraction deletion. (e.g., filtered_pseudouridines.tsv)

# This command assumes a custom Python script 'filter_pseudouridine_deletions.py'
# is available in the PATH or current directory. The script is expected to take
# an input file and an output file as arguments.

python filter_pseudouridine_deletions.py \
  --input predicted_pseudouridines.tsv \
  --output filtered_pseudouridines.tsv

A p-value is calculated for each group using the binomial test comparing observed deletion rate to the expected deletion rate.

Python (scipy.stats.binomtest) (Inferred with models/gemini-2.5-flash) vscipy >= 1.7.0 GitHub

$ Bash example

# Install Python and scipy if not already available
# conda install -c conda-forge python scipy pandas

# Create a dummy input file for demonstration purposes.
# In a real pipeline, this file would be generated by a previous step
# and contain columns like 'group_id', 'observed_deletions', 'total_observations', 'expected_rate'.
echo -e "group_id\tobserved_deletions\ttotal_observations\texpected_rate" > deletion_data.tsv
echo -e "GroupA\t5\t100\t0.03" >> deletion_data.tsv
echo -e "GroupB\t12\t150\t0.05" >> deletion_data.tsv
echo -e "GroupC\t3\t80\t0.02" >> deletion_data.tsv

# Execute the Python script to calculate p-values using the binomial test
python -c "
import pandas as pd
from scipy.stats import binomtest

input_file = 'deletion_data.tsv'
output_file = 'deletion_pvalues.tsv'

# Read the input data
df = pd.read_csv(input_file, sep='\t')

p_values = []
# Iterate through each group to calculate the p-value
for index, row in df.iterrows():
    k = row['observed_deletions']
    n = row['total_observations']
    p_expected = row['expected_rate']

    # Perform the binomial test. By default, binomtest performs a two-sided test.
    # For a one-sided test (e.g., observed rate > expected rate), use alternative='greater'.
    result = binomtest(k, n, p_expected)
    p_values.append(result.pvalue)

# Add the calculated p-values to the DataFrame and save to an output file
df['p_value'] = p_values
df.to_csv(output_file, sep='\t', index=False)
print(f'P-values saved to {output_file}')
"

View on GitHub

The expected deletion rate is calculated using an error rate of 0.001.

Custom Script (Inferred with models/gemini-2.5-flash) vN/A

$ Bash example

# The expected deletion rate is calculated using an error rate of 0.001.
# This value is typically used as a parameter for downstream analysis or simulation tools.
ERROR_RATE="0.001"
echo "Using an error rate of: $ERROR_RATE"
# The specific calculation or tool that derives the 'expected deletion rate'
# from this error rate is not specified in the description.
# This value might be directly used as an input for a model or simulation.

Deletion groups are flagged as false positives if: 1) NBS deletion p-value < 0.01, 2) deletion occurs within 4bp of an exon boundary 3) within 6bp or longer homopolymer or 4) within 5bp of a natural deletion observed in the NBS sample.

Custom Script (Inferred with models/gemini-2.5-flash) vN/A

$ Bash example

# Define input and output files (placeholders)
INPUT_DELETIONS="input_deletion_calls.vcf" # Assuming VCF for variant calls
OUTPUT_FILTERED_DELETIONS="filtered_deletion_calls.vcf"
OUTPUT_FALSE_POSITIVES="false_positive_deletions.vcf"

# Define reference datasets (placeholders - use latest assembly as a placeholder if not specified)
# For human, GRCh38 (hg38) is a commonly used latest assembly.
GENOME_FASTA="path/to/reference/genome/GRCh38.fa"
GENE_ANNOTATION_BED="path/to/reference/annotations/GRCh38_exons.bed" # e.g., derived from Ensembl/GENCODE GTF
NBS_NATURAL_DELETIONS="path/to/reference/nbs_natural_deletions.vcf" # Custom dataset of known natural deletions in NBS samples

# Define filtering parameters based on the description
P_VALUE_THRESHOLD=0.01
EXON_BOUNDARY_DISTANCE=4
HOMOPOLYMER_LENGTH=6
NATURAL_DELETION_DISTANCE=5

# This step involves custom logic to flag deletion groups as false positives.
# It would typically be implemented in a custom script (e.g., Python, R, Perl).
# The script would read deletion calls, evaluate each against the defined criteria,
# and output filtered calls and flagged false positives.

# Conceptual Python script (filter_deletions.py) execution:
# This script would:
# 1. Parse input deletion calls (e.g., VCF records).
# 2. For each deletion:
#    a. Check if NBS deletion p-value < P_VALUE_THRESHOLD.
#    b. Check proximity to exon boundaries using GENE_ANNOTATION_BED.
#    c. Check if it overlaps with a homopolymer of length >= HOMOPOLYMER_LENGTH using GENOME_FASTA.
#    d. Check proximity to known natural deletions in NBS_NATURAL_DELETIONS.
# 3. If any of the conditions (a, b, c, or d) are met, flag as false positive.
# 4. Write true positives to OUTPUT_FILTERED_DELETIONS and false positives to OUTPUT_FALSE_POSITIVES.

# Example command for executing a hypothetical custom Python script:
python3 filter_deletions.py \
    --input "$INPUT_DELETIONS" \
    --output "$OUTPUT_FILTERED_DELETIONS" \
    --false-positives "$OUTPUT_FALSE_POSITIVES" \
    --p-value-threshold "$P_VALUE_THRESHOLD" \
    --exon-boundary-distance "$EXON_BOUNDARY_DISTANCE" \
    --homopolymer-length "$HOMOPOLYMER_LENGTH" \
    --natural-deletion-distance "$NATURAL_DELETION_DISTANCE" \
    --genome-fasta "$GENOME_FASTA" \
    --gene-annotation-bed "$GENE_ANNOTATION_BED" \
    --nbs-natural-deletions "$NBS_NATURAL_DELETIONS"

Deletion groups that pass these criteria are annotated if they are within a gene or repetitive element.

bedtools (Inferred with models/gemini-2.5-flash) v2.31.0 GitHub

$ Bash example

# Install bedtools if not already installed
# conda install -c bioconda bedtools

# --- Reference Data Preparation (example for hg38) ---
# These steps are illustrative for obtaining gene and repetitive element annotations.
# Replace with actual paths to your reference files.

# Download gene annotations (e.g., RefSeq genes from UCSC Table Browser)
# wget -O refGene.txt.gz "http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/refGene.txt.gz"
# gunzip refGene.txt.gz
# awk 'BEGIN{OFS="\t"}{print $3, $5, $6, $2}' refGene.txt | sort -k1,1 -k2,2n > hg38.genes.bed

# Download repetitive element annotations (e.g., RepeatMasker from UCSC Table Browser)
# wget -O rmsk.txt.gz "http://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/rmsk.txt.gz"
# gunzip rmsk.txt.gz
# awk 'BEGIN{OFS="\t"}{print $6, $7, $8, $11, $12}' rmsk.txt | sort -k1,1 -k2,2n > hg38.repeats.bed

# Combine gene and repetitive element annotations into a single sorted BED file.
# This creates a comprehensive annotation file for genomic features.
# cat hg38.genes.bed hg38.repeats.bed | sort -k1,1 -k2,2n > hg38.gene_repeat_elements.bed
# --- End Reference Data Preparation ---

# Input file: deletion_groups.bed (assumed to be a BED file with deletion coordinates)
# Reference annotation file: hg38.gene_repeat_elements.bed (a BED file containing combined gene and repetitive element annotations)
# Output file: annotated_deletion_groups.bed (input file with an additional column for overlap count)

# Annotate deletion groups by adding a column indicating if they overlap with any gene or repetitive element.
# The new column will contain the count of overlapping features from hg38.gene_repeat_elements.bed.
# A count > 0 indicates that the deletion group is "within" a gene or repetitive element.
bedtools map -a deletion_groups.bed -b hg38.gene_repeat_elements.bed -c 1 -o count > annotated_deletion_groups.bed

View on GitHub

Finally, FDR values are calculated using the Benjamini Hochberg procedure.

R (Inferred with models/gemini-2.5-flash) vlatest stable

$ Bash example

# Assume 'p_values.txt' is a tab-separated file with at least a 'p_value' column.
# For demonstration, we create a dummy file.

# Install R if not already installed
# conda install -c r r-base

# Create a dummy p_values.txt for demonstration purposes
echo "gene_id\tp_value" > p_values.txt
echo "geneA\t0.001" >> p_values.txt
echo "geneB\t0.01" >> p_values.txt
echo "geneC\t0.05" >> p_values.txt
echo "geneD\t0.1" >> p_values.txt
echo "geneE\t0.0005" >> p_values.txt

# R script to calculate Benjamini-Hochberg False Discovery Rate (FDR)
Rscript -e '
  # Read the input file containing p-values
  data <- read.table("p_values.txt", header=TRUE, sep="\t")
  
  # Calculate FDR using Benjamini-Hochberg procedure
  # The "BH" method corresponds to Benjamini-Hochberg
  data$fdr <- p.adjust(data$p_value, method="BH")
  
  # Write the output with original data and new FDR column
  write.table(data, "output_fdr.txt", sep="\t", quote=FALSE, row.names=FALSE)
'

Positions were further filtered manually by removing sites that were classified as ÃNeighborhoodÃ when predicting the true pseudouridine position or had a high fraction of insertions at deletion site.

Custom Script (Inferred with models/gemini-2.5-flash) vN/A

$ Bash example

# This step describes manual filtering based on specific criteria.
# Assuming an input file 'pseudouridine_predictions.tsv' that contains predicted pseudouridine positions
# with columns including 'classification' and 'insertion_fraction_at_deletion_site'.
# For this example, let's assume 'classification' is the 5th column and 'insertion_fraction_at_deletion_site' is the 6th column.
# A 'high' fraction of insertions is arbitrarily defined as > 0.2 for this example.

# Filter out sites classified as 'Neighborhood' OR sites with an insertion fraction > 0.2.
# This awk command keeps only sites where the classification is NOT 'Neighborhood' AND the insertion fraction is <= 0.2.
awk -F'\t' '($5 != "Neighborhood" && $6 <= 0.2)' pseudouridine_predictions.tsv > filtered_pseudouridine_positions.tsv

Reads were aligned and processed using the pipeline outlined above, except the Novoalign bisuflite alignment mode was set to 2 and different adapters were used (AGATCGGAAGAGCACACGTCTGAACTCCAGTCA AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT).

Novoalign v4.04.00 (Inferred with models/gemini-2.5-flash) GitHub

$ Bash example

# Novoalign is commercial software. Installation typically involves downloading and licensing from Novocraft.
# Example:
# wget http://www.novocraft.com/downloads/novoalign/novoalignV4.04.00.zip
# unzip novoalignV4.04.00.zip
# cd novoalignV4.04.00
# export PATH=$PWD:$PATH

# Reference genome (e.g., hg38) and index preparation
# Download hg38 reference genome:
# wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz
# gunzip hg38.fa.gz
# Build Novoalign index (this can take a long time and requires significant memory):
# novoindex -k 14 -s 2 hg38.nix hg38.fa

# Placeholder for input FASTQ files
READ1="input_R1.fastq.gz"
READ2="input_R2.fastq.gz"
OUTPUT_BAM="output.bam"
REFERENCE_INDEX="hg38.nix" # Path to your Novoalign index file

# Adapters specified in the description
ADAPTER1="AGATCGGAAGAGCACACGTCTGAACTCCAGTCA"
ADAPTER2="AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT"

# Number of threads to use
THREADS=8

novoalign -d "${REFERENCE_INDEX}" \
          -f "${READ1}" "${READ2}" \
          -o BAM \
          -b 2 \
          -a "${ADAPTER1}" "${ADAPTER2}" \
          -S 1 \
          -r All \
          -k \
          -t "${THREADS}" \
          -Q 30 \
          -R '@RG\tID:sample\tSM:sample\tLB:library\tPL:ILLUMINA' \
          > "${OUTPUT_BAM}"

View on GitHub

Validation calls were intersected with the RBS-Seq calls using the predicted pseudouridine position.

bedtools (Inferred with models/gemini-2.5-flash) v2.30.0 GitHub

$ Bash example

# Install bedtools if not already installed
# conda install -c bioconda bedtools

# Assuming 'validation_calls.bed' and 'rbs_seq_calls.bed' are the input files
# and contain predicted pseudouridine positions as genomic intervals.
# The output will contain the overlapping regions from validation_calls.bed that intersect with rbs_seq_calls.bed.
bedtools intersect -a validation_calls.bed -b rbs_seq_calls.bed > intersected_pseudouridine_positions.bed

View on GitHub

Raw Source Text

Reference genome: Novoindex (2.8) (http://www.novocraft.com) was used to create a standard and bisulfite reference index on a combination of hg19 chromosome, scaffold and splice junction sequences. Splice junction sequences were generated with USeq (v8.8.8) 1 MakeTranscriptome using Ensembl transcript annotations (build 74).
Alignment parameters: BS and NBS samples were aligned to the transcriptome indexes described above using Novoalign (v2.08.01) (http://www.novocraft.com) allowing up to 50 alignments and trimming adaptor sequences (TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC GATCGTCGGACTGTAGAACTCTGAAC).
Alignment processing : The alignment files were processed with USeq SamTranscriptomeParser, which reports the best alignment location for each read in bam format and converts the coordinates of splice junction alignments to genomic space. One alignment location is selected at random if a read aligned to multiple locations in the genome with equal confidence.
Alignments were reordered using Picard (v1.87) (http://picard.sourceforge.net) ReorderBam and all alignments were set as primary using the set_sam_primary.py (https://github.com/HuntsmanCancerInstitute/RBSSeqTools). The alignments were then processed with GATK (v3.4-0) 2 SplitNCigarReads, which splits spliced alignments into multiple alignment segments without N cigar elements. Reads containing deletions four base pairs or longer were removed using remove_deletions.py (https://github.com/HuntsmanCancerInstitute/RBSSeqTools).
m5C analysis pipeline: BS and NBS sample reads were aligned and processed as outlined above. Alignments were further processed using CreateMethTable (https://github.com/HuntsmanCancerInstitute/RBSSeqTools), which iterates over the alignment file and reports forward alignment base counts at all reference ÃCÃ positions and in reverse alignment base counts at all reference ÃGÃ positions. A Fisher's exact test p-value was calculated for each position by comparing the observed count of unconverted and converted bases to the expected count, which was calculated assuming an error rate of 0.001. The table was filtered to create a final list of candidates by selecting only those sites which had a read depth Â³10 in both BS and NBS datasets, a non-conversion rate of Â³20% in the BS dataset and a Fisher exact p-value Â²0.05. These sites were further screened manually for evidence of mis-mapping due to loss of complexity, or nonconversion arising from nondenatured secondary structure.
m1A analysis pipeline: BS and NBS sample reads were aligned and processed as listed above. CreateMethTable was used to report forward alignment base counts at all reference ÃAÃ positions with at least 100x coverage and in reverse alignment base counts at all reference ÃTÃ positions with at least 100x coverage. A one-tailed Fisher's exact test, corrected for multiple comparisons using the Benjamini-Hochberg (BH) procedure, was used to test if the mismatch rate in the NBS sample was greater than the BS sample. One-tailed binomial tests, corrected using BH, were used to test if the number of C, G or T bases were greater than expected given error rate (0.001). Positions where 1) the mismatch rate of the NBS sample was significantly higher than the BS sample, 2) two of C, G or T were significantly higher than the error rate and 3) the FDR or C was not lower than both G and T were reported as m1A sites.
Pseudouridine analysis pipeline: BS and NBS samples were aligned to the transcriptome indexes described above. Pseudouridine positions were called using ScorePsuedouridinePositions (https://github.com/HuntsmanCancerInstitute/RBSSeqTools). This application iterates over a bam file, tracking the number of deletions and overall coverage at each genomic position. Positions are skipped if 1) BS sample < 5 deletions, 2) BS sample fraction deletion < 0.01, 3) BS sample coverage < 10 reads, 4) NBS sample fraction deletion > 0.01 or NBS sample coverage < 10 reads. Adjacent positions are merged into a single deletion group on the initial analysis pass. Deletion groups are pruned by removing positions with fraction deletions less than half the maximum observed faction deletion in the group. If pruning results in non-continuous regions, both are used in the remaining analysis. Deletion groups are then scanned for potential pseudouridine locations using the following steps: 1) T at the position with highest fraction deletion (FirstDelBase) 2) T in the deletion group (AltPositionBase) 3) T in a homopolymer region (considering bisulfite treatment) with deletion at the leftmost position of the homopolymer (LeftAlignBase) or 4) nearest T within 25 bp of deletion (Neighborhood). If multiple deletion groups share the same predicted pseudouridine position, only the group with the highest fraction deletion is used. A p-value is calculated for each group using the binomial test comparing observed deletion rate to the expected deletion rate. The expected deletion rate is calculated using an error rate of 0.001. Deletion groups are flagged as false positives if: 1) NBS deletion p-value < 0.01, 2) deletion occurs within 4bp of an exon boundary 3) within 6bp or longer homopolymer or 4) within 5bp of a natural deletion observed in the NBS sample. Deletion groups that pass these criteria are annotated if they are within a gene or repetitive element. Finally, FDR values are calculated using the Benjamini Hochberg procedure. Positions were further filtered manually by removing sites that were classified as ÃNeighborhoodÃ when predicting the true pseudouridine position or had a high fraction of insertions at deletion site.
Reads were aligned and processed using the pipeline outlined above, except the Novoalign bisuflite alignment mode was set to 2 and different adapters were used (AGATCGGAAGAGCACACGTCTGAACTCCAGTCA AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT). Validation calls were intersected with the RBS-Seq calls using the predicted pseudouridine position.
Genome_build: hg19
Supplementary_files_format_and_content: Excel files contaning m1A, m5C and pseudouridine candidtate sites. Column descriptions can be found on the first tab of the spreadsheet

← Back to Analysis