Abstract
The rapid transmission and mutation rates of viruses necessitate efficient and precise classification methods to aid public health responses. Often constrained by culturing requirements and narrow taxonomic scopes, traditional diagnostic techniques struggle to keep pace with the rapidly expanding genomic landscape. This study introduces an advanced multi-virus genome sequence classification approach utilizing a k-mer-based dictionary for rapid genomic pattern extraction, enhanced by sophisticated natural language processing (NLP) and machine learning (ML) models. Our contributions are threefold: first, we introduce a k-mer dictionary approach that significantly enhances virus classification speed and accuracy by capturing essential genomic patterns; second, we implement a suite of optimized classifiers—Multinomial Naive Bayes, Random Forest, K-Nearest Neighbor, and nu-SVM—that achieve high performance metrics, with accuracy at 0.95, precision at 0.96, specificity at 0.94, and sensitivity at 0.99; third, we provide a comprehensive comparative analysis with advanced deep learning architectures, such as CNN, CNN-LSTM, and CNN-Bidirectional LSTM, wherein our method demonstrates superior performance. These findings validate the effective integration of ML and NLP techniques in viral genome classification, setting new benchmarks in bioinformatics and offering scalable solutions for rapid virus detection and identification.
Keywords
Get full access to this article
View all access options for this article.
