Sage Journals: Discover world-class research

Abstract

The United States’ current list-based approach to biodefense is limited because it considers only known biological agents. Alternatively, developing and adopting a system based on agent-agnostic signatures would enable detection and characterization of both known and novel agents, thereby engendering greater adaptability in the face of an evolving threat landscape. Machine learning (ML) could aid in such a transition, as it can recognize and encode highly complex patterns from multiple input data modalities and has already demonstrated success in many healthcare and defense applications. Functionalizing ML for environmental biodetection requires understanding current technical capabilities. In this article, we provide a systematic review of existing ML platforms and discuss anticipated development efforts needed to achieve effective ML-enabled, agnostic biodetection.

Introduction

The COVID-19 pandemic highlighted the uncertainty and misinformation that novel biological threats can engender, underscoring the need for “agnostic” analyses capable of detecting both known and unknown biological agents, regardless of their identity, origin (natural vs engineered), or sequencing history. Even prepandemic, recognition of limitations to the current list-based approach and the importance of agent-agnostic biodetection was growing. A 2018 National Academies of Sciences, Engineering, and Medicine report stated that “an overreliance on the Select Agent List is a systematic weakness.”¹ Recently, researchers proposed adopting “bioagent-agnostic signatures” to detect and characterize existing and novel agents, an approach believed to “enable a more flexible and resilient biodefense posture.”² The idea has gained traction, with others suggesting steps such as designing host-based screening strategies and filling gaps in data acquisition and handling.³

Machine learning (ML) systems could enable real-time, multiscale, multidimensional agnostic assessment of a perturbation’s nature and source. ML models can recognize and encode highly complex patterns from multiple data modalities, including images, text, and biological/chemical/physical spectra, and can execute assessments that traditionally require human operation. In healthcare, ML models aid diagnosis of clinical threats, including infection, cancer, and strokes,^4-6 though greater, more diverse training data and strategies for handling missing data are needed to improve reliability and efficacy. Moreover, ML advances could support more efficient raw data analysis. Recently, large language models have been deployed for biological function prediction.^7-9

Interest in environmental biosurveillance is increasing across application spaces and sampling media, including wastewater,^10,11 drinking water,¹² and aerosols.^13-15 Expansion of antibiotic resistance has amplified interest in surveillance of nonclinical settings.¹⁶ While ML is used broadly in the clinical sciences, its role in environmental biodetection is currently limited. Functionalizing ML-based strategies for environmental detection of potential biological threats requires an understanding of the existing technical landscape. In this review, we systematically evaluate existing ML platforms applied in clinical and nonclinical contexts, focusing on workflows with potential for biosurveillance in nonclinical environments surrounding human populations, and we discuss development efforts needed for effective ML-enabled, agnostic biodetection.

Platform Assessments

Pathogens are organisms that cause disease in their host.¹⁷ Existing environmental biodetection systems rely largely on defined pathogens’ nucleic acid-based signatures via quantitative polymerase chain reaction (qPCR) or next-generation sequencing. Methods for profiling pathogens’ physicochemical properties, including mass spectrometry and Raman spectroscopy, have also been evaluated. Adopting an agent-agnostic paradigm requires less reliance on static, predefined signatures. We hypothesize that features in raw datasets could distinguish health or environmental threats and that analyzing data at its source could promote “reference-independent” systems—self-contained platforms not requiring external databases. While such platforms would still require existing datasets for model training, inference could be performed without access to these datasets/references. Further, if generalizable features for pathogenic function are identified in raw signal, their discriminatory capacity may demonstrate greater extensibility to novel threats.

Direct Processing of Raw DNA Signal

Raw instrument output obtained prior to DNA base assignment could contain unique optical, pH/chemical, or electrical signals and facilitate direct aerosol biodetection.^18,19 During nanopore sequencing, for example, electrical current is assigned to nucleotides passing through the nanopore. The raw electrical signal (“squiggle”) carries information beyond the base sequence.²⁰ Biological sample classification from squiggles was explored using deep learning (DL),^20,21 probabilistic models,²² and reference conversion.²³ A convolutional neural network (CNN) called SquiggleNet, developed using multiple human and bacterial reference DNA datasets and 4,500 electrical signals from 2 million sequence reads,²⁰ achieved 75% to 95% accuracy when predicting whether DNA sequences were human or microbial. Nanopore sequencing instruments have a small footprint and could be a useful, rapid diagnostic tool.²⁴

Other platforms include alternative deep neural networks (DNNs) for selective sequencing applications^21,25; lightweight platforms for rapid analysis²⁶; and platforms leveraging graphics processing unit (GPU)-based computing architecture,^27,28 gene-level assessments,²⁹ and algorithm-architecture codesign.³⁰ SquiggleNet and other nascent models demonstrate proof-of-concept for ML-based detection from raw output (Figure 1).

Figure 1.

Schematic of the integration of nanopore sequencing and ML, including some potential applications. A sample of DNA is supplied to the nanopore sequencer, which outputs an electrical “squiggle” based on the underlying sequence of nucleotides. ML models trained on such squiggles could be used to detect artificial nucleotides or epigenetic modifications or to classify the sample’s taxonomy. Abbreviation: ML, machine learning.

Analysis speed is a potential advantage of using raw DNA-derived data, as human vs bacterial ML classifiers require only about 1 second of sequencing data and limited memory. These systems could theoretically detect artificial nucleotides and nonencoded properties, like DNA methylation, using random forest and support vector machine models via multiple instance learning.³¹ Thanks to the “read-until” ability³² of nanopore sequencing, such platforms already serve as a preprocessing step for eliminating nontarget (eg, human) DNA from environmental samples.³³

These platforms’ long-term biodetection utility depends on generating test and training datasets for feasibility testing. Future evaluation could use simulated data³⁴ to assess whether raw data enables categorization of microbes into functionally useful bins.

ML for Gene Virulence Prediction

Determining the taxonomic identity and function of genetic sequences is critical for biodetection. SeqScreen³⁵ characterizes short gene or protein sequences by predicting taxonomic and functional labels. SeqScreen’s developers created 32 custom functions of sequences of concern (FunSoCs) to describe microbial pathogenesis functions encoded in viral and bacterial sequences.³⁶ SeqScreen’s pipeline takes protein sequences of at least 50 nucleotides as input, aligns them to UniProt entries,³⁷ and assigns FunSoC labels through ensemble learning, after representing each sequence as a high-dimensional feature vector (Figures 2 and 3A). A feature vector, depicted in Figure 2, is an ordered list of numbers that describes an object’s real-world properties and that can be used as input for ML models. Feature vectors often serve as input to ML models, algorithms that are trained to identify patterns in existing data and make predictions for unseen data.

Figure 2.

Schematic of a generalized ML workflow and its main components: data representations, feature vectors, models, and embeddings. Biological data can take many forms, including as a 1D word or sequence, a 2D image or spectrum, or a 3D object or structure. Regardless of its dimensionality, input data can be re-represented as a feature vector, an ordered list of numbers that describes an object’s real-world properties. The different types of numbers shown in the feature vectors in the figure are meant to illustrate that there are multiple ways of encoding the same data. In this schematic and those following, numeric values in feature vectors and embeddings are randomly assigned. Abbreviation: ML, machine learning.

Figure 3.

(A) Schematic of the SeqScreen pipeline.³⁵ The pipeline takes a sample gene or protein sequence as input, aligns it with reference sequences in UniProt, produces a feature vector of the sequence, and predicts with which microbial pathogenesis functions it might be associated. (B) Schematic of the FUTUSA pipeline.³⁹ The pipeline is designed to take an unknown, segmented protein sequence and, using a convolutional neural network, predict the functional impact of each amino acid. By substituting point mutations into the input sequence, FUTUSA could be used to predict the functional impacts of sequence variations. Abbreviations: FunSoCs, functions of sequences of concern; FUTUSA, function teller using sequence alone; ML, machine learning.

Three ML models, selected because they handle class imbalances (negative samples outnumber positive samples), participate in majority voting to decide labels. SeqScreen training data (98,283 samples) included manually curated virulence-positive sequences, labeled with FunSoCs, and virulence-negative sequences from SwissProt.³⁸ The majority vote classifier outperformed individual classifiers, achieving high precision (0.90) and recall (0.82) values, with 1.0 indicating full precision or recall.

Pathogens from different taxonomic kingdoms sometimes use similar infection mechanisms. ML approaches that predict virulence could aid taxonomic-agnostic pathogen detection. However, SeqScreen’s taxonomic gene classifiers generated more false positives when specific pathogen data were removed from the reference database. Moreover, because many nonpathogens contain threat-associated genes and SeqScreen assigns threat scores to individual sequence reads, an abundance of false positives can arise from nonthreat organisms. Further refinement is needed for practical application to real-world metagenomic samples. Applicability could be expanded by incorporating more fine-grained virulence types, including FunSoCs for fungal sequences, training on multiorganism samples, and reducing reliance on sequence databases, perhaps by adapting current models to make predictions directly from sequence data.

DL Embeddings for Prediction of Protein Characteristics and Function

Case Study: DL Classification of Function From Protein Sequence

Protein structural data often aids function prediction, but most proteins lack such data. Methods that predict protein function from sequence data alone would therefore be valuable for agnostic biodetection. The developers of FUTUSA (function teller using sequence alone)³⁹ evaluated function prediction by inputting whole and segmented protein sequences, finding that segmentation produced optimal results (Figure 3B). Using data on oxidoreductases, acetyltransferases, and demethylases, they tested how simulated point mutations affect function prediction. For segmented sequences, mutations had the largest impact on function prediction scores for segments overlapping important functional regions. Scores for unsegmented input sequences remained unchanged, even with mutations in critical regions. FUTUSA, based on segmented sequences of size 64 amino acids, demonstrated improved performance over Protein Basic Local Alignment Search Tool (BLASTP) in all categories for at least 1 of 5 assessed metrics.

Another DL model,⁴⁰ ProtCNN, assigns existing Pfam annotations to full-length protein domains, which could aid in predicting Pfam class, inferring coverage, and discovering protein families. The DL study used unaligned domain amino acid sequences from 17,939 classes from Pfam as input, added annotations to approximately 6.8 million sequence regions, and predicted the function of 360 human reference proteome proteins with no previous Pfam annotation. Other promising DL models included ProtENN and ProtREP.

DL models can predict function directly from raw, unannotated protein sequences. Advantages include reduced need for structural information, greater availability of training data, and prediction of functionally essential protein regions using simulated mutations. Limitations include not capturing hierarchical relationships of functional assignments and ignoring prior knowledge of segmented regions and their importance in protein function.

Assessing differences between expected and observed contributions to whole protein function and multicall classification could improve the model. Finally, applying the Pfam annotation platform for biodetection would require further development to categorize additional pathogenic classes.

Case Study: ProteinBERT

The applicability of DL models for proteins extends beyond function-from-sequence predictions. ProteinBERT,⁷ based on bidirectional encoder representations from transformers (BERT),⁴¹ is a deep language model (DLM) for learning global and local feature vector representations of protein sequences of any length, which could be used for predicting protein characteristics, such as posttranslational modifications, structure, or protein–protein interactions. With an attention-transformer architecture, ProteinBERT was pretrained on a large unlabeled dataset (106 million proteins from UniRef90³⁷ and 8,943 Gene Ontology annotations) and then fine-tuned. During pretraining, a percentage of input sequences and Gene Ontology annotations were “corrupted” to ensure the model could recover the “uncorrupted” versions. Smaller and faster than models with more parameters, ProteinBERT achieves comparable accuracy.^42,43 Current applications include providing input to protein–protein interaction graph neural networks (NNs),⁴⁴ predicting protein toxicity,⁴⁵ and designing proteins⁸ and antibodies.⁴⁶

ML models with an attention or transformer architecture can effectively predict protein function but can require extensive design. They are portable and can be fine-tuned for specific tasks after 1 pretraining step. Ideally, the subjects of the original and new tasks would be similar enough that knowledge gained from the original task would also be relevant for the new task. However, fine-tuned models could be prone to poor performance if transferable information is insufficient. Designed for sequence-like data, these models are often alignment independent. Uses could include identifying remote homology and predicting function for distantly related sequences.

Like the DL models described earlier, ProteinBERT does not capture the hierarchical nature of Gene Ontology functional annotations. Additionally, applying DLMs to larger contexts is challenging; so-called “language of life” tasks are currently nonviable. Predicting gene clusters, metabolic networks, protein complexes, and pathway functions would represent a breakthrough for DLMs. Current efforts to predict pairwise protein interactions⁴⁴ and immune system gene function⁴⁷ seek to address this gap.

Case Study: Deep Embeddings to Understand Microbial Protein Space

Some ML algorithms learn and output “embeddings” (Figure 2). Embeddings are low-dimensional, ordered lists of numbers that reflect key (though often abstract) qualities of the original objects, such as proteins. ML algorithms determine embeddings from the input data in such a way that the more similar 2 objects are with respect to those key qualities, the more similar their embeddings. Embeddings can then be used in downstream analysis, such as clustering or visualization.

For proteins, embeddings can describe characteristics related to protein structure and function in entire microbiomes. For instance, a 3-layer bidirectional long short-term memory (BiLSTM) model was trained with microbial protein sequences from the Unified Human Gastrointestinal Genome catalog⁴⁸ and used to generate embeddings for nontraining set proteins.⁴⁹ Fed an input protein sequence, the model outputs a vector representation that summarizes the sequence and encodes structure and function features.

The model was validated using bacterial SwissProt database proteins³⁸ to ensure it could recapitulate known protein properties and relationships, with deep embeddings scoring as well as, or better than, other methods. On label recovery tests, the F1 score for deep embeddings ranged from approximately 0.5 to 0.9, compared with approximately 0.1 to 0.95 for other methods. Model developers visualized SwissProt bacterial protein dataset embeddings using the Uniform Manifold Approximation and Projection method and found that embeddings clustered around labels defined by the Kegg Orthology ID.⁵⁰

Embeddings obtained from the BiLSTM model accurately describe the 3D structure and function of microbial proteins, and the model could apply to a broader protein set if trained on a more comprehensive dataset. By avoiding sequence alignments, this DL approach overcomes a limitation of traditional homology-based methods^51,52 and enables inference for novel sequences. Functional clustering in the Uniform Manifold Approximation and Projection visualizations could reveal how novel and known proteins relate functionally. Though initial model training is computationally expensive, creating and analyzing new embeddings is fast and efficient.

DNNs for Taxonomic Classification of DNA Sequences

BERT is useful for proteins⁷ and DNA. BERTax,⁵³ based on BERT,⁴¹ contains additional NN layers and can classify DNA sequences’ taxonomic superkingdom and phylum using natural language processing. Because it does not rely on reference genomes, BERTax can make predictions for sequences without close relatives in existing databases. BERTax performs comparably to current methods when related sequences are in the training data, and it outperforms on novel sequences.

BERTax was trained with unsupervised pretraining to learn “DNA language” structure, followed by fine-tuning to learn to predict taxonomic classes. The pretraining data included approximately 2.5 million genomic fragments of 1,500 nucleotides from across the 4 superkingdoms with a sequence similarity constraint of at least 80%.

BERTax is reference-independent, making it inherently pathogen agnostic and more generalizable. Consequently, it can make better predictions for metagenomic samples and novel DNA sequences. Yet it can also be combined with database approaches to capitalize on available information. However, only classifying superkingdom and phylum is a significant limitation; further taxonomic refinement would make BERTax more useful. Because BERTax requires significant training time and runs slower than other DL methods, one might first use a high-throughput method and pass only unassigned or low confidence sequences to BERTax.

DL, Image-Based Approach to Classify Fungi

DL can be used for image data as well as sequence data. An ML approach to classify microscopic fungi images was developed based on DNNs and bag-of-words,⁵⁴ and it is faster and cheaper than traditional techniques that involve human visual inspection and biochemical tests. The method extracts image features using steps from previously trained DNNs, clusters and aggregates the features to reduce dimensionality, and classifies original images into fungal species using a support vector machine or random forest model.⁵⁴

To test the DL classifier, 180 preprocessed microscopic fungal scans from the Digital Images of Fungus Species database were used. Data was partitioned by sample preparation, and parameters were optimized through 5-fold cross validation. The Fisher vector representation outperformed bag-of-words, and classification accuracy was approximately 75% to 97% for all but 2 species (accuracy 50% to 60%). The DL classifier struggled with samples displaying high variation in arrangement, appearance, and preparation.⁵⁴

By reducing the need for sequencing, image-based DL techniques could decrease sample classification time and cost. This preliminary study shows proof-of-concept for repurposing DNNs and coupling them with flexible feature representations encapsulating varied data, though further developments are required to fully realize these benefits. Poor performance on highly variable samples might foreshadow difficulty translating to biodetection contexts. Expanding the Digital Images of Fungus Species database to include images of higher-resolution, multiple coexistent fungal species and other sample preparation protocols would enhance applicability and performance.

Determining virulence from images could boost relevance, since threat estimation could be more informative than pathogen identity. While the potential gains from an image-based DL method are high, the example classifier is limited by reliance on traditional cell culture. Nevertheless, the current implementation could identify samples requiring sequencing and further analysis.

ML for Spectral Data

Mass spectrometry (MS) analysis of proteomic or metabolomic factors holds promise for phenotype prediction and diagnosis and is explored broadly in clinical settings.^55-59 Pathogen detection via MS is well studied for emerging viral diseases⁶⁰ but requires intensive curation. Classifying directly from raw MS data using DL models,⁶¹ especially those designed for natural image classification,⁶² might overcome these hurdles.

Case Study: MS Analytics for Sample Categorization

In an article published in 2021,⁶² researchers transformed MS profiles into images and encoded them into feature vectors, which they used in logistic regression, support vector machine, random forest, and gradient boosted trees for classification (Figure 4A). They used multiple publicly available image models that had been pretrained on “natural images.”⁶³ Classifiers predicted whether MS profiles from cancer tissue biopsies were derived from a malignant tumor with peak performance of 0.876 area under the receiver operating characteristic curve.

Figure 4.

(A) Schematic of the SWATH-MS-based pipeline.⁶² The pipeline could be used to distinguish between harmless and harmful samples based on their mass spectra, though those spectra first need to be processed into images before they can be ingested into the machine learning workflow. (B) Schematic of the DLearnMS pipeline.⁶⁴ The pipeline consists of a neural network that distinguishes between 2 sets of samples (eg, harmless vs harmful, healthy vs diseased) based directly on their raw mass spectra. Abbreviations: ML, machine learning; SWATH-MS, sequential window acquisition of all theoretical mass spectra.

These analyses show that public pretrained models can generalize to MS image analysis, despite being pretrained on different image classes. Further development of DL models designed and architected for proteomics applications could enhance biodetection by identifying signature pathogen proteins. However, large data storage solutions would be necessary.

Case Study: DL Approach to Detect Biomarkers in LC-MS Proteomics Data

DLearnMS is an NN designed to classify liquid chromatography-mass spectrometry (LC-MS) maps as diseased or healthy with minimal preprocessing (Figure 4B).⁶⁴ Lacking experimental data, researchers simulated the training spectra. From UniProt, they randomly selected 20 human proteome peptides for the “healthy class,” then added 9 spiked peptides for the “diseased class.” The DLearnMS algorithm relies on class labels and employs layer-wise relevance propagation for network and feature selection, detecting differentially abundant peaks as biomarkers. Published LC-MS data were used as a benchmark to evaluate performance. DLearnMS recovered 7 of the 9 spiked peptides, detected fewer false positives than other methods, and required less preprocessing.

Because DLearnMS uses raw MS, it avoids information loss and dimensionality reduction, increases interpretability, and minimizes steps between data collection and model prediction. ML techniques developed on synthetic data require quality assurance to ensure transferability to real-world data. Overfitting is a risk since the distinguishing elements in a synthetic dataset are known. Given this study’s limited scope, it is unclear if DLearnMS could be used for high-throughput applications. Finally, because DLearnMS is based on observing differences between 2 data classes, the model might need to be retrained for environmental samples, which would have different background signals than clinical samples.

Case Study: Raman Spectral Processing for Pathogen/Toxin Detection

Image-based DL is also applied to Raman spectra for clinical diagnostics⁶⁵ and holds potential for agnostic biosurveillance. Surface-enhanced Raman scattering, for example, has been used for environmental and foodborne pathogen detection⁶⁶ and to predict multidrug-resistance profiles in nosocomial pathogens.⁶⁷ A CNN model was trained with 2,000 bacterial Raman spectra to distinguish 30 microbial pathogens.⁶⁸ Support vector machine evaluation of Raman spectra identified bacterial toxins and their concentrations from spectral data.⁶⁹ While most studies combining Raman spectroscopy (RS) and ML (RS+ML) have leveraged ML for dimensionality reduction, classification, validation, regression, and clustering, others applied uncommon or novel methods.^70-81

Current approaches require spectral preprocessing to remove background signal.⁸² Dimensionality reduction is often performed before model training, resulting in model features that are composite quantities. Full Raman spectra can contain complex patterns, making it difficult to determine individual structural components in a sample, but analyzing minimally preprocessed data with ML increases available information.

RS+ML performs well in biomedical settings.⁶⁵ However, most studies used small sample sizes and/or did not validate results. Model performance varied by biological sample and choice of ML method, so RS+ML analyses must first be standardized and optimized if they are to be reliable in biodetection settings. Additional research is needed on samples with complex environmental backgrounds. Indeed, differences in growth conditions and sample preparation affect an organism’s mass and Raman spectra, presenting a major challenge for agnostic biodetection.

Transcriptional Signatures of Infection

The preceding methods were designed to identify and characterize pathogens. Alternative, host-based strategies could provide complementary information and unique advantages. For example, transcriptional signatures in host cells could be used to diagnose infection.⁸³

A 2022 meta-analysis⁸³ evaluated published host signatures of infection^84-92 against 17,105 Gene Expression Omnibus⁹³ transcriptional profiles from whole blood cells and peripheral blood mononuclear cells. The profiles included bacterial and viral infections, plus some parasitic infections and noninfectious conditions. To score different signatures, the authors devised a standardized framework that relates gene expression levels for positive and negative genes associated with the signatures. Most signatures were robust at detecting viral or bacterial infection with median area under the receiver operating characteristic curve values greater than 0.7, though some signatures demonstrated cross-reactivity with unintended infection types.

By identifying nascent infections, host transcriptional signatures could enable earlier and broader detection and reduce diagnosis time. While host-based approaches might struggle to identify individual pathogens, this loss of specificity might not be a drawback for pathogen-agnostic detection.

The potential value in host-based techniques is clear, but their feasibility for complex, pooled, or environmental samples is undetermined. This scoring system was designed for biomedical applications but could be extended if the environment is treated as the “host” (Figure 5). There could be nonmicrobial environmental signatures that reflect the presence of a given pathogen or class. The applicability range for such an indirect signature system is unclear, but the approach should be considered, as it is inherently pathogen agnostic.

Figure 5.

Conceptualization of host-based signatures in an environmental context. A range of potential bioagents could impact various aspects of our environment, from the soil and air to our wastewater and food and water supply. These impacts could be evaluated with analytical techniques such as sequencing, imaging, and spectroscopy. The data from these assessments could be used to identify bioagent-agnostic signatures that differ based on threat absence or presence. Adapted from Leiser et al.² Abbreviation: BAS, bioagent-agnostic signatures.

Discussion and Conclusions

Discussions about nontargeted and One Health assays for outbreak preparedness^94-96 highlight a clear need for agnostic diagnostics. Although environmental biodetection technologies are less common, ML platforms pretrained for clinical/biomedical use could be adapted for biosurveillance of environmental threats. We examined strengths and weaknesses of ML biodetection approaches with varying input data type and format, ML algorithms, use cases, performance, and maturity level. The unifying theme is identification via raw or minimally preprocessed data, with the aim of achieving untargeted detection or diagnostics readouts.

The systems’ framework depends on models capable of classifying input data and effective methods for learning representations of these datasets. The latter supports reference independence, as flexible representations could capture threat function even if the encoding of that function does not exist in reference knowledgebases.

The DL approach for classifying fungi based on microscopic images⁵⁴ avoids cell culture but requires more training data to bolster performance and reliability. Recently developed DL models for bacterial image classification were trained on thousands of images.^97,98 Alternatively, if insufficient training images are available, models pretrained on nonmicrobial images could be leveraged. A recent model for classifying bacterial images as Gram-positive or Gram-negative was pretrained on the ImageNet-1k dataset and then retrained on bacterial images.⁹⁸

Generalizability hurdles pose another challenge for agent-agnostic biodetection. Given the high resolution of RS and the variability in performance of different combinations of biological samples and ML methods, specific use cases might need a specialized approach. The development cost and delay could render these methods less suitable. Instead, if spectral signatures for threat function could be identified using existing models, focusing on these more generalizable features could aid first-pass threat identification, with further assessments reserved for characterizing potential threats.

FUTUSA³⁹ and SeqScreen³⁵ both predict protein function from sequence, though SeqScreen’s approach seems more mature. In contrast to FUTUSA, SeqScreen provides well-defined functional labels, incorporates manual data curation, combines several functional annotation algorithms, and assigns both taxonomic and functional labels. However, FUTUSA, unlike SeqScreen, can predict the functional impact of point mutations.

Reference-independent prediction of unknown protein function is appealing. While initial training requires reference sequences, the resultant embeddings and classification model can assign functions to previously unseen sequences and are more portable than reference databases. Because BERTax⁵³ operates independently of reference genome databases, it could characterize unknown DNA sequences when other approaches fail.

Host-based strategies are also appealing because they would be inherently pathogen agnostic. Host signatures could be more robust at detecting pathogens, since host response patterns may be “evolutionarily conserved” across “a wide range of pathogens and toxins that elicit disease.”² By defining molecular host pathways of disease and assessing how a pathogen would disrupt those pathways, researchers could discover unknown pathogens and functional categories that SeqScreen³⁵ might not capture.

DL-based tools⁹⁹ can already predict protein structure and could also potentially predict function. It could be worthwhile to explore improving performance and generalizability of large language models, like that used in ProteinBERT,⁷ given their success in other domains and the language-like nature of protein sequences. As frameworks like large language models become larger and more complex, their capacity to serve as pretrained models that can be fine-tuned for biodetection will be amplified.

Currently, no single algorithm or model addresses all biosurveillance needs. Using complementary platforms together could help overcome limitations and enhance potential (Figure 6). For example, platforms operating on processed⁶² and unprocessed⁶⁴ mass spectra could be paired. DLearnMS⁶⁴ could be used to identify detailed and interpretable distinguishing features between healthy and diseased samples, which could then inform high-throughput, multisubject analysis from the SWATH-MS (sequential window acquisition of all theoretical mass spectra)-based platform (Figure 6A).⁶²

Figure 6.

(A) Schematic illustrating how the DLearnMS and SWATH-MS-based pipelines^62,64 could be applied in tandem. (B) Schematic illustrating how the FUTUSA³⁹ and BiLSTM model⁴⁹ pipelines could be applied in tandem. (C) Schematic illustrating how nanopore sequencing and SeqScreen³⁵ could be used in tandem. Abbreviations: BiLSTM, bidirectional long short-term memory; FUTUSA, function teller using sequence alone; ML, machine learning; MS, mass spectrum; SWATH-MS, sequential window acquisition of all theoretical mass spectra.

Coupling FUTUSA³⁹ and BiLSTM for protein embeddings⁴⁹ could be fruitful for probing protein function (Figure 6B). FUTUSA could first be used to create new functional annotations for mutated protein sequences. Those new annotations could then be used to augment the training data set and re-train the BiLSTM model. Updated embeddings could be generated from the re-trained BiLSTM model and used to assess protein sequences from environmental surveillance, potentially giving insight into the effects of mutations on protein function and protein-protein similarity.

Similarly, combining nanopore sequencing and SeqScreen³⁵ could enhance their value (Figure 6C). Nanopore sequencing could be performed on wastewater samples to generate input sequences for SeqScreen, which could identify FunSoCs. Those identifications could then be used to train a ML model to characterize future wastewater samples or classify them as concerning or not concerning.

Although future biodetection needs are complex and uncertain, the assessments above indicate that ML capabilities are well positioned to address emerging challenges (Figure 7). Efforts to extend and synergize such platforms are still needed to bring their potential utility to practice.

Figure 7.

Decision flowchart to aid in identifying the most appropriate pipeline(s) to use based on the available input data and desired task. Pairs of pipelines in orange, red, and purple could be used complementarily, as described in the text and depicted in Figure 6. Abbreviations: BiLSTM, bidirectional long short-term memory; DL, deep learning; FUTUSA, function teller using sequence alone; ML, machine learning; SWATH-MS, sequential window acquisition of all theoretical mass spectra.

Moreover, to bolster confidence in risk evaluation and minimize false positives, evidence from multiple algorithms should be weighed before acting. Thus, an integrated, complementary approach to implementing these technologies will likely best serve the goal of agnostic biosurveillance.

Footnotes

Acknowledgments

We thank Monica Borucki for helpful discussion and feedback, Melanie Mendez for editorial assistance, and Jeremy Turner for graphic design support. We also thank the leadership and team members of the Department of Homeland Security Science and Technology Directorate (DHS S&T), Hazard Awareness and Characterization Technology Center (HAC-TC) for helpful discussion and review. Molecular graphics were created with UCSF ChimeraX, developed by the Resource for Biocomputing, Visualization, and Informatics at the University of California, San Francisco, with support from National Institutes of Health R01-GM129325 and the National Institute of Allergy and Infectious Diseases Office of Cyber Infrastructure and Computational Biology. This work was performed under the auspices of the US Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344. This effort was funded by DHS S&T HAC-TC contract number 70RSAT23KPM000036.

References

National Academies of Sciences, Engineering, and Medicine. Biodefense in the Age of Synthetic Biology. Washington, DC: National Academies Press; 2018. Accessed January 27, 2025. https://doi.org/10.17226/24890

Leiser

, Hobbs

, Sims

, Korch

, Taylor

. Beyond the list: bioagent-agnostic signatures could enable a more flexible and resilient biodefense posture than an approach based on priority agent lists alone. Pathogens. 2021; 10(11):1497.

Lin

, Torres

, Hobbs

, et al. Computational and systems biology advances to enable bioagent agnostic signatures. Health Secur. 2024; 22(2):130-139.

Zhou

, Yu

, Wang

, et al. A transformer-based representation-learning model with unified processing of multimodal input for clinical diagnostics. Nat Biomed Eng. 2023; 7(6):743-755.

Herzog

, Kook

, Hamann

, et al. Deep learning versus neurologists: functional outcome prediction in LVO stroke patients undergoing mechanical thrombectomy. Stroke. 2023; 54(7):1761-1769.

Radak

, Lafta

, Fallahi

. Machine learning and deep learning techniques for breast cancer diagnosis and classification: a comprehensive review of medical imaging studies. J Cancer Res Clin Oncol. 2023; 149(12):10473-10491.

Brandes

, Ofer

, Peleg

, Rappoport

, Linial

. ProteinBERT: a universal deep-learning model of protein sequence and function. Bioinformatics. 2022; 38(8):2102-2110.

Ferruz

, Schmidt

, Höcker

. ProtGPT2 is a deep unsupervised language model for protein design. Nat Commun. 2022; 13(1):4348.

Nijkamp

, Ruffolo

, Weinstein

, Naik

, Madani

. ProGen2: exploring the boundaries of protein language models. Preprint. arXiv. arXiv:2206.13517 [cs.LG]. Submitted June 27, 2022. Accessed January 27, 2025. https://doi.org/10.48550/arXiv.2206.13517

10.

Parkins

, Lee

, Acosta

, et al. Wastewater-based surveillance as a tool for public health action: SARS-CoV-2 and beyond. Clin Microbiol Rev. 2024; 37(1):e0010322.

11.

Keshaviah

, Diamond

, Wade

, Scarpino

; Global Wastewater Action Group. Wastewater monitoring can anchor global disease surveillance systems. Lancet Glob Health. 2023; 11(6):e976-e981.

12.

Darling

, Patton

, Rasheduzzaman

, et al. Microbiological and chemical drinking water contaminants and associated health outcomes in rural Appalachia, USA: a systematic review and meta-analysis. Sci Total Environ. 2023; 892:164036.

13.

Bøifot

, Gohli

, Skogan

, Dybwad

. Performance evaluation of high-volume electret filter air samplers in aerosol microbiome research. Environ Microbiome. 2020; 15(1):14.

14.

, Thissen

, Fofanov

, et al. Metagenomic analysis of the airborne environment in urban spaces. Microb Ecol. 2015; 69(2):346-355.

15.

Pyrri

, Stamatelopoulou

, Pardali

, Maggos

. The air and dust invisible mycobiome of urban domestic environments. Sci Total Environ. 2023; 904:166228.

16.

Pillay

, Calderón-Franco

, Urhan

, Abeel

. Metagenomic-based surveillance systems for antibiotic resistance in non-clinical settings. Front Microbiol. 2022; 13:1066995.

17.

Balloux

, van Dorp

. Q&A: what are pathogens, and what have they done to and for us? BMC Biol. 2017; 15(1):91.

18.

Clare

, Economou

, Bennett

, et al. Measuring biodiversity from DNA in the air. Curr Biol. 2022; 32(3):693-700.e5.

19.

Lynggaard

, Bertelsen

, Jensen

, et al. Airborne environmental DNA for terrestrial vertebrate community monitoring. Curr Biol. 2022; 32(3):701-707.e5.

20.

Bao

, Wadden

, Erb-Downward

, et al. SquiggleNet: real-time, direct classification of nanopore signals. Genome Biol. 2021; 22(1):298.

21.

Senanayake

, Gamaarachchi

, Herath

, Ragel

. DeepSelectNet: deep neural network based selective sequencing for oxford nanopore sequencing. BMC Bioinformatics. 2023; 24(1):31.

22.

Kovaka

, Fan

, Ni

, Timp

, Schatz

. Targeted nanopore sequencing by real-time mapping of raw electrical signal with UNCALLED. Nat Biotechnol. 2021; 39(4):431-441.

23.

Zhang

, Li

, Jain

, et al. Real-time mapping of nanopore raw signals. Bioinformatics. 2021; 37(suppl 1):i477-i483.

24.

Gorzynski

, Goenka

, Shafin

, et al. Ultrarapid nanopore genome sequencing in a critical care setting. N Engl J Med. 2022; 386(7):700-702.

25.

Sadasivan

, Wadden

, Goliya

, et al. Rapid real-time squiggle classification for read until using RawMap. Arch Clin Biomed Res. 2023; 7(1):45-57.

26.

Noordijk

, Nijland

, Carrion

, Raaijmakers

, de Ridder

, de Lannoy

. baseLess: lightweight detection of sequences in raw MinION data. Bioinform Adv. 2023; 3(1):vbad017.

27.

, Li

, Song

, Wang

. Crescent: a GPU-based targeted nanopore sequence selector. Presented at: 2022 IEEE International Conference on Bioinformatics and Biomedicine; December 7, 2022; Las Vegas, NV. Accessed January 27, 2025. https://doi.org/10.1109/BIBM55620.2022.9995449

28.

Sadasivan

, Stiffler

, Tirumala

, Israeli

, Narayanasamy

. Accelerated dynamic time warping on GPU for selective nanopore sequencing. J Biotechnol Biomed. 2024; 7:137-148.

29.

Nykrynova

, Jakubicek

, Barton

, Bezdicek

, Lengerova

, Skutkova

. Using deep learning for gene detection and classification in raw nanopore signals. Front Microbiol. 2022; 13:942179.

30.

Mutlu

, Firtina

. Accelerating genome analysis via algorithm-architecture co-design. Preprint. arXiv. arXiv:230500492. Submitted April 30, 2023. Last revised May 31, 2023. Accessed January 27, 2025. https://arxiv.org/abs/2305.00492

31.

Wan

, Hendra

, Pratanwanich

, Göke

. Beyond sequencing: machine learning algorithms extract biology hidden in nanopore signal data. Trends Genet. 2022; 38(3):246-257.

32.

Loose

, Malla

, Stout

. Real-time selective sequencing using nanopore technology. Nat Methods. 2016; 13(9):751-754.

33.

Masutani

, Morishita

. A framework and an algorithm to detect low-abundance DNA by a handy sequencer and a palm-sized computer. Bioinformatics. 2019; 35(4):584-592.

34.

, Wang

, Bi

, Qiu

, Li

, Gao

. DeepSimulator1.5: a more powerful, quicker and lighter simulator for nanopore sequencing. Bioinformatics. 2020; 36(8):2578-2580.

35.

Balaji

, Kille

, Kappell

, et al. SeqScreen: accurate and sensitive functional screening of pathogenic sequences via ensemble learning. Genome Biol. 2022; 23(1):133.

36.

Godbold

, Kappell

, LeSassier

, Treangen

, Ternus

. Categorizing sequences of concern by function to better assess mechanisms of microbial pathogenesis. Infect Immun. 2022; 90(5):e0033421.

37.

UniProt Consortium. UniProt: the Universal Protein Knowledgebase in 2023. Nucleic Acids Res. 2023; 51(D1):D523-D531.

38.

Boutet

, Lieberherr

, Tognolli

, Schneider

, Bairoch

. UniProtKB/Swiss-Prot. Methods Mol Biol. 2007; 406:89-112.

39.

, Huh

, Park

. Deep learning program to predict protein functions based on sequence information. MethodsX. 2022; 9:101622.

40.

Bileschi

, Belanger

, Bryant

, et al. Using deep learning to annotate the protein universe. Nat Biotechnol. 2022; 40(6):932-937.

41.

Devlin

, Chang

, Lee

, Toutanova

. BERT: pre-training of deep bidirectional transformers for language understanding. Preprint. arXiv. arXiv:1810.04805. Submitted October 11, 2018. Last revised May 24, 2019. Accessed January 30, 2025. https://arxiv.org/abs/1810.04805

42.

Elnaggar

, Heinzinger

, Dallago

, et al. ProtTrans: toward understanding the language of life through self-supervised learning. IEEE Trans Pattern Anal Mach Intell. 2022; 44(10):7112-7127.

43.

Rives

, Meier

, Sercu

, et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc Natl Acad Sci U S A. 2021; 118(15):e2016239118.

44.

Jha

, Saha

, Singh

. Prediction of protein-protein interaction using graph neural networks. Sci Rep. 2022; 12(1):8360.

45.

Morozov

, Rodrigues

CHM

, Ascher

. CSM-Toxin: a web-server for predicting protein toxicity. Pharmaceutics. 2023; 15(2):431.

46.

Khan

, Cowen-Rivers

, Grosnit

, et al. Toward real-world automated antibody design with combinatorial Bayesian optimization. Cell Rep Methods. 2023; 3(1):100374.

47.

Miller

, Stern

, Burstein

. Deciphering microbial gene function using natural language processing. Nat Commun. 2022; 13(1):5731.

48.

Almeida

, Nayfach

, Boland

, et al. A unified catalog of 204,938 reference genomes from the human gut microbiome. Nat Biotechnol. 2021; 39(1):105-114.

49.

Odrzywolek

, Karwowska

, Majta

, Byrski

, Milanowska-Zabel

, Kosciolek

. Deep embeddings to comprehend and visualize microbiome protein space. Sci Rep. 2022; 12(1):10332.

50.

Kanehisa

, Sato

, Kawashima

, Furumichi

, Tanabe

. KEGG as a reference resource for gene and protein annotation. Nucleic Acids Res. 2016; 44(D1):D457-D462.

51.

Finn

, Clements

, Eddy

. HMMER web server: interactive sequence similarity searching. Nucleic Acids Res. 2011; 39(Web Server issue):W29-W37.

52.

Altschul

, Gish

, Miller

, Myers

, Lipman

. Basic local alignment search tool. J Mol Biol. 1990; 215(3):403-410.

53.

Mock

, Kretschmer

, Kriese

, Böcker

, Marz

. Taxonomic classification of DNA sequences beyond sequence similarity using deep neural networks. Proc Natl Acad Sci U S A. 2022; 119(35):e2122636119.

54.

Zieliński

, Sroka-Oleksiak

, Rymarczyk

, Piekarczyk

, Brzychczy-Włoch

. Deep learning approach to describe and classify fungi microscopic images. PLoS One. 2020; 15(6):e0234806.

55.

Creighton

. Clinical proteomics towards multiomics in cancer. Mass Spectrom Rev. 2024; 43(6):1255-1269.

56.

Demichev

, Tober-Lau

, Nazarenko

, et al. A proteomic survival predictor for COVID-19 patients in intensive care. PLOS Digit Health. 2022; 1(1):e0000007.

57.

Torres-Sangiao

, Leal Rodriguez

, García-Riestra

. Application and perspectives of MALDI-TOF mass spectrometry in clinical microbiology laboratories. Microorganisms. 2021; 9(7):1539.

58.

Chatterjee

, Zaia

. Proteomics-based mass spectrometry profiling of SARS-CoV-2 infection from human nasopharyngeal samples. Mass Spectrom Rev. 2024; 43(1):193-229.

59.

Solntceva

, Kostrzewa

, Larrouy-Maumus

. Detection of species-specific lipids by routine MALDI TOF mass spectrometry to unlock the challenges of microbial identification and antimicrobial susceptibility testing. Front Cell Infect Microbiol. 2021; 10:621452.

60.

Mahmud

, Garrett

. Mass spectrometry techniques in emerging pathogens studies: COVID-19 perspectives. J Am Soc Mass Spectrom. 2020; 31(10):2013-2024.

61.

Wang

, Zhu

, Zhou

, Cheng

, Yang

. MSpectraAI: a powerful platform for deciphering proteome profiling of multi-tumor mass spectrometry data by using deep neural networks. BMC Bioinformatics. 2020; 21(1):439.

62.

Cadow

, Manica

, Mathis

, Guo

, Aebersold

, Rodríguez Martínez

. On the feasibility of deep learning applications using raw mass spectrometry data. Bioinformatics. 2021; 37(suppl 1):i245-i253.

63.

Cui

, Song

, Sun

, Howard

, Belongie

. Large scale fine-grained categorization and domain-specific transfer learning. Presented at: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition; June 18-23, 2018; Salt Lake City, UT. Accessed January 30, 2025. https://doi.org/10.1109/CVPR.2018.00432

64.

Iravani

, Conrad

TOF

. An interpretable deep learning approach for biomarker detection in LC-MS proteomics data. IEEE/ACM Trans Comput Biol Bioinform. 2023; 20(1):151-161.

65.

Ralbovsky

, Lednev

. Towards development of a novel universal medical diagnostic method: Raman spectroscopy and machine learning. Chem Soc Rev. 2020; 49(20):7428-7453.

66.

Zhao

, Li

, Xu

. Detection of foodborne pathogens by surface enhanced Raman spectroscopy. Front Microbiol. 2018; 9:1236.

67.

Lyu

, Zhang

, Tang

, et al. Rapid prediction of multidrug-resistant Klebsiella pneumoniae through deep learning analysis of SERS spectra. Microbiol Spectr. 2023; 11(2):e0412622.

68.

, Jean

, Hogan

, et al. Rapid identification of pathogenic bacteria using Raman spectroscopy and deep learning. Nat Commun. 2019; 10(1):4927.

69.

Koya

, Brusatori

, Martin

, et al. Rapid detection of Clostridium difficile toxins in serum by Raman spectroscopy. J Surg Res. 2018; 232:195-201.

70.

Marro

, Nieva

, de Juan

, Sierra

. Unravelling the metabolic progression of breast cancer cells to bone metastasis by coupling Raman spectroscopy and a novel use of Mcr-Als algorithm. Anal Chem. 2018; 90(9):5594-5602.

71.

de Juan

, Jaumot

, Tauler

. Multivariate Curve Resolution (MCR). Solving the mixture analysis problem. Anal Methods. 2014; 6(14):4964-4976.

72.

Fallahzadeh

, Dehghani-Bidgoli

, Assarian

. Raman spectral feature selection using ant colony optimization for breast cancer diagnosis. Lasers Med Sci. 2018; 33(8):1799-1806.

73.

Dorigo

, Birattari

, Stutzle

. Ant colony optimization - artificial ants as a computational intelligence technique. IEEE Comput Intell Mag. 2006; 1(4):28-39.

74.

Maitra

, Morais

CLM

, Lima

KMG

, Ashton

, Date

, Martin

. Raman spectral discrimination in human liquid biopsies of oesophageal transformation to adenocarcinoma. J Biophotonics. 2020; 13(3):e201960132.

75.

Katoch

, Chauhan

, Kumar

. A review on genetic algorithm: past, present, and future. Multimed Tools Appl. 2021; 80(5):8091-8126.

76.

González-Solís

. Discrimination of different cancer types clustering Raman spectra by a super paramagnetic stochastic network approach. PLoS One. 2019; 14(3):e0213621.

77.

Blatt

, Wiseman

, Domany

. Superparamagnetic clustering of data. Phys Rev Lett. 1996; 76(18):3251-3254.

78.

, Li

, Zhang

, Xu

. An improved k-nearest neighbour method to diagnose breast cancer. Analyst. 2018; 143(12):2807-2811.

79.

Sohail

, Khan

, Ullah

, Qureshi

, Bilal

, Khan

. Analysis of hepatitis C infection using Raman spectroscopy and proximity based classification in the transformed domain. Biomed Opt Express. 2018; 9(5):2041-2055.

80.

Garcia

, da Silva Filho

, Silveira

, et al. Analysis of Raman spectroscopy data with algorithms based on paraconsistent logic for characterization of skin cancer lesions. Vib Spectrosc. 2019; 103:102929.

81.

Krbcova

, Kukal

, Mares

, Habartova

, Setnicka

. Variational approach to cancerous tissue identification from in vivo Raman spectra. Biomed Signal Process Control. 2019; 49:520-527.

82.

Hernández-Vidales

, Guevara

, Olivares-Illana

, Gonzalez

. Characterization of wild-type and mutant p53 protein by Raman spectroscopy and multivariate methods. J Raman Spectrosc. 2019; 50(10):1388-1394.

83.

Chawla

, Cappuccio

, Tamminga

, Sealfon

, Zaslavsky

, Kleinstein

. Benchmarking transcriptional host response signatures for infection diagnosis. Cell Syst. 2022; 13(12):974-988.e7.

84.

Andres-Terre

, McGuire

, Pouliot

, et al. Integrated, multi-cohort analysis identifies conserved transcriptional signatures across multiple respiratory viruses. Immunity. 2015; 43(6):1199-1211.

85.

Suarez

, Bunsow

, Falsey

, Walsh

, Mejias

, Ramilo

. Superiority of transcriptional profiling over procalcitonin for distinguishing bacterial from viral lower respiratory tract infections in hospitalized adults. J Infect Dis. 2015; 212(2):213-222.

86.

Sweeney

, Wong

, Khatri

. Robust classification of bacterial and viral infections via integrated host gene expression diagnostics. Sci Transl Med. 2016; 8(346):346ra91.

87.

Tsalik

, Henao

, Nichols

, et al. Host gene expression classifiers diagnose acute respiratory illness etiology. Sci Transl Med. 2016; 8(322):322ra11.

88.

Herberg

, Kaforou

, Wright

, et al; IRIS Consortium. Diagnostic test accuracy of a 2-transcript host RNA signature for discriminating bacterial vs viral infection in febrile children. JAMA. 2016; 316(8):835-845.

89.

Smith

, Dampier

, Tozeren

, Brown

, Magid-Slav

. Identification of common biological pathways and drug targets across multiple respiratory viruses based on human host gene expression analysis. PLoS One. 2012; 7(3):e33174.

90.

Smith

, Magid-Slav

, Brown

. Host response to respiratory bacterial pathogens as identified by integrated analysis of human gene expression data. PLoS One. 2013; 8(9):e75607.

91.

Zaas

, Chen

, Hero

, Lucas

, Carin

, Ginsburg

. Response: improving development of the molecular signature for diagnosis of acute respiratory viral infections. Cell Host Microbe. 2010; 7(2):P102.

92.

, Yu

, Crosby

, Storch

. Gene expression profiles in febrile children with defined viral and bacterial infection. Proc Natl Acad Sci U S A. 2013; 110(31):12792-12797.

93.

Barrett

, Wilhite

, Ledoux

, et al. NCBI GEO: archive for functional genomics data sets—update. Nucleic Acids Res. 2013; 41(Database issue):D991-D995.

94.

Gauthier

NPG

, Chorlton

, Krajden

, Manges

. Agnostic sequencing for detection of viral pathogens. Clin Microbiol Rev. 2023; 36(1):e0011922.

95.

Simner

, Miller

, Carroll

. Understanding the promises and hurdles of metagenomic next-generation sequencing as a diagnostic tool for infectious diseases. Clin Infect Dis. 2018; 66(5):778-788.

96.

Aarestrup

, Bonten

, Koopmans

. Pandemics- One Health preparedness for the next. Lancet Reg Health Eur. 2021; 9:100210.

97.

Worth

, Espina

. ScanGrow: deep learning-based live tracking of bacterial growth in broth. Front Microbiol. 2022; 13:900596.

98.

Kim

, Maros

, Miethke

, Kittel

, Siegel

, Ganslandt

. Lightweight visual transformers outperform convolutional neural networks for Gram-stained image classification: an empirical study. Biomedicines. 2023; 11(5):1333.

99.

Jumper

, Evans

, Pritzel

, et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021; 596(7873):583-589.

Harnessing Machine Learning for Agnostic Biodetection

Abstract

Introduction

Platform Assessments

Direct Processing of Raw DNA Signal

ML for Gene Virulence Prediction

DL Embeddings for Prediction of Protein Characteristics and Function

Case Study: DL Classification of Function From Protein Sequence

Case Study: ProteinBERT

Case Study: Deep Embeddings to Understand Microbial Protein Space

DNNs for Taxonomic Classification of DNA Sequences

DL, Image-Based Approach to Classify Fungi

ML for Spectral Data

Case Study: MS Analytics for Sample Categorization

Case Study: DL Approach to Detect Biomarkers in LC-MS Proteomics Data

Case Study: Raman Spectral Processing for Pathogen/Toxin Detection

Transcriptional Signatures of Infection

Discussion and Conclusions

Footnotes

Acknowledgments

References