Sage Journals: Discover world-class research

Abstract

An estimated 20 million chemical transformations occur in an average human cell every second; however, the vast majority of nature’s small-molecule “dark chemical matter” remains uncharted, limiting our understanding of basic biological processes and throttling progress in human health. High-throughput metabolomics coupled with artificial intelligence and machine learning can decode life’s chemistry, illuminating novel metabolites involved in giving rise to living systems and transforming human health.

The sequencing of the human genome marked a monumental moment in scientific history, unlocking the blueprint of life and fueling hopes for transformative advances in medicine and industry.¹ Life-changing medicines have resulted from genomics research, including gene therapies and nucleic acid-based therapeutics, bringing new hope to thousands of patients by targeting the genetic blueprint of diseases to alter disease progression.^2,3

We are all comfortable with the familiar framing of proteins being the workhorse of the cell but rarely ask, “What work do proteins do?” To quote the biophysicist Harold Morowitz, every protein, indirectly or directly, makes matter cycle or energy flow.⁴ Viewed from another lens, the point of genes, transcripts, and proteins is to transform carbon, that is, metabolism. This chemical layer of biology, perhaps its very point, is largely ignored by science.

The human genome comprises roughly 20,000 genes, a finite and well-mapped terrain compared with the chemical universe.⁵ The human metabolome is so unknown that we do not yet have a confident estimate of its size, but experts suggest it is likely in the millions of compounds. Nature, by contrast, is an almost inexhaustible library of molecular diversity. An estimated 99% of natural compounds remain unknown.⁶ Their structures, functions, and potential applications are hidden in the shadows of what researchers call “dark chemical matter.” These molecules are the products of billions of years of evolutionary experimentation, and they hold the keys to life’s resilience, adaptation, and complexity.

This chemical richness has already proven to be an unparalleled source of innovation. Nearly two-thirds of all approved drugs originate from natural products or their derivatives, including antibiotics, anticancer agents, and immunosuppressants.⁷ Nature’s compounds, shaped by evolutionary pressures, exhibit bioactivity unmatched by synthetic chemistry. And yet, despite their success in medicine and industry, we’ve barely scratched the surface of this chemical treasure trove.

A Slow and Iterative Process

Why is this chemical dark matter so elusive? Unlike genomic sequencing, the task of identifying and characterizing natural chemicals remains a laborious and complex endeavor. While mass spectrometry, the primary tool for analyzing molecular identities, predates genomics, it is still incomplete. Chemical compounds are often found in complex mixtures, without annotation of structure or biological function. Traditionally, activity in complex extracts is chased down to individual bioactive compounds using a process called bioactivity-guided fractionation. This technique identifies bioactive compounds from a natural extract by repeatedly separating the extract into smaller fractions, testing each fraction for biological activity, and then further isolating the most active fractions until the pure bioactive compound is obtained; essentially, it is a method to systematically identify the active components within a complex mixture based on their biological effects.

The main challenge of bioactivity-guided fractionation is that often the bioactivity is lost in the process of fractionation before the molecule is whittled down. This could be due to several issues, including potential degradation of bioactive compounds during purification, low concentrations of active molecules, which make isolation and purification challenging, and potentially synergistic effects between multiple compounds.^8,9 When compounded by the need for time-consuming and laborious assays to isolate and identify the active principle from a complex mixture, and the iterative nature of the process becomes unscalable.¹⁰

If one were fortunate to be able to track molecules using bioactivity-guided fractionation, the current structural elucidation methods limit the annotation of chemical dark matter. These methods include isolating the desired compound with high purity and resolving its structure using nuclear magnetic resonance (NMR), and bioactivity-guided fractionation to guide elucidation of the biological activity or function of a target compound. This process is time-consuming and expensive, and therefore impractical to apply to the millions of compounds found within complex biological mixtures.¹¹

In short, our conventional wet lab pipeline for discovering biochemicals—from extraction and bioassay to isolation, NMR, and functional testing—is inherently low-throughput. It is no wonder that chemical space remains the last great frontier; we simply have not had the tools to efficiently survey it.

Technological Bottlenecks in Metabolomics

The advent of modern metabolomics—using high-performance analytical chemistry to profile metabolites on a large scale—promised to accelerate the exploration of chemical diversity. Indeed, mass spectrometry can detect hundreds or thousands of metabolites in a single sample.¹²

In mass spectrometry, each molecule produces a unique spectral “fingerprint” as it fragments, offering clues to its identity. In principle, these spectra allow us to cast a much wider net for unknown compounds. In practice, however, metabolomics has revealed an ironic truth: the more data we collect, the more unknowns we encounter. Several bottlenecks have prevented us from converting the deluge of spectral data into chemical knowledge, including: 1.

Limited reference databases: The gold-standard approach for identifying a metabolite is to match its mass spectrum against a library of spectra from known compounds. But public tandem mass spectrometry (MS/MS) reference libraries cover only a tiny fraction of chemical space—on the order of tens of thousands of compounds—mostly those that researchers have isolated and cataloged before.¹³

Everything outside this narrow library remains invisible to database matching. In untargeted metabolomics experiments, only a small fraction of detected spectra can be matched to known molecules.¹² The rest—often >80% of the spectral features in a sample—have no reference spectrum and thus remain unannotated. In other words, most peaks in a metabolomics dataset represent molecules that have never been seen before in any database.¹⁴

Data volume and complexity: A single high-resolution mass spectrometer (MS) run can produce gigabytes of data. Large-scale studies (e.g., profiling thousands of samples or environmental metagenomes) can generate millions of distinct spectra.¹⁵ Handling these massive datasets is a challenge in itself—from storing and transferring files to processing the spectra with algorithms. Standard analytical software struggles with the big data aspect of metabolomics (e.g., aligning peaks across samples or picking out significant signals). Moreover, each spectrum is a complex, high-dimensional signal—a product of a molecule’s fragmentation behavior under specific conditions. Interpreting these patterns is not trivial, especially when signals from many compounds overlap. The complexity and sheer scale of the data mean that brute-force or manual analysis approaches are futile.

Lack of annotation and context: For the vast majority of spectra that do not match any known compound, making sense of them is like reading an unknown language. Each unknown spectrum is essentially a cryptogram awaiting decryption. Without a reference or some prior knowledge (e.g., what class of molecule it might be), manual interpretation is impractical for more than a few spectra.

Chemists can sometimes propose partial structures or substructures from fragmentation patterns, but doing this for tens of thousands of unknowns in a study is impossible for most labs. This leads to a paradox: We can measure countless metabolite signals, but we cannot identify or use most of them. The result is a growing pile of spectral data rich in potential insights but effectively opaque—the dark matter of the metabolome.

These limitations have real consequences. In metabolomics studies of human health, for instance, researchers often observe biomarker signals correlated with disease—but if those metabolites cannot be identified, their biological interpretation remains a mystery. In natural product discovery, chemists may keep rediscovering the same familiar compounds because unknown ones can’t be recognized and prioritized from complex extracts. And in drug discovery, vast chemical diversity is going unmined: pharmaceutical libraries end up biased toward known scaffolds, while the long tail of rare or novel chemotypes remains untouched, possibly containing cures or chemistries we desperately need.^7,16 In short, the pace of biomedical innovation is throttled by our inability to read the molecular lexicon that nature has written.

Artificial Intelligence and Machine Learning: Navigating the Chemical Dark Matter

This is where cutting-edge innovations in artificial intelligence (AI) and machine learning come into play. Just as AI has revolutionized proteomics (think of how AlphaFold solves protein structures), it is now beginning to tackle the interpretation of complex chemical data.^17–24 The idea is straightforward: can a computer be taught to decode mass spectra of unknown molecules, effectively translating the cryptic signals into likely chemical structures? Recent advances suggest the answer is yes.^17–24 In fact, researchers are now deploying the same technologies behind modern language translation and image recognition to “learn” the language of mass spectrometry. One promising approach uses deep learning models—in particular, transformer neural networks akin to those in large language models—to recognize patterns in mass spectra. These models excel at handling high-dimensional, sequential data with intricate contextual relationships.

Mass spectrometry generates a wealth of data that reflects the fundamental chemical and physical properties of a molecule, including its mass-to-charge ratio, fragmentation patterns, and isotopic distributions. Unlike molecular sequences, which are a linear representation of atomic connectivity, mass spectra provide a high-dimensional fingerprint capturing both connectivity and fragmentation behaviors under defined conditions. These inherently rich data lend themselves to deciphering complex chemical structures that may involve nonlinear or cyclic arrangements, which are often obscured in sequence-based representations.

Transformer models, including attention-based architectures like Bidirectional Encoder Representations from Transformers (BERT) and Generative Pre-trained Transformer (GPT), excel at processing this type of high-dimensional, nonlinear data with intricate relationships.²⁵ Transformers, which have revolutionized natural language processing, can ingest a spectrum and tease out the correlations between fragment peaks that correspond to specific substructures, stereochemistry, or functional groups. BERT uses a training technique known as masked language modeling, where parts of a sentence are “masked” to teach the model the structure of natural language by predicting the missing elements based on the remaining context. For instance, in the sentence “The dog tried to follow the person,” the word “follow” could be masked, presenting the model with “The dog tried to [MASK] the person.” The model then learns to predict the hidden word using the contextual clues provided by the rest of the sentence (Fig. 1).¹⁹

FIG. 1.

Masked language modeling.

Transformers that have been tailored for tandem mass spectra essentially learn the grammar of chemistry. Early studies have demonstrated that such models can predict molecular features more accurately and comprehensively than traditional rule-based methods.¹⁹

Different research groups have taken slightly different approaches. Some have developed encoder–decoder networks that directly generate a molecule’s structure (for instance, a Simplified Molecular Input Line Entry System [SMILES] string) from its spectrum, analogous to speech-to-text translation. One such model, appropriately named Spec2Mol, uses an encoder to create a learned representation of the spectrum and a decoder to output plausible molecular structures, drawing from the knowledge of chemical syntax.²⁶ Other efforts (such as CSI:FingerID, MS2DeepScore, and MSNovelist) combine machine learning with clever heuristic or rule-based steps to expand the search for candidate structures beyond known libraries.^20,27,28

AI is enabling a leap from relying solely on known spectra to predicting the unknown. The application of these advances is profound. Imagine being able to rapidly identify the metabolites in a person’s blood sample that have no entries in any database—AI could flag a novel molecule, providing the first clues to its identity. In drug discovery, instead of randomly testing fractions, one could prioritize specific spectra that the AI predicts to have “drug-like” substructures or interesting novelty. AI models can also learn from a vast corpus of uncharacterized spectra, finding patterns across experiments and labs that humans might never notice. In essence, machine learning offers a scalable way to illuminate the chemical dark matter, turning masses of raw data into actionable hypotheses about molecules.

Bridging the Gap: A New Era of Scalable Chemical Discovery

Realizing this vision requires not just clever algorithms but also massive amounts of data and computational muscle. This is leading to the rise of foundation models for chemistry—large AI models trained on extremely broad datasets of molecular data, analogous to how GPT-4 was trained on the internet’s text (see Box 1).

Box 1. Pioneering Foundation Models for Decoding the Dark Chemical Space

A recent example is work by our company, Enveda, an AI-powered metabolomics company seeking to unlock nature’s chemistry. Enveda is leveraging AI, particularly transformer models such as those used by large language models like ChatGPT, to bridge the gap between mass spectral data and chemical knowledge.^17–19 By training AI on large datasets, Enveda’s approach essentially translates the cryptic language of mass spectra into the chemical structures they represent. This approach is akin to teaching a machine to read a foreign language, enabling it to uncover the identities and properties of previously unknown molecules.

Enveda’s most recent model is a foundation model trained on an unprecedented 1.2 billion small-molecule mass spectra.¹⁹ This model, called PRISM (Pretrained Representations Informed by Spectral Masking), represents (to our knowledge) the largest training set of tandem MS data assembled to date. Importantly, the vast majority of those 1.2 billion spectra were unannotated—no one knows what compounds produced them. By training in a self-supervised manner (learning the patterns within spectra themselves), the model effectively imbibes the latent grammar of fragmentation across an almost astronomical number of molecules.

Enveda’s automated pipeline for mass spectral profiling has already generated hundreds of millions of MS/MS spectra, and as we scale our search for new molecules, we plan to scale the size and diversity of experimental data. Large amounts of raw data for training translates into better predictive models, which in turn will enable scientists to decode the chemistry of life to find interesting new biomarkers.

AI models can be further trained on chemical and biological properties which would be of interest specifically to target disease etiology. Furthermore, the self-supervised learning approaches used for PRISM are well-suited for large language models that allow the model to learn from abundant unlabeled mass spectrometry data.²⁹

Crucially, approaches like this break from the past reliance on only labeled data. Traditionally, machine learning in metabolomics was limited to using spectra of known compounds (e.g., to classify a spectrum or to compare against candidates). But annotated spectra represent only a tiny fraction of all available data—on the order of maybe 50,000–60,000 unique molecules in public libraries. In contrast, repositories of raw experimental data (such as Global Natural Products Social Molecular Networking [GNPS], MetaboLights, and the Metabolomics Workbench) contain hundreds of millions of spectra, reflecting a far broader swath of chemical space. By harnessing this mountain of dark data, next-generation AI models do not need to wait for a molecule to be isolated and cataloged to learn its signature—they learn directly from the uncharacterized spectra, effectively mapping the wilderness without a predefined guidebook.

Conclusion

The implications of closing the metabolomic knowledge gap are enormous. In biotechnology and pharmaceutical research, a more complete understanding of nature’s chemical playbook could open up avenues to new drug leads, enzymes, and biomolecules that humanity has never encountered. In diagnostics, comprehensively mapping the human metabolome (and how it changes in disease) could reveal novel biomarkers—potentially early warning signals or mechanistic clues for conditions like cancer, neurodegeneration, inflammation or metabolic disorders. In agriculture and ecology, identifying the myriad chemical signals used in plant–microbe interactions or animal communication could lead to sustainable innovations (natural pest control agents, growth promoters, etc.). And in basic science, illuminating the structures and pathways of unknown metabolites will fill glaring gaps in our understanding of physiology and biochemistry, much like sequencing the human genome revealed previously “hidden” genes and regulatory elements. Recall that not long ago, vast stretches of genomics were dubbed “junk DNA” and thought to be irrelevant.³⁰ We now know those noncoding regions hold crucial regulatory roles—a testament to how dark matter in science can surprise us once illumination is possible.

Today’s chemical dark matter may likewise harbor treasures we can scarcely imagine—from new therapeutics to fundamental biological insights. The convergence of high-throughput metabolomics and AI is finally creating a path to explore this last frontier. By systematically identifying the metabolites that have eluded us, science is poised to complete the map of life’s molecular landscape.

Chemistry is steadily yielding its secrets. As we learn to read the chemical alphabet at scale, we move closer to a future where no biologically relevant molecule will lurk in the shadows, and where understanding life in its totality is a genuine possibility.

The genome may be life’s blueprint, but chemistry is its language. To understand life—and to harness its full potential—we must learn to read the messages hidden in nature’s molecules. In doing so, we will open a new chapter in science, one where the unknown chemistry of nature takes center stage in shaping the future of human health.

Footnotes

Acknowledgements

Author Disclosure Statement

The authors are affiliated with Enveda Therapeutics, Inc.

Funding Information

No funding was received for this article.

References

Lander

, Linton

, Birren

, et al.; International Human Genome Sequencing Consortium. Initial sequencing and analysis of the human genome. Nature, 2001; 409(6822):860–921; doi: 10.1038/35057062

Mullard

. 2024 FDA approvals. Nature, 2025; doi: 10.1038/d41573-025-00001-5

Baylot

, Le

, Taïeb

, et al. Between hope and reality: Treatment of genetic diseases through nucleic acid-based drugs. Commun Biol, 2024; 7(1):489; doi: 10.1038/s42003-024-06121-9

Morowitz

, Smith

. Energy flow and the organization of life. Wiley Periodicals, Inc. Complexity, 2007; 13:51–59; doi: 10.1002/cplx.20191

Piovesan

, Antonaros

, Vitale

, et al. Human protein-coding genes and gene feature statistics in 2019. BMC Res Notes, 2019; 12(1):315; doi: 10.1186/s13104-019-4343-8

Addicoat

. Only 1% of chemicals in the universe have been discovered. Here’s how scientists are hunting for the rest. livescience.com. 2023. Available from: https://www.livescience.com/chemistry/only-1-of-chemicals-in-the-universe-have-been-discovered-heres-how-scientists-are-hunting-for-the-rest

Patridge

, Gareiss

, Kinch

, et al. An analysis of FDA-approved drugs: Natural products and their derivatives. Drug Discov Today, 2016; 21(2):204–207; doi: 10.1016/j.drudis.2015.01.009

Zhu

, Bai

. Separation of biologically active compounds by membrane operations. Curr Pharm Des, 2017; 23(2):218–230; doi: 10.2174/1381612822666161027153823

Kellogg

, Todd

, Egan

, et al. Biochemometrics for natural products research: Comparison of data analysis approaches and application to identification of bioactive compounds. J Nat Prod, 2016; 79(2):376–386; doi: 10.1021/acs.jnatprod.5b01014

10.

Eloff

. Quantification of the bioactivity of plant extracts during screening and bioassay guided fractionation. Phytomedicine, 2004; 11(4):370–371; doi: 10.1078/0944711041495218

11.

Leggett

, Wang

, Li

, et al. Identification of unknown metabolomics mixture compounds by combining NMR, MS, and cheminformatics. Methods Enzymol, 2019; 615:407–422; doi: 10.1016/bs.mie.2018.09.003

12.

Hoffmann

, Nothias

, Ludwig

, et al. High-confidence structural annotation of metabolites absent from spectral libraries. Nat Biotechnol, 2022; 40(3):411–421; doi: 10.1038/s41587-021-01045-9

13.

de Jonge

, Louwen

JJR

, Chekmeneva

, et al. MS2Query: Reliable and scalable MS2 mass spectra-based analogue search. Nat Commun, 2023; 14(1):1752; doi: 10.1038/s41467-023-37446-4

14.

da Silva

, Dorrestein

, Quinn

. Illuminating the dark matter in metabolomics. Proc Natl Acad Sci U S A, 2015; 112(41):12549–12550; doi: 10.1073/pnas.1516878112

15.

Corbally

, Freye

. How much data is too much? An analysis of the pros and cons of high-resolution mass spectral data. LCGC Supplements, 2023; 41(s5):12–14.

16.

Ayon

. High-Throughput screening of natural product and synthetic molecule libraries for antibacterial drug discovery. Metabolites, 2023; 13(5):625; doi: 10.3390/metabo13050625

17.

Butler

, Frandsen

, Lightheart

, et al. MS2Mol: A transformer model for illuminating dark chemical space from mass spectra. ChemRxiv, 2023; doi: 10.26434/chemrxiv-2023-vsmpx-v4 This content is a preprint and has not been peer-reviewed

18.

Voronov

, Lightheart

, Frandsen

, et al. MS2Prop: A machine learning model that directly generates de novo predictions of drug-likeness of natural products from unannotated MS/MS spectra. bioRxiv, 2024; doi: 10.1101/2022.10.09.511482

19.

Healey

, Domingo-Fernández

, Taylor

, et al. PRISM: A foundation model for life’s chemistry. 2024. Available from: https://enveda.com/prism-a-foundation-model-for-lifes-chemistry/

20.

Stravs

, Dührkop

, Böcker

, et al. MSNovelist: De novo structure generation from mass spectra. Nat Methods, 2022; 19(7):865–870; doi: 10.1038/s41592-022-01486-3

21.

van der Hooft

JJJ

, Wandy

, Barrett

, et al. Topic modeling for untargeted substructure exploration in metabolomics. Proc Natl Acad Sci U S A, 2016; 113(48):13738–13743; doi: 10.1073/pnas.1608041113

22.

Huber

, Ridder

, Verhoeven

, et al. Spec2Vec: Improved mass spectral similarity scoring through learning of structural relationships. PLoS Comput Biol, 2021; 17(2):e1008724; doi: 10.1371/journal.pcbi.1008724

23.

Bushuiev

, Bushuiev

, Samusevich

, et al. Emergence of molecular structures from self-supervised learning on mass spectra. ChemRxiv, 2023; doi: 10.26434/chemrxiv-2023-kss3rThis content is a preprint and has not been peer-reviewed

24.

Asher

, Campbell

, Geremia

, et al. LSM1-MS2: A self-supervised foundation model for tandem mass spectrometry applications, encompassing extensive chemical property predictions and spectral matching. ChemRxiv, 2024; doi: 10.26434/chemrxiv-2024-k06gb This content is a preprint and has not been peer-reviewed

25.

Lin

, Wang

, et al. A survey of transformers. OpenAI, 2022; 3:111–132; doi: 10.1016/j.aiopen.2022.10.001

26.

Litsa

, Chenthamarakshan

, Das

, et al. An end-to-end deep learning framework for translating mass spectra to de-novo molecules. Commun Chem, 2023; 6(1):132; doi: 10.1038/s42004-023-00932-3

27.

Huber

, van der Burg

, van der Hooft

JJJ

, et al. MS2DeepScore: A novel deep learning similarity measure to compare tandem mass spectra. J Cheminform, 2021; 13(1):84; doi: 10.1186/s13321-021-00558-4

28.

Dührkop

, Shen

, Meusel

, et al. Searching molecular structure databases with tandem mass spectra using CSI:FingerID. Proc Natl Acad Sci U S A, 2015; 112(41):12580–12585; doi: 10.1073/pnas.1509788112

29.

Devlin

, Chang

, Lee

, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv, 2019 arXiv:1810.04805 [cs.CL].

30.

Fagundes

NJR

, Bisso-Machado

, Figueiredo

PICC

, et al. What we talk about when we talk about “Junk DNA”. Genome Biol Evol, 2022; 14(5):evac055; doi: 10.1093/gbe/evac055

The Great Unknown: How Chemistry Remains the Last Frontier to Understanding Life