Focusing on Data to Improve Machine Learning-Guided Antibiotic Discovery

Abstract

Machine learning (ML) is poised to accelerate antibiotic discovery by rapidly identifying and generating compounds with desirable properties. Despite focused effort, algorithmic advances alone have yielded only modest improvements in real-world performance. Greater gains will likely come from improved data acquisition, data representation, and model output interpretation by domain experts. Field-wide efforts in more standardized data curation, benchmarking, and publication practices are also essential to ensure that ML methods reach their full potential to help us efficiently discover new antibiotics to address unmet clinical needs. This review focuses on the data-centric choices necessary to build ML pipelines for antibiotic discovery that are robust, reliable, efficient, and biologically grounded.

Keywords

antibiotics machine learning data

Get full access to this article

View all access options for this article.

References

1. Murray

CJL

, Ikuta

, Sharara

, et al. Global burden of bacterial antimicrobial resistance in 2019: A systematic analysis. Lancet 2022;399(10325):629–655; doi: 10.1016/S0140-6736(21)02724-0

2. Brown

, Wright

. Antibacterial drug discovery in the resistance era. Nature 2016;529(7586):336–343; doi: 10.1038/nature17042

3. McDowell

, Quinn

, Leeds

, et al. Perspective on antibacterial lead identification challenges and the role of hypothesis-driven strategies. SLAS Discov 2019;24(4):440–456; doi: 10.1177/2472555218818786

4. Makurvet

. Biologics vs. small molecules: Drug costs and patient access. Med Drug Discov 2021;9:100075; doi: 10.1016/j.medidd.2020.100075

5. Beck

, Härter

, Haß

, et al. Small molecules and their impact in drug discovery: A perspective on the occasion of the 125th anniversary of the Bayer Chemical Research Laboratory. Drug Discov Today 2022;27(6):1560–1574; doi: 10.1016/j.drudis.2022.02.015

6. Kang

, Bagchi

, Chen

. Pharmacokinetics and biodistribution of phages and their current applications in antimicrobial therapy. Adv Ther (Weinh) 2024;7(3):2300355; doi: 10.1002/adtp.202300355

7. Oo

, Kalbag

. Leveraging the attributes of biologics and small molecules, and releasing the bottlenecks: A new wave of revolution in drug development. Expert Rev Clin Pharmacol 2016;9(6):747–749; doi: 10.1586/17512433.2016.1160778

8. Verma

, Goand

, Husain

, et al. Challenges of peptide and protein drug delivery by oral route: Current strategies to improve the bioavailability. Drug Dev Res 2021;82(7):927–944; doi: 10.1002/ddr.21832

9. Wang

, Wang

, Zhang

, et al. Therapeutic peptides: Current applications and future directions. Signal Transduct Target Ther 2022;7(1):48; doi: 10.1038/s41392-022-00904-4

10.

10. Arnold

, McLellan

, Stokes

. How AI can help us beat AMR. NPJ Antimicrob Resist 2025;3(1):18; doi: 10.1038/s44259-025-00085-4

11.

11. Catacutan

, Alexander

, Arnold

, et al. Machine learning in preclinical drug discovery. Nat Chem Biol 2024;20(8):960–973; doi: 10.1038/s41589-024-01679-1

12.

12. Jumper

, Evans

, Pritzel

, et al. Highly accurate protein structure prediction with AlphaFold. Nature 2021;596(7873):583–589; doi: 10.1038/s41586-021-03819-2

13.

13. Durant

, Boyles

, Birchall

, et al. The future of machine learning for small-molecule drug discovery will be driven by data. Nat Comput Sci 2024;4(10):735–743; doi: 10.1038/s43588-024-00699-0

14.

14. Touvron

, Martin

, Stone

, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv [Preprint] 2023; doi: 10.48550/arXiv.2307.09288

15.

15. Ramesh

, Dhariwal

, Nichol

, et al. Hierarchical text-conditional image generation with CLIP latents. arXiv [Preprint] 2022; doi: 10.48550/arXiv.2204.06125

16.

16. Zheng

, Thorne

, McKew

. Phenotypic screens as a renewed approach for drug discovery. Drug Discov Today 2013;18(21–22):1067–1073; doi: 10.1016/j.drudis.2013.07.001

17.

17. Aulner

, Danckaert

, Ihm

, et al. Next-generation phenotypic screening in early drug discovery for infectious diseases. Trends Parasitol 2019;35(7):559–570; doi: 10.1016/j.pt.2019.05.004

18.

18. Payne

, Gwynn

, Holmes

, et al. Drugs for bad bugs: Confronting the challenges of antibacterial discovery. Nat Rev Drug Discov 2007;6(1):29–40; doi: 10.1038/nrd2201

19.

19. Tommasi

, Brown

, Walkup

, et al. ESKAPEing the labyrinth of antibacterial discovery. Nat Rev Drug Discov 2015;14(8):529–542; doi: 10.1038/nrd4572

20.

20. Li

JW-H

, Vederas

. Drug discovery and natural products: End of an era or an endless frontier? Science 2009;325(5937):161–165; doi: 10.1126/science.1168243

21.

21. Clardy

, Fischbach

, Walsh

. New antibiotics from bacterial natural products. Nat Biotechnol 2006;24(12):1541–1550; doi: 10.1038/nbt1266

22.

22. Newman

, Cragg

. Natural products as sources of new drugs over the last 25 years. J Nat Prod 2007;70(3):461–477; doi: 10.1021/np068054v

23.

23. Cos

, Vlietinck

, Berghe

, et al. Anti-infective potential of natural products: How to develop a stronger in vitro ‘proof-of-concept’. J Ethnopharmacol 2006;106(3):290–302; doi: 10.1016/j.jep.2006.04.003

24.

24. Fux

, Shirtliff

, Stoodley

, et al. Can laboratory reference strains mirror ‘real-world’ pathogenesis? Trends Microbiol 2005;13(2):58–63; doi: 10.1016/j.tim.2004.11.001

25.

25. Carey

, Rock

, Krieger

, et al. TnSeq of Mycobacterium tuberculosis clinical isolates reveals strain-specific antibiotic liabilities. PLoS Pathog 2018;14(3):e1006939; doi: 10.1371/journal.ppat.1006939

26.

26. Shi

, Mi

, Wang

, et al. In vitro and ex vivo systems at the forefront of infection modeling and drug discovery. Biomaterials 2019;198:228–249; doi: 10.1016/j.biomaterials.2018.10.030

27.

27. Nizet

. The accidental orthodoxy of Drs. Mueller and Hinton. EBioMedicine 2017;22:26–27; doi: 10.1016/j.ebiom.2017.07.002

28.

28. Jacobs

, Sayood

, Olmsted

, et al. Characterization of the Acinetobacter baumannii growth phase-dependent and serum responsive transcriptomes. FEMS Immunol Med Microbiol 2012;64(3):403–412; doi: 10.1111/j.1574-695X.2011.00926.x

29.

29. Blanchard

, Barnett

, Perlmutter

, et al. Identification of Acinetobacter baumannii serum-associated antibiotic efflux pump inhibitors. Antimicrob Agents Chemother 2014;58(11):6360–6370; doi: 10.1128/aac.03535-14

30.

30. Kruczek

, Kottapalli

, Dissanaike

, et al. Major transcriptome changes accompany the growth of Pseudomonas aeruginosa in blood from patients with severe thermal injuries. PLoS One 2016;11(3):e0149229; doi: 10.1371/journal.pone.0149229

31.

31. Belanger

, Hancock

REW

. Testing physiologically relevant conditions in minimal inhibitory concentration assays. Nat Protoc 2021;16(8):3761–3774; doi: 10.1038/s41596-021-00572-8

32.

32. Heesterbeek

DAC

, Martin

, Velthuizen

, et al. Complement-dependent outer membrane perturbation sensitizes Gram-negative bacteria to Gram-positive specific antibiotics. Sci Rep 2019;9(1):3074; doi: 10.1038/s41598-019-38577-9

33.

33. Weber

, De Jong

, Guo

ABY

, et al. Genetic and chemical screening in human blood serum reveals unique antibacterial targets and compounds against Klebsiella pneumoniae. Cell Rep 2020;32(3):107927; doi: 10.1016/j.celrep.2020.107927

34.

34. Sadri

. Is target-based drug discovery efficient? Discovery and “Off-target” mechanisms of all drugs. J Med Chem 2023;66(18):12651–12677; doi: 10.1021/acs.jmedchem.2c01737

35.

35. Brown

. Unfinished business: Target-based drug discovery. Drug Discov Today 2007;12(23–24):1007–1012; doi: 10.1016/j.drudis.2007.10.017

36.

36. Lewis

. The science of antibiotic discovery. Cell 2020;181(1):29–45; doi: 10.1016/j.cell.2020.02.056

37.

37. Silver

. Challenges of antibacterial discovery. Clin Microbiol Rev 2011;24(1):71–109; doi: 10.1128/CMR.00030-10

38.

38. Farha

, Tu

, Brown

. Important challenges to finding new leads for new antibiotics. Curr Opin Microbiol 2025;83:102562; doi: 10.1016/j.mib.2024.102562

39.

39. Leus

, Weeks

, Bonifay

, et al. Property space mapping of Pseudomonas aeruginosa permeability to small molecules. Sci Rep 2022;12(1):8220; doi: 10.1038/s41598-022-12376-1

40.

40. Le Goff

, Hazemann

, Christen

, et al. Measurement and prediction of small molecule retention by Gram-negative bacteria based on a large-scale LC/MS screen. Sci Rep 2025;15(1):25431; doi: 10.1038/s41598-025-10208-6

41.

41. Minerali

, Foil

, Zorn

, et al. Comparing machine learning algorithms for predicting drug-induced liver injury (DILI). Mol Pharm 2020;17(7):2628–2637; doi: 10.1021/acs.molpharmaceut.0c00326

42.

42. Ryu

, Lee

, et al. DeepHIT: A deep learning framework for prediction of hERG-induced cardiotoxicity. Bioinformatics 2020;36(10):3049–3055; doi: 10.1093/bioinformatics/btaa075

43.

43. Swanson

, Liu

, Catacutan

, et al. Generative AI for designing and validating easily synthesizable and structurally novel antibiotics. Nat Mach Intell 2024;6(3):338–353; doi: 10.1038/s42256-024-00809-7

44.

44. Fernandes

, Dias

ALT

, dos Santos Júnior

, et al. Machine learning-based virtual screening of antibacterial agents against Methicillin-susceptible and resistant Staphylococcus aureus. J Chem Inf Model 2024;64(6):1932–1944; doi: 10.1021/acs.jcim.4c00087

45.

45. Zakharov

, Peach

, Sitzmann

, et al. QSAR modeling of imbalanced high-throughput screening data in PubChem. J Chem Inf Model 2014;54(3):705–712; doi: 10.1021/ci400737s

46.

46. Diéguez-Santana

, Casañola-Martin

, Torres

, et al. Machine learning study of metabolic networks vs ChEMBL data of antibacterial compounds. Mol Pharm 2022;19(7):2151–2163; doi: 10.1021/acs.molpharmaceut.2c00029

47.

47. Mendez

, Gaulton

, Bento

, et al. ChEMBL: Towards direct deposition of bioassay data. Nucleic Acids Res 2019;47(D1):D930–D940; doi: 10.1093/nar/gky1075

48.

48. Zdrazil

, Felix

, Hunter

, et al. The ChEMBL database in 2023: A drug discovery platform spanning multiple bioactivity data types and time periods. Nucleic Acids Res 2024;52(D1):D1180–D1192; doi: 10.1093/nar/gkad1004

49.

49. Wang

, Xiao

, Suzek

, et al. PubChem: A public information system for analyzing bioactivities of small molecules. Nucleic Acids Res 2009;37(Web Server issue):W623–W633; doi: 10.1093/nar/gkp456

50.

50. Kim

, Thiessen

, Bolton

, et al. PubChem substance and compound databases. Nucleic Acids Res 2016;44(D1):D1202–D1213; doi: 10.1093/nar/gkv951

51.

51. Kim

, Chen

, Cheng

, et al. PubChem 2023 update. Nucleic Acids Res 2023;51(D1):D1373–D1380.

52.

52. Kim

, Chen

, Cheng

, et al. PubChem 2025 update. Nucleic Acids Res 2025;53(D1):D1516–D1525; doi: 10.1093/nar/gkae1059

53.

53. Chen

, Liu

, Gilson

. BindingDB: A web-accessible molecular recognition database. Comb Chem High Throughput Screen 2001;4(8):719–725; doi: 10.2174/1386207013330670

54.

54. Chen

, Lin

, Gilson

. The binding database: Overview and user’s guide. Biopolymers 2001;61(2):127–141; doi: 10.1002/1097-0282(2002)61:2%3C127::AID-BIP10076%3E3.0.CO;2-N

55.

55. Chen

, Lin

, Liu

, et al. The binding database: Data management and interface design. Bioinformatics 2002;18(1):130–139; doi: 10.1093/bioinformatics/18.1.130

56.

56. Liu

, Lin

, Wen

, et al. BindingDB: A web-accessible database of experimentally determined protein-ligand binding affinities. Nucleic Acids Res 2007;35(Database issue):D198–D201; doi: 10.1093/nar/gkl999

57.

57. Gilson

, Liu

, Baitaluk

, et al. BindingDB in 2015: A public database for medicinal chemistry, computational chemistry and systems pharmacology. Nucleic Acids Res 2016;44(D1):D1045–D1053; doi: 10.1093/nar/gkv1072

58.

58. Liu

, Hwang

, Burley

, et al. BindingDB in 2024: A FAIR knowledgebase of protein-small molecule binding data. Nucleic Acids Res 2025;53(D1):D1633–D1644; doi: 10.1093/nar/gkae1075

59.

59. Fourches

, Muratov

, Tropsha

. Trust, but verify II: A practical guide to chemogenomics data curation. J Chem Inf Model 2016;56(7):1243–1252; doi: 10.1021/acs.jcim.6b00129

60.

60. Papadatos

, Gaulton

, Hersey

, et al. Activity, assay and target data curation and quality in the ChEMBL database. J Comput Aided Mol Des 2015;29(9):885–896; doi: 10.1007/s10822-015-9860-5

61.

61. Desselle

, Neale

, Hansford

, et al. Institutional profile: Community for open antimicrobial drug discovery—Crowdsourcing new antibiotics and antifungals. Future Sci OA 2017;3(2):FSO171; doi: 10.4155/fsoa-2016-0093

62.

62. Sadybekov

, Katritch

. Computational approaches streamlining drug discovery. Nature 2023;616(7958):673–685; doi: 10.1038/s41586-023-05905-z

63.

63. Blaskovich

MAT

, Zuegg

, Elliott

, et al. Helping chemists discover new antibiotics. ACS Infect Dis 2015;1(7):285–287; doi: 10.1021/acsinfecdis.5b00044

64.

64. Wishart

, Knox

, Guo

, et al. DrugBank: A comprehensive resource for in silico drug discovery and exploration. Nucleic Acids Res 2006;34(Database issue):D668–D672; doi: 10.1093/nar/gkj067

65.

65. Wishart

, Knox

, Guo

, et al. DrugBank: A knowledgebase for drugs, drug actions and drug targets. Nucleic Acids Res 2008;36(Database issue):D901–D906; doi: 10.1093/nar/gkm958

66.

66. Knox

, Law

, Jewison

, et al. DrugBank 3.0: A comprehensive resource for ‘OMICS’ research on drugs. Nucleic Acids Res 2011;39(Database issue):D1035–D1041; doi: 10.1093/nar/gkq1126

67.

67. Law

, Knox

, Djoumbou

, et al. DrugBank 4.0: Shedding new light on drug metabolism. Nucleic Acids Res 2014;42(Database issue):D1091–D1097; doi: 10.1093/nar/gkt1068

68.

68. Wishart

, Feunang

, Guo

, et al. DrugBank 5.0: A major update to the DrugBank database for 2018. Nucleic Acids Res 2018;46(D1):D1074–D1082; doi: 10.1093/nar/gkx1037

69.

69. Knox

, Wilson

, Klinger

, et al. DrugBank 6.0: The DrugBank knowledgebase for 2024. Nucleic Acids Res 2024;52(D1):D1265–D1275; doi: 10.1093/nar/gkad976

70.

70. Halip

, Avram

, Curpan

, et al. Exploring DrugCentral: From molecular structures to clinical effects. J Comput Aided Mol Des 2023;37(12):681–694; doi: 10.1007/s10822-023-00529-x

71.

71. Corsello

, Bittker

, Liu

, et al. The drug repurposing hub: A next-generation drug library and information resource. Nat Med 2017;23(4):405–408; doi: 10.1038/nm.4306

72.

72. Zeng

, Zhang

, He

, et al. NPASS: Natural product activity and species source database for natural product research, discovery and tool development. Nucleic Acids Res 2018;46(D1):D1217–D1222; doi: 10.1093/nar/gkx1026

73.

73. Zhao

, Yang

, Wang

, et al. NPASS database update 2023: Quantitative natural product activity and species source database for biomedical research. Nucleic Acids Res 2023;51(D1):D621–D628; doi: 10.1093/nar/gkac1069

74.

74. Martinez-Mayorga

, Rosas-Jiménez

, Gonzalez-Ponce

, et al. The pursuit of accurate predictive models of the bioactivity of small molecules. Chem Sci 2024;15(6):1938–1952; doi: 10.1039/D3SC05534E

75.

75. Melo

MCR

, Maasch

JRMA

, de la Fuente-Nunez

. Accelerating antibiotic discovery through artificial intelligence. Commun Biol 2021;4(1):1050; doi: 10.1038/s42003-021-02586-0

76.

76. Miethke

, Pieroni

, Weber

, et al. Towards the sustainable discovery and development of new antibiotics. Nat Rev Chem 2021;5(10):726–749; doi: 10.1038/s41570-021-00313-1

77.

77. Veríssimo

, dos Santos Júnior

, de Almeida

IAdR

, et al. The Brazilian compound library (BraCoLi) database: A repository of chemical and biological information for drug design. Mol Divers 2022;26(6):3387–3397; doi: 10.1007/s11030-022-10386-9

78.

78. Pilon

, Valli

, Dametto

, et al. NuBBEDB: An updated database to uncover chemical and biological information from Brazilian biodiversity. Sci Rep 2017;7(1):7215; doi: 10.1038/s41598-017-07451-x

79.

79. Capecchi

, Cai

, Personne

, et al. Machine learning designs non-hemolytic antimicrobial peptides. Chem Sci 2021;12(26):9221–9232; doi: 10.1039/D1SC01713F

80.

80. MacNair

, Rutherford

, Tan

M-W

. Alternative therapeutic strategies to treat antibiotic-resistant pathogens. Nat Rev Microbiol 2024;22(5):262–275; doi: 10.1038/s41579-023-00993-0

81.

81. Landrum

, Riniker

. Combining IC50 or Ki values from different sources is a source of significant noise. J Chem Inf Model 2024;64(5):1560–1567; doi: 10.1021/acs.jcim.4c00049

82.

82. Schoenmaker

, Sastrokarijo

, Heitman

, et al. Toward assay-aware bioactivity model(er)s: Getting a grip on biological context. J Chem Inf Model 2025;65(13):7013–7023; doi: 10.1021/acs.jcim.5c00603

83.

83. Cesaro

, Bagheri

, Torres

, et al. Deep learning tools to accelerate antibiotic discovery. Expert Opin Drug Discov 2023;18(11):1245–1257; doi: 10.1080/17460441.2023.2250721

84.

84. Smajić

, Rami

, Sosnin

, et al. Identifying differences in the performance of machine learning models for off-targets trained on publicly available and proprietary data sets. Chem Res Toxicol 2023;36(8):1300–1312; doi: 10.1021/acs.chemrestox.3c00042

85.

85. Bosc

, Atkinson

, Felix

, et al. Large scale comparison of QSAR and conformal prediction methods and their applications in drug discovery. J Cheminform 2019;11(1):4; doi: 10.1186/s13321-018-0325-4

86.

86. Bhalodi

, Oppermann

, Campeau

, et al. Variability of Beta-Lactam Broth Microdilution for Pseudomonas aeruginosa. Antimicrob Agents Chemother 2021;65(10):e00640-21; doi: 10.1128/AAC.00640-21

87.

87. Inglese

, Shamu

, Guy

. Reporting data from high-throughput screening of small-molecule libraries. Nat Chem Biol 2007;3(8):438–441; doi: 10.1038/nchembio0807-438

88.

88. Fanelli

. Negative results are disappearing from most disciplines and countries. Scientometrics 2012;90(3):891–904; doi: 10.1007/s11192-011-0494-7

89.

89. Cáceres

, Mew

, Keiser

. Adding stochastic negative examples into machine learning improves molecular bioactivity prediction. J Chem Inf Model 2020;60(12):5957–5970; doi: 10.1021/acs.jcim.0c00565

90.

90. de la Fuente-Nunez

, Collins

. Essay: Using machine learning for antibiotic discovery. Phys Rev Lett 2025;135(3):030001; doi: 10.1103/y3fg-s9vg

91.

91. Kapoor

, Narayanan

. Leakage and the reproducibility crisis in machine-learning-based science. Patterns (N Y) 2023;4(9):100804; doi: 10.1016/j.patter.2023.100804

92.

92. Hutson

. Artificial intelligence faces reproducibility crisis. Science 2018;359(6377):725–726; doi: 10.1126/science.359.6377.725

93.

93. Mongan

, Moy

, Kahn

. Checklist for artificial intelligence in medical imaging (CLAIM): A guide for authors and reviewers. Radiol Artif Intell 2020;2(2):e200029; doi: 10.1148/ryai.2020200029

94.

Tejani

, Klontzas

, Gatti

et al.; CLAIM 2024 Update Panel. Checklist for artificial intelligence in medical imaging (CLAIM): 2024 update. Radiol Artif Intell 2024;6(4):e240300; doi: 10.1148/ryai.240300

95.

95. Wilkinson

, Dumontier

, Aalbersberg

, et al. The FAIR guiding principles for scientific data management and stewardship. Sci Data 2016;3(1):160018; doi: 10.1038/sdata.2016.18

96.

96. Warr

. Representation of chemical structures. WIREs Comput Mol Sci 2011;1(4):557–579; doi: 10.1002/wcms.36

97.

97. Wiswesser

. 107 years of line-formula notations (1861-1968). J Chem Doc 1968;8(3):146–150; doi: 10.1021/c160030a007

98.

98. Dyson

, Lynch

, Morgan

. A modified IUPAC-Dyson notation system for chemical structures. Information Storage and Retrieval 1968;4(1):27–83; doi: 10.1016/0020-0271(68)90004-1

99.

99. Weininger

. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J Chem Inf Comput Sci 1988;28(1):31–36; doi: 10.1021/ci00057a005

100.

100. Moret

, Pachon Angona

, Cotos

, et al. Leveraging molecular structure and bioactivity with chemical language models for de novo drug design. Nat Commun 2023;14(1):114; doi: 10.1038/s41467-022-35692-6

101.

101. Wang

, Guo

, Wang

, et al. SMILES-BERT: Large scale unsupervised pre-training for molecular property prediction. In: Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics ACM 2019; pp. 429–436; doi: 10.1145/3307339.3342186

102.

102. Ikebata

, Hongo

, Isomura

, et al. Bayesian molecular design with a chemical language model. J Comput Aided Mol Des 2017;31(4):379–391; doi: 10.1007/s10822-016-0008-z

103.

103. Wu

, Xia

, Deng

, et al. TamGen: Drug design with target-aware molecule generation through a chemical language model. Nat Commun 2024;15(1):9360; doi: 10.1038/s41467-024-53632-4

104.

104. Trott

, Olson

. AutoDock Vina: Improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading. J Comput Chem 2010;31(2):455–461; doi: 10.1002/jcc.21334

105.

105. Guo

, Liu

, Guo

, et al. Ligandformer: A graph neural network for predicting compound property with robust interpretation. arXiv [Preprint] 2022; doi: 10.48550/arXiv.2202.10873

106.

106. Grisoni

. Chemical language models for de novo drug design: Challenges and opportunities. Curr Opin Struct Biol 2023;79:102527; doi: 10.1016/j.sbi.2023.102527

107.

107. Weininger

, Weininger

. SMILES. 2. Algorithm for generation of unique SMILES notation. J Chem Inf Comput Sci 1989;29(2):97–101; doi: 10.1021/ci00062a008

108.

108. O’Boyle

. Towards a universal SMILES representation—A standard method to generate canonical SMILES based on the InChI. J Cheminform 2012;4(1):22; doi: 10.1186/1758-2946-4-22

109.

109. Skinnider

. Invalid SMILES are beneficial rather than detrimental to chemical language models. Nat Mach Intell 2024;6(4):437–448; doi: 10.1038/s42256-024-00821-x

110.

110. Bjerrum

. SMILES enumeration as data augmentation for neural network modeling of molecules. arXiv [Preprint] 2017; doi: 10.48550/arXiv.1703.07076

111.

111. Moret

, Friedrich

, Grisoni

, et al. Generative molecular design in low data regimes. Nat Mach Intell 2020;2(3):171–180; doi: 10.1038/s42256-020-0160-y

112.

112. Born

, Markert

, Janakarajan

, et al. Chemical representation learning for toxicity prediction. Digit Discov 2023;2(3):674–691; doi: 10.1039/D2DD00099G

113.

113. Arús-Pous

, Johansson

, Prykhodko

, et al. Improving Deep Generative Models with Randomized SMILES. In: Tetko

, Kůrková

, Karpov

, et al., editors. Artificial Neural Networks and Machine Learning—ICANN 2019: Workshop and Special Sessions. Lecture Notes in Computer Science, vol 11731. Cham: Springer; 2019; pp. 747–751; doi: 10.1007/978-3-030-30493-5_68

114.

114. O’Boyle

, Dalke

. DeepSMILES: An adaptation of SMILES for use in machine-learning of chemical structures. ChemRxiv [Preprint] 2018; doi: 10.26434/chemrxiv.7097960.v1

115.

115. Bhadwal

, Kumar

. GenSMILES: An enhanced validity conscious representation for inverse design of molecules. Knowl Based Syst 2023;268:110429; doi: 10.1016/j.knosys.2023.110429

116.

116. Krenn

, Häse

, Nigam

, et al. Self-referencing embedded strings (SELFIES): A 100% robust molecular string representation. Mach Learn: Sci Technol 2020;1(4):045024; doi: 10.1088/2632-2153/aba947

117.

117. Gómez-Bombarelli

, Wei

, Duvenaud

, et al. Automatic chemical design using a data-driven continuous representation of molecules. ACS Cent Sci 2018;4(2):268–276; doi: 10.1021/acscentsci.7b00572

118.

118. Schoenmaker

, Béquignon

OJM

, Jespers

, et al. UnCorrupt SMILES: A novel approach to de novo design. J Cheminform 2023;15(1):22; doi: 10.1186/s13321-023-00696-x

119.

119. Heller

, McNaught

, Pletnev

, et al. InChI, the IUPAC international chemical identifier. J Cheminform 2015;7(1):23; doi: 10.1186/s13321-015-0068-4

120.

120. Winter

, Montanari

, Noé

, et al. Learning continuous and data-driven molecular descriptors by translating equivalent chemical representations. Chem Sci 2019;10(6):1692–1701; doi: 10.1039/C8SC04175J

121.

121. Rajan

, Steinbeck

, Zielesny

. Performance of chemical structure string representations for chemical image recognition using transformers. Digit Discov 2022;1(2):84–90; doi: 10.1039/D1DD00013F

122.

122. David

, Thakkar

, Mercado

, et al. Molecular representations in AI-driven drug discovery: A review and practical guide. J Cheminform 2020;12(1):56; doi: 10.1186/s13321-020-00460-5

123.

123. Rogers

, Hahn

. Extended-connectivity fingerprints. J Chem Inf Model 2010;50(5):742–754; doi: 10.1021/ci100050t

124.

124. Duvenaud

, Maclaurin

, Aguilera-Iparraguirre

, et al. Convolutional networks on graphs for learning molecular fingerprints. arXiv [Preprint] 2015; doi: 10.48550/arXiv.1509.09292

125.

125. Gilmer

, Schoenholz

, Riley

, et al. Neural message passing for quantum chemistry. arXiv [Preprint] 2017; doi: 10.48550/arXiv.1704.01212

126.

126. Yang

, Swanson

, Jin

, et al. Analyzing learned molecular representations for property prediction. J Chem Inf Model 2019;59(8):3370–3388; doi: 10.1021/acs.jcim.9b00237

127.

127. Stokes

, Yang

, Swanson

, et al. A deep learning approach to antibiotic discovery. Cell 2020;180(4):688–702.e13; doi: 10.1016/j.cell.2020.01.021

128.

128. Liu

, Catacutan

, Rathod

, et al. Deep learning-guided discovery of an antibiotic targeting Acinetobacter baumannii. Nat Chem Biol 2023;19(11):1342–1350; doi: 10.1038/s41589-023-01349-8

129.

129. Krishnan

, Anahtar

, Valeri

, et al. A generative deep learning approach to de novo antibiotic design. Cell 2025;188(21):5962–5979.e22; doi: 10.1016/j.cell.2025.07.033

130.

130. Heid

, Greenman

, Chung

, et al. Chemprop: A machine learning package for chemical property prediction. J Chem Inf Model 2024;64(1):9–17; doi: 10.1021/acs.jcim.3c01250

131.

131. Fang

, Liu

, Lei

, et al. Geometry-enhanced molecular representation learning for property prediction. Nat Mach Intell 2022;4(2):127–134; doi: 10.1038/s42256-021-00438-4

132.

132. Maaten

L V D

, Hinton

. Visualizing Data using t-SNE. J Mach Learn Res 2008;9(86):2579–2605.

133.

133. Ramakrishnan

, Dral

, Rupp

, et al. Quantum chemistry structures and properties of 134 kilo molecules. Sci Data 2014;1(1):140022; doi: 10.1038/sdata.2014.22

134.

134. Ruddigkeit

, Van Deursen

, Blum

, et al. Enumeration of 166 billion organic small molecules in the chemical universe database GDB-17. J Chem Inf Model 2012;52(11):2864–2875; doi: 10.1021/ci300415d

135.

135. Cremer

, Medrano Sandonas

, Tkatchenko

, et al. Equivariant graph neural networks for toxicity prediction. Chem Res Toxicol 2023;36(10):1561–1573; doi: 10.1021/acs.chemrestox.3c00032

136.

136. Dobbelaere

, Lengyel

, Stevens

, et al. Geometric deep learning for molecular property predictions with chemical accuracy across chemical space. J Cheminform 2024;16(1):99; doi: 10.1186/s13321-024-00895-0

137.

137. Axelrod

, Gomez-Bombarelli

. Molecular machine learning with conformer ensembles. arXiv [Preprint] 2021; doi: 10.48550/arXiv.2012.08452

138.

138. Hamakawa

, Miyao

. Understanding conformation importance in data-driven property prediction models. J Chem Inf Model 2025;65(7):3388–3404; doi: 10.1021/acs.jcim.5c00018

139.

139. Gao

, Nguyen

, Sresht

, et al. Are 2D fingerprints still valuable for drug discovery? Phys Chem Chem Phys 2020;22(16):8373–8390; doi: 10.1039/D0CP00305K

140.

140. Olayo-Alarcon

, Amstalden

, Zannoni

, et al. Pre-trained molecular representations enable antimicrobial discovery. Nat Commun 2025;16(1):3420; doi: 10.1038/s41467-025-58804-4

141.

141. Li

, Jiang

. Mol-BERT: An effective molecular representation with BERT for molecular property prediction. Wirel Commun Mob Comput 2021;2021(1):7181815; doi: 10.1155/2021/7181815

142.

142. Liu

, Zhang

, Li

, et al. MolRoPE-BERT: An enhanced molecular representation with rotary position embedding for molecular property prediction. J Mol Graph Model 2023;118:108344; doi: 10.1016/j.jmgm.2022.108344

143.

143. Zheng

, Tomiura

. A BERT-based pretraining model for extracting molecular structural information from a SMILES sequence. J Cheminform 2024;16(1):71; doi: 10.1186/s13321-024-00848-7

144.

144. Scalia

, Rutherford

, Lu

, et al. A high-throughput phenotypic screen combined with an ultra-large-scale deep learning-based virtual screening reveals novel scaffolds of antibacterial compounds. bioRxiv [Preprint] 2024; doi: 10.1101/2024.09.11.612340

145.

145. Devlin

, Chang

M-W

, Lee

, et al. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv [Preprint] 2019; doi: 10.48550/arXiv.1810.04805

146.

146. Irwin

, Sterling

, Mysinger

, et al. ZINC: A free tool to discover chemistry for biology. J Chem Inf Model 2012;52(7):1757–1768; doi: 10.1021/ci3001277

147.

147. Wu

, Ramsundar

, Feinberg

, et al. MoleculeNet: A benchmark for molecular machine learning. Chem Sci 2018;9(2):513–530; doi: 10.1039/C7SC02664A

148.

148. O’Shea

, Moser

. Physicochemical properties of antibacterial compounds: Implications for drug discovery. J Med Chem 2008;51(10):2871–2878; doi: 10.1021/jm700967e

149.

149. Lipinski

, Lombardo

, Dominy

, et al. Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. Adv Drug Deliv Rev 1997;23(1–3):3–25; doi: 10.1016/S0169-409X(96)00423-1

150.

150. Huang

, Fu

, Gao

, et al. Therapeutics data commons: Machine learning datasets and tasks for drug discovery and development. arXiv [Preprint] 2021; doi: 10.48550/arXiv.2102.09548arXiv:2102.09548v2,

151.

151. Swanson

, Walther

, Leitz

, et al. ADMET-AI: A machine learning ADMET platform for evaluation of large-scale chemical libraries. Bioinformatics 2024;40(7):btae416; doi: 10.1093/bioinformatics/btae416

152.

152. Hirschfeld

, Swanson

, Yang

, et al. Uncertainty quantification using neural networks for molecular property prediction. J Chem Inf Model 2020;60(8):3770–3780; doi: 10.1021/acs.jcim.0c00502

153.

153. Naik

, Kangas

, Sullivan

, et al. Active machine learning-driven experimentation to determine compound effects on protein patterns. Elife 2016;5:e10047; doi: 10.7554/eLife.10047

154.

154. Reker

, Schneider

. Multi-objective active machine learning rapidly improves structure–activity models and reveals new protein–protein interaction inhibitors. Chem Sci 2016;7(6):3919–3927; doi: 10.1039/C5SC04272K

155.

155. Reker

, Schneider

. Active-learning strategies in computer-assisted drug discovery. Drug Discov Today 2015;20(4):458–465; doi: 10.1016/j.drudis.2014.12.004

156.

156. Reker

. Active learning for drug discovery and automated data curation. In: Brown

., editor. Artificial Intelligence in Drug Discovery. Drug Discovery Series. Cambridge: Royal Society of Chemistry; 2020. pp. 301–326; doi: 10.1039/9781788016841-00301

157.

157. van Tilborg

, Grisoni

. Traversing chemical space with active deep learning for low-data drug discovery. Nat Comput Sci 2024;4(10):786–796; doi: 10.1038/s43588-024-00697-2

158.

158. Thompson

, Walters

, Feng

, et al. Optimizing active learning for free energy calculations. Artif Intell Life Sci 2022;2:100050; doi: 10.1016/j.ailsci.2022.100050

159.

159. Fralish

, Reker

. Taking a deep dive with active learning for drug discovery. Nat Comput Sci 2024;4(10):727–728; doi: 10.1038/s43588-024-00704-6

160.

160. Gusev

, Gutkin

, Kurnikova

, et al. Active learning guided drug design lead optimization based on relative binding free energy modeling. J Chem Inf Model 2023;63(2):583–594; doi: 10.1021/acs.jcim.2c01052

161.

161. Reker

, Schneider

, et al. Active learning for computational chemogenomics. Future Med Chem 2017;9(4):381–402; doi: 10.4155/fmc-2016-0197

162.

162. Bailey

, Moayedpour

, Li

, et al. Deep batch active learning for drug discovery. Elife 2024:12; doi: 10.7554/eLife.89679.2

163.

163. Gorantla

, Kubincová

, Suutari

, et al. Benchmarking active learning protocols for ligand-binding affinity prediction. J Chem Inf Model 2024;64(6):1955–1965; doi: 10.1021/acs.jcim.4c00220

164.

164. Littmann

, Selig

, Cohen-Lavi

, et al. Validity of machine learning in biology and medicine increased through collaborations across fields of expertise. Nat Mach Intell 2020;2(1):18–24; doi: 10.1038/s42256-019-0139-8