Sage Journals: Discover world-class research

Abstract

Statistical models that accurately predict the binding affinity of an input ligand–protein pair can greatly accelerate drug discovery. Such models are trained on available ligand–protein interaction data sets, which may contain biases that lead the predictor models to learn data set-specific, spurious patterns instead of generalizable relationships. This leads the prediction performances of these models to drop dramatically for previously unseen biomolecules. Various approaches that aim to improve model generalizability either have limited applicability or introduce the risk of degrading overall prediction performance. In this article, we present DebiasedDTA, a novel training framework for drug–target affinity (DTA) prediction models that addresses data set biases to improve the generalizability of such models. DebiasedDTA relies on reweighting the training samples to achieve robust generalization, and is thus applicable to most DTA prediction models. Extensive experiments with different biomolecule representations, model architectures, and data sets demonstrate that DebiasedDTA achieves improved generalizability in predicting drug–target affinities.

Get full access to this article

View all access options for this article.

References

Agamennoni

, Nieto

, Nebot

. Approximate inference in state-space models with heavy-tailed noise. IEEE Trans Signal Process, 2012; 60(10):5024–5037; doi: 10.1109/TSP.2012.2208106

Arjovsky

, Bottou

, Gulrajani

, et al. Invariant risk minimization. arXiv:1907.02893,, 2020; doi: 10.48550/arXiv.1907.02893

Barsbey

, Özçelik

, Atıl

, et al. A Computational Software for Training Robust Drug-Target Affinity Prediction Models: Pydebiaseddta. 2023 (This issue).

Bietz

, Schomburg

, Hilbig

, et al. Discriminative chemical patterns: Automatic and interactive design. J Chem Inform Model, 2015; 55(8):1535–1546; doi: 10.1021/acs.jcim.5b00323

Boyles

, Deane

, Morris

. Learning from the ligand: Using ligand-based features to improve binding affinity prediction. Bioinformatics, 2020; 36(3):758–764; doi: 10.1093/bioinformatics/btz665

Chaput

, Martinez-Sanz

, Saettel

, et al. Benchmark of four popular virtual screening programs: Construction of the active/decoy dataset remains a major determinant of measured performance. J Cheminformatics, 2016; 8(1):1–17; doi: 10.1186/s13321-016-0167-x

Chen

, Cruz

, Ramsey

, et al. Hidden bias in the DUD-E dataset leads to misleading performance of deep learning in structure-based virtual screening. PLoS One, 2019; 14(8):e0220113; doi: 10.1371/journal.pone.0220113

Chithrananda

, Grand

, Ramsundar

. ChemBERTa: Large-scale self-supervised pretraining for molecular property prediction. arXiv:2010.09885,, 2020; doi: 10.48550/arXiv.2010.09885

Elnaggar

, Heinzinger

, Dallago

, et al. ProtTrans: Toward understanding the language of life through self-supervised learning. IEEE Trans Pattern Anal Mach Intell, 2022; 44(10):7112–7127; doi: 10.1109/TPAMI.2021.3095381

10.

Geirhos

, Jacobsen

J-H

, Michaelis

, et al. Shortcut learning in deep neural networks. Nat Mach Intell, 2020; 2(11):665–673; doi: 10.1038/s42256-020-00257-z

11.

Gönen

, Heller

. Concordance probability and discriminatory power in proportional hazards regression. Biometrika, 2005; 92(4):965–970; doi: 10.1093/biomet/92.4.965

12.

Gulrajani

, Lopez-Paz

. In search of lost domain generalization. arXiv:2007.01434,, 2020; doi: 10.48550/arXiv.2007.01434

13.

Guney

Reproducible drug repurposing: When similarity does not suffice. In: Pacific Symposium on Biocomputing. 2017; pp. 132–143; doi: 10.1142/9789813207813_0014

14.

Gururangan

, Swayamdipta

, Levy

, et al. Annotation artifacts in natural language inference data. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 2 (Short Papers). 2018; pp. 107–112; doi: 10.18653/v1/N18-2017

15.

Hastie

, Tibshirani

, Friedman

. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed. Springer Series in Statistics. Springer: New York, NY; 2009.

16.

, Zhang

, Wu

, et al. DrugOOD: Out-of-Distribution (OOD) Dataset Curator and Benchmark for AI-aided Drug Discovery—A Focus on Affinity Prediction Problems with Noise Annotations. arXiv:2201.09637,, 2022; doi: 10.48550/arXiv.2201.09637

17.

Koh

, Sagawa

, Marklund

, et al. WILDS: A benchmark of in-the-wild distribution shifts. In: Proceedings of the 38th International Conference on Machine Learning. Meila M and Zhang T, eds. 2021; pp. 5637–5664. PMLR, Publishers.

18.

Liu

, Haghgoo

, Chen

, et al. Just train twice: Improving group robustness without training group information. In: Proceedings of the 38th International Conference on Machine Learning. PMLR; 2021; pp. 6781–6792.

19.

Liu

, Lin

, Wen

, et al. BindingDB: A web-accessible database of experimentally determined protein–ligand binding affinities. Nucleic Acids Res, 2007; 35(suppl 1):D198–D201; doi: 10.1093/nar/gkl999

20.

Mestres

, Gregori-Puigjane

, Valverde

, et al. Data completeness—The achilles heel of drug-target networks. Nat Biotechnol, 2008; 26(9):983–984; doi: 10.1038/nbt0908-983

21.

Nguyen

, Le

, Quinn

, et al. GraphDTA: Predicting drug–target binding affinity with graph neural networks. Bioinformatics, 2021; 37(8):1140–1147; doi: 10.1093/bioinformatics/btaa921

22.

Özçelik

, Öztürk

, Özgür

, et al. ChemBoost: A chemical language based approach for protein–ligand binding affinity prediction. Mol Inform, 2021; 40(5):200–212; doi: 10.1002/minf.202000212

23.

Öztürk

, Özgür

, Ozkirimli

. DeepDTA: Deep drug–target binding affinity prediction. Bioinformatics, 2018; 34(17):i821–i829; doi: 10.1093/bioinformatics/bty593

24.

Pahikkala

, Airola

, Pietilä

, et al. Toward more realistic drug–target interaction predictions. Brief Bioinform, 2015; 16(2):325–337; doi: 10.1093/bib/bbu010

25.

Peters

, Janzing

, Schoelkopf

Elements of Causal Inference: Foundations and Learning Algorithms. MIT Press; 2017.

26.

Poliak

, Naradowsky

, Haldar

, et al. Hypothesis only baselines in natural language inference. In: Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics, New Orleans, Louisiana. Association for Computational Linguistics; 2018; pp. 180–191; doi: 10.18653/v1/S18-2023

27.

Rahman

, Fookes

, Baktashmotlagh

, et al. Correlation-aware adversarial domain adaptation and generalization. Pattern Recognit, 2020; 100:107124; doi: 10.1016/j.patcog.2019.107124

28.

Sagawa

, Koh

, Hashimoto

, et al. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. In: International Conference on Learning Representations. Song D, Cho K, and White M, program chairs. 2020; OpenReview.

29.

Scantlebury

, Brown

, Von Delft

, et al. Data set augmentation allows deep learning-based virtual screening to better generalize to unseen target classes and highlight important binding interactions. J Chem Inform Model, 2020; 60(8):3722–3730; doi: 10.1021/acs.jcim.0c00263

30.

Selvaraju

, Cogswell

, Das

, et al. Grad-CAM: Visual explanations from deep networks via gradient-based localization. Int J Comput Vis, 2019; 128(2):336–359; doi: 10.1109/ICCV.2017.74

31.

Sennrich

, Haddow

, Birch

. Neural machine translation of rare words with subword units. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany. Association for Computational Linguistics; 2016; pp. 1715–1725; doi: 10.18653/v1/P16-1162

32.

Shah

, Tamuly

, Raghunathan

, et al. The pitfalls of simplicity bias in neural networks. In: Proceedings of the 34th International Conference on Neural Information Processing Systems. Curran Associates Inc: Red Hook, NY; 2020; pp. 9573–9585.

33.

Shen

, Liu

, He

, et al. Towards out-of-distribution generalization: A survey. arXiv:2108.13624,, 2021; doi: 10.48550/arXiv.2108.13624

34.

Sieg

, Flachsenberg

, Rarey

. In need of bias control: Evaluating chemical data for machine learning in structure-based virtual screening. J Chem Inform Model, 2019; 59(3):947–961; doi: 10.1021/acs.jcim.8b00712

35.

Sundar

, Colwell

. The effect of debiasing protein–ligand binding data on generalization. J Chem Inform Model, 2019; 60(1):56–62; doi: 10.1021/acs.jcim.9b00415

36.

Sundar

, Colwell

. Using single protein/ligand binding models to predict active ligands for unseen proteins. bioRxiv,, 2020; doi: 10.1101/2020.08.02.233155

37.

Tang

, Szwajda

, Shakyawar

, et al. Making sense of large-scale kinase inhibitor bioactivity data sets: A comparative and integrative analysis. J Chem Inform Model, 2014; 54(3):735–743; doi: 10.1021/ci400709d

38.

Tran-Nguyen

V-K

, Jacquemard

, Rognan

. LIT-PCBA: An unbiased data set for machine learning and virtual screening. J Chem Inform Model, 2020; 60(9):4263–4273; doi: 10.1021/acs.jcim.0c00155

39.

Wallach

, Heifets

. Most ligand-based classification benchmarks reward memorization rather than generalization. J Chem Inform Model, 2018; 58(5):916–932; doi: 10.1021/acs.jcim.7b00403

40.

Xiao

, Engstrom

, Ilyas

, et al. Noise or signal: The role of image backgrounds in object recognition. In: International Conference on Learning Representations. Oh A, Murray N, and Titov I, program chairs. 2021; OpenReview.

41.

Yang

, Shen

, Huang

. Predicting or pretending: Artificial intelligence for protein-ligand interactions lack of sufficiently large and unbiased datasets. Front Pharmacol, 2020; 11:69; doi: 10.3389/fphar.2020.00069

42.

Yang

, Zhong

, Zhao

, et al. MGraphDTA: Deep multiscale graph neural network for explainable drug–target binding affinity prediction. Chem Sci, 2022; 13(3):816–833; doi: 10.1039/d1sc05180f

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.14 MB

A Framework for Improving the Generalizability of Drug–Target Affinity Prediction Models

Abstract

Get full access to this article

References

Supplementary Material