Published April 01, 2025
Drug repositioning holds great promise for reducing the time and cost associated with traditional drug discovery, but it faces significant challenges related to data imbalance and noise in negative samples. In this article, we introduce a novel method leveraging high negative oversampling (HNO) to address these challenges. Our approach integrates HNO with advanced techniques such as network-based graph mining, matrix factorization, and Bayesian inference, specifically designed for imbalanced data scenarios. Constructing high-quality negative samples is crucial to mitigate the detrimental effects of noisy negative data and enhance model performance. Experimental results demonstrate the efficacy of our approach in enhancing the performance of drug discovery models by effectively managing data imbalance and refining the selection of negative samples. This methodology provides a robust framework for improving drug repositioning, with potential applications in broader biomedical domains.
Drug-disease associations, over-sample, protein associations, imbalanced data, drug repositioning, Bayesian inference
In the context of today’s increasingly complex diseases and rapidly mutating variants posing significant risks to humans, there is an urgent need for treatment methods and drugs that can swiftly respond to these health challenges. Meanwhile, the traditional process of drug discovery and development is becoming less appealing due to its considerable costs and time requirements. 1 Bringing a new drug to market typically spans over a decade and may incur costs amounting to several billion US dollars. 2 In this context, drug repurposing, the process of finding new applications for already approved drugs, emerges as a promising strategy to optimize investment in drug development. Supported by readily available data on drug safety and efficacy, drug repurposing can significantly reduce the time and resources compared to developing a new drug from scratch. 3 A crucial initial step in this process is the selection of a candidate drug and the identification of a new indication for it before proceeding to preclinical trials. To enhance the efficiency of this process, various computational methods have been developed, employing different strategies. Among the most common are similarity-based techniques, which infer new indications based on the similarity of drugs and suggest treatment methods for similar disease conditions. To enhance the efficiency of this process, various computational methods have been developed, employing different strategies. Among the most common are similarity-based techniques, which infer new indications by analyzing the similarity of drugs and suggesting treatment methods for related disease conditions. Recent advancements in computational drug discovery highlight the potential of advanced machine learning techniques. For instance, TransEDRP 4 employs a dual-transformer framework to integrate chemical and pharmacological properties, achieving a 22.67% improvement in accuracy across multiple datasets. Similarly, MilGNet, 5 using heterogeneous graph neural networks, enhances drug-disease association prediction and outperforms 10 state-of-the-art methods. These approaches have also been validated through real-world applications, such as identifying Methotrexate for mismatch repair cancer syndrome. Furthermore, Gottlieb et al introduced PREDICT, 6 a method leveraging drug-drug and disease-disease similarities based on their characteristics to uncover new drug structures and disease associations from phenotypic data.
In their latest study, Yu et al 7 employed similarity analysis to explore the relationships between drugs, based on their 5 principal attributes, as well as the similarities between diseases. This effort culminated in the development of a novel technique, layer attention graph convolutional network (LAGCN), aimed at drug repositioning. By integrating diverse data, this method not only enhances the reliability of identifying similarities between drugs and diseases but also broadens the potential to discover new links between them. However, as highlighted in prior studies, the number of drug-disease relationships experimentally confirmed to date remains limited compared to the total potential relationships. 8 In this context, clearly distinguishing between “positive” (known) and “negative” (unidentified or non-existent) drug-disease pairs emerges as a significant challenge, especially in constructing an effective “negative” dataset to improve the accuracy of machine learning models.
For instance, in the study by Li et al, 9 the number of drugs was 2593, and the number of diseases was 19 941, but known drug-disease associations represent only a very small proportion of the 50 million possible drug-disease associations. In machine learning tasks, these known relationships are labeled as positive. However, identifying and pinpointing the high negative drug-disease pairs among the remaining 50 million possible pairs presents a significant challenge, as both potential positive and negative elements are hidden within. In this study, we propose a method to efficiently construct a highly negative dataset, clearly distinguishing it from the known positive elements, thereby enhancing the accuracy of the model.3,10 In addition, a major challenge identified is the imbalanced data, as observed in the example above, where the positive elements account for only 0.34%. This imbalance affects the precision and reliability of predictions in machine learning models. Hence, this study aims to address the issue of data imbalance and offers specific solutions to improve the quality of the drug repositioning process.
In this study, we endeavor to address the issues mentioned above and make specific contributions toward this end as follows: (a) We analyze the challenge of drug repositioning, predicting new drug-disease relationships, the necessity of constructing a robust high negative dataset of unparalleled quality, and the imbalanced data issue encountered in resolving this challenge by Bayesian inference. (b) We propose a method for constructing a robust high negative dataset of unparalleled quality and apply imbalance techniques for the dataset. The advantages and disadvantages of current methodologies for addressing imbalanced data are investigated and assessed. (c) Through experimental results, we demonstrate that our proposed approach, which combines the creation of a high-quality negative dataset with methods to over-sample, effectively resolves the issue of data imbalance, thereby enhancing the efficiency and reliability of identifying new drug-disease relationships.
Drug repositioning, which seeks new therapeutic uses for existing drugs, offers a strategy that is both time and cost-efficient due to pre-existing knowledge of drugs’ pharmacological and safety profiles. 11 Success stories, like sildenafil (Viagra) for erectile dysfunction and minoxidil for hair loss, are uncommon and typically accidental. 12 The process is hindered by its dependence on prior knowledge and the prohibitive cost of clinical trials, making wide-scale application challenging. Fifty million potential drug-disease pairings manually is infeasible, highlighting the need for computational methods to discover new drug applications efficiently.
The rapid increase in drug and disease-related data, along with advancements in machine learning, has spurred the creation of diverse theoretical methods. These are designed to reveal new drug uses by identifying potential drug-disease associations. These computational strategies are primarily categorized into 3 principal approaches: drug-based, disease-based, and network-based. Each category adopts a distinct premise for association prediction, leveraging the intrinsic properties and similarities within drugs and diseases to facilitate the discovery of novel therapeutic applications.
Drug-based and disease-based methodologies operate on the foundational hypothesis that drugs with similar structural or functional traits tend to be effective against diseases sharing analogous pathogenic processes or symptoms. This concept has been supported by numerous studies, establishing a solid theoretical basis for these approaches. In this context, Wang et al 13 developed a support vector machine (SVM) model to detect potential drug-disease interactions. This model integrates a wide range of data, including molecular structure, molecular activity, and phenotype information, thereby substantially improving the accuracy of therapeutic connection predictions. Echoing this approach, Khalid and Sezerman 14 presented an integrative method that uses a similarity-based framework to predict approved and novel drug targets and their new disease associations. This method integrates protein-protein interactions (PPI), biological pathways, binding site structural similarities, and disease-disease similarity metrics to enhance prediction accuracy. Further contributing to this field, Zhang et al 15 unveiled a novel Similarity Constrained Matrix Factorization for Drug-Disease Association (SCMFDD) prediction aimed at elucidating the connections between drugs and diseases. Leveraging existing drug-disease associations along with drug characteristics and disease semantic data, the SCMFDD projects these relationships into bidimensional spaces to uncover latent features of both drugs and diseases.
The last one is based on the principle of “guilt-by-association” that drugs treating the same disease share structure/network properties and the diseases treated with the same drug also share phenotype/network properties. In their seminal work, Yang et al 16 constructed 3 causal networks targeting cardiovascular diseases, diabetes mellitus, and neoplasms using a causal inference-probabilistic matrix factorization (CIPMF) methodology. This approach aimed to predict and classify drug-disease associations, thereby aiding in the identification of new drug repositioning opportunities. It entailed the integration of multilevel systematic relationships between drugs and diseases from diverse databases to establish causal networks that link drug-target-pathway-gene-disease. Liu et al, 17 in a related vein, developed a heterogeneous network comprising drug-drug similarity, disease-disease similarity, and known drug-disease association networks. They introduced a novel 2-pass random walk with a restart algorithm to predict novel indications for approved drugs. Zhang et al 18 further advanced the domain by formulating drug-disease associations as a bipartite network and implementing a network topological similarity-based inference method, which leverages linear neighborhood similarity to predict unobserved drug-disease associations. Building on these foundational studies, Yue et al 19 conducted a rigorous evaluation of 11 distinct graph embedding methodologies across 3 critical biomedical link prediction tasks: drug-disease association (DDA), drug-drug interaction (DDI), and PPI predictions. Their analysis extended to 2 node classification tasks, specifically the classification of medical term semantic types and the prediction of protein functions. This comprehensive assessment seeks to shed light on the efficacy and practicality of graph embedding techniques in biomedical research, setting a new benchmark for future investigations in this rapidly evolving field.
Within the current landscape of methodologies, 2 principal challenges are prevalent: First, supervised learning-based approaches typically require both positive and negative samples for training predictive models. However, these methods often operate under the assumption that unknown drug-disease pairs default to the negative class, leading to a scarcity of experimentally validated negative samples. This assumption inaccurately constructs the negative sample set, as these unknown pairs could potentially belong to an undefined category, being either positive or negative. Therefore, there is a critical need for a methodology capable of identifying robust negative elements distinctly from known drug-disease pairs. To address the first challenge of constructing robust negative samples, prior studies have proposed various approaches to enhance the reliability of the negative sample set. Methods like EMP-SVD 20 and TS-SVD 21 enhance negative dataset reliability by excluding pairs sharing common proteins or short-path associations within heterogeneous networks, reducing noise from arbitrary assumptions. These methods improve the quality of negative datasets and contribute to the reliability of downstream predictions.
The second challenge arises from the inherent data imbalance prevalent in drug-disease pair datasets, as is common in biological data, where the proportion of positive class elements is substantially lower than that of the negative class. This imbalance significantly affects the predictive efficacy of models. Thus, a method that effectively addresses this imbalance is essential. In this article, we propose an approach that not only constructs a robust set of negative class elements but also tackles the issue of data imbalance, thereby enhancing the predictive accuracy of the model.
Drug repositioning in Bayesian inference
In what follows, the notation
The drug-disease prediction process is modeled as a graph mining on a heterogeneous network where nodes are drugs, proteins, and diseases. Importantly, protein examination in drug repositioning should take into account how a drug interacts with a protein and how a protein is related to a disease. Diseases are often caused by mutations involving the binding interface or directing to biochemically dysfunctional allosteric changes in proteins.
23
Considering the conventional association of a drug, a protein, and a disease, the probability of these objects
For instance, the original probability
Since the conventional association of a drug, a protein, and a disease
Clearly, such Bayesian inference allows us to include proteins in drug repositioning showing how a new drug can be proposed for a disease through data of confirmed interactions between drugs, proteins, and diseases.
High negative samples and oversampling
Indeed, the learning process for drug repositioning requires training data that consists of pairs of drug
Identification of the positive drug-disease relation is realized by checking approved drugs from pharmacy companies:
The negative drug-disease relation related to those drug-disease whose information of appointment is lacking:
There is a specific imbalance in the training data. For any drug
A classification where the data set has skewed class proportions is called imbalanced. Classes that have a large proportion of the data set are called majority classes. Those that make up a smaller proportion are minority classes. The imbalance may cause learning issues. If the number of positive samples is too small relative to negative samples, the training process will spend most of its time on negative samples and positive samples are not learned enough. The above issue leads us to apply a solution as follows.
The finding that the prior probability of protein
Similarly, there can be other way in identifying the prior probability of protein
Hence, the prior probabilities are joined in a general view from the drug prospect as well as from the disease prospect. This produces the final probability for protein
For improving the quality of learning, the negative samples for training can be selected from the set of samples where their probability
The training set of negative samples by equation (12) is likely stronger than the original one due to the sample selection already described:
We are thus depicting that selectively constructing training data with high potential negative samples data is possible through evaluation of prior probability for protein, completing data preparation task with the set of samples for training:
Prediction in drug repositioning
We have presented the results of training data performed as a part of a balancing number of samples for classes in relation to drug and disease
The Bayesian inference (equation (4)), therefore, shows a prediction for interaction between a drug
There is thus a Bayesian inference from the above training data, although deeper, using reference of protein in probabilities of
After passing through a relation of drug and protein
With appropriate inference through drug-disease-drug and again drug-disease, the probability
It is then necessary to account for the protein with the disease in relation
The aim of learning is to maximize belief of reasoning and so get the approximate posterior as close as possible to the true posterior.
To obtain a final prediction for test data with pairs of (
During the experimental phase, we strictly followed the described method. Initially, we introduced and conducted a detailed analysis of the dataset used in our research to better understand the interrelationships among its components, particularly focusing on drug-disease and protein associations. We then demonstrated that drug discovery using high negative sampling combined with oversampling techniques such as Gaussian-synthetic minority oversampling technique (SMOTE) 24 yields consistent and reliable results.
Data analysis
Data integrity plays a critical role and is rigorously evaluated before being used in experimental design setups. In this research, we used 2 datasets that consist of the dataset introduced by Wu et al 20 and the B-dataset introduced by Zhang et al 15 and Zhao et al. 25 Wu et al 20 constructed their dataset from 3 specific components: drug-protein-disease interactions. Specifically, the disease-protein data was extracted from OMIM, 26 the drug-protein data was sourced from DrugBank, 27 and the drug-disease data was retrieved from Gottlieb’s research. 28 Table 1 provides an overview of the disease-protein data sources, illustrating the relationships between 449 diseases and 1467 proteins. Notably, there were 1365 verified drug-disease interactions (considered positive samples), compared with 657 318 unverified interactions (negative samples), resulting in a positive-to-negative sample ratio of 0.207%. Similarly, the drug-protein data included 1186 drugs and 1467 proteins, with 4642 positive samples and 17 352 200 unverified samples, translating to a ratio of 0.267%. The drug-disease data encompassed 1186 drugs and 449 diseases, with 1827 positive samples versus 530 687 unverified samples, achieving a rate of 0.344%. In this study, we proceeded to select high-probability negative samples using formulas previously analyzed. 29
The B-dataset includes 269 drugs, 598 diseases, and 1021 proteins. It contains 18 416 drug-disease associations, 3110 drug-protein associations, and 5898 disease-protein associations, as detailed in Table 2. Within this dataset, we constructed positive and negative samples in a manner similar to the one previously introduced. In this article, the matrix
Here, n is the number of drugs, m is the number of diseases, and k is the number of proteins. As introduced earlier, the positive samples in the experimental dataset were identified using equation (12), while the status of the remaining samples remained undetermined, they could be either positive or negative.
According to equations (9) to (13), we extract high negative samples (HNS) using Algorithm 1. Ultimately, a new dataset was generated by combining the positive samples with these high-quality negative samples according to equation (15); this new dataset is called HNdataset. In addition, we constructed another dataset based on a method for filter negative samples (FNS) introduced by Wu et al, 20 whose name is Fndataset.
Drug repositioning with heterogeneous network
A heterogeneous drug-protein-disease network was constructed, where each node represents a drug
Extraction of drug-disease features with singular value decomposition
The drug-disease matrices
where U contains the left singular vectors.
High negative samples and oversampling
To evaluate and compare efficiency, we categorized the selected methods into 3 groups. The first group includes under-sampling data balancing algorithms. These are SPY, 30 NearMiss, 31 TomekLinks, 32 RandomUnderSampler, OneSidedSelection, 33 and NeighbourhoodCleaningRule. 33 The second group consists of oversampling data balancing methods. These methods are SMOTE, 34 Borderline-SMOTE, 35 Clustering-based Under-sampling and Over-sampling Using Synthetic Minority Over-sampling Technique (CURE-SMOTE), 36 SMOTE-TomekLinks, 37 Automated Noise Detection Synthetic Minority Over-sampling Technique (AND-SMOTE), 38 SMOTE-D, 39 Random-SMOTE, 40 Kmean-SMOTE, 41 Gaussian-SMOTE, 24 and SMOTE-WB. 42 The third group comprises techniques from previous research for data balancing prior to machine learning. We implemented these methods on 2 standardized datasets, HNdataset and FNdataset. All experiments were conducted under identical conditions to ensure a fair comparison.
The synthetic minority oversampling technique (SMOTE), widely recognized for its ability to generate synthetic samples for the minority class by creating new data along the line connecting a minority class instance and a certain number of its same-class neighbors, has shown substantial potential in mitigating data imbalance issues. Further advancing this, the synthetic minority oversampling technique with boosting and noise detection (SMOTE-WB) represents a hybrid approach that combines SMOTE and random oversampling (ROS), incorporating additional boosting and noise detection techniques. The objective of this combination is to enhance the efficacy of synthetic sample creation, thereby improving the accuracy of classification models. The boosting technique amplifies the features of synthetic data samples by focusing on difficult-to-classify cases, while noise detection minimizes the impact of noisy data on the training process, ensuring the high quality of synthetic samples.
Performance analysis
In the evaluation of machine learning models, a variety of parameters are proposed to assess performance accurately. Selecting appropriate parameters that align with each model and dataset characteristic is crucial. In scenarios involving severely imbalanced datasets, sensitivity (SE) and specificity (SP) are frequently used metrics, see equations (27) and (28) in Appendix 1. Kubat and Matwin 33 introduced the geometric mean (G-Mean) to assess machine learning models on imbalanced data (equation (29)). In addition, metrics such as accuracy (ACC) (equation (30)), recall (REC) (equation (31)), precision (PRE) (equation (32)), F1-score (equation (33)), area under the precision-recall curve (AUPR), Matthews correlation coefficient (MCC) (equation (34)), area under the curve (AUC), and precision-recall area under curve (PR-AUC) are used to evaluate and compare the effectiveness of methodologies against recent research. In this section, we evaluate the performance of our method relative to 7 prominent studies previously introduced, each designed to predict drug-disease associations (DDAs) using heterogeneous networks. Here is a brief overview of each methodology.
All experiments were conducted on a system running Microsoft Windows 11 Pro (Build 22631) with an Intel Core i5-12400 processor and 16 GB of DDR4 RAM. The experiments used Python 3.11.5 and Scikit-learn 1.3.0 for machine learning model development and evaluation. The smote_variants library 43 was employed to implement various oversampling techniques for handling imbalanced datasets. All oversampling methods were applied using the library’s default parameter settings, as these configurations are well-documented and have been validated in prior studies. This choice ensures consistency and reproducibility across experiments while focusing on the evaluation of the proposed method.
First, we explored the undersampling technique, applying it to both the HN dataset and the FN dataset. To enhance the reliability of the performance outcomes, the 5-fold cross-validation framework used in this study is designed to ensure an objective evaluation of the model’s performance. By treating all drug-disease relationships in the test dataset as unknown during training, we ensured complete independence between training and testing processes. This practice mimics real-world scenarios, where the model is expected to predict novel drug-disease interactions without prior knowledge. We used 3 model evaluation metrics: F1, G-Mean, and PR-AUC. The results for each metric, corresponding to each undersampling method on each dataset, are detailed in Table 3 and illustrated in Figure 1.
Overall, superior results were observed with the HNdataset. Specifically, the TomekLinks method on the HNdataset produced an F1 score (equation (33)) of 83.98% and a PR-AUC of 87.90%, while the OneSidedSelection method on the HNdataset achieved a G-Mean of 89.11%.
Figure 1 displays the area beneath the PRE and REC curves for both datasets, with Figure 1A for the FNdataset and Figure 1B for the HNdataset. The PR-AUC values from these curves underscore the model’s high precision and recall levels for most methods applied to the HNdataset, indicating a robust capability to accurately identify positive cases across a broad range of thresholds, especially within the context of the HNdataset.
Figure 2 displays the area beneath the Precision-Recall (PR) curve for both datasets, with Figure 2A for the FNdataset and Figure 2B for the HNdataset. The PR-AUC values from these curves underscore the model’s high precision and recall levels for most methods applied to the HNdataset, indicating a robust capability to accurately identify positive cases across a broad range of thresholds, especially within the context of the HNdataset.
Following that, we examined the HNdataset and FNdataset using an oversampling approach derived from a SMOTE variant. The results of these experiments are compiled in Table 4 and depicted in Figure 3. It is clear that all evaluated performance metrics for the HNdataset significantly surpassed those for the Fndataset.
In particular, the CURE-SMOTE method on the HNdataset achieved notable results, with an F1 score of 84.53% and a PR-AUC of 88.36%. Concurrently, the SMOTEWB 42 technique on the same dataset produced the highest G-Mean of 89.70% and a competitive PR-AUC of 87.67%. As shown in Tables 3 and 4, the HNdataset consistently outperformed the FNdataset across all metrics. These results highlight the effectiveness of oversampling techniques, particularly SMOTEWB, 40 in handling imbalanced datasets and improving model performance.
Figure 4 illustrates the area under the Precision-Recall (PR) curve for both datasets, with Figure 4A representing the FNdataset and Figure 4B depicting the HNdataset within the oversampling approach derived from a SMOTE variant. The derived PR-AUC metrics from these curves highlight the high levels of precision and recall achieved by most methods when applied to the HNdataset. This suggests a strong ability of the model to reliably identify positive instances over a diverse range of thresholds, particularly in the case of the HNdataset.
To evaluate the performance of the proposed methods, we compared 4 primary configurations: (1) Original, using the raw dataset without applying High Negative Dataset or Gaussian-SMOTE; (2) Gaussian-SMOTE, applying the Gaussian-SMOTE technique without using High Negative Dataset; (3) High Negative, using the High Negative Dataset without applying Gaussian-SMOTE; and (4) Our Method, which combines both High Negative Dataset and Gaussian-SMOTE. All experiments were conducted using a 5-fold cross-validation procedure to ensure consistency and fairness in evaluation. The results indicate that our method outperforms all baselines across the 3 evaluation metrics, achieving an F1-score of 85.09%, a G-mean of 89.80%, and a PR-AUC of 88.39% (Table 5).
In addition, we conducted a 2-sample t test to assess the statistical significance of the improvements. The results (Table 6) show that our method significantly outperforms Gaussian-SMOTE across all metrics with P-values <.0002. When compared with the High Negative Dataset, our method demonstrates statistically significant improvements with P-values <.05 for F1-score and PR-AUC, and approaches significance for G-mean (P = .000313). These findings confirm that the combination of High Negative Dataset and Gaussian-SMOTE provides substantial advantages in improving classification performance, particularly for imbalanced datasets.
Our proposed method is designed to leverage the strengths of both the High Negative Dataset and Gaussian-SMOTE. The High Negative Dataset enriches the model’s ability to discriminate by providing high-quality negative samples, while Gaussian-SMOTE generates synthetic samples from the minority class, reducing data imbalance and enhancing generalization. Experimental results demonstrate that this combination not only improves performance but also yields a more stable and reliable model compared with the baseline, as evidenced by its superiority across the 3 key metrics: F1-score, G-mean, and PR-AUC.
These findings emphasize the effectiveness of oversampling methods, particularly SMOTE variants, in enhancing model performance on datasets with imbalances. Our meticulous testing process demonstrated that, across all methods, balancing the data when combined with the proposed negative sampling method consistently yields superior performance, notably enhancing the F1, G Mean, and PR-AUC metrics.
In this section, we evaluate the performance of our method relative to 7 prominent studies previously introduced, each designed to predict drug-disease associations (DDAs) using heterogeneous networks. Here is a brief overview of each methodology.
Deep drug repositioning (deepDR) 44 employs a network-based deep learning framework to repurpose drugs in silico by integrating 10 related networks and using a multi-modal deep autoencoder to learn and transform drug features into a lower-dimensional representation. A variational autoencoder encodes and decodes these features along with clinically reported drug-disease pairs to predict new applications for approved drugs. The drug-drug associations by using geometric deep learning (DDAGDL) 25 framework applies geometric deep learning to a heterogeneous information network to predict drug-drug associations, incorporating biological data and an attention mechanism to effectively manage the non-Euclidean structure of biomedical networks. Heterogeneous information network graph representation learning (HINGRL) 45 leverages a heterogeneous information network that integrates biological knowledge with drug-disease, drug-protein, and protein-disease relationships, employing graph representation learning and a Random Forest classifier to enhance drug repositioning. HNet-DNN 46 proposes a deep neural network approach using a drug-disease heterogeneous network that constructs drug-drug and disease-disease similarity networks, integrates them with existing drug-disease associations, extracts topological features, and uses them to train the DNN.
The drug repositioning approach based on weighted bilinear neural collaborative filtering (DRWBNCF) 47 employs a deep learning model that integrates various similarity networks and uses a novel weighted bilinear graph convolution technique along with a multilayer perceptron optimized by specific loss functions and graph regularization to enhance drug repositioning by predicting new drug-disease relationships. Drug Repositioning based on the Heterogeneous information fusion graph convolutional network (DRHGCN) 48 uses graph convolutional networks to analyze and integrate data from drug-drug, disease-disease, and drug-disease association networks, employing a layer attention mechanism to refine features from multiple network layers. Attention-aware multi-modal fusion using a dual-graph transformer (AMDDT) 49 is based on dual-graph transformer modules, leveraging advanced graph neural networks for predicting drug-disease associations.
In this study, we consistently applied 5-fold cross-validation across all experiments, including those that combined the high negative sampling (HNS) and full negative sampling (FNS) techniques. The comparison of model performance with HNS and FNS is presented in the last 2 rows of Table 7. The results demonstrate that the HNS strategy significantly outperforms FNS across all performance metrics. Specifically, AUPR improved from 0.892 to 0.915, AUC increased from 0.959 to 0.966, PRE rose from 0.835 to 0.862, REC increased from 0.843 to 0.851, ACC improved from 0.932 to 0.938, MCC increased from 0.793 to 0.817, and the F1-score rose from 0.835 to 0.856. These substantial improvements underscore the effectiveness of the HNS strategy in leveraging informative negative samples to enhance model training quality.
Furthermore, compared with prior studies, our method consistently demonstrated superior performance, as shown in Table 7. Notable results achieved by our approach include AUPR: 0.980, AUC: 0.983, REC: 0.940, ACC: 0.946, MCC: 0.890, and F1-score: 0.935. These metrics surpass all existing approaches, providing compelling evidence of the optimized predictive capability of our model when employing the HNS strategy for drug-disease interaction prediction. We have uploaded the complete implementation, along with a detailed README file, to GitHub: https://github.com/hunglm11/BI-DD-HNSO.
These findings not only confirm the effectiveness of the HNS strategy but also highlight its potential applications in other complex predictive tasks. This approach maximizes the extraction of valuable information from potential negative samples, enhancing both the accuracy and reliability of the model. The integration of an advanced model with an efficient negative sampling strategy offers promising avenues for drug-disease interaction research, contributing to the advancement of prediction methodologies in this domain.
The exceptional performance of our method can largely be attributed to the integration of oversampling technique with a high negative approach. This innovative strategy significantly enhances the accuracy of the model by ensuring that the trained classifier does not overly fit the majority class in imbalanced datasets, which is a common challenge in medical data analysis. Moreover, this approach allows precise capturing of rare but medically significant patterns within the data, contributing to the model’s high recall and precision rates. The successful application of these techniques ensures that our method not only performs well in balanced scenarios but excels even under conditions characterized by data sparsity and imbalance, a common issue in the complex landscape of drug-disease association prediction.
The relationships between drugs and diseases extracted from DrugBank have been identified with 1187 positive samples and supplemented by 530 687 unknown drug-disease pairs categorized as high-negative samples. To balance the dataset for model training, the SMOTE-WB technique was employed. The model was then trained on these unknown pairs to predict potential drug-disease relationships. Rigorous validation of the results was conducted through a review of credible biomedical literature.
The top 20 predictions from the model are presented in Table 8, with 12 of these predictions being substantiated by authoritative biomedical reports. Notably, Levobunolol has been shown effective in treating various forms of glaucoma, including Primary Open Angle Glaucoma (Disease ID: 137760) and Glaucoma 1, Primary Open Angle (Disease ID: 601682). Gliclazide has demonstrated efficacy against several types of Maturity-Onset Diabetes of the Young (MODY), such as MODY3 (Disease ID: 600496), MODY1 (Disease ID: 125850), and MODY2 (Disease ID: 125851). In addition, the relationships of Triamcinolone and Prednisone with autoimmune and inflammatory disorders like Multiple Sclerosis (Disease ID: 126200) and Otitis Media (Disease ID: 166760), as well as Estradiol’s connection to Hereditary Prostate Cancer type 1 (Disease ID: 601518), highlight the intricate interactions between drug mechanisms and disease pathologies. The use of Betamethasone in treating Hydrocortisone in Sarcoidosis (Disease ID: 181000), Cimetidine in Helicobacter Pylori infections (Disease ID: 600263), and Daunorubicin in Classic Hodgkin Lymphoma (Disease ID: 236000) further illustrates the significant role these drugs play in specific disease management strategies. Even though some predicted drug-disease associations lack direct documentary evidence, these findings pave new pathways for potential clinical trials and research, potentially reducing both the time and cost involved in drug development processes.
These findings not only validate the predictive model but also lay the groundwork for further clinical research and the development of personalized medical treatments, thereby enhancing the accuracy and efficiency of disease management protocols.
In this study, we proposed a novel approach by combining data balancing techniques with the high-confidence negative sample selection method to predict the relationship between drugs and diseases. Our findings demonstrated promising results, indicating the efficacy of this approach in enhancing the reliability of predictive models. First, we employed Bayesian theory to analyze the relationships between drugs and diseases, constructing a heterogeneous drug-protein-disease network. From this, we extracted richly informative drug-disease feature vectors. Second, we demonstrated that our method of selecting high-confidence negative samples effectively eliminated unreliable negative instances, contributing significantly to the overall improvement of model performance. This enhancement not only increased the accuracy of predictions but also fostered trust in the model’s outputs. Furthermore, by recommending certain drugs for diseases, some of which have been scientifically validated through clinical trials and are currently employed in treatment regimens, we have bolstered the credibility of our model’s predictions.
Looking ahead, we aim to continue refining our method for selecting reliable samples and enhancing data balancing techniques to further improve the efficiency of our model. This ongoing pursuit of refinement will ensure that our predictive model remains at the forefront of precision medicine, offering valuable insights into drug-disease relationships for clinical decision-making and therapeutic advancements.
We acknowledge support from the Electric Power University and Academy of Policy and Development.
Manh Hung Le, Nam Anh Dao, Xuan Tho Dang
Bioinformatics and Biology Insights
Vol 2025, Issue , pp. -
Issue published date: -01-
10.1177/11779322251328269