Abstract
BACKGROUND:
Drug repositioning (DR) refers to a method used to find new targets for existing drugs. This method can effectively reduce the development cost of drugs, save time on drug development, and reduce the risks of drug design. The traditional experimental methods related to DR are time-consuming, expensive, and have a high failure rate. Several computational methods have been developed with the increase in data volume and computing power. In the last decade, matrix factorization (MF) methods have been widely used in DR issues. However, these methods still have some challenges. (1) The model easily falls into a bad local optimal solution due to the high noise and high missing rate in the data. (2) Single similarity information makes the learning power of the model insufficient in terms of identifying the potential associations accurately.
OBJECTIVE:
We proposed self-paced learning with dual similarity information and MF (SPLDMF), which introduced the self-paced learning method and more information related to drugs and targets into the model to improve prediction performance.
METHODS:
Combining self-paced learning first can effectively alleviate the model prone to fall into a bad local optimal solution because of the high noise and high data missing rate. Then, we incorporated more data into the model to improve the model’s capacity for learning.
RESULTS:
Our model achieved the best results on each dataset tested. For example, the area under the receiver operating characteristic curve and the precision-recall curve of SPLDMF was 0.982 and 0.815, respectively, outperforming the state-of-the-art methods.
CONCLUSION:
The experimental results on five benchmark datasets and two extended datasets demonstrated the effectiveness of our approach in predicting drug-target interactions.
Keywords
Introduction
Predicting drug-target interaction (DTI) is a crucial phase in drug discovery (DD) [1] and drug repositioning (DR) [2] for discovering novel targets of existing drugs [3, 4, 5]. The traditional methods for new DD are time-consuming and have a high failure rate; therefore, traditional new drug development is not a good choice [3, 6]. Various computer prediction methods have been proposed in recent years to improve the efficiency of new drug research and discovery, thus increasing the development efficiency and reducing expenditure to a certain extent. According to previous works [7, 8, 9], the current methods are mainly categorized into three groups [10, 11, 12, 13, 14, 15, 16, 17]: (1) molecular docking (MD) methods, (2) ligand-based methods, and (3) chemical genomics methods.
The MD methods involve simulation experiments based on the 3D structure drug and protein [11, 18]. However, the simulation of the 3D structure of massive ligands and targets, as well as their massive simulation calculation using MD-based methods, requires a lot of time and computing equipment [19, 20]. The ligand-based methods assume that drugs with similar functions have similar functional properties and may also have corresponding targets. They predict the drug target using ligand similarity. However, this approach suffers from unpredictable targets without known ligands. On the contrary, errors in chemical structure and physiological effects beyond structural relationships (e.g., the metabolites may be active molecules) may limit its use in drug repurposing. The chemical method facilitates rapid and large-scale DTI predictions to generate drug candidates and targets, making it the most efficient method in drug research [21, 22]. Adopting this method for DTI prediction has become a prominent research issue with the continuous increase in drug-related data and the launch of a large number of databases, such as DrugBank [23], KEGG [24], PubChem [25], BRENDA [26], and SuperTarget [27].
Recently, chemical genomics-based computational approaches for DTI prediction have advanced rapidly. They are mainly categorized into three groups: classification-based methods, network diffusion (network propagation), and matrix factorization (MF). The classification-based methods treat a DR prediction task as a binary classification task that whether has an association between drug and target. These methods are not yet proof with wet experimental. In 2008, Yamanishi et al. [28] established a bipartite network technique to predict DTIs for four target classes: G protein-coupled receptors, by combining chemical and genomic spaces (GPCRs), nuclear receptor (NR), ion channel (IC), and enzyme (E). Yamanishi’s dataset [28] is regarded as the gold standard by many researchers; several newly developed algorithms based on it have displayed better performance. Based on this benchmark dataset, Bleakley et al. [29] suggested a novel supervised inference method for predicting unknown DTIs based on benchmark datasets, namely, a kernel-based support vector machine (KN-SVM) model.
In recent years, the MF methods are widely used in many DR prediction works, which combines two low-rank matrices to factorize the matrix. Liu et al. [30] proposed a neighborhood regularized logistic MF model. Hao et al. [31] designed a logistic MF based on a dual network (DNILMF) approach to predict DTIs. Yang et al. [32] performed the nonlinear MF technique and the negative sampling technique for DR prediction. SPLCMF, a collaborative MF method combined with self-paced learning (SPL), is an efficient DTI prediction method proposed by Xia et al. [33]. Yang et al. [34] developed an MF method based on multi-similarities bilinear MF for DR prediction. Ding et al. [35] developed a multiple kernel-based triple collaborative MF method to predict DTIs. Wang et al. [36] used a neighborhood regularized logistic MF method based on extracted features from a neural tangent kernel to predict DTIs. These previous studies showed the feasibility of MF used in DR prediction tasks, but it still had two challenges. (1) The model easily fell into a bad local optimal solution due to the high noise and high missing rate in the data. (2) Single similarity information makes the learning power of the model insufficient in terms of identifying the potential associations accurately.
To cope with the aforementioned challenges, we propose a model named Self-Paced Learning with Dual similarity information and Matrix Factorization (SPLDMF), which combines the self-paced learning method into MF. Furthermore, more similarity information related to drugs and targets is integrated into the model to improve the prediction performance. First, many previous works demonstrate that SPL has the superiority of relieving the problem of bad local optimum, especially when data is sparse [37, 38]. Inspired by the human learning process, the core idea of SPL is to automatically include more samples from simple to complex for training in a purely self-paced manner. Thus, we make improvement of MF based on the SPL mechanism to adapt for the data with high noise and high missing rate. Then, the SPLDMF method also incorporates more data into our model to improve its capacity for learning, which can predict the potential relationship more accurately. Experimental results on five benchmark datasets and two extend datasets demonstrate the effectiveness of our approach in predicting drug-target interactions. Our model obtains the best results on each dataset we tested, such as AUC and AUPR of SPLDMF achieve 0.982 and 0.815, outperforming state-of-the-art models among similar methods to our knowledge
Materials
Yamanishi [28], Kuang [39], and Hao [31] datasets are three critical databases used for validating the proposed DTI-related algorithm. The Yamanishi dataset is called a benchmark database, which contains drug-target relationships from databases such as KEGG BRITE [40], BRENDA [41], SuperTarget [27], and DrugBank [23], target protein sequence from KEGG Gene Database [40], and drug compounds from KEGG Drug and Compound Database [40]. Moreover, the Yamanishi database is categorized into four datasets: NR, GPCR, IC, and E. It contained 445 drugs and 664 targets in E, 210 drugs and 204 targets in IC, 223 drugs and 95 targets in GPCR, and 54 drugs and 26 targets in NR. The details of the dataset are depicted in Table 1. The Kuang dataset had 3681 known interaction pairs [39], including 786 drugs and 809 targets (Table 1). The Hao dataset comprised 829 drugs, 733 targets, and 3688 identified interaction pairs [31] (Table 1).
Summary of four benchmark and two expanded datasets
Summary of four benchmark and two expanded datasets
For targeted analysis and prediction, we ensured that each drug contained at least one FDA-approved ATC code in the dataset.
This study introduced a novel DTI prediction model, self-paced learning with dual similarity information and MF method (SPLDMF), to predict unknown DTIs.
Task description
Five matrices
Four scenarios of DTI predictions. The pair with orange background represents (a) known drug-known target; (b) known drug-new target; (c) new drug-known target; and (d) new drug-new target.
Process of our proposed model.
In the protocol, definitions reference to a “known drug” means that the experimental drug has at least one interaction with the targets (e.g.,
Suppose
In this study, the attributive and topological properties of the drug and the target were used. The drug and target attributive features referred to the drug structure and the amino acid sequence of the target protein, respectively. Yamanishi et al. [28] also collected a dataset including the attributive feature similarity of the drug and the target. The structural data of all network nodes were referred to as topological features. Drug-drug topological feature similarity and target-target topological feature similarity were measured using the Node2vec method and the cosine similarity method, respectively, to extract the topological features of drugs and targets from the DTI network [43].
The DTI matrix
For ease of description, the drug-drug topological feature similarity matrix can be represented as
The goal of MF was to factorize the identified DTI matrix
First,
where
Solving for Eq. (3) might directly lead to overfitting during training. Therefore, the
where
Based on the idea that drugs with a higher degree of similarity tend to act on a similar set of targets, and vice versa, we integrated drug-related similarity matrices
Therefore, we added the drug similarity matrix
where
The objective function of the most recent MF-based approaches for DTI prediction is nonconvex. As a result, the optimized objective function can be easily trapped in local minima, particularly when dealing with enhanced noise and a large amount of missing data. Many studies showed that SPL could alleviate the model falling into a bad local optimal solution because of its training strategy of selecting samples from easy to complex [44, 45]. Thus, we integrated the SPL algorithm into the MF model to improve its strength. Consequently, Eq. (3.3) could be modified as:
where
According to Zhao et al. [44], the optimal
where
The alternative search strategy (ASS) was used to calculate
We fixed
Similarly, we fixed
where
Algorithms 1 and 2 explain the process of assessing individual parameters. The potential drug characteristic representation
The drugs (compounds) and targets (small molecules) could be determined based on the prediction result, that is, the scoring and ranking of matrix
Compared with other methods, the performance of the proposed model was assessed by simulating experiments under different missing rates and noise ratios. Then, compared with the performance of the advanced model, the performance was tested using four application scenarios. Further, two realistic and challenging extended datasets were selected for experimental comparison. We used four matrices such as root-mean-squared error (RMSE), mean absolute error (MAE), area under the receiver operating characteristic curve (AUC), and precision-recall curve (AUPR) to evaluate the effectiveness of SPLDMF.
Simulation data experiment
Simulation experiments were carried out to test the robustness of the model under different missing rates and noise ratios. We compared the proposed SPLDMF with two popular DTI prediction methods: MF and SVD. According to the studies by Xia et al. [33], Zheng et al. [46], and Zhao et al. [44], a matrix
Performance comparison of MF, SVD, and SPLDMF on synthetic data in terms of MAE and RMSE
Performance comparison of MF, SVD, and SPLDMF on synthetic data in terms of MAE and RMSE
We used the same dataset and cross-validation technique to compare our method with state-of-the-art methods (i.e., 5-time-10-fold cross-validation using Yamanishi’s benchmark dataset in four different applications scenarios) to validate the performance of the model. Three cross-validation settings were used to better evaluate the model in these four scenarios: (1) CVP, which was based on the cross-validation of drug-target pairs; (2) CVR, which was based on cross-validation on rows; (3) CVC, which was based on cross-validation on columns; and (4) CV4S, which was based on random cross-validation. Table 3 depicts the application scenario as well as the optimal potential feature dimensionality settings in our experiments. We employed the CVP settings to predict known drug-known target interactions (i.e., scenario 1, named CVPS). Figure 3 illustrates the model’s AUPR and AUC values for several potential features. The findings revealed that a higher potential feature dimensionality was more consistent AUPR and AUC values. In the CVP scenario, the GPCR dataset also reached the optimal feature dimensionality at
Application scenarios and dataset settings and optimal feature dimensionality
Application scenarios and dataset settings and optimal feature dimensionality
Performance comparison of SPLDMF and other advanced models, and the influence and change of r on AUC and AUPR in different scenarios. (a) Changes in AUC and AUPR under different feature dimensions under CVPS. (b) Changes in AUC and AUPR under different feature dimensions under CVRS. (c) Variation in AUC and AUPR under different feature dimensions under CVCS. (d) Performance comparison of SPLDMF and other advanced models under the GPCR dataset in four scenarios.
The values are the average findings of 30 runs. The best results are shown in bold, and the values in parentheses are standard deviations.
The value was found to be the highest at
The CVC configuration was applied (i.e., scenario 2, named CVCS) for predicting new target-known drug interactions. Figure 3c illustrates the model’s AUPR and AUC values for several potential feature dimensionalities. The experimental findings revealed that the AUC curves in the CVC scenario differed significantly from those in the CVP and CVR scenarios, particularly with the possible feature dimensionality
The fourth of the four scenarios (CV4S, new drug-new target) was the most difficult for DTI prediction. Since this sort of cross-validation was random and the training datasets and test datasets were also generated randomly, the test dataset might contain samples of fresh medications and fresh targets to aid in the inclusion of drug-target combinations in the new drug-new target category (
Comparison of the matrices from the major algorithms in CVPS, CVRS, CVCS, and CV4S scenarios based on the GPCR dataset
Top 10 drug-target relationship prediction scores and their validation
Comparison of the matrices from DNILMF, SPLCMF, and SPLDMF algorithms in four scenarios based on the Kuang and Hao datasets
We conducted sufficient comparative experiments for the aforementioned four scenarios to verify the effectiveness of the proposed method. Specifically, we compared SPLDMF with three other state-of-the-art methods, and the results are depicted in Table 4. The results indicated that the AUC and AUPR of SPLDMF were currently the best among the comparison methods. Our method could deal with noisy data more robustly due to the introduction of the SPL strategy, thus achieving better performance. The result showed that SPLDMF under all scenarios outperformed NRLMF and DNILMF in AUC and AUPR, suggesting that the proposed SPLDMF was more robust when using ligand-based methods to anticipate the interactions between ligands and target proteins. Our method outperformed in all scenarios compared with SPLCMF, which also used SPL strategy. An insightful explanation was that we leveraged more drug-drug and target-target similarities to improve predictive capacity for unknown outcomes. The result also demonstrated that SPLDMF had an improvement of 0.054 and 0.006 in AUC and AUPR, respectively, in the most difficult scenario CV4S, compared with SPLCMF.
The prediction matrix was scored using Eq. (12). We took the top 10 DTI pairs with the prediction scores after synthesizing the DTI prediction scores of NR, GPCR, IC, and E. Data validation was performed using ChEMBL, DrugBank, and KEGG databases, labeled C, D, and K, respectively. We validated the partial prediction results based on previous studies. The fifth and sixth columns of Table 5 list the database used for data validation and the studies referred to for the validation method, respectively. Table 5 lists the top 10 predicted DTIs. The most anticipated interaction was between DB00661 (verapamil) and P35499 (SCN4A) with a predicted high score of 0.983. This predicted relationship was found in the three databases C, D, and K. Furthermore, they were also reported in previous studies (Shafi et al., 2022; Stee et al., 2020). Except for the fifth item, other predictions were found in relevant reports in the database and literature, which verified these predictions to a certain extent. The fifth pair, the relationship between norethindrone (DB00717) and ESR1 (P03372), had no relevant reports in the current database and literature.
According to the FDA, the drug norethindrone (DB00717), similar to the drug diethylstilbestrol (DB00255), is a progestin used for contraception, the prevention of endometrial hyperplasia in hormone replacement therapy, and the treatment of other hormone-mediated diseases such as endometriosis. Diethylstilbestrol is also used to treat diseases such as breast and prostate cancer, but it is listed as a known carcinogen. The predicted results indicated that norethindrone has the same target (ESR1) as diethylstilbestrol. Besides its proven contraceptive use, norethindrone may also be used to treat breast cancer, prostate cancer, and other diseases based on the target principle. We verified our speculation through the KEGG pathway analysis experiment.
Besides simulated data and common benchmark datasets, the proposed SPLDMF was also tested with additional expanded datasets (prepared by Kuang [39] and Hao [31]) to fully verify the effectiveness of the suggested model on various datasets. A total of 3681 known interactions, 786 drugs, and 809 targets were detected in the Kuang dataset. Moreover, 3688 known interactions, 829 drugs, and 733 targets were detected in the Hao dataset. Table 6 depicts the performance comparison of SPLDMF and other methods on the expanded dataset, indicating that SPLDMF achieved the best prediction performance on both augmented datasets. This was mainly attributed to the fact that the SPL strategy improved the generalization performance of the model, enabling it to perform more robustly on noisy data. Meanwhile, the use of more feature similarity also enhanced the prediction accuracy, which was conducive to the discovery of potential DTIs.
Discussion and conclusion
Several computational-based methods, including similarity-based methods, standard machine learning methods, and MF-based methods, have been developed in recent years to achieve efficient and accurate DTI prediction. A recent study by Shi et al. [48] revealed that MF-based methods had the best prediction accuracy. Existing MF-based methods, however, might easily fall into bad local minima due to noise and missing data, as well as the nonconvex pattern of MF models. Meanwhile, the lack of prior information made it challenging for the model to accurately predict more potential associations. Therefore, we proposed a DTI prediction model based on an SPL strategy and incorporated more similarity information. The novelty of SPLDMF might be attributed to a combination of several factors. First, introducing the SPL strategy enabled the model to avoid falling into a bad local optimum solution and thus had stronger robustness. The proposed SPLDMF had better prediction performance when the data were affected by noise. Moreover, we employed more prior similarity information to improve the feature extraction capability of the model, thus enabling the model to observe more potential DTIs accurately.
Extensive experiments on synthetic data and four benchmark datasets were performed to assess the validity of the proposed SPLDMF method, which was then compared with three state-of-the-art DTI prediction methods. Two extended datasets were also used to verify the validity of each method. Comprehensive analysis results demonstrated that our proposed SPLDMF outperformed other state-of-the-art approaches. SPLDMF, for example, was more robust for noisy and missing data based on synthetic data. Furthermore, it outperformed all four scenarios and two expanded datasets in terms of common machine learning evaluation matrices. The prediction results revealed that 9 of the top 10 DTI pairs were found in the database and literature, and they were proven or considered effective. An unproven DTI pair (DB00717-P03372) was also preliminarily proven using pathway enrichment experiments. These results suggested that SPLDMF might provide a useful tool for predicting new DTIs and redirecting the use of existing drugs.
Footnotes
Acknowledgments
This work was supported in part by the Macau Science and Technology Development (Grant no. 0056/2020/AFJ) from the Macau Special Administrative Region of the People’s Republic of China and the Key Project from the University of Educational Commission of Guangdong Province of China (Natural, grant no. 2019GZDXM005).
Conflict of interest
None to report.
