Abstract
The identification of potential interactions and relationships between diseases and drugs is significant in public health care and drug discovery. As we all know, experimenting to determine the drug-disease interactions is very expensive in both time and money. However, there are still many drug-disease associations that are still undiscovered and potential. Therefore, the development of computational methods to explore the relationship between drugs and diseases is very important and essential. Many computational methods for predicting drug-disease associations have been developed based on known interactions to learn potential interactions of unknown drug-disease pairs. In this paper, we propose 3 new main groups of meta-paths based on the heterogeneous biological network of drug-protein-disease objects. For each meta-path, we design a machine learning model, then an integrated learning method is formed by these models. We evaluated our approach on 3 standard datasets which are DrugBank, OMIM, and Gottlieb’s dataset. Experimental results demonstrate that the proposed method is better than some recent methods such as EMP-SVD, LRSSL, MBiRW, MPG-DDA, SCMFDD,. . . in some measures such as AUC, AUPR, and F1-score.
Keywords
Introduction
As traditional drug development faces sufficiently long procedures including target discovery, discovery screening, lead optimization, ADMET (Absorption, Distribution, Metabolism, Excretion and Toxicity) testing, development and registration, the process is usually complicated and costly and it carries a high risk of failure. The pharmaceutical product development is still in need of at least 10 to 15 years and this can cost between $500 million and $2 billion, 1 with substantial investments directed toward basic science, technology development, and the exploration of new organizational and management models.
In particular, newly discovered usages of existing drugs seems to bring the development cost down much compared with “de novo” drug discovery and development. 2 Much recent publications3 -14 have considered closely on drugs repurposing where additional indications were discovered unexpectedly. While Chlorpromazine (dopamine receptor blockade) was initially developed to treat Antiemetic/antihistamine, a tranquilizing effect of the drug was discovered by Heri Laborit and it became a staple of psychiatric treatment. Very recently, Galantamine (acetylcholinesterase inhibition) 15 which was considered as a drug for treating polio, paralysis and anesthesia had its new usage approved in many countries for Alzheimer’s disease. Importantly, as repositioning drug candidates have frequently been tested in development for their initial indication, a variety of phases common to de novo drug discovery and development can be avoided. So, the drug repositioning provides a chance of reducing time and risk of development. It is with the drug repositioning concept that many researchers choose to explore drug-disease association for predicting new usage for existing drugs. This may be due to the assumption that similar drugs tended to treat similar diseases. 16
We can note a recent sharp growth of biological data for genome sequences, gene expression status, protein interactions and patients. In addition, most databases are dedicated to a specific type of information, and the relationship between different datasets, for example between gene production and epigenetic status, is still deficiently understood. With this complex data landscape, combining different datasets gives an integrated heterogeneous dataset that is almost always as good as, and in several cases significantly better than, a dataset alone. Also note that, heterogeneous transfer learning methods, where heterogeneous networks take place throughout biological data, have been implemented with promising results. 17 The approach is adapted in our article to set up a heterogeneous network allowing drug repositioning with appropriate proteins’ information. There is evidence that if information of associations between drugs, protein and diseases is available then a heterogeneous network with Singular Value Decomposition (SVD) can learn relying on some meta-paths. 18
Concerning the meta-paths, it is important for drug repositioning, and all other transfer learning methods, that a clear logical structure of meta-paths is needed to be defined. In this work, we present a new method for detection of new drug-disease association based on meta-paths. The first major contribution has come in the way of finding out drug-drug associations, protein-protein associations, disease-disease associations and heterogeneous network construction from the associations. The second, we propose to analyze the drug-disease associations by presenting drug-disease associations through 3 creative meta-paths. As far as the drug-disease associations are combined from these meta-paths, latent features are extracted with data dimension reduction. Finally, we apply an appropriate classification model for the heterogeneous network. The proposed approach is designed for drug repositioning in a biological heterogeneous network and can be an effective model for label transferring as well as on other heterogeneous data.
Related Work
It is significant to understand what the expert currently studies when checking drug-disease association in drug repurposing. A large number of computational works have attempted to define a method of presenting drug-disease association or design drug re-purpose learning model. A detailed review of works related to our approach is shown now in 2 subsections.
Drug-disease association presentation
Differences in feature extraction, similarity estimation, and matrix factorization are just some approaches that play a role for presenting drug-disease association. In particular, the association of drug-protein, drug-disease and protein-disease are encoded in binary labels indicating the presence or absence of an interaction. Feature vectors with certain length, often accompanied with the binary labels are used for presenting the features of drug-disease association. 19 To improve performance as well as the efficiency, some works implemented dimension reduction techniques to transform feature arrays from a high-dimensional space into a low-dimensional space, retaining meaningful properties of the drug-disease association. In order to forecast drug-disease interaction, different techniques can be implemented, some of them are: Support Vector Machines (SVM), 17 and Random Forest. 3 However, SVM performs poorly on highly imbalanced data, especially in complex tasks. In this study, only the weighted SVM model was considered without any data sampling or filtering, which may not be sufficient to address the issue of data imbalance. Meanwhile, the limitations of Random Forests, such as high computational complexity and difficulty in explaining the model, could also apply in this case, based on the general nature of these limitations for this method in other studies.
The similarity measure that associations are labeled based on their features’ similarity has been addressed in this domain. Because the drug-drug and disease-disease similarity measures can be performed through similarity or distance functions, prediction of interaction can be estimated: Using the matrix consisting of known drug and disease interaction, similarity measures can produce estimation for unknown drug and disease pairs. A number of similarity based methods, including Zhang et al, 4 Shi et al 5 has been proposed addressing the similarity scores of either drug-drug, disease-disease or drug-disease associations. Furthermore, Euclidean distance was used in a nearest neighbor algorithm applied to the interaction. 6 The genomic similarity of protein sequences, and pharmacological similarity of drugs, in cooperation with topological properties of drugs-protein-disease network were also suggested for drug-disease interaction prediction. 7 Actually, proportion of known drug-disease interactions and total number of interactions is very low and this is the main disadvantage of the similarity-based methods.
We are also examining whether study works that are related to factors of the features of drug-disease interactions. More interesting might be the matrix factorization that can represent drug-disease interactions by factors. This is surely possible when there are consistent associations between the characteristics of drugs and the characteristics of the diseases. Unlike the similarity approach based on characteristics of drugs and the characteristics of the diseases, the matrix factorization is based on measurement of the strength of the drug-disease interactions, when drugs and diseases are located within the same spatial region. 8 Mentioned above works outlined different ways to present features for drug-disease association. The proposed method in this paper uses binary class for presenting drug disease association and the matrix factorization approach for getting major factors of feature matrices. However, we propose to implement novel 3 meta-paths instead of using similarity measurement.
Drug re-purpose learning model design
Research in recent years shows that few experts described specific concepts like network, deep learning and hybrid methods for the designing of drug re-purpose learning models. Consider network-based methods where data structure is a set of objects represented by nodes and their relationships shown by edges. The attention of the methods is gained for machine learning research by the high network power. Alternative semi-supervised heterogeneous network embedding was noted by Song et al. 9 Specifically, the network is set up by similarity of drugs, drug-disease, and protein-protein interaction. On other hand, a multi-graph based method was proposed by Zhao et al 10 where graph convolution network was implemented with graph embedding approach for representing features of drug-disease associations. Good works adapting networks for drug-disease heterogeneous data can be seen in convolutional neural networks by Öztürk et al, 11 and in multiple layer perceptions by You et al. 12
In the deep learning direction, Zhao et al 13 recommended a geometric deep learning method for solving the drug-disease associations problem with heterogeneous information. The projection of geometric prior knowledge of network structure to a latent feature space was addressed for feature representation.
Most of the methods introduced above are really good at completing 1 task or working with 1 dataset. There are methods that are combinations of existing methods which were applied in the field or were transferred from other fields. We see that combinations can be performed from feature-based methods, matrix factorization, networks and deep learning. Thus, the feature-based and similarity-based machine learning approaches were essentially integrated by Agarwal et al. 14 Such hybrid methods are generally constructive and productive by optimizing the feature extraction for extracting the complex hidden features of drugs and diseases. Joining 2 machine learning methods in Drug-Disease Interaction prediction often yields favorable results as they can fully exploit the potential of partial methods simultaneously. However, one should be able to handle the high complexity, either computational or operational caused by integration. In drug repurpose learning model design, we selected the hybrid approach to attract advantages of partial methods including feature extraction, SVD and new 3 meta-paths that were designed specifically to deal with the heterogeneous drug and disease data.
The Method
In this section, the essential tasks of our method are outlined for predicting drug-disease associations. As part of the story, we describe a heterogeneous network that takes place throughout biological databases related to drug, protein and disease, see step 1 in Figure 1. Then, the network can be extended by adding the relationships of drug-drug, protein-protein, and disease-disease. We will build 6 new matrices, which describe the connections, see step 2 in Figure 1. To ease drug-disease associations prediction, our suggestion is to bring the new constructed matrices into learning—to have three paths which actually reflect the drug-disease associations, see step 3 in Figure 1. We shall see that features can be extracted from the drug-disease associations and used for learning in the final task, see steps 4 and 5 in Figure 1.

General workflow containing 5 main steps.
Heterogeneous network construction
To avoid being distracted by the details, we use
Heterogeneous network
Let
The elements of the heterogeneous network in Table 1 demonstrates that the network contains interconnected nodes and links of different types. A heterogeneous network can represent interconnected nodes of various types, including drugs, diseases, and proteins. So, step 1 in Figure 2 shows the nodes in 3 colors according to the types. The edges of the networks are displayed in 3 types of lines for indicating divergent types of edges.
Elements of the heterogeneous network of drug-protein-disease.

Construct the drug-protein-disease heterogeneous network and 6 new matrices representing the relationships of drug-drug, protein-protein, and disease-disease.
Network expansion by adding associations
In designing a heterogeneous network, there are choices about the types of edges. The constructed network has edges for drug-protein, disease-protein and drug-disease associations. We propose to append 3 new types of edges which are drug-drug, disease-disease, and protein-protein.
The drug-drug association
Specifying drug-drug association is very noteworthy and it can be actually carried out by studying the drug-disease or drug-protein association. Of course, association between two drugs can be established if there exists a disease that both drugs are associated with. The drug-drug association created by disease as intermediary is represented by a matrix
The protein-protein association
We have looked at techniques for defining this association. One way of detecting protein-protein association is to search a drug as an intermediary for protein-drug-protein association to calculate the association matrix
The disease-disease association
You can see that this kind of association matches one disease to another. What we have done is to extract disease-drug-disease relation by checking drug-disease association. Once 2 diseases are associated with the same drug, they are marked as associated in the matrix
New three meta-paths
Meta-path
In a heterogeneous network, a meta-path X is defined as a path on a schema
where
A common point of view is that a meta-path for drug disease association will start by a drug and end by a disease (
When studying associations of
If we look at 3 associations of drug-drug
So, in general, for producing drug-disease association each new meta-path has its sub-paths.
Feature extraction
As defined by Wu et al,
18
the element

Extract features with singular value decomposition, then predict drug-disease.
However, as we know, the number of known drug disease pairs is extremely small, compared with the total number of drug-disease pairs. Therefore, this considerably affects the construction of an effective machine learning model. It is, of course, possible to apply the singular value decomposition (SVD) to extract some small features in our work. Some studies have also expressed that in the task of dimensionality downgrading using SVD in the prediction problem on a heterogeneous biological network, useful data will not be altered, but redundant information will be taken out. 20 Note, as an interesting contribution, that the proposed 3 meta-paths with their 128 subpaths reflect many aspects integrated in the heterogeneous network, thereby revealing many relationships between drug disease treatment. Then, the base classifiers of each metapath were constructed to predict the relationship between drug-disease treatment. Finally, we integrated these base classifiers together to create an ensemble classifier as shown in Figure 3. The classifier used in our method is the ensemble classifier with a voting strategy for selecting the best one.
Prediction of drug-disease associations
It would be necessary to apply an ensemble classifier for improving performance. Suppose
In summary, as data were collected from different biological databases, we integrate them to form a mixed drug protein-disease bio-network. We explained how 128 subpaths of three new meta-paths enriched associations for the heterogeneous network. This noticeably aids the process of constructing feature vector data that represent the relationship between drug and disease in the network.
Experiments
The study for evaluating the prediction of new drug disease interactions by considering available interaction between drugs, proteins and diseases. The description of data and parameters used in our experiments will be outlined and discussion of learning results will be followed in this section.
Data
Our study on searching drugs for diseases requires reliable and accurate data of drug-disease, drug-protein and disease-protein. The resources of the data are available from OMIM, 21 Gottlieb’s data set, 22 DrugBank. 23 and selected by Wu et al 18 as shown in Table 2. Actually, the data was provided by the sources with a big variation of formats and data types due to different data sources. In the OMIM, 21 by checking 449 diseases and 1147 proteins, 1365 disease-protein interactions were reported. At the same time, Gottlieb’s data set provided 1827 drug disease interactions addressing 302 diseases and 551 drugs. By the DrugBank, 23 4642 drug-protein interactions were gathered from 1186 drugs and 1147 proteins. The first significant notice is the heterogeneity of the dataset, collected from different sources. Second, for studying the relation between a drug and a disease, it is to check whether the route lies within the network of the diseases, proteins and drugs that connect a particular drug and a specific disease. Since the network consists of edges that were constructed from mentioned above different sources, the network is surely heterogeneous.
Data in experiments. 18
Parameters
In the algorithm 1 for meta-path 1, the drug parameter
To enable cross validation, the data of drug-disease interaction in each learning option were split randomly 5 times, providing a training set and a test set each time. Several metrics that include ACC, PRE, REC, MCC, F1 (A1-A5) were implemented in the cross validation to evaluate performance of learning. The Area Under Precision-Recall Curve (AUPR) and Area Under Receiver Operating Characteristic Curve (AUC) were used in our tests.
Discussion
We may illustrate the selected parameters corresponding to the possible value of accuracy. The map in Figure 4 explores the correspondence by using vertical axis for parameters of path 1 and path 2, while horizontal axis is covered by parameters of path 3. In particular, the accuracy which is higher than 0.9 has been seen in several places of the map. To report the best results of implementing 3 paths by 28 options, Table 3 uses a particular note for an option consisting of parameters. At the path 1, the note of m1(ds) is to show parameter of drug (d) for accuracy varying from 0.906 to 0.908. Actually, there is only 1 note of

Accuracy by ensemble path given parameters of each meta-path.
The best results of ensemble path by parameters of 3 meta-path.
Furthermore, the available results of related works are summarized in a comparative report. By incorporating drug chemistry information and gene ontology annotation information, Liang et al 24 proposed the Laplacian regularized sparse subspace learning (LRSSL) approach for predicting drug-disease interactions.24 Luo et al 25 used the Bi-Random walk algorithm (MBiRW)25 with analysis of medications and disorders for evaluating new drug-disease interactions. In applying meta-paths with ensemble learning methods, Kawichai et al 26 associated drugs and diseases by Gene ontology terms.
To estimate drug-disease interactions, a linear neighborhood similarity 27 and a network topological similarity 10 were introduced by Zhang et al. It is then possible to implement a similarity constrained matrix factorization method (SCMFDD) analyzing drug features, and disease semantics and information of drug-disease associations. 20 After representing similarities and interactions between diseases, medications, and therapeutic targets, a three-layer heterogeneous network model (TL-HGBI) was proposed by Wang et al 28 like a computational framework. Tho Dang et al 29 implemented the EMP-SVD 18 in other new 5 meta-paths and that improved some performance metrics.
As SVD was proposed to extract some small features, it’s essential to show the effect of the SVD by small experiment where SVD was not used in our method and all features were used for training. The scores of the experiment can be seen in Table 4 with ACC = 0.848, which is lower than the case with SVD. The method has its major three meta paths. For instance, the method with 1 meta path consists of 1 time running EMP-SVD proposed by Wu et al.
18
The method designed with 2 meta paths consists of first run of the EMP-SVD and then run different combination
Performance of related methods.
The best scores are printed in bold.
In this study, we proposed to analyze the drug-disease associations by presenting drug-disease associations through three novel meta-paths. Through experiments, it has also been proven that this is a new point and main contribution of the article, demonstrating the role of these three new meta-paths. However, a limitation of the article is that it has not been scientifically explained using biomedical bases to see the practical significance. If it can be done, it will be a groundbreaking contribution. This is really very difficult, and currently research is mainly doing what we do, which is to prove it through experiments and measurements for evaluation.
Case Studies
When transferring drug-disease associations from known associations to new associations, new drug-disease associations can be checked with literature for confirmation or disapproval reports. We used the label transfer method with 3 paths for searching for a new association of drug- disease from the dataset covering drug-disease, disease-protein and drug-protein association. A number of new associations of drug-disease are found while they were both unassociated in the initial dataset and unassociated by the original 5 paths methods.
For each new found drug-disease association we search available publications of the drug and its uses in treatment of the disease. Many drug-disease associations are created by the label transferring method but no report of the association cannot be found. Although it is hard to derive publication for new associations, we can instead obtain confirmation for some drug-disease associations. Thus, by raising association of the drug of fludrocortisone and the disease of hypertension from label transfering, we have found a paper of Veazie et al 31 on this association. The disease of orthostatic hypotension is an overstated drop in blood pressure while standing. This is the effect of a diminish in cardiac output or defective or insufficient vasoconstrictor mechanisms. The drug fludrocortisone is a mineralocorticoid that expands blood volume and blood pressure. Fludrocortisone is regarded as the first- or second-line pharmacological therapy for disease of orthostatic hypotension alongside mechanical and positional methods.
For instance, a new association of the oseltamivir drug and the encephalopathy disease is created by our label transfer method. Encephalopathy is described for any disease of the brain that changes brain structure or function. In the market, oseltamivir is sold under the brand name Tamiflu and it is an antiviral drug. A common way of using oseltamivir is to prevent and to treat influenza A and influenza B, viruses which cause the flu. In what follows, a case of treatment with oseltamivir for encephalopathy was reported in detail by Yen et al. 32 Here, flu-like symptoms and progressive encephalopathy were observed for a 25-year old female patient. With assistance of nasopharyngeal swab Polymerase Chain Reaction Influenza B was detected. The patient was treated with oseltamivir and patient’s mental status retaken within days.
Conclusions
We presented a new method for enhancing performance of drug-disease interaction prediction and applied it to the analysis of biomedical heterogeneous data. The method includes 3 paths designed for training a dataset of interactions between 3 objects: drug, protein and disease. The contribution of this paper is to present only 3 meta-paths with a full 27 options which allows us to update the prior of mentioned above objects by their related interactions with neighbor objects. In experiments, all the learning options were tested with cross validation permitting us to see which options can improve accuracy. As a result the method succeeded in enhancement for most of all performance metrics, including accuracy and F1-measure. The integration of the use of prior update, the use wherever applicable for heterogeneous biomedical heterogeneous data, and the way to make training flexible, yields an computational framework effective for data collected from different sources.
Footnotes
Appendix
Acknowledgements
We acknowledge support from the Vietnam Ministry of Education and Training, Hanoi National University of Education, Electric Power University, and Academy of Policy and Development.
Credit Authorship Contribution Statement
Nam Anh Dao: Conceptualization of this study, Software. Manh Hung Le: Data curation, Writing - Related woks. Xuan Tho Dang: Data curation, Methodology. Xuan Tho Dang et al.: Preprint submitted to Evolutionary Bioinformatics.
Funding:
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was funded by the Vietnam Ministry of Education and Training, project B2022-SPH-04.
Declaration of conflicting interests:
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
