Abstract
Analyzing the relationships among various drugs is an essential issue in the field of computational biology. Different kinds of informative knowledge, such as drug repurposing, can be extracted from drug-drug relationships. Scientific literature represents a rich source for the retrieval of knowledge about the relationships between biological concepts, mainly drug-drug, disease-disease, and drug-disease relationships. In this paper, we propose DDREL as a general-purpose method that applies deep learning on scientific literature to automatically extract the graph of syntactic and semantic relationships among drugs. DDREL remarkably outperforms the existing human drug network method and a random network respected to average similarities of drugs’ anatomical therapeutic chemical (ATC) codes. DDREL is able to shed light on the existing deficiency of the ATC codes in various drug groups. From the DDREL graph, the history of drug discovery became visible. In addition, drugs that had repurposing score 1 (diflunisal, pargyline, fenofibrate, guanfacine, chlorzoxazone, doxazosin, oxymetholone, azathioprine, drotaverine, demecarium, omifensine, yohimbine) were already used in additional indication. The proposed DDREL method justifies the predictive power of textual data in PubMed abstracts. DDREL shows that such data can be used to 1- Predict repurposing drugs with high accuracy, and 2- Reveal existing deficiencies of the ATC codes in various drug groups.
Introduction
Recently, discovering relationships among drugs, side-effects, and many other biological concepts has attracted much attention [1, 2, 3, 4, 5, 6, 7, 8]. Biological literature is a rich source of knowledge for discovering relationships among drugs [9, 10, 11, 12, 13, 14], as well as diseases [15] and accordingly, computational text mining approaches play an important role in current drug studies [16, 17, 18]. Recognizing the relationships among drugs leads to the creation of a knowledge graph that can be used efficiently for the task of discovering repurposing drugs [19, 20]. Drug repurposing is a process of discovering new usages for existing drugs and currently widely discussed in the drug discovery domain [21, 22, 23, 24, 25, 26].
In this paper, we proposed a method called DDREL that applies a deep learning method on 29 million abstract articles extracted from the PubMed dataset. DDREL builds the graph of drug-drug relationships among 2061 drugs. As an evaluation, DDREL uses homophily measure with respect to average similar ATC levels that indicates the average number of similar ATC levels of connecting drugs (Discussed in detail in Section 2, Step 6 and Eq. (3)). As an application, DDREL first clusters the graph of drug-drug relationships and then uses outlier detection mechanisms to discover repurposing drugs in each cluster.
The structure of this paper is as follows: Section 2 overviews available methods. DDREL is described in detail in Section 3. The empirical results are discussed in Section 4. Section 5 provides the explanation and interpretation of results. Section 6 concludes and discusses future work.
Background
Traditional methods used to discover drug relationships from the biological texts, generally calculated the similarity of drugs by applying text mining techniques such as co-occurrence of drugs, drug frequency, n-gram, and TF-IDF [27, 28, 29]. Griffith et al. [27] and Wagner et al. [28] applied the co-occurrence analysis to discover the interaction among drugs and genes and built the drug-gene interaction databases. Rahmani et al. [30] used the random walk algorithm to construct a Human Drug Network (HDN) for 200 different drugs based on functional and structural information. Stenetorp et al. [29] implemented a text mining-based clustering approach and n-gram to study the effect of corpus size and domain in the biological domain. Frijters et al. [31] proposed a text-based CoPub method to find new hidden relationships among biological concepts using the co-occurrence method. Sun et al. [32] used a new approach called X-Cluster to discover new goals for medications. The main drawback of the most traditional methods is that they consider only the syntactic relations among drugs and neglect the semantic relationships. The syntactic method considers each term individually (by features such as term-frequency (TF), Inverse document frequency (IDF), etc.) or with limited context around terms (by local features such as n-gram, word occurrence, etc.). On another side, semantic methods consider a wide range of contexts (k words before and k words after the term). Word2Vec is a very well known semantic approach that literature has shown which considers the context of terms in the text very efficiently [33, 34, 35].
Among the methods that inherently consider semantic relationships among concepts, deep neural network-based approaches have recently attracted much attention [36, 37, 38, 39]. There are several ways to implement deep learning, the most well-known of which are the Convolution and Recurrent networks [40]. Deep neural networks extract features from text using Word2Vec [41, 42, 43, 44] and GloVe [33, 45, 46]. Moen et al. [36] applied Word2Vec and n-gram to cluster the words in the biological text. Kosmopoulos et al. [37] used Word2Vec and TF-IDF to classify biological text in the PubMed dataset. Allahgholi et al. [47] applied Word2Vec on PubMed abstracts to predict interaction types among drugs. They recommend alternatives for drug-drug interactions (DDIs) which have Negative Health Effects Types (NHETs). Ma et al. [38] proposed a recurrent neural network (RNN) approach to predict risks based on the electronic health record (EHR) data. Dey et al. [39] used RNN on chemical structures of various drugs to discover substructures that construct adverse drug reactions (ADR).
Patrick et al. [20] applied a word-embedding based machine learning approach to predict drug repurposing for Immune-Mediated Cutaneous diseases. The results of Patrick et al. [20] is limited to a small range of Immune-Mediated Cutaneous diseases (353 drugs). Zhou et al. [48] and Singh et al. [49] discovered repurposing drugs for COVID-19 using artificial intelligence techniques. Kingsmore et al. [50] introduced treatments for rheumatic autoimmune inflammatory diseases using machine learning and drug repurposing. Udrescu et al. [51] extracted drug-drug relationships from DrugBank (1141 drugs) and then applied modularity-based clustering on the network. They labeled inconsistence drugs with respect to drug properties as repurposing drugs. Zhou et al. [52] used the Tanimo similarity coefficient [53] to construct drug-drug interaction networks. Then they applied the Markov clustering algorithm to cluster 589 drugs into 98 groups. In the end, they recommended drugs with the most different functions in each cluster as repurposing drugs.
The Anatomical Therapeutic Chemical (ATC) Classification System is a wieldy used method to classify drugs chemical and anatomical properties [54]. There are methods that are trying to predict the ATC codes from compound structure [55]. Predicting a drug’s ATC code can give a novel indication of a drug. The NetInfer method used the drug’s structure to predict their ATC code and by that potential novel therapeutic uses [56, 57]. The ATC code itself is a fuzzy annotation, which means it does not contain all indications of a drug [58]. We used the ATC codes to show how the DDREL method can give a historic perspective on drug repurposing.
Method
The main steps of our proposed DDREL method are shown in Fig. 1. These steps are as follows:
The block diagram of DDREL, which contains seven steps: Datasets and their pre-processing, constructing Word2Vec model, extracting drugs, building graph, homophily evaluation, and discovering repurposing drugs.
DDREL accepts three datasets as input data.
The first dataset is PubMed dataset.1
To detect the drug terms in the raw text, we use the DrugBank dataset2
DDREL uses a graph of drug-drug relationships to discover repurposing drugs. The third dataset (called RepoDB [60]) that is used in DDREL is the drug repurposing dataset, which contains 1571 drugs. DDREL uses RepoDB to evaluate the novel repurposing drugs discovered by DDREL. DDREL applied Word2Vec model Cbow [61] on the PubMed dataset to describe each biology term in the PubMed. The configuration of Word2Vec is shown in Table 1. Parameters configuration of Word2Vec models in DDREL
The minimum count (min-count) interprets the minimum number of occurrences needed for a word to be included in the word vectors. The vector dimension (dim) is the learned word vector size. Sub-sampling (samp) is the process of diminishing frequent word occurrences. The context window size (win) denotes words range to be included as target word context [33, 62, 63]. The value of these parameters are determined in Table 1 according to the best practices of similar work in the literature [42, 33, 45, 35, 64, 34, 65].
For 2,061 drugs that have appeared both in drugBank dataset and PubMed corpus, DDREL applied the Cosine similarity among their Word2Vec vectors and calculated the semantic similarity (see Eq. (1)).
In Eq. (1),
10 synonym drugs with high word2vec similarity
DDREL considered only the most informative discovered relationships among drugs by pruning the relationships with semantic similarity value less than
Homophily of graph and number of relationships among drugs in different value of
DDREL constructed graph
DDREL evaluated the built graph using homophily measure [66]. Homophily is a measure used in network science, showing how similar the nodes that are connected are [67]. We calculated the network homophily measure of the drug-drug network by considering the number of similar levels between ATC codes of two interacting drugs in the network. ATC code classifies drugs into groups at five different levels:3
Anatomical main group (one letter). There are 14 main groups (see Table 4). 14 anatomical main groups
Therapeutic subgroup (two digits).
Therapeutic/pharmacological subgroup (one letter).
Chemical/therapeutic/pharmacological subgroup (one letter).
Chemical substance (two digits).
N: Nervous system group N02: Analgesic drug subgroup N02B: Other Analgesic drugs subgroup N02BE: Anilide subgroup N02BE01: Acetaminophen drug
To compare the similarity of drugs, the similarity of the first 4 levels of the ATC code will be measured.
Equation (3) calculates the homophily of edge
Where
To calculate the homophily for the whole graph, we use Eq. (4) that calculates the average similar ATC levels among all edges.
Where
Based on similar ATC codes of the connected drugs, Eq. (5) calculates the accuracy of discovered drug-drug relationships. DDREL uses the same homophily measure to compare its approach with existing methods and a family of graphs generated randomly with a similar number of nodes and edges.
In Eq. (5) “Number of correct drug-drug relationships detected” is the number of drugs pair (
One of the main applications of DDREL is discovering novel repurposing drugs. For this purpose, DDREL, first applies Glay algorithm [68] to cluster the graph of drug-drug relationships. Second, DDREL uses an outlier detection mechanism to discover the drugs with the most different ATC codes comparing to other drug members of the same cluster (see Eq. (6)). In DDREL, we considered a drug as an outlier that differs in the first level of the ATC code from other drugs in its cluster. If
Pseudocode of DDREL is provided in Algorithm 1. DDREL used the accuracy measure, same as previous work [69, 70] to evaluate the detected repurposing drugs.
DDREL algorithmeach drug
The accuracy of DDREL in terms of the number of similar ATC code levels between the two drugs that are connected in the graph.
DDREL succeeded in discovering 3,430 relationships among 1,297 drugs. In the following subsections, we first evaluate the DDREL, and then we discuss one of the possible usages of DDREL in discovering the repurposing drugs.
Homophily evaluation
We calculated the network homophily measure [66] of the drug-drug graph created by Word2Vec by considering the ATC codes of each drug-drug relationship. Since the ATC code is four levels, DDREL computed the number of similar levels between ATC codes of two interacting drugs in the graph. Figure 2 shows the accuracy of DDREL based on the number of similar ATC levels of the connected drugs. In Fig. 2, the accuracy of 94% in the first level of the ATC means that 94% of the connected drugs were similar with respect to the first level of their ATC codes (
Application of DDREL: Discovering repurposing drugs
Drug repositioning refers to the unexpected use of a drug for a different disease indication, most likely due to the accidental discovery of new targets or new mechanisms of action [71, 72, 73, 74, 24, 75]. For example, Sildenafil (Viagra) was initially developed as an anti-hypotension drug but later was found to be more effective in treating erectile dysfunction. Axitinib is a VEGFR inhibitor approved for renal cancer, but it was recently discovered that axitinib has an off-target called BCR-ABL that can treat leukemia.4
The DDREL network clustered with the Glay clustering method. With red border are the drugs with repurposing score 1. Some of the clusters of these drugs are enlarged in detail in the results section. The network clusters and components are showing the major targeting areas of the current drug discovery. The network has many small components which contain specific drugs. To discover the DDREL network further see [59].
The first step can be to visualize and cluster the DDREL network (Fig. 3). The drug network is clustered using the Glay Algorithm [68] on the drug-drug graph to partition the graph into clusters of related drugs. Glay is suitable for the decomposition, display, and exploratory analysis of large biological networks due to high-performance [68]. The result of the Glay clustering algorithm is available at [59]. According to Fig. 3 the DDREL network has many components representing the general therapeutic modalities. The network’s giant component consists of the two largest classes of drugs the central nervous system targeting drugs and the antihypertensive agents. The second-largest component is the various antibiotics. The Glay clustering clustered the network based on the currently available target groups.
The main hypothesis behind discovering repurposing drugs is the detection of outlier drugs in each cluster. Equation (6) calculates the outlier-score for each drug
Twelve drugs with outlier-score 1. 8 drugs are indicated in the dataset [60] as repurposing drugs
Sample cluster of similar drugs – all of these drugs’ ATC code starts with A02B.
Number of drugs for different values of outlier score.
There are 12 drugs with outlier-score 1 in Fig. 5. The first level of ATC codes of these drugs are completely different from their co-cluster drugs. Figure 6 shows a sample cluster of these drugs. It shows the clusters containing the diflunisal and doxazosin drugs. In Fig. 6a are non-steroid anti-inflammatory drugs (NSAID). On the bottom part of Fig. 6a, the COX2 inhibitors such as valdecoxib, lumiracoxib, etc. can be seen. Meanwhile at the top are the COX1 inhibitors like at one end ibuprofen and the other phenylbutazone etc. This is similar how the various NSAIDs were discovered: 1st the COX1 inhibitors then the COX2 inhibitors. The DDREL graph shows the evolution of NSAIDs here. The cluster is not connected to any other part of the DDREL graph, making them a separate entity of this particular therapeutic class. All of them are having ATC code M01A except diflunisal. The outlier diflunisal’s ATC code starts with N02B. It is a salicylic-acid derived such as aspirin. It has anti-inflammatory and analgesic activity, like all NSAIDs. Diflunisal also used in rheumatic arthritis. The classification is coming to form the ATC intention to “All preparations containing salicylic acid and derivatives are classified in N02BA – Salicylic acid and derivatives, as it is difficult to differentiate between the use of salicylates in rheumatic conditions and other therapeutic uses.” Based on our data maybe it worth considering diflunisal just as an NSAID like many other drugs in this cluster.
In Fig. 6b cluster are the urologic agents with ATC code G04. At the top of the cluster are sildenafil and its derivates vardenafil and tadalafil. These cGMP-specific phosphodiesterase type 5 inhibitors, were developed as antihypertensive agents and became the definitive treatment of erectile dysfunction [76]. The bottom part of the cluster is alfuzosin, tamsulosin, and terazosin which are alpha 1 adrenergic inhibitors. These agents are used for benign prostate hypertrophy to release the urethra’s smooth muscle [77]. The outlier in the cluster is doxazosin, which also targets the alpha 1 adrenergic receptor but is classified as an antihypertensive agent with the ATC code C02. Doxazosin connects the urological agents to the various antihypertensive agents on the DDREL graph (Fig. 3). This classification connects the two therapeutic groups (antihypertensive agents and urological agents) by their mechanism on the alpha 1 adrenergic receptor. Doxasosizne is already used in benign prostate hypertrophy even if it was developed as an antihypertensive agent [78].
Table 5 focuses on 12 drugs with an outlier-score 1. Eight drugs out of 12 drugs with outlier-score 1 are indicated in RepoDB dataset [60] as repurposing drugs. All the clusters containing outlier drugs are available at [59].
Clusters containing doxazosin and diflunisal drugs – a) The cluster containing diflunisal drug – Excluding diflunisal, all the remaining drugs in this cluster are antirheumatic agents (ATC: M01A ATC), meanwhile diflunisal is in the N02B ATC class. b) The cluster containing doxazosin drug – Excluding doxazosin, all the remaining drugs in this cluster are used in the urological subgroup (ATC starts with G04), but doxazosin is an antihypertensive agent (ATC: C02), which is considered as an outlier and has a repurposing score of 1.
Rahmani et al. [30] proposed a method to build relationships among human drugs. They used a random-walk-based approach to discover relationships among 200 drugs. We compared the graph discovered by DDREL with their work. As a comparison measure, we used the average similar ATC levels between connecting drugs (Eq. (4)) of Rahmani et al., and DDREL. This measure is 2.7 in Rahmani et al. and 3.1 in DDREL. Additionally, for random control, we built 100 graphs keeping the degree distribution of the DDREL network. In this case, the ATC level similarity was 2.1 (
Drug repurposing is an approach for discovering new applications for approved or investigational drugs that are outside the scope of the primary medical indication [73]. The accuracy of DDREL for detecting repurposing drugs is 81% (Accuracy needs to be seen in the light of incomplete annotations of indications). DDREL was able to discover 8 repurposing drugs (see, e.g., Table 5). The supporting papers, which indicate that drugs in Table 5 are repurposing drugs, are available at [79, 84, 80, 81, 82, 83, 86, 85] and are shown in Table 5. All the clusters containing repurposing drugs are available at [59].
Moreover, due to a large number of repurposing drugs in the nervous system, we focused on the drug clusters where ATC codes of all the cluster members start with “N”. We found the cluster contains 26 drugs where 24 drugs in this cluster start with “N05” and two drugs “methylphenobarbital” and “trimethadione” which starts with “N03” (see, Fig. 7a). We found a cluster contains 26 drugs where 24 drugs in this cluster start with “N05” meaning psycholeptics, drugs used during schizophrenia, but also hypnotics and sedatives. Most of the drugs are benzodiazepines (see, Fig. 7a). The two drugs which do not have the N05 label are methylphenobarbital and trimethadione. Their ATC code starts with “N03” – meaning anticonvulsant drugs administered during epilepsy. Methylphenobarbital a barbiturate derivate. Its metabolite phenobarbital and itself was used in epilepsy, however, it has like all other barbiturate sedative effects. It is withdrawn from the market [89]. Trimethadione was one of the first drug used in in petit mal epilepsies, but later it was found teratogen [90, 91, 92]. The example shows the problem with the ATC categories. Methylphenobarbital is an outlier of the barbiturate family which is in the N05 group (N05CA). Surprisingly, both drugs are mentioned in two very recent papers [93, 94] as repurposing drugs (Phenobarbital against psoriasis and trimethadione against retinoblastoma).
Another cluster contains 11 drugs where ATC codes of 10 drugs in that cluster start with “N03”, meaning these drugs are, anticonvulsants based on their ATC code. Meanwhile, the 11th drug clobazam which is a psycholeptic with ATC code starting “N05”, however, clobazam is also used in epileptic seizures (see, Fig. 7b). This drug is mentioned in paper [95] as a repurposing drug.
These examples show the problem of the ATC classification because one drug can have multiple indications [58]. DDREL was able to find these inconsistencies based on the text-mined similarity graph.
Clusters containing methylphenobarbital, trimethadione and clobazam drugs a) The cluster containing methylphenobarbital and trimethadione drugs – Excluding methylphenobarbital and trimethadione, all the remaining drugs in this cluster have N05 ATC class, but methylphenobarbital and trimethadione drugs have N03 ATC class, which are considered as repurposing drugs. b) The cluster containing clobazam drug – Excluding clobazam, all the remaining drugs in this cluster have N03 ATC class, but clobazam has an N05 ATC class, which is considered as a repurposing drug.
In recent years, much effort has been invested into discovering new relationships among drugs. To the best of our knowledge, most of the existing text mining approaches did not consider semantic relations among drugs. In this paper, we proposed DDREL, a method that applied deep learning to discover syntactic and semantic relationships among drugs to predict repurposing drugs. The experimental results show that in the terms of average similar ATC levels (Eq. (4)), DDREL significantly outperformed the existing human drug network method in addition to a random network built 100 times. DDREL was able to find the discrepancies in the ATC classification and show how the indication of various drugs changed during a time (e.g. in the case of alpha-adrenergic antagonists). Of course selecting drugs for repurposing requires expert knowledge, but DDREL can be the first step to suggest which drugs need further evaluation.
For future work, first, we will extend DDREL to detect drug combinations (i.e., synergistic or antagonistic interactions) in the network of drug-drug relationships; fortunately, there are existing comprehensive databases that may serve as ground-truth data sources such as DrugComb.5
Declaration of competing interest
AB is a shareholder of Healx Ltd and Pharmenable Ltd.
Supplementary materials
Supplementary material associated with this article can be found at [59].
