Abstract
Machine Learning methods have been introduced in the Semantic Web for solving problems such as link and type prediction, ontology enrichment and completion (both at terminological and assertional level). Whilst initially mainly focussing on symbol-based solutions, recently numeric-based approaches have received major attention, motivated by the need to scale on the very large Web of Data. In this paper, the most representative proposals, belonging to the aforementioned categories are surveyed, jointly with the analysis of their main peculiarities and drawbacks. Afterwards the main envisioned research directions for further developing Machine Learning solutions for the Semantic Web are presented.
Introduction
The Semantic Web (SW) vision has been introduced with the goal of making the Web machine readable [3], by enriching resources with metadata whose formal semantics is defined in OWL1
In order to fill some of these gaps, machine learning (ML) methods have been proposed [13]. Problems such as query answering, instance retrieval and link prediction have been regarded as classification problems. Suitable machine learning methods, often inspired by symbol-based solutions in the Inductive Logic Programming (ILP) field (aiming at inducing an hypothesised logic program from a background knowledge and a collection of examples), have been proposed [14,26,32,38,56]. Most of them are able to cope with the expressive SW representations and the Open World Assumption (OWA) typically adopted, differently from the Closed Wold Assumption (CWA) that is usually assumed in the traditional ML settings. Problems such as ontology refinement and enrichment at terminology level, e.g. assessing disjointness axioms or complex descriptions for a given concept name, have been regarded as concept learning problems to be solved via supervised/unsupervised inductive learning methods for Description Logics [2] (DLs) representations [20,21,23,37,50,59].
Nowadays, the adoption of ML methods represents a major trend in several research fields such as computer vision, bioinformatics, image recognition, natural language processing and artificial intelligence. This is mostly due to the impressive scalability that recent methods, mainly grounded on numeric approaches (also called sub-symbolic), such as embeddings and deep learning [18], have shown. This trend is also occurred in SW. Indeed, motivated by the need for scaling, many of the recent ML-based solutions, e.g. for performing link/type predictions, as well as data-intensive tasks by exploiting the Web of Data and the emerging Knowledge Graphs (KGs) as background knowledge, are mainly grounded on embeddings [11,42,45] that are methods for translating high-dimensional vectors into relatively low-dimensional spaces. Nevertheless, the important gain in terms of scalability, that ML methods for the SW are obtaining is penalizing: a) having interpretable models as a result of a learning process; b) the ability to exploit deductive (and complementary forms of) reasoning capabilities; c) the expressiveness of the SW representations and the compliance with the OWA.
In the following, the main problems and ML methods that have been developed in the SW are surveyed along with the two categories: symbol-based (Section 2) and numeric-based (Section 3), hence the fundamental peculiarities and issues are discussed. The main envisioned research directions that need to be pursued for developing ML methods for the SW are illustrated in Section 4. Conclusions are drawn in Section 5.
The first efforts in developing ML methods for the SW have been devoted to solve deductive reasoning tasks over ontologies under an inductive perspective. This was motivated by the necessity of offering an alternative way to perform some forms of reasoning when deductive reasoning was not applicable, for instance because of inconsistencies within ontologies, but also for supplying a solution for reasoning in presence of incompleteness (that is when missing information with respect to a certain domain of reference is registered), and/or in presence of noise (that is when ontologies are consistent but the information therein is somehow wrong with respect to a reference domain, e.g. missing disjointness axioms, missing and/or wrong assertions). Particularly, the incompleteness of knowledge bases, both at assertional and schema level, drove the development of ML methods trying to specifically tackle this problem. The overall idea consisted in exploiting the evidence coming from assertional knowledge for drawing plausible conclusions to be possibly represented with intensional models. In the following, the tasks that received major attention are reported jointly with the analysis of the main solutions for them.
Instance retrieval
One of the first problems that has been investigated is the instance retrieval problem, which amounts at assessing if an individual is an instance of a given concept. It has been regarded as a classification problem aiming at assessing the class-membership of an individual with respect to a query concept. Similarity-based methods, such as K-Nearest Neighbor and Support Vector Machine, have been developed since they are well known to be noise tolerant [6,14,47]. This required to cope with: 1) the OWA rather than the CWA generally adopted in ML; 2) the non-disjointness of the classes (since an individual can be instance of more that one concept at the same time) while, in the usual ML setting, classes are assumed to be disjoint; 3) the definition of new similarity measures and kernel functions for exploiting the expressiveness of SW representations. Additionally, because of the OWA, new metrics for the evaluation of the classification results have been defined [14]. This is because, by using standard metrics as precision, recall and F-measure, new inductive results were deemed as mistakes whilst they could turn out to be correct inferences when judged by a knowledge engineer. The proposed solutions experimentally proved their ability to perform inductive instance retrieval when compared to a standard deductive reasoner. They also proved their ability to induce new knowledge that was not logically derivable, but they did not result fully able to work at large scale.
Methods characterized by more interpretable models have been also defined [23,51]. Inspired by the ILP literature concerning the induction of decision trees in clausal representation [5], a solution for inducing a Terminological Decision Tree (TDT) has been formalized [23]. A TDT is a tree structure, naturally compliant with the OWA, employing: a DL language for representing nodes and inference services as corresponding tests on the nodes. The tree-induction algorithm adopts a classical top-down divide-and-conquer strategy with the use of refinement operators for DL concept descriptions. Once a TDT is induced, similarly to logical decision trees, a definition for the target concept (namely the concept with respect to perform classification) can be drawn, by exploiting the nodes in the tree structure. This solution showed the interesting ability to provide an interpretable model, but it turned out slightly less effective then similarity-based classification methods.
Concept learning for ontology enrichment
With the purpose of enriching ontologies at terminological level, methods for learning concept descriptions for a concept name have been proposed. The problem has been regarded as a supervised concept learning problem aiming at approximating an intensional DLs definition, given a set of individuals of an ontology acting as positive/negative training examples.
Various solutions, e.g.
These solutions proved their ability to learn approximated concept descriptions for a target concept name but relatively small ontological knowledge bases have been considered for the experiments.
Knowledge completion consists in finding new information at assertional level, that is facts that are missing in a considered knowledge base. This task has become very popular with the development of KGs, that are well known to be incomplete, and it is also strongly related to the link prediction task (see Section 3).
One of the most well known system for knowledge completion of RDF knowledge bases is AMIE [26]. Inspired by the literature in association rule mining [1] and ILP methods for learning Horn clauses, AMIE aims to mine logic rules from RDF knowledge bases with the final goal of predicting new assertions. AMIE (and its optimized version AMIE+ [25]) currently represents the most scalable rule mining system for learning Horn rules on large RDF data collections and is also explicitly tailored to support the OWA. However, it does not exploit any form of deductive reasoning. A related rule mining system, similarly based on a level-wise generate and test strategy has been proposed in [16]. It aims to learn SWRL rules [31] from OWL ontologies while exploiting schema level information and deductive reasoning during the rule learning process. Both AMIE and the solution presented in [16] showed the ability to mine useful rules and to predict new assertional knowledge. the solution proposed in [16] showed reduced scalability due to the exploitation of the reasoning capabilities.
Learning disjointness axioms
Disjointness axioms are essential for making explicit the negative knowledge about a domain, yet they are often overlooked during the modeling process (thus affecting the efficacy of reasoning services). To tackle this problem, automated methods for discovering these axioms from the data distribution have been devised.
A solution grounded on association rule mining [1] has been proposed in [59,60]. It is based on studying the correlation between classes comparatively, namely association rules, negative association rules and correlation coefficient. Background knowledge and reasoning capabilities are used to a limited extent.
A different solution has been proposed in [50] where, moving from the assumption that two or more concepts may be mutually disjoint when the sets of their (known) instances do not overlap, the problem has been regarded as a clustering problem, aiming at finding partitions of similar individuals of the knowledge base, according to a cohesion criterion quantifying the degree of homogeneity of the individuals in an element of the partition. Specifically, the problem has been cast as a conceptual clustering problem, where the goal is both to find the best possible partitioning of the individuals and also to induce intensional definitions of the corresponding classes expressed in the standard representation languages. Emerging disjointness axioms are captured by the employment of terminological cluster trees (TCTs) and by minimizing the risk of mutual overlap between concepts. Once the TCT is grown, groups of (disjoint) clusters located at sibling nodes identify concepts involved in candidate disjointness axioms to be derived. Unlike [59,60], based on the statistical correlation between instances, the empirical evaluation of [50] showed its ability to discover disjointness axioms also involving complex concept descriptions, thanks to the exploitation of the underlying ontology as background knowledge.
Numeric-based methods for the Semantic Web
Whilst symbolic methods adopt symbols for representing entities and relationships of a domain and infer generalizations that provide new insights into the data and are ideally readily interpretable, numeric-based methods typically adopt feature vector (propositional) representations and cannot provide interpretable models but they usually result rather scalable [40].
The problem that has been mainly investigated in the SW context by adopting numeric solutions is link prediction which amounts to predict the existence (or the probability of correctness) of triples in (a portion of) the Web of Data. Data are considered in their graph representation, mostly RDF representation language has been targeted and almost no reasoning is exploited; most expressive SW languages are basically discarded. The attention towards this problem is also grown due to the increasing of KGs, that are known to be often missing facts [61]. In the KG context, link prediction is also referred to as knowledge graph completion. Methods borrowed from the Statistical Relational Learning (SRL) [27] (having as main goal the creation of statistical models for relational/graph-based data) have been mostly developed. In the following the main classes of methods and solutions targeting link prediction in the SW are analyzed.
Probabilistic latent variable models
Probabilistic Latent Variable Models explains relations between entities by associating each resource to a set of intrinsic latent attributes (i.e. attributes not directly observable in the data) and conditions the probability distribution of the relations between two resources on their latent attributes. All relations are considered conditionally independent given the latent attributes. This allows the information to propagate through the network of interconnected latent variables.
One of the first numeric-based link prediction solution belonging to this category is the Infinite Hidden Semantic Model (IHSM) [48]. It formalizes a probabilistic latent variable that associates a latent class variable with each resource/node and makes use of constraints expressed in First Order Logic during the learning process. IHSM showed promising results but resulted limited in scaling on large SW data collection because of the complexity of the probabilistic inference and learning, which is intractable in general [34].
Embedding models
With the goal of scaling on very large SW data collections, embedding models have been investigated. Similarly to probabilistic latent variable models, in embedding models each resource/node is represented with a continuous embedding vector encoding its intrinsic latent features within the data collection. Models in this class do not necessarily rely on probabilistic inference for learning the optimal embedding vectors and this allows to avoid the issues related to the normalization of probability distributions, that may lead to intractable problems.
One of the first solution belonging to this category is
Nevertheless, since embedding models proved interesting ability to scale while maintaining comparative performance to probabilistic latent variable models in terms of predictive accuracy [7], with the goal of improving the model training phase employed by
An aspect that needs to be highlighted is that, due to tackling RDF representation, most of the considered data collections only contain positive (training) examples, since usually false facts are not encoded. As training a learning model in all-positive examples could be tricky because the model might easily over generalize, for obtaining negative examples two different approaches are generally adopted: either perturbing true/observed triples with the goal of generating plausible negative examples or making a local-closed world assumption (LCWA) in which the data collection is assumed as locally complete [44].
Vector space embeddings for propositionalization
A complementary research direction focused on the exploitation of vector space embeddings for obtaining a propositional feature vector representation of RDF data collections. Specifically, inspired by the data mining (DM) literature on propositionalization [35], that is a collection of methods for transforming a relational data representation into a (numeric) propositional feature vector representation so that scalable propositional DM/ML methods can be applied, RDF2Vec [49] has been proposed. It formalizes a solution for learning latent numeric representations of entities in RDF graphs by adapting language modeling approaches. A two-steps approach is adopted: first the RDF graph is converted into a set of sequences of entities (for the purpose two different approaches using local information, that are graph walks and Weisfeiler–Lehman Subtree RDF graph kernels, are exploited); in the second step, the obtained sequences are used to train a neural language model estimating the likelihood of a sequence of entities appearing in a graph. The outcome of the training process provides each entity in the graph represented as a vector of latent numerical features. DBpedia and Wikidata have been processed. In order to show that the obtained vector representation is independent from task and algorithm, an experimental evaluation involving a number of classification and regression tasks has been performed.
An upgrade of RDF2Vec has been presented in [11]. The proposed solution is grounded on the exploitation of global patterns, differently from RDF2Vec which exploits local patterns. None of the two solutions can cope with literals.
Machine Learning for the Semantic Web: Next research directions
In this section the envisioned most challenging research problems are illustrated. Hence additional ML settings and methods that could be usefully adopted for SW related issues are briefly discussed.
Research problems
The need to cope with the fast growing of the Web of Data and the emerging very large KGs required the SW community to show its ability to manage such tremendous amount of data and knowledge.
This mostly motivated the right attention towards numeric ML methods, particularly for providing scalable solutions to manage the inherent incompleteness of the Web of Data. Indeed, current symbolic methods are not actually comparable, in terms of scalability, to numeric-based solutions. This gain is not for free. It is obtained by giving up the expressive representation languages, such as OWL, that the SW community contributed to standardize with the goal of formalizing rich and expressive knowledge, but also by forgetting one of the most powerful characteristic of these languages, that is being empowered with deductive reasoning capabilities that allow to derive new knowledge. This means to loose knowledge that is already available. Indeed, as illustrated in Section 3, almost all numeric methods focus on RDF as a representation language and nearly no reasoning capabilities are exploited. Furthermore, differently from symbolic methods, numeric-based solutions lack of the ability to provide interpretable models (see Section 3), thus limiting the possibility to interpret and understand the motivations for the returned results. Additionally, tasks such as learning concept or disjointness axioms cannot be performed without symbol-based methods which can certainly benefit of very large amount of information to provide potentially more accurate results.
Some discussions in this direction have been developed by the Neural-Symbolic Learning and Reasoning community [17,29], which seeks to integrate principles from neural networks learning and logical reasoning. The main conclusion has been that neural-symbolic integration appears particularly suitable for applications characterized by the joint availability of large amounts of (heterogeneous) data and knowledge descriptions, which is actually the case of the Web of Data. A set of key challenges and opportunities have been outlined [17], such as: how to represent expressive logics within neural networks, how neural networks should reason with variables, or how to extract symbolic representation from trained neural networks.
Some preliminary results for some these challenges have been recently provided. For instance SimplE [33] has been proposed. It is a scalable tensor-based factorization model that is able to learn interpretable embeddings incorporating logical rules through weight tying. Ideas for extracting propositional rules from trained neural networks under a SW background knowledge have been also illustrated [36], showing that the exploitation of a background knowledge allows: to reduce the extracted rule set; to reproduce the input-output function of the trained neural network. A conceptual sketch for explaining artificial neural networks classification behavior in a non-propositional setting while using SW background knowledge bases has been proposed in [53]. These initial results, show the feasibility of this research direction while remarking the importance of pursuing this goal.
Hence, providing an explanation means to open the box of the reasoning process and make it understandable. In a complex setting such as the Web of Data, where knowledge may result from an automatic information acquisition and integration process from different sources, thus potentially noisy and with conflicting information, multiple reasoning paradigms may be required e.g. deduction (when rules and theory are available), induction (for building models from the available knowledge), abduction (for filling in partial models coping with incomplete theory), commonsense reasoning etc. Large research efforts have been devoted to study each paradigm, however in the considered complex scenario, multiple paradigms could be needed at the same time. This may require the formalization of a unifying reasoning framework.
Additional Machine Learning settings
As for the settings to be exploited, multiple research directions still need to be investigated. Several problems such as instance retrieval but also link prediction and assertional knowledge completion have been solved by casting them as classification tasks. However, as discussed in Section 2, when assessing the concept membership for an individual, it may result instance of more than one concept at the same time. As such a more suitable way to regard the problem is as multi-label classification task [62], where multiple labels (concepts in the specific case) may be assigned to each instance. Some preliminary research has been presented in [41], focussing on type prediction in RDF data collections where limited information from the available background knowledge is considered.
Multiple-instance learning (MIL) [8] is also a setting that would need investigation. It deals with the problem of incomplete knowledge concerning labels in training sets, as it happens in SW knowledge bases due to OWA. MIL is a type of supervised learning where training instances are not individually labeled, they are collected in sets of labeled bags. From a collection of labeled bags, the learner tries to either (i) induce a concept that will label individual instances correctly or (ii) learn how to label bags without inducing the concept. It may be fruitfully exploited for discovering correlations among resources and/or emerging concepts.
Other settings that would be useful for coping with the large number of unlabelled instances are semi-supervised learning (SSL) [9] and learning from imbalanced data. SSL makes use of both labeled and unlabeled instances, during the learning process, for surpassing the classification performance that could be obtained by discarding the unlabeled data (as it would happen in a supervised learning setting). Very few research efforts have been made in this direction. Some initial results have been presented in [43], where a link prediction problem is solved in a transductive learning framework. In learning from unbalanced data [28,39], that is data collections where the labels distribution is not uniform, sampling techniques are usually adopted in order to create a balanced dataset to be successively used for the learning task. Ensemble methods, consisting in using multiple learning algorithms to obtain better predictive performance, could be fruitfully adopted, as illustrated in [24,51] where respectively a boosting [24] and bagging technique is employed.
As a last point, considering the increasing volume of the Web of Data, online and incremental learning, which input data is continuously used to further train and extend the learned model, would be naturally investigated. For the best of knowledge, no research efforts have been made in this direction.
Conclusions
In this paper, the progresses that have been made in SW by exploiting ML methods have been surveyed. Specifically symbol-based and numeric-based solutions have been analyzed highlighting their main peculiarities and drawbacks. Hence the main envisioned research directions have been drawn.
