Abstract
We present a novel approach for learning embeddings of
Introduction
In this paper we design, implement and evaluate a novel method of learning concept embeddings in knowledge bases for the
The neural reasoner consists of two modules – a reasoner head, that is a neural network classifier, trained to classify whether subsumption axioms hold for a given knowledge base, and capable of constructing an embedding for arbitrarily complex
In our work we hypothesize, and experimentally show support for the idea that
We purposefully limit ourselves to the taxonomical part of
The main contributions of the work are as follows:
A novel architecture consisting of two parts: a reasoner head, employing a recursive neural network, and a collection of embedding layers. Together, they enable the construction of reason-able embeddings.
A training procedure employing multiple knowledge bases at once to shape the topology of the embedding space.
Extensive experimental evaluation aimed at establishing the properties of the proposed approach. In the experiments, we show that the proposed approach indeed forces embeddings to represent the formal semantics in a shared way, that the resulting embeddings are of high quality, and that using this kind of shared, latent semantics does not incur substantial penalty on them.
On top of that, we present three open-source technical contributions implemented in Python that may be useful for the community:
An implementation of the presented method
A random axiom generator for the
An extended interface for the FaCT++ reasoner
Our work is structured as follows. In Section 2, we briefly introduce deep learning (Section 2.1) and description logics (Section 2.2), along with the related notation that we use throughout this text. In Section 3, we give an overview of the current state of the art, in particular of knowledge base embeddings (Section 3.1), and of deep deductive reasoners (Section 3.2). Section 4 presents the details of the architecture for learning reason-able embeddings: in Section 4.1 we present the details of the reasoner head, and in Section 4.2 of the embedding layer. We then introduce a relaxed variant of the architecture in Section 4.3, and describe the training procedure in Section 4.4. The axiom generator for
Preliminaries
Deep learning
While machine learning is ubiquitous in research these days, we give a very brief introduction to ensure there is no confusion in terms. In this, we follow [31], and the reader is welcome to the book for a more extended introduction.
Binary classification is frequently solved by constructing a function predicting the probability of an object to belong to the positive class, and then thresholding on the probability to obtain the label. Throughout this work, we employ this approach, writing
A classifier usually contains a number of
For gradient descent algorithms, it is established to use a
Due to the stochastic nature of training, too aggressively optimizing the metric can lead to
It is frequently the case that a learning algorithm has some configuration options that cannot be automatically adjusted using the training set, e.g., because that would immediately lead to overfitting. These options are called
A
Over the years, many activation functions were considered. Throughout the paper we use two of them: the sigmoid
Finally, by an
Description logics
As mentioned, DLs are a family of logics, which means that there are many DLs of varying expressiveness. Since in our work we focus on the
For readers that are unfamiliar with DLs we begin with an intuitive overview. In later sections we describe the notation we use, and define the syntax and semantics of the
Overview
At a high level, a KB divides knowledge into
The TBox contains
The semantics of DLs provide an
Notation
Formally, a KB in the
Semantics
An interpretation is defined as a pair
The set of terminological axioms
The notion can be extended to arbitrary axioms
Syntax and semantics of
Syntax and semantics of
While the Web Ontology Language (OWL) is underpinned by a DL, it uses a different nomenclature. In particular, knowledge bases are called
Terminological axioms are called class axioms, and a subsumption
Semantic reasoners
The main benefit of using a knowledge representation paradigm with well-founded semantics is the ability to perform automated reasoning. In this work, we use the
Many
In this work, FaCT++ is the reasoner of choice, as it is capable enough to deal with
Related work
The neural and symbolic paradigms of artificial intelligence are vastly different. On one hand, the currently dominant neural paradigm shows how well neural networks perform on large-scale data sets, ranging from simple classification and regression, to language models so powerful, that their output is almost indistinguishable from human-written text [9], and generative image models that can synthesize photorealistic images from text prompts [74]. However, neural models often produce nonsensical results, for example paragraphs of text that contradict themselves. Because neural models are not easily interpretable, it is difficult to find the cause of such problems. On the other hand, the symbolic paradigm offers methods for explicitly describing knowledge, and reliably performing reasoning over that knowledge. Symbolic methods can provide step-by-step explanations of their inferences, because they use deductive reasoning. Unfortunately, symbolic methods are not well suited for learning from data, as real-life knowledge is often seemingly contradictory. Symbolic methods also have trouble with processing large-scale data sets, because of high time complexities of used algorithms. The research field of neuro-symbolic integration aims to combine the large-scale learning ability of neural models, and the ability of symbolic methods to express knowledge and perform reasoning, all while keeping interpretability, thus combining the benefits and avoiding the pitfalls of both paradigms [23]. There seems to be two major avenues where the Semantic Web can benefit from the neural paradigm: ontology and knowledge graphs embeddings and deep deductive reasoners [35].
Knowledge base embeddings
One way to bridge the neural and symbolic paradigms is to represent symbolic knowledge in terms of vectors in a high-dimensional real vector space. Such vectors can then be used as additional inputs to machine learning models based on neural networks, to improve their performance by allowing them to use an approximation of expert knowledge [87]. A good method of learning embeddings from symbolic knowledge should ideally leverage the structural information present in relations between abstract concepts, and should not try to learn embeddings for abstract concepts with the aid of word embeddings, due to the ambiguity of language, its limited abstraction, and other problems [11].
One of the first embeddings methods tailored to KBs to get traction were NTN proposed by Socher et al. [84], TransE proposed by Bordes et al. [7,8], later unified into a single framework by Yang et al. [97], as both leverage linear and bilinear mappings. The notion of modeling relations as operations in some vector space was further extended by Wang et al. with TransH [94], Lin et al. with TransR [51] and PTransE [50], Sun et al. with RotatE [85], Zhang et al. with HakE [98], Wang et al. with InterHT [90], or Zhou et al. with Path-RotatE [99].
A separate category of embeddings was established by ConvE, employing convolutional neural networks [20]. This line of research was continued by Nguyen et al. with ConvKB [62], Jiang et al. with ConvR [42], Demir et al. with ConEx [19]. Over the time even more complex neural architectures were employed, e.g., capsule networks in CapsE by Nguyen et al. [63], recurrent skipping networks by Guo et al. [33], graph convolutional networks in GCN-Align by Wang et al. [93] or R-GCN by Schlichtkrull et al. [82], recurrent transformers by Werner et al. [95]
Yet another approach was taken in RESCAL by Nickel et al. where tensor-based techniques were used [65]. Similar approaches were taken in TOAST by Jachnik et al. [39], TATEC by García-Durán et al. [28], DistMult by Yang et al. [97], HolE by Nickel et al. [64], ComplEx by Trouillon et al. [88], or ANALOGY by Liu et al. [52].
In the context of the Semantic Web technologies, notable examples of embeddings are RDF2Vec by Ristoski et al. [75], OWL2Vec* by Chen et al. [12], and
Embeddings are used in various downstream tasks, such as disambiguation [80], ontology learning [71], handling concept drift in streams [13], question answering [54] or fact classification [43]. Over the years multiple survey papers were published on the topic of KB embeddings [10,14,16,91,92].
To the best of our knowledge, none of the proposed methods aims to capture the semantics as a topology of the embedding space, nor enables computing the embeddings of complex expressions from the embeddings of their parts.
Deep deductive reasoners
Compared to the work on embeddings, much less attention has been devoted to the notion of approximating or emulating the results of deductive reasoning with machine learning, and in particular deep learning [23].
Earlier works, employing traditional machine learning approaches, were mostly devoted to approximate a deductive reasoner in order to obtain the results faster than from a deductive reasoner. For example, Rizzo et al. [76,77] proposed terminological trees and forests to tackle this problem, while Paulheim and Stuckenschmidt concentrated on the problem of approximate reasoning with ABox [70].
One of the first works leveraging deep learning for deductive reasoning over ontologies were those of Makni and Hendler on using recurrent neural networks for RDFS reasoning [56], and of Hohenecker and Lukasiewicz on using recursive reasoning networks [37]. Eberhart et al. employed long short-term memory networks (LSTMs) for reasoning in the
Ebrahimi et al. in [24] introduced a few dimensions along which a reasoner can be classified:
Neural reasoner

A high-level overview of the neural reasoner architecture. An embedding layer for knowledge base
Our neural reasoner (hereafter,
The main purpose of the classifier is to facilitate the computation of the gradient of the loss function with respect to the weights of the embedding layer. In other words, when the reasoner is given an axiom, it builds its representation bottom-up from the KB-specific embeddings in the embedding layer. That representation is then used by a neural network in the reasoner head to classify whether the axiom is entailed by that KB. The classification output is contrasted with the target output computed by a semantic reasoner, and the value of the loss function is used to adjust weights of the embedding layer with backpropagation through structure [30].
In our work we chose the reasoner to be an entailment classifier for subsumption axioms in
An Each element of the outer product of two vectors
The classification output for an entailment query
Recursive neural networks have been successfully used for encoding expression trees as fixed-size vectors, that could be used as inputs to machine learning models [30]. The recursively defined DL concepts also have a tree structure [49], so it is appropriate to use a recursive neural network as the embedding layer in our reasoner architecture. The key hypothesis that defines our reasoner is that for each KB, we can train an embedding layer that embeds concepts from that KB in an embedding space with a topology that makes it easy for the reasoner head to classify entailment. An embedding topology that is beneficial to entailment classification is formed by jointly training the reasoner head and embedding layers for multiple KBs, which forces the embedding head to generalize, and the embedding layers to output embeddings in the shared embedding space.
We note in passing that we distinguish between a recursive and a recurrent neural network: in a recursive neural network, different parts of the network are applied to the input multiple times, depending on a structured input, as with recursive grammars, where the same production rule can be applied again and again; in a recurrent neural network (e.g., GRU, LSTM) the output of the network is connected back to the network, as part of its input.
In general, concept embeddings are represented by vectors in real vector space
The embeddings of the bottom and top concepts are stored by the vectors
To obtain the embedding for the complement of a given concept, one first recursively obtains the embedding of that concept, and then passes it as an input to the
We also define in Equation (8) the embedding of a doubly negated concept as the embedding of that concept, by the double negation elimination rule, which speeds the construction of embeddings.
To obtain the embedding for an intersection of concepts, one first recursively obtains the embeddings of the two concepts, computes their interaction map, and passes it as the input to the

An example interpretation of a knowledge base with role assertions is shown on the left side. The individuals
To obtain the embedding for an existential restriction, one first obtains the embedding of the given concept, and passes it to the
In deep learning, activation functions must be used after each hidden layer so that one does not compose multiple linear transformations, which are equivalent to a single linear transformation. Note that even though our constructor networks do not have an activation function, the result of the
To extend
An important property of the embedding layer, that we introduced in the previous section, is that the weights of the neural networks
While sharing constructors mimics the logic, it may not be an appropriate assumption in a deep learning model and thus a hindrance to learn good embeddings. To verify whether this is the case, we consider a variant of the reasoner, that allows the embedding layer to learn KB-specific concept constructor networks
In the relaxed architecture, the only weights shared between different KBs are the ones learned by the classifier
Training procedure
The training procedure for our reasoner consists of two steps. In the first step, we train both the reasoner head and embedding layers for as many diverse KBs as possible. This results in a reasoner head that learned to classify whether subsumption axioms hold in any KB, given that an appropriate embedding layer is provided. By appropriate embedding layer we mean a layer that learned to embed KB-specific concepts in a space that minimizes the classifier loss. We simply call this step
In the second step, we freeze the reasoner head and train the embedding layers for KBs that were not seen in the first step. This results in embedding layers that can embed concepts in a space in which the reasoner is good at classification. We think that if our reasoner can accurately classify whether subsumption axioms are entailed by a KB, then the embeddings used as inputs to the classifier
Axiom generator
Our reasoner learns embeddings by learning to classify entailment queries
One algorithm for pseudo-randomly generating
Axioms are generated by starting with the grammar rule
The grammar rule
The grammar rule
Experimental evaluation
Goals
Although the design of the proposed approach mimics the
The main hypothesis of our research is that using a transferable reasoner head forces the embeddings to reflect the semantics of the underlying KBs in a shared way. To address it, we pose the following research question
In Section 4.3, we constructed a relaxed variant of the reasoner, hypothesising that the restricted variant may be too restrictive and impede the learning process. From this, we formulate
The overarching goal of this work is to present a method to train embeddings for
We extend
Finally, we tackle the problem of identifying what is hard for the reasoner by posing
Setup
We implemented the proposed approach in Python 3.9.7 and Cython [4] (a superset of Python that compiles to C or C++), and used PyTorch 1.10.1 to implement the neural reasoner [68]. The source code is available at
We use pseudo-randomly generated numbers to create our data sets and in training. For data generation we use the NumPy implementation of the Permuted Congruential Generator [66]. The internal PyTorch pseudo-random number generator is used during training. To ensure reproducibility of our experiments, we set the initial states of the pseudo-random number generators to a known initial value.
During training, our reasoner uses inferences made by a semantic reasoner as the expected class in classification. The Semantic Web community traditionally uses Java as the language of choice, a language which does not interface well with Python. In particular, using either HermiT [61] or Pellet [83], state-of-the-art semantic reasoners, required executing the reasoners as separate Java Virtual Machine (JVM) processes, a process for every inference. This introduced unacceptable overhead, which we alleviated by employing FaCT++ [89], a semantic reasoner implemented in C++, which makes the task of interfacing with Python much easier. We improved a Python interface for FaCT++ by wrobell,2
In the experiments dealing with OWL ontologies, initially we used Owlready2 [47] to parse OWL ontologies in the RDF/XML format. Eventually, to improve performance and avoid some problems with the parser, we switched to converting the ontologies to the OWL functional-style syntax [69] using the ROBOT command-line tool [40], and parsing them using a custom, high-performance parser implemented in Cython.
Both our parser and the extended version of the Python interface for FaCT++ are available at
As the quality of the learned embeddings cannot be measured directly, we perform so-called extrinsic evaluation by computing classification metrics for a trained reasoner. We stipulate that if the classification metrics indicate good performance, then the embeddings learned by the embedding layer must capture the semantics of a given KB well. If that was not the case and the embeddings did not encode useful information for inference in the
For simplicity, in threshold-sensitive metrics we choose a threshold of 0.5, i.e,
Our classifier learns binary classification, so we chose appropriate metrics [86]. Firstly, we include accuracy (Equation (16)) in the set of evaluation metrics. Because the data set that we generated for the experiments is slightly imbalanced we also compute precision (Equation (17)), recall (Equation (18)), and the F1-score (Equation (19)).
We also compute the values of the area under the ROC curve (AUC-ROC) and the area under the PR curve (AUC-PR), which are threshold-invariant metrics with a minimum value of 0 and a maximum value of 1. The ROC curve shows the performance of a classification model at all classification thresholds, by plotting the true positive rate (TPR; also called recall) against the false positive rate (FPR) (Equation (20)). Similarly, the PR curve shows the performance of a model at all thresholds by plotting precision against recall [79]. For both ROC and PR curves, a larger area under the curve suggests a better classifier [26,79].
In experiments leveraging multiple KBs, we compute all metrics separately for each query set, and then compute the average and standard deviation across KBs. We do this because the KBs in the test set may be unequally difficult to classify, so it is beneficial to measure the variance of the reasoner performance across different KBs.
Data sets
Synthetic data set
Finding a large number of
For each KB, we also randomly choose the number of terminological axioms
For each knowledge base
We assign the first 40 KBs as the training set and the remaining 20 KBs as the test set. We then create a validation set from 20% of the queries from the training set. In total, there are 64,000 queries in the training set, 16,000 queries in the validation set, and 40,000 queries in the test set. In every data set, approximately 21.5% of queries have class
Data set of real-world ontologies
We ignore object property axioms, since role axioms are not expressible in
We ignore axioms with number or value restrictions, since they are not expressible in
We ignore individuals and ABox axioms, since our reasoner does not support ABox reasoning.
We remove roles that do not appear in any TBox axiom kept after pre-processing. This is done because of a limitation in FaCT++, which makes it raise an exception, when constructing a concept with an unused role name.
The pizza ontology in the RDF/XML format is available at
The ontology contains 99 classes and 8 object properties, which are equivalent to concept names and role names, respectively. There are 15 equivalence axioms and 15 class disjointness axioms. Two of the classes are unsatisfiable.
The preprocessing is not without some influence on the ontology:
Software Ontology5 Path in the repository:
Stuff ontology6 Path in the repository:
African Wildlife Ontology7 Path in the repository:
Dementia Ambient Care ontology8 Path in the repository:
Ontology of Datatypes9 Path in the repository:
The ontologies were preprocessed according to the procedure described earlier.
In this experiment, we consider
We test our architecture by first training the reasoner head and embedding layers on the training set (as usual), which results in a skillful reasoner. Then we create a no-skill reasoner head, that is not trained, but only randomly initialized. For each of the two reasoner heads, we train embedding layers on the test set, but keep the reasoner head weights frozen. After training the embedding layers on the test set for a fixed number of epochs, if the classification metrics for the reasoner with the trained head are significantly greater than for the reasoner with the random head, then the trained reasoner head learned useful relations in the
Conversely, if the differences between classification metrics of both reasoner heads are not significant, then the trained reasoner actually has little or no skill and the only skill in classification comes from the embedding layer, which would mean that the reasoner head is not transferable.
Training details
The embedding dimension is set to
We train both reasoner variants with mini-batch gradient descent for 15 epochs, which was enough for the validation loss to stop decreasing. During testing, we train the embedding layers (while the reasoner head weights are frozen) for 10 epochs, as the test loss stabilized after that. We set the batch size to 32, as small batch sizes have been shown to improve generalization [58]. Unless stated otherwise, all weights are randomly initialized using the Xavier initialization [29].
Evaluation of reasoning ability
We evaluate the reasoning ability of the restricted variant of our reasoner by monitoring the training and validation loss and AUC-ROC values during training. The only goal of this assessment is to verify that the reasoner can effectively learn to classify entailment in the training set, and does not suffer from underfitting. Training and validation losses steadily decreasing, and validation AUC-ROC increasing with each training epoch, suggests that the reasoner architecture is sufficient for learning to classify entailment in a given data set.

Training and test progress of the restricted reasoner. The reported training loss for each epoch is the average mini-batch loss in that epoch. We also show the standard deviation of the mini-batch loss for each epoch. In test progress, the trained reasoner is shown in red, while the random reasoner is shown in blue. The reasoner was
As shown in Fig. 3, the training loss decreases and validation AUC-ROC increases for the entirety of the training, with smaller gains after epoch 10. Furthermore, the validation loss does not start increasing in later epochs, which suggests that the restricted reasoner variant is not prone to overfitting.
After training the restricted reasoner, we froze the reasoner head, and trained embedding layers on the test set. We also trained separate embedding layers in conjunction with a randomly initialized, frozen reasoner head. The test loss decreased quickly for the reasoner with the trained head, while the loss for the random head decreased so slowly, that the test loss curve seems stationary on the plot. For the reasoner with the trained head, the test AUC-ROC quickly increased to almost 0.8 after epoch 2, and approached 1 after the last epoch.
Overall, training embedding layers for the reasoner with the trained head was much faster, than for the reasoner with the randomly initialized head, as the reasoner with the trained head achieved average AUC-ROC greater than 0.8 after epoch 2, while it took the reasoner with the random head 10 epochs to do the same. Furthermore, the trained head allowed the reasoner to achieve an average AUC-ROC close to 1 on the test set after 10 epochs, while the reasoner with the random head only achieved an average AUC-ROC of around 0.8.
The metrics after the last training epoch on the test data are reported in Table 2. The extremely high recall of the reasoner with the randomly initialized head is an artifact of it classifying most queries as class 1. Given the very low precision of the reasoner with the random head, its high recall should be ignored. The restricted reasoner with the trained head has lower variance of AUC-ROC and accuracy, than the reasoner with the random head. However, the restricted reasoner with the random head has lower variance for the F1-score, precision and recall. The trained reasoner outperforms the random reasoner on all measures except recall.
Test set metrics for the restricted reasoner. Metric values were averaged across different KBs in the test set. In addition to averages, we standard deviation values are shown
Test set metrics for the restricted reasoner. Metric values were averaged across different KBs in the test set. In addition to averages, we standard deviation values are shown
Test set metrics for the restricted reasoners with varying embedding size (
), and the number of neurons in the reasoner head (k ). Metric values were averaged across different KBs in the test set
Test set metrics for the restricted reasoners with varying embedding size (
To determine the effect of the embedding size and the number of neurons in the reasoner head we set the embedding size to 1 and/or the number of neurons to 1, and repeated the training procedure we described earlier. A reasoner with 1 neuron is a linear classifier and thus must rely on the embeddings to convey almost all the necessary information. Conversely, an embedding of size 1 is sufficient if all the necessary information is stored in the reasoner.
We report the results in Table 3. Following [41], we used repeated measures ANOVA on the AUC-ROC and obtained a p-value below 0.001, i.e., for at least one pair
The variant with
Answering
In this experiment we answer
Evaluation of reasoning ability
First, we trained the relaxed reasoner on the training set. The training progress for this variant is shown in Fig. 4. The training loss decreases and the validation AUC-ROC increases for the entirety of the training, with smaller gains after epoch 10. The validation loss decreases until epoch 10, and then starts increasing, which indicates overfitting.
When comparing Fig. 4 with Fig. 3, we observe that the AUC-ROC metric for the validation set increases slower for the relaxed reasoner than for the restricted reasoner. We attribute the faster convergence to the shared concept constructor networks
Evaluation of knowledge transfer

Training and test progress of the relaxed reasoner. The reported training loss for each epoch is the average mini-batch loss in that epoch. We also show the standard deviation of the mini-batch loss for each epoch. In test progress, the trained reasoner is shown in red, while the random reasoner is shown in blue. The reasoner was
After training the relaxed reasoner, we froze the reasoner head, and trained embedding layers on the test set. We also trained embedding layers in conjunction with a randomly initialized reasoner head. The test progress is shown on the right side of Fig. 4. At the beginning, the test loss for the trained reasoner head was higher than the test loss for the randomly initialized reasoner head. The test loss decreased quickly for the reasoner with the trained head, while the loss for the random head decreased very slowly. For the reasoner with the trained head, the test AUC-ROC quickly increased to about 0.7 after epoch 2, and approached 1 after the last epoch. The average test AUC-ROC for the random head slowly increased from around 0.5 in the beginning to around 0.8 after the last epoch.
Overall, training embedding layers for the relaxed reasoner with the trained head was much faster, than for the reasoner with the randomly initialized head. Moreover, the trained head allowed the reasoner to achieve an average AUC-ROC close to 1 on the test set after 10 epochs, while the reasoner with the random head only achieved an average AUC-ROC of around 0.8.
We report the final metrics of training on the test data in Table 4. For the relaxed architecture, the trained model is strictly better than the randomly initialized one, as all metric values are higher for the former. Moreover, the trained model has lower variance of metrics across KBs in the test set, than the random model. Similarly as for the restricted reasoner, the reasoner with trained head outperforms the reasoner with random head by a fair margin.
Comparing the metrics for the restricted reasoner in Table 2 and for the relaxed reasoner in Table 4, we observe that as expected, for both relaxed and restricted reasoners, the reasoners with trained heads achieved superior performance on the test set, which shows that both variants are indeed transferable.
Between the two random reasoners, the relaxed variant achieves better metric values, except recall, due to the very high recall of the restricted random reasoner. This was expected, since in the relaxed variant, the complement and intersection constructor networks can adjust to the randomly classifier to minimize the classification error, while in the restricted variant this is not possible.
Test set metrics for the relaxed reasoner. Metric values were averaged across different KBs in the test set. In addition to averages, the standard deviation values are shown
Test set metrics for the relaxed reasoner. Metric values were averaged across different KBs in the test set. In addition to averages, the standard deviation values are shown
When comparing the trained reasoners, the relaxed variant achieves slightly higher values with lower variance for all metrics (except precision, for which the restricted variant has slightly lower variance). However, the differences are not statistically significant. For each metric, we conducted a paired t-test, and checked whether the metric values for KBs from the test set, do not significantly differ between the relaxed and restricted reasoners. The resulting
Disregarding the performance metrics, the restricted reasoner has an advantage over the restricted reasoner – it has fewer learnable parameters. Since one of the main goals of our reasoner is transfer learning, a lower number of parameters in the embedding layer is preferable, as it speeds up training for new KBs. The average embedding layer training time per epoch was approximately 21.72 seconds for the restricted variant, and 26.84 seconds per epoch for the relaxed variant, which is 23% slower.
Answering
The previous experiments showed that our reasoner is capable of learning good concept embeddings for synthetic KBs, so the next step is to answer
Training set
The data set for this experiment consists of 32,000 unique random queries that we generate using the algorithm described in Section 5. We set the maximum axiom depth to
Training procedure
As mentioned, we repeat learning three times, which results in three reasoners:
In every run of the experiment we trained the model for 30 epochs, with learning rate set to
We expected the reasoner with the unfrozen head to achieve the best metric values and learn the best embeddings out of the three reasoners in the embedding analysis, because no weights are frozen, which means that the reasoner with the unfrozen head can fit to the data set more than the reasoners with frozen heads. The only obstacle to learning good embeddings for the reasoner with unfrozen head are the randomly generated queries, that may not contain useful entailments for the pizza ontology, although the other two reasoners learn with the same data set, so the comparison is at least fair.
Based on the results of Experiment 1, we expected the reasoner with transfer to achieve higher classification metrics than the reasoner with randomly initialized frozen head. In Experiment 1, the trained reasoner head was better at classifying queries for the unseen KBs from the test set, than the randomly initialized reasoner head, so we expected the same to be true for the pizza ontology.
Evaluation of reasoning ability
Classification metrics for reasoners in Experiment 3 after training for 30 epochs
Classification metrics for reasoners in Experiment 3 after training for 30 epochs
The classification metrics of the three reasoners are shown in Table 6. As expected, the unfrozen reasoner achieves strictly better classification performance than the reasoners with frozen heads, as the values of all metrics are higher than for other reasoners. The reasoner with the frozen pre-trained head is the second-best reasoner after the unfrozen reasoner, and is strictly better than the reasoner with the randomly initialized head.
The reasoner with the randomly initialized frozen head was the worst of the three reasoners, although it is better than a random guesser, with AUC-ROC of around 0.90. This reasoner has very high recall, but relatively low precision, which is reflected in the significantly lower F1-score and AUC-PR, when compared to the better reasoners. Even though the reasoner with the randomly initialized head was the worst of the three, it still achieved relatively high metric values, which shows that good embeddings can compensate for a bad reasoner head.
In the last section, we discussed the differences between the three reasoners that we trained in Experiment 3. Taking into account the differences between the reasoners with frozen heads, we conclude that the transfer of knowledge from the randomly generated KBs in Experiment 1 to the pizza ontology was a success. The reasoner with the pre-trained head achieved results similar to the unfrozen reasoner, which is very promising, given that the pizza ontology is certainly different from the randomly generated KBs used as the training set in Experiment 1.
It should be mentioned that the total training time of the reasoners with frozen heads is shorter than the training time of the unfrozen reasoner, which is the case because there is no time spent on updating the reasoner head weights.
Embedding analysis
In addition to evaluating the reasoning and transfer ability of our reasoner, which showed that it can successfully learn to classify entailment in a real-world KB, we visualize the learned embeddings. A 2D visualization that tries to preserve the distances between concepts in the high-dimensional embedding space enables visual assessment of the quality of the learned embeddings. We think that if semantically similar concepts are placed closer to each other than to dissimilar concepts, then the learned embeddings capture the semantics of a given KB.
We set the concept embedding dimension to
The UMAP visualizations of the embeddings learned in this experiment, are shown in Fig. 5. The visualization of embeddings learned by the unfrozen reasoner suggests, that the embeddings capture the semantics of the pizza ontology well. As expected, the unsatisfiable
The embeddings learned by the reasoner with the pre-trained head look a bit worse than those of the unfrozen reasoner. The pizzas and the toppings are separated, but inside of the topping and pizza clusters the embeddings are not as well organized.
The embeddings learned by the reasoner with the randomly initialized head do not look good when visualized. A few concept names form clusters that make sense, but most look like they are randomly scattered. General concepts are close to the top concept, which is good, but the unsatisfiable
In general, we think that the embeddings for the unfrozen reasoner are the best, the embeddings for the reasoner with the frozen pre-trained head are good, and the embeddings for the reasoner with the frozen randomly initialized head are not good at all. Again, the results of the visual assessment of the learned embeddings are consistent with the classification metrics of the reasoners.
Overall, the presented results indicate that the answer to

UMAP visualization of the learned concept embeddings. We replaced “Topping” in concept names with “T” to improve readability. We use different shapes, sizes, and colors of markers as follows: by default concepts are square and gray; the top concept, bottom concept and concept expressions are cyan; toppings diamond-shaped, and pizzas are round; vegetarian pizzas and toppings are light green, except vegetable toppings, which are dark green; seafood toppings are pink; non-vegetarian pizzas and meat toppings are dark red; cheesy pizzas, and cheese toppings are respectively marked with a yellow disk or yellow diamond inside; pepper toppings are marked with an orange diamond inside; spicy things are marked with a red “x” inside; spiciness levels are light red.
In Experiment 3 we found that a reasoner with a frozen pre-trained head achieves similar classification metrics to a reasoner trained from scratch. To see whether this result holds for a more diverse set of ontologies we posed
Training set
In this experiment we use six real-world ontologies: five ontologies from the CQ2SPARQLOWL data set, and the pizza ontology. For each ontology, we generate 32,000 unique random queries according to the algorithm described in Section 5. We set the maximum axiom depth to
Training procedure
Similarly to the Experiment 3, we repeat learning three times, which results in three reasoner heads, each with six embedding layers.
We use the exact reasoner architecture and training setup as in Experiment 3, but in this experiment each mini-batch could contain queries for any of the six ontologies, as the embedding layers were trained in parallel.
Results
The metric values for the pizza ontology are slightly different from the ones in Experiment 3, as this time, the reasoner head was trained on six ontologies at once, in contrast to exclusively training on the pizza ontology. We report the classification metrics for each reasoner head, averaged over all ontologies, in Table 7, and we give the values for each ontology separately in Appendix A.
Classification metrics for reasoners for the AWO, Dem@Care, Stuff, SWO, OntoDT, and Pizza ontologies. Values were averaged over all ontologies
Classification metrics for reasoners for the AWO, Dem@Care, Stuff, SWO, OntoDT, and Pizza ontologies. Values were averaged over all ontologies
In Table 8 we report the
The results indicate that using a pre-trained head is a viable choice for real-world ontologies, as the performance drop is negligible, and thus the answer for
One can wonder what types of axioms are particularly hard for the reasoner. To answer this, we started with the reasoner and the test set from Experiment 1. Then, for each query in the test set, we computed the following features:
Next, we computed the Pearson correlation coefficient between each feature and the classification error of the reasoner (before thresholding). The absolute values of the obtained coefficients (Table 9) were all below 0.05, indicating that none of the features have a significant impact on the performance of the reasoner.
The Pearson correlation coefficient between each feature and the classification error of the reasoner
The Pearson correlation coefficient between each feature and the classification error of the reasoner
Currently, we do not have a decisive answer to
In this work, we introduced a novel method of learning data-driven concept embeddings, called reason-able embeddings, in
We also show that using recursive neural networks for constructing embeddings of arbitrarily complex concepts obviates the need for manually designing concept vectorization schemes, and avoids the pitfalls of recurrent neural networks operating on a textual representations of concepts. Instead, concept embeddings can be learned in a data-driven way, by simply asking entailment queries for a given knowledge base.
Finally, we show that a significant part of our reasoner is transferable across knowledge bases in the
We hope that our neural reasoner architecture will allow for greater use of knowledge in models based on neural networks, both by providing an effective way of learning concept embeddings, and learning an accurate entailment classifier for knowledge bases in description logics, thus, making a small step towards the integration of the neural and symbolic paradigms in artificial intelligence.
We identified many opportunities to improve, extend and apply our neural reasoner, that were out-of-scope for this work, but look like promising avenues for future research. In our work we used small neural networks, but deeper and wider concept constructor networks, and subsumption entailment classifier networks could be examined. The number of parameters in the reasoner could also be reduced, while preserving the quality of embeddings and accuracy of entailment classification. Currently, the number of parameters scales quadratically with the embedding dimension, because the reasoner uses the outer product of embeddings as an input to neural networks
Finally, it would be interesting to see if recursive neural networks could be applied in reverse to how we use them – to generate concepts, given learned concept embeddings. That would make it possible to not only learn concept embeddings by classifying entailment, but also to induce new concepts by sampling the embedding space, e.g., to construct a scalable algorithm for explainable artificial intelligence [81].
Footnotes
Acknowledgements
This paper is, in part, a summary of the master thesis of Dariusz Max Adamski, done under the supervision of Jedrzej Potoniec. This research was partially supported by TAILOR, a project funded by EU Horizon 2020 research and innovation programme under GA No 952215. We would like to thank Prof. Agnieszka Ławrynowicz for helpful feedback on our work.
Detailed results of Experiment 4
In Table 10 we report the values of classification metrics separately for each of the six ontologies considered in Experiment 4, and for each of the three runs.
Detailed statistics of the synthetic dataset
In Tables 11–13, we report detailed statistics about the queries of the synthetic dataset, introduced in Section 6.4.1. We computed them separately for subsumption queries and for disjointness queries (see Equation (13)). Moreover, for subsumption queries, we also computed them for only the left-hand sides (LHS) and only for the right-hand sides (RHS), since subsumptions are not commutative.
We computed the following statistics:
We report statistics requiring aggregation in the following format:
