Sage Journals: Discover world-class research

Abstract

Using knowledge graph embedding models (KGEMs) is a popular approach for predicting links in knowledge graphs (KGs). Traditionally, the performance of KGEMs for link prediction is assessed using rank-based metrics, which evaluate their ability to give high scores to ground-truth entities. However, the literature claims that the KGEM evaluation procedure would benefit from adding supplementary dimensions to assess. That is why, in this paper, we extend our previously introduced metric Sem@K that measures the capability of models to predict valid entities w.r.t. domain and range constraints. In particular, we consider a broad range of KGs and take their respective characteristics into account to propose different versions of Sem@K. We also perform an extensive study to qualify the abilities of KGEMs as measured by our metric. Our experiments show that Sem@K provides a new perspective on KGEM quality. Its joint analysis with rank-based metrics offers different conclusions on the predictive power of models. Regarding Sem@K, some KGEMs are inherently better than others, but this semantic superiority is not indicative of their performance w.r.t. rank-based metrics. In this work, we generalize conclusions about the relative performance of KGEMs w.r.t. rank-based and semantic-oriented metrics at the level of families of models. The joint analysis of the aforementioned metrics gives more insight into the peculiarities of each model. This work paves the way for a more comprehensive evaluation of KGEM adequacy for specific downstream tasks.

Keywords

Knowledge graph embeddings link prediction semantic web model evaluation semantic-oriented metrics

1. Introduction

A knowledge graph (KG) is commonly seen as a directed multi-relational graph in which two nodes can be linked through potentially several semantic relationships. More formally, a knowledge graph $KG = (E, R, T)$ where $E$ , $R$ and $T \subseteq E \times R \times E$ are a set of entities (nodes), relations (edge labels) and triples, respectively. A KG is represented as a collection of such triples – a.k.a. facts – denoted as $(h, r, t) \in T$ where $h \in E$ and $t \in E$ are two entities of the graph and are respectively named the head and tail of the triple, while $r \in R$ is a predicate that qualifies the nature of the relationship holding between these entities. For instance, in the sample KG depicted in Fig. 1: $E =$ {BarackObama,MichelleObama,EmmanuelMacron,USA,France} and $R =$ {spouseOf,presidentOf,supports,livesIn}.

Fig. 1.

Excerpt of a KG containing some influential political figures and relations holding between them.

KGs are inherently incomplete, incorrect, or overlapping and thus major refinement tasks include entity matching, question answering, and link prediction [48,72]. The latter is the focus of this paper. Link prediction (LP) aims at completing KGs by leveraging existing facts to infer missing ones. In the LP task, one is provided with a set of incomplete triples, where the missing head (resp. tail) needs to be predicted. This amounts to holding a set of triples $T^{'}$ where, for each triple, either the head h or the tail t is missing. This task can be subdivided into a head prediction phase – which consists in predicting the most plausible head h for each $(?, r, t)$ – and a tail prediction phase – which consists in predicting the most plausible tail t for each $(h, r, ?)$ . In the sample KG depicted in Fig. 1, an example of triple to be predicted during the tail prediction phase would be (EmmanuelMacron,livesIn,?), where the expected tail to be inferred is France. Training a Knowledge Graph Embedding Model (KGEM) firstly requires corrupting existing triples by replacing either their head h or their tail t with another entity to generate negative counterparts. This procedure is called negative sampling [7,23,30]. Secondly, the KGEM iteratively learns to assign higher scores to true triples than to their negative counterparts.

The performance of KGEM for LP is evaluated using rank-based metrics such as Hits@K, Mean Rank (MR), and Mean Reciprocal Rank (MRR) that assess whether ground-truth entities are indeed given higher scores [54,72]. However, various works recently raised some caveats about such metrics [5,19,64]. For instance, they are not well-suited for drawing comparisons across datasets [5]. More importantly, they only provide a partial picture of KGEM performance [5]. Indeed, LP can lead to nonsensical triples, such as (BarackObama,isFatherOf,USA), being predicted as highly plausible facts, although they violate constraints on the domain and range of relations [23,73]. KGEMs with such issues may nevertheless reach a satisfying performance in terms of rank-based metrics.

Few works propose to go beyond the mere traditional quantitative performance of KGEMs and address their ability to capture the semantics of the original KG, e.g., domain and range constraints, hierarchy of classes [22,40,49]. According to Berrendorf et al. [5], this would give a more complete picture of the performance of a KGEM. This is why we advocate for additional qualitative and semantic-oriented metrics to supplement traditional rank-based metrics and propose Sem@K to address this need. The relevance of using semantic-oriented metrics – more specifically Sem@K – is clearly visible in Fig. 2: Sem@K provides a supplementary dimension to the evaluation procedure and allows to confidently choose between two models. They are equally good in terms of rank-based metrics but Model A predicts entities that are semantically valid w.r.t. the range constraint.

More specifically, in this work, our goal is to assess the ability of popular KGEMs to capture the semantic profile of relations in a LP task, i.e., whether KGEMs predict entities that respects domain and range of relations. Henceforth, we refer to this aspect as the semantic awareness of KGEMs. To do so, we build on Sem@K, a semantic-oriented metric that we previously introduced [20,21] In [20], Sem@K was specifically defined for the recommendation task which was seen as predicting tails for a unique target relation. Sem@K was then extended in [21] to the more generic LP task, where not only tails but also heads are corrupted and all relations are considered. In the present work, we deepen the study of the semantic awareness of KGEMs by proposing different versions of Sem@K that take into account the different characteristics of KGs (e.g., hierarchy of types). Moreover, the semantic awareness of a wider range of KGEMs is analyzed – especially convolutional models. Likewise, a broader array of KGs is used, in order to benchmark the semantic capabilities of KGEMs on mainstream LP datasets. Thus, the following research questions are addressed:

RQ1: how semantic-aware agnostic KGEMs are?

RQ2: how the evaluation of KGEM semantic awareness should adapt to the typology of KGs?

RQ3: does the evaluation of KGEM semantic awareness offer different conclusions on the relative superiority of some KGEMs compared to rank-based metrics?

Accordingly, the main contributions of this work are:

to evaluate KGEM semantic awareness on any kind of KG, we extend a previously defined semantic-oriented metric and tailor it to support a broader range of KGs.

we perform an extensive study of the semantic awareness of state-of-the-art KGEMs on mainstream KGs. We show that most of the observed trends apply at the level of families of models.

we perform a dynamic study of the evolution of KGEM semantic awareness vs. their performance in terms of rank-based metrics along training epochs. We show that a trade-off may exist for most KGEMs.

our study supports the view that agnostic KGEMs are quickly able to infer the semantics of KG entities and relations.

The remainder of the paper is structured as follows. Related work is presented in Section 2. Section 3 outlines the main motivations for assessing the semantic awareness of KGEMs. Section 4 subsequently presents Sem@K, the semantic-oriented metric that fulfills this purpose. Sem@K comes in different flavors based on the typology of the datasets at hand and the intended use cases. In Section 5, we detail the datasets and KGEMs used in this work, before presenting the experimental findings in Section 6. A thorough discussion is provided in Section 7. Lastly, Section 8 outlines future research directions.

2. Related work

2.1. Link prediction using knowledge graph embedding models

Several LP approaches have been proposed to complete KGs. Symbolic approaches relying on rule-based [16,17,31,38,47] or path-based reasoning [12,32,79] are somewhat popular but are not considered in the present work. Instead, KGEMs are the focus of this paper.

In particular, this work is concerned with assessing the semantic awareness of agnostic KGEMs. The semantic awareness of such models is defined as their ability to score higher entities whose types belong to the domain and range of relations. Agnostic KGEMs are defined as KGEMs that rely solely on the structure of the KG to learn entity and relation representations. In this respect, these models differ according to several criteria, such as the nature of the embedding space or the type of scoring function [25,54,70,72]. In this work, KGEMs are considered with respect to three main families of models that are traditionally distinguished in the literature.

Geometric models are additive models that consider relations as geometric operations in the latent embedding space. A head entity h is spatially transformed using an operation τ that depends on the relation embedding r. Then the distance between the resulting vector and the tail entity t is used as a measure for assessing the plausibility of a fact $(h, r, t)$ . A distance-based function δ is used to define the scoring function f of such KGEMs: $f (h, r, t) = δ (τ (h, r), t)$ . A large array of geometric models is purely translational and is based on TransE [7] and its extensions. TransE was the earliest introduced geometric model for link prediction. It enforces the sum of the head and relation embeddings to lie in a close neighborhood of the tail embedding. The distance-based function δ is usually the $L 1$ or $L 2$ norm. TransE does not properly handle 1-to-N, N-to-1, nor N-to-N relations [72] and yet has been found to be very efficient in multi-relational settings [9]. A myriad of translational models have been proposed since then, such as TransH [74], TransR [35], TransD [24], TransA [26] and TransG [77]. In addition to these purely translational models, some recent geometric models either replace or combine the translation operation with rotation-wise transformations to make their models even more expressive and deal with difficult relational patterns such as symmetric or anti-symmetric relations. A case in point is RotatE [61], which considers relations as rotations in a complex latent space. h, r and t are all embedded in $C^{d}$ where d denotes the embeddings’ dimension. The use of the rotation operation allows RotatE to properly address many relational patterns. In particular, RotatE is able to take account for symmetry – which is not modeled correctly by TransE. In the wake of RotatE, a handful of other roto-translational models have subsequently been proposed in the literature, e.g. QuatE [82], DualE [8] and HAKE [84].

Semantic matching models are named this way as they usually use a similarity-based function to define their own scoring function. Semantic matching models are also referred to as matrix factorization models or tensor decomposition models, in the sense that a KG can be seen as a 3D adjacency matrix, in other words a three-way tensor. This tensor can be decomposed into a set of low-dimensional vectors that actually represent the entity and relation embeddings. Semantic matching models are subdivided into bilinear and non-bilinear models. Bilinear models share the characteristic of capturing interactions between two entity vectors using multiplicative terms. RESCAL [45] is the earliest introduced bilinear model for LP. It models entities as vectors and relations as matrices. The components $w_{i, j}$ of the relation matrix $W_{r} \in R^{d \times d}$ account for the interaction intensity between the i-th embedding component of $e_{h} \in R^{d}$ and the j-th embedding component of $e_{t} \in R^{d}$ under the relation r. As pointed out by the original author of RESCAL [43], the parameter complexity of this model can drastically increase when the embedding dimension is large. DistMult [81] alleviates this scalability issue by enforcing all relation-matrices $W_{r}$ to be diagonal. Doing so, DistMult trades computational advantages for less expresiveness. Indeed, DistMult is not able to model anti-symmetric relations. However, DistMult still achieves state-of-the-art performance in most cases [28]. ComplEx [66] relies on complex-valued representations for both entities and relations. Because the Hadamard product is used in the scoring function, ComplEx is not commutative in the complex space and can properly model anti-symmetric relations. Other bilinear models with distinctive features were later proposed, such as Analogy [36], SimplE [29] and DihEdral [80] that are supposedly more expressive models. Another category of semantic matching models leverage interactions between entities and relations using non-bilinear operations. For instance, HolE [44] uses the circular correlation operation, while TuckER [4] relies on the Tucker decomposition.

Neural network-based models rely on neural networks to perform LP. In neural networks, the parameters (e.g. weights and biases) are organized into different layers, with usually non-linear activation functions between each of these layers. The first introduced neural-network based model for LP is Neural Tensor Network (NTN) [59]. It can be seen as a combination of multi-layer perceptrons (MLPs) and bilinear models [25]. NTN defines a distinct neural network for each relation. This choice of parameterization makes NTN similar to RESCAL in the sense that both models achieve great expressiveness at the expense of computational concerns. The most recent neural network-based models rely on more sophisticated layers to perform a broader set of operations. Convolutional models are by far the most representative family of such models. They use convolutional layers to learn deep and expressive features from the input data, which pass through such layers to undergo convolution with low-dimensional filters ω. The resulting feature maps subsequently go through dense layers to obtain a final plausibility score. Compared with fully connected neural networks, convolution-based models are able to capture complex relationships with fewer parameters by learning non-linear features. While ConvE [13] reshapes head entity and relation embeddings before concatenating them into a unique input matrix to pass through convolutional layers, ConvKB [42] does not perform any reshaping and also puts the tail embedding into the concatenated input matrix. Other models were later proposed, such as ConvR [27] and InteractE [67] that both process triples independently. Another branch of convolutional models also considers the local neighborhood around each central entity. These are based on Graph Neural Networks (GNNs). The most representative GNN-based model for LP is R-GCN [57]. The key idea is to accumulate messages from the local neighborhood of the central node over multi-hop relations. By doing so, R-GCN is better able to model a long range of relational dependencies. However, R-GCN does not outperform baselines for the LP task [15]. Subsequent models have claimed superiority over R-GCN: SACN [58] introduces a weighted GCN to adjust the amount of aggregated information from the local neighborhood and KBGAT [41] relies on attention mechanism to generate more accurate embeddings. More recently, CompGCN [68] and DisenKGAT [75] showcased impressive performance with regard to the LP task.

2.2. Combining embeddings and semantics

The possibility of using additional semantic information has been extensively studied in recent works [10,23,30,37,46,71,78]. In general, the semantic information stems directly from an ontology, originally defined by Gruber as an “explicit specification of a conceptualization” [18]. Ontologies formally describe a specific application domain of interest (e.g. education, pharmacology, etc.) in which several classes (or concepts) and relations are identified and formally specified. Ontologies support KG construction by providing a schema that specifies the nature of entities, the semantic profile of relations, and other constraints that give the KG a semantic coherence.

A significant part of the literature incorporates such semantic information to constrain the negative sampling procedure and generate meaningful negative triples [23,30]. For instance, type-constrained negative sampling (TCNS) [30] replaces the head or the tail of a triple with a random entity belonging to the same type (rdf:type) as the ground-truth entity. Jain et al. [23] go a step further and use ontological reasoning to iteratively improve KGEM performance by retraining the model on inconsistent predictions.

Semantic information can also be embedded in the model itself. In fact, some KGEMs leverage ontological information such as entity types and hierarchy.

Embedding models project entities and relations of a KG into a vector space. Thus, the semantics of the original KG may not be fully preserved [23,49]. As stated by Paulheim [49], because embeddings are not meant to preserve the semantics of the KG, they are not interpretable and this can severely hinder explainability in domains such as recommender systems. Consequently, Paulheim [49] advocates for semantic embeddings. Similarly, Jain et al. [22] perform a thorough evaluation of popular KGEMs to better assess whether embeddings can express similarities between entities of the same type. A key finding is that because of overlapping relations among entities of different types, fine-grained semantics cannot be properly reflected by embeddings For instance, the task of finding semantically similar entities does not always provide satisfying results when working with entity embeddings [22].

2.3. Evaluating KGEM performance for link prediction

KGEM performance is evaluated in two stages. During the validation phase, KGEM performance is evaluated on the validation set $T_{valid}$ after regular – often uniform – intervals of epochs. This way, the best epoch is identified. During the test phase, KGEM performance is ultimately evaluated on the test set $T_{test}$ after retrieving the optimal model parameters achieved on the best epoch of validation. Whether during the validation or test phase, KGEM performance is evaluated the same way: sifting through every triple of the test (resp. validation) set, both head and tail predictions are performed. In the case of head prediction, this amounts to taking a triple $(h, r, t)$ from the test (resp. validation) set, hiding the ground-truth head entity – resulting in $(?, r, t)$ – and letting the KGEM assign a score to every possible entity as a candidate for the head position. These scores are finally ordered and reflect the plausibility of such facts. Tail prediction is performed analogously on $(h, r, ?)$ .

In both cases, the rank of the ground-truth entity from the test (resp. validation) set is used to compute aggregated rank-based metrics based on the top-K scored entities. The rank of the ground-truth entity can be determined in two different ways that depend on how observed facts – i.e. facts that already exist in the KG – are considered. In the raw setting, observed facts outranking the ground-truth are not filtered out, while this is the case in the filtered setting. For instance, assuming head prediction is performed on the given ground-truth triple (BarackObama, livesIn, USA), a KGEM may assign a lower score to this triple than to the following triple: (MichelleObama, livesIn, USA). The latter triple actually represents an observed fact. In the raw setting, this triple would not be filtered out from top-K scored triples. This can cause the evaluation procedure to not properly assess the KGEM performance. This is why in practice, the filtered setting is commonly preferred. In the present work, the filtered setting is also used.

KGEM performance is almost exclusively assessed using the following rank-based metrics: Hits@K, Mean Rank (MR), and Mean Reciprocal Rank (MRR) [19]. Disagreements exist as to how and when these metrics can be used and compared properly. In the following, we recall their definitions and discuss their limits.

Hits@K (Eq. (1)) accounts for the proportion of ground-truth triples appearing in the first K top-scored triples: $\begin{matrix} (1) & Hits @ K = \frac{1}{| B |} \sum_{q \in B} 1 [rank (q) ⩽ K] \end{matrix}$ where $B$ is the batch of ground-truth triples, $rank (q)$ is the position of the ground-truth triple q in the sorted list of triples, and $1 [rank (q) ⩽ K]$ yields 1 if q is ranked between 1 and K, 0 otherwise. This metric is bounded in the $[0, 1]$ range and its values increase with K, where the higher the better.

Mean Rank (MR) (Eq. (2)) corresponds to the arithmetic mean over ranks of ground-truth triples: $\begin{matrix} (2) & MR = \frac{1}{| B |} \sum_{q \in B} rank (q) \end{matrix}$ This metric is bounded in the $[0, | E |]$ interval, where $| E |$ stands for the number of entities in the KG, where the lower the better.

Mean Reciprocal Rank (MRR) (Eq. (3)) corresponds to the arithmetic mean over reciprocals of ranks of ground-truth triples: $\begin{matrix} (3) & MRR = \frac{1}{| B |} \sum_{q \in B} \frac{1}{rank (q)} \end{matrix}$ Contrary to MR, MRR is a metric bounded in the $[0, 1]$ interval. Higher results indicate better performance. Because this metric does not use any threshold K compared to Hits@K, it is less sensitive to outliers. In addition, it is often used for performing early stopping and for tracking the best epoch during training [5,19].

As mentioned in Section 1, these metrics present some caveats. LP is often used to complete knowledge graphs, where the Open World Assumption (OWA) prevails. KGs are incomplete and, due to the OWA, an unobserved triple used as a negative one can still be positive. It follows that traditional evaluation methods based on rank-based metrics may systematically underestimate the true performance of a KGEM [73].

In addition, the aforementioned rank-based metrics have intrinsic and theoretical flaws, as pointed out in several works [5,19,64]. For example, Hits@K does not take into account triples whose rank is larger than K. As such, a model scoring the ground-truth in position $K + 1$ would be considered equally good as another model scoring the ground-truth in position $K + d$ with $d ≫ 1$ . It follows that Hits@K is not a suitable metric for drawing comparisons between models [19]. MR alleviates this concern as it does not consider any threshold K. Therefore, MR allows to compare KGEM performance on the same dataset. Nonetheless, MR is sensitive to the number of KG entities (see Eq. (2)) [5]: a MR of 10 indicates very good performance if the set of entities is in the thousands, but it would indicate poor performance if the set of entities is much more restricted. Therefore, MR does not allow comparisons across datasets.

Recent works recommend using adjusted version of the aforementioned metrics. The Adjusted Mean Rank (AMR) proposed in [1] compares the mean rank against the expected mean rank under a model with random scores. In [5], Berrendorf et al. transform the AMR to define the Adjusted Mean Rank Index (AMRI) bounded in the $[0, 1]$ interval. This way, a value of 1 indicates an optimal performance of the model. A value of 0 indicates a model performance similar to a model assigning random scores. A negative value indicates that the model performs worse than random. However, all these attempts at producing a better evaluation framework still focus on the quantitative assessment of KGEMs, i.e. the improvement of already existing rank-based metrics.

In Section 2.2, several approaches incorporating the semantics of entities and relations into the embeddings were mentioned. However, in such cases, the use of semantic information such as entity types and the hierarchy of classes is intended to improve KGEM performance in terms of the aforementioned rank-based metrics only. The underlying semantics of KGs is considered as an additional source of information during training but the ability of KGEMs to generate predictions in accordance with these semantic constraints is never directly addressed. This encourages further assessment of the semantic capabilities of KGEMs, as firstly suggested in [21]. In our work, we directly address this issue by assessing to what extent KGEMs are able to give high scores to triples whose head (resp. tail) belongs to the domain (resp. range) of the relation. When such information is not available – i.e. the KG does not rely on a schema containing rdfs:domain and rdfs:range properties – extensional constraints can still be used to evaluate the semantic capabilities of KGEMs, as detailed in Section 4.

3. Motivations and problem formulation

3.1. Motivating example

To motivate the use of Sem@K to evaluate KGEM quality, this section builds upon a minimalist example which is representative of the issue encountered while benchmarking the performance of several KGEM on the same dataset. As depicted in Fig. 2, two KGEMs that have been trained on the whole training set are tested on a batch of test triples. These KGEMs are referred to as Model A and Model B. Without loss of generality, it is assumed that the test set only comprises the three triples shown in Fig. 2. For the sake of clarity, only the tail prediction pass and the top-5 ranked candidate entities are depicted in Fig. 2. It should be noted that the performance of both models are strictly equal in terms of MR, MRR and Hits@K with $K ⩽ 5$ : MR=8/3, Hits@1=1/3, Hits@3=2/3 and Hits@5=1. MRR can only be computed knowing the total number of entities in the KG but two models having the same MR on the same dataset have de facto the same MRR as well. Distinguishing between these two models by relying solely on traditional rank-based metrics is not possible. One might even draw the misleading conclusion that these models are equally good. But they are not: Model A gives high scores to semantically valid entities with regard to the range of relations, while Model B semantic awareness is very low. In other words, Model B does not capture the semantic profile of relations well. This case in point illustrates the need for using semantic-oriented metrics alongside common rank-based metrics, so as to better assess the overall quality of KGEMs.

Fig. 2.

Motivating example. Tail prediction is performed for the three test triples contained in the upper insert. Model A and Model B output scores for each possible entity and only the top-five ranked tail candidates are depicted here. Model A and Model B have the same Hits@1, Hits@3 and Hits@5 values. But Model A has better semantic capabilities. Green, blue and white cells respectively denote the ground-truth entity, entities other than the ground-truth and semantically valid, and entities other than the ground-truth and semantically invalid.

3.2. Problem formulation

The traditional evaluation of KGEMs solely based on rank-based metrics can be flawed for several reasons. First, KGEMs benchmarked on the same test sets can exhibit very similar results. Using only rank-based metrics, the final choice only depends on the best achieved MRR and/or Hits@K. This raises the questions whether the chosen model is actually the best one, or whether its slight superiority over other KGEMs can be due to other factors such as better hyperparameter tuning or better modeling of a relational pattern highly present in the test set. Moreover, using only rank-based metrics does not provide the full picture of KGEM quality for the downstream LP task, as some dimensions of KGEMs are left unassessed (see Section 2). In this work, contrary to the mainstream approach consisting in comparing KGEM performance exclusively in terms of rank-based metrics, the trained KGEMs are also evaluated in terms of Sem@K which measures the ability of KGEMs to predict semantically valid triples with respect to the domain and range of relations.

4. Measuring KGEMs semantic awareness with Sem@K

The standard LP evaluation protocol consists in reporting aggregated results, considering the rank-based metrics presented in Section 2.3. As mentioned above, these metrics only provide a partial picture of KGEM performance [5]. To give a more comprehensive assessment of KGEMs, we aim at assessing their semantic awareness using our proposed metric called Sem@K [20,21]. In [20], Sem@K was specifically defined for the recommendation task which was seen as predicting tails for a unique target relation. Sem@K was then extended in [21] to the more generic LP task, where not only tails but also heads are corrupted and all relations are considered. In this work, this original formalization for LP is presented in Section 4.1, and enriched to take into account schemaless KGs (Section 4.3), or KGs with a class hierarchy (Section 4.4). As a consequence, Sem@K comes in 3 different versions (respectively denoted Sem@K[base], Sem@K[ext], and Sem@K[wup]) so as to adapt to KG typology. These distinct versions and their adequacy regarding the KG at hand are summarized in Table 1 and further detailed below. In the following, when no suffix is provided, it is assumed that we are concerned with Sem@K in general, regardless of the actual version.

Table 1
Typology of KGs and their respective adequacy for the presented Sem@K versions

KG type Sem@K[ext] Sem@K[base] Sem@K[wup]

Schemaless ×

Schema-defined, w/o class hierarchy × ×

Schema-defined, w/ class hierarchy × × ×

KG type	Sem@K[ext]	Sem@K[base]	Sem@K[wup]
Schemaless	×
Schema-defined, w/o class hierarchy	×	×
Schema-defined, w/ class hierarchy	×	×	×

4.1. Definition of Sem@K[base]

This version of Sem@K (Eq. (4)) accounts for the proportion of triples that are semantically valid in the first K top-scored triples: $\begin{matrix} (4) & Sem @ K = \frac{1}{| B |} \sum_{q \in B} \frac{1}{K} \sum_{q^{'} \in S_{q}^{K}} compatibility (q, q^{'}) \end{matrix}$ where, given a ground-truth triple $q = (h, r, t)$ , $S_{q}^{K}$ is the list of the top-K candidate triples scored by a given KGEM (i.e. by predicting the tail for $(h, r, ?)$ or the head for $(?, r, t)$ ). The function $compatibility (q, q^{'})$ (Eq. (5)) assesses whether the candidate triple $q^{'}$ is semantically compatible with its ground-truth counterpart q. In this work, by semantic compatibility we refer to the fact that the predicted head (resp. tail) belongs to the domain (resp. range) of the relation: $\begin{matrix} (5) & compatibility (q, q^{'}) = \{\begin{array}{ll} 1, & if types (q_{h}^{'}) \cap domain (q_{r}) \neq \emptyset \land types (q_{t}^{'}) \cap range (q_{r}) \neq \emptyset \\ 0, & otherwise \end{array} \end{matrix}$ where $types (e)$ returns the types of entity e and $domain (r)$ (resp. $range (r)$ ) is the domain (resp. range) of the relation r. $q_{r}$ , $q_{h}^{'}$ , and $q_{t}^{'}$ denote the ground-truth relation, the head and the tail of the ranked triple $q^{'}$ , respectively. It is noteworthy that with this formula, we allow entities to instantiate multiple types, and domains and ranges to be defined with multiple types. Sem@K is bounded in the $[0, 1]$ interval. Compared to Hits@K (Eq. (1)), Sem@K is non-monotonic: increasing K can lead to either lower or higher Sem@K values.

4.2. A note on Sem@K, untyped entities, and the Open World Assumption

Traditional KGEM evaluation can be performed in all situations, regardless of whether entities are typed or whether the KG comes with a proper schema. When measuring the semantic capability of KGEMs – e.g. with Sem@K – some concerns arise. For instance, a fair question to ask is the following: how should untyped entities be considered? Indeed, in some KGs, some entities are left untyped. For example, in DBpedia the entity dbr:1._FC_Union_Berlin_players does not belong to any other class than owl:Thing. In some other cases as in DB15K and DB100K [11], entities have incomplete typing.

Under the OWA, an untyped entity should not count in the calculation of Sem@K since it is not possible to determine whether this entity has no types or no known types. Although this seems to be a fair option, it raises a major issue: it makes possible to score different sets of entities with rank-based metrics and Sem@K, which is not desirable. If there are M untyped entities in an ordered list of ranked entities, Hits@K, MR and MRR are calculated regardless of this fact, i.e. still taking into account the M untyped entities. However, Sem@K cannot be calculated for these M untyped entities. As such, MR, MRR and Hits@K would be calculated on the original entity set, whereas Sem@K would be computed on a different set of entities. This issue would be even more acute in the case of Sem@1 when the first ranked entity is untyped: Hits@1 and Sem@1 would be calculated on two different entities, which is not acceptable. Consequently, one strategy consists in removing untyped entities from the evaluation protocol, both regarding rank-based and semantic-based metrics. By doing so, consistency is ensured in the ranked list of entities across rank-based and semantic metrics evaluation.

4.3. Sem@K[ext] for schemaless KGs

Not all KGs come with a proper schema, e.g. relations do not appear in any rdfs:domain or rdfs:range clauses. In that particular situation, it can still be desirable to assess the semantic awareness of KGEMs. One approach is to maintain a list of all entities that have been observed as heads (resp. tails) of each relation r: $domain (r) = {e : \exists (e, r, t) \in T}$ (resp. $range (r) = {e : \exists (h, r, e) \in T}$ ). Hence, in this case, Sem@K[ext] is defined as in Eq. (4) and (5) but using these definitions of domain and range. Therefore, an entity will be considered as a semantically valid head (resp. tail) with respect to the relation if this entity appears as a head (resp. tail) in any other triple observed in the KG with the same relation.

4.4. Sem@K[wup]: A hierarchy-aware version of Sem@K

Sem@K as previously defined equally penalizes all entities that are not of the expected type. However, KGs may be equipped with a class hierarchy that, in turn, can support a more fine-grained penalty for entities depending on the distance or similarity between their type and the expected domain (resp. range) in this hierarchy. To illustrate, consider Fig. 3 that depicts a subset of the DBpedia ontology dbo class hierarchy. Using the hierarchy-free version of Sem@K for the test triple (dbr:The_Social_Network, dbo:director, dbr:David_Fincher), predicting dbr:Friends or dbr:Central_Park as head would be penalized the same way in the compatibility function. However, it is clear that an entity of class dbo:TelevisionShow is semantically closer to dbo:Film – the domain of the relation dbo:director – than an entity of class dbo:Park, and thus should be less penalized.

To leverage such a semantic relatedness between concepts in Sem@K, the compatibility function can be adapted accordingly: $\begin{matrix} (6) & compatibility (q, q^{'}) = min (max_{\begin{array}{c} c \in type (q_{h}^{'}) \\ c^{'} \in domain (q_{r}) \end{array}} σ (c, c^{'}), max_{\begin{array}{c} c \in type (q_{t}^{'}) \\ c^{'} \in range (q_{r}) \end{array}} σ (c, c^{'})) \end{matrix}$ where $σ (c, c^{'})$ measure the semantic similarity between the two classes c and $c^{'}$ based on the class hierarchy. It should be noted that in this formula $type (e)$ only returns the most specific classes instantiated by e.

Several similarity measures σ have been proposed in the literature [33,34,52,53,76]. In this work, the Wu-Palmer similarity [76] (Eq. (7)) is used: $\begin{matrix} (7) & σ (c, c^{'}) = \frac{2 \times δ (c \land c^{'}, ρ)}{δ (c, c \land c^{'}) + δ (c^{'}, c \land c^{'}) + 2 \times δ (c \land c^{'}, ρ)} \end{matrix}$ where ρ is the root of the hierarchy (e.g.owl:Thing), $δ (c, c^{'})$ is the number of edges linking c to $c^{'}$ , and $c \land c^{'}$ represents the least common subsumer of c and $c^{'}$ . The Wu-Palmer similarity is well suited to a class hierarchy and provides a good indication of the semantic relatedness between the domain (resp. range) class and the classes of the chosen entity. This gives rise to the Sem@K[wup] version.

Considering the example in Fig. 3, using the Wu-Palmer score in the calculation of Sem@K, a head prediction of dbr:Friends and dbr:Central_Park for the ground-truth triple (dbr:The_Social_Network, dbo:director, dbr:David_Fincher) are now differently penalized. The instance dbr:Friends is of type dbo:TelevisionShow, so we have: $σ (dbo:TelevisionShow, dbo:Film) = 1 / 2$ . The instance dbr:Central_Park is of type dbo:Park, so we have: $σ (dbo:Park, dbo:Film) = 0$ . This illustrates that incorporating the Wu-Palmer score into Sem@K calculation leads to more precise semantic comparisons that take into account the available class hierarchy. It should be noted that comparing the classes of a candidate entity with the expected class can result in the same penalization as the vanilla version of Sem@K. However, in most cases, two classes do not lie in a completely disjoint part of the hierarchy of classes. As such, the Wu-Palmer score between the classes of a candidate entity and the expected class is rarely 0.

Fig. 3.

Excerpt from the DBpedia class hierarchy.

5. Experimental setting

5.1. Datasets

In order to draw reliable and general conclusions, a broad range of KGs are used in our experiments. They have been chosen due to their mainstream adoption in recent research works around KGEMs for LP and the fact that they have different characteristics (e.g. entities, relations, classes, presence of a class hierarchy). In this section, the schema-defined and schemaless KGs used in the experiments are detailed. Note that in our experiments, all the schema-defined KGs come with a class hierarchy inherited from either Freebase [6], DBpedia [2], YAGO [60], or schema.org. To meet requirements for evaluating KGEMs w.r.t. Sem@K (i.e. classes instantiated by entities, domain and range for relations), when necessary, subsets of schema-defined KGs are used, as explained in Section 5.1.1. Among the 4 schema-defined KGs presented hereafter, FB15K237-ET, DB93K and YAGO3-37K are derived from already existing KGs, while YAGO4-19K was specifically created and is made available on Zenodo1

¹
https://doi.org/10.5281/zenodo.7526244
and GitHub.2 ²
https://github.com/pmonnin/YAGO4-LP
The other KGs are made available on GitHub.3 ³
https://github.com/nicolas-hbt/benchmark-sematk

5.1.1. Schema-defined KGs

The statistics of the datasets FB15K237-ET, DB93K, YAGO3-37K and YAGO4-19K used in our experiments are provided in Table 2. As discussed in Section 4.2, to create an experimental evaluation setting as unbiased and flawless as possible, the schema-defined KGs used in the experiments are filtered to keep typed entities only. This way, Sem@K is calculated under the CWA.

FB15K237-ET derives from FB15K [7] – a dataset extracted from the cross-domain Freebase KG [6]. In these experiments, we do not use FB15K, as it has been noted that this dataset suffers from test leakage, i.e., a large number of test triples can be obtained by simply inverting the position of the head and tail in the train triples [65]. The later introduced FB15K237 dataset [65] is a subset of the original FB15K without these inverse relations. To the best of our knowledge, there is no schema-defined version of FB15K237. However, a schema-defined version of FB15K is provided in [78]. Consequently, we based ourselves on this version of FB15K and mapped the extracted entity types, relation domains and ranges to the entities and relations found in FB15K237. The resulting schema-defined version of FB15K237 is named FB15K237-ET and includes only typed entities. Besides, validation and test sets contain triples whose relation have a well-defined domain (resp. range), as well as more than 10 possible candidates as head (resp. tail). This ensures Sem@K is not unduly penalized and can be calculated on the same set of entities as Hits@K and MRR, at least until $K = 10$ .

DB93K is a subset of DB100K, which was first introduced in [14]. A slightly modified version of DB100K has been proposed in [11]. Contrary to the initial version of DB100K, the latter version is schema-defined: entities are properly typed and most relations have a domain and/or a range. This second version is considered in the following experiments. However, some inconsistencies were found in the dataset. Some DBpedia entities only instantiate Wikidata4

⁴
https://www.wikidata.org/
or schema.org5 ⁵
https://schema.org/
classes, while instantiation of classes from the DBpedia ontology actually exist. Moreover, some entities are only partially typed. It must also be noted that domains and ranges of relations have been extracted from DBpedia more than two years ago. DBpedia is a communautary and open-source project: DBpedia classification system relies on human curation, which sometimes implies a lack of coverage for some resources and updates for others. Consequently we associated all relations of DB100K to their most up-to-date domains and ranges.6 ⁶
SPARQL queries were fired against DBpedia as of November 9, 2022.
Similarly for entities, we associated all entities to their most up-to-date classes. Finally, we removed all untyped entities as well as validation and test relations having less than 10 observed entities, so as not to unfairly penalize Sem@K results in the validation and test phases.

YAGO3-37K derives from the schema-defined YAGO39K dataset [37] extracted from the cross-domain YAGO3 KG [60]. Compared to the original YAGO39K, in our experiments only typed entities are kept. In addition, relations having less than 10 observed heads or tails in the training set are discarded from the validation and test splits, for the same reason that keeping them would not reflect the actual Sem@K values. It should be noted that in the YAGO3 ontology, most relations have very generic domains and ranges which are very close to the root of the ontology hierarchy. To produce a more challenging evaluation setting of the models’ semantic awareness, a subset of hard relations was identified and only validation and test triples whose relation belongs to this subset are kept. The resulting dataset is named YAGO3-37K.

We built YAGO4-19K with several other subsets of the YAGO4 knowledge graph [63]. Similarly to other datasets, we focused on relations with a defined domain and range, and more than 10 triples to constitute the validation and test sets. We purposedly favored difficult relations to feature in the validation and test sets. To enrich the training set, additional relations were added based on a manual selection. Selected relations in validation and test sets as well as additional relations in the training set can be found on the GitHub repository of the dataset. It should be noted that the class hierarchy associated with entities in YAGO 4 is schema.org.

Table 2
Statistics of the schema-defined, hierarchical KGs used in the experiments

Dataset $| E |$ $| R |$ $| C |$ $| T_{train} |$ $| T_{valid} |$ $| T_{test} |$

FB15K237-ET 14,541 237 532 271,575 15,337 17,928

DB93K 92,574 277 311 237,062 18,059 36,424

YAGO3-37K 37,335 33 132 351,599 4,220 4,016

YAGO4-19K 18,960 74 1,232 27,447 485 463

5.1.2. Extensional KGs

Dataset	$\| E \|$	$\| R \|$	$\| C \|$	$\| T_{train} \|$	$\| T_{valid} \|$	$\| T_{test} \|$
FB15K237-ET	14,541	237	532	271,575	15,337	17,928
DB93K	92,574	277	311	237,062	18,059	36,424
YAGO3-37K	37,335	33	132	351,599	4,220	4,016
YAGO4-19K	18,960	74	1,232	27,447	485	463

Another range of datasets used in these experiments do not come with an ontological schema. In particular, this means relations do not have a clearly-defined domain or range. Although Codex-S and Codex-M are based on the Wikidata schema which does possess property constraints linking subject types to value type constraints, we limit ourselves to KGs that represent this information with rdfs:domain and rdfs:range predicates. Consequently, in this work, we only report Sem@K[ext] results for Codex-S and Codex-M. The statistics of the datasets Codex-S, Codex-M and WN18RR used in our experiments are provided in Table 3.

Table 3
Statistics of the schemaless KGs used in the experiments

Dataset $| E |$ $| R |$ $| T_{train} |$ $| T_{valid} |$ $| T_{test} |$

Codex-S 2,034 42 32,888 1,827 1,828

Codex-M 17,050 51 185,584 10,310 10,311

WN18RR 40,943 11 86,834 3,034 3,134

Dataset	$\| E \|$	$\| R \|$	$\| T_{train} \|$	$\| T_{valid} \|$	$\| T_{test} \|$
Codex-S	2,034	42	32,888	1,827	1,828
Codex-M	17,050	51	185,584	10,310	10,311
WN18RR	40,943	11	86,834	3,034	3,134

WN18RR [13] originates from WN18, which is a subset of the WordNet KG [39]. As for FB15K, Toutanova et al. [65] reported a huge test leakage in the original WN18 dataset. More specifically, 94% of the train triples have inverse relations that are linked to test triples. Dettmers et al. [13] remove all inverse relations to propose WN18RR. In WN18RR, entities are nouns, verbs, and adjectives. Relations such as hypernym and derivationally related from hold between observed entities. Such relations are not linked to any rdfs:domain or rdfs:range predicates. Besides, relations such as derivationally related from can contain nouns, verbs or adjectives as both head or tail. It follows that it is not possible to infer any expected entity type for relations – in this case the word qualifier. This is why WN18RR is used in the extensional setting.

Codex-S [56] and Codex-M [56] are datasets extracted from Wikidata and Wikipedia. They cover a wider scope and purposely contain harder facts than most KGs [56]. Consequently, these datasets prove to be more challenging for the link prediction task. Compared to WN18RR, Codex-S and Codex-M contain entity types, relation descriptions and Wikipedia page extracts. Nonetheless, Wikidata does not contain rdfs:domain or rdfs:range predicates. The property constraints present in Wikidata are harder to manipulate than the rdfs:domain and rdfs:range clauses found in DBpedia. As such, in our experiments Codex-S and Codex-M are used in the extensional setting. Codex-S and Codex-M initially come with already generated hard negatives. In our experiments, we do not directly use these provided negative triples. Instead, we use the same Uniform Random Negative Sampling schema as for other datasets.

5.2. Baseline models

In this work, the semantic awareness of the most popular semantically agnostic KGEMs is analyzed. More specifically, the translational models TransE [7] and TransH [74], the semantic matching models DistMult [81], ComplEx [66] and SimplE [29], and the convolutional models ConvE [13], ConvKB [42], R-GCN [57] and CompGCN [68] are considered. Note that in the analysis of the results in Section 6, a distinction will be made between pure convolutional KGEMs (ConvE, ConvKB) and GNNs (R-GCN, CompGCN). Although the latter have convolutional layers, they are able to capture long-range interactions between entities due to their ability to consider k-hop neighborhoods. The characteristics of the models used in our experiments are mentioned hereinafter and summarized in Table 4.

Table 4
Summary of the KGEMs used in the experiments

Model family Model Scoring function Loss function

Geometric TransE $‖ e_{h} + e_{r} - e_{t} ‖_{p}$ Pairwise Hinge

TransH $‖ e_{h_{⊥}} + d_{r} - e_{t_{⊥}} ‖_{p}$ Pairwise Hinge

Semantic Matching DistMult $⟨ e_{h}, W_{r}, e_{t} ⟩$ Pairwise Hinge

ComplEx $Re (e_{h} ⊙ e_{r} ⊙ {\overline{e}}_{t})$ Pointwise Logistic

SimplE $\frac{1}{2} (⟨ e_{h}^{h}, e_{r}, e_{t}^{t} ⟩ + ⟨ e_{h}^{t}, e_{r}^{- 1}, e_{t}^{h} ⟩)$ Pointwise Logistic

Convolutional ConvE $g (vec (g (concat ({\hat{e}}_{h}, {\hat{e}}_{r}) * ω)) W) \cdot e_{t}$ Binary Cross-Entropy

ConvKB $concat (g ([e_{h}, e_{r}, e_{t}] * ω)) \cdot w$ Pointwise Logistic

R-GCN DistMult decoder Binary Cross-Entropy

CompGCN ConvE decoder Binary Cross-Entropy

Model family	Model	Scoring function	Loss function
Geometric	TransE	$‖ e_{h} + e_{r} - e_{t} ‖_{p}$	Pairwise Hinge
TransH	$‖ e_{h_{⊥}} + d_{r} - e_{t_{⊥}} ‖_{p}$	Pairwise Hinge
Semantic Matching	DistMult	$⟨ e_{h}, W_{r}, e_{t} ⟩$	Pairwise Hinge
ComplEx	$Re (e_{h} ⊙ e_{r} ⊙ {\overline{e}}_{t})$	Pointwise Logistic
SimplE	$\frac{1}{2} (⟨ e_{h}^{h}, e_{r}, e_{t}^{t} ⟩ + ⟨ e_{h}^{t}, e_{r}^{- 1}, e_{t}^{h} ⟩)$	Pointwise Logistic
Convolutional	ConvE	$g (vec (g (concat ({\hat{e}}_{h}, {\hat{e}}_{r}) * ω)) W) \cdot e_{t}$	Binary Cross-Entropy
ConvKB	$concat (g ([e_{h}, e_{r}, e_{t}] * ω)) \cdot w$	Pointwise Logistic
R-GCN	DistMult decoder	Binary Cross-Entropy
CompGCN	ConvE decoder	Binary Cross-Entropy

TransE [7] is the earliest translational model. It learns representations of entities and relations such that for a triple $(h, r, t)$ , $e_{h} + e_{r} \approx e_{t}$ , where $e_{h}$ , $e_{r}$ and $e_{t}$ are the head, relation and tail embeddings, respectively. The scoring function is $\begin{matrix} (8) & f (h, r, t) = d (e_{h} + e_{r} - e_{t}) \end{matrix}$ where d is a distance function, usually the $L 1$ or $L 2$ norm.

TransH [74] is an extension of TransE. It allows entities to have distinct representations when involved in different relations. Specifically, $e_{h}$ and $e_{t}$ are projected into relation-specific hyperplanes with projection matrices $w_{r}$ . If $(h, r, t)$ holds, the projected entities $e_{h_{⊥}} = e_{h} - w_{r}^{T} e_{h} w_{r}$ and $e_{t_{⊥}} = e_{t} - w_{r}^{T} e_{t} w_{r}$ are expected to be linked by the relation-specific translation vector $d_{r}$ . Thus, the scoring function is $\begin{matrix} (9) & f (h, r, t) = d (e_{h_{⊥}} + d_{r} - e_{t_{⊥}}) \end{matrix}$ TransH often showcases better performance than TransE with only slightly more parameters [74].

DistMult [81] is a semantic matching model. It is characterized as such because it uses a similarity-based scoring function and matches the latent semantics of entities and relations by leveraging their vector space representations. More specifically, DistMult is a bilinear diagonal model that uses a trilinear dot product as its scoring function: $\begin{matrix} (10) & f (h, r, t) = ⟨ e_{h}, W_{r}, e_{t} ⟩ \end{matrix}$ It is similar to RESCAL [45] – the very first semantic matching model – but restricts relation matrices $W_{r} \in R^{d \times d}$ to be diagonal.

ComplEx [66] is also a semantic matching model. It extends DistMult by using complex-valued vectors to represent entities and relations: $e_{h}, e_{r}, e_{t} \in C^{d}$ . As a result, ComplEx is better able to model antisymmetric relations than DistMult [61]. Its scoring function uses the Hadamard product: $\begin{matrix} (11) & f (h, r, t) = Re (e_{h} ⊙ e_{r} ⊙ {\overline{e}}_{t}) \end{matrix}$ where ${\overline{e}}_{t}$ denotes the conjugate of $e_{t}$ .

SimplE [29] models each fact in both a direct and an inverse form. To do so, an entity e is simultaneously represented by two vectors $e^{h}, e^{t} \in R^{d}$ . Depending on whether e appears as head or tail in a given triple, either $e^{h}$ or $e^{t}$ is used. Consequently, the two entity representations $e^{h}$ and $e^{t}$ are learned independently. Likewise, each relation r comes with a direct and an inverse vectors $e_{r}$ and $e_{r^{- 1}}$ . The scoring function reflects the interaction between all the aforementioned entity and relation embeddings: $\begin{matrix} (12) & f (h, r, t) = \frac{1}{2} (⟨ e_{h}^{h}, e_{r}, e_{t}^{t} ⟩ + ⟨ e_{h}^{t}, e_{r^{- 1}}, e_{t}^{h} ⟩) \end{matrix}$

ConvE [13] first reshapes entity and relation embeddings and then concatenates them into a 2D matrix [h; r]. To model the interactions between entities and relations, ConvE subsequently uses 2D convolution over embeddings and layers of nonlinear features. The output is ultimately scored against the tail embedding t using the dot product. More precisely, the following scoring function is used: $\begin{matrix} (13) & f (h, r, t) = g (vec (g (concat ({\hat{e}}_{h}, {\hat{e}}_{r}) * ω)) W) \cdot e_{t} \end{matrix}$ where g denotes a non-linear function, vec is the vectorization operation reshaping a tensor into a vector, concat is the concatenation operator, ∗ and . denote a convolution and a dot product, respectively, $\hat{e}$ denotes a 2D reshaping of e and ω is the set of convolutional filters.

ConvKB [42] also represents entities and relations as same-sized vectors. However, ConvKB does not reshape the embeddings of entities and relations. Plus, ConvKB also considers the tail embedding in the concatenation operation, thus obtaining the 2D matrix [h; r; t] after concatenation. Convolution by a set ω of T filters of shape $1 * 3$ is applied on this input. The resulting $T * 3$ feature map then passes through a dense layer with one neuron and a weight matrix W. Finally, the following scoring function assesses the plausibility of a given triple $(h, r, t)$ : $\begin{matrix} (14) & f (h, r, t) = concat (g ([e_{h}, e_{r}, e_{t}] * ω)) \cdot w \end{matrix}$ It has been claimed that the concatenation of a set of feature maps generated by convolution should increase the learning ability of latent features compared to ConvE [25]. However, recent works point out the evaluation procedure used in the original implementation of ConvKB [58,62], which may result in overly optimistic results on the benchmark datasets FB15K237 and WN18RR. Nonetheless, due to the popularity of this model, we choose to include it in our experiments.

R-GCN [57] extends the idea of applying graph convolutional networks (GCNs) to multi-relational data. R-GCN operates on local graph neighborhoods and applies a convolution operation to the neighbors of each entity. By aggregating the messages coming from all the neighbors of an entity, the embedding of the latter is updated in accordance. Each entity thus has a hidden representation which directly depends on the hidden representations of its neighbors. This process of accumulating messages (i.e. the hidden representations of neighboring entities) and aggregating them so as to update the hidden representation of the central node is performed for each layer of the R-GCN model. More formally, the hidden representation of the entity i in the layer $(l + 1)$ is defined as: $\begin{matrix} (15) & h_{i}^{(l + 1)} = σ (W_{0}^{(l)} h_{i}^{(l)} + \sum_{r \in R} \sum_{j \in N_{i}^{r}} \frac{1}{c_{i, r}} W_{r}^{(l)} h_{j}^{(l)}) \end{matrix}$ where $σ (.)$ can be any element-wise activation function, $N_{i}^{r}$ denotes the set of neighboring entities of entity i considering the relation r, $W_{r}^{(l)}$ is the relation-specific weight matrix for layer l and $c_{i, r}$ is a normalization constant. To update entity hidden representation taking into account its previous state, a self-connection is also incorporated into the activation function $σ (.)$ . Such layers can be stacked multiple times in order to better learn interactions and dependencies across several relational steps. However, as applying Eq. (15) directly would dramatically increase the number of parameters for KGs having lots of different relation types, basis-decomposition and block-diagonal-decomposition are proposed in [57] to reduce model parameter size and prevent overfitting. In the case of basis decomposition which is used in the original paper, the relation-specific weight matrix $W_{r}^{(l)}$ is decomposed into a linear combination of basis transformation $V_{b}^{(l)}$ with coefficients $a_{r b}^{(l)}$ : $\begin{matrix} (16) & W_{r}^{(l)} = \sum_{b = 1}^{B} a_{r b}^{(l)} V_{b}^{(l)} \end{matrix}$

CompGCN [68] improves over R-GCN by not only learning entity representations but also relation representations. Concretely, CompGCN performs a composition operation $ϕ (.)$ over each edge in the neighborhood of a central node. The composed embeddings are subsequently convolved with direction-specific weight matrices $W_{O}$ and $W_{I}$ – for original and inverse relations, respectively. In addition to the entity representation update, relation representations are also updated individually: $\begin{matrix} (17) & h_{r}^{(l + 1)} = W_{rel}^{(l)} h_{r}^{(l)} \end{matrix}$ where $W_{rel}$ is a learnable transformation matrix which projects all the relations into the same embedding space as entities.

5.3. Implementation details and hyperparameters

For the sake of comparisons, MRR, Hits@K and Sem@K all need to rely on the same code implementation. More specifically, for R-GCN7

⁷
https://github.com/toooooodo/RGCN-LinkPrediction
and CompGCN,8 ⁸
https://github.com/malllabiisc/CompGCN
existing implementations were reused and Sem@K values were calculated on the trained models. Other KGEMs were implemented in PyTorch. To avoid time-consuming hyperparameter tuning, we took inspiration from the hyperparameters provided by LibKGE9 ⁹
https://github.com/uma-pi1/kge
for Codex-S, Codex-M, WN18RR and FB15K237. However, LibKGE does not benchmark all the datasets and models used in our experiments. For such models, we stick to the hyperparameters provided by the original authors, when available. For the models with no reported best hyperparameters, as well as for the remaining datasets used in the experiments – i.e. DB93K, YAGO3-37K and YAGO4-19K – different combinations of hyperparameters were manually tried. We first trained our KGEMs for $1, 000$ epochs, then noticed the best achieved results were found around epoch 400 or below. Consequently, we stick to a maximum of 400 epochs of training as in LibKGE (except R-GCN which is trained during $4, 000$ epochs due to lower convergence to the best achieved results). For each positive triple in the training set, one corresponding negative triple is generated. To ensure fair comparisons between our models, we stick with Uniform Random Negative Sampling [7]. The chosen hyperparameters leading to the best performance on the validation dataset are provided in Appendix (see Appendix A, Table 7). Recall that the objective of this work is to perform a fair and insightful assessment of the semantic awareness of KGEMs. As such, the intended purpose was not to reach optimal performance in terms of rank-based metrics. Instead, the objective is to identify a set of hyperparameters that provides satisfying and stable performance, and then study how Sem@K behaves, both for a fixed epoch – e.g. for the best epoch in terms of MRR – and dynamically w.r.t. the number of epochs.
6. Results

In the following, we perform an extensive analysis of the results obtained using the aforedescribed KGEMs and KGs. For the sake of clarity, the complete range of tables and plots are placed in Appendix B and C, respectively. When necessary to support our claim, some of them are duplicated in the body text.

6.1. Semantic awareness of KGEMs

This section draws on the Sem@K values (see Tables 8 and 9) achieved at the best epoch in terms of MRR to provide an analysis of the semantic awareness of state-of-the-art KGEMs. In other words, for such models we only consider a snapshot of their best epochs in terms of rank-based metrics.

A major finding is that models performing well with respect to rank-based metrics are not necessarily the most competitive when it comes to their semantic capabilities. On YAGO3-37K (see Table 5) for instance, ConvE showcases impressive MRR and Hits@K values. However, it is far from being the best KGEM in terms of Sem@K, as it is outperformed by CompGCN, R-GCN and all translational models.

Table 5
Results on YAGO3-37K. Bold fonts and gray cells denote the best achieved results and the worst achieved results among the models reported in the table, respectively. Full results are available in Appendix B, Table 8

From a coarse-grained viewpoint, conclusions about the relative superiority of models with the distinct consideration of rank-based metrics and semantic awareness can be generalized at the level of models families. For example, semantic matching models (DistMult, ComplEx, SimplE) globally achieve better MRR and Hits@K values while their semantic capabilities are in most cases lower than the ones of translational models (TransE, TransH) – see Table 8 and Table 9 for detailed results w.r.t. rank-based and semantic-oriented metrics. A condensed view of the comparison between MRR and Sem@10 is also reported in Fig. 4 and Fig. 5. The respective hierarchies of such models for the benchmarked schema-defined and schemaless KGs are depicted in Fig. 6 and Fig. 7, respectively. It is clearly visible that KGEMs are grouped by family. In particular, GNNs and translational models showcase very promising semantic capabilities. GNNs are almost always the best regarding Sem@K[ext] values – not only for schemaless KGs (Fig. 7), but also for schema-defined KGs (see Table 8 for full results, and Fig. 4 for a quick glimpse). This means GNNs are more capable of predicting entities that have been observed as head (resp. tail) of a given relation. Translational models are very competitive in terms of Sem@K[base]. In Fig. 6, we clearly see that they consistently rank among the best performing models regarding Sem@K[base]. Interestingly, the semantic matching models DistMult, ComplEx and SimplE perform relatively poorly. This observation holds regardless of the nature of the KG, as they systematically rank among the worst performing models for schema-defined (Fig. 6) and schemaless (Fig. 7) KGs.

Fig. 4.

MRR and Sem@10 results achieved at the best epoch in terms of MRR for each model and on each schema-defined dataset.

Fig. 5.

MRR and Sem@10 results achieved at the best epoch in terms of MRR for each model and on each schemaless dataset.

Fig. 6.

Sem@K[base] comparisons between KGEMs on the 4 benchmarked schema-defined KGs. Colors indicate the family of models: blue, purple, green, and yellow cells denote GNNs, translational, convolutional, and semantic matching models, respectively. Regarding Sem@K, the relative hierarchy of models is consistent across KGs and we can clearly see that KGEMs are grouped by families of models.

Fig. 7.

Sem@K[ext] comparisons between KGEMs on the 3 benchmarked schemaless KGs. Colors indicate the family of models: blue, purple, green, and yellow cells denote GNNs, translational, convolutional, and semantic matching models, respectively. Regarding Sem@K, the relative hierarchy of models is consistent across KGs and we can clearly see that KGEMs are grouped by families of models.

Therefore, it seems that translational models are better able at recovering the semantics of entities and relations to properly predict entities that are in the domain (resp. range) of a given relation, while semantic matching models might sometimes be better at predicting entities already observed in the domain (resp. range) of a given relation (e.g. DistMult reaches very high Sem@K[ext] values on Codex-S, as evidenced in Fig. 7a). In cases when translational and semantic matching models provide similar results in terms of rank-based metrics, the nature of the dataset at hand – whether it is schema-defined or schemaless – might thus strongly influence the final choice of a KGEM.

Interestingly, CompGCN which is by far the most recent and sophisticated model used in our experiments, outperforms all the other models in terms of semantic awareness as, with very limited exceptions, it is the best in terms of Sem@K[base], Sem@K[ext] and Sem@K[wup]. In addition, it should be noted that R-GCN provides satisfying results as well. Except in a very few cases (e.g. outperformed by ComplEx on Codex-S and WN18RR in terms of Sem@1), R-GCN showcases better semantic awareness than semantic matching models. Most of the time, R-GCN also provides comparable or even higher semantic capabilities than translational models. In particular, Sem@K values achieved with R-GCN are similar to the ones achieved with TransE and TransH (e.g. on FB15K237-ET and YAGO3-37K, see Tables 8a and 8c) while the latter models are actually outperformed by R-GCN in terms of Sem@K[ext]. It appears clearly on Codex-M and WN18RR (Tables 9b and 9c), although the conclusion holds for all datasets.

Our experimental results suggest that the structure of GNNs seems to be able to encode the latent semantics of the entities and relations of the graph. This ability may be attributed to the fact that, contrary to translational models which only model the local neighborhood of each triple, GNNs update entity embeddings (and relation embeddings in the case of CompGCN) based on the information found in the extended, h-hop neighborhood of the central node. While translational and semantic matching models treat each triple independently, GNNs model interactions between entities on a large range of semantic relations. It is likely that this extended neighborhood comprises signals or patterns that help the model infer the classes of entities, thus providing very promising semantic capabilities in all experimental conditions.

6.2. Dynamic appraisal of KGEM semantic awareness

Fig. 8.

Evolution of MRR (green —), Sem@1 (red - -), Sem@3 (blue - · -), and Sem@10 (purple $\dots$ ) for translational (TransE, TransH) and semantic matching models (DistMult, ComplEx) on FB15K237-ET.

For certain models, rank-based metrics performance and semantic capabilities improve jointly. For others, the enhancement of their performance in terms of rank-based metrics comes at the expense of their semantic awareness. Interestingly, trends emerge relatively to families of models. First, we observe that a trade-off exists for semantic matching models. Results are particularly striking on FB15K237-ET (see Fig. 8), where it is obvious that after reaching the best Sem@K values after a few epochs, Sem@K values of DistMult and ComplEx quickly drop while MRR continues rising. Conversely, translational models are more robust to Sem@K degradation throughout the epochs. Even though the best achieved Sem@K values are also reported in the very first epochs, once these values are reached they remain stable for the remaining epochs of training. This might be due to the geometric nature of such KGEMs, which will organize the representation space so as to $h + r$ falls in a region of the space where neighboring entities of the ground-truth tail t all are entities of the expected type. This is highly related to the block structure property, which is a common statistical pattern found in KGs [43]. It refers to the fact that entities can naturally belong to different groups (blocks), such that all the entities of a given group are linked to entities of another group through the same relationship. In this case, each group comprises entities of the same class. Translational models will naturally group entities of the same class in the same region of the representation space, as this is determined by the translation vector in that space.

Plots of the joint evolution of MRR and Sem@K values show that most of the KGEMs reaches their best Sem@K values after a few number of epochs. This means that predictions get semantically valid in the early stages of training. As previously mentioned, Sem@K then usually start to decrease, as it has been noted for semantic matching models in particular. To this respect, an excerpt of the head and tail predictions of DistMult on YAGO3-37K is depicted in Fig. 9. Even though the ground-truth entity does not show up in neither the head nor the tail top-K list, we clearly see that after only 30 epochs of training, predictions made by DistMult are more meaningful than after 400 epochs of training. This relative trade-off between making semantically valid predictions and predictions that comprise the ground-truth entity higher in the top-K list calls for finding a compromise in terms of training. The LP task is usually addressed in terms of rank-based metrics only, hence the choice of performing more and more training epochs so as to find the optimal KGEM in terms of MRR and Hits@K. However, as discussed in the present work, adding training steps may improve KGEM performance at the expense of its semantic awareness. In many cases, rank-based metrics values only slightly increase, whereas Sem@K values drastically drop. For instance, comparing MRR vs. Sem@K evolution of DistMult on FB15K237-ET (Fig. 10c), we clearly see that after a moderate number of epochs, any additional epoch of training only provides a very slight improvement in terms of MRR, while it is very detrimental to Sem@K values. Depending on the use case, such a decline in the semantic capabilities of the model is not desirable, and a compromise is to be found between training more to increase KGEM predictive performance and stopping training early enough so as not to deteriorate its semantic awareness too much.

Fig. 9.

Top-ten ranked entities for head and tail predictions at epochs 30 and 400 for a sample triple from YAGO3-37K. Green, blue and white cells respectively denote the ground-truth entity, entities other than the ground-truth and semantically valid, and entities other than the ground-truth and semantically invalid. In this case, semantic validity is based on the domain and range of the relation <hasCapital>.

6.3. On the use of Sem@K for different kinds of KGs

As reported in Table 1, KGs based upon a schema and a class hierarchy are candidates for the computation of all the versions of Sem@K. For the schema-defined KGs used in our experiments, we choose to report values regarding all these metrics so as to enable multi-view comparisons across models. From Table 8 it can be clearly seen that the relative superiority of models is consistent throughout the different Sem@K definitions. From a higher perspective, this means that even for schema-defined KGs with a hierarchy class, Sem@K[ext] is already a good proxy. This may be a good option to only rely on the Sem@K[ext] whenever the computation of Sem@K[base] is too expensive, due to the entity type checking part. This is even more true for Sem@K[wup], which requires an additional step of semantic relatedness computation.

7. Discussion

Three major research questions have been formulated in Section 1. Based on the analysis presented in Section 6, we discuss each research question individually. We ultimately discuss the potential for further considerations of semantics into KGEMs.

7.1. RQ1: How semantic-aware agnostic KGEMs are?

From a coarse-grained viewpoint, we noted that KGEMs trained in an agnostic way prove capable of giving higher scores to semantically valid triples. However, disparities exist between models. Interestingly, these disparities seem to derive from the family of such models. Globally, translational models and GNNs – represented by R-GCN and CompGCN in this work – provide promising results. It appears that the two aforementioned families of KGEMs are better able than semantic matching models (DistMult, ComplEx, SimplE) at recovering the semantics of entities and relations to give higher score to semantically valid triples. In fact, semantic matching models are almost systematically the worst performing models in terms of semantic awareness. From a dynamic standpoint, it is worth noting the high semantic capabilities of KGEMs reached during the first epochs of training. In most cases, this is even during the first epochs that the optimal semantic awareness is attained.

7.2. RQ2: How KGEM semantic awareness’ evaluation should adapt to the typology of KGs?

Drawing on the initial version of Sem@K as presented in [21] – referred to herein as Sem@K[base] – an issue is quickly encountered when it comes to schemaless KGs, which do not contain any rdfs:domain (resp. rdfs:range) clause to indicate the class that candidate heads (resp. tails) should belong to. Our work introduces Sem@K[ext] – a new version of Sem@K that overcomes the aforementioned limitation. In addition, even with schema-defined KGs, Sem@K[base] is not necessarily sufficient in itself. This metric can be further enriched whenever a KG comes with a class hierarchy. In Section 4.4, we integrate class hierarchy into Sem@K by means of a similarity measure between concepts. We subsequently provide an example using the Wu-Palmer similarity score. The resulting Sem@K[wup] is used in the experiments in Section 6 and provide a finer-grained measure of KGEMs semantic awareness.

7.3. RQ3: Does the evaluation of KGEM semantic awareness offers different conclusions on the relative superiority of some KGEMs?

A major finding is that models performing well with respect to rank-based metrics are not necessarily the most competitive regarding their semantic capabilities. We previously noted that translational models globally showcase better Sem@K values compared to semantic matching models. Considering MRR and Hits@K, the opposite conclusion is often drawn. Hence, the performance of KGEMs in terms of rank-based metrics is not indicative of their semantic capabilities. The only exception that might exist is for GNNs that perform well both in terms of rank-based metrics and semantic-oriented measures.

The answers provided to the research questions also lead to consider new matters. As evidenced in Table 8 and Table 9, some KGs are more challenging with regard to Sem@K results. Due to its tailored extraction strategy that purposedly favored difficult relations to feature in the validation and test sets, YAGO4-19K is the schema-defined KG with the lowest achieved Sem@K. This observation raises a deeper question: what characteristics of a KG make it inherently challenging for KGEMs to recover the semantics of entities and relations? An extensive study of the influence of KGs characteristics on the semantic capabilities of KGEMs would require to benchmark them on a broad set of KGs with varying dimensions, so as to determine those that are the most prevalent. Such characteristics can be the total number of relations, the average number of instances per class, or a combination of different factors. We leave this experimental study for future work.

Recall that this work is motivated by the possibility of going beyond a mere assessment of KGEM performance regarding rank-based metrics. We showed that these metrics only evaluate one aspect of such models, somehow providing a partial view on the quality of KGEMs. Our proposal for further assessing KGEM semantic capabilities aims at diving deeper into their predictive expressiveness and measuring to what extent their predictions are semantically valid. However, this second evaluation component does not shed full light on the respective KGEM peculiarities. Other evaluation components may be added, such as the storage and computational requirements of KGEMs [51,69] and the environmental impact of their training and testing procedure [50]. Furthermore, the explainability of KGEMs is another dimension that deserves great attention [55,83].

7.4. Towards further considerations of semantics in knowledge graph embeddings models

The Sem@K metric presented in this work allows for a more comprehensive evaluation of KGEMs. Based on domains and ranges of relations, Sem@K assesses to what extent the predictions of a model are semantically valid. The present work constitutes one of several stepping stones toward the further consideration of ontological and semantic information in KGEM design and evaluation.

It should be noted that due to the only consideration of domains and ranges, Sem@K cannot indicate whether predictions are logically consistent with other constraints posed by the ontology. This is in contrast with Inc@K presented in [23] that takes a broader set of ontological axioms into account. However, Inc@K and Sem@K intrinsically assess distinct dimensions of predictions. While the former is concerned with the logical consistency of predictions, the latter focuses on whether these predictions are semantically valid. For instance, an ontology can specify that City is the range of livesIn, that Seattle is a City but not a Capital, and that entities of type President should be linked to a Capital through the relation livesIn. Hence, it would still be meaningful and semantically correct to predict that (BarackObama,livesIn,Seattle). However, this prediction is not logically consistent w.r.t. ontology specifications. Inc@K and SemK thus consider triples at different levels: while a given triple can be meaningful and semantically correct on its own, its combination with other triples may not be semantically valid or consistent with the ontology. In future work, we will consider more expressive ontologies and see how the broad collection of axioms that constitute them can be incorporated into Sem@K.

KGEMs evaluated in this work are all agnostic to ontological information in their design. However, some models that consider or ensure specific ontological or logical properties exist. For example, HAKE [84] is constructed with the purpose of preserving hierarchies, Logic Tensor Networks [3] are designed to ensure logical reasoning, and the training of TransOWL and TransROWL [11] is enriched with additional triples deduced from, e.g., inverse predicates or equivalent classes. Because of the integration of semantic information in their design or training, one could wonder if they present improved semantic awareness compared to agnostic models. Additionally, KGEMs can also be used to predict triples that represent class instantiations. A possible extension of the present work thus consists in studying whether predicted links and class instantiations are consistent and lead to increased Sem@K values. This would further qualify and highlight the semantic awareness and the consistency of predictions of KGEMs. We leave these questions for future work.

8. Conclusion

In this work, we consider the link prediction task and extend our previously introduced Sem@K metric to measure the ability of KGEMs to assign higher scores to triples that are semantically valid. In particular, to adapt to different types of KGs (e.g., schemaless, class hierarchy), we introduce Sem@K[base], Sem@K[ext], or Sem@K[wup]. Compared with the traditional evaluation approach that solely relies on rank-based metrics, we show that the evaluation procedure is enhanced with the addition of semantic-oriented metrics that bring an additional perspective on KGEM quality. Our experiments with different types of KGs highlight that there is no clear correlation between the performance of KGEMs in terms of traditional rank-based metrics versus their performance regarding semantic-oriented ones. In some cases, however, a trade-off does exist. Consequently, this calls for monitoring KGEM training under more scrutiny. Our experiments also point out that most of the conclusions that have been drawn actually hold at the level of families of models.

In future work, we will conduct experiments considering a broader array of KGEM families (e.g., KGEMs that include semantics) and propose evaluation metrics that consider additional and more expressive ontological constraints.

Footnotes

Acknowledgements

This work is supported by the AILES PIA3 project (see https://www.projetailes.com/). Experiments presented in this paper were carried out using the Grid’5000 testbed, supported by a scientific interest group hosted by Inria and including CNRS, RENATER and several Universities as well as other organizations (see https://www.grid5000.fr).

Hyperparameters

For datasets with no reported optimal hyperparameters, grid-search based on curated hyperpamaters were performed. The full hyperparameter space is provided in Table 6. Chosen hyperparameters for each pair of KGEM and dataset are provided in Table 7.

Results achieved with the best reported hyperparameters

Results achieved with the best reported hyperparameters are presented in Tables 8 and 9.

Evolution of MRR and Sem@ K values with respect to the number of epochs

The evolution of MRR and Sem@K values with respect to the number of epochs is presented in Fig. 10, 11, 12, 13, 14, 15, and 16. For equity and clarity sakes, we choose to present 2 KGEMs for each family of model (translational models, semantic matching models, CNNs, and GNNs). Regarding semantic matching models, DistMult and ComplEx are chosen, as the evolution of their MRR and Sem@K values is less erratic than SimplE. The evolution of MRR and Sem@K values for SimplE are made available on the GitHub repository of the datasets.10 ¹⁰

https://github.com/nicolas-hbt/benchmark-sematk

References

Ali,

Berrendorf,

C.T.

Hoyt,

Vermue,

Galkin,

Sharifzadeh,

Fischer,

Tresp and

Lehmann, Bringing light into the dark: A large-scale evaluation of knowledge graph embedding models under a unified framework, IEEE Trans. Pattern Anal. Mach. Intell.44(12) (2022), 8825–8845. doi:10.1109/TPAMI.2021.3124805.

Auer,

Bizer,

Kobilarov,

Lehmann,

Cyganiak and

Z.G.

Ives, DBpedia: A nucleus for a web of open data, in: The Semantic Web, 6th International Semantic Web Conf., 2nd Asian Semantic Web Conf., ISWC + ASWC, Lecture Notes in Computer Science, Vol. 4825, Springer, 2007, pp. 722–735.

Badreddine,

A.S.

d’Avila Garcez,

Serafini and

Spranger, Logic tensor networks, Artificial Intelligence303 (2022), 103649. doi:10.1016/j.artint.2021.103649.

Balazevic,

Allen and

T.M.

Hospedales, TuckER: Tensor factorization for knowledge graph completion, in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3–7, 2019, Association for Computational Linguistics, 2019, pp. 5184–5193. doi:10.18653/v1/D19-1522.

Berrendorf,

Faerman,

Vermue and

Tresp, On the ambiguity of rank-based evaluation of entity alignment or link prediction methods, 2020, arXiv preprint arXiv:2002.06914.

K.D.

Bollacker,

Evans,

P.K.

Paritosh,

Sturge and

Taylor, Freebase: A collaboratively created graph database for structuring human knowledge, in: Proc. of the ACM SIGMOD International Conf. on Management of Data, ACM, 2008, pp. 1247–1250.

Bordes,

Usunier,

García-Durán,

Weston and

Yakhnenko, Translating embeddings for modeling multi-relational data, in: Conf. on Neural Information Processing Systems (NeurIPS), 2013, pp. 2787–2795.

Cao,

Xu,

Yang,

Cao and

Huang, Dual quaternion knowledge graph embeddings, in: Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, the Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2–9, 2021, AAAI Press, 2021, pp. 6894–6902, https://ojs.aaai.org/index.php/AAAI/article/view/16850 .

Chowdhury,

Srilakshmi,

Chain and

Sarkar, Neural factorization for offer recommendation using knowledge graph embeddings, in: Proc. of the SIGIR Workshop on eCommerce, Vol. 2410, 2019.

10.

Cui,

Kapanipathi,

Talamadupula,

Gao and

Ji, Type-augmented relation prediction in knowledge graphs, in: Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, the Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2–9, 2021, AAAI Press, 2021, pp. 7151–7159, https://ojs.aaai.org/index.php/AAAI/article/view/16879 .

11.

d’Amato,

N.F.

Quatraro and

Fanizzi, Injecting background knowledge into embedding models for predictive tasks on knowledge graphs, in: The Semantic Web – 18th International Conference, ESWC 2021, Virtual Event, June 6–10, 2021, Proceedings, Lecture Notes in Computer Science, Vol. 12731, Springer, 2021, pp. 441–457. doi:10.1007/978-3-030-77385-4_26.

12.

Das,

Dhuliawala,

Zaheer,

Vilnis,

Durugkar,

Krishnamurthy,

Smola and

McCallum, Go for a walk and arrive at the answer: Reasoning over paths in knowledge bases using reinforcement learning, in: 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30–May 3, 2018, Conference Track Proceedings, OpenReview.net, 2018, https://openreview.net/forum?id=Syg-YfWCW .

13.

Dettmers,

Minervini,

Stenetorp and

Riedel, Convolutional 2D knowledge graph embeddings, in: Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th Innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2–7, 2018, AAAI Press, 2018, pp. 1811–1818, https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/17366 .

14.

Ding,

Wang,

Wang and

Guo, Improving knowledge graph embedding using simple constraints, in: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15–20, 2018, Volume 1: Long Papers, Association for Computational Linguistics, 2018, pp. 110–121. doi:10.18653/v1/P18-1011.

15.

Ferrari,

Frisoni,

Italiani,

Moro and

Sartori, Comprehensive analysis of knowledge graph embedding techniques benchmarked on link prediction, Electronics11(23) (2022).

16.

Galárraga,

Teflioudi,

Hose and

F.M.

Suchanek, Fast rule mining in ontological knowledge bases with AMIE+, VLDB J.24(6) (2015), 707–730. doi:10.1007/s00778-015-0394-1.

17.

L.A.

Galárraga,

Teflioudi,

Hose and

F.M.

Suchanek, AMIE: Association rule mining under incomplete evidence in ontological knowledge bases, in: 22nd International World Wide Web Conference, WWW ’13, Rio de Janeiro, Brazil, May 13–17, 2013, International World Wide Web Conferences Steering Committee / ACM, 2013, pp. 413–422. doi:10.1145/2488388.2488425.

18.

T.R.

Gruber, A translation approach to portable ontology specifications, Knowl. Acquis.5(2) (1993), 199–220. doi:10.1006/knac.1993.1008.

19.

C.T.

Hoyt,

Berrendorf,

Gaklin,

Tresp and

B.M.

Gyori, A unified framework for rank-based evaluation metrics for link prediction in knowledge graphs, 2022, arXiv preprint arXiv:2203.07544.

20.

Hubert,

Monnin,

Brun and

Monticolo, New strategies for learning knowledge graph embeddings: The recommendation case, in: EKAW – 23rd International Conf. on Knowledge Engineering and Knowledge Management, Springer, 2022, pp. 66–80. doi:10.1007/978-3-031-17105-5_5.

21.

Hubert,

Monnin,

Brun and

Monticolo, Knowledge graph embeddings for link prediction: Beware of semantics! in: DL4KG@ISWC 2022: Workshop on Deep Learning for Knowledge Graphs, Held as Part of ISWC 2022: The 21st International Semantic Web Conference, Virtual, China, 2022.

22.

Jain,

Kalo,

Balke and

Krestel, Do embeddings actually capture knowledge graph semantics? in: The Semantic Web – 18th International Conf., ESWC, LNCS, Vol. 12731, Springer, 2021, pp. 143–159.

23.

Jain,

Tran,

M.H.

Gad-Elrab and

Stepanova, Improving knowledge graph embeddings with ontological reasoning, in: The Semantic Web – International Semantic Web Conf. ISWC, Vol. 12922, 2021, pp. 410–426.

24.

Ji,

He,

Xu,

Liu and

Zhao, Knowledge graph embedding via dynamic mapping matrix, in: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, ACL 2015, July 26–31, 2015, Beijing, China, Volume 1: Long Papers, The Association for Computer Linguistics, 2015, pp. 687–696. doi:10.3115/v1/p15-1067.

25.

Ji,

Pan,

Cambria,

Marttinen and

P.S.

Yu, A survey on knowledge graphs: Representation, acquisition, and applications, IEEE Trans. Neural Networks Learn. Syst.33(2) (2022), 494–514. doi:10.1109/TNNLS.2021.3070843.

26.

Jia,

Wang,

Lin,

Jin and

Cheng, Locally adaptive translation for knowledge graph embedding, in: Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, Phoenix, Arizona, USA, February 12–17, 2016, AAAI Press, 2016, pp. 992–998, http://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/view/12018 .

27.

Jiang,

Wang and

Wang, Adaptive convolution for multi-relational learning, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2–7, 2019, Volume 1 (Long and Short Papers), Association for Computational Linguistics, 2019, pp. 978–987. doi:10.18653/v1/n19-1103.

28.

Kadlec,

Bajgar and

Kleindienst, Knowledge base completion: Baselines strike back, in: Proc. of the 2nd Workshop on Representation Learning for NLP, Rep4NLP@ACL, 2017, pp. 69–74. doi:10.18653/v1/W17-2609.

29.

S.M.

Kazemi and

Poole, SimplE embedding for link prediction in knowledge graphs, in: Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, Montréal, Canada, December 3–8, 2018, 2018, pp. 4289–4300.

30.

Krompaß,

Baier and

Tresp, Type-constrained representation learning in knowledge graphs, in: The Semantic Web – 14th International Semantic Web Conf. (ISWC), Vol. 9366, Springer, 2015, pp. 640–655.

31.

Lajus,

Galárraga and

F.M.

Suchanek, Fast and exact rule mining with AMIE 3, in: The Semantic Web – 17th International Conference, ESWC 2020, Heraklion, Crete, Greece, May 31–June 4, 2020, Proceedings, Lecture Notes in Computer Science, Vol. 12123, Springer, 2020, pp. 36–52. doi:10.1007/978-3-030-49461-2_3.

32.

Lao,

T.M.

Mitchell and

W.W.

Cohen, Random walk inference and learning in a large scale knowledge base, in: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, EMNLP 2011, 27–31 July 2011, John McIntyre Conference Centre, Edinburgh, UK, a Meeting of SIGDAT, a Special Interest Group of the ACL, ACL, 2011, pp. 529–539, https://aclanthology.org/D11-1049/ .

33.

Leacock and

Chodorow, Combining local context and WordNet similarity for word sense identification, 1998, p. 265.

34.

Li,

Bandar and

McLean, An approach for measuring semantic similarity between words using multiple information sources, IEEE Trans. Knowl. Data Eng.15(4) (2003), 871–882. doi:10.1109/TKDE.2003.1209005.

35.

Lin,

Liu,

Sun,

Liu and

Zhu, Learning entity and relation embeddings for knowledge graph completion, in: Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, Austin, Texas, USA, January 25–30, 2015, AAAI Press, 2015, pp. 2181–2187, http://www.aaai.org/ocs/index.php/AAAI/AAAI15/paper/view/9571 .

36.

Liu,

Wu and

Yang, Analogical inference for multi-relational embeddings, in: Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6–11 August 2017, Proceedings of Machine Learning Research, Vol. 70, PMLR, 2017, pp. 2168–2178, http://proceedings.mlr.press/v70/liu17d.html .

37.

Lv,

Hou,

Li and

Liu, Differentiating concepts and instances for knowledge graph embedding, in: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31–November 4, 2018, Association for Computational Linguistics, 2018, pp. 1971–1979. doi:10.18653/v1/d18-1222.

38.

Meilicke,

M.W.

Chekol,

Ruffinelli and

Stuckenschmidt, Anytime bottom-up rule learning for knowledge graph completion, in: Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019, Macao, China, August 10–16, 2019, ijcai.org, 2019, pp. 3137–3143. doi:10.24963/ijcai.2019/435.

39.

G.A.

Miller, WordNet: A lexical database for English, Commun. ACM38(11) (1995), 39–41. doi:10.1145/219717.219748.

40.

Monnin,

Raïssi,

Napoli and

Coulet, Discovering alignment relations with graph convolutional networks: A biomedical case study, Semantic Web13(3) (2022), 379–398. doi:10.3233/SW-210452.

41.

Nathani,

Chauhan,

Sharma and

Kaul, Learning attention-based embeddings for relation prediction in knowledge graphs, in: Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28–August 2, 2019, Volume 1: Long Papers, Association for Computational Linguistics, 2019, pp. 4710–4723. doi:10.18653/v1/p19-1466.

42.

D.Q.

Nguyen,

T.D.

Nguyen,

D.Q.

Nguyen and

D.Q.

Phung, A novel embedding model for knowledge base completion based on convolutional neural network, in: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 2 (Short Papers), Association for Computational Linguistics, 2018, pp. 327–333. doi:10.18653/v1/n18-2053.

43.

Nickel,

Murphy,

Tresp and

Gabrilovich, A review of relational machine learning for knowledge graphs, Proc. IEEE104(1) (2016), 11–33. doi:10.1109/JPROC.2015.2483592.

44.

Nickel,

Rosasco and

T.A.

Poggio, Holographic embeddings of knowledge graphs, in: Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, Phoenix, Arizona, USA, February 12–17, 2016, AAAI Press, 2016, pp. 1955–1961, http://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/view/12484 .

45.

Nickel,

Tresp and

Kriegel, A three-way model for collective learning on multi-relational data, in: Proc. of the 28th International Conf. on Machine Learning, ICML, 2011, pp. 809–816.

46.

Niu,

Li,

Zhang,

Pu and

Li, AutoETER: Automated entity type representation with relation-aware attention for knowledge graph embedding, in: Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16–20 November 2020, Findings of ACL, Vol. EMNLP 2020, Association for Computational Linguistics, 2020, pp. 1172–1181. doi:10.18653/v1/2020.findings-emnlp.105.

47.

Ott,

Meilicke and

Samwald, SAFRAN: An interpretable, rule-based link prediction method outperforming embedding models, in: 3rd Conference on Automated Knowledge Base Construction, AKBC 2021, Virtual, October 4–8, 2021, 2021. doi:10.24432/C5MK57.

48.

Paulheim, Knowledge graph refinement: A survey of approaches and evaluation methods, Semantic Web8(3) (2017), 489–508. doi:10.3233/SW-160218.

49.

Paulheim, Make embeddings semantic again! in: Proc. of the ISWC Posters & Demonstrations, Industry and Blue Sky Ideas Tracks, CEUR Workshop Proceedings, Vol. 2180, 2018.

50.

Peng,

Chen,

Lin and

Stevenson, Highly efficient knowledge graph embedding learning with orthogonal procrustes analysis, in: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics, Online, 2021, pp. 2364–2375, https://aclanthology.org/2021.naacl-main.187 . doi:10.18653/v1/2021.naacl-main.187.

51.

Portisch,

Hladik and

Paulheim, RDF2Vec light – a lightweight approachfor knowledge graph embeddings, in: Proceedings of the ISWC 2020 Demos and Industry Tracks: From Novel Ideas to Industrial Practice Co-Located with 19th International Semantic Web Conference (ISWC 2020), Globally online, November 1–6, 2020 (UTC), CEUR Workshop Proceedings, Vol. 2721, CEUR-WS.org, 2020, pp. 79–84, http://ceur-ws.org/Vol-2721/paper520.pdf .

52.

Rada,

Mili,

Bicknell and

Blettner, Development and application of a metric on semantic nets, IEEE Trans. Syst. Man Cybern.19(1) (1989), 17–30. doi:10.1109/21.24528.

53.

Resnik, Semantic similarity in a taxonomy: An information-based measure and its application to problems of ambiguity in natural language, J. Artif. Intell. Res.11 (1999), 95–130. doi:10.1613/jair.514.

54.

Rossi,

Barbosa,

Firmani,

Matinata and

Merialdo, Knowledge graph embedding for link prediction: A comparative analysis, ACM Transactions on Knowledge Discovery from Data15(2) (2021), 14:1–14:49.

55.

Rossi,

Firmani,

Merialdo and

Teofili, Explaining link prediction systems based on knowledge graph embeddings, in: Proceedings of the 2022 International Conference on Management of Data, SIGMOD ’22, Association for Computing Machinery, New York, NY, USA, 2022, pp. 2062–2075. ISBN 9781450392495. doi:10.1145/3514221.3517887.

56.

Safavi and

Koutra, CoDEx: A comprehensive knowledge graph completion benchmark, in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16–20, 2020, Association for Computational Linguistics, 2020, pp. 8328–8350. doi:10.18653/v1/2020.emnlp-main.669.

57.

M.S.

Schlichtkrull,

T.N.

Kipf,

Bloem,

van den Berg,

Titov and

Welling, Modeling relational data with graph convolutional networks, in: The Semantic Web – 15th International Conference, ESWC 2018, Heraklion, Crete, Greece, June 3–7, 2018, Proceedings, Lecture Notes in Computer Science, Vol. 10843, Springer, 2018, pp. 593–607. doi:10.1007/978-3-319-93417-4_38.

58.

Shang,

Tang,

Huang,

Bi,

He and

Zhou, End-to-end structure-aware convolutional networks for knowledge base completion, in: The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, the Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, the Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27–February 1, 2019, 2019, pp. 3060–3067. doi:10.1609/aaai.v33i01.33013060.

59.

Socher,

Chen,

C.D.

Manning and

A.Y.

Ng, Reasoning with neural tensor networks for knowledge base completion, in: Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a Meeting Held December 5–8, 2013, Lake Tahoe, Nevada, United States, 2013, pp. 926–934, https://proceedings.neurips.cc/paper/2013/hash/b337e84de8752b27eda3a12363109e80-Abstract.html .

60.

F.M.

Suchanek,

Kasneci and

Weikum, Yago: A core of semantic knowledge, in: Proc. of the 16th International Conf. on World Wide Web, WWW, ACM, 2007, pp. 697–706.

61.

Sun,

Deng,

Nie and

Tang, RotatE: Knowledge graph embedding by relational rotation in complex space, in: 7th International Conf. on Learning Representations, ICLR, 2019.

62.

Sun,

Vashishth,

Sanyal,

P.P.

Talukdar and

Yang, A re-evaluation of knowledge graph completion methods, in: Proc. of the 58th Annual Meeting of the Association for Computational Linguistics ACL, Association for Computational Linguistics, 2020, pp. 5516–5522. doi:10.18653/v1/2020.acl-main.489.

63.

T.P.

Tanon,

Weikum and

F.M.

Suchanek, YAGO 4: A reason-able knowledge base, in: The Semantic Web – 17th International Conference, ESWC 2020, Heraklion, Crete, Greece, May 31–June 4, 2020, Proceedings, Lecture Notes in Computer Science, Vol. 12123, Springer, 2020, pp. 583–596. doi:10.1007/978-3-030-49461-2_34.

64.

Tiwari,

Bansal and

C.R.

Rivero, Revisiting the evaluation protocol of knowledge graph completion methods for link prediction, in: WWW ’21: The Web Conf., ACM / IW3C2, 2021, pp. 809–820. doi:10.1145/3442381.3449856.

65.

Toutanova and

Chen, Observed versus latent features for knowledge base and text inference, in: Proc. of the 3rd Workshop on Continuous Vector Space Models and Their Compositionality, Association for Computational Linguistics, 2015, pp. 57–66.

66.

Trouillon,

Welbl,

Riedel,

É.

Gaussier and

Bouchard, Complex embeddings for simple link prediction, in: Proc. of the 33rd International Conf. on Machine Learning, ICML, Vol. 48, 2016, pp. 2071–2080.

67.

Vashishth,

Sanyal,

Nitin,

Agrawal and

P.P.

Talukdar, InteractE: Improving convolution-based knowledge graph embeddings by increasing feature interactions, in: The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, the Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, the Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7–12, 2020, AAAI Press, 2020, pp. 3009–3016, https://ojs.aaai.org/index.php/AAAI/article/view/5694 .

68.

Vashishth,

Sanyal,

Nitin and

P.P.

Talukdar, Composition-based multi-relational graph convolutional networks, in: 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26–30, 2020, OpenReview.net, 2020, https://openreview.net/forum?id=BylA_C4tPr .

69.

Wang,

Lian and

Gao, A lightweight knowledge graph embedding framework for efficient inference and storage, in: Proceedings of the 30th ACM International Conference on Information and Knowledge Management, CIKM ’21, Association for Computing Machinery, New York, NY, USA, 2021, pp. 1909–1918. ISBN 9781450384469. doi:10.1145/3459637.3482224.

70.

Wang,

Qiu and

Wang, A survey on knowledge graph embeddings for link prediction, Symmetry13(3) (2021), 485. doi:10.3390/sym13030485.

71.

Wang,

Zhou,

Liu and

Zhou, Transet: Knowledge graph embedding with entity types, Electronics10(12) (2021), 1407. doi:10.3390/electronics10121407.

72.

Wang,

Mao,

Wang and

Guo, Knowledge graph embedding: A survey of approaches and applications, IEEE Trans. Knowl. Data Eng.29(12) (2017), 2724–2743. doi:10.1109/TKDE.2017.2754499.

73.

Wang,

Ruffinelli,

Gemulla,

Broscheit and

Meilicke, On evaluating embedding models for knowledge base completion, in: Proc. of the 4th Workshop on Representation Learning for NLP, RepL4NLP@ACL, 2019, pp. 104–112.

74.

Wang,

Zhang,

Feng and

Chen, Knowledge graph embedding by translating on hyperplanes, in: Proc. of the Twenty-Eighth AAAI Conf. on Artificial Intelligence, 2014, pp. 1112–1119.

75.

Wu,

Shi,

Cao,

Chen,

Lei,

Zhang,

Wu and

He, DisenKGAT: Knowledge graph embedding with disentangled graph attention network, in: CIKM ’21: The 30th ACM International Conference on Information and Knowledge Management, Virtual Event, Queensland, Australia, November 1–5, 2021, ACM, 2021, pp. 2140–2149. doi:10.1145/3459637.3482424.

76.

Wu and

Palmer, Verbs semantics and lexical selection, in: Proceedings of the 32nd Annual Meeting on Association for Computational Linguistics, ACL ’94, Association for Computational Linguistics, USA, 1994, pp. 133–138. doi:10.3115/981732.981751.

77.

Xiao,

Huang and

Zhu, TransG: A generative model for knowledge graph embedding, in: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7–12, 2016, Berlin, Germany, Volume 1: Long Papers, The Association for Computer Linguistics, 2016. doi:10.18653/v1/p16-1219.

78.

Xie,

Liu and

Sun, Representation learning of knowledge graphs with hierarchical types, in: Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, IJCAI 2016, New York, NY, USA, 9–15 July 2016, IJCAI/AAAI Press, 2016, pp. 2965–2971, http://www.ijcai.org/Abstract/16/421 .

79.

Xiong,

Hoang and

W.Y.

Wang, DeepPath: A reinforcement learning method for knowledge graph reasoning, in: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September 9–11, 2017, Association for Computational Linguistics, 2017, pp. 564–573. doi:10.18653/v1/d17-1060.

80.

Xu and

Li, Relation embedding with dihedral group in knowledge graph, in: Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28–August 2, 2019, Volume 1: Long Papers, Association for Computational Linguistics, 2019, pp. 263–272. doi:10.18653/v1/p19-1026.

81.

Yang,

Yih,

He,

Gao and

Deng, Embedding entities and relations for learning and inference in knowledge bases, in: 3rd International Conf. on Learning Representations, ICLR, 2015.

82.

Zhang,

Tay,

Yao and

Liu, Quaternion knowledge graph embeddings, in: Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, Vancouver, BC, Canada, December 8–14, 2019, 2019, pp. 2731–2741, https://proceedings.neurips.cc/paper/2019/hash/d961e9f236177d65d21100592edb0769-Abstract.html .

83.

Zhang,

Deng,

Wang,

Chen,

Zhang and

Chen, XTransE: Explainable knowledge graph embedding for link prediction with lifestyles in e-commerce, in: Semantic Technology. JIST 2019, 2020, pp. 78–87. ISBN 978-981-15-3411-9. doi:10.1007/978-981-15-3412-6_8.

84.

Zhang,

Cai,

Zhang and

Wang, Learning hierarchy-aware knowledge graph embeddings for link prediction, in: The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, the Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, the Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7–12, 2020, AAAI Press, 2020, pp. 3065–3072, https://ojs.aaai.org/index.php/AAAI/article/view/5701 .

Sem@ K : Is my knowledge graph embedding model semantic-aware?

Abstract

Keywords

1. Introduction

2.1. Link prediction using knowledge graph embedding models

2.2. Combining embeddings and semantics

2.3. Evaluating KGEM performance for link prediction

3. Motivations and problem formulation

3.1. Motivating example

4. Measuring KGEMs semantic awareness with Sem@K

Table 1 Typology of KGs and their respective adequacy for the presented Sem@K versions KG type Sem@K[ext] Sem@K[base] Sem@K[wup] Schemaless × Schema-defined, w/o class hierarchy × × Schema-defined, w/ class hierarchy × × ×

4.2. A note on Sem@K, untyped entities, and the Open World Assumption

4.3. Sem@K[ext] for schemaless KGs

4.4. Sem@K[wup]: A hierarchy-aware version of Sem@K

5.1. Datasets

1 https://doi.org/10.5281/zenodo.7526244 and GitHub.2 2 https://github.com/pmonnin/YAGO4-LP The other KGs are made available on GitHub.3 3 https://github.com/nicolas-hbt/benchmark-sematk 5.1.1. Schema-defined KGs

Table 3 Statistics of the schemaless KGs used in the experiments Dataset | E | | R | | T train | | T valid | | T test | Codex-S 2,034 42 32,888 1,827 1,828 Codex-M 17,050 51 185,584 10,310 10,311 WN18RR 40,943 11 86,834 3,034 3,134

6.1. Semantic awareness of KGEMs

Table 5 Results on YAGO3-37K. Bold fonts and gray cells denote the best achieved results and the worst achieved results among the models reported in the table, respectively. Full results are available in Appendix B, Table 8

7. Discussion

7.1. RQ1: How semantic-aware agnostic KGEMs are?

7.2. RQ2: How KGEM semantic awareness’ evaluation should adapt to the typology of KGs?

7.3. RQ3: Does the evaluation of KGEM semantic awareness offers different conclusions on the relative superiority of some KGEMs?

7.4. Towards further considerations of semantics in knowledge graph embeddings models

8. Conclusion

Footnotes

Acknowledgements

Hyperparameters

Results achieved with the best reported hyperparameters

Evolution of MRR and Sem@ K values with respect to the number of epochs

References

Table 1
Typology of KGs and their respective adequacy for the presented Sem@K versions

KG type Sem@K[ext] Sem@K[base] Sem@K[wup]

Schemaless ×

Schema-defined, w/o class hierarchy × ×

Schema-defined, w/ class hierarchy × × ×

¹
https://doi.org/10.5281/zenodo.7526244
and GitHub.2 ²
https://github.com/pmonnin/YAGO4-LP
The other KGs are made available on GitHub.3 ³
https://github.com/nicolas-hbt/benchmark-sematk

5.1.1. Schema-defined KGs

Table 3
Statistics of the schemaless KGs used in the experiments

Dataset $| E |$ $| R |$ $| T_{train} |$ $| T_{valid} |$ $| T_{test} |$

Codex-S 2,034 42 32,888 1,827 1,828

Codex-M 17,050 51 185,584 10,310 10,311

WN18RR 40,943 11 86,834 3,034 3,134

Table 5
Results on YAGO3-37K. Bold fonts and gray cells denote the best achieved results and the worst achieved results among the models reported in the table, respectively. Full results are available in Appendix B, Table 8