Sage Journals: Discover world-class research

Abstract

RDF Graph Summarization pertains to the process of extracting concise but meaningful summaries from RDF Knowledge Bases (KBs) representing as close as possible the actual contents of the KB both in terms of structure and data. RDF Summarization allows for better exploration and visualization of the underlying RDF graphs, optimization of queries or query evaluation in multiple steps, better understanding of connections in Linked Datasets and many other applications. In the literature, there are efforts reported presenting algorithms for extracting summaries from RDF KBs. These efforts though provide different results while applied on the same KB, thus a way to compare the produced summaries and decide on their quality and best-fitness for specific tasks, in the form of a quality framework, is necessary. So in this work, we propose a comprehensive Quality Framework for RDF Graph Summarization that would allow a better, deeper and more complete understanding of the quality of the different summaries and facilitate their comparison. We work at two levels: the level of the ideal summary of the KB that could be provided by an expert user and the level of the instances contained by the KB. For the first level, we are computing how close the proposed summary is to the ideal solution (when this is available) by defining and computing its precision, recall and F-measure against the ideal solution. For the second level, we are computing if the existing instances are covered (i.e. can be retrieved) and at which degree by the proposed summary. Again we define and compute its precision, recall and F-measure against the data contained in the original KB. We also compute the connectivity of the proposed summary compared to the ideal one, since in many cases (like, e.g., when we want to query) this is an important factor and in general in RDF, linked datasets are usually used. We use our quality framework to test the results of three of the best RDF Graph Summarization algorithms, when summarizing different (in terms of content) and diverse (in terms of total size and number of instances, classes and predicates) KBs and we present comparative results for them. We conclude this work by discussing these results and the suitability of the proposed quality framework in order to get useful insights for the quality of the presented results.

Keywords

Quality framework quality metrics RDF Summarization linked Open Data RDF query processing

1. Introduction

RDF has become one of the major standards in describing and publishing data, establishing what we call the Semantic Web. Thus, the amount of the available RDF data increases fast both in size and complexity, making the appearance of RDF Knowledge Bases (KBs) with millions or even billions of triples something usual. Given that RDF is built on the promise of linking together relevant datasets or KBs and with the appearance of the Linked Open Data (LOD) cloud, we can now query KBs (both standalone or distributed) with millions or billions of triples altogether. This increased size and complexity of RDF KBs has a direct impact on the evaluation of the RDF queries we express against these RDF KBs. Especially on the LOD cloud, we observe that a query against a big, complex, interlinked and distributed RDF KB might retrieve no results at the end because either the association between the different RDF KBs is weak (is based only on a few associative links) or there is an association at the schema level that has never been instantiated at the actual data level. Moreover, a lot of these RDF KBs carry none at all or only partial schema information (mainly contain instances built and described separately). Additionally, in the LOD cloud the number of KBs which do not use the full schema or they use multiple schemas has increased due to the absence of the schema information, which describes the interlinks between the datasets, and the combinatorial way of mixing vocabularies.

One way to address the concerns described above is by creating summaries of the RDF KBs. Thus the user or the system is able to decide whether or not to post a query, since it is known whether information is present or not based on the summary. This would provide significant cost savings in processing time since we will substitute queries on complex RDF KBs with queries first on the summaries (on much simpler structures with no instances) and then with queries only towards the KBs that we know will produce some useful results. Graph summarization techniques would allow the creation of a concise representation of the KB regardless of the existence or not of schema information in the KB. So, the summary will represent the actual situation in the KB, namely should capture the existing/used classes and relationships by the instances and not what the schema proposes (and might have never been used). This should facilitate the query building for the end users with the additional benefit of exploring the contents of the KB based on the summary. This is true regardless if we use heterogeneous or homogeneous, linked or not, standalone or distributed KBs. In all these cases we can use the RDF summary to concisely describe the data in the RDF KB and possibly add useful information for the RDF graph queries, like the distribution and the number of instances for each involved entity.

In the literature we can find various efforts proposing summarization techniques for RDF graphs. These techniques, presented briefly in Section 3, come from various scientific backgrounds ranging from generic graph summarization to explicit RDF graph summarization. While all promise that they provide correct, concise and well-built summaries so far has been very little effort into addressing in a comprehensive and coherent way the problem of evaluating these summaries against different criteria and have some formal metrics to describe the quality of the results. Only sparse efforts have been reported, usually tailored to a specific method or algorithm. So with this paper, we aim to cover the gap that exists in the literature and provide a comprehensive Quality Framework for RDF Graph Summarization that would allow a better, deeper and more complete understanding of the quality of the different summaries and facilitate their comparison. We propose to take into account the possibility to compare the summary against two levels of information possibly available for a RDF KB. In the case where an ideal summary is available, either because it has been proposed by a human expert or because we can assume that an existing schema represents perfectly the data graph, we compare the summary provided by the algorithms with it and use similarity measures to compute its precision and recall against the ideal summary. If this is not available or usually in addition to it, we compute the percentage of the instances represented by the summary (including both class and property instances). This provides us with the understanding of how well the summary covers the KB. Moreover we introduce a metric to cover the coherency dimension of the problem, i.e. how well connected the computed summary graph is. One can combine at the end the two overall metrics or use them independently. In order to validate the proposed quality metrics, we evaluated three of the most promising RDF graph summarization algorithms and report on the quality of their results over different datasets with diverse characteristics. We should note here that the proposed Quality Framework is independent of any of the algorithms evaluated but it is suitable in providing a common ground to compare them.

This is why we could summarize our contribution as presenting a quality framework that:

Evaluates the quality of RDF Graph Summaries, where a combined effort is made to summarize, while preserving existing important semantics, basic structure and coherence;

Works at different levels, both trying to understand the comparison of the two summaries (ideal and computed) at the schema and the instance levels, while previous approaches were mainly dealing with one level (which corresponds to the instance level in our approach);

Provides novel customized definitions for precision and recall for summaries, thus allowing better capturing of the quality of the results – so we go beyond the standard property and recall definitions;

Adds the discussion on the connectivity of the computed summary and tries to promote summaries that are more connected. This is quite crucial if we want later on to query the summary using standard RDF tools.

So, the proposed framework allows for understanding the quality of the different summaries at different levels. The users can pick the metrics that better fit to the task for which they need to pick a summary.

The paper is structured as follows: Section 2 introduces some of the foundations of RDF and RDFS, which are useful for defining later on some concepts in our work; Section 3 provides a review of the existing works around quality metrics in graph summarization; while Section 4 presents our proposed Quality Metrics for RDF Graph Summaries. Section 5 presents the three of the most promising RDF Graph Summarization algorithms in the literature that are compared using the proposed Quality Framework in Section 6, where the extensive experiments performed in order to validate the appropriateness of the proposed metrics are reported. We then conclude our paper in Section 7.

2. Preliminaries

2.1. RDF

As per the W3C standards, the RDF data model represents data on the Web as a set of triples of the form $(s, p, o)$ , expressing the fact that for the subject s, the value of the property p is the object/value o. RDF data can also be represented as a labeled directed graph in which entities (subjects/objects) are represented as nodes and property instances (expressed by the triples) as labeled directed edges. RDF datasets are usually accompanied with a RDF Schema,1

¹
https://www.w3.org/TR/2004/REC-rdf-schema-20040210/

which provides a data-modeling vocabulary for RDF data. RDF Schema (RDFS) defines a set of classes for declaring the resource types and a set of properties for declaring the resource relationships and attributes. RDF Schema describes relations between classes and properties, but could also be represented as a directed labeled graph, where the labeled nodes represent the classes and the labeled edges represent properties relating class instances.

Let $C, P, I$ and L be the sets of class Universal Resource Identifiers (URIs), property URIs, instance URIs and literal values respectively, and let T be a set of RDFS standard properties (rdfs:range, rdfs:domain, rdf:type, rdfs:subClassOf, etc.). The concepts of RDF schemas and instances can be formalized as follows.

Definition 1 (RDF schema graph).

An RDF schema graph $G_{s} = (N_{s}, E_{s}, λ_{s}, C, P, T)$ is a directed labeled graph where:

$N_{s}$ is the set of nodes, representing classes and properties.

$E_{s} \subseteq {(x, α, y) | x \in N_{s}, α \in T, y \in N_{s}}$ is the set of labeled edges.

$λ_{s} : N_{s} ⟶ C \cup P$ is an injective node labeling function that maps nodes of $N_{s}$ to class and property URIs.

We note

λ_{e} : E_{s} ⟶ T

the edge labeling function that associates to each edge

(x, α, y) \in E_{s}

the RDFS standard property URI

α \in T

Definition 2 (RDF data graph).

An RDF data graph $G_{i} = (N_{i}, E_{i}, λ_{i}, I, P, L, C)$ is a directed labeled graph where:

$N_{i}$ is the set of nodes, representing instances, literals and class URIs.

$E_{i} \subseteq {(x, α, y) | x \in N_{i}, α \in P, y \in N_{i}}$ is the set of labeled edges.

$λ_{i} : N_{i} ⟶ I \cup L \cup C$ is a node labeling function that maps nodes of $N_{i}$ to instance URIs, class URIs or literals.

We note

λ_{e i} : E_{i} ⟶ P

the edge labeling function that associates to each edge

(x, α, y) \in E_{i}

the property URI

α \in P

Fig. 1.

RDF schema and data graphs.

Example 1.

The upper part of Fig. 1 shows a visualization of an RDF schema graph example for the cultural domain, representing only class nodes, while properties are illustrated as edges between classes. For example, the class Painter denotes the set of instances which represent painter entities, while property paints relates class Painter instances to class Painting instances. The lower part of Fig. 1 depicts an instance (data) graph building on this schema. This graph represents 6 different resources. For example the resource Picasso is an instance of the Painter class having properties fname, lname and paints.

Type edges. Edges labeled with rdf:type in the RDF data graph explicitly describe the type (class) of an instance, e.g. dashed edges in Fig. 1, where for instance Picasso is declared to be of type Painter. We will note in the following the type edge label with τ. For an instance $x \in N_{i}$ , we define $Types (x) = {λ_{i} (y) | (x, τ, y) \in E_{i}}$ to be the set of types related to the node x via an explicit type edge definition, e.g., $Types (Picasso) = {Painter}$ , while $Types (Guernica) = {Painting}$ .

Properties. We denote by $Properties (x) = {α : \forall (x, α, y) \in E_{i} : α \neq τ \land λ_{i} (y) \in I \land x \in N_{i}}$ , a set of labels of the non-Type edges which associate the node x with a set of entity nodes (nodes labeled by instance URIs).

Attributes. We denote by $Attributes (x) = {α : \forall (x, α, y) \in E_{i} : α \neq τ \land λ_{i} (y) \in L \land x \in N_{i}}$ a set of labels of the non-Type edges which associate the node x with a set of literal nodes (nodes labeled by literal values).

Example 2.

The set of properties associated with Picasso node in our example is ${paints}$ , while the set of attributes of Picasso node is ${fname, lname}$ .

Definition 3 (Class Instances).

We denote by $instances (c \in C) = {λ_{i} (x) : \forall (x, τ, y) \in E_{i} : y = c}$ a set of labels of the nodes which are associated to the node c (represent the class) via a typed edge τ, or in other words the set of resources(subjects) belonging to the class c.

Definition 4 (Property Instances).

We denote by $instances (p \in P) = {λ_{i} (x) : \forall (x, α, y) \in E_{i} : α = p}$ a set of labels of the nodes which are associated to other nodes via the property p, or in other words, is the set of resources (subjects) having the property p.

Example 3.
The set of instances of the class Painting in our example is ${Woman, Guernica, Abraham}$ , while the set of instances of the property exhibited (which is one of the Painting class’s properties) is ${⟨ Woman, exhibited, museum . es ⟩, ⟨ Guernica, exhibited, museum . es ⟩}$
2.2. Knowledge patterns

A knowledge pattern (or simply pattern from now on) characterizes a set of instances in an RDF data graph that share a common set of types and a common set of properties. More precisely:

Definition 5 (Knowledge Pattern).

A knowledge pattern $KP$ in an RDF data graph is a quad $(Cl, \Pr, Ins, SUP)$ , where $Cl = {c_{1}, c_{2}, \dots, c_{n}} \subseteq C$ is a set of classes, $\Pr = {\Pr_{1}, \Pr_{2}, \dots, \Pr_{m}} \subseteq P$ is a set of properties, $Ins \subseteq I$ is the set of instances that have all the types of $Cl$ and all the properties of $\Pr$ , and $SUP$ = $| Ins |$ is called the support of the knowledge pattern in the RDF data graph (i.e. the number of instances that have all types and all properties).

We introduce the term knowledge pattern because it is not sure that all summarization algorithms will produce something that can be necessarily defined as an RDF class or RDF property and also because we want to differentiate from the classes/properties of the ideal summary when we compare the two.

Pattern instances. We denote by $instances (pa) = Ins$ a set of the original KB resources having the same set of the properties/types of the pattern $pa$ , or in other words is the set of bindings for the $? a$ variable over the RDF data graph in the following SPARQL-like conjunctive pattern: ${⟨ ? a, τ, c_{1} ⟩, ⟨ ? a, τ, c_{2} ⟩, \dots, ⟨ ? a, τ, c_{n} ⟩, ⟨ ? a, \Pr_{1}, ? b_{1} ⟩, ⟨ ? a, \Pr_{2}, ? b_{2} ⟩, \dots, ⟨ ? a, \Pr_{m}, ? b_{m} ⟩}$ , e.g. $instances (p 2) = {Woman, Guernica}$

Example 4.
Table 1 shows possible patterns which can be extracted from the RDF instance graph depicted in Fig. 1 based on a forward bisimilarity relation.

Table 1
Knowledge patterns example (computed based on the bisimilarity relation)

2.3. RDF summary graph

In order to be able to properly address the problem of creating RDF summaries of LOD/RDF graphs we need to define what an RDF Summary is. We follow the Definition 5 of the Knowledge Pattern and we define the summary as a set of the Knowledge Patterns that the algorithms compute. The proposed Quality Framework will help the user understand which proposed summary will better represent the original KB in terms of structure, coverage and connectivity. So, the Summary graph is defined as follows:

Definition 6 (Summary graph).

Let $G = G_{s} \cup G_{i}$ be an RDF graph, including both schema and instance information. The RDF Summary Graph $SG = {C, P, I}$ of G is a graph consisting of a set of Knowledge patterns Π, where $pa \in Π$ and for which:

$C = ⋃_{pa \in Π} pa . Cl$ the set of classes of $SG$ ;

$P = ⋃_{pa \in Π} pa . \Pr$ the set of properties of $SG$ ;

$I = ⋃_{pa \in Π} pa . Ins$ the set of the instances represented by the summary.

2.4. Bisimilarity relation

Bisimilarity in a directed labeled graph is an Equivalence Relation defined on a set of nodes N, such that two nodes $(u, v)$ are bisimilar if and only if the set of ougoing edges of u is equal to the set of outgoing edges of v and also, all successor nodes of u and v must be bisimilar (in other words, the outgoing paths of u and v are similar). We call the bisimilarity relation when defined based on outgoing paths, Forward (FW) Bisimilarity, and when it is based on incoming paths, Backward (BW) Bisimilarity. More on bisimilarity can be found at [15]. In the example presented in Table 1 we can notice that Woman and Guernica are grouped together by an algorithm that is based on bisimilarity, while Abraham is missing an outgoing link (exhibited) and thus is grouped separately.

3. Related work

RDF graph summarization has been intensively studied, with various approaches and techniques proposed to summarize the RDF graphs, which could be grouped into four main categories:

Aggregation and grouping approaches [25,30–32,36], which are based on grouping the nodes of an input RDF graph G into clusters/groups based on the similarity of the attribute values and on the neighborhood relationships associated with nodes of G.

Structural extraction approaches [2,5,8,10,16–21,28,29,35], which define an equivalence relation on the nodes of the RDF data graph G, usually based on the set of incident graph paths. This allows extracting a form of schema for G by representing the equivalence classes of nodes of G as nodes in the summary graph, characterized by the set of incident paths of each class.

Logical compression approaches [13,14], which are based on compressing the RDF datasets by generating a set of logical rules from the dataset and removing triples that can be inferred from these rules. The summary graph is then represented by a compressed graph and set of logical decompression rules, with the drawback that such approaches do not produce RDF graphs as summaries.

Pattern-mining-based approaches [6,37,38], which are based on extracting frequent patterns from the RDF graph, then composing them to build an approximated summary graph.

Typically, the RDF summarization methods proposed so far do not address in depth the problem of the quality of the produced RDF summaries. A noticeable exception is the work in [4], which proposes a model for evaluating the precision of the graph summary, compared to a gold standard summary which is a forward and backward bisimulation summary. The main idea of the precision model is based on counting the edges or paths that exist in the summary and/or in the gold summary graph. The precision of a summary is evaluated in the standard way, based on the number of true positives (the number of edges existing in the summary and in the input graph) and false positives (the number of invalid edges and paths existing in the summary but not in the input graph). The first limitation of this quality model [4] is that it works only with the summaries generated by an algorithm that uses a bisimulation relation. Similarly to our quality framework, they consider the precision at instance level, i.e. how many of summary class/property instances are correctly matched in the original KB. Unlike to our work, this work does not consider the recall at the instance level, because it claims that the way summarization algorithms work, does not allow them to miss any instance. But this is not always correct, e.g. the approximate RDF summarization algorithms like [37,38] might miss a lot of instances. As it is well-known, the precision alone cannot accurately assess the quality, since a high precision can be achieved at the expense of a poor recall by returning only few (even if correct) common paths. Additionally and unlike our work, this model does not consider at all the quality of the summary at the schema level, e.g. what if one class/property of the ideal summary is missing or an extra one is added or a property is assigned to the wrong class. In all these cases, the result will be the same, while it is obvious that it should not. Finally, [4] is missing completely any notion of evaluating the connectivity of the final summarization result.

One more effort, [7], addressing the quality of hierarchical dataset summaries is reported in the literature. The hierarchical dataset summary is based on the grouping of the entities in the KB using their types and the values of their attributes. The quality of a given/computed hierarchical grouping of entities is based on three metrics: (1) the weighted average coverage of the hierarchical grouping, i.e. the average percentage of the entities of the original graph that are covered by each group in the summary; (2) the average cohesion of the hierarchical grouping where the cohesion of a subgroup measures the extent to which the entities in it form a united whole; and (3) the height of a hierarchical grouping, i.e. the number of edges on a longest path between the root and a leaf. The main limitation of this approach is that it works only with the hierarchical dataset summaries, since metrics like the cohesion of the hierarchical groups or the height of the hierarchy cannot be computed in other cases. Moreover, the proposed groupings provide a summary that can be used for a quick inspection of the KB but cannot be queried by any of the standard semantic query languages. On the other hand and similarly to our quality framework, [7] considers the recall (named coverage) at instance level, i.e. how many of the instances of the original KB are correctly covered by the summary concepts. Contrary to our work, this model does not consider at all the quality of the summary at the schema level. Notions from [7] can also be found in the current paper, where algorithms like [37,38] that rely on approximation get penalized if they approximate too much, in fact loosing the cohesion of the instances represented by the computed knowledge patterns.

Besides that, only few efforts have been reported in the literature addressing the quality of the schema summarization methods in general [3,27,33], i.e. the quality of the RDF schema that can be obtained through RDF summarization. The quality of the RDF schema summary in [27] is based on expert ground truth and is calculated as the ratio of the number of classes identified both by the expert users and the summarization tool over the total number of classes in the summary. The main limitation of this approach is that it uses a boolean match of classes and fails to take into account similarity between classes when classes are close but not exactly the same as in the ground truth or when classes in the ground truth are represented by more than one class in the summary. Works in schema matching (e.g. [33]) are also using to some extend similar metrics like recall, precision, F1-Measure commonly used in Information Retrieval, but are not relevant to our work since even if we consider an RDF graph summary as an RDF schema, we are not interested in matching its classes and properties one by one; as stated above this binary view of the summary results does not offer much in the quality discussion. Additionally these works do not take into account issues like the size of the summary.

To the best of our knowledge, this is the first effort in the literature to provide a comprehensive Quality Framework for RDF Graph Summarization, independent of the type and specific results of the algorithms used and the size, type and content of the KBs. We provide metrics that help us understand not only if this is a valid summary but also if a summary is better than another in terms of the specified quality characteristics. And we can do this by assessing information, if available, both at schema and instance levels.

4. Quality assessment model

In this section we present a comprehensive and coherent way to measure the quality of RDF summaries produced by any algorithm that summarizes RDF graphs. The framework is independent of the way algorithms work and makes no assumptions on the type or structure neither of the input nor of the final results, besides being expressed in RDF; this is required in order to guarantee the validity of the result but can be easily extended to other cases of semantic summarization, like for graphs expressed in OWL or Description Logics. In order to achieve this, we work at two levels:

schema level, where if an ideal summary exists, the summary is compared with it by computing the precision and recall for each class and its neighborhood (properties and attributes having as domain that class) of the produced summary against the ideal one; we also compute the precision and recall of the whole summary against the ideal one. The first will capture the quality of the summary at the local (class) level, while the second will give us the overall quality in terms of classes’ and properties/attributes’ precision and recall.

instance level, where the coverage that the summary provides for class and property instances is calculated, i.e. how many instances will be retrieved if we query the whole summary graph. We use again precision and recall against the contents of the original KB.

At the end, a metric is presented that provides an indication of the quality of the graph summary by measuring whether or not the summary is a connected graph. Ideally, a summary should be a connected graph but this also depends on the actual data stored in the Knowledge Base. Thus a disconnected graph could be an indication of the data quality in the KB and not necessarily a problem of the summarization process. Nevertheless, we present it here as another indicator of the quality process, especially if the summary is compared with an ideal one, but for the reason mentioned before we avoid to combine it with the rest of the presented metrics. Finally, we discuss some results that combine these metrics and interpret their meaning.

Table 2
Summary description of the proposed schema measures

4.1. Quality model in the presence of an ideal summary (schema level)

In this section we present our quality assessment framework to evaluate the quality of an RDF graph summary against a ground truth summary (S) (e.g. one provided by an expert). We measure how close the proposed summary is to the ground truth summary by computing its precision and recall against this ground truth. We suggest that we compute both the precision and recall at the class and at the property level and at the overall summary level. Table 2 gives us a summary description of the schema-level proposed measures.

Precision and recall for classes. We present here the recall and the precision measures for the classes of the detected patterns against a ground truth summary S. We first introduce the recall over the classes which is the fraction of relevant classes that are reported in the summary. Given a set of knowledge patterns Π (as defined in Section 2.1 and referred commonly as patterns from now on) and a set of classes $C \in S$ , we start by defining the recall of a class $c \in C$ over the set of patterns Π as the fraction of relevant class’s properties (namely properties that have this class as their domain) that are reported in Π; we denote it by schema class recall $SchemaRec (c, Π)$ : $\begin{array}{l} SchemaRecall (c, Π) \\ (1) & = \frac{| ⋃_{pa \in Π} (A (c) \cap A (pa)) |}{| A (c) |} \end{array}$ The $A (pa)$ is the set of properties and attributes involved in the pattern pa, and the $A (c)$ is the set of properties and attributes of the ideal class c. Thus, the overall summary recall using the classes ${SchemaRec}_{ClassAll}$ is computed as the mean of the various schema class recall $SchemaRecall (c, Π)$ for all the classes c of the ground-truth Schema S. $\begin{array}{l} {SchemaRec}_{ClassAll} \\ (2) & = \frac{1}{| C |} \sum_{c \in C} SchemaRecall (c, Π) \end{array}$

The precision is the fraction of retrieved classes and properties of the summary that are relevant. If a knowledge pattern of a summary carries a typeof link then this pattern is relevant to a specific class if the typeof points to this class, if not this is not relevant to this class. If no typeof information exists then we use the available properties and attributes to evaluate the similarity between a class and a pattern. Thus we define the $L (c, pa)$ function to capture this information and we add this to the similarity function. $\begin{array}{l} (3) & L (c, pa) = \{\begin{matrix} 1, & if typeof (pa) = c or \\ typeof (pa) = \emptyset \\ 0, & otherwise \end{matrix} \end{array}$ The similarity between a class c in the ideal summary and a pattern pa $Sim (pa, c)$ in the computed summary is defined as the number of common properties between class c and pattern pa divided on the total number of the properties of the patterns pa: $\begin{array}{l} (4) & Sim (pa, c) = L (pa, c) * \frac{| A (c) \cap A (pa) |}{| A (pa) |} \end{array}$ Given that a class might be represented by more than one knowledge patterns, depending on the algorithm used, we are interested in introducing a way to penalize cases where this happens, thus favoring smaller summaries over bigger ones. We achieve this by introducing a weight function that allows us to reduce the similarity value if this is based on consuming multiple patterns. Thus we introduce the following exponential function, which uses coefficient a to allow variations if needed in the future, and is chosen based on experimental evaluation of the functions that could provide us with a smooth decay in similarity as patterns’ number increases. The $Nps (c)$ is the number of patterns that represent the class c and $α \in [1, 10]$ .

We define the $T (c, pa)$ function to capture if a pattern $pa$ can be used to represent the class c; this function returns 1 if the similarity function between the pattern $pa$ and c is bigger than zero (so the pattern covers some of the elements that define the class) and zero otherwise. $\begin{matrix} (5) & T (c, pa) = \{\begin{matrix} 1, & if Sim (pa, c) > 0 \\ 0, & otherwise \end{matrix} \end{matrix}$

Based on the $T (c, pa)$ function, the number of patterns $Nps (c)$ that represent the class is defined as follows: $\begin{array}{l} (6) & Nps (c) = \sum_{pa \in Π} T (c, pa) \\ (7) & W (c) = e^{1 - \sqrt[α]{Nps (c)}} \end{array}$ Based on this weight function we define the class precision metric for every pattern pa in the computed summary and every class c in the ground truth summary as follows: $\begin{array}{l} SchemaPrec (c, Π) \\ (8) & = W (c) * \frac{\sum_{pa \in Π} Sim (pa, c)}{Nps (c)} \end{array}$ Thus, we define the schema class precision ${SchemaPrec}_{ClassAll}$ as the mean of the various class precision values $SchemaPrec (c, Π)$ for all the classes of the ground-truth Schema S. $\begin{array}{l} {SchemaPrec}_{ClassAll} \\ (9) & = \frac{\sum_{c \in C} SchemaPrec (c, Π)}{| C 1 |} \end{array}$ where $C 1 \subseteq C$ is the list of all the ground truth’s retrieved classes, or in other words, is the list of the ground truth’s classes for which $SchemaPrec (c, Π) > 0$ .

However, neither precision nor recall alone can accurately assess the match quality. In particular, recall can easily be maximized at the expense of a poor precision by returning as many correspondences as possible. On the other side, a high precision can be achieved at the expense of a poor recall by returning only few (correct) correspondences. Hence it is necessary to consider both measures and express this through a combined measure; we use the F-Measure for this purpose, namely $Schema F 1_{c}$ : $\begin{array}{l} Schema F 1_{c} \\ = 2 * ({SchemaPrec}_{ClassAll} \\ * {SchemaRec}_{ClassAll}) \\ / ({SchemaPrec}_{ClassAll} \\ (10) & + {SchemaRec}_{ClassAll}) \end{array}$ Precision and recall for properties. The overall recall at the property level, namely ${SchemaRec}_{PropertyAll}$ is computed as the ratio between the number of common properties extracted by the summary and the ones in the ground truth summary divided by the number of properties in the ground truth summary: $\begin{array}{l} {SchemaRec}_{PropertyAll} \\ (11) & = \frac{| ⋃_{pa \in Π} A (pa) \cap ⋃_{c \in C} A (c) |}{| ⋃_{c \in C} A (c) |} \end{array}$

We note that the schema precision at the property level in our experiments is always equal to 1 (see Section 6), which means that in our examples there are no false positives for properties. Summarization algorithms do not invent new properties but they might report some properties, present in the KB, that are not present in the ground truth summary. So, precision for properties namely ${SchemaPrec}_{PropertyAll}$ , is computed as the ratio between the number of common properties between the extracted summary and the ground truth divided by the number of properties in the extracted truth summary and is as follows: $\begin{array}{l} {SchemaPrec}_{PropertyAll} \\ (12) & = \frac{| ⋃_{pa \in Π} A (pa) \cap ⋃_{c \in C} A (c) |}{| ⋃_{pa \in Π} A (pa) |} \end{array}$

Thus, the F-Measure for the schema properties, namely $Schema F 1_{p}$ will be calculated as: $\begin{array}{l} Schema F 1_{p} \\ = 2 * ({SchemaPrec}_{PropertyAll} \\ * {SchemaRec}_{PropertyAll}) \\ / ({SchemaPrec}_{PropertyAll} \\ (13) & + {SchemaRec}_{PropertyAll}) \end{array}$ Overall schema level F-measure. After defining the individual metrics for the class schema F-Measure $Schema F 1_{c}$ and property schema F-Measure $Schema F 1_{p}$ , we can define the combined overall schema F-measure $Schema F 1$ as the weighted harmonic mean of the class schema F-Measure and property schema F-Measure: $\begin{array}{l} Schema F 1 = & β * Schema F 1_{p} \\ (14) & + (1 - β) * Schema F 1_{c} \end{array}$ where the weight $β \in [0, 1]$ . The overall schema F-measure provides a better insight on the combination of the number of classes found by the summarization algorithm and the overall number of properties discovered. The metrics used to compute precision and recall at schema class level include (all) the properties discovered (equations (1), (4) and (11), (12) respectively). But by penalizing the expression of a class by more than one patterns while computing the schema class F-measure, the quality of the results of the summarization algorithms towards the properties gets blurred and is also penalized, which should not be the case. So, we use the schema property recall and precision to recover the notion of quality on property discovery in all cases for the whole schema, so algorithms that will discover all or most of the properties will get acknowledged, even if they use multiple knowledge patterns to do that. Even in the case of not having multiple patterns representing a class the computations for the schema property recall and precision are not redundant because they capture different aspects of the summary’s quality, since the overall schema class level precision and recall is an average and thus not the same as the overall property level precision and recall. So the first one tells us how much of the semantics of the classes is recovered in the summary, while the second tells us how many of the overall schema properties are present regardless of where they belong.

Table 3
Summary description of the proposed instance measures

Connectivity. One more important aspect that we need to consider, is the connectivity of the summary, i.e. is the summary a connected graph? So, we propose a new metric to measure how many disconnected graphs exist in the summary and what percentage of the classes in the ground truth they represent. The connectivity of a summary graph $G_{s} Con (G_{s})$ is defined as the number of the connected components(independent subgraphs) of the summary graph divided on the number of the connected components(independent subgraphs) of the ground truth. $\begin{array}{l} Con (G_{s}) \\ = (number of connected \\ components of the summary) \\ / (number of connected components \\ (15) & of the ground truth) \end{array}$

We compute the number of connected components for the summary (and in the same manner for the ground truth) using the breadth-first search algorithm, where given a particular node n, we will find the entire connected component containing n (and no more) before returning. To find all the connected components of a summary (or the ground truth) graph, we loop through the nodes, starting a new breadth-first search, whenever the loop reaches a node that has not already been included in a previously found connected component. This metric gives an indication of the connectivity of a generated summary. If it is 1, it shows that the summary is a graph connected as well as the ground truth graph, but if it is bigger than 1 it means that the summary is more disconnected than desired. The higher the connectivity, the more the links that are missing between the classes of the computed graph compared to the ground truth; this could even capture correctly a completely disconnected summary graph. This metric allows us to penalize (if needed) disconnected (compared to the ground truth) summary graphs and allows for progressive linear penalties. It is also theoretically possible that the summary graph will be more connected than the ground truth graph, this will give us values less than 1. The value of the connectivity can tend to but will never reach 0.

4.2. Quality model at the instance level

We measure the quality with regard to the instances by introducing the notion of the coverage of the instances of the original KB, i.e. how many of the original class and property instances are successfully represented by the computed RDF summary graph (e.g. can be retrieved in the case of a SPARQL query). This requires computing both the precision and recall at the class instance and at the property instance levels. Table 3 gives us a summary description of the proposed instance level metrics.

Precision and recall for class instances. The overall recall at the instance class level is the total number of the class instances represented by the computed summary divided on the total number of instances of the original KB D. $\begin{matrix} (16) & {InstanceRec}_{ClassAll} = \frac{| instances (Π) |}{| instances (D) |} \end{matrix}$ The class $instances (Π)$ is the list of instances covered by the set of patterns Π, $instances (D)$ is the list of all instances of the original KB D. To avoid the problem of overlapping of instances in several patterns which will cause the over-coverage, we calculate the $instances (Π)$ , $instances (D)$ as follows: $\begin{array}{l} (17) & instances (Π) = ⋃_{pa \in Π} instances (pa) \\ (18) & instances (D) = ⋃_{c \in C} instances (c) \end{array}$ The $instances (pa)$ denotes the list of covered instances by the pattern pa and the $instances (c)$ denotes the list of instances of the type c in the original KB D.

We denote by ${Cov}_{c} (c, pa)$ , the list of the class instances which are represented by a pattern pa: $\begin{array}{l} {Cov}_{c} (c, pa) \\ (19) & = \{\begin{matrix} instances (pa), & if L (c, pa) = 1 \\ \emptyset, & otherwise \end{matrix} \end{array}$

Thus, we can define the total number of class instances $instances (c, Π)$ that are reported by a set of patterns Π representing the class c as: $\begin{array}{l} instances (c, Π) \\ (20) & = \sum_{pa \in Π} | {Cov}_{c} (c, pa) | \end{array}$

We define $InstancePrec (c, Π)$ the instance precision of a class c in C over the set of patterns Π as follows: $\begin{array}{l} InstancePrec (c, Π) \\ (21) & = \frac{| instances (c) \cap instances (c, Π) |}{| instances (c, Π) |} \end{array}$ Thus, we define the overall instance class precision denoted by ${InstancePrec}_{ClassAll}$ as the weighted mean of the various $InstancePrec (c, Π)$ for all the retrieved classes: $\begin{array}{l} {InstancePrec}_{ClassAll} \\ (22) & = \sum_{c \in C} w i (c) * InstancePrec (c, Π) \end{array}$ The $wi (c)$ is the weight of a class c and it measures the percentage of class instances of the class c with respect to the total number of class instances in the KB. This is used to weight in the importance of the specific class in terms of the number of instances it “represents”; so the more instances it “represents” the bigger the weight. It is defined as the number of instances of class c in the KB $instances (c)$ compared to the total number of class instances in the KB $instances (D)$ . $\begin{matrix} (23) & w i (c) = \frac{instances (c)}{instance (D)} \end{matrix}$ The overall instance class recall and the overall instance class precision are combined by the instance class F-Measure, namely $Instance F 1_{c}$ : $\begin{array}{l} Instance F 1_{c} \\ = 2 * ({InstancePrec}_{ClassAll} \\ * {InstanceRec}_{ClassAll}) \\ / ({InstancePrec}_{ClassAll} \\ (24) & + {InstancePrec}_{ClassAll}) \end{array}$ Precision and recall for property instances. The $Cov (p, pa)$ represents the list of the original property instances which are successfully represented by a pattern $pa$ : $\begin{array}{l} {Cov}_{p} (p, pa) \\ (25) & = \{\begin{matrix} instances (pa), & if p \in pa \\ \emptyset, & otherwise \end{matrix} \end{array}$ We denote by the $instances (p, Π)$ the list of the original property instances that are successfully covered by a set of patterns Π: $\begin{array}{l} Instance (p, Π)) \\ (26) & = ⋃_{pa \in Π} ({Cov}_{p} (p, pa) \cap instances (p)) \end{array}$ The $instances (p)$ denotes the list of original instances which have the property p in original KB D. Thus, the instance property recall $InstanceRec (p, Π)$ defined as: $\begin{array}{l} InstanceRec (p, Π) \\ (27) & = \frac{| instances (p, Π) \cap instances (p) |}{| instances (p) |} \end{array}$ The overall recall at the instance property level ${InstanceRec}_{PropertyAll}$ is computed as the weighted mean of the various instance property recall $InstanceRec$ for all the properties of the ground-truth. $\begin{array}{l} {InstanceRec}_{PropertyAll} \\ (28) & = \sum_{p \in P} w i (p) * InstanceRec (p, Π) \end{array}$ The $wi (p)$ is the weight of the property p and it measures the percentage of instances of a property p with respect to the total number of property instances in the KB. It is defined as the number of instances of property p in the KB $instances (p)$ compared to the total number of property instances in the KB. Again the idea here is to capture the important properties by weighting in the number of property instances each one represents. $\begin{matrix} (29) & w i (p) = \frac{instances (p)}{\sum_{p 1 \in P} instances (p 1)} \end{matrix}$ We define $InstancePrec (p, Π)$ , the precision of a property p in P over the set of patterns Π as follows: $\begin{array}{l} InstancePrec (p, Π) \\ (30) & = \frac{| instances (p) \cap instances (p, Π) |}{| instances (p, Π) |} \end{array}$ Thus, we define the overall instance precision for property instances denoted by ${InstancePrec}_{PropertyAll}$ as the mean of the various $InstancePrec (c, Π)$ for all the properties of the ground-truth Schema S: $\begin{array}{l} {InstancePrec}_{PropertyAll} \\ (31) & = \frac{\sum_{p \in P} InstancePrec (p, Π)}{| P 1 |} \end{array}$ where $P 1 \subseteq P$ is the list of retrieved properties, or in other words the list of properties having $InstancePrec (p, Π) > 0$ . The overall instance recall and the overall instance precision for property instances are combined by the instance class F-Measure, namely $Schema F 1_{c}$ : $\begin{array}{l} Instance F 1_{p} \\ = 2 * ({InstancePrec}_{PropertyAll} \\ * {InstanceRec}_{PropertyAll}) \\ / ({InstancePrec}_{PropertyAll} \\ (32) & + {InstancePrec}_{PropertyAll}) \end{array}$ Thus, the overall instance F-measure $Instance F 1$ is obtained by combining the overall instance schema F-Measure $Instance F 1_{c}$ and overall property instance F-Measure $Instance F 1_{p}$ . $\begin{array}{l} Instance F 1 = & β * Instance F 1_{p} \\ (33) & + (1 - β) * Instance F 1_{c} \end{array}$ where the weight $β \in [0, 1]$ . The overall instance F-measure can be viewed as a compromise between overall class instance F-Measure and overall property instance F-Measure. It is high only when both overall class and property instance F-Measure are high. It is equivalent to the class instance F-Measure when $β = 0$ and to the property instance F-Measure when $β = 1$ .

We need also to make one last point for the computation of the instance-level metrics, for the case when our KB contains no schema information. In this case and in order to make the instance level class precision and recall computable we need to annotate the KB with $typeof (class)$ so as to be able to compute the metrics presented above. If not, we will declare the instance level class precision and recall uncomputable but we will be able to continue the quality assessment using the rest of the metrics, including property precision and recall at the instance level. This demonstrates that the proposed Quality Framework will work under all circumstances.

5. Representative algorithms for validating the quality framework

5.1. Algorithms’ description

As we have already mentioned in Section 3, the RDF graph summarization algorithms could be grouped into four main categories. Based on the results reported in the literature we have chosen three of the most well performing RDF graph summarization algorithms [5,16,37] according to their authors. Our selection of these algorithms was also based on specific properties and features that they demonstrate: (a) they do not require the presence of RDF schema (triples) in order to work properly, (b) they work on both homo- and heterogeneous KBs, (c) they provide statistical information about the available data (which can be used to estimate a query’s expected results’ size), and (d) they provide a summary graph that is considerably smaller than the original graph.

Fig. 2.

An artificial dataset about music artists and their productions.

ExpLOD [16 ,17] is a RDF graph summarization algorithm and tool that produces summary graphs for specific aspects of an RDF dataset, like class or predicate usage. The summary graph is computed over the RDF graph based on a forward bisimulation that creates group nodes based on classes and predicates. Two nodes v and u are bisimilar if they have the same set of types and properties. The generated summaries contain metadata about the structure of the RDF graph, like the sets of used RDF classes and properties. Some statistics like the number of instances per class or per property are aggregated with this structural information. The ExpLOD summaries are extracted by partition refinement algorithms or alternatively via a SPARQL query where the summary graph is a labeled graph with unlabeled edges. The advantage of ExpLOD approach is that its generated summaries show a dataset’s structure as homo- or heterogeneous as it may be. The big disadvantage is the need for transforming the original RDF KB into a ExpLOD graph which is an unlabeled edges graph, where for each triple in RDF KB it generates a node for the subject, node for the object and a unique node for the predicate. Then an edge is drawn from the subject node to the predicate node and other edge from the predicate node to the object node. This process requires the materialization of the whole dataset and this can be limiting in cases of large KBs. The second limitation is that the created summary is not necessarily a RDF graph itself.

Campinas et al. [5] are creating their own RDF summarization graph, whose nodes represent a subset of the original nodes based on their types or used predicates. This summary graph is generated by the following mechanism: (1) extract the types and predicates for each node in the original graph; (2) group the nodes which share the same set of types into the same node summary where two nodes, one of type A and one of types A and B, will end up in different disjoint summary nodes; (3) group based on attributes only if a node does not have a class definition. Like ExpLOD, a summary node is created for each combination of classes, i.e., two nodes, one of type A and one of types A and B, will end up in different disjoint summary nodes. Some statistics like the number of instances per class or the number of property instances are aggregated with this summary graph. Unlike ExpLOD, the summary nodes are not further partitioned based on their interlinks (properties), i.e., two nodes of type A, one has a, b and c properties and one has a and d properties will end up in the same summary node. Unlike ExpLOD, their summary graph is a RDF graph which makes it compatible for storing at RDF databases and queried by SPARQL.

Zneika et al. [37,38] present an approach for RDF graph summarization based on mining a set of approximate graph patterns. It aims at extracting the best approximate RDF graph patterns that describe the input dataset and it works in three independent steps that are described below.

Binary Matrix Mapper: Transform the RDF graph into a binary matrix, where the rows represent the subjects and the columns represent the predicates. They preserve the semantics of the information by capturing distinct types (if present), all attributes and properties (capturing property participation both as subject and object for an instance).

Graph Pattern Identification: The binary matrix created in previous step is used in a calibrated version of the PaNDa+ [26] algorithm, which allows to experiment with different cost functions while retrieving the best approximate RDF graph patterns. Each extracted pattern identifies a set of subjects (rows) all having approximately the same properties (cols). The patterns are extracted so as to minimize errors and to maximize the coverage (i.e. provide a richer description) of the input data. A pattern thus encompasses a set of concepts (type, property, attribute) of the RDF dataset, holding at the same time information about the number of instances that support this set of concepts.

Fig. 3.

The ideal summary of the dataset depicted in Fig. 2.

Constructing the RDF summary graph: A process, which reconstructs the summary as a valid RDF graph using the extracted patterns is applied at the end. The process exploits information already embedded in the binary matrix and constructs a valid RDF schema to summarize the KB.

5.2. Algorithms’ implementation

We implemented the three algorithms used in the experiments hereafter ourselves. The implementations of two of the algorithms (ExpLOD and Campinas et al.) were not available from the original authors so we had to implement them ourselves in Java, based on the corresponding papers. We validated the implementation running tests with the datasets described in the original papers. Since we were getting the same results we are quite confident that the implementations are correct. Given also that performance benchmarking is out of scope of this current work, we did not have to deal with any kind of extreme optimizations.

5.3. Working example

In an effort to better explain the way our quality assessment framework works and captures the differences among the different summaries we provide a working example. We have created an artificial dataset containing information about music artists and their productions. Figure 2 shows a visualization example of the RDF instance graph of this dataset. We have 3000 resources describing the music-artists and all of them have the name and made properties, while only 2500 resources have the rdf:type property, 2049 resources have the homepage property, 2850 have the img property, 50 resources have the biography property. We can also notice that we have 5000 resources describing the records and all of them have the date, image, track and maker properties, while 4995 resources have the title property and only 28 resources have the description property. There are also 45000 resources describing the tracks and all of them have the rdf:type, title, track-number and available-as properties, while only 5 resources have the olga property (used to link a track to a Document for tracking in the On-Line Guitar Archive). These tracks are available as a Playlist or/and as ED2K formats. Figure 3 shows an ideal summary for this dataset as was suggested by an expert.

Tables 4, 5 and 6 present the three RDF summaries generated using the three discussed algorithms: ExpLOD, Campinas et al. and Zneika et al. respectively. The first column shows the pattern id, the second shows the predicates involved in the pattern, while the third column shows the corresponding ideal summary class for a pattern. The last column shows the number of instances per pattern. The Figs 4, 5 and 6 are a visualization representing for three RDF summaries generated using ExpLOD, Campinas et al. and Zneika et al. receptively.

Table 4
ExpLOD summary for the dataset depicted in Fig. 2

Table 5

Campinas et al. summary for the dataset depicted in Fig. 2

Table 6

Zneika et al. summary for the dataset depicted in Fig. 2

5.3.1. Schema-level metrics

Here we calculate the precision for the MusicArtist class for the three summaries. We start by the ExpLOD summary described in Table 4, $Sim (Pa 1, MusicArtist) = 1$ , because all the properties of the pattern Pa1 are properties of the MusicArtist in the ideal summary. Actually for each $Pa \in {Pa 1, Pa 2, Pa 3, Pa 4, Pa 6, Pa 7}$ $Sim (Pa, MusicArtist) = 1$ for the same reason. Concerning the pattern Pa5, it has 6 properties, 5 of which are properties of MusicArtist that are included in the ideal summary. But the pattern Pa5 has also chosen the discography property, which is not included in the ideal summary. That makes the $Sim (Pa 5, MusicArtist) = \frac{5}{6}$ . Any other pattern Pa in the table has $Sim (MusicArtist, Pa) = 0$ , because it has a different typeof and there are no common properties between these patterns and the MusicArtist class. So the $Nps (MusicArtist) = 7$ , and with the $α = 3$ then $W (MusicArtist) = e^{1 - \sqrt[3]{7}} = 0.40$ . Hence, the precision of the patterns corresponding to the MusicArtist class is: $SchemaPrec (MusicArtist, Π) = 0.40 * \frac{1 + 1 + 1 + 1 + 0.83 + 1 + 1}{7} = 0.39$ .

Now let us take the Campinas et al. summary described in Table 5. In this table we can see that we have two patterns Pa1 and Pa2 represent the MusicArtist class, so $Nps (MusicArtist) = 2$ , thus the weight: $W (MusicArtist) = e^{1 - \sqrt[3]{2}} = 0.77$ . The first pattern Pa1 has 6 properties where 5 of these 6 properties are properties of MusicArtist in the ideal summary. But it has chosen the discography property, too, which is not included in the ideal summary. That makes $Sim (MusicArtist, Pa 1) = \frac{5}{6}$ . Respectively, Pa2 has $Sim (MusicArtist, Pa 2) = 1$ , since all of its properties are included in the ideal summary. From all above we conclude the precision of Campinas et al.: $SchemaPrec (MusicArtist, Π) = 0.77 * \frac{1 + 0.83}{2} = 0.70$ .

Fig. 4.

The ExpLOD summary of the dataset depicted in Fig. 2.

Fig. 5.

The Campinas et al. summary of the dataset depicted in Fig. 2.

Fig. 6.

The Zneika et al. summary of the dataset depicted in Fig. 2.

Table 7

Schema metrics at class level

Now let us compute the precision of the MusicArtist for the Zneika et al. summary depicted in Table 6, $Sim (Pa 1, MusicArtist) = 1$ , because all the properties of the pattern Pa1 are properties of the MusicArtist in the ideal summary. Any other pattern Pa in the Table 6 has $Sim (MusicArtist, P a) = 0$ , because it has a different typeof link and no common properties exist between each one of these patterns and the MusicArtist class. So $Nps (MusicArtist) = 1$ , and keeping $α = 3$ then $W (MusicArtist) = e^{1 - \sqrt[3]{1}} = 1$ . Hence, the precision of class MusicArtist is: $SchemaPrec (MusicArtist, Π) = 1 * \frac{1}{1} = 1$ .

Following the same procedure, we can calculate the precision for each class in the set of classes of the ideal summary; these results are reported in Table 7a. We should also note that the class Document, which is reported in the summaries of the ExpLod and Campinas et al., is not a class in the ideal summary.

Table 7b shows the values of the recall for the list of ideal summary classes. We can note that for ExpLOD and Campinas et al., all recall values are 1, as their patterns cover all the properties in the ideal summary. While for Zneika et al., the recall for the MusicArtist is 0.8, because pattern Pa1, which represents the MusicArtist class, does not cover the biography property, so its recall equals 4 properties over 5 in the ideal summary, $SchemaRec (MusicArtist, Π) = \frac{4}{5} = 0.80$ .

Table 8 shows the recall and the precision values of the list of ideal summary properties. Where Table 8a shows the schema-level property precision, we notice that each one of the ExpLOD and Campinas et al. summaries has 16 properties, 13 of these 16 are included in the ideal summary, the other three: discography, description, and ogla are not. That makes the property precision for each one of these two summaries ${SchemaPrec}_{PropertyAll} = \frac{13}{16} = 0.81$ . The properties reported by the Zneika et al. summary are all included in the ideal summary, thus its precision is 1.

Concerning the recall at the property level, ExpLOD and Campinas et al. recall equals 1, as they included all the properties in the ideal summary, while Zneika et al. missed one property which is biography, so its recall is ${SchemaRec}_{PropertyAll} \frac{12}{13} = 0.92$ .

Table 8

Schema metrics at property level

Table 9

Instance metrics at property level

5.3.2. Instance-level metrics

Table 9 shows the values of the recall for the list of distinct properties of the dataset depicted in Fig. 2. We can note that for ExpLOD and Campinas et al., all recall values are 1, as their patterns cover all the property instances of the datasets. While for Zneika et al., the property instance recall values for the biography, discography and description are 0, because these properties are completely missing from Zneika et al. summary.

Table 9a shows the values of the property instance precision. We can note that for ExpLOD, all precision values are 1, as its patterns described in Table 4 are correctly identifying all the property instances of the datasets. For the running example, the property $homepage$ has 2049 instances in the original dataset and as it can be seen, it is included in 4 patterns ${Pa 1, Pa 2, Pa 5, Pa 7}$ , thus $| Instance (biography, Π) | = 1500 + 500 + 35 + 14 = 2049$ . Hence, the $InstancePrec (homepage, Π) = \frac{2049}{2049} = 1$ . Following the same procedure, we can find that all property instance precision values are 1 for the Explod summary.

Now let us try to compute the instance precision value for the $homepage$ property for the Campinas et al. summary described in Table 5. From this Table 5 we can note that this property is included in the patterns Pa1 and Pa2, thus $| Instance (homepage, Π) | = 2500 + 500$ . Hence, the $InstancePrec (homepage, Π) = \frac{2049}{3000} = 0.68$ .

Now let us take the Zneika el al. summary described in Table 6. In this table we can see that only the pattern Pa1 has the $homepage$ property. Thus $| Instance (hompage, Π) | = 3000$ . Hence,the $InstancePrec (homepage, Π) = \frac{2049}{3000} = 0.68$ .

Following the same procedure, we can calculate the instance property precision for all the dataset properties; these results are reported in Table 9a.

On the other hand, the results for the class precision and recall at the instance level in this example is always equal to 1 or almost 1 (since in one case only a few class instances are missing) and thus their computation provides no further insights for this example. This is why, the corresponding tables were omitted.

5.3.3. Connectivity

Table 10 reports the connectivity metric values for the summaries produced by the three discussed algorithms. It shows that the ExpLod has a value of 6 for this metric because its summary ends up with 6 separate components while the ideal summary depicted in Fig. 3 has exactly one connected component. This value means the ExpLOD provides a disconnected summary. The two other algorithms report a value of 1, which means that these two algorithm provide a summary as connected as the ideal one (one connected component in this case).

Table 10
Connectivity

6. Experiments

In this section, we compare the quality of the generated summaries of the three RDF graph summarization approaches covered in Section 5. We implemented these three approaches in Java 1.8 using the Nxparser2

²
Nxparser: https://github.com/nxparser/nxparser.

API to parse the RDF triples. All the experiments ran on a Intel(R) Core(i5) Opteron 2.5 GHz server with 16 GB of RAM (of which 14 GB was assigned to the Java Virtual Machine), running Windows 7. Section 6.1 describes the datasets considered in the experiments. Section 6.2 gives a quality evaluation of the created summaries based on the three discussed approaches and using the metrics described in Section 4.

Table 11

Descriptive statistics of the datasets

6.1. Datasets

Table 11 shows the datasets from the LOD cloud that are considered for the experiments. The first seven columns show the following information about each dataset: its name, the number of triples it contains, and the number of instances, classes, predicates, properties and attributes. The eighth column shows the class instance distribution metric which provides an indication on how instances are spread across the classes and it is defined as the standard deviation (SD) in the number of instances per class. When the number of class instances per class in a dataset is quite close then the standard deviation is small; while, when there are considerable differences, the standard deviation will be relatively large. The ninth column shows the property instance distribution metric which provides an indication on how instances are spread across the properties and it is also defined as standard deviation (SD) in the number of instances per property.

The main goal of our datasets selection is to use real-world datasets from diverse domains with different size (number of triples) and with different numbers of classes (and class instances) and properties (and properties instances). We are also interested in the distribution of the data which might indicate if the structure of the KB or the size of the represented knowledge could affect the quality of the generated summaries. So we have datasets from 270 thousand (Jpeel) to 263 million triples (Lobid), from one (Bank2) to 53 unique classes (LinkedMDB), from about 76 thousand (Jpeel) to about 18 million unique instances/entities and from 12 to 222 predicates. These datasets range from being very homogeneous (the Bank dataset where all subjects have the same list of attributes and properties) to being very heterogeneous (LinkedMDB where the attributes and properties are very heterogeneous across types). The diversity of the datasets can help us to understand better how the selected approaches work in different situations and thus validate that the proposed quality metrics will capture the different behaviors correctly.

6.2. Evaluation results

In this section, we discuss the quality results of the RDF graph summarization approaches covered in Section 5, evaluated over all the datasets described in Table 11 for the following two cases:

Typed Dataset: the KB contains schema information, like definition of classes and properties and more importantly a significant number of instances of a dataset have at least one typeof link/property.

Untyped Dataset: there is no schema information in the KB and more importantly none of the datasets subjects/objects or properties has a defined type (we explicitly checked and deleted all of them).

The distinction for the experimentation is important because there are algorithms that try to exploit schema related information (mainly typeof links) in order to gain insights for the structure of the KB. While, wherever available using this information could be valuable, we would like to test the summarization algorithms in cases when this information is not available, too. With that we can validate that the proposed Quality Framework will correctly capture the differences in the results and will correctly identify, for example, algorithms that work well in both cases.

6.2.1. Implementation of the quality framework

We implemented our Quality Framework as a software that takes as input the results of any RDF Graph Summarization algorithm and the ideal summary and computes the different metrics that are required to capture the quality of the results at the different levels described earlier. It outputs the values for the different metrics in an automated fashion and allows to compute F-measures where applicable. In principle it can be used to compare the quality of any summary against an ideal one or to understand how close two summaries are to one another. It is implemented in Java and it is available as open source software, here: https://github.com/ETIS-MIDI/Quality-Metrics-For-RDF-Graph-Summarization.

We describe the different steps applied in the form of algorithmic pseudocode that allows to track the computations taking place at the different levels that the Quality Framework operates. The pseudocode of Algorithm 1 gives an overview of our implementation of the computations at the schema level. The function which computes the schema class recall is shown in Algorithm 2, while the one, which computes the schema class precision, is shown in Algorithm 3. The function which computes the schema property precision and recall is shown Algorithm 4. In the same manner, the pseudocode in Algorithm 5 gives an overview of the computations at the instance level. The function which computes the instance class recall is shown in Algorithm 6, while the one, which computes the instance class precision, is shown in Algorithm 7. The function which computes the instance property precision and recall is shown in Algorithm 8.

Algorithm 1

Schema Level Metrics

Algorithm 2

Function Schema Class Recall

Algorithm 3

Function Schema Class Precision

Algorithm 4

Function Schema Property Precision and Recall

Algorithm 5

Instance Level Metrics

Algorithm 6

Function Instance Class Recall

Algorithm 7

Function Instance Class Precision

Algorithm 8

Function Instance Property Precision and Recall

6.2.2. Results for schema level metrics

Table 13 reports the precision, recall and F-Measure values at the schema level for classes and properties of the generated RDF summaries over the set of datasets depicted in Table 11 for the typed and untyped cases. The left part of Table 13 shows the results for the typed used datasets while the right part shows the results for untyped used datasets. The Figs 7 and 8 are representing the overall schema F-Measure and the class precision metrics values respectively, they were picked as visualization examples from the different computed metrics because visualized as charts they offer more details.

We can note from Table 13 that the schema property recall, schema property precision and the schema property F-Measure, reported in columns $R_{p}$ , $P_{p}$ and $F_{p}$ respectively, are always equal to 1 for the ExpLod and the Campinas et al. algorithms over all the presented datasets. The same is true for the schema class recall reported in column $R_{c}$ . We can also note from the right part of the Table 13 that the values of the previously mentioned measures are equal to 1. This is because the ExpLod and Campinas et al. algorithms depend on the notion of the forward bisimulation that groups the original nodes based on classes and/or predicates, hence they are no missed properties or types (and of course nothing new is added), thus the schema class recall values will be always 1 for the ExpLod and the Campinas et al. A predicates-based grouping is necessary for the Campinas et al. algorithm when the entities’ nodes do not have a class definition, hence they are no missed properties for the untyped case, which explains why the values for these measures have not changed for the untyped datasets. This also explains why we have the same measures’ values for the ExpLod and Campinas et al. for the untyped datasets. For Zneika et al. algorithm, although it depends on the approximation type selected, if we exclude the linkedct dataset the values for measures mentioned previously are also equal to 1 for the typed an untyped datasets, which means that the algorithm successfully summarizes the KBs, despite the fact that by construction the algorithm uses approximate pattern mining to detect the classes and properties available and thus some could have been possibly missed.

Another notable observation from the Table 13g and the Fig. 7b, is that for the Bank dataset and for the overall schema F-Measure the perfect value (equal to 1) is reported for the Zneika et al. and Campinas et al. algorithms. This is because the Bank dataset is a fully typed and homogeneous dataset(each subject of this dataset has at least one typeof link/property) and as we explained earlier, the Campinas et al. algorithm groups the original nodes based only on their types when types exist, hence they are no missed or added properties in this case.

For the Sec dataset, the Table 13e shows that the values of schema class precision reported in column $P_{c}$ and depicted in Fig. 8a are low for the three discussed algorithms. This is because that the ground truth schema of the Sec dataset contains a lot of inheritance relationships and as none of three discussed algorithms deals with inheritance, the three algorithms end up with a lot of overlapping patterns (some properties which belong to the subclasses are assigned to the patterns which represent the superclasses).

Tables 13s, 13t report metrics values at the schema level for the generated RDF summaries of the Campinas et al. and Zneika et al. algorithms over the DBpedia dataset for the typed and untyped cases. We do not report results for the ExpLOd algorithm because ExpLOD’s implementation was bound to datasets that fit in main memory and DBpedia could not fit in main memory. We notice from these tables that the values of the schema class precision reported in column $P_{c}$ are low for the Zneika el al.and Campinas et al.summaries. This is because the DBpedia KB contains a lot of entities/resources having multiple classes/types and a lot of classes carry subsumption (inheritance) relationships. Actually, on average an entity has four types associated with it, and as apparently none of the two mentioned algorithms deals adequately with multiple classification, the two algorithms end up with a lot of overlapping patterns (some properties which belong to class A are assigned to the patterns which represent class B in the multiple classification case or some properties which belong to the subclasses are assigned to the patterns which represent the superclasses in inheritance case). An additional reason to have a poorer precision for the Campinas et al. summary is that the type definitions are missing of a quite large number of DBpedia KB’s instances. As already discussed, in this case, the Campinas et al. groups the nodes based on the properties and this makes it generate a summary where a lot of the ideal summary classes are represented by several knowledge patterns.

Fig. 7.

F-measure results for typed/untyped presented datasets at the schema level.

Fig. 8.

Class precision results for typed/untyped presented datasets at the schema level.

Table 13 shows well that algorithms like ExpLod do not provide quality summaries in extreme cases like the Bank dataset (where we have only one class) or in heterogeneous datasets like LinkedMDB, Linkedct and DBLP, where they report very low class precision values, because instances of the same class in these cases have quite different properties and they cannot be grouped together by ExpLod. This is because the ExpLod algorithm depends on the notion of the forward bisimulation [15] that groups the original nodes based on the existence of common typeof and property links. In other words, two nodes v and u are bisimilar and will end-up in the same equivalent class (pattern) if they have exactly the same set of types and properties. Thus, it might generate a summary where many ideal summary classes are represented by several knowledge patterns. For example, in the Bank dataset case, which contains only one class in the ideal summary, ExpLOD generated 79 knowledge patterns. And as we mentioned in Section 4.1 we have included in our framework a way to penalize these cases by introducing the W(c) exponential function (see equation (7)). Table 13 and Figs 8 and 7 also demonstrate that the Zneika et al. algorithm gives better results, when compared with the other two algorithms, over all the presented datasets, and it showcases that it works well with heterogeneous datasets like the LinkedMdb, unlike the ExpLod and Campinas et al. that give a low class precision with the heterogeneous datasets.

By comparing the results for the typed datasets case depicted in Fig. 8a and the untyped datasets depicted in Fig. 8b. We can easily observe that the behavior of Zneika et al. and ExpLod algorithms in the case of the untyped cases is the same as in the case of the typed datasets, which means that the quality of the summary is not affected by the presence (or not) of schema information in the KB. While we can easily observe the significant impact the absence of typeof schema information had for the Campinas et al. algorithm.

The discussion so far provides some insights on how we can use the proposed Quality Framework to assess the quality of the summaries produced by the different algorithms. Since we are looking at comparing the quality of the computed summary to a ground truth summary provided by an expert in general we can observe that:

the summarization algorithms usually capture correctly the properties involved in the data but miss at different levels (and for different reasons) some of the classes. The Quality Framework provides enough resolution to clearly identify the algorithms that provide a better summary in turn of the classes reported and the quality of this report (e.g. are all properties reported, is the class present as one entity in the computed summary, etc.).

the summarization algorithms do not capture well cases where the data (instances) are multiply classified or where there are quite widespread subsumption relationships.

the summarization algorithms could have quite a few differences when reporting on the contents of the KB and the quality of the summaries could greatly vary and this is mostly because of the differences in the precision of reporting the classes in the summary, including penalizing verbose descriptions (like those reported by Explod). So actually we can capture even fine differences where for example a single class in the ground truth is represented by two in the computed summary.

6.2.3. Results for instance level metrics

Table 14 reports the precision, the recall and the F-Measure of RDF summaries at the instance level, based on the same datasets and algorithms as before. The left part of Table 13 shows the results for the typed datasets while the right part shows the results for untyped datasets. For each dataset, we report the precision, the recall and the F-measure values at class and property level. We note that ExpLod produces the best results (actually perfect ones, always 1) since it is not missing any property or class instance because ExpLod works by grouping of even two instances if they have the same set of attributes and types, thus does not add any false positives. We can also note that the instance class precision and the instance recall precision reported in columns $P_{c}$ and $R_{c}$ are always equal to 1 for Campinas et al. algorithm over all the presented datasets, while the property instance precision reported in column $P_{p}$ is low in most presented datasets. This is because the Campinas et al. algorithm works by grouping of two instances if they have the same set types, thus it does not add any false positives at the class level but maybe it will assign some properties to subjects/instances which do not actually have these properties at the KB (false positive at the property level). This explain why it is important to take into consideration quality metrics at the property and class levels.

Table 14 shows also that the behavior of Zneika et al. and ExpLod algorithms in the case of the untyped datasets is the same or approximately the same as in the case of the typed datasets, which means that the quality of the summary with regard to the coverage of the instances is not affected by the presence (or not) of schema information in the KB for these two algorithms. On the other hand, we can easily observe the great positive impact left by the absence of typeof schema information for the Campinas et al. algorithm.

Also, Tables 14s, 14t report metrics values at the instance level for the generated RDF summaries of the Campinas et al. and Zneika et al. algorithms over the DBpedia dataset for the typed and untyped cases respectively. From the Table 14t, we can note that Campinas et al. produces the perfect results since it is not missing any property or class instance because for the untyped case, Campinas et al. works by grouping of instances if they have the same set of properties, regardless of how many they are; thus does not add any false positives. On the contrary, the Table 14s shows the Campinas et al. produces a very poor value for the instance level property precision reported in column $P_{p}$ because with the presence of the class definition for the entities in the KB, works by grouping of instances based only on the types they carry and ignores e.g. how many they are. Thus with a very heterogeneous KB like DBpedia, the Campinas el al. algorithm ends up with a lot of extra property instances since for all the properties the same number of property instances is assumed, since the algorithm looks only at the type information.

From this discussion, we can observe that the summarization algorithms provide results of good quality concerning the coverage of the instances in the KB. The proposed quality metrics clearly show that relying only on this metric is not adequate to judge the quality of a summary since a lot of the algorithms report perfect scores in all measures. But still we have cases where we can distinguish the quality among the results based on the instances covered by the computed summary, especially when algorithms use approximative methods to compute the summary (one algorithm in our case). It is worth noting here that our Quality Framework can capture both under-coverage (when not all instances are represented in the final result) and over-coverage (when some instances are represented more than once or some fictitious instances are included) of instances. With the metrics at the instance level we can capture these fine differences for covering correctly or not and how much the instance in the KB.

6.2.4. Results combining schema- and instance-level metrics

By comparing the results in both cases, it becomes clear why it is important to take into consideration quality metrics that capture information both at the instance and the conceptual level. Otherwise behaviors like the one demonstrated by ExpLod cannot be captured and summaries that are flawed might be indistinguishable from better ones. Overall, we could argue that the Quality Framework introduced in Section 4 is adequate for capturing the fine differences in quality of the summaries produced by the three algorithms. We can also see that with a closer look at the results we can gain or verify insights on how specific algorithms work and the quality of the summaries they produce.

One final metric to be considered is whether the final graph is connected or not and appears as more than one connected components. This might mean that the summarization algorithm while captures correctly the important properties and classes in the KB fails to provide at the end a connected graph. This is important because this might signify whether the summary graph is usable or not for answering, for example, SPARQL queries. Table 12 reports the connectivity metric values for the summaries produced by the three discussed algorithms over all the datasets described in Table 11. It shows that the ExpLod has always high values for this metric which means it provides a disconnected summary, while the two others have always 1, which means that these two algorithms provide a connected summary (at least as connected as the ideal summary; fully connected in our examples).

Table 12
Connectivity metric results

So measuring the quality at the schema level, the instance level and the connected components of the graph can give us a detailed view of the strengths and weaknesses of a summary and decide whether to use it or not depending on the potential use and application. We avoided combining all the measures together because this might blur the final picture. The idea is not to necessarily prove an algorithm as better or worse (we can do this to a great extend through the different F-measures) but mainly to help the user understand the different qualities of the summaries and choose the best one for the different needs of the diverse use cases.

Table 13

Precision, Recall and F-measure at the Schema level. The $R_{c}$ column reports the schema class Recall ${SchemaRec}_{ClassAll}$ . The $P_{c}$ column reports the schema class precision ${SchemaPrec}_{ClassAll}$ . The $F 1_{c}$ reports the schema class F-measure $Schema F 1_{c}$ . The $R_{p}$ column reports the schema property Recall ${SchemaRec}_{PropertyAll}$ . The $P_{p}$ column reports the schema property precision ${SchemaPrec}_{PropertyAll}$ . The $F 1_{p}$ column reports the schema property F-measure $Schema F 1_{p}$ . The F1 column reports the overall schema F-measure $Schema F 1$

Table 14

Precision, Recall and F-measure at the instance level. The $R_{c}$ column reports the instance class Recall ${InstanceRec}_{ClassAll}$ . The $P_{c}$ column reports the instance class precision ${InstancePrec}_{ClassAll}$ . The $F 1_{c}$ column reports the instance class F-measure $Instance F 1_{c}$ . The $R_{p}$ column reports the instance property Recall ${InstanceRec}_{PropertyAll}$ . The $P_{p}$ column reports the instance property precision ${InstancePrec}_{PropertyAll}$ . The $F 1_{p}$ column reports the instance property F-measure $Instance F 1_{p}$ . The F1 column reports the overall instance F-measure $Instance F 1$

7. Conclusions and future work

In this paper, we introduced a quality framework by defining a set of metrics, that can be used to comprehensively evaluate any RDF summarization algorithm that is reported in the literature. The metrics proposed are independent of the algorithm, the KB (thus the data) and the existence or not of schema information within the KB. The Quality Framework proposed in the paper captures correctly various desirable properties of the original KB. So, it accounts for:

the conciseness of the summary by:

Penalizing the verboseness in the form of multiple patterns representing a single class in the ideal summary

Capturing the similarity of the different patterns or groups created by the summarization algorithm with the corresponding ideal summary parts, even if this similarity is not 100%

the connectedness of the summary by:

Introducing a metric on the connectivity of the summary, thus prioritizing connected summaries against not so connected ones

the comprehensibility of the summary by:

Covering the schema part and thus understanding how good a summary is at the structural level

Covering the instance part and thus understanding how good a summary is at covering the instances that are in the KB

Understanding how well connected the summary and thus the content of the KB is

Capturing subtle differences in the result summary, like the omission of just one property or the approximation over the number of instances that allows the user to really understand why and where there is a problem

the overall quality of the summary so that it can be compared with other summaries by combining the different metrics like precision, recall, F-measure at different levels with connectedness in order to allow for the overall comparison, while the different metrics still provide a more detailed idea on where there are problems with a computed summary.

We made a big effort on validating that the proposed Quality Framework correctly captures the differences present in different summaries by evaluating three different algorithms (that work in substantially different ways) over ten different and diverse datasets, showcasing that indeed the different aspects are correctly captured in terms of quality and that the results are easily matched towards the status of the KB.

To the best of our knowledge, the literature does not report any other effort that tries to capture the quality properties of RDF graph summaries both at the concept (schema) and instance level in a complete and comprehensive way. The experiments showed that using the proposed set of metrics we are able now to compare the quality at different levels of the RDF summaries produced by different algorithms found in the literature, applied on different and diverse datasets and extract useful insights for their suitability for various tasks.

We plan to extend this work by applying the framework to Linked Data sources where quality results might be different for each part of the linked datasets. We would like to explore both theoretically and experimentally whether there are ways to provide consolidated quality metrics treating the linked KBs as one, which will go beyond simply averaging the individual quality results. We would also like to use the framework to assess the quality of the results of more algorithms, in order to validate experimentally its suitability.

References

About World Bank Linked Data: World Bank Finances. http://worldbank.270a.info/about.html#about-datasets/. Accessed: 2017-03-30.

Alzogbi and

Lausen, Similar structures inside rdf-graphs, in: Proceedings of the WWW2013 Workshop on Linked Data on the Web, Rio de Janeiro, Brazil, 14 May, 2013,

Bizer,

Heath,

Berners-Lee,

Hausenblas and

Auer, eds, CEUR Workshop Proceedings, Vol. 996, CEUR-WS.org, 2013.

Araújo,

Hidders,

Schwabe and

A.P.

de Vries, SERIMI – resource description similarity, RDF instance matching and interlinking, in: Proceedings of the 6th International Workshop on Ontology Matching, Bonn, Germany, October 24, 2011,

Shvaiko,

Euzenat,

Heath,

Quix,

Mao and

I.F.

Cruz, eds, CEUR Workshop Proceedings, Vol. 814, CEUR-WS.org, 2011.

Campinas,

Delbru and

Tummarello, Efficiency and precision trade-offs in graph summary algorithms, in: 17th International Database Engineering & Applications Symposium, IDEAS ’13, Barcelona, Spain, October 09–11, 2013,

B.C.

Desai,

Larriba-Pey and

Bernardino, eds, ACM, 2013, pp. 38–47. doi:10.1145/2513591.2513654.

Campinas,

Perry,

Ceccarelli,

Delbru and

Tummarello, Introducing RDF graph summary with application to assisted SPARQL formulation, in: 23rd International Workshop on Database and Expert Systems Applications, DEXA 2012, Vienna, Austria, September 3–7, 2012,

Hameurlain,

A.M.

Tjoa and

R.R.

Wagner, eds, IEEE Computer Society, 2012, pp. 261–266. doi:10.1109/DEXA.2012.38.

Cebiric,

Goasdoué and

Manolescu, Query-oriented summarization of RDF graphs, PVLDB8(12) (2015), 2012–2015.

Cheng,

Jin and

Qu, HIEDS: A generic and efficient approach to hierarchical dataset summarization, in: Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, IJCAI 2016, New York, NY, USA, 9–15 July 2016,

Kambhampati, ed., IJCAI/AAAI Press, 2016, pp. 3705–3711.

M.P.

Consens,

Fionda,

Khatchadourian and

Pirro, S+ epps: Construct and explore bisimulation summaries, plus optimize navigational queries; all on existing sparql systems, Proceedings of the VLDB Endowment8(12) (2015), 2028–2031. doi:10.14778/2824032.2824128.

DBLP Bibliography Database in RDF Datahub. https://datahub.io/dataset/fu-berlin-dblp. Accessed: 2017-03-30.

10.

Dudás,

Svátek and

Mynarz, Dataset summary visualization with lodsight, in: The Semantic Web: ESWC 2015 Satellite Events – ESWC 2015 Satellite Events, Revised Selected Papers, Portorož, Slovenia, May 31–June 4, 2015,

Gandon,

Guéret,

Villata,

J.G.

Breslin,

Faron-Zucker and

Zimmermann, eds, Lecture Notes in Computer Science, Vol. 9341, Springer, 2015, pp. 36–40. doi:10.1007/978-3-319-25639-9_7.

11.

Jamendo DBTune home. http://dbtune.org/jamendo. Accessed: 2017-03-30.

12.

John Peel DBTune home. http://dbtune.org/bbc/peel. Accessed: 2017-03-30.

13.

A.K.

Joshi,

Hitzler and

Dong, Towards logical linked data compression, in: Proceedings of the Joint Workshop on Large and Heterogeneous Data and Quantitative Formalization in the Semantic Web, LHD+ SemQuant2012, at the 11th International Semantic Web Conference, ISWC 2012, Boston, USA, November 11–15, 2012, Citeseer, 2012, pp. 11–15.

14.

A.K.

Joshi,

Hitzler and

Dong, Logical linked data compression, in: The Semantic Web: Semantics and Big Data, 10th International Conference, ESWC 2013, Proceedings, Montpellier, France, May 26–30, 2013,

Cimiano,

Ó.

Corcho,

Presutti,

Hollink and

Rudolph, eds, Lecture Notes in Computer Science, Vol. 7882, Springer, 2013, pp. 170–184. doi:10.1007/978-3-642-38288-8_12.

15.

Kaushik,

Bohannon,

J.F.

Naughton and

H.F.

Korth, Covering indexes for branching path queries, in: Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, Madison, Wisconsin, USA, June 3–6, 2002,

M.J.

Franklin,

Moon and

Ailamaki, eds, ACM, 2002, pp. 133–144. doi:10.1145/564691.564707.

16.

Khatchadourian and

M.P.

Consens, Explod: Summary-based exploration of interlinking and RDF usage in the linked open data cloud, in: The Semantic Web: Research and Applications, 7th Extended Semantic Web Conference, ESWC 2010, Proceedings, Part II, Heraklion, Crete, Greece, May 30–June 3, 2010,

Aroyo,

Antoniou,

Hyvönen,

ten Teije,

Stuckenschmidt,

Cabral, and

Tudorache, eds, Notes in Computer Science, Vol. 6089, Springer, 2010, pp. 272–287. doi:10.1007/978-3-642-13489-0_19..

17.

Khatchadourian and

M.P.

Consens, Exploring RDF usage and interlinking in the linked open data cloud using explod, in: Proceedings of the WWW2010 Workshop on Linked Data on the Web, LDOW 2010, Raleigh, USA, April 27, 2010,

Bizer,

Heath,

Berners-Lee and

Hausenblas, eds, CEUR Workshop Proceedings, Vol. 628, CEUR-WS.org, 2010.

18.

Khatchadourian and

M.P.

Consens, Understanding billions of triples with usage summaries, Semantic Web Challenge (2011).

19.

Khatchadourian and

M.P.

Consens, Constructing bisimulation summaries on a multi-core graph processing framework, in: Proceedings of the Third International Workshop on Graph Data Management Experiences and Systems, GRADES 2015, Melbourne, VIC, Australia, May 31–June 4, 2015,

Larriba-Pey and

T.L.

Willke, eds, ACM, 2015, pp. 8:1–8:7. doi:10.1145/2764947.2764955.

20.

Konrath,

Gottron and

Scherp, Schemex – web-scale indexed schema extraction of linked open data, Semantic Web Challenge, Submission to the Billion Triple Track (2011), 52–58.

21.

Konrath,

Gottron,

Staab and

Scherp, Schemex – efficient construction of a data catalogue by stream-based indexing of linked data, Web Semantics: Science, Services and Agents on the World Wide Web16 (2012), 52–58. doi:10.1016/j.websem.2012.06.002.

22.

LinkedCT Datahub. https://datahub.io/dataset/linkedct/. Accessed: 2017-03-30.

23.

LinkedMDB home. http://www.linkedmdb.org/. Accessed: 2017-03-30.

24.

lobid-Bibliographic Resources. https://datahub.io/dataset/lobid-resources. Accessed: 2017-03-30.

25.

Louati,

Aufaure and

Lechevallier, Graph aggregation: Application to social networks, in: Advances in Theory and Applications of High Dimensional and Symbolic Data Analysis, HDSDA, Beihang University, Beijing, China, October 27–30, 2011,

Lechevallier,

Saporta,

Guan and

Wang, eds, RNTI, Vol. E-25, Hermann-Éditions, 2013, pp. 157–177.

26.

Lucchese,

Orlando and

Perego, A unifying framework for mining approximate top-k binary patterns, IEEE TKDE26 (2014), 2900–2913.

27.

C.E.S.

Pires,

P.O.

Queiroz-Sousa,

Kedad and

A.C.

Salgado, Summarizing ontology-based schemas in PDMS, in: Workshops Proceedings of the 26th International Conference on Data Engineering, ICDE 2010, Long Beach, California, USA, March 1–6, 2010, IEEE Computer Society, 2010, pp. 239–244. doi:10.1109/ICDEW.2010.5452706.

28.

Schätzle,

Neu,

Lausen and

Przyjaciel-Zablocki, Large-scale bisimulation of RDF graphs, in: Proceedings of the Fifth Workshop on Semantic Web Information Management, SWIM@SIGMOD Conference 2013, New York, NY, USA, June 23, 2013,

De Virgilio,

Giunchiglia and

Tanca, eds, Vol. 23, ACM, 2013, pp. 1:1–1:8. doi:10.1145/2484712.2484713.

29.

Spahiu,

Porrini,

Palmonari,

Rula and

Maurino, ABSTAT: Ontology-driven linked data summaries with pattern minimalization, in: The Semantic Web – ESWC 2016 Satellite Events, Revised Selected Papers, Heraklion, Crete, Greece, May 29–June 2, 2016,

Sack,

Rizzo,

Steinmetz,

Mladenic,

Auer and

Lange, eds, Lecture Notes in Computer Science, Vol. 9989, 2016, pp. 381–395. doi:10.1007/978-3-319-47602-5_51.

30.

Sun,

Hu,

Lu,

Zhao and

Chen, A graph summarization algorithm based on rfid logistics, Physics Procedia24 (2012), 1707–1714. doi:10.1016/j.phpro.2012.02.252.

31.

Tian,

R.A.

Hankins and

J.M.

Patel, Efficient aggregation for graph summarization, in: Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2008, Vancouver, BC, Canada, June 10–12, 2008,

J.T.L.

Wang, ed., ACM, 2008, pp. 567–580. doi:10.1145/1376616.1376675..

32.

Tian and

J.M.

Patel, Interactive graph summarization, in: Link Mining: Models, Algorithms, and Applications,

P.S.

Yu,

Han and

Faloutsos, eds, Springer-Verlag, New York, 2010, pp. 389–409. doi:10.1007/978-1-4419-6515-8_15.

33.

Wang, A semi-supervised learning approach for ontology matching, in: The Semantic Web and Web Science – 8th Chinese Conference, CSWS 2014, Revised Selected Papers, Wuhan, China, August 8–12, 2014,

Zhao,

Du,

Wang,

Ji and

J.Z.

Pan, eds, Communications in Computer and Information Science, Vol. 480, Springer, 2014, pp. 17–28. doi:10.1007/978-3-662-45495-4_2.

34.

WordNet RDF home. http://wordnet-rdf.princeton.edu. Accessed: 2017-03-30.

35.

Zhang,

Duan,

Yuan and

Zhang, ASSG: Adaptive structural summary for RDF graph data, in: Proceedings of the ISWC 2014 Posters & Demonstrations Track a Track Within the 13th International Semantic Web Conference, ISWC 2014, Riva del Garda, Italy, October 21, 2014,

Horridge,

Rospocher and

van Ossenbruggen, eds, CEUR Workshop Proceedings, Vol. 1272, CEUR-WS.org, 2014, pp. 233–236.

36.

Zhang,

Tian and

J.M.

Patel, Discovery-driven graph summarization, in: Proceedings of the 26th International Conference on Data Engineering, ICDE 2010, Long Beach, California, USA, March 1–6, 2010,

Li,

M.M.

Moro,

Ghandeharizadeh,

J.R.

Haritsa,

Weikum,

M.J.

Carey,

Casati,

E.Y.

Chang,

Manolescu,

Mehrotra,

Dayal and

V.J.

Tsotras, eds, IEEE Computer Society, 2010, pp. 880–891. doi:10.1109/ICDE.2010.5447830.

37.

Zneika,

Lucchese,

Vodislav and

Kotzinos, RDF graph summarization based on approximate patterns, in: Information Search, Integration, and Personalization – 10th International Workshop, ISIP 2015, Grand Forks, Revised Selected Papers, Grand Forks, ND, USA, October 1–2, 2015,

E.S.

Grant,

Kotzinos,

Laurent,

Spyratos and

Tanaka, eds, Communications in Computer and Information Science, Vol. 622, Springer, 2016, pp. 69–87. doi:10.1007/978-3-319-43862-7_4.

38.

Zneika,

Lucchese,

Vodislav and

Kotzinos, Summarizing linked data RDF graphs using approximate graph pattern mining, in: Proceedings of the 19th International Conference on Extending Database Technology, EDBT 2016, Bordeaux, France, March 15–16, 2016,

Pitoura,

Maabout,

Koutrika,

Marian,

Tanca,

Manolescu and

Stefanidis, eds, OpenProceedings.org, 2016, pp. 684–685. doi:10.5441/002/edbt.2016.86.

Quality metrics for RDF graph summarization

Abstract

Keywords

1. Introduction

2. Preliminaries

2.1. RDF

1 https://www.w3.org/TR/2004/REC-rdf-schema-20040210/

Definition 2 (RDF data graph).

Definition 4 (Property Instances).

Definition 5 (Knowledge Pattern).

Example 4. Table 1 shows possible patterns which can be extracted from the RDF instance graph depicted in Fig. 1 based on a forward bisimilarity relation. OPEN IN VIEWER Table 1 Knowledge patterns example (computed based on the bisimilarity relation) 2.3. RDF summary graph

Definition 6 (Summary graph).

2.4. Bisimilarity relation

3. Related work

4. Quality assessment model

Table 2 Summary description of the proposed schema measures

Table 3 Summary description of the proposed instance measures

5. Representative algorithms for validating the quality framework

5.1. Algorithms’ description

5.3. Working example

Table 4 ExpLOD summary for the dataset depicted in Fig. 2

5.3.3. Connectivity

Table 10 Connectivity

2 Nxparser: https://github.com/nxparser/nxparser.

6.2. Evaluation results

6.2.1. Implementation of the quality framework

6.2.4. Results combining schema- and instance-level metrics

Table 12 Connectivity metric results

References

¹
https://www.w3.org/TR/2004/REC-rdf-schema-20040210/

Example 4.
Table 1 shows possible patterns which can be extracted from the RDF instance graph depicted in Fig. 1 based on a forward bisimilarity relation.

Table 1
Knowledge patterns example (computed based on the bisimilarity relation)

2.3. RDF summary graph

Table 2
Summary description of the proposed schema measures

Table 3
Summary description of the proposed instance measures

Table 4
ExpLOD summary for the dataset depicted in Fig. 2

Table 10
Connectivity

²
Nxparser: https://github.com/nxparser/nxparser.

Table 12
Connectivity metric results