RDF graph validation using rule-based reasoning

Abstract

The correct functioning of Semantic Web applications requires that given RDF graphs adhere to an expected shape. This shape depends on the RDF graph and the application’s supported entailments of that graph. During validation, RDF graphs are assessed against sets of constraints, and found violations help refining the RDF graphs. However, existing validation approaches cannot always explain the root causes of violations (inhibiting refinement), and cannot fully match the entailments supported during validation with those supported by the application. These approaches cannot accurately validate RDF graphs, or combine multiple systems, deteriorating the validator’s performance. In this paper, we present an alternative validation approach using rule-based reasoning, capable of fully customizing the used inferencing steps. We compare to existing approaches, and present a formal ground and practical implementation “Validatrr”, based on N3Logic and the EYE reasoner. Our approach – supporting an equivalent number of constraint types compared to the state of the art – better explains the root cause of the violations due to the reasoner’s generated logical proof, and returns an accurate number of violations due to the customizable inferencing rule set. Performance evaluation shows that Validatrr is performant for smaller datasets, and scales linearly w.r.t. the RDF graph size. The detailed root cause explanations can guide future validation report description specifications, and the fine-grained level of configuration can be employed to support different constraint languages. This foundation allows further research into handling recursion, validating RDF graphs based on their generation description, and providing automatic refinement suggestions.

Keywords

Constraints rule-based reasoning validation

1. Introduction

Semantic Web data is represented using the Resource Description Framework (RDF), forming an RDF graph [25]. The quality of an RDF graph – its “fitness for use” [86] – heavily influences the results of a Semantic Web application [58]. An RDF graph’s fitness for use depends on its shape, i.e., the RDF graph itself and the application’s supported entailments of that RDF graph. For example, some applications support inferring rdfs:subClassOf entailments [19], whereas other applications require the RDF graph to explicitly contain all classifying triples (i.e., rdfs:subClassOf entailment is not supported).

RDF graphs are validated by assessing their adherence to a set of constraints [57], and different applications (i.e., different use cases) specify different sets of constraints. Via validation, we discover (portions of) RDF graphs that do not conform to these constraints, i.e., the violations that occur. These violations guide the user to the resources and relationships related to the constraints. Refining these resources and relationships results in an RDF graph of higher quality [31], thus, RDF graph validation is an important element for the correct functioning of Semantic Web applications.

1.1. Validation problems

Let us consider the following example: an RDF graph containing people and their birthdates is validated. The use case dictates the set of constraints and the supported entailments. Specifically, we validate formula (1),1

¹
For the remainder of the paper, empty prefixes denote the fictional schema http://example.com/, other prefixes are conform with the results of https://prefix.cc.

with a relevant ontology represented in formula (2).

\begin{array}{l} (1) & :Bob :firstname "Bob" ; \\ :birthdate "1970-01-01"̂̂xsd:date . \\ (2) & :birthdate rdfs:domain :Person . \\ (3) & :Bob a :Person . \end{array}

Problem 1 (P1): Finding the root causes of violations For example, a use case dictates that every resource should have either a firstname and lastname, or a nickname. This constraint, $c_{compound}$ , is thus a compound of several constraints. When the RDF graph contains formula (1) and formula (3), :Bob should be marked as a resource violating $c_{compound}$ . However, the RDF graph cannot be refined solely by knowing which resources violate the constraint. The root cause of the violation is needed: does the resource lack firstname, lastname, or nickname?

For constraint types such as compound constraints, existing validation approaches typically return the resource that violates the constraint. More detailed descriptions are typically not provided, and manual inspection is needed to discover the root cause of a violation, i.e., why a resource violates a constraint. Without the root cause, it is hard to (automatically) refine the RDF graph and improve its quality.

Problem 2 (P2): The number of found violations depends on the supported entailments A mismatch between which entailments are supported during validation and which entailments are supported by the use case influences, e.g., whether formula (3) is inferred or not. Thus, either too many or too few violations can be returned [14]. This difference in number of found violations gives a biased idea of the real quality of the validated RDF graph.

Too many violations: formula (2) specifies the domain of :birthdate. Let us validate that “every resource in the RDF graph that has a birthdate, is a person” given formula (1). When the entailments of formula (2) are not supported, this would result in a violation: formula (3) is missing in the RDF graph. However, when the entailments of formula (2) are supported, we can infer formula (3), and no violation is returned.

Too few violations: Let us validate that “every person in the RDF graph adheres to constraint $c_{compound}$ ” given formula (1). Formula (3) is not explicitly stated and the entailments of formula (2) are not supported. No violations are found: :Bob is not explicitly classified as a :Person, thus :Bob is not targeted by $c_{compound}$ . However, supporting those entailments can create new statements to be validated, and lead to new violations. For example, by inferring formula (3) using formula (2), :Bob is targeted by – and violates – $c_{compound}$ . Such violations are not found in the original RDF graph, but discovered due to the supported entailments.

Customizing the set of inferencing steps during validation (e.g., whether rdfs:domain entailments are supported or not) allows to match the entailments supported by the use case with those of the validation approach. However, support for customizable inferencing steps is limited. When a fixed set (or no set) of inferencing steps is supported, a separate reasoning process is needed to infer unsupported entailments, and edge cases handling this fixed set cannot be validated accurately. For example, let us look at the W3C recommended Shapes Constraint Language (SHACL): a language for validating RDF graphs against a set of constraints [56]. SHACL specifies a fixed set of inferencing steps during validation, namely, rdfs:subClassOf entailment when targeting resources of a certain class. Thus, one cannot validate, e.g., whether an RDF graph explicitly contains all triples that link resources to all their classes given a set of rdfs:subClassOf axioms, as rdfs:subClassOf triples are inferred by a conform SHACL validator.2

For a detailed example, please see https://idlabresearch.github.io/validatrr/blog/2019/09/shacl-subclassof.html.

RDF graphs that do not contain all classifying triples will be valid according to SHACL validators, however, they are handled poorly by applications that do not support rdfs:subClassOf entailment.

Problem 3 (P3): Combining validation with a reasoning preprocessing step decreases performance Entailments can be inferred by performing reasoning as a preprocessing step prior to validation [14], thus combining multiple systems. The resulting RDF graph then explicitly contains all supported entailments, given that the reasoner can be configured to only infer the entailments that are supported by the use case. The number of found violations is then accurate with respect to the use case (solving P2). However, this requires a sequence of independent systems. Thus, the preprocessing step possibly produces entailments not relevant for validation [14]. This independent generation of unnecessary entailments can decrease the performance compared to a single validation system. More, due to this sequence of independent systems, finding the root causes involves investigating the results of both systems: the validator who detects violations, and the reasoner who infers entailments.

1.2. Hypotheses

To solve aforementioned observed validation problems, we pose following hypotheses.

Hypothesis 1 Root causes can be explained more accurately compared to existing validation approaches when using a logical framework that can be configured declaratively.

Hypothesis 2 A more accurate number of violations are found compared to existing validation approaches when supporting a custom set of inferencing steps.

Hypothesis 3 A validation approach supporting more accurate root cause explanations and a custom set of inferencing steps can support an equivalent number of constraint types compared to existing approaches.

Hypothesis 4 A validation approach supporting a custom set of inferencing steps is faster than an approach including the same inferencing as a preprocessing step.

1.3. Contributions

In this paper, we propose an approach for RDF graph validation that uses a rule-based reasoner as its underlying technology. Rule-based reasoners can generate a proof stating which rules were triggered for which returned violation. Thus, the root causes of violations can be accurately explained (solving P1).

A validation approach using rule-based reasoning natively support the inclusion of a custom set of inferencing steps by adding custom rules. The supported entailments during validation can thus be matched to the entailments supported by the use case, and the validation returns an accurate number of found violations (solving P2).

Moreover, rule-based reasoners only need a single language to declare both the constraints and the set of inferencing rules, and only a single system to execute the validation. Compared to a combination of a reasoner and a validation system, this approach does not lead to the generation of entailments unnecessary to the validation step, making it potentially faster than including an inferencing preprocessing step (solving P3).

Our contributions are as follows.

An analysis of existing validation approaches and comparison to a rule-based reasoning approach.

A formal ground for using rule-based reasoning for validation.

An application of that formal ground by providing an implementation using N3Logic [11] to define the inferencing and validation rules, executed using the EYE reasoner [84], supporting general constraint types as described by Hartmann et al. [41,43].

An evaluation of our approach, positioning it within the state of the art by functionally validating the hypotheses and comparing the validation speed.

We validated that (a) the formal logical proof explains the root cause of a violation more detailed than the state of the art; (b) an accurate number of violations is returned by using a custom set of inferencing rules up to at least OWL-RL complexity and expressiveness; (c) the number of supported constraint types is equivalent to existing validation approaches; and (d) our implementation is faster than a combined system, and faster than an existing validation approach when RDF graphs are smaller than one hundred thousand triples.

The remainder of the paper is organized as follows. We start by giving an overview of the state of the art (Section 2), after which we position and compare rule-based reasoning as validation approach (Section 3). Then, we discuss the logical requirements (Section 4) and apply them to achieve a practical implementation (Section 5). Finally, we evaluate our proposed approach (Section 6) and summarize our conclusions (Section 7).

2. State of the art

In this work, we propose an alternative validation approach using rule-based reasoning. We first provide a background on validation and reasoning in Section 2.1. Then, we give an overview of existing validation approaches in Section 2.2, and of related vocabularies and ontologies in Section 2.3. We conclude with an overview of general constraint types in Section 2.4, which allows us to functionally compare validation approaches. Our categorization is derived from the general quality surveys of Zaveri et al. [86], Ellefi et al. [34], and Tomaszuk [81], and from the “Validating RDF Data” book [58]. The related works are extended with recent works published in, among others, the major Semantic Web conferences (ESWC and ISWC), and the major Semantic Web journals (Journal of Web Semantics and Semantic Web Journal).

2.1. Background

Validation Data quality can be assessed by employing a set of data quality assessment metrics [12]. Quality assessment for the Semantic Web – and more specifically, for Linked Data – spans multiple dimensions, further categorized in accessibility, intrinsic, trust, dataset dynamicity, contextual, and representational dimensions [86]. Validating an RDF graph directly relates to intrinsic quality dimensions, as defined by Zaveri et al. [86]: (i) independent of the user’s context, and (ii) checking if information correctly and compactly represents the real world data and is logically consistent in itself, i.e., the graph’s adherence to a certain schema or shape. In this paper, we specifically focus on RDF graph validation, i.e., the intrinsic dimensions.

Validation of an RDF graph can be automated by using a set of test cases, each assessing a specific constraint [57]. Violations of those constraints are then indicated when a validation returns negative results. Validation is typically achieved following Closed World Assumption (CWA): what is not known to be true must be false. For example, a validation assesses for a specific RDF graph if all objects linked via the predicate schema:birthdate are a valid xsd:date , or if all subjects and objects linked via the predicate foaf:knows are explicitly listed to be of type :Human. Negative results are returned, indicating violations.

Reasoning Ontologies are prevalent in the Semantic Web community to represent the knowledge of a domain. Ontology languages are used to annotate asserted facts (axioms). Examples include RDF Schema (RDFS) [19] and the Web Ontology Language (OWL) [46]. Reasoning on top of these axioms is achieved, as the calculus of the used logic specifies a set of inferencing steps, inferring logical consequences (entailments) from these axioms [30]. Logics for the Semantic Web – given the open nature of the Web – typically follow the Open World Assumption (OWA): what is not known to be true is simply unknown.

Semantic Web reasoners are typically description logic-based reasoners supporting OWL-DL or subprofiles such as OWL-QL [61], or rule-based reasoners [66]. Description logic-based reasoners are typically optimized for specific description logics, such as KAON23

³
http://kaon2.semanticweb.org/

for

SHIQ

and FaCT++4

⁴

http://owl.cs.manchester.ac.uk/tools/fact/

for

SROIQ

. Rule-based reasoners typically follow two types of inferencing algorithms: forward chaining and backward chaining [66]. Whereas forward chaining tries to infer as much new information as possible, backward chaining is goal-driven: the reasoner starts with a list of goals and tries to verify if there are statements and rules available that support any of these goals [66]. The employed rules define the logic followed by rule-based reasoners such as EYE [84] or cwm [9]. Whereas description logic-based reasoners have (optimized) inferencing steps for, e.g., rdfs:subClassOf and other RDFS or OWL constructs embedded, rule-based reasoners commonly rely on the general “implies” construct. Each rule specifies “A implies B”, where both the antecedent “A” and the consequence “B” can consist of statements [66]. Certain constructs such as rdfs:subClassOf can be translated into one or more rules.5

⁵

http://eulersharp.sourceforge.net/#theories

There is a clear distinction between ontologies and the constraint set for RDF graph validation: ontologies focus on the representation of a domain, whereas RDF graph validation checks whether the resources of that graph conform to a desired schema [58]. It is not required that the representation of a domain aligns with the schema for validation. However, they can complement each other. The usage of ontologies prescribes a set of inferencing steps, for example, the FOAF ontology declares the rdfs:range of the foaf:knows predicate as foaf:Person [20]. Whether these inferencing steps are taken into account during validation or not, influences the number of found violations [14].

2.2. Validation approaches

In this section, we discuss RDF graph validation approaches. Tools and surveys that cover quality dimensions other than the intrinsic dimensions such as accessibility or representational dimensions are out of scope. We discuss the approaches roughly in chronological order: hard-coded, using integrity constraints, query-based, and using a high-level language. Except from hard-coded systems, these validation approaches propose or use some kind of declarative means to describe RDF graph constraints.

2.2.1. Hard-coded

Hard-coded systems are a black box where the business logic lies within the code base: the implementation embeds both description and validation of constraints. Hogan et al. analyzed common quality problems both for publishing and intrinsic quality dimensions [47], providing an initial set of best practices [48]. Efforts focus on a limited set of configurable settings (i.e., turning constraint rules on or off) [60].

2.2.2. Integrity constraints

For these validation approaches (so-called “logic-based approaches”), the axioms of vocabularies and ontologies used by the validated RDF graph are interpreted as integrity constraints [62,67,79]. For example, disjointness forces a description logic-based reasoner to throw an error, which is interpreted as a violation. To combine CWA typically assumed for validation with OWA assumed in ontology languages, alternative semantics for these ontology languages are proposed. The underlying technology used is a description logic-based reasoner or a SPARQL endpoint.

Description logic-based reasoner Motik et al. [62] propose semantic redefinitions, where a certain subset of axioms are designated as constraints. To know which alternative semantics for OWL apply, constraints have to be marked as such. They propose to integrate their implementation with KAON2. Furthermore, custom integrity constraints for Wordnet have been verified using Protégé [63] with FaCT++ [23].

SPARQL endpoint Tao et al. [79] propose using OWL expressions with Closed World assumption and a weak variant of Unique Name assumption to express integrity constraints. OWL semantics are redefined, without being explicitly stated as such during validation. They use SPARQL [1] for axioms described in RDF, RDFS, and OWL [79], e.g., using SPARQL property paths to simulate rdfs:subClassOf entailment. Tao et al. work in a general OWL setting, where their approach is sound but not complete. In an RDF setting the approach is both sound and complete, as there is only a single model that needs to be considered [67]. This implementation is incorporated into Stardog ICV [71]. Patel-Schneider separates validation into integrity constraints and Closed World recognition [67], showing that RDF and RDFS entailment can be implemented for both by translation to SPARQL queries.

2.2.3. Query-based

In query-based approaches, constraints are described and interpreted similar to SPARQL queries [43,68]: only RDF graphs whose structure is compatible with the defined structure are returned. These approaches use an embedded or external SPARQL endpoint as underlying technology.

CLAMS [35] is a system to discover and resolve Linked Data inconsistencies. They define a violation as a minimal set of triples that cannot coexist. The system identifies all violations by executing a SPARQL query set. Knublauch et al. propose the SPARQL Inference Notation (SPIN) [55]: a SPARQL-based rule and constraint language. The SPARQL query is described using RDF statements instead of using the original SPARQL syntax. Kontokostas et al. [57] propose Data Quality Test Patterns (DQTP): tuples of typed pattern variables and a SPARQL query template to declare test case patterns. The validation framework that validates these DQTPs is called RDFUnit. The DQTPs are transformed into SPARQL queries, where every SPARQL query is a test case. RDFUnit additionally allows automatically generated test cases, depending on the used schema.

RDFUnit is also used to validate Linked Data generation rules in the RDF Mapping Language (RML) [32], by manually defining different DQTPs to target the generation description instead of the generated RDF graph [31]. This means the RDF graph can be validated before any data is generated, as the generation description reflects how the RDF graph will be formed.

2.2.4. High-level language

These approaches use a terse high-level language specifically designed to describe constraints for validation [58]. These languages are independent of underlying technologies, and alternative implementation strategies can be devised. We first discuss initial high-level languages, after which we discuss high-level languages with wide adoption from the community: ShEx and SHACL.

Description Set Profiles (DSP) [64] define a set of constraints using Description Templates, targeted specifically to Dublin Core Application Profiles, and implemented using SPIN [16]. Other high-level languages to describe constraints include OSLC Resource Shapes [76] – part of IBM Resource Shapes – and RDF Data Descriptions [36]. Luzzu [28] uses a custom declarative constraint language (Luzzu Quality Metric Language, LQML). Any metric that can be expressed in a SPARQL query can be defined using LQML. Moreover, quality dimensions other than the intrinsic dimensions are also expressible using LQML. Luzzu supports basic metrics and custom JAVA code allowing users to implement custom metrics.

ShEx Shape Expressions (ShEx) [73,74] is a structural schema language which can be used for RDF graph validation. The grammar of ShEx is inspired by Turtle and RelaxNG, its semantics are well-founded, and its complexity and expressiveness are formalized [13,78]. ShEx provides an extension point to handle advanced constraints via Semantic Actions, which allows to evaluate a part of the validated RDF graph using a custom function.

SHACL The Shapes Constraint Language (SHACL) is the W3C Recommendation for validating RDF graphs against a set of constraints [56]. The core of SHACL is independent of SPARQL, which promotes the development of new algorithms and approaches to validate RDF graphs [59]. The original specification does not include a denotational semantics such as ShEx, however, the recent work of Corman et al. propose a concise formal semantics for SHACL’s core constraint components, and a way of handling recursion in combination with negation [24]. Advanced features of SHACL include SHACL Rules (to derive inferred triples from the validated RDF graph) and SHACL Functions (to evaluate a part of the validated RDF graph using a custom function) [54].

2.3. Validation reports

Validation reports handle identification of which data quality dimensions are assessed in general, and the representation of violations in particular.

To identify data quality dimensions, Radulvic et al. extended the Dataset Quality Ontology (daQ) [29] to include all data quality dimensions as identified by Zaveri et al. [86], leading to the Data Quality Vocabulary [75]. This allows the comparison of data quality dimension coverage of different frameworks.

Table 1
Comparing the prominent validation approaches with rule-based reasoning, using factors explanation, time, customization, inferencing steps, and reasoning preprocessing. The time row indicates which approaches’ execution time is influenced due to the reasoning preprocessing using an asterisk. The asterisk in the inferencing steps row indicates that approaches based on integrity constraints cannot combine with a custom set of inferencing steps that overlaps with the integrity constraints, as their semantics are redefined

The violations report itself allows to distribute and compare the violations found in an RDF graph, and can refer to the dimension specifications using aforementioned general vocabularies. For example, the Quality Problem Report Ontology assembles detailed quality reports for all data quality dimensions [28]. The Reasoning Violations Ontology (RVO) is used to represent integrity constraint violations [18], and Kontokostas et al. [57] use the RDF Logging Ontology6

⁶

http://persistence.uni-leipzig.org/nlp2rdf/ontologies/rlog#

(RLOG) to describe RDFUnit’s violation results. Both ShEx and SHACL provide violation report descriptions, with means to specify the violating resources, using a ShapeMap [73] and a Focus node [56], respectively.

2.4. Constraint types

Hartmann né Bosch et al. identify eighty-one general constraint types [17,43]. These constraint types are an abstraction of specific constraints, independent of the constraint language used to describe them. A constraint type can be defined in different ways. For example, the property domain constraint type specifies that resources that use a specific property should be classified via a specific class, e.g., all resources using the :birthdate property that are not classified as a :Person are violating resources. Using RDFS [19], the property domain constraint type can be assessed by interpreting rdfs:domain as an integrity constraint. Using SHACL, this can be achieved by defining a sh:property with sh:class for a sh:targetSubjectsOf shape [56].

Moreover, Hartmann et al. provide a logical underpinning stating the requirements for a validation approach to support all constraint types [14]. For thirty-five out of eighty-one constraints types ( $43.2 %$ ), reasoning (up to OWL-DL expressiveness) can improve the validation: without reasoning, either too many or too few violations can be returned.

3. Comparative analysis

Different types of validation approaches are proposed in the state of the art. The most prominent approaches are hard-coded, based on integrity constraints, query-based, and using high-level languages. In this section, we compare them with our proposed rule-based reasoning approach. Our analysis is summarized in Table 1.

We adapt the framework presented by Pauwels et al. [70], which introduces comparative factors of key implementation strategies for compliance checking applications. We adjust these factors with respect to the validation problems identified in Section 1.1. We generalize the factors time, customization, and inferencing steps, and introduce explanation and reasoning preprocessing as validation-specific factors.

Explanation The explanation as to why a certain violation occurs (i.e., the root cause). The more specific a validator can explain, the easier it is to (automatically) refine the RDF graph and improve its quality. Existing approaches typically have the means to explain violations up to the level of which resource violates which constraint. Explanations of hard-coded approaches either need to be explicitly implemented, or are provided by inspecting the code base. When using integrity constraints, approaches exist for resolving inconsistencies. These approaches perform some sort of root cause analysis, but are usually targeted at refining the axioms of the ontologies themselves [39]. It is not a standard feature to produce proofs of the results of description logic-based reasoners [65]. In a query-based approach, the used SPARQL endpoint returns bindings [1]. In the case of validation, it returns the violating resources, without additional explanation. High-level languages can have mechanisms to additionally include the violating resources in the validation report. For example, ShEx and SHACL provide ShapeMaps [73] and Focus nodes [56], respectively. SHACL’s Focus nodes can further specify which predicate and object cause the violation, except for, e.g., compound constraints. Using rule-based reasoning allows the generation of a logical proof, as rule-based reasoning relies on a general “implies” construct to describe rules, and rule-based reasoners typically do not contain description logic optimizations. Such a logical proof declares which rules were triggered to arrive at a certain conclusion, giving a precise explanation for the root causes of constraint violations. Where existing approaches typically have the means to explain violations up to the level of which resource violates which shape, a logical proof can provide a more detailed explanation.

Time The time needed to execute the validation: short versus long. Typically, specialized approaches allow for optimizations, making them faster than general approaches. Hard-coded is usually the fastest and needs the shortest processing time, followed by systems that use high-level languages: both can be optimized for validation tasks. The other approaches (using integrity constraints, query-based, and rule-based reasoning) are typically built using an underlying existing technology (description logic-based reasoners, SPARQL endpoints, and rule-based reasoners, respectively). They are not built (or optimized) for validation tasks. This makes them independent of the constraint language, but can also slow down the validation. The total execution time of validation approaches depends on whether a reasoning preprocessing step to include additional inferencing steps is required or not. Using rule-based reasoning is thus potentially slower than existing approaches, however, it does not require inclusion of reasoning preprocessing.

Customization The extent of customization each type of approach enables. Typically, ease of customization is improved by using a declarative language. Customization of a hard-coded system requires development effort, as the business logic is embedded within the code. Other approaches rely on declarations to customize the validation. Declarations are decoupled, i.e., independent of the tool’s implementation. Thus, they can be shared and easier customized to a certain use case. Description logic-based reasoners used to identify integrity constraints are typically optimized for description logics such as OWL-QL and OWL-DL. Customization is limited to the description logic that the reasoner is optimized for. Query-based approaches allow customization by defining additional SPARQL queries and registering custom functions [40]. Systems using high-level languages are customized using the declarations as specified by the used language. The adoption of ShEx and SHACL shows that these languages provide sufficient customization. The extension mechanisms of these languages such as Semantic Actions [73] and SHACL Advanced Features [54], respectively, allow to customize the validation even further. Using rule-based reasoning allows customization by adding and removing rules. As opposed to existing approaches, users can customize both the constraint types and the set of inferencing steps within the same declarative language.

Inferencing steps Whether the validation approach supports a (custom) set of inferencing steps. Hard-coded systems can support a fixed set of inferencing steps, but this set cannot be inspected or altered without investigating the code base. Approaches that use integrity constraints for validation propose alternative semantics of commonly agreed upon ontology languages to include, among others, some form of CWA [62,79]. This leads to ambiguity in the Semantic Web as an existing, globally agreed upon logic, is changed [4]. It is not possible to combine such validation with a (custom) set of inferencing steps within a description logic: the same inferencing step has different semantics whether it is used for validation or for inferring new statements. SPARQL endpoints used for query-based approaches can support up to OWL-RL reasoning [53], or support up to RDF and RDFS entailment via translation of the SPARQL queries using property paths [79]. High-level languages such as SHACL allow specifying the entailment regime used [38]: SHACL validators may operate on RDF graphs that include entailments using the sh:entailment property [56]. Furthermore, SHACL Rules [54] can be used to a certain extent to generate inferred statements during validation. By design, rule-based reasoning allows inclusion of a set of additional (custom)) inferencing rules [66]. Whereas existing approaches mostly allow configuration to support, e.g., a specific entailment regime, the customization of the set of inferencing steps is more fine-grained for rule-based reasoners. This can increase complexity, but also allows catering the validation to use cases that depend on a specific set of inferencing steps. The importance of such use cases is evidenced by the fact that SHACL Rules is proposed as an advanced feature to the SHACL specification [54].

Reasoning preprocessing Existing approaches have no support for including a custom set of inferencing steps, propose alternative semantics, or allow a specific entailment regime. By including a reasoning step as preprocessing step to these approaches (see Fig. 1.1), the entailments valid during validation can be matched with the entailments valid for the use case, even when that use cases requires a custom set of inferencing steps [14]. First, a reasoner – optionally, hence the dashed line – infers all valid entailments of the original RDF graph (Fig. 1.1, Reasoner), taking into account the axioms of the relevant ontologies and vocabularies (Axioms). Then, the newly generated RDF graph (RDF graph*) is validated with respect to the specified constraints (Fig. 1.1, Validator).

Fig. 1.

The preprocessing approach: first (optionally, hence the dashed line), a reasoner is used to generate intermediate data (RDF graph*). That intermediate data is then the input data for the Validator. Using a rule-based reasoner only needs a single system and language to combine reasoning and validation.

By using a preprocessed inferred RDF graph, multiple systems (i.e., the reasoner and the validator) need to be combined, configured, and maintained. This separates concerns, however, this also means that different languages may need to be learned and combined for specifying the inferencing steps and constraints. As these multiple systems are not aligned, the reasoner could infer a large number of new triples that are irrelevant to the defined constraints, which could lead to bad scaling (Fig. 1.1, RDF graph*). Also, explaining the violation is hindered. Even when the reasoner can differentiate between the original triples and the inferred triples, finding the root causes involves investigating the output of both systems: the validator detecting the violations, and the reasoner inferring the supported entailments.

Reasoning preprocessing is not required when using rule-based reasoning. The set of inferencing steps and the set of constraints can be defined using the same declaration (Fig. 1.2, Inferencing rules and Constraints*), and executed simultaneously on the RDF graph and the axioms. Which statements need to be inferred can be optimized guided by the set of constraints, and only the output of a single system needs to be investigated to explain the found violations.

4. Logical requirements

In this section, we discuss the logical requirements needed for RDF graph validation, and argue for using a rule-based logic.

Constraint languages need to cope with different constraint types depending on users’ needs. Each constraint type implies logical requirements. The constraint types and the requirements they entail are investigated by Hartmann et al., claiming that Closed World Assumption (CWA) and Unique Name Assumption (UNA) are crucial for validation [14]. These requirements typically do not apply to logics for the Semantic Web, as data on the Web is decentralized, information is spread (“anyone can say anything about anything” [25]), and single resources can have multiple URIs. Instead, relevant logics such as OWL-DL assume OWA and in general non-Unique Name Assumption [61]. Hartmann et al. emphasize the difference between reasoning and validation, and favor query-based approaches for validation. When needed, query-based approaches can be combined with reasoning (e.g., OWL-DL or OWL-QL) as a preprocessing step.

However, in this section, we show how rule-based reasoning can be used for validation in a Semantic Web context, even though this reasoning typically does not follow CWA and UNA. Specifically, we state that the requirements for using rule-based reasoning are (i) supporting Scoped Negation as Failure (SNAF) [26,51,72] instead of CWA (Section 4.1), (ii) containing predicates to compare URIs and literals instead of supporting UNA (Section 4.2), and (iii) supporting expressive built-ins, as validation often deals with, e.g., string comparison and mathematical calculations (Section 4.3).

4.1. Scoped negation as failure

Existing works claim that CWA is needed to perform validation [14,67,79]. Given that most Web logics assume OWA, this would require semantic redefinitions to include inferencing during validation [62], which leads to ambiguity. However, as validation copes with the local knowledge base, and not the entire Web, we claim Scoped Negation as Failure (SNAF) is sufficient. This is an interpretation of logical negation: instead of stating that ρ does not hold (i.e., $\neg ρ$ ), it is stated that reasoning fails to infer ρ within a specific scope [26,51,72]. This scope needs to be explicitly stated. As such, SNAF keeps monotonicity.

To understand the idea behind Scoped Negation as Failure, let us validate following RDF graph: $\begin{array}{l} (4) & :Kurt & a :Researcher; \\ (5) & :name "Kurt01". \end{array}$ We validate the constraint “every individual which is declared as a researcher is also declared as a person”. This thus means a violation is returned when an individual is found during validation which is a researcher, but not a person: $\begin{matrix} (6) & \begin{matrix} \forall x : & ((x a :Researcher) \land \\ \neg (x a :Person)) \\ \to & (:constraint :isViolated "true".) \end{matrix} \end{matrix}$ As stated, this constraint cannot be tested with OWA: the knowledge base contains the triple of formula (4), but not of: $\begin{matrix} (7) & \begin{matrix} :Kurt a :Person. \end{matrix} \end{matrix}$ The rule is more general: given its open nature, we cannot guarantee that there is no document in the entire Web which declares the triple of formula (7).

This changes if we take into account SNAF. Suppose that $K$ is the set of triples we can derive (either with or without reasoning) from our knowledge base of formulas (4) and (5). Having $K$ at our disposal, we can test: $\begin{matrix} (8) & \begin{matrix} \forall x : & (((x a :Researcher) \in K) \land \\ \neg ((x a :Person) \in K)) \\ \to & (:constraint :is :violated.) \end{matrix} \end{matrix}$ The second conjunct is not a simple negation, it is a negation with a certain scope, in this case $K$ . If we add new data to our knowledge base, e.g., the triple of formula (7), we would have a different knowledge base $K^{'}$ for which other statements hold. The truth value of formula (8) would not change since this formula explicitly mentions $K$ . SNAF is what we actually need for validation: we do not validate the Web in general, we validate a specific RDF graph.

4.2. Predicates for name comparison

UNA is deemed required for validation [14], i.e., every resource taken into account can only have one single name (a single URI in our case) [52]. UNA is in general difficult to obtain for the Semantic Web and Web logics due to its distributed nature: different RDF graphs can – and actually do – use different names for the same individual or concept. For instance, the URI dbpedia:London refers to the same place in Britain as, e.g., dbpedia-nl:London . That fact is even stated in the corresponding datasets using the predicate owl:sameAs . The usage of owl:sameAs conflicts with UNA and influences validation [14].

Let us look into the following example. We assume dbo:capital is an owl:InverseFunctionalProperty. Our knowledge base contains: $\begin{array}{l} (9) & :Britain dbo:capital :London. \\ (10) & :England dbo:capital :London. \end{array}$ Since both :Britain and :England have :London as their capital and dbo:capital is an inverse functional property, an description logic-based reasoner would derive that $\begin{matrix} (11) & :Britain owl:sameAs :England. \end{matrix}$ This thus influences the validation result. Such a derivation cannot be made if UNA holds, since UNA explicitly excludes this possibility.

The related constraint – defined as INVFUNC by Kontokostas et al. [57] – specifies that each resource should contain exactly one relationship via dbo:capital, i.e., the capital is different for every resource. The constraint INVFUNC is related to owl:InverseFunctionalProperty, but it is slightly different: while OWL’s inverse functional property refers to the resources that are in the domain of dbo:capital, the validation constraint INVFUNC refers to the representation of those resources. The RDF graph of formulas (9) and (10) thus violates the INVFUNC constraint. Even if our logic does not follow UNA, this violation can be detected if the logic offers predicates to compare the (string) representation of resources.

Fig. 2.

Components view of our approach. All double-snipped rectangles are rule sets, the single-snipped rectangles are RDF graphs or constraint declarations. The large overlapping rectangle is the rule-based reasoner. By taking all rule sets into account, the rule-based validator is formed. Four parts can be identified within the validation execution: (i) possibly guided by provided Axioms, all supported entailments of the given RDF graph can be generated using the Inferencing rules, resulting in RDF graph*; (ii) the general Constraints* are inferred from the given Constraints using a set of rules for Constraint translation; (iii) the rules for Validation generate Violations; and (iv) the returned Violations* are structured given a set of rules that specify the Report format.

4.3. Expressive built-ins

Validation often deals with, e.g., string comparison and mathematic calculations. These functionalities are widely spread in rule-based logics using built-in functions. While it typically depends on the designers of a logic which features are supported, there are also common standards. One of them is the Rule Interchange Format (RIF), whose aim is to provide a formalism to exchange rules in the Web [50]. Being the result of a W3C working group consisting of developers and users of different rule based languages, RIF can also be understood as a reference for the most common features rule based logics might have.

Let us take a closer look to the comparison of URIs from the previous section. func:compare can be used to compare two strings. This function takes two string values as input, and returns -1 if the first string is smaller than the second one regarding a string order, 0 if the two strings are the same, and 1 if the second is smaller than the first. The example above gives: $\begin{matrix} (12) & \begin{array}{l} ("http://example.com/Britain" \\ "http://example.com/England") \\ func:compare -1. \end{array} \end{matrix}$ To refer to a URI value, RIF provides the predicate pred:iri-string which converts a URI to a string and vice versa. To enable a rule to detect whether the two URI names are equal or not, an additional function is needed: the reasoner has to detect whether the comparison’s result is different from zero. That can be checked using the predicate pred:numeric-not-equal, which is the RIF version of ≠ for numerical values. In the example, the comparison would be true since $0 \neq - 1$ . Using these RIF built-ins, a reasoner can check the name equality between :Britain and :England, and return a violation. Whether a rule based Web logic is suited for validation highly depends on its built-ins. If it supports all RIF predicates, this can be seen as a strong indication that it is expressive enough.

5. Application

In this section, we present our approach that uses rule-based reasoning for validation. We discuss the different components and the workflow in Section 5.1, the underlying technologies in Section 5.2, and implementation in Section 5.3. We end with an example using rules in Section 5.4.

5.1. Customizable validation

Our validator consists of multiple components that can be configured by adjusting the different rule sets (Fig. 2). The execution is primarily handled using the rule-based reasoner as underlying technology.

The set of Inferencing rules specifies the supported entailments during validation. This set can either be a predefined set to support, e.g., RDFS entailment [19], or can be fully customized. Optionally, the relevant axioms are provided during validation. As such, the entailments supported by the use case can be matched during validation.

The set of rules forming the Constraint translation allows our validator to infer the general constraint types – common across existing constraint languages [17,43] – from specific constraint descriptions. It can thus infer these types from the constraints described in a specific language such as SHACL [56]. The general constraint types are described using RDF-CV, which generalizes the constraint types into a coherent structure [17]. The purpose of RDF-CV is not to invent a new constraint language: it is a concise ontology which is deemed universal enough to describe constraints expressible by other constraint languages such as SHACL.7

⁷
For a detailed description of RDF-CV, we refer to the original papers [15,17], or the source: https://github.com/boschthomas/RDF-Constraints-Vocabulary.

Our rule-based validator is thus constraint language-independent.

The set of rules forming the Validation allows our validator to infer violations on the RDF graph with all supported entailments, based on the general constraint types. This set of rules specifies how to detect each constraint type.

The set of rules forming the Report allows our validator to infer the resulting violations in the required format. This set can be adapted to, e.g., the SHACL report format [56].

As a result, this declarative approach is decoupled from ontology language, constraint language, and report format. When no additional rule sets are included (i.e., only the Validation rule set is used), this validator does not infer any entailments, only validates constraints described using RDF-CV, and returns a report in a format based on RDF-CV.

All rule sets and input data are taken into account during a single reasoner execution. As opposed to using a reasoning preprocessing step, the inferred entailments can be geared towards the specified constraints (when making use of a backward chaining reasoner), and no unnecessary entailments are produced. For example, when an axiom specifies the range of a certain path, but no constraints are related to that path, this range might not need to be inferred. Moreover, as you only have a single system, finding the root cause does not require investigation of multiple systems: the logical proof contains the complete overview of which rules were used to generate which entailments and which violations.

5.2. Used technologies

The most important technological considerations are the rule-based web logic and reasoner in accordance with that logic.

Rule-based web logic Rule-based web logics include the Semantic Web Rule Language (SWRL) [49], the Datalog+/− framework [22] and N3Logic [11].8

⁸
For a more thorough discussion of relevant rule languages, we refer to Section 3.2 of [83].

We use N3Logic as it fulfills all requirements: SWRL does not support the logical requirement SNAF,9

⁹

https://github.com/protegeproject/swrlapi/wiki/SWRLLanguageFAQ#Does_SWRL_support_Negation_As_Failure

and the Datalog+/− framework does not support production of logical proofs. N3Logic is being actively supported and used, as evidenced by recent papers and patents [27,69,82,85], and by the recently founded W3C Notation 3 (N3) Community Group fostering development, implementation, and standardization.10

¹⁰

https://www.w3.org/community/n3-dev/

We follow the formalized semantics of N3Logic [83] as implemented in the EYE reasoner [5]: a clear formal definition of Notation3’s semantics was missing from its initial proposal [11]. Verborgh et al. formalised the basics of the model theory of a logic with similar properties to N3Logic, excluding the constructs which lead to different interpretations (mainly nested implicit quantification) [83]. This work also proves the correctness of the calculus N3 reasoners use. Thus: the results of the reasoners are correct if the defined model theory is followed. Arndt et al. expanded on this work, specifically investigating the excluded constructs, and defined two different mappings from N3 syntax to core logic syntax covering two possible interpretations of N3Logic [5]. Even though this work defines two possible semantics, the difference between these two semantics does not influence the use of N3 in our paper since the semantic differences are only relevant for deeply nested formulas, our formulas are not of that nature (see Listing 3).

More, N3Logic supports at least OWL-RL inferencing [2,3], which can be included during validation: the rules for OWL-RL are specified11

¹¹

https://www.w3.org/TR/owl2-profiles/#Reasoning_in_OWL_2_RL_and_RDF_Graphs_using_Rules

and are supported by every rule language that is at least as expressive as Datalog. This includes N3Logic: the concrete realisation of these rules in N3 can be found online.12

¹²

http://eulersharp.sourceforge.net/2003/03swap/eye-owl2.html

N3Logic, among others, covers existential rules, thus typically rendering the logic undecidable. This brings three trade-offs. First, we note that decidability does not imply that reasoning times are acceptable: even decidable logics can result in reasoner time-outs. Second, we expect the validation rules to be used in a distributed context (the Web). Thus, even though relevant research investigates the maximal subset of existential rules which are still decidable [6,21,80], we have no control over all potential rules used together with our validation rules, and cannot use these well-studied mechanisms ensuring a set of existential rules to be decidable. These mechanisms need to consider all rules together. Third, for example, the logic framework Prolog is a widely used Turing complete programming language. Even though this is a desirable property for a programming language, making it very expressive, checking properties over a Turing complete language is undecidable. Prolog remains a popular choice: we can conclude that using this undecidable logic allows for expressiveness, without necessarily introducing a performance bottleneck.

The rule language introduced together with N3Logic is N3 [10,11]. Everything covered by RDF 1.1 Semantics [44] is covered in N3. Syntactically, it is a superset of Turtle [7]. N3 allows declaring inferencing rules, axioms, and constraints in the same language. As in RDF, blank nodes are understood as existentially quantified variables and the co-occurrence of two triples as in the RDF graph of formulas (9) and (10) is understood as their conjunction. More, N3 supports universally quantified variables, indicated by a leading question mark ?. $\begin{matrix} (13) & ?x :likes :IceCream. \end{matrix}$ stands for “Everyone likes ice cream.”, or in first order logic $\begin{matrix} (14) & \forall x : likes (x, ice-cream) . \end{matrix}$ Rules are written using curly brackets { } and the implication symbol =>. An rdfs:subClassOf relation such as :Person rdfs:subClassOf :Researcher can be expressed as: $\begin{matrix} (15) & \begin{array}{l} {?x a :Researcher} \\ => {?x a :Person}. \end{array} \end{matrix}$ The general rdfs:subClassOf relation can be expressed as: $\begin{matrix} (16) & \begin{matrix} {?C rdfs:subClassOf ?D. ?X a ?C} \\ => {?X a ?D}. \end{matrix} \end{matrix}$

Reasoner Reasoners that support N3Logic include FuXi, cwm, and EYE. FuXi13

¹³

http://code.google.com/p/fuxi/

is a forward chaining production system for N3 whose reasoning is based on the Rete algorithm [37]. The forward chaining cwm [9] reasoner is a general-purpose data processing tool which can be used for querying, checking, transforming and altering information. EYE14

¹⁴

https://github.com/josd/eye

[84] is a high-performance reasoner written in Prolog, enhanced with Euler path detection, allowing the creator of the rules to decide when to do forward reasoning and when backwards. EYE has generous support for built-in functions,15

¹⁵

http://eulersharp.sourceforge.net/2003/03swap/eye-builtins.html

among which, the RIF functions.

We choose the EYE reasoner as it fulfills the requirements as presented in Section 4. Furthermore, its ability to combine forward and backward chaining proves especially useful since constraint types are mostly localized to single relationships [14]. This means backward chaining has a potentially large impact on the performance: reasoning during validation can be very targeted, and in most cases, only facts that are relevant to the defined constraints are inferred.

5.3. Implementation

Our implementation is dubbed “Validatrr”: a validator using rule-based reasoning. A Node.js JavaScript framework was created to discover and retrieve the vocabularies and ontologies as required by the use case, manage the commandline arguments, etc. The implementation is available at https://github.com/IDLabResearch/validatrr, and the set of validation rules (Fig. 2, center) is available at https://github.com/IDLabResearch/data-validation.

5.4. Execution example

As example, we validate an RDF graph with a custom set of inferencing steps using SHACL constraints. We take into account the example of the introduction (formula (1)), but the case where :Bob has two birthdates defined. The implications of rdfs:domain (formula (2)) should be taken into account as defined in RDFS [19] during validation, and the SHACL constraint states that each person should have exactly one birthdate (Listing 1). The result should be in the SHACL validation report format. Using this example, we can detail every step as show in Fig. 2: the RDF graph with all supported entailments (RDF graph*) and general constraint types (Constraints*) are inferred using a (custom) set of inferencing rules (Inferencing rules) and constraint translation rules (Constraint translation), after which the validation occurs (Validation), and the resulting violations are translated via rules (Report) in a specific report format (Violations*).

Listing 1.

Person shape in SHACL

To make sure rdfs:domain is correctly interpreted during validation, we include additional inferencing rules16

¹⁶

http://eulersharp.sourceforge.net/2003/03swap/rdfs-domain.html

(Inferencing rules), described in N3 as

\begin{matrix} (17) & \begin{matrix} {?P rdfs:domain ?C. ?X ?P ?Y} \\ => {?X a ?C} . \end{matrix} \end{matrix}

Given formula (17), it is inferred that :Bob is a person (RDF graph*).

To make sure SHACL constraints are correctly interpreted, SHACL translation rules need to be included during validation (Constraint translation). The general “Exact Qualified Cardinality Restrictions” RDF-CV constraint is inferred from the SHACL constraint of Listing 1, using the rules of Listing 2 (Constraints*).

Listing 2.

Translate the SHACL shape to a general constraint type

Validation makes use of general rules, i.e., Listing 3 (Validation). Lines 11–14 define how to find a violation, relying on built-ins: gather a set of resources in a list (e:findall), calculate the length of that list (e:length), and mathematically compare numbers (math:notEqualTo). For all objects of a certain class or datatype related using predicate ?p (in this case :birthdate) where the number of objects is different from the constraint value ?v (in this case 1), a violation is returned (lines 16–21).

Listing 3.

Validate using general constraint types

The general violations are translated into a report format (Fig. 2, Violations*), e.g., using the SHACL Validation Report [56] (see Listing 4). The result is a set of triples using the exact same input and output as a SHACL processor. However, the RDF graph’s supported entailments can be matched to the use case, and the process is a single reasoning execution with transparent rule sets.

Listing 4.

Translate the general violations to the SHACL validation report

Moreover, different constraint descriptions are easily supported via the general constraint types. Given the OWL restriction of Listing 5: using a different set of rules, we can translate this restriction into the same constraint type (Listing 6). The validation process continues exactly the same.

Listing 5.

An OWL restriction

Listing 6.

Translate the OWL restriction to the general constraint type

6. Hypothesis validation

To validate the hypotheses of Section 1.2, we compare Validatrr to different validation approaches. We show that Validatrr (i) accurately explains the root cause of why a violation occurs in more cases than specified in SHACL, given the SHACL core constraint components (accepting Hypothesis 1, see Section 6.1); (ii) returns an accurate number of validation results with respect to the used set of inferencing steps, compared to an integrity constraints validator with a fixed set of inferencing steps using RDFUnit (accepting Hypothesis 2, see Section 6.2); and (iii) supports an equivalent number of constraint types than existing approaches (accepting Hypothesis 3, see Section 6.3). The performance evaluation shows that our implementation is faster than the state of the art when combining inferencing and validation for commonly published datasets (accepting Hypothesis 4, see Section 6.4).

6.1. Root cause explanation of constraint violations

Using the logical proof, we increase the explanation’s accuracy compared to what is currently expected of a validation approach. SHACL is a W3C Recommendation standardizing the description of constraints and violation reports for RDF graph validation. We show that the logical proof produced by the rule-based reasoning execution provides more detailed root cause explanations of constraint violations, compared to SHACL’s violation report description.

The SHACL recommendation provides a set of test cases, enabling implementations to prove compliance.17

¹⁷
https://github.com/w3c/data-shapes/tree/gh-pages/data-shapes-test-suite/tests

The validation report denotes the violating resources via sh:focusNode, and in some cases can further specify the violating path via sh:resultPath and the violating value via sh:value [56]. However, it is not always possible to retrieve such additional information about the root cause. We revisit the previous example constraint that given a resource r, this resource has

(r_{firstname} \land r_{lastname}) \lor (r_{nickname})

.18

¹⁸

This example is similar to the following SHACL test case: https://github.com/w3c/data-shapes/blob/gh-pages/data-shapes-test-suite/tests/core/node/or-001.ttl.

Validation of formula (1) using a conforming SHACL implementation results in a validation report similar to Listing 7. The validation report does not provide any further details to explain why :Bob is invalid.19

¹⁹

https://www.w3.org/TR/shacl/#validator-OrConstraintComponent

Listing 7.

Validation report of an OR constraint

The rule-based reasoning execution of Validatrr can generate a proof, showing the rules used to reach a conclusion. This logical proof allows to determine, for each violation, which part of the RDF graph is the root cause of the violation, and which axiom of the used ontology triggered an inference causing the violation. Listing 8 shows the part of the proof which contains the rules deriving the violation. For :firstname, :lastname, and :nickname, we query objects that are linked using the respective predicate (Listing 8, lines 12–15, 18–21, and 24–27). $K$ is the scope of our knowledge base, in which we look for violations. We count the number of objects found and compare them with the needed number. For :firstname, one linked object is found (Listing 8, lines 16–17), however, no linked object is found for :lastname nor :nickname (Listing 8, lines 22–23 and 28–29): a violation is returned.

Listing 8.

Validation proof of an OR constraint

Due to this proof, Validatrr can provide detailed explanations for the root causes of violations for all SHACL core constraint components, compared to 46%–75% of SHACL-conforming implementations. Analysis of the SHACL specification shows that, out of the 28 core constraint components, 13 (46%) provide a full explanation of the root cause (summarized in Table 2). For eight of the remaining components (an additional 29%), the validation report returns which resource violates which constraint, but does not return a detailed explanation. For example, a sh:class violation occurs when the targeted node is a literal, or when the targeted node is not classified accordingly, but this disjunction is not reflected in the validation report. For the remaining seven components, the validation report does not provide an explanation at all. For example, violations of nested shapes are not reflected in the validation report, only violations of top-level shapes.

Table 2

Analysis of root cause explanation of violations for SHACL core constraint components. Validatrr can provide more detailed explanations for up to 56% of the components compared to SHACL-conforming implementations

Compared to SHACL-conforming implementations, Validatrr supports, among others, explanation of disjunction and nested shapes. Our approach provides detailed explanations for all core components of W3C’s recommended high-level language to describe constraints. We thus accept Hypothesis 1.

6.2. Accurate number of found violations

Validatrr finds a more accurate number of violations compared to the state of the art. To prove this, we first compare Validatrr with the state of the art functionally, and then include a set of inferencing steps to clarify the difference.

Specifically, we compare with RDFUnit [57]. Hartmann et. al explicitly proposed using query-based approaches for validation [15], and RDFUnit is such a query-based approach, relying on a SPARQL endpoint, and describing the constrains using SPARQL templates named Data Quality Test Patterns (DQTP). As such, RDFUnit is highly configurable and one of the implementations that supports SHACL.20

²⁰
https://w3c.github.io/data-shapes/data-shapes-test-suite/

Functional comparison We compare with the original pattern library of RDFUnit [57]. This pattern library is the closest to the constraint types as introduced by Hartmann et al. [17,43]: the mapping between those two is presented in previous work [4]. We test all unit tests defined by RDFUnit21

²¹

https://github.com/AKSW/RDFUnit/tree/master/rdfunit-core/src/test/resources/org/aksw/rdfunit/validate/data

after retrieving them as-is from the RDFUnit repository. As Validatrr validates general constraint types, a custom profile was created that translates the RDFUnit patterns to general constraint types. For a detailed explanation of the different test cases, we refer to the original RDFUnit paper [57].

The validation results depend on the used set of inferencing steps. RDFUnit implicitly takes “every resource is an rdfs:Resource” and the rdfs:subClassOf construct into account, forming the custom set of inferencing steps υ. We compare RDFUnit with Validatrr using three sets of inferencing steps, taking into account (i) no entailment at all (∅), (ii) the custom set of inferencing steps (υ), and (iii) full RDFS entailment (ρ).

Table 3 summarizes the results. For each constraint, we mention the test case’s name, the number of violations that RDFUnit detects, and the number of violations that Validatrr detects using the different sets of inferencing steps. The table shows the impact of using different sets of inferencing steps: depending on the set, Validatrr finds a different number of violations. More, Validatrr detects more violations using the same set of inferencing steps: there is a higher number of found violations for Validatrr under υ compared to RDFUnit.

Table 3

Comparing RDFUnit to Validatrr using different sets of inferencing steps (∅, υ, and ρ). Validatrr finds more violations given the same set of inferencing steps, and the set of inferencing steps used impacts the result. Test cases where Validatrr outperforms RDFUnit are starred. Rows where Validatrr and RDFUnit differ are marked gray

Test Case	# found violations

	RDFUnit	Validatrr

	υ	∅	υ	ρ
invfunc_correct	0	0	0	0
INVFUNC_wrong	2	0	2	2
owlcardt_correct	0	0	0	0
owlcardt_wrong_exact	6	6	6	6
owlcardt_wrong_max	2	2	2	2
owlcardt_wrong_min	2	2	2	2
owldisjc_correct	0	0	0	2
OWLDISJC_wrong	6	2	6	6
owlqcardt_correct	0	0	0	0
owlqcardt_wrong_exact	6	6	6	6
owlqcardt_wrong_max	2	2	2	2
owlqcardt_wrong_min	2	2	2	2
rdflangstring_correct	0	0	0	0
RDFLANGSTRING_wrong	2	2	2	0
RDFSRANGE-MISS_wrong*	1	3	3	0
rdfsranged_correct	0	0	0	0
RDFSRANGED_wrong*	2	3	3	0
RDFSRANGE_correct*	0	5	4	0
RDFSRANGE_wrong*	1	3	3	3
rdfsrang_lit_correct	0	0	0	0
RDFSRANG_LIT_wrong	3	3	3	1

Validatrr finds more violations and supports more constraint types than RDFUnit, denoted as starred test cases rdfsrange-miss_wrong, rdfsranged_wrong, rdfsrange_correct, and rdfsrange_wrong. RDFUnit does not yet support the constraint type multiple ranges: when a certain predicate is used, each resource linked as an object to that predicate should be classified into multiple classes. In all other cases, both solutions identify the same number of violations when using the same set of inferencing steps. Validatrr thus functionally outperforms the pattern library (i.e., the corresponding constraint types) of RDFUnit.

Impact of including sets of inferencing steps during validation Running Validatrr using different sets of inferencing steps impacts the number of found violations. Validatrr is designed to easily configure this set using inferencing rules (Fig. 2, top-left). The results are found in Table 3, comparing the different Validatrr columns. On the one hand, certain violations are not found without entailment (∅), as is the case for invfunc_wrong and owldisjc_wrong. On the other hand, violations are resolved early-on when including RDFS entailment (ρ), as is the case for rdflangstring_wrong.

Compared to existing validation approaches, our approach allows including custom sets of inferencing steps during validation. The inferencing provenance is retained in the proof, as all inferencing occurs during a single reasoning execution. The logical proof can thus distinguish between violations that are caused due to constraint violations in the original RDF graph, or due to entailment during validation. We thus accept Hypothesis 2.

6.3. Equivalent number of constraint types

Validatrr can support an equivalent number of constraint types compared to existing validation approaches such as RDFUnit and SHACL. In the previous section, we showed we functionally outperform the original pattern library of RDFUnit whilst including a custom set of inferencing steps during validation. In this section, we compare our number of supported constraint types to that of SHACL [56].

We test Validatrr against general constraint types [42,43], to show that the number of supported constraint types is equivalent to SHACL. We do not test specifically against SHACL’s test cases, as Validatrr is independent of the constraint language. We provide a set of test cases, used to test these different constraint types.22

²²
https://github.com/IDLabResearch/data-validation

Hartmann et al. investigated the constraint type support of SHACL, and stated that its coverage is 52% [42]. We updated the coverage report as presented by Hartmann et al. to take the latest SHACL specification and advanced features into account [54,56]. The relevant data is available at Section x, and online.23

²³

https://github.com/IDLabResearch/constraint-types-coverage

This updated report shows that SHACL’s constraint type coverage is 84%.

Validatrr can cover up to 94% of all constraint types – given the current expressive support for built-ins – and has been tested to cover a similar number of constraint types as SHACL.24

²⁴

The test report is available at https://github.com/IDLabResearch/validatrr/blob/v0.2.0/reports/validatrr-rdfcv-earl.ttl.

After including the rules for the remaining constraint types, we support an equivalent number of constraint types compared to SHACL. We thus accept Hypothesis 3.

Achieving 100% coverage (i.e., the remaining five constraint types) requires additional development on the reasoner to support specific built-ins. “Whitespace Handling” and “HTML Handling” require parsing built-ins, and “Valid Identifiers” requires a built-in to test URIs’ dereferencability. The remaining two types (“Structure” and “Data Model Consistency”) are general constraint types, defined by Hartmann et al., requiring SPARQL support. Supporting these constraint types requires a translation from SPARQL queries to N3 rules, for which we refer to related work [77].

6.4. Speed

A validation approach that supports a custom set of inferencing steps is faster than a validation system that includes a reasoning preprocessing step. We first compare the performance of Validatrr to that of RDFUnit, both without and with a custom set of inferencing steps.

For these performance evaluations, we used 300 data sets with sizes ranging from ten to one million triples, and an executing machine consisting of 24 cores (Intel Xeon CPU E5-2620 v3 @ 2.40 GHz) and 128 GB RAM. All evaluations were performed using untampered docker images for both approaches to maintain reproducibility, the different tests were orchestrated using custom scripts. All timings include the docker images’ initialization time. The data is available online.25

²⁵
https://github.com/IDLabResearch/validation-benchmark/tree/master/data/validation-journal

Fig. 3.

Validatrr’s execution speed (dotted line) is up to an order of magnitude faster than RDFUnits’s (solid line) when the number of triples per RDF graph is below 100,000 triples.

Performance comparison We compare the execution time of Validatrr to RDFUnit, following RDFUnit’s original evaluation method. We use a default set of constraints for a fixed set of schemas, as defined by Kontokostas et al. [57]. We consider six commonly used schemas: FOAF, GeoSPARQL, OWL, DC terms, SKOS, and Prov-O. For each schema, we use RDF graphs of varying size. The validated RDF graphs’ size range from ten triples to one million triples, in logarithmic steps of base ten. At most ten different RDF graphs – per schema, per RDF graph size – were downloaded, by querying LODLaundromat’s SPARQL endpoint [8].

We validate the different RDF graphs against their respective schema using the default set of constraints and set of inferencing steps (υ) of RDFUnit, and measure total execution time of Validatrr and RDFUnit. The median execution time across all schemas is plotted against RDF graph size per approach in a log-log scale (see Fig. 3). To make sure we can combine execution times across schemas, we tested the null hypothesis that no significant difference in execution time was found between schemas, by performing an ANOVA statistical test with single factor “used schema” for measurement variable “execution time per triple”, executed pairwise for all used schemas. The null hypothesis with $α = 0.05$ was accepted for every pair. The number of found violations are not plotted, as statistical analysis shows no large correlation between execution time and number of found violations, neither for Validatrr or RDFUnit ( $- 0.0203$ and 0.0458, respectively).

Validatrr’s execution time is highly correlated with the number of triples of the validated RDF graph. Regression analysis shows an R square value of 0.9998, the null hypothesis with $α = 0.05$ is accepted: Validatrr’s execution time grows linearly with respect to the size of the validated RDF graph. Meanwhile, the execution time of RDFUnit remains constant at around 30 s. This could largely be due to the set-up time required by RDFUnit, however, the timings attained via RDFUnit’s docker image does not allow us to draw further conclusions. The set-up time of RDFUnit thus possibly dominates the total execution time.

Without customizing the set of inferencing steps and docker images, Validatrr is faster for small RDF graphs. Validatrr is about an order of magnitude faster until 10,000 triples, namely, 1–2 s per RDF graph compared to 30 s per RDF graph for RDFUnit. After 100,000 triples, Validatrr is slower than RDFUnit, as Validatrr’s linearly growing execution time surpasses RDFUnit’s execution time.

Custom inferencing steps’ performance impact We compare the execution time of Validatrr to RDFUnit when using a custom set of inferencing steps. We use RDFS entailment (ρ): it is commonly used, and the evaluation of Section 6.2 showed it affects the number of violations found. For Validatrr, we include the RDFS rules during validation. For RDFUnit, we include an RDFS entailment preprocessing step, as RDFUnit’s docker image does not allow configuration to use a SPARQL engine that has inferencing capabilities. However, even if it would be possible to use a different SPARQL engine, a reasoning preprocessing step would still be needed for use cases that require support for a specific set of inferencing steps, not covered by typical entailment regimes [1].

To keep the measures comparable, we use the EYE reasoner as used in Validatrr with the same RDFS entailment rule set to execute the reasoning preprocessing step. This also precludes the need to compare with other sets of inferencing steps than RDFS entailment: the conclusions will be similar due to the usage of the same reasoner. Figure 4 depicts the timings of RDFUnit and Validatrr. For RDFUnit, it depicts the combined timings of RDFS entailment as preprocessing step and validation on the newly inferred RDF graph (RDFUnit (ρ)), and it depicts solely the validation timings on the newly inferred graph (RDFUnit). For Validatrr, it depicts the timings of the validation with the two sets of inferencing rules (Validatrr (ρ) and Validatrr (υ), respectively).

Fig. 4.

Validatrr’s performance is not affected when including the RDFS inferencing rules (dotted line, compared to the lighter dotted line), whereas the reasoning preprocessing time deteriorated RDFUnit’s performance (solid line, compared to the lighter solid line).

Validatrr’s performance is not affected by using a different set of inferencing steps, whereas the preprocessing step deteriorates RDFUnit’s performance. This effect is noticable starting from RDF graphs of 10,000 triples. For RDF graphs of one million triples, compared to the previous evaluation, median execution time rises from 27 s to 210 s for RDFUnit, largely due to the reasoning preprocessing step.

The number of found violations inversely affects the validation execution speed. Most original violations handle missing domain and range classes, which is inferred in RDFS entailment. Statistical analysis does not allow us to accept the null hypothesis that the number of violations found is inversely correlated to the execution time. However, we notice increased performance for both approaches when less violations need to be handled. Compared to previous evaluation, for one million triples, execution time (without reasoning preprocessing) drops from 27 s to 21 s for RDFUnit, and from 116 s to 80 s for Validatrr.

The performance evaluations show that the execution time of Validatrr outperforms RDFUnit for small RDF graphs up to 100,000 triples, and its linear scaling behavior is not affected by including RDFS entailment during validation. Validatrr outperforms RDFUnit when reasoning preprocessing is needed, i.e., when the used SPARQL endpoint does not support inferencing up to the needed expressiveness, or cannot be sufficiently customized to the use case. Where RDFUnit first needs to infer all implicit data before validation, Validatrr can infer this data during validation, and thus performs better. We thus accept Hypothesis 4.

7. Conclusion and future work

In this section, we discuss our proposed rule-based reasoning validation approach and introduced implementation. We provide concluding remarks and guide towards future work with respect to (i) the detailed root cause explanations, (ii) the fine-grained level of configuration, (iii) the number of constraint types supported by our approach, and (iv) the scaling behavior of Validatrr’s performance. We close by providing some further research perspectives.

The logical proof of a validation execution, generated by the rule-based reasoner, provides a more detailed root cause explanation of why a violation occurs than the state of the art. Our evaluation does not imply that existing approaches and implementations are not capable of providing a similar level of detail. However, it does show the feasibility of more detailed explanations, and the capability of our approach to generate them. To improve the level of detail of explanations provided in the validation report, our work can guide future iterations of, e.g., SHACL’s validation report descriptions, and the algorithms that generate them.

Our approach is fully configurable by adjusting different rule sets: only a single declaration and single implementation is needed to support different constraint languages, sets of inferencing steps, and validation report descriptions. This level of control considerably increases expressiveness and complexity of the validator, and a small change in a rule set could have large effects on the validation results. However, such fine-grained configuration is not needed for every use case. Future work requires investigation into configuration defaults for, among others, ShEx and SHACL: to what extend can Validatrr be configured to function as a compliant ShEx or SHACL validator, and how will the combination of inferencing rule sets look like? A short-term goal is showing that Validatrr with the right configuration passes the core SHACL tests and is included as a compliant SHACL validator in the respective W3C documentation.26

²⁶

https://w3c.github.io/data-shapes/data-shapes-test-suite/

As such, we can provide a compliant SHACL validator where sh:entailment is accurately supported: the user can choose exactly which inferencing rule set is supported during validation, and can choose not to rely on the predefined custom set of inferencing steps (i.e., support for rdfs:subClassOf, but no other RDFS entailments) as currently specified in SHACL [56].

Our approach supports an equivalent number of constraint types compared to the state of the art, with description logic-expressiveness up to at least OWL-RL. An important point of interest is handling recursion, one of the main differences between ShEx and SHACL. The semantics of ShEx are defined, also for recursion [13], and – as it is currently undefined in the SHACL specification [56] – current works are investigating recursion in combination with negation for SHACL [24]. Future work for our approach is investigating recursion, taking into account the conclusions and mentioned complexity issues of aforementioned works. Accepting that the general problem is NP-Hard, using rule-based reasoning gives us a strong tool to handle recursion. A rule-based reasoner such as the EYE reasoner has path detection: different validations calling each other can be handled, as path detection prevents the reasoner from applying the same rule to the same data twice. In this regard, we can further investigate whether the strategies of Answer Set Programming [33] help to solve related problems, taking into account their two kinds of negation (Negation as Failure and strong negation). After investigating which rules are needed to handle recursion, the user can choose whether or not recursion should be supported during validation, as these extra rules can be added or not.

The performance of Validatrr is up to an order of magnitude faster than RDFUnit for RDF graphs up to 100,000 triples, and scales linearly w.r.t. the number of triples in the RDF graph. However, it scales less than RDFUnit, making Validatrr less suitable for large RDF graphs. As such, a trade-off must be made: our approach, which performs better for smaller RDF graphs, allows fine-grained configuration and detailed explanation, whereas other approaches scale better but do not provide the same level of detail. For future work, further investigation into related works that aim to improve the performance of rule-based reasoners, such as the work of Arndt et al. [3], can be used to improve the current scaling behavior of Validatrr.

Further research perspectives include validation of RDF graph generation descriptions, and automatic graph refinement based on violation explanations. The combination reduces the effort required to provide high-quality RDF graph generation descriptions, and is being further investigated by Heyvaert et al. [45].

On the one hand, a declarative description for generating an RDF graph – e.g., using the RDF Mapping Language (RML) [32] – can be validated, to show whether that description produces a valid RDF graph [31]. Certain constraints that apply to the description can be inferred based on the constraints that apply to the RDF graph. By including a custom inferencing rule set that reflects such inferencing in Validatrr, the generation description can be validated based on the set of constraints that apply to the RDF graph. As such, only a single set of constraints needs to be maintained and understood. The requirements of this custom inferencing rule set, and which constraint types can be applied to generation descriptions, is future work.

On the other hand, rules that handle the accurate explanations of why a violation is returned, can provide suggestions to (automatically) resolve the violation. For example, the constraint specifying “every book should have either an ISSN or an ISBN number” is violated by a resource that has both numbers. Suggestions include removing the ISSN number and removing the ISBN number. Which types of suggestions can be provided, and in which order these should be applied, is future work.

Footnotes

Acknowledgements

The described research activities were funded by Ghent University, imec, Flanders Innovation & Entrepreneurship (VLAIO), and the European Union. Ruben Verborgh is a postdoctoral fellow of the Research Foundation – Flanders (FWO).

Updated constraint types coverage

Tables 4 and 5 summarize the updated constraint types coverage.

References

C.B.

Aranda,

Corby,

Das,

Feigenbaum,

Gearon,

Glimm,

Harris,

Hawke,

Herman,

Humfrey,

Michaelis,

Ogbuji,

Perry,

Passant,

Polleres,

Prud’hommeaux,

Seaborne and

G.T.

Williams, SPARQL 1.1 overview, Recommendation, World Wide Web Consortium (W3C), 2013.

Arndt,

Bonte,

Dejonghe,

Verborgh,

De Turck and

Ongenae, SENSdesc: Connect sensor queries and context, in: Proceedings of the 11th International Joint Conference on Biomedical Engineering Systems and Technologies,

Zwiggelaar,

Bamboa,

Fred and

Bermúdez i Badia, eds, Vol. 5, SCITEPRESS – Science and Technology Publications, Setúbal, Portugal, 2018, pp. 671–679. doi:10.5220/0006733106710679.

Arndt,

De Meester,

Bonte,

Schaballie,

Bhatti,

Dereuddre,

Verborgh,

Ongenae,

De Turck,

Van de Walle and

Mannens, Improving OWL RL reasoning in N3 by using specialized rules, in: Ontology Engineering: 12th International Experiences and Directions Workshop on OWL,

Tamma,

Dragoni,

Gonçalves and

Ławrynowicz, eds, Lecture Notes in Computer Science, Vol. 9557, Springer, Cham, 2016, pp. 93–104. doi:10.1007/978-3-319-33245-1_10.

Arndt,

De Meester,

Dimou,

Verborgh and

Mannens, Using rule-based reasoning for RDF validation, in: Rules and Reasoning: International Joint Conference, RuleML+RR 2017, London, UK, July 12–15, 2017,

Constantini,

Franconi,

Van Woensel,

Kontchakov,

Sadri and

Roman, eds, Lecture Notes in Computer Science, Vol. 10364, Springer, Cham, 2017, pp. 22–36. doi:10.1007/978-3-319-61252-2_3.

Arndt,

Schrijvers,

De Roo and

Verborgh, Implicit quantification made explicit: How to interpret blank nodes and universal variables in Notation3 Logic, Journal of Web Semantics58 (2019), 100501. doi:10.1016/j.websem.2019.04.001.

J.-F.

Baget,

Leclère,

M.-L.

Mugnier and

Salvat, On rules with existential variables: Walking the decidability line, Artificial Intelligence175(9–10) (2011), 1620–1654. doi:10.1016/j.artint.2011.03.002.

Beckett,

Berners-Lee,

Prud’hommeaux and

Carothers, RDF 1.1 Turtle – Terse RDF triple language, Recommendation, World Wide Web Consortium (W3C), 2014.

Beek,

Rietveld,

H.R.

Bazoobandi,

Wielemaker and

Schlobach, LOD Laundromat: A Uniform Way of Publishing Other People’s Dirty Data, in: The Semantic Web – ISWC 2014,

Mika,

Tudorache,

Bernstein,

Welty,

Knoblock,

Vrandečić,

Groth,

Noy,

Janowicz and

Goble, eds, Lecture Notes in Computer Science, Vol. 8796, Springer, Cham, 2014, pp. 213–228. doi:10.1007/978-3-319-11964-9_14.

Berners-Lee, Cwm, World Wide Web Consortium (W3C), 2000, http://www.w3.org/2000/10/swap/doc/cwm.html.

10.

Berners-Lee, Notation 3 Logic, World Wide Web Consortium (W3C), 2005, http://www.w3.org/DesignIssues/N3Logic.

11.

Berners-Lee,

Connolly,

Kagal,

Scharf and

Hendler, N3Logic: A logical framework for the World Wide Web, Theory and Practice of Logic Programming8(3) (2008), 249–269. doi:10.1017/S1471068407003213.

12.

Bizer and

Cyganiak, Quality-driven information filtering using the WIQA policy framework, Web Semantics: Science, Services and Agents on the World Wide Web7(1) (2009), 1–10. doi:10.1016/j.websem.2008.02.005.

13.

Boneva,

J.E.

Labra Gayo and

Prud’hommeaux, Semantics and validation of shapes schemas for RDF, in: The Semantic Web – ISWC 2017,

d’Amato,

Fernandez,

Tamma,

Lecue,

Cudré-Mauroux,

Sequeda,

Lange and

Heflin, eds, Lecture Notes in Computer Science, Vol. 10587, Springer, Cham, 2017, pp. 104–120. doi:10.1007/978-3-319-68288-4_7.

14.

Bosch,

Acar,

Nolle and

Eckert, The role of reasoning for RDF validation, in: Proceedings of the 11th International Conference on Semantic Systems,

Hellmann,

J.X.

Parreira and

Polleres, eds, Association for Computing Machinery, New York, NY, 2015, pp. 33–40. doi:10.1145/2814864.2814867.

15.

Bosch and

Eckert, Requirements on RDF constraint formulation and validation, in: Proceedings of the International Conference on Dublin Core and Metadata Applications,

Moen and

Rushing, eds, Dublin Core Metadata Initiative, 2014, pp. 95–108.

16.

Bosch and

Eckert, Towards description set profiles for RDF using SPARQL as intermediate language, in: Proceedings of the International Conference on Dublin Core and Metadata Applications,

Moen and

Rushing, eds, Dublin Core Metadata Initiative, 2014, pp. 129–137.

17.

Bosch,

Nolle,

Acar and

Eckert, RDF validation requirements – Evaluation and logical underpinning, Preprint, 2015, http://arxiv.org/abs/1501.03933.

18.

Bozic,

Brennan,

Feeney and

Mendel-Gleason, Describing reasoning results with RVO, the reasoning violations ontology, in: Joint Proceedings of the 2nd Workshop on Managing the Evolution and Preservation of the Data Web (MEPDaW 2016) and the 3rd Workshop on Linked Data Quality (LDQ 2016) Co-Located with 13th European Semantic Web Conference (ESWC 2016),

Debattista,

Umbrich and

J.D.

Fernándex, eds, CEUR Workshop Proceedings, Vol. 1585, CEUR-WS.org, 2016, pp. 62–69.

19.

Brickley and

R.V.

Guha, RDF Schema 1.1, Recommendation, World Wide Web Consortium (W3C), 2014.

20.

Brickley and

Miller, FOAF vocabulary specification 0.99, Namespace Document, 2014.

21.

Calì,

Gottlob and

Kifer, Taming the infinite chase: Query answering under expressive relational constraints, Journal of Artificial Intelligence Research48 (2013), 115–174. doi:10.1613/jair.3873.

22.

Calì,

Gottlob,

Lukasiewicz and

Pieris, Datalog+/−: A family of languages for ontology querying, in: Datalog Reloaded,

de Moor,

Gottlob,

Furche and

Sellers, eds, Lecture Notes in Computer Science, Vol. 6702, Springer, Berlin, 2011, pp. 351–368. doi:10.1007/978-3-642-24206-9_20.

23.

Chalub and

Rademaker, Verifying integrity constraints of a RDF-based WordNet, in: Global WordNet Conference, 2016, pp. 309–316.

24.

Corman,

J.L.

Reutter and

Savković, Semantics and validation of recursive SHACL, in: The Semantic Web – ISWC 2018,

Vrandečić,

Bontcheva,

M.C.

Suárez-Figueroa,

Presutti,

Celino,

Sabou,

L.-A.

Kaffee and

Simperl, eds, Lecture Notes in Computer Science, Vol. 11136, Springer, Cham, 2018, pp. 318–336. doi:10.1007/978-3-030-00671-6_19.

25.

Cyganiak,

Wood and

Lanthaler, RDF 1.1 concepts and abstract syntax, Recommendation, World Wide Web Consortium (W3C), 2014.

26.

C.V.

Damásio,

Analyti,

Antoniou and

Wagner, Supporting open and closed world reasoning on the Web, in: Principles and Practice of Semantic Web Reasoning,

J.J.

Alferes,

Bailey,

May and

Schwertel, eds, Lecture Notes in Computer Science, Vol. 4187, Springer, Berlin, 2006, pp. 149–163. doi:10.1007/11853107_11.

27.

De Roo,

Mels,

Sun and

Colaert, Specialisation mechanism for terminology reasoning, US Patent Application 15/120,165, 2017.

28.

Debattista,

Auer and

Lange, Luzzu – A methodology and framework for linked data quality assessment, Journal of Data and Information Quality8(1) (2016), 4:1–4:32. doi:10.1145/2992786.

29.

Debattista,

Dekkers,

Guéret,

Lee,

Mihindukulasooriya and

Zaveri, Data on the Web best practices: Data quality vocabulary, Working Group Note, World Wide Web Consortium, 2016.

30.

Dentler,

Cornet,

ten Teije and

de Keizer, Comparison of reasoners for large ontologies in the OWL 2 EL profile, Semantic Web Journal2(2) (2011), 71–87. doi:10.3233/SW-2011-0034.

31.

Dimou,

Kontokostas,

Freudenberg,

Verborgh,

Lehmann,

Mannens,

Hellmann and

Van de Walle, Assessing and refining mappings to RDF to improve dataset quality, in: The Semantic Web – ISWC 2015,

Arenas,

Corcho,

Simperl,

Strohmaier,

d’Aquin,

Srinivas,

Groth,

Dumontier,

Heflin,

Thirunarayan and

Staab, eds, Lecture Notes in Computer Science, Vol. 9367, Springer, Cham, 2015, pp. 133–149. doi:10.1007/978-3-319-25010-6_8.

32.

Dimou,

Vander Sande,

Colpaert,

Verborgh,

Mannens and

Van de Walle, RML: A generic language for integrated RDF mappings of heterogeneous data, in: Proceedings of the 7th Workshop on Linked Data on the Web,

Bizer,

Heath,

Auer and

Berners-Lee, eds, CEUR Workshop Proceedings, Vol. 1184, CEUR-WS.org, 2014.

33.

Eiter,

Ianni and

Krennwallner, Answer set programming: A primer, in: Reasoning Web. Semantic Technologies for Information Systems,

Tessaris,

Franconi,

Eiter,

Gutierrez,

Handschuh,

M.-C.

Rousset and

R.A.

Schmidt, eds, Lecture Notes in Computer Scinece, Vol. 5689, Springer, Berlin, 2009, pp. 40–110. doi:10.1007/978-3-642-03754-2_2.

34.

M.B.

Ellefi,

Bellahsene,

Breslin,

Demidova,

Dietze,

Szymanski and

Todorov, RDF dataset profiling – A survey of features, methods, vocabularies and applications, Semantic Web Journal9(5) (2018), 677–705. doi:10.3233/SW-180294.

35.

Farid,

Roatis,

I.F.

Ilyas,

H.-F.

Hoffmann and

Chu, CLAMS: Bringing quality to data lakes, in: Proceedings of the 2016 International Conference on Management of Data (SIGMOD),

Özcan and

Koutrika, eds, Association for Computing Machinery, New York, NY, 2016, pp. 2089–2092. doi:10.1145/2882903.2899391.

36.

P.M.

Fischer,

Lausen,

Schätzle and

Schmidt, RDF constraint checking, in: Proceedings of the Workshops of the EDBT/ICDT 2015 Joint Conference (EDBT/ICDT 2015),

P.M.

Fischer,

Alonso,

Arenas and

Geerts, eds, CEUR Workshop Proceedings, Vol. 1330, CEUR-WS.org, 2015, pp. 205–212.

37.

C.L.

Forgy, Rete: A fast algorithm for the many pattern/many object pattern match problem, Artificial Intelligence19(1) (1982), 17–37. doi:10.1016/0004-3702(82)90020-0.

38.

Glimm and

Ogbuji, SPARQL 1.1 entailment regimes, Recommendation, World Wide Web Consortium (W3C), 2013.

39.

Haase and

Qi, An analysis of approaches to resolving inconsistencies in DL-based ontologies, in: Proceedings of the International Workshop on Ontology Dynamics (IWOD-07),

Flouris and

d’Aquin, eds, 2007, pp. 97–109.

40.

Harris and

Seaborne, SPARQL 1.1 query language, Recommendation, World Wide Web Consortium (W3C), 2013.

41.

Hartmann, Validation framework for RDF-based constraint languages, PhD thesis, Karlsruher Institut für Technologie (KIT), 2016. doi:10.5445/ir/1000056458.

42.

Hartmann, Validation framework for RDF-based constraint languages – PhD thesis appendix, Technical report, Karlsruher Institut für Technologie (KIT), 2016. doi:10.5445/ir/1000054062.

43.

Hartmann,

Zapilko,

Wackerow and

Eckert, Validating RDF data quality using constraints to direct the development of constraint languages, in: IEEE Tenth International Conference on Semantic Computing (ICSC), IEEE, 2016, pp. 116–123. doi:10.1109/icsc.2016.43.

44.

P.J.

Hayes and

P.F.

Patel-Schneider, RDF 1.1 semantics, Recommendation, World Wide Web Consortium (W3C), 2014.

45.

Heyvaert,

Dimou,

De Meester and

Verborgh, Rule-driven inconsistency resolution for knowledge graph generation rules, Semantic Web Journal10(6) (2019), 1071–1086. doi:10.3233/SW-190358.

46.

Hitzler,

Krötzsch,

Parsia,

P.F.

Patel-Schneider and

Rudolph, OWL 2 Web ontology language – Primer (second edition), Recommendation, World Wide Web Consortium (W3C), 2012.

47.

Hogan,

Harth,

Passant,

Decker and

Polleres, Weaving the pedantic Web, in: 3rd International Workshop on Linked Data on the Web,

Bizer,

Heath,

Berners-Lee and

Hausenblas, eds, CEUR Workshop Proceedings, Vol. 628, CEUR-WS.org, 2010.

48.

Hogan,

Umbrich,

Harth,

Cyganiak,

Polleres and

Decker, An empirical survey of linked data conformance, Journal of Web Semantics14 (2012), 14–44. doi:10.1016/j.websem.2012.02.001.

49.

Horrocks,

P.F.

Patel-Schneider,

Boley,

Tabet,

Grosof and

Dean, SWRL: A semantic web rule language combining OWL and RuleML, Member Submission, World Wide Web Consortium (W3C), 2004.

50.

Kifer, Rule interchange format: The framework, in: RR 2008: Web Reasoning and Rule Systems,

Calvanese and

Lausen, eds, Lecture Notes in Computer Science, Vol. 5341, Springer, Berlin, 2008, pp. 1–11. doi:10.1007/978-3-540-88737-9_1.

51.

Kifer,

de Bruijn,

Boley and

Fensel, A realistic architecture for the Semantic Web, in: Rules and Rule Markup Languages for the Semantic Web,

Adi,

Stoutenburg and

Tabet, eds, Lecture Notes in Computer Science, Vol. 3791, Springer, Berlin, 2005, pp. 17–29. doi:10.1007/11580072_3.

52.

Kifer,

Lausen and

Wu, Logical foundations of object-oriented and frame-based languages, Journal of the ACM42(4) (1995), 741–843. doi:10.1145/210332.210335.

53.

Knublauch, OWL 2 RL in SPARQL, Documentation, TopBraid.

54.

Knublauch,

Allemang and

Steyskal, SHACL advanced features, Working Group Note, World Wide Web Consortium (W3C), 2017.

55.

Knublauch,

J.A.

Hendler and

Idehen, SPIN – Overview and motivation, Member Submission, World Wide Web Consortium (W3C), 2011.

56.

Knublauch and

Kontokostas, Shapes constraint language (SHACL), Recommendation, World Wide Web Consortium (W3C), 2017.

57.

Kontokostas,

Westphal,

Auer,

Hellmann,

Lehmann,

Cornelissen and

Zaveri, Test-driven evaluation of linked data quality, in: Proceedings of the 23rd International Conference on World Wide Web,

C.-W.

Chung, ed., Association for Computing Machinery, New York, NY, 2014, pp. 747–757. doi:10.1145/2566486.2568002.

58.

J.E.

Labra Gayo,

Prud’hommeaux,

Boneva and

Kontokostas, Validating RDF Data, Vol. 7, Morgan & Claypool Publishers LLC, 2017, pp. 1–328. doi:10.2200/s00786ed1v01y201707wbe016.

59.

J.E.

Labra Gayo,

Prud’hommeaux,

Solbrig and

Boneva, Validating and describing linked data portals using shapes, Preprint, 2017, https://arxiv.org/abs/1701.08924.

60.

P.N.

Mendes,

Mühleisen and

Bizer, Sieve: Linked data quality assessment and fusion, in: Proceedings of the 2012 Joint EDBT/ICDT Workshops,

Srivastava and

Ari, eds, Association for Computing Machinery, New York, NY, 2012, pp. 116–123. doi:10.1145/2320765.2320803.

61.

Motik,

B.C.

Grau,

Horrocks,

Wu,

Fokoue and

Lutz, OWL 2 Web ontology language profiles (second edition), Recommendation, World Wide Web Consortium (W3C), 2012.

62.

Motik,

Horrocks and

Sattler, Bridging the gap between OWL and relational databases, Journal of Web Semantics7(2) (2009), 74–89. doi:10.1016/j.websem.2009.02.001.

63.

M.A.

Musen, The Protégé Project: A look back and a look forward, AI Matters1(4) (2015), 4–12. doi:10.1145/2757001.2757003.

64.

Nilsson, Description set profiles: A constraint language for Dublin Core Application Profiles, Working Draft, Dublin Core Metadata Initiative (DCMI), 2008.

65.

Parsia,

Matentzoglu,

Gonçalves,

Glimm and

Steigmiller, The OWL reasoner evaluation (ORE) 2015 competition report, Journal of Automated Reasoning59 (2017), 455–482. doi:10.1007/s10817-017-9406-8.

66.

Paschke, Rules and logic programming for the Web, in: Reasoning Web. Semantic Technologies for the Web of Data,

Polleres,

d’Amato,

Arenas,

Handschuh,

Kroner,

Ossowski and

P.F.

Patel-Schneider, eds, Lecture Notes in Computer Science, Vol. 6848, Springer, Berlin, 2011, pp. 326–381. doi:10.1007/978-3-642-23032-5_6.

67.

P.F.

Patel-Schneider, Using description logics for RDF constraint checking and closed-world recognition, in: Proceedings of the 29th AAAI Conference on Artificial Intelligence,

Bonet and

Koenig, eds, AAAI Press, 2015, pp. 247–253.

68.

P.F.

Patel-Schneider, Diverging views of SHACL, Nuance Communications, 2016, https://research.nuance.com/diverging-views-of-shacl/.

69.

Pauwels,

Mendes de Farias,

Zhang,

Roxin,

Beetz,

De Roo and

Nicolle, A performance benchmark over semantic rule checking approaches in construction industry, Advanced Engineering Informatics33 (2017), 68–88. doi:10.1016/j.aei.2017.05.001.

70.

Pauwels and

Zhang, Semantic rule-checking for regulation compliance checking: An overview of strategies and approaches, in: Proceedings of the 32rd International CIB W78 Conference,

Beetz,

van Berlo,

Hartmann and

Amor, eds, 2015.

71.

Pérez-Urbina,

Sirin and

Clark, Validating RDF with OWL integrity constraints, Technical report, Clark & Parsia, LLC, 2012.

72.

Polleres,

Feier and

Harth, Rules with contextually scoped negation, in: The Semantic Web: Research and Applications: 3rd European Semantic Web Conference, ESWC 2006, Budva, Montenegro, June 11–14, 2006. Proceedings,

Sure and

Domingue, eds, Lecture Notes in Computer Science, Vol. 4011, Springer, Berlin, 2006, pp. 332–347. doi:10.1007/11762256_26.

73.

Prud’hommeaux,

Boneva,

J.E.

Labra Gayo and

Kellogg, Shape expressions language 2.1, Draft Community Group Report, World Wide Web Consortium (W3C), 2018.

74.

Prud’hommeaux,

J.E.

Labra Gayo and

Solbrig, Shape expressions: An RDF validation and transformation language, in: Proceedings of the 10th International Conference on Semantic Systems,

Sack,

Filipowska,

Lehmann and

Hellmann, eds, Association for Computing Machinery, New York, NY, 2014, pp. 32–40. doi:10.1145/2660517.2660523.

75.

Radulovic,

Mihindukulasooriya,

García-Castro and

Gómez-Pérez, A comprehensive quality model for linked data, Semantic Web Journal9(1) (2017), 3–24. doi:10.3233/sw-170267.

76.

A.G.

Ryman,

A.J.

Le Hors and

Speicher, OSLC resource shape: A language for defining constraints on linked data, in: Proceedings of the WWW2013 Workshop on Linked Data on the Web,

Bizer,

Heath,

Berners-Lee,

Hausenblas and

Auer, eds, CEUR Workshop Proceedings, Vol. 996, CEUR-WS.org, 2013.

77.

J.H.

Soltren, Query-based database policy assurance using Semantic Web technologies, mathesis, Massachusetts Institute of Technology, 2009.

78.

Staworko,

Boneva,

J.E.

Labra Gayo,

Hym,

Prud’hommeaux and

Solbrig, Complexity and expressiveness of ShEx for RDF, in: LIPIcs – Leibniz International Proceedings in Informatics,

Arenas and

Ugarte, eds, Leibniz International Proceedings in Informatics (LIPIcs), Vol. 31, Schloss Dagstuhl – Leibniz-Zentrum fuer Informatik, Dagstuhl, Germany, 2015, pp. 195–211. doi:10.4230/LIPIcs.ICDT.2015.195.

79.

Tao,

Sirin,

Bao and

D.L.

McGuinness, Integrity constraints in OWL, in: Proceedings of the 24th AAAI Conference on Artificial Intelligence,

Fox and

Poole, eds, AAAI Press, Menlo Park, CA, 2010, pp. 1443–1448.

80.

Thomazo, Compact rewritings for existential rules, in: Proceedings of the 23rd International Joint Conference on Artificial Intelligence (IJCAI),

Rossi, ed., AAAI Press, Menlo Park, CA, 2013, pp. 1125–1131.

81.

Tomaszuk, RDF validation: A brief survey, in: Beyond Databases, Architectures and Structures. Towards Efficient Solutions for Data Analysis and Knowledge Representation,

Kozielski,

Mrozek,

Kasprowski and

Małysiak-Mrozek, eds, Communications in Computer and Information Science, Vol. 716, Springer, Cham, 2017, pp. 344–355. doi:10.1007/978-3-319-58274-0_28.

82.

Tomaszuk, Inference rules for OWL-P in N3Logic, in: Communication Papers of the 2018 Federated Conference on Computer Science and Information Systems,

Ganza,

Maciaszek and

Paprzycki, eds, ACSIS, Vol. 17, Polskie Towarzystwo Informatyczne, 2018, pp. 27–33. doi:10.15439/2018f102.

83.

Verborgh,

Arndt,

Van Hoecke,

De Roo,

Mels,

Steiner and

Gabarró, The pragmatic proof: Hypermedia API composition and execution, Theory and Practice of Logic Programming17(1) (2017), 1–48. doi:10.1017/S1471068416000016.

84.

Verborgh and

De Roo, Drawing conclusions from linked data on the Web: The EYE reasoner, IEEE Software32(5) (2015), 23–27. doi:10.1109/MS.2015.63.

85.

Yuksel,

Gonul,

Banu Laleci Erturkmen,

Anil Sinaci,

Invernizzi,

Facchinetti,

Migliavacca,

Bergvall,

Depraetere and

De Roo, An interoperability platform enabling reuse of electronic health records for signal verification studies, BioMed Research International2016 (2016), 6741418. doi:10.1155/2016/6741418.

86.

Zaveri,

Rula,

Maurino,

Pietrobon,

Lehmann and

Auer, Quality assessment for linked data: A survey, Semantic Web Journal7(1) (2015), 63–93. doi:10.3233/SW-150175.

RDF graph validation using rule-based reasoning

Abstract

Keywords

1. Introduction

1.1. Validation problems

1 For the remainder of the paper, empty prefixes denote the fictional schema http://example.com/, other prefixes are conform with the results of https://prefix.cc.

1.3. Contributions

2. State of the art

2.1. Background

3 http://kaon2.semanticweb.org/

2.2.1. Hard-coded

2.2.2. Integrity constraints

2.2.3. Query-based

2.2.4. High-level language

2.3. Validation reports

3. Comparative analysis

4.1. Scoped negation as failure

4.2. Predicates for name comparison

5. Application

5.1. Customizable validation

7 For a detailed description of RDF-CV, we refer to the original papers [15,17], or the source: https://github.com/boschthomas/RDF-Constraints-Vocabulary.

8 For a more thorough discussion of relevant rule languages, we refer to Section 3.2 of [83].

5.4. Execution example

6.1. Root cause explanation of constraint violations

17 https://github.com/w3c/data-shapes/tree/gh-pages/data-shapes-test-suite/tests

20 https://w3c.github.io/data-shapes/data-shapes-test-suite/

22 https://github.com/IDLabResearch/data-validation

25 https://github.com/IDLabResearch/validation-benchmark/tree/master/data/validation-journal

Footnotes

Acknowledgements

Updated constraint types coverage

References

¹
For the remainder of the paper, empty prefixes denote the fictional schema http://example.com/, other prefixes are conform with the results of https://prefix.cc.

³
http://kaon2.semanticweb.org/

⁷
For a detailed description of RDF-CV, we refer to the original papers [15,17], or the source: https://github.com/boschthomas/RDF-Constraints-Vocabulary.

⁸
For a more thorough discussion of relevant rule languages, we refer to Section 3.2 of [83].

¹⁷
https://github.com/w3c/data-shapes/tree/gh-pages/data-shapes-test-suite/tests

²⁰
https://w3c.github.io/data-shapes/data-shapes-test-suite/

²²
https://github.com/IDLabResearch/data-validation

²⁵
https://github.com/IDLabResearch/validation-benchmark/tree/master/data/validation-journal