Sage Journals: Discover world-class research

Abstract

Flexible querying techniques can enhance users’ access to complex, heterogeneous datasets in settings such as Linked Data, where the user may not always know how a query should be formulated in order to retrieve the desired answers. This paper presents query processing algorithms for a fragment of SPARQL 1.1 incorporating regular path queries (property path queries), extended with query approximation and relaxation operators. Our flexible query processing approach is based on query rewriting and returns answers incrementally according to their “distance” from the exact form of the query. We formally show the soundness, completeness and termination properties of our query rewriting algorithm. We also present empirical results that show promising query processing performance for the extended language.

Keywords

Semantic Web SPARQL 1.1 path queries query approximation query relaxation

1. Introduction

Flexible querying techniques have the potential to enhance users’ access to complex, heterogeneous datasets. In particular, users querying Linked Data may lack full knowledge of the structure of the data, its irregularities, and the URIs used within it. Moreover, the schemas and URIs used can also evolve over time. This makes it difficult for users to formulate queries that precisely express their information retrieval requirements. Hence, providing users with flexible querying capabilities is desirable.

SPARQL is the predominant language for querying RDF data and, in the latest extension of SPARQL 1.1, it supports property path queries (i.e. regular path queries) over the RDF graph. However, it does not support notions of query approximation and relaxation (apart from the OPTIONAL operator).

Example 1.
Suppose a user wishes to find events that took place in London on 15th September 1940 and poses the following query on the YAGO knowledge base,1
¹
http://www.mpi-inf.mpg.de/yago-naga/yago/.

which is derived from multiple sources such as Wikipedia, WordNet and GeoNames: $\begin{array}{l} (x, o n, “ 15 / 09 / 1940 ”) AND (x, i n, “ London ”) \end{array}$ (The above is not a complete SPARQL query, but is sufficient to illustrate the problem we address.) This query returns no results because there are no property edges named “on” or “in” in YAGO.

Approximating “on” by “happenedOnDate” and “in” by “happenedIn” (which do appear in YAGO) gives the following query: $\begin{array}{l} (x, happenedOnDate, “ 15 / 09 / 1940 ”) AND \\ (x, happenedIn, “ London ”) \end{array}$ This still returns no answers, since “happenedIn” does not connect event instances directly to literals such as “London”. However, relaxing now $(x, happenedIn, “ London ”)$ to $(x, type, Event)$ , using knowledge encoded in YAGO that the domain of “happenedIn” is $Event$ , will return all events that occurred on 15th September 1940, including those occurring in London. In this particular instance only one answer is returned which is the event “Battle of Britain”, but other events could in principle have been returned. So the query exhibits better recall than the original query, but possibly low precision.

Alternatively, instead of relaxing the second triple above, another approximation step can be applied to it, inserting the property “label” that connects URIs to their labels and yielding the following query: $\begin{array}{l} (x, happenedOnDate, “ 15 / 09 / 1940 ”) AND \\ (x, happenedIn / label, “ London ”) \end{array}$ This query now returns the only event that occurred on 15th September 1940 in London, that is “Battle of Britain”. It exhibits both better recall than the original query and also high precision.
Example 2.
Suppose the user wishes to find the geographic coordinates of the “Battle of Waterloo” event by posing the query $\begin{array}{l} (⟨ Battle_of_Waterloo ⟩, \\ happenedIn / (hasLongitude | hasLatitude), x) . \end{array}$ in which angle brackets delimit a URI. We see that this query uses the property paths extension of SPARQL, specifically the concatenation (/) and disjunction (|) operators. In the query, the property “happenedIn” is concatenated with either “hasLongitude” or “hasLatitude”, thereby finding a connection between the event and its location (in our case Waterloo), and from the location to both its coordinates.

This query does not return any answers from YAGO since YAGO does not store the geographic coordinates of Waterloo. However, by applying an approximation step, we can insert “isLocatedIn” after “happenedIn” which connects the URI representing Waterloo with the URI representing Belgium. The resulting query is $\begin{array}{l} Battle_of_Waterloo, happenedIn / isLocatedIn / \\ (hasLongitude | hasLatitude), x . \end{array}$ This query returns 16 answers that may be relevant to the user, since YAGO does store the geographic coordinates of some (unspecified) locations in Belgium, increasing recall but with possibly low precision.

Moreover, YAGO does in fact store directly the coordinates of the “Battle of Waterloo” event, so if we apply an approximation step that deletes the property “happenedIn”, instead of adding “isLocatedIn”, the resulting query $\begin{array}{l} (⟨ Battle_of_Waterloo ⟩, \\ (hasLongitude | hasLatitude), x) \end{array}$ returns the desired answers, showing both high precision and high recall.

In this paper we describe an extension of a fragment of SPARQL 1.1 with query approximation and query relaxation operations that automatically generate rewritten queries such as those illustrated in the above examples, calling the extended language SPARQL^AR. We first presented SPARQL^AR in [5], focussing on its syntax, semantics and complexity of query answering. We showed that the introduction of the query approximation and query relaxation operators does not increase the theoretical complexity of the language, and we provided complexity bounds for several language fragments. In this paper, we review and extend these results to a larger SPARQL language fragment. We also explore in more detail the theoretical and performance aspects of our query processing algorithms for SPARQL^AR, examining their correctness and termination properties, and presenting the results of a performance study over the YAGO dataset.

The rest of the paper is structured as follows. Section 2 describes related work on flexible querying for the Semantic Web, and on query approximation and relaxation more generally. Section 3 presents the theoretical foundation of our approach, summarising the syntax, semantics and complexity of SPARQL^AR. Section 4 presents in detail our query processing approach for SPARQL^AR, which is based on query rewriting. We present our query processing algorithms, and formally show the soundness and completeness of our query rewriting algorithm, as well as its termination. We include a discussion in Section 4.4 on how users may be helped in formulating queries and interpreting results in a system which includes query approximation and relaxation. Section 5 presents and discusses the results of a performance study over the YAGO dataset. Finally, Section 6 gives our concluding remarks and directions for further work.
2. Related work

There have been several previous proposals for applying flexible querying to the Semantic Web, mainly employing similarity measures to retrieve additional answers of possible relevance. For example, in [10] matching functions are used for constants such as strings and numbers, while in [14] an extension of SPARQL is developed called iSPARQL which uses three different matching functions to compute string similarity. In [7], the structure of the RDF data is exploited and a similarity measurement technique is proposed which matches paths in the RDF graph with respect to the query. Ontology-driven similarity measures are proposed in [11,12,20] which use the RDFS ontology to retrieve extra answers and assign a score to them.

In [8] methods for relaxing SPARQL-like triple pattern queries automatically are presented. Query relaxations are produced by means of statistical language models for structured RDF data and queries. The query processing algorithms merge the results of different relaxations into a unified results list.

Recently, a fuzzy approach has been proposed to extend the XPath query language with the aim of providing mechanisms to assign priorities to queries and to rank query answers [2]. These techniques are based on fuzzy extensions of the Boolean operators.

Flexible querying approaches for SQL have been discussed in [21] where the authors describe a system that enables a user to issue an SQL aggregation query, see results as they are produced, and adjust the processing as the query runs. This approach allows users to write flexible queries containing linguistic terms, observe the progress of their aggregation queries, and control execution on the fly.

An approximation technique for conjunctive queries on probabilistic databases has been investigated in [9]. The authors use propositional formulas for approximating the queries. Formulas and queries are connected in the following way: given an input database where every tuple is annotated by a distinct variable, each tuple t in the query answer is annotated by a formula over the input tuples that contributed to t.

Another flexible querying technique for relational databases is described in [4]. The authors present an extension to SQL (Soft-SQL) which permits so-called soft conditions. Such conditions tolerate degrees of under-satisfaction of a query by exploiting the flexibility offered by fuzzy set theory.

In [18] the authors show how a conjunctive regular path query language can be effectively extended with approximation and relaxation techniques, using similar notions of approximation and relaxation as we use here. Finally, in [23] the authors describe and provide technical details of the implementation of a flexible querying evaluator for conjunctive regular path queries, extending the work in [18].

In contrast to all the above work, our focus is on the SPARQL 1.1 language. In [5] we extended, for the first time, a fragment of this language with query approximation and query relaxation operators, terming the extended language SPARQL^AR. Here, we add the UNION operator to SPARQL^AR and derive additional complexity results. Moreover, we present in detail our query processing algorithms for SPARQL^AR. Our query processing approach is based on query rewriting, whereby we incrementally generate a set of SPARQL 1.1 queries from the original SPARQL^AR query, evaluate these queries using existing technologies, and return answers ranked according to their “distance” from the original query. We examine the correctness and termination properties of our query rewriting algorithm and we present the results of a performance study on the YAGO dataset.

3. Theoretical foundation

In this section we give definitions of the syntax and semantics of SPARQL^AR summarising and extending the syntax and semantics from [5], and also the complexity results from that paper. We begin with some necessary definitions.

Definition 1 (Sets, triples and variables).

We assume pairwise disjoint infinite sets U and L of URIs and literals, respectively. An RDF triple is a tuple $⟨ s, p, o ⟩ \in U \times U \times (U \cup L)$ , where s is the subject, p the predicate and o the object of the triple. We assume also an infinite set V of variables that is disjoint from U and L. We abbreviate any union of the sets U, L and V by concatenating their names; for instance, $UL = U \cup L$ .

Note that in the above definition we modify the definition of triples from [16] by omitting blank nodes, since their use is discouraged for Linked Data because they represent a resource without specifying its name and are identified by an ID which may not be unique in the dataset [3].

Definition 2 (RDF-Graph).

An RDF-Graph G is a directed graph $(N, D, E)$ where: N is a finite set of nodes such that $N \subset UL$ ; D is a finite set of predicates such that $D \subset U$ ; E is a finite set of labelled, weighted edges of the form $⟨ ⟨ s, p, o ⟩, c ⟩$ such that the edge source (subject) $s \in N \cap U$ , the edge target (object) $o \in N$ , the edge label $p \in D$ and the edge weight c is a non-negative number.

Note that, in the above definition, we modify the definition of an RDF-Graph from [16] to add weights to the edges, which are needed to formalise our flexible querying semantics. Initially, these weights are all 0.

We next define the ontology of an RDF dataset, using a fragment of the RDF-Schema (RDFS) vocabulary.

Definition 3 (Ontology).

An ontology K is a directed graph $(N_{K}, E_{K})$ where each node in $N_{K}$ represents either a class or a property, and each edge in $E_{K}$ is labelled with a symbol from the set ${sc, sp, dom, range}$ . These edge labels encompass a fragment of the RDFS vocabulary, namely rdfs:subClassOf, rdfs:subPropertyOf, rdfs:domain and rdfs:range, respectively.

In an RDF-graph $G = (N, D, E)$ , we assume that each node in N represents an instance or a class and each edge in E a property (even though, more generally, RDF does not distinguish between instances, classes and properties; in fact, in RDF it is possible to use a property as a node of the graph). The predicate $type$ representing the RDF vocabulary rdf:type, can be used in E to connect an instance of a class to a node representing that class. In an ontology $K = (N_{K}, E_{K})$ , each node in $N_{K}$ represents a class (a “class node”) or a property (a “property node”). The intersection of N and $N_{K}$ is contained in the set of class nodes of $N_{K}$ . D is contained in the set of property nodes of $N_{K}$ .

Definition 4 (Triple pattern).

A triple pattern is a tuple $⟨ x, z, y ⟩ \in UV \times UV \times UVL$ . Given a triple pattern $⟨ x, z, y ⟩$ , $var (⟨ x, z, y ⟩)$ is the set of variables occurring in it.

Note that again we modify the definition from [16] to exclude blank nodes.

Definition 5 (Mapping).

A mapping μ from $ULV$ to $UL$ is a partial function $μ : ULV \to UL$ . We assume that $μ (x) = x$ for all $x \in UL$ , i.e. μ maps URIs and literals to themselves. The set $var (μ)$ is the subset of V on which μ is defined. Given a triple pattern $⟨ x, z, y ⟩$ and a mapping μ such that $var (⟨ x, z, y ⟩) \subseteq var (μ)$ , $μ (⟨ x, z, y ⟩)$ is the triple obtained by replacing the variables in $⟨ x, z, y ⟩$ by their image according to μ.

3.1. Syntax of SPARQL^AR queries

Definition 6 (Regular expression pattern).

A regular expression pattern $P \in RegEx (U)$ is defined as follows: $\begin{matrix} P : = ϵ ∣_∣ p ∣ (P_{1} | P_{2}) ∣ (P_{1} / P_{2}) ∣ P^{*} \end{matrix}$ where $P_{1}, P_{2} \in RegEx (U)$ are also regular expression patterns, ϵ represents the empty pattern, $p \in U$ and $_$ is a symbol that denotes the disjunction of all URIs in U.

This definition of regular expression patterns is the same as that in [6]. Our query pattern syntax is also based on that of [6], but includes also our query approximation and relaxation operators APPROX and RELAX.

Definition 7 (Query Pattern).

A SPARQL^ARquery pattern Q is defined as follows: $\begin{array}{l} Q : = UV \times UV \times UVL ∣ UV \times RegEx (U) \times UVL ∣ \\ Q_{1} AND Q_{2} ∣ Q_{1} UNION Q_{2} ∣ Q FILTER R ∣ \\ RELAX (UV \times RegEx (U) \times UVL) ∣ \\ APPROX (UV \times RegEx (U) \times UVL) \end{array}$ where R is a SPARQL built-in condition and $Q_{1}$ , $Q_{2}$ are also query patterns. We denote by $var (Q)$ the set of all variables occurring in a query pattern Q.

(In the W3C SPARQL syntax, a dot (.) is used for conjunction but, for greater clarity, we use AND instead. Note also that ϵ and $_$ cannot be specified in property paths in SPARQL 1.1.)

A SPARQL^AR query has the form ${SELECT}_{\vec{w}}$ WHERE Q, with $\vec{w} \subseteq var (Q)$ . We may omit here the keyword WHERE for simplicity. Given $Q^{'} = {SELECT}_{\vec{w}} Q$ , the head of $Q^{'}$ , $head (Q^{'})$ , is $\vec{w}$ if $\vec{w} \neq \emptyset$ and $var (Q)$ otherwise.

3.2. Semantics of SPARQL^AR queries

We extend the semantics of SPARQL with regular expression query patterns given in [6] in order to handle the weight/cost of edges in an RDF-Graph and the cost of applying the approximation and relaxation operators. These costs are used to rank the answers. In particular, when we introduce the APPROX and RELAX operators below these costs determine the ranking of answers returned to the user, with exact answers (of cost 0) being returned first, followed by answers with increasing costs.

We extend the notion of SPARQL query evaluation from returning a set of mappings to returning a set of pairs of the form $⟨ μ, c ⟩$ , where μ is a mapping and c is a non-negative integer that indicates the cost of the answers arising from this mapping.

Two mappings $μ_{1}$ and $μ_{2}$ are said to be compatible if $\forall x \in var (μ_{1}) \cap var (μ_{2})$ , $μ_{1} (x) = μ_{2} (x)$ . The union of two mappings $μ = μ_{1} \cup μ_{2}$ can be computed only if $μ_{1}$ and $μ_{2}$ are compatible. The resulting μ is a mapping such that $var (μ) = var (μ_{1}) \cup var (μ_{2})$ and: for each x in $var (μ_{1}) \cap var (μ_{2})$ , we have $μ (x) = μ_{1} (x) = μ_{2} (x)$ ; for each x in $var (μ_{1})$ but not in $var (μ_{2})$ , we have $μ (x) = μ_{1} (x)$ ; and for each x in $var (μ_{2})$ but not in $var (μ_{1})$ , we have $μ (x) = μ_{2} (x)$ .

We finally define the union and join of two sets of query evaluation results, $M_{1}$ and $M_{2}$ :

$M_{1} \cup M_{2} = {⟨ μ, c ⟩ ∣ ⟨ μ, c_{1} ⟩ \in M_{1} or ⟨ μ, c_{2} ⟩ \in M_{2} with c = c_{1} if ∄ c_{2} . ⟨ μ, c_{2} ⟩ \in M_{2}, c = c_{2} if ∄ c_{1} . ⟨ μ, c_{1} ⟩ \in M_{1}, and c = \min (c_{1}, c_{2}) otherwise}$ .

$M_{1} ⋈ M_{2} = {⟨ μ_{1} \cup μ_{2}, c_{1} + c_{2} ⟩ ∣ ⟨ μ_{1}, c_{1} ⟩ \in M_{1} and ⟨ μ_{2}, c_{2} ⟩ \in M_{2} with μ_{1} and μ_{2} compatible mappings}$ .

3.2.1. Exact semantics

The semantics of a triple pattern t that may include a regular expression pattern as its second component, with respect to a graph G, denoted ${[[t]]}_{G}$ , is defined recursively as follows: $\begin{array}{l} {[[⟨ x, ϵ, y ⟩]]}_{G} = {⟨ μ, 0 ⟩ ∣ var (μ) = var (⟨ x, ϵ, y ⟩) \\ \land \exists c \in N . μ (x) = μ (y) = c} \\ {[[⟨ x, z, y ⟩]]}_{G} = {⟨ μ, c ⟩ ∣ var (μ) = \\ var (⟨ x, z, y ⟩) \land ⟨ μ (⟨ x, z, y ⟩), c ⟩ \in E} \\ {[[⟨ x, P_{1} | P_{2}, y ⟩]]}_{G} = {[[⟨ x, P_{1}, y ⟩]]}_{G} \cup {[[⟨ x, P_{2}, y ⟩]]}_{G} \\ {[[⟨ x, P_{1} / P_{2}, y ⟩]]}_{G} = {[[⟨ x, P_{1}, z ⟩]]}_{G} ⋈ \\ {[[⟨ z, P_{2}, y ⟩]]}_{G} \\ {[[⟨ x, P^{*}, y ⟩]]}_{G} = {[[⟨ x, ϵ, y ⟩]]}_{G} \cup {[[⟨ x, P, y ⟩]]}_{G} \cup \\ ⋃_{n ⩾ 1} {⟨ μ, c ⟩ ∣ ⟨ μ, c ⟩ \in {[[⟨ x, P, z_{1} ⟩]]}_{G} \\ ⋈ {[[⟨ z_{1}, P, z_{2} ⟩]]}_{G} ⋈ \dots ⋈ {[[⟨ z_{n}, P, y ⟩]]}_{G}} \end{array}$ where P, $P_{1}$ , $P_{2}$ are regular expression patterns, x, y, z are in $ULV$ , and $z, z_{1}, \dots, z_{n}$ are fresh variables.

A mapping satisfies a condition R, denoted $μ ⊧ R$ , as follows:

R is $x = a$ : $μ ⊧ R$ if $x \in var (μ)$ , $a \in LU$ and $μ (x) = a$ ;

R is $x = y$ : $μ ⊧ R$ if $x, y \in var (μ)$ and $μ (x) = μ (y)$ ;

R is $isURI (x)$ : $μ ⊧ R$ if $x \in var (μ)$ and $μ (x) \in U$ ;

R is $isLiteral (x)$ : $μ ⊧ R$ if $x \in var (μ)$ and $μ (x) \in L$ ;

R is $R_{1} \land R_{2}$ : $μ ⊧ R$ if $μ ⊧ R_{1}$ and $μ ⊧ R_{2}$ ;

R is $R_{1} \lor R_{2}$ : $μ ⊧ R$ if $μ ⊧ R_{1}$ or $μ ⊧ R_{2}$ ;

R is $\neg R_{1}$ : $μ ⊧ R$ if it is not the case that $μ ⊧ R_{1}$ .

The overall semantics of queries (excluding APPROX and RELAX) is as follows, where Q, $Q_{1}$ , $Q_{2}$ are query patterns and the projection operator $π_{\vec{w}}$ selects only the subsets of the mappings relating to the variables in $\vec{w}$ : $\begin{array}{l} {[[Q_{1} AND Q_{2}]]}_{G} = {[[Q_{1}]]}_{G} ⋈ {[[Q_{2}]]}_{G} \\ {[[Q_{1} UNION Q_{2}]]}_{G} = {[[Q_{1}]]}_{G} \cup {[[Q_{2}]]}_{G} \\ {[[Q FILTER R]]}_{G} = {⟨ μ, c ⟩ \in {[[Q]]}_{G} ∣ μ ⊧ R} \\ {[[{SELECT}_{\vec{w}} Q]]}_{G} = π_{\vec{w}} ({[[Q]]}_{G}) \end{array}$ We will omit the $SELECT$ keyword from a query Q if $\vec{w} = vars (Q)$ .

3.2.2. Query relaxation

Our relaxation operator is based on that in [18] and relies on a fragment of the RDFS entailment rules known as ρDF [15]. An RDFS graph $K_{1}$ entails an RDFS graph $K_{2}$ , denoted $K_{1} ⊧_{RDFS} K_{2}$ , if $K_{2}$ can be derived by applying the rules in Fig. 1 iteratively to $K_{1}$ . For the fragment of RDFS that we consider, $K_{1} ⊧_{RDFS} K_{2}$ if and only if $K_{2} \subseteq cl (K_{1})$ , with $cl (K_{1})$ being the closure of the RDFS Graph $K_{1}$ under these rules. Notice that if $K_{1}$ is finite then also $cl (K_{1})$ is finite.

Fig. 1.

RDFS entailment rules.

Fig. 2.

Additional rules for extended reduction of an RDFS ontology.

Applying a rule means adding a triple that is deducible by the rule to G or K. Specifically, if there are two triples t, $t^{'}$ that match the antecedent of a rule, then it is possible to insert the triple implied by the consequent of the rule. For example, the triple pattern $⟨ x, startsExistingOnDate, y ⟩$ can be deduced from $⟨ x, wasBornOnDate, y ⟩$ and $⟨ wasBornOnDate, sp, startsExistingOnDate ⟩$ by applying rule 2.

In order to apply relaxation to queries, the extended reduction of an ontology K is required [13]. Given an ontology K, its extended reduction $extRed (K)$ is computed as follows: (i) compute $cl (K)$ ; (ii) apply the rules of Fig. 2 in reverse until no more rules can be applied (after applying this step the ontology generated is unique); (iii) apply rules 1 and 3 of Fig. 1 in reverse until no more rules can be applied.2

In order to generate a unique extended reduction we alter step (iii) of the procedure in [13] as follows: for every pair of triples $(a, sp, b)$ , $(b, sp, c)$ (or $(a, sc, b)$ , $(b, sc, c)$ respectively) in K, apply rule 1 (rule 3) of Fig. 1 in reverse unless there exists a URI d such that $(c, sp, d)$ and $(a, sp, d)$ ( $(c, sc, d)$ and $(a, sc, d)$ ) are also contained in K. We thank one of the reviewers for pointing out that, without such an extra condition, the extended reduction may not be unique.

Applying a rule in reverse means removing a triple deducible by the rule from G or K. Specifically, if there are two triples t and $t^{'}$ that match the antecedent of a rule then it is possible to remove a triple that can be derived from t and $t^{'}$ by that rule.

Henceforth, we assume that $K = extRed (K)$ , which allows direct relaxations to be applied to queries (see below), corresponding to the ‘smallest’ relaxation steps. This is necessary for associating an unambiguous cost to queries, so that query answers can then be returned to users incrementally in order of increasing cost.

If we did not use the extended reduction of the ontology K, then the relaxation steps applied would not necessarily be the “smallest”. For example, consider the following ontology $K = {(b, dom, c), (a, sp, b), (a, dom, c)}$ , where $K \neq extRed (K)$ . If we relax the triple pattern $(x, a, y)$ with respect to K, then as a first step we could apply rule 5 to generate $(x, type, c)$ . However, the same triple pattern can be generated with 2 steps of relaxation by applying rule 1 first and then rule 5 of Fig. 1.

As a further condition, we require that the ontology K is acyclic in order for relaxed queries to have unambiguous costs (a detailed analysis can be found in [13]).

Example 3.

Given the following cyclic ontology $K = (⟨ a, sp, b ⟩, ⟨ b, sp, a ⟩, ⟨ a, dom, c ⟩, ⟨ b, dom, c ⟩) then cl (K) = K \cup (⟨ a, sp, a ⟩, ⟨ b, sp, b ⟩)$ . By applying steps (ii) and (iii) above we could generate two possible ontologies $K^{'} = (⟨ a, sp, b ⟩, ⟨ b, sp, a ⟩, ⟨ b, dom, c ⟩) and K^{″} = (⟨ a, sp, b ⟩, ⟨ b, sp, a ⟩, ⟨ a, dom, c ⟩)$ that are extended reductions of K.

Consider now the query $Q = RELAX (x, a, y)$ which can be relaxed to $(x, b, y)$ with $K^{'}$ . Applying a second step of relaxation we obtain $(x, type, c)$ . If instead we used ontology $K^{″}$ , a first step of relaxation would immediately generate $(x, type, c)$ . Therefore, having non-acyclic ontologies might generate the same triple pattern at two different relaxation distances from the original triple pattern, depending on the reduced ontology.

Following the terminology of [13], a triple pattern $⟨ x, p, y ⟩$ directly relaxes to a triple pattern $⟨ x^{'}, p^{'}, y^{'} ⟩$ with respect to an ontology $K = extRed (K)$ , denoted $⟨ x, p, y ⟩ ≺_{i} ⟨ x^{'}, p^{'}, y^{'} ⟩$ , if $vars (⟨ x, p, y ⟩) = vars (⟨ x^{'}, p^{'}, y^{'} ⟩)$ and $⟨ x^{'}, p^{'}, y^{'} ⟩$ is derived from $⟨ x, p, y ⟩$ by applying rule i from Fig. 1.

A triple pattern $⟨ x, p, y ⟩$ relaxes to a triple pattern $⟨ x^{'}, p^{'}, y^{'} ⟩$ , denoted $⟨ x, p, y ⟩ ⩽_{K} ⟨ x^{'}, p^{'}, y^{'} ⟩$ , if starting from $⟨ x, p, y ⟩$ there is a sequence of direct relaxations that derives $⟨ x^{'}, p^{'}, y^{'} ⟩$ . The relaxation cost of deriving $⟨ x, p, y ⟩$ from $⟨ x^{'}, p^{'}, y^{'} ⟩$ , denoted $rcost (⟨ x, p, y ⟩, ⟨ x^{'}, p^{'}, y^{'} ⟩)$ , is the minimum cost of applying such a sequence of direct relaxations.

The semantics of the RELAX operator in SPARQL^AR are as follows: $\begin{array}{l} {[[RELAX (x, p, y)]]}_{G, K} = {[[⟨ x, p, y ⟩]]}_{G} \cup \\ {⟨ μ, c + rcost (⟨ x, p, y ⟩, ⟨ x^{'}, p^{'}, y^{'} ⟩) ⟩ ∣ \\ ⟨ x, p, y ⟩ ⩽_{K} ⟨ x^{'}, p^{'}, y^{'} ⟩ \land \\ ⟨ μ, c ⟩ \in {[[⟨ x^{'}, p^{'}, y^{'} ⟩]]}_{G}} \\ {[[RELAX (x, P_{1} | P_{2}, y)]]}_{G, K} = \\ {[[RELAX (x, P_{1}, y)]]}_{G, K} \cup \\ {[[RELAX (x, P_{2}, y)]]}_{G, K} \\ {[[RELAX (x, P_{1} / P_{2}, y)]]}_{G, K} = \\ {[[RELAX (x, P_{1}, z)]]}_{G, K} ⋈ \\ {[[RELAX (z, P_{2}, y)]]}_{G, K} \\ {[[RELAX (x, P^{*}, y)]]}_{G, K} = {[[⟨ x, ϵ, y ⟩]]}_{G} \cup \\ {[[RELAX (x, P, y)]]}_{G, K} \cup ⋃_{n ⩾ 1} {⟨ μ, c ⟩ ∣ \\ ⟨ μ, c ⟩ \in {[[RELAX (x, P, z_{1})]]}_{G, K} ⋈ \\ ⋈ {[[RELAX (z_{1}, P, z_{2})]]}_{G, K} \\ ⋈ \dots ⋈ {[[RELAX (z_{n}, P, y)]]}_{G, K}} \end{array}$ where P, $P_{1}$ , $P_{2}$ are regular expression patterns, x, $x^{'}$ , y, $y^{'}$ are in $ULV$ , p, $p^{'}$ are in U, and z, $z_{1}$ , …, $z_{n}$ are fresh variables.

Example 4.

Consider the following portion $K = (N_{K}, E_{K})$ of the YAGO ontology, where $N_{K}$ is $\begin{array}{l} {hasFamilyName, hasGivenName, label, actedIn, \\ Actor, English_politicians, politician}, \end{array}$ and $E_{K}$ is $\begin{array}{l} {(hasFamilyName, sp, label), \\ (hasGivenName, sp, label), \\ (actedIn, domain, actor), \\ (English_politicians, sc, politician)} \end{array}$ Suppose the user is looking for the family names of all the actors who played in the film “Tea with Mussolini” and poses this query: SELECT * WHERE { ?x actedIn <Tea_with_Mussolini> . ?x hasFamilyName ?z }

The above query returns 4 answers. However, some actors have only a single name (for example Cher), or have their full name recorded using the “label” property directly. By applying relaxation to the second triple pattern using rule (2), we can replace the predicate $hasFamilyName$ by “label”. This causes the relaxed query to return also the given names of actors in that film recorded through the property “hasGivenName” (hence returning Cher), as well as actors’ full names recorded through the property “label”: a total of 255 results.

As another example, suppose the user poses the following query: SELECT * WHERE { ?x type <English_politicians> . ?x wasBornIn/isLocatedIn* <England>}

which returns every English politician born in England. By applying relaxation to the first triple pattern using rule (4), it is possible to replace the class $English_politicians$ by $politicians$ . This relaxed query will return every politician who was born in England, giving possibly additional answers of relevance to the user.

3.2.3. Query approximation

For query approximation, we apply edit operations which transform a regular expression pattern P into a new expression pattern $P^{'}$ . Specifically, we apply the edit operations deletion, insertion and substitution, defined as follows (other possible edit operations are transposition and inversion, which we leave as future work): $\begin{array}{l} A / p / B ⇝ & (A / ϵ / B) & deletion \\ A / p / B ⇝ & (A /_/ B) & substitution \\ A / p / B ⇝ & (A /_/ p / B) & left insertion \\ A / p / B ⇝ & (A / p /_/ B) & right insertion \end{array}$ Here, A and B denote any regular expression and the symbol $_$ represents every URI from U – so the edit operations allow us to insert any URI and substitute a URI by any other. The application of an edit operation $op$ has a non-negative cost $c_{op}$ associated with it.

These rules can be applied to a URI p in order to approximate it to a regular expression P. We write $p ⇝^{*} P$ if a sequence of edit operations can be applied to p to derive P. The edit cost of deriving P from p, denoted $ecost (p, P)$ , is the minimum cost of applying such a sequence of edit operations.

The semantics of the APPROX operator in SPARQL^AR are as follows: $\begin{array}{l} {[[APPROX (x, p, y)]]}_{G} = {[[⟨ x, p, y ⟩]]}_{G} \cup \\ \cup {⟨ μ, c + ecost (p, P) ⟩ ∣ \\ p ⇝^{*} P \land ⟨ μ, c ⟩ \in {[[⟨ x, P, y ⟩]]}_{G}} \\ {[[APPROX (x, P_{1} | P_{2}, y)]]}_{G} = \\ {[[APPROX (x, P_{1}, y)]]}_{G} \cup \\ {[[APPROX (x, P_{2}, y)]]}_{G} \\ {[[APPROX (x, P_{1} / P_{2}, y)]]}_{G} = \\ {[[APPROX (x, P_{1}, z)]]}_{G} ⋈ \\ {[[APPROX (z, P_{2}, y)]]}_{G} \\ {[[APPROX (x, P^{*}, y)]]}_{G} = {[[⟨ x, ϵ, y ⟩]]}_{G} \cup \\ {[[APPROX (x, P, y)]]}_{G} \cup ⋃_{n ⩾ 1} {⟨ μ, c ⟩ ∣ \\ ⟨ μ, c ⟩ \in {[[APPROX (x, P, z_{1})]]}_{G} ⋈ \\ {[[APPROX (z_{1}, P, z_{2})]]}_{G} ⋈ \dots ⋈ \\ {[[APPROX ⋈ (z_{n}, P, y)]]}_{G}} \end{array}$ where P, $P_{1}$ , $P_{2}$ are regular expression patterns, x, y are in $ULV$ , p, $p^{'}$ are in U, and z, $z_{1}$ , …, $z_{n}$ are fresh variables.

Example 5.
Suppose that the user is looking for all discoveries made between 1700 and 1800 AD, and queries the YAGO dataset as follows: SELECT ?p ?z ?y WHERE{ ?p discovered ?x . ?x discoveredOnDate ?y . ?x label ?z . FILTER(?y >= 1700/1/1 and ?y <= 1800/1/1)}

Approximating the third triple pattern, it is possible to substitute the predicate “label” by “ $_$ ”. The query will then return more information concerning that discovery, such as its preferred name ( $hasPreferredName$ ) and the Wikipedia abstract ( $hasWikipediaAbstract$ ), improving recall and maintaining good precision. As another example, consider the following query, which is intended to return every German politician: SELECT * WHERE{ ?x isPoliticianOf ?y . ?x wasBornIn/isLocatedIn* <Germany>}

This query returns no answers since the predicate “ $isPoliticianOf$ ” only connects persons to states of the United States in YAGO. If the first triple pattern is approximated by substituting the predicate “ $isPoliticianOf$ ” with “ $_$ ”, then the query will return the expected results, matching the correct predicate to retrieve the desired answers, which is “ $holdsPoliticalPosition$ ”. It will also retrieve all the other persons that are born in Germany (thus showing improved recall, but lower precision).
Observation 1.
By the semantics of RELAX and APPROX, we observe that given a triple pattern $⟨ x, P, y ⟩$ , ${[[⟨ x, P, y ⟩]]}_{G, K} \subseteq {[[APPROX (x, P, y)]]}_{G}$ and ${[[⟨ x, P, y ⟩]]}_{G, K} \subseteq {[[RELAX (x, P, y)]]}_{G, K}$ for every graph G and ontology K.
3.3. Complexity of query answering

We now study the combined, data and query complexity of SPARQL^AR, extending the complexity results from [16,17,22] for simple SPARQL queries, from [1] for SPARQL with regular expression patterns to include our new flexible query constructs, and from [5] to include now UNION in SPARQL^AR.

Table 1
Complexity of various SPARQL^AR fragments

Operators Data Complexity Query Complexity Combined Complexity

AND, FILTER $O (| E |)$ $O (| Q |)$ $O (| E | \cdot | Q |)$

AND, FILTER, RegEx $O (| E |)$ $O (| Q |^{2})$ $O (| E | \cdot | Q |^{2})$

RELAX, APPROX $O (| E |)$ P-Time P-Time

RELAX, APPROX, AND, FILTER, RegEx $O (| E |)$ P-Time P-Time

AND, SELECT P-Time NP-Complete NP-Complete

RELAX, APPROX, AND, FILTER, RegEx, SELECT P-Time NP-Complete NP-Complete

RELAX, APPROX, AND, UNION, FILTER, RegEx, $O (| E |)$ NP NP

RELAX, APPROX, AND, UNION, FILTER, RegEx, SELECT P-Time NP-Complete NP-Complete

Operators	Data Complexity	Query Complexity	Combined Complexity
AND, FILTER	$O (\| E \|)$	$O (\| Q \|)$	$O (\| E \| \cdot \| Q \|)$
AND, FILTER, RegEx	$O (\| E \|)$	$O (\| Q \|^{2})$	$O (\| E \| \cdot \| Q \|^{2})$
RELAX, APPROX	$O (\| E \|)$	P-Time	P-Time
RELAX, APPROX, AND, FILTER, RegEx	$O (\| E \|)$	P-Time	P-Time
AND, SELECT	P-Time	NP-Complete	NP-Complete
RELAX, APPROX, AND, FILTER, RegEx, SELECT	P-Time	NP-Complete	NP-Complete
RELAX, APPROX, AND, UNION, FILTER, RegEx,	$O (\| E \|)$	NP	NP
RELAX, APPROX, AND, UNION, FILTER, RegEx, SELECT	P-Time	NP-Complete	NP-Complete

The complexity of query evaluation is based on the following decision problem, which we denote EVALUATION: Given as input a graph $G = (N, D, E)$ , an ontology K, a query Q and a pair $⟨ μ, cost ⟩$ , is it the case that $⟨ μ, cost ⟩ \in {[[Q]]}_{G, K}$ ? Considering data complexity, the decision problem becomes the following: Given as input a graph G, ontology K and a pair $⟨ μ, cost ⟩$ , is it the case that $⟨ μ, cost ⟩ \in {[[Q]]}_{G, K}$ , with Q a fixed query? Finally, the decision problem for query complexity is the following: Given as input an ontology K, a query Q and a pair $⟨ μ, cost ⟩$ , is it the case that $⟨ μ, cost ⟩ \in {[[Q]]}_{G, K}$ , with G a fixed graph?

We have the following results, the proofs of which are given in the Appendix.

Theorem 1.

EVALUATION can be solved in time $O (| E | \cdot | Q |)$ for queries not containing regular expression patterns, and constructed using only the AND and $FILTER$ operators.

Theorem 2.

EVALUATION can be solved in time $O (| E | \cdot | Q |^{2})$ for queries that may contain regular expression patterns and that are constructed using only the AND and $FILTER$ operators.

Theorem 3.

EVALUATION is NP-complete for queries that are constructed using only the AND and $SELECT$ operators.

Lemma 1.

EVALUATION of ${[[APPROX (x, P, y)]]}_{G, K}$ and ${[[RELAX (x, P, y)]]}_{G, K}$ can be accomplished in polynomial time.

Theorem 4.

EVALUATION is NP-complete for queries that may contain regular expression patterns and that are constructed using the operators AND, $FILTER$ , $RELAX$ , $APPROX$ and $SELECT$ .

Theorem 5.

EVALUATION is PTIME in data complexity for queries that may contain regular expression patterns and that are constructed using the operators AND, $FILTER$ , $RELAX$ , $APPROX$ and $SELECT$ .

The complexity study of SPARQL^AR in [5] is summarised in the first six lines of Table 1, where the combined, data and query complexity are shown for specific language fragments and combinations of operators.

The results for query complexity follow from Lemma 1 and Theorems 1, 2 and 3.

We next show three new complexity results which extend those of [5], summarised in the last two lines of Table 1.

Theorem 6.

EVALUATION is in NP for queries containing AND, UNION, FILTER, APPROX, RELAX and regular expression patterns.

Theorem 7.

EVALUATION is NP-complete for queries that may contain regular expression patterns and that are constructed using the operators AND, UNION, FILTER, RELAX, APPROX and $SELECT$ .

Theorem 8.

EVALUATION is PTIME in data complexity for queries that may contain $SELECT$ and regular expression patterns, and that are constructed using the operators AND, UNION, $FILTER$ , $RELAX$ and $APPROX$ .

The results for query complexity follow from Lemma 1 and Theorems 6 and 7.

We conclude our complexity study confirming that adding the UNION operator to SPARQL^AR does not increase the overall complexity. EVALUATION is NP-complete for queries that contain regular expression patterns and that are constructed using the operators AND, UNION, $FILTER$ , $RELAX$ , $APPROX$ and $SELECT$ .

3.4. OPTIONAL operator

The OPTIONAL operator can be added to a SPARQL query in order to retrieve information only when it is available. In other words, it allows optional matching of query patterns. If in query Q the OPTIONAL operator is applied to query pattern $Q^{″}$ , that is $Q = Q^{'} OPTIONAL {Q^{″}}$ , then ${[[Q]]}_{G}$ will return all the mappings in ${[[Q^{'}]]}_{G} ⋈ {[[Q^{″}]]}_{G}$ plus all the mappings in ${[[Q^{'}]]}_{G}$ that are not compatible with any mappings in ${[[Q^{″}]]}_{G}$ .

It is possible to add the OPTIONAL operator to SPARQL^AR, allowing APPROX and RELAX to be applied to triple patterns occurring within an OPTIONAL clause, with the same semantics as specified earlier. However, the complexity of SPARQL with the OPTIONAL clause is PSPACE-complete [16]. Therefore, by our earlier results, the complexity of SPARQL^AR would also increase similarly.

4. Query processing

We evaluate SPARQL^AR queries by making use of a query rewriting algorithm, following a similar approach to [11,12,20]. In particular, given a query Q which may contain the APPROX and/or RELAX operators, we incrementally build a set of queries ${Q_{0}, Q_{1}, \dots}$ that do not contain these operators such that $⋃_{i} {[[Q_{i}]]}_{G, K} = {[[Q]]}_{G, K}$ .

We present the algorithm in Section 4.1, proving its correctness in Section 4.2 and termination in Section 4.3. Some practical issues relating to how users might benefit from a flexible querying system such as this are discussed in Section 4.4.

Algorithm 1.

Flexible Query Evaluation

Algorithm 2.

Rewriting algorithm

Algorithm 3.

applyApprox

Algorithm 4.

approxRegex

4.1. Query rewriting

Our query rewriting algorithm (Algorithm 2 below) starts by considering the query $Q_{0}$ which returns the exact answers to the query Q, i.e. ignoring the APPROX and RELAX operators. To keep track of which triple patterns need to be relaxed or approximated, we label such triple patterns with A for approximation and R for relaxation.

The function $toCQS$ (“to conjunctive query set”) takes as input a query Q, and returns a set of pairs $⟨ Q_{i}, 0 ⟩$ such that $⋃_{i} {[[Q_{i}]]}_{G} = {[[Q]]}_{G}$ and no $Q_{i}$ contains the UNION operator. The function $toCQS$ exploits the following equality: $\begin{array}{l} {[[(Q_{1} UNION Q_{2}) AND Q_{3}]]}_{G} = \\ ({[[Q_{1}]]}_{G} \cup {[[Q_{2}]]}_{G}) ⋈ {[[Q_{3}]]}_{G} = \\ ({[[Q_{1}]]}_{G} ⋈ {[[Q_{3}]]}_{G}) \cup ({[[Q_{2}]]}_{G} ⋈ {[[Q_{3}]]}_{G}) = \\ ({[[Q_{1} AND Q_{3}]]}_{G}) \cup ({[[Q_{2} AND Q_{3}]]}_{G}) \end{array}$

Algorithm 5.

applyRelax

Algorithm 6.

relaxTriplePattern

We assign to the variable $oldGeneration$ the set of queries returned by $toCQS (Q_{0})$ . For each query $Q^{'}$ in the set $oldGeneration$ , each triple pattern $⟨ x_{i}, P_{i}, y_{i} ⟩$ in $Q^{'}$ labelled with A (R), and each URI p that appears in $P_{i}$ , we apply one step of approximation (relaxation) to p, and we assign the cost of applying that approximation (relaxation) to the resulting query. The applyApprox and applyRelax functions invoked by Algorithm 2 are shown as Algorithms 3 and 5, respectively. From each query constructed in this way, we next generate a new set of queries by applying a second step of approximation or relaxation. We continue to generate queries iteratively in this way. The cost of each query generated is the summed cost of the sequence of approximations or relaxations that have generated it. If the same query is generated more than once, only the one with the lowest cost is retained. Moreover, the set of queries generated is kept sorted by increasing cost. For practical reasons, we limit the number of queries generated by bounding the cost of queries up to a maximum value c.

In Algorithm 2, the $addTo$ operator accepts two arguments: the first is a collection C of query/cost pairs, while the second is a single query/cost pair $⟨ Q, c ⟩$ . The operator adds $⟨ Q, c ⟩$ to C. If C already contains a pair $⟨ Q, c^{'} ⟩$ such that $c^{'} ⩾ c$ , then $⟨ Q, c^{'} ⟩$ is replaced by $⟨ Q, c ⟩$ in C.

To compute the query answers (Algorithm 1) we apply an evaluation function, $eval$ , to each query generated by the rewriting algorithm (in order of increasing cost of the queries) and to each mapping returned by $eval$ we assign the cost of the query. If we generate a particular mapping more than once, only the one with the lowest cost is retained. In Algorithm 1, rewrite is the query rewriting algorithm (Algorithm 2) and the set of mappings M is maintained in order of increasing cost.

The applyApprox and applyRelax functions, respectively, invoke the functions approxRegex and replaceTriplePattern, shown as Algorithms 4 and 6. In Algorithm 6, z, $z_{1}$ and $z_{2}$ are fresh new variables. The relaxTriplePattern function might generate regular expressions containing a URI ${type}^{-}$ , which are matched to edges in E by reversing the subject and the object and using the property label $type$ . The predicate ${type}^{-}$ is generated when we apply rule 6 of Fig. 1 to a triple pattern. Given a triple pattern $⟨ x, a, y ⟩$ where x is a constant and is y a variable, and an ontology statement $⟨ a, range, d ⟩$ , we can deduce the triple pattern $⟨ y, type, d ⟩$ . If instead the predicate a appears in a triple pattern containing a regular expression such as $⟨ x, a / b, z ⟩$ (which is equivalent to $⟨ x, a, y ⟩ AND ⟨ y, b, z ⟩$ ), then we cannot simply replace it with $⟨ y, type, d ⟩$ as the regular expression would be broken apart and two triple patterns would result. By using $⟨ d, {type}^{-}, y ⟩$ , we correctly construct the triple pattern $⟨ d, {type}^{-} / b, z ⟩$ .

In the following example, we illustrate how the rewriting algorithm works by showing the queries it generates, starting from a SPARQL^AR query.

Example 6.

Consider the following ontology K (satisfying $K = extRed (K)$ ), which is a fragment of the YAGO knowledge base: $\begin{array}{l} K = & ({happenedIn, placedIn, Event}, \\ {⟨ happenedIn, sp, placedIn ⟩, \\ ⟨ happenedIn, dom, Event ⟩}) \end{array}$

Suppose a user wishes to find every event which took place in London on 15th September 1940 and poses the following query Q: $\begin{array}{l} APPROX (x, happenedOnDate, “ 15 / 09 / 1940 ”) \\ AND RELAX (x, happenedIn, “ London ”) . \end{array}$ As pointed out in Example 1, without applying APPROX or RELAX this query does not return any answers when evaluated on the YAGO endpoint (because “happenedIn” connects to URIs representing places and “London” is a literal, not a URI). After the first step of approximation and relaxation, the following queries are generated: $\begin{array}{l} Q_{1} = {(x, ϵ, “ 15 / 09 / 1940 ”)}_{A} AND \\ {(x, happenedIn, “ London ”)}_{R} \\ Q_{2} = {(x, happenedOnDate /_, “ 15 / 09 / 1940 ”)}_{A} AND \\ {(x, happenedIn, “ London ”)}_{R} \\ Q_{3} = {(x,_/ happenedOnDate, “ 15 / 09 / 1940 ”)}_{A} AND \\ {(x, happenedIn, “ London ”)}_{R} \\ Q_{4} = {(x,_, “ 12 / 12 / 12 ”)}_{A} AND \\ {(x, happenedIn, “ London ”)}_{R} \\ Q_{5} = {(x, happenedOnDate, “ 15 / 09 / 1940 ”)}_{A} AND \\ {(x, placedIn, “ London ”)}_{R} \\ Q_{6} = {(x, happenedOnDate, “ 15 / 09 / 1940 ”)}_{A} AND \\ {(x, type, Event)}_{R} \end{array}$ Each of these also returns empty results, with the exception of query $Q_{6}$ which returns every event occurring on 15/09/1940 (YAGO contains only one such event, namely “Battle of Britain”).

4.2. Correctness of the rewriting algorithm

We now discuss the soundness, completeness and termination of the rewriting algorithm. As we stated earlier, this takes as input a cost that limits the number of queries generated. Therefore the classic definitions of soundness and completeness need to be modified. To handle this, we use an operator $CostProj (M, c)$ to select mappings with a cost less than or equal to a given value c from a set M of pairs of the form $⟨ μ, cost ⟩$ . We denote by $rew {(Q)}_{c}$ the set of queries generated by the rewriting algorithm from an initial query Q which have cost less than or equal to c.

Definition 8 (Containment).

Given a graph G, an ontology K, and queries Q and $Q^{'}$ , ${[[Q]]}_{G, K} \subseteq {[[Q^{'}]]}_{G, K}$ if for each pair $⟨ μ, c ⟩ \in {[[Q]]}_{G, K}$ there exists a pair $⟨ μ, c ⟩ \in {[[Q^{'}]]}_{G, K}$ .

Definition 9 (Soundness).

The rewriting of Q, $rew {(Q)}_{c}$ , is sound if the following holds: $⋃_{Q^{'} \in rew {(Q)}_{c}}$ ${[[Q^{'}]]}_{G, K} \subseteq CostProj ({[[Q]]}_{G, K}, c)$ for every graph G and ontology K.

Definition 10 (Completeness).

The rewriting of Q, $rew {(Q)}_{c}$ , is complete if the following holds: $CostProj ({[[Q]]}_{G, K}, c) \subseteq ⋃_{Q^{'} \in rew {(Q)}_{c}} {[[Q^{'}]]}_{G, K}$ for every graph G and ontology K.

To show the soundness and completeness of the query rewriting algorithm, we will require the following lemmas and corollary.

Lemma 2.
Given four sets of evaluation results $M_{1}$ , $M_{2}$ , $M_{1}^{'}$ and $M_{2}^{'}$ such that $M_{1} \subseteq M_{1}^{'}$ and $M_{2} \subseteq M_{2}^{'}$ , it holds that: $\begin{array}{l} (1) & M_{1} \cup M_{2} \subseteq M_{1}^{'} \cup M_{2}^{'} \\ (2) & M_{1} ⋈ M_{2} \subseteq M_{1}^{'} ⋈ M_{2}^{'} \end{array}$

The following result follows from Lemma 2: Corollary 1.
Given four sets of evaluation results $M_{1}$ , $M_{2}$ , $M_{1}^{'}$ and $M_{2}^{'}$ such that $M_{1} = M_{1}^{'}$ and $M_{2} = M_{2}^{'}$ , it holds that: $\begin{array}{l} (3) & M_{1} \cup M_{2} = M_{1}^{'} \cup M_{2}^{'} \\ (4) & M_{1} ⋈ M_{2} = M_{1}^{'} ⋈ M_{2}^{'} \end{array}$
Lemma 3.
Given queries $Q_{1}$ and $Q_{2}$ , graph G and ontology K the following equations hold: $\begin{array}{l} CostPr & oj ({[[Q_{1}]]}_{G, K} ⋈ {[[Q_{2}]]}_{G, K}, c) = \\ CostProj (CostProj ({[[Q_{1}]]}_{G, K}, c) ⋈ \\ CostProj ({[[Q_{2}]]}_{G, K}, c), c) \\ CostPr & oj ({[[Q_{1}]]}_{G, K} \cup {[[Q_{2}]]}_{G, K}, c) = \\ CostProj ({[[Q_{1}]]}_{G, K}, c) \cup \\ CostProj ({[[Q_{2}]]}_{G, K}, c) \end{array}$
Theorem 9.
The Rewriting Algorithm is sound and complete.

4.3. Termination of the rewriting algorithm

We are able to show that the rewriting algorithm terminates after a finite number of steps:

Theorem 10.
Given a query Q, ontology K and maximum query cost c, the Rewriting Algorithm terminates after at most $⌈ c / c^{'} ⌉$ iterations, where $c^{'}$ is the lowest cost of an edit or relaxation operation, assuming that $c^{'} > 0$ .

4.4. Practical considerations

In the previous sections we have concentrated on theoretical aspects of SPARQL^AR. In practice SPARQL^AR would be used as part of a framework allowing users to search RDF data in a flexible way. A front-end could allow users to pose queries using keywords or natural language. Such user queries could then be translated into SPARQL (cf. [19]).

Suppose a user poses the following question to the system: “What event happened in London on 15/09/1940?”. This question could be translated to the query ( $(x, happenedOnDate, “ 15 / 09 / 1940 ”) AND (x, happenedIn, “ London ”)$ ) from Example 6. The system could automatically apply APPROX to the first triple pattern and RELAX to the second triple pattern (as shown in the example) by determining that $happenedIn$ appears in the given ontology, whereas $happenedOnDate$ does not. As we saw earlier, applying approximation and relaxation to this query produces the answer that the user is looking for.

Applying the APPROX operator without an upper bound on cost will eventually return every connected node of the RDF graph. This negatively affects the precision of answers but ensures 100% recall. When applying the substitution operation of APPROX to a triple pattern, we do not specify the predicate that needs to be replaced, but instead insert the wildcard $(_)$ that allows the insertion/replacement of any predicate. Of course, this is a drawback in terms of the precision of answers retrieved. However, for each predicate in a query, it is possible to specify a set of predicates that are semantically similar to it in order to increase the precision of the retrieved answers. A similarity matching algorithm, based either on syntactic or semantic similarity, could be used to compare the predicates of the RDF dataset. The semantic similarity could exploit dictionaries such as WordNet.3

³
https://wordnet.princeton.edu/.

Moreover, we could assign different costs for substitution by different (sets of) predicates, depending on how similar they are to the original predicate. This would allow for a finer ranking of the answers.

Similarly, it is possible to add a finer ranking of the answers arising from the RELAX operator. When we relax a triple pattern to derive a direct relaxation, we make use of a triple from the ontology K. Instead of assigning a cost to each rule of Fig. 1, we could assign a cost to each triple in K, reflecting domain experts’ views of the semantic closeness of concepts. Therefore, the direct relaxation would have a cost depending on which triple in K is used.

Fig. 3.

SPARQL^AR system architecture.

Finally, in order to help users interpret answers to their queries, the system could provide information about which rewritten query returned which answers. Showing only queries to users might not be particularly helpful, especially if the original query was simply in the form of keywords. Instead, showing the sequence of steps by which the original terms used by the user were approximated or relaxed could help them decide whether the answers returned were meaningful or not.

5. Experimental results

We have implemented the query evaluation algorithms described above in Java, using Jena for SPARQL query evaluation. Figure 3 illustrates the system architecture, which consists of three layers: the GUI layer, the System layer, and the Data layer. The GUI layer supports user interaction with the system, allowing queries to be submitted, costs of the edit and relaxation operators to be set, data sets and ontologies to be selected, and query answers to be incrementally displayed to the user. The System layer comprises three components: the Utilities, containing classes providing the core logic of the system; the Domain Classes, providing classes relating to the construction of SPARQL^AR queries; and the Query Evaluator in which query rewriting, optimisation and evaluation are undertaken. The Data layer connects the system to the selected RDF dataset and ontology using the JENA API; RDF datasets are stored as a TDB database4

⁴
https://jena.apache.org/documentation/tdb/.

and RDF schemas can be stored in multiple RDF formats (e.g. Turtle, N-Triple, RDF/XML).

User queries are submitted to the GUI, which invokes a method of the SPARQL^ARParser to parse the query string and construct an object of the class SPARQL^ARQuery. The GUI also invokes the Data/Ontology Loader which creates an object of the class Data/Ontology Wrapper, and the Approx/Relax Constructor which creates objects of the classes Approx and Relax. Once these objects have been initialised, they are passed to the Query Evaluator by invoking the Rewriting Algorithm. This generates the set of SPARQL queries to be executed over the RDF dataset. The set of queries is passed to the Evaluator, which interacts with the Optimiser and the Cache to improve query performance – we discuss the Optimiser and the Cache in Section 5.2. The Evaluator uses the Jena Wrapper to invoke Jena library methods for executing SPARQL queries over the RDF dataset. The Jena Wrapper also gathers the query answers and passes them to the Answer Wrapper. Finally, the answers are displayed by the Answers Window, in ranked order.

We have conducted empirical trials over the YAGO dataset and the Lehigh University Benchmark (LUBM).5

⁵

http://swat.cse.lehigh.edu/projects/lubm/.

Our empirical results using the LUBM are described in [5], where we ran a small set of queries comprising 1 to 4 triple patterns on increasing sizes of datasets, with and without the APPROX/RELAX operators. In all cases, the approxed/relaxed versions of the queries returned more answers than the exact query. Response times were good for most of the queries.

For the rest of this section, we focus on our empirical trials over the YAGO dataset, firstly without any optimisations, and then in Section 5.2 with an optimised query evaluator. YAGO contains over 120 million triples (4.83 GB in Turtle format) which we downloaded and stored in a TDB database. The size of the TDB database is 9.70 GB, and the nodes of the YAGO graph are stored in a 1.1 GB file.

We ran our experiments on a Windows PC with a 2.4 Ghz i5 dual-core processor and 8 GB of RAM. We executed 10 queries over the database, comprising increasing numbers of triple patterns (1 up to 10), listed below. The aim of this performance study was to further gauge the practical feasibility of our techniques and to discover major performance bottlenecks requiring further investigation. A more comprehensive and detailed performance study is planned for future work.

Q1 = SELECT ?a WHERE { RELAX(?a rdf:type <location>)} Q2 = SELECT ?n WHERE { ?a rdfs:label ?n . RELAX(?a <happenedIn> <Berlin>)} Q3 = SELECT ?n ?d WHERE { ?a rdfs:label ?n . RELAX(?a <happenedIn> <Berlin>) . ?a <happenedOnDate> ?d} Q4 = SELECT ?n ?m WHERE { ?a rdfs:label ?n . ?a <livesIn> ?b . ?a <actedIn> ?m . RELAX(?m <isLocatedIn> ?b)} Q5 = SELECT ?n1 ?n2 WHERE { ?a rdfs:label ?n1 . ?b rdfs:label ?n2 . RELAX(?a <isMarriedTo> ?b). APPROX(?a <livesIn>/ <isLocatedIn>* ?p). APPROX(?b <livesIn>/ <isLocatedIn>* ?p)} Q6 = SELECT ?n WHERE { APPROX(?a <actedIn>/<isLocatedIn> <Australia>) . ?a rdfs:label ?n . RELAX(?a rdf:type <actor>) . ?city <isLocatedIn> <China> . ?a <wasBornIn> ?city . APPROX(?a <directed>/<isLocatedIn> United_States>)} Q7 = SELECT ?n1 ?n2 WHERE { APPROX(?a rdf:type <event>) . RELAX(?a <happenedIn> ?b ). ?p <wasBornIn> ?b . ?p <wasBornOnDate> ?d . RELAX(?a <happenedOnDate> ?d) . ?a rdfs:label ?n1 . ?p rdfs:label ?n2} Q8 = SELECT ?c ?n ?p ?l ?d WHERE { ?a <hasFamilyName> ?n . ?a rdfs:label ?c . ?a <hasWonPrize> ?p . ?a <wasBornIn> ?l . RELAX(?a <wasBornOnDate> ?d) . APPROX(?a rdf:type <scientist>) . ?a <isMarriedTo> ?b1 . ?a <isMarriedTo> ?b2} Filter (?b1!=?b2) Q9 = SELECT ?c ?n ?p ?l ?d WHERE { ?a <hasFamilyName> ?n . ?a rdfs:label ?c . ?a <hasWonPrize> ?p . ?a <wasBornIn> ?l . ?a <wasBornOnDate> ?d . RELAX(?a rdf:type <scientist>) . ?a <isMarriedTo> ?b . ?b <wasBornOnDate> ?d . RELAX(?l <isLocatedIn>* <Germany>)} Q10 = SELECT ?n ?n1 ?n2 WHERE { ?a rdfs:label ?n . RELAX(?a rdf:type <actor> ). APPROX(?a <wasBornIn> ?city) . ?a <actedIn> ?m1 . ?m1 <isLocatedIn> <Australia> . ?a <directed> ?m2 . ?m2 <isLocatedIn> <Australia>. APPROX(?city <isLocatedIn> <United_States>) . ?m1 rdfs:label ?n1 . ?m2 rdfs:label ?n2}

Table 2

Numbers of answers (Exact and A/R) and numbers of rewritten queries (A/R)

	$Q_{1}$	$Q_{2}$	$Q_{3}$	$Q_{4}$	$Q_{5}$
Exact	6491	116	106	8546	585150
A/R	6494	60614	6867	8586	N/A
# of queries	2	5	5	2	95

The reader will notice that we used the APPROX operator only on triple patterns containing a regular expression in which either the subject or object is a constant. This is due to the fact that if we apply APPROX to simple triple patterns of the form $(? x, p, ? y)$ , the rewriting algorithm will generate the following two triple patterns: $(? x,_, ? y)$ which returns every triple in the database, and $(? x, ϵ, ? y)$ which returns every node in the database.

For each query $Q_{1}$ to $Q_{10}$ , we ran both the exact form of the query (without any APPROX or RELAX operators) and the version of the query as specified above. For the latter queries, we set the cost of applying each edit operation of approximation and each RDFS entailment rule of Fig. 1 to one, and requested answers of maximum cost two. We ran each query 6 times, ignored the first timing as a Jena cache warm-up, and took the mean of the other 5 timings. We restart our system each time we run a query; this avoids the possibility that the warm-up caching of a previous query enhances the execution performance of other queries.

The numbers of answers returned by each query, for both the exact form and the APPROX/RELAX (A/R) form, are shown in Tables 2 and 3, along with the number of rewritten queries in each case (# of queries). Tables 4 and 5 list the execution times for the exact queries, the A/R queries and the A/R queries with a simple caching optimisation implemented (optimised A/R). We discuss the results without this optimisation in the next subsection, and the results with this optimisation applied in Section 5.2.

5.1. Initial results

Query $Q_{1}$ returns every location stored in YAGO. The rewriting algorithm generates only the following additional query

SELECT ?a WHERE {?a rdf:type <Resource>}

which returns only 3 answers. Increasing the maximum cost does not result in the rewriting algorithm generating any more queries, and no other answers would be returned at higher cost.

Table 3
Numbers of answers (Exact and A/R) and numbers of rewritten queries (A/R)

$Q_{6}$ $Q_{7}$ $Q_{8}$ $Q_{9}$ $Q_{10}$

Exact 28 5 1540 0 0

A/R 14431 N/A 22540 0 0

# of queries 154 36 17 29 47

	$Q_{6}$	$Q_{7}$	$Q_{8}$	$Q_{9}$	$Q_{10}$
Exact	28	5	1540	0	0
A/R	14431	N/A	22540	0	0
# of queries	154	36	17	29	47

Table 4

Query execution time (in seconds)

	$Q_{1}$	$Q_{2}$	$Q_{3}$	$Q_{4}$	$Q_{5}$
Exact	0.321	0.008	0.009	1.512	7670
A/R	0.340	66.32	0.81	1.571	N/A
optimised A/R	0.440	60.4	2.31	1.01	N/A

Table 5

Query execution time (in seconds)

	$Q_{6}$	$Q_{7}$	$Q_{8}$	$Q_{9}$	$Q_{10}$
Exact	0.123	5	0.173	1.23	323.100
A/R	N/A	N/A	272.875	N/A	N/A
optimised A/R	60.23	N/A	12.475	0.08	100.4

Query $Q_{2}$ returns every event that happened in Berlin. When the second triple pattern is relaxed the rewriting algorithm generates a query that returns every event in YAGO. This explains the long execution time of the relaxed version of $Q_{2}$ compared to its exact form.

Query $Q_{3}$ returns every event that happened in Berlin along with its date, while query $Q_{4}$ returns every actor who acted in movies located in the same city where the actor lived. Queries $Q_{3}$ and $Q_{4}$ return additional answers in their relaxed form compared to their exact form. Both queries exhibit reasonable performance.

Query $Q_{5}$ returns all married couples who live in the same city. The long execution time of the exact form of this query is due to the presence of Kleene-closure in two of the query conjuncts. Moreover, the rewriting algorithm generates 95 queries which are time-consuming to evaluate due to the presence of not only of the Kleene-closure but also the wild-card symbol “ $_$ ”. We were not able to complete the execution of query $Q_{5}$ with approximation and relaxation. It might be possible to overcome this problem by replacing “ $_$ ” with a selected disjunction of predicates. Such predicates would be chosen using knowledge of the graph structure. For example, when we approximate the triple pattern $⟨ ? a, livesIn / isLocatedIn *, ? p ⟩$ in query $Q_{5}$ , we generate $⟨ ? a, livesIn /_/ isLocatedIn *, ? p ⟩$ using the insertion edit operator and, during query evaluation, the symbol $_$ is replaced with a disjunction of all the predicates in YAGO. We could instead replace $_$ with a disjunction of the predicates that are known to connect $livesIn$ and $isLocatedIn$ , i.e. the predicates p such that there is a path $livesIn / p / isLocatedIn$ in YAGO. This type of optimisation is currently being investigated.

Query $Q_{6}$ returns every Chinese actor who played in American films and directed Australian films. The A/R version of the query takes many hours to evaluate (the rewriting algorithm generates 154 queries) due to the $_$ symbol we use for the insertion and substitution approximation operations. Applying the optimisation technique described in Section 5.2 below decreases the execution time dramatically and returns results in a more reasonable time. Similarly to query $Q_{5}$ , it would also be possible to replace the $_$ symbol with a selected disjunction of predicates, making the query more likely to return answers more quickly still.

Query $Q_{7}$ returns every event and person such that the person was born in the same place and on the same day that the event occurred. When the rewriting algorithm is applied to $Q_{7}$ , it generates many queries that contain no answers or that contain answers already computed. The first version of caching that we have implemented (described in Section 5.2) is not sophisticated enough to help with the A/R version of $Q_{7}$ , for the reason explained in Section 5.2 and further work is required here.

Query $Q_{8}$ returns every scientist who has married twice and has won a prize. The rewriting algorithm generates 17 queries from $Q_{8}$ . The long execution time of the A/R form of $Q_{8}$ is due to use of the “ $_$ ” symbol. The running time of the query is improved significantly by the optimisation described in Section 5.2.

Query $Q_{9}$ returns every scientist who was born in Germany, has won a prize, and was married to someone with the same date of birth. This query returns no answers. The execution of the A/R form of the query takes many hours, due to the Kleene-closure. However, the running time of the query is again dramatically improved by the optimisation described in Section 5.2.

Finally, query $Q_{10}$ returns every actor who directed and acted in Australian movies and was born in the United States. The exact form of this query returns no answers. The rewriting algorithm generates 47 queries and the A/R form of the query takes a very long time to evaluate. Once again, the optimisation described in Section 5.2 gives a significant reduction in the running time.

5.2. Optimised evaluation

Algorithm 7.

Flexible Query Evaluation – Optimised

In Tables 4 and 5 we also show the query execution times for the A/R forms of all the queries using an optimised query evaluator. The optimisation is based on a caching technique, in which we pre-compute some of the answers in advance. The Optimiser module stores the cached answers in memory using the Java class HashSet6

⁶

Java documentation: https://docs.oracle.com/javase/6/docs/api/java/util/HashSet.html.

which enables answers to be retrieved efficiently. Algorithm 7 shows the optimised evaluation. We leave it as future work to investigate other join strategies, such as sort-merge join or sideways information passing.

In Algorithm 7, we start by splitting a query into two parts: the triple patterns which do not have APPROX or RELAX applied to them (which we call the exact part) and those which have (which we call the A/R part). We first evaluate the exact part of the query and store the results. We then apply the rewriting algorithm to the A/R part. Each triple pattern of the latter is evaluated individually; all possible pairs of triple patterns are also evaluated. The answers to each are stored in $cache$ , a data structure that contains these partial evaluation results. To avoid memory overflow, we place an upper limit on the size of $cache$ . We then compute the answers of the A/R part with the newEval function which exploits the answers already computed and stored in $cache$ . In other words, if part of the query has been already computed, it retrieves the answers and joins them with the part of the query that has not been executed. Finally, we join the answers of the exact part of the query with those of the A/R part.

For query $Q_{1}$ the optimised evaluation slightly worsens the computation time. This is due to the extra time spent by the evaluation algorithm in undertaking the caching. In fact, in general for single-conjunct queries the optimisation does not speed up the computation.7

⁷

In the final version of the system, the optimisation module would be disabled for single-conjunct queries.

For queries $Q_{2}$ and $Q_{4}$ the optimised evaluation decreases the execution time somewhat. For query $Q_{2}$ , since the number of answers is rather large, it is hard to compute all these answers in a shorter amount of time even with the optimised evaluation.

The optimised evaluation of query $Q_{3}$ performs worse than the simple evaluation. The main reason is that the exact sub-query returns a large number of answers, namely, every event along with its label and date. These answers are stored in the cache and then retrieved for the final join with the relaxed triple pattern.

Queries $Q_{6}$ and $Q_{8}$ can now be computed in a reasonable amount of time. $Q_{9}$ can also be computed with the optimised algorithm. The time taken is less than 0.1 seconds due to the fact that its exact part returns no answers, making the computation of the rest of the query redundant.

Query $Q_{10}$ can now be computed and, in fact, the optimised algorithm managed to run the A/R form of the query faster than its exact form. It is possible that the Jena SPARQL evaluator does not perform optimally for this particular query, but splitting the query into multiple parts and joining the results separately, as we do in our optimisation technique, improves the evaluation time considerably.

We are still unable to execute the A/R forms of queries $Q_{5}$ and $Q_{7}$ . For query $Q_{5}$ , the long evaluation time is due to the presence of the Kleene-closure and the wild-card symbol “ $_$ ”. On the other hand query $Q_{7}$ cannot be computed because of the join structure of the exact part of the query, which is the following:

?p <wasBornIn> ?b . ?p <wasBornOnDate> ?d . ?p rdfs:label ?n2. ?a rdfs:label ?n1 .

We can see that the variable $? p$ appears in the first three triple patterns, while the variables of the last triple pattern do not appear anywhere else in the query. Therefore, Jena has to compute a Cartesian product between the 2954875 answers retrieved for the last triple pattern and the 500000 answers retrieved for the first three triple patterns. More sophisticated optimisation techniques need to be investigated to improve the performance of these queries.

The overall results show that the evaluation of SPARQL^AR queries through a query rewriting approach is promising. The difference between the execution time of the exact form and the A/R form of the queries is acceptable for queries with fewer than 5 conjuncts. For most of the other queries, the simple optimisation technique described above also brings down the running times of the A/R forms to more reasonable levels. Clearly, for more complex queries, more sophisticated optimisation techniques need to be investigated and developed.

6. Conclusions

In this paper we have presented query processing algorithms for an extended fragment of the SPARQL 1.1 language, incorporating approximation and relaxation operators. Our query processing approach is based on query rewriting whereby, given a query Q containing the APPROX and/or RELAX operators, we incrementally generate a set of queries ${Q_{0}, Q_{1}, \dots}$ that do not contain these operators such that $⋃_{i} {[[Q_{i}]]}_{G, K} = {[[Q]]}_{G, K}$ , and we return results according to their “distance” from the exact form of Q.

We have formally shown the soundness, completeness and termination of our query rewriting algorithm. Our empirical studies show promising query processing performance, but also that further optimisations are required.

An advantage of adopting a query rewriting approach is that existing techniques for SPARQL query optimisation and evaluation can be reused to evaluate the queries generated by our rewriting algorithm. Our ongoing work involves investigating optimisations to the rewriting algorithm itself, since it can generate a large number of queries. Specifically, we are studying the query containment problem for SPARQL^AR and how query costs impact on this. Following this investigation, we plan to implement optimisations for the rewriting algorithm. For example, for a query $Q = Q_{1} AND Q_{2}$ it is possible to decrease the number of queries generated by the rewriting algorithm if we know that ${[[Q_{1}]]}_{G, K} \subseteq {[[Q_{2}]]}_{G, K}$ , in which case ${[[Q]]}_{G, K} = {[[Q_{1}]]}_{G, K}$ .

Another area of ongoing work involves the construction of synopses (or data guides) of RDF-datasets in order to speed up query evaluation. In our context, such a synopsis is a graph S constructed from an RDF-dataset G that will have the following property: if we consider G and S as automata, then $L (G) \subseteq L (S)$ . Hence, given a query Q, if $eval (Q, S) = \emptyset$ then $eval (Q, G) = \emptyset$ . Since the synopsis S will be considerably smaller than G, we can evaluate Q over S for each query Q generated by the rewriting algorithm; if $eval (Q, S)$ returns no answer, then we do not need to execute Q over G. Moreover, the synopsis S can be exploited in order to remove the $_$ symbol that is generated when we apply APPROX to a triple pattern of a query. Given a triple pattern $⟨ x, P, y ⟩$ from query Q, we compute $A = M_{P} \cap S$ , where $M_{P}$ is the automaton that recognises $L (P)$ . Subsequently, we replace $⟨ x, P, y ⟩$ with $⟨ x, P_{A}, y ⟩$ in Q, where $P_{A}$ is a property path that does not contain the symbol $_$ such that $L (P_{A}) = L (A)$ .

Another direction of research is the extension of our approximation and relaxation operators, query evaluation and query optimisation techniques to flexible federated query processing for SPARQL 1.1. Finally, also planned is a detailed comparison of the query rewriting approach to query approximation and relaxation presented here with the “native” implementation of similar operators described in [23].

Footnotes

Acknowledgement

Andrea Calì acknowledges partial support by the EPSRC project “Logic-based Integration and Querying of Unindexed Data” (EP/E010865/1).

Proof of Theorem 1.

Proof of Theorem 2.

Proof of Theorem 3.

Proof of Lemma 1.

Proof of Theorem 4.

Proof of Theorem 5.

Proof of Theorem 6.

Proof of Theorem 7.

Proof of Theorem 8.

Proof of Lemma 2.

Proof of Lemma 3.

Proof of Theorem 9.

Proof of Theorem 10.

References

Alkhateeb,

J.-F.

Baget and

Euzenat, Extending SPARQL with regular expression patterns (for querying RDF), Web Semant.7(2) (2009), 57–73, [Online]. Available: doi:10.1016/j.websem.2009.02.002.

J.M.

Almendros-Jiménez,

Luna and

Moreno, Fuzzy XPath queries in XQuery, in: On the Move to Meaningful Internet Systems: OTM 2014 Conference Proceedings – Confederated International Conferences: CoopIS, and ODBASE 2014, Amantea, Italy, October 27–31,

Meersman,

Panetto,

T.S.

Dillon,

Missikoff,

Liu,

Pastor,

Cuzzocrea and

Sellis, eds, 2014, pp. 457–472, [Online]. Available: doi:10.1007/978-3-662-45563-0_27.

Bizer,

Cyganiak and

Heath, How to publish Linked Data on the Web, Web page, 2007, revised 2008, Accessed 22/02/2010. [Online]. Available: http://www4.wiwiss.fu-berlin.de/bizer/pub/LinkedDataTutorial/.

Bordogna and

Psaila, Customizable flexible querying in classical relational databases, in: Handbook of Research on Fuzzy Information Processing in Databases,

Galindo, ed., IGI Global, 2008, pp. 191–217, [Online]. Available: http://dblp.uni-trier.de/db/books/collections/Galindo2008.html#BordognaP08.

Calì,

Frosini,

Poulovassilis and

P.T.

Wood, Flexible querying for SPARQL, in: On the Move to Meaningful Internet Systems: OTM 2014 Conference Proceedings – Confederated International Conferences: CoopIS, and ODBASE 2014, Amantea, Italy, October 27–31,

Meersman,

Panetto,

T.S.

Dillon,

Missikoff,

Liu,

Pastor,

Cuzzocrea and

Sellis, eds, 2014, pp. 473–490, [Online]. Available: doi:10.1007/978-3-662-45563-0_28.

M.W.

Chekol,

Euzenat,

Genevès and

Layaïda, PSPARQL query containment, in: Proc. of the 13th International Symposium on Database Programming Languages – DBPL 201, Seattle, Washington, USA, August 29, 2011,

Foster and

Kementsietsidis, eds, 2011, [Online]. Available: http://www.cs.cornell.edu/conferences/dbpl2011/papers/dbpl11-chekol.pdf.

De Virgilio,

Maccioni and

Torlone, A similarity measure for approximate querying over RDF data, in: Proc. of the Joint EDBT/ICDT 2013 Workshops, EDBT’13,

Guerrini, ed., ACM, New York, NY, USA, 2013, pp. 205–213, [Online]. Available: http://doi.acm.org/10.1145/2457317.2457352.

Elbassuoni,

Ramanath and

Weikum, Query relaxation for entity-relationship search, in: Proc. of the 8th Extended Semantic Web Conference on the Semanic Web: Research and Applications – Volume Part II, ESWC’11,

Antoniou,

Grobelnik,

E.P.B.

Simperl,

Parsia,

Plexousakis,

P.D.

Leenheer and

J.Z.

Pan, eds, Springer-Verlag, Berlin, Heidelberg, 2011, pp. 62–76, [Online]. Available: http://dl.acm.org/citation.cfm?id=2017936.2017942.

Fink and

Olteanu, On the optimal approximation of queries using tractable propositional languages, in: Proc. of the 14th International Conference on Database Theory, ICDT’11,

Milo, ed., ACM, New York, NY, USA, 2011, pp. 174–185, [Online]. Available: http://doi.acm.org/10.1145/1938551.1938575.

10.

Hogan,

Mellotte,

Powell and

Stampouli, Towards fuzzy query-relaxation for RDF, in: The Semantic Web: Research and Applications,

Simperl,

Cimiano,

Polleres,

Corcho and

Presutti, eds, Lecture Notes in Computer Science, Vol. 7295, Springer, Berlin, Heidelberg, 2012, pp. 687–702, [Online]. Available: doi:10.1007/978-3-642-30284-8_53.

11.

Huang and

Liu, Query relaxation for star queries on RDF, in: Proc. of the 11th International Conference on Web Information Systems Engineering, WISE’10,

Chen,

Triantafillou and

Suel, eds, Springer-Verlag, Berlin, Heidelberg, 2010, pp. 376–389, [Online]. Available: http://dl.acm.org/citation.cfm?id=1991336.1991379.

12.

Huang,

Liu and

Zhou, Computing relaxed answers on RDF databases, in: Proc. of the 9th International Conference on Web Information Systems Engineering, WISE’08,

Bailey,

Maier,

Schewe,

Thalheim and

X.S.

Wang, eds, Springer-Verlag, Berlin, Heidelberg, 2008, pp. 163–175, [Online]. Available: doi:10.1007/978-3-540-85481-4_14.

13.

C.A.

Hurtado,

Poulovassilis and

P.T.

Wood, Query relaxation in RDF, J. Data Semantics10 (2008), 31–61, [Online]. Available: doi:10.1007/978-3-540-77688-8_2.

14.

Kiefer,

Bernstein and

Stocker, The fundamentals of iSPARQL: A virtual triple approach for similarity-based semantic web tasks, in: Proc. of the 6th International Semantic Web Conference and 2nd Asian Semantic Web Conference, ISWC’07/ASWC’07,

Aberer,

Choi,

N.F.

Noy,

Allemang,

Lee,

L.J.B.

Nixon,

Golbeck,

Mika,

Maynard,

Mizoguchi,

Schreiber and

Cudré-Mauroux, eds, Springer-Verlag, Berlin, Heidelberg, 2007, pp. 295–309, [Online]. Available: http://dl.acm.org/citation.cfm?id=1785162.1785185.

15.

Muñoz,

Pérez and

Gutierrez, Minimal deductive systems for RDF, in: Proc. of the 4th European Conference on the Semantic Web: Research and Applications, ESWC’07,

Franconi,

Kifer and

May, eds, Springer-Verlag, Berlin, Heidelberg, 2007, pp. 53–67, [Online]. Available: doi:10.1007/978-3-540-72667-8_6.

16.

Pérez,

Arenas and

Gutierrez, Semantics and complexity of SPARQL, in: Proc. of the 5th International Semantic Web Conference, ISWC’06,

I.F.

Cruz,

Decker,

Allemang,

Preist,

Schwabe,

Mika,

Uschold and

Aroyo, eds, Springer-Verlag, Berlin, Heidelberg, 2006, pp. 30–43, [Online]. Available: doi:10.1007/11926078_3.

17.

Pérez,

Arenas and

Gutierrez, Semantics and complexity of SPARQL, ACM Trans. Database Syst.34(3) (Sep. 2009), 16:1–16:45, [Online]. Available: http://doi.acm.org/10.1145/1567274.1567278.

18.

Poulovassilis and

P.T.

Wood, Combining approximation and relaxation in semantic web path queries, in: Proc. of the 9th International Semantic Web Conference, ISWC’10,

P.F.

Patel-Schneider,

Pan,

Hitzler,

Mika,

Zhang,

J.Z.

Pan,

Horrocks and

Glimm, eds, Springer-Verlag, Berlin, Heidelberg, 2010, pp. 631–646, [Online]. Available: http://dl.acm.org/citation.cfm?id=1940281.1940322.

19.

Pradel,

Haemmerlé and

Hernandez, Natural language query interpretation into SPARQL using patterns, in: Fourth International Workshop on Consuming Linked Data – COLD 2013, Kent State UNiversity, Sydney, AU, 2013, pp. 1–12, thanks to Ceur-ws editor. The definitive version is available at http://ceur-ws.org/Vol-1034/. [Online]. Available: http://oatao.univ-toulouse.fr/12893/.

20.

B.R.K.

Reddy and

P.S.

Kumar, Efficient approximate SPARQL querying of web of linked data, in: URSW,

Bobillo,

R.N.

Carvalho,

P.C.G.

da Costa,

d’Amato,

Fanizzi,

K.B.

Laskey,

K.J.

Laskey,

Lukasiewicz,

Martin,

Nickles and

Pool, eds, CEUR Workshop Proceedings, Vol. 654, CEUR-WS.org, 2010, pp. 37–48, [Online]. Available: http://dblp.uni-trier.de/db/conf/semweb/ursw2010.html#ReddyK10.

21.

Sassi,

Tlili and

Ounelli, Approximate query processing for database flexible querying with aggregates, in: Transactions on Large-Scale Data- and Knowledge-Centered Systems V,

Hameurlain,

Küng and

Wagner, eds, Springer-Verlag, Berlin, Heidelberg, 2012, pp. 1–27, [Online]. Available: http://dl.acm.org/citation.cfm?id=2184170.2184171.

22.

Schmidt, Foundations of sparql query optimization, Ph.D. dissertation, Albert-Ludwigs-Universitat Freiburg, 2009, [Online]. Available: http://www.informatik.uni-freiburg.de/~mschmidt/docs/diss_final01122010.pdf.

23.

Selmer,

Poulovassilis and

P.T.

Wood, Implementing flexible operators for regular path queries, in: Proc. of the Workshops of the EDBT/ICDT 2015 Joint Conference (EDBT/ICDT), Brussels, Belgium, March 27th, 2015,

P.M.

Fischer,

Alonso,

Arenas and

Geerts, eds, 2015, pp. 149–156, [Online]. Available: http://ceur-ws.org/Vol-1330/paper-25.pdf.

Flexible query processing for SPARQL

Abstract

Keywords

1. Introduction

3. Theoretical foundation

Definition 1 (Sets, triples and variables).

Definition 2 (RDF-Graph).

Definition 3 (Ontology).

Definition 4 (Triple pattern).

Definition 5 (Mapping).

3.1. Syntax of SPARQL AR queries

Definition 6 (Regular expression pattern).

Definition 7 (Query Pattern).

3.2. Semantics of SPARQL AR queries

3.2.1. Exact semantics

3.2.2. Query relaxation

4. Query processing

Definition 8 (Containment).

Definition 9 (Soundness).

Definition 10 (Completeness).

Theorem 10. Given a query Q, ontology K and maximum query cost c, the Rewriting Algorithm terminates after at most ⌈ c / c ′ ⌉ iterations, where c ′ is the lowest cost of an edit or relaxation operation, assuming that c ′ > 0 . 4.4. Practical considerations

3 https://wordnet.princeton.edu/.

4 https://jena.apache.org/documentation/tdb/.

Table 3 Numbers of answers (Exact and A/R) and numbers of rewritten queries (A/R) Q 6 Q 7 Q 8 Q 9 Q 10 Exact 28 5 1540 0 0 A/R 14431 N/A 22540 0 0 # of queries 154 36 17 29 47

Footnotes

Acknowledgement

References

3.1. Syntax of SPARQL^AR queries

3.2. Semantics of SPARQL^AR queries

Theorem 10.
Given a query Q, ontology K and maximum query cost c, the Rewriting Algorithm terminates after at most $⌈ c / c^{'} ⌉$ iterations, where $c^{'}$ is the lowest cost of an edit or relaxation operation, assuming that $c^{'} > 0$ .

4.4. Practical considerations

³
https://wordnet.princeton.edu/.

⁴
https://jena.apache.org/documentation/tdb/.

Table 3
Numbers of answers (Exact and A/R) and numbers of rewritten queries (A/R)

$Q_{6}$ $Q_{7}$ $Q_{8}$ $Q_{9}$ $Q_{10}$

Exact 28 5 1540 0 0

A/R 14431 N/A 22540 0 0

# of queries 154 36 17 29 47