Abstract
Flexible querying techniques can enhance users’ access to complex, heterogeneous datasets in settings such as Linked Data, where the user may not always know how a query should be formulated in order to retrieve the desired answers. This paper presents query processing algorithms for a fragment of SPARQL 1.1 incorporating regular path queries (property path queries), extended with query approximation and relaxation operators. Our flexible query processing approach is based on query rewriting and returns answers incrementally according to their “distance” from the exact form of the query. We formally show the soundness, completeness and termination properties of our query rewriting algorithm. We also present empirical results that show promising query processing performance for the extended language.
Introduction
Flexible querying techniques have the potential to enhance users’ access to complex, heterogeneous datasets. In particular, users querying Linked Data may lack full knowledge of the structure of the data, its irregularities, and the URIs used within it. Moreover, the schemas and URIs used can also evolve over time. This makes it difficult for users to formulate queries that precisely express their information retrieval requirements. Hence, providing users with flexible querying capabilities is desirable.
SPARQL is the predominant language for querying RDF data and, in the latest extension of SPARQL 1.1, it supports property path queries (i.e. regular path queries) over the RDF graph. However, it does not support notions of query approximation and relaxation (apart from the OPTIONAL operator).
Suppose a user wishes to find events that took place in London on 15th September 1940 and poses the following query on the YAGO knowledge base,1
Approximating “on” by “happenedOnDate” and “in” by “happenedIn” (which do appear in YAGO) gives the following query:
Alternatively, instead of relaxing the second triple above, another approximation step can be applied to it, inserting the property “label” that connects URIs to their labels and yielding the following query:
Suppose the user wishes to find the geographic coordinates of the “Battle of Waterloo” event by posing the query
This query does not return any answers from YAGO since YAGO does not store the geographic coordinates of Waterloo. However, by applying an approximation step, we can insert “isLocatedIn” after “happenedIn” which connects the URI representing Waterloo with the URI representing Belgium. The resulting query is
Moreover, YAGO does in fact store directly the coordinates of the “Battle of Waterloo” event, so if we apply an approximation step that deletes the property “happenedIn”, instead of adding “isLocatedIn”, the resulting query
In this paper we describe an extension of a fragment of SPARQL 1.1 with query approximation and query relaxation operations that automatically generate rewritten queries such as those illustrated in the above examples, calling the extended language SPARQL AR . We first presented SPARQL AR in [5], focussing on its syntax, semantics and complexity of query answering. We showed that the introduction of the query approximation and query relaxation operators does not increase the theoretical complexity of the language, and we provided complexity bounds for several language fragments. In this paper, we review and extend these results to a larger SPARQL language fragment. We also explore in more detail the theoretical and performance aspects of our query processing algorithms for SPARQL AR , examining their correctness and termination properties, and presenting the results of a performance study over the YAGO dataset.
The rest of the paper is structured as follows. Section 2 describes related work on flexible querying for the Semantic Web, and on query approximation and relaxation more generally. Section 3 presents the theoretical foundation of our approach, summarising the syntax, semantics and complexity of SPARQL AR . Section 4 presents in detail our query processing approach for SPARQL AR , which is based on query rewriting. We present our query processing algorithms, and formally show the soundness and completeness of our query rewriting algorithm, as well as its termination. We include a discussion in Section 4.4 on how users may be helped in formulating queries and interpreting results in a system which includes query approximation and relaxation. Section 5 presents and discusses the results of a performance study over the YAGO dataset. Finally, Section 6 gives our concluding remarks and directions for further work.
There have been several previous proposals for applying flexible querying to the Semantic Web, mainly employing similarity measures to retrieve additional answers of possible relevance. For example, in [10] matching functions are used for constants such as strings and numbers, while in [14] an extension of SPARQL is developed called iSPARQL which uses three different matching functions to compute string similarity. In [7], the structure of the RDF data is exploited and a similarity measurement technique is proposed which matches paths in the RDF graph with respect to the query. Ontology-driven similarity measures are proposed in [11,12,20] which use the RDFS ontology to retrieve extra answers and assign a score to them.
In [8] methods for relaxing SPARQL-like triple pattern queries automatically are presented. Query relaxations are produced by means of statistical language models for structured RDF data and queries. The query processing algorithms merge the results of different relaxations into a unified results list.
Recently, a fuzzy approach has been proposed to extend the XPath query language with the aim of providing mechanisms to assign priorities to queries and to rank query answers [2]. These techniques are based on fuzzy extensions of the Boolean operators.
Flexible querying approaches for SQL have been discussed in [21] where the authors describe a system that enables a user to issue an SQL aggregation query, see results as they are produced, and adjust the processing as the query runs. This approach allows users to write flexible queries containing linguistic terms, observe the progress of their aggregation queries, and control execution on the fly.
An approximation technique for conjunctive queries on probabilistic databases has been investigated in [9]. The authors use propositional formulas for approximating the queries. Formulas and queries are connected in the following way: given an input database where every tuple is annotated by a distinct variable, each tuple t in the query answer is annotated by a formula over the input tuples that contributed to t.
Another flexible querying technique for relational databases is described in [4]. The authors present an extension to SQL (Soft-SQL) which permits so-called soft conditions. Such conditions tolerate degrees of under-satisfaction of a query by exploiting the flexibility offered by fuzzy set theory.
In [18] the authors show how a conjunctive regular path query language can be effectively extended with approximation and relaxation techniques, using similar notions of approximation and relaxation as we use here. Finally, in [23] the authors describe and provide technical details of the implementation of a flexible querying evaluator for conjunctive regular path queries, extending the work in [18].
In contrast to all the above work, our focus is on the SPARQL 1.1 language. In [5] we extended, for the first time, a fragment of this language with query approximation and query relaxation operators, terming the extended language SPARQL AR . Here, we add the UNION operator to SPARQL AR and derive additional complexity results. Moreover, we present in detail our query processing algorithms for SPARQL AR . Our query processing approach is based on query rewriting, whereby we incrementally generate a set of SPARQL 1.1 queries from the original SPARQL AR query, evaluate these queries using existing technologies, and return answers ranked according to their “distance” from the original query. We examine the correctness and termination properties of our query rewriting algorithm and we present the results of a performance study on the YAGO dataset.
Theoretical foundation
In this section we give definitions of the syntax and semantics of SPARQL AR summarising and extending the syntax and semantics from [5], and also the complexity results from that paper. We begin with some necessary definitions.
(Sets, triples and variables).
We assume pairwise disjoint infinite sets U and L of URIs and literals, respectively. An RDF triple is a tuple
Note that in the above definition we modify the definition of triples from [16] by omitting blank nodes, since their use is discouraged for Linked Data because they represent a resource without specifying its name and are identified by an ID which may not be unique in the dataset [3].
(RDF-Graph).
An RDF-Graph G is a directed graph
Note that, in the above definition, we modify the definition of an RDF-Graph from [16] to add weights to the edges, which are needed to formalise our flexible querying semantics. Initially, these weights are all 0.
We next define the ontology of an RDF dataset, using a fragment of the RDF-Schema (RDFS) vocabulary.
(Ontology).
An ontology K is a directed graph
In an RDF-graph
(Triple pattern).
A triple pattern is a tuple
Note that again we modify the definition from [16] to exclude blank nodes. A mapping μ from (Mapping).
Syntax of SPARQL AR queries
(Regular expression pattern).
A regular expression pattern
This definition of regular expression patterns is the same as that in [6]. Our query pattern syntax is also based on that of [6], but includes also our query approximation and relaxation operators APPROX and RELAX.
(Query Pattern).
A SPARQL
AR
query pattern Q is defined as follows:
(In the W3C SPARQL syntax, a dot (.) is used for conjunction but, for greater clarity, we use AND instead. Note also that ϵ and
A SPARQL
AR
query has the form
Semantics of SPARQL AR queries
We extend the semantics of SPARQL with regular expression query patterns given in [6] in order to handle the weight/cost of edges in an RDF-Graph and the cost of applying the approximation and relaxation operators. These costs are used to rank the answers. In particular, when we introduce the APPROX and RELAX operators below these costs determine the ranking of answers returned to the user, with exact answers (of cost 0) being returned first, followed by answers with increasing costs.
We extend the notion of SPARQL query evaluation from returning a set of mappings to returning a set of pairs of the form
Two mappings
We finally define the union and join of two sets of query evaluation results,
Exact semantics
The semantics of a triple pattern t that may include a regular expression pattern as its second component, with respect to a graph G, denoted
A mapping satisfies a condition R, denoted
R is
R is
R is
R is
R is
R is
R is
The overall semantics of queries (excluding APPROX and RELAX) is as follows, where Q,
Query relaxation
Our relaxation operator is based on that in [18] and relies on a fragment of the RDFS entailment rules known as ρDF [15]. An RDFS graph

RDFS entailment rules.

Additional rules for extended reduction of an RDFS ontology.
Applying a rule means adding a triple that is deducible by the rule to G or K. Specifically, if there are two triples t,
In order to apply relaxation to queries, the extended reduction of an ontology K is required [13]. Given an ontology K, its extended reduction
In order to generate a unique extended reduction we alter step (iii) of the procedure in [13] as follows: for every pair of triples
Applying a rule in reverse means removing a triple deducible by the rule from G or K. Specifically, if there are two triples t and
Henceforth, we assume that
If we did not use the extended reduction of the ontology K, then the relaxation steps applied would not necessarily be the “smallest”. For example, consider the following ontology
As a further condition, we require that the ontology K is acyclic in order for relaxed queries to have unambiguous costs (a detailed analysis can be found in [13]).
Given the following cyclic ontology
Consider now the query
Following the terminology of [13], a triple pattern
A triple pattern
The semantics of the RELAX operator in SPARQL
AR
are as follows:
Consider the following portion
The above query returns 4 answers. However, some actors have only a single name (for example Cher), or have their full name recorded using the “label” property directly. By applying relaxation to the second triple pattern using rule (2), we can replace the predicate
As another example, suppose the user poses the following query:
which returns every English politician born in England. By applying relaxation to the first triple pattern using rule (4), it is possible to replace the class
For query approximation, we apply edit operations which transform a regular expression pattern P into a new expression pattern
These rules can be applied to a URI p in order to approximate it to a regular expression P. We write
The semantics of the APPROX operator in SPARQL
AR
are as follows:
Suppose that the user is looking for all discoveries made between 1700 and 1800 AD, and queries the YAGO dataset as follows:
Approximating the third triple pattern, it is possible to substitute the predicate “label” by “
This query returns no answers since the predicate “
By the semantics of RELAX and APPROX, we observe that given a triple pattern
We now study the combined, data and query complexity of SPARQL AR , extending the complexity results from [16,17,22] for simple SPARQL queries, from [1] for SPARQL with regular expression patterns to include our new flexible query constructs, and from [5] to include now UNION in SPARQL AR .
Complexity of various SPARQL
AR
fragments
Complexity of various SPARQL AR fragments
The complexity of query evaluation is based on the following decision problem, which we denote EVALUATION: Given as input a graph
We have the following results, the proofs of which are given in the Appendix.
EVALUATION can be solved in time
EVALUATION can be solved in time
EVALUATION is NP-complete for queries that are constructed using only the AND and
EVALUATION of
EVALUATION is NP-complete for queries that may contain regular expression patterns and that are constructed using the operators AND,
EVALUATION is PTIME in data complexity for queries that may contain regular expression patterns and that are constructed using the operators AND,
The complexity study of SPARQL AR in [5] is summarised in the first six lines of Table 1, where the combined, data and query complexity are shown for specific language fragments and combinations of operators.
The results for query complexity follow from Lemma 1 and Theorems 1, 2 and 3.
We next show three new complexity results which extend those of [5], summarised in the last two lines of Table 1.
EVALUATION is in NP for queries containing AND, UNION, FILTER, APPROX, RELAX and regular expression patterns.
EVALUATION is NP-complete for queries that may contain regular expression patterns and that are constructed using the operators AND, UNION, FILTER, RELAX, APPROX and
EVALUATION is PTIME in data complexity for queries that may contain
The results for query complexity follow from Lemma 1 and Theorems 6 and 7.
We conclude our complexity study confirming that adding the UNION operator to SPARQL
AR
does not increase the overall complexity. EVALUATION is NP-complete for queries that contain regular expression patterns and that are constructed using the operators AND, UNION,
The OPTIONAL operator can be added to a SPARQL query in order to retrieve information only when it is available. In other words, it allows optional matching of query patterns. If in query Q the OPTIONAL operator is applied to query pattern
It is possible to add the OPTIONAL operator to SPARQL AR , allowing APPROX and RELAX to be applied to triple patterns occurring within an OPTIONAL clause, with the same semantics as specified earlier. However, the complexity of SPARQL with the OPTIONAL clause is PSPACE-complete [16]. Therefore, by our earlier results, the complexity of SPARQL AR would also increase similarly.
Query processing
We evaluate SPARQL
AR
queries by making use of a query rewriting algorithm, following a similar approach to [11,12,20]. In particular, given a query Q which may contain the APPROX and/or RELAX operators, we incrementally build a set of queries
We present the algorithm in Section 4.1, proving its correctness in Section 4.2 and termination in Section 4.3. Some practical issues relating to how users might benefit from a flexible querying system such as this are discussed in Section 4.4.

Flexible Query Evaluation

Rewriting algorithm

applyApprox

approxRegex
Our query rewriting algorithm (Algorithm 2 below) starts by considering the query
The function

applyRelax

relaxTriplePattern
We assign to the variable
In Algorithm 2, the
To compute the query answers (Algorithm 1) we apply an evaluation function,
The applyApprox and applyRelax functions, respectively, invoke the functions approxRegex and replaceTriplePattern, shown as Algorithms 4 and 6. In Algorithm 6, z,
In the following example, we illustrate how the rewriting algorithm works by showing the queries it generates, starting from a SPARQL AR query.
Consider the following ontology K (satisfying
Suppose a user wishes to find every event which took place in London on 15th September 1940 and poses the following query Q:
We now discuss the soundness, completeness and termination of the rewriting algorithm. As we stated earlier, this takes as input a cost that limits the number of queries generated. Therefore the classic definitions of soundness and completeness need to be modified. To handle this, we use an operator
(Containment).
Given a graph G, an ontology K, and queries Q and
(Soundness).
The rewriting of Q,
(Completeness).
The rewriting of Q,
To show the soundness and completeness of the query rewriting algorithm, we will require the following lemmas and corollary.
Given four sets of evaluation results
The following result follows from Lemma 2:
Given four sets of evaluation results
Given queries
The Rewriting Algorithm is sound and complete.
We are able to show that the rewriting algorithm terminates after a finite number of steps:
Given a query Q, ontology K and maximum query cost c, the Rewriting Algorithm terminates after at most
In the previous sections we have concentrated on theoretical aspects of SPARQL AR . In practice SPARQL AR would be used as part of a framework allowing users to search RDF data in a flexible way. A front-end could allow users to pose queries using keywords or natural language. Such user queries could then be translated into SPARQL (cf. [19]).
Suppose a user poses the following question to the system: “What event happened in London on 15/09/1940?”. This question could be translated to the query (
Applying the APPROX operator without an upper bound on cost will eventually return every connected node of the RDF graph. This negatively affects the precision of answers but ensures 100% recall. When applying the substitution operation of APPROX to a triple pattern, we do not specify the predicate that needs to be replaced, but instead insert the wildcard
Similarly, it is possible to add a finer ranking of the answers arising from the RELAX operator. When we relax a triple pattern to derive a direct relaxation, we make use of a triple from the ontology K. Instead of assigning a cost to each rule of Fig. 1, we could assign a cost to each triple in K, reflecting domain experts’ views of the semantic closeness of concepts. Therefore, the direct relaxation would have a cost depending on which triple in K is used.

SPARQL AR system architecture.
Finally, in order to help users interpret answers to their queries, the system could provide information about which rewritten query returned which answers. Showing only queries to users might not be particularly helpful, especially if the original query was simply in the form of keywords. Instead, showing the sequence of steps by which the original terms used by the user were approximated or relaxed could help them decide whether the answers returned were meaningful or not.
We have implemented the query evaluation algorithms described above in Java, using Jena for SPARQL query evaluation. Figure 3 illustrates the system architecture, which consists of three layers: the GUI layer, the System layer, and the Data layer. The GUI layer supports user interaction with the system, allowing queries to be submitted, costs of the edit and relaxation operators to be set, data sets and ontologies to be selected, and query answers to be incrementally displayed to the user. The System layer comprises three components: the Utilities, containing classes providing the core logic of the system; the Domain Classes, providing classes relating to the construction of SPARQL AR queries; and the Query Evaluator in which query rewriting, optimisation and evaluation are undertaken. The Data layer connects the system to the selected RDF dataset and ontology using the JENA API; RDF datasets are stored as a TDB database4
User queries are submitted to the GUI, which invokes a method of the SPARQL AR Parser to parse the query string and construct an object of the class SPARQL AR Query. The GUI also invokes the Data/Ontology Loader which creates an object of the class Data/Ontology Wrapper, and the Approx/Relax Constructor which creates objects of the classes Approx and Relax. Once these objects have been initialised, they are passed to the Query Evaluator by invoking the Rewriting Algorithm. This generates the set of SPARQL queries to be executed over the RDF dataset. The set of queries is passed to the Evaluator, which interacts with the Optimiser and the Cache to improve query performance – we discuss the Optimiser and the Cache in Section 5.2. The Evaluator uses the Jena Wrapper to invoke Jena library methods for executing SPARQL queries over the RDF dataset. The Jena Wrapper also gathers the query answers and passes them to the Answer Wrapper. Finally, the answers are displayed by the Answers Window, in ranked order.
We have conducted empirical trials over the YAGO dataset and the Lehigh University Benchmark (LUBM).5
For the rest of this section, we focus on our empirical trials over the YAGO dataset, firstly without any optimisations, and then in Section 5.2 with an optimised query evaluator. YAGO contains over 120 million triples (4.83 GB in Turtle format) which we downloaded and stored in a TDB database. The size of the TDB database is 9.70 GB, and the nodes of the YAGO graph are stored in a 1.1 GB file.
We ran our experiments on a Windows PC with a 2.4 Ghz i5 dual-core processor and 8 GB of RAM. We executed 10 queries over the database, comprising increasing numbers of triple patterns (1 up to 10), listed below. The aim of this performance study was to further gauge the practical feasibility of our techniques and to discover major performance bottlenecks requiring further investigation. A more comprehensive and detailed performance study is planned for future work.
Numbers of answers (Exact and A/R) and numbers of rewritten queries (A/R)
The reader will notice that we used the APPROX operator only on triple patterns containing a regular expression in which either the subject or object is a constant. This is due to the fact that if we apply APPROX to simple triple patterns of the form
For each query
The numbers of answers returned by each query, for both the exact form and the APPROX/RELAX (A/R) form, are shown in Tables 2 and 3, along with the number of rewritten queries in each case (# of queries). Tables 4 and 5 list the execution times for the exact queries, the A/R queries and the A/R queries with a simple caching optimisation implemented (optimised A/R). We discuss the results without this optimisation in the next subsection, and the results with this optimisation applied in Section 5.2.
Query
which returns only 3 answers. Increasing the maximum cost does not result in the rewriting algorithm generating any more queries, and no other answers would be returned at higher cost.
Numbers of answers (Exact and A/R) and numbers of rewritten queries (A/R)
Numbers of answers (Exact and A/R) and numbers of rewritten queries (A/R)
Query execution time (in seconds)
Query execution time (in seconds)
Query
Query
Query
Query
Query
Query
Query
Finally, query

Flexible Query Evaluation – Optimised
In Tables 4 and 5 we also show the query execution times for the A/R forms of all the queries using an optimised query evaluator. The optimisation is based on a caching technique, in which we pre-compute some of the answers in advance. The Optimiser module stores the cached answers in memory using the Java class HashSet6
Java documentation:
In Algorithm 7, we start by splitting a query into two parts: the triple patterns which do not have APPROX or RELAX applied to them (which we call the exact part) and those which have (which we call the A/R part). We first evaluate the exact part of the query and store the results. We then apply the rewriting algorithm to the A/R part. Each triple pattern of the latter is evaluated individually; all possible pairs of triple patterns are also evaluated. The answers to each are stored in
For query
In the final version of the system, the optimisation module would be disabled for single-conjunct queries.
For queries
The optimised evaluation of query
Queries
Query
We are still unable to execute the A/R forms of queries
We can see that the variable
The overall results show that the evaluation of SPARQL AR queries through a query rewriting approach is promising. The difference between the execution time of the exact form and the A/R form of the queries is acceptable for queries with fewer than 5 conjuncts. For most of the other queries, the simple optimisation technique described above also brings down the running times of the A/R forms to more reasonable levels. Clearly, for more complex queries, more sophisticated optimisation techniques need to be investigated and developed.
In this paper we have presented query processing algorithms for an extended fragment of the SPARQL 1.1 language, incorporating approximation and relaxation operators. Our query processing approach is based on query rewriting whereby, given a query Q containing the APPROX and/or RELAX operators, we incrementally generate a set of queries
We have formally shown the soundness, completeness and termination of our query rewriting algorithm. Our empirical studies show promising query processing performance, but also that further optimisations are required.
An advantage of adopting a query rewriting approach is that existing techniques for SPARQL query optimisation and evaluation can be reused to evaluate the queries generated by our rewriting algorithm. Our ongoing work involves investigating optimisations to the rewriting algorithm itself, since it can generate a large number of queries. Specifically, we are studying the query containment problem for SPARQL
AR
and how query costs impact on this. Following this investigation, we plan to implement optimisations for the rewriting algorithm. For example, for a query
Another area of ongoing work involves the construction of synopses (or data guides) of RDF-datasets in order to speed up query evaluation. In our context, such a synopsis is a graph S constructed from an RDF-dataset G that will have the following property: if we consider G and S as automata, then
Another direction of research is the extension of our approximation and relaxation operators, query evaluation and query optimisation techniques to flexible federated query processing for SPARQL 1.1. Finally, also planned is a detailed comparison of the query rewriting approach to query approximation and relaxation presented here with the “native” implementation of similar operators described in [23].
Footnotes
Acknowledgement
Andrea Calì acknowledges partial support by the EPSRC project “Logic-based Integration and Querying of Unindexed Data” (EP/E010865/1).
