This paper demonstrates that the presence of blank nodes in RDF data represents a problem for distributed processing of SPARQL queries. It is shown that the usual decomposition strategies from the literature will leak information – even when information derives from a single source. It is argued that this leakage, and the proper reparational measures, need to be accounted for in a formal semantics. To this end a set semantics for SPARQL is generalized with a parameter representing execution contexts. This makes it possible to keep tabs on the naming of blank nodes across execution contexts, which in turn makes it possible to articulate a decomposition strategy that is provably sound and complete wrt. any selection of RDF sources even when blank nodes are allowed. Alas, this strategy is not computationally tractable. However, there are ways of utilizing knowledge about the sources, if one has it, that will help considerably.
This paper is concerned with giving a general theoretical foundation for the semantics of distributed SPARQL processing. It was inspired in part by the observation that whilst a number of distributed SPARQL processors exist already, foundational work that allows one to study their relative behaviour in a principled manner is still in short supply.
In most ways, the present study can be regarded as a sequel to [35], in which the concept of a federation scheme was first introduced. Essentially, a federation scheme is a theoretical device that captures formally the conditions under which a particular federation strategy will yield sound and complete answer sets. Contrapositively, it can be regarded as a way of characterizing the behaviour of the federation strategy in question by making theoretically explicit when it will miss answers and why. While [35] gave several examples of sound and complete ways of decomposing and evaluating a query over a set of sources, these results were valid under the proviso that the contributing sources do not employ blank nodes for encoding information. The main contribution of the present paper is to remove this restriction thus giving, for the first time we believe, a fully general declarative semantics for distributed processing of SPARQL basic graph patterns.
There is a singular tension in the Resource Description Framework (RDF) between, on the one hand, the strictly local validity of blank nodes and, on the other, the global pretensions of the data model. RDF affords a knowledge representation language for making assertions that are meant to be valid in a global or Web-wide scope. Yet, it recognizes the existence of entities that cannot be referenced outside their source of origin, and that cannot be identified across query execution contexts.
The Semantic Web community is deeply ambiguous in its attitude towards blank nodes. The Linked Data community generally shuns them – in Bizer [18] they are discussed under the heading “RDF features best avoided in the Linked Data context”. On the other hand, asserting the existence of something that one is not willing to name is generally recognized as indispensable for design patterns that rely on expressing general logical concepts such as n-ary relations, lists, sets or classes. However one sees it, there is no getting around the well-documented prevalence of blank nodes on the Web [19] – a fact which is not likely to change anytime soon.
In rough outline, the problem with blank nodes is that if a federator is to combine information across SPARQL endpoints, then it will have to join suitably small fragments of information from different sources. This in turn involves evaluating suitably small subqueries to get at all the partial answers that are required for answering the global query in full. Alas, in the process patterns may happen to be broken up that match a partial answer in a single source where a blank node figures in join position. If so, then that partial answer cannot be put back together afterwards, because the blank nodes that were previously joined will no longer be identifiable as the same node. The situation poses somewhat of a dilemma, and resolving it requires rather profound, though not too complex, amendments to the semantics of evaluating SPARQL basic graph patterns.
Admittedly, the fragment of SPARQL that consists only of basic graph patterns is a fragment that ignores many interesting and useful features of SPARQL. Yet, it is a natural fragment to start with since it is simple and mathematically well-behaved. Granted, it is an interesting question how, say, the non-monotonic features of SPARQL such as negation would alter the picture. Nevertheless, this is a question that is considered out of scope for the purposes of the present paper, though one it is hoped it will be easier to explore in the future on the basis of this paper.
The layout and specific contributions of this paper are as follows: Section 2 identifies and describes the adverse effect of blank nodes on distributed query processing, whilst Section 3 surveys related work in the area. Section 4 provides the semantic prerequisites for working out a solution to the said dilemma. This consists mainly in generalizing the SPARQL 1.1 semantics by adding a formal parameter for execution contexts to the notion of an answer set. The concept of an execution context is left largely implicit in the SPARQL 1.1 specification. In the distributed setting, it is a condition sine qua non for an adequate semantics and therefore needs to be formalized. Section 5 flexes the new semantic machinery by providing an existence theorem showing that if an answer to a query exists in a set of contributing sources taken as a whole, then there is a way to retrieve it, even when allowing for blank nodes. Section 6 reviews the federation-schemes framework as developed in [35] as a prerequisite for the subsequent section. It does not contribute any new results per se, but merely adapts the relevant concepts and definitions to the generalized semantics developed in Section 4. Finally, Section 7 ties the threads together by abstracting from the existence theorem in Section 5 a federation scheme that gives sound and complete answer sets in the absence of any knowledge about the structure of the sources, blank nodes allowed. However this section also shows that there is a rather steep price to pay for ignorance when blank nodes are involved. More specifically, it is shown that distributed query answering in the presence of blank nodes when no knowledge is available is inherently hard. Different ways to mitigate the situation are sketched based on obtaining knowledge either by regimentation of the sources or by probing them, pointing to interesting lines of future research.
Problem description
It seems reasonable, by default at least, to require of a distributed query processor that it neither misses nor invents answers; all and only the answers that are warranted by the sum total of information contained in the contributing sources should, if there are no overriding reasons to the contrary, be returned.
To illustrate what is meant here by ‘the sum total of information’, consider the two RDF graphs in Figs 1 and 2 respectively, call them source A and B. As should be apparent, they encode information about trophies won by Roger Federer and Raphael Nadal respectively in Grand Slam tournaments.1
It is a slightly modified version of an example from [19].
Now, according to source A the Wimbledon tournament is a Grand Slam tournament but the French Open is not. One may assume that this is an unintentional omission on behalf of the curator of the data set. The missing piece of information is provided by source B, however, so when the two sources are merged, as in Fig. 3, all tennis events become appropriately classified. Therefore, the query in Fig. 4 – which asks for the name of a player and the year in which he won a Grand Slam tournament – when evaluated against the merge of the sources A and B, returns the answer in Fig. 5.
RDF source A.
RDF source B.
This answer set strictly includes that which is obtained by evaluating the query in Fig. 4 against each source separately before taking the union of the result, viz. Fig. 6. In other words, the sum of the whole is bigger than the sum of the parts. Specifically, the total amount of information contained by the two sources combined, resides not only in what each of them can contribute separately, but in also in the combination or join of elements across sources.
The union of sources A and B modulo renaming of blank nodes.
Get Grand Slam triumphs.
Answer over the merge of A and B.
Union of answers over A and B.
A distributed query processor should, in the absence of a good reason for sacrificing completeness, query the whole. That is, by combining results from partial queries evaluated separately against source A and B, it should return the table in Fig. 5 as the answer to the query in Fig. 4.
Generalizing from the present example – keeping the restriction to basic graph patterns in mind – one thus arrives at the following intuitive criterion of adequacy for a distributed query processor: the same answer should be returned as that which would be obtained were the query to be evaluated over the merge of the contributing sources. This concept of adequacy will have to be made more precise later, for there is leeway for different interpretations of it due to the appeal to the ‘sameness’ of answer sets. For now, however, and in relation to the example at hand, the question becomes: how must the query in Fig. 4 be decomposed and processed (vagueness intended) in order to satisfy the adequacy criterion?
Given the obviousness of all that has been said so far, it is somewhat surprising that there is no simple answer to this question. This is due to the fact that source A and B – as recommended by the Semantic Web Best Practices and Deployment Working Group – use blank nodes for expressing three-place predicates, in this case the predicate “X wins Y in year Z”. Due to this, the information expressed by the graphs depends on joins on blank nodes. In the distributed case, such a join, if it is not handled with special care, will quickly become a drain through which information will leak. As we shall see, this has to do with anaphoric reference being lost whenever the same blank node is processed in two separate execution contexts.
It is interesting to register that none of the more straightforward and better known query-decomposition strategies from the literature will work for this example. Consider for instance the even decomposition, so called in [35] (cf. Section 6) as implemented in DARQ [29]. This is the decomposition that evaluates each triple pattern (from the global query, let’s call it) against every source that may contain an answer for it (meaning that the RDF property from the triple pattern in question occurs in that source). For instance, the even decomposition will evaluate both of the triple patterns ?x :event ?event and ?x :year ?year from the query in Fig. 4 separately against each of A and B. Collecting the solutions, say, from source A in separate tables, we have the answer sets in Figs 7 and 8.
Answers to ?x :event ?event over A.
Answers to ?x :year ?year over A.
Here, the identifiers for blank nodes have been given distinct subscripts c and d to signify that they are not to be treated as the same names. This behaviour is mandatory according to SPARQL 1.1 specification which requires every distinct query to be treated as a separate query execution context. More precisely, every query defines a distinct and sealed scope for blank node identifiers, which means that a blank node from one execution context cannot be referenced in another. From a logical point of view this is an entirely legitimate restriction. Blank nodes are similar to existential variables and existential variables are anaphors within the same quantificational context only. Nevertheless, it is detrimental for the even distribution, because join information is not preserved by the two tables. Therefore, the even distribution is not a decomposition strategy that can be relied on to remain faithful to the ‘sum total’ of information contained in the contributing sources.
The problem generalizes, for no other well-known decomposition strategy fares any better. Take the one implemented in FedX [34] for instance. The FedX decomposition is similar to DARQ except that it groups triple patterns that are satisfiable by a single source only – so called exclusive groups – into the same subquery. This strategy is studied under the name of the standard decomposition in [35].
Exclusive groups anchor the triples involved to the same execution context, so they may in fortuitous cases solve the problem. However, the example currently under consideration shows that this is not in general so. Indeed, in the cases where it is so, it comes down to luck, because exclusivity has nothing to do essentially with blank nodes. In particular, there are no exclusive groups in the query in Fig. 4 wrt. source A and B, so the even and standard decompositions are the same in this case.
As a final example, consider the decomposition strategy implemented in ADERIS [25], which is that of assigning to each selected source G the maximal subsets of the global query P such that the signature of is covered by G. That is, if all of the RDF properties that occur in also occur in G then is routed to G as a single conjunctive query pattern. The partial results obtained are then combined by taking their union. It is not difficult to see that this strategy falls short of our proposed criterion of adequacy. The running example constitutes a counterexample, for all of the RDF properties in the query occur in both of the sources A and B. Hence, on the ADERIS approach the whole query would be evaluated against each of these sources, and the final result would be the union of the respective answer sets. But, as we have seen, this produces the table in Fig. 6 which misses the cross-site join that relates Roger Federer to his 2009 win.
Taking stock, these examples can be taken to show the following: If answering a query involves joins on blank nodes, then the granularity of the decomposition of that query matters a great deal. If the query is split too finely, then answers from a single source may be lost due to the loss of join information linking the partial answers. If on the other hand the query is split too coarsely, then cross-site joins may be lost. From a semantical point of view, distributed query answering is essentially about balancing these two opposing forces.
The delicateness of this balancing act has not been fully appreciated yet it seems: while it is reasonable to expect a distributed query processor to do better than to just collect solutions to the global query from each of the sources, it may so happen that the very attempt to do so, i.e. the attempt to harvest information by utilizing cross-site joins, is the very thing that causes it to do worse, because if there are joins on blank nodes in a contributing source, then that source will leak solutions if the global query is decomposed too finely.
As for the question of whether there even is a decomposition that can be made to work, the answer is affirmative for the example at hand at least, viz. the decomposition below:
The crucial thing about this decomposition is first, that it groups together those triple patterns that match a join on a blank node, thus ensuring that the join arguments are kept within one and the same execution context so as not to lose anaphoric reference (). Secondly, it is also of crucial importance that all other triples are shipped as singletons, or else cross-site joins would be lost ( and ). It is not difficult to show that the procedure that takes the union for each before folding the results into a single answer set by joins, produces the outcome in Fig. 5. Yet, this solution is specific to the problem at hand and depends on detailed knowledge about the shape of the graphs being queried. For instance, one needs to know that the variable ?x will be mapped to a blank node – something that cannot, needless to say, be read off from the query alone.
We have the work cut out for us then: first it is necessary to show that irrespective of the shape of graphs, if a solution to a query is warranted by the merge of the contributing sources, then there is a decomposition from which that solution can be reconstructed. Secondly, it must be shown that it is possible to abstract from these particular solutions a general federation scheme that yields a correct and exhaustive answer set in the absence of any knowledge about the sources, even whilst allowing for joins on blank nodes. In order to do so, however, it will be necessary to generalize the SPARQL 1.1 semantics with an extra parameter representing execution contexts – a topic for the next section.
Related work
A tripartite classification of approaches to SPARQL federation is explicit in Görlitz and Staab [12] and implicit in Betz et al. [7] and Hose et al. [20]:
Lookup-based federation: Also known as federation-by-traversal, or follow-your-nose traversal [4], iteratively downloads and evaluates data based on links (usually as prescribed by the Linked Data principles [8]) from one dataset into another. The answer to a query is composed by cumulatively adding answers from incoming data during the traversal process. This approach is exemplified by e.g. [14–17] and [23].
Warehousing: Broadly understood, warehousing covers approaches that seek to collect or assemble all data of relevance ahead of query time into a repository that behaves as if it is a centralized single store. This area naturally shades off into cloud computing [32], distributed file systems, and big data technologies. Recent efforts in this direction focus on the use of cluster technology such as MapReduce and Hadoop e.g. [10,19,21,22,24] and [28], Spark e.g. [9] and GraphX e.g. [31].
Distributed query processing: Distributed query processing relies on analysing a query to identify a set of relevant RDF graphs – with or without the aid of statistics – to which subqueries can be assigned. Like federation-by-traversal and unlike warehousing, queries are evaluated remotely. Examples of this approach include [3,6,27,29] and [34].
If the issue were pressed, the present paper would be most naturally classified as belonging to the field of distributed query processing. However, it does not really advocate ‘an approach’ to SPARQL federation. Rather, it aims to articulate and develop a set of semantical primitives in terms of which any federation engine ought to be describable. To the authors’ knowledge it is the first foundational study of distributed query processing that allots to blank nodes a clearly described role in a strictly logico-mathematical theory.
It should be said that the tripartite classification mentioned above is very rough, and cuts across important and interesting distinctions in the literature. Conversely, there are subfields within federation that cut across it. It can plausibly be argued for instance that scalability and optimization are issues that are largely independent, or at least not tightly coupled with, any one particular approach. On the contrary, a lot of work seems to be going on here with more general significance.
To mention but a few, there is the area of adaptive query answering that explores how the evaluation of a query may be adjusted at run-time, e.g. by reordering joins [2] or by exploiting approximate membership functions reducing HTTP requests [36]. There is also a fair bit of activity in the field of caching, data replication and load balancing. For instance, [26] explores the use of replicated fragments of data graphs in order to mimimize the size of intermediate results, whereas [37,38] adresses client-server load balancing through the use of so called Triple Pattern Fragments (TPFs).
As regards, the relationship between the present theory and the SPARQL 1.1 federation extension, the two are essentially orthogonal to each other: the latter, by providing the SERVICE keyword for source selection, provides a way of evaluating different parts of a query against different endpoints, but it doesn’t tell you which parts. The present paper, in contrast, is concerned precisely with the question of which parts one can rely on to retain soundness and completeness. i.e. it is concerned with defining decompositions that are logically well-behaved.
Adding execution contexts to the SPARQL semantics
The formal developments that follow are based on the set version of the SPARQL 1.1 semantics as defined in [5] – or at any rate, as far as it goes. We say “as far as it goes”, because that semantics is designed for the case where a single query is evaluated against a single source in a single execution context. But what is distinctive about the distributed setting, is that the overall answer to the distributed query is assembled from partial answers to partial queries each one of which represents a distinct execution context. These contexts matter a great deal from a semantical point of view as they constitute disjoint name spaces for blank nodes, which in turn influences the semantics of joins.
Execution contexts are not formally defined in the SPARQL 1.1 semantics, which therefore lacks the expressive power to define distributed query processing. However, the following passage provides a fairly robust basis for a formal reconstruction: “Since SPARQL treats blank node identifiers in a results format document as scoped to the document, they cannot be understood as identifying nodes in the active graph of the dataset. If DS is the dataset of a query, pattern solutions are therefore understood to be not from the active graph of DS itself, but from an RDF graph, called the scoping graph, which is graph-equivalent to the active graph of DS but shares no blank nodes with DS or with [the query]” [13].
The scoping graph mentioned here is a purely theoretical construct – it is just another namespace for blank nodes. In other words, SPARQL distinguishes between the scopes of a query, the queried data and the result, stipulating that blank nodes cannot be shared between them. This has important formal ramifications that will be explored shortly, after the requisite semantic concepts have been introduced.
The following notational conventions will be adopt-ed from here on: curly braces are omitted from singletons in set-theoretic expressions as well as from arguments of functions if no confusion is likely to ensue, e.g. instead of and instead of . Also, when f is a function and A a subset of f’s domain, then will be used as a shorthand for the set of elements b such that for some . If f is a function, and are its domain and range respectively.
Turning now to RDF specifics, let I, B and L denote pairwise disjoint infinite sets of IRIs, blank nodes, and literals respectively. In conformity with the nomenclature of [5], abbreviates and T abbreviates . These latter sets will be referred to collectively as RDF terms, and RDF terms will be denoted individually by , . Here, as everywhere else, indexes will be omitted when idle. An RDF triple is an element . If it is mnemonically convenient, terms in triples may be denoted by their roles as subject, predicate and object of that triple; . If the elements of a triple is irrelevant to the discussion, an RDF triple will be denoted where the ‘a’ is meant to stand for ‘assertion’. An RDF graph is a finite set of RDF triples. RDF graphs are denoted G with or without subscripts. RDF graphs may also be referred to as RDF sources. These terms are intended to mean the same thing, but will tend to be used for different emphasis and in order to make the terminology more suggestive.
Turning now to SPARQL queries, let V be an infinite set of variables. Variables are denoted by lower case letters in the range of x to z with a question mark prepended. Variables may also be counted as terms, in which case they too are denoted .
A basic graph pattern (BGP) is either
a singleton where t is a triple pattern in , or
a union of BGPs and .
Thus, denotes triple patterns and denotes BGPs. For any BGP P, denotes the set of variables in P. In order to avoid tedious limiting cases in proofs, it will be assumed that the set of triple patterns is disjoint from the set of RDF triples, that is, a triple pattern is assumed to contain at least one variable. By Definition 4.1(1), it is also assumed not to contain blank nodes. These assumptions incur no loss of generality.
Following [5], SPARQL queries will be identified with basic graph patterns. In other words, the distinction between the syntax and the semantics of SPARQL queries will deliberately be blurred. Concretely, this means that queries will be treated as sets of triple patterns rather than as syntactic entities in and of themselves. This is mathematically convenient and simplifies formal developments without loss of precision. Abiteboul et al. [1] establishes precedent in this respect.
In the SPARQL 1.1 specification answers to simple conjunctive queries aka. basic graph patterns are formalized in terms of solution mappings, which are partial functions μ of type . The relevant definition, giving the answer to a basic graph pattern P evaluated over an RDF graph G, reads:
Let G be an RDF graph, and P a BGP. Let denote the triple obtained by replacing the variables in t according to μ, and let denote the set of variables occurring in t. Then
, if P is a triple pattern t, or
, if .
Here denotes the set of all for . Individual μ will be called solutions whereas sets will be called answer sets. The join ⋈ of answer sets is defined as the set of pairwise unions of compatible maps where and are compatible if implies .
This is the standard definition of answer sets as it looks in the set version of the SPARQL 1.1 semantics. Yet, keeping the SPARQL 1.1 specification’s notion of a scoping graph in mind, it is already not entirely accurate – not even for a single source and a single execution context. For if a solution to t over G happens to contain a blank node, it follows that is not strictly speaking an element in G. Rather, the relationship between is a weaker one: μ being a solution to t over G means that there is an such that and a are the same up to renaming of blank nodes. This in turn is just to say that and a simply entail each other, where the following theorem from [19] may serve as the definition of simple entailment:
(Simple entailment).
An RDF graphentails an RDF graph, written, iff there is an RDF homomorphism h fromto.
The concept of an RDF homomorphism that figures in this definition will be defined shortly. Henceforth, whenever two answer sets are said to be the same, this is intended to mean that they simply entail each other, i.e. that they are simply equivalent, represented symbolically as .
The upshot of this is that even in the single-source case, SPARQL semantics can be seen to have an equivalential basis, and is thus not really expressible in terms of membership alone. This is not all bad news. On the contrary, it fits hand-in-glove with distributed query processing, for as soon as the equivalential basis of answer sets over a single source is made formally explicit it generalizes straightforwardly to answer sets obtained by combining solutions from different execution contexts.
The notational apparatus will have to be amended slightly. Starting with a countably infinite set of contexts (definition pending) what is required is a family of pairwise disjoint sets of blank nodes to match. It suffices to just stipulate that is the family of sets obtained by making a copy of B in which each element is indexed by the context c. To lighten the notation the union of all these sets of blank nodes, one for each context, will henceforth be denoted . The set of SPARQL terms may now be defined as
This is to be kept distinct from T, the set of RDF terms. The latter is the stuff that RDF graphs are made of, the former is the stuff that solutions to queries are made of. The same blank node is never contained in both.
Returning now to RDF homomorphisms:
Given two RDF graphs and , an RDF homomorphism is a function such that all of the following hold:
for ,
if , and
if .
This definition differs from the standard only in that the range of functions has been extended with the new (countable) set of blank nodes. As for the concept of an execution context, viz. the following definition:
(Execution context).
An execution context is a bijection which is such that
implies ,
implies .
The idea now, is to let the partial map μ, as defined in the standard semantics, figure merely as the mathematical relationship between a graph pattern and a graph that is to be portrayed by solutions proper. Stated differently, μ is no longer considered a solution per se, rather a solution is a mapping μ modulo a context c that provides the solution with a unique set of names for its blank nodes:
The following pair of lemmas is unglamorous but important:
An execution context c is an RDF homomorphism fromto.
Suppose , and for , where , , . Then, by Definition 4.6(1) i.e. . By the RDF semantics, from which it follows that by Definition 4.5(1), hence . Now, assume wlog. that , , as the case for , would proceed as for above.
By Definition 4.5(1) and (2), we then have , and by Definition 4.4(1) we have .
Therefore and hence c satisfies Definition 4.3(3). That c also satisfies Definition 4.3(1) and (2) is immediate from Definition 4.4. □
The converse of a context c is an RDF homomorphism.
By Lemma 4.2, c is a homomorphism which by Definition 4.4 is bijective. Hence is a graph homomorphism. To verify that it is in fact an RDF homomorphism it suffices to note that Definition 4.3(1) imposes the same restriction as Definition 4.4(1) and that Definition 4.4(2) is a stronger requirement than Definition 4.3(2). □
Taking stock so far, there is no longer such a thing as the set of answers to a BGP P over a graph G, strictly speaking. When blank nodes are involved there are infinitely many distinct such sets, one for each context c. These sets are all simply equivalent, though, which just means that they answer the query in essentially the same way:
Ifthen,andare all simply equivalent irrespective of the choice of c and d.
By Lemmas 4.2 and 4.3c is an RDF homomorphism from to and is an RDF homomorphism in the converse direction. By Theorem 4.1 then . By similar reasoning , and by the property of euclideanness for equivalence relations . □
Corollary 4.4 thus expresses formally what was hinted at in Section 2 when it was said that the ‘sameness’ of answer sets would have to be made precise.
Turning now to distributed query answering and its semantics, its formalization is a matter of generalizing the ⋈ relation to apply to answer sets from different sources and parametrized by different execution contexts:
Such expressions will henceforth be stipulated to denote the set of unions of pairs of compatible maps and . There are a few things to note:
There is in general no context e such that.
It suffices to argue the case where . Suppose for reduction that there is such a context e for every and every source G. Then it is possible to choose P, G, and in such a way that , where and , although and . But then and , either of which contradicts clause (2) of the Definition 4.4 of contexts. □
What this means is that there is in general no algebra on contexts and sources that allows one to reduce to a single execution context. In particular we do not have or anything of the sort. This is as it should be. Distributed queries do not run in a single execution context. Theorem 4.5 simply reflects this fact.
As a consequence of this, there are two different things that need to be recognized as solutions to a query: first, any solution to the query over a single source, and secondly, any solution obtained by joining partial answers from different sources or contexts. Call the latter intercontextual joins, and let them be denoted with ρ. Intercontextual joins generalize cross-site joins: whereas a cross-site join is an intercontextual join, the converse does not necessarily hold. The ρ notation will be used primarily as a convention for indicating in the relevant parts of a proof or line of reasoning that an appeal to a context is neither required nor apt. We have:
Letandbe two RDF graphs that are standardized apart. Ifthen there is a context e such thatand.
By Definition 4.6(2) it suffices to show the property for a in . So suppose . Then by dataset monotony, whence for in the first and in the second of the join arguments. By Corollary 4.4 for an arbitrarily chosen e, and for the same e. From the former it follows by Theorem 4.1 that there is a pair of RDF homomorphisms such that maps into and maps back into . From the latter it follows that there is a pair linking and in the same way. The proof now proceeds from this to show that defined as is a solution to : since entails , it suffices to show that A) is an RDF homomorphism from to and that B) is a homomorphism in the converse direction.
For A) it suffices to show that is well-defined, i.e. that and are compatible. So, suppose for reduction that there is a term such that . By the definition of an RDF homomorphism u must be a blank node. However, the contexts c and d, by Definition 4.4, have disjoint ranges when restricted to blank nodes, so contrary to assumption.
For B) the reasoning is entirely similar to A) so suppose for reduction that there is a term such that . Then again u must be a blank node, by the definition of an RDF homomorphism. However and have been assumed to be standardized apart so contrary to assumption. □
As this argument can obviously be extended to all finite unions of BGPs by a straightforward induction, it is in effect akin to a soundness result: as long as execution contexts do not share blank node identifiers with one another or with RDF graphs, and as long as the scope of RDF graphs are treated as mutually exclusive, distributed query processing, understood as assembling intercontextual joins across sources and contexts, never produces solutions that are not warranted by the union of those sources.
There is always a solution
To recapitulate briefly, the point of introducing an explicit parameter for execution contexts is to be able to reason about the total information content held by a set of answers each of which was generated by a different query over a distinct source – the total information content being given by the join over all these sets. Now, as each query will set up a separate scope for blank node identifiers that is disjoint from all others, answer sets cannot in the distributed case be joined freely. In particular, any join on blank nodes, if it spans two or more execution contexts, must be deemed a false positive. Conversely, any join on blank nodes that refers to a single context is ok. In other words, execution contexts is essentially a way to keep tabs on legitimate and illegitimate joins.
The present section is concerned with the first of the tasks that were defined towards the end of Section 2 which, having introduced execution contexts into the SPARQL semantics, it is now possible to address: the aim is to show that irrespective of the shape of graphs, if a solution to a query is warranted by the merge of the contributing sources, then there is a decomposition from which that solution can be reconstructed.
The investigative strategy that is adopted here is to reverse-engineer the problem by showing that every solution over the merge of the contributing sources induces a decomposition that fits the bill. More specifically, the idea is to let those parts of the given solution that have joins on blank nodes dictate how the corresponding query pattern is to be decomposed; for if a graph pattern is connected by blank nodes, it must derive from a single source, so the subquery that matches it must be grouped and evaluated in a single execution context.
Thus, we shall be interested in those subpatterns of a global query P that select subgraphs of the source being queried that are connected by blank nodes in the following sense:
(b-connectedness).
Let G and be RDF graphs, then
is b-connected,
is b-connected if G is b-connected and G and a share a blank node.
It is worth noting that, in the limiting case, the definition makes every triple pattern that, parameterized by , does not select an assertion with blank nodes in it b-connected.
The definition of a b-connected set (wrt. ) induces a corresponding notion of a b-component (wrt. ):
(b-component).
Let be any solution in , where is assumed to be standardized apart, and let . Then is a b-component of P iff is a maximal b-connected subset of .
Note that b-connected sets are RDF graphs, whereas b-components are SPARQL query patterns. Consider the following example:
Consider the merge of the RDF graphs A and B in Fig. 3 and the query in Fig. 4. Let be the solution that assigns
Then the b-components of the query, relative to , are given by the decomposition in Section 2, namely
Now, Example 5.1 seems to indicate otherwise, but it is a fact of some importance that being a b-component does not entail that P is connected. Consider the following example:
Put where b is assumed to be a blank node. Let . Then any maps both and to b. Since G is connected, and maximally so, is consequently a b-component. Yet, P itself is not connected.
The significance of this is that one cannot limit ones’ view to connected subqueries when searching for a complete federation scheme. There will be more to say about this in the next section.
For stocktaking and for the sake of contrast, it might be instructive to review a couple of cases that are excluded by Definitions 5.1 and 5.2: First of all, and rather obviously, a maximal b-connected set is not the same as a set which is maximally such that each pair of triples is connected by a blank node. Consider Fig. 9, where for simplicity we let the properties name the triples they belong to. The subgraphs that are pairwise b-connected, and maximally so, are and . The triples denoted by and do not share blank nodes and therefore do not belong to the same pairwise b-connected set, i.e. they constitute a cutoff point for the purposes of this example. In contrast, the only maximal b-connected subgraph of Fig. 9 according to Definition 5.1 is , that is, the whole graph.
Considering how blank nodes are renamed in each execution context, one can appreciate why it is that must be deemed to reflect the correct concept of b-connectedness: Since and themselves share blank nodes, it follows that if they did induce separate b-components with variables in the positions occupied by the blank nodes, then the blank node that is shared between them would be renamed in two separate execution contexts whence that join would be missed.
Elaborating on this example a bit, the lopsidedness of the concept of pairwise b-connectedness could be counteracted by stipulating instead that a b-connected set is one in which all pair of triples are connected by an undirected path where all interior nodes on the path are blank nodes. It is easily verified that the only subgraph of Fig. 9 which is maximal in this respect is the entire graph, as this example demands.
However, it is not too difficult to rig up another graph which shows that this time we have overshot our target insofar as the latter concept is not discriminating enough. Consider the graph Fig. 10. There is a path of the kind described between every pair of triples, so the graph itself is the only connected component according to this latter concept of connectedness. However, and do not have blank nodes at all, so any query that does not treat the corresponding triple patterns as separate subqueries will risk losing cross-site joins. In other words, the maximal b -connected components of the graph in Fig. 10 should be deemed to be , and which is precisely what Definition 5.1 yields.
Counterexample 1.
Counterexample 2.
It will be convenient to have a notation that assembles all the b -components that a particular solution induces:
The set of b-components of P relative to is denoted by .
The next two lemmas record some important properties about f as so defined.
partitions P.
According to Definition 5.1 the empty set is not b-connected, hence any is non-empty. For exhaustiveness we need to show that
The right-to-left inclusion is trivial. For the converse suppose . If t is a b-component then , so there is nothing to prove. If not, then, since is vacuously b-connected then t is not maximal in this respect. In other words, can be extended to a set such that is b-connected and is maximal among the subsets of P for which this holds. In other words, which completes the verification of exhaustiveness.
It remains to show that the elements of are pairwise disjoint. So, let , be an arbitrarily chosen pair of elements of and assume for the sake of contradiction that for . There are two cases to consider:
contains no blank nodes: Then, is b-connected according to Definition 5.1, and there is clearly no such that and is b-connected. Hence, by Definition 5.2 is itself a b-component, from which it follows that and therefore contrary to assumption.
contains a blank node: There are two subcases: either , or .
: Then is not maximally b-connected as per Definition 5.2, hence is not a b-component (contradiction).
: Since they are both b-components, then by Definition 5.2, and are each b-connected. However, since they overlap on t, this means that is also a b-component contradicting the maximality of and . □
f is a function.
Let and suppose , , we need to show that . Suppose for reduction that , say wlog. . Since it is a b-component of P relative to . By Lemma 5.1, is a partition of P, so . However, , thus there are two cases to consider:
is a proper subset of a cell in , that is, for some . Then contrary to assumption is not a b-component.
is split among cells in , that is, there exists a s.t. but . Since and are b-components, and each satisfy Definition 5.1, hence are b-connected. But since this is true also of contradicting the maximality of and in this respect. □
The combined significance of these two lemmas may be explained as follows: Lemma 5.2 fixes a single set of subpatterns of P, which according to Lemma 5.1 mutually exhaust P. Taken together, therefore, they show that talk of as one decomposition of P is legitimate. This decomposition has special properties:
Let P be a query,a set of sources that are standardized apart, and. Then, for anythere exists a sourcesuch thatandfor any execution context c.
Put . Then . By Corollary 4.4 it suffices to show that for some . The proof subdivides into two cases:
contains blank nodes: Suppose for reduction that there is no single with . Then means that must span at least two sources . Now, is b-connected, since , so it is possible to choose triple patterns such that , and such that is b-connected. By Definition 5.1 and share a blank node. However, this contradicts the assumption that and are standardized apart.
does not contain a blank node: Then is a singleton , so the desired result follows immediately. □
In words, Lemma 5.3 demonstrates that every element of is such that there exists a single source in the set of sources in question capable of answering in a manner equivalent to . Needless to say, therein lies the essence of federation, or more accurately, one half of it. The other “half” is given by Lemma 5.4:
Suppose,and. Then.
Suppose the conditions of the theorem hold. Since it suffices to show . By Theorem 4.1, this in turn reduces to demonstrating the existence of an RDF homomorphism from left to right and one from right to left. By the supposition of the theorem the property holds for each of the component patterns. That is, there is a pair of RDF homomorphisms of to and back, and a pair of to and back. We show that is a pair of RDF homomorphisms of to and back.
For , it is necessary to verify first of all that it is well-defined, i.e. that for any if then . The verification subdivides into two cases depending on the sets to which u and v belong. To ease the notation put .
. If then . Similarly, if then . Suppose therefore wlog. that and . Then it follows by Definition 4.4(2) that contrary to assumption.
: Then since and are RDF homomorphisms, they are the identity on u and v, whence
It remains to verify that is an RDF homomorphism, so suppose . Then either or . Hence, either or either . In both cases is an RDF homomorphism, since both of and are. This completes the case for .
For , put . To verify that is well-defined it suffices to check the case where , the other one being similar to that for .
Suppose that either or holds. In the former case and in the latter . The desired result follows immediately from either by the functionality of and . It suffices, therefore, to confirm that and do not share blank nodes.
Suppose for contradiction and without loss of generality that and . Then is b-connected as per Definition 5.1. Choose such that . Then . Since partitions P and it follows that . Therefore so is not maximal in the sense of Definition 5.2. This contradicts and completes the verification that is well-defined.
The verification that is an RDF homomorphism is exactly like that for , so the proof is complete. □
It may be worthwhile to pause to register what it is that makes this proof work: it is essentially a question of building larger RDF homomorphisms from smaller ones. This requires certain simple conditions to be met, most importantly that the union of functions be a function. Since the smaller functions in question are RDF homomorphisms, and RDF homomorphisms by definition map IRIs and literals to themselves, the only way this could go wrong is if the mapped element were a blank node. However, this cannot happen for the left to right direction, since execution contexts have disjoint ranges, and it cannot happen for the right to left direction since the solutions in question do not share blank nodes. Thus the theorem flows from two things: the relativization of answer sets to execution contexts and the splitting of queries into b-components.
It is now possible to prove the converse of Theorem 4.6 and the main result of this section:
Letbe a set of sources, standardized apart, and let. Then there is a partitionof P such that,for arbitrarily chosenand.
Suppose the conditions of the corollary hold. Put . By Lemma 5.1 we have . Lemma 5.3 entails that for every there exists a source, say , a context, say , and a solution mapping, say , such that and . By extension of Lemma 5.4 to finite unions it follows that . Therefore, selecting proves the corollary. □
Whereas Theorem 4.6 says that cross-site joins can never be incorrect, provided that tabs are kept on execution contexts, Corollary 5.5 states that every solution to a query P returned by the merge of the contributing sources is contained in the join of the separate evaluation of the cells of some decomposition of P. Thus there is always a solution distributed in the contributing sources if there is a solution in their union modulo renaming of blank nodes – i.e. in their merge.
What Corollary 5.5 does not say, is how to obtain this decomposition when we do not know the structure of the sources. It was only possible to prove that a decomposition exists by assuming prior knowledge of a solution that could be used to partition the query. However, different solutions may require different decompositions. Yet, all solution cannot be assumed to be available if federation is to produce information one does not already possess. It remains, therefore, to abstract from Theorems 4.6 and 5.5 a general federation scheme that will deliver all solutions, and only solutions, over any set of contributing sources in the absence of any knowledge about them. For this, we shall need the concept of a federation scheme.
Federation schemes
The concept of a federation scheme was first introduced in [35] as a mathematical abstraction for reasoning about the distributed query answering process. It is meant to highlight the two complementary aspects of this process, which may be taken to consists respectively in decomposing a query over a set of sources and in assembling an answer to it from the partial answers returned by each contributing source. A federation scheme is thus a pair of a decomposition function δ and an evaluation rule. Intuitively a decomposition function is responsible for assigning subqueries to contributing sources, whereas an evaluation rule is responsible for applying joins and unions (or in general any operation on answer-sets) in the right measure and order.
The soundness and completeness of a federation scheme is a relationship that holds wrt. a set of sets of contributing sources – or, in a more suggestive terminology, a set of selections of contributing sources: the federation scheme specifies how a graph pattern is to be distributed and evaluated over a set/selection of sources. Conversely the set of selections defines the permissible ways of choosing sources over which to distribute the decomposed query, where permissible means that soundness and completeness of query answering is preserved. Stated differently one might say that the set of sources specify in semantic terms the assumptions about the structure of the data that is built into the federation scheme.
The present section recalls the essentials of this formal framework, and generalizes it to the new equivalence-based semantics for distributed queries as set out in the previous section.
The minimal amount of knowledge one needs about a set of sources in order to distribute a query over them, is their signatures. Here a signature is understood essentially as the set of concrete predicates of an RDF graph or of a SPARQL pattern. The following definition covers both cases:
Let S be a BGP or a set of RDF triples. The signature of S written is defined as
where denotes the projection of S onto its second coordinates.
Note that variables and blank nodes do not add to the signature of a graph pattern, and that blank nodes do not add to the signature of an RDF graph. Using signatures to route subqueries is a common stratagem in the literature (cf. [11] and [34]).
If P is non-empty, then
Otherwise .
It is a simple consequence of Definitions 6.1 and 6.2 that if for some triple pattern a, that is, if a has a variable in predicate position, then for any . A consequence of this is that a triple pattern that has a blank node or variable in predicate position will be routed to all sources in a selection.
(Decomposition function).
A decomposition function is a binary function δ that takes a set of RDF graphs and a BGP P and returns a set of pairs such that and such that all of the following hold:
,
if and then ,
.
Note that this definition does not necessarily partition P, it all depends on the signature of the sources. Clause (1) prevents a decomposition function from behaving erratically when it comes to assigning subqueries to sources. That is, a BGP is only assigned to sources that can potentially answer it. Clause (2) ensures that the value of a decomposition function on any selection of sources and any BGP P is itself a function, that is, that every subquery of P is assigned to exactly one subset of . Finally, clause (3) expresses a necessary condition for soundness: δ distributes P over only if P is decomposed into subqueries that jointly exhaust P. Note that δ as so defined is a partial function, that is, (1) and (3) can not be satisfied for arbitrary choices of and P, reflecting the fact that the accumulated signature of the sources may not cover the signature of the global query. If it does then δ will henceforth be said to be defined at and P.
Examples of decomposition functions
Section 2 mentioned the notion of an exclusive group of a query P relative to a selection of sources . This concept can be defined as:
The subpattern of BGP P that is exclusive to an RDF graph G, relative to a set of RDF graphs , is the set:
A triple pattern that is not an element of an exclusive group is called a non-exclusive triple pattern relative to the same selection of graphs.
The following theorem, the proof of which can be found in [35], gives examples of decomposition functions:
Letbe any set of RDF graphs and P any BGP. Stipulate thatiffand one of the following conditions hold:
Even decomposition:is a singleton.
Standard decomposition: either
is a non-exclusive singleton, or
for some.
Prudent decomposition: either
is a non-exclusive singleton, or
is a maximal connected subset offor some.
Then δ is a decomposition function in the sense of Definition
6.3
.
The effect of the second clause under the prudent decomposition is to avoid useless cartesian products in cells of the decomposition, something that is not excluded by the standard decomposition. The reader may wish to consult [35] for more detail.
Query LS5 from FedBench.
Consider the query in Listing 1. This is query LS5 from the FedBench suite [33] which asks for all drugs from Drugbank, together with the URL of the corresponding page stored in KEGG and the URL to the image derived from ChEBI.
The signature of the query divides among the FedBench datasets as specified in Table 1. Presupposing this division, Table 2 gives the even, standard and prudent decompositions respectively. Blocks in a column correspond to subqueries and are labelled with the datasets to which that subquery is assigned. For instance, the standard decomposition assigns the subquery consisting of line 2 and 3 to KEGG, since 2 and 3 form an exclusive group for it. The non-exclusive triple pattern in line 4, in contrast, is assigned to both KEGG and ChEBI.
Decomposition of the signature of query LS5
Signature element/property
Dataset
rdf:type
All
drugbank:keggCompundId
Drugbank
drugbank:genericName
Drugbank
bio2rdf:url
KEGG, ChEBI
purl:title
KEGG, ChEBI
bio2rdf:image
ChEBI
Decomposition of LS5 over KEGG, ChEBI and Drugbank
Nr.
Triple pattern
Even
Standard
Prudent
1
?drug rdf:type drugbank:drugs
All
All
All
2
?drug drugbank:keggCompoundId ?keggDrug
Drugbank
Drugbank
Drugbank
3
?drug drugbank:genericName ?drugBankName
Drugbank
4
?keggDrug bio2rdf:url ?keggUrl
KEGG, ChEBI
KEGG, ChEBI
KEGG, ChEBI
5
?chebiDrug purl:title ?drugBankName
KEGG, ChEBI
KEGG, ChEBI
KEGG, ChEBI
6
?chebiDrug bio2rdf:image ?chebiImage
ChEBI
ChEBI
ChEBI
The standard- and prudent decompositions of LS5 over KEGG, ChEBI and Drugbank are identical for this particular selection of datasets. The difference between them is revealed by reducing the selection to ChEBI and Drugbank only, viz. Table 3. Here, the standard decomposition assigns the subquery consisting of line 4, 5 and 6 to ChEBI, since this larger set now constitutes an exclusive group for ChEBI. Note, however, that its join-graph is not connected since none of the variables in line 4 occur in line 5 or 6. Therefore, the prudent decomposition divides into its two join-connected components and which are both assigned to ChEBI, only this time as separate subqueries.
Decompositions of LS5 over ChEBI and Drugbank
Nr.
Triple pattern
Even
Standard
Prudent
1
?drug rdf:type drugbank:drugs
Both
Both
Both
2
?drug drugbank:keggCompoundId ?keggDrug
Drugbank
Drugbank
Drugbank
3
?drug drugbank:genericName ?drugBankName
Drugbank
4
?keggDrug bio2rdf:url ?keggUrl
ChEBI
ChEBI
ChEBI
5
?chebiDrug purl:title ?drugBankName
ChEBI
ChEBI
6
?chebiDrug bio2rdf:image ?chebiImage
ChEBI
Evaluation rules
Turning now to the notion of an evaluation rule, it suffices to define it abstractly as a function that produces a set of answers, in the form of intercontextual joins (cf. the previous section), from a decomposition:
(Evaluation rule).
An evaluation rule is a function that takes any decomposition and returns a set of variable mappings .
Obviously, this definition hides great variety. Next:
A federation scheme is a pair consisting of an evaluation rule and a decomposition function.
The following particular evaluation rule is a generalization to a semantics parametrized by contexts of the so-called collect-and-combine rule from [35]:
(Collect-and-combine rule).
Put . If δ is not defined at and P then , otherwise
where where and is a vector of n distinct execution contexts.
The collect and combine rule is relatively simple if one looks beyond the bureaucracy that comes with execution contexts. It takes each element in a decomposition, executes it against each source that covers its signature, and collects the results. Then it forms the coordinatewise join of the resulting set. Peeking ahead to the next section, this rule turns out to be too simple for the case where blank nodes are involved, but may serve as a concrete example to aid intuition.
The reader should note that in order to be completely precise formally, the set ought to have had an additional parameter denoting the contexts . However as the sets and are simply equivalent for any choice of and and since expressions of this form will figure only as arguments to intercontextual joins, we let the contexts drop out of view at this point.
Completeness
Recapitulating briefly, the properties of soundness and completeness are relations that hold between federation schemes on the one hand and sets of selections of RDF graphs on the other. The former specifies how a graph pattern is to be distributed and evaluated, the latter defines the permissible ways of selecting sources over which to distribute the decomposed query.
One will want the federation scheme in question and the selection of sources to be perfectly matched, meaning that if a global query is processed according to the scheme it will neither invent solutions nor ignore any that are warranted by the sum total of information in the contributing sources. Recalling that sameness of solutions is defined in terms of simple equivalence, this gives the following definition of the soundness and completeness of a federation scheme wrt. to a set of selection of sources:
Let be a set of selections of RDF graphs. A federation scheme is complete wrt. if for every , implies that there is a such that . It is sound if the converse holds.
This definition differs from that of [35] where soundness and completeness is defined in terms of set-inclusion as . Set inclusion is fine if one disregards blank nodes, i.e. assumes that there are none, for if there are no blank nodes then contexts are idle and answer sets do not need to be parameterized. The inclusion-based semantics of [5] will then be sufficiently expressive to give a characterisation of the semantics of distributed queries. In the opposite case, however, contexts are needed to keep tabs on the separate scopes for blank nodes. Therefore emphasis must be shifted from set-inclusion onto simple equivalence, as explained in Section 4. Definition 6.8 is adjusted accordingly.
Taking sets of selections of sources (that is, sets of sets of sources) as the semantic correlate of decomposition schemes facilitates the formalization of different degrees of knowledge about the structure and content of the sources that a query is distributed over. At the extreme end of this spectrum one finds the zero-knowledge case, i.e. the case where nothing is known about the tributary sources apart from their respective signatures. The zero-knowledge case, of course, corresponds to the absence of any restriction on how to make a selection of sources to distribute a query over, i.e. it corresponds to the set of all selection of sources. Stolpe [35] proves the following:
Suppose δ is the even-, standard or prudent decomposition. Thenis sound and complete wrt. the set of all selections of RDF graphs, provided none of the selected graphs contain blank nodes.
However, Theorem 6.2 does not generalize. That is, neither the even-, standard nor prudent decomposition guarantees completeness of query answering under the collect-and-combine rule when blank nodes are allowed (the reader may wish to refer back to the example in Section 4).
Summary of notation
Nomenclature
Description
t
a triple pattern
a
an assertion aka. an RDF triple
P
a conjunctive graph pattern
G
an RDF graph
the signature/the set of predicates of P
the sources in whose signature cover P
a set of RDF graphs
the variables occurring in P
μ
partial function from variables to RDF terms
solution mapping, blank nodes derive from c
ρ
a join across multiple execution contexts
the domain of
P evaluated over G in context c
P evaluated over the union of in context c
the union of evaluating P over each
the b-components of P relative to
A zero-knowledge complete federation scheme
Consider the following distribution function and evaluation rule:
Let be a set of sources, P a query pattern, and put . Then
It is immediate by this definition, that if is defined at and P then includes at least one partition of P. Indeed, peeking ahead to Theorem 7.1, it can be shown to contain a sufficient number of subsets of P to reconstruct any partition of P that is induced by a solution over the merge of . This is what matters to completeness, of course.
Let and P be as in Definition 7.1. For any distribution Δ of a query pattern P let denote the set of partitions of P contained in Δ. Then
It should be evident that is a straightforward generalization of . More precisely, is simply the union of the results of applying to each partition of P that the distribution function delivers. We have:
The federation schemeis sound and complete wrt. the set of all selections of sources.
For completeness, let be any selection of sources and suppose . By the definition of it suffices to find a set such that there is a with . Now, by Corollary 5.5 there is a partition of P such that, for arbitrarily chosen , and . We need to show first of all that this partition is an element of . This is immediate, because and implies that for every thus satisfying Definition 7.1. Now, by the definition of we have so it suffices to show that ρ is in the right hand side of this equality. By Definition 6.7 we have for every . By monotony for ⋈ under set inclusion, therefore, , as desired.
As for the soundness direction of the proof, this is just Lemma 4.6, so the proof is complete. □
A few comments on this result are in order: Although the completeness of is achieved, at the end of the day, by grouping together in a single subquery those triple patterns that are needed to capture a join on a blank node, this does not entail that the subqueries themselves are connected. Example 5.2 showed as much. Therefore, in the absence of knowledge about the structure of the data, the distribution function has to cast a very wide net indeed. thus contains one pair for every subset of P such that the signature of P in its entirety is covered by at least one source.
When one cannot make any assumptions about the structure and content of the contributing sources, there seems to be no other solution than to evaluate all the possible partitions of the original query that involves only cells whose signature is covered by a source. Indeed, this can easily be seen to be not only a sufficient, but also a necessary condition for completeness, for if one of these partitions is skipped it is straightforward to tailor a selection of sources that correspond to just that partition.
Alas, in the limiting case where contains sources with identical signatures, the distribution is isomorphic to the powerset of P, which in turn means that the number of partitions that must be processed by the evaluation rule is the Bell number corresponding to the cardinality of P. This requires an exponential number of steps and so is not computationally tractable. As remarked by one of the reviewers, query evaluation is already computationally intractable, even for a single source that doesn’t contain blank nodes and even for conjunctive queries. So it is not that sound and complete decomposition makes query evaluation intractable. Rather it’s that computing the decomposition adds another exponentially hard problem to the already intractable problem of query answering.
The obvious remedy for this is to make sure that one has some knowledge about the sources, either by imposing requirements on them or by probing them. By ‘probing’ is here understood an initial round of queries, typically but not necessarily SPARQL ASK queries, designed to detect patterns in the data, typically concerning the placement of blank nodes, that can be exploited for decomposing the global query.
Recall that denotes the b-components of P parameterized by . A condition that simplifies things a great deal, and that involves both regimentation and probing is the following:
(Uniformity).
A set of RDF graphs is a uniform selection of sources, or just uniform, if for all patterns P it is the case that if and then .
To be sure, uniformity is a lot to ask for. It is a global requirement that demands that the same type of information be encoded by the same graph pattern in all the sources where it occurs. Uniformity does not, however, limit the space of eligible patterns. That is, it does not say anything specific about how the data must look, only that an encoding pattern must be employed consistently across all contributing sources. This need not be as strict as it sounds, since it will be the case for instance if each source is individually consistent in this sense and if in addition the contributing sources all provide solutions to different subpatterns.
From a theoretical point of view, uniformity is interesting for two reasons: first, a very limited amount of probing suffices. In fact, a single solution to the global query is enough to determine a sound and complete federation scheme. Secondly, the evaluation rule only has to process a single partition, and thus reduces to . All this is recorded in the following simple definition and accompanying theorem:
(Induced distro).
Let . The distribution induced by is defined as the function such that iff
,
, and
.
Letbe a uniform selection of sources. Thenis sound and complete wrt. to.
Soundness is already established. For completeness, suppose . We need to show that there is a such that . Clearly there is a with the required properties since is induced by . But by uniformity so we are done. □
Summing up, one might say that Theorems 7.1 and 7.2 constitute different extremes of a combinatorial spectrum. The former gives a complete federation scheme in which the number of required partitions of a query grows exponentially in the size of the query, whereas the latter gives a federation scheme where the number of required partitions is constant with . It follows that any blend of probing and regimentation will land you somewhere in-between.
Of course, with perfect knowledge about the structure of the sources, probing will not be necessary at all. A perfect knowledge-case would be one where the contributing sources could be expected to conform to a predefined schema e.g. a resource shape [30]. In such a case, distributed query processing can be done efficiently without having to sacrifice the expressive power that comes with blank nodes.
Exploring the space between Theorems 7.1 and 7.2 should provide a rich field for future research.
Conclusion
It has been demonstrated that the presence of blank nodes in RDF data represents a problem for distributed processing of SPARQL queries. Even though the facts of the matter are fairly simple, they seem as yet to have slipped by unnoticed, and none of the usual decomposition strategies from the literature solve it.
To mend the situation, this paper has introduced a semantics for distributed query processing in which the notion of an execution context is explicit. This makes it possible to keep tabs on the naming of blank nodes across execution contexts, which in turn makes it possible to articulate a decomposition strategy that is provably sound and complete wrt. any selection of RDF sources even when blank nodes are allowed. Alas, this strategy is not computationally tractable. However, there are ways of utilizing knowledge about the sources, if one has it, that will help considerably.
References
1.
S.Abiteboul, R.Hull and V.Vianu, Foundations of Databases, Addison-Wesley, 1995, http://webdam.inria.fr/Alice/. ISBN 0-201-53771-0.
2.
M.Acosta and M.-E.Vidal, Networks of linked data eddies: An adaptive web query processing engine for RDF data, in: The Semantic Web – ISWC 2015 – 14th International Semantic Web Conference, Proceedings, Part I, Bethlehem, PA, USA, October 11–15, 2015, M.Arenas, Ó.Corcho, E.Simperl, M.Strohmaier, M.d’Aquin, K.Srinivas, P.T.Groth, M.Dumontier, J.Heflin, K.Thirunarayan and S.Staab, eds, Lecture Notes in Computer Science, Vol. 9366, Springer, 2015, pp. 111–127. doi:10.1007/978-3-319-25007-6_7.
3.
M.Acosta, M.-E.Vidal, T.Lampo, J.Castillo and E.Ruckhaus, ANAPSID: An adaptive query processing engine for SPARQL endpoints, in: The Semantic Web – ISWC 2011 – 10th International Semantic Web Conference, Proceedings, Part I, Bonn, Germany, October 23–27, 2011, L.Aroyo, C.Welty, H.Alani, J.Taylor, A.Bernstein, L.Kagal, N.F.Noy and E.Blomqvist, eds, Lecture Notes in Computer Science, Vol. 7031, Springer, 2011, pp. 18–34. doi:10.1007/978-3-642-25073-6_2.
4.
Z.Akar, T.G.Halaç, E.E.Ekinci and O.Dikenelli, Querying the web of interlinked datasets using VOID descriptions, in: WWW2012 Workshop on Linked Data on the Web, Lyon, France, 16 April, 2012, C.Bizer, T.Heath, T.Berners-Lee and M.Hausenblas, eds, CEUR Workshop Proceedings, Vol. 937, CEUR-WS.org, 2012, http://ceur-ws.org/Vol-937/ldow2012-paper-06.pdf.
5.
M.Arenas, C.Gutierrez and J.Pérez, Foundations of RDF databases, in: Reasoning Web. Semantic Technologies for Information Systems, 5th International Summer School 2009, Tutorial Lectures, Brixen-Bressanone, Italy, August 30–September 4, 2009, S.Tessaris, E.Franconi, T.Eiter, C.Gutierrez, S.Handschuh, M.-C.Rousset and R.A.Schmidt, eds, Lecture Notes in Computer Science, Vol. 5689, Springer, 2009, pp. 158–204. doi:10.1007/978-3-642-03754-2_4.
6.
C.Basca and A.Bernstein, Avalanche: Putting the spirit of the web back into semantic web querying, in: Proceedings of the ISWC 2010 Posters & Demonstrations Track: Collected Abstracts, Shanghai, China, November 9, 2010, A.Polleres and H.Chen, eds, CEUR Workshop Proceedings, Vol. 658, CEUR-WS.org, 2010, http://ceur-ws.org/Vol-658/paper527.pdf.
7.
H.Betz, F.Gropengießer, K.Hose and K.-U.Sattler, Learning from the history of distributed query processing – A heretic view on linked data management, in: Proceedings of the Third International Workshop on Consuming Linked Data, COLD 2012, Boston, MA, USA, November 12, 2012, J.Sequeda, A.Harth and O.Hartig, eds, CEUR Workshop Proceedings, Vol. 905, CEUR-WS.org, 2012, http://ceur-ws.org/Vol-905/BetzEtAl_COLD2012.pdf.
8.
C.Bizer, T.Heath and T.Berners-Lee, Linked data – The story so far, International Journal on Semantic Web and Information Systems5(3) (2009), 1–22. doi:10.4018/jswis.2009081901.
9.
X.Chen, H.Chen, N.Zhang and S.Zhang, Sparkrdf: Elastic discreted RDF graph processing engine with distributed memory, in: Proceedings of the ISWC 2014 Posters & Demonstrations Track a Track Within the 13th International Semantic Web Conference, ISWC 2014, Riva del Garda, Italy, October 21, 2014, M.Horridge, M.Rospocher and J.van Ossenbruggen, eds, CEUR Workshop Proceedings, Vol. 1272, CEUR-WS.org, 2014, pp. 261–264, http://ceur-ws.org/Vol-1272/paper_43.pdf.
10.
H.Choi, J.Son, Y.Cho, M.K.Sung and Y.D.Chung, SPIDER: A system for scalable, parallel/distributed evaluation of large-scale RDF data, in: Proceedings of the 18th ACM Conference on Information and Knowledge Management, CIKM 2009, Hong Kong, China, November 2–6, 2009, D.W.-L.Cheung, I.-Y.Song, W.W.Chu, X.Hu and J.J.Lin, eds, ACM, 2009, pp. 2087–2088. doi:10.1145/1645953.1646315.
11.
O.Görlitz and S.Staab, SPLENDID: SPARQL endpoint federation exploiting VOID descriptions, in: Proceedings of the Second International Workshop on Consuming Linked Data (COLD2011), Bonn, Germany, October 23, 2011, O.Hartig, A.Harth and J.Sequeda, eds, CEUR Workshop Proceedings, Vol. 782, CEUR-WS.org, 2011, http://ceur-ws.org/Vol-782/GoerlitzAndStaab_COLD2011.pdf.
12.
O.Görlitz and S.Staab, Federated data management and query optimization for linked open data, in: New Directions in Web Data Management 1, A.Vakali and L.C.Jain, eds, Studies in Computational Intelligence, Vol. 331, Springer, 2011, pp. 109–137. doi:10.1007/978-3-642-17551-0_5.
13.
S.Harris and A.Seaborne, SPARQL 1.1 query language, W3C Recommendation, 21 March 2013, https://www.w3.org/TR/sparql11-query/.
14.
O.Hartig, Zero-knowledge query planning for an iterator implementation of link traversal based query execution, in: The Semantic Web: Research and Applications – 8th Extended Semantic Web Conference, ESWC 2011,Proceedings, Part I, Heraklion, Crete, Greece, May 29–June 2, 2011, G.Antoniou, M.Grobelnik, E.P.B.Simperl, B.Parsia, D.Plexousakis, P.De Leenheer and J.Z.Pan, eds, Lecture Notes in Computer Science, Vol. 6643, Springer, 2011, pp. 154–169. doi:10.1007/978-3-642-21034-1_11.
15.
O.Hartig, SPARQL for a web of linked data: Semantics and computability, in: The Semantic Web: Research and Applications – 9th Extended Semantic Web Conference, ESWC 2012, Proceedings, Heraklion, Crete, Greece, May 27–31, 2012, E.Simperl, P.Cimiano, A.Polleres, Ó.Corcho and V.Presutti, eds, Lecture Notes in Computer Science, Vol. 7295, Springer, 2012, pp. 8–23. doi:10.1007/978-3-642-30284-8_8.
16.
O.Hartig, C.Bizer and J.C.Freytag, Executing SPARQL queries over the web of linked data, in: The Semantic Web – ISWC 2009, 8th International Semantic Web Conference, ISWC 2009, Proceedings, Chantilly, VA, USA, October 25–29, 2009, A.Bernstein, D.R.Karger, T.Heath, L.Feigenbaum, D.Maynard, E.Motta and K.Thirunarayan, eds, Lecture Notes in Computer Science, Vol. 5823, Springer, 2009, pp. 293–309. doi:10.1007/978-3-642-04930-9_19.
17.
O.Hartig and G.Pirrò, A context-based semantics for SPARQL property paths over the web, in: The Semantic Web. Latest Advances and New Domains – 12th European Semantic Web Conference, ESWC 2015, Proceedings, Portoroz, Slovenia, May 31–June 4, 2015, F.Gandon, M.Sabou, H.Sack, C.d’Amato, P.Cudré-Mauroux and A.Zimmermann, eds, Lecture Notes in Computer Science, Vol. 9088, Springer, 2015, pp. 71–87. doi:10.1007/978-3-319-18818-8_5.
18.
T.Heath and C.Bizer, Linked Data: Evolving the Web Into a Global Data Space, Synthesis Lectures on the Semantic Web, Morgan & Claypool Publishers, 2011. doi:10.2200/S00334ED1V01Y201102WBE001.
19.
A.Hogan, M.Arenas, A.Mallea and A.Polleres, Everything you always wanted to know about blank nodes, Journal of Web Semantics27 (2014), 42–69. doi:10.1016/j.websem.2014.06.004.
20.
K.Hose, R.Schenkel, M.Theobald and G.Weikum, Database foundations for scalable RDF processing, in: Reasoning Web. Semantic Technologies for the Web of Data – 7th International Summer School 2011, Tutorial Lectures, Galway, Ireland, August 23–27, 2011, A.Polleres, C.d’Amato, M.Arenas, S.Handschuh, P.Kroner, S.Ossowski and P.F.Patel-Schneider, eds, Lecture Notes in Computer Science, Vol. 6848, Springer, 2011, pp. 202–249. doi:10.1007/978-3-642-23032-5_4.
M.F.Husain, J.P.McGlothlin, M.M.Masud, L.R.Khan and B.M.Thuraisingham, Heuristics-based query processing for large RDF graphs using cloud computing, IEEE Transactions on Knowledge and Data Engineering23(9) (2011), 1312–1327. doi:10.1109/TKDE.2011.103.
23.
G.Ladwig and T.Tran, Sihjoin: Querying remote and local linked data, in: The Semantic Web: Research and Applications – 8th Extended Semantic Web Conference, ESWC 2011, Proceedings, Part I, Heraklion, Crete, Greece, May 29–June 2, 2011, G.Antoniou, M.Grobelnik, E.P.B.Simperl, B.Parsia, D.Plexousakis, P.De Leenheer and J.Z.Pan, eds, Lecture Notes in Computer Science, Vol. 6643, Springer, 2011, pp. 139–153. doi:10.1007/978-3-642-21034-1_10.
24.
C.Liu, J.Qu, G.Qi, H.Wang and Y.Yu, Hadoopsparql: A hadoop-based engine for multiple SPARQL query answering, in: The Semantic Web: ESWC 2012 Satellite Events – ESWC 2012 Satellite Events, Revised Selected Papers, Heraklion, Crete, Greece, May 27–31, 2012, E.Simperl, B.Norton, D.Mladenic, E.Della Valle, I.Fundulaki, A.Passant and R.Troncy, eds, Lecture Notes in Computer Science, Vol. 7540, Springer, 2012, pp. 474–479. doi:10.1007/978-3-662-46641-4_48.
25.
S.J.Lynden, I.Kojima, A.Matono and Y.Tanimura, Adaptive integration of distributed semantic web data, in: Databases in Networked Information Systems, 6th International Workshop, DNIS 2010, Proceedings, Aizu-Wakamatsu, Japan, March 29–31, 2010, S.Kikuchi, S.Sachdeva and S.Bhalla, eds, Lecture Notes in Computer Science, Vol. 5999, Springer, 2010, pp. 174–193. doi:10.1007/978-3-642-12038-1_12.
26.
G.Montoya, H.Skaf-Molli, P.Molli and M.-E.Vidal, Federated SPARQL queries processing with replicated fragments, in: The Semantic Web – ISWC 2015 – 14th International Semantic Web Conference, Proceedings, Part I, Bethlehem, PA, USA, October 11–15, 2015, M.Arenas, Ó.Corcho, E.Simperl, M.Strohmaier, M.d’Aquin, K.Srinivas, P.T.Groth, M.Dumontier, J.Heflin, K.Thirunarayan and S.Staab, eds, Lecture Notes in Computer Science, Vol. 9366, Springer, 2015, pp. 36–51. doi:10.1007/978-3-319-25007-6_3.
27.
G.Montoya, M.-E.Vidal and M.Acosta, A heuristic-based approach for planning federated SPARQL queries, in: Proceedings of the Third International Workshop on Consuming Linked Data, COLD 2012, Boston, MA, USA, November 12, 2012, J.Sequeda, A.Harth and O.Hartig, eds, CEUR Workshop Proceedings, Vol. 905, CEUR-WS.org, 2012, http://ceur-ws.org/Vol-905/MontoyaEtAl_COLD2012.pdf.
28.
N.Papailiou, I.Konstantinou, D.Tsoumakos, and N.Koziris, H2RDF: Adaptive query processing on RDF data in the cloud, in: Proceedings of the 21st World Wide Web Conference, WWW 2012 (Companion Volume), Lyon, France, April 16–20, 2012, A.Mille, F.L.Gandon, J.Misselis, M.Rabinovich and S.Staab, eds, ACM, 2012, pp. 397–400. doi:10.1145/2187980.2188058.
29.
B.Quilitz and U.Leser, Querying distributed RDF data sources with SPARQL, in: The Semantic Web: Research and Applications, 5th European Semantic Web Conference, ESWC 2008, Proceedings, Tenerife, Canary Islands, Spain, June 1–5, 2008, S.Bechhofer, M.Hauswirth, J.Hoffmann and M.Koubarakis, eds, Lecture Notes in Computer Science, Vol. 5021, Springer, 2008, pp. 524–538. doi:10.1007/978-3-540-68234-9_39.
30.
A.G.Ryman, A.Le Hors and S.Speicher, OSLC resource shape: A language for defining constraints on linked data, in: Proceedings of the WWW2013 Workshop on Linked Data on the Web, Rio de Janeiro, Brazil, 14 May, 2013, C.Bizer, T.Heath, T.Berners-Lee, M.Hausenblas and S.Auer, eds, CEUR Workshop Proceedings, Vol. 996, CEUR-WS.org, 2013, http://ceur-ws.org/Vol-996/papers/ldow2013-paper-02.pdf.
31.
A.Schätzle, M.Przyjaciel-Zablocki, T.Berberich and G.Lausen, S2X: Graph-parallel querying of RDF with graphx, in: Biomedical Data Management and Graph Online Querying – VLDB 2015 Workshops, Big-O(Q) and DMAH, Revised Selected Papers, Waikoloa, HI, USA, August 31–September 4, 2015, F.Wang, G.Luo, C.Weng, A.Khan, P.Mitra and C.Yu, eds, Lecture Notes in Computer Science, Vol. 9579, Springer, 2015, pp. 155–168. doi:10.1007/978-3-319-41576-5_12.
32.
S.Schenk, C.Saathoff, S.Staab and A.Scherp, Semaplorer – Interactive semantic exploration of data and media based on a federated cloud infrastructure, Journal of Web Semantics7(4) (2009), 298–304. doi:10.1016/j.websem.2009.09.006.
33.
M.Schmidt, O.Görlitz, P.Haase, G.Ladwig, A.Schwarte and T.Tran, Fedbench: A benchmark suite for federated semantic data query processing, in: The Semantic Web – ISWC 2011 – 10th International Semantic Web Conference, Proceedings, Part I, Bonn, Germany, October 23–27, 2011, L.Aroyo, C.Welty, H.Alani, J.Taylor, A.Bernstein, L.Kagal, N.F.Noy and E.Blomqvist, eds, Lecture Notes in Computer Science, Vol. 7031, Springer, 2011, pp. 585–600. doi:10.1007/978-3-642-25073-6_37.
34.
A.Schwarte, P.Haase, K.Hose, R.Schenkel and M.Schmidt, Fedx: Optimization techniques for federated query processing on linked data, in: The Semantic Web – ISWC 2011 – 10th International Semantic Web Conference, Proceedings, Part I, Bonn, Germany, October 23–27, 2011, L.Aroyo, C.Welty, H.Alani, J.Taylor, A.Bernstein, L.Kagal, N.F.Noy and E.Blomqvist, eds, Lecture Notes in Computer Science, Vol. 7031, Springer, 2011, pp. 601–616. doi:10.1007/978-3-642-25073-6_38.
35.
A.Stolpe, A logical characterisation of SPARQL federation, Semantic Web6(6) (2015), 565–584. doi:10.3233/SW-140160.
36.
M.Vander Sande, R.Verborgh, J.Van Herwegen, E.Mannens and R.Van de Walle, Opportunistic linked data querying through approximate membership metadata, in: The Semantic Web – ISWC 2015 – 14th International Semantic Web Conference, Proceedings, Part I, Bethlehem, PA, USA, October 11–15, 2015, M.Arenas, Ó.Corcho, E.Simperl, M.Strohmaier, M.d’Aquin, K.Srinivas, P.T.Groth, M.Dumontier, J.Heflin, K.Thirunarayan and S.Staab, eds, Lecture Notes in Computer Science, Vol. 9366, Springer, 2015, pp. 92–110. doi:10.1007/978-3-319-25007-6_6.
37.
R.Verborgh, O.Hartig, B.De Meester, G.Haesendonck, L.De Vocht, M.Vander Sande, R.Cyganiak, P.Colpaert, E.Mannens and R.Van de Walle, Querying datasets on the web with high availability, in: The Semantic Web – ISWC 2014 – 13th International Semantic Web Conference, Proceedings, Part I, Riva del Garda, Italy, October 19–23, 2014, P.Mika, T.Tudorache, A.Bernstein, C.Welty, C.A.Knoblock, D.Vrandecic, P.T.Groth, N.F.Noy, K.Janowicz and C.A.Goble, eds, Lecture Notes in Computer Science, Vol. 8796, Springer, 2014, pp. 180–196. doi:10.1007/978-3-319-11964-9_12.
38.
R.Verborgh, M.Vander Sande, P.Colpaert, S.Coppens, E.Mannens and R.Van de Walle, Web-scale querying through linked data fragments, in: Proceedings of the Workshop on Linked Data on the Web Co-Located with the 23rd International World Wide Web Conference (WWW 2014), Seoul, Korea, April 8, 2014, C.Bizer, T.Heath, S.Auer and T.Berners-Lee, eds, CEUR Workshop Proceedings, Vol. 1184, CEUR-WS.org, 2014, http://ceur-ws.org/Vol-1184/ldow2014_paper_04.pdf.