Abstract
The paper provides a logical characterisation of distributed processing of SPARQL queries. The principal notion analysed is that of a distribution scheme; a pair consisting of an evaluation rule and a distribution function. Different choices of evaluation rules and distribution functions give different federation schemes that are proved to be sound and complete wrt. different selections of RDF datasets. Three distribution functions are singled out for special attention, called the even-, standard- and prudent distribution respectively. All yield sound and complete federation schemes when combined with an evaluation rule named the collect-and-combine rule. The completeness results thus obtained are next compared to existing federation systems with the aim of illustrating the potential utility of a framework in which salient properties can be formulated and explored. The final section, relates distribution schemes to the heuristic notion of a join-ordering to yield sound and complete execution plans for the aforementioned federation schemes.
Keywords
Introduction
That the exploitation and dissemination of Semantic Web data requires powerful federation engines, is a claim that hardly needs an argument. Unedited and decentralized, the Semantic Web is also a singularly fluid computational environment. Therefore any federation engine that aims to be generic must be capable of operating in a dynamic network topology where datasets may be discovered at run time, and where little is known about the structure and content of these datasets prior to querying.
SPARQL federation—the decomposition and distribution of SPARQL queries—is one approach that seems both principled and promising. Its emerging importance is mirrored by the number of publications devoted to this topic in the last five to ten years. However, the wealth of experimental systems and empirical research contrasts sharply with the paucity of foundational work, and attempts to formulate a theoretical framework in which to explore and compare different approaches mathematically are largely absent.
This remains true even when the query language is abstracted away so as to include the comparatively rich literature on distributed databases. Thus, whilst the 150+ references in Kossman’s excellent “The State of the Art in Distributed Query Processing” [23] together with the 150+ references in Hose et al.’s “Database foundations for scalable RDF processing” [20] span more than 30 years of research, none of the listed references gives a general logical characterisation of distributed query answering.
The present paper provides some tentative indications that the working out of such a framework may be both an interesting and worthwhile pursuit. Although the concepts introduced ought to be of more general validity, focus will be placed on federation in a zero-knowledge environment, that is, on federation performed in the absence of specific information about the structure and content of contributing datasets (apart from their signatures—a concept to be defined).
Also, only the fragment of SPARQL that consists of simple conjunctive queries aka. basic graph patterns will be considered. The operative assumption is that these restrictions form a natural base-line for a theory, from which more complex cases can be derived.
The present paper does not advocate ‘an approach’ to SPARQL federation. Rather, it is more aptly seen as giving expression to properties and observations that are already part of the ‘folklore’ of the Semantic Web community, and in the process developing a formal apparatus in which the relevant distinctions can be drawn and their ramifications explored.
Indeed, the results that are to be presented are not even essentially tied to SPARQL,
but could easily be generalized to apply to any class of conjunctive queries [1, chap. 4] expressible in first-order
logic. Thus, the relationship between the present theory and the SPARQL 1.1.
federation extension [11,33] is essentially one of abstraction: the
latter, by providing the
The paper is organized as follows: After a review of related work in Section 2, Section 3 recalls a few
essential notions of SPARQL semantics, fixes notation and declares theoretical
assumptions for subsequent developments. Section 4
introduces the principal unit of analysis, which is that of a
Related work
Fairly recent surveys of the state of the art wrt. RDF federation can be found in Betz et al. [8], Görlitz and Staab [15] and Hose et al. [20]. Based on these sources, it is possible to distinguish between at least three major trends in RDF federation, namely lookup-based federation, warehousing and distributed query processing. Lookup-based federation, which may also be called federation-by-traversal (or follow-your-nose traversal, as in [4]), iteratively downloads and evaluates data based on links (usually as prescribed by the Linked Data principles [9]) from one dataset into another. The answer to a query is composed by cumulatively adding answers from incoming data during the traversal process. This approach is exemplified by e.g. [18,19] and [24].
Approaches based on warehousing, on the other hand, collect all data in advance and combines it into a central triple store against which queries are evaluated. Recent efforts in this direction focus on the use of cluster technology such as MapReduce and Hadoop, e.g. [12,21,22] and [30].
Finally, approaches based on distributed query processing rely on analysing a query to identify a set of relevant RDF graphs—with or without the aid of statistics—to which subqueries can be assigned. Like federation-by-traversal and unlike warehousing, queries are evaluated remotely. Recent example of this approach include [3,7,28,31] and [35].
It should be said that this classification is very rough, cutting across important and interesting topics such as adaptivity and opportunism, indexing, join-order optimization, statistics-generation and more. The reader is referred to the aforementioned surveys for more detail.
The present paper is most comfortably subsumed under the category of distributed query processing, although it ought to be of relevance to the other two as well. To the author’s knowledge it is the first foundational study of distributed query processing that approaches the discipline from a purely logical point of view. There do exist a few formal studies of what can be considered the more general topic of querying the Web, for instance Abiteboul and Vianu [1] and Bouquet et al. [10]. The former is concerned with the computability of queries over the Web expressed in Datalog, and predates the SPARQL query language. The latter describes three different ways in which a query can be answered and characterizes their answers semantically in complete abstraction from query languages and distribution algorithms. These references have little in common with the present study, apart from being formal and about queries on the Semantic Web.
Preliminaries
In order to lighten the notation somewhat, curly braces will be omitted from
singletons in set-theoretic expressions and applications of functions if no
confusion is likely to ensue, e.g.
Turning now to RDF specifics, let
An RDF graph is a finite set of RDF triples.
An RDF graph will sometimes also be referred to as an RDF
Let
A basic graph pattern (BGP), aka. a conjunctive graph pattern, is defined as
follows: Any If
The notation
As in [5] conjunctive graph patterns are
conceived of as queries matching RDF graphs. Answers are eligible substitutions of
values from the RDF graph in question for variables in the graph pattern in
question. In conformity to [5], sets of
answers will be formalized in terms of partial functions
The set of answers to a BGP
Let
Taking stock, the reader should note two things: first, as announced in the
introduction, the present theory only considers the fragment of SPARQL that consists
of conjunctive queries. Secondly, the distinction between the syntax and the
semantics of SPARQL queries has been blurred, and this is deliberate. Usually, one
would treat a SPARQL graph pattern as a syntactic object, and the mapping that links
it to an RDF dataset as a semantic object. However, defining a basic graph pattern
as a
Now, considered from an abstract or logical point of view, the problem of
characterizing SPARQL federation is essentially that of generalizing
Definition 3.3 to the case that
Let
Here
Presupposing a selection of RDF datasets
Therefore any
The last of these concepts will be discussed in more detail in the next section. As
regards the former two, the general idea expounded in the following is to distribute
a (global) query over a selection of datasets by decomposing it into subqueries that
can be evaluated directly against the sources that cover its
Let
Note that variables and blank nodes do not add to the signature of a graph pattern, and that blank nodes do not add to the signature of an RDF graph.
If
It is a simple consequence of Definitions 4.1 and 4.2 that if
A distribution function is a binary function
if
Clause (1) prevents a distribution function from behaving erratically when it comes
to assigning subqueries to datasets. That is, a BGP is only assigned to datasets
that can potentially answer it. Clause (2) ensures that the value of a distribution
function on any selection of datasets
A distribution function will be said to be
(Egalitarian distribution).
A distribution function is
Egalitarianism has an important bearing on zero-knowledge completeness. It is
therefore investigated in more detail in Section 5.
Several of the distribution functions introduced in the next section are
egalitarian. If
Examples of distribution functions
Let the notion of an
The subpattern of BGP
The following theorem gives examples of egalitarian distribution functions:
The proof considers only the standard distribution. The other cases are
similar. So suppose
Here, the join-connectedness of a BGP
Let
Stated in words, a BGP being join-connected means that it cannot be split into sub-patterns that do not share a variable.
Anticipating subsequents developments, the even-, standard- and prudent distribution are all suitable for a zero-knowledge environment insofar as they yield sound and complete federation schemes (wrt. the set of all selections of RDF graphs) when paired with the collect-and-combine evaluation rule that will be introduced shortly.
Moreover, all three distribution functions can be said to heed the fact that one
dataset may be able to generate new results when joining with
Distribution of the signature of query LS5

Query LS5 from FedBench.
Consider the query in Listing 1. This is query LS5 from the FedBench suite [34] which asks for all drugs from Drugbank, together with the URL of the corresponding page stored in KEGG and the URL to the image derived from ChEBI.
The signature of the query divides among the FedBench datasets as specified in Table 1. Presupposing this division, Table 2 gives the even, standard and prudent distributions respectively. Blocks in a column correspond to subqueries and are labelled with the datasets to which that subquery is assigned. For instance, the standard distribution assigns the subquery consisting of line 2 and 3 to KEGG, since 2 and 3 form and exclusive group for it. The non-exclusive triple pattern in line 4, in contrast, is assigned to both KEGG and ChEBI.
The standard- and prudent distributions of LS5 over KEGG, ChEBI and Drugbank
are identical for this particular selection of datasets. The difference
between them is revealed by reducing the selection to ChEBI and Drugbank
only, viz. Table 3. Here, the standard
distribution assigns the subquery consisting of line 4, 5 and 6 to ChEBI,
since this larger set now constitutes an exclusive group for ChEBI. Note,
however, that its join-graph is not connected since none of the variables in
line 4 occur in line 5 or 6. Therefore, the prudent distribution divides
Distribution of LS5 over KEGG, ChEBI and Drugbank
Distributions of LS5 over ChEBI and Drugbank
Turning now to the notion of an evaluation rule, the second proposed component of federation schemes, it suffices for now to define it in complete abstraction simply as a function that produces a set of answers from a distribution:
An evaluation rule is a function
Obviously, this definition hides great variety. Next:
A federation scheme is a pair
The central claim of the present paper is that a federation scheme is the proper unit of analysis for distributed query processing—or at any rate, for the purposes of logical analysis. This means that meta-properties such as soundness and completeness are relations that hold between federation schemes on the one hand and sets of selections of RDF datasets on the other. The former specifies how a graph pattern is to be distributed and evaluated, the latter demarcates the permissible ways of selecting sources over which to distribute the decomposed query:
Let
Summary of notation
As mentioned in Section 3, the problem of giving a
logical characterisation of SPARQL federation can be construed as the problem of
generalizing the semantics of conjunctive queries to multiple RDF graphs. Standard
SPARQL semantics defines the evaluation
The distributed query processing semantics that is developed in the present paper
sticks to the general idiom of the standard semantics but relaxes restrictions a)
and b). Considered in the abstract, this involves working out an account of
combinations
This section focuses on one particularly natural set of such pairs, namely those
formed from distributions functions that are egalitarian and satisfy a condition
called
The set of federation schemes that are sound and complete in a zero-knowledge environment—that is, sound and complete wrt. to the set of all selections of RDF datasets—is larger then this, though, and in fact does not presuppose neither granularity nor egalitarianism. Examples of non-granular and non-egalitarian, but zero-knowledge sound and complete, federation schemes are provided in Section 5.2.
The account is built up from simple principles, starting with the following lemma:
Immediate from the assumption that no RDF graphs share blank nodes. □
Although dataset monotony does require the non-trivial property of disjointness for every pair of elements in a set of RDF graphs wrt. blank nodes, this is not a real restriction since renaming of blank nodes does not change the semantics of an RDF graph. Hence, one can always assume that blank nodes have been standardized apart in a pre-processing step.
The next lemma gives the case of Eq. (1) for
For the left-to-right direction suppose
The restriction on the families of datasets
As regards the interplay between
Suppose
The converse does not, but there is the following qualified version of it:
By Definition 3.4(1),
A distribution function that satisfies both directions will be said to be
A distribution function
As an example of a distribution function that is not agnostic, consider the
trivial distribution defined by putting
The significance of the property of agnosticism consists in the fact that for agnostic distribution functions there exists a natural and simple evaluation rule that yields a sound and complete federation scheme for the zero-knowledge case. This is the announced collect-and-combine rule:
Put
For a concrete example of how the collect-and-com- bine rule is implemented in a
popular SPARQL 1.0 federation system, the reader may want to peak ahead to
Section 6.1. From a theoretical point of view the
rule may be motivated follows: To produce a result set from a distribution, each
cell in the distribution has to be evaluated against its designated datasets and the
partial answers that are returned have to be combined in order to yield an answer to
the global query. The question, is how to unpack ‘combine’. Note that just fixating
on one operator will not do. Consider for instance, the case where
The collect-and-combine rule is designed to balance the need for both ∪ and ⋈ as result-set forming operators, and to apply them in equal measure so to speak.
Let
As the next subsection will show, the even-, standard- and prudent distributions are all agnostic and egalitarian in this sense. Hence, Tables 2 and 3 give examples of agnostic and egalitarian decompositions of FedBench query LS5.
Theorem 5.5 specialises in a straightforward way to the even-, standard- and prudent distribution by way of the property egalitarianism (Definition 4.4) and the following property of granularity:
(Granularity).
A distribution function
It is evident by inspection of Definition 4.3 that the even-, standard- and prudent distribution are all granular in this sense. Stretching the terminology somewhat, the federation scheme itself will be said to be granular if its distribution function is.
Granular distribution functions have the important property that they make sets
Suppose
Thus,
Now, if a distribution function is egalitarian then granularity suffices for
agnosticism:
Let
It is a fact of some importance—and one that there shall be occasion to refer
back to in the next section—that neither granularity nor egalitarianism are
A coalescent distribution of LS5 over
KEGG, ChEBI and Drugbank
A coalescent distribution of LS5 over KEGG, ChEBI and Drugbank
Now, consider BGPs that are as unconstrained as can be in the sense of having no joins and only variables or blank nodes in subject or object position:
An if
Referring back to Table 2, the BGP that consists of row 3 and 4 is unconstrained in this sense, as no two triples in it share a variable (5.4(1)) and as none of the terms in subject or object position is a URL or a literal value (5.4(2)). Indeed, it is a maximal unconstrained subset of 1–6, for row 1 has a URL in object position, whereas all other triple patterns share a variable with either 3 or 4.
A distribution function that lumps together triple patterns in unconstrained
BGPs, in one way or the other (there are of course many), will henceforth be
said to be
Call
Consider the decomposition of the LS5 query from the FedBench suite given by Table 5. This decomposition is coalescent but not granular: line 2 and 3 form an exclusive group relative to the Drugbank dataset, line 6 is an exclusive group relative to the ChEBI dataset—albeit a singleton one—whereas line 1 is a non-exclusive triple pattern assigned to all involved datasets. As regards lines 4 and 5, they form a group that is neither a singleton nor assigned to a single dataset, whence it is a group that violates the property of granularity. The two lines do not share variables, however, and do not have instantiated subjects or objects, and is therefore unconstrained in the sense of Definition 5.4. It follows that the decomposition is coalescent in the sence of Definition 5.5.
The thing about an unconstrained BGP
Let
The following lemma verifies that the splitting of a result set suffices to recover all information:
By induction on the complexity of
For the left-to-right direction of the principal case put
For the induction step, suppose
Lemma 5.11 suggests the following evaluation rule:
Put
Given the conditions on the definition of a distribution function, it is not difficult to show that:
The proof is a straightforward rerun of Theorem 5.5
based on generalizing the property of agnosticism so that it reflects the
A similar result can easily be shown to hold for the property of egalitarianism,
that is, there are non-egalitarian distribution functions that induce complete
federation schemes wrt. the set of all selections of RDF graphs. Indeed, there
are non-egalitarian
To see this, let
Explained in words, for every unconstrained subquery
Although these examples of non-granular and non-egalitarian distribution
functions are admittedly somewhat contrived, their mere existence is a fact of
some importance insofar as they prove that granularity and egalitarianism are
not necessary for completeness. Contrapositively therefore, non-granularity and
non-egalitarianism are not sufficient for
With new federation engines being developed virtually by the hour, only a small
sample can be discussed in the present section. The systems that
The primary concern of the present section is to illustrate the application of the
conceptual framework with an eye to the completeness of a federation scheme wrt. a
set of selections of RDF graphs. It is only secondarily an evaluation of existing
systems, and not at all an
Yet, if the said arguments miss their mark for that reason alone, that is if the systems discussed turn out to behave differently than described, then one way of looking at that fact is to take it as underlining the general theme of this paper: there is a need for a framework to allow implementation to be derived from theory rather than vice versa.
FedBench setup for the Life
Sciences 6 query
FedBench setup for the Life Sciences 6 query
Keeping the possibility open that there are also other valid reasons for deviating from completeness, the material that follows should not be taken as an argument to the effect that any of the approaches being discussed is inherently flawed.
The standard distribution has emerged as a point of convergence between several distributed query processing engines that have been proposed in the last half decade or so. Although one should probably be cautious about crediting any single reference with the idea, an early comer seems to have been the DARQ system proposed in Quilitz and Leser [31]. Later FedX [35] and SPLENDID [14] were based on the same idea—no dependencies implied.
All these approaches share three key observations: 1) that RDF properties can be used to assign sub-queries to RDF sources, 2) that if the property of triple pattern occurs in multiple datasets, then that triple pattern must be sent as a singleton subquery to each of datasets in question in order not to neglect cross-site joins, and 3) if the signature of a subquery is covered by exactly one source, then that subquery does not need to be decomposed further but can be assigned to the source in question as-is. These three steps taken together amount to the standard distribution function.
It is has proved difficult to find a clear statement of evaluation rules—the importance of which seems largely to be ignored. However, Schwarte et al. [35] at least offer an example of how the Life Sciences 6 query from the FedBench suite would be evaluated according to their algorithm. The LS6 query is here reproduced in Table 6 which also displays the signature coverage of each triple wrt. to the datasets KEGG, drugbank and DBpedia. The query execution plan from [35] is given in Fig. 1, with minor adjustments of notation: the leaves are numbered according to Table 6, the brackets indicate the grouping of the sub-queries, and the ‘@’-tag above a leaf indicates a selection against the dataset suffixed to the tag.

Execution plan for LS6 from Schwarte et al. [35].
There is reason to believe, however, that this execution plan contains a mistake,
for the group
Triple pattern 3 is
However, by separating 3 from 4 and inserting the branch
Distribution of LS5 over KEGG, ChEBI and Drugbank according to DEFENDER
DEFENDER [29] is a query decomposition engine that is built on top of the better known ANAPSID federator [3]. The main feature of the latter is that it adapts query execution plans to data availability and run-time conditions by providing physical SPARQL operators that detect when a source becomes blocked. These operators produce answers opportunistically as soon as data becomes available.
Given the opportunistic nature of DEFENDER that it inherits from ANAPSID, completeness of query answering wrt. to the selected datasets is not necessarily an overriding concern—the discussion towards the end of [29] indicates that it is not. Nevertheless, it is natural to expect DEFENDER to at least approximate completeness in the limiting case that the contributing sources are sufficiently responsive. That is, it seems reasonable to expect completeness to act as a norm from which DEFENDER as well as ANAPSID itself will deviate only if circumstances dictate it.
The DEFENDER decomposition algorithm is based on evaluating so-called star-shaped
query patterns, a notion which in turn derives from Vidal et al. [37]: Every pattern
Distribution of LS5 over ChEBI 1, ChEBI 2, KEGG and drugbank according to DEFENDER
In other words, DEFENDER assigns to each endpoint a
The DEFENDER decomposition algorithm can now be described with more precision: it
decomposes a query into
Certain features of the induced distribution function can be read off from this
table directly. For one, the distribution is egalitarian: the set of endpoints
that a star is assigned to is precisely the set of endpoints that cover its
signature. Indeed, this property generalizes and is due to the fact that stars
are maximal in the global sense—interestingly, maximality in the local sense
does
Moreover the decomposition in Table 7 is granular
too: every subquery is either a singleton or assigned to a singleton. Yet, this
is incidental to the example: to see this, add another dataset encoded in, say,
the ChEBI vocabulary to the datasets over which the LS5 query is distributed,
call the two ChEBI datasets ChEBI 1 and ChEBI 2. In terms of the decomposition
algorithm this has the effect of copying the stars assigned to ChEBI 1 into the
columns of ChEBI 2 yielding the distribution in Table 8 which is
Distribution of LS5 over KEGG, ChEBI and Drugbank according to ADERIS
By the results of Section 5.2, the non-granularity and unegalitarianism of a distribution function is not sufficient warrant to conclude that a federator is incomplete wrt. to the set of all selections of RDF graphs. Nevertheless, for DEFENDER this turns out to be so:
Put
Distribution of LS5 over KEGG, ChEBI and Drugbank according to AVALANCHE
The Adaptive Distributed Endpoint RDF Integration System (ADERIS) from Lynden et al. [25] is another example of an adaptive distributed query processor. That is, ADERIS seems to have a principal design goal that is similar to that of DEFENDER/ANAPSID, namely to have join-orderings change during query execution as a means of continual refinement and optimization of the execution plan at run-time.
ADERIS too is a distributed query processing system in the sense described in Section 2, which means that it relies on a decomposition algorithm to distribute subqueries to contributing sources.
The ADERIS decomposition algorithm is rather different from that of DEFENDER, however, and is based on generating “for each data source, a source query (…) that contains all of the triple patterns from the federated query that could possibly match the triples in the given data source” [25, p. 182]. Unlike Montoya et al. [29], Lynden et al. [25] state explicitly that completeness is a property they wish the decomposition algorithm to have: “Source queries are constructed for each of these data sources, with the aim of retrieving all the data needed to answer the query” (p. 181).
In mathematical terms, the ADERIS distribution translates to a function which for
every RDF source
Table 9 shows the result of applying the ADERIS distribution to the LS5 example. This particular decomposition, is granular but not egalitarian. The group consisting of line 4 and 5 constitutes a counterexample to egalitarianism: since the signature of this group is covered by both KEGG and ChEBI, egalitarianism dictates that it be assigned to both, which it is not.
Granularity is again incidental in precisely the same sense as for the
distribution in Table 7, i.e. non-granularity can
be shown by column-copying. Moreover, ADERIS and DEFENDER are in complete
agreement wrt. to Example 6.1 insofar as they both
yield the same distribution. It follows that the ADERIS too is incomplete wrt.
the set of all selections of RDF graphs for any reasonable choice of evaluation
rule
AVALANCHE
The AVALANCHE system is a query answering system for the Semantic Web that, in the words of Başca and Bernstein [6], is designed to embrace the WWW uncertainties rather than to try to dispense with them. According to Başca and Bernstein, this means querying the Semantic Web without making any prior assumptions about the data location or distribution, schema-alignment, pertinent statistics, data evolution, or accessibility of servers [6, p. 2].
This is not to say, though, that AVALANCHE is a zero-knowledge federator in the
present sense of the term—it is not. It merely means that AVALANCHE does not
utilize
In the general case, AVALANCHE, like ADERIS and DEFENDER/ANAPSID, is opportunistic: it is designed to deliver partial results as they become available, favouring those that are easier to assemble. Like the other two approaches it relies on cost-reducing heuristics that adjust to the current computational environment, but it also relies on heuristics that is applied to query planning regardless.
For instance, the AVALANCHE query planner restricts the search for an optimal
plan by considering only those distributions that assign a triple pattern to
exactly one source (henceforth this will be called the
Başca and Bernstein [6, listing 2, page
7] illustrate this decomposition algorithm by giving an example distribution of
the LS5 query reproduced here in Table 10 (note
that the example assumes the
It is immediate by inspection of the table that this particular decomposition is granular. This time granularity is not an incidental property of the example but a consequence of the single-source restriction. Equally obviously, and for the same reason, AVALANCHE is not egalitarian—viz. row 6 is assigned only to KEGG although it is also covered by ChEBI.
In fact, AVALANCHE like DEFENDER/ANAPSID and ADERIS, can easily be seen to be
incomplete wrt. to the set of all sets of RDF graphs for any reasonable choice
of evaluation rule
As an afterthought, it is instructive to contrast this conclusion with the
discussion of the homonymous notions of completeness in [6] itself: it cannot be the aim of AVALANCHE to deliver
This notion of query-contextual completeness does not coincide with the
Discussion
The preceding discussion cannot pretend to do justice to the systems discussed. They are all much broader in scope and ambition than the narrow focus on completeness suggests, covering besides a plethora of interesting and important topics such as adaptivity, opportunism, indexing, join-order optimization, statistics-generation and more.
As to the question of the significance of completeness itself, a two-fold answer is probably most appropriate: from a pragmatic point of view completeness is essentially subservient to user needs. An incomplete answer that returns in reasonable time is usually preferable to a complete one that makes excessive demands on time and memory.
Looked at from a theoretical perspective, however, a completeness result offers
an exhaustive characterisation of the
As argued throughout this paper, an exhaustive characterisation of a federation strategy is obtained by turning three knobs: the way a query is decomposed into sub-queries, the way the partial answers are subsequently combined, and the way that a selection of RDF graphs is chosen for federation.
Depending on how much one turns one knob, it may not have any effect turning
another. For instance, Example 6.1 has shown that
the DEFENDER/ANAPSID, ADERIS and AVALANCHE distribution functions are incomplete
wrt. to the set of all selections of RDF graphs for all reasonable choices of
evaluation rule
Needless to say, there is large variety of ways in which a set of selections of RDF sources can be constrained in this manner, and therefor also a large number of distribution schemes to match.
A corollary wrt. join-order heuristics
The account of distributed query processing given so far is entirely abstract and declarative, and it is far from obvious how one would move towards the implementation level from the theory. As a stepping stone, this section operationalizes the distribution semantics by showing how to fit a federation scheme into a join-ordering in order to obtain an execution plan for a given distribution that is provably sound and complete. Attention is restricted to zero-knowledge completeness and the collect-and-combine rule.
A join-ordering will be represented as usual as an operator tree, based on the following definitions from [16]:
(Tree domain).
A tree domain For each
For
each
Given a tree domain
Given a set Σ of symbols, a tree over Σ is a function
Given a tree
Every element of a domain
(Ordered distribution).
Let Δ be a distribution. A join-ordering of Δ is a tree if
Definition 7.4 represents a generalisation of the usual notion of the join-ordering of a BGP. It is a generalization in the sense that the carrier set of the poset is not a set of triples but a distribution, that is, itself a set of BGPs.
In computational terms it is natural to see this generalization as an adaptation in which the join-ordering also distributes the responsibility for executing joins: a distribution assigns a subquery of a global query to a set of contributing sources. If the subquery in question contains more than one triple pattern—which will be the case if it is e.g. an exclusive group—then the subquery is handed over to the contributing sources along with the responsibility for executing the joins involved.
It should be fairly obvious how to relate Definition 7.4 to the completeness results from Section 5:
(Evaluation).
The evaluation of an ordered distribution
otherwise it is the application of
We have:
By Definition 7.4(2)
and Definition 7.4(2) we have that
None of this is surprising, which is why it is presented as a corollary.
Nevertheless there are a couple of things to note: First, the unit of heuristic significance in a distributed setting is the subpattern, not the triple pattern, and all subpatterns are by default equally privileged. That is, there is no constraint on the shape of a tree that dictates that e.g. exclusive groups be processed first. As hinted at in Section 6.1, this contrast with how evaluation is performed by e.g. FedX where exclusive groups are given priority based on the claim that they are usually more constrained than singleton triple patterns [35, p. 607].
This is probably true, but it is not
The second thing to note, is the generality of the collect-and-combine rule which
yields sound and complete heuristics for each of the even-, standard and prudent
distributions in one fell swoop. Indeed, one may speculate that there is little
reason for employing a different evaluation rule with these distributions
functions—or, at any rate, at least in the zero knowledge case. If there were, then
it would be for heuristic reasons. However, in the absence of detailed knowledge
about the contributing sources there seems to be no reason to think that some other
particular evaluation order would do better than the collect-and-combine rule.
Whilst evaluating the two pairs
This foundational study has provided some first steps towards a logical framework for reasoning about distributed query processing in SPARQL in a formal manner. The analysis is based on the general idea of correlating a ‘syntactic’ object consisting of a distribution function and an evaluation rule with a ‘semantic’ object consisting of a family of sets of RDF graphs, each one of which represents an eligible selection of sources for the federation process.
The combination of granular and egalitarian distributions function with the collect-and-combine evaluation rule has emerged as a trait that induces completeness in a zero knowledge computational environment for several natural examples of distribution functions, including the standard distribution, so-called, which occurs frequently in the literature.
Federation schemes have a simple and transparent interface towards the heuristic notion of a join-ordering of a query. This interface connects execution plans to soundness and completeness results in a manner that makes it evident that whole classes of execution plans preserve these properties.
An interesting line of future research would be to investigate natural candidates for some-knowledge federation schemes, and to furnish them with completeness results by restricting the set of eligible selections. This should include a characterization of existing some-knowledge approaches. For instance; what needs to be assumed about the distribution of data over a selection of RDF datasets for a distribution scheme based on maximal stars in the manner of Vidal et al. [37] to be complete? Or, are there any established resource shapes [32] that could be analysed this way? This ought to be a rich field for future work.
Footnotes
Acknowledgements
The author is grateful to Carlos Buil-Aranda, Jürgen Umbrich, and one other anonymous reviewer for constructive criticism and insightful questions regarding earlier drafts of this paper. The author also wishes to thank Jonas Halvorsen and Bjørn Jervell-Hansen for stimulating discussions leading up to and advancing the present line of research.
