Sage Journals: Discover world-class research

Abstract

Finding a good query plan is key to the optimization of query runtime. This holds in particular for cost-based federation engines, which make use of cardinality estimations to achieve this goal. A number of studies compare SPARQL federation engines across different performance metrics, including query runtime, result set completeness and correctness, number of sources selected and number of requests sent. Albeit informative, these metrics are generic and unable to quantify and evaluate the accuracy of the cardinality estimators of cost-based federation engines. To thoroughly evaluate cost-based federation engines, the effect of estimated cardinality errors on the overall query runtime performance must be measured. In this paper, we address this challenge by presenting novel evaluation metrics targeted at a fine-grained benchmarking of cost-based federated SPARQL query engines. We evaluate five cost-based federated SPARQL query engines using existing as well as novel evaluation metrics by using LargeRDFBench queries. Our results provide a detailed analysis of the experimental outcomes that reveal novel insights, useful for the development of future cost-based federated SPARQL query processing engines.

Keywords

SPARQL benchmarking cost-based cost-free federated querying

1. Introduction

The availability of increasing amounts of data published in RDF has led to the genesis of many federated SPARQL query engines. These engines vary widely in their approaches to generating a good query plan [5,25,39,50]. In general, there exist several possible plans that a federation engine can consider when executing a given query. These plans have different costs in terms of the resources required and the overall query execution time. Selection of the best possible plan with minimum cost is hence of key importance when devising cost-based federation engines; a fact which is corroborated by a plethora of works in database research [27,29].

In SPARQL query federation, index-free (heuristics-based) [16,31,47] and index-assisted (cost-based) [9,11,13,17,20,26,28,30,36,43,49] engines are most commonly used for federated query processing [39]. The heuristics-based federation engines do not store any pre-computed statistics and hence mostly use different heuristics to optimize their query plans [47]. Cost-based engines make use of an index with pre-computed statistics about the datasets [39]. Using cardinality estimates as principal input, such engines make use of cost models to calculate the cost of different query joins and generate optimized query plans. Most state-of-the-art cost-based federated SPARQL processing engines [9,13,17,20,26,28,30,43,49] achieve the goal of optimizing their query plan by first estimating the cardinality of the query’s triple patterns. Subsequently, they use this information to estimate the cardinality of the joins involved in the query. A cost model is then used to compute the cost of performing different query joins while considering network communication costs. One of the query plans with minimum execution costs is finally selected for result retrieval. Since the principal input for cost-based query planning is the cardinality estimates, the accuracy of these estimates is crucial to achieve a good query plan.

The performance of federated SPARQL query processing engines has been evaluated in many recent studies [1–3,9,10,12,18,23,24,30,39–41,43,48] using different federated benchmarks [4,7,14,19,32,33,38,45,46]. Performance metrics, including query execution time, number of sources selected, source selection time, query planning time, continuous efficiency of query processing, answer completeness and correctness, time for the first answer, and throughput, are usually reported in these studies. While these metrics allow the evaluation of certain components (e.g., the source selection model), they cannot be used to evaluate the accuracy of the cardinality estimators of the cost-based federation engines. Consequently, they are unable to show how the estimated cardinality errors affect the overall query runtime performance of federation engines.

In this paper, we address the problem of measuring the accuracy of the cardinality estimators of federated SPARQL engines, as well as the effect of these errors on the overall query runtime performance. In particular, we propose metrics1

¹
Our proposed metric is open-source and available online at https://github.com/dice-group/CostBased-FedEval.

for measuring errors in the cardinality estimations of (1) triple patterns, (2) joins between triple patterns, and (3) query plans. We correlate these errors with the overall query runtime performance of state-of-the-art cost-based SPARQL federation engines. The observed results show that these metrics are correlated with the overall runtime performances. In addition, we compare sate-of-the-art cost-based SPARQL federation engines using existing metrics pertaining to indexing, query processing, network, and overall query runtime using different evaluation setups.

In summary, the contributions of this work are as follows:

We propose metrics to measure the errors in cardinality estimations of cost-based federated engines. These metrics allow a fine-grained evaluation of cost-based federated SPARQL query engines and uncover novel insights about the performance of these types of federation engines that were not reported in previous works.

We measure the correlation between the values of the novel metrics and the overall query runtimes. We show that some of these metrics have a strong correlation with runtimes and can hence be used as predictors for the overall query execution performance.

We present an empirical evaluation of five – CostFed [43], Odyessey [30], SemaGrow [9], LHD [49] and SPLENDID [13] – state-of-the-art cost-based SPARQL federation engines on LargeRDFBench [38] by using the proposed metrics along with existing metrics, affecting the query runtime performance.

The rest of the paper is organized as follows: In Section 2, we present related work. A motivating example is given in Section 3. In Section 4, we present our novel metrics for the evaluation of cost-based federation engines. In Section 5, we give an overview of the cardinality estimators of selected cost-based federation engines. The evaluation of these engines with proposed as well as existing metrics is shown in Section 6. Finally, we conclude in Section 7.

Table 1

Metrics used in the existing federated SPARQL query processing systems, Res: Resource Related, RS: Result Set Related, Add: Additional, Cr: index compression ratio, Gt: the index/summary generation time, Qp: Query planning time, #Ts: total number of triple pattern-wise sources selected, Qet: the average query execution time, #A: total number of SPARQL ASK requests submitted, Sst: the average source selection time, #Tt: number of transferred tuples, #Er: number of endpoint requests, Cu: CPU usage, Mu: Memory usage, Cp: Result Set completeness, Ct: Result Set correctness, @T: dief@t, @K: dief@k

Engine	Index		Processing					Network		Res		RS		Add

	Cr	Gt	Qp	#Ts	Qet	#A	Sst	#Tt	#Er	Cu	Mu	Cp	Ct	@T	@K
CostFed [43]	✓	✓		✓	✓	✓	✓					✓
SPLENDID [13]				✓	✓				✓
SemaGrow [9]			✓	✓	✓
Odyssey [30]			✓	✓	✓			✓	✓			✓
LHD [49]					✓			✓	✓	✓	✓	✓
DARQ [36]			✓		✓
ANAPSID [2]					✓
HiBISCuS [40]	✓	✓		✓	✓	✓	✓					✓
MULDER [12]					✓							✓	✓	✓	✓
FedX [47]					✓	✓			✓
Lusail [1]			✓		✓		✓
BioFed [18]				✓	✓	✓	✓					✓	✓
TopFed [42]				✓	✓
SAFE [23 ,24]	✓	✓		✓	✓	✓	✓

2. Related work

In this section, we focus on the performance metrics used in the state-of-the-art to compare federated SPARQL query processing engines. Based on the previous federated SPARQL benchmarks [14,38,45] and performance evaluations [1–3,9,12,13,18,30,36,43,47,49] (see Table 1 for an overview), the performance metrics used for federated SPARQL engines comparison can be categorized as:

Index-Related: Index-assisted approaches [39] make use of stored dataset statistics to generate an optimized query execution plan. The indexes are pre-computed by collecting information from available federated datasets. This is usually a one-time process. However, later updates are required to ensure the result-set completeness of the query processing. The index generation time and its compression ratio (w.r.t. overall dataset size) are important measures to be considered when devising index-assisted federated engines.

Query-Processing-Related: This category contains metrics related to the query processing capabilities of the federated SPARQL engines. The reported metrics in this category are the total number of triple-pattern-wise sources selected, number of ASK requests used to perform source selection, source selection time, query planning time, and overall query runtime.

Network-Related: Federated engines collect information from multiple data sources, e.g., SPARQL endpoints. Thus, it is important to minimize the network traffic generated by the engines during query processing. The number of transferred tuples and the number of endpoint requests generated by the federation engine are the two network-related metrics used in existing federated SPARQL query processing evaluations.

Result-Set-Related: Two systems are only comparable if they produce exactly the same results. Therefore, result set correctness and completeness are the two most important metrics in this category.

Resource-Related: The CPU and memory resources consumed during query processing dictate the query load an engine can bear. Hence, they are of importance when evaluating the performance of federated SPARQL engines.

Additional: Two metrics dief@t and dief@k are proposed to measure continuous efficiency of query processing approaches.

All of these metrics are helpful to evaluate the performance of different components of federated query engines. However, none of these metrics can be used to evaluate the accuracy of the cardinality estimators of cost-based federation engines. Consequently, studying the effect of estimated cardinality errors on the overall query runtime performance of federation engines cannot be conducted based on these metrics. To overcome these limitations, we propose metrics for measuring errors in cardinality estimations of triple patterns, joins between triple patterns, and overall query plan, and show how these metrics are affecting the overall runtime performance of federation engines.

Fig. 1.

Motivating example: a sample SPARQL query and the corresponding query plans of two different federation engines.

3. Motivating example

In this section, we present an example to motivate our work and to understand the proposed metrics. We assume that the reader is familiar with the concepts of SPARQL and RDF, including the notions of a triple pattern, the joins between triple patterns, the cardinality (result size) of a triple pattern, and left-deep query execution plans. As aforementioned, most cost-based SPARQL federation engines first estimate individual triple pattern cardinality and use this information to estimate the cardinality of joins found in the query. Finally, the query execution plan is generated by ordering the joins. In general, the optimizer first selects the triple patterns and joins with minimum estimated cardinalities [43].

Figure 1 shows a motivating example containing a SPARQL query with three triple patterns – namely TP1, TP2 and TP3 – and two joins. Consider two different cost-based federation engines with different cardinality estimators. Figure 1(a) shows the real (Cr) and estimated cardinalities (Ce1 for Engine 1 and Ce2 for Engine 2) for triple patterns of the query. Let us assume that both engines generate left-deep query plans by selecting triple patterns with the smallest cardinalities to perform their first join. The results of this join are then used to perform the second join with the remaining third triple pattern. By using actual cardinalities, the optimal query execution plan would be to first perform the join between TP1 and TP2 and then perform the second join with TP3. The same plan will be generated by Engine 1 as well, as shown in Fig. 1(b). The subOptimal plan generated by Engine 2 is given in Fig. 1(c). Note that Engine 2 did not select the optimal plan because of large errors in cardinality estimations of triple patterns and joins between triple patterns.

The motivating example clearly shows that good cardinality estimations are essential to produce a better query plan. The question we aim to answer pertains to how much the accuracy of cardinality estimations affects the overall query plan and the overall query runtime performance. To answer this question, the q-error (Q in Fig. 1) was introduced in [29] in the database literature. In the next section, we define this measure and propose new metrics based on similarities to measure the overall triple patterns error $E_{T}$ , overall joins error $E_{J}$ as well as overall query plan error $E_{P}$ .

4. Cardinality estimation-related metrics

Now we formally define the q-error and our proposed metrics, namely $E_{T}$ , $E_{J}$ , $E_{P}$ to measure the overall error in cardinality estimations of triple patterns, joins between triple patterns and overall query plan error, respectively.

4.1. q-error

The q-error is the factor by which an estimated cardinality value differs from the actual cardinality value [29].

Definition 1 (q-error).

Let $\vec{r} = (r_{1}, \dots, r_{n}) \in R^{n}$ where $r_{i} > 0$ be a vector of real values and $\vec{e} = (e_{1}, \dots, e_{n}) \in R^{n}$ be the vector of the corresponding estimated values. By defining $\vec{e} / \vec{r} = \frac{\vec{e}}{\vec{r}} = (e_{1} / r_{1}, \dots, e_{n} / r_{n})$ , then q-error of estimation e of r is given as $\begin{matrix} ‖ e / r ‖_{Q} = max_{1 ⩽ i ⩽ n} ‖ e_{i} / r_{i} ‖_{Q}, \end{matrix}$ where $\begin{matrix} ‖ e_{i} / r_{i} ‖_{Q} = max (e_{i} / r_{i}, r_{i} / e_{i}) \end{matrix}$

In this definition, over- and underestimations are treated symmetrically [29]. In the motivating example given in Fig. 1, the real cardinality of TP1 is 100 (i.e., $Cr (TP 1) = 100$ ) while the estimated cardinality by engine 1 for the same triple pattern is 90 (i.e., $Cr (TP 1) = 90$ ). Thus, the q-error for this individual triple pattern is $max (90 / 100, 100 / 90) = 1.11$ . The query’s overall q-error of its triple patterns (see Fig. 1(b)) is the maximum value of all the q-error values of triple patterns, i.e., $max (1.11, 1.25, 1) = 1.25$ . The q-error of the complete query plan would be the maximum q-error values in all triple patterns and joins used in the query plan, i.e., $max (1.11, 1.25, 1, 1.3, 3) = 3$ .

The q-error makes use of the ratio instead of an absolute or quadratic difference and is hence able to capture the intuition that only relative differences matter for making planning decisions. In addition, the q-error provides a theoretical upper bound for the plan quality if the q-error of a query is bounded. Since it only considers the maximum value amongst those calculated, it is possible that plans with good average estimations are regarded as poor by this measure. Consider the query plans given in Fig. 1(b) and Fig. 1(c). Both have a q-error of 3, yet the query plan in Fig. 1(b) is optimal, while the query plan in Fig. 1(c) is not. To solve this problem, we introduce the additional metrics defined below.

4.2. Similarity errors

The overall similarity error of query triple patterns is formalised as follows:

Definition 2 (Triple Patterns Error $E_{T}$ ).

Let Q be a SPARQL query containing triple patterns $T = {{TP}_{1}, \dots, {TP}_{n}}$ . Let $\vec{r} = (Cr ({TP}_{1}), \dots, Cr ({TP}_{n})) \in R^{n}$ be the vector of real cardinalities of T and $\vec{e} = (Ce ({TP}_{1}), \dots, Ce ({TP}_{n})) \in R^{n}$ be the vector of the corresponding estimated cardinalities of T. Then, we define our overall triple pattern error as follows: $\begin{array}{l} E_{T} & = \frac{‖ \vec{r} - \vec{e} ‖}{‖ \vec{r} ‖ + ‖ \vec{e} ‖} \\ = \frac{\sqrt{\sum_{i = 1}^{n} {(Cr ({TP}_{i}) - Ce ({TP}_{i}))}^{2}}}{\sqrt{\sum_{i = 1}^{n} {(Cr ({TP}_{i}))}^{2}} + \sqrt{\sum_{i = 1}^{n} {(Ce ({TP}_{i}))}^{2}}} \end{array}$

In the motivating example given in Fig. 1, the real cardinalities vector $\vec{r} = (100, 200, 300)$ and the Engine 1 estimated cardinalities vector $\vec{e} = (90, 250, 300)$ . Thus, $E_{T} = 0.0658$ . Similarly, the Engine 2 estimated cardinality vector is $\vec{e} = (200, 500, 600)$ . Thus, Engine 2 achieves $E_{T} = 0.388$ .

Definition 3 (Joins Error $E_{J}$ ).

Let Q be a SPARQL query containing joins $J = {J_{1}, \dots, J_{n}}$ . Let $\vec{r} = (Cr (J_{1}), \dots, Cr (J_{n})) \in R^{n}$ a vector of real cardinalities of J and $\vec{e} = (Ce (J_{1}), \dots, Ce (J_{n})) \in R^{n}$ be the vector of the corresponding estimated cardinalities of J, then the overall joins error is defined by the same equation in Definition 2.

Definition 4 (Query Plan Error $E_{P}$ ).

Let Q be a SPARQL query and $TJ$ be the set of triple patterns and joins in Q. Let $\vec{r} = (r_{1}, \dots, r_{n}) \in R^{n}$ be a vector of real cardinalities of $TJ$ and $\vec{e} = (e_{1}, \dots, e_{n}) \in R^{n}$ be the vector of corresponding estimated cardinalities of $TJ$ , then the overall query plan error is defined by the same equation in Definition 2.

In the motivating example given in Fig. 1(b), the real cardinalities vector of all triple patterns and joins, $\vec{r} = (100, 200, 300, 50, 50)$ and the Engine 1 estimated cardinalities vectors $\vec{e} = (90, 250, 300, 65, 150)$ . Thus, $E_{P} = 0.1391$ for Engine 1. Engine 2 achieves $E_{P} = 0.3838$ . In these matrices, over- and underestimations are also treated symmetrically. The purpose of these definitions is to keep the lower bound at 0, which could be reached if $r = e$ (i.e., there is no error in the estimation), and the upper bound at 1, which could be reached if e is much larger than r.

5. Selected federation engines

In this section, we give a brief overview of the selected cost-based SPARQL federation engines. In particular, we describe how the cardinality estimations for triple patterns and joins between triple patterns are performed in these engines.

CostFed CostFed [43] makes use of pre-computed statistics stored in index to estimate the cardinality of triple patterns and joins between triple patterns. CostFed benefits from both bind join ( $⋈_{b}$ ) [9,13,47] and symmetric hash join ( $⋈_{h}$ ) [2] for joining the results of triple patterns. The decision of join selection is based on calculating the cost of both joins on query runtime. It creates three buckets for each distinct predicate used in the RDF dataset. These buckets are used for estimating the cardinality of query triple patterns. Furthermore, CostFed stores selectivity information that is used to estimate the cardinality of triple patterns as well as devising an efficient query plan. The CostFed query planner also considers the skew in the distribution of objects and subjects across predicates. Separate cardinality estimation is used for Multi-valued predicates. Multi-valued predicates are the predicates that can have multiple values, as people can have multiple contact numbers or graduation schools. It performs a join-aware trie-based source selection, which uses common URI prefixes.

Let D represent a dataset or source for short, $tp = ⟨ s, p, o ⟩$ be a triple pattern having predicate p, and $R (tp)$ be the set of relevant sources for that triple pattern. The following notations are used to calculate the cardinality of $tp$ .

$T (p, D)$ is the total number of triples with predicate p in D.

$avgSS (p, D)$ is the average subject selectivity of p in D.

$avgOS (p, D)$ is the average object selectivity of p in D.

$tT (D)$ is the total number of triples in D.

$tS (D)$ is the total number of distinct subjects in D.

$tO (D)$ is the total number of distinct objects in D.

From these notations the cardinality $C (tp)$ of $tp$ is calculated as follows (the predicate b stands for bound): $\begin{matrix} \{\begin{matrix} \sum_{\forall Di \in R (tp)} T (p, D_{i}) \times 1 \\ \Rightarrow if b (p) \land! b (s) \land! b (o), \\ \sum_{\forall Di \in R (tp)} T (p, D_{i}) \times avgSS (p, D_{i}) \\ \Rightarrow if b (p) \land b (s) \land! b (o), \\ \sum_{\forall Di \in R (tp)} T (p, D_{i}) \times avgOS (p, D_{i}) \\ \Rightarrow if b (p) \land! b (s) \land b (o), \\ \sum_{\forall Di \in R (tp)} tT (D_{i}) \times 1 \\ \Rightarrow if ! b (p) \land! b (s) \land! b (o), \\ \sum_{\forall Di \in R (tp)} tT (D_{i}) \times \frac{1}{tS (D_{i})} \\ \Rightarrow if ! b (p) \land b (s) \land! b (o), \\ \sum_{\forall Di \in R (tp)} tT (D_{i}) \times \frac{1}{tO (D_{i})} \\ \Rightarrow if ! b (p) \land! b (s) \land b (o), \\ \sum_{\forall Di \in R (tp)} tT (D_{i}) \times \frac{1}{tS (D_{i}) \times tO (D_{i})} \\ \Rightarrow if ! b (p) \land b (s) \land b (o), \\ 1 \Rightarrow if b (p) \land b (s) \land b (o) \end{matrix} \end{matrix}$

A recursive definition is used to define the SPARQL expression E [8,9] in the query planning phase and is defined as follows: all triple patterns are SPARQL expressions and if $E 1$ and $E 2$ are SPARQL expressions then $E 1 ⋈ E 2$ is also a SPARQL expression. The join cardinality of two expressions $E_{1}$ and $E_{2}$ is estimated as follows: $\begin{array}{l} C (E 1 ⋈ E 2) \\ = M (E 1) \times M (E 2) \times Min (C (E 1), C (E 2)) \end{array}$ where the average frequency of multi-valued predicates in the expression E is defined as $M (E)$ . In $M (E)$ , E is not the result of joins between triple patterns but the triple pattern itself. $M (E)$ is calculated using the following equation: $\begin{matrix} M (E) = \{\begin{matrix} 1 / \sqrt{2} \\ if b (p) \land! b (s) \land b (o), \\ C (E) / distSbjs (p, D) \\ if b (p) \land! b (s) \land! b (o) \land j(s), \\ C (E) / distObjs (p, D) \\ if b (p) \land! b (o) \land! b (s) \land j(o), \\ 1 other \end{matrix} \end{matrix}$ If the subject of the triple pattern is involved in the join, it is defined as $j (s)$ , and b(s), b(o), and b(p) are defined as bound subject, object, predicate respectively.

SPLENDID SPLENDID [13] also uses VoID statistics to generate a query execution plan. It uses a dynamic programming approach to produce a query execution plan. SPLENDID makes use of both hash ( $⋈_{h}$ ) and bind ( $⋈_{b}$ ) joins.

Triple pattern cardinality is estimated as follows: $\begin{array}{l} {card}_{d} (?, p, ?) = {card}_{d} (p) \\ {card}_{d} (s, ?, ?) = | d | \cdot sel . s_{d} \\ {card}_{d} (s, p, ?) = {card}_{d} (p) \cdot {sel. s}_{d} (p) \\ {card}_{d} (?, ?, o) = | d | \cdot sel . o_{d} \\ {card}_{d} (?, p, o) = {card}_{d} (p) \cdot sel . o_{d} (p) \\ {card}_{d} (s, ?, o) = | d | \cdot sel . s_{d} \cdot sel . o_{d} \end{array}$ where the ${card}_{d} (p)$ is the number of triple patterns in the data source d having predicate p. The total number of triples in a data source d is defined as $| d |$ . If we have a bound predicate then the average selectivity of subject and object is defined as ${sel . s}_{d} (p)$ and ${sel . o}_{d} (p)$ respectively; if the predicate is not bound then the average selectivity of subject and object is defined as ${sel . s}_{d}$ and ${sel . o}_{d}$ respectively. In star-shaped queries, SPLENDID estimates the cardinality of triple patterns having the same subject separately. All triples with same subjects are grouped and then the minimum cardinality of all triple patterns with bound objects is calculated. Lastly, the cardinality of remaining triples with unbound objects is multiplied with the average selectivity of subjects and the minimum value. Formally, the equation is defined as: $\begin{array}{l} {card}_{d} (T) & = min ({card}_{d} (T_{bound})) \\ \cdot \prod (sel . s_{d} \cdot {card}_{d} (T_{unbound})) \end{array}$ Join cardinality is estimated as follows: $\begin{array}{l} card (q 1 ⋈ q 2) \\ = card (q_{1}) \cdot card (q_{2}) \cdot {sel}_{⋈} (q 1, q 2) \end{array}$

In these equations the ${sel}_{⋈}$ is the join selectivity of two input relations. It defines how many bindings are matched between two relations. SPLENDID uses the average selectivity of join variables as join selectivity.

LHD LHD [49] is a cardinality-based and index-assisted approach that aims to maximize parallel execution of sub-queries. It makes use of the VoID statistics for estimating the cardinality of triple patterns and joins between triple patterns. LHD only uses Bind joins for query execution. LHD implements a response-time-cost model by making an assumption that the response time of a query request is proportional to the total number of bindings transferred. LHD determines the total number of triples $t_{d}$ , distinct subjects $s_{d}$ and objects $o_{d}$ from the VoID description of a dataset d. The VoID file also provides other information, such as the number of triples $t_{d . p}$ , distinct subjects $s_{d . p}$ and distinct objects $o_{d . p}$ in the dataset d for a predicate p. The federation engine makes an assumption about uniform distribution of objects and subjects in datasets. Let’s assume a triple pattern $T : {S P O}$ ,2

²
In this section, the letters with a question mark (e.g. $? x$ ) denote a variable in an RDF triple, a literal value is represented by a lower-case letter (e.g. o), and a variable or a literal value is defined by an upper-case letter (e.g. S).

the function to get the set of relevant datasets of T is defined as

S (T)

, the selectivity of x with respect to

S (T)

is defined as

sel T (x)

, and the cardinality of x with respect to

S (T)

is defined as

card T (x)

For single triple pattern cardinality estimation, the selectivity of each part is estimated as follows: $\begin{array}{l} {sel}_{T} (S) \\ = \{\begin{matrix} \frac{\sum_{d \in S (T)} t_{d} / s_{d}}{\sum_{d \in S (T)} s_{d}} & if var (P) \land \neg var (S) \\ \frac{\sum_{d \in S (T)} t_{d \cdot p} / s_{d \cdot p}}{\sum_{d \in S (T)} s_{d \cdot p}} & if P = p \land \neg var (S) \\ 1 & if var (S) \end{matrix} \\ {sel}_{T} (P) \\ = \{\begin{matrix} \frac{\sum_{d \in S (T)} t_{d . p}}{\sum_{d \in S (T)} t_{d}} & if P = p \\ 1 & if var (P) \end{matrix} \\ {sel}_{T} (O) \\ = \{\begin{matrix} \frac{\sum_{d \in S (T)} t_{d} / o_{d}}{\sum_{d \in S (T)} o_{d}} & if var (P) \land \neg var (O) \\ \frac{\sum_{d \in S (T)} t_{d . p} / o_{d . p}}{\sum_{d \in S (T)} o_{d . p}} & if P = p \land \neg var (O) \\ 1 & if var (O) \end{matrix} \end{array}$ After calculating the selectivity of each part, LHD estimates the cardinality of the triple pattern as follows: $\begin{matrix} card (T) = t \cdot {sel}_{T} (S) \cdot {sel}_{T} (P) \cdot {sel}_{T} (O) \end{matrix}$

Given two triple patterns T1 and T2, LHD calculates the join selectivity by using the following equations: $\begin{array}{l} sel (T_{1} ⋈ T_{2}) \\ = \{\begin{matrix} \frac{\sum_{d \in S (T_{1})} s_{d . p 1} \cdot \sum_{d \in S (T_{2})} s_{d . p 2}}{\sum_{d \in S (T_{1})} s_{d} \cdot \sum_{d \in S (T_{2})} s_{d}} \\ if joined on S_{1} = S_{2} \\ \frac{\sum_{d \in S (T_{1})} o_{d . p 1} \cdot \sum_{d \in S (T_{2})} o_{d . p 2}}{\sum_{d \in S (T_{1})} o_{d} \cdot \sum_{d \in S (T_{2})} o_{d}} \\ if joined on O_{1} = O_{2} \\ \frac{\sum_{d \in S (T_{1})} s_{d . p 1} \cdot \sum_{d \in S (T_{2})} o_{d . p 2}}{\sum_{d \in S (T_{1})} s_{d} \cdot \sum_{d \in S (T_{2})} o_{d}} \\ if joined on S_{1} = O_{2} \\ 1 if no shared variables . \end{matrix} \end{array}$

Using the join selectivity values, join cardinality is estimated by the following equation: $\begin{array}{l} card (T_{1} ⋈ T_{2} ⋈ \dots ⋈ T_{n}) \\ = \prod_{i = 1}^{n} card (T_{i}) \cdot sel (T_{1} ⋈ T_{2} ⋈ \dots ⋈ T_{n}) \end{array}$

SemaGrow SemaGrow [9] query planning is based on VoID3

VoID vocabulary: https://www.w3.org/TR/void/.

statistics [6] about datasets. It makes use of the VoID index as well as SPARQL ASK queries to perform source selection. Three types of joins, i.e, bind, merge and hash, are used during the query planning. The selection to perform the required join operation is based on a cost function. It uses a reactive model for retrieving results of the joins as well as individual triple patterns. As with CostFed, SemaGrow recursively defines SPARQL expressions. Given a data source S, the cardinality estimations of triple patterns and joins are explained below.

SemaGrow contains a Resource discovery component, which returns the list of relevant sources to a triple pattern along with statistics. The statistics related to the data source include (1) the number of estimated distinct subjects, predicates and objects matching the triple pattern, and (2) the number of triple patterns in the data sources matching the triple pattern. The cardinality of a triple pattern is provided by the Resource Discovery component. On the other hand, for more complex expressions, SemaGrow needs to make an estimation based on available statistics. In order to estimate complex expressions based on the aforementioned basic statistics, SemaGrow adopts the formulas described by LHD [49]. The cardinality of each expression (E) in a data source S, is defined as $Card ([E], S)$ ).

For estimating the join cardinality we need to calculate the join selectivity ( $JoinSel ([E 1] ⋈ [E 2])$ ), which is given as follows: $\begin{array}{l} JoinSel ([E 1] ⋈ [E 2]) \\ = min (JoinSel [E 1], JoinSel [E 2]) \\ JoinSel ([T]) = min (1 / d_{i}) \end{array}$ In these equations, $E 1$ and $E 2$ reside any join expressions or triple patterns. The T is a single triple pattern. $d_{i}$ represents the number of distinct values for the ist join attribute in a T. Hence, the join cardinality is given as follows: $\begin{array}{l} Card ([E 1] ⋈ [E 2], S) \\ = Card ([E 1], S) \cdot Card ([E 2], S) \\ \cdot JoinSel ([E 1] ⋈ [E 2]) \end{array}$

Odyssey Odyssey [30] makes use of the distributed characteristic sets (CS) [34] and characteristic pair (CP) [15] statistics to estimate cardinalities. Odyssey estimates the cardinality of each type of query differently using these statistics. For star-shaped queries, where the subject (or object) is the same for all joining triple patterns, estimated cardinality for a given set of properties P (predicates of joining triple patterns) is computed using CSs $C_{j}$ containing all these properties. The common subject (or object) is defined as an entity. CSs can be computed by scanning once a dataset’s triples are sorted by subject; after all the entity properties have been scanned, the entity’s CS is identified. For each CS C, Odyssey computes statistics, i.e., $(count (C))$ represents the number of entities sharing C and $(occurrences (p, C))$ represents the number of triples with predicate p occurring with these entities.

Odyssey represents ${estimatedCardinality}_{Distinct} (P)$ as the estimated cardinality of queries that contain distinct keywords, and $estimatedCardinality (P)$ as the estimated cardinality of those queries that do not contain the distinct keyword. Formally, estimated cardinality for star-shaped queries is defined as follows: $\begin{array}{l} {estimatedCardinality}_{Distinct} (P) = \sum_{P \subseteq C_{j}} (count (C_{j})) \\ estimatedCardinality (P) \\ = \sum_{P \subseteq C_{j}} (count (C_{j}) \cdot \prod_{p_{i} \in P} \frac{ocurrences (p_{i}, C_{j})}{count (C_{j})}) \end{array}$

For arbitrary-shaped queries, Odyssey also considers the connections (links) between different CSs. Characteristic pairs (CPs) help in describing the links between Characteristic sets (CSs) using properties. For entities $e 1$ and $e 2$ the link is defined as $(CSs (e 1), CSs (e 2), p)$ , given that $(e 1, p, e 2) \in s$ , where s is data source. The number of links between two $CSs$ : $C_{i}$ and $C_{j}$ , through a property p is represented in statistics, which is defined as: – $count (Ci, Cj, p)$ . The equation for estimating the cardinality (pairs of entities with a set of properties $P_{k}$ and $P_{l}$ ) for complex-shaped queries is defined as: $\begin{array}{l} estimatedCardinality (P_{k}, P_{l}, p) \\ = \sum_{P_{k} \subseteq C_{i} \land P_{l} \subseteq C_{j}} (count (C_{i}, C_{j}, p) \\ \cdot \prod_{p_{k} \in P_{k} - {p}} (\frac{ocurrences (p_{k}, C_{i})}{count (C_{i})}) \\ \cdot \prod_{p_{l} \in P_{l}} (\frac{ocurrences (p_{l}, C_{j})}{count (C_{j})})) \end{array}$ In order to reduce the complexity, Odyssey treats each star-shaped query as a single meta-node; assuming that the order of joins has already optimized within the star-shaped sub-queries. It uses Characteristics Pairs (CPs) to estimate the cardinality of joins between star-shaped queries (meta-nodes) and uses dynamic programming (DP) to optimize the join order and find the optimal plan.

6. Evaluation and results

In this section, we discuss the results we obtained in our evaluation. All results are also available at the project homepage. First, we evaluate our novel metrics in terms of how they are correlated with the overall query runtime performance of state-of-the-art federated query engines. Thereafter, we compare existing cost-based SPARQL federation engines using the proposed metrics and discuss the evaluation results.

6.1. Experiment setup and hardware

Benchmarks used In our experiments, we used the state-of-the-art benchmark for federated engines dubbed LargeRDFBench [38]. LargeRDFBench comprises a total of 40 queries (including all queries from FedBench [45]): 14 simple queries (S1–S14) from FedBench, 10 complex queries (C1–C10), 8 complex and high-sources queries (CH1–CH8), and 10 large data queries (L1–L10). Simple queries are fast to execute and include the smallest number of triple patterns, which ranges from 2 to 7 [38]. Complex queries are more challenging and take more time to execute compared to simple queries [38]. The queries in this category have at least 8 triple patterns and contain more joins and SPARQL operators than simple queries. The complex and high-sources queries are even more challenging as they need to retrieve results from more data sources and they have more triple patterns, joins and SPARQL operators than the simple and complex queries [38].

We used all queries except the large data queries (L1–L10) in our experiments. The reason for not using L1–L10 was that the evaluation results presented in [38] show that most engines are not yet able to execute these queries. LargeRDFBench comprises of 13 real-world RDF datasets of varying sizes. We loaded each dataset into a Virtuoso 7.2 server.

Cost-based federation engines We evaluated five – CostFed [43], Odyessey [30], SemaGrow [9], LHD [49] and SPLENDID [13] – state-of-the-art cost-based SPARQL federation engines. To the best of our knowledge, these are most of the currently available, open-source cost-based federation engines.

Hardware used Each Virtuoso was deployed on a physical machine (32 GB RAM, Core i7 processor and 500 GB hard disk). We ran the selected federation engines on a local client machine with the same specification. Our experiments were run in a local environment where the network cost is negligible. This is the standard setting used in the original LargeRDFBench. Note that the accuracy of the cardinality estimators of the federated SPARQL query processing is independent of the network cost.

Warm-up and number of runs In each experiment, we warmed up each federation engine for 10 minutes by executing the Linked Data (LD1–LD10) queries from FedBench. Experiments were run three times and the results were averaged. The query timeout was set to 30 minutes.

Metrics We present results for the (1) q-error of triple patterns, (2) q-error of joins between triple patterns, (3) q-error of overall query plans, (4) errors of triple patterns, (5) errors of joins between triple patterns, (6) errors of overall query plans, (7) overall query runtimes, (8) number of tuples transferred (intermediate results), (9) source selection related metrics, and (10) quality of plans generated by query planner of each engine. In addition, we used Spearman’s correlation coefficient to measure the correlation between the proposed metrics and the overall query runtimes. The Spearman test is designed to assess how well the dependency between two variables can be described using a monotonic function. While the Pearson test could also be used, we preferred the Spearman test because it is parameter-free and tests at rank level. We used simple linear and robust regression models to compute the correlation.

6.2. Regression experiments

Throughout our regression experiments, our null hypothesis was that there is no correlation between the runtimes of queries and error measurements (i.e., q-error or similarity error) used in the experiments. We began by investigating the dependency between the metrics we proposed and the overall query runtime performance of the federation engines selected for our experiments. Figure 2 shows the results of a simple linear regression experiment aiming to compute the dependency between the q-error and similarity errors and the overall query runtimes. For a particular engine, the left figure shows the dependency between the q-error and overall runtime while the right figure in the same row shows the result of the correlation of runtime with similarity error. The higher coefficients (dubbed R in the figure) computed in the experiments with similarity errors suggest that it is likely that the similarity errors are a better predictor for runtime. The positive value of the coefficient suggests that an increase in similarity error also means an increase in the overall runtime. It can be observed from the figure that outliers are potentially contaminating the results. Hence, we applied robust regression [21,35,37] using the Huber loss function [22] in a second series of experiments to lessen the effect of the outliers on the results (especially for q-errors) (see Fig. 3). We observe that after removing outliers using robust regression, the average R-values of the similarity-based error correlation further increases. The lower p-values in the similarity-error-based experiments further confirm that our metrics are more likely to be a better predictor for runtime than the q-error. The reason for this result is that our measure exploits more information and is hence less affected by outliers. This is not the case for the q-error, which can be perturbed significantly by a single outlier.

Fig. 2.

q-error and similarity error vs. runtime (simple linear regression analysis). The grey shaded areas represent the confidence intervals (bands) in regression line.

Fig. 3.

q-error and similarity error vs. runtime (robust regression analysis). The grey areas represent the confidence intervals (bands) in regression line.

Table 2

Spearman’s rank correlation coefficients between query plan features and query runtimes for all queries

Table 3

Spearman’s rank correlation coefficients between query plan features and query runtimes after linear regression (only for common queries between all systems)

Table 4

Spearman’s rank correlation coefficients between query plan features and query runtimes after robust regression (only for common queries between all systems). $E_{J}$ : similarity error of joins, $E_{P}$ : similarity error of overall query plan, $E_{T}$ : similarity error of triple patterns, $Q_{J}$ : q-error of joins, $Q_{P}$ : q-error of overall query plan, $Q_{T}$ : q-error of triple patterns, F.Q: federated query. Correlations and colors ( $- +$ ): $0.00 \dots 0.19$ very weak (), $0.20 \dots 0.39$ weak (), $0.40 \dots 0.59$ moderate (), $0.60 \dots 0.79$ strong (), $0.80 \dots 1.00$ very strong ()

To further investigate the correlation between metrics and runtimes, we measured Spearman’s correlation coefficient between query runtimes and corresponding errors of each of the first six metrics. The results are shown in Table 2 which shows that the proposed metrics on average have positive correlations with query runtimes, i.e., the smaller the error, the smaller the query runtimes. The similarity error of overall query plan ( $E_{P}$ ) has the highest impact (i.e. 0.35) on query runtimes, followed by the similarity error of the triple pattern (i.e., $E_{T}$ with 0.27), q-error of joins (i.e., $Q_{J}$ with 0.26), similarity error of Join (i.e., $E_{J}$ with 0.22), q-error of overall plan (i.e., $Q_{P}$ with 0.17), and q-error of triple patterns (i.e., $Q_{T}$ with 0.06).

In order to make a fair comparison between the results, we only take the common queries on which every system passed. We eliminate the LHD [49] because it failed in 20/32 benchmark queries (which is a very high number and only 12 simple queries passed), and is not adequate for comparison. We apply Spearman’s correlation again on common queries. Table 3 shows that the proposed metric has a positive correlation with query runtime when we deal with only common queries. The similarity error of overall plan ( $E_{P}$ ) and triple pattern ( $E_{T}$ ) has the highest impact (i.e., 0.40) on query runtime, followed by similarity error of joins (i.e., $E_{J}$ with 0.39), q-error of joins (i.e., $Q_{J}$ with 0.17) and overall query plan (i.e., $Q_{P}$ with 0.17), and q-error of triple patterns (i.e., $Q_{T}$ with 0.01).

Furthermore, we removed outliers influencing results by applying robust regression on both the q-error and proposed similarity error metrics. Robust regression is done by Iterated Re-weighted Least Squares (IRLS) [21]. We used Huber weights [22] as weighting function in IRLS. This approach further fine-tuned the results and made the correlation for our proposed similarity error and runtime stronger. Table 4 shows that all metrics have a positive correlation. However, in our proposed metric this difference is definite. The similarity error of overall query plan ( $E_{P}$ ) has the highest impact (i.e., 0.56) on query runtimes, followed by the similarity error of the triple pattern (i.e. $E_{T}$ with 0.49), similarity error of joins ( $E_{J}$ with 0.45), q-error of joins (i.e. $Q_{J}$ with 0.22), q-error of overall plan (i.e., $Q_{P}$ with 0.18) and triple pattern (i.e., $Q_{P}$ with 0.18). Table 4 also shows that the q-error for Odyssey is negatively correlated with runtime. We can also observe high q-error values from Fig. 4.

Fig. 4.

Similarity and q-error of query plan.

Another important factor worth mentioning is that the robust regression does not abide by the normality assumptions. Comparing the p-values (at 5% confidence level) of the simple linear regression and robust regression suggests that the data is sufficiently normally distributed for simple linear regression.

Overall, the results show that the proposed similarity errors correlate better with query runtimes than the q-error. Moreover, the correct estimation of the overall plan is clearly the most crucial fragment of the plan generation. Thus, it is important for federation engines to pay particular attention to the cardinality estimation of the overall query plan. However, given that this estimation commonly depends on triple patterns and join estimations, better means for approximating triple patterns and join cardinalities should lead to better plans. The weak to moderate correlation of the similarity errors with query runtimes suggests that the query runtime is a complex measure affected by multi-dimensional metrics, such as metrics given in Table 1 and the SPARQL features, such as number of triple patterns, their selectivities, use of projection variables, number of joins and their types [44]. Therefore, it is rather hard to pinpoint a single metric or a SPARQL feature which has a high correlation with the runtime [38,44]. The proposed similarity error metric is related to the query planning component of the federation engines and is useful for evaluating the quality of the query plans generated by these engines.

6.2.1. Outlier analysis

In the robust regression model, the outliers are adjusted with new values according to Huber loss function. In similarity error, the list of queries which are re-weighted after applying robust regression are: C2, C1, S14, CH7 in CostFed; S2 in SemaGrow; S8 and S2 in SPLENDID; and S8 in Odyssey. In q-error, the list of queries which are re-weighted after applying robust regression are: C6, C2, C4, CH7, S3 in CostFed; CH3, CH4, S13, C2 in SemaGrow; CH6, C2, C7, S5 in SPLENDID; and S11, C2, C1, S4 in Odyssey.

In these queries, the residual values are either significantly higher or lower than the regression line. For example, in CostFed the average of the similarity errors across all queries is 0.272 and the range of the residual values for unmodified queries is between $- 0.17$ and 0.17, while C2 similarity error is 0.99 with residual value 0.73, CH7 similarity error is 0.99 with residual value 0.19, and C1 similarity error is 0.62 with residual value 0.32. Hence, by re-weighting these queries, the similarity error and q-error values are re-adjusted close to the regression line, to get a more clear and concise picture of the regression experiments. As it can be observed from the Simple Linear Regression figure, the outliers are influencing the results. For example, in CostFed, the R value in simple regression is 0.59 and after re-weighting in robust regression the value increased to 0.66. Furthermore, we observe that, in similarity error the R values are increased in robust regression as compared to simple linear regression. While on the other hand, for q-error the R values are decreased in robust regression, further suggesting that similarity error is the better predictor of the runtime as compared to q-error.

Finally, the overall q-error is more affected by robust regression as compared to similarity error. This is because a q-error takes the maximum of all the errors in the cardinality estimation of the joins and triple patterns. Consequently, some queries produce very high q-error values due to a single less efficient cardinality estimation for a join or a triple pattern.

6.2.2. Combined regression-based comparison analysis

Recall that our null hypothesis was that there is no correlation between query runtime and error measurement. Based on the results shown in Figs 2 and 3, we can make the following observations:

We can reject the null hypothesis in 62.5% (i.e., 5 out 8) of the experiments for the similarity error while the same can only be done in 12.5% (1 out of 8) experimental settings for the q-error.

The similarity error is significantly correlated with the runtimes of CostFed (simple and robust regression), SemaGrow (simple and robust regression) and Splendid (robust regression). On the other hand, the q-error is solely significantly correlated with the runtime of SemaGrow (robust regression). In the one case where the p-values achieved by both measures allow to reject the null hypothesis (i.e., for SemaGrow using the robust regression analysis), the R-value of the similarity error is higher than that of the q-error (0.56 vs. 0.53).

For Odyssey, both the similarity error and q-error were not able to produce significant results in our experiments. This suggests that the two errors do not capture the phenomena that influence the performance of Odyssey. A deeper look into Odyssey’s runtime performance suggests that it performs worst w.r.t. its source selection time (see Table 8), a factor which is not captured by the errors considered herein.

Our observations suggest that the similarity error is more likely to be significantly correlated with the runtime of a federated query engine than the q-error. However, for some systems (like Odyssey in our case) it may not produce significant results. Interestingly, the correlation between similarity error and runtimes is significant and highest for best-performing (in terms of average query runtime, see Fig. 5) federated query engine CostFed. We hypothesize that this result might indicate that the similarity error is most useful for systems which are already optimized to generate good plans. However, this hypothesis needs to be confirmed through further experiments. Still, the usefulness of the similarity error seems especially evident when one compares the behaviour of the similarity and the q-error when faced with single cardinality estimation errors. For example, suppose we have 3 joins in a query with estimated cardinalities 10, 10, 100 and with real cardinalities 10, 10 and 1 respectively. The q-error of the plan would be 100 even though only a single join estimation was not optimal. As shown by the equation in Section 4.1, the q-error is sensitive to single estimation error if they are of high magnitude. This is not the case with similarity errors, which would return 0.86.

6.3. q-error and similarity-based errors

We now present a comparison of the selected cost-based engines based on the 6 metrics given in Fig. 4. Overall, the similarity errors of query plans given in Fig. 4(a) suggests that CostFed produces the smallest errors followed by SPLENDID, LHD, SemaGrow, and Odyssey. CostFed produces smaller errors than SPLENDID in 10/17 comparable queries (excluding queries with timeout and runtime errors). SPLENDID produces smaller errors than LHD in 12/14 comparable queries. LHD produces smaller errors than SemaGrow in 6/12 comparable queries. In turn, SemaGrow produces smaller errors than Odyssey in 9/15 comparable queries.

An overall evaluation of the q-error of query plans given in Fig. 4(b) leads to the following result: CostFed produces the smallest errors followed by SPLENDID, SemaGrow, Odyssey, and LHD. In particular, CostFed produces smaller errors than SPLENDID in 9/17 comparable queries (excluding queries with timeout and runtime error). SPLENDID produces smaller errors than SemaGrow in 9/17 comparable queries. SemaGrow produces smaller errors than Odyssey in 8/13 comparable queries. Odyssey is superior to LHD in 5/8 cases.

An overall evaluation of the similarity error in joins leads to a different picture (see Fig. 4(c)). While CostFed remains the best system and produces the smallest errors, it is followed by Odyssey, SPLENDID, SemaGrow, and LHD. In particular, CostFed outperforms Odyssey in 12/17 comparable queries (excluding queries with timeout and runtime error). Odyssey produces less errors than SPLENDID in 7/14 comparable queries. SPLENDID is superior to SemaGrow in 11/17 comparable queries. SemaGrow outperforms LHD in 7/12 comparable queries.

As an overall evaluation of the q-error of joins given in Fig. 4(d), CostFed produces the smallest errors followed by SPLENDID, SemaGrow, Odyssey, and LHD. CostFed produces less errors than SPLENDID in 12/17 comparable queries (excluding queries with timeout and runtime error). SPLENDID produces less errors than SemaGrow in 9/17 comparable queries. SemaGrow produces less errors than Odyssey in 9/13 comparable queries. Odyssey produces less errors than LHD in 4/8 comparable queries.

Overall, the evaluation of the similarity errors of triple patterns given in Fig. 4(e) reveals that CostFed produces the smallest errors followed by SPLENDID, Odyssey, SemaGrow, and LHD. CostFed produces smaller errors than SPLENDID in 10/17 comparable queries (excluding queries with timeout and runtime error). SPLENDID produces smaller errors than Odyssey in 15/17 comparable queries. Odyssey produces smaller errors than SemaGrow in 7/14 comparable queries. SemaGrow outperformed LHD in 6/12 queries.

An overall evaluation of the q-error of triple patterns given in Fig. 4(f) leads to a different ranking: CostFed produces the smallest errors followed by LHD, SemaGrow, SPLENDID, and Odyssey. CostFed outperforms LHD in 6/11 comparable queries (excluding queries with timeout and runtime error). LHD produces fewer errors than SemaGrow in 5/10 comparable queries. SemaGrow is better than SPLENDID in 10/17 comparable queries. SPLENDID produces fewer errors than Odyssey in 7/14 comparable queries.

In general, the accuracy of the estimation is dependent upon the detail of the statistics stored in the index or data summaries. Furthermore, it is important to pay special attention to the different types of triple patterns (with bound and unbound subject, predicate, objects) and joins types (subject-subject, subject-object, object-object) for the better cardinality estimations. CostFed is more accurate because of the more detailed data summaries, able to handle the different types of triple patterns and joins between triple patterns. The use of the buckets can more accurately estimate the cardinalities of the triple patterns with most common predicates used in the dataset. Furthermore, it handles multi-valued predicates. The Odyssey statistics are more detailed as compared to SPLENDID and SemaGrow (both using VoiD statististics). The distributed characteristic sets (CS) and characteristic pair (CP) statistics generally leads to better cardinality estimations for joins.

6.4. How much does an efficient cardinality estimation really matter?

We observed that it is possible for a federation engine to produce quite a high cardinality estimation error (e.g., 0.99 is the overall similarity error for the S11 query in SemaGrow), yet it produces the optimal query plan. This leads to the question, how much does the efficiency of cardinality estimators of federation engines matter to generate optimal query plans? To this end, we analyzed query plans generated by each of the selected engines for the benchmark queries. In our analysis, there are three possible cases in each plan:

Optimal plan: In the optimal plan, the best possible join order is selected based on the given source selection performed by the underlying federation engine, i.e., the least cardinality joins are always executed first.

Sub-optimal plan: In the sub-optimal plan, the engine fails to select the best join based on the given source selection performed by the underlying federation engine, i.e., the least cardinality joins are not always executed first. Please note that this also means that the high error in the join cardinality estimation leads to the sub-optimal join order.

Only-plan: In only-plan, there is only one possible join order according to the given source selection performed by the underlying federation engine. This is possible if only 1 join (excluding a left-join due to the OPTIONAL clause in the query) needs to be executed locally by the federation engine. This situation occurs if there is only a single join in the query or the federation engine creates exclusive groups of joins that are executed remotely by the underlying SPARQL endpoints.

Table 5 shows the query plan generated by the query planners of the selected engines according to the aforementioned three cases possible for each plan. Since LHD failed to generate any query plan for the majority of the LargeRDFBench queries, we omit it from further discussion. In our evaluation, CostFed produced the smallest sub-optimal plans (i.e, 6) followed by Odyssey (i.e., 11), SemaGrow (i.e., 12), and SPLENDID (i.e, 14). The reason for CostFed’s small number of sub-optimal plans is due to the fact that it has the fewest cardinality errors in the estimation, as discussed in the previous section. In addition, it generates the highest number of possible only-plans (which can be regarded as optimal plans for the given source selection information). This is because CostFed’s source selection is more efficient in terms of the total triple pattern-wise sources selected without losing recall (see Table 8).

Table 5
Query Plans generated by query engines for all queries (Simple, Complex, Complex + High Dimensional Queries). Failed: () Engine Failed to produce Query Plan, OptP: () Optimal Query Plan generated by engine, subOpt: () subOptimal Plan generated by engine, OnlyP: () Only Plan possible

In Table 5, we can see that only a few sub-optimal query plans were generated for simple queries. This is due to the fact that simple category queries of the LargeRDFBench contain very few joins (avg. 2.6 [38]) to be executed by the federation engines. Thus, it is relatively easy to find the best join execution order. However, for complex and complex-plus-high-data sources queries, more sub-optimal plans were generated. This is because these queries contain more joins (around 4 joins on avg. [38]), hence a more accurate join cardinality estimation is required to generate the optimal join ordering plan. In conclusion, efficient cardinality estimation is more important for complex queries with more possible join ordering.

Table 6

Number of transferred tuples. NA: “Not applicable”. Failed means either “Runtime Error” or “Incomplete Results” and TO: “Timeout”, which means Query Execution exceeds threshold value. “green color” () means lowest value among all systems, and “red color” () means highest value among all systems

6.5. Number of transferred triples

Table 6 shows the number of tuples sent and received during the query execution for the selected federation engines. The number of sent tuples is related to the number of endpoint requests sent by the federation engine during query processing [30,47]. The number of received tuples can be regarded as the number of intermediate results produced by the federation engine during query processing [30]. The smaller number of transferred tuples is considered important for fast query processing [30]. In this regard, CostFed ranked first with 31 green boxes (i.e., it had the best results among the selected engines), followed by Odyssey with 24 green boxes, SemaGrow with 12 green boxes, LHD with 10 green boxes, and then SPLENDID with 9 green boxes.

In most queries, CostFed and Odyssey produced the only possible plans only-plan, which means only one (excluding the Left join for OPTIONAL SPARQL operator) was locally executed by the federation engine. Consequently, these engines transfer fewer tuples in comparison to other approaches. The largest difference is observed for S13, where CostFed and Odyssey clearly outperform the other approaches, transferring 500 times fewer tuples. The number of received tuples in LHD is significantly high in comparison to other approaches. This is because it does not produce normal tree-like query plans. Rather, LHD focuses on generating independent tasks that can be run in parallel. Therefore, independent tasks retrieve a lot of intermediate results, which need to be joined locally in order to get the final query resultset.

Table 7
Comparison of index construction time (Index Gen. Time) and Index Size for selected federation engines

CostFed SemaGrow SPLENDID Odyssey LHD

Index Gen. Time (min) 65 110 110 533 110

Index Size (MBs) 10 1 1 5200 1

	CostFed	SemaGrow	SPLENDID	Odyssey	LHD
Index Gen. Time (min)	65	110	110	533	110
Index Size (MBs)	10	1	1	5200	1

Another advantage that CostFed and Odyssey have over other approaches is their join-aware approach for triple pattern-wise sources selected (TPWSS). This join-aware nature of these engines saves many tuples from transferring due to less overestimation of sources. CostFed also performs better because it maintains cache for ask requests and saves many queries from sending to different sources. Another important factor worth mentioning here is that the number of transferred tuples does not consider the number of columns (i.e., the number of projection variables in the query) in the result set, but only counts the number of rows (i.e., the number of results) returned or sent to the endpoints. We also observed that in the case of an only-plan or an optimal plan, the number of received tuples is less compared to sub-optimal plans, clearly indicating that a smaller number of tuples is key to fast query processing. The amalgamated average of all queries could also be misleading because in complex queries, there are more failed/timeout queries for some systems while producing answers in others. Therefore, we calculated the separate average for each category of queries, i.e., simple, complex and complex-and-high-data. From our analysis of the results, we concludes that if an engine produces optimal or only-plan, the number of intermediate results also decreases.

6.6. Indexing and source selection metrics

A smaller-sized index is essential for fast index lookup during source selection, but it can lack important information. In contrast, large index sizes provide slow index lookup and are hard to manage, but may lead to better cardinality estimations. To this end, it is important to compare the size of the indexes generated by the selected federation engines. Table 7 shows a comparison of the index/data summaries’ construction time and the index size4

⁴
The index size is given by size of summaries used for cardinality estimation (in MBs).

of the selected state-of-the-art cost-based SPARQL federation approaches. SemaGrow, SPLENDID and LHD rely on VOID statistics with a size of 1 MB for the complete LargeRDFBench datasets of size 34.3 GB. CostFed’s index size is 10.5 MB while Odyssey’s is 5.2 GB. The much bigger index size used by Odyssey might makes this approach less appropriate to be used for Big RDF datasets such as WikiData, Linked Geo Data etc. CostFed’s index construction time is around 1 hr and 6 mins for the complete LargeRDFBench datasets. SPLENDID, SemaGrow and LHD took 1 hr and 50 mins to generate the index. The Index construction time for Odyssey was 86 hrs and 30 mins, which makes it difficult to use for big datasets or datasets with frequent updates.

Table 8

Comparison of selected federation engines in terms of source selection time ST in msec, total number of SPARQL ASK requests #A, and total triple pattern-wise sources selected #T. (RE represents “Runtime Error”, TO represents “Time Out” of 20 min, T/A represents “Total/Average” where Average is for ST, and Total is for #T and #A, NA represents “Not Applicable”). “green color” () means lowest value among all systems, and “red color” () means highest value among all systems

According to [38], the efficiency of source selection can be measured in terms of: (1) total number of triple pattern-wise sources selected (#T), (2) the number of SPARQL ASK requests sent to the endpoints (#A) during source selection, and (3) the source selection time. Table 8 shows a comparison of the source selection algorithms of the select triple stores across these metrics. As discussed previously, the smaller #T leads to better query plan generation [38]. The smaller #A leads to smaller source selection time, which in turn leads to smaller query execution time. In this regard, CostFed ranked first (83 green boxes, i.e., the best results among the selected engines), followed by Odyssey with 56 green boxes, LHD with 15 green boxes, SPLENDID with 10 green boxes, and then SemaGrow with 9 green boxes.

The approaches that perform a join-aware and hybrid (SPARQL + index) source selection lead to smaller #T [38]. Both Odyssey and Costfed perform join-aware source selection and hence lead to smaller #T than other selected approaches. The highest number of SPARQL ASK requests is sent by index-free federation engines, followed by hybrid (SPARQL + index), which in turn is followed by index-only federation engines [38]. This is because for index-free federation engines, such as FedX, the complete source selection is based on SPARQL ASK queries. The Hybrid engines such as CostFed, SPLENDID, SemaGrow and Odyssey make use of both index and SPARQL ASK queries to perform source selection, thus some of the SPARQL ASK requests are skipped due to the information used in the index. The Index-only engines, such as LHD, only make use of the index to perform the complete source selection. Thus, these engines do not consume a single SPARQL ASK query during source selection. The source selection time for such engines is much smaller due to only index-lookup without sending outside requests to endpoints. However, they have more #T than hybrid (SPARQL + index) source selection approaches.

6.7. Query execution time

Finally, we present the query runtime results of the selected federation engines across the different queries categories of LargeRDFBench. Figure 5 gives an overview of our results. In our runtime evaluation on simple queries (S1–S14) (see Fig. 5(a)), CostFed has the shortest runtimes, followed by SemaGrow, LHD, Odyssey, and SPLENDID. CostFed’s runtimes are shorter than SemaGrow’s on 13/13 comparable queries (excluding queries with timeout and runtime error) (average runtime = 0.5 sec for CostFed vs. 2.5 sec for SemaGrow). SemaGrow outperforms LHD on 4/11 comparable queries with an average runtime of 2.5 sec for SemaGrow vs. 2.7 sec for LHD. LHD’s runtimes are shorter than Odyssey’s on 8/10 comparable queries with an average runtime of 8.5 sec for Odyssey. Finally, Odyssey is clearly faster than SPLENDID on 8/12 comparable queries with an average runtime of 131 sec for SPLENDID.

Our runtime evaluation on the complex queries (C1–C10) (see Fig. 5(b)) leads to a different ranking: CostFed produces the shortest runtimes followed by SemaGrow, Odyssey, and SPLENDID. CostFed outperforms SemaGrow in 6/6 comparable queries (excluding queries with timeout and runtime error) with an average runtime of 3 sec for CostFed vs. 9 sec for SemaGrow. SemaGrow’s runtimes are shorter than Odyssey’s in 3/4 comparable queries with an average runtime of 63 sec for Odyssey. Odyssey is better than SPLENDID in 5/5 comparable queries, where SPLENDID’s average runtime is 98 sec.

The runtime evaluation on the complex and high sources queries (CH1–C8) given in Fig. 5(c) establishes CostFed as the best query federation engine, followed by SPLENDID and then SemaGrow. CostFed’s runtimes are smaller than SemaGrow in 3/3 comparable queries (excluding queries with timeout and runtime error), with an average runtime of 4 sec for CostFed vs. 191 sec for SemaGrow. SPLENDID has no comparable queries with CostFed and SemaGrow. LHD and Odyssey both fail to produce results when faced with complex queries.

Fig. 5.

Average execution time of LargeRDFBench and FedBench queries.

7. Conclusion

In this paper, we presented an extensive evaluation of existing cost-based federated query engines. We used existing metrics from relational database research and proposed new metrics to measure the quality of cardinality estimators of selected engines. To the best of our knowledge, this work is the first evaluation of cost-based SPARQL federation engines focused on the quality of the cardinality estimations.

The proposed similarity-based errors have a more positive correlation with runtimes, i.e., the smaller the error values, the better the query runtimes. Thus, this metric helps developers to design a more efficient query execution planner for federation engines. Our proposed approach produces more significant results compared to q-error. However, there is still room for further improvement.

The higher coefficients (R values) with similarity errors (as opposed to q-error), suggest that the proposed similarity errors are a better predictor for runtime than the q-error.

The smaller p-values of the similarity errors, as compared to q-error, further confirm that similarity errors are more likely to be a better predictor for runtime than the q-error.

Errors in the cardinality estimation of triple patterns have a higher correlation to runtimes than the error in the cardinality estimation of joins. Thus, cost-based federation engines must pay particular attention to attaining accurate cardinality estimations of triple patterns.

The number of transferred tuples have a direct co-relation with query runtime, i.e., the smaller the number of transferred tuples, the smaller the query runtimes.

The smaller number of triple pattern-wise sources selected is key to generate maximum only possible query plans (only-plan).

On average, the CostFed engine produces the fewest estimation errors and has the shortest execution time for the majority of LargeRDFBench queries.

The weak to moderate correlation of the cardinality errors with query execution time suggests that the query runtime is a complex measure affected by multi-dimensional performance metrics and SPARQL query features. The proposed similarity error metric is related to the query planning component of the federation engines and is useful for evaluating the quality of the query plans generated by these engines.

The proposed cardinality estimating metrics are generic and can be applied to non-federated cardinality-based query processing engines as well.

The impact of our proposed work is to provide new measures for the development of better cost-based federated SPARQL query engines. Furthermore, our proposed metrics will help in determining the quality of the generated query plans, such as indicating whether or not the join orders are correct. This kind of information is not revealed from the query runtime because the overall query runtime is affected by all metrics given in Table 1. As future work, we want to compare heuristic-based (index-free) federated SPARQL query processing engines with cost-based federated engines. We want to investigate how much an index is assisting a cost-based federated SPARQL engine to generate optimized query execution plans.

Footnotes

Acknowledgements

The work has been supported by the EU H2020 Marie Skłodowska-Curie project KnowGraphs (no. 860801), BMVI-funded project LIMBO (Grant no. 19F2029I), BMVI-funded project OPAL (no. 19F2028A), and BMBF-funded EuroStars project SOLIDE (no. 13N14456). This work has also been supported by the National Research Foundation of Korea (NRF) (grant funded by the Korea government (MSIT) (no. NRF-2018R1A2A2A05023669)).

References

Abdelaziz,

Mansour,

Ouzzani,

Aboulnaga and

P.K.

Lusail, A system for querying linked data at scale, Proc. VLDB Endow.11(4) (2017), 485–498. doi:10.1145/3186728.3164144.

Acosta,

M.-E.

Vidal,

Lampo,

Castillo and

Ruckhaus, ANAPSID: An adaptive query processing engine for SPARQL endpoints, in: The Semantic Web – ISWC 2011,

Aroyo,

Welty,

Alani,

Taylor,

Bernstein,

Kagal,

Noy and

Blomqvist, eds, Lecture Notes in Computer Science, Vol. 7031, Springer-Verlag, Berlin, Heidelberg, 2011, pp. 18–34. doi:10.1007/978-3-642-25073-6_2.

Acosta,

M.-E.

Vidal and

Sure-Vetter, Diefficiency metrics: Measuring the continuous efficiency of query processing approaches, in: The Semantic Web – ISWC 2017,

d’Amato,

Fernandez,

Tamma,

Lecue,

Cudré-Mauroux,

Sequeda,

Lange and

Heflin, eds, Springer-Verlag Berlin Heidelberg, Cham, 2017, pp. 3–19. doi:10.1007/978-3-319-68204-4_1.

Aini Rakhmawati,

Saleem,

Lalithsena and

Decker, QFed: Query set for federated SPARQL query benchmark, in: Proceedings of the 16th International Conference on Information Integration and Web-Based Applications & Services, iiWAS ’14, ACM, New York, NY, USA, 2014, pp. 207–211. doi:10.1145/2684200.2684321.

Aini Rakhmawati,

Umbrich,

Karnstedt,

Hasnain and

Hausenblas, Querying over federated SPARQL endpoints – A state of the art survey. CoRR, 2013, arXiv:1306.1723.

Alexander,

Cyganiak,

Hausenblas and

Zhao, Describing linked datasets – On the design and usage of void, the vocabulary of interlinked datasets, in: Linked Data on the Web Workshop (LDOW 09), in Conjunction with 18th International World Wide Web Conference (WWW 09), Vol. 538, 2010.

Bizer and

Schultz, The Berlin SPARQL benchmark, International Journal on Semantic Web and Information Systems (IJSWIS)5 (2009), 1–24. doi:10.4018/jswis.2009040101.

Buil-Aranda,

Hogan,

Umbrich and

P.-Y.

Vandenbussche, SPARQL web-querying infrastructure: Ready for action? in: The Semantic Web – ISWC 2013,

Alani,

Kagal,

Fokoue,

Groth,

Biemann,

J.X.

Parreira,

Aroyo,

Noy,

Welty and

Janowicz, eds, Springer, Berlin, Heidelberg, 2013, pp. 277–293. doi:10.1007/978-3-642-41338-4_18.

Charalambidis,

Troumpoukis and

S.K.

Semagrow, Optimizing federated SPARQL queries, in: Proceedings of the 11th International Conference on Semantic Systems, SEMANTICS ’15, ACM, New York, NY, USA, 2015, pp. 121–128. doi:10.1145/2814864.2814886.

10.

Conrads,

Lehmann,

Saleem,

Morsey and

A.-C.

Ngonga Ngomo, I guana: A generic framework for benchmarking the read-write performance of triple stores, in: International Semantic Web Conference, Springer, Cham, 2017, pp. 48–65. doi:10.1007/978-3-319-68204-4_5.

11.

Du,

Chen and

Du, Partitioned indexes for entity search over RDF knowledge bases, in: Proceedings of the 17th International Conference on Database Systems for Advanced Applications – Volume Part I, DASFAA ’12, Springer-Verlag, Berlin, Heidelberg, 2012, pp. 141–155. doi:10.1007/978-3-642-29038-1_12.

12.

K.M.

Endris,

Galkin,

Lytra,

M.N.

Mami,

M.-E.

Vidal and

Auer, MULDER: Querying the linked data web by bridging RDF molecule templates, in: Database and Expert Systems Applications (DEXA ’17),

Benslimane,

Damiani,

W.I.

Grosky,

Hameurlain,

Sheth and

R.R.

Wagner, eds, Vol. 8, Springer, Cham, 2017, pp. 3–18. doi:10.1007/978-3-319-64468-4_1.

13.

Görlitz and

Staab, SPLENDID: SPARQL endpoint federation exploiting VOID descriptions, in: Proceedings of the Second International Conference on Consuming Linked Data (COLD ’11), Vol. 782, CEUR-WS.org, Aachen, Germany, 2010, pp. 13–24.

14.

Görlitz,

Thimm and

Staab, SPLODGE: Systematic generation of SPARQL benchmark queries for linked open data, in: Proceedings of the 11th International Conference on the Semantic Web – Part I, the Semantic Web – ISWC ’12, Springer-Verlag, Berlin, Heidelberg, 2012, pp. 116–132. doi:10.1007/978-3-642-35176-1_8.

15.

Gubichev and

Neumann, Exploiting the query structure for efficient join ordering in SPARQL queries, in: EDBT, Vol. 14, 2014, pp. 439–450.

16.

Hartig,

Bizer and

J.-C.

Freytag, Executing SPARQL queries over the web of linked data, in: Proceedings of the 8th International Semantic Web Conference, ISWC ’09, Springer-Verlag, Berlin, Heidelberg, 2009, pp. 293–309. doi:10.1007/978-3-642-04930-9_19.

17.

Hasnain,

Fox,

Decker and

H.F.

Deus, Cataloguing and linking life sciences LOD cloud, in: 1st International Workshop on Ontology Engineering in a Data-Driven World (OEDW 2012) Collocated with 8th International Conference on Knowledge Engineering and Knowledge Management (EKAW 2012), 2012, pp. 114–130.

18.

Hasnain,

Mehmood,

Sana,

Zainab,

Saleem,

Warren, Jr.,

Zehra,

Decker and

Rebholz-SchuhmanBioFed: Federated query processing over life sciences linked open data, Journal of Biomedical Semantics8(1) (2017), 13. doi:10.1186/s13326-016-0111-z.

19.

Hasnain,

Saleem,

A.-C.

Ngonga Ngomo and

Rebholz-Schuhmann, Extending LargeRDFBench for multi-source data at scale for SPARQL endpoint federation, in: Proceedings of the 12th International Workshop on Scalable Semantic Web Knowledge Base Systems Co-Located with 17th International Semantic Web Conference, SSWS@ISWC 2018, Monterey, California, USA, October 9, 2018, Vol. 2179, 2018, pp. 28–44. doi:10.3233/978-1-61499-894-5-203.

20.

Hasnain,

Sana e Zainab,

M.R.

Kamdar,

Mehmood,

C.N.

Warren, Jr.,

Q.A.

Fatimah,

H.F.

Deus,

Mehdi and

Decker, A roadmap for navigating the life sciences linked open data cloud, in: Semantic Technology,

Supnithi,

Yamaguchi,

J.Z.

Pan,

Wuwongse and

Buranarach, eds, Lecture Notes in Computer Science, Vol. 8943, Springer International Publishing, 2015, pp. 97–112. doi:10.1007/978-3-319-15615-6_8.

21.

P.W.

Holland and

R.E.

Welsch, Robust regression using iteratively reweighted least-squares, Communications in Statistics – Theory and Methods6(9) (1977), 813–827. doi:10.1080/03610927708827533.

22.

P.J.

Huber, Robust Estimation of a Location Parameter, Springer, New York, New York, NY, 1992, pp. 492–518. doi:10.1007/978-1-4612-4380-9_35.

23.

Khan,

Saleem,

Iqbal,

Mehdi,

Hogan,

A.-C.

Ngonga Ngomo,

Decker and

Sahay, SAFE: Policy aware SPARQL query federation over RDF data cubes, in: Proceedings of the 7th International Workshop on Semantic Web Applications and Tools for Life Sciences, Berlin, Germany, December 9–11, 2014, 2014. doi:10.13140/2.1.3153.9204.

24.

Khan,

Saleem,

Mehdi,

Hogan,

Mehmood,

Rebholz-Schuhmann and

Sahay, SAFE: SPARQL federation over RDF data cubes with access control, Journal of biomedical semantics8(1) (2017), 5. doi:10.1186/s13326-017-0112-6.

25.

Kossmann, The state of the art in distributed query processing, ACM Comput. Surv.32(4) (2000), 422–469. doi:10.1145/371578.371598.

26.

Ladwig and

Tran, SIHJoin: Querying remote and local linked data, in: The Semantic Web: Research and Applications,

Antoniou,

Grobelnik,

Simperl,

Parsia,

Plexousakis,

De Leenheer and

Pan, eds, Lecture Notes in Computer Science, Vol. 6643, Springer, Berlin, Heidelberg, 2011, pp. 139–153. doi:10.1007/978-3-642-21034-1_10.

27.

Leis,

Gubichev,

Mirchev,

Boncz,

Kemper and

Neumann, How good are query optimizers, really?Proc. VLDB Endow.9(3) (2015), 204–215. doi:10.14778/2850583.2850594.

28.

Lynden,

Kojima,

Matono and

Tanimura, ADERIS: An adaptive query processor for joining federated SPARQL endpoints, in: On the Move to Meaningful Internet Systems (OTM2011), Part II,

Meersman,

Dillon,

Herrero,

Kumar,

Reichert,

Qing,

B.-C.

Ooi,

Damiani,

D.C.

Schmidt,

White,

Hauswirth,

Hitzler and

Mohania, eds, LNCS, Vol. 7045, Springer, Heidelberg, 2011, pp. 808–817. doi:10.1007/978-3-642-25106-1_28.

29.

Moerkotte,

Neumann and

Steidl, Preventing bad plans by bounding the impact of cardinality estimation errors, Proc. VLDB Endow.2(1) (2009), 982–993. doi:10.14778/1687627.1687738.

30.

Montoya,

Skaf-Molli and

Hose, The odyssey approach for optimizing federated SPARQL queries, The Semantic Web – ISWC2017 (2017), 471–489. doi:10.1007/978-3-319-68288-4_28.

31.

Montoya,

M.-E.

Vidal and

Acosta, A heuristic-based approach for planning federated SPARQL queries, in: Proceedings of the Third International Conference on Consuming Linked Data (COLD ’12), Vol. 905, CEUR-WS.org, Aachen, Germany, 2012, pp. 63–74. doi:10.5555/2887367.2887373.

32.

Montoya,

M.-E.

Vidal,

Ó.

Corcho,

Ruckhaus and

C.B.

Aranda, Benchmarking federated SPARQL query engines: Are existing testbeds enough? in: Proceedings, Part II, The Semantic Web – ISWC 2012 – 11th International Semantic Web Conference, Boston, MA, USA, November 11–15, 2012, Proceedings, Part II,

Cudré-Mauroux,

Heflin,

Sirin,

Tudorache,

Euzenat,

Hauswirth,

J.X.

Parreira,

Hendler,

Schreiber,

Bernstein and

Blomqvist, eds, Lecture Notes in Computer Science, Vol. 7650, Springer, 2012, pp. 313–324. doi:10.1007/978-3-642-35173-0_21.

33.

Morsey,

Lehmann,

Auer and

A.-C.

Ngonga Ngonga, DBpedia SPARQL benchmark: Performance assessment with real queries on real data, in: Proceedings of the 10th International Conference on the Semantic Web – Part I, the Semantic Web – ISWC ’11, Springer-Verlag, Berlin, Heidelberg, 2011, pp. 454–469. doi:10.1007/978-3-642-25073-6_29.

34.

Neumann and

Moerkotte, Characteristic sets: Accurate cardinality estimation for RDF queries with multiple joins, in: Proceedings of the 2011 IEEE 27th International Conference on Data Engineering, IEEE Computer Society, IEEE, 2011, pp. 984–994. doi:10.1109/ICDE.2011.5767868.

35.

D.P.

O’Leary, Robust regression computation using iteratively reweighted least squares, SIAM J. Matrix Anal. Appl.11(3) (1990), 466–480. doi:10.1137/0611032.

36.

Quilitz and

Leser, Querying distributed RDF data sources with SPARQL, in: Proceedings of the 5th European Semantic Web Conference on the Semantic Web: Research and Applications, ESWC ’08, Springer-Verlag, Berlin, Heidelberg, 2008, pp. 524–538. doi:10.1007/978-3-540-68234-9_39.

37.

P.J.

Rousseeuw and

A.M.

Leroy, Robust Regression and Outlier Detection, Vol. 589, 1st edn, John Wiley & Sons, Inc., New York, NY, USA, 1987. doi:10.1002/0471725382.

38.

Saleem,

Hasnain and

A.-C.

Ngonga Ngomo, LargeRDFBench: A billion triples benchmark for SPARQL endpoint federation, Journal of Web Semantics48 (2018), 85–125. doi:10.1016/j.websem.2017.12.005.

39.

Saleem,

Khan,

Hasnain,

Ermilov and

A.-C.

Ngonga Ngomo, A fine-grained evaluation of SPARQL endpoint federation systems, Semantic Web Journal7(5) (2016), 493–518. doi:10.3233/SW-150186.

40.

Saleem and

A.-C.

Ngonga Ngomo, HiBISCuS: Hypergraph-based source selection for SPARQL endpoint federation, in: The Semantic Web: Trends and Challenges,

Presutti,

d’Amato,

Gandon,

d’Aquin,

Staab and

Tordai, eds, Lecture Notes in Computer Science, Vol. 8465, Springer International Publishing, 2014, pp. 176–191. doi:10.1007/978-3-319-07443-6_13.

41.

Saleem,

A.-C.

Ngonga Ngomo,

J.X.

Parreira,

H.F.

Deus and

Hauswirth, DAW: Duplicate-AWare federated query processing over the web of data, in: Proceedings of the 12th International Semantic Web Conference – Part I, Lecture Notes in Computer Science, Springer-Verlag, New York, NY, USA, 2013, pp. 574–590. doi:10.1007/978-3-642-41335-3_36.

42.

Saleem,

S.S.

Padmanabhuni,

A.-C.

Ngonga Ngomo,

Iqbal,

J.S.

Almeida,

Decker and

H.F.

Deus, TopFed: TCGA tailored federated query processing and linking to LOD, J. Biomed. Semant.5 (2014), 47. doi:10.1186/2041-1480-5-47.

43.

Saleem,

Potocki,

Soru,

Hartig and

A.-C.

Ngonga Ngomo, CostFed: Cost-based query optimization for SPARQL endpoint federation, in: Proceedings of the 14th International Conference on Semantic Systems, Vol. 137, Elsevier, 2018, pp. 163–174. doi:10.1016/j.procs.2018.09.016.

44.

Saleem,

Szárnyas,

Conrads,

S.A.C.

Bukhari,

Mehmood and

A.-C.

Ngonga Ngomo, How representative is a SPARQL benchmark? An analysis of RDF triplestore benchmarks, in: The World Wide Web Conference, WWW ’19, Association for Computing Machinery, New York, NY, USA, 2019, pp. 1623–1633. doi:10.1145/3308558.3313556.

45.

Schmidt,

Görlitz,

Haase,

Ladwig,

Schwarte and

Tran, FedBench: A benchmark suite for federated semantic data query processing, in: The Semantic Web – ISWC 2011,

Aroyo,

Welty,

Alani,

Taylor,

Bernstein,

Kagal,

Noy and

Blomqvist, eds, Springer-Verlag, Berlin, Heidelberg, 2011, pp. 585–600. doi:10.1007/978-3-642-25073-6_37.

46.

Schmidt,

Hornung,

Lausen and

Pinkel, SP 2Bench: A SPARQL performance benchmark, in: Proceedings of the 25th International Conference on Data Engineering ICDE, IEEE, 2009, pp. 222–233. doi:10.1109/ICDE.2009.28.

47.

Schwarte,

Haase,

Hose,

Schenkel and

M.S.

FedX, Optimization techniques for federated query processing on linked data, in: The Semantic Web – ISWC 2011,

Aroyo,

Welty,

Alani,

Taylor,

Bernstein,

Kagal,

Noy and

Blomqvist, eds, Springer-Verlag, Berlin, Heidelberg, 2011, pp. 601–616. doi:10.1007/978-3-642-25073-6_38.

48.

Umbrich,

Hogan,

Polleres and

Decker, Link traversal querying for a diverse web of data, Semantic Web Journal6(6) (2015), 585–624. doi:10.3233/SW-140164.

49.

Wang,

Tiropanis and

Davis, LHD optimising: Linked data query processing using parallelisation, in: Workshop on Linked Data on the Web (LDOW ’13), Proceedings of the WWW 2013, CEUR Workshop Proceedings, Vol. 996, CEUR-WS.org, Rio de Janeiro, Brazil, 2013.

50.

Wylot,

Hauswirth,

Cudré-Mauroux and

Sakr, RDF data storage and query processing schemes: A survey, ACM Comput. Surv.51(4) (2018), 84:1–84:36. doi:10.1145/3177850.

An empirical evaluation of cost-based federated SPARQL query processing engines

Abstract

Keywords

1. Introduction

1 Our proposed metric is open-source and available online at https://github.com/dice-group/CostBased-FedEval.

4. Cardinality estimation-related metrics

4.1. q-error

Definition 1 (q-error).

4.2. Similarity errors

Definition 2 (Triple Patterns Error E T ).

Definition 3 (Joins Error E J ).

Definition 4 (Query Plan Error E P ).

5. Selected federation engines

2 In this section, the letters with a question mark (e.g. ? x ) denote a variable in an RDF triple, a literal value is represented by a lower-case letter (e.g. o), and a variable or a literal value is defined by an upper-case letter (e.g. S).

6.1. Experiment setup and hardware

6.2. Regression experiments

6.2.2. Combined regression-based comparison analysis

6.3. q-error and similarity-based errors

6.4. How much does an efficient cardinality estimation really matter?

Table 5 Query Plans generated by query engines for all queries (Simple, Complex, Complex + High Dimensional Queries). Failed: () Engine Failed to produce Query Plan, OptP: () Optimal Query Plan generated by engine, subOpt: () subOptimal Plan generated by engine, OnlyP: () Only Plan possible

Table 7 Comparison of index construction time (Index Gen. Time) and Index Size for selected federation engines CostFed SemaGrow SPLENDID Odyssey LHD Index Gen. Time (min) 65 110 110 533 110 Index Size (MBs) 10 1 1 5200 1

4 The index size is given by size of summaries used for cardinality estimation (in MBs).

Footnotes

Acknowledgements

References

¹
Our proposed metric is open-source and available online at https://github.com/dice-group/CostBased-FedEval.

Definition 2 (Triple Patterns Error $E_{T}$ ).

Definition 3 (Joins Error $E_{J}$ ).

Definition 4 (Query Plan Error $E_{P}$ ).

²
In this section, the letters with a question mark (e.g. $? x$ ) denote a variable in an RDF triple, a literal value is represented by a lower-case letter (e.g. o), and a variable or a literal value is defined by an upper-case letter (e.g. S).

Table 5
Query Plans generated by query engines for all queries (Simple, Complex, Complex + High Dimensional Queries). Failed: () Engine Failed to produce Query Plan, OptP: () Optimal Query Plan generated by engine, subOpt: () subOptimal Plan generated by engine, OnlyP: () Only Plan possible

Table 7
Comparison of index construction time (Index Gen. Time) and Index Size for selected federation engines

CostFed SemaGrow SPLENDID Odyssey LHD

Index Gen. Time (min) 65 110 110 533 110

Index Size (MBs) 10 1 1 5200 1

⁴
The index size is given by size of summaries used for cardinality estimation (in MBs).