Sage Journals: Discover world-class research

Abstract

The Web of Data has grown enormously over the last years. Currently, it comprises a large compendium of interlinked and distributed datasets from multiple domains. Running complex queries on this compendium often requires accessing data from different endpoints within one query. The abundance of datasets and the need for running complex query has thus motivated a considerable body of work on SPARQL query federation systems, the dedicated means to access data distributed over the Web of Data. However, the granularity of previous evaluations of such systems has not allowed deriving of insights concerning their behavior in different steps involved during federated query processing. In this work, we perform extensive experiments to compare state-of-the-art SPARQL endpoint federation systems using the comprehensive performance evaluation framework FedBench. In addition to considering the tradition query runtime as an evaluation criterion, we extend the scope of our performance evaluation by considering criteria, which have not been paid much attention to in previous studies. In particular, we consider the number of sources selected, the total number of SPARQL ASK requests used, the completeness of answers as well as the source selection time. Yet, we show that they have a significant impact on the overall query runtime of existing systems. Moreover, we extend FedBench to mirror a highly distributed data environment and assess the behavior of existing systems by using the same performance criteria. As the result we provide a detailed analysis of the experimental outcomes that reveal novel insights for improving current and future SPARQL federation systems.

Keywords

SPARQL federation Web of Data RDF

1. Introduction

The transition from the Web of Documents to the Web of Data has resulted in a large compendium of interlinked datasets from diverse domains. Currently, the Linking Open Data (LOD) Cloud1

¹
http://stats.lod2.eu/
contains over 60 billion triples available from more than 1000 different datasets with large datasets [30] being added frequently. Due to the decentralized and linked architecture of LOD, answering complex queries often requires accessing and combining information from multiple datasets. Processing such federated queries [7,25,29,33] in a virtually integrated fashion is becoming increasingly popular. Given the importance of federated query processing over the Web of Data, it is critical to provide fine-grained evaluations to assess the quality of state-of-the-art implementations of federated SPARQL query engines by comparing them against common criteria through an open benchmark. Such fine-grained evaluation results are valuable when positioning new federation systems against existing. In addition, these results help application developers when choosing appropriate systems for a given application as they allow them to select federation systems through a comparison of their performance against their criteria of interest. Moreover, such fine-grained results provide useful insights for system developers and empower them to improve current federation systems as well as to develop better systems.

Current evaluations [1,7,21,26,27,33,38] of SPARQL query federation systems compare some of the federation systems based on the central criterion of overall query runtime. While optimizing the query runtime of federation systems is the ultimate goal of the research performed on this topic, the granularity of current evaluations fails to provide results that allow understanding why the query runtimes of systems can differ drastically and are thus insufficient to detect the components of systems that need to be improved. For example, key metrics such as a smart source selection in terms of the total number of data sources selected, the total number of SPARQL ASK requests used, and the source selection time have a direct impact on the overall query performance but are not addressed in the existing evaluations. For example, the over-estimation of the total number of relevant data sources will generate extra network traffic, may result in increased query execution time. The SPARQL ASK queries returns boolean value indicating whether a query pattern matches or not. The greater the number of SPARQL ASK requests used for source selection, the higher the source selection time and therefore overall query execution time. Furthermore, as pointed out by [22], the current testbeds [6,23,31,32] for evaluating, comparing, and eventually improving SPARQL query federation systems still have some limitations. Especially, the partitioning of data as well as the SPARQL clauses used cannot be tailored sufficiently, although they are known to have a direct impact on the behavior of SPARQL query federation systems.

The aim of this paper is to experimentally evaluate a large number of SPARQL 1.0 query federation systems within a more fine-granular setting in which we can measure the time required to complete different steps of the SPARQL query federation process. To achieve this goal, we conducted a public survey2 ²
Survey: http://goo.gl/iXvKVT, Results: http://goo.gl/CNW5UC.
and collected information regarding 14 existing federated system implementations, their key features, and supported SPARQL clauses. Eight of the systems which participated in this survey are publicly available. However, two out of the eight with public implementation do not make use of the SPARQL endpoints and were thus not considered further in this study.

In the next step and like in previous evaluations, we compared the remaining six systems [1,7,20,25,33,38] with respect to the traditional performance criterion that is the query execution time using the commonly used benchmark FedBench. In addition, we also compared these six systems with respect to their answer completeness, source selection approach in terms of the total number of sources they selected, the total number of SPARQL ASK requests they used and source selection time. For the sake of completeness, we also performed a comparative analysis (based on the survey outcome) of the key functionality of the 14 systems which participated in our survey. The most important outcomes of this survey are presented in Section 3.3 ³
All survey responses can be found at http://goo.gl/CNW5UC.

To provide a quantitative analysis of the effect of data partitioning on the systems at hand, we extended both FedBench [31] and SP²Bench [32] by distributing the data upon which they rely. To this end, we used the slice generation tool4 ⁴
https://code.google.com/p/fed-eval/wiki/SliceGenerator
described in [29]. This tool allows creating any number of subsets of a given dataset (called slices) while controlling the number of slices, the amount of overlap between the slices as well as the size distribution of these slices. The resulting slices were distributed across various data sources (SPARQL endpoints) to simulate a highly federated environment. In our experiments, we made use of both FedBench [31] and SP²Bench [32] queries to ensure that we cover the majority of the SPARQL query types and clauses.

Our main contributions are summarized as follows:

We present the results of a public survey which allows us to provide a crisp overview of categories of SPARQL federation systems as well as provide their implementation details, features, and supported SPARQL clauses.

We present (to the best of our knowledge) the most comprehensive experimental evaluation of open-source SPARQL federations systems in terms of their source selection and overall query runtime using in two different evaluation setups.

Along with the central evaluation criterion (i.e., the overall query runtime), we measure three further criteria, i.e., the total number of sources selected, the total number of SPARQL ASK requests used, and the source selection time. By these means, we obtain deeper insights into the behavior of SPARQL federation systems.

We extend both FedBench and SP²Bench to mirror highly distributed data environments and test SPARQL endpoint federation systems for their parallel processing capabilities.

We provide a detailed discussion of experimental results and reveal novel insights for improving existing and future federation systems.

Our survey results points to research opportunities in the area of federated semantic data processing.

The rest of this paper is structured as follows: In Section 2, we provide an overview of state-of-the-art SPARQL federated query processing approaches. Section 3 provides a detailed description of the design of and the responses to our public survey of the SPARQL query federation. Section 4 provides an introduction to SPARQL query federation and selected approaches for experimental evaluation. Section 6 describes our evaluation framework and experimental results, including key performance metrics, a description of the used benchmarks (FedBench, SP²Bench, SlicedBench), and the data slice generator. Section 7 provides our further discussion of the results. Finally, Section 8 concludes our work and gives an overview of possible future extensions.
2. Related work

In this section, we provide a two-pronged overview of existing works. First, we give an overview of existing federated SPARQL query system evaluations. Here, we focus on the description of different surveys/evaluations of SPARQL query federation systems and argue for the need of a new fine-grained evaluation of federated SPARQL query engines. Thereafter, we give an overview of benchmarks for SPARQL query processing engines. In addition, we provide reasons for our benchmark selection in this evaluation.

2.1. Federation systems evaluations

Several SPARQL query federation surveys have been developed over the last years. Rakhmawati et al. [26] present a survey of SPARQL endpoint federation systems in which the details of the query federation process along with a comparison of the query evaluation strategies used in these systems. Moreover, systems that support both SPARQL 1.0 and SPARQL 1.1 are explained. However, this survey do not provide any experimental evaluation of the discussed SPARQL query federation systems. In addition, the system implementation details resp. supported features are not discussed in much detail. We address these drawbacks in Tables 1 resp. 2. Hartig [10] provide a general overview of Linked Data federation. In particular, they introduce the specific challenges that need to be addressed and focus on possible strategies for executing Linked Data queries. However, this survey do not provide an experimental evaluation of the discussed SPARQL query federation systems. Umbrich et al. [36] provide a detailed study of the recall and effectiveness of Link Traversal Querying for the Web of Data. Schwarte et al. [34] present an experimental study of large-scale RDF federations on top of the Bio2RDF data sources using a particular federation system, i.e., FedX [33]. They focus on design decisions, technical aspects, and experiences made in setting up and optimizing the Bio2RDF federation. Betz et al. [5] identify various drawbacks of federated Linked Data query processing. The authors propose that Linked Data as a service has the potential to solve some of the identified problems. Görlitz and Staab [8] present limitations in Linked Data federated query processing and implications of these limitations. Moreover, this paper presents a query optimization approach based on semi-joins and dynamic programming. Ladwig and Tran [18] identify various strategies while processing federated queries over Linked Data. Umbrich et al. [37] provide an experimental evaluation of the different data summaries used in live query processing over Linked Data. Montoya et al. [22] provide a detail discussion of the limitations of the existing testbeds used for the evaluation of SPARQL query federation systems. Some other experimental evaluations [1,7,21,22,27,33,38] of SPARQL query federation systems compare some of the federation systems based on their overall query runtime. For example, Gorlitz and Staab [7] compare their approach with three other approaches ([25,33], AliBaba5

⁵
Sesame AliBaba: http://www.openrdf.org/alibaba.jsp
) using a subset of the queries from FedBench. Furthermore, they measure the effect of the information in VoiD descriptions on the accuracy of their source selection. Acosta et al. [1] compare their approach performance with Virtuoso SPARQL endpoints, ARQ 2.8.8. BSD-style21, and RDF-3X 0.3.4.22. An extension of ANAPSID presented in [21] compares ANAPSID with FedX using 10 FedBench-additional complex queries. Schwarte et al. [33] compare FedX performance with AliBaba and DARQ using a subset of the FedBench queries. Wang et al. [38] evaluate the performance of LHD with FedX and SPLENDID using the Berlin SPARQL Benchmark (BSBM) [6].

All experimental evaluations above compare only a small number of SPARQL query federation systems using a subset of the queries available in current benchmarks with respect to a single performance criterion (query execution time). Consequently, they fail to provide deeper insights into the behavior of these systems in different steps (e.g., source selection) required during the query federation. In this work, we evaluate six open-source federated SPARQL query engines experimentally on two different evaluation frameworks. To the best of our knowledge, this is currently the most comprehensive evaluation of open-source SPARQL query federation systems. Furthermore, along with central performance criterion of query runtime, we compare these systems with respect to their source selection. Our results show (Section 6) that the different steps of the source selection affect the overall query runtime considerably. Thus, the insights gained through our evaluation w.r.t. to these criteria provide valuable findings for optimizing SPARQL query federation.
2.2. Benchmarks

Benchmarks for comparing SPARQL query processing systems have a rich literature as well. These include Berlin SPARQL Benchmark (BSBM), SP²Bench, FedBench, Lehigh University Benchmark (LUBM) and the DBpedia Sparql Benchmark (DBPSB). Both BSBM and SP²Bench are mainly designed for the evaluation of triple stores that keep their data in a single large repository. BSBM [6] was developed for comparing the performance of native RDF stores with the performance of SPARQL-to-SQL re-writers. SP²Bench [32] mirrors vital characteristics (such as power law distributions or Gaussian curves) of the data in the DBLP bibliographic database. This benchmark comprises both a data generator for creating arbitrarily large DBLP-like documents and a set of carefully designed benchmark queries. FedBench [31] is designed explicitly to evaluate SPARQL query federation tasks on real-world datasets with queries resembling typical requests on these datasets. Furthermore, this benchmark also includes a dataset and queries from SP²Bench. LUBM [9] is designed to facilitate the evaluation of Semantic Web repositories in a systematic way. It is based on a customizable and repeatable synthetic data. DBPSB [23] includes queries from the DBpedia query log and aims to reflect the behavior of triple stores when confronted with real queries aiming to access native RDF data.

FedBench is the only (to the best of our knowledge) benchmark that encompasses real-world datasets, commonly used queries and distributed data environment. Furthermore, it is commonly used in the evaluation of SPARQL query federation systems [7,21,29,33]. Therefore, we choose this benchmark as a main evaluation benchmark in this paper. We also decided on using SP²Bench in parts of our experiments to ensure that our queries cover most of SPARQL. Note that neither FedBench nor SP²Bench provide SPARQL 1.1 federated queries. Devising such a benchmark remains future work.

3. Federated engines public survey

In order to provide a comprehensive overview of existing SPARQL federation engines, we designed and conducted a survey of SPARQL query federation engines. In this section, we present the principles and ideas behind the design of the survey as well as its results and their analysis.

3.1. Survey design

The aim of the survey was to compare the existing SPARQL query federation engines, regardless of their implementation or code availability. To reach this aim, we interviewed domain experts and designed a survey with three sections: system information, requirements, and supported SPARQL clauses.6

⁶
The survey can be found at http://goo.gl/iXvKVT.

The system information section of the survey includes implementation details of the SPARQL federation engine such as:

URL of the paper, engine implementation: Provides the URL of the related scientific publication or URL to the engine implementation binaries/code.

Code availability: Indicates the disclosure of the code to the public.

Implementation and licensing: Defines the programming language and distribution license of the framework.

Type of source selection: Defines the source selection strategy used by the underlying federation system.

Type of join(s) used for data integration: Shows the type of joins used to integrate sub-queries results coming from different data sources.

Use of cache: Shows the usage of cache for performance improvement.

Support for catalog/index update: Indicates the support for automatic index/catalog update.

The questions from the requirements section assess SPARQL query federation engines for the key features/requirements that a developer would require from such engines. These include:

Result completeness: Assuming that the SPARQL endpoints return complete results for any given SPARQL1.0 sub-query that they have to process. Does your implementation then guarantee that your engine will return complete results for the input query (100% recall) or is it possible that it misses some of the solutions (for example due to the source selection, join implementation, or using an out-of-date index)?. Please note that a 100% recall cannot be assured with an index that is out of date.

Policy-based query planning: Most federation approaches target open data and do not provide restrictions (according to different user access rights) on data access during query planning. As a result, a federation engine may select a data source for which the requester is not authorized, thus overestimating the data sources and increasing the overall query runtime. Does your system have the capability of taking into account the privacy information (e.g., different graph-level access rights for different users, etc.) during query planning?

Support for partial results retrieval: In some cases the query results can be too large and result completeness (i.e., 100% recall) may not be desired, rather partial but fast and/or quality query results are acceptable. Does the federation engine provide such functionality where a user can specify a desired recall (less than 100%) as a threshold for fast result retrieval? It is worth noticing that this is different from limiting the results using SPARQL LIMIT clause as it restricts the number of results to some fixed value while in partial result retrieval the number of retrieved results are relative to the actual total number of results.

Support for no-blocking operator/adaptive query processing: SPARQL endpoints are sometimes blocked or down or exhibit high latency. Does the federation engine support non-blocking joins (where results are returned based on the order in which the data arrives, not in the order in which data being requested) and able to change its behavior at runtime by learning behavior of the data providers?

Support for provenance information: Usually, SPARQL query federation systems integrate results from multiple SPARQL endpoints without any provenance information, such as how many results were contributed by a given SPARQL endpoint or which of the results are contributed by each of the endpoint. Does the federation engine provide such provenance information?

Query runtime estimation: In some cases a query may have a longer runtime (e.g., in the order of minutes). Does the federation engine provide means to approximate and display (to the user) the overall runtime of the query execution in advance?

Duplicate detection: Due to the decentralized architecture of Linked Data Cloud, a sub-query might retrieve results that were already retrieved by another sub-query. For some applications, the former sub-query can be skipped from submission (federation) as it will only produce overlapping triples. Does the federation engine provide such a duplicate-aware SPARQL query federation? Note that this is the duplicate detection before sub-query submission to the SPARQL endpoints and the aim is to minimize the number of sub-queries submitted by the federation engine.

Top-K query processing: Is the federation engine able to rank results based on the user’s preferences (e.g., his/her profile, his/her location, etc.)?

Table 1
Overview of implementation details of federated SPARQL query engines

Systems Category C.A Implementation Licencing S.S.T Join type Cache I.U

FedX [33] SEF ✓ Java GNU A.G.P.L Index-free Bind (VENL) ✓ NA

LHD [38] SEF ✓ Java MIT Hybrid (A+I) Hash/bind ✗ ✗

SPLENDID [7] SEF ✓ Java L.G.P.L Hybrid (A+I) Hash/bind ✗ ✗

FedSearch [24] SEF ✗ Java GNU A.G.P.L Hybrid (A+I) Bind (VENL), pull based rank join (RMHJ) ✓ NA

GRANATUM [11,13] SEF ✗ Java Yet to decide Index only Nested loop ✗ ✗

Avalanche [4] SEF ✗ Python, C, C++ Yet to decide Index only Distributed, merge ✓ ✗

DAW [29] SEF ✗ Java GNU G.P.L Hybrid (A+I) Based on underlying system ✓ ✗

ANAPSID [1] SEF ✓ Python GNU G.P.L Hybrid (A+I) AGJ, ADJ ✗ ✓

ADERIS [20] SEF ✓ Java Apache Index only Index-based nested loop ✗ ✗

DARQ [25] SEF ✓ Java GPL Index only Nested loop, bound ✗ ✗

LDQPS [18] LDF ✗ Java Scala Hybrid (C+L) Symmetric hash ✗ ✗

SIHJoin [19] LDF ✗ Java Scala Hybrid (C+L) Symmetric hash ✗ ✗

WoDQA [2] LDF ✓ Java GPL Hybrid (A+I) Nested loop, bound ✓ ✓

Atlas [16] DHTF ✓ Java GNU L.G.P.L Index only SQLite ✗ ✗

Notes: SEF = SPARQL Endpoints Federation, DHTF = DHT Federation, LDF = Linked Data Federation, C.A. = Code Availability, A.G.P.L. Affero General Public License, L.G.P.L. = Lesser General Public License, S.S.T. = Source Selection Type, I.U. = Index/catalog Update, (A+I) = SPARQL ASK and Index/catalog, (C+L) = Catalog and online discovery via Link-traversal, VENL = Vectored Evaluation in Nested Loop, AGJ = Adaptive Group Join, ADJ = Adaptive Dependent Join, RMHJ = Rank-aware Modification of Hash Join, NA = Not Applicable

The supported SPARQL clauses section assess existing SPARQL query federation engines w.r.t. the list of supported SPARQL clauses. The list of the SPARQL clauses is mostly based on the characteristics of the BSBM benchmark queries [6]. The summary of the used SPARQL clauses can be found in Table 3.

The survey was open and free for all to participate in. To contact potential participants, we used Google Scholar to retrieve papers that contained the keywords “SPARQL” and “query federation”. After a manual filtering of the results, we contacted the main authors of the papers and informed them of the existence of the survey while asking them to participate. Moreover, we sent messages to the W3C Linked Open Data mailing list7 ⁷
public-lod@w3.org
and Semantic Web mailing list8 ⁸
semantic-web@w3.org
with a request to participate. The survey was opened for two weeks.

Table 2
Survey outcome: System’s features

Systems R.C. P.R.R. N.B.O / A.Q.P. D.D. P.B.Q.P Provenance Q.R.E Top-K.Q.P

FedX ✓ ✗ ✗ ✗ ✗ ✗ ✗ ✗

LHD ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗

SPLENDID ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗

FedSearch ✓ ✗ ✗ ✗ ✗ ✗ ✗ ✗

GRANATUM ✗ ✗ ✗ ✗ Partial Partial ✗ ✗

Avalanche ✗ ✓ ✓ Partial ✗ ✗ ✗ ✗

DAW ✗ ✓ Based on underlying system ✓ ✗ ✗ ✗ ✗

ANAPSID ✗ ✗ ✓ ✗ ✗ ✗ ✗ ✗

ADERIS ✗ ✗ ✓ ✗ ✗ ✗ ✗ ✗

DARQ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ✗

LDQPS ✗ ✗ ✓ ✗ ✗ ✗ ✗ ✗

SIHJoin ✗ ✗ ✓ ✗ ✗ ✗ ✗ ✗

WoDQA ✓ ✗ ✗ ✗ ✗ ✗ ✗ ✗

Atlas ✗ ✗ ✗ Partial ✗ ✗ ✗ ✗

Notes: R.C. = Results Completeness, P.R.R. = Partial Results Retrieval, N.B.O. = No Blocking Operator, A.Q.P. = Adaptive Query Processing, D.D. = Duplicate Detection, P.B.Q.P = Policy-based Query Planning, Q.R.E. = Query Runtime Estimation, Top-K.Q.P = Top-K query processing

3.2. Discussion of the survey results

Systems	Category	C.A	Implementation	Licencing	S.S.T	Join type	Cache	I.U
FedX [33]	SEF	✓	Java	GNU A.G.P.L	Index-free	Bind (VENL)	✓	NA
LHD [38]	SEF	✓	Java	MIT	Hybrid (A+I)	Hash/bind	✗	✗
SPLENDID [7]	SEF	✓	Java	L.G.P.L	Hybrid (A+I)	Hash/bind	✗	✗
FedSearch [24]	SEF	✗	Java	GNU A.G.P.L	Hybrid (A+I)	Bind (VENL), pull based rank join (RMHJ)	✓	NA
GRANATUM [11,13]	SEF	✗	Java	Yet to decide	Index only	Nested loop	✗	✗
Avalanche [4]	SEF	✗	Python, C, C++	Yet to decide	Index only	Distributed, merge	✓	✗
DAW [29]	SEF	✗	Java	GNU G.P.L	Hybrid (A+I)	Based on underlying system	✓	✗
ANAPSID [1]	SEF	✓	Python	GNU G.P.L	Hybrid (A+I)	AGJ, ADJ	✗	✓
ADERIS [20]	SEF	✓	Java	Apache	Index only	Index-based nested loop	✗	✗
DARQ [25]	SEF	✓	Java	GPL	Index only	Nested loop, bound	✗	✗
LDQPS [18]	LDF	✗	Java	Scala	Hybrid (C+L)	Symmetric hash	✗	✗
SIHJoin [19]	LDF	✗	Java	Scala	Hybrid (C+L)	Symmetric hash	✗	✗
WoDQA [2]	LDF	✓	Java	GPL	Hybrid (A+I)	Nested loop, bound	✓	✓
Atlas [16]	DHTF	✓	Java	GNU L.G.P.L	Index only	SQLite	✗	✗

Systems	R.C.	P.R.R.	N.B.O / A.Q.P.	D.D.	P.B.Q.P	Provenance	Q.R.E	Top-K.Q.P
FedX	✓	✗	✗	✗	✗	✗	✗	✗
LHD	✗	✗	✗	✗	✗	✗	✗	✗
SPLENDID	✗	✗	✗	✗	✗	✗	✗	✗
FedSearch	✓	✗	✗	✗	✗	✗	✗	✗
GRANATUM	✗	✗	✗	✗	Partial	Partial	✗	✗
Avalanche	✗	✓	✓	Partial	✗	✗	✗	✗
DAW	✗	✓	Based on underlying system	✓	✗	✗	✗	✗
ANAPSID	✗	✗	✓	✗	✗	✗	✗	✗
ADERIS	✗	✗	✓	✗	✗	✗	✗	✗
DARQ	✗	✗	✗	✗	✗	✗	✗	✗
LDQPS	✗	✗	✓	✗	✗	✗	✗	✗
SIHJoin	✗	✗	✓	✗	✗	✗	✗	✗
WoDQA	✓	✗	✗	✗	✗	✗	✗	✗
Atlas	✗	✗	✗	Partial	✗	✗	✗	✗

Based on our survey results,9

⁹
Available at http://goo.gl/CNW5UC.
existing SPARQL query federation approaches can be divided into three main categories (see Table 1).

1. Query federation over multiple SPARQL endpoints: In this type of approaches, RDF data is made available via SPARQL endpoints. The federation engine makes use of endpoint URLs to federate sub-queries and collect results back for integration. The advantage of this category of approaches is that queries are answered based on original, up-to-date data with no synchronization of the copied data required [10]. Moreover, the execution of queries can be carried out efficiently because the approach relies on SPARQL endpoints. However, such approaches are unable to deal with the data provided by the whole of LOD Cloud because sometimes Linked Data is not exposed through SPARQL endpoints.

2. Query federation over Linked Data: This type of approaches relies on the Linked Data principles10 ¹⁰
http://www.w3.org/DesignIssues/LinkedData.html
for query execution. The set of data sources which can contribute results into the final query resultset is determined by using URI lookups during the query execution itself. Query federation over Linked Data does not require the data providers to publish their data as SPARQL endpoints. Instead, the only requirement is that the RDF data follows the Linked Data principles. A downside of these approaches is that they are less time-efficient than the previous approaches due to the URI lookups they perform.

3. Query federation on top of Distributed Hash Tables: This type of federation approaches stores RDF data on top of Distributed Hash Tables (DHTs) and use DHT indexing to federate SPARQL queries over multiple RDF nodes. This is a space-efficient solution and can reduce the network cost as well. However, many of the LOD datasets are not stored on top of DHTs.

Each of the above main category can be further divided into three sub-categories (see Table 1).

(a) Catalog/index-assisted solutions: These approaches utilize dataset summaries that have been collected in a pre-processing stage. These approaches may lead to more efficient query federation. However, the index needs to be constantly updated to ensure complete results retrieval. The index size should also be kept to a minimum to ensure that it does not significantly increase the overall query processing costs.

(b) Catalog/index-free solutions: In these approaches, the query federation is performed without using any stored data summaries. The data source statistics can be collected on-the-fly before the query federation starts. This approach promises that the results retrieved by the engine are complete and up-to-date. However, it may increase the query execution time, depending on the extra processing required for collecting and processing on-the-fly statistics.

(c) Hybrid solutions: In these approaches, some of the data source statistics are pre-stored while some are collected on-the-fly, e.g., using SPARQL ASK queries.

Table 3
Survey outcome: System’s Support for SPARQL Query Constructs

SPARQL cluase FedX Atlas LHD SPLENDID FedSearch GRANATUM Avalanche DAW LDQPS SIHJoin ANAPSID ADERIS QWIDVD DARQ

SERVICE ✓ ✗ ✗ ✗ ✓ ✓ ✓ ✓ ✗ ✗ ✓ ✗ ✓ ✗

FILTER ✓ ✓ ✓ ✓ ✓ ✓ ✗ ✓ ✗ ✗ ✓ ✓ ✓ ✓

Unbound QP ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✗

Unbound QS ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓

OPTIONAL ✓ ✗ ✓ ✓ ✓ ✓ ✗ ✓ ✗ ✗ ✓ ✗ ✓ ✓

DISTINCT ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✓ ✗ ✗ ✓ ✗ ✓ ✓

ORDER BY ✓ ✗ ✓ ✓ ✓ ✓ ✗ ✓ ✗ ✗ ✓ ✗ ✓ ✓

UNION ✓ ✓ ✓ ✓ ✓ ✓ ✗ ✓ ✗ ✗ ✓ ✗ ✓ ✓

NEGATION ✓ ✗ ✓ ✓ ✓ ✓ ✗ ✓ ✗ ✗ ✗ ✗ ✓ ✓

REGEX ✓ ✗ ✓ ✗ ✓ ✓ ✗ ✓ ✗ ✗ ✓ ✓ ✓ ✓

LIMIT ✓ ✗ ✓ ✓ ✓ ✓ ✓ ✓ ✗ ✗ ✓ ✗ ✓ ✓

CONSTRUCT ✓ ✗ ✓ ✗ ✓ ✓ ✗ ✓ ✗ ✗ ✗ ✗ ✓ ✗

DESCRIBE ✗ ✗ ✗ ✗ ✗ ✓ ✗ ✗ ✗ ✗ ✗ ✗ ✓ ✗

ASK ✓ ✗ ✓ ✗ ✓ ✓ ✗ ✓ ✗ ✗ ✗ ✗ ✓ ✗

Notes: QP = Query Predicates, QS = Query Subjects

Table 1 provides a classification along with the implementation details of the 14 systems which participated in the survey. Overall, we received responses mainly for systems which implemented the SPARQL endpoint federation and hybrid query processing paradigms in Java. Only Atlas [16] implements DHT federation whereas WoDQA [2], LDQPS [18], and SIHJoin [19] implement federation over linked data (LDF). Most of the surveyed systems provides “General Public Licences” with the exception of [18] and [19] which provides “Scala” license whereas the authors of [4] and [11] have not yet decided which license type will hold for their tools. Five of the surveyed systems implement caching mechanisms including [2,4,24,29] and [33]. Only [1] and [2] provide support for catalog/index update whereas two systems do not require this mechanism by virtue of being index/catalog-free approaches.

Table 2 summarizes the survey outcome w.r.t. different features supported by systems. Only three systems ([24,33] and QWIDVD) claim that they achieve result completeness and only Avalanche [4] and DAW [29] support partial results retrieval for their implementations. Note that FedX claims result completeness when the cache that it relies on is up-to-date. Five (i.e., Avalanche, ANAPSID, ADERIS, LDQPS, SIHJoin) of the considered systems support adaptive query processing. Only DAW [29] supports duplicate detection whereas DHT and Avalanche [4] claim to support partial duplicate detection. Granatum [11,12,15] is the only system that implements privacy and provenance. None of the considered systems implement top-k query processing or query runtime estimation.

Table 3 lists SPARQL clauses supported by the each of 14 systems. GRANATUM and QWIDVD are only two systems that support all of the query constructs used in our survey. It is important to note that most of these query constructs are based on query characteristics defined in BSBM.
4. Details of selected systems

SPARQL cluase	FedX	Atlas	LHD	SPLENDID	FedSearch	GRANATUM	Avalanche	DAW	LDQPS	SIHJoin	ANAPSID	ADERIS	QWIDVD	DARQ
SERVICE	✓	✗	✗	✗	✓	✓	✓	✓	✗	✗	✓	✗	✓	✗
FILTER	✓	✓	✓	✓	✓	✓	✗	✓	✗	✗	✓	✓	✓	✓
Unbound QP	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✗
Unbound QS	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓
OPTIONAL	✓	✗	✓	✓	✓	✓	✗	✓	✗	✗	✓	✗	✓	✓
DISTINCT	✓	✓	✓	✓	✓	✓	✓	✓	✗	✗	✓	✗	✓	✓
ORDER BY	✓	✗	✓	✓	✓	✓	✗	✓	✗	✗	✓	✗	✓	✓
UNION	✓	✓	✓	✓	✓	✓	✗	✓	✗	✗	✓	✗	✓	✓
NEGATION	✓	✗	✓	✓	✓	✓	✗	✓	✗	✗	✗	✗	✓	✓
REGEX	✓	✗	✓	✗	✓	✓	✗	✓	✗	✗	✓	✓	✓	✓
LIMIT	✓	✗	✓	✓	✓	✓	✓	✓	✗	✗	✓	✗	✓	✓
CONSTRUCT	✓	✗	✓	✗	✓	✓	✗	✓	✗	✗	✗	✗	✓	✗
DESCRIBE	✗	✗	✗	✗	✗	✓	✗	✗	✗	✗	✗	✗	✓	✗
ASK	✓	✗	✓	✗	✓	✓	✗	✓	✗	✗	✗	✗	✓	✗

After having given a general overview of SPARQL query federation systems, we present six SPARQL endpoints federation engines [1,7,20,25,33,38] with public implementation that were used within our experiments. We begin by presenting an overview of key concepts that underpin federated query processing and are used in the performance evaluation. We then use these key concepts to present the aforementioned six systems used in our evaluation in more detail.

4.1. Federated query processing

Given a SPARQL query $q \in Q$ , where Q is a set of queries, the first step of federated SPARQL query processing is to perform triple pattern-wise source selection or source selection for short. The goal of the triple pattern-wise source selection is to identify the set of data sources that contain relevant results against individual triple patterns of the query [29]. We call these sources relevant (also called capable) for the given triple pattern. The total number of triple pattern-wise selected sources $N_{s}$ is the sum of the number of sources selected for individual query triple pattern. Later, in our evaluation, we will see that $N_{s}$ has a direct impact on the query execution time. The source selection information is then used to decompose q into multiple sub-queries. Each sub-query is optimized to generate an execution plan. The sub-queries are forwarded to the relevant data sources according to the optimization plan. The results of each sub-query execution are finally joined to generate the result set of q.

4.2. Overview of the selected approaches

DARQ [25] makes use of an index known as service description to perform source selection. Each service description provides a declarative description of the data available in a data source, including the corresponding SPARQL endpoint along with statistical information. The source selection is performed by using distinct predicates (for each data source) recorded in the index as capabilities. The source selection algorithm used in DARQ for a query simply matches all triple patterns against the capabilities of the data sources. The matching compares the predicate in a triple pattern with the predicate defined for a capability in the index. This means that DARQ is only able to answer queries with bound predicates. DARQ combines service descriptions, query rewriting mechanisms and a cost-based optimization approach to reduce the query processing time and the bandwidth usage.

SPLENDID [7] makes use of VoiD descriptions as index along with SPARQL ASK queries to perform the source selection step. A SPARQL ASK query is used when any of the subject or object of the triple pattern is bound. This query is forwarded to all of the data sources and those sources which pass the SPARQL ASK test are selected. A dynamic programming strategy [35] is used to optimize the join order of SPARQL basic graph patterns.

Table 4
Known variables that impact the behavior of SPARQL federated

Independent variables Dependent/observed variables

#ASK #TP sources Source selection time Query runtime Answer completeness

Query Query plan shape ✓ ✓ ✓ ✓ ✓

#Basic triple patterns ✓ ✓ ✓ ✓ ✓

#Instantiations and their position ✓ ✓ ✓ ✓ x

Join selectivity x x x ✓ x

#Intermediate results x x x ✓ x

Answer size x x x ✓ x

Usage of query language expressivity ✓ ✓ ✓ ✓ x

#General predicates ✓ ✓ ✓ ✓ ✓

Data Dataset size x x x ✓ x

Data frequency distribution x x x ✓ x

Type of partitioning ✓ ✓ ✓ ✓ ✓

Data endpoint distribution ✓ ✓ ✓ ✓ ✓

Platform Cache on/off ✓ ✓ ✓ ✓ x

RAM available x x ✓ ✓ x

#Processors x x ✓ ✓ x

Endpoints #Endpoints ✓ ✓ ✓ ✓ ✓

Endpoint type x x ✓ ✓ x

Relation graph/endpoint/instance x x x ✓ ✓

Network latency x x ✓ ✓ ✓

Initial delay x x ✓ ✓ x

Message size x x x ✓ x

Transfer distribution x x ✓ ✓ ✓

Answer size limit x x x ✓ ✓

Timeout x x x ✓ ✓

Notes: #ASK = Total number of SPARQL ASK requests used during source selection, #TP = total triple pattern-wise sources selected

Independent variables	Dependent/observed variables
Query	Query plan shape	✓	✓	✓	✓	✓
#Basic triple patterns	✓	✓	✓	✓	✓
#Instantiations and their position	✓	✓	✓	✓	x
Join selectivity	x	x	x	✓	x
#Intermediate results	x	x	x	✓	x
Answer size	x	x	x	✓	x
Usage of query language expressivity	✓	✓	✓	✓	x
#General predicates	✓	✓	✓	✓	✓
Data	Dataset size	x	x	x	✓	x
Data frequency distribution	x	x	x	✓	x
Type of partitioning	✓	✓	✓	✓	✓
Data endpoint distribution	✓	✓	✓	✓	✓
Platform	Cache on/off	✓	✓	✓	✓	x
RAM available	x	x	✓	✓	x
#Processors	x	x	✓	✓	x
Endpoints	#Endpoints	✓	✓	✓	✓	✓
Endpoint type	x	x	✓	✓	x
Relation graph/endpoint/instance	x	x	x	✓	✓
Network latency	x	x	✓	✓	✓
Initial delay	x	x	✓	✓	x
Message size	x	x	x	✓	x
Transfer distribution	x	x	✓	✓	✓
Answer size limit	x	x	x	✓	✓
Timeout	x	x	x	✓	✓

FedX [33] is an index-free SPARQL query federation system. The source selection relies completely on SPARQL ASK queries and a cache. The cache is used to store recent SPARQL ASK operations for relevant data source selection. As shown by our evaluation, the use of this cache greatly reduces the source selection and query execution time.

The publicly available implementation of LHD [38] only makes use of the VoiD descriptions to perform source selection. The source selection algorithm is similar to DARQ. However, it also supports query triple patterns with unbound predicates. In such cases, LHD simply selects all of the available data sources as relevant. This strategy often overestimates the number of capable sources and can thus lead to high overall runtimes. LHD performs a pipeline hash join to integrate sub-queries in parallel.

ANAPSID [1] is an adaptive query engine that adapts its query execution schedulers to the data availability and runtime conditions of SPARQL endpoints. This framework provides physical SPARQL operators that detect when a source becomes blocked or data traffic is bursty. The operators produce results as quickly as data arrives from the sources. ANAPSID makes use of both a catalog and ASK queries along with heuristics defined in [21] to perform the source selection step. This heuristic-based source selection can greatly reduce the total number of triple pattern-wise selected sources.

Finally, ADERIS [20] is an index-only approach for adaptive integration of data from multiple SPARQL endpoints. The source selection algorithm is similar to DARQ’s. However, this framework also selects all of the available data sources for triple patterns with unbound predicates. ADERIS does not support several SPARQL 1.0 clauses such as UNION and OPTIONAL. For the data integration, the framework implements the pipelined index nested loop join operator.

In the next section, we describe known variables that may impact the performance of the federated SPARQL query engines.

5. Performance variables

Table 4 shows known variables that may impact the behavior of federated SPARQL query engines. According to [21], these variables can be grouped into two categories (i.e., independent and dependent variables) that affect the overall performance of federated query SPARQL engines. Dependent (also called observed) variables are usually the performance metrics and are normally influenced by independent variables. Dependent variables include: (1) total number of SPARQL ASK requests used during source selection #ASK, (2) total number of triple pattern-wise sources selected (ref. Fig. 1) during source selection #TP Sources, (3) source selection time, (4) overall query runtime, and (5) answer set completeness.

Independent variables can be grouped into four dimensions: query, data, platform, and endpoint [21]. The query dimension includes:

the type of query (star, path, hybrid [29]),

the number of basic graph patterns,

the instantiations (bound/unbound) of tuples (subject, predicate, object) of the query triple pattern,

the selectivity of the joins between triple patterns,

the query result set size, and use of different SPARQL clauses along with general predicates such as rdf:type, owl:sameAs.

The data dimension comprises of:

the dataset size, its type of partition (horizontal, vertical, hybrid), and

the data frequency distribution (e.g., number of subject, predicates and objects) etc.

The platform dimension consists of:

use of cache,

number of processor, and

amount of RAM available.

The following parameters belong to the endpoints dimension:

the number of endpoints used in the federation and their types (e.g., Fuseki, Sesame, Virtuoso etc., and single vs. clustered server),

the relationship between the number of instances, graphs and endpoints of the systems used during the evaluation, and

network latency (in case of live SPARQL endpoints) and different endpoint configuration parameters such as answer size limit, maximum resultset size etc.

In our evaluation, we measured all of the five dependent variables reported in Table 4. Most of the query (an independent variable) parameters are covered by using the complete query set of both FedX and SP²Bench. However, as pointed in [21], the join selectivity cannot be fully covered due to the limitations of both FedX and SP²Bench. In data parameters, the data set size cannot be fully explored in the selected SPARQL query federation benchmarks. This is because both FedBench and SP²Bench do not contain very large datasets (the largest dataset in these benchmarks contains solely 108M triples, see Table 6) such as Linked TCGA (20.4 billion triples11

¹¹
Linked TCGA: http://tcga.deri.ie/.
), UniProt (8.4 billion triples12 ¹²
UniProt: http://datahub.io/dataset/uniprotkb.
) etc. We used horizontal partitioning and mirrored a highly distributed environment to test the selected federation systems for their parallel processing capabilities. W.r.t. platform parameters, the effect of using a cache is measured. As shown in the experiments section, the use of a cache (especially in FedX) has the potential of greatly improving the query runtime of federation systems. The amount of available RAM is more important when dealing with queries with large intermediate results, which are not given in the benchmarks at hand. The number of processors used is an important dimension to be considered in future SPARQL query federation engines. The endpoint parameters did not play a major role in our study because we used a dedicated local area network to avoid network delays. An evaluation on live SPARQL endpoints with network delay will be considered in future work. We used Virtuoso (version 20120802) SPARQL endpoints with maximum rows set to 100,000 (i.e., we chose this value because it is greater than the answer size of all of the queries in the selected benchmarks) and a transaction timeout of 60 seconds (which allows for all sub-queries in the selected benchmarks to be executed). The overall query execution timeout was set to 30 min on the system running the federation engine. The higher threshold is due to SPARQL endpoints requiring less time to run the sub-queries generated by the federation engine than the federation engine to integrate the results.

While the dependent variables source selection time, query runtime, and answer completeness are already highlighted in [21], we also measured the total number of data sources selected and total number of SPARQL ASK requests used during the source selection. Section 6 shows that both of these additional variables have a significant impact on the performance of federated SPARQL query engines. For example, an overestimation of the capable sources can lead through an increase of the overall runtime due to (1) increased network traffic and (2) unnecessary intermediate results which are excluded after performing all the joins between the query triple patterns. On the other hand, the smaller the number of SPARQL ASK requests used during the source selection, the smaller the source selection time and vice versa. Further details of the depended and independent variables can be found at [21].
6. Evaluation

In this section we present the data and hardware used in our evaluation. Moreover, we explain the key metrics underlying our experiments as well as the corresponding results.

6.1. Experimental setup

We used two settings to evaluate the selected federation systems. Within the first evaluation, we used the query execution time as central evaluation parameter and made use of the FedBench [31] federated SPARQL querying benchmark. In the second evaluation, we extended both FedBench and SP²Bench to simulate a highly federated environment. Here, we focused especially on analyzing the effect of data partitioning on the performance of federation systems. We call this extension SlicedBench as we created slices of each original datasets and distributed them among data sources. All of the selected performance metrics (explained in Section 6.2) remained the same for both evaluation frameworks. We used the most recent versions (at the time at which the evaluation was carried out), i.e., FedX2.0 and ANAPSID (December 2013 version). The remaining systems has no versions. All experiments were carried out on a system (machine running federation engines) with a 2.60 GHz i5 processor, 4 GB RAM and 500 GB hard disk. For systems with Java implementation, we used Eclipse with default settings, i.e., Java Virtual Machine (JVM) initial memory allocation pool (Xms) size of 40 MB and the maximum memory allocation pool (Xmx) size of 512 MB. The permanents generation (MaxPermSize) which defines the memory allocated to keep compiled class files was also set to default size of 256 MB. To minimize the network latency we used a dedicated local network. We conducted our experiments on local copies of Virtuoso (version 20120802) SPARQL endpoints with number of buffers 1360000, maximum dirty buffers 1000000, number of server threads 20, result set maximum rows 100000, and maximum SPARQL endpoint query execution time of 60 seconds. A separate physical virtuoso server was created for each dataset. The specification of the machines hosting the virtuoso SPARQL endpoints used in both evaluations is given in Table 5. We executed each query 10 times and present the average values in the results. The source selection time (ref. Section 6.3.4) and query runtime (ref. Section 6.3.5) was calculated using the function System.currentTimeMillis() (for Java system implementations) and function time() (for Python implementations). The results of the time() was converted from seconds as float to milliseconds. The accuracy of both functions is in the order of 1 ms, which does not influence the conclusions reached by our evaluation. The query runtime was calculated once all the results are retrieved and the time out was set to 30 minutes. Furthermore, the query runtime results were analyzed statistically using Wilcoxon signed rank test. We chose this test because it is parameter-free and does not assume a particular error distribution in the data like a t-test does. For all the significance tests, we set the p-value to 0.05.

Table 5
System’s specifications hosting SPARQL endpoints

Endpoint CPU (GHz) RAM Hard disk

SW Dog Food 2.2, i3 4 GB 300 GB

GeoNames 2.9, i7 16 GB 256 GB SSD

KEGG 2.6, i5 4 GB 150 GB

Jamendo 2.53, i5 4 GB 300 GB

New York Times 2.3, i5 4 GB 500 GB

Drugbank 2.53, i5 4 GB 300 GB

ChEBI 2.9, i7 8 GB 450 GB

LinkedMDB 2.6, i5 8 GB 400 GB

SP²Bench 2.6, i5 8 GB 400 GB

DBpedia subset 3.5.1 2.9, i7 16 GB 500 GB

Endpoint	CPU (GHz)	RAM	Hard disk
SW Dog Food	2.2, i3	4 GB	300 GB
GeoNames	2.9, i7	16 GB	256 GB SSD
KEGG	2.6, i5	4 GB	150 GB
Jamendo	2.53, i5	4 GB	300 GB
New York Times	2.3, i5	4 GB	500 GB
Drugbank	2.53, i5	4 GB	300 GB
ChEBI	2.9, i7	8 GB	450 GB
LinkedMDB	2.6, i5	8 GB	400 GB
SP²Bench	2.6, i5	8 GB	400 GB
DBpedia subset 3.5.1	2.9, i7	16 GB	500 GB

Table 6

Datasets statistics used in our benchmarks

Collection	Dataset	Version	#Triples	#Subjects	#Predicates	#Objects	#Types	#Links
Cross domain	DBpedia subset	3.5.1	43.6M	9.50M	1.06k	13.6M	248	61.5k
	GeoNames	2010-10-06	108M	7.48M	26	35.8M	1	118k
	LinkedMDB	2010-01-19	6.15M	694k	222	2.05M	53	63.1k
	Jamendo	2010-11-25	1.05M	336k	26	441k	11	1.7k
	New York Times	2010-01-13	335k	21.7k	36	192k	2	31.7k
	SW Dog Food	2010-11-25	104k	12.0k	118	37.5k	103	1.6k
Life sciences	KEGG	2010-11-25	1.09M	34.3k	21	939k	4	30k
	ChEBI	2010-11-25	7.33M	50.5k	28	772k	1	–
	Drugbank	2010-11-25	767k	19.7k	119	276k	8	9.5k
	DBpedia subset 3.5.1	43.6M	9.50M	1.06k	13.6M	248	61.5k	61.5k
SP²Bench a	SP²Bench 10M	v1.01	10M	1.7M	77	5.4M	12	–

Only used in SlicedBench

Table 7

Query characteristics

Linked Data (LD)					Cross Domain (CD)					Life Science (LS)					SP²Bench a

Q	#T	#Res	Op	Struct	Q	#T	#Res	Op	Struct	Q	#T	#Res	Op	Struct	Q	#T	#Res	Op	Struc
1	3	309	A	C	1	3	90	AU	S	1	2	1159	U	–	1	3	1	A	S
2	3	185	A	C	2	3	1	A	S	2	3	333	AU	–	2	10	>50k	AO	S
3	4	162	A	C	3	5	2	A	H	3	5	9054	A	H	3a	3	27789	AF	S
4	5	50	A	C	4	5	1	A	C	4	7	3	A	H	4	8	>40k	AF	C
5	3	10	A	S	5	4	2	A	C	5	6	393	A	H	5b	5	>30k	AF	C
6	5	11	A	H	6	4	11	A	C	6	5	28	A	H	6	9	>70k	AFO	H
7	2	1024	A	S	7	4	1	A	C	7	5	144	AFO	H	7	12	>2k	AFO	H
8	5	22	A	H											8	10	493	AFU	H
9	3	1	A	C											9	4	4	AU	–
10	3	3	A	C											10	1	656		–
11	5	239	A	S											11	1	10	–	–

Notes: #T = Total number of Triple patterns, #Res = Total number of query results, OPerators: And (“.”), Union, Filter, Optional; Structure: Star, Chain, Hybrid

Only used in SlicedBench

All of the data used in both evaluations along with the portable virtuoso SPARQL endpoints can be downloaded from the project website.13 ¹³

https://code.google.com/p/fed-eval/

6.1.1. First setting: FedBench

FedBench is commonly used to evaluate performance of the SPARQL query federation systems [7,21,29,33]. The benchmark is explicitly designed to represent SPARQL query federation on a real-world datasets. The datasets can be varied according to several dimensions such as size, diversity and number of interlinks. The benchmark queries resemble typical requests on these datasets and their structure ranges from simple star [29] and chain queries to complex graph patterns. The details about the FedBench datasets used in our evaluation along with some statistical information are given in Table 6.

The queries included in FedBench are divided into three categories: Cross-domain (CD), Life Sciences (LS), Linked Data (LD). In addition, it includes SP²Bench queries. The distribution of the queries along with the result set sizes are given in Table 7. Details on the datasets and various advanced statistics are provided at the FedBench project page.14

¹⁴
http://code.google.com/p/fbench/

In this evaluation setting, we selected all queries from CD, LS, and LD, thus performing (to the best of our knowledge) the first evaluation of SPARQL query federation systems on the complete benchmark data of FedBench. It is important to note that SP²Bench was designed with the main goal of evaluating query engines that access data kept in a single repository. Thus, the complete query is answered by a single data set. However, a federated query is one which collects results from multiple data sets. Due to this reason we did not include the SP²Bench queries in our first evaluation. We have included all these queries into our SlicedBench because the data is distributed in 10 different data sets and each SP²Bench query span over more than one data set, thus full-filling the criteria of federated query.
6.1.2. Second setting: Sliced bench

As pointed out in [22] the data partitioning can affect the overall performance of SPARQL query federation engines. To quantify this effect, we created 10 slices of each of the 10 datasets given in Table 6 and distributed this data across 10 local virtuoso SPARQL endpoints (one slice per SPARQL endpoint). Thus, every SPARQL endpoint contained one slice from each of the 10 datasets. This creates a highly fragmented data environment where a federated query possibly had to collect data from all of the 10 SPARQL endpoints. This characteristic of the benchmark stands in contrast to FedBench where the data is not highly fragmented. Moreover, each of the SPARQL endpoint contained a comparable amount of triples (load balancing). To facilitate the distribution of the data, we used the Slice Generator tool employed in [29]. This tool allows setting a discrepancy across the slices, where the discrepancy is defined as the difference (in terms of number of triples) between the largest and smallest slice: $\begin{matrix} (1) & discrepancy = max_{1 ⩽ i ⩽ M} | S_{i} | - min_{1 ⩽ j ⩽ M} | S_{j} |, \end{matrix}$ where $S_{i}$ stands for the ith slice. The dataset D is partitioned randomly among the slices in a way that $\sum_{i} | S_{i} | = | D |$ and $\forall i \forall j i \neq j \to | | S_{i} | - | S_{j} | | ⩽ discrepancy$ .

This tool generate slices based on horizontal partitioning of the data. Table 8 shows the discrepancy values used for slice generation for each of the 10 datasets. Our discrepancy value varies with the size of the dataset. For the query runtime evaluation, we selected all of the queries both from FedBench and SP²Bench given in Table 7: the reason for this selection was to cover majority of the SPARQL query clauses and types along with variable results size (from 1 to 40 million). For each of the CD, LS, and LD queries used in SlicedBench, the number of results remained the same as given in Table 7. Analogously to FedBench, each of the SlicedBench data source is a virtuoso SPARQL endpoint.

6.2. Evaluation criteria

We selected five metrics for our evaluation: (1) total triple pattern-wise sources selected, (2) total number of SPARQL ASK requests used during source selection, (3) answer completeness (4) source selection time (i.e. the time taken by the process in the first metric), and (5) query execution time.

Table 8
Dataset slices used in SlicedBench

Collection #Slices Discrepancy

DBpedia subset 3.5.1 10 280000

GeoNames 10 600000

LinkedMDB 10 100000

Jamendo 10 30000

New York Times 10 700

SW Dog Food 10 200

KEGG 10 35000

ChEBI 10 50000

Drugbank 10 25000

SP²Bench 10 150000

Collection	#Slices	Discrepancy
DBpedia subset 3.5.1	10	280000
GeoNames	10	600000
LinkedMDB	10	100000
Jamendo	10	30000
New York Times	10	700
SW Dog Food	10	200
KEGG	10	35000
ChEBI	10	50000
Drugbank	10	25000
SP²Bench	10	150000

Fig. 1.

Total triple pattern-wise selected sources example. (TP. = triple pattern-wise selected.)

The total number of triple pattern-wise selected sources for a query is calculated as follows: Let $D_{i} = {s_{1}, s_{2}, \dots, s_{m}}$ be the set of sources capable of answering a triple pattern ${tp}_{i}$ and M is the total number of available (physical) sources. Then, for a query q with n triple patterns, ${{tp}_{1}, {tp}_{2}, \dots, {tp}_{n}}$ , the total number of triple pattern-wise sources is the sum of the magnitude ( $| D_{i} |$ ) of capable sources set for individual triple patterns. An example of the triple pattern-wise source selection is given in Fig. 1 considering there are three sources capable of answering the first triple pattern ${tp}_{1}$ and four sources capable of answering ${tp}_{2}$ summing up to a total triple pattern-wise selected sources equal to seven.

An overestimation of triple pattern-wise selected sources increases the source selection time and thus the query execution time. Furthermore, such an overestimation increases the number of irrelevant results which are excluded after joining the results of the different sources, therewith increasing both the network traffic and query execution time. In the next section we explain how such overestimations occur in the selected approaches.

6.3. Experimental results

6.3.1. Triple pattern-wise selected sources

Table 9 shows the total number of triple pattern-wise sources (TP sources for short) selected by each approach both for the FedBench and SlicedBench queries. ANAPSID is the most accurate system in terms of TP sources followed by both FedX and SPLENDID whereas similar results are achieved by the other three systems, i.e., LHD, DARQ, and ADERIS. Both FedX and SPLENDID select the optimal number of TP sources for individual query triple patterns. This is because both make use of ASK queries when any of the subject or object is bound in a triple pattern. However, they do not consider whether a source can actually contribute results after performing a join between results with other query triple patterns. Therefore, both can overestimate the set of capable sources that can actually contribute results. ANAPSID uses a catalog and ASK queries along with heuristics [21] about triple pattern joins to reduce the overestimation of sources. LHD (the publicly available version), DARQ, and ADERIS are index-only approaches and do not use SPARQL ASK queries when any of the subject or object is bound. Consequently, these three approaches tend to overestimate the TP sources per individual triple pattern. It is important to note that DARQ does not support queries where any of the predicates in a triple pattern is unbound (e.g., CD1, LS2) and ADERIS does not support queries which feature FILTER or UNION clauses (e.g., CD1, LS1, LS2, LS7). In case of triple patterns with unbound predicates (such as CD1, LS2) both LHD and ADERIS simply select all of the available sources as relevant. This overestimation can significantly increase the overall query execution time.

Table 9
Comparison of triple pattern-wise total number of sources selected for FedBench and SlicedBench

FedBench SlicedBench

Query FedX SPL LHD DARQ ANA ADE Query FedX SPL LHD DARQ ANA ADE

CD1 11 11 28 NS 3 NS CD1 17 17 30 NS 8 NS

CD2 3 3 10 10 3 10 CD2 12 12 24 24 12 24

CD3 12 12 20 20 5 20 CD3 31 31 38 38 31 38

CD4 19 19 20 20 5 20 CD4 32 32 34 34 32 34

CD5 11 11 11 11 4 11 CD5 19 19 19 19 9 19

CD6 9 9 10 10 10 10 CD6 31 31 40 40 31 40

CD7 13 13 13 13 6 13 CD7 40 40 40 40 40 40

Total 78 78 112 84 36 84 Total 182 182 225 195 163 195

LS1 1 1 1 1 1 NS LS1 3 3 3 3 3 NS

LS2 11 11 28 NS 12 NS LS2 16 16 30 NS 16 NS

LS3 12 12 20 20 5 20 LS3 19 19 26 26 19 26

LS4 7 7 15 15 7 15 LS4 25 25 27 27 14 27

LS5 10 10 18 18 7 18 LS5 30 30 37 37 20 37

LS6 9 9 17 17 5 17 LS6 19 19 27 27 17 27

LS7 6 6 6 6 7 NS LS7 13 13 13 13 13 NS

Total 56 56 105 77 44 70 Total 125 125 163 133 102 117

LD1 8 8 11 11 3 11 LD1 10 10 29 29 3 29

LD2 3 3 3 3 3 3 LD2 20 20 28 28 20 28

LD3 16 16 16 16 4 16 LD3 30 30 39 39 13 39

LD4 5 5 5 5 5 5 LD4 30 30 47 47 5 47

LD5 5 5 13 13 3 13 LD5 15 15 24 24 15 24

LD6 14 14 14 14 14 14 LD6 38 38 38 38 38 38

LD7 3 3 4 4 2 4 LD7 12 12 20 20 12 20

LD8 15 15 15 15 9 15 LD8 27 27 27 27 16 27

LD9 3 3 6 6 3 6 LD9 7 7 17 17 7 17

LD10 10 10 11 11 3 11 LD10 23 23 23 23 23 23

LD11 15 15 15 15 5 15 LD11 31 31 32 32 31 32

Total 108 108 119 122 54 119 Total 243 243 324 324 183 324

SP2B-1 10 10 28 28 NS 28

SP2B-2 90 90 92 92 RE NS

SP2B-3a 13 13 19 NS 13 19

SP2B-4 52 52 66 66 52 66

SP2B-5b 40 40 50 50 40 50

SP2B-6 68 68 72 72 18 NS

SP2B-7 100 100 104 NS 64 NS

SP2B-8 91 91 102 102 NS NS

SP2B-9 40 40 40 NS 40 NS

SP2B-10 7 7 10 NS 7 10

SP2B-11 10 10 10 10 10 NS

Total 521 521 593 420 244 173

Net total 242 242 336 283 134 273 Net Total 1071 1071 1305 1072 692 809

Notes: NS stands for “not supported”, RE for “runtime error”, SPL for SPLENDID, ANA for ANAPSID and ADE for ADERIS. Key results are in bold

FedBench	SlicedBench
CD1	11	11	28	NS	3	NS	CD1	17	17	30	NS	8	NS
CD2	3	3	10	10	3	10	CD2	12	12	24	24	12	24
CD3	12	12	20	20	5	20	CD3	31	31	38	38	31	38
CD4	19	19	20	20	5	20	CD4	32	32	34	34	32	34
CD5	11	11	11	11	4	11	CD5	19	19	19	19	9	19
CD6	9	9	10	10	10	10	CD6	31	31	40	40	31	40
CD7	13	13	13	13	6	13	CD7	40	40	40	40	40	40
Total	78	78	112	84	36	84	Total	182	182	225	195	163	195
LS1	1	1	1	1	1	NS	LS1	3	3	3	3	3	NS
LS2	11	11	28	NS	12	NS	LS2	16	16	30	NS	16	NS
LS3	12	12	20	20	5	20	LS3	19	19	26	26	19	26
LS4	7	7	15	15	7	15	LS4	25	25	27	27	14	27
LS5	10	10	18	18	7	18	LS5	30	30	37	37	20	37
LS6	9	9	17	17	5	17	LS6	19	19	27	27	17	27
LS7	6	6	6	6	7	NS	LS7	13	13	13	13	13	NS
Total	56	56	105	77	44	70	Total	125	125	163	133	102	117
LD1	8	8	11	11	3	11	LD1	10	10	29	29	3	29
LD2	3	3	3	3	3	3	LD2	20	20	28	28	20	28
LD3	16	16	16	16	4	16	LD3	30	30	39	39	13	39
LD4	5	5	5	5	5	5	LD4	30	30	47	47	5	47
LD5	5	5	13	13	3	13	LD5	15	15	24	24	15	24
LD6	14	14	14	14	14	14	LD6	38	38	38	38	38	38
LD7	3	3	4	4	2	4	LD7	12	12	20	20	12	20
LD8	15	15	15	15	9	15	LD8	27	27	27	27	16	27
LD9	3	3	6	6	3	6	LD9	7	7	17	17	7	17
LD10	10	10	11	11	3	11	LD10	23	23	23	23	23	23
LD11	15	15	15	15	5	15	LD11	31	31	32	32	31	32
Total	108	108	119	122	54	119	Total	243	243	324	324	183	324
							SP2B-1	10	10	28	28	NS	28
							SP2B-2	90	90	92	92	RE	NS
							SP2B-3a	13	13	19	NS	13	19
							SP2B-4	52	52	66	66	52	66
							SP2B-5b	40	40	50	50	40	50
							SP2B-6	68	68	72	72	18	NS
							SP2B-7	100	100	104	NS	64	NS
							SP2B-8	91	91	102	102	NS	NS
							SP2B-9	40	40	40	NS	40	NS
							SP2B-10	7	7	10	NS	7	10
							SP2B-11	10	10	10	10	10	NS
							Total	521	521	593	420	244	173
Net total	242	242	336	283	134	273	Net Total	1071	1071	1305	1072	692	809

The effect overestimation can be clearly seen by taking a fine-granular look at how the different systems process FedBench query CD3 given in Listing 1. The optimal number of TP sources for this query is 5. This query has a total of five triple patterns. To process this query, FedX sends a SPARQL ASK query to all of the 10 benchmark SPARQL endpoints for each of the triple pattern summing up to a total of 50 ( $5 * 10$ ) SPARQL ASK operations. As a result of these operations, only one source is selected for each of the first four triple pattern while eight sources are selected for last one, summing up to a total of 12 TP sources. SPLENDID utilizes its index and ASK queries for the first three and index-only for last two triple pattern to select exactly the same number of sources selected by FedX. LHD, ADERIS, and DARQ only makes use of predicate lookups in their catalogs to select nine sources for the first, one source each for the second, third, fourth, and eighth for the last triple pattern summing up to a total of 20 TP sources. The later three approaches overestimate the number of sources for first triple pattern by 8 sources. This is due to the predicate rdf:type being likely to be used in all of RDF datasets. However, triples with rdf:type as predicate and the bound object dbp:President are only contained in the DBpedia subset of FedBench. Thus, the only relevant data source for the first triple pattern is DBpedia subset. Interestingly, even FedX and SPLENDID overestimate the number of data sources that can contribute for the last triple pattern. There are eight FedBench datasets which contain owl:sameAs predicate. However, only one (i.e., New York Times) can actually contribute results after a join of the last two triple patterns is carried out. ANAPSID makes use of a catalog and SPARQL-ASK-assisted Star Shaped Group Multiple (SSGM) endpoint selection heuristic [21] to select the optimal (i.e., five) TP sources for this query. However, ANAPSID also overestimates the TP sources in some cases. For query CD6 of FedBench, ANAPSID selected a total of 10 TP sources while only 4 is the optimal sources that actually contributes to the final result set. This behavior leads us to our first insight: Optimal TP source selection is not sufficient to detect the optimal set of sources that should be queried.

Listing 1.

FedBench CD3. Prefixes are ignored for simplicity.

Table 10

Comparison of number of SPARQL ASK requests used for source selection both in FedBench and SlicedBench

FedBench							SlicedBench

Query	FedX	SPL	LHD	DARQ	ANA	ADE	Query	FedX	SPL	LHD	DARQ	ANA	ADE
CD1	27	26	0	NS	20	NS	CD1	30	30	0	NS	25	NS
CD2	27	9	0	0	1	0	CD2	30	20	0	0	29	0
CD3	45	2	0	0	2	0	CD3	50	20	0	0	46	0
CD4	45	2	0	0	3	0	CD4	50	10	0	0	34	0
CD5	36	1	0	0	1	0	CD5	40	10	0	0	14	0
CD6	36	2	0	0	11	0	CD6	40	10	0	0	40	0
CD7	36	2	0	0	5	0	CD7	40	10	0	0	40	0
Total	252	44	0	0	43	0	Total	280	110	0	0	228	0
LS1	18	0	0	0	0	NS	LS1	20	0	0	0	3	NS
LS2	27	26	0	NS	30	NS	LS2	30	30	0	NS	30	NS
LS3	45	1	0	0	13	0	LS3	50	10	0	0	30	0
LS4	63	2	0	0	1	0	LS4	70	20	0	0	15	0
LS5	54	1	0	0	4	0	LS5	60	10	0	0	27	0
LS6	45	2	0	0	13	0	LS6	50	20	0	0	26	0
LS7	45	1	0	0	2	NS	LS7	50	10	0	0	12	NS
Total	297	33	0	0	63	0	Total	330	100	0	0	143	0
LD1	27	1	0	0	1	0	LD1	30	10	0	0	12	0
LD2	27	1	0	0	0	0	LD2	30	10	0	0	29	0
LD3	36	1	0	0	2	0	LD3	40	10	0	0	23	0
LD4	45	2	0	0	0	0	LD4	50	20	0	0	25	0
LD5	27	2	0	0	2	0	LD5	30	20	0	0	32	0
LD6	45	1	0	0	12	0	LD6	50	10	0	0	38	0
LD7	18	2	0	0	4	0	LD7	20	10	0	0	20	0
LD8	45	1	0	0	7	0	LD8	50	10	0	0	19	0
LD9	27	5	0	0	3	0	LD9	30	20	0	0	17	0
LD10	27	2	0	0	4	0	LD10	30	10	0	0	23	0
LD11	45	1	0	0	2	0	LD11	50	10	0	0	32	0
Total	369	19	0	0	37	0	Total	410	140	0	0	270	0
							SP2B-1	30	20	0	0	NS	0
							SP2B-2	100	10	0	0	RE	NS
							SP2B-3a	20	10	0	NS	10	0
							SP2B-4	80	20	0	0	66	0
							SP2B-5b	50	20	0	0	50	0
							SP2B-6	90	20	0	0	37	NS
							SP2B-7	130	30	0	NS	62	NS
							SP2B-8	100	20	0	0	NS	NS
							SP2B-9	40	20	0	NS	20	NS
							SP2B-10	10	10	0	NS	10	0
							SP2B-11	10	0	0	0	10	NS
							Total	660	180	0	0	265	0
Net total	918	96	0	0	143	0	Net Total	1680	530	0	0	906	0

Notes: NS stands for “not supported”, RE for “runtime error”, SPL for SPLENDID, ANA for ANAPSID and ADE for ADERIS. Key results are in bold

In the SlicedBench results, we can clearly see the TP values are increased for each of the FedBench queries which mean a query spans more data sources, thus simulating a highly fragmented environment suitable to test the federation system for effective parallel query processing. The highest number of TP sources are reported for the second SP²Bench query where up to a total of 92 TP sources are selected. This query contains 10 triple patterns and index-free approaches (e.g., FedX) need 100 ( $10 * 10$ ) SPARQL ASK queries to perform the source selection operation. Using SPARQL ASK queries with no caching for such a highly federated environment can be very expensive. From the results shown in Table 9, it is noticeable that hybrid (catalog + SPARQL ASK) source selection approaches (ANAPSID, SPLENDID) perform an more accurate source selection than index/catalog-only approaches (i.e., DARQ, LHD, and ADERIS).

6.3.2. Number of SPARQL ASK requests

Table 11
The queries for which some system’s did not retrieve complete results

CD1(90) CD7(1) LS1(1159) LS2(333) LS3(9054) LS5(393) LD1(309) LD3(162) LD9(1) SP2B-3a(27789) SP2B-6(>70k)

SPLENDID 90 1 1159 333 9054 393 308 159 1 – –

LHD 77 1 0 322 0 0 309 162 1 – –

ANAPSID 90 0 1159 333 9054 393 309 162 1 0 0

ADERIS 77 1 1159 333 9054 393 309 162 1 – –

DARQ 90 1 1159 333 9054 393 309 162 0 – –

FedX 90 1 1159 333 9054 393 309 162 1 27789 –

Notes: The values inside bracket shows the actual results size. “–” means the results completeness cannot be determined due to query execution timed out. Incomplete results are highlighted in bold

	CD1(90)	CD7(1)	LS1(1159)	LS2(333)	LS3(9054)	LS5(393)	LD1(309)	LD3(162)	LD9(1)	SP2B-3a(27789)	SP2B-6(>70k)
SPLENDID	90	1	1159	333	9054	393	308	159	1	–	–
LHD	77	1	0	322	0	0	309	162	1	–	–
ANAPSID	90	0	1159	333	9054	393	309	162	1	0	0
ADERIS	77	1	1159	333	9054	393	309	162	1	–	–
DARQ	90	1	1159	333	9054	393	309	162	0	–	–
FedX	90	1	1159	333	9054	393	309	162	1	27789	–

Table 10 shows the total number of SPARQL ASK requests used to perform source selection for each of the queries of FedBench and SlicedBench. Index-only approaches (DARQ, ADERIS, LHD) only make use of their index to perform source selection. Therefore, they do not necessitate any ASK requests to process queries. As mention before, FedX only makes use of ASK requests (along with a cache) to perform source selection. The results presented in Table 10 are for FedX(cold or first run), where the FedX cache is empty. This is basically the lower bound of the performance of FedX. For FedX(100% cached), the complete source selection is performed by using cache entries only. Hence, in that case, the number of SPARQL ASK requests is zero for each query. This is the upper bound of the performance of FedX on the data at hand. The results clearly shows that index-free (e.g., FedX) approaches can be very expensive in terms of SPARQL ASK requests used. This can greatly affect the source selection time and overall query execution time if no cache is used. Both for FedBench and SlicedBench, SPLENDID is the most efficient hybrid approach in terms of SPARQL ASK requests consumed during source selection.

For SlicedBench, all data sources are likely contains the same set of distinct predicates (because each data source contains at least one slice from each data dump). Therefore, the index-free and hybrid source selection approaches are bound to consume more SPARQL ASK requests. It is important to note that ANAPSID combines more than one triple pattern into a single SPARQL ASK query. The time required to execute these more complex SPARQL ASK operations are generally higher than SPARQL ASK queries having a single triple pattern as used in FedX and SPLENDID. Consequently, even though ANAPSID require less SPARQL ASK requests for many of the FedBench queries, its source selection time is greater than all other selected approaches. This behavior will be further elaborated upon in the subsequent section. Tables 9 and 10 clearly show that using SPARQL ASK queries for source selection leads to an efficient source selection in terms of TP sources selected. However, in the next section we will see that they increase both source selection and overall query runtime. A smart source selection approach should select fewer number of TP sources while using minimal number of SPARQL ASK requests.

6.3.3. Answer compeleteness

As pointed in [21], an important criterion in performance evaluation of the federated SPARQL query engines is the result set completeness. Two or more engines are only comparable to each other if they provide the same result set for a given query. A federated engine may miss results due to various reasons including the type of source selection used, the use of an outdated cache or index, the type of network, the endpoint result size limit or even the join implementation. In our case, the sole possible reason for missing results across all six engines is the join implementation as all of the selected engines overestimate the set of capable sources (i.e., they never generate false negatives w.r.t. the capable sources), the cache, index are always up-to-date, the endpoint result size limit is greater than the query results and we used a local dedicated network with negligible network delay. Table 11 shows the queries and federated engines for which we did not receive the complete results. As an overall answer completeness evaluation, only FedX is always able to retrieve complete results. It is important to note that these results are directly connected to the answer completeness results presented in survey Table 2; which shows only FedX is able to provide complete results among the selected systems.

6.3.4. Source selection time

Fig. 2.

Comparison of source selection time for FedBench and SlicedBench.

Fig. 3.

Comparison of source selection time for SP²Bench queries in SlicedBench.

Figures 2 and 3 show the source selection time for each of the selected approaches and for both FedBench and SlicedBench. Compared to the TP results, the index-only approaches require less time than the hybrid approaches even though they overestimated the TP sources in comparison with the hybrid approaches. This is due to index-only approaches not having to send any SPARQL ASK queries during the source selection process. The index being usually pre-loaded into the memory before the query execution means that the runtime the predicate look-up in index-only approaches is minimal. Consequently, we observe a trade-off between the intelligent source selection and the time required to perform this process. To reduce the costs associated with ASK operations, FedX implements a cache to store the results of the recent SPARQL ASK operations. Figure 2 shows that source selection time of FedX with cached entries is significantly smaller than FedX’s first run with no cached entries.

As expected the source selection time for FedBench queries is smaller than that for SlicedBench, particularly in hybrid approaches. This is because the number of TP sources for SlicedBench queries are increased due to data partitioning. Consequently, the number of SPARQL ASK requests grows and increases the overall source selection time. As mentioned before, an overestimation of TP sources in highly federated environments can greatly increase the source selection time. For example, consider query LD4. SPLENDID selects the optimal (i.e., five) number of sources for FedBench and the source selection time is 218 ms. However, it overestimates the number of TP sources for SlicedBench by selecting 30 instead of 5 sources. As a result, the source selection time is significantly increased to 1035 ms which directly affects the overall query runtime. The effect of such overestimation is even worse in SP2B-2 and SP2B-4 queries for the SlicedBench.

Lessons learned from the evaluation of the first three metrics is that using ASK queries for source selection leads to smart source selection in term of total TP sources selected. On the other hand, they significantly increase the overall query runtime where no caching is used. FedX makes use of an intelligent combination of parallel ASK query processing and caching to perform the source selection process. This parallel execution of SPARQL ASK queries is more time-efficient than the ASK query processing approaches implemented in both ANAPSID and SPLENDID. Nevertheless, the source selection of FedX could be improved further by using heuristics such as ANAPSID’s to reduce the overestimation of TP sources.

6.3.5. Query execution time

Fig. 4.

Comparison of query execution time for FedBench and SlicedBench.

Fig. 5.

Comparison of query execution time forSlicedBench: SP²Bench queries.

Figures 4 and 5 show the query execution time for both experimental setups. The negligibly small standard deviation error bars (shown on top of each bar) indicate that the data points tend to be very close to the mean, thus suggest a high consistency of the query runtimes in most frameworks. As an overall query execution time evaluation, FedX(cached) outperforms all of the remaining approaches in majority of the queries. FedX(cached) is followed by FedX(first run) which is further followed by LHD, SPLENDID, ANAPSID, ADERIS, and DARQ. Deciding between DARQ and ADERIS is not trivial because the latter does not produce results for most of the queries. The exact number of queries by which one system is better than other is given in the next section (ref. Section 7.1). Furthermore, the number of queries by which one system significantly outperform other (using Wilcoxon signed rank test) is also given in the next section.

Interestingly, while ANAPSID ranks first (among the selected systems) in terms of triple pattern-wise sources selected results, it ranks fourth in terms of query execution performance. There are a couple of reason for this: (1) ANAPSID does not make use of cache. As a result, it spends more time (8 ms for FedX and 1265 ms for ANAPSID on average over both setups) performing source selection, which worsens its query execution time and (2) Bushy tress (used in ANAPSID) only perform better than left deep trees (used in FedX) when the queries are more complex and triple patterns joins are more selective [3,14]. However, the FedBench queries (excluding SP²Bench) are not very selective and are rather simple, e.g., triple patterns in a query ranges from 2 to 7. In addition, the query result set sizes are small (10 queries whose resultset size smaller than 16) and the average query execution is small (about 3 seconds on average for FedX over both setups). The SP²Bench queries are more complex and the resultset sizes are large. However, the selected systems were not able to execute majority of the SP²Bench queries. It would be interesting to compare these systems on more complex and Big Data benchmark. The use of a cache improves FedX’s performance by 10.5% in the average query execution for FedBench and 4.14% in SlicedBench.

The effect of the overestimation of the TP sources on query execution can be observed on the majority of the queries for different systems. For instance, for FedBench’s LD4 query SPLENDID selects the optimal number of TP sources (i.e., five) and the query execution time is 318 ms of which 218 ms are used for selecting sources. For SlicedBench, SPLENDID overestimates the TP sources by 25 (i.e., selects 30 instead of 5 sources), resulting in a query execution of 10693 ms, of which 1035 ms are spent in the source selection process. Consequently, the pure query execution time of this query is only 100 ms for FedBench (318-218) and 9659 ms (10693-1035) for SlicedBench. This means that an overestimation of TP sources does not only increase the source selection time but also produces results which are excluded after performing join operation between query triple patterns. These retrieval of irrelevant results increases the network traffic and thwarts the query execution plan. For example, both FedX and SPLENDID considered 285412 irrelevant triples due to the overestimation of 8 TP sources only for $owl : sameAs$ predicate in CD3 of FedBench. Another example of TP source overestimation can seen in CD1, LS2. LHD’s overestimation of TP sources on SlicedBench (e.g., 22 for CD1, 14 for LS2) leads to its query execution time jumping from 3670.9 ms to 41586.3 ms for CD1 and 427 ms to 34418.3 ms for LS2.

In queries such as CD4, CD6, LS3, LD11 and SP2B-11 we observe that the query execution time for DARQ is more than 2 minutes. In some cases, it even reaches the 30 minute timeout used in our experiments. The reason for this behavior is that the simple nested loop join it implements overfloods SPARQL endpoints by submitting too many endpoint requests. FedX overcomes this problem by using a block nested loop join where the number of endpoints requests are dependent upon the block size. Furthermore, we can see that many systems do not produce results for SP²Bench queries. A possible reason for this is the fact that SP²Bench queries contain up to 10 triple patterns with different SPARQL clauses such as DISTINCT, ORDER BY, and complex FILTERS.

Fig. 6.

Overall performance evaluation (ms).

6.3.6. Overall performance evaluation

The comparison of the overall performance of each approach is summarized in Fig. 6, where we show the average query execution time for the queries in CD, LS, LD, and SP²Bench sub-groups. As an overall performance evaluation based on FedBench, FedX(cached) outperformed FedX(first run) on all of the 25 queries. FedX(first run) in turn outperformed LHD on 17 out of 22 commonly supported queries (LHD retrieve zero results for three queries). LHD is better than SPLENDID in 13 out of 22 comparable queries. SPLENDID outperformed ANAPSID in 15 out of 24 queries while ANAPSID outperforms DARQ in 16 out of 22 commonly supported queries. For SlicedBench, FedX(cached) outperformed FedX(first run) in 29 out of 36 comparable queries. In turn FedX(first run) outperformed LHD in 17 out of 24 queries. LHD is better than SPLENDID in 17 out of 24 comparable queries. SPLENDID outperformed ANAPSID in 17 out of 26 which in turn outperformed DARQ in 12 out of 20 commonly supported queries. No results were retrieved for majority of the queries in case of ADERIS, hence not included to this section. All of the above improvements are significant based on Wilcoxon signed ranked test with significance level set to 0.05.

7. Discussion

The subsequent discussion of our findings can be divided into two main categories.

7.1. Effect of the source selection time

Fig. 7.

Comparison of pure query execution time (without source selection time) for FedBench and SlicedBench.

To the best of our knowledge, the effect of the source selection runtime has not been considered in SPARQL query federation system evaluations [1,7,21,26,33] so far. However, after analyzing all of the results presented above, we noticed that this metric greatly affects the overall query execution time. To show this effect, we compared the pure query execution time (excluding source selection time). To calculate the pure query execution time, we simply subtracted the source selection time from the overall query execution and plot the execution time in Fig. 7.

We can see that the overall query execution time (including source selection given in Fig. 4) of SPLENDID is better than FedX(cached) in only one out of the 25 FedBench queries. However, Fig. 7 suggests that SPLENDID is better in 8 out of the 25 queries in terms of the pure query execution time. This means that SPLENDID is slower than FedX (cached) in 33% of the queries only due to the source selection process. Furthermore, our results also suggest that the use of SPARQL ASK queries for source selection is expensive without caching. On average, SPLENDID’s source selection time is 235 ms for FedBench and 591 ms in case of SlicedBench. On the other hand, FedX (cached)’s source selection time is 8 ms for both FedBench and SlicedBench. ANAPSID average source selection time for FedBench is 507 ms and 2014 ms for SlicedBench which is one of the reason of ANAPSID poor performance as compare to FedX (cached).

7.2. Effect of the data partitioning

Fig. 8.

Effect of the data partitioning.

In our SlicedBench experiments, we extended FedBench to test the federation systems behavior in highly federated data environment. This extension can also be utilized to test the capability of parallel execution of queries in SPARQL endpoint federation system. To show the effect of data partitioning, we calculated the average for the query execution time of LD, CD, and LS for both the benchmarks and compared the effect on each of the selected approach. The performance of FedX(cached) and DARQ is improved with partitioning while the performance of FedX(first run), SPLENDID, ANAPSID, and LHD is reduced. As an overall evaluation result, FedX(first run)’s performance is reduced by 214%, FedX(cached)’s is reduced 199%, SPLENDID’s is reduced by 227%, LHD’s is reduced by 293%, ANAPSID’s is reduced by 382%, and interestingly DARQ’s is improved by 36%. This results suggest that FedX is the best system in terms of parallel execution of queries, followed by SPLENDID, LHD, and ANAPSID. The performance improvement for DARQ occurs due to the fact that the overflooding of endpoints with too many nested loop requests to a particular endpoint is now reduced. This reduction is due to the different distribution of the relevant results among many SPARQL endpoints. One of the reasons for the performance reduction in LHD is its significant overestimation of TP sources in SlicedBench. The reduction of both SPLENDID’s and ANAPSID’s performance is due to an increase in ASK operations in SlicedBench and due to the increase in triple pattern-wise selected sources which greatly affects the overall performance of the systems when no cache used.

8. Conclusion

In this paper, we evaluated six SPARQL endpoint federation systems based on extended performance metrics and evaluation framework. We kept the main experimental metric (i.e. query execution time) unchanged and showed that the three other metrics (i.e., total triple pattern-wise selected sources, total number of SPARQL ASK request used during source selection, and source selection time), which did not receive much attention so far, can significantly affect the main metric. We also measured the effect of the data partitioning on these systems to test the effective parallel processing in each of the federation system. Overall, our results suggest that a combination of caching and ASK queries with accurate heuristics for source selection (as implemented in ANAPSID) has the potential to lead to a significant improvement of the overall runtime of federated SPARQL query processing systems.

In future work, we will aim to get access to and evaluate the systems from our survey which do not provide a public implementation and those which are recently published, e.g., HiBISCuS [28] which emphasize on efficient source selection for SPARQL endpoint federation and SAFE [17] performs policy based source selection. We will also measure the effect of a range of various features (e.g., RAM size, SPARQL endpoint capabilities restrictions, vertical and hybrid partitioning, and duplicate detection) on the overall runtime of federated SPARQL engine. The resultset correctness is an important metric as well to be considered in future. A tool which automatically measures the precision, recall and F1 scores is on the road-map. Furthermore, we will assess these systems on big data SPARQL query federation benchmark.

Footnotes

Acknowledgements

This work was partially supported by the EU FP7 projects GeoKnow (GA: 318159) and BioASQ (GA: 318652) as well as the DFG project LinkingLOD and the BMWi project SAKE.

References

Acosta,

M.-E.

Vidal,

Lampo,

Castillo and

Ruckhaus, ANAPSID: An adaptive query processing engine for SPARQL endpoints, in: The Semantic Web – ISWC 2011,

Aroyo,

Welty,

Alani,

Taylor,

Bernstein,

Kagal,

Noy and

Blomqvist, eds, Lecture Notes in Computer Science, Vol. 7031, Springer, Berlin Heidelberg, 2011, pp. 18–34.

Akar,

T.G.

Halaç,

E.E.

Ekinci and

Dikenelli, Querying the Web of interlinked datasets using VoID descriptions, in: Linked Data on the Web (LDOW2012),

Bizer et al., eds, CEUR Workshop Proceedings, Vol. 937, 2012.

Amorim, Join reordering and bushy plans, 2013, https://www.simple-talk.com/sql/performance/join-reordering-and-bushy-plans/, Accessed: June 16, 2014.

Basca and

Bernstein, Avalanche: Putting the spirit of the Web Back into Semantic Web querying, in: 6th International Workshop on Scalable Semantic Web Knowledge Base Systems (SSWS 2010),

Fokoue,

Guo and

Liebig, eds, CEUR Workshop Proceedings, Vol. 669, 2010, pp. 64–79.

Betz,

Gropengießer,

Hose and

K.-U.

Sattler, Learning from the history of distributed query processing: A heretic view on Linked Data management, in: 3rd International Workshop on Consuming Linked Data (COLD 2012),

J.F.

Sequeda,

Harth and

Hartig, eds, CEUR Workshop Proceedings Vol. 905, 2012.

Bizer and

Schultz, The Berlin SPARQL benchmark, International Journal on Semantic Web and Information Systems (IJSWIS), 5 (2009), IGI Global, 1–24.

Görlitz and

Staab, SPLENDID: SPARQL endpoint federation exploiting VoID descriptions, in: 2nd International Workshop on Consuming Linked Data (COLD 2011),

Hartig,

Harth and

J.F.

Sequeda, eds, CEUR Workshop Proceedings, Vol. 782, 2011.

Görlitz and

Staab, Federated data management and query optimization for Linked Open Data, in: New Directions in Web Data Management 1,

Vakali and

Jain, eds, Studies in Computational Intelligence, Vol. 331, Springer, Berlin, Heidelberg, 2011, pp. 109–137.

Guo,

Pan and

Heflin, LUBM: A benchmark for OWL knowledge base systems, in: Web Semantics: Science, Services and Agents on the World Wide Web, Vol. 3, Elsevier, 2005, pp. 158–182.

10.

Hartig, An overview on execution strategies for Linked Data queries, in: Datenbank-Spektrum, Vol. 13, Springer, 2013, pp. 89–99.

11.

Hasnain,

Fox,

Decker and

H.F.

Deus, Cataloguing and linking life sciences LOD cloud, in: 1st International Workshop on Ontology Engineering in a Data-Driven World (OEDW 2012) Collocated with 8th International Conference on Knowledge Engineering and Knowledge Management (EKAW 2012), 2012.

12.

Hasnain,

Kamdar,

Hasapis,

Zeginis,

Warren,

Claude,

Deus,

Ntalaperas,

Tarabanis,

Mehdi and

Decker, Linked biomedical dataspace: Lessons learned integrating data for drug discovery, in: The Semantic Web – ISWC 2014,

Mika,

Tudorache,

Bernstein,

Welty,

Knoblock,

Vrandečić,

Groth,

Noy,

Janowicz and

Goble, eds, Lecture Notes in Computer Science, Vol. 8796, Springer International Publishing, 2014, pp. 114–130.

13.

Hasnain,

Sana e Zainab,

Kamdar,

Mehmood,

Warren,

Claude,

Fatimah,

Deus,

Mehdi and

Decker, A roadmap for navigating the life sciences Linked Open Data Cloud, in: Semantic Technology,

Supnithi,

Yamaguchi,

J.Z.

Pan,

Wuwongse and

Buranarach, eds, Lecture Notes in Computer Science, Vol. 8943, Springer International Publishing, 2015, pp. 97–112.

14.

Y.E.

Ioannidis and

Y.C.

Kang, Left-deep vs. Bushy Trees: An analysis of strategy spaces and its implications for query optimization, in: Proc. of the 1991 ACM SIGMOD International Conference on Management of Data, SIGMOD’91,

James and

Roger, eds, ACM, New York, NY, USA, 1991, pp. 168–177.

15.

M.R.

Kamdar,

Zeginis,

Hasnain,

Decker and

H.F.

Deus, ReVeaLD: A user-driven domain-specific interactive search platform for biomedical research, Journal of Biomedical Informatics 47 (2014), Elsevier, 112–130.

16.

Kaoudi,

Koubarakis,

Kyzirakos,

Miliaraki,

Magiridou and

Papadakis-Pesaresi, Atlas: Storing, updating and querying RDF(S) data on top of DHTs, in: Web Semantics: Science, Services and Agents on the World Wide Web, Vol. 8, Elsevier, 2010, pp. 271–277.

17.

Khan,

Saleem,

Iqbal,

Mehdi,

Hogan,

Hasapis,

A.-C.N.

Ngomo,

Decker and

Sahay, SAFE: Policy aware SPARQL query federation over RDF data cubes, in: Proc. of the 7th International Workshop on Semantic Web Applications and Tools for Life Sciences,

Paschke,

Burger,

Romano,

M.S.

Marshall and

Splendiani, eds, CEUR Workshop Proceedings, Vol. 1320, December 2014.

18.

Ladwig and

Tran, Linked Data query processing strategies, in: The Semantic Web – ISWC 2010,

Patel-Schneider,

Pan,

Hitzler,

Mika,

Zhang,

Pan,

Horrocks and

Glimm, eds, Lecture Notes in Computer Science, Vol. 6496, Springer, Berlin, Heidelberg, 2010, pp. 453–469.

19.

Ladwig and

Tran, SIHJoin: Querying remote and local Linked Data, in: The Semantic Web: Research and Applications,

Antoniou,

Grobelnik,

Simperl,

Parsia,

Plexousakis,

De Leenheer and

Pan, eds, Lecture Notes in Computer Science, Vol. 6643, Springer, Berlin, Heidelberg, 2011, pp. 139–153.

20.

Lynden,

Kojima,

Matono and

Tanimura, ADERIS: An adaptive query processor for joining federated SPARQL endpoints, in: On the Move to Meaningful Internet Systems (OTM2011), Part II,

Meersman,

Dillon,

Herrero,

Kumar,

Reichert,

Qing,

B.-C.

Ooi,

Damiani,

D.C.

Schmidt,

White,

Hauswirth,

Hitzler and

Mohania, eds, LNCS, Vol. 7045, Springer, Heidelberg, 2011, pp. 808–817.

21.

Montoya,

M.-E.

Vidal and

Acosta, A heuristic-based approach for planning federated SPARQL queries, in: 3rd International Workshop on Consuming Linked Data (COLD 2012),

J.F.

Sequeda,

Harth and

Hartig, eds, CEUR Workshop Proceedings, Vol. 905, 2012.

22.

Montoya,

M.-E.

Vidal,

Corcho,

Ruckhaus and

Buil-Aranda, Benchmarking federated SPARQL query engines: Are existing testbeds enough? in: The Semantic Web – ISWC 2012, Part II,

Cudre Mauroux,

Heflin,

Sirin,

Tudorache,

Euzenat,

Hauswirth,

J.X.

Parreira,

Hendler,

Schreiber,

Bernstein and

Blomqvist, eds, LNCS, Vol. 7650, Springer, Heidelberg, 2012, pp. 313–324.

23.

Morsey,

Lehmann,

Auer and

A.-C.

Ngonga Ngomo, DBpedia SPARQL benchmark – Performance assessment with real queries on real data, in: International Semantic Web Conference (ISWC2011), Part I,

Aroyo,

Welty,

Alani,

Taylor,

Bernstein,

Kagal,

Noy and

Blomqvist, eds, LNCS, Vol. 7031, Springer, Heidelberg, 2011, pp. 454–469.

24.

Nikolov,

Schwarte and

Hütter, Fedsearch: Efficiently combining structured queries and full-text search in a SPARQL federation, in: The Semantic Web – ISWC 2013,

Alani,

Kagal,

Fokoue,

Groth,

Biemann,

Parreira,

Aroyo,

Noy,

Welty and

Janowicz, eds, Lecture Notes in Computer Science, Vol. 8218, Springer, Berlin, Heidelberg, 2013, pp. 427–443.

25.

Quilitz and

Leser, Querying distributed RDF data sources with SPARQL, in: The Semantic Web: Research and Applications,

Bechhofer,

Hauswirth,

Hoffmann and

Koubarakis, eds, Lecture Notes in Computer Science, Vol. 5021, Springer, Berlin, Heidelberg, 2008, pp. 524–538.

26.

N.A.

Rakhmawati,

Umbrich,

Karnstedt,

Hasnain and

Hausenblas, Querying over federated SPARQL endpoints – A state of the art survey, CoRR, arXiv:1306.1723, 2013.

27.

Saleem,

Maulik,

Aftab,

Shanmukha,

Deus and

A.-C.

Ngonga Ngomo, Fostering serendipity through Big Linked Data, in: Semantic Web Challenge at International Semantic Web Conference, 2013.

28.

Saleem and

A.-C.

Ngonga Ngomo, HiBISCuS: Hypergraph-based source selection for SPARQL endpoint federation, in: The Semantic Web: Trends and Challenges,

Presutti,

d’Amato,

Gandon,

d’Aquin,

Staab and

Tordai, eds, Lecture Notes in Computer Science, Vol. 8465, Springer International Publishing, 2014, pp. 176–191.

29.

Saleem,

A.-C.

Ngonga Ngomo,

Xavier Parreira,

Deus and

Hauswirth, DAW: Duplicate-AWare federated query processing over the Web of Data, in: The Semantic Web – ISWC 2013,

Alani,

Kagal,

Fokoue,

Groth,

Biemann,

Parreira,

Aroyo,

Noy,

Welty and

Janowicz, eds, Lecture Notes in Computer Science, Vol. 8218, Springer, Berlin, Heidelberg, 2013, pp. 574–590.

30.

Saleem,

S.S.

Padmanabhuni,

A.-C.N.

Ngomo,

J.S.

Almeida,

Decker and

H.F.

Deus, Linked cancer genome atlas database, in: Proc. of the 9th International Conference on Semantic Systems,

Sabou,

Blomqvist,

Di Noia,

Sack and

Pellegrini, eds, ACM, New York, NY, USA, 2013, pp. 129–134.

31.

Schmidt,

Görlitz,

Haase,

Ladwig,

Schwarte and

Tran, FedBench: A benchmark suite for federated semantic data query processing, in: The Semantic Web – ISWC 2011,

Aroyo,

Welty,

Alani,

Taylor,

Bernstein,

Kagal,

Noy and

Blomqvist, eds, Lecture Notes in Computer Science, Vol. 7031, Springer, Berlin, Heidelberg, 2011, pp. 585–600.

32.

Schmidt,

Hornung,

Lausen and

Pinkel, SP²Bench: A SPARQL performance benchmark, in: Proc. of the 25th International Conference on Data Engineering ICDE, IEEE, 2009, pp. 222–233.

33.

Schwarte,

Haase,

Hose,

Schenkel and

Schmidt, FedX: Optimization techniques for federated query processing on Linked Data, in: The Semantic Web – ISWC 2011,

Aroyo,

Welty,

Alani,

Taylor,

Bernstein,

Kagal,

Noy and

Blomqvist, eds, Lecture Notes in Computer Science, Vol. 7031, Springer, Berlin, Heidelberg, 2011, pp. 601–616.

34.

Schwarte,

Haase,

Schmidt,

Hose and

Schenkel, An experience report of large scale federations, CoRR, arXiv:1210.5403, 2012.

35.

P.G.

Selinger,

M.M.

Astrahan,

D.D.

Chamberlin,

R.A.

Lorie and

T.G.

Price, Access path selection in a relational database management system, in: Proc. of the 1979 ACM SIGMOD International Conference on Management of Data, SIGMOD’79, ACM, New York, NY, USA, 1979, pp. 23–34.

36.

Umbrich,

Hogan,

Polleres and

Decker, Link traversal querying for a Diverse Web of Data, Semantic Web Journal (SWJ), IOS Press, 2014, accepted for publication.

37.

Umbrich,

Hose,

Karnstedt,

Harth and

Polleres, Comparing data summaries for processing live queries over Linked Data, World Wide Web Journal 14 (2011), Springer US, 495–544.

38.

Wang,

Tiropanis and

H.C.

Davis, LHD: Optimising Linked Data query processing using parallelisation, in: Proc. of the WWW2013 Workshop on Linked Data on the Web,

Bizer,

Heath,

Berners-Lee,

Hausenblas and

Auer, eds, CEUR Workshop Proceedings, Vol. 996, 2013.

Independent variables		Dependent/observed variables

		#ASK	#TP sources	Source selection time	Query runtime	Answer completeness
Query	Query plan shape	✓	✓	✓	✓	✓
	#Basic triple patterns	✓	✓	✓	✓	✓
	#Instantiations and their position	✓	✓	✓	✓	x
	Join selectivity	x	x	x	✓	x
	#Intermediate results	x	x	x	✓	x
	Answer size	x	x	x	✓	x
	Usage of query language expressivity	✓	✓	✓	✓	x
	#General predicates	✓	✓	✓	✓	✓
Data	Dataset size	x	x	x	✓	x
	Data frequency distribution	x	x	x	✓	x
	Type of partitioning	✓	✓	✓	✓	✓
	Data endpoint distribution	✓	✓	✓	✓	✓
Platform	Cache on/off	✓	✓	✓	✓	x
	RAM available	x	x	✓	✓	x
	#Processors	x	x	✓	✓	x
Endpoints	#Endpoints	✓	✓	✓	✓	✓
	Endpoint type	x	x	✓	✓	x
	Relation graph/endpoint/instance	x	x	x	✓	✓
	Network latency	x	x	✓	✓	✓
	Initial delay	x	x	✓	✓	x
	Message size	x	x	x	✓	x
	Transfer distribution	x	x	✓	✓	✓
	Answer size limit	x	x	x	✓	✓
	Timeout	x	x	x	✓	✓

A fine-grained evaluation of SPARQL endpoint federation systems

Abstract

Keywords

1. Introduction

2.1. Federation systems evaluations

3. Federated engines public survey

3.1. Survey design

4.1. Federated query processing

4.2. Overview of the selected approaches

6.1. Experimental setup

6.2. Evaluation criteria

Table 8 Dataset slices used in SlicedBench Collection #Slices Discrepancy DBpedia subset 3.5.1 10 280000 GeoNames 10 600000 LinkedMDB 10 100000 Jamendo 10 30000 New York Times 10 700 SW Dog Food 10 200 KEGG 10 35000 ChEBI 10 50000 Drugbank 10 25000 SP2Bench 10 150000

6.3.1. Triple pattern-wise selected sources

6.3.4. Source selection time

7. Discussion

7.1. Effect of the source selection time

Footnotes

Acknowledgements

References

Table 8
Dataset slices used in SlicedBench

Collection #Slices Discrepancy

DBpedia subset 3.5.1 10 280000

GeoNames 10 600000

LinkedMDB 10 100000

Jamendo 10 30000

New York Times 10 700

SW Dog Food 10 200

KEGG 10 35000

ChEBI 10 50000

Drugbank 10 25000

SP²Bench 10 150000