Sage Journals: Discover world-class research

Abstract

There is an emerging demand on efficiently archiving and (temporal) querying different versions of evolving semantic Web data. As novel archiving systems are starting to address this challenge, foundations/standards for benchmarking RDF archives are needed to evaluate its storage space efficiency and the performance of different retrieval operations. To this end, we provide theoretical foundations on the design of data and queries to evaluate emerging RDF archiving systems. Then, we instantiate these foundations along a concrete set of queries on the basis of a real-world evolving datasets. Finally, we perform an extensive empirical evaluation of current archiving techniques and querying strategies, which is meant to serve as a baseline of future developments on querying archives of evolving RDF data.

Keywords

RDF archiving Semantic Data Versioning Evolving Web data SPARQL Benchmark

1. Introduction

Nowadays, RDF data is ubiquitous. In less than a decade, and thanks to active projects such as the Linked Open Data (LOD) [5] effort or schema.org, researchers and practitioners have built a continuously growing interconnected Web of Data. In parallel, a novel generation of semantically enhanced applications leverage this infrastructure to build services which can answer questions not possible before (thanks to the availability of SPARQL [21] which enables structured queries over this data). As previously reported [23,42], this published data is continuously undergoing changes (on a data and schema level). These changes naturally happen without a centralized monitoring nor pre-defined policy, following the scale-free nature of the Web. Applications and businesses leveraging the availability of certain data over time, and seeking to track data changes or conduct studies on the evolution of data, thus need to build their own infrastructures to preserve and query data over time. Moreover, at the schema level, evolving vocabularies complicate re-use as inconsistencies may be introduced between data relying on a previous version of the ontology.

Thus, archiving policies of Linked Open Data (LOD) collections emerges as a novel – and open – challenge aimed at assuring quality and traceability of Semantic Web data over time. While sharing the same overall objectives with traditional Web archives, such as the Internet Archive,1

¹
http://archive.org/
archives for the Web of Data should additionally offer capabilities for time-traversing structured queries. Recently, initial works on RDF archiving policies/strategies [13,15] are starting to offer such time-based capabilities, such as knowing whether a dataset or a particular entity has changed, which is neither natively supported by SPARQL nor by any of the existing temporal extensions of SPARQL [16,35,40,48].

This paper discusses the emerging problem of evaluating the efficiency of the required retrieval demands in RDF archives. To the best of our knowledge, few and very initial works have been proposed to systematically benchmark RDF archives. EvoGen [27] is a recent suite that extends the traditional LUBM benchmark [19] to provide a dataset generator for versioned RDF data. However, the system is limited to a unique dataset and very constrained synthetic data. The recent HOBBIT2 ²
http://project-hobbit.eu/
H2020 EU project on benchmarking Big Linked Data is starting to face similar challenges [34]. Existing RDF versioning and archiving solutions focus so far on providing feasible proposals for partial coverage of possible use case demands. Somewhat related, but not covering the specifics of (temporal) querying over archives, existing RDF/SPARQL benchmarks focus on static [1,6,38], federated [30] or streaming data [9] in centralized or distributed repositories: they do not cover the particularities of RDF archiving, where querying entity changes across time is a crucial aspect.

In order to fill this gap, our main contributions are:

We analyse current RDF archiving proposals and provide theoretical foundations on the design of benchmark data and specific queries for RDF archiving systems;

We provide a concrete instantiation of such queries using AnQL [48], a query language for annotated RDF data.

we present a prototypical BEnchmark of RDF ARchives (referred to as BEAR), a test suite composed of three real-world datasets from the Dynamic Linked Data Observatory [23] (referred to as BEAR-A), DBpedia Live [22] (BEAR-B) and the European Open Data portal3 ³
http://data.europa.eu/
(BEAR-C). We describe queries with varying complexity, covering a broad range of archiving use cases;

we implement RDF archiving systems on different RDF stores and archiving strategies, and we evaluate them, together with other existing archiving systems in the literature, using BEAR. This evaluation is aimed at establishing an (extensible) baseline and illustrate our foundations.

The paper is organized as follows. First, Section 2 reviews current RDF archiving proposals. We establish the theoretical foundations in Section 3, formalizing the key features to characterize data and queries to evaluate RDF archives. Section 4 instantiates these guidelines and presents the proposed BEAR test suite. In Section 5, we detail the implemented RDF archives and we evaluate BEAR with different archiving systems. Finally, we conclude and point out future work in Section 6. Appendixes A, B and C provide further details on the BEAR-A, BEAR-B and BEAR-C test suite respectively.

Fig. 1.
Example of RDF graph versions.

Table 1
Classification and examples of retrieval needs

$Focus ∖ Type$ Materialisation Structured Queries

Single time Cross time

Version Version Materialisation Single-version structured queries Cross-version structured queries

-get snapshot at time $t_{i}$ -lectures given by certain teacher at time $t_{i}$ -subjects who have played the role of student and teacher of the same course

Delta Delta Materialisation Single-delta structured queries Cross-delta structured queries

-get delta at time $t_{i}$ -students leaving a course between two consecutive snapshots, i.e. between $t_{i - 1}$ , $t_{i}$ -largest variation of students in the history of the archive

2. Preliminaries

$Focus ∖ Type$	Materialisation	Structured Queries
Version	Version Materialisation	Single-version structured queries	Cross-version structured queries
	-get snapshot at time $t_{i}$	-lectures given by certain teacher at time $t_{i}$	-subjects who have played the role of student and teacher of the same course
Delta	Delta Materialisation	Single-delta structured queries	Cross-delta structured queries
	-get delta at time $t_{i}$	-students leaving a course between two consecutive snapshots, i.e. between $t_{i - 1}$ , $t_{i}$	-largest variation of students in the history of the archive

We briefly summarise the necessary findings of our previous survey on current archiving techniques for dynamic Linked Open Data [13]. The use case is depicted in Fig. 1, showing an evolving RDF graph with three versions $V_{1}$ , $V_{2}$ and $V_{3}$ : the initial version $V_{1}$ models two students ex:S1 and ex:S2 of a course ex:C1, whose professor is ex:P1. In $V_{2}$ , the ex:S2 student disappeared in favour of a new student, ex:S3. Finally, the former professor ex:P1 leaves the course to a new professor ex:P2, and the former student ex:S2 reappears also as a professor.

2.1. Retrieval functionality

Given the relative novelty of archiving and querying evolving semantic Web data, retrieval needs are neither fully described nor broadly implemented in practical implementations (described below). Table 1 shows a first classification [13,39] that distinguishes six different types of retrieval needs, mainly regarding the query type (materialisation or structured queries) and the main focus (version/delta) of the query.

Version materialisation is a basic demand in which a full version is retrieved. In fact, this is the most common feature provided by revision control systems and other large scale archives, such as current Web archiving that mostly dereferences URLs across a given time point.4

⁴
See the Internet Archive effort, http://archive.org/web/.

Single-version structured queries are queries which are performed on a specific version. One could expect to exploit current state-of-the-art query resolution in RDF management systems, with the additional difficulty of maintaining and switching between all versions.

Cross-version structured queries, also called time-traversal queries, must be satisfied across different versions, hence they introduce novel complexities for query optimization.

Delta materialisation retrieves the differences (deltas) between two or more given versions. This functionality is largely related to RDF authoring and other operations from revision control systems (merge, conflict resolution, etc.).

Single-delta structured queries and cross-delta structured queries are the counterparts of the aforementioned version-focused queries, but they must be satisfied on change instances of the dataset.
2.2. Archiving policies and retrieval process

Main efforts addressing the challenge of RDF archiving fall in one of the following three storage strategies [13]: independent copies (IC), change-based (CB) and timestamp-based (TB) approaches.

Independent Copies (IC) [ 25 , 33 ] is a basic policy that manages each version as a different, isolated dataset. It is, however, expected that IC faces scalability problems as static information is duplicated across the versions. Besides simple retrieval operations such as version materialisation, other operations require non-negligible processing efforts. A potential retrieval mediator should be placed on top of the versions, with the challenging tasks of (i) computing deltas at query time to satisfy delta-focused queries, (ii) loading/accessing the appropriate version/s and solve the structured queries, and (iii) performing both previous tasks for the case of structured queries dealing with deltas.

Change-based approach (CB) [ 11 , 46 , 47 ] partially addresses the previous scalability issue by computing and storing the differences (deltas) between versions. For the sake of simplicity, in this paper we focus on low-level deltas (added or deleted triples).

A query mediator for this policy manages a materialised version and the subsequent deltas. Thus, CB requires additional computational costs for delta propagation which affects version-focused retrieving operations. Although an alternative policy could always keep a materialisation of the current version and store reverse deltas with respect to this latter [39], such deltas still need to be propagated to access previous versions.

Timestamp-based approach (TB) [ 8 , 20 , 48 ] can be seen as a particular case of time modelling in RDF, where each triple is annotated with its temporal validity. Likewise, in RDF archiving, each triple locally holds the timestamp of the version. In order to save space avoiding repetitions, compression techniques can be used to minimize the space overheads, e.g. using self-indexes, such as in v-RDFCSA [8], or delta compression in B+Trees [15].

Hybrid-based approaches (HB) [ 15 , 32 , 39 ] combine previous policies to inspect other space/performance tradeoffs. On the one hand, Dong-Hyuk et al. [11] and the TailR [29] archiving system adopt a hybrid IC/CB approach (referred to as ${HB}^{IC / CB}$ hereinafter), which can be complemented with a theoretical cost model [39] to decide when a fresh materialised version (IC) should be computed. These costs highly depend on the difficulties of constructing and reconstructing versions and deltas, which may depend on multiple and variable factors. On the other hand, R43ples [17] and other practical approaches [15,32,44] follow a TB/CB approach (referred to as ${HB}^{TB / CB}$ hereinafter) in which triples can be time-annotated only when they are added or deleted (if present). In these practical approaches, versions/deltas are often managed under named/virtual graphs, so that the retrieval mediator can rely on existing solutions providing named/virtual graphs. Except for delta materialisation, all retrieval demands can be satisfied with some extra efforts given that (i) version materialisation requires to rebuild the delta similarly to CB, and (ii) structured queries may need to skip irrelevant triples [32].

Finally, [41] builds a partial order index keeping a hierarchical track of changes. This proposal, though, is a limited variation of delta computation and it is only tested with datasets having some thousand triples.

3. Evaluation of RDF archives: Challenges and guidelines

Previous considerations on RDF archiving policies and retrieval functionality set the basis of future directions on evaluating the efficiency of RDF archives. The design of a benchmark for RDF archives should meet three requirements:

The benchmark should be archiving-policy agnostic both in the dataset design/generation and the selection of queries to do a fair comparison of different archiving policies.

Early benchmarks should mainly focus on simpler queries against an increasing number of snapshots and introduce complex querying once the policies and systems are better understood.

While new retrieval features must be incorporated to benchmark archives, one should consider lessons learnt in previous recommendations on benchmarking RDF data management systems [1].

Although many benchmarks are defined for RDF stores [1,6] (see the Linked Data Benchmark Council project [7] for a general overview) and related areas such as relational databases (e.g. the well-known TPC5

⁵
http://www.tpc.org/
and recent TPC-H and TPC-C extensions to add temporal aspects to queries [24]) and graph databases [10], to the best of our knowledge, none of them are designed to address these particular considerations in RDF archiving. The preliminary EvoGen [27] data generator is one of the first attempts in this regards, based on extending the Lehigh University Benchmark (LUBM) [19] with evolution patterns. However, the work is focused on the creation of such synthetic evolving RDF data, and the functionality is restricted to the LUBM scenario. Nonetheless, most of the well-established benchmarks share important and general principles. We briefly recall here the four most important criteria when designing a domain-specific benchmark [18], which are also considered in our approach: Relevancy (to measure the performance when performing typical operations of the problem domain, i.e. archiving retrieval features), portability (easy to implement on different systems and architectures, i.e. RDF archiving policies), scalability (apply to small and large computer configurations, which should be extended in our case also to data size and number of versions), and simplicity (to evaluate a set of easy-to-understand and extensible retrieval features).

We next formalize the most important features to characterize data and queries to evaluate RDF archives. These will be instantiated in the next section to provide a concrete experimental testbed.
3.1. Dataset configuration

We first provide semantics for RDF archives and adapt the notion of temporal RDF graphs by Gutierrez et al. [20]. In this paper, we make a syntatic-sugar modification to put the focus on version labels instead of temporal labels. Note, that time labels are a more general concept that could lead to time-specific operators (intersect, overlaps, etc.), which is complementary – and not mandatory – to RDF archives. Let $N$ be a finite set of version labels in which a total order is defined.

Definition 1 (RDF Archive).

A version-annotated triple is an RDF triple $(s, p, o)$ with a label $i \in N$ representing the version in which this triple holds, denoted by the notation $(s, p, o) : [i]$ . An RDF archive graph $A$ is a set of version-annotated triples.

Definition 2 (RDF Version).

An RDF version of an RDF archive $A$ at snapshot i is the RDF graph $A (i) = {(s, p, o) | (s, p, o) : [i] \in A}$ . We use the notation $V_{i}$ to refer to the RDF version $A (i)$ .

As basis for comparing different archiving policies, we introduce four main features to describe the dataset configuration, namely data dynamicity, data static core, total version-oblivious triples and RDF vocabulary.

Data dynamicity

This feature measures the number of changes between versions, considering these differences at the level of triples (low-level deltas [47]). Thus, it is mainly described by the change ratio and the data growth between versions. We note that there are various definitions of change and growth metrics conceivable, and we consider our framework extensible in this respect with other, additional metrics. At the moment, we consider the following definitions of change ratio, insertion ratio, deletion ratio and data growth:

Definition 3 (Change ratio).

Given two versions $V_{i}$ and $V_{j}$ , with $i < j$ , let $Δ_{i, j}^{+}$ and $Δ_{i, j}^{-}$ two sets respectively denoting the triples added and deleted between these versions, i.e. $Δ_{i, j}^{+} = V_{j} ∖ V_{i}$ and $Δ_{i, j}^{-} = V_{i} ∖ V_{j}$ . The change ratio between two versions denoted by $δ_{i, j}$ , is defined by $\begin{matrix} δ_{i, j} = \frac{| Δ_{i, j}^{+} \cup Δ_{i, j}^{-} |}{| V_{i} \cup V_{j} |} . \end{matrix}$

That is, the change ratio between two versions should express the ratio of all triples in $V_{i} \cup V_{j}$ that have changed, i.e., that have been either inserted or deleted. In contrast, the insertion and deletion ratios provide further details on the proportion of inserted and add triple wrt. the original version:

Definition 4 (Insertion ratio, deletion ratio).

The insertion $δ_{i, j}^{+} = \frac{| Δ_{i, j}^{+} |}{| V_{i} |}$ and deletion $δ_{i, j}^{-} = \frac{| Δ_{i, j}^{-} |}{| V_{i} |}$ denote the ratio of “new” or “removed” triples with respect to the original version.

Finally, the data growth rate compares the number of triples between two versions:

Definition 5 (data growth).

Given two versions $V_{i}$ and $V_{j}$ , having $| V_{i} |$ and $| V_{j} |$ different triples respectively, the data growth of $V_{j}$ with respect to $V_{i}$ , denoted by, $growth (V_{i}, V_{j})$ , is defined by $\begin{matrix} growth (V_{i}, V_{j}) = \frac{| V_{j} |}{| V_{i} |} \end{matrix}$

In archiving evaluations, one should provide details on three related aspects, $δ_{i, j}$ , $δ_{i, j}^{+}$ and $δ_{i, j}^{-}$ , as well as the complementary version data growth, for all pairs of consecutive versions. Additionally, one important aspect of measurement could be the rate of changed triples accumulated overall across non-consecutive versions. That is, as opposed to the (absolute) metrics defined so far, which compare between the original and the final version only, here we want to also be able to take all intermediate changes into account. To this end, we can also define an accumulated change rate $δ_{i, j}^{*}$ between two (not necessarily consecutive) versions as follows:

Definition 6.
The accumulated change ratio $δ_{i, j}^{}$ between two versions $V_{i}$ , $V_{j}$ with $j = i + h$ , with $h > 0$ , is defined as $\begin{matrix} δ_{i, j}^{} = \frac{Σ_{k = i}^{j} δ_{k, k + 1}}{h} \end{matrix}$

The rationale here is that $δ_{i, j}^{*}$ should be 1 iff all triples changed in each version (even if eventually the changes are reverted and $V_{i} = V_{j}$ ), 0 if $V_{i} = V_{k}$ for each $i ⩽ k ⩽ j$ , and non-0 otherwise, i.e. measuring the accumulation of changes over time.

Note that most archiving policies are affected by the frequency and also the type of changes, that is both absolute change metrics and accumulated change rates play a role. For instance, IC policy duplicates the static information between two consecutive versions $V_{i}$ and $V_{j}$ , whereas the size of $V_{j}$ increases with the added information ( $δ_{i, j}^{+}$ ) and decreases with the number of deletions ( $δ_{i, j}^{-}$ ), given that the latter are not represented. In contrast, CB and TB approaches store all changes, hence they are affected by the general dynamicity ( $δ_{i, j}$ ).

Data static core

It measures the triples that are available in all versions:
Definition 7 (Static core).

For an RDF archive $A$ , the static core $C_{A} = {(s, p, o) | \forall i \in N, (s, p, o) : [i] \in A}$ .

This feature is particularly important for those archiving policies that, whether implicitly or explicitly, represent such static core. In a change-based approach, the static core is not represented explicitly, but it inherently conforms the triples that are not duplicated in the versions, which is an advantage against other policies such as IC. It is worth mentioning that the static core can be easily computed taking the first version and applying all the subsequent deletions.

Total version-oblivious triples

This computes the total number of different triples in an RDF archive independently of the timestamp. Formally speaking:

Definition 8 (Version-oblivious triples).

For an RDF archive $A$ , the version-oblivious triples $O_{A} = {(s, p, o) | \exists i \in N, (s, p, o) : [i] \in A}$ .

This feature serves two main purposes. First, it points to the diverse set of triples managed by the archive. Note that an archive could be composed of few triples that are frequently added or deleted. This could be the case of data denoting the presence or absence of certain information, e.g. a particular case of RDF streaming. Then, the total version-oblivious triples are in fact the set of triples annotated by temporal RDF [20] and other representations based on annotation (e.g. AnQL [48]), where different annotations for the same triple are merged in an annotation set (often resulting in an interval or a set of intervals).

RDF vocabulary

In general, we cover under this feature the main aspects regarding the different subjects ( $S_{A}$ ), predicates ( $P_{A}$ ), and objects ( $O_{A}$ ) in the RDF archive $A$ . Namely, we put the focus on the RDF vocabulary per version and delta and the vocabulary set dynamicity, defined as follows:

Definition 9 (RDF vocabulary per version).

For an RDF archive $A$ , the vocabulary per version is the set of subjects ( $S_{V_{i}}$ ), predicates ( $P_{V_{i}}$ ) and objects ( $O_{V_{i}}$ ) for each version $V_{i}$ in the RDF archive $A$ .

Definition 10 (RDF vocabulary per delta).

For an RDF archive $A$ , the vocabulary per delta is the set of subjects ( $S_{Δ_{i, j}^{+}}$ and $S_{Δ_{i, j}^{-}}$ ), predicates ( $P_{Δ_{i, j}^{+}}$ and $P_{Δ_{i, j}^{-}}$ ) and objects ( $O_{Δ_{i, j}^{+}}$ and $O_{Δ_{i, j}^{-}}$ ) for all consecutive (i.e., $j = i + 1$ ) $V_{i}$ and $V_{j}$ in $A$ .

Definition 11 (RDF vocabulary set dynamicity).

The dynamicity of a vocabulary set K, being K one of ${S, P, O}$ , over two versions $V_{i}$ and $V_{j}$ , with $i < j$ , denoted by $vdyn (K, V_{i}, V_{j})$ is defined by $\begin{array}{l} vdyn (K, V_{i}, V_{j}) = \frac{| (K_{V_{i}} ∖ K_{V_{j}}) \cup (K_{V_{j}} ∖ K_{V_{i}}) |}{| K_{V_{i}} \cup K_{V_{j}} |} . \end{array}$

Likewise, the vocabulary set dynamicity for insertions and deletions is defined by ${vdyn}^{+} (K, V_{i}, V_{j}) = \frac{| K_{V_{j}} ∖ K_{V_{i}} |}{| K_{V_{i}} \cup K_{V_{j}} |}$ and ${vdyn}^{-} (K, V_{i}, V_{j}) = \frac{| K_{V_{i}} ∖ K_{V_{j}} |}{| K_{V_{i}} \cup K_{V_{j}} |}$ respectively.

The evolution (cardinality and dynamicity) of the vocabulary is specially relevant in RDF archiving, since traditional RDF management systems use dictionaries (mappings between terms and integer IDs) to efficiently manage RDF graphs. Finally, whereas additional graph-based features (e.g. in-out-degree, clustering coefficient, presence of cliques, etc.) are interesting and complementary to our work, our proposed properties are feasible (efficient to compute and analyse) and grounded in state-of-the-art of archiving policies.

3.2. Design of benchmark queries

There is neither a standard language to query RDF archives, nor an agreed way for the more general problem of querying temporal graphs. Nonetheless, most of the proposals (such as T-SPARQL [16], stSPARQL [3], SPARQL-ST [35] and the most recent SPARQ-LTL [14]) are based on SPARQL modifications.

In this scenario, previous experiences on benchmarking SPARQL resolution in RDF stores show that benchmark queries should report on the query type, result size, graph pattern shape, and query atom selectivity [37]. Conversely, for RDF archiving, one should put the focus on data dynamicity, without forgetting the strong impact played by query selectivity in most RDF triple stores and query planning strategies [1].

Let us briefly recall and adapt definitions of query cardinality and selectivity [1,2] to RDF archives. Given a SPARQL query Q, where we restrict to SPARQL Basic Graph Patterns (BGPs6

⁶
Sets of triple patterns, potentially including a FILTER condition, in which all triple patterns must match.
) hereafter, the evaluation of Q over a general RDF graph $G$ results in a bag of solution mappings ${[[Q]]}_{G}$ , where Ω denotes its underlying set. The function ${card}_{{[[Q]]}_{G}}$ maps each mapping $μ \in Ω$ to its cardinality in ${[[Q]]}_{G}$ . Then, for comparison purposes, we introduce three main features, namely archive-driven result cardinality and selectivity, version-driven result cardinality and selectivity, and version-driven result dynamicity, defined as follows.
Definition 12 (Archive-driven result cardinality).

The archive-driven result cardinality of Q over the RDF archive $A$ , is defined by $\begin{matrix} CARD (Q, A) = \sum_{μ \in Ω} {card}_{{[[Q]]}_{A}} (μ) . \end{matrix}$ In turn, the archive-driven query selectivity accounts how selective is the query, and it is defined by $SEL (Q, A) = | Ω | / | A |$ .

Definition 13 (Version-driven result cardinality).

The version-driven result cardinality of Q over a version $V_{i}$ , is defined by $\begin{matrix} CARD (Q, V_{i}) = \sum_{μ \in Ω_{i}} {card}_{{[[Q]]}_{V_{i}}} (μ), \end{matrix}$ where $Ω_{i}$ denotes the underlying set of the bag ${[[Q]]}_{V_{i}}$ . Then, the version-driven query selectivity is defined by $SEL (Q, V_{i}) = | Ω_{i} | / | V_{i} |$ .

Definition 14 (Version-driven result dynamicity).

The version-driven result dynamicity of the query Q over two versions $V_{i}$ and $V_{j}$ , with $i < j$ , denoted by $dyn (Q, V_{i}, V_{j})$ is defined by $\begin{matrix} dyn (Q, V_{i}, V_{j}) = \frac{| (Ω_{i} ∖ Ω_{j}) \cup (Ω_{j} ∖ Ω_{i}) |}{| Ω_{i} \cup Ω_{j} |} . \end{matrix}$

Likewise, we define the version-driven result insertion ${dyn}^{+} (Q, V_{i}, V_{j}) = \frac{| Ω_{j} ∖ Ω_{i} |}{| Ω_{i} \cup Ω_{j} |}$ and deletion ${dyn}^{-} (Q, V_{i}, V_{j}) = \frac{| Ω_{i} ∖ Ω_{j} |}{| Ω_{i} \cup Ω_{j} |}$ dynamicity.

The archive-driven result cardinality is reported as a feature directly inherited from traditional SPARQL querying, as it disregards the versions and evaluates the query over the set of triples present in the RDF archive. Although this feature could be only of peripheral interest, the knowledge of this feature can help in the interpretation of version-agnostic retrieval purposes (e.g. ASK queries).

As stated, result cardinality and query selectivity are main influencing factors for the query performance, and should be considered in the benchmark design and also known for the result analysis. In RDF archiving, both processes require particular care, given that the results of a query can highly vary in different versions. Knowing the version-driven result cardinality and selectivity helps to interpret the behaviour and performance of a query across the archive. For instance, selecting only queries with the same cardinality and selectivity across all version should guarantee that the index performance is always the same and as such, potential retrieval time differences can be attributed to the archiving policy. Finally, the version-driven result dynamicity does not just focus on the number of results, but how these are distributed in the archive timeline.

In the following, we introduce five foundational query atoms to cover the broad spectrum of emerging retrieval demands in RDF archiving. Rather than providing a complete catalog, our main aim is to reflect basic retrieval features on RDF archives, which can be combined to serve more complex queries. We elaborate these atoms on the basis of related literature, with special attention to the needs of the well-established Memento Framework [43], which can provide access to prior states of RDF resources using datetime negotiation in HTTP.

Version materialisation, $Mat (Q, V_{i})$ : it provides the SPARQL query resolution of the query Q at the given version $V_{i}$ . Formally, $Mat (Q, V_{i}) = {[[Q]]}_{V_{i}}$ .

Within the Memento Framework, this operation is needed to provide mementos (URI-M) that encapsulate a prior state of the original resource (URI-R).

Delta materialisation, $Diff (Q, V_{i}, V_{j})$ : it provides the different results of the query Q between the given $V_{i}$ and $V_{j}$ versions. Formally, let us consider that the output is a pair of mapping sets, corresponding to the results that are present in $V_{i}$ but not in $V_{j}$ , that is $(Ω_{i} ∖ Ω_{j})$ , and viceversa, i.e. $(Ω_{j} ∖ Ω_{i})$ .

A particular case of delta materialisation is to retrieve all the differences between $V_{i}$ and $V_{j}$ , which corresponds to the aforementioned $Δ_{i, j}^{+}$ and $Δ_{i, j}^{-}$ .

Version Query, $Ver (Q)$ : it provides the results of the query Q annotated with the version label in which each of them holds. In other words, it facilitates the ${[[Q]]}_{V_{i}}$ solution for those $V_{i}$ that contribute with results.

Cross-version join, $Join (Q_{1}, V_{i}, Q_{2}, V_{j})$ : it serves the join between the results of $Q_{1}$ in $V_{i}$ , and $Q_{2}$ in $V_{j}$ . Intuitively, it is similar to $Mat (Q_{1}, V_{i}) ⋈ Mat (Q_{2}, V_{j})$ .

Change materialisation, $Change (Q)$ : it provides those consecutive versions in which the given query Q produces different results. Formally, $Change (Q)$ reports the labels $i, j$ (referring to the versions $V_{i}$ and $V_{j}$ ) $\Leftrightarrow Diff (Q, V_{i}, V_{j}) \neq \emptyset, j = i + 1$ .

Within the Memento Framework, change materialisation is needed to provide timemap information to compile the list of all mementos (URI-T) for the original resource, i.e. the basis of datetime negotiation handled by the timegate (URI-G).

These query features can be instantiated in domain-specific query languages (e.g. DIACHRON QL [28]) and existing temporal extensions of SPARQL (e.g. T-SPARQL [16], stSPARQL [3], SPARQL-ST [35], and SPARQ-LTL [14]). We include below an instantiation of these five queries in AnQL [48], as well as a discussion of how these AnQL queries could be evaluated over off-the-shelf RDF stores using “pure” SPARQL. However, since such an approach would typically render rather inefficient SPARQL queries, in the following sections, we focus on tailored implementations using optimized storage techniques to serve these features.

3.3. Instantiation in a concrete query language: AnQL

In order to “ground” the five concrete query cases outlined above, we herein propose the syntactic abstraction of AnQL [48], a query language that provides some syntactic sugar for (time-)annotated RDF data and queries on top of SPARQL. This abstraction helps us – as a tradeoff between concrete instantiation in SPARQL as a query language and implementation issues underneath – to illustrate differences between IC, CB and TB as storage strategies from the viewpoint of an off-the shelf RDF store.

AnQL is a query language defined as a – relatively straightforward – extension of SPARQL, where a SPARQL triple pattern t is allowed to be annotated with a (temporal7

⁷
Note that in [48] we also discuss various other annotation domains.
) label l as an annotated triple pattern of the form $t : l$ . In our case, we assume for simplicity that the domain of annotations are simply (consecutive) version numbers, i.e. :s :p :o :[ $v_{i}$ ] and :s :p :o :[ $v_{i}$ , $v_{j}$ ], resp., would indicate that the triple pattern :s :p :o is valid in version $v_{i}$ or, resp., between versions $v_{i}, v_{j} \in N$ , where s.t. $v_{i} ⩽ v_{j}$ .

Moreover, for simplicity, we extend an AnQL BAP (basic annotated pattern), that is, a SPARQL Basic graph pattern (BGP) which may contain such annotated triple patterns as follows: Let P be a SPARQL graph pattern, then we write $P : l$ as a syntactic short cut for an annotated pattern such that each triple pattern $t \in P$ is replaced by $t : l$ .

Using this notation, we can “instantiate” the queries from above as follows in AnQL.

$Mat (Q, v_{i})$ :

$Diff (Q, v_{i}, v_{j})$ :

Here, the newly bound variable ?V is used to show which solutions appear only in version ?V but not in the other version, which is a simple way to describe the changeset [26].

$Ver (Q)$ :

$Join (Q_{1}, v_{i}, Q_{2}, v_{j})$ :

$Change (Q)$ :

Based on these queries, a naive implementation of IC, TB and CB on top of an off-the-shelf triple store could now look as follows:
3.3.1. IC

All triples of each instance/version would be stored in named graphs with the version name being the graph name and respective metadata about the version number on the default graph. That is, a triple (:s :p :o) in version $v_{i}$ would result in the respective graph being stored in the named graph :version_v1 along with a triple (:version_v1 :version_number $v_{i}$ ) in the default graph.

Then, each annotated pattern $P : l$ in the AnQL queries above could be translated into a native SPARQL graph pattern as:

3.3.2. TB

All triples appearing in any instance/version could be stored as a single reified triple, with additional meta-information in which version the triple is true in disjoint from-to ranges to indicate the version ranges when a particular triple was true. That is, a triple (:s :p :o) which was true in versions $v_{i}$ until $v_{j}$ coud be represented as follows:

Note that this representation allows for a compact representation of several disjoint (maximal) validity intervals of the same triple, thus causing less overhead than the graph-based representation discussed for IC. The translation for annotated query patterns $P : l$ in the AnQL syntax could proceed by replacing each triple pattern $t = (s p o)$ in P as follows, where ?t_start and ?t_end are fresh variables unique per t:

Unfortunately, this “recipe” does not work for $Ver (Q)$ and $Change (Q)$ , since it would result in l being an unbound variable in the FILTER expression. Thus we provide separate translations for $Ver (Q)$ and $Change (Q)$ , where both would use the same replacement, but without the FILTER expression per triple pattern.

As for $Ver (Q)$ , the overall result only holds in case the intersection of all [?t_start $_{i}$ ,?t_end $_{i}$ ] intervals is non-empty for any binding returned for the resp. BGP $Q = {t_{1}, \dots, t_{n}}$ . So, an overall FILTER, which checks this condition needs to be added for the whole BGP Q. To this end, we first translate each triple pattern $t_{i} = (s_{i} p_{i} o_{i})$ with $1 ⩽ i ⩽ n$ in Q separately as before, to the following pattern (without the single FILTER per triple):

Let us call the BGP translated this way $Q^{'}$ ; then $Ver (Q)$ could be realized with the following combination of BIND and FILTER clauses:8

⁸
Note that, unfortunately, strictly speaking the function min() and max() used here exist in SPARQL only as aggregates for subqueries and not as functions over value lists, but for instance an expression BIND(min( $x_{1}$ , … $x_{n}$ ) AS ?X) can be easily emulated using a combination of IF and BIND, as follows:

BIND( IF( $x_{1}$ < $x_{2}$ , $x_{1}$ , $x_{2}$ ) AS ?X $_{2}$ )

BIND( IF(?X $_{2}$ < $x_{3}$ ,?X $_{2}$ , $x_{3}$ ) AS ?X $_{3}$ )

…

BIND( IF(?X $_{n - 1}$ < $x_{n}$ ,?X $_{n - 1}$ , $x_{n}$ ) AS ?X)

Analogously, $Change (Q)$ could in turn (in a naive implementation just demonstrating expressive feasibility) re-use this implementation of $Ver (Q)$ to determine between which exact versions the result has actually changed: in fact that is the case, exactly before and after the ?t_start and ?t_end labels returned by $Ver (Q)$ . That is, using $Ver (Q)$ as a subquery you could formulate $Change (Q)$ as follows:

Note that this works because $Ver (Q)$ just returns the (maximum) intervals where query Q returned the same results. Therefore, each time before or after such an interval, some change in the result of Q must have occurred. Note further that we need the UNION of start nd end of these intervals, since $Ver (Q)$ might actually leave gaps, i.e. there might be intervals in between where there are no results fo Q at all.

Finally, let us note that the implementation sketched here only works for Q being a BGP (as we originally assumed). As for more complex patterns such as OPTIONAL, MINUS, NOT EXISTS or patterns involving complex FILTERS or even aggregations, a simple translation like the one sketched here would not return correct results in the general case.
3.3.3. CB

We emphasize that a change-based storage of RDF triples has no trivial implementation in an off-the shelf RDF store. Again, change -deltas (triple additions and deletions between versions could be stored in separate graphs, starting with an original graph :version_v0_add and separate graphs labelled, e.g. :version_vi_add and :version_vi_del per new version, plus again metadata triples in the default graph, e.g.:

Then the validity of a triple pattern t in a particular version $v_{i}$ can be checked as follows, intuitively testing whether the triple has been added in a prior version and not been removed since:

The translation of whole AnQL queries in the case of CB is therefore, by no means trivial, as this covers only single triple patterns. Whereas we do not provide the full translation for CB here, we hope that the sketch here, along with the translations for IC and TB above, have served to illustrate that an implementation of RDF archives and queries in off-the-shelf RDF stores and using SPARQL is a non-trivial exercise – even the translated patterns for CB and IC sketched above would likely not scale to large archives of dynamic RDF data and complex queries. Therefore, in our current evaluation (Section 5), we focus on tailored implementations using efficient, optimized storage techniques to implement these features, using rather simple triple pattern queries and joins of triple patterns, as opposed to full SPARQL BGPs.

4. BEAR: A test suite for RDF archiving

This section presents BEAR, a prototypical (and extensible) test suite to demonstrate the new capabilities in benchmarking the efficiency of RDF archives using our foundations, and to highlight current challenges and potential improvements in RDF archiving. BEAR comprises three main datasets, namely BEAR-A, BEAR-B, and BEAR-C, each having different characteristics. We first detail the dataset descriptions and the query set covering basic retrieval needs for each of these datasets in Sections 4.1–4.3. In the next section (5) we will evaluate BEAR on different archiving systems. The complete test suite (data corpus, queries, archiving system source codes, evaluation and additional results) is available at the BEAR repository.9

⁹
https://aic.ai.wu.ac.at/qadlod/bear

4.1. BEAR-A: Dynamic linked data

The first benchmark we consider provides a realistic scenario on queries about the evolution of Linked Data in practice.

4.1.1. Dataset description

Table 2
BEAR-A Dataset configuration

Versions $| V_{0} |$ $| V_{57} |$ $\overline{growth}$ $\overline{δ}$ $\overline{δ^{-}}$ $\overline{δ^{+}}$ $C_{A}$ $O_{A}$

58 30 m 66 m 101% 31% 32% 27% 3.5 m 376 m

Versions	$\| V_{0} \|$	$\| V_{57} \|$	$\overline{growth}$	$\overline{δ}$	$\overline{δ^{-}}$	$\overline{δ^{+}}$	$C_{A}$	$O_{A}$
58	30 m	66 m	101%	31%	32%	27%	3.5 m	376 m

Fig. 2.

Dataset description.

We build our RDF archive on the data hosted by the Dynamic Linked Data Observatory,10 ¹⁰

http://swse.deri.org/dyldo/

monitoring more than 650 different domains across time and serving weekly crawls of these domains. BEAR data are composed of the first 58 weekly snapshots, i.e. 58 versions, from this corpus. Each original week consists of triples annotated with their RDF document provenance, in N-Quads format. In this paper we focus on archiving of a single RDF graph, so that we remove the context information and manage the resultant set of triples, disregarding duplicates. The extension to multiple graph archiving can be seen as future work. In addition, we replaced Blank Nodes with Skolem IRIs11 ¹¹

https://www.w3.org/TR/rdf11-concepts/#section-skolemization

(with a prefix http://example.org/bnode/) in order to simplify the computation of diffs.

We report the data configuration features (cf. Section 3) that are relevant for our purposes. Table 2 lists basic statistics of our dataset, further detailed in Fig. 2, which shows the figures per version and the vocabulary evolution. Data growth behaviour (dynamicity) can be identified at a glance: although the number of statement in the last version ( $| V_{57} |$ ) is more than double the initial size ( $| V_{0} |$ ), the mean version data growth ( $\overline{growth}$ ) between versions is almost marginal ( $101 %$ ).

A closer look to Fig. 2 (a) allows to identify that the latest versions are highly contributing to this increase. Similarly, the version change ratios12 ¹²

Note that $\overline{δ} = δ_{1, n}^{*}$ , so we use them interchangeably.

in Table 2 (

\overline{δ}

\overline{δ^{-}}

and

\overline{δ^{+}}

) point to the concrete adds and delete operations. Thus, one can see that a mean of

31 %

of the data change between two versions and that each new version deletes a mean of

27 %

of the previous triples, and adds

32 %

. Nonetheless, Fig. 2 (b) points to particular corner cases (in spite of a common stability), such as

V_{31}

in which no deletes are present, as well as it highlights the noticeable dynamicity in the last versions.

Conversely, the number of version-oblivious triples ( $O_{A}$ ), 376 m, points to a relatively low number of different triples in all the history if we compare this against the number of versions and the size of each version. This fact is in line with the $\overline{δ}$ dynamicity values, stating that a mean of $31 %$ of the data change between two versions. The same reasoning applies for the remarkably small static core ( $C_{A}$ ), 3.5 m.

Finally, Figs 2 (c)–(e) show the RDF vocabulary (different subjects, predicates and objects) per version and per delta (adds and deletes). As can be seen, the number of different subjects and predicates remains stable except for the noticeable increase in the latests versions, as already identified in the number of statements per versions. However, the number of added and deleted subjects and objects fluctuates greatly and remain high (one order of magnitude of the total number of elements, except for the aforementioned $V_{31}$ in which no deletes are present). In turn, the number or predicates are proportionally smaller, but it presents a similar behaviour.

4.1.2. Test queries

BEAR-A provides triple pattern queries Q to test each of the five atomic operations defined in our foundations (Section 3). Note that, although such queries do not cover the full spectrum of SPARQL queries, triple patterns (i) constitute the basis for more complex queries, (ii) are the main operation served by lightweight clients such as the Linked Data Fragments [45] proposal, and (iii) they are the required operation to retrieve prior states of a resource in the Memento Framework. For simplicity, we present here atomic lookup queries Q in the form (S??), (?P?), and (??O), which are then extended to the rest of triple patterns (SP?), (S?O), (?PO), and (SPO).13

¹³
The triple pattern (???) retrieves all the information, so no sampling technique is required.
For instance, Listing 1 shows an example of a materialization of a basic predicate lookup query in version 3.

Listing 1.
Materialization of a (?P?) triple pattern in version 3

As for the generation of queries, we randomly select such triple patterns from the 58 versions of the Dynamic Linked Data Observatory. In order to provide comparable results, we consider entirely dynamic queries, meaning that the results always differ between consecutive versions. In other words, for each of our selected queries Q, and all the versions $V_{i}$ and $V_{j}$ ( $i < j$ ), we assure that $dyn (Q, V_{i}, V_{j}) > 0$ . To do so, we first extract subjects, predicates and objects that appear in all $Δ_{i, j}$ .

Then, we follow the foundations and try to minimise the influence of the result cardinality on the query performance. For this purpose, we sample queries which return, for all versions, result sets of similar size, that is, $CARD (Q, V_{i}) \approx CARD (Q, V_{j})$ for all queries and versions. We introduce here the notation of a ϵ-stable query, that is, a query for which the min and max result cardinality over all versions do not vary by more than a factor of $1 \pm ϵ$ from the mean cardinality, i.e., ${max}_{\forall i \in N} CARD (Q, V_{i}) ⩽ (1 + ϵ) \cdot \frac{\sum_{\forall i \in N} CARD (Q, V_{i})}{| N |}$ and ${min}_{\forall i \in N} CARD (Q, V_{i}) ⩾ (1 - ϵ) \cdot \frac{\sum_{\forall i \in N} CARD (Q, V_{i})}{| N |}$ .

Thus, the previous selected dynamic queries are effectively run over each version in order to collect the result cardinality. Next, we split subject, objects and predicate queries producing low ( $Q_{L}^{S}$ , $Q_{L}^{P}$ , $Q_{L}^{O}$ ) and high ( $Q_{H}^{S}$ , $Q_{H}^{P}$ , $Q_{H}^{O}$ ) cardinalities. Finally, we filter these sets to sample at most 50 subject, predicate and object queries which can be considered ϵ-stable for a given ϵ. Table 3 shows the selected query sets with their epsilon value, mean cardinality and mean dynamicity. Although, in general, one could expect to have queries with a low ϵ (i.e. cardinalities are equivalent between versions), we test higher ϵ values in objects and predicates in order to have queries with higher cardinalities. Even with this relaxed restriction, the number of predicate queries that fulfil the requirements is just 6 and 10 for low and high cardinalities respectively.

Table 3
Overview of BEAR-A lookup queries

Query set lookup position $\overline{CARD}$ $\overline{dyn}$ #queries

$Q_{L}^{S} - ϵ = 0.2$ subject 6.7 0.46 50

$Q_{L}^{P} - ϵ = 0.6$ predicate 178.66 0.09 6

$Q_{L}^{O} - ϵ = 0.1$ object 2.18 0.92 50

$Q_{H}^{S} - ϵ = 0.1$ subject 55.22 0.78 50

$Q_{H}^{P} - ϵ = 0.6$ predicate 845.3 0.12 10

$Q_{H}^{O} - ϵ = 0.6$ object 55.62 0.64 50

Section 5 provides an evaluation of (i) version materialisation, (ii) delta materialisation and (iii) version queries for these lookup queries under different state-of-the-art archiving policies. Appendix A extends the lookup queries to triple patterns (SP?), (S?O) and (?PO). We additionally sample 50 (SPO) queries from the static core.

Table 4
BEAR-B Dataset configuration

Granularity versions $| V_{0} |$ $| V_{last} |$ $\overline{growth}$ $\overline{δ}$ $\overline{δ^{-}}$ $\overline{δ^{+}}$ $C_{A}$ $O_{A}$

instant 21,046 33,502 43,907 100.001% 0.011% 0.007% 0.004% 32,094 234,588

hour 1,299 33,502 43,907 100.090% 0.304% 0.197% 0.107% 32,303 178,618

day 89 33,502 43,907 100.744% 1.778% 1.252% 0.526% 32,448 83,134

4.2. BEAR-B: DBpedia Live

Query set	lookup position	$\overline{CARD}$	$\overline{dyn}$	#queries
$Q_{L}^{S} - ϵ = 0.2$	subject	6.7	0.46	50
$Q_{L}^{P} - ϵ = 0.6$	predicate	178.66	0.09	6
$Q_{L}^{O} - ϵ = 0.1$	object	2.18	0.92	50
$Q_{H}^{S} - ϵ = 0.1$	subject	55.22	0.78	50
$Q_{H}^{P} - ϵ = 0.6$	predicate	845.3	0.12	10
$Q_{H}^{O} - ϵ = 0.6$	object	55.62	0.64	50

Granularity	versions	$\| V_{0} \|$	$\| V_{last} \|$	$\overline{growth}$	$\overline{δ}$	$\overline{δ^{-}}$	$\overline{δ^{+}}$	$C_{A}$	$O_{A}$
instant	21,046	33,502	43,907	100.001%	0.011%	0.007%	0.004%	32,094	234,588
hour	1,299	33,502	43,907	100.090%	0.304%	0.197%	0.107%	32,303	178,618
day	89	33,502	43,907	100.744%	1.778%	1.252%	0.526%	32,448	83,134

Our next benchmark, rather than looking at arbitrary Linked Data, is focung on the evolution of DBpedia, which directly reflect Wikipedia edits, where we can expect quite different change/evolution characteristics.

4.2.1. Dataset description

The BEAR-B dataset has been compiled from DBpedia Live changesets14

¹⁴
http://live.dbpedia.org/changesets/
over the course of three months (August to October 2015). DBpedia Live [22] records all updates to Wikipedia articles and hence re-extracts and instantly updates the respective DBpedia Live resource descriptions. The BEAR-B contains the resource descriptions of the 100 most volatile resources along with their updates. The most volatile resource (dbr:Deaths_in_2015) changes 1,305 times, the least volatile resource contained in the dataset (dbr:Once_Upon_a_Time_(season_5)) changes 263 times.

As dataset updates in DBpedia Live occur instantly, for every single update the dataset shifts to a new version. In practice, one would possibly aggregate such updates in order to have less dataset modifications. Therefore, we also aggregated these updates on an hourly and daily level. Hence, we get three time granularities from the changesets for the very same dataset: instant (21,046 versions), hour (1,299 versions), and day (89 versions).

Detailed characteristics of the dataset granularities are listed in Table 4. The dataset grows almost continuously from 33,502 triples to 43,907 triples. Since the time granularities differ in the number of intermediate versions, they show different change characteristics: a longer update cycle also results in more extensive updates between versions, the average version change ratio increases from very small portions of 0.011% for instant updates to 1.8% at the daily level. It can also be seen that the aggregation of updates leads to omission of changes: whereas the instant updates handle 234,588 version-oblivious triples, the daily aggregates only have 83,134 (hourly: 178,618), i.e. a reasonable number of triples exists only for a short period of time before they get deleted again. Likewise, from the different sizes of the static core, we see that triples which have been deleted at some point are re-inserted after a short period of time (in the case of DBpedia Live this may happen when changes made to a Wikipedia article are reverted shortly after).
4.2.2. Test queries

BEAR-B allows one to use the same sampling methodology as BEAR-A to retrieve dynamic queries. Nonetheless, we exploit the real-world usage of DBpedia to provide realistic queries. Thus, we extract the 200 most frequent triple patterns from the DBpedia query set of Linked SPARQL Queries dataset (LSQ) [36] and filter those that produce results in our BEAR-B corpus. We then obtain a batch of 62 lookup queries, mixing (?P?) and (?PO) queries, evaluated in Section 5. The full batch has a $\overline{CARD} = 80$ in BEAR-B-day and BEAR-B-hour, and $\overline{CARD} = 54$ in BEAR-B-instant. Finally, we build 20 join cases using the selected triple patterns, such as the join in Listing 2. Further statistics on each query are available at the BEAR repository.

Listing 2.

Example of a join query in BEAR-B

4.3. BEAR-C: Open data portals

The third dataset is taken from the Open Data Portal Watch project, a framework that monitors over 260 Open Data portals in a weekly basis and performs a quality assessment. The framework harvests the dataset descriptions in the portals and converts them to their DCAT representation. We refer to [31] for more details.

4.3.1. Dataset description

For BEAR-C, we decided to take the datasets descriptions of the European Open Data portal15

¹⁵
http://data.europa.eu/euodp/en/data/
for 32 weeks, or 32 snapshots respectively. Table 5 and Fig. 3 show the main characteristics of the dataset. Each snapshot consists of roughly 500 m triples with a very limited growth as most of the updates are modifications on the metadata, i.e. adds and deletes report similar figures as shown in Fig. 3 (a)–(b). Note also that this dynamicity is also reflected in the subject and object vocabulary (Figs 3 (c)–(d)), whereas the metadata is always described with the same predicate vocabulary (Fig. 3 (e)), in spite of a minor modification in version 24 and 25. An excerpt of the RDF data is shown in Listing 7 (Appendix C). Note that, as in BEAR-A, we also replaced Blank Nodes with Skolem IRIs.

Table 5
BEAR-C Dataset configuration

Granularity versions $| V_{0} |$ $| V_{last} |$ $\overline{growth}$ $\overline{δ}$ $\overline{δ^{-}}$ $\overline{δ^{+}}$ $C_{A}$ $O_{A}$

portal 32 485,179 563,738 100.478% 67.617% 33.671% 33.946% 178,484 9,403,540

Fig. 3.
Dataset description.
4.3.2. Test queries

Granularity	versions	$\| V_{0} \|$	$\| V_{last} \|$	$\overline{growth}$	$\overline{δ}$	$\overline{δ^{-}}$	$\overline{δ^{+}}$	$C_{A}$	$O_{A}$
portal	32	485,179	563,738	100.478%	67.617%	33.671%	33.946%	178,484	9,403,540

Selected triple patterns in BEAR-A cover queries whose dynamicity is well-defined, hence it allows for a fine-grained evaluation of different archiving strategies (and particular systems). In turn, BEAR-B adopts a realistic approach and gather real-word queries from DBpedia. Thus, we provide complex queries for BEAR-C that, although they cannot be resolved in current archiving strategies in a straightforward and optimized way (as discussed in Section 3.3 for the CB approach), they could help to foster the development and benchmarking of novel strategies and query resolution optimizations in archiving scenarios.

With the help of Open Data experts, we created 10 queries that retrieve different information from datasets and files (referred to as distributions, where each dataset refers to one or more distributions) in the European Open Data portal. For instance, Q1 in Listing 3 retrieves all the datasets and their file URLs. Appendix C includes the full list of queries.

Listing 3.

BEAR-C Q1: Retrieve portals and their files

Note that queries are provided as group graph pattern, such that they can be integrated in the AnQL notation16 ¹⁶

BEAR-C queries intentionally included UNION and OPTIONAL to extend the application beyond Basic Graph Patterns.

(see Section 3.3).

5. Evaluation of RDF archiving systems

We illustrate the use of our foundations to evaluate RDF archiving systems. To do so, we built two RDF archiving systems using the Jena’s TDB store17

¹⁷
https://jena.apache.org/documentation/tdb/, v3.2.0.
(referred to as Jena hereinafter) and HDT [12], considering different state-of-the-art archiving policies (IC, CB, TB and hybrid approaches ${HB}^{IC / CB}$ and ${HB}^{TB / CB}$ ). Then, we use our prototypical BEAR to evaluate the influence of the concrete store and policy.

Note that we considered these particular open RDF stores given that they are (i) easy to extend in order to implement the suggested archiving strategies, (ii) representative in the community and (iii) useful for potential archiving adopters. Jena is widely used in the community and can be considered as the de-facto standard implementation of most W3C efforts in RDF querying (SPARQL) and reasoning. In turn, HDT is a compressed store that considerably reduces space requirements of state-of-the-art stores (e.g. Virtuoso), hence it perfectly fits space efficiency requirements for archives. Furthermore, HDT is the underlying store of potential archiving adopters such as the crawling system LOD Laundromat,18 ¹⁸
http://lodlaundromat.org/
which generates new versions in each crawling process.19 ¹⁹
LOD Laundromat only serves the last crawled version of a dataset.

We implemented the different policies in Jena as follows. For the IC policy (referred to as Jena-IC), we index each version in an independent TDB instance. Likewise, for the CB policy (Jena-CB), we create an index for each added and deleted statements, again for each version and using an independent TDB store. In the TB policy (Jena-TB), we indexed all triples in one single TDB instance, using named graphs to indicate the versions of each triple. Listing 4 shows an example (in TriG notation [4]) with a triple (_:Jon foaf:name "Doe") in versions 1 and 2 and a triple (:Jon foaf:email "j@example.org") in versions 1 and 3. The graph http://example.org/versions lists the concrete version label of each named graph.

Listing 4.
Example of realization of a TB approach

Then, we implemented the hybrid ${HB}^{TB / CB}$ approach (Jena- ${HB}^{TB / CB}$ ) following the approach of [17,44] and indexed all deltas using two named graphs per version (adds and deletes) in one single TDB instance. Last, we implemented the ${HB}^{IC / CB}$ approach (Jena- ${HB}^{IC / CB}$ ), then the system can manage a set of IC and CB stores for the same dataset.

We follow the same strategy to develop the IC and CB strategies in HDT [12] (referred to as HDT-IC, HDT-CB and HDT- ${HB}^{IC / CB}$ ), which provides a compressed representation and indexing of RDF. The TB and ${HB}^{TB / CB}$ policies cannot be implemented as current HDT implementations20 ²⁰
We use the HDT C++ libraries at http://www.rdfhdt.org/.
do not support quads, hence triples cannot be annotated with the version.

In addition, we compare these systems with three state-of-the-art RDF archiving systems: R43ples [17] (v. 0.8.721 ²¹
https://github.com/plt-tud/r43ples
), which follows a hybrid TB/CB approach that stores deltas22 ²²
R43ples stores the recent version fully materialized, and previous versions can be queried by applying deltas in a reverse way.
in named graphs on top of Jena, v-RDFCSA [8] (v. 2016 and the vpt sampling=64 as default configuration), a pure TB strategy that makes use of compression notions similarly to HDT, and TailR [29] (v. Dec-201623 ²³
https://github.com/SemanticMultimedia/tlr
), which archives Linked Data descriptions (RDF triples of a given subject) using a hybrid IC/CB approach and implements the Memento protocol.

Tests were performed on a computer with 2× Intel Xeon E5-2650v2 @ 2.6 GHz (16 cores), RAM 171 GB, 4 HDDs in RAID 5 config. (2.7 TB netto storage), Ubuntu 14.04.5 LTS running on a VM with QEMU/KVM hypervisor. We report elapsed times in a warm scenario, given that all systems are based on disk except for HDT and v-RDFCSA, which perform on memory.

Table 6
Space of the different archiving systems and policies

Dataset rawdata (gzip) diffdata (gzip) Jena TDB HDT v-RDFCSA R43ples TailR

IC CB TB IC CB (TB) ( ${HB}^{TB / CB}$ ) ( ${HB}^{IC / CB}$ )

BEAR-A 23 GB 14 GB 230 GB 138 GB 83 GB 48 GB 28 GB 7.0 GB NA NA

BEAR-B-instant 12 GB 0.16 GB 158 GB 7.4 GB – 63 GB 0.33 GB NA 0.42 GB 0.28 GB

BEAR-B-hour 475 MB 10 MB 6238 MB 479 MB 3679 MB 2229 MB 35 MB 36 MB 149 MB 19 MB

BEAR-B-day 37 MB 1 MB 421 MB 44 MB 24 MB 149 MB 7 MB 5 MB 63 MB 9 MB

BEAR-C 243 MB 205 MB 2151 MB 2271 MB 2012 MB 421 MB 439 MB 313 MB 8339 MB 1607 MB

5.1. RDF storage space results

Dataset	rawdata (gzip)	diffdata (gzip)	Jena TDB	HDT	v-RDFCSA	R43ples	TailR
BEAR-A	23 GB	14 GB	230 GB	138 GB	83 GB	48 GB	28 GB	7.0 GB	NA	NA
BEAR-B-instant	12 GB	0.16 GB	158 GB	7.4 GB	–	63 GB	0.33 GB	NA	0.42 GB	0.28 GB
BEAR-B-hour	475 MB	10 MB	6238 MB	479 MB	3679 MB	2229 MB	35 MB	36 MB	149 MB	19 MB
BEAR-B-day	37 MB	1 MB	421 MB	44 MB	24 MB	149 MB	7 MB	5 MB	63 MB	9 MB
BEAR-C	243 MB	205 MB	2151 MB	2271 MB	2012 MB	421 MB	439 MB	313 MB	8339 MB	1607 MB

Table 6 shows the required on-disk space for the raw data of the corpus, the GNU diff of such data, and the space required by the Jena and HDT24

²⁴
We include the space overheads of the provided HDT indexes to solve all lookups.
archiving systems under the different implemented policies. We also include the space requirements of the existing v-RDFCSA, R43ples and TailR systems, whose archiving policies (TB, ${HB}^{TB / CB}$ and ${HB}^{IC / CB}$ respectively) are predefined and inherent to each system.

Several comments are in order. As expected, the diff data take much less space than the raw gzipped data, and the space savings are highly affected by the dynamicity of the data. For example, Both BEAR-A and BEAR-C are highly dynamic ( $\overline{δ} = 31 %$ and $\overline{δ} = 67 %$ , respectively) and the diff data saves 40 and 15% of the space respectively (i.e. the more changes, the less space savings in the diff). In contrast, changes between versions are more limited in the aggregation of days, hours and instants in BEAR-B (with $\overline{δ} < 2 %$ in all cases), hence the diff data only take 3%, 2% and 1% of the original size respectively.

A comparison of these figures against the size of the different systems and policies allows for describing their inherent overheads. First, Jena-CB and HDT-CB highly reduce the space needs of their IC counterparts, following the same tendency as the diff, i.e., CB policies achieve better space results in less dynamic (i.e. small δ) datasets. For instance, in BEAR-B-day, Jena-CB only takes 10% the space of IC, and only 5% in BEAR-B-instant. The only exception is BEAR-C where data are so dynamic that the additional index overhead in CB (for adds and deletes) produces slightly bigger sizes than IC, both in Jena and HDT. Interestingly, R43ples shows a similar behaviour given its ${HB}^{TB / CB}$ policy, which only stores changing triples in add and delete named graphs. Thus, R43ples effectively manage dataset with low dynamicity (i.e. small δ) such as BEAR-B, but is highly penalized in others such as BEAR-C. In fact, R43ples was unable to load the bigger BEAR-A dataset and cannot be included in the analysis.

TailR shares similar remarks: for those indexed datasets, TailR shows very competitive performance in space in datasets with low dynamicity, given that it makes use of a ${HB}^{IC / CB}$ policy. In fact, it achieves better results than HDT in BEAR-B-hour and instant. Also, note that TailR groups all the triples of a given subject (following a Linked Data philosophy), hence it particularly excels in BEAR-B, with few different subjects. In contrast, TailR reports poor performance in BEAR-C, where the dataset is more dynamic.

In turn, the IC policy is highly affected by the number of versions. In BEAR-A and BEAR-C, both comprising a reasonable number of versions (less than 60), the IC policy indexing in Jena requires roughly ten times more space than the raw data, mainly due to the data decompression and the built-in Jena indexes. In turn, the compact HDT indexes in the IC policy just double the size of the gzipped raw data, serving the required retrieval operations in such compressed space. In contrast, in BEAR-B Jena-IC and HDT-IC are both penalized by the increasing number of versions. For instance, in BEAR-B instant, Jena-IC takes 13 times the space of the raw gzipped data, while HDT requires 5 times such space. It is worth noting that both Jena and HDT have structures with a minimum fixed size, then the IC strategy has a fixed minimum increase disregarding the small size of each version, such as in BEAR-B-instant.

Finally, the TB policy in Jena and v-RDFCSA reports overall good space figures, as it stores each triple once (i.e. the final size depends on the version-oblivious triples $O_{A}$ ), using the named graph to denote the versions as previous explained (see example in Listing 4). In fact, v-RDFCSA reports the best space results in all datasets (except for a small difference in BEAR-B-hour). However, note that TB approaches introduce overheads at increasing number of versions, as the forth named-graph component must be also indexed to speed up queries. In fact, Jena-TB shows poor performance in BEAR-B- hour, with 1,299 versions, and both Jena-TB and v-RDFCSA even failed to load BEAR-B-instant, with 21,046 versions. Thus, the notation for graphs and versions in TB can present scalability challenges at larger number of versions (even if each of them is of a limited size). This limitation is partially overcome by the hybrid ${HB}^{TB / CB}$ policies, such as the one implemented in R43ples.

Table 7
Space of Jena and HDT Hybrid-Based approaches (HB)

Dataset Jena TDB HDT

IC CB HB IC CB HB

${HB}_{S}^{IC / CB}$ ${HB}_{M}^{IC / CB}$ ${HB}_{L}^{IC / CB}$ ${HB}^{TB / CB}$ ${HB}_{S}^{IC / CB}$ ${HB}_{M}^{IC / CB}$ ${HB}_{L}^{IC / CB}$

BEAR-A 230 GB 138 GB 163 GB 152 GB 143 GB 353 GB 48 GB 28 GB 34 GB 31 GB 29 GB

BEAR-B-instant 158 GB 7.4 GB 9.7 GB 7.7 GB 7.4 GB 0.10 GB 63 GB 0.33 GB 1.4 GB 0.46 GB 0.36 GB

BEAR-B-hour 6238 MB 479 MB 662 MB 563 MB 529 MB 54 MB 2229 MB 35 MB 103 MB 69 MB 52 MB

BEAR-B-day 421 MB 44 MB 137 MB 90 MB 65 MB 23 MB 149 MB 7 MB 43 MB 25 MB 15 MB

BEAR-C 2151 MB 2271 MB 2356 MB 2286 MB 2310 MB 3735 MB 421 MB 439 MB 458 MB 444 MB 448 MB

Table 7 shows the space for the selected hybrid approaches in HDT and Jena, i.e., ${HB}_{S}^{IC / CB}$ in HDT and Jena, and ${HB}^{TB / CB}$ in Jena. For the first one, we evaluated three different archives, with a small (S), medium (M) and large (L) gap between ICs: ${HB}_{S}^{IC / CB}$ , ${HB}_{M}^{IC / CB}$ and ${HB}_{L}^{IC / CB}$ stand for a policy in which an IC version is stored after 4, 8 and 16 CB versions respectively. For the case of BEAR-B-hour, we use a gap of 32, 64 and 128 versions, and for BEAR-B-instant we use 64, 512 and 2048 versions respectively.

Results firstly show that Jena- ${HB}^{TB / CB}$ keeps the aforementioned remarks for R43ples, as it also uses named graphs to store the delta in each version, hence it effectively manage dataset with low dynamicity such as BEAR-B. In this particular case, it outperforms all Jena approaches, and it even improves HDT in BEAR-B-instant (due to the aforementioned fixed minimum size per index in HDT). In contrast, ${HB}^{TB / CB}$ shows the worst results in highly dynamic datasets such as BEAR-A and BEAR-C. Note that, although CB and TB policies manage the same delta sets, TB uses a unique Jena instance and stores named graph for the triples, so additional “context” indexes are required.

Finally, the ${HB}_{S}^{IC / CB}$ policies behave as expected, i.e., the shorter is the gap (e.g. in ${HB}_{S}^{IC / CB}$ ), the more IC copies are present and thus the size is similar than the pure IC strategy, and the larger is the gap, more CB copies are present (e.g in ${HB}_{L}^{IC / CB}$ ), and the closer is the final size to the pure CB approach.

These initial results confirm current RDF archiving scalability problems at large scale, where specific RDF compression techniques such as HDT and RDFCSA emerge as an ideal solution [13]. For example, Jena-IC requires overall almost 4 times the size of HDT-IC, whereas Jena-CB takes more than 10 times the space required by HDT-CB. Results also point to the influence of the number of versions and the dynamicity of the dataset, considered in our δ metrics, in the selection of the proper strategy (as well as an input for hybrid approaches in order to decide when and how to materialize a version).

Fig. 4.
Query times for $Mat$ queries.
5.2. Retrieval performance

Dataset	Jena TDB	HDT
BEAR-A	230 GB	138 GB	163 GB	152 GB	143 GB	353 GB	48 GB	28 GB	34 GB	31 GB	29 GB
BEAR-B-instant	158 GB	7.4 GB	9.7 GB	7.7 GB	7.4 GB	0.10 GB	63 GB	0.33 GB	1.4 GB	0.46 GB	0.36 GB
BEAR-B-hour	6238 MB	479 MB	662 MB	563 MB	529 MB	54 MB	2229 MB	35 MB	103 MB	69 MB	52 MB
BEAR-B-day	421 MB	44 MB	137 MB	90 MB	65 MB	23 MB	149 MB	7 MB	43 MB	25 MB	15 MB
BEAR-C	2151 MB	2271 MB	2356 MB	2286 MB	2310 MB	3735 MB	421 MB	439 MB	458 MB	444 MB	448 MB

From our foundations, we consider the five aforementioned query atoms: (i) version materialisation, (ii) delta materialisation, (iii) version queries, (iv) cross-version joins and (v) change materialisation. As stated, we focus on evaluating the well-described triple patterns in the selected BEAR-A queries (see Section 4.1.2) and the real-world patterns in BEAR-B queries (see Section 4.2.2). In both cases, each triple pattern act as the target query Q in the version materialisation, delta materialisation, version queries and change materialisation. The evaluation of cross-version joins makes use of the joins defined for BEAR-B.

In general, our evaluation confirmed our assumptions about the characteristics of the policies (see Section 2), but also pointed out differences between the archiving systems. The IC, TB and CB/TB policies show a very constant behaviour in all our tests, while the retrieval times of the CB and IC/CB policies increase if more deltas have to be queried. Next, we present and discuss selected plots for each query operation. For the sake of clarity, we present below only the results for the subject lookup queries (with high number of results) in BEAR-A, and the selected queries in BEAR-B-hour. Results for the rest of queries show a very similar tendency, and are presented in Appendix A (for BEAR-A) and Appendix B (for BEAR-B).

Version materialisation. Fig. 4 reports, for each version, the average query time (in ms and logarithmic sale in Y axis) over all queries in the selected query set. Figures 4 (a)–(c) show the results for subject lookup queries in BEAR-A, for v-RDFCSA, R43ples and the pure IC, CB and TB approaches in Jena and HDT (Fig. 4(a)), hybrid approaches in HDT (Fig. 4(b)) and hybrid approaches in Jena (Fig. 4(c)). Likewise, Figs 4 (d)–(f) focus on BEAR-B-hour. Note that the v-RDFCSA system currently only supports subject and object lookups [8] and TailR only resolves subject lookups. In order to provide additional evaluation for these systems, we extend the BEAR-B queries to measure subject lookup (presented below). In turn R43ples and TailR are reported for BEAR-B, as we failed to load BEAR-A. Also, it is worth mentioning that R43ples is accessed via a SPARQL interface with an extended syntax to support versioning (called revisions in R43ples). Although the interface is queried locally, the SPARQL protocol might introduce a minimum overhead which is not considered in other systems (Jena, HDT and v-RDFCSA) that provide a direct API to the versioned triple store.

Fig. 5.

BEAR-B $Q^{S}$ $Mat$ queries.

Fig. 6.

Query times for $Diff$ queries with increasing intervals.

First, we can observe from Fig. 4 (a) that v-RDFCSA, which implements a TB policy, outperforms any of the other systems, remaining close to HDT with an IC policy. In practice, v-RDFCSA makes use of fast and self-compresses indexes to represent both the triples and their annotated versions, hence it speeds up materialisation queries irrespective of the concrete retrieved version. Appendix A shows that HDT is slightly faster than v-RDFCSA in object lookups, although it remains competitive in any case. In turn, as shown in Fig. 4 (d), R43ples is significantly slower than any of the other systems, in particular when initial versions are demanded, making its use impractical with a large number of versions. In fact, given its delays, we report only sampled versions $(0, 200, 400, \dots, 1200, 1298)$ in order to show the tendency of its performance. Note that R43ples materializes the latest version, and previous versions can be queried by applying the deltas in reverse order. Thus, R43ples performs faster at more recent versions as it requires less materializations of deltas, while it quickly degrades at older versions. Nonetheless, R43ples allows for materializing some intermediate versions (at the cost of increasing the size of the archive), which can emulate an hybrid IC/CB system. Further inspection on this tradeoff is devoted to future work.

In turn, Figs 4 (a) and (d) show that the HDT archiving system generally outperforms Jena. In turn, in both systems, the IC policy provides the best and most constant retrieval time. In contrast, the CB policy shows a clear trend that the query performance decreases if we query a higher version since more deltas have to be queried and the adds and delete information processed. The degradation of the performance highly depends on the system and the type of query. For instance, HDT-CB seems to degrade faster in BEAR-A than Jena-CB but, conversely, the performance degradation is more skewed in Jena-CB in BEAR-B-hour (Fig. 4 (d)), due to the large number of versions. In turn, the TB policy in Jena performs worse than IC, as TB has to query a single but potentially large dataset (and Jena indexes are not as optimized as in v-RDFCSA). This causes the remarkable poor performance of Jena-TB in BEAR-A (cf. Fig. 4 (a)), where the volume of data is high. In contrast, Jena-TB can outperform Jena-CB in BEAR-B when the number of versions is large.

Then, Figs 4 (b) and (e) show the behaviour of hybrid IC/CB approaches in HDT for the selected BEAR-A and BEAR-B datasets respectively. As expected, the performance degrades as soon as the archive has to query the CB copies, while it drops to the IC time when the fully materialized version is available.

Figures 4 (c) and (f) present the hybrid IC/CB approaches in Jena, which share a similar behaviour as discussed above, and compare the hybrid TB/CB approach. For this latter, it is interesting to note that it first retrieve all results matching the query, ordered by named graphs (adds and deletes per version [17,44]), and then process the graphs in order to apply the changes. Nonetheless, this latter is negligible with respect to the first operation (in particular in large datasets), hence the time is almost stable at incremental number of versions. As such, the performance is always worse than IC, but it can highly improve CB in the presence or a large number of versions, such as BEAR-B (cf. Fig. 4 (f)).

Finally, in order to test the performance of TailR (and v-RDFCSA), with limited subject lookup retrieval, we extend the real-world BEAR-B queries and we consider lookups of its 100 different subjects, named $Q^{S}$ . Figure 5 (a) reports the average time per $Mat$ query (for such subjects) in BEAR-B-day. Figure 5 (b) shows the same value for BEAR-B-hour, but taking some representative versions in order to show the tendency of the performance. Results show that TailR is competitive with the HDT and Jena archiving systems, specially in BEAR-B with a large number of versions and low dynamicity (δ). Although it can be one level of magnitude slower than an HDT approach, the performance of TailR remains below 100 ms in any case. Results of $Mat$ queries in both datasets also show that v-RDFCSA is again the fastest approach, also scaling to the large number of versions managed in BEAR-B-hour.

Delta materialisation queries. We performed diffs between the initial version and increasing intervals of 5 versions, i.e., $diff (Q, V_{0}, V_{i})$ for i in ${5, 10, 15, \dots, n}$ . Figure 6 shows again the plots for selected query sets in BEAR-A (a-c) and BEAR-B (d-f), while additional results can be found in Appendixes A and B.

As expected, the TB policy in v-RDFCSA and Jena behaves similarly than the $Mat$ case given that TB always inspects the full store. Thus, v-RDFCSA is again the fastest system for its currently supported queries, i.e., restricted to subject and object lookups. Then, R43triples is also the slower approach in BEAR-B, shown in Fig. 6 (d), even if it stores the deltas in named graph, given that the system first materializes the versions, and then it performs the diff similarly to the AnQL syntax for $diff (Q, v_{i}, v_{j})$ in Section 3.3. Jena and HDT report the expected constant retrieval performance of the IC policy (cf. Fig. 6 (a) and (d)), which always needs to query only two version to compute the delta in-memory. In contrast, the query time increases for the CB policy if the intervals of the deltas are increasing, given that more deltas have to be inspected. Thus, the CB policy is always slower than IC at increasing versions. Interestingly, HDT outperforms Jena under the same policy (IC or CB), i.e. HDT implements the policy more efficiently. However, an IC policy in Jena can be faster than a CB policy in HDT. For instance, Jena-IC outperforms HDT-CB after the 5th version in BEAR-A (Fig. 6 (a)).

Table 8

Average query time (in ms) for $ver (Q)$ queries

Query set	Jena TDB							HDT					v-RDFCSA	R43ples

	IC	CB	TB	HB				IC	CB	HB			TB	${HB}^{TB / CB}$

				${HB}_{S}^{IC / CB}$	${HB}_{M}^{IC / CB}$	${HB}_{L}^{IC / CB}$	${HB}^{TB / CB}$			${HB}_{S}^{IC / CB}$	${HB}_{M}^{IC / CB}$	${HB}_{L}^{IC / CB}$
BEAR-A $Q_{H}^{S}$	101	72	56693	76	75	89	44	4.98	7.98	10.94	13.59	18.32	0.49	NA
BEAR-B-hour	1189	120	6473	147	138	132	24	111.61	2.49	18.60	17.26	20.45	NA	$> 21600000$

Last, the hybrid approaches reported in Figs 4 (b)–(f) show a similar behaviour than in mat queries. The only consideration is that the performance of IC/TB highly depends on the particular two versions in the diff, and they report the expected IC or CB times depending on which of the versions is already materialized (IC).

Version queries. Table 8 reports the average query time over each $ver (Q)$ query. Similarly to the previous operations, we summarize our findings by presenting the results for subject lookup queries (with high number of results) in BEAR-A, and queries in BEAR-B-hour, while Appendixes A and B show all results.

As can be seen, v-RDFCSA is again the fastest approach in BEAR-A (1–2 order of magnitude faster), while the HDT archiving system clearly outperforms Jena in all scenarios, taking advantages of its efficient indexing. Nonetheless, the policies plays an important role: As opposed to the previous Mat and Diff operations, Jena-CB outperforms Jena-IC in these version queries, being even more noticeable in BEAR-B with a large number of versions. The explanation of such behaviour is that, in version queries, all versions have to be queried, hence the query of a version $V_{i}$ in CB can leverage the already materialized version for the previous version $V_{i - 1}$ (note that, in contrast, the IC approach has to perform two queries over the full $V_{i - i}$ and $V_{i}$ versions). For the same reason, HDT-CB outperforms HDT-IC in BEAR-B. The only exception in BEAR-A, where the efficiency of indexes in HDT (in a case with few but very large versions) still predominates over the aforementioned gain in CB. In turn, all the ${HB}_{S}^{IC / CB}$ policies follow the same behaviour, with a compromise between the CB benefit and the number of IC versions.

Note that R43ples shows poor performance in $ver (Q)$ queries as the current system forces to specify a revision via a $REVISION (i)$ keyword (i.e. performing the query at the given version i), hence all versions (called revisions) have to be materialized at query time. Thus, queries in BEAR-B-hour were stopped after a timeout of 6 hours. As shown in Appendix B, R43ples managed to complete these queries in BEAR-B-day (with smaller number of versions), taking an average of 20 minutes, which is in any case inefficiency compared to any other approach.

Finally, it is worth mentioning that the Jena- ${HB}_{S}^{TB / CB}$ approach emerges as the fastest approach for version queries, as it only requires a query over the full store and then it splits the results by version. In contrast, the pure Jena-TB approach is seriously compromised by the fact that it needs to query and iterate through all the occurrences on the results in all graphs (which is potentially large given the Jena indexes).

Cross-version join queries. We make use of the joins defined in BEAR-B to test the performance of the systems that support cross-version joins, namely HDT and Jena under different archiving policies, and RDF43ples. In order to construct the cross-version join, we split the joins in two triple patterns, $t p_{1}$ and $t p_{2}$ , matching the first one in the initial version and the second one at increasing intervals of 5 versions, i.e., $join (t p_{1}, V_{0}, t p_{2}, V_{i})$ for i in ${5, 10, 15, \dots, n - 5, n}$ . Listing 5 depicts an example of such join using the AnQL notation.

Fig. 7.

Query times for join queries with increasing intervals.

Listing 5.

Example of a cross-version join query in BEAR-B

Figure 7 (a) shows the plots for the selected joins in BEAR-B-hour for all supported systems, whereas Fig. 7 (b) and (d) reports the HDT and Jena hybrid approaches, respectively. The figures show similar tendency as $Mat$ queries, where HDT remains the fastest approach, building on top of fast triple pattern resolution. In turn, R43ples and CB approaches pay the price of materializing the deltas. As expected, given the reverse delta approach, R43ples improves with more recent versions. Finally, in order to test R43ples in the most favourable condition, we perform and additional test in BEAR-B-hour, where one triple pattern is fixed to the latest version. Thus, we measure $join (t p_{1}, V_{n}, t p_{2}, V_{i})$ i in ${0, 200, 400, \dots, 1200, 1298}$ , shown in Fig. 7 (c). Results point out that, even in a favourable case, R43ples is still penalized (in particular with older versions) and can only compete with a Jena-CB policy.

Table 9

Average query time (in ms) for $change (Q)$ queries

Query set	Jena TDB							HDT					R43ples

	IC	CB	TB	HB				IC	CB	HB			${HB}^{TB / CB}$

				${HB}_{S}^{IC / CB}$	${HB}_{M}^{IC / CB}$	${HB}_{L}^{IC / CB}$	${HB}^{TB / CB}$			${HB}_{S}^{IC / CB}$	${HB}_{M}^{IC / CB}$	${HB}_{L}^{IC / CB}$
BEAR-A $Q_{H}^{S}$	151	143	63543	212	307	730	49	24.15	6.93	41.33	29.66	22.80	NA
BEAR-B-hour	1690	196	12295	4569	7182	13546	88	1876.81	3.73	127.32	104.61	95.29	487

Change materialisation queries. Finally, we evaluate the performance of $change (Q)$ queries in all systems except for v-RDFCSA and TailR, which do not support this type of query. As we explained, $change (Q)$ queries can be implemented using $diff (Q)$ queries for each version, but we decide here to look at specific, tailored optimizations for $change (Q)$ queries. Thus, in R43ples and Jena using a hybrid TBCB approach, we translate these queries to make intensive use of the added and deleted named graphs, hence change queries are speed up by avoiding materialization. Listing 6 shows an example of a query in R43ples that efficiently inspects changes in film directors. To do so, we resolve the given triple pattern in all added and deleted graphs (i.e. deltagraph in the example), whose metadata is stored in a particular revision graph (http://example.org/r43ples-revisions).

Listing 6.

Example of a change query in R43ples

In turn, in all HDT and Jena cases, we optimize the resolution by marking a change between two versions as soon as we find the first different result between two versions, hence we avoid a full inspection of the $Δ^{+}$ and $Δ^{-}$ sets. In HDT, the resolution can be improved as soon as we find discrepancies in the RDF vocabulary (see Definition 9 in our metrics).

Table 9 reports the average query time over each $change (Q)$ query, for subject lookup queries (with high number of results) in BEAR-A, and queries in BEAR-B-hour. Appendixes A and B show all results. As in $Ver (Q)$ queries, results show that the CB approach generally outperforms IC, given that changes can be quickly detected in CB by inspecting the added and deleted sets. Interestingly, the hybrid IC/CB approaches, in particular in Jena, pay the price of materializing some versions. In particular, when inspecting changes between versions $V_{i}$ and $V_{i + 1}$ , if version $V_{i}$ is stored as a delta and $V_{i + 1}$ is a fully materialized version, then $V_{i}$ has to be fully materialized in order to inspect the differences. In turn, the Jena-TB approach is again compromised as it needs to iterate trough all the occurrences. Finally, it is worth mentioning that the aforementioned TBCB optimizations improve the performance significantly: Jena- ${HB}_{S}^{TB / CB}$ outperforms all strategies in Jena, and R43ples is much more competitive in comparison with other query atoms such as version materialisation.

6. Conclusions and future work

RDF archiving is still in an early stage of research. Novel solutions have to face the additional challenge of comparing the performance against other archiving policies or storage schemes, as there is not a standard way of defining neither a specific data corpus for RDF archiving nor relevant retrieval functionalities. To this end, we have provided foundations to guide future evaluation of RDF archives. First, we formalized dynamic notions of archives, allowing to effectively describe the data corpus. Then, we described the main retrieval facilities involved in RDF archiving, and have provided guidelines on the selection of relevant and comparable queries. We provide a concrete instantiation of archiving queries using AnQL [48] and instantiate our foundations in a prototypical benchmark suit, BEAR, composed of three real-world and well-described data corpus and query testbeds. Finally, we have implemented state-of-the-art archiving policies using independent copies (IC), change-based (CB), timestamp (TB) and hybrid (HB) approaches in two stores (Jena TDB and HDT). We use BEAR to evaluate our implementations as well as existing state-of-the-art archiving systems (v-RDFCSA, TailR, R43ples). Results clearly confirm challenges (in terms of scalability) and strengths of current archiving approaches, and highlight the influence of the number of versions and the dynamicity of the dataset in order to select the right strategy (as well as an input for hybrid approaches in order to decide when and how to materialize a version), guiding future developments. In particular, in terms of space, CB, TB and hybrid policies (such as TB/CB in R43ples and IC/CB in TailR) achieve better results than IC in less dynamic datasets, but they are penalized in highly dynamic datasets due to index overheads. In this case, the TB policy reports overall good space figures but it can be penalized at increasing number of versions. Regarding query resolution performance, the evaluated archiving policies excel at different operations but, in general, the IC, TB and CB/TB policies show a very constant behaviour, while CB and IC/CB policies degrade if more deltas have to be queried. Results also show that specific functional RDF compression techniques such as HDT and RDFCSA emerge as promising solutions for RDF archiving in terms of space requirements and query performance.

We currently focus on exploiting the presented benchmark to build a customizable generator of evolving synthetic RDF data which can preserve user-defined characteristics while scaling to any dataset size and number of versions. We also work on extending the benchmark for multiple versioned graphs in a federated scenario.

Footnotes

Acknowledgements

Funded by Austrian Science Fund (FWF): M1720-G11, European Union’s Horizon 2020 research and innovation programme under grant 731601 (SPECIAL), MINECO-AEI/FEDER-UE ETOME-RDFD3: TIN2015-69951-R, by Austrian Research Promotion Agency (FFG): grant no. 849982 (ADEQUATe) and grant 861213 (CitySpin), and the German Government, Federal Ministry of Education and Research under the project number 03WKCJ4D. Javier D. Fernández was funded by WU post-doc research contracts, and Axel Polleres was supported by the “Distinguished Visiting Austrian Chair” program as a visiting professor hosted at The Europe Center and the Center for Biomedical Research (BMIR) at Stanford University. Special thanks to Sebastian Neumaier for his support with the Open Data Portal Watch.

BEAR-A performance results

This appendix comprises the performance results for all subject, predicate and object lookups (S??, ?P? and ??O respectively) in BEAR-A (see Section 4.1 for a description of the corpus), and the corresponding triple patterns (SP?), (S?O), (?PO) and (SPO). Figures 8 and 9 show the results for Mat queries with pure IC, CB and TB approaches in HDT and Jena. Figures 10 and 11 compare such results with hybrid IC/CB approaches with HDT, whereas Figs 12 and 13 perform the comparison with IC/CB and TB/CB approaches. Diff queries are presented in Figs 14-19. Finally, Tables 10 and 11 report the results for the Ver query, and Tables 12 and 13 show the Change queries.

BEAR-B queries

This appendix shows the performance results of BEAR-B (see Section 4.2 for a description of the corpus). We focus here on reporting BEAR-B-day and BEAR-B-hour results, whereas current systems were unable to efficiently query the 21,046 versions in BEAR-B-instant and a report can be found in the BEAR repository (https://aic.ai.wu.ac.at/qadlod/bear).

Figures 20–22 show the results for Mat queries with pure IC, CB and TB approaches, hybrid IC/CB approaches with HDT and hybrid IC/CB and TB/CB approaches in Jena, respectively. Figures 23–25 present Diff queries, and Fig. 26 report join performance. Last, Table 14 reports the results for the Ver query and Table 15 shows the results for Change queries.

BEAR-C queries

This appendix lists the 10 selected queries for BEAR-C (see Section 4.3 for a description of the corpus). First, Listing 7 shows an excerpt from the corpus (in RDF turtle25 ²⁵

https://www.w3.org/TR/turtle/

). Then, the queries are described in Listings 8–17.

References

Aluç,

Hartig,

M.T.

Özsu and

Daudjee, Diversified stress testing of RDF data management systems, in: Proc. of ISWC, 2014, pp. 197–212.

Arenas,

Gutierrez and

Pérez, On the semantics of SPARQL, in: Semantic Web Information Management2009, pp. 281–307.

Bereta,

Smeros and

Koubarakis, Representation and querying of valid time of triples in linked geospatial data, in: Proc. of ESWC, 2013, pp. 259–274.

Bizer and

Cyganiak, Rdf 1.1 trig, W3C recommendation, 110, 2014.

Bizer,

Heath and

Berners-Lee, Linked data – The story so far, Int. J. Semantic Web Inf. Syst5 (2009), 1–22.

Bizer and

Schultz, The Berlin SPARQL benchmark, Int. J. Semantic Web Inf. Syst5(2) (2009), 1–24. doi:10.4018/jswis.2009040101.

Boncz,

Fundulaki,

Gubichev,

Larriba-Pey and

Neumann, The linked data benchmark council project, Datenbank-Spektrum13(2) (2013), 121–129. doi:10.1007/s13222-013-0125-y.

Cerdeira-Pena,

Farina,

J.D.

Fernández and

M.A.

Martınez-Prieto, Self-indexing rdf archives, in: Proc. of DCC, 2016.

Dell’Aglio,

J.P.

Calbimonte,

Balduini,

Corcho and

Della Valle, On correctness in RDF stream processor benchmarking, in: Proc. of ISWC, 2013, pp. 326–342.

10.

Dominguez-Sal,

Martinez-Bazan,

Muntes-Mulero,

Baleta and

J.L.

Larriba-Pey, A discussion on the design of graph database benchmarks, in: Performance Evaluation, Measurement and Characterization of Complex Systems, Springer, 2010, pp. 25–40.

11.

Dong-Hyuk,

Sang-Won and

Hyoung-Joo, A version management framework for RDF triple stores, Int. J. Softw. Eng. Know.22(1) (2012), 85–106. doi:10.1142/S0218194012500040.

12.

J.D.

Fernández,

M.A.

Martínez-Prieto,

Gutiérrez,

Polleres and

Arias, Binary RDF representation for publication and exchange (HDT), JWS19 (2013), 22–41. doi:10.1016/j.websem.2013.01.002.

13.

J.D.

Fernández,

Polleres and

Umbrich, Towards efficient archiving of dynamic linked open data, in: Proc. of DIACHRON, 2015.

14.

Fionda,

M.W.

Chekol and

Pirrò, Gize: A time warp in the web of data, in: Proc. of ISWC, 2016.

15.

Gao,

Gu and

Zaniolo, Rdf-tx: A fast, user-friendly system for querying the history of rdf knowledge bases, in: Proc. of EDBT, 2016.

16.

Grandi, T-SPARQL: A TSQL2-like temporal query language for RDF, in: Proc. of ADBIS, 2010, pp. 21–30.

17.

Graube,

Hensel and

Urbas, R43ples: Revisions for triples, in: Proc. of LDQ, CEUR-WS, Vol. 1215, 2014, paper 3.

18.

Gray, Benchmark Handbook: For Database and Transaction Processing Systems, Morgan Kaufmann Publishers Inc., 1992.

19.

Guo,

Pan and

Heflin, Lubm: A benchmark for owl knowledge base systems, Web Semantics: Science, Services and Agents on the World Wide Web3(2) (2005), 158–182. doi:10.1016/j.websem.2005.06.005.

20.

Gutierrez,

C.A.

Hurtado and

Vaisman, Introducing time into RDF, IEEE T. Knowl. Data En.19(2) (2007), 207–218. doi:10.1109/TKDE.2007.34.

21.

Harris and

Seaborne, SPARQL 1.1 query language. W3C Recom., 2013.

22.

Hellmann,

Stadler,

Lehmann and

Auer, DBpedia Live extraction, in: On the Move to Meaningful Internet Systems: OTM 2009, Vol. 5871, Springer, 2009, pp. 1209–1223. doi:10.1007/978-3-642-05151-7_33.

23.

Käfer,

Abdelrahman,

Umbrich,

O’Byrne and

Hogan, Observing linked data dynamics, in: Proc. of ESWC, 2013, pp. 213–227.

24.

Kaufmann,

Kossmann,

May and

Tonder, Benchmarking databases with history support, Technical report, Eidgenössische Technische Hochschule Zürich, 2013.

25.

Klein,

Fensel,

Kiryakov and

Ognyanov, Ontology versioning and change detection on the web, in: Proc. of EKAW, 2002, pp. 197–212.

26.

Knuth,

Reddy,

Dimou,

Vahdati and

Kastrinakis, Towards linked data update notifications reviewing and generalizing the SparqlPuSH approach, in: Proceedings of the Workshop on Negative or Inconclusive Results in Semantic Web (NoISE2015), Portoroz, Slovenia, June 2015.

27.

Meimaris and

Papastefanatos, The evogen benchmark suite for evolving rdf data, in: Proc. of MEPDaW, CEUR, Vol. 1585, 2016, pp. 20–35.

28.

Meimaris,

Papastefanatos,

Viglas,

Stavrakas and

Pateritsas, A query language for multi-version data web archives, Technical report, Institute for the Management of Information Systems, Greece, 2015.

29.

Meinhardt,

Knuth and

Sack, Tailr: A platform for preserving history on the web of data, in: Proc. of SEMANTiCS, ACM, 2015, pp. 57–64.

30.

Montoya,

M.E.

Vidal,

Corcho,

Ruckhaus and

Buil Aranda, Benchmarking federated SPARQL query engines: Are existing testbeds enough?, in: Proc. of ISWC, 2012, pp. 313–324.

31.

Neumaier,

Umbrich and

Polleres, Automated quality assessment of metadata across open data portals, ACM Journal of Data and Information Quality (JDIQ) (2016, forthcoming)

32.

Neumann and

Weikum, x-RDF-3X: Fast querying, high update rates, and consistency for RDF databases, in: Proc. of VLDB Endowment, Vol. 3, 2010, pp. 256–263.

33.

N.F.

Noy and

M.A.

Musen, Ontology versioning in an ontology management framework, IEEE Intelligent Systems19(4) (2004), 6–13. doi:10.1109/MIS.2004.33.

34.

Papakonstantinou,

Flouris,

Fundulaki,

Stefanidis and

Roussakis, Versioning for linked data: Archiving systems and benchmarks, in: Proc. of BLINK, CEUR, Vol. 1700, 2016.

35.

Perry,

Jain and

A.P.

Sheth, SPARQL-ST: Extending SPARQL to support spatiotemporal queries, Geospatial Semantics and the Semantic Web12 (2011), 61–86. doi:10.1007/978-1-4419-9446-2_3.

36.

Saleem,

M.I.

Ali,

Hogan,

Mehmood and

A.-C.

Ngonga Ngomo . LSQ: The linked SPARQL queries dataset, in: The Semantic Web – ISWC 2015, Springer, 2015.

37.

Saleem,

Mehmood and

A.-C.

Ngonga Ngomo, FEASIBLE: A feature-based SPARQL benchmark generation framework, in: Proc. of ISWC, 2015, pp. 52–69.

38.

Schmidt,

Hornung,

Lausen and

Pinkel, SP2Bench: A SPARQL performance benchmark, in: Proc. of ICDE, 2009, pp. 222–233.

39.

Stefanidis,

Chrysakis and

Flouris, On designing archiving policies for evolving RDF datasets on the web, in: Proc. of ER, 2014, pp. 43–56.

40.

Tappolet and

Bernstein, Applied temporal RDF: Efficient temporal querying of RDF data with SPARQL, in: Proc. of ESWC, 2009, pp. 308–322.

41.

Tzitzikas,

Theoharis and

Andreou, On storage policies for semantic web repositories that support versioning, in: Proc. of ESWC, 2008, pp. 705–719.

42.

Umbrich,

Hausenblas,

Hogan,

Polleres and

Decker, Towards dataset dynamics: Change frequency of linked open data sources, in: Proc. of LDOW, 2010.

43.

Van de Sompel,

Sanderson,

M.L.

Nelson,

Balakireva,

Shankar and

Ainsworth, An HTTP-based versioning mechanism for linked data, in: Proc. of LDOW, 2010.

44.

Vander Sander,

Colpaert,

Verborgh,

Coppens,

Mannens and

de Van Walle, R&Wbase: Git for triples, in: Proc. of LDOW, 2013.

45.

Verborgh,

Hartig,

De Meester,

Haesendonck,

De Vocht,

Vander Sande,

Cyganiak,

Colpaert,

Mannens and

Van de Walle, Querying datasets on the Web with high availability, in: Proc. of ISWC, 2014, pp. 180–196.

46.

Volkel,

Winkler,

Sure,

S.R.

Kruk and

Synak, Semversion: A versioning system for RDF and ontologies, in: Proc. of ESWC, 2005.

47.

Zeginis,

Tzitzikas and

Christophides, On computing deltas of RDF/s knowledge bases, ACM Transactions on the Web (TWEB)5(3) (2011), 14.

48.

Zimmermann,

Lopes,

Polleres and

Straccia, A general framework for representing, reasoning and querying with annotated semantic web data, JWS12 (2012), 72–95. doi:10.1016/j.websem.2011.08.006.

Evaluating query and storage strategies for RDF archives

Abstract

Keywords

1. Introduction

2.1. Retrieval functionality

3. Evaluation of RDF archives: Challenges and guidelines

Definition 1 (RDF Archive).

Definition 2 (RDF Version).

Data dynamicity

Definition 3 (Change ratio).

Definition 4 (Insertion ratio, deletion ratio).

Definition 5 (data growth).

Data static core

Total version-oblivious triples

Definition 8 (Version-oblivious triples).

RDF vocabulary

Definition 9 (RDF vocabulary per version).

Definition 10 (RDF vocabulary per delta).

Definition 11 (RDF vocabulary set dynamicity).

3.2. Design of benchmark queries

Definition 13 (Version-driven result cardinality).

Definition 14 (Version-driven result dynamicity).

3.3. Instantiation in a concrete query language: AnQL

3.3.2. TB

4. BEAR: A test suite for RDF archiving

9 https://aic.ai.wu.ac.at/qadlod/bear 4.1. BEAR-A: Dynamic linked data

4.1.1. Dataset description

Table 2 BEAR-A Dataset configuration Versions | V 0 | | V 57 | growth ‾ δ ‾ δ − ‾ δ + ‾ C A O A 58 30 m 66 m 101% 31% 32% 27% 3.5 m 376 m

4.2.1. Dataset description

4.3.1. Dataset description

Footnotes

Acknowledgements

BEAR-A performance results

BEAR-B queries

BEAR-C queries

References

⁹
https://aic.ai.wu.ac.at/qadlod/bear

4.1. BEAR-A: Dynamic linked data

Table 2
BEAR-A Dataset configuration

Versions $| V_{0} |$ $| V_{57} |$ $\overline{growth}$ $\overline{δ}$ $\overline{δ^{-}}$ $\overline{δ^{+}}$ $C_{A}$ $O_{A}$

58 30 m 66 m 101% 31% 32% 27% 3.5 m 376 m