Sage Journals: Discover world-class research

Abstract

The proliferation of large and ever-growing resource description framework (RDF) datasets has sparked a need for robust and performant RDF archiving systems. In order to tackle this challenge, several solutions have been proposed throughout the years, including archiving systems based on independent copies, time-based indexes, and change-based approaches. In recent years, modern solutions combine several of the above mentioned paradigms. In particular, aggregated changesets of time-annotated triples have showcased a noteworthy ability to handle and query relatively large RDF archives. However, such approaches still suffer from scalability issues, notably at ingestion time. This makes the use of these solutions prohibitive for large revision histories. Furthermore, applications for such systems remain often constrained by their limited querying abilities, where SPARQL is often left out in favor of single triple-pattern queries. In this article, we propose a hybrid storage approach based on aggregated changesets, snapshots, and multiple delta chains that additionally provides full querying SPARQL on RDF archives. This is done by interfacing our system with a modified SPARQL query engine. We evaluate our system with different snapshot creation strategies on the BEAR benchmark for RDF archives and showcase improvements of up to one order of magnitude in ingestion speed compared to state-of-the-art approaches, while keeping competitive querying performance. Furthermore, we demonstrate our SPARQL query processing capabilities on the BEAR-C variant of BEAR. This is, to the best of our knowledge, the first openly available endeavor that provides full SPARQL querying on RDF archives.

Keywords

RDF archiving SPARQL

1. Introduction

The exponential growth of resource description framework (RDF) data and the emergence of large collaborative knowledge graphs have driven research in the field of efficient RDF archiving (Fernández et al., 2019; Pelgrin et al., 2021), the task of managing the change history of RDF graphs. This offers invaluable benefits to both data maintainers and consumers. For data maintainers, RDF archives serve as the foundation for version control (Arndt et al., 2019). This not only enables data mining tasks, such as identifying temporal and correction patterns (Pellissier Tanon et al., 2019), but in general opens the door to advanced data analytics of evolving graphs (Brunsmann, 2010; Gür et al., 2018; Hose, 2021; Polleres et al., 2023; Roussakis et al., 2015). For data consumers, RDF archives provide a valuable means to access historical data and delve into the evolution of specific knowledge domains (Aebeloe et al., 2021; Huet et al., 2013; Tanon & Suchanek, 2019). In essence, these archives offer a way to query past versions of RDF data, allowing for a deeper understanding of how knowledge has developed over time.

However, building and maintaining RDF archives presents substantial technical challenges, primarily due to the large size of contemporary knowledge graphs. For instance, DBpedia, as of April 2022, comprises 220 million entities and 1.45 billion triples.¹ The number of changes between consecutive releases can reach millions (Pelgrin et al., 2021). Yet, dealing with large changesets is not the sole obstacle faced by state-of-the-art RDF archive systems. Efficient querying also remains an open challenge, since support for full SPARQL is rare among existing systems (Fernández et al., 2019; Pelgrin et al., 2021).

To address these challenges, we propose an approach for ingesting, storing, and querying long revision histories on large RDF archives. Our approach, which combines multiple snapshots and delta chains, has been previously detailed in our prior work (Pelgrin et al., 2023) and outperforms existing state-of-the-art systems in terms of ingestion time and query runtime for archive queries on single triple patterns. This article builds on top of this prior work and extends it with the following contributions:

–
The design and implementation of a full SPARQL querying middle-ware on top of our multi-snapshot RDF archiving engine.
–
A novel representation for the versioning metadata stored in our indexes. This representation is designed to improve ingestion time and disk usage without increasing query runtime.
–
An evaluation of the two aforementioned contributions in addition to an extended evaluation of our prior work with additional baselines.
In general, we evaluate the effectiveness of our enhanced approach using the BEAR benchmark (Fernández et al., 2019), our results demonstrate remarkable improvements, namely, up to several order of magnitude faster ingestion times, reduced disk usage, and overall improved querying speed compared to existing baselines. Additionally, we showcase our new SPARQL querying capabilities on the BEAR-C variant of the BEAR benchmark. This is the first time, to the best of our knowledge, that a system complete this benchmark suite and publish its results.

The remainder of this article is organized as follows: Section 2 explains the background concepts used throughout the paper. In Section 3, we discuss the state-of-the-art in RDF archiving, in particular existing approaches to store and query RDF archives. In Section 4, we detail our storage architecture that builds upon multiple delta chains, and proposes several strategies to handle the materialization of new delta chains. Section 5 details the algorithms employed for processing single triple patterns over our multiple-delta-chain- architecture, while Section 6 describes our new versioning metadata serialization method, and showcases how it improves ingestion times and disk usage. In Section 7, we explain our solution to support full SPARQL 1.1 archives queries on top of our storage architecture. Section 8 describes our extensive experimental evaluation. The article concludes with Section 9, which summarizes our contributions and discusses future research directions.
2. Preliminaries

An RDF graph $G$ (also called a knowledge graph) consists of a set of triples $⟨$ s, p, o $⟩$ with subject $s \in I \cup B$ , predicate $p \in I$ , and object $o \in I \cup L \cup B$ , where $I$ is a set of IRIs, $L$ is a set of literals, and $B$ is a set of blank nodes (Raimond & Schreiber, 2014). RDF graphs are queried using SPARQL (Seaborne & Harris, 2013), whose building blocks are triple patterns, that is, triples that allow variables (prefixed with a “?”) in any position, for example, $⟨$ ?x, cityIn, USA $⟩$ matches all American cities in $G$ .

An RDF archive $A$ is a temporally ordered collection of RDF graphs that represents all the states of the graph throughout its update history. This can be formalized as $A = {G_{0}, \dots, G_{k}}$ , with $G_{i}$ being the graph at version (or revision) $i \in Z_{\geq 0}$ . The transition from $G_{i - 1}$ to version $G_{i}$ is implemented through an update operation $G_{i} = (G_{i - 1} ∖ u_{i}^{-}) \cup u_{i}^{+}$ , where $u_{i}^{+}$ and $u_{i}^{-}$ are disjoint sets of added and deleted triples. We call the pair $u_{i} = ⟨ u_{i}^{+}, u_{i}^{-} ⟩$ a changeset or delta. Figure 1 illustrates an RDF archive with five revisions. We can generalize changesets to any pair of versions, that is, $u_{i, j} = ⟨ u_{i, j}^{+}, u_{i, j}^{-} ⟩$ defines the changes between versions $i$ and $j$ . When a triple $⟨$ s, p, o $⟩$ is present in a version $i$ of the archive, we write it as a quad $⟨$ s, p, o, i $⟩$ . We summarize the notations used throughout the article in Table 1.

Figure 1.

An resource description framework (RDF) graph archive $A$ with five revisions from the initial revision $G_{0}$ to $G_{4}$ and the corresponding changesets.

Table 1.

Notations Summary.

$⟨$ s, p, o $⟩$	RDF triple
$⟨$ s, p, o, i $⟩$	Versioned triple, that is, an RDF quad
$G$	RDF graph
$G_{i}$	$i$ th version or revision of graph $G$
$A$	RDF graph archive
$u_{i} = ⟨ u_{i}^{+}, u_{i}^{-} ⟩$	Changeset with sets of added and deleted triples for version $i$
$u_{i, j} = ⟨ u_{i, j}^{+}, u_{i, j}^{-} ⟩$	Changeset between graph versions $i$ and $j$ ( $j > i$ )

RDF = resource description framework.

3. Related Work

In this section, we discuss the current state of RDF archiving in the literature. We will first present how RDF archives are usually queried. We then discuss the existing storage paradigms for RDF archives and how they perform on the different query types. We conclude this section by detailing the inner-workings of OSTRICH (Taelman et al., 2019), a prominent solution for managing RDF archives, which we use as baseline for our proposed design.

3.1. Querying RDF Archives

In contrast to conventional RDF, the presence of multiple versions within an RDF archive requires the definition of novel query categories. Some categorizations for versioning queries over RDF Archives have been proposed in the literature (Fernández et al., 2019; Papakonstantinou et al., 2017; Polleres et al., 2023). In this work, we build upon the proposal of Fernández et al. (2019) due to its greater adoption by the community. They identify five query types, which we explain in the following through a hypothetical RDF archive that stores information about countries and their diplomatic relationships:

–
Version materialization (VM). These are standard SPARQL queries run against a single version $i$ , for example, $⟨$ ?s, type, Country, 5 $⟩$ returns the countries present in version $i = 5$ .
–
Delta materialization (DM). These are queries defined on changesets $u_{i, j} = ⟨ u_{i, j}^{+}, u_{i, j}^{-} ⟩$ , for example, the query asking for the countries added between versions $i = 3$ and $j = 5$ , which implies to run $⟨$ ?s, type, Country $⟩$ on $u_{3, 5}^{+}$ .
–
Version query (V). These are standard SPARQL queries that provide results annotated with the versions where those results hold. An example is $⟨$ ?s, type, Country, ?v $⟩$ , which returns pairs $⟨$ country, version $⟩$ .
–
Cross-version (CV). CV queries combine results from multiple versions, for example: which of the countries in the latest version have diplomatic relationships with the countries in revision 0?
–
Cross-delta (CD). CD queries combine results from multiple changesets, for example: in which versions were the most countries added?
Both CV and CD queries build upon the other types of queries, that is, V and DM queries. Therefore, full support for VM, DM, and V queries is the minimum requirement for applications relying on RDF archives. Papakonstantinou et al. (2017), on the other hand, propose a categorization into two main categories, version and delta queries, which can be of any of three types: materialization, single version, or Cross-version. As such, materialization queries request the full set of triples present in a given version, while single version queries are answered by applying restrictions or filters on the triples of that version. Cross-version queries instead need access to multiple versions of the data. In practice, the categorizations of Papakonstantinou et al. (2017) and Fernández et al. (2019) are equally expressive. Polleres et al. (2023) propose two categories of versioned queries: version materialization and delta materialization. These are identical to the categories used by Fernández et al. (2019) described above. Queries applied to multiple versions are categorized as cross-version, which includes the version queries (V) from Fernández et al.’s (2019) classification.

SPARQL is the recommended W3C standard to query RDF data, however adapting versioned query types to standard SPARQL remains one of the main challenges in RDF archiving. Indeed, current RDF archiving systems are often limited to queries on single triple patterns (Pelgrin et al., 2021; Taelman et al., 2019). This puts the burden of combining the results of single triple pattern queries onto the user, further raising the barrier for the adoption of RDF archiving systems. While support for standard SPARQL on RDF archives is nonexistent, multiple endeavors have proposed either novel query languages or temporal extensions for SPARQL. Fernández et al. (2019), for example, formulate their categories of versioning queries using the AnQL (Zimmermann et al., 2012) query language. AnQL is a SPARQL extension operating on quad patterns instead of triples pattern. The additional component can be mapped to any term $u \in I \cup L$ , and is used to represent time objects such as timestamps or version identifiers. Other works have focused on expressing SPARQL queries with temporal constraints (Bereta et al., 2013; Grandi, 2010; Perry et al., 2011). T-SPARQL, for example, takes inspiration from the TSQL2 language and can match triples annotated with validity timestamps. SPARQL-LTL (Fionda et al., 2016) on the other hand, supports triples annotated with version numbers, which are implemented as named graphs.

All in all, there is currently no widely accepted standard for the representation of versioned queries over RDF archives within the community. Instead, current proposals are often tailored to specific use cases and applications, and no standardization effort has been proposed.

In this work, we formulate complex queries on RDF archives as standard SPARQL queries, but we assume that revisions in the archive are modeled logically as RDF graphs named according to a particular convention (explained in Section 7). This design decision makes our solution suitable for any standard RDF/SPARQL engine with support for named graphs.
3.2. Main Storage Paradigms for Storing RDF Archives

Several solutions have been proposed for storing the history of RDF graphs efficiently. We review the most prominent approaches in this section and refer the reader to Pelgrin et al. (2021) for a detailed survey. We distinguish three main design paradigms: independent copies (IC), change-based solutions (CB), and timestamp-based systems (TB).

Independent copies (IC) systems, such as SemVersion (Volkel et al., 2005), implement full redundancy: all triples present in a version $i$ are stored as an independent RDF graph $G_{i}$ . While IC approaches excel at executing VM queries, DM and V queries suffer from the need to execute queries independently across multiple versions, requiring subsequent result set integration and filtering. Similarly, IC approaches are impractical for today’s knowledge graphs because of their prohibitive storage footprint. This fact has shifted the research trend toward CB and TB systems.

Change-based (CB) solutions store their initial version as a full snapshot and subsequent versions $G_{j}$ as changesets $u_{i, j}$ (also called deltas), where $j > i$ . We call a sequence of changesets—representing an arbitrary sequence of versions—and its corresponding reference revision $i$ , a delta chain. CB approaches usually require less disk space than IC architectures and are optimal for DM queries—at the expense of efficiency for VM queries. R43ples (Graube et al., 2014) is a prominent example of a system employing a CB storage paradigm.

Timestamp-based (TB) systems store triples annotated with versioning metadata such as temporal validity intervals, addition/deletion timestamps, and list of valid versions, among others. This makes TB solutions usually well suited to efficiently answer V queries, while VM and DM queries still necessitate further processing. The storage efficiency of TB solutions depends on the representation chosen to serialize the versioning metadata. TB systems notably include x-RDF-3X (Neumann & Weikum, 2010), Dydra (Anderson & Bendiken, 2016), and v-RDFCSA/v-HDT (Cerdeira-Pena et al., 2023). The latter has been shown to provide excellent storage efficiency and query performance. However, this is achieved at the cost of flexibility, by limiting itself to storing and indexing existing full archives, and leaving out the possibility of subsequent updates. Moreover, no implementation of their method has been made publicly available at the time of writing. Dydra is only available through their cloud service and is otherwise closed source, which makes a fair comparative evaluation in an independent and controlled setup impossible.

Finally, recent approaches borrow inspiration from more than one paradigm. QuitStore (Arndt et al., 2019), for instance, stores the data in fragments, for which it implements a selective IC approach. This means that only modified fragments generate new copies, whereas the latest version is always materialized in main memory. OSTRICH (Taelman et al., 2019) proposes a customized CB based approach based on aggregated changesets with version-annotated triples. This approach has shown great potential both in terms of scalability and query performance (Pelgrin et al., 2021; Taelman et al., 2019). For this reason, our solutions use this system as underlying architecture. We describe OSTRICH in detail in the following section.

3.3. OSTRICH’s Architecture and Storage Paradigm

Change-based (CB) approaches work by storing the first revision of an archive as a fully materialized snapshot, and subsequent versions as deltas $u_{i, j}$ with $i = j - 1$ (see Figure 2(a) for an illustration). This approach provides better storage efficiency than approaches based on independent copies (IC) as long as the deltas between versions are not larger than the materialized revisions. Also, CB approaches provide good query performance for delta-materialization (DM) queries. However, some queries can become increasingly expensive as the delta chain grows because each delta is relative to the previous one. For instance, version-materialization (VM) queries require the iteration of the full delta chain, up to the target version, in order to reconstruct the needed data on which the query will be executed. Similarly, version queries (V) need to iterate over the entire delta chain to provide a complete list of the valid version numbers for each solution of a query. On long delta chains, this process can become prohibitive.

As such, OSTRICH (Taelman et al., 2019) proposes instead the use of aggregated delta chains, as illustrated in Figure 2(b). An aggregated delta chain works by storing an initial reference snapshot, as conventional delta chains do, and then storing subsequent versions as deltas $u_{0, j}$ with 0 being the reference version. Such an approach allows for a more efficient version materialization process compared to conventional delta chains, since only one delta needs to be considered to reconstruct any version.

OSTRICH stores the initial snapshot in an HDT file (Fernández et al., 2013), which provides a dictionary and a compressed, yet easy to read, serialization for the triples. As standard for RDF engines, the dictionary maps RDF terms to integer identifiers, which are used in all data structures for efficiency.

In contrast to the initial snapshot, the delta chain is stored in two separate triple stores, one for additions and one for deletions. Each triple store consists of three differently ordered indexes (SPO, POS, and OSP), as illustrated in Figure 3. Those indexes are stored as clustered B+ trees (Taelman et al., 2019). Each triple is annotated with versioning metadata, which is used for several purposes: First, this reduces data redundancy in the delta chain, allowing each triple to be stored only once. Secondly, the metadata can be used to accelerate query processing. As shown in Figure 3, this metadata differs between additions and deletions triples. All triples feature a collection of mappings from version to a local change flag, which indicates whether the triple reverts a previous change in the delta chain. For example, consider the quad $⟨$ :USA, :dr, :Cuba, 0 $⟩$ and a changeset in revision 1 that removes the triple $⟨$ :USA, :dr, :Cuba $⟩$ . If a subsequent change adds this triple again, say in revision 4, then the entry for $⟨$ :USA, :dr, :Cuba $⟩$ will contain the mappings ${4 : true}$ and ${1 : false}$ for the addition and deletion indexes, respectively. This flag can be used to filter triples early during querying. Since deltas in OSTRICH are aggregated, entries in the versioning metadata are copied for each version where a change is relevant. From the previous example, the entry ${1 : false}$ in the deletion index will also exist for revisions 2 and 3, since the triple is deleted in both $u_{0, 2}$ and $u_{0, 3}$ . This can create redundancies, especially in long delta chains.

Figure 2.

OSTRICH delta chain storage overview.

We notice that deleted triples are associated to an additional vector that stores the triple’s relative position in the delta for every possible triple pattern order. This allows OSTRICH to optimize for offset queries, and enables fast computation of deletion counts for any triple pattern and version. Since HDT files cannot be edited, the delta chain also has its own writable dictionary for the RDF terms that were added after the first snapshot. More details about OSTRICH’s storage system can be found in the original paper (Taelman et al., 2019).

Based on our running example from Figure 1, we illustrate the contents of the additions and deletions stores in Figure 4.

Figure 3.

Contents of the additions and deletions store contents based on the running example from Figure 1. Column $+$ and $-$ , respectively, represent the keys of the additions and deletions store. The remaining columns represent the values, that is, a mapping from version (V) to the local change flag (L). For the deletions store, these values also include the relative positions for all essential triple patterns.

Figure 4.

Delta chain architectures: (a) single delta chain, (b) single aggregated delta chain, and (c) multiple aggregated delta chains.

OSTRICH supports VM, DM, and V queries on single triple patterns natively. Aggregated changesets have been shown to speed up VM and DM queries significantly w.r.t. a standard CB approach. As shown by Pelgrin et al. (2021) and Taelman et al. (2019), OSTRICH is the only available solution that can handle histories for large RDF graphs, such as DBpedia. That said, scalability still remains a challenge for OSTRICH because aggregated changesets grow monotonically. This leads to prohibitive ingestion times for large histories (Pelgrin et al., 2021; Taelman et al., 2022)—even when the original changesets are small. In this article, we build upon OSTRICH and propose a solution to this problem.

4. Storing Archives With Multiple Delta Chains

4.1. Multiple Delta Chains

As discussed in Section 3.3, ingesting new revisions as aggregate changesets can quickly become prohibitive for long revision histories when the RDF archive is stored in a single delta chain (see Figure 2(b)). In such cases, we propose the creation of a fresh snapshot that becomes the new reference for subsequent deltas. Those new deltas will be smaller and thus easier to build and maintain. They will also constitute a new delta chain as depicted in Figure 2(c).

While creating a fresh snapshot with a new delta chain should presumably reduce ingestion time for subsequent revisions, its impact on query efficiency remains unclear. For instance, V queries will have to be evaluated on multiple delta chains, becoming more challenging to answer. In contrast, VM queries defined on revisions already materialized as snapshots should be executed much faster. Storage size and DM response time may be highly dependent on the actual evolution of the data. If a new version includes many deletions, fresh snapshots may be smaller than aggregated deltas. We highlight that in our proposed architecture, revisions stored as snapshots also exist as aggregated deltas w.r.t. the previous snapshot—as shown for revision 2 in Figure 2(c). Such a design decision allows us to speed up DM queries as explained later.

It follows from the previous discussion that introducing multiple snapshots and delta chains raises a natural question: When is the right moment to create a snapshot? We elaborate on this question from the perspective of storage, ingestion time, and query efficiency next. We then explain how to query archives in a multi-snapshot setting in Section 5.

4.2. Strategies for Snapshot Creation

A key aspect of our proposed design is to determine the right moment to place a snapshot, as this decision is subject to a trade-off among ingestion speed, storage size, and query performance. We formalize this decision via an abstract snapshot oracle $f : A \times U \to {0, 1}$ that, given an archive $A \in A$ with $k$ revisions and a changeset $u_{k - 1, k} \in U$ , decides whether revision $k$ should (1) or should not (0) be materialized as a snapshot—otherwise the revision is stored as an aggregated delta. The oracle can rely on the properties of the archive and the input changeset to make a decision. In the following, we describe some natural alternatives for our snapshot oracle $f$ and illustrate them with a running example (Table 2) based on the example archive from Figure 1. All strategies start with a snapshot at revision 0. Note that we do not provide an exhaustive list of all possible strategies that one could implement.

Table 2.
Creation of Snapshots According to the Different Strategies on the Toy RDF Archive From Figure 1.

Version ( $k$ ) 0 1 2 3 4

snapshot $s$ 0 0 0 0 3

$| u_{s, k}^{+} |$ 3 1 2 3 0

$| u_{s, k}^{-} |$ 0 1 1 1 2

$| G_{s} \cup G_{k} |$ – 4 4 5 5

$δ_{s, i}$ – 0.5 0.75 2.05 0.40

$\sum_{i = s + 1}^{k} δ_{s, i}$ – 0.5 1.25 3.30 0.40

$t_{k}$ – 1.00 1.50 2.25 3.38

Baseline S $Δ$ $Δ$ $Δ$ $Δ$

Periodic ( $d = 2$ ) S $Δ$ S $Δ$ S

Change ratio ( $γ = 2.0$ ) S $Δ$ $Δ$ S $Δ$

Time ( $θ = 3.0$ ) S $Δ$ $Δ$ $Δ$ S

Here $| u_{0, 0}^{+} | = 3 = | G_{0} |$ is the size of the initial snapshot, whereas $u_{s, k}$ denote the sizes of the aggregated deltas and $s$ is the revision number of the latest snapshot according to the change ratio strategy. Except for the ingestion time $t_{k}$ , all the other values are computed from the example graph. An S denotes a snapshot, whereas a $Δ$ denotes an aggregated changeset.

Version ( $k$ )	0	1	2	3	4
snapshot $s$	0	0	0	0	3
$\| u_{s, k}^{+} \|$	3	1	2	3	0
$\| u_{s, k}^{-} \|$	0	1	1	1	2
$\| G_{s} \cup G_{k} \|$	–	4	4	5	5
$δ_{s, i}$	–	0.5	0.75	2.05	0.40
$\sum_{i = s + 1}^{k} δ_{s, i}$	–	0.5	1.25	3.30	0.40
$t_{k}$	–	1.00	1.50	2.25	3.38
Baseline	S	$Δ$	$Δ$	$Δ$	$Δ$
Periodic ( $d = 2$ )	S	$Δ$	S	$Δ$	S
Change ratio ( $γ = 2.0$ )	S	$Δ$	$Δ$	S	$Δ$
Time ( $θ = 3.0$ )	S	$Δ$	$Δ$	$Δ$	S

Baseline. The baseline oracle never creates snapshots, except for the very first revision, that is, $f (A, u) \equiv (A = \emptyset)$ . This is akin to OSTRICH’s snapshot policy (Taelman et al., 2019).

Periodic. A new snapshot is created when a fixed number $d$ of versions has been ingested as aggregated deltas, that is, $f (A, u) \equiv (| A | mod (d + 1) = 0)$ . We call $d$ the period.

Change-ratio. Long delta chains not only incur longer ingestion times, but also higher disk consumption due to redundancy in the aggregated changesets. When low disk usage is desired, the snapshot strategy may take into account the editing dynamics of the RDF graph. This notion has been quantified in the literature via the change ratio score (Fernández et al., 2019):

δ_{i, j} (A) = \frac{| u_{i, j}^{+} | + | u_{i, j}^{-} |}{| G_{i} \cup G_{j} |} .

(1)Given two revisions

i

and

j

, the change ratio normalizes the number of changes (additions and deletions) between the revisions by the joint size of the revisions. If we aggregate the change ratios of all the revisions coming after a snapshot revision

s

w.r.t. this snapshot, we get a estimate of the normalized amount of additional data stored in the delta chain, which is correlated with the delta chain’s storage footprint. Since our snapshots are aggregated, that is, they keep increasing in size, a reasonable snapshot strategy would therefore bound the aggregated change ratios

\sum_{i = s + 1}^{k} δ_{s, i}

, put differently:

f (A, u) \equiv (\sum_{i = s + 1}^{k} δ_{s, i}) \geq γ

for some user-defined budget threshold

γ \in R_{> 0}

. In our example from Table 2, we can see that the snapshot strategy with

γ = 2.0

materializes a snapshot at revision

k = 3

because the aggregated changesets from previous revisions induce normalized space overheads (w.r.t.

G_{s} \cup G_{k}

) of 0.5, 0.75, and 2.05, which when added up surpass the overhead threshold of 2.0.

Time. If we denote by $t_{k}$ the time required to ingest revision $k$ as an aggregated changeset in an archive $A$ , this oracle is implemented as $f (A, u) \equiv (\frac{t_{k}}{t_{s + 1}} > θ)$ , where $s + 1$ is the first revision stored as an aggregated changeset in the current delta chain. This strategy therefore creates a new snapshot as soon as ingestion time exceeds $θ$ times the ingestion time of version $s + 1$ . In our example from Table 2 with $θ = 3$ , we can see that as soon the ingestion time $t_{k}$ is more than 3 times the ingestion time of the first delta of the chain (revision $1$ ), the strategy materializes a snapshot.

4.3. Implementation

We implemented the proposed snapshot creation strategies and query algorithms for RDF archives on top of OSTRICH (Taelman et al., 2019). We briefly explain the most important aspects of our implementation.

Storage. In OSTRICH, an RDF archive consists of a snapshot for revision 0 and a single delta chain of aggregated changesets for the upcoming revisions (Figure 2(b)). The snapshot is stored as an HDT (Fernández et al., 2013) file, whereas the delta chain is materialized in two stores: one for additions and one for deletions. Each store consists of three indexes in different triple component orders, namely SPO, OSP, and POS, implemented as B+trees. Keys in those indexes are individual triples linked to version metadata, that is, the revisions where the triple is present and absent. Besides the change stores, there is an index with addition and deletion counts for all possible triple patterns, for example, $⟨$ ?s, ?p, ?o $⟩$ or $⟨$ ?s, cityIn, ?o $⟩$ , which can be used to efficiently compute cardinality estimations—particularly useful for SPARQL engines.

Dictionary. As common in RDF stores (Neumann & Weikum, 2010; Weiss et al., 2008), RDF terms are mapped to an integer space to achieve efficient storage and retrieval. Two disjoint dictionaries are used in each delta chain: the snapshot dictionary (using HDT) and the delta chain dictionary. Hence, our multi-snapshot approach uses $D \times 2$ (potentially non-disjoint) dictionaries, where $D$ is the number of delta chains in the archive.

Ingestion. The ingestion routine depends on whether a revision will be stored as an aggregated delta or as a snapshot. For revision 0, our ingestion routine takes as input a full RDF graph to build the initial snapshot. For subsequent revisions, we take as input a standard changeset $u_{k - 1, k}$ ( $| A | = k$ ), and use OSTRICH to construct an aggregated changeset of the form $u_{s, k}$ , where revision $s = snapshot (k)$ is the latest snapshot in the history. When the snapshot policy decides to materialize a revision $s^{'}$ as a snapshot, we use the aggregated changeset $u_{s, s^{'}}$ to compute the snapshot efficiently as $G_{s^{'}} = (G_{s} ∖ u_{s, s^{'}}^{-}) \cup u_{s, s^{'}}^{+}$ .

Change-ratio estimations. The change-ratio snapshot strategy computes the cumulative change ratio of the current delta chain w.r.t. a reference snapshot $s$ to decide whether to create a new snapshot or not. We therefore store the approximated change ratios $δ_{s, k}$ of each revision in a key-value store. To approximate each $δ_{s, k}$ according to equation (1), we rely on OSTRICH’s count indexes. The terms $| u_{s, k}^{+} |$ and $| u_{s, k}^{-} |$ can be obtained from the count indexes of the fully unbounded triple pattern $⟨$ ?s, ?p, ?o $⟩$ in $O (1)$ time. We estimate $| G_{s} \cup G_{j} |$ as $| G_{s} | + | u_{s, j}^{+} |$ , where $| G_{s} |$ is efficiently provided by HDT.

5. Single Queries on Archives With Multiple Delta Chains

In the following, we detail our algorithms to compute version materialization (VM), delta materialization (DM), and V (version) queries on RDF archives with multiple delta chains. Our algorithms focus on answering single triple patterns queries, since they constitute the building blocks for answering arbitrary SPARQL queries—which we address in Section 7. All the routines described next are defined w.r.t. to an implicit RDF archive $A$ .

5.1. VM Queries

In a single delta chain with aggregated deltas and reference snapshot $s$ , executing a VM query with triple pattern $p$ on a revision $i$ requires us to materialize the target revision as $G_{i} = (G_{s} \cup u_{s, i}^{+}) ∖ u_{s, i}^{-}$ and then execute $p$ on $G_{i}$ . In our baseline OSTRICH, $s = 0$ . In the presence of multiple delta chains, we define $s = snapshot (i)$ as the revision number of $i$ ’s reference snapshot in the archive’s history.

Algorithm 1 provides a high level description of the query algorithm used for version materialization queries (VM). Our baseline, OSTRICH, uses a similar algorithm where ${sid}_{i} = 0$ . The algorithm starts by getting the corresponding snapshot of the target version (line 2), and retrieving the matches of the query triple pattern (line 3) on the snapshot—as a stream of results. If the target version corresponds to the snapshot, the query stops there and we return the results stream (line 5). Otherwise, the algorithm retrieves, from the delta chain indexes, those added and deleted triples of the target version that match the given query pattern (lines 7 and 8)—changes reverted in the delta chain are filtered out. From now on we will assume that the routines to retrieve additions and deletions filter out those local changes—as done by OSTRICH. The deleted triples are then filtered out from the snapshot results (line 9), which are extended with the added triples (line 10). It is important to note that this process is implemented in a streaming way, and is therefore computed lazily, as needed by the query consumer.

5.2. DM Queries

Algorithm 2 describes the procedure $singleDCQueryDM$ that answers DM queries with start and end versions $i, j$ on a single delta chain for triple pattern $p$ . This procedure is crucial for handling DM query algorithms on multiple delta chains. This algorithm consists of two cases: The first case, described on lines 3–6, is met when the start version $i$ corresponds to the snapshot of the delta chain. When this is the case, the execution of the query is trivial as triple pattern $p$ can be directly evaluated on the corresponding aggregated delta $u_{i, j}$ . The second case, starting from line 7, deals with a start version stored as a delta. We get the changes (additions and deletions) for both the start and end versions (lines 8 and 9), and filter the additions and deletions so that the ones from the end version $j$ prevail (lines 10 and 11). The results consist of the combination of the newly computed addition and deletion sets. In practice, this can be efficiently implemented as a sort-merge join operation where triples are emitted only when present for version $j$ , or when their addition flag for $i$ and $j$ is different (in which case the flag for version $j$ is kept).

We now turn our attention to archives with multiple delta chains. The procedure queryDM in Algorithm 3 describes how to answer a DM query on two revisions $i$ and $j$ ( $i < j$ ) with triple pattern $p$ on an RDF archive with multiple delta chains. The algorithm relies on two important sub-routines, which we now explain. The first one, $singleDCQueryDM$ , was already described in Algorithm 2 and executes standard DM queries on single triple patterns over a single delta chain. The second routine, called $snapshotDiff$ , computes the difference between the results of $p$ on two reference snapshots $S_{i}$ and $S_{j}$ . It works by first testing if the delta chains of $S_{i}$ and $S_{j}$ are not consecutive (line 2 in Algorithm 3). If they are not, $snapshotDiff$ implements a set-difference between $p$ ’s results on $S_{i}$ and $S_{j}$ (lines 4 and 5). In case the snapshots define consecutive delta chains, we leverage the fact that $S_{j}$ also exists as an aggregated delta w.r.t. $S_{i}$ (see Section 4.1). We can therefore treat this case efficiently as a standard DM query via $singleDCQueryDM$ (line 7).

We now have the elements to explain the main DM query procedure (queryDM). First, the procedure checks whether both revisions are in the same delta chain, that is, if they have the same reference snapshot (line 14). If so, the problem boils down to a single delta chain DM query that can be answered with $singleDCQueryDM$ (line 15). Otherwise, we invoke the routine $snapshotDiff$ on the reference snapshots (line 17) to compute the results’ difference between the delta chains. This is denoted by $u_{s i, s j}$ .

If revisions $i$ and $j$ are not snapshots themselves, lines 20 and 23 compute the changes between the target versions and their corresponding reference snapshots—denoted by $u_{s i, i}$ and $u_{s j, j}$ . The last steps, that is, lines 25 and 26, merge the intermediate results to produce the final output. First, the routine mergeBackwards merges $u_{s i, s j}$ , that is, the changes between the two delta chains, with $u_{s i, i}$ , that is, the changes within the first delta chain. This routine is designed as a regular sorted merge because triples are already sorted in the OSTRICH indexes. Unlike a classical merge routine, mergeBackwards inverts the flags of the changes present in $u_{s i, i}$ but not in $u_{s i, s j}$ . Indeed, if a change in $u_{s i, i}$ did not survive to the next delta chain, it means it was later reverted in revision ${sid}_{j}$ . The result of this operation are, therefore, the changes between revisions $i$ and ${sid}_{j}$ , which we denote by $u_{i, s j}$ . The final merge step, mergeForward, combines $u_{i, s j}$ with the changes in the second delta chain, that is, $u_{s j, j}$ . The routine mergeForward runs also a sorted merge, but now triples with opposite change flag present in both changesets are filtered from the final output as they indicate reversion operations.

5.3. V Queries

Algorithm 4 describes how V queries are executed in a single delta chain setup. This is akin to how our baseline, OSTRICH, processes queries, and is used by our multiple snapshot query algorithm. We assume that each triple is annotated with its version validity, that is, a vector of versions in which the triple exists. In OSTRICH, this is stored directly in the delta chain as versioning metadata, and, therefore, does not need additional computation. For the deletion triples, this metadata contains the list of versions where the triple is absent instead. The core of the algorithm iterates over the triples (line 6) that match triple pattern $p$ in the snapshot (line 4) and in the delta chain (line 5). These operations rely on the capabilities of OSTRICH. Each triple is queried for its existence in the deletion delta chain (line 7). If the triple exists there, then it has been deleted in a subsequent revision. We remove the versions where the triple is absent from the version validity set of the triple (line 9). Finally, we add the triple to the result set in line 11. Like the other algorithms, this routine can also be implemented in a streaming way, where each loop iteration is triggered on demand.

Algorithm 5 describes the process of executing a V query $p$ over multiple delta chains. This relies on the capability to execute V queries on individual delta chains via the function singleQueryV described above. The routine iterates over the list of delta chains (line 3), and runs singleQueryV on each delta chain (line 4). This gives us triples annotated with lists of versions within the range of the delta chain. At each iteration we carry out a merge step (line 5) that consists of a set union of the triples from the current delta chain and the results seen so far. When a triple is present in both sets, we merge their lists of versions.

6. Optimization of Versioning Metadata Serialization

The versioning metadata stored in the delta chain indexes is paramount to the functioning of our solution and influences multiple aspects of the system’s performance. One of the current limitations of our architecture based on aggregated deltas is that it does not scale well in terms of ingestion speed and disk usage, when the number of versions grows. As we show in this section, this happens because the delta chain indexes still contain a lot of redundancy that could be removed with a proper compression scheme. In this section, we therefore discuss the limitations of the current serialization of the versioning metadata, and propose an alternative serialization scheme that brings significant improvements in terms of ingestion speed and disk usage.

6.1. Versioning Metadata Encoding

As discussed in Section 3.3, OSTRICH indexes additions and deletions in separate triple stores for each delta chain. Because aggregated deltas introduce redundancy, OSTRICH annotates triples with additional version metadata that prevents the system from storing the same triple multiple times. This versioning metadata is, in turn, leveraged during querying to filter triples based on their version validity.

In Table 3, we show side by side the versioning metadata of an arbitrary triple as stored by OSTRICH (see Section 3.3), and as stored using our proposed representation. Although triples are stored only once, we observe that the index entries still store a lot of repeated information. Consider the example in Table 3(a) where we can see that the local change flag is always set to true for each version. Our proposed representation compresses this information by storing intervals of versions where the value does not change. In our example, in Table 3(b), the local change flag is stored as an interval $[2, \infty)$ , meaning that the flag is true starting from revision 2. We highlight that the version numbers are also stored as intervals. Similarly to the uncompressed metadata, the logical model is one of a mapping from version to a local change flag. In practice, this means that if a version is not present in any of the intervals, then there is no corresponding valid local change flag, regardless of the content of the local change intervals.

Table 3.
Representation of the Versioning Metadata in the Indexes for Arbitrary Example Triples in OSTRICH and Compressed in Our New Implementation.

(a) Original Addition Metadata in OSTRICH

Version 2 3 4 6

LC T T T T

(b) Compressed Addition Metadata

Version [2, 4) – – [5, $\infty$ )

LC [2, $\infty$ ) – – –

(c) Original Deletion Metadata in OSTRICH

Version 2 3 4 6

LC F F F T

SP? 0 0 0 0

S?O 0 0 0 0

S?? 4 6 6 0

?PO 0 0 0 1

?P? 6 8 8 0

??O 0 0 0 0

??? 8 8 8 0

(d) Compressed Deletion Metadata

Version [2, 5) – – [6, $\infty$ )

LC – – – [6, $\infty$ )

SP? 0 – – –

S?O 0 – – –

S?? 4 $+$ 2 — $-$ 6

?PO 0 – – $+$ 1

?P? 6 $+$ 2 – $-$ 8

??O 0 – – –

??? 8 – – $-$ 8

LC denotes the local change flag.

(a) Original Addition Metadata in OSTRICH
LC	T	T	T	T
(b) Compressed Addition Metadata
Version	[2, 4)	–	–	[5, $\infty$ )
LC	[2, $\infty$ )	–	–	–
(c) Original Deletion Metadata in OSTRICH
Version	2	3	4	6
LC	F	F	F	T
SP?	0	0	0	0
S?O	0	0	0	0
S??	4	6	6	0
?PO	0	0	0	1
?P?	6	8	8	0
??O	0	0	0	0
???	8	8	8	0
(d) Compressed Deletion Metadata
Version	[2, 5)	–	–	[6, $\infty$ )
LC	–	–	–	[6, $\infty$ )
SP?	0	–	–	–
S?O	0	–	–	–
S??	4	$+$ 2	—	$-$ 6
?PO	0	–	–	$+$ 1
?P?	6	$+$ 2	–	$-$ 8
??O	0	–	–	–
???	8	–	–	$-$ 8

Deletion indexes contain more metadata than the addition indexes. Indeed, they also store the relative position of the triple within its respective delta for all triple pattern combinations, as illustrated in Table 3(c). This data can be large, especially for long delta chains, and can be both costly to create during ingestion and to deserialize during querying. OSTRICH alleviates these issues by restricting this metadata to the SPO index. We propose to replace this representation with a delta-compressed vector list, as illustrated in Table 3(d). This first position vector in the list is stored plainly, as before, but we only store deltas for subsequent changes. In case where no changes occur in a given revision, like in version 4 of our example, the vector is empty. In the next section, we elaborate on the implementation details of this serialization scheme.

6.2. Implementation Considerations

As depicted in Figure 4(d), some of the entries in the position vectors of the deleted triples can be empty. We handle those empty entries by means of a 8-bit header mask that precedes the position vector. Consider the second column vector (version 3) in our example Table 3(d). This vector contains two values: $+$ 2 for the triple pattern S?? in the third position and $+$ 2 for the triple pattern ?P? in the fifth position. As such, the 8-bit header would be the string “0010100” (or 14 in hexadecimal), with 1s indicating the positions where valid values exist. Notice that this header mask is preceded by the version identifier, since this number cannot be easily inferred from the intervals. Figure 5 offers a visual representation of the serialization of the position vectors for our example 5. This compressed representation uses only 25 bytes, versus the 56 bytes required by the original serialization.

Figure 5.

Representation of the positions vectors for version 3 of our example, without and with compression.

Furthermore, we highlight a key advantage of delta encoding: since most vector entries consist of small values, our representation can benefit from further compression via variable size integer encoding—which we also evaluate in our experimental section. The compression ultimately depends on how often the value of the positions changes between versions. Our experiments (Section 8.4) demonstrate the efficacy of our approach in practice.

7. SPARQL 1.1 Support for RDF Archives

Section 5 describes the algorithms to answer versioned queries on single triple patterns on top of our multi-snapshot storage engine. In this section, we describe our solution to support SPARQL queries over RDF archives. We will first discuss how to formulate versioned queries using SPARQL. We then provide details of the proposed architecture and query engine.

7.1. SPARQL Versioned Queries

As discussed in Section 2, there have been a few efforts to write versioned queries comprising multiple triple patterns. All those endeavors rely on ad-hoc extensions to the SPARQL language. For this reason, none of these extensions has reached broad community acceptance. As consequence, we have opted for a query middleware based on native SPARQL that models revisions as named graphs (Graube et al., 2014). Versioning requirements are, therefore, expressed using the SPARQL GRAPH keyword on named graphs with URIs of the form $< version:i >$ , for example, $< version:0 >$ denotes the initial revision. Our SPARQL engine interprets the provided graph URI and translates it into a proper retrieval operation within the physical data model described earlier in this article. Our approach supports the base versioned query types, namely version materialization (VM), delta materialization (DM), and version (V) queries.

We illustrate the different queries with an example RDF Archive $A$ describing information about countries. For the sake of simplicity, we assume that each version of the graph $G_{i}$ represents a specific year, for example, $G_{2003}$ contains information about countries in the year 2003. Table 4 illustrates different versioned queries asking for country membership in the European Union (EU). First, VM queries are similar to standard SPARQL queries where the GRAPH clause is used to limit query evaluation to the target version. DM queries require the use of FILTER sub-queries to select the changes between versions, as exemplified in Table 4. The example DM query retrieves the countries that joined in revision 2004, that is, EU members in $u_{2003, 2004}^{+}$ . In our design, a query on $u_{2003, 2004}^{-}$ (deletions) has the same form, but the version numbers are swapped. Finally, V queries can be expressed with the GRAPH keyword followed by a variable.

Table 4.
Example of SPARQL Representation and Results for VM, DM, and V Queries Using the GRAPH Keyword.

7.2. Architecture and Implementation

In order to support full SPARQL query processing over RDF archives, we make use of the Comunica (Taelman et al.) query engine deployed on top of our multi-snapshot storage layer and our algorithms for processing single triple patterns (described in Section 5). Comunica is a scalable and extensible query engine written in TypeScript with full support for the SPARQL 1.1 language. Due to its modularity, it is a natural choice for extending our system, and initial work was already done by Taelman et al. (2018) to support archives queries (although without V queries support). We have adapted this implementation to our multi-snapshot storage architecture and extended it to support V queries. We have also implemented several optimizations, notably in regard to the communication between the query engine and the storage layer. Previously, results from a triple pattern query would be buffered in OSTRICH until all results have been gathered. Because Comunica is designed to work with streams of triples, this incurs locking in the query processing while waiting for the availability of the triples. Instead, we now buffer triples into smaller buffers of configurable size. When a buffer is filled, the triples it contains can be sent to Comunica without waiting for the remaining triples. This allows for shorter locking time in Comunica and allows us to take better advantage of its asynchronous query processing capabilities.

Figure 6 illustrates the query processing pipeline of our solution for SPARQL queries on RDF archives. There are two main software components interacting with each other: the first is the storage layer consisting of our multi-snapshot version of OSTRICH (see Sections 4 and 5), and the second is the Comunica (Taelman et al.) query engine, which includes several modules.

Figure 6.

SPARQL query processing pipeline.

A versioned SPARQL query like the ones in Table 4 (Section 7.1), is first transformed back to a graph- and filter-free SPARQL query, that is, without the special GRAPH URIs and/or FILTER clauses, and annotated with a versioning context. This versioning context depends on the query type (e.g., revisions for VM queries and changesets for DM queries) and the target version(s) when relevant, and is used to select the type of triple pattern queries to send for execution by OSTRICH. The communication between Comunica and OSTRICH is done through a NodeJS native addon. Our implementation is open source^2,3 and a demonstration system is available at Pelgrin et al. (2023).

8. Experiments

To determine the effectiveness of our multi-snapshot approach for RDF archiving, we evaluate the four proposed snapshot creation strategies described in Section 4 along three dimensions: ingestion time (Section 8.2.1), disk usage (Section 8.2.2), and query runtime for VM, DM, and V queries (Section 8.3). Thereafter, in Section 8.4, we delve into the effectiveness of our versioning metadata representation described in Section 6. This is done by comparing its performance against the original representation—across the three aforementioned evaluation dimensions. Section 8.5 concludes our experiments with an evaluation of our full SPARQL query capabilities. The source code of our implementation as well as the experimental scripts to reproduce our results are available in a Zenodo archive.⁴

8.1. Experimental Setup

We resort to the BEAR benchmark for RDF archives (Fernández et al., 2019) for our evaluation. BEAR comes in three flavors: BEAR-A, BEAR-B, and BEAR-C, which comprise a representative selection of different RDF graphs and query loads. Table 5 summarizes the characteristics of the experimental datasets and query loads. We use OSTRICH (Taelman et al., 2019) as baseline solution and excluded approaches such as QuitStore (Arndt et al., 2019) or RDF43ples (Graube et al., 2014), as Pelgrin et al. (2021) showed that OSTRICH is the only system that can handle archives and changesets with millions of triples. We also excluded the BEAR benchmark systems as they have been outperformed by OSTRICH (Taelman et al., 2019). That said, OSTRICH could only ingest one-third of BEAR-B’s long history (7,063 out of 21,046 revisions) after one month of execution—before crashing. In a similar vibe, OSTRICH took one month to ingest the first 18 revisions (out of 58) of BEAR-A. Despite the dataset’s short history, changesets in BEAR-A are in the order of millions of changes, which also makes ingestion intractable in practice. On these grounds, the original OSTRICH paper (Taelman et al., 2019) excluded BEAR-B instant from its evaluation, and considered only the first 10 versions of BEAR-A. Multi-snapshot solutions, on the other hand, allow us to manage these datasets. We provide, nevertheless, the results of the baseline strategy (OSTRICH) for entire history of BEAR-A. We emphasize, however, that ingesting this archive was beyond what would be considered reasonable for any use case: it took more than five months of execution. We provide those results as a baseline (although an easy one) to highlight the challenge of scaling to large datasets. All our experiments were run on a Linux server with a 16-core CPU (AMD EPYC 7281), 256 GB of RAM, and 8 TB hard disk drive.

Table 5.
Dataset Characteristics.

BEAR-B

BEAR-A Daily Hourly Instant BEAR-C

# versions 58 89 1,299 21,046 32

$| G_{i} |$ ’s range 30M–66M 33K–44K 33K–44K 33K–44K 485K–563K

$\bar{| Δ |}$ 22M 942 198 23 568K

# queries 368 62 (49 ?P? and 13 ?PO) 11 (SPARQL)

$| G_{i} |$ is the size of the individual revisions, and $\bar{| Δ |}$ denotes the average size of the individual changesets $u_{k - 1, k}$ .

		BEAR-B
# versions	58	89	1,299	21,046	32
$\| G_{i} \|$ ’s range	30M–66M	33K–44K	33K–44K	33K–44K	485K–563K
$\bar{\| Δ \|}$	22M	942	198	23	568K
# queries	368	62 (49 ?P? and 13 ?PO)	11 (SPARQL)

We evaluate the different strategies for snapshot creation detailed in Section 4.2 along ingestion speed, storage size, and query runtime. Except for our baseline (OSTRICH), all our strategies are defined by parameters that we adjust according to the dataset:

Periodic. This strategy is defined by the period $d$ . We set $d \in {2, 5}$ for BEAR-A and BEAR-C, $d \in {5, 10}$ for BEAR-B daily, $d \in {50, 100}$ for BEAR-B hourly, and $d \in {100, 500}$ for BEAR-B instant. Values of $d$ were adjusted per dataset experimentally w.r.t. the length of the revision history and the baseline ingestion time. High periodicity, that is, smaller values for $d$ , lead to more and shorter delta chains.

Change-ratio (CR). This strategy depends on a cumulative change-ratio budget threshold $γ$ . We set $γ \in {2.0, 4.0}$ for all the tested datasets. $γ = 2.0$ yields 10 delta chains for BEAR-A, nine for BEAR-C, as well as five, 23, and 151 delta chains for BEAR-B daily, hourly, and instant, respectively. For $γ = 4.0$ , we obtain instead six delta chains for BEAR-A, six for BEAR-C, and three, 16, and 98 for the BEAR-B datasets.

Time. This strategy depends on the ratio $θ$ between the ingestion time of the new revision and the ingestion time of the first delta in the current delta chain. We set $θ = 20$ for all datasets. This produces three, 26, and 293 delta chains for the daily, hourly, and instant variants of BEAR-B, respectively, and two delta chains for BEAR-A. As for BEAR-C, no new delta chains are created with $θ = 20$ , and so it is equivalent to the baseline.

Table 6 summarizes the important information about evaluated methods and systems. We omit the reference systems included with the BEAR benchmark since they are outperformed by OSTRICH (Taelman et al., 2019).

Table 6.

List of the Evaluated Storage Strategies and the Resulting Number of Snapshots per BEAR Dataset.

	Number of Snapshots/Delta Chains
		BEAR B
Method	BEAR-A	Daily	Hourly	Instant	BEAR-C
Baseline (OSTRICH)	1	1	1	1	1
Periodic ( $d = 2$ )	29	–	–	–	16
Periodic ( $d = 5$ )	11	17	–	–	6
Periodic ( $d = 10$ )	–	8	–	–	–
Periodic ( $d = 50$ )	–	–	25	–	–
Periodic ( $d = 100$ )	–	–	12	2104	–
Periodic ( $d = 500$ )	–	–	–	42	–
CR ( $γ = 2.0$ )	10	5	23	151	9
CR ( $γ = 4.0$ )	6	3	16	98	6
CR ( $γ = 6.0$ )	–	–	–	–	3
Time ( $θ = 2.0$ )	2	3	26	293	1

A dash ‘–’ means that the architecture was not evaluated on the given dataset.

8.2. Results on Resource Consumption

8.2.1. Ingestion Time

Table 7(a) depicts the total time to ingest the experimental datasets. Since we always test two different values of $d$ for the periodic strategy on each dataset, in both Table 7(a) and (b), we refer to them as “high” and “low” periodicities. This is meant to abstract away the exact parameters which vary for each dataset, so that we can focus instead on the effects of higher/lower periodicity. We remind the reader that the baseline (OSTRICH) cannot ingest BEAR-B instant, which explains its absence in Table 7(a). But even when OSTRICH can ingest the entire history of BEAR-B (in around 26 h), a multi-snapshot strategy still incurs a significant speed-up. This becomes more significant for long histories as observed for BEAR-B hourly, where the speed-up can reach two orders of magnitude. The good performance of the high periodicity strategy and change-ratio with the smaller budget threshold $γ = 2.0$ suggests that shorter delta chains are beneficial for ingestion time. This is confirmed by Figure 7, where we also notice that ingestion time reaches a minimum for the revisions following a snapshot.

Figure 7.

Detailed ingestion times (log scale) per revision. We include the first 1,500 revisions for BEAR-B instant since the runtime pattern is recurrent along the entire history: (a) BEAR-B daily, (b) BEAR-B hourly, and (c) BEAR-B instant.

Table 7.

Time and Disk Usage Used by Our Different Strategies to Ingest the Data of the BEAR Benchmark Datasets.

		BEAR-B
	BEAR-A	Daily	Hourly	Instant	BEAR-C
(a) Ingestion times in minutes
High periodicity	13472.16	0.67	12.95	57.89	43.97
Low periodicity	14499.45	0.98	23.05	298.36	96.38
CR $γ = 2.0$	20505.93	1.88	13.79	77.01	75.61
CR $γ = 4.0$	21588.25	2.34	19.47	114.83	111.78
Time $θ = 20$	49506.15	2.64	15.83	43.53	543.82
Baseline	253676.98	6.89	1514.85	–	550.90
(b) Disk usage in MB
High periodicity	72417.47	199.17	322.34	2283.43	1149.63
Low periodicity	49995.00	102.96	185.33	787.75	890.44
CR $γ = 2.0$	47335.74	51.49	284.47	1690.38	920.46
CR $γ = 4.0$	42203.04	37.91	211.71	1175.15	939.30
Time $θ = 20$	46614.98	38.33	325.13	3972.32	1365.10
Baseline	45965.40	19.82	644.50	–	1365.10

BEAR = BEnchmark of RDF ARchives; CR = change-ratio.

8.2.2. Disk Usage

Unlike ingestion time, where shorter delta changes are clearly beneficial, the gains in terms of disk usage depend on the dataset as shown in Table 7(b). Overall, more delta chains tend to increase disk usage because snapshots can be large. For BEAR-B daily, the most space efficient multi-snapshot strategy uses twice as much space as the baseline where frequent snapshots (high periodicity $d = 5$ leading to 17 snapshots) incur the largest overhead w.r.t. the baseline. This happens because the changesets are small and the revision history of BEAR-B Daily is short—it only contains 89 revisions (see Figure 8(b))—so snapshots induce a lot of redundancy. That is why the more conservative CR strategies yield the slowest increase in disk consumption as more revisions are created. Similar results are observed for BEAR-A and BEAR-B instant, even though multi-snapshot strategies use less space than the baseline. As Figure 8(a) shows, frequent regular snapshots are not suitable for BEAR-A even if the changesets are in the order of millions of triples (Fernández et al., 2019). For BEAR-B instant, a new snapshot every 500 revisions works best since the changesets are small (see Figure 8(d)). BEAR-B hourly is interesting because it shows that for long histories, a single delta chain can be inefficient in terms of disk usage.

Figure 8.

Disk usage evolution per revision for the BEAR-A and BEAR-B datasets: (a) BEAR-A, (b) BEAR-B daily, (c) BEAR-B hourly, and (d) BEAR-B instant.

Interestingly, for BEAR-A, the change-ratio $γ = 4.0$ uses less storage than both the time strategy with $θ = 20$ and the baseline, despite using more delta chains. This hints that very large aggregated deltas can be less efficient than multiple delta chains with smaller aggregated deltas, suggesting the existence of a sweet spot for snapshot frequency. For BEAR-B instant, the good performance of the change-ratio strategies and the low periodicity strategy ( $d = 500$ ) suggests that a few delta chains can provide significant space savings. On the other hand, the time strategy with $θ = 20$ performs slightly worse because it creates too many delta chains. The bottom line is that redundancies in the delta chains (a) explain the storage overhead in archives, and (b) can be caused either by very long delta chains (BEAR-B hourly and instant), or by large delta chains (BEAR-A), that is, delta chains with voluminous changesets. Multiple snapshots tackle the redundancy of long delta chains naturally, but can sometimes be beneficial for bulky delta chains, as demonstrated by the BEAR-A results with change-ratio $γ = 4.0$ .

8.3. Query Runtime Evaluation

In this section, we evaluate the impact of our snapshot creation strategies on query runtime. We use the queries provided with the BEAR benchmark for BEAR-A and BEAR-B. These are DM, VM, and V queries on single triple patterns. Each individual query was executed 5 times and the runtimes averaged. All the query results are depicted in Figure 9.

Figure 9.

Query results for the BEAR benchmark: (a) BEAR-A VM, (b) BEAR-A DM, (c) BEAR-A V, (d) BEAR-B daily VM, (e) BEAR-B daily DM, (f) BEAR-B daily V, (g) BEAR-B hourly VM, (h) BEAR-B hourly DM, (i) BEAR-B hourly V, (j) BEAR-B instant VM, (k) BEAR-B instant DM, and (l) BEAR-B instant V. VM = version materialization; DM = delta materialization; V = version query.

8.3.1. VM Queries

We report the average runtime of the benchmark VM queries for each version $i$ in the archive. The results are depicted in Figure 9(a), (d), (g), and (j). We report runtimes in micro-seconds for all strategies.

Using multiple delta chains is consistently beneficial for VM query runtime, which is best when the target revision is materialized as a snapshot. When it is not the case, runtime is proportional to the size of the delta chain, which depends on its length and the volume of changes that must be applied to the snapshot before running the query. This is obvious for BEAR-A with the baseline strategy or with the time $θ = 20$ strategy. The latter strategy splits the history into two imbalanced delta-chains, where one of them contains the first 53 revisions (out of 58). Both strategies are significantly outperformed by the other multi-snapshot strategies. Similar results can be observed on the BEAR-B variants, particularly for BEAR-B Hourly, where the baseline strategy is outperformed by all the strategies with more than one snapshot.

8.3.2. DM Queries

We report for each revision $i$ in the archive the average runtime of the benchmark DM queries on changesets $u_{0, i}$ and $u_{1, i}$ . As described in Section 5.2, DM queries are executed on both the additions and deletions indexes to retrieve the full set of results for the given query pattern. Such a setup tests the query routine in all possible scenarios: between two snapshots, between a snapshot and a delta (and vice versa), and between two deltas. The results are depicted in Figure 9(b), (e), (h), and (k). The results show a rather mixed benefit of multiple delta chains in query runtime: highly positive for the long history of BEAR-B hourly and modest for BEAR-B daily. Overall, DM queries benefit from short delta chains as illustrated in Figure 9(b) and to a lesser extent by the periodic strategy with $d = 5$ illustrated in Figure 9(e). All our strategies beat the baseline by a large margin on BEAR-B hourly because delta operations become very expensive as the single delta chain grows. That said, the baseline runtime tends to decrease slightly with $i$ because the data from two distant versions tend to diverge more, which requires the engine to filter fewer results from the aggregated deltas. For BEAR-B daily, multiple delta chains may perform comparably or slightly worse—by no more than 20%—than the baseline. This happens because BEAR daily’s history is short, and hence efficiently manageable with a single delta chain. In this case, the overhead of multiple snapshots and delta chains does not bring any advantage for DM queries.

8.3.3. V Queries

Figure 9(c), (f), (i), and (l) shows the total runtime of the benchmark V queries on the different datasets. V queries are the most challenging queries for the multi-snapshot archiving strategies as suggested in Figure 9(f) and (i). As described in Algorithm 5, answering V queries requires us to query each delta chain individually, buffer the intermediate results, and then merge them. It follows that runtime scales proportionally to the number of delta chains, which means that, contrary to DM and VM queries, many short delta chains are detrimental to V query performance. The only exception is BEAR-A, where the change-ratio strategies can outperform the baseline strategy. This indicates that delta chains with very large aggregated deltas can also be detrimental to V query performance. However, BEAR-A is the only dataset showcasing such a behavior in our experiments. Nonetheless, due to their prohibitive ingestion cost, querying datasets such as BEAR-A and BEAR-B instant is only possible with a multi-snapshot solution.

8.4. Experiments on the Metadata Representation

We now evaluate our proposed encoding for versioning metadata, described in Section 6. We conduct the evaluation across the dimensions of ingestion time, disk usage, and query performance on the BEAR-B variants of the BEAR benchmark. For the sake of legibility, we apply our new encoding on archives stored using a single-snapshot strategy, that is, the baseline OSTRICH, and one multi-snapshot strategy. We chose the change-ratio (CR) strategy with $γ = 4.0$ in this case since it exhibited overall good performance across the different evaluation criteria in Section 8.2. We omit the baseline strategy for BEAR-B Instant since we could not ingest it using a single snapshot.

In a first stage we study the performance of using variable-size (also variable-length) integer encoding (VSI)—a common and simple compression scheme—in the delta chains. We first evaluate it without our novel versioning metadata encoding on both the baseline OSTRICH and our multi-snapshot strategy. Figure 10 shows the results for VSI alone according to our different evaluation criteria on the BEAR-B daily datasets—the results from the other datasets lead to the same conclusions. We first highlight that VSI has a negligible impact on ingestion time for both the baseline and our method (Figure 10(a)). In terms of disk usage, VSI alone reduces disk consumption by 25% in storage consumption for the baseline OSTRICH, however its benefits are less evident in a multi-snapshot setting (Figure 10(b)). This happens because HDT snapshots account for most of the disk usage in the archive and they do not use VSI. Like any other compression scheme, VSI increases query performance, however its impact is minimal on V queries (see Figure 10(c)) and still reasonable for V and VM queries (Figure 10(d) and (e)). Based on this analysis, we decided to adopt VSI by default, that is, all subsequent experiments have enabled this compression scheme.

Figure 10.

BEAR-B daily performance impact on ingestion time, storage, and VM, DM, and V query runtime for the compression using variable size integer (VSI) on both OSTRICH and our multi-snapshot strategy with change ratio $γ = 4.0$ : (a) BEAR-B daily ingestion times, (b) BEAR-B daily disk usage, (c) BEAR-B daily V queries, (d) BEAR-B daily VM queries, (e) BEAR-B daily DM queries, and (f) BEAR-B daily storage per revision. VM = version materialization; DM = delta materialization; V = version query.

Figure 11 shows the ingestion time and disk usage of the BEAR-B variants with the original versioning metadata encoding as used in OSTRICH and our proposed compressed encoding—denoted with the “Comp.” prefix in the graphs. The new encoding incurs a drastic decrease in ingestion time. This is particularly notable for the baseline strategy, where ingestion times are reduced by as much as a factor of 40 on the BEAR-B hourly dataset, as illustrated in Figure 11(b). Here, the ingestion time drops from 1,473 min to just 36 min. Disk usage is also notably improved with the new encoding. This is particularly obvious for the baseline strategy, because larger delta chains imply more redundancy, which in turn means more room for compression. In the most remarkable case, disk usage is reduced from 615 to just 25 MB for BEAR-B hourly when using the baseline strategy. The improvements in disk usage are more modest on multiple snapshot strategies, as expected from the smaller delta chains. In these cases, the snapshots account for more of the disk usage of the entire archive. For example, on BEAR-B hourly, disk usage is only reduced from 212 to 193 MB. The fact that metadata compression makes the baseline beat the change-ratio strategy on BEAR-B hourly suggest that the optimal parameters of the snapshot strategies may not always be the same for the compressed and uncompressed versions of the delta chains.

Figure 11.

Ingestion times (top row) and disk usage (bottom row) for OSTRICH and a multi-snapshot storage strategy applied on the different BEAR-B flavors with the original version metadata representation and our new compressed representation: (a) BEAR-B daily ingestion times, (b) BEAR-B hourly ingestion times, (c) BEAR-B instant ingestion times, (d) BEAR-B daily disk usage, (e) BEAR-B hourly disk usage, and (f) BEAR-B instant disk usage.

In Figure 12, we show the performance of queries with the two encodings of versioning metadata. Contrary to ingestion times and disk usage, the picture for query runtime is more nuanced. Overall, our new encoding scheme does not provide a clear advantage over the previous encoding in terms of query performance. For BEAR-B daily, as shown in Figure 12(a) to (c), the archives using the new encoding are systematically slower at resolving queries than the archives with the original encoding. As for BEAR-B hourly, Figure 12(d), (e), and (c), we note that queries are faster with the new encoding on the baseline strategy, but slightly slower with the change-ratio strategy. We can explain this by the large amounts of data that must be retrieved from long delta chains. In such cases, the gains obtained by reading less data—thanks to compression—outweight the costs of decompression, which translates into overall faster retrieval times. Finally, for BEAR-B instant, the compressed representation is slower for VM queries, similar for DM queries, and slightly faster for V queries when compared to the original representation. All in all, compressing the versioned metadata reduces the redundancy in the delta chains, which goes in the same direction as using shorter delta chains. This explains why the compressed representation performs worst for querying on multiple delta chains: redundancy has already (or partially) been reduced by the use of multiple snapshots. This diminishes the gains of further compression with our approach, which comes with a performance penalty due to decompression.

Figure 12.

Query results for the BEAR benchmark with the original version metadata encoding and the new compressed encoding: (a) BEAR-B daily VM, (b) BEAR-B daily DM, (c) BEAR-B daily V, (d) BEAR-B hourly VM, (e) BEAR-B hourly DM, (f) BEAR-B hourly V, (g) BEAR-B instant VM, (h) BEAR-B instant DM, and (i) BEAR-B instant V. VM = version materialization; DM = delta materialization; V = version query.

8.5. SPARQL Performance Evaluation on BEAR-C

We evaluate our solution for full SPARQL support on the BEAR-C dataset. BEAR-C is based on 32 weekly snapshots of the European Open Data Portal taken from the Open Data Portal Watch project (Neumaier et al., 2016). Each version contains between 485K and 563K triples, which puts BEAR-C between BEAR-B daily and BEAR-A in terms of size. Table 5 in Section 8.1 summarizes the characteristics of the datasets. BEAR-C’s query workload consists of 11 full SPARQL queries. The queries contain between two and 12 triple patterns and include the operators FILTER, OPTIONAL, UNION, LIMIT, and OFFSET. All queries (except for Q2 which is a simple star query) arrange the triple patterns into BGPs with two or three star patterns connected through an object-subject join. All queries ask for entities of type dc:Dataset (a non-selective constraint), but four of them contain relative selective triple patterns with bounded objects. In accordance with our experimental setup, we run each query 5 times and report the average runtime. Since there are no other publicly available SPARQL-compliant RDF archiving systems (Pelgrin et al., 2021), we compare our change-ratio (CR) multi-snapshot strategy ( $γ = 4.0$ and $γ = 6.0$ ) to the baseline. We chose the CR multi-snapshot strategy due to its good overall performance in our evaluations in Sections 8.2 and 8.3. We include the strategy CR $γ = 6.0$ that generates snapshots less often than CR $γ = 4.0$ (used in the previous experiments). This is due to the smaller size of BEAR-C when compared to more challenging datasets such as BEAR-A or BEAR-B instant. Finally, we make use of the compressed representation for the versioning metadata presented in Section 6, and evaluated earlier in this section.

Figure 13 illustrates the average execution time for each category of versioned query (VM, DM, and V) on BEAR-C. The results are displayed per individual query and averaged across all revisions for VM queries, and pairs of revisions $⟨$ 1, i $⟩$ and $⟨$ 0, i $⟩$ for DM queries—in concordance with our experimental protocol. We note that the results are consistent with our single-triple patterns evaluation on the other BEAR datasets. That is, BEAR-C’s relatively short history (32 versions) puts our CR stategies in disadvantage w.r.t. to the baseline strategy. The query runtime of the change-ratio strategies and the baseline are almost identical for VM queries, with the change-ratio strategies having a slight advantage for some queries (notably, queries 1, 3, and 7). For DM queries, runtimes are also closely matched between the different strategies. Overall, the CR $γ = 6.0$ strategy performs best on average. Finally, we can observe large differences in runtimes between the different strategies on the V queries. While all strategies are closely matched overall, we can notice that the baseline strategy gets significantly outperformed on queries 9 and 10, whereas it outperforms the CR strategies on query 6. The CR $γ = 4.0$ is notably faster than the alternatives on queries 1 and 3. The overall good performance of the change-ratio strategies seems to contradict our previous findings, as V queries tend to become more expensive with more delta chains. However, we observed a similar behavior on BEAR-A in our previous experiments (Section 8.3), so to say, that the baseline strategy was outperformed by multi-snapshot strategies on V queries. This confirms the hypothesis that bulky deltas—common for BEAR-A and to a lesser extent for BEAR-C—are also detrimental to V query performance, justifying the use of multiple less voluminous delta chains.

Figure 13.

BEAR-C average query execution time in seconds for (a) VM, (b) DM, and (c) V queries. (log scale). VM = version materialization; DM = delta materialization; V = version query.

In Figure 14, we plot the runtime of VM and DM queries across revisions for queries #1 and query #2 of BEAR-C. The figures for all the other queries can be found in Appendix A. We selected those queries due to their representative runtime behavior. Query #1 has a relatively stable runtime for VM queries, with a slight increase in later revisions. Oppositely, query #2 sees a linear increase in runtime for VM queries as the target revision increases. In contrast, the runtime of DM queries is stable. We can observe that the CR strategies consistently outperform the baseline strategy on VM queries, while the baseline is faster on average for DM queries on query #1, and on par with CR $γ = 6.0$ for query #2. Overall, the differences between strategies are small, and vary depending on the query, as shown in Figure 13.

Figure 14.

BEAR-C average query execution time in seconds for VM, DM, and V queries: (a) DM and VM runtime for BEAR-C query #1 and (b) DM and VM runtime for BEAR-C query #2. VM = version materialization; DM = delta materialization; V = version query.

A queryable SPARQL endpoint on BEAR-C deployed using our multi-snapshot storage architecture is available online5 and described by Pelgrin et al. (2023).

8.6. Discussion

We now summarize our findings in previous sections and draw a few design lessons for efficient RDF archiving.

Ingestion time

For storage architectures based on aggregated deltas, our results confirm that multi-snapshot strategies pay off in terms of ingestion time. We argue they could compete or even beat pure IC systems. To see why we remind the reader that a baseline IC system has to process $O (| G_{k - 1} | + | u_{k} |)$ triples to materialize a new revision $k$ —the size of $G_{k - 1}$ depends on previous updates. Our multi-snapshot strategy pays a comparable price, that is, $O (| G_{s} | + \sum_{s + 1 \leq i \leq k} | u_{i} |)$ , only when materializing a new snapshot. When the revision is materialized as an aggregated delta, our approach has to handle $O (\sum_{1 \leq i \leq k} | u_{i} |)$ triples in the aggregated delta, which is expected to be way smaller than $G_{s}$ (or any materialized revision $G_{i}$ , $i > s$ ).

While our approach could beat IC implementations at ingestion time, it cannot compete with classical pure CB approaches, because those have an ingestion complexity of $O (| u_{k} |)$ . For pure TB6 systems, the answer depends on the indexing implementation and on how revision annotations for triples are handled. This could lead to complexities in between $O (| u_{k} |)$ and $O (| G_{0} | + \sum_{1 \leq i \leq k} | u_{i} |)$ .

Disk usage

The more redundancy a delta chain contains, the more storage efficient a multi-snapshot strategy will be. This redundancy can be caused by various factors, such as large changesets, long change histories, but also by the nature of the changes, for example, changes that revert previous changes. It then follows that for small datasets, small changesets, or relatively short histories, multi-snapshot strategies induce an overhead as the performance of the baseline OSTRICH on BEAR-B daily suggest. In such cases frequent snapshots are not a good idea. As our results on BEAR-A show, bulky updates benefit from snapshots at a moderate periodicity. Having said that, we expect multiple delta chains to be consistently more storage efficient than an IC approach if the snapshot frequency is judiciously set. In contrast, pure CB systems would be more difficult to beat by the uncompressed version of our approach—in particular if the CB system uses also uses HDT for the changesets. Since we can see our delta chains as TB stores of the triples that changed after a snapshot revision, we expect our solution to have a better storage footprint than a pure TB store. To see why, bear in mind that the size of our stores is capped every time a new delta chain is started. Depending on the edition patterns of the changesets, new delta chains could avoid storing deleted triples that will not be added again.

Query performance

For “easy” archives (small and/or with short histories), the overhead of multi-snapshot strategies does not pay off in terms of query runtime. This observation is particularly striking for V queries for which runtime increases with the number of delta chains. Conversely, short delta chains are mostly beneficial for VM and DM queries because these query types require us to iterate over changes within two delta chains in the worst case (for DM queries). They also systematically translate into faster ingestion times while being detrimental to storage consumption.

That said, when individual deltas are very bulky, as in BEAR-A and BEAR-C, multiple delta chains can be beneficial to V query performance, and can use less disk space than a single-snapshot storage strategy. Change-ratio strategies strike an interesting trade-off because they take into account the amount of data stored in the delta chain as criterion to create a snapshot. This ultimately has a direct positive effect on ingestion time, VM/DM querying, and storage size. Given that our solution outperforms the baseline OSTRICH for VM and DM queries, we argue that multi-snapshot architectures should outperform CB and TB approaches (as they are outperformed by OSTRICH (Taelman et al., 2019)) for these types of queries. They could be competitive to IC approaches with efficient storage, for example, HDT. The situation is different for V queries, which as reported by Taelman et al. (2019), are best handled by TB and CB approaches with HDT. Since multiple delta chains add an overhead to V query processing, we do not expect our solution to beat the classical systems in this scenario.

Finally, we highlight that the performance of full SPARQL queries on RDF archives is subject to same performance trade-off as queries on single triple patterns. Like most query engines, Comunica relies upon cardinality estimates for triple patterns to determine its query plans. Since our approach is able to accurately provide such estimates, Comunica is able to produce reasonable query plans: both VM and DM queries do not take longer than 1,000 s and some queries, such as Q1, Q3, Q6, Q7, or Q9 are executed very efficiently in spite of containing potentially expensive FILTER operations (e.g., Q7) or many triple patterns (e.g., Q9). In sum, total execution time is determined by the query planning time of Comunica, the produced query plan, and the execution time of triple pattern queries against our storage approach. Our observations show that all queries use combinations of hash joins, nested loop joins, and bind joins for the physical join plans. Since Comunica interacts with OSTRICH only at the level of triple patterns, the performance of more complex query operators such as FILTER and OPTIONAL is directly determined by the query plan determined by Comunica, and only indirectly by the execution times of the triple pattern queries on OSTRICH that propagate upwards in the query plan hierarchy.

Compression

In general, compressing the version metadata stored in the delta chain is a sensible alternative: compression increases ingestion speed and reduces disk storage. While it can increase query runtime, its impact is usually minimal and depends on the amount of data that needs to be fetched from disk. For very large delta chains (e.g., delta chains with big deltas), compression can even be beneficial for query performance because the overhead of decompression is insignificant compared to the savings in terms of retrieved data. This observation holds promise for distributed settings. Since compression mitigates redundancy in the delta chain, we cannot expect the snapshot strategies to have the same performance on the compressed and uncompressed versions of the archive.

Design lessons

The bottom line is that the snapshot creation strategy for RDF archives is subject to a trade-off among ingestion time, disk consumption, and query runtime for VM, DM, and V queries. As shown in our experimental section, there is no one-size-fits-all strategy. The suitability of a strategy depends on the application, namely the users’ priorities or constraints, the characteristics of the archive (snapshot size, history length, and changeset size), and the query load. For example, implementing version control for a collaborative RDF graph will likely yield an archive like BEAR-B instant, that is, a very long history with many small changes and VM/DM queries mostly executed on the latest revisions. Depending on the server’s capabilities and the frequency of the changes, the storage strategy could therefore rely on the change ratio or the ingestion time ratio and be tuned to offer arbitrary latency guarantees for ingestion. On a different note, a user doing data analytics on the published versions of DBpedia (as done by Pelgrin et al. (2021)) may be confronted to a dataset like BEAR-A and, therefore, resort to numerous snapshots, unless their query load includes many real-time V queries. Table 8 summarizes these lessons for different common requirements when managing RDF archives.

Table 8.
Design Recommendations for Multi-Snapshot RDF Archives.

Requirement Recommendation Caveats

Low disk usage Infrequent snapshots and compression VM, DM decreased performance and higher ingestion time

Small many changesets

Bulky changesets

VM query performance Frequent snapshots V query decreased performance and higher disk consumption

DM query performance

V query performance No snapshots High ingestion time

VM = version materialization; DM = delta materialization; V = version query.

Furthermore, we showcased our results for full SPARQL processing over RDF archives on the BEAR-C benchmark. To the best of our knowledge, this is the first approach that provides a solution for BEAR-C. We believe this work unlocks several perspectives for efficient querying over RDF archives. Firstly, our integration with Comunica enables support for CV and CD archive queries, because these query types build upon the composition of simpler queries (VM, DM, and V). It follows that our approach can now support those CD and CV queries that are expressible in SPARQL. Secondly, our proposal highlights the lack of standardization for SPARQL querying on RDF archives, which has encouraged solution providers to come up with their own language extensions and ad-hoc implementations. None of them, however, has attained wide acceptance within the research and developer communities. Our solution relies on standard SPARQL and can be easily adopted by the managers of RDF archives—at least until a canonical SPARQL extension for archives arrives. Thirdly, our work reveals the limited diversity of benchmarks for SPARQL query workloads on RDF (Pelgrin et al., 2023). In BEAR, for example, only the BEAR-C dataset offers full SPARQL queries. Those 11 queries, are alas, insufficient to provide a comprehensive evaluation of the capabilities of novel systems as they target a single application case and consider queries of similar topology (star-shaped patterns joined on one variable). Alternatives, such as SPBv (Papakonstantinou et al., 2017), have not seen similar adoption by the community, probably because they are not easy to deploy.7 We expect this work to prepare the ground for the emergence of more efficient, standardized, and expressive solutions for managing RDF archives.
9. Conclusion

Requirement	Recommendation	Caveats
Low disk usage	Infrequent snapshots and compression	VM, DM decreased performance and higher ingestion time
Small many changesets
Bulky changesets
VM query performance	Frequent snapshots	V query decreased performance and higher disk consumption
DM query performance
V query performance	No snapshots	High ingestion time

In this article, we have presented a hybrid storage architecture for RDF archiving based on multiple snapshots and chains of aggregated deltas with support for full SPARQL versioned queries. We have evaluated this architecture with several snapshot creation strategies on ingestion times, disk usage, and query performance using the BEAR benchmark. The benefits of this architecture are bolstered thanks to a novel and efficient compression scheme for versioning metadata, which has yielded impressive improvements over the original serialization scheme. This has further improved the scalability of our system when handling large datasets with long version histories. All these building blocks cleared the way to introduce a new SPARQL processing system on top of our storage architecture. We are now capable of answering full SPARQL VM, DM, or V queries over RDF archives.

Our evaluation shows that our architecture can handle very long version histories, at a scale not possible before with previous techniques. We used our experimental results on the different snapshot creation strategies to draw a set of design lessons that can help users choose the best storage policy based on their data and application needs. We showcased our ability to handle the BEAR-C variant of the BEAR benchmark—the first evaluation on this dataset to the best of our knowledge. This is a first step towards the support of more sophisticated applications on top of large RDF archives, and we hope it will expedite research and development in this area.

As future work, we plan to further explore different snapshot creation strategies, for example, using machine learning, to further improve the management of complex and large RDF archives. Furthermore, we plan to investigate novel approaches in the compact representation of semantic data (Perego et al., 2021; Sagi et al., 2022), which could lead to a promising alternative to the use of B+ trees. Future experiments could also evaluate the performance of our storage solutions on solid-state disks that perform better on I/O operations. We envision further efforts towards the practical implementation of versioning use cases of RDF, such as the implementation of version control features, like branching and tagging, into our system. Such features are paramount to real world usages of versioning software, and can benefit RDF dataset maintainers (Arndt et al., 2019; Graube et al., 2014). With the recent popularity of RDF-star (Abuoda et al., 2023; Hartig, 2017), which can be used to capture versioning in the form of metadata, we also plan to look into recent advances in this area. Finally, the lack of an accepted standard for expressing versioning queries with SPARQL limits the wider adoption of RDF archiving systems. We aim to work towards a standardization effort, notably on a novel syntax and a formal definition of the semantics of versioned queries.

Footnotes

Acknowledgments

This research was partially funded by the Danish Council for Independent Research (DFF) under grant agreement no. DFF-8048-00051B, the Poul Due Jensen Foundation, and the TAILOR Network (EU Horizon 2020 research and innovation program under GA 952215). Ruben Taelman is a postdoctoral fellow of the Research Foundation – Flanders (FWO) (1202124N).

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

ORCID iDs

Luis Galárraga

Olivier Pelgrin

Katja Hose

Notes

Appendix

References

Abuoda

Aebeloe

Dell’Aglio

Keen

Hose

(2023). StarBench: Benchmarking RDF-star triplestores. In QuWeDa/MEPDaW@ISWC. CEUR Workshop Proceedings, Vol. 3565 (pp. 34–49). CEUR-WS.org.

Aebeloe

Montoya

Hose

(2021). ColChain: Collaborative linked data networks. In The web conference (WWW) (pp. 1385–1396). https://doi.org/10.1145/3442381.3450037

Anderson

Bendiken

(2016). Transaction-time queries in Dydra. In MEPDaW/LDQ@ESWC. CEUR Workshop Proceedings, Vol. 1585 (pp. 11–19). CEUR-WS.org.

Arndt

Naumann

Radtke

Martin

Marx

(2019). Decentralized collaborative knowledge management using git. The Journal of Web Semantics, 54, 29–47. https://doi.org/10.1016/j.websem.2018.08.002

Bereta

Smeros

Koubarakis

(2013). Representation and querying of valid time of triples in linked geospatial data. In ESWC (Vol. 7882, pp. 259–274).

Brunsmann

(2010). Archiving pushed inferences from sensor data streams. In International workshop on semantic sensor web (pp. 38–46). https://doi.org/10.5220/0003116000380046

Cerdeira-Pena

de Bernardo

Fariña

Fernández

J. D.

Martínez-Prieto

M. A.

(2023). Compressed and queryable self-indexes for RDF archives. Knowledge and Information Systems, 66, 1–37.

Fernández

J. D.

Martı́nez-Prieto

M. A.

Gutiérrez

Polleres

Arias

(2013). Binary RDF representation for publication and exchange (HDT). The Journal of Web Semantics, 19, 22–41. https://doi.org/10.1016/j.websem.2013.01.002

Fernández

J. D.

Umbrich

Polleres

Knuth

(2019). Evaluating query and storage strategies for RDF archives. The Journal of Web Semantics, 10(2), 247–291. https://doi.org/10.3233/SW-180309

10.

Fionda

Chekol

M. W.

Pirrò

(2016). Gize: A time warp in the web of data. In International semantic web conference (ISWC) (Vol. 1690).

11.

Grandi

(2010). T-SPARQL: A TSQL2-like temporal query language for RDF. In Local proceedings of European conference on Advances in Databases and Information Systems (ADBIS) (pp. 21–30).

12.

Graube

Hensel

Urbas

(2014). R43ples: Revisions for triples – An approach for version control in the semantic web. In LDQ@SEMANTICS.

13.

Gür

Pedersen

T. B.

Zimányi

Hose

(2018). A foundation for spatial data warehouses on the semantic web. Semantic Web, 9(5), 557–587.

14.

Hartig

(2017). Foundations of RDF

⋆

and SPARQL

⋆

(an alternative approach to statement-level metadata in RDF). In AMW. CEUR Workshop Proceedings, Vol. 1912. CEUR-WS.org.

15.

Hose

(2021). Knowledge graph (r)evolution and the web of data. In MEPDaW@ISWC. CEUR Workshop Proceedings, Vol. 3225 (pp. 1–7). CEUR-WS.org.

16.

Huet

Biega

Suchanek

F. M.

(2013). Mining history with Le Monde. In Workshop on automated knowledge base construction (pp. 49–54). https://doi.org/10.1145/2509558.2509567

17.

Neumaier

Umbrich

Polleres

(2016). Automated quality assessment of metadata across open data portals. ACM Journal of Data and Information Quality, 8(1), 2:1–2:29.

18.

Neumann

Weikum

(2010). x-RDF-3X: Fast querying, high update rates, and consistency for RDF databases. Proceedings of the VLDB Endowment, 3(1), 256–263. https://doi.org/10.14778/1920841.1920877

19.

Papakonstantinou

Flouris

Fundulaki

Stefanidis

Roussakis

(2017). SPBv: Benchmarking linked data archiving systems. In BLINK/NLIWoD3@ISWC. CEUR Workshop Proceedings, Vol. 1932.

20.

Pelgrin

Galárraga

Hose

(2021). Towards fully-fledged archiving for RDF datasets. Semantic Web Journal, 12(6), 903–925. https://doi.org/10.3233/SW-210434

21.

Pelgrin

Taelman

Galárraga

Hose

(2023). Glenda: Querying RDF archives with full SPARQL. In C. Pesquita, H. Skaf-Molli, V. Efthymiou, S. Kirrane, A. Ngonga, D. Collarana, R. Cerqueira, M. Alam, C. Trojahn, & S. Hertling (Eds.), The semantic web: ESWC 2023 satellite events (pp. 75–80). Cham: Springer Nature Switzerland. ISBN 978-3-031-43458-7.

22.

Pelgrin

Taelman

Galárraga

Hose

(2023). The need for better RDF archiving benchmarks. In MEPDaW@ISWC. CEUR Workshop Proceedings. CEUR-WS.org.

23.

Pelgrin

Taelman

Galárraga

Hose

(2023). Scaling large RDF archives to very long histories. In International conference in semantic computing (ICSC) (pp. 41–48).

24.

Pellissier Tanon

Bourgaux

Suchanek

(2019). Learning how to correct a knowledge base from the edit history. In The web conference (WWW) (pp. 1465–1475). https://doi.org/10.1145/3308558.3313584

25.

Perego

Pibiri

G. E.

Venturini

(2021). Compressed indexes for fast search of semantic data. IEEE Transactions on Knowledge and Data Engineering, 33(9), 3187–3198.

26.

Perry

Jain

Sheth

A. P.

(2011). SPARQL-ST: Extending SPARQL to support spatiotemporal queries. In Geospatial semantics and the semantic web (Vol. 12, pp. 61–86).

27.

Polleres

Pernisch

Bonifati

Dell’Aglio

Dobriy

Dumbrava

Etcheverry

Ferranti

Hose

Jiménez-Ruiz

Lissandrini

Scherp

Tommasini

Wachs

(2023). How does knowledge evolve in open knowledge graphs? Transactions on Graph Data and Knowledge, 1(1), 11:1–11:59.

28.

Raimond

Schreiber

(2014). RDF 1.1 primer. W3C recommendation. http://www.w3.org/TR/2014/NOTE-rdf11-primer-20140624/

29.

Roussakis

Chrysakis

Stefanidis

Flouris

Stavrakas

(2015). A flexible framework for understanding the dynamics of evolving RDF datasets. In International semantic web conference (ISWC) (Vol. 9366, pp. 495–512). https://doi.org/10.1007/978-3-319-25007-6_29

30.

Sagi

Lissandrini

Pedersen

T. B.

Hose

(2022). A design space for RDF data representations. The VLDB Journal, 31(2), 347–373.

31.

Seaborne

Harris

(2013). SPARQL 1.1 query language. W3C recommendation, W3C. http://www.w3.org/TR/2013/REC-sparql11-query-20130321/

32.

Taelman

Mahieu

Vanbrabant

Verborgh

(2022). Optimizing storage of RDF archives using bidirectional delta chains. The Journal of Web Semantics, 13, 705–734.

33.

Taelman

Sande

M. V.

Herwegen

J. V.

Mannens

Verborgh

(2019). Triple storage for random-access versioned querying of RDF archives. The Journal of Web Semantics, 54, 4–28. https://doi.org/10.1016/j.websem.2018.08.001

34.

Taelman

Sande

M. V.

Verborgh

(2018). Versioned querying with OSTRICH and comunica in MOCHA 2018. In SemWebEval@ESWC (Vol. 927. pp. 17–23).

35.

Taelman

Van Herwegen

Vander Sande

Verborgh

Comunica: A modular SPARQL query engine for the web. In International semantic web conference (ISWC).

36.

Tanon

T. P.

Suchanek

F. M.

(2019). Querying the edit history of Wikidata. In ESWC (Vol. 11762. pp. 161–166). https://doi.org/10.1007/978-3-030-32327-1_32

37.

Volkel

Winkler

Sure

Kruk

S. R.

Synak

(2005). SemVersion: A versioning system for RDF and ontologies. Extended semantic web conference (ESWC).

38.

Weiss

Karras

Bernstein

(2008). Hexastore: Sextuple indexing for semantic web data management. PVLDB, 1(1), 1008–1019. https://doi.org/10.14778/1453856.1453965

39.

Zimmermann

Lopes

Polleres

Straccia

(2012). A general framework for representing, reasoning and querying with annotated semantic web data. The Journal of Web Semantics, 11, 72–95.

Expressive Querying and Scalable Management of Large RDF Archives

Abstract

Keywords

1. Introduction

3.1. Querying RDF Archives

3.3. OSTRICH’s Architecture and Storage Paradigm

4.1. Multiple Delta Chains

4.2. Strategies for Snapshot Creation

5. Single Queries on Archives With Multiple Delta Chains

5.1. VM Queries

5.2. DM Queries

5.3. V Queries

6. Optimization of Versioning Metadata Serialization

6.1. Versioning Metadata Encoding

7.1. SPARQL Versioned Queries

Table 4. Example of SPARQL Representation and Results for VM, DM, and V Queries Using the GRAPH Keyword.

8.1. Experimental Setup

8.2.1. Ingestion Time

8.3.2. DM Queries

8.3.3. V Queries

8.4. Experiments on the Metadata Representation

Ingestion time

Disk usage

Query performance

Compression

Design lessons

Footnotes

Acknowledgments

Funding

Declaration of Conflicting Interests

ORCID iDs

Notes

Appendix

References

Table 4.
Example of SPARQL Representation and Results for VM, DM, and V Queries Using the GRAPH Keyword.