The publication and interchange of RDF datasets online has experienced significant growth in recent years, promoted by different but complementary efforts, such as Linked Open Data, the Web of Things and RDF stream processing systems. However, the current Linked Data infrastructure does not cater for the storage and exchange of sensitive or private data. On the one hand, data publishers need means to limit access to confidential data (e.g. health, financial, personal, or other sensitive data). On the other hand, the infrastructure needs to compress RDF graphs in a manner that minimises the amount of data that is both stored and transferred over the wire. In this paper, we demonstrate how HDT – a compressed serialization format for RDF – can be extended to cater for supporting encryption. We propose a number of different graph partitioning strategies and discuss the benefits and tradeoffs of each approach.
In recent years, we have seen an increase in the amount of structured data published online using the Resource Description Framework (RDF), in a manner that not only lends itself to data integration but also supports data exchange. Although Linked Data publishers focus on exposing and linking open data, there are scenarios where individuals and organisations need to store and share sensitive or private data. Additionally, there are number of regulations concerning the financial, medical, personal, or otherwise sensitive data that require companies to employ strong data protection mechanisms, such as encryption and anonymisation. In order to ensure confidentially it is necessary to encrypt the data not only when it is in transit but also when it is at rest. In such scenarios, where multiple users have different access rights to different parts of the data, users should only be able to access the data they are allowed to access.
When it comes to Linked Data protection, to date research has focused on the encryption of partial RDF graphs using eXtensible Markup Language (XML) encryption techniques [12–14] or proposing strategies for querying encrypted RDF data [21]. One of the primary challenges of existing encryption strategies is that they result in a verbose serialization that prevents their use at scale. RDF compression is an emerging research area that focuses on reducing the space requirements of traditional RDF serializations. One approach to efficient data exchange is a (binary) RDF serialization format known as HDT (Header Dictionary Triples) [10] that can be used to compress large datasets in a manner than can be queried without prior decompression [25]. Together encryption and compression mechanisms could be used to cater for the compact storage and efficient exchange of confidential data.
In this paper, we combine “compression+encryption” functionality for RDF datasets, thus allowing service providers to store and share confidential data while reducing storage and bandwidth usage. In particular, we propose , an extension of HDT to represent encrypted datasets for multiple users with different access rights (i.e. users can only access particular subgraphs of the RDF dataset). To do so, we assume a service provider defines the different “access restricted” subgraphs of a dataset, and we investigate different partitioning strategies to better capture and represent the redundancy (i.e. repeated triples and terms) between them in HDT.
The contributions of our paper can be summarised as follows, we: (i) demonstrate how HDT compression can be extended to cater for encrypted RDF data; (ii) examine a number of alternative partitioning strategies that can be used to reduce the number of duplicates in encrypted HDT (referred to as ); and (iii) compare different partitioning strategies in terms of bandwidth and performance. Experiments show that each of our partitioning strategies is able to achieve space savings over the compression baseline (up to 31%), and are comparable in terms of query performance. We present different space/performance tradeoffs and discuss how partitioning strategies are influenced both by the number of access restricted subgraphs and the distribution of triples across subgraphs.
The rest of the paper is structured as follows: In Section2 we discuss related work on RDF encryption and compression. Section3 provides the necessary background information on HDT and Section4 describes how compression can be combined with encryption. Section5 details the different partitioning strategies that can be used in conjunction with graph based encryption. In Section6 we evaluate using both real-world and synthetic RDF datasets and discusses the trade-off between space and performance. Finally, we conclude and highlight future work in Section7.
Related work
When it comes to encryption and RDF, the focus to date has been on proposing strategies for the partial encryption of RDF graphs [12–14] or the querying of encrypted data [21]. Giereth [13,14] demonstrate how XML based encryption techniques can be used to encrypt confidential data in an RDF-graph, while all non-confidential data is left as plaintext. Gerbracht [12] built on this work by examining how encryption techniques can be used to encrypt RDF elements and RDF subgraphs, in a manner that reduces the storage overhead. Kasten et al. [21] in turn discuss how data can be encrypted and queried according to SPARQL triple patterns. However this proposal suffers from scalability problems given that each triple is encrypted multiple times depending on whether or not access to the subject, predicate and/or object is restricted. A recent work by Fernández et al. [8] uses Predicate-based Encryption [22] to enable controlled access to encrypted RDF data, i.e., data providers can generate query keys based on (triple-)patterns, whereby one decryption key can decrypt all triples that match its associated triple pattern. In the database and cloud community, Searchable Symmetric Encryption (SSE) [6] has been extensively applied to store and search data in a secure manner. SSE techniques focus on the encryption of outsourced data such that an external user can encrypt their query and subsequently evaluate it against the encrypted data. The more recent Fully Homomorphic Encryption (FHE) [11] technique allows any general circuit/computation over encrypted data, however it is prohibitively slow for most operations [4,29]. None of these works examine the interplay between encryption and compression, which is the focus of our present paper. In particular, we investigate different HDT compression strategies for RDF datasets, which are organised into different RDF graphs that need to be encrypted with different keys. However, our approach could be adapted to work with partially encrypted graphs.
Following the categorization in [27], an RDF compressor can be classified as either syntactic or semantic. Syntactic compressors try to detect redundancy at the serialisation level, whereas semantic compressors try to eliminate logical redundancies. HDT was designed as a binary serialisation format for RDF graphs, but its optimised encodings means that HDT also excels as a syntactic RDF compressor [10,25]. In HDT RDF data is encoded into two main data-driven components: a Dictionary that maps all distinct terms in the dataset to unique identifiers (IDs) (reducing symbolic redundancy), and a triple component that encodes the inner RDF structure as a compact graph of IDs (reducing structural redundancy). This kind of redundancy is also addressed in -triples [1]. However, in the case of -triples the authors perform a predicate-based partition of the dataset into disjoint subsets of (subject, object) pairs. These subsets are highly compressed as (sparse) binary matrices that also allow for efficient data retrieval. RDF compression can also benefit from semantic redundancy. Theoretic foundations of exploiting logical redundancies with respect to rules and grammars have been investigated by [28] and [24], respectively. In particular, the recent compressor gRePair [24] reports the best compression ratios over the structure of RDF graphs (i.e. the graph after ID replacement), to the best of our knowledge.
Likewise, Joshi et al. [20] use rules to discard triples that can be inferred from others, and they only encode these “primitive triples”. In doing so they reduce the number of triples and consequently save space. The authors also propose a combination of semantic and syntactic compression, by integrating their approach with syntactic HDT compression techniques. Interestingly the results were similar to those obtained by simply using HDT. Recently, Wu et al. [27] have proposed SSP, a hybrid syntactic and semantic compressor. Their evaluation demonstrates that SSP+bzip2 is slightly better than HDT+bzip2. Other approaches, like HDT-FoQ [25] or WaterFowl [5] also enable compressed data to be retrieved without the need for decompression. Both techniques, based on HDT serialization, report competitive performance at the price of using more space than other compressors such as -triples or gRePair.We also use HDT compression, however specifically we examine the syntactic redundancy between RDF graphs that need to be encrypted separately, and propose and evaluate four alternative HDT compression strategies. The exploitation of semantic redundancies within HDT is out of scope and left for future work (for more details on semantic compression and HDT we refer the reader to the work by Hernández-Illera et al. [18]).
Preliminaries
Before we present our approach, we need to introduce some concepts and terminology from RDF and HDT. Thereafter, in Section 4, we propose a general mechanism to extend HDT with encryption, termed .
As usual, an RDF Graph G is a finite set of triples from , where I, B, L denote IRIs, blank nodes and RDF literals, respectively [16]. Figure 1 shows an example of an RDF graph representing two individuals ex:Bob and ex:Alice, and the project ex:pastProject of the latter. In this paper, we discuss different ways to compress and encrypt such datasets, using HDT a particular compression format for RDF graphs.
Example of an RDF graph G.
HDT Dictionary and Triples for our full graph G.
HDT [10] is a binary, compressed serialization format for optimized RDF storage and transmission, which also allows certain lookups and queries over compressed data. It is therefore very suitable for the efficient exchange and querying of large datasets. HDT encodes an RDF graph G into three components: the Header component H holds metadata, including relevant information necessary for discovery and parsing; the Dictionary component D is a catalogue that encodes all RDF terms in G and maps each of them to a unique identifier; the Triple component T compactly encodes G’s graph structure as tuples of three IDs that are used to represent the directed labelled edges in an RDF graph.
Figure 2 shows the Dictionary component (a), the underlying graph structure (b) and the final Triple component (c) for the previous RDF graph G (Fig. 1).
HDT dictionary component D
This component organises the terms in a graph G according to their positions in RDF triples, thus we also write to denote the dictionary component constructed from graph G: the section SO manages terms occurring both as subject and object, and maps them to the ID-range [1, |SO|], where |SO| is the number of such terms acting as subject and object. Sections S and O comprise terms that only occur as subjects or objects, respectively. Both sections are mapped from |SO|+1, ranging up to |SO|+|S| and |SO|+|O|, respectively. Finally, section P organises all predicate terms, which are mapped to the range [1, |P|]. It is worth noting that no ambiguity is possible once we know the role (i.e. the position in a triple, being subject, predicate or object) played by the corresponding ID. For further details, we refer to [26]. For convenience, we write for the particular ID assigned to an RDF term x, whereas we refer to all IDs and RDF terms mapped in a dictionary component D as and , respectively. Note that, for simplicity, we omit the “role” parameter in these functions, which should be provided in case the terms in subjects (or objects) and predicates are not disjoint [26]. Also, it is worth mentioning that in the original HDT proposal, blank nodes are treated exactly as any other term [10], considering an optional skolemization of blank nodes as a pre-processing step.
HDT triple component T
This component encodes the structure of the RDF graph after ID substitution, taking into consideration a particular dictionary D, thus, we write to denote a triple component that was constructed from the triples in G using the IDs in dictionary D. More concretely, RDF triples are encoded as groups of three IDs: (idsidpido), where ids, idp, and ido are the IDs of the corresponding subject, predicate, and object terms in the dictionary. T organises all triples into a forest of trees, one per different subject: the subject is the root; the middle level comprises the ordered list of predicates reachable from the corresponding subject; and the leaves list the object IDs related to each (subject, predicate) pair. This underlying representation (illustrated in Fig. 2(b)) is effectively encoded following the BitmapTriples approach [10]. It comprises two sequences: Sp and So, concatenating all predicate IDs in the middle level and all object IDs in the leaves, respectively; and two bitsequences: Bp and Bo, which are aligned with Sp and So respectively, using a 1-bit to mark the end of each list (Fig. 2(c)). In practice, each ID in Sp and So is encoded with a fixed-length encoding, using bits, where n is the maximum ID in the sequence [10]. Again, we use to refer to all IDs used in a triple component T.
HDT header component H
The HDT Header includes (i) the machine-readable metadata that is necessary to process an HDT file (format metadata); and (ii) additional human-readable information to describe the dataset (usually in the form of VoID1
http://www.w3.org/TR/void/
descriptions). The format metadata is mainly focused on characterising the dictionary and triple formats. In general, an HDT file of a graph G consists of a single header H, dictionary D and triples T, . Nonetheless, the HDT specification [9] is flexible and allows for several dictionaries or triple components to be specified in H as soon as the interpretation of their relationship is provided in the header. It was envisaged that this would be used to split huge RDF graphs into several chunks or streams, where a sequential order of the components is assumed by default [9]. In the following section we exploit and expand this feature to encode a partition of the graph G with several dictionaries and triples.
An access-restricted RDF dataset such that G comprises three separate access-restricted subgraphs , , ; the graph’s canonical partition is comprised of four non-empty subgraphs , , , , whereas the terms in these graphs can be partitioned into five non-empty subsets corresponding to the dictionaries , , , , .
: Extending HDT for encryption
We introduce , an extension of HDT that involves encryption of RDF graphs. We first define the notion of access-restricted RDF datasets and the implications for HDT (Section 4.1). Then, we show an extension of the HDT header component to cope with access-restricted RDF datasets (Section 4.2), which leads to the final encoding. Finally, as can manage several HDT Dictionary components, we describe the required operations to integrate different Dictionary components within an HDT collection (Section 4.3). These operations will be the basis to represent the shared components between access-restricted datasets efficiently, addressed in Section 5.
Representing access-restricted RDF datasets
We consider hereinafter that users wishing to publish access-restricted RDF datasets divide their complete graph of RDF triples G into (named) graphs, that are accessible to other users, i.e. we assume that access rights are already materialised per user group in the form of a set (cover) of separate, possibly overlapping, RDF graphs, each of which are accessible to different sets of users.
Borrowing terminology from [17], an access restricted RDF dataset (or just “dataset” in the following) is a set consisting of a (non-named) default graph G and named graphs s.t. are graph names, where in our setting we require that is a cover2
In the set-theoretic sense.
of G. We further call a partition of G if for any ; . Note that from any dataset , a canonical partition can be trivially constructed (but may be exponential in size) consisting of all non-empty (at most ) subsets of triples corresponding to an index set such that .
Figure 3 shows an example of such a dataset composed of three access-restricted subgraphs (or just “subgraphs” in the following) , , for the previous full graph G (Fig. 2(a)). Intuitively, this corresponds to a scenario with three access rights: users who can access general information about projects in an organisation (graph ); users who have access to public email accounts and relations between members in the organisation (graph ); and finally, users who can view personal information of members, such as the salary and personal email accounts (graph ). As can be seen, the triple (ex:Alice foaf:mbox “alice@example.org”) is repeated in subgraphs and , showing a redundancy which can produce significant overheads in realistic scenarios with large-scale datasets and highly overlapping graphs. Canonical partitioning groups these triples into disjoint sets so that no repetitions are present. In our example in Fig. 3, the set , which can simply be written as , holds this single triple, (ex:Alice foaf:mbox “alice@example.org”), hence this triple is not present in and . In this simple scenario, is equivalent to as it does not share triples with other graphs.
Thus, we consider hereinafter an HDT collection corresponding to a dataset denoted by as a single H, plus sets , of dictionary and triple components, respectively, such that the union of triple components encodes a cover of G, i.e. the overall graph of all triples in the dataset . We do not assume that there is a one-to-one correspondence between individual triple components in and graphs in ; different options of mapping subgraphs to HDT components will be discussed in Section 5 below. The relation between the dictionaries and the triple components (in other words, which dictionaries are used to codify which triple components) is also flexible and must be specified through metadata properties. In our case, we assume to contain a relation , which we call the dictionary-triples map with the implicit meaning that dictionary components encode terms used in the corresponding triple components, and M is comprised of additional header metadata (as mentioned above, the header contains a variety of further (meta-)information in standard HDT [9], which we skip for the considerations herein). It is worth noting that we do not prescribe that either D or T do not overlap. However, it is clear that one should find an unambiguous correspondence to decode the terms under .
Thus, we define the following admissibility condition for R. An HDT collection is called admissible if:
For any admissible HDT collection we define the T-restricted collection as the collection obtained from removing: (i) all triple components from ; (ii) the corresponding such that is in R and is not in R; and (iii) the relations from R. Thus allowing an HDT collection to be filtered by erasing all dictionary and triple components that are not required for T.
encoding
We now introduce the final encoding of the extension. uses AES (Advanced Encryption Standard) [7] to encrypt the D and triple components of an HDT collection and extends the header H with a keymap that maps encrypted components to identifiers (IRIs), which denote AES keys that can be used to decrypt these components.
Thus, where , , and the components in and are encrypted with keys identified in .
The operations to and the dictionary and triples are described as follows. First, the operation takes one or more dictionary and triples and encrypts the components with a given key. Formally, we write , where to denote the component obtained by encrypting c with the key . While, we add an identifier of the components to the header metadata. In other words, is added to the kmap, where denotes the ID of the component in and and a unique identifier for the symmetric key.
For the decryption, it is assumed that an authorized user u has partial knowledge about these keys, i.e. they have access to a partial function that maps a finite set of “user-owned” key IDs to the set of AES (symmetric) keys K. The decryption simply takes the given compressed component(s) and performs the decryption with the given symmetric key. Formally, we write , where to denote the component obtained from decrypting with the key . Further we write to denote the non-encrypted HDT collection consisting of all decrypted dictionary and triple components of which can be decrypted with the keys in . In other words, the T-restriction of is defined analogously to the above-said.
Integration operations
Finally, we define two different ways of integrating dictionaries within an HDT collection: D-union and D-merge. In the former, we replace dictionaries with a new dictionary that includes the union of all terms. In the latter, we establish one of the dictionaries as the dictionary baseline and rename the IDs of the other dictionaries.
D-Union
The D-union is only defined for if the following condition holds on R: . In other words, we can perform a D-union if all T-components depending on dictionaries in the set only depend on these dictionaries. Then, we can define a trivial D-union of wrt. , written , as follows:
replace dictionaries with a single dictionary , such that
is obtained by sequentially numbering the terms in upon an (arbitrary) total order, e.g., lexicographically ordering the terms (as it is done in HDT dictionaries by default).
replace all , , with new relations, where is obtained from T by replacing the original IDs from with their corresponding new IDs in .
D-Merge
In the more general case where the condition for D-unions does not hold on , then we can define another operation, D-merge, written . We start with the binary case, where only two dictionaries and are involved; is obtain as follows:
We use the directed operator ⊳ instead of ∪ here, since this operation is not commutative.
such that
replace all with
replace all with , where is obtained from by analogous ID changes.
D-merge can then be generalized to a sequence of dictionaries assuming left-associativity of ⊳ operator, i.e. .
For convenience, we extend the notation of from Section 3.2 to D-unions and D-merges: let be a sequence of dictionaries and G an RDF graph such that . Then we will write and for the triples part generated from G according to the combined dictionary and respectively. Finally, we note that for any admissible HDT collection, both D-union and D-merge preserve admissibility.
Efficient partitioning
After having introduced the general idea of and the two different ways of integrating dictionaries within an HDT collection, we now discuss four alternatives strategies that can be used for distributing a dataset across dictionary and triple components in an collection. These alternatives, referred to as HDTcrypt-A, HDTcrypt-B, HDTcrypt-C and HDTcrypt-D, provide different space/performance tradeoffs that will be evaluated in Section 6. We note that HDT behaves differently than the normal RDF merge regarding blank nodes in different “partitions” as, by default, HDT does not rename the blank nodes to avoid shared labels [19]: the original blank nodes are skolemized to constants (unique per RDF graph) and preserved across partitions, so that we do not need to consider blank node (re-)naming separately.
HDTcrypt-A: A dictionary and triples per named graph in
The baseline approach is straightforward, we construct separate HDT components and per graph in the dataset, see Fig. 4, thereafter each of these components is encrypted with a respective, separate key, identified by a unique IRI , i.e., and . For re-obtaining graph a user must only have access to the key corresponding to , and can thereby decrypt and and extract the restricted collection , which corresponds to . Obviously, this approach encodes a lot of overlaps in both dictionary and triples parts: that is, for our running example from Fig. 4, the IRI for ex:alice is encoded in each individual D component and the overlapping triples in graphs and appear in both and respectively (cf., Fig. 4).
HDTcrypt-A, create and encrypt one HDT per partition.
HDTcrypt-B, extracting non-overlapping triples.
HDTcrypt-B: Extracting non-overlapping triples in
In order to avoid the overlaps in the triple components, a more efficient approach could be to split the graphs in the dataset according to their canonical partition and again construct separate (D, T)-pairs for each subset , see Fig. 5. That is, we create and per graph , where denotes the index set corresponding to a (non-empty) subset of . R in turn contains pairs and entries for keys identified by per used for the encryption/decryption of the relevant and . The difference for decryption now is that any user who is allowed access to must have all keys corresponding to any such that in order to re-obtain the original graph .
First, the user will decrypt all the components for which they have keys, i.e. obtaining a non-encrypted collection consisting of components , consisting of the components corresponding to a partition of . Then, for decompressing the original graph , we create separate -restricted HDTs, which are decompressed separately, with being the union of the resulting subgraphs.
Union of dictionaries (in HDTcrypt-C) to codify the non-overlapping dictionaries of a partition.
HDTcrypt-C: Extracting non-overlapping dictionaries in
Note that in the previous approach, we have duplicates in the dictionary components. An alternative strategy would be to create a canonical partition of terms instead of triples, and create separate dictionaries for each non-empty term-subset,4
Again, here represents an index set.
respectively. Figure 6 shows the canonical partition of terms in our running example: as can be seen, the original dictionary is split into five non-empty terms-subsets corresponding to the dictionaries (terms shared in all three graphs), (terms shared in graphs and that are not in ) and , , (terms in either , or resp. and are not shared between graphs). This partition can be computed efficiently, thanks to the HDT dictionary D of the full graph G, which we assume to be available.5
All strategies are evaluated from an existing full graph G. Our evaluation in Section 6 also reports the time to create the HDT representation of the full graph G.
This auxiliary structure is maintained just at compression time and it is not shipped with the encrypted information.
an auxiliary bitsequence per graph (see Fig. 6, top left), each of size . Then, we iterate through triples in each graph and, for each term, we search its ID in D, marking such position with a 1-bit in the bitsequence of . Finally, the dictionaries of the subsets can be created by inspecting the combinations of 1-bits in the bitsequences: terms in will be those with a 1-bit in the bitsequences of graphs and 0-bits in other graphs. For instance, in Fig. 6, is constituted only by ex:alice, because it is the only term with three 1-bits in the bitsequences of , and . In contrast, ex:Project1 will be part of as it has a 1-bit only in the bitsequence of .
The number of triple components in this approach are as in HDTcrypt-A, one per graph . However, they are constructed slightly differently as, in this case, we have a canonical partition of terms and one user will just receive the dictionaries corresponding to subsets that correspond to terms in the graph that they have been granted access to. In other words, the IDs used in each should unambiguously correspond to terms, but these terms may be distributed across several dictionaries.7
Given the partition definition, it is nonetheless true that a term appears in one and only one term-subset.
Thus, we encode triples with a D-union (see Section 4.3) of the such that . That is, for each we construct , and add the respective pairs in R.
HDTcrypt-D, extracting non-overlapping dictionaries and triples.
Figure 7 illustrates this merge of dictionaries for the graph and the respective construction of . The decompression process after decryption is the exact opposite. For decompressing the graph , the decrypted dictionaries are used to create a D-union which can be used to decompress the triples in one go. Finally, as a performance improvement at compression time, note that, although the canonical partition of terms has to be built to be shipped in the compressed output, we can actually skip the creation of the D-union dictionaries to encode the IDs in the triples. To do so, we make use of the bitsequences to get the final IDs that are used in the triples: One should note that the ID of a term in a D-union of a graph is the number of previous 1-bits in the bitsequence of (for each , S, O, and P section). For instance, in our example in Fig. 7, ex:Project1 is encoded with the . Instead of creating , we can see that in the bitsequence of (see Fig. 6, top right) we have two 1-bits in the predicate section up to the position where ex:Project1 is stored in the original dictionary, hence its .
HDTcrypt-D: Extracting non-overlapping dictionaries and triples in
In HDTcrypt-D, we combine the methods of both HDTcrypt-B and HDTcrypt-C. That is, we first create a canonical partition of terms as in HDTcrypt-C, and a canonical partition of triples as in HDTcrypt-B. Then, we codify the IDs in the subsets of with the IDs from the dictionaries. Note, however, that in this case there is – potentially – an between the resulting dictionary and triple components. In other words, triples in can include terms that are not only in as they may be distributed across several term-subsets. For instance, in our running example, in HDTcrypt-B includes ex:Alice (see Fig. 5) which is stored in in HDTcrypt-C (see Fig. 6). One alternative could be to create a D-union of each graph and codify triples in with the corresponding IDs. However, it is trivial to see that this would lead to an exponential number of D-union dictionaries, one per component. In addition, we would need to physically recreate all these dictionaries at compression time, and also at decompression time in order to decompress each single graph . Thus, we perform a D-merge approach (see the definition in Section 4.3), which fits perfectly with -relations. This is illustrated in Fig. 8. As can be seen, triples in each of the canonical partition are encoded with an appropriate D-merge of term-subsets. A practical example is shown in Fig. 9, representing the encoding of graph . As defined in D-merge, IDs are assigned in order, that is for a merge , the IDs in are shifted . For instance, in our example, the predicate ex:salary will be encoded in with the , because its local ID in is 1, and it has to be shifted , hence its final . Note that here we restrict the dictionaries per section (, S, O and P). Given the special numbering of IDs in HDT, where S and O IDs follow from as explained in Section 3.1. This is illustrated in our example, e.g. the object “30K” with local in is mapped in the D-merge dictionary with 3, as it sums up all the previous objects and subjects IDs in and .
Merge of dictionaries (in HDTcrypt-D) to codify the non-overlapping dictionaries and triples of a partition.
It is worth mentioning that no ambiguity is present in the order of the D-merge as it is implicitly given by the partition as per the canonical term partition. Thus, the decompression follows the opposite process: for each graph in the partition of the graph , the user processes each ID and, depending of the value, they get the associated term in an appropriate term subset. For instance, if the user is accessing the predicate in our example, one can easily see that , so dictionary has to be used.8
We abuse notation to denote the cardinality of a set, e.g. , as the maximum id represented in such dictionary set.
The local ID to look at is then , hence the predicate in is inspected and then foaf:pastProject is retrieved. Finally, note that not all terms in a D-merge are necessarily used when encoding a particular . For instance, in our example in Fig. 9, the object “alice@example.org” with in (and in the D-merge) is not used in . However, this ID is “blocked”: it cannot be used by a different object in as this ID is taken into account when encoding the present objects (“30K” and “personal@example.org”), once we sum the as explained above. The same consequence applies to subjects, so that subject IDs are not necessarily correlative in . This constitutes a problem for the HDT Bitmap Triples encoding (presented in Section 3.2), given that it represents subjects implicitly assuming that they are correlative. Thus, HDTcrypt-D has to explicitly state the ID of each subject, which constitutes a space overhead and a drawback of this approach, despite the fact that duplicate terms and triples are avoided. Technically, instead of a forest of trees, triples are codified as tuples of three IDs, using an existing HDT triples representation called Plain Triples [9].
Evaluation
This section evaluates the performance of by comparing each of the aforementioned partitioning strategies with respect to the performance of the algorithms and the size of the compressed encrypted dataset. We first describe our experimental setup in detail. Then, we present our evaluation results in terms of three distinct yet related tasks: (i) performance of compression and encryption algorithms and size of resulting datasets; (ii) performance of decryption and decompression algorithms; and (iii) performance of triple pattern queries9
Matching RDF triples in which each component may be a variable.
over the compressed datasets, which constitute the basis for SPARQL’s graph pattern matching [17].
Finally, we provide a summary and discussion of the results in Section 6.5. Additional experiments can be found in the Appendix.
The evaluation is performed on three different datasets, described in Table 1.
First, we selected DBpedia, the well-known RDF knowledge base extracted from Wikipedia, which was chosen due to the volume and variety of the data and large number of dictionary terms therein. We used two different versions, DBpedia 3.813
which is double the size of the previous one. Hereinafter, we will use the term DBpedia to refer to both versions, as the results are comparable. Then, we chose a realistic scenario using the configuration used in SAFE [23], a query federation engine with access control. The SAFE dataset includes public statistical data (referred to as external) and anonymised clinical data (internal).
Additionally, in order to test the scalability of the various partitioning strategies we use the Lehigh University Benchmark (LUBM) [15] data generator to obtain synthetic datasets of incremental sizes from 1,000 universities (LUBM1K, including 0.13 billion triples) to 4,000 universities (LUBM4K, 0.53 billion triples). Table 1 shows the original dataset sizes in plain N-Triples (NT). In addition, we provide details of the size of the datasets compressed with gzip, HDT and HDT+gz (gzip compression over the HDT file). This shows that our HDT compression ratios are in line with the original proposal [10]. Finally, the last column of the table shows the time (in minutes) to compute the HDT representation of each dataset. In turn, the HDT creation time for LUBM grows linearly with the number of triples. This result is also in accordance with the HDT technique, which reports linear scalability regarding the input size and the terms in the dictionary (cf. [10]). The two versions of DBpedia also show a similar behaviour: DBpedia 2016-10 doubles the number of triples of DBpedia 3.8 and its dictionary triples the number of terms. As a result, the HDT creating time increases 2.6 times.
For the LUBM dataset we group data based on the rdf:type of resources and use these groupings to generate three different subgraph datasets (the size of each subgraph is shown in Table 2):
12 subgraphs, composed of UnderGraduateStudent (), Courses (), Publication (), GraduateStudent (), Department (), ResearchAssistant (), AssociateProfessor (), TeachingAssistant (), FullProfessor (), AssistantProfessor (), University () and Lecturer ().
9 subgraphs, composed of the union of UnderGraduateStudent and GraduateStudent (), Courses (), Publication (), the union of AssistantProfessor, ResearchAssistant, and TeachingAssistant (), Department (), AssociateProfessor (), FullProfessor (), University () and Lecturer ().
6 subgraphs, composed of UnderGraduateStudent and GraduateStudent (), the union of AssistantProfessor, ResearchAssistant, TeachingAssistant, Lecturer, AssociateProfessor, FullProfessor (), Courses (), Publication (), Department () and University ().
When triples represent relations between resources of different types all incoming/outgoing relations are replicated in both subgraphs.
For DBpedia (in the case of both versions), we generate 6, 9 and 12 subgraphs, each containing randomly selected triples amounting to 10% of the entire corpus (thus ensuring overlaps among subgraphs). Triples that do not appear in any subgraph are subsequently distributed evenly among the subgraphs.
In the case of SAFE, the dataset is already organised in 8 subgraphs, composed of 5 external graphs, including statistical data from well-known organisation such as Eurostat and FAO, and 3 internal graphs including aggregated clinical data represented as RDF data cubes [23].
Given that the complexity of the partitioning is directly related to the number of duplicates across subgraphs, the size of each of the subgraphs and the overall duplicate ratio, as , is presented in column Dup % of Table 2. Note that the type-based selection of subgraphs in LUBM generates a skewed distribution of subgraph sizes but similar duplicate ratio (of approximately 38%) at increasing sizes (LUBM1K to LUBM4K). Thus, the comparison between techniques focuses on the effect of the 6/9/12 subgraphs and the efficiency at large scale. In contrast, the even distribution of DBpedia is reflected in the similar size of its subgraphs. Given that the number of duplicates increase with the number of subgraphs (12%, 22% and 33% for 6/9/12 respectively), the effect of duplicates is also evaluated. In SAFE, the already given 8 subgraphs contains few repeated triples (less than 0.01%). Note that the internal subgraphs corresponds to graphs , and in Table 2, i.e. the public external information corresponds to the biggest partitions.
In the following we show the performance results of each of the algorithms (compression and encryption, decryption and decompression, integration and querying). Experiments were performed in a –commodity server – (Intel Xeon E5-2650v2 @ 2.6 GHz, 16 cores, RAM 180 GB, Debian 7.9.). All of the reported (elapsed) times are the average of three independent executions in a cold cache scenario (caches are empty at the start of each process).
Performance of compression and encryption algorithms
Subgraphs
Dataset
Compression time (minutes)
Encryption time (seconds)
Size (GB)
crypt-A
crypt-B
crypt-C
crypt-D
crypt-A
crypt-B
crypt-C
crypt-D
crypt-A
crypt-B
crypt-C
crypt-D
6
DBpedia 3.8
102
197
62
121
98.17
108.84
80.85
91.07
9.64
9.33
7.43
8.59
DBpedia 2016-10
306
565
197
378
144.54
143.06
109.95
123.91
18.91
18.38
14.04
16.25
9
DBpedia 3.8
117
225
72
142
124.65
140.40
98.24
125.94
11.64
10.91
7.92
8.76
DBpedia 2016-10
267
520
175
315
182.61
172.50
113.18
128.33
23.12
21.78
15.00
16.58
12
DBpedia 3.8
131
214
81
152
156.52
245.16
187.90
221.89
13.87
12.49
8.49
8.91
DBpedia 2016-10
300
485
202
326
228.92
201.26
128.29
138.70
27.79
25.11
16.14
16.88
8
SAFE
9
18
8
14
4.07
4.91
4.27
5.20
0.53
0.61
0.53
0.65
6
LUBM1K
34
41
21
33
12.25
11.21
9.85
10.94
1.40
1.08
1.05
1.05
LUBM2K
78
94
47
73
21.88
18.74
17.95
18.24
2.86
2.19
2.15
2.16
LUBM3K
125
143
72
112
32.24
26.82
25.72
26.00
4.35
3.31
3.28
3.30
LUBM4K
169
191
97
151
54.56
33.83
33.17
33.90
5.65
4.45
4.41
4.45
9
LUBM1K
37
42
21
36
12.96
11.93
11.33
11.89
1.44
1.09
1.06
1.04
LUBM2K
78
88
45
73
22.80
19.56
18.97
19.98
2.93
2.21
2.17
2.14
LUBM3K
126
144
71
114
33.79
28.02
27.61
27.58
4.45
3.34
3.31
3.26
LUBM4K
174
194
98
158
60.11
35.66
35.53
35.36
5.97
4.49
4.44
4.42
12
LUBM1K
36
44
23
39
12.78
13.48
12.21
12.98
1.45
1.11
1.06
1.05
LUBM2K
75
94
49
82
23.32
21.62
20.23
21.38
2.96
2.25
2.17
2.15
LUBM3K
116
142
73
126
33.92
29.03
28.35
29.50
4.50
3.41
3.31
3.26
LUBM4K
158
190
99
175
60.80
37.85
38.20
37.36
6.03
4.56
4.44
4.44
Number of dictionaries/triples in each approach
Subgraphs
Dataset
Dictionaries
Triples
crypt-A
crypt-B
crypt-Ccrypt-D
crypt-Acrypt-C
crypt-Bcrypt-D
6
DBpedia 3.8
6
63
63
6
63
9
DBpedia 3.8
9
510
511
9
510
12
DBpedia 3.8
12
3836
4095
12
3836
6
DBpedia 2016-10
6
63
63
6
63
9
DBpedia 2016-10
9
511
511
9
511
12
DBpedia 2016-10
12
3904
4095
12
3904
8
SAFE
8
32
48
8
32
6
LUBM
6
20
23
6
20
9
LUBM
9
39
64
9
39
12
LUBM
12
55
122
12
55
Compression and encryption
Table 3 shows the compression and encryption times as well as corresponding resulting file sizes15
Note that encryption produces negligible size overheads on the compressed files.
of the datasets for different partitioning strategies, whereas Table 4 shows the respective number of resulting dictionary and triple components.
The results show that HDTcrypt-C is both the fastest and also produces the most compact representation (only marginally outperformed in space by HDTcrypt-D in particular LUBM cases). HDTcrypt-C is 37% faster than the baseline approach HDTcrypt-A in DBpedia (we refer to the average in both DBpedia versions hereafter), and 40% faster in LUBM. In SAFE, with few duplicates, HDTcrypt-C is still 18% faster.
In contrast, HDTcrypt-B is the slowest approach with a mean of 68% over the baseline, because it needs to create many dictionaries (e.g. 3904 in DBpedia 2016-10 as shown in Table 4) with overlapping terms. In turn, HDTcrypt-D is highly influenced by the number of dictionary components, due to the additional complexity of creating the resp. triple components from the D-merge. Thus, HDTcrypt-D is faster than the baseline in LUBM with 6 or 9 subgraphs, with few components as shown in Table 4, but it shows a worse performance in LUBM 12 subgraphs, as well as in all DBpedia and SAFE datasets.
Note that, as stated in Section 5, the creation of HDTcrypt-B, HDTcrypt-C and HDTcrypt-D assumes that the HDT representation of the full graph G is already computed.16
In fact, HDT is becoming popular to store and serve large datasets by publishers and third parties, and a large portion of datasets in the Linked Open Data cloud is already available in HDT thanks to the project LOD Laundromat [2], crawling and serving the HDT conversion of datasets (http://lodlaundromat.org/wardrobe/).
Otherwise, the HDT creation time (reported in Table 1) should be considered as a once-off overhead. In the worst case (i.e. the conversion is done for the sole purpose of encrypting a single dataset with a particular number of subgraphs), adding this time would make the HDTcrypt-C perform similarly to the baseline in LUBM. In DBpedia, with a richer dictionary of terms, HDTcrypt-C would be 35–50% slower than the baseline.
Additionally, when compared with the baseline approach HDTcrypt-A, HDTcrypt-C achieves a mean of 33% space saving in DBpedia and 26% space saving in LUBM. In general, HDTcrypt-B, HDTcrypt-C and HDTcrypt-D benefit from having an increasing number of overlapping dictionaries/triples, hence the DBpedia even distribution produces more space savings. For the same reason, an increasing number of subgraphs leads to more duplicates and space savings w.r.t the baseline, e.g. HDTcrypt-C in LUBM achieves 24%, 26% and 27% savings with 6, 9 and 12 subgraphs respectively. It is worth mentioning that despite the fact that HDTcrypt-D isolates the non-overlapping dictionaries and triples, there is an overhead in the representation as we do not use Bitmap Triples but Plain Triples (as stated in Section 5.4). This is more noticeable in DBpedia with long predicate and object lists. It is worth highlighting that, in SAFE, with almost no duplicates, only HDTcrypt-C is competitive in space with the baseline, while HDTcrypt-B and HDTcrypt-D have to pay a slight overhead for keeping the different structures, which cannot leverage the minimal duplication across subgraphs.
Encryption times are only a small portion of the publication process, where HDTcrypt-C is generally the fastest approach except for DBpedia 3.8 with 12 subgraphs and SAFE, for which HDTcrypt-A is the fastest, and for LUBM3K/LUBM4K with 9 subgraphs as well as LUBM4K with 12 subgraphs where HDTcrypt-D is marginally faster. Thus we can conclude that both the number of files that need to be encrypted as well as their respective file sizes influence the overall encryption time. Finally, it is worth noting that – as expected – the performance time of the compression and encryption, as well as the result file sizes show linear growth with increasing LUBM datasets.
Performance of decryption and decompression algorithms for , and , i.e., half of the 6/9/12 subgraphs including the smallest/average/largest subgraphs
Subgraphs
Dataset
Decryption time (seconds)
Decompression time (minutes)
crypt-A
crypt-B
crypt-C
crypt-D
crypt-A
crypt-B
crypt-C
crypt-D
DBpedia 3.8
61.56
79.08
64.92
79.80
22
18
14
18
DBpedia 2016-10
108.64
125.36
108.69
127.51
51
46
39
53
DBpedia 3.8
88.64
148.52
111.84
129.31
26
22
17
25
DBpedia 2016-10
146.93
200.97
151.41
171.56
49
45
36
50
DBpedia 3.8
93.10
220.46
195.11
242.85
22
22
17
26
DBpedia 2016-10
160.88
256.05
179.65
206.75
37
34
27
37
LUBM1K
10.82
11.37
9.80
13.74
8
7
5
7
LUBM2K
19.24
22.83
17.15
27.62
16
14
11
15
LUBM3K
28.35
31.65
24.78
45.14
24
20
16
22
LUBM4K
48.56
43.03
33.70
59.46
32
27
21
29
LUBM1K
12.84
13.36
11.86
17.52
8
10
6
8
LUBM2K
22.77
24.47
20.63
33.15
17
21
12
16
LUBM3K
32.94
37.32
30.30
48.95
26
32
18
23
LUBM4K
48.00
52.35
51.36
70.12
34
41
24
32
LUBM1K
10.75
11.54
11.73
15.84
7
6
5
7
LUBM2K
18.50
20.30
18.99
30.40
14
13
10
14
LUBM3K
26.60
31.08
27.00
45.35
21
19
15
20
LUBM4K
36.62
39.48
39.09
66.57
29
25
19
27
Performance of decryption and decompression algorithms for and in the SAFE dataset
Subgraphs
Dataset
Decryption time (seconds)
Decompression time (seconds)
crypt-A
crypt-B
crypt-C
crypt-D
crypt-A
crypt-B
crypt-C
crypt-D
SAFE
3.98
4.45
4.01
4.70
182
169
118
174
SAFE
1.01
2.75
1.05
2.14
6
74
4
56
Decryption and decompression
According to our use case scenario we assume that a user has been granted access to more than one named graph, but not the whole dataset. For a fair comparison, given the skewed size distribution of subgraphs in LUBM (see Table 2), we set up a scenario where the user has been granted access to half of the total subgraphs, including the smallest, average and largest subgraphs. This configuration corresponds to decrypting and decompressing the subgraphs referred to as , and in the case of 6, 9 and 12 subgraphs respectively. As for the SAFE dataset, we consider a scenario where a subset of the external and internal datasets are accessed. In particular, we also took half of the datasets, , including the largest external dataset , and , of smaller size.
Table 5 shows the time to decrypt and decompress each of the respective subgraphs in the case of DBpedia and LUBM, while Table 6 shows the results for SAFE.
Decryption times are almost negligible compared to the decompression time – similar to encryption vs. compression time. Again, the number of files is the dominating factor, hence HDTcrypt-A is the fastest approach regarding decryption.
Regarding decompression, (as per compression) HDTcrypt-C is the fastest approach, achieving a mean of 30% time savings in DBpedia and LUBM w.r.t the baseline HDTcrypt-A. In DBpedia, given the even distribution, having 6 subgraphs is always slightly faster than 9 and 12 subgraphs as the latter generates more duplicates. Regarding the number of graphs in LUBM, 6 and 12 subgraphs behave similarly, while the decompression of 9 subgraphs is slightly slower. Nonetheless, we could verify that the difference between 9 and 12 subgraphs is due to the slightly bigger total file size produced by in comparison to . In turn, the difference between 9 and 6 subgraphs is a consequence of the larger number of generated dictionary/triples between 9 and 6 subgraphs (as shown in Table 4). As per compression, there is a linear increase in performance times with increasing dataset sizes.
Finally, although the results for the SAFE dataset (shown in Table 6) follow a similar behaviour, it is worth mentioning that HDTcrypt-B and HDTcrypt-D have to pay the price of loading additional structures (even in the presence of minimal duplication). Results show that, while this pays off in the case of the larger subset like , for a small subset like , HDTcrypt-A and HDTcrypt-C are clearly faster than HDTcrypt-B and HDTcrypt-D.
Querying compressed data
One of the main advantages of HDT compression is that it is possible to perform SPARQL triple pattern queries directly on the compressed data [25]. Whereas this also holds for approach HDTcrypt-A, as it already consists of one file per subgraph, the other approaches presented, HDTcrypt-B, HDTcrypt-C and HDTcrypt-D, split a subgraph in different dictionary (D) and triple (T) components. For these latter approaches, query resolution can be done by two strategies:
Querying an integrated HDT: This strategy integrates all the dictionary and triple components of a subgraph into a new HDT (i.e. converting to the baseline HDTcrypt-A) which can be then queried.
Local query on each dictionary and triple component: In this case, the query is performed locally in each dictionary and triple component and the results are then integrated. Note that HDTcrypt-C is not viable for this strategy as it would require to perform the D-union of all the dictionaries in order to search the triples IDs, which is then equivalent to integrating HDTcrypt-C into a new HDT to be queried.
The following evaluation first inspects the performance overhead of the integration required by the former strategy. Then, we evaluate the query performance of the latter. For exemplary purposes, we present the average results of the DBpedia datasets, while the performance for LUBM and SAFE can be found in the Appendix.
Note that, although there are a number of strategies for querying encrypted data directly (see e.g., [3]), we consider these orthogonal and leave combining them with our partitioning for future work.
Integration of the dictionary and triple components of , and into one HDT per subgraph in DBpedia (average of the performance in both DBpedia versions).
Integrating dictionary and triple components into a new HDT
Following our use case scenario, we assume that a user has decrypted half of the subgraphs, the i.e. , and subgraphs. Figure 10 shows the time required by each strategy (i.e. HDTcrypt-B, HDTcrypt-C and HDTcrypt-D) to integrate their dictionary and triple components into one HDT per subgraph (e.g. , , , , and for ), similarly to the baseline HDTcrypt-A. This integration is performed as follows. First, all dictionary components are fed into a new dictionary, reorganizing the mapping between all terms and their corresponding IDs (as defined in Section 3.1). This first process is similar to the first step of the D-union (see Section 4.3.1). Then, we read the triple components and use the new dictionary to convert the triples to the new IDs, integrating all of them in a single new triple component per subgraph.17
Note that this process slightly differs from the D-union as the latter only replaces the new IDs in each of the input triple component.
We present the time to integrate the dictionary and triple components of into the corresponding subgraphs (Fig. 10(a)), for DBpedia. Yet again we see that HDTcrypt-C is the fastest approach, 29% and 56% faster than B and D in DBpedia. In general, all approaches show a linear increase over dataset sizes, as shown in the Appendix.
A comparison in terms of number of subgraphs is shown in Fig. 10(b), reporting the times of merging , and for DBpedia (the trends are similar for all datasets). As expected, given that the integration process yields to a partial decompression of the dictionary and triple components, the integration performance follows the same pattern as the decompression. That is, the even distribution of DBpedia results in a faster performance for 6 subgraphs, whereas the excessive duplicates of 12 penalises its performance.
Performance of Triple Patterns over DBpedia (average of the performance in both DBpedia versions).
Query performance
We evaluate the query performance of all partitioning strategies in our use case scenario. Thus, for each subgraph in , and (and and in SAFE) we first generate 30 random queries for each triple pattern type,18
All queries are available at the HDTcrypt repository.
assuring an even presence of different predicates. Figure 11 shows the average execution time of the selected queries for both DBpedia versions (the results for LUBM4K and SAFE are presented in the Appendix). Note that, as shown in the previous section, the integration into a new HDT results in a non-negligible time to perform the process. Thus, for HDTcrypt-A, HDTcrypt-B and HDTcrypt-D we follow the strategy where queries are executed locally in each dictionary and triple component. In contrast, query execution in HDTcrypt-C would require the D-union of all the dictionaries to be created, which is then equivalent to integrating HDTcrypt-C into a new HDT to be queried. As such, the performance time for HDTcrypt-C is presented as the sum of the time taken to create one integrated HDT (performed once), as previously explained in Section 6.4.1, and to subsequently query the integrated HDT (note again that this latter is equivalent to querying HDTcrypt-A).
Regarding the comparison between our strategies for partitioning, results show that HDTcrypt-A and HDTcrypt-B have the best performance for all patterns. This can be attributed to the fact that they benefit from efficient Bitmap Triples indexes, while HDTcrypt-D must use Plain Triples (as stated in Section 5.4) that perform sequential scans to resolve queries. Note that HDTcrypt-D is only competitive in the case of (?p?) queries (i.e. retrieving all subjects and objects related to a given predicate), given that most of the triples are returned and the total time is similar to a full sequential scan. In addition, Bitmap Triples indexes are less efficient for such query types [25]. As previously stated, HDTcrypt-C behaves as HDTcrypt-A but there is a once-off overhead associated with merging all dictionary and triple components into one HDT (represented in red in Fig. 14).
In turn, it is also worth mentioning that HDTcrypt-B query performance is closer to the baseline HDTcrypt-A in the scenario with 6 subgraphs. This is mainly due to the larger number of dictionaries/triples to be queried in a scenario with a higher number of subgraphs (as shown in Table 4), which penalises the HDTcrypt-B and HDTcrypt-D methods. In this scenario, HDTcrypt-A is the most efficient approach for query execution. The noticeable performance difference against the rest of the partitioning approaches suggests that the once-off merging that is required for HDTcrypt-C can be amortised if the dataset is meant for intensive querying after decryption.
Summary of performance of different strategies, where ⋆ ⋆ ⋆ stands for the best performance
Strategy
Comp. & encryp.
Decryp. & decomp.
Querying
Time
Size
Preconditions
Time
Time
Preconditions
crypt-A
⋆ ⋆
⋆
None
⋆
⋆ ⋆ ⋆
None
crypt-B
⋆
⋆ ⋆
HDT of full graph G
⋆ ⋆
⋆ ⋆
None
crypt-C
⋆ ⋆ ⋆
⋆ ⋆ ⋆
HDT of full graph G
⋆ ⋆ ⋆
⋆ ⋆
Once-off integration to a new HDT
crypt-D
⋆
⋆ ⋆ ⋆
HDT of full graph G
⋆ ⋆
⋆
None
Influence of the increasing number of subgraphs and duplicates in the performance of different strategies, where stands for very positive and for very negative
Strategy
Comp. & encryp.
Decryp. & decomp.
Querying
Time
Size
Time
Time
crypt-A
−
−
−
−
crypt-B
−
−
−
crypt-C
−
−
−
crypt-D
−
Discussion of the results
Overall, our empirical evaluation showed interesting results and allows us to draw conclusions on the applicability of each strategy. We summarize a ranking of results for each scenario in Table 7, and we outline the influence of the increasing number of subgraphs and duplicates in the data in Table 8, detailed as follows:
HDTcrypt-C is the most effective technique in terms of compression and decompression times, as well as compression sizes. In particular, it achieves additional 26–33% space saving over the –already compressed – baseline (HDTcrypt-A), and it is 37–40% faster to compress, and 30% faster to decompress. Note that the impact of these space and time savings are even more noticeable when dealing with big data. As we noticed, if the original HDT of the full graph is not available beforehand, then the creation of HDTcrypt-C can take more time than the baseline (it results in approx. the same time in LUBM and 35–50% slower in DBpedia, with a rich dictionary of terms), but it keeps the aforementioned noticeable space savings. In the extreme case of isolated subgraphs with few duplicates, as in SAFE, HDTcrypt-C takes the same space as the baseline and is still 18% faster to encrypt.
In contrast, HDTcrypt-C does not allow the user to directly query the exchanged information. If such a scenario is required, this can be solved with a once-off conversion from HDTcrypt-C to HDTcrypt-A. This conversion can be done for any strategy, but it is indeed faster for HDTcrypt-C.
HDTcrypt-B and HDTcrypt-D also reduce the size of the baseline (HDTcrypt-A), and can be directly queried. Results show that HDTcrypt-B and HDTcrypt-D gain 6–24% and 24–26% space over the baseline respectively, at the cost of an extra 68% and 23% time for compression (performed only once by the data publisher). In turn, the decompression time outperforms the baseline by 7% and 9% for HDTcrypt-B and HDTcrypt-D respectively. In the extreme case of isolated subgraphs with few duplicates, as in SAFE, HDTcrypt-B and HDTcrypt-D suffer from a slight space overhead (15–23%) over the baseline, and non negligible additional decompressing times.
The performance of directly querying several subgraphs in HDTcrypt-B is close to the baseline HDTcrypt-A. Nonetheless, it is penalised at larger number of partitions (such as 12 in our experiments) and larger number of duplicates (such as our even distribution in DBpedia). HDTcrypt-D suffers from the additional problem of performing sequential scans, and is not competitive but for queries that retrieve large number of results.
Encryption and decryption times are almost negligible compared to the compression/decompression counterparts.
Compression sizes, compression and decompression times show linear growth with increasing dataset size.
In general, an increasing number of subgraphs leads to more duplicates and more space savings of our novel HDTcrypt-B, HDTcrypt-C and HDTcrypt-D partitioning approaches over the baseline HDTcrypt-A. In turn, less data file sizes result in faster decompression of our novel approaches. In contrast, the compression time is penalised given that more components have to be generated. Our experiments also showed that the number of subgraphs does not have a strong influence on the query performance, but the skewed distribution of sizes and the large number of components (such as in DBpedia) can result in slight differences between scenarios.
Integration of the dictionary and triple components of , and into one HDT per subgraph.
Conclusions and future work
To date Linked Data publishers have focused on exposing and linking open data, however the Linked Data infrastructure could be extended to cater for the storage and exchange of confidential data. In this paper, we discussed how HDT compression can be extended to cater for RDF datasets which needs to be encrypted. Specifically, we proposed a number of different compression strategies that are compatible and demonstrated the need for careful integration when it comes to compressed encrypted RDF data. From our evaluation we can see that our proposal HDTcrypt-C outperforms the other partitioning strategies both in terms of compression and decompression time, and it also produces the most compact representation, resulting in 26–31% space savings over the – already compressed – baseline. HDTcrypt-B and HDTcrypt-D also reduce the size of the baseline significantly. Whereas, when it comes to querying HDTcrypt-A and HDTcrypt-B outperform HDTcrypt-C, which incurs additional overhead as the dictionaries and triples need to be integrated in order to support querying. Additionally, we note that compression, decompression and query performance is influenced both by the number of access restricted subgraphs and the distribution of triples across subgraphs, especially in HDTcrypt-D. In future work, we plan to extend our existing work to cater for querying over encrypted compressed data without the need for decryption. Our current work considers basic triple pattern resolution, while the HDT approach can be used as the basic engine to resolve full SPARQL queries. Our plan is to support this possibility on the compressed and encrypted data in future work.
Footnotes
Acknowledgements
Supported by the European Union’s Horizon 2020 research and innovation programme under grant 731601, Austrian Science Fund (FWF): M1720-G11, the Austrian Research Promotion Agency (FFG) under grant 845638 (SHAPE) and grant 861213 (CitySpin), and MINECO-AEI/FEDER-UE ETOME-RDFD3:TIN2015-69951-R. Axel Polleres was supported by the “Distinguished Visiting Austrian Chair” program as a visiting professor hosted at The Europe Center and the Center for Biomedical Research (BMIR) at Stanford University. Special thanks to Ratnesh Sahay for providing the SAFE dataset.
Additional performance results
This appendix comprises the performance results for all datasets. See Section 6 for a description of the corpus and the complete discussion of results.
References
1.
S.Álvarez-García, N.Brisaboa, J.D.Fernández, M.A.Martínez-Prieto and G.Navarro, Compressed vertical partitioning for efficient RDF management, Knowl. Inf. Syst.44(2) (2015), 439–474. doi:10.1007/s10115-014-0770-y.
2.
W.Beek, L.Rietveld, H.R.Bazoobandi, J.Wielemaker and S.Schlobach, LOD laundromat: A uniform way of publishing other people’s dirty data, in: Proc. of ISWC, LNCS, Vol. 8796, Springer, 2014, pp. 213–228.
3.
M.Bellare, A.Boldyreva and A.O’Neill, Deterministic and efficiently searchable encryption, in: Proc. of CRYPTO, Springer, 2007, pp. 535–552.
O.Curé, G.Blin, D.Revuz and D.C.Faye, WaterFowl: A compact, self-indexed and inference-enabled immutable RDF store, in: Proc. of ESWC, LNCS, Vol. 8465, 2014, pp. 302–316.
6.
R.Curtmola, J.Garay, S.Kamara and R.Ostrovsky, Searchable symmetric encryption: Improved definitions and efficient constructions, in: Proc. of CSS, ACM, 2006, pp. 79–88.
7.
J.Daemen and V.Rijmen, The Design of Rijndael: AES – the Advanced Encryption Standard, Springer, 2013. ISBN 3540425802.
8.
J.D.Fernández, S.Kirrane, A.Polleres and S.Steyskal, Self-enforcing access control for encrypted RDF, in: Proc. of ESWC, 2017.
9.
J.D.Fernández, M.A.Martínez-Prieto, C.Gutiérrez and A.Polleres, Binary RDF representation for publication and exchange (HDT), W3C Member Submission, 2011, https://www.w3.org/Submission/HDT/.
10.
J.D.Fernández, M.A.Martínez-Prieto, C.Gutiérrez, A.Polleres and M.Arias, Binary RDF representation for publication and exchange, J. Web Semant.19 (2013), 22–41. doi:10.1016/j.websem.2013.01.002.
11.
C.Gentry, Fully homomorphic encryption using ideal lattices., in: Proc. of STOC, Vol. 9, ACM, 2009, pp. 169–178.
12.
S.Gerbracht, Possibilities to encrypt an RDF-graph, in: Proc. of ICTTA, IEEE, 2008, pp. 1–6.
13.
M.Giereth, On partial encryption of RDF-graphs., in: Proc. of ISWC, LNCS, Vol. 3729, Springer, 2005, pp. 308–322, http://dblp.uni-trier.de/db/conf/semweb/iswc2005.html#Giereth05. ISBN 3-540-29754-5.
14.
M.Giereth, PRE4J – a partial RDF encryption API for Jena, ACAD MED70(3) (2006), 216–223, http://jena.hpl.hp.com/juc2006/proceedings/giereth/paper.pdf.
15.
Y.Guo, Z.Pan and J.Heflin, LUBM: A benchmark for OWL knowledge base systems, JWS3(2) (2005), 158–182. doi:10.1016/j.websem.2005.06.005.
16.
C.Gutiérrez, C.Hurtado, A.O.Mendelzon and J.Perez, Foundations of semantic web databases, JCSS77 (2011), 520–541.
A.Hernández-Illera, M.A.Martínez-Prieto and J.D.Fernández, Serializing RDF in compressed space, in: Data Compression Conference (DCC), 2015, IEEE, 2015, pp. 363–372. doi:10.1109/DCC.2015.16.
19.
A.Hogan, M.Arenas, A.Mallea and A.Polleres, Everything you always wanted to know about blank nodes, Journal of Web Semantics (JWS) (2014), 42–69.
20.
A.Joshi, P.Hitzler and G.Dong, Logical linked data compression, in: Proc. of ESWC, LNCS, Vol. 7882, Springer, 2013, pp. 170–184.
21.
A.Kasten, A.Scherp, F.Armknecht and M.Krause, Towards search on encrypted graph data, in: Proc. of PrivOn’14, Vol. 1121, CEUR-WS.org, 2013, pp. 46–57.
22.
J.Katz, A.Sahai and B.Waters, Predicate encryption supporting disjunctions, polynomial equations, and inner products, J. Cryptology (2013), 1–34.
23.
Y.Khan, M.Saleem, M.Mehdi, A.Hogan, Q.Mehmood, D.Rebholz-Schuhmann and R.Sahay, SAFE: SPARQL federation over RDF data cubes with access control, Journal of Biomedical Semantics8(1) (2017), 5. doi:10.1186/s13326-017-0112-6.
24.
S.Maneth and F.Peternek, Grammar-based graph compression, Information Systems76 (2018), 19–45, ISSN 0306-4379. doi:10.1016/j.is.2018.03.002.
25.
M.A.Martínez-Prieto, M.Arias and J.D.Fernández, Exchange and consumption of huge RDF data, in: Proc. of ESWC, LNCS, Vol. 7295, Springer, 2012, pp. 437–452.
26.
M.A.Martínez-Prieto, J.D.Fernández and R.Cánovas, Querying RDF dictionaries in compressed space, SIGAPP Appl. Comput. Rev.12(2) (2012), 64–77. doi:10.1145/2340416.2340422.
27.
J.Z.Pan, J.M.Gómez-Pérez, Y.Ren, H.Wu and M.Zhu, SSP: Compressing RDF data by summarisation, serialisation and predictive encoding, Technical report, K-Drive, 2014.
28.
R.Pichler, A.Polleres, S.Skritek and S.Woltran, Complexity of redundancy detection on RDF graphs in the presence of rules, constraints, and queries, SWJ4(4) (2013).
29.
R.A.Popa, N.Zeldovich and H.Balakrishnan, Cryptdb: A practical encrypted relational dbms, Technical report, MIT-CSAIL-TR-2011-005, 2011.