Abstract
Amazon Neptune is a graph database service that supports two graph models: W3C’s Resource Description Framework (RDF) and Labeled Property Graphs (LPG). Customers choose one or the other model. This choice determines which data modeling features can be used and – perhaps more importantly – which query languages are available. The choice between the two technology stacks is difficult and time consuming. It requires consideration of data modeling aspects, query language features, their adequacy for current and future use cases, as well as developer knowledge. Even in cases where customers evaluate the pros and cons and make a conscious choice that fits their use case, over time we often see requirements from new use cases emerge that could be addressed more easily with a different data model or query language. It is therefore highly desirable that the choice of the query language can be made without consideration of what graph model is chosen and can be easily revised or complemented at a later point. To this end, we advocate and explore the idea of OneGraph (“1G” for short), a single, unified graph data model that embraces both RDF and LPGs. The goal of 1G is to achieve interoperability at both data level, by supporting the co-existence of RDF and LPG in the same database, as well as query level, by enabling queries and updates over the unified data model with a query language of choice. In this paper, we sketch our vision and investigate technical challenges towards a unification of the two graph data models.
Introduction
The Amazon Neptune graph database service supports both the data model of the Resource Description Framework (RDF) [5] with its query language SPARQL [10] and Labeled Property Graphs (LPG) [22,24]2
Note that we use the term “LPG” rather loosely, since there are differences even between different LPG implementations.
Some RDF stores like Stardog and GraphDB offer propriertary constructs for path search queries. While this overcomes some SPARQL limitations, it harms portability as these constructs are not part of the SPARQL specification.
Occasionally, we also receive requests for interoperability (i.e., cross-use) from experienced users who are proficient with RDF or LPG, or both, and from customers with use cases that are sometimes better served via RDF and at other times better served via LPG. The latter often happens when the initial requirements fall either into the data integration space (e.g., data alignment, master data management, data exchange) where RDF with standardized exchange formats and built-in ontology management capabilities has unique strengths, or in the analytics space (e.g., graph analytics, graph traversals) where the vertex/edge centric LPG abstraction with traversal-based query languages is often a more natural fit. In both scenarios, especially when graph technology is extended across departments, new use cases may suggest new and/or different modeling requirements. Concretely, it is likely that the users of a data integration use case will later ask for analytics and the users of an analytics use case will later ask for data integration.
Surveying the two data models in more detail, RDF offers a formal model that supports global identifiers (IRIs), well-defined graph merging, a natural way to break a graph into subgraphs, and query federation. This makes RDF graphs well-suited to use cases where multiple independent data sources are used (see, for example, the Linked Open Data project4
What if users could choose between different query languages, independent of what graph model they have decided to use? From interactions with Neptune customers we have noticed that there are often strong preferences for a particular query language, and there are also situations where one query language is simply better suited because of its particular features. For instance, expressiveness of graph traversals and path queries in Gremlin vs.
Therefore, in this paper we examine the idea of graph interoperability. That is, removing the obstacles that prevent us from using SPARQL over LPGs, Gremlin or openCypher over RDF, etc. The goal is not merely to be able to cross-use query languages, but to do it in a manner where the user does not have to be careful about how the interoperability is achieved. In other words, we are interested in a data model with a unified semantics that generalizes the specifics of the RDF and LPG data models. In addition to flexibility in choosing the query language, the idea is that this data model would also allow graph users to combine and interlink data sets maintained in both RDF and LPG formats. More generally, the idea of providing a unified data model abstracts away the need for customers to choose a data format ahead of time, therefore removing a major obstacle to graph database adoption.
In this context it should be noted that less interesting is some kind of “qualified interoperability” where the cross-use of query languages would require one to understand the implementation of the underlying graph models. In other words, we want to stay away from, say, implementing5
e.g.,
The discussion of LPGs in this paper is limited to the form of LPGs used by Gremlin and openCypher. Other graph query languages and implementations may come with additional challenges to consider. When discussing RDF, we consider features of RDF-star as well.
As a basis for the discussion in this paper, we introduce a data-model-independent notation that has an interpretation in both the RDF and the LPG world and, thus, enables us to present examples using a data-model-independent syntax. Specifically, in our examples we represent graph data via so-called statements of the following form:
In the context of RDF, such statements shall be interpreted as triples with
While we do believe that the concept of elements with identity is crucial to overcome existing gaps between RDF and LPGs, we emphasize that the statement notation with explicit SIDs used throughout the paper should not be understood as an attempt to propose this notation as a new graph data model. A complete formalization that bridges all the gaps between the triples-based view on graphs taken by RDF and the vertex/edge-centric paradigm underlying LPGs may well differ from, or go beyond, the (deliberately simple) statement notation chosen within this paper. Also, we would like to stress that this notation does not imply a necessity (neither does it suggest a preference) for a triples-based or a quads-based physical indexing scheme for graph data. The sole purpose of the statement notation is to provide a syntax that allows for a semi-formal presentation of examples, to illustrate characteristics of the two data models and the gaps between them.
As a concrete example of the notation, consider the following three statements:
Informally, these statements represent a graph that captures a
Conceptually, we distinguish between simple statements, as statements that do not contain SIDs in the first three positions, and meta-statements, represented as statements with SIDs in the
We do not see a strong use case for supporting SIDs in the
With the statement notation at hand, we can now discuss interoperability questions by discussing possible “views” of RDF, RDF-star, and LPG over a set of statements, where a view is defined by a (possible) mapping from our notation to these data models. We sketch this idea by providing possible interpretations of our toy graph, for RDF, RDF-star, and LPG, respectively:
With LPG, meta-statements about edges are viewed as edge properties, so our LPG interpretation may consist of the following vertices and edges:
In this section, we discuss challenges to interoperability broadly, ranging from semantics to potential implementation issues. Note that the issue of schema languages is out of scope for this paper.
Challenge #1: Edge properties, multiple edge instances, and reification
One of the most fundamental (structural) differences perceived between RDF and LPGs is that RDF does not offer built-in support for edge properties (except as “statements about statements” via reification, which is generally understood to be cumbersome and inefficient). We sketched this aspect already when talking about possible interpretations for edge properties in RDF in the context of the example in Section 2. From our experience, the convenience of having edge properties as a built-in construct is one of the core strengths of LPGs over RDF.
The RDF-star effort is adding edge properties to RDF, but not in a way that is fully aligned with LPGs. That is, in RDF-star, a quoted triple
Of course, one could easily argue that our interpretation of the above model is wrong, because there is no logical dependency between the
Prohibiting multiple edge instances would seriously hamper how LPGs are used. There is a clear use case behind the distinction of the two edges in the example above. So, it would make sense to capture this expressiveness in a potential 1G model for graph database interoperability. The question then becomes what an RDF-star view (in which the scenario cannot be expressed “naturally,” with built-in mechanisms) over such data would look like. One option is to collapse multiple edge instances (in our example, the statements with
There are other open questions related to reification, apart from the edge identity challenge discussed so far. Those include aspects such as multi-level reification (which, in contrast to the scenario above, can be expressed with RDF-star but not with LPGs), the ability to reference statements in the
We have submitted some use case examples to W3C that we believe should be considered in the eventual RDF-star specification [14].
At the very basic level, RDF graphs are defined as sets of triples, whereas LPGs are defined as (optionally labeled) vertices with properties that can be connected via labeled edges. The statement notation used in this paper gives us a straightforward mapping to RDF triples: in the absence of reification scenarios discussed in the previous section (i.e., in cases without meta-statements that have SIDs in their
Note that RDF, effectively, does not have vertices that could exist in such a “stand-alone” sense. Instead, there is an infinite space of identifiers, any of which could be used as vertices in edges. This is not merely a philosophical notion, as we now see.
Several other interesting questions related to the notions of vertices, edges, vertex properties, and edge properties arise. For instance, Gremlin supports queries to enumerate all vertices using the expression
A potential 1G model requires a unified type system over RDF and LPGs. RDF builds upon the XML Schema definition of datatypes and utilizes primitive XML Schema datatypes such as strings, numbers, and dates [2]. Since RDF is defined along an Open World paradigm, its datatypes tend to be more extensible and flexible than the types of values in LPGs. One specific aspect where RDF differs from LPG type systems is the absence of validation. For instance, nothing prevents users from adding ill-typed values such as
The type system for LPGs, on the other hand, is not defined formally (to the best of our knowledge). Semantics of datatypes in LPGs are opaque and are typically “delegated” to the underlying implementation language, making it potentially hard to unify graph representation. Generally, we have a menagerie of datatypes to reconcile, and needless “baggage” because of the reliance on implementation languages. The challenge for a unifying graph data model is to define a meta type system that captures and aligns these different types, and gives them a concrete semantics. In contrast to RDF, however, LPGs typically support different composite datatypes (e.g., lists and sets) as built-in types. In other words, while RDF uses the graph structure to model composite types, in LPGs a property value itself may be an instance of a composite type. This feature reflects the general notion of semi-structured JSON documents as property values, and those composite types are also an integral part of LPG query languages.
One possible approach to align the type systems of RDF and LPGs more closely – with a focus on composite types – would be to leverage the notion of user-defined literals of RDF, which is an extension mechanism that can be used to syntactically represent arbitrary (simple and complex) types. As an alternative to the structural approach taken by RDF, a list could be represented as a literal whose lexical form is, say,
The syntax makes no assertion about how the data are represented internally.
RDF defines the notion of named graphs [5, Section 4], which are often used to support subgraph management use cases. Named graphs are sometimes thought of as an extension of the triple model to a quad model with the addition of a (sub)graph identifier. Some users have chosen to treat named graphs as containers (sometimes of a single triple) to make “statements about statements” (or sets of statements) in lieu of using the reification mechanism. This approach is, however, outside the formal semantics of RDF, since named graphs do not have any semantic theory associated with them in the RDF specification.
The absence of an explicit subgraph container mechanism in LPGs raises the question whether it is possible to map the concept of named graphs to existing LPG constructs. To sketch one possible approach, we extend our statement mechanism by a dedicated graph membership relation, where
The motivation behind this approach is to restore symmetry to the data model instead of privileging named graphs as somehow special, treating named graphs as an application of the SID, much like aspects in [15]. Note, however, that such a representation does not dictate how a database system physically organizes the graph (i.e., a physical storage scheme may account for the special role of named graphs and encode them in a more compact form).
Looking at named graphs from this perspective reveals another idea: in Section 3.1 we used exactly the same pattern of SIDs in the
Challenge #5: Graph merging and external identifiers
RDF has a definition for graph merging [12, Section 4.1], which is one of the distinguished benefits of RDF and, conversely, one of the weakest aspects of LPGs. Whenever multiple data sources are used, particularly from multiple organizations, graph merging is a key functionality. IRIs as global external identifiers are an essential part of this mechanism. Of course, it is possible to use IRIs as identifiers in LPGs, but there is more to graph merging: edge properties and particularly multiple edge instances (cf. Section 3.1) complicate any merging semantics. Specifically, under which conditions may two “similar” edges in two graphs that are to be merged be considered the “same” edge?
Allowing both RDF and LPG data to be represented in a single unifying model requires the co-existence of global identifiers (i.e., IRIs coming from RDF data) and local identifiers (i.e., vertex and edge identifiers from LPGs). User-defined default namespaces could then be used to expose local identifiers as if they were IRIs, whenever they are queried via SPARQL; conversely, IRIs originating from RDF data could be shortened according to existing namespace prefixes, to make querying and syntactic result representation for LPG query languages more user friendly.
While such a distinction would make it possible to load data via both RDF and LPG data formats into the same logical graph, it would result in a mere co-existence, without any (initial) overlap in vertex identifiers, labels, etc. Our vision for a unifying graph data model, however, goes beyond just co-existence of local and global identifiers: users may want to unify graph elements (such as vertices and edges) originating from RDF and LPGs. Assume, for instance, a user who maintains internal data about countries, which is available as an LPG in which the countries are identified via a simple country code string. Suppose the user wants to augment this data with information coming from an RDF dump of the Geonames10
Unlike SPARQL and SQL, existing LPG query languages – by and large – lack strict formal semantics (in the form of, say, a query algebra), which makes it hard to assess semantic compatibility of queries in different languages. Similarly, unlike for RDF, there does not exists a formal semantics for LPGs (or only exist post hoc, as in [17,25]). For LPG query languages, semantics is typically defined informally, either via documentation and examples or via an implementation. The Gremlin implementation in Neptune, for instance, is based in significant part on our interpretation of the informal Tinkerpop specification11
In order to be able to subsume both RDF(-star) and LPGs, a unifying graph model needs to be as expressive as the “most expressive” model, in each of the considered dimensions. As we have illustrated in previous examples, certain extensions that are defined for the more expressive model may not have a natural representation in the less expressive model, thus introducing “dimensions” that are invisible when looking at the data from the less expressive model’s perspective (e.g., recall the example of multiple edge instances as discussed in Section 3.1).
While semantics of read queries can be defined unambiguously by mapping a unified data model to a lower-dimensional level, the situation becomes more complex for queries that manipulate the data. Consider, for instance, the example graph sketched in Section 3.1 and assume a SPARQL update statement (not SPARQL-star) to drop the triple (
The common root cause for the ambiguity of all these scenarios is that we request updates over a simplified, dimensionally reduced view of the data. While the discussion of all these questions goes beyond the scope of this paper, the semantics of such operations needs to be carefully designed in order to avoid unwanted side effects, as to obtain a “natural” behavior in scenarios where data is queried and manipulated via different query languages.
Concluding remarks and the way forward
A recent survey of organizations working on, or considering to adopt, knowledge graphs found that interoperability and standards have a very high priority among survey respondents, and data integration was seen as the dominant use case [1]. These findings could be interpreted to suggest a need for RDF/LPG compatibility and unification. While it may seem that making RDF and LPGs fully compatible is not possible (as per the official RDF specifications and the emerging RDF-star work), we believe there is a way forward. Minimally, we must address the challenges of edge identity (multiple similar edges), graph merging, and well-defined update semantics across languages. One way to make progress would be to define some kind of “compatibility subset” to cover enough ground so that most RDF and LPG applications work with no or minimal modifications. Lack of interoperability slows the overall adoption of graph technologies and, thus, should be of high priority to be addressed by the graph data community.
In this paper we have sketched the 1G vision of a unifying model as a source of ideas about where to go next. The open questions we have posed imply lots of interesting research, the outcomes of which will have significant practical relevance. We look forward to working with the broader community to explore solutions to these topics.
