Abstract
RDF triplestores have become an appealing option for storing and publishing humanities data, but available technologies for querying this data have drawbacks that make them unsuitable for many applications. Gravsearch (Virtual Graph Search), a SPARQL transformer developed as part of a web-based API, is designed to support complex searches that are desirable in humanities research, while avoiding these disadvantages. It does this by introducing server software that mediates between the client and the triplestore, transforming an input SPARQL query into one or more queries executed by the triplestore. This design suggests a practical way to go beyond some limitations of the ways that RDF data has generally been made available.
Introduction
Gravsearch transforms SPARQL queries and results to facilitate the use of humanities data stored as RDF.1 The source code of the Gravsearch implementation, and its design documentation, are freely available online; see
A Gravsearch query is a virtual SPARQL query, i.e. it is processed by a server application, which translates it into one or more SPARQL queries to be processed by the triplestore. Therefore, it can offer support for data structures that are especially useful in the humanities, such as text markup and calendar-independent historical dates, and are not included in RDF standards. More generally, a Gravsearch query can use data structures that are simpler than the ones used in the triplestore, thus improving ease of use. A virtual query also allows the application to filter results according to user permissions, enforce the paging of results to improve scalability, take into account the versioning of data in the triplestore, and return responses in a form that is more convenient for web application development. The input SPARQL is independent of the triplestore implementation used, and the transformer backend generates vendor-specific SPARQL as needed, taking into account the triplestore’s implementation of inference, full-text searches, and the like. Instead of simply returning a set of triples, a Gravsearch query can produce a JSON-LD response whose structure facilitates web application development.
The current implementation of Gravsearch is closely tied to a particular application, but its design is of more general interest to developers of RDF-based systems. It demonstrates that, for an application that manages data in an RDF triplestore and provides a web-based API, there is considerable value in a SPARQL-based API, which accepts SPARQL queries from the client, transforms them in the application, runs the transformed queries in the triplestore, and transforms the results before returning them to the client. It also suggests solutions to some of the practical issues that arise in the development of such a system. Consideration of this approach may lead others to develop similar application-specific tools, or to envisage more generic approaches to SPARQL transformation.2
Gravsearch has been developed as part of Knora (Knowledge Organization, Representation, and Annotation), an application developed by the Data and Service Center for the Humanities (DaSCH) [22] to ensure the long-term availability and reusability of research data in the humanities.3 The DaSCH as an institution is responsible for research data created by humanities projects. However, this does not rule out the possibility that the technical solutions developed by the DaSCH are useful to other fields of research as well.
http://www.snf.ch/en/theSNSF/research-policies/open_research_data/Pages/default.aspx
It is not feasible or cost-effective for the DaSCH to maintain a multitude of different data formats and storage systems over the long term. Moreover, part of the DaSCH’s mission is to make all the data it stores interoperable, so that it is possible to search for data across projects and academic disciplines in a generic way, regardless of the specific data structures used in each project. For example, many projects store text with markup, and it should be possible to search the markup of all these texts, regardless of whether they are letters, poems, books, or anything else. When markup contains references to other data, it is useful for humanities researchers to search
The DaSCH must also mitigate the risks associated with technological and institutional change. The more its data storage is based on open standards, the easier it will be to migrate data to some other format if it becomes necessary to do so in the future.
With these requirements in mind, the DaSCH has chosen to store most data in an RDF triplestore. RDF’s flexibility allows it to accommodate all sorts of data structures. Its standardisation, along with the variety of triplestore implementations available, helps reduce the risks of long-term preservation. Even if RDF technology disappears over time, the RDF standards will make it possible to migrate RDF graphs to other types of graph storage.
Knora is therefore based on an RDF triplestore and a base ontology. The base ontology defines basic data types and abstract data structures; it is generic and does not make any assumptions about semantics.5 For details of the Knora base ontology, see the documentation at
Knora is a server application written in Scala; it provides a web-based API that allows data to be queried and updated, and supports the creation of virtual research environments that can work with heterogeneous research data from different disciplines.6 For more information on Knora, see
The DaSCH’s mission of long-term preservation requires it to minimise the risk of vendor lock-in. Knora must therefore, as far as possible, avoid relying on any particular triplestore implementation, or on any type of data storage that has not been standardised and does not have multiple well-maintained implementations. This means that as long as there is no widely implemented RDF standard for some of the DaSCH’s requirements, such as access control and versioning, Knora has to handle these tasks itself to ensure consistent behaviour, regardless of which triplestore is used.7 One exception is full-text search, which is implemented by the triplestore but is not standardised. For an overview of different approaches to RDF access control, see [17]. See The Fundamentals of Semantic Versioned Querying for a proposed versioning extension to RDF. The RDF*/SPARQL* reification proposal [13] may also lead to new ways of solving these problems.
Knora’s API required a query language that allows for complex searches in specific projects, as well as cross-project searches using ontologies shared by different projects, such as
The query language also needs to allow queries to use data structures that are simpler than the ones stored in the triplestore. The DaSCH’s requirement to have a generic storage system for humanities data, as outlined in Section 1.1, introduces complexities in the RDF data structures used in the triplestore, and these complexities should be hidden from clients for the sake of usability.
For example, in humanities research, it is useful to search for dates independently of the calendar in which they are written in source materials. If an ancient Chinese manuscript records a solar eclipse or a sighting of a comet, and an ancient Greek manuscript records the same astronomical event at the same time, a search for texts mentioning the event and the date should find both texts, even though the date was written in different calendars. Astronomers and historians have long used a calendar-independent representation of dates, called Julian Day Numbers (JDNs), to facilitate such comparisons [11,19]. A Julian Day Number is an integer, and can therefore be efficiently compared in a database query. Moreover, historical dates are often imprecise, e.g. when only the year is known, but not the month or day, or when a date is known only to fall within a certain range.
Knora therefore stores each date in the triplestore as an instance of its Projects such as
JDNs are clearly not a convenient representation for clients to use. Search requests and responses should instead use whatever calendars the client prefers. The client should not have to send or receive JDNs itself, or to deal with the complexities of comparing JDN ranges in SPARQL. In the Knora API, all dates are input and output as calendar dates; Knora converts between JDNs and calendar dates using the International Components for Unicode library.12
Another example concerns text with markup, which Knora can store as RDF data (see Section 3.3). A humanities researcher might wish to search a large number of texts for, say, a particular word marked up as a noun. Knora could optimise this search by using a full-text search index. This also involves rather complex SPARQL, which is partly specific to the type of full-text indexing software being used. The client should not have to deal with these details.
SPARQL lacks other features required in this context. Knora must restrict access to data according to user permissions. It also implements a system for versioning data in the triplestore, such that the most recent version is returned by default, but the version history of resources can be requested. To improve scalability, Knora should enforce the paging of search results, rather than leaving this up to the client as in SPARQL.
One way of making RDF data publicly available and queryable is by means of a SPARQL endpoint backed directly by a triplestore. Prominent examples include DBpedia [7], Wikidata [28], and Europeana [8]. While this approach offers great flexibility and allows for complex queries, its drawbacks have been criticised. In a widely cited blog post [21], Dave Rogers argues that SPARQL endpoints are an inherently poor design that cannot possibly scale, and that RESTful APIs should be used instead. For example, a SPARQL endpoint allows a client to request all the data in the repository; this could easily place unreasonable demands on the server, particularly if many such requests are submitted concurrently.
GraphQL [12] is a newer development and – despite its name – not restricted to graph databases. It is meant to be a query language that integrates different API endpoints. Instead of making several requests to different APIs and processing the results individually, GraphQL is intended to allow the client to make a single request that defines the structure of the expected response. (See Section 2.1 for a discussion of extensions to GraphQL for querying RDF.)
From our perspective, triplestore-backed SPARQL endpoints and GraphQL both have limitations that make them unsuitable for Knora and for humanities data in general. They assume that the data structures in the triplestore are the same as the ones to be returned to the client. They offer no standard way to restrict query results according to the client’s permissions. They do not enforce the paging of results, but leave this to the client. And they provide no way to work with data that has a version history (so that ordinary queries return only the latest version of each item). These requirements led us to develop a different approach.
A hybrid between a SPARQL endpoint and a web API
One option would be to create a domain-specific language, but it was simpler to use SPARQL, leveraging its standardisation and library support, while integrating it into Knora’s web API. Gravsearch therefore accepts as input a subset of standard SPARQL query syntax, and requires queries to follow certain additional rules.
Gravsearch is thus a hybrid between a SPARQL endpoint and a web API, aimed at combining the advantages of both. By supporting SPARQL syntax, it enables clients to submit queries based on complex graph patterns given in a
This extra layer of processing enables Gravsearch to avoid the disadvantages of SPARQL endpoints backed directly by a triplestore, and to provide additional features. Certain data structures can be queried in a more convenient way, results are filtered according to the user’s permissions, the versioning of data in the triplestore is taken into account (only the most recent version of the data is returned), and scalability is improved by returning results in pages of limited size.
Scope of Gravsearch
Gravsearch was developed as part of Knora and is used as part of the Knora API. Knora is an integrated system, taking care of creating, updating and reading data by mediating between the triplestore and the client. Gravsearch is not meant to replace direct communication with a triplestore via a SPARQL endpoint, but rather to provide an integrated system with a flexible way of querying data. The result of a Gravsearch query is returned in the same format used by the rest of the Knora API, making it suitable for web applications.
HyperGraphQL [14], an extension to GraphQL, makes it possible to query SPARQL endpoints using GraphQL queries, by converting them to SPARQL. Its intended advantages include the reduction of complexity on the client side and a more controlled way of accessing a SPARQL endpoint, avoiding some of the problems discussed in Rogers’s blog post [18]. A similar approach has been taken with GraphQL-LD [25], differing from HyperGraphQL in that GraphQL queries are translated to SPARQL on the client-side. We think that both HyperGraphQL and GraphQL-LD could be viable approaches to generating SPARQL that is further processed before being sent to the triplestore, thus integrating with Gravsearch. We consider this a promising avenue for future development.
When the full flexibility of SPARQL is needed, projects always have the possibility to make their data openly accessible via a SPARQL endpoint. In such a case, a named graph can be made for queries but not for updates, and possibly transformed into a simpler data model than used internally in order to facilitate integration with other data sets.
It is true that, even with pagination, it is possible to write ‘inefficient or complex SPARQL that returns only a few results’ [21], and this could also be true of Gravsearch queries. As a last resort, some triplestores provide a way to set arbitrary limits on execution time or the number of triples or rows returned, but this still means consuming significant resources before rejecting the query. Moreover, setting a limit on the number of triples returned by a
It would be better to reject a badly written query without running it, and users would appreciate error messages explaining what needs to be changed to improve the query. The presence of an application layer that analyses and transforms the input query provides opportunities to do this. This is why, for example, Gravsearch currently does not allow subqueries. It would also be possible to reject a triple pattern like
Although
Ontology schemas
A design goal of Gravsearch is to enable queries to work with data structures that are simpler than the ones actually used in the triplestore, thus hiding some complexity from the user. To make this possible, Knora implements
Knora’s built-in ontologies as well as all project-specific ontologies are available in each schema, but in the triplestore, all ontologies are stored only in the internal schema; the other schemas are generated on the fly as needed. An ontology always has a different IRI in each schema, but the triplestore sees only the IRIs for the internal schema. In Knora, the IRIs of built-in and project-specific ontologies must conform to certain patterns; this allows Knora to convert ontology IRIs automatically, following simple rules, when converting ontologies and data between schemas. An additional convention is that Knora’s base ontology is called
In the internal schema, the smallest unit of research data is a Knora
For clients that need a read-only view of the data, Knora provides a
Gravsearch queries can thus be written in either of Knora’s external schemas, and results can also be returned in either of these schemas.
To illustrate the differences between these schemas, Listing 1 in Appendix A shows a Gravsearch query that searches for a letter in the Bernoulli-Euler Online (BEOL) project (see Section 3), using the complex schema. The query searches for a letter from the mathematician Anders Johan Lexell (identified by his IRI), specifying that the text of the letter must refer to a person whose birthdate is after 1706 CE. To do this, the query in Listing 1 searches the triples representing the text’s standoff markup (see Section 3.3) for a link to a
To run this query, the Gravsearch transformer generates a SPARQL
If the client requested a JSON-LD response in the complex schema, it will look like Listing 3. Here each date is represented as a calendar date, the text with its markup has been converted to an XML document (one of several possible representations that the client can request), and each link to a
This JSON-LD representation in the complex schema is designed to be convenient to use in the development of web applications. The ARK URLs in this listing can be opened in a web browser to show an example of such an application.13 For example, see
If the client requested a JSON-LD response in the simple schema, it will look like Listing 4. (The client can also request an equivalent response in Turtle or RDF/XML.) Here, the structure of the response has been simplified, so that values are represented as simple literals. For example, the object of
The simple schema also makes it possible to use standard ontologies in the Gravsearch query itself. Listing 5 shows a Gravsearch query that uses the
The query in Listing 5 specifies that the object of
In short, Gravsearch transforms queries, ontologies, and data on the fly according to the ontology schema that the client is using. The implementation is partly object-oriented (the Scala classes representing different sorts of RDF entities have methods for generating representations of themselves in different schemas) and partly based on sets of transformation rules for each schema (e.g. to remove certain properties and add others, and to change OWL cardinalities). Additional external schemas could be added in the future.
In Knora, each resource and each value has role-based permissions attached to it. Internally, permissions are represented as string literals in a compact format that optimises query performance. For example, a Knora value could contain this triple:
This means that the value can be viewed by members of the specified group. With a SPARQL endpoint backed directly by a triplestore, there would be no way to prevent other users from querying the value. Therefore, the application must filter query results according to user permissions.
To determine whether a particular user can view the value, Knora must compute the intersection of the set of groups that the user belongs to and the set of groups that have view permission on the value. If not, Knora removes the value from the results of the Gravsearch query.
Versioning
Internally, a resource is connected only to the current version of each of its values. Each value version is connected to the previous version via the property
Gravsearch is designed to query only current data. This is easily achieved, because the only way to obtain a value in Gravsearch is to follow the connection between the resource and the value, which is always the current version. Knora’s external ontology schemas do not expose the version history data structure at all (e.g. they do not provide the property
Gravsearch syntax and semantics
Syntactically, a Gravsearch query is a SPARQL
Results are returned by default as a JSON-LD array, with one element per search result. Each search result contains the ‘main’ or top-level resource that matched the query. If the query requests other resources that are connected to the main resource, these are nested as JSON-LD objects within the main resource. To make this possible, a Gravsearch query must specify (in the

Processing and execution of a Gravsearch query.
Gravsearch uses the SPARQL constructs
In processing Gravsearch queries, the API server is free to use a SPARQL design that best suits the performance characteristics of the triplestore. For example, as described below, our implementation transforms each input query into multiple SPARQL queries that are run in the triplestore, and generates different SPARQL for different triplestores. Clients and users need not be aware of this.
In theory, such transformations could be implemented using a more generic rule system, such as Rule Interchange Format (RIF),15
Another example of a mapping language for RDF is R2RML, which intended for defining mappings between relational databases and RDF datasets. Similarly, EPNet [5] maps SPARQL queries onto heterogeneous queries in different sorts of databases, and Ontop [4] allows for querying relational databases with SPARQL.
Rather than introduce an additional programming language for query transformation, we chose to implement a traditional compiler design in Scala, taking advantage of the RDF4J18
For the reasons explained in Section 1.2, the generated SPARQL is considerably more complex than the provided Gravsearch query, and deals with data structures in the internal schema. Each Gravsearch query is converted to two SPARQL queries to improve performance. First, a
The prequery’s purpose is to get one page of IRIs of matching resources and values. It is a
The prequery’s result consists of the IRIs of one page of matching main resources, along with their values and linked resources. Since correct paging requires the query to return one row per matching main resource, the results are grouped by main resource IRI, and the IRIs of matching values and of linked resources are aggregated using
The following sample Gravsearch query gets all the entries with sequence number equal to 10 from different manuscripts.
The resulting prequery looks like this:
The variables in the prequery’s
Generation of the main query
The result of the prequery is a collection of IRIs of matching resources and values, grouped by main resource. The main query is then generated; it is a SPARQL
Unlike the prequery, the main query’s
The code below shows a snippet from the main query.
The main query’s results are the contents of a page of resources and values that matched the input query’s
Finally, the application orders the main query’s results according to the order in which the main resources were returned by the prequery, returning a JSON-LD array with one element per main resource.
Type checking and inference
SPARQL does not provide type checking; if a SPARQL query uses a property with an object that is not compatible with the property definition, the query will simply return no results.
However, Gravsearch requires the types of the entities used in a query so it can generate the correct SPARQL. Specifically, if a query uses the simple schema, it needs to be expanded to work with the internal schema, by taking into account an additional layer of value entities rather than simple literal values. The compiler therefore needs to know:
The type of each variable or entity IRI used as the subject or object of a statement.
The type that is expected as the object of each property used.
For the sake of efficiency, it is desirable to obtain this information without doing additional SPARQL queries, using only the information provided in the query itself along with the available ontologies in the triplestore (which Knora keeps in memory).19 This approach contrasts with a mechanism such as SHACL [27], which can run SPARQL queries in the triplestore.
Gravsearch therefore implements a simple type inference algorithm, focusing on identifying the types that are relevant to the compiler.20 The Knora base ontology determines the set of types that the algorithm needs to identify, and thus simplifies the algorithm. For an attempt at more complete type inference for SPARQL queries, see [24].
At the end of the pipeline, each typeable entity should have exactly one type from the set of types that are relevant to the compiler. Unlike a SPARQL endpoint, if Gravsearch cannot determine the type of an entity, or finds that an entity has been used inconsistently (i.e. with two different types in that set), it returns an error message rather than an empty response.
The first type inspector reads type annotations. Two annotations are supported:
Annotations are needed when a query uses a non-Knora ontology such as
The second inspector in the pipeline infers types by using class and property definitions in ontologies. It runs each typeable entity through a pipeline of inference rules. These include rules such as the following:
The type of a property’s object is inferred from the expected object type of the property (which is specified in the definition of each Knora property).
The expected object type of a property is inferred from the type of its actual object. (Thus if the query specifies
If a
Function arguments are inferred to have the required types for the function.
Since the output of one rule may allow another rule to infer additional information, the pipeline of inference rules is run repeatedly until no new information can be determined. In practice, two iterations are sufficient for most realistic queries.
Appendix B shows an input query in the simple schema. It searches for books that have a particular publisher (identified by IRI), and returns them along with the family names of all the persons that have some connection with those books (e.g. as author or editor).
In this example, the definition of the property
One project that is using Gravsearch is Bernoulli-Euler Online (BEOL),21
Most of the texts that are currently integrated in the BEOL platform are letters exchanged between mathematicians. On the project’s landing page, we would like to present the letters arranged by their authors and recipients. With Gravsearch, it is not necessary to make a custom API operation for this kind of query in Knora. Instead, a Gravsearch template can be used, with variables for the correspondents.
Appendix C shows a template for a Gravsearch query that finds all the letters exchanged between two persons. Each person is represented as a resource in the triplestore. It would be possible to use the IRIs of these resources to identify mathematicians, but since these IRIs are not yet stable during development, it is more convenient to use the property
This query is simple enough to be written in the simple schema. For example, this allows the object of
Gravsearch transforms these two lines to the following SPARQL:
Since values in Knora can be marked as deleted, the generated query uses
Example 2: A user interface for creating queries
Users can also create custom queries that are not based on a predefined template. For this purpose, a user-interface widget generates Gravsearch, without requiring the user to write any code (Appendix E).
For example, a user can create a query to search for all letters written since 1 January 1700 CE (the user specifies the Gregorian calendar) by Johann I Bernoulli, that mention Leonhard Euler but not Daniel I Bernoulli, and that contain the word
In generating SPARQL to perform the requested search, the Gravsearch compiler converts the date comparison to one that uses a JDN. In the example, the input SPARQL requests a date greater than a date literal in the Gregorian calendar:
Gravsearch converts this to a JDN comparison. The Gregorian date 1 January 1700 is converted to the JDN 2341973. In the generated SPARQL, a matching date’s end point must be greater than or equal to that JDN:
Other date comparisons work as follows:
Two
Two
To specify that the text of the letter must contain the word
Gravsearch converts this function to triplestore-specific SPARQL that (unlike the standard SPARQL
Example 3: Searching for text markup
Here were are looking for a text containing the word ‘Acta’ refers to an article published by Jacob Bernoulli in
Knora can store text markup as ‘standoff markup’: each markup tag is represented as an entity in the triplestore, with start and end positions referring to a substring in the text. This makes it straightforward to represent non-hierarchical structures in markup,23 See the TEI guidelines [26, Chapter 20, Non-hierarchical Structures] for a discussion of this problem.
To search for text markup, the input query must be written in the complex schema. The input query uses the
Gravsearch translates this
An optimisation that searches in the full-text search index to find all texts containing this word.
A regular expression match that determines whether, in each text, the word is located within a substring that is marked up as a paragraph.
The resulting generated SPARQL looks like this:24 Knora uses 0-based indexes in standoff markup, but SPARQL uses 1-based indexes:
To ensure the long-term accessibility of research data, a case can be made for storing most data in RDF, in a way that works with any standards-compliant triplestore, and avoiding non-standard technologies and vendor lock-in. However, this approach implies that clients cannot be given direct access to the triplestore, both because the necessary generic data structures are inconvenient for clients to use, and because features such as versioning and access control cannot be handled by the triplestore in a standard way. There are also scalability issues associated with using a SPARQL endpoint backed directly by the triplestore. Yet humanities researchers want to be able to do the sorts of powerful graph searches that they can do in SPARQL.
The solution proposed here is to let users submit SPARQL, but to a virtual endpoint (the Knora API) rather than to the triplestore. In this way, the Knora API can serve its main purpose of keeping research data available for the long term without relying on vendor-specific triplestore features, while providing flexible, controllable, and consistent access to data. By sending SPARQL queries to the Knora API, users can access only the data they have permission to see, only the current version of the data is served, and implementation details are hidden. Results are returned using a paging mechanism controlled by the Knora API. By default, the query results are returned in a tree structure in JSON-LD. This data structure is suitable for web application development, while maintaining machine-readability of the data as well as interoperability with other RDF-based tools.
Footnotes
Acknowledgements
This work was supported by the Swiss National Science Foundation (166072) and the Swiss Data and Service Center for the Humanities.
