Abstract
Building and publishing knowledge graphs (KG) as Linked Data, either on the Web or in private companies, has become a relevant and crucial process in many domains. This process requires that users perform a wide number of tasks conforming to the life cycle of a KG, and these tasks usually involve different unrelated research topics, such as RDF materialisation or link discovery. There is already a large corpus of tools and methods designed to perform these tasks; however, the lack of one tool that gathers them all leads practitioners to develop ad-hoc pipelines that are not generic and, thus, non-reusable. As a result, building and publishing a KG is becoming a complex and resource-consuming process. In this paper, a generic framework called Helio is presented. The framework aims to cover a set of requirements elicited from the KG life cycle and provide a tool capable of performing the different tasks required to build and publish KGs. As a result, Helio aims at providing users with the means for reducing the effort required to perform this process and, also, Helio aims to prevent the development of ad-hoc pipelines. Furthermore, the Helio framework has been applied in many different contexts, from European projects to research work.
Introduction
The presence of knowledge graphs (KGs) published openly on the Web, or privately as Linked Data has grown in the last decade [57]. The reason for this growth is due to the fact that many domains demand data to be published homogeneously under a common representation which, sometimes, requires translating existing heterogeneous data from a set of data sources [11]. To this end, the data of the KGs could be built using Semantic Web Technologies [33] like RDF and, then, published following the Linked Data principles [3]. Building a KG is not a simple process since it may involve many tasks that belong to different research topics [74]; from the translation of data into RDF using materialisers [30], to the generation of links among the resources of different KGs utilizing link discovery tools [58].
Numerous tools aim at performing one or more tasks, which are related to these research topics, for building and publishing a KG [38]. However, these tools were designed with a narrow scope that aimed to solve a reduced set of very specific tasks, usually involving a novel research topic. As a result, many of these tools were developed to have a standalone use and, thus, using and coordinating different tools is not possible without developing custom ad-hoc code in most of the cases [74]. Furthermore, up to the authors’ knowledge, no tool is able to cope and cover all the tasks conforming to a KG life cycle, which are required for building and publishing KGs [74].
Consequently, building and publishing a KG becomes a complex and resource-consuming task that is not at the hands of all practitioners. On the one hand, practitioners must learn a wide spectrum of tools which some are research prototypes that are not suitable for a production environment, or their usability is hindered due to the lack of fundamental documentation. On the other hand, the fact that these tools can not be directly interconnected in order to work together requires practitioners to develop ad-hoc pipelines to build and publish a KG [24,81,84,85]. Developing these ad-hoc pipelines has a high cost in time, personal resources and requires long cycles of debugging and maintenance, decreasing the project’s productivity.
In this paper, a framework known as Helio is presented. The goal of the framework is to provide a tool that is able to perform all the tasks required for building and publishing a KG and, in case new functionalities are required, allows practitioners to integrate these without modifying the framework source code through independent plugins. To ensure its goal, Helio has been developed on top of a list of requirements that support the KG life cycle [34]. These requirements profile a system that is able to assist practitioners during the whole life cycle of a KG and that publishes the KG according to the Linked Data principles [3].
The Helio architecture has a modular design that, on the one hand, allows Helio to use some existing tools to perform these tasks and, on the other hand, allows practitioners to extend the framework in order to cope with new scenarios. In addition, Helio fosters the development of plugins for either using existing tools or implementing new functionalities since they are highly reusable. As a result, plugins prevent developing ad-hoc pipelines and allow other practitioners to cope with common scenarios without spending additional effort.
The Helio framework has been used in several contexts: A) European research projects from different domains, namely: VICINITY1
The rest of this article is structured as follows: Section 2 reports the history, motivation, and a list of requirements on top of which Helio has been built; Section 3 introduces an analysis of proposals from the literature; Section 4 presents the framework design and its architecture; Section 5 provides a discussion about the framework introduced and how it meets the requirements elicited; Section 6 reports real-world cases where Helio has been used and; finally, Section 7 recaps our findings and conclusions.
Knowledge graphs have a well-defined and established life cycle [34], which is depicted in Fig. 1. It consists of several steps, depicted as rounded boxes, which have one or more associated tasks that should be performed in each specific step, depicted as squared boxes in Fig. 1. These tasks are usually related to one or more research problems, which are still active nowadays, or they lack a recommendation, fostering the numerous existing tools that tackle the same problem. As Fig. 1 depicts, some of these tasks are also related to the Linked Data principles. The different steps of the life cycle are the following:

KG life cycle [34] and related tasks.
As depicted by Fig. 1, the different steps of the life cycle imply a set of related tasks that practitioners may perform using several of the different existing tools. Nevertheless, up to the authors’ knowledge, these proposals usually focus on some tasks or steps, but they do not cover the whole KG life cycle [74]. To this end, a set of requirements have been elicited. These requirements find their origins mostly, but not uniquely, from real-world scenarios (like those presented in 6). These requirements profile a system that potentially covers the whole KG life cycle by implementing them. Furthermore, a system that implements these requirements builds and publishes the KG data according to the Linked Data principles, fostering the good practices promoted by the W3C. The requirements are the following:
(KG Creation) R01: The system allows practitioners to provide as input an RDF file, in all likelihood created manually, for feeding the life cycle.
(KG Creation) R02: The system provides a materialisation tool to translate the heterogeneous data, i.e., non-RDF, from a set of heterogeneous data sources into RDF.
(KG Creation) R03: The materialisation tool of the system understands more than one mapping language, reducing the chances for users of needing to learn a new mapping language or providing bespoke features of such mapping language missing in other [25].
(KG Creation) R04: The materialisation tool of the system relies on a mapping language that allows expressing a set of functions, and the tool implements these functions. This allows practitioners to use these functions to clean data before translating it into RDF.
(KG Creation) R05: The materialisation tool of the system allows defining link rules and applying such rules for linking resources that belong to the RDF data of the KG.
(KG Creation) R06: The system provides a mechanism to use other existing materialisation tools. This allows practitioners to use a materialisation tool known by them and, therefore, not need to learn a new mapping language.
(KG Creation) R07: The system provides reusable extension mechanisms that allow extending the provided materialisation tool and other system features in order to cope with new scenarios without forcing practitioners to develop ad-hoc software.
(KG Hosting) R08: The system provides different configurable options for storing the RDF data [74]. For instance, the system may be configured to store in-memory RDF data for quick retrieval.
(KG Hosting) R09: The system provides mechanisms to synchronise the stored RDF data generated by one or more materialisation tools and the original heterogeneous data.
(KG Curation) R10: The system allows practitioners to use existing tools that aim at enriching, validating, or linking RDF data that has been previously stored by the system.
(KG Deployment) R11: The system provides a REST API that publishes each resource in the RDF data through its URI using either the HTTP or HTTPS protocol.
(KG Deployment) R12: The system provides a SPARQL endpoint according to the W3C specification [32]. As a result, the system allows practitioners to query the RDF data of the KG.
(KG Deployment) R13: The system provides content negotiation so practitioners can consume the RDF data and/or the SPARQL results in different serialisations.
(KG Deployment) R14: The system publishes HTML views for assisting practitioners during data consumption.
(KG Deployment) R15: The system provides mechanisms to customise the HTML views, e.g., to allow practitioners to change the aesthetics of the HTML views.
(KG Deployment) R16: The system provides mechanisms to customise and embed meta-annotations in the published HTML views. As a result, the system allows transforming the plain HTML into HTML + RDFa [41].
Notice that the previous requirements not only describe a system that is able to assist practitioners during the whole life cycle of a KG. In addition, a system that implements all these requirements publishes the RDF data of a KG according to the Linked Data principles [3], namely: 1) Use URIs as names for things covered by R01 and/or R02; 2) Use HTTP URIs so that people can look up those names, covered by R11; 3) When someone looks up a URI, provide useful information, using the standards (RDF, SPARQL), covered by R11 and R12; and 4) Include links to other URIs, so that they can discover more things, covered by R10.
Elicited requirements met by existing tools types
There are a wide number of tools from the literature that have been designed to address specific tasks or steps from the KG life cycle depicted by Fig. 1. However, up to the authors’ knowledge, none is able to cope with the whole KG life cycle; as also pointed out by Simsek et al. [74]. In this section, the different tools are analysed from the point of view of the requirements elicited by Section 2 and the step, or steps, of the KG life cycle that they address.
Table 1 shows the different categories of these tools from the literature. Additionally, Table 1 reports which of the requirements are covered by all the tools from the category (✓), or they do not cover (–), or are partially covered by some of the tools (
Knowledge graph creation
RDF materialisation is an approach widely used to generate the RDF data of a KG from a set of heterogeneous sources that counts with a large number of existing tools [22,39,40,46,50,73]. Although some of these may differ on efficiency or suitability when applied in certain contexts, in general, they have the same workflow. First, a practitioner manually writes a set of translation mappings; then, these mappings are provided as input for the materialiser; finally, the materialiser fetches and translates the data producing the RDF data that is written in a file. As a result, since these tools provide as output an RDF file, they only cover the KG Creation step from the life cycle, i.e., excluding any requirement from R08 and forth.
Although all the materialisers implement the requirement R02, not all of them implement the other KG Creation requirements: only a few materialisation tools are able to understand more than one translation mapping (R03); a large number of materialisers are able to apply functions when translating heterogeneous data into RDF. However, they are namely meant for cleaning data rather than linking RDF resources (R04 and R05); these tools are developed to translate heterogeneous data and, thus, some of them are not able to take data already in RDF as input (R01). Finally, up to the authors’ knowledge, none of these tools is designed to work in combination with another materialisation tool (R06), nor provide extension mechanisms to cope with new scenarios (R07), e.g., a new format from which to translate data into RDF.
Ontology Based Data Integration (OBDI) and Ontology-based Data Access (OBDA) tools are used when there is a non-RDF database with large amounts of data for which materialisation proposals fall short [65]. These tools focus on providing a SPARQL endpoint and translating the SPARQL queries received into one or more languages. OBDA are tools that only translate from SPARQL to just one language [4,6,12,54,66,72], instead OBDI tools that are able to translate a SPARQL query into other multiple languages at once [31,48,51]. Since these tools are not actually building a KG, they do not cover any requirement from the KG Creation step from the life cycle. Instead, these tools cover some from the KG Deployment, especially those related to SPARQL, since they allow consuming the heterogeneous data from the databases by means of SPARQL queries.
OBDI and OBDA tools allow answering SPARQL queries over data from heterogeneous databases as if a KG with such RDF data would exist. However, some of these tools expect the queries to be provided programmatically rather than through a published SPARQL endpoint (R12), and some of these tools only support SELECT queries or SELECT without special statements like FILTER; which may be a serious limitation when querying the data of a KG. Finally, since these tools perform a translation of queries, although SPARQL supports functions that could be used for cleaning or linking, not all the tools are able to cope with such functions during query translation (R03 and R04). Furthermore, since these tools only translate queries instead of building and publishing RDF data, they do not cover the rest of the requirements from R08 and further, with the exception of R12 for some tools.
Knowledge graph curation
KG Curation involves a large number of tools from the literature that can be divided into four categories: RDF enriching tools [64], RDF link discovery tools [58], RDF validation tools [76], and RDF quality assessment tools [80]. Nevertheless, this categorisation is far from complete since the number of tools that may fall in this topic is numberless. These categories group the tools that authors consider the most common to appear in the KG life cycle, but others could also be considered as related to knowledge graph curation and do not fall in any of the previous categories.
Notice that KG Curation is a step that occurs once the RDF data has been stored after the KG Hosting step. It is worthwhile to mention that the tasks involved in KG Curation are not necessarily blocking for those happening in KG Deployment, entailing that they are fully optional. In fact, any of these tools could be used by a system covering a KG life cycle as stated by R10; however, by themselves, they do not cover any requirement of the elicited ones. Due to this fact, the tools analysed in this subsection are not included in Table 1.
RDF enriching tools have a wide number of goals. For instance, some tools aim at completing with new information existing RDF data [69] whereas others aim at summarising existing RDF data [86].
RDF link discovery aims at producing relationships between local RDF resources and other RDF resources allocated in different KGs [58]. On the one hand, there is a wide number of tools that aim at producing link rules [16–18,23,43,44,60,61,75], i.e., restrictions under which two RDF resources are linked. On the other hand, other tools focus on applying those rules efficiently and producing the links among resources [26,59,79].
RDF validators are tools that specify whether data expressed in RDF conform to a set of restrictions. There is a specification for expressing these restrictions that is a W3C standard, i.e., SHACL [49], and other nonstandard specifications [77]. These tools usually take as input an excerpt of RDF data and a set of restrictions and produce a validation report.
RDF quality tools aim at investigating and quantifying the quality of a KG and the parameters influencing such quality [45]. To this end, the literature counts with proposals of different nature: from tools [13] to metrics [80], and other approaches [83].
Knowledge graph hosting and deployment
RDF frameworks aim at providing practitioners with several functionalities, e.g., pragmatically choosing different environments where to host their RDF data that belong to different steps of the KG life cycle [5,52,62,63,82]. Some tools like Star Dog5
In most of the cases, the KGs are deployed by storing their RDF data into triple stores [2,47,55,67,71]. These stores host RDF data and provide a SPARQL endpoint (R12). Some triple stores also publish each resource under a URL (R11), and others implement content negotiation for the SPARQL endpoint and RDF resources (R13), including HTML documents. Furthermore, some triple stores provide curation techniques, e.g., SHACL validation. Nevertheless, triple stores are not suitable for any other task within the KG life cycle.
RDF publishers aim at providing human interfaces (HTML) for those SPARQL endpoints and resources published exclusively with machine interfaces. Some tools, like YASGUI [68], publish a human SPARQL interface for a given SPARQL endpoint (R12 and R13). Others, like Pubby [27], also publish human interfaces for the resources provided by the SPARQL endpoint (R14). Some others, like Elda,7
Helio is a framework built to meet the requirements previously elicited and explained. The goal of Helio is to build a KG from heterogeneous data sources (which may include RDF sources) and publish the KG data following the Linked Data principles. In order to meet all the requirements, the Helio framework is divided into four logic modules, each of which aims at implementing a set of the elicited requirements. These modules are implemented as Java artefacts, although they could be implemented or viewed as microservices alternatively. The modules and the requirements that they cover, depicted in Fig. 2, are the following:

Helio framework.
In the following subsections, these modules are explained in detail, providing an insight view of their implementation. Then, in Section 5 it is explained how the framework allows publishing KGs according to the Linked Data principles and how it covers the elicited requirements.
The RDF Generator Module is in charge of generating the RDF data of a KG and providing this data to the Hosting Module. In order to fulfil its goal, this module is built upon two generic components that must be instantiated in an implementation, i.e., data providers and data handlers, and a component to translate data into RDF if required, i.e., the data translator. The details of the translation process are specified in an Helio bespoke mapping language, the conceptual mapping, that must be provided to the RDF Generator Module as input. Additionally, there is the last component named resources orchestrator that organises the whole translation process and pushes the generated RDF data into the Hosting Module.
The data providers are components in charge of retrieving data from one data source. These components are agnostic to the format of the data; its only goal is to deal with the protocols for retrieving the data. After a data provider obtains the data, such data is passed to the data handlers. The current Helio implementation counts with several data provider instantiations.11
The URLProvider is able to retrieve data from a URL based on several protocols such as HTTP, HTTPS, FTP, or file. Nevertheless, the URLProvider is agnostic from the format of the data retrieved.
The data handlers are components that focus on fetching fragments of information from the data provided by a data provider. These components are highly related to the format of the data since they need to iterate or access specific positions of the data in order to fetch the fragments. Notice that they are totally agnostic to the protocols involved for retrieving such data. The current Helio implementation counts on several data handlers instantiations.12
Assuming that the data retrieved in Example 1 was a JSON file, the JsonHandler should be used for iterating over the file and retrieving different values by means of JSONPath expressions.
The RDF translator takes as input a conceptual mapping that specifies what data providers shall be used and to which data handlers they have to pass the retrieved data. In addition, these mappings hold a set of translation rules that are related to the data handlers. The rules usually contain some filtering expressions and optionally cleaning functions that require fetching fragments of information from the data retrieved by the data providers by using their related data handlers to process the filtering expressions. Additionally, the conceptual mappings may include some linkage rules to be applied after the translation. Finally, once the RDF translator has been initialised with a conceptual mapping, this component initialises and connects the different data providers with their respective data providers and then remains on standby. In case the RDF translator would be provided with valid RDF data, i.e., no translation is required since the data handler is meant for RDF, this component will automatically provide the RDF data as a result.
The resources orchestrator is the component that triggers the RDF translator when required on-demand by any other module. At that moment, the resources orchestrator invokes the RDF translator that will generate the RDF data. Then, the resources orchestrator pushes this data into the Hosting Module. Alternatively, if specified in the conceptual mappings, the resources orchestrator can trigger the RDF translator periodically instead of on-demand.
Notice that anytime the RDF data is pushed to the Hosting Module, this module could handle it in different ways. For instance, if an excerpt of RDF is already present in the triple store, it could replace the old triples with the new ones. On the other hand, the Host Module could store the triples in different named graphs, and anytime an excerpt is given, it would be stored into a new named graph, not overriding the old version of such an excerpt (this would be especially interesting for versions of the data or even historical data).

Helio conceptual mappings model.
The Conceptual Mappings13
A Datasouce describes a source of data, which has a unique identifier (id) and a refreshing time (refreshTime) that specifies whether the generation of data is performed on-demand (if null) or periodically. In addition, a Datasource counts with two other elements: a data Handler and a data Provider.
A Provider and a Handler have a type that refers to a specific instantiation, for instance the name of a class such as JsonHandler or URLprovider, and an input that must be a JSON document for configuration, for instance for the URLprovider the URL from where the data must be fetched.
A ResourceRule describes how the data from one or more sources is translated into RDF. It has a unique identifier (id), a set of Datasource identifiers (datasources), and a subject that specifies how the subject of a set of triples is generated. Additionally, a ResourceRule is related to zero or more PropertyRule; each of which specifies how to generate a predicate and an object related to the former subject (predicateTemplate and objectPrediacte, respectively), and also, if the object is a literal (isLiteral) or the datatype of such literal (datatype).
A LinkingRule describes how to link resources (the subjects) from the RDF generated with the rules of two ResourceRule. The LinkingRule has two ResourceRule identifiers, one related to the subject of the link (sourceId) and one related to the object of the link (targetId). Also, the LinkingRule has an expression that is a link rule [23], an RDF predicate to relate both subjects (property), and a predicate that will be generated in inverse order inverseProperty (linking the target subject with the source subject).
Notice that a LinkingRule can only relate RDF resources within the KG generated by the framework. Linking the RDF resources from different KGs is out of the scope of these linking rules. Therefore, they must not be considered as a curation technique.

Example of conceptual mappings for a REST API.
Figure 4 depicts a simple Conceptual Mapping instantiation specifying how to integrate data from a REST API that publishes JSON data about sensors that measure luminance. A sample payload is the following:
Notice that the Conceptual Mappings are data structures that the RDF Generator component handles internally. This entails that the input provided to this component may have different serialisations that are translated into this data structure internally. For instance, the Conceptual Mapping depicted in Fig. 4 can be the result of translating an equivalent mapping from a JSON serialisation (as shown in Appendix A), or translating an equivalent mapping from RML (as shown in Appendix B), or from a WoT-Mapping14
The RDF Generator Module has several internal translators in order to understand different mapping languages, like RML,15
As it has been explained, the RDF Generator Module is capable of generating RDF from heterogeneous data sources, cleaning the data, and linking the RDF resources generated. Although it counts on several data providers and handlers for achieving this task, new scenarios may introduce protocols or formats currently not supported by the module.
For this reason, the RDF Generator Module counts with a dynamic system for loading plugins.17
As an example, in the repository of plugins18
Notice that the RDF translator component is capable of dealing with data that is already expressed in RDF. Therefore, a new data provider that relies on existing materialisation techniques could be implemented. This provider would receive as input the mappings understandable for that technique, and the technique would be invoked as a regular data provider. It is worth mentioning that, similarly to materialisers, OBDI or OBDA techniques could also be included as data providers.
As a result, thanks to the plugin system, the RDF Generator Module is capable of reusing code and prevents the generation of non-reusable ad-hoc pipelines. Additionally, although it generates RDF from heterogeneous sources, it could be used with plugins that rely on third-party techniques for RDF generation.
This module publishes a SPARQL endpoint for the other modules to store, read, or update RDF data. For this goal, the current Helio implementation relies on SAIL Configurations that enable a user to specify where to store RDF data. For instance, the following configuration stores the data in an existing triple store.
Instead, the configuration below specifies that the triples must be stored in the file system.
The Hosting Module is configured with one of these SAIL Configurations and then publishes a SPARQL endpoint for the rest of the modules to be used. Notice that this flexibility allows users to adapt to certain scenarios where the computational resources are limited (e.g., deploying Helio in a Raspberry Pi board) and thus, choosing a suitable environment becomes paramount.
Curation module
The Curation Module aims at performing different curation tasks, for instance, linking resources or completing RDF triples. Therefore, this module can have one or more implementations depending on the task at hand.
The current Helio framework allows for any Curation Module implementation to interact with the rest of the framework by relying on a standard SPARQL interface. These implementations must access the generated data by means of the Hosting Module, which publishes the SPARQL endpoint, perform the desired curation task, and then store the output by using the Hosting Module through the SPARQL endpoint. As a result, the RDF of the KG published by the Publisher Module will include these modifications.
Notice that this mechanism allows any user to define a service to perform a specific curation task; its only requirement is to interact with the Hosting Module through a SPARQL 1.1 interface. As a result, this service could be reused by any third-party entity that will have to deal with a similar, or the same, curation challenge.
Notice that the current Helio implementation allows linking RDF resources that are the result of translating data from heterogeneous sources. In other words, Helio does not implement a linker among different KGs (which would be a Curation Module), only among the resources of a local KG generated by Helio.
Publisher module
The Publisher Module is in charge of making the RDF data from the Hosting Module available through the HTTP protocol, i.e., it publishes a REST API for consuming the data. The current implementation of the Publisher Module is a Spring Boot Java service.
The data from the Hosting Module is published by this module at three levels: RDF resource level, when the URL of a specific existing RDF resource is requested the Publisher Module outputs its triples; SPARQL level, the Publisher Module enables a standard SPARQL 1.1 endpoint for querying all data stored by the Hosting Module; Dataset level, the Publisher Module provides a dump containing the triples that conform the dataset stored in the Hosting Module.
The Publisher Module implements content negotiation by means of HTTP headers that enable consuming any of the data published in different formats. For instance, the module will provide a client with an HTML view if a request with the text/html is performed; instead, if the same request uses text/turtle the same data will be output in raw RDF turtle. Figure 5 from Appendix D shows the standard HTML views that the Publisher Module provides for its SPARQL endpoint (implemented with YASGUI [68]) (shown by Fig. 5(a)), any RDF resource (shown by Fig. 5(b)), and the dataset (shown by Fig. 5(c)).
Besides the standard views of RDF resources (shown by Fig. 5(b)), the Publisher Module implements a mechanism to customise the HTML views of the resources, as depicted by Fig. 5(d) from Appendix D. It allows associating the URLs of these resources to a specific HTML file in which the information is dynamically injected. As a result, a user can customise the HTML views of the resources. Furthermore, these views can also include RDFa annotations.
Finally, the Publisher Module allows defining dynamic views that are HTML documents in which the data injected is the result of a SPARQL query. In other words, a user can choose a subset of data from the Hosting Module by means of a SPARQL query and associate to such view a URL that does not exist in the dataset. Nevertheless, if a client requests such URL to the Publisher Module, the module will automatically fetch the data and inject the result into a customised HTML view previously provided. A sample of this kind of usage is available at
Communication between modules
The communication among the different Helio modules depicted in Fig. 2, as well as the nature of the information exchanged and the way this is done, is heavily related to specific implementations of the framework.
Currently, the RDF Generator Module of the framework is distributed as a Java dependency. The Publisher Module is distributed as a Spring Boot Service that imports the RDF Generator Module; therefore, they communicate through the interfaces of the Java library described by the Helio framework repository.22
Finally, any Curator Module can interact with the Host Module through the SPARQL endpoint when such module is implemented using a remote triple store. Notice that if the Host Module is implemented as an embedded RDF4J triple store, the possibility of using Curator Modules is not possible. Nevertheless, having an embedded triple store is suitable for some scenarios, as mentioned in Appendix E.
Notice that the modularity of the Helio framework suits a microservice-based implementation, which is one of the future aims of the authors. In such an implementation, the communication among the components will be totally different from the current one.
This section aims at providing a discussion divided into several subsections. The former explains how the framework provides the means for publishing a KG following the Linked Data principles. The latter explains how the framework meets the requirements elicited in Section 2. Finally, a brief explanation of the use cases is given.
Enabling Linked Data principles
The Linked Data principles establish good practices that must be followed when publishing a KG [3]. These principles are namely:
The Helio framework enables
Finally, the framework allows generating links among the RDF resources of the same dataset thanks to the linking rules supported by the Generator Module. Nevertheless, other linking techniques can also be used as a Curation Module implementation. As a result, the framework enables the
Notice that none of the tools analysed in the literature explained in Section 3 allowed users to fully follow the Linked Data principles. In order to follow them, a user should rely on several of the analysed tools.
Requirements met
The requirements elicited in Section 2 are grouped by the KG step. Similarly, the modules of Helio are split into the same steps, easing the coverage analysis of these requirements.
The Generator Module meets the requirements related to KG Creation, i.e.,
The Hosting Module allows, by means of an RDF SAIL configuration, to choose different environments to store the data from an existing triple store to disk-based persistence (
The Host Module allows plugging any Curator Module (
Finally, the Publisher Module meets the requirements related to the KG Deployment, i.e.,
As a final remark, notice that none of the tools analysed in Section 3 was able to meet all the requirements elicited in Section 2. Up to the authors’ knowledge, the Helio framework is the first tool to meet all these requirements, allowing the users to cover the whole life cycle of a KG.
Architecture details & use cases
The Helio framework is built upon the different modules shown by Fig. 2. Although in principle, all these modules are paramount for the framework, not all of them are necessary to exist in a real-world deployment. To this end, the use cases identified for the framework are described in Appendix E. As a result, it is shown how the combination of either the Publisher Module and the RDF Generator Module with the Host Module is always required. Instead, the Curator Module is always an optional module that is not required.
Helio framework adoption
The Helio framework lacks formal experimentation. Nevertheless, it has been widely used in different contexts. The wide adoption of the framework is an indicator of its usability and usefulness.
As a result of this project, a standalone proposal called eWoT that enables semantic interoperability for ecosystems of sensors was released [22]. This proposal relies on Helio to perform the translation on the fly of the data required to answer an issued query.
Additionally, in this project, Helio was used to publish JSON data stored in an Hyperledge Blockchain as RDF, allowing users to easily consume such data. In this context, Helio was extended for including custom templates for publishing the data and enabling a validation mechanism to ensure data quality.
This research line was originated from a master thesis [29] in which the feasibility of using Helio to publish data stored in a blockchain regardless of its implementation (e.g., Bitcoin, Ethereum, or Hyperledge) was studied and analysed.
The latter Bachelor’s work aimed at studying factors that were related with the rabies virus propagation [56]. For this purpose, Helio integrated a file with data endowed by the student and several sources of information, namely: the PanTHERIA database,34
In this article, the Helio framework for building and publishing KGs as Linked Data has been presented. The framework sets its pillars on top of several requirements that establish the life cycle of KGs, meeting these requirements and allowing practitioners to publish KGs following the Linked Data principles. Furthermore, the framework counts with a plugin system that prevents the generation of ad-hoc code that is not reusable to address novel challenges identified in new scenarios.
Future work of Helio will consist of adding new functionalities and improvements in the framework and providing tests for the framework. Regarding the improvement of the framework, the KG Curation Modules will be extended to add novel functionalities, such as ODRL policies [42]. In addition, the architecture of Helio will be broken down into pieces to constitute a distributed architecture capable of dealing with larger and more complex scenarios.
In the future, the Helio framework is going to be evaluated from different aspects. As it has been mentioned, the current Helio framework lacks formal experimentation, apart from a set of JUnit tests to check the correct software behaviour. In the future, the framework will be tested on different aspects. On the one hand, the RDF Generator Module is going to be tested with a benchmark from the literature, e.g., GTFS Bench [11]. The Publisher Module is going to be tested using a stress test tool like JMeter, obtaining time answering results for different operations, e.g., retrieving a resource or running a SPARQL query. Furthermore, the whole framework may be evaluated from a user perspective using questionnaires to get feedback about the usability of the whole framework.
On the other hand, it would be interesting to define a set of unitary tests for ensuring that a tool meets the requirements elicited in this article and that it conforms to the different standards that it uses.
Footnotes
Acknowledgements
This work is partially funded by the European Union’s Horizon 2020 Research and Innovation Programme through the AURORAL project, Grant Agreement No. 101016854.
Json mapping serialisation for instantiating the Conceptual Mapping depicted by Fig. 4
RML mapping serialisation for instantiating the Conceptual Mapping depicted by Fig. 4
WoT-mapping serialisation for instantiating the Conceptual Mapping depicted by Fig. 4
Helio Publisher Module HTML interfaces depicted by Fig. 5
Helio Publisher Module interfaces.
Helio framework: Use cases and real-world applications
The Helio framework provides to end-users several use cases related to the KG life cycle as depicted by Fig. 6. Notice that the actor depicted is a user, which could be a person or a third-party application.
In the light of the previous use cases, it can be inferred that some modules in the framework have a mandatory presence, whereas others are not mandatory. The modules that are mandatory are one of these combinations: KG Generator and the Host Module, Publisher and the Host Module, or KG Generator, Publisher, and Host Module. Besides, the Curator Module is optional in different use cases.
Notice that these use cases describe a high and abstract level of usage. Depending on a specific scenario, these use cases will be implemented differently. For instance, the use case Generate KG could be implemented as Use the RDF Generator Module with RML mappings and store the result into GraphDB or Use the framework with a third party RDF generator storing the data into Fuseky. Notice that this last implementation would require an end-user to code a Provider that wraps a third party RDF generator (like RMLMapper) and whose configuration points to the respective mappings. As a result, the Provider passes to the other components directly when the RDF is generated.
In the current Helio documentation, some of the implementations of the available use cases have been identified and documented.37
The communication among these components is enabled thanks to the CIM gateway.38
The CIM relies on Helio for translating data from local infrastructures into a KG which is published on the XMPP cloud at two levels as RDF resources and through a SPARQL endpoint.39
In this project, Helio was used for translating heterogeneous data into RDF (namely, XML and JSON data). This data was stored into an embedded RDF4J triple store and published through XMPP, wrapping the regular Helio publisher API.41
In this project, Helio is needed for publishing the wide number of ontologies gathered under the project scope (materials and manufacturing). The starting point of this scenario was an already curated KG containing these ontologies (rather than a set of heterogeneous data sources). As a result, Helio is used uniquely as a data publisher that provides a data portal with these ontologies within.42
