Abstract
We present UnifiedViews, an Extract-Transform-Load (ETL) framework that allows users to define, execute, monitor, debug, schedule, and share data processing tasks, which may employ custom plugins (data processing units) created by users. UnifiedViews natively supports processing of RDF data. In this paper, we: (1) introduce UnifiedViews’ basic concepts and features, (2) demonstrate the maturity of the tool by presenting exemplary projects where UnifiedViews is successfully deployed, and (3) outline research projects and future directions in which UnifiedViews is exploited. Based on our practical experience with the tool, we found that UnifiedViews simplifies the creation and maintenance of Linked Data publication processes.
Introduction
The advent of Linked Data [1] accelerates the evolution of the Web into an exponentially growing information space where the unprecedented volume of structured data offers information consumers a level of information integration that has up to now not been possible. Data consumers can create mashups of Linked Data that leverage various data sources to support use cases which were not intended by the original data publishers.
Suppose a data wrangler wants to build an RDF data mart that integrates information from various RDF and non-RDF sources. So the wrangler’s data processing task involves the activities of:
getting the data from certain data sources,
transforming the data to RDF data format,
cleaning the data,
interlinking it with other (external) data sources,
solving data conflicts to prepare the integrated data mart.
There are numerous tools used by the Linked Data community,1
configure every such tool differently,
write a custom script downloading and unpacking data from various data sources,
prepare a script executing a set of SPARQL Update queries curating the data,
implement custom transformers, which, e.g., enrich processed data with the data from the DBpedia knowledge base [2], and
write a custom script ensuring that the tools are executed in the required order, so that every tool has all the desired inputs when being launched.
Maintaining data processing tasks of increasing complexity is challenging. Suppose for example that the wrangler defines tens of such data processing tasks, which should run every month. So apart from the activities described above, the data wrangler has to also configure a scheduling script or use an external tool, such as cron,5
The task of compiling and setting up various tools of different vendors for multiple data analyses settings is cumbersome and often repetitive. In combination with the lack of an integrated debugging and maintenance support, the immediate consequence is a negative impact on a data wrangler’s productivity. On a larger scale, we believe that the current lack of easy-to-use frameworks for Linked Data preparation and publication prevents many institutions to provide their datasets for public utilization as Linked Data.
Therefore, instead of requiring data wranglers to write most of the logic for defining, executing, monitoring, scheduling, and sharing the data processing tasks themselves, we provide UnifiedViews,6
This paper is organized as follows. In Section 2, we present the basic concepts of UnifiedViews, how a data wrangler may interact with the tool, and describe its architecture. To demonstrate the maturity of UnifiedViews, we introduce in Section 3 a number of exemplary projects in which the framework is successfully used. In Section 4, we summarize lessons learned from Section 3 and introduce current UnifiedViews research projects and future work. In Section 5 we provide an outline of related work in this problem domain and draw our conclusions in Section 6.
An Extract-Transform-Load (ETL) process performs: (1) extraction of data from original data sources, (2) transformation of the data to the proper format and structure for the purposes of the consequent querying and/or data analysis, and (3) load of the resulting data into the target data source (also called operational data store, or data warehouse/data mart). An Extract-Transform-Load (ETL) framework allows users at least to define, monitor, and execute ETL processes.
UnifiedViews is an open-source ETL framework with a focus on processing RDF data. It allows users to define, execute, monitor, debug, schedule, and share Linked Data processing and publishing tasks. A sample task is the preparation of a data mart by the wrangler as introduced in Section 1.
A data processing task is represented in UnifiedViews as a data processing pipeline (or simply pipeline). Every pipeline consists of one or more data processing units (DPUs) and data flows between these DPUs.
Every DPU may declare certain mandatory or optional inputs, encapsulates certain business logic that processes the data (e.g., a DPU may extract data from a SPARQL endpoint, apply a SPARQL query, or transform CSV data to RDF data), and may produce certain outputs. DPUs may also provide a configuration dialog, so that a pipeline designer (e.g., the data wrangler mentioned above) may configure them differently in different pipelines. Administrators of the particular UnifiedViews installations may set up the default configurations of the DPUs and also prepare various alternative configurations, which may be directly used by pipeline designers.
A data unit is a container for data being consumed or produced by a DPU. We distinguish input and output data units. An input data unit contains data used as the input during a DPU’s execution. An output data unit holds the data which is produced in the course of a DPU’s execution. Every data flow between two DPUs X and Y consists of the output data unit of DPU X producing the data and the input data unit of DPU Y consuming the data. Every data unit has its name, which is assigned by a DPU developer and is mandatory. Data unit names are then visible to pipeline designers, so that pipeline designers may, e.g., map an output data unit of one DPU to input data unit of another DPUs. Every DPU may declare zero or more input data units and zero or more output data units.
UnifiedViews supports three types of data units which can be both input and output data units and are distinguished by the type of information they can hold:
RDF data units can contain RDF graphs, Files data units can contain files and folders, and Relational data units can contain tables from relational databases.
Every data unit can hold zero or more entries of the particular supported type.
There are four types of DPUs, which are determined by the number of input and output data units they declare, as well as their intended purpose:
Extractor: A DPU that does not define any input data unit. Input data to such a DPU is not provided by the UnifiedViews framework, but rather obtained from external sources by the business logic of the DPU. For instance, an extractor may query data from a remote SPARQL endpoint or download files from a certain set of URLs.
Transformer: A DPU that transforms inputs to outputs. It defines both input and output data units. UnifiedViews must ensure that proper inputs are prepared for the DPU and must also handle the outputs produced by the DPU. Examples of transformers are DPUs that transform tabular data to RDF data or execute SPARQL (Update) queries.
Loader: A DPU that defines an input data unit, but does not define any output data unit. Output data produced by such a DPU is not maintained by the UnifiedViews framework, but rather intended for storage in external repositories outside of UnifiedViews. DPUs uploading data to a remote SPARQL endpoint or disseminating new records to the CKAN catalog7
Quality Assessor: A DPU that assesses the quality of the input data and produces a quality assessment report as the output. We decided to distinguish this type of DPUs from transformers, because they work differently – they do not produce transformed data at the output, but rather produce quality assessment report. For instance, quality assessor DPUs may check to which extent the input data is complete or whether data type literals contain correct values in the resulting data.
UnifiedViews distinguishes DPU templates and DPU instances. The DPUs available in the system to be used on the pipeline by data wrangles are called DPU templates. Every DPU template defines a template configuration of the DPU. When a data wrangler places such DPU template on the pipeline, such placement is called DPU instantiation and the result is called a DPU instance. The DPU instance has its own instance configuration being based on the template configuration of the DPU. The DPU instance is always associated with the given pipeline. One pipeline may contain more different DPU instances of the same DPU template. Every DPU instance always points to the DPU template from which it was created.

UnifiedViews framework – definition of a data processing task.
UnifiedViews provides a graphical user interface to define, manage, execute, monitor, debug, schedule, and share DPUs and pipelines. A screenshot of this interface is illustrated in Fig. 1. It shows a data processing pipeline, consisting of five DPUs (colored boxes) and four data flows (arrows connecting the boxes) between these DPUs. DPUs may be added to a pipeline by drag & dropping them on a pipeline canvas. Data flows between two DPUs may be denoted by drawing an edge between them. Labels on the data flow edges clarify which output data units are mapped to which input data units of the DPUs. In Fig. 1, all data flows always map output data unit with the name output to input data unit of the next DPU with the name input; however, custom data unit names may be used by DPU developers, which are then reflected in the user interface.
As the pipeline is being prepared, UnifiedViews provides data wranglers with debugging capabilities; the data wrangler may execute a selected fragment of the pipeline at any time and browse or query (using the SPARQL query language) the entries in the input/output data units that are consumed/produced by each DPU.
When data wranglers are satisfied with the prepared pipelines, they can manually execute them and verify the results. Alternatively, it is possible to schedule pipelines for execution (1) once at certain time, (2) every certain period of time, or (3) after another pipeline is successfully executed. Data wranglers may also get notifications about the pipelines’ execution states – either for all executions of the selected pipelines or only for those which ended with an error. It is also possible to get daily summaries about the executions of the selected pipelines in the last 24 hours.
UnifiedViews currently provides more than 35 core DPUs.8
obtaining data from external sources (CSV, DBF, XLS, XML files, RDF data, or relational tables),
transforming data between various formats (e.g. CSV files to RDF data, relational tables to RDF data),
executing typical transformations such as executing SPARQL Construct/Update queries, executing XSL transformations, linking/fusing RDF data, unpacking/packing/filtering files, and
loading the transformed and curated data to external systems (loaders to various RDF databases and generic SPARQL endpoints, loaders of files via FTP/SFTP/etc., loaders to relational databases).
Apart from that, the UnifiedViews team9
All pipelines and DPU templates prepared by data wranglers are by default private (available only to the user who prepared them and also to administrators – see Section 2.2 for more details on roles). Nevertheless, UnifiedViews allows data wranglers to share prepared pipelines with other data wranglers, so that others can either see the pipeline (sharing in read-only mode) or even collaborate on the pipeline preparation (sharing with write access to the pipeline). Data wranglers may also create custom DPU templates (DPU templates available in the system with custom configurations) and share such DPU templates with others, so that others can use such custom DPU templates in their pipelines.
UnifiedViews is composed of three main components:
Graphical user interface: Being the primary means of interaction with the framework, the graphical user interface (henceforth referred to as frontend) supports definition, management, execution, monitoring, debugging, scheduling, and sharing of data processing pipelines and management of DPUs. The frontend is implemented in Java as a Web application using the Vaadin11
Pipeline execution engine: It is responsible for running the (scheduled) pipelines, and implemented as stand-alone Java application (henceforth referred to as backend).
REST API administration service: It allows to define, manage, execute, and monitor data processing pipelines without using the frontend. For example, external applications may execute pipelines and get results of the executions by interacting with this component.

UnifiedViews framework – architecture.
Frontend and backend communicate via a relational database, which stores all configuration information such as pipeline setups, DPU configurations, execution states, or scheduled events.
To support scalability, multiple backend instances can run on different machines, effectively executing pipelines in parallel (see Fig. 2). Each backend has its own identifier and all backends observe pending executions. If there is any execution pending, the first backend realizing that marks its identifier next to that pending execution and executes it.
Every backend uses its own RDF Working Store for storing temporary data which is produced by the pipeline during its execution. As the RDF Working Store, we currently support Sesame12
Every DPU is an OSGi15
UnifiedViews also supports authentication and authorization of users. Two roles are supported by default – Users and Administrators. Each such role can be associated with a list of permissions, e.g., a permission to import new DPU templates. Anytime a data wrangler wants to interact (view, edit, save, delete, etc.) with a certain entity (pipeline, DPU, scheduled event, etc.), UnifiedViews checks whether that user is authorized to do so. Spring Security16
There are two ways how DPUs on the pipeline may communicate certain information to a data wrangler when being executed. They may either publish an event (important message), e.g., that DPU e-sparqlExtractor was successfully executed and extracted 1000 triples, or just log something using standard mechanism for logging in Java. Both messages – events and logs – support various levels of severity (error, warning, info, etc.) and are displayed in the frontend of UnifiedViews as the pipeline is being executed. The reason why we distinguish events and logs (and also display them differently in the frontend) is to give data wranglers high-level overview of what happened during pipeline execution (through events) and to give them the possibility to examine logs in case more information is needed. DPUs may also throw an execution exception, which is semantically equivalent to sending an event with the severity error. When such an exception is thrown, the pipeline execution is stopped.
The source code of UnifiedViews is published under an open-source license which is a combination of GPLv317
Plugin-devEnv: It contains UnifiedViews APIs (for DPUs, configuration dialogs, data units, etc.) and also a set of helper classes, which simplify development of new DPUs.
Core: It contains implementations of the UnifiedViews APIs from Plugin-devEnv, including also implementations of the supported data units, frontend, backend, and REST API administration service components.
Plugins: It contains a set of core DPUs.
Plugins-QualityAssessment: It contains DPUs which assess the quality of processed data.
All information about UnifiedViews, including the documentation and tutorial for building DPUs, may be found at the project’s website.20
Overview of the projects
In this section, we describe the projects in which UnifiedViews has been successfully deployed and used. For each project we describe in separate sections:
the motivation and goals of the project, approach we took and achievements we reached, challenges we faced and lessons learned. the organization being responsible for realizing the project,23 For details about the introduced organizations, such as their addresses/full names, please see affiliations of authors.
Furthermore, Table 1 summarizes for each such project:
number of UnifiedViews pipelines prepared in that project,
information about the scheduling of the pipelines (whether they are scheduled to be executed every certain period of time or executed on demand),
(approximate) number of RDF triples produced in each project (as of October 2016).
The Czech Trade Inspection Authority in Czech Republic (CTIA)24
Before CTIA published their core datasets – data about inspections, bans, and sanctions – as open data, lots of subjects (citizens, companies) requested access to particular aspects of the data (e.g., to get information about inspections and sanctions in the company X) based on the Czech act on free access to information25
CTIA successfully used UnifiedViews to publish their core datasets about inspections, bans, and sanctions in CSV and RDF data formats according to recommendations of the OpenData.cz Initiative.26
We found out that there are governmental institutions willing to publish their data as (Linked) open data and UnifiedViews was able to help them realize that goal.
Charles University, Department of Software Engineering, deploying UnifiedViews at CTIA, had to help CTIA with the initial installation/updates of UnifiedViews and with the pipeline design.
CTIA data wranglers had two issues when creating pipelines in UnifiedViews – (1) they were not Linked Data/RDF experts, thus, they did not know which ontologies they should use to publish their data as Linked Data in a correct and reusable way and (2) they sometimes did not know how certain DPUs should be interconnected in the pipelines to realize their particular need. Addressing issue (1) is more difficult and being able to semi-automatically suggest suitable RDF ontologies to represent source data is our future work; to address (2), we are working on the tutorials explaining how the DPUs should be interconnected in the typical data transformation and curation tasks.
Unfortunately, CTIA is currently (2016) not publishing new data in RDF data format, because employees involved in the preparation of the RDF data left CTIA and there was a political decision not to continue in that effort.
Council open data initiative
The Council of the European Union (EU Council) is, together with the European Parliament, the legislative body of the EU. In 2015, the decision was taken to provide public information not only as documents but also as machine readable data to the EU citizens. The first dataset to be published were the Council’s voting results. Later the Public Register (meta data on their documents) and the Requests to Access Documents were published.
Motivation and goals
Until 2015, the votes of the EU Council were only available in an unstructured format as a picture embedded in a PDF document. Since voting is a core element of democratic accountability, there is a considerable interest among practitioners and researchers in the voting patterns at EU level, including those of the EU Council.
The goal of the project is to ensure transparency on information about the votings of EU Council, and to empower experts, journalists and citizens to re-use the data and analyse such votings as well as realize visualizations, applications, etc., on top of the EU Council dataset. The EU Council vote dataset does not only contain the votes but also information about, e.g., the act type (regulation, directive, decision or position), act number (as published in the EU’s Official Journal), document number (submitted to the Council for adoption), inter-institutional number and much more.
The positive experience on the voting results dataset led to the second phase of that project during 2016 with the goal to provide the Public Register and the Requests for Access to Documents datasets as Linked Data. The Public Register is a meta data catalog of all documents that are publicly accessible. The Requests for Access to Documents provides insights in the requests the EU Council is receiving from the public to get access to documents related to certain topic. For instance, one can ask what are the documents related to the CETA treaty.30
From a technology perspective, the Council Open Data Initiative implements a mechanisms to extract data from the EU Council’s original database. Making use of UnifiedViews, this data is then automatically converted into RDF data format by adoption of the Data Cube vocabulary and published using OpenLink Virtuoso RDF store. To achieve this, Semantic Web Company, realizing this project for EU Council, developed a pipeline which is scheduled and bi-hourly executed.
During the second phase, the Linked Data publishing environment has been made more robust and upgraded to the latest version of UnifiedViews by TenForce.
As a first example of date re-use, three data visualizations have been created: a map visualization,31
Due to internal policies, the project requires the use of MS SQL server instead of open-source databases like MySQL or PostgreSQL which were so far supported by UnfiedViews. We therefore adapted UnifiedViews to also operate on a MS SQL server for storing its internal data, i.e., pipeline stages, scheduling information or user management.
The second phase was confronted with a higher volume of data (about 3 millions documents) and the information had to be retrieved from an Oracle database. It turned out that getting access to the data was easy, the default extraction DPU of UnifiedViews handled that case. The challenge was, however, throughput performance. It turned out that simply turning the Oracle database views in RDF data (without any post-processing) was sufficiently performant but adding further transformations caused the throughput performance to drop significantly. An in depth analysis showed that the throughput performance is determined by the used RDF working store in UnifiedViews. The default Sesame working store, is very inefficient when dealing with graph management operations and deletion of triples. When using the Sesame working store, the size of the pipelines (the number of DPU’s) combined with the starting volume is a measure of the throughput performance. By batching the input and splitting the pipelines in smaller ones, a situation has been established in which each pipeline execution requires at most 30 minutes.
A main lesson learned is the importance of high quality source data. This includes both the enforcement of strict syntax validation for all data elements, as well as an increased focus on using controlled vocabularies wherever applicable. Timeliness and up-to-dateness of the data is therefore crucial for that use case which can be achieved by UnifiedView’s advanced scheduling functionality for the data extraction pipeline.
Another major insight gained from the project was, that we were able to lower the barrier of starting to work with Linked Data extraction and conversion methods. For the Voting Results dataset, Semantic Web Company provided both the technology as the transformation pipelines for achieving the project’s goals. Only by using email interaction, Semantic Web Company was able to instruct the council members to maintain (i.e., control and observe). Given that prior to this project, the EU council had only minor experience with Linked Data, this can be considered a major achievement. During the second phase, this earlier experience paid off and it established a working environment in which the development and the later handover of the improved EU Council linked data processing setup went smoothly. As we speak today, the EU Council is working with internal resources to publish new datasets using UnifiedViews.
We can therefore conclude that UnifiedViews, as a user-friendly graphical tool, can encourage institutions to publish their data in an open format on the Web, contributing to increased availability of high quality Linked Data.
Open data support: First pan-European open data portal
The European Commission Directorate General for Communications Networks, Content & Technology (DG Connect)34
DG Connect launched a 3-year project Open Data Support37
The awareness on (Linked) Open Data has increased as the project has trained over 1200 persons active in governmental administration in almost every member state.
The second ambition initiated the creation of the DCAT-AP specification.38
Despite being the first commercial application of UnifiedViews, the benefits of the UnifiedViews approach for this challenge were immediately visible. When having the first DPU’s ready for this task (i.e., extractors from CKAN catalog, loaders to CKAN catalog), the actual aggregation and harmonisation work could have started. It allowed to reach within a short amount of time (less than 2 months) a first production ready setup. Thereafter an iterative approach could have been followed to stepwise improve the quality of the already harmonised datasets and to add more new functionality such as: versioning, automated translation of descriptions and titles and DCAT-AP compatibility correspondence (quality report). At the same time a methodology for harvesting any new data portal was established, which reduced the inclusion of a new data portal from (initially) 10 days to 2 days.
For this project the UnifiedViews RDF working store was switched from Sesame to OpenLink Virtuoso. This was done to achieve a better throughput performance. Despite this choice realized the needed performance improvement, it created another challenge: the proper support of SPARQL query language. It turned out that Virtuoso is not supporting the same SPARQL queries as Sesame, therefore SPARQL queries had to be rewritten to satisfy the restrictions imposed by OpenLink Virtuoso.
During the 3 years of the project the maintenance of the pipelines has been done by several persons. The transition from one to another was each time rather smooth, the only prior knowledge of each maintainer was general Linked Data experience. This indicates that UnifiedViews also assists in the knowledge transfer that is required in any long lasting project.
Westtoer datahub
Westtoer40
In the context of Westtoer’s role as a knowledge center of the touristic information, touristic data is being collected and made available. In the recent past Westtoer established a datahub: a data portal from which machine processable data is made available.41
For the Westtoer datahub UnifiedViews is deployed as a dockerized solution exporting the data to an OpenLink Virtuoso RDF store. End-users can access the data via the DataTank.42
At Westtoer, a locally developed tool was taking care of the data conversion. The drawbacks of that tool: limited maintainability due to complex specifications, UTF-8 encoding problems, etc., and the need for transforming (new) data sources to a new vocabulary encouraged the team to replace it with UnifiedViews.
After an extensive test period, the setup has been validated by the active data consumers (there are running products on the datahub). As a result, the new setup with Unifiedviews will be taken into production at the end of 2016.
As experienced in other use cases, the creation of the UnifiedViews pipelines is a labor intense work requiring knowledge of source and target vocabularies. The large domain and the wide variety of sources made that each source required extra deep assessment. Furthermore the target domain is open ended. So during the pipeline creation the understanding grows by inspecting and evaluating the result with the end-users. As this is a human process, it showed a slow progress.
The initial pipelines, which are based entirely on SPARQL (Update) queries, reached not always the required throughput performance. Applying the knowledge from other projects was able to create some improvements, but still it was not sufficient especially for one source which was provided as a set of XML documents. The solution was to rewrite the original XSLT that turned the input into basic RDF data. Further reduction has been reached by incorporating some domain knowledge. In the end, the execution went from 100 hours to only 45 minutes.
The Westtoer DataHub UnifiedViews setup is now in its first release. The current experience allowed to identify specific improvement actions. The knowledge that those actions can be implemented without interfering the whole setup, but only that part that must be addressed, creates comfort for the maintainer.
Slovak environmental agency
Slovak Environmental Agency (SAE) is the provider of the data from the environmental domain; this includes data about environmental burdens, protected sites, land cover, waste dumps etc. SAE is also an infrastructure provider – it hosts DB servers and web services working with the environmental data.
Motivation and goals
SEA wanted to explore the potential to increase re-use of their data if published as Linked Data. SAE decided to publish as Linked Data datasets on: protected sites, species distribution, bio-geographical regions, land cover, contaminated sites registered as enviromental burdens; these datasets are available in the Geography Markup Language (GML) via an API provided by the Web Feature Service, typically in INSPIRE format.43
UnifiedViews was successfully deployed (as one of components of Open Data Node (ODN) publication platform44
An initial barrier we had to overcome was that the vocabularies mapping the INSPIRE XML schemas to RDF were not available, so we had to provide the mappings.
UnifiedViews was able to transform, enrich and publish RDF data in a simple way, allowing easy maintenance for the future. A key benefit of the RDF version of the SEA datasets is that it is straightforward to combine them with third-party datasets.
OpenData.cz initiative
OpenData.cz initiative49
The goal of the initiative is to extract, transform and publish Czech open data in the form of Linked Data, so that the initiative contributes to the Czech Linked Open Data cloud. The initiative focuses mainly on Czech governmental data.
Approach and achievements
For this effort, UnifiedViews framework has been successfully used since September 2013; so far OpenData.cz initiative has published over 70 datasets and hundreds of millions of triples. The list of published datasets is available online.50
Using UnifiedViews to manage data processing tasks of OpenData.cz initiative simplified the data processing tasks and at the same time kept the data processing tasks documented.
OpenData.cz initiative realized that certain fragments of pipelines tend to repeat for many pipelines, e.g., the pipeline fragment producing DCAT-AP metadata for the published data; to simplify creation of new and maintenance of existing pipelines, UnifiedViews should have the possibility to pack these fragments, so that they may be placed on the pipelines in the same way as other DPUs.
Further, although it was really effective to manage pipelines in UnifiedViews, sometimes it happened that scheduled pipeline suddenly did not produce results as expected; the reasons for that were typically twofold – either the structure of the source data changed in the meanwhile or the pipeline designer made a mistake as he was fine-tuning the pipeline definition. In these cases, UnifiedViews should send alerts to the pipeline designer that the results of the scheduled pipeline suddenly changed dramatically.
Summary of lessons learned and future work
Most of the projects in Section 3 confirmed that UnifiedViews provides easy pipeline management via user interfaces. Open Data Support project also mentioned that UnifiedViews assists in the knowledge transfer (between data wranglers) that is required in any long lasting project.
Predefined UnifiedViews plugins which may be used out of the box and without a need for heavy programming speed-up the preparation of data processing tasks; Open Data Support project explicitly quantified that time needed to include yet another data source to be extracted was reduced from 10 to 2 days when UnifiedViews was used.
Westtoer Datahub and Open Data Support mentioned performance issues when processing bigger amounts of data with UnifiedViews – these issues are planned to be solved by adding solid support for Openlink Virtuoso as the RDF working store and by optimizing the way how RDF data is passed between DPUs.
OpenData.cz initiative project realized that certain fragments of pipelines tend to repeat often; to simplify creation of new and maintenance of existing pipelines, UnifiedViews should have the possibility to pack these fragments, so that they may be placed on the pipelines in the same way as other DPUs. They claimed that UnifiedViews should send alerts to the pipeline designers when something suspicious happened during pipeline execution, e.g., when the results of the scheduled pipeline suddenly changes dramatically. We plan to provide a possibility for DPU developers to include to their DPUs RDF validation features via a single line of code; as a result, for DPUs supporting the RDF validation features data wranglers may define sets of (SPARQL ASK) queries verifying the produced RDF data.
As reported by Westtoer Datahub and The Czech Trade Inspection Authority, data wranglers had difficulties when preparing data processing pipelines, because 1) they are not Linked Data experts and do not know source and target vocabularies they should use and 2) they sometimes did not know how certain DPUs should be interconnected in the pipelines to realize their particular need. To address (1) as a future work we plan to include algorithms for (semi)automatic suggestions of suitable RDF ontologies representing source data.
UnifiedViews in research projects
Currently, we also plan to use UnifiedViews in several research projects.
The goal of the EU-funded ALIGNED project51
ADEQUATe, a FFG-funded52
Austrian Research Promotion Agency (FFG),
Within the EU-funded YourDataStories53
There are plenty of ETL frameworks for preparing tabular data to be loaded to data warehouses, some of them are also open-source54
ODCleanStore[8] is a Java based Linked Data Management framework developed at Charles University in Prague, Department of Software Engineering. Linked Data Manager (LDM)59
DERI Pipes [10] is an engine and graphical environment for general Web data transformations. DERI Pipes supports creation of custom DPUs; however, an adjustment of the core is needed everytime a new DPU should be added; in UnifiedViews, it is possible to reload DPUs as the framework is running. DERI Pipes also does not provide any solution for library version clashes; on the other hand, in UnifiedViews, DPUs are loaded as OSGi bundles, thus, it is possible to use two DPUs requiring two different versions of the same dependency (library) and no clashes arise. In DERI pipes, it is not possible to debug inputs and outputs of DPUs. Lastly, DERI pipes seems to be unmaintained for years.
Linked Data Integration Framework (LDIF) [12] is an open-source Linked Data integration framework that can be used to transform Web data. The framework consists of a predefined set of DPUs, which may be influenced by their configuration; however, new DPUs cannot be easily added.62
Grafter64
Booth [3] presents an approach to automate data production pipelines using semantic web technologies. Every pipeline consists of nodes composed of two parts: the updater and a wrapper. A wrapper is a standard component that is responsible for invoking the updater, communicating with other nodes, caching results; an updater executes the business logic of the node. The approach is decentralized – every node in a pipeline can be easily distributed across multiple servers with a minimal change to the pipeline definition and no change to the node’s updater. The approach has been implemented in the RDF Pipeline Framework,68
Rautenberg et al. [11] present LODFlow, a Linked Data workflow management system, which provides an environment for planning, executing, reusing, and documenting Linked Data workflows. Nevertheless, the authors focus mainly on the description of a comprehensive ontological model, the Linked Data Workflow Project Ontology, for describing the workflows and a workflow execution engine, but the actual implementation of the workflow system is an ongoing and mainly future work.69
Open Refine [13] is a tool for cleansing, transforming, and enriching tabular data. It has also RDF extension which provides a service to disambiguate cell values to Linked Data entities, e.g., from DBpedia knowledge base. Nevertheless, when compared with UnifiedViews, the purpose of such tool is different – it is used for manual refinement of the data, whereas UnifiedViews is used for preparing tasks, which may be executed repeatedly without the interaction of the user.
LinkedPipes ETL (LP-ETL) [7] is an RDF based ETL framework being developed based on the experience gathered while publishing Linked Data using UnifiedViews with a similar set of features. Compared to UnifiedViews, LP-ETL does not provide features such as using relational data units in the pipelines, multilingual user interfaces and granular user and permission management. LP-ETL refocuses on Linked Data and the definition of the pipelines themselves. From the technical point of view, LP-ETL uses RDF as a native format for storage of pipelines and configuration, it facilitates sharing of pipelines and their fragments directly using their URLs and URL dereferencing and it has a more granular, RESTful API providing access to all functionality. From the user point of view, the process of pipeline creation is more intuitive as algorithms for suggesting next possible DPUs based on various factors are employed, and sample pipeline fragments are attached and ready to be reused directly in the documentation of individual DPUs. UnifiedViews pipelines can be imported to LP-ETL given that the unsupported features are not used.
We presented UnifiedViews, an open-source ETL framework for processing RDF data, which addresses the problem of efficiently creating, debugging, and maintaining Linked Data processing tasks.
UnifiedViews combines several aspects that are crucial to be successful within these projects:
Support for complex data processing tasks which may contain forking and merging of data flows; as a result, complex RDF data processing flows combining data from different sources may be easily prepared in UnifiedViews.
A robust data processing engine; this was proofed, e.g., by Opendata.cz project involving preparation and regular execution of tens of pipelines processing millions of triples (see Table 1).
Flexibility to choose from a large number of existing DPUs – UnifiedViews provides
A simple way to create custom DPUs by providing an extensive set of tutorials71
RDF data debugging. Data wrangles may debug (RDF) data flows between DPUs as they are preparing pipelines, which decreases the time needed to prepare pipelines.
An intuitive user interface, which was confirmed by data wranglers from most of the projects in Section 3, e.g., the project with Council Open Data Initiative.
Exemplary use cases introduced in Section 3 also allowed us to reveal certain shortcoming of UnifiedViews, which were summarized in Section 4.
UnifiedViews is pushed forward by a unique collaboration of a diverse group of partners – research institutes and SME’s across Europe.72
Footnotes
Acknowledgements
This work was supported by the Seventh Framework Programme of the European Union, Grant Agreement number 611358, by the Czech Science Foundation (GAČR), grant number 16-09713S and by the project SVV 260451.
