Abstract
Ontology Based Data Access (OBDA) refers to a range of techniques, algorithms and systems that can be used to deal with the heterogeneity of data that is common inside many organisations as well as in inter-organisational settings and more openly on the Web. In OBDA, ontologies are used to provide a global view over multiple local datasets; and mappings are commonly used to describe the relationships between such global and local schemas. Since its inception, this area has evolved in several directions. Initially, the focus was on the translation of original sources into a global schema, and its materialisation, including non-OBDA approaches such as the use of Extract Transform Load (ETL) workflows in data warehouses and, more recently, in data lakes. Then OBDA-based query translation techniques, relying on mappings, were proposed, with the aim of removing the need for materialisation, something especially useful for very dynamic data sources. We think that we are now witnessing the emergence of a new generation of OBDA approaches. It is driven by the fact that a new set of declarative mapping languages, most of which stem from the W3C Recommendation R2RML for Relational Databases (RDB), are being created. In this vision paper, we enumerate the reasons why new mapping languages are being introduced. We discuss why it may be relevant to work on translations among them, so as to benefit from the engines associated to each of them whenever one language and/or engine is more suitable than another. We discuss the emerging concept of “mapping translation”, the basis for this new generation of OBDA, together with some of its desirable properties: information preservation and query result preservation. We show several scenarios where mapping translation can be or is being already applied, even though this term has not necessarily been used in existing literature.
Introduction
Database technologies play a vital role in the development of information systems for all sorts of organisations. So far, relational databases (RDB) are still the dominating type of structure and technology used for data management inside organisations, although other formats (e.g. JSON, spreadsheets, XML) and types of databases (e.g. noSQL, graph databases) have also emerged as alternatives for data representation and management in the last decades.
In the early days of information system development, it was natural for organisations to develop their own data models, which were strongly aligned with their activities. This led to a large heterogeneity across organisations, and even across different departments inside the same organisation. Such heterogeneity was especially evident in the case of organisational changes, merges, etc. Similarly, data warehouses were also used in order to align and materialise data from different sources, normally from the same organisation, so as to provide support for analytical queries and for the generation of reports. These situations made researchers and professionals start working on solutions for data integration, where data from several sources needed to be accessible according to a unified and global view over such local heterogeneous data sources. Popular technologies used in production systems worldwide included the use of Extract-Transform-Load (ETL) [25] workflows to overcome heterogeneity and ensure the availability of data in such data warehouses or on integrated databases. Indeed, these approaches are still strongly used nowadays.
In the meantime, data integration challenges became even more relevant since two decades ago, when organisations started using Web technologies to provide access to their data (via Web Services, REST APIs or using Semantic Web [3] and Linked Data [4] approaches), both for their own information system development as well as for data sharing, and later on when public administrations started publishing open data according to public-sector information reuse initiatives. Availability and heterogeneity of data (both in terms of content and format) is nowadays present at an unprecedented level. Following the aforementioned ETL approaches, the term data lake has been rather recently coined to refer to an evolution of data warehouses that considers not only structured data but also the other types of (semi-)structured and unstructured formats in which data is made available nowadays, as discussed above.
Over these decades, several approaches have been proposed to tackle data integration challenges. We are specially interested in those that fall under the area of Ontology Based Data Access (OBDA) and Integration (OBDI) [18]. From now on we will refer to both of them, in a general manner, as OBDA. In OBDA, ontologies are used as a global view over heterogeneous data sources. It is quite common to use a mediator-based approach [26], where mediators and wrappers are used as intermediaries to overcome the differences between the local schemas and the global view. In this setting, mappings are commonly used to describe such relationships in a declarative manner. These mappings may be normally exploited in two directions: for
Many different types of OBDA mapping languages have been proposed over the last decades [12], with a large variety of syntaxes and formats especially in the early ones. Since the standardisation of languages like RDF and OWL, several languages were proposed focused on the transformation from relational databases into RDF (e.g. D2R, R2O). This led to the creation of the RDB2RDF W3C Working Group, which published two recommendations for transforming the content of relational databases into RDF: Direct Mapping [1] and R2RML [6]. The Direct Mapping approach specifies simple transformations that require no intervention from users. R2RML allows specifying transformation rules, such as how URIs should be generated, which columns to be used for the transformation, etc. A bit after R2RML was recommended, and because of its use in different types of contexts, new needs and requirements arose, especially in relation to supporting other formats beyond relational databases, and this resulted in the creation of many new mapping languages, such as RML [8] (to deal with CSVs, JSON and XML data sources), xR2RML [17] (to deal with MongoDB), KR2RML [23] (to deal with nested data), CSVW1
There are several reasons why new mapping languages are needed. The first and main reason is that a typical mapping language is designed to work with a specific

Timeline of data integration techniques. During the 1970s ETL approaches started with data translation techniques, current generation of OBDA incorporated techniques for query translation and next generation of OBDA systems which mapping translation approaches are to be applied.
Therefore, the current situation of an OBDA practitioner that needs to provide access to a varied set of heterogeneous data sources is that there are many different options to select from, and it is difficult to determine which one is better for each situation. Languages are not necessarily interoperable, and many of them come associated with a very specific engine that supports them. However, at the same time, it is clear that most of these languages share many common aspects, such as the description of where the data comes from, how URIs can be created for resources, how triples need to be generated (in a materialised or virtual way), etc. Having the possibility of translating among these different languages, covering at least those common characteristics that are shared across languages, would allow practitioners to have the possibility of selecting a wider set of engines to implement their OBDA.
In this paper, we lay out our vision that the next generation of OBDA systems should take advantage of this proliferation of mapping languages. In other words, in addition to the data translation and query translation techniques that have been widely addressed in the state of the art of OBDA so far, the OBDA research community will need to think carefully about how to address mapping translation (see Fig. 1).
The paper is organised as follows. In Section 2 we informally discuss the concept of mapping translation and some of its desirable properties. A deeper formalisation of the concept and properties is out of the scope of this paper, although we consider it a relevant topic to better understand and characterise ongoing activities in this area. Several scenarios where mapping translation is already being applied or where we think that mapping translation would be clearly applicable are presented in Section 3. Finally, conclusions and practical implications of this vision are discussed in Section 4.
We define the mapping translation concept as a function that transforms a set of mappings described in one language (we call them original mappings) into a set of mappings described in another language (we call them target mappings).
Our next step is to attach desirable properties for such a function. In this line, we propose to use and adapt some properties that have been described by [22] and [11] in their works. To be more specific, those properties are
The

Mapping translator properties. The results (triangles) may satisfy the IPP property after the application of the source and target mappings over the same data. In the same way, query results (rectangles) may satisfy the QRPP property when equivalent queries are evaluated over the source and target results.
The
Finally, using these two properties, we define the concepts of weak and strong semantics preservation for a mapping translation function, as follows: a mapping translation function exhibits
Summary of mapping translation approaches
In this section we identify a set of scenarios and challenges in the creation and use of OBDA mapping languages, where mapping translation is relevant. We describe the challenge and provide some references to some of the work presented in the literature addressing or acknowledging it. The presented use cases are summarised in Table 1.
Improving mapping creation and maintenance
Creating and maintaining OBDA mappings is usually difficult, since mapping languages have been created so that they can be consumed by the corresponding OBDA engines, and they commonly suffer from readability and compactness problems. With respect to
YARRRML [14] is a serialisation of RML mappings that uses the YAML (a human-readable data serialization language) format.4
In the case of multidimensional data (e.g. official statistics data), the W3C RDF DataCube recommendation is the ontology that is commonly used as a global view in an OBDA setting. In most cases, the amount of mappings that would need to be created to link the original data source with the ontology will be rather large and with similar structure. Therefore, there is a high risk that the [R2]RML mapping document(s) generated in the end will contain clerical errors due to copy&paste&edit operations. Furthermore, they will be difficult to maintain. As a result, RMLC-Iterator [7] is proposed as a simplified mapping language specifically designed for this type of data, including two new properties in the R2RML specification: a property to define the array of access columns and a corresponding dictionary if the value of a header needs to be changed. With this approach, the mapping size is drastically reduced. Additionally, a tool is provided to translate RMLC-Iterator mappings into R2RML, hence allowing the use of any R2RML-compliant OBDA engine.
Introduced in 2000, REST [10] has become now the most popular architecture for the provision of web services and the implementation of Web-based applications. However, the complexity of software development continues evolving, and aspects that received little attention, such as the size of data being exchanged/transmitted or the number of API calls being made, are now becoming more relevant in the context of mobile application development. As a result, problems like
The two main components of a GraphQL server are the
From this basic description of the GraphQL framework, the analogy with the OBDA architecture is clear. Typically, the following tasks need to be done to setup a GraphQL server:
A domain expert will analyse the underlying datasets, propose a unified GraphQL schema and describe how the source data sources will need to be mapped into it. Note that there is no standard mechanism to represent these mappings.
A software developer will then implement those mappings as GraphQL resolvers. Generating GraphQL resolvers is difficult even for a standard-sized dataset which typically contains more than a handful tables and hundreds of properties. This situation is even worse if the underlying dataset evolves, considering that the corresponding resolvers have to be updated as well.
In a recent paper [19] we proposed the use of the mapping translation concept to facilitate the generation of GraphQL resolvers. We propose specifying mappings in R2RML, which is a well-defined and formalised mapping language, and apply a mapping translation technique to generate automatically the corresponding GraphQL schemas and resolvers in different programming languages. Our intuition is that following this approach, GraphQL resolver will be easier to maintain, as they are declarative and independent from any programming language.
Semi-structured data formats are one of the most widely used formats to publish data on the Web. Although existing mapping languages provide support for this type of data sources, existing engines are mostly focused on the generation (materialisation) of RDF-based knowledge graphs, with only a few proposals (e.g. xR2RML [17]) focused on the application of query-translation techniques (virtualisation) over such types of data sources.
In the specific case of spreadsheets (CSV), providing access to this format is difficult for two main reasons: (i) CSV does not provide its own query language, (ii) there are some transformations that are commonly needed when treating data available in CSVs. For solving the first issue, query translation techniques have been applied over such data format by considering a CSV file as a single table that can be loaded in an RDB. For the second issue, some extensions of well-known mapping languages (RML together with the Function Ontology [16]) and annotations following the CSVW specification [24] can be used.
Morph-CSV8
To the best of our knowledge, there has not been yet any formal study of the relationship between R2RML and the Direct Mapping recommendations, and among the many different mapping languages that have arisen recently, as pointed out in Section 1.
For the first case (R2RML and Direct Mapping), intuitively we may consider the Direct Mapping is a subset of R2RML, given the expressive power provided by the latter. However, it would be interesting to know how expressive Direct Mapping may be in case that views are generated for the underlying data sources, for instance. Our intuition is that given the possibility of creating a database view from an existing database, there exists a fragment of R2RML that can be translated into Direct Mapping, such that the application of Direct Mapping over the view generates equivalent results as the application of R2RML mappings over the original database. Finding such fragment brings a practical implication because it would lower down the barrier for transforming data into RDF and enable people to use Direct Mapping engines, which are in general easier to use than R2RML engines for those people who are used to manage databases.
Similarly, this analysis may be extended to other combinations of mapping languages, so as to allow mapping translations among them that would allow exploiting the specific characteristics of each associated implementation, as well as describing formally their semantics, especially for those cases where no formal specification of the semantics has been provided yet.
Ontop [21] is an OBDA system that comes with both data and query translation techniques. Ontop translates R2RML mappings into its own mapping called “OBDA mappings”. These mappings are represented as datalog rules, allowing the formalisation and semantic optimisation techniques to be performed, and generating a more efficient SQL queries (e.g. self-join elimination) that can be evaluated in less time by the underlying databases.
Conclusions and practical implications
In this vision paper, we have discussed the concept of mapping translation, which had not been addressed before in the literature. We have shown how this concept has been actually implemented in some approaches addressing the readability and maintenance of mappings, the generation of programming code to provide access to heterogeneous data sources, or the enrichment of original data sources, among others.
We think that this concept needs to be explored further, and this would allow a new range of OBDA approaches that may be part of a new OBDA generation, as claimed in the title of this paper. In our opinion, the OBDA community should see this variety of mapping languages not only as challenges (e.g., interoperability) but also, and mainly, as an opportunity for further research and development in this area, to address the need to cover more types of data sources while taking advantage of all the work that has been done in advanced aspects like query translation. Providing mapping translator services across mapping languages would bring further benefits and increase the availability of ontology-based data for its exploitation by search engines and query answering systems at Web scale.
Footnotes
Acknowledgements
The work presented in this paper is supported by Ministerio de Economía, Industria y Competitividad and EU FEDER funds under the DATOS 4.0: RETOS Y SOLUCIONES – UPM Spanish national project (TIN2016-78011-C4-4-R) and by an FPI grant (BES-2017-082511). This work has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement no. 820621.
