RODI: Benchmarking relational-to-ontology mapping generation quality

Abstract

Accessing and utilizing enterprise or Web data that is scattered across multiple data sources is an important task for both applications and users. Ontology-based data integration, where an ontology mediates between the raw data and its consumers, is a promising approach to facilitate such scenarios. This approach crucially relies on useful mappings to relate the ontology and the data, the latter being typically stored in relational databases. A number of systems to support the construction of such mappings have recently been developed. A generic and effective benchmark for reliable and comparable evaluation of the practical utility of such systems would make an important contribution to the development of ontology-based data integration systems and their application in practice. We have proposed such a benchmark, called RODI. In this paper, we present a new version of RODI, which significantly extends our previous benchmark, and we evaluate various systems with it. RODI includes test scenarios from the domains of scientific conferences, geographical data, and oil and gas exploration. Scenarios are constituted of databases, ontologies, and queries to test expected results. Systems that compute relational-to-ontology mappings can be evaluated using RODI by checking how well they can handle various features of relational schemas and ontologies, and how well the computed mappings work for query answering. Using RODI, we conducted a comprehensive evaluation of seven systems.

Keywords

Mappings relational databases RDB2RDF R2RML benchmarking bootstrapping

1. Introduction

1.1. Motivation

Accessing and utilizing enterprise or Web data that is scattered across multiple databases is an important task for both applications and users in many scenarios [13,31]. Ontology-based data integration is a promising approach to this task, and recently it has been successfully applied in practice (e.g., [27]). The main idea behind this approach is to employ an ontology to mediate between data consumers and databases. Mappings can then be used to either export data to consumers or to translate (or rewrite) consumer queries into queries over the underlying databases on the fly. The latter approach is often referred to as ontology-based data access (OBDA).

Ontology-based data integration crucially depends on usable and useful ontologies and mappings. Ontology development has attracted a lot of attention in the last decade, and ontologies have been developed for various domains including life sciences (e.g., [3]), medicine (e.g., [16]), the energy sector (e.g., [28]), and others. Many of these ontologies are generic enough to be useful as a target ontology in a significant number of ontology-based integration scenarios.

The development of reusable mappings has, however, received much less attention. Moreover, mappings are typically tailored to relate one specific pair of an ontology and one specific database. As a result, mappings typically cannot be as easily reused as ontologies across integration scenarios. Thus, each new integration scenario essentially requires the development of its own mappings. This is a complex and time consuming process. Hence, it calls for automatic or semi-automatic support, i.e., for systems that (semi-) automatically construct useful mappings. In order to address this challenge, a number of systems that generate relational-to-ontology mappings have recently been developed [6,17,25,29,40,47,56].

Whether such generated relational-to-ontology mappings are useful in practice or not is usually evaluated using self-designed and therefore potentially biased benchmarks. This situation makes it particularly difficult to compare results across systems. Consequently, there is not enough evidence to select an adequate mapping generation system in ontology-based data integration projects. What matters at the end of the day in practice is whether the generated mappings are usable and useful for the task at hand. We therefore consider mapping quality as mapping utility with relation to a query workload posed against the mapped data.1

¹
Utility has also been referred to as fitness for use in similar contexts in parts of the literature, cf., [58].

Note, that this definition of mapping quality is more narrow from the notion of multi-dimensional quality that is also frequently used in the literature (e.g., [11,54,58]). Utility is of particular importance in large-scale industrial projects where support from (semi-)automatic systems is vital (e.g., [27]). In order to help ontology-based data integration finding its way into mainstream practice, there is a need for a generic and effective benchmark that can be used for the reliable evaluation of mapping generation systems w.r.t. their utility under actual query workloads. RODI, our mapping generation quality benchmark for Relational-to-Ontology Data Integration scenarios, addresses this challenge.

1.2. RODI benchmark approach

The RODI benchmark is composed of (i) a software framework to test systems that generate mappings between relational schemata [18] and OWL 2 ontologies [10], (ii) a scoring function to measure the utility of system-generated mappings under a query workload, (iii) different datasets and queries for benchmarking, which we call benchmark scenarios, and (iv) a mechanism to extend the benchmark with additional scenarios. Using RODI one can evaluate the quality (i.e., actual utility) of relational-to-ontology mappings produced by systems for ontology-based data integration indirectly through querying the resulting data.

To make this possible, RODI is designed as an end-to-end benchmark. That is, we consider systems that can produce mappings directly between relational databases and ontologies. Also, we evaluate mappings according to their utility for an actual query workload over real-world or realistic databases.

Apparently, such an approach has both its advantages and disadvantages. The end-to-end setup allows almost any systems that map data between relational schemata and ontologies to participate in the benchmark even if they do not support certain standards or languages. However, it also means that we cannot analyze mapping rules or other intermediate artifacts of the process directly. Our use of real-world databases and other realistic databases that closely emulate a real-world case brings the benefit of testing systems under conditions that are similar to the ones that they would encounter in real life, rather than following purely academic considerations. The same argument also holds for using a query workload. On the downside, the distribution of tasks and challenges cannot be controlled systematically and is also not backed by empirical evidence but rather built on the qualitative experience from individual applications. We compensate for the latter disadvantage by (a) including a wide range of scenarios from different application domains and by (b) allowing the benchmark to be extended by users with further scenarios and domains.

We believe that this real-world, end-to-end approach is currently the most suitable way for testing the utility of relational-to-ontology mapping generation systems.

Figure 1 schematically depicts the RODI architecture: the benchmark comes with a number of benchmark scenarios. Scenarios are initialized and set up for use by the framework. Candidate systems then read their input from the active scenario and produce mappings, which are evaluated again by our framework.

Fig. 1.

RODI benchmark overview.

1.3. Contributions

In this section we summarize the main contributions of the RODI benchmark. We note that an earlier version of RODI has been introduced in [39]. In this paper we significantly extend our earlier results in several important directions: we extended the systematic analyses of challenges and related approaches; we now cover several new benchmark scenarios as well as additional test categories; we significantly extended the scope of the experimental study and now we cover seven different systems.

In the following we describe the main characteristics of RODI and highlight the enhancements with respect to its predecessor [39].

Systematic analyses of challenges and existing approaches in relational-to-ontology mapping generation: These support and explain the types of challenges tested by RODI. This paper contains an updated summary of previous work in addition to a newly added discussion of mapping approaches.

Evaluation scenarios: RODI consists of data integration test scenarios from the domains of conferencing, geographical data, and oil and gas exploration. Scenarios are constituted of databases, ontologies, and queries to check expected results. Components of the scenarios are selected in such a way that they cover the key challenges of relational-to-ontology mapping generation. The version of RODI presented in this paper contains 18 scenarios in three different domains, as opposed to only 9 scenarios from two domains in the previous version of the benchmark [39]. The newly added scenarios focus on features that are important to be tested in real-world challenges, such as high semantic heterogeneity or complex query workloads in different application domains.

The RODI framework: The RODI software package, including all scenarios, has been implemented and made available for public download under an open source license.2

²
Ready-to-use RODI distribution available at: http://www.cs.ox.ac.uk/isg/tools/RODI/. Source code available on GitHub: https://github.com/chrpin/rodi.

In this paper we describe the new version of the framework in greater detail than before, so readers could thoroughly understand and independently judge RODI’s evaluation procedures. Readers should also be able to use the paper as a starting point for applying the benchmark by themselves. To this end, we also added for the first time a description of key implementation details.

System Evaluation: We have used RODI to evaluate seven relational-to-ontology mapping systems: BootOX [25], COMA++ [2], IncMap [40], MIRROR [17], the -ontop- bootstrapper [7], D2RQ [6], and Karma [29]. The systems are chosen in a way that they cover the breadth of recent and traditional approaches in (semi-)automatic schema-to-ontology mapping generation. The insights gained from the evaluation allow us to point out specific strengths and weaknesses of individual systems and to propose how they can be improved. Compared to our preliminary experiments from [39], the study presented in this paper extends not only to twice as many benchmark scenarios as before and adds three additional systems, COMA++, D2RQ and Karma, but it also gives much greater detail on several result aspects, such as a discussion of support for 1:n and n:1 mappings, and for the first time it also includes semi-automatic experiments. In total, we present numbers for seven different reporting categories and drill-downs, as compared to only two in our preliminary study. Also, the accompanying discussion adds significant detailed insights over the earlier paper.

In the new version of RODI, we have also modified all benchmark scenarios to produce more specific individual scores rather than aggregated values for relevant categories of tests. In addition, we have extended the benchmark framework to allow detailed debugging of the results for each individual test. On that basis we now could point to individual issues and bugs in several systems, some of which have already been addressed by the authors of the evaluated systems.

1.4. Outline

First, we present our analysis of the different types of mapping challenges for relational-to-ontology mapping generation in Section 2. Then, in Section 3 we discuss differences in mapping generation approaches that impact mapping generation, and thus also need to be considered for designing appropriate evaluation approaches. Section 4 presents our benchmark suite and the evaluation procedure. Afterwards, Section 5 discusses some implementation details that should help researchers and practitioners to understand how their systems could be evaluated in our benchmarking suite. Section 6 then presents our evaluation, including a detailed discussion of results. Finally, Section 7 summarizes related work and Section 8 concludes the paper and provides an outlook on future work.

2. Mapping challenges

In the following we give a summary of our classification of different types of mapping challenges in relational-to-ontology data integration scenarios. For a more detailed discussion, please refer to [39]. As a high-level classification, we use the standard classification for data integration described by Batini et al. [5]: naming conflicts, structural heterogeneity, and semantic heterogeneity. For each of these classes, we list and briefly describe the key challenges that we have identified.

2.1. Naming conflicts

Table 1
Detailed list of specific structural mapping challenges. RDB patterns may correspond to some of the “guiding” ontology axioms and language constructs. Specific difficulties explain particular hurdles in constructing mappings

# Challenge type RDB pattern Examples of relevant guiding OWL language constructs Specific difficulty

(1) Normalization Weak entity table (depends on other table, e.g., in a part-of relationship) owl:Class JOIN to extract full IDs

(2) 1:n attribute owl:DatatypeProperty JOIN to relate attribute with entity ID

(3) 1:n relation owl:ObjectProperty, owl:InverseFunctionalProperty JOIN to relate entity IDs

(4) n:m relation owl:ObjectProperty 3-way JOIN to relate entity IDs

(5) Indirect n:m relation (using additional intermediary tables) owl:ObjectProperty k-way JOIN to relate entity IDs

(6) Denormalization Correlated entities (in shared table) owl:Class Filter condition

(7) Multi-value owl:DatatypeProperty, owl:minCardinality $[> 1]$ , owl:maxCardinality $[> 1]$ , owl:cardinality $[> 1]$ Handling of duplicate IDs

(8) Class hierarchies 1:n property match rdfs:subClassOf, owl:unionOf, owl:disjointWith UNION to assemble redundant properties

(9) n:1 class match with type column rdfs:subClassOf, owl:unionOf Filter condition

(10) n:1 class match without type column rdfs:subClassOf, owl:unionOf JOIN condition as implicit filter

(11) Key conflicts Plain composite key owl:Class, owl:hasKey Technical handling (e.g., Skolemnization)

(12) Composite key, n:1 class matching to partial keys owl:Class, owl:hasKey, rdfs:subClassOf Choice of correct partial keys

(13) Missing key (e.g., no UNIQUE constraint on secondary key) owl:Class, owl:hasKey Choice of correct non-key attribute as ID

(14) Missing reference (no foreign key where relevant relation exists) owl:ObjectProperty, owl:DatatypeProperty Unconstrained attributes as references

(15) Dependency conflicts 1:n attribute owl:FunctionalProperty, owl:minCardinality $[> 1]$ , owl:maxCardinality $[> 1]$ , owl:cardinality $[> 1]$ Misleading guiding axioms; possible restriction violations

(16) 1:n relation owl:FunctionalProperty, owl:minCardinality $[> 1]$ , owl:maxCardinality $[> 1]$ , owl:cardinality $[> 1]$ Misleading guiding axioms; possible restriction violations

(17) n:m relation owl:FunctionalProperty, owl:InverseFunctionalProperty, owl:minCardinality $[> 1]$ , owl:maxCardinality $[> 1]$ , owl:cardinality $[> 1]$ Misleading guiding axioms; possible restriction violations

#	Challenge type	RDB pattern	Examples of relevant guiding OWL language constructs	Specific difficulty
(1)	Normalization	Weak entity table (depends on other table, e.g., in a part-of relationship)	owl:Class	JOIN to extract full IDs
(2)	1:n attribute	owl:DatatypeProperty	JOIN to relate attribute with entity ID
(3)	1:n relation	owl:ObjectProperty, owl:InverseFunctionalProperty	JOIN to relate entity IDs
(4)	n:m relation	owl:ObjectProperty	3-way JOIN to relate entity IDs
(5)	Indirect n:m relation (using additional intermediary tables)	owl:ObjectProperty	k-way JOIN to relate entity IDs
(6)	Denormalization	Correlated entities (in shared table)	owl:Class	Filter condition
(7)	Multi-value	owl:DatatypeProperty, owl:minCardinality $[> 1]$ , owl:maxCardinality $[> 1]$ , owl:cardinality $[> 1]$	Handling of duplicate IDs
(8)	Class hierarchies	1:n property match	rdfs:subClassOf, owl:unionOf, owl:disjointWith	UNION to assemble redundant properties
(9)	n:1 class match with type column	rdfs:subClassOf, owl:unionOf	Filter condition
(10)	n:1 class match without type column	rdfs:subClassOf, owl:unionOf	JOIN condition as implicit filter
(11)	Key conflicts	Plain composite key	owl:Class, owl:hasKey	Technical handling (e.g., Skolemnization)
(12)	Composite key, n:1 class matching to partial keys	owl:Class, owl:hasKey, rdfs:subClassOf	Choice of correct partial keys
(13)	Missing key (e.g., no UNIQUE constraint on secondary key)	owl:Class, owl:hasKey	Choice of correct non-key attribute as ID
(14)	Missing reference (no foreign key where relevant relation exists)	owl:ObjectProperty, owl:DatatypeProperty	Unconstrained attributes as references
(15)	Dependency conflicts	1:n attribute	owl:FunctionalProperty, owl:minCardinality $[> 1]$ , owl:maxCardinality $[> 1]$ , owl:cardinality $[> 1]$	Misleading guiding axioms; possible restriction violations
(16)	1:n relation	owl:FunctionalProperty, owl:minCardinality $[> 1]$ , owl:maxCardinality $[> 1]$ , owl:cardinality $[> 1]$	Misleading guiding axioms; possible restriction violations
(17)	n:m relation	owl:FunctionalProperty, owl:InverseFunctionalProperty, owl:minCardinality $[> 1]$ , owl:maxCardinality $[> 1]$ , owl:cardinality $[> 1]$	Misleading guiding axioms; possible restriction violations

Typically, relational database schemata and ontologies use different conventions to name their artifacts even when they model the same domain based on the same specification and thus should use a similar terminology. The main challenge here is to be able to find similar names despite the different naming patterns. We are particularly interested in differences that are specific to inter-model matching and come on top of other naming differences, which commonly exist in other schema matching cases as well.

2.2. Structural heterogeneity

The most important differences in relational-to-ontology integration scenarios, compared to other integration scenarios, are structural heterogeneities. Table 1 lists all specific testable relational-to-ontology structural challenges that we have identified.

In brief, there are type conflicts resulting from normalization, denormalization or different modeling of class hierarchies, key conflicts, and dependency conflicts.

2.2.1. Type conflicts

Most real-world relational schemata and corresponding ontologies cannot be related by any simple canonical mapping. This is because big differences exist in the way how the same concepts are modeled (i.e., type conflicts). One reason why these differences are so big is that relational schemata often are optimized towards a given workload (e.g., they are normalized for update-intensive workloads or denormalized for read-intensive workloads). Ontologies, on the other side, model a domain on the conceptual level, albeit with different degrees of expressiveness and thus conceptual richness. Another reason is that some modeling elements have no single canonical translation (e.g., class hierarchies in ontologies can be mapped to relational schemata in different ways). In the following, we list the different type conflicts covered by RODI:

Normalization artifacts: Often properties that belong to a class in an ontology are spread over different tables in the relational schema as a consequence of normalization.

Denormalization artifacts: For read-intensive workloads, tables are often denormalized. Thus, properties of different classes in the ontology might map to attributes in the same table.

Class hierarchies: Ontologies typically make use of explicit class hierarchies. Relational models implement class hierarchies implicitly, typically using one of three different common modeling patterns (cf., [18, Chap. 3]). (i) The relational schema materializes several subclasses in the same table and uses additional attributes to indicate the subclass of each individual. With this variant, mapping systems have to resolve n:1 matches, i.e., they need to filter from one single table to extract information about different classes. (ii) Use one table per most specific class in the class hierarchy and materialize the inherited attributes in each table separately. Thus, the same property of the ontology must be mapped to several tables, leading to 1:n matches. (iii) Use one table for each class in the hierarchy, including the possibly abstract superclasses. Tables then use primary key-foreign key references to indicate the subclass relationship.

2.2.2. Key conflicts

In ontologies and relational schemata, keys and references are represented differently.

Keys: Keys in databases are usually implemented using primary keys and unique constraints, while ontologies use IRIs for individuals. The challenge is that integration tools must be able to compose or skolemize appropriate IRIs.

References: While typically modeled as foreign keys in relational schemata, ontologies use object properties. Moreover, sometimes relational databases do not model foreign key constraints at all.

2.2.3. Dependency conflicts

These conflicts arise when a group of concepts are related among each other with different dependencies (i.e., 1:1, 1:n, n:m) in the relational schema and the ontology. Relational schemata also often model n:m relationships using an additional connecting table.

2.3. Semantic heterogeneity

Semantic heterogeneity plays a highly important role for data integration in general. Therefore, we extensively test scenarios that bring significant semantic heterogeneity.

Besides the usual semantic differences between any two conceptual models of the same domain, three additional factors apply in relational-to-ontology data integration: (i) the impedance mismatch caused by the object-relational gap, i.e., ontologies group information around entities (objects) while relational databases encode them in a series of values that are structured in relations; (ii) the impedance mismatch between the closed-world assumption (CWA) in databases and the open-world assumption (OWA) in ontologies; and (iii) the difference in semantic expressiveness, i.e., databases may model some concepts or data explicitly where they are derived logically in ontologies.

While some forms of inference on relational databases have also been given regular attention in database research, they more often take an angle of correctness and efficiency (e.g., in the case of query containment [18]) rather than knowledge systems, with only a few exceptions (e.g., [4]). For a more detailed comparison, please refer to [34]. Modern relational database systems typically also offer several possibilities to programmatically extend core database functionality with almost arbitrary business logic. They thereby enable calculations with an equivalent effect to some forms of logical inference, e.g., by using stored procedures for the purpose. These include stored procedures, triggers, UDFs, and others. For instance, a stored procedure or calculated view could list all persons based on their roles as authors (or other activities that imply the involvement of a human being), but in our experience such applications of these features are not very common in practice. We do not consider such programmatic schema extensions as they are powerful, partially non-declarative and partially non-standardized features that are difficult to analyze in the general case.

All of them are inherent to all relational-to-ontology mapping problems.

3. Analysis of mapping approaches

Different mapping generation systems make different assumptions and implement different approaches. Thus, a benchmark needs to consider each approach appropriately.

3.1. Differences in availability and relevance of input

Different input may be available to an automatic mapping generator. In relational-to-ontology data integration, the main difference on available input concerns the target ontology. The ontology could be specified entirely and in detail, or it could still be incomplete (or even missing) when mapping construction starts. Other differences comprise the availability of data or of a query workload.

The case where both the relational database schema and the ontology are completely available could be motivated by different situations. For example, a company may wish to integrate a relational data source into an existing, mature, Semantic Web application. In this case, the target ontology would already be well defined and would also be populated with some A-Box data. In addition, a SPARQL query workload could be known and could be available as additional input to a mapping generator.

On the other side, relational-to-ontology data integration might be motivated by a large-scale industry data integration scenario (e.g., [19,26]). In this scenario, the task at hand is to make complex and confusing schemata easier to understand for experts who write specialized queries. In this case, at the beginning no real ontology is given. At best there might be an initial, incomplete vocabulary.

Essentially, the different scenarios can all be distinguished by the following question: which information is available as input, besides the relational database? We always assume that the relational source database is completely accessible (both schema and data), as this is a fundamental requirement without which relational-to-ontology data integration applications cannot reasonably be motivated. Besides the availability of input for mapping generation, there could be additional knowledge about which parts of the input are even relevant. For instance, it may be clear that only parts of the ontology that are being used by a certain query workload need to be mapped. If so, this information could also be leveraged by the mapping generation system (e.g., by analyzing the query workload).

It has to be noted that some other and different motivations to work with relational-to-ontology mappings exist as well: for instance, a database schema might be developed or generated to serve as a storage engine for an existing ontology (e.g., [21]). We do not consider these cases but rather think of them as the inverse of what happens for relational-to-ontology mapping generation. Similarly, we consider questions of mapping evolution as a related but separate problem.

For the RODI benchmark design we consider different forms of input. In particular we vary the input database, ontology, data and queries.

3.2. Differences in the mapping process

Other differences can arise from the process in which mapping generation is approached. These can be either fully-automatic approaches or semi-automatic approaches. Truly semi-automatic approaches are usually iterative [15], as they consist of a sequence of mapping generation steps that get interrupted to allow human feedback, corrections, or other input. Their process is driven by the human perspective rather than by an automatic component. Since we want to better adjust our benchmark to the semi-automatic approaches, we first discuss different ways that are known for the semi-automatic case.

Fig. 2.

Overview of RODI benchmark scenarios.

Heyvaert et al. [20] have recently identified four different ways for manual relational-to-ontology mapping creation. Each of those directions inflicts a different interaction paradigm between the system and the user and thus solicits different forms of human input: users can edit mappings based on either the source or target definitions, they can drive the process by providing result examples or could theoretically even edit mappings irrespective of either the source or target in an abstract fashion. Some of us have also earlier identified two core user perspectives on mapping generation [38] also discussed by [20]. Moreover, while some approaches consider manual corrections only at the end of the mapping process, more thoroughly semi-automatic approaches allow or even require such input during the process.

In terms of their potential evaluation, iterative approaches of this kind must be considered according to two additional characteristics: First, whether iterative human input is mandatory or generally optional. Second, whether input is only used to improve the mapping as such, or if the systems also exploit it as feedback for their next automated iteration. Systems that solicit input only optionally and do not use it as feedback can be evaluated like non-iterative systems on a fully automatic baseline without limitations. Systems with only optional input that do learn from the feedback (if provided), can still be evaluated on the same baseline but may not demonstrate their full potential. Where input is mandatory, systems need to be either steered by an actual human user or at least require simulated human input produced by an oracle.

Next, the kind of human input that a system can process makes a difference for evaluation settings. Most semi-automatic systems either provide suggestions that users can confirm or delete, or they allow users to manually adjust the mapping. An alternative approach is mapping by example, where users provide expected results. In addition, however, some systems may require complex or indirect interactions, or simply resort to more unusual forms of input that cannot easily be foreseen.

Each mapping generation system is usually tied to one specific approach and does not allow for much freedom.

We therefore decided that an end-to-end evaluation that allows the use of different types of input is best. Since semi-automatic approaches are becoming more and more relevant, we decided to support them using an automated oracle that simulates user input where possible.

4. RODI benchmark suite

In the following, we present the details of our RODI benchmark: we first give an overview, then we discuss the data sets (relational schemata and ontologies) that can be used, as well as the queries. Finally, we present our scoring function to evaluate the benchmark results.

4.1. Overview

Figure 2 gives an overview of the scenarios used in our benchmark. The benchmark ships with data sets from three different application domains: conferences, geodata, and the oil & gas exploration domain. In its basic mode of operation, the benchmark provides one or more target ontologies for each of those domains (T-Box only) together with relational source databases for each ontology (schema and data). For some of the ontologies there are different variants of accompanying relational schemata that systematically vary some of the targeted mapping challenges.

The benchmark asks the systems to create mapping rules from the different source databases to their corresponding target ontologies. We call each such combination of a database and an ontology a benchmark scenario. For evaluation, we provide for each scenario a series of query pairs to test a range of mapping challenges as illustrated in Fig. 3. Every query pair consists of a SPARQL query (“test query”) against the ontology, and a semantically equivalent SQL query (“reference query”) against the provided SQL database. The test query runs against RDF data that results from applying the mapping rules of the matching system under consideration. The reference query is directly evaluated by RODI against the SQL database. The results are compared for each query pair and are aggregated in the light of different mapping challenges using our scoring function. For this, all query pairs are tagged with categories, relating them to different mapping challenges.

Fig. 3.

Query pair evaluation.

While challenges that result from different naming or semantic heterogeneity are mostly covered by complete scenarios, we target structural challenges on a more fine-granular level of individual query tests with a dedicated score. To this end, we add a corresponding category tag to query tests that address certain challenges. We target all structural challenges as previously listed in Table 1 in one or more scenarios.

Multi-source integration can be tested as a sequence of different scenarios that share the same target ontology. Although it has to be noted that some specific challenges in multi-source integration, especially conflicts introduced by different sources, may not become visible in sequential tests, this setup covers a wide range of multi-source mapping challenges. We include specialized scenarios for such testing with the conference domain.

In order to be open for other data sets and different domains, our benchmark can be easily extended to include scenarios with real-world ontologies and databases.

While all of our included default scenarios focus on schema-level matching, some cases in the real-world additionally demand data transformations to work fully as expected. These comprise translations between different representations of date and time (e.g., a dedicated date type versus Epoch time stamps), simple numeric unit transformations (e.g., MB vs. GB), unit transformations requiring more complex formulae (e.g., degrees Celsius vs. Fahrenheit), string-based data cleansing (e.g., removing trailing whitespace), string compositions (e.g., concatenating a first and last name), more complex string modifications (e.g., breaking up a string based on a learned regular expression), table-based name translations (e.g., replacing names using a thesaurus), noise removal (e.g., ignoring erroneous tuples), etc. Our extension mechanism (see Section 4.5) is suited to add dedicated scenarios for testing such conversions, however, we excluded them from our default benchmark for a merely practical reason: To the best of our knowledge no current relational-to-ontology mapping generation system implements any such transformation functionality to date, so there is little practical use for benchmarking it.

Our main design objective with RODI is to provide an end-to-end testing framework that can closely mimick the challenges encountered by mapping generation in the real world. Hence, it is based on different real-world scenarios and hence it is extensible. As a consequence of this goal there is always the risk of a subjective imbalance introduced by the query workload of a certain application that we build on. RODI scores in each case reflect the query workload that is being tested and therefore may mirror some of this subjectivity. Even in scenarios where we attempt to strike a balance and vary some of the challenges we do so by modifying only the structure of the source database, not the workload. Scores are thus designed to offer a clear indication of system performance, not to provide an unquestionable source of truth. We strongly encourage users of the benchmark to look into individual test results as logged by the benchmark framework before preparing their discussion and interpretation of the results.

In the following we present the data sources (i.e., ontologies and relational schemata) as well as the combinations used as integration scenarios for the benchmark in more details. RODI ships with scenarios based on data sources from three different application domains.

4.2. Conference scenarios

We chose the conference domain as our primary testing domain since (i) it is well understood and comprehensible even for non-domain experts, (ii) it is complex enough for realistic testing, and (iii) it has been successfully used as the domain of choice in other benchmarks before (e.g., [8,12,55]). While we ship several different variants of scenarios for this domain, which vary in size and complexity, they are all built around a core fragment comprising 23 classes and 77 properties with varying additions. Corresponding databases vary in size between 32 tables and a total of 85 columns to 66 tables with 125 columns. Each scenario runs between 19 and 39 query tests.

4.2.1. Ontologies

The conference ontologies in this benchmark are provided by the Ontology Alignment Evaluation Initiative (OAEI) [8,12,55] and were originally developed by the OntoFarm project [52]. We selected three particular ontologies (Cmt, Sigkdd, Conference) based on a number of criteria: variation in size, the presence of cardinality information (especially, functionality of relationships), the coverage of the domain, variations in modeling style, and the expressive power of the ontology language used. Different modeling styles result from the fact that each ontology was modeled by different people based on various views on the domain, e.g., they modeled it according to an existing conference management tool, expert insider knowledge, or according to a conference website. To cover our mapping challenges (Section 2), we selectively modified the ontologies (e.g., we added labels to add interesting lexical matching challenges) as follows: (i) we selectively added annotations like labels and comments, as these can help to identify correspondences lexically; (ii) we added a few additional datatype properties where they were scarce, as they test other mapping challenges than just classes and object properties; and (iii) we fixed a total of seven inconsistencies that we discovered in Sigkddwhen adding A-Box facts (e.g., each place with a zip code automatically became a sponsor, which was modeled as a subclass of person).

4.2.2. Relational schemata

We synthetically derived different relational schemata for each of the ontologies, focusing on different mapping challenges. We provide benchmark scenarios as combinations of those derived schemata with either their ontologies of origin, or, for more advanced testing, paired with any of the other ontologies. First, for each ontology we derived a relational schema using a canonical mapping as described in [21]: The algorithm works by deriving an entity-relationship (ER) model from an OWL ontology. It then translates this ER model into a relational schema according to textbook rules (e.g., [18]). For this paper, we extended this algorithm to cover the full range of expected relational design patterns. Additionally, we extended this algorithm to consider ontology instance data to derive more proper functionalities (rather than just looking at the T-Box as the previous algorithms do). Otherwise, the generated canonical relational schemata would have contained an unrealistically high number of n:m-relationship tables. The canonical schemata are guaranteed to be in fourth normal form (4NF), fulfilling normalization requirements of standard design practices. Thus, they already include various normalization artifacts as mapping challenges.

From the canonical schema corresponding to each of the ontologies, we created different variants by introducing different aspects on how a real-world schema may differ from the canonical one and thus to test different mapping challenges:

Adjusted Naming: As described in Section 2.1, ontology designers typically consider other naming schemes than database architects do, even when implementing the same (verbal) specification. Those differences include longer vs. shorter names, “speaking” prefixes, human-readable property IRIs vs. technical abbreviations (e.g., “hasRole” vs. “RID”), camel case vs. underscore tokenization, preferred use of singular vs. plural, and others. For each canonical schema, we automatically generated a variant with identifier names changed in this way.

Restructured Hierarchies: The most critical structural challenge in terms of difficulty comes with different relational design patterns to model class hierarchies more or less implicitly. As we have discussed in Section 2.2, these changes introduce significant structural dissimilarities between source and target. We automatically derive variants of all canonical schemata where different hierarchy design patterns are used. The choice of design pattern in each case is algorithmically determined on a “best fit” approach considering the number of specific and shared (inherited) attributes for each of the classes. For instance, a small number of sibling classes would be split over several tables if they mostly used different properties but they would be rather joined together in a single table if they would mostly make use of the same set of properties.

Combined Case: In the real world, both of the previous cases (i.e., adjusted naming and hierarchies) would usually apply at the same time. To find out how tools cope with such a situation, we also built scenarios where both are combined.

Removing Foreign Keys: Although it is considered as bad style, databases without foreign keys are not uncommon in real-world applications. This can be a result of lazy design, or due to legacy applications (e.g., popular open source DBMS MySQL introduced plugin-free support for foreign keys just half a decade ago). The mapping challenge is that mapping tools must find the join paths to connect tables of different entities. Additionally, they sometimes even need to guess a join path for reading attributes of the same entity if its data is split over several tables as a consequence of normalization. Therefore, we have created one dedicated scenario to test this challenge with the Conference ontology and based it on the schema variant with restructured hierarchies.

Partial Denormalization: In many cases, schemata are partially denormalized to optimize for a certain read-mostly workload. Denormalization essentially means that correlated (yet separated) information is jointly stored in the same table and partially redundant. We provide one such scenario for the Cmt ontology. As denormalization requires conscious design choices, this schema is the only one that we had to hand-craft. It is based on the variant with restructured hierarchies.

4.2.3. Integration scenarios

For each of our three main ontologies, Cmt, Conference, and Sigkdd, the benchmark includes five scenarios (including basic and cross-matching scenarios), each with a different variant of the database schema (discussed before). Table 2 lists the different (basic) scenario variants.

Table 2
Basic scenario variants (non-default scenarios are put in parentheses)

Cmt Conference Sigkdd

Canonical (✓) (✓) (✓)

Adjusted Naming ✓ ✓ ✓

Restructured Hierarchies ✓ ✓ ✓

Combined Case (✓) (✓) ✓

Missing FKs - ✓ -

Denormalized ✓ - -

	Cmt	Conference	Sigkdd
Canonical	(✓)	(✓)	(✓)
Adjusted Naming	✓	✓	✓
Restructured Hierarchies	✓	✓	✓
Combined Case	(✓)	(✓)	✓
Missing FKs	-	✓	-
Denormalized	✓	-	-

As discussed before, Canonical closely mimics the structure of the original ontology, but the schemata are normalized and thus the scenario contains the challenge of normalization artifacts. Adjusted Naming adds the naming conflicts as discussed before. Restructured hierarchies tests the critical structural challenge of different relational patterns to model class hierarchies, which, among others, subsumes the challenge to correctly build n:1 mappings between classes and tables. In the Combined Case, naming conflicts and restructured hierarchies are employed and their effects are tested in combination. This is a more advanced test case. A special challenge arises from databases with no (or few) foreign key constraints (Missing FKs). In such a scenario, mapping tools must guess the join paths to connect tables that correspond to different entity types. The technical mapping challenge arising from Denormalized schemata consists in identifying the correct partial key for each of those correlated entities, and to identify which attributes and relations belong to which of the types.

To keep the number of scenarios small for the default setup, we differentiate between default scenarios and non-default scenarios. We excluded scenarios with the most trivial schema versions. In addition, we did limit the number of combinations for the most complex schema versions by including only one of each type as a default scenario. While the default scenarios are mandatory to cover all mapping challenges, the non-default scenarios are optional (i.e., users could decide to run them in order to gain additional insights). Non-default scenarios are put in parentheses in Table 2. However, they are not supposed to be executed in a default run of the benchmark.

We also include cross-matching scenarios that require mappings of schemata to one of the other ontologies (e.g., mapping a Cmt database schema variant to the Sigkddontology). They represent more advanced data integration scenarios and belong to the default scenarios.

4.2.4. Data

We provide data to fill both the databases and ontologies. The conference ontologies are originally provided as T-Boxes only, i.e., no A-Box. We first generate data as A-Box facts for the different ontologies, and then transform them into the corresponding relational data using the same process as for translating the T-Box. For the technical process of evaluating generated mappings data is only needed in the relational databases. Hence, generating ontology A-Boxes would not even be necessary for this purpose alone. However, this procedure simplifies data generation since all databases can be automatically derived from the given ontologies as described before. Our conference data generator deterministically produces a scalable amount of synthetic facts around key concepts in the ontologies, such as conferences, papers, authors, reviewers, and others. In total, we generate data for 23 classes, 66 object properties (including inverse properties) and 11 datatype properties (some of which apply to several classes). However, not all of those concepts and properties are supported by every ontology. For each ontology, we only generate facts for the subset of classes and properties that have an equivalent in the relational schema in question.

4.2.5. Queries

For the conference scenarios, all scenarios draw on the same pool of 56 query pairs, accordingly translated for each ontology and schema, with each scenario supporting a different subset. The benchmark queries on the conference scenarios are rather simple, using either only one concept and a property of it, or one relationship and two concepts with one property each. The same query may face different challenges in different scenarios, e.g., a simple 1:1 mapping between a class and table in a canonical scenario can turn into a complicated n:1 mapping problem in a scenario with restructured hierarchies.

Query pairs are grouped into three basic categories to test the correct mapping of class instances, instantiations of datatype properties and object properties, respectively. Additional categories relate queries to n:1 and n:m mapping problems or prolonged property join paths resulting from normalization artifacts. A specific category exists for the de-normalization challenge.

4.3. Geodata domain – mondial scenarios

As a second application domain, RODI ships scenarios in the domain of geographical data.

The Mondial database [32] is a manually curated database containing information about countries, cities, organizations, and geographic features such as waters (with subclasses lakes, rivers, and seas), mountains, and islands. It has been designed as a medium-sized case study for several scientific aspects and data models.3

³
http://www.dbis.informatik.uni-goettingen.de/Mondial/.

The Mondial database contains 42 tables and a total of 160 columns, while its ontology comprises around 30 classes and 50 properties. They are tested by 50 query pairs.

Based on Mondial, we have developed a number of benchmark scenarios, which combine the Mondial OWL ontology with a series of different relational schemata. The OWL ontology is quite sophisticated, using many OWL constructs, and providing many potential challenges, e.g.:

properties whose domain is neither a single class, nor some kind of top class (like usually for the name property) but a union of several classes (e.g., area, which is a property of countries, provinces, lakes, islands etc.).

multiple properties between the same domain and range, distinguishable by the cardinality: Country.capital is functional with cities as range, while Country.hasCity is 1:n.

properties that are functional on some subdomain, and n:m on another subdomain (e.g., locatedOnIsland which is functional for mountains and n:m for cities).

properties that have a named union class as range: the range of City.locatedAt is waters, with rivers, lakes and seas as subclasses.

For the Mondial scenarios, we use a query workload that mainly approximates real-world explorative queries on the data, although limited to queries of low or medium complexity. The queries typically combine more than one concept, or several attributes, or several relationships with a common class. The degree of difficulty in the Mondial scenarios is therefore generally higher than in the conference domain scenarios.

There is only a single default scenario, which is based on the original relational Mondial database (42 tables, 160 columns, 60 foreign keys). It features a wide range of relational modeling patterns and it also differs from the canonical relational schema in some well-chosen aspects. With these changed aspects it mimicks a “real-life” (legacy) relational database schema:

classical relational keys/foreign keys built upon literal-valued attributes. Most of them are unary (usually, the name attribute), but e.g. City(name, country, province) is a weak entity type whose key consists of multiple components.

E.g., the above-mentioned City.locatedAtn:m property cannot be stored in a single n:m table since the foreign keys of the range would have to reference several tables for the different subclasses of waters. So another, even slightly denormalized modeling locatedAt(city,province,country,river, lake,sea) has been chosen (where a city located at two rivers and one sea requires two tuples, and the relation has no key since each of the watertype references may be null in some tuples).

In addition, we have designed a systematical series of further scenarios with synthetically modified variants of the canonical relational schema. To keep the number of tested scenarios at bay, we do not consider those additional synthetic variants as part of the default benchmark. Instead, we recommend these as optional tests to dig deeper into specific patterns. These scenarios are similar to the different variants produced in the conference domain, with the additional feature that the database schema and the queries are explicitly designed to test the following crucial elements:

for all scenarios based on the canonical relational schema, all keys/foreign keys are the IRIs;

different modeling variants of the class hierarchy wrt. Water/River/Lake/Sea and Mountain/Volcano (cf. Section 2.2);

a variant where all classes mentioned in the ontology, including typically abstract ones like PoliticalBody, have an own table, each functional property is stored with the most abstract class, and foreign keys also reference the most abstract class;

different modeling variants of functional properties whose domain is a union of classes like area, population and capital (ranging over cities);

different modeling variants of City.locatedOnIsland (1:n) and Mountain.locatedOnIsland (n:m);

case-sensitive vs. case-insensitive;

a variant where all properties are n:m.

The challenges for the matching systems are not only to produce the appropriate mapping, but also to generate appropriate rules incorporating join paths, which is checked by the design of the benchmark queries. With these scenarios, the behavior of a system can be checked systematically, and even further database variants can easily be designed.

A typical (but already high-end) query is e.g. where information from the City table (functional: name) must be combined with the n:m property City. locatedOnIsland; for Mountain, both properties are functional and might be found in the Mountain table, and a lookup for the Island’s name must be made.

4.4. Oil & gas domain – NPD FactPages scenarios

Finally, we include an example of an actual real-world database and ontology, in the oil and gas domain: The Norwegian Petroleum Directorate (NPD) FactPages [49]. Our test set contains a small relational database with a relatively complex structure (70 tables, ≈1,000 columns and ≈100 foreign keys), and an ontology covering the domain of the database. The database is constructed from a publicly available dataset containing reference data about past and ongoing activities in the Norwegian petroleum industry, such as oil and gas production and exploration. The corresponding ontology contains ≈300 classes and ≈350 properties.

With this pair of a database and ontology, we have constructed two scenarios that feature a different series of tests on the data: first, there are queries that are built from information needs collected from real users of the FactPages and cover large parts of the dataset. Those queries are highly complex compared to the ones in other scenarios (by query complexity, we refer to the number of different schema elements that need to be accessed). They thus each require a significant number of schema elements to be correctly mapped at the same time to bear any results. The principled benefit of such queries is that they represent actual real-world information needs much more accurately than any simplified query workloads do. A realistic workload can thus be seen as a measure of real-world utility as opposed to simplified queries, which artifically set the bar far too low and thus may convey a false impression of actual utility. Even if today’s mapping generation systems may perform poorly on such queries, they will be the most relevant test to pass eventually. We have collected 17 such queries in scenario npd_user_ tests. In addition, we have generated a large number of small, atomic query tests for baseline testing. These are similar to the ones used with the conference domain, i.e., they test for individual classes or properties to be correctly mapped. A total of 439 such queries have been compiled in the scenario npd_atomic_tests to cover all of the non-empty fields in our sample database.

A specific feature resulting from the structure of the FactPages database and ontology are a high number of 1:n matches, i.e., concepts or properties in the ontology that require a UNION over several relations to return complete results. 1:n matches as a structural feature can therefore best be tested in the npd_atomic_tests scenario.

4.5. Extension scenarios

Our benchmark suite is designed to be extensible, i.e., additional scenarios can be easily added. The primary aim of supporting such extensions is to allow domain-specific, real-world mapping challenges to be tested alongside our more default scenarios. Extension scenarios can be added by users of our benchmark without any programming efforts. The creation and addition of scenarios is described in the user documentation of the RODI benchmark suite [42].

4.6. Challenge coverage and query category tags

Table 3
Category tags used in different default scenarios, with brief description. Relevant challenges where applicable. Note, that some challenges require a check on a combination of tags or will be checked at a scenario-level, not on a query-level

Tag ID Meaning Default Scenarios Relevant Challenges

class Matching class All –

attrib Matching datatype property All –

link Matching object property All –

1-1 1:1 match All conference –

n-1 n:1 match All restructured conference 6, 9, 10, 12

1-n 1:n match NPD, some conference 8, 15, 16

union-n Requires n UNIONs to build match (form of 1:n match with explicit n) NPD 8, 15, 16

superclass Test on entities that should comprise sub class entities (form of 1:n) All conference adjusted naming 8

in-table Datatype match that finds the data value in the same table that also defines the related entity All conference 7

other-table Datatype match that finds the data value in a table other than the one that defines the related entity (requires joins) All conference 2, 15

path-0 Object property match that finds both related entities in the same table (1:1 or denormalized) Some conference 7

path-1, path-2 Object property match that finds related entities in two tables that can be directly joined (single JOIN) or joined through one intermediate table (two JOINs). Note: NPD uses join-1, join-2 to denote the same aspect All conference, NPD 3, 4, 16

path-n Object property match that requires n JOINs ( $n > 2$ ) to connect the tables that define entities on both sides. Note: NPD uses join-n to denote the same aspect Some conference, NPD 5

path-X Additional tag for all queries that are tagged path-n, with any $n > 1$ (denotes multi-hop JOIN of any length) All conference 4, 5, 17

denorm Type filtering required due to denormalization CMT denormalized 6, 7, 12

no-fk JOINs without leading foreign keys Conference missing FKs 14

Tag ID	Meaning	Default Scenarios	Relevant Challenges
class	Matching class	All	–
attrib	Matching datatype property	All	–
link	Matching object property	All	–
1-1	1:1 match	All conference	–
n-1	n:1 match	All restructured conference	6, 9, 10, 12
1-n	1:n match	NPD, some conference	8, 15, 16
union-n	Requires n UNIONs to build match (form of 1:n match with explicit n)	NPD	8, 15, 16
superclass	Test on entities that should comprise sub class entities (form of 1:n)	All conference adjusted naming	8
in-table	Datatype match that finds the data value in the same table that also defines the related entity	All conference	7
other-table	Datatype match that finds the data value in a table other than the one that defines the related entity (requires joins)	All conference	2, 15
path-0	Object property match that finds both related entities in the same table (1:1 or denormalized)	Some conference	7
path-1, path-2	Object property match that finds related entities in two tables that can be directly joined (single JOIN) or joined through one intermediate table (two JOINs). Note: NPD uses join-1, join-2 to denote the same aspect	All conference, NPD	3, 4, 16
path-n	Object property match that requires n JOINs ( $n > 2$ ) to connect the tables that define entities on both sides. Note: NPD uses join-n to denote the same aspect	Some conference, NPD	5
path-X	Additional tag for all queries that are tagged path-n, with any $n > 1$ (denotes multi-hop JOIN of any length)	All conference	4, 5, 17
denorm	Type filtering required due to denormalization	CMT denormalized	6, 7, 12
no-fk	JOINs without leading foreign keys	Conference missing FKs	14

The scenarios included in RODI add up to jointly cover all mapping challenges identified and discussed in the previous sections.

On a per-scenario level, they represent examples from three different application domains. They also have different degrees of complexity. This affects schema size where NPD is the largest, Mondial is in-between and Conference scenarios have different sizes from small to medium. Also, different degrees of query workload complexity are included. Query workloads are modestly complex in conference scenarios and NPD atomic tests, more demanding in Mondial, and most complex with NPD user tests. Moreover, scenarios exhibit different degrees of semantic heterogeneity. There is rather little semantic heterogeneity between conference cases of any ontology and their directly corresponding database schema. Heterogeneity is higher for Mondial and NPD. The highest degree of semantic heterogeneity is however reached for conference scenarios, where we match an ontology to a database schema corresponding to a different ontology from the same domain (cross-matching).

On a more fine-grained level, several challenges are tested through a subset of queries. We tag query tests by categories and report separate scores not only for each scenario, but also for each category in each scenario.

Table 3 shows a list of all tags that we use in our default scenarios, a brief description of their purpose, and scenarios that include them. Not all of the category tags correspond to a particular challenge. Some of them also serve to allow drill-downs into basic aspects of the schema, e.g., to separately report on class matches and property matches. However, using tags we can also report on challenges that correspond to some of the category tags in the query tests.

4.7. Evaluation criteria – scoring function

It is our aim to measure the practical usefulness of mappings. We are therefore interested in the utility of query results, rather than comparing mappings directly to a reference mapping set or than measuring precision and recall on all elements of the schemata. This is important because a number of different mappings might effectively produce the same data w.r.t. a specific input database. Also, the mere number of facts is no indicator of their semantic importance for answering queries (e.g., the total number of conference venues is much smaller than the number of all paper submission dates, yet the venues are just as important in a query retrieving information about any of these papers). In addition, in many cases only a subset of the information is relevant in practice and we define our queries on a meaningful subset of information needs.

We therefore observe a score that reflects the utility of the mappings with relation to our query tests as our main measure. Intuitively, this score reports the percentage of successful queries for each scenario.

However, in a number of cases, queries may return correct but incomplete results, or could return a mix of correct and incorrect results. In these cases, we consider per-query accuracy by means of a local per-query F-measure. Technically, our reported overall score for each scenario is the average of F-measures for each query test, rather than a simple percentile of successful queries. To calculate these per-query F-measures, we also need to consider query results that contain IRIs.

Different mapping generators will typically generate different IRIs for the same entities represented in the relational database, e.g., by choosing different prefixes. F-measures for query results containing IRIs are therefore calculated w.r.t. the degree to which they satisfy structural equivalence with a reference result. For practical reasons, we use query results on the original, underlying SQL databases as technical reference during evaluation. Structural equivalence effectively means that if same-as links were established appropriately, then both results would be semantically identical.

Formally, structural equivalence and fitting measures for precision and recall are defined as follows:

Definition 1 (Structural tuple set equivalence).

Let $V = IRI \cup Lit \cup Blank$ be the set of all IRIs, literals and blank nodes, $T = V \times \dots \times V$ the set of all n-tuples of V. Then two tuple sets $t_{1}, t_{2} \in P (T)$ are structurally equivalent if there is an isomorphism $ϕ : (IRI \cap t_{1}) \to (IRI \cap t_{2})$ .

For instance, $\begin{matrix} {(urn:p-1, ‘John Doe’)} \end{matrix}$ and $\begin{matrix} {(http://my#john, ‘John Doe’)} \end{matrix}$ are structurally equivalent. On this basis, we can easily define the equivalence of query results w.r.t. a mapping target ontology:

Definition 2 (Tuple set equivalence w.r.t. ontology).

Let O be a target ontology of a mapping, $I \subset IRI$ the set of IRIs used in O and $t_{1}, t_{2} \in P (T)$ result sets of queries $q_{1}$ and $q_{2}$ evaluated on a superset of O (i.e., over O plus A-Box facts added by a mapping).

Then, $t_{1} \sim_{O} t_{2}$ (are structurally equivalent w.r.t. O) iff $t_{1}$ and $t_{2}$ are structurally equivalent and $\forall i \in I : ϕ (i) = i$ .

With the same example as just before, the two tuples are structurally equivalent, iff http://my#john is not already defined in the target ontology. Finally, we can define precision and recall:

Definition 3 (Precision and recall).

Let $t_{r} \in P (T)$ be a reference tuple set, $t_{t} \in P (T)$ a test tuple set and $t_{rsub}, t_{tsub} \in P (T)$ be maximal subsets of $t_{r}$ and $t_{t}$ , s.t., $t_{rsub} \sim_{O} t_{tsub}$ .

Then the precision of the test set $t_{t}$ is $P = \frac{| t_{tsub} |}{| t_{t} |}$ and recall is $R = \frac{| t_{rsub} |}{| t_{r} |}$ .

Table 4 shows an example with a query test that asks for the names of all authors. Result set A is structurally equivalent to the reference result set, i.e., it has found all authors and did not return anything else, so both precision and recall are 1.0. Result set B is equivalent with only a subset of the reference result (e.g., it did not include those authors who are also reviewers). Here, precision is still 1.0, but recall is only 0.5. In case of result set C, all expected authors are included, but also another person, James. Here, precision is 0.66 but recall is 1.0.

Table 4
Example results from a query pair asking for author names, e.g., SQL: SELECT name FROM persons WHEREperson_type=2 SPARQL: SELECT ?name WHERE {?p a :Author; foaf:name?name}

To aggregate results of individual query pairs, a scoring function calculates the averages of per query numbers for each scenario and for each challenge category. For instance, we calculate averages of all queries testing 1:n mappings. Thus, for each scenario there is a number of scores that rate performance on different technical challenges. Also, the benchmark can log detailed per-query output for debugging purposes.

Fig. 4.

RODI framework architecture.

4.8. Mapping system requirements

With RODI, we can test mapping generators that work in either one or two stages: that is, they either directly map data from the relational source database to the target ontology in one stage (e.g., [2,40]). Or, they bootstrap their own ontology, which they use as an intermediate mapping target. In this case, to get to the full end-to-end mappings that we can test, the intermediate ontology and the actual target ontology should be integrated via ontology alignment in a second stage. Two-stage systems may either include a dedicated ontology alignment stage (e.g., [25]) or they deliver the first (intermediary) stage only [6,7,17]. In the latter case, RODI can step in to fill the missing second stage with a standard ontology alignment setup [48].

Our tests check the accuracy of SPARQL query results. Queries ask for individuals of a certain type (or their aggregates), properties correlating them, associated values and combinations thereof, sometimes also using additional SPARQL language features such as filters to narrow down the result set. This means that mapped data will be deemed correct if it contains correct RDF triples for all tested cases. For entities, this means that systems need to construct one correctly typed IRI for each entity of a certain type. For object properties, they need to construct triples to correctly relate those typed IRIs, and for datatype properties, they need to assign the correct literal values to each of the entity IRIs using the appropriate predicates. Systems do therefore not strictly need to understand or to produce any OWL axioms in the target ontology. However, our target ontologies are in OWL 2, using different degrees of expressiveness. Axioms and other language constructs in the target ontology can be important as guidance to identify suitable correspondences for one-stage systems. Similarly, if two-stage systems construct expressive axioms in their intermediate ontology, this may guide the second stage of ontology alignment. For instance, if a predicate is known to be an object property in the target ontology, results will suffer if a mapping generation tool assigns literal values using this property. Also, if a property is known to be functional it might be a better match for an n:1 relation than a non-functional property would be.

5. Framework implementation

In this section, we discuss some implementation details in order to guide researchers and practitioners to include their system in our benchmarking suite.

5.1. Architecture of the benchmarking suite

Figure 4 depicts the overall architecture of our benchmarking suite. The framework requires upfront initialization per scenario. Artifacts generated or provided during initialization are depicted blue in the figure. After initialization, a mapping tool can access the database (directly or via the framework’s API) and the target ontology (via the Sesame API4

⁴
http://rdf4j.org.

or using SPARQL, or serialized as an RDF file). Finally, it submits generated R2RML5

⁵

http://www.w3.org/TR/r2rml/.

mappings in a special folder on the file system, so evaluation can be triggered. As an alternative, mapping tools could also execute mappings themselves and submit final mapped data instead of R2RML. This would be the preferred procedure for tools that do not support R2RML but other mapping languages. More generally, mapping tools that cannot comply with the assisted benchmark workflow can always trigger individual aspects of initialization of evaluation separately. For instance, they could use the framework to setup a test environment, then perform mapping generation, reasoning and mapping execution in their own workflow and finally trigger evaluation.

5.2. Details on the evaluation phase

Unless a mapping system under evaluation decides to skip individual steps, i.e., to implement them independently, in the evaluation phase, the benchmark suite will: (i) read submitted R2RML mappings and execute them on the database, (ii) materialize the resulting A-Box facts in a Sesame repository together with the target ontology (T-Box), (iii) optionally apply reasoning through an external OWL API [22] compatible reasoner to infer additional facts that may be requested for evaluation, (iv) evaluate all query pairs of the scenario on the repository and on the relational database, and (v) produce a detailed evaluation report. Additionally, as mentioned in Section 4.8, RODI also provides support to (two-stage) systems that require assistance to align their generated ontology with the target ontology. Information about how individual steps are invoked can be found in RODI’s user documentation [42].

We evaluate query results as described in Section 4.7 by attempting to construct an isomorphism ϕ between the query result set and the reference results. Technically, we use the results of the SQL queries from query pairs to calculate the reference result set. For each SQL query in a query pair, we flag attributes that together serve as keys, so keys can be matched with IRIs rather than with literal values. Obviously, literal values need to be exact matches. IRIs always need to match the same unique value from the database in each matching tuple. But the unique values used in the tuple can be different from the string components that make up an IRI. For instance, a tuple describing a person might have a synthetic integer ID column in the relational database, whereas corresponding person IRIs are composed of a namespace prefix and the persons’ first and last names, with an additional suffix added only where several different persons share identical names.

For constructing ϕ, we first index all individual IRIs (i.e., IRIs that identify instances of some class) in the query result. Next, we build a corresponding index for keys in the reference set. For both sets we determine binding dependencies across tuples (i.e., re-occurrences of the same IRI or key in different tuples). As a next step, we narrow down match candidates to tuples where all corresponding literal values are exact matches. After this step, we have a set of tuple pairs from the two result sets that are candidates for being structurally equivalent, because all their literals are identical. Finally, we check for these candidates to be actually structurally equivalent, i.e., we also check for viable correspondences between keys (in the reference tuples) and IRIs (in the matching result tuples). As discussed, the criterion for a viable match between a key and an IRI is that for each occurence of this particular key and of this particular IRI in any of the tuples, both need to be matched with the same partner. In principle, the same IRIs (and keys) can appear in any number of tuples and in different positions of each tuple. A simple example for such a query with repeated occurences of the same IRI in several positions of the result tuple could be a request for papers with their authors and reviewers (where each author and reviewer may appear in many different tuples and the same person’s IRI may appear as an author in one tuple and as a reviewer in another). All occurences of all IRIs in any tuple need to match the same keys in all cases for a structurally equivalent overall match. Hence, we can have transitive n-hop dependencies between n tuples or even cyclic dependencies. The step of finding viable matches across all tuples therefore corresponds to identifying a maximal common subgraph (MCS) between the dependency graphs of tuples on both sides, i.e., it corresponds to the MCS-isomorphism problem.6

⁶
The MCS isomorphism problem is a well-studied optimization problem and is known to be NP-hard.

For efficiency reasons, we approximate the MCS if dependency graphs contain transitive dependencies, breaking them down to fully connected subgraphs. However, it is usually possible to formulate query results to not contain any such transitive dependencies by avoiding inter-dependent IRIs in SPARQL SELECT results in favor of a set of significant literals describing them. In the case of the above example, authors and their papers, correct matches for literals, such as author names and paper titles, are easy to check. If instead authors and papers were identified by IRIs, however, such checks become indefinitely more difficult: each query result that relates an author IRI with a paper IRI creates a dependency between our interpretation of those IRIs w.r.t. corresponding database IDs. For instance, if a PhD student did co-author all of their papers with their supervisor while the supervisor had the same PhD student participate in all of their recent paper projects, then mistaking the author IDs of the two would look like a perfectly viable match at first. The mistake can be clarified only once tuples about older papers begin to appear that were co-authored by the supervisor but not by the PhD student. Still, seeing such tuples may not be a clarifying moment after all, but may also mean that they originate from partially incorrect mappings. In that case switching our interpretation of the ID/IRI mappings could make things only worse. To be fair to all tested systems we must assume that their intended interpretations are the ones that make the highest possible number of tuples considered as a correct match, i.e., the interpretation that is based on the MCS on author/paper dependencies between all tuples in the case of our example. All queries shipped with this benchmark are free of transitive dependencies, hence the algorithm is accurate for all delivered scenarios.

Finally, we count tuples that could not be matched in the result and reference set, respectively. Precision is then calculated as $\frac{| res | - | unmatched (res) |}{| res |}$ and recall as $\frac{| ref | - | unmatched (ref) |}{| ref |}$ in accordance with our precision and recall measures for structural equivalence (Def. 3). Aggregated numbers are calculated per query pair category as the averages of precision and recall of all queries in each category.

5.3. Presentation of measurements

RODI offers different ways at different levels of detail to look into test results. By default, for each scenario a report of different scores will be provided. Output formats can be chosen between tabular format and a human-readable log. For each of the reports, a break-down into categories used within the corresponding scenario is included. To better understand the details of category composition for each scenario, users can lookup the corresponding queries, which are stored in human-readable text files.

In addition, a debug mode of the benchmark supports detailed per-query test output. When enabled, one report for each query test will be produced and includes details that fully explain test criteria, including full provided and expected results and any gaps between them. This information can be used, together with the database and ontology from a scenario, to understand exactly where and why a certain system has failed any of the tests, or tests associated with a particular score.

6. Benchmark results

6.1. Evaluated systems

We have performed an in-depth analysis using RODI on a wide range of systems. Those include current contenders in the automatic segment (BootOX [19,25,27] and IncMap [37,40,41]), more general-purpose mapping generators (-ontop- [7], MIRROR [17] and D2RQ [6]), as well as a much earlier, yet state-of-the-art system in inter-model matching (COMA++ [2]). In a specialized semi-automatic series of experiments, we have also evaluated Karma [29,53], which does not support a fully automatic mapping generation mode and works with a sophisticated model of human intervention. As a consequence, it requires a specific experimental setup. Note that BootOX, -ontop-, MIRROR and D2RQ are two-stage systems (see Section 4.8), that is, they do not generate mappings targetting the ontology provided in each RODI scenario and they generate their own (putative or bootstrapped) ontology instead. BootOX includes a built-in ontology alignment system which allows to integrate the bootstrapped ontology with the target (scenario) ontology. In order to be able to evaluate -ontop-, MIRROR and D2RQ with RODI, we also aligned their generated ontologies with the target ontology in a similar setup to the one used in BootOX.

BootOX (B.OX) is based on the approach called direct mapping by the W3C:7

⁷
http://www.w3.org/TR/rdb-direct-mapping/.

every table in the database (except for those representing n:m relationships) is mapped to one class in the ontology; every data attribute is mapped to one data property; and every foreign key to one object property. Explicit and implicit database constraints from the schema are also used to enrich the bootstrapped ontology with axioms about the classes and properties from these direct mappings. Afterwards, BootOX performs an alignment with the target ontology using the LogMap system [1,24,50].

IncMap (IncM.) maps an available ontology directly to the relational schema. IncMap represents both the ontology and schema uniformly, using a structure-preserving meta-graph for both. It runs in two phases, using lexical and structural matching. We evaluate a current (and yet unpublished) work-in-progress version of IncMap, as opposed to the initial version previously evaluated in [39]. The main difference between the two versions of IncMap are improvements in lexical matching and mapping selection, as well as engineering improvements that increase the mapping quality.

MIRROR (MIRR.) is a tool for generating an ontology and R2RML direct mappings automatically from an RDB schema. MIRROR has been implemented as a module of the RDB2RDF engine morph-RDB [44]. Their output is oblivious of the required target ontology, though, so we perform post-processing with ontology alignment.

The -ontop- Protege Plugin (ontop) is a mapping generator developed for -ontop- [7]. -ontop- is a full-fledged query rewriting system [45] with limited ontology and mapping bootstrapping capabilities. As mentioned above, we need to post-process results with ontology alignment.

COMA++ (COMA) has been a contender in the field of schema matching for several years already; it is still widely considered state of the art. In contrast to other systems from the same era, COMA++ is built explicitly also for inter-model matching. To evaluate the system, we had to perform a translation of its output into modern R2RML.

D2RQ platform (D2RQ) is a fully-fledged system to access relational databases as virtual RDF graphs. Since D2RQ relies on its own native language to define mappings and RODI only supports standard R2RML mappings, we executed the mappings using D2RQ and provided RODI with the materialized data. Just as with MIRROR and -ontop-, a post-process with ontology alignment is required.

Karma is one of the most prominent modern relational-to-ontology mapping generation systems. It is strictly semi-automatic, i.e., there is no fully automatic baseline that we could use for non-interactive evaluation. In addition, Karma’s mode of iterations is designed to take advantage mostly from integrating a series of data sources to the same target ontology. Karma is therefore not well suited for single-scenario evaluations. We therefore only evaluate Karma in a dedicated line of experiments that suit its specifications.

6.2. Experimental setup

We conduct benchmark default experiments as described in Section 4 for all systems except Karma. This includes a selection of nine prototypical scenarios from the conference domain, one from the geodata domain and two from the oil & gas domain, as well as six different cross-matching (conference) scenarios. For all of these main experiments, we observe and report overall RODI scores as well as specific selected scores in individual categories.

Table 5
Overall scores in default scenarios (scores based on average of per-test F-measure). Best numbers per scenario in bold print

Scenario B.OX IncM. ontop MIRR. COMA D2RQ

Conference domain, adjusted naming

CMT 0.76 0.45 0.28 0.28 0.48 0.31

Conference 0.51 0.53 0.26 0.27 0.36 0.26

SIGKDD 0.86 0.76 0.38 0.30 0.66 0.38

Conference domain, restructured

CMT 0.41 0.44 0.14 0.17 0.38 0.14

Conference 0.41 0.41 0.13 0.23 0.31 0.21

SIGKDD 0.52 0.38 0.21 0.11 0.41 0.28

Conference domain, combined case

SIGKDD 0.48 0.38 0.21 0.11 0.28 0.21

Conference domain, missing FKs

Conference 0.33 0.41 – 0.17 0.21 0.18

Conference domain, denormalized

CMT 0.44 0.40 0.20 0.22 – 0.20

Geodata

Classic Rel. 0.13 0.08 – – – 0.06

Oil & gas domain

User Queries 0.00 0.00 0.00 0.00 – 0.00

Atomic 0.14 0.12 0.10 0.00 0.02 0.08

Scenario	B.OX	IncM.	ontop	MIRR.	COMA	D2RQ
	Conference domain, adjusted naming
CMT	0.76	0.45	0.28	0.28	0.48	0.31
Conference	0.51	0.53	0.26	0.27	0.36	0.26
SIGKDD	0.86	0.76	0.38	0.30	0.66	0.38
	Conference domain, restructured
CMT	0.41	0.44	0.14	0.17	0.38	0.14
Conference	0.41	0.41	0.13	0.23	0.31	0.21
SIGKDD	0.52	0.38	0.21	0.11	0.41	0.28
	Conference domain, combined case
SIGKDD	0.48	0.38	0.21	0.11	0.28	0.21
	Conference domain, missing FKs
Conference	0.33	0.41	–	0.17	0.21	0.18
	Conference domain, denormalized
CMT	0.44	0.40	0.20	0.22	–	0.20
	Geodata
Classic Rel.	0.13	0.08	–	–	–	0.06
	Oil & gas domain
User Queries	0.00	0.00	0.00	0.00	–	0.00
Atomic	0.14	0.12	0.10	0.00	0.02	0.08

In addition, we perform two different semi-automatic experiments on selected scenarios for Karma and IncMap, respectively. For Karma, we had to conduct experiments with an actual human in the loop to perform steps that Karma could not automate. With IncMap, we could simulate human feedback by responding to suggestions by taking a response from the benchmark that indicates changes in mapping quality. In both semi-automatic cases, we chiefly observe the number of interactions.

6.3. Default scenarios: Overall results

Table 5 shows scores for all systems on all basic default scenarios. At first impression we can observe that all tested systems manage to solve some parts of the scenarios, but with declining success as scenario complexity increases.

For instance, relational schemata in the conference “adjusted naming” scenarios follow modeling patterns from their corresponding ontologies most closely, and all systems without exception perform best in this part of the experiments. Quality drops for all other types of scenarios, i.e., whenever we introduce additional challenges that are specific to the relational-to-ontology modeling gap. The drop in accuracy between Adjusted Names and Restructured hierarchies settings is mostly due to the n:1 mapping challenge introduced by one of the relational patterns to represent class hierarchies which groups data for several subclasses in a single table. In the most advanced conference cases, systems lose further due to the additional challenges, although to different degrees. Good news is that some of the actively developed current systems, BootOX and IncMap, could improve their scores compared to previous numbers recorded in January 2015 [39]. A somewhat disappointing general observation, however, is that measured quality is overall still modest compared to results that are known from ontology alignment tasks involving some of the same ontologies (cf. [8,12,55]). This is disappointing, especially while state-of-the-art ontology alignment software is employed in some of the systems. It could indicate that the specific challenges in relational-to-ontology mapping generation can not convincingly be solved with the same technology that is successful in ontology alignment, but may call for more specialized approaches.

Table 6
Overall scores in cross-matching scenarios (scores based on average of per-test F-measure). Best numbers per scenario in bold print

Source B.OX IncM. ontop MIRR. COMA D2RQ

Target ontology: CMT

Conference 0.20 0.35 0.10 0.00 0.00 0.10

SIGKDD 0.33 0.33 0.19 0.00 0.14 0.19

Target ontology: Conference

CMT 0.20 0.34 0.05 0.00 0.05 0.05

SIGKDD 0.13 0.30 0.09 0.00 0.04 0.09

Target ontology: SIGKDD

CMT 0.51 0.57 0.19 0.00 0.24 0.26

Conference 0.24 0.44 0.13 0.00 0.09 0.14

Source	B.OX	IncM.	ontop	MIRR.	COMA	D2RQ
	Target ontology: CMT
Conference	0.20	0.35	0.10	0.00	0.00	0.10
SIGKDD	0.33	0.33	0.19	0.00	0.14	0.19
	Target ontology: Conference
CMT	0.20	0.34	0.05	0.00	0.05	0.05
SIGKDD	0.13	0.30	0.09	0.00	0.04	0.09
	Target ontology: SIGKDD
CMT	0.51	0.57	0.19	0.00	0.24	0.26
Conference	0.24	0.44	0.13	0.00	0.09	0.14

While all of the conference scenarios test a wide range of specific relational-to-ontology mapping challenges, they do so in a highly controlled fashion, on schemata with at best medium size and complexity, and using a largely simplified query workload. For instance, queries in the conference domain scenarios would separately check for mappings of authors, person names, and papers. They would not, however, pose any queries like asking for the names of authors who did participate in at least five different papers. The huge difference here is that, if two out of three of these elements were mapped correctly, the simple, atomic queries would report an average score of 0.66, while the single, more application-like query that correlates the same elements would not retrieve anything, thus resulting in a score of 0.00. None of the systems managed to solve even a single test on this challenge. This kind of real-world queries that mimick an actual application query workload, are precisely what we focus on the remaining default scenarios, which are set in the geodata and oil & gas exploration domains. Consequently, scores are lower again in those scenarios. In the geodata scenario, only a minority of query tests could be solved. Detailed debugging showed that the reason for this lies in the more complex nature of queries, most of which go beyond returning simple results of just a single mapped element. In the oil & gas case, the situation becomes even more problematic. Here, the schema and ontology are again a bit more complex than in the geodata scenario, and so is the explorative query workload (“user queries”). None of the systems was able to answer any of these queries correctly after a round of automatic mapping. To retrieve meaningful results, we added a second scenario on the same data, but with a synthetic query workload of atomic queries (“atomic”). On this scenario, results could be computed but overall scores remain low due to the size and complexity of the schema and ontology with a large search space as well as many 1:n matches.

Table 6 showcases results from the most advanced scenarios in the conference domain. All of them are built on the “combined case” scenarios, i.e., they contain a mix of all of the standard relational-to-ontology mapping challenges except for denormalization and lazy modeling of constraints. In addition, they increase the level of semantic heterogeneity by asking for mappings between a schema derived from one ontology to a rather different, independent ontology in the same domain. Scores are generally lower than in the basic conference cases discussed above. Reasonable scores can still be achieved by some systems. Also, the overall trend of performance between the systems mostly remains the same as in the basic scenarios, with a few exceptions. Somewhat surprisingly, COMA loses out more than other contenders. Even more, the performance of BootOX is noticeably low compared to the baseline results from basic scenarios in Table 5. This is unexpected as BootOX essentially applies ontology alignment technology that has proven itself in tasks with high semantic heterogeneity [1]. It could, again, be an indicator that out-of-the-box ontology alignment techniques could not take the same leverage that they do when aligning original ontologies.

The big picture shows that two of most specialized and also actively developed systems, BootOX and IncMap, are leading the field. Among those two, BootOX is at a clear advantage in scenarios where the inter-model gap between relational schema and ontology is small (e.g., “adjusted naming”). IncMap is gaining ground when more specific inter-model mapping challenges are added. MIRROR, -ontop- and D2RQ generally show weaker results. It has to be noted, though, that these systems have been originally designed and optimized for a somewhat different task than the full end-to-end mapping generation setup tested with RODI. MIRROR and -ontop- also fail to execute some of the scenarios due to technical difficulties. For MIRROR in particular, we have encountered a number of so far unresolved difficulties that may also have a detrimental effect on MIRROR scores. COMA keeps up well, given that it is no longer actively developed and improved. Also, while COMA has been constructed to support inter-model matching in general, it has not been explicitly optimized for the specific case of relational-to-ontology matching.

As part of our detailed analysis of the results we could also identify, and partially even fix, a number of technical shortcomings in tested systems. For instance, we encountered issues with MIRROR in certain multi-schema matching cases on PostgreSQL and implemented a solution in exchange with the authors of the system. In another example, IncMap’s poor performance in the geodata scenario could in part be explained by its failure to understand the specification of property domains and ranges as a union of several concrete classes. This pattern lead IncMap to skipping such properties altogether. While not yet fixed, the observation points to concrete technical improvements in IncMap. In BootOX, incomplete and unfavorable reasoning settings were detected and fixed.

6.4. Default scenarios: Drill-down

Table 7
Score break-down for queries on different match types with adjusted naming conference scenarios. ‘C’ stands for queries on classes, ‘D’ for data properties, ‘O’ for object properties

Scenario B.OX IncM. ontop MIRR. COMA D2RQ

C D O C D O C D O C D O C D O C D O

CMT 0.92 0.73 0.50 0.58 0.46 0.17 0.67 0.00 0.00 0.56 0.00 0.00 0.75 0.46 0.00 0.67 0.00 0.17

Conference 0.81 0.27 0.38 0.81 0.53 0.13 0.63 0.00 0.00 0.53 0.00 0.00 0.50 0.40 0.00 0.63 0.00 0.00

SIGKDD 1.00 0.90 0.25 0.80 0.70 0.25 0.73 0.00 0.00 0.46 0.00 0.00 0.80 0.70 0.00 0.73 0.00 0.00

Scenario	B.OX	IncM.	ontop	MIRR.	COMA	D2RQ
CMT	0.92	0.73	0.50	0.58	0.46	0.17	0.67	0.56	0.75	0.46	0.67	0.17
Conference	0.81	0.27	0.38	0.81	0.53	0.13	0.63	0.53	0.50	0.40	0.63	0.00
SIGKDD	1.00	0.90	0.25	0.80	0.70	0.25	0.73	0.46	0.80	0.70	0.73	0.00

Table 8

Score break-down for queries that test n:1 matches in restructured conference domain scenarios. 1:1 and n:1 stands for queries involving 1:1 or n:1 mappings among classes and tables, respectively

Scenario	B.OX		IncM.		ontop		MIRR.		COMA		D2RQ

	1:1	n:1	1:1	n:1	1:1	n:1	1:1	n:1	1:1	n:1	1:1	n:1
CMT	0.86	0.00	0.79	0.00	0.57	0.00	0.00	0.00	0.58	0.00	0.57	0.00
Conference	0.78	0.00	0.89	0.00	0.56	0.00	0.00	0.00	0.56	0.00	0.67	0.00
SIGKDD	1.00	0.00	0.86	0.00	0.86	0.00	0.00	0.00	0.86	0.00	0.86	0.00

Table 9

Score break-down for queries that require 1:n class matches on the Oil & Gas atomic tests scenario

Scenario	B.OX			IncM.			ontop			MIRR.			COMA			D2RQ

	1:1	1:2	1:3	1:1	1:2	1:3	1:1	1:2	1:3	1:1	1:2	1:3	1:1	1:2	1:3	1:1	1:2	1:3
Oil & Gas Atomic	0.17	0.11	0.07	0.20	0.01	0.03	0.10	0.09	0.07	0.00	0.00	0.00	0.03	0.00	0.00	0.11	0.00	0.07

All systems struggle with correctly identifying properties as Table 7 shows. A further drill-down shows that this is in part due to the challenge of normalization artifacts, with systems struggling to detect any properties that map to multi-hop join paths in the tables. Mapping data to class types appears to be generally easier for all contenders. BootOX is performing best in most cases with all kinds of properties, with IncMap coming in second. This represents a change over the previous versions of both systems benchmarked earlier, where IncMap was clearly leading on properties [39].

Tables 8 and 9 show the behavior of systems for finding n:1 and 1:n matches between ontology classes and table content, respectively. We highlight the n:1 case in restructured conference scenarios and 1:n matches in the oil & gas scenario as they include the highest number of tests in their respective categories. In both cases results are staggering with all systems failing the large majority of tests. For 1:n matches the situation is slightly better than it is with n:1 matches. This is not particularly surprising in general, as 1:n matches can be composed in mapping rules by adding up several correct 1:1 matches. A correct mapping of n:1 matches between classes and tables, on the other side, usually requires the much more challenging task of filtering from the table that holds entities of different types.

Fig. 5.

Karma multi-source integration counting human interactions.

6.5. Semi-automatic, iterative scenarios

We have also conducted semi-automatic, iterative experiments on RODI scenarios with two different systems, Karma and IncMap. While IncMap was also evaluated in the main line of experiments before on its fully automatic mode, Karma does not support such a baseline mode and always requires human intervention in different forms. This is mainly due to Karma’s need for so called Python transformations, essentially tiny Python scripts, to mint entity IRIs. In contrast to class and property matches, Karma does not learn those transformations. Also, both systems work according to completely different semi-automatic processes. Karma is designed for multi-source integration and learns from human interactions in one scenario to provide suggestions in the next ones. IncMap, on the other side, adjusts its suggestions after simple yes/no feedback during one single scenario but has no memory between any two scenarios.

For these reasons, a direct experimental comparison between the two systems is not feasible. Instead, we run a separate dedicated experiment for each of them and identify similarities and differences in performance in the following discussion.

With Karma, we ran three experiments, each of which consists of a series of three related scenarios on the same target ontology. This translates to three different source schemata that Karma needs to integrate in a row. As Karma cannot produce any results completely automatically, we conducted this experiment interactively and recorded the number of human interactions needed to complete the mapping for each of the data sources. Figure 5 shows that in all cases the total number of required interactions drops for later data sources over previous ones. The drop in manual class matches and property matches is made possible by type learning. Python transformations remain approximately constant across subsequent data sources as no learning support and suggestions are available for these transformations.

Due to the manual input, mappings resulting from Karma’s semi-automatic process are generally of high quality and did mostly reach scores close to 1.0 (cf. Table 10).

For IncMap, we ran a series of regular single-scenario tests, but in an incremental, semi-automatic setup [36]. That is, for each of the scenarios, we simulated human feedback in the form of choosing a suggestion from shortlists of three suggestions, each. To simulate this kind of feedback we simply used the benchmark as an oracle to identify the best pick. We observed how the score achieved by IncMap’s mappings changes after a number of iterations, i.e., we report a score at k human interactions [35].

Table 11 shows those scores for three conference domain scenarios, before feedback ( $@ 0$ ), and after 6, 12, or 24 interactions, respectively. It is clearly visible that scores increase with ongoing feedback. From the first few rounds of feedback, the system profits most. After that, gains are moderate.

Note that these changes in score are based on feedback during several iterations on the same scenarios. It would be most interesting to see an evaluation of a system that combines the approaches of Karma and IncMap. From the results available from these two systems so far, it becomes clear that either approach has its own benefits. A direct comparison is not possible, though, as both follow a fairly different kind of process (multi-source vs. single-source) and also request different forms of human input (e.g., Python transformations in Karma).

Table 10
Semi-automatic Karma mappings: generally very high scores thanks to human input

Series 1st 2nd 3rd

To CMT 0.97 0.85 0.99

To Conference 0.90 1.00 1.00

To SIGKDD 1.00 0.99 1.00

Series	1st	2nd	3rd
To CMT	0.97	0.85	0.99
To Conference	0.90	1.00	1.00
To SIGKDD	1.00	0.99	1.00

Table 11

Impact of incremental mapping: scores for IncMap after k interactions in adjusted naming scenarios

Scenario	@0	@6	@12	@24
CMT	0.45	0.73	0.92	0.96
Conference	0.53	0.61	0.68	0.77
SIGKDD	0.76	0.85	1.00	1.00

7. Related work

Mappings between ontologies are usually evaluated only on the basis of their underlying correspondences (usually referred to as ontology alignments). The Ontology Alignment Evaluation Initiative (OAEI) [8,12,55] provides tests and benchmarks of those alignments that can be considered a de-facto standard. Mappings between relational databases are typically not evaluated by a common benchmark. Instead, authors typically compare their tools to one or more of the industry standard systems (e.g., [2,14]) in a scenario of their own choice. A novel TPC benchmark [43] was recently created to close this gap. However, no results are reported so far on the TPC-DI website. To the best of our knowledge, no benchmark to measure specifically the quality of inter-model relational-to-ontology mappings was available before the original release of RODI [39].

Similarly, evaluations of relational-to-ontology mapping generating systems were based on one or several data sets deemed appropriate by the authors and are therefore not comparable. In one of the most comprehensive evaluations so far, QODI [56] was evaluated on several real-world data sets, though some of the reference mappings were rather simple. IncMap [40] was first evaluated on a choice of real-world mapping problems based on data from two different domains. Such domain-specific mapping problems could be easily integrated in our benchmark through our extension mechanism.

A number of papers discuss different quality aspects of relational-to-ontology mapping generation in a more general way. Console and Lenzerini have devised a series of theoretical OBDA data quality checks w.r.t. consistency [9]. As such, these could also be used to judge mapping quality to a certain degree. However, the focus of this work is clearly different. Also, the approach is agnostic of actual requirements and expectations and only considers consistency of data in itself. A more multi-dimensional approach has been proposed by Westphal et al. [57]. Their proposals do not include a single, globally comparable scoring measure, but rather a collection of different measures, which could be sampled and combined as suitable or applicable in different scenarios. Tarasowa et al. propose a similarly generic approach to quality measurement on relational-to-ontology mappings [54]. Dimou et al. [11] have proposed unit testing as a generic and domain-independent quality measure for relational-to-ontology mappings. Impraliou et al. [23] present a benchmark composed by a series of synthetic queries to measure the correctness and completeness of relational-to-ontology query rewriting. The presence of complete and correct mappings is a prerequisite to their approach. Mora and Corcho discuss issues and possible solutions to benchmark the query rewriting step in OBDA systems [33]. Mappings are supposed to be given as immutable input. The NPD benchmark [30] measures performance of OBDA query evaluation. Neither of these latter two papers, however, address the issue of measuring mapping quality.

A comprehensive overview of relational-to-ontology efforts, including related approaches of automatic mapping generation, can be found in the two surveys [46,51].

8. Conclusion

We have presented a novel benchmark suite RODI that allows to test the quality of system-generated relational-to-ontology mappings. The prime application area of RODI is ontology-based data integration. RODI tests a wide range of data mapping challenges that are specific to relational-to-ontology mappings, and which we identified in this paper.

Using RODI we have conducted a thorough evaluation of seven prominent relational-to-ontology mapping generation systems from different research groups. We have identified strengths and weaknesses for each of the systems and in some cases could even point to specific erroneous behavior. We have communicated our observations to the authors of BootOX, IncMap, MIRROR, D2RQ and -ontop- and most of them already used our feedback to improve their systems and the quality of the computed mappings. Overall, the systems demonstrate that they can cope well with relatively simple mapping challenges. However, all tested tools perform poorly on most of the more advanced challenges that come close to actual real-world problems. Thus, further research is needed to address these challenges.

Future work includes repeated evaluations of a growing number of relational-to-ontology mapping generation systems. It would be particularly interesting to evaluate semi-automatic tools in a more comprehensive way, and to directly compare different tools under identical settings. Additionally, we expect several of the tested systems to address issues pointed by our evaluation with RODI. Another avenue of future work includes the extension of the benchmark suite, e.g., by adding scenarios from other application domains relevant for ontology-based data integration.

Footnotes

Acknowledgements

This research is funded by the Seventh Framework Program (FP7) of the European Commission under Grant Agreement 318338, “Optique”. Ernesto Jimenez-Ruiz, Evgeny Kharlamov and Ian Horrocks were also supported by the EPSRC projects MaSI³, Score!, DBOnto and ED3.

References

Aroyo,

Welty,

Alani,

Taylor,

Bernstein,

Kagal,

Noy,

Blomqvist,

Jiménez-Ruiz and

Cuenca Grau, eds, LogMap: Logic-Based and Scalable Ontology Matching, in: Proceedings, Part I, The Semantic Web – ISWC 2011: 10th International Semantic Web Conference, Springer, Berlin Heidelberg, 2011. doi:10.1007/978-3-642-25073-6_18.

Aumueller,

H.-H.

Do,

Massmann and

Rahm, Schema and ontology matching with COMA++, in: Proceedings of the ACM SIGMOD International Conference on Management of Data, Baltimore, Maryland, USA, June 14–16, 2005,

Özcan, ed., ACM Press, 2005, pp. 906–908. doi:10.1145/1066157.1066283.

Bada,

Stevens,

Goble,

Gil,

Ashburner,

Blake,

J.M.

Cherry,

Harris and

Lewis, A short study on the success of the Gene Ontology, Journal of Web Semantics1(2) (2004), 235–240. doi:10.1016/j.websem.2003.12.003.

J.F.

Baldwin and

S.Q.

Zhou, A fuzzy relational inference language, Fuzzy Sets and Systems14(2) (1984). doi:10.1016/0165-0114(84)90098-8.

Batini,

Lenzerini and

S.B.

Navathe, A comparative analysis of methodologies for database schema integration, ACM Computing Surveys18(4) (1986), 323–364. doi:10.1145/27633.27634.

Bizer and

Andy, D2RQ – treating non-RDF databases as virtual RDF graphs, in: Proceedings of Poster Track – Third International Semantic Web Conference (ISWC) 2004, Hiroshima, Japan,

J.J.

Carrol, ed., 2004, http://iswc2004.semanticweb.org/posters/PID-SMCVRKBT-1089637165.pdf .

Calvanese,

Cogrel,

Komla-Ebri,

Kontchakov,

Lanti,

Rezk,

Rodriguez-Muro and

Xiao, Ontop: Answering SPARQL queries over relational databases, Semantic Web Journal8(3) (2017), 471–487. doi:10.3233/SW-160217.

Cheatham,

Dragisic,

Euzenat,

Faria,

Ferrara,

Flouris,

Fundulaki,

Granada,

Ivanova,

Jimenez-Ruiz,

Lambrix,

Montanelli,

Pesquita,

Saveta,

Shvaiko,

Solimando,

Trojahn and

Zamazal, Results of the ontology alignment evaluation initiative 2015, in: Proceedings of the 10th International Workshop on Ontology Matching Collocated with the 14th International Semantic Web Conference (ISWC 2015), Bethlehem, PA, USA, October 12, 2015, CEUR Workshop Proceedings, Vol. 1545, CEUR-WS.org, 2015, pp. 60–115, http://ceur-ws.org/Vol-1545/oaei15_paper0.pdf .

Console and

Lenzerini, Data quality in ontology-based data access: The case of consistency, in: Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence, Québec City, Québec, Canada, July 27–31, 2014,

C.E.

Brodley and

Stone, eds, AAAI Press, 2014, pp. 1020–1026, http://www.aaai.org/ocs/index.php/AAAI/AAAI14/paper/view/8552 .

10.

Cuenca Grau,

Horrocks,

Motik,

Parsia,

P.F.

Patel-Schneider and

Sattler, OWL 2: The next step for OWL, Journal of Web Semantics6(4) (2008), 309–322. doi:10.1016/j.websem.2008.05.001.

11.

Dimou,

Kontokostas,

Freudenberg,

Verborgh,

Lehmann,

Mannens,

Hellmann and

Van de Walle, Assessing and refining mappings to RDF to improve dataset quality, in: Proceedings, Part II, The Semantic Web – ISWC 2015 – 14th International Semantic Web Conference, Bethlehem, PA, USA, October 11–15, 2015,

Arenas,

Ó.

Corcho,

Simperl,

Strohmaier,

d’Aquin,

Srinivas,

P.T.

Groth,

Dumontier,

Heflin,

Thirunarayan and

Staab, eds, Lecture Notes in Computer Science, Vol. 9367, Springer, 2015, pp. 133–149. doi:10.1007/978-3-319-25010-6_8.

12.

Dragisic,

Eckert,

Euzenat,

Faria,

Ferrara,

Granada,

Ivanova,

Jiménez-Ruiz,

Oskar Kempf,

Lambrix,

Montanelli,

Paulheim,

Ritze,

Shvaiko,

Solimando,

C.T.

dos Santos,

Zamazal and

B.C.

Grau, Results of the ontology alignment evaluation initiative 2014, in: Proceedings of the 9th International Workshop on Ontology Matching Collocated with the 13th International Semantic Web Conference (ISWC 2014), Riva del Garda, Trentino, Italy, October 20, 2014,

Shvaiko,

Euzenat,

Mao,

Jiménez-Ruiz,

Li and

Ngonga, eds, CEUR Workshop Proceedings, Vol. 1317, CEUR-WS org, 2014, pp. 61–104, http://ceur-ws.org/Vol-1317/oaei14_paper0.pdf .

13.

Duggan,

A.J.

Elmore,

Stonebraker,

Balazinska,

Howe,

Kepner,

Madden,

Maier,

Mattson and

Zdonik, The BigDAWG polystore system, SIGMOD Record44(2), (2015). doi:10.1145/2814710.2814713.

14.

Fagin,

L.M.

Haas,

M.A.

Hernández,

R.J.

Miller,

Popa and

Velegrakis, Clio: Schema mapping creation and data exchange, in: Conceptual Modeling: Foundations and Applications – Essays in Honor of John Mylopoulos,

Borgida,

V.K.

Chaudhri,

Giorgini and

E.S.K.

Yu, eds, Lecture Notes in Computer Science, Vol. 5600, Springer, 2009, pp. 198–236. doi:10.1007/978-3-642-02463-4_12.

15.

S.M.

Falconer and

Fridman Noy, Interactive techniques to support ontology matching, in: Schema Matching and Mapping, Data-Centric Systems and Applications,

Bellahsene,

Bonifati and

Rahm, eds, Springer, 2011, pp. 29–51. doi:10.1007/978-3-642-16518-4_2.

16.

Freitas,

Schulz and

Moraes, Survey of current terminologies and ontologies in biology and medicine, RECIIS – Electronic Journal in Communication, Information and Innovation in Health3 (2009), 7–18.

17.

Frontino de Medeiros,

Priyatna and

Ó.

Corcho, MIRROR: automatic R2RML mapping generation from relational databases, in: Proceedings, Engineering the Web in the Big Data Era – 15th International Conference, ICWE 2015, Rotterdam, The Netherlands, June 23–26, 2015,

Cimiano,

Frasincar,

G.-J.

Houben and

Schwabe, eds, Lecture Notes in Computer Science, Vol. 9114, Springer, 2015, pp. 326–343. doi:10.1007/978-3-319-19890-3_21.

18.

Garcia-Molina,

J.D.

Ullman and

Widom, Database Systems – the Complete Book, 2nd edn, Prentice Hall Press, Upper Saddle River, NJ, USA, 2008.

19.

Giese,

Soylu,

Vega-Gorgojo,

Waaler,

Haase,

Jiménez-Ruiz,

Lanti,

Rezk,

Xiao,

Ö.L.

Özçep and

Rosati, Optique: Zooming in on big data, IEEE Computer48(3) (2015), 60–67. doi:10.1109/MC.2015.82.

20.

Heyvaert,

Dimou,

Verborgh,

Mannens and

Van de Walle, Towards approaches for generating RDF mapping definitions, in: Proceedings of the ISWC 2015 Posters & Demonstrations Track Co-Located with the 14th International Semantic Web Conference (ISWC-2015), Bethlehem, PA, USA, October 11, 2015,

Villata,

J.Z.

Pan and

Dragoni, eds, CEUR Workshop Proceedings, Vol. 1486, CEUR-WS.org, 2015, http://ceur-ws.org/Vol-1486/paper_70.pdf .

21.

Hornung and

May, Experiences from a TBox reasoning application: Deriving a relational model by OWL schema analysis, in: Proceedings of the 10th International Workshop on OWL: Experiences and Directions (OWLED 2013) Co-Located with 10th Extended Semantic Web Conference (ESWC 2013), Montpellier, France, May 26–27, 2013,

Rodriguez-Muro,

Jupp and

Srinivas, eds, CEUR Workshop Proceedings, Vol. 1080, CEUR-WS.org, 2013, pp. 26–27, http://ceur-ws.org/Vol-1080/owled2013_3.pdf .

22.

Horridge and

Bechhofer, The OWL API: A Java API for OWL ontologies, Semantic Web2(1) (2011), 11–21. doi:10.3233/SW-2011-0025.

23.

Imprialou,

Stoilos and

Cuenca Grau, Benchmarking ontology-based query rewriting systems, in: Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence, Toronto, Ontario, Canada, July 22–26, 2012,

Hoffmann and

Selman, eds, AAAI Press, 2012, http://www.aaai.org/ocs/index.php/AAAI/AAAI12/paper/view/4910 .

24.

Jiménez-Ruiz,

Cuenca Grau,

Zhou and

Horrocks, Large-scale interactive ontology matching: Algorithms and implementation, in: ECAI 2012 – 20th European Conference on Artificial Intelligence,

L.D.

Raedt,

Bessière Didier Dubois,

Doherty,

Frasconi,

Heintz and

P.J.F.

Lucas, eds, Frontiers in Artificial Intelligence and Applications, Vol. 242, IOS Press, 2012. doi:10.3233/978-1-61499-098-7-444.

25.

Jiménez-Ruiz,

Kharlamov,

Zheleznyakov,

Horrocks,

Pinkel,

M.G.

Skjæveland,

Thorstensen and

Mora, BootOX: Practical mapping of RDBs to OWL 2, in: Proceedings, Part II, The Semantic Web – ISWC 2015 – 14th International Semantic Web Conference, Bethlehem, PA, USA, October 11–15, 2015,

Arenas,

Ó.

Corcho,

Simperl,

Strohmaier,

d’Aquin,

Srinivas,

P.T.

Groth,

Dumontier,

Heflin,

Thirunarayan and

Staab, eds, Lecture Notes in Computer Science, Vol. 9367, Springer, 2015, pp. 113–132. doi:10.1007/978-3-319-25010-6_7.

26.

Kharlamov,

Giese,

Jiménez-Ruiz,

M.G.

Skjæveland,

Soylu,

Zheleznyakov,

Bagosi,

Console,

Haase,

Horrocks,

Marciuska,

Pinkel,

Rodriguez-Muro,

Ruzzi,

Santarelli,

D.F.

Savo,

Sengupta,

Schmidt,

Thorstensen,

Trame and

Waaler, Optique 1.0: Semantic access to big data: The case of Norwegian petroleum directorate’s FactPages, in: Proceedings of the ISWC 2013 Posters & Demonstrations Track, Sydney, Australia, October 23, 2013,

Blomqvist and

Groza, eds, CEUR Workshop Proceedings, Vol. 1035, CEUR-WS.org, 2013, pp. 65–68, http://ceur-ws.org/Vol-1035/iswc2013_demo_17.pdf .

27.

Kharlamov,

Hovland,

Jiménez-Ruiz,

Lanti,

Lie,

Pinkel,

Rezk,

M.G.

Skjæveland,

Thorstensen,

Xiao,

Zheleznyakov and

Horrocks, Ontology based access to exploration data at Statoil, in: Proceedings, Part II, The Semantic Web – ISWC 2015 – 14th International Semantic Web Conference, Bethlehem, PA, USA, October 11–15, 2015,

Arenas,

Ó.

Corcho,

Simperl,

Strohmaier,

d’Aquin,

Srinivas,

P.T.

Groth,

Dumontier,

Heflin,

Thirunarayan and

Staab, eds, Lecture Notes in Computer Science, Vol. 9367, Springer, 2015, pp. 93–112. doi:10.1007/978-3-319-25010-6_6.

28.

Kharlamov,

Solomakhina,

Ö.

Lütfü Özçep,

Zheleznyakov,

Hubauer,

Lamparter,

Roshchin,

Soylu and

Watson, How semantic technologies can enhance data access at Siemens Energy, in: Proceedings, Part I, The Semantic Web – ISWC 2014 – 13th International Semantic Web Conference, Riva del Garda, Italy, October 19–23, 2014,

Mika,

Tudorache,

Bernstein,

Welty,

C.A.

Knoblock,

Vrandecic,

P.T.

Groth,

N.F.

Noy,

Janowicz and

C.A.

Goble, eds, Lecture Notes in Computer Science, Vol. 8796, Springer, 2014, pp. 601–619. doi:10.1007/978-3-319-11964-9_38.

29.

C.A.

Knoblock,

P.A.

Szekely,

Luis Ambite,

Goel,

Gupta,

Lerman,

Muslea,

Taheriyan and

Mallick, Semi-automatically mapping structured sources into the semantic web, in: Proceedings, The Semantic Web: Research and Applications – 9th Extended Semantic Web Conference, ESWC 2012, Crete, Greece, May 27–31, 2012,

Simperl,

Cimiano,

Polleres,

Ó.

Corcho and

Presutti, eds, Lecture Notes in Computer Science, Vol. 7295, Springer, 2012, pp. 375–390. doi:10.1007/978-3-642-30284-8_32.

30.

Lanti,

Rezk,

Slusnys,

Calvanese and

Xiao, The NPD benchmark for OBDA systems, in: Proceedings of the 10th International Workshop on Scalable Semantic Web Knowledge Base Systems (SSWS 2014), CEUR Electronic Workshop Proceedings, 2014.

31.

Luna Dong and

Srivastava, Big data integration, Proceedings of the VLDB Endowment6(11) (2013), 1188–1189. doi:10.14778/2536222.2536253.

32.

May, Information Extraction and Integration with Florid: The Mondial Case Study, Technical report, Universität Freiburg, Institut für Informatik, 1999, http://www.dbis.informatik.uni-goettingen.de/Mondial/.

33.

Mora and

Ó.

Corcho, Towards a systematic benchmarking of ontology-based query rewriting systems, in: Proceedings, Part II, The Semantic Web – ISWC 2013 – 12th International Semantic Web Conference, Sydney, NSW, Australia, October 21–25, 2013,

Alani,

Kagal,

Fokoue,

P.T.

Groth,

Biemann,

J.X.

Parreira,

Aroyo,

N.F.

Noy,

Welty and

Janowicz, eds, Lecture Notes in Computer Science, Vol. 8219, Springer, 2013, pp. 376–391. doi:10.1007/978-3-642-41338-4_24.

34.

Motik,

Horrocks and

Sattler, Bridging the gap between OWL and relational databases, Journal of Web Semantics7(2) (2009), 74–89. doi:10.1016/j.websem.2009.02.001.

35.

Paulheim,

Hertling and

Ritze, Towards evaluating interactive ontology matching tools, in: The Semantic Web: Semantics and Big Data. 10th International Conference (ESWC 2013), Lecture Notes in Computer Science, Vol. 7882, Springer, Berlin, 2013, pp. 31–45. doi:10.1007/978-3-642-38288-8_3.

36.

Pinkel, Interactive pay as you go relational-to-ontology mapping, in: Proceedings, Part II, The Semantic Web – ISWC 2013 – 12th International Semantic Web Conference, Sydney, NSW, Australia, October 21–25, 2013,

Alani,

Kagal,

Fokoue,

P.T.

Groth,

Biemann,

J.X.

Parreira,

Aroyo,

N.F.

Noy,

Welty and

Janowicz, eds, Lecture Notes in Computer Science, Vol. 8219, Springer, 2013, pp. 456–464. doi:10.1007/978-3-642-41338-4_31.

37.

Pinkel, i³MAGE: Incremental, Interactive, Inter-Model Mapping Generation, PhD thesis, University of Mannheim, 2016.

38.

Pinkel,

Binnig,

Haase,

Martin,

Sengupta and

Trame, How to best find a partner? An evaluation of editing approaches to construct R2RML mappings, in: Proceedings, The Semantic Web: Trends and Challenges – 11th International Conference, ESWC 2014, Anissaras, Crete, Greece, May 25–29, 2014,

Presutti,

d’Amato,

Gandon,

d’Aquin,

Staab and

Tordai, eds, Lecture Notes in Computer Science, Vol. 8465, Springer, 2014, pp. 675–690. doi:10.1007/978-3-319-07443-6_45.

39.

Pinkel,

Binnig,

Jiménez-Ruiz,

May,

Ritze,

M.G.

Skjæveland,

Solimando and

Kharlamov, RODI: A benchmark for automatic mapping generation in relational-to-ontology data integration, in: Proceedings, The Semantic Web. Latest Advances and New Domains – 12th European Semantic Web Conference, ESWC 2015, Portoroz, Slovenia, May 31–June 4, 2015,

Gandon,

Sabou,

Sack,

d’Amato,

Cudré-Mauroux and

Zimmermann, eds, Lecture Notes in Computer Science, Vol. 9088, Springer, 2015, pp. 21–37. doi:10.1007/978-3-319-18818-8_2.

40.

Pinkel,

Binnig,

Kharlamov and

Haase, IncMap: Pay as you go matching of relational schemata to OWL ontologies, in: Proceedings of the 8th International Workshop on Ontology Matching Co-Located with the 12th International Semantic Web Conference (ISWC 2013), Sydney, Australia, October 21, 2013,

Shvaiko,

Euzenat,

Srinivas,

Mao and

Jiménez-Ruiz, eds, CEUR Workshop Proceedings, Vol. 1111, CEUR-WS org, 2013, pp. 37–48, http://ceur-ws.org/Vol-1111/om2013_Tpaper4.pdf .

41.

Pinkel,

Binnig,

Kharlamov and

Haase, Pay as you go matching of relational schemata to OWL ontologies with IncMap, in: Proceedings of the ISWC 2013 Posters & Demonstrations Track, Sydney, Australia, October 23, 2013,

Blomqvist and

Groza, eds, CEUR Workshop Proceedings, Vol. 1035, CEUR-WS org, 2013, pp. 225–228, http://ceur-ws.org/Vol-1035/iswc2013_poster_12.pdf .

42.

Pinkel and

Jiménez-Ruiz, Technical User Manual, http://www.cs.ox.ac.uk/isg/tools/RODI/manual.pdf.

43.

Poess,

Rabl,

H.-A.

Jacobsen and

Caufield, TPC-DI: the first industry benchmark for data integration, Proceedings of the VLDB Endowment7(13) (2014), 1367–1378. doi:10.14778/2733004.2733009.

44.

Priyatna,

Ó.

Corcho and

Sequeda, Formalisation and experiences of R2RML-based SPARQL to SQL query translation using morph, in: 23rd International World Wide Web Conference, WWW ’14, Seoul, Republic of Korea, April 7–11, 2014,

Chung,

A.Z.

Broder,

Shim and

Suel, eds, ACM, 2014, pp. 479–490. doi:10.1145/2566486.2567981.

45.

Rodriguez-Muro and

Rezk, Efficient sparql-to-sql with R2RML mappings, Journal of Web Semantics33 (2015), 141–169. doi:10.1016/j.websem.2015.03.001.

46.

Sequeda,

S.H.

Tirmizi,

Ó.

Corcho and

D.P.

Miranker, Survey of directly mapping SQL databases to the semantic web, The Knowledge Engineering Review26(4) (2011), 445–486. doi:10.1017/S0269888911000208.

47.

J.F.

Sequeda and

D.P.

Miranker, Ultrawrap Mapper: A semi-automatic relational database to RDF (RDB2RDF) mapping tool, in: Proceedings of the ISWC 2015 Posters & Demonstrations Track Co-Located with the 14th International Semantic Web Conference (ISWC-2015), Bethlehem, PA, USA, October 11, 2015,

Villata,

J.Z.

Pan and

Dragoni, eds, CEUR Workshop Proceedings, Vol. 1486, CEUR-WS.org, 2015, http://ceur-ws.org/Vol-1486/paper_105.pdf .

48.

Shvaiko and

Euzenat, Ontology matching: State of the art and future challenges, IEEE Transactions on Knowledge and Data Engingeering25(1) (2013), 158–176. doi:10.1109/TKDE.2011.253.

49.

M.G.

Skjæveland,

E.H.

Lian and

Horrocks, Publishing the Norwegian Petroleum Directorate’s FactPages as Semantic Web Data, in: Proceedings, Part II, The Semantic Web – ISWC 2013 – 12th International Semantic Web Conference, Sydney, NSW, Australia, October 21–25, 2013,

Alani,

Kagal,

Fokoue,

P.T.

Groth,

Biemann,

J.X.

Parreira,

Aroyo,

N.F.

Noy,

Welty and

Janowicz, eds, Lecture Notes in Computer Science, Vol. 8219, Springer, 2013, pp. 162–177. doi:10.1007/978-3-642-41338-4_11.

50.

Solimando,

Jiménez-Ruiz and

Guerrini, Detecting and correcting conservativity principle violations in ontology-to-ontology mappings, in: Proceedings, Part II, The Semantic Web – ISWC 2014: 13th International Semantic Web Conference, Riva del Garda, Italy,

Mika,

Tudorache,

Bernstein,

Welty,

Knoblock,

Vrandečić,

Groth,

Noy,

Janowicz and

Goble, eds, Lecture Notes in Computer Science, Vol. 8797, Springer International Publishing, 2014, pp. 1–16. doi:10.1007/978-3-319-11915-1_1.

51.

D.-E.

Spanos,

Stavrou and

Mitrou, Bringing relational databases into the semantic web: A survey, Semantic Web3(2) (2012), 169–209. doi:10.3233/SW-2011-0055.

52.

Šváb,

Svátek,

Berka,

Rak and

Tomášek, OntoFarm: Towards an experimental collection of parallel ontologies, in: The Forth International Semantic Web Conference (ISWC2005): Posters and Demos, Galway, Ireland, 2005, http://data.semanticweb.org/conference/iswc/2005/poster-demo-proceedings/paper-35 .

53.

Taheriyan,

C.A.

Knoblock,

P.A.

Szekely and

J.L.

Ambite, Learning the semantics of structured data sources, Journal of Web Semantics37–38 (2016), 152–169. doi:10.1016/j.websem.2015.12.003.

54.

Tarasowa,

Lange and

Auer, Measuring the quality of Relational-to-RDF mappings, in: Proceedings, Knowledge Engineering and Semantic Web – 6th International Conference, KESW 2015, Moscow, Russia, September 30–October 2,

Klinov and

Mouromtsev, eds, Communications in Computer and Information Science, Vol. 518, Springer, 2015, pp. 210–224. doi:10.1007/978-3-319-24543-0_16.

55.

The Ontology Alignment Evaluation Initiative (OAEI), http://oaei.ontologymatching.org.

56.

Tian,

Sequeda and

D.P.

Miranker, QODI: query as context in automatic data integration, in: Proceedings, Part I, The Semantic Web – ISWC 2013 – 12th International Semantic Web Conference, Sydney, NSW, Australia, October 21–25, 2013,

Alani,

Kagal,

Fokoue,

P.T.

Groth,

Biemann,

J.X.

Parreira,

Aroyo,

N.F.

Noy,

Welty and

Janowicz, eds, Lecture Notes in Computer Science, Vol. 8218, Springer, 2013, pp. 624–639. doi:10.1007/978-3-642-41335-3_39.

57.

Westphal,

Stadler and

Lehmann, Quality Assurance of RDB2RDF Mappings, Technical report, University of Leipzig, 2014.

58.

Zaveri,

Rula,

Maurino,

Pietrobon,

Lehmann and

Auer, Quality assessment for linked data: A survey, Semantic Web7(1) (2015), 63–93. doi:10.3233/SW-150175.

RODI: Benchmarking relational-to-ontology mapping generation quality

Abstract

Keywords

1. Introduction

1.1. Motivation

1 Utility has also been referred to as fitness for use in similar contexts in parts of the literature, cf., [58].

2 Ready-to-use RODI distribution available at: http://www.cs.ox.ac.uk/isg/tools/RODI/. Source code available on GitHub: https://github.com/chrpin/rodi.

2. Mapping challenges

2.1. Naming conflicts

2.2.1. Type conflicts

2.2.2. Key conflicts

2.2.3. Dependency conflicts

2.3. Semantic heterogeneity

3. Analysis of mapping approaches

3.1. Differences in availability and relevance of input

3.2. Differences in the mapping process

4.1. Overview

4.2.1. Ontologies

4.2.2. Relational schemata

4.2.3. Integration scenarios

Table 2 Basic scenario variants (non-default scenarios are put in parentheses) Cmt Conference Sigkdd Canonical (✓) (✓) (✓) Adjusted Naming ✓ ✓ ✓ Restructured Hierarchies ✓ ✓ ✓ Combined Case (✓) (✓) ✓ Missing FKs - ✓ - Denormalized ✓ - -

4.2.5. Queries

4.3. Geodata domain – mondial scenarios

3 http://www.dbis.informatik.uni-goettingen.de/Mondial/.

4.5. Extension scenarios

4.6. Challenge coverage and query category tags

Definition 1 (Structural tuple set equivalence).

Definition 2 (Tuple set equivalence w.r.t. ontology).

Definition 3 (Precision and recall).

Table 4 Example results from a query pair asking for author names, e.g., SQL: SELECT name FROM persons WHEREperson_type=2 SPARQL: SELECT ?name WHERE {?p a :Author; foaf:name?name}

5. Framework implementation

5.1. Architecture of the benchmarking suite

4 http://rdf4j.org.

6 The MCS isomorphism problem is a well-studied optimization problem and is known to be NP-hard.

6. Benchmark results

6.1. Evaluated systems

7 http://www.w3.org/TR/rdb-direct-mapping/.

Table 10 Semi-automatic Karma mappings: generally very high scores thanks to human input Series 1st 2nd 3rd To CMT 0.97 0.85 0.99 To Conference 0.90 1.00 1.00 To SIGKDD 1.00 0.99 1.00

8. Conclusion

Footnotes

Acknowledgements

References

¹
Utility has also been referred to as fitness for use in similar contexts in parts of the literature, cf., [58].

²
Ready-to-use RODI distribution available at: http://www.cs.ox.ac.uk/isg/tools/RODI/. Source code available on GitHub: https://github.com/chrpin/rodi.

Table 2
Basic scenario variants (non-default scenarios are put in parentheses)

Cmt Conference Sigkdd

Canonical (✓) (✓) (✓)

Adjusted Naming ✓ ✓ ✓

Restructured Hierarchies ✓ ✓ ✓

Combined Case (✓) (✓) ✓

Missing FKs - ✓ -

Denormalized ✓ - -

³
http://www.dbis.informatik.uni-goettingen.de/Mondial/.

Table 4
Example results from a query pair asking for author names, e.g., SQL: SELECT name FROM persons WHEREperson_type=2 SPARQL: SELECT ?name WHERE {?p a :Author; foaf:name?name}

⁴
http://rdf4j.org.

⁶
The MCS isomorphism problem is a well-studied optimization problem and is known to be NP-hard.

⁷
http://www.w3.org/TR/rdb-direct-mapping/.

Table 10
Semi-automatic Karma mappings: generally very high scores thanks to human input

Series 1st 2nd 3rd

To CMT 0.97 0.85 0.99

To Conference 0.90 1.00 1.00

To SIGKDD 1.00 0.99 1.00