Given two datasets, i.e., two sets of tuples of constants, representing positive and negative examples, logical separability is the reasoning task of finding a formula in a certain target query language that separates them. As already pointed out in previous works, this task turns out to be relevant in several application scenarios such as concept learning and generating referring expressions. Besides, if we think of the input datasets of positive and negative examples as composed of tuples of constants classified, respectively, positively and negatively by a black-box model, then the separating formula can be used to provide global post-hoc explanations of such a model. In this paper, we study the separability task in the context of Ontology-based Data Management (OBDM), in which a domain ontology provides a high-level, logic-based specification of a domain of interest, semantically linked through suitable mapping assertions to the data source layer of an information system. Since a formula that properly separates (proper separation) two input datasets does not always exist, our first contribution is to propose (best) approximations of the proper separation, called (minimally) complete and (maximally) sound separations. We do this by presenting a general framework for separability in OBDM. Then, in a scenario that uses by far the most popular languages for the OBDM paradigm, our second contribution is a comprehensive study of three natural computational problems associated with the framework, namely Verification (check whether a given formula is a proper, complete, or sound separation of two given datasets), Existence (check whether a proper, or best approximated separation of two given datasets exists at all), and Computation (compute any proper, or any best approximated separation of two given datasets).
The separability problem deals with finding an intensional representation of two datasets, i.e., sets of data items, interpreted as positive and negative examples. In this problem, one is given two sets of data items, one with positive and the other with negative examples, and is asked to provide a query so that the evaluation of such a query over the database contains all the data items in the set of positive examples, and none of the data items in the set of negative examples. We say that a solution to this problem is a query that separates the given datasets. A special case of this problem arises when only one set of positive examples is given as input, and one is interested in finding a query whose evaluation over the database coincides with the data items in such a set. In this paper, we refer to the latter special case with the term characterizability, and we say that a solution to this problem is a query that characterizes the given dataset.
The separability problem has initially been studied for relational databases and is known in the community as the query-by-example problem.1
On the other hand, the characterizability problem is known in the literature as query definability.
Over the years, researchers have found several interesting applications of the separability problem, spanning from simplifying query formulation by non-experts, to debugging facilities for data engineers. Indeed, the problem has been studied as a useful tool for data exploration, concept learning, data analysis, usability, data security and more [51,54]. Moreover, as already observed in [37], the problem is studied in two special cases in which the input datasets are constituted by only one single tuple. In separability, this special case is studied for entity comparison in RDF graphs, where the goal is to find a meaningful description that separates one entity from another. Similarly, in characterizability, this special case is studied for generating referring expressions (GRE), where one is interested in describing a single data item by a logical expression that allows to separate it from all other data items. With the rise of Machine Learning (ML), we argue that this topic acquires primary importance for providing meaningful explanations to any typical supervised black-box model used for classification tasks. When applied to classification, the ultimate goal of supervised learning is to construct models that are able to predict the target output (i.e., the class) of the proposed inputs. To achieve this, the learning algorithm is provided with some training examples that demonstrate the intended relation of input and output values. Then, the learned model is supposed to be able to correctly classify instances that have not been shown during training. A crucial problem for wise and safe adoption of ML-based black-box models is that, especially in high-risk domains such as healthcare and finance, it is often very hard to understand the rationale behind a classification made by these models. This may lead to discriminatory biases in the classification that were not intended and, more surprisingly, of which the designers were unaware of.
In this paper, we assume that the classification task is performed in an organization that adopts an Ontology-based Data Management (OBDM)2
In this paper, we prefer the usage of the acronym OBDM rather than its similar OBDA, which stands for Ontology-based Data Access [58], because data access is just one aspect, although one of the most important, of the more general notion of data management.
approach [46,48]. OBDM is a paradigm for accessing data using a conceptual representation of the domain of interest expressed as an ontology. The OBDM paradigm relies on a three-level architecture, consisting of the data layer, the ontology, and the mapping between the two. Consequently, an OBDM specification is a triple which, together with an -database D, form a so-called OBDM system . We are going to tackle the separability problem by leveraging the notion of evaluation of a query with respect to an OBDM system, in turn based on the notion of certain answers to a query over an OBDM system. Intuitively, given an OBDM system and two D-datasets and , our goal is to derive a query expression over that separates and in Σ (called here proper separation).
Let be the OBDM specification in which asserts that both math students and foreign students are students, , and the mapping consists of the following assertions:
Consider now the OBDM system , where J is the OBDM specification illustrated above and D is the -database . Let the D-datasets of positive and negative examples be and , respectively. One can see that the query over separates and in Σ because its set of certain answers over Σ, , contains all the positive examples in and none of the negative examples in .
An important contribution of our work is to provide approximated results for all the cases in which it is not possible to provide a separating query. We argue that, in these cases, reasonable and useful ontological characterizations can still be provided. We propose to resort to suitable approximations of the proper separating query, by introducing the notions of sound and complete separating queries. The former is a query whose certain answers have empty intersection with the D-dataset , whereas the certain answers of the latter form a superset of the D-dataset . Obviously, we are interested in computing the best approximated separating queries, which we call maximally sound and minimally complete separations, respectively. A maximally sound (resp., minimally complete) separation is a sound (resp., complete) separation such that no other sound (resp., complete) separation exists that better approximates the input datasets. Moreover, we cover the special cases in which the input datasets are constituted by only one single tuple for all our results and refer to them as the single tuple variants of the problems we deal with.
In this context, the training set used in the classification task, which is a collection of data items that are labeled as positive and negative examples, is seen as two sets of tuples in the database schema. The query derived by solving the separability problem results into an intensional definition of such training set, and are considered as an explanation of the intensional properties of the training set. The same principle can also be applied to a set of tuples that has not been seen during the training of the model. In this scenario, one can consider the black-box model as an oracle that assigns a class to all given tuples. Then, the query derived by solving the separability problem in this new context, is considered as an explanation of the intrinsic behaviour of the model. Traditionally, there are two different types of explanations: global and local. We refer to the former for explanations of the general behaviour of the model, and to the latter for explanations of the output of the model with respect to a specific object. The present work poses the foundational basis for providing both kind of explanations. Indeed, it deals with global explanations when the separating query is searched with respect to positive and negative examples containing and arbitrary number of tuples. It deals with local explanations when the sets of positive and negative examples for which one searches for a separating query contain only one single tuple.
Our procedure fits into the definition of post-hoc explanations of black-box models, i.e. a set of techniques aimed at approximating the behaviour of a black-box model with a surrogate interpretable model. We are now going to describe how in our context the role of the surrogate model is played by the query resulting from the solution of the separability problem. Suppose an organization that adopts the OBDM paradigm wants to train a model for predicting which candidates in a selection process are the most likely to perform well in a certain job. For training the model, the organization is given the curricula of current employees with a feedback on their performances that makes it possible to divide the training set in two different classes: the good and bad performers. For the sake of simplicity, suppose John, Mary, and Jane, who studied Biology, Medicine, and Math respectively, perform well. Suppose also that Matt, Angeline, and Jess, who studied Music, Linguistics, and Fashion respectively, perform badly. The ML-based black-box model is trained with this dataset and it is optimised to reach the highest possible accuracy. Now suppose some candidates apply for the job and are evaluated by the model. The latter states that Lucy, Mara, and George, who studied Math, Chemistry, and Physics respectively, have high probability to perform well in doing the job. At the same time, the model states that Lucas, and Paul, who studied Art, and Classics are believed to perform badly. The organization now wonders why the black-box model divided the candidates in this way. By instantiating a separability problem with the two sets of positive and negative candidates, the organization finds out that, no matter how sophisticated the internal details of the black-box model are, the resulting classification is so that all positive candidates are answers to the query “return all the candidates with a scientific background”, and none of the negative candidates are. This separating query provides an intuitive explanation of the actual behaviour of the model. Of course, there are in general many valid separating queries for a given instance of a separability problem, as it is possible that in many cases there is no valid separating query at all.
Contributions of the paper The contribution provided by this paper can be summarized as follows:
We present a formal framework for separability in OBDM. In particular, we first cast the classical notion of separating query in the OBDM context (called here proper separation), and then we propose the relaxations mentioned above (complete separation and sound separation) as well as their optimal versions (minimally complete separation and maximally sound separation). We do exactly the same for the special case of characterization, which deals only with a dataset of positive examples rather than both positive and negative examples.
We study the Verification problem for both separability and characterizability in OBDM, i.e. check whether a given query is a proper, complete, or sound separation (resp., characterization) of two given datasets (resp., a given dataset). More specifically, we introduce three families of decision problems for both separability and characterizability, called X-(, , ) (resp., X-(, , )) for X = {Proper, Complete, Sound}, which are parametric with respect to an ontology language , a mapping language , and a query language . We provide tight computational complexity bounds for the most common languages used in OBDM, i.e., is , is GLAV, and is UCQ. The results are summarized in Table 1.
We study the Computation problem. We provide two algorithms that, taking as input an OBDM system and two D-datasets and , where is a ontology and is a GLAV mapping, return, respectively, a UCQ-minimally complete Σ-separation of and and a UCQ-maximally sound Σ-separation of and . As a consequence, this proves that in the scenario under consideration the two best approximated versions of proper separations always exist.
We study the Existence problem for both separability and characterizability in OBDM. Since for the scenario under consideration their two best approximated versions always exist, we only focus on the existence of a proper separation (resp., characterization), i.e., check whether a proper separation (resp., characterization) in a target query language of two given datasets (resp., a given dataset) exists at all. More specifically, we introduce a family of decision problems for both separability and characterizability, called (, , ) (resp., (, , )), which are parametric with respect to an ontology language , a mapping language , and a query language . Also in this case, we provide tight computational complexity bounds for the most common languages used in OBDM. The results are summarized in Table 2.
For X = {Proper, Complete, Sound}, the table reports the exact computational complexity of X-(, , ) and X-(, , ) when is , is GLAV, and is UCQ. We point out that all the lower bounds already hold for the single-tuple version of the considered problem (i.e., when all the input datasets consist of a single tuple) and when is ∅ (i.e., the ontology language allowing only for ontologies without assertions), is GAV∩LAV (i.e., the mapping language allowing only for assertions that are both GAV and LAV), and is CQ
X-(, GLAV, UCQ)/X-(, GLAV, UCQ)
X = Complete
-complete
X = Sound
-complete
X = Proper
-complete
For , the table reports the exact computational complexity of and when is and is UCQ. Again, all the lower bounds already hold for the single-tuple version of the considered problem and when is ∅. Moreover, such lower bounds hold for both and
-complete
-complete
-complete
This paper is an extended version of our CIKM’21 conference paper [22]. We point out that the conference paper focused only on the characterizability reasoning task, whereas here we illustrate a formal framework and study the related computational problems both for separability and characterizability. Furthermore, in this extended version we provide all the full proofs that were only sketched in [22]. Finally, we observe that the decision problem is incorrectly claimed to be -complete in the conference version of this paper. In this extended version, we fix this error and show that this decision problem is actually -complete.
To the best of our knowledge, the present work is the first to introduce separability and characterizability in the OBDM context. In particular, while separability and characterizability have been studied in various ontology-enriched query answering settings, this is the first to take into account all the layers stack of the OBDM paradigm, thus including also the mapping layer, and also propose and study natural relaxations of the separability and characterizability notions. Nevertheless, we remark that both in the Computation and in the Existence problems for separability and characterizability we mainly focus only on UCQ as a query language, while other works on the subject also investigate the typically more challenging scenario of CQ as a query language (see, e.g., [7,30,34,64]).
Finally, we mention that, since is insensitive to the adoption of the unique name assumption (UNA) for UCQ answering [5], all our results hold both with and without the UNA.
Outline of the paper The paper is organized as follows. After the discussion of related works in Section 2, Section 3 introduces the relevant background for our study and Section 4 illustrates a formal framework for separation in OBDM. Then, Sections 5, 6, and 7 present the results on the three computational problems Verification, Computation, and Existence, respectively. We deal first with the Computation problem and then with the Existence problem because the solutions we will provide about the Computation problem will help us to solve some tasks about the Existence problem. Finally, Section 8 concludes the paper with a perspective on future work.
Related work
Query definability and query-by-example have long been studied for classical databases, starting from [70] up to many recent studies [3,7,41,52,64,65,68]. We have laid the foundations for analysing the complexity of our decision problems from the results in [3,7]. We found the work in [65] inspirational since they dealt with the problem of deriving a metric for establishing how close is a non proper separating query to the optimal solution. We found the survey in [52] important for the whole research problem as they present a well described motivating example and they outline the main open problems of this topic.
Our work has also been inspired by the notion of Abstraction [19,20,23]. Although the goal is still to derive a query expression over the ontology, in that work the input is a query over the data layer of the OBDM architecture, whereas in query definability (resp. in query-by-example) the input is a set of tuples (or two sets of tuples). It follows that the two tasks are completely different and require different technical solutions. We postpone a detailed comparison between Abstraction and the separability notion investigated in this paper in Section 4.1 (when we will have all the technical tools in hand to provide an in-depth comparison).
The works that are closer to ours are the ones in [4,34,55]. In [4] the authors study the existence, and verification problems both for query definability and for query-by-example, and computation problem for the query definability case. In their work, the ontology is expressed as an RDF graph and they consider several fragments of SPARQL as the query language to be used for the separation of the examples. Differently from the present work, they do not aim at finding the best approximated separation. We share with [55] the expressive power of the language used for the ontology (DL-Lite), and for the separation query the input examples (UCQs). However, the work in [55] does not deal with the cases in which proper separations for the input examples do not exist, and it is based on a slightly different framework from ours, i.e. since in that work they study the problems in the context of ontology-mediated queries, they do not have the mapping layer that instead is part of our more general OBDM paradigm.
The work in [34], studies the query-by-example problem for expressive horn description logic ontologies, namely Horn-. Apart from the clear difference in the expressive power of the ontologies, their work does not consider the case of approximated separations of the input examples, and it does not consider the mapping layer.
This work has also been inspired by [11,31,38,44]. All these papers’ goal is to learn a concept expression that best captures a given set of examples (or two set of positive and negative examples for query-by-example). We differ from these works because our goal is to derive a full-blown query that separates the input examples.
The problem of checking whether there exists a formula separating positive and negative examples in the presence of an ontology has recently been studied in [37,40]. Other than verifying that the ontology entails the searched formula for all positive examples, the authors conducted an in-depth analysis of the so-called separability problem accounting for both weak separability, i.e. the one we study in this paper, and strong separability, i.e. checking whether the negation of the separating formula is entailed by the KB. They also consider the case of enriching the separating formula by adding helper symbols that are not originally present in the ontology, and study the complexity of the decision problems for a wide range of languages both for expressing the ontology and the separating formula. In this paper, we are interested in studying the weak separability problem in the context of the OBDM by also considering the cases in which the separating formula does not exist and one wants to search for sound and complete approximations of it.
Another important line of research related to the present work is the one regarding post-hoc explanations of opaque machine learning models. As also highlighted by other works [25,26,49,60], the query-by-example problem can easily be adapted for explaining the output of a black-box machine learning classification model. For example, consider the case of a binary classifier labelling a set of examples in two classes 1 and 0. In this scenario, the solution of the query-by-example problem is considered as a surrogate of the machine learning model, so that the examples labelled by class 1 are the answers of the reverse engineered query, while none of the examples labelled by class 0 are. Therefore the query acts as an explanation because it provides a more human understandable way, especially in our framework in which the query is based on the knowledge of the ontology, for classifying the given examples in the two different classes. Although relevant for our work, we differ from all the above cited papers. In [60], they map the inputs of the machine learning classifier to the ontology and then uses a concept learning tool to find a class expression over the ontology that best describes the positive example. In [25], for explaining the behaviour of a black-box classifier, they build another black-box classifier (a neural network) and then project the output of this latter model onto a so-called rule space, where each coordinate represents the activation of a rule that is described in First Order Logic. In [26] they present the TREPAN algorithm, i.e. a way for building a decision tree in which the nodes are linked to an ontology, that is used as a means for explaining the input positive and negative examples. We consider the work in [49] to be relevant for our work, even though is rather preliminary and does not specify many important details such as the expressive power of the language of the ontology, and the language of the query they search for. The biggest differences with the present work are the fact that they do not consider the mapping layer between the ontology and the data, and that they do not focus on the concepts of maximally-sound and minimally-complete solutions, in cases where a perfect solution does not exist. On the contrary, they define a best approximated query as the one minimising the jaccard distance between the answers of the query and the set of positive examples in the input.
Inductive Logic Programming (ILP) [43] has long been considered related to the query definability and query-by-example tasks. We also considered it inspiring for our work, but we soon acknowledged that the expressive power of the languages used for representing the knowledge base are incomparable, and that in ILP they are interested in searching for explanations of a set of logical facts rather than a set of tuples.
The Active Learning task initially introduced by [2] has been studied as a possible framework for learning queries from examples in relational databases [64] and in the presence of ontologies [30]. Moreover, instances of the framework in [2] can be found in [42] and in [57]. In particular, the goal of [42] is to learn an ontology that is equivalent to a target ontology, while the goal of [57] is to learn an ontology that is query inseparable with respect to a target ontology and a query language , i.e. given a set of ground atoms (ABox) , the learned ontology must be such that if and only if , for every .
Finally, query inseparability has been studied even outside the learning task. For example, [9] studies query inseparability for (fragments of) Horn- as ontology language and CQ as a query language, whereas [10] studies query inseparability for both the ontology languages and and for (fragments of) UCQ as a query language.
Preliminaries
We recall some notions about relational databases [1], Description Logics (DLs) [6], and Ontology-based Data Management (OBDM) [47]. We define , , , and to be the pairwise disjoint, countably infinite sets of symbols for database predicates, ontology predicates, constants, and variables, respectively. We further assume that is partitioned into the disjoint sets and for atomic concepts and atomic roles, respectively.
Databases, datasets, and queries A relational database schema (or simply schema) is a finite subset of . Given a schema , an -database D is a finite set of facts (a.k.a. ground atoms) of the form , where s is an n-ary predicate in , and is an n-tuple of constants from . We denote by the finite subset of of those constants occurring in D. Given a schema and an -database D, a D-dataset λ of arity n is simply a finite set of n-tuples of constants occurring in D, i.e., .
A query over a schema is an expression in a certain query language using the predicate symbols of and arguments of predicates are variables from , i.e., we disallow constants to occur in queries. Each query has an associated arity. The evaluation of a query of arity n over an -database D is a set of answers, each answer being an n-tuple of constants occurring in , i.e., . A query of arity 0 over a schema is called a boolean query, and we denote by (resp., ) the fact that (resp., ).
Following the terminology of [7,63], we say that a query over a schema explains two D-datasetsand (resp., defines a D-dataset) inside an-database D if and (resp., ). We also say that and are -explainable (resp., is -definable) inside D, for a query language , if there exists a query that explains and (resp., defines ) inside D.
We are particularly interested in conjunctive queries and unions thereof. A conjunctive query (CQ) over a schema is an expression of the form such that (i) , called the target list of , is an n-tuple of distinguished variables from , where n is the arity of (ii) is an m-tuple of existential variables from ; and (iii) , called the body of , is a finite conjunction of atoms of the form , where s is an p-ary predicate in and is either a distinguished or an existential variable, i.e., for each . A union of conjunctive queries (UCQ) over a schema is a finite set of CQs over with same arity, called its disjuncts.
For a conjunction of atoms , we denote by the set of all the atoms occurring in ϕ. For a set of atoms and a tuple of constants, we denote by the CQ , where (i) is the conjunction of all the atoms occurring in the set of atoms , where is obtained from by replacing everywhere each constant occurring in with a fresh variable and each constant c not occurring in with a fresh variable , (ii) , and (iii) is the tuple of all variables occurring in that do not occur in .
Given a set of atoms , we denote by the set of all constants and variables occurring in . Observe that . Let and be two sets of atoms. We say that a function is a homomorphism from to if , where is the image of under h, i.e., with for each atom . For two sets of atoms and and two tuples of terms and , we write if there is a function h from to such that (i) h is a homomorphism from to , and (ii) (where, for a tuple of terms , ), otherwise.
We define the semantics of (U)CQs in terms of homomorphisms. For an -database D and a CQ over of arity n, we let to be the set of n-tuples of constants occurring in D for which . Furthermore, for an -database D and a UCQ over , we let .
Syntax and semantics of In this paper, a DL ontology (or simply ontology) is a TBox (“Terminological Box”) expressed in a specific DL, that is, a finite set of assertions stating general properties of atomic concepts and roles built according to the syntax of the specific DL, which represents the intensional knowledge of a modeled domain. The alphabet of an ontology is the finite subset of of atomic concepts and atomic roles mentioned in the assertions of , and we assume that every ontology comprises the special bottom concept ⊥ in its alphabet. In this paper, whenever we speak of a query over an ontology , we implicitly mean a query in a certain query language that uses the atomic concepts and the atomic roles in the alphabet of as predicate symbols.
The semantics of DL ontologies is specified through the notion of first-order interpretations: an interpretation is a pair , where the interpretation domain is a non-empty, possibly infinite set of objects, and the interpretation function assigns to each constant an object , to each atomic concept a subset of (with the requirement ), and to each atomic role a subset of . We say that an interpretation satisfies an ontology , denoted by , if satisfies every assertion in . When convenient, with a slight abuse of notation, we treat an interpretation as the (potentially infinite) set of facts of the form and that are true in, i.e., that are such that and .
We are particularly interested in DL ontologies expressed in , the member of the family [15] that underpins , i.e., the profile especially designed for efficient query answering [53]. A ontology is a finite set of assertions of the form:
where , are basic concepts, i.e., expressions of the form A, , or , with and , and and basic roles, i.e., expressions of the form P, or with . We assume that the special bottom concept ⊥ never occurs in the right-hand side of inclusion assertions. This is without loss of generality, since each inclusion assertion of the form is logically equivalent to the disjointness assertion .
For the constructs of , the interpretation function of an interpretation extends to basic concepts and basic roles as follows: and . Finally, an interpretation satisfies a concept inclusion assertion (respectively, role inclusion assertion ) if (respectively, ), and it satisfies a concept disjointness assertion (respectively, role disjointness assertion ) if (respectively, ).
Syntax and semantics of ontology-based data management According to [47,58], an Ontology-based Data Management (OBDM) specification is a triple , where is a DL ontology, is a relational database schema, also called source schema, and is a mapping, i.e., a finite set of assertions relating the source schema to the ontology . An OBDM system is a pair , where is an OBDM specification and D is an -database. For readability purposes, with a slight abuse of notation, we sometimes denote an OBDM system as a quadruple , where is an OBDM specification and D is an -database.
In this paper, each assertion in the mapping component of an OBDM system is a Global-And-Local-As-View (GLAV) assertion [27], that is, an assertion of the form , where and are CQs over and over , respectively, with the same target list . Special cases of GLAV assertions highly considered in the data integration literature are Global-As-View (GAV) and Local-As-View (LAV) assertions [45]: in a GAV (resp., LAV) assertion, (resp., ) is simply an atom without existential variables. Finally, a GLAV (resp., GAV, LAV, GAV∩LAV) mapping consists in a finite set of GLAV (resp., GAV, LAV, both GAV and LAV) assertions.
Given a GLAV mapping relating a source schema to an ontology , an interpretation , and an -database D, we denote by the fact that implies for each possible mapping assertion in and for each possible tuple of constants occurring in D. Here, denotes the evaluation of over the interpretation seen as a (potentially infinite) set of facts.
As customary in OBDM, we define the semantics of OBDM systems by specifying which are the first-order models of the OBDM specification relative to the -database D. Specifically, we say that an interpretation is a model of an OBDM system if (i) , and (ii) . Finally, we say that an OBDM system Σ is consistent if it has at least one model, inconsistent otherwise.
For an OBDM system , with , and a query over , we denote by the set of certain answers of w.r.t. Σ, i.e., the set of tuples of constants occurring in D such that for each model of Σ. If Σ is inconsistent, then the set of certain answers of any query over w.r.t. Σ is simply the set of all possible tuples of constants occurring in D whose arity is the one of the query. Finally, we say that two queries and are equivalent w.r.t. an OBDM system if .
Query rewriting Given a UCQ over a ontology , we denote by the UCQ computed by executing the algorithm [15] on and (slightly extended to deal also with the bottom concept ⊥). In a nutshell, the algorithm uses the concept/role inclusions of the input ontology as rewriting rules applied to the input UCQ . In this way, it compiles into the query all the knowledge provided by that is relevant to answering the query. applies repeatedly the following two steps until a fixpoint is reached: step (i) uses concept/role inclusions as rewriting rules to rewrite query atoms one by one, each time producing a new CQ to be added to the final output; step (ii) unifies the query atoms to enable further executions of step (i). For additional details, we refer the reader to [15].
Let be an OBDM specification where , i.e., has no assertions, and is a GLAV mapping. From results of [17,29], it is well-known that, given a UCQ over , by splitting the GLAV mapping into a GAV mapping followed by a LAV mapping over an intermediate alphabet, it is always possible to compute a UCQ over , denoted by , such that for each -database D.
Now, let be an OBDM specification where is a ontology and is a GLAV mapping. Given a UCQ over , in what follows, we denote by the following UCQ over : . By combining the above observation regarding with the correctness of the algorithm [15] and the fact that is insensitive to the adoption of the UNA for UCQ answering [5], we derive that holds for each UCQ over and for each -database D such that is consistent. Furthermore, by combining the above observation regarding with results of [58], we derive that, given an -database D, the OBDM system is inconsistent if and only if . Here, is the -violation query, i.e., the boolean UCQ over constituted by the disjunct and a disjunct of the form (respectively, , , and ) for each disjointness assertion (respectively, or , , and ) occurring in , where an atom of the form stands for either if R denotes an atomic role P, or if R denotes the inverse of an atomic role, i.e., .
Canonical structure Given an -database D and a GLAV mapping relating a source schema to an ontology , the chase [13] of D with respect to , denoted by , is the set of atoms computed as follows: (i) is initially set to the empty set of atoms; then (ii) for every GLAV assertion in and for every tuple of constants occurring in D such that , we add to the image of the set of atoms under , that is, we set , where extends h by assigning to each variable z occurring in a different fresh variable of still not present in . Observe that is guaranteed to be finite and can be always computed in exponential time.
We conclude this section with the following observation used for the technical development of the next sections. Let be an OBDM system where is a ontology and is a GLAV mapping. The canonical structure of with respect to and D, denoted by , is the (potentially infinite) set of atoms obtained by first computing as described before, and then by chasing with respect to the inclusion assertions of as described in [15, Definition 5] but using the alphabet of variables whenever a new element is needed in the chase. Observe that this latter is a fair deterministic strategy, i.e., it is such that if at some point an assertion is applicable, then it will be eventually applied. By combining [28, Proposition 4.2] with [15, Theorem 29], it is well-known that, for a UCQ over and a tuple of constants occurring in D, if is consistent, then the following holds: if and only if for some .
Formal framework
In what follows, we implicitly use to denote an OBDM system, where is an OBDM specification and D is an -database. Intuitively, given two sets and of tuples of constants occurring in D (i.e., and are D-datasets) of positive and negative examples, respectively, we aim at finding a query over the ontology in a certain target query language that logically separates and w.r.t. Σ. Since the evaluation of queries in OBDM systems is based on certain answers, we are naturally led to the following definition.
is a proper Σ-separation of and in the query language , if both conditions (i) and (ii) hold.
The next example illustrates the above definition.
Recall the OBDM system , the D-datasets and , and the CQ over the ontology illustrated in Example 1. As already observed, we have that , and therefore is a proper Σ-separation of and in CQ.
Consider now a slight variation of the negative examples, i.e., consider and . Since and are such that and , and since and are such that and , one can verify that no proper Σ-separation of and in UCQ exists.
Clearly, the more expressive the target query language , the more likely it is possible to distinguish (w.r.t. the OBDM system) the properties between the tuples in and by means of the operators in , and therefore the more likely the proper separation in exists. Unfortunately, the next example shows that, even in trivial cases and without any restriction on the target query language , proper separations are not guaranteed to exist.
Let be the following OBDM system: (i) is the OBDM specification in which , i.e., contains no assertions, , and with and ; and (ii) D is the -database .
For the D-datasets and , one can trivially verify that, whatever is the query language , there can be no query over for which include the tuple but does not include the tuple . To see this, note that . It follows that, whatever is the query language , there can be no query over which is a proper Σ-separation of and in .
Notice the importance of the role played by the mapping in order to reach this conclusion. Indeed, if we replace with , then a proper Σ-separation of and in UCQ would simply be the CQ over the ontology .
Borrowing similar ideas from [23], to remedy situations where proper separations do not exist, we now introduce approximations of proper separations in terms of completeness and soundness. More specifically, a complete separation is a query that captures all the positive examples in its set of certain answers w.r.t. the OBDM system (i.e., satisfies condition (i) of Definition 1), whereas a sound separation is a query that contains none of the negative examples in its set certain answers w.r.t. the OBDM system (i.e., satisfies condition (ii) of Definition 1).
is a complete (resp., sound) Σ-separation of and in the query language , if (resp., ).
We observe that the condition for being a complete (resp., sound) separation does not involve (resp., ).
Refer to Example 2. We have that and are complete Σ-separations of and in UCQ, whereas and are sound Σ-separations of and in UCQ.
As the above example manifests, there may be several complete and sound separations. In those cases, the interest is unquestionably in those queries that approximate at best the proper one, at least relative to a target query language . Informally, a -minimally complete separation is a complete separation in containing a minimal (w.r.t. set containment) possible set of negative examples in its set of certain answers, whereas a -maximally sound separation is a sound separation in capturing a maximal (w.r.t. set containment) possible set of positive examples in its set of certain answers.
is a -minimally complete (resp., -maximally sound) Σ-separation of and , if is a complete (resp., sound) Σ-separation of and in and there is no satisfying both the following two conditions:
is a complete (resp., sound) Σ-separation of and in ;
(resp., ).
By definition, it is immediate to verify that a Σ-proper separation of and in a query language is both a -minimally complete Σ-separation of and and a -maximally sound Σ-separation of and .
Refer again to Example 2. One can verify that the CQ is a UCQ-minimally complete Σ-separation of and , whereas is not. Moreover, both and are CQ-maximally sound Σ-separations of and , but neither of them is a UCQ-maximally sound Σ-separation of and . Indeed, a UCQ-maximally sound Σ-separation of and is .
We point out that all the above definitions are a generalization of the definitions illustrated in [22] which deal only with datasets of positive examples, rather than datasets of both positive and negative examples as done here. More specifically, in [22], as well as in all the works addressing the separability task, when only a D-dataset of positive examples is provided, to it is implicitly associated the D-dataset , where n is the arity of the tuples in . With this remark in mind, we can now specialize the above definitions for the only dataset of positive examples case and report the definitions given in [22].
is a proper (resp., complete, sound) Σ-characterization of in the query language , if (resp., , ).
is a -minimally complete (resp., -maximally sound) Σ-characterization of , if is a complete (resp., sound) Σ-characterization of in and there is no satisfying both the following two conditions:
is a complete (resp., sound) Σ-characterization of in ;
(resp., ).
In other words, given an OBDM system with , a D-dataset of arity n, and a query , we have that is a proper in (resp., complete in , sound in , -minimally complete, -maximally sound) Σ-characterization of if and only if is a proper in (resp., complete in , sound in , -minimally complete, -maximally sound) Σ-separation of and , where .
Relation with the abstraction reasoning task
We now discuss the relation between the notion of separation in OBDM introduced here with the notion of Abstraction [19,20], recently introduced in [18] and studied under various scenarios [21,23,24,50] for addressing several reverse engineering tasks in OBDM. In Abstraction, we are given an OBDM specification and a query over , and the aim is to find a query over , called a perfect J-abstraction of, such that for each-database D for which is consistent. Conversely, in the separation task also the -database D is given, and instead of a query we have two set of tuples and of constants taken from D, and the aim is to find a query over such that both and hold.
Following [23], we also say that a query over is a complete (resp., sound) J-abstraction of if (resp., ) for each -database D for which is consistent. The next proposition establishes a precise relationship between the notion of separation in OBDM introduced here and the notion of abstraction.
Letbe a consistent OBDM system,andbe two D-datasets, andbe a query that explainsandinside D. If a queryis a perfect (resp., complete, sound) J-abstraction of, thenis a proper (resp., complete, sound) Σ-separation ofandin.
Suppose is a perfect (resp., complete, sound) J-abstraction of , i.e., (resp., , ) for each -database for which is consistent. Since by assumption is consistent, we have that (resp., , ). Now, since both and hold by the assumption that explains and inside D, we derive that both and (resp., only , only ) hold, which means that is a proper (resp., complete, sound) Σ-separation of and in . □
The next example shows that the converse of the above proposition does not necessarily hold, thus stressing the fact that the two problems are indeed different.
Let be as follows. is the OBDM specification in which , i.e., contains no assertions, , and consists of the following two GAV assertions:
Let also the -database be and the D-datasets and be and . Consider, moreover, the CQ . One can see that the query explains and inside D because . Consider now the CQ over . One can see that is a proper (and therefore, also a complete and a sound) Σ-separation of and in CQ because .
Notice, however, that for the -database we have that (i) is a consistent OBDM system, (ii) , and (iii) . Thus, is neither a complete nor a sound (and therefore, nor a perfect) J-abstraction of . Indeed both (witnessing that is not a complete J-abstraction of ) and (witnessing that is not a sound J-abstraction of ) hold.
Computational problems associated with the framework
Given the general framework introduced so far, there are (at least) three computational problems to consider, with respect to an ontology language , a mapping language , and a query language . Given an OBDM system and two D-datasets and , where and :
Verification: given also a query , check whether is a proper (resp., complete, sound) Σ-separation of and in .
Computation: compute any proper in (resp., any -minimally complete, any -maximally sound) Σ-separation of and , provided it exists.
Existence: check whether there exists a query that is a proper in (resp., a -minimally complete, a -maximally sound) Σ-separation of and .
Analogous computational problems can be defined when in the input we have only one D-dataset of arity n, rather than two D-datasets and , and we implicitly think of as .
In what follows, if not otherwise stated, we refer to the following scenario which considers by far the most popular languages for the OBDM paradigm: (i) is , (ii) is GLAV, and (iii) is UCQ. In this scenario, there are two interesting properties that are worth mentioning.
Letbe an OBDM system, andandbe two D-datasets of arity n such that. Ifandare UCQ-minimally complete (resp., UCQ-maximally sound) Σ-separations ofand, then they are equivalent w.r.t. Σ.
We first address the case of UCQ-maximally sound, and then the case of UCQ-minimally complete.
Assume that and are UCQ-maximally sound Σ-separations of and and suppose, for the sake of contradiction, that they are not equivalent w.r.t. Σ. This implies the existence of a tuple such that and . Observe that, since , must be such that , otherwise would not be a sound Σ-separation of and in UCQ, thus reaching a contradiction. But then, the UCQ is such that (i) since both and are sound Σ-separations of and in UCQ, Q is a sound Σ-separation of and in UCQ as well, and (ii) holds because of tuple . Obviously, this contradicts the fact that is a UCQ-maximally sound Σ-separation of and .
Assume now that and are UCQ-minimally complete Σ-separations of and and suppose, for the sake of contradiction, that they are not equivalent w.r.t. Σ. This implies the existence of a tuple such that and . Observe that, since , must be such that , otherwise would not be a complete Σ-separation of and in UCQ, thus reaching a contradiction. But then, consider the query Q such that . Obviously, since and are UCQs, Q exists and can be expressed as a UCQ. Moreover, (i) since and are complete Σ-separations of and in UCQ, Q is a complete Σ-separation of and in UCQ as well, and (ii) holds because of tuple . Obviously, this contradicts the fact that is a UCQ-minimally complete Σ-separation of and . □
This also means that, in our scenario, for an OBDM system and two D-datasets and of arity n such that , if a proper Σ-separation of and in UCQ exists, then it is unique up to equivalence w.r.t. Σ. Furthermore, for the characterizability case where is implicitly set to be , proper in UCQ, UCQ-minimally complete, and UCQ-maximally sound Σ-characterizations of are always unique up to equivalence w.r.t. Σ, provided they exist.
Secondly, in this scenario, as one may expect, proper separations are less likely to exist than explanations in the plain relational database case.
Letbe a consistent OBDM system, andandbe two D-datasets. If there exists a proper Σ-separation ofandin UCQ, thenandare UCQ-explainable inside D.
Suppose there exists a proper Σ-separation of and in UCQ, i.e., there is a UCQ over for which and . Recall from Section 3 that the UCQ over is such that for each -database for which is consistent. Since is consistent by assumption, we have that . Thus, is such that both and hold, from which immediately follows that explains and inside D by definition, and thus and are UCQ-explainable inside D. □
In general, the converse of the above proposition does not hold. Indeed, in Example 3, while there is no proper Σ-separation of and in , whatever is the target query language , the CQ witnesses that and are CQ-definable inside D.
Note that Definition 1 can be seen as a generalization of the classical notion of explanation in the plain relational database case [7], when one also has to deal with mapping assertions and ontology assertions. In the specific case of OBDM systems such that and contains assertions of the form and providing a one-to-one correspondence between the alphabet of the ontology and the predicate symbols of the source schema, we observe that, given two D-datasets and (resp., a D-dataset ), there exists a Σ-proper separation of and (resp., a Σ-proper characterization of ) in UCQ if and only if and are UCQ-explainable (resp., is UCQ-definable) inside D.
We conclude this section by observing that, in the case of plain relational databases, two D-datasets and (resp., a D-dataset ) are UCQ-explainable (resp., is UCQ-definable) inside D if and only if the union of the CQs describing the tuples in achieves the separation (resp., the definition). More formally, and (resp., ) are UCQ-explainable (resp., is UCQ-definable) inside D if and only if explains and inside D (resp., defines inside D), where . When casting this problem to the OBDM setting, one of the key challenges is that the separating query needs to be formulated over the ontology level, while data reside at the source level, with a mapping layer mediating the two. One of the contributions of this paper is to provide a similar characterization as in the relational database case for the OBDM setting (cf. Corollary 3).
The verification problem
We now define the verification problems for X-separability (X-V) and X-characterization (X-V), where . These decision problems are parametric with respect to the ontology language to express the ontology , the mapping language to express the mapping , and the target query language to express the query over .
We also introduce two important special cases of the above decision problems, namely: X-VST(, , ) and X-VST(, , ), which are special cases of X-V(, , ) and X-V(, , ), respectively, in which the all the input D-datasets are singleton sets (i.e., they consist of just a single tuple).
In what follows, given a syntactic object x such as a query, an ontology, or a mapping, we denote by its size, which is the number of symbols needed to write it, with names of predicates, variables, etc. counting as one.
We start by analyzing the upper bounds for the case . The proof of the next theorems rely on the fact that, given an OBDM specification of our considered scenario and a UCQ over , each disjunct in can be obtained after a polynomial number of transformations of an initial disjunct in [15], and, analogously, each disjunct in can be obtained after a polynomial number of transformations of an initial disjunct in [17]. This clearly implies that, although there may be an exponential number of disjuncts in the UCQ , the size of each such disjunct is always bounded by a polynomial in the size of , , and . More precisely, the size of each disjunct in is at most and the size of each disjunct in is at most .
Complete-V(, GLAV, UCQ) and Complete-V(, GLAV, UCQ) are in.
We address Complete-V(, GLAV, UCQ). The case of Complete-V(, GLAV, UCQ) can be addressed in exactly the same way (recall that is immaterial for the complete case). In particular, we now show how to check in whether is a complete Σ-separation of and in UCQ (i.e., ), where with in which is a ontology and is a GLAV mapping.
Let n be the arity of the tuple(s) in the D-datasets and . For each n-tuple of constants , we first guess (i) a CQ over which is either of arity n and size at most , or a boolean one capturing a disjointness assertion d (e.g., capturing ); (ii) a sequence of ontology assertions; (iii) a CQ over of size at most which is either of arity n and of the form , or a boolean one of the form ; (iv) a sequence of mapping assertions; and (v) a function f from the variables occurring in to .
Then, we check in polynomial time whether (i) by means of , either we can rewrite a disjunct of into through (i.e., ), or we can rewrite a disjunct of into through (i.e., ); (ii) by means of , we can rewrite into through (i.e., , and thus either or ); and finally (iii) f consists in a homomorphism witnessing either , i.e., (and therefore , which means ), or , i.e., (and therefore , which means that Σ is inconsistent and thus by definition of certain answers). □
By exploiting the above result, we now address the upper bounds for the case .
Sound-V(, GLAV, UCQ) and Sound-V(, GLAV, UCQ) are in.
We start with Sound-V(, GLAV, UCQ), and then we consider Sound-V(, GLAV, UCQ). In particular, we now show how to check in whether is not a sound Σ-separation of and in UCQ (i.e., ), where with in which is a ontology and is a GLAV mapping.
We first guess (i) a tuple of constants , and, exactly as in the proof of Theorem 1, we also guess (ii) , , , , and f. Then, we check in polynomial time whether (i) , and (ii) using , , , , and f, we follow exactly the same polynomial time procedure described in the proof of Theorem 1 to check whether .
As for the Sound-V(, GLAV, UCQ) case, it is sufficient to replace the check (i) with the check , where n is the arity of the tuple(s) in the D-dataset . Clearly this latter check can be done in polynomial time as well, by first checking that every constant of effectively occurs in and then simply checking that . □
We recall that the complexity class , introduced in [56], resides at the second level of the polynomial time hierarchy [61], and it is composed of all those decision problems that are the intersection of a decision problem in and a decision problem in , that is, .
Since is a proper Σ-separation of and in if and only if it is both a sound, and a complete Σ-separation of and in , we immediately derive the following upper bounds for .
Proper-V(, GLAV, UCQ) and Proper-V(, GLAV, UCQ) are in.
We now provide matching lower bounds, proving that all of them already hold for the singleton datasets special cases. More specifically, we show that they already hold for the same fixed OBDM system , same fixed D-datasets and (resp., D-dataset ) containing only a single unary tuple, and for CQs as queries. Furthermore, the fixed OBDM specification is such that (i.e., contains no ontology assertions) and is a GAV∩LAV mapping (i.e., is both a GAV and a LAV mapping). To simplify the presentation, with a slight abuse of notation, from now on we denote by the ontology language allowing only for ontologies , i.e., ontologies without assertions.
There is an OBDM systemsuch thatandis a GAV∩LAV mapping, and two D-datasetsand(resp., a D-dataset) containing only one unary tuple for which:
Complete-VST(∅, GAV∩LAV, CQ) and Complete-VST(∅, GAV∩LAV, CQ) are-hard;
Sound-VST(∅, GAV∩LAV, CQ) and Sound-VST(∅, GAV∩LAV, CQ) are-hard;
Proper-VST(∅, GAV∩LAV, CQ) and Proper-VST(∅, GAV∩LAV, CQ)) are-hard.
Let be the fixed OBDM system such that (i) is an OBDM specification in which contains no assertions and whose alphabet consists of two atomic roles and , , and consists of the following two GAV∩LAV assertions: and , which simply mirrors source predicate to atomic role , for , and (ii) D is the -database composed of the following facts:
Let, moreover, and be the fixed D-datasets and . Note that is needed only for the decision problems related to separability but not for the decision problems related to characterizability.
Let be a finite and undirected graph without loops3
In graph theory, a loop is an edge that connects a vertex with itself [8].
or isolated nodes, where . We define a CQ over as follows:
Notice that can be constructed in from an input graph G as above.
By inspecting the OBDM system , for any graph G as above, the set of certain answers must necessarily be an element of the power set of . More specifically, the following property holds:
For bothand, we have that a graphis i-colourable if and only if.
First of all, notice that is composed of the following facts:
“Only-if part:” Suppose is 3-colourable (resp., 4-colourable), that is, there exists a function (resp., ) such that for each . Let be the body of , and consider the extension of f which assigns to the distinguished variable x of the constant (resp., ). It can be readily seen that f consists in a homomorphism from to such that (resp., ). In other words, f witnesses that (resp., ). Thus, (resp., ), as required.
“If part:” Suppose is not 3-colourable (resp., not 4-colourable), that is, each possible function (resp., ) is such that for some . Clearly, this implies that (resp., ). Thus, (resp., ), as required. □
With the above property in hand, and the fact that is an element of the power set of for each possible graph , we are now ready to prove the claimed lower bounds.
As for the complete case, the proof of -hardness is by a reduction from the 4-colourability problem, which is -complete [32]. In particular, from the above claim a graph G is 4-colourable if and only if , i.e., if and only if , which is the condition for to be a complete Σ-separation of and in CQ (resp., a complete Σ-characterization of in CQ).
As for the sound case, the proof of -hardness is by a reduction from the complement of the 3-colourability problem, which is -complete [32]. In particular, from the above claim a graph G is not 3-colourable if and only if , i.e., if and only if (resp., ), which is the condition for to be a sound Σ-separation of and in CQ (resp., a sound Σ-characterization of in CQ).
Finally, as for the proper case, the proof of -hardness is by a reduction from the exact-4-colourability problem, which is -complete [59]. In particular, a graph G is exact-4-colourable (i.e., 4-colourable and not 3-colourable) if and only if , i.e., if and only if and (resp., ), which is the condition for to be a proper Σ-separation of and in CQ (resp., a proper Σ-characterization of in CQ). □
For the scenario under investigation in this paper, we can now establish the precise computational complexity of all the Verification decision problems introduced at the beginning of this section.
The following holds:
Complete-V(, GLAV, UCQ) and Complete-V(, GLAV, UCQ) are-complete. The lower bounds already hold for Complete-VST(∅, GAV∩LAV, CQ) and Complete-VST(∅, GAV∩LAV, CQ);
Sound-V(, GLAV, UCQ) and Sound-V(, GLAV, UCQ) are-complete. The lower bounds already hold for Sound-VST(∅, GAV∩LAV, CQ) and Sound-VST(∅, GAV∩LAV, CQ);
Proper-V(, GLAV, UCQ) and Proper-V(, GLAV, UCQ) are-complete. The lower bounds already hold for Proper-VST(∅, GAV∩LAV, CQ) and Proper-VST(∅, GAV∩LAV, CQ).
Finally, from the lower bounds given in Theorem 3, we can derive two interesting novel results also in the context of explainability and definability in the plain relational database case. More specifically, since the fixed OBDM specification used in the proof that theorem is such that and is both a GAV and a LAV mapping, the proof can be straightforwardly adapted also for the plain relational database case. Thus, given a schema , an -database D, two D-datasets and (resp., a D-dataset ), and a UCQ over , it is -complete the problem of deciding whether explains and (resp., defines ) inside D (the membership of these decision problems directly follows from Corollary 1).
The computation problem
In this section, we address the Computation problem. We start by considering the case when the given OBDM system Σ at hand is inconsistent. Given an inconsistent OBDM system with and two D-datasets and (resp., a D-dataset ) of arity n, we point out that any query of arity n over is a -minimally complete Σ-separation (resp., Σ-characterization) of and (resp., of ). This is so because, by definition, the certain answers of any query of arity n w.r.t. an inconsistent OBDM system Σ is the set of all possible n-tuples of constants occurring in D, i.e., . Furthermore, if and , then any query of arity n over is also a -maximally sound (and therefore a proper in ) Σ-separation (resp., Σ-characterization) of and (resp., of ); otherwise, no sound (and therefore, no -maximally sound and no proper in ) Σ-separation (resp., Σ-characterization) of and (resp., of ) exists. Since, however, for OBDM systems of our scenario it is always possible to check whether they are inconsistent or not (cf. Section 3), from the above observations one can trivially derive suitable algorithms for the Computation problem in all the cases in which the input OBDM system Σ is inconsistent.
Having thoroughly covered the case of inconsistent OBDM systems, in what follows in this section, unless otherwise stated, we implicitly assume to only deal with consistent OBDM systems.
Specifically, we now provide two algorithms that, given a consistent OBDM system and two D-datasets and (resp., a D-dataset ), always terminate and return, respectively, a UCQ-minimally complete and a UCQ-maximally sound Σ-separation (resp., Σ-characterization) of and (resp., of ). This proves that, in our investigated scenario, they always exist and can be computed. In fact, the algorithms we provide focus only on the Separability case. Algorithms for the Characterizability case can be immediately derived from the ones we provide by simply computing , where n is the arity of the tuple(s) in .
Before delving into the details of the two algorithms, we provide some crucial properties about the canonical structure that will be used to establish the correctness of such algorithms.
Letwithbe an OBDM system,be a UCQ over, andandbe two tuples of constants such that. If, then.
If Σ is inconsistent, then the claim is trivial. If Σ is consistent, from Section 3 we know that implies the existence of a disjunct in for which . Let h be the homomorphism witnessing that , and let be the homomorphism witnessing that , which exists by the premises of the proposition. The composite function is then a homomorphism witnessing that . It follows that , as required. □
Letwithbe a consistent OBDM system,andbe two tuples of constants, andbe the CQover. We have thatif and only if.
Suppose that , and let h be the homomorphism witnessing this. Consider the query . Observe that is obtained from by appropriately replacing each occurrence of each constant either with a distinguished variable or with an existential variable . This means that h can be immediately transformed into a homomorphism witnessing that , thus implying that .
Suppose now that . Since Σ is consistent, it follows that there is a homomorphism h witnessing that , where . By considering again the relationship between and , the homomorphism h can be immediately transformed into a homomorphism that witnesses . By construction of the canonical structure , it is now trivial to verify that can be properly extended into a homomorphism witnessing that . □
We are now ready to present our algorithms for the Computation problem. We start with the complete case, and provide the Algorithm 1 (MinCompSeparation) for computing UCQ-minimally complete separations.
MinCompSeparation
Informally, for each positive example , the algorithm obtains from the set of atoms the CQ . Then, the output query is the union of all the CQs obtained in such a way.
Let be the same OBDM specification of Example 2. One can verify that for the -database and the D-datasets and , MinCompSeparation(, ,) returns the UCQ ∪ , where and . Note that the query returned by the algorithm is a UCQ-minimally complete Σ-separation of and , where .
The following theorem establishes termination and correctness of the Algorithm 1 (MinCompSeparation).
Letwithbe a consistent OBDM system andandbe two D-datasets. We have that MinCompSeparation(Σ,,) terminates and returns a UCQ-minimally complete Σ-separation ofand.
Termination of the algorithm as well as completeness of the UCQ returned are straightforward. In particular, due to Proposition 5, it is obvious that each is such that , where .
To prove that is also a UCQ-minimally complete Σ-separation of and , note that it is enough to show that any query q over that is a complete Σ-separation of and in UCQ is such that . We do this by contraposition. Let q be a UCQ over for which , i.e., for a tuple of constants we have but . This latter means that for some with occurring in . By Proposition 5, one can see that implies . By Proposition 4, it follows that each UCQ over containing the tuple in its set of certain answers w.r.t. Σ must contain also tuple in such a set, i.e., implies for any UCQ over . Thus, since , we derive that as well. Now, since and , this immediately implies that q is not a complete Σ-separation of and in UCQ, as required. □
We now turn to the sound case, and provide the Algorithm 2 (MaxSoundSeparation) for computing UCQ-maximally sound Σ-separations.
MaxSoundSeparation
Intuitively, starting from the query , which is a UCQ-minimally complete Σ-separation of and , the algorithm simply discards all those disjuncts whose set of certain answers w.r.t. Σ contain at least a tuple . We recall from Section 3 that the set of certain answers of a CQ w.r.t. a consistent OBDM system can be always computed by first obtaining its reformulation over the source schema , and then by evaluating this latter query directly over the -database D. In other words, the if-condition of line 5 can be equivalently reformulated as: .
Refer to Example 7. Since the certain answers of the CQ w.r.t. include also , MaxSoundSeparation(Σ, ,) returns the CQ . Note that is a UCQ-maximally sound Σ-separation of and .
The following theorem establishes termination and correctness of the Algorithm 2 (MaxSoundSeparation).
Letwithbe a consistent OBDM system, andandbe two D-datasets. We have that MaxSoundSeparation(Σ,,) terminates and returns a UCQ-maximally sound Σ-separation ofand.
Termination of the algorithm as well as soundness of the UCQ returned are straightforward. In particular, by construction, all the disjuncts of are such that there is no tuple for which .
To prove that is also a UCQ-maximally sound Σ-separation of and , note that it is enough to show that any query q over that is a sound Σ-separation of and in UCQ is such that . We do this by contraposition. Let q be a UCQ over for which , i.e., for a tuple of constants we have but . Since and , it is easy to see that the algorithm discarded the disjunct (otherwise, we would trivially derive that , and thus , which is a contradiction to the fact that ). By construction of the algorithm, one can see that the only reason was discarded is because for at least a tuple . By Proposition 5, one can see that implies . By Proposition 4, it follows that each UCQ over containing tuple in its set of certain answers w.r.t. Σ must contain also tuple in such a set, i.e., implies for any UCQ over . Thus, since , we derive that as well. Now, since and , this immediately implies that q is not a sound Σ-separation of and in UCQ, as required. □
We now briefly discuss the running time of the two above algorithms. First, we observe that the running time of the presented algorithms differs depending on which mapping language the input mapping is formulated. A key difference between the mapping languages GLAV and the special cases GAV and LAV, is in the size of the computed . In GLAV mappings, due to the simultaneous presence of queries involving multiple source predicates in the left-hand side of the mapping assertions, and join existential variables in the right-hand side of the assertions, the set of atoms can be exponentially large. For example, considering a database and the mappings containing the GLAV assertion: , the number of atoms occurring in the set is . Conversely, both in LAV and GAV mappings, is always polynomially bounded, since the former does not allow for multiple source predicates in the left-hand side of mapping assertions, whereas the latter does not allow for existential variables in the right-hand side of mapping assertions and the arity of ontology predicates is fixed to at most 2.
Furthermore, the running time required to compute is different in the LAV and GAV cases. Indeed can always be computed in polynomial time whenever is a LAV mapping, whereas, in general, already when is a GAV mapping, since there are CQs in the left-hand side of mapping assertions that need to be evaluated over the database D, it takes exponential time to deterministically compute (unless ).
As immediate consequences of the above observations, we derive that, in general, the Algorithm 1 (MinCompleteSeparation) runs in exponential time with respect to the size of the input, while it runs in polynomial time whenever the mapping of the input OBDM system is a LAV mapping. As for the case of the Algorithm 2 (MaxSoundSeparation), since after computing one needs also to compute the certain answers of each for all in the input dataset , and deterministically computing the certain answers of CQs with respect to an OBDM system takes exponential time (unless ), we derive that, in general, the algorithm runs in double exponential time with respect to the size of the input, while it runs in exponential time whenever the mapping of the input OBDM system is either a GAV or a LAV mapping. This is because is guaranteed to be of polynomial size whenever is either a GAV or a LAV mapping (and therefore, also the various CQs are of polynomial size with respect to the size of the input).
We conclude this section by providing a semantic test for the existence of proper separations and characterizations in UCQ in the OBDM settings. We observe that, in all the cases in which a proper separation exists, it is clear that both algorithms return the same query . Therefore, as a direct consequence of both Theorem 4 and Theorem 5, we get the following result.
Letbe an OBDM specification, andandtwo D-datasets. Either the UCQis a proper Σ-separation ofandin UCQ, or no proper Σ-separation ofandin UCQ exists.
The combination of Corollary 3 and Proposition 5 allows us to provide the following semantic tests for the existence of proper separations and proper characterizations in UCQ in the OBDM setting, which can be seen as the analogous of the semantic tests given in [7] and [55] for the plain relational database setting and the ontology-enriched query answering setting, respectively.
test for UCQs in OBDM: given a consistent OBDM system and two D-datasets and , there exists a proper Σ-separation of and in UCQ if and only if it is the case that for all and all .
test for UCQs in OBDM: given a consistent OBDM system and a D-dataset of arity n, there exists a proper Σ-characterization of in UCQ if and only if it is the case that for all and all .
The existence problem
We now address the existence problem. First of all, for the scenario under consideration in this paper, the existence problem for both UCQ-minimally complete and UCQ-maximally sound separations (and also characterizations) is trivial, since by Theorems 4 and 5 they always exist. Thus, in this section we only consider the remaining proper case, by defining a variant of the decision problems as defined in [55], where also a mapping in some mapping language is given as input.
We also introduce two important special cases of the above decision problems, namely: ST(, , ) and ST(, , ), which are special cases of (, , ) and (, , ), respectively, in which the all the input D-datasets are singleton sets (i.e., they consist of just a single tuple).
In what follows, we show that the computational complexity of the above decision problems differ only depending on the mapping language adopted. As already discussed in Section 6, a key difference between GLAV and the special cases GAV and LAV is both in the size of the computed , i.e. exponential for GLAV mapping and polynomial for both GAV and LAV mapping, and in the effort required to compute , i.e. polynomial time for LAV mapping and exponential time (unless ) in both GAV and GLAV mapping.
We start by characterizing the computational complexity of and (and their respective subproblems) in the simplest LAV case, then we address the GAV case, and finally we focus on the most general GLAV case. Interestingly, all the provided matching lower bounds hold even for fixed ontologies , i.e., ontologies without assertions, and fixed D-datasets containing only a single unary tuple.
Importantly, for the scenario under consideration, from the results of the previous section, the questions in and can be reformulated equivalently as follows: “is also a sound (and so, a proper) Σ-separation of and in UCQ?” and “is also a sound (and so, a proper) Σ-characterization of in UCQ?”, respectively.
(, LAV, UCQ) and(, LAV, UCQ) are-complete. The lower bounds already hold for ST(∅, GAV∩LAV, UCQ) and ST(∅, GAV∩LAV, UCQ).
Upper bound: We only mention (, LAV, UCQ). The (, LAV, UCQ) case is similar and therefore not discussed. Given an OBDM system with in which is a ontology and is a LAV mapping, and two D-datasets and , we first compute in polynomial time (recall that is a LAV mapping), and therefore also the query which is a UCQ-minimally complete Σ-separation of and . Then, exactly as illustrated in Theorem 2, we can check in whether is also a sound (and so, a proper) Σ-separation of and in UCQ.
Lower bound: We start with ST(∅, GAV∩LAV, UCQ), and then we address the ST(∅, GAV∩LAV, UCQ) case. The proof of -hardness is by a reduction from the complement of the 3-colourability problem. We define a fixed OBDM specification in which contains no assertions and whose alphabet consists of an atomic role e and an atomic concept A, , and consists of the following two GAV∩LAV mapping assertions: and , which simply mirrors source predicates and s to e and A, respectively.
Let be a finite and undirected graph without loops or isolated nodes, where . Without loss of generality, we may assume that and that G is connected, i.e., there is a path from c to for any pair of nodes . Then, we define an -database composed of the following facts:
Let, moreover, and be the fixed -datasets and .
Notice that the -database can be constructed in from an input graph G as above. Let the OBDM system be . We now show that a graph G is not 3-colourable if and only if there exists a proper -separation of and in UCQ, thus proving the claimed lower bound.
Given a graph G, there exists a proper-separation ofandin UCQ if and only if G is not 3-colourable.
First of all, note that . We recall that a proper -separation of and in UCQ exists if and only if the CQ is also a sound (and so, a proper) -separation of and in UCQ.
“Only-if part:” Suppose is 3-colourable, that is, there exists a function such that for each . Without loss of generality, we may assume that (indeed, the existence of f clearly implies the existence of a function with and such that for each holds as well). Consider the extension h of f assigning to each . It can be readily seen that h consists in a homomorphism witnessing that , which directly implies that . It follows that is not a sound -separation of and in UCQ, and therefore no proper -separation of and in UCQ exists.
“If part:” Suppose there exists no proper -separation of and in UCQ, i.e., the CQ is such that . It follows that there exists a homomorphism h witnessing that . Now, since the graph G is connected and since , by construction of the homomorphism h must necessarily be such that for each constant c representing a node . But then, due to the fact that none of , , and occur in , we derive that h is such that for each , and therefore for each as well. Thus, we can conclude that G is 3-colourable. □
As for the -hardness of ST(∅, GAV∩LAV, UCQ), it is possible to use exactly the same reduction provided above by discarding and considering only . In particular, due to the fact that and are the only A-facts occurring in , the set of certain answers of the query w.r.t. is always either or , and which one of the two depends on the 3-colourability of G. □
We now turn to consider GAV mappings. We recall that the complexity class has many characterizations: P with a constant number of rounds of parallel queries to an oracle for a decision problem in [12] (we refer the reader to [67] for further characterizations of such complexity class). By a round of parallel queries, it is intended that the Turing machine can ask for polynomially many non-adaptive queries to the oracle.
(, GAV, UCQ) and(, GAV, UCQ) are-complete. The lower bounds already hold for ST(∅, GAV, UCQ) and ST(∅, GAV, UCQ).
Upper bound: We only mention (, GAV, UCQ). The (, GAV, UCQ) case is similar and therefore not discussed. Given an OBDM system with in which is a ontology and is a GAV mapping, and two D-datasets and , as a first step we compute in polynomial time with a single round of parallel queries to an oracle. More specifically, for each pair of constants (resp., constant ) and for each atomic role P (resp., concept A) in the alphabet of we ask, all together with a single round of parallel queries to an oracle, whether (resp., ). It is clear that deciding whether (resp., ) for a given pair of constants (resp., constant ), a GAV mapping , and a database D is an -complete problem because the left-hand side of mapping assertions are CQs [1].
Once obtained as described above, we construct in polynomial time the UCQ . Then, due to Theorem 2, with a second and final round of parallel queries, we can ask with a single query to an oracle whether is also a sound (and so, a proper) Σ-separation of and in UCQ.
Lower bound: We start with ST(∅, GAV, UCQ), and then we address the ST(∅, GAV, UCQ) case. The proof of -hardness is by a reduction from the odd clique problem, which is a -complete problem [66]. Odd clique is the problem of deciding, given a finite and undirected graph without loops, whether the maximum clique size of G is an odd number. Without loss of generality, we may assume that E contains at least an edge and that the cardinality of V is an even number (indeed, it is always possible to add fresh isolated nodes to the graph G without changing its maximum clique size).
Given a graph as above with , we define an OBDM system as follows: is an OBDM specification in which contains no assertions, , and, for each odd , the mapping has the following two GAV assertions:
where is an atomic concept in the alphabet of and, for each natural number p, is the conjunction of atoms:
.
Intuitively, asks whether G contains a clique of size p. Finally, is the -database . Let, moreover, and be the fixed -datasets and .
Notice that and are fixed, whereas the OBDM system can be constructed in from an input graph G as above.
The correctness of the reduction is mainly based on the next property:
Letbe an odd number. We have that:
if and only if G contains a clique of size i.
if and only if G contains a clique of size.
As for 1, since , it is easy to see that the query is such that if and only if G has a clique of size i. Thus, due to the GAV assertion occurring in , we have if and only if G has a clique of size i.
As for 2, since , it is easy to see that the query is such that if and only if G has a clique of size . Thus, due to the GAV assertion occurring in , if G has a clique of size , then . Conversely, suppose that G has not a clique of size . On the one hand, the assertion does not make true in . On the other hand, since , not even the assertion makes true in . Thus, . □
With the above property in hand, we can now prove that the maximum clique size of a graph G is an odd number if and only if the CQ is also a sound (and so, a proper) -separation of and in UCQ, thus showing the claimed lower bound.
The maximum clique size of a graph G is an odd number if and only if the CQis also a sound (and so, a proper)-separation ofandin UCQ.
“Only-if part:” Suppose that the maximum clique size of G is p, where p is an odd number. Due to Claim 3, we have that . Observe that because G has not a clique of size by assumption. Thus, , where . It is straightforward to verify that but . It follows that , i.e., is a proper -separation of and in UCQ.
“If part:” Suppose that the maximum clique size of G is r, where r is an even number. Due to Claim 3, we have that . Observe that and because by assumption G has a clique of size r but not of size . Thus, , where . It is straightforward to verify that . It follows that , and therefore is not a sound (and so, not a proper) -separation of and in UCQ. □
As for the -hardness of ST(∅, GAV, UCQ), it is possible to use exactly the same reduction provided above by discarding and considering only . In particular, the set of certain answers of the query w.r.t. is always either or , and which one of the two depends on the parity of the maximum clique size of G. □
We now consider the remaining more general case of GLAV mappings.
(, GLAV, UCQ) and(, GLAV, UCQ) are-complete. The lower bounds already hold for ST(∅, GLAV, UCQ) and ST(∅, GLAV, UCQ).
The remainder of this section is dedicated to the proof of the above theorem and is organized as follows. We first provide the lower bound proof in Section 7.1, and then we provide a matching upper bound in Section 7.2.
To simplify the presentation of the lower bound proof, we first assume that ontologies may contain also ternary predicates in their alphabet. Then, by simply applying reification we will provide the proof by removing such an assumption. We start with (∅, GLAV, UCQ), and then we address the (∅, GLAV, UCQ) case.
With ternary predicates: The proof of -hardness is by a reduction from the complement of the-3CNF problem. Given a formula ϕ of the form , such that is a disjunction of exactly three literals for , -3CNF is the problem of deciding whether ϕ is satisfiable. It is the prototypical -complete [62] problem.
The reduction is as follows. Let be an input formula for the -3CNF problem, where , , and . In what follows, for each , we denote by , , and the variable in appearing, respectively, in the first, second, and third literal of the clause . We build an OBDM system as follows:
.
is the -database consisting of the following facts:
. So, we have a constant for each for each variable ;
and for each . In this way, a formula of the form has possible assignments, each corresponding to an assignment to the variables in ϕ;
for each ;
and ;
for each , we have 7 different facts of the form , with being either 0 or 1 for . These 7 facts represent all the possible assignments that satisfy the clause of ϕ. For example, if ϕ contains the clause , then would contain the facts , , , , , , , and it would not contain the fact corresponding to an assignment that does not satisfy the clause .
. The alphabet of contains a ternary predicate for each and a binary predicates . Informally, will store and , where 0, 1, and b are constants. Moreover, for each , will store , where a is a constant and is the constant representing the variable in ϕ. The role played by the ternary predicates is twofold. (i) They represent the clauses of ϕ, with as constants coming from , each variable replaced with either 0 or 1, and the variables as existential variables generated by a GLAV assertion in ; (ii) They also represent the 7 possible assignments that satisfy the clause .
Finally, the mapping consists of the following GLAV assertions:
is the GLAV assertion
, where . Recall that, for each , we have that , , and denote, respectively, the variable in appearing in the first, second, and third literal of the clause ;
for , we have the assertion ;
;
.
Intuitively, populates a total of formulae based on ϕ. These formulae are obtained by assigning all possible combinations of 0 and 1 constants to the variables . It is important to notice that in each of these formulae the variables in are always represented by the same constants in , while the variables in are represented each time by fresh variables, generated at each application of the mapping assertion corresponding to a combination of 0 and 1 constants for the variables (i.e., the set of variables generated by an application of the assertion is disjoint from the set of variables generated by another application of the assertion ).
For each , the assertion generates on the predicate all the possible assignments that satisfy the clause in the formula ϕ. Finally, the assertions and are used to generate in the facts for each and the two facts and , respectively.
Finally, the fixed -datasets are and . Notice that and are fixed, whereas the OBDM system can be constructed in from an input formula for the -3CNF problem.
We are now going to prove that if and only if ϕ is satisfiable. This proves the correctness of the reduction, and therefore that (∅, GLAV, UCQ) is -hard. Indeed, according to the semantic test given in Section 6, we have that there exists a proper -separation of and in UCQ if and only if . In other words, is equivalent to say that there does not exist a proper -separation of and in UCQ.
if and only if ϕ is satisfiable.
Since , we have that and coincide. Notice that, by definition, every function f witnessing that must be such that . As a consequence, due to the -facts occurring in , the function f must associate to each constant representing a variable of the formula ϕ either the constant 0 or the constant 1, i.e., for each we have that either or .
Now, if ϕ is satisfiable, then it is not hard to verify that there exists an association of the constants with 0 and 1 that satisfy all formulae represented in . In particular, for each of these formulae, f also needs to associate to the existential variables generated by the application of corresponding to that formula the constants 0 and 1 that will satisfy all the clauses of that formula. Thus, if ϕ is satisfiable, then we have that .
Conversely, if ϕ is not satisfiable, then it is not hard to verify that for every possible association made by a function f for the constants there exist at least one of those formulae for which it is not possible to assign to the variables of that formula the constants 0 and 1 that will satisfy all the clauses of that formula. Thus, if ϕ is not satisfiable, then we have that . □
As for the case of (∅, GLAV, UCQ), it is possible to use exactly the same reduction provided above by discarding and considering only . In particular, it is immediate to verify that for each unary tuple with c being a constant different from b (while whether it is the case that depends solely on the satisfiability of ϕ).
We now discuss the necessary changes to be made to the above reduction to work in the presence of only binary predicates (atomic roles) in the ontology alphabet.
Without ternary predicates (only atomic roles): Let be an input formula for the -3CNF problem, where , , and . We build an OBDM system as follows:
. Differently from the previous case, for each , we have three binary predicates , , and instead of a single ternary relation .
is the -database consisting of the following facts:
;
and for each ;
and ;
for each , we make use of 7 different constants . Moreover, for each and , we have three facts of the form , , and with being either 0 or 1 for . These 7 constants represent all the possible assignments that satisfy the clause of ϕ, and the value assumed by each is the value corresponding to the j-th assignment in the l-th position of clause . For example, if ϕ contains the clause , then would contain the following facts: , , and ; , , and ; , , and ; , , and ; , , and ; , , and ; , , and . It would not contain the facts , , and corresponding to an assignment that does not satisfy the clause .
. The alphabet of contains the atomic role and three atomic roles , , and for each .
Finally, the mapping consists of the following GLAV assertions:
is the GLAV assertion
, where . Recall that, for each , we have that , , and denote, respectively, the variable in appearing in the first, second, and third literal of the clause ;
for , we have the assertions , , and ;
;
.
Finally, exactly as in the previous case of ternary predicates, the fixed -datasets are and . Notice that and are fixed, whereas the OBDM system can be constructed in from an input formula for the -3CNF problem.
Intuitively, differently from the case of ternary predicates, each of the formulae generated by is represented by binary predicates , with the additional freshly introduced existential variables representing the resulting clauses of that formula. Thus, for each of the formulae, a function f witnessing that must associate to each variable of that formula one of the 7 constants . Using analogous arguments as those given in the proof of Claim 5, it is immediate to verify that if and only if ϕ is satisfiable, from which we derive that (∅, GLAV, UCQ) is -hard (observe that ). As in the previous case of ternary predicates, for the case of (∅, GLAV, UCQ), it is possible to use exactly the same reduction provided above by discarding and considering only . Since we always have that for each unary tuple with c being a constant different from b (while whether it is the case that depends solely on the satisfiability of ϕ), we derive that (∅, GLAV, UCQ) is -hard as well.
First of all, we assume to only deal with consistent OBDM systems Σ. Indeed, given an inconsistent OBDM system and two D-datasets and (resp., a D-dataset ) of arity n, we point out that there exists a proper Σ-separation (resp., Σ-characterization) of and (resp., of ) in UCQ if and only if , where n is the arity of . Since we will provide a upper bound for the case of consistent OBDM systems and since for an OBDM system such that is a ontology and is a GLAV mapping checking whether Σ is inconsistent can be done in using the technique described in the proof of Theorem 1, for simplifying the presentation of the proof we can assume to deal only with consistent OBDM systems.
Recall from the semantic tests given at the end of Section 6 that there exists a proper Σ-separation (resp. Σ-characterization) of and (resp., of ) in UCQ if and only if it is the case that for all and all (resp., all ). With this observation at hand, we can solve the complements of (, GLAV, UCQ) and (, GLAV, UCQ) in using the following intuition: given an OBDM system such that is a ontology and is a GLAV mapping, and given two D-datasets and (resp., a D-dataset ) of arity n, we first guess a tuple , a tuple , and a function from to . Then, we check whether , (resp., ), is such that and every constant in is mapped to an existing term in . Finally, we call an oracle for the problem of checking whether can be extended to a function f witnessing that .
In order to realize the presented intuition, we first introduce some preliminary results that will allow us to simplify the presentation of the problem, as well as some naming conventions that we will adopt for the fresh variables generated when computing starting from D.
We start by observing that checking whether is equivalent to checking whether , where is obtained from simply by substituting every occurrence of a constant with its fresh copy .
if and only if.
“Only-if part”: Let h be a homomorphism witnessing , i.e. it holds that: (i) and (ii) implies for each . Let be a function such that, for every term t (i.e., either a constant or a variable): (i) if for a constant c, then , i.e. assigns to t the copy of c; (ii) if for a variable v, then , i.e. assigns to t the same variable v. Since h is such that , we have that . Moreover, since implies for each , by construction we have that implies for each . Thus, witnesses that .
“If part”: The proof can be obtained by following the same considerations of the “Only-if” part, by switching the roles of the two functions h and . □
As a consequence of the above lemma, we can show that it is possible to reduce the problem of verifying whether to the equivalent problem of verifying whether , where is computed from D by substituting every occurrence of a constant with its fresh copy . Although strictly not necessary, in the problem of verifying whether (equivalent to ) this simple consideration will allow us to simplify the presentation of the upper bound proof, because we can distinguish between the canonical structure on the left-hand side of the arrow and the canonical structure on the right-hand side of the arrow.
if and only if.
This is a straightforward consequence of Lemma 1 and the fact that and coincide (up to variable renaming). □
In all what follows, given an -database D and a tuple of constants from , we will implicitly assume that , called the copy of D, represents the -database obtained from D by substituting every occurrence of a constant with its fresh copy , and represents the copy of the tuple . It is clear that both and can be computed in polynomial time from D and .
The following fundamental lemma tells us that, when verifying whether , on the left-hand side of the arrow we can actually restrict the attention only at .
if and only if.
The “Only-if” part is trivial, since by restricting the homomorphism witnessing to the terms in we derive an homomorphism witnessing .
As for the “If” part, let h be a homomorphism witnessing . Consider any concept inclusion assertion (resp., role inclusion assertion ) in such that for some (resp, for some terms ). This means that (resp., ). Now, since by assumption we have that (resp, ) because h is a homomorphism from to , we derive that (resp., ). With this observation at hand, it is easy to see that any homomorphism h witnessing can be extended to an homomorphism witnessing . □
We now introduce some naming conventions we will adopt for the fresh variables generated when computing starting from D.
Naming convention for the chase of the database with respect to the mapping: Let be a GLAV mapping relating a source schema to an ontology , and let D an -database. We call a single application of on D, where , with , is a mapping assertion in , and is a homomorphism from to D. We denote by the set of atoms , where extends by assigning to the existential variable the variable , for . Notice that, up to variable renaming, coincides with , where is the set of all the possible single applications of on D. It is important to notice that, in this way, each variable generated when chasing D with respect to carries in its name both the application that generated it and which was the variable inside m.
Naming convention for the chase with respect to the ontology: Let be a GLAV mapping relating a source schema to an ontology , let D an -database, and be a ontology. We follow standard conventions for the name given to the freshly introduce variables when computing the chase of with respect to (see, e.g. [33]). More specifically, contains all the constants and variables occurring in and all the variables of the form , with , , and a basic role for each , such that:
there is a basic concept B with and , but for all and basic role with ;
and , for each .
As an example, suppose that for a term t that is either a constant in D or a fresh variable introduced when chasing D with respect to , and let . Then, , where and are the fresh variables introduced when chasing with respect to .
Before illustrating the non-deterministic algorithms for solving the complements of (, GLAV, UCQ) and (, GLAV, UCQ), we introduce a new notion and some fundamental technical results. Let be an OBDM system such that is a ontology and is a GLAV mapping, D be an -database, be a tuple of constants from , be the copy of D, and be a tuple of constants from . A partial witnessing function (in short, pwf) of on is a function from to such that . We call such function partial because only the constants of have an assignment to a term in , while the variables of have not been assigned yet. The following lemma follows by definition.
if and only if there exists a pwfofonthat can be extended to a function f fromtowitnessing that, i.e., such thatimpliesfor each.
We now present a fundamental technical result that will allow us to restrict the attention only to those pwfs such that the range of the function contains constants and variables whose names lengths are of polynomial size with respect to the size of the input OBDM system Σ.
Letbe the set of all the possible single applications ofon D, and letbe a pwf ofon. Then, we have thatcan be extended to a function f fromtowitnessing thatif and only if the following holds:
for every single applicationofon D, there is a functionfrom variables intosuch thatimplies, wheredenotes the function fromtosuch thatif t is a constant andif t is a variable.
The proof can be obtained by simply combining these two observations: (i) all the constants in are already assigned to a term in by the function ; (ii) every single application of on D produces new variables in without ever reusing the variables introduced in other applications of the mappings. In other words, the set of all variables produced by a single application p of on D, and the set of all variables produced by all the other possible single applications of on D are disjoint. Thus, a function f extending and witnessing can always be split into many s functions as in the statement of the lemma, and, vice versa, if the functions s as in the statement of the lemma exist, then they can be combined into an overall function f witnessing . □
As announced before, the above lemma has a crucial consequence. In particular, let be an OBDM system. Since the number of atoms in generated by a single application p of on D is at most polynomial with respect to the size of (i.e., it is at most the number of atoms occurring in the longest right-hand side among all ), from the above lemma and due to the forest-shaped form of the canonical structure , it is not hard to see that the following holds: if there exists a pwf of on that can be extended to a function f from to witnessing that , then there exists another pwf that can be extended to a function from to witnessing that and such that each variable in the range of (and therefore also in the range of ) has a name length that is bounded by a polynomial in the size of and , which implies that the variable can be generated after polynomially many chase steps of with respect to .
We are now finally ready to present the non-deterministic algorithms for solving the complements of (, GLAV, UCQ) and (, GLAV, UCQ). We first concentrate on (, GLAV, UCQ) by providing the non-deterministic Algorithm 3 (CharExistence).
CharExistence
Notice that the Algorithm 3 (CharExistence) uses an oracle for the auxiliary decision problem CheckTotalExtension, which is the problem of deciding, given an OBDM system and a pwf of on , whether can be actually extended to a function f from to witnessing that .
Letbe a consistent OBDM system andbe a D-dataset. We have that CharExistence(Σ,) terminates and returns true if and only if there exists a proper Σ-characterization ofin UCQ.
Termination of the algorithm is immediate, while its correctness can be obtained by simply combining the semantic test given at the end Section 6 with Lemmata 2, 3, and 4. □
As for the running time, we observe that, due to the discussion after Lemma 5, we can concentrate only on pwfs whose range contains only constants and variables whose names lengths are of polynomial size with respect to the input OBDM system Σ, and therefore the overall is of polynomial size with respect to Σ since the number of constants in D is polynomial. Furthermore, we can check in polynomial time whether is a pwf of on , i.e., whether all the terms in the range of occur in as follows. For each term t in the range of , we compute , where is the guessed application of on for the term t (of course, we first need to check whether is a homomorphism from the atoms on the left-hand side of m to ). Starting from the set of atoms, we then iteratively apply the guessed ontology assertions in the sequence for the term t one after the other, each time by applying the considered ontology assertion in a single chase step over all the atoms obtained so far. Finally, we check whether the term t occurs in the overall set of atoms obtained in this way.
Thus, with respect to the size of its input, the overall running time of the non-deterministic Algorithm 3 (CharExistence) is polynomial with an oracle for the decision problem CheckTotalExtension. As a consequence, we derive that the complement of (, GLAV, UCQ) is in with an oracle in , i.e., , where is the computational complexity of CheckTotalExtension. As for the case of (, GLAV, UCQ), we can use a slight adaptation of the Algorithm 3 (CharExistence), which takes in the input an additional D-dataset and modifies the step 5 of the algorithm by checking instead of checking . This means that the complement of (, GLAV, UCQ) is in as well, where is the computational complexity of CheckTotalExtension.
To conclude the proof of the upper bound, we are now going to show that CheckTotalExtension is in (thus, ), which gives us the desired upper bound for the complements of both (, GLAV, UCQ) and (, GLAV, UCQ), and therefore this will show that both (, GLAV, UCQ) and (, GLAV, UCQ) are in , as required.
Upper bound for the decision problem CheckTotalExtension
CheckTotalExtension is in.
We recall that CheckTotalExtension is the problem of deciding, given an OBDM system and a pwf of on , whether can be extended to a function f from to witnessing that , i.e., such that implies for each .
By Lemma 5, we can immediately derive the following non-deterministic algorithm to solve the complement of the CheckTotalExtension decision problem:
Guess a pair , where is a mapping assertion in , and is a function from the variables in to ;
Check whether p is an application of m on D, which amounts to check whether and is a homomorphism from to D (if not, then return false);
Compute ;
Call an oracle for the decision problem CheckSingleExtension with input Σ, , ;
If the oracle accepts, then return false; otherwise return true.
Notice that the above non-deterministic algorithm uses an oracle for the auxiliary decision problem CheckSingleExtension, which is the problem of deciding, given an OBDM system , a pwf of on , and a set of atoms from , whether can be actually extended to a function f from to such that f is a homomorphism from to , i.e., implies for each . Notice the difference between the decision problems CheckTotalExtension and CheckSingleExtension. While CheckTotalExtension additionally takes in the input only Σ and , CheckSingleExtension takes in the input also the set of atoms .
The termination of the algorithm is straightforward. Furthermore, due to Lemma 5 and the definition of the auxiliary decision problem CheckSingleExtension, it is immediate to verify that the above non-deterministic algorithm solves the complement of CheckTotalExtension, i.e., returns true if and only if cannot be extended to a function f from to such that implies for each . As for its running time, we observe that the size of the guessed p is clearly polynomial with respect to the size of the input, and both checking whether is a homomorphism from to D, and computing the set of atoms, given , can be performed in polynomial time with respect to the size of the input. As a consequence, we derive that the complement of CheckTotalExtension is in with an oracle in , i.e., , where is the computational complexity of CheckSingleExtension. Thus, CheckTotalExtension is in . In the following lemma, we are going to show that CheckSingleExtension is in (thus, ), which concludes the proof of the desired upper bound for the decision problem CheckTotalExtension. □
Upper bound for the decision problem CheckSingleExtension
CheckSingleExtension is in.
Recall that CheckSingleExtension is the problem of deciding, given an OBDM system , a pwf of on , and a set such that , whether can be extended to an homomorphism f from to . In other words, the problem seeks for a function that assign to each variable in a term occurring in such that implies for each , where is the function such that is t is a constant, and if t is a variable.
Obviously, for an homomorphism f from to to exist, we do not need to generate all the atoms in but we actually need only a number of atoms that is less than or equal the number of atoms in . Furthermore, due to the forest-shaped form of the canonical structure , each of such atoms can be generated after polynomially many chase steps of with respect to .
From the above consideration, we can immediately derive the following non-deterministic algorithm to solve the CheckSingleExtension decision problem.
;
For each :
Guess a pair and a sequence of ontology assertions in . Here, is a mapping assertion in and is a function from the variables in to ;
Check whether is a homomorphism from to (if not, then return false);
Compute and set ;
Let ;
For each :
Let be the set of all the fresh atoms obtained by applying the ontology assertion to all the atoms in in a single chase step;
Set ;
End For
End For
Guess a function from the variables in to ;
Let be the function from to such that if t is a constant and if t is a variable;
If f consists in a homomorphism from to , then return true; otherwise return false.
In a nutshell, for each atom , the algorithm generates a set of atoms in by first computing the guessed application of on , and then by iteratively applying the guessed ontology assertions in the sequence one after the other, each time by applying the considered ontology assertion in a single chase step over all the atoms obtained so far. Finally, the algorithm checks whether the given can be extended to a homomorphism from to , where is the set of all the atoms obtained as above.
The termination and the correctness of the algorithm are straightforward. It is indeed immediate to verify that the above non-deterministic algorithm solves CheckSingleExtension, i.e., returns true if and only if can be extended to an homomorphism f from to .
As for its running time, as already observed, we can restrict the attention only to those atoms of that can be generated after polynomially many chase steps of with respect to . Thus, for each , both the guessed and the guessed sequence are of polynomial size with respect to the input, which implies that the overall number of atoms in is of polynomial size with respect to the input. Furthermore, checking whether is a homomorphism from to , computing the set of atoms, applying in sequence the ontology assertions in in single chase steps, and checking whether f is a homomorphism from to (i.e., whether implies for each ) are all steps that can be carried out in polynomial time with respect to the size of the input. This means that the above non-deterministic algorithm runs in polynomial time with respect to the size of the input, and therefore that CheckSingleExtension is in , as required. □
Conclusion and future work
In this paper, we have studied logical separability in OBDM. As a first contribution, we have illustrated a general framework for separability in OBDM, by also proposing natural relaxations of the classical separability notion (called here proper separation). In the general envision of separability as a tool for providing global post-hoc explanations of black-box models, we argue that such relaxations (especially the best approximations) are crucial in order to always be able to provide meaningful explanations even when proper separations do not exist. As a second contribution, by instantiating the general framework with the most common languages used in OBDM, we have provided a comprehensive study of three computational problems associated with the framework, namely Verification, Computation, and Existence. For the decision problems related to Verification and Existence, we have provided tight complexity results, whereas for the Computation problem we have devised two algorithms for computing the two forms of best approximated separations considered in this paper, thus proving they always exist.
We conclude the paper by discussing some interesting avenues for future work that deserve more investigation.
More expressive scenarios It would be interesting to study extensions of the scenario considered in this paper for the computational problems. For example, one may consider more expressive target query languages that go beyond UCQ, in order to capture proper separations in more cases. Natural candidates for this are , i.e., UCQ with inequalities, First-order logic, and [16]. However, differently from our case, by extending the considered scenario with one of these query languages, adopting or not the UNA makes a difference.
As other interesting extensions of the considered scenario, one may investigate highly expressive ontology languages, such as [36] or even [35], where the separability task has not yet been studied even in the ontology-enriched query answering setting (i.e., without considering mapping assertions).
Strong separability In [37], separability comes into two flavours: weak separability and strong separability. In weak separability, which is the one we addressed in this paper, the requirement on the negative examples is that none of them is included in the set of certain answers of the separating query. On the other hand, in strong separability, the requirement on the negative examples is that all of them are included in the set of the certain answers of the negation of the separating query. Thus, a natural problem to be addressed in our considered OBDM scenario is strong separability, by considering relaxations also in this case.
Quantitative metrics for best approximations In this work, we have defined best approximations of the proper separation notion by adopting the set-inclusion metric. Another common choice that would be interesting to investigate is the cardinality criteria. Moreover, for a more fine-grained usage of separability as a tool for explanation, it would be natural to assign weights to both positive and negative examples, and to consider such weights when computing best approximations of proper separations.
Restricted signature A possible constraint that can be imposed when finding a separating query is that the query expression uses only a specific subset of the predicates available in the alphabet of the ontology . This may be important to meet some users’ requirements regarding the separating query, as for example a user may ask for a separating query that does not make use of a specific concept. The separating task under restricted signature has been studied in [39] in the ontology-enriched query answering setting, in the particular case of expressive ontology languages, such as or . Typically, the computational complexity of checking for a proper separation increases under the restricted signature constraint. Thus, another notable direction is to study separability in the considered OBDM scenario under the restricted signature constraint.
Intelligible queries A delicate question from an end-user perspective is the number of atoms involved in each of the disjuncts of the separating queries. While under both GAV and LAV mappings our algorithms presented for the computation problem ensure that this number of atoms is “only” polynomially related to the size of the mapping and the database, the same is not true in the presence of GLAV mappings. In particular, it remains an interesting open question whether or not, under GLAV mappings, there are cases in which separations (or their best approximated versions) must necessarily be of exponential size with respect to the size of the mapping and the database.
More generally speaking, the separating queries may sometimes be very hard to understand from an end-user perspective. Thus, it would be natural to consider also the length of each disjunct, as well as the number of disjuncts, as additional parameters when considering approximations. This requires to consider novel definitions and techniques that may allow obtaining, from end users’ perspectives, more intelligible (approximated) separating queries.
Implementation Finally, we mention that we are currently implementing the algorithms and techniques proposed in this paper using the OBDM engine Mastro [14]. The implementation we are working on follows the recently introduced Human-in-the-loop Artificial Intelligence (HitAI) paradigm [69]. By running some experiments with a first prototype in real-world settings, we observed as the above-mentioned issues for future work turn out to be crucial, especially the ones concerning intelligible queries and quantitative metrics.
Footnotes
Acknowledgements
This work has been supported by MUR under the PRIN 2017 project HOPE (prot. 2017MMJJRE) and under the PNRR project FAIR (PE0000013), by the EU under the H2020-EU.2.1.1 project TAILOR (grant id. 952215), and by European Research Council under the European Union’s Horizon 2020 Programme through the ERC Advanced Grant WhiteMech (No. 834228). Gianluca Cima has been fully supported by MUR under the PNRR project FAIR (PE0000013).
References
1.
S.Abiteboul, R.Hull and V.Vianu, Foundations of Databases, Addison Wesley Publ. Co., 1995.
2.
D.Angluin, Queries and concept learning, Machine Learning2(4) (1987), 319–342.
3.
M.Arenas and G.I.Diaz, The exact complexity of the first-order logic definability problem, ACM Trans. Database Syst.41(2) (2016), 13. doi:10.1145/2886095.
4.
M.Arenas, G.I.Diaz and E.V.Kostylev, Reverse engineering SPARQL queries, in: Proceedings of the 25th International Conference on World Wide Web, 2016.
5.
A.Artale, D.Calvanese, R.Kontchakov and M.Zakharyaschev, The DL-lite family and relations, Journal of Artificial Intelligence Research36 (2009), 1–69. doi:10.1613/jair.2820.
6.
F.Baader, D.Calvanese, D.McGuinness, D.Nardi and P.F.Patel-Schneider (eds), The Description Logic Handbook: Theory, Implementation and Applications, Cambridge University Press, 2003.
7.
P.Barceló and M.Romero, The complexity of reverse engineering problems for conjunctive queries, in: 20th International Conference on Database Theory (ICDT 2017), M.Benedikt and G.Orsi, eds, Leibniz International Proceedings in Informatics (LIPIcs), Vol. 68, Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, Dagstuhl, Germany, 2017, pp. 7:1–7:17, ISSN 1868-8969, http://drops.dagstuhl.de/opus/volltexte/2017/7052. ISBN 978-3-95977-024-8. doi:10.4230/LIPIcs.ICDT.2017.7.
8.
A.Bondy and M.R.Murty, Graph Theory, Graduate Texts in Mathematics, Springer, 2008.
9.
E.Botoeva, R.Kontchakov, V.Ryzhikov, F.Wolter and M.Zakharyaschev, Games for query inseparability of description logic knowledge bases, Artificial Intelligence234 (2016), 78–119. doi:10.1016/j.artint.2016.01.010.
10.
E.Botoeva, C.Lutz, V.Ryzhikov, F.Wolter and M.Zakharyaschev, Query inseparability for ALC ontologies, Artif. Intell.272(C) (2019), 1–51. doi:10.1016/j.artint.2018.09.003.
11.
L.Buhmann̈, J.Lehmann, P.Westphal and S.Bin, DL-learner structured machine learning on Semantic Web data, in: Companion Proceedings of the The Web Conference 2018, WWW ’18, International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Switzerland, 2018, pp. 467–471. ISBN 978-1-4503-5640-4. doi:10.1145/3184558.3186235.
12.
S.R.Buss and L.Hay, On truth-table reducibility to SAT, Information and Computation91(1) (1991), 86–102. doi:10.1016/0890-5401(91)90075-D.
13.
A.Calì, G.Gottlob and M.Kifer, Taming the infinite chase: Query answering under expressive relational constraints, Journal of Artificial Intelligence Research48 (2013), 115–174. doi:10.1613/jair.3873.
14.
D.Calvanese, G.De Giacomo, D.Lembo, M.Lenzerini, A.Poggi, M.Rodriguez-Muro, R.Rosati, M.Ruzzi and D.F.Savo, The mastro system for ontology-based data access, Semantic Web Journal2(1) (2011), 43–53. doi:10.3233/SW-2011-0029.
15.
D.Calvanese, G.De Giacomo, D.Lembo, M.Lenzerini and R.Rosati, Tractable reasoning and efficient query answering in description logics: The DL-lite family, Journal of Automated Reasoning39(3) (2007), 385–429. doi:10.1007/s10817-007-9078-x.
16.
D.Calvanese, G.De Giacomo, D.Lembo, M.Lenzerini and R.Rosati, EQL-Lite: Effective first-order query processing in description logics, in: Proceedings of the Twentieth International Joint Conference on Artificial Intelligence (IJCAI 2007), 2007, pp. 274–279.
17.
D.Calvanese, G.De Giacomo, M.Lenzerini and M.Y.Vardi, Query processing under GLAV mappings for relational and graph databases, Proceedings of the Very Large Database Endowment6(2) (2012), 61–72.
18.
G.Cima, Preliminary results on ontology-based open data publishing, in: Proceedings of the Thirtieth International Workshop on Description Logics (DL 2017), CEUR Electronic Workshop Proceedings, Vol. 1879, 2017. http://ceur-ws.org/.
19.
G.Cima, Abstraction in Ontology-Based Data Management, Frontiers in Artificial Intelligence and Applications, Vol. 348, IOS Press, 2022.
20.
G.Cima, M.Console, M.Lenzerini and A.Poggi, Abstraction in data integration, in: Proceedings of the Thirty-Sixth Annual ACM/IEEE Symposium on Logic in Computer Science (LICS 2021), IEEE, 2021, pp. 1–11.
21.
G.Cima, M.Console, M.Lenzerini and A.Poggi, Monotone abstractions in ontology-based data management, in: Proceedings of the Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI 2022), 2022, pp. 5556–5563.
22.
G.Cima, F.Croce and M.Lenzerini, Query definability and its approximations in ontology-based data management, in: CIKM’21: The 30th ACM International Conference on Information and Knowledge Management, ACM, 2021.
23.
G.Cima, M.Lenzerini and A.Poggi, Semantic characterization of data services through ontologies, in: Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI 2019), 2019, pp. 1647–1653.
24.
G.Cima, M.Lenzerini and A.Poggi, Non-monotonic ontology-based abstractions of data services, in: Proceedings of the Seventeenth International Conference on Principles of Knowledge Representation and Reasoning (KR 2020), 2020, pp. 243–252. doi:10.24963/kr.2020/25.
25.
G.Ciravegna, F.Giannini, M.Gori, M.Maggini and S.Melacci, Human-driven FOL explanations of deep learning, in: Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI-20, 2020.
26.
R.Confalonieri, T.Weyde, T.R.Besold and F.M.del Prado Martín, Using ontologies to enhance human understandability of global post-hoc explanations of black-box models, Artif. Intell. (2021).
27.
A.Doan, A.Y.Halevy and Z.G.Ives, Principles of Data Integration, Morgan Kaufmann, 2012. ISBN 978-0-12-416044-6.
28.
R.Fagin, P.G.Kolaitis, R.J.Miller and L.Popa, Data exchange: Semantics and query answering, Theoretical Computer Science336(1) (2005), 89–124. doi:10.1016/j.tcs.2004.10.033.
29.
M.Friedman, A.Levy and T.Millstein, Navigational plans for data integration, in: Proceedings of the Sixteenth National Conference on Artificial Intelligence (AAAI 1999), AAAI Press, 1999, pp. 67–73.
30.
M.Funk, J.C.Jung and C.Lutz, Frontiers and exact learning of ELI queries under DL-lite ontologies, in: Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI 2022), 2022, pp. 2627–2633.
31.
M.Funk, J.C.Jung, C.Lutz, H.Pulcini and F.Wolter, Learning description logic concepts: When can positive and negative examples be separated? in: Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, International Joint Conferences on Artificial Intelligence Organization, 2019, pp. 1682–1688. doi:10.24963/ijcai.2019/233.
32.
M.R.Garey, D.S.Johnson and L.J.Stockmeyer, Some simplified NP-complete graph problems, Theoretical Computer Science1(3) (1976), 237–267. doi:10.1016/0304-3975(76)90059-1.
33.
V.Gutierrez-Basulto´, Y.A.Ibañez-García´, R.Kontchakov and E.V.Kostylev, Queries with negation and inequalities over lightweight ontologies, Journal of Web Semantics35 (2015), 184–202. doi:10.1016/j.websem.2015.06.002.
34.
V.Gutierrez-Basulto´, J.C.Jung and L.Sabellek, Reverse engineering queries in ontology-enriched systems: The case of expressive horn description logic ontologies, in: Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-18, International Joint Conferences on Artificial Intelligence Organization, 2018, pp. 1847–1853. doi:10.24963/ijcai.2018/255.
35.
I.Horrocks, O.Kutz and U.Sattler, The even more irresistible SROIQ, in: Proceedings of the Tenth International Conference on the Principles of Knowledge Representation and Reasoning (KR 2006), 2006, pp. 57–67.
36.
I.Horrocks, U.Sattler and S.Tobies, Reasoning with individuals for the description logic SHIQ, in: Proceedings of the Seventeenth International Conference on Automated Deduction (CADE 2000), D.McAllester, ed., Lecture Notes in Computer Science, Vol. 1831, Springer, 2000, pp. 482–496.
37.
J.C.Jung, C.Lutz, H.Pulcini and F.Wolter, Logical separability of incomplete data under ontologies, in: Proceedings of the 17th International Conference on Principles of Knowledge Representation and Reasoning, KR 2020, Rhodes, Greece, 2020.
38.
J.C.Jung, C.Lutz, H.Pulcini and F.Wolter, Separating data examples by description logic concepts with restricted signatures, in: Proceedings of the Eighteenth International Conference on Principles of Knowledge Representation and Reasoning, International Joint Conferences on Artificial Intelligence Organization, 2021.
39.
J.C.Jung, C.Lutz, H.Pulcini and F.Wolter, Separating data examples by description logic concepts with restricted signatures, in: Proceedings of the 18th International Conference on Principles of Knowledge Representation and Reasoning, KR 2021, Online Event, November 3–12, 2021, 2021, pp. 390–399. doi:10.24963/kr.2021/37.
40.
J.C.Jung, C.Lutz, H.Pulcini and F.Wolter, Logical separability of labeled data examples under ontologies, Artificial Intelligence313 (2022), 103785. doi:10.1016/j.artint.2022.103785.
41.
D.V.Kalashnikov, L.V.S.Lakshmanan and D.Srivastava, FastQRE: Fast query reverse engineering, in: Proceedings of the 2018 International Conference on Management of Data, Association for Computing Machinery, 2018.
42.
B.Konev, A.Ozaki and F.Wolter, A model for learning description logic ontologies based on exact learning, in: AAAI, 2016.
43.
M.Law, A.Russo and K.Broda, Inductive learning of answer set programs, in: Logics in Artificial Intelligence, 2014.
44.
J.Lehmann and P.Hitzler, Concept learning in description logics using refinement operators, Mach. Learn. (2010).
45.
M.Lenzerini, Data integration: A theoretical perspective, in: Proceedings of the Twentyfirst ACM SIGACT SIGMOD SIGART Symposium on Principles of Database Systems (PODS 2002), 2002, pp. 233–246.
46.
M.Lenzerini, Ontology-based data management, in: Proceedings of the 20th ACM Conference on Information and, Knowledge Management, CIKM, 2011.
47.
M.Lenzerini, Ontology-based data management, in: Proceedings of the Twentieth International Conference on Information and Knowledge Management (CIKM 2011), 2011, pp. 5–6. doi:10.1145/2063576.2063582.
48.
M.Lenzerini, Managing data through the lens of an ontology, AI Magazine39(2) (2018), 65–74. doi:10.1609/aimag.v39i2.2802.
49.
J.Liartis, E.Dervakos, O.M.Mastromichalakis, A.Chortaras and G.Stamou, Semantic queries explaining opaque machine learning classifiers, in: Proceedings of the Workshop on Data Meets Applied Ontologies in Explainable AI, 2021.
50.
C.Lutz, J.Marti and L.Sabellek, Query expressibility and verification in ontology-based data access, in: Principles of Knowledge Representation and Reasoning: Proceedings of the Sixteenth International Conference (KR 2018), 2018, pp. 389–398.
51.
D.M.L.Martins, Reverse engineering database queries from examples: State-of-the-art, challenges, and research opportunities, Information Systems83 (2019), 89–100. doi:10.1016/j.is.2019.03.002.
52.
D.M.L.Martins, Reverse engineering database queries from examples: State-of-the-art, challenges, and research opportunities, Information Systems83 (2019), 89–100. doi:10.1016/j.is.2019.03.002.
53.
B.Motik, B.Cuenca Grau, I.Horrocks, Z.Wu, A.Fokoue and C.Lutz, OWL 2 Web Ontology Language Profiles (Second Edition), W3C Recommendation, World Wide Web Consortium, 2012, available athttp://www.w3.org/TR/owl2-profiles/.
54.
D.Mottin, M.Lissandrini, Y.Velegrakis and T.Palpanas, New trends on exploratory methods for data analytics, Proc. VLDB Endow.10(12) (2017), 1977–1980. doi:10.14778/3137765.3137824.
55.
M.Ortiz, Ontology-mediated queries from examples: A glimpse at the DL-Lite case, in: Proceedings of the Fifth Global Conference on Artificial Intelligence, EPiC Series in Computing, Vol. 65, 2019, pp. 1–14.
56.
C.H.Papadimitriou and M.Yannakakis, The complexity of facets (and some facets of complexity), Journal of Computer and System Sciences28(2) (1984), 244–259. doi:10.1016/0022-0000(84)90068-0.
A.Poggi, D.Lembo, D.Calvanese, G.De Giacomo, M.Lenzerini and R.Rosati, Linking data to ontologies, Journal on Data SemanticsX (2008), 133–173.
59.
J.Rothe, Exact complexity of exact-four-colorability, Information Processing Letters87(1) (2003), 7–12. doi:10.1016/S0020-0190(03)00229-1.
60.
M.K.Sarker, N.Xie, D.Doran, M.Raymer and P.Hitzler, Explaining trained neural networks with Semantic Web technologies: First steps, in: Proceedings of the Twelfth International Workshop on Neural-Symbolic Learning and Reasoning, 2017.
61.
L.J.Stockmeyer, The polynomial-time hierarchy, Theoretical Computer Science3(1) (1976), 1–22. doi:10.1016/0304-3975(76)90061-X.
62.
L.J.Stockmeyer, The polynomial-time hierarchy, Theoretical Computer Science3(1) (1976), 1–22. doi:10.1016/0304-3975(76)90061-X.
63.
B.ten Cate and V.Dalmau, The product homomorphism problem and applications, in: Proceedings of the Eigthteenth International Conference on Database Theory (ICDT 2015), LIPIcs, Vol. 31, 2015, pp. 161–176.
64.
B.ten Cate and V.Dalmau, Conjunctive queries: Unique characterizations and exact learnability, in: 24th International Conference on Database Theory (ICDT 2021), Vol. 186, 2021.
65.
Q.T.Tran, C.-Y.Chan and S.Parthasarathy, Query Reverse Engineering, The VLDB Journal23(5) (2014).
66.
K.W.Wagner, More complicated questions about maxima and minima, and some closures of NP, Theoretical Computer Science51 (1987), 53–80. doi:10.1016/0304-3975(87)90049-1.
Y.Y.Weiss and S.Cohen, Reverse engineering SPJ-queries from examples, in: Proceedings of the 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, 2017.
M.M.Zloof, Query-by-example: The invocation and definition of tables and forms, in: Proceedings of the 1st International Conference on Very Large Data Bases, VLDB ’75, ACM, New York, NY, USA, 1975, pp. 1–24. ISBN 978-1-4503-3920-9. doi:10.1145/1282480.1282482.