Separability and Its Approximations in Ontology-based Data Management

Abstract

Given two datasets, i.e., two sets of tuples of constants, representing positive and negative examples, logical separability is the reasoning task of finding a formula in a certain target query language that separates them. As already pointed out in previous works, this task turns out to be relevant in several application scenarios such as concept learning and generating referring expressions. Besides, if we think of the input datasets of positive and negative examples as composed of tuples of constants classified, respectively, positively and negatively by a black-box model, then the separating formula can be used to provide global post-hoc explanations of such a model. In this paper, we study the separability task in the context of Ontology-based Data Management (OBDM), in which a domain ontology provides a high-level, logic-based specification of a domain of interest, semantically linked through suitable mapping assertions to the data source layer of an information system. Since a formula that properly separates (proper separation) two input datasets does not always exist, our first contribution is to propose (best) approximations of the proper separation, called (minimally) complete and (maximally) sound separations. We do this by presenting a general framework for separability in OBDM. Then, in a scenario that uses by far the most popular languages for the OBDM paradigm, our second contribution is a comprehensive study of three natural computational problems associated with the framework, namely Verification (check whether a given formula is a proper, complete, or sound separation of two given datasets), Existence (check whether a proper, or best approximated separation of two given datasets exists at all), and Computation (compute any proper, or any best approximated separation of two given datasets).

Keywords

Ontology-based Data Management Separability Explainable Artificial Intelligence Semantic Technologies

1. Introduction

The separability problem deals with finding an intensional representation of two datasets, i.e., sets of data items, interpreted as positive and negative examples. In this problem, one is given two sets of data items, one with positive and the other with negative examples, and is asked to provide a query so that the evaluation of such a query over the database contains all the data items in the set of positive examples, and none of the data items in the set of negative examples. We say that a solution to this problem is a query that separates the given datasets. A special case of this problem arises when only one set of positive examples is given as input, and one is interested in finding a query whose evaluation over the database coincides with the data items in such a set. In this paper, we refer to the latter special case with the term characterizability, and we say that a solution to this problem is a query that characterizes the given dataset.

The separability problem has initially been studied for relational databases and is known in the community as the query-by-example problem.1

¹
On the other hand, the characterizability problem is known in the literature as query definability.

Over the years, researchers have found several interesting applications of the separability problem, spanning from simplifying query formulation by non-experts, to debugging facilities for data engineers. Indeed, the problem has been studied as a useful tool for data exploration, concept learning, data analysis, usability, data security and more [51,54]. Moreover, as already observed in [37], the problem is studied in two special cases in which the input datasets are constituted by only one single tuple. In separability, this special case is studied for entity comparison in RDF graphs, where the goal is to find a meaningful description that separates one entity from another. Similarly, in characterizability, this special case is studied for generating referring expressions (GRE), where one is interested in describing a single data item by a logical expression that allows to separate it from all other data items. With the rise of Machine Learning (ML), we argue that this topic acquires primary importance for providing meaningful explanations to any typical supervised black-box model used for classification tasks. When applied to classification, the ultimate goal of supervised learning is to construct models that are able to predict the target output (i.e., the class) of the proposed inputs. To achieve this, the learning algorithm is provided with some training examples that demonstrate the intended relation of input and output values. Then, the learned model is supposed to be able to correctly classify instances that have not been shown during training. A crucial problem for wise and safe adoption of ML-based black-box models is that, especially in high-risk domains such as healthcare and finance, it is often very hard to understand the rationale behind a classification made by these models. This may lead to discriminatory biases in the classification that were not intended and, more surprisingly, of which the designers were unaware of.

In this paper, we assume that the classification task is performed in an organization that adopts an Ontology-based Data Management (OBDM)2

In this paper, we prefer the usage of the acronym OBDM rather than its similar OBDA, which stands for Ontology-based Data Access [58], because data access is just one aspect, although one of the most important, of the more general notion of data management.

approach [46,48]. OBDM is a paradigm for accessing data using a conceptual representation of the domain of interest expressed as an ontology. The OBDM paradigm relies on a three-level architecture, consisting of the data layer, the ontology, and the mapping between the two. Consequently, an OBDM specification is a triple

J = ⟨ O, S, M ⟩

which, together with an

S

-database D, form a so-called OBDM system

Σ = ⟨ J, D ⟩

. We are going to tackle the separability problem by leveraging the notion of evaluation of a query with respect to an OBDM system, in turn based on the notion of certain answers to a query over an OBDM system. Intuitively, given an OBDM system

Σ = ⟨ J, D ⟩

and two D-datasets

λ^{+}

and

λ^{-}

, our goal is to derive a query expression over

O

that separates

λ^{+}

and

λ^{-}

in Σ (called here proper separation).

Example 1.

Let $J = ⟨ O, S, M ⟩$ be the OBDM specification in which $O = {MathStudent ⊑ Student, ForeignStudent ⊑ Student}$ asserts that both math students and foreign students are students, $S = {s_{1}, s_{2}, s_{3}, s_{4}, s_{5}}$ , and the mapping $M$ consists of the following assertions: $\begin{aligned} {(x) ∣ s_{1} (x)} \to {(x) ∣ Student (x)} \\ {(x) ∣ s_{2} (x)} \to {(x) ∣ Student (x)} \\ {(x_{1}, x_{2}) ∣ s_{3} (x_{1}, x_{2})} \to {(x_{1}, x_{2}) ∣ EnrolledIn (x_{1}, x_{2})} \\ {(x) ∣ \exists y . s_{3} (x, y) \land s_{4} (y)} \to {(x) ∣ MathStudent (x)} \\ {(x) ∣ \exists y . s_{3} (x, y) \land s_{5} (y)} \to {(x) ∣ ForeignStudent (x)} \end{aligned}$

Consider now the OBDM system $Σ = ⟨ J, D ⟩$ , where J is the OBDM specification illustrated above and D is the $S$ -database $D = {s_{1} (c_{4}), s_{2} (c_{3}), s_{4} (b_{1}), s_{5} (d_{1}), s_{3} (c_{1}, b_{1}), s_{3} (c_{2}, d_{1}), s_{3} (c_{3}, e_{1}), s_{3} (c_{4}, e_{2}), s_{3} (c_{5}, e_{3})}$ . Let the D-datasets of positive and negative examples be $λ_{1}^{+} = {(c_{1}), (c_{2}), (c_{3})}$ and $λ_{1}^{-} = {(c_{5})}$ , respectively. One can see that the query $q_{O}^{1} = {(x) ∣ Student (x)}$ over $O$ separates $λ_{1}^{+}$ and $λ_{1}^{-}$ in Σ because its set of certain answers over Σ, ${(c_{1}), (c_{2}), (c_{3}), (c_{4})}$ , contains all the positive examples in $λ^{+}$ and none of the negative examples in $λ^{-}$ .

An important contribution of our work is to provide approximated results for all the cases in which it is not possible to provide a separating query. We argue that, in these cases, reasonable and useful ontological characterizations can still be provided. We propose to resort to suitable approximations of the proper separating query, by introducing the notions of sound and complete separating queries. The former is a query whose certain answers have empty intersection with the D-dataset $λ^{-}$ , whereas the certain answers of the latter form a superset of the D-dataset $λ^{+}$ . Obviously, we are interested in computing the best approximated separating queries, which we call maximally sound and minimally complete separations, respectively. A maximally sound (resp., minimally complete) separation is a sound (resp., complete) separation such that no other sound (resp., complete) separation exists that better approximates the input datasets. Moreover, we cover the special cases in which the input datasets are constituted by only one single tuple for all our results and refer to them as the single tuple variants of the problems we deal with.

In this context, the training set used in the classification task, which is a collection of data items that are labeled as positive and negative examples, is seen as two sets of tuples in the database schema. The query derived by solving the separability problem results into an intensional definition of such training set, and are considered as an explanation of the intensional properties of the training set. The same principle can also be applied to a set of tuples that has not been seen during the training of the model. In this scenario, one can consider the black-box model as an oracle that assigns a class to all given tuples. Then, the query derived by solving the separability problem in this new context, is considered as an explanation of the intrinsic behaviour of the model. Traditionally, there are two different types of explanations: global and local. We refer to the former for explanations of the general behaviour of the model, and to the latter for explanations of the output of the model with respect to a specific object. The present work poses the foundational basis for providing both kind of explanations. Indeed, it deals with global explanations when the separating query is searched with respect to positive and negative examples containing and arbitrary number of tuples. It deals with local explanations when the sets of positive and negative examples for which one searches for a separating query contain only one single tuple.

Our procedure fits into the definition of post-hoc explanations of black-box models, i.e. a set of techniques aimed at approximating the behaviour of a black-box model with a surrogate interpretable model. We are now going to describe how in our context the role of the surrogate model is played by the query resulting from the solution of the separability problem. Suppose an organization that adopts the OBDM paradigm wants to train a model for predicting which candidates in a selection process are the most likely to perform well in a certain job. For training the model, the organization is given the curricula of current employees with a feedback on their performances that makes it possible to divide the training set in two different classes: the good and bad performers. For the sake of simplicity, suppose John, Mary, and Jane, who studied Biology, Medicine, and Math respectively, perform well. Suppose also that Matt, Angeline, and Jess, who studied Music, Linguistics, and Fashion respectively, perform badly. The ML-based black-box model is trained with this dataset and it is optimised to reach the highest possible accuracy. Now suppose some candidates apply for the job and are evaluated by the model. The latter states that Lucy, Mara, and George, who studied Math, Chemistry, and Physics respectively, have high probability to perform well in doing the job. At the same time, the model states that Lucas, and Paul, who studied Art, and Classics are believed to perform badly. The organization now wonders why the black-box model divided the candidates in this way. By instantiating a separability problem with the two sets of positive and negative candidates, the organization finds out that, no matter how sophisticated the internal details of the black-box model are, the resulting classification is so that all positive candidates are answers to the query “return all the candidates with a scientific background”, and none of the negative candidates are. This separating query provides an intuitive explanation of the actual behaviour of the model. Of course, there are in general many valid separating queries for a given instance of a separability problem, as it is possible that in many cases there is no valid separating query at all.

Contributions of the paper The contribution provided by this paper can be summarized as follows:

We present a formal framework for separability in OBDM. In particular, we first cast the classical notion of separating query in the OBDM context (called here proper separation), and then we propose the relaxations mentioned above (complete separation and sound separation) as well as their optimal versions (minimally complete separation and maximally sound separation). We do exactly the same for the special case of characterization, which deals only with a dataset of positive examples rather than both positive and negative examples.

We study the Verification problem for both separability and characterizability in OBDM, i.e. check whether a given query is a proper, complete, or sound separation (resp., characterization) of two given datasets (resp., a given dataset). More specifically, we introduce three families of decision problems for both separability and characterizability, called X- $V S EP$ ( $L_{O}$ , $L_{M}$ , $Q$ ) (resp., X- $V C HAR$ ( $L_{O}$ , $L_{M}$ , $Q$ )) for X = {Proper, Complete, Sound}, which are parametric with respect to an ontology language $L_{O}$ , a mapping language $L_{M}$ , and a query language $Q$ . We provide tight computational complexity bounds for the most common languages used in OBDM, i.e., $L_{O}$ is ${DL-Lite}_{R}$ , $L_{M}$ is GLAV, and $Q$ is UCQ. The results are summarized in Table 1.

We study the Computation problem. We provide two algorithms that, taking as input an OBDM system $Σ = ⟨ O, S, M, D ⟩$ and two D-datasets $λ^{+}$ and $λ^{-}$ , where $O$ is a ${DL-Lite}_{R}$ ontology and $M$ is a GLAV mapping, return, respectively, a UCQ-minimally complete Σ-separation of $λ^{+}$ and $λ^{-}$ and a UCQ-maximally sound Σ-separation of $λ^{+}$ and $λ^{-}$ . As a consequence, this proves that in the scenario under consideration the two best approximated versions of proper separations always exist.

We study the Existence problem for both separability and characterizability in OBDM. Since for the scenario under consideration their two best approximated versions always exist, we only focus on the existence of a proper separation (resp., characterization), i.e., check whether a proper separation (resp., characterization) in a target query language $Q$ of two given datasets (resp., a given dataset) exists at all. More specifically, we introduce a family of decision problems for both separability and characterizability, called $S EP$ ( $L_{O}$ , $L_{M}$ , $Q$ ) (resp., $C HAR$ ( $L_{O}$ , $L_{M}$ , $Q$ )), which are parametric with respect to an ontology language $L_{O}$ , a mapping language $L_{M}$ , and a query language $Q$ . Also in this case, we provide tight computational complexity bounds for the most common languages used in OBDM. The results are summarized in Table 2.

Table 1

For X = {Proper, Complete, Sound}, the table reports the exact computational complexity of X- $V S EP$ ( $L_{O}$ , $L_{M}$ , $Q$ ) and X- $V C HAR$ ( $L_{O}$ , $L_{M}$ , $Q$ ) when $L_{O}$ is ${DL-Lite}_{R}$ , $L_{M}$ is GLAV, and $Q$ is UCQ. We point out that all the lower bounds already hold for the single-tuple version of the considered problem (i.e., when all the input datasets consist of a single tuple) and when $L_{O}$ is ∅ (i.e., the ontology language allowing only for ontologies without assertions), $L_{M}$ is GAV∩LAV (i.e., the mapping language allowing only for assertions that are both GAV and LAV), and $Q$ is CQ

X- $V S EP$ ( ${DL-Lite}_{R}$ , GLAV, UCQ)/X- $V C HAR$ ( ${DL-Lite}_{R}$ , GLAV, UCQ)
X = Complete	$NP$ -complete
X = Sound	$coNP$ -complete
X = Proper	$DP$ -complete

Table 2

For $L_{M} = {LAV, GAV, GLAV}$ , the table reports the exact computational complexity of $S EP (L_{O}, L_{M}, Q)$ and $C HAR (L_{O}, L_{M}, Q)$ when $L_{O}$ is ${DL-Lite}_{R}$ and $Q$ is UCQ. Again, all the lower bounds already hold for the single-tuple version of the considered problem and when $L_{O}$ is ∅. Moreover, such lower bounds hold for both $Q = CQ$ and $Q = UCQ$

$S EP ({DL-Lite}_{R}, L_{M}, UCQ) / C HAR ({DL-Lite}_{R}, L_{M}, UCQ)$
$L_{M} = LAV$	$coNP$ -complete
$L_{M} = GAV$	$Θ_{2}^{p}$ -complete
$L_{M} = GLAV$	$Π_{3}^{p}$ -complete

This paper is an extended version of our CIKM’21 conference paper [22]. We point out that the conference paper focused only on the characterizability reasoning task, whereas here we illustrate a formal framework and study the related computational problems both for separability and characterizability. Furthermore, in this extended version we provide all the full proofs that were only sketched in [22]. Finally, we observe that the $C HAR ({DL-Lite}_{R}, GLAV, UCQ)$ decision problem is incorrectly claimed to be $CO NE XP T IME$ -complete in the conference version of this paper. In this extended version, we fix this error and show that this decision problem is actually $Π_{3}^{p}$ -complete.

To the best of our knowledge, the present work is the first to introduce separability and characterizability in the OBDM context. In particular, while separability and characterizability have been studied in various ontology-enriched query answering settings, this is the first to take into account all the layers stack of the OBDM paradigm, thus including also the mapping layer, and also propose and study natural relaxations of the separability and characterizability notions. Nevertheless, we remark that both in the Computation and in the Existence problems for separability and characterizability we mainly focus only on UCQ as a query language, while other works on the subject also investigate the typically more challenging scenario of CQ as a query language (see, e.g., [7,30,34,64]).

Finally, we mention that, since ${DL-Lite}_{R}$ is insensitive to the adoption of the unique name assumption (UNA) for UCQ answering [5], all our results hold both with and without the UNA.

Outline of the paper The paper is organized as follows. After the discussion of related works in Section 2, Section 3 introduces the relevant background for our study and Section 4 illustrates a formal framework for separation in OBDM. Then, Sections 5, 6, and 7 present the results on the three computational problems Verification, Computation, and Existence, respectively. We deal first with the Computation problem and then with the Existence problem because the solutions we will provide about the Computation problem will help us to solve some tasks about the Existence problem. Finally, Section 8 concludes the paper with a perspective on future work.

2. Related work

Query definability and query-by-example have long been studied for classical databases, starting from [70] up to many recent studies [3,7,41,52,64,65,68]. We have laid the foundations for analysing the complexity of our decision problems from the results in [3,7]. We found the work in [65] inspirational since they dealt with the problem of deriving a metric for establishing how close is a non proper separating query to the optimal solution. We found the survey in [52] important for the whole research problem as they present a well described motivating example and they outline the main open problems of this topic.

Our work has also been inspired by the notion of Abstraction [19,20,23]. Although the goal is still to derive a query expression over the ontology, in that work the input is a query over the data layer of the OBDM architecture, whereas in query definability (resp. in query-by-example) the input is a set of tuples (or two sets of tuples). It follows that the two tasks are completely different and require different technical solutions. We postpone a detailed comparison between Abstraction and the separability notion investigated in this paper in Section 4.1 (when we will have all the technical tools in hand to provide an in-depth comparison).

The works that are closer to ours are the ones in [4,34,55]. In [4] the authors study the existence, and verification problems both for query definability and for query-by-example, and computation problem for the query definability case. In their work, the ontology is expressed as an RDF graph and they consider several fragments of SPARQL as the query language to be used for the separation of the examples. Differently from the present work, they do not aim at finding the best approximated separation. We share with [55] the expressive power of the language used for the ontology (DL-Lite), and for the separation query the input examples (UCQs). However, the work in [55] does not deal with the cases in which proper separations for the input examples do not exist, and it is based on a slightly different framework from ours, i.e. since in that work they study the problems in the context of ontology-mediated queries, they do not have the mapping layer that instead is part of our more general OBDM paradigm.

The work in [34], studies the query-by-example problem for expressive horn description logic ontologies, namely Horn- $A L C I$ . Apart from the clear difference in the expressive power of the ontologies, their work does not consider the case of approximated separations of the input examples, and it does not consider the mapping layer.

This work has also been inspired by [11,31,38,44]. All these papers’ goal is to learn a concept expression that best captures a given set of examples (or two set of positive and negative examples for query-by-example). We differ from these works because our goal is to derive a full-blown query that separates the input examples.

The problem of checking whether there exists a formula separating positive and negative examples in the presence of an ontology has recently been studied in [37,40]. Other than verifying that the ontology entails the searched formula for all positive examples, the authors conducted an in-depth analysis of the so-called separability problem accounting for both weak separability, i.e. the one we study in this paper, and strong separability, i.e. checking whether the negation of the separating formula is entailed by the KB. They also consider the case of enriching the separating formula by adding helper symbols that are not originally present in the ontology, and study the complexity of the decision problems for a wide range of languages both for expressing the ontology and the separating formula. In this paper, we are interested in studying the weak separability problem in the context of the OBDM by also considering the cases in which the separating formula does not exist and one wants to search for sound and complete approximations of it.

Another important line of research related to the present work is the one regarding post-hoc explanations of opaque machine learning models. As also highlighted by other works [25,26,49,60], the query-by-example problem can easily be adapted for explaining the output of a black-box machine learning classification model. For example, consider the case of a binary classifier labelling a set of examples in two classes 1 and 0. In this scenario, the solution of the query-by-example problem is considered as a surrogate of the machine learning model, so that the examples labelled by class 1 are the answers of the reverse engineered query, while none of the examples labelled by class 0 are. Therefore the query acts as an explanation because it provides a more human understandable way, especially in our framework in which the query is based on the knowledge of the ontology, for classifying the given examples in the two different classes. Although relevant for our work, we differ from all the above cited papers. In [60], they map the inputs of the machine learning classifier to the ontology and then uses a concept learning tool to find a class expression over the ontology that best describes the positive example. In [25], for explaining the behaviour of a black-box classifier, they build another black-box classifier (a neural network) and then project the output of this latter model onto a so-called rule space, where each coordinate represents the activation of a rule that is described in First Order Logic. In [26] they present the TREPAN algorithm, i.e. a way for building a decision tree in which the nodes are linked to an ontology, that is used as a means for explaining the input positive and negative examples. We consider the work in [49] to be relevant for our work, even though is rather preliminary and does not specify many important details such as the expressive power of the language of the ontology, and the language of the query they search for. The biggest differences with the present work are the fact that they do not consider the mapping layer between the ontology and the data, and that they do not focus on the concepts of maximally-sound and minimally-complete solutions, in cases where a perfect solution does not exist. On the contrary, they define a best approximated query as the one minimising the jaccard distance between the answers of the query and the set of positive examples in the input.

Inductive Logic Programming (ILP) [43] has long been considered related to the query definability and query-by-example tasks. We also considered it inspiring for our work, but we soon acknowledged that the expressive power of the languages used for representing the knowledge base are incomparable, and that in ILP they are interested in searching for explanations of a set of logical facts rather than a set of tuples.

The Active Learning task initially introduced by [2] has been studied as a possible framework for learning queries from examples in relational databases [64] and in the presence of $DL-Lite$ ontologies [30]. Moreover, instances of the framework in [2] can be found in [42] and in [57]. In particular, the goal of [42] is to learn an ontology that is equivalent to a target ontology, while the goal of [57] is to learn an ontology that is query inseparable with respect to a target ontology $T$ and a query language $Q$ , i.e. given a set of ground atoms (ABox) $A$ , the learned ontology $H$ must be such that $⟨ H, A ⟩ ⊧ q$ if and only if $⟨ T, A ⟩ ⊧ q$ , for every $q \in Q$ .

Finally, query inseparability has been studied even outside the learning task. For example, [9] studies query inseparability for (fragments of) Horn- $A L C$ as ontology language and CQ as a query language, whereas [10] studies query inseparability for both the ontology languages $A L C$ and $E L$ and for (fragments of) UCQ as a query language.

3. Preliminaries

We recall some notions about relational databases [1], Description Logics (DLs) [6], and Ontology-based Data Management (OBDM) [47]. We define $Γ_{S}$ , $Γ_{O}$ , $Const$ , and $V$ to be the pairwise disjoint, countably infinite sets of symbols for database predicates, ontology predicates, constants, and variables, respectively. We further assume that $Γ_{O}$ is partitioned into the disjoint sets $Γ_{A}$ and $Γ_{P}$ for atomic concepts and atomic roles, respectively.

Databases, datasets, and queries A relational database schema (or simply schema) $S$ is a finite subset of $Γ_{S}$ . Given a schema $S$ , an $S$ -database D is a finite set of facts (a.k.a. ground atoms) of the form $s (\vec{c})$ , where s is an n-ary predicate in $S$ , and $\vec{c} = (c_{1}, \dots, c_{n})$ is an n-tuple of constants from $Const$ . We denote by $dom (D)$ the finite subset of $Const$ of those constants occurring in D. Given a schema $S$ and an $S$ -database D, a D-dataset λ of arity n is simply a finite set of n-tuples $\vec{c}$ of constants occurring in D, i.e., $λ \subseteq dom {(D)}^{n}$ .

A query $q_{S}$ over a schema $S$ is an expression in a certain query language $Q$ using the predicate symbols of $S$ and arguments of predicates are variables from $V$ , i.e., we disallow constants to occur in queries. Each query has an associated arity. The evaluation of a query $q_{S}$ of arity n over an $S$ -database D is a set of answers $q_{S}^{D}$ , each answer being an n-tuple of constants occurring in $dom (D)$ , i.e., $q_{S}^{D} \subseteq dom {(D)}^{n}$ . A query $q_{S}$ of arity 0 over a schema $S$ is called a boolean query, and we denote by $q_{S}^{D} = {⟨ ⟩}$ (resp., $q_{S}^{D} = \emptyset$ ) the fact that $D ⊧ q_{S}$ (resp., $D ⊭ q_{S}$ ).

Following the terminology of [7,63], we say that a query $q_{S}$ over a schema $S$ explains two D-datasets $λ^{+}$ and $λ^{-}$ (resp., defines a D-dataset $λ^{+}$ ) inside an $S$ -database D if $λ^{+} \subseteq q_{S}^{D}$ and $q_{S}^{D} \cap λ^{-} = \emptyset$ (resp., $q_{S}^{D} = λ^{+}$ ). We also say that $λ^{+}$ and $λ^{-}$ are $Q$ -explainable (resp., $λ^{+}$ is $Q$ -definable) inside D, for a query language $Q$ , if there exists a query $q_{S} \in Q$ that explains $λ^{+}$ and $λ^{-}$ (resp., defines $λ^{+}$ ) inside D.

We are particularly interested in conjunctive queries and unions thereof. A conjunctive query (CQ) over a schema $S$ is an expression of the form $q_{S} = {\vec{x} ∣ \exists \vec{y} . ϕ (\vec{x}, \vec{y})}$ such that (i) $\vec{x} = (x_{1}, \dots, x_{n})$ , called the target list of $q_{S}$ , is an n-tuple of distinguished variables from $V$ , where n is the arity of $q_{S}$ (ii) $\vec{y} = (y_{1}, \dots, y_{m})$ is an m-tuple of existential variables from $V$ ; and (iii) $ϕ (\vec{x}, \vec{y})$ , called the body of $q_{S}$ , is a finite conjunction of atoms of the form $s (v_{1}, \dots, v_{p})$ , where s is an p-ary predicate in $S$ and $v_{i}$ is either a distinguished or an existential variable, i.e., $v_{i} \in \vec{x} \cup \vec{y}$ for each $i \in [1, p]$ . A union of conjunctive queries (UCQ) $q_{S}$ over a schema $S$ is a finite set ${\vec{x_{1}} ∣ \exists \vec{y_{1}} . ϕ_{1} (\vec{x_{1}}, \vec{y_{1}})} \cup \dots \cup {\vec{x_{m}} ∣ \exists \vec{y_{m}} . ϕ_{m} (\vec{x_{m}}, \vec{y_{m}})}$ of CQs over $S$ with same arity, called its disjuncts.

For a conjunction of atoms $ϕ (\vec{x}, \vec{y})$ , we denote by $set (ϕ)$ the set of all the atoms occurring in ϕ. For a set of atoms $C$ and a tuple $\vec{c} = (c_{1}, \dots, c_{n})$ of constants, we denote by $query (C, \vec{c})$ the CQ ${\vec{x} ∣ \exists \vec{y} . ϕ (\vec{x}, \vec{y})}$ , where (i) $ϕ (\vec{x}, \vec{y})$ is the conjunction of all the atoms occurring in the set of atoms $C^{'}$ , where $C^{'}$ is obtained from $C$ by replacing everywhere each constant $c_{i}$ occurring in $\vec{c}$ with a fresh variable $x_{c_{i}}$ and each constant c not occurring in $\vec{c}$ with a fresh variable $y_{c}$ , (ii) $\vec{x} = (x_{c_{1}}, \dots, x_{c_{n}})$ , and (iii) $\vec{y}$ is the tuple of all variables occurring in $C^{'}$ that do not occur in $\vec{x}$ .

Given a set of atoms $C$ , we denote by $dom (C)$ the set of all constants and variables occurring in $C$ . Observe that $dom (C) \subseteq Const \cup V$ . Let $C_{1}$ and $C_{2}$ be two sets of atoms. We say that a function $h : dom (C_{1}) \to dom (C_{2})$ is a homomorphism from $C_{1}$ to $C_{2}$ if $h (C_{1}) \subseteq C_{2}$ , where $h (C_{1})$ is the image of $C_{1}$ under h, i.e., $h (C_{1}) = {h (α) ∣ α \in C_{1}}$ with $h (s (t_{1}, \dots, t_{n})) = s (h (t_{1}), \dots, h (t_{n}))$ for each atom $α = s (t_{1}, \dots, t_{n})$ . For two sets of atoms $C_{1}$ and $C_{2}$ and two tuples of terms $\vec{t_{1}}$ and $\vec{t_{2}}$ , we write $(C_{1}, \vec{t_{1}}) \to (C_{2}, \vec{t_{2}})$ if there is a function h from $dom (C_{1}) \cup \vec{t_{1}}$ to $dom (C_{2}) \cup \vec{t_{2}}$ such that (i) h is a homomorphism from $C_{1}$ to $C_{2}$ , and (ii) $h (\vec{t_{1}}) = \vec{t_{2}}$ (where, for a tuple of terms $\vec{t} = (t_{1}, \dots, t_{n})$ , $h (\vec{t}) = (h (t_{1}), \dots, h (t_{n}))$ ), $(C_{1}, \vec{t_{1}}) ↛ (C_{2}, \vec{t_{2}})$ otherwise.

We define the semantics of (U)CQs in terms of homomorphisms. For an $S$ -database D and a CQ $q_{S} = {\vec{x} ∣ \exists \vec{y} . ϕ (\vec{x}, \vec{y})}$ over $S$ of arity n, we let $q_{S}^{D}$ to be the set of n-tuples $\vec{c}$ of constants occurring in D for which $(set (ϕ), \vec{x}) \to (D, \vec{c})$ . Furthermore, for an $S$ -database D and a UCQ $q_{S} = q_{1} \cup \dots \cup q_{m}$ over $S$ , we let $q_{S}^{D} = q_{1}^{D} \cup \dots \cup q_{m}^{D}$ .

Syntax and semantics of ${DL-Lite}_{R}$ In this paper, a DL ontology (or simply ontology) $O$ is a TBox (“Terminological Box”) expressed in a specific DL, that is, a finite set of assertions stating general properties of atomic concepts and roles built according to the syntax of the specific DL, which represents the intensional knowledge of a modeled domain. The alphabet of an ontology $O$ is the finite subset of $Γ_{O}$ of atomic concepts and atomic roles mentioned in the assertions of $O$ , and we assume that every ontology $O$ comprises the special bottom concept ⊥ in its alphabet. In this paper, whenever we speak of a query $q_{O}$ over an ontology $O$ , we implicitly mean a query in a certain query language $Q$ that uses the atomic concepts and the atomic roles in the alphabet of $O$ as predicate symbols.

The semantics of DL ontologies is specified through the notion of first-order interpretations: an interpretation $I$ is a pair $I = ⟨ Δ^{I}, \cdot^{I} ⟩$ , where the interpretation domain $Δ^{I}$ is a non-empty, possibly infinite set of objects, and the interpretation function $\cdot^{I}$ assigns to each constant $c \in Const$ an object $c^{I} \in Δ^{I}$ , to each atomic concept $A \in Γ_{A}$ a subset $A^{I}$ of $Δ^{I}$ (with the requirement $⊥^{I} = \emptyset$ ), and to each atomic role $P \in Γ_{P}$ a subset $P^{I}$ of $Δ^{I} \times Δ^{I}$ . We say that an interpretation $I$ satisfies an ontology $O$ , denoted by $I ⊧ O$ , if $I$ satisfies every assertion in $O$ . When convenient, with a slight abuse of notation, we treat an interpretation $I$ as the (potentially infinite) set of facts of the form $A (o)$ and $P (o_{1}, o_{2})$ that are true in $I$ , i.e., that are such that $o \in A^{I}$ and $(o_{1}, o_{2}) \in P^{I}$ .

We are particularly interested in DL ontologies expressed in ${DL-Lite}_{R}$ , the member of the $DL-Lite$ family [15] that underpins $OWL 2 QL$ , i.e., the $OWL 2$ profile especially designed for efficient query answering [53]. A ${DL-Lite}_{R}$ ontology $O$ is a finite set of assertions of the form: $\begin{array}{c} B_{1} ⊑ B_{2} R_{1} ⊑ R_{2} (concept/role inclusion) \\ B_{1} ⊑ \neg B_{2} R_{1} ⊑ \neg R_{2} (concept/role disjointness) \end{array}$ where $B_{1}$ , $B_{2}$ are basic concepts, i.e., expressions of the form A, $\exists P$ , or $\exists P^{-}$ , with $A \in Γ_{A}$ and $P \in Γ_{P}$ , and $R_{1}$ and $R_{2}$ basic roles, i.e., expressions of the form P, or $P^{-}$ with $P \in Γ_{P}$ . We assume that the special bottom concept ⊥ never occurs in the right-hand side of inclusion assertions. This is without loss of generality, since each inclusion assertion of the form $B ⊑ ⊥$ is logically equivalent to the disjointness assertion $B ⊑ \neg B$ .

For the constructs of ${DL-Lite}_{R}$ , the interpretation function $\cdot^{I}$ of an interpretation $I = ⟨ Δ^{I}, \cdot^{I} ⟩$ extends to basic concepts and basic roles as follows: ${(\exists P)}^{I} = {o ∣ \exists o^{'} . (o, o^{'}) \in P^{I}}$ and ${(P^{-})}^{I} = {(o, o^{'}) ∣ (o^{'}, o) \in P^{I}}$ . Finally, an interpretation $I$ satisfies a concept inclusion assertion $B_{1} ⊑ B_{2}$ (respectively, role inclusion assertion $R_{1} ⊑ R_{2}$ ) if $B_{1}^{I} \subseteq B_{2}^{I}$ (respectively, $R_{1}^{I} \subseteq R_{2}^{I}$ ), and it satisfies a concept disjointness assertion $B_{1} ⊑ \neg B_{2}$ (respectively, role disjointness assertion $R_{1} ⊑ \neg R_{2}$ ) if $B_{1}^{I} \cap B_{2}^{I} = \emptyset$ (respectively, $R_{1}^{I} \cap R_{2}^{I} = \emptyset$ ).

Syntax and semantics of ontology-based data management According to [47,58], an Ontology-based Data Management (OBDM) specification is a triple $J = ⟨ O, S, M ⟩$ , where $O$ is a DL ontology, $S$ is a relational database schema, also called source schema, and $M$ is a mapping, i.e., a finite set of assertions relating the source schema $S$ to the ontology $O$ . An OBDM system is a pair $Σ = ⟨ J, D ⟩$ , where $J = ⟨ O, S, M ⟩$ is an OBDM specification and D is an $S$ -database. For readability purposes, with a slight abuse of notation, we sometimes denote an OBDM system as a quadruple $Σ = ⟨ O, S, M, D ⟩$ , where $⟨ O, S, M ⟩$ is an OBDM specification and D is an $S$ -database.

In this paper, each assertion in the mapping component of an OBDM system is a Global-And-Local-As-View (GLAV) assertion [27], that is, an assertion of the form $q_{S} \to q_{O}$ , where $q_{S}$ and $q_{O}$ are CQs over $S$ and over $O$ , respectively, with the same target list $\vec{x} = (x_{1}, \dots, x_{n})$ . Special cases of GLAV assertions highly considered in the data integration literature are Global-As-View (GAV) and Local-As-View (LAV) assertions [45]: in a GAV (resp., LAV) assertion, $q_{O}$ (resp., $q_{S}$ ) is simply an atom without existential variables. Finally, a GLAV (resp., GAV, LAV, GAV∩LAV) mapping $M$ consists in a finite set of GLAV (resp., GAV, LAV, both GAV and LAV) assertions.

Given a GLAV mapping $M$ relating a source schema $S$ to an ontology $O$ , an interpretation $I$ , and an $S$ -database D, we denote by $⟨ I, D ⟩ ⊧ M$ the fact that $(c_{1}, \dots, c_{n}) \in q_{S}^{D}$ implies $(c_{1}^{I}, \dots, c_{n}^{I}) \in q_{O}^{I}$ for each possible mapping assertion $q_{S} \to q_{O}$ in $M$ and for each possible tuple $(c_{1}, \dots, c_{n})$ of constants occurring in D. Here, $q_{O}^{I}$ denotes the evaluation of $q_{O}$ over the interpretation $I$ seen as a (potentially infinite) set of facts.

As customary in OBDM, we define the semantics of OBDM systems $Σ = ⟨ O, S, M, D ⟩$ by specifying which are the first-order models of the OBDM specification $⟨ O, S, M ⟩$ relative to the $S$ -database D. Specifically, we say that an interpretation $I$ is a model of an OBDM system $Σ = ⟨ O, S, M, D ⟩$ if (i) $I ⊧ O$ , and (ii) $⟨ I, D ⟩ ⊧ M$ . Finally, we say that an OBDM system Σ is consistent if it has at least one model, inconsistent otherwise.

For an OBDM system $Σ = ⟨ J, D ⟩$ , with $J = ⟨ O, S, M ⟩$ , and a query $q_{O}$ over $O$ , we denote by ${cert}_{q_{O}, J}^{D}$ the set of certain answers of $q_{O}$ w.r.t. Σ, i.e., the set of tuples of constants $(c_{1}, \dots, c_{n})$ occurring in D such that $(c_{1}^{I}, \dots, c_{n}^{I}) \in q_{O}^{I}$ for each model $I$ of Σ. If Σ is inconsistent, then the set of certain answers of any query $q_{O}$ over $O$ w.r.t. Σ is simply the set of all possible tuples of constants occurring in D whose arity is the one of the query. Finally, we say that two queries $q_{1}$ and $q_{2}$ are equivalent w.r.t. an OBDM system $Σ = ⟨ J, D ⟩$ if ${cert}_{q_{1}, J}^{D} = {cert}_{q_{2}, J}^{D}$ .

Query rewriting Given a UCQ $q_{O}$ over a ${DL-Lite}_{R}$ ontology $O$ , we denote by $PerfectRef (O, q_{O})$ the UCQ computed by executing the algorithm $PerfectRef$ [15] on $O$ and $q_{O}$ (slightly extended to deal also with the bottom concept ⊥). In a nutshell, the $PerfectRef$ algorithm uses the concept/role inclusions of the input ${DL-Lite}_{R}$ ontology $O$ as rewriting rules applied to the input UCQ $q_{O}$ . In this way, it compiles into the query $q_{O}$ all the knowledge provided by $O$ that is relevant to answering the query. $PerfectRef$ applies repeatedly the following two steps until a fixpoint is reached: step (i) uses concept/role inclusions as rewriting rules to rewrite query atoms one by one, each time producing a new CQ to be added to the final output; step (ii) unifies the query atoms to enable further executions of step (i). For additional details, we refer the reader to [15].

Let $J = ⟨ O, S, M ⟩$ be an OBDM specification where $O = \emptyset$ , i.e., $O$ has no assertions, and $M$ is a GLAV mapping. From results of [17,29], it is well-known that, given a UCQ $q_{O}$ over $O$ , by splitting the GLAV mapping $M$ into a GAV mapping followed by a LAV mapping over an intermediate alphabet, it is always possible to compute a UCQ over $S$ , denoted by $MapRef (M, q_{O})$ , such that $MapRef {(M, q_{O})}^{D} = {cert}_{q_{O}, J}^{D}$ for each $S$ -database D.

Now, let $J = ⟨ O, S, M ⟩$ be an OBDM specification where $O$ is a ${DL-Lite}_{R}$ ontology and $M$ is a GLAV mapping. Given a UCQ $q_{O}$ over $O$ , in what follows, we denote by ${rew}_{q_{O}, J}$ the following UCQ over $S$ : ${rew}_{q_{O}, J} : = MapRef (M, PerfectRef (O, q_{O}))$ . By combining the above observation regarding $MapRef$ with the correctness of the $PerfectRef$ algorithm [15] and the fact that ${DL-Lite}_{R}$ is insensitive to the adoption of the UNA for UCQ answering [5], we derive that ${cert}_{q_{O}, J}^{D} = {rew}_{q_{O}, J}^{D}$ holds for each UCQ $q_{O}$ over $O$ and for each $S$ -database D such that $⟨ J, D ⟩$ is consistent. Furthermore, by combining the above observation regarding $MapRef$ with results of [58], we derive that, given an $S$ -database D, the OBDM system $⟨ J, D ⟩$ is inconsistent if and only if ${rew}_{V_{O}, J}^{D} = {⟨ ⟩}$ . Here, $V_{O}$ is the $O$ -violation query, i.e., the boolean UCQ over $O$ constituted by the disjunct ${() ∣ \exists y . ⊥ (y)}$ and a disjunct of the form ${() ∣ \exists y . A_{1} (y) \land A_{2} (y)}$ (respectively, ${() ∣ \exists y_{1}, y_{2} . A_{1} (y_{1}) \land R (y_{1}, y_{2})}$ , ${() ∣ \exists y_{1}, y_{2}, y_{3} . R_{1} (y_{1}, y_{2}) \land R_{2} (y_{1}, y_{3})}$ , and ${() ∣ \exists y_{1}, y_{2} . R_{1} (y_{1}, y_{2}) \land R_{2} (y_{1}, y_{2})}$ ) for each disjointness assertion $A_{1} ⊑ \neg A_{2}$ (respectively, $A_{1} ⊑ \neg \exists R$ or $\exists R ⊑ \neg A_{1}$ , $\exists R_{1} ⊑ \neg \exists R_{2}$ , and $R_{1} ⊑ \neg R_{2}$ ) occurring in $O$ , where an atom of the form $R (y, y^{'})$ stands for either $P (y, y^{'})$ if R denotes an atomic role P, or $P (y^{'}, y)$ if R denotes the inverse of an atomic role, i.e., $R = P^{-}$ .

Canonical structure Given an $S$ -database D and a GLAV mapping $M$ relating a source schema $S$ to an ontology $O$ , the chase [13] of D with respect to $M$ , denoted by $M (D)$ , is the set of atoms computed as follows: (i) $M (D)$ is initially set to the empty set of atoms; then (ii) for every GLAV assertion ${\vec{x} ∣ \exists \vec{y} . ϕ_{S} (\vec{x}, \vec{y})} \to {\vec{x} ∣ \exists \vec{z} . φ_{O} (\vec{x}, \vec{z})}$ in $M$ and for every tuple $\vec{c}$ of constants occurring in D such that $(set (ϕ_{S}), \vec{x}) \to (D, \vec{c})$ , we add to $M (D)$ the image of the set of atoms $set (φ_{O})$ under $h^{'}$ , that is, we set $M (D) : = M (D) \cup h^{'} (φ_{O} (\vec{x}, \vec{z}))$ , where $h^{'}$ extends h by assigning to each variable z occurring in $\vec{z}$ a different fresh variable of $V$ still not present in $dom (M (D))$ . Observe that $M (D)$ is guaranteed to be finite and can be always computed in exponential time.

We conclude this section with the following observation used for the technical development of the next sections. Let $Σ = ⟨ O, S, M, D ⟩$ be an OBDM system where $O$ is a ${DL-Lite}_{R}$ ontology and $M$ is a GLAV mapping. The canonical structure of $O$ with respect to $M$ and D, denoted by $C_{O}^{M (D)}$ , is the (potentially infinite) set of atoms obtained by first computing $M (D)$ as described before, and then by chasing $M (D)$ with respect to the inclusion assertions of $O$ as described in [15, Definition 5] but using the alphabet $V$ of variables whenever a new element is needed in the chase. Observe that this latter is a fair deterministic strategy, i.e., it is such that if at some point an assertion is applicable, then it will be eventually applied. By combining [28, Proposition 4.2] with [15, Theorem 29], it is well-known that, for a UCQ $q_{O} = {\vec{x_{1}} ∣ \exists \vec{y_{1}} . ϕ_{O}^{1} (\vec{x_{1}}, \vec{y_{1}})} \cup \dots \cup {\vec{x_{p}} ∣ \exists \vec{y_{p}} . ϕ_{O}^{p} (\vec{x_{p}}, \vec{y_{p}})}$ over $O$ and a tuple $\vec{c}$ of constants occurring in D, if $Σ = ⟨ J, D ⟩$ is consistent, then the following holds: $\vec{c} \in {cert}_{q_{O}, J}^{D}$ if and only if $(set (ϕ_{O}^{i}), \vec{x_{i}}) \to (C_{O}^{M (D)}, \vec{c})$ for some $i \in [1, p]$ .

4. Formal framework

In what follows, we implicitly use $Σ = ⟨ J, D ⟩$ to denote an OBDM system, where $J = ⟨ O, S, M ⟩$ is an OBDM specification and D is an $S$ -database. Intuitively, given two sets $λ^{+}$ and $λ^{-}$ of tuples of constants occurring in D (i.e., $λ^{+}$ and $λ^{-}$ are D-datasets) of positive and negative examples, respectively, we aim at finding a query $q_{O}$ over the ontology $O$ in a certain target query language $Q$ that logically separates $λ^{+}$ and $λ^{-}$ w.r.t. Σ. Since the evaluation of queries in OBDM systems is based on certain answers, we are naturally led to the following definition.

Definition 1.
$q_{O} \in Q$ is a proper Σ-separation of $λ^{+}$ and $λ^{-}$ in the query language $Q$ , if both conditions (i) $λ^{+} \subseteq {cert}_{q_{O}, J}^{D}$ and (ii) ${cert}_{q_{O}, J}^{D} \cap λ^{-} = \emptyset$ hold.

The next example illustrates the above definition.
Example 2.
Recall the OBDM system $Σ = ⟨ J, D ⟩$ , the D-datasets $λ_{1}^{+} = {(c_{1}), (c_{2}), (c_{3})}$ and $λ_{1}^{-} = {(c_{5})}$ , and the CQ $q_{O}^{1}$ over the ontology $O$ illustrated in Example 1. As already observed, we have that ${cert}_{q_{O}^{1}, J}^{D} = {(c_{1}), (c_{2}), (c_{3}), (c_{4})}$ , and therefore $q_{O}^{1}$ is a proper Σ-separation of $λ_{1}^{+}$ and $λ_{1}^{-}$ in CQ.

Consider now a slight variation of the negative examples, i.e., consider $λ^{+} = λ_{1}^{+}$ and $λ^{-} = {(c_{4}), (c_{5})}$ . Since $q_{O}^{1}$ and $q_{O}^{2} = {(x) ∣ \exists y . EnrolledIn (x, y)}$ are such that ${cert}_{q_{O}^{1}, J}^{D} = {(c_{1}), (c_{2}), (c_{3}), (c_{4})}$ and ${cert}_{q_{O}^{2}, J}^{D} = {(c_{1}), (c_{2}), (c_{3}), (c_{4}), (c_{5})}$ , and since $q_{O}^{3} = {(x) ∣ MathStudent (x)}$ and $q_{O}^{4} = {(x) ∣ ForeignStudent (x)}$ are such that ${cert}_{q_{O}^{3}, J}^{D} = {(c_{1})}$ and ${cert}_{q_{O}^{4}, J}^{D} = {(c_{2})}$ , one can verify that no proper Σ-separation of $λ^{+}$ and $λ^{-}$ in UCQ exists.

Clearly, the more expressive the target query language $Q$ , the more likely it is possible to distinguish (w.r.t. the OBDM system) the properties between the tuples in $λ^{+}$ and $λ^{-}$ by means of the operators in $Q$ , and therefore the more likely the proper separation in $Q$ exists. Unfortunately, the next example shows that, even in trivial cases and without any restriction on the target query language $Q$ , proper separations are not guaranteed to exist.
Example 3.
Let $Σ = ⟨ J, D ⟩$ be the following OBDM system: (i) $J = ⟨ O, S, M ⟩$ is the OBDM specification in which $O = \emptyset$ , i.e., $O$ contains no assertions, $S = {s_{1}, s_{2}}$ , and $M = {m_{1}, m_{2}}$ with $m_{1} = {(x) ∣ s_{1} (x)} \to {(x) ∣ A (x)}$ and $m_{2} = {(x) ∣ s_{2} (x)} \to {(x) ∣ A (x)}$ ; and (ii) D is the $S$ -database $D = {s_{1} (c_{1}), s_{2} (c_{2})}$ .

For the D-datasets $λ^{+} = {(c_{1})}$ and $λ^{-} = {(c_{2})}$ , one can trivially verify that, whatever is the query language $Q$ , there can be no query $q_{O} \in Q$ over $O$ for which ${cert}_{q_{O}, J}^{D}$ include the tuple $(c_{1})$ but does not include the tuple $(c_{2})$ . To see this, note that $C_{O}^{M (D)} = {A (c_{1}), A (c_{2})}$ . It follows that, whatever is the query language $Q$ , there can be no query $q_{O} \in Q$ over $O$ which is a proper Σ-separation of $λ^{+}$ and $λ^{-}$ in $Q$ .

Notice the importance of the role played by the mapping $M$ in order to reach this conclusion. Indeed, if we replace $m_{2}$ with ${(x) ∣ s_{2} (x)} \to {(x) ∣ B (x)}$ , then a proper Σ-separation of $λ^{+}$ and $λ^{-}$ in UCQ would simply be the CQ ${(x) ∣ A (x)}$ over the ontology $O$ .

Borrowing similar ideas from [23], to remedy situations where proper separations do not exist, we now introduce approximations of proper separations in terms of completeness and soundness. More specifically, a complete separation is a query that captures all the positive examples in its set of certain answers w.r.t. the OBDM system (i.e., satisfies condition (i) of Definition 1), whereas a sound separation is a query that contains none of the negative examples in its set certain answers w.r.t. the OBDM system (i.e., satisfies condition (ii) of Definition 1).
Definition 2.
$q_{O} \in Q$ is a complete (resp., sound) Σ-separation of $λ^{+}$ and $λ^{-}$ in the query language $Q$ , if $λ^{+} \subseteq {cert}_{q_{O}, J}^{D}$ (resp., ${cert}_{q_{O}, J}^{D} \cap λ^{-} = \emptyset$ ).

We observe that the condition for being a complete (resp., sound) separation does not involve $λ^{-}$ (resp., $λ^{+}$ ).
Example 4.
Refer to Example 2. We have that $q_{O}^{1}$ and $q_{O}^{2}$ are complete Σ-separations of $λ^{+}$ and $λ^{-}$ in UCQ, whereas $q_{O}^{3}$ and $q_{O}^{4}$ are sound Σ-separations of $λ^{+}$ and $λ^{-}$ in UCQ.

As the above example manifests, there may be several complete and sound separations. In those cases, the interest is unquestionably in those queries that approximate at best the proper one, at least relative to a target query language $Q$ . Informally, a $Q$ -minimally complete separation is a complete separation in $Q$ containing a minimal (w.r.t. set containment) possible set of negative examples in its set of certain answers, whereas a $Q$ -maximally sound separation is a sound separation in $Q$ capturing a maximal (w.r.t. set containment) possible set of positive examples in its set of certain answers.
Definition 3.
$q_{O}$ is a $Q$ -minimally complete (resp., $Q$ -maximally sound) Σ-separation of $λ^{+}$ and $λ^{-}$ , if $q_{O}$ is a complete (resp., sound) Σ-separation of $λ^{+}$ and $λ^{-}$ in $Q$ and there is no $q_{O}^{'} \in Q$ satisfying both the following two conditions:
$q_{O}^{'}$ is a complete (resp., sound) Σ-separation of $λ^{+}$ and $λ^{-}$ in $Q$ ;

${cert}_{q_{O}^{'}, J}^{D} \cap λ^{-} \subset {cert}_{q_{O}, J}^{D} \cap λ^{-}$ (resp., ${cert}_{q_{O}, J}^{D} \cap λ^{+} \subset {cert}_{q_{O}^{'}, J}^{D} \cap λ^{+}$ ).

By definition, it is immediate to verify that a Σ-proper separation of $λ^{+}$ and $λ^{-}$ in a query language $Q$ is both a $Q$ -minimally complete Σ-separation of $λ^{+}$ and $λ^{-}$ and a $Q$ -maximally sound Σ-separation of $λ^{+}$ and $λ^{-}$ .
Example 5.
Refer again to Example 2. One can verify that the CQ $q_{O}^{1}$ is a UCQ-minimally complete Σ-separation of $λ^{+}$ and $λ^{-}$ , whereas $q_{O}^{2}$ is not. Moreover, both $q_{O}^{3}$ and $q_{O}^{4}$ are CQ-maximally sound Σ-separations of $λ^{+}$ and $λ^{-}$ , but neither of them is a UCQ-maximally sound Σ-separation of $λ^{+}$ and $λ^{-}$ . Indeed, a UCQ-maximally sound Σ-separation of $λ^{+}$ and $λ^{-}$ is $q_{O}^{5} = q_{O}^{3} \cup q_{O}^{4}$ .

We point out that all the above definitions are a generalization of the definitions illustrated in [22] which deal only with datasets of positive examples, rather than datasets of both positive and negative examples as done here. More specifically, in [22], as well as in all the works addressing the separability task, when only a D-dataset $λ^{+}$ of positive examples is provided, to $λ^{-}$ it is implicitly associated the D-dataset $λ^{-} = dom {(D)}^{n} ∖ λ^{+}$ , where n is the arity of the tuples in $λ^{+}$ . With this remark in mind, we can now specialize the above definitions for the only dataset of positive examples case and report the definitions given in [22].
Definition 4.
$q_{O} \in Q$ is a proper (resp., complete, sound) Σ-characterization of $λ^{+}$ in the query language $Q$ , if ${cert}_{q_{O}, J}^{D} = λ^{+}$ (resp., $λ^{+} \subseteq {cert}_{q_{O}, J}^{D}$ , ${cert}_{q_{O}, J}^{D} \subseteq λ^{+}$ ).

$q_{O}$ is a $Q$ -minimally complete (resp., $Q$ -maximally sound) Σ-characterization of $λ^{+}$ , if $q_{O}$ is a complete (resp., sound) Σ-characterization of $λ^{+}$ in $Q$ and there is no $q_{O}^{'} \in Q$ satisfying both the following two conditions:
$q_{O}^{'}$ is a complete (resp., sound) Σ-characterization of $λ^{+}$ in $Q$ ;

${cert}_{q_{O}^{'}, J}^{D} \subset {cert}_{q_{O}, J}^{D}$ (resp., ${cert}_{q_{O}, J}^{D} \subset {cert}_{q_{O}^{'}, J}^{D}$ ).

In other words, given an OBDM system $Σ = ⟨ J, D ⟩$ with $J = ⟨ O, S, M ⟩$ , a D-dataset $λ^{+}$ of arity n, and a query $q_{O} \in Q$ , we have that $q_{O}$ is a proper in $Q$ (resp., complete in $Q$ , sound in $Q$ , $Q$ -minimally complete, $Q$ -maximally sound) Σ-characterization of $λ^{+}$ if and only if $q_{O}$ is a proper in $Q$ (resp., complete in $Q$ , sound in $Q$ , $Q$ -minimally complete, $Q$ -maximally sound) Σ-separation of $λ^{+}$ and $λ^{-}$ , where $λ^{-} = dom {(D)}^{n} ∖ λ^{+}$ .
4.1. Relation with the abstraction reasoning task

We now discuss the relation between the notion of separation in OBDM introduced here with the notion of Abstraction [19,20], recently introduced in [18] and studied under various scenarios [21,23,24,50] for addressing several reverse engineering tasks in OBDM. In Abstraction, we are given an OBDM specification $J = ⟨ O, S, M ⟩$ and a query $q_{S}$ over $S$ , and the aim is to find a query $q_{O}$ over $O$ , called a perfect J-abstraction of $q_{S}$ , such that ${cert}_{q_{O}, J}^{D} = q_{S}^{D}$ for each $S$ -database D for which $⟨ J, D ⟩$ is consistent. Conversely, in the separation task also the $S$ -database D is given, and instead of a query $q_{S}$ we have two set of tuples $λ^{+}$ and $λ^{-}$ of constants taken from D, and the aim is to find a query $q_{O}$ over $O$ such that both $λ^{+} \subseteq {cert}_{q_{O}, J}^{D}$ and ${cert}_{q_{O}, J}^{D} \cap λ^{-} = \emptyset$ hold.

Following [23], we also say that a query $q_{O}$ over $O$ is a complete (resp., sound) J-abstraction of $q_{S}$ if $q_{S}^{D} \subseteq {cert}_{q_{O}, J}^{D}$ (resp., ${cert}_{q_{O}, J}^{D} \subseteq q_{S}^{D}$ ) for each $S$ -database D for which $⟨ J, D ⟩$ is consistent. The next proposition establishes a precise relationship between the notion of separation in OBDM introduced here and the notion of abstraction.

Proposition 1.
Let $Σ = ⟨ J, D ⟩$ be a consistent OBDM system, $λ^{+}$ and $λ^{-}$ be two D-datasets, and $q_{S}$ be a query that explains $λ^{+}$ and $λ^{-}$ inside D. If a query $q_{O} \in Q$ is a perfect (resp., complete, sound) J-abstraction of $q_{S}$ , then $q_{O}$ is a proper (resp., complete, sound) Σ-separation of $λ^{+}$ and $λ^{-}$ in $Q$ .
Proof.
Suppose $q_{O} \in Q$ is a perfect (resp., complete, sound) J-abstraction of $q_{S}$ , i.e., ${cert}_{q_{O}, J}^{D^{'}} = q_{S}^{D^{'}}$ (resp., $q_{S}^{D^{'}} \subseteq {cert}_{q_{O}, J}^{D^{'}}$ , ${cert}_{q_{O}, J}^{D^{'}} \subseteq q_{S}^{D^{'}}$ ) for each $S$ -database $D^{'}$ for which $⟨ J, D^{'} ⟩$ is consistent. Since by assumption $Σ = ⟨ J, D ⟩$ is consistent, we have that ${cert}_{q_{O}, J}^{D} = q_{S}^{D}$ (resp., $q_{S}^{D} \subseteq {cert}_{q_{O}, J}^{D}$ , ${cert}_{q_{O}, J}^{D} \subseteq q_{S}^{D}$ ). Now, since both $λ^{+} \subseteq q_{S}^{D}$ and $q_{S}^{D} \cap λ^{-} = \emptyset$ hold by the assumption that $q_{S}$ explains $λ^{+}$ and $λ^{-}$ inside D, we derive that both $λ^{+} \subseteq {cert}_{q_{O}, J}^{D}$ and ${cert}_{q_{O}, J}^{D} \cap λ^{-} = \emptyset$ (resp., only $λ^{+} \subseteq {cert}_{q_{O}, J}^{D}$ , only ${cert}_{q_{O}, J}^{D} \cap λ^{-} = \emptyset$ ) hold, which means that $q_{O}$ is a proper (resp., complete, sound) Σ-separation of $λ^{+}$ and $λ^{-}$ in $Q$ . □

The next example shows that the converse of the above proposition does not necessarily hold, thus stressing the fact that the two problems are indeed different. Example 6.
Let $Σ = ⟨ J, D ⟩$ be as follows. $J = ⟨ O, S, M ⟩$ is the OBDM specification in which $O = \emptyset$ , i.e., $O$ contains no assertions, $S = {s_{1}, s_{2}, s_{3}}$ , and $M$ consists of the following two GAV assertions: $\begin{aligned} {(x) ∣ s_{1} (x) \land s_{2} (x)} \to {(x) ∣ Student (x)} \\ {(x) ∣ s_{3} (x)} \to {(x) ∣ Student (x)} \end{aligned}$ Let also the $S$ -database be $D = {s_{1} (a), s_{2} (a), s_{2} (b)}$ and the D-datasets $λ^{+}$ and $λ^{-}$ be $λ^{+} = {(a)}$ and $λ^{-} = {(b)}$ . Consider, moreover, the CQ $q_{S} = {(x) ∣ s_{1} (x)}$ . One can see that the query $q_{S}$ explains $λ^{+}$ and $λ^{-}$ inside D because $q_{S}^{D} = {(a)}$ . Consider now the CQ $q_{O} = {(x) ∣ Student (x)}$ over $O$ . One can see that $q_{O}$ is a proper (and therefore, also a complete and a sound) Σ-separation of $λ^{+}$ and $λ^{-}$ in CQ because ${cert}_{q_{O}, J}^{D} = {(a)}$ .

Notice, however, that for the $S$ -database $D^{'} = {s_{1} (a), s_{3} (b)}$ we have that (i) $⟨ J, D^{'} ⟩$ is a consistent OBDM system, (ii) $q_{S}^{D^{'}} = {(a)}$ , and (iii) ${cert}_{q_{O}, J}^{D^{'}} = {(b)}$ . Thus, $q_{O}$ is neither a complete nor a sound (and therefore, nor a perfect) J-abstraction of $q_{S}$ . Indeed both $q_{S}^{D^{'}} ⊈ {cert}_{q_{O}, J}^{D^{'}}$ (witnessing that $q_{O}$ is not a complete J-abstraction of $q_{S}$ ) and ${cert}_{q_{O}, J}^{D^{'}} ⊈ q_{S}^{D^{'}}$ (witnessing that $q_{O}$ is not a sound J-abstraction of $q_{S}$ ) hold.

4.2. Computational problems associated with the framework

Given the general framework introduced so far, there are (at least) three computational problems to consider, with respect to an ontology language $L_{O}$ , a mapping language $L_{M}$ , and a query language $Q$ . Given an OBDM system $Σ = ⟨ O, S, M, D ⟩$ and two D-datasets $λ^{+}$ and $λ^{-}$ , where $O \in L_{O}$ and $M \in L_{M}$ :

Verification: given also a query $q_{O} \in Q$ , check whether $q_{O}$ is a proper (resp., complete, sound) Σ-separation of $λ^{+}$ and $λ^{-}$ in $Q$ .

Computation: compute any proper in $Q$ (resp., any $Q$ -minimally complete, any $Q$ -maximally sound) Σ-separation of $λ^{+}$ and $λ^{-}$ , provided it exists.

Existence: check whether there exists a query $q_{O} \in Q$ that is a proper in $Q$ (resp., a $Q$ -minimally complete, a $Q$ -maximally sound) Σ-separation of $λ^{+}$ and $λ^{-}$ .

Analogous computational problems can be defined when in the input we have only one D-dataset $λ^{+}$ of arity n, rather than two D-datasets $λ^{+}$ and $λ^{-}$ , and we implicitly think of $λ^{-}$ as $dom {(D)}^{n} ∖ λ^{+}$ .

In what follows, if not otherwise stated, we refer to the following scenario which considers by far the most popular languages for the OBDM paradigm: (i) $L_{O}$ is ${DL-Lite}_{R}$ , (ii) $L_{M}$ is GLAV, and (iii) $Q$ is UCQ. In this scenario, there are two interesting properties that are worth mentioning.

Proposition 2.
Let $Σ = ⟨ J, D ⟩$ be an OBDM system, and $λ^{+}$ and $λ^{-}$ be two D-datasets of arity n such that $λ^{+} \cup λ^{-} = dom {(D)}^{n}$ . If $q_{1}$ and $q_{2}$ are UCQ-minimally complete (resp., UCQ-maximally sound) Σ-separations of $λ^{+}$ and $λ^{-}$ , then they are equivalent w.r.t. Σ.
Proof.
We first address the case of UCQ-maximally sound, and then the case of UCQ-minimally complete.

Assume that $q_{1}$ and $q_{2}$ are UCQ-maximally sound Σ-separations of $λ^{+}$ and $λ^{-}$ and suppose, for the sake of contradiction, that they are not equivalent w.r.t. Σ. This implies the existence of a tuple $\vec{c}$ such that $\vec{c} \notin {cert}_{q_{1}, J}^{D}$ and $\vec{c} \in {cert}_{q_{2}, J}^{D}$ . Observe that, since $λ^{+} \cup λ^{-} = dom {(D)}^{n}$ , $\vec{c}$ must be such that $\vec{c} \in λ^{+}$ , otherwise $q_{2}$ would not be a sound Σ-separation of $λ^{+}$ and $λ^{-}$ in UCQ, thus reaching a contradiction. But then, the UCQ $Q = q_{1} \cup q_{2}$ is such that (i) since both $q_{1}$ and $q_{2}$ are sound Σ-separations of $λ^{+}$ and $λ^{-}$ in UCQ, Q is a sound Σ-separation of $λ^{+}$ and $λ^{-}$ in UCQ as well, and (ii) ${cert}_{q_{1}, J}^{D} \cap λ^{+} \subset {cert}_{Q, J}^{D} \cap λ^{+}$ holds because of tuple $\vec{c}$ . Obviously, this contradicts the fact that $q_{1}$ is a UCQ-maximally sound Σ-separation of $λ^{+}$ and $λ^{-}$ .

Assume now that $q_{1}$ and $q_{2}$ are UCQ-minimally complete Σ-separations of $λ^{+}$ and $λ^{-}$ and suppose, for the sake of contradiction, that they are not equivalent w.r.t. Σ. This implies the existence of a tuple $\vec{c}$ such that $\vec{c} \in {cert}_{q_{1}, J}^{D}$ and $\vec{c} \notin {cert}_{q_{2}, J}^{D}$ . Observe that, since $λ^{+} \cup λ^{-} = dom {(D)}^{n}$ , $\vec{c}$ must be such that $\vec{c} \in λ^{-}$ , otherwise $q_{2}$ would not be a complete Σ-separation of $λ^{+}$ and $λ^{-}$ in UCQ, thus reaching a contradiction. But then, consider the query Q such that ${cert}_{Q, J}^{D} = {cert}_{q_{1}, J}^{D} \cap {cert}_{q_{2}, J}^{D}$ . Obviously, since $q_{1}$ and $q_{2}$ are UCQs, Q exists and can be expressed as a UCQ. Moreover, (i) since $q_{1}$ and $q_{2}$ are complete Σ-separations of $λ^{+}$ and $λ^{-}$ in UCQ, Q is a complete Σ-separation of $λ^{+}$ and $λ^{-}$ in UCQ as well, and (ii) ${cert}_{Q, J}^{D} \cap λ^{-} \subset {cert}_{q_{1}, J}^{D} \cap λ^{-}$ holds because of tuple $\vec{c}$ . Obviously, this contradicts the fact that $q_{1}$ is a UCQ-minimally complete Σ-separation of $λ^{+}$ and $λ^{-}$ . □

This also means that, in our scenario, for an OBDM system $Σ = ⟨ J, D ⟩$ and two D-datasets $λ^{+}$ and $λ^{-}$ of arity n such that $λ^{+} \cup λ^{-} = dom {(D)}^{n}$ , if a proper Σ-separation of $λ^{+}$ and $λ^{-}$ in UCQ exists, then it is unique up to equivalence w.r.t. Σ. Furthermore, for the characterizability case where $λ^{-}$ is implicitly set to be $dom {(D)}^{n} ∖ λ^{+}$ , proper in UCQ, UCQ-minimally complete, and UCQ-maximally sound Σ-characterizations of $λ^{+}$ are always unique up to equivalence w.r.t. Σ, provided they exist.

Secondly, in this scenario, as one may expect, proper separations are less likely to exist than explanations in the plain relational database case.
Proposition 3.
Let $Σ = ⟨ J, D ⟩$ be a consistent OBDM system, and $λ^{+}$ and $λ^{-}$ be two D-datasets. If there exists a proper Σ-separation of $λ^{+}$ and $λ^{-}$ in UCQ, then $λ^{+}$ and $λ^{-}$ are UCQ-explainable inside D.
Proof.
Suppose there exists a proper Σ-separation of $λ^{+}$ and $λ^{-}$ in UCQ, i.e., there is a UCQ $q_{O}$ over $O$ for which $λ^{+} \subseteq {cert}_{q_{O}, J}^{D}$ and ${cert}_{q_{O}, J}^{D} \cap λ^{-} = \emptyset$ . Recall from Section 3 that the UCQ ${rew}_{q_{O}, J}$ over $S$ is such that ${cert}_{q_{O}, J}^{D^{'}} = {rew}_{q_{O}, J}^{D^{'}}$ for each $S$ -database $D^{'}$ for which $⟨ J, D^{'} ⟩$ is consistent. Since $Σ = ⟨ J, D ⟩$ is consistent by assumption, we have that ${cert}_{q_{O}, J}^{D} = {rew}_{q_{O}, J}^{D}$ . Thus, ${rew}_{q_{O}, J}$ is such that both $λ^{+} \subseteq {rew}_{q_{O}, J}^{D}$ and ${rew}_{q_{O}, J}^{D} \cap λ^{-} = \emptyset$ hold, from which immediately follows that ${rew}_{q_{O}, J}$ explains $λ^{+}$ and $λ^{-}$ inside D by definition, and thus $λ^{+}$ and $λ^{-}$ are UCQ-explainable inside D. □

In general, the converse of the above proposition does not hold. Indeed, in Example 3, while there is no proper Σ-separation of $λ^{+}$ and $λ^{-}$ in $Q$ , whatever is the target query language $Q$ , the CQ $q_{S} = {(x) ∣ s_{1} (x)}$ witnesses that $λ^{+}$ and $λ^{-}$ are CQ-definable inside D.

Note that Definition 1 can be seen as a generalization of the classical notion of explanation in the plain relational database case [7], when one also has to deal with mapping assertions and ontology assertions. In the specific case of OBDM systems $Σ = ⟨ O, S, M, D ⟩$ such that $O = \emptyset$ and $M$ contains assertions of the form ${(x) ∣ s (x)} \to {(x) ∣ A (x)}$ and ${(x) ∣ s^{'} (x_{1}, x_{2})} \to {(x) ∣ P (x_{1}, x_{2})}$ providing a one-to-one correspondence between the alphabet of the ontology and the predicate symbols of the source schema, we observe that, given two D-datasets $λ^{+}$ and $λ^{-}$ (resp., a D-dataset $λ^{+}$ ), there exists a Σ-proper separation of $λ^{+}$ and $λ^{-}$ (resp., a Σ-proper characterization of $λ^{+}$ ) in UCQ if and only if $λ^{+}$ and $λ^{-}$ are UCQ-explainable (resp., $λ^{+}$ is UCQ-definable) inside D.

We conclude this section by observing that, in the case of plain relational databases, two D-datasets $λ^{+}$ and $λ^{-}$ (resp., a D-dataset $λ^{+}$ ) are UCQ-explainable (resp., is UCQ-definable) inside D if and only if the union of the CQs describing the tuples in $λ^{+}$ achieves the separation (resp., the definition). More formally, $λ^{+}$ and $λ^{-}$ (resp., $λ^{+}$ ) are UCQ-explainable (resp., is UCQ-definable) inside D if and only if $query (D, \vec{c_{1}}) \cup \dots \cup query (D, \vec{c_{n}})$ explains $λ^{+}$ and $λ^{-}$ inside D (resp., defines $λ^{+}$ inside D), where $λ^{+} = {\vec{c_{1}}, \dots, \vec{c_{n}}}$ . When casting this problem to the OBDM setting, one of the key challenges is that the separating query needs to be formulated over the ontology level, while data reside at the source level, with a mapping layer mediating the two. One of the contributions of this paper is to provide a similar characterization as in the relational database case for the OBDM setting (cf. Corollary 3).
5. The verification problem

We now define the verification problems for X-separability (X-V $S EP$ ) and X-characterization (X-V $C HAR$ ), where $X = {Proper, Complete, Sound}$ . These decision problems are parametric with respect to the ontology language $L_{O}$ to express the ontology $O$ , the mapping language $L_{M}$ to express the mapping $M$ , and the target query language $Q$ to express the query $q_{O}$ over $O$ .

We also introduce two important special cases of the above decision problems, namely: X-VST $S EP$ ( $L_{O}$ , $L_{M}$ , $Q$ ) and X-VST $C HAR$ ( $L_{O}$ , $L_{M}$ , $Q$ ), which are special cases of X-V $S EP$ ( $L_{O}$ , $L_{M}$ , $Q$ ) and X-V $C HAR$ ( $L_{O}$ , $L_{M}$ , $Q$ ), respectively, in which the all the input D-datasets are singleton sets (i.e., they consist of just a single tuple).

In what follows, given a syntactic object x such as a query, an ontology, or a mapping, we denote by $| x |$ its size, which is the number of symbols needed to write it, with names of predicates, variables, etc. counting as one.

We start by analyzing the upper bounds for the case $X = Complete$ . The proof of the next theorems rely on the fact that, given an OBDM specification $J = ⟨ O, S, M ⟩$ of our considered scenario and a UCQ $q_{O}$ over $O$ , each disjunct in $PerfectRef (O, q_{O})$ can be obtained after a polynomial number of transformations of an initial disjunct in $q_{O}$ [15], and, analogously, each disjunct in $MapRef (M, q_{O})$ can be obtained after a polynomial number of transformations of an initial disjunct in $q_{O}$ [17]. This clearly implies that, although there may be an exponential number of disjuncts in the UCQ ${rew}_{q_{O}, J}$ , the size of each such disjunct is always bounded by a polynomial in the size of $q_{O}$ , $O$ , and $M$ . More precisely, the size of each disjunct in $PerfectRef (O, q_{O})$ is at most $| q_{O} |$ and the size of each disjunct in ${rew}_{q_{O}, J}$ is at most $| M | \cdot | q_{O} |$ .

Theorem 1.
Complete-V $S EP$ ( ${DL-Lite}_{R}$ , GLAV, UCQ) and Complete-V $C HAR$ ( ${DL-Lite}_{R}$ , GLAV, UCQ) are in $NP$ .
Proof.
We address Complete-V $S EP$ ( ${DL-Lite}_{R}$ , GLAV, UCQ). The case of Complete-V $C HAR$ ( ${DL-Lite}_{R}$ , GLAV, UCQ) can be addressed in exactly the same way (recall that $λ^{-}$ is immaterial for the complete case). In particular, we now show how to check in $NP$ whether $q_{O}$ is a complete Σ-separation of $λ^{+}$ and $λ^{-}$ in UCQ (i.e., $λ^{+} \subseteq {cert}_{q_{O}, J}^{D}$ ), where $Σ = ⟨ J, D ⟩$ with $J = ⟨ O, S, M ⟩$ in which $O$ is a ${DL-Lite}_{R}$ ontology and $M$ is a GLAV mapping.

Let n be the arity of the tuple(s) in the D-datasets $λ^{+}$ and $λ^{-}$ . For each n-tuple of constants $\vec{c} \in λ^{+}$ , we first guess (i) a CQ $q_{O}^{'}$ over $O$ which is either of arity n and size at most $| q_{O} |$ , or a boolean one capturing a disjointness assertion d (e.g., ${() ∣ \exists y . A_{1} (y) \land A_{2} (y)}$ capturing $d = A_{1} ⊑ \neg A_{2}$ ); (ii) a sequence $ρ_{O}$ of ontology assertions; (iii) a CQ $q_{S}$ over $S$ of size at most $| M | \cdot | q_{O}^{'} |$ which is either of arity n and of the form ${\vec{x} ∣ \exists \vec{y} . ϕ_{S} (\vec{x}, \vec{y})}$ , or a boolean one of the form ${() ∣ \exists \vec{y} . ϕ_{S} (\vec{y})}$ ; (iv) a sequence $ρ_{M}$ of mapping assertions; and (v) a function f from the variables occurring in $q_{S}$ to $dom (D)$ .

Then, we check in polynomial time whether (i) by means of $ρ_{O}$ , either we can rewrite a disjunct of $q_{O}$ into $q_{O}^{'}$ through $O$ (i.e., $q_{O}^{'} \in PerfectRef (O, q_{O})$ ), or we can rewrite a disjunct of $V_{O}$ into $q_{O}^{'}$ through $O$ (i.e., $q_{O}^{'} \in PerfectRef (O, V_{O})$ ); (ii) by means of $ρ_{M}$ , we can rewrite $q_{O}^{'}$ into $q_{S}$ through $M$ (i.e., $q_{S} \in MapRef (M, q_{O}^{'})$ , and thus either $q_{S} \in {rew}_{q_{O}, J}$ or $q_{S} \in {rew}_{V_{O}, J}$ ); and finally (iii) f consists in a homomorphism witnessing either $(set (ϕ_{S}), \vec{x}) \to (D, \vec{c})$ , i.e., $\vec{c} \in q_{S}^{D}$ (and therefore $\vec{c} \in {rew}_{q_{O}, J}^{D}$ , which means $\vec{c} \in {cert}_{q_{O}, J}^{D}$ ), or $(set (ϕ_{S}), ()) \to (D, ())$ , i.e., $q_{S}^{D} = {⟨ ⟩}$ (and therefore ${rew}_{V_{O}, J}^{D} = {⟨ ⟩}$ , which means that Σ is inconsistent and thus $\vec{c} \in {cert}_{q_{O}, J}^{D}$ by definition of certain answers). □

By exploiting the above result, we now address the upper bounds for the case $X = Sound$ .
Theorem 2.
Sound-V $S EP$ ( ${DL-Lite}_{R}$ , GLAV, UCQ) and Sound-V $C HAR$ ( ${DL-Lite}_{R}$ , GLAV, UCQ) are in $coNP$ .
Proof.
We start with Sound-V $S EP$ ( ${DL-Lite}_{R}$ , GLAV, UCQ), and then we consider Sound-V $C HAR$ ( ${DL-Lite}_{R}$ , GLAV, UCQ). In particular, we now show how to check in $NP$ whether $q_{O}$ is not a sound Σ-separation of $λ^{+}$ and $λ^{-}$ in UCQ (i.e., ${cert}_{q_{O}, J}^{D} \cap λ^{-} \neq \emptyset$ ), where $Σ = ⟨ J, D ⟩$ with $J = ⟨ O, S, M ⟩$ in which $O$ is a ${DL-Lite}_{R}$ ontology and $M$ is a GLAV mapping.

We first guess (i) a tuple of constants $\vec{c}$ , and, exactly as in the proof of Theorem 1, we also guess (ii) $q_{O}^{'}$ , $ρ_{O}$ , $q_{S}$ , $ρ_{M}$ , and f. Then, we check in polynomial time whether (i) $\vec{c} \in λ^{-}$ , and (ii) using $q_{O}^{'}$ , $ρ_{O}$ , $q_{S}$ , $ρ_{M}$ , and f, we follow exactly the same polynomial time procedure described in the proof of Theorem 1 to check whether $\vec{c} \in {cert}_{q_{O}, J}^{D}$ .

As for the Sound-V $C HAR$ ( ${DL-Lite}_{R}$ , GLAV, UCQ) case, it is sufficient to replace the check (i) $\vec{c} \in λ^{-}$ with the check $\vec{c} \in dom {(D)}^{n} ∖ λ^{+}$ , where n is the arity of the tuple(s) in the D-dataset $λ^{+}$ . Clearly this latter check can be done in polynomial time as well, by first checking that every constant of $\vec{c}$ effectively occurs in $dom (D)$ and then simply checking that $\vec{c} \notin λ^{+}$ . □

We recall that the complexity class $DP$ , introduced in [56], resides at the second level of the polynomial time hierarchy [61], and it is composed of all those decision problems that are the intersection of a decision problem in $NP$ and a decision problem in $coNP$ , that is, $DP = NP \land coNP = {P_{1} \cap P_{2} ∣ P_{1} \in NP \land P_{2} \in coNP}$ .

Since $q_{O}$ is a proper Σ-separation of $λ^{+}$ and $λ^{-}$ in $Q$ if and only if it is both a sound, and a complete Σ-separation of $λ^{+}$ and $λ^{-}$ in $Q$ , we immediately derive the following upper bounds for $X = Proper$ .
Corollary 1.
Proper-V $S EP$ ( ${DL-Lite}_{R}$ , GLAV, UCQ) and Proper-V $C HAR$ ( ${DL-Lite}_{R}$ , GLAV, UCQ) are in $DP$ .

We now provide matching lower bounds, proving that all of them already hold for the singleton datasets special cases. More specifically, we show that they already hold for the same fixed OBDM system $Σ = ⟨ J, D ⟩$ , same fixed D-datasets $λ^{+}$ and $λ^{-}$ (resp., D-dataset $λ^{+}$ ) containing only a single unary tuple, and for CQs as queries. Furthermore, the fixed OBDM specification $J = ⟨ O, S, M ⟩$ is such that $O = \emptyset$ (i.e., $O$ contains no ontology assertions) and $M$ is a GAV∩LAV mapping (i.e., $M$ is both a GAV and a LAV mapping). To simplify the presentation, with a slight abuse of notation, from now on we denote by $L_{O} = \emptyset$ the ontology language allowing only for ontologies $O = \emptyset$ , i.e., ontologies $O$ without assertions.
Theorem 3.
There is an OBDM system $Σ = ⟨ O, S, M, D ⟩$ such that $O = \emptyset$ and $M$ is a GAV∩LAV mapping, and two D-datasets $λ^{+}$ and $λ^{-}$ (resp., a D-dataset $λ^{+}$ ) containing only one unary tuple for which:
Complete-VST $S EP$ (∅, GAV∩LAV, CQ) and Complete-VST $C HAR$ (∅, GAV∩LAV, CQ) are $NP$ -hard;

Sound-VST $S EP$ (∅, GAV∩LAV, CQ) and Sound-VST $C HAR$ (∅, GAV∩LAV, CQ) are $coNP$ -hard;

Proper-VST $S EP$ (∅, GAV∩LAV, CQ) and Proper-VST $C HAR$ (∅, GAV∩LAV, CQ)) are $DP$ -hard.

Proof.
Let $Σ = ⟨ J, D ⟩$ be the fixed OBDM system such that (i) $J = ⟨ O, S, M ⟩$ is an OBDM specification in which $O$ contains no assertions and whose alphabet consists of two atomic roles $P_{1}$ and $P_{2}$ , $S = {s_{1}, s_{2}}$ , and $M$ consists of the following two GAV∩LAV assertions: ${(x_{1}, x_{2}) ∣ s_{1} (x_{1}, x_{2})} \to {(x_{1}, x_{2}) ∣ P_{1} (x_{1}, x_{2})}$ and ${(x_{1}, x_{2}) ∣ s_{2} (x_{1}, x_{2})} \to {(x_{1}, x_{2}) ∣ P_{2} (x_{1}, x_{2})}$ , which simply mirrors source predicate $s_{i}$ to atomic role $P_{i}$ , for $i \in [1, 2]$ , and (ii) D is the $S$ -database composed of the following facts: $\begin{aligned} {s_{1} (x, y) ∣ x = {r^{'}, g^{'}, b^{'}} and y = {r^{'}, g^{'}, b^{'}} and x \neq y} \\ \cup {s_{1} (x, y) ∣ x = {r, g, b, o} and y = {r, g, b, o} and x \neq y} \\ \cup {s_{2} (x, c_{3}) ∣ x = {r^{'}, g^{'}, b^{'}}} \cup {s_{2} (x, c_{4}) ∣ x = {r, g, b, o}} . \end{aligned}$ Let, moreover, $λ^{+}$ and $λ^{-}$ be the fixed D-datasets $λ^{+} = {(c_{4})}$ and $λ^{-} = {(c_{3})}$ . Note that $λ^{-}$ is needed only for the decision problems related to separability but not for the decision problems related to characterizability.

Let $G = (V, E)$ be a finite and undirected graph without loops3
³
In graph theory, a loop is an edge that connects a vertex with itself [8].

or isolated nodes, where $V = {y_{1}, \dots, y_{n}}$ . We define a CQ $q_{G} = {(x) ∣ \exists \vec{y} . ϕ_{O} (x, \vec{y})}$ over $O$ as follows: $\begin{matrix} {(x) ∣ \exists y_{1}, \dots, y_{n} . \underset{(y_{i}, y_{j}) \in E}{⋀} (P_{1} (y_{i}, y_{j})) \land \underset{y_{i} \in V}{⋀} (P_{2} (y_{i}, x))} \end{matrix}$

Notice that $q_{G}$ can be constructed in $L OG S PACE$ from an input graph G as above.

By inspecting the OBDM system $Σ = ⟨ J, D ⟩$ , for any graph G as above, the set of certain answers ${cert}_{q_{G}, J}^{D}$ must necessarily be an element of the power set of ${(c_{3}), (c_{4})}$ . More specifically, the following property holds: Claim 1.
For both $i = 3$ and $i = 4$ , we have that a graph $G = (V, E)$ is i-colourable if and only if $(c_{i}) \in {cert}_{q_{G}, J}^{D}$ .
Proof.
First of all, notice that $C_{O}^{M (D)}$ is composed of the following facts: $\begin{aligned} {P_{1} (x, y) ∣ x = {r^{'}, g^{'}, b^{'}} and y = {r^{'}, g^{'}, b^{'}} and x \neq y} \\ \cup {P_{1} (x, y) ∣ x = {r, g, b, o} and y = {r, g, b, o} and x \neq y} \\ \cup {P_{2} (x, c_{3}) ∣ x = {r^{'}, g^{'}, b^{'}}} \cup {P_{2} (x, c_{4}) ∣ x = {r, g, b, o}} . \end{aligned}$

“Only-if part:” Suppose $G = (V, E)$ is 3-colourable (resp., 4-colourable), that is, there exists a function $f : V \to {r^{'}, g^{'}, b^{'}}$ (resp., $f : V \to {r, g, b, o}$ ) such that $f (y_{i}) \neq f (y_{j})$ for each $(y_{i}, y_{j}) \in E$ . Let $ϕ_{O}$ be the body of $q_{G}$ , and consider the extension of f which assigns to the distinguished variable x of $q_{G}$ the constant $c_{3}$ (resp., $c_{4}$ ). It can be readily seen that f consists in a homomorphism from $set (ϕ_{O})$ to $C_{O}^{M (D)}$ such that $f (x) = c_{3}$ (resp., $f (x) = c_{4}$ ). In other words, f witnesses that $(set (ϕ_{O}), (x)) \to (C_{O}^{M (D)}, (c_{3}))$ (resp., $(set (ϕ_{O}), (x)) \to (C_{O}^{M (D)}, (c_{4}))$ ). Thus, $(c_{3}) \in {cert}_{q_{G}, J}^{D}$ (resp., $(c_{4}) \in {cert}_{q_{G}, J}^{D}$ ), as required.

“If part:” Suppose $G = (V, E)$ is not 3-colourable (resp., not 4-colourable), that is, each possible function $f : V \to {r^{'}, g^{'}, b^{'}}$ (resp., $f : V \to {r, g, b, o}$ ) is such that $f (y_{i}) = f (y_{j})$ for some $(y_{i}, y_{j}) \in E$ . Clearly, this implies that $(set (ϕ_{O}), (x)) ↛ (C_{O}^{M (D)}, (c_{3}))$ (resp., $(set (ϕ_{O}), (x)) ↛ (C_{O}^{M (D)}, (c_{4}))$ ). Thus, $(c_{3}) \notin {cert}_{q_{G}, J}^{D}$ (resp., $(c_{4}) \notin {cert}_{q_{G}, J}^{D}$ ), as required. □

With the above property in hand, and the fact that ${cert}_{q_{G}, J}^{D}$ is an element of the power set of ${(c_{3}), (c_{4})}$ for each possible graph $G = (V, E)$ , we are now ready to prove the claimed lower bounds.

As for the complete case, the proof of $NP$ -hardness is by a $L OG S PACE$ reduction from the 4-colourability problem, which is $NP$ -complete [32]. In particular, from the above claim a graph G is 4-colourable if and only if $(c_{4}) \in {cert}_{q_{G}, J}^{D}$ , i.e., if and only if $λ^{+} \subseteq {cert}_{q_{G}, J}^{D}$ , which is the condition for $q_{G}$ to be a complete Σ-separation of $λ^{+}$ and $λ^{-}$ in CQ (resp., a complete Σ-characterization of $λ^{+}$ in CQ).

As for the sound case, the proof of $coNP$ -hardness is by a $L OG S PACE$ reduction from the complement of the 3-colourability problem, which is $coNP$ -complete [32]. In particular, from the above claim a graph G is not 3-colourable if and only if $(c_{3}) \notin {cert}_{q_{G}, J}^{D}$ , i.e., if and only if ${cert}_{q_{G}, J}^{D} \cap λ^{-} = \emptyset$ (resp., ${cert}_{q_{G}, J}^{D} \subseteq λ^{+}$ ), which is the condition for $q_{G}$ to be a sound Σ-separation of $λ^{+}$ and $λ^{-}$ in CQ (resp., a sound Σ-characterization of $λ^{+}$ in CQ).

Finally, as for the proper case, the proof of $DP$ -hardness is by a $L OG S PACE$ reduction from the exact-4-colourability problem, which is $DP$ -complete [59]. In particular, a graph G is exact-4-colourable (i.e., 4-colourable and not 3-colourable) if and only if ${cert}_{q_{G}, J}^{D} = {(c_{4})}$ , i.e., if and only if ${cert}_{q_{G}, J}^{D} \subseteq λ^{+}$ and ${cert}_{q_{G}, J}^{D} \cap λ^{-} = \emptyset$ (resp., ${cert}_{q_{G}, J}^{D} = λ^{+}$ ), which is the condition for $q_{G}$ to be a proper Σ-separation of $λ^{+}$ and $λ^{-}$ in CQ (resp., a proper Σ-characterization of $λ^{+}$ in CQ). □

For the scenario under investigation in this paper, we can now establish the precise computational complexity of all the Verification decision problems introduced at the beginning of this section.
Corollary 2.
The following holds:
Complete-V $S EP$ ( ${DL-Lite}_{R}$ , GLAV, UCQ) and Complete-V $C HAR$ ( ${DL-Lite}_{R}$ , GLAV, UCQ) are $NP$ -complete. The lower bounds already hold for Complete-VST $S EP$ (∅, GAV∩LAV, CQ) and Complete-VST $C HAR$ (∅, GAV∩LAV, CQ);

Sound-V $S EP$ ( ${DL-Lite}_{R}$ , GLAV, UCQ) and Sound-V $C HAR$ ( ${DL-Lite}_{R}$ , GLAV, UCQ) are $coNP$ -complete. The lower bounds already hold for Sound-VST $S EP$ (∅, GAV∩LAV, CQ) and Sound-VST $C HAR$ (∅, GAV∩LAV, CQ);

Proper-V $S EP$ ( ${DL-Lite}_{R}$ , GLAV, UCQ) and Proper-V $C HAR$ ( ${DL-Lite}_{R}$ , GLAV, UCQ) are $DP$ -complete. The lower bounds already hold for Proper-VST $S EP$ (∅, GAV∩LAV, CQ) and Proper-VST $C HAR$ (∅, GAV∩LAV, CQ).

Finally, from the lower bounds given in Theorem 3, we can derive two interesting novel results also in the context of explainability and definability in the plain relational database case. More specifically, since the fixed OBDM specification $J = ⟨ O, S, M ⟩$ used in the proof that theorem is such that $O = \emptyset$ and $M$ is both a GAV and a LAV mapping, the proof can be straightforwardly adapted also for the plain relational database case. Thus, given a schema $S$ , an $S$ -database D, two D-datasets $λ^{+}$ and $λ^{-}$ (resp., a D-dataset $λ^{+}$ ), and a UCQ $q_{S}$ over $S$ , it is $DP$ -complete the problem of deciding whether $q_{S}$ explains $λ^{+}$ and $λ^{-}$ (resp., defines $λ^{+}$ ) inside D (the $DP$ membership of these decision problems directly follows from Corollary 1).
6. The computation problem

In this section, we address the Computation problem. We start by considering the case when the given OBDM system Σ at hand is inconsistent. Given an inconsistent OBDM system $Σ = ⟨ J, D ⟩$ with $J = ⟨ O, S, M ⟩$ and two D-datasets $λ^{+}$ and $λ^{-}$ (resp., a D-dataset $λ^{+}$ ) of arity n, we point out that any query $q_{O} \in Q$ of arity n over $O$ is a $Q$ -minimally complete Σ-separation (resp., Σ-characterization) of $λ^{+}$ and $λ^{-}$ (resp., of $λ^{+}$ ). This is so because, by definition, the certain answers of any query $q_{O}$ of arity n w.r.t. an inconsistent OBDM system Σ is the set of all possible n-tuples of constants occurring in D, i.e., $dom {(D)}^{n}$ . Furthermore, if $λ^{+} = dom {(D)}^{n}$ and $λ^{-} = \emptyset$ , then any query $q_{O} \in Q$ of arity n over $O$ is also a $Q$ -maximally sound (and therefore a proper in $Q$ ) Σ-separation (resp., Σ-characterization) of $λ^{+}$ and $λ^{-}$ (resp., of $λ^{+}$ ); otherwise, no sound (and therefore, no $Q$ -maximally sound and no proper in $Q$ ) Σ-separation (resp., Σ-characterization) of $λ^{+}$ and $λ^{-}$ (resp., of $λ^{+}$ ) exists. Since, however, for OBDM systems of our scenario it is always possible to check whether they are inconsistent or not (cf. Section 3), from the above observations one can trivially derive suitable algorithms for the Computation problem in all the cases in which the input OBDM system Σ is inconsistent.

Having thoroughly covered the case of inconsistent OBDM systems, in what follows in this section, unless otherwise stated, we implicitly assume to only deal with consistent OBDM systems.

Specifically, we now provide two algorithms that, given a consistent OBDM system $Σ = ⟨ J, D ⟩$ and two D-datasets $λ^{+}$ and $λ^{-}$ (resp., a D-dataset $λ^{+}$ ), always terminate and return, respectively, a UCQ-minimally complete and a UCQ-maximally sound Σ-separation (resp., Σ-characterization) of $λ^{+}$ and $λ^{-}$ (resp., of $λ^{+}$ ). This proves that, in our investigated scenario, they always exist and can be computed. In fact, the algorithms we provide focus only on the Separability case. Algorithms for the Characterizability case can be immediately derived from the ones we provide by simply computing $λ^{-} = dom {(D)}^{n} ∖ λ^{+}$ , where n is the arity of the tuple(s) in $λ^{+}$ .

Before delving into the details of the two algorithms, we provide some crucial properties about the canonical structure that will be used to establish the correctness of such algorithms.

Proposition 4.
Let $Σ = ⟨ J, D ⟩$ with $J = ⟨ O, S, M ⟩$ be an OBDM system, $q_{O}$ be a UCQ over $O$ , and $\vec{c}$ and $\vec{b}$ be two tuples of constants such that $(C_{O}^{M (D)}, \vec{c}) \to (C_{O}^{M (D)}, \vec{b})$ . If $\vec{c} \in {cert}_{q_{O}, J}^{D}$ , then $\vec{b} \in {cert}_{q_{O}, J}^{D}$ .
Proof.
If Σ is inconsistent, then the claim is trivial. If Σ is consistent, from Section 3 we know that $\vec{c} \in {cert}_{q_{O}, J}^{D}$ implies the existence of a disjunct $q = {\vec{x} ∣ \exists \vec{y} . ϕ (\vec{x}, \vec{y})}$ in $q_{O}$ for which $(set (ϕ), \vec{x}) \to (C_{O}^{M (D)}, \vec{c})$ . Let h be the homomorphism witnessing that $(set (ϕ), \vec{x}) \to (C_{O}^{M (D)}, \vec{c})$ , and let $h^{'}$ be the homomorphism witnessing that $(C_{O}^{M (D)}, \vec{c}) \to (C_{O}^{M (D)}, \vec{b})$ , which exists by the premises of the proposition. The composite function $h^{″} = h^{'} \circ h$ is then a homomorphism witnessing that $(set (ϕ), \vec{x}) \to (C_{O}^{M (D)}, \vec{b})$ . It follows that $\vec{b} \in {cert}_{q_{O}, J}^{D}$ , as required. □
Proposition 5.
Let $Σ = ⟨ J, D ⟩$ with $J = ⟨ O, S, M ⟩$ be a consistent OBDM system, $\vec{b}$ and $\vec{c}$ be two tuples of constants, and $q_{\vec{c}}$ be the CQ $q_{\vec{c}} = query (M (D), \vec{c})$ over $O$ . We have that $\vec{b} \in {cert}_{q_{\vec{c}}, J}^{D}$ if and only if $(C_{O}^{M (D)}, \vec{c}) \to (C_{O}^{M (D)}, \vec{b})$ .
Proof.
Suppose that $(C_{O}^{M (D)}, \vec{c}) \to (C_{O}^{M (D)}, \vec{b})$ , and let h be the homomorphism witnessing this. Consider the query $q_{\vec{c}} = query (M (D), \vec{c}) = {\vec{x} ∣ \exists \vec{y} . ϕ (\vec{x}, \vec{y})}$ . Observe that $set (ϕ)$ is obtained from $M (D)$ by appropriately replacing each occurrence of each constant $c \in dom (M (D))$ either with a distinguished variable $x_{c} \in \vec{x}$ or with an existential variable $y_{c} \in \vec{y}$ . This means that h can be immediately transformed into a homomorphism witnessing that $(set (ϕ), \vec{x}) \to (C_{O}^{M (D)}, \vec{b})$ , thus implying that $\vec{b} \in {cert}_{q_{\vec{c}}, J}^{D}$ .

Suppose now that $\vec{b} \in {cert}_{q_{\vec{c}}, J}^{D}$ . Since Σ is consistent, it follows that there is a homomorphism h witnessing that $(set (ϕ), \vec{x}) \to (C_{O}^{M (D)}, \vec{b})$ , where $q_{\vec{c}} = query (M (D), \vec{c}) = {\vec{x} ∣ \exists \vec{y} . ϕ (\vec{x}, \vec{y})}$ . By considering again the relationship between $set (ϕ)$ and $C_{O}^{M (D)}$ , the homomorphism h can be immediately transformed into a homomorphism $h^{'}$ that witnesses $(M (D), \vec{c}) \to (C_{O}^{M (D)}, \vec{b})$ . By construction of the canonical structure $C_{O}^{M (D)}$ , it is now trivial to verify that $h^{'}$ can be properly extended into a homomorphism $h^{″}$ witnessing that $(C_{O}^{M (D)}, \vec{c}) \to (C_{O}^{M (D)}, \vec{b})$ . □

We are now ready to present our algorithms for the Computation problem. We start with the complete case, and provide the Algorithm 1 (MinCompSeparation) for computing UCQ-minimally complete separations.

Algorithm 1
MinCompSeparation

Informally, for each positive example $\vec{c_{i}} \in λ^{+}$ , the algorithm obtains from the set of atoms $M (D)$ the CQ $query (M (D), \vec{c_{i}})$ . Then, the output query $q_{O}$ is the union of all the CQs obtained in such a way.
Example 7.
Let $J = ⟨ O, S, M ⟩$ be the same OBDM specification of Example 2. One can verify that for the $S$ -database $D = {s_{1} (c_{1}), s_{3} (c_{2}, b), s_{3} (c_{3}, b)}$ and the D-datasets $λ^{+} = {(c_{1}), (c_{2})}$ and $λ^{-} = {(c_{3})}$ , MinCompSeparation( $⟨ J, D ⟩$ , $λ^{+}$ , $λ^{-}$ ) returns the UCQ $q_{O} = query (M (D), (c_{1}))$ ∪ $query (M (D), (c_{2}))$ , where $query (M (D), (c_{1})) = {(x_{c_{1}}) ∣ \exists y_{c_{2}}, y_{c_{3}}, y_{b} . Student (x_{c_{1}}) \land EnrolledIn (y_{c_{2}}, y_{b}) \land EnrolledIn (y_{c_{3}}, y_{b})}$ and $query (M (D), (c_{2})) = {(x_{c_{2}}) ∣ \exists y_{c_{1}}, y_{c_{3}}, y_{b} . EnrolledIn (x_{c_{2}}, y_{b}) \land EnrolledIn (y_{c_{3}}, y_{b}) \land Student (y_{c_{1}})}$ . Note that the query $q_{O}$ returned by the algorithm is a UCQ-minimally complete Σ-separation of $λ^{+}$ and $λ^{-}$ , where $Σ = ⟨ J, D ⟩$ .

The following theorem establishes termination and correctness of the Algorithm 1 (MinCompSeparation).
Theorem 4.
Let $Σ = ⟨ J, D ⟩$ with $J = ⟨ O, S, M ⟩$ be a consistent OBDM system and $λ^{+}$ and $λ^{-}$ be two D-datasets. We have that MinCompSeparation(Σ, $λ^{+}$ , $λ^{-}$ ) terminates and returns a UCQ-minimally complete Σ-separation of $λ^{+}$ and $λ^{-}$ .
Proof.
Termination of the algorithm as well as completeness of the UCQ $q_{O}$ returned are straightforward. In particular, due to Proposition 5, it is obvious that each $\vec{c_{i}} \in λ^{+}$ is such that $\vec{c_{i}} \in {cert}_{q_{\vec{c_{i}}}, J}^{D}$ , where $q_{\vec{c_{i}}} = query (M (D), \vec{c_{i}})$ .

To prove that $q_{O}$ is also a UCQ-minimally complete Σ-separation of $λ^{+}$ and $λ^{-}$ , note that it is enough to show that any query q over $O$ that is a complete Σ-separation of $λ^{+}$ and $λ^{-}$ in UCQ is such that ${cert}_{q_{O}, J}^{D} \subseteq {cert}_{q, J}^{D}$ . We do this by contraposition. Let q be a UCQ over $O$ for which ${cert}_{q_{O}, J}^{D} ⊈ {cert}_{q, J}^{D}$ , i.e., for a tuple of constants $\vec{b}$ we have $\vec{b} \notin {cert}_{q, J}^{D}$ but $\vec{b} \in {cert}_{q_{O}, J}^{D}$ . This latter means that $\vec{b} \in {cert}_{q_{\vec{c_{i}}}, J}^{D}$ for some $q_{\vec{c_{i}}} = query (M (D), \vec{c_{i}})$ with $\vec{c_{i}} \in λ^{+}$ occurring in $q_{O}$ . By Proposition 5, one can see that $\vec{b} \in {cert}_{q_{\vec{c_{i}}}, J}^{D}$ implies $(C_{O}^{M (D)}, \vec{c_{i}}) \to (C_{O}^{M (D)}, \vec{b})$ . By Proposition 4, it follows that each UCQ $q^{'}$ over $O$ containing the tuple $\vec{c_{i}}$ in its set of certain answers w.r.t. Σ must contain also tuple $\vec{b}$ in such a set, i.e., $\vec{c_{i}} \in {cert}_{q^{'}, J}^{D}$ implies $\bar{b} \in {cert}_{q^{'}, J}^{D}$ for any UCQ $q^{'}$ over $O$ . Thus, since $\vec{b} \notin {cert}_{q, J}^{D}$ , we derive that $\vec{c_{i}} \notin {cert}_{q, J}^{D}$ as well. Now, since $\vec{c_{i}} \in λ^{+}$ and $\vec{c_{i}} \notin {cert}_{q, J}^{D}$ , this immediately implies that q is not a complete Σ-separation of $λ^{+}$ and $λ^{-}$ in UCQ, as required. □

We now turn to the sound case, and provide the Algorithm 2 (MaxSoundSeparation) for computing UCQ-maximally sound Σ-separations.

Algorithm 2
MaxSoundSeparation

Intuitively, starting from the query $query (M (D), \vec{c_{1}}) \cup \dots \cup query (M (D), \vec{c_{m}})$ , which is a UCQ-minimally complete Σ-separation of $λ^{+}$ and $λ^{-}$ , the algorithm simply discards all those disjuncts whose set of certain answers w.r.t. Σ contain at least a tuple $\vec{b} \in λ^{-}$ . We recall from Section 3 that the set of certain answers of a CQ $q_{\vec{c_{i}}}$ w.r.t. a consistent OBDM system $Σ = ⟨ J, D ⟩$ can be always computed by first obtaining its reformulation ${rew}_{q_{\vec{c_{i}}}, J}$ over the source schema $S$ , and then by evaluating this latter query directly over the $S$ -database D. In other words, the if-condition of line 5 can be equivalently reformulated as: ${rew}_{q_{\vec{c_{i}}}, J}^{D} \cap λ^{-} = \emptyset$ .
Example 8.
Refer to Example 7. Since the certain answers of the CQ $query (M (D), (c_{2}))$ w.r.t. $Σ = ⟨ J, D ⟩$ include also $(c_{3}) \in λ^{-}$ , MaxSoundSeparation(Σ, $λ^{+}$ , $λ^{-}$ ) returns the CQ $q_{O} = query (M (D), (c_{1}))$ . Note that $q_{O}$ is a UCQ-maximally sound Σ-separation of $λ^{+}$ and $λ^{-}$ .

The following theorem establishes termination and correctness of the Algorithm 2 (MaxSoundSeparation).
Theorem 5.
Let $Σ = ⟨ J, D ⟩$ with $J = ⟨ O, S, M ⟩$ be a consistent OBDM system, and $λ^{+}$ and $λ^{-}$ be two D-datasets. We have that MaxSoundSeparation(Σ, $λ^{+}$ , $λ^{-}$ ) terminates and returns a UCQ-maximally sound Σ-separation of $λ^{+}$ and $λ^{-}$ .
Proof.
Termination of the algorithm as well as soundness of the UCQ $q_{O}$ returned are straightforward. In particular, by construction, all the disjuncts $q_{\vec{c_{i}}}$ of $q_{O}$ are such that there is no tuple $\vec{b} \in λ^{-}$ for which $\vec{b} \in {cert}_{q_{\vec{c_{i}}}, J}^{D}$ .

To prove that $q_{O}$ is also a UCQ-maximally sound Σ-separation of $λ^{+}$ and $λ^{-}$ , note that it is enough to show that any query q over $O$ that is a sound Σ-separation of $λ^{+}$ and $λ^{-}$ in UCQ is such that ${cert}_{q, J}^{D} \cap λ^{+} \subseteq {cert}_{q_{O}, J}^{D} \cap λ^{+}$ . We do this by contraposition. Let q be a UCQ over $O$ for which ${cert}_{q, J}^{D} \cap λ^{+} ⊈ {cert}_{q_{O}, J}^{D} \cap λ^{+}$ , i.e., for a tuple of constants $\vec{b} \in λ^{+}$ we have $\vec{b} \in {cert}_{q, J}^{D}$ but $\vec{b} \notin {cert}_{q_{O}, J}^{D}$ . Since $\vec{b} \notin {cert}_{q_{O}, J}^{D}$ and $\vec{b} \in λ^{+}$ , it is easy to see that the algorithm discarded the disjunct $q_{\vec{b}} = query (M (D), \vec{b})$ (otherwise, we would trivially derive that $\vec{b} \in {cert}_{q_{\vec{b}}, J}^{D}$ , and thus $\vec{b} \in {cert}_{q_{O}, J}^{D}$ , which is a contradiction to the fact that $\vec{b} \notin {cert}_{q_{O}, J}^{D}$ ). By construction of the algorithm, one can see that the only reason $q_{\vec{b}}$ was discarded is because $\vec{g} \in {cert}_{q_{\vec{b}}, J}^{D}$ for at least a tuple $\vec{g} \in λ^{-}$ . By Proposition 5, one can see that $\vec{g} \in {cert}_{q_{\vec{b}}, J}^{D}$ implies $(C_{O}^{M (D)}, \vec{b}) \to (C_{O}^{M (D)}, \vec{g})$ . By Proposition 4, it follows that each UCQ $q^{'}$ over $O$ containing tuple $\vec{b}$ in its set of certain answers w.r.t. Σ must contain also tuple $\vec{g}$ in such a set, i.e., $\vec{b} \in {cert}_{q^{'}, J}^{D}$ implies $\vec{g} \in {cert}_{q^{'}, J}^{D}$ for any UCQ $q^{'}$ over $O$ . Thus, since $\vec{b} \in {cert}_{q, J}^{D}$ , we derive that $\vec{g} \in {cert}_{q, J}^{D}$ as well. Now, since $\vec{g} \in λ^{-}$ and $\vec{g} \in {cert}_{q, J}^{D}$ , this immediately implies that q is not a sound Σ-separation of $λ^{+}$ and $λ^{-}$ in UCQ, as required. □

We now briefly discuss the running time of the two above algorithms. First, we observe that the running time of the presented algorithms differs depending on which mapping language the input mapping $M$ is formulated. A key difference between the mapping languages GLAV and the special cases GAV and LAV, is in the size of the computed $M (D)$ . In GLAV mappings, due to the simultaneous presence of queries involving multiple source predicates in the left-hand side of the mapping assertions, and join existential variables in the right-hand side of the assertions, the set of atoms $M (D)$ can be exponentially large. For example, considering a database $D = {s_{i} (0), s_{i} (1) ∣ 1 ⩽ i ⩽ n}$ and the mappings $M$ containing the GLAV assertion: ${(x_{1}, \dots, x_{n}) ∣ s_{1} (x_{1}) \land \dots \land s_{n} (x_{n})} \to {(x_{1}, \dots, x_{n}) ∣ \exists y . P (x_{1}, y) \land \dots \land P (x_{n}, y)}$ , the number of atoms occurring in the set $M (D)$ is $2^{n}$ . Conversely, both in LAV and GAV mappings, $M (D)$ is always polynomially bounded, since the former does not allow for multiple source predicates in the left-hand side of mapping assertions, whereas the latter does not allow for existential variables in the right-hand side of mapping assertions and the arity of ontology predicates is fixed to at most 2.

Furthermore, the running time required to compute $M (D)$ is different in the LAV and GAV cases. Indeed $M (D)$ can always be computed in polynomial time whenever $M$ is a LAV mapping, whereas, in general, already when $M$ is a GAV mapping, since there are CQs in the left-hand side of mapping assertions that need to be evaluated over the database D, it takes exponential time to deterministically compute $M (D)$ (unless $PT IME = NP$ ).

As immediate consequences of the above observations, we derive that, in general, the Algorithm 1 (MinCompleteSeparation) runs in exponential time with respect to the size of the input, while it runs in polynomial time whenever the mapping $M$ of the input OBDM system is a LAV mapping. As for the case of the Algorithm 2 (MaxSoundSeparation), since after computing $M (D)$ one needs also to compute the certain answers of each $query (M (D), \vec{c_{i}})$ for all $\vec{c_{i}}$ in the input dataset $λ^{+}$ , and deterministically computing the certain answers of CQs with respect to an OBDM system takes exponential time (unless $PT IME = NP$ ), we derive that, in general, the algorithm runs in double exponential time with respect to the size of the input, while it runs in exponential time whenever the mapping $M$ of the input OBDM system is either a GAV or a LAV mapping. This is because $M (D)$ is guaranteed to be of polynomial size whenever $M$ is either a GAV or a LAV mapping (and therefore, also the various CQs $query (M (D), \vec{c_{i}})$ are of polynomial size with respect to the size of the input).

We conclude this section by providing a semantic test for the existence of proper separations and characterizations in UCQ in the OBDM settings. We observe that, in all the cases in which a proper separation exists, it is clear that both algorithms return the same query $query (M (D), \vec{c_{1}}) \cup \dots \cup query (M (D), \vec{c_{m}})$ . Therefore, as a direct consequence of both Theorem 4 and Theorem 5, we get the following result.
Corollary 3.
Let $Σ = ⟨ O, S, M, D ⟩$ be an OBDM specification, and $λ^{+} = {\vec{c_{1}}, \dots, \vec{c_{m}}}$ and $λ^{-}$ two D-datasets. Either the UCQ $q_{O} = query (M (D), \vec{c_{1}}) \cup \dots \cup query (M (D), \vec{c_{m}})$ is a proper Σ-separation of $λ^{+}$ and $λ^{-}$ in UCQ, or no proper Σ-separation of $λ^{+}$ and $λ^{-}$ in UCQ exists.

The combination of Corollary 3 and Proposition 5 allows us to provide the following semantic tests for the existence of proper separations and proper characterizations in UCQ in the OBDM setting, which can be seen as the analogous of the semantic tests given in [7] and [55] for the plain relational database setting and the ontology-enriched query answering setting, respectively.

$S EP$ test for UCQs in OBDM: given a consistent OBDM system $Σ = ⟨ O, S, M, D ⟩$ and two D-datasets $λ^{+}$ and $λ^{-}$ , there exists a proper Σ-separation of $λ^{+}$ and $λ^{-}$ in UCQ if and only if it is the case that $(C_{O}^{M (D)}, \vec{c}) ↛ (C_{O}^{M (D)}, \vec{b})$ for all $\vec{c} \in λ^{+}$ and all $\vec{b} \in λ^{-}$ .

$C HAR$ test for UCQs in OBDM: given a consistent OBDM system $Σ = ⟨ O, S, M, D ⟩$ and a D-dataset $λ^{+}$ of arity n, there exists a proper Σ-characterization of $λ^{+}$ in UCQ if and only if it is the case that $(C_{O}^{M (D)}, \vec{c}) ↛ (C_{O}^{M (D)}, \vec{b})$ for all $\vec{c} \in λ^{+}$ and all $\vec{b} \in dom {(D)}^{n} ∖ λ^{+}$ .
7. The existence problem

We now address the existence problem. First of all, for the scenario under consideration in this paper, the existence problem for both UCQ-minimally complete and UCQ-maximally sound separations (and also characterizations) is trivial, since by Theorems 4 and 5 they always exist. Thus, in this section we only consider the remaining proper case, by defining a variant of the decision problems as defined in [55], where also a mapping in some mapping language $L_{M}$ is given as input.

We also introduce two important special cases of the above decision problems, namely: ST $S EP$ ( $L_{O}$ , $L_{M}$ , $Q$ ) and ST $C HAR$ ( $L_{O}$ , $L_{M}$ , $Q$ ), which are special cases of $S EP$ ( $L_{O}$ , $L_{M}$ , $Q$ ) and $C HAR$ ( $L_{O}$ , $L_{M}$ , $Q$ ), respectively, in which the all the input D-datasets are singleton sets (i.e., they consist of just a single tuple).

In what follows, we show that the computational complexity of the above decision problems differ only depending on the mapping language $L_{M}$ adopted. As already discussed in Section 6, a key difference between GLAV and the special cases GAV and LAV is both in the size of the computed $M (D)$ , i.e. exponential for GLAV mapping and polynomial for both GAV and LAV mapping, and in the effort required to compute $M (D)$ , i.e. polynomial time for LAV mapping and exponential time (unless $PT IME = NP$ ) in both GAV and GLAV mapping.

We start by characterizing the computational complexity of $S EP$ and $C HAR$ (and their respective subproblems) in the simplest LAV case, then we address the GAV case, and finally we focus on the most general GLAV case. Interestingly, all the provided matching lower bounds hold even for fixed ontologies $O = \emptyset$ , i.e., ontologies without assertions, and fixed D-datasets containing only a single unary tuple.

Importantly, for the scenario under consideration, from the results of the previous section, the questions in $S EP$ and $C HAR$ can be reformulated equivalently as follows: “is $q_{O} = query (M (D), \vec{c_{1}}) \cup \dots \cup query (M (D), \vec{c_{m}})$ also a sound (and so, a proper) Σ-separation of $λ^{+} = {\vec{c_{1}}, \dots, \vec{c_{m}}}$ and $λ^{-}$ in UCQ?” and “is $q_{O} = query (M (D), \vec{c_{1}}) \cup \dots \cup query (M (D), \vec{c_{m}})$ also a sound (and so, a proper) Σ-characterization of $λ^{+} = {\vec{c_{1}}, \dots, \vec{c_{m}}}$ in UCQ?”, respectively.

Theorem 6.
$S EP$ ( ${DL-Lite}_{R}$ , LAV, UCQ) and $C HAR$ ( ${DL-Lite}_{R}$ , LAV, UCQ) are $coNP$ -complete. The lower bounds already hold for ST $S EP$ (∅, GAV∩LAV, UCQ) and ST $C HAR$ (∅, GAV∩LAV, UCQ).
Proof.
Upper bound: We only mention $S EP$ ( ${DL-Lite}_{R}$ , LAV, UCQ). The $C HAR$ ( ${DL-Lite}_{R}$ , LAV, UCQ) case is similar and therefore not discussed. Given an OBDM system $Σ = ⟨ J, D ⟩$ with $J = ⟨ O, S, M ⟩$ in which $O$ is a ${DL-Lite}_{R}$ ontology and $M$ is a LAV mapping, and two D-datasets $λ^{+} = {\vec{c_{1}}, \dots, \vec{c_{m}}}$ and $λ^{-}$ , we first compute $M (D)$ in polynomial time (recall that $M$ is a LAV mapping), and therefore also the query $q_{O} = query (M (D), \vec{c_{1}}) \cup \dots \cup query (M (D), \vec{c_{m}})$ which is a UCQ-minimally complete Σ-separation of $λ^{+}$ and $λ^{-}$ . Then, exactly as illustrated in Theorem 2, we can check in $coNP$ whether $query (M (D), \vec{c_{1}}) \cup \dots \cup query (M (D), \vec{c_{m}})$ is also a sound (and so, a proper) Σ-separation of $λ^{+}$ and $λ^{-}$ in UCQ.

Lower bound: We start with ST $S EP$ (∅, GAV∩LAV, UCQ), and then we address the ST $C HAR$ (∅, GAV∩LAV, UCQ) case. The proof of $coNP$ -hardness is by a $L OG S PACE$ reduction from the complement of the 3-colourability problem. We define a fixed OBDM specification $J = ⟨ O, S, M ⟩$ in which $O$ contains no assertions and whose alphabet consists of an atomic role e and an atomic concept A, $S = {s_{e}, s}$ , and $M$ consists of the following two GAV∩LAV mapping assertions: ${(x_{1}, x_{2}) ∣ s_{e} (x_{1}, x_{2})} \to {(x_{1}, x_{2}) ∣ e (x_{1}, x_{2})}$ and ${(x) ∣ s (x)} \to {(x) ∣ A (x)}$ , which simply mirrors source predicates $s_{e}$ and s to e and A, respectively.

Let $G = (V, E)$ be a finite and undirected graph without loops or isolated nodes, where $V = {c_{1}, \dots, c_{n}}$ . Without loss of generality, we may assume that $V \neq \emptyset$ and that G is connected, i.e., there is a path from c to $c^{'}$ for any pair of nodes $(c, c^{'}) \in V^{2}$ . Then, we define an $S$ -database $D_{G}$ composed of the following facts: $\begin{aligned} {s_{e} (c, c^{'}) ∣ (c, c^{'}) \in E} \cup {s (c_{1})} \\ \cup {s_{e} (x, y) ∣ x = {r, g, b} and y = {r, g, b} and x \neq y} \cup {s (r)} . \end{aligned}$ Let, moreover, $λ^{+}$ and $λ^{-}$ be the fixed $D_{G}$ -datasets $λ^{+} = {(c_{1})}$ and $λ^{-} = {(r)}$ .

Notice that the $S$ -database $D_{G}$ can be constructed in $L OG S PACE$ from an input graph G as above. Let the OBDM system be $Σ_{G} = ⟨ J, D_{G} ⟩$ . We now show that a graph G is not 3-colourable if and only if there exists a proper $Σ_{G}$ -separation of $λ^{+}$ and $λ^{-}$ in UCQ, thus proving the claimed lower bound. Claim 2.
Given a graph G, there exists a proper $Σ_{G}$ -separation of $λ^{+}$ and $λ^{-}$ in UCQ if and only if G is not 3-colourable.
Proof.
First of all, note that $C_{O}^{M (D_{G})} = {e (c, c^{'}) ∣ (c, c^{'}) \in E} \cup {A (c_{1})} \cup {e (x, y) ∣ x = {r, g, b} and y = {r, g, b} and x \neq y} \cup {A (r)}$ . We recall that a proper $Σ_{G}$ -separation of $λ^{+}$ and $λ^{-}$ in UCQ exists if and only if the CQ $q_{G} = query (M (D_{G}), (c_{1}))$ is also a sound (and so, a proper) $Σ_{G}$ -separation of $λ^{+}$ and $λ^{-}$ in UCQ.

“Only-if part:” Suppose $G = (V, E)$ is 3-colourable, that is, there exists a function $f : V \to {r, g, b}$ such that $f (c) \neq f (c^{'})$ for each $(c, c^{'}) \in E$ . Without loss of generality, we may assume that $f (c_{1}) = r$ (indeed, the existence of f clearly implies the existence of a function $f^{'}$ with $f^{'} (c_{1}) = r$ and such that $f^{'} (c) \neq f^{'} (c^{'})$ for each $(c, c^{'}) \in E$ holds as well). Consider the extension h of f assigning $h (x) = x$ to each $x \in {r, g, b}$ . It can be readily seen that h consists in a homomorphism witnessing that $(C_{O}^{M (D_{G})}, (c_{1})) \to (C_{O}^{M (D_{G})}, (r))$ , which directly implies that $(r) \in {cert}_{q_{G}, J}^{D_{G}}$ . It follows that $q_{G}$ is not a sound $Σ_{G}$ -separation of $λ^{+}$ and $λ^{-}$ in UCQ, and therefore no proper $Σ_{G}$ -separation of $λ^{+}$ and $λ^{-}$ in UCQ exists.

“If part:” Suppose there exists no proper $Σ_{G}$ -separation of $λ^{+}$ and $λ^{-}$ in UCQ, i.e., the CQ $q_{G}$ is such that $(r) \in {cert}_{q_{G}, J}^{D_{G}}$ . It follows that there exists a homomorphism h witnessing that $(C_{O}^{M (D_{G})}, (c_{1})) \to (C_{O}^{M (D_{G})}, (r))$ . Now, since the graph G is connected and since $h (c_{1}) = h (r)$ , by construction of $C_{O}^{M (D_{G})}$ the homomorphism h must necessarily be such that $h (c) \in {r, g, b}$ for each constant c representing a node $c \in V$ . But then, due to the fact that none of $e (r, r)$ , $e (g, g)$ , and $e (b, b)$ occur in $C_{O}^{M (D_{G})}$ , we derive that h is such that $h (c) \neq h (c^{'})$ for each $e (c, c^{'}) \in C_{O}^{M (D_{G})}$ , and therefore for each $(c, c^{'}) \in E$ as well. Thus, we can conclude that G is 3-colourable. □

As for the $coNP$ -hardness of ST $C HAR$ (∅, GAV∩LAV, UCQ), it is possible to use exactly the same reduction provided above by discarding $λ^{-}$ and considering only $λ^{+} = {(c_{1})}$ . In particular, due to the fact that $A (c)$ and $A (r)$ are the only A-facts occurring in $C_{O}^{M (D_{G})}$ , the set of certain answers of the query $q_{G}$ w.r.t. $Σ_{G}$ is always either ${(c_{1})}$ or ${(c_{1}), (r)}$ , and which one of the two depends on the 3-colourability of G. □

We now turn to consider GAV mappings. We recall that the complexity class $Θ_{2}^{p}$ has many characterizations: $Θ_{2}^{p} = P^{NP [O (log n)]} =$ P with a constant number of rounds of parallel queries to an oracle for a decision problem in $NP$ [12] (we refer the reader to [67] for further characterizations of such complexity class). By a round of parallel queries, it is intended that the Turing machine can ask for polynomially many non-adaptive queries to the $NP$ oracle.
Theorem 7.
$S EP$ ( ${DL-Lite}_{R}$ , GAV, UCQ) and $C HAR$ ( ${DL-Lite}_{R}$ , GAV, UCQ) are $Θ_{2}^{p}$ -complete. The lower bounds already hold for ST $S EP$ (∅, GAV, UCQ) and ST $C HAR$ (∅, GAV, UCQ).
Proof.
Upper bound: We only mention $S EP$ ( ${DL-Lite}_{R}$ , GAV, UCQ). The $C HAR$ ( ${DL-Lite}_{R}$ , GAV, UCQ) case is similar and therefore not discussed. Given an OBDM system $Σ = ⟨ J, D ⟩$ with $J = ⟨ O, S, M ⟩$ in which $O$ is a ${DL-Lite}_{R}$ ontology and $M$ is a GAV mapping, and two D-datasets $λ^{+} = {\vec{c_{1}}, \dots, \vec{c_{m}}}$ and $λ^{-}$ , as a first step we compute $M (D)$ in polynomial time with a single round of parallel queries to an $NP$ oracle. More specifically, for each pair of constants $(c_{1}, c_{2}) \in dom {(D)}^{2}$ (resp., constant $c \in dom (D)$ ) and for each atomic role P (resp., concept A) in the alphabet of $O$ we ask, all together with a single round of parallel queries to an $NP$ oracle, whether $P (c_{1}, c_{2}) \in M (D)$ (resp., $A (c) \in M (D)$ ). It is clear that deciding whether $P (c_{1}, c_{2}) \in M (D)$ (resp., $A (c) \in M (D)$ ) for a given pair of constants $(c_{1}, c_{2}) \in dom {(D)}^{2}$ (resp., constant $c \in dom (D)$ ), a GAV mapping $M$ , and a database D is an $NP$ -complete problem because the left-hand side of mapping assertions are CQs [1].

Once obtained $M (D)$ as described above, we construct in polynomial time the UCQ $q_{O} = query (M (D), \vec{c_{1}}) \cup \dots \cup query (M (D), \vec{c_{m}})$ . Then, due to Theorem 2, with a second and final round of parallel queries, we can ask with a single query to an $NP$ oracle whether $q_{O}$ is also a sound (and so, a proper) Σ-separation of $λ^{+} = {\vec{c_{1}}, \dots, \vec{c_{m}}}$ and $λ^{-}$ in UCQ.

Lower bound: We start with ST $S EP$ (∅, GAV, UCQ), and then we address the ST $C HAR$ (∅, GAV, UCQ) case. The proof of $Θ_{2}^{p}$ -hardness is by a $L OG S PACE$ reduction from the odd clique problem, which is a $Θ_{2}^{p}$ -complete problem [66]. Odd clique is the problem of deciding, given a finite and undirected graph $G = (V, E)$ without loops, whether the maximum clique size of G is an odd number. Without loss of generality, we may assume that E contains at least an edge and that the cardinality of V is an even number (indeed, it is always possible to add fresh isolated nodes to the graph G without changing its maximum clique size).

Given a graph $G = (V, E)$ as above with $V = {v_{1}, \dots, v_{n}}$ , we define an OBDM system $Σ_{G} = ⟨ J_{G}, D_{G} ⟩$ as follows: $J_{G} = ⟨ O, S_{G}, M_{G} ⟩$ is an OBDM specification in which $O$ contains no assertions, $S_{G} = {e, s_{1}, \dots, s_{n}}$ , and, for each odd $i \in [1, n]$ , the mapping $M_{G}$ has the following two GAV assertions: $\begin{aligned} {(x) ∣ \exists y_{1}, \dots, y_{i} . s_{i} (x) \land {cl}_{i}} \to {(x) ∣ A_{i} (x)} \\ {(x) ∣ \exists y_{1}, \dots, y_{i + 1} . s_{i + 1} (x) \land {cl}_{i + 1}} \to {(x) ∣ A_{i} (x)} \end{aligned}$ where $A_{i}$ is an atomic concept in the alphabet of $O$ and, for each natural number p, ${cl}_{p}$ is the conjunction of atoms: $\begin{matrix} {cl}_{p} = \underset{{(k, j) ∣ 1 ⩽ k < j ⩽ p}}{⋀} e (y_{k}, y_{j}) \end{matrix}$ .

Intuitively, ${cl}_{p}$ asks whether G contains a clique of size p. Finally, $D_{G}$ is the $S_{G}$ -database $D_{G} = {e (x_{1}, x_{2}) ∣ (x_{1}, x_{2}) \in E} \cup {e (x_{2}, x_{1}) ∣ (x_{1}, x_{2}) \in E} \cup {s_{i} (c) ∣ 1 ⩽ i ⩽ n and i is odd} \cup {s_{i} (c^{'}) ∣ 2 ⩽ i ⩽ n and i is even}$ . Let, moreover, $λ^{+}$ and $λ^{-}$ be the fixed $D_{G}$ -datasets $λ^{+} = {(c)}$ and $λ^{-} = {(c^{'})}$ .

Notice that $λ^{+}$ and $λ^{-}$ are fixed, whereas the OBDM system $Σ_{G}$ can be constructed in $L OG S PACE$ from an input graph G as above.

The correctness of the reduction is mainly based on the next property: Claim 3.
Let $i \in [1, n]$ be an odd number. We have that:
$A_{i} (c) \in C_{O}^{M_{G} (D_{G})}$ if and only if G contains a clique of size i.

$A_{i} (c^{'}) \in C_{O}^{M_{G} (D_{G})}$ if and only if G contains a clique of size $i + 1$ .

Proof.
As for 1, since $s_{i} (c) \in D_{G}$ , it is easy to see that the query $q_{i} = {(x) ∣ \exists y_{1}, \dots, y_{i} . s_{i} (x) \land {cl}_{i}}$ is such that $(c) \in q_{i}^{D_{G}}$ if and only if G has a clique of size i. Thus, due to the GAV assertion $q_{i} \to {(x) ∣ A_{i} (x)}$ occurring in $M_{G}$ , we have $A_{i} (c) \in C_{O}^{M_{G} (D_{G})}$ if and only if G has a clique of size i.

As for 2, since $s_{i + 1} (c^{'}) \in D_{G}$ , it is easy to see that the query $q_{i + 1} = {(x) ∣ \exists y_{1}, \dots, y_{i + 1} . s_{i + 1} (x) \land {cl}_{i + 1}}$ is such that $(c^{'}) \in q_{i + 1}^{D_{G}}$ if and only if G has a clique of size $i + 1$ . Thus, due to the GAV assertion $q_{i + 1} \to {(x) ∣ A_{i} (x)}$ occurring in $M_{G}$ , if G has a clique of size $i + 1$ , then $A_{i} (c^{'}) \in C_{O}^{M_{G} (D_{G})}$ . Conversely, suppose that G has not a clique of size $i + 1$ . On the one hand, the assertion $q_{i + 1} \to {(x) ∣ A_{i} (x)}$ does not make $A_{i} (c^{'})$ true in $C_{O}^{M_{G} (D_{G})}$ . On the other hand, since $s_{i} (c^{'}) \notin D_{G}$ , not even the assertion ${(x) ∣ \exists y_{1}, \dots, y_{i} . s_{i} (x) \land {cl}_{i}} \to {(x) ∣ A_{i} (x)}$ makes $A_{i} (c^{'})$ true in $C_{O}^{M_{G} (D_{G})}$ . Thus, $A_{i} (c^{'}) \notin C_{O}^{M_{G} (D_{G})}$ . □

With the above property in hand, we can now prove that the maximum clique size of a graph G is an odd number if and only if the CQ $q_{O} = query (M_{G} (D_{G}), (c))$ is also a sound (and so, a proper) $Σ_{G}$ -separation of $λ^{+}$ and $λ^{-}$ in UCQ, thus showing the claimed lower bound. Claim 4.
The maximum clique size of a graph G is an odd number if and only if the CQ $q_{O} = query (M_{G} (D_{G}), (c))$ is also a sound (and so, a proper) $Σ_{G}$ -separation of $λ^{+}$ and $λ^{-}$ in UCQ.
Proof.
“Only-if part:” Suppose that the maximum clique size of G is p, where p is an odd number. Due to Claim 3, we have that $C_{O}^{M_{G} (D_{G})} = {A_{1} (c), A_{1} (c^{'}), A_{3} (c), A_{3} (c^{'}), \dots, A_{p} (c)}$ . Observe that $A_{p} (c^{'}) \notin C_{O}^{M_{G} (D_{G})}$ because G has not a clique of size $p + 1$ by assumption. Thus, $q_{O} = query (M_{G} (D_{G}), (c)) = {(x_{c}) ∣ \exists y_{c^{'}} . ϕ_{O} (x_{c}, y_{c^{'}})}$ , where $ϕ_{O} (x_{c}, y_{c^{'}}) = A_{1} (x_{c}) \land A_{1} (y_{c^{'}}) \land A_{3} (x_{c}) \land A_{3} (y_{c}^{'}) \land \dots \land A_{p} (x_{c})$ . It is straightforward to verify that $(set (ϕ_{O}), (x_{c})) \to (C_{O}^{M_{G} (D_{G})}, (c))$ but $(set (ϕ_{O}), (x_{c})) ↛ (C_{O}^{M_{G} (D_{G})}, (c^{'}))$ . It follows that ${cert}_{q_{O}, J_{G}}^{D_{G}} = {(c)}$ , i.e., $q_{O}$ is a proper $Σ_{G}$ -separation of $λ^{+}$ and $λ^{-}$ in UCQ.

“If part:” Suppose that the maximum clique size of G is r, where r is an even number. Due to Claim 3, we have that $C_{O}^{M_{G} (D_{G})} = {A_{1} (c), A_{1} (c^{'}), A_{3} (c), A_{3} (c^{'}), \dots, A_{r - 1} (c), A_{r - 1} (c^{'})}$ . Observe that $A_{r - 1} (c^{'}) \in C_{O}^{M_{G} (D_{G})}$ and $A_{r + 1} (c) \notin C_{O}^{M_{G} (D_{G})}$ because by assumption G has a clique of size r but not of size $r + 1$ . Thus, $q_{O} = query (M_{G} (D_{G}), (c)) = {(x_{c}) ∣ \exists y_{c^{'}} . ϕ_{O} (x_{c}, y_{c^{'}})}$ , where $ϕ_{O} (x_{c}, y_{c^{'}}) = A_{1} (x_{c}) \land A_{1} (y_{c^{'}}) \land A_{3} (x_{c}) \land A_{3} (y_{c}^{'}) \land \dots \land A_{r - 1} (x_{c}) \land A_{r - 1} (y_{c^{'}})$ . It is straightforward to verify that $(set (ϕ_{O}), (x_{c})) \to (C_{O}^{M_{G} (D_{G})}, (c^{'}))$ . It follows that $(c^{'}) \in {cert}_{q_{O}, J_{G}}^{D_{G}}$ , and therefore $q_{O}$ is not a sound (and so, not a proper) $Σ_{G}$ -separation of $λ^{+}$ and $λ^{-}$ in UCQ. □

As for the $Θ_{2}^{p}$ -hardness of ST $C HAR$ (∅, GAV, UCQ), it is possible to use exactly the same reduction provided above by discarding $λ^{-}$ and considering only $λ^{+} = {(c)}$ . In particular, the set of certain answers of the query $q_{O} = query (M_{G} (D_{G}), (c))$ w.r.t. $Σ_{G}$ is always either ${(c)}$ or ${(c), (c^{'})}$ , and which one of the two depends on the parity of the maximum clique size of G. □

We now consider the remaining more general case of GLAV mappings.
Theorem 8.
$S EP$ ( ${DL-Lite}_{R}$ , GLAV, UCQ) and $C HAR$ ( ${DL-Lite}_{R}$ , GLAV, UCQ) are $Π_{3}^{p}$ -complete. The lower bounds already hold for ST $S EP$ (∅, GLAV, UCQ) and ST $C HAR$ (∅, GLAV, UCQ).

The remainder of this section is dedicated to the proof of the above theorem and is organized as follows. We first provide the lower bound proof in Section 7.1, and then we provide a matching upper bound in Section 7.2.
7.1. Proof of Theorem 8: Lower bound

To simplify the presentation of the lower bound proof, we first assume that ontologies may contain also ternary predicates in their alphabet. Then, by simply applying reification we will provide the proof by removing such an assumption. We start with $ST S EP$ (∅, GLAV, UCQ), and then we address the $ST C HAR$ (∅, GLAV, UCQ) case.

With ternary predicates: The proof of $Π_{3}^{p}$ -hardness is by a $L OG S PACE$ reduction from the complement of the $\exists \forall \exists$ -3CNF problem. Given a formula ϕ of the form $ϕ = \exists \vec{x} \forall \vec{y} \exists \vec{z} . c_{1} \land \dots \land c_{k}$ , such that $c_{i}$ is a disjunction of exactly three literals for $i \in [1, k]$ , $\exists \forall \exists$ -3CNF is the problem of deciding whether ϕ is satisfiable. It is the prototypical $Σ_{3}^{p}$ -complete [62] problem.

The reduction is as follows. Let $ϕ = \exists \vec{x} \forall \vec{y} \exists \vec{z} . c_{1} \land \dots \land c_{k}$ be an input formula for the $\exists \forall \exists$ -3CNF problem, where $\vec{x} = (x_{1}, \dots, x_{n})$ , $\vec{y} = (y_{1}, \dots, y_{m})$ , and $\vec{z} = (z_{1}, \dots, z_{p})$ . In what follows, for each $i \in [1, k]$ , we denote by $v_{i, 1}$ , $v_{i, 2}$ , and $v_{i, 3}$ the variable in $\vec{x} \cup \vec{y} \cup \vec{z}$ appearing, respectively, in the first, second, and third literal of the clause $c_{i}$ . We build an OBDM system $Σ_{ϕ} = ⟨ O_{ϕ}, S_{ϕ}, M_{ϕ}, D_{ϕ} ⟩$ as follows:

$S_{ϕ} = {s_{X}, s_{Y}^{1}, \dots, s_{Y}^{m}, s_{a}, s_{b}, s_{1}, \dots, s_{k}}$ .

$D_{ϕ}$ is the $S_{ϕ}$ -database consisting of the following facts:

$s_{X} (x_{1}, \dots, x_{n})$ . So, we have a constant $x_{i}$ for each for each variable $x_{i} \in \vec{x}$ ;

$s_{Y}^{i} (0)$ and $s_{Y}^{i} (1)$ for each $i \in [1, m]$ . In this way, a formula of the form $⋀_{i}^{m} s_{Y}^{i} (y_{i})$ has $2^{m}$ possible assignments, each corresponding to an assignment to the variables $\vec{y}$ in ϕ;

$s_{a} (x_{i}, a)$ for each $i \in [1, n]$ ;

$s_{b} (0, b)$ and $s_{b} (1, b)$ ;

for each $i \in [1, k]$ , we have 7 different facts of the form $s_{i} (w_{1}, w_{2}, w_{3})$ , with $w_{j}$ being either 0 or 1 for $j \in [1, 3]$ . These 7 facts represent all the possible assignments that satisfy the clause $c_{i}$ of ϕ. For example, if ϕ contains the clause $c_{3} = (\neg x_{2} \lor \neg y_{5} \lor z_{2})$ , then $D_{ϕ}$ would contain the facts $s_{3} (0, 0, 0)$ , $s_{3} (0, 0, 1)$ , $s_{3} (0, 1, 0)$ , $s_{3} (0, 1, 1)$ , $s_{3} (1, 0, 0)$ , $s_{3} (1, 0, 1)$ , $s_{3} (1, 1, 1)$ , and it would not contain the fact $s_{3} (1, 1, 0)$ corresponding to an assignment that does not satisfy the clause $c_{3}$ .

$O_{ϕ} = \emptyset$ . The alphabet of $O_{ϕ}$ contains a ternary predicate $P_{i}$ for each $i \in [1, k]$ and a binary predicates $P_{a b}$ . Informally, $P_{a b}$ will store $P_{a b} (0, b)$ and $P_{a b} (1, b)$ , where 0, 1, and b are constants. Moreover, for each $i \in [1, n]$ , $P_{a b}$ will store $P_{a} (x_{i}, a)$ , where a is a constant and $x_{i}$ is the constant representing the variable $x_{i}$ in ϕ. The role played by the ternary predicates $P_{i}$ is twofold. (i) They represent the clauses of ϕ, with $\vec{x}$ as constants coming from $D_{ϕ}$ , each variable $y \in \vec{y}$ replaced with either 0 or 1, and the variables $\vec{z}$ as existential variables generated by a GLAV assertion in $M_{ϕ}$ ; (ii) They also represent the 7 possible assignments that satisfy the clause $c_{i}$ .

Finally, the mapping $M_{ϕ}$ consists of the following GLAV assertions:

$m_{ϕ}$ is the GLAV assertion

${\vec{x} \cup \vec{y} ∣ s_{X} (x_{1}, \dots, x_{n}) \land ⋀_{i = 1}^{m} s_{Y}^{i} (y_{i})} \to {\vec{x} \cup \vec{y} ∣ \exists z_{1}, \dots, z_{p} . P_{1} (v_{1, 1}, v_{1, 2}, v_{1, 3}) \land \dots \land P_{k} (v_{k, 1}, v_{k, 2}, v_{k, 3})}$ , where $\vec{x} \cup \vec{y} = (x_{1}, \dots, x_{n}, y_{1}, \dots, y_{m})$ . Recall that, for each $i \in [i, k]$ , we have that $v_{i, 1}$ , $v_{i, 2}$ , and $v_{i, 3}$ denote, respectively, the variable in $\vec{x} \cup \vec{y} \cup \vec{z}$ appearing in the first, second, and third literal of the clause $c_{i}$ ;

for $i \in [1, k]$ , we have the assertion $m_{i} : {(w_{1}, w_{2}, w_{3}) ∣ s_{i} (w_{1}, w_{2}, w_{3})} \to {(w_{1}, w_{2}, w_{3}) ∣ P_{i} (w_{1}, w_{2}, w_{3})}$ ;

$m_{a} : {(w_{1}, w_{2}) ∣ s_{a} (w_{1}, w_{2})} \to {(w_{1}, w_{2}) ∣ P_{a b} (w_{1}, w_{2})}$ ;

$m_{b} : {(w_{1}, w_{2}) ∣ s_{b} (w_{1}, w_{2})} \to {(w_{1}, w_{2}) ∣ P_{a b} (w_{1}, w_{2})}$ .

Intuitively, $m_{ϕ}$ populates a total of $2^{m}$ formulae based on ϕ. These formulae are obtained by assigning all possible combinations of 0 and 1 constants to the variables $\vec{y} = (y_{1}, \dots, y_{m})$ . It is important to notice that in each of these $2^{m}$ formulae the variables in $\vec{x}$ are always represented by the same constants in $D_{ϕ}$ , while the variables in $\vec{z}$ are represented each time by fresh variables, generated at each application of the mapping assertion corresponding to a combination of 0 and 1 constants for the variables $\vec{y}$ (i.e., the set of variables generated by an application of the assertion $m_{ϕ}$ is disjoint from the set of variables generated by another application of the assertion $m_{ϕ}$ ).

For each $i \in [1, k]$ , the assertion $m_{i}$ generates on the predicate $P_{i}$ all the possible assignments that satisfy the clause $c_{i}$ in the formula ϕ. Finally, the assertions $m_{a}$ and $m_{b}$ are used to generate in $M_{ϕ} (D_{ϕ})$ the facts $P_{a b} (x_{i}, a)$ for each $i = [1, n]$ and the two facts $P_{a b} (0, b)$ and $P_{a b} (1, b)$ , respectively.

Finally, the fixed $D_{ϕ}$ -datasets are $λ^{+} = {(a)}$ and $λ^{-} = {(b)}$ . Notice that $λ^{+}$ and $λ^{-}$ are fixed, whereas the OBDM system $Σ_{ϕ}$ can be constructed in $L OG S PACE$ from an input formula $ϕ = \exists \vec{x} \forall \vec{y} \exists \vec{z} . c_{1} \land \dots \land c_{k}$ for the $\exists \forall \exists$ -3CNF problem.

We are now going to prove that $(C_{O_{ϕ}}^{M_{ϕ} (D_{ϕ})}, (a)) \to (C_{O_{ϕ}}^{M_{ϕ} (D_{ϕ})}, (b))$ if and only if ϕ is satisfiable. This proves the correctness of the reduction, and therefore that $ST S EP$ (∅, GLAV, UCQ) is $Π_{3}^{p}$ -hard. Indeed, according to the semantic test given in Section 6, we have that there exists a proper $Σ_{ϕ}$ -separation of $λ^{+}$ and $λ^{-}$ in UCQ if and only if $(C_{O_{ϕ}}^{M_{ϕ} (D_{ϕ})}, (a)) ↛ (C_{O_{ϕ}}^{M_{ϕ} (D_{ϕ})}, (b))$ . In other words, $(C_{O_{ϕ}}^{M_{ϕ} (D_{ϕ})}, (a)) \to (C_{O_{ϕ}}^{M_{ϕ} (D_{ϕ})}, (b))$ is equivalent to say that there does not exist a proper $Σ_{ϕ}$ -separation of $λ^{+}$ and $λ^{-}$ in UCQ.

Claim 5.
$(C_{O_{ϕ}}^{M_{ϕ} (D_{ϕ})}, (a)) \to (C_{O_{ϕ}}^{M_{ϕ} (D_{ϕ})}, (b))$ if and only if ϕ is satisfiable.
Proof.
Since $O_{ϕ} = \emptyset$ , we have that $C_{O_{ϕ}}^{M_{ϕ} (D_{ϕ})}$ and $M_{ϕ} (D_{ϕ})$ coincide. Notice that, by definition, every function f witnessing that $(M_{ϕ} (D_{ϕ}), (a)) \to (M_{ϕ} (D_{ϕ}), (b))$ must be such that $f (a) = b$ . As a consequence, due to the $P_{a b}$ -facts occurring in $M_{ϕ} (D_{ϕ})$ , the function f must associate to each constant $x_{i}$ representing a variable of the formula ϕ either the constant 0 or the constant 1, i.e., for each $i \in [1, n]$ we have that either $f (x_{i}) = 0$ or $f (x_{i}) = 1$ .

Now, if ϕ is satisfiable, then it is not hard to verify that there exists an association of the constants $\vec{x}$ with 0 and 1 that satisfy all $2^{m}$ formulae represented in $M_{ϕ} (D_{ϕ})$ . In particular, for each of these formulae, f also needs to associate to the existential variables $\vec{z}$ generated by the application of $m_{ϕ}$ corresponding to that formula the constants 0 and 1 that will satisfy all the clauses of that formula. Thus, if ϕ is satisfiable, then we have that $(M_{ϕ} (D_{ϕ}), (a)) \to (M_{ϕ} (D_{ϕ}), (b))$ .

Conversely, if ϕ is not satisfiable, then it is not hard to verify that for every possible association made by a function f for the constants $\vec{x}$ there exist at least one of those $2^{m}$ formulae for which it is not possible to assign to the variables $\vec{z}$ of that formula the constants 0 and 1 that will satisfy all the clauses of that formula. Thus, if ϕ is not satisfiable, then we have that $(M_{ϕ} (D_{ϕ}), (a)) ↛ (M_{ϕ} (D_{ϕ}), (b))$ . □

As for the case of $ST C HAR$ (∅, GLAV, UCQ), it is possible to use exactly the same reduction provided above by discarding $λ^{-}$ and considering only $λ^{+} = {(a)}$ . In particular, it is immediate to verify that $(M_{ϕ} (D_{ϕ}), (a)) ↛ (M_{ϕ} (D_{ϕ}), (c))$ for each unary tuple $(c)$ with c being a constant different from b (while whether it is the case that $(M_{ϕ} (D_{ϕ}), (a)) ↛ (M_{ϕ} (D_{ϕ}), (b))$ depends solely on the satisfiability of ϕ).

We now discuss the necessary changes to be made to the above reduction to work in the presence of only binary predicates (atomic roles) in the ontology alphabet.

Without ternary predicates (only atomic roles): Let $ϕ = \exists \vec{x} \forall \vec{y} \exists \vec{z} . c_{1} \land \dots \land c_{k}$ be an input formula for the $\exists \forall \exists$ -3CNF problem, where $\vec{x} = (x_{1}, \dots, x_{n})$ , $\vec{y} = (y_{1}, \dots, y_{m})$ , and $\vec{z} = (z_{1}, \dots, z_{p})$ . We build an OBDM system $Σ_{ϕ} = ⟨ O_{ϕ}, S_{ϕ}, M_{ϕ}, D_{ϕ} ⟩$ as follows:

$S_{ϕ} = {s_{X}, s_{Y}^{1}, \dots, s_{Y}^{m}, s_{a}, s_{b}, s_{1, 1}, s_{1, 2}, s_{1, 3}, s_{2, 1}, s_{2, 2}, s_{2, 3}, \dots, s_{k, 1}, s_{k, 2}, s_{k, 3}}$ . Differently from the previous case, for each $i \in [1, k]$ , we have three binary predicates $s_{i, 1}$ , $s_{i, 2}$ , and $s_{i, 3}$ instead of a single ternary relation $s_{i}$ .

$D_{ϕ}$ is the $S_{ϕ}$ -database consisting of the following facts:

$s_{X} (x_{1}, \dots, x_{n})$ ;

$s_{Y}^{i} (0)$ and $s_{Y}^{i} (1)$ for each $i \in [1, m]$ ;

$s_{b} (0, b)$ and $s_{b} (1, b)$ ;

for each $i \in [1, k]$ , we make use of 7 different constants $A_{i, 1}, \dots, A_{i, 7}$ . Moreover, for each $i \in [1, k]$ and $j \in [1, 7]$ , we have three facts of the form $s_{i, 1} (A_{i, j}, w_{i, j}^{1})$ , $s_{i, 2} (A_{i, j}, w_{i, j}^{2})$ , and $s_{i, 3} (A_{i, j}, w_{i, j}^{3})$ with $w_{i, j}^{l}$ being either 0 or 1 for $l \in [1, 3]$ . These 7 constants represent all the possible assignments that satisfy the clause $c_{i}$ of ϕ, and the value assumed by each $w_{i, j}^{l}$ is the value corresponding to the j-th assignment in the l-th position of clause $c_{i}$ . For example, if ϕ contains the clause $c_{3} = (\neg x_{2} \lor \neg y_{5} \lor z_{2})$ , then $D_{ϕ}$ would contain the following facts: $s_{3, 1} (A_{3, 1}, 0)$ , $s_{3, 2} (A_{3, 1}, 0)$ , and $s_{3, 3} (A_{3, 1}, 0)$ ; $s_{3, 1} (A_{3, 2}, 0)$ , $s_{3, 2} (A_{3, 2}, 0)$ , and $s_{3, 3} (A_{3, 2}, 1)$ ; $s_{3, 1} (A_{3, 3}, 0)$ , $s_{3, 2} (A_{3, 3}, 1)$ , and $s_{3, 3} (A_{3, 3}, 0)$ ; $s_{3, 1} (A_{3, 4}, 0)$ , $s_{3, 2} (A_{3, 4}, 1)$ , and $s_{3, 3} (A_{3, 4}, 1)$ ; $s_{3, 1} (A_{3, 5}, 1)$ , $s_{3, 2} (A_{3, 5}, 0)$ , and $s_{3, 3} (A_{3, 5}, 0)$ ; $s_{3, 1} (A_{3, 6}, 1)$ , $s_{3, 2} (A_{3, 6}, 0)$ , and $s_{3, 3} (A_{3, 6}, 1)$ ; $s_{3, 1} (A_{3, 7}, 1)$ , $s_{3, 2} (A_{3, 7}, 1)$ , and $s_{3, 3} (A_{3, 7}, 1)$ . It would not contain the facts $s_{3, 1} (A_{3, 8}, 1)$ , $s_{3, 2} (A_{3, 8}, 1)$ , and $s_{3, 3} (A_{3, 8}, 0)$ corresponding to an assignment that does not satisfy the clause $c_{3}$ .

$O_{ϕ} = \emptyset$ . The alphabet of $O_{ϕ}$ contains the atomic role $P_{a b}$ and three atomic roles $P_{i, 1}$ , $P_{i, 2}$ , and $P_{i, 3}$ for each $i \in [1, k]$ .

Finally, the mapping $M_{ϕ}$ consists of the following GLAV assertions:

$m_{ϕ}$ is the GLAV assertion

${\vec{x} \cup \vec{y} ∣ s_{X} (x_{1}, \dots, x_{n}) \land ⋀_{i = 1}^{m} s_{Y}^{i} (y_{i})} \to {\vec{x} \cup \vec{y} ∣ \exists g_{1}, \dots, g_{k}, z_{1}, \dots, z_{p} . P_{1, 1} (g_{1}, v_{1, 1}) \land P_{1, 2} (g_{1}, v_{1, 2}) \land P_{1, 3} (g_{1}, v_{1, 3}) \land P_{2, 1} (g_{2}, v_{2, 1}) \land P_{2, 2} (g_{2}, v_{2, 2}) \land P_{2, 3} (g_{2}, v_{2, 3}) \land \dots \land P_{k, 1} (g_{k}, v_{k, 1}) \land P_{k, 2} (g_{k}, v_{k, 2}) \land P_{k, 3} (g_{k}, v_{k, 3})}$ , where $\vec{x} \cup \vec{y} = (x_{1}, \dots, x_{n}, y_{1}, \dots, y_{m})$ . Recall that, for each $i \in [i, k]$ , we have that $v_{i, 1}$ , $v_{i, 2}$ , and $v_{i, 3}$ denote, respectively, the variable in $\vec{x} \cup \vec{y} \cup \vec{z}$ appearing in the first, second, and third literal of the clause $c_{i}$ ;

for $i \in [1, k]$ , we have the assertions $m_{i, 1} : {(w_{1}, w_{2}) ∣ s_{i, 1} (w_{1}, w_{2})} \to {(w_{1}, w_{2}) ∣ P_{i, 1} (w_{1}, w_{2})}$ , $m_{i, 2} : {(w_{1}, w_{2}) ∣ s_{i, 2} (w_{1}, w_{2})} \to {(w_{1}, w_{2}) ∣ P_{i, 2} (w_{1}, w_{2})}$ , and $m_{i, 3} : {(w_{1}, w_{2}) ∣ s_{i, 3} (w_{1}, w_{2})} \to {(w_{1}, w_{2}) ∣ P_{i, 3} (w_{1}, w_{2})}$ ;

$m_{a} : {(w_{1}, w_{2}) ∣ s_{a} (w_{1}, w_{2})} \to {(w_{1}, w_{2}) ∣ P_{a b} (w_{1}, w_{2})}$ ;

$m_{b} : {(w_{1}, w_{2}) ∣ s_{b} (w_{1}, w_{2})} \to {(w_{1}, w_{2}) ∣ P_{a b} (w_{1}, w_{2})}$ .

Finally, exactly as in the previous case of ternary predicates, the fixed $D_{ϕ}$ -datasets are $λ^{+} = {(a)}$ and $λ^{-} = {(b)}$ . Notice that $λ^{+}$ and $λ^{-}$ are fixed, whereas the OBDM system $Σ_{ϕ}$ can be constructed in $L OG S PACE$ from an input formula $ϕ = \exists \vec{x} \forall \vec{y} \exists \vec{z} . c_{1} \land \dots \land c_{k}$ for the $\exists \forall \exists$ -3CNF problem.

Intuitively, differently from the case of ternary predicates, each of the $2^{m}$ formulae generated by $m_{ϕ}$ is represented by binary predicates $P_{1, 1}, P_{1, 2}, P_{1, 3}, \dots, P_{k, 1}, P_{k, 2}, P_{k, 3}$ , with the additional freshly introduced existential variables $g_{1}, \dots, g_{k}$ representing the resulting clauses $c_{i}, \dots, c_{k}$ of that formula. Thus, for each of the $2^{m}$ formulae, a function f witnessing that $(M_{ϕ} (D_{ϕ}), (a)) \to (M_{ϕ} (D_{ϕ}), (b))$ must associate to each variable $g_{i}$ of that formula one of the 7 constants $A_{i, 1}, \dots, A_{i, 7}$ . Using analogous arguments as those given in the proof of Claim 5, it is immediate to verify that $(M_{ϕ} (D_{ϕ}), (a)) \to (M_{ϕ} (D_{ϕ}), (b))$ if and only if ϕ is satisfiable, from which we derive that $ST S EP$ (∅, GLAV, UCQ) is $Π_{3}^{p}$ -hard (observe that $M_{ϕ} (D_{ϕ}) = C_{O_{ϕ}}^{M_{ϕ} (D_{ϕ})}$ ). As in the previous case of ternary predicates, for the case of $ST C HAR$ (∅, GLAV, UCQ), it is possible to use exactly the same reduction provided above by discarding $λ^{-}$ and considering only $λ^{+} = {(a)}$ . Since we always have that $(M_{ϕ} (D_{ϕ}), (a)) ↛ (M_{ϕ} (D_{ϕ}), (c))$ for each unary tuple $(c)$ with c being a constant different from b (while whether it is the case that $(M_{ϕ} (D_{ϕ}), (a)) ↛ (M_{ϕ} (D_{ϕ}), (b))$ depends solely on the satisfiability of ϕ), we derive that $ST C HAR$ (∅, GLAV, UCQ) is $Π_{3}^{p}$ -hard as well.
7.2. Proof of Theorem 8: Upper bound

First of all, we assume to only deal with consistent OBDM systems Σ. Indeed, given an inconsistent OBDM system $Σ = ⟨ J, D ⟩$ and two D-datasets $λ^{+}$ and $λ^{-}$ (resp., a D-dataset $λ^{+}$ ) of arity n, we point out that there exists a proper Σ-separation (resp., Σ-characterization) of $λ^{+}$ and $λ^{-}$ (resp., of $λ^{+}$ ) in UCQ if and only if $λ^{+} = dom {(D)}^{n}$ , where n is the arity of $λ^{+}$ . Since we will provide a $Π_{3}^{p}$ upper bound for the case of consistent OBDM systems and since for an OBDM system $Σ = ⟨ O, S, M, D ⟩$ such that $O$ is a ${DL-Lite}_{R}$ ontology and $M$ is a GLAV mapping checking whether Σ is inconsistent can be done in $NP$ using the technique described in the proof of Theorem 1, for simplifying the presentation of the proof we can assume to deal only with consistent OBDM systems.

Recall from the semantic tests given at the end of Section 6 that there exists a proper Σ-separation (resp. Σ-characterization) of $λ^{+}$ and $λ^{-}$ (resp., of $λ^{+}$ ) in UCQ if and only if it is the case that $(C_{O}^{M (D)}, \vec{a}) ↛ (C_{O}^{M (D)}, \vec{b})$ for all $\vec{a} \in λ^{+}$ and all $\vec{b} \in λ^{-}$ (resp., all $\vec{b} \in dom {(D)}^{n} ∖ λ^{+}$ ). With this observation at hand, we can solve the complements of $S EP$ ( ${DL-Lite}_{R}$ , GLAV, UCQ) and $C HAR$ ( ${DL-Lite}_{R}$ , GLAV, UCQ) in $Σ_{3}^{p}$ using the following intuition: given an OBDM system $Σ = ⟨ O, S, M, D ⟩$ such that $O$ is a ${DL-Lite}_{R}$ ontology and $M$ is a GLAV mapping, and given two D-datasets $λ^{+}$ and $λ^{-}$ (resp., a D-dataset $λ^{+}$ ) of arity n, we first guess a tuple $\vec{a}$ , a tuple $\vec{b}$ , and a function $W$ from $dom (D)$ to $Const \cup V$ . Then, we check whether $\vec{a} \in λ^{+}$ , $\vec{b} \in λ^{-}$ (resp., $\vec{b} \in dom {(D)}^{n} ∖ λ^{+}$ ), $W$ is such that $W (\vec{a}) = \vec{b}$ and every constant in $dom (D)$ is mapped to an existing term in $C_{O}^{M (D)}$ . Finally, we call an oracle for the problem of checking whether $W$ can be extended to a function f witnessing that $(C_{O}^{M (D)}, \vec{a}) \to (C_{O}^{M (D)}, \vec{b})$ .

In order to realize the presented intuition, we first introduce some preliminary results that will allow us to simplify the presentation of the problem, as well as some naming conventions that we will adopt for the fresh variables generated when computing $C_{O}^{M (D)}$ starting from D.

We start by observing that checking whether $(C_{O}^{M (D)}, \vec{a}) \to (C_{O}^{M (D)}, \vec{b})$ is equivalent to checking whether $(C_{O}^{M (D)}, \vec{a}) \to ({(C_{O}^{M (D)})}^{'}, \vec{b^{'}})$ , where ${(C_{O}^{M (D)})}^{'}$ is obtained from $C_{O}^{M (D)}$ simply by substituting every occurrence of a constant $c \in dom (C_{O}^{M (D)})$ with its fresh copy $c^{'}$ .

Lemma 1.
$(C_{O}^{M (D)}, \vec{a}) \to (C_{O}^{M (D)}, \vec{b})$ if and only if $(C_{O}^{M (D)}, \vec{a}) \to ({(C_{O}^{M (D)})}^{'}, \vec{b^{'}})$ .
Proof.
“Only-if part”: Let h be a homomorphism witnessing $(C_{O}^{M (D)}, \vec{a}) \to (C_{O}^{M (D)}, \vec{b})$ , i.e. it holds that: (i) $h (\vec{a}) = \vec{b}$ and (ii) $α \in C_{O}^{M (D)}$ implies $h (α) \in C_{O}^{M (D)}$ for each $α \in C_{O}^{M (D)}$ . Let $h^{'}$ be a function such that, for every term t (i.e., either a constant or a variable): (i) if $h (t) = c$ for a constant c, then $h^{'} (t) = c^{'}$ , i.e. $h^{'}$ assigns to t the copy $c^{'}$ of c; (ii) if $h (t) = v$ for a variable v, then $h^{'} (v) = v$ , i.e. $h^{'}$ assigns to t the same variable v. Since h is such that $h (\vec{a}) = \vec{b}$ , we have that $h^{'} (\vec{a}) = \vec{b^{'}}$ . Moreover, since $α \in C_{O}^{M (D)}$ implies $h (α) \in C_{O}^{M (D)}$ for each $α \in C_{O}^{M (D)}$ , by construction we have that $α \in C_{O}^{M (D)}$ implies $h^{'} (α) \in {(C_{O}^{M (D)})}^{'}$ for each $α \in C_{O}^{M (D)}$ . Thus, $h^{'}$ witnesses that $(C_{O}^{M (D)}, \vec{a}) \to ({(C_{O}^{M (D)})}^{'}, \vec{b^{'}})$ .

“If part”: The proof can be obtained by following the same considerations of the “Only-if” part, by switching the roles of the two functions h and $h^{'}$ . □

As a consequence of the above lemma, we can show that it is possible to reduce the problem of verifying whether $(C_{O}^{M (D)}, \vec{a}) \to (C_{O}^{M (D)}, \vec{b})$ to the equivalent problem of verifying whether $(C_{O}^{M (D)}, \vec{a}) \to (C_{O}^{M (D^{'})}, \vec{b^{'}})$ , where $D^{'}$ is computed from D by substituting every occurrence of a constant $c \in dom (D)$ with its fresh copy $c^{'} \in dom (D^{'})$ . Although strictly not necessary, in the problem of verifying whether $(C_{O}^{M (D)}, \vec{a}) \to (C_{O}^{M (D)}, \vec{b})$ (equivalent to $(C_{O}^{M (D)}, \vec{a}) \to (C_{O}^{M (D^{'})}, \vec{b^{'}})$ ) this simple consideration will allow us to simplify the presentation of the upper bound proof, because we can distinguish between the canonical structure on the left-hand side of the arrow and the canonical structure on the right-hand side of the arrow.
Lemma 2.
$(C_{O}^{M (D)}, \vec{a}) \to (C_{O}^{M (D)}, \vec{b})$ if and only if $(C_{O}^{M (D)}, \vec{a}) \to (C_{O}^{M (D^{'})}, \vec{b^{'}})$ .
Proof.
This is a straightforward consequence of Lemma 1 and the fact that ${(C_{O}^{M (D)})}^{'}$ and $C_{O}^{M (D^{'})}$ coincide (up to variable renaming). □

In all what follows, given an $S$ -database D and a tuple $\vec{b}$ of constants from $dom (D)$ , we will implicitly assume that $D^{'}$ , called the copy of D, represents the $S$ -database obtained from D by substituting every occurrence of a constant $c \in dom (D)$ with its fresh copy $c^{'} \in dom (D^{'})$ , and $\vec{b^{'}}$ represents the copy of the tuple $\vec{b}$ . It is clear that both $D^{'}$ and $\vec{b^{'}}$ can be computed in polynomial time from D and $\vec{b}$ .

The following fundamental lemma tells us that, when verifying whether $(C_{O}^{M (D)}, \vec{a}) \to (C_{O}^{M (D^{'})}, \vec{b^{'}})$ , on the left-hand side of the arrow we can actually restrict the attention only at $M (D)$ .
Lemma 3.
$(C_{O}^{M (D)}, \vec{a}) \to (C_{O}^{M (D^{'})}, \vec{b^{'}})$ if and only if $(M (D), \vec{a}) \to (C_{O}^{M (D^{'})}, \vec{b^{'}})$ .
Proof.
The “Only-if” part is trivial, since by restricting the homomorphism witnessing $(C_{O}^{M (D)}, \vec{a}) \to (C_{O}^{M (D^{'})}, \vec{b^{'}})$ to the terms in $M (D)$ we derive an homomorphism witnessing $(M (D), \vec{a}) \to (C_{O}^{M (D^{'})}, \vec{b^{'}})$ .

As for the “If” part, let h be a homomorphism witnessing $(M (D), \vec{a}) \to (C_{O}^{M (D^{'})}, \vec{b^{'}})$ . Consider any concept inclusion assertion $B_{1} ⊑ B_{2}$ (resp., role inclusion assertion $R_{1} ⊑ R_{2}$ ) in $O$ such that $M (D) ⊧ B_{1} (t)$ for some $t \in dom (M (D))$ (resp, $M (D) ⊧ R_{1} (t_{1}, t_{2})$ for some terms $t_{1}, t_{2} \in dom (M (D))$ ). This means that $C_{O}^{M (D)} ⊧ B_{2} (t)$ (resp., $C_{O}^{M (D)} ⊧ R_{2} (t_{1}, t_{2})$ ). Now, since by assumption we have that $C_{O}^{M (D^{'})} ⊧ B_{1} (h (t))$ (resp, $C_{O}^{M (D^{'})} ⊧ R_{1} (h (t_{1}), h (t_{2}))$ ) because h is a homomorphism from $M (D)$ to $C_{O}^{M (D^{'})}$ , we derive that $C_{O}^{M (D^{'})} ⊧ B_{2} (h (t))$ (resp., $C_{O}^{M (D^{'})} ⊧ R_{2} (h (t_{1}), h (t_{2}))$ ). With this observation at hand, it is easy to see that any homomorphism h witnessing $(M (D), \vec{a}) \to (C_{O}^{M (D^{'})}, \vec{b^{'}})$ can be extended to an homomorphism $h^{'}$ witnessing $(C_{O}^{M (D)}, \vec{a}) \to (C_{O}^{M (D^{'})}, \vec{b^{'}})$ . □

We now introduce some naming conventions we will adopt for the fresh variables generated when computing $C_{O}^{M (D)}$ starting from D.

Naming convention for the chase of the database with respect to the mapping: Let $M$ be a GLAV mapping relating a source schema $S$ to an ontology $O$ , and let D an $S$ -database. We call $p = (m, h_{m})$ a single application of $M$ on D, where $m = {\vec{x} ∣ \exists \vec{y} . ϕ_{S} (\vec{x}, \vec{y})} \to {\vec{x} ∣ \exists \vec{z} . φ_{O} (\vec{x}, \vec{z})}$ , with $\vec{z} = (z_{1}, \dots, z_{n})$ , is a mapping assertion in $M$ , and $h_{m}$ is a homomorphism from $set (ϕ)$ to D. We denote by $p (D)$ the set of atoms $h^{'} (set (φ_{O}))$ , where $h^{'}$ extends $h_{m}$ by assigning to the existential variable $z_{i} \in \vec{z}$ the variable $z_{i}^{p} \in dom (M (D))$ , for $i \in [1, n]$ . Notice that, up to variable renaming, $M (D)$ coincides with $⋃_{p \in P} p (D)$ , where $P$ is the set of all the possible single applications of $M$ on D. It is important to notice that, in this way, each variable $z_{i}^{p}$ generated when chasing D with respect to $M$ carries in its name both the application $p = (m, h_{m})$ that generated it and which was the variable $z_{i}$ inside m.

Naming convention for the chase with respect to the ontology: Let $M$ be a GLAV mapping relating a source schema $S$ to an ontology $O$ , let D an $S$ -database, and $O$ be a ${DL-Lite}_{R}$ ontology. We follow standard conventions for the name given to the freshly introduce variables when computing the chase of $M (D)$ with respect to $O$ (see, e.g. [33]). More specifically, $dom (C_{O}^{M (D)})$ contains all the constants and variables occurring in $dom (M (D))$ and all the variables of the form $t R_{1}, \dots, R_{n}$ , with $t \in dom (M (D))$ , $n ⩾ 1$ , and $R_{i}$ a basic role for each $i \in [1, n]$ , such that:
there is a basic concept B with $M (D) ⊧ B (t)$ and $O ⊧ B ⊑ \exists R_{1}$ , but $R^{'} (t, t^{'}) \notin M (D)$ for all $t^{'} \in dom (M (D))$ and basic role $R^{'}$ with $O ⊧ R^{'} ⊑ R_{1}$ ;

$O ⊧ \exists R_{i}^{-} ⊑ \exists R_{i + 1}$ and $O ⊭ R_{i}^{-} ⊑ R_{i + 1}$ , for each $i \in [1, n - 1]$ .
As an example, suppose that $M (D) = {A (t)}$ for a term t that is either a constant in D or a fresh variable introduced when chasing D with respect to $M$ , and let $O = {A ⊑ C, C ⊑ \exists R^{-}, \exists R ⊑ \exists S}$ . Then, $C_{O}^{M (D)} = {A (t), C (t), R (t R^{-}, t), S (t R^{-}, t R^{-} S)}$ , where $t R^{-}$ and $t R^{-} S$ are the fresh variables introduced when chasing $M (D)$ with respect to $O$ .

Before illustrating the non-deterministic algorithms for solving the complements of $S EP$ ( ${DL-Lite}_{R}$ , GLAV, UCQ) and $C HAR$ ( ${DL-Lite}_{R}$ , GLAV, UCQ), we introduce a new notion and some fundamental technical results. Let $Σ = ⟨ O, S, M, D ⟩$ be an OBDM system such that $O$ is a ${DL-Lite}_{R}$ ontology and $M$ is a GLAV mapping, D be an $S$ -database, $\vec{a}$ be a tuple of constants from $dom (D)$ , $D^{'}$ be the copy of D, and $\vec{b^{'}}$ be a tuple of constants from $dom (D^{'})$ . A partial witnessing function (in short, pwf) of $(M (D), \vec{a})$ on $(C_{O}^{M (D^{'})}, \vec{b^{'}})$ is a function $W$ from $dom (D)$ to $dom (C_{O}^{M (D^{'})})$ such that $W (\vec{a}) = \vec{b^{'}}$ . We call such function partial because only the constants of $M (D)$ have an assignment to a term in $C_{O}^{M (D^{'})}$ , while the variables of $M (D)$ have not been assigned yet. The following lemma follows by definition.
Lemma 4.
$(M (D), \vec{a}) \to (C_{O}^{M (D^{'})}, \vec{b^{'}})$ if and only if there exists a pwf $W$ of $(M (D), \vec{a})$ on $(C_{O}^{M (D^{'})}, \vec{b^{'}})$ that can be extended to a function f from $dom (M (D))$ to $dom (C_{O}^{M (D^{'})})$ witnessing that $(M (D), \vec{a}) \to (C_{O}^{M (D^{'})}, \vec{b^{'}})$ , i.e., such that $α \in M (D)$ implies $f (α) \in C_{O}^{M (D^{'})}$ for each $α \in M (D)$ .

We now present a fundamental technical result that will allow us to restrict the attention only to those pwfs $W$ such that the range of the function $W$ contains constants and variables whose names lengths are of polynomial size with respect to the size of the input OBDM system Σ.
Lemma 5.
Let $P$ be the set of all the possible single applications of $M$ on D, and let $W$ be a pwf of $(M (D), \vec{a})$ on $(C_{O}^{M (D^{'})}, \vec{b^{'}})$ . Then, we have that $W$ can be extended to a function f from $dom (M (D))$ to $dom (C_{O}^{M (D^{'})})$ witnessing that $(M (D), \vec{a}) \to (C_{O}^{M (D^{'})}, \vec{b^{'}})$ if and only if the following holds:
for every single application $p \in P$ of $M$ on D, there is a function $f_{p}^{'}$ from variables in $dom (p (D))$ to $dom (C_{O}^{M (D^{'})})$ such that $α \in p (D)$ implies $f_{p} (α) \in C_{O}^{M (D^{'})}$ , where $f_{p} = W \cup f_{p}^{'}$ denotes the function from $dom (p (D))$ to $dom (C_{O}^{M (D^{'})})$ such that $f_{p} (t) = W (t)$ if t is a constant and $f_{p} (t) = f_{p}^{'} (t)$ if t is a variable.

Proof.
The proof can be obtained by simply combining these two observations: (i) all the constants in $dom (M (D))$ are already assigned to a term in $dom (C_{O}^{M (D^{'})})$ by the function $W$ ; (ii) every single application $p \in P$ of $M$ on D produces new variables in $M (D)$ without ever reusing the variables introduced in other applications of the mappings. In other words, the set of all variables produced by a single application p of $M$ on D, and the set of all variables produced by all the other possible single applications $p^{'} \in P ∖ {p}$ of $M$ on D are disjoint. Thus, a function f extending $W$ and witnessing $(M (D), \vec{a}) \to (C_{O}^{M (D^{'})}, \vec{b^{'}})$ can always be split into many $f_{p}$ s functions as in the statement of the lemma, and, vice versa, if the functions $f_{p}$ s as in the statement of the lemma exist, then they can be combined into an overall function f witnessing $(M (D), \vec{a}) \to (C_{O}^{M (D^{'})}, \vec{b^{'}})$ . □

As announced before, the above lemma has a crucial consequence. In particular, let $Σ = ⟨ O, S, M, D ⟩$ be an OBDM system. Since the number of atoms in $p (D)$ generated by a single application p of $M$ on D is at most polynomial with respect to the size of $M$ (i.e., it is at most the number of atoms occurring in the longest right-hand side among all $m \in M$ ), from the above lemma and due to the forest-shaped form of the canonical structure $C_{O}^{M (D^{'})}$ , it is not hard to see that the following holds: if there exists a pwf $W$ of $(M (D), \vec{a})$ on $(C_{O}^{M (D^{'})}, \vec{b^{'}})$ that can be extended to a function f from $dom (M (D))$ to $dom (C_{O}^{M (D^{'})})$ witnessing that $(M (D), \vec{a}) \to (C_{O}^{M (D^{'})}, \vec{b^{'}})$ , then there exists another pwf $W^{'}$ that can be extended to a function $f^{'}$ from $dom (M (D))$ to $dom (C_{O}^{M (D^{'})})$ witnessing that $(M (D), \vec{a}) \to (C_{O}^{M (D^{'})}, \vec{b^{'}})$ and such that each variable in the range of $f^{'}$ (and therefore also in the range of $W^{'}$ ) has a name length that is bounded by a polynomial in the size of $O$ and $M$ , which implies that the variable can be generated after polynomially many chase steps of $M (D^{'})$ with respect to $O$ .

We are now finally ready to present the non-deterministic algorithms for solving the complements of $S EP$ ( ${DL-Lite}_{R}$ , GLAV, UCQ) and $C HAR$ ( ${DL-Lite}_{R}$ , GLAV, UCQ). We first concentrate on $C HAR$ ( ${DL-Lite}_{R}$ , GLAV, UCQ) by providing the non-deterministic Algorithm 3 (CharExistence).

Algorithm 3
CharExistence

Notice that the Algorithm 3 (CharExistence) uses an oracle for the auxiliary decision problem CheckTotalExtension, which is the problem of deciding, given an OBDM system $Σ = ⟨ O, S, M, D ⟩$ and a pwf $W$ of $(M (D), \vec{a})$ on $(C_{O}^{M (D^{'})}, \vec{b^{'}})$ , whether $W$ can be actually extended to a function f from $dom (M (D))$ to $dom (C_{O}^{M (D^{'})})$ witnessing that $(M (D), \vec{a}) \to (C_{O}^{M (D^{'})}, \vec{b^{'}})$ .
Lemma 6.
Let $Σ = ⟨ O, S, M, D ⟩$ be a consistent OBDM system and $λ^{+}$ be a D-dataset. We have that CharExistence(Σ, $λ^{+}$ ) terminates and returns true if and only if there exists a proper Σ-characterization of $λ^{+}$ in UCQ.
Proof.
Termination of the algorithm is immediate, while its correctness can be obtained by simply combining the semantic test given at the end Section 6 with Lemmata 2, 3, and 4. □

As for the running time, we observe that, due to the discussion after Lemma 5, we can concentrate only on pwfs $W$ whose range contains only constants and variables whose names lengths are of polynomial size with respect to the input OBDM system Σ, and therefore the overall $W$ is of polynomial size with respect to Σ since the number of constants in D is polynomial. Furthermore, we can check in polynomial time whether $W$ is a pwf of $(M (D), \vec{a})$ on $(C_{O}^{M (D^{'})}, \vec{b^{'}})$ , i.e., whether all the terms in the range of $W$ occur in $dom (C_{O}^{M (D^{'})})$ as follows. For each term t in the range of $W$ , we compute $p_{t} (D^{'})$ , where $p_{t}$ is the guessed application of $M$ on $D^{'}$ for the term t (of course, we first need to check whether $h_{m}$ is a homomorphism from the atoms on the left-hand side of m to $D^{'}$ ). Starting from the set $p_{t} (D^{'})$ of atoms, we then iteratively apply the guessed ontology assertions in the sequence $ρ_{t}$ for the term t one after the other, each time by applying the considered ontology assertion in a single chase step over all the atoms obtained so far. Finally, we check whether the term t occurs in the overall set of atoms obtained in this way.

Thus, with respect to the size of its input, the overall running time of the non-deterministic Algorithm 3 (CharExistence) is polynomial with an oracle for the decision problem CheckTotalExtension. As a consequence, we derive that the complement of $C HAR$ ( ${DL-Lite}_{R}$ , GLAV, UCQ) is in $NP$ with an oracle in $C$ , i.e., ${NP}^{C}$ , where $C$ is the computational complexity of CheckTotalExtension. As for the case of $S EP$ ( ${DL-Lite}_{R}$ , GLAV, UCQ), we can use a slight adaptation of the Algorithm 3 (CharExistence), which takes in the input an additional D-dataset $λ^{-}$ and modifies the step 5 of the algorithm by checking $\vec{b} \in λ^{-}$ instead of checking $\vec{b} \notin λ^{+}$ . This means that the complement of $S EP$ ( ${DL-Lite}_{R}$ , GLAV, UCQ) is in ${NP}^{C}$ as well, where $C$ is the computational complexity of CheckTotalExtension.

To conclude the proof of the upper bound, we are now going to show that CheckTotalExtension is in $Π_{2}^{p}$ (thus, $C = Π_{2}^{p}$ ), which gives us the desired $Σ_{3}^{p}$ upper bound for the complements of both $C HAR$ ( ${DL-Lite}_{R}$ , GLAV, UCQ) and $S EP$ ( ${DL-Lite}_{R}$ , GLAV, UCQ), and therefore this will show that both $C HAR$ ( ${DL-Lite}_{R}$ , GLAV, UCQ) and $S EP$ ( ${DL-Lite}_{R}$ , GLAV, UCQ) are in $Π_{3}^{p}$ , as required.

Upper bound for the decision problem CheckTotalExtension
Lemma 7.
CheckTotalExtension is in $Π_{2}^{p}$ .
Proof.
We recall that CheckTotalExtension is the problem of deciding, given an OBDM system $Σ = ⟨ O, S, M, D ⟩$ and a pwf $W$ of $(M (D), \vec{a})$ on $(C_{O}^{M (D^{'})}, \vec{b^{'}})$ , whether $W$ can be extended to a function f from $dom (M (D))$ to $dom (C_{O}^{M (D^{'})})$ witnessing that $(M (D), \vec{a}) \to (C_{O}^{M (D^{'})}, \vec{b^{'}})$ , i.e., such that $α \in M (D)$ implies $f (α) \in C_{O}^{M (D^{'})}$ for each $α \in M (D)$ .

By Lemma 5, we can immediately derive the following non-deterministic algorithm to solve the complement of the CheckTotalExtension decision problem:
Guess a pair $p = (m, h_{m})$ , where $m = {\vec{x} ∣ \exists \vec{y} . ϕ_{S} (\vec{x}, \vec{y})} \to {\vec{x} ∣ \exists \vec{z} . φ_{O} (\vec{x}, \vec{z})}$ is a mapping assertion in $M$ , and $h_{m}$ is a function from the variables in $\vec{x} \cup \vec{y}$ to $dom (D)$ ;

Check whether p is an application of m on D, which amounts to check whether $m \in M$ and $h_{m}$ is a homomorphism from $set (ϕ_{S})$ to D (if not, then return false);

Compute $p (D)$ ;

Call an oracle for the decision problem CheckSingleExtension with input Σ, $W$ , $p (D)$ ;

If the oracle accepts, then return false; otherwise return true.

Notice that the above non-deterministic algorithm uses an oracle for the auxiliary decision problem CheckSingleExtension, which is the problem of deciding, given an OBDM system $Σ = ⟨ O, S, M, D ⟩$ , a pwf $W$ of $(M (D), \vec{a})$ on $(C_{O}^{M (D^{'})}, \vec{b^{'}})$ , and a set of atoms $p (D)$ from $M (D)$ , whether $W$ can be actually extended to a function f from $dom (p (D))$ to $dom (C_{O}^{M (D^{'})})$ such that f is a homomorphism from $p (D)$ to $C_{O}^{M (D^{'})}$ , i.e., $α \in p (D)$ implies $f (α) \in C_{O}^{M (D^{'})}$ for each $α \in p (D)$ . Notice the difference between the decision problems CheckTotalExtension and CheckSingleExtension. While CheckTotalExtension additionally takes in the input only Σ and $W$ , CheckSingleExtension takes in the input also the set of atoms $p (D)$ .

The termination of the algorithm is straightforward. Furthermore, due to Lemma 5 and the definition of the auxiliary decision problem CheckSingleExtension, it is immediate to verify that the above non-deterministic algorithm solves the complement of CheckTotalExtension, i.e., returns true if and only if $W$ cannot be extended to a function f from $dom (M (D))$ to $dom (C_{O}^{M (D^{'})})$ such that $α \in M (D)$ implies $f (α) \in C_{O}^{M (D^{'})}$ for each $α \in M (D)$ . As for its running time, we observe that the size of the guessed p is clearly polynomial with respect to the size of the input, and both checking whether $h_{m}$ is a homomorphism from $set (ϕ_{S})$ to D, and computing the set $p (D)$ of atoms, given $h_{m}$ , can be performed in polynomial time with respect to the size of the input. As a consequence, we derive that the complement of CheckTotalExtension is in $NP$ with an oracle in $C^{'}$ , i.e., ${NP}^{C^{'}}$ , where $C^{'}$ is the computational complexity of CheckSingleExtension. Thus, CheckTotalExtension is in ${coNP}^{C^{'}}$ . In the following lemma, we are going to show that CheckSingleExtension is in $NP$ (thus, $C^{'} = NP$ ), which concludes the proof of the desired $Π_{2}^{p}$ upper bound for the decision problem CheckTotalExtension. □

Upper bound for the decision problem CheckSingleExtension
Lemma 8.
CheckSingleExtension is in $NP$ .
Proof.
Recall that CheckSingleExtension is the problem of deciding, given an OBDM system $Σ = ⟨ O, S, M, D ⟩$ , a pwf $W$ of $(M (D), \vec{a})$ on $(C_{O}^{M (D^{'})}, \vec{b^{'}})$ , and a set $p (D)$ such that $p (D) \subseteq M (D)$ , whether $W$ can be extended to an homomorphism f from $p (D)$ to $C_{O}^{M (D^{'})}$ . In other words, the problem seeks for a function $f^{'}$ that assign to each variable in $p (D)$ a term occurring in $dom (C_{O}^{M (D^{'})})$ such that $α \in p (D)$ implies $f (α) \in C_{O}^{M (D^{'})}$ for each $α \in p (D)$ , where $f = W \cup f^{'}$ is the function such that $f (t) = W (t)$ is t is a constant, and $f (t) = f^{'} (t)$ if t is a variable.

Obviously, for an homomorphism f from $p (D)$ to $C_{O}^{M (D^{'})}$ to exist, we do not need to generate all the atoms in $C_{O}^{M (D^{'})}$ but we actually need only a number of atoms that is less than or equal the number of atoms in $p (D)$ . Furthermore, due to the forest-shaped form of the canonical structure $C_{O}^{M (D^{'})}$ , each of such atoms can be generated after polynomially many chase steps of $M (D^{'})$ with respect to $O$ .

From the above consideration, we can immediately derive the following non-deterministic algorithm to solve the CheckSingleExtension decision problem.
$A^{'} : = \emptyset$ ;

For each $α \in p (D)$ :

Guess a pair $p_{α}^{'} = (m, h_{m})$ and a sequence $ρ_{α}$ of ontology assertions in $O$ . Here, $m = {\vec{x} ∣ \exists \vec{y} . ϕ_{S} (\vec{x}, \vec{y})} \to {\vec{x} ∣ \exists \vec{z} . φ_{O} (\vec{x}, \vec{z})}$ is a mapping assertion in $M$ and $h_{m}$ is a function from the variables in $\vec{x} \cup \vec{y}$ to $dom (D^{'})$ ;

Check whether $h_{m}$ is a homomorphism from $set (ϕ_{S})$ to $D^{'}$ (if not, then return false);

Compute $p_{α}^{'} (D^{'})$ and set $A^{'} : = A^{'} \cup p_{α}^{'} (D^{'})$ ;

Let $ρ_{α} = (o_{1}, \dots, o_{n})$ ;

For each $i \leftarrow 1, \dots, n$ :

Let $N$ be the set of all the fresh atoms obtained by applying the ontology assertion $o_{i}$ to all the atoms in $A^{'}$ in a single chase step;

Set $A^{'} : = A^{'} \cup N$ ;

End For

End For

Guess a function $f^{'}$ from the variables in $dom (p (D))$ to $dom (A^{'})$ ;

Let $f = W \cup f^{'}$ be the function from $dom (p (D))$ to $dom (A^{'})$ such that $f (t) = W (t)$ if t is a constant and $f (t) = f^{'} (t)$ if t is a variable;

If f consists in a homomorphism from $p (D)$ to $A^{'}$ , then return true; otherwise return false.

In a nutshell, for each atom $α \in p (D)$ , the algorithm generates a set $A_{α}^{'}$ of atoms in $C_{O}^{M (D^{'})}$ by first computing the guessed application $p_{α}^{'}$ of $M$ on $D^{'}$ , and then by iteratively applying the guessed ontology assertions in the sequence $ρ_{α}$ one after the other, each time by applying the considered ontology assertion in a single chase step over all the atoms obtained so far. Finally, the algorithm checks whether the given $W$ can be extended to a homomorphism from $p (D)$ to $A^{'}$ , where $A^{'} = ⋃_{α \in p (D)} A_{α}^{'}$ is the set of all the atoms obtained as above.

The termination and the correctness of the algorithm are straightforward. It is indeed immediate to verify that the above non-deterministic algorithm solves CheckSingleExtension, i.e., returns true if and only if $W$ can be extended to an homomorphism f from $p (D)$ to $C_{O}^{M (D^{'})}$ .

As for its running time, as already observed, we can restrict the attention only to those atoms of $C_{O}^{M (D^{'})}$ that can be generated after polynomially many chase steps of $M (D^{'})$ with respect to $O$ . Thus, for each $α \in p (D)$ , both the guessed $p_{α}^{'}$ and the guessed sequence $ρ_{α}$ are of polynomial size with respect to the input, which implies that the overall number of atoms in $A^{'}$ is of polynomial size with respect to the input. Furthermore, checking whether $h_{m}$ is a homomorphism from $set (ϕ_{S})$ to $D^{'}$ , computing the set $p_{α}^{'} (D^{'})$ of atoms, applying in sequence the ontology assertions in $ρ_{α}$ in single chase steps, and checking whether f is a homomorphism from $p (D)$ to $A^{'}$ (i.e., whether $α \in p (D)$ implies $f (α) \in A^{'}$ for each $α \in p (D)$ ) are all steps that can be carried out in polynomial time with respect to the size of the input. This means that the above non-deterministic algorithm runs in polynomial time with respect to the size of the input, and therefore that CheckSingleExtension is in $NP$ , as required. □
8. Conclusion and future work

In this paper, we have studied logical separability in OBDM. As a first contribution, we have illustrated a general framework for separability in OBDM, by also proposing natural relaxations of the classical separability notion (called here proper separation). In the general envision of separability as a tool for providing global post-hoc explanations of black-box models, we argue that such relaxations (especially the best approximations) are crucial in order to always be able to provide meaningful explanations even when proper separations do not exist. As a second contribution, by instantiating the general framework with the most common languages used in OBDM, we have provided a comprehensive study of three computational problems associated with the framework, namely Verification, Computation, and Existence. For the decision problems related to Verification and Existence, we have provided tight complexity results, whereas for the Computation problem we have devised two algorithms for computing the two forms of best approximated separations considered in this paper, thus proving they always exist.

We conclude the paper by discussing some interesting avenues for future work that deserve more investigation.

More expressive scenarios It would be interesting to study extensions of the scenario considered in this paper for the computational problems. For example, one may consider more expressive target query languages that go beyond UCQ, in order to capture proper separations in more cases. Natural candidates for this are ${UCQ}^{\neq}$ , i.e., UCQ with inequalities, First-order logic, and $EQL-Lite$ [16]. However, differently from our case, by extending the considered scenario with one of these query languages, adopting or not the UNA makes a difference.

As other interesting extensions of the considered scenario, one may investigate highly expressive ontology languages, such as $S H I Q$ [36] or even $S R O I Q$ [35], where the separability task has not yet been studied even in the ontology-enriched query answering setting (i.e., without considering mapping assertions).

Strong separability In [37], separability comes into two flavours: weak separability and strong separability. In weak separability, which is the one we addressed in this paper, the requirement on the negative examples is that none of them is included in the set of certain answers of the separating query. On the other hand, in strong separability, the requirement on the negative examples is that all of them are included in the set of the certain answers of the negation of the separating query. Thus, a natural problem to be addressed in our considered OBDM scenario is strong separability, by considering relaxations also in this case.

Quantitative metrics for best approximations In this work, we have defined best approximations of the proper separation notion by adopting the set-inclusion metric. Another common choice that would be interesting to investigate is the cardinality criteria. Moreover, for a more fine-grained usage of separability as a tool for explanation, it would be natural to assign weights to both positive and negative examples, and to consider such weights when computing best approximations of proper separations.

Restricted signature A possible constraint that can be imposed when finding a separating query is that the query expression uses only a specific subset of the predicates available in the alphabet of the ontology $O$ . This may be important to meet some users’ requirements regarding the separating query, as for example a user may ask for a separating query that does not make use of a specific concept. The separating task under restricted signature has been studied in [39] in the ontology-enriched query answering setting, in the particular case of expressive ontology languages, such as $A L C$ or $A L C O$ . Typically, the computational complexity of checking for a proper separation increases under the restricted signature constraint. Thus, another notable direction is to study separability in the considered OBDM scenario under the restricted signature constraint.

Intelligible queries A delicate question from an end-user perspective is the number of atoms involved in each of the disjuncts of the separating queries. While under both GAV and LAV mappings our algorithms presented for the computation problem ensure that this number of atoms is “only” polynomially related to the size of the mapping and the database, the same is not true in the presence of GLAV mappings. In particular, it remains an interesting open question whether or not, under GLAV mappings, there are cases in which separations (or their best approximated versions) must necessarily be of exponential size with respect to the size of the mapping and the database.

More generally speaking, the separating queries may sometimes be very hard to understand from an end-user perspective. Thus, it would be natural to consider also the length of each disjunct, as well as the number of disjuncts, as additional parameters when considering approximations. This requires to consider novel definitions and techniques that may allow obtaining, from end users’ perspectives, more intelligible (approximated) separating queries.

Implementation Finally, we mention that we are currently implementing the algorithms and techniques proposed in this paper using the OBDM engine Mastro [14]. The implementation we are working on follows the recently introduced Human-in-the-loop Artificial Intelligence (HitAI) paradigm [69]. By running some experiments with a first prototype in real-world settings, we observed as the above-mentioned issues for future work turn out to be crucial, especially the ones concerning intelligible queries and quantitative metrics.

Footnotes

Acknowledgements

This work has been supported by MUR under the PRIN 2017 project HOPE (prot. 2017MMJJRE) and under the PNRR project FAIR (PE0000013), by the EU under the H2020-EU.2.1.1 project TAILOR (grant id. 952215), and by European Research Council under the European Union’s Horizon 2020 Programme through the ERC Advanced Grant WhiteMech (No. 834228). Gianluca Cima has been fully supported by MUR under the PNRR project FAIR (PE0000013).

References

Abiteboul ,

Hull and

Vianu , Foundations of Databases, Addison Wesley Publ. Co., 1995.

Angluin , Queries and concept learning, Machine Learning 2(4) (1987), 319–342.

Arenas and

G.I.

Diaz , The exact complexity of the first-order logic definability problem, ACM Trans. Database Syst. 41(2) (2016), 13. doi:10.1145/2886095.

Arenas ,

G.I.

Diaz and

E.V.

Kostylev , Reverse engineering SPARQL queries, in: Proceedings of the 25th International Conference on World Wide Web, 2016.

Artale ,

Calvanese ,

Kontchakov and

Zakharyaschev , The DL-lite family and relations, Journal of Artificial Intelligence Research 36 (2009), 1–69. doi:10.1613/jair.2820.

Baader ,

Calvanese ,

McGuinness ,

Nardi and

P.F.

Patel-Schneider (eds), The Description Logic Handbook: Theory, Implementation and Applications, Cambridge University Press, 2003.

Barceló and

Romero , The complexity of reverse engineering problems for conjunctive queries, in: 20th International Conference on Database Theory (ICDT 2017),

Benedikt and

Orsi , eds, Leibniz International Proceedings in Informatics (LIPIcs), Vol. 68, Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, Dagstuhl, Germany, 2017, pp. 7:1–7:17, ISSN 1868-8969, http://drops.dagstuhl.de/opus/volltexte/2017/7052. ISBN 978-3-95977-024-8. doi:10.4230/LIPIcs.ICDT.2017.7.

Bondy and

M.R.

Murty , Graph Theory, Graduate Texts in Mathematics, Springer, 2008.

Botoeva ,

Kontchakov ,

Ryzhikov ,

Wolter and

Zakharyaschev , Games for query inseparability of description logic knowledge bases, Artificial Intelligence 234 (2016), 78–119. doi:10.1016/j.artint.2016.01.010.

10.

Botoeva ,

Lutz ,

Ryzhikov ,

Wolter and

Zakharyaschev , Query inseparability for ALC ontologies, Artif. Intell. 272(C) (2019), 1–51. doi:10.1016/j.artint.2018.09.003.

11.

Buhmann̈ ,

Lehmann ,

Westphal and

Bin , DL-learner structured machine learning on Semantic Web data, in: Companion Proceedings of the The Web Conference 2018, WWW ’18, International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Switzerland, 2018, pp. 467–471. ISBN 978-1-4503-5640-4. doi:10.1145/3184558.3186235.

12.

S.R.

Buss and

Hay , On truth-table reducibility to SAT, Information and Computation 91(1) (1991), 86–102. doi:10.1016/0890-5401(91)90075-D.

13.

Calì ,

Gottlob and

Kifer , Taming the infinite chase: Query answering under expressive relational constraints, Journal of Artificial Intelligence Research 48 (2013), 115–174. doi:10.1613/jair.3873.

14.

Calvanese ,

De Giacomo ,

Lembo ,

Lenzerini ,

Poggi ,

Rodriguez-Muro ,

Rosati ,

Ruzzi and

D.F.

Savo , The mastro system for ontology-based data access, Semantic Web Journal 2(1) (2011), 43–53. doi:10.3233/SW-2011-0029.

15.

Calvanese ,

De Giacomo ,

Lembo ,

Lenzerini and

Rosati , Tractable reasoning and efficient query answering in description logics: The DL-lite family, Journal of Automated Reasoning 39(3) (2007), 385–429. doi:10.1007/s10817-007-9078-x.

16.

Calvanese ,

De Giacomo ,

Lembo ,

Lenzerini and

Rosati , EQL-Lite: Effective first-order query processing in description logics, in: Proceedings of the Twentieth International Joint Conference on Artificial Intelligence (IJCAI 2007), 2007, pp. 274–279.

17.

Calvanese ,

De Giacomo ,

Lenzerini and

M.Y.

Vardi , Query processing under GLAV mappings for relational and graph databases, Proceedings of the Very Large Database Endowment 6(2) (2012), 61–72.

18.

Cima , Preliminary results on ontology-based open data publishing, in: Proceedings of the Thirtieth International Workshop on Description Logics (DL 2017), CEUR Electronic Workshop Proceedings, Vol. 1879, 2017. http://ceur-ws.org/.

19.

Cima , Abstraction in Ontology-Based Data Management, Frontiers in Artificial Intelligence and Applications, Vol. 348, IOS Press, 2022.

20.

Cima ,

Console ,

Lenzerini and

Poggi , Abstraction in data integration, in: Proceedings of the Thirty-Sixth Annual ACM/IEEE Symposium on Logic in Computer Science (LICS 2021), IEEE, 2021, pp. 1–11.

21.

Cima ,

Console ,

Lenzerini and

Poggi , Monotone abstractions in ontology-based data management, in: Proceedings of the Thirty-Sixth AAAI Conference on Artificial Intelligence (AAAI 2022), 2022, pp. 5556–5563.

22.

Cima ,

Croce and

Lenzerini , Query definability and its approximations in ontology-based data management, in: CIKM’21: The 30th ACM International Conference on Information and Knowledge Management, ACM, 2021.

23.

Cima ,

Lenzerini and

Poggi , Semantic characterization of data services through ontologies, in: Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI 2019), 2019, pp. 1647–1653.

24.

Cima ,

Lenzerini and

Poggi , Non-monotonic ontology-based abstractions of data services, in: Proceedings of the Seventeenth International Conference on Principles of Knowledge Representation and Reasoning (KR 2020), 2020, pp. 243–252. doi:10.24963/kr.2020/25.

25.

Ciravegna ,

Giannini ,

Gori ,

Maggini and

Melacci , Human-driven FOL explanations of deep learning, in: Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI-20, 2020.

26.

Confalonieri ,

Weyde ,

T.R.

Besold and

F.M.

del Prado Martín , Using ontologies to enhance human understandability of global post-hoc explanations of black-box models, Artif. Intell. (2021).

27.

Doan ,

A.Y.

Halevy and

Z.G.

Ives , Principles of Data Integration, Morgan Kaufmann, 2012. ISBN 978-0-12-416044-6.

28.

Fagin ,

P.G.

Kolaitis ,

R.J.

Miller and

Popa , Data exchange: Semantics and query answering, Theoretical Computer Science 336(1) (2005), 89–124. doi:10.1016/j.tcs.2004.10.033.

29.

Friedman ,

Levy and

Millstein , Navigational plans for data integration, in: Proceedings of the Sixteenth National Conference on Artificial Intelligence (AAAI 1999), AAAI Press, 1999, pp. 67–73.

30.

Funk ,

J.C.

Jung and

Lutz , Frontiers and exact learning of ELI queries under DL-lite ontologies, in: Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI 2022), 2022, pp. 2627–2633.

31.

Funk ,

J.C.

Jung ,

Lutz ,

Pulcini and

Wolter , Learning description logic concepts: When can positive and negative examples be separated? in: Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, International Joint Conferences on Artificial Intelligence Organization, 2019, pp. 1682–1688. doi:10.24963/ijcai.2019/233.

32.

M.R.

Garey ,

D.S.

Johnson and

L.J.

Stockmeyer , Some simplified NP-complete graph problems, Theoretical Computer Science 1(3) (1976), 237–267. doi:10.1016/0304-3975(76)90059-1.

33.

Gutierrez-Basulto´ ,

Y.A.

Ibañez-García´ ,

Kontchakov and

E.V.

Kostylev , Queries with negation and inequalities over lightweight ontologies, Journal of Web Semantics 35 (2015), 184–202. doi:10.1016/j.websem.2015.06.002.

34.

Gutierrez-Basulto´ ,

J.C.

Jung and

Sabellek , Reverse engineering queries in ontology-enriched systems: The case of expressive horn description logic ontologies, in: Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-18, International Joint Conferences on Artificial Intelligence Organization, 2018, pp. 1847–1853. doi:10.24963/ijcai.2018/255.

35.

Horrocks ,

Kutz and

Sattler , The even more irresistible SROIQ, in: Proceedings of the Tenth International Conference on the Principles of Knowledge Representation and Reasoning (KR 2006), 2006, pp. 57–67.

36.

Horrocks ,

Sattler and

Tobies , Reasoning with individuals for the description logic SHIQ, in: Proceedings of the Seventeenth International Conference on Automated Deduction (CADE 2000),

McAllester , ed., Lecture Notes in Computer Science, Vol. 1831, Springer, 2000, pp. 482–496.

37.

J.C.

Jung ,

Lutz ,

Pulcini and

Wolter , Logical separability of incomplete data under ontologies, in: Proceedings of the 17th International Conference on Principles of Knowledge Representation and Reasoning, KR 2020, Rhodes, Greece, 2020.

38.

J.C.

Jung ,

Lutz ,

Pulcini and

Wolter , Separating data examples by description logic concepts with restricted signatures, in: Proceedings of the Eighteenth International Conference on Principles of Knowledge Representation and Reasoning, International Joint Conferences on Artificial Intelligence Organization, 2021.

39.

J.C.

Jung ,

Lutz ,

Pulcini and

Wolter , Separating data examples by description logic concepts with restricted signatures, in: Proceedings of the 18th International Conference on Principles of Knowledge Representation and Reasoning, KR 2021, Online Event, November 3–12, 2021, 2021, pp. 390–399. doi:10.24963/kr.2021/37.

40.

J.C.

Jung ,

Lutz ,

Pulcini and

Wolter , Logical separability of labeled data examples under ontologies, Artificial Intelligence 313 (2022), 103785. doi:10.1016/j.artint.2022.103785.

41.

D.V.

Kalashnikov ,

L.V.S.

Lakshmanan and

Srivastava , FastQRE: Fast query reverse engineering, in: Proceedings of the 2018 International Conference on Management of Data, Association for Computing Machinery, 2018.

42.

Konev ,

Ozaki and

Wolter , A model for learning description logic ontologies based on exact learning, in: AAAI, 2016.

43.

Law ,

Russo and

Broda , Inductive learning of answer set programs, in: Logics in Artificial Intelligence, 2014.

44.

Lehmann and

Hitzler , Concept learning in description logics using refinement operators, Mach. Learn. (2010).

45.

Lenzerini , Data integration: A theoretical perspective, in: Proceedings of the Twentyfirst ACM SIGACT SIGMOD SIGART Symposium on Principles of Database Systems (PODS 2002), 2002, pp. 233–246.

46.

Lenzerini , Ontology-based data management, in: Proceedings of the 20th ACM Conference on Information and, Knowledge Management, CIKM, 2011.

47.

Lenzerini , Ontology-based data management, in: Proceedings of the Twentieth International Conference on Information and Knowledge Management (CIKM 2011), 2011, pp. 5–6. doi:10.1145/2063576.2063582.

48.

Lenzerini , Managing data through the lens of an ontology, AI Magazine 39(2) (2018), 65–74. doi:10.1609/aimag.v39i2.2802.

49.

Liartis ,

Dervakos ,

O.M.

Mastromichalakis ,

Chortaras and

Stamou , Semantic queries explaining opaque machine learning classifiers, in: Proceedings of the Workshop on Data Meets Applied Ontologies in Explainable AI, 2021.

50.

Lutz ,

Marti and

Sabellek , Query expressibility and verification in ontology-based data access, in: Principles of Knowledge Representation and Reasoning: Proceedings of the Sixteenth International Conference (KR 2018), 2018, pp. 389–398.

51.

D.M.L.

Martins , Reverse engineering database queries from examples: State-of-the-art, challenges, and research opportunities, Information Systems 83 (2019), 89–100. doi:10.1016/j.is.2019.03.002.

52.

D.M.L.

Martins , Reverse engineering database queries from examples: State-of-the-art, challenges, and research opportunities, Information Systems 83 (2019), 89–100. doi:10.1016/j.is.2019.03.002.

53.

Motik ,

Cuenca Grau ,

Horrocks ,

Wu ,

Fokoue and

Lutz , OWL 2 Web Ontology Language Profiles (Second Edition), W3C Recommendation, World Wide Web Consortium, 2012, available at http://www.w3.org/TR/owl2-profiles/.

54.

Mottin ,

Lissandrini ,

Velegrakis and

Palpanas , New trends on exploratory methods for data analytics, Proc. VLDB Endow. 10(12) (2017), 1977–1980. doi:10.14778/3137765.3137824.

55.

Ortiz , Ontology-mediated queries from examples: A glimpse at the DL-Lite case, in: Proceedings of the Fifth Global Conference on Artificial Intelligence, EPiC Series in Computing, Vol. 65, 2019, pp. 1–14.

56.

C.H.

Papadimitriou and

Yannakakis , The complexity of facets (and some facets of complexity), Journal of Computer and System Sciences 28(2) (1984), 244–259. doi:10.1016/0022-0000(84)90068-0.

57.

Persia ,

Ozaki and

Mazzullo , Learning query inseparable ELH ontologies, in: AAAI, 2019.

58.

Poggi ,

Lembo ,

Calvanese ,

De Giacomo ,

Lenzerini and

Rosati , Linking data to ontologies, Journal on Data Semantics X (2008), 133–173.

59.

Rothe , Exact complexity of exact-four-colorability, Information Processing Letters 87(1) (2003), 7–12. doi:10.1016/S0020-0190(03)00229-1.

60.

M.K.

Sarker ,

Xie ,

Doran ,

Raymer and

Hitzler , Explaining trained neural networks with Semantic Web technologies: First steps, in: Proceedings of the Twelfth International Workshop on Neural-Symbolic Learning and Reasoning, 2017.

61.

L.J.

Stockmeyer , The polynomial-time hierarchy, Theoretical Computer Science 3(1) (1976), 1–22. doi:10.1016/0304-3975(76)90061-X.

62.

L.J.

Stockmeyer , The polynomial-time hierarchy, Theoretical Computer Science 3(1) (1976), 1–22. doi:10.1016/0304-3975(76)90061-X.

63.

ten Cate and

Dalmau , The product homomorphism problem and applications, in: Proceedings of the Eigthteenth International Conference on Database Theory (ICDT 2015), LIPIcs, Vol. 31, 2015, pp. 161–176.

64.

ten Cate and

Dalmau , Conjunctive queries: Unique characterizations and exact learnability, in: 24th International Conference on Database Theory (ICDT 2021), Vol. 186, 2021.

65.

Q.T.

Tran ,

C.-Y.

Chan and

Parthasarathy , Query Reverse Engineering, The VLDB Journal 23(5) (2014).

66.

K.W.

Wagner , More complicated questions about maxima and minima, and some closures of NP, Theoretical Computer Science 51 (1987), 53–80. doi:10.1016/0304-3975(87)90049-1.

67.

K.W.

Wagner , Bounded query classes, SIAM Journal on Computing 19(5) (1990), 833–846. doi:10.1137/0219058.

68.

Y.Y.

Weiss and

Cohen , Reverse engineering SPJ-queries from examples, in: Proceedings of the 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, 2017.

69.

F.M.

Zanzotto , Viewpoint: Human-in-the-loop artificial intelligence, Journal of Artificial Intelligence Research 64 (2019), 243–252. doi:10.1613/jair.1.11345.

70.

M.M.

Zloof , Query-by-example: The invocation and definition of tables and forms, in: Proceedings of the 1st International Conference on Very Large Data Bases, VLDB ’75, ACM, New York, NY, USA, 1975, pp. 1–24. ISBN 978-1-4503-3920-9. doi:10.1145/1282480.1282482.

Separability and Its Approximations in Ontology-based Data Management

Abstract

Keywords

1. Introduction

1 On the other hand, the characterizability problem is known in the literature as query definability.

3. Preliminaries

4. Formal framework

Footnotes

Acknowledgements

References

¹
On the other hand, the characterizability problem is known in the literature as query definability.