Sage Journals: Discover world-class research

Abstract

Cyberphysical systems (CPSs) have been widely applied in a variety of applications to collect data, while data is often dirty in reality. We pay attention to the way of evaluating data inconsistency which is a major concern for evaluating quality of data and its source. This paper is the first study on data inconsistency evaluation problem for CPS based on conditional functional dependencies. Given a database instance D including n tuples and a CFD set $Σ$ including r CFDs, data inconsistency is defined as the ratio of the size of minimum culprit in D, where a culprit is a set of tuples leading to integrity errors. Firstly, we give a sufficient analysis on the complexity and inapproximability of minimum culprit problem. Then, we provide a practical algorithm that gives a 2-approximation of the data dirtiness in $O (r n \log n)$ time based on independent residual subgraph. To deal with the large dynamic data, we provide a compact structure based on B-tree for storing independent residual subgraph in order to update inconsistency efficiently. At last, we test our algorithm on both synthetic and real-life datasets; the experiment results show the scalability of our algorithm and the quality of the evaluation result.

1. Introduction

Cyberphysical Systems (CPSs) have been widely applied in a variety of applications to collect data, such as temperature, heart rate, and speed, from the physical world and make decisions based on the analysis of the data, thereby controlling and optimizing the physical objects in the real world, and they have a great influence on the way we observe and change the world [1]. CPS obtains the information of the physical world through many sensors and impacts the environment by actuators. Data sensed and sampled by sensors usually contains valuable information about the physical world, and its volume is growing. For better understanding and changing the physical environment, data collection and analysis are of the essence [2]. The knowledge extracted from the data also guides the behavior of actuators in CPS; for instance, sensors and actuators cooperate with each other to monitor some area [3] and react when a certain event is detected [4, 5]. Data gathered by sensors will not just be thrown away when they have been transmitted to the processors, but they will also be stored for further analysis. Unfortunately, not all the information gathered by different CPSs is reliable due to hardware and communication limits [6]. Many deployment experiences have shown that low data quality is the most serious problem that impacts CPS performance. Tolle et al. pointed out that faulty data can occur in various unexpected ways and less than 69% of their data could be used for meaningful interpretation [7]. Szewczyk et al. also found that about 30% of data are faulty in their deployment [8]. What makes the situation worse is that the quality of data is not easily judged. It is important to find a way to identify the quality of data gathered by CPSs to estimate the availability of data. Meanwhile, the data quality also reflects the reliability of the system. In this paper, we utilized data inconsistency to measure the data quality and we store all the data in a relational database.

Once these systems get pervasive and ubiquitously available, large amounts of data will be collected. They may include faked information. Such case highlights the quality of the data in the decision-making system and other CPSs; it is crucial for the success of the applications. Without high-quality data, no high-quality service based on the right decision could be provided, for instance, aggregation and routing services [9–15].

1.1. Motivation

In CPSs, data is collected mostly from the physical world. However, data availability would be reduced by faulty data which is not reporting the real value of the monitoring objects. The idea in this paper is that the database techniques of data inconsistency can be utilized to model and manage the data quality for CPS, in order to evaluate data source quality, CPS data quality, and so on. Based on this, we propose the new measurement and technique for its efficient computation.

In database technique, data consistency is one of the most important aspects of data quality; it is usually defined based on integrity constraints. They are semantic conditions that a database should satisfy in order to be an appropriate model of external reality. In practice, a database may not satisfy those integrity constraints, and for that reason it is said to be inconsistent or dirty. As a type of integrity constraint, conditional functional dependency (CFD) [16] has been proposed to capture inconsistency in data, which is a generalization of functional dependency (FD) [17] and has more powerful expressibility than FD. Based on CFD, many works on data quality have appeared; for example, [18–20] focus on inconsistency detection problem, while [21–26] focus on the data repairing problem. Besides inconsistency detection and data repairing, an important problem is data inconsistency evaluation which aims to quantify how dirty the data is.

Traditional evaluation methods for data source are mostly based on statistics. Compared with them, the logical method proposed in this paper is more flexible and fundamental. And the proposed method has a higher capability of expression. To the best of our knowledge, there is no existing work on providing a specific formula quantifying the data inconsistency based on CFD. We now give an example for modeling CPS data using CFDs. Consider the example below.

Example 1.

A CPS group maintains a relation of sensing data for its laboratory for several years:

\begin{matrix} C M (s i d, l o c, t i m e, w e e k, d a t e, t e m p, v i b r a t e) . \end{matrix}

(1)

Each climate monitor tuple contains information about a record t (a unique sensor id

s i d

, location of sensor loc, time information about data reporting (time, week, and date), temperature, and vibrate). A sampled fragment D from all the data is shown in Table 1.

Two CFDs defined over such sampled data are shown as follows:

\begin{matrix} φ_{1} : (loc, time, date ⟶ vibrate, \{t_{p} (_,_,_,_)\}), \\ φ_{2} : (sid, week ⟶ loc, \{t_{p^{'}} (_,_,_), t_{p} (s 817,_, 6 : 8)\}); \end{matrix}

(2)

intuitively,

φ_{1}

restrains the notion that the vibrate status of each location is the same at the same time, while

φ_{2}

specifies that the location of each sensor cannot be changed in the same week; however, for the special one “

s 817

,” its position cannot be changed no matter the time. According to such two CFDs, D is inconsistent, since there exists violation in them as follows:

(i)

Tuple pair $(t_{1}, t_{4})$ are violations with respect to $φ_{1}$ , because they reported different “vibrate” status.

(ii)

Tuple $t_{4}$ is a violation with respect to $φ_{2}$ , because “s817” at location “6:8” cannot be changed.

(iii)

Tuple pairs $(t_{1}, t_{3})$ and $(t_{2}, t_{3})$ are violations with respect to $φ_{2}$ , because “s816” reported different “location” at the same week.

It is easy to see that the size of minimum culprit is

2

because any culprit cannot have less than

2

tuples; for example, subset

{t_{3}, t_{4}}

is a minimum culprit. That is to say, the data we sampled is not very reliable, and, to make it clear, at least 33.3% of the data should be cleaned.

Table 1

Sampled data.

	sid	loc	Time	Week	Date	Temp.	Vibrate
$t_{1}$	s816	6:8.1	14:10	0079	11-04	48	0
$t_{2}$	s816	6:8.1	14:10	0079	11-05	24	1
$t_{3}$	s816	7:4.2	14:10	0079	11-06	22	1
$t_{4}$	s817	6:8.1	14:10	0079	11-04	24	1
$t_{5}$	s817	6:8	14:10	0080	11-09	24	0
$t_{6}$	s817	6:8	14:10	0080	11-10	22	0

Motivated by this, we consider how to efficiently compute this inconsistency measurement when the integrity constraint is conditional functional dependency in this paper. Technically, to the best of our knowledge, there is no existing work considering this aspect. There are some detection techniques [18–20, 27] but they are not able to reveal how dirty the data is directly. For confidence computation [28], our problem generalizes the confidence of a single CFD; actually, this measurement is also the confidence of a set of CFDs. For repairing techniques, our problem can be seen as a special case of [29], because the complementary minimum culprit can be seen as C-repair (cardinality repair) of an inconsistent database; however, it is much more expensive using the techniques proposed by [29] directly, especially for dynamic data, and the algorithm given in this paper is more efficient and seems optimal. Briefly, there are three challenges. $(1)$ The first challenge is how to evaluate the inconsistency efficiently. It is proved that the inconsistency evaluation problem we study is NP-complete even if there are only two CFDs in rule set and there are only three attributes in relation schema. Therefore, we should provide an efficient approximation algorithm for evaluating the data inconsistency. $(2)$ Because most existing repair algorithms that could perform repair encounter a huge searching space when data is large and they have to take an expensive cost on performance, the approximation algorithm is also expected to be more efficient than the C-repair algorithms and to be able to guarantee the approximation ratio and evaluate large data more than data in memory. $(3)$ For the dynamic data, an external memory data structure is necessary to make the approximation algorithm able to deal with the update of tuples efficiently rather than recomputing the inconsistency from scratch.

1.2. Contributions

This paper first studies how to compute the data inconsistency for CPS data efficiently with respect to CFDs; the main contributions in this paper are as follows: (a)

We formally define the inconsistency evaluation problem. The inconsistency of a given database instance D is defined based on minimum culprit, the minimum subset of D, whose complementary value in D is consistent with respect to all the given CFDs, and we use the proportion of minimum culprit to quantize the inconsistency of a database. It is proved that it is monotonic and insensitive to a small change on the database. And we prove that the minimum culprit problem is still NP-complete even if $(1)$ Σ has only two variable CFDs; $(2)$ the relation has only $3$ attributes; and $(3)$ the number of violations caused by each tuple is not more than $6$ .

(b)

Based on the conflict graph model, we transform the inconsistency evaluation problem into the minimum vertex cover problem based on conflict graph model. Based on finding the maximal matching of independent residual subgraph, we give a $2$ -approximation algorithm with $O (r n \log n)$ time complexity where r is the number of given CFDs and always a small constant. To deal with large dynamic data, we design a compact structure for indexing all tuples and give the method for its maintenance. Some useful properties of independent residual subgraph prevent storing edges in the compact structure so that the storage cost of the graph is $O (r n)$ and the update cost is only $O (r \log n)$ .

(c)

Using TPCH for large-scale data and IMDB and DBLP for real-life data, we conduct experiments on PC. We find that the adjusted counterpart outperforms the evaluation algorithm if several CFDs are of small confidence while the others are not. In addition, our algorithms scale well with both the size of the data and the number of CFDs.

2. Background

An l-ary relation schema can be represented by $R (A_{1}, A_{2}, \dots, A_{l})$ , where R is the relation name and all $A_{j}$ 's ( $1 \leq j \leq l$ ) are the attributes of R. Let $a t t r (R)$ be ${A_{1}, \dots, A_{l}}$ , and let $d o m (A_{j})$ be the domain of attribute $A_{j}$ . An instance D of relation R is a set of l-ary tuples, denoted by $D = {t_{1}, t_{2}, \dots, t_{n}}$ , where each tuple $t_{i} (1 \leq i \leq n)$ belongs to the set $d o m (A_{1}) \times d o m (A_{2}) \times \dots \times d o m (A_{l})$ . Let $t_{i} [A_{j}]$ be the value of $t_{i}$ on attribute $A_{j}$ .

Conditional functional dependency, CFD for short, is a class of integrity constraints capturing the consistency of data, whose formal definition can be found in [24]. Next, the syntax and semantic definitions of CFD will be reviewed briefly. (i)

Syntax. A CFD rule φ defined over relation R is a pair $(X \to Y, T_{p})$ , where X and Y are two distinctive attribute lists satisfying $X \cup Y \subseteq a t t r (R)$ , $X \to Y$ is a standard FD, and $T_{p}$ is a pattern tableau over attributes $X \cup Y$ . For each tuple $t_{p} \in T_{p}$ and each attribute $A \in X \cup Y$ , the value $t_{p} [A]$ can be either a constant “a” in $d o m (A)$ or a wild card “_”. For a rule φ, we use $L H S (φ)$ to denote X and $R H S (φ)$ to denote Y.

(ii)

Semantic. Given a tuple t and a pattern tuple $t_{p}$ , t is said to match $t_{p}$ , denoted by $t ≍ t_{p}$ , if either $t [A] = t_{p} [A]$ or $t_{p} [A]$ = “_” is satisfied for each attribute A. Two tuples $t_{1}$ and $t_{2}$ satisfy φ, denoted by $(t_{1}, t_{2}) ⊨ φ$ , if when $t_{1} [X] = t_{2} [X] ≍ t_{p} [X]$ , there must be $t_{1} [Y] = t_{2} [Y] ≍ t_{p} [Y]$ ; if $t_{1} [X] = t_{2} [X] ≍ t_{p} [X]$ but $t_{1} [Y] \neq t_{2} [Y]$ , then tuple pair $(t_{1}, t_{2})$ is a violation. Particularly, a single tuple $t ⊨ φ$ , if when $t [X] ≍ t_{p} [X]$ , we must have $t [Y] ≍ t_{p} [Y]$ . Given a relational instance D and a CFD rule φ, D is said to satisfy φ (i.e., $D ⊨ φ$ ) iff, (a) for each tuple $t \in D$ , $t ⊨ φ$ and, (b) for any two tuples $t_{1}$ and $t_{2}$ in D, $(t_{1}, t_{2}) ⊨ φ$ . Given a CFD set Σ, D is consistent with respect to Σ, if it satisfies all rules in set Σ; otherwise, it is inconsistent or dirty, denoted by $D ⊭ Σ$ . For example, in Table 1, tuple pair $(t_{1}, t_{2})$ is a violation with respect to $φ_{1}$ shown in Example 1 (i.e., $(t_{1}, t_{2}) ⊭ φ_{1}$ ) due to the fact that “ $t_{1} [l o c, t i m e, d a t e] = t_{4} [l o c, t i m e, d a t e] ≍ t_{p} [l o c, t i m e, d a t e]$ ”, but “ $t_{1} [v i b r a t e] =$ ‘0’ $\neq t_{4} [v i b r a t e] =$ ‘1’”; therefore, D is inconsistent or dirty because of the existence of violation.

A CFD is said to be simple if there is only one row in its pattern tableau such as both CFDs shown in Example 1. Additionally, two special fragments of simple CFD can be defined as follows: (i)

A simple CFD $φ = (X \to Y, T_{p})$ is said to be a variable CFD, if, for each $A \in Y$ , $t_{p} [A] \neq$ “_”; for example, $φ_{1}$ is a variable CFD.

(ii)

A simple CFD $φ = (X \to Y, T_{p})$ is said to be a constant CFD, if, for each $A \in Y$ , $t_{p} [A] =$ “_”; for example, the second pattern of $φ_{2}$ can be changed into a constant CFD.

Intuitively, a constant CFD can capture inconsistencies on single tuple, while a variable CFD can capture inconsistencies between two tuples.

In fact, given any CFD, it can be rewritten by some simple CFDs naïvely by splitting its tableau horizontally, while a simple CFD can be rewritten by at most a constant one and a variable one. Therefore, without loss of generality, only simple CFDs are used in this paper.

3. Problem Definition

This section first formally defines the data inconsistency evaluation problem and then proves that it is NP-complete.

Given a CFD set Σ and a database instance D such that $D ⊭ Σ$ , intuitively, the dirty part $D^{'}$ is a subset of D such that the deletion of $D^{'}$ will make the data clean. We can formalize this idea as follows.

Definition 2 (culprit).

Given a database instance D and a set of CFD rules Σ, culprit $C (D)$ is a subset of D satisfying $D - C (D) ⊨ Σ$ .

Obviously, for fixed Σ and D, there may be many culprits. In this paper, to measure the data dirtiness, we only care about the minimum culprit. $C_{m i n} (D)$ is the minimum culprit if, for any culprit $C (D)$ , $| C_{m i n} (D) | \leq | C (D) |$ .

Definition 3 (data dirtiness evaluation problem).

Given a database instance D and a set of CFD rules Σ, we want to compute the dirtiness of a database instance D, which is $d i r t (D, Σ) = | C_{m i n} (D) | / | D |$ .

Property 1 (minimality).

Given any instance D and any CFD set Σ, $d i r t (D, Σ)$ is the portion of tuples that need to be edited at least in any exact repair algorithm.

Measurement $d i r t (D, Σ)$ is also monotonic and insensitive to a small change Δ (i.e., set of tuples) on instance D as the following proposition states.

Property 2 (monotonic and insensitive).

Given an instance D, a set of tuples Δ, and CFD set Σ, we have $0 \leq | C_{m i n} (D \cup Δ) | - | C_{m i n} (D) | \leq | Δ |$ .

This implies $0 \leq | d i r t (D \cup Δ, Σ) - d i r t (D, Σ) | \leq | Δ | / (| D | + | Δ |)$ .

Remark 4.

That is to say, the inconsistency of data measured by Definition 2 changes gently with small update. Usually, this trend of variation agrees with the reality; this is really because “(1) most of the data is often correct, especially for large data, and (2) a small update has a tiny impact on the dirtiness of the entire dataset.”

Similar to [30], we next have the following theorem on the complexity of minimum culprit problem with more restricted condition on the input. Here, the decision version of minimum culprit problem, k-culprit problem for short, is that, given a database instance D and a CFD set Σ, it is to decide whether there is a culprit C of D with respect to Σ and the size $| C | \leq k$ .

Theorem 5.

Given an instance D of relation R and CFD set Σ, k-culprit problem is NP-complete, even if (1) there are only 2 variable CFDs in Σ, (2) R is a 3-ary relation, and (3) for each tuple $t \in D$ there are at most 6 violations including t.

Proof.

NP. There is an NP algorithm as follows: $(1)$ guess a subset C of D with size of k; $(2)$ check whether $D - C ⊨ Σ$ , that is, to check whether each tuple pair in $D - C$ satisfies all CFDs of Σ; $(3)$ output “ $y e s$ ” when $D - C ⊨ Σ$ and “ $n o$ ” otherwise. For each instantiation, the checking step can be done in polynomial time. Thus, the problem is in NP.

NP-Hardness. The lower bound is established by a reduction from $3$ -SAT problem to k-culprit problem. An instance of $3$ -SAT problem includes a set U of n variables $x_{0}, \dots, x_{n - 1}$ and a collection S of m clauses $s_{0}, \dots, s_{m - 1}$ , while, in each clause $s_{j} = α_{j 1} + α_{j 2} + α_{j 3}$ , each $α_{j q} (1 \leq q \leq 3)$ is the qth literal of $s_{j}$ . Given an instance of $3$ -SAT problem, it is to decide whether there is a satisfying truth assignment for S. The $3$ -SAT problem is NP-complete, and it remains NP-complete even if, for each $x_{i} \in U$ , there are at most $5$ clauses in S that contain either $x_{i}$ or ${\bar{x}}_{i}$ .

A polynomial reduction from $3$ -SAT to k-culprit problem can be constructed as follows. $(1)$ Given an instance of $3$ -SAT, we introduce a 3-ary relation $R (A, B, C)$ and a variable CFD set Σ including $φ_{1} : (A \to B, \{t_{p} (_,_)\})$ and $φ_{2} : (C \to B, \{t_{p} (_,_)\})$ . $(2)$ Build an instance D over R. For each variable $x_{i}$ , insert two tuples $t_{2 i} (x_{i}, x_{i}, x_{i})$ and $t_{2 i + 1} (x_{i}, {\bar{x}}_{i}, x_{i})$ into D. For each literal $α_{j q}$ in clause $S_{j}$ , if it is a positive literal of variable $x_{i}$ , add tuple $t_{2 n + 3 j + k} (s_{j}, x_{i}, x_{i})$ to D; if it is a positive literal of variable $x_{i}$ , add tuple $t_{2 n + 3 j + k} (s_{j}, {\bar{x}}_{i}, x_{i})$ to D. (3) At last, let $k = n + 2 m$ . Note that $(1)$ the instance D can be constructed in $O (n + m)$ ; $(2)$ there are $3$ attributes in R and two variable CFDs in Σ; $(3)$ for each tuple t in T the number of violations caused by t is at most $6$ , if each variable exhibits at most $5$ clauses.

Suppose that the $3$ -SAT instance is satisfiable; that is, there is a satisfying truth assignment $ρ : U \to {\{0,1\}}^{n}$ for S; then, there is a culprit C of D such that its size is at most $n + 2 m$ . Concretely, it can be computed as follows, for each variable $x_{i}$ ; $(1)$ if $ρ (x_{i}) = 1$ , delete tuple $t_{2 i}$ from D. Then, for each clause $s_{j}$ , if $α_{j q}$ is a positive literal of $x_{i}$ and $t_{2 n + 3 j + 1}, t_{2 n + 3 j + 2}, t_{2 n + 3 j + 3} \in D$ , delete $t_{2 n + 3 j + q}$ from D; $(2)$ if $ρ (x_{i}) = 0$ , delete tuple $t_{2 i + 1}$ from D. Then, for each clause $s_{j}$ , if $α_{j q}$ is a negative literal of $x_{i}$ and $t_{2 n + 3 j + 1}$ , $t_{2 n + 3 j + 2}$ , and $t_{2 n + 3 j + 3} \in D$ , delete $t_{2 n + 3 j + q}$ from D. We have that, for each i, either $t_{2 i}$ or $t_{2 i + 1}$ is deleted from D, and, for each j, either of $\{t_{2 n + 3 j + 1}, t_{2 n + 3 j + 2}, t_{2 n + 3 j + 3}\}$ is deleted from D for each j. This is because, in each clause, there is at least one literal that is made $t r u e$ by assignment ρ . Therefore, there is a set C of the rest of the tuples such that $D - C ⊨ Σ$ and $| C | \leq n + 2 m = k$ .

To see the converse, let C be the culprit such that $| C | \leq k = n + 2 m$ . CFD $φ_{1}$ restricts the notion that either $t_{2 i}$ or $t_{2 i + 1}$ should be deleted from D, for each $0 \leq i \leq n - 1$ , and at least two tuples of $\{t_{2 n + 3 j + 1}, t_{2 n + 3 j + 1}, t_{2 n + 3 j + 1}\}$ should be deleted from D. That is, the size of C is at least $n + 2 m$ . Therefore $| C |$ is exactly $n + 2 m$ . Moreover, CFD $φ_{2}$ restricts the notion that there is only one literal of each variable in $D - C$ . Then, there is a satisfying truth assignment ρ for S such that, for each $0 \leq i \leq n - 1$ ,

\begin{matrix} ρ (x_{i}) = \{\begin{cases} 0 & i f t_{2 i} \in C, \\ 1 & i f t_{2 i + 1} \in C . \end{cases} \end{matrix}

(3)

4. Evaluation Algorithm

For any database instance D, let $D_{s}$ be the subset of D where each tuple violates at least one constant CFD in Σ; then, we have

\begin{matrix} C_{m i n} (D) = |D_{s}| + |C_{m i n} (D - D_{s})|; \end{matrix}

(4)

this is because any culprit C of D satisfying

C - D_{s}

must be larger than

C_{m i n} (D - D_{s})

. Therefore, data dirtiness can be computed as

\begin{matrix} d i r t (D, Σ) = \frac{|D_{s}| + |C_{m i n} (D - D_{s})|}{|D|}, \end{matrix}

(5)

where

D_{s}

can be detected by scanning the database once. Without loss of generality, we do not let Σ contain constant CFD from now on.

Definition 6 (conflict graph [31]).

Given an instance D and a CFD set Σ with r CFDs, the conflict graph $G (D, Σ)$ is an undirected graph $〈V, E〉$ , where V is the vertex set and E is the edge set. In conflict graph $G (D, Σ)$ , each vertex $v_{i} \in V$ refers to the tuple $t_{i} \in D$ and edge $v_{i} v_{j} \in E$ , if $\exists φ \in Σ$ , $(t_{i}, t_{j}) ⊭ φ$ .

Example 7.

One has instance D with tuples $t_{1} ~ t_{9}$ as in (6) and $Σ = \{φ_{1} : (A \to B, \{t_{p} (_,_)\}), φ_{2} : (C \to D, \{t_{p} (_,_)\})\}$ .

Instance D is shown as follows:

\begin{matrix} \begin{matrix} \begin{matrix} \begin{matrix} A & B & C & D \\ t_{1} & a & a & a & a \\ t_{2} & a & b & b & a \\ t_{3} & a & c & a & b \\ t_{4} & a & b & b & b \\ t_{5} & a & c & a & c \\ t_{6} & b & a & b & c \\ t_{7} & b & a & b & b \\ t_{8} & b & a & b & c \\ t_{9} & b & a & a & b \end{matrix} \end{matrix} \end{matrix} \end{matrix}

(6)

Conflict graphs $G (D, Σ)$ , $G (D, \{φ_{1}\})$ , and $G (D, \{φ_{2}\})$ are shown in Figures 1, 2, and 3. We see that $v_{1}$ is adjacent to $v_{2}$ because $t_{1}$ conflicts with $t_{2}$ with respect to $φ_{1}$ ; they have the same value on A but different values on B. Obviously, both $G (D, \{φ_{1}\})$ and $G (D, \{φ_{2}\})$ are the subgraphs of $G (D, Σ)$ .

Figure 1

Conflict graph $G (D, Σ)$ .

Figure 2

$G (D$ , $\{φ_{1}\})$ .

Figure 3

$G (D$ , $\{φ_{2}\})$ .

There is a naïve $2$ -approximate algorithm (Algorithm 1). The minimum culprit can be transformed as the minimum vertex cover on the conflict graph which can be built by the input database and CFDs, and it is easy to see that the size of minimum vertex cover of the conflict graph $G (D, Σ)$ equals the size of minimum culprit of D. Therefore, a naïve $2$ -approximation algorithm (Algorithm 1) will be obtained immediately.

Algorithm 1: Linear algorithm.

Input: Database instance D = $\{t_{1}, \dots, t_{n}\}$ , CFD set $Σ = \{φ_{1}, \dots, φ_{r}\}$ .

Output: $d i r t (D, Σ)$ which is the dirtiness of database D w.r.t. $Σ$ .

(1) $G (V, E) \leftarrow (⌀, ⌀)$ , $M \leftarrow ⌀$ ;

(2) for all tuple pair $(t_{i}, t_{j})$ in D do

(3) for all $φ_{k} \in Σ$ such that $1 \leq k \leq r$ do

(4) if $(t_{i}, t_{j}) ⊭ φ_{k}$ then

(5) add vertices $v_{i}$ and $v_{j}$ into V;

(6) add edge $v_{i} v_{j}$ into E;

(7) GOTO line (2);

(8) $M \leftarrow M a x i m a l M a t c h i n g (G (V, E))$ ;

(9) return $2 |M| / |D|$ ;

Algorithm 1 works like this: lines (1)–(6): build a conflict graph G for the given database instance D and the CFD set Σ; line (7): compute the minimum vertex cover approximately. One can call the classic approximation algorithm [32] to find maximal matching $M (G (D, Σ))$ greedily, where a matching in G is a set of pairwise nonadjacent edges and the matching is maximal if it is not a proper subset of any other matching in G. For any maximal matching M, the amount of vertexes in M (i.e., $2 | M |$ ) is at most twice as much as the size of minimum vertex cover, a $2$ -approximation of $| C_{m i n} (D) |$ .

5. Reduce Quadratic for Large Dynamic Data

We propose another $2$ -approximation dirtiness evaluation algorithm DDEva to overcome the shortcomings stated above. Then, an index based on B-tree is designed to enable $O (r n \log n)$ time and $O (r n)$ space implementation of DDEva over large data, where r is the number of CFDs given in a general form rather than a simple CFD. It prevents the potential quadratic storage of edges. At last, an $O (r \log n)$ update method based on the efficient update of maximal matching and conflict graph is proposed to deal with dynamic data. Generally, the number of general CFDs, r, is a small constant, so that the algorithm we proposed works efficiently as shown in the experiments. We still use simple CFD in order to simplify the description below, but note that our algorithm can process the general CFDs natively.

5.1. Some Notations and Observations

For clarity, we first declare the following notations formally. (A) Given a CFD set Σ, it has r variable CFDs, and the jth CFD is $φ_{j}$ . The conflict graph $G (D, \{φ_{j}\})$ is denoted by $G_{j}$ for short. Recall Definition 6; obviously, $G_{j}$ is a subgraph of $G (D, Σ)$ . (B) For any matching M, let $V_{M}$ be the set of all vertices in M. The size of M is denoted by $| M |$ which is the number of edges in it; obviously, $| V_{M} | = 2 \times | M |$ . For any graph G, let $M (G)$ be the maximal matching of G; (C) given a graph $G (V, E)$ and a matching M, let $G - M$ be a graph $G^{'} (V^{'}, E^{'})$ , where $V^{'} = V ∖ V_{M}$ and $E^{'}$ is obtained by removing all edges covered by $V_{M}$ from E; (D) let $K_{ω_{1}, \dots, ω_{l}}$ represent a complete multipartite graph with l vertex equiv-classes, such that any pair of vertices in the same equiv-class ω are nonadjacent while any pair of vertices in different equiv-classes are adjacent. For each equiv-class $ω_{i}$ , let $| ω_{i} |$ be the number of vertices in it.

Given a CFD set Σ of r variable CFDs, let $G_{j}$ be the conflict graph $G (D, \{φ_{j}\})$ . Interestingly, we have the following useful observation of conflict graph with respect to one CFD.

Observation 1.

Each conflict graph $G_{j}$ is a forest of multipartite graph; that is, it is composed of several nonoverlapping connected components, and each component is a complete multipartite graph.

It is easy to find maximal matching for each complete multipartite connected component in each $G_{j}$ . However, the sum of each $G_{j}$ 's maximal matching sizes is not a $2$ -approximation of minimum vertex cover due to the overlaps among those matching scenarios. In order to remove these overlaps, we next define a series of independent residual subgraphs $〈Δ_{1}, \dots, Δ_{r}〉$ for $G (D, Σ)$ , in which each $Δ_{j}$ is a counterpart of the conflict graph $G_{j}$ .

Definition 8 (independent residual subgraph).

Given a database instance D and CFD set $Σ = \{φ_{j} ∣ 1 \leq j \leq r\}$ , the independent residual subgraph is a subgraph of $G (D, Σ)$ , ir-subgraph for short, such that

\begin{matrix} Δ_{j} = \{\begin{cases} G_{1} & if j = 1, \\ G_{j} - M_{1, j - 1} & if 1 < j \leq r, \end{cases} \end{matrix}

(7)

where

M_{1, j - 1} = ⋃_{i = 1}^{j - 1} ‍ M (Δ_{i})

Following Example 7, Figure 4 shows $Δ_{1} = G_{1}$ with the dash line, and Figure 5 shows $Δ_{2} = G_{2} - M (Δ_{1})$ ; $Δ_{2}$ is obtained by removing all the vertexes $v_{1}$ , $v_{2}$ , $v_{3}$ , and $v_{4}$ and their adjacent edges from G since such four vertexes are all in the maximal matching of $Δ_{1}$ represented by dash edges.

Figure 4

$Δ_{1}$ and $M (Δ_{1})$ .

Figure 5

$Δ_{2}$ and $M (Δ_{2})$ .

Observation 2.

Each ir-subgraph $Δ_{j}$ is also a forest of complete multipartite graph.

This observation is correct because $Δ_{j}$ is still a complete multipartite graph when any vertex v and its adjacent edges are removed from $G_{j}$ . For example, in Figure 5, $Δ_{2}$ is still a forest of complete multipartite graph.

Interestingly, we find that the union of maximal matchings $M (Δ_{1})$ (dash edges in Figure 4) and $M (Δ_{2})$ (dash edges in Figure 5) is exactly a maximal matching of $G (D, Σ)$ (dash edges in Figure 6). This inspires a proposition for computing the maximal matching of conflict graph $G (D, Σ)$ as follows.

Figure 6

Maximal matching $M (G)$ .

Proposition 9.

$M_{1, r}$ is a maximal matching of $G (D, Σ)$ , and $|M_{1, r}| = \sum_{j = 1}^{r} |M (Δ_{j})|$ .

5.2. Algorithm for Dirtiness Evaluation

In contrast to the naïve algorithm, we propose Algorithm 2 to compute the data dirtiness in an $O (r n \log n)$ time rather than the quadratic cost while r is a small constant generally. It works as follows: r ir-subgraphs are built first instead of the conflict graph $G (D, Σ)$ ; compute maximal matching of each ir-subgraph independently to get the value $| V_{M_{1, r}} |$ which is a $2$ -approximation of the size of minimum vertex cover. Proposition 9 guarantees the correctness of this algorithm.

Algorithm 2: $D D E v a (D, Σ)$ .

Input: Database instance D, CFD set $Σ = \{φ_{j} ∣ 1 \leq j \leq r\}$ .

Output: $d i r t (D, Σ)$ which is the dirtiness of database D with respect to $Σ$ .

(1) $M \leftarrow ⌀$ ;

(2) for all j such that $1 \leq j \leq r$ do

(3) build $G_{j}$ for D with respect to $φ_{j}$ ;

(4) $Δ_{j} \leftarrow G_{j}$ ;

(5) for all $v_{i} \in Δ_{j}$ do

(6) if $v_{i} \in M$ then

(7) remove $v_{i}$ and its adjacent edges from $Δ_{j}$ ;

(8) $M (Δ_{j}) \leftarrow G r e e d y M a x i m a l M a t c h i n g (Δ_{j})$ ;

(9) $M \leftarrow M \cup M (Δ_{j})$ ;

(10) return $| V_{M} | / | D |$ ;

Briefly, there are two key points to reduce the quadratic cost, respectively; each ir-subgraph $Δ_{j}$ can be built within $O (n \log n)$ , and the maximal matching of each ir-subgraph can be specified quickly without scanning any edge of it.

5.2.1. Building ir-Subgraphs

Let the jth CFD be $φ_{j} : (X \to Y, T_{p})$ ; to get $Δ_{j}$ , we will build the conflict graph $G_{j}$ first and then remove all vertices of $V_{M_{1, j - 1}}$ and their adjacent edges. Recall Observation 1; each connected component of $G_{j}$ is a complete multipartite graph which can be built by partitioning without pairwise comparison. Concretely, we can partition all tuples of D according to the attribute values on X and Y; each connected component refers to the tuples with the same attribute value on X, and each equiv-class in the component refers to the tuples with the same attribute value on Y. Because we do not need to store edges in complete multipartite graph, only vertices in $V_{M_{1, j - 1}}$ need to be removed from $G_{j}$ . As the size of $V_{M_{1, j - 1}}$ is always no more than n, it will take at most $O (\log n)$ time cost to check whether a vertex $v \in V_{M_{1, j - 1}}$ based on a lookup data structure. Therefore, an ir-subgraph can be built within an $O (r n \log n)$ time.

5.2.2. Finding Maximal Matching Greedily

Due to Observation 2, each connected component of ir-subgraph is also a complete multipartite graph. We can quickly specify an arbitrary maximal matching of each connected component without scanning any edge so that the time cost for computing the maximal matching only depends on the number of vertices rather than edges.

Algorithm 3 is proposed to find a maximal matching quickly. Concretely, the maximal matching of each ir-subgraph can be obtained by the union of the maximal matchings of each component. For each component $P_{i}$ containing l equiv-classes, namely, $ω_{1}, \dots, ω_{l}$ , we group the l equiv-classes into two groups $〈L, R〉$ . During the scan on $P_{i}$ , each equiv-class is added to the group with smaller cardinality greedily where cardinality of a group S is $|S| = \sum_{i = 1}^{s} |ω_{i}|$ , if $S = \{ω_{1}, \dots, ω_{s}\}$ . Then, for a grouping $〈L, R〉$ , we build the maximal matching $M 〈L, R〉$ like this; for each vertex in group L, match it with a vertex of R orderly. The time cost of Algorithm 3 does not depend on the number of edges but only depends on the number of vertexes.

Algorithm 3: $M a x i m a l M a t c h i n g (Δ)$ .

Input: ir-subgraph Δ .

Output: a maximal matching M (Δ).

(1) $M \leftarrow ⌀$ ;

(2) for each component $C_{i} \in Δ$ do

(3) $L, R \leftarrow ⌀$ ;

(4) for all i such that $1 \leq i \leq l$ do

(5) if $| L | \leq | R |$ then

(6) put $ω_{i}$ into L;

(7) else

(8) put $ω_{i}$ into R;

(9) $M \leftarrow M \cup M 〈L, R〉$ ;

(10) return M;

Example 10.

Following Example 7, the grouping of $Δ_{1}$ and $Δ_{2}$ is shown in Figures 7(a) and 7(b); maximal matchings are represented by dashes. Recall Example 7; there are two components $P_{1}$ and $P_{2}$ in the ir-subgraph $Δ_{1}$ . In $P_{1}$ , equiv-class $ω_{1}$ only contains $v_{1}$ , equiv-class $ω_{2}$ includes $v_{2}$ and $v_{4}$ , and equiv-class $ω_{3}$ includes $v_{3}$ and $v_{5}$ . Algorithm 3 first adds $ω_{1}$ to group L. Since $| R | = 0$ , then $ω_{2}$ is added to group R by Algorithm 3. At last, $ω_{3}$ is added to group L due to $| R | > | L |$ . The maximal matching of $P_{1}$ is obtained by matching the vertices between L and R one by one. However, in $P_{2}$ , there is only one equiv-class so that group R is empty; thus, the maximal matching of $P_{2}$ is an empty set. Therefore, we get a maximal matching $\{(v_{1}, v_{2}), (v_{3}, v_{4})\}$ for $Δ_{1}$ . In a similar way, we find a maximal matching $\{(v_{6}, v_{7}), (v_{5}, v_{9})\}$ for $Δ_{2}$ . Finally, the union of both matchings is exactly a maximal matching of $G (D, Σ)$ as shown in Figure 6 which is represented by dashes. Obviously, maximal matching finding can be done in an $O (n)$ time for each ir-subgraph. Additionally, we have an observation that, in each component $P_{i}$ , all of the unmatched vertexes belong to one equiv-class $ω_{j}$ ( $1 \leq j \leq l$ ) which is called tail class $τ (P_{i})$ , such as in $Δ_{1}$ ; $τ (P_{1})$ is $ω_{3}$ and $τ (P_{2})$ is $ω_{1}$ as shown by a dashed rectangle in Figure 7. Obviously, there is at most one tail class in a component.

Figure 7

Computing maximal matching independently.

5.3. Update

According to Definition 8, all the ir-subgraphs are updated based on update of maximal matching. We next show how to maintain the maximal matching, following an efficient ir-subgraph update method.

5.3.1. Update Maximal Matching

Given an ir-subgraph Δ, when a vertex update $(v, o p)$ arises, subroutine $U p d a t e G r o u p i n g (Δ, v, o p)$ will update the grouping of component P that v is involved in. Concretely, suppose that v belongs to some equiv-class ω of P; then, $U p d a t e G r o u p i n g$ will update grouping $〈L, R〉$ upon the following two cases (E1)~(E2): (E1)

Vertex Deletion ( $o p = d e l e t e$ ). Without loss of generality, let $|L| > |R|$ . It is obvious that tail class $τ (P) \in L$ currently; then, grouping $〈L, R〉$ needs to be updated, iff (a) $ω \in R$ and (b) $|L - \{τ (P)\}| = |R|$ . If (a) and (b) are satisfied, $U p d a t e G r o u p i n g$ will delete vertex v from ω and switch $τ (P)$ into R from L; otherwise, it only delete v from ω . And if $|L| < |R|$ , the opposite will occur.

(E2)

Vertex Insertion ( $o p = i n s e r t$ ). Without loss of generality, let $|L| > |R|$ . Grouping $〈L, R〉$ needs to be updated, iff (a) $|L - \{τ (P)\}| = |R|$ and (b) $ω \in L$ ( $ω \neq τ (P)$ ). If (a) and (b) are satisfied, $U p d a t e G r o u p i n g$ will insert v into ω and switch ω into R from L; otherwise, it only inserts v from ω . And if $|L| < |R|$ , the opposite will occur.

After updating $〈L, R〉$ , a new maximal matching $M^{'}$ can be obtained in greedy order.

Observation 3.

Let M be the maximal matching of component P before grouping update, while $M^{'}$ is the new maximal matching obtained in greedy order after grouping update. We have that if $M^{'} \neq M$ , there must be one and only one vertex $v^{'}$ ( $v^{'} \neq v$ ) such that (a) $v^{'} \in M^{'} (Δ)$ but $v^{'} \notin M (Δ)$ or (b) $v^{'} \in M (Δ)$ but $v^{'} \notin M^{'} (Δ)$ .

5.3.2. Update ir-Subgraph

Algorithm 4 shows how to update the ir-subgraphs. When a tuple update arises, all ir-subgraphs are updated one by one. Specifically, starting from $Δ_{1}$ , $U p d a t e S u b g r a p h$ updates the ir-subgraph according to the parameter “ $o p$ ” by calling $U p d a t e G r o u p i n g$ . For vertex deletion, parameter “ $o p$ ” is set as “ $d e l e t e$ ”; for vertex insertion, it is set as “ $i n s e r t$ ”; Algorithm 4 ends until r ir-subgraphs have been processed.

Algorithm 4: $U p d a t e S u b g r a p h$ ( $t, o p$ ).

Input: tuple need to be processed t, operator parameter op.

Output: $i r - s u b g r a p h s$ updated.

(1) for $i \leftarrow 1$ to r do

(2) $M^{'} (Δ_{i})$ , $v^{'} \leftarrow U p d a t e G r o u p i n g (Δ_{i}, v, o p)$ ;

(3) if $v^{'} \in M^{'} (Δ_{i})$ and $v^{'} \notin M (Δ_{i})$ then

(4) $v \leftarrow v^{'}$ , $o p \leftarrow d e l e t e$ ;

(5) if $v^{'} \in M (Δ_{i})$ and $v^{'} \notin M^{'} (Δ_{i})$ then

(6) $v \leftarrow v^{'}$ , $o p \leftarrow i n s e r t$ ;

This algorithm is correct. Actually, for each $Δ_{j}$ , if update v does not result in a change on maximal matching $M (Δ_{j})$ , then there must be v that does not belong to $M (Δ_{j})$ , and it still should be inserted into or deleted from the following ir-subgraphs. If update v results in a change on maximal matching $M (Δ_{j})$ , according to Observation 3, there must be one and only one vertex $v^{'}$ such that “it was matched but becomes unmatched” or “it was unmatched but becomes matched.” Then, according to the definition of ir-subgraph, $v^{'}$ should be either inserted into or deleted from the following ir-subgraphs. Therefore, there is also an important observation that each ir-subgraph needs to be processed only once by $U p d a t e S u b g r a p h$ . In the next section, we organized each ir-subgraph as a compact structure, in which the costs of vertex operations including insertion, deletion, and lookup are $O (\log n)$ . That is, in the processing procedure of each ir-subgraph, $U p d a t e G r o u p i n g$ (line (2)) and vertex checking (lines (3) and (5)) can be done in $O (\log n)$ . Therefore, at most $O (r \log n)$ time cost will be taken to update the r ir-subgraphs.

Example 11.

Following Example 10, given two tuple updates,

\begin{matrix} (U 1) : i n s e r t t_{10} : 〈a, a, a, a〉 into D; \\ (U 2) : i n s e r t t_{11} : 〈a, a, a, b〉 into D, \end{matrix}

(8)

the grouping updated after each insertion was shown in Figures 8(a) and 8(b).

Figure 8

Example for update processing.

Respectively, we show the procedure of processing $(U 1)$ and $(U 2)$ as follows: (U1)

In $Δ_{1}$ , according to the attribute values on X and Y, vertex $v_{10}$ belongs to $ω_{1}$ of $P_{1}$ . In $P_{1}$ , $ω_{1}$ belongs to L, and vertex $v_{10}$ does not result in a change on the grouping and $ω_{3}$ is still the tail class since $|L| > |R|$ ; however, vertex $v_{3}$ is not matched any more; intuitively, it is squeezed out from the matching; it should be inserted into $Δ_{2}$ . After update on $Δ_{2}$ , class $ω_{2}$ has become the tail class of component $P_{2}$ in $Δ_{2}$ .

(U2)

After insertion of vertex $v_{11}$ , $U p d a t e S u b g r a p h$ switches the tail class $ω_{3}$ into R from L since the size of $|L - \{ω_{3}\}| < R$ ; however, because $v_{3}$ has become matched after the switch, then it is deleted from $Δ_{2}$ and $v_{9}$ is matched with $v_{5}$ in $Δ_{2}$ .

5.4. Implementation

In this subsection, a compact structure is given to support the following efficient implementation scenarios: $(1)$ answer the membership query of whether a vertex belongs to the maximal matching in $O (\log n)$ time and $(2)$ update each ir-subgraph and its maximal matching in $O (\log n)$ time once a tuple update arises.

5.4.1. Compact Structure for ir-Subgraph

As shown in Figure 9, we store each ir-subgraph $Δ_{i}$ ( $1 \leq i \leq r$ ) as the index $I_{i}$ of database D (with B-tree implementation in this paper). Concretely, given D and variable rule $φ_{i} : 〈X \to Y, T_{p}〉$ , $I_{i}$ only indexes those tuples satisfying $\exists t_{p} \in T_{p}$ , $t [X] ≍ t_{p} [X]$ ; the index key is $(X, Y, i d)$ of each tuple. In B-tree implementation, (a) each entry in an index node refers to a vertex of $Δ_{i}$ ; (b) all the vertices of each equiv-class and all the equiv-classes of each component are, respectively, organized as a double linked list for constant time update once the maximal matching changes. Additionally, two kinds of header entries are settled in the index.

Figure 9

Compact Structure for storing ir-subgraph Δ₁.

K-Header. For each component P, a K-header is settled for keeping relative information about the corresponding component as follows:

$y (P) :$ the attribute value on Y corresponding to the tail class $τ (P)$ . Actually, each equiv-class in a component will be identified uniquely by the attribute value on Y.

$e (P) :$ the id of the last matched vertex in $τ (P)$ . It is exactly less than the ids of all unmatched vertices because all vertices are sorted by id inside each equiv-class. For example, in $Δ_{1}$ shown in Figure 7, $e (P_{1}) =$ “3”, while we let “ $e (P_{2}) = - \infty$ ” as there is no vertex matched in component $P_{2}$ .

$e (L) :$ it points to the tail of the double linked list of L.

$e (R) :$ it points to the tail of the double linked list of R.

W-Header. For each equiv-class ω, a W-header is settled for keeping relative information about it as follows:

$e (ω) :$ it points to the tail of the double linked list of ω .

$g (ω) :$ it indicates which group ω belongs to.

5.4.2. Supporting Membership Query

Given a vertex v and ir-subgraph $Δ_{i}$ , the membership query of whether it is in the maximal matching $M (Δ_{i})$ will be answered in $O (\log n)$ . Let v refer to the tuple t and find K-header of component P by key value $(t [X], - \infty, - \infty)$ ; then, we have $v \in M (Δ_{i})$ if (a) $t [Y] = y (P)$ , which means $v \in τ (P)$ , and (b) $t [i d] \leq e (P)$ .

5.4.3. Supporting Update on ir-Subgraphs

In the B-tree implementation, it will take only $O (\log n)$ time on average to insert or delete a vertex in an ir-subgraph. Once a tuple update results in a change on a maximal matching, $U p d a t e G r o u p i n g$ only updates the corresponding K-header and W-header in a constant time after finding both headers in $O (\log n)$ time.

Example 12.

Following Example 10, Figure 9 shows the storage implementation of ir-subgraph $Δ_{1}$ for D with respect to $φ_{1} : A \to B$ . For component $P_{1}$ , its K-header can be found by key “ $(A : a, B : - \infty, i d : - \infty)$ .” In K-header of $P_{1}$ , the size of maximal matching of $P_{1}$ is recorded as $| M | = 2$ with respect to the grouping in Figure 7 where $| L | = 4$ and $| R | = 2$ . The K-header has also recorded $y (P_{1}) =$ “c” for tail class $ω_{3}$ which can be identified uniquely in $P_{1}$ by value “c” (the attribute value on “B”), while $e (P)$ is set as $10$ which refers to end vertex $v_{10}$ . The pointer $e (L)$ (resp., $e (R)$ ) in K-header points to the W-header of the last class $ω_{3}$ (resp., $ω_{2}$ ) of group L (resp., R). In the index, equiv-classes $ω_{1}$ and $ω_{3}$ (resp., $ω_{2}$ ) in group L (resp., R) are organized as a double linked list by pointers placed in the corresponding W-header. All the vertices inside each class are also organized as a double linked list by pointers placed in the corresponding entry.

Considering the insertion of $t_{11}$ , $U p d a t e C R G$ will update $Δ_{1}$ in the first iteration by implementation as the following three steps.

Step 1 (update the double linked list of vertices in each equiv-class).

Insert $v_{11}$ as the new tail of the double linked list of $ω_{1}$ ; that is, find the last vertex before insertion of $v_{11}$ and reset its pointer and then change the last vertex in W-header of $ω_{1}$ as “ $11$ ” which is the id of vertex $v_{11}$ . Here, the key of the last vertex of $ω_{1}$ can be fetched from W-header of $ω_{1}$ .

Step 2 (update the double linked list of groups L and R).

Because $| L | - | R | > | ω_{3} |$ after the insertion of $v_{11}$ , then $ω_{3}$ need to be switched into R from L. Concretely, $(1)$ delete W-header of $ω_{3}$ from the double linked list of group L (its key is obtained by the value of attributes A and B of $t_{11}$ , i.e., $〈A : a, B : c, i d : - \infty〉$ ). This is implemented by setting the pointer $e (ω_{1})$ as “ $n u l l$ ” and changing the pointer $e (L)$ to point to $ω_{1}$ ; $(2)$ insert a W-header for $ω_{3}$ into the double linked list of group R. This is implemented by setting the pointer $e (ω_{2})$ to be pointing to $ω_{3}$ and changing the pointer $e (R)$ to point to $ω_{1}$ , meanwhile changing $g (ω_{3})$ as “R.”

Step 3 (update the relative information recorded in K-header).

Since $| R | > | L |$ , then $ω_{3}$ remains the tail class of this component, but $e (P_{1})$ is updated as $3$ because new point $v_{11}$ matches $v_{3}$ ; at last, a membership query of whether $v_{3}$ is matched is necessary to decide how to update the following ir-subgraph $Δ_{2}$ in the next iteration of $U p d a t e I n c G$ . $τ (P_{1})$ and $e (P_{1})$ can be fetched from the K-header of component $P_{1}$ ; then, it is checked that $v_{3} \in ω_{3}$ and $v_{3} \cdot i d \leq e (P_{1}) = 5$ .

It is easy to see that, in each iteration of $U p d a t e C R G$ , such steps can be achieved by querying B-tree index in $O (\log n)$ time and updating the linked list in $O (1)$ time. After at most r iterations, the update finishes in $O (r \log n)$ time.

6. Optimizations and Extensions

6.1. Key Value Compression

The sort key of each tuple consists of the values on X and Y and tuple id with respect to a CFD ( $X \to Y, T_{p}$ ). Reducing the size of the key implies improving the efficiency of finding vertex in the index. However, we can build two prefix-trees, one for X and the other for Y, respectively. Then, we assign each leaf a unique id. Then, each string with arbitrary size will be transformed into an integer; that is, each key value is compressed as a triple of integers. As the size of string of each attribute is not more than a fixed constant, then each prefix-tree has a fixed height; thus, this transformation can be done in a constant time.

6.2. The Number of Indexes

In practice, CFD is always given in a general form and it can be transformed into lots of simple rules. That is, r may be very large and many indexes need to be built; thus, there will be lots of copies of isolated vertexes stored. However, actually, our algorithm can process each CFD with general form natively; this is really because the conflict graph with respect to a general CFD is also a forest of complete multipartite graph. Therefore, only one index needs to be built for one general CFD. In practice, the number of indexes to be built equals the number of the general CFDs.

6.3. Minimum Space Cost

Due to the definition of the ir-subgraph, each ir-subgraph has to store many copies of vertexes which are unmatched in the previous ir-subgraphs. Reducing the size of each index implies the improvement of the efficiency of the index. The size of all the indexes depends on the processing order of r indexes. To reduce the space cost caused by the redundancy, we should choose the best processing order of the CFDs. However, the best order that minimizes the overall space cost cannot be precomputed and it also will change with the update of data. Therefore, we chose the processing order of CFDs as the decreasing order of factor $s u p p (ϕ) / c o n f (ϕ)$ , in which $s u p p (ϕ)$ and $c o n f (ϕ)$ are the support and confidence of a given CFD ϕ, respectively, and such two values can be obtained by sampling method [28]. Intuitively, in each index, the bigger the ratio $s u p p (ϕ) / c o n f (ϕ)$ is, the more the tuples will be matched as early as possible, so that $(1)$ there may be less tuple copies storing in the indexes of the following ir-subgraphs and $(2)$ the number of queries will be also reduced possibly when building and updating the indexes of all ir-subgraphs.

7. Experiments

We next present an experimental study of data dirtiness evaluation algorithms, measuring elapsed time and the quality of the evaluation result. Using both synthetic data TPC-H and real-life data including DBLP and IMDB, we focus on their scalability by varying the following three parameters: $(1)$ $| D |$ : the size of the original database; $(2)$ $|Δ D|$ : the size of updates; $(3)$ $|Σ|$ : the number of CFDs.

7.1. Experimental Settings

We used synthetic and real-life data.

7.1.1. Datasets

(a) TPC-H [33]: we built a wider table by joining all the $8$ tables. The data ranges from 2 million tuples (i.e., $1$ M) to $10$ million tuples (i.e., $10$ M). Note that the size of $10$ M tuples is almost as large as $10$ GB. (b) IMDB [34]: we extracted a $1.12$ GB relation from its XML data. The data scales from $1$ M tuples to $4$ M tuples where the size of $4$ M tuples is almost as large as $1.12$ GB. (c) DBLP [35]: we extracted a $1.4$ GB relation from its XML data. The data scales from $500$ K tuples to $3.6$ M tuples where the size of $3.6$ M tuples is almost as large as $1.4$ GB.

7.1.2. CFDs

We designed CFDs manually, varied by modifying patterns. (a) TPC-H: the number $| Σ |$ of variable CFDs ranges from $20$ to $100$ including $5 %$ FDs with $40$ by default. (b) IMDB: $| Σ |$ scales from $5$ to $20$ variable CFDs including $3$ FDs, with $10$ by default. (c) DBLP: $| Σ |$ scales from $5$ to $20$ including $3$ FDs, with $10$ by default.

7.1.3. Updates

Updates contain $90$ % insertions and $10$ % deletions. The size of updates is up to $10$ GB (about $10$ M tuples) for TPC-H, up to $3$ M tuples for DBLP, and up to $3$ M tuples for IMDB.

7.2. Implementation

We denote by DDEva the straightforward implementation of our evaluation algorithms, while adjusted-DDEva refers to the order adjusting method based on sampling. We compare our algorithm with the naïve algorithm. In the implementation of the naïve algorithm, we use the adjacency list to store the conflict graph $G (D, Σ)$ and build an index for all vertexes based on their ids so that each vertex can be found efficiently. In order to lower the cost of finding all the violations as much as possible, for each CFD $(X \to Y, T_{p})$ , we partition the database into different blocks according to the value of X and check all the tuple pairs for a violation in each block, rather than checking all possible tuple pairs naïvely.

All codes were written in C/C++ and compiled by Visual Studio 2005 and QT4 library. We run our algorithms on Windows 7 platform on Dell PC OptiPlex 790 with 3.10 GHz Intel Core i5 CPU, 4 GB memory, and hard disk of 5400 rpm. In the following, the algorithms are run five times under each setting and the average time is taken. In each run, we use large amount of random data to wipe I/O cache.

7.3. Experimental Results for Evaluation Algorithm

Exp-1: Impact of $| D |$ . In the first set of experiments, we show the impact of the size of the database D on the performance of evaluation algorithm of inconsistencies in static data. Fixing $| Σ | = 40$ (including 5% FDs), the size of D (i.e., $| D |$ ) is varied from $2$ M to $10$ M tuples ( $10$ GB) for TPC-H. And for IMDB and DBLP, $| D |$ is varied from $500$ K to $3$ M while fixing $| Σ | = 10$ (including $2$ FDs) for both datasets.

The elapsed time in seconds is shown in Figure 10 when varying $| D |$ . From the results, it is first shown that the naïve algorithm takes too much time to perform computation; we manually terminate the program when the elapsed time exceeds the top boundary. It is also shown that adjusted-DDEva outperforms DDEva for both real-life datasets while sometimes it does not for TPC-H. This is really because both real-life datasets are much less dirty with respect to many CFDs such that most tuples are matched earlier avoiding redundancy in the following indexes and the factor $s u p p (ϕ) / c o n f (ϕ)$ just captures this. Figure 10 also shows that both DDEva and adjusted-DDEva scale well with the size $| D |$ .

Figure 10

Elapsed time of DDEva and adjusted-DDEva.

Our algorithm works well on both synthetic data TPC-H and real-life data DBLP and IMDB demonstrating that $D D E v a$ algorithm is able to deal with large dataset efficiently.

Exp-2: Impact of $|Δ D|$ . In the second set of experiments, we show how the size of changes $| Δ D |$ to the database affects the performance of inconsistency evaluation algorithm. Fixing $| Σ | = 50$ and $| D | = 2$ M, the size of $| Δ D |$ is varied from $2$ M to $10$ M tuples for TPC-H. $| Δ D |$ is varied from $500$ K to $3000$ K tuples for DBLP and IMDB while fixing $| D | = 500$ K and $| Σ | = 16$ . The elapsed times in seconds when varying $| Δ D |$ for TPC-H (resp., DBLP and IMDB) are shown in Figure 11(a) (resp., Figures 11(b) and 11(c)).

Figure 11

Elapsed time of $D D E v a$ and adjusted-DDEva.

As shown in Figures 11(a), 11(b), and 11(c), the elapsed times of adjusted-DDEva scale well up with $| Δ D |$ , for example, $55$ seconds when $| D |$ is updated from $2$ M to $4$ M and $110$ seconds when $| Δ D |$ is updated from $8$ M to $10$ M as shown in Figure 11(a).

Also, adjusted-DDEva updates the result much more efficiently than DDEva in both real-life datasets and it has a slower growth in contrast to DDEva. That is, in experiment 2, the results have shown that adjusted-DDEva updates the result much more efficiently than DDEva in both real-life datasets.

Exp-3: Impact of $| Σ |$ . In this set of experiments, we study the impact of the number of variable CFDs on data dirtiness evaluation. Fixing $| D | = 2$ M and $| Δ D | = 10$ M for TPC-H, we varied the number of CFDs $| Σ |$ from $20$ to $100$ including $5 %$ FDs. Moreover, fixing $| D | = 500$ K and $| Δ D | = 2000$ K for DBLP and IMDB, we varied $| Σ |$ from $8$ to $20$ including $3$ FDs. The elapsed times when varying $| Σ |$ from $20$ to $100$ for TPC-H (resp., from $8$ to $20$ for DBLP and IMDB) are shown in Figure 12(a) (resp., Figures 12(b) and 12(c)). Both $D D E v a$ and $a d j u s t e d$ - $D D E v a$ are able to evaluate the data dirtiness with good scalability when varying $| Σ |$ .

Figure 12

Elapsed time of $D D E v a$ and adjusted-DDEva.

As the number of indexes will increase with $| Σ |$ , the elapsed time of DDEva will increase with $| Σ |$ . However, the elapsed time of adjusted-DDEva performance is better than that of DDEva since the size of indexes with higher rank is very small after adjusting the processing order of CFDs in Σ . The results demonstrate that adjusted-DDEva has good scalability with $| Σ |$ , and it works well on a larger number of CFDs.

Note that, in Figures 12(b) and 12(c), we can see that the increase of size $| Σ |$ does not lead to a fast growth of the adjusted-DDEva; that is really because the number of FDs included in the CFD set for DBLP and IMDB is fixed and captures most conflicts so that a large amount of random I/Os in the following indexes is prevented in practice.

Exp-4: Space Cost. In this set of experiments, we study the sizes of indexes that our algorithm needs to build. Fixing the number of general CFDs $| Σ | = 5$ , each with $200$ pattern tuples generated randomly, and setting $| D |$ as $10$ M for TPC-H (i.e., $10$ GB) and $4$ M for DBLP and IMDB (about $1$ GB), we record the size of each index for the three datasets. The results are shown in Figure 13.

Figure 13

Space cost of $D D E v a$ and adjusted-DDEva.

First, for each dataset, the index size decreases with the processing order; this is consistent with the definition of ir-subgraph. Second, the total space our algorithm takes does not depend on the width of the dataset due to key compression. Actually, in practice, a pair of 32-bit or 64-bit integers is enough to partition the dataset according to the values on X and Y without errors. Third, as shown in the results of this experiment, $a d j u s t e d$ - $D D E v a$ takes less total space cost than its counterpart, and the first few indexes almost cover most of the matched tuples.

Exp-5: Quality of Evaluation Result. In the last set of experiments, we study the evaluation result quality of our evaluation method based on minimum culprit with respect to CFD set Σ (MC), in contrast with naïve methods based on conflicts counting (CC). Here, we introduce a variable “ $σ_{i, i - 1}$ ” representing the difference of data inconsistency between the ith update and the $(i - 1)$ th update; concretely, $(1)$ “ $σ_{i, i - 1}^{M C}$ ” is the difference of minimum culprit size estimated with respect to CFD set Σ; $(2)$ “ $σ_{i, i - 1}^{C C}$ ” is the difference of conflicts detected, respectively. To measure the result quality of two evaluation methods under assumption in $(1)$ , we compute the standard deviation of “ $σ_{i, i - 1}$ ” on $5$ samples with $100$ tuples which lead to inconsistency.

Figure 14 shows the standard deviation of variable “ $σ_{i, i - 1}$ ” computed for TPC-H ( $D = 2$ M, $| Σ | = 100$ , and $| Δ D |$ is varied from $1$ M to $5$ M including only insert operation) and DBLP and IMDB ( $D = 500$ K, $| Σ | = 20$ , and $| Δ D |$ is varied from $500$ K to $2000$ K including only insert operation).

Figure 14

Elapsed time of $D D E v a$ and adjusted-DDEva.

The figure tells us that, for each dataset, $σ_{i, i - 1}^{M C}$ is insensitive to a single update operation, but $σ_{i, i - 1}^{C C}$ is very sensitive to each single update operation which will be inconsistent with existing tuples. That is to say, the evaluation method MC studied in this paper will give a very smooth monitoring curve rather than the naïve method CC.

7.4. Summary

We find the following conclusions from the results of the experiments conducted on both synthetic data TPC-H and real-life data DBLP and IMDB. $(1)$ Our evaluation algorithms scale well with respect to the size of database $| D |$ , the size of changes to database $| Δ D |$ , and the number of CFDs $| Σ |$ for large data (Exp-1 to Exp-3). $(2)$ The modified version of algorithm adjusted-DDEva outperforms its counterpart DDEva much more especially for both real-life datasets and larger $| Σ |$ because a few CFDs in Σ have small confidence while the others do not. $(3)$ The evaluation method proposed based on minimum culprit with respect to CFD set Σ substantially revealed how dirty the data is under the typical assumption: “ $(1)$ most of the data is often correct, especially for large data, and $(2)$ an update of a tuple with error has a very tiny impact on the dirtiness of the entire dataset” (Exp-5).

8. Related Works

Conditional functional dependency (CFD) was first proposed by the authors of [16] while the SQL techniques they provided have been applied in data cleaning broadly, which can be used to detect the inconsistencies of databases. However, there is no existing work focusing on computing the inconsistency of a database based on CFDs efficiently. The most relevant works to this paper can be categorized into inconsistency detection and resolution.

For inconsistency detection, there exist some detection techniques which are able to detect errors efficiently; SQL techniques for detecting CFD violations were given by [18]; practical algorithms for detecting violations of CFDs in fragmented and distributed relations were provided by [19] and an incremental detection algorithm was proposed by [20]. In contrast to inconsistency detection, inconsistency evaluation needs to compute the quantized dirtiness value of the data, rather than finding all violations.

For data repair, there are two kinds of works which are based on FDs/CFDs; they both aim to directly resolve the inconsistency of database. One kind of method is to repair data based on minimizing the repair cost, for example, [22, 24, 29, 36, 37]. Given the data edit operations (including tuple-level and cell-level), minimum cost repair will output repaired data with minimizing the difference between it and the original one. Our problem can be seen as a special case of [29], because the complementary minimum culprit can be seen as C-repair (cardinality repair) of an inconsistent database; however, it is much more expensive using the techniques of the authors of [27] directly, especially for dynamic data, and the algorithm given in this paper is more efficient and seems optimal. There are some other repair definitions, such as “minimum description length (MDL)” [23] and “relative trust” [21]. To the best of our knowledge, there is almost no polynomial approximation algorithm with a good ratio bound for repairing inconsistent data based on CFDs except for a few approximation algorithms with constant ratio which were provided, such as [25], while the ratio could not reach $2$ , although the repair algorithm starts to need an approximate minimum vertex cover in a conflict graph, which is why they do not consider how to compute it efficiently; moreover, they cannot deal with large dynamic data well because it starts by finding all FD violations and a conflict hypergraph with respect to all FDs/CFDs should be built first which may take quadratic time and space. Another kind of method is consistent query answer (CQA) [29, 36, 38–43]; for a fixed boolean query q, $C Q A (q)$ is the following problem: given a database D, decide whether q evaluates to true on every repair of D. Such method will not edit the data but will find a query answer among all possible repairs of original database, and, unfortunately, there is no technique in CQA that can be used in this paper directly. Moreover, there are some works considering the theoretical result of CQA under C-repair [38, 44–46], but they do not provide the technique able to solve the problem this paper studied. Additionally, if the minimality assumption fails or there are multiple optimal repairs of the data, output of repair approximation algorithm will be meaningless sometimes even if it has an accuracy guarantee. In contrast to data repair, this paper aims to output the value of dirtiness for data quality evaluation, monitoring, and so on. Therefore, approximation with constant factor is also the lower bound of repairing cost.

9. Conclusions

This paper studied the data consistency evaluation based on CFDs, in order to give a quantized quality value to users. We proved that the complexity of dirtiness evaluation is NP-complete even if the condition is simple enough; moreover, for any $ε > 0$ , it is hard to give an approximation within $2 - ε$ in polynomial time. The time complexity of our $2$ -approximate algorithm is $O (n \log n)$ , and it scales well. To deal with the larger data and its update, the compact structure reduces the storage of conflict graph to $O (r n)$ and reduces the time of update to $O (r \log n)$ . The experiments show that our algorithm scales well with data size and the quality of its evaluation result is good.

Footnotes

Competing Interests

The authors declare that they have no competing interests.

Acknowledgments

This work is supported in part by the National Basic Research Program of China (973 Program) under Grant no. 2012CB316200, the National Natural Science Foundation of China (NSFC) under Grant no. 61190115, 61370217, the Fundamental Research Funds for the Central Universities under Grant no. HIT.KISTP201415, the National Science Foundation (NSF) under Grants nos. CNS-1152001 and CNS-1252292, the Research Fund for the Doctoral Program of Higher Education of China under Grant no. 20132302120045, and the Natural Scientific Research Innovation Foundation in Harbin Institute of Technology under Grant no. HIT.NSRIF.2014070.

References

Rajkumar

R. R.

Lee

Sha

Stankovic

Cyber-physical systems: the next computing revolution

Proceedings of the 47th Design Automation Conference (DAC '10)

June 2010

Anaheim, Calif, USA

ACM

731 736

10.1145/1837274.1837461

2-s2.0-77956217277

Cheng

Gao

Cai

Approximate physical world reconstruction algorithms in sensor networks

IEEE Transactions on Parallel and Distributed Systems 2014 25 12 3099 3110

10.1109/TPDS.2013.2297121

2-s2.0-84908187999

Cheng

Thaeler

Xue

Chen

TPS: a time-based positioning scheme for outdoor wireless sensor networks

Proceedings of the 23rd Annual Joint Conference of the IEEE Computer and Communications Societies (INFOCOM '04)

March 2004

2685 2696

10.1109/infcom.2004.1354687

2-s2.0-8344222870

Ding

Chen

Xing

Cheng

Localized fault-tolerant event boundary detection in sensor networks

Proceedings of the 24th IEEE Annual Joint Conference of the IEEE Computer and Communications Societies (INFOCOM '05)

March 2005

Miami, Fla, USA

902 913

2-s2.0-25844521520

Hua

Q. S.

Lau

F. C. M.

Minimum-latency aggregation scheduling in wireless sensor networks under physical interference model

Proceedings of the 13th ACM International Conference on Modeling, Analysis and Simulation of Wireless and Mobile Systems (MSWiM '10)

October 2010

Bodrum, Turkey

ACM

360 367

10.1145/1868521.1868581

2-s2.0-78650222013

Sha

Zeadally

Data quality challenges in cyber-physical systems

Journal of Data and Information Quality 2015 6 2-3, article 8

10.1145/2740965

2-s2.0-84930974854

Tolle

Polastre

Szewczyk

Culler

Turner

Burgess

Dawson

Buonadonna

Gay

Hong

A macroscope in the redwoods

Proceedings of the 3rd ACM International Conference on Embedded Networked Sensor Systems (SenSys '05)

November 2005

ACM

51 63

10.1145/1098918.1098925

2-s2.0-84905819805

Szewczyk

Polastre

Mainwaring

Culler

Lessons from a sensor network expedition

Proceedings of the 1st EuropeanWorkshop (EWSN '04)

2004

Berlin, Germany

307 322

Cai

Chen

Z.-Z.

Lin

3.4713

-approximation algorithm for the capacitated multicast tree routing problem

Theoretical Computer Science 2009 410 52 5415 5424

10.1016/j.tcs.2009.05.013

MR2567642

2-s2.0-70449123289

10.

Cai

Chen

Z.-Z.

Lin

Wang

An improved approximation algorithm for the capacitated multicast tree routing problem

Combinatorial Optimization and Applications: Second International Conference, COCOA 2008, St. John's, NL, Canada, August 21–24, 2008. Proceedings 2008 5165

Berlin, Germany

Springer

286 295 Lecture Notes in Computer Science

10.1007/978-3-540-85097-7_27

MR2733023

11.

Cai

Goebel

Lin

Size-constrained tree partitioning: approximating the multicast k-tree routing problem

Theoretical Computer Science 2011 412 3 240 245

10.1016/j.tcs.2009.05.031

MR2789646

2-s2.0-78650610629

12.

Cai

Lin

Xue

Improved approximation algorithms for the capacitated multicast routing problem

Computing and Combinatorics: 11th Annual International Conference, COCOON 2005 Kunming, China, August 16–19, 2005 Proceedings 2005 3595

Berlin, Germany

Springer

136 145 Lecture Notes in Computer Science

10.1007/11533719_16

13.

Guo

Cai

Minimum-latency aggregation scheduling in wireless sensor network

Journal of Combinatorial Optimization 2016 31 1 279 310

10.1007/s10878-014-9748-7

MR3440260

ZBL1332.90110

2-s2.0-84953639722

14.

Cai

Cheng

Wang

Approximate aggregation for tracking quantiles in wireless sensor networks

Proceedings of the 8th International Conference on Combinatorial Optimization and Applications (COCOA '14)

December 2014

Maui, Hawaii, USA

161 172

15.

Cai

Cheng

Wang

Approximate aggregation for tracking quantiles and range countings in wireless sensor networks

Theoretical Computer Science 2015 607 381 390

10.1016/j.tcs.2015.07.056

MR3429060

2-s2.0-84939525641

16.

Bohannon

Fan

Geerts

Jia

Kementsietsidis

Conditional functional dependencies for data cleaning

Proceedings of the 23rd International Conference on Data Engineering (ICDE '07)

April 2007

Istanbul, Turkey

746 755

10.1109/icde.2007.367920

2-s2.0-34548731840

17.

Abiteboul

Hull

Vianu

Foundations of Databases 1995

18.

Fan

Geerts

Jia

Kementsietsidis

Conditional functional dependencies for capturing data inconsistencies

ACM Transactions on Database Systems 2008 33 2

1366103

10.1145/1366102.1366103

2-s2.0-46649106686

19.

Fan

Geerts

Müller

Detecting inconsistencies in distributed data

Proceedings of the 26th IEEE International Conference on Data Engineering (ICDE '10)

March 2010

Long Beach, Calif, USA

64 75

10.1109/icde.2010.5447855

2-s2.0-77952749687

20.

Fan

Tang

Incremental detection of inconsistencies in distributed data

Proceedings of the IEEE 28th International Conference on Data Engineering (ICDE '12)

April 2012

Washington, DC, USA

IEEE

318 329

10.1109/icde.2012.82

2-s2.0-84864198280

21.

Beskales

Ilyas

I. F.

Golab

Galiullin

On the relative trust between inconsistent data and inaccurate constraints

Proceedings of the 29th International Conference on Data Engineering (ICDE '13)

April 2013

Brisbane, Australia

541 552

10.1109/icde.2013.6544854

2-s2.0-84881320841

22.

Bohannon

Fan

Flaster

Rastogi

A cost-based model and effective heuristic for repairing constraints by value modification

Proceedings of the 2005 ACM SIGMOD international conference on Management of data (SIGMOD '05)

June 2005

Baltimore, Md, USA

ACM

143 154

10.1145/1066157.1066175

2-s2.0-29844436973

23.

Chiang

Miller

R. J.

A unified model for data and constraint repair

Proceedings of the IEEE 27th International Conference on Data Engineering (ICDE '11)

April 2011

Hannover, Germany

IEEE

446 457

10.1109/icde.2011.5767833

2-s2.0-79957823829

24.

Cong

Fan

Geerts

Jia

Improving data quality: consistency and accuracy

Proceedings of the 33rd International Conference on Very Large Data Bases (VLDB '07)

2007

315 326

25.

Kolahi

Lakshmanan

L. V. S.

On approximating optimum repairs for functional dependency violations

Proceedings of the 12th International Conference on Database Theory (ICDT '09)

March 2009

Saint-Petersburg, Russia

ACM

53 62

10.1145/1514894.1514901

2-s2.0-70349143135

26.

Yakout

Elmagarmid

A. K.

Neville

Ouzzani

Ilyas

I. F.

Guided data repair

Proceedings of the VLDB Endowment 2011 4 5 279 289

10.14778/1952376.1952378

27.

Miao

Liu

On the complexity of sampling query feedback restricted database repair of functional dependency violations

Theoretical Computer Science 2016 609 3 594 605

10.1016/j.tcs.2015.02.010

MR3429092

2-s2.0-84948808360

28.

Cormode

Golab

Flip

McGregor

Srivastava

Zhang

Estimating the confidence of conditional functional dependencies

Proceedings of the 35th ACM SIGMOD International Conference on Management of Data

July 2009

Providence, RI, USA

469 482

10.1145/1559845.1559895

2-s2.0-70849101505

29.

Lopatenko

Bravo

Efficient approximation algorithms for repairing inconsistent databases

Proceedings of the 23rd International Conference on Data Engineering ICDE 2007

April 2007

Istanbul, Turkey

216 225

10.1109/icde.2007.367867

2-s2.0-34548775857

30.

Miao

Liu

Gao

Vertex cover in conflict graphs: complexity and a near optimal approximation

Combinatorial Optimization and Applications: 9th International Conference, COCOA 2015, Houston, TX, USA, December 18–20, 2015, Proceedings 2015 9486

Berlin, Germany

Springer

395 408 Lecture Notes in Computer Science

10.1007/978-3-319-26626-8_29

31.

Arenas

Bertossi

L. E.

Chomicki

Scalar aggregation in fd-inconsistent databases

Proceedings of the 8th International Conference on Database Theory (ICDT '01)

2001

London, UK

Springer

39 53

32.

Gavril

Algorithms for minimum coloring, maximum clique, minimum covering by cliques, and maximum independent set of a chordal graph

SIAM Journal on Computing 1972 1 2 180 187

10.1137/0201013

MR0327580

33.

Tpc-h benchmark, http://www.tpc.org

34.

Imdb ftp://ftp.funet.fi/pub/mirrors/ftp.imdb.com/pub/

35.

Dblp

http://dblp.uni-trier.de/xml/

36.

Arenas

Bertossi

Chomicki

Consistent query answers in inconsistent databases

Proceedings of the 18th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS' 99)

June 1999

ACM

68 79

2-s2.0-0032640897

37.

Winkler

W. E.

Methods for evaluating and creating data quality

Information Systems 2004 29 7 531 550

10.1016/j.is.2003.12.003

2-s2.0-2942709772

38.

Chomicki

Marcinkowski

Minimal-change integrity maintenance using tuple deletions

Information and Computation 2005 197 1-2 90 121

10.1016/j.ic.2004.04.007

MR2126364

2-s2.0-14744293228

39.

Bertossi

Bravo

Franconi

Lopatenko

Complexity and approximation of fixing numerical attributes in databases under integrity constraints

Database Programming Languages 2005

Berlin, Germany

Springer

262 278

40.

Calì

Lembo

Rosati

On the decidability and complexity of query answering over inconsistent and incomplete databases

Proceedings of the 22nd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS '03)

June 2003

260 271

2-s2.0-1142299764

41.

Fuxman

Fazli

Miller

R. J.

ConQuer: efficient management of inconsistent databases

Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD '05)

June 2005

Baltimore, Md, USA

155 166

10.1145/1066157.1066176

2-s2.0-29844448776

42.

Fuxman

Miller

R. J.

First-order query rewriting for inconsistent databases

Journal of Computer and System Sciences 2007 73 4 610 635

10.1016/j.jcss.2006.10.013

MR2320187

ZBL1112.68042

2-s2.0-33947319488

43.

Wijsen

Database repairing using updates

ACM Transactions on Database Systems 2005 30 3 722 768

10.1145/1093382.1093385

2-s2.0-33745206041

44.

Arenas

Bertossi

Chomicki

Answer sets for consistent query answering in inconsistent databases

Theory and Practice of Logic Programming 2003 3 4-5 393 424

10.1017/S1471068403001832

MR2014092

ZBL1079.68026

2-s2.0-0345393818

45.

Kolahi

Libkin

An information-theoretic analysis of worst-case redundancy in database design

ACM Transactions on Database Systems 2010 35 1, article 5

10.1145/1670243.1670248

2-s2.0-77249147548

46.

Lopatenko

Bertossi

Complexity of consistent query answering in databases under cardinality-based and incremental repair semantics

Proceedings of the 11th International Conference on Database Theory (ICDT '07) 2006

Berlin, Germany

Springer

179 193

Data Inconsistency Evaluation for Cyberphysical System

Abstract

1. Introduction

1.1. Motivation

Example 1.

1.2. Contributions

2. Background

3. Problem Definition

Definition 2 (culprit).

Definition 3 (data dirtiness evaluation problem).

Property 1 (minimality).

Property 2 (monotonic and insensitive).

Remark 4.

Theorem 5.

Proof.

4. Evaluation Algorithm

Definition 6 (conflict graph [31]).

Example 7.

Algorithm 1: Linear algorithm.

5. Reduce Quadratic for Large Dynamic Data

5.1. Some Notations and Observations

Observation 1.

Definition 8 (independent residual subgraph).

Observation 2.

Proposition 9.

5.2. Algorithm for Dirtiness Evaluation

Algorithm 2: D D E v a ( D , Σ ) .

5.2.1. Building ir-Subgraphs

5.2.2. Finding Maximal Matching Greedily

Algorithm 3: M a x i m a l M a t c h i n g ( Δ ) .

Example 10.

5.3. Update

5.3.1. Update Maximal Matching

Observation 3.

5.3.2. Update ir-Subgraph

Algorithm 4: U p d a t e S u b g r a p h ( t , o p ).

Example 11.

5.4. Implementation

5.4.1. Compact Structure for ir-Subgraph

5.4.2. Supporting Membership Query

5.4.3. Supporting Update on ir-Subgraphs

Example 12.

Step 1 (update the double linked list of vertices in each equiv-class).

Step 2 (update the double linked list of groups L and R).

Step 3 (update the relative information recorded in K-header).

6. Optimizations and Extensions

6.1. Key Value Compression

6.2. The Number of Indexes

6.3. Minimum Space Cost

7. Experiments

7.1. Experimental Settings

7.1.1. Datasets

7.1.2. CFDs

7.1.3. Updates

7.2. Implementation

7.3. Experimental Results for Evaluation Algorithm

7.4. Summary

8. Related Works

9. Conclusions

Footnotes

Competing Interests

Acknowledgments

References

Algorithm 2: $D D E v a (D, Σ)$ .

Algorithm 3: $M a x i m a l M a t c h i n g (Δ)$ .

Algorithm 4: $U p d a t e S u b g r a p h$ ( $t, o p$ ).