DRAV: Detection and repair of data availability violations in Internet of Things

Abstract

The application of the Internet of Things has produced large amounts of data in different scenarios, which are accompanied with problems, such as consistency and integrity violations. Existing research on dealing with data availability violations is insufficient. In this work, the detection and repair of data availability violations (DRAV) framework is proposed to detect and repair data violations in Internet of Things with a distributed parallel computing environment. DRAV uses algorithms in the MapReduce programming framework, and these include detection and repair algorithms based on enhanced conditional function dependency for data consistency violation, MapJoin, and ReduceJoin algorithms based on master data for k-nearest neighbor–based integrity violation detection, and repair algorithms. Experiments are conducted to determine the effect of the algorithms. Results show that DRAV improves data availability in Internet of Things compared with existing methods by detecting and repairing violations.

Keywords

Internet of things data availability violation detection rule-based repair conditional functional dependency MapReduce

Introduction

The Internet of Things (IoT) connect multiple information producers with sensors and actuators. Then, these producers collect and transmit surrounding information based on application demands. IoT encounters many problems as it develops. The terminal sensing devices of IoT are mostly used in areas with harsh environments, and the application scenarios are complex. To some extent, the trust rank of the data in an IoT system determines how extensively the system can be used. The data generated by IoT are multi-sourced. Thus, they are challenging to process accurately. Given that data with low availability are collected, Cisco has increased the number of factors that could lead to IoT failure.¹ Data availability and trustworthiness must be enhanced to meet the application requirements of real-time IoT.

Data availability describes the availability of data in business processes. The degree of data availability directly affects the results of business analyses. Invalid and erroneous data interfere with the regular operation of business workflow, inevitably reducing the availability of entity information. Data availability can be assessed from five aspects, namely, consistency, accuracy, integrity, timeliness, and entity identity. Concerning violations in IoT, data availability is described primarily by analyzing consistency and integrity performance. Data consistency means that the relevant information in the data information system is compatible and does not cause conflicts. Data integrity means that the data sets contain data that fully satisfy the requirements for performing various operations on the data. Our work mainly investigates data consistency and integrity and considers the detection and repair of violation elements in data sets. Distributed computing divides the problem solved into many small parts, distributes the parts to many computers for processing, and combines the calculation results to obtain the outcome. Existing methods for data availability in IoT^2–7 are not described in the distributed programming model, and only a few implementations have been made for the detection and repair of distributed data availability violations. Data availability violations are random and unpredictable and cannot be quickly resolved by distributed problem-solving. The concurrency and multi-distribution of the parallel environment in IoT complicate the problem of data availability violation in detection and repair.

In this article, we investigate the issue of data availability violations in IoT and propose the detection and repair of data availability violations (DRAV) framework to detect and repair these violations. To address the shortcomings of conditional functional dependency (CFD), DRAV proposes a semantic extension of CFD and corresponding solutions in terms of data consistency. DRAV improves the existing algorithm using the clustering method and proposes a strategy to detect and repair data integrity violations. For distributed application scenarios, DRAV proposes algorithms in the MapReduce programming framework, including the consistency violation detection and repair algorithm based on enhanced conditional functional dependencies (xCFDs), $MapJoin$ , and $ReduceJoin$ algorithms based on primary data, and a processing algorithm that deals with integrity violations based on k-nearest neighbor (k-NN).

The main contributions of this article are as follows:

The proposed xCFD extends existing functional dependencies, integrates high-quality data into the logic system of CFD, and further enhances the ability to detect and repair data consistency violations by eliminating conflicts.

An improved k-NN algorithm is used to automatically calculate the best value of $k$ , which improves the performance in data integrity violation detection and repair.

The distributed solution of detecting and repairing data availability violations in IoT is realized by designing related algorithms in the MapReduce programming framework.

Related work

This section summarizes the current research on data consistency, data integrity, and violations in data availability. The detection and repair of data consistency and integrity errors in IoT are focused as follows.

Data consistency

Data consistency is an essential sub-property of data availability. Improper design of the data model and integration of multiple data sources may lead to data inconsistency. Data consistency involves expressing the data of a data set and related theoretical issues, and it lays the foundation for the judgment, detection, and repair of data consistency errors. Research results on the theoretical system of data consistency can be divided into two categories: data consistency theory based on semantic rules and statistical methods.⁸

The research focuses on the theory of data consistency based on semantic rules, in which CFD is one of the classical theories. An improved mechanism called CFD⁹ was proposed based on functional dependency (FD). The problem of computing the smallest tuple set to delete the original data in the process of evaluating data consistency has been studied and proven to be non-deterministic polynomial complete (NP-C). Besides, CFDs implement data constraints by binding specific values and expressing data consistency semantics. Compared with classical FD, CFD performs better in conditional expression. Thus, many studies have used CFD. Bohannon et al.¹⁰ presented a method of consistency error detection for SQL data based on CFD. The method has been widely used in data cleaning. Fan et al.¹¹ proposed an algorithm for mining CFDs in data sets. The algorithm has been used to generate CFDs automatically. Miao et al.¹² systematically studied the problem of data consistency determination using CFD and measured the consistency quality of data sets. The ratio of the tuples in the most substantial subset of the rule set that satisfies CFD to the tuple number of the data set. These researchers also studied the computational complexity and approximation of the problem. Zhou and Bu¹³ proposed an improved strategy for existing cleaning schemes, including introducing support to mining dependency rules and presented approximate functional dependency. Moreover, the application of CFD in data cleaning systems in the general domain has been studied. Yang¹⁴ reported that the editing rules supplement the expressive ability of CFDs. Based on CFDs, Jin et al.¹⁵ introduced a series of external knowledge bases, including hard, quantitative, equivalent, and non-equivalent constraints, to assist the expression of consistency constraint rules. Based on CFDs and external knowledge bases, they proposed an incremental data consistency error detection and repair algorithm. Salem and Abdo¹⁶ presented editing rules to supplement the application of the expressive ability of FD. Extended conditional functional dependencies¹⁷ (eCFDs) can express the semantics of $' and'$ , $' or'$ , and $' unequal'$ , thereby enhancing the ability of CFD to express constraints. They are in line with the actual data application scenario. Moreover, the definition of extended functional dependency rules has been formally proposed.

Data integrity

Data integrity is another essential sub-property of data availability. When humans input errors, the absence of non-vacancy constraints and attribute recognition of semi-structured data may lead to data integrity loss. Most of the studies on the detection and repair of data integrity errors focused on filling missing values. Filling data sources can be divided into internal and external. The internal data source includes the data set being detected and repaired, and the external data source includes the data source of other business systems or the data in the network.

Liu et al.¹⁸ studied the problem of data integrity determination and proposed a measurement model of data integrity. The model proved that the problem of data integrity determination is NP-C. Afterward, the researchers proposed an optimization-based integrity determination algorithm. For the algorithm to apply to the processing of large amounts of data, they also proposed an approximation algorithm for big data. Razniewski and Nutt¹⁹ studied the problem of determining the integrity of geographic data and improved relevant decision algorithms. A method to determine the data integrity of the time series²⁰ was proposed. Emrann et al.^21,22 developed data integrity assessment measures for microbial genome databases. Libkin²³ studied the theoretical basis of filling missing values. Farooq et al.²⁴ provided a well-defined security architecture as a confidentiality of the user’s privacy and integrity, which could result in the architecture’s wide adoption by masses. Several studies^25–29 used the method of filling in internal values. Hao et al.³⁰ proposed a filling algorithm based on double clustering. Concerning external data filling, some studies^31,32 investigated the problem of filling missing values with data on the network. A method of filling missing values using internal and external data sources³³ was proposed. To minimize the number of network queries, Li et al.³⁴ developed a method of filling missing values for network data, with the number of queries as the goal.

Detection and repair of data availability violations

With the increase in data volume, this research on data consistency focuses on centralized relational data sets. Theoretical studies on distributed and non-relational data sets are few. However, in recent years, an increasing number of researchers have focused on improving the data quality of large data sets.

Most existing studies are based on the theory system of derivative self-functional dependency and aimed at structured data. An algorithm of distributed multi-functional dependency conflict detection was proposed in a previous work.³⁵ Fan et al.³⁶ presented an incremental online algorithm for detecting data consistency errors in distributed databases. Other studies^37,38 developed MapReduce algorithms for consistency error detection and repair of data files using CFD in the Hadoop environment. An algorithm that extends the stand-alone offline method for batch data³⁹ was proposed for the Hadoop platform. Yang et al.⁴⁰ optimized a series of big data algorithms based on task merging and MapReduce. Ding et al.⁴¹ studied the relationship among the five sub-properties of data availability and pointed out that integrity errors may be repaired in three dimensions: attribute values, tuples, and data tables. The filled data may have inconsistent errors, which may lead to the violation of the consistency constraint after repair.

Therefore, the repair of integrity violations is not entirely related to data consistency. However, the repair operation of consistency errors modifies data that violate the semantic rules into data that satisfy the consistency constraints. It does not delete the data to avoid introducing missing values that lead to integrity errors.

Detection and repair of consistency violation

Based on CFD and eCFD theories, this section defines the xCFD rule to address the limitations caused by the limited set of discrete value constraints and describes the consistency violation detection and conflict elimination algorithm based on xCFDs.

Definition and properties of xCFDs and main data

CFD, as the basis of subsequent consistency theory, belongs to the consistency theory system based on semantic rules. CFD enhances the semantics of FD commonly used in relational schemes. For relational schema $R (X_{0} \to Y_{0})$ , all attribute sets in $R$ can be denoted as $attr (R)$ , and a set of CFD rules in $R$ can be denoted as $φ : (R : X \to Y, T_{p})$ . The first item $R : X \to Y$ in the two tuples represents the standard FD, which contains left and right attributes that need to be added to the constraint rules. The second item $T_{p}$ represents a rule table that contains all attributes of $X$ and $Y$ . For any tuple, $t \in T_{p}$ in the rule table. For any attribute $A \in X \cup Y$ , $t [A]$ represents the value of attribute $A$ in the rule tuple, and $t [A]$ is a constant $' a'$ of $dom (A)$ or an undefined variable $'_'$ . Intuitively, the rule table of $φ$ expresses the semantic range of FD constraints by setting values in specific attributes and interprets FD in $φ$ in a specific subset of the data set.

The data with benchmark nature is usually referred to as the main data. As shown in Figure 1, the main data generally has high quality, long data life cycle, low data update frequency, and small data volume. These attributes can be used to detect data errors and directly guide the repair of data errors. The author formally defines main data in the dimension of consistency. Besides, the author considers the relation $R$ in $attr (R)$ and its instance data set $D$ . Functional dependency set $F$ exists in $R$ . If no violation of any functional dependency $f : X \to Y$ occurs between tuples in $D$ , then $D$ represents the main data.

Figure 1.

Types of IoT data and their characteristics.

As shown in the conversion example in Figure 2, for each FD of a data set, a corresponding CFD rule table containing only constant values can be generated. Conversely, the number of rule tuples in a rule table is similar to the number of tuples in a data set. In addition to consistency, the main data also have integrity, timeliness, and accuracy. Here, the main data are integrated into the system of CFDs, and the compatibility of CFDs in xCFDs is used to make it convertible. The error tuples are processed by the repair method of xCFDs.

Figure 2.

Conversion example of the main data and CFD.

The modification of attributes in each tuple is considered in the repair proposals. When a unit group violation occurs, other irrelevant attribute values in the tuple do not affect the repair proposals. Under known rules, the proposed modification is to change the constant in the tuple against the attribute value to the constant selected in the rule. When multiple violations occur, the minimum modification method under FD constraints aggregates and groups the same $X$ values. And all tuples in the same group have the same values for all attributes of FD. If different $y$ attributes exist in tuples at this time, then FD is violated. These minimal repair proposals for unit and multigroup violations consider only one rule tuple in xCFDs. In the actual repair process, the repair recommended values may be in conflict with one other.

The concept of equivalence class is introduced to constrain the logical consistency of repair operations.³⁹ For relational schema $R$ and its data set $D$ , an equivalence class $E$ contains binaries, such as $(t, r)$ , where $t \in D$ and $r \in X \cup Y$ . Thus, every attribute of each tuple in $D$ has its corresponding equivalence class denoted as $e (t, r)$ . The semantics of repair operations under equivalence classes is as follows: a repair target value is assigned to the equivalence classes that detect errors, and DRAV uses $prop (e (t, r))$ to represent the repair recommendation value in the equivalence class $e (t, r)$ . A simple CFD is used to illustrate the role of equivalence classes in resolving the conflict of repair proposals. There exist CFD rule $[R : X \to Y, t_{p}]$ . For $\forall t_{1}, t_{2} \in D$ , if $t_{1} [X] ≍ t_{p} [X]$ , $t_{2} [X] ≍ t_{p} [X]$ and $t_{1} [X] = t_{2} [X]$ exist, then $t_{1} [Y]$ and $t_{2} [Y]$ belong to the same equivalence class. As shown in Figure 3, when two equivalent classes are the same, a conflict-free recommendation value needs to be selected for the equivalent class.

Figure 3.

Process for merging equivalence classes.

The values in the rule table of CFDs can only be a specific constant or an undefined variable. CFDs cannot interfere when we express the $' or'$ semantics in the real world. For this reason, eCFDs can describe many patterns and express abundant semantic information.¹⁷

The formal definition of eCFD is as follows: for relational schema $R (X_{0} \to Y_{0})$ , all attribute sets in $R$ are denoted as $attr (R)$ , and an eCFD rule in $R$ is in the form of $φ : (R : X \to Y, Y_{p}, T_{p})$ , where $X, Y, Y_{p} \subseteq attr (R)$ and $Y \cap Y_{p} = ϕ$ . $R : X \to Y$ denotes the standard FD, which includes left and right attributes that need to add constraint rules. Similar to CFDs, $T_{p}$ represents a rule table containing all attributes of $A \in X \cup Y \cup Y_{p}$ which contains finite rule tuples. For any tuple $t \in T_{p}$ in the rule table and any attribute $A \in X \cup Y \cup Y_{p}$ in the attribute set, $t [A]$ denotes the value of attribute $A$ in the rule tuple. $t [A]$ is either an undefined variable “_–”, $S$ , or $\bar{S}$ of a set, where $S$ is a finite subset of $dom (A)$ of attribute $A$ .

When the semantics rules of $' or'$ in continuous values are expressed in the real world, the following expression requirements can be used. CFDs and eCFDs are impossible to express with appropriate eCFDs and corresponding rule tables. For this reason, DRAV defines xCFD rules to address the limitation of eCFDs to a finite set of discrete values constrained by set conditions.

The definition of xCFDs is as follows: for relational schema $R (X_{0} \to Y_{0})$ , all attribute sets in $R$ are $attr (R)$ , and an xCFD rule $φ$ in $R$ is a triple $(R : X \to Y, Y_{p}, T_{p})$ , where $X, Y, Y_{p} \subseteq attr (R)$ and $Y \cap Y_{p} = ϕ$ . The first item $R : X \to Y$ denotes the FD of the standard, which contains the left and right attributes that need to be added to the constraint rules ( $X$ and $Y$ are not necessarily the same as the original $X_{0}$ and $Y_{0}$ due to the fine semantic granularity). Similar to eCFDs, $T_{p}$ denotes a rule table containing all attributes of $X \cup Y \cup Y_{p}$ . It contains finite rule tuples for any tuple $t \in T_{p}$ in the rule table and attribute $A \in X \cup Y \cup Y_{p}$ in the attribute set. $t [A]$ denotes the value of attribute $A$ in a rule tuple. $t [A]$ is either a set $S$ or an undefined variable $' *'$ , where $S$ is a valid subset of the domain $dom (A)$ of attribute $A$ .

As shown in Table 1, an example of an xCFD rule is shown as follows: $φ : (R : {A, B, C} \to {D, E}, {F, G}, T_{p})$ . The semantics of $t_{p_{1}}$ is that when $t [A] \in (a 1, a 2]$ and $t [E] \in {e_{1}, e_{2}, e_{3}}$ , $t [F]$ must be $f_{1}$ or $f_{2}$ , $t [G] = g_{1}$ , and the values of $D$ and $E$ depend on the values of $A$ , $B$ , and $C$ . The semantics of $t_{p_{2}}$ is that when $t [B] = b_{2}$ , $t [G] = g_{2}$ , $t [F]$ can take any value. Then $t [E] = e_{2}$ and $t [D] \in [d_{1}, d_{2}] \cup {d_{3}}$ . The value of $D$ depends on the values of $A$ and $C$ . From the perspective of semantics and forms, xCFDs and eCFDs are the same. The enhancement of xCFDs’ expressive ability to deal with constraints lies in the following: the value of $t [A]$ in eCFDs can only be a finite discrete value or an undefined variable, and the value of $t [A]$ in xCFDs can be close to the actual $' set'$ , which can contain any number of continuous intervals and the union of any number of discrete values.

Table 1.

Example of xCFDs rules.

	$X$			$Y$		$Y_{p}$
	$A$	$B$	$C$	$D$	$E$	$F$	$G$
$t_{p_{1}}$	$(a_{1}, a_{2})$	*	*	*	${e_{1}, e_{2}, e_{3}}$	${f_{1}, f_{2}}$	${g_{1}}$
$t_{p_{2}}$	*	${b_{2}}$	*	$[d_{1}, d_{2}] \cup {d_{3}}$	${e_{2}}$	*	${g_{2}}$

The definition of xCFDs indicates that their semantic expression ability is stronger than that of eCFDs and CFDs, and for CFD and eCFD rules, corresponding expressions in xCFDs exist. To illustrate the conversion from CFD to xCFD rules, DRAV describes the conversion of CFDs to eCFDs and eCFDs to xCFDs in turn.

From these definitions and semantics, for any CFD rule $(R : X \to Y, T_{p})$ and any item $t \in T_{p}$ in its rule table, a unique eCFD rule $(R_{e} : X \to Y_{e}, Y_{ep}, T_{ep})$ can be generated. Among them, $Y_{e} = {A | t [A] =' a', a \in dom (A)}$ and $Y_{ep} = {A | t [A] ='_'}$ . Only one rule tuple $t_{ep}$ exists in $T_{ep}$ . As shown in equation (1), the value is obtained by each item of the corresponding original rule table $\forall A \in X \cup Y_{e} \cup Y_{ep}$

t_{ep} (A) = {\begin{matrix} t [A], t [A] =' a', a \in dom (A) \\ {t [A]}, t [A] ='_' \end{matrix}

(1)

For only one rule tuple CFD $({A, B} \to {C, D}, T_{cfd})$ , the eCFD $({A, B} \to {D}, {C}, T_{ecfd})$ generated after the conversion is shown in Figure 4. The definition and semantics indicate that for any eCFD rule $(R_{e} : X \to Y_{e}, Y_{ep}, T_{ep})$ , the only xCFD rule $(R_{x} : X \to Y_{x}, Y_{xp}, T_{xp})$ can be generated. The generation of xCFDs is as follows: $Y_{x} = Y_{e}$ and $Y_{xp} = Y_{ep}$ , $\forall t_{ep} \in T_{ep}$ , and exactly one rule tuple $t_{xp}$ corresponds to it in $T_{xp}$ . As shown in equation (2), the value is obtained by each item of the corresponding original $t_{ep}$ rule table $\forall A \in X \cup Y_{e} \cup Y_{ep}$

t_{xp} (A) = {\begin{matrix} t_{ep} [A], & t [A] = S \\ t_{ep} [A], & t [A] = \bar{S} \\ ' *', & t [A] ='_' \end{matrix}

(2)

Figure 4.

Conversion example of CFD and eCFD.

For eCFD $(R : {A, B} \to {C}, {D}, T_{ecfd})$ , the xCFD $(R : {A, B} \to {C}, {D}, T_{xcfd})$ generated after the conversion is shown in Figure 5. In summary, classical and eCFD rules can be fully expressed in xCFDs. Thus, the subsequent consistency detection and repair schemes are applicable.

Figure 5.

Conversion example of eCFD and xCFD.

Algorithm 1: Match detection algorithm $RuleMatch$
Input: A value in a data tuple $v$ , a value in a rule tuple $RuleCell$ Output: The matching results between data and rules $Result$ 1: $Result \leftarrow False$ 2: if $RuleCell . discrete . contains (v)$ then 3: $Result \leftarrow True$ 4: else 5: for $section \in RuleCell . continuous$ do 6: if $section ._1 = v & section ._2 = True$ then 7: $Result \leftarrow True$ 8: end 9: if $section ._3 = v & section ._4 = True$ then 10: $Result \leftarrow True$ 11: end 12: if $v > section ._1 & v < section ._3$ then 13: $Result \leftarrow True$ 14: end 15: end 16: end 17: return $Result$

Algorithm 1: Match detection algorithm

RuleMatch

Input: A value in a data tuple

v

, a value in a rule tuple

RuleCell

Output: The matching results between data and rules

Result

Result \leftarrow False

2: if

RuleCell . discrete . contains (v)

then
3:

Result \leftarrow True

4: else
5: for

section \in RuleCell . continuous

do
6: if

section ._1 = v & section ._2 = True

then
7:

Result \leftarrow True

8: end
9: if

section ._3 = v & section ._4 = True

then
10:

Result \leftarrow True

11: end
12: if

v > section ._1 & v < section ._3

then
13:

Result \leftarrow True

14: end
15: end
16: end
17: return

Result

Research on consistency based on xCFDs and main data

The data consistency violation detection method based on xCFDs is similar to the detection method based on eCFDs. The analogy in xCFDs is extended. For the rule tuple $t_{p} \in T_{p}$ in the xCFD rule table, the mapping $f$ that maps the rule tuple $t_{p}$ to a tuple without undefined variables is defined. For any attribute $A \in X \cup Y \cup Y_{p}$ in the attribute set, the mapping rule is that if $t_{p} [A]$ is a set, then it is mapped to a constant $' a'$ . If $t_{p} [A]$ is an undefined variable $' *'$ , then it is mapped to a constant in the domain $dom (A)$ of attribute $A$ . Many or even infinite maps of $f$ exist, the number of which is determined by the value of $t_{p} [A]$ and $| dom (A) |$ . If mapping $f$ results in $f (t_{p}) = t_{data}$ , which is entirely consistent with the data instance $t_{data}$ , then the data tuple $t_{data}$ matching rule $t_{p}$ is called $t_{data} ≍ t_{p}$ . In the context of xCFDs, the symbol $' \approx'$ is used with the context of CFDs and xCFDs to represent the matching relationship between tuples and rules in xCFDs.

To specify the specific matching method of the $' \approx'$ relation, that is, to define the matching relation between the data and rule tuple. The specific expression of xCFDs and the design scheme of the xCFD data structure in memory must be introduced first. The structure of xCFD records a rule, and multiple structure instances can constitute xCFDs. In xCFD, the data table and the left and right attributes of all the rules in it are recorded, and the corresponding values are obtained. The structure records the attribute names and values of each attribute in the left and right parts of the rule indiscriminately.

The algorithm $RuleMatch$ is proposed to detect the matching between an attribute value in a data tuple and the corresponding attribute value in a rule tuple. Thus far, DRAV uses $RuleMatch$ to define whether or not an attribute value in the data matches an attribute value in the rule. For the tuple $t \in D$ , rule tuple $t_{p} \in T_{p}$ and attribute set $A \subseteq attr (R)$ , if $t [A] ≍ t_{p} [A]$ is true, the expression $\land_{a \in A} (match (t [a], t_{p} [a]))$ is true. In the following sections, the concept of matching the set of attributes on tuples and the set of attributes on rule tuples is used.

Following the idea of consistency error detection under the eCFD system, we further expand the consistency violation of xCFD data tuples in xCFDs by matching left data with regular tuple data. Given the similarity of xCFDs and eCFDs in form and semantics, their detection algorithms are the same and can be divided into two cases.

Consistency violation of unit group composition. As shown in Table 2, consistent with the detection method of eCFDs, for data set $D$ , the SQL query $Q_{xcfd}^{single} (D)$ can obtain all the tuple IDs that violate the constraint.

Conformity violation caused by multiple groups. Similarly, for data set $D$ , the SQL query ${Query}_{xcfd}^{multiple} (D)$ shown in Table 3 can obtain all tuple IDs that violate the constraints together.

Table 2.

SQL query for consistency violation of unit groups.

$Q_{xcfd}^{single} (D)$
SELECT $id$
FROM $D$
WHERE $t [X] ≍ t_{p} [X]$ AND $t [y] \notin t_{p} [y]$

SQL: Structured Query Language.

Table 3.

SQL query for consistency violation of multiple groups.

$Q_{xcfd}^{multiple} (D)$
SELECT $id$
FROM $D$
WHERE $t . X$ IN
( SELECT DISTINCT $X$
FROM $D$
WHERE $t [X] ≍ t_{p} [X]$ AND $t [y] \in t_{p} [y]$
GROUP BY $t . X$
HAVING COUNT(DISTINCT $y$ ) > 1)

SQL: Structured Query Language.

Here, we present a consistency violation detection algorithm $DetectVios$ based on xCFDs. The algorithm initially detects and discriminates the consistency violation tuples in the execution process then performs repair operations during the iteration. Besides, $DetectVios$ is not related to SQL.

Algorithm 2: Consistency violation detection algorithm $DetectVios$ based on xCFDs
Input: The data set $D$ , the xCFD rule set $φ$ Output: The id set $Vio$ of all tuples that violate $φ$ 1: $Vio \leftarrow \emptyset$ 2: for $t_{p} \in φ . T_{p}$ do 3: $Vi o_{func} \leftarrow \emptyset$ 4: for $y \in Y \cup Y_{p}$ do 5: if $t_{p} [y] \subset dom (y)$ then 6: $V i o . a d d (Q u e r y_{y}^{s i n g l e} (D))$ 7: end 8: if $y \in Y$ then 9: $Vi o_{func} . add (Vi o_{single})$ 10: end 11: end 12: for $y \in Y$ do 13: $Vio . add ({Query}_{y}^{multiple} (D - Vi o_{func}))$ 14: end 15: end 16: return $Vio$

Algorithm 2: Consistency violation detection algorithm

DetectVios

based on xCFDs

Input: The data set

D

, the xCFD rule set

φ

Output: The id set

Vio

of all tuples that violate

φ

Vio \leftarrow \emptyset

2: for

t_{p} \in φ . T_{p}

do
3:

Vi o_{func} \leftarrow \emptyset

4: for

y \in Y \cup Y_{p}

do
5: if

t_{p} [y] \subset dom (y)

then
6:

V i o . a d d (Q u e r y_{y}^{s i n g l e} (D))

7: end
8: if

y \in Y

then
9:

Vi o_{func} . add (Vi o_{single})

10: end
11: end
12: for

y \in Y

do
13:

Vio . add ({Query}_{y}^{multiple} (D - Vi o_{func}))

14: end
15: end
16: return

Vio

Next, we consider the issue of consistency violation repair. When only one rule exists in xCFDs and only one rule tuple is present in the rule table, a constant in the set of CFD rules can be used to provide a repair suggestion value for the attribute value violated in the wrong tuple in the case of unit group violation. In the case of multiple group violations, the attribute values that appear most frequently in each group can be used as the repair recommendation values.³⁹ In the rules of CFDs, constants are used only as repair suggestions because only constants are included in the rules, except for undefined variables. In xCFDs, because rules make up a set that may contain many quantities, calculating a specific constant is necessary. To be compatible with CFDs, when the number of elements in a set is 1, the semantics of CFDs is included, and the constants calculated are the only elements in the set. Therefore, the most frequently occurring amount of constraint attributes of all tuples that conform to the rule constraints can be regarded as the repair recommendation value. To this end, the algorithm $PropY$ is established.

Algorithm 3: Right attribute calculation repair recommendation value algorithm $PropY$
Input: The data set $D$ , a rule tuple $t_{p}$ , An attribute $y$ Output: The repair value in the rule item collection 1: $freq \leftarrow \emptyset$ 2: for $t \in D$ do 3: if $t [X] ≍ t_{p} [X] & t [y] \in t_{p} [y]$ then 4: $freq . add (t [y])$ 5: end 6: end 7: return $\underset{v \in freq}{\arg max} ($ Count $(v)$ )

The method of $PropY$ is described as follows. Among all rule-compliant tuples, the attribute that appears most frequently in the $y$ attribute is selected. This method is simple but effective. If a value is randomly selected as the recommended value according to the frequency of each value in the $freq$ set, then the probability distribution of the value is the frequency of each value in $freq$ . However, the value generated by this method cannot guarantee the maximization of the repair effect measure in probability.

Consistent unit group violations can be detected and repaired in the mapping process. After the unit group detection process, multigroup violations can be detected, and whether or not the data need to be grouped can be determined. In the mapping and reduction processes of the $ConsistencyRepair$ algorithm, the number of cases in the group is calculated according to the group marker. If the number of groups is greater than 1, then the tuple is inconsistent. In subsequent processing, for all tuples labeled right, the repair values are calculated, and the necessary conflicts are eliminated.

Algorithm 4: Distributed consistency violates repair algorithm $ConsistencyRepair$ based on xCFDs
Input: The xCFD rule set $φ$ Output: The consistency violates repair result $valueOut$ 1: $% Map$ 2: for $φ = (R : X \to Y, Y_{p}, t_{p}) \in φ$ do 3: for $y \in Y \cup Y_{p}$ do 4: if $t [X] ≍ t_{p} [X] &! (t [y] ≍ t_{p} [y])$ then 5: $t [y] \leftarrow propY$ 6: end 7: end 8: for $y \in Y \cup Y_{p}$ do 9: if $t [X] ≍ t_{p} [X] & \| t_{p} [y] \| > 1$ then 10: $keyOut \leftarrow (φ, t_{p}, y, True)$ 11: else 12: $keyOut \leftarrow (φ, t_{p}, y, False)$ 13: end 14: $valueOut \leftarrow (t . id, t)$ 15: end 16: end 17: $% Reduce$ 18: for $v \in valueList$ do 19: $keyOut \leftarrow (t . id, t)$ 20: if $key . groupTag = True$ & distinct $(t (y)) \neq 1$ then 21: $valueOut \leftarrow (y, propY, φ, True)$ 22: else 23: $valueOut \leftarrow (y, null, φ, True)$ 24: end 25: end 26: return $valueOut$

Algorithm 4: Distributed consistency violates repair algorithm

ConsistencyRepair

based on xCFDs

Input: The xCFD rule set

φ

Output: The consistency violates repair result

valueOut

% Map

2: for

φ = (R : X \to Y, Y_{p}, t_{p}) \in φ

do
3: for

y \in Y \cup Y_{p}

do
4: if

t [X] ≍ t_{p} [X] &! (t [y] ≍ t_{p} [y])

then
5:

t [y] \leftarrow propY

6: end
7: end
8: for

y \in Y \cup Y_{p}

do
9: if

t [X] ≍ t_{p} [X] & | t_{p} [y] | > 1

then
10:

keyOut \leftarrow (φ, t_{p}, y, True)

11: else
12:

keyOut \leftarrow (φ, t_{p}, y, False)

13: end
14:

valueOut \leftarrow (t . id, t)

15: end
16: end
17:

% Reduce

18: for

v \in valueList

do
19:

keyOut \leftarrow (t . id, t)

20: if

key . groupTag = True

& distinct

(t (y)) \neq 1

then
21:

valueOut \leftarrow (y, propY, φ, True)

22: else
23:

valueOut \leftarrow (y, null, φ, True)

24: end
25: end
26: return

valueOut

When the amount is small, main data can be stored directly in Hadoop’s dedicated cache or memory. Thus, the join operation can be carried out in the mapping process. The $MapJoin$ method for the repair of the map process is as follows.

Algorithm 5: Map-based repair algorithm $MapJoin$
Input: The main data $mainData$ , the data set to be repaired $valueIn$ Output: The repair data result $valueOut$ 1: $% Map$ 2: $valueOut \leftarrow 0$ , $checkFlag \leftarrow True$ 3: for $x \in attr [X_{m}]$ do 4: if $valueIn [x] \neq mainData [x]$ then 5: $checkFlag \leftarrow False$ 6: break 7: end 8: end 9: $% Reduce$ 10: if $flag = True$ then 11: for $y \in attr [Y]$ do 12: $valueOut + = mainData [y]$ 13: end 14: else 15: $valueOut \leftarrow valueIn [Y]$ 16: end 17: return $valueOut$

Algorithm 5: Map-based repair algorithm

MapJoin

Input: The main data

mainData

, the data set to be repaired

valueIn

Output: The repair data result

valueOut

% Map

valueOut \leftarrow 0

checkFlag \leftarrow True

3: for

x \in attr [X_{m}]

do
4: if

valueIn [x] \neq mainData [x]

then
5:

checkFlag \leftarrow False

6: break
7: end
8: end
9:

% Reduce

10: if

flag = True

then
11: for

y \in attr [Y]

do
12:

valueOut + = mainData [y]

13: end
14: else
15:

valueOut \leftarrow valueIn [Y]

16: end
17: return

valueOut

When using the reduction process to realize the connection, the main data and the data to be cleaned can be read and grouped according to the attribute value on the left side of the functional dependency. The procedure of $ReduceJoin$ reduces the list of values in the same group. It judges whether main data exist in the list or not. Others indicate that the data to be cleaned match the main data in the left part of FD. They need to be overwritten by the main data.

Algorithm 6: Reduce-based repair algorithm $ReduceJoin$
Input: The main data $mainData$ , the data set to be repaired $valueIn$ Output: The repair data result $valueOut$ 1: $% Map$ 2: $valueOut \leftarrow null$ 3: for $data \in mainData \cup valueIn$ do 4: if $data \in mainData$ then 5: $valueOut \leftarrow mainData [Y]$ 6: else 7: $valueOut \leftarrow valueIn [Y]$ 8: end 9: end 10: $% Reduce$ 11: $mainDataY \leftarrow null$ 12: for $data \in mainData \cup valueIn$ do 13: if $data \in mainData$ then 14: for $v \in valueIn$ do 15: $valueOut \leftarrow$ replace $(v [Y], mainData [Y])$ 16: end 17: break 18: end 19: end 20: for $v \in valueIn$ do 21: $valueOut \leftarrow v$ 22: end 23: return $valueOut$

Algorithm 6: Reduce-based repair algorithm

ReduceJoin

Input: The main data

mainData

, the data set to be repaired

valueIn

Output: The repair data result

valueOut

% Map

valueOut \leftarrow null

3: for

data \in mainData \cup valueIn

do
4: if

data \in mainData

then
5:

valueOut \leftarrow mainData [Y]

6: else
7:

valueOut \leftarrow valueIn [Y]

8: end
9: end
10:

% Reduce

11:

mainDataY \leftarrow null

12: for

data \in mainData \cup valueIn

do
13: if

data \in mainData

then
14: for

v \in valueIn

do
15:

valueOut \leftarrow

replace

(v [Y], mainData [Y])

16: end
17: break
18: end
19: end
20: for

v \in valueIn

do
21:

valueOut \leftarrow v

22: end
23: return

valueOut

Conflict elimination

When a unit group violation occurs, $t [X] ≍ t_{p} [X]$ , but $t [y] \notin t_{p} [y]$ . Considering this rule, DRAV uses the $PropY$ method to calculate simple repair recommendation $' a'$ . Meanwhile, if $prop (e (t, r)) =' *'$ , which means that the target value has not been determined, then $prop (e (t, r)) =' a'$ . If $prop (e (t, r))$ is a different constant $' b'$ , then conflict with the previous repair proposals exists. This conflict should be eliminated, which means that considering only the $y$ attribute cannot eliminate the conflict. Therefore, the value of $t$ in an attribute $x$ in the $X$ attribute set must be modified.

The specific attributes are selected, and repair suggestions are provided for $prop (e (t, x))$ . When multigroup violations occur, $t_{1}$ and $t_{2}$ are assumed to be tuples of violations, $prop (e (t_{1}, r)) = p_{1}$ , and $prop (e (t_{2}, r)) = p_{2}$ . Three mutually exclusive situations are considered.

Neither $p_{1}$ nor $p_{2}$ is null, and at least one of them is $' *'$ . If both are $' *'$ , then two equivalent classes are merged, and the recommended value of the equivalent class is $' *'$ . If one of them is a constant, then it can be merged into that constant similarly.

and $p_{2}$ are different constants. At this time, we cannot modify the value on the right attribute of the rule to solve the conflict. We select one of the left attributes of the rule.

Atleast one of $p_{1}$ and $p_{2}$ is null. The violation is resolved regardless of the other value because of the semantic nature of the $null$ value.

An attribute in $X$ is selected as an attribute to be modified. In the $X$ attribute list, the attribute whose first recommended value is an undetermined variable is selected. If no such attributes exist, then the first one is selected. Therefore, the input order of the left attributes in xCFDs implies the $' importance'$ of attributes. The more advanced the input is, the easier it is to be modified as an attribute. In existing batch repair algorithms, the $' order'$ is determined by setting weights for attributes in CFDs. The order of occurrence is used as the order of weight to simplify the rule input in subsequent experiments and practical projects. Then, the most frequently occurring value of all tuples in this attribute is regarded as the suggested repair value for the left attribute.

Detection and repair of integrity violation

A specific detection and repair scheme for integrity violation data is provided in this section. The $DevideTuples$ algorithm is used to detect data integrity violations. For the repair of integrity violation data, we introduce the k-NN algorithm from pattern recognition and machine learning and establish a method to select the value of $k$ .

Algorithm 7: Tuple partitioning algorithm $DevideTuples$
Input: The data set $D$ Output: The complete data set $D_{c}$ , the incomplete data set $D_{f}$ 1: $D_{c}, D_{f} \leftarrow \emptyset$ 2: for $t \in D$ do 3: $CheckFlag \leftarrow True$ 4: for $r \in attr (R)$ do 5: ifisEmpty $(t [r])$ then 6: $D_{f} . add (t)$ 7: $CheckFlag \leftarrow False$ 8: break 9: end 10: end 11: if $CheckFlag = True$ then 12: $D_{c} . add (t)$ 13: end 14: end 15: return $D_{c}, D_{f}$

Algorithm 7: Tuple partitioning algorithm

DevideTuples

Input: The data set

D

Output: The complete data set

D_{c}

, the incomplete data set

D_{f}

D_{c}, D_{f} \leftarrow \emptyset

2: for

t \in D

do
3:

CheckFlag \leftarrow True

4: for

r \in attr (R)

do
5: ifisEmpty

(t [r])

then
6:

D_{f} . add (t)

CheckFlag \leftarrow False

8: break
9: end
10: end
11: if

CheckFlag = True

then
12:

D_{c} . add (t)

13: end
14: end
15: return

D_{c}, D_{f}

Integrity violation detection

The standard definition attribute value is null when the judgment condition of integrity is missing. Decision flag $C_{cell}$ is defined as equation (3)

C_{cell} (v) = {\begin{matrix} 0, & Is Empty (v) \\ 1, & other cases \end{matrix}

(3)

We determine $null$ values for the following cases: strings made up of non-printable characters, null in the database, null pointers or references in programming languages, values as null and their cases, and case deformations. Generally, in practical applications, several separators are meaningless and can also be used as $null$ values.

Based on defining the $C_{cell}$ function, we define the operation in $C_{tuple}$ as a multiplication operation. The $t$ tuple is considered complete if and only if $C_{tuple} (t) = 1$ . Therefore, we make a disjoint partition of the original data set $D$ , $D_{fix} : {t | C_{tuple} (t) = 0}$

C_{tuple} (t) = \underset{A \in attr (R)}{Π} C_{cell} (t [A])

(4)

The partition method $DevideTuples$ divides $D$ into $D_{fix}$ and $D_{clean}$ , where $D_{fix} \cap D_{clean} = \emptyset$ and $D_{fix} \cup D_{clean} = D$ . $D_{clean}$ is a well-integrated subset of data in $D$ , which is also represented by $D_{c}$ in the subsequent part. $D_{fix}$ is a subset of data with missing integrity in $D$ , which is also expressed in $D_{f}$ in the subsequent part.

Integrity violation repair

At present, the main function of integrity repair focuses on filling missing values. The repair of integrity violation tuples involves the repair of missing values in $D_{f}$ . The definition of internal filling is as follows: considering any two tuples $t_{1}$ and $t_{2}$ in $D_{f}$ based on the partition of the data set, if $t_{1}$ and $t_{2}$ have intersecting missing attributes $r$ , then $r$ should be filled on the attribute set of $(R - r)$ . Even if the two most similar tuples in $D$ are $t_{1}$ and $t_{2}$ , $t_{2} [r]$ cannot be filled with $t_{1} [r]$ . Filling is meaningless because both values are null. If $t_{1} [r]$ is not empty, then $t_{1} [s]$ and $t_{2} [r]$ are empty. If $t_{2} [s]$ is not empty, $r \neq s$ , then consider filling in $s \in (R - r)$ , and the distance between $t_{1} [s]$ and $t_{1}$ is difficult to define when calculating the distance in the $s$ attribute. Thus, $t_{1}$ is not unsuitable for calculating distance. In summary, the tuple attribute values in $D_{f}$ are unsuitable for patching even if they are not empty. Therefore, the problem of filling the missing values in $D$ is to use the data in $D_{c}$ to patch the missing values in $D_{f}$ .

In pattern recognition and machine learning, the k-NN algorithm⁴² is a non-parametric statistical method that can be used for classification and regression. In the classification and regression, the input is a set of data containing at least $k$ training samples in the feature space and the eigenvectors to be calculated. When k-NN is used for classification, the output is a label for classification. The decision calculation of $k$ neighbors determines the classification result of an object. Generally, the classification determines the category assigned to the object. If $k = 1$ , then the class of the object is directly the category of the nearest eigenvector, which is also known as the NN method. k-NN is used in regression, and the output is the attribute value of the object, which is calculated by the average value of its k-NNs.

The NN method uses the space vector model to classify cases and evaluate the possible classification of unknown category cases by calculating the distance measure from known category samples. Measuring the weights of neighbors is useful regardless of classification or regression. Thus, the weights of nearby neighbors are more important than those of distant neighbors. The disadvantage of k-NN is that it is sensitive to the local data structure. Specifically, the value calculation methods for numerical and non-numerical types of decision-making are as follows. For a numerical attribute, when $d \in D_{f}$ , the first k similar tuples in $D_{c}$ are determined in non-null attributes, and the $k$ decision-making values $v_{1}, v_{2}, \dots, v_{k}$ are obtained, which are filled with the value of $k$ . For classification and text attributes, the class with the most occurrences is used as filling value $v_{fix}$ , as shown in equation (5)

v_{fix} = \underset{v \in {v_{1}, v_{2}, \dots, v_{k}}}{\arg max} (Count (v))

(5)

If the class with the most frequent occurrence is not unique, then the value of the class with the nearest point in the two types of points is used as the patching value. If the value can only be taken as an integer, then the result of the calculation $v_{fix}$ is rounded as a repair value. For classification attributes, the average value of $k$ decision values is used as a fill value.

For each element in a set, $KnnJoin$ calculates its k-NNs in another set. For $R$ and $S$ datasets that have the same number of attributes, two tuple distance metrics $dis (r, s)$ exist. The set of k-NN elements of tuple $r$ in data set $S$ is defined as $knn (r, S)$ , which presents the set of all tuples in $S$ and the previous k-NN tuple of $r$ . On this basis, $knnJoin (R, S)$ of data sets $R$ and $S$ are recorded as follows

knn (r, S) = \underset{s \in S}{\arg min} (dis (r, s))

(6)

knnJoin (R, S) = {(r, knn (r, S)) | \forall r \in R}

(7)

In $knnJoin$ , the number of attributes of the two data sets is the same. However, after partitioning the original data set according to the integrity measure, the number of attributes of the two data sets can be considered unequal. The reason is that when calculating the similarity between the tuple to be filled and all the complete tuples, the attributes of all the vacancy values on the tuple to be filled are excluded. Moreover, the vacancy attributes in the tuple to be filled are not entirely equal. Thus, not all attributes in $R$ and $S$ cannot be calculated. This problem is a knnJoin-like problem. Unlike knnJoin, the $knnJoin (R, S)$ problem involves finding the elements near the first $k$ in $S$ for each element in $R$ . In the filling problem, the missing attributes of each missing tuple in $R$ are different, and the missing degree of each tuple affects the attribute set included in the calculation.

Algorithm 8: Classification algorithm $CalcClsAttrDis$ to calculate the distance among the attributes of each category
Input: A set of tuples $attrFreq$ for each class value and corresponding frequency in the data set Output: The distance set $attrDists$ between any two classes 1: $attrDists \leftarrow \emptyset$ 2: for $(attr, freqs) \in D$ do 3: $dist \leftarrow \emptyset$ 4: for $valueI, valueJ \in freqs . keySet$ do 5: $freqI \leftarrow freqs [valueI]$ 6: $freqJ \leftarrow freqs [valueJ]$ 7: $calDist \leftarrow \frac{\log (freqI + freqJ) * 2}{\log (freqI * freqJ)}$ 8: $dist . add (valueI + valueJ, calDist)$ 9: end 10: $attrDists . add (attr, dist)$ 11: end 12: return $attrDists$

Algorithm 8: Classification algorithm

CalcClsAttrDis

to calculate the distance among the attributes of each category

Input: A set of tuples

attrFreq

for each class value and corresponding frequency in the data set
Output: The distance set

attrDists

between any two classes
1:

attrDists \leftarrow \emptyset

2: for

(attr, freqs) \in D

do
3:

dist \leftarrow \emptyset

4: for

valueI, valueJ \in freqs . keySet

do
5:

freqI \leftarrow freqs [valueI]

freqJ \leftarrow freqs [valueJ]

calDist \leftarrow \frac{\log (freqI + freqJ) * 2}{\log (freqI * freqJ)}

dist . add (valueI + valueJ, calDist)

9: end
10:

attrDists . add (attr, dist)

11: end
12: return

attrDists

The k-NN algorithm needs to determine the value of $k$ . We define a method to evaluate the quality of $k$ decision points obtained by k-NN when the neighbor parameter is $k$ . We set the neighbor parameter of k-NN to be a little large and store it as $n$ . Then, we calculate the average value of the first $i$ points. The distance between these points and the target point is considered the quality measure. Based on this metric, we take the $i$ value of the smallest distance among these distances as the NN parameter $k$ . The classification algorithm $CalcClsAttrDis$ to calculate the distance among the attributes of each category is shown. This method dynamically calculates the values of $k$ for each tuple then fills the attribute values of classification, text, and numerical classes by using the method from the decision value to the filling value.

Experimental study

This section demonstrates the improvement of data availability by DRAV through experiments. We conducted three experiments on consistency violation detection and repair based on xCFDs, repair based on main data and integrity violation detection, and repair based on mean-k-NN.

Data sets and environment

Two IoT open data sets were used instead of actual sensor data. Data Set I (Hubway Data Visualization Challenge) has nine numerical attributes and six classification attributes, totaling to 1 million records. Data Set II (Gas Sensor Array Drift Data Set) consists of 16 numerical attributes, totaling 13,000 records.

All the experiments in this section were run on a distributed cluster. Five physical nodes were implemented on CentOS 7.3, MySQL 5.5, and Hadoop 2.7 with an Intel E3 CPU running at 3.4 GHz and with 16 GB of memory.

Detection and repair experiment of consistency violation based on xCFDs

In this experiment, the proposed method of consistency violation based on xCFDs is evaluated on Data Set I. Analysis of violation detect precision, repair quantity, repair precision, and running time comparison are partially focused. Based on the experimental data and the domain knowledge of the attributes, four constraint rules with different constraints of CFDs, eCFDs and xCFDs were generated, which cover most of the data sets. The tuples in Data Set I were randomly selected to make them violate the four rules.

Violation detect precision analysis

In this section, the influence of the proposed xCFDs on efficiency is considered by detect precision analysis. Since the detection results of consistency violations are deterministic, the respective detection accuracy results depend on the ratio of the consistency violations randomly generated on each rule. Due to the limitation of expression ability, the rule semantics of xCFDs cannot be expressed in CFDs and eCFDs. Meanwhile, the semantics of eCFDs cannot be expressed in CFDs. Therefore, CFDs cannot detect the constraints expressed by xCFDs. As shown in Figure 6, xCFDs have a better performance on the detection precision than CFDs and eCFDs at the same violation rate.

Figure 6.

Comparison of detect precision among CFDs, eCFDs, and xCFDs.

Violation repair quantity analysis

In this section, the repair situations under different error rates of raw data are investigated. We examined the total number of CFD and xCFD based repairs. As shown in Figure 7, with the increase in error rate, the number of detected error tuples increases gradually, and the ratio of xCFDs to CFDs in the number of repairs increases with the increase in data scale. Combined with the previous experimental results, xCFDs’ enhanced expression ability in constraint semantics helps describe data errors. These rules with an appropriate repair algorithm can be used for error detection and correction. This result indicates that these xCFDs are better than classical CFDs.

Figure 7.

Comparison of repair quantity between CFDs and xCFDs.

Violation repair precision analysis

In this section, the performance of the overall repair precision for CFDs and xCFDs is demonstrated. The error rate of each group is set according to the above group. Figure 8 shows the error repair accuracy of CFDs and xCFDs with different error rates. The constraints of using sets are weaker than using constants, especially in repair operations, due to the enhanced expressive ability of xCFDs. Overall, in the process of xCFD repair, the conflict between rules and data also increases, leading to the determinacy of the overall repair decline. Thus, repair accuracy also decreases. However, the decline rate is acceptable, and the impact on repair capacity is low.

Figure 8.

Comparison of repair precision between CFDs and xCFDs.

Running time analysis

In this section, the influence of the proposed xCFDs on efficiency is considered by running time analysis between stand-alone SQL and Hadoop. As shown in Figure 9, the efficiency of Hadoop has apparent advantages than stand-alone SQL. With the increase in data volume, the time gap becomes increasingly significant. On the contrary, the time reduction of the Hadoop version gradually decreases, indicating that a complicated relationship exists among data volume, error rate, and parallelism. In summary, the proposed xCFDs have proper parallel detection and repair effect for data consistency violation.

Figure 9.

Comparison of xCFDs running time between stand-alone SQL and Hadoop.

Detection and repair experiment of consistency violation based on main data

In this experiment, the proposed method of consistency violation based on main data is evaluated on Data Set I. Analysis of running time comparison is partially focused. In the previous sections, we mentioned that the relational model of main data is $R_{m} : (X_{m} \to Y_{m})$ , the data set is $D_{m}$ , and the data set to be repaired is $D$ . Using main data to repair data mainly achieves the following SQL semantics $update D set D . Y_{m} = D_{m} . Y_{m} where D . X_{m} = D_{m} . X_{m}$ . The problem is the distortion of the multi-table join problem. Two schemes can be used depending on the quantity of main data. Initially, 20% of the data set was intercepted as the main data. Accordingly, an attribute value (which does not change the value of the ID primary key attribute for record repair) was set to $null$ randomly in the tuple of the original data set. Moreover, the number of tuples containing $null$ values was 4%–20% of the total tuple number. When the main data and the data to be repaired had proper primary key constraints, the results of error detection and repair were determined. Therefore, we focused on the repair results of $MapJoin$ and $ReduceJoin$ under different error rates.

Running time analysis

In this section, the influence of the proposed $MapJoin$ on efficiency is considered by running time analysis. As shown in Figure 10, $MapJoin$ implemented using distributed cache is much more efficient than $ReduceJoin$ . It is because $MapJoin$ does not need to allocate the two tables and then be judged and connected by the reducer. When the amount of data is small, the cost of copying the main data to slave nodes is lower than that of the reduce operation. Thus, the approach can be used as an acceleration method.

Figure 10.

Comparison of running time between $MapJoin$ and $ReduceJoin$ .

Detection and repair experiment of integrity violation

In this experiment, the proposed method of integrity violation is evaluated on Data Set II. Analysis of tuple distance and running time comparison are partially focused. Data Set II was used to select 5%–30% tuples randomly in the original data, from which an attribute value was set to be empty. The repair effects of mean-k-NN and fixed-value-NN were compared. The initial value of mean-k-NN was set to 25, and the fixed-value was set to different values.

To meet the experiment requirements, the frequency of each class in all classified attributes was collected in statistical data, and the distance between classes of all attributes is calculated. The calculation results were saved for subsequent use. The data sets were partitioned, and the input was detected in the mapping process to detect and repair integrity violations further. For the results after mapping, the distance of the internal partition elements was calculated. If the map is reduced directly after the mapping, then the results of all lists will be judged. The list in each partition only needs to maintain the minimum of the first $k$ . The specific operation was accomplished through $Combine$ , which is a mechanism of reducing traffic in the Hadoop framework. It can customize the operation before reduction to merge the local map results once, which can be regarded as the local reduction process. Subsequently, we intercepted the first $k$ elements of the reserved distance list in the local partition, reduced the follow-up data and node traffic. Then we dynamically selected the values of $k$ for each tuple to be filled by the reducing operation and filled the work according to the complete tuple data corresponding to these values.

Tuple distance analysis

In this section, the influence of the mean-k-NN method on efficiency is considered by tuple distance analysis. As shown in Figure 11, the longitudinal axis is the Euclidean distance between the repair tuple and the actual tuple. The method in mean-k-NN has an excellent dynamic effect, and in most cases, it can be close to the value of the $k$ with the best filling effect.

Figure 11.

Comparison of tuple distances between fillings with mean-k-NN and fixed-value-NN.

Running time analysis

In this section, the influence of the $Combine$ process on efficiency is considered by running time analysis. Given that the $Combine$ process only attempts to intercept the first $k$ value in the sub-list, it cannot affect the result of the repair value. Only the run times of the combine and non-combine processes are considered. Both processes involve mean-k-NN repair on data with different error rates. When the number of nodes is fixed, the efficiency of using the combined mechanism is much higher than that of using conventional methods that do not intercept local lists in Figure 12. The efficiency gap between them increases with the increase in error rate. Moreover, the reasonable use of the intermediate process can reduce the unnecessary amount of calculation and information exchange, thereby improving the operational efficiency.

Figure 12.

Comparison of running time between Map-Combine-Reduce and MapReduce.

Discussion

In this article, a series of algorithms described in the MapReduce form on Hadoop have been proposed in the previous section. In research on data consistency with xCFDs, $ConsistencyRepair$ adopts a repair strategy similar to the Standalone mode. It marks and repairs the unit group errors, classifies multiple groups to mark and repair, and resolves the conflict of the repair values. The experimental results show that the detection accuracy of xCFDs is improved compared with that of CFD and eCFD methods, and the enhanced expression ability of xCFDs in constraint semantics can describe data errors. These rules with appropriate repair algorithms can be used for error detection and correction. The ability of these improved methods is also better than that of classical CFD and eCFD methods. In research on data consistency with main data, the $MapJoin$ method implemented by distributed cache is much more efficient than the $ReduceJoin$ method because it does not need to allocate two tables to be judged and connected by the reducer. In terms of integrity research, the $Combine$ operation optimizes the calculation of the value of $k$ in the experimental method and adopts the dynamic value of the $k$ automatic calculation method. $Combine$ performs better than other fixed-value k-NN algorithm.

In summary, compared with classical algorithms, the performance of the framework method is improved, but it still has room for improvement. For example, in the violation detection and repair algorithm based on xCFDs, the enhancement of expression ability is accompanied by the weakening of strict constraint ability, leading to an increase in the uncertainty of the repair results. Thus, the repair rate is lower than that of CFDs. In the mean-k-NN-based integrity repair experiment, the algorithm for calculating the optimal value of $k$ can still be improved by optimizing the distributed topology.

DRAV still has room for improvement in terms of dealing with IoT data availability violations. Future work on DRAV should focus on the following aspects: (1) calculating domain-related maintenance costs by network algorithms, (2) enhancing the functional dependencies in the data sets by mining algorithms, and (3) promoting the neighbor calculation process by using the distance coefficient in the neighbor algorithm.

Conclusion

The improvement of data availability is crucial to the credibility of IoT. Our work focused on the consistency and integrity of data availability in IoT and proposed a DRAV framework that contains a series of processing algorithms. To address the deficiency of CFD in expressing data consistency, DRAV proposes an extension of CFD in semantics to enhance the expressive ability on rule constraints. Under the existing theory of error detection and correction, DRAV proposes a corresponding SQL query and standalone detection algorithm and a heuristic standalone repair scheme. Main data are formally defined by FD, which integrates high-quality data into a theoretical system of CFD and clarifies the guiding significance of high-quality data for error detection and repair. In distance measurement, a k-means method is used to compute normalized distances for IoT data to ensure the rationality of distance measurement between tuples. DRAV also uses a method to evaluate the value of the first $k$ decision points to determine the actual number of neighbors. To verify the correctness and acceleration ratio of each algorithm in the distributed computing platform, a series of algorithms based on the MapReduce programming framework were proposed. The results of the corresponding experiments showed that DRAV improves the data availability of IoT by detecting and repairing violations with better performance than existing methods.

Footnotes

Handling Editor: Ximeng Liu

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the National Natural Science Foundation of China (Nos 61472108, 61601146, and 61732022) and the National Key Research and Development Program of China (Nos 2016QY03D0501 and 2017YFB0803300).

ORCID iD

Jinlin Wang

References

Delaney

Nicole

Kapur

, et al. Internet of things: challenges, breakthroughs and best practices. Technical Report, Cisco Systems, Inc, San Jose, CA, November 2017.

Woolsey

Schulz

. Credit card statistics, industry facts, debt statistics. Technical Report, Creditcards.com, 2010. https://www.nasdaq.com/articles/credit-card-statistics-industry-facts-debt-statistics-2009-01-12-0

Sidi

Panahy

PHS

Affendey

, et al. Data quality: a survey of data quality dimensions. In: Proceedings of the international conference on information retrieval & knowledge management, Kuala Lumpur, Malaysia, 13–15 March, pp.300–304. New York: IEEE.

Chen

Fan

. Analyses and validation of conditional dependencies with built-in predicates. In: Proceedings of the international conference on database and expert systems applications, Linz, 31 August–4 September 2009, pp.576–591. New York: Springer.

Fan

Geerts

Jia

, et al. Conditional functional dependencies for capturing data inconsistencies. ACM Trans Dabase Syst 2008; 33(2): 61–648.

Miao

Liu

. An algorithm on mining approximate functional dependencies in probabilistic database. J Comput Res Develop 2015; 52(12): 2857–2865.

Zhang

. Shell-neighbor method and its application in missing data imputation. Appl Intell 2011; 35(1): 123–133.

Liu

. An important aspect of big data: data usability. J Comput Res Develop 2013; 50(6): 1147–1162.

Miao

Liu

, et al. Vertex cover in conflict graphs: complexity and a near optimal approximation. In: Proceedings of the 9th international conference on combinatorial optimization and applications, Houston, TX, 18–20 December 2015, pp.395–408. New York: Springer.

10.

Bohannon

Fan

Geerts

, et al. Conditional functional dependencies for data cleaning. In: Proceedings of the IEEE 23rd international conference on data engineering, Istanbul, Turkey, 15–20 April 2007, pp.746–755. New York: IEEE.

11.

Fan

Geerts

, et al. Discovering conditional functional dependencies. IEEE Trans Knowled Data Eng 2011; 23(5): 683–698.

12.

Miao

Liu

. On the complexity of sampling query feedback restricted database repair of functional dependency violations. Theor Comput Sci 2016; 609: 594–605.

13.

Zhou

. Conditional functional dependencies and application in domain-independent data cleaning. Microcomput Appl 2012; 28(9): 23–30.

14.

Yang

. Research on data repairing techniques based on editing rules and master data. Master’s Thesis, Donghua University, Shanghai, China, 2017.

15.

Jin

Liu

Zhou

. Functional dependency and conditional constraint based data repair. J Softw 2016; 27(7): 1671–1684.

16.

Salem

Abdo

. Fixing rules for data cleaning based on conditional functional dependency. Fut Comput Inform J 2016; 1(1–2): 10–26.

17.

Bravo

Fan

Geerts

, et al. Increasing the expressivity of conditional functional dependencies without extra complexity. In: Proceedings of the IEEE 24th international conference on data engineering, Cancun, Mexico, 7–12 April 2008, pp.516–525. New York: IEEE.

18.

Liu

Zou

, et al. Evaluation of data completeness. J Comput Res Develop 2013; 50(1): 230–238.

19.

Razniewski

Nutt

. Assessing the completeness of geographical data. In: British national conference on databases, Oxford, 8–10 July, pp.228–237. New York: Springer.

20.

Endler

Baumgärtel

Wahl

, et al. ForCE: is estimation of data completeness through time series forecasts feasible? In: East European conference on advances in databases and information systems, Poitiers, 8–11 September, pp.261–274. New York: Springer.

21.

Emran

Embury

Missier

, et al. Measuring data completeness for microbial genomics database. In: Asian conference on intelligent information and database systems, Kuala Lumpur, Malaysia, 18–20 March, pp.186–195. New York: Springer.

22.

Emran

Embury

Missier

Measuring population-based completeness for single nucleotide polymorphism (SNP) databases. In: Emran

Embury

Missier

(eds) Advanced approaches to intelligent information and database systems. New York: Springer, 2014, pp.173–182.

23.

Libkin

Incomplete data: what went wrong, and how to fix it. In: Proceedings of the 33rd ACM SIGMOD symposium on principles of database systems, Snowbird, UT, 22–27 June 2014, pp.1–13. New York: ACM.

24.

Farooq

Waseem

Khairi

, et al. A critical analysis on the security concerns of Internet of Things (IoT). Int J Comput Appl 2015; 111(7): 1–6.

25.

Liu

Zhang

Noisy data elimination using mutual k-nearest neighbor for classification mining. J Syst Softw 2012; 85(5): 1067–1074.

26.

Tian

, et al. Missing data analyses: a hybrid multiple imputation algorithm using Gray system theory and entropy based on clustering. Appl Intell 2014; 40(2): 376–388.

27.

Van Buuren

. Flexible imputation of missing data. Boca Raton, FL: CRC Press, 2013.

28.

Zhang

Nearest neighbor selection for iteratively kNN imputation. J Syst Softw 2012; 85(11): 2541–2552.

29.

Song

Zhang

Chen

, et al. Enriching data imputation with extensive similarity neighbors. Proc VLDB Endow 2015; 8(11): 1286–1297.

30.

Hao

Song

Zhou

Novel approach for missing data imitation based on biclustering. Appl Res Comput 2015; 32(3): 674–678.

31.

Gummadi

Khulbe

Kalavagattu

, et al. SMARTINT: using mined attribute dependencies to integrate fragmented web databases. J Intell Inform Syst 2011; 38(3): 575–599.

32.

Yakout

Ganjam

Chakrabarti

, et al. InfoGather: entity augmentation and attribute discovery by holistic matching with web tables. In: Proceedings of the 2012 international conference on management of data, Scottsdale, AZ, 20–24 May 2012, pp.97–108. New York: ACM Press.

33.

Qin

Cheng

, et al. TRIP: an interactive retrieving-inferring data imputation approach. IEEE Trans Knowled Data Eng 2018; 27(9): 2550–2563.

34.

Shang

Xie

, et al. Cost reduction for web-based data imputation. In: The 19th international conference on database systems for advanced applications (eds Bhowmick

Dyreson

Jensen

), Bali, Indonesia, 21–24 April, pp.438–452. New York: Springer.

35.

Jiang

Violations detection of multiple functional dependencies in distributed big data. Chin J Comput 2017; 40(1): 144–160.

36.

Fan

Tang

Incremental detection of inconsistencies in distributed data. IEEE Trans Knowled Data Eng 2014; 26(6): 1367–1383.

37.

Men

Design and implementation of the inconsistent data repairing subsystem in the data cleaning system. Master’s Thesis, Harbin Institute of Technology, Harbin, 2013.

38.

Zhang

Men

Wang

, et al. Hadoop-based inconsistence detection and reparation algorithm for big data. J Front Comput Sci Technol 2015; 9(9): 1044–1055.

39.

Cong

Fan

Geerts

, et al. Improving data quality: consistency and accuracy. In: Proceedings of the 33rd international conference on very large data bases. VLDB endowment, Vienna, 23–28 September 2007, pp.315–326. New York: ACM.

40.

Yang

Wang

, et al. The optimization of the big data cleaning based on task merging. Chin J Comput 2016; 39(1): 97–108.

41.

Ding

Wang

Zhang

, et al. Association relationships study of multi-dimensional data quality. J Softw 2016; 27(7): 1626–1644.

42.

Altman

NS.

An introduction to kernel and nearest-neighbor nonparametric regression. Am Stat 1992; 46(3): 175–185.