A Multisource Retrospective Audit Method for Data Quality Optimization and Evaluation

Abstract

With the rapid development of information technology and the coming of the era of big data, various data are constantly emerging and present the characteristics of autonomy and heterogeneity. How to optimize data quality and evaluate the effect has become a challenging problem. Firstly, a heterogeneous data integration model based on retrospective audit is proposed to locate the original data source and match the data. Secondly, in order to improve the integrated data quality, a retrospective audit model and associative audit rules are proposed to fix incomplete and incorrect data from multiple heterogeneous data sources. The heterogeneous data integration model based on retrospective audit is divided into four modules including original heterogeneous data, data structure, data processing, and data retrospective audit. At last, some assessment criteria such as redundancy, sparsity, and accuracy are defined to evaluate the effect of the optimized data quality. Experimental results show that the quality of the integrated data is significantly higher than the quality of the original data.

1. Introduction

With the rapid development of Internet technology and the advent of the era of big data, global enterprise information doubled every 1.5 years on average, and at present only 7% of the total information data has been utilized [1]. How to effectively integrate distributed, heterogeneous, and self-knowledge data, so as to realize data sharing, is a somewhat challenging research topic. Enterprises get valuable information through analyzing these data effectively, and data of high quality can help users make the right decision. There are many factors that can affect the quality of data. Among these factors, those related to original data include error in data format, data inconsistency, and nonconformity with the business logic.

At present, one of the approaches to heterogeneous data integration is through data warehouse model. Data warehouse model extracts data from one or more data sources, processes the data when necessary, and then stores the data in the target data warehouse. This model supports complex data conversion and has better performance. Data storage model generally adopts the forms of ETL (Extract, Transform, and Load) and data warehouse. ETL process includes data extracting, data transformation, and data loading. In this process, data from distributed and heterogeneous sources are extracted to the temporary middle layer and then cleaned, converted, integrated, and finally loaded into the data warehouse or data mart.

Some data integration methods are proposed to improve the traditional data integration, such as the method based on Xml middle-ware [2], that based on the conjoint method [3], and that using ontology [4]. As for real-time data integration, [5] proposes reducing the integration time by using component for data warehouse, and [6, 7] advocate using ODS technique for dynamic data loading process. Some data integration methods such as data dimension [8] and the market integration of the semi-structured data integration [9] were proposed to optimize the data integration quality. Pellegrino [10] for interactive visualization of data integration system has carried on the detailed research. Many researchers try to optimize the data quality in data integration from different perspectives [11–16].

However, the existing heterogeneous data integration process has the following disadvantages. Firstly, the traditional heterogeneous data integration technology is based on the process of ETL data extraction, integration, cleaning, and loading. It cannot update data in real-time because it needs to set a time interval value to specify how often to update regular periodic data passively. Secondly, traditional ETL process will not reverse changes on the original data. It is only for the use of the original data, not the quality assurance and quality improvement on the original data, and lacks maintenance process of the original data.

In this paper, we propose a multisource retrospective audit method for data quality optimization and evaluation. At first, a real-time multisource heterogeneous data integration model is established to improve the quality of the original data by employing the technologies of adapter, XML, and reverse cleaning. Secondly, in order to improve the quality of integrated data, a retrospective audit model and relevant audit rules are proposed to fix incomplete and incorrect data from multiple heterogeneous data sources. The heterogeneous data integration model based on retrospective audit is divided into four modules which includes original heterogeneous data, data structure, data processing, and data retrospective audit. Finally, some assessment criteria, such as redundancy, sparsity, and accuracy, are defined to evaluate the effect of the optimized data quality.

The remainder of the paper is organized as follows. In Section 2, we introduce the heterogeneous data integration model based on traceable audit. In Section 3, optimization process of traceable audit is described in detail. In Section 4, some assessment criteria, such as redundancy, sparsity, and accuracy, are defined to evaluate the effect of the optimized data quality. In Section 5, we present the experiment process, experiment results, and analysis. At last, we conclude the paper in Section 6.

2. Heterogeneous Data Integration Model Based on Traceable Audit

2.1. Data Lineage

In the view of the database update, there is a similar process called data lineage. In recent years, with the development of the network, data lineage has become a new field of research. Through data lineage tracing we can get the information about the source of data view. When the original data in the database are changed, the view of database can be updated through tracing the lineage of the data. Data lineage has attracted the attention of scholars in fields of Web Search and Mass Storage in recent years.

Reverse cleaning process is composed of building process and query process. For example, Table A is composed of Table B and Table C, and Table B is composed of Table D and Table E. We can use node to describe a table and use edge to describe the relationship between tables. Then, data bus can be generated according to the data lineage of A. We can record accurate data source according to the data flow process in constructing local dataset which need to be repaired.

The process of traceable audit optimization is in nature the process to find the sources of data and then match and modify the data in the process. We can follow a data convergence process to fix data in the integration of local raw data tables or files. For example, when we modify the first row in Table A, according to the reverse trace convergence process, we can analyze the source of Table A: A→(B, C)→[(D, E), C]. Then we will quickly find out the result like this: “First line of the Table A is actually recorded by row 1 of Table F, the row 5 of Table G, the row 3 of Table E, and the row 1 of Table C.” The reverse traceability process is shown on the right side in Figure 1.

Figure 1

Data bus and reverse traceability.

2.2. Traceable Audit

Traceable audit means that the data audited can be traced. When heterogeneous data are integrated, the source of data will be marked. After the data have been treated according to different audit rules, the repair history can be recorded in detail. If the set of data to be repaired corresponds to several sources, after it is repaired, we can trace the record to know the data source according to which the repair is operated. Heterogeneous data sources are also the audit reference to repair the local data. It comes from the outside of the local data collection of the dataset with uncertain structure.

2.3. Integrating Heterogeneous Data Structures

The heterogeneous data integration model based on traceable audit is divided into original heterogeneous data, data structure, data processing, and data feedback audit, as follows. (1)

Original data module: the module stored heterogeneous data from different data sources, such as structured MySQL, Oracle database, Html, Xml, semistructured and unstructured text, and Excel.

(2)

Data structured module: its function lies in the different sources of heterogeneous data sources into a unified structured data structure, at the same time, according to the data source tree and relevant rules, define the data structure of the need of repair, convenient for later data processing.

(3)

Data processing module consists of two parts, real-time data integration process and data restoration process. It means to integrate multiple heterogeneous data sources and to extract, integrate, clean, and preserve the original data.

(4)

Data feedback audit module, which is divided into data feedback and the audit process, is mainly used to query the data source and the local corrects errors in the original data, in the case of confirmed raw data that needs to be fixed to update the original data.

Its structure is as in Figure 2.

Figure 2

Structure of heterogeneous data integration.

3. Optimization Process of Traceable Audit

3.1. Data Traceable Audit Process

Traceable audit aims to repair the original data, in order to solve the distributed data sharing of credibility, quality, and the problem such as version information. The reverse feedback audit can clearly reflect the change of the original data in the database. It is helpful to solve the problem of updating wrong data in the process of heterogeneous data integration.

Feedback audit in this paper refers to the repair of database according to multiple heterogeneous data sources on the Internet, that is, to correct the local incomplete and incorrect data according to the high quality dataset integrated on the basis of certain rules. Figure 3 illustrates the feedback audit process in the form of a flowchart.

Figure 3

Data flow optimization traceability audit.

3.2. Integration Rules

In the feedback audit process, data integration rules affect the reverse audit quality immensely. Integration rules include defining the contents to be repaired, integration of heterogeneous data sources, and integration of local original data.

(1) Defining the Contents to Be Repaired. Defining the contents to be repaired means to identify the items to be repaired in original data. For example, if we are to repair data about business information such as address, and telephone number, we need to define the contents to be repaired beforehand so that we can extract relevant data from different data sources, hence reducing the computational complexity of the algorithm.

( 2) Integration of Local Original Data. Local raw data integration means extracting the local data according to the rules and then forming a structured data, which can be used for necessary feedback audit.

Algorithms extract the corresponding dataset from different database tables which need to be repaired according to the definition of repair content. At the same time, the source of each datum is recorded according to the data bus rules to ensure the trace of original data after retrospective repair. When data are extracted, they are structured at the same time.

( 3) Heterogeneous Data Integration. Heterogeneous data integration extracts the data integration in various forms except local data source data storage, including XML, MySQL, and TXT. It includes content extraction of heterogeneous data integration and merging rules.

Content Selection Rules. On the basis of repair contents defined, corresponding contents are extracted from different data sources and structured in a uniform XML storage. In MySQL, Oracle, Txt, Excel, and other storage structures, the algorithm, according to the definition of the repair content, conducts extraction of data from various sources. For example, if the material to be repaired is about basic information of students, the adapter, according to the definition, finds the corresponding record, field, property, and so forth, to extract data from multiple data sources, namely, to extract the basic information data from multiple data sources.

Merging Rules. Data from various sources were extracted, structured, and then merged into a dataset. On the basis of the extraction, the algorithm merges the structured XML documents such as Oracle-XML, MySQL-XML, Txt-XML, Excel-XML according to certain rules, such as labeling the source coupled with a time stamp.

3.3. Auditing Rules

3.3.1. Auditing Rules

In the heterogeneous data integration model, a pool of audit rules is built to store a series of auditing rules, such as the redundancy, sparsity, and accuracy rules defined in this paper.

( 1) Redundancy Auditing Rules. Data redundancy refers to the number of occurrences of the same record. The audit of redundancy means to delete the repeated ones in original data by certain algorithm. The main steps are as follows.

Step 1. Calculate the redundancy of dataset, and mark the same record in the dataset.

Step 2. Delete and back up the data according to the mark of repetition.

( 2) Sparsity Auditing Rules. Sparsity refers to the percentage of vacant fields in a dataset record. Sparsity audit aims to reduce the sparsity of data so as to avoid the waste of physical space. The main steps are as follows.

Step 1. Calculate rate of empty fields of a dataset that tag is empty field.

Step 2. Match and fill the empty fields according the integrated data.

( 3) Accuracy Auditing Rules. Accuracy means the difference between the local data and real data. The smaller the difference is, the higher the accuracy is. The main steps are as follows.

Step 1. Use consolidated results to repair the local data, and then generate the repaired data.

Step 2. Analyze data accuracy by matching the original data with the repaired data and the data after manual revision.

3.3.2. Active Learning Principles

The audit rules in the pool will follow certain principles of active learning, such priority of time, priority of amount, priority of history, and priority of artificial rectification.

Priority of time means that, in multiple data sources integration model, those updated most recently will be used as the standard in repairing the local data. Priority of amount means that, in the data model of multiple data sources integration, the great majority will be taken as the criterion; that is, as regards the same data, when the majority of the data source reveals the same value, the local data will be revised accordingly. Priority of history means that we determine which data source to be relatively accurate by reading historical records of repair and then take it as the ground for future repair. Artificial rectification refers to the human participation in the process.

4. Data Quality Assessment

4.1. Related Definitions

As far as the performance of the feedback audit process is concerned, relevant definitions for quantitative analysis are as follows.

Definition 1.

Record $R = (A_{1}, A_{2}, A_{3}, \dots, A_{n})$ .

Definition 2.

Record set $R s = (R_{1}, R_{2}, R_{3}, \dots, R_{m})$ .

Definition 3.

Record-attribute matrix

\begin{matrix} R A = \begin{matrix} A_{1} & A_{2} & \dots & A_{n} \\ R_{1} & 0 & a & \dots & j \\ R_{2} & b & 0 & \dots & g \\ ⋮ & ⋮ & ⋮ & ⋮ & ⋮ \\ R_{m} & e & 0 & \dots & k . \end{matrix} \end{matrix}

(1)

Herein, $R_{i}$ stands for record, $A_{i}$ for attribute of record, and $R A$ for record-attribute matrix; the record of the field is empty when the element value is 0.

4.2. Performance Parameters

Local data are repaired through feedback audit and then compared with the actual data which have been repaired and verified manually. The following parameters of performance are generated: redundancy, sparsity, and accuracy.

4.2.1. Redundancy

The indicator of redundancy in this paper is worked out on the basis of the similarity between records in their attribute value, among which those that record the value of the attribute field are categorized into numeric type and character type (English characters, Chinese characters). The core theoretical foundation is the similarity between Chinese phrasal texts, which measures the similarity between Chinese record attribute values. According to Chinese phrase text similarity calculation method, we define the similarity of Chinese text phrases A and B as $s i m i l a r i t y 〈A, B〉$ , and the value is $[0,1]$ . Redundancy, namely, the similarity between records, is calculated by calculating the similarity in attribute.

We define records similarity between $R_{i}$ and $R_{j}$ as $s i m i l a r i t y 〈R_{i}, R_{j}〉$ , and

\begin{matrix} s i m i l a r i t y 〈R_{i}, R_{j}〉 = \sum_{k = 1}^{n} s i m i l a r i t y 〈A_{k} (R_{i}), A_{k} (R_{j})〉; \end{matrix}

(2)

herein,

s i m i l a r i t y 〈A_{k} (R_{i}), A_{k} (R_{j})〉

describes the similarity between

R_{i}

and

R_{j}

in field

A_{k}

, and the value falls in

[0,1]

. When the fields are of the same value, the similarity is 1; when the fields are of totally different values, the similarity is 0; hence,

s i m i l a r i t y 〈R_{i}, R_{j}〉 \in [0, n]

. We consider records

R_{i}

and

R_{j}

are the same as the record when

s i m i l a r i t y 〈R_{i}, R_{j}〉 \in [n - 1, n]

On condition that the similarity is worked out, the repetition rate between $R_{i}$ and $R_{j}$ can be defined as $D 〈R_{i}, R_{j}〉$ , and

\begin{array}{l} D 〈R_{i}, R_{j}〉 \\ = D 〈R_{j}, R_{i}〉 = \{\begin{cases} 0 & (s i m i l a r i t y 〈R_{i}, R_{j}〉 \notin [n - 1, n]) \\ 1 & (s i m i l a r i t y 〈R_{i}, R_{j}〉 \in [n - 1, n]) . \end{cases} \end{array}

(3)

Also the repetition rate of record $R_{i}$ can be defined as ${D R}_{i}$ , and the initial value is 0, according the description above, when $D 〈R_{i}, R_{j}〉 = 1$ $(j = i + k, k = 1,2, 3, \dots)$ , ${D R}_{i} = {D R}_{i} + 1$ .

Then the redundancy of record is

\begin{matrix} P R D = \frac{\sum_{i = 1}^{m} {D R}_{i}}{M} . \end{matrix}

(4)

M represents the total number of records;

\sum_{i = 1}^{m} {D R}_{i}

represents the number of duplicate records.

4.2.2. Sparsity

(1)

Sparsity of single record:

\begin{matrix} {R_{i}}_{P N D} = \frac{K_{R_{i}}}{N} . \end{matrix}

(5)

K_{R_{i}}

represents the number of elements whose value is 0 in record

R_{i}

, and N represents the total number of attributes.

(2)

Sparsity of record set:

\begin{matrix} P N D = \frac{\sum_{i = 1}^{M} K_{R_{i}}}{N \times M} . \end{matrix}

(6)

N represents the total number of attributes, M represents the total number of records, corresponding to the properties and the number of records in matrix

R A

4.2.3. Accuracy

Accuracy of data refers to the similarity between corresponding fields achieved by comparing the data before and after repairing with the artificially revised data. According to phrase text similarity and matrix multiplication algorithm, set

\begin{matrix} A = (a_{i j}) = [\begin{bmatrix} a_{11} & a_{12} & \dots & a_{1 n} \\ a_{21} & a_{22} & \dots & a_{2 n} \\ ⋮ & ⋮ & ⋮ & ⋮ \\ a_{m 1} & a_{m 2} & \dots & a_{m n} \end{bmatrix}] \end{matrix}

(7)

m \times n

record-attribute matrix before repair,

\begin{matrix} B = (b_{i j}) = [\begin{bmatrix} b_{11} & b_{12} & \dots & b_{1 n} \\ b_{21} & b_{22} & \dots & b_{2 n} \\ ⋮ & ⋮ & ⋮ & ⋮ \\ b_{m 1} & b_{m 2} & \dots & b_{m n} \end{bmatrix}] \end{matrix}

(8)

m \times n

record-attribute matrix after repair, and

\begin{matrix} C = (c_{i j}) = [\begin{bmatrix} c_{11} & c_{12} & \dots & c_{1 n} \\ c_{21} & c_{22} & \dots & c_{2 n} \\ ⋮ & ⋮ & ⋮ & ⋮ \\ c_{m 1} & c_{m 2} & \dots & c_{m n} \end{bmatrix}] \end{matrix}

(9)

m \times n

record-attribute matrix after manual verification.

And define S as the accuracy of a collection of records before and after repair. And the similarity of matrix value is used as a reference. It means that the higher the similarity, the higher the accuracy. The matrix similarity formula is as follows:

\begin{array}{l} S_{A C} = s i m i l a r i t y 〈A, C〉 \\ = s i m i l a r i t y 〈a_{11}, c_{11}〉 + s i m i l a r i t y 〈a_{12}, c_{12}〉 \\ + \dots + s i m i l a r i t y 〈a_{m n}, c_{m n}〉 \\ S_{B C} = s i m i l a r i t y 〈B, C〉 \\ = s i m i l a r i t y 〈b_{11}, c_{11}〉 + s i m i l a r i t y 〈b_{12}, c_{12}〉 \\ + \dots + s i m i l a r i t y 〈b_{m n}, c_{m n}〉 \\ P A D = \frac{S_{A C}}{N \times M} = \frac{S_{B C}}{N \times M} . \end{array}

(10)

Herein, $S_{A C}$ represents the accuracy of A relative to C, and $S_{B C}$ represents the accuracy of B relative to C. That the values are higher always means the accuracy is higher.

According to Definition 3, let the data matrix before restoration be ${R A}_{o l d}$ , the repaired data matrix ${R A}_{n e w}$ , and the artificially restored matrix ${R A}_{r e p a i r_a r t i f i c i a l}$ .

Then, accuracy of the data before restoration is

\begin{array}{l} {P A D}_{o l d} = \frac{S_{{R A}_{o l d}} / {R A}_{r e p a i r_a r t i f i c i a l}}{N \times M} \\ = \frac{s i m i l a r i t y 〈{R A}_{o l d}, {R A}_{{r e p a i r}_{a r t i f i c i a l}}〉}{N \times M} . \end{array}

(11)

Accuracy of the data after restoration is

\begin{array}{l} {P A D}_{n e w} = \frac{S_{{R A}_{n e w}} / {R A}_{r e p a i r_a r t i f i c i a l}}{N \times M} \\ = \frac{s i m i l a r i t y 〈{R A}_{n e w}, {R A}_{{r e p a i r}_{a r t i f i c i a l}}〉}{N \times M} . \end{array}

(12)

5. Case Studies

5.1. Experimental Environment

This experiment uses a medium-sized local life service dataset (http://www.dongway.com.cn) and several other datasets in the same field to test the algorithm. The local data collection is mainly information from catering industry, such as businesses, cuisines, and customer behaviors. The number of businesses is more than 28000 and that of users nearly 15000. These datasets are representative and have a certain influence on the industry.

The experiment selects a medium-sized subset from these datasets, which mainly contains the basic information of the merchants, such as business name, phone number, address, latitude and longitude, and the main business projects and other information.

5.2. Experiment Process

The experiment collects data from other sites and takes these heterogeneous data sources as a basis for the repair of local dataset. The experiment process is as follows.

Step 1. Select and format local dataset which needs to be fixed according to the data bus rules.

Step 2. Gather from multiple data sources on the Internet the same data objects with the local data according to integration rules.

Step 3. Integrate and format the dataset gathered in Step 2.

Step 4. Repair the local data according to auditing rules with the important repair parameters of dataset being redundancy, sparsity, and accuracy. Redundancy and sparsity concern a comparison between data before and after the repair, and accuracy applies in comparison between data before and after repair and artificially restored data.

5.3. Experimental Findings and Analysis

5.3.1. Redundancy Analysis

The experiment compared the value of redundancy records between the original data and the repaired data; the redundant records of raw data are as in Table 1.

Table 1

Redundancy records of raw data.

Records	Before	After
1000	11	7
2000	19	12
3000	39	27
4000	65	52
5000	98	78
6000	121	111
7000	175	141
8000	248	202
9000	307	280
10000	382	347

On the basis of the definitions of redundancy parameter in previous section, redundancy trend is as in Figure 4.

Figure 4

Redundancy PRD trend.

In Figure 4, PRD_b represents the redundancy of the dataset before repair and PRD_a represents the redundancy of the dataset after repair. It can be seen from the figure that the redundancy of the dataset becomes higher when the number of records is increasing. This trend indicates that the data quality is optimized significantly and the redundancy of the dataset reduces after repair audit.

5.3.2. Sparsity Analysis

Data sparsity describes the phenomenon of no enough useful data in dataset; it is a problem of estimating a sparse multidimensional vector. In our experiment, data sparsity is calculated by transformed matrix; we counted the number of elements 0 in the matrix to measure the dataset sparsity. Experiment results in sparsity trend as in Figure 5.

Figure 5

Sparsity PND trend.

In Figure 5, PND_b represents the sparsity of original data, PND_a3 is the sparsity of dataset repaired in the condition of 3 heterogeneous data sources, and PND_a5 has 5 heterogeneous data sources. Experiment results show that the data sparsity significantly decreased after repairing with the audit algorithm. On the other hand, the data sparsity gradually reduces when the heterogeneous data sources are increased.

5.3.3. Accuracy Analysis

Comparing data manually revised with those before and after algorithms audit, experiment results in accuracy trend as in Figure 6.

Figure 6

Accuracy PAD trend.

In Figure 6, PAD_b represents the accuracy of original data, PAD_a3 is the accuracy of dataset repaired in the condition of 3 heterogeneous data sources, and PAD_a5 has 5 heterogeneous data sources. Experiment results show that the data accuracy is significantly optimized after repairing by the audit algorithm. On the other hand, the data accuracy gradually increases when the heterogeneous data sources are increased.

6. Conclusions

How to optimize data quality and evaluate the effect is a challenging problem. In this paper, we proposed a multisource retrospective audit method for data quality optimization and evaluation. At first, in order to locate the original data source and match the data, we propose a heterogeneous data integration model based on retrospective audit. Secondly, a retrospective audit model and associative audit rules are proposed to improve the integrated data quality from multiple heterogeneous data sources. The heterogeneous data integration model based on retrospective audit is divided into original heterogeneous data, data structure, data processing, and data retrospective audit. At last, we define some assessment criteria such as redundancy, sparsity, and accuracy to evaluate the effect of the optimized data quality. Experimental results show that the model works well and the quality of the integrated data significantly improved. In future, our work can be extended in potential ways of data quality optimization and evaluation; further research needs to be done on the general framework for studying optimization methods for parameters in heterogeneous data integration model based on traceable audit.

Footnotes

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgment

This paper is partly supported by the National Science Foundation of China (Grant nos. 61472132, 61370226, 61472131, and 61300218).

References

Mahmoud

H. A.

Schema clustering and retrieval for multi-domain pay-as-you-go data integration systems [dissertation] 2010

Ontario, Canada

Waterloo University

Haas

L. M.

Miller

R. J.

Niswonger

H. M.

Tork

Schwarz

R. P. M.

Wimmers

E. L.

Transforming heterogeneous data with database middleware: beyond integration

IEEE Data Engineering Bulletin 1999 22 1 31 36

Wei

Haohong

Skit

Fully associative heterogeneous data integration system research and design

Microcomputer Information 2011 7 156 158

Tao

Xue

Peng

Ontology-based birth defects related to medical knowledge management systems

Application Research of Computers 2011 3 28 1007 1011

Long

X. Q.

Dai

M. H.

H. Q.

Implementaction of real-time data warehouse

Computer Systems & Applications 2010 19 178 183

Han

W. H.

Jia

Yang

S.-G.

TB level mass data real-time loading technology research and implementation

Computer Research and Development 2009 1 405 412

Huang

Wan

Model analysis of data integration of enterprises and E-eommerce based on ODS

International Federation for Information Processing 2008 254 275 282

Torlone

Two approaches to the integration of heterogeneous data warehouses

Distributed and Parallel Databases 2008 23 1 69 97

10.1007/s10619-007-7022-z

2-s2.0-38349114742

Feuerlicht

Pokorny

Richta

Integration of weakly heterogeneous semi-structured data

Information Systems Development 2009 10 69 79

10.

Pellegrino

D. A.

Jr.

Interactive visualization systems and data integration methods for supporting discovery in collections of scientific information scientific information [dissertation] 2011

Philadelphia, Pa, USA

Drexel University

11.

Bikakis

Tsinaraki

Gioldasis

Stavrakantonakis

Christodoulakis

The XML and semantic web worlds: technologies, interoperability and integration: a survey of the state of the art

Semantic Hyper/Multimedia Adaptation 2013 418

Berlin, Germany

Springer

319 360 Studies in Computational Intelligence

10.1007/978-3-642-28977-4_12

12.

Lin

Chen

X.-W.

Heterogeneous data integration by tree-augmented naïve Bayes for protein-protein interactions prediction

Proteomics 2013 13 2 261 268

10.1002/pmic.201200326

2-s2.0-84872678121

13.

Jing

H. C.

Zhang

Meng

Yang

Research of data tree model in coal mine heterogeneous database integration

Applied Mechanics and Materials 2013 263–266 312 315

10.4028/www.scientific.net/AMM.263-266.312

14.

Iskar

Zeller

Zhao

X. M.

van Noort

Bork

Drug discovery in the age of systems biology: the rise of computational approaches for data integration

Current Opinion in Biotechnology 2012 23 4 609 616

10.1016/j.copbio.2011.11.010

2-s2.0-84864801155

15.

Bizer

Boncz

Brodie

M. L.

Erling

The meaningful use of big data: four perspectives—four challenges

ACM SIGMOD Record 2011 40 4 56 60

10.1145/2094114.2094129

2-s2.0-84857148573

16.

Freitas

Curry

Oliveira

J. G.

O'Riain

Querying heterogeneous datasets on the linked data web: challenges, approaches, and trends

IEEE Internet Computing 2012 16 1 24 33

10.1109/MIC.2011.141

2-s2.0-84855690015