Abstract
Master data has been revealed as one of the most potent instruments to guarantee adequate levels of data quality. The main contribution of this paper is a data quality model to guide repeatable and homogeneous evaluations of the level of data quality of master data repositories. This data quality model follows several international open standards: ISO/IEC 25012, ISO/IEC 25024, and ISO 8000-1000, enabling compliance certification. A case study of applying the data quality model to an organizational master data repository has been carried out to demonstrate the applicability of the data quality model.
Keywords
Introduction
With the growing need to share data to empower companies, organizations, and societies to be more competitive and sustainable (i.e. European Data Strategy1), data must be propagated within and outside the organizations. To avoid the possible propagation of errors and to assure the achievement of the expected benefits, it is more necessary than ever to monitor the levels of quality of the data repositories by strategically enabling data quality evaluation cycles regularly; and, if feasible, granting data quality certifications to maximize the trustability and usability of organizational data (Gualo
Implementation of master data initiatives creates a unified and internally and externally shareable vision of the critical business data elements and associated standard business rules (Fan
However, dealing with poor data quality in the master data repositories is still one of the biggest challenges in master data management (Benkherourou and Bourouis, 2022; Ibrahim
To systematically and rigorously conduct the corresponding data quality evaluation and certification projects, it is possible to use the international standards ISO/IEC 25012 (ISO/IEC, 2008), ISO/IEC 25024 (ISO/IEC, 2015), and ISO/IEC 25040 (ISO/IEC, 2011). In this sense, we pose that organizations can be benefited from using the framework we developed based on these international standards tailoring a data quality model for the master data repositories conveniently. This repository type has specific features that differentiate them from transactional data repositories making it unique and justifying a particular investigation on master data quality (see Section 2.1). In this paper, considering these differences and based on learned lessons in our previous experience, we propose a
The remainder of the paper is structured as follows: Section 2 briefly reviews concepts about master data management, data quality evaluation, and ISO 8000-100 series. Section 3 describes the data quality model for master data quality evaluation and certification. Section 4 introduces a case study, also showing some learned lessons. Section 5 identifies some threats to the validity of our proposal, and finally, Section 6 presents some conclusions.
Literature Review
Master Data and Master Data Management Foundations
According to ISO/IEC 25024 (ISO/IEC, 2015), master data is “
Master data differs from any other types of data in these four aspects (Hüner
Transactional data typically use master data as a reference. The implementation of master data management systems typically includes four main components (Haneem
Data dictionaries contain the semantic definition of terms for master data that would provide the business and technical description of the master data through metadata and valid values (reference data) for master data attributes. Reference data can be defined as a list of acceptable values (i.e. a list of currency or nation codes as those gathered in ISO 4217 or ISO 3166) or using procedures to build these good values (i.e. regular expressions to validate an email address). To ensure that master data attributes (especially identifiers) take the correct values from the set of valid values, syntactic and semantic codification should be enabled, as explained in ISO 800-110 (ISO, 2009). This codification is required to identify a specific master entity within a dataset unambiguously (Berson and Dubov, 2011), when it is necessary to anonymize the information for personal protection purposes, or even when it is required to cipher information for security purposes (Piedrabuena
However, master data is much more than technology (Haneem
Management of Quality of Master Data Repositories
Organizations must face many data quality concerns, like inconsistencies in data definitions, data formats, and values or lack of understanding of the semantics of the data definitions (Cleven and Wortmann, 2010). Master data management is instrumental in addressing these data quality problems and concerns (Dahlberg
As master data are still data (Allen and Cervo, 2015), it could be said that the body of knowledge on data quality management for regular (i.e. transactional) data is a good starting point for studying the quality of master data, but taking into consideration the unique features of master data shown in Section 2.1. However, even when several works have dealt explicitly with how master data management has helped organizations to solve data quality issues (Silvola
Master Data quality evaluation is also closely associated with data quality dimension or characteristics (Gualo
Data quality dimensions/characteristics for master data in the literature.
Data quality dimensions/characteristics for master data in the literature.
In addition to these works, given its benefits, we pose that the most convenient way to deal with the selection of the data quality characteristics is using ISO/IEC 25012 (ISO/IEC, 2008) (see Table 2 – Type “I” stands for “
Based on these standards, we developed a complete framework – initially proposed by Merino (2017) and later revised and improved by Gualo
Data Quality characteristics in ISO/IEC 25012.
Part of the definition of data quality measurements involves the identification of some business rules describing the fitness for using a data record (that is, defining a list of valid values for attributes). Previously in Section 2.1, the importance of data dictionaries as typical containers of the set of valid values for the golden records in the master data repository was raised. For instance, data dictionaries can help the organization in the banking or insurance domain to adopt the Global Legal Entity for Identifiers (GLEI2) to build and use specific identifiers for their partners, or ISO 21586 for their related products, or when a manufacturing company has decided to adopt Global Trade Item Number (GTIN) to codify the identifiers for its products (Hüner
Evaluating data quality properties requires identifying, validating, and grouping the business rules (Caballero
Overviewof the ISO 8000-100 series.
Overviewof the ISO 8000-100 series.
This investigation is motivated by some organizations’ need to evaluate the quality of the master data repositories. Based on our experience in previous data quality evaluations and certification projects (Gualo
This revision generates a master data quality model agnostic to (1) the domain of application of the master data (i.e. Manufacturing, Healthcare) (Allen and Cervo, 2015); (2) the technology used to implement the master data repository, although typically, relational technology is preferred (Berson and Dubov, 2011; Otto, 2015), and to the technology used to build and deploy services to access to the master data repository (Dreibelbis
Revisiting Data Quality Characteristics for Master Data
Since master data repositories are a particular type of data repository with specific business rules and a unique environment, we grounded our investigation on the works on the quality of regular data based on ISO/IEC 25012 and ISO/IEC 25024. In our investigation, we comprehensively analyse master data quality and consider their specific unique features that differentiate master data from other types of data (see Section 2.2) to provide interpretations of the effect of master data quality for business and IT. In the following paragraphs, we introduce some results of this analysis. The analysis is made by first recalling the definition of the data quality characteristics according to ISO/IEC 25012 presented in Table 2.
Business Rules for Master Data Inferred from ISO 8000-100 Series
This section presents the minimum set of business rules inferred from the ISO 8000-100 series study to be used as a reference for master data quality evaluation and certification. To our knowledge, no other open standard (i.e. developed by ISO or IEEE) explicitly addresses the business rules to be met by master data. We found only a family of standards containing requirements for the exchange of characteristic data in master data (see Section 2.3) that can be considered as the source of mandatory business rules that master data repositories must comply with. Consequently, we consider only the normative clauses of the ISO 8000-100 (mainly ISO 8000-110) series to identify the mandatory business rules. For instance, Table 4. gathers one of the clauses extracted from ISO 8000-110, which we inspected to infer and state from sub-clauses a) and b) the following business rule: “
Example of clause extracted from ISO 8000-110.
Example of clause extracted from ISO 8000-110.
Business rules inferred from ISO 8000-110.
It is essential to state that in data quality evaluation projects, these business rules are typically outside the development scope. So, as part of the evaluation, the discovered business rules must be mapped against the ones presented here by comparing the corresponding statements.
Business rules inferred from ISO 8000-110 grouped for data quality characteristics.
The customization of the measurement of the properties of the master data quality model involves the development of the following components:
Define the
Obtaining the quality level of a DQ Characteristic from DQ properties measurements.
Measurement of ‘Master data understandability’ property adapted from ISO/IEC 25024.
Define the
Define the

Profile functions proposed to evaluate data quality properties characteristics.
Granting a certification is a form of precise pointing by independent third parties that a set of characteristics, merits, or conditions of a fact or good are acceptably adequate. Certification requires an evaluation with results that are necessarily objective, absolute, repeatable, unambiguous, unbiased, and comparable without the need to explain the specific context in which it has been measured. The data quality evaluation framework can also be used to certify the level of quality of the master data repository. Following AENOR’s guidelines (as stated in Gualo
The certification process typically involves costs that include the evaluation and the management of the granting of the certification. Not all the data quality characteristics and properties introduced in Section 3.1 are susceptible to be certifiable (i.e. they do not provide relevant information) or worthy (i.e. they change very quickly over time or because the cost/benefit analysis reveals that they are not good benefits are obtained). To optimize the costs, we decided to establish a set of selection criteria to discriminate those being susceptible or worthy to be certified. The criteria are:
C1. Eligible master data quality characteristics should provide relevant information on the data quality status of the master data repository so that they show
C2. Selected master data quality characteristics have the property(ies) that must rely on well-proven measurement methods to produce results
C3. Measurement methods for the chosen properties should rely on formal audit techniques to systematically limit or eliminate
C4. Since the intention is to develop a model that certifies the quality of master data concerning the requirements introduced in the ISO 8000-100 series of standards, following the philosophy introduced in this work, the established criterion is that
C5. We intend to develop a consistent, simple model with a minimum but enough number of data quality characteristics and properties. During the selection process, some eligible characteristics may be prioritized to the detriment of others. This prioritization is based on the
Table 8 shows a complete overview of the proposed certifiable subset of data quality characteristics for the evaluation of master data repositories once the criteria are applied.
Proposed characteristics for master data quality certification.
Once the certifiable data quality characteristics were selected, the next step was to identify the underlying data quality properties that can contribute notably to evaluating the data quality characteristics. For this second level of selection, we listed three criteria: (1) the measurement of the property must be done in an objective and repeatable way; (2) there must be specific business rules inferred from ISO 8000-100 that can be used as the basis for the measurement; and (3) the measurement of the property must be relevant and consistent to measure the data quality characteristics. Table 9 shows the data quality model with the list of certifiable data quality characteristics and the selected specific data quality properties.
Proposed data quality characteristics and properties for master data certification.
The primary purpose of the case study is to validate that the model is complete, minimal, and contains all the data quality characteristics relevant to master data. It is entirely usable in the real world. Thus, this section describes the application of the framework for master data evaluation to a real case study following the methodology based on ISO/IEC 25040, which is further described in Gualo
Description of the Data from the Evaluated Master Data Repository
The evaluated master data repository integrates several data sources of a software company (called the ‘Client’) to provide a 360⁰ view of employees. In this sense, the master data repository is part of the organizational data lake, and the whole organization uses it through different departments with different goals. Unfortunately, we were not allowed to provide further information about the organization that owns the master data repository.
This master data repository and data dictionary was implemented in a relational database management system (MySQL 5.7.6) to enable the required SQL queries necessary to compute the data quality measures. The master data repository comprises 48 master data attributes with 59,387 records. Several triggers were implemented in the master data repository to track the changes in the master data records values. Table 10 shows an excerpt of the most representative attributes, their description, and an example of the valid value they can take. An example of metadata for the attribute “
List attributes, descriptions, and example values from the master data repository.
List attributes, descriptions, and example values from the master data repository.
Example of metadata in data specification.
Excerpt of Data dictionary for the case study.
As part of
As the master data quality granted for a data quality certification, the business rules required for evaluating the selected data quality characteristics must be assimilated to the reference set of business rules identified in Table 6.
Matching between sets of mandatory and obtained business rules for the master data repository.
Matching between sets of mandatory and obtained business rules for the master data repository.
This action was done using the metadata and the provided study of the data model. For example, through the example metadata in Table 11, some business rules were identified: “
SQL script example to evaluate
In
During the execution of
In the execution of
SQL script example to evaluate
Results of the application of the measurement methods for the DQ properties.
These measurement values should be then analysed and discussed to collate the weaknesses and strengths previously identified for each data quality property to ratify and identify critical aspects that can be improved. It is important to note that although the definition of the measurement method follows ISO/IEC 25024, its application requires customization in terms of the specific semantics of the master data entities under evaluation and the technological aspects of the system of which the evaluated data repository is part.

The value obtained for data quality properties during the evaluation of the master data repository.

Data quality level results for each data quality characteristic in the master data repository evaluation.
Finally,
In addition to the evaluation report, a comprehensive improvement report can be prepared and provided to the organization, detailing the weaknesses and strengths of each measured data quality property. This improvement report focuses on the properties that did not achieve an adequate level of quality. It details the causes of those low levels so that the organization can take steps to improve them. In addition, the organization provides a set of scripts to identify specific records that require improvement actions. Finally, to complete the evaluation, the access permissions to the master data repository and other assets needed for data quality evaluation are removed or revoked.
Explanations can be split into two parts to demonstrate the validity of the proposed work. On the one hand, the data quality evaluation framework has already been tested and validated (Gualo
In addition, after analysing the process of conducting the case study, we have reached a series of findings:
In the following subsections, the threats to the validity of this initial state of the framework are analysed based on the aspects identified (Runeson
Conclusions
Master data is one of the unique types of data in organizations. It holds the knowledge that the organization needs for their day-by-day operations, i.e. they are the reference for the transactional data. This capability of being referenced means that the master data values are propagated through the organizations as master data is used for other purposes, i.e. transactional purposes. Consequently, if the values stored in the master data repository do not have adequate levels of quality, then low-quality data are propagated through the entire organization and maybe externally.
In this paper, we presented a data quality model for evaluating the quality of master data repositories. This master data quality model can be used with the evaluation framework we developed. This evaluation framework has been successfully used to evaluate the quality of repositories of regular data repositories. Master data has specific features beyond regular data (i.e. transactional relational data) that make necessary the investigation of the particular concerns of master data quality. The tailoring of the data quality model involves the specific definition of the quality value range matrix and the corresponding profiling functions. As these elements are under the Intellectual Property protection of the owners of the data quality evaluation frameworks (they are exploiting the framework commercially), we have been allowed only to show examples of these elements.
Sometimes, granting a certification of the level of quality can improve the trustworthiness of a master data repository. Only some of the 15 data quality characteristics included in the data quality model are susceptible or worthy of certification. To limit the scope of certification and improve the usability of the results, we defined five criteria used to select data quality characteristics relevant to the business or the information systems if certified.
The results obtained either from evaluation or the certification process can be used to identify the most critical points or risks for a master data repository, enabling the guiding and optimization of the efforts in fixing errors and, consequently, making the master data an essential asset of the organization.
The most important conclusions we can raise are that the presented version of the data quality model is sufficiently complete and comprehensive to be used in master data quality evaluation and certification projects, easily understandable, and combined with the developed framework, it is applicable in the real world.
Footnotes
European data strategy:
Global Legal Entity for Identifiers website:
eOTD | ECCMA OPEN TECHNICAL DICTIONARY website:
Acknowledgements
We would like Natalia Sánchez for her impressive support in formatting the manuscript.
