Abstract
This study identifies essential components in the data collection process for public health information systems based on appraisal and synthesis of the reported factors affecting this process in the literature. Extant process assessment instruments and studies of public health data collection from electronic databases and the relevant institutional websites were reviewed and analyzed following a five-stage framework. Four dimensions covering 12 factors and 149 indicators were identified. The first dimension, data collection management, includes data collection system and quality assurance. The second dimension, data collector, is described by staffing pattern, skill or competence, communication and attitude toward data collection. The third, information system, is assessed by function and technology support, integration of different data collection systems, and device. The fourth dimension, data collection environment, comprises training, leadership, and funding. With empirical testing and contextual analysis, these essential components can be further used to develop a framework for measuring the quality of the data collection process for public health information systems.
Keywords
Introduction
Public health information systems (PHIS), the government-recognized population-based data repositories, are essential for public health management and improvement. 1 PHIS provide nations with health-related data mainly required for monitoring, prevention, and control of diseases and other adverse health conditions. 2 Data in PHIS must be of sufficient quality to meet public health needs and worthy of data users’ trust.3–7 The World Health Organization (WHO) has introduced a generic Data Quality Assessment Framework (DQAF) developed by the International Monetary Fund. 4 The WHO has reinforced that data quality assessment should not only describe the quality status of data but also enable identification of the causes of data quality problems.4,8 The process of data collection is an essential element of data quality. It includes the generation, assembly, description, and maintenance of data, all of which should be of high quality.9,10 While data quality problems originating from the process of data collection have been frequently found, research into this topic area is yet to further develop.7,11
To date, the assessment of the quality of the data collection process in PHIS has not been well considered nor routinely conducted. 9 The quality improvement effort has been focused on the assessment of the quality of data which have already been captured and stored.5,9,12 Data quality assessment is mainly focused on the identification and evaluation of the attributes of data quality, including accuracy, completeness, and timeliness of data.9,13,14 The reason for the lack of attention to the quality of the data collection process could be an insufficient clarification of the essential components for data collection. For example, our review of PHIS data quality assessment showed that only 2 (5%) of 39 studies specified an explicit definition of the quality of data collection. 9 A variety of quality criteria for data collection were introduced such as data accuracy, data integrity, minimum response burden for data provider practices, and the relevance, simplicity, and layout of the data collection tools.5,7,15,16
Data collection is a systematic data gathering process, 15 which includes a set of interrelated or interacting activities contributing to the process of transforming inputs into outputs. 17 Organizational, technical, and behavioral factors can affect the performance of the data collection process for the PHIS.7,11,13,18–20 They may “take the form of defects in organizational procedures, faulty logic, and reasoning, or human error that result in compromised performance.” 21 In total, 16 of the 39 implementation studies reviewed in our earlier study, instead of taking a comprehensive picture of the entire process, were centered around some data collection procedures, such as data recording and storage, and on quality control mechanisms. 9 The unsystematic knowledge about the key factors influencing the quality of the data collection process has impaired the effectiveness and efficiency of data-driven monitoring and performance evaluation mechanisms for public health programs.7,11,21,22
An interesting question is: what are the exact components to ensure the quality of the data collection process? Researchers have conducted some exploration in this area. At the organizational level, structure, resources, procedures, support services, and culture in an organization can all influence the process quality of data collection.13,18–20 However, the operational definition or measurement for these factors has yet to be reached. At the technical level, the design of electronic data collection forms and integration of different information systems are the important mechanisms. But technology advancement alone cannot always lead to high-quality data.7,11,13,21 At the individual behavioral level, a data collector’s motivation and competence to perform a task, though often scrutinized from the lens of data users, have not been clarified in the context of the data collection process. 13 The lack of a comprehensive understanding of the contribution of these factors leads to challenges in assessing the quality of that process.
Such challenges hence pose the fundamental research question of this study which aims to identify the essential components of quality in the data collection process for PHIS.
Methods
Literature search
We searched peer-reviewed full-text English literature in medical and informatics electronic databases, including Scopus, CINAHL, Medline, PubMed, and Web of Science. The publication dates were from 1 January 2001 (since the principles and practices of PHIS were defined in the discipline of public health informatics) 23 to 31 December 2016. Search terms included words or phrases relating to data collection, process assessment, measurement, data quality, public health, and health information systems. The symbol “*” was used to include the variations of a word. A total of 172 publications were retrieved.
To improve the literature coverage, we further conducted a manual search of the literature and identified papers referenced by the selected publications. Prominent public health institutional websites were also searched, such as those of the WHO and the Australian Institute of Health and Welfare (AIHW). The authors’ research databases on data quality assessment were also searched. Another 52 publications from these sources were included. Two of the authors independently assessed the study quality. Any discrepancies were resolved through discussion and an informal consensus process. Further details are shown in Supplemental Appendix Table 1.
Selection of publications
Articles were assessed based on inclusion and exclusion criteria. The inclusion criteria were articles contributing significantly in the domain of quality of the data collection process in the PHIS and research topics related to the quality of the data collection process. The article types included empirical studies, reviews, guidelines, and work reports.
The exclusion criteria were publications that did not mention factors or components of the quality of the data collection process; those focusing on data use or only measuring the data stored in the PHIS; and those lacking clear definition or without evidence-based information. Editorials, notes, and letters were also excluded.
The above screening activities led to a selection of 107 articles eligible for further study quality evaluation.
We used the Critical Appraisal Skills Programme (CASP) tools to assess the reliability and validity of each selected study.24,25 The CASP tools provide a set of checklists for evaluation of study quality including the context, subjects, study design, research methods, data collection, data analysis, and conclusions.
We also considered (1) whether the concepts of data quality or the quality of the data collection process matched our understanding of these; (2) whether the cause of poor data quality arising from the data collection process was analyzed; and (3) how the factors contributing to the quality of data collection or data quality were measured. The studies that did not provide adequate information and rigorous research methods were excluded.
Eventually, 45 publications were selected for review. The publication selection and evaluation process is illustrated in Figure 1.

Publication screening process.
Data extraction and analysis
A five-stage framework was followed for qualitative data extraction, processing, and analysis: 26
Stage 1—familiarization with data. Each article was thoroughly read to identify quality issues, concepts, and themes related to the data collection process. Relevant data from the selected studies were extracted and entered into an Excel spreadsheet to facilitate critical evaluation of the results. A total of 453 pieces of relevant text were pre-selected and recorded.
Stage 2—identification of a thematic framework. A process of shortening the extracted text while still preserving the core was conducted for condensing the relevant data. A constant comparison and aggregation process led to the abstraction of 149 first-level codes as indicators relating to the quality of the data collection process. Further comparison, aggregation, abstraction, and classification of the indicators generated 16 factors that were related to the quality of the data collection process. These factors were further abstracted using an approach based on general systems theory 27 and advice from public health experts. A four-dimensional (4D) thematic framework was developed, including data collection management, data collector, information system, and the data collection environment.
Stage 3—indexing and validation of the thematic framework. The process of constant comparison, aggregation, and classification was iterated repetitively. Data were rearranged per the appropriate dimension of the thematic framework to which they were likely to belong. Attempts were made to avoid duplication and overlap in semantics and refinement of paraphrasing within the framework. This process led to the reduction of factors from 16 to 12.
Stage 4—charting. The 12 factors were arranged into the appropriate dimension of the thematic framework to which they related. A chart was prepared.
Stage 5—mapping and interpretation. This stage is a process to map the nature and range of the concepts and factors. The associations between the factors were identified to create the typology of the framework. Each indicator was interpreted as either a facilitator or a barrier according to its direction of influence, positive or negative, on the quality of the data collection process. Eventually, the theoretical saturation was reached, and all extracted data were placed into the categories already created (see Table 1).
Data for the 45 included studies.
WHO: World Health Organization; USCDC: US Centers for Disease Control; EMR: electronic medical record.
Results
The results from the qualitative data processing and analysis provided material for the identification of the essential components of quality in the data collection process for PHIS, including four dimensions that covered 12 factors (see Figure 2). The first dimension, data collection management, includes data collection system and quality assurance. The second dimension, data collector, is described by staffing pattern, skill/competence, communication, and attitude toward data collection. The third, information system, is assessed by function and technology support, integration of different data collection systems, and devices. The fourth, data collection environment, comprises training, leadership, and funding. The 12 factors are characterized by 149 indicators with either positive or negative impacts on the quality of the data collection process (see Supplemental Appendix Table 2).

The components of quality in the data collection process for public health information systems.
Accuracy, completeness, and timeliness are the most frequently mentioned parameters of quality for evaluating the performance of the data collection process. These three parameters appeared in 24, 16, and 7 studies, respectively. And 14 studies did not define data quality specifically. Reliability, data use, quality of service, and system quality were also addressed.
Data collection management
From an organizational perspective, data collection management is an administrative process by which data are acquired, validated, stored, protected, and processed.7,11 Effective management requires the application of knowledge, skills, tools, and techniques to data collection activities to meet data quality requirements. The ultimate goal of data collection management is to fulfill every requirement from data users.11,16 That is the provision of sufficient supervision to personal and systematic process audits to ensure data quality. Of the 45 included articles, 32 assessed the quality of data collection management9,14,28,29,31–38,40,42,44,47,48,51,52,54–58,60–64,66,69,70 (see Table 1). Details of facilitators and barriers for data collection management are shown in Supplemental Appendix Table 2.
For the preservation of data integrity, data collection management needs to detect errors that have occurred in the data collection process. 16 Errors may be produced intentionally (i.e. deliberate falsification) or unintentionally (i.e. systematic or random errors). 16 In public health, data collection management primarily focuses on the procedures of data collection, storage, quality control, and data presentation for users.5,6 They are often presented in a format of guidelines or a set of policies to direct the execution of programs and guide the practice of parties involved in the process.5,56 We identified two major factors for data collection management in the context of the PHIS. They are data collection system and quality assurance.
A data collection system primarily comprises two sub-components: data collection form and data collection practice. Data collection form is the core component of data collection instruments. A poorly designed data collection form may impair data accuracy; 69 therefore, data collection form needs to be standardized, well defined, and structured. As one of the major concerns, standardization in data collection form can be facilitated by a series of tactics. 59 The format of the data collection form needs to be simple, standardized, and complete.28,32,55 The layout and order of data items of a form need to accord with the workflow of data capture or reporting for easy data entry and retrieval.33,34
Data collection practice should be well guided, conducted, and documented. They include guideline development, documentation, data backup and security, selection of data collection methods, and a trial of a new process before implementation.38,41,44,54 A complete record of the data collection process in line with the workflow of data collection is recommended. 47
Quality assurance for data collection should be in place before the collection begins and it should be focused on quality control.16,58 The function of quality assurance is to ensure that each process of data collection is traceable, accurate, and timely, and has integrity. Four factors could be utilized to assess the adequacy of a system for quality assurance. These are quality audit, fundamental responsibility, mechanisms for addressing data quality challenges, and a feedback loop.14,35,37,47 Designated unit or individuals to monitor data quality and prevent data collection mismanagement are recommended.44,47,49,53,55 A veteran health worker register could remind data collectors to correct inaccurate data items,44,62 and provide additional training, supervision, and incentives. 31 Holding regular meetings with medical or clinical staff and a data registrar is useful to address missing or inconsistent data.32,57
Data collector
A data collector collects or supplies data for the PHIS. Of the 45 articles, 23 assessed the performance of data collectors14,29–31,33,35,38–40,44,46,47,49,51,55,57,59,60,62,66,68–70 (see Table 1). Details of facilitators and barriers for data collector are shown in Supplemental Appendix Table 2.
A data collector is a stakeholder with whom the data user should build up and nurture a relationship. Data collectors’ performance was mainly related to data accuracy.44,47,55,57,66,69,70 The association between data quality and the data collector’s certain characteristics such as level of responsibility, level of work engagement, and sector of employment was statistically quantified. 29 Four types of factors including the staffing pattern, their data collection skills or competence, communication with clients, and their attitude toward data collection could, directly and indirectly, influence data collectors’ performance. For example, data collector shortage and high turnover could impair data quality. 66 A data collector needs to have sufficient capability to conduct data collection activities, for example, understanding contextual information and having basic knowledge of the data elements to be collected.55,57 Proficient data collection skills and good communication with clients are ideal for a data collector.14,44 Mistakes often originated from data collectors’ attempts to simplify data collection tasks.11,30,39,66 Provision of training is regarded by higher authorities and upper management as a useful approach to improving data collectors’ capabilities including fundamental medical knowledge and routine data management skills.35,57,70,71
Information system
An information system is a combination of hardware, software, infrastructure, and trained personnel. 5 Of the 45 articles, 30 assessed the quality of information systems14,28,33–36,38,40–45,47,48,51–53,55,57,58,60–62,64–67,69,70 (see Table 1). A regular system of data quality checks may be more cost effective and reliable to ensure data quality. 51
Characteristics of information systems in PHIS are demonstrated by automatic functions and technology support provided to the users of the system and the integration of different data collection systems and devices. The facilitators and barriers for information systems are described in Supplemental Appendix Table 2.
The functions of information systems in PHIS are automatic data processing, usually via an electronic interface of data collection forms and prompts for data collectors about data collection activities. The systems may automatically check the logic of data, assess the comprehensiveness of required data items, and issue alerts for errors made during data entry.35,55 These functions serve as an online task reminder to help with task completion and prevent slippage. 14 Use of the “smart chart” technology can prevent a data collector from submitting a record with missing fields. In this manner, the function of an automatic logic check and smart selection of data are integrated into the mandatory fields. It is found that data errors are rare since the introduction of “smart charts.” 52 If an automatic workflow chart is available in the system, it could guide and standardize the data collection and reporting process. However, changes in the project procedures and system configuration over time may lead to a decline in data quality if deployed against established guidelines or specifications on data collection for PHIS.11,53
Integration of different systems is important in the PHIS. Multiple systems and files may impair the quality of the data collection process if data are from the various sources. 55 Therefore, centralizing data in one unique source and use of linked data systems is preferred.34,48,51,58,64 For example, the use of external data linkage and collaboration with other jurisdictions can facilitate the generation of a higher level data repository or data sharing platform.14,38
Devices are the hardware used to store or transmit data such as computers, printers, and other electronic equipment. These devices need to be adapted to the operational system, suitable for use in data collection, and free from computer crashes, viruses, and insecure methods for data backup and storage.40,51,58,65,67
Data collection environment
Data collection environment refers to the context for data collection. In a government context, the PHIS is directly responsible to legislative, regulatory, and policy directives. 1 Of the 45 articles, 30 assessed the quality of the data collection environment14,29,31–33,35–40,42,44,46,47,50–52,55–58,60–64,68–70 (see Table 1). Training, leadership, and funding support are the three main factors. Details of facilitators and barriers for data collection environment are shown in Supplemental Appendix Table 2.
Training is imparting information and providing instructions to help trainees attain a required level of knowledge and skill or improve their performance. Such training should be mandated by higher authorities and upper management instead of on a voluntary basis.57,71 Continuing education and training opportunities should be provided to all data collectors, including frontline health professionals, managers, and specialists. The training should be individualized, measurable, and may focus on communication skills for data collection, and criteria and procedure of health service provision.47,56,57,62,63
Attributes of good leadership include (1) strengthened coordination, cooperation, and communication among government agencies and between healthcare facilities and health professionals; (2) recognition of the importance of data to be collected; (3) provision of sufficient funding; and (4) allocation of full-time staff or specific staff to data collection.35,44,51–53,69 Examples of good leadership include the development of a less resource-intensive approach using strategies of decentralization to empower the management team in the field and establishing a multi-level supervision network that includes health departments and healthcare facilities.51,52 Supervisors can perform real-time field quality assurance and control activities.
Implementation of electronic systems, installation of local system infrastructure, and maintenance of a network across data collection facilities are sometimes costly; therefore, funding is critical for data collection in resource-constrained settings.38,51 Availability of funding can improve data quality.
Discussion
The quality of the data collection process is a key component of overall data quality in the PHIS.9,10 Conceptualization of the quality of the data collection process for PHIS is also requisite for reaching public health high data quality goals. A recent evaluation of data quality in the country health information systems by WHO in a global context has found that data management was the weakest component of the system performance. 7 A lack of knowledge about the key factors influencing the quality of the data collection process for PHIS has hindered data quality improvement and thus has impeded the effectiveness of data-driven monitoring and performance evaluation for public health programs. Effective process assessment of data collection that focuses on how data are collected will help standardize the performance of public health programs by comparing “the specific actions taken, events occurring, and human interactions with accepted standards.” 72 Prior studies have explored some factors that may affect the quality of the data collection process. But consensus on a comprehensive and systematic assessment of the process has not been reached. Identification of the essential components of the quality of data collection is needed to guide efforts in the development of a quality framework for PHIS.
The most commonly reported quality dimension for the data collection process is data collection management; half of the identified facilitators and barriers belong to this dimension. Key areas demanding an effort for improvement include the design of the data collection form, data collection practice, and data collection quality assurance. Standardization of public health data collection practice is a long-standing issue, together with the integration of different data sources and data collection systems in public health. These key study findings reflected a primary concern with the definitions and characteristics of data collection. A variety of definitions and different quality criteria of the data collection process may contribute to a wide range of factors that affect the quality of the process.5–7,15
The data collection process is recognized as a systematic process that consisted of the interrelated and interdependent parts. However, the different parts of the process and the interaction between those parts for PHIS have not been well articulated. For example, the quality of the data collection form can contribute to data completeness and standardization and thus the use and validity of data. But the association between the quality of data collection form and data quality has not been quantified. Over-emphasis on the procedures, methods, and quality control parts of data collection and simply automating data collection systems cannot solve all data quality problems. 21
Data collectors play an important role in the quality of the data collection process. Extant data quality assessment instruments have not paid sufficient attention to data collectors except for their training experience. 12 Gaps existed between actual and recommended practice even though guidelines were available to data collectors. 11 These gaps may arise from inefficient communication between data users and data collectors. Seamless translation of data users’ requirements for data quality into the quality of the data collection process is an effective strategy for collecting high-quality data. We suggest more contextual analysis with an emphasis on data quality criteria to meet data users’ needs.
Our study identified the data collection environment as one of the four essential components of the quality of the data collection process. Training, leadership, and funding are the building blocks of a friendly and supportive data collection environment, in addition to the other factors. These factors include whether the relationship with the data collectors “is of the utmost importance” in a data collection setting 5 and if the data collection process fits with an organization’s service model. 11 Barriers to health clients’ participation in health services such as poor communication, cultural safety, and a lack of transport to health facilities could also affect the volume of data available for collection. Adding the data collection environment to the essential components would better inform data quality assessment in troubleshooting the factors that affect the quality of the data collection process.
Limitations of existing measurement instruments and studies were also found. Information about data quality was not provided in a third of the studies (14 articles).9,33,38,40,46,48,51,54,58–60,64,67,68 The majority of studies used simple descriptive or qualitative data to analyze the relationship between the factors affecting the data collection process and data quality. As the identified components were distilled from qualitative analysis of the published literature, future empirical testing and practical implementation are needed.
Conclusion
Acceptable data quality in the PHIS cannot be achieved without a high-quality data collection process. The identification of the essential components that contribute to the quality of the data collection process is thus vital to ensure that data collection leads to high-quality data. After an extended literature review, this study has identified 4D components of the quality of the data collection process for PHIS. They are data collection management, data collector, information systems, and data collection environment. With empirical testing and contextual analysis, the above identified essential components can be used in future research and practice to develop a quality framework for measuring and improving the quality of the data collection process.
Supplemental Material
Health_informatics_journal-Appendix_Table_2_framework – Supplemental material for Identification of the essential components of quality in the data collection process for public health information systems
Supplemental material, Health_informatics_journal-Appendix_Table_2_framework for Identification of the essential components of quality in the data collection process for public health information systems by Hong Chen, Ping Yu, David Hailey and Tingru Cui in Health Informatics Journal
Footnotes
Author contributions
H.C. searched the literature, performed data extraction, conducted data analysis, and wrote the manuscript. Being a principal investigator, P.Y. led the study, conceptualized, and wrote the manuscript. D.H. advised the conceptualization and revised the manuscript. T.C. reviewed the extracted data and revised the manuscript. All authors read, reviewed, and approved the final manuscript.
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship and/or publication of this article: This research has been conducted with the support of an Australian Government Research Training Program Scholarship.
Supplemental material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
