Abstract

‘Data cleaning’ and ‘central monitoring’ have become intertwined to the potential detriment of trial conduct. They are practically and conceptually different. What is data cleaning, what is central monitoring and why does the difference matter?
Early clinical trials collected data on punch cards and then on paper. As computers became accessible, trialists began to enter data into a database towards the end of a trial and cleaned it before the analysis. As data started being entered centrally into computer databases on receipt of forms, trialists recognised that it was better to clean the data in real time. Many considered double data entry to reduce the amount of data cleaning. 1 Now, with increasing use of electronic data capture to replace paper forms, staff at trial sites are entering data directly into databases and are prompted in real time with automated data checks. Further data cleaning is led centrally, often by trial managers and statisticians, and is achieved through checking against prescriptive or plausible ranges, by checking for logical sequences of events, and by checking that critical data (‘key variables’) are not missing. Van den Broeck and colleagues offer some advice on best practice for data cleaning. 2
Monitoring of trials began with 100% source data verification – double-checking that the data on case report forms matched the patient’s hospital notes – and process checking at on-site monitoring visits. This required many dedicated monitors combing through hospital notes. Trials with more modest budgets conducted source data verification on only a sample of participants or a subset of datapoints (critical data). Trialists began to conduct central reviews of the database and to contact sites or make an on-site monitoring visit if the central review showed an apparent need. Risk-based monitoring was enshrined in International Council for Harmonisation of Technical Requirements for Pharmaceuticals for Human Use (ICH) GCP E6(R2) in 2016 and amended in 2018, 3 with all trials encouraged into this monitoring strategy.4–6 In risk-based monitoring, the monitoring activities are focussed on preventing or mitigating risks to data quality that are both important and likely. These must be risks to processes critical to human participant protection (rights, safety and wellbeing) or to trial integrity. Rather than monitoring broadly all aspects of the trial, monitoring is directed at these pre-defined risks to the trial and also to risks which become apparent during the trial. Risk-based monitoring often starts with central monitoring which is monitoring performed in a location away from the investigator site and often at clinical trial unit/sponsor offices. It involves an evaluation of accumulating data (or lack thereof), performed in a timely manner and supported by appropriately qualified and trained persons. 7 This central monitoring is followed by escalation to an on-site monitoring visit, if concerns about a site warrant it. Some element of source data verification may be mandated, but often only for a small selection of data or participants. Monitoring is applicable to all trials, with clinical trials of investigational medicinal products tending to have a higher risk and therefore requiring more extensive monitoring.
It is particularly the terms and processes of central monitoring and data cleaning that are confused. Table 1 defines data cleaning and central monitoring. As an example, a data cleaning activity might be sending out a list of queries for site teams to resolve, whereas a related central monitoring activity might be looking at query resolution rates across different sites and escalating, if a certain percentage of queries have remained open for 6 months or more. Table 2 contrasts these terms.
Definitions.
Data cleaning and central monitoring similarities and differences.
Central monitoring may be split into many tasks which are completed across time in a rolling pattern, for example, serious adverse events in week 1, protocol deviations in week 2, case report form return rate in week 3, serious adverse events in week 4 and so on. Our term ‘repeat central monitoring’ is referring to the re-running of the same central monitoring task(s).
Without a clear understanding of data cleaning and central monitoring, the trial team and site staff may spend time and effort inappropriately or wastefully. If these activities are not separated, they can each occur at the wrong time: data cleaning too rarely and central monitoring too frequently. Data cleaning needs to happen often. It is easier to clarify, correct or locate previously missing or out-of-range datapoints when the query is asked close in time to when the data were collected. Data cleaning needs to be done often so that the data are as good quality as possible for central monitoring to be effective. Central monitoring is most effective on cleaned data, otherwise teams will focus on individual data errors rather than required process changes, or an incorrect process may be missed due to poor quality data. Repeat central monitoring needs to happen periodically. Trial teams need to have the time and capacity to consider the central monitoring findings and take appropriate action. Action will take time. The interval between running repeat central monitoring reviews needs to be long enough so that site staff who action the central monitoring findings have had time to do so. The actions do not need to be complete but some work needs to have been done. In most trials, daily central monitoring is not viable. Central monitoring needs to happen to pick up real, systemic problems, not momentary blips. Similar to interim analysis being done at planned times so as not to inflate the chance of a positive finding, central monitoring repeated daily, for example, for all except fast recruiting short duration primary outcome trials, will find problems that are not real or that are transitory and do not require extra input. Resources are required for each of data cleaning and central monitoring. Appreciating their benefit to the trial is a part of resourcing.
The quality of the trial will suffer if the differences between data cleaning and central monitoring are not well appreciated. If they are not separated, then either or both could be done inadequately. By considering them as one, it can feel like enough is being done. If they are not done separately, then it may be that a risk for a trial is not adequately covered. Central monitoring is often considered in a risk-based framework relating to the written risk assessment of the trial. Though data cleaning protects the integrity of the trial and may be based on risk (e.g. variables considered critical to trial completion may be cleaned more often), it is not framed in a risk-based way. Therefore, there is scope for a risk noted in central monitoring to be part covered by a data cleaning task, resulting in the risk not being adequately covered. Data cleaning is done on individual participant data and central monitoring is carried out on all available (and missing) data at one site. Data cleaning will not be so effective done at site level and central monitoring may miss a risk if it is done per individual participant.
If the research community cannot be clear on language, it is difficult to discuss best practice or, importantly, define high-quality methodology projects to determine evidence-based improvements to approaches across trials.
In conclusion, it is important to correctly define data cleaning and central monitoring in order to communicate the conduct of a trial, to ensure adequate risks mitigation and to ensure that the data are appropriately corrected. This commentary starts this discussion.
Footnotes
Declaration of conflicting interests
The author(s) declared the following potential conflicts of interest with respect to the research, authorship and/or publication of this article: M.R.S. reports grants and non-financial support from Astellas, Janssen, Novartis and Sanofi; grants from Clovis; and personal fees from Lilly Oncology and Janssen, outside the submitted work. The remaining authors declare that there is no conflict of interest.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship and/or publication of this article: M.L.M.’s salary was supported by HDRUK and all other salaries were supported by MRC grant (MC_UU_12023/24).
