Abstract
Background:
Participants of health research studies such as cancer screening trials usually have better health than the target population. Data-enabled recruitment strategies might be used to help minimise healthy volunteer effects on study power and improve equity.
Methods:
A computer algorithm was developed to help target trial invitations. It assumes participants are recruited from distinct sites (such as different physical locations or periods in time) that are served by clusters (such as general practitioners in England, or geographical areas), and the population may be split into defined groups (such as age and sex bands). The problem is to decide the number of people to invite from each group, such that all recruitment slots are filled, healthy volunteer effects are accounted for, and equity is achieved through representation in sufficient numbers of all major societal and ethnic groups. A linear programme was formulated for this problem.
Results:
The optimisation problem was solved dynamically for invitations to the NHS-Galleri trial (ISRCTN91431511). This multi-cancer screening trial aimed to recruit 140,000 participants from areas in England over 10 months. Public data sources were used for objective function weights, and constraints. Invitations were sent by sampling according to lists generated by the algorithm. To help achieve equity the algorithm tilts the invitation sampling distribution towards groups that are less likely to join. To mitigate healthy volunteer effects, it requires a minimum expected event rate of the primary outcome in the trial.
Conclusion:
Our invitation algorithm is a novel data-enabled approach to recruitment that is designed to address healthy volunteer effects and inequity in health research studies. It could be adapted for use in other trials or research studies.
Background
Participants in clinical trials are usually healthier than the target population. This so-called healthy volunteer effect has been observed in most cancer screening trials done to date. For example, in the Prostate, Lung, Colorectal and Ovarian (PLCO) cancer screening trial, participants in the control arm had less than half the rate of mortality than the general population 1 ; a similar effect was seen in the European Randomised study of Screening for Prostate Cancer (ERSPC) 2 ; and mortality was lower in participants than those who did not join the lung-screening NELSON (Nederlands–Leuvens Longkanker Screenings Onderzoek) trial. 3 Healthy volunteer effects have also been observed in cohort studies including the European Prospective Investigation into Cancer and Nutrition (EPIC) 4 and UK Biobank. 5 A parallel issue is that participants recruited to such research studies are usually much less diverse than the target population. For example, ethnic minorities were under-represented in the PLCO cancer screening trial, despite efforts 6 ; those who joined UK Biobank were disproportionally from less-deprived areas. 5
It is important to try to address healthy volunteer effects and representation of the target population at the design stage of research studies for several reasons. First, unless accounted for the study will be underpowered. Second, lack of representation risks generalisability. Third, seeking to limit healthy volunteer effects and trying to ensure all groups of society are represented in adequate numbers is important for moral reasons. There is an imperative to reduce health inequalities in all areas, including representing those who most likely to have ill health in research. 7
In this article, we outline a dynamic data-enabled method for inviting people to join a trial. It is designed to help address healthy volunteer effects and improve representation. The approach was developed for the NHS-Galleri trial (ISRCTN91431511). 8 This trial is being run to see how well a multi-cancer early detection test (Galleri® test) works in the National Health Service (NHS) in England. 9 The trial aim is to evaluate if the test (alongside standard screening) finds cancer earlier and thereby prevents stage III and IV cancers in people who do not have symptoms of cancer.
Clinical and demographic factors were monitored during recruitment to try to ensure that: (1) the participants at entry would be representative of the population of England aged 50–77 years; and (2) the incidence of advanced cancer in the control arm within 3 years of enrolment would be at least as great as the average among the population of England age 50–77 years. By ‘representative’, we mean participants from all areas of deprivation and all major ethnic groups should be included in reasonable numbers. We do not mean that the proportion from each group should exactly mirror that of the population as a whole. Indeed, we would prefer to over-recruit from more deprived groups and ethnic minorities, because people in these groups are usually substantially under-represented in clinical trials and will have poorer health outcomes because of the social determinants of health. 7 In other words, the recruitment strategy aimed for equity rather than equality. We also note that if all major ethnic and deprivation groups are represented in the study sample then marginal measures may be calibrated to different populations through standardisation methods that differentially weight data from participants. Under-sampling uncommon groups will decrease the precision of standardised estimates much more than under-sampling common groups.
One recruitment strategy is to allow anyone eligible to be able to join. This has consistently been shown to suffer from healthy volunteer effects. Another approach is to require that participants receive an invitation before joining. This approach was used in the United Kingdom Collaborative Trial of Ovarian Cancer Screening (UKCTOCS). 10 Women were randomly invited from population registers. The trial invited 1,243,282 women to recruit 205,090 (uptake 16.5%). 11 Unfortunately, on average those who joined the study were less deprived than the wider population, and mortality in the trial was substantially less than the wider population. 10 The trial leaders had to extend the duration of screening and follow-up to achieve a sufficient number of events in the control arm for their primary analysis. 10
An alternative to random invitation is stratified sampling. This was used in the NHS-Galleri trial. The vast majority of participants were invited to attend a mobile clinical unit for blood sampling. Invitations were sent to patients registered with a General Practitioner (GP) located in a geographical circumference around the clinical unit or site in accordance with the relevant permissions and approvals. A dynamic computer programme was used to decide which groups of people to invite through NHS DigiTrials, to ensure adequate representation in participants across demographic and clinical factors, enrich for advanced cancer in the control arm and account for likely healthy volunteer bias. In addition to the central approach, there was also targeted GP search invitations, and targeted open enrolment of interested individuals who learned about the trial from specific recruitment efforts in selected communities. 12 Local media campaigns were coordinated with site openings. Public and patient involvement in the recruitment of participants included the design of participant information materials. Further work is ongoing focussing on behavioural science relating to acceptability and informed decision-making when considering participation in screening using tests for multiple cancer types. 13
In the rest of this article, we report the algorithm that was developed and used for most of the invitations to NHS-Galleri and describe how its parameters were set. The algorithm is sufficiently generic that it might also be useful beyond this trial for other research studies.
Methods
Model
Our model requires patients to be recruited from different physical locations or periods in time, which we call sites. In NHS-Galleri, a site was a location where blood was donated in a mobile clinic. The sites are served by clusters of potential participants. In NHS-Galleri, these were patients registered at GPs, in other studies they might be people resident in a geographical area. Each cluster may be further divided into defined groups, such as age-and-sex bands. Figure 1 illustrates that the cluster size (number of people registered at each GP) may vary overall, and by age and sex.

Schematic of the invitation model in an example where (for simplicity) the site is served by three GP practices (clusters). There is an age/sex distribution of people to potentially invite within each cluster. This is illustrated using a histogram, where the solid blocks represent the male population and the hatched blocks the female population, with increasing age from left to right and from lighter to darker shades. The problem is to determine the number to invite to attend appointments at the site from each age/sex band (group) and GP (cluster).
Our invitation model is dynamic because invitations are sent in sequential waves within each site. This enables feedback on uptake, which may be used to help plan subsequent waves of invitations. It also provides flexibility if the total capacity at a site changes. For example, a site may be forced to reduce the number of slots available due to logistical issues; or additional capacity is made available.
The invitation problem is to determine the number of people to invite from each group within each cluster serving a single site in each wave, so that the study sample is likely to be adequately powered to meet the trial objective; representative of the population in the sense described above; and all slots available for recruitment to the study are filled. We next describe a mathematical model for this problem. For ease of exposition, in the rest of the article, the model groups are referred to as age/sex groups, and clusters as GPs.
Optimization problem
The optimization problem is set up and solved separately for each site. For each site, there are
by solving for
1. The decision variable
2. No more than 100% of patients in an age-and-sex group may be invited through the waves
where
3. The expected number of people who book appointments
This constraint effectively controls the number of invitations sent given
4. The proportion of invitations sent to each GP in each wave is less than a chosen
This is used to avoid GPs being potentially overburdened with inquiries about the trial if, for example, everyone in their practice receives an invitation on the same day.
5. A minimum bound is achieved on the expected proportion of patients who book (of the total) from each age/sex group in wave
where
6. The expected proportion of men who book is
In practice, uptake rates often differ by age and sex, and one may need to invite more men to achieve parity in bookings by age/sex.
7. The expected number of events in those who book is greater than a bound:
where
A summary of all the parameters defined above is in Table 1. The mathematical formulation may be solved using standard methods, such as a simplex algorithm. 14
Definition of parameters in the linear programme, and how they were applied in NHS-Galleri.
GP: general practitioner; NHS: National Health Service.
Results
We next describe how the algorithm parameters were chosen for NHS-Galleri.
Algorithm parameters
Cost weights
The most important parameter is
The first term on the right-hand side gives a higher cost to invitations sent to patients with lower event rates. The second term on the right-hand side is used so that the cost of inviting any patient from the highest preference group of GPs is less than any patient GPs with a lower preference. Therefore, unless the constraints are broken, the optimal solution will be to invite everyone in the highest preference group of GPs before moving to the next preference group. Likewise, the cost for inviting a patient from the second priority group is less than any patient in the third, fourth or lower preference practices. The first term on the right-hand side of equation (1) means that within GPs of the same rank, invitations to patients with the highest advanced cancer rate
The preference ranking
Finally, we note that in NHS-Galleri invitation weighting of deprivation and ethnicity information was derived at the cluster (GP) level in our model; and age and sex were controlled at the group level in our model. This was due to constraints in how participants could be selected for an invitation. The choice between cluster and group factors in future studies will also be dictated by the level of stratification that is feasible.
Event rates and uptake
We modelled expected advanced cancer incidence
For uptake, initially we had no data and set
where
Invitation process control
The first invitation process parameter is the target number to book in each wave
The second control parameter is the maximum proportion of a GP list that may be invited at each wave. This was arbitrarily set as
The third control parameter is the minimum number expected to be book in each age-group during each wave
The final control parameter is the minimum expected number of events
Computer algorithm
In our implementation of the algorithm, the parameters in Table 1 were organised into four input CSV files (Table 3). The input files were generated using scripts written in the statistical computing software R. The linear programme was solved using a programme written in Python 3, using the cvxopt library.14,16 The algorithm writes a CSV file with the number of people to invite for each wave by age, sex and GP (Table 3). A demonstration example is provided with open source code. 17
Public data sources used to help guide invitations in NHS-Galleri
GP: general practitioner; IMD: index of multiple deprivation; NHS: National Health Service; LSOA: Lower layer Super Output Area; NCRAS: National Cancer Registration and Analysis Service.
Organisation of algorithm input and output from each wave of invitations
GP: general practitioner; NHS: National Health Service.
Conclusion
We have described a novel data-enabled algorithm to help overcome healthy volunteer effects and improve equity when recruiting to large trials or cohorts. In NHS-Galleri, the method was intended to tilt the invitation sampling distribution towards more deprived groups, and those with a higher expected event rate of the primary outcome in the trial. The approach is unlikely to eliminate all healthy volunteer effects. However, it tries to mitigate the impact of healthy volunteer bias by guarding against potential loss of power, as well as increasing representation in the trial from societal groups who are often not well represented.
The successful use of this algorithm at scale has been demonstrated by rapid recruitment to NHS-Galleri. Approximately 1.5 million people from the general population of England were invited and 140,000 of those were enrolled in under 11 months. 12 Our method might be used for other research studies. The most direct application would be in other screening trials run through NHS DigiTrials. So that other trial units can build on our methodology, demonstration code has been made available. 17
There are several considerations for future use of this methodology. The first consideration is the primary endpoint. In NHS-Galleri, the primary endpoint was advanced cancer incidence. There will be different considerations for other outcomes such as cancer-specific mortality. For example, in UKCTOCS, healthy volunteer effects had a greater impact on mortality than on cancer incidence. 10 One reason for this is the eligibility criteria. These precluded people with cancer from joining the trial, so that those who joined would not have the same cancer-specific mortality rates as the general population in the short to medium term. A second consideration is the choice of variables used to tilt the sample to a higher-risk group. In this example, age, sex and deprivation were the key variables, but a different approach might be needed depending on the trial endpoint. A third consideration is achieving adequate representation of the target population. Age, sex, deprivation and ethnicity are likely to remain important for equity considerations, but there might be other factors that are important to take into consideration. Finally, the choice of variables used in the model will depend on data availability. For example, if data on body mass index were available at a group or cluster level, then it could contribute to this data-driven approach.
Strengths of our method include that it uses the invitation process to adjust recruitment according to pre-determined factors, and a data-enabled strategy to address important problems related to equity and healthy volunteer effects that have affected many research studies. Data on the effectiveness of our strategy will be presented elsewhere.
A limitation of this approach is that the method is based on the site, cluster, group model, which may not translate to all settings. Another limitation is inclusion/exclusion criteria. The example had inclusive entry criteria, but if the trial needs to be more selective then the approach might be more difficult to apply. The methods also rely on several flows of data, which may be a practical impediment to implementation in other settings outside of NHS DigiTrials. One might also be concerned if the trial successfully over-recruits from target groups who may not usually take up cancer screening, and whether this could affect how health policymakers interpret the results of the trial. However, the goals of recruiting to a trial to evaluate efficacy are usually different from those when evaluating the effectiveness of a proven intervention. Subsequent larger-scale pilots and analyses are usually needed to evaluate and help plan implementation. 23
In conclusion, healthy volunteer effects and adequate representation have been identified as a problem for many years 24 but arguably little progress has been made in reducing the impact even with judicious recruitment strategies. We hope that our data-driven stratified sampling methodology might be applied elsewhere to enable future studies to better represent their target population, improve equity, diversity and inclusion of trial participants, and account for healthy volunteer effects.
Footnotes
Acknowledgements
The work of Brentnall and Beare was supported by funding from GRAIL LLC awarded to The Cancer Research UK and King’s College London Cancer Prevention Trials Unit (CTPU) for NHS-Galleri. Sleeth and Ching are supported by Cancer Research UK (ref: C8162/A25356, awarded to Sasieni). Mathews is supported by Cancer Research UK (ref: C8162/A27047, awarded to Sasieni). We would like to thank all those who commented on the material. In particular, we are grateful to Rachel McMullan, Liz Holmes, Sara Hiom and Rebecca Smittenaar.
Declaration of conflicting interests
The author(s) declared the following potential conflicts of interest with respect to the research, authorship, and/or publication of this article: Sasieni is a paid member of the GRAIL scientific advisory board, and statistician on the NHS-Galleri trial.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The work of Brentnall and Beare was supported by funding from GRAIL LLC awarded to The Cancer Research UK and King’s College London Cancer Prevention Trials Unit (CTPU) for NHS-Galleri. Sleeth and Ching are supported by Cancer Research UK (ref: C8162/A25356, awarded to Sasieni). Mathews is supported by Cancer Research UK (ref: C8162/A27047, awarded to Sasieni).
Trial registration
Trial registration number: ISRCTN91431511.
