Abstract
Objectives
To summarize debate and research in the Swedish Two-County Trial of mammographic screening on key issues of trial design, endpoint evaluation, and overdiagnosis, and from these to infer promising directions for the future.
Methods
A cluster-randomized controlled trial of the offer of breast cancer screening in Sweden, with a single screen of the control group at the end of the screening phase forms the setting for a historical review of investigations and debate on issues of design, analysis, and interpretation of results of the trial.
Results
There has been considerable commentary on the closure screen of the control group, ascertainment of cause of death, and cluster randomization. The issues raised were researched in detail and the main questions answered in publications between 1989 and 2003. Overdiagnosis issues still remain, but methods of estimation taking full account of lead time and of non-screening influences on incidence (taking place mainly before 2005) suggest that it is a minor phenomenon.
Conclusion
Despite resolution of issues relating to this trial in peer-reviewed publications dating from years, or even decades ago, issues that already have been addressed continue to be raised. We suggest that it would be more profitable to concentrate efforts on current research issues in breast cancer diagnosis, treatment, and prevention.
Introduction
The Swedish Two-County Trial of mammographic screening for breast cancer was initiated in 1977, with 77,080 women aged 40–74 invited to screening (Active Study Population, i.e. ASP) and 55,985 women not invited (Passive Study Population, i.e. PSP). After approximately three rounds of screening the PSP was invited to screening for the first time, and the trial closed thereafter. 1 The first mortality results were published in 1985, showing a significant 31% reduction in breast cancer mortality associated with an invitation to screening. 2 In the years that followed, a number of methodological and clinical issues related to the Two-County Trial arose, and findings were investigated and scrutinized. Many of these issues arose as a result of challenges raised by sceptics, including concern about overdiagnosis, 3 trial design aspects, including cluster randomization and the closure screen of the control group, and allegations of bias in classification of cause of death.4–6 In this paper, we summarize the issues and concerns raised, then review the research from the Two-County Trial and elsewhere that has addressed and responded to these critiques. We conclude by summarizing the current state of knowledge, and giving some implications for future research and its conduct.
Evaluation issues
Trial design
Two major attributes of the Two-County Trial design have received critical attention: the cluster randomization, in which 45 geographic areas (rather than the 133,065 individual women) were randomized to invitation to screening or not; and the screen of the control group at the end of the screening phase of the trial.4–8 The major criticism was that the cluster randomization could have led to imbalances between ASP and PSP, that could have biased the breast cancer mortality comparison.4,6 In fact, the Two-County investigators had reported a small age imbalance between the ASP and PSP in 1989. 9 In addition, concern was expressed in 1998 that the screening of the PSP at closure of the trial had caused inclusion of deaths from cancers in the PSP whose counterparts in the ASP would not have been included, 5 thus inflating deaths in the control (PSP) group, with the net result that mammography screening would appear more beneficial.
Cause of death ascertainment
The possibility that cause of death was differentially classified between the ASP and PSP has been raised.4,6–8,10 The grounds for these allegations (different numbers of deaths reported in different publications) have been shown to be spurious, due to (i) the inevitable differences between numbers of breast cancer deaths reported from the original trial endpoint committee classification of cause of death and those determined by the Swedish overview committee in a separate analysis of all the Swedish trials,11,12 and (ii) the also inevitable increase in numbers of deaths with time during the follow-up period. Naturally, successive publications with longer follow-up have larger numbers of deaths. 12 Despite these explanations, the allegations have persisted. 8 It also has been argued that the appropriate endpoint should be all-cause mortality, which is described as free from bias in cause of death ascertainment. The call for all-cause mortality ignores the fact that an intervention for a single cause of death cannot be expected to prevent death from all causes. It also either overlooks the sample size requirements to measure a statistically significant difference in all-cause deaths associated with a disease-specific intervention, or judges this not to be a problem. One commentator has argued that to evaluate breast screening using all-cause mortality, ‘a trial of 1.2 million women would suffice’, presumably (emphasis on “suffice”) regarding this as a readily achievable figure. 7
Overdiagnosis
Overdiagnosis, of both invasive disease and ductal carcinoma in situ (DCIS) is increasingly cited as a serious risk associated with mammography screening programmes.13,14 While these programmes aim to detect breast cancer early, in order to reduce mortality and morbidity from breast cancer, it is argued that they cause significant harm by finding cancers that do not need to be found. Overdiagnosis is usually defined as the detection of cancers, through screening, that would never have developed symptoms, and therefore never would have been diagnosed or been a threat to life during the patient’s life time, had she not participated in screening. It is argued that the diagnosis of non-progressive cancers results in unnecessary treatment and associated psychological effects that would not otherwise have been experienced in the absence of screening. Specifically in the Two-County Trial, Gøtzsche claimed an overdiagnosis rate of 33%. 7
Theoretically, overdiagnosis could occur in one of several ways: when the preclinical, screen-detected cancer is progressive but the person dies prematurely of another cause before the time at which symptoms would have occurred; if the growth rate of a truly progressive cancer is not rapid enough to give rise to symptoms during the person’s life time; if the cancer stops growing and becomes indolent for some reason; or due to regression of the cancer, as is claimed by some. 15 Of these four possibilities, it is challenging to measure the first three with confidence, and there is very little sound evidence of the fourth. 16
Overdiagnosed cancers are pathologically confirmed cases, that is, they exhibit all the features by which pathologists classify cancer. They are distinct from “false positives” which are suspicious lesions that are found not to be malignant after further investigation. There are currently no markers that can distinguish overdiagnosed cases from those which would have become symptomatic in a woman’s life time. Hence it is not possible to directly observe overdiagnosis. In essence, overdiagnosis is an epidemiological concept related to the apparent excess in breast cancer incidence that occurs following screening.
Estimating the level of overdiagnosis is complex. Methods used include:
Comparison of cumulative incidence in long-term follow-up of populations exposed and unexposed to screening. This could be follow-up of the mammography screening trials in which the control group was not screened at closure of the trial or thereafter, or observational studies comparing breast cancer incidence with and without screening.17,18 Estimation from disease progression models allowing heterogeneity in lead time. Overdiagnosed cancers can be thought of as an extreme of length bias cases or as an extreme of lead time, in which the latter applies to those whose lead time extends beyond the future expected lifetime of the patient.
19
While it is impossible to know if an individual cancer is overdiagnosed, it is, by definition, confined to screen-detected cancer, and is expected that it is more prevalent in DCIS and in small, localized invasive cancers. 20 A major issue of debate is the necessity to distinguish the effect of lead time from that of overdiagnosis, that is, to avoid attributing excess incidence from cancers diagnosed earlier than they would otherwise have been diagnosed from the excess incidence due to cancers that would not have been diagnosed at all without screening.13,21 In estimation from observational studies it is also necessary to take account of trends in breast cancer incidence occurring independently of screening. 18 While it might be expected that these trends would be the same in a group exposed to screening as in one not exposed to screening, this is not guaranteed in a non-trial setting, and indeed, the changes in incidence over time are also confounded by lead-time.
Research in response to evaluation issues
Trial design
In any randomized trial, difficult decisions arise in relation to design. In particular, in the design of population-scale screening trials, practicability of different randomization strategies and dealing with issues of lead time are important issues. From early in the evolution of the Two-County Trial results, the issue of the potential influence of cluster randomization on outcomes has been taken seriously. It was observed in the first publication of the breast cancer mortality results that ‘the excess variation resulting from randomization being at the community rather than the individual level was negligible’,
2
however, it was the Two-County investigators themselves who first noted and published the age imbalance,
9
whereby the ASP was slightly older on average than the PSP. Concerns about cluster randomization always are more focused on the risk of biases that lead to type 1 errors, although in this instance, it was noted that if the age imbalance had any effect at all on the primary result, it would be expected to bias it
From 1992, the Two-County Trial investigators repeatedly published conservative estimates to take account of the cluster randomization, and the results have remained virtually unchanged, i.e. a significant reduction in breast cancer mortality in the ASP, of the order of 30%.1,22,23 Of particular interest is the analysis published in 2000 by Nixon et al., 22 which used Bayesian hierarchical modelling to fit variation between and within clusters. Dr AB Miller, who has frequently expressed scepticism about mammography screening, wrote in 2004 that he found this analysis, ‘particularly compelling in largely dealing with the cluster randomization issue’. 24
Perhaps more pertinently, in 2003, we noted that the important issue was whether the cluster randomization had introduced an imbalance in underlying rates of breast cancer mortality between ASP and PSP. 25 We found no evidence of this, but nonetheless reanalyzed the results, adjusting for breast cancer mortality in the 10 years prior to the trial, and found the same result, a substantial and significant approximate 30% reduction in breast cancer mortality in the ASP. We also performed an analysis adjusting for prior breast cancer mortality in the randomization clusters, that is, adjusting for a potential difference between ASP and PSP clusters in baseline breast cancer mortality, and still observed a significant 27% reduction in mortality in the ASP. 25 Two summary points are important. First, cluster randomization is a legitimate methodological strategy, although more challenging than individual randomization, due to the requirement that the clusters produce similar groups for comparison. The secondary analyses performed on these data do not support conjectures that the positive findings in the Two County Trial are in doubt due to alleged vagaries in the randomization.
In relation to the issue of closure screening in the PSP, the initial trial results published in 1985 pertained almost entirely to the period
We recently published a paper reviewing methodological strategies in screening trial design, highlighting the degree to which all strategies with a fixed number of screening rounds and a post-screening follow-up period introduced different levels of bias against measuring the effect of screening. We found that a closure screen of the PSP gave the least biased result (albeit still conservative), and most closely approximated the approximately unbiased, but often impractical design, in which screening in the ASP and usual care for the PSP continued for a long period. 29
Cause of death ascertainment
In 1989, the first mortality update of the Swedish Two-County Study published the details of determination of cause of death.
9
Briefly, if there were clinical, histological, or autopsy evidence of distant metastases, and
In the same paper, we investigated the issue of a potential difference in classification of cause of death between the ASP and PSP. 9 If such a difference existed, it would be manifested as a significant difference between ASP and PSP in rates of death from other causes among breast cancer cases, as only those with breast cancer diagnosed can be classified as having died from the disease. For example, if there were a subjective tendency to ascribe deaths in breast cancer cases in the ASP as from other causes, there would be an excess rate of death from other causes in breast cancers in the ASP. No such difference was observed. The analysis was repeated with further follow-up in 1992, 1 and again, no significant difference was observed.
The issue was revisited in 2002, when we noted again that only those with breast cancer can die from the disease, and used this fact to demonstrate a significant reduction in mortality among breast cancer cases in the ASP compared with the PSP without classifying cause of death. 30 Thus, the mortality benefit was established without reliance on cause of death classification. Indeed, we found a significant all-cause mortality reduction of 13–19%, depending on the method of lead time adjustment. 30 It is also worth noting that the Swedish overview published an excess mortality analysis that similarly avoided classifying cause of death, and found, essentially, the same result as when explicitly using breast cancer death as the endpoint. 31
Finally, in collaboration with the Swedish Overview, the Two-County investigators compared the cause of death, as determined by the local endpoint committee, with cause of death as classified in the Swedish overview. 32 All disagreements were investigated to establish the cause of disagreement. The authors concluded: ‘The vast majority of these pertained to a disagreement in inclusion/exclusion and not to disagreement in determination of cause of death.’ For example, the original trial defined eligibility on the basis of year of birth, whereas the overview used exact age. The results vindicated the original trial endpoint committee’s classification of cause of death. Despite this, in their paper on the closure screen of the control group, Autier et al. again raised this issue, 26 with no reference to publications providing evidence that the concern is misplaced.11,30,31
There is a minority view that all-cause mortality is an appropriate endpoint for cancer screening trials.4,7 This view is mistaken, as can be seen from the following example. There is currently considerable interest in the possibility of ovarian cancer screening, and the ovarian cancer mortality results of the UK Trial have recently been published. 33 Let us consider the implications of all-cause mortality as an endpoint in evaluation of ovarian cancer screening. Ovarian cancer is responsible for approximately 4% of all deaths in a typical middle-aged female population. Suppose that the effect of the offer of ovarian cancer screening were to reduce ovarian cancer mortality by 20%, without affecting deaths from other causes. In a very large trial with 100,000 all-cause deaths expected in the control group, the expected number of deaths in the study group would be 99,200 (0.04 × 0.2 × 100,000 = 800 fewer ovarian cancer deaths). Thus the all-cause mortality relative risk would be 0.992 with a 95% confidence interval (CI) = 0.9834–1.0008. That is, even with 100,000 expected all cause deaths in each arm, the variation in mortality from other causes in this example completely swamps the beneficial effect of screening. A study with 300,000 all cause deaths expected in each arm would, arguably, be powered for this effect, but it means that to evaluate ovarian cancer screening with an all-cause mortality endpoint, a trial with 12 million women is needed, 6 million in each arm, and follow-up such that 5% in each arm die from any cause. This demonstrates that all-cause mortality is an impractical, inefficient, and unaffordable endpoint in a trial of an intervention which can only be expected to affect a single (minority) cause. Thus, the effect of ovarian cancer screening on all-cause mortality is essentially unverifiable. The answer is surely to have death from ovarian cancer, or the sequelae of screening or treatment for ovarian cancer, as the endpoint, and to adopt very rigorous cause of death determination policies, with a high rate of autopsy if necessary.
The use of all-cause mortality is sometimes advocated on the grounds of objectivity. While human judgement is required in all areas of medicine and health, it might be observed that its use distrusts and discards so many things that we know. These include the fact that only those with a cancer can die of it, only those irradiated can have a radiation-induced disorder, and so on. The way to effective evaluation is to use our knowledge when designing and running trials, not to throw it away on the assumption that proper safeguards cannot be implemented and regularly reviewed.
It is worth noting that, if a statistically significant effect on all-cause mortality (in those with and without the relevant disease) were the universal criterion by which medical interventions were assessed, in all likelihood, there are very few preventive public health interventions that would satisfy this criterion.
Overdiagnosis
In 2003, we published estimates from the Two-County Trial, and from a number of service screening sources, of the division of DCIS into progressive and non-progressive disease. 34 Results indicated that the majority of DCIS cases would have progressed to invasive disease if left untreated.
In 2005, in collaboration with the Gothenburg breast screening trial, the Two-County Trialists published estimates of overdiagnosis using the observed data in the trials to estimate rates of diagnosis of non-progressive (i.e. overdiagnosed) cancers, adjusting for progression rates and screening sensitivity in non-overdiagnosed cancers. 19 The latter estimates separate overdiagnosis from additional incidence due to lead time, while ensuring that overdiagnosed cases are not included in the estimation of lead time. This work and its results addressed the overdiagnosis questions previously raised,3,7 but despite this, some authors have continued to assert these same critiques, i.e. that screening in the Two County Trial resulted in substantial overdiagnosis.13,35 We estimated that 3–4% of cancers diagnosed at first screen, and less than 1% at subsequent screens, were non-progressive. In terms of DCIS, an excess incidence of around one per thousand recruits in the ASP was balanced by a deficit of similar size in invasive cancers, strongly suggesting that diagnosis of DCIS is preventing subsequent diagnosis of invasive breast cancer. Results from both trials were very similar.
We revisited the issue using 29-year incidence data from one county of the Trial in 2012. 36 At the end of the 29 years, no excess incidence was observed in the ASP over the PSP, despite screening having started some years earlier in the ASP, and the latter having had around 100,000 more screening episodes than the PSP.
All the above suggest that overdiagnosis, and in particular non-progressive breast cancer, is a rare phenomenon. The UK Independent Review arrived at larger estimates from follow-up of the Canadian and Malmö trials, which had reported no screening of the control groups, and for which excess long-term incidence in the screened groups might be interpreted as overdiagnosis. 17 However, the UK review did not include the HIP trial, which also did not screen the control group, and which showed only a very small excess incidence at long-term follow-up. 37 In the Malmö trial, screening continued in the study group beyond the nominal screening age, and in those for whom screening stopped at age 70, and the excess incidence was only 1%. 38
It should be acknowledged that much of the focus of overdiagnosis research and discussion in recent years has been on its presence in service screening with mammography, rather than in the trials.13–15,18,20,21 It should also be noted that, given the unobservable nature of overdiagnosis, there is still scope for further quantification with long-term follow-up of current screening programmes. The point that adjusting for lead time based on all cases may result in over-adjustment, and hence underestimation of overdiagnosis, is fair. 13 However, the fact remains that when this point is honoured in analysis, and when adjustments are made for lead time and underlying incidence trends independent of screening, the resulting estimates of overdiagnosis are modest, at worst.19,34,39,40
Discussion and implications for future research
This paper, which formally inaugurates the Swedish Two-County Research Group after many years of productive collaboration, concentrates on three issues which have been a particular focus for published concerns about the validity of the Two County trial, and on the trialists’ willingness to meet these concerns head on. We note in passing that, in addition to this work, the trialists have carried out a considerable programme of downstream research into tumour biology, progression, and natural history.41–49 Our research will continue in this direction, to further inform prevention, diagnosis, and management of breast cancer.
Some general points can be made in relation to the subjects dealt with above. First, the concerns about design and analysis issues in the Swedish Two-County Trial have been answered repeatedly in the past, and there seems little point in further argument, or at the very least, argument that does not describe in detail why published explanations are insufficient to rule out these concerns.1,9,40,24,25 Secondly, the arguments about cause of death determination seem similarly pointless, particularly in view of the observation above that all-cause mortality is essentially useless in evaluation of screening for most cancers, and indeed for most health interventions.
As a case study of reviving issues long after their resolution, perhaps the most egregious example is the recent paper by Autier et al.,
26
which raises again issues relating to the design feature of the closure screen of the PSP, potential imbalances between the ASP and PSP, and ascertainment of cause of death, with little acknowledgement that the questions posed have been repeatedly answered in previous publications, and makes a number of assumptions which a brief study of the literature would show are unwarranted. Specifically:
The authors assert, ‘During post-intervention periods, because screening (or absence of screening) activities are similar in the screening and in the control group, cancer detection rates in the two groups are also similar.' This argument forms the basis for the claim that the closure screen of the PSP has biased the results in favour of screening, but is incorrect because in the screening group a large number of cancers will have been screen detected in the intervention period, which otherwise would have been detected later, in the post-intervention period. Cancer detection rates in the screening group will thus be lower than the rates in the control group in the post-intervention period, at least in the early years. Breast cancer mortality will, therefore, also be lower. As noted above, lower incidence after screening has been observed in trials and service screening programmes,1,27,28 and we have shown that the closure screen of the PSP is slightly conservative.
29
Even if the argument of Autier et al. against the closure screen of the PSP were accepted, their adjusted estimates are wrong, as they subtract the deaths from cancers diagnosed at that screen of the control group, but not those from corresponding cancers diagnosed contemporaneously in the study group. Of greater concern, the reduction in mortality from breast cancers prior to the closure screen of the control group has been in the public domain since 1985,
2
and was 31%, almost identical to that observed including the deaths from the closure screen and their counterparts in the study group.
1
The argument that the smaller numbers of advanced cancers in the study group contemporaneously with the control group’s closure screen somehow invalidates the design and analysis including such a closure screen again displays a fundamental misunderstanding of the influence of screening on incidence and mortality rates. The control group screen is a prevalent screen, whereas at the end of the screening phase, the study group is in incident screen mode. The appropriate comparison is with the prevalent screen of the study group, which shows very similar numbers to the control group closure screen.
1
This too has been in the public domain for decades. The paper asserts that there are significant imbalances in missing values between ASP and PSP. The authors claim “the histological grade of cancers found during the Two-County trial was unknown for 19% of patients in the control group vs. 10% in the screening group ( The paper again raises the issue of potential bias in cause of death, without reference to work investigating this issue in detail and showing that the concerns were unwarranted.9,25,30,31
The above illustrates how easy it is to overlook the fact that speculative concerns about specific research projects have been raised and answered in the past. We do acknowledge that there is scope for further research on overdiagnosis, but we qualify this with two observations. First, methodologically sound estimates of overdiagnosis which take account of lead time and underlying incidence, and which avoid some of the mistakes of the past, yield estimates that are modest. 18 Second, it is worth noting that scepticism about the benefits of screening seems to be accompanied by credulity about extreme estimates of the quantity of non-progressive or regressive disease,8,10 despite there being very little evidence of such cases. 11
In our view, the time has come to put aside arguing about the past, and concentrate on the important issues for breast cancer control today. These include development of early detection protocols which better serve the population with dense breast tissue, and a greater understanding of the risk factors and potential preventive actions for non-hormone dependent breast cancers. Also, now that breast cancer is typically detected at an early stage, there is a need to develop ‘light-touch’ patient communication, treatment, and management protocols for breast cancers of low risk to life. We should also work towards better biological and prognostic stratification of breast cancer, to improve prevention, early detection, and treatment of breast cancer in the future.
Footnotes
Swedish Two-County research group
Laszlo Tabar (University of Uppsala, Sweden); Tony Hsiu-Hsi Chen, Chen-Yang Hsu, Wendy Yi-Ying Wu (National Taiwan University, Taipei, Taiwan); Amy Ming-Fang Yen, Sam Li-Sheng Chen (Taipei Medical University, Taipei, Taiwan); Sherry Yueh-Hsia Chiu (Chang Gung University, Tao-Yuan, Taiwan); Jean Ching-Yuan Fann (Kainan University, Tao-Yuan, Taiwan); Kerri Beckmann (University of South Australia, Adelaide, Australia); Robert A Smith (American Cancer Society, Atlanta, GA); Stephen W Duffy (Queen Mary University of London, UK).
Acknowledgements
Thanks are due to the women who took part in the Two County Trial and the personnel of the Departments of Mammography in the two counties who carried out more than 200,000 mammography examinations with great skill and dedication.
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: the American Cancer Society through a gift from the Longaberger Company’s Horizon of Hope Campaign®.
