Sage Journals: Discover world-class research

Abstract

Objectives

Data on the performance of health boards, hospitals and medical specialists, etc., are being collected at various levels in the health-care system and are often presented as league tables. These tables ignore natural variation and/or confounders, and this introduces uncertainty about their interpretation. The purpose of this study was to devise and illustrate a method to expose the real difference between the ratings in league tables.

Methods

Two values per rating were added to the league tables: the best-case scenario and the worst-case scenario. True performance will lie somewhere between these two values. The method is illustrated using data from the Dutch breast cancer screening programme.

Results

By focusing on one performance indicator and one confounder, it was possible to show shifts in the rating order of breast cancer screening units and thus expose the uncertainty about the true performance of each screening unit.

Conclusions

The worst-case and best-case scenario ratings demonstrated the uncertainty within the ratings of a league table. League tables should therefore only be used with great caution and after providing the public with sufficient information.

Introduction

In recent years there has been an increasing supply of and demand for data on the quality of care. Canada and the USA operate performance-reporting systems, and the UK government has published outcome data on National Health Service (NHS) organizations. These data can vary from indicators that measure an aspect of the process of care (such as practitioner adherence to clinical guidelines), or outcomes of care at hospitals (such as birth rates at in vitro fertilization [IVF] clinics¹) to the mortality rates associated with individual surgeons.² Important reasons to publish such data are to stimulate action to improve the quality of care while controlling the cost, to promote public trust so that they can see what the services are achieving and to support the selection of a care provider either by the patients or by the purchasers of health care, such as Medicare or Medicaid in the USA.

There is great concern about the manner in which the data are being presented and interpreted. In particular, a large difference in rank may reflect only trivial (i.e. not clinically meaningful) differences in actual outcome.³

A major disadvantage of presenting ratings alone is that valuable information is lost. It is impossible to see whether the difference in performance between two institutions with similar ratings is large or meaningless. When data are presented in the form of means, rates, odds ratios, etc., it is common to see a confidence interval, which is rarely the case with league tables. Therefore, league tables should be accompanied by an indication of their precision if they are to be of any help to patients, care providers, purchasers of care, or the government.

League tables are often used when an outcome measure is expressed as a total score. For example, the magazine U.S. News used a total score to compile a league table of American cancer hospitals.⁴ Their total score was calculated from nine subscores that varied from reputation (in %), number of discharges and being an NCI cancer centre (yes/no). A difference of, for example, five points on this scale is impossible to interpret, so the scores were put into league tables. Confounding variables were ignored. A method is needed to expose the uncertainty within the rating of each individual institution.

In essence, differences in performance are caused by three factors:

(1)

Quality, i.e. technical expertise and equipment, organizational aspects, quality indicators for the care or service delivered;

(2)

Confounders, such as patient case mix;

(3)

Natural variation.

If natural variation and confounders are filtered out, the real differences in performance will become apparent, instead of the observed differences. It is necessary to adjust for the confounders and such an adjustment can have a heavy impact on the league table. Natural variation is present in all data. These fluctuations in results are caused by chance variability and they increase as the number of observations decreases. Therefore, smaller institutions are more likely to show over-performance or under-performance than larger ones, even if this is not true.

This article explains how to expose the magnitude of the influence of natural variation on the league table, after adjustment for confounding variables. The method introduces the best-case scenario and the worst-case scenario into the rating, and is illustrated using data from 18 central screening units in the Dutch breast cancer screening programme. First, the best-case scenario and the worst-case scenario ratings at each central screening unit were calculated, then the calculation was repeated while adjusting for the age of the women.

Marshall and Spiegelhalter⁵ used another method to show the variability of the rankings, based on constructing credible intervals for the rankings. The results of the two methods are compared.

Presentation of the best-case scenario, the worst-case scenario and the adjusted rating will illustrate clearly the magnitude of the influence of natural variation on the league table. Patients and insurance companies can understand that the true rating may be as good as the best-case scenario, while the central screening unit can view the worst-case scenario as a warning: if no action is taken, next years’ rating may drop to this level.

Patients and Methods

Patients

The Dutch screening programme for breast cancer was gradually implemented over the period 1988–1997. Its main characteristics are the centralized organization (including centralized technical and medical quality control and audit), the two-year interval between screening examinations and the eligible age of 50–74 years. Technical and medical quality control are the responsibility of the National Expert and Training Centre. Data collection, evaluation and annual reporting of the results are performed by the National Evaluation Team.⁶ In 2004 and in 2005, annually almost 1.1 million women aged 50–74 years were invited for a screening examination. Participation was 81% and breast cancer was detected in approximately 5%o of the invited women.⁷ It has been estimated that the national breast cancer screening programme reduces the number of women who die from breast cancer by 700 per year.⁸ In 2005, breast cancer mortality in women aged 55–74 years was 26% lower than in 1986–1988, i.e. before the start of the programme.⁷

The Dutch screening programme has a central organization that oversees nine screening regions. Each region has 1–5 central screening units. Radiologists evaluate the screening radiographs. Site visits to the central screening units are performed once every three years by the National Reference Centre. They monitor the performance of the radiologists and radiographers and check the technical and positioning quality of the mammograms. In addition, they review data from two screening rounds and a selection of cancer cases.⁶ The site visit report contains data on performance indicators, such as number of women screened, age, referral rate, detection rate and tumour stage distributions. Some of these data were used as a model in the present study.

In our analysis, we drew data on detection rates and age from the 18 central screening units that were visited in the period 2003–2005. We selected only women who had been screened at least once before. The women were divided into two age groups: 49–69 years and 70–74 years.

The raw data from each central screening unit are presented in Table 1: the number of women screened in the two age groups, the percentage of women in the younger age group, the detection rates in the two age groups and the overall detection rates.

Table 1

Dutch breast cancer screening data per screening unit

Central screening unit	Women screened 49–69 years	Women screened 70–74 years	% Young	Detection rate 49–69 years in %o	Detection rate 70–74 years in %o	Overall detection rate in %o
1	62752	9102	87.3	4.1	8.6	4.7
2	105088	11311	90.3	5.7	13.3	6.5
3	20779	3283	86.4	3.7	9.1	4.4
4	24426	5425	81.8	3.6	4.4	3.8
5	46500	6675	87.4	3.8	10.6	4.7
6	86347	10702	89.0	4.4	8.4	4.8
7	129327	41852	75.6	4.1	3.9	4.1
8	77154	15734	83.1	4.3	8.3	4.9
9	126093	16623	88.4	4.6	8.3	5.0
10	127685	23210	84.6	4.8	8.7	5.4
11	84847	12961	86.7	4.6	8.1	5.1
12	135264	35589	79.2	3.7	4.5	3.9
13	41022	7048	85.3	4.4	6.5	4.7
14	14475	3337	81.3	4.3	5.4	4.5
15	15810	3765	80.8	4.4	4.5	4.4
16	86744	14317	85.8	4.1	7.4	4.6
17	41467	7256	85.1	4.2	6.6	4.6
18	44046	6804	86.6	4.1	7.9	4.6

Methods

The first part of this section explains the general method when the outcome variable has not been adjusted for confounding variables. This method has been previously explained in Lemmers et al.,⁹ but has nevertheless been repeated for the convenience of the reader. Furthermore, it makes the new part, how to adjust the outcome variable for confounding variables, easier to comprehend. Calculations are shown for the rating, the best-case scenario and the worst-case scenario for the rating. The second part illustrates how these values can be calculated after the outcome variable has been adjusted for confounders.

(A) When the outcome variable has not been adjusted

To calculate the rating, the best-case scenario and the worst-case scenario for the rating of a screening unit (a clinic or a medical specialist), the following is needed:

•

Estimates of the performance of the screening unit compared with each of the other screening units (e.g. an odds ratio, or a difference between means);

•

Reasonable estimates of the lower bound of the performance (e.g. the lower bound of a 95% confidence interval in the case of an odds ratio, or the lower bound of a 95% confidence interval in the case of a difference between means);

•

Reasonable estimates of the upper bound of the performance (e.g. the upper bound of a 95% confidence interval in the case of an odds ratio, or the upper bound of a 95% confidence interval in the case of a difference between means).

If, for example, the upper bound of the 95% confidence interval of the odds ratio for detection at a particular screening unit versus another screening unit is larger than one, we interpret this as ‘When the first screening unit has the benefit of the doubt compared with the second screening unit, it is more likely that breast cancer will be detected in the first screening unit than in the second screening unit’. We then say that in the best-case scenario the performance of the first screening unit is better than that of the second screening unit.

Similarly, if the lower bound of the 95% confidence interval of the odds ratio for detection at a particular screening unit versus another screening unit is larger than one, we interpret this as ‘When the first screening unit has everything going against it compared with the second screening unit, it is still more likely that breast cancer will be detected at the first screening unit than at the second screening unit’. We then say that in the worst-case scenario the performance of the first screening unit is better than that of the second screening unit.

We performed a logistic regression analysis on the detection rate as outcome variable with one covariate, namely screening unit (as a class variable). We obtained from each screening unit the odds ratios for detection of cancer at this unit versus detection at each of the other screening units, with corresponding 95% confidence intervals. As relatively few of the women received a screen-positive result (around 5%o), these odds ratios can be interpreted as relative risks.

Next, we calculated the unadjusted rating, the best-case scenario and the worst-case scenario for the rating of each screening unit. The unadjusted rating of the first screening unit was 1+ the number of screening units whose performance was better than that at screening unit 1. The best-case scenario for the rating of the first screening unit was 1+ the number of screening units whose performance was better than that at screening unit 1, even when screening unit 1 was given the benefit of the doubt. This was 1+ the number of screening units for which the 95% upper bound of the odds ratio for detection at screening unit 1 versus detection at this particular screening unit was smaller than 1. Similarly, the worst-case scenario for the rating of the first screening unit was 1+ the number of screening units whose performance was better than that at screening unit 1 when everything was going against screening unit 1. This was 1+ the number of screening units for which the 95% lower bound of the odds ratio for detection at screening unit 1 versus detection at this particular screening unit was smaller than 1.

(B) When the outcome variable has been adjusted

When the outcome variable has been adjusted for one or more confounders, estimates of the differences in performance are used after adjustment for confounding covariates. A continuous outcome variable, such as length of hospitalization, can be adjusted using linear regression. A dichotomous outcome variable can be adjusted using logistic regression, as we show below using the same breast cancer screening data as before.

We performed a logistic regression analysis on the outcome variable breast cancer detection rate, with two covariates, namely age group and screening unit (as class variables). We obtained from every screening unit the odds ratio for detection of cancer at this unit versus detection at each other screening unit, adjusted for age, with corresponding 95% confidence intervals. Then, using the same method as in the previous section, we calculated the rating, the best-case scenario and the worst-case scenario of each screening unit, based on the odds ratios for breast cancer detection adjusted for age.

Results

Table 2 summarizes the results at each screening unit. Except for screening units 4, 7 and 12, all the screening units had a best-case scenario of 1, 2 or 3 before adjustment for age. After adjustment, the ratings of 10 out of the 18 central screening units deviated by a maximum of two positions. The screening units with different ratings after adjustment are shown in bold-type. Even after adjustment, except for screening units 4, 7 and 12, all the screening units had a best-case scenario rating of 1, 2 or 3.

Table 2

Dutch breast cancer screening unit league table, with rating, best-case scenario and worst-case scenario ratings, before and after adjustment for age

Screening unit	Unadjusted rating	Best-case scenario – worst-case scenario	Ratings adjusted for age^*	Best-case scenario – worst-case scenario adjusted for age	Credible intervals (unadjusted)
2	1	1–1	1	1–1	1–1
10	2	2–9	2	2–11	2–5
11	3	2–15	4	2–15	2–11
9	4	2–15	3	2–15	2–10
8	5	2–15	6	2–15	2–12
6	6	2–15	5	2–15	3–13
13	7	2–17	9	2–15	3–15
5	8	3–17	7	2–15	3–15
1	9	3–15	8	3–15	4–15
18	10	3–17	10	3–15	3–16
16	11	3–17	11	3–15	5–15
17	12	3–17	12	3–16	4–16
14	13	2–18	14	2–18	2–18
15	14	2–18	15	2–18	3–18
3	15	3–18	13	2–18	3–18
7	16	8–18	16	13–18	13–18
12	17	13–18	17	13–18	14–18
4	18	8–18	18	12–18	12–18

Bold type indicates different ratings after adjustment

Figure 1 shows the adjusted rating and the range from the best-case scenario to the worst-case scenario of each screening unit. The lower endpoint of a vertical line represents the best-case scenario, while the upper endpoint indicates the worst-case scenario of the unadjusted rating. The adjusted rating is indicated with a horizontal line.

Figure 1

Graphic representation of the performance of 18 Dutch breast cancer screening units according to their detection rate adjusted for age

Comparison with the credible interval method

In this section, we compare the results of our method with the credible interval method. This method was used by Marshall and Spiegelhalter,⁵ who studied the reliability of league tables of IVF clinics. When the credible interval method is used, bootstrap samples are drawn and the resulting replications are analysed. The results of the comparison are shown in the sixth column of Table 2.

The credible intervals are in general somewhat narrower than the best-case worst-case intervals. To understand why this is the case, we consider an example in which all units have the same size. Suppose that the lower limit of the 95% confidence interval of the difference between the best scoring unit B and another unit U is just below zero. Then the best-case scenario of unit U is 1. If there are only two units, unit U is then expected to have ranking 1 in just over 2.5% of the replicates, so the upper limit of the credibility interval for U is 1. However, when there are more units, this may reduce the number of times that U has score 1 to less than 2.5%. As a consequence, the upper limit of the credibility interval for the ranking of unit U will not be 1, but the upper limit of the best-case worst-case interval will be.

Discussion

In a league table, it is impossible to perceive whether there is any real difference between the ratings. Differences in rating between e.g. two institutions, centres or units, might be caused by confounding variables and/or by natural variation. Adding the best-case and worst-case scenarios to the adjusted rating will show the magnitude of the influence of natural variation on the league table, while adjusting for confounders. This puts the values and positions in better perspective, because we can see whether there is any real difference. We saw that there were sometimes large differences in best-case scenarios and worst-case scenarios between the screening units in the league table, which meant that there was great uncertainty about the true position of each particular screening unit.

The best-case scenario rating of a screening unit indicates the position that a screening unit would take in the league table if it had everything going for it. The worst-case scenario rating indicates the position that a screening unit would take if it had everything going against it. Thus, the true rating of a screening unit can be expected to lie somewhere between the best-case scenario and the worst-case scenario.

Besides natural variation, another well-known cause of the differences in performance is confounding. In the example, adjustment was made for age, but several other factors can influence the results and differ between screening units. We only adjusted for age and not for other factors, because we wanted to give a clear illustration of our method. Extra adjustments would probably influence the ratings. Table 2 shows for example that there were differences between the league tables before and after adjustment: screening unit 3 climbed 2 places after adjustment.

The selection of data from the breast cancer screening programme over-simplified the comparison between screening units. Performance is not only determined by detection rates, but also by indicators, such as:

•

the percentage of women in the screening population who are screened at the unit;

•

the percentage of women who are referred for further tests and are diagnosed with breast cancer;

•

the size and stage of the cancers of the screen-detected breast malignancies: e.g. the performance of a screening unit that detects mostly small malignancies is better than that of a screening unit that detects mostly large malignancies when they have the same detection rate;

•

the number of women diagnosed with interval cancer;

•

and several other factors.

These performance indicators are known to vary between screening units. Typically, however, there is no consistent pattern in which one unit outperforms the others on all the indicators. In addition, the number of screen-positive women should be weighed against the number of women who are referred for further tests but do not have breast cancer (false-positive) and the number of women diagnosed with interval breast cancer (false-negative). A review study by Otten et al. on interval and screen-detected breast malignancies showed a delicate balance between recall, detection and false-positive rates.¹⁰

A reliable total score that gives a good reflection of the performance of a screening unit consists of a combination of several scores on each part of the chain of care, e.g. scores on the radiologist, the technical quality of the mammograms and the level of compliance with guidelines. This total score should be adjusted for some factors, such as the time period reporting took place. Interpretation of total scores (and their confidence intervals) is notoriously difficult, because what does a difference of 5 points mean in practice? Thus, the only option is to resort to a league table anyway. If the best-case scenario and worst-case scenario are also presented, then the range of statistical dispersion will be easier to interpret.

There are several alternatives to our method. A more conservative or less conservative method can be chosen depending on the situation. If the goal is to expose even relatively small differences in the league, even at the cost of some additional false alarms, the credible interval method⁵ should be used, or our method with, e.g., 90% confidence intervals. If the goal is to prevent premature action, our method could be used with Bonferroni correction to correct for the multiple comparisons that are being made, or with wide confidence intervals (e.g. 99%).

Confidence intervals can also be presented with the (adjusted) detection rate at each screening unit. This has the advantage that the differences in performance become visible, which provides more scope than ratings alone. However, there is sometimes more interest in the league table than in any differences. Therefore, it is necessary to develop methods that give an accurate portrayal of the spread and spacing of ratings. The results of these methods should be easy to interpret, so that non-statisticians also have access to adequate information.

Conclusions

The method illustrated here is easy to implement and the results are easy to interpret. By adding a best-case scenario and a worst-case scenario, the screening units were rated according to their performance, without doing them any injustice, because natural variation and confounders were incorporated into these additional ratings. If the differences are real, even after adjustment, our method distinguishes between screening units whose performance is significantly poorer (or better). If natural variation is the only cause of the observed differences, our method prevents any premature judgement and thus avoids unnecessary anxiety, stigma and guilt. These are important improvements, for patients, care providers and purchasers of health care.

References

The European IVF-monitoring programme (EIM) for the European Society of Human Reproduction and Embryology (ESHRE); Andersen AN, Gianaroli L, Felberbaum R, de Mouzon J, Nygren KG. Assisted reproductive technology in Europe, 2002. Results generated from European registers by ESHRE. Hum Reprod 2006; 21: 1680–97.

Hannan

, Wu

, Ryan

. Do hospitals and surgeons with higher coronary artery bypass graft surgery volumes still have lower risk-adjusted mortality rates? Circulation 2003; 108: 795–801.

Botha

, Silcocks

, Bright

, Redgrave

. Breast and cervical cancer survival making sense of league tables’. Public Health 2001; 115: 165–72.

US News. Best Hospitals 2007: Cancer http://www.usnews.com/usnews/health/best-hospitals/rankings/specihqcanc.htm (last accessed 3 November 2008)

Marshall

, Spiegelhalter

. Reliability of league tables of in vitro fertilization clinics: retrospective analysis of live birth rates. BMJ 1998; 316: 1701–5.

Holland

, Rijken

, Hendriks

. The Dutch population-based mammography screening: 30-year experience. Breast Care 2007; 2: 12–8.

National Evaluation Team for Breast cancer screening. Interim report 2006. Main results of the breast cancer screening programme in the Netherlands. 2006

Verbeek ALM, Broeders MJM on behalf of the National Evaluation Team for Breast Cancer Screening and the National Expert and Training Centre for Breast Cancer Screening.

Evaluation of The Netherlands Breast Cancer Screening Programme.

Ann One 2003; 14: 1203–5.

Lemmers

, Kremer

JAM

, Borm

. Incorporating natural variation into IVF clinic league tables. Hum Repr 2007; 22: 1359–62.

10.

Otten

, Karsemeijer

, Hendriks

. Effect of recall rate on earlier screen detection of breast cancers based on the Dutch performance indicators. J Natl Cancer Inst 2005; 97: 748–54.

League Tables of Breast Cancer Screening Units: Worst-case and Best-case Scenario Ratings Helped in Exposing Real Differences between Performance Ratings

Abstract

Objectives

Methods

Results

Conclusions

Introduction

Patients and Methods

Patients

Methods

(A) When the outcome variable has not been adjusted

(B) When the outcome variable has been adjusted

Results

Comparison with the credible interval method

Discussion

Conclusions

References