Sage Journals: Discover world-class research

Abstract

Centres of Excellence (CEs) are thought to provide better quality services for their speciality than Generic Services (GS). However, clinical test theory suggests this may arise from differences in the prevalence of these specialities’ conditions in their referral populations, which affects the services’ ability to detect diagnoses accurately, even with similar diagnostic sensitivities and specificities. Furthermore, GS’ insensitivity to rarer diagnoses is necessary to avoid serious overdiagnosis despite using skills equivalent to CEs. Good GS can perform as well as CEs for disorders of 15% to 20% or greater prevalence in their referral populations, depending on the Minimal Clinically Important Difference (MCID) decided for their diagnoses’ positive predictive values or degree of bias. CEs are necessary for rare disorders and have a role in determining MCIDs and the sensitivity and specificity of new measures. Sensitivity, specificity, positive & negative predictive values, and true diagnostic prevalence should be routine outcome measures.

Keywords

public Health diagnosis epidemiology services mental health paediatrics prevalence service quality

Centres of excellence may be defined as

…a program within a healthcare institution which is assembled to supply an exceptionally high concentration of expertise and related resources centered on a particular area of medicine, delivering associated care in a comprehensive, interdisciplinary fashion to afford the best patient outcomes possible.¹

It is received wisdom that such centres are better than more general ones, and medical policy is usually biased towards their expansion via centralisation^2,3 despite the risk that remote areas may lose out.⁴ Centres of excellence (henceforth CEs) are essential for the “hub and spoke” model of service delivery, which includes medical training.^5–7 However, populations served by “hubs” and “spokes” may differ.⁸ While “spokes” can improve reach, especially using telemedicine,^7,9 more in-depth training takes place in the “hub” with both “push” and “pull” factors working to orientate students hubwards¹⁰ as a source of excellence, so tailoring their skills to those most relevant for the hub. They seem expensive if the costs of teaching centres are treated as a proxy measure for them. In neurosurgery, costs in teaching hospitals are 21% greater than non-teaching ones,¹¹ and in orthopaedics, 33% or more than eight times the upper limit for cost-effective care when improvement in outcomes is considered.¹² Empirical research on CEs is limited by a paucity of studies³ and domain, which has tended to concentrate on acute medicine and surgery.^1,13,14 In the developing world, the picture is confusing regarding their use, e.g., depending on setting & role, substituting nurses for physicians, or the reverse, may be more effective.¹⁵ Two broad themes seem to be emerging. Promising results at the local level do not generate the same benefit nationally,^2,16 and CEs’ benefits are less detectable in more common conditions.¹⁷ Both these trends resemble what has been called the “scale-up penalty”, which is the tendency for interventions’ effects to attenuate as they come to be delivered at scale. This is presumed to reflect a loss of fidelity, investment, and policy inefficiencies, i.e., like the claims regarding CEs versus Generic Services (GS). It is said to be prevalent in, for example, early years interventions,¹⁸ but variation in research approaches have made it impossible to quantify an overall effect,¹⁹ and empirical studies have not always detected it.²⁰

This literature suggests the development of CEs and their associated hub-and-spoke models should be conditional rather than universal, but there is currently no clinical basis for making these choices. However, with some assumptions, clinical testing theory (discussed below) can be used to set choice criteria.

Centres of excellence as diagnostic decision-makers

In psychiatry, there is a long tradition of understanding diagnosis as a disguised indication for treatment.²¹ From this perspective, with optimal treatment pathways established, a centre of excellence will be no better than the diagnoses it confirms.

This is equivalent to saying that a diagnosis can be considered a test for a treatment: with the diagnosis obtained, optimal treatment can begin.

Determinants of effective testing in clinical test theory

Clinical Test Theory may be defined as the application of probability theory to clinical testing, deriving a set of parameters, listed below, that enables estimation of a test’s utility in clinical practice.

• Reliability is the likelihood of two measures of the same thing giving the same result, either in the hands of different observers (inter-rater reliability) or at other times (repeat reliability). It is really the amount of random error associated with the measure. Because such error is inescapable, it sets a ceiling on how much a measure can be trusted. No measure can be more valid than it is reliable.

• Validity is the extent to which the measure reflects what it is supposed to. As reliability captures random error, it is also the bias a clinical test might show, so it injects systematic error. So, adjusting validity can never compensate for unreliability. Bias can arise for many reasons, so there are many forms of validity, with various means of detection.

• Sensitivity is the likelihood of a measure being able to detect what it is supposed to. Both reliability and validity contribute to sensitivity.

• Specificity is the likelihood of a measure not reporting the presence of something when it is absent.

• Positive Predictive Value (PPV) is the likelihood of a measure identifying a case correctly in a population. It depends on the measure’s sensitivity, specificity and prevalence of cases.

• Negative Predictive Value (NPV) is the likelihood of diagnosing a non-case (i.e., a case without the diagnosis of interest) correctly in a population. A diagnosis should have $NPV \geq 1 - Prevalence$ ; otherwise, it will simply accrue false positives.

• Accuracy is the proportion of cases that have been correctly classified. Without bias, it should equal reliability, as the mean error rate should be constant and normally distributed.

• Bias is the tendency of the test to over- or under-identify cases, so it measures validity, as discussed above. An unbiased test’s proportion of positive diagnoses should be equal to its population’s prevalence.

• Minimal Clinically Important Difference (MCID) is the least difference in values for a measurement that identifies a clinically meaningful difference. It can never be below the limit set by a measure’s reliability and includes any bias.

Estimating the difference between centres of excellence and generic services

CEs for any diagnosis will see more of it than services offering generic care. It follows that, for any diagnosis, there will be a difference in prevalence between its CEs and GS. From the perspective of clinical decision-making, what will matter is the PPV and NPV of the diagnoses made, as these will determine the clinical intervention implemented. We want services to apply diagnostic-related interventions to patients with the diagnoses and other interventions to those who don’t. The PPV and NPV may be estimated from diagnostic sensitivity, specificity and true prevalence as follows

P P V = \frac{S e n s i t i v i t y \times P r e v a l e n c e}{S e n s i t i v i t y \times P r e v a l e n c e + (1 - S p e c i f i c i t y) \times (1 - P r e v a l e n c e)}

N P V = \frac{S p e c i f i c i t y \times (1 - P r e v a l e n c e)}{(1 - S e n s i t i v i t y) \times P r e v a l e n c e + S p e c i f i c i t y \times (1 - P r e v a l e n c e)}

It follows that even if CEs have similar levels of skill to GS (i.e., their diagnoses have similar sensitivities and specificities), they will seem to perform better because they will have a higher prevalence of their specialist conditions.

Some MCID related to prevalence that distinguishes between CEs and GS could be used to determine whether developing a CE and its spokes is worthwhile if we assume all factors other than prevalence are captured in services’ diagnostic sensitivity and specificity. I shall return to this assumption below.

Estimation

As the use of clinical test theory is novel in this context, its parameters must be assumed a priori or derived from plausible findings in the literature.

Diagnostic sensitivity & specificity

Services diagnose, so these are their properties. Assume, as suggested above, that they are like share prices, i.e., summary statistics which capture all relevant service information. The 6-month mortality data in orthopaedic outcomes between major teaching and non-teaching hospitals for hip fractures, a common condition, suggests a 1% difference in performance between CEs and GS.¹² As discussed, outcome differences are less in more common conditions, when prevalence differences between service types reduce. Assuming that treatment is broadly equivalent contingent upon diagnosis (discussed below), the 1% difference in outcomes found is less than the 5% margin of uncertainty used in the estimations, so CEs and GS were not paramaterised separately.

Studies of sensitivity and specificity show wide variation. The only study to examine the relationship between sensitivity, specificity and prevalence²² across diagnoses in an epidemiological sample reported strong correlations between true prevalence, sensitivity and specificity (.55 and -.85, respectively, for all billing physicians in Quebec; see Figure 3 below). The direction of the correlations suggests an expectancy effect, with the physicians considering rare diagnoses less frequently, avoiding false positives but at the price of reduced sensitivity. Two sets of sensitivities and specificities were modelled. Plausible estimates of sensitivity and specificity were derived from population studies of common conditions, as CEs should concentrate rare conditions in their referral populations. A range of 0.5 to 0.8 for sensitivity, and 0.6 to 0.95 for specificity, were consistent with findings for common disorders in general medicine,^22,23 surgery²⁴ and psychiatry.^25,26 Then, the floor prevalence, sensitivity, and specificity from²² were used to estimate plausible effects of the postulated expectancy effects.

The changes in positive and negative predictive values, accuracy and bias at different sensitivities and specificities across prevalence were plotted using the “riskyr” package in R.²⁷

Results

Figure 1 below reports PPVs, NPVs, accuracy and proportion of positive diagnoses on a logarithmic prevalence scale, to allow examination of these under conditions of rarity. For ease of interpretation, only the lower and upper bounds of the estimated sensitivity and specificity range are reported. 95% confidence bands are also shown.

Figure 1.

Positive & negative predictive values, accuracy & proportion of positive values for rare diagnoses with 5% uncertainty.

The lower bounds of sensitivity (0.5) and specificity (0.6) performed, unsurprisingly, very badly, with enormous positive bias apparent from the very low PPV percentages, even in the presence of apparently high NPV. A hundred-fold increase in true prevalence between .1% and 10% was associated with only a 1% increase in the number of true positive cases, and because so many cases are false positives (more than four times at 10%), a decrease in overall accuracy.

At the upper bounds (sensitivity of 0.8 and specificity of .95), things had improved, but the impact of prevalence could still be seen, most obviously in the much wider uncertainty around PPV between .01% and 50%. While positive bias was still high at very low levels, it dropped from a 50-fold positive bias at 0.1% prevalence to around 5-fold at 1% and a quarter at 10% prevalence, with decreasing uncertainty. Accuracy remained high, though the increasing proportion of positive but uncertain cases reduced it as the proportion of positive cases increased.

Figure 2 below shows the same metrics but across the full prevalence range. The very high positive biases for rare disorders are no longer visible due to the change of scale from logarithmic to linear. The NPV of a test must be greater or equal to $1 - prevalence$ for it to be useful,²⁸ and at the low values for sensitivity and specificity, it barely crossed this threshold. What accuracy there was came from correct predictions of non-cases in sparse populations; overall accuracy tended towards 50% (a coin toss) as prevalence increased.

Figure 2.

Positive & negative predictive values, accuracy & proportion of positive values for all diagnostic prevalences with 5% uncertainty.

For high sensitivity and specificity, the figures show variable but small degrees of bias across the full prevalence range, with excellent overall accuracy. Positive predictive detection rose rapidly with increasing prevalence, reaching 64% at 10% and 80% at 20%. With 5% uncertainty allowed, an upper bound PPV of nearly 100% was achieved above 5% prevalence.

Figure 3 below explores the effect of physicians’ expectations.

Figure 3.

Association between Condition Prevalence, Physician Sensitivity and Specificity (upper row) Positive & Negative Predictive Values, Accuracy & Proportion of Positive Diagnoses under Maximally Reduced Diagnostic Expectancy with 5% Uncertainty (lower row).

The upper two charts report the association between prevalence, sensitivity and specificity from the diagnoses reported in 22. Overall, the shapes of the curves are complex, and the data is sparse and noisy. However, the increase in specificity and reduction in sensitivity are both steep and approximately linear below 10% prevalence, which is when case overidentification starts to increase. The lower two charts report PPV, NPV, accuracy and proportion of positive diagnoses for sensitivity and specificity at the floor prevalence from²²; i.e., the prevalence of the rarest disorder reported, and therefore displaying the highest measured impact of the hypothesised expectancy effect. As can be seen, accuracy was maintained, and bias minimised at a prevalence as low as 0.07%. However, the price paid was huge uncertainty about any diagnosis made: the 95% confidence intervals cover the entire probability range.

Discussion

Recall that, in this context, the prevalence refers to the diagnosis rate in a service’s referral population. With the presumption that the sensitivity and specificity differences between CEs and GS are small, it becomes possible to use such graphs to decide an appropriate prevalence threshold for a CE hub-and-spoke approach. The most clinically relevant MCID will be derived from the PPV, which determines the correct treatment for the diagnosis. For example, consider an absolute MCID of $1 - 0.2$ (80%) for PPV, specificity .95, and sensitivity 0.8. A 20% population prevalence will achieve this with 92% total accuracy and no bias. Many disorders can reach this prevalence or greater in ordinary community clinics’ referral populations, suggesting that the additional costs incurred by a CE are not warranted for such disorders.

For rare disorders, the picture is different. Even at the highest reasonable expectations for sensitivity and specificity, positive bias balloons below 20% prevalence. At 10%, it has increased to around 25%, and nearly 500% by 1% prevalence. In practice, the expectancy effect mitigates the risk of many false positives at low prevalence, effectively operating as a low-frequency filter for disorders too rare to detect accurately. Removal of this filter may account for disasters such as the excess diagnosis of anal abuse of young children in Cleveland²⁹ when anal examination was used proactively to detect a rare event without knowledge of its prevalence or the sensitivity and specificity of the test. Applying a similar MCID of 20% to bias (i.e., an upper limit to overdiagnosis of 20%) suggests a minimum referral prevalence of around 15% for community management, should that MCID be an acceptable margin of error.

Three arguments support the admittedly unproven assumption that diagnosis can substitute for outcome when comparing CEs and GS: evidence-based practice, training practices and supportive empirical evidence. First, using the best treatment for a condition is equivalent to ensuring the best chance of success, conditionally independent of the correct diagnosis: knowing the diagnosis sets the best treatment, determining the outcome. Second, as discussed in the introduction, best practice is taught by CEs, at both undergraduate and postgraduate levels. Clinicians typically work in both settings during their careers, and both CEs & GS organise a range of meetings where best practice is demonstrated and updated, attended by staff from both. It is hard to argue that clinicians in CEs will be better practitioners than those in GS without claiming a failure in medical education. Third, the introductory review found largely supportive findings. Hip fracture in over 65 s is common in the referral population to orthopaedic clinics, and only minimal differences in outcome were found between CEs and GS.¹² Similarly, the triple P program described in²⁰ is for common problems and can be applied transdiagnostically so that the diagnostic population would have collectively had a high prevalence. Under these conditions, no waning effects were detected between CEs and GS. The only divergence involved the postulated expectancy effect, which is, nonetheless, a predictable necessity arising from the high risk of false positives at low prevalence.

It is clear from the charts above that diagnostic prediction at the lower reasonable bounds of sensitivity and specificity is insufficient for good practice. However, as Figure 3 shows, these lower bounds, which summarise many studies, should be interpreted in the light of any prevalence associated with them, which is unknown. It seems unlikely that such low levels of sensitivity and specificity co-occur in clinic practice, as both CEs and GS perform much better than the lower bounds suggest, and they instead reflect the unmeasured consequence of the expectancy effect at different prevalence levels.

Routine estimation of these parameters would also enable bias quantification between referral strata (e.g., 8) assisting local detection and action, e.g., by appropriate referral templates.³⁰

Finally, this paper considered CEs and GS from the perspective of the direct delivery of patient care. CEs also add value through teaching and research. Furthermore, this analysis of value for money was considered from the perspective of the service provider. When considered from the perspective of the local communities they serve, hospitals are major contributors to their local economies,³¹ and these benefits are not captured in comparisons between types of service delivery. When deciding whether the higher costs of a CE are worthwhile, the value of such indirect contributions should also be included. However, patients should know that a well-delivered GS can provide care every bit as good as that from the equivalent CE provided the prevalence of cases in the former’s referral population is sufficiently high, and differences between CEs and GS that are not accounted for by differences in prevalence should not be tolerated. Excellence is not the sole prerogative of specialist teaching and research centres.

Findings and conclusions

What effect does disorder prevalence have on service delivery?	The prevalence of a disorder in its referral population may dramatically affect services’ ability to accurately detect, and therefore treat it
When should care for a disorder be restricted to a CE?	Clinical test theory suggests that 15%-20% (depending on choice of MCID) prevalence in the referral population should be the lower bound for a service to assess and treat a disorder, or a group of disorders responding to a common treatment
Are there risks in seeing rare disorders in GS?	Clinical test theory predicts that assessment protocols suitable for use in CE will lead to massive overdiagnosis in GS, which can only be corrected though a corresponding drop in diagnostic awareness
Do these findings suggest changes in everyday clinical practice?	There are now many medical decision aids with known specificities and sensitivities. Supplementing unsupported clinical opinion with them where possible would provide an administrative dataset that might enable real-time review of expectancy effects
What additional research should be undertaken?	Exploring diagnostic specificities, sensitivities and referral prevalence as routine outcome measures (ROMs) would
	• Enable empirical testing of the assumed relationship between diagnosis and outcome
	• Directly evaluate referral bias at local level, to see if correction improves equitable access

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID iD

David Martin Foreman

References

Elrod

Fortenberry

. Centers of excellence in healthcare institutions: what they are and how to assemble them. BMC Health Serv Res 2017; 17(1): 425.

Flojstrup

Bogh

SBB

Bech

, et al. Mortality before and after reconfiguration of the Danish hospital-based emergency healthcare system: a nationwide interrupted time series analysis. BMJ Qual Saf [Internet]. 2022. https://qualitysafety.bmj.com/content/early/2022/05/18/bmjqs-2021-013881

Imison

. The reconfiguration of clinical services. London: King’s fund, 2014.

Mungall

. Trend towards centralisation of hospital services, and its effect on access to care for rural and remote communities in the UK. Rural Rem Health [Internet], 2005. https://www.rrh.org.au/journal/article/390/

Bok

Noone

Skouw-Rasmussen

. Key challenges for hub and spoke models of care – a report from the 1st workshop of the EHC Think Tank on Hub and Spoke Treatment Models. J Haemoph Pract 2022; 9(1): 20–26.

Elrod

Fortenberry

. The hub-and-spoke organization design: an avenue for serving patients well. BMC Health Serv Res 2017; 17(S1): 457.

Vyas

Albright

Walker

, et al. Clinical training at remote sites using mobile technology: an India–USA partnership. Distance Educ 2010; 31(2): 211–226.

Broman

Ross

Weech-Maldonado

, et al. 102 characterization of hub and spoke facilities for study of surgical care within United States health systems. J Clin Transl Sci 2022; 6(s1): 1.

Williams

. Using the hub and spoke model of telemental health to expand the reach of community based care in the United States. Community Ment Health J 2021; 57(1): 49–56.

10.

Roxburgh

Conlon

Banks

. Evaluating Hub and Spoke models of practice learning in Scotland, UK: a multiple case study approach. Nurse Educ Today 2012; 32(7): 782–789.

11.

Sonig

Khan

Wadhwa

, et al. The impact of comorbidities, regional trends, and hospital factors on discharge dispositions and hospital costs after acoustic neuroma microsurgery: a United States nationwide inpatient data sample study (2005–2009). Neurosurg Focus 2012; 33(3): E3.

12.

McGuire

Chacko

Bernstein

. Cost-effectiveness of teaching hospitals for the operative management of hip fractures. Orthop Online 2011; 34(10): e598–e601.

13.

Mehrotra

Dimick

. Ensuring excellence in centers of excellence programs. Ann Surg 2015; 261(2): 237–239.

14.

Sugerman

. JAMA patient page. Centers of excellence. JAMA 2013; 310(9): 994.

15.

Ciapponi

Lewin

Herrera

, et al. Delivery arrangements for health systems in low‐income countries: an overview of systematic reviews. Cochrane Database Syst Rev [Internet]. 2017. https://www.cochranelibrary.com/cdsr/doi/10.1002/14651858.CD011083.pub2/abstract

16.

Price

McCarthy

Bate

, et al. Impact of emergency care centralisation on mortality and efficiency: a retrospective service evaluation. Emerg Med J 2020; 37(4): 180–186.

17.

Post

Wittenberg

Burgers

. Do specialized centers and specialists produce better outcomes for patients with chronic diseases than primary care generalists? A systematic review. Int J Qual Health Care 2009; 21(6): 387–396.

18.

Gupta

Supplee

Suskind

, et al. Failed to Scale: embracing the challenge of scaling in early childhood. London: Routledge, 2021. p. 1–22.

19.

Yohros

Welsh

. Understanding and quantifying the scale-up penalty: a systematic review of early developmental preventive interventions with criminological outcomes. J Dev Life Course Criminol 2019; 5(4): 481–497.

20.

Tommeraas

Ogden

. Is there a scale-up penalty? Testing behavioral change in the scaling up of parent management training in Norway. Adm Policy Ment Health 2017; 44(2): 203–216.

21.

Kendell

. The role of diagnosis in psychiatry. Oxford, England: Blackwell Scientific Publications, 1975, p. 176. (The role of diagnosis in psychiatry).

22.

Wilchesky

Tamblyn

Huang

. Validation of diagnostic codes within medical services claims. J Clin Epidemiol 2004; 57(2): 131–141.

23.

Martin

Wagner

Lupulescu-Mann

, et al. Comparison of EHR-based diagnosis documentation locations to a gold standard for risk stratification in patients with multiple chronic conditions. Appl Clin Inform 2017; 8(03): 794–809.

24.

Nouraei

Sa. R

Hudovsky

Frampton

, et al. A study of clinical coding accuracy in surgery: implications for the use of administrative big data for outcomes management. Ann Surg 2015; 261(6): 1096–1107.

25.

Eaton

Hall

ALF

Macdonald

, et al. Case identification in psychiatric epidemiology: a review. Int Rev Psychiatry 2007; 19(5): 497–507.

26.

Mojtabai

. Clinician-identified depression in community settings: concordance with structured-interview diagnoses. Psychother Psychosom 2013; 82(3): 161–169.

27.

Neth

Gaisbauer

Gradwohl

, et al. Riskyr: rendering risk literacy more transparent [Internet]. Konstanz, Germany: University of Konstanz, 2022. https://CRAN.R-project.org/package=riskyr

28.

Steinberg

Fine

Chappell

. Sample size for positive and negative predictive value in diagnostic research using case–control designs. Biostatistics 2009; 10(1): 94–105.

29.

British Medical Journal . Summary of the Cleveland inquiry. BMJ 1988; 297(6642): 190–191.

30.

Wåhlberg

Valle

Malm

, et al. Impact of referral templates on the quality of referrals from primary to secondary care: a cluster randomised trial. BMC Health Serv Res 2015; 15(1): 353.

31.

Rotarius

Liberman

. The financial impact of hospitals on the local economy—2 new factors. Health Care Manag 2014; 33(4): 304–309.