Abstract
Aim:
Generic patient-reported outcome measures (PROMs) allow comparison of health-related quality of life across populations and pathologies. For these comparisons to be valid, the PROM must be responsive; the score must change when the patient’s quality of life changes. This study aims to assess the responsiveness of the EQ-5D-three level (3L) in elective shoulder surgery.
Methods:
Pre- and post-operative EQ-5D-3L and Oxford Shoulder Scores (OSS) were prospectively collected across a range of 204 elective shoulder surgeries. Internal responsiveness was assessed through significance testing of mean change scores and standardized response means (SRMs). External responsiveness of the EQ-5D-3L was assessed against the minimal clinically important difference in OSS, using receiver operating characteristic curve and change score correlation.
Results:
Both EQ-5D-3L and OSS scores improved significantly over time (
Discussion:
The EQ-5D-3L is adequately internally responsive to change following elective shoulder surgery but is unable to differentiate patients demonstrating minimal clinically important change. The EQ-5D therefore only partially reflects patient experience.
Introduction
It is accepted that only by quantifying the patient’s perspective of their own health, can we truly comment on the quality and effectiveness of healthcare interventions. To this end, considerable investment of resources has been made by academics and clinicians to develop robust and valid ways of collecting self-reported health outcome data, the culmination of which is the patient-reported outcome measure (PROM). 1 The use of PROMs is now embedded in the research framework through which health technology appraisal is undertaken, with the United Kingdom’s National Institute of Health and Care Excellence (NICE) requiring PROMs evidence as part of their deliberations. 2 Through this route, cost-effectiveness is measured in relation to the benefit in quality-adjusted life years (QALY), using health-related quality of life (HRQoL) data derived from generic measures such as the EQ-5D.
The paradigm between cost and QALY is becoming increasingly relevant in modern healthcare. This is of particular importance in the heavily scrutinized area of elective orthopaedic surgery. It has been well reported that the volume of elective shoulder surgery has risen exponentially. 3,4 Evidence of effectiveness, or lack thereof, must therefore be valid, reliable and responsive. The use of generic instruments such as the EQ-5D and SF-36 has previously been shown to fulfil these metric properties in a wide range of conditions 1,5 ; however, in certain groups, these instruments miss aspects of health that are vitally important to patients. In these circumstances, condition-specific measures, such as the Oxford Shoulder Score (OSS), 6 are advocated alongside their generic counterparts. However, these measures do not allow comparison across patients with different conditions and cannot provide quality of life data for economic evaluation. 5 It is therefore in the interests of patients that the responsiveness of the generic PROM has been adequately assessed. Only then can we be confident that the generic PROM accurately reflects the experience of the patient. 5
The EQ-5D-three-level (3L) questionnaire is currently employed by the National Health Service (NHS) England PROMs programme. 7,8 It was introduced in 1990 and comprises questions on the following five dimensions: mobility, self-care, usual activities, pain/discomfort and anxiety/depression. Each dimension has three levels: no problems, some problems and extreme problems. A visual analogue scale (VAS) is also provided within the questionnaire which records the patient’s self-rated health on a vertical scale where the endpoints are labelled ‘Best imaginable health state’ and ‘Worst imaginable health state’. Though its sensitivity has previously been questioned 8 and a five-level version has been produced and is beginning to be utilized, 3,9 the 3L version remains highly prevalent in recent upper limb research 10,11 and is likely to remain part of the relevant evidence base for many years, perhaps decades. 12 Analysis of the NHS PROMs programme has found that the EQ-5D-3L is adequately responsive in total hip and knee arthroplasty. 1 The only assessment of shoulder-related responsiveness has been in proximal humeral fractures 13 where it was recommended as a quality of life measure for that particular injury. In light of the increasing utilization and central importance of cost-effectiveness analysis, it is of vital relevance that this is assessed. The aim of this study is to evaluate the responsiveness of the generic PROM, the EQ-5D-3L and the condition-specific OSS in elective shoulder surgery.
Patients and methods
A prospective cohort study on patients undergoing shoulder surgery between January 2009 and January 2012 was undertaken. Patients undergoing surgery for instability were excluded as they were assessed with the Oxford Shoulder Instability Score. All patients completed the OSS and the EQ-5D-3L score preoperatively. Patients who underwent arthroscopic capsular release, arthroscopic subacromial decompression and arthroscopic rotator cuff repair (RCR) completed the same questionnaires 6 months post-operatively, while patients who had a total shoulder replacement (TSR) completed them at 1 year post-operatively. Questionnaires were checked for completion by one of the investigators to facilitate instrument completion without influencing responses.
The OSS
6
assesses the symptoms and function experienced by the patient during the preceding 4 weeks. It comprises 12 questions with five possible response options, scored from 0 to 4, with 4 representing the best response. Scores from individual questions are added to produce a final score ranging from 0 (
The EuroQol
15
EQ-5D-3L is an internationally validated general measure of HRQoL. The EQ-5D-3L index score is calculated using population-based preference weights and the score ranges from 0 to 1, where 1 represents
Statistics methods
Responsiveness is related to an instrument’s ability to capture clinically important changes over time.
16
For a rounded understanding of the utility of a generic health measure, this needs to be assessed in two forms, ‘internal’ and ‘external’ responsiveness. Internal responsiveness is the ability of the measure to change over a pre-specified time period. ‘External responsiveness’ reflects the extent to which change in a measure relates to a corresponding change in a reference measure of clinical or health status.
17
All statistical analysis was undertaken in STATA (StataCorp. 2015. Stata Statistical Software: Release 14. College Station, TX: StataCorp LP). Significance was set at
Internal responsiveness
The statistical significance of the observed change was assessed for both OSS and EQ-5D using a one-sample (two-sided)
Further assessment of responsiveness was assessed by measuring the standardized response mean (SRM). The SRM is the mean score change divided by the standard deviation (SD) of the score change between each time period.
18
The use of SRM is often used alongside the paired
The SRM is preferred over effect size in measuring responsiveness as it uses the SDs of the change scores as the denominator. 13 The SRM was interpreted using Cohen’s criteria where the SRM is regarded as large (>0.8), moderate (0.5–0.8) or small (<0.5). 19
External responsiveness
The external criterion for which the responsiveness of the EQ-5D was tested was the minimal clinically important difference (MCID) of the OSS. This is the smallest difference in a score which the patient perceives as being beneficial. 20 The MCID of the OSS has only recently been defined as >4.5 points for elective shoulder surgery. 3 This was derived from the distribution method of half a SD. 21 Anchor-based methodologies have been utilized in cross-culturally adapted Dutch OSS but not in UK populations. 22,23
Non-parametric receiver operating characteristic (ROC) curves were used to assess the sensitivity and the false positive rate (1-specificity) of the EQ-5D-3L and EQ-5D VAS against the dichotomized outcome of patients with an OSS change score >4.5 (MCID). The ROC curves demonstrate the ability of the change scores (post-operative score minus preoperative score) to discriminate between the patients defined by the external criterion. 13 The area under the curve (AUC) represents the diagnostic ability of an instrument, with a value of 0.5 denoting performance no better than random chance and a value of 1.0 indicating perfect predictive ability. 24
Logistic regression was used to provide a relative estimate of the level of variance that is explained by the change scores. The derived odds ratio (OR) was calculated with the external criterion as the dependent variable and the EQ-5D change scores as the dependent variables.
Correlations between the change scores of the OSS, EQ-5D-3L and EQ-5D VAS were calculated using Spearman’s rank. Under the null hypothesis, a proportional change in all scores would not occur over time.
Preoperatively and post-operatively scores were evaluated for floor (scores reflecting the lowest level of functioning) and ceiling (scores reflecting the maximal level of functioning) effects. An instrument is considered to have significant floor or ceiling effect if more than 15% of the scores are at the lowest or highest level of functioning. 25
Results
There were 204 patients (125 women, 79 men) who were eligible for inclusion in the study. The demographics of the surgical population studied and the respective subgroups are shown in Table 1. There was a significant improvement in the patient’s function post-operatively as assessed by the OSS, the EQ-5D-3L and EQ-5D VAS with the exception of the TSR EQ-5D VAS group.
Patient demographics and mean (SD) of pre- and post-operative and change score for the OSS, EQ-5D-3L and EQ-5D VAS.
VAS: visual analogue scale; SD: standard deviation; OSS: Oxford Shoulder Score; ACR: arthroscopic capsular release; ASD: arthroscopic subacromial decompression; RCR: rotator cuff repair; TSR: total shoulder replacement.
a Denotes statistically significant difference between pre- and post-op score.
The SRM for OSS was significantly higher than the SRM for EQ-5D-3L and EQ-5D VAS in all surgical groups except for those patients who underwent the RCR procedure (Table 2). In accordance with Cohen’s criteria, the OSS SRM scores were large (>0.8) in all categories. This is also the same in the EQ-5D group, though to a lesser magnitude. The EQ-5D VAS responsiveness was moderate for capsular release and subacromial decompression groups but small for RCR and TSR groups.
SRM ± 95% CI for the OSS, EQ-5D-3L and EQ-5D VAS in surgical subgroups.
SRM: standardized response mean; CI: confidence interval; OSS: Oxford Shoulder Score; ACR: arthroscopic capsular release; ASD: arthroscopic subacromial decompression; RCR: rotator cuff repair; TSR: total shoulder replacement; VAS: visual analogue scale.
Weak correlations were noted between OSS and the EQ-5D-3L (
EQ-5D change score correlation (Spearman rank) with OSS change score.a
AUC: area under the curve; CI: confidence interval; OR: odds ratio; MCID: minimal clinically important difference; OSS: Oxford Shoulder Score; VAS: visual analogue scale.
a AUC ± 95% CI and logistic regression OR ± 95% CI against MCID criteria of a change in OSS of >4.5 points.

Non-parametric ROC curves for EQ-5D-3L and EQ-5D VAS against the external criterion of a MCID of a change score in OSS of >4.5 points. ROC: receiver operating characteristic; MCID: minimal clinically important difference; OSS: Oxford Shoulder Score.
There were no floor effects observed in the OSS or the EQ-5D-3L scores preoperatively or post-operatively (Table 4). There were no ceiling effects with the OSS, but significant ceiling effects were observed both overall and within all subgroups for the EQ-5D-3L post-operative scores.
Pre- and post-operative floor and ceiling effect of the OSS and EQ-5D in surgical subgroups.
OSS: Oxford Shoulder Score; ACR: arthroscopic capsular release; ASD: arthroscopic subacromial decompression; RCR: rotator cuff repair; TSR: total shoulder replacement.
Discussion
The routine use of PROMs is now an established component of health technology appraisal, monitoring and performance assessment. 1,26 The OSS is collected by the United Kingdom and several other European countries’ national joint arthroplasty registries. 27 In addition to the OSS, some national joint registries collect the EQ-5D-3L, allowing the assessment against population-based norms and for health economic analysis. In the United Kingdom, the EQ-5D has become the instrument of choice for many agencies including the National Institute for Clinical Excellence (NICE). 2 The results of this study in elective shoulder procedures demonstrate that though the EQ-5D-3L has adequate internal responsiveness, it’s external responsiveness was poor and the change in scores correlates only weakly with the change in OSS. Furthermore, it is unable to discriminate between patients who did, or did not, demonstrate MCID in the OSS.
Significant improvement in both scores was noted following a variety of common elective shoulder procedures. However, when quantified through SRM, the OSS was found to be significantly more responsive. Though the SRM values for the EQ-5D-3L exceed Cohen’s benchmark of 0.8 for large effect, they were approximately half that shown by the OSS, and it is relevant to note that the EQ-5D VAS responded poorly, particularly in the arthroplasty group. To a certain extent, this is to be expected, where a disease-specific outcome commonly outperforms generic measures and forms the basis for recommendations that assessments should include both measures. 1 We would certainly concur with this assertion, particularly in light of previous attempts to reduce the patient burden by administration of generic only questionnaires, an approach which has been found to be sufficient in lower limb surgery, where the inclusion of condition-specific scores did not provide additional information. 28,29
The use of an accepted external criterion of MCID was employed to assess the external responsiveness of the EQ-5D. A difference in the pre- and post-operative OSS score of >4.5 is felt to represent a patient improvement. Against this criteria, the EQ-5D was not able to discriminate between improved and non-improved patients. Post hoc modelling of different theoretical MCID reference scores found the responsiveness to improve once the MCID was set at >9 point change in OSS, with an AUC of 0.63. This represents a large change in OSS, and interestingly, using a distribution-based derivation of MCID (half a SD), 21 our own data set would place the MCID at >3. The EQ-5D change scores weakly correlated with OSS change scores and linear regression demonstrated poor discriminative ability.
When the MCID was >4.5, the EQ-5D VAS demonstrated good discrimination between improved and unimproved patients. This was surprising in light of the very weak correlation between OSS and EQ-5D VAS change scores and is likely due to the small number of very poor EQ-5D VAS change scores in the non-improved patients. When the MCID was set at a theoretical >9 points, where a larger number of non-improved patients are included, the discriminatory value of the EQ-5D VAS significantly diminished, with an AUC of 0.52.
The presence of ceiling effects limits the use of an instrument due to clustering of scores at a maximum level of functioning. 30 In line with previous reports, 31,32 we found the OSS to be resistant from any significant floor and ceiling effects. In contrast, the EQ-5D-3L had a significantly high ceiling effect ranging from 16% in the capsular release group to as high as 57% in the subacromial decompression group. Our results agree with the findings of Slobogean et al. 33 who evaluated patients with proximal humerus fractures and reported a 30% ceiling effect with the use. The high ceiling rate may be in part due to a bimodal distribution of scores but the content of the questionnaire also makes a difference. If an item is irrelevant to members of a population, then there is limited probability it will show improvement in a longitudinal study. 30 For example, although it is possible that patients with shoulder pathology may have had concomitant symptoms of anxiety/depression, most patients are unlikely to be affected by this domain, therefore this item is unlikely to shift after any treatment.
The authors accept that there are limitations to this study. The EQ-5D-3L was selected for this study, but we recognize that previously reported high ceiling effect, bimodality and inadequacy of response options in capturing changes in health states in milder health problems, drove the development of the EQ-5D-5L, published in 2011. 34 This has extended the response options on the five health dimensions from three to five options. It may be that the responsiveness of this newer version is improved, and we would encourage this analysis to be undertaken. However, reference value sets for UK populations have only very recently been published, 8 and the three-option version continues to be employed in health technology assessment and population monitoring, with particular reference to the English NHS PROMs programme for hip and knee arthroplasty. 7 If this was extended to shoulder arthroplasty or elective shoulder surgery, it is highly relevant to note that the responsiveness of the three-option version may not represent patients adequately. We also recognize that the use of Cohen’s criteria with SRM data may lead to over- or underestimations of the magnitude of change over time. 35 Using the correction method advocated by Middel and van Sonderen, 35 by relating repeated measure correlations to SRM data, no classification changes occurred for OSS or EQ-5D-3L groups. We also recognize that responsiveness is only one of the psychometric properties essential to the functioning of a PROM, 36 however, the appropriateness, validity, repeatability, acceptability and feasibility of these have previously been studied.
The EQ-5D-3L is one of the most widely used generic PROMs. 37 It is the instrument recommended by NICE 2 for health technology assessment, which includes assessment of effectiveness and cost-effectiveness. It is vital that any health economic assessment computed on the basis of the EQ-5D are a reliable measure of the health state they represent. 6 The EQ-5D-3L, though internally responsive in elective shoulder surgery, correlated poorly with the OSS and is unable to differentiate patients whose clinical condition has improved from those that have not. Though the VAS component of the EQ-5D-3L might have utility in distinguishing patients, based on ROC analysis, the limited responsiveness to change demonstrated by a low SRM suggests that it too is inadequate. Though the five-option version may offer improved metric properties, it is worth noting that a 15-level 5-dimension score has been found to be less responsive than the EQ-5D-3L in some health states, 38 and further investigation is therefore warranted. The OSS itself was initially validated against the SF-36, where pre- and post-op correlation coefficients were greater 39 than those demonstrated here with the EQ-5D-3L, though their assessment of effect size of the SF-36 was broadly similar to the EQ-5D-3L. Further assessment of SF-36 or SF-6D responsiveness as well as EQ-5D-5L in elective shoulder procedures is therefore required before any measure could be confidently recommended.
Conclusion
This is the first study to demonstrate that the EQ-5D-3L exhibits adequate internal responsiveness but poor external responsiveness in elective shoulder surgery. The EQ-5D-3L does not represent patient improvement and therefore may not provide adequate evidence on QALYs for economic evaluation in this patient population.
Footnotes
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
