Abstract
Depression is now rightly being recognized as a significant public health problem, with studies in Australia [1], and elsewhere, drawing attention to its underrecognition and under-treatment. Mechanisms for screening for latent depression and other common mental disorders are potentially useful in association with appropriate treatment and monitoring programmes [2]. It is therefore of interest to read of the development of the 12-item SPHERE screening instrument [3], with claimed sensitivity of 93% and better performance than the 30-item General Health Questionnaire (GHQ) [4].
However, nowhere in the original report is the specificity recorded, even though a large percentage of patients screened positive – more than the expected prevalence rate of the common mental disorders. We therefore sought to re-examine the data.
Method
Using data presented in Box 5 of the original publication [2] 2 · 2 tables (screen ±, diagnosis yes/no) were constructed for a variety of comparisons. The DAG_Stat computer program [5] was used to calculate sensitivity, specificity, positive predictive power, negative predictive power and overall efficiency. These and other relevant terms are defined by Baldessarini et al. [6], and Clarke and McKenzie [7]. Calculations of exact confidence intervals for sensitivity and specificity are as described by McKenzie et al. [8].
The SPHERE instrument has two components; one scale of 6 items derived from the GHQ which measures aspects of depression and anxiety (PSYCH-6), and another scale of 6 items which measures fatigue (SOMA-6). As a screening instrument these different components can be combined to require a positive result in both (PSYCH and SOMA) or either (PSYCH or SOMA). The former is a narrower criterion (termed Level 1), the latter a broader one (termed Level 2). We also examined the performance of PSYCH-6 alone. Psychiatric diagnoses were made using the CIDI-Auto [9], a computer-operated enquiry based on the Composite International Diagnostic Interview (CIDI) [10].
Results
In the first analysis, using ‘total current’ diagnosis in the (n= 164) sample, the broadest method of screen (PSYCH or SOMA) was used. The results appear in Table 1.
2 × 2 table and derived screening characteristics for SPHERE (PSYCH or SOMA)
Eighty-three per cent of patients screened positive. Twenty-seven per cent of patients received a psychiatric diagnosis, most of these screening positive (sensitivity 93%). However, the likelihood of a person screening positive being a true case was only 30%. The majority of people screening positive were non-cases. This procedure was repeated using the narrower screen (PSYCH and SOMA; screen positive rate 32%), and also using PSYCH alone(screen positive rate 67%) against ‘any current diagnosis’, and against major depression.
For ‘any diagnosis’ (see Table 2), specificity rose to 37% using the less broad PSYCH alone, and to 72% for the narrowest screen of PSYCH and SOMA; sensitivity dropped reciprocally to 78% and 47%, respectively. The overall efficiencies for the three levels of screening were therefore 40%, 48% and 65%. The parameters for the GHQ-30 are also shown in Table 2, being very similar to the PSYCH alone screen.
Comparison of Somatic and Psychological Health Report screening for ‘any current diagnosis’ (termed ‘total current’ in the original article)
For major depression (Table 3), the CIDI diagnosis rate was 9.8% – much lower than for ‘any diagnosis’, as expected. The screen positive rates were as before (83%, 67%, and 32%, respectively). Consequently, the positive predictive power and the specificity were also much reduced. The narrower screen (PSYCH and SOMA) had the greatest overall efficiency.
Comparison of Somatic and Psychological Health Report screening for current major depression
Similar analyses were additionally done on the data from the (n= 364) sample, also presented in Box 5 of the original article, where the prevalence of any mental disorder was 12.6%. Without going into full detail, we report that, for the different levels of screening, the sensitivity ranged from 84% to 54%, specificity from 48% to 78%, overall efficiency from 53% to 75%, positive predictive power from 19% to 27%, and the screen positive rate from 55% to 25%. Using the overall efficiency as a guide, these results are better than in the previous sample.
Discussion
Screening instruments are potentially useful, although their usefulness in psychiatry is not fully proven, and they can be imprecise, if not clumsy, instruments [11]. Generalizing from the above data, it would appear such instruments are better at screening for broader concepts than narrower ones. However, this may not be so. Using various versions of the GHQ, we have previously shown that the overall efficiency of a screening instrument increases with the narrower the definition of the disorder (for instance, major depression vs any form of depression) as long as the threshold for caseness is raised [12].
With respect to the SPHERE, a number of issues arise from the data. The first is the usefulness of a screening instrument for which 80% of the population screen positive, of which only 30% have the disorder. Of course, this may have been an unusual sample, and in the larger samples reported in the original article [3] the screen positive rates are lower, between 45% and 50%. It may also be that the most appropriate threshold for this population has not been used and that different populations require different thresholds. These possibilities were not examined. Second, the results suggest that there is no advantage in including the fatigue component – at least when joined with an or in which case it makes for a very broad screen. The inclusion of the fatigue component may have an advantage when combined with an and in making the screen more specific, although this effect might probably be equally gained by raising the threshold of the PSYCH scale.
Finally, the data here does not support the proposition that the SPHERE is more efficient than the GHQ-30. The parameters for the GHQ (used on the n= 164 sample only) were almost identical to those when its derivative, the PSYCH scale, was used alone. This is not surprising. The 12-item version of the GHQ was also used in the Australian National Survey of Mental Health and Wellbeing [13], where it had a somewhat similar sensitivity of 75%, though greater specificity of 70% – and a higher sensitivity when scored to take into account chronicity of symptoms [14]. The SPHERE might have the advantage of flexibility whereby, with the fatigue scale added, combined with an or, it becomes more sensitive, and when combined with an and it becomes more specific. However, this effect can be achieved by simply moving the threshold score and does not enhance the overall efficiency of the instrument.
We believe that screening instruments have their place. At present it is debatable whether the SPHERE is a significant advance on other screening instruments, and in its present form, with so many people screening positive, it would appear to be quite the opposite. The principles of screening are that it should be low cost, low risk and of likely benefit [2]. Labelling a significant number of people who are not depressed as ‘probably depressed’ might reasonably be considered a potential harm. We do not want to replace a situation of underrecognition with one of over-recognition, neither being of benefit to the patient.
