Abstract
The Global Burden of Disease study [1–3] found mental disorders to be the fourth leading cause of burden in the world, and the leading cause of disability. Depression alone was the single leading cause of global disability. This prominence of mental disorders has also been observed in subsequent country-level studies [4], [5] including Australia [6]. Such findings are a powerful advocacy tool, with obvious political [7], [8] and potentially economic [9] advantages for psychiatry. However, while these findings are frequently cited there has been virtually no examination of their validity. The current study provides an overview of the Global Burden of Disease study approach, and then presents the results from a validation study of the disability component of the method.
What was the Global Burden of Disease study?
The Global Burden of Disease study was both landmark and ambitious. The World Bank and World Health Organization commissioned Harvard University to estimate the death and disability caused by diseases and injuries for 233 countries (in the reference year 1990). The final estimates were published in 1996 [1], and of theover 400 conditions included, eight mental disorders were represented: psychotic episode, depressive episode, bipolar disorder, panic disorder, obsessive–compulsive disorder, posttraumatic stress disorder, alcohol dependence, and harmful drug use. The Global Burden of Disease study is currently being re-done for the reference year 2000, and thus an examination of the methods is timely. A new epidemiological measure was developed to estimate the impact of death and disability on the human population: the disability-adjusted life year (DALY). DALYs lost were calculated as the sum of the number of healthy years life lost to premature death and living with a disabling condition. The mortality component was based on the number of people who died from each disease, and based on the age at which they died, an estimation of the number of years of life they lost from dying prematurely (termed years of life lost, YLLs). This is a standard epidemiological measure, as mortality is traditionally how the population impact of disease is measured. The new element of this study was the inclusion of the population impact of non-fatal, but disabling, conditions.
The disability component (called years lived with disability, YLDs) was calculated as the product of the epidemiology (incidence, duration) and severity (disability weight) of a disorder [1]. It takes into account the number of people who suffer from a disorder, how long they will have that disorder, and how disabled they will be by that disorder. It is thus immediately obvious how such a metric would have far greater relevance for mental disorders than the traditional mortality statistics.
The use of epidemiological estimates such as incidence and duration is not controversial; however, the method for estimating the severity of disability has been. The disability weight was obtained using a preference measurement technique. Preferences reflect the relative severity of a health condition on a hypothetical continuum between perfect health and death. Preferences, when expressed as a weighted value between 0 and 1, can be multiplied by the duration of a condition to give the ‘adjusted’ number of years lived with a disability (a detailed review of preferences is beyond the scope of this article, but interested readers are referred to Singh, Hawthorne and Vos [10] for an introduction). A number of preference measurement methods exist [10], with the Global Burden study choosing the person trade-off [11]. The person trade-off is described in more detail below, but its nature is best appreciated by reviewing the actual transcript of the task (see Fig. 1). The aim of the person trade-off is to decide how many people in a given health condition are equivalent in health value to a set number of healthy people (in this case, 1000). For example, in the first version of the task, a respondent mayconsider 2000 people with depression to represent the same amount of health as 1000 healthy people, as they consider people with depression to have only half the health of fully healthy people. This ratio of healthy to disabled provides the weighted preference value between 0 and 1, in this example equal to 0.5.

Is this approach to disability applicable to psychiatry?
The disability weights for the global study ranked psychotic and depressive episodes as equivalent in disability to quadriplegia and paraplegia, respectively. Much milder weights were provided for the other disorders, similar in disability to angina or a broken arm. There is substantial evidence that mental disorders are as disabling as many chronic physical conditions [12–14], however, preference measurement techniques such as the person trade-off are not frequently used to estimate severity in psychiatry, and additional concerns about this methodology have been raised.
The validity of the person trade-off method is largely unknown, and it has been noted to have questionable reliability [15]. In the Global Burden of Disease study, the weights were obtained after considering brief symptom sketches, only with no standardized presentation of disability information (e.g. ‘Active psychosis: an individual with paranoid delusions, auditory hallucinations and disorganized speech’ [16], p.94]). In addition, the weights were provided by a small convenience group of experts [16] whose familiarity with the impact of mental disorders is not known. Finally, the method has been criticized for its complexity and potential discrimination against people with a disability [17–19], criticisms which are readily understandable when the nature of the task is revealed (refer Fig. 1).
These concerns suggest a need to examine the validity of the global burden of disease disability weights and the method used to elicit them. The present study examined two aspects of the method: feasibility (Can the method be applied beyond interested experts?), and validity (Does the preference score reflect the severity of the disorder?). The validity of this approach for calculating burden from mental disorders is then discussed.
Method
Using preferences to calculate disability burden
Disability burden (YLDs) adjusts the time spent with a disorder by the relative severity of that disorder, using preference values as the measure of severity (called disability weights). Preferences are considered to express the relative value or desirability of a health state, and are often converted to a value on a scale from 0 (usually representing death) to 1 (usually representing perfect health). As YLDs measure life years lost rather than gained, the preference scale is reversed, such that the higher the preference value the less desirable the health state, and the more years of life that are ‘lost’ (0 represents perfect healthand 1 represents death). For example, 10 years lived in a disability state assigned a preference value of 0.85, or highly disabling, would be equivalent to losing 8.5 years of healthy life. In comparison, if that health state was judged to be mild, such as a weight of 0.05,those 10 years would only be equivalent to losing 6 months of healthy life. The disability weight is thus an integral part of the disability burden calculation.
Given the large scale of the Global Burden of Disease study it was not possible to obtain person trade-off values for all conditions, and so a short-cut procedure was adopted. Person trade-off preferences were obtained for a set of 22 ‘indicator’ conditions (including depressive episode and active psychosis) [16]. These values were used to arbitrarily define seven disability severity categories, which provided the basis for defining disability weights for the large number of health conditions included in the study, by estimating the proportion of people with a disorder who would fall into each of the disability categories [16]. This protocol for the indicator conditions (but not the estimations) has been replicated by the Global Burden group, but only one separatereplication [20] has been published. Conducted in the Netherlands, this replication study provided mental disorder weights, but only from another short-cut estimation procedure. While these short-cut procedures are necessary when valuing large numbers of disorders, they do not reflect the validity of the actual person trade-off approach.
Study participants
We recruited general practitioners with a particular knowledge of mental disorders from a group who had completed or were completing a postgraduate Masters programme in the recognition and management of mental disorders (a potential study pool of 56 participants). General practitioners are the professionals most likely to be consulted by people with mental disorders [21], and may be better equipped than specialists to place the disability of mental disorders in context with non-psychiatric disorders. None of the participants had any familiarity with either the concept or techniques for preference measurement, and thus provide a test of the feasibility of the method to those with no direct experience or vested interest in this type of health measurement. This group was contacted from April to July 1999, and 20 were enrolled in the study (further recruitment was ceased based on the experiences of this group, discussed below). The average age of the participants was 45 years (range = 35–68) and 70% (n = 14) were female, reflecting the broader age and gender mix of practitioners who enrol in the programme. The group had an average of 18 years experience in primary care (range = 5–42 years). Informed consent was obtained, and they were paid a nominal fee ($50) as reimbursement for their time.
Study tasks
The Global Burden of Disease approach [16] included the person trade-off and a rank order task. Before implementing this protocol, we included another preference measurement method, the rating scale, as a validity comparison to the person trade-off. All of the study tasks involved an initial consideration of a disorder described in terms of symptoms and disability (see below), followed by a series of ratings to measure how severe that disorder was perceived to be.
Rating scale
The rating scale task consisted of a vertical line drawn on a page anchored with ‘best health state imaginable’ at 100 (the top of the line) and ‘worst health state imaginable’ at 0 (the bottom of the line). Onepoint increments were drawn on the scale, with each 10-point interval labelled (10, 20, 30 etc.). The task was simply to draw a line on the scale that corresponds to the rater's perceived severity of the disorder under consideration.
Person trade-off
The person trade-off (PTO) presented a hypothetical task that asked you to consider yourself a decision maker who must decide between two intervention options, but due to limited resources can only afford one. In an attempt to increase validity the global study included two versions of the PTO, providing two valuations per disorder (see Fig. 1). For both versions, the value of n is varied in a ‘back-and-forth’ procedure [10] until the numbers of people saved in the given health state is perceived as equivalent to saving the 1000 healthy (for example, the first value of n is 2000 and the person chooses to save the 1000 healthy, so then n is increased to 4000 and the person still chooses the healthy, so n is increased to 6000 but now the person chooses the disabled, and so n is lowered to 5000, and this continues until the n disabled andthe 1000 healthy are considered equivalent and the person cannot choose between them). For PTO1 the more disabling the health state is perceived, the higher the equivalence number, whereas the opposite is the case for PTO2. Both versions are supposedly two different ways of framing the same question (although this has been questioned [17]), so consistency of responses for each is required (the expressions to translate the two values of n to preference values of between 0 and 1 are mathematically equivalent, such that a value on PTO1 has acorresponding ‘consistent’ value on PTO2 [16], p.98]).
Rank order
Due to the abstract nature of the PTO, a simple rank order of the severity of the disorders was used (from most disabling to least disabling), to help the participants think about the meaning of their PTO values. The rank order was conducted with the use of labelled flash cards.
Development of the mental disorder descriptions
The aim in developing new disorder descriptions was to presentthe ‘average’ case for 19 disorders, in terms of symptoms and disability. Type and severity of symptoms were based on endorsement of ICD-10 diagnostic criteria [22], using data from the Australian National Survey of Mental Health and Wellbeing [21], a household survey (N = 10 641) of mental disorders which used the Composite International Diagnostic Interview to obtain ICD-10 diagnoses. Level of disablement was described from self-reported functioning and wellbeing in the Survey, using the Medical Outcomes Study Short-form 12 (SF-12) [23]. This symptom and disability information was based on people who were currently symptomatic (1-month prevalence). An example of one of the disorder descriptions provided to raters is presented in Figure 2. For comparison with the Global Burden of Disease study, two physical disorders (blindness and paraplegia), using descriptions identical to those used in the Global Burden of Disease study, were included.
Example of the disorder descriptions provided to the study participants to make their judgements of severity. The shaded response category represents the average score for each item for those with a mild depressive episode. Data taken from the Australian National Survey of Mental Health and Wellbeing [21].
Procedure
The present study used a single group, non-experimental design. Participants were provided with a paper-and-pencil booklet containing the disorder descriptions, rating scales, forms for recording PTO responses and a brief demographic sheet. Each participant began with the rating scale task, working through the booklet individually. After an initial introduction to the PTO, each participant then individually completed PTO valuations for each disorder. This was followed by the rank order task, finishing with an open discussion of people's preference values, with the opportunity to revise PTO values in response to their rank order and other's responses. The original protocol [16] advises a lengthy 8-h session with 8–12 participants rating 22 disorders with the PTO. As this all-day format was not convenient for this group of clinicians, smaller groups (five groups of 3–6 participants) were run over 2 h. Due to the consequent time restrictions, each participant rated a core set of nine disorders followed by four additional disorders, such that each person rated 13 disorders. The core set included the two disorders as used in the global study (blindness and paraplegia) and one disorder from most of the major diagnostic categories (panic disorder and agoraphobia, alcohol dependence, psychotic episode, and depressive episodes). In this way, all disorders could be rated, albeit with a smaller number of ratings for some disorders.
Data analysis
Data were analysed using SPSS for Unix (version 6.1; SPSS, Chicago, IL, USA). As no gold standard exists for validating preferences, we examined the construct validity of the PTO method through its association with another preference measure (rating scale) and a simple rank order task. First, mean PTO and rating scale values were compared with paired t-tests (after reverse scoring the rating scale, such that higher values indicate worse disability on both measures). Tests were restricted to disorders with five or more pairs of ratings, with adjustments for multiple comparisons (producing a significance value of 0.003). Second, an approach described by Giesler et al. [24] was used which compares the directly elicited rank order of health state descriptions to the implied rank order from the observed preference weights. Inconsistency scores are calculated, which expresses a count of the number of PTO values that are inconsistent (out of order) with the card sort rank order as a proportion of the total number of possible inconsistencies. The score is therefore a value between 0 and 1, with lower values indicating higher agreement (and greater validity) between rank order and preference values.
Results
Feasibility: experience of the person-trade-off method
This group of general practitioners reported real difficulty with the conceptual complexity of the task (particularly version 2, trading quantity for quality of life), and most were confronted by the trading of groups of people. One participant took such offence that they revoked consent for use of their PTO data, while permitting inclusion of their other data. The face validity was low, as many found it difficult to conceive of the task as a measure of severity.
Disability weights for mental disorders
Despite these difficulties, all participants eventually completed the study requirements (with the exception of the participant who revoked consent, and one individual who gave up on the PTO after rating only the first five disorders). Table 1 presents all PTO values, with the rating scale reported for those who also had completed a PTO for that disorder (as this serves as the validity comparison). Person trade-off disability weights ranged from lower values for mild depressive episode (0.09), obsessive–compulsive personality disorder (0.16) and obsessive–compulsive disorder (0.17), to very high disability weights for opioid dependence (0.91) and psychotic episode (0.82). In general most values represented a moderate level of disability.
Disability weights for mental disorders from the present study and compared with the Global Burden of Disease study. Values are on a scale from 0 to 1, with lower values indicating more preferable health states (less disabling)
Validity of the person trade-off: comparison with the rating scale and rank order
As can be seen in Table 1, the weights from the PTO and rating scale were not significantly different for any disorder, with the exception of mild depression (PTO value milder than rating scale, t = 4.33, df = 18, p< 0.001). This non-significance should be interpreted with caution due to low numbers for some disorders and the variability in the data. The second validity analysis was in comparison to the rank order task, with inconsistency scores calculated for both the initial PTO judgements that were made, and the final values after completion of the rank order and group discussion. The average inconsistency score for the initial PTO values was 0.311 (SD = 0.137, range from 0 to 0.500). This means that, on average, 31% of disorders had an implied PTO rank order that was inconsistent with the direct rank order from the card sort. Many participants took the opportunity to revise their PTO values following comparison with their rank order and group discussion, with the final values significantly more valid (mean = 0.250, SD = 0.175, paired t-test t = 2.459, df = 18, p = 0.02). This second validity test is obviously confounded by the comparison with the validity comparison (the rank order), but serves as a test of the respondents' understanding of what the PTO task is trying to measure. As inconsistency scores were reduced, respondents tended to make their initial PTO values more consistent with their card sort rank order, implying the PTO was not as readily understood as the simple rank order.
Comparison with the Global Burden of Disease study
To place the weights from the present study in context, Table 1 also provides the PTO weights from the Global Burden of Disease study. Two sets of weights are provided: those based on the full PTO protocol that was used for the indicator conditions, and the weights from the short-cut estimation procedure (the actual weights used to calculate disability burden). These latter weights include an adjustment for the effectiveness of treatment (as judged by the expert group) in economically developed countries. Similarly the weights from the present study were based on a consideration of severity in light of average treatment conditions. The two physical disorder descriptions included in the present study were identical to those used in the original study, and produced near-identical weights. In comparison, the mental disorder vignettes from the present study were based on the symptoms and disability of those currently symptomatic, and on the whole the resultant weights appeared to be substantially higher than those used in the Global study.
Discussion
This study examined the disability adjustment method used in the Global Burden of Disease study, and is one of only two separate investigations of this method, and the only one specific to psychiatry. In terms of feasibility, the PTO protocol was found to be difficult to replicate with an educated group of clinicians, who had no prior experience of preference measurement, and no vested interest. In contrast, the rating scale and ranking exercise were found to be much more palatable to respondents. The conceptual difficulty with the PTO probably contributed to the only moderate level of validity. In the absence of a gold standard for preferences, the standard approach to investigate validity is a comparison with measures purported to be tapping the same construct. While this is constrained by the validity of the comparison measures, it does indicate whether there is consistency in measures that should be related. The validity analysis thus compared the PTO values with other measures of severity, to determine the consistency among them. Comparison with the rating scale showed few significant differences, however, comparison with the simple rank order demonstrated that people's PTO values were only moderately consistent (nearly 1 in 3 initial disability weights were inconsistent with their rank order from the card sort task). It is important to note that validity was significantly improved following an opportunity for group discussion, however, this is only achieved with lengthening an already burdensome task.
The mental disorder weights from the present study appeared higher than those used in the Global Study, particularly in comparison with the weights from the estimation procedure. It should be kept in mind that disability burden is a combination of both epidemiology and disability severity, so disability weights based on current (1-month) prevalence (as was done for the present study) would only be combined with 1-month prevalence figures, which are lower than the 1-year prevalence that was used in the global study. Furthermore, using descriptions of current cases is likely to have produced these higher weights: current cases report more severe disability than those who have remitted [25], and the SF-12 scores used in the vignettes for the present study had a moderate rank order correlation with the resultant disability weights (Spearman's rho = 0.672, p = 0.002). The more severe weights are therefore not unexpected.
Limitations of the study
The present study provided disability weights from a small group of general practitioners. The complexity of the PTO protocol was considered a barrier to other possible informants, such as patients or familymembers [18]. These difficulties also contributed to the low subject numbers in the present study, however, it is worth noting that the expert group in the Global Burden of Disease study was of a comparable size, and many of the ratings from the Dutch replication study were based on only six respondents [20]. A planned test–retest reliability analysis was abandoned due to low subject interest (n = 3). These difficulties pose a serious limitation to the method in general if it restricts both study sizes and key informants by virtue of its complexity. While the non-experimental design did not allow a controlled comparison of differences in stimuli and raters, the two disorder descriptions as used exactly in the original study (blindness and paraplegia) produced near-identical weights, suggesting our raters provide similar values given similar stimuli. Despite the limitations of the present study, it does provide the only investigation of the PTO protocol looking specifically at mental disorders, and is one of only a few validity studies.
Implications for the burden of disease methodology
Disability weights, in combination with epidemiology, determined the high disability burden of mental disorders observed in the burden of disease studies. The prevalence of mental disorders is not in dispute; it is the combination with preferences that raises concerns about the overall validity of this method. While the PTO is theoretically attractive, trading groups of people, even though entirely hypothetical, is difficult or disturbing for many participants [11], [26]. Ideally disability weights should incorporate the views of all stakeholders [18], [19] particularly where those decisions may ultimately impact on policy decisions for health care. Unless the PTO can be made more accessible and more relevant, a simplified method such as the rating scale [24] is probably preferable. Furthermore, as preferences are not commonly used in psychiatry [27] their role requires further investigation. In principle, preferences can complement descriptive health measures as they incorporate the importance of differences in health status [28]. In practice, preference measurement remains conceptually and methodologically controversial [29]. The validity of the burden of disease methodology for disability, at least in its current form, is thus dependent on the validity of preferences.
Conclusions
The Burden of Disease methodology was innovative in its attempt to incorporate disabling conditions in global health statistics, rather than relying solely on traditional mortality rates. The prominence of mental disorders observed in both the original [1] and subsequent [4–6] Burden of Disease studies has been an important advocacy tool for the mental health community, and the present study confirms the importance of the burden of mental disorders. However, continued interest in this approach should proceed with a greater attention to the methods that lie behind such powerful results. This study has identified one area of particular importance for further work, the disability adjustment method. In light of the paucity of validity studies of the approach used, and the only limited evidence of validity from the present study, this should be a focus for future efforts. Such studies will increase confidence in disability burden estimates, which can then be appropriately used to inform mental health policy.
Footnotes
Acknowledgements
Financial support for this study was provided by a project grant from the National Health and Medical Research Council. The authors thank the clinicians who participated in this study for their perseverance and good will.
