Abstract
There is a growing need in medical practice for evidence to demonstrate the effectiveness of an intervention. In psychiatric care, the imperative for evidence is no less important. The Health of Nation Outcome Scales (HoNOS) is a clinician-completed set of scales designed to measure outcome in mental health [1] and is being advocated as a measure suitable for this purpose. It has been seen as an easily administered, reliable instrument providing objective evidence of the benefits of hospitalisation in psychiatric care [2, p. 199]. In Australia, the HoNOS has been recommended by a number of bodies as one of a small number of tools to be used in assessing outcome from psychiatric care [3–5].
Questions about the usefulness of the HoNOS have recently been raised. It was unrelated to length of stay and diagnosis across six private psychiatric hospitals [2]. Goldney et al. (1998) found this surprising in that ‘one might reasonably have anticipated that a more severe disturbance would have been associated with a greater period of hospitalisation’ [2, p. 204]. Thus, the validity of the HoNOS in determining service provision may be limited.
The minimum essential characteristics of any scale are reliability and validity [6]. The HoNOS, as a clinician-completed tool, requires both test–retest reliability (the same answer on the same patient by the same rater over two occasions) and interrater reliability (the same answer on the same patient by different raters). The determination of validity is an ongoing process to establish that a scale is measuring what it is supposed to measure.
The validity of the HoNOS during the original studies the HoNOS was established by comparing it with the Role Functioning Scale (RFS) and the Brief Psychiatric Rating Scale (BPRS) [1]. Both these measures are clinician-rated instruments. The validity of a ‘concept’ (i.e. mental health) should not be dependent on the rater, that is scores derived from the patient will be similar to the scores derived from the clinician if the concept is adequately measured.
To examine the validity of the HoNOS in relation to patient-derived measures of mental health, it is important to determine suitable instruments. Recently, a number of models of the relationships between various health outcome measures have been developed [7–10]. All models present a common theme and ordering of measures. At the most basic level are physiological and/or biological measures followed by measures of increasing subjectivity: first symptom, then health status and finally health-related quality of life (HRQOL) measures. The HoNOS is clearly not a HRQOL measure, but is not easily classified as a symptom or as a health status measure. The scales in the HoNOS are broader than individual symptoms, but not of the degree of subjectivity to be considered health status. Thus, the validity of the HoNOS is examined in relation to both a symptom measure, the Symptom Checklist 90 Revised (SCL90-R) [11] and a health status measure, Medical Outcome Study Short-Form 36 (SF-36) [12].
Within Australia, the HoNOS is being proposed as a standard outcome measure for all patients receiving mental health care. Thus, it is intended that every patient receiving in- or outpatient care will have the HoNOS completed at least twice. This is a substantial commitment of resources, at a substantial cost to providers of psychiatric care and produces a large set of data. Given the resource and decision-making implications that may flow from using the HoNOS in this manner, it is important that it be reliable and valid.
Aim of the study
The aims of this study were to examine the reliability and validity of the HoNOS in a private psychiatric hospital setting. This paper reports on three studies examining the psychometric properties of the HoNOS.
Method
Study one: reliability of the HoNOS
All subjects in study one were trained to the criterion set out in the original training [1].
Subjects
Subjects consisted of the current clinical staff of St John of God Hospitals in New South Wales, Australia. The majority of staff were nursing staff and the remainder were allied health staff (e.g. psychologists). Many staff had previously been trained in the HoNOS during a major diagnosis and costing study in Australia [13]. Other staff had been trained on the ward. There had been no formal training in the HoNOS in the intervening 2 years since the Mental Health Classification and Service Costs (MHCASC) project.
Design
The two objectives of this first study were to maximise the effectiveness of the training and to examine the reliability of the HoNOS. Given the first objective, the training proceeded in two stages. First, two pilot programs to evaluate and develop the training program were run. Second, training of clinical staff was implemented. After the first pilot study, both the content and structure of the training and the profiles used to examine the interrater reliability (IRR) were improved.
To control for familiarity with the profiles being a reason for improving IRR, the second pilot study was randomly split into two groups. The first completed the profiles before, and 1 week after, training. The second, the control group, completed the profiles twice a week apart before training [14]. Further, as the control group completed the profiles without intervention, the test–retest reliability can be examined. Staff in the control group would have had either no formal training or training 2–3 years earlier in the HoNOS. Thus, this is a limited examination of the test–retest reliability.
The first study consisted of training clinical staff using the refined training procedures and the refined patient profiles. The major objective during this study was to examine the IRR of the HoNOS after training. To control for the effect of unique St John of God profiles, the last training group completed a second set profiles used in field trials in Victoria, Australia.
Materials
The HoNOS consists of 12 scales. Each scale is rated on a five-point scale; from 0 (no problem) to 4 (severe/very severe). Each scale covers a broad range of problems rather than specific symptoms (e.g. problems resulting from overactive, aggressive, disruptive or agitated behaviour). The scales were constructed to be as independent as possible [1, p. 28]. The 12 scales have been grouped, based on clinical judgement, into four categories. These are the behaviour, impairment, symptom and social subcategories. In addition, there are two total scores, a 12- and a 10-item total. The 10-item total excludes the last two scales, often not completed on inpatients.
The profiles used in the St John of God Hospital studies were initially written by senior clinical staff, based on actual cases. The profiles covered a variety of different patient types, including depression, schizophrenia, posttraumatic stress disorder. They were then modified by other clinical and research staff to ensure that they contained sufficient information to complete a HoNOS. The profiles were designed to be naturalistic with a mixture of current information and past history, some having diagnostic information and others not. It was considered that they reflected real patients and this was commented on by staff during training as being the case. The naturalistic design was adopted to emulate the real situation as closely as possible. Staff were told to assume that the lack of information indicated the absence of a problem. Thus, they were instructed to score a scale as 0 if, in their opinion, no information was available. This instruction was based on the limitation imposed by the use of profiles.
Study two validity of the HoNOS: inpatients
Patients admitted to two units at St John of God Hospital Richmond complete an SCL90-R on admission and on discharge. Staff also complete the HoNOS on admission and on discharge from each unit. The first unit, St Joseph's, is an acute psychiatric unit principally treating patients with affective and adjustment disorders. The second unit, St Paul's treats patients with drug and alcohol problems. On entering St Paul's, patients' also complete a measure of dependence, thus, the validity of HoNOS scale three, ‘problem drinking or drug taking’, is also examined.
Study three: validity of the HoNOS: day patients
St John of God provides a range of psychiatric services on a day patient basis. Patients on entering and discharge from day services complete a battery of questionnaires including the SF-36. Staff also complete the HoNOS. Thus, the validity of the HoNOS is examined in relation to a well-accepted health status measure.
Analysis
All analysis was completed using
Study one
The intraclass correlation (ICC) was determined for each profile, for the HoNOS total scores and for the HoNOS scales [15]. The ICC was determined across all profiles providing a composite average. To examine the adequacy of the St John of God profiles, profiles used in a field trial study from the State of Victoria [16] were also used with one training group. These differed from the St John of God profiles in that they contained information directly related to the HoNOS scales and were thus less naturalistic than the St John of God profiles.
Study two and study three
The total and subcategory scores of the HoNOS were correlated with the scales of the SCL90-R and the SF-36. Specific scales of the HoNOS were correlated with similar scales from the SCL90-R, the SF-36 or with dependence measures. The total scores correlations are calculated first to control for multiple testing [17]. Thus, if the total scores do not correlate significantly and substantially, there is no justification to examining subcategories and individual scale scores. These subsequent analyses will be completed for comprehensiveness only and unless highly significant (based on Bonferroni adjustments [18]) will not be considered significant.
Missing data
Missing data for the HoNOS were handled by using mean substitution using the following rules: the 10-item total score was calculated only if the were two or fewer missing items from the first 10 items. The 12-item total was calculated only if there were three or fewer missing items. Each subscale was calculated if one or zero items were missing, except the impairment subscale where both items needed to be present. This is a very conservative approach in comparison with Boot et al. [19], and consistent with Goldney et al. [2]. Even so, it could be argued, given the ‘independence’ of the HoNOS scales, that any missing item should exclude that HoNOS from further analysis.
Missing data on the SF-36 were handled as per instruction [20] and missing data in the SCL90-R were handled as per instruction [11]. There were very little missing data on either of these questionnaires.
Results
Pilot studies interrater and test–retest reliability
The first pilot study was based on 10 staff members. Staff for the second pilot study were randomly split into two groups of seven. The results of these two studies are presented in Table 1. For the control group the average test–retest correlation on each of the individual profiles, based on the total HoNOS, was 0.86 (range 0.71–0.96).
Key interrater reliability (IRR) scores from pilot studies
Study one: interrater reliability of the HoNOS
The major reliability study reported was conducted on the training of staff following the two pilot studies. Staff completed the eight profiles before and after the training. Table 2 provides the post training ICC on the St John of God profiles (n = 20), and the ICC on profiles used in a Victorian study (Trauer 1998) (n = 6). For comparison purposes, Table 2 contains the results from the original trials [1] and results from the recent field trial in Victoria [16].
Intraclass correlation (ICC) across profiles compared with other studies
Satisfactory IRRs (defined as > = 0.6).
Wing et al. 1996 [1, p. 65] these are the results of the IRR trials conducted during the development of the HoNOS in
Nottingham and
Manchester. They were completed on patients. The ICCs were calculated between two staff members, a consultant psychiatrist and the nursing member most closely involved with the patient.
Trauer et al. 1999 [16].
The ICC for each of the eight profiles, post-training, ranged from 0.41 to 0.76 (
= 0.60). On the Victorian profiles, the individual profile ICCs ranged from 0.54 to 0.84 (
= 0.63).
The ICCs for the individual HoNOS scales, Table 1, ranged from 0.2 to 0.88 (
= 0.48). Three HoNOS scales consistently had satisfactory ICCs (defined as ≥ 0.6). That is, across pilot study two and the main study, and the group who also completed the Victorian profiles: problem drinking or drug-taking, hallucinations and depression, and aggression had satisfactory ICCs.
Study two: validity of the HoNOS in comparison with the Symptom Checklist 90 Revised
For patients with multiple admissions, the first admission in 1998 was used. There were 256 HoNOSs and SCL90-Rs completed out of 318 patients (response rate of 80.5%).
The global severity index of the SCL90-R was correlated with the 10-item total of the HoNOS. On admission r = 0.037 and on discharge r = 0.065. Given this non-significant result, no further analysis should have been conducted [17,18]. However, for completeness, specific scales of the HoNOS were correlated with relevant scales from the SCL90-R and with dependence measures obtained on admission. The depressed mood correlated poorly with the depression scale (r = 0.15 and 0.17 on admission and discharge, respectively). When anxiety was identified as a specific problem, the correlation between the SCL90-R anxiety scale and the HoNOS measure of anxiety did not differ from 0. The correlation between the HoNOS drug and alcohol measure and patient-completed severity of dependence measures was negative (–0.13) when it should be positive.
To examine the effect of the HoNOS training (study one), the first half of the year was compared with the second half of the year. There were no substantial correlations between the total SCL90-R and the HoNOS in the first or second half of the year (for the first half, r = −0.01 on admission and, r = 0.12 on discharge; for the second half, r = 0.08 on admission and r = −0.02 on discharge).
The HoNOS was designed to measures change. To test this, the simple difference scores were calculated for the HoNOS and the SCL90-R. The correlation between the change scores was 0.04.
There was a significant change on the 10-item total HoNOS scores between admission (
= 12.98) and discharge (
= 5.40; tpaired = 21.36, df = 253, p < 0.001). There was also a significant change on the SCL90-R Global Severity Index (GSI) from admission
= 1.80) to discharge (
= 0.95; tpaired = 16.15, df = 181, p < 0.001).
Study three: comparison with the SF-36
The 10-item total HoNOS correlated, non-significantly, with the SF-36 Mental Health Component Score (MCS) on admission (r = −0.03, n = 290, ns) and on discharge (r = −0.10, n = 102, ns). There was a significant correlation with the Physical Health Component Score (PCS) on admission (r = −0.22, n = 290, p < 0.001) and on discharge (r = −0.33, n = 102, p = 0.001). The HoNOS and the SF-36 are scored in opposite directions thus a negative correlation is expected.
Discussion
Reliability and validity of an instrument are minimum essential qualities. Based on the studies presented in this paper, the HoNOS has limited reliability and there is no evidence for its validity in relation to patient-completed information.
The first use of the ICCs was to examine the effectiveness of the HoNOS training. The first pilot study indicated that standard training (training staff to criterion), based on materials from Wing et al. [1], MH-CASC [2] and Morris-Yates [21] failed to impact on the IRR of staff. Modification to the training indicated that reasonable improvements could be gained. Modifications introduced covered three components: adult learning principles, clinical information and a second session for further practice. The learning principles included making the training important and relevant to staff, and allowing staff the opportunity to express, mostly negative, attitudes about using the HoNOS, and answering questions about its use openly and honestly. The clinical content of the training, particularly concerning scales with low IRR, was increased. A second training session, 1 week after the first session, provided staff with a chance to consolidate prior learning through further practice and discussion. The improvements in IRR that resulted suggest that, without formal testing of the effectiveness of training, training may be of no value and may provide a false sense of achievement.
The second reason for the ICCs was to examine the IRR of the HoNOS. The IRRs presented in Table 2 from the St John of God studies are substantially below the other studies also reported in the table. The major differences are due to methodology. In the Wing et al. [1] studies, the ICC was calculated by comparing the scores on the HoNOS of one psychiatrist, who was involved in training other staff, with the nurse most closely involved in the patients' treatment. In the Victorian field trials [16], the IRR was calculated in a stable population of mentally ill patients, by comparing the scores of two staff members completing a HoNOS on the same person at different times. Thus, in both these studies, the IRRs were determined on actual patients, but only on two staff measuring each patient. In the St John of God studies, the IRR was determined by comparing the scores of multiple staff across multiple patients (profiles).
This is similar to the proposed situation of assessing all private patients receiving mental health care in Australia. To determine the reliability of the HoNOS under theses circumstances requires a methodology similar to that employed in this study, that of multiple patients, being rated by multiple staff. Thus, it is necessary to achieved IRR across multiple raters not just between two raters.
When multiple staff are compared across individual patients the IRR, although reasonable, is still not of a level that would be required to consider the HoNOS reliable. Further, when individual scales are examined across multiple patients (profiles) only three or four scales could be considered reliable. The hallucinations, depression, problem drinking or drug-taking, and possibly the aggression scales could be considered reliable. However, the majority of scales are unreliable.
Examination of the relationship between the HoNOS, the SCL90-R and the SF-36, two well-established and valid measures, failed to provide substantial evidence for the validity of the HoNOS. Wing et al. compared the HoNOS with two clinician-completed instruments, the RFS and the BPRS in two hospitals [1]. The trial at Nottingham showed good evidence for the validity of the HoNOS. The evidence from the Manchester trial, however, was not as strong (pp. 66–70). In contrast to this, the St John of God study compared patient-completed measures, rather than clinician-completed measures, with the HoNOS. It has been argued, in relation to HRQOL, that the patient perspective is primary, and thus the ‘gold standard’ to be attained [22]. In addition, the stronger relationship with physical health than with mental health suggests that the discriminant validity of the HoNOS is poor [23]. Determining the validity of an instrument is an ongoing process, with no clear end. The HoNOS, in reference to self-completed comparable measures, appears invalid.
Wing et al. [24] claimed that the HoNOS was sensitive to change and that the change in HoNOS was related to clinician judgement of the amount of change in patients' mental health. The change scores on the HoNOS were unrelated to change scores on the SCL90-R. That is, staff perceptions of patient change are unrelated to change experienced by the patients. Thus, one of the more important psychometric characteristics, sensitivity to change [25] of the HoNOS, is unrelated to the experience of the patient.
There are a number of reasons as to why clinician-completed instruments may not match patient assessments. A substantial reason for differences between first and third person assessments is due to differences in the information available to each assessor, with the clinician having limited information. A major review of third person assessments of HRQOL found that third persons were generally poor at assessing HRQOL when compared with the patient [26]. The differences in assessment, although generally lower, were not simply a change in scale (just lower or just higher), but were often unrelated to the patients' assessments. There were a number of factors related to the degree of relationship between third and first person assessment. Two of the factors are relevant to the HoNOS. First, the closer the relationship between the patient and the rater the better the assessment. Second, the more specific and behavioural the questions, the closer the scores between first and third person. Thus, third person assessments are of limited validity, particularly when there is a limited relationship with the patient being rated and when the scales are broadly based. The HoNOS does not have specific behavioural questions, and often it is completed by a person with a limited knowledge of the patient.
St John of God has examined the reliability and the validity of the HoNOS in three different studies. Overall, the HoNOS demonstrated moderate reliability, and seemed to measure change. However, it has limited validity compared with patient-completed measures of mental wellbeing, both at a point in time and in relation to the measurement of change. A series of studies reporting on the HoNOS have recently been published in the British Journal of Psychiatry [16,27–29]. The results of these recent studies are varied, however, they tend to support those of this study. Some HoNOS scales have limited reliability [16,27], there is poor agreement between different workers [28], and the HoNOS is of limited clinical value [29].
Given the above difficulties it is recommended that the HoNOS not be implemented as a major outcome tool, until the reliability and validity of the HoNOS is clearly established.
Acknowledgements
The research on which this report is based was supported by St John of God Health Services in New South Wales, Australia. The support of Natalie Smith and Janet Devlin in various stages of design and data collection was invaluable. The support provided by Dr Tom Trauer (Department of Psychological Medicine, Monash University, Australia) has also been most valuable.
