Abstract
Health services aim to improve the functioning and reduce the symptomatology of their target group. Assessment of the effectiveness of any aspect of a service requires an appropriate outcome measure. Health funding bodies, clinicians and researchers all recognize that outcome measures are urgently required. Jenkins states this succinctly: ‘in order to evaluate our health care system, we need to be able to measure the baseline health of the population, and then to measure the impact of health care on that baseline.’ [1], p.500]. This urgency is echoed by Birleson [2] who believes that services will increasingly be expected to deliver value for money.
Traditionally, health services have been measured by inputs, structures and processes, although the limitations of these measures alone have long been recognized [1, 3]. Many changes in health policy are aimed at improving children's mental health. Systems of care can be changed and this can affect patterns of care and expenditure [4]. Efforts to improve quality in health tend to focus on monitoring and altering processes without articulating the connection between those processes and outcomes. Modifying the structures and processes of mental health care may not produce better mental health. Without an ongoing measure of outcome, there is little way of knowing whether well-intentioned initiatives achieve actual improvement for children or adolescents. In fact, there is disturbing evidence that apparently beneficial structural initiatives (e.g. providing a continuum of care), may not clearly improve mental health [4].
Child Adolescent Mental Health Services (CAMHS) exist to improve the mental health of young people with significant emotional, behavioural or psychological problems. Comparatively little is known about the best ways of addressing the mental health needs of children or about the effectiveness of current mental health treatments [4–6]. There are increasing calls for clinicians to use evidence-based treatments [7]. While this generally refers to those treatments supported by the empirical literature, Weiss extends the idea to argue that [8], p.943]: Responsible use of a treatment in a clinic or practice group requires that the treatment be evaluated in that particular group of practitioners, where the treatment has been implemented not for research purposes but for the purpose of treating clients effectively. The objective of routine monitoring studies is not to establish the general effectiveness of a treatment but rather to determine, for a particular clinic or practice group, the parameters of the treatment's effectiveness [my italics].
Bickman [9] suggests that reforms in mental health systems have generally moved from efficacy research to system change without examining the effectiveness of interventions with practising clinicians and real patients. Only localized outcome measurement allows investigation of the impact of literature-supported treatments with the real population of interest: those seen in your service or by you.
A comprehensive review in 1996 concluded that no outcome instruments were available for CAMHS that had psychometric rigour and were clinically viable [10]. There is a lack of generic outcome measures for children and adolescents [11]. More recently, Bickman et al. [12] in their review highlighted the ‘dearth of national or international literature concerning outcome measurement in child and adolescent mental health’ (p.29). A mixture of existing and new instruments are required [12], with many existing measures failing the criteria of utility, being too time-consuming or costly to use routinely [10].
In its transition to a learning organization informed by outcomes, Maroondah Hospital CAMHS required a brief, inexpensive, clinically useful instrument. A clinicianrated instrument recommended by Hunter et al. [10], the Health of the Nation Outcome Scales for Children and Adolescents (HoNOSCA) [13], was implemented.
The Health of the Nation Outcome Scales for Children and Adolescents has been examined in two studies in the UK. Gowers et al. [13] presented the field trial results and concluded that HoNOSCA had adequate reliability and was acceptable to a range of clinicians. Yates et al. [14] found that HoNOSCA could usefully describe the profiles of children referred to CAMHS.
This paper aims to describe aspects of the reliability and validity of HoNOSCA in an Australian CAMHS where HoNOSCA was implemented as a routine measure. The study shall address interrater reliability with case vignettes and then examine properties of HoNOSCA with a patient sample. Child Adolescent Mental Health Services treat a heterogenous population and this necessitates examining HoNOSCA with a sample heterogenous in age, gender, and diagnosis, and rated by professionals heterogenous in experience and professional background. Heterogeneity of treated and treating population is the sine qua non of public CAMHS.
Method and results
The Health of the Nation Outcome Scales for Children and Adolescents
The Health of the Nation Outcome Scales for Children and Adolescents was developed in the UK [13] as a brief mental health measure. Completed by clinicians, it comprises 15 scales of which the first 13 are used to compute the total score.
The 13 scales are disruptive/aggressive behaviours, overactivity/ concentration, self-injury, substance misuse, scholastic/language skills, physical illness/disability, hallucinations/delusions, non-organic somatic symptoms, emotional, peer relationships, self-care, family relationships, and school attendance. The most severe occurrence for each scale in the preceding 2 weeks is scored on a 0–4 point rating from ‘no problems’ through to ‘severe problems’.
When HoNOSCA is measured on two or more occasions, the difference score is a measure of change. The cause of this change may reflect treatment, maturational or other factors. Little is known about HoNOSCA, with the main published data based on the penultimate version with a UK sample (Version 5) [13]. This study will present data from Version 6 (HoNOSCA 98) [15].
Setting
At the time of this study, multidisciplinary teams provided outpatient services to children aged from 0 to 18, from two main centres and outreach locations in a suburban and semirural area of Melbourne. The catchment area had an under-19 population of approximately 115 700 [16] and the area was predominantly English speaking. The annual referral rate was 1470 with 861 patients accepted for assessment and treatment.
Training
The introduction of any instrument should be closely accompanied by relevant training to enhance reliability [13]. One of the concerns expressed about generalizing the findings of randomized, controlled trials to clinical settings is the amount, and standardized nature of, training received [17]. Training in a clinical setting is rarely comparable with the intensity and resources available in research projects. It is important that training be similar to that which clinicians joining a service could expect to receive. The aim of this study was to examine the reliability of HoNOSCA under clinically realistic conditions.
Reliability study
Procedure
Twenty-four of the service's 30 clinicians received 1 hour's discussion on HoNOSCA and another hour on administration, implementation and logistics. The Health of the Nation Outcome Scales for Children and Adolescents was then completed for three clinical vignettes previously developed by senior staff unfamiliar with HoNOSCA. The vignettes focused on anxiety and depression, oppositional and concentration difficulties, and on suicidality and perceptual disturbances.
The remaining six clinicians received training later. As they did not complete the vignettes before discussing with other staff, their vignette ratings were excluded from reliability estimates.
Results: reliability analysis of vignette data
Intraclass correlations (ICC) were used to estimate interrater reliability using a two-way random effects model with interaction [18]. Intraclass correlations can be calculated in different ways [18, 19] and each model is listed in Table 1. The appropriate estimate is that relevant to the intended purpose to which the scores will be applied and the accompanying parameter assumptions.
Intraclass correlation estimates of reliability
The ICC using a model based on the level of absolute agreement between different judges (i.e. the 24 clinicians) is shown in column 2. That is, whether judges are using the scale in the same direction and with the same anchor points. This ICC is critical if individual clinician's scores are compared with each other. An absolute model considers the variance due to both vignettes and judges as relevant.
Table 1 (column 4) shows the ICC for a consistency model where absolute anchor points are irrelevant. The question answered here is how consistently each judge uses HoNOSCA and this ICC is relevant for comparisons within a clinician's caseload, or where aggregated percentage change is relevant.
The total score ICC is 0.72 suggesting HoNOSCA is being used with a good degree of consistency. However, if HoNOSCA were to be used with any reference to the absolute value of the total score, then its reliability is 0.52 and hence moderate. Nine of the 13 scales have good or very good reliability. Peer relationships was an unreliable scale and conclusions based solely on that scale would be unwarranted. The reliability descriptions are based on Landis and Koch (cited by [20]).
Patient study
Procedure
The study included all new patients over a 7-month period and those who had entered the service in the preceding 3 months. For new patients, clinicians completed HoNOSCA at assessment, 3 and 6 months (if possible). For existing patients, HoNOSCA was completed at 3 and 6 months. A discharge HoNOSCA was completed if discharge occurred. When HoNOSCA was completed a second time, clinicians also completed a seven-point global rating of perceived patient change. The seven points were ‘much worse’, ‘worse’, ‘slightly worse’, ‘no change’, ‘slightly better’, ‘better’ and ‘much better’. Throughout this paper, time 1 refers to the first time HoNOSCA was completed for a patient. Time 2 refers to the second occasion for that patient (i.e. 3 months after the first rating, or discharge, if earlier than 3 months). Although feedback is considered important [21], little is known about the impact of feedback on outcome measures. Consequently, clinicians completed a time 2 rating without knowledge of the previous rating.
Assessment ratings
The age and gender of patients for whom an assessment rating was obtained was similar (59% male: 41% female) to that of CAMHS during the same year (61% male: 39% female). Ages ranged from 3 to 20 years with a mean of 11.1 (CAMHS = 11.6), the mode being 14 followed by 9 years old. Forty-nine per cent of the sample were under 12 (CAMHS = 45%). Girls were concentrated in the adolescent group.
The most frequent diagnoses recorded by treating clinicians were: attention deficit disorders (9.5%), adjustment disorder (14.8%), conduct disorders (14.3%), anxiety disorders (11.9%), social/relationship problems (11.4%), mood disorders (11.0%), developmental disorders (9.5%), somatic (5.3%), eating disorders (3.3%) and personality problems (2.4%).
The Health of the Nation Outcome Scales for Children and Adolescents were completed by 30 clinicians. Psychologists completed the most (34%), followed by psychiatrists (20%), occupational therapists (16%), social workers (9%), psychiatric nurses (9%) and registrars/ medical officers (9%).
Analysis of the patient sample firstly focuses on the 305 assessment ratings. The mean HoNOSCA total score was 13.11 (SD = 6.3.)
HoNOSCA mean (SD) Scale and Total Scores by gender for ratings completed at Assessment
Mean total scores were 13.56 (SD = 6.76), 11.11 (SD = 5.15) and 15.21 (SD = 6.66) for under-5-year-olds, 5–12-year-olds and the over- 12 groups respectively (F = 15.77, df = 2,302, p < 0.001). The total mean scores are high enough to suggest that HoNOSCA can be used with all three age groups.
Multivariate analysis of variance of the 13 scales revealed significant differences by age group (F = 6.65, df = 26,482, p < 0.001). Only disruptive and non-organic somatic did not differ. Adolescents scored highest on self-injury, substance use, hallucinations and school attendance. The under-5 group scored highest on self-care, scholastic and concentration/overactivity (Table 3). These patterns are consistent with clinical expectations.
HoNOSCA mean (SD) Scale and Total scores by age group for ratings completed at Assessment
Patient paired ratings
Responsiveness to change is important for evaluative instruments and should not be confused with reliability [22]. The average age of patients with two ratings was the same as those patients with only one assessment rating (11.1 years) and the proportion of males (56%) was similar. There was an average of 95 days between the paired ratings, ranging from 67 to 156 days.
A two-way ANOVA revealed a significant interaction between clinician perception of change and time of HoNOSCA administration (F = 15.26, df = 3,137, p < 0.001). The source of the difference is in the time 2 HoNOSCA scores rather than at time 1.
A significant reduction in total score from a mean of 13.04 (SD = 5.98) to 10.16 (SD = 5.92) was found in paired time 1 and time 2 scores (F = 58.77, df = 1,158, p < 0.001). The interpretation of this is limited in that there is no independent measure of whether patients were ‘really’ changing. Within routine treatment, discharge is a broad indicator of positive change. A comparison of total scores for those discharged compared with those still in treatment revealed a nonsignificant trend (F = 3.23, df = 1,70, p = 0.077). For those patients discharged by time 2, total scores had significantly decreased from 12.17 (SD = 5.15) to 8.63 (SD = 5.03, F = 16.03, df = 1,29, p < 0.001). In percentage improvement terms, this group improved by approximately 30%.
Severity
Using the entire 617 ratings, the question of whether the total score is meaningful was investigated by examining the relationship between the pattern of scale scores and the total score. For example, it is theoretically possible to achieve a score of 12, by three scales scored four or 12 scales scored one, and so forth. This would make interpretation of the total score problematic. In practice, however, the number of high scores on the 13 scales was related to the total score and there was a significant correlation between the total score and the number of high scores on each scale (Spearman's Rho = −0.77, p < 0.001). This may suggest interpreting the HoNOSCA total score as an index of severity. A high total score is more likely to reflect a few scales having high scores rather than many scales having mild to moderate scores.
Discussion
Reliability estimates indicate the purposes for which the instrument can be used with differing levels of confidence and the appropriate estimate depends on the intended use of the data [23]. Reliability of the HoNOSCA total score appears good with a consistency model and moderate with an absolute model. The use of HoNOSCA, for example, to establish cut-off points, or to compare clinicians or services, involves absolute scores and hence moderate reliability. Absolute scores may be less relevant where, for example, individual clinicians focus on their own caseload to prioritize cases for review, reflect on strengths and weaknesses, and overt their perceptions of symptoms to facilitate collaboration with patients and colleagues. Clinicians can draw more reliable conclusions about the relative severity and rate of change with their own patients than should be drawn about different patients of different clinicians. Both absolute and consistency uses of HoNOSCA are legitimate.
Gowers et al. [13] reported intraclass correlations for three judges scoring 20 case presentations ranging from not calculated (substance use) to 0.63 for family relationships through to 0.98 for school attendance. Estimates in this study, with the exception of hallucinations, are all lower. Judges in that study were present at the same case presentations, and could question the presenter [Gowers S: personal communication 1998]. Experienced presenters could ensure they presented information directly relevant to HoNOSCA and specify the 2-week time period. These factors may contribute to a generous estimate of reliability as both judges and presenters can mutually tailor the information presented to maximize agreement. The current study challenged reliability by providing clinicians with limited vignette information and no more training than was considered sustainable in CAMHS. It is clear that reliability is robust under these circumstances.
No instrument has an inherent level of reliability [24]. While the results of this study and Gowers et al. [13] contribute to a picture of HoNOSCA having acceptable reliability under different conditions with different populations, further investigation would be prudent.
This study explored some properties of HoNOSCA when used under normal clinical conditions. There was an understandable relationship with age. Adolescents had the highest scores on self-injury, substance use, hallucinations and school attendance while the preschool group showed the greatest difficulties with concentration. The size of the mean total scores suggests that improvement or deterioration could be assessed for each age group. Boys scored higher on scales relating to externalizing behaviours (e.g. disruptive, concentration/ overactivity) while girls scored highest on emotional symptoms, self-injury, hallucinations and substance use. Neither gender had a significantly higher total score suggesting that while the pattern varied, the total amount of problems was perceived to be similar. Change scores were related to global impressions of change and a strong trend of improvement was evident when the score at assessment was compared with the score at discharge. The total score appears to be a measure of severity. This is important given both the use of total scores for reporting results, and the difficulty the professions have in agreeing about a continuum of severity.
Gowers et al. [13] noted an average improvement of 38% compared with 22% in the present study over a similar time frame. The presence of ongoing patients in the present study may have diminished the rated improvement. It is equally possible that more effective treatments or a different population is being treated in the UK.
Although the results encourage further use and exploration of HoNOSCA, the present study has some limitations. First, no measure of concurrent validity was used and this places the reality of the reported change scores in parentheses. Second, while the reliability estimates are moderate to good, they are based on vignette material and different estimates may occur with real cases. The difficulty with real cases is ensuring judges have equivalent information. Third, diagnoses were made by clinicians in the course of their usual practice and the accuracy of these diagnoses is not independently known. Other studies are addressing these limitations.
A major impediment to data interpretation where routine outcome measurement is based on returns from patients is a low response rate. Some 40–60% of families starting therapy finish prematurely [25]. While consumer data are an important measure of outcome [3], clinicians can provide both a better return rate and data on those who discontinue.
Routine outcome measurement is relatively new and few services routinely assess outcome with all patients [3]. Clinicians, carers and patients all have different domains of expertise, information, values and frameworks and their reports will reflect these different domains [12]. However, there is always a cost involved in collecting and analysing outcome data, and this may be expensive [26, 27]. The more informants and domains covered by an outcome system, the more expensive it is likely to be. Marks [25] estimated 10% of a clinician's time can be spent in a simple outcome measurement system. Not surprisingly, brevity has been proposed as a key feature of the feasibility of an outcome measure [28]. Although the feasibility and use of the scores in clinical practice is under investigation, HoNOSCA appears to meet these requirements.
Outcome measures are a necessary component of continuing quality improvement (CQI). Continuing quality improvement has often focused around processes (e.g. reviews, procedures and practices) without demonstrating whether improved processes lead to better outcomes. Bickman et al. issue a salutary warning: ‘… without definable clinical procedures, outcome data alone are unlikely to enhance services’ [12], p.25]. However, it remains equally probable that without measured outcomes, the outcome of defined procedures will remain unknown. It is likely that progress in this field will require both a routine outcome infrastructure and defined procedures.
In conclusion, HoNOSCA has proved to be an instrument worthy of use and further investigation. It has assisted in creating a culture of outcome measurement and demonstrating that routine outcome measurement in CAMHS is possible. The contextual importance of this study is that it initiated an outcome infrastructure in a clinical setting. Maroondah Hospital CAMHS has moved on to the next stage through introducing an outcome measure for carers and adolescents, the Strength and Difficulties Questionnaire [29]. With such an infrastructure in place, it is possible to hypothesize and examine which practices do, and do not, lead to better mental health, in the real clinical context [30]. This may generate answers in the setting where they are most keenly sought: that is, answers for this setting, with these resources, with these clinicians and with these patients.
Footnotes
Acknowledgements
Thanks to the staff of Maroondah CAMHS for engaging in the organizational change inherent in this study. Peter Birleson, Helen Mildred, Jan Costin, Lisa Wong, Heather Willsher, Tom Trauer and Jenny Wilkins contributed to the systems, data and interpretation underlying this paper. The genesis of this work was supported by the Mental Health Branch of the Department of Human Services, Victoria.
