Abstract
The literature that is relevant to evaluation of treatment effectiveness is large, scattered and difficult to assemble for appraisal. This scoping review first develops a conceptual framework to help organize the field, and second, uses the framework to appraise early psychosis intervention (EPI) studies. Literature searches were used to identify representative study designs, which were then sorted according to evaluation approach. The groupings provided a conceptual framework upon which a map of the field could be drawn. Key words were cross-checked against definitions in dictionaries of scientific terms and the National Library of Medicine Medical Subject Headings (MeSH) browser. Using the final list of key words as search terms, the EPI evaluation literature was appraised. Experimental studies could be grouped into two classes: efficacy and effectiveness randomized controlled trials. Non-experimental studies could be subgrouped into at least four overlapping categories: clinical epidemiological; health service evaluations; quality assurance studies; and, quasi-experimental assessments of treatment effects. Applying this framework to appraise EPI studies indicated promising evidence for the effectiveness of EPI irrespective of study design type, and a clearer picture of where future evaluation efforts should be focused. Reliance on clinical trials alone will restrict the type of information that can inform clinical practice. There is convergent evidence for the benefits of specialized EPI service functions across a range of study designs. Greater investment in health services research and quality assurance approaches in evaluating EPI effectiveness should be made, which will involve scaling up of study sizes and development of an EPI programme fidelity rating template. The degree of complexity of the evaluation field suggests that greater focus on research methodology in the training of Australasian psychiatrists is urgently needed.
Naturalistic study evidence supports the importance of intervention in the early stages of psychotic disorders. Early disease course is the strongest predictor of long-term outcome for psychotic disorders [1]. Remission of the first psychotic episode and avoidance of relapse during the first 2 years of treatment may reduce long-term disability associated with schizophrenia by up to 30% [2]. Conversely, the chances of full recovery decline with each relapse [3], while the likelihood of relapse increases over time [4]. Antipsychotic drugs reduce the risk of relapse in first-episode schizophrenia [5–7] and treatment discontinuation results in high relapse rates [4,8]. More patients respond well to medication in the first episode (approx. 80%) [9–13] compared to subsequent episodes (approx. 50%) [14]; initial response occurs at lower doses of medication [15,16]; and symptoms appear to improve to a greater extent in first-episode patients (e.g. >60%) [17], compared with multi-episode patients (e.g. <16%) [18]. First-episode patients also respond to psy-chosocial treatments in lower ‘doses’, as illustrated by motivational interventions for comorbid substance use in early psychosis, in which a 3 h intervention achieved substantial effects in most patients [19] but protracted intensive programmes in chronic schizophrenia have only a modest effect, and are applicable to only a minority of patients [20–22]. Findings that shorter duration of (initially) untreated psychosis (DUP) predicts better treatment outcomes [23,24] are also consistent with the potential effectiveness of early detection (ED) and intervention strategies.
The strength of this circumstantial evidence (i.e. not from formal evaluation studies) in support of early psychosis intervention (EPI) [25] might suggest that it would be easy to demonstrate the efficacy of EPI in clinical trials (CTs), but this has not been the case. Indeed, the Cochrane Review concluded that there was insufficient or no evidence from randomized control trials (RCTs) to support the benefits of ED or specialist EPI teams for first-episode psychosis [25]. That Cochrane Review is a striking illustration of how few high-quality RCTs of complex mental health interventions are carried out in general. It also raises the question of whether randomized trials are appropriate in the context of EPI because individuals from the same population cannot be randomly exposed to an ED strategy, nor is it ethically appropriate to allocate half of a large group of patients with first-episode psychosis to substantially delayed treatment. If research evidence is to keep pace with increasing demands for better patient outcomes, observational evaluations of intervention effectiveness may have to substitute for, rather than complement, efficacy data from CTs.
There are numerous non-experimental approaches to evaluating treatment effectiveness, but the literature is scattered across methodological domains and is difficult to assemble and appraise. Subtle differences in study designs, even within the CT literature, create difficulty in evaluating the evidence, especially for complex interventions. Also, the sheer volume of observational studies being published, often with varying terminology, leads to problems in simply sorting articles by design or theoretical orientation in the absence of an evaluation framework. Existing evaluation frameworks or evidence hierarchies rank level (based on study design) and quality (methods used to minimize bias) of evidence [26]. Their focus is on experimental methods as the gold standard [27], with little regard for how the evaluation field is theoretically organized or observational studies subclassified. In this review the aim was to develop a conceptual framework reflecting how the published literature is organized that facilitates searching, organizing, and appraising intervention effectiveness studies, across the entire spectrum of experimental and observational methods, and to apply this to an appraisal of the EPI literature.
Methods
Exploratory database searching (Medline Ovid; PsycINFO; Web of Science) of the early psychosis literature (using search terms: early psychosis OR early schizo∗ OR first-episode psychosis OR first-episode schizo∗) was carried out initially using general key words (evaluat∗ OR effective∗). These searches resulted in large numbers of publications reporting a mixture of study designs. Reference lists in these articles showed that the evaluation of EPI drew on a broader literature in mental health. When this broader mental health literature was searched (using key words: psychiat∗ OR psychol∗ OR mental), references listed in those articles identified additional relevant evaluation subspecialties in the general medical and programme evaluation literature. That is, exploratory searching resulted in several thousand reports of evaluations of intervention effectiveness that arose from fields with different theoretical orientations and used study designs that did not lend themselves readily to classification. Hence we decided to adopt the following scoping approach [28] to our synthesis of the literature.
Publications were sorted into groups according to study design or evaluation approach. After culling low-interest articles, two senior authors independently grouped publications according to methodology. The common design characteristic of studies in each group was consensually agreed upon and cross-checked for consistency with the US National Library of Medicine Medical Subject Headings (MeSH) browser (www.nlm.nih.gov/mesh) and against definitions in authoritative dictionaries and encyclopaedias [29–31]. The final list of checked terms describing evaluation approaches was then conceptually sorted into a logical hierarchy in order to determine the most parsimonious conceptual map of the intervention evaluation field, starting with the two broadest study design classes, experimental (CTs) and observational (non-randomized).
Two sorting principles emerged as the most parsimonious to chart the field of evaluation [32] and organize the conceptual map. First, the framework was ordered from left to right along an efficacy versus effectiveness evaluation spectrum. Efficacy refers to whether a treatment can work under ideal conditions. Effectiveness is whether a treatment works in routine clinical settings, and what is the most effective service model to deliver an efficacious treatment. That is, effectiveness encompasses service provider competence, patient adherence and disease coverage, in addition to efficacy of an intervention. Second, observational evaluation subfields were ordered left to right according to the primary focus or unit of analysis: population level, service level, process level (healthcare quality and clinical practice effectiveness), and patient level (which type of patient responds to what treatment programme). Neither sorting principle could be strictly applied because study design types did not always precisely fit either dimension. For instance, health services research and economic evaluation can be based on efficacy or effectiveness designs. Also, in ordering designs according to unit of analysis there was a degree of arbitrariness because all designs ultimately use patient-level data, and it is only the perspective of the researcher that determines whether the primary focus is at the level of the population, service, practice, or patient. When it was difficult to decide where a study design should be located in the evaluation field map, the reliability of the design was used as a secondary ordering principle. The greater the chance of confounding and bias, and lesser opportunity to assess the likelihood of these to affect results, the lower the reliability of study design was ranked. More reliable designs were placed to the left nearer the efficacy end of the map and listed higher in the column hierarchies, creating a top-left (efficacy RCT) to bottom-right (case studies) gradient of robustness of study design.
The conceptual framework generated by this method is shown in Figure 1 and each component further detailed in Table 1. Comprehensive coverage of the evaluation literature drew heavily on manual searches of reference lists, and relied on referencing a large number of books and technical reports.

Treatment and service evaluation. CPG, clinical practice guidelines; RCT, randomized controlled trial; TQM, total quality management.
General descriptions of the evaluation approaches used to assess intervention effectiveness with notes of relevance to the mental health field.
AR, action research; CPG, clinical practice guideline; DRG, diagnosis-related group; EPI, early psychosis intervention; HoNOS, Health of the Nation Outcome Rating Scales; QA, quality assurance; Ql, quality improvement; RCA, root cause analysis; RCT, randomized controlled trial; TAU, treatment as usual; TQM, total quality management; WHO, World Health Organization.
We became aware of some fields of literature (e.g. evidence mapping) [32,176,177] only by internet searches (e.g. Centre for Reviews and Dissemination, www.york.ac.uk/inst/crd; Global Evidence Mapping Initiative, www.evidencemap.org). The framework shows that treatment and service evaluation draws upon diverse theoretical traditions. CTs (randomized) are primarily rooted in the evidence-based medicine literature [134,178–180], while observational (non-randomized) designs appear to encompass: (i) population-based case-control and cohort studies based in clinical epidemiology and public health fields [181,182]; (ii) service model evaluation from health services research [183,184], programme evaluation [94,95,185], and economic evaluation [99,100,103]; (iii) quality assurance (QA) studies with their origins in the management [126,127] and audit literature [110]; and (iv) quasi-experimental assessments of treatment effects from the social sciences [64]. Each of the main evaluation fields contains a number of distinct subfields (Figure 1).
Using terms listed in the conceptual map one at a time, the EPI literature was searched again (limitations: English language; years, 1970–2008). The three databases (Web of Science, Medline and PsychlNFO) produced different but overlapping listings. After grouping publication according to our conceptual framework, leading examples of evaluation methodologies (studies highly cited or using a design of high relevance) were selected for inclusion in the scoping review that appears in the following section. In accordance with scoping principles [186], breadth and comprehensiveness were considered more important than depth and study quality.
Results
Overview of evaluation approaches and an appraisal of the evidence for the effectiveness of EPI
This section summarizes the different intervention evaluation designs and illustrates their limitations and advantages by appraising representative studies from the EPI literature. Studies will be reviewed under the two broadest subheadings: experimental (randomized CTs) and observational (non-randomized designs).
Experimental (randomized controlled, or clinical, trials) study designs
Two types of CTs can be distinguished: efficacy and effectiveness RCT (Table 1; Figure 1). In relation to EPI, there are a number of published efficacy RCTs of antipsychotic medication in early psychosis patients [187–191], and prodromal patients [192,193]. Taken together, these studies indicate that early psychosis patients (i) require lower doses of antipsychotic drugs and are more sensitive to both extrapyra-midal [194,195] and metabolic [196] side-effects than multi-episode patients; and (ii) less consistently, report a modest advantage in favour of the use of second-generation drugs in acute [188] and maintenance treatment [190,197]. There is little evidence of advantage in using clo-zapine as a first-line treatment in these patients [191], unless treatment resistance is evident [198].
Although the evaluation of a single treatment element, such as a new medication, lends itself well to efficacy RCT designs, the evaluation of multi-component complex interventions, such as EPI, is problematic especially because clinicians and researchers may not have fully defined and developed the intervention [51]. The evaluation of complex interventions lends itself well to cluster RCT designs (Table 1) [55,56], in which randomization occurs at the level of the service and not the individual patient. There are two published EPI cluster RCTs. One randomized general practitioner (GP) practices (thereby clustering patients by practice) to GP education and access to an early assessment team or treatment as usual (TAU), and showed that ED procedures resulted in higher patient referral rates but did not reduce DUP overall [199]. The other cluster RCT also involves evaluating education of GPs in detection of first-episode psychosis [200], final results to be reported. While economic evaluation and evaluating the delivery and organization of health services are most reliably carried out using RCT designs, the bulk of the literature on service systems and economic evaluation in mental health is non-experimental, and hence this subfield will be mainly covered under observational study designs.
In the first quantitative review of efficacy RCTs for EPI as a complex intervention, Marshall and Rathbone identified 65 candidate studies, 58 of which were excluded for methodological reasons [25]. Meta-analysis (Table 1) of the seven eligible studies was not possible because of insufficient comparability across studies of the types of EPI or control services being trialled. (A meta-analysis of EPI evaluation studies has more recently been published but that review combined RCTs with non-randomized studies and hence its results are difficult to interpret [201]). An important point made by Marshall and Rathbone was that the complexity of EPI makes it likely that no two specialized teams will be identical, drawing attention to the urgent need for the development of a validated EPI programme fidelity rating template that can reliably measure the extent to which individual components of EPI are represented in programmes under evaluation. Although approaches to the implementation of EPI have been described [202], most evaluations of EPI have not applied the degree of programme development and description recommended for complex interventions [52]. The importance of precise programme specification is illustrated by comparing the findings of a Norwegian RCT of EPI [203] with those of the Danish intensive EPI program (OPUS) Study [204,205]. Although both studies named their EPI service model ‘integrated treatment’, very different outcomes were reported: one study found a large effect in favour of EPI at 2 year follow up [203], while the other found an inconclusive effect in favour of EPI at 2 year follow up using un-blinded assessment [204], and no apparent efficacy advantage for EPI at 5 year follow up using blinded assessments [205]. When programme descriptions for integrated treatment are compared across the two studies, differences are apparent, including programme duration (18 months in the Danish study and 2 years in the Norwegian study).
In contrast to efficacy RCTs, an effectiveness RCT aims to provide results that are more readily applicable to real-world treatment settings [206,207]. Their characteristics are listed in Table 1. Two types of effectiveness RCTs can be distinguished (Table 1): practical CTs and pragmatic CTs [208]. The Clinical Antipsychotic Trials of Intervention Effectiveness (CATIE) study is illustrative of the strengths and limitations of the practical CT [18]. CATIE's strengths were: recruitment of a very large sample (n=1493); its duration (18 months); blind assessment; and the use of a simple primary efficacy measure, all-cause treatment discontinuation. Its limitations included potential bias [209], failing to genuinely reflect real-world practice [210], a high dropout rate (the overall discontinuation rate was 74% within 18 months) and a high cost (in the order of AU$50m). Notably, CATIE found no effectiveness advantage for second-generation antipsychotics (SGAs) compared to first-generation antipsychotics (apart from clozapine), in contrast to meta-analysed efficacy RCT data from short-term studies showing a modest efficacy advantage for at least some SGAs [211,212]. Loss of efficacy in effectiveness trials is mainly accounted for by reduced patient adherence or clinician competence. In CATIE, patient adherence was not monitored stringently, and if the alarming rates of untreated physical comorbidity are any indication [213], there are questions to be raised about clinician competence as well. Thus, a single practical CT on the scale of CATIE does not provide unequivocal answers even about one element of treatment, such as antipsychotic drug treatment.
So far, the only practical CT involving early psychosis patients is the Comparison of Atypicals in First-Episode Psychosis (CAFÉ) study [214]. This double-blind multi-site RCT randomized 400 early psychosis patients to daily divided doses of olanzapine, quetiapine or risperidone and followed them for 12 months. Like CATIE, all-cause treatment discontinuation was the primary outcome measure. By 12 months, treatment discontinuation rates were approximately 70% and no drug differences were identified. That study highlights the higher rates of dissatisfaction with first-line antipsychotic drugs in early psychosis patients compared to CATIE, in which similar discontinuation rates took 18 months to occur. CAFÉ also confirmed appropriate maintenance dosing levels for early psychosis patients for whom modal doses were 10 mg olanzapine, 500 mg quetiapine, and 2 mg risperidone (only 11% of the patients on risperidone reaching the maximum allowable dose of 4 mg). Again, extremely high discontinuation rates raise the question of whether CAFÉ truly mirrored routine practice. Given the burdensome secondary effectiveness assessments (e.g. Positive and Negative Syndrome Scale (PANSS) and Quality of Life ratings) and twice-daily dosing, perhaps these procedures contributed to the high discontinuation rates and blunted the ability of the primary effectiveness measure to distinguish different treatments.
The alternative type of effectiveness RCT is called the pragmatic CT, designed to be extremely simple and flexible (e.g. use un-blinded assessment), so that a very large number of patients can be recruited quickly [215,216]. Their characteristics are described in Table 1. The first application of the pragmatic CT design to EPI, the European First-episode Schizophrenia Trial (EUFEST) [17], became highly controversial [217]. Fifty centres located in 13 European countries and Israel participated in EUFEST, which recruited 498 first-episode schizophrenia spectrum patients out of 1047 patients assessed for eligibility. Patients were randomly allocated to haloperidol (1–4 mg daily), amisul-pride (200–800 mg daily), olanzapine (5–20 mg daily) or quetiapine (200–750 mg daily). The primary effectiveness measure was all-cause discontinuation within the first 12 months of treatment. Because EUFEST was a pragmatic CT aiming to reflect routine clinical practice, treating clinicians were not blinded to treatment, and after training and calibration they carried out the effectiveness and safety assessments un-blinded, which included deciding whether and when the patient had discontinued treatment as well as rating the PANSS. These design features of the EUFEST pragmatic CT seemed unremarkable until the results of the study were known. At 12 months follow up EUFEST found that the percentage of patients who discontinued treatment for any cause was 40% for amisulpride, 33% for olanzapine, 53% for quetiapine, whereas for haloperidol it was a significantly higher 72% [17]. At 12 months follow up, the key secondary effectiveness measure (PANSS ratings) did not distinguish the different drug treatments.
Why did the EUFEST study demonstrate discontinuation rates for haloperidol comparable to CATIE yet find such low rates of discontinuation with SGAs? Some have attributed that study outcome to bias related to un-blinded assessments by clinicians who may have favoured use of SGAs in early psychosis patients [217], a criticism that has been defended by the EUFEST group (figure 4) [17]. Although this controversy illustrates an important limitation of the pragmatic CT design, perhaps the most remarkable finding from the EUFEST study of relevance to appraising the effectiveness of EPI was the extremely favourable symptom response to all first-line drug treatments: more than 60% reduction in total PANSS ratings at 12 months follow up.
Observational (non-randomized) study designs
The argument for shifting away from sole reliance on RCT designs in assessing the strength of evidence for intervention effectiveness assumes that non-experimental studies show equivalent scientific rigour. Therefore, it is essential to appreciate the limitations of the different types of observational designs, sources of bias and confounding, and the methods to minimize their effects. We therefore, in some detail, now describe these issues for the non-specialist.
When experimental studies are not ethical or not feasible, the effects of treatments are examined in observational studies in which the researcher does not control treatment conditions and simply observes (measures) outcomes and their associations with treatment exposure. Without randomization to ensure that comparable groups are contrasted under competing treatments, observational study designs that directly assess treatment effects are prone to selection bias (pre-intervention patient differences across groups that affect outcome). This bias results from doctor- or patient-determined treatment group allocation. Although attempts can be made to control for known (observed and measured) confounding variables that cause overt bias (e.g. by pairwise matching or subgroup stratification of comparison groups, or by statistical covariance adjustments of differences in baseline patient characteristics), these procedures do not control for unknown (not observed and not measured) baseline confounders that could result in hidden bias.
Among the many types of biases and confounding that can affect observational treatment evaluation studies, is the important concept of ‘confounding by indication’, bias associated with treatment allocation according to whether it is indicated. For example, the treatment of patients with chronic treatment resistance at baseline may include (have an indication for) intensive family intervention, while patients with a history of brief acute illness at baseline may not require intensive family intervention (because it is not indicated). If the two patient groups are compared, a spurious association between intensive family intervention and poorer patient outcome may be found at follow up. Subgroups of patients may be matched on a single obvious baseline confounder (called a covariate), such as pre-treatment disease chronicity. This approach becomes impractical when there are many covariates, as is the case with psychotic patient samples, when multivariate approaches to matching are required. An example of multivariate matching uses ‘propensity scores’. The propensity (to be selected into one treatment group or the other) score [218,219] is estimated using logistic regression to combine into a single score all measured covariates that predict the binary category, treatment versus control group, which is then applied to statistically match treatment groups (after the effectiveness of the score in matching groups is checked). Propensity scores do not control for unknown confounders and some argue that they add little to simpler multivariate approaches to matching [220,221], although they may be useful for stratifying a sample to explore dose-response relationships [222] or interactions between confounding covariates [221].
A more robust approach to addressing confounding by indication uses an ‘instrumental variable’, defined as a factor that strongly affects the likelihood of being exposed to treatment but does not affect the outcome of that treatment [223,224]. Randomization itself functions as a binary ‘instrumental variable’, determining who is and is not treated without influencing treatment outcome. For example, in an observational evaluation of specialist cardiology treatment, Stukel et al. created an instrumental variable by dividing the total patient sample equally into those living a long distance away from a specialist hospital, versus those living close to a specialist hospital [225]. The distance between a patient's residential address and a specialist hospital was shown to strongly affect the likelihood of receiving specialist treatment without directly affecting treatment outcome, thereby satisfying the definition for an instrumental variable [226]. This instrumental variable may also be applicable to mental health evaluation [227].
Because most observational studies of treatment over time measure outcome in response to exposure to treatment, three methodological problems are introduced. First, longitudinal interventional data are correlated (clustered) observations repeated through time on the same patient or in the same setting. Second, longitudinal data are often highly unbalanced in the sense that an equal number of measurements tend not to be available for all subjects and/or measurements are not taken at fixed time points; such data cannot be analysed using simple multivariate regression techniques [35]. Third, longitudinal data are typically very prone to incompleteness due to subject dropout, missing data, or changes in assessment procedures over time. All three challenges to interpretability can be addressed using mixed-effects [34,35,63] or multilevel [228,229] approaches to analysis. These analytic approaches include maximum likelihood estimation of missing data that is considered optimal [230], and flexible procedures for analysing variance-covariance structures in longitudinal data [35,63,175].
When concern remains that treatment and control groups are not comparable on unmeasured covariates prior to treatment in an observational study, sensitivity analysis can be used to estimate the magnitude of a hidden bias that would be needed to be present to explain the treatment effect observed. In sensitivity analysis, deviation of the odds of receiving treatment by chance is calculated for subjects matched on a confounding covariate. A study is considered highly insensitive to bias if only a very large deviation from chance allocation to treatment group could explain the observed association between treatment and outcome. As well as determining the likelihood that treatment effects could be accounted for by confounding by unmeasured covariates [63,231], ideally sensitivity analysis should be used to examine the robustness of the assumptions underlying the modelling approach to managing missing data [35,232,233].
Notwithstanding the merits of the aforementioned approaches to confounding in observational studies, it must be acknowledged that even the most carefully designed studies will have weaknesses, and conclusions should not be drawn from a single study. Replication is necessary. Despite these caveats, the results from observational studies have been largely consistent with the results of randomized trials [234,235]. We now review the four subfields of observational studies identified.
Clinical epidemiological designs
Clinical epidemiology applies the principles and methods of epidemiology to evaluate disease detection technology and clinical treatment in patients (Table 1). Non-interventional epidemiological studies of psychotic disorders are relevant to evaluating EPI. The World Health Organization Determinants of Outcome of Severe Mental Disorders (WHO DoSMeD) or Ten-Country Study was the first population-based study to provide reliable estimates for the 15–54 year age range of the incidence of narrowly defined schizophrenia (7–14 cases per 100 000 per annum) and broadly defined non-affective psychotic disorder (16–42 cases per 100 000 per annum) that could be used as benchmarks for evaluating case detection rates of EPI programmes [236]. Less reliable estimates of the incidence of affective psychosis (7.7–10.6 per 100 000 per annum [237]) and bipolar I disorder including ‘non-psychotic’ cases (10.8–20.8 per 100 000 per annum [238]) provide benchmarks for EPI programmes that recruit patients with affective psychosis. Recent prevalence studies highlight the large number of patients with diagnosable psychotic disorder who may never have received treatment. For example, when comprehensive case-finding procedures are used the reported prevalence rates for schizophrenia (10 per 1000) and non-affective psychosis (22.9 per 1000) [239] are more than double the rates generally reported in studies relying on case finding via health services contact only [240]. Taken together these epidemiological studies indicate that simply screening first service-presentations for non-affective and affective psychosis, without any additional case-finding procedures, should identify more than 30 incident cases per 100 000 population per annum [241]; comprehensive case-finding procedures could potentially double that figure.
Cohort studies of first-presentation psychosis have also provided designers of EPI programmes with valuable information, including (i) >75% of early psychosis patients will present in the 15–30 year age range [236,237]; (ii) diagnosis is unreliable at first presentation [73] and may only be accurate with longitudinal assessment and review of all available sources of data [239]; and (iii) acute-onset transient psychosis (DSM schizophreniform disorder or brief psychosis) is particularly diagnostically unstable, with male subjects, especially those with premorbid dysfunction, tending to be re-diagnosed with schizophrenia [242] and female subjects re-diagnosed with bipolar disorder [243,244]. The British AESOP (Aetiology and Ethnicity of Schizophrenia and Other Psychoses) first-onset psychosis study provided information about the sort of diagnostic mix of patients that a well-established EPI programme is likely to recruit [237] and how to design ED strategies for ethnic minorities [245,246]. Other cohort studies such as the German ABC (Age, Beginning, Course) Schizophrenia Study demonstrated in schizophrenia that there is on average a 4 year history of functional and symptomatic decline prior to onset of first psychotic symptoms, while psychotic symptoms are apparent only for the 12 months prior to hospital admission and diagnosis [247,248].
Clinical epidemiological principles are especially relevant to ED strategies. ED of psychotic disorder, especially prior to onset of currently diagnosable illness, should be informed by the general medical literature concerning possibilities of spurious associations between earlier diagnosis and better outcomes [62,249], and the poor feasibility of accurate screening tests for low-prevalence disorders [250,251]. Even if improvements in ED strategies do not achieve reductions in DUP, there is evidence that they will lead to a greater proportion of patients being treated (disease coverage) because these strategies identify more untreated prevalent cases [252], an improvement in effectiveness in its own right. Nonetheless, the results of cohort studies describing the early course of schizophrenia [247,248] are consistent with evidence that much of the trajectory of morbidity is established prior to the onset of psychotic symptoms [253–256], and that to prevent this morbidity we must develop improved clinical services and assessment technology [257,258].
Service evaluation studies
Health services research can be defined in a number of ways [259,260], but narrowly it is ‘the field of enquiry that examines the impact of the organization, financing, and management of health care services on the delivery, quality, cost, access to, and outcomes of health services’ (cited in [261]). In mental health, it had its roots in concerns about the failure to develop effective community-based systems of care [183,184]. The scope of service evaluation is broad, spanning the environment, structure and performance of health-care organizations through to the effectiveness of programmatic interventions [262]. Although some view all types of intervention evaluation as a form of service research, we have restricted this area to a subfield in which the evaluation focus is primarily on systems of health care, with health services or component programme models as the smallest unit of analysis. Further details about the three study types (health services research, programme evaluation, and economic evaluation) are contained in the Table 1.
A major difficulty with service evaluation is defining elements of the therapeutic programme and maintaining its fidelity. In mental health service evaluation, programme fidelity first emerged as a major issue in relation to assessing the effectiveness of assertive community treatment (ACT) [263–266]. Problems with programme fidelity were suggested when advantage for ACT over standard case management was not consistently found across studies with adequate power. Whether the form of case management evaluated by the UK 700 Group was genuinely a form of ACT was questioned [267]; the findings of the PRiSM (Psychiatric Research in Service Measurement) Psychosis Study were criticized because no attempt was made to ensure fidelity of the service model [268]. This literature highlighted the need for programme fidelity rating scales [48,117,120,269,270] that (i) assess the extent to which the ‘active ingredients’ of a service model or intervention are implemented; (ii) differentiate interventions with flawed logic and processes from evaluations with defective methodology [49]; and (iii) comprehensively capture programme components at both service, practice, and patient level [119,124]. This research also encouraged the development of service model fidelity scales based on programme template tools for evaluating programme content [271], for assessing treatment strength [272,273]. and structured approaches to assessing treatment content [84]. To date, fidelity scales have been developed for measuring service model adherence to ACT [274–276]; psychiatric rehabilitation [277]; integrated services for mental health and substance use comorbidity [274,278–280]; family psychoeducation [281]; and motivational interviewing [282]. Other examples of service model fidelity scales rate child and family programmes for adherence to specific treatment principles, either by staff interviews [283], or programme activity observation [284], or by assessment of observed clinical practice [285].
By contrast, it is notable that programme evaluation approaches have been little used in the EPI field. This is particularly the case in relation to the development of service model fidelity scales. We could not find a single published service model evaluation scale in the EPI literature, although the essential elements of EPI have been agreed to by consensus [286]. In their Cochrane Review cited previously, Marshall and Rathbone drew attention to the urgent need for identifying the treatment components that distinguish EPI from standard service delivery, and developing methodology to measure the implementation of these programme components in the future evaluation of EPI programmes irrespective of study design [25].
Interest in economic evaluation of EPI [287] was stimulated by evidence that patients receiving EPI required fewer inpatient days compared to TAU [204,205,288,289]. The first cost-effectiveness study, conducted at the Early Psychosis Prevention and Intervention Centre (EPPIC) in Melbourne, reported that the average cost per unit of symptomatic improvement for patients receiving EPI (EPPIC patients) was $AU16 964, while for patients receiving TAU (before EPPIC) it was $AU24 074 [290]. The Spanish PSICOST (a mental health service research group) study found that direct health-care costs for first-episode schizophrenia during the first 3 years of treatment were significantly reduced for patients receiving intensive community care versus those who did not, despite there being no differences in health outcomes [291]. Other cost-effectiveness studies in England and Sweden also found that total costs for patients receiving EPI were lower compared to alternative service models, mainly due to lower inpatient costs [292,293]. Medium-term cost savings were associated with EPI at 3 year follow up in a Canadian service [294]. A recent long-term follow up (approx. 8 years after first treatment) of patients who received EPI at EPPIC for the first 2 years of their treatment, found that the EPI patients were symptomatically better at follow up compared to historic control patients who did not receive EPI, and that their treatment costs per year were lower ($AU3445 vs AU$9503) [295]. Although all of these economic evaluations of EPI were based on quasi-experimental designs, the large size of the cost differences, irrespective of the type of comparison group used or study country of origin, strongly suggests that EPI may cost less than TAU [296].
Quality assurance studies
Health-care quality, ‘the degree to which health services for individuals and populations increase the likelihood of desired health outcomes and are consistent with current professional knowledge’, frequently falls short of evidence-based standards [297] despite effective ways to improve it [298,299]. These concerns particularly apply to mental health care [104,300,301]. QA is the ‘system of procedures, checks, audits, and corrective actions to ensure that all research, testing, monitoring, sampling, analysis, and other technical and reporting activities are of the highest achievable quality’ [29]. Quality improvement (QI) involves continual performance measurement, incrementally improving service delivery, and revising (upwards) standards in the pursuit of best practice [302]. Details concerning QA/QI, including its two main methodological forms, quality audit and practice audit, are outlined in the Table 1.
Despite the formidable practical and methodological challenges associated with QA studies, they uniquely offer the opportunity to evaluate health care as it is routinely practised on real-world patient populations without the constraints imposed by individual patient and/or staff informed consent [303,304]. Multi-centre QA studies also permit examination of the influence of service context on treatment effectiveness, of particular relevance to EPI, which relies on the optimal combination of service function and clinical practice. Multilevel modelling approaches (mixed models, multilevel or growth curve modelling) can analyse data collected on ‘units’ hierarchically nested within other units: mental health services grouped (nested or clustered) within jurisdictions; clinicians grouped within services (or programmes); patients grouped under individual clinicians; and repeated observations measured through time clustered for each patient. Multilevel modelling efficiently corrects standard errors (underestimated by ordinary linear regression) related to data correlations due to clustering, and partitions variance independently at each level of measurement [31,228,229,305,306].
Quality audit tends to use indicators broadly applicable across health-care fields (e.g. medication error rates). Quality (performance) indicators are ‘operationally-defined indirect measures of selected aspects of a system which gives some indication of how far it conforms to its intended purpose’ [307]. Practice audit tends to use indicators of evidence-based practice in relation to a specified health-care professional group for a defined patient population eligible to receive that practice (e.g. dose of antipsychotic medication in schizophrenia). Whether evaluation of these clinical processes is from the perspective of service performance and manager concerns (top-down) or from the perspective of evidence-based practice and practitioner concerns (bottom-up), the evaluation focus is on what happens to individual patients, the microsystem of clinician-patient interactions. Although patient-level data may be used and the individual patient is the unit of analysis, the primary focus of QA studies is on health-care or treatment processes and is mostly published in the quality rather than clinical literature.
There are few examples of the application of quality audit-based design in mental healthcare effectiveness evaluation [308,309–312]. Benchmarking is an emerging mental health service activity in Australasia [125]. The balanced scorecard has been applied to generic services [313], and was used to show that introduction of a specialist EPI increased the rate of treated psychosis [314,315] and reduced treatment default [315]. Consensus performance indicators for EPI programmes have been published [316]. Performance measurement has added to the evidence that EPI programmes increase service recruitment of first-episode psychosis [309,317,318], improve pathways to care [319,320], increase retention [318], and are associated with better health outcomes [321]. This literature indicates that simply instituting a service-wide flagging system for early psychosis patients increases recognition and intervention effectiveness [322]. Despite the potential for quality audit to evaluate programmes, however, to date its application to EPI has been limited.
Practice audit studies determine the conformity of clinical practice with clinical practice guidelines (CPG) (Table 1). CPG are defined as ‘systematically developed statements to assist practitioner and patient decisions about appropriate health care for specific circumstances’ [323]. They represent statements of best practice based on systematic review of published research studies on the outcomes of treatment and other health-care procedures. The first evaluation of the feasibility of measuring CPG adherence in psychotic disorders was the US Schizophrenia Patient Outcomes Research Team (PORT) studies, which found that 29% of patients received CPG-adherent doses of antipsychotic, 22% of eligible patients received vocational rehabilitation, and only 10% of available family members received psychoeducation [324]. A recent review of CPG adherence in mental health found that (i) a mere 1% of the total published literature on CPG pertained to mental health; (ii) in non-experimental studies the overall rate of adequate CPG adherence was low (27%); and (iii) in the rare study that actually examined whether the level of adherence predicted patient outcome, it did [325].
In 1998 the National Early Psychosis Project produced the Australian Clinical Practice Guidelines for Early Psychosis (EPPIC State-wide Services, Melbourne). This created an opportunity to evaluate CPG for early psychosis by operationalizing the Guidelines in terms of process indicators of CPG-adherence or clinical pathways. To date only one study in Australia using CPG-adherence indicators has been published [326]. There are two pilot studies on clinical pathways for early psychosis [327,328]. These reports show that designing CPG-adherence indicators for file audit is feasible, while successful application of clinical pathways for early psychosis is highly dependent upon implementation strategy and local service factors. A number of EPI clinical practice improvement networks have been established to monitor CPG-adherence and patient outcomes in England (National EDEN Project, a component of the NIHR Mental Health Research Network) [329]; in London (London Early Intervention Research Network: LEIRN) [330]; and in Australia (NHMRC Clinical Practice Network for Early Psychosis: CPIN-EP) [331]; but evaluation findings are yet to be published. Taken together, the literature indicates very limited application of QA study designs to the evaluation of the effectiveness of EPI and suggests considerable opportunity for further development in this field.
Quasi-experimental comparisons of treatment groups
Quasi-experimental design (Table 1) is one in which the researcher lacks control over the allocation or timing of intervention but nonetheless conducts the study as if it were an experiment, allocating subjects to groups [29]. As with CTs, the primary focus of quasi-experimental studies is whether patients have changed as a result of receiving treatment. That is, the individual patient is the unit of analysis. In this paper the quasi-experimental design is considered as a separate evaluative subfield because numerically it formed the bulk of the published literature in the mental health field and did not fit well into other sub-fields reviewed in the previous sections.
The strengths and weaknesses of the quasi-experimental design is well illustrated by the Schizophrenia Outpatient Health Outcomes (SOHO) study [332]. That 3 year prospective study of antipsychotic medication included >10 000 outpatients recruited in 10 European countries. SOHO provided useful post-marketing surveillance safety data in a large representative sample of patients with schizophrenia, SOHO highlighted the high prevalence (approx. 80%) of depressive symptoms [332], amenorrhoea (approx. 30% of female subjects), and impotence and loss of libido (40–50% of male subjects) [333]. The Sotto Study, in which clinicians could flexibly determine drug type and dose, found surprisingly high rates of treatment continuation (88.5% after 6 months), most patients staying on the same medication they started at baseline [332]. A total of 70% of first-episode patients with schizophrenia in SOHO responded very well to medication by 6 months and, notably, all patients with substance abuse at baseline ceased abuse after 6 months of drug treatment without specific psychosocial intervention [334], a replicated result [194]. SOHO, however, failed to achieve its primary aim of assessing relative effectiveness of different antipsychotic drugs because of its non-randomized design despite extensive use of sensitivity analysis [332].
There are hundreds of examples of the use of a non-equivalent comparison group in evaluation studies of EPI (including all of the economic evaluations and QA studies reviewed here). Therefore, only a few illustrative studies will be considered here. In the first, McGorry et al. reported more rapid reduction of negative symptoms in patients attending the EPPIC programme, compared with a carefully matched group of patients who attended the same service prior to commencement of EPPIC [335]. The apparent advantage for EPI, however, could not be related to earlier intervention because DUP was similar in both the pre-EPPIC and EPPIC patient groups [335]. There are many subsequent evaluations using historical comparison groups in the EPI literature [336–338]. There are also numerous studies using parallel contemporaneous programmes in non-equivalent group comparisons [318,339,340]. In addition, general population [341] or representative age-matched clinical samples [318] have been used as non-equivalent reference groups. Bias and confounding cannot be excluded from these designs, as is the case with the multitude of simple pretest-post-test evaluations of EPI [194]. It is rarely feasible to apply the more sophisticated cross-over and time series study designs to evaluating EPI. Non-experimental case studies and qualitative research are also sparingly published in the EPI field, despite their potential for methodological rigor [342] and for identifying treatment processes [343].
The Early Treatment and Identification of Psychosis Study (TIPS) is the most outstanding example of quasi-experimental evaluation of EPI [337,338,344–348]. TIPS set out to determine whether ED can shorten DUP and, in turn, improve patient outcomes. The ED strategy had two major components: (i) an intensive public information and awareness campaign targeting communities, schools, families, and GPs; and (ii) a network of detection teams with low threshold for referral and easy access [344,346]. The ED strategy was introduced into Rogaland County, Norway (population 370 000). Detection rates, DUP, and health outcomes in that sector were compared to those in two ‘parallel in time’ health sectors without ED strategies, one in Ulleval, Norway (population 190 000) and another in Roskilde County, Denmark (population, 100 000). The populations in the three sectors had remarkably similar demographics and public health-care systems typical of Scandinavia generally. In addition to this parallel non-equivalent group comparison, a historical control group design was used in a subsector of Rogaland County in which a cohort of patients presenting in 1993–1994 (before ED) was compared with a cohort presenting in 1997–1998 when ED strategies were being used.
Compared to the pre-ED cohort, the ED cohort (i) was larger in number; (ii) had shorter median DUP (4.5 weeks vs 20 weeks); (iii) was younger; (iv) had lower baseline symptom ratings; and (v) less often required inpatient admission [338,344]. Also, the cohort with ED and standard treatment had fewer negative symptoms and better peer networks at 12 month follow up, even within the schizophrenia group [348]. Patient characteristics reverted towards those of the pre-ED cohort after ED strategies in Rogaland were withdrawn [337], providing additional evidence for a specific relationship between ED strategies and shorter DUP, and more favourable baseline patient characteristics [337]. The methodologically more sound TIPS parallel group comparison produced comparable results to the historical group comparison. Compared to the non-ED sectors (Ulleval and Roskilde), the early psychosis cohort in the ED sector (Rogaland) had significantly shorter median DUP (5 weeks vs 16 weeks), and better functioning and lower symptom levels at baseline [345]. At 12 month and 2 year follow up, patients with ED had better outcomes with significantly reduced levels of negative symptoms, despite treatment after detection being essentially the same (i.e. specialist EPI was not offered in any of the four health sectors) [340,347].
In summary, the TIPS study is exemplary in the EPI field of combining an epidemiological approach and quasi-experimental design, and using multiple control groups and appropriate statistical analysis to examine whether confounding could have accounted for the ED treatment effects. TIPS also described the feasibility and costs of ED, and the clinical assessment load generated by it [344,346]. Significantly, TIPS showed that a critical component of EPI, ED, can be very effective.
Discussion
Using a scoping strategy to map the evaluation field, we were able to describe a conceptual framework for the intervention effectiveness field (Figure 1). Without this framework, it may not have been possible to sort and organize the hundreds of papers, books, and reports identified by database and Internet searches that were scattered across the management, social sciences, and clinical literature. Some caution is warranted regarding conceptual frameworks of the kind we have proposed herein because they may be prone to distortion by the theoretical perspective of their designers or overlook certain subfields [30]. To minimize these concerns we used a saturation searching strategy (searching until no new papers were found) and sorted until all representative papers could be grouped into at least one subfield. We found that ordering evaluation on an efficacy-effectiveness spectrum was useful because it tended inter alia to rank study designs according to their susceptibility to bias [349]. Typical of global evidence mapping, this review generated a voluminous bibliography, which we have included in full for use as an educational tool. Our global mapping exercise identified the broad range of evaluation designs beyond efficacy RCTs. The map made identifying gaps in evidence apparent more readily when appraising the EPI literature.
Scoping the effectiveness evaluation literature on EPI indicated a striking paucity of RCT data. Although the first Cochrane Review found CT evidence for EPI to be inconclusive [25], we detected promising trends. First, there were several well-constructed RCTs of specialist EPI showing short-term advantage [203,204,350], even when ED strategies were not used. It was also clear that without continuity of optimal care, the effects of short-term specialist EPI fade and disappear [205]. CTs of antipsychotic drugs consistently showed that positive symptoms were highly responsive in first-episode patients, although negative symptoms and cognitive deficits seemed little improved. Moreover, SGA do not seem to prevent transition to psychosis in prodromal patients, suggesting that alternative neuroprotective treatments, such as omega 3 fatty acids, N-acetyl cysteine, and metabotropic glutamate receptor agonists should be more intensively studied in prodromal patients [351–353].
Our appraisal found that the bulk of the EPI literature supporting EPI effectiveness consisted of observational studies. Although there are many sound non-interventional epidemiological studies to guide implementation of EPI programmes, most interventional studies of EPI rely on poorly controlled quasi-experimental design. We conclude that the conduct of high-quality health services research and QA studies has been grossly underresourced in spite of the critical importance of mental health systems in improving service delivery [93]. This conclusion implies that large-scale multi-service studies are required, necessitating at a national level investment in mental health services research capacity in Australia. We consider that the first urgent task in strengthening these fields of EPI evaluation would be the design of psycho-metrically sound programme fidelity rating scales.
Nevertheless, there are some important findings in the observational literature that are consistently reported irrespective of study design. First, specialist EPI service functions increase the recognition rates of first-presentation psychotic disorders [318], although these may not shorten DUP. It is likely that EPI functions draw into treatment more of the currently untreated cases of chronic psychosis in the community, estimated to be in excess of 50% of all cases [354,355]. Because disease coverage is an aspect of effectiveness, these data represent support for the effectiveness of specialist EPI. Second, irrespective of comparison group or country of origin, cost analysis studies of specialist EPI consistently showed that EPI costs less than TAU, mainly because of reduced hospital read-mission rates. Consistent with this, a recently published meta-analysis demonstrated that specialist EPI programs prevent relapse more effectively than FAU [356].
Although most quasi-experimental evaluations of EPI tend to use poorly controlled designs, our evidence-mapping procedures found examples of design features and statistical procedures that can be applied to this type of observational study to minimize the likelihood of biased results. These approaches were incorporated into the TIPS project, which produced compelling evidence that a combination of public awareness campaigns and provision of ED teams effectively engages psychotic patients at an earlier stage of their disease. In combination with state-of-the-art programme development and evaluation methodologies for implementing youth mental health campaigns [357], this area of evaluation could rapidly provide policy makers with the evidence base for funding preventative strategies [257,258,358]. Also, increased emphasis on audit and feedback in the evaluation of CPG implementation and EPI programme fidelity was identified in our review as an effective strategy for improving health-care quality and guiding service reform to support better clinical practice [109,301].
In this review and our proposed evaluation framework, we have assumed that existing evidence hierarchies [143,144] place undue weight on efficacy RCT designs relative to the diverse range of non-experimental methods. As others have proposed [359], we are concerned that existing evidence hierarchies may slow advances in fields such as mental health and public health nutrition, in which RCTs are uncommon due to feasibility or ethical constraints. We suggest that the mental health field not limit itself to an exclusive focus on RCT data, which tend to be ‘inconclusive and will augment the grey zones of practice’ [360], when there are reliable observational data supporting intervention effectiveness. Lowering the threshold for level of evidence needed before implementation of a service innovation, however, demands investment in routine quality evaluation after implementation [361], carried out by clinicians who are well-versed in research methodology [362].
In summary, our field mapping exercise was able to logically categorize the full range of evaluation approaches, and specify relevant methodological strengths and weaknesses. Using this conceptual framework we were able to comprehensively appraise the EPI effectiveness evaluation literature, collate findings that are consistent across study design types, and identify gaps in the literature that need urgent additional investment. We also concluded that the degree of complexity of the intervention evaluation literature suggests that greater focus on research methodology in the training of Australasian psychiatrists is urgently needed.
Footnotes
Acknowledgements
