Abstract
Treatment guidelines should inform and assist the reader about how specific conditions might best be managed. Ideally, such recommendations should be underpinned by a meld of supportive evidence and clinical wisdom. In recent years, guideline development has commonly weighted the former component and prioritized ‘Level 1 evidence’. Such evidence is defined in the Australian and New Zealand College of Psychiatry's introductory paper to its clinical practice guidelines as data derived from ‘systematic review of all relevant randomized controlled trials’ or randomised controlled trials (RCTs) [1].
While supportive of the general principle of prioritizing RCTs, I have also argued [2] that – in application – current RCT procedures for assessing antidepressant strategies are providing non-specific (and meaningless) information. This is a consequence of three intermeshing factors. First, because treatments are generally tested in RCTs as if they had universal application (i.e. a nonspecific model). Second, because the most common RCT diagnostic groups (e.g. ‘major’ depression, dysthymia, ‘severe’ depression, ‘moderate’ depression) effectively homogenize any intrinsic depressive disorders along a severity dimension. This reduces the capacity for differential disorder-treatment responses to be identified. Third, formal and informal recruitment criteria favour subjects who have a propensity for rapid (or spontaneous) remission and are weighted against the inclusion of depressive conditions observed in clinical practice.
In that paper [2] I considered meta-analyses and other reviews to draw two principal conclusions. One, that such large RCT databases effectively allowed a nonspecific (‘equipotency’) interpretation for contrasting antidepressant treatments. In essence, the aggregated RCTs identify all antidepressant drugs, all tested psychotherapies and a number of other less orthodox treatments (St John's wort) as having efficacy levels of 50–55%. This promotes an ‘all roads lead to Rome’ conclusion which is unsatisfactory, strains credulity and does not accord with clinical observation. Two, and a matter of commonsense, that aggregated RCT analyses suggesting that the antidepressant drugs were only marginally superior to placebos are also at marked variance with clinical observation.
While RCTs in other disciplines may have stronger foundations, any guidelines that prioritize such nonspecific RCT data for managing depression risk building their house on sand. This is not unique to our region, and more reflects limitations of the DSM and ICD dimensional models for depression. Their non-specific diagnostic groupings favour non-specific findings and generate non-specific treatment data. Such an approach regrettably underpins many educational programs. For instance, the current Concise guide to mood disorders published by the American Psychiatric Association [3] noted that recent DSM systems have defined ‘more homogeneous populations’ (along a severity dimension). The book goes on to observe that, while a test like the Dexamethasone Suppression Test has moderate diagnostic specificity in distinguishing melancholic and nonmelancholic depression, this is ‘not of great practical importance’ as ‘both types of depression are treated similarly’ (p. 92). Further, that mild and moderate expressions of major depression can be treated with ‘antidepressants or psychotherapy’ (p. 219) and that ‘all antidepressants currently available are equally effective’ (p. 220). Such conclusions neither inspire nor satisfy.
The above identifies the logical fallacy to developing guidelines and one that can either be ignored or addressed. If the evaluative studies largely fail to differentiate one treatment from any other, how can treatment guidelines use such data without generating non-specific recommendations? The aphorism ‘what's the use of running if you're on the wrong road?’ captures the dilemma. In a previous publication in this journal I questioned [2] how the CPG Team had addressed such problems in synthesizing and evaluating the literature, having – at that time – read only the ‘summary’ of the College guidelines for depression [4] published in March 2003. At face value, many of the recommendations were perplexing – and it was unclear whether the authors had weighted RCTs, expert opinion or their own views. Now that the definitive guidelines have been published in June 2004 [5], theoretical limitations have become apparent. Additionally, there are problematic process and content components to the final guidelines.
There is much at stake in producing such guidelines. They have the imprimatur of the College. Second, to the extent that invalid recommendations influence treatment choices (by patients and consumers), consequences for patients may not be trivial. If there are ‘truths’ that are more likely to be defined or approached by debate, this is a preferable process to tepid and passive resignation. Thus, this critique will identify a number of illustrative issues.
The blueprint
Definition of terms is a useful first base for inquiries, so, how was ‘depression’ modelled? Initially, we are informed [5] that the guidelines are for ‘moderate’ and ‘severe’ depression. Shortly the review moves to define ‘clinical depression’, then prevalence data are provided for ‘depressive disorder’ and ‘depression’ in association with medical illnesses, while several depressive subtypes (e.g. melancholia, psychotic depression, atypical depression) are also considered. While I applaud the Team for considering depressive subtypes, the ‘condition’ under review is such a moving target that coherence is compromised. The writers fail to integrate such disparate expressions of ‘depression’ in any model, so that the definitional variegation propagates further downstream confusion. Reductive statements appear naive. For example, guideline users are encouraged (p. 391) to use the Hamilton or the Centre for Epidemiological Studies–Depression Scale (CES-D) to formally assess severity so as to ‘allow selection of evidence-based treatments’.
Selecting the building blocks
Next of concern is how the Team actually derived the evidence to generate the recommendations. We are informed that RCTs were included but, where ‘knowledge is sparse, lower orders of evidence have been used’ (p. 389). While evidence levels are detailed in the figures, assertions often appear in the text without any clarification and rarely with conditional clauses, giving the statements an authority that does not always stand up when source documents are examined against the inclusion criterion (p. 391) for the meta-analyses or Table 1 definitions.
The literature-based evidence is presented and interpreted with a ‘pseudo-science’ flavour. In terms of quantitative evaluation, the Team compares the efficacy of a number of antidepressant strategies to placebo and to other active treatments in Tables 2 and 3, respectively. Such comparative analyses are only appropriate if there is a sufficiently large bank of studies – and in recent years we can observe an increasing rigour in including unpublished studies in such analyses to overcome the problems emerging from ‘file drawer’ negative studies. The Team appears sanguine about including single studies (e.g. reboxetine vs. placebo; tricyclic antidepressant (TCA) versus moclobemide; nefazadone vs. selective serotonin re-uptake inhibitors (SSRIs), while 80% of the comparisons involve five studies or less. By publishing some statistics (the NNT or Number Needed to Treat, and ARR or Absolute Risk Reduction), quantification and a level of scientific rigour are suggested. However, the relatively few tabled studies makes estimates of treatment effects unstable and simple conclusions problematic. The injunction by Cook et al. [6], in considering the relation between systematic reviews and practice guidelines (i.e. to undertake a critical appraisal of the original studies), does not appear to have been taken up.
In such quantitative analyses, only one significant difference is observed, with the Table 3 asterisk indicating that venlafaxine was superior to TCAs in four studies. The legend, however, indicates that the asterisk should actually have been attached to the venlafaxine versus SSRI comparator studies (an example of quality control issues, also evident in other parts of the guidelines). The more important point, however, is that reliance on small banks of studies is unlikely to deliver differentiation. As noted in my earlier paper [2], an analysis of 150 RCTs for major depression found a 54% improvement in those receiving an ‘old’ antidepressant and a 54% improvement in those receiving a new antidepressant, while an analysis of 102 RCTs found no difference between TCAs and SSRIs. The risk to the Team's analyses is to conclude – as they do (p. 393) – that almost all antidepressants are equally effective, rather than concede the strong possibility of methodological and analytic limitations to the database. Thus, to even apply a statistical test is to suggest the capacity to deliver an interpretable result – when the literature reviewed earlier suggests that would be an unwarranted expectation.
(Over) estimating the materials
Turning now to the Team's qualitative evaluation, interpretations and recommendations are more incisive than the literature base would allow – and again risk error. For example, we are informed that Table 3 data (described as ‘more relevant and robust’) ‘show the effectiveness of cognitive behavior therapy (CBT) and interpersonal therapy (IPT)’ (p. 394), while the text states that there is no evidence ‘that dynamic psychotherapy is effective’ (p. 394). The risk here is that the guidelines will be interpreted as providing evidence that two manualized psychotherapies (e.g. CBT, IPT) are superior to other psychotherapies, but the Team neither provides such evidence nor appropriately reviews the substantive database. When considering CBT, they analyse only four placebo-controlled studies and five studies comparing CBT to antidepressants. In our review of CBT [7], which included one meta-analysis of 48 RCTs, we suggested that the question as to whether CBT had superior efficacy compared to other psychotherapies returned the Scottish verdict of ‘not proven’. Our interpretation is not unique. The Team's review of the literature appears to have failed to find many similar interpretations in the last few years, including an Australian review and critique [8]. In a methodologically superior meta-analysis (in terms of addressing comparative psychotherapies), Wampold et al. [9] concluded that, when CBT was compared with other bona fide brief therapies (principally psychodynamic ones), results failed ‘to support the superiority of CT [cognitive therapy] for depression’ (p. 159). As Scott and Watkins [10] observed, it would appear ‘that the most important factors in day to day practice are not the specific model of brief therapy, but the competent delivery of the treatment’ (p. 6). Yet CBT and IPT are advocated as superior, following selective or limited abstraction of ‘the evidence’. The risk is of opinion dressed up as fact.
(Under) estimating the materials
In addition to advocacy for a treatment straying beyond the evidential base, there are examples of the opposing phenomenon – for psychotherapy (as noted) but even more evident in relation to St John's wort. The guidelines consider St John's wort under the section of ‘Interventions of unproven efficacy’, stating that while there were some early supportive trials, ‘more recent studies of better design’ found it not to be superior to placebo. The guidelines ignore any reference to an informative meta-analysis by Williams et al. [11] which compared St John's wort versus placebo (eight studies) or versus a TCA (six studies) with the 14 trials comprising 1417 adults. The responder rates (i.e. 50% improvement or more) were 62% for the TCA, 61% for St John's wort and 38% for the placebo. That interpretive review noted that as a consequence of such indicative but incomplete data, the National Institute of Mental Health was sponsoring a trial. That large-scale study is now published [12] and shows a virtually identical 8-week reduction in Hamilton scores for the SSRI, St John's wort and placebo groups. While the Team does examine (Table 6) a reasonably large number of studies, their quantification is difficult to interpret. Their calculated NNT for St John's wort versus placebo, SSRIs and TCAs are comparable to their calculated NNTs for other antidepressant treatments (Table 2). In the text, however, they state that when a Hamilton score of 20 or more (unreferenced) is imposed, ‘St John's wort is not superior to placebo’. That conclusion risks being disingenuous if it is based on studies such as the Hypericum Depression Trial Study Group (p. 398) analysis [12] where the active antidepressant also failed to differentiate from placebo. The Guidelines Team and I may have a similar view about St John's wort (that it is not a particularly useful antidepressant for ‘clinical depression’) but it is not clear to me that this can be substantiated from reference to the large number of RCTs and meta-analyses (Level I evidence) to arrive at that conclusion.
Adjusting for variations
Guidelines should assist decision-making and have a contemporary relevance – even if they need to be updated at frequent intervals as the evidence base emerges or changes. These guidelines provide six firstline monotherapies for ‘severe uncomplicated’ depression: a TCA, venlafaxine, nefazadone, an SSRI, CBT or IPT. Six? All bases appear to be loaded and equipotency rules. But discussion with observant clinicians might well have narrowed the options. Nefazadone was clinically judged, in comparative terms, as a less effective antidepressant by many regional psychiatrists. Such a ‘signal’ appears to have been reflected in clinical prescribing in Australia, with the marketing company consequently dissolving its sales force in 2003 and removing the drug from the Australian market in April 2004. So, why include nefazadone? We are informed (Figure 2) that nefazadone met the Level III criterion (i.e. a non-RCT study), but was this enough for the Team to recommend it joining the Group of Six? Equally importantly, why recommend a drug in guidelines published in June 2004 when the drug had been withdrawn months before – and as foreshadowed by the company in the previous year?
Again, on the basis of the evidence, how did CBT and IPT achieve first line status for ‘severe uncomplicated’ depression? The Team list the Elkin NIMH study [13] as one of their five references for CBT and as one of their two references for IPT (Table 3). However, in that pivotal study, CBT, IPT and the TCA imipramine did not differ from ‘placebo plus clinical management’. When, however, analyses were restricted to those with more severe depression, Imipramine was distinctly superior to CBT and IPT. While a single study, it was a milestone in many ways, and is (as far as I can ascertain) the only study that has directly examined the issue in hand – the impact of ‘depression severity’ on the comparable efficacy of a TCA with CBT and IPT. If the Guidelines Team had respected the published evidential base, then CBT and IPT would not have joined the TCAs as recommended treatments for ‘severe uncomplicated depression’. As noted earlier, advocacy for CBT and IPT appears to have proceeded beyond the published evidence.
Auxillary materials
Next, issues of clinical relevance and balance. The Team's consideration of augmentation strategies is problematic. There is currently very wide use of the atypical antipsychotics as augmenters of antidepressant drugs. Though an ‘off licence’ use and while supportive trials are few [e.g. 14], this strategy is neither considered nor mentioned in the guidelines. By comparison, we are informed that ‘Pindolol… hastens response to treatment’ (p. 395). This encapsulates several problems, including balance, relevance and topicality. Balance? Table 4 (‘Augmentation strategies for treating depression’) is weighted to pindolol (13 references as against three for lithium and five for T3 augmentation). Relevance? The domain is antidepressant drug augmentation (i.e. the use of a second agent to enhance the antidepressant action of the primary agent), while pindolol attracted research attention for its possible capacity to advance the speed of action of antidepressants. Further, and as noted by the Team, it is not licenced as an antidepressant in Australia. Topicality? While it excited interest at international meetings for a year or two, it is now hardly ever considered. Thus, its inclusion is dated, further evidenced by the lack of post-2000 publications in the guidelines. A local publication [15] is unreferenced. That 10-year review echoed the more general interpretation of most recent reviews in reporting that ‘large, placebo-controlled, double-blind studies have produced conflicting results’ and that, in the light of concerning drug–drug interactions, pindolol should be ‘considered an experimental approach’.
Conversely, transcranial magnetic stimulation (TMS) is overviewed in a 12-word sentence, where it is judged as having ‘scant evidence of benefit’ (p. 398). There is no consideration of the converse problem confounding RCTs for drug treatments (that subjects referred to TMS studies tend to be treatment-resistant rather than having a high likelihood of spontaneous remission), so that the literature has to be considered in some detail before concluding that this treatment is without benefit. And there is a supportive literature out there. A review by Gershon [16] examined 14 studies and concluded that ‘TMS shows promise as a novel antidepressant treatment’; Padberg and Moller [17] noted that the ‘majority of the (controlled) trials demonstrated significant antidepressant effects’; while Burt et al. [18] undertook a meta-analysis to conclude that TMS emerged as having antidepressant properties (‘this effect is fairly robust from a statistical viewpoint’) for major depression (p. 398). Similarly, in relation to omega-3 fatty acids (O3FAs), the Team states there ‘is no evidence that they improve depression’ (p. 398). No evidence? Were two positive studies published in 2002 in the Archives of General Psychiatry [19] and American Journal of Psychiatry [20] overlooked or dismissed?
I am not arguing that a few studies in a domain (e.g. O3FA), even be they only ‘proof of concept’ ones, or that multiple studies in some domains (e.g. TMS), necessarily allow definitive conclusions for guidelines. For each of these specific domains it would have been more appropriate, however, to refer to the literature and judge that there is some supportive evidence but that the treatments remain experimental or require further refined studies, rather than dismissing such treatments on the respective bases of ‘scant evidence’ (TMS) and ‘no evidence’ (O3FA) – which is wrong on both counts.
Conclusions
Guidelines are difficult to prepare and easy to criticise – philosophically, parochially and in terms of their potential regulatory and medico-legal applications. These guidelines, however, seem overly ambitious in their breadth, which may have compromised the quality control. Greenhaigh [21] proposed a number of questions that can be asked to evaluate the quality of a systematic review, including: (i) Does the review address an important clinical question?; (ii) Was a thorough search done of the appropriate databases and were other potentially important sources explored?; (iii) Was the methodological quality assessed and the trials weighted accordingly?; and (iv) Have the numerical results been interpreted with common sense and due regard to the broader aspects of the problem? I suggest that the answer to the last three questions is ‘no’. Missing at points is logic, commonsense and what Goodwin [22] has described as the ‘modern unifying approach to psychiatry – a synthesis of reliable knowledge and clinical judgement’. I have sought to argue this judgement on the basis that the Team has not reconciled depression modelling and paradigm limitations or conceded the recognized limitations to clinical extrapolation [23], nor integrated the formal evidential base with the wisdom of clinical observation. The scientific base is, in many areas, problematic, with various statistical analyses (e.g. NNTs and ARRs) implying a higher level of rigour than is allowed by model constraints (e.g. nonspecificity model) or by the relatively few studies (e.g. risking type II errors). Some data appear to have been abstracted from published studies in arbitrary ways, with illustrative errors of omission and commission noted here. Treatments are often favoured without the evidential base, while others with such a base are summarily dismissed. The contemporary nature of the guidelines can also be challenged: many literature reviews are years out of date, while a ‘recommendation’ for an antidepressant medication removed from prescription highlights their compromised immediacy and utility. Such issues suggest key content and process limitations.
Guidelines – especially professional college-backed guidelines – have an unstated imperative to ensure a high standard, as the stakes are major. If a professional body endorses a muddied and muddled set of recommendations, the body itself risks being harshly judged about its professional standards. Just as any building is dependent on its foundations and the safety inspections, quality control questions arise in regard to both. The Journal also needs to consider its role if it publishes such documents without independent review. If so, why? Perhaps, as I noted earlier [2] because the ‘Decision rules appear so authoritative, precise and prescriptive (as) to be above challenge’. But force of argument should not distract from examining the integral argument or logic (the building blocks) and having independent reviewers address the quality control.
Footnotes
Acknowledgements
I thank Kerrie Eyers, Jo Crawford, Lucy Tully and Yvonne Foy. Preparation was supported by an NHMRC Program Grant (222708) and an Infrastructure Grant from the Centre for Mental Health, NSW Department of Health.
