Abstract
Keywords
‘There is no treatment they cannot make equal to placebo’ [1].
Randomized controlled trials – finding their level
Like religion, evidence-based medicine is all around, and when based on valid data, deserves its converts. Such an approach is increasingly being adopted in psychiatry, commonly to produce treatment guidelines. In introducing a series of papers describing the Australian and New Zealand Clinical Practice Guidelines, the authors [2] note that they were ‘systematically developed’, with the highest level of ‘evidence’ (i.e. ‘level 1 evidence’) involving ‘systematic review of all relevant controlled trials’. In this paper, I will argue that, at least for the mood disorders, randomized controlled trials (RCTs) are no longer producing meaningful clinical results. If such level I evidence is invalid, using RCTs to drive treatment guidelines – whether produced by the RANZCP or other organizations – risks a concatenation of inaccuracies. This review reflects a focus of our Black Dog Institute, (an educational, research and clinical facility offering specialist expertise in mood disorders) to address issues of clinical importance and relevance in the management of the mood disorders, to weight observation and common sense as much as theory and science, and to be ‘practical’.
Despite RCTs of treatments for ‘major depression’ generating the largest evidential database existing in psychiatry, their intrinsic limitations are not widely appreciated. Most RCTs are designed to determine whether a treatment is ‘efficacious’, safe and tolerated, information that is required by licensing authorities. Such efficacy data is (at best) of some potential use to clinicians, but quite inappropriate ‘evidence’ for shaping clinical guidelines when its validity is suspect.
The invalidity of RCTs is suggested by a single reality. In clinical practice, most antidepressant and mood stabilizing drugs appear useful and effective for a percentage of patients. If such patients were randomly assigned to receive such medication or a placebo, we would expect their superiority to placebo to be comfortably demonstrated. By contrast, such superiority is not evident within the RCT database. Only one of the two contrasting underlying interpretations (i.e. antidepressant drugs are ineffective or effective) should be true but, as will be detailed, context allows alternate answers. Limitations and distortions to our evidence base effected by RCTs will therefore be detailed, by first considering three individual studies and then considering aggregated databases.
The first trial [3] reported an eight-week double-blind, randomized, placebo-controlled trial of Hypericum perforatum (St John's wort) and the selective serotonin reuptake inhibitor (SSRI) sertraline. The original protocol was designed within the National Institutes of Health; the study's scientific advisory board included representatives from the US Food and Drug Administration (FDA) and academic psychiatry, and the trial was undertaken by researchers from major universities and the National Institute of Mental Health. Subjects met DSM-IV criteria for major depression, had a minimum score of 20 on the 17-item Hamilton depression severity measure and a Global Assessment of Functioning score indicative of at least moderate symptom severity and/or moderate impairment – with such criteria also required after a one-week placebo run-in phase. The randomised sample was large (n = 340) with a minimum cell size of 111 subjects. The design was about ‘as good as it gets’ for an efficacy study. In terms of results, neither sertraline nor Hypericum was significantly different from placebo in reducing depression severity (observer-rated Hamilton, patient-rated Beck), reducing disability or in overall improvement. Hamilton depression severity scores reduced by 27%, 28% and 29%, respectively, for those receiving Hypericum, placebo and sertraline.
In the second, and even more recent study [4], elderly outpatients with major depression were randomly assigned to receive the SSRI sertraline or placebo, with 747 subjects being recruited to ensure the power needed to detect a two-point difference on the HAM-D (Hamilton Depression) Scale. At the end of the trial, the difference in HAM-D scores across the two groups was 0.8 points, which Carroll [5] interpreted as ‘clinically trivial’ and achieving ‘statistical significance by virtue of the gargantuan sample size’. Why clinically trivial? The number need to treat (NNT) statistic quantifies the number of individuals that would need to receive the drug to obtain one response. For the SSRI in this study, an NNT of 11 was calculated, comparing with an NNT of 3 for earlier antidepressant trials, and with Carroll observing that ‘there has been much dumbing down of expectations for antidepressant efficacy in recent years’, as well as Orwellian ‘newspeak’ in reporting clinical trials.
A similar story can be put for RCTS of treatments for bipolar disorder. For example, Bowden and colleagues [6] failed to find any outcome differences for bipolar I outpatients treated with lithium, valproate and placebo. In this study, subjects had their psychotropic medications discontinued prior to being randomly allocated to receiving drug or placebo. The primary outcome measure was time to a new mood episode, which did not differ significantly across the three groups. The median time to survival without any mood disorder (based on 4-week measurement intervals) was 40 weeks for valproate, 24 weeks for lithium and 28 weeks for placebo. The authors considered a number of explanations for the good outcome in the placebo group, including possible recruitment of those with ‘mild forms of bipolar disorder’, and observed that drug–placebo differences are known to be greater when those with more severe forms of illness are studied, an issue returned to shortly.
In case such examples are viewed as idiosyncratic, let us turn to aggregated data sets. First, presuming that RCT data presented to the US Food and Drug Authority (FDA) for licensing of antidepressant drugs represent high-level evidence, two recent analyses of such data are salient. In one, Kirsch and colleagues [7] analyzed efficacy data submitted to the FDA for approval of the six ‘most widely prescribed antidepressants’ (citalopram, fluoxetine, nefazadone, paroxetine, sertraline and venlafaxine) approved between 1987 and 1999. The data were derived from 47 randomized placebo-controlled trials. Mean improvement scores were not reported in nine trials, as the study reviewers stated they had not demonstrated any drug effect. For the remaining 38 trials, the mean drug–placebo difference was two points on the Hamilton measure, which allowed the authors to conclude that the effects of antidepressant drugs were ‘very small and of questionable clinical significance’. In the second paper, Khan and Brown [8] examined data submitted to the FDA from 52 pivotal placebo-controlled studies of antidepressants and concluded that in about half of the studies the response to antidepressants was indistinguishable from placebo.
In English law, decisions can be reached on the basis of how the evidence might be interpreted by the ‘man on the Clapham bus’. If such ‘evidence’ (both the overviews and the two recent large individual trials) of antidepressants were presented to the Clapham bus traveller, his interpretation would be that antidepressants are not distinctly superior to placebo therapies or, more worryingly, that they act as placebos. Professionals also draw such conclusions. Thus, Kirsch and colleagues [7] concluded from their review of FDA studies that the ‘pharmacological effects of antidepressants are clinically negligible’. Clinicians may view such conclusions as specious, but the impact on patients and the public is not trivial, and added to by several Australian academics publicly interpreting the Kirsch [7] data as demonstrating that antidepressants are largely placebos. Patients who benefit from an antidepressant feel demeaned when they read in the media that antidepressants are either placebos or act like placebos. Such ‘evidence’ contributes to stigma in the community, perpetuating the view that people with depression are weak-willed and merely need a tonic if they are incapable of pulling up their own socks.
Building on the observations of Carroll [5], the layman might offer a second interpretation – that the newer antidepressants (especially the narrower-action ones) are less efficacious than the older broader-based ones. As noted shortly, this possibility may have some relevance to melancholic depression, but is unlikely to be a substantive contributor to the issue. Irrespective of interpretation, it is unlikely that the man on the Clapham bus would interpret the RCT evidence as putting a strong case for antidepressant drugs for managing depression – yet such evidence is weighted by those drawing up treatment guidelines.
There are additional limitations to RCTs, particularly in their failure to demonstrate differences across differing treatments for ‘major depression’. For example, Williams and colleagues [9] analysed 150 studies involving more than 160 000 subjects with major depression and established that improvement was evident in 54% of those receiving an ‘old’ antidepressant compared to 54% of those receiving a ‘new’ antidepressant. Anderson [10] compared tricyclic antidepressants (TCAs) as representative of the ‘old’ classes and SSRIs as representatives of the ‘new’ classes, analyzing 102 randomized controlled studies, and found no difference in their efficacy (effect size = 0.03). Further, meta-analyses quantify similar response rates from drugs and most non-drug treatments for depression. Robinson and Rickels [11] reviewed approximately 60 psychotherapy studies and established only trivial superiority to pharmacotherapy after controlling for the allegiance of the researcher. A meta-analysis [12] of 28 randomized controlled psychotherapy trials found response rates of 50% for cognitive behaviour therapy, 52% for interpersonal psychotherapy and 55% for behaviour therapy. The similar efficacy is unlikely to reflect the limitations of meta-analysis, as it is consistent with many well-conducted individual studies. For instance, an NIMH study [13] found that a TCA, cognitive behaviour therapy, interpersonal psychotherapy and ‘placebo plus clinical management’ all produced similar outcomes. Failure to differentiate antidepressant drug and two manualized psychotherapies from ‘placebo plus clinical management’ in that study, and our recent review of RCTs of cognitive behaviour therapy [14] might invite those on their way to Clapham to also interpret that antidepressant treatments for depression act non-specifically (perhaps reflecting their putative credibility, logic and rationale, as well as the effects of the therapist) rather than having any specific mode of action.
As noted by Holmes [15], it also encourages the dodo bird verdict: ‘Everyone has won, and all must have prizes.’ Such an equipotency inference – that all tested treatments are of equivalent efficacy and not distinctly superior to placebo – contrasts markedly with clinicians' experience and observations. In addition, when we ask (at educational meetings) for clinical psychiatrists to quantify the effectiveness of antidepressant classes, their ratings indicate marked gradients across the differing classes. This discordance between RCT-generated ‘efficacy’ data and clinical ‘effectiveness’ or real world data provides some answers to the conundrum.
How RCT results are confounded
As detailed elsewhere [16–18] there are likely to be several reasons why RCTs are producing meaningless results. Firstly, they are based on a nonsensical model for categorizing the depressive disorders. Rather than distinguishing separate disorders (ideally, phenotypically and aetiologically), the current model essentially views ‘depression’ as a single entity varying principally by severity. As noted, ‘major depression’ is the commonest RCT diagnostic category. In reality, ‘major depression’ is a pseudo-category, effectively homogenizing multiple expressions of depression, and with each likely to be associated with quite variable response rates to differing interventions.
Second, the current paradigm is to view treatments (be they antidepressant drugs or psychotherapies) as having ‘universal’ application (i.e. efficacious across all expressions of disorder). It is completely understandable that a pharmaceutical company might want to position its antidepressant drug as a ‘universal treatment’ (as this offers a wide market) rather than as a ‘niche’ treatment. It is then useful to their case to have non-specific diagnostic categories such as ‘major depression’ because any specificity (challenging ‘universal’ application) emerging from disorder pattern–drug utility interactions will be obscured. Arguing by analogy, let us imagine if medicine operated to a model where leg oedema was defined as ‘major dropsy’ or ‘minor dropsy’. Imagine further, if aetiology (say cardiac or renal cause) was ignored and two quite contrasting treatments (respectively addressing cardiac-induced and renal-induced determinants) were applied. For those who (by chance) received an appropriate treatment for their cardiac-induced oedema, we might expect a high rate of response but that specificity effect might be lost within the overall sample and, any treatment specificity across the overall sample would be minimized, if detectable at all. The current paradigm, viewing treatments as non-specific (or universal) and prioritizing a non-specific disorder (major depression) for efficacy studies, must logically be expected to generate non-specific (and nonsensical) results.
Third, recruitment procedures have led to unrepresentative subjects being assessed in RCTs, particularly antidepressant drug studies. Subjects are extensively screened to ensure that they lack many comorbid conditions (e.g. anxiety and personality disorders, organic disorders, drug and alcohol disorders) or ‘problems’ such as suicidality. Those with the more ‘biological’ depressive disorders (e.g. melancholia, psychotic depression) are commonly excluded, whether formally or informally. Walsh and colleagues [19] noted that it has ‘become commonplace for investigators to advertise the availability of research trials to the public’. Inclusion and exclusion criteria have produced a subject profile that is remote from clinical practice, and so pristine as to invite a new definition of ‘cosmetic psychopharmacology’.
The impact of ‘type’ of disorder on differential nonspecific and specific response rates is worthy of emphasis. Placebo rates of less than 10% have been found in psychotic depression [20], [21] and in the order of 5–20% for melancholic depression [22], [23]. In the latter studies, the placebo response rates for non-melancholic disorders were in the 50–70% range. If we assume for the moment that 60–70% of melancholic and non-melancholic disorders respond to antidepressant drugs, then the probability of a ‘specific response’ would be very high for melancholic depression and minimal to nonexistent for non-melancholic depression.
The last point emphasizes the fourth factor eroding the value of clinical trials – the high non-specific ‘responsivity’ of trial subjects. It is natural for humans to develop a depressed mood and even depressive syndromes. For most individuals, such ‘normal’ mood states remit after hours or days, whether spontaneously or in response to support, good advice or improvement in a stressful situation. Patients with ‘clinical depression’ differ from those in the general community because they have a distinctly lower likelihood of experiencing a ‘spontaneous remission’. This can reflect biological factors (e.g. perturbed neurotransmitter functioning), psychological factors (e.g. personality styles such as ‘anxious worrying’) or social factors (e.g. living circumstances that promote ‘learned helplessness’). Subjects in clinical trials are likely to be closer to the general community than to clinical patients in terms of their ‘responsivity’. Existing clinical trial procedures are over-represented (and increasingly so) by pristine subjects with nonmelancholic depressive disorders who have brief episodes, unstable symptomatology and disorders of marginal severity [24] and who are thus disposed to ‘respond’, irrespective of the treatment arm. It is salutary to note one analysis [19], which examined reports of controlled antidepressant trials published between 1981 and 2000. The researchers establishing that the response rates to placebo and to medication had increased by about 7% per decade.
Fifth, clinical trials remain subject to bias despite the efforts of researchers. For example, incentives to researchers may result in unconscious ‘rater bias’, with investigators consciously or unconsciously inflating baseline depression rating scores to boost recruitment. In theory this phenomenon should be negated by placebo control, but it may result in the selection of subjects who are more ‘responsive’ to either active treatment or placebo (and to study protocols, so that most subjects do not start to improve until ‘the gun has been fired’), so reducing the chance of the antidepressant separating from placebo.
Thus, the analysis [8] suggesting that response to antidepressant and placebo was indistinguishable in half the FDA trials is not surprising. By minimizing the number of patients with melancholic depressive disorders in drug trials, ‘true’ responders are minimally represented. Consequently, extrapolation of findings to the management of melancholic depression, and possibly other ‘biological’ expressions of depression, is illogical.
Quo vadis?
If the ‘populations’ taking part in drug trials do not correspond to those seen in clinical practice, why should we have any confidence in the trial data? More importantly, what should we do? Clearly, there is a need to ensure that the scientists and clinicians are aware of the limitations to RCTs – both theoretically and, increasingly, in practice. How to go forward? We argue for restoring weighting to clinical observation. This recommendation is immediately problematic, as clinical observation may be valid or invalid – for a whole range of reasons. Nevertheless, an argument can be built.
As background reading, I would encourage all researchers, clinicians and trainees to read Himmelhoch's [1] recent commentary considering the usefulness of clinical case studies. Himmelhoch despairs at the uselessness of the current ‘scientific methodologies’ (a ‘huge geyser of 6–12 week, placebo-controlled, double-blind “instrument” measured’ studies, with ‘ready made prefabricated, so-called methodologically sound research’ proving a ‘boon for both academic funding and academic careers’). He compares the dismal utility of a number of efficacy studies of mood stabilisers with a case study of 14 bipolar patients managed with lithium (and sophistication) for an average of two decades by two clinicians [25]. Himmelhoch notes how: ‘Each of the authors’ 14 case studies is an anecdote, but none is anecdotal… anecdotes that break new ground because of the careful clinical observation behind them have far greater clinical sensitivity than can be produced by any clinical instrument'. The issue of ‘clinical sensitivity’ is central.
If we assume that an argument for clinical observation (and ‘clinical world’ studies) has been made, then the next question is how to proceed? We need to move beyond the current weighting to ‘efficacy’ studies and undertake real-world ‘effectiveness’ studies. As defined by Wells [26], efficacy studies examine ‘whether treatments improve outcomes under controlled conditions’, while effectiveness studies evaluate ‘effects of treatments on health outcomes approximating usual care’. Wells argued the need for hybrid designs, and there is much to support such an approach.
Many evince pessimism for such a task, often noting that there is no gold standard laboratory test for depression or test for any of its subtypes. There are, however, multiple options for proceeding despite such concerns, including contrasting ‘top down’ and ‘bottom up’ approaches. The first might involve giving a particular treatment to a large number of subjects and determining characteristics of ‘true’ responders. The latter might involve observing distinguishing ‘patterns’ to depressive conditions (akin to the ‘thick description’ technique used by anthropologists) and then identifying the utility of differing treatments to the differing depressive ‘patterns’. Evaluative studies should occur initially in routine clinical practice, perhaps involving clinical panel studies, and which allow large databases to be recruited relatively easily. Progressively, and iteratively (if the two approaches are linked) differential results emerge and allow refined hypotheses to be proposed. For example, ‘depression pattern X’ may appear particularly responsive to an SSRI, ‘depression pattern Y’ to a cognitive behavioural approach. Structured randomised controlled trials can then be undertaken for those within each ‘pattern’ (whether comparing an SSRI with a placebo or another therapy) to confirm or refute that hypothesis. Application and development of the model allows a matrix of disorder-treatment specificity to be progressively developed and extended, once again in line with the general medical research ‘model’.
There are data that support the importance of both conceding and so testing such a model for the depressive conditions. We have argued for a subtyping model that respects phenotypic pattern and aetiology, and have promulgated a hierarchical model and measurement strategy [27] for distinguishing between psychotic, melancholic and non-melancholic disorders. Application of that model does generate differential treatment efficacy data, and two examples can be noted. For psychotic depression, two meta-analyses [28], [29] have quantified a response rate of approximately 25% to an antidepressant drug alone, 33% to an antipsychotic drug alone, as against 80% to their combination or to ECT. For melancholic depression, we have shown [30] that the likelihood of responding to an SSRI or to a TCA is roughly comparable for those younger than 40 years, but with the responses to the differing classes differentiating as age increases (i.e. 2: 1 in favour of the TCA for those 40–60; 4: 1 for those older than 60 years). Similarly, Joyce and colleagues [31] established that melancholicdepressed patients older than 40 had a distinctly superior response to the tricyclic nortriptyline (67%) than to the SSRI fluoxetine (38%). This differential antidepressant class effect is interpreted as a reflection of the underlying aetiology, and which is expressed in the phenotypic melancholic pattern. Thus, we [30] suggested that as those with melancholia age, they recruit more monoaminergic system perturbations (e.g. noradrenergic as well as serotonergic), evince greater psychomotor disturbance (‘surface signal’) and, as a consequence, are less likely to respond to a serotonergic drug alone. For nonmelancholic disorders, we have proposed a ‘spectrum model’ [27], linking underlying personality with the surface phenotypic pattern. Here, for example, those with an internalizing anxious worrying personality style tend to be likely to show a phenotypic ‘anxious depression’ pattern, those with an externalizing anxiety style to show an ‘irritable depression’ and those with a volatile Cluster B personality style be more likely to develop ‘hostile depression’, with those in the first two group only showing a superior response to SSRI medication, presumably as a consequence of aetiological anxietybased ‘emotional dysregulation’ being muted or modulated. Whether valid explanations or not, such examples demonstrate how pattern analysis can provide aetiological and treatment specificity information, and thus allow some understanding of possible determinants of differential antidepressant treatment responses.
As concluded previously [17], there is no dearth of clinical trial studies. In fact, quite the opposite. However, by recruiting and studying subjects who fail to correspond with real patients, by testing treatments as having universal application and by using ‘major depression’ as the diagnostic category, derived data are no longer credible or meaningful. No one gains from a strategy deriving trivial to nonexisting differences between active drugs and placebo, and which when data are aggregated and analyzed, fail to differentiate one treatment from another. Such information is specious, while common sense is compromised.
Let's return to the RANZCP guidelines for the treatment of depression [32]. Here, several (but not all) antidepressants and two psychotherapies (CBT and IPT) are recommended as first-line therapies for depression and rated as supported by level I evidence. Other recommendations are also colour-coded to reflect their underpinning level of evidence (from level II to level V). Decision rules appear so authoritative, precise and prescriptive as to be above challenge. But poor science is poor science. In describing the general process, the authors [2] state that one of the ‘quality features’ of the RANZCP guidelines is ‘systematic review’, involving comprehensive review of ‘randomized controlled trials of predefined quality’, summarized ‘through meta analysis’, which together with other evidence and expert opinions, is ‘critical for the formulation of clinical recommendations but also to allow the evidence to “speak for itself”’. This could allow a process where whatever level I evidence is poured into the top of the funnel comes out the spout untrammelled and preserved or where the guideline teams might judge – as I do here – that the level I evidence is a nonsense, and quietly insert their views and those of other experts to produce guidelines. It would be of interest if the authors of the RANZCP depression guidelines [32] were to detail which option they selected.
Most of the pharmaceutical companies are aware of the dilemmas noted above, and particularly the limitations to RCTs, both in terms of their limited design and the ways in which the true effectiveness of drugs is unable to be determined by such data. However, if they wish to bring a drug to market, they have no choice than to proceed under the current house rules imposed by the licensing authorities. Concern is also emerging from other academics. For instance, Meltzer [33] recently reviewed a large number of recent study findings (e.g. a ‘remarkable’ announcement that a new NK1 inhibitor was no better than placebo as an antidepressant; an established atypical antipsychotic was no better than an old typical antipsychotic drug; a retrospective review indicating that lithium was three times more effective than sodium valproate in preventing suicide in bipolar disorder), and adverse media coverage in relation to neuropsychopharmacology. Meltzer suggested that ‘in aggregate, our field would appear to have a lot of problems in its scientific basis, ethics, and independence from the pharmaceutical industry’, and concluded that ‘much of the negative news represents the result of not taking all available information into account and a lack of understanding of the importance of specific features of the illnesses in question’.
The mole is recognized as the emblem of blindness, and it was a molehill that caused William III to fall from his horse. If we do not recognize the current limitations to the evidence base for managing mood disorders, we will continue to turn molehills into mountains – mountains of data but lacking intrinsic substance.
Footnotes
Acknowledgements
I thank Kerrie Eyers. This report is supported by an NHMRC Program Grant (222708) and an Infrastructure Grant from the Centre of Mental Health, NSW Department of Health.
