Abstract
The results of a randomized controlled trial (RCT) show what happens to patients who are treated in different ways. In the present study we address the issue of deciding whether results are real – that is, applicable in clinical practice.
Clinical uncertainty
Evidence-based medicine has been described as the ‘conscientious, explicit and judicious use of the current best evidence of making decisions about the care of individual patients’ [1]. A hierarchy of evidence, with meta-analysis of RCTs at the top, to expert consensus at the bottom, has been agreed upon. From this perspective, all clinical treatments should preferably be based on data generated by RCTs than re-examined by meta-analysis. Much psychiatric practice is not based on this type of evidence; there are a number of reasons for this.
First, RCTs are unlikely to influence practice unless they address an area of perceived uncertainty. If there is a consensus that a treatment works then few can see the purpose of a trial. This is reasonable, provided that consensus is based on good evidence [2]. For example, few would argue with the effects of penicillin on infection, or hesitate to use it if necessary, despite the fact it has never been subjected to anything we would recognize as an RCT. No treatment in psychiatry has this level of certainty.
Second, RCTs are often driven by commercial interest, such as drug company-sponsored studies comparing two medications or an attempt by a small group to establish the efficacy of a treatment. This leads to a focus on discrete choices, such as treatment modality or intensity, rather than practical decisions, such as when to use physical restraint or to hospitalize a patient [3].
Third, psychiatry has a small database of RCTs. For many conditions (e.g., anorexia nervosa, somatoform disorder and most personality disorders) insufficient data are available to guide treatment. For other disorders, despite research, no consensus been reached. Behaviours such as conduct disorder and repeated self-harm have been studied extensively but no clear conclusions about effective treatment have emerged.
Fourth, many clinicians are content to practice without reference to research that might refine or even change their treatments. In some cases this is ignorance, in others a reaction to a threat to their autonomy. However, for many there are concerns that RCTs do not answer ‘real life’ questions in ‘real life’ clinical situations. This lack of ‘external validity’ is accepted by researchers as a significant problem.
Internal and external validity
The internal validity of a study is the extent to which the design and conduct of the trial eliminate bias. Efficacy studies emphasize the importance of being able to draw a causal inference between treatment provided and outcome [4]. Details of this were discussed in Part I. Internal validity in a properly conducted RCT is strong but in attempting to eliminate bias, trials are often set up in a way that limits generalizability [2]. Clinicians are interested in the effects of treatments delivered in clinical practice, outside rigorous trials. This has been called external validity or effectiveness (see [2, 3] for a general discussion).
Threats to internal validity
Random allocation of patients to treatments should eliminate both confounding and most sources of bias. Threats to internal validity are those that increase possibility of bias. Psychiatric RCTs have five problems. First, the studies tend to have small numbers. Second, because nosology is subjective, heterogeneous samples are common. Third, many syndromes have a high spontaneous recovery rate. Fourth, studies incorporate a wide range of outcome measures. Fifth, problems arise in using RCTs to compare psychotherapies. These will be discussed in order.
Sample size
A typical psychiatric RCT has 20–100 subjects and even in so-called large studies numbers are modest. The influential NIMH Treatment of Depression Collaboration Research Program, for example, had 250 subjects at baseline [5]. In contrast, the First International Study of Infarct Survival Collaboration Group enrolled 16 027 subjects [6]. Small samples are more likely to have initial group differences (despite randomization) and differential dropout rates. Statistical techniques can be used to control for differences in assignment or from dropouts but small samples can only have adequate power to detect moderate to large treatment differences and are unlikely to detect differences either in rare events, such as a suicide attempt, or factors that have high variability [3].
Small samples also make statistical (hypothesis testing) errors more likely. There are two types. In type I error, results are statistically significant but the null hypothesis of no difference between groups is in fact true. The second, type II error, is the erroneous conclusion that the null hypothesis is true when the two groups do differ. It is important to minimize both, but many studies have not been designed with this in mind. Thase suggests that at least one-third of published RCTs testing the effectiveness of antidepressants have equivocal results, much of it due to inadequate size. In a systematic review Hotopf et al. [7] reported that nearly half of 122 RCTs comparing tricyclic antidepressants (TCAs) and selective serotonin re-uptake inhibitors (SSRIs) had less than an 80% chance of detecting a difference as great as a 60% remission against a 100% remission although most concluded that there was no difference in efficacy between the treatments. Given low statistical power it is unlikely that they would have found a difference in outcome even if it did exist.
Heterogeneity
Diagnoses we assign to our patients are little more than theoretical constructs [8]. This creates opportunities for heterogeneity even among groups of patients with the same diagnosis [9]. To reduce this problem inclusion/ exclusion criteria are inserted, or sample matching is attempted. However, we may have little idea of what to include or exclude or which patient features are important to match. Randomization helps create two or more groups that are similar, but such groups may be different from other samples with the same diagnosis. One illustration is the substantial variation of response rates within both placebo and treatment groups. For example, when antidepressants were compared to placebos for patients with dysthymia, 13 studies produced response rates between 0% and 72% for the placebo group and 18% and 78% for the treatment group [10]. These and similar results partially reflect poor design, differing definitions of outcome, or inadequate sample size, but may also arise from the differences between studies in patient characteristics, including severity and comorbidity [10].
Placebo control
Some patients who ‘respond’ to a treatment are responding to the specific intervention, but others to unspecified aspects, or undergo spontaneous improvement associated with natural course. In an illness such as major depression up to half of those who respond while taking a medication may improve due to placebo factors [11]. This placebo response rate in depression RCTs varies but is growing [12]. In other disorders, (e.g. obsessive–compulsive disorder and schizophrenia), the placebo response is much less [13].
The use of a placebo arm in a RCT raises ethical, clinical and methodological issues. Ethically, it may be argued that placebo is only appropriate when no established effective treatment is available. On the other hand, the dangers of marketing ineffective drugs or psychotherapies that carry risks of side-effects may outweigh the modest risk associated with placebo treatment in randomized, placebo-controlled trials. A placebo group is essential to establish that a treatment is effective for any condition with a substantial and/or variable placebo response.
Placebo-controlled trials may be preferable methodologically but may also influence the trial. For example, individuals likely to participate in a placebo-controlled trial have less severe, less chronic and less disabling disease [14, 15]. Therefore, they may be the very people who are apt to respond to placebo [11].
Outcome measures
Differing definitions of what constitutes a response to treatment is a problem in psychiatric trials where clear outcomes, such as death or highly reliable physiological changes are rarely suitable. A clever study by Tedlow et al. [16] highlights this issue. They showed that the relationship between severity of depression and response to antidepressants depended almost entirely on the definition of response used. If response was defined as Hamilton Depression Rating Scale (HDRS)-17 score at the end of the trial, then baseline depression related to poor outcome (r = 0.41 p < 0.0001). If response was defined as the change in HDRS-17 score from baseline to endpoint, there was no relationship (r = 0.07 p = 0.26). If response was defined as percentage reduction in HDRS-17 score from baseline there was a positive effect, (i.e. baseline depression severity predicted a moderately better response (r = 0.015 p = 0.02)). These differences in definition may explain variability in response rates discussed earlier.
A further problem is that most RCTs are of short duration (e.g. weeks) and clinicians are interested in longer term outcome. Many disorders are recurrent or chronic but trials usually last between 6 and 12 weeks, and can only address short-term response (an improvement of a predetermined magnitude, e.g. a 50% reduction in symptom score) or remission (when a patient is considered asymptomatic, e.g. a HDRS score < 8). They do not deal with the clinical outcome of recovery (a sustained remission). This would not be a significant problem if short-term response was strongly predictive of recovery. However, the relationship between initial response and recovery is not consistent [17, 18]. In addition, the ability of treatment to produce an initial response and sustained efficacy may differ. For example, a meta-analysis on cognitive behaviour therapy for depression has shown that while half the patients who complete treatment experience initial improvement, sustained efficacy over 12–24 months occurs in only 25–30% [19].
Randomized controlled trials of psychotherapy
Randomized controlled trials have particular problems when comparing psychotherapies, of which the most serious is that interventions do not represent separate treatments and therefore violate the premise of controlled trials. A study by Ablon and Jones [4] generated prototypes of ideal regimes of interpersonal psychotherapy (IPT) and cognitive behaviour therapy (CBT), which were then used by judges to determine the extent to which psychotherapies transcribed from the NIMH Treatment of Depression Collaborative Research Program conformed to the prototype. They concluded that it may not be possible to control sufficiently how treatments are conducted, leading to poor external validity of psychotherapy RCTs. Similar arguments have generated much debate and it may be prudent to approach with caution studies claiming one psychotherapy superior to another [20, 21]. Studies comparing psychotherapy to no treatment, or psychotherapy to a medication, do not necessarily share this problem.
Threats to external validity
External validity is defined as the extent to which the results of a trial produce a correct basis for generalization to other circumstances. There are several threats to such validity.
Sample
Randomized controlled trials often recruit different subjects from those who present to clinical services because of convenience sampling, advertising and highly specialized services. Exclusion criteria, usually set to create a more homogeneous sample, further reduce the comparability of the sample to those in a clinical service. To combat this Westen and Morrison [19] have recently made three recommendations. First, excluding patients with co-occurring disorders should not occur when these comorbidities are common or they affect treatment response and prognosis. For example, in patients with major depression comorbid Axis I and Axis II conditions are present in most patients, and may impact on response to treatment.
Second, patients with subsyndromes should be included because in practice many patients presenting for treatment may have such pathology (up to 50% [22]). If these patients differ from syndromal patients then our estimates of efficacy may be inaccurate. It is also possible that some of these patients may respond differently. For example, patients with atypical depression respond preferentially to phenelzine [23] and possibly to SSRIs compared with TCAs. Excluding patients who have HDRS scores of less than 18, as most RCTs do, will exclude many patients with subsyndromal depression (because the HDRS does not score many atypical symptoms) therefore obscuring this important finding [Joyce PR et al. submitted for publication].
Third, studies need to state the proportion of patients who were screened, recruited and completed the RCT. Reasons for reduced numbers need to be stated; was it comorbidity, failure to meet set criteria, inability to tolerate treatment or side-effects. This would help clinicians to compare meaningfully the RCT sample with their own practice.
Treatment
Randomized controlled trials have limited usefulness if treatments are not feasible or their relationship to ‘usual care’ is unknown [3]. In psychotherapy, for example, RCTs typically rely on manual-based protocols which supports their efficacy for some disorders such as depression and anxiety. However, the relationship of these therapies to general practice is unclear since in reality most clinicians do not have the level of training and supervision provided to research therapists. They may also introduce different modes of therapy if the situation warrants it, something that has been claimed to have been avoided in RCTs. Therefore, studies are needed to evaluate the effectiveness of currently practised ‘eclectic’ psychotherapies against the manualized therapies usually evaluated in RCTs.
Studies of psychotropic medications face other problems. A particular one is that adherence to medication may be higher in RCTs. Patients may be more willing to continue to take medication while they are in a study because they are being monitored and the trial is for a limited time.
Analysis of treatment success
An important question is how to define treatment success. Is this a clinically significant improvement (e.g. reduction in symptoms by 50%) or remission (absence of significant symptoms). Which measure is more important in practice is influenced by type of patient and treatment setting. For example, in an inpatient with schizophrenia, a reduction in psychotic symptoms by 30% might be considered worthwhile while the complete absence of attacks in a patient with panic disorder or purging in a patient with bulimia might be seen as the appropriate measure of treatment success in an outpatient clinic. Treatment for depression may produce a reduction in symptoms but if the average patient continues to suffer substantial albeit diminished symptoms, then to conclude that it is effective may not be justified, especially since residual symptoms predict relapse [19]. Furthermore, relating symptomatic improvement to functional gains is rarely addressed [24].
Discussion
The first modern clinical trial was the Medical Research Council trial of streptomycin led by Bradford Hill in 1948 [25]. For the first time, inclusion and exclusion criteria were used, patients were randomly allocated to treatment or placebo and endpoints clearly defined (in this case death or improvement in chest X-rays). The fact that streptomycin was clearly effective led to a change in practice and public demand that the drug be more freely available.
It took many years for the medical profession to accept that treatment should be based on RCTs. It has taken much less time for doctors to realize that such trials also have limitations [26]. The results of a trial describe what has happened to the patients included. Statistical computations give us an idea of the probability that the same results would be obtained in another group of patients with the same inclusion/exclusion criteria. The utility of this probability depends on the strength of the treatment effect, the similarity of the patients to our own, and how similar the treatment used is to what we can provide. We do not advocate the strict application of evidence-based medicine in a mechanical way. Patients need individually tailored treatment. Randomized controlled trials are the best way to tell us what treatments are effective but not necessarily which patients should receive them and when they should do so.
Randomized controlled trials remain superior to any other strategy yet designed to test which treatments should and should not be used. A RCT is like democracy. It is problematic but has intrinsic advantages; it is described as the ‘gold standard’, but this is an exaggeration. On the other hand, the RCT is the most robust approach to examine treatment effectiveness. If RCTs are to help in deciding what interventions to use, they must be applied to relevant questions carried out, as far as possible, under the conditions in which they are likely to be used. The efficacy of a treatment is no more than a measure of its potential under standardized and optimal conditions [27]. We need large trials in real clinical settings, few exclusion criteria, simple outcome measures, and long-term follow-up.
