Abstract
This is the first of two articles that aim to introduce readers to randomized controlled trials in psychiatry. Part 1 covers basic methodology and critical appraisal; Part 2 examines in more detail clinical relevance of trials and methodological issues that may detract from their usefulness.
The evaluation of clinical trials has been the subject of much discussion in recent years and a general scheme, now widely used, has been developed [1]. Proformas for the evaluation of reports of clinical trials are available and readily downloaded from the web (http://minerva.minervation.com/cebm/documents/worksheets.pdf). We will follow a similar approach, modified from Guyatt [1]. In doing so we have paid attention to issues that are complex and contentious (see Table 1 for our modified scheme). In discussing research methods it is easier to refer to a particular example. We have therefore selected a clinical question and will discuss the methodology that might be used to answer this question. We will also discuss a study that has attempted to address this question [2]; we encourage readers to obtain a copy and to assess it in conjunction with this article. Further examples of structured critical appraisal of RCTs can be found in Warner [3] and Lawrie [4].
Scheme for evaluation of articles (modified from Guyat et al. 1993)
Why a randomized controlled trial?
As the name suggests, randomized controlled trials (RCTs) have two important features designed to provide objective evidence regarding clinical practice. These key features are randomization and the use of a control treatment.
1 While a treatment may be associated with improvement in a patient's condition, this may be due to many factors including spontaneous remission. Without comparing the ‘investigational’ treatment with a ‘control’ treatment we cannot assess its real efficacy.
2 Unless a comparison between an investigational treatment and control treatment is made in a group of patients truly ‘randomized’ to one treatment or the other, the biases of clinicians are likely to dictate which treatments are received by which patients. More severely depressed patients might, for instance, be given tricyclic antidepressants (TCAs) and less severely depressed, specific serotonin reuptake inhibitors (SSRIs). This may bias the results of a comparison between these treatments.
Without such a design it is difficult to draw conclusions from any trial. However, there are many other features of the design of RCTs that may affect their validity and these are addressed below.
Definition of the ‘problem’
The crux of the clinical trial methodology is to define the question. In everyday practice many questions arise for each patient. Many have never been addressed in trials and those that have, have sometimes been addressed in a way that does not help to guide practice. This problem will be discussed further in Part 2. We will take as an example treatment of severe depression in hospital.
Scenario
A 50-year-old man is admitted with an episode of major depression with a Montgomery Asberg Depression rating scale [5] (MADRS) score of 35. There are DSM-IV melancholic but no psychotic features. He has had two previous episodes of major depression, which were not as severe and did not require hospitalization. You can elicit no history suggestive of bipolar disorder. His daughter's wedding is in 3 weeks and he is desperate to be well enough to attend.
A literature search reveals a study of rapidly escalating venlafaxine in the treatment of inpatient depression [2]. You decide to evaluate this report using the headings below.
What are the treatments assessed and are treatment details adequately defined?
Investigational treatment
The choice of investigational treatment is usually driven by commercial interest, by a clinical idea that a particular treatment may have benefits or by a perceived need to determine which option is the best among treatments which, on the basis of existing evidence, appear equivalent. It is important that the treatment is well defined so that clinicians can be confident they are administering the same treatment if the results of the trial persuade them that it is appropriate to do so. This is clearly a major issue in psychotherapy trials (see Part 2). Equally, however, in drug trials, the dosing schedule requires careful consideration and should be adequately specified and reported at the end of the trial.
Comparator treatment
Once an investigational treatment is decided on, the next step is to examine the suitability of the comparator. Its choice is one of the most difficult and criticised aspects of trials.
New agents are usually compared with placebo (a pharmacologically inactive compound). This is because, particularly in conditions such as major depressive disorder, there is a high placebo response rate that varies considerably depending on chronicity, severity and subtype of depression (see Schatzberg [6] and Quitkin [7] for an in-depth discussion). While it has been suggested that a placebo ‘run in’ period can help to control for this, evidence is inconsistent [8].
In some situations however, it is argued that placebo treatment is ethically inappropriate (see Miller [9] for an interesting review of ethical issues of placebo controlled trials). In this case, the best comparison would be between a proposed treatment and a currently accepted optimal treatment. In our scenario, what would be the most appropriate comparator? It could be argued that placebo treatment for severe hospitalized depression is unethical and that the information which would be useful in developing practice is whether a proposed treatment is better than current usual treatment. Perhaps the trial should compare venlafaxine with the currently accepted optimal treatment of inpatient depression. A recent meta-analysis in hospitalized depression suggested that amitriptyline had a significant advantage over SSRIs [10]. This might then be a suitable agent to use as a comparator. However, it could be argued that because of a relatively high side-effect burden amitriptyline may not be the most commonly used antidepressant in the treatment of inpatient depression and may not, therefore, be as clinically relevant as a comparison with an SSRI. The study of Benkert et al. [2] uses rapidly escalating imipramine as a comparator. While few psychiatrists would argue against imipramine as a comparator, the exact schedule of treatment does need to be examined more closely.
Detailed evaluation of the treatments delivered
The next issue is the way in which the treatments were delivered. For instance, in drug trials a dosage schedule must be specified. This can be left up to the clinician or predetermined by the investigators. Pre-set schedules give a clearer indication of the exact treatment used. Clinician titration is probably closer to clinical practice and may reduce dropouts since sensitive patients will be maintained at lower doses and dosage increased more slowly. In this situation, average dose received should be reported.
The dose of an active comparator can become the subject of debate. Benkert et al. [2] compared rapid titration to high dose venlafaxine with rapid titration to high dose imipramine. While, as discussed, imipramine is probably a reasonable comparator, rapid titration is rarely used and there is little evidence for its use. Therefore, it could be argued that a more usual protocol would be a more meaningful comparison. This study also illustrates the importance of considering the details of the treatments in appraising trials. Although upward titration of dose was roughly equivalent in both groups, the venlafaxine dose was reduced after two weeks while the imipramine dose remained high. This issue may have influenced the results as we will discuss later.
A further debate on the importance of comparator selection and dosing can be found in Healy [11] and McKenna [12] who debate the issue of an appropriate comparison treatment in trials of clozapine in treatmentresistant schizophrenia.
The exact details of treatment and comparator are also important in trials of psychotherapeutic interventions. Since different types of psychotherapy may give rise to different results, it is important to define the exact form of therapy to be given, to ensure adherence of therapists to that therapy and to maintain its quality. In reporting the trial, it is important to describe the therapy in such a way that it can be replicated if it proves to be clinically useful.
Apart from the proposed treatments (investigational vs comparator) it is important to anticipate the possibility that additional, less specific treatments will be needed during a trial and that these will vary between groups. An obvious example is that in trials for mania, if a particular agent is not effective, certain ‘rescue’ medications (e.g. benzodiazepines) may be used. This may occur to a greater extent in one group compared with another and could obscure a difference in outcome. This may therefore need to be considered as an actual treatment outcome. The use of additional treatments should also be predetermined and included in the protocol so that, for instance, different centres or investigators do not use different ‘rescue medications’, making this a difficult factor to analyse.
Randomization: was assignment of patients to each treatment truly random?
The most important feature of the RCT is random allocation of patients. Practically, randomization is done by allocating each patient a number from a predetermined, computer-generated code. This then determines which treatment the patient will receive. There must be no way in which a clinician can influence which number is assigned, and if the trial is truly double-blind, the clinician should not know which numbers correspond to which treatment. It is otherwise possible that clinicians who have a pre-existing belief about a treatment may attempt to obtain that treatment for particular patients. A RCT report should contain an unequivocal statement regarding randomization, as does the study by Benkert et al. [2].
Were the treatment groups comparable?
It is critical that any differences in outcome between groups can only be ascribed to the treatment effect and not to the confounding effect of prognostic variables which may differ between groups. Confounders may be described as variables that distort the relationship between treatments and outcome. Large, randomized studies in most cases, by chance, result in treatment groups that are comparable in all important respects. In smaller studies, it is possible that an important prognostic variable will differ between groups. There are two ways of overcoming this problem. First, a technique called stratification can be used. In this system, important prognostic groups (e.g. males and females) are randomized separately, ensuring that the same numbers of each are allocated to each treatment. It is also common to stratify by study centre so that each has equivalent numbers of patients in each group. The second method is to use statistical techniques, such as covariate analysis which ‘correct’ for any differences in prognostic variables between groups before comparing outcomes. Reports should contain a table (Table 2 in Benkert et al. [2]) with summaries of key prognostic variables specified for each treatment group. Not only is it important to examine this for variables which differ between groups (these may be brought to the reader's attention by significance values in the table) but to consider whether other variables should also have been considered. For instance in Benkert et al. [2], it would be useful to know whether patients suffering from depression with psychotic features were represented more in one group than another.
Summary of evaluation of Benkert et al. 1996
Was the study double-blind?
Double-blind means that neither patients, clinicians nor raters of important variables know which treatment group the patients are in. It is not clear that all studies described as double-blind are truly so. There is an ethical obligation and a clinical necessity to explain to patients likely side-effects and while this does not necessarily mean that patients will know which side-effects apply to which medication, they can look this up. Likewise, by inquiring about side-effects, researchers may guess what treatment a patient is taking. One partial solution is to use ‘blind’ raters to assess outcome. It is probably also useful to assess the degree to which the study was double-blind simply by asking patients and researchers to guess the assignment of each patient.
In appraising a RCT it is useful to consider whether the trial is likely to have been blind and what, if any, the implications of researchers and patients knowing their assignment might have been. One might expect the ‘placebo’ effect of a new and therefore promising treatment to be greater than that of an older treatment being used as a comparator or of placebo itself (i.e. an expectation effect). It is likely that patients and clinicians could have distinguished between the side-effects of venlafaxine and imipramine and that the Benkert et al. [2] trial was not in fact truly double-blind. A possible effect is to have made dropout greater among patients who perceived themselves to be on an ‘old’ treatment and an increased placebo effect in those knowing themselves to be on the ‘new’ treatment. In fact, significantly more patients did drop out of the imipramine group as a result of ‘patient request’. This could bias results. While it is not specified, it appears that clinicians and not independent ‘blind’ raters performed ratings. It is also possible that raters could be biased by their expectations of the efficacy of either treatment.
The primary outcome measure: is it defined and is it clinically relevant?
For any trial, a predetermined primary outcome measure should be chosen. Many published trials have included several measures and report positive effects on only one of these. However, including multiple measures makes it likely that at least one of them will show a difference by chance.
It is not an easy task to decide on a single outcome measure. In many trials of antidepressant treatment it has been total score or change from baseline on the Hamilton Depression Rating Scale (HAMD) [13] at 6 weeks. In our scenario, where speed of response is important, one option is to define primary outcome as the score, on a predetermined scale (e.g. HAMD), at a specified point in the treatment course (e.g. 3 weeks). Another option is to use clinically relevant, objectively defined outcome measures. In trials of cancer treatment, for example, death or survival is clinically relevant and easily measured. It is more difficult in psychiatry to find such measures. However, researchers are increasingly advocating the use of simple objective outcomes to make trials large enough (and therefore simple enough to be carried out at many sites) to show relatively small differences. This is the philosophy of the important BALANCE trial of prophylaxis in bipolar affective disorder [14] underway in the UK, in which a primary measure is admission to hospital.
Benkert et al. [2] cite four primary outcome variables (times to response and to sustained response on the HAMD and MADRS scales). A sustained response was defined as one that occurred by week 2 and persisted to the end of the study and included at least 39 days of double-blind therapy. This is not a commonly used measure. It is possible that reducing the dose of venlafaxine but not imipramine increased dropouts at this stage in the imipramine group. This may have contributed to the greater ‘sustained response’ with venlafaxine. Although not statistically significant (25% vs 38%) there was a greater number of dropouts in the imipramine group.
Other outcome measures: are all clinically important outcomes assessed and reported?
Since all treatments potentially have adverse effects, it is important to anticipate, measure and report them accurately. In pharmacological trials, side-effects are important and a treatment that increases the speed of recovery from depression but gives rise to enduring sideeffects may not be deemed useful. This might apply to ECT that may lead to a rapid recovery for an inpatient with depression. If we were to conduct a trial of high dose venlafaxine versus ECT, cognitive side-effects should obviously be measured.
Were all patients who entered the trial properly accounted for and analysed in the groups to which they were randomized?
Patients are generally assessed on a range of inclusion and exclusion criteria and if suitable asked to consent to involvement in a trial. Further assessments may take place at which patients may still be excluded. At a predetermined point, patients are then randomized and deemed to have entered the trial. Results should be analysed by ‘intention to treat analysis’. In this method, the last appropriate assessment is included in the analysis even if the patient drops out, a procedure known as last observation carried forward (LOCF). Treatment may result in a number of discontinuations secondary to side-effects or perceived lack of effect. If only patients completing treatment were analysed, this would give an inflated estimate of clinical efficacy. Analysis using intention to treat includes the observation taken before dropout, which in most cases will show little improvement and reduce apparent efficacy. In evaluating a trial it is important to check that all randomized patients are included in the final analysis. In the Benkert et al. [2] study, 167 patients were randomized and at all points data from all of them are reported and analysed by the LOCF method.
Is the analysis valid and what are the results?
The simplest form of analysis is a comparison of outcome, as measured by the predetermined primary measure, between the treatment groups, carried out after a predetermined number of patients have completed the study. If the measure is continuous (e.g. percentage reduction from baseline in HAMD score after 6 weeks) the statistical test carried out may be an independent t-test. If the data do not fit a normal distribution curve then non-parametric statistics are used. If the primary measure is binary (e.g. response versus non-response according to a predetermined definition such as 50% reduction in HAMD at 6 weeks) a simple χ 2 test can be used to determine if different numbers of patients in each group meet this criterion.
Since published analyses can be complicated and misleading, it is useful to discuss several other concepts at this point.
Analysis of variance
Despite the ideal, trials are often imperfectly described by comparison of a single variable at the ‘end’ of treatment. Such an analysis may incompletely summarize the results, missing key features. Several factors may make a more complicated analysis of variance appropriate. First, where measures have been made at several points and speed of response can be examined, a repeated measures analysis of variance may be used. In its simplest form this allows examination of treatment effect, the role of time and the interaction between treatment and time. A significant effect of time means that the outcome variable changes over time. A significant interaction between treatment and time suggests that there is a difference between the speed of response to the two treatments. This can then be explored by examining response at different points. Making multiple comparisons at different points without first finding an interaction between treatment and time is invalid. Second, there may be a differential effect of treatment in certain subgroups; this subgrouping can be entered into an analysis of variance. This might arise for instance in a study of treatment for depression, where an analysis may include both treatment and melancholia as factors. An interaction between treatment and melancholia would suggest that the relative effects of the treatments differ in the subgroups. A significant effect of treatment would suggest an advantage for one treatment without a differential subgroup effect. An example of such an analysis is seen in Swann [15] where a difference in response to lithium and sodium valproate is seen between subgroups of manic patients with and without depressive symptoms.
One advantage of this type of analysis is that it allows more complex data to be analysed in a single analysis. This avoids the problem of multiple statistical comparisons that give an increasing potential for false positives with each analysis. We should be suspicious of studies that use t-tests to compare outcome at multiple points or several subgroups, especially where there is no prior justification. This may apply, for instance, where baseline severity is deemed important and patients are split into groups of different severity and the effect of treatment assessed in each group. More valid would be analysis of covariance with severity as a covariate.
In Benkert et al. [2] ‘times to response and to sustained response on the MADRS and HAMD scales’ are the primary outcome variables. However, these variables only apply to responders. Therefore, this is not an intention to treat analysis. A further analysis is given which is intention- to-treat and examines response at 2 and 6 weeks and ‘sustained response’ on both MADRS and HAMD. Six separate χ 2 tests are carried out, only one of which shows a significant difference. In our opinion a repeated measures analysis of variance with treatment and time as factors and percentage change in HAMD from baseline as the variable would have been a better way to deal with the issue of differential time to response.
Study size
One of the most important features of planning any research is a power calculation. This is a statistical strategy to determine numbers needed in each group to have a predetermined chance (usually 80%) of detecting a clinically significant difference in outcome (usually at p < 0.05), if that difference actually exists. The first step is to decide what a clinically significant difference would be and to determine from previous studies the variation (standard deviation) in the outcomes. A calculation then allows determination of the number of subjects needed in each group.
In evaluating trials it is useful to determine whether the study size was predetermined with a power calculation. In the extreme case, researchers could monitor and analyse the results throughout the study and report the results at the point at which the difference became significant. This would be misleading. An interim analysis is permissible so that if treatments unexpectedly differ before all intended patients are recruited, further patients need not be exposed to what is an ineffective treatment. This analysis should be done independently since knowledge of the analysis may give rise to confounding expectation effects.
The Benkert et al. [2] study stopped early ‘because of observed differences in outcomes between different study centres’. However, after data analysis, no ‘centreby-treatment interactions were noted’. Thus, although the investigators perceived a problematic difference in outcome between centres (but do not state what this was), there was no significant difference between centres in terms of relative response to the treatments. It is questionable whether the study should have been stopped at this point.
How large was the treatment effect?
Number needed to treat
A useful measure of effectiveness is the ‘number needed to treat’ (NNT) (see Chatellier [16] for further discussion), defined as the number of patients who would need to be treated with the investigational drug rather than the comparator in order for one additional patient to derive a significant benefit (usually a predefined response). Individual clinicians may, based on this number, make an informed decision regarding clinical practice. If, for instance, an antidepressant was better than its comparator with a NNT of 25, this would imply that of 25 patients treated with drug A rather than drug B, one would respond who would not have done had they all been treated with drug B. If costs and sideeffects were equal most clinicians would prescribe drug A. If, however, drug A was worse in terms of side-effects with a NNT of 10 for significant adverse effects (of 10 patients treated with drug A rather than drug B one would develop significant adverse effects which they would not have done if treated with drug B) most clinicians would use drug B. Number needed to treat and how to interpret it is discussed further in Thompson [17] and Anderson [18].
Where effects are assessed on a rating scale, this can be recalculated as a categorical variable by defining a level of improvement on that scale and calculating the percentage of patients who reach this threshold. A good example of such a calculation is given by Warner [3].
How precise is the estimate of treatment effect?
Confidence intervals give an estimate of the precision with which the NNT has been measured. The figures quoted are usually 95% confidence intervals (95% CI) which refer to the range of values which statistically the authors can be 95% sure will contain the true value. For instance, the NNT may be expressed as NNT = 7, 95% CI = 4–80. These figures are calculated from the figures for ‘sustained response’ on HAMD for venlafaxine and imipramine in the study of Benkert et al. [2]. The figures suggest that if seven patients were treated with the rapidly escalating schedule of venlafaxine rather than imipramine, we would expect one extra patient to respond according to their definition. The figures also say that the data gives a 95% CI whereby to get an extra responder, between four and 80 patients would need to be treated with venlafaxine rather than imipramine. A larger sample would give rise to a narrower confidence interval (i.e. a more precise estimate of the true treatment effect).
Inclusion and exclusion criteria: does the trial apply to my patient?
Every RCT investigates the comparative effects of two or more treatments on a patient group or population, defined by inclusion and exclusion criteria.
There are two extremes in defining groups. At one extreme a tightly defined homogenous group (few inclusions and many exclusions) may be recruited and significant results reported. However, the results apply to very few patients seen in practice. It is important that those reading the results realize that they may not apply generally because of the narrow definition of the group studied. At the other end of the spectrum are trials with few exclusion criteria but the problem that significant effects of treatment in subgroups may be masked by a lack of effect in other groups. In a large enough trial, this may be overcome by subgroup analyses – in effect splitting the trial into several individual trials in different groups of patients. In reading a reported RCT it is important to examine the inclusion/exclusion criteria to determine whether the results can be applied to one's own clinical practice.
The group in Benkert et al. [2] fits our clinical scenario exactly because the inclusion criteria appear to specify a group (depressed, melancholic, inpatients) similar to our patient. Among the exclusions were patients with ‘serious comorbid disease’ (although ‘serious’ is not specified), ‘those undergoing formal psychotherapy’ and ‘patients with a history of drug or alcohol dependence within 2 years’. Since our patient does not meet these criteria we can assume that the results may still apply to him.
Are the likely treatment benefits worth the potential harms and costs?
Having evaluated evidence presented in a RCT clinicians must decide whether the results should influence practice. In the example presented, although there is little evidence of harm, neither is there convincing evidence of benefit and the treatment (rapid titration of venlafaxine) was not evaluated against a current standard treatment. We have not changed our clinical practice as a result of this trial.
Websites
The Centre for Evidence Based Medicine in Oxford has a useful website and vast number of links to other sites dealing with evidence-based medicine and critical appraisal. URL: http://minerva.minervation.com/cebm/
The Centre for Evidence Based Mental Health also has useful critical appraisal forms for a variety of studies. URL: http://www.cebmh.com/
The Journal of the Medical Association Evidence Based Practice Users Guides are available at, URL: http://www.cche.net/che/home.asp
