Abstract
As a research community, we have failed to show that drugs, which show substantial efficacy in animal models of cerebral ischemia, can also improve outcome in human stroke. Accumulating evidence suggests this may be due, at least in part, to problems in the design, conduct, and reporting of animal experiments which create a systematic bias resulting in the overstatement of neuroprotective efficacy. Here, we set out a series of measures to reduce bias in the design, conduct and reporting of animal experiments modeling human stroke.
Nearly 10 years after the first Stroke Therapy Academic Industry Roundtable (STAIR) participants established guidelines intended to support the translation of neuroprotective efficacy from bench to bedside (Altman et al, 2001), there is still no clinically effective neuroprotective drug for stroke. One interpretation of this observation is that measures outlined in STAIR I have failed to deliver the promised improvements in drug development. However, a dispassionate analysis of data presented over the last 10 years suggests that the ‘STAIR hypothesis‘—that improvements in animal experimental design will lead to improvements in translational efficiency—has yet to be adequately tested. Adhering to standards of conducting and reporting of experiments to reduce the confounding effects of bias and ensure adequate statistical power, as outlined below, will increase the confidence with which we can assess new data and maximize our chances of developing effective therapies.
The original STAIR proposal was that by paying due attention to experimental bias, to the breadth of physiologic variables known to influence stroke outcome in patients, and by testing therapies in a range of model systems which might more faithfully reproduce the key facets of stroke pathophysiology, we would be able to translate what appeared to be clear evidence of neuroprotective efficacy in animals to the more heterogeneous circumstances of human stroke. Although we believe strongly that failure to adequately consider variables such as age, comorbidity, physiologic status, and timing of drug administration contribute to the disparity between the results of animal models and clinical trials, they have been reviewed elsewhere (Altman et al, 2001; Bath et al, 1998) and are not the subject of this article.
Analyses of data supporting the efficacy of various neuroprotective strategies (Begg et al, 1996; Crossley et al, 2008; Dirnagl, 2006) have revealed that although many researchers adhere closely to the ethos of these guidelines, as a community we do not. A simple checklist derived from the STAIR guidelines to provide an overview of the range of data available for 1,026 candidate therapies (Crossley et al, 2008) revealed that only a few came close to meeting the STAIR guidelines. A higher score against this checklist was accompanied by a marked reduction in effect size. This later trend could be seen clearly even within the data for individual drugs (Grotta, 1995). Moreover, studies which reported measures to avoid bias such as random allocation to treatment group, masked induction of ischemia, or the masked assessment of outcome (Macleod et al, 2005, 2008), gave a markedly lower estimate of efficacy. Despite this there has been some evidence of improvement in study quality, and the performance of animal stroke studies is substantially better than that for most other models of neurologic disease (Dirnagl, 2006). And yet, the majority of investigators still do not report whether they took measures to avoid bias.
Systematic reviews and meta-analyses of data from animal stroke studies suggest that these studies may be substantially distorted by experimental bias. Taken together, publications supporting the efficacy of NXY-059 include randomized data with allocation concealment and masked outcome assessment, but most individual publications do not report these measures. Analyses of those data suggest that at least half of the reported 44% improvement in outcome could be attributed to experimental bias, specifically a failure to randomize the allocation to experimental group, a failure to conceal treatment group allocation from the surgeon or a failure to blind the assessment of outcome (Macleod
A related issue is the number of animals used in experiments. The probability of detecting a difference of a given size between groups is related to the number of animals in each group, the size of the difference and the variability in the outcome measure used. However, only 3% of studies identified in systematic reviews reported using a sample size calculation (Dirnagl, 2006). Importantly, if sample size calculations are based on falsely large estimates of effect size, studies will not be powered to detect real differences between treatment and control groups. Indeed,
These problems are not unique to the preclinical study of stroke. Clinical stroke trials have had problems with inadequate sample size (O'Collins et al, 2006) and have also failed to report whether they took measures to avoid bias (Plint et al, 2006). Indeed Cochrane's observation that ‘when humans have to make observations there is always the possibility of bias’ (Sena et al, 2007) was a lynchpin of the CONSORT (
On the basis of the available evidence it would now seem reasonable to suggest that preclinical testing in animal models of stroke, and indeed other models of disease, should adopt similar standards to ensure that decision making is based on high quality unbiased data (Dirnagl, 2006; Weaver et al, 2004). Adoption of such standards would have the added benefit of reducing wasteful usage of financial and animal resources.
In general, studies should only be considered for publication if their ‘Methods’ section includes a description of how they have addressed the standards below, or if authors make a cogent argument for why these standards are not relevant to their work. For these components of a paper, citation of methods described in previous publications is not considered sufficient. These requirements should not preclude publication of important observational, pilot or hypothesis-generating data, but the conclusions of such studies should reflect their preliminary nature.
We consider that these measures are of central importance to Good Laboratory Practice in the modeling of cerebral ischemia. Many groups already perform experiments to these high standards, and we hope that they will now report this in full, and that others follow their lead. Finally, we do not consider these requirements to represent a final or complete list of appropriate measures necessary to avoid bias. Future additions may be required as further evidence emerges and the experience of authors and reviewers evolves.
