Abstract
Additional material for this article is available from the James Lind Library website [www.jameslindlibrary.org], where it was previously published.
From Gambling to Astronomy
It was not until the 17th century, when the French mathematician Blaise Pascal developed mathematical ways of dealing with the games of chance used for gambling, that a science for dealing quantitatively with varying observations started to emerge. Whereas in games of chance these mathematical approaches allowed one to determine the value of possible gambles, it turned out they also allowed one to determine the best way to compare and combine observations made by different astronomers.
In the 1700s, there was not yet the strong and clear distinction made today between observations within a given study, and summarized results from different studies. These ideas were tackled in the 18th and 19th century by astronomers and mathematicians such as Gauss and Laplace 1 and presented in a textbook published by George Biddell Airy, 2 the British Astronomer Royal. But it was only in the 20th century that statisticians addressed similar questions for the combination of clinical trial results. Summarizing results from different studies eventually became the formalized technique we refer to today as meta-analysis.
Karl Pearson and Typhoid Inoculation
The British statistician Karl Pearson was familiar with Airy's textbook and appears to have been the first to apply methods to combine observations from different clinical studies. He was asked to analyse data comparing infection and mortality among soldiers who had volunteered for inoculation against typhoid fever in various places across the British Empire with that of other soldiers who had not volunteered. 3
Pearson first re-grouped the study observations into larger groups, noting simply that he considered some groups too small. His reasoning here is not clear, though it might simply have been based on expediency, given the practical difficulty of carrying out many small analyses. This preliminary re-grouping of various studies into ‘one study’ would be considered an invalid technique today, although a re-analysis comparing the original studies with the collapsed studies used by Pearson shows that the collapsing had no practical consequence.
Pearson decided to look at the association of inoculation with infection separately from the association of inoculation with mortality. The observed study outcomes were presented in ‘two by two’ tables in his Appendix B. He presented the results of his analyses in a table in which each study was assigned its own line showing its measure of effect, together with a measure of the within-study uncertainty. The last line gives a pooled estimate of the effect—his ‘meta-analysis'—albeit without an estimate of the pooled uncertainty associated with this estimate.
By the standards of the time (using two probable errors rather than two standard errors as the criterion) all but two studies analysed by Pearson showed statistically significant associations of inoculation with infection and death from typhoid; but he was struck by the irregularity of the associations. Seeking some explanation for these varying effects, he considered the possibility that the soldiers who had volunteered for inoculation against typhoid might have been at lower initial risk of developing the disease. He notes that these uncertainties might be resolved by further scrutiny of the results in hand, but, significantly, proposes ‘an experimental inquiry’:
‘Assuming that the inoculation is not more than a temporary inconvenience, it would seem to be possible to call for volunteers . . . [and] only to inoculate every second volunteer . . . with a view to ascertaining whether any inoculation is likely to prove useful . . . In other words, the ‘experiment’ might demonstrate that this first step to a reasonably effective prevention was not a false one.’
Karl Pearson appears to have been the first to analyse clinical trial results using meta-analysis. He was especially thorough about questioning the consistency of individual trial results and equally keen to discover clues from this for better future research.
The Fertile Field of Agricultural Statistics
Like Pearson, the British statistician Ronald Fisher had studied statistics from Airy's textbook, and was comfortable addressing the combination of different study results. During the 1920s and 1930s, Fisher worked at the Agricultural Research Station in Rothamstead. In his 1935 textbook, he gives an example of the appropriate analysis of multiple studies in agriculture, identifying the probable and real concern that fertilizer effects will vary by year and location. 4 There were numerous references to and discussions of the analysis of multiple studies in the last book that Fisher wrote, 5 in which he encouraged scientists to summarize their research in such a way to make the comparison and combination of estimates almost automatic, and the same as if all the data were available. Fisher's influence on meta-analysis is hard to exaggerate. For instance, one of the earliest publications warning about preferential publication of studies based statistical significance acknowledged Fisher as the person responsible for stimulating the research. 6
One of Fisher's colleagues, William Cochran, extended Fisher's approach and provided a formal random effects framework for it more in line with the earlier approach by Airy. 7 Cochran, together with Frank Yates (another colleague of Fisher's), soon afterwards applied this in practice to agricultural data. 8 Cochran continued to work on methods for the analysis of multiple studies throughout his career. Indeed, the last sentence in his last paper commented on the difficulties in dealing with study effects that vary over time and location. 9
Cochran also applied the method in medical research in an assessment of the effects of vagotomy (a surgical operation for duodenal ulcers), which was reported in an influential book entitled Costs, Risks and Benefits of Surgery. 10 Like Karl Pearson before him, 3 Cochran commented on the need for data from controlled trials:
‘We could have come across a number of comparisons that were well done but not randomized—the type sometimes called observational studies. . . . I would have been interested in including the observational studies so as to learn whether they agreed with the randomized studies and if not, why not? But the medical members of our team had been too well brought up by statisticians, and refused to look at anything but randomized experiments.’
Meta-Analysis and Fair Tests of Social, Educational and Medical Interventions
By the middle of the 20th century, the sheer volume of research reports forced researchers to consider how to develop and apply methods to synthesize the results produced. In 1940, for example, quantitative synthesis was used in an analysis of the results of 60 years’ research by psychologists on extrasensory perception. 11 Finding themselves swamped with studies and in need of methods to make sense of the barrage of findings, 12 other American social scientists and statisticians began to develop and apply methods for quantitative synthesis of the results of separate but similar studies.13,14 In 1976, one of them, Gene Glass, coined the term ‘meta-analysis’ to refer to ‘the statistical analysis of a large collection of analysis results from individual studies for the purpose of integrating the findings.’ 15 Articles and textbooks about meta-analysis followed soon after.16–21
Application of meta-analysis by medical researchers began a few years later.10,22–24 Particularly influential was the first randomized trial conducted by Peter Elwood, Archie Cochrane and their colleagues to assess whether aspirin reduced recurrences of heart attack. 25 The results were suggestive of a beneficial effect but were not statistically convincing; therefore, as additional trials were reported, Elwood and Cochrane assembled and synthesized their results using meta-analysis. 26 This left little doubt that aspirin could reduce the risk of recurrence, and the results were published in 1980 in an anonymous Lancet editorial, 27 which had actually been written by the British medical statistician Richard Peto. Based on earlier work,28,29 Peto and his colleagues went on to provide a detailed example (using randomized trials of beta-blockade following heart attack) to encourage clinicians to review randomized trials systematically, and to combine estimates of the effects of treatments considered to be the same, based on informed clinical judgment. 30 When treatment effects varied among studies, Peto argued for testing and estimating the (fixed) weighted average of the varying treatment effects. 31 He and his colleagues therefore rejected the Airy/Cochran tradition of considering the variation of treatment effect as being like a random variable. The latter approach was promoted to medical researchers by DerSimonian and Laird, 32 who also provided simple approximate formulas for Cochran's formal random effects model.
As had happened in the social sciences a few years earlier, these developments in clinical research led to expository papers,33–36 special journal issues 37 and books38–40 directed at clinical researchers and clinicians. These publications tended to emphasize the importance of assessing the quality of the studies being considered for meta-analysis to a greater extent than the early work in social sciences had done. 38 They also emphasized the importance of the overall scientific process (or epidemiology) involved.35,36
The importance of using systematic approaches to reducing bias in reviews of a body of evidence began to be distinguished as an issue separate from meta-analysis.41,42 This emphasis was manifested most explicitly in the late 1980s by the creation of global trialists’ groups to conduct collaborative ‘overviews'—meta-analyses based on individual patient data from their respective studies,43,44 as well as international collaboration to prepare meta-analyses of all the randomized trials in some medical fields. 45
By the early 1990s, terminology was becoming confusing, and Chalmers and Altman 40 suggested that the term ‘meta-analysis’ should be restricted to the process of statistical synthesis considered in this commentary. This convention has now been adopted in some quarters. For example, the second edition of the BMJ publication Systematic Reviews is subtitled Meta-analysis in Context, 46 and the 4th edition of Last's Dictionary of Epidemiology 47 gives definitions as follows:
‘Systematic Review: The application of strategies that limit bias in the assembly, critical appraisal, and synthesis of all relevant studies on a specific topic. Meta-analysis may be, but is not necessarily, used as part of this process.’
‘Meta-Analysis: The statistical synthesis of the data from separate but similar, i.e. comparable studies, leading to a quantitative summary of the pooled results.’
Just as debates seem likely to continue about the statistical methods used for meta-analysis, so also will debates continue about terminology. What is certain, however, is that we will continue to have to deal quantitatively with varying study results.
Footnotes
Competing interests None declared.
