Abstract

Introduction
As described in Part 1, Richard Cabot’s 1937 Cambridge-Somerville Youth Study on the impact of a social intervention of ‘directed friendship’ on youth delinquency represented the first trial in the social or behavioural sciences to use alternate or random allocation after matching study participants into pairs. Cabot himself, as an esteemed physician, straddled the worlds of both the social sciences and medicine. Part 1 described the study and placed it within the historical context of the social sciences. We next turn to the historical context of medicine and public health.
The advent of matching
Within medicine, from the turn of the 20th century onward and despite the abundance of uncontrolled studies of interventions, some researchers began to consider how to compare like with like in clinical experiments. Many such researchers, often investigating the prophylaxis or treatment of infectious diseases, employed ‘alternate allocation’ studies, in which patient A would receive one intervention, patient B an alternative or nothing, and so forth (other versions of this entailed alternating the treatment plan every other day, or, in an anticipation of cluster randomisation, alternating wards of a hospital to treatment versus control groups). Dozens of such studies were conducted during the first half of the 20th century. 1 By the 1930s and 1940s, concerns over researchers believing that they could ‘improve on’ allocation schedules based on alternation or random allocation, for example by preferentially steering the sickest patients to the novel treatment, led Austin Bradford Hill to conceal allocation schedules from those entering patients as participants in controlled trials.2,3 Concealed allocation schedules in the 1948 Medical Research Council study of streptomycin for pulmonary tuberculosis contributed to its subsequent iconic status in the history of treatment evaluation. 4
But there were other methods offered for ensuring fair medical treatment comparisons. One important technique was that of ‘matching’, or attempting to ensure, a priori rather than solely in post-hoc analysis, equivalent representation of seemingly relevant characteristics among treated and untreated groups. To some extent, such notions extend to James Lind’s own assertion that the cases of the sailors in his experiment comparing different treatments for scurvy ‘were as similar as I could have them’. 5 By the early 20th century, some trialists demanded increasing attention to ensuring such matched characteristics.
One articulation of intentional matching as a method per se in the evaluation of therapeutics appears in a 1912 paper by Harry Lee Barnes on the treatment of tuberculosis with tuberculin, although this was a retrospective analysis.
6
Superintendent of the State Sanatorium in Rhode Island, Barnes conducted a retrospective analysis of 150 patients treated at the sanatorium between 1907 and 1912. As he stated, comparisons should be drawn between two classes of patients, those who take the treatment and those who do not, and these parallels should be made from cases that are as similar in prognosis as possible. For this study, an attempt was made to match each one of the 150 patients taking tuberculin against another patient of the same classification, according to the National Association, and also anatomically according to Turban, and likewise to match only cases having similar records of bacilli in the sputum, temperature, pulse, respiration, general condition, weight, race and year of discharge. While not perfect it should be much superior to slipshod methods of stating results of treatment and if widely adopted it would help to weed out more rapidly worthless methods of treatment in pulmonary tuberculosis. If applied to mooted questions like the ‘value of climate,’ it would eventually solve them, as the fruitless war of theories and opinions would eventually be displaced by evidence.
We see emphasis on the need for equivalence around key factors in prospective studies in such prominent locations as Major Greenwood’s and Udny Yule’s World War I-era paper on ‘The Statistics of Anti-Typhoid and Anti-Cholera Inoculation, and the Interpretation of Such Statistics in General’ 7 and in the subsequent American Public Health Association ‘Working Program against Influenza’. As John Eyler has pointed out, in the wake of the mass of uncontrolled (or poorly controlled, even to contemporary judges) influenza vaccination studies during the influenza pandemic, the ‘Working Program’ authors stated as one of their key characteristics of a valid study comparing vaccinated to unvaccinated individuals that ‘the relative susceptibilities of the two groups should be equal, as measured by age and sex distribution’, as well as exposure history.8,9 There is no mention of alternate allocation (let alone random allocation) in the ‘Working Program’. Indeed, it did not otherwise specify how such groups were to be rendered equivalent with respect to such factors.
More formal attention to a priori matching using key characteristics appeared in several prominent prospective studies throughout the 1920s. In Harriette Chick et al.’s investigation of the influence of diet and sunlight on rickets in institutionalised children in postwar Vienna, the children on admission were placed in two groups upon Diets I and II, care being taken that the infants in each group should be as similar as possible in age, general condition, and development, and that they should remain under identical conditions of general management and hygiene during their stay in the hospital.10,11
In Elmer McCollum’s controlled study of supplementary milk in 84 institutionalised children in Baltimore divided into treatment and control groups, ‘every effort was made … so that any child in one group was comparable in age, size and condition to a child in the other group’.12,13 Likewise, in Harold Corry Mann’s study of milk supplementation among institutionalised children outside London, the division of children into active or control groups took account of age, as well as a combined rating score of height and weight.14,13
In a study of vaccines for preventing common colds at the University of Manchester, researchers took 144 volunteers ‘and divided them into two equal groups by sorting the cards [filled out by the volunteers] first according to the sex of the volunteers, and then according to the dates on which the last cold was recorded’. 15 As they continued, emphasising the characteristics they were seemingly able to control versus those they were not able to: ‘Thus the two groups were approximately alike with regard to sex-distribution and with regard to the period which had elapsed since the last cold, in all other respects the distribution was random’. In none of the four studies was there any mention of how participants were allocated to active versus control groups.
Matching and stratification in alternate allocation studies
By contrast, among those focusing on alternate allocation as a means of ensuring the comparison of like with like, matching prior to allocation could be described as an impractical luxury, especially when researchers felt that with strict alternation and a large enough sample size, important characteristics would distribute sufficiently evenly among active versus control groups. Patients with pneumonia at Harlem Hospital were given polyvalent antiserum (treating multiple pneumococcal serotypes), and ‘because of the importance of treating patients at the earliest moment it was impracticable to alternate [the patients by pneumococcal serotype], since often at least twelve hours would have been lost before this was determined’.
16
William Park, Jesse Bullowa and Milton Rosenblüth, conducting the study, ‘believed that with a sufficiently large series the distribution of case by type would be equalised between the treated and the untreated group’, and indeed this proved to be the case. Similarly, when the British Medical Research Council began its own study of anti-pneumococcal antiserum a few years later, they clearly enunciated exclusion criteria (e.g. no patients with advanced heart disease, no patients under the age of 20 or over the age of 60) to avoid confounding factors, but their plan still left altogether unregulated the chance scatter of distribution of patients with severe or mild pneumonia into either the serum or control groups, and also of those admitted for treatment early or relatively late in the progress of the disease.
17
To some extent, such divisions between the use of a priori stratified studies and alternate allocation may be considered to represent the practical differences between planned, slowly enrolling studies of chronic conditions or preventive measures, and interventions in acute illnesses like pneumonia. But certain researchers did take pains to carefully stratify patients into subgroups for comparison before alternate allocation took place. In the early 1920s, Nicholas Kopeloff and George Kirby, at the New York State Psychiatric Institute on Ward’s Island, investigated the impact of the elimination of focal infections (dental, tonsillar or cervical) on psychiatric illness.19,20 As they noted, ‘because of the difficulties of interpretation inherent in an investigation of this nature, it seemed desirable to reduce the study as nearly as possible to the terms of an experiment’. They chose alternate allocation as their primary means for ensuring equivalence between treated and untreated patients, but also noted that ‘an attempt was made to place in the two different groups, patients comparable as to sex, age, duration of psychosis, diagnosis, prognosis, and infective conditions’. It is unclear how exactly they attempted to operationalise this methodological foreshadowing of ‘minimization’ 21 or reconcile it with alternate allocation.
A decade later, Massachusetts General Hospital’s Donald King, studying the inhalation of carbon dioxide to prevent postoperative pulmonary complications, was far more explicit in describing his attempt to stratify patients prior to alternate allocation. He began by noting that since the sex of the patient and the type of abdominal operation play so important a part, the patients were divided according to sex and then grouped according to the type of abdominal operation. Every other patient, in the subgroups of each sex, was treated.
22
This alternation gave, for instance, a group of men who had had operations on the stomach and who had had hyperventilation induced, to compare with an equal number of men who had had operations on the stomach but who had not had hyperventilation induced. … Thus, statistics were available for male and female cases, treated and untreated, in the different groups of abdominal operations and hernia repair.
Random allocation within matched pairs
Such general tensions between matching and alternate allocation would be paralleled among those who first broached the mixed application of matching and random allocation within medicine and public health, bringing us still closer to Cabot’s study. By the 1920s, Ronald Fisher had advocated random allocation among agriculture plots in ‘The Arrangement of Field Experiments’. 23 Ian Hacking has noted that, in contrast, ‘a majority of traditionalists believed that “matched” or “balanced” arrangements were less subject to error, more instructive, and in general entitled one to draw firmer instances’, 24 and that William Sealy Gosset (a traditionalist who published under the pseudonym ‘Student’, and originator of ‘Student’s t-test’) eventually favoured ‘balanced randomization’ as a happy compromise.
This played out in a fascinating way in 1930 and 1931 with respect to a ‘nutritional experiment on a very large scale’ that followed upon the milk studies described above.13,25 In Lanarkshire, Scotland, 20,000 students from 67 schools were studied in the spring of 1930 to assess the effects of milk supplementation on growth. In any given school, for the most part, ‘the teachers selected the two classes of pupils, those getting milk and those acting as “controls”, in two different ways. In certain cases they selected them by ballot and in others on an alphabetical system’. 26 However, ‘in any particular school where there was any group to which these methods had given an undue proportion of well-fed or ill-nourished children, others were substituted in order to attain a more level selection’. In other words, a rough form of ‘matching’ was added to the process to ensure the comparison of like with like. The study seemed to favour the inclusion of milk; but most important to our inquiry, by 1931, it had led Gosset to produce a methodological deconstruction of the study.
For Gosset, foreshadowing the concerns of those who revealed well-intentioned cheating with unconcealed allocation schedules, ‘unconscious selection’ (later in the paper referred to as ‘unconscious bias’), seemingly manifested in the attempt at matching, could lead to the production of unequal comparison groups (as seemed to have been the case in the Lanarkshire study). Especially focusing on a sub-question of the study concerning the relative utility of raw versus pasteurised milk, Gosset noted that the studied students ‘were not random samples from the same population; they were selected samples from populations which may have been different, … [and] I would be very chary of drawing any conclusions from these small biased differences’. As he gently lamented, ‘this experiment, in spite of all the good work which was put into it, just lacked the essential condition of randomness which would have enabled us to prove the fact’. Instead, Gosset proposed that if the experiment were to be repeated ‘on the same spectacular scale’, then: The ‘controls’ and ‘feeders’ should be chosen by the teachers in pairs of the same age group and sex, and as similar in height, weight and especially physical condition (i.e. well or ill nourished) as possible, and divided into ‘controls’ and ‘feeders’ by tossing a coin for each pair.
That same year witnessed the publication by J Burns Amberson et al. of ‘A Clinical Trial of Sanocrysin in Pulmonary Tuberculosis’.
27
Twenty-four patients ‘free from serious complications’ participated in the study: On the basis of clinical, X-ray and laboratory findings the 24 patients were divided into two approximately comparable groups of 12 each. The cases were individually matched, one with another, in making this division. Obviously, the matching could not be precise, but it was as close as possible, each patient having previously been studied independently by two of us.
Amberson et al.’s study did not uncover any beneficial effects of sanocrysin; indeed the drug was shown to have nasty side effects. Joseph Gabriel has demonstrated the origins of the trial at the intersection of mutual public health service and pharmaceutical industry (Parke Davis) interest in an objective assessment of the drug, with the trial entailing blinding of patients to prevent a ‘psychic influence’ on healing. 28 More germane to our line of inquiry, the origins of the single coin toss to determine the allocation of the two groups of patients are less apparent from the archival record. George McCoy, who had played a large role in the American Public Health Association vaccine protocols that emphasised matching (as mentioned above), also supported this therapeutic trial through his role as the director of the national Hygienic Laboratory. While the expressed need for a controlled study (even in discussions of the animal studies that preceded the human study) is evident throughout the record, and while the ‘plan’ for the trial initially called for 100 treatment and 50 control patients, there is no formal mention in the plan of either matching (beyond the intent to choose groups of patients ‘on the basis of pulmonary lesions that are as nearly as possible comparable as regards extent and character of disease’) or random allocation. 29 Clearly, however, by the late 1920s and early 1930s, certain trialists were extending beyond would-be matched controls to the addition of random allocation as an additional mechanism to ensure fair comparisons between treatment and control groups.
Conclusions
Despite our extensive searching within the Cabot and Sheldon and Eleanor Glueck papers at Harvard (the Glueck’s had a close relationship with Cabot and were influential in Cabot’s development of the Cambridge-Somerville Youth Study), it is unclear whether Cabot was aware of Kopeloff and Kirby’s trial, King’s study, Gosset’s dissection of the Lanarkshire study, Amberson et al.’s tuberculosis trial or Austin Bradford Hill’s discussion of the ‘Principles of Medical Statistics’ in The Lancet in 1937. On the one hand, Cabot conducted research on tuberculosis in his early medical career and wrote about the disease and its treatment in his medical textbooks, including later editions published after 1931.30–32 On the other hand, we have not found reference to any of these studies, either in his medical publications or in his personal notes or correspondence.
In tracing the history of treatment evaluation and the conduct of fair comparisons, it would seem that there is more of a direct line from the advent of alternate allocation, through concerns over their improper implementation, to the advent of randomised clinical trials, than there is from Amberson, Cabot or even Gosset to Bradford Hill. 2 In this reading, the mixed matching plus randomisation proposals and studies of the 1920s and 1930s seem to be a relative dead end, albeit one reflecting increasing concern to provide objective assessment of novel interventions in the interwar years. These developments ensured that like would be compared with like, unmeasured variables would be unbiasedly distributed among comparison groups, and that, by concealing the allocation schedule, the allocation system itself could not be cheated.
The combination of matching with random allocation in prospective clinical trials would continue to be deployed in both the social sciences and medicine throughout the 20th and into the 21st centuries, followed by evolving debate over its advantages and limitations.33–36 The design itself serves as a cornerstone of the evolving articulation of stratification, matching, randomisation, and similar innovations for ensuring fair comparisons are made in trials. Key to this history has been Richard Cabot’s Cambridge-Somerville Youth Study, the first large-scale matched-randomised trial and one of the earliest randomised trials of a social intervention. 37
Prior research pointed to the study’s design as representing a natural carry-over from Cabot’s background in clinical and research medicine, as well as the design appealing to him on the grounds that it would be even more rigorous than alternation or simple random allocation. 38 We found additional support for the influence of the latter, with Cabot viewing matching as insufficient on its own to achieve equivalence between treatment and control groups. Equally compelling was Cabot’s added concern about ‘achieving adequate experimental controls’ in evaluating an intervention that focused on social behaviour. The implication is that the social world compared to the physical world was less known to experimentalists. This is to take nothing away from Cabot’s staunch advocacy for social workers and his repeated call for them to evaluate their interventions using rigorous methods. Famously, in his presidential address to the National Conference of Social Work, he made clear his desire for an age of rigorous evaluation in social work, ‘the much-to-be-desired epoch when we shall control our results by comparison with a parallel series of cases in which we did nothing’. 39
More broadly, our research has situated both Cabot and his study in the midst of the social sciences and medicine and public health as they have wrestled with the uses of a priori stratification, matching, alternate allocation, and random allocation, and attempted to compare like with like in the 20th and 21st centuries. Matching – whether as an independent form of ensuring seemingly unbiased comparisons or as an a priori component of alternate allocation or random allocation – has so far received insufficient attention (as have related notions of stratification and exclusion criteria). Our research is an attempt to address this historical gap. Additionally, we have tried to place the history of the social sciences, medicine and public health in direct conversation with one another. The boundaries among such disciplines are indeed indistinct and dynamic. For example, the 1916 study cited in Part 1 on the role of air quality in education was overseen by the New York State Commission on Ventilation, with noted public health pioneer Charles-Edward Amory Winslow among the Commission’s listed members (Thorndike and Ruger 1916). 40 ‘Cross-ventilation’ among the social sciences, medicine and public health themselves has persisted to this day. Richard Cabot likely served as only the most prominent of individuals who straddled – or at least engaged with – multiple disciplines. We hope historians will follow suit and that these articles will stimulate further attention in this direction.
Footnotes
Declarations
Acknowledgements
We are especially grateful to the reference staff at Harvard University Archives, Harvard Law School Library’s Historical and Special Collections and the Center for the History of Medicine at the Countway Library. We thank Joseph Gabriel for generously making available more than 500 images of pages from the sanocrysin files held at the National Archives. Finally, we wish to thank Iain Chalmers for suggesting the central questions that guided our research, as well as for his sage advice and insightful comments throughout the development of this articles.
Provenance
Invited article from the James Lind Library.
