Analyzing and interpreting “imperfect” Big Data in the 1600s

Abstract

One of the characteristics of Big Data is that it often involves “imperfect” information. This paper examines the work of John Graunt (1620–1674) in the tabulation of diseases in London and the development of a life table using the “imperfect data” contained in London’s Bills of Mortality in the 1600s. London’s Bills of Mortality were Big Data for the 1600s, as they included information collected over time, the depth and accuracy of which improved gradually. The main shortcoming of the data available at the time was its nonuniform upkeep and the lack of depth of variables included at its outset. Due to these characteristics, it provides a perfect model for the examination of imperfect Big Data, as it has been analyzed, criticized, and interpreted repeatedly since the 1600s.

Keywords

Big Data John Graunt “Imperfect” Big Data life expectancy London’s Bills of Mortality observations

Introduction

This paper examines John Graunt’s (1620–1674) work in Natural and Political Observations Made Upon the Bills of Morality (Observations) (1662), where the author, a tradesman and haberdasher, analyzed and interpreted “imperfect” Big Data to tabulate diseases and calculate life expectancy. Since the data contained in London’s Bills of Mortality (The Bills) were imperfect in many senses, which we will examine in this paper, they were underutilized until they passed under Graunt’s scrutiny. Graunt’s willingness to analyze The Bills indicated that he wanted to build upon an imperfect Big Data set initiated by others for research he considered important for the state. The data set—initiated in 1538 as parish records of births, marriages, and burials—became an object of intense interest throughout Europe when Graunt undertook the analysis of The Bills and went on to publish his analysis in Observations.

This paper also contrasts Graunt’s imperfect Big Data with the imperfect Big Data available today pertaining to human survival, life expectancy, and predictors of life expectancy. Our goal in examining Graunt’s work is three-fold: (1) how data collection impacts the final product; (2) how working with imperfect data can still produce important information; and (3) how imperfect data can be improved on in an existing or future data set. Graunt’s non-hesitation in embarking on an imperfect data set that started in parish birth and burial records began an approach that continues today in Big Data databases. It is important to study how Big Data were studied in the past to improve our current practices in handling Big Data in contemporary society.

Graunt’s Big Data of the 1600s

The Worshipful Company of Parish Clerks of London (Wrigley et al., 1989) oversaw the collection of the data for The Bills. The members of this association were wealthy, highly influential people in the City of London. Lord Thomas Cromwell issued the decree that the parson, vicar, or curate of every parish should keep an exact register of all christenings, weddings, and burials, entering the data into a book. All Episcopal churches and those in the City of London abided by this decree. The parish clerks not only helped establish these parish registers, but they made them public property by issuing weekly sheets, known as London’s Bills of Mortality, which could be purchased for two shillings. The publication of The Bills began in 1592, occasioned by a time of great mortality from the plague. Although they were subsequently discontinued, they became available again in 1603 (Anonymous, 1842).

In the Preface of Observations, Graunt noted that he considered The Bills as “neglected papers,” and he “preceded farther to consider what benefit the knowledge of the same would bring to the world.” As this data set did not contain age at the time of death as a variable of interest, it was imperfect and thus limited Graunt’s construction of a life table. Cleland (1836) noted temporal issues that surrounded The Bills lacked depth until the first quarter of the 1600s. In 1629, the “number of deaths by various diseases and casualties” began to be included in The Bills. Nearly a century later, in 1728, the “age of the person at time of death” was introduced, but gender distinctions were not attended to with much care.

Birch (1759) described The Bills as a book of “births and burials.” In 1629, the number of deaths, classified by different diseases and casualties, was first added into The Bills, along with the distinction between the sexes (Anonymous, 1842). Graunt did not separate the sexes in his work, but combined them in the total number of deaths by disease or casualty. In 1728, The Bills started to include the ages at which the deaths took place (Anonymous, 1842). In his work, Graunt constructed tables listing causes of death and the number of deaths within each category. Graunt noted that the identification of the cause of death in each case was work undertaken by the Searchers, ancient matrons sworn to their office who, upon hearing the tolling of a bell signaling that someone had died, would go immediately to view the corpse and assess the cause of death.

Graunt’s tabulations of the listing of causes of death from the records included various conditions, such as plague and fever, consumption, convulsions, “distracted,” drowned, executed, found dead in the street, “frighted,” “griping in the guts,” “hang’d and made away themselves,” “kill’d by several accidents,” “murdered and shot,” “overlaid and starved,” plague, “planet,” “rising of the lights,” smallpox, spleen, spotted fever, “stopping of the stomach,” “surfeit”; along with a sum of the number of inhabitants who had died in each death category. These tabulations, along with Graunt’s commentary, formed the substance of his book, Observations, with two editions published in 1662, third and fourth editions published in 1665, and a fifth in 1667, two years after his death (Heyde and Seneta, 2001).

Graunt (1662) dedicated an entire chapter in Observations to examining the possible errors of assessment of causes of death conducted by the Searchers. First, he acknowledged that the numbers of individuals assigned the category of dying of plague were “not sufficiently deduced from the mere report of the Searchers.” He also noted that some of the Searchers’ reports may have been “ignorant and careless,” and he had to consider whether any credit should be given to their “distinguishments.” Yet, he also acknowledged that many of the categories of death were “but matter of sense” and whether a child was abortive or stillborn or whether the aged person (above 60 years old) without any curious determinant could be categorized as dying “purely of age” were decisions that the Searchers could make on their own. With respect to consumption, Graunt noted that “whether the dead corpse were very lean, and worn away, it matters not to many of our purposes, whether the disease was exactly the same as physicians define it in their books.” Similarly, for Graunt, in the case of a 75-year-old man who “died of a cough” (of which had he been free, he might have possibly lived to 90), Graunt esteemed it “little error (as to many of our purposes) if this person be, in the Table of Casualites, reckoned among the Aged, and not placed under the title of Cough.”

However, in the matter of Infants, Graunt reported he “would like to know clearly, what the Searchers meant by Infants, as whether children that cannot speak, as the word Infans seems to signify, or children under two or three years old.” In addition, for Graunt, “if one died suddenly, the matter is not great, whether it be reported in The Bills, Suddenly, Apoplexie, or Planet-strucken, &c.” Finally, Graunt concluded that “in many of these cases, the Searchers are able to report the opinion of the physician who was with the patient.” It is unclear what the term “many” meant in this context. The importance of Graunt’s comments on possible errors of assessment of the causes of death is that he is at least specifying how he evaluated these errors explicitly by informing the readers of his stance on these points of error in his assessment of The Bills.

Graunt neither discussed nor mentioned the specific criteria the Searchers used to identify diseases. However, regarding the plague, Graunt did note that “ …many times other pestilential diseases, as purple-fever, smallpox, &c. do fore-run the plague a year, two, or three … ” In 1759, Thomas Birch (1759), Secretary of the Royal Society, edited Collection of Yearly Bills of Mortality from 1657 to 1756 Inclusive. The preface to this collection shows the type of classification (nosology) that expands upon Graunt’s point about diseases that “fore-run the plague.” In Birch’s preface, the following points were noted:

Before the plague begins, there sometimes dies not one in a week of spotted fever, and never at most above four. But in the fifth week of the plague there die twelve, and afterwards the number increases as the plague increases, so that there frequently die above a hundred, and one week one hundred and ninety. This fever decreases with the plague. There is reason therefore to suspect, that this fever was the same from the beginning, as the true plague very often passed under the name of the spotted fever.

Reasons are also given for why there may be problems with classifying disease as spotted fever versus plague: “This might be done willfully by some, who were unwilling to own, the plague was in their houses; which is worth the attention of the magistrate.” Finally, this preface pointed to some of the descriptors used to classify the plague in Graunt’s day: “ …purples only appearing in several, who had the plague, without any buboes or carbuncles.”

Within Observations, Graunt constructed a life table. A life table is a table of average ages of people’s deaths or probability of death used in calculating the life insurance premiums (Franklin, 2013). While Graunt has been heralded as the developer of the concept of vital statistics, Ivo Schneider (2000) recognized that Graunt’s data set in The Bills did not contain enough information to construct a life table. Schneider argued that Graunt’s own discussions in The Bills were hypothetical, rather than the result of empirical observations, because The Bills did not contain any information on the “age” of the deceased. According to Schneider, Graunt constructed his life table “on the assumption that the number of living at age 6, 16, 26, … , 76 form a falling geometrical series.” Thus, as Schneider argued, “Graunt’s [life] table is purely fictitious.” As noted, the ages at which the deaths took place were not inserted in The Bills until 1728 (Anonymous, 1842). Nonetheless, Graunt’s book was influential not only in England but also in France and Holland as the public and state became interested in life annuities (Maddison, 2006).

One of the earliest uses of Graunt’s tables was discussed in The Railway News (Anonymous, 1865), where it was reported that when the plague was spreading over Europe, inhabitants of London used the tabulated data to make decisions on when to leave London (based on high death rate reported in The Bills) and on when to return (due to the low number of plague deaths reported). Others have recognized the influence of Britain’s use of parish houses for data collection. Louis XIV of France began development of the parish house concept in France, and Laplace used France’s parish house data to calculate the population of France in 1786. Finally, Newman (1956) noted that Charles II of England was so favorably disposed with Graunt’s Observations that he proposed Graunt as an original member of the newly incorporated Royal Society. To forestall any objections from parties opposed to admitting “a shopkeeper” to the Royal Society, according to Newman, Charles II charged that if the Royal Society found any other such tradesmen, they should admit them all without further ado.

Imperfect data in the 1600s

Lumpkin (2002) credited Graunt with “a careful and logical interpretation of ‘imperfect’ data.” Similarly, Franklin (2013) recognized Graunt for “printing yearly tables of the data,” noting that “The tabular form and the reduction of data to summary figures are both substantial advances in allowing data to be understood.” Franklin gave Graunt further credit for acknowledging the overall lack of knowledge in Graunt’s day of the size of the population of London (which was rapidly increasing during the 1603–1660 period and was affected by temporary emigration in plague years). Franklin particularly acknowledged Graunt’s use of three methods for estimating the size of London that Graunt noted to have surprising close agreement: “Christenings, burials, and multiplying the area of a map of London by estimates of the number of houses per unit area and the number of people per house … ”

Graunt’s initial acquaintance with The Bills was as a reader of these reports. Graunt analyzed the already existing Big Data data set contained in The Bills for three purposes: (1) extracting the information the data set contained for use in society, while at the same time, (2) learning how to make improvements to that existing data set, and (3) specifying the “imperfections” of that data to improve on that existing data set as well as to impact the design and collection methods to be used in future data sets of the same genre as clearly as possible. Graunt recognized that, once collected, the data could be misanalyzed and misinterpreted, and incorrect conclusions drawn from it. But this recognition did not deter Graunt from analyzing The Bills to see what knowledge could be derived from them.

Specifically, Graunt recognized how imperfections in the design of the preexisting data set (parish records) and the imperfections in identifying data during the data collection process impacted the final products derived from the data set. In terms of the design of the preexisting data set, until 1865 in London, age at time of death had not been included as a variable in parish records. This imperfection precluded the development of a life table. Graunt’s life table was built on the assumption that people died in a manner that could be captured mathematically in terms of a geometric series. In terms of identifying data during the collection process, Graunt saw that the data concerning cause of death obtained by the Searchers—who, when possible, conferred with the individual’s doctor—could be incorrect. Thus, the attribution of cause of death was limited by how well the Searcher understood the doctor and by how well the doctor understood the cause of the person’s death.

In summary, Graunt noted at least three types of “imperfect data”: (1) failure to recognize and include certain types of data in designing a data set, for example, age at death and cause of death in parish records; (2) errors of data description, for example, errors of cause of death made by the Searchers; and (3) inclusion of inappropriate data because of the characteristics of the study population, for example, the lack of stability of London’s population. In addition, Graunt identified “imperfections” in the data he confronted in The Bills: (1) the population from which the data were derived (the data Graunt used were derived from a city with an increasing population, i.e. London); (2) the set of variables included in the initial data set (the data Graunt used was derived from parish records of births and burials); and (3) the set of individuals charged with collecting the data (the data Graunt used were collected by the Searchers who were eager to do their task well, but with lapses due to lack of training).

Imperfect Big Data today

Today’s investigators continue the work started by Graunt: identifying the sources of imperfect data to create more perfect data collections in the future. Today, numerous issues affect Big Data, including: inappropriate data selected in the study design (Clark, 2013); data contamination and missing records (Pearson, 2005); data misclassification, miscompiling data, making incorrect observations related to the data that one is collecting; misrecording data; and selecting inappropriate data in the study design in the first place (Dharmalingam and Amalraj, 2013; Lumpkin, 2002; Smets, 1999; Vilhuber, 2008; Wheeler, 2006). Barranco et al. (2008: 438) defined imperfect data as “data that are uncertain, imprecise, vague, or inapplicable.” Pearson (2005: 33) identified five types of “data imperfections”: “noise or observation error, gross errors, simple missing data, coded missing data, and disguised missing data.” Thus, today the term imperfect data includes errors of formation of study hypothesis (e.g. selection of variables to be included in the study); errors of formulation of study design (e.g. errors of selecting what is to count as a variable); errors of selection and training of data collectors; errors of data collection (e.g. errors of data inclusion, data exclusion); errors of database recording (who is responsible for the recording of data in the database); errors of data storage (e.g. errors in deciding the format in which the data will be stored and where the database is kept); and errors of data analysis and interpretation. The imperfection of data also increases when data are pooled from multiple sources, as the problem then becomes matching imperfect data from multiple databases.

Discussion

The evolution of data sets based on birth and death registries beyond Graunt’s unique first effort can be traced to Edmond Halley, astronomer, demographer, mathematician, physicist, financial economist, and actuarialist. Caspar Neumann, a minister with mathematical skills, developed a data set of the city of Breslau, Silesia. During this time, the city of Breslau (now Wroclaw, Poland)—unlike London—had a stable population with little migration in and out of the city. Neumann maintained the data set himself to ensure its integrity. Halley was given this rather unique data set and used it to develop life tables. Halley’s life tables were considered better than Graunt’s in two aspects (Ciecka, 2008). First, Graunt’s life table based on The Bills was compromised because he had to make assumptions about the ages of the individuals who died because “age at time of death” was not a variable in the data set he initially had at his disposal. In addition, Graunt’s data set was based on the population of the city of London, which was growing in as yet unquantified ways during the period of time of the data collection. In contrast, Neumann’s data set—based on the city of Breslau—included “age at the time of death” because Neumann developed and maintained the data set. It is thus essential to understand that in research based on data sets, even those data sets developed prospectively as a new data collection, there is no guarantee that all needed variables will be recognized and identified before the new data collection begins. Consequently, the problem of incomplete data can never be avoided. One can say that the first data collection was based on study hypothesis-1 and, once a new variable is recognized during study-2, a new study hypothesis can be tested through the collection of a new data set with this new variable added. This is the way Big Data-based research science is conducted.

When one reflects on Big Data in the 1600s and 1700s, it is evident that the individuals engaged in developing large data sets did so for a number of reasons. While Graunt was interested in helping the state better understand itself and its resources, Neumann’s work was aimed at refuting local legend about the moon and deaths in the town. Neumann recognized that he had to make sure that the variables needed to answer study hypotheses were included in the initial study design and that the data were collected correctly so that others developing additional study hypotheses could use the data set to answer new research questions. In addition, by the 1700s, it was recognized that expert mathematicians, like Daniel Bernoulli, could be attracted to certain products that needed data. This was the case when Paris lacked its own data to answer the question of whether variolation could control the spread of smallpox in the city. Bernoulli used the Breslau data in the smallpox model he developed for Paris. However, the question that remains to be answered is whether Breslau data, with its stationary population, could yield useful data for Paris, which was more affected by migration and was, in that respect, similar to the London of Graunt’s day. Clean data are useful to researchers aiming to answer a data-based question only to the extent that the population from which the data were derived is similar to the population under investigation. In other words, variables need to be carefully selected for the study hypothesis to be tested, yet there is no guarantee prior to any study that the right variables are being selected.

Imperfect data are used presently to answer a variety of research questions, including: genotype errors (Cook et al., 2014); evaluating the yearly estimates of the number of strokes in a country given the declining hospitalizations for stroke in that country where “estimating the number of strokes in a county can be highly variable depending on the recency of the data, the type of data available, and the methods used” (Cadilhac et al., 2014); and designing and validating epidemiologic surveillance in uncounted populations (Byass et al., 2011). The search for high quality data today is costly and requires human study volunteers willing to assume the risks of taking part in a study. In early studies involving newly developed drugs and devices, the risks are often high and the volunteers are usually offered no compensation in the event of a severe adverse outcome. Research studies involving human study participants from the outset (particularly with newly developed drugs and devices) are not designed to help the study volunteers, but rather they aim to produce knowledge that would benefit future generations (Mazur, 2007).

Conclusions and future challenges in imperfect data

While Graunt was keenly aware of the problems of the data collection as he was working with The Bills, they did not deter him from tabulating and using the data. Today, we are in an era where there is more recognition of the need for more comprehensive and reliable Big Data in health and medicine for societal decision making in public health and medical care. Today’s efforts in societal decision making are required not only for better estimates of life expectancy but also for the development of life tables for the insurance industry. Today’s databases need new data types to develop a wide range of prediction models: predictors of natural and unnatural mortality (Fok et al., 2014); predictors of long-term (10-year) mortality (Plakht et al., 2014); risk factors that affect life expectancy (Wattmo et al., 2014); predictors of long-term cognitive outcomes in young adults (de Bruijn et al., 2014); mortality inequality in populations with equal life expectancy (Auger et al., 2014); and identifying factors that would enable population to live longer with less disease and disability (Hoeymans et al., 2014).

A key difference between imperfect data in the days of Graunt and the present involves today’s keen ethical recognition of the rights of human study volunteers and the limitations of what can be asked of research volunteers in terms of data acquisition. Rights are accorded citizens and patients in informed consent in clinical care and in research on human study volunteers. In the future, these rights of citizens and patients may be expanded further to include more assurances of confidentiality and privacy related to their personal data not only in clinical care and human subjects research, but also in other societal uses of personal data, requiring informed consent before deciding whether to allow use of their personal data for any societal purpose. Such expansions in rights to confidentiality, privacy and informed consent may result in an expanded number of refusals to allow use of personal data and more missing data based on these personal refusals, leading to more instances of imperfect data in future data collections and databases.

Future research in medicine and public health will continue to deal with imperfect data, as the challenges of increasing costs, increasing need to protect participants’ rights, and increasing need for study sponsors and researchers to abide by stringent safety protocols cannot be minimized for the sake of making data “more perfect”.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

References

Anonymous (1842) Mortality, human. The Encyclopaedia Britannica, or Dictionary of Arts, Sciences, and General Literature 15(2): 513.

Anonymous (1865) The story of life insurance: Chapter 1. John Graunt, citizen and haberdasher. The Railway News and Joint Stock Journal 439–441: 441.

Auger N, Feuillet P, Martel S, et al. (2014) Mortality inequality in populations with equal life expectancy: Arriaga's decomposition method in SAS, Stata, and Excel. Annals of Epidemiology 24(8): 575--580, p. 580.e1.

Barranco

Campaῆa

Medina

(2008) Towards a fuzzy object-relational database model. In: Galindo

(ed.) Handbook of Research on Fuzzy Information Processing in Databases, Hershey, PA: IGI Publishing.

Birch T (ed) (1759) Collection of Yearly Bills of Mortality, from 1657 to 1758 Inclusive, Together with Several Other Bills of an Earlier Date. London: Printed for A. Millar of the Strand, p. 11, 1655.

Byass

Sankoh

Tollman

(2011) Lessons from history for designing and validating epidemiological surveillance in uncounted populations. PLoS One 6(8): e22897.

Cadilhac

Vos

Thrift

(2014) Estimating the annual number of strokes and the issue of imperfect data: An example from Australia. International Journal of Stroke 9(1): 19–22.

Ciecka

(2008) Edmond Halley’s life table and its uses. Journal of Legal Economics 15(1): 65–74.

Clark

(2013) Sample design using imperfect design data. Journal of Survey Statistics and Methodology 1(1): 6–23.

10.

Cleland

(1836) A historical account of bills of mortality, and the probability of human life, in Glasgow and other large towns. In: Edward

(ed.) A Historical Account of Bills of Mortality, and the Probability of Human Life, in Glasgow and Other Large Towns, Glasgow: University Press.

11.

Cook

Benitez

(2014) Evaluating the impact of genotype errors on rare variant tests of association. Frontiers in Genetics 5: 62.

12.

de Bruijn

Synhaeve

van Rijsbergen

(2014) Long-term cognitive outcome of ischaemic stroke in young adults. Cerebrovascular Disease 37(5): 376–381.

13.

Dharmalingam

Amalraj

(2013) Supervised learning in imperfect information game. International Journal of Advanced Research in Computer Science 4(1): 195.

14.

Fok

Stewart

Hayes

(2014) Predictors of natural and unnatural mortality among patients with personality disorder: Evidence from a large UK case register. PLoS One 9(7): e100979.

15.

Franklin

(2013) Probable opinion. In: Anstey

(ed.) The Oxford Handbook of British Philosophy in the Seventeenth Century, New York, NY: Oxford University Press, pp. 366–369.

16.

Graunt J (1662) Natural and Political Observations Mentioned in a Following Index and Made Upon the Bills of Mortality. London: Printed by Tho: Roycroft, for John Martyn, James Allestry, and Tho: Dicas, at the Sign of the Bell in St. Paul's Church-yard, 1662, pp. 11–15.

17.

Heyde CC and Seneta E (eds) (2001) Statisticians of the Centuries. New York, NY: Springer Verlag, pp. 15–16.

18.

Hoeymans

Harbers

Hilderink

(2014) [Living longer, with more disease and less disability; trends in public health 2000–2030]. Ned Tijdschr Geneeskd 158(0): A7819.

19.

Laplace PS (1786) “Sur les naissances, les mariages et les morts at Paris, from 1771 to 1784 & in the whole extent of France, during the years 1781 & 1782.” Mémoires de l'Académie Royale des Sciences 1783, pp. 693--702. Available at: http://cerebro.xu.edu/math/Sources/Laplace/naissances.pdf (accessed 25 September 2015).

20.

Lumpkin

(2002) History and significance of information systems and public health. In: O’Carrol

(ed.) Public Health Informatics and Information Systems, New York, NY: Springer Science and Business Media, pp. 24.

21.

Maddison

(2006) The World Economy, Paris, France: Development Centre of the Organisation for Economic Co-operation and Development, pp. 397.

22.

Mazur

(2007) Evaluating the Science and Ethics of Research on Humans: A Guide for IRB Members, Baltimore, MD: Johns Hopkins University Press.

23.

Newman

(1956) Commentary on an ingenious army captain and on a generous and many-sided man. In: Newman

(ed.) The World of Mathematics, vol. 3, Mineola, NY: Dover Publications, pp. 1416–1419.

24.

Pearson

(2005) Mining Imperfect Data: Dealing with Contamination and Missing Records, Philadelphia, PA: Society for Industrial and Applied Mathematics.

25.

Plakht Y, Shiyovich A and Gilutz H (2014) Predictors of long-term (10-year) mortality postmyocardial infarction: Age-related differences. Soroka Acute Myocardial Infarction (SAMI) Project. Journal of Cardiology 65(3): 216--223.

26.

Schneider

(2000) The mathematization of chance in the middle of the 17th century. In: Grosholz

Breger

(eds) The Growth of Mathematical Knowledge, Boston, MA: Kluwer Academic Publishers, pp. 63–64.

27.

Smets P (1999) Imperfect Information: Imprecision – Uncertainty. IRIDIA, Université Libre de Bruxelles. Available at: http://sites.poli.usp.br/d/pmr5406/Download/papers/Imperfect_Data.pdf (accessed 10 July 2014).

28.

Vilhuber L (2008) Adjusting imperfect data: Overview and case studies. In: The Structure of Wages: An International Comparison. National Bureau of Economic Research, Inc., Chicago: University of Chicago Press, pp. 59–80. Available at: http://EconPapers.repec.org/RePEc:nbr:nberch:2366 (accessed 10 July 2014).

29.

Wattmo

Londos

Minthon

(2014) Risk factors that affect life expectancy in Alzheimer's disease: A 15-year follow-up. Dementia and Geriatric Cognitive Disorders 38(5–6): 286–299.

30.

Wheeler

(2006) EMP (Evaluating the Measurement Process) III: Using Imperfect Data, Knoxville, TN: Statistical Process Control Press.

31.

Wrigley EA and Schofield RS (1989) The Population History of England 1541--1871: A Reconstruction. New York: Cambridge University Press, p. 77.