Alternative Sources and Machine Learning for Official Statistics: A Review

Abstract

In 2023, Wiley published a book entitled “Advances in Business Statistics, Methods and Data Collection.” The 856 pages in the book provide an extensive overview of the currently available methods, new developments, and challenges when producing high-quality establishment statistics (Snijkers et al. 2023). The book is divided into seven sections, in each section a major theme is discussed. In this review, we focus on section 5, which deals with “Topics in the Use of New Data Sources and New Technologies” and is composed of five chapters. Words that best describe this section are Data Science, Machine Learning, and Big Data.

The field of data science is young and dynamic and is currently still being developed. Presenting the advantages of data science-based approaches to statisticians poses an additional challenge. This challenge is particularly evident in the chapters dedicated to machine learning (ML; Chapters 23–25) which, albeit rich in insight, may prove somewhat daunting for statisticians to digest fully. Although the authors’ undisputed expertise in ML, we believe that a more methodological way of describing ML would enhance its understanding by statisticians (Puts and Daas 2021). AI, and also ML, is a field that is mainly built on heuristics. This makes them, purely from a methodological or mathematical point of view, disputable. For instance, on page 582, probabilities are calculated by renormalizing, which can be done under certain assumptions, but can be questionable here. This is, however, how such a heuristic approach works. Furthermore, the terms “methods” and “methodology” seem to be used somewhat blurry in their definitions contributing to potential confusion. Consequently, we anticipate that the book’s primary audience, consisting largely of statisticians, may encounter difficulty in grasping the overarching objectives presented within this part of section 5.

Therefore, this review will start with a short introduction of ML and ML methodology from a statistical perspective. Here, the focus is on creating common ground between the world of data science and the world of statistics. After that, we will discuss the ML chapters (23, 24, and 25) followed by the chapters on the use of secondary data (i.e., Chapters 22 and 26).

ML encompasses various techniques, often referred to as “a methodology” within the realm of data science. However, most statisticians, as well as most scientists, would disagree with the usage of this term in this way. In social sciences and econometrics, “methodology,” “methods,” and “techniques” carry specific and distinct meanings, which we advocate for adhering to, when talking about ML from an (official) statistical point of view. When we look at the term “methods” this becomes most apparent. We have observed that the term “methods” is incorrectly used in many applications of ML. It is often used to merely describe the chosen environment in which a study has been performed. So, when summarizing the ML algorithms and hyperparameters used, maybe with some kind of rationale, many data scientists assume that this suffices to describe the “method” used. Such a description, however, falls short when viewed from an official statistical perspective (as will become clear later on).

So let’s start with the basis. Techniques, and how we combine them, are primarily determined by the “how” question. It is this question that is at the core of the techniques themselves: “how do we go about performing certain steps?” In addition to the algorithmic description, it describes the (pre- and post-) conditions for applying the technique. Methods, especially in the context of official statistics, refer to the so-called “why” and “what” questions. For instance, in survey methodology, a technique could refer to “how a random sample is taken,” whereas stratified sampling is a method that refers to a procedure for dividing the population in several subgroups. In terms of these definitions, a technique in ML is logistic regression or k-fold cross validation, whereas a procedure to select features during the training of a model, for instance with logistic regression, is considered a method. From this, we can define a method as:

A method is a systematic procedure of techniques for accomplishing a certain goal.

Most of the time these methods are established. A technique is defined as:

A technique is a way of carrying out a specific task.

Most of the time techniques are described as algorithms. Let’s look at an analogue in a totally different context: preparing a meal. When one prepares a meal this may, for instance, involve cutting vegetables, boiling eggs, and grilling steaks. Recipes can be considered methods; for example, they describe the procedure that needs to be followed to obtain a component of a meal and, as such, they will include various techniques. A method will also take into account the context in which one prepares a meal, as well as any additional circumstances that may arise (e.g., a guest may be vegan or allergic to certain ingredients) when preparing a meal. The overall question “What is the best way to use a specific technique with a specific set of ingredients under certain circumstances to prepare a specific meal?” is described by a methodology. A methodology is a system of methods. When it comes to ML, we state that this field of science is lacking a methodology. The fact that it is sometimes stated that “data science (read ML) is more an art than a science” indicates this. Data science, and consequently also ML’s, should evolve from an art to a real science. This is actually quite common in science, as described by Knuth (1974). Knuth starts his paper with a quote made by the editorial board of ACM in 1959: “if computer programming is to become an important part of computer research and development, a transition of programming from an art to a disciplined science must be affected.” In our era, we can claim the same for data science, and more specifically, ML. The ML chapters of the section should be read in the light of this observation. When reading the chapters, we observed how the authors are searching for the right approach, and they seem to be successful in their quest.

So what is ML? ML can best be described as the application of a set of heuristics to a dataset (commonly referred to as a training set)—via a procedure—in such a way that a model is created that enables the prediction of future outcomes, as good as possible, on new data. To use an analogy: the algorithm creates a viewing lens on the data to predict future outcomes. The procedure used is a method. ML can be applied in several ways, and this is described in Chapters 23, 24, and 25. These chapters provide a good overview of the possibility of using ML in official statistics, and it is nice to see that they have been written by the forerunners of its use.

Chapter 23 gives a nice introduction on ML and how ML is used in official statistics. It gives an overview of how ML is used in general in official statistics and how it is used specifically at Destatis. The examples describe two different areas of applications: Increasing analysis capabilities and editing and imputation. In the first example on analysis capabilities, the author shows nicely how ML was used to find people who were affected when the minimum wage was introduced in Germany. The author describes how the model was constructed and validated on a very general level. With respect to editing and imputations, the author describes two studies. One study is on the editing and imputation in the new digital earnings survey. Implicitly, the author asks a very important question in this chapter: How can we assure the quality of the imputations in order to replace more traditional imputation methods?

Chapter 24, 25, and 26 give overviews of applications of ML work in the area of official statistics, one at the US Bureau of Labor Statistics and two of the US Census Bureau. Whereas the first two chapters focus on machine learning, the latter focuses on new data sources. In the work performed at the US Bureau of Labor Statistics an application of automated coding is discussed that makes use of Natural Language processing techniques in the Survey of Occupational Injuries and Illnesses (SOII), an annual survey performed among US establishments. The focus is on developing a procedure that assures a ML model is developed that performs well on new, unseen, data. We describe this as the external validity of the model. The same struggle can be observed in the two applications of the US Census Bureau described in Chapters 25 and 26. Here, assuring that the model developed is externally valid is at the core of the work. This illustrates the definite need for rigorous procedures in the area of ML to assure the best, and most reliable, results are obtained. The most challenging question that the authors in each of the three chapters try to answer is “how to validate the findings obtained from new data sources?” This is indeed challenging as a direct comparison with other findings is not possible. One last remark on Chapter 25: several times it refers to Chapter 23, but it seems it is actually referring to Chapter 24.

Since we did not discuss the introductory Chapter 22 yet, we will do that here. This chapter discusses the “foundation” that enables the (successful) use of secondary data, for example, organic, Big, and administrative data, for statistics production. A model is introduced, called BUilding Blocks for enabling MICro data access (abbreviated as BUBMIC), that enables users to determine the benefit of a new data source. A major part of the model is based on what is called the “five safes” approach. It’s a pity that this term is explained much later in the chapter; for the curious reader for whom this term is unfamiliar, the “five safes” are safe projects, safe people, safe settings, safe data, and safe output. The highlight of the chapter is Section 22.4 in which the model is applied to a real-world example. Here, everything discussed before becomes apparent.

Overall, the chapters were engaging and resonated with the sense of fondness reminding us of our pioneering work on big data in our office. Considerable progress has been made since then. For us, it was a pleasant read. While the content may pose a challenge for statisticians, it offers a rewarding read for (i) those embarking on their careers as young data scientists in official statistics and for (ii) statisticians starting their journey as data scientists.

Footnotes

ORCID iD

Marco Puts

References

Knuth

D. E.

1974. “Computer Programming as an Art.” Communications of the ACM 17: 667–73. DOI: https://doi.org/10.1145/361604.361612.

Puts

Daas

2021. “Machine Learning from the Perspective of Official Statistics.” The Survey Statistician 84: 12–7. http://isi-iass.org/home/wp-content/uploads/Survey_Statistician_2021_July_N84_02.pdf

Snijkers

Bavdaz

Bender

Jones

MacFeely

Sakshaug

Thompson

van Delden

2023. Advances in Business Statistics, Methods and Data Collection. Hoboken, NJ: Wiley. DOI: https://doi.org/10.1002/9781119672333.