Abstract

Keywords
Near the end of my June 2013 President’s letter to the members of the International Association of Survey Statisticians, I described what I saw then as the “shape” of the science that survey statistics was evolving toward. I am still comfortable with that description: Target populations are more dynamic, much less clearly defined (think networks) and much harder to measure. Furthermore, resources that can be targeted at a particular sample unit are much smaller. As a consequence, sampling design is moving away from its traditional emphasis on how selection is implemented to how samples from quite different sources, and of quite varying provenance, can be quickly integrated. Sampling inference will have to adapt to this new data collection paradigm, with the importance of sampling error much diminished, and a real need to come to grips with how basic ideas like uncertainty should be characterized in the resulting confusing mix of non-response errors, linking errors, measurement errors and model specification errors.
I am grateful to the editors of the Journal of Official Statistics for giving me the opportunity to expand here on some of the research implications arising from this change. It is well known that realized sample sizes are dropping in many parts of the world. The reasons for this vary but one consequence has been an explosion of interest in research related to small sample inference. But there has been another important change, reflecting the way survey data are assembled. More precisely, with Big Data now well and truly embedded in our social, political, economic and environmental infrastructure, there has been an accompanying increase in data sets that are not obtained via direct enumeration but are constructed by a process of integration of samples, registers, and other data sources, each containing a subset of the survey variables of interest. As with small sample inference, dealing with this relatively new data collection paradigm has inevitably required a model-based approach. We can no longer use the classical sampling ideas of the past century as our inferential framework, and a more complicated conceptual scaffolding is required. This is clear from the collection of papers contained in Zhang and Chambers (2019).
The clue to what this scaffolding entails is in the last sentence of the quote above. Coverage issues due to non-response in contributing sample data as well as non-collection due to administrative procedures need consideration. Measurement issues associated with the disparate data sources need resolution, since variables purporting to measure the same thing in different sources can in fact be measuring quite different things. Entity resolution errors arising in the integration process need to be modeled and included in sample space of potential outcomes. And, of course, the data models used for inference need to be specified so that they bear some semblance to reality. That is, they are fit for purpose.
It is easy to be overwhelmed by the first two coverage and measurement error issues above, representing as they do two Jumbos in the collection of Basu-grade circus elephants that now inhabit the survey inference living room. And I make no claim to having insight into how to ship these two colossi out of the room. Instead, I will in this short note focus on the single Sambo-sized elephant corresponding to the third set of entity resolution issues that have proved amenable to analysis over the last fifteen years. My discussion will focus on what I consider to be some of the challenges of eventually moving this elephant out of the inferential living room, or at least substantially downsizing it. It will therefore not be complete, or even sufficient. For example, I ignore strong Bayesian developments, see Goldstein et al. (2012) and Briscolini et al. (2018). But it may help focus necessary research.
Entity resolution occurs when one decides whether the population record obtained via data integration constitutes a valid record, that is, the values for the variables defining the record are the values of an existing unit in a target population. Note that entity de-duplication, where all distinct records sourced via integration correspond to different entities in the target population, is also often viewed as part of this resolution process, though data fusion, where potential entities are created rather than real entities recovered, is not. Record linkage, or just linkage, is the procedure that identifies those records in separate data sources that are records for the same entity, that is, it leads to entity resolution. However, that does not mean that it leads to correct resolution, or that there is no duplication, or that linkage is complete, that is, when all linked records can be mapped to all records in the target population.
A very useful model for linkage is to assume that the same p variables are defined for the N entities making up the target population and the M entities making up the integrated data set. Observe in passing that this also assumes away the second of the Jumbo-sized issues referenced earlier. Let
There is no claim that the linkage model described in the preceding paragraph is comprehensive. But, like any model, it is useful when its assumptions are at least approximately valid, allowing us to explore methods for inference using linked data that explicitly incorporate uncertainty due to the linkage process. This is because one can now use knowledge about the linkage process to model
An early model (Chambers 2009) for
The ELE model for linkage should be adequate if the block size Nq is not large, since by construction all records in the block are similar in terms of the values of their matching variables. When blocks are large however it may be necessary to allow heterogeneity in the linkage error probabilities, for example by making them appropriately parameterized functions of the record matching weights used in linking. Taken to the extreme, each row of
Using an ELE model (or a heterogeneous extension) to characterize linkage requires information on the parameters of this model. Typically, the data model of interest is a conditional one, that is, it specifies the distribution 〈
An alternative stream of research in the statistical optimization literature has focused on an approach with weaker information requirements, but more restrictive assumptions. This approach does not directly model the source
More research is required on choosing between the ELE model and the mixture model in any given situation. Since λ q and γ q should be closely related, it may also be that one can use the mixture model to “boot up” the ELE by substituting the resulting estimate of γ q for λ q . Another area where further research is necessary is when linkage is informative. This is very similar to dealing with informative sampling and informative non-response, so progress can be expected to be slow. However, there is one common situation where this type of linkage occurs, and this is when a sample data set taken from one population register is then linked to another population register. In this case it is expected that the links formed by this process will not be the same as the links formed if the two population registers were first linked and the same sample then taken from the linked register, a requirement for sampling and linkage to be mutually non-informative. Determining an appropriate model for when “sample then link” and “link then sample” lead to different outcomes remains an outstanding problem.
Given the current interest in small sample inference, it is not surprising that much of the recent research on analysis using integrated data has focused on small sample inference using these data. Here we can distinguish two broad streams of research. The first assumes that linked unit record data are available, and that area identifiers and model covariates do not suffer from linkage errors. In this case there is a growing body of evidence that using models for linkage error (e.g., the ELE model or the mixture model) can effectively adjust for much of the attenuation bias caused by linkage error. See Salvati et al. (2021), Chambers et al. (2022) and Slawski et al. (2025). But it could also be that area identifiers are subject to linkage error rather than model variables. As far as I know, little research has been carried out for this situation. Perhaps more importantly however is the impact of linkage on the other broad stream of small sample research, where the integrated unit record data are not publicly released, but instead estimates based on small samples from these data are released. Linkage error (or entity ambiguity more generally) is a unit or record level phenomenon. But in this case, we do not have record level integrated data, only summary statistics based on it. This is a problem that is only just starting to get the attention it deserves.
One type of research that is notable for its absence as far as analysis of integrated data is concerned is the development of user-friendly software that can fit standard models (e.g., linear or generalized linear mixed models) to these data. This seems a binding constraint on the usefulness of any of the methods for analyzing integrated data that have appeared in the literature. It is hard to see users with little or no methodological interest in adjusting for entity ambiguity issues taking the time to modify their analyses without access to appropriate tools. Given that use of survey weights to correct for selection effects in analysis is ubiquitous, I have been asked whether a similar “weighting” approach would work with linked data. Unfortunately, I do not see how attaching a single fixed set of weights to an integrated data set can provide a general correction for the attenuation bias induced by linkage error, since this is dependent on block specification as well as on linkage model specification. However, this does not mean that for a particular choice of such specifications one could not compute a set of weights that recover the results of an analysis that explicitly allows for corresponding linkage errors. But one still needs to develop the software that allows a secondary analyst to compute these weights. See J. Chipperfield (2019).
Finally, it would be remiss of me to not point out that many of the issues that arise when dealing with entity ambiguity caused by linkage error are because there is typically little or no linkage error-related paradata provided with public use integrated data sets. I find this strange since holding back this information is not dissimilar to releasing public use sample survey data with no information about how the sample was selected. Unlike primary analysts, users of integrated data are overwhelmingly secondary analysts who are in no position to identify the uncertainty introduced by the linkage processes used to create these data. Even providing “official” audit-based estimates of the parameters λ q or γ q would be beneficial since such users then would then be at least able to check how modifying their analyses to include the impact of linkage error changes their conclusions. Hopefully this state of affairs will change in the future.
