Sage Journals: Discover world-class research

Abstract

Keywords

record linkage record matching exchangeable linkage errors gaussian copula approximation missing information principle mixtures composite likelihood paradata

Near the end of my June 2013 President’s letter to the members of the International Association of Survey Statisticians, I described what I saw then as the “shape” of the science that survey statistics was evolving toward. I am still comfortable with that description:

Target populations are more dynamic, much less clearly defined (think networks) and much harder to measure. Furthermore, resources that can be targeted at a particular sample unit are much smaller. As a consequence, sampling design is moving away from its traditional emphasis on how selection is implemented to how samples from quite different sources, and of quite varying provenance, can be quickly integrated. Sampling inference will have to adapt to this new data collection paradigm, with the importance of sampling error much diminished, and a real need to come to grips with how basic ideas like uncertainty should be characterized in the resulting confusing mix of non-response errors, linking errors, measurement errors and model specification errors.

I am grateful to the editors of the Journal of Official Statistics for giving me the opportunity to expand here on some of the research implications arising from this change. It is well known that realized sample sizes are dropping in many parts of the world. The reasons for this vary but one consequence has been an explosion of interest in research related to small sample inference. But there has been another important change, reflecting the way survey data are assembled. More precisely, with Big Data now well and truly embedded in our social, political, economic and environmental infrastructure, there has been an accompanying increase in data sets that are not obtained via direct enumeration but are constructed by a process of integration of samples, registers, and other data sources, each containing a subset of the survey variables of interest. As with small sample inference, dealing with this relatively new data collection paradigm has inevitably required a model-based approach. We can no longer use the classical sampling ideas of the past century as our inferential framework, and a more complicated conceptual scaffolding is required. This is clear from the collection of papers contained in Zhang and Chambers (2019).

The clue to what this scaffolding entails is in the last sentence of the quote above. Coverage issues due to non-response in contributing sample data as well as non-collection due to administrative procedures need consideration. Measurement issues associated with the disparate data sources need resolution, since variables purporting to measure the same thing in different sources can in fact be measuring quite different things. Entity resolution errors arising in the integration process need to be modeled and included in sample space of potential outcomes. And, of course, the data models used for inference need to be specified so that they bear some semblance to reality. That is, they are fit for purpose.

It is easy to be overwhelmed by the first two coverage and measurement error issues above, representing as they do two Jumbos in the collection of Basu-grade circus elephants that now inhabit the survey inference living room. And I make no claim to having insight into how to ship these two colossi out of the room. Instead, I will in this short note focus on the single Sambo-sized elephant corresponding to the third set of entity resolution issues that have proved amenable to analysis over the last fifteen years. My discussion will focus on what I consider to be some of the challenges of eventually moving this elephant out of the inferential living room, or at least substantially downsizing it. It will therefore not be complete, or even sufficient. For example, I ignore strong Bayesian developments, see Goldstein et al. (2012) and Briscolini et al. (2018). But it may help focus necessary research.

Entity resolution occurs when one decides whether the population record obtained via data integration constitutes a valid record, that is, the values for the variables defining the record are the values of an existing unit in a target population. Note that entity de-duplication, where all distinct records sourced via integration correspond to different entities in the target population, is also often viewed as part of this resolution process, though data fusion, where potential entities are created rather than real entities recovered, is not. Record linkage, or just linkage, is the procedure that identifies those records in separate data sources that are records for the same entity, that is, it leads to entity resolution. However, that does not mean that it leads to correct resolution, or that there is no duplication, or that linkage is complete, that is, when all linked records can be mapped to all records in the target population.

A very useful model for linkage is to assume that the same p variables are defined for the N entities making up the target population and the M entities making up the integrated data set. Observe in passing that this also assumes away the second of the Jumbo-sized issues referenced earlier. Let y be the N × 1 vector of values for one of the p variables defined by the entities in the (unknown) target population and let y* be the M × 1 vector of values for the same variable defined by the entities in the (known) integrated data set. We can model the relationship between y and y* by writing y* = Ay, where A is an unknown variable-specific M × N matrix containing just ones and zeros and with row totals all equal to one. That is, each row of A“picks out” one record in the target population with corresponding value in y to link to the record in the integrated data set with value in the same row of y*. Note that this model allows multiple records in the target population with different values in y to be all linked to the same record in the integrated data and hence to its value in y*. It also allows partial linkage with some records in the target population never linked to any records in the integrated dataset. However, when combined with de-duplication this implicitly assumes that M ≤ N, that is, there are no records in the integrated data that are not in the target population.

There is no claim that the linkage model described in the preceding paragraph is comprehensive. But, like any model, it is useful when its assumptions are at least approximately valid, allowing us to explore methods for inference using linked data that explicitly incorporate uncertainty due to the linkage process. This is because one can now use knowledge about the linkage process to model A. In this note I focus on the secondary analysis case, that is, where the analyst only has access to the integrated data set and where this knowledge is typically limited. One of the most important concepts that can be introduced at this point is that of blocking. This is where the linkage can be assumed to be essentially local or limited to records with the same distinct values of known matching variables common to both the target population and the integrated data. The end result is that not every match is possible, and so A can be row and column ordered so that it is block diagonal, A = diag [A_q; q = 1, … , Q], where q indexes the Q blocks, each of size M_q × N_q, making up y and hence y*. We can then write y*_q = A_qy_q.

An early model (Chambers 2009) for A_q is the Exchangeable Linkage Error (ELE) model, where for any record in the integrated data (i.e., any value in y*) its defining row in A_q corresponds to randomly picking out one of the elements of y_q so that the probability that this choice is a correct one is λ_q and the probability that it is an incorrect one is η_q = (1−λ_q)/(N_q−1). By construction, λ_q then varies between 1 (perfect linkage) and 1/N_q (completely random linkage). Since there will have been considerable effort put into the linkage process, it seems reasonable to expect that in most cases λ_q will be close to one. Note that no restriction is placed on how the different rows of A_q are selected. There are N_q different potential values for each row of A_q and so the linkage outcome can be characterized as selecting one of these outcomes for each row. If they are selected independently and with replacement, then there is a finite probability that the same population record can be linked to different records in the integrated dataset. On the other hand, if the selection is without replacement, then links (correct or incorrect) are unique. Furthermore, under an ELE model the probability of selecting any one of these potential outcomes will change from row to row so that for any row there is a probability λ_q of selecting the outcome corresponding to a correct link for that row, while the probability of selecting any other outcome will be the same and equal to η_q.

The ELE model for linkage should be adequate if the block size N_q is not large, since by construction all records in the block are similar in terms of the values of their matching variables. When blocks are large however it may be necessary to allow heterogeneity in the linkage error probabilities, for example by making them appropriately parameterized functions of the record matching weights used in linking. Taken to the extreme, each row of A_q could be characterized by a different set of probabilities, clearly an over-parameterized model. Determining just how much heterogeneity to allow for in linkage modeling (and how to parameterize it) in a secondary analysis situation remains an open area of research, though primary analysts with access to the source data sets and the linking procedure can obviously bootstrap A to obtain empirical estimates of linkage probabilities (J. Chipperfield 2020; J. O. Chipperfield and Chambers 2015).

Using an ELE model (or a heterogeneous extension) to characterize linkage requires information on the parameters of this model. Typically, the data model of interest is a conditional one, that is, it specifies the distribution 〈y_q|X_q〉 of y_q given a known matrix of covariates X_q. A key working assumption in this case is non-informative linking, that is, the distributions of y_q and A_q are independent given X_q. This implies that the conditional distribution 〈y*_q|X_q〉 of the linked data satisfies 〈y*_q|X_q〉 = 〈A_q|X_q〉〈y_q|X_q〉, so the impact of linkage under non-informative linking is to “tilt” the distribution implied by the standard data model by an amount depending on the (unknown) distribution of A_q given X_q. If this distribution follows an ELE model then this implies that one needs to know λ_q or at least have a credible estimate of this parameter. Generally, however, even this information will be insufficient to fully define 〈A_q|X_q〉, implying that inferential techniques like maximum likelihood are difficult to apply properly with linked data (Zhang 2019), and approximations are necessary. Since the first and second moments of the marginal distributions 〈y_q|X_q〉 and 〈y*_q|X_q〉 are often straightforward to specify under an ELE model, one promising approach is via a Gaussian copula approximation to the joint conditional distribution of y_q and y*_q followed by maximum likelihood based on the missing information principle. See Chambers (2023).

An alternative stream of research in the statistical optimization literature has focused on an approach with weaker information requirements, but more restrictive assumptions. This approach does not directly model the source A_q of the entity ambiguity arising from linkage, but instead modifies the data model of interest, relying on the fact that incorrect links can look like they are drawn from a different distribution compared with correct links. It does this by also assuming non-informative linkage but now such that incorrectly linked values are treated as random draws from the marginal distribution 〈y_q〉 while correctly linked values are treated as draws from 〈y_q|X_q〉. The distribution 〈y*_q|X_q〉 is then modeled as a mixture of these two component distributions with unknown mixing parameter γ_q, that is, 〈y*_q|X_q〉 = γ_q〈y_q|X_q〉 + (1−γ_q)〈y_q〉, and a maximum composite likelihood approach coupled with application of the EM algorithm underpins inference. Here again initial results are promising, see Slawski et al. (2025).

More research is required on choosing between the ELE model and the mixture model in any given situation. Since λ_q and γ_q should be closely related, it may also be that one can use the mixture model to “boot up” the ELE by substituting the resulting estimate of γ_q for λ_q. Another area where further research is necessary is when linkage is informative. This is very similar to dealing with informative sampling and informative non-response, so progress can be expected to be slow. However, there is one common situation where this type of linkage occurs, and this is when a sample data set taken from one population register is then linked to another population register. In this case it is expected that the links formed by this process will not be the same as the links formed if the two population registers were first linked and the same sample then taken from the linked register, a requirement for sampling and linkage to be mutually non-informative. Determining an appropriate model for when “sample then link” and “link then sample” lead to different outcomes remains an outstanding problem.

Given the current interest in small sample inference, it is not surprising that much of the recent research on analysis using integrated data has focused on small sample inference using these data. Here we can distinguish two broad streams of research. The first assumes that linked unit record data are available, and that area identifiers and model covariates do not suffer from linkage errors. In this case there is a growing body of evidence that using models for linkage error (e.g., the ELE model or the mixture model) can effectively adjust for much of the attenuation bias caused by linkage error. See Salvati et al. (2021), Chambers et al. (2022) and Slawski et al. (2025). But it could also be that area identifiers are subject to linkage error rather than model variables. As far as I know, little research has been carried out for this situation. Perhaps more importantly however is the impact of linkage on the other broad stream of small sample research, where the integrated unit record data are not publicly released, but instead estimates based on small samples from these data are released. Linkage error (or entity ambiguity more generally) is a unit or record level phenomenon. But in this case, we do not have record level integrated data, only summary statistics based on it. This is a problem that is only just starting to get the attention it deserves.

One type of research that is notable for its absence as far as analysis of integrated data is concerned is the development of user-friendly software that can fit standard models (e.g., linear or generalized linear mixed models) to these data. This seems a binding constraint on the usefulness of any of the methods for analyzing integrated data that have appeared in the literature. It is hard to see users with little or no methodological interest in adjusting for entity ambiguity issues taking the time to modify their analyses without access to appropriate tools. Given that use of survey weights to correct for selection effects in analysis is ubiquitous, I have been asked whether a similar “weighting” approach would work with linked data. Unfortunately, I do not see how attaching a single fixed set of weights to an integrated data set can provide a general correction for the attenuation bias induced by linkage error, since this is dependent on block specification as well as on linkage model specification. However, this does not mean that for a particular choice of such specifications one could not compute a set of weights that recover the results of an analysis that explicitly allows for corresponding linkage errors. But one still needs to develop the software that allows a secondary analyst to compute these weights. See J. Chipperfield (2019).

Finally, it would be remiss of me to not point out that many of the issues that arise when dealing with entity ambiguity caused by linkage error are because there is typically little or no linkage error-related paradata provided with public use integrated data sets. I find this strange since holding back this information is not dissimilar to releasing public use sample survey data with no information about how the sample was selected. Unlike primary analysts, users of integrated data are overwhelmingly secondary analysts who are in no position to identify the uncertainty introduced by the linkage processes used to create these data. Even providing “official” audit-based estimates of the parameters λ_q or γ_q would be beneficial since such users then would then be at least able to check how modifying their analyses to include the impact of linkage error changes their conclusions. Hopefully this state of affairs will change in the future.

Footnotes

Funding

The author received no financial support for the research, authorship, and/or publication of this article.

ORCID iD

Raymond L. Chambers

Received: January 2025

Accepted: January 2025

References

Briscolini

Di Consiglio

Liseo

Tancredi

Tuoto

2018. “New Methods for Small Area Estimation with Linkage Uncertainty.” International Journal of Approximate Reasoning 94: 30–42. DOI: https://doi.org/10.1016/j.ijar.12.005.

Chambers

2009. “Regression Analysis of Probability-Linked Data.” Statisphere, Official Statistics Research Series, 4. https://ndhadeliver.natlib.govt.nz/delivery/DeliveryManagerServlet?dps_pid=FL1356824.

Chambers

Fabrizi

Ranalli

M. G.

Salvati

Wang

2022. “Robust Regression Using Probabilistically Linked Data.” WIREs Computational Statistics 15 (2): e1596. DOI: https://doi.org/10.1002/wics.1596.

Chambers

R. L.

2023. “The Missing Information Principle - A Paradigm for Analysis of Messy Sample Survey Data.” Survey Methodology 49: 219–56.

Chipperfield

2019. “A Weighting Approach to Making Inference with Probabilistically Linked Data.” Statistica Neerlandica 73: 333–50. DOI: https://doi.org/10.1111/stan.12172.

Chipperfield

. 2020. “Bootstrap Inference Using Estimating Equations and Data That Are Linked with Complex Probabilistic Algorithms.” Statistica Neerlandica 74: 96–111. DOI: https://doi.org/10.1111/stan.12189.

Chipperfield

J. O.

Chambers

2015. “Using the Bootstrap to Analyse Binary Data Obtained via Probabilistic Linkage.” Journal of Official Statistics 31: 397–414.

Goldstein

Harron

Wade

2012. “The Analysis of Record-Linked Data Using Multiple Imputation with Data Value Priors.” Statistics in Medicine 31: 3481–93. DOI: https://doi.org/10.1002/sim.5508.

Salvati

Fabrizi

Ranalli

M. G.

Chambers

2021. “Small Area Estimation with Linked Data.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 83: 78–107. DOI: https://doi.org/10.1111/rssb.12401.

10.

Slawski

Salvati

Fabrizi

2025. “Accounting for Mismatch Error in Small Area Estimation with Linked Data.” arXiv preprint arXiv:2405.20149. https://arxiv.org/abs/2405.20149.

11.

Zhang

L.-C.

2019. “On Secondary Analysis of Datasets That Cannot Be Linked Without Errors.” In Analysis of Integrated Data, edited by L.-C. Zhang and R. Chambers. Boca Raton, FL: CRC Press.

12.

Zhang

L.-C.

Chambers

R. L.

, eds. 2019. Analysis of Integrated Data. Boca Raton, FL: CRC Press.

Some Challenges and Research Needs for the Analysis of Integrated Data

Abstract

Keywords

Footnotes

Funding

ORCID iD

References