Automated linkage of patient records from disparate sources

Abstract

We introduce an automated method of record linkage that has two key features, automated selection of match field interactions to include in the model for estimation and automated threshold determination for classifying record pairs to matches or non-matches. We applied our method to two real-world examples. The first example demonstrated results consistent with our earlier work: When data quality is adequate and the match field discriminating power is high, matching algorithms exhibit similar performance. The second example demonstrated that our method yields a lower false positive rate and higher positive predictive value than the Fellegi-Sunter model in the face of low data quality. When compared to the Fellegi-Sunter model, simulation studies suggest that our method exhibits better overall performance as indicated by higher area under the curve, and less biased estimates for both the match prevalence rate and the m- and u-probabilities over a range of data scenarios, especially when the match prevalence is extreme. Computationally, our method is as efficient as the Fellegi-Sunter model. We recommend this method in situations that an unsupervised linking algorithm is needed.

Keywords

Diagnostic tests Fellegi-Sunter model latent class model log-linear model patient matching record linkage

Get full access to this article

View all access options for this article.

References

Fellegi

Sunter

. A theory of record linkage. J Am Stat Assoc 1969; 64: 1183–1210.

Dempster

Laird

Rubin

. Maximum likelihood from incomplete data via the EM algorithm (with discussion). J R Stat Soc 1977; 39: 1–38.

Shen

. Linkage of patient records from disparate sources. Stat Methods Med Res 2013; 22: 31–38.

Pepe

Janes

. Insights into latent class analysis of diagnostic test performance. Biostatistics 2007; 8: 474–484.

Craig

. A probit latent class model with general correlation structures for evaluating accuracy of diagnostic tests. Biometrics 2009; 65: 1145–1155.

Larsen

Rubin

. Iterative automated record linkage using mixture models. J Am Stat Assoc 2001; 96: 32–41.

Daggy

Hui

et al.

Evaluating latent class models with conditional dependence in record linkage. Stat Med 2014; 33: 4250–4265.

Winkler

Matching and record linkage. In: Cox

Binder

Chinnappa

(eds). Business survey methods, Vol. Wiley Series in Probability and Statistics. Hoboken, NJ: John Wiley & Sons, 1995.

Subtil

de Oliveir

Goncalves

. Conditional dependence diagnostic in the latent class model: a simulation study. Stat Probab Lett 2012; 82: 1407–1412.

10.

Herzog

Scheuren

Winkler

. Data quality and record linkage techniques, New York: Springer, 2007.

11.

Zhu

Overhage

Egg

et al.

An empiric modification to the probabilistic record linkage algorithm using frequency-based weight scaling. J Am Med Inform Assoc 2009; 16: 738–745.

12.

Vacek

. The effect of conditional dependence on the evaluation of diagnostic tests. Biometrics 1985; 41: 959–968.

13.

Tromp

Meray

Ravelli

ACJ

et al.

Ignoring dependency between linking variables and its impact on the outcome of probabilistic record linkage studies. J Am Med Inform Assoc 2008; 15: 654–660.

14.

Kelly R. Robustness of the Census Bureau's record linkage system. In: Proceedings of the Section on Survey Research Method. Alexandria, VA: American Statistical Association, 1986, pp. 620–624. Available at http://www.amstat.org/sections/srms/Proceedings/papers/1986_116.pdf.

15.

Xu H, Li X, Shen C, et al. Incorporating conditional dependence in latent class models for probabilistic record linkage: does it matter? Submitted to the Annals of Applied Statistics, http://pages.iu.edu/∼huipxu/publications/RecordLinkage.pdf (2014, accessed 27 December 2015).

16.

Albert

McShane

Shih

et al.

Latent class modeling approaches for assessing diagnostic error without a gold standard: with applications to p53 immunohistochemical assays in bladder tumors. Biometrics 2001; 57: 610–619.

17.

Albert

Dodd

. A cautionary note on the robustness of latent class models for estimating diagnostic error without a gold standard. Biometrics 2004; 60: 427–435.

18.

Agresti

. Categorical data analysis, 3rd ed. Hoboken, NJ: Wiley, 2013.

19.

Tan

Kutner

. Random effects models in latent class analysis for evaluating accuracy of diagnostic tests. Biometrics 1996; 52: 797–810.

20.

Daggy

Hui

et al.

A practical approach for incorporating dependence among fields in probabilistic record linkage. BMC Med Inform Decis Mak 2013; 13: 97–97.

21.

Aldridge

Shaji

Hayward

et al.

Accuracy of probabilistic linkage using the enhanced matching system for public health and epidemiological studies. PloS One 2015; 10: e0136179–e0136179.

22.

Harron

Wade

Gilbert

et al.

Evaluating bias due to data linkage error in electronic healthcare records. BMC Med Res Methodol 2014; 14: 36–36.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.18 MB