LLM-assisted record linkage: A framework for official statistics

Abstract

National statistical offices (NSOs) increasingly rely on record linkage to link census data, administrative sources, and survey responses. However, conventional string-similarity methods often struggle with free-text fields. To address these challenges, this paper systematically benchmarks modern open-source large language models (LLMs) against classic string-based comparators for record linkage. Building on these findings, this paper introduces a hybrid approach that retains well-established probabilistic frameworks yet integrates an LLM-based classifier for ambiguous record pairs. A Bayesian update is applied to combine the LLM's output with the prior probability, with the aim of reducing the burden on manual clerical review. The experiments show that selectively deploying open-source LLMs for the most uncertain pairs can significantly reduce manual effort by refining decisions through Bayesian updating. As NSOs must ensure transparency, explainability, and adherence to official statistical standards, this paper systematically addresses these concerns while evaluating the potential of LLMs for record linkage. Practical considerations including secure on-premises deployment, computational cost, human-in-the-loop review, and calibration are discussed to support responsible adoption in official statistics.

Keywords

uncertainty quantification data privacy large language model national statistical offices record linkage

Get full access to this article

View all access options for this article.

References

Fellegi

Sunter

. A theory for record linkage. J Am Stat Assoc 1969; 64: 1183–1210.

Jaro

. Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. J Am Stat Assoc 1989; 84: 414–420.

Zhong

Cui

Guo

, et al. AGIEval: A human-centric benchmark for evaluating foundation models. arXiv preprint. 2023.

Katz

Bommarito

Gao

, et al. GPT-4 passes the bar exam. Philosophical Trans Royal Soc A 2024; 382: 20230254.

Brown

Mann

Ryder

, et al. Language models are few-shot learners. Adv Neural Inf Process Syst 2020; 33: 1877–1901.

Touvron

Lavril

Izacard

, et al. LLaMA: Open and efficient foundation language models. arXiv preprint. 2023.

Levenshtein

. Binary codes capable of correcting deletions, insertions, and reversals. Soviet Physics Doklady 1966; 10: 707–710.

Sarawagi

Bhamidipaty

. Interactive deduplication using active learning. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ‘02), 2002, pp.269–278.

Gelman

Carlin

Stern

, et al. Bayesian data analysis. 3rd ed. Boca Raton (FL): Chapman & Hall/CRC, 2013.

10.

Wei

Wang

Schuurmans

, et al. Chain-of-thought prompting elicits reasoning in large language models. arXiv preprint. 2022.

11.

Kojima

Reid

, et al. Large language models are zero-shot reasoners. arXiv preprint. 2022.

12.

Christen

. Data matching: concepts and techniques for record linkage, entity resolution, and duplicate detection. Berlin: Springer, 2012.