A comparative analysis of graph-based and partition-based approximate nearest neighbor search for large-scale entity resolution

Abstract

The discipline of Entity Resolution (ER), the process of identifying and linking records that refer to the same real-world entity, has been fundamentally reshaped by the adoption of high-dimensional vector embeddings. This transformation reframes ER as a large-scale Approximate Nearest Neighbor Search (ANNS) problem, making the choice of ANNS architecture a critical determinant of system performance. This paper provides a deep architectural comparison and a novel, large-scale empirical evaluation of the two dominant ANNS paradigms: graph-based methods (HNSW, DiskANN) and partition-based methods (Faiss-IVF+PQ, Scann). We introduce a new semi-synthetic benchmark tailored to the ER task, consisting of two one-million-vector datasets with a known ground truth. On this benchmark, we conduct a comprehensive evaluation, measuring not only total query time but also disaggregated blocking and matching times, alongside canonical ER quality metrics: precision, recall, and F1-score. Our findings reveal that partition-based methods, particularly Scann, offer superior performance in high-throughput, moderate-recall scenarios, while graph-based methods like HNSW and DiskANN are unequivocally superior for applications demanding the highest levels of matching quality. This work provides a nuanced, application-centric analysis that culminates in a set of actionable recommendations for practitioners designing modern data integration and retrieval systems.

Keywords

ANNS Entity Resolution Faiss Scann IVF PQ

Get full access to this article

View all access options for this article.

References

Christen

. Data matching. Springer, 2012.

Devlin

Chang

Lee

, et al. BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.4171–4186.

Reimers

Gurevych

. Sentence-BERT: sentence embeddings using siamese BERT-Networks. In: Proceedings of the 2019 conference on empirical methods in natural language processing (EMNLP-IJCNLP), pp.3980–3990.

Malkov

Yashunin

. Efficient and Robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE Trans Pattern Anal Mach Intell 2018; 42: 824–836.

Subramanya

Devvrit

Simhadri

, et al. Diskann: fast accurate billion-point nearest neighbor search on a single node. In: Advances in neural information processing systems, volume 32. Curran Associates, Inc.

Douze

Guzhva

Deng

, et al. The faiss library, 2024.

Jégou

Douze

Schmid

. Product quantization for nearest neighbor search. IEEE Trans Pattern Anal Mach Intell 2011; 33: 117–128.

Sun

RGP

Lindgren

Geng

, et al. Accelerating large-scale inference with anisotropic vector quantization. In: International conference on machine learning.

Karapiperis

Verykios

. Scaling entity resolution with k-means: a review of partitioning techniques. Electronics 2025; 14: 3605.

10.

Wang

Wei

Dong

, et al. Deep self-attention distillation for task-agnostic compression of pre-trained transformers. In: Advances in neural information processing systems, 33.

11.

Singh

Subramanya

Krishnaswamy

, et al. Freshdiskann: a fast and accurate graph-based ann index for streaming similarity search, 2021. 2105.09613.