Abstract
The discipline of Entity Resolution (ER), the process of identifying and linking records that refer to the same real-world entity, has been fundamentally reshaped by the adoption of high-dimensional vector embeddings. This transformation reframes ER as a large-scale Approximate Nearest Neighbor Search (ANNS) problem, making the choice of ANNS architecture a critical determinant of system performance. This paper provides a deep architectural comparison and a novel, large-scale empirical evaluation of the two dominant ANNS paradigms: graph-based methods (HNSW, DiskANN) and partition-based methods (Faiss-IVF+PQ, Scann). We introduce a new semi-synthetic benchmark tailored to the ER task, consisting of two one-million-vector datasets with a known ground truth. On this benchmark, we conduct a comprehensive evaluation, measuring not only total query time but also disaggregated blocking and matching times, alongside canonical ER quality metrics: precision, recall, and F1-score. Our findings reveal that partition-based methods, particularly Scann, offer superior performance in high-throughput, moderate-recall scenarios, while graph-based methods like HNSW and DiskANN are unequivocally superior for applications demanding the highest levels of matching quality. This work provides a nuanced, application-centric analysis that culminates in a set of actionable recommendations for practitioners designing modern data integration and retrieval systems.
Get full access to this article
View all access options for this article.
