Fast Algorithms for Computing Jaro Similarity

Abstract

The Jaro and Jaro–Winkler similarity measures are fundamental tools for character-based string comparison, with widespread use in applications such as record linkage, entity resolution, and natural language processing. Although their accuracy in capturing typographical and transpositional errors has made them popular, traditional implementations suffer from high computational cost, especially when applied to large datasets. Previously, we proposed a Jaro similarity algorithm that reduces the time complexity from quadratic to linear. The proposed linear time algorithm can compute the Jaro similarity between two strings significantly faster if the strings are sufficiently long. In this article, we introduce enhanced algorithms for computing both Jaro and Jaro–Winkler similarity that improve the runtime, including in handling shorter strings. Furthermore, we propose some techniques to drastically reduce the computing time for the case where a set of strings is repeatedly compared among themselves, making the algorithms particularly well-suited for large-scale record linkage tasks.

Keywords

Get full access to this article

View all access options for this article.

References

Altschul

, Gish

, Miller

, et al. Basic local alignment search tool. J Mol Biol, 1990; 215(3):403–410.

Basak

, Sahni

, Rajasekaran

, et al. Superblocking: An efficient blocking technique for record linkage. In: 2023 IEEE International Conference on Big Data (BigData). IEEE; 2023a, pp. 498–503.

Basak

, Soliman

, Deo

, et al. On computing the jaro similarity between two strings. In: International Symposium on Bioinformatics Research and Applications. Springer; 2023b, pp. 31–44.

Christen

. Data matching: concepts and techniques for record linkage, entity resolution, and duplicate detection. Springer; 2012.

Durbin

, Eddy

, Krogh

, et al. Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge university press, 1998.

Fellegi

, Sunter

. A theory for record linkage. J Am Stat Assoc, 1969; 64(328):1183–1210.

GeeksforGeeks. Jaro and jaro-winkler similarity. 2025. Available from: https://www.geeksforgeeks.org/dsa/jaro-and-jaro-winkler-similarity/

Horowitz

, Sahni

, Rajasekaran

, et al. Computer algorithms C++: C++ and pseudocode versions. Macmillan, 1997.

Huttenlocher

, Klanderman

, Rucklidge

. Comparing images using the hausdorff distance. IEEE Trans Pattern Anal Machine Intell, 1993; 15(9):850–863.

10.

Jaro

. Advances in record-linkage methodology as applied to matching the 1985 census of tampa, florida. J Am Stat Assoc, 1989; 84(406):414–420.

11.

Lcvenshtcin

. Binary coors capable or ‘correcting deletions, insertions, and reversals. Soviet Physics-Doklady, 1966; 10.

12.

Needleman

, Wunsch

. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol, 1970; 48(3):443–453.

13.

Papadakis

, Ioannou

, Thanos

, et al. The four generations of entity resolution, vol. 16. Springer; 2021.

14.

Saeedi

, Peukert

, Rahm

, et al. Using link features for entity clustering in knowledge graphs. In: European Semantic Web Conference. Springer; 2018, pp. 576–592.

15.

Soliman

, Rajasekaran

. Firla: A fast incremental record linkage algorithm. J Biomed Inform, 2022; 130:104094.

16.

Ukkonen

. Approximate string-matching with q-grams and maximal matches. Theor Comput Sci, 1992; 92(1):191–211.

17.

Winkler

. String comparator metrics and enhanced decision rules in the fellegi-sunter model of record linkage. American Statistical Association; 1990.

18.

Winkler

, et al. Overview of record linkage and current research directions. Bureau of the Census, 2006; 25(4):603–623.