Abstract
The Jaro and Jaro–Winkler similarity measures are fundamental tools for character-based string comparison, with widespread use in applications such as record linkage, entity resolution, and natural language processing. Although their accuracy in capturing typographical and transpositional errors has made them popular, traditional implementations suffer from high computational cost, especially when applied to large datasets. Previously, we proposed a Jaro similarity algorithm that reduces the time complexity from quadratic to linear. The proposed linear time algorithm can compute the Jaro similarity between two strings significantly faster if the strings are sufficiently long. In this article, we introduce enhanced algorithms for computing both Jaro and Jaro–Winkler similarity that improve the runtime, including in handling shorter strings. Furthermore, we propose some techniques to drastically reduce the computing time for the case where a set of strings is repeatedly compared among themselves, making the algorithms particularly well-suited for large-scale record linkage tasks.
Get full access to this article
View all access options for this article.
