Sage Journals: Discover world-class research

Abstract

Over the last years, time-efficient approaches for the discovery of links between knowledge bases have been regarded as a key requirement towards implementing the idea of a Data Web. Thus, efficient and effective measures for comparing the labels of resources are central to facilitate the discovery of links between datasets on the Web of Data as well as their integration and fusion. We present a novel time-efficient implementation of filters that allow for the efficient execution of bounded Jaro-Winkler measures. We evaluate our approach on several datasets derived from DBpedia 3.9 and LinkedGeoData and containing up to $10^{6}$ strings and show that it scales linearly with the size of the data for large thresholds. Moreover, we also show that our approach can be easily implemented in parallel. We also evaluate our approach against SILK and show that we outperform it even on small datasets.

Keywords

Link discovery string similarity measure

1. Introduction

The Linked Open Data Cloud (LOD Cloud) has developed to a compendium of more than 2000 datasets over the last few years.1

¹
See http://stats.lod2.eu for an overview of the current state of the Cloud. Last access: July 11th, 2014.

For example, data sets pertaining to more than 14 million persons have already been made available on the Linked Data Web.2

Data collected from http://stats.lod2.eu. Last access: July 11th, 2014.

While this number is impressive on its own, it is well known that the population of the planet has surpassed 7 billion people. Hence, the Web of Data contains information on less than 1% of the overall population of the planet (counting both the living and the dead). The output of open-government movements,3

See for example http://data.gov.uk/.

scientific conferences,4

⁴

See for example http://data.semanticweb.org/.

health data5

⁵

http://aksw.org/Projects/GHO.

and similar endeavours yet promises to make massive amounts of data pertaining to persons available in the near future. Dealing with this upcoming increase of the number of resources on the Web of data requires providing means to integrate these datasets with the aim to facilitate statistical analysis, data mining, personalization, etc. However, while the number of datasets on the Linked Data Web grows drastically, the number of links between datasets still stagnates.6

⁶

http://linklion.org.

Addressing this lack of links requires solving two main problems: the quadratic time complexity of link discovery (efficiency) and the automatic support of the detection of link specifications (effectiveness). In this paper, we address the efficiency of the execution of bounded Jaro-Winkler measures,7

⁷

We use bounded measures in the same sense as [18], i.e., to mean that we are only interested in pairs of strings whose similarity is greater than or equal to a given lower bound.

which are known to be effective when comparing person names [11]. To this end, we derive equations that allow discarding a large number of computations while executing bounded Jaro-Winkler comparisons with high thresholds.

The contributions of this paper are as follows:

We derive length- and range-based filters that allow reducing the number of strings t that are compared with a string s.

We present a character-based filter that allows detecting whether two strings s and t share enough resemblance to be similar according to the Jaro-Winkler measure.

We evaluate our approach w.r.t. to its runtime and its scalability with several threshold settings and dataset sizes.

The rest of this paper is structured as follows: In Section 2, we present the problem we tackled as well as the formal notation necessary to understand this work. In the subsequent Section 3, we present the three approaches we developed to reduce the runtime of bounded Jaro-Winkler computations. We then evaluate our approach in Section 4. Related work is presented in Section 5, where we focus on approaches that aim to improve the time-efficiency of link discovery. We conclude in Section 6. The approach presented herein is now an integral part of LIMES.8

⁸

http://limes.sf.net.

This paper is an extended version of [6]. We improved the scalability of the approach presented in the previous version of the paper by using tries. In addition, we provide a parallel version of the approach for additional scalability. Finally, an extended evaluation of the algorithm is presented.

2. Preliminaries

In the following, we present some of the symbols and terms used within this work.

2.1. Link discovery

In this work, we use link discovery as a hypernym for deduplication, record linkage, entity resolution and similar terms used across literature. The formal specification of link discovery adopted herein is tantamount to the definition proposed in [17]: Given a set S of source resources, a set T of target resources and a relation R, our goal is to find the set $M \subseteq S \times T$ of pairs $(s, t)$ such that $R (s, t)$ . If R is owl:sameAs, then we are faced with a deduplication task. Given that the explicit computation of M is usually a very complex endeavour, M is most commonly approximated by a set $M^{'} = {(s, t, δ (s, t)) \in S \times T \times R^{+} : δ (s, t) ⩾ θ}$ , where δ is a (potentially complex) similarity function and $θ \in [0, 1]$ is a similarity threshold. Given that this problem is in $O (n^{2})$ , using naïve algorithms to compare large S and T is most commonly impracticable. Thus, time-efficient approaches for the computation of bounded measures have been developed over the last years for measures such as the Levenshtein distance, Minkowski distances, trigrams and many more [16].

In this paper, we thus study the following problem: Given a threshold $θ \in [0, 1]$ and two sets of strings S and T, compute the set $M^{'} = {(s, t, δ (s, t)) \in S \times T \times R^{+} : δ (s, t) ⩾ θ}$ . Two categories of approaches can be considered to improve the runtime of measures: Lossy approaches return a subset $M^{″}$ of $M^{'}$ which can be calculated efficiently but for which there are no guarantees that $M^{″} = M^{'}$ . Lossless approaches on the other hand ensure that their result set $M^{″}$ is exactly the same as $M^{'}$ . In this paper, we present a lossless approach. To the best of our knowledge, only one other link discovery framework implements a lossless approach that has been designed to exploit the bound defined by the threshold θ to ensure a more efficient computation of the Jaro-Winkler distance, i.e., the SILK framework with the approach MultiBlock [10]. We thus compare our approach with SILK 2.6.0 in the evaluation section of this paper.

2.2. The Jaro-Winkler similarity

Let $Σ^{*}$ be the set of all possible strings over Σ. The Jaro measure $d_{j} : Σ^{*} \times Σ^{*} \to [0, 1]$ is a string similarity measure approach which was developed originally for name comparison in the U.S. Census. This measure takes into account the number of character matches m and the ratio of their transpositions t: $\begin{matrix} (1) & d_{j} = \{\begin{matrix} 0 & if m = 0 \\ \frac{1}{3} (\frac{m}{| s |} + \frac{m}{| t |} + \frac{m - τ}{m}) & otherwise \end{matrix} \end{matrix}$ Here two characters are considered to be a match if and only if (1) they are the same and (2) they are at most at a distance $w = ⌊ \frac{max (| s |, | t |)}{2} ⌋$ from each other. For example, for $s = spears$ and $t = pears$ , the second s of s matches the s of t while the first s of s does not match the s of t. The number of transpositions τ is half of the number of matching characters which are in different orders across the input strings s and t. For example, consider strings house and huose. All characters match, however, the order of the letters o and u is different. Thus, we have $τ = \frac{2}{2} = 1$ .

The Jaro-Winkler measure [30] is an extension of the Jaro distance. This extension is based on Winkler’s observation that typing errors occur most commonly in the middle or at the end of a word, but very rarely in the beginning. Hence, it is legitimate to put more emphasis on matching prefixes if the Jaro distance exceeds a certain “boost threshold” $b_{t}$ , originally set to 0.7 by Winkler in his publication [30]. $\begin{matrix} (2) & d_{w} = \{\begin{matrix} d_{j} & if d_{j} < b_{t} \\ d_{j} + (ℓ p (1 - d_{j})) & otherwise \end{matrix} \end{matrix}$ Here, ℓ denotes the length of the common prefix and p is a weighting factor. Winkler uses $p = 0.1$ and $ℓ ⩽ 4$ . Note that $ℓ p$ must not be greater than 1.

3. Improving the runtime of bounded Jaro-Winkler

The main principle behind reducing the runtime of the computation of measures is to improve their reduction ratio. Here, we use a sequence of filters that allow discarding similarity computations while being sure that they would have led to a similarity score which would have been less than our threshold θ. To this end, we regard the problem as that of finding filters that return an upper bound estimate $θ_{e} (s, t) ⩾ d_{w} (s, t)$ for some properties of the input strings that can be computed in constant time. For a given threshold θ, if $θ_{e} (s, t) ⩽ θ$ , we can discard the computation of the Jaro-Winkler score for $(s, t)$ . We denote the upper bound estimate of a concrete filter by $θ_{e, x}$ , where x will be the first letter of the filters name.

3.1. Length-based filters

In the following, we denoted the length of a string s with $| s |$ . Our first filter is based on the insight that large length differences are a guarantee for poor similarity. For example, the strings a and alpha cannot have a Jaro-Winkler similarity of 1 by virtue of their length difference. We can formalize this idea as follows: Let s and t be strings with respective lengths $| s |$ and $| t |$ . Without loss of generality, we will assume that $| s | ⩽ | t |$ . Moreover, let m be the number of matches across s and t. Because $m ⩽ | s |$ , we can substitute m with $| s |$ and gain the following upper bound estimation for $d_{j} (s, t)$ : $\begin{array}{l} d_{j} & = \frac{1}{3} (\frac{m}{| s |} + \frac{m}{| t |} + \frac{m - τ}{m}) \\ (3) & ⩽ \frac{1}{3} (1 + \frac{| s |}{| t |} + \frac{| s | - τ}{| s |}) \end{array}$ Now the lower bound for the number τ of transpositions is 0. Thus, we obtain the following equation. $\begin{matrix} (4) & d_{j} ⩽ \frac{1}{3} (1 + \frac{| s |}{| t |} + 1) ⩽ \frac{2}{3} + \frac{| s |}{3 | t |} \end{matrix}$

The application of this approximation on Winkler’s extension is trivial: $\begin{array}{l} d_{w} & = d_{j} + ℓ \cdot p \cdot (1 - d_{j}) \\ (5) & ⩽ \frac{2}{3} + \frac{| s |}{3 | t |} + ℓ \cdot p \cdot (\frac{1}{3} - \frac{| s |}{3 | t |}) = θ_{e, l} \end{array}$

Consider the pair $s = bike$ and $t = bicycle$ and a threshold $θ = 0.9$ . Applying the estimation for Jaro we get $d_{j} ⩽ \frac{2}{3} - \frac{4}{3 \cdot 7} = 0.857$ . This exceeds the boost threshold, so we use Eq. (5) to compute $θ_{e, l} (s, t) = 0.885$ . Given that $θ_{e, l} (s, t) < θ$ , we do not need to compute the real value of $d_{w} (s, t)$ as we know that this pair will not belong to the desired output.

By using this approach we can decide in $O (1)$ 9

⁹
In most programming languages, especially Java (which we used for our implementation), the length of string is stored in a variable and can thus be accessed in constant time.

if a given pairs score is greater than a given threshold, which saves us the much more expensive score computation for a big number of pairs, provided that the input strings sufficiently vary in length.

3.2. Filtering ranges by length

The approach described above can be reversed to limit the number of pairs that we are going to be iterating over. To this end, we can construct an $index : N \to Σ^{*}$ which maps string lengths $l \in N$ to all strings s with $| s | = l$ . With the help of this index, we can now determine the set of strings t that should be compared with the subset $S (l)$ of S that only contains strings of length l. We go about using this insight by computing the upper and lower bound for the length of a string t that should be compared with a string s. This is basically equivalent to asking what is the minimum length difference $| | s | - | t | |$ so that $θ ⩾ θ_{e, l} (s, t)$ is satisfied. We transpose Eq. (5) to the following for our lower bound: $\begin{matrix} (6) & | s | ⩾ ⌊ 3 | t | \frac{θ - ℓ p}{1 - ℓ p} - 2 | t | ⌋ = ρ_{\min} \end{matrix}$ Analogously, we can derive the following upper bound: $\begin{matrix} (7) & | s | ⩽ ⌈ \frac{| t |}{3 \frac{θ - ℓ p}{1 - ℓ p} - 2} ⌉ = ρ_{\max} \end{matrix}$

For example, consider a list of strings S with equally distributed string lengths $(4, 7, 11, 18)$ and $θ = 0.9$ . Using Eqs (6) and (7) we obtain Table 1. Taking into account the last column of the table, we will save a total of $\frac{3}{8}$ comparisons. Note that this bound is not effective if the threshold is not greater than $ϕ = \frac{2 + ℓ p}{3}$ . This is simply due to the bounds $ρ_{\min}$ and $ρ_{\max}$ being negative below this particular value of ϕ.

Table 1
Bounds for distinct string lengths ( $θ = 0.9$ )

$| s |$ $ρ_{\min}$ $ρ_{\max}$ sizes in range

4 2 8 $(4, 7)$

7 3 14 $(4, 7, 11)$

11 5 22 $(7, 11, 18)$

18 9 36 $(11, 18)$

$\| s \|$	$ρ_{\min}$	$ρ_{\max}$	sizes in range
4	2	8	$(4, 7)$
7	3	14	$(4, 7, 11)$
11	5	22	$(7, 11, 18)$
18	9	36	$(11, 18)$

3.3. Filtering by character frequency

An even more fine-grained approach can be chosen to filter out computations. Let $e : Σ \times Σ^{*} \to N$ be the function which returns the number of occurrences of a given character in a string. For the strings s and t, the number of maximum possible matches $m_{\max}$ can be expressed as $\begin{matrix} (8) & m_{\max} = \sum_{c \in s} min (e (s, c), e (t, c)) ⩾ m \end{matrix}$ Consequently, we can now substitute m for $m_{\max}$ in the Jaro distance computation: $\begin{array}{l} d_{j} (s, t) & = \frac{1}{3} (\frac{m_{\max}}{| s |} + \frac{m_{\max}}{| t |} + \frac{m_{\max} - τ}{m_{\max}}) \\ (9) & ⩽ \frac{1}{3} (\frac{m_{\max}}{| s |} + \frac{m_{\max}}{| t |} + 1) = θ_{e, f} \end{array}$ We can thus derive that $d_{j} (s, t) ⩾ θ$ iff $\begin{matrix} (10) & m_{\max} ⩾ \frac{(3 θ - 1) | s | | t |}{| s | + | t |} . \end{matrix}$

For instance, let $s = astronaut$ , $t = astrochimp$ . The retrieval of $m_{\max}$ is shown in Table 2.

Table 2
Calculation of $m_{\max}$

c $e (s, c)$ $e (t, c)$ $min (e (s, c), e (t, c))$ $m_{\max}$

a 2 1 1 1

c 0 1 0 1

h 0 1 0 1

i 0 1 0 1

m 0 1 0 1

n 1 0 0 1

o 1 1 1 2

p 0 1 0 2

r 1 1 1 3

s 1 1 1 4

t 2 1 1 5

u 1 0 0 5

c	$e (s, c)$	$e (t, c)$	$min (e (s, c), e (t, c))$	$m_{\max}$
a	2	1	1	1
c	0	1	0	1
h	0	1	0	1
i	0	1	0	1
m	0	1	0	1
n	1	0	0	1
o	1	1	1	2
p	0	1	0	2
r	1	1	1	3
s	1	1	1	4
t	2	1	1	5
u	1	0	0	5

3.3.1. Naïve implementation

The naïve implementation of the character-based filter consists of checking if Eq. (10) holds true. The e function for each string is thereby represented using a map. As shown by our evaluation, the character-based filter leads to a significant reduction of the number of comparisons (see Fig. 6) by more than 2 orders of magnitude. However, the runtime improvement achieved using this implementation is not substantial. This is simply due to the lookup into maps being constant in time complexity but still a large amount of time. Instead of regarding strings as monolithic entities, we thus extended our implementation to be more fine-grained and used a trie as explained in the subsequent section.

3.3.2. Implementation with tries

To overcome the need to perform character index lookups for every pair of strings we use a trie-based pruning technique. We thus dub this filter the trie filter from now on. Let $D_{l} \in {S, T}$ be the dataset that contains the longest string. Moreover, let $D_{s}$ be the other dataset (we set $D_{l} = S$ and $D_{s} = T$ if both datasets contain longest strings of same length).

We define a function $σ : Σ^{+} \to Σ^{+}$ , which maps a word onto an ordered permutation of itself according to any consistent total ordering of letters in Σ. For example, we get $σ (hello) = ehllo$ for the ordering derived from the UTF-8 index of characters. We begin by partitioning the set $D_{l}$ . The partition $P_{σ (s_{i})}$ with the signature $σ (s_{i})$ is the set of all strings $s_{j}$ from $D_{l}$ with $σ (s_{i}) = σ (s_{j})$ . In a second step, we add all elements $t_{k} \in D_{s}$ to a trie $T$ . More precisely, we add all $t_{k}$ to sets $S_{η}$ associated with nodes η at path $σ (t_{k})$ . We assume that each node of $T$ is labelled with the letter that must be added to its father to create it. The core of our approach now consists of detecting all $t_{k}$ that must be compared with all elements of a particular partition $P_{σ (s_{i})}$ . To this end, we begin by computing the minimal number of matching characters $m_{\min}$ from the signature $σ (s_{i})$ that any relevant $t_{k}$ must have to be a potential match. The total amount of possible matches $m_{\max}$ must be greater than or equal to $m_{\min}$ for having a chance of finding any $t_{k}$ such that $d_{w} (s_{i}, t_{k}) ⩾ θ$ .

From the Jaro-Winkler version of the upper bound estimation given in Eq. (9) we obtain the following equation: $\begin{array}{l} m_{\max} & ⩾ m_{\min} (s, t) \\ = ⌈ (θ - ℓ p - \frac{1 - ℓ p}{3}) \\ (11) & \cdot \frac{3 | s | | t |}{(| s | + | t |) (1 - ℓ p)} ⌉ \end{array}$

Given this value, we traverse $T$ depth-first in pre-order for every $P_{u}$ with signature u (i.e., for all strings $s_{i}$ with signature u) as follows: We begin by setting the number $m_{u}$ of matches for u to 0 and the depth counter $d_{u}$ to 0. Note that $d_{u}$ and $m_{u}$ do have a local scope w.r.t. a path on $T$ . That is, we virtually run our approach independently on each path. When at a node η of the trie, we check whether the number of matches between the sequence of letters that leads to η and the signature u abides by $m_{\max} ⩾ m_{\min}$ . Less formally, we simply check whether the children of the given nodes can at all be potential matches for the signature u. If the node η does not fulfill this criterion, we terminate the exploration of the trie and go on with the next node in the set of nodes to be explored, effectively cutting off the subtrie below node η. Else, we make use of the total ordering underlying the signature u. To this end, we compare the letter of u with the index $d_{u}$ , from here on called $u_{d_{u}}$ with the label of η. If the label of η is superior to that chosen letter $u_{d_{u}}$ , then no children of η can match $u_{d_{u}}$ either. However, subsequent letters of u may match the label of η or its children. We therefore increment $d_{u}$ and put η back into the set of nodes to be explored, effectively revisiting η in the next iteration step. If the label of η is inferior to $u_{d_{u}}$ , it means that we can still expect some node in the subtrie of η to match $u_{d_{u}}$ , therefore we just add all children of η to the set of nodes to be explored. Finally, if the label of η matches $u_{d_{u}}$ exactly, we increment both $m_{u}$ and $d_{u}$ , add all children of η to the set of nodes to be explored and if $S_{η}$ is not empty, we add $(P_{u}, S_{η})$ to the set of potential match candidates. The detailed implementation logic of the tree traversal is given in Algorithm 1.

Consider the following example for a better understanding of how the algorithm works. The threshold parameter θ is set to 0.92. Datasets S and T are given in Table 3. We deliberately selected string of size 5 so that we can present the character filtering in a clear fashion. As both datasets contain the longest string, we set $D_{l} = S$ and $D_{s} = T$ . The only string in $D_{l}$ is nines, hence we have exactly one Partition $P_{σ (nines)} = P_{einns} = {nines}$ . We now construct $T$ from strings in $D_{s}$ , gaining the trie shown in Fig. 1. The nodes η, whose corresponding sets $S_{η}$ are not empty, are highlighted in blue color. Table 4 gives an overview of all $S_{η}$ .

Table 3
Datasets S and T from the example

S T

{nines} {niece, niche, nices, spice, since}

S	T
{nines}	{niece, niche, nices, spice, since}

Fig. 1.

Example Trie $T$ .

Table 4

Nodes and their corresponding sets from the example

η	$S_{η}$
s ₀	{spice}
s ₁	{nices, since}
n ₀	{niche}
n ₁	{niece}

Algorithm 1

Trie filter

We select the first and only Partition $P_{einns}$ and set $u = einns$ , $d_{u} = 0$ and $m_{u} = 0$ . Now we begin the trie exploration, therefore starting at the root (labeled with the empty word). In the next step we need to check whether $m_{\max} ⩾ m_{\min}$ holds true. To this end, we need to compute $m_{\min}$ and $m_{\max}$ . First, we compute $m_{\min}$ using Eq. (11) with arguments u and v, where v is the string we gain by chaining all labels of nodes on the shortest path from the root to a leaf, getting $m_{\min} = 4$ . Second, we determine the maximal number of possible matches in this iteration using Eq. (12). $\begin{matrix} (12) & m_{\max} (u, d_{u}, η) = m + min (| u | - d_{u}, height (η) + 1) \end{matrix}$

The height of a node η herein is defined as the number of edges on the longest downward path between that node and a leaf. Therefore we have $m_{\max} = min (5, 6) = 6$ . As $m_{\max} = 6 ⩾ 4 = m_{\min}$ holds true, we continue by comparing the label of η to $u_{d_{u}}$ . With η being the root of the trie, we have $label (η) = ϵ < e = u_{0}$ , hence we just add all children of η to the set of nodes to be explored.

In the next iteration step, we now have $m_{\max} = min (5, 5) = 5$ , which is still greater than $m_{\min} = 4$ . Again, we have $label (η) = c < e = u_{0}$ and end up adding all children of node c to the set of nodes to be explored.

In iteration step 3 $m_{\max} = 4 ⩾ 4 = m_{\min}$ still holds true and this time we indeed have a match, as $label (η) = e = u_{0}$ . Therefore, we now increment both m and $d_{u}$ and add all children of node e₀ to the set of nodes to be explored.

In iteration step 4 we select $η = i_{0}$ and still $m_{\max} = 4 ⩾ 4 = m_{\min}$ . We have a match for $label (η) = i = u_{1}$ . We increment both m and $d_{u}$ and add all children of node i₀ to the set of nodes to be explored.

Fig. 2.

Flowchart of parallel trie filter stack.

In iteration step 5 we select $η = p$ and still $m_{\max} = 4 ⩾ 4 = m_{\min}$ . We have $label (η) = p > n = u_{2}$ . We increment $d_{u}$ and revisit this node in the next step.

In iteration step 6 we revisit $η = p$ and still $m_{\max} = 4 ⩾ 4 = m_{\min}$ . Again, it is $label (η) = p > n = u_{3}$ . We increment $d_{u}$ and revisit this node once again in the next step.

In iteration step 7 we revisit $η = p$ and this time $m_{\max} = 3 ⩾ 4 = m_{\min}$ , hence we continue with the next node.

In iteration step 8 we select $η = n_{0}$ and still $m_{\max} = 4 ⩾ 4 = m_{\min}$ . We have a match for $label (η) = n = u_{2}$ . Therefore, we now increment both m and $d_{u}$ and add all children of node n₀ to the set of nodes to be explored.

In iteration step 9 we select $η = s_{1}$ and still $m_{\max} = 4 ⩾ 4 = m_{\min}$ . We have $label (η) = s > n = u_{3}$ . We increment $d_{u}$ and revisit this node in the next step.

In iteration step 10 we revisit $η = s_{1}$ and still $m_{\max} = 4 ⩾ 4 = m_{\min}$ . We have a match for $label (η) = s = u_{4}$ . As $m_{u} = 4 = m_{\min}$ , we can now add ${P_{u}, S_{s_{1}}}$ to the set of match candidates.

In iteration step 11 we select $η = h$ and still $m_{\max} = 4 ⩾ 4 = m_{\min}$ . However, $label (η) = h < i = u_{1}$ . All children of node h are therefore added to the set of nodes to be explored.

In iteration step 12 we select $η = i_{1}$ and due to $m_{\max} = 3 < 4 = m_{\min}$ , we abort the exploration of this branch and continue with the next step.

Iterations 13 and 14 come about analogously to steps 11 and 12, leading to a stop of exploration at node i₂.

The set of match candidates that is forwarded to the actual Jaro-Winkler distance computation now contains ${P_{u}, S_{s_{1}}} = {{nines}, {nices, since}}$ .

Note that using the range filter prior to the trie filter is a key requirement towards its efficiency, that is, the more equally distributed the string lengths are, the better we get in terms of reduction ratio and runtime improvement. The worst case scenario is a big dataset with very large strings over a little alphabet and little to none variation in string lengths.

Fig. 3.

Runtimes on sample of DBpedia rdf:labels with size 1000.

3.4. Parallel implementation

While our results suggest that the approach presented above scales well on one processor, we wanted to measure how well the partitioning induced by the upper and lower length bounds given in Eqs (6) and (7) can be used to implement our approach in parallel. Let $T_{i, j} = {t \in T : i ⩽ | t | ⩽ j}$ be set of strings whose length is larger or equal to i and less or equal to j. Moreover, let $S_{k} = {s \in S : k = | s |}$ be the subset of S which contains strings of length k. Finally, let $L (S)$ be the set of distinct string lengths of elements of S. We distribute a subset $S \times T$ into sets $M_{x}^{‴} = S_{x} \times T_{ρ_{\min} (x), ρ_{\max} (x)} \forall x \in L (S)$ . Note that due to the bounds in Eqs (6) and (7), we are sure that all elements of T which abide by the similarity condition $δ (s, t) ⩾ θ for s \in S_{x}$ can be found in the subset $T_{ρ_{\min} (x), ρ_{\max} (x)}$ . Moreover, not all elements of $S \times T$ are in the set $M_{x}^{‴}$ . In particular, pairs of strings that do not abide by the range restrictions are not considered as they are known not to be matches.

Now given the sets $M_{x}^{‴}$ , we can parallelize our implementation by simply using a thread-pool-based approach. We initialize a user-given number of threads and assign each thread one of the sets $M_{x}^{‴}$ randomly. Once a thread has completed its computations, it is simply assigned a further set $M_{x}^{‴}$ . Note that we do not implement any load balancing as we were primarily interested in how well the range filter allows partitioning data. Figure 2 gives an overview of our parallel pipeline.

4. Evaluation

4.1. Experimental setup

The aim of our evaluation was to study how well our approach performs on real data. We chose DBpedia 3.9 as a source of data for our experiments as it contains data pertaining to 1.1 million persons and thus allows for both fine-grained evaluations and scalability evaluations. As a second data source we chose LinkedGeoData to evaluate how well we perform on strings that have no relation to person names. We chose these datasets because (1) they have been widely used in experiments pertaining to link discovery and (2) the distributions of string sizes in these datasets are significantly different (see Figs 7 and 8). All experiments where deduplication experiments, i.e., $S = T$ . We considered the list of all rdfs:label in DBpedia in our runtime evaluation and added all rdfs:labels of the Places dataset from LinkedGeoData for scalability experiments. We also computed the number of actual Jaro-Winkler calculations carried out for 1000 strings from DBpedia. All runtime and scalability experiments were performed on a 2.5 GHz Intel Core i5 machine with 16 GB RAM running OS X 10.9.3. The speedup of the parallel trie-based filter stack was measured on a Microsoft Azure VM instance with 16 CPUs and 112 GB RAM.

For the sake of legibility, we used the following names and acronyms for filter combination or lack thereof:

naïve (n): No use of any filters (equivalent to brute-force).

range (r): Uses only the range filter as described in Section 3.2.

length (l): Uses only the length filter as described in Section 3.1.

frequency (f): Uses only the naïve implementation of the character frequency filter as described in Section 3.3.

trie (t): Uses only the trie filter implementation of the character frequency filter as described in Section 3.3.2.

range and length (r+l): Uses the range filter to partition the datasets, the length filter is then applied to each pair of partitions.

range and frequency (r+f): Sets up the frequency filter globally in a first step. Then uses the range filter to partition the datasets and applies the frequency filter to each pair of partitions.

range and trie (r+t): Uses the range filter to partition the datasets, the trie filter is then applied to each pair of partitions.

range, length and frequency (r+l+f): Sets up the frequency filter globally in a first step. Then uses the range filter to partition the datasets and applies the length filter to each pair of partitions. The frequency filter is then only applied to those pairs of strings for which the length filter does not apply.

Fig. 4.

Runtimes on sample of DBpedia rdf:labels with size $10^{4}$ .

Note that the trie filter implicitly contains a length filter, which is the reason why we do not evaluate the combination r+l+t separately.

4.2. Runtime evaluation

In our first series of experiments, we evaluated the runtime of all filter combinations against the naïve approach on a small dataset containing 1000 labels from DBpedia. The results of our evaluation are shown in Fig. 3. This evaluation suggests that all filter setups except those containing t outperform the naïve approach. Moreover, the combination of all filters leads to the best overall runtime in most cases on this small sample. Overall, the results on this dataset already shows that we outperform the naïve approach by more than an order of magnitude when $θ > 0.9$ . Interestingly, the break-even point for f and t is reached when $θ > 0.99$ on this small dataset. This is clearly due to the overhead necessary to create the trie overshadowing the runtime advantage engendered by using the trie to search for matches.

The runtimes on a larger sample of size $10^{4}$ show an even better improvement (see Fig. 4). This suggests that the relative improvement of our approach improves with the size of the problem. The most interesting result comes from filter setup r+t. It is slower than the naïve approach on low θ but after an break-even point around $0.89 ⩽ θ ⩽ 0.91$ it outperforms the second best setup r+l by an order of magnitude.

4.3. Scalability evaluation

The aim of the scalability evaluation was to measure how well our approach scales. In our first set of experiments, we looked at the growth of the runtime of our approach on datasets of growing sizes (see Figs 9 and 10). Our results show very clearly that r+t is the best filter combination for datasets of large sizes. This result holds on both DBpedia and LinkedGeoData. r+t is thus the default implementation of the Jaro-Winkler measure in LIMES. In addition, our results suggest that our approach grows linearly with the number of labels contained in S and T. This is one of our most important results as it makes clear that we can employ r+t on large datasets and expect acceptable runtimes. Moreover, the behavior of the runtime of our default setup can be easily predicted for large datasets, which is of importance when asking users to wait for the results of the computation.

Fig. 5.

Runtimes on sample of DBpedia rdf:labels with size $10^{5}$ , scaling threshold.

Fig. 6.

Actual Jaro-Winkler computations, measured on sample of DBpedia rdf:labels with size 1000.

Fig. 7.

String length distribution of $10^{6}$ DBpedia rdf:labels.

Fig. 8.

String length distribution of $10^{6}$ LinkedGeoData rdf:labels.

Fig. 9.

Runtimes for growing input list sizes (DBPedia).

Fig. 10.

Runtimes for growing input list sizes (LinkedGeoData).

The second series of scalability experiments looked at the runtime behaviour of our approach on a large dataset with $10^{5}$ labels (see Fig. 5). Our results suggest that the runtime of our approach falls superlinearly with an increase of the threshold θ. This behaviour suggest that our approach is especially useful on clean datasets, where high thresholds can be used for link discovery.

In the third series of experiments we looked at the speedup we gain by parallelizing r+t on the DBpedia dataset with input sizes of $10^{5}$ and $10^{6}$ (see Figs 11 and 12). Here, our results show that the current implementation scales up in a satisfactory manner on up to 4 processors. Running the parallel implementation on 8 and 12 processors does not bring about any considerable increase in speedup. The reason for this behavior is simply that we did not implement any load balancing. Hence, there is commonly one thread that is assigned a single large $M_{x}^{‴}$ , leading to all other threads having completed their tasks but having to wait for this particular thread to terminate. Adding load balancing to the approach as well as splitting large $M_{x}^{‴}$ into smaller chunks should lead to an improvement of the scalability of our parallel implementation. These extensions will be implemented in future work.

4.4. Comparison with existing approaches

We compared our approach with SILK2.6.0. To this end, we retrieved all rdfs:label of instances of subclasses of Person. We only compared with SILK on small datasets (i.e., on classes with small numbers of instances) as the results on these small datasets already showed that we outperform SILK consistently.10

¹⁰
We ran SILK with -Dthreads = 1 for the sake of fairness.

Our results are shown in Table 5. They suggest that the absolute difference in runtime grows with the size of the datasets. Thus, we did not consider testing larger datasets against SILK as in the best case, we were already 4.7 times faster than SILK (Architect dataset,

θ = 0.95

Table 5

Runtimes (in seconds) of our approach (OA) and SILK 2.6.0

DBpedia Class	Size	OA(0.8)	OA(0.9)	OA(0.95)	SILK(0.8)	SILK(0.9)	SILK(0.95)
Actors	9509	15.07	10.13	6.38	27	25	25
Architect	3544	5.58	5.48	2.32	11	11	11
Criminal	5291	11.54	7.77	4.52	18	18	18

Fig. 11.

Speedup of parallel algorithm (DBPedia, $10^{5}$ ).

Fig. 12.

Speedup of parallel algorithm (DBPedia, $10^{6}$ ).

5. Related work

The work presented herein is related to record linkage, deduplication, link discovery and the efficient computation of Hausdorff distances. An extensive amount of literature has been published by the database community on record linkage (see [7,12] for surveys). With regard to time complexity, time-efficient deduplication algorithms such as PPJoin+ [32], EDJoin [31], PassJoin [13] and TrieJoin [29] were developed over the last years. Several of these were then integrated into the hybrid link discovery framework LIMES [17]. Moreover, dedicated time-efficient approaches were developed for LD. For example, RDF-AI [27] implements a five-step approach that comprises the preprocessing, matching, fusion, interlink and post-processing of data sets. [19] presents an approach based on the Cauchy-Schwarz that allows discarding a large number of unnecessary computations. The approaches HYPPO [15] and $HR 3$ [16] rely on space tiling in spaces with measures that can be split into independent measures across the dimensions of the problem at hand. Especially, $HR 3$ was shown to be the first approach that can achieve a relative reduction ratio $r^{'}$ less or equal to any given relative reduction ratio $r > 1$ . Standard blocking approaches were implemented in the first versions of SILK and later replaced with MultiBlock [10], a lossless multi-dimensional blocking technique. KnoFuss [22] also implements blocking techniques to achieve acceptable runtimes. Further approaches can be found in [4,8,23,24,26,28].

In addition to addressing the runtime of link discovery, several machine-learning approaches have been developed to learn link specifications (also called linkage rules) for link discovery. For example, machine-learning frameworks such as FEBRL [2] and MARLIN [1] rely on models such as Support Vector Machines [3] and decision trees [25] to detect classifiers for record linkage. RAVEN [20] relies on active learning to detect linear or Boolean classifiers. The EAGLE approach [21] combines active learning and genetic programming to detect link specifications. KnoFuss [22] goes a step further and presents an unsupervised approach based on genetic programming for finding accurate link specifications. Other record deduplication approaches based on active learning and genetic programming are presented in [5,9].

6. Conclusion and future work

In this paper, we present a novel approach for the efficient execution of bounded Jaro-Winkler computations. Our approach is based on three filters which allow discarding a large number of comparisons. We showed that our approach scales well with the amount of data it is faced with. Moreover, we showed that our approach can make effective use of large thresholds by reducing the total runtime of the approach considerably. We also compared our approach with the state-of-the-art framework SILK 2.6.0 and showed that we outperform it on all datasets. In future work, we will test whether our approach improves the accuracy of specification detection algorithms such as EAGLE. Moreover, we will focus on improving the quality of matches. To this end we will split input strings into tokens and use a hybrid approach as proposed by Monge and Elkan [14], which adds to complexity of the algorithm, hence allowing for further runtime improvements. Furthermore, we will extend the parallel implementation with load balancing.

References

Bilenko and

R.J.

Mooney, Adaptive duplicate detection using learnable string similarity measures, in: Proc. of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, August 24–27, 2003,

Getoor,

T.E.

Senator,

P.M.

Domingos and

Faloutsos, eds, ACM, 2003, pp. 39–48. doi:10.1145/956750.956759.

Christen, Febrl-: An open source data cleaning, deduplication and record linkage system with a graphical user interface, in: Proc. of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Las Vegas Nevada, USA, August 24–27, 2008,

Li,

Liu and

Sarawagi, eds, ACM, 2008, pp. 1065–1068. doi:10.1145/1401890.1402020.

Cristianini and

Ricci, Support vector machines, in: Encyclopedia of Algorithms,

M.-Y.

Kao, ed., Springer, 2008. doi:10.1007/978-0-387-30162-4_415.

Cudré-Mauroux,

Haghani,

Jost,

Aberer and

de Meer, Idmesh: Graph-based disambiguation of linked data, in: Proc. of the 18th International Conference on World Wide Web, WWW 2009, Madrid, Spain, April 20–24, 2009,

Quemada,

León,

Y.S.

Maarek and

Nejdl, eds, ACM, 2009, pp. 591–600. doi:10.1145/1526709.1526789.

de Freitas,

G.L.

Pappa,

A.S.

da Silva,

M.A.

Gonçalves,

E.S.

de Moura,

Veloso,

A.H.F.

Laender and

M.G.

de Carvalho, Active learning genetic programming for record deduplication, in: Proc. of the IEEE Congress on Evolutionary Computation, CEC 2010, Barcelona, Spain, 18–23 July 2010, IEEE, 2010, pp. 1–8. doi:10.1109/CEC.2010.5586104.

Dreßler and

A.-C.

Ngonga Ngomo, Time-efficient execution of bounded jaro-winkler distances, in: Proc. of the 9th International Workshop on Ontology Matching Collocated with the 13th International Semantic Web Conference (ISWC 2014), Riva del Garda, Trentino, Italy, October 20, 2014,

Shvaiko,

Euzenat,

Mao,

Jiménez-Ruiz,

Li and

A.-C.

Ngonga Ngomo, eds, CEUR Workshop Proceedings, Vol. 1317, CEUR-WS.org, 2014, pp. 37–48.

A.K.

Elmagarmid,

P.G.

Ipeirotis and

V.S.

Verykios, Duplicate record detection: A survey, IEEE Trans. Knowl. Data Eng.19(1) (2007), 1–16. doi:10.1109/TKDE.2007.250581.

Euzenat,

Ferrara,

W.R.

van Hage,

Hollink,

Meilicke,

Nikolov,

Ritze,

Scharffe,

Shvaiko,

Stuckenschmidt,

Sváb-Zamazal and

C.T.

dos Santos, Results of the ontology alignment evaluation initiative 2011, in: Proc. of the 6th International Workshop on Ontology Matching, Bonn, Germany, October 24, 2011,

Shvaiko,

Euzenat,

Heath,

Quix,

Mao and

I.F.

Cruz, eds, CEUR Workshop Proceedings, Vol. 814, CEUR-WS.org, 2011.

Isele and

Bizer, Learning expressive linkage rules using genetic programming, PVLDB5(11) (2012), 1638–1649.

10.

Isele,

Jentzsch and

Bizer, Efficient multidimensional blocking for link discovery without losing recall, in: Proc. of the 14th International Workshop on the Web and Databases 2011, WebDB 2011, Athens, Greece, June 12, 2011,

Marian and

Vassalos, eds, 2011.

11.

M.A.

Jaro, Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida, Journal of the American Statistical Association84(406) (1989), 414–420.

12.

Köpcke and

Rahm, Frameworks for entity matching: A comparison, Data Knowl. Eng.69(2) (2010), 197–210. doi:10.1016/j.datak.2009.10.003.

13.

Li,

Deng,

Wang and

Feng, PASS-JOIN: A partition-based method for similarity joins, PVLDB5(3) (2011), 253–264.

14.

A.E.

Monge and

Elkan, An efficient domain-independent algorithm for detecting approximately duplicate database records, in: DMKD, 1997.

15.

A.-C.

Ngonga Ngomo, A time-efficient hybrid approach to link discovery, in: Proc. of the 6th International Workshop on Ontology Matching, Bonn, Germany, October 24, 2011,

Shvaiko,

Euzenat,

Heath,

Quix,

Mao and

I.F.

Cruz, eds, CEUR Workshop Proceedings, Vol. 814, CEUR-WS.org, 2011.

16.

A.-C.

Ngonga Ngomo, Link discovery with guaranteed reduction ratio in affine spaces with Minkowski measures, in: Proc. of the Semantic Web – ISWC 2012 – 11th International Semantic Web Conference, Part I, Boston, MA, USA, November 11–15, 2012,

Cudré-Mauroux,

Heflin,

Sirin,

Tudorache,

Euzenat,

Hauswirth,

J.X.

Parreira,

Hendler,

Schreiber,

Bernstein and

Blomqvist, eds, Lecture Notes in Computer Science, Vol. 7649, Springer, 2012, pp. 378–393. doi:10.1007/978-3-642-35176-1_24.

17.

A.-C.

Ngonga Ngomo, On link discovery using a hybrid approach, J. Data Semantics1(4) (2012), 203–217. doi:10.1007/s13740-012-0012-y.

18.

A.-C.

Ngonga Ngomo, ORCHID – reduction-ratio-optimal computation of geo-spatial distances for link discovery, in: Proc. of the Semantic Web – ISWC 2013 – 12th International Semantic Web Conference, Part I, Sydney, NSW, Australia, October 21–25, 2013,

Alani,

Kagal,

Fokoue,

P.T.

Groth,

Biemann,

J.X.

Parreira,

Aroyo,

N.F.

Noy,

Welty and

Janowicz, eds, Lecture Notes in Computer Science, Vol. 8218, Springer, 2013, pp. 395–410. doi:10.1007/978-3-642-41335-3_25.

19.

A.-C.

Ngonga Ngomo and

Auer, LIMES – a time-efficient approach for large-scale link discovery on the web of data, in: IJCAI 2011, Proc. of the 22nd International Joint Conference on Artificial Intelligence, Barcelona, Catalonia, Spain, July 16–22, 2011,

Walsh, ed., IJCAI/AAAI, 2011, pp. 2312–2317.

20.

Ngonga Ngomo,

Lehmann,

Auer and

Höffner, RAVEN – active learning of link specifications, in: Proc. of the 6th International Workshop on Ontology Matching, Bonn, Germany, October 24, 2011,

Shvaiko,

Euzenat,

Heath,

Quix,

Mao and

I.F.

Cruz, eds, CEUR Workshop Proceedings, Vol. 814, CEUR-WS.org, 2011.

21.

A.-C.

Ngonga Ngomo and

Lyko, EAGLE: Efficient active learning of link specifications using genetic programming, in: Proc. of the Semantic Web: Research and Applications – 9th Extended Semantic Web Conference, ESWC 2012, Heraklion, Crete, Greece, May 27–31, 2012,

Simperl,

Cimiano,

Polleres,

Ó.

Corcho and

Presutti, eds, Lecture Notes in Computer Science Vol. 7295, Springer, 2012, pp. 149–163. doi:10.1007/978-3-642-30284-8_17.

22.

Nikolov,

d’Aquin and

Motta, Unsupervised learning of link discovery configuration, in: Proc. of the Semantic Web: Research and Applications – 9th Extended Semantic Web Conference, ESWC 2012, Heraklion, Crete, Greece, May 27–31, 2012,

Simperl,

Cimiano,

Polleres,

Ó.

Corcho and

Presutti, eds, Lecture Notes in Computer Science Vol. 7295, Springer, 2012, pp. 119–133. doi:10.1007/978-3-642-30284-8_15.

23.

Nikolov,

V.S.

Uren,

Motta and

A.N.

De Roeck, Overcoming schema heterogeneity between linked semantic repositories to improve coreference resolution, in: Proc. of the Semantic Web, Fourth Asian Conference, ASWC 2009, Shanghai, China, December 6–9, 2009

Gómez-Pérez,

Yu and

Ding, eds, Lecture Notes in Computer Science, Vol. 5926, Springer, 2009, pp. 332–346. doi:10.1007/978-3-642-10871-6_23.

24.

Papadakis,

Ioannou,

Niederée,

Palpanas and

Nejdl, Eliminating the redundancy in blocking-based entity resolution methods, in: Proc. of the 2011 Joint International Conference on Digital Libraries, JCDL 2011, Ottawa, ON, Canada, June 13–17, 2011,

Newton,

M.J.

Wright and

L.N.

Cassel, eds, ACM, 2011, pp. 85–94. doi:10.1145/1998076.1998093.

25.

S.R.

Safavian and

D.A.

Landgrebe, A survey of decision tree classifier methodology, IEEE Transactions on Systems, Man, and Cybernetics21(3) (1991), 660–674. doi:10.1109/21.97458.

26.

Saïs,

Pernelle and

M.-C.

Rousset, Combining a logical and a numerical method for data reconciliation, J. Data Semantics12 (2009), 66–94. doi:10.1007/978-3-642-00685-2_3.

27.

Scharffe,

Liu and

Zhou, Rdf-ai: An architecture for rdf datasets matching, fusion and interlink, in: Proc. IJCAI 2009 Workshop on Identity, Reference, and Knowledge Representation (IR-KR), Pasadena, CA, US, 2009.

28.

Song and

Heflin, Automatically generating data linkages using a domain-independent candidate selection approach, in: Proc. of the Semantic Web – ISWC 2011 – 10th International Semantic Web Conference, Part I, Bonn, Germany, October 23–27, 2011,

Aroyo,

Welty,

Alani,

Taylor,

Bernstein,

Kagal,

N.F.

Noy and

Blomqvist, eds, Lecture Notes in Computer Science, Vol. 7031, Springer, 2011, pp. 649–664. doi:10.1007/978-3-642-25073-6_41.

29.

Wang,

Li and

Feng, Trie-join: Efficient trie-based string similarity joins with edit-distance constraints, PVLDB3(1) (2010), 1219–1230.

30.

W.E.

Winkler, String comparator metrics and enhanced decision rules in the fellegi-sunter model of record linkage, in: Proc. of the Section on Survey Research, 1990, pp. 354–359.

31.

Xiao,

Wang and

Lin, Ed-join: An efficient algorithm for similarity joins with edit distance constraints, PVLDB1(1) (2008), 933–944.

32.

Xiao,

Wang,

Lin and

J.X.

Yu, Efficient similarity joins for near duplicate detection, in: Proc. of the 17th International Conference on World Wide Web, WWW 2008, Beijing, China, April 21–25, 2008,

Huai,

Chen,

H.-W.

Hon,

Liu,

W.-Y.

Ma,

Tomkins and

Zhang, eds, ACM, 2008, pp. 131–140. doi:10.1145/1367497.1367516.

On the efficient execution of bounded Jaro-Winkler distances

Abstract

Keywords

1. Introduction

1 See http://stats.lod2.eu for an overview of the current state of the Cloud. Last access: July 11th, 2014.

2.1. Link discovery

2.2. The Jaro-Winkler similarity

3. Improving the runtime of bounded Jaro-Winkler

3.1. Length-based filters

9 In most programming languages, especially Java (which we used for our implementation), the length of string is stored in a variable and can thus be accessed in constant time.

Table 1 Bounds for distinct string lengths ( θ = 0.9 ) | s | ρ min ρ max sizes in range 4 2 8 ( 4 , 7 ) 7 3 14 ( 4 , 7 , 11 ) 11 5 22 ( 7 , 11 , 18 ) 18 9 36 ( 11 , 18 )

Table 2 Calculation of m max c e ( s , c ) e ( t , c ) min ( e ( s , c ) , e ( t , c ) ) m max a 2 1 1 1 c 0 1 0 1 h 0 1 0 1 i 0 1 0 1 m 0 1 0 1 n 1 0 0 1 o 1 1 1 2 p 0 1 0 2 r 1 1 1 3 s 1 1 1 4 t 2 1 1 5 u 1 0 0 5

3.3.2. Implementation with tries

Table 3 Datasets S and T from the example S T {nines} {niece, niche, nices, spice, since}

4. Evaluation

4.1. Experimental setup

4.3. Scalability evaluation

10 We ran SILK with -Dthreads = 1 for the sake of fairness.

6. Conclusion and future work

References

¹
See http://stats.lod2.eu for an overview of the current state of the Cloud. Last access: July 11th, 2014.

⁹
In most programming languages, especially Java (which we used for our implementation), the length of string is stored in a variable and can thus be accessed in constant time.

Table 1
Bounds for distinct string lengths ( $θ = 0.9$ )

$| s |$ $ρ_{\min}$ $ρ_{\max}$ sizes in range

4 2 8 $(4, 7)$

7 3 14 $(4, 7, 11)$

11 5 22 $(7, 11, 18)$

18 9 36 $(11, 18)$

Table 2
Calculation of $m_{\max}$

c $e (s, c)$ $e (t, c)$ $min (e (s, c), e (t, c))$ $m_{\max}$

a 2 1 1 1

c 0 1 0 1

h 0 1 0 1

i 0 1 0 1

m 0 1 0 1

n 1 0 0 1

o 1 1 1 2

p 0 1 0 2

r 1 1 1 3

s 1 1 1 4

t 2 1 1 5

u 1 0 0 5

Table 3
Datasets S and T from the example

S T

{nines} {niece, niche, nices, spice, since}

¹⁰
We ran SILK with -Dthreads = 1 for the sake of fairness.