Sage Journals: Discover world-class research

Abstract

Large amounts of geo-spatial information have been made available with the growth of the Web of Data. While discovering links between resources on the Web of Data has been shown to be a demanding task, discovering links between geo-spatial resources proves to be even more challenging. This is partly due to the resources being described by the means of vector geometry. Especially, discrepancies in granularity and error measurements across data sets render the selection of appropriate distance measures for geo-spatial resources difficult. In this paper, we survey existing literature for point-set measures that can be used to measure the similarity of vector geometries. We then present and evaluate the ten measures that we derived from literature. We evaluate these measures with respect to their time-efficiency and their robustness against discrepancies in measurement and in granularity. To this end, we use samples of real data sets of different granularity as input for our evaluation framework. The results obtained on three different data sets suggest that most distance approaches can be led to scale. Moreover, while some distance measures are significantly slower than other measures, distance measure based on means, surjections and sums of minimal distances are robust against the different types of discrepancies.

Keywords

Link discovery geographic distances

1. Introduction

The Web of Data has grown significantly over the last years. In particular, very large data sets pertaining to different domains such as bio-medicine (e.g., LinkedTCGA with now 20+ billion triples [40]) and geo-locations (e.g., LinkedGeoData with approximately 30 billion triples [4]) have been made available. Implementing the fourth Linked Data principle (i.e., the creation of links between these knowledge bases and other knowledge bases) for these knowledge bases has been shown to be a difficult problem in previous works [5]. Most of the existing solutions (see [29] for an overview) address this problem by using a complex similarity or distance function to compare instances from two (not necessarily distinct) knowledge bases. The result of the function is then compared to a threshold. The result of the comparison is finally used to suggest the existence of a link between instances.

While previous works have compared a large number of measures with respect to how well they perform in the link discovery task [12], measures for linking geo-spatial resources have been paid little attention to. Previous works have yet shown that domain-specific measures and algorithms are required to tackle the problem of geo-spatial link discovery [31]. For example, 20,354 pairs of cities in DBpedia 2014 share exactly the same label. For villages in LinkedGeoData 2014, this number grows to 3,946,750. Consequently, finding links between geo-spatial resources requires devising means to distinguish them using their geo-spatial location. On the Web of Data, the geo-spatial location of resources is most commonly described using either points or more generally by means of vector geometry. Thus, devising means for using geo-spatial information to improve link discovery requires providing means to measure distances between such vector geometry data.

Examples of vector geometry descriptions for the country of Malta are shown in Fig. 1. As displayed in the examples, two types of discrepancies occur when one compares the vector descriptions of the same real-world entity (e.g., Malta) in different data sets: First, the different vector descriptions of a given real-world entity often comprise different points across different data sets. For example, Malta’s vector description in DBpedia contains the point with latitude 14.46 and longitude 35.89. In LinkedGeoData, the same country is described by the point of latitude 14.5 and longitude 35.9. We dub the discrepancy in latitude and longitude for points in the vector description measurement discrepancy. A second type of discrepancy that occurs in the vector description of geo-spatial resources across different data sets are discrepancies in granularity. For example, Malta is described by one polygon in DBpedia, two polygons in NUTS and a single point in LinkedGeoData.

Fig. 1.

Vector description of the country of Malta. The blue polygons shows the vector geometry for Malta in the NUTS data set, the red polygon shows the same for the DBpedia, while the black point shows the location of the same real-world entity according to LinkedGeoData.

Analysing the behaviour of different measures with respect to these two types of discrepancies is of central importance to detect the measures that should be used for geo-spatial link discovery. In this paper, we address this research gap by first surveying existing measures that can be used for comparing point sets. We then compare these measures in series of experiments on samples extracted from three real data sets with the aim of answering the questions introduced in Section 3.

Note that throughout the paper, we model complex representations of geo-spatial objects as point sets. While more complex representations can be chosen, comparing all corresponding measures would go beyond the scope of this paper. In addition, we are only concerned with atomic measures and do not consider combinations of measures. Approaches that allow combining measures can be found in [29].

The remainder of this paper is structured as follows: Section 2 introduces some basic assumption and notations that will be used all over the rest of the paper. Section 3 introduces our systematic survey methodology. Then, in Section 4 we give a detailed description of each of point set distance functions, as well as their mathematical formulation and different implementations. Thereafter, in Section 5 we introduce evaluation of our work for both scalability and robustness. Finally, we conclude the paper with a brief overview of related work (Section 6), as well as a conclusion and future work (Section 7). All measures and algorithms presented herein were integrated into the LIMES framework.1

http://limes.sf.net

2. Preliminaries and notation

We assume the link discovery (LD) problem as being formulated in a way akin to [31]: Given two sets S and T of resources as well as a predicate p, compute the set $M = {(s, t) \in S \times T : ⟨ s, p, t ⟩ holds}$ , where $⟨ s, p, t ⟩$ is the RDF triple with the subject s, the predicate p and the object t. Computing M directly is commonly a non-trivial task. State-of-the-art link discovery systems thus most commonly aim to compute an approximation $M^{'}$ of M with $M^{'} = {(s, t) : δ (s, t) ⩽ θ}$ , where δ is a complex distance function and θ is a distance threshold. δ most commonly consists of a combination of atomic measures which can be used to compare property values of the resources s and t. For example, the edit distance is an atomic measure that can be used to compare the labels of two resources.

In addition to bearing properties similar to those bared by other types of resources (label, country, etc.), geo-spatial resources are commonly described by means of vector geometry.2

²
Most commonly encoded in the WKT format, see http://www.opengeospatial.org/standards/sfa.

Each vector description can be modelled as a set of points. We will write

s = (s_{1}, \dots, s_{n})

to denote that the vector description of the resource s comprises the points

s_{1}, \dots, s_{n}

. A point

s_{i}

on the surface of the planet is fully described by two values: its latitude

lat (s_{i}) = φ_{i}

and its longitude

lon (s_{i}) = λ_{i}

. We will denote points

s_{i}

as pairs

(φ_{i}, λ_{i})

. Then, the distance between two points

s_{1}

and

s_{2}

can be computed by using the orthodromic distance

\begin{matrix} (1) & \begin{matrix} δ (s_{1}, s_{2}) & = R {cos}^{- 1} (sin (φ_{1}) sin (φ_{2}) \\ + cos (φ_{1}) cos (φ_{2}) \\ \times cos (λ_{2} - λ_{1})), \end{matrix} \end{matrix}

where

R = 6371 km

is the planet’s radius.3

Here, we assume the planet to be a perfect sphere.

Alternatively, the distance between two points $s_{1}$ and $s_{2}$ can be computed based on the great elliptic curve distance [10]. Note that this distance is recommended in previous works (e.g., [13]) as it is more accurate than the orthodromic distance. However, given that our evaluations (see Table 2) showed that the distance error of orthodromic distance did not affect the LD results and that the orthodromic distance has a lower time complexity that the great elliptic curve distance, we rely on the orthodromic distance throughout the explanations in this paper.

Computing the distance between sets of points is yet a more difficult endeavor. Over the last years, several measures have been developed to achieve this task. Most of these approaches regard vector descriptions as ordered set of points. In the following sections, we present such measures and evaluate their robustness against different types of discrepancies.

3. Systematic survey methodology

We carried out a systematic study of the literature on distance measures for point sets according to the approach presented in [24,28]. In the following, we present our survey approach in more detail.

3.1. Research question formulation

We began by defining research questions that guided our search for measures. These questions were as follows:

Which of the existing measures is the most time-efficient measure?

Which measure generates mappings with a high precision, recall, or F-measure?

How well do the measures perform when the data sets have different granularities?

How sensitive are the measures to measurement discrepancies?

How robust are the measures when both types of discrepancy occur?

3.2. Eligibility criteria

To direct our search process towards answering our research questions, we created two lists of inclusion/exclusion criteria for papers. Papers had to abide by all inclusion criteria and by none of the exclusion criteria to be part of our survey:

Inclusion Criteria

Work published in English between 2003 and 2013.

Studies on geographic terms based link discovery.

Algorithms for finding distance between point sets.

Techniques for improving performance of some well-known point sets distance Algorithms.

Exclusion Criteria

Work that were not peer-reviewed or published.

Work that were published as a poster abstract.

Distance functions that focused on finding distances only between convex point sets.

3.3. Search strategy

Based on the research question and the eligibility criteria, we defined a set of most related keywords. There were as follows: Linked Data, link discovery, record linkage, polygon, point set, distance, metric, geographic, spatial, non-convex. We used those keywords as follows:

Linked Data AND (Link discovery OR record linkage) AND (geographic OR spatial)

Non-convex AND (polygon OR point set) AND (distance OR metric)

A keyword search was applied in the following list of search engines, digital libraries, journals, conferences and their respective workshops:

Search Engines and digital libraries:

Google Scholar4

⁴
http://scholar.google.com/

ACM Digital Library5

⁵

http://dl.acm.org/

Springer Link6

⁶

http://link.springer.com/

Science Direct7

⁷

http://www.sciencedirect.com/

ISI Web of Science8

⁸

http://portal.isiknowledge.com/

Journals:

Semantic Web Journal(SWJ)9

⁹

http://www.semantic-web-journal.net/

Journal of Web Semantics(JWS)10

¹⁰

http://www.websemanticsjournal.org/

Journal of Data and Knowledge Engineering(JDWE)11

¹¹

http://www.journals.elsevier.com/data-and-knowledge-engineering/

3.4. Search methodology phases

In order to conduct our systematic literature review, we applied a six-phase search methodology:

Apply keywords to the search engine using the time frame from 2003–2013.

Scan article titles based on inclusion/exclusion criteria.

Import output from phase 2 to a reference manager software to remove duplicates. Here, we used Mendeley12

¹²
http://www.mendeley.com/

as it is free and has functionality for deduplication.

Review abstracts according to include/exclude criteria.

Read through the papers, looking for some approaches that fits the inclusion criteria and exclude papers that fits the exclusion criteria. Also, retrieve and analyze related papers from references.

Implement point sets distance functions found in phase 5.

Table 1 provides details about the number of retrieved articles through each of the first five search phases. Note that in the sixth phase we only implemented distance functions found in the articles resulted from phase 5.

Table 1

Number of retrieved articles during each of the search methodology phases

Search engines	Phase 1	Phase 2	Phase 3	Phase 4	Phase 5
Google Scholar	9860	21	19	10	4
ACM Digital Library	3677	16	16	5	3
Springer Link	5101	22	21	11	8
Science Direct	1055	21	18	10	4
ISI Web of Science	176	15	14	4	2
SWJ	0	0	0	0	0
JWS	0	0	0	0	0
JDWE	0	0	0	0	0

4. Distance measures for point sets

In the following, we present each of the distance measures derived from our systematic survey and exemplify it by using the DBpedia and NUTS descriptions of Malta presented in Fig. 1. The input for the distance measures consists of two point sets $s = (s_{1}, \dots, s_{n})$ and $t = (t_{1}, \dots, t_{m})$ , where n resp. m stands for the number of distinct points in the description of s resp. t. W.l.o.g, we assume $n ⩾ m$ .

4.1. Mean distance function

The mean distance is one of the most efficient distance measures for point sets [16]. First, a mean point is computed for each point set. Then, the distance between the two means is computed by using the orthodromic distance. Formally: $\begin{matrix} (2) & D_{mean} (s, t) = δ (\frac{\sum_{s_{i} \in s} s_{i}}{n}, \frac{\sum_{t_{j} \in t} t_{j}}{m}) . \end{matrix}$ $D_{mean}$ can be computed in $O (n)$ . For our example, the mean of the DBpedia description of Malta is the point (14.48, 35.89). The mean for the NUTS description are (14.33, 35.97). Thus, $D_{mean}$ returns $18.46 km$ as the distance between the two means points.

4.2. Max distance function

The idea behind this measure is to compute the overall maximal distance between points $s_{i} \in s$ and $t_{j} \in t$ . Formally, the maximum distance is defined as: $\begin{matrix} (3) & D_{max} (s, t) = max_{s_{i} \in s, t_{j} \in t} δ (s_{i}, t_{j}) . \end{matrix}$ For our example, $D_{max}$ returns $38.59 km$ as the distance between the points $d_{3}$ and $n_{6}$ . Due to its construction, this distance is particularly sensitive to outliers. While the naive implementation of Max is in $O (n^{2})$ , [9] introduced an efficient implementation that achieves a complexity of $O (n log n)$ .

4.3. Min distance function

The main idea of the Min is akin to that of Max and is formally defined as $\begin{matrix} (4) & D_{min} (s, t) = min_{s_{i} \in s, t_{j} \in t} δ (s_{i}, t_{j}) . \end{matrix}$ Going back to our example, $D_{min}$ returns $7.82 km$ as the distance between the points $d_{2}$ and $n_{5}$ . Like $D_{max}$ , $D_{min}$ can be implemented to achieve a complexity of $O (n log n)$ [26,47].

4.4. Average distance function

For computing the average point sets distance function, the orthodromic distance measures between all the source-target points pairs is cumulated and divided by the number of point source-target point pairs: $\begin{matrix} (5) & D_{avg} (s, t) = \frac{1}{n m} \sum_{s_{i} \in s, t_{j} \in t} δ (s_{i}, t_{j}) . \end{matrix}$ For our example, $D_{avg}$ returns $22 km$ . A naive implementation of the average distance is $O (n^{2})$ ,

4.5. Sum of minimums distance function

This distance function was first proposed by [33] and is computed as follows: First, the closest point $t_{j}$ to each point $s_{i}$ is to be detected, i.e., the point $t_{j} = arg {min}_{t_{k} \in t} δ (s_{i}, t_{k})$ . The same operation is carried out with source and target reversed. Finally, the average of the two values is then the distance value. Formally, the sum of minimums distance is defined as: $\begin{matrix} (6) & \begin{matrix} D_{som} (s, t) & = \frac{1}{2} (\sum_{s_{i} \in s} min_{t_{j} \in t} δ (s_{i}, t_{j}) \\ + \sum_{t_{i} \in t} min_{s_{j} \in s} δ (t_{i}, s_{j})) . \end{matrix} \end{matrix}$

Going back again to our example, the sum of minimum distances from each of DBpedia points describing Malta to the ones of NUTS is $37.27 km$ , and from NUTS to DBpedia is $178.58 km$ . Consequently, $D_{som}$ returns $107.92 km$ as the average of the two values. The sum of minimum has the same complexity as $D_{min}$ .

4.6. Surjection distance function

The surjection distance function introduced by [37] defines the distance between two point sets as follows: The minimum distance between the sum of distances of the surjection of the larger set to the smaller one. Formally, the Surjection distance is defined as: $\begin{matrix} (7) & D_{s} (s, t) = min_{η} \sum_{(e 1, e 2) \in η} δ (e_{1}, e_{2}), \end{matrix}$ where η is the surjection from the larger of the point sets s and t to the smaller. In to our example, $η = (n_{1}, d_{4})$ , $(n_{2}, d_{1})$ , $(n_{3}, d_{2})$ , $(n_{4}, d_{3})$ , $(n_{5}, d_{4})$ , $(n_{6}, d_{1})$ , $(n_{7}, d_{1})$ , $(n_{8}, d_{1})$ and $(n_{9}, d_{1})$ . Then, $D_{s}$ returns $184.74 km$ as the sum of the orthodromic distances between each of the point pairs included in η. A main drawback of the surjection is being biased toward some points ignoring some others in calculations. (i.e. putting more weight in some points more than the others) For instance in our example, η contains 5 different points surjected to $d_{1}$ , while only one point surjected to $d_{2}$ .

4.7. Fair surjection distance function

In order to fix the bias of the surjection distance function, [37] introduces an extension of the surjection function which is dubbed fair surjection. The surjection between sets S and t is said to be fair if $η^{'}$ maps elements of s as evenly as possible to t. The fair surjection is defined formally as: $\begin{matrix} (8) & D_{fs} (s, t) = min_{η^{'}} \sum_{(e 1, e 2) \in η^{'}} δ (e_{1}, e_{2}), \end{matrix}$ where $η^{'}$ is the evenly mapped surjection from the larger of the sets s and t to the smaller. For our example, $η^{'} = (n_{1}, d_{1})$ , $(n_{2}, d_{2})$ , $(n_{3}, d_{3})$ , $(n_{4}, d_{4})$ , $(n_{5}, d_{1})$ , $(n_{6}, d_{2})$ , $(n_{7}, d_{3})$ , $(n_{8}, d_{4})$ and $(n_{9}, d_{1})$ . Then, $D_{fs}$ returns $137.42 km$ as the sum of the orthodromic distances between each of the point pairs included in $η^{'}$ .

4.8. Link distance function

The link distance introduced by [17] defines distance between two point sets s and t as a relation $R \subseteq s \times t$ satisfying

For all $s_{i} \in s$ there exists $t_{j} \in t$ such that $(s_{i}, t_{j}) \in R$

For all $t_{j} \in t$ there exists $s_{i} \in s$ such that $(s_{i}, t_{j}) \in R$

Formally, The minimum link distance between two point sets s and t is defined by

\begin{matrix} (9) & D_{l} (s, t) = min_{R} \sum_{(s_{i}, t_{j}) \in R} δ (s_{i}, t_{j}), \end{matrix}

where minimum is computed from all relations R, where R is a linking between s and t satisfying the previous two conditions. For our example, the small granularity of the Malta descriptions in the data sets at hand leads to

D_{l}

having the same results as

D_{fs}

. See [17] for complexity analysis for surjection, fair surjection and link distance functions.

4.9. Hausdorff distance function

The Hausdorff distance is a measure of the maximum of the minimum distances between two sets of points. Hausdorff is one of the commonly used approach for determining the similarity between point sets [22]. Formally, the Hausdorff distance is defined as $\begin{matrix} (10) & D_{h} (s, t) = max_{s_{i} \in s} {min_{t_{j} \in t} {δ (s_{i}, t_{j})}} . \end{matrix}$ Back to our example, First, the algorithm finds the orthodromic distance between each of the points of DBpedia to the nearest point NUTS, which found to be the distances between the point pairs $(d_{1}, n_{5})$ , $(d_{2}, n_{5})$ , $(d_{3}, n_{4})$ , and $(d_{4}, n_{4})$ . Then, $D_{h}$ is the maximum distance of them, which is between the point $d_{4}$ and $n_{4}$ equals $34.21 km$ . [31] introduces two efficient approaches for computing bound Hausdorff distance.

Fig. 2.

Fréchet vs other distance approaches.

4.10. Fréchet distance function

Most of the distance measures presented before have a considerable common disadvantage. Consider the two curves shown in Fig. 2, Any point on one of the curves has a nearby point on the other curve. Therefore, many of the measures presented so far (incl. Hausdorff, min, sum of mins) return a low distance. However, these curves are intuitively quite dissimilar: While they are close on a point-wise basis, they are not so close if we try to map the curves continuously to each other. A distance measure that captures this intuition is the Fréchet [19] distance.

The basic idea behind the Fréchet distance is encapsulated in the following example:13

¹³
Adapted from [1].

Imagine two formula one racing cars. The first car, A, hurtles over a curve formulated by a first point set. The second car does the same over a curve formulated by the second point set. The first and second car will vary in velocity but they do not move backwards over their curves. Then the Fréchet distance between the point sets is the minimum length of a non-stretchable cable that would be attached to both cars and would not break during the race.

In order to drive a formal definition of Fréchet distance, First we define A curve as a continuous mapping $f : [a, b] \to V$ with $a, b \in R$ , and $a < b$ , where V denote an arbitrary vector space. A polygonal curve is $P : [0, n] \to V$ with $n \in N$ , such that for all $i \in {0, 1, \dots, n - 1}$ each $P [i, i + 1]$ is affine, i.e. $P (i + λ) = (1 - λ) P (i) + λ P (i + 1)$ for all $λ \in [0, 1]$ . n is called the length of P. Then, Fréchet distance is formally defined as: $\begin{array}{rcl} (11) & \begin{matrix} D_{f} (s, t) & = inf_{\begin{array}{c} α [0, 1] \to [s_{1}, s_{n}] \\ β [0, 1] \to [t_{1}, t_{m}] \end{array}} {sup_{τ \in [0, 1]} {δ (f (α (τ)) \\ - g (β (τ)))}}, \end{matrix} \end{array}$ where $f : [s_{1}, s_{n}] \to V$ and $g : [t_{1}, t_{m}] \to V$ . α, β range over continuous and increasing functions with $α (0) = s_{1}$ , $α (1) = s_{n}$ , $β (0) = t_{1}$ and $β (1) = t_{m}$ only. Computing the Fréchet distance for our example returns $34.62 km$ . See [1] for a complexity analysis of the Fréchet distance.

Overall, the distance measures presented above return partly very different values ranging from $7.82 km$ to $184.74 km$ even on our small example. In the following, we evaluate how well these measures can be used for link discovery.

5. Evaluation

The goal of our evaluation was to answer the five questions mentioned in Section 3.1. To this end, we devised four series of experiments. First, we evaluated the use of different point-to-point geographical distance formulas together with the point set distance introduced in Section 4. Next, we evaluated the scalability of the ten measures with growing data set sizes. Then, we measured the robustness of these measures against measurement and granularity discrepancies as well as combinations of both. Finally, we measured the scalability of the measures when combined with the Orchid algorithm.

5.1. Experimental setup

In this section, we describe the experimental setup used throughout our experiments. We focus on datasets of geo-spatial regions for all experiments as they were the major motivation behind this study. Other experiments with varying object densities (e.g., buildings) go beyond the scope of this paper.

5.1.1. Datasets

We used three publicly available data sets for our experiments. The first data set, NUTS14

¹⁴
Version 0.91 available at http://nuts.geovocab.org/data/ is used in this work.

was used as core data set for our scalability experiments. We chose this data set because it contains fine-granular descriptions of 1461 geo-spatial resources located in Europe. For example, Norway is described by 1981 points. The second data set, DBpedia,15

¹⁵

We used version 3.8 as available at http://dbpedia.org/Datasets.

contains all the 731,922 entries from DBpedia that possess geometry entries. We chose DBpedia because it is commonly used in the Semantic Web community. Finally, the third data set, LinkedGeoData, contains all 3,836,119 geo-spatial objects from http://linkgeodata.org that are instances of the class Way.16

¹⁶

We used the RelevantWays data set (version of April 26th, 2011) of LinkedGeoData as available at http://linkedgeodata.org/Datasets.

Further details to the data sets can be found in [31].

5.1.2. Benchmark

To the best of our knowledge, there is no gold standard benchmark geographic data set that can be used to evaluate the robustness of geo-spatial distance measures. We thus adapted the benchmark generation approach proposed by [18] to geo-spatial distance measures. In order to generate our benchmark data sets, we implemented two modifiers dubbed as granularity and measurement error. The implemented geo-spatial modifiers are analogous with the data sets generation algorithms from the field cartographic generalisation [25]. The granularity modifier implements the most commonly used simplification operator [27], while the measurement error modifier is akin with the displacement operator [32]. Both modifiers take a point set s and a threshold as input and return a point set $s^{'}$ . Note that, both modifiers do not preserve the topological consistency of the input geometries. However, this is not necessary when dealing with point set measures given that the order of points in a set does not matter per definition of a set. Hence, the two modifiers implemented in our evaluation scenario may generate a topologically invalid geometries with self-intersecting polygons, overlapping polygons and/or sliver polygons. Altering our modifiers to preserve the topological consistency among generated geometries remains a future work for a more specific paper about automatic generation of geographic benchmark.

The granularity modifier $M_{g}$ regards the threshold $γ \in [0, 1]$ as the probability that a point of s will be in the output point set $s^{'}$ . To ensure that an empty point set is never generated, the modifier always includes the first point of s into $s^{'}$ . For all other points $s_{i} \in s$ , a random number r between 0 and 1 is generated. If $r ⩽ γ$ , then $s_{i}$ is added to $s^{'}$ . Else, $s_{i}$ is discarded.

The measurement error modifier $M_{e}$ emulates measurement errors across data sets. To this end, it alters the latitude and longitude of each points $s_{i} \in s$ by at most the threshold μ. Consequently, the new coordinates of a point $s_{i}^{'}$ are located within a square of size $2 μ$ with $s_{i}$ at the center. We used a sample of 200 points from each data set for our discrepancy experiments.

To measure how well each of the distance measures performed w.r.t. to the modifiers, we first created a reference mapping $M = {(s, s) \in S}$ when given a set of input resources S. Then, we applied the modifier to all the elements of S to generate a target data set T. We then measured the distance between each of the point sets in the set T and the resources in S. For each element of S we stored the closest point $t \in T$ in a mapping $M^{'}$ . We now computed the precision, recall and F-measure achieved within the experiment by comparing the pairs in $M^{'}$ with those in M.

Table 2
Comparison of the orthodromic and great elliptic distances using 200 randomly selected resources from each data set, where precision (P), recall (R), F-measure (F) and run time (T) are presented. Note that all run times are in milliseconds

Data set Point set measure Orthodromic distance Elliptic curve distance

P R F T P R F T

NUTS Min 0.19 1.00 0.32 1806 0.19 1.00 0.32 7506

Max 0.85 0.85 0.85 1696 0.85 0.85 0.85 7448

Average 0.90 0.90 0.90 1676 0.90 0.90 0.90 7468

Sum of Min 1.00 1.00 1.00 3421 1.00 1.00 1.00 15,035

Link 1.00 1.00 1.00 2357 1.00 1.00 1.00 8878

Surjection 1.00 1.00 1.00 2066 1.00 1.00 1.00 8666

Fair Surjection 1.00 1.00 1.00 2253 1.00 1.00 1.00 8879

Hausdorff 0.96 1.00 0.98 1719 0.96 1.00 0.98 7524

Mean 1.00 1.00 1.00 185 1.00 1.00 1.00 250

Frechet 1.00 1.00 1.00 1311 1.00 1.00 1.00 3652

DBpedia Min 1.00 1.00 1.00 122 1.00 1.00 1.00 108

Max 1.00 1.00 1.00 64 1.00 1.00 1.00 102

Average 1.00 1.00 1.00 46 1.00 1.00 1.00 100

Sum of Min 1.00 1.00 1.00 46 1.00 1.00 1.00 159

Link 1.00 1.00 1.00 146 1.00 1.00 1.00 140

Surjection 1.00 1.00 1.00 124 1.00 1.00 1.00 246

Fair Surjection 1.00 1.00 1.00 107 1.00 1.00 1.00 153

Hausdorff 1.00 1.00 1.00 40 1.00 1.00 1.00 87

Mean 1.00 1.00 1.00 84 1.00 1.00 1.00 77

Frechet 1.00 1.00 1.00 110 1.00 1.00 1.00 286

LinkedGeoData Min 1.00 1.00 1.00 1175 1.00 1.00 1.00 4554

Max 1.00 1.00 1.00 1113 1.00 1.00 1.00 4483

Average 1.00 1.00 1.00 1079 1.00 1.00 1.00 4480

Sum of Min 1.00 1.00 1.00 2180 1.00 1.00 1.00 8999

Link 1.00 1.00 1.00 1552 1.00 1.00 1.00 5603

Surjection 1.00 1.00 1.00 1397 1.00 1.00 1.00 5406

Fair Surjection 1.00 1.00 1.00 1472 1.00 1.00 1.00 5491

Hausdorff 1.00 1.00 1.00 1107 1.00 1.00 1.00 4510

Mean 1.00 1.00 1.00 101 1.00 1.00 1.00 244

Frechet 1.00 1.00 1.00 1201 1.00 1.00 1.00 4493

Data set	Point set measure	Orthodromic distance	Elliptic curve distance
NUTS	Min	0.19	1.00	0.32	1806	0.19	1.00	0.32	7506
Max	0.85	0.85	0.85	1696	0.85	0.85	0.85	7448
Average	0.90	0.90	0.90	1676	0.90	0.90	0.90	7468
Sum of Min	1.00	1.00	1.00	3421	1.00	1.00	1.00	15,035
Link	1.00	1.00	1.00	2357	1.00	1.00	1.00	8878
Surjection	1.00	1.00	1.00	2066	1.00	1.00	1.00	8666
Fair Surjection	1.00	1.00	1.00	2253	1.00	1.00	1.00	8879
Hausdorff	0.96	1.00	0.98	1719	0.96	1.00	0.98	7524
Mean	1.00	1.00	1.00	185	1.00	1.00	1.00	250
Frechet	1.00	1.00	1.00	1311	1.00	1.00	1.00	3652
DBpedia	Min	1.00	1.00	1.00	122	1.00	1.00	1.00	108
Max	1.00	1.00	1.00	64	1.00	1.00	1.00	102
Average	1.00	1.00	1.00	46	1.00	1.00	1.00	100
Sum of Min	1.00	1.00	1.00	46	1.00	1.00	1.00	159
Link	1.00	1.00	1.00	146	1.00	1.00	1.00	140
Surjection	1.00	1.00	1.00	124	1.00	1.00	1.00	246
Fair Surjection	1.00	1.00	1.00	107	1.00	1.00	1.00	153
Hausdorff	1.00	1.00	1.00	40	1.00	1.00	1.00	87
Mean	1.00	1.00	1.00	84	1.00	1.00	1.00	77
Frechet	1.00	1.00	1.00	110	1.00	1.00	1.00	286
LinkedGeoData	Min	1.00	1.00	1.00	1175	1.00	1.00	1.00	4554
Max	1.00	1.00	1.00	1113	1.00	1.00	1.00	4483
Average	1.00	1.00	1.00	1079	1.00	1.00	1.00	4480
Sum of Min	1.00	1.00	1.00	2180	1.00	1.00	1.00	8999
Link	1.00	1.00	1.00	1552	1.00	1.00	1.00	5603
Surjection	1.00	1.00	1.00	1397	1.00	1.00	1.00	5406
Fair Surjection	1.00	1.00	1.00	1472	1.00	1.00	1.00	5491
Hausdorff	1.00	1.00	1.00	1107	1.00	1.00	1.00	4510
Mean	1.00	1.00	1.00	101	1.00	1.00	1.00	244
Frechet	1.00	1.00	1.00	1201	1.00	1.00	1.00	4493

5.1.3. Hardware

All experiments were carried out on a server running OpenJDK 64-Bit Server 1.6.0_27 on Ubuntu 14.04.2 LTS. The processors were 64-core AuthenticAMD clocked at 2.3 GHz. Unless stated otherwise, each experiment was assigned 8 GB of memory and was ran 5 times.

5.2. Point-to-point geographic distance evaluation

To evaluate the effect of the basic point-to-point geographic distance $δ (s_{i}, t_{j})$ in the aforementioned point sets distance functions (Section 4), we carried out two sets of experiments. In the first set of experiments, we used the orthodromic distance (see Equation 1) as the basic point-to-point distance function $δ (s_{i}, t_{j})$ , while the great elliptic curve distance [10] was used to compute $δ (s_{i}, t_{j})$ in the second set of experiments. As input we used a sample of 200 randomly picked resources from the three data sets of NUTS, DBpedia, and LinkedGeoData. We did not apply any modifiers in these two sets of experiments as we aimed to evaluate how the measures perform on real data. In each of the two sets of the experiments, we measured the precision, recall, F-measure and run time for each of the 10 point sets distance function.

Fig. 3.

Scalability evaluation on the NUTS data set.

The results (see Table 2) show that both the orthodromic and elliptic curve distances achieved the same precision, recall and F-measure when applied to the same resources. Moreover, the elliptic distance (in average) was 3.9 times slower than the orthodromic distance. Given that the great elliptic curve distance is known to be more accurate than the orthodromic distance [13], these observations emphasise that (1) the distance error of the orthodromic distance did not affect the link discovery results and that (2) the orthodromic distance has a lower time complexity than the great elliptic distance. Therefore, we rely on the orthodromic distance throughout the rest of experiments in this paper.

It is important to notice that the setting mentioned above could also be used with planar spaces. To this end, one would only have to replace the Orthodromic distance with the Euclidean distance. One could also consider computing the point-to-point geographic distance $δ (s_{i}, t_{j})$ by using the Euclidean distance computed on 2D projected geometries. While this approach would work well for small regions of space, we did not consider the usage of this method in our evaluation here because projections distort distances more drastically when they are applied on large areas and the datasets chosen for evaluating the surveyed distance measures have a large geographic extent. For selecting the appropriate projection type based on use case and scale, please refer to the INSPIRE Directive17

¹⁷

http://inspire.ec.europa.eu/documents/Data_Specifications/INSPIRE_DataSpecification_RS_v3.2.pdf

recommendations.

5.3. Scalability evaluation

To quantify how well the measures scale, we measured the runtime of the measures on fragments of growing size of each of the input data sets. This experiment emulates a naive deduplication on data sets of various sizes. The results achieved on NUTS are shown in Fig. 3. We chose to show NUTS because it is the smallest and most fine-granular of our data sets. Thus, the measures achieved here represent an upper bound for the runtime behaviour of the different approaches. $D_{mean}$ is clearly the most time-efficient approach. This was to be expected as its algorithmic complexity is linear. While most of the other measures are similar in their efficiency, the Fréchet distance sticks out as the slowest to run. Overall, it is at least two orders of magnitude slower than the other measures. These results give a clear answer to question $Q_{1}$ , which pertains to the time-efficiency of the measures at hand: $D_{mean}$ is clearly the fastest.

Fig. 4.

Comparison of different point set distance measures against granularity discrepancies.

5.4. Robustness evaluation

We carried out three types of evaluations to measure the robustness of the measures at hand. First, we measured their robustness against discrepancies in granularity. Then, we measured their robustness against measurement discrepancies. Finally, we combined discrepancies in measurement and granularity and evaluated all our measures against these. We chose to show only a portion of our results for the sake of space. All results can be found at http://limes.sf.net.

5.4.1. Robustness against discrepancies in granularity

We measured the effect of changes in granularity on the measures at hand by using the five granularity thresholds 1, $\frac{1}{2}$ , $\frac{1}{3}$ , $\frac{1}{4}$ and $\frac{1}{5}$ . Note that the threshold of 1 means that the data set was not altered. This setting allows us to answer $Q_{2}$ , which pertains to the measures that are most adequate for deduplication. On NUTS (see Fig. 4(a)), our results suggest that $D_{min}$ is the least robust of the measures w.r.t. the F-measure. In addition to being the least time-efficient measure, Fréchet is also not robust against changes in granularity. The best performing measure w.r.t. to its F-measure is the sum of minimums, followed closely by the surjection and mean measures. On the DBpedia and LinkedGeoData data sets, all measures apart from the Fréchet distance perform in a similar fashion (see Fig. 4(b)). This is yet simply due the sample of the data set containing point sets that were located far apart from each other. Thus, the answer to question $Q_{3}$ on the effect of discrepancies in granularity is that while the sum of mins is the least sensitive to changes in granularity. However, note that sum of mins is closely followed by the mean measure.

The answer to $Q_{2}$ can be derived from the evaluation with the granularity threshold set to 1. Here, mean, fair surjection, surjection, sum of mins and link perform best. Thus, mean should be used because it is more time-efficient.

5.4.2. Robustness against measurement discrepancies

The evaluation of the robustness of the measures at hand against discrepancies in measurement are shown in Fig. 5. Interestingly, the results differ across the different data sets. On the NUTS data, where the regions are described with high granularity, five of the measures (mean, fair surjection, link, sum of mins and surjection) perform well. On LinkedGeoData, the number of points pro resources is considerably smaller. Moreover, the resources are partly far from each other. Here, the Hausdorff distance is the poorest while max and mean perform comparably well. Finally, on the DBpedia data set, all measures apart from Fréchet are comparable. Our results thus suggest that the answer to $Q_{4}$ is as follows: The mean distance is the distance of choice when computing links between geo-spatial data sets which contain measurement errors, especially if the resources described have a high geographical density or the difference in granularity is significant.

Fig. 5.

Comparison of different point set distance measures against measurement discrepancies.

Fig. 6.

Comparison of different point set distance measures against granularity and measurement discrepancies.

5.4.3. Overall robustness

We emulated the differences across various real geographic data sets by combining the granularity and the measurement modifiers. Given a data set S, we generated a modified data set $S^{'}$ using the granularity modifier. The modified data set was used as input for a measurement modifier, which generated our final data set T. The results of our experiments are shown in Fig. 6. Again, the results vary across the different data sets. While mean performs well on NUTS Fig. 6(a) and LinkedGeoData, it is surjection that outperforms all the other measures on DBpedia Fig. 6. This surprising result is due to the measurement errors having only a small effect on our DBpedia sample. Thus, after applying the granularity modifier, the surjection value is rarely affected.

Fig. 7.

Scalability evaluation with Orchid.

Overall, our results suggest that the following answer to $Q_{5}$ : In most cases, using the mean distance leads to high F-measures. Moreover, mean present the advantage of being an order of magnitude faster than the other approaches. Still, the surjection measure should also be considered when comparing different data sets as it can significantly outperform the mean measure

5.5. Scalability with Orchid

We aimed to know how far the runtime of measures such as mean, surjection and sum of mins can be reduced so as to ensure that these measures can be used on large data sets. We thus combined these measures with the Orchid approach presented in [31]. The idea behind Orchid is to improve the runtime of algorithms for measuring geo-spatial distance measures by adapting an approach akin to divide-and-conquer. Orchid assumes that it is given a distance measure (not necessarily a metric) m that abides by $m (s, t) ⩽ θ \to \forall s_{i} \in s \exists t_{j} \in t : δ (s_{i}, t_{j}) ⩽ θ$ . This condition is obviously not satisfied by all measures considered herein, including min and mean. However, dedicated extensions of Orchid can be developed for these measures. Overall, Orchid begins by partitioning the surface of the planet. The points in a given partition are then only compared with points in partitions that abide by the distance threshold underlying the computation.

We used the default settings of the implementation provided in the LIMES framework and the distance threshold of ${0.02}^{\circ}$ (2.2 km). Figure 7(a) shows the runtime results achieved on the same data sets as Fig. 3. Clearly, the runtimes of the approaches can be decreased by up to an order of magnitude. Therewith, Orchid allows most measures (i.e., all apart from Fréchet) to scale in a manner comparable to that of the mean measure. Therewith, the measures can now be used on the whole of the data sets at hand. For example, all distance measures apart from the Fréchet distance require less than five minutes to run on the whole of the DBpedia data set (see Fig. 7(b)).

Overall, we can conclude that all measures apart from the Fréchet distance are amenable to being used for link discovery. While mean performs best overall, surjection-based and minimum-based measures are good candidates to use if mean returns unsatisfactory results. The Fréchet distance on the other hand seems inadequate for link discovery. This can yet be due to the point set approach chosen in this paper. An analysis of the Fréchet distance on the description of resources as polygons remains future work. Note that the high Fréchet distances computed when minor discrepancies between representations of geo-spatial objects occurred can be of importance when carrying out other tasks such as analyzing the quality of RDF datasets.

5.6. Experiment on real datasets

We were interested in knowing whether the mean function performs well on real data. Validating link discovery results on geo-spatial data is difficult due to the lack of reference data sets. We thus measured the increase in precision and recall achieved by using geo-spatial information by sampling 100 links from the results of real link discovery tasks and evaluating these links manually. The links were evaluated by the authors who reached an agreement of 100%.

In the first experiment, we computed links between cities in DBpedia and LinkedGeoData by comparing solely their labels by means of an exact match string similarity. No geo-spatial similarity metric was used, leading to cities being linked if they have exactly the same name. Overall only 74% of the links in our sample were correct. The remaining 26% differed in country or even continent. We can assume that a recall of 1 would be achieved by using this approach as a particular city will most probably have the same name across different geo-spatial data sets. Thus, in the best case, linking geo-spatial resources in DBpedia to LinkedGeoData would only lead to an F-measure of 0.85.

In our second experiment, we extended the specification described above by linking two cities if their names were exact matches (which was used in the first experiment) and the mean distance function between their geometry representation returned a value under 100 km. In our sample, we achieved a perfect accuracy and thus an F-measure of 1. While this experiment is small, it clearly demonstrates the importance of using geo-spatial information for linking geo-spatial resources. Moreover, it suggest that the mean distance is indeed reliable on real data. More experiments yet need to be carried out to ensure that the empirical results we got in this experiment are not just a mere artifact in the data. We will achieve this goal by creating a benchmark for geo-spatial link discovery in future work.

6. Related work

This paper is related to distance measures for point sets and link discovery. Several reviews on distance measures for point sets have been published. For example, [17] reviews some of the distance functions proposed in the literature presents efficient algorithms for the computation of these measures. Also, [3] presents parallel implementation of some distance functions between convex and non-convex (possibly intersecting) polygons.

Ramon et al. [39] introduce a metric computable in polynomial time for measuring the similarity between sets of points, while [45] presents an approach to compute the similarity between multiple polylines and a polygon using dynamic programming. Barequet et al. [6] show how to compute the respective nearest- and furthest-site Voronoi diagrams of point sites in the plane, [7] provides near-optimal deterministic time algorithms to compute the corresponding nearest- and furthest-site Voronoi diagrams of point sites.

Hausdorff distances are commonly used in fields such as object modelling, computer vision and object tracking. [2] focuses on the Hausdorff distance and presents an approach for its efficient computation between convex polygons. While the approach is quasi-linear in the number of nodes of the polygons, it cannot deal with non-convex polygons as commonly found in geographic data. [46] presents a similar approach that allows approximating Hausdorff distances within a certain error bound, while [8] presents an exact approach. [36] proposes an approach to compute Hausdorff distances between trajectories using R-trees within an $L_{2}$ -space.

Fréchet distance is basically used in piecewise curve similarity detection like in case of hand writing recognition. For example, [1] introduces an algorithm for computing Fréchet distance between two polygonal curves, while [11] presents a polynomial-time algorithm to compute the homotopic Fréchet distance between two given polygonal curves in the plane avoiding a given set of polygonal obstacles. [15] provides an approximation of Fréchet distance for realistic curves in near linear time. Dealing with non-flat surfaces, [14] presented three different methods to adapt the original Fréche distance in non-flat surfaces.

There are number of techniques presented in literature that -if applied in combination with the presented distance approaches- can achieve better performance. In order to limit the number of polygons to be compared in deduplication problems, [23] proposed a dissimilarity function for clustering geospatial polygons. A kinematics-based method proposed in [41] approximates large polygon using less number of points is proposed, thus requires less execution time for distance measurement. Yet, another algorithm presented by [38] models non-convex polygons as the union of a set of convex components, the algorithm construct a hierarchical bounding representation based on spheres. [21] shows an approach for the comparison of 3D models represented as triangular meshes. The approach is based on a subdivision sampling algorithm that makes used of octrees to approximate distances. Orchid [31] was designed especially for the Hausdorff distance but can be extended to deal with other measures.

The problem of time-efficient LD has been addressed by several frameworks such as Limes [30], Silk [48], KnoFuss [34] and Zhishi.links [35]. These frameworks incorporate declarative approaches towards LD. Both Silk and KnoFuss implement blocking techniques to identify links between knowledge bases in order to reduce the number of unnecessary comparisons between resources. Limes reduces the time-complexity of the LD process by combining techniques such as PPJoin+ [49] and refinement operators [43]. A review comprising further LD approaches can be found in [29]. Up until now, only Silk and Limes support point set, temporal and topological distances. Silk supports such distances by incorporating the work of Smeros et al. [44], while Limes implementation of these distances are based on the work presented in [20,31,42].

7. Conclusion and future work

In this paper, we presented an evaluation of point set distance measures for link discovery on geo-spatial resources. We evaluated these distances on sample from three different data sets. Our results suggest that while different measures perform best on the data sets we used, the mean distance measure is the most time-efficient and overall best measure to use for link discovery. We also showed that all measures apart from the Fréchet distance can scale even on large data sets when combine with an approach such as Orchid. While working on this paper, we realized the need for a full-fledged benchmark for geo-spatial link discovery. In future work, we will devise such a benchmark and make it available to the community. All the measures presented in this paper were integrated in the LIMES framework available at http://limes.sf.net. In future work, we will extend this framework with dedicated versions of Orchid for the different measures presented herein. Moreover, we will aim to devise means to detect the best measure for any given geo-spatial data set.

Footnotes

Acknowledgements

This work has been supported by H2020 projects SLIPO (GA no. 731581) and HOBBIT (GA no. 688227) as well as the DFG project LinkingLOD (project no. NG 105/3-2), the eurostars project SAGE (GA no. E!10882) and the BMWI Project GEISER (project no. 01MD16014E).

References

Alt and

Godau, Computing the Fréchet distance between two polygonal curves, International Journal of Computational Geometry & Applications5(01n02) (1995), 75–91. doi:10.1142/S0218195995000064.

M.J.

Atallah, A linear time algorithm for the Hausdorff distance between convex polygons, Technical report 83-442, Purdue University, Department of Computer Science, 1983. http://docs.lib.purdue.edu/cstech/363/.

M.J.

Atallah,

C.C.

Ribeiro and

Lifschitz, Computing some distance functions between polygons, Pattern Recognition24(8) (1991), 775–781. doi:10.1016/0031-3203(91)90045-7.

Auer,

Lehmann and

Hellmann, LinkedGeoData: Adding a spatial dimension to the web of data, in: The Semantic Web – ISWC 2009, 8th International Semantic Web Conference, ISWC 2009, Chantilly, VA, USA, October 25–29, 2009. Proceedings,

Bernstein,

D.R.

Karger,

Heath,

Feigenbaum,

Maynard,

Motta and

Thirunarayan, eds, Lecture Notes in Computer Science, Vol. 5823, Springer, 2009, pp. 731–746. doi:10.1007/978-3-642-04930-9_46.

Auer,

Lehmann and

Ngonga Ngomo, Introduction to linked data and its lifecycle on the web, in: Reasoning Web. Semantic Technologies for the Web of Data – 7th International Summer School 2011, Galway, Ireland, August 23–27, 2011, Tutorial Lectures,

Polleres,

d’Amato,

Arenas,

Handschuh,

Kroner,

Ossowski and

P.F.

Patel-Schneider, eds, Lecture Notes in Computer Science, Vol. 6848, Springer, 2011, pp. 1–75. doi:10.1007/978-3-642-23032-5_1.

Barequet,

Dickerson and

M.T.

Goodrich, Voronoi diagrams for polygon-offset distance functions, in: Algorithms and Data Structures, 5th International Workshop, WADS ’97, Halifax, Nova Scotia, Canada, August 6–8, 1997, Proceedings,

F.K.H.A.

Dehne,

Rau-Chaplin,

Sack and

Tamassia, eds, Lecture Notes in Computer Science, Vol. 1272, Springer, 1997, pp. 200–209. doi:10.1007/3-540-63307-3_60.

Barequet,

Dickerson and

M.T.

Goodrich, Voronoi diagrams for convex polygon-offset distance functions, Discrete & Computational Geometry25(2) (2001), 271–291. doi:10.1007/s004540010081.

Bartoň,

Hanniel,

Elber and

Kim, Precise Hausdorff distance computation between polygonal meshes, Computer Aided Geometric Design27(8) (2010), 580–591. doi:10.1016/j.cagd.2010.04.004.

B.K.

Bhattacharya and

G.T.

Toussaint, Efficient algorithms for computing the maximum distance between two finite planar sets, Journal of Algorithms4(2) (1983), 121–136. doi:10.1016/0196-6774(83)90040-8.

10.

B.R.

Bowring, The direct and inverse solutions for the great elliptic line on the reference ellipsoid, Bulletin Géodésique58(1) (1984), 101–108. doi:10.1007/BF02521760.

11.

E.W.

Chambers,

É.C.

de Verdière,

Erickson,

Lazard,

Lazarus and

Thite, Homotopic Fréchet distance between curves or, walking your dog in the woods in polynomial time, Computational Geometry43(3) (2010), 295–311. doi:10.1016/j.comgeo.2009.02.008.

12.

Cheatham and

Hitzler, String similarity metrics for ontology alignment, in: The Semantic Web – ISWC 2013 – 12th International Semantic Web Conference, Sydney, NSW, Australia, October 21–25, 2013, Proceedings, Part II,

Alani,

Kagal,

Fokoue,

P.T.

Groth,

Biemann,

J.X.

Parreira,

Aroyo,

N.F.

Noy,

Welty and

Janowicz, eds, Lecture Notes in Computer Science, Vol. 8219, Springer, 2013, pp. 294–309. doi:10.1007/978-3-642-41338-4_19.

13.

Chrisman and

J.-F.

Girres, First, do no harm: Eliminating systematic error in analytical results of GIS applications, in: 8th International Symposium on Spatial Data Quality,

Wu,

Guilbert and

Shi, eds, International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Vol. XL-2/W1, International Society for Photogrammetry and Remote Sensing, 2013, pp. 35–40. doi:10.5194/isprsarchives-XL-2-W1-35-2013.

14.

A.F.

Cook,

Driemel,

Har-Peled,

Sherette and

Wenk, Computing the Fréchet distance between folded polygons, in: Algorithms and Data Structures – 12th International Symposium, WADS 2011, New York, NY, USA, August 15–17, 2011. Proceedings,

Dehne,

Iacono and

Sack, eds, Lecture Notes in Computer Science, Vol. 6844, Springer, 2011, pp. 267–278. doi:10.1007/978-3-642-22300-6_23.

15.

Driemel,

Har-Peled and

Wenk, Approximating the Fréchet distance for realistic curves in near linear time, Discrete & Computational Geometry48(1) (2012), 94–127. doi:10.1007/s00454-012-9402-z.

16.

R.O.

Duda,

P.E.

Hart and

D.G.

Stork, Pattern Classification, 2nd edn, Wiley, 2001. http://www.worldcat.org/oclc/41347061 .

17.

Eiter and

Mannila, Distance measures for point sets and their computation, Acta Informatica34(2) (1997), 109–133. doi:10.1007/s002360050075.

18.

Ferrara,

Montanelli,

Noessner and

Stuckenschmidt, Benchmarking matching applications on the semantic web, in: The Semanic Web: Research and Applications – 8th Extended Semantic Web Conference, ESWC 2011, Heraklion, Crete, Greece, May 29–June 2, 2011, Proceedings, Part II,

Antoniou,

Grobelnik,

E.P.B.

Simperl,

Parsia,

Plexousakis,

De Leenheer and

J.Z.

Pan, eds, Lecture Notes in Computer Science, Vol. 6644, Springer, 2011, pp. 108–122. doi:10.1007/978-3-642-21064-8_8.

19.

M.M.

Fréchet, Sur quelques points du calcul fonctionnel, Rendiconti del Circolo Matematico di Palermo22(1) (1906), 1–72. doi:10.1007/BF03018603.

20.

Georgala,

M.A.

Sherif and

Ngonga Ngomo, An efficient approach for the generation of Allen relations, in: ECAI 2016 – 22nd European Conference on Artificial Intelligence, 29 August–2 September 2016, The Hague, The Netherlands – Including Prestigious Applications of Artificial Intelligence (PAIS 2016),

G.A.

Kaminka,

Fox,

Bouquet,

Hüllermeier,

Dignum,

Dignum and

van Harmelen, eds, Frontiers in Artificial Intelligence and Applications, Vol. 285, IOS Press, 2016, pp. 948–956. doi:10.3233/978-1-61499-672-9-948.

21.

Guthe,

Borodin and

Klein, Fast and accurate Hausdorff distance calculation between meshes, Journal of WSCG13(2) (2005), 41–48. http://wscg.zcu.cz/wscg2005/Papers_2005/Journal/!WSCG2005_Journal_Final.pdf .

22.

D.P.

Huttenlocher,

Kedem and

J.M.

Kleinberg, On dynamic Voronoi diagrams and the minimum Hausdorff distance for point sets under Euclidean motion in the plane, in: Proceedings of the Eighth Annual Symposium on Computational Geometry, Berlin, Germany, June 10–12, 1992,

Avis, ed., ACM, 1992, pp. 110–119. doi:10.1145/142675.142700.

23.

Joshi,

Samal and

Soh, A dissimilarity function for clustering geospatial polygons, in: 17th ACM SIGSPATIAL International Symposium on Advances in Geographic Information Systems, ACM-GIS 2009, November 4–6, 2009, Seattle, Washington, USA, Proceedings,

Agrawal,

W.G.

Aref,

Lu,

M.F.

Mokbel,

Scheuermann,

Shahabi and

Wolfson, eds, ACM, 2009, pp. 384–387. doi:10.1145/1653771.1653825.

24.

Kitchenham, Procedures for performing systematic reviews, Technical report, Joint Technical Report Keele University Technical Report TR/SE-0401 and NICTA Technical Report 0400011T.1, 2004.

25.

W.A.

Mackaness,

Ruas and

L.T.

Sarjakoski (eds), Generalisation of Geographic Information: Cartographic Modelling and Applications, International Cartographic Association, Elsevier, 2011. https://www.elsevier.com/books/generalisation-of-geographic-information/mackaness/978-0-08-045374-3 .

26.

McKenna and

G.T.

Toussaint, Finding the minimum vertex distance between two disjoint convex polygons in linear time, Computers & Mathematics with Applications11(12) (1985), 1227–1242. doi:10.1016/0898-1221(85)90109-9.

27.

R.B.

McMaster, Automated line generalization, Cartographica: The International Journal for Geographic Information and Geovisualization24(2) (1987), 74–111. doi:10.3138/3535-7609-781G-4L20.

28.

Moher,

Liberati,

Tetzlaff,

D.G.

Altman and The PRISMA Group, Preferred reporting items for systematic reviews and meta-analyses: The PRISMA statement, PLoS medicine6(7) (2009). doi:10.1371/journal.pmed.1000097.

29.

Nentwig,

Hartung,

Ngonga Ngomo and

Rahm, A survey of current link discovery frameworks, Semantic Web8(3) (2017), 419–436. doi:10.3233/SW-150210.

30.

Ngonga Ngomo, On link discovery using a hybrid approach, Journal on Data Semantics1(4) (2012), 203–217. doi:10.1007/s13740-012-0012-y.

31.

Ngonga Ngomo, ORCHID – reduction-ratio-optimal computation of geo-spatial distances for link discovery, in: The Semantic Web – ISWC 2013 – 12th International Semantic Web Conference, Sydney, NSW, Australia, October 21–25, 2013, Proceedings, Part I,

Alani,

Kagal,

Fokoue,

P.T.

Groth,

Biemann,

J.X.

Parreira,

Aroyo,

N.F.

Noy,

Welty and

Janowicz, eds, Lecture Notes in Computer Science, Vol. 8218, Springer, 2013, pp. 395–410. doi:10.1007/978-3-642-41335-3_25.

32.

B.G.

Nickerson and

Freeman, Development of a rule-based system for automatic map generalization, in: Proceedings, Second International Symposium on Spatial Data Handling: July 5–10, 1986, Seattle, Washington, U.S.A., International Geographical Union, Commission on Geographical Data Sensing and Processing, 1986, pp. 537–556.

33.

Niiniluoto, Truthlikeness, Synthese Library, Springer, 1987. doi:10.1007/978-94-009-3739-0.

34.

Nikolov,

d’Aquin and

Motta, Unsupervised learning of link discovery configuration, in: The Semantic Web: Research and Applications – 9th Extended Semantic Web Conference, ESWC 2012, Heraklion, Crete, Greece, May 27–31, 2012. Proceedings,

Simperl,

Cimiano,

Polleres,

Ó.

Corcho and

Presutti, eds, Lecture Notes in Computer Science, Vol. 7295, Springer, 2012, pp. 119–133. doi:10.1007/978-3-642-30284-8_15.

35.

Niu,

Rong,

Zhang and

Wang, Zhishi.links results for OAEI 2011, in: Proceedings of the 6th International Workshop on Ontology Matching, Bonn, Germany, October 24, 2011,

Shvaiko,

Euzenat,

Heath,

Quix,

Mao and

I.F.

Cruz, eds, CEUR Workshop Proceedings, Vol. 814, CEUR-WS.org, 2011. http://ceur-ws.org/Vol-814/oaei11_paper16.pdf .

36.

Nutanong,

E.H.

Jacox and

Samet, An incremental Hausdorff distance calculation algorithm, Proceedings of the VLDB Endowment4(8) (2011), 506–517. http://www.vldb.org/pvldb/vol4/p506-nutanong.pdf . doi:10.14778/2002974.2002978.

37.

Oddie, Verisimilitude and distance in logical space, The logic and epistemology of scientific change, Acta Philosophica Fennica30(2–4) (1978), 227–242.

38.

Quinlan, Efficient distance computation between non-convex objects, in: Proceedings of the 1994 International Conference on Robotics and Automation, San Diego, CA, USA, May 1994, IEEE Computer Society, 1994, pp. 3324–3329. doi:10.1109/ROBOT.1994.351059.

39.

Ramon and

Bruynooghe, A polynomial time computable metric between point sets, Acta Informatica37(10) (2001), 765–780. doi:10.1007/PL00013304.

40.

Saleem,

S.S.

Padmanabhuni,

Ngonga Ngomo,

J.S.

Almeida,

Decker and

H.F.

Deus, Linked Cancer Genome Atlas database, in: i-SEMANTICS 2013 – 9th International Conference on Semantic Systems, ISEM’13, Graz, Austria, September 4–6, 2013,

Sabou,

Blomqvist,

Di Noia,

Sack and

Pellegrini, eds, ACM, 2013, pp. 129–134. doi:10.1145/2506182.2506200.

41.

Saykol,

Gülesir,

Güdükbay and

Ö.

Ulusoy, KiMPA: A kinematics-based method for polygon approximation, in: Advances in Information Systems, Second International Conference, ADVIS 2002, Izmir, Turkey, October 23–25, 2002, Proceedings,

T.M.

Yakhno, ed., Lecture Notes in Computer Science, Vol. 2457, Springer, 2002, pp. 186–194. doi:10.1007/3-540-36077-8_18.

42.

M.A.

Sherif,

Dreßler,

Smeros and

Ngonga Ngomo, Radon – rapid discovery of topological relations, in: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4–9, 2017, San Francisco, California, USA,

S.P.

Singh and

Markovitch, eds, AAAI Press, 2017, pp. 175–181. http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14199 .

43.

M.A.

Sherif,

Ngonga Ngomo and

Lehmann, Wombat – a generalization approach for automatic link discovery, in: The Semantic Web – 14th International Conference, ESWC 2017, Portorož, Slovenia, May 28–June 1, 2017, Proceedings, Part I,

Blomqvist,

Maynard,

Gangemi,

Hoekstra,

Hitzler and

Hartig, eds, Lecture Notes in Computer Science, Vol. 10249, 2017, pp. 103–119. doi:10.1007/978-3-319-58068-5_7.

44.

Smeros and

Koubarakis, Discovering spatial and temporal links among RDF data, in: Proceedings of the Workshop on Linked Data on the Web, LDOW 2016, Co-Located with 25th International World Wide Web Conference (WWW 2016),

Auer,

Berners-Lee,

Bizer and

Heath, eds, CEUR Workshop Proceedings, Vol. 1593, CEUR-WS.org, 2016. http://ceur-ws.org/Vol-1593/article-06.pdf .

45.

Tanase,

R.C.

Veltkamp and

H.J.

Haverkort, Multiple polyline to polygon matching, in: Algorithms and Computation, 16th International Symposium, ISAAC 2005, Sanya, Hainan, China, December 19–21, 2005, Proceedings,

Deng and

Du, eds, Lecture Notes in Computer Science, Vol. 3827, Springer, 2005, pp. 60–70. doi:10.1007/11602613_8.

46.

Tang,

Lee and

Y.J.

Kim, Interactive Hausdorff distance computation for general polygonal models, ACM Transactions on Graphics28(3) (2009), 74:1–74:9. doi:10.1145/1531326.1531380.

47.

G.T.

Toussaint and

B.K.

Bhattacharya, Optimal algorithms for computing the minimum distance between two finite planar sets, Pattern Recognition Letters2(2) (1983), 79–82. doi:10.1016/0167-8655(83)90041-7.

48.

Volz,

Bizer,

Gaedke and

Kobilarov, Discovering and maintaining links on the web of data, in: The Semantic Web – ISWC 2009, 8th International Semantic Web Conference, ISWC 2009, Chantilly, VA, USA, October 25–29, 2009. Proceedings,

Bernstein,

D.R.

Karger,

Heath,

Feigenbaum,

Maynard,

Motta and

Thirunarayan, eds, Lecture Notes in Computer Science, Vol. 5823, Springer, 2009, pp. 650–665. doi:10.1007/978-3-642-04930-9_41.

49.

Xiao,

Wang,

Lin and

J.X.

Yu, Efficient similarity joins for near duplicate detection, in: Proceedings of the 17th International Conference on World Wide Web, WWW 2008, Beijing, China, April 21–25, 2008,

Huai,

Chen,

Hon,

Liu,

Ma,

Tomkins and

Zhang, eds, ACM, 2008, pp. 131–140. doi:10.1145/1367497.1367516.

A systematic survey of point set distance measures for link discovery

Abstract

Keywords

1. Introduction

2 Most commonly encoded in the WKT format, see http://www.opengeospatial.org/standards/sfa.

3.1. Research question formulation

3.2. Eligibility criteria

3.3. Search strategy

4 http://scholar.google.com/

12 http://www.mendeley.com/

4.1. Mean distance function

4.2. Max distance function

4.3. Min distance function

4.4. Average distance function

4.5. Sum of minimums distance function

4.6. Surjection distance function

4.7. Fair surjection distance function

4.8. Link distance function

4.9. Hausdorff distance function

13 Adapted from [1].

5.1. Experimental setup

5.1.1. Datasets

14 Version 0.91 available at http://nuts.geovocab.org/data/ is used in this work.

5.2. Point-to-point geographic distance evaluation

5.4.1. Robustness against discrepancies in granularity

5.4.2. Robustness against measurement discrepancies

5.6. Experiment on real datasets

6. Related work

7. Conclusion and future work

Footnotes

Acknowledgements

References

²
Most commonly encoded in the WKT format, see http://www.opengeospatial.org/standards/sfa.

⁴
http://scholar.google.com/

¹²
http://www.mendeley.com/

¹³
Adapted from [1].

¹⁴
Version 0.91 available at http://nuts.geovocab.org/data/ is used in this work.