Abstract
The aim of this paper is to show a Palestinian Central Bureau of Statistics (PCBS) [1] experiment in administrative data records linkage. We focused in this paper on PCBS experiment in matching different data sources from different ministries, municipalities and other partners with PCBS Establishments Census 2012. Different matching algorithms and tools were used in the experiment. We started our experiment by using the Fuzzy Lookup [2]. It is an add-in for Excel developed by Microsoft Research. It performs fuzzy matching of textual data in Microsoft Excel. The tool uses the Jaccard Index of Similarity and Levenshtein distance; a statistical way to measure similarities between sample sets. In order to compare data and try to find out matching data, we also used Duke, see Lars [3] which is an existing and flexible deduplication (or entity resolution, or record linkage) engine written in Java. By using Duke Engine, we wrote our matching algorithm and comparators to increase the matching results and matching accuracy. We also wrote some data-cleaning functions for matching variables (Commercial Name, Owner Name and Telephone) in order to standardize each matching variable to get improved results. Different matching algorithms were used in the experiment such as Hamming Distance, e.g. Mohammad [4], Levenshtein distance, Mark [5], Jaccard Similarity, e.g. Suphakit et al. [6], exact match and multiple match.
The results showed that after cleaning the identification variables, the number of matches rises significantly. We also noted that there is an improvement in matching rates when going from the matching based only on phone numbers to the matching based on Telephone, Commercial Name and Owner Name.
Get full access to this article
View all access options for this article.
