Abstract
Data deduplication is process of discovering multiple representations of same entity in an information system. Blocking has been a benchmark technique for avoiding the pair-wise record comparisons in data deduplication. Standard blocking (SB) aims at putting the potential duplicate records in the same block on the basis of a blocking key. Afterwards, the detailed comparisons are made only among the records residing in the same block. The selection of blocking key is a tedious process that involves exponential alternatives. The outcome of SB varies considerably with a change in blocking key. To this end, we have proposed a robust blocking technique called Locality Sensitive Blocking (LSB) that does not require the selection of blocking key. The experimental results show an increase of up to 0.448 in F-score as compared with SB. Furthermore, it is found that LSB is more robust towards blocking parameters and data noise.
Get full access to this article
View all access options for this article.
