A framework for distributed nearest neighbor classification using Hadoop

Abstract

Within the field of data mining and machine learning, the K-Nearest Neighbor algorithm is a classic algorithm which simply yet elegantly classifies data based upon its similarity to other data. While it follows that the accuracy increases as more data are provided, handling large sets of data is difficult to process serially. It is therefore ideal to perform these tasks in parallel or distributed mode. In this paper, we proposed a framework for distributed nearest neighbor classification. A custom K-Nearest Neighbor algorithm was developed using Hadoop, an environment for developing and deploying applications in parallel on a cluster. The algorithm was implemented on a cluster then tested for accuracy and time of execution. It was observed that the accuracy depends on the provided k-value and on the data set, which is to be expected for the K-Nearest Neighbor process. The time of execution was found to increase logarithmically as the file size, and thus the amount of data the algorithm must parse, increases exponentially.

Keywords

Data mining distributed data mining classification K-Nearest Neighbor Hadoop

Get full access to this article

View all access options for this article.

References

Cover

and Hart

, Nearest neighbor pattern classification, IEEE Transactions on Information Theory13(1) (1967), 21-27.

Chen

M.-S.

, Han

and Yu

P.S.

, Data Mining: An Overview from a Database Perspective, IEEE Transactions on Knowledge and Data Engineering8 (1996), 866-883.

Connor

and Kumar

, Fast construction of k-Nearest Neighbor Graphs for Point Clouds, IEEE Transactions on Visualization and Computer Graphics16(4) (2009), 599-608.

Dasarathy

B.V.

, Nearest-Neighbor Classification Techniques, IEEE Computer Society Press, Los Alomitos, CA1991.

Dean

and Ghemawat

, Mapreduce: Simplified data processing on large clusters, Communications of the ACM51(1) (2008), 107-113.

Fayyad

U.M.

, Piatetsky-Shapiro

and Smyth

, From Data Mining to Knowledge Discovery: An Overview, in: Advances in Knowledge Discovery and Data Mining, U.M. Fayyad et al., AAAI/MIT 1996, 1-34.

Friedman

J.H.

, Baskett

and Shustek

L.J.

, An Algorithm for Finding Nearest Neighbors, IEEE Transactions on Computers1975.

Han

and Kamber

, Data Mining: Concepts and Techniques, Morgan Kaufmann 2006.

James

, Classification Algorithms, New York: John Wiley & Sons 1985.

10.

Morin

R.L.

and Raeside

D.E.

, A Reappraisal of Distance-Weighted k-Nearest Neighbor Classification for Pattern Recognition with Missing Data, IEEE Transactions on Systems, Man, and CyberneticsSMC-11(3) (1981), 241-243.

11.

Oruganti

, Ding

and Tabrizi

, Exploring HADOOP as a Platform for Distributed Association Rule Mining, in: Proceedings of International Conference on Future Computational Technologies and Applications, Valencia, Spain, 2013, 62-67.

12.

Stupar

, Michel

and Schenkel

, RankReduce-processing k-nearest neighbor queries on top of MapReduce, Large-Scale Distributed Systems for Information Retrieval15 (2010).

13.

Yokoyama

, Ishikawa

and Suzuki

, All k-Nearest Neighbor Queries in Hadoop, In Proceedings of Web-Age Information Management, Lecture Notes in Computer Science, Volume 7418 2012, 346-351.

14.

Zhang

, Li

and Jestes

, Efficient Parallel kNN Joins for Large Data in MapReduce, In Proceedings of International Conference on Extending Database Technology2012, 38-49.

15.

http://hadoop.apache.org.

16.

http://archive.ics.uci.edu/ml/datasets/Wine+Quality.