Sage Journals: Discover world-class research

Abstract

The advances in information technology of both hardware and software have allowed big data to emerge recently, classification of such data is extremely slow, particularly when using K-nearest neighbors (KNN) classifier. In this article, we propose a new approach that creates a binary search tree (BST) to be used later by the KNN to speed up the big data classification. This approach is based on finding the furthest-pair of points (diameter) in a data set, and then, it uses this pair of points to sort the examples of the training data set into a BST. At each node of the BST, the furthest-pair is found and the examples located at that particular node are further sorted based on their distances to these local furthest points. The created BST is then searched for a test example to the leaf; the examples found in that particular leaf are used to classify the test example using the KNN classifier. The experimental results on some well-known machine learning data sets show the efficiency of the proposed method, in terms of speed and accuracy compared with the state-of-the-art methods reviewed. With some optimization, the proposed method has a great potential to be used for big data classification and can be generalized for other applications, particularly when classification speed is the main concern.

Get full access to this article

View all access options for this article.

References

. The Big Data impact and application study on the like ecosystem construction of open internet of things. Cluster Comput. [Epub ahead of print]; DOI: https://doi.org/10.1007/s10586-018-2206-z

Zhang

, Yang

, Chen

, Li

. A survey on deep learning for Big Data. Inform Fusion. 2018; 42:146–157.

Bolón-Canedo

, Remeseiro

, Sechidis

, et al. Algorithmic challenges in Big Data analytics. In: European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, ESANN, Bruges, Belgium: 16doc.com, 2017, pp. 519–527.

Zhu

. Humor robot and humor generation method based on Big Data search through IOT. Cluster Comput. [Epub ahead of print]; DOI: https://doi.org/10.1007/s10586-018-2097-z

Hassanat

, Tarawneh

. Fusion of color and statistic features for enhancing content-based image retrieval systems. J Theoret Appl Inform Technol. 2016; 88:644–655.

Tarawneh

, Chetverikov

, Verma

, Hassanat

. Stability and reduction of statistical features for image classification and retrieval: Preliminary results. In: 9th International Conference on Information and Communication Systems, ICIS, Irbid, Jordan: Institute of Electrical and Electronics Engineers Inc., 2018, pp. 117–121.

Fix

, Hodges

. Discriminatory analysis-nonparametric discrimination: consistency properties. Int Stat Rev. 1951; 57:238–247.

Cover

, Hart

. Nearest neighbor pattern classification. IEEE Trans Inform Theory. 1967; IT-13:21–27.

Maillo

, Triguero

, Herrera

. A mapreduce-based k-nearest neighbor approach for Big Data classification. In: Presented at Trustcom/BigDataSE/ISPA, Helsinki, Finland: IEEE, 2015, pp. 167–172.

10.

Maillo

, Ramírez

, Triguero

, Herrera

. kNN-IS: An iterative spark-based design of the k-nearest neighbors classifier for Big Data. Knowl Based Syst. 2017; 117:3–15.

11.

Deng

, Zhu

, Cheng

, et al. Efficient kNN classification algorithm for Big Data. Neurocomputing. 2016; 195:143–148.

12.

Gallego

, Calvo-Zaragoza

, Valero-Mas

, Rico-Juan

. Clustering-based k-nearest neighbor classification for large-scale data with neural codes representation. Pattern Recognit. 2018; 74:531–543.

13.

Wang

, Wang

, Nie

, et al. Efficient tree classifiers for large scale datasets. Neurocomputing. 2018; 284:70–79.

14.

Liu

, Zheng

, Ji

, Zhao

. Sparse self-represented network map: A fast representative-based clustering method for large dataset and data stream. Eng Appl Artif Intell. 2018; 68:121–130.

15.

Zoumpatianos

, Idreos

, Palpanas

. Indexing for interactive exploration of Big Data series. In: ACM SIGMOD International Conference on Management of Data, Snowbird, Utah: ACM, 2014, pp. 1555–1566.

16.

Palpanas

. The parallel and distributed future of data series mining. In: International Conference on High Performance Computing and Simulation, Genoa, Italy: IEEE, 2017, pp. 916–920.

17.

Zhang

, Li

, Zong

, et al. Efficient knn classification with different numbers of nearest neighbors. IEEE Trans Neural Netw Learn Syst. 2018; 29:1774–1785.

18.

Hassanat

, Abbadi

, Altarawneh

, Alhasanat

. Solving the problem of the K parameter in the KNN classifier using an ensemble learning approach. Int J Comput Sci Inform Security. 2014; 12:33–39.

19.

Bentley

. Multidimensional binary search trees used for associative searching. Commun ACM. 1975; 18:509–517.

20.

Uhlmann

. Satisfying general proximity/similarity queries with metric trees. Inform Process Lett. 1991; 40:175–179.

21.

Beygelzimer

, Kakade

, Langford

. Cover trees for nearest neighbor. In: 23rd International Conference on Machine Learning, Pittsburgh, PA: ACM, 2006, pp. 97–104.

22.

Kibriya

, Frank

. An empirical comparison of exact nearest neighbour algorithms. In: European Conference on Principles of Data Mining and Knowledge Discovery, Skopje, Macedonia: Springer, Berlin, Heidelberg, 2007, pp. 140–151.

23.

Cislak

, Grabowski

. Experimental evaluation of selected tree structures for exact and approximate k-nearest neighbor classification. In: Federated Conference on Computer Science and Information Systems, Warsaw, Poland: IEEE, 2014, pp. 93–100.

24.

Agarwal

, Matoušek

, Suri

. Farthest neighbors, maximum spanning trees and related problems in higher dimensions. Comput Geom. 1992; 1:189–201.

25.

Williams

. On the difference between closest, furthest, and orthogonal pairs: Nearly-linear vs barely-subquadratic complexity. In: Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, New Orleans, LA: SIAM, 2018, pp. 1207–1215.

26.

Hassanat

. Greedy algorithms for approximating the diameter of machine learning datasets in multidimensional euclidean space. arxiv. 1808.03566.

27.

Goodman

. Handbook of discrete and computational geometry, 2nd ed. CRC Press, 2004.

28.

Fan

R-E

. 2011. LIBSVM data: Classification, regression, and multi-label. Available online at www.csie.ntu.edu.tw/∼cjlin/libsvmtools/datasets/.

29.

Lichman

. 2013. UCI machine learning repository. Available online at http://archive.ics.uci.edu/ml (last accessed March 20, 2018).

30.

Hassanat

ABA

. Dimensionality invariant similarity measure. J Am Sci. 2014; 10:221–226.

31.

Alkasassbeh

, Altarawneh

, Hassanat

. On enhancing the performance of nearest neighbour classifiers using hassanat distance metric. Can J Pure Appl Sci. 2015; 9:3291–3298.

Furthest-Pair-Based Binary Search Tree for Speeding Big Data Classification Using K-Nearest Neighbors

Abstract

Abstract

Get full access to this article

References