A case based method to predict optimal k value for k-NN algorithm

Abstract

The k-nearest-neighbor classifier is a vital algorithm. In practice, the choice of k is decided by the cross-validation method. We propose a new method for neighborhood size selection based on the data set profile. The distribution of a data set and its intrinsic characteristics are the fundamental factors to the choice of k. A local complexity was computed for each example and a complexity profile was constructed by sorting these local complexity values which try to capture inner structure of a data set. After this, a feature vector was built by combing the local complexity profile and some statistic features of a data set. In addition, a history meta-data set was constructed by using the feature vector as attributes and the optimum k value of data set as the label, which was calculated by using ten cross-validation methods. A predict model was trained based on the historic meta-data set and used to predict optimum k value for a new data set. Some exclusive experiments are conducted to verify the proposed method. The results shows that the local complexity features could reflect the inner structure of a data set which could help find the optimum k for k-NN for different domains.

Keywords

k-NN classifier data sets local complexity profile optimum k

Get full access to this article

View all access options for this article.

References

Cover

and Hart

, Nearest neighbor pattern classification, IEEE Transactions on Information Theory13(1) (1967), 21–27.

, et al., Top 10 algorithms in data mining, Knowledge and Information Systems14(1) (2008), 1–37.

Latourrette

, Toward an explanatory similarity measure for nearest-neighbor classification, European Conference on Machine Learning. SpringerBerlin Heidelberg, 2000.

Loftsgaarden

D.O.

and Quesenberry

C.P.

, A nonparametric estimate of a multivariate density function, The Annals of Mathematical Statistics36(3) (1965), 1049–1051.

Lachenbruch

P.A.

and Ray Mickey

, Estimation of error rates in discriminant analysis, Technometrics10(1) (1968), 1–11.

Stone

, Cross-validation: A review 2, Statistics: A Journal of Theoretical and Applied Statistics9(1) (1978), 127–139.

Kang

and Cho

, Locally linear reconstruction for instance-based learning, Pattern Recognition41(11) (2008), 3507–3518.

Meesad

and Hengpraprohm

, Combination of knn-based feature selection and knn based missing-value imputation of microarray data, In 3rd International Conference on Innovative Computing Information and Control, IEEE ICICIC’08, Dalian, China, 2008, 341–344.

Lall

and Sharma

, A nearest neighbor bootstrap for resampling hydrologic time series, Water Resources Research32(3) (1996), 679–693.

10.

Liu

, et al., A new classification algorithm using mutual nearest neighbors, In Ninth International Conference on Grid and Cloud Computing, IEEE, Nanjing, China, 2010, 52–57.

11.

Ghosh

A.K.

, On optimum choice of k in nearest neighbor classification, Computational Statistics & Data Analysis50(11) (2006), 3113–3123.

12.

Hall

, Park

B.U.

and Samworth

R.J.

, Choice of neighbor order in nearest-neighbor classification, The Annals of Statistics36(5) (2008), 2135–2152.

13.

Hand

D.J.

and Vinciotti

, Choosing k for two-class nearest neighbour classifiers with unbalanced classes, Pattern Recognition Letters24(9) (2003), 1555–1562.

14.

Wang

, Neskovic

and Cooper

L.N.

, Neighborhood size selection in the k-nearest-neighbor rule using statistical confidence, Pattern Recognition39(3) (2006), 417–423.

15.

Quinlan

J.R.

, Comparing connectionist and symbolic learning methods, Computational Learning Theory and Natural Learning Systems: Constraints and Prospects, 1994, 445–456.

16.

Ali

and Smith

K.A.

, On learning algorithm selection for classification, Applied Soft Computing6(2) (2006), 119–138.

17.

Holmes

C.C.

and Adams

N.M.

, A probabilistic nearest neighbour method for statistical pattern recognition, Journal of the Royal Statistical Society: Series B (Statistical Methodology)64(2) (2002), 295–306.

18.

Wettschereck

and Dietterich

T.G.

, Locally adaptive nearest neighbor algorithms, Advances in Neural Information Processing Systems (1994), 184–184.

19.

Song

, et al., Iknn: Informative k-nearest neighbor pattern classification, European Conference on Principles of Data Mining and Knowledge Discovery, SpringerBerlin Heidelberg, 2007.

20.

Ghosh

A.K.

, On nearest neighbor classification using adaptive choice of k, Journal of Computational and Graphical Statistics16(2) (2007), 482–502.

21.

Reif

, et al., Automatic classifier selection for non-experts, Pattern Analysis and Applications17(1) (2014), 83–96.

22.

Ozger

Z.B.

and Amasyali

M.F.

, KNN parameter selection via meta learning, In 21st Signal Processing and Communications Applications Conference (SIU)North Cyprus, Turkey (2013), 1–4.

23.

Bulut

and Amasyali

M.F.

, Locally adaptive k parameter selection for nearest neighbor classifier: One nearest cluster, Pattern Analysis and Applications (2015), 1–11.

24.

Zhu

, Feng

and Huang

, Natural neighbor: A self-adaptive neighborhood method without parameter K, Pattern Recognition Letters80(1) (2016), 30–36.

25.

Bhattacharya

, Ghosh

and Chowdhury

A.S.

, Test Point Specific k Estimation for kNN Classifier, In 22nd International Conference on Pattern Recognition, Stockholm, Sweden, ICPR 2014, IEEE, Stockholm, Sweden, 2014, 1478–1483.

26.

Hassanat

A.B.

, et al., Solving the problem of the K parameter in the kNN classifier using an ensemble learning approach. arXiv preprint arXiv: 1409.0919, 2014.

27.

Engels

and Theusinger

, Using a data metric for preprocessing advice for data mining applications, ECAI (1998), 430–434.

28.

Sohn

S.Y.

, Meta analysis of classification algorithms for pattern recognition, IEEE Transactions on Pattern Analysis and Machine Intelligence21(11) (1999), 1137–1144.

29.

Segrera

, Pinho

and Moreno

M.N.

, Information-theoretic measures for meta-learning. International Workshop on Hybrid Artificial Intelligence Systems, SpringerBerlin Heidelberg, 2008.

30.

Peng

, et al., Improved dataset characterisation for meta-learning. International Conference on Discovery Science, SpringerBerlin Heidelberg, 2002.

31.

Bernhard

, Hilan

and Christophe

G.-C.

, Meta-Learning by Landmarking Various Learning Algorithms, In Seventeenth International Conference on Machine Learning, San Francisco, USA, 2000, 743–750.

32.

Cano

, Analysis of data complexity measures for classification, Expert Systems with Applications40(12) (2013), 4820–4831.

33.

Massie

, Craw

and Wiratunga

, Complexity profiling for informed case-base editing, European Conference on Case-Based Reasoning, Springer, Berlin Heidelberg, 2006.

34.

Breiman

, Machine Learning, Random Forests, 2001.