Using the concept of instance typicality in instance-based learning environments involving nominal attributes

Abstract

Instance-Based Learning (IBL) is a machine learning research area with focus on supervised algorithms that use the given training set as the expression of the learned concept. Usually the training instances in the set are described by vectors of attribute values and an associated class. The generalization process conducted by an instance-based algorithm happens during the classification phase, when a class should be assigned to a new instance of unknown class. Attributes that describe instances can be of different types, depending on the values they represent and, usually, can be of discrete or continuous type. A subtype of the discrete type is known as nominal. An attribute of nominal type usually represents categories and there is no order among its possible values. This paper proposes and investigates an alternative strategy for dealing with nominal attributes during the classification phase of the well-known instance-based algorithm NN (Nearest Neighbor). The proposed strategy is based on the concept of typicality of an instance, which can be taken into account as a possible tiebreaker, in situations where the new instance to be classified is equidistant from more than one nearest neighbor. Experiments using the proposed strategy and the default random strategy used by the conventional NN show that a strategy based on the concept of instance typicality can be a convenient choice to improve accuracy, when data instances have nominal attributes among the attributes that describe them.

Keywords

Instance-based learning nearest neighbor nominal attributes instance typicality.

Get full access to this article

View all access options for this article.

References

Aha

D.W.

Kibler

and Albert

M.K.

, Instance-based learning algorithms, Machine Learning 6 (1991), 37–66.

Aha

D.W.

, Tolerating noisy, irrelevant and novel attributes in instance-based learning algorithms, International Journal of Man-Machine Studies 36 (1992), 267–287.

Aha

D.W.

, Ed., Lazy Learning, Springer Science+Business Media Dordrecht, 2013.

Ahmad

and Dey

, K-mean clustering algorithm for mixed numeric and categorical data, Data & Knowledge Engineering 63 (2007), 503–527.

Alcalá-Fdez

Fernandez

Luengo

Derrac

García

Sánchez

and Herrera

, KEEL Data-mining software tool: data set repository, integration of algorithms and experimental analysis framework, Journal of Multiple-Valued Logic and Soft Computing 17 (2011), 255–287, https://sci2sugr.es/keel/datasets.php.

Barsalou

, Ideals, central tendency, and frequency of instantiation as determinants of graded structure in categories, Journal of Experimental Psychology: Learning, Memory and Cognition 11 (1985), 629–624.

Berthold

M.R.

Borgelt

Höppner

and Klawonn

, Guide to Intelligent Data Analysis, Springer-Verlag London Limited, 2010.

Bishop

C.M.

, Pattern Recognition and Machine Learning, Springer-Verlag New York Inc, 2011.

Breiman

Friedman

Stone

C.J.

and Olshen

R.A.

, Classification and Regression Trees, CRC Press, 1984.

10.

Brighton

and Mellish

, Advances in instance selection for instance-based learning algorithms, Data Mining Knowledge Discovery 6 (2002), 153–172.

11.

Cost

and Salzberg

, A weighted nearest neighbor algorithm for learning with symbolic features, Machine Learning 10 (1993), 57–78.

12.

Cover

and Hart

, Nearest neighbor pattern classification, IEEE Transactions on Information Theory 13 (1967), 21–27.

13.

Dua

and Graff

, UCI Machine Learning Repository [http://archive.ics.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science, 2019.

14.

Duda

R.O.

Hart

P.F.

and Stork

D.G.

, Pattern Classification, USA: John Wiley & Sons, Inc, 2001.

15.

Gan

and Wu

, Data Clustering Theory, Algorithms, and Applications, ASA/SIAM Publishers, 2007.

16.

Gates

G.W.

, The reduced nearest neighbor rule, IEEE Transactions on Information Theory 18(3) (1972), 431–433.

17.

Giraud-Carrier

and Martinez

, An efficient metric for heterogeneous inductive learning applications in the attribute-value language, Intelligent Systems (1995), 341–350.

18.

Gonçalves

S.V.

and Nicoletti

M.C.

, A proposal based on instance typicality for dealing with nominal attribute values in instance-based learning environments, In: Proc. of the 19th International Conference on Intelligent Systems Design and Application (ISDA 2019) (scheduled to be published in April-May 2020), 2019.

19.

Grochowski

and Jankowski

, Comparison of instance selection algorithms II, results and comments, Lecture Notes in Computer Science, Springer Verlag, v. 3070, 2004, pp. 580–585.

20.

Han

Kamber

and Pei

, Data Mining – Concepts and Techniques, Amsterdam: Morgan Kaufmann Publishers, 2012.

21.

Hart

P.E.

, The condensed nearest neighbor rule, IEEE Transactions on Information Theory 14 (1968), 515–516.

22.

L.-Y.

Huang

M.-W.

S.-W.

and Tsai

C.-F.

, The distance function effect on k-nearest neighbor classification for medical datasets, SpringerPlus 5(1304) (2016). doi: 10.1186/s40064-016-2941-7.

23.

Kadir

Nugroho

L.E.

Susanto

and Insap

P.S.

, Experiments of distance measurements in a foliage plant retrieval system, International Journal of Signal Processing, Image Processing and Pattern Recognition 5(2) (2012), 47–60.

24.

Kataria

and Singh

M.D.

, A review of data classification using K-nearest neighbour algorithm, International Journal of Emerging Technology and Advanced Engineering 3(6) (2013), 354–360.

25.

Manning

C.D.

Raghavan

and Schutze

, An Introduction to Information Retrieval, Cambridge University Press, Cambridge, 2008.

26.

Mitchell

T.M.

, Machine Learning, McGraw-Hill, New York, 1997.

27.

Moreno-Torres

J.G.

Sáez

J.A.

and Herrera

, Study on the impact of partition-induced dataset shift on k-fold cross-validation, IEEE Transactions on Neural Networks and Learning Systems 23 (2012), 1304–1312.

28.

Muggleton

S.H.

, Inductive logic programming, New Generation Computing 8(4) (1991), 295–318.

29.

Muggleton

S.-H.

, Inductive logic programming: issues, results and the challenge of learning language in logic, Artificial Intelligence 114(1–2) (1999), 283–296.

30.

Ritter

G.L.

Woodruff

H.B.

Lowry

S.R.

and Isenhour

T.L.

, An algorithm for the selective nearest neighbour decision rule, IEEE Transactions on Information Theory 21(6) (1975), 665–669.

31.

Rosch

and Mervis

C.B.

, Family resemblances: studies in the internal structure of categories, Cognitive Psychology 7(4) (1975), 573–605.

32.

Schul

and Burnstein

, Judging the typicality of an instance: should the category be accessed first? Journal of Personality and Social Psychology 58(6) (1990), 964–974.

33.

Stanfill

and Waltz

, Toward memory-based reasoning, Communications of the ACM 29(12) (1986), 1213–1228.

34.

Todeschini

Consonni

Grisoni

F.G.

and Ballabio

, A new concept of higher-order similarity and the role of distance/similarity measures in local classification methods, Chemometrics and Intelligent Laboratory Systems 157 (2016), 50–57.

35.

Wilson

D.L.

, Asymptotic properties of nearest neighbor rules using edited data, IEEE Transactions on Systems, Man, and Cybernetics SMC-2 (1972), 408–421.

36.

Willett

Barnard

J.M.

and Downs

G.M.

, Chemical similarity searching, Journal of Chemical Information and Computer Sciences 38(6) (1998), 983–996.

37.

Wilson

D.R.

and Martinez

T.R.

, Reduction techniques for instance-based learning algorithms, Machine Learning 38 (2000), 257–286.

38.

Wilson

D.R.

and Martinez

T.R.

, Improved heterogeneous distance function, Journal of Artificial Intelligence Research 6 (1997), 1–34.

39.

Kumar

Quinlan

J.R.

Ghosh

Yang

Motoda

McLachlan

G.J.

Liu

P.S.

Zhou

Z.-H.

Steinbach

Hand

D.J.

and Steinberg

, Top 10 algorithms in data mining, Knowledge and Information Systems 14(1) (2008), 1–37.

40.

Zhang

, Selecting typical instances in instance-based learning, In: Proceedings of the Ninth International Machine Learning Conference, 1972, pp. 470–479.