Abstract
With the information technology development, data sets often contain a very large number of observations. Symbolic data analysis treats new units that are underlying concepts on the given data base or found by clustering. In this way, it is possible to reduce the size of the data set to be processed by transforming the initial classical variables into variables called symbolic variables. In symbolic data analysis, the values of the variables can be, among others, intervals. The algebraic structure of these variables leads us to adapt criteria to be able to study them. In this paper, we propose the extension of the Kolmogorov-Smirnov's binary splitting criterion to interval data. This criterion is used as a test selection metric for decision tree induction. For this criterion, the values taken by the explanatory variables have to be ordered. We have been interested in different possible orders of these interval values. We present some results using the pure assignment in order to examine the quality and the precision of this criterion. We compare this criterion to some classical criteria (Gini and entropy) in the case of pure assignment. An application in the case where the variable to be explained is a correlation is presented. We end this paper with a probabilistic method of assignment using the criterion of Komogorov-Smirnov.
Get full access to this article
View all access options for this article.
