Minimum information for training a classifier

Abstract

Classifier accuracy is extremely important and can be improved by increasing the size of the training data set. However, in experimental studies it might be very costly to survey cases; therefore, limiting sample size to a minimum is essential. Sometimes very large data sets might not contain enough information, and additional computer resources do not improve accuracy. Stopping at the optimal iteration results in the minimum amount of observations being used, possibly saving computational time and sampling costs.

For this reason, a sequential method of training classifiers can be of use. This paper proposes a sequential method that seeks to sample the minimum number of observations necessary to train a classifier to estimate the feasible minimum rate of misclassification, the Bayes error.

Using SAS/IML ${}^{\@setsize{\scriptsize}{8pt}{\viipt}{\@viipt}\textregistered}$ Studio, this method of classifier training proves ideal as it gives the researcher more control over the process by specifying when the sequential procedure should be stopped. It is not restricted to any single method of classification, and it never seeks to obtain an unfeasibly low misclassification rate.

Keywords

Bayes error fixed-width confidence interval classifier training

Get full access to this article

View all access options for this article.

References

Frey,

(2010). Fixed-width sequential confidence intervals for a proportion. The American Statistician, 64(3), 242-249.

Fu,

Dougherty,

Mallick,

, & Carroll,

(2005). How many samples are needed to build a classifier: a general sequential approach. Bioinformatics, 21(12005), 63-70.

Girshick,

Mosteller,

, & Savage,

. (1946). Unbiased estimates for certain binomial sampling problems with applications. Annals of Mathematical Statistics, 17, 13-23.

Hastie,

Tibshirani,

, & Friedman,

(2009). The Elements of Statistical Learning. Springer Series in Statistics.

Potgieter,

(2013). Minimum sample size for estimating the Bayes error at a predetermined level. Master’s Thesis. University of Pretoria.

Schultz,

Nichol,

Elfring,

, & Weed,

(1973). Multiple-stage procedures for drug screening. Biometrics, 293-300.