Predicting wine types with different classification techniques

Abstract

In modern world, wine has become a part and pencil of life and culture. With the improvement of production techniques, wine making has been turned into as a form of art and a branch of science. Italian wine is very popular because of its variation in taste. The taste of wine depends on different types of cultivars. This paper attempts to classify the cultivars on the basis of different chemical constituents recorded as wine data. To accomplish this task, we used linear discriminant analysis (LDA), multinomial logistic regression (MLR), random forest (RF) and support vector machine (SVM) classification techniques. We have analyzed these in the absence of outliers and in the presence of different rate of outliers. In both of the cases, bootstrapping is used due to small data. We have used the accuracy, sensitivity and specificity as the measuring criteria of classification techniques. In absence of the outlier, LDA gives maximum classification accuracy, sensitivity and specificity. When the percentage of outlier is increases, the performance of RF tends to get better than LDA. Generally, we can suggest LDA when such type of data is obtained in the absence of outliers and RF in the presence of outliers.

Keywords

Bootstrapping classification techniques outlier wine

Get full access to this article

View all access options for this article.

References

Ali

Khan

Ahmad

, & Maqsood

(2012). Random forests and decision trees. International Journal of Computer Science, 9(5), 272-278.

Allwein

E. L.

Schapire

R. E.

, & Singer

(2000). Reducing multiclass to binary: A unifying approach for margin classifiers. Journal of Machine Learning Research, 1, 113-141.

Appalasamy

Mustapha

Rizal

, Johari

, Mansor

A. F.

(2012). Classification-based data mining approach for quality control in wine production. Journal of Applied Sciences, 12(6), 598-601.

Barber

, Williams

C. K.

(1997). Gaussian processes for Bayesian classification via hybrid Monte Carlo. Advances in Neural Information Processing Systems, MIT Press, 9, 340-346.

Boser

B. E.

Guyon

I. M.

, Vapnik

V. N.

(1992). A training algorithm for optimal margin classifiers. Proceedings of the 5th Annual ACM Workshop on Computational Learning Theory, ACM Press, New York, 144-152.

Breiman

(2001). Random forests. Machine Learning, 45(1), 5-32.

Chang

C. C.

, Lin

C. J.

(2011). LBSVM: A library of support vector machines. ACM Transactions on Intelligent Systems and Technology, 2(3), 27:1-27:27.

Conforti

, Guido

(2010). Kernel based support vector machine via semi definite programming: Application to medical diagnosis. Comput Oper Res, 37(8), 1389-1394.

Cortez

Cerdeira

Almeida

Matos

, Reis

(2009). Modeling wine preferences by data mining from physicochemical properties. Decis Support Syst, 47(4), 547-553.

10.

Ebeler

S. E.

(1999). Linking flavour chemistry to sensory analysis of wine. Flavor Chemistry: Thirty Years of Progress, 409-421.

11.

Efron

, Tibshirani

(1991). Statistical data analysis in the computer age. American Association for the Advancement of Science, 253(5018), 390-395.

12.

Elsalamony

H. A.

(2014). Bank direct marketing analysis of data mining techniques. International Journal of Computer Applications, 85(7), 12-22.

13.

Ewing-Mulligan

, McCarthy

(2005). Wine Style: Using Your Senses to Explore and Enjoy Wine. Wiley & Sons, New York.

14.

Fisher

R. A.

(1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7(2), 179-188.

15.

Forina

. et al. (1991). PARVUS-An Extendible Package for Data Exploration, Classification and Correlation. Institute of Pharmaceutical and Food Analysis and Technologies, Via Brigata Salerno, 16147 Genoa, Italy.

16.

Fukunaga

(1990). Introduction to Statistical Pattern Recognition. Academic Press, USA.

17.

Gibbs

M. N.

, MacKay

D. G. C.

(2000). Variational Gaussian process classifiers. IEEE Transactions on Neural Networks, 11(6), 1458.

18.

Hawkins

(1980). Identification of Outliers. Chapman and Hall, London.

19.

Istat database, (2016). website: http://italianwinecentral.com/wine-production-in-italy-by-region.

20.

James

Witten

Hastie

, Tibsirani

(2013). An Introduction to Statistical Learning with Applications is R. Springer, New York.

21.

Johnson

R. A.

, Wichern

D. W

, (1988). Applied Multivariate Statistical Analysis. Prentice-Hall, Inc. Upper Saddle River, NJ, USA.

22.

Kleinbaum

D. G.

, Klein

(2010). Logistic regression: A self-learning text (3rd ed). NY: Springer, New York.

23.

Kumar

Hoque

M. A.

Shahjaman

Islam

S. M. S.

, & Mollah

M. N. H.

, (2017). Metabolomic biomarker identification in presence of outliers and missing values. Bio Med Research International, 2017, 1-11.

24.

Legin

Rudnitskaya

Luvova

Vlasov

Natale

, D’Amico

(2003). Evaluation of Italian wine by the electronic tongue: Recognition, quantitative analysis and correlation with human sensory perception. Analytica Chimica Acta, 484(1), 33-34.

25.

Powers

D. M. W.

(2011). Evaluation: From precision, recall and F-measure to ROC, informedness, markedness & correlation. Journal of Machine Learning Technologies, 2(1), 37-63.

26.

R Development Core Team, (2008). R: A language and environment for statistical computing. R Foundation for Statistical Computing. Available on: http://www.R-project.org.

27.

Smith

, Margolskee

(2006). Making sense of taste. Scientific American, 16(3), 84-92.

28.

Stehman

S. V.

(1997). Selecting and interpreting measures of thematic classification accuracy. Remote Sensing of Environment, 62(1), 77-89.

29.

Tabatabai

M. A.

Eby

W. M.

Kengwoung-Keumo

J. J.

Manne

Bae

Fouad

, & Singh

K. P.

(2014). Robust logistic and probit methods for binary and multinomial regression. J Biomet Biosta, 5(4), 1-8.