Sage Journals: Discover world-class research

Abstract

Variable selection is crucial for improving interpretation quality and forecasting accuracy. To this end, it is very interesting to choose an effective dimension reduction technique suitable for processing data according to their specificity and characteristics. In this paper, the problem of variable selection for linear and nonlinear regression is deeply investigated. The curse of dimensionality issue is also addressed. An intensive comparative study is performed between Support Vector Regression(SVR) and Random Forests (RF) for the purpose of variable importance assessment then for variable selection. The main contribution of this work is twofold: to expose some experimental insights about the efficiency of variable ranking and selection based on SVR and on RF, and to provide a benchmark study that helps researchers to choose the appropriate method for their data. Experiments on simulated and real-world datasets have been carried out. Results show that the SVR score ∂ G_α is recommended for variable ranking in linear situations whereas the RF score is preferable in nonlinear cases. Moreover, we found that RF models are more efficient for selecting variables especially when used with an external score of importance.

Keywords

Variable importance score variable selection support vector regression random forests nonlinearity stepwise algorithm curse of dimensionality selection bias

Get full access to this article

View all access options for this article.

References

Amaldi

and Kann

, On the approximability of minimizing non zero variables or unsatisfied relations in linear systems, Theoretical Computer Science 209(1-2) (1998), 237-260.

Ambroise

and McLachlan

G.J.

, Selection bias in gene extraction on the basis of microarray gene-expression data, National Academy of Sciences 99(10) (2002), 6562-6566.

Andersen

C.M.

and Bro

, Variable selection in regression - a tutorial, Journal of Chemometrics 24 (2010), 728-737.

Blake

and Merz

,UCI Repository of Machine Learning Databases, http://www.ics.uci.edu/\symbol{126}mlearn/MLRepository. html, University of California, Irvine, Department of Information and Computer Sciences, 1998.

Blum

and Langley

, Selection of relevant features and examples in machine learning, Artificial Intelligence 97(1-2) (1997), 245-271.

Boser

B.E.

, Guyon

and Vapnik

V.N.

, A training algorithm for optimal margin classifiers, in: Proceedings of the Fifth Annual Workshop on Computational Learning Theory COLT '92, New York, NY, USA: ACM Press, Pittsburgh, (1992), 144-152.

Breiman

, Friedman

J.H.

, Olshen

R.A.

and Stone

C.J.

, Classification and Regression Trees, Wadsworth and Brooks, 1984.

Breiman

, Bagging predictors, Machine Learning 24 (1996), 123-140.

Breiman

, Random forests, Machine Learning 45 (2001), 5-32.

10.

Bühlmann

, Boosting for high-dimensional linear models, Ann Statist 34 (2006), 559-583.

11.

Burges

, A tutorial on support vector machines for pattern recognition, in: Data Mining and Knowledge Discovery, Kluwer Academic Publishers, Boston, Vol. 2, 1998.

12.

Chang

M.W.

and Lin

C.J.

, Leave-one-out bounds for support vector regression model selection, Neural Computation 17(4) (2005), 1-26.

13.

Chapelle

, Vapnik

V.N.

, Bousquet

and Mukherjee

, Choosing multiple parameters for support vector machines, Machine Learning 46(1-3) (2002), 131-159.

14.

Dataset provided by Prof. Marc Meurens, Université catholique de Louvain, BNUT, meurens@bnut.ucl.ac.be. Dataset available from www.ucl.ac.be/mlg/.

15.

Díaz-Uriarte

and Alvarez de Andrés

, Gene selection and classification of microarray data using random forest, BMC Bioinformatics 7(3) (2006).

16.

Donoho

D.L.

, High-dimensional data analysis: The curses and blessings of dimensionality, in: Aide-Memoire of a Lecture at AMS Conference on Math Challenges of the 21st Century, (2000).

17.

Douha

, Benoudjit

and Melgani

, A robust regression approach for spectrophotometric signal analysis, Journal of Chemometrics 26 (2012), 400-405.

18.

Efron

, Hastie

, Johnstone

and Tibshirani

, Least angle regression (with discussion), Ann Statist 32 (2004), 407-499.

19.

Fan

and Li

, Statistical challenges with high dimensionality: Feature selection in knowledge discovery, in: Proceedings of the International Congress of Mathematicians, Sanz-Sole

, Soria

, Varona

J.L.

and Verdera

, eds, Vol. III, 2006, pp. 595-622.

20.

Feki

, Ben Ishak

and Feki

, Feature selection using bayesian and multiclass support vector machines approaches: Application to bank risk prediction, Expert Systems with Applications 39(3) (2012), 3087-3099.

21.

Forman

, An extensive empirical study of feature selection metrics for text classification, JMLR 3 (2003), 1289-1306.

22.

Genuer

, Poggi

J.M.

and Tuleau

, Variable selection using random forests, Pattern Recognit Lett 31(14) (2010), 2225-2236.

23.

Ghattas

and Ben Ishak

, Sélection de variables pour la classification binaire en grande dimension: Comparaisons et application aux données de biopuces, Journal de la Société Française de Statistiques 149(3) (2008), 43-66.

24.

Grömping

, Variable importance assessment in regression: Linear regression versus random forest, The American Statistical Association 63(4) (2009), 308-319.

25.

Gunn

S.R.

, Brown

and Bossley

K.M.

, Network performance assessment for neuro-fuzzy data modelling, in: Intelligent Data Analysis, Liu

, Cohen

and Berthold

, eds, volume 1208 of Lecture Notes in Computer Science, 1997, pp. 313-323.

26.

Guyon

, Weston

, Barnhill

and Vapnik

V.N.

, Gene selection for cancer classification using support vector machines, Machine Learning 46(1-3) (2002), 389-422.

27.

Guyon

and Elisseeff

, An introduction to variable and feature selection, JMLR 3 (2003), 1157-1182.

28.

Guyon

, Li

, Mader

, Pletscher

P.A.

, Schneider

and Uhr

, Competitive baseline methods set new standards for the NIPS 2003 feature selection benchmark, Pattern Recognit Lett 28(12) (2007), 1438-1444.

29.

Hastie

, Tibshirani

and Friedman

, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer-Verlag (2nd edition), New York, 2009.

30.

Hill

S.M.

, Neve

R.M.

, Bayani

, Kuo

W.L.

, Ziyadi

, Spellman

P.T.

, Gray

J.W.

and Mukherjee

, Integrating biological knowledge into variable selection: An empirical Bayes approach with an application in cancer biology, Bioinformatics 28(18) (2012), 2342-2348.

31.

Hirst

J.D.

, King

R.D.

and Sternberg

M.J.E.

, Quantitative structure-activity relationships by neural networks and inductive logic programming: I. The inhibition of dihydrofolate reductase by pyrimidines, Journal of Computer-Aided Molecular Design 8 (1994), 405-420.

32.

Hirst

J.D.

, King

R.D.

and Sternberg

M.J.E.

, Quantitative structure-activity relationships by neural networks and inductive logic programming: II. The inhibition of dihydrofolate reductase by triazines, Journal of Computer-Aided Molecular Design 8 (1994), 421-432.

33.

Ishwaran

, Variable importance in binary regression trees and forests, Electronic Journal of Statistics 1 (2007), 519-537.

34.

King

R.D.

, Hirst

J.D.

and Sternberg

M.J.E.

, A comparison of artificial intelligence methods for modelling QSARs, Applied Artificial Intelligence 9(2) (1995), 213-233.

35.

Kohavi

and John

, Wrappers for feature selection, Artificial Intelligence 97(1-2) (1997), 273-324.

36.

Rakotomamonjy

, Variable selection using SVM-based criteria, JMLR 3 (2003), 1357-1370.

37.

Rakotomamonjy

, Analysis of SVM regression bounds for variable ranking, Neurocomputing 70(7-9) (2007), 1489-1501.

38.

Shawe-Taylor

and Cristianini

, Kernel Methods for Pattern Analysis, Cambridge University Press, Cambridge, UK, 2004.

39.

Smola

A.J.

and Scholkopf

, A tutorial on support vector regression. NeuroCOLT Technical Report NC-TR-98-030, Royal Holloway College, University of London, UK, 1998.

40.

Strobl

, Boulesteix

A.-L.

, Kneib

, Augustin

and Zeileis

, Conditional variable importance for random forests, Bioinformatics 9(307) (2008).

41.

Tibshirani

, Regression shrinkage and selection via the LASSO, J Roy Statist Soc Ser B 58 (2004), 267-288.

42.

Vapnik

V.N.

, The Nature of Statistical Learning Theory, Springer Verlag, New York, 1995.

43.

Vapnik

V.N.

, Golowich

and Smola

, Support vector method for function approximation regression estimation and signal processing, in: Advances in Neural Information Processing Systems Pages, Mozer

, Jordan

and Petsche

, eds, Cambridge, MA, MIT Press, 1997.

44.

Vapnik

V.N.

, Statistical Learning Theory, Wiley, New York, 1998.

45.

Vapnik

V.N.

and Chapelle

, Bounds on error expectation for support vector machines, Neural Computation 12(9) (2000), 2013-2036.

46.

Zeng

X.-Q.

and Li

G.-Z.

, Incremental partial least squares analysis of big streaming data, Pattern Recognit 47(11) (2014), 3726-3735.

47.

Yang

C.C.

and Shieh

M.D.

, A support vector regression based prediction model of affective responses for product form design, Computers & Industrial Engineering 59(4) (2010), 682-689.

Variable selection using support vector regression and random forests: A comparative study

Abstract

Keywords

Get full access to this article

References