Facing the full model selection problem in high volume datasets employing intelligent proxy models

Abstract

Full model selection is a technique to improve the accuracy of machine learning algorithms through the search of the most appropriate combination on each dataset of feature selection, data preparation, a learning algorithm and the adjustment of its hyper-parameters. This paradigm has been widely studied in datasets of moderate size, but poorly explored in high volume datasets. One of the main reasons is the high search space and an elevated number of fitness evaluations of candidate models. In order to overcome this obstacle, the use of proxy models or surrogate functions has been proposed in the literature. In this work, we propose the use of the full model selection paradigm to construct proxy models. Such proxy models were employed to assist in the search of models in high volume datasets in order to reduce the number of fitness evaluations and to guide the search. The obtained results, show a performance without significant differences in comparison to the complete search algorithm, using just the third part of the expensive fitness evaluations.

Keywords

Proxy models model selection big datasets

Get full access to this article

View all access options for this article.

References

Alenezi

and Mohaghegh

, A data-driven smart proxy model for a comprehensive reservoir simulation, in: Information Technology (Big Data Analysis) (KACSTIT), Saudi International Conference on, IEEE, 2016, pp. 1–6.

Bansal

and Sahoo

, Full model selection using bat algorithm, in: Cognitive Computing and Information Processing (CCIP), 2015 International Conference on, IEEE, 2015, pp. 1–4.

Ceruti

Bassis

Rozza

Lombardi

Casiraghi

and Campadelli

, Danco: Dimensionality from angle and norm concentration, arXiv preprint arXiv:1206.3881, 2012.

Chatelain

Adam

Lecourtier

Heutte

and Paquet

, A multi-model selection framework for unknown and/or evolutive misclassification cost problems, Pattern Recognition 43(3) (2010), 815–823.

Couckuyt

De Turck

Dhaene

and Gorissen

, Automatic surrogate model type selection during the optimization of expensive black-box problems, in: Simulation Conference (WSC), Proceedings of the 2011 Winter, IEEE, 2011, pp. 4269–4279.

Crombecq

De Tommasi

Gorissen

and Dhaene

, A novel sequential design strategy for global surrogate modeling, in: Simulation Conference (WSC), Proceedings of the 2009 Winter, IEEE, 2009, pp. 731–742.

Cruz-Vega

Alberto Reyes García

Gómez Gil

Manuel Ramírez Cortés

and de Jesús Rangel Magdaleno

, Genetic algorithms based on a granular surrogate model and fuzzy aptitude functions, in: Evolutionary Computation (CEC), 2016 IEEE Congress on, IEEE, 2016, pp. 2122–2128.

Dean

and Ghemawat

, Mapreduce: Simplified data processing on large clusters, Communications of the ACM 51(1) (2008), 107–113.

del Río

López

Benítez

J.M.

and Herrera

, On the use of mapreduce for imbalanced big data using random forest, Information Sciences 285 (2014), 112–137.

10.

Escalante

H.J.

Montes

and Sucar

L.E.

, Particle swarm model selection, Journal of Machine Learning Research 10(Feb) (2009), 405–440.

11.

Fan

, Libsvm data: Classification, regression, and multi-label, https://www.csie.ntu.edu.tw/∼cjlin/libsvmtools/datasets/, 2018.

12.

Golzari

Sefat

M.H.

and Jamshidi

, Development of an adaptive surrogate model for production optimization, Journal of Petroleum Science and Engineering 133 (2015), 677–688.

13.

Goodrich

M.T.

Sitchinava

and Zhang

, Sorting, searching, and simulation in the mapreduce framework, in: International Symposium on Algorithms and Computation, Springer, 2011, pp. 374–383.

14.

Gorissen

Dhaene

and De Turck

, Evolutionary model type selection for global surrogate modeling, Journal of Machine Learning Research 10(Sep) (2009), 2039–2078.

15.

Guller

, Big data analytics with spark: A practitionerâ€™s guide to using spark for large scale data analysis. apress. URL: http://www.apress.com/9781484209653, 2015.

16.

Khan

M.A.

Uddin

M.F.

and Gupta

, Seven v’s of big data understanding big data to extract value, in: Proceedings of the 2014 Zone 1 Conference of the American Society for Engineering Education, April 2014, pp. 1–5.

17.

H.H.

and Gupta

M.M.

, Fuzzy Logic and Intelligent Systems, International Series in Intelligent Technologies, Springer Netherlands, 2007.

18.

Lichman

, UCI machine learning repository, 2013.

19.

Lombardi

Rozza

Ceruti

Casiraghi

and Campadelli

, Minimum neighbor distance estimators of intrinsic dimension, in: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Springer, 2011, pp. 374–389.

20.

Pavelski

L.M.

Delgado

M.R.

Almeida

C.P.

Gonçalves

R.A.

and Venske

S.M.

, Extreme learning surrogate models in multi-objective optimization based on decomposition, Neurocomputing 180 (2016), 55–67.

21.

Pilat

and Neruda

, Meta-learning and model selection in multi-objective evolutionary algorithms, in: Machine Learning and Applications (ICMLA), 2012 11th International Conference on, IEEE, Vol. 1, 2012, pp. 433–438.

22.

Rosales-Pérez

, Surrogate-Assisted Multi-Objective Model Selection for Support Vector Machines, 2015.

23.

Rosales-Pérez

Gonzalez

J.A.

Coello Coello

C.A.

Escalante

H.J.

and Reyes-Garcia

C.A.

, Multi-objective model type selection, Neurocomputing 146 (2014), 83–94.

24.

Sánchez-Monedero

Gutiérrez

P.A.

Pérez-Ortiz

and Hervás-Martínez

, An n-spheres based synthetic data generator for supervised classification, in: International Work-Conference on Artificial Neural Networks, Springer, 2013, pp. 613–621.

25.

Sundar

V.S.

and Shields

M.D.

, Surrogate-enhanced stochastic search algorithms to identify implicitly defined functions for reliability analysis, Structural Safety 62 (2016), 1–11.

26.

Szmidt

, Distances and Similarities in Intuitionistic Fuzzy Sets, Studies in Fuzziness and Soft Computing, Springer International Publishing, 2013.

27.

Thornton

Hutter

Hoos

H.H.

and Leyton-Brown

, Auto-weka: Combined selection and hyperparameter optimization of classification algorithms, in: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2013, pp. 847–855.

28.

Vincenzi

and Gambarelli

, A proper infill sampling strategy for improving the speed performance of a surrogate-assisted evolutionary algorithm, Computers & Structures 178 (2017), 58–70.

29.

Zhu

G.-Q.

and Ding

, Data mining with big data, IEEE Transactions on Knowledge and Data Engineering 26(1) (2014), 97–107.

30.

and Zhang

, Kernel nearest-neighbor algorithm, Neural Processing Letters 15(2) (2002), 147–156.

31.

and Wilkinson

, Coevolution of simulator proxies and sampling strategies for petroleum reservoir modeling, in: Evolutionary Computation, 2009. CEC’09. IEEE Congress on, IEEE, 2009, pp. 2677–2684.

32.

Zaharia

Xin

R.S.

Wendell

Das

Armbrust

Dave

Meng

Rosen

Venkataraman

Franklin

M.J.

et al., Apache spark: A unified engine for big data processing, Communications of the ACM 59(11) (2016), 56–65.