Abstract
Auto-PLD is the automatic design algorithm framework of machine learning pipeline for the whole-process data analysis scenario. First, define a machine learning pipeline with five stages and can support the processing of continuous and discrete type features separately. Then, the automated pipeline design problem is decomposed into two sub problems: structure search and hyper parameter optimization, and an algorithm combining reinforcement learning and Bayesian optimization is proposed to alternately optimize the two subproblems. Finally, in order to improve the efficiency of automatic pipeline design, two parallelized pipeline construction methods are further proposed. Experimental results show that Auto-PLD outperforms auto-sklearn with most datasets. Moreover, with the increase of computing nodes, the parallelized Auto-PLD can further improve the pipeline building performance. The Auto-PLD-random method performed at 1, 4, and 8 hours respectively: 19, 22, 23, 27, 26, and 25. Automated Pipeline Design with Q-learning (Auto-PLD-Q) methods performed at 1, 4, and 8 hours: 18, 20, 26; 23, 27, and 27. The Auto-PLD-DeepC method performed at 1, 4, and 8 hours, respectively: 21,23,27; 22, 26, and 26. Auto-PLD-PG methods performed at 1, 4, and 8 hours, respectively: 19, 21, 27; 26, 26, and 25. Auto-LLE is an automated machine learning algorithm framework for lifelong learning scenarios. For the classification task based on concept drift and data imbalance, an algorithm based on weighted ensemble learning of adaptive model was proposed. Concept types are divided into “long-term concepts” and “short-term concepts,” and different types of concepts are handled separately using incremental learners and adaptive weight update methods, and automatically capture the concept drift and improve the model prediction performance. Based on Auto-PLD and Auto-LLE, we design and implement a system that supports both automated pipeline design and automated lifelong learning. In system design, high system accessibility and scalability are obtained by designing easy high-level programming interface and pluggable module integration. In task type, the common data analysis tasks such as classification, regression, and clustering are supported.
Get full access to this article
View all access options for this article.
