Parallel and fault-tolerant k-means clustering based on the actor model

Abstract

K-means algorithm is a well-known unsupervised machine learning tool that aims at splitting a given dataset into a fixed number of clusters via iterative refinement approach. Running such an algorithm on today’s datasets that are characterized by its high multidimensionality and huge size requires using fault-tolerance mechanisms to mitigate the impact of possible failures. In this paper, we propose an actor-based implementation of k-means algorithm. The algorithm was made fault-tolerant by periodically saving the centroids into a stable storage during the failure-free execution, and restarting from the last saved centroids upon a failure. This was implemented in two different ways: optimistic checkpointing (blocking) and pessimistic checkpointing (non-blocking). The actor-based k-means algorithm was evaluated on a machine with eight cores. The experiments showed that the proposed algorithm scales very well as the number of workers increases, and can be up to $\sim$ 2x faster than a Java-thread-based implementation of k-means algorithm. The results also showed that the optimistic algorithm outperformed the pessimistic one, specifically, in the presence of competing I/O operations. Several failures were forced to occur during the execution to evaluate the performance of the fault-tolerant implementations. The experiments showed that the average amount of lost work ranged from 3–6%.

Keywords

Parallel k-means actor-model checkpointing

Get full access to this article

View all access options for this article.

References

Zhao

and He

, Parallel k-means clustering based on mapreduce, in: IEEE International Conference on Cloud Computing, Springer, Berlin, Heidelberg, 2009, pp. 674–679.

Stoffel

and Belkoniene

, Parallel k/h-means clustering for large data sets, in: European Conference on Parallel Processing, Springer, Berlin, Heidelberg, 1999, pp. 1451–1454.

Kantabutra

and Couch

, Parallel k-means clustering algorithm on NOWs, NECTEC Technical Journal 1 (2000), 243–247.

Zhong

and Zhao

, Parallel k-means clustering of remote sensing images based on mapreduce, in: International Conference on Web Information Systems and Mining, Springer, Berlin, Heidelberg, 2010, pp. 162–170.

Zhang

Xiong

Mao

and Ou

, The study of parallel k-means algorithm, 2006 6th World Congress on Intelligent Control and Automation 2 (2006), 5868–5871.

Farivar

Rebolledo

Chan

and Campbell

R.H.

, A parallel implementation of k-means clustering on GPUs, Pdpta 13 (2008), 212–312.

Kwok

Smith

Lozano

and Taniar

, Parallel fuzzy c-means clustering for large data sets, in: European Conference on Parallel Processing, Springer, Berlin, Heidelberg, 2002, pp. 365–374.

Zhang

and Hao

, A parallel k-means clustering algorithm with mpi, in: Fourth International Symposium on Parallel Architectures, Algorithms and Programming, IEEE, Tianjin, China, 2011, pp. 60–64.

Bhimani

Leeser

and Mi

, Accelerating k-means clustering with parallel implementations and GPU computing, in: 2015 IEEE High Performance Extreme Computing Conference (HPEC), Waltham, MA USA, 2015, pp. 1–6.

10.

Chu

C.T.

Kim

S.K.

Lin

Y.A.

Bradski

Olukotun

and Ng

A.Y.

, Map-reduce for machine learning on multicore in: Advances in Neural Information Processing Systems, Vancouver, Canada, 2007, pp. 281–288.

11.

Anchalia

P.P.

Koundinya

A.K.

and Srinath

N.K.

, Mapreduce design of k-means clustering algorithm in: 2013 International Conference on Information Science and Applications (ICISA), IEEE, Pattaya, Thailand, 2013, pp. 1–5.

12.

Gopalani

and Rohan

, Comparing apache spark and map reduce with performance analysis using k-means, International Journal of Computer Applications 113 (2015), 8–11.

13.

Wang

Yin

Hua

and Cao

, Parallelizing k-means-based clustering on spark in: 2016 International Conference on Advanced Cloud and Big Data (CBD), IEEE, Chengdu, China, 2016, pp. 31–36.

14.

Wang

and Khan

M.M.H.

, Performance prediction for apache spark platform, in: 2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conference on Embedded Software and Systems, IEEE, New York, NY, USA, 2015, pp. 166–173.

15.

Shi

Qiu

Minhas

U.F.

Jiao

Wang

Reinwald

and Özcan

, Clash of the titans: mapreduce vs. spark for large scale data analytics, Proceedings of the VLDB Endowment 8 (2015), 2110–2121.

16.

Savvas

I.K.

and Sofianidou

G.N.

, A novel near-parallel version of k-means algorithm for n-dimensional data objects using mpi, International Journal of Grid and Utility Computing 7 (2016), 80–91.

17.

Savvas

I.K.

and Sofianidou

G.N.

Parallelizing k-means algorithm for 1-d data using mpi., in: IEEE 23rd International WETICE Conference, IEEE, Parma, Italy, 2014, pp. 179–184.

18.

Mohanavalli

Jaisakthi

S.M.

and Aravindan

, Strategies for parallelizing kmeans data clustering algorithm in: International Conference on Advances in Information Technology and Mobile Communication, Springer, Berlin, Heidelberg, 2011, pp. 427–430.

19.

Shen

Fang

Sips

and Varbanescu

A.L.

, Performance gaps between OpenMP and OpenCL for multi-core CPUs, in: 41st International Conference on Parallel Processing Workshops, IEEE, Pittsburgh, PA, USA, 2012, pp. 116–125.

20.

Dhanasekaran

and Rubin

, A new method for GPU based irregular reductions and its application to k-means clustering, in: Proceedings of the Fourth Workshop on General Purpose Processing on Graphics Processing Units, ACM, New York, NY, USA, 2011, pp. 1–8.

21.

Hong-Tao

Li-li

Dan-tong

Zhan-shan

and He

, K-means on commodity GPUs with CUDA, in: 2009 WRI World Congress on Computer Science and Information Engineering, 3, 2009, pp. 651–655.

22.

Zhao

Chu

and Liu

, Speeding up k-means algorithm by gpus, Journal of Computer and System Sciences 79 (2013), 216–229.

23.

Shalom

S.A.

Dash

and Tue

, Efficient k-means clustering using accelerated graphics processors, in: International Conference on Data Warehousing and Knowledge Discovery, Springer, Berlin, Heidelberg, 2008, pp. 166–175.

24.

Zaharia

Chowdhury

Das

Dave

McCauly

Franklin

M.J.

Shenker

and Stoica

, Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing, in: Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation, USENIX Association, Berkeley, CA, USA, 2012, pp. 15–28.

25.

Bytschkow, Denis, Martin Zellner and Markus Duchon, Combining SCADA, CIM, GridLab-D and AKKA for smart grid co-simulation, in: 2015 IEEE Power & Energy Society Innovative Smart Grid Technologies Conference (ISGT), IEEE, 2015, pp. 1–5.

26.

Lee

Niddodi

Srivastava

and Bakken

, Decentralized voltage stability monitoring and control in the smart grid using distributed computing architecture, in: 2016 IEEE Industry Applications Society Annual Meeting, Portland, OR, USA, 2016, pp. 1–9.

27.

Mohindra

Hook

Prout

Sanh

A.H.

Tran

and Yee

, Big data analysis using distributed actors framework, in: Proc. of the 2013 IEEE High Performance Extreme Computing Conference (HPEC), Waltham, MA, USA, 2013, pp. 1–5.

28.

Sanchez

D.D.

Sherratt

R.S.

Arias

Almenarez

and Marin

, Enabling actor model for crowd sensing and IoT, in: 2015 International Symposium on Consumer Electronics (ISCE), IEEE, Madrid, Spain, 2015, pp. 1–2.

29.

Chelcioiu

I.D.

Corlatescu

Paraschiv

I.C.

Dascalu

and Trausan-Matu

, Semantic Meta-search Using Cohesion Network Analysis, in: International Conference on Artificial Intelligence: Methodology, Systems, and Applications, Varna, Bulgaria, 2018, pp. 207–217.

30.

Rycerz

and Bubak

, Using Akka actors for managing iterations in multiscale applications, in: International Conference on Parallel Processing and Applied Mathematics, Bialystock, Poland, 2015, pp. 332–341.

31.

Yahyapour

Wieder

Yaqub

Abdullah

Schloer

and Kotsokalis

, Fault-tolerant service level agreement lifecycle management in clouds using actor system, Future Generation Computer Systems 54 (2016), 247–259.