A generic approach to scheduling and checkpointing workflows

Abstract

This work deals with scheduling and checkpointing strategies to execute scientific workflows on failure-prone large-scale platforms. To the best of our knowledge, this work is the first to target fail-stop errors for arbitrary workflows. Most previous work addresses soft errors, which corrupt the task being executed by a processor but do not cause the entire memory of that processor to be lost, contrarily to fail-stop errors. We revisit classical mapping heuristics such as Heterogeneous Earliest Finish Time and MinMin and complement them with several checkpointing strategies. The objective is to derive an efficient trade-off between checkpointing every task (CkptAll), which is an overkill when failures are rare events, and checkpointing no task (CkptNone), which induces dramatic re-execution overhead even when only a few failures strike during execution. Contrarily to previous work, our approach applies to arbitrary workflows, not just special classes of dependence graphs such as minimal series-parallel graphs. Extensive experiments report significant gain over both CkptAll and CkptNone for a wide variety of workflows.

Keywords

Workflow checkpoint fail-stop error resilience

Get full access to this article

View all access options for this article.

References

Albrecht

Donnelly

Bui

, et al. (2012) Makeflow: a portable abstraction for data intensive computing on clusters, clouds, and grids. In: Proceedings of the 1st ACM SIGMOD Workshop on Scalable Workflow Execution Engines and Technologies, SWEET@SIGMOD 2012, Scottsdale, AZ, USA, 20 May 2012, pp. 1:1–1:13. New York: ACM.

Altintas

Berkley

Jaeger

, et al. (2004) Kepler: an extensible system for design and execution of scientific workflows. In: Proceedings of the 16th International Conference on Scientific and Statistical Database Management (SSDBM 2004), Santorini Island, Greece, 21–23 June 2004, pp. 423–424. IEEE.

Assayad

Girault

Kalla

(2004) A bi-criteria scheduling heuristic for distributed embedded systems under reliability and real-time constraints. In: 2004 International Conference on Dependable Systems and Networks (DSN 2004), Florence, Italy, 28 June–1 July 2004, pp. 347–356. Piscataway: IEEE.

Augonnet

Thibault

Namyst

, et al. (2011) Starpu: a unified platform for task scheduling on heterogeneous multicore architectures. Concurrency and Computation: Practice & Experience 23(2): 187–198.

Aupy

Benoit

Casanova

, et al. (2016) Scheduling computational workflows on failure-prone platforms. International Journal of Networking and Computing 6(1): 2–26.

Baldoni

Helary

Mostefaoui

, et al. (1997) A communication-induced checkpointing protocol that ensures rollback-dependency trackability. In: Digest of Papers: FTCS-27, The Twenty-Seventh Annual International Symposium on Fault-Tolerant Computing, Seattle, Washington, USA, 24–27 June 1997, pp. 68–77. IEEE.

Bautista Gomez

Cappello

(2014) Detecting silent data corruption through data dynamic monitoring for scientific applications. SIGPLAN Notices 49(8): 381–382.

Bautista Gomez

Cappello

(2015) Detecting and correcting data corruption in stencil applications through multivariate interpolation. In: 2015 IEEE International Conference on Cluster Computing, CLUSTER 2015, Chicago, IL, USA, 8–11 September 2015, pp. 595–602. Piscataway: IEEE.

Benoit

Cavelan

Robert

, et al. (2016) Assessing general-purpose algorithms to cope with fail-stop and silent errors. ACM Transactions on Parallel Computing 3(2): 13.

10.

Berrocal

Bautista-Gomez

, et al. (2015) Lightweight silent data corruption detection based on runtime data analysis for HPC applications. In: Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing, HPDC 2015, Portland, OR, USA, 15–19 June 2015, pp. 275–278. New York: ACM.

11.

Bharathi

Chervenak

Deelman

, et al. (2008) Characterization of scientific workflows. In: Proceedings of the 11th Workshop on Workflows in Support of Large-Scale Science co-located with The International Conference for High Performance Computing, Networking, Storage and Analysis (SC 2016), Salt Lake City, Utah, USA, 14 November 2016, pp. 1–10. Piscataway: IEEE.

12.

Bosilca

Delmas

Dongarra

, et al. (2009) Algorithm-based fault tolerance applied to high performance computing. Journal of Parallel and Distributed Computing 69(4): 410–416.

13.

Braun

Siegel

Beck

, et al. (2001) A comparison of eleven static heuristics for mapping a class of independent tasks onto heterogeneous distributed computing systems. Journal of Parallel and Distributed computing 61(6): 810–837.

14.

Cao

Herault

Bosilca

, et al. (2015) Design for a soft error resilient dynamic task-based runtime. In: 2015 IEEE International Parallel and Distributed Processing Symposium, IPDPS 2015, Hyderabad, India, 25–29 May 2015, pp. 765–774. Piscataway: IEEE.

15.

Cappello

Geist

Gropp

, et al. (2014) Toward exascale resilience: 2014 update. Supercomputing Frontiers and Innovations 1(1).

16.

Choi

Dongarra

Ostrouchov

, et al. (1996) Design and implementation of the scalapack lu, qr, and cholesky factorization routines. Scientific Programming 5(3): 173–184.

17.

da Silva

Chen

Juve

, et al. (2014) Community resources for enabling research in distributed scientific workflows. In: 10th IEEE International Conference on e-Science, eScience 2014, Vol 1. Sao Paulo, Brazil, 20–24 October 2014, pp. 177–184. IEEE.

18.

Darte

Robert

Vivien

(2000) Scheduling and Automatic Parallelization. Basel: Birkhäuser. ISBN 978-3-7643-4149-7.

19.

Deelman

Singh

, et al. (2005) Pegasus: a framework for mapping complex scientific workflows onto distributed systems. Scientific Programming 13(3): 219–237.

20.

Deelman

Vahi

Juve

, et al. (2015) Pegasus, a workflow management system for science automation. Future Generation Computer Systems 46: 17–35.

21.

Downey

(2001) The structural cause of file size distributions. In: Proceedings of the Joint International Conference on Measurements and Modeling of Computer Systems, SIGMETRICS/Performance 2001, Cambridge, MA, USA, 16–20 June 2001, pp. 361–370. Piscataway: IEEE.

22.

Drozdowski

(2009) Scheduling for Parallel Processing. Computer Communications and Networks. Berlin: Springer.

23.

Duan

Prodan

Fahringer

(2005) Dee: a distributed fault tolerant workflow enactment engine for grid computing. In: High Performance Computing and Communications, First International Conference, HPCC 2005, Sorrento, Italy, 21–23 September 2005, pp. 704–716. Springer.

24.

Fahringer

Prodan

Duan

, et al. (2007) Askalon: a development and grid computing environment for scientific workflows. In: Workflows for e-Science, pp. 450–471. Berlin: Springer.

25.

Han

Canon

Casanova

, et al. (2018a) Checkpointing workflows for fail-stop errors. IEEE Transactions on Computers 67(1): 1105–1020.

26.

Han

Fèvre

Canon

, et al. (2018b) A generic approach to scheduling and checkpointing workflows. In: Proceedings of the 47th International Conference on Parallel Processing, ICPP 2018, Eugene, OR, USA, 13–16 August 2018, pp. 28:1–28:10. ACM.

27.

Hérault

Robert

(eds) (2015) Fault-Tolerance Techniques for High-Performance Computing, Computer Communications and Networks. Berlin: Springer Verlag.

28.

Huang

Abraham

(1984) Algorithm-based fault tolerance for matrix operations. IEEE Transactions on Computers 33(6): 518–528.

29.

Hwang

Kesselman

(2003) GridWorkflow: a flexible failure handling framework for the grid. In: 12th International Symposium on High-Performance Distributed Computing (HPDC-12 2003), Seattle, WA, USA, 22–24 June 2003, pp. 126–137. IEEE.

30.

Jin

Sun

Zheng

, et al. (2009) Performance under failures of DAG-based parallel computing. In: 9th IEEE/ACM International Symposium on Cluster Computing and the Grid, CCGrid 2009, Shanghai, China, 18–21 May 2009, pp. 236–243. Washington: IEEE Computer Society.

31.

Juve

Chervenak

Deelman

, et al. (2013) Characterizing and profiling scientific workflows. Future Generation Computer Systems 29(3): 682–692.

32.

Kail

fchtpen

Kozlovszky

(2016) A novel adaptive checkpointing method based on information obtained from workflow structure. Computer Science 17(3): 387–406.

33.

Lin

Chen

Cheng

(2013) On improving fault tolerance for heterogeneous hadoop mapreduce clusters. In: 2013 International Conference on Cloud Computing and Big Data. pp. 38–43. IEEE.

34.

Pegasus (2014) Pegasus workflow generator. Available at: https://confluence.pegasus.isi.edu/display/pegasus/WorkflowGenerator.

35.

Pothen

Sun

(1993) A mapping algorithm for parallel sparse cholesky factorization. SIAM J on Scientific Computing 14(5): 1253–1257.

36.

Shantharam

Srinivasmurthy

Raghavan

(2012) Fault tolerant preconditioned conjugate gradient for sparse linear system solution. In: International Conference on Supercomputing, ICS’12, Venice, Italy, 25–29 June 2012, pp. 69–78. New York: ACM.

37.

Tobita

Kasahara

(2002) A standard task graph set for fair evaluation of multiprocessor scheduling algorithms. Journal of Scheduling 5(5): 379–394.

38.

Tolosana-Calasanz

Bañares

JÁ

Álvarez

, et al. (2010) An uncoordinated asynchronous checkpointing model for hierarchical scientific workflows. Journal of Computer and System Sciences 76(6): 403–415.

39.

Topcuoglu

Hariri

(2002) Performance-effective and low-complexity task scheduling for heterogeneous computing. IEEE Transactions on Parallel and Distributed Systems 13(3): 260–274.

40.

Toueg

Babaoğlu

(1984) On the optimum checkpoint selection problem. SIAM Journal on Computing 13(3): 630–649.

41.

Valdes

Tarjan

Lawler

(1979) The recognition of series parallel digraphs. In: Proceedings of the 11h Annual ACM Symposium on Theory of Computing, Atlanta, Georgia, USA, 30 April–2 May 1979, pp. 1–12. ACM.

42.

Wang

Zhang

Chen

, et al. (2014) Replication-based fault-tolerance for large-scale graph processing. In: 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2014, Atlanta, GA, USA, 23–26 June 2014, pp. 562–573. IEEE Computer Society.

43.

Wilde

Hategan

Wozniak

, et al. (2011) Swift: a language for distributed parallel scripting. Parallel Computing 37(9): 633–652.

44.

Wolstencroft

Haines

Fellows

, et al. (2013) The taverna workflow suite: designing and executing workflows of web services on the desktop, web or in the cloud. Nucleic Acids Research 41: W557–W561.

45.

Gajski

(1990) Hypertool: a programming aid for message-passing systems. IEEE Trans. Parallel Distributed Systems 1(3): 330–343.

46.

Zhang

Docan

Parashar

, et al. (2012) Enabling in-situ execution of coupled scientific workflow on multi-core platform. In: 26th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2012, Shanghai, China, 21–25 May 2012. pp. 1352–1363. IEEE Computer Society.

47.

Zhu

Chen

(2016) Asc: improving spark driver performance with automatic spark checkpoint. In: 2016 18th International Conference on Advanced Communication Technology (ICACT), pp. 607–611. IEEE.