Fault-Tolerant Scheduling of Fine-Grained Tasks in Grid Environments

Abstract

Divide-and-conquer is a well-suited programming paradigm for parallel Grid applications. Our Satin system efficiently schedules the fine-grained tasks of a divide-andconquer application across multiple clusters in a grid. To accommodate long-running applications, we present a fault-tolerance mechanism for Satin that has negligible overhead during normal execution, while minimizing the amount of redundant work done after a crash of one or more nodes. We study the impact of our fault-tolerance mechanism on application efficiency, both on the Dutch DAS-2 system and using the European testbed of the ECfunded project GridLab.

Keywords

fault-tolerance divide-and-conquer grid computing Java

Get full access to this article

View all access options for this article.

References

Allen, G. , Benger, W. , Goodale, T. , Hege, H.-C. , Lanfermann, G. , Merzky, A. , Radke, T. , Seidel, E. , and Shalf, J. 2000. The Cactus code: a problem solving environment for the grid . Proceedings of the 9th IEEE International Symposium on High Performance Distributed Computing (HPDC9),Pittsburgh, PA, August, pp. 253–260 .

Allen, G. , Davis, K. , Dolkas, K. N. , Doulamis, N. D. , Goodale, T. , Kielmann, T. , Merzky, A. , Nabrzyski, J. , Pukacki, J. , Radke, T. , Russell, M. , Seidel, E. , Shalf, J. , and Taylor, I. 2003. Enabling applications on the Grid: a GridLab overview . International Journal of High Performance Computing Applications: Special issue on Grid Computing: Infrastructure and Applications 17(4): 449–466 .

Anderson, T. and Lee, P. 1981. Fault Tolerance, Principles and Practice, Prentice-Hall, Englewood Cliffs, NJ .

Arnold, D. and Dongarra, J. 2000. The NetSolve environment: progressing towards the seamless grid . Proceedings of the 2000 International Conference on Parallel Processing (ICPP-2000),Toronto, Canada, August, pp. 199–206 .

Baldeschwieler, J. , Blumofe, R. , and Brewer, E. 1996. ATLAS: an infrastructure for global computing . Proceedings of the 7th ACM SIGOPS European Workshop on System Support for Worldwide Applications, Connemara, Ireland, September, pp. 165–172 .

Baratloo, A. , Karaul, M. , Kedem, Z. M. , and Wyckoff, P. 1996. Charlotte: metacomputing on the web . Proceedings of the 9th International Conference on Parallel and Distributed Computing Systems (PCDS-96),Dijon, France, September, pp. 181–188 .

Blumofe, R. and Lisiecki, P. 1997. Adaptive and reliable parallel computing on networks of workstations . Proceedings of the USENIX 1997 Annual Technical Conference on UNIX and Advanced Computing Systems, Anaheim, CA, January, pp. 133–147 .

Blumofe, R. D. , Joerg, C. F. , Kuszmaul, B. C. , Leiserson, C. E. , Randall, K. H. , and Zhou, Y. 1996. Cilk: an efficient multithreaded runtime system . Journal of Parallel and Distributed Computing 37(1): 55–69 .

Boden, N. , Cohen, D. , Felderman, R. , Kulawik, A. , Seitz, C. , Seizovic, J. , and Su, W. 1995. Myrinet: a gigabit-per-second local area network . IEEE Micro 15(1): 29–36 .

10.

Breuker, D. M. 1998. Memory versus search in games. Ph.D. Thesis, Universiteit Maastricht.

11.

Cherif, A. 1998. Replication for fault-tolerant software using a functional and attribute grammar based computational model. Ph.D. Thesis, School of Information Science, Japan Advanced Institute of Science and Technology.

12.

Denis, A. , Aumage, O. , Hofman, R. , Verstoep, K. , Kielmann, T. , and Bal, H. E. 2004. Wide-area communication for grids: an integrated solution to connectivity, performance and security problems . Proceedings of the 13th IEEE International Symposium on High Performance Distributed Computing (HPDC-13),Honolulu, HI, pp. 97–106 .

13.

Elnozahy, M. , Alvisi, L. , Wang, Y-M. , and Johnson, D. B. 2002. A survey of rollback-recovery protocols in message-passing systems . ACM Computing Surveys 34(3): 375–408 .

14.

Finkel, R. and Manber, U. 1987. DIB – a distributed implementation of backtracking . ACM Transactions of Programming Languages and Systems 9(2): 235–256 .

15.

Foster, I. and Kesselman, C. , editors. 2004. The Grid 2: Blueprint for a New Computing Infrastructure, Morgan Kaufman, San Mateo, CA .

16.

Fredman, S. I. and Brown, C. B. 1989. Igor: a system for program debugging via reversible execution . ACM SIPLAN Notices, Workshop on Prallel and Distributed Debugging 24(1): 112–123 .

17.

Goux, J.-P. , Kulkarni, S. , Yoder, M. , and Linderoth, J. 2000. An enabling framework for master–worker applications on the computational grid . Proceedings of the 9th IEEE International Symposium on High Performance Distributed Computing (HPDC’00),Pittsburgh, PA, August, pp. 43–50 .

18.

Lin, F. C. H. and Keller, R. M. 1986. Distributed recovery in applicative systems . Proceedings of the 1986 International Conference on Parallel Processing, University Park, PA, August, pp. 405–412 .

19.

Litzkow, M. , Livny, M. , and Mutka, M. 1988. Condor – a hunter of idle workstations . Proceedings of the 8th International Conference of Distributed Computing Systems, San Jose, CA, June, pp. 104–111 .

20.

Plank, J. 1993. Efficient checkpointing on MIMD architectures. Ph.D. Thesis, Princeton University, NJ.

21.

Tamaki, H. and Sato, T. 1986. OLD resolution with tabulation . Proceedings of the 3rd International Conference on Logic Programming, London, UK, July, pp. 84–98 .

22.

van Nieuwpoort, R. V. , Kielmann, T. , and Bal, H. 2001. Efficient load balancing for wide-area divide-and-conquer applications . Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Snowbird, UT, June, pp. 34–43 .

23.

van Nieuwpoort, R. V. , Maassen, J. , Hofman, R. , Kielmann, T. , and Bal, H. E. 2002. Ibis: an efficient Java-based grid programming environment . Proceedings of the Joint ACM Java Grande–ISCOPE 2002 Conference, Seattle, WA, November, pp. 18–27 .

24.

van Nieuwpoort, R. V. , Maassen, J. , Hofman, R. , Kielmann, T. , and Bal, H. E. 2005. Satin: simple and efficient Javabased grid programming . Scalable Computing: Practice and Experience 6(3): 19–32 .

25.

Wrzesinska, G. , van Nieuwpoort, R. V. , Maassen, J. , and Bal, H. E. 2005. Fault-tolerance, malleability and migration for divide-and-conquer applications on the grid . Proceedings of the 19th IEEE International Parallel and Distributed Processing Symposium (IPDPS),Denver, CO, April.

26.

Zhang, L. and Malik, S. 2002. The quest for efficient Boolean satisfiability solvers . Proceedings of the 18th International Conference on Automated Deduction, Copenhagen, Denmark, July, pp. 295–313 .