Energy and performance characteristics of different parallel implementations of scientific applications on multicore systems

Abstract

Energy consumption is a major concern with high-performance multicore systems. In this paper, we explore the energy consumption and performance (execution time) characteristics of different parallel implementations of scientific applications. In particular, the experiments focus on message-passing interface (MPI)-only versus hybrid MPI/OpenMP implementations for hybrid the NAS (NASA Advanced Supercomputing) BT (Block Tridiagonal) benchmark (strong scaling), a Lattice Boltzmann application (strong scaling), and a Gyrokinetic Toroidal Code — GTC (weak scaling), as well as central processing unit (CPU) frequency scaling. Experiments were conducted on a system instrumented to obtain power information; this system consists of eight nodes with four cores per node. The results indicate, with respect to the MPI-only versus the hybrid implementation, that the best implementation is dependent upon the application executed on 16 or fewer cores. For the case of 32 cores, the results were consistent in that hybrid implementation resulted in less execution time and energy. With CPU frequency scaling, the best case for energy saving was not the best case for execution time.

Keywords

energy consumption frequency scaling hybrid MPI/OpenMP MPI multicore system performance characteristics scientific applications

Get full access to this article

View all access options for this article.

References

Curtis-Maury M. , Blagojevic F. , Antonopoulos CD and Nilolopoulos DS ( 2008) Prediction-based power-performance adaptation of multithreaded scientific codes. IEEE Trans Parallel Distrib Syst 19(10): 1396-1410.

Ethier S. ( 2005) First Experience on BlueGene/L, BlueGene Applications Workshop, ANL, 27-28 April. Available at: http://www.bgl.mcs.anl.gov/Papers/GTC_BGL_20050520.pdf

Freeh V. , Kappiah N. , Lowenthal D. and Bletsch T. ( 2008) Just-in-time dynamic voltage scaling: exploiting inter-node slack to save energy in MPI programs. J Parallel Distrib Comput 68: 1175-1185.

Freeh V. , Pan F. , Lowenthal D. and Kappiah N. ( 2005) Using multiple energy gears in MPI programs on a power-scalable cluster. In: Proceedings of the 10th ACM Symposium on Principles and Practice of Parallel Programming (PPOPP), June.

Ge R. , Feng X. , Song S. , Chang H. , Li D. and Cameron K. (2010) PowerPack: energy profiling and analysis of high-performance systems and applications. IEEE Trans Parallel Distrib Syst 21: 658-671.

Hsu C-H. and Feng W-C. ( 2005) A power-aware run-time system for high-performance computing . In: Proceedings of the IEEE/ACM Supercomputing 2005 (SC05), November.

Li D. , de Supinski B. , Schulz M. , Cameron K. and Nikolopoulos DS ( 2010) Hybrid MPI/OpenMP power-aware computing. In: Proceedings of the 24th International Parallel and Distributed Processing Symposium (IPDPS), Atlanta, GA, April.

Kogge PM (ed.) (2008) Exascale computing study: technology challenges in achieving exascale systems. CSE Dept. Tech. Report TR-2008-13, University of Notre Dame, 28 September.

Rountree B. , Lowenthal D. , et al. (2009) Adagio: making DVS practical for complex HPC applications. In: Proceedings of the 23rd International Conference on Supercomputing (ICS09), New York.

10.

Song S. , Ge R. , Feng X and Cameron K (2009) Energy profiling and analysis of the HPC challenge benchmarks. Int J High Perform Comput Appl 23: 265-276.

11.

Song S. , Su C. , Ge R. , et al. (2011) Iso-energy-efficiency: an approach to power-constrained parallel computation. In: Proceedings of the 25th IEEE International Parallel & Distributed Processing Symposium (IPDPS) .

12.

Taylor V. , Wu X. and Stevens R. ( 2003) Prophesy: an infrastructure for performance analysis and modeling system of parallel and grid applications. ACM SIGMETRICS Perform Eval Rev 30: 13-18.

13.

Wu X. and Taylor V. (2011) Performance characteristics of hybrid MPI/OpenMP implementations of NPB SP and BT on large-scale multicore supercomputers. ACM SIGMETRICS Perform Eval Rev 38: 56-62.

14.

Wu X. , Taylor V. , Lively C and Sharkawi S (2009) Performance analysis and optimization of parallel scientific applications on CMP clusters. Scalable Comput Pract Exper 10: 61-74.

15.

Wu X. , Taylor V. , Garrick S. , Yu D. and Richard J. ( 2006) Performance analysis, modeling and prediction of a parallel multiblock lattice Boltzmann application using prophesy system. In: Proceedings of the IEEE International Conference on Cluster Computing , 25-28 September.