Sage Journals: Discover world-class research

Abstract

With heterogeneous systems, the number of GPUs per chip increases to provide computational capabilities for solving science at a nanoscopic scale. However, low utilization for single GPUs defies the need to invest more money in expensive accelerators. Although related work develops optimizations to improve application performance, none studies how these optimizations impact hardware resource usage or average GPU utilization. This paper takes a data-driven analysis approach in addressing this gap by (1) characterizing how hardware resource usage affects device utilization, execution time, or both, (2) presenting a multiobjective metric to identify important application-device interactions that can be optimized to improve device utilization and application performance jointly, (3) studying hardware resource usage behaviors of several optimizations for a benchmark application, and finally (4) identifying optimization opportunities for several scientific proxy applications based on their hardware resource usage behaviors. Furthermore, we demonstrate the applicability of our methodology by applying the identified optimizations to a proxy application, which improves the execution time, device utilization, and power consumption by up to 29.6%, 5.3% and 26.5% respectively.

Keywords

performance characterization performance optimization multi-objective performance optimization metric machine learning hardware resource usage

Get full access to this article

View all access options for this article.

References

Adhianto

Banerjee

Fagan

, et al. (2010) Hpctoolkit: tools for performance analysis of optimized parallel programs http://hpctoolkit.org. Concurrency and Computation: Practice and Experience 22(6): 685–701.

Allen

(2016) Characterizing power and performance of gpu memory access. In: 2016 4th International Workshop on Energy Efficient Supercomputing (E2SC). IEEE, 46–53.

Bateni

Wang

Zhu

, et al. (2020) Co-optimizing performance and memory footprint via integrated cpu/gpu memory management, an implementation on autonomous driving platform. 2020 IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS). IEEE, 310–323.

Boyer

Skadron

Che

, et al. (2013) Load balancing in a changing world: dealing with heterogeneity and performance variability. In: Proceedings of the ACM International Conference on Computing. Frontiers, 1–10.

Bruckstein

Elad

Zibulevsky

(2008) Sparse non-negative solution of a linear system of equations is unique. In: 3rd International Symposium on Communications, Control and Signal Processing. IEEE.

Chen

Chung

Abali

, et al. (2018) Towards a single-host many-gpu system. 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD). IEEE, 140–147.

Corp

(2020a) CUPTI API reference. https://docs.nvidia.com/cuda/cupti/index.html

Corp

(2020b) NVML API reference. https://docs.nvidia.com/deploy/nvml-api

Davis

Duff

(1999) A combined unifrontal/multifrontal method for unsymmetric sparse matrices. ACM Transactions on Mathematical Software 25(1): 1–20.

10.

Delorme

(2013) Parallel Sorting on the Heterogeneous AMD Fusion Accelerated Processing Unit. M.A.Sc. thesis, University of Toronto, Toronto, ON, Canada.

11.

Ekondis (2020) Mixbench. Available at: https://github.com/ekondis/mixbench (accessed 22 February 2022).

12.

Ferenbaugh

(2015) Pennant: an unstructured mesh mini-app for advanced architecture research.

13.

Ferenbaugh

(2016) The PENNANT mini-app: unstructured mesh hydrodynamics for advanced architectures. In: La-cc-12-021, Version 0.9. Los Alamos National Laboratory. https://github.com/lanl/PENNANT/blob/master/doc/pennantdoc.pdf

14.

Ganguly

Zhang

Yang

, et al. (2019) Interplay between hardware prefetcher and page eviction policy in cpu-gpu unified virtual memory. Proceedings of the 46th International Symposium on Computer Architecture. Association for Computing Machinery, 224–235.

15.

Hestness

Keckler

Wood

(2015) Gpu computing pipeline inefficiencies and optimization opportunities in heterogeneous cpu-gpu processors. In: 2015 IEEE International Symposium on Workload Characterization. IEEE, 87–97.

16.

Islam

Thiagarajan

Bhatele

, et al. (2016) A machine learning framework for performance coverage analysis of proxy applications. In: SC’16: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 538–549.

17.

Islam

Ayala

Jensen

, et al. (2019) Toward a programmable analysis and visualization framework for interactive performance analytics. In: 2019 IEEE/ACM International Workshop on Programming and Performance Visualization Tools (ProTools). IEEE, 70–77.

18.

Jack Dongarra UoT (2023) Top500 list of supercomputers. https://www.top500.org/lists/top500/2023/11/

19.

Karamizadeh

Abdullah

Manaf

, et al. (2013) An overview of principal component analysis. Journal of Signal and Information Processing 4(3B): 173–175.

20.

Landaverde

Zhang

Coskun

, et al. (2014) An investigation of unified memory access performance in cuda. In: 2014 IEEE High Performance Extreme Computing Conference (HPEC). IEEE, 1–6.

21.

Lin

Heroux

Barrett

, et al. (2015) Assessing a mini-application as a performance proxy for a finite element method engineering application. Concurrency and Computation: Practice and Experience 27(17): 5374–5389.

22.

Schutte

Islam

(2020) libNVCD: A per-thread hardware performance counter measurement tool for GPUs. Available at: https://github.com/tzislam/libnvcd (accessed 22 February 2022).

23.

Shende

Malony

(2006) The tau parallel performance system. The International Journal of High Performance Computing Applications 20(2): 287–311.

24.

Thiagarajan

Anirudh

Kailkhura

, et al. (2018) Paddle: performance analysis using a data-driven learning environment. In: 2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 784–793.

25.

Waeijen

(2018) Matrix multiplication CUDA. https://gitlab.com/ecatue/gpu_matrixmul_cuda

26.

Welton

Miller

(2019) Diogenes: looking for an honest cpu/gpu performance measurement tool. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’19. Association for Computing Machinery.

27.

(2009) Design space exploration for gpu-based architecture.

28.

Zhang

Owens

(2011) A quantitative performance analysis model for gpu architectures. In: 2011 IEEE 17th International Symposium on High Performance Computer Architecture. IEEE, 382–393.

29.

Zhou

Tong

Liu

(2015) Gpes: a preemptive execution system for gpgpu computing. In: 21st IEEE Real-Time and Embedded Technology and Applications Symposium. IEEE, 87–97.

30.

Zhou

Bateni

Liu

(2018) S 3dnn: supervised streaming and scheduling for gpu-accelerated real-time dnn workloads. In: 2018 IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS). IEEE, 190–201.

Data-driven analysis to understand GPU hardware resource usage of optimizations

Abstract

Keywords

Get full access to this article

References