Sage Journals: Discover world-class research

Abstract

This article is devoted to graphics processing unit (GPU) kernel optimization and performance analysis of three tensor-product operations arising in finite element methods. We provide a mathematical background to these operations and implementation details. Achieving close to peak performance for these operators requires extensive optimization because of the operators’ properties: low arithmetic intensity, tiered structure, and the need to store intermediate results during the kernel execution. We give a guided overview of optimization strategies and we present a performance model that allows us to compare the efficacy of these optimizations against an empirically calibrated roofline.

Keywords

Finite element method elliptic problem hexahedral elements matrix–vector product GPU tensor operations NVIDIA Tesla P100

Get full access to this article

View all access options for this article.

References

Abdelfattah

Baboulin

Dobrev

. (2016) High-performance tensor contractions for GPUs. Procedia Computer Science 80: 108–118.

Cecka

Lew

Darve

(2011) Assembly of finite element methods on graphics processors. International Journal of Numerical Methods in Engineering 85(5): 640–669.

CEED (2017) CEED benchmark problems. Available at: http://ceed.exascaleproject.org/bps/ (accessed 6 November 2018).

Dehnavi

Fernández

Giannacopoulos

(2010) Finite-element sparse matrix vector multiplication on graphic processing units. IEEE Transactions on Magnetics 46(8): 2982–2985.

Deville

Fischer

Mund

(2002) High-Order Methods for Incompressible Fluid Flow. Cambridge: Cambridge University Press.

Don

Solomonoff

(1995) Accuracy and speed in computing the Chebyshev collocation derivative. SIAM Journal of Scientific Computing 16(6): 1253–1268. DOI: 10.1137/0916073.

Dziekoński

Rewieński

Sypek

. (2017) GPU-accelerated LOBPCG method with inexact null-space filtering for solving generalized eigenvalue problems in computational electromagnetics analysis with higher-order FEM. Journal of Computational Physics 22(4): 997–1014.

Fischer

Heisey

Min

(2015) Scaling limits for PDE-based simulation. In: Proceedings of 22nd AIAA computational fluid dynamics conference, Dallas, United States, 22 June 2015.

Fischer

Lottes

Kerkemeier

(2008) Nek5000 Web page. Available at: http://nek5000.mcs.anl.gov (accessed 6 November 2018).

10.

Lewis

Kirby

. (2014) Architecting the finite element method pipeline for the GPU. Journal of Computational and Applied Mathematics 257: 195–211.

11.

Garvey

Abdelrahman

(2015) Automatic performance tuning of stencil computations on GPUs. In: Proceedings of 44th international conference on parallel processing, Beijing, China, 1–4 September 2015, pp. 300–309. IEEE.

12.

Göddeke

Buijssen

Wobker

. (2009) GPU acceleration of an unmodified parallel finite element Navier-Stokes solver. In: Proceedings of international conference on High Performance Computing & simulation (HPCS’09), pp. 12–21. IEEE.

13.

Göddeke

Strzodka

Mohd-Yusof

. (2007) Exploring weak scalability for FEM calculations on a GPU-enhanced cluster. Parallel Computing 33(10-11): 685–699.

14.

Göddeke

Strzodka

Turek

(2005) Accelerating Double Precision FEM Simulations With GPUs. Dortmund: Universität Dortmund, Fachbereich Mathematik.

15.

Grigoras

Burovskiy

Luk

. (2016) Optimising sparse matrix vector multiplication for large scale FEM problems on FPGA. In: Proceedings of international conference on field programmable logic and applications (FPL 2016), pp. 1–9. IEEE.

16.

Hong

Kim

(2009) An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness. SIGARCH Computer Architecture News 37(3): 152–163. DOI: 10.1145/1555815.1555775.

17.

Kim

Vuduc

Baghsorkhi

. (2012) Performance analysis and tuning for general purpose graphics processing units (GPGPU). Synthesis Lectures on Computer Architecture 7(2): 1–96.

18.

Konstantinidis

Cotronis

(2015) A practical performance model for compute and memory bound GPU kernels. In: Proceedings of 23rd Euromicro international conference on parallel, distributed and network-based processing, pp. 651–658. IEEE.

19.

Lee

Kim

Chhugani

. (2010) Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU. SIGARCH Computer Architecture News 38(3): 451–460. DOI: 10.1145/1816038.1816021.

20.

(2016) GPU performance modeling and optimization. PhD Thesis, Technische Universiteit Eindhoven.

21.

Liu

Wen

Sarwate

. (2017) A unified optimization approach for sparse tensor operations on GPUs. In: IEEE International conference on cluster computing (CLUSTER), Honolulu, HI, 5–8 September 2017, pp. 47–57. IEEE.

22.

Williams

Van Straalen

. (2014) Roofline model toolkit: a practical tool for architectural and program analysis. In: Proceedings of international workshop on performance modeling, benchmarking and simulation of high performance computer systems, pp. 129–148. Springer

23.

Markall

Ham

Kelly

(2010) Towards generating optimised finite element solvers for GPUs from high-level specifications. Procedia Computer Science 1(1): 1815–1823.

24.

Markall

Slemmer

Ham

. (2013) Finite element assembly strategies on multi-core and many-core architectures. International Journal for Numerical Methods in Fluids 71(1): 80–97.

25.

Medina

St-Cyr

Warburton

(2014) OCCA: a unified approach to multi-threading languages. Available at: https://arxiv.org/abs/1403.0968 (2014, accesssed December 2018)

26.

Nelson

Rivera

Balaprakash

. (2015) Generating efficient tensor contractions for GPUs. In: Proceedings of 44th international conference on parallel processing, Beijing, China, pp. 969–978. IEEE.

27.

Remacle

Gandham

Warburton

(2016) GPU accelerated spectral finite elements on all-hex meshes. Journal of Computational Physics 324: 246–257.

28.

Shi

Niranjan

Anandkumar

. (2016) Tensor contractions with extended BLAS kernels on CPU and GPU. In: Proceedings of 23rd international conference on High Performance Computing (HiPC), Piscataway, NJ, USA, 19–22 December 2016, pp. 193–202.

29.

Stratton

Rodrigues

Sung

. (2012) Parboil: A Revised Benchmark Suite for Scientic and Commercial Throughput Computing: Technical Report IMPACT-12-01. University of Illinois at Urbana-Champaign.

30.

Volkov

(2010) Better performance at lower occupancy. In: Proceedings of the GPU technology conference, GTC, San Jose, CA, 20–23 September 2010, volume 10, pp. 16. NVIDIA.

31.

Volkov

Demmel

(2008) Benchmarking GPUs to tune dense linear algebra. In: Proceedings of international conference for High Performance Computing, networking, storage and analysis (SC 2008), Austin, TX, USA, 15–21 November 2008, pp. 1–11. IEEE.

32.

Wong

Papadopoulou

Sadooghi-Alvandi

. (2010) Demystifying GPU microarchitecture through microbenchmarking. In: Proceedings of 2010 IEEE international symposium on performance analysis of systems software (ISPASS), pp. 235–246. DOI: 10.1109/ISPASS.2010.5452013.

33.

Zhang

Owens

(2011) A quantitative performance analysis model for GPU architectures. In: Proceedings of 17th international symposium on high performance computer architecture, pp. 382–393. DOI: 10.1109/HPCA.2011.5749745.

Acceleration of tensor-product operations for high-order finite element methods

Abstract

Keywords

Get full access to this article

References