Mixed precision LU factorization on GPU tensor cores: reducing data movement and memory footprint

Abstract

Modern GPUs equipped with mixed precision tensor core units present great potential to accelerate dense linear algebra operations such as LU factorization. However, state-of-the-art mixed half/single precision LU factorization algorithms all require the matrix to be stored in single precision, leading to expensive data movement and storage costs. This is explained by the fact that simply switching the storage precision from single to half leads to significant loss of accuracy, forfeiting all accuracy benefits from using tensor core technology. In this article, we propose a new factorization algorithm that is able to store the matrix in half precision without incurring any significant loss of accuracy. Our approach is based on a left-looking scheme employing single precision buffers of controlled size and a mixed precision doubly partitioned algorithm exploiting tensor cores in the panel factorizations. Our numerical results show that compared with the state of the art, the proposed approach is of similar accuracy but with only half the data movement and memory footprint, and hence potentially much faster: it achieves up to 2× and 3.5× speedups on V100 and A100 GPUs, respectively.

Keywords

Numerical linear algebra mixed precision algorithms high-performance computing LU factorization tensor cores NVIDIA GPU rounding error analysis

Get full access to this article

View all access options for this article.

References

Anderson

Bai

Bischof

, et al. (1995) LAPACK Users’ Guide. 3rd edition. Philadelphia, PA: SIAM Press.

Appleyard

Yokim

(2017) Programming tensor cores in CUDA 9. Available at: https://devblogs.nvidia.com/programming-tensor-cores-cuda-9/ (accessed 25 March 2019).

Blanchard

Higham

Lopez

, et al. (2020) Mixed precision block fused multiply-add: Error analysis and application to GPU tensor cores. SIAM Journal on Scientific Computing 42(3): C124–C141. DOI: 10.1137/19M1289546

Carson

Higham

(2017) A new analysis of iterative refinement and its application to accurate solution of ill-conditioned sparse linear systems. SIAM Journal on Scientific Computing 39(6): A2834–A2856. DOI: 10.1137/17M1122918

Carson

Higham

(2018) Accelerating the solution of linear systems by iterative refinement in three precisions. SIAM Journal on Scientific Computing 40(2): A817–A847. DOI: 10.1137/17M1140819

Connolly

Higham

Mary

(2021) Stochastic rounding and its probabilistic backward error analysis. SIAM Journal on Scientific Computing 43(1): A566–A585. DOI: 10.1137/20m1334796

Davis

(2011) The University of Florida sparse matrix collection. ACM Trans Math Software 38(1): 1:1–1:25. DOI: 10.1145/2049662.2049663

Fasi

Higham

(2021) Matrices with tunable infinity-norm condition number and no need for pivoting in LU factorization. SIAM Journal on Matrix Analysis and Applications 42(1): 417–435. DOI: 10.1137/20m1357238

Gustavson

(1997) Recursion leads to automatic variable blocking for dense linear-algebra algorithms. IBM Journal of Research and Development 41(6): 737–755. DOI: 10.1147/rd.416.0737

10.

Haidar

Bayraktar

Tomov

, et al. (2020) Mixed-precision iterative refinement using tensor cores on GPUs to accelerate solution of linear systems. Proceeding of Royal Society London Ser A 476(2243): 20200110. DOI: 10.1098/rspa.2020.0110

11.

Haidar

Tomov

Dongarra

, et al. (2018) Harnessing GPU tensor cores for fast FP16 arithmetic to speed up mixed-precision iterative refinement solvers. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis. SC18, (Dallas, TX), Piscataway, NJ, USA, pp. 47:1–47:11. DOI: 10.1109/SC.2018.00050

12.

Higham

Mary

(2019) A new approach to probabilistic rounding error analysis. SIAM Journal on Scientific Computing 41(5): A2815–A2835. DOI: 10.1137/18M1226312

13.

Higham

Mary

(2020) Sharper probabilistic backward error analysis for basic linear algebra kernels with random data. SIAM Journal on Scientific Computing 42(5): A3427–A3446. DOI: 10.1137/20M1314355

14.

Higham

Pranesh

Zounon

(2019) Squeezing a matrix into half precision, with an application to solving linear systems. SIAM Journal on Scientific Computing 41(4): A2536–A2551. DOI: 10.1137/18M1229511

15.

Higham

(2002) Accuracy and Stability of Numerical Algorithms. 2nd edition. Philadelphia, PA, USA: Society for Industrial and Applied Mathematics. ISBN 0-89871-521-0. DOI: 10.1137/1.9780898718027

16.

Oettli

Prager

(1964) Compatibility of approximate solution of linear equations with given error bounds for coefficients and right-hand sides. Numerische Mathematik 6: 405–409. DOI: 10.1007/BF01386090

17.

Yang

Fox

Sanders

(2021) Rounding error analysis of mixed precision block Householder QR algorithms. SIAM Journal on Scientific Computing 43(3): A1723–A1753. DOI: 10.1137/19M1296367

18.

Zhang

Baharlouei

(2020) High accuracy matrix computations on neural engines: A study of QR factorization and its applications. In Proceedings of the 29th International Symposium on High-Performance Parallel and Distributed Computing, Stockholm, Sweden. ACM. DOI: 10.1145/3369583.3392685