Performance enhancement of the Ozaki Scheme on integer matrix multiplication unit

Abstract

This study was aimed at simultaneously achieving sufficient accuracy and high performance for general matrix multiplications. Recent architectures, such as NVIDIA GPUs, feature high-performance units designed for low-precision matrix multiplications in machine learning models, and next-generation architectures are expected to follow the same design principle. The key to achieving superior performance is to fully leverage such architectures. The Ozaki scheme, a highly accurate matrix multiplication algorithm using error-free transformations, enables higher-precision matrix multiplication to be performed through multiple lower-precision matrix multiplications and higher-precision matrix additions. Ootomo et al. implemented the Ozaki scheme on high-performance matrix multiplication units with the aim of achieving both sufficient accuracy and high performance. This paper proposes alternative approaches to improving performance by reducing the numbers of lower-precision matrix multiplications and higher-precision matrix additions. Numerical experiments demonstrate the accuracy of the results and conduct performance benchmarks of the proposed approaches. These approaches are expected to yield more efficient results in next-generation architectures.

Keywords

Matrix multiplication Fixed-point arithmetic Floating-point arithmetic Tensor Cores Error-free transformation

1. Introduction

The field of machine learning, including AI, is evolving daily, and the scale and complexity of machine learning models are continually increasing. Recent architectures are designed to process these models rapidly and with high energy efficiency. For machine learning models, matrix multiplications using low-precision floating-point systems and integers are essential. Therefore, recent architectures are equipped with high-performance low-precision floating-point and integer matrix multiplication units, and high-performance mixed-precision matrix multiplication units that leverage these capabilities. A prime example of this is the NVIDIA Tensor Cores (see NVIDIA Corporation (2024)). Table 1 shows the specifications of the NVIDIA GPUs equipped with Tensor Core technology. Note that the specification of FP16 TC on RTX 4090 (165 TFLOPS) is for FP16 input and FP32 output, the specifications for the H100 and GH200 are the same as those for the H200, and the specifications for the B200 in the table is based on NVIDIA’s Technical Brief. In the future, numerical computation algorithms that leverage the performance of cutting-edge architectures will be essential. This study focused on researching high-performance matrix multiplication algorithms that maximize the potential of the latest architectures.

Table 1.

Specifications in TFLOPS/TOPS of NVIDIA GPUs for dense data NVIDIA Corporation (2024).

	Ampere	ADA	Hopper	Blackwell
	A100 SXM4	RTX 4090	H200 SXM5	B200
	GA100	AD102-300
FP64	9.7	1.29	34	40
FP64 TC	19.5	–	67	40
FP32	19.5	82.6	67	80
TF32 TC	156	82.6	494	1100
BF16 TC	312	165	989	2250
FP16 TC	312	165	989	2250
INT8 TC	624	661	1979	4500
FP8 TC	–	661	1979	4500
FP6 TC	–	–	–	4500
INT4 TC	–	1321	–	–
FP4 TC	–	–	–	9000

A highly accurate matrix multiplication scheme via the error-free transformation of matrix products was proposed by Ozaki et al. (2012). The scheme is called the Ozaki scheme and it converts a matrix product into a sum of multiple matrix products. It is also possible to convert a matrix product into a sum of lower-precision matrix products to take full advantage of the immense computational power of recent architectures. In order to compute D ← AB for higher-precision matrices $A \in R^{m \times n}$ and $B \in R^{n \times p}$ , the Ozaki scheme using lower/mixed-precision, provided by Mukunoki et al. (2020), is constructed as the following four steps:

(i) Extract the lower-precision matrices A₁, A₂, …, A_k from the higher-precision matrix A, where k is specified by the user, shifting A_i to prevent overflow and underflow at lower-precision arithmetic.

(ii) Extract k lower-precision matrices B_i from B in the similar way as A_i.

(iii) Compute A_iB_j for i + j ≤ k + 1 using lower/mixed-precision arithmetic.

(iv) Reverse (shift) A_iB_j and accumulate the results into D using higher-precision arithmetic.

When emulating the GEMM routine, the following fifth step is also performed:

(v) Compute C ← αD + βC for scalars α, β and a matrix C.

Ootomo et al. (2024) implemented the Ozaki scheme using the INT8 Tensor Cores and evaluated the performance for emulating DGEMM, an FP64 matrix multiplication routine. The Ozaki scheme implemented by Ootomo is named the “Ozaki Scheme on Integer Matrix Multiplication Unit” (ozIMMU), and the code is available from the GitHub repository by Ootomo (2024). Figure 1 shows the accuracy of ozIMMU. Herein, ϕ specifies the tendency of the difficulty in terms of the accuracy of matrix multiplications. At least 7 or 8 slices are required to obtain sufficiently accurate results, even for well-conditioned matrix multiplications. As ϕ increases, more slices are required to obtain sufficient accuracy. Ootomo’s implementation for splitting matrices offers bit masking. Thus, the extracted matrices A_i, B_i may not be optimal and the splitting method can be improved (see Section 3.1 for details). Improvement of the splitting method should contribute to improving the accuracy of results.

Figure 1.

Accuracy of ozIMMU. Matrix A has entries a_ij≔(U_ij − 0.5) ⋅ exp(ϕ ⋅ N_ij), where U_ij ∈ (0, 1) are uniformly distributed and N_ij are drawn from standard normal distribution for 1 ≤ i, j ≤ m and m = n = p. Matrix B is composed similarly.

Figures 2 and 3 show the time breakdown of ozIMMU for double-precision matrices $A \in R^{m \times n}$ and $B \in R^{n \times p}$ , defined by IEEE Computer Society (2019), on GeForce RTX 4090 and GH200 Grace Hopper Superchip, respectively. Note that “split A”, “split B”, “Gemm in INT8-TC”, “accumulation in FP64”, and “copy” correspond to the steps (i), (ii), (iii), (iv), and (v) in the Ozaki scheme for emulating the GEMM routine, respectively. It can be seen that the computing time for the accumulation of matrix products in FP64 is not negligible even though the computation cost is $O (m p)$ operations. This is because the performance of the INT8 Tensor Core is nearly 512 times and 60 times higher than that of FP64 on the RTX 4090 and GH200, respectively. Because this ratio increases significantly on the B200 and future architectures, accelerating the accumulation process is critical for the Ozaki scheme.

Figure 2.

Time breakdown of ozIMMU on NVIDIA GeForce RTX 4090.

Figure 3.

Time breakdown of ozIMMU on NVIDIA GH200 Grace Hopper Superchip.

In this paper, we propose new implementation methods for accelerating the accumulation of the Ozaki scheme using the INT8 Tensor Core. In addition, an alternative splitting method is also applied to improve the accuracy of results. For the number of slices, the splitting method contributes to obtaining more accurate results than those of Ootomo’s ozIMMU. The remainder of this paper is organized as follows: Section 2 overviews previous studies about the Ozaki scheme using an integer matrix multiplication unit; Section 3 presents the proposed method for accelerating the accumulation and optimizing the splitting method; Section 4 shows numerical results to illustrate the efficiency of the proposed methods; Section 5 provides a rounding error analysis of the Ozaki scheme with the proposed methods; and Section 6 presents final remarks.

2. Previous study

In this section, we briefly summarize previous studies. Let u be the relative error unit, e.g., u = 2⁻⁵³ for double-precision floating-point numbers defined in IEEE 754. Define $F$ as a set of binary floating-point numbers with u. Let O_m,n be an m × n zero matrix. Suppose that $A \in F^{m \times n}$ and $B \in F^{n \times p}$ . Ozaki et al. (2012) proposed an error-free transformation of a matrix product AB. For a user-specified constant $k \in N$ , the Ozaki scheme transforms each of A and B into k + 1 matrices

\begin{array}{l} A ≕ A_{1} + A_{2} + \dots + A_{k} + V_{k}, \\ B ≕ B_{1} + B_{2} + \dots + B_{k} + W_{k}, \end{array}

where

A_{i} \in F^{m \times n}

B_{i} \in F^{n \times p}

V_{i} ≔ A - \sum_{j = 1}^{i} A_{j}, and W_{i} ≔ B - \sum_{j = 1}^{i} B_{j},

(1)

such that

V_{i} \in F^{m \times n}

and

W_{i} \in F^{n \times p}

for i ≤ k, and

| {(A_{s})}_{i j} | \geq | {(A_{s + 1})}_{i j} |

holds if

{(A_{s})}_{i j} \neq 0

for s = 1, …, k − 2, and B_s satisfies the corresponding relation. If the technique provided by Minamihata et al. (2016) is used, then

| {(A_{s})}_{i j} | > | {(A_{s + 1})}_{i j} |

holds when

{(A_{s})}_{i j} \neq 0

. Note that

{(A_{s})}_{i j}

and

{(B_{s})}_{i j}

are obtained depending on

\max_{1 \leq j \leq n} | {(V_{s - 1})}_{i j} |

and

\max_{1 \leq i \leq m} | {(W_{s - 1})}_{i j} |

, respectively.

Let

F_{N}

be a set of N-bit binary floating-point numbers. For the Ozaki scheme using the FP16 Tensor Core with accumulation in FP32 and outputs in FP32, the mixed-precision splitting method with shifting to prevent overflow and underflow was utilized by Mukunoki et al. (2020). For

A \in F_{64}^{m \times n}

and

B \in F_{64}^{n \times p}

, the splitting method produces

\begin{align} A ≕ d i a g (μ^{' (1)}) A_{1}^{'} + \dots + d i a g (μ^{' (k)}) A_{k}^{'} + V_{k}, \end{align}

(2)

\begin{align} B ≕ B_{1}^{'} d i a g (ν^{' (1)}) + \dots + B_{k}^{'} d i a g (ν^{' (k)}) + W_{k} \end{align}

(3)

with double-precision shift values

μ^{' (i)} \in F_{64}^{m}

ν^{' (i)} \in F_{64}^{p}

and half-precision matrices

A_{i}^{'} \in F_{16}^{m \times n}

B_{i}^{'} \in F_{16}^{n \times p}

for i = 1, …, k. Algorithm 1 represents the splitting method for obtaining (2) using Minamihata’s technique. Equation (3) can be obtained by transposing the results of Algorithm 1 executed for B^T. Algorithm 1 can be described as a loop that is executed until s = k for simplicity; however, the loop terminates when A = O_m,n in practice. Note that the binary logarithm is used at the 3rd line in Algorithm 1; however, the calculation using the binary logarithm occasionally returns erroneous results. Therefore, it is better to use a calculation method without the binary logarithm, such as a bitwise operation or a technique leveraging rounding error in floating-point arithmetic as follows:

μ_{i}^{' (s)} \leftarrow u^{- 1} α_{i} + (1 - u^{- 1}) α_{i},

where α_i≔max_j|a_ij|. This technique was developed by Rump and provided by Ozaki et al. (2013).

Algorithm 1 Mixed-precision splitting method from Mukunoki et al. (2020) for Ozaki scheme between FP64 and FP16 using floating-point arithmetic in roundto-nearest-even mode with Minamihata’s technique from Minamihata et al. (2016)
Input: $A \in F_{64}^{m \times n}$ , $k \in N$ Output: $A_{s}^{'} \in F_{16}^{m \times n}$ , $μ^{' (s)} \in F_{64}^{m}$ , s = 1, …, k 1: β ←⌈(29 − log₂n)/2⌉
2: for s = 1, …, k do
3: $μ_{i}^{' (s)} \leftarrow 2^{⌈ \log_{2} \max_{j} \| a_{i j} \| ⌉}$ for i = 1, …, m
4: $σ_{i} \leftarrow 0.75 \cdot μ_{i}^{' (s)} \cdot 2^{β}$ for i = 1, …, m
5: ${(A_{s})}_{i j} \leftarrow (a_{i j} + σ_{i}) - σ_{i} \forall i, j$ {Extract in round-to-nearest-even mode}
6: ${(A_{s}^{'})}_{i j} \leftarrow F P 16 ({(μ_{i}^{' (s)})}^{- 1} \cdot {(A_{s})}_{i j}) \forall i, j$ {Convert to FP16}
7: $a_{i j} \leftarrow a_{i j} - {(A_{s})}_{i j} \forall i, j$
8: end for

After splitting matrices as in (2) and (3),

\begin{aligned} A B = \sum_{s + t \leq k + 1} d i a g (μ^{' (s)}) A_{s}^{'} B_{t}^{'} d i a g (ν^{' (t)}) \\ + \sum_{s + t > k + 1} d i a g (μ^{' (s)}) A_{s}^{'} B_{t}^{'} d i a g (ν^{' (t)}) \\ + \sum_{s = 1}^{k} d i a g (μ^{' (s)}) A_{s}^{'} W_{k} \\ + \sum_{s = 1}^{k} V_{k} B_{s}^{'} d i a g (ν^{' (s)}) \\ + V_{k} W_{k} \end{aligned}

holds and there is no rounding error in

A_{s}^{'} B_{t}^{'}

for 1 ≤ s, t ≤ k on the FP16 Tensor Core. In the method implemented by Mukunoki et al. (2020), the approximation of AB was computed only using matrix multiplications with the FP16 Tensor Core, as

A B \approx \sum_{s + t \leq k + 1} d i a g (μ^{' (s)}) A_{s}^{'} B_{t}^{'} d i a g (ν^{' (t)})

shown in Algorithm 2. This method was referred to as the “Fast Mode” defined by Mukunoki et al. (2020). For larger k,

| {(W_{k})}_{i j} |

and

| {(V_{k})}_{i j} |

are smaller and the approximation is more accurate.

Algorithm 2 Mixed-precision matrix multiplication method for Ozaki scheme using FP16 Tensor Core in fast mode from Mukunoki et al. (2020)
Input: $A_{s}^{'} \in F_{16}^{m \times n}$ , $μ^{' (s)} \in F_{64}^{m}$ , $B_{s}^{'} \in F_{16}^{n \times p}$ , $ν^{' (s)} \in F_{64}^{n}$ , s = 1, …, k Output: $C \in F_{64}^{m \times p}$
1: C ← O_m,p { $C^{'} \in F_{32}^{m \times p}$ }
2: for g = 2, …, k + 1 do
3: for s = 1, …, g − 1 do
4: $C^{'} \leftarrow A_{s}^{'} B_{g - s}^{'}$ {Compute using GEMM on FP16 Tensor Core}
5: C ← C + diag(μ′^(s))FP64(C′)diag(ν′^(t)) {Compute in FP64}
6: end for
7: end for

Let

I_{N}

be a set of N-bit signed integers. It holds that −2^N−1 ≤ i ≤ 2^N−1 − 1 for all

i \in I_{N}

. Remember that no error occurs in integer arithmetic, barring overflow. For the Ozaki scheme using the INT8 Tensor Core with accumulation in INT32, the mixed-precision splitting method via bit masking with shifting is offered in Ootomo et al. (2024) and provided by Ootomo (2024). Algorithm 3 shows the splitting method. Let

β ≔ \min (7, ⌊ \frac{31 - \log_{2} n}{2} ⌋) .

(4)

Assume that β ≥ 1, i.e., n ≤ 2²⁹. That splitting method produces

\begin{align} A ≕ d i a g (μ^{''}) (2^{- β + 1} A_{1}^{''} + \dots + 2^{- k β + 1} A_{k}^{''}) + V_{k}, \end{align}

(5)

\begin{align} B ≕ (2^{- β + 1} B_{1}^{''} + \dots + 2^{- k β + 1} B_{k}^{''}) d i a g (ν^{''}) + W_{k} \end{align}

(6)

with double-precision shift values

μ^{''} \in F_{64}^{m}

ν^{''} \in F^{p}

and 8-bit integer matrices

A_{i}^{''} \in I_{8}^{m \times n}

B_{i}^{''} \in I_{8}^{n \times p}

for i = 1, …, k.

Algorithm 3 Mixed-precision splitting method for Ozaki scheme via bit masking from Ootomo et al. (2024). This algorithm is specialized for the Ozaki scheme using the INT8 Tensor Core.
Input: $A \in F_{64}^{m \times n}$ , $k \in N$ Output: $A_{s}^{''} \in I_{8}^{m \times n}$ , $μ^{''} \in F_{64}^{m}$ , s = 1, …, k
1: β ← min(7, ⌊(31 − log₂n)/2⌋)
2: $μ_{i}^{''} \leftarrow 2^{⌊ \log_{2} \max_{j} \| a_{i j} \| ⌋}$ for i = 1, …, m
3: for s = 1, …, k do
4: Extract sth β bits of mantissa of a_ij ∀i, j via bit masking and hold those as INT8 variable ${(A_{s}^{''})}_{i j}$
5: end for

After splitting matrices as in (5) and (6),

\begin{array}{l} A B & = \sum_{s + t \leq k + 1} d i a g (μ^{''}) 2^{- β s + 1} 2^{- β t + 1} A_{s}^{''} B_{t}^{''} d i a g (ν^{''}) \\ + \sum_{s + t > k + 1} d i a g (μ^{''}) 2^{- β s + 1} 2^{- β t + 1} A_{s}^{''} B_{t}^{''} d i a g (ν^{''}) \\ + \sum_{s = 1}^{k} d i a g (μ^{''}) 2^{- β s + 1} A_{s}^{''} W_{k} \\ + \sum_{s = 1}^{k} V_{k} 2^{- β s + 1} B_{s}^{''} d i a g (ν^{''}) \\ + V_{k} W_{k} \end{array}

holds and there is no error in

A_{s}^{''} B_{t}^{''}

for 1 ≤ s, t ≤ k on the INT8 Tensor Core barring overflow. In the method implemented by Ootomo et al. (2024), the approximation of AB is computed using only matrix multiplications with the INT8 Tensor Core, as

A B \approx \sum_{s + t \leq k + 1} d i a g (μ^{''}) 2^{- β s + 1} 2^{- β t + 1} A_{s}^{''} B_{t}^{''} d i a g (ν^{''}),

(7)

shown in Algorithm 4. For larger k,

| {(W_{k})}_{i j} |

and

| {(V_{k})}_{i j} |

are smaller and the approximation is more accurate.

Algorithm 4 Mixed-precision matrix multiplication method for Ozaki scheme using INT8 Tensor Core from Ootomo et al. (2024)
Input: $A_{s}^{''} \in I_{8}^{m \times n}$ , $μ^{'' (s)} \in F_{64}^{m}$ , $B_{s}^{''} \in I_{8}^{n \times p}$ , $ν^{'' (s)} \in F_{64}^{p}$ , s = 1, …, k Output: $C \in F_{64}^{m \times p}$
1: β ← min(7, ⌊(31 − log₂n)/2⌋)
2: C ← O_m,p { $C^{''} \in I_{32}^{m \times p}$ }
3: for g = 2, …, k + 1 do
4: for s = 1, …, g − 1 do
5: $C^{''} \leftarrow A_{s}^{''} B_{g - s}^{''}$ {Compute using GEMM on INT8 Tensor Core}
6: C ← C + diag(μ″)2^−βs+12^{−β(g−s)+1}FP64(C″)diag(ν″) {Compute in FP64}
7: end for
8: end for

Figure 4 represents images of Algorithms 2 and 4. In Figure 4, $A_{i} ≔ d i a g (μ^{' (i)}) A_{i}^{'}$ and $B_{i} ≔ B_{i}^{'} d i a g (ν^{' (i)})$ for Algorithms 1 and 2, and $A_{i} ≔ d i a g (μ^{''}) 2^{- i β + 1} A_{i}^{''}$ and $B_{i} ≔ 2^{- i β + 1} B_{i}^{''} d i a g (ν^{''})$ for Algorithms 3 and 4. The exponent of A_iB_j is larger for smaller indices of i, j, so those terms have stronger effects on the accuracy of the final result.

Figure 4.

Images of matrix multiplications in Ozaki scheme for k = four.

3. Proposed methods

3.1. Splitting with rounding to nearest

We begin by explaining our motivation with a simple example. Algorithm 3 described in Section 2 used bitmask for the splitting. We assume that each number keeps three bits. Let a floating-point number be x≔377, represented by (101011111)₂ in binary format. The bitmask strategy divides x into three numbers x = x₁ + x₂ + x₃:

\{\begin{cases} x_{1} = {(\underset{̲}{101} 000000)}_{2}, \\ x_{2} = {(000 \underset{̲}{011} 000)}_{2}, \\ x_{3} = {(000000 \underset{̲}{111})}_{2} . \end{cases}

(8)

In decimal format,

x_{1} = 256 + 0 + 64, x_{2} = 0 + 16 + 8, x_{3} = 4 + 2 + 1 .

Here x₁ is not the nearest number to x.

On the other hand, let x₁ and x₂ be the nearest number to x and x − x₁, respectively. Then,

\{\begin{cases} x_{1} = {(\underset{̲}{101} 000000)}_{2}, \\ x_{2} = {(000 \underset{̲}{100} 000)}_{2}, \\ x_{3} = - {(00000000 \underset{̲}{1})}_{2} . \end{cases}

(9)

In decimal format,

x_{1} = 256 + 0 + 64, x_{2} = 32, x_{3} = - 1 .

There are two drawbacks in the bitmask strategy. One drawback is that even if the leading bit is zero, we keep it, so that the number of significant figures decreases as in x₂ in (8). The other drawback is that the truncation error can be reduced. For example, if we set two slices, there is a difference in the truncation error |x₃| in (8) and (9). It is expected that the nearest strategy makes the truncation error small. Summarizing, even if we obtain a computed result with k slices using the rounding to nearest strategy, the accuracy may be comparable to that with k + 1 slices using the bitmask strategy described in Section 2 as the previous study. In this case, we must carefully monitor the following via numerical examples:

• the increase in the splitting cost because the cost for the bitmask is low;

• the reduction in the number of matrix multiplications; and

• the reduction in the term for the accumulation.

Note that in the worst case, the results of the splitting with bitmask and the nearest strategy are the same. It can be explained using the following example:

x ≔ {(100010001)}_{2} .

For the bit mask strategy, we have

\{\begin{cases} x_{1} = {(\underset{̲}{100} 000000)}_{2}, \\ x_{2} = {(000 \underset{̲}{010} 000)}_{2}, \\ x_{3} = {(000000 \underset{̲}{001})}_{2} . \end{cases}

For the rounding to the nearest strategy, we obtain

\{\begin{cases} x_{1} = {(\underset{̲}{100} 000000)}_{2}, \\ x_{2} = {(0000 \underset{̲}{100} 00)}_{2}, \\ x_{3} = {(00000000 \underset{̲}{1})}_{2} . \end{cases}

The following is an algorithm for the splitting with the round to nearest strategy.

Algorithm 5 Proposed method for mixed-precision splitting for Ozaki scheme between FP64 and INT8 using floatingpoint arithmetic in round-to-nearest-even mode with Minamihata’s technique from Minamihata et al. (2016)
Input: $A \in F_{64}^{m \times n}$ , $k \in N$ Output: $A_{s}^{'} \in I_{8}^{m \times n}$ , $μ^{' (s)} \in F_{64}^{m}$ , s = 1, …, k
1: β ← min(7, ⌊(31 − log₂n)/2⌋)
2: for s = 1, …, k do
3: $μ_{i}^{' (s)} \leftarrow 2^{⌈ \log_{2} \max_{j} \| a_{i j} \| ⌉} \cdot 2^{1 - β}$ for i = 1, …, m
4: $σ_{i} = 0.75 \cdot 2^{53} \cdot μ_{i}^{' (s)}$ for i = 1, …, m
5: ${(A_{s})}_{i j} = (a_{i j} + σ_{i}) - σ_{i}$ for i = 1, …, m {Extract in round-to-nearest-even mode}
6: ${(A_{s}^{'})}_{i j} \leftarrow I N T 8 ({(μ_{i}^{' (s)})}^{- 1} \cdot {(A_{s})}_{i j}) \forall i, j$ {Convert to INT8}
7: $a_{i j} \leftarrow a_{i j} - {(A_{s})}_{i j} \forall i, j$
8: end for

3.2. Group-wise error-free accumulation

Next, we propose a method for accelerating the accumulation in FP64 for ozIMMU. Algorithm 4 requires the computation of a sum of k(k + 1)/2 FP64 matrices at the 6th line. The accumulation accounts for a large ratio of the total computation time of ozIMMU as shown in Figures 2 and 3. For this challenge, we propose a method for accelerating ozIMMU by reducing the number of additions in FP64.

Recall that Algorithm 4 performs

\begin{aligned} C \leftarrow \\ \sum_{g = 2}^{k + 1} \sum_{s = 1}^{g - 1} d i a g (μ^{''}) 2^{- β s + 1} 2^{- β (g - s) + 1} A_{s}^{''} B_{g - s}^{''} d i a g (ν^{''}) . \end{aligned}

(10)

The inner sum of (10) can be expressed as follows:

\begin{aligned} \sum_{s = 1}^{g - 1} d i a g (μ^{''}) 2^{- β s + 1} 2^{- β (g - s) + 1} A_{s}^{''} B_{g - s}^{''} d i a g (ν^{''}) \\ = 2^{- β g + 2} d i a g (μ^{''}) (\sum_{s = 1}^{g - 1} A_{s}^{''} B_{g - s}^{''}) d i a g (ν^{''}) \end{aligned}

(11)

If no overflow occurs in

A_{s}^{''} B_{g - s}^{''} + A_{s + 1}^{''} B_{g - s - 1}^{''}

(1 ≤ s ≤ g − 2) in INT32, the summation can be performed on the accumulator of GEMM on the INT8 Tensor Core. Therefore, applying the above transformation reduces the numbers of slow conversions from INT32 to FP64, scalings, and slow summations in FP64. Let

r \in N

r ≔ \max (1, 2^{31 - 2 β - ⌈ \log_{2} n ⌉}) .

(12)

Then, the summation

\sum_{s = 1}^{\min (r, g - 1)} A_{s}^{''} B_{g - s}^{''}

can be computed without error using GEMM on the INT8 Tensor Core. The summation of r instances of

A_{s}^{''} B_{g - s}^{''}

for s = 1, 2, …, r can be computed without error; however, the summation of all instances of

A_{s}^{''} B_{g - s}^{''}

for s = 1, 2, …, g − 1 cannot always be computed without error. The proof validating error-free summation is provided in Section 5. Finally, we present Algorithm 6, which has a reduced number of summands in FP64 accumulation.

Algorithm 6 Proposed method for groupwise error-free accumulation for Ozaki scheme using INT8 Tensor Core
Input: $A_{s}^{''} \in I_{8}^{m \times n}$ , $μ^{''} \in F_{64}^{m}$ , $B_{s}^{''} \in I_{8}^{n \times p}$ , $ν^{'' (s)} \in F_{64}^{p}$ , s = 1, …, k Output: $C \in F_{64}^{m \times p}$
1: β ← min(7, ⌊(31 − log₂n)/2⌋)
2: $r \leftarrow \max (1, 2^{31 - 2 β - ⌈ \log_{2} n ⌉})$ {#addends for error-free accum.}
3: C ← O_m,p
4: for g = 2, …, k + 1 do
5: q ← 0
6: C″ ← O_m,p { $C^{''} \in I_{32}^{m \times p}$ }
7: for s = 1, …, g − 1 do
8: q ← q + 1
9: $C^{''} \leftarrow C^{''} + A_{s}^{''} B_{g - s}^{''}$ {Compute using GEMM on INT8 Tensor Core}
10: if q = r or s = g − 1 then
11: C ← C + 2^−βg+2diag(μ″)FP64(C″)diag(ν″) {Compute in FP64}
12: q ← 0
13: C″ ← O
14: end if
15: end for
16: end for

Assuming r ≥ k for simplification, all

A_{s}^{''} B_{g - s}^{''}

for s = 1, 2, …, g − 1 can be accumulated without overflow for g = 3, …, k + 1. Algorithm 7 shows the Ozaki scheme with groupwise error-free accumulation for r ≥ k.

Algorithm 7 Simple version of proposed method for groupwise error-free accumulation for Ozaki scheme using INT8 Tensor Core for r ≥ k
Input: $A_{s}^{''} \in I_{8}^{m \times n}$ , $μ^{''} \in F_{64}^{m}$ , $B_{s}^{''} \in I_{8}^{n \times p}$ , $ν^{'' (s)} \in F_{64}^{p}$ , s = 1, …, k Output: $C \in F_{64}^{m \times p}$
1: β ← min(7, ⌊(31 − log₂n)/2⌋) = 7
2: C ← O_m,p
3: for g = 2, …, k + 1 do
4: C″ ← O_m,p { $C^{''} \in I_{32}^{m \times p}$ }
5: for s = 1, …, g − 1 do
6: $C^{''} \leftarrow C^{''} + A_{s}^{''} B_{g - s}^{''}$ {Compute using GEMM on INT8 Tensor Core}
7: end for
8: C ← C + diag(μ″)2^−βg+2FP64(C″)diag(ν″) {Compute in FP64}
9: end for

3.3. Combination of proposals

Next, we accelerate ozIMMU by reducing the number of summands in FP64 accumulation and improving the splitting method. For this purpose, we combine the methods proposed in Sections 3.1 and 3.2. In order to use group-wise error-free accumulation as in Algorithm 6, A and B need to be expressed as in (5) and (6). Thus, we determine

\begin{array}{l} μ_{i}^{''} \leftarrow 2^{⌈ \log_{2} \max_{j} | a_{i j} | ⌉} \cdot 2^{1 - β}, \\ σ_{i} \leftarrow 0.75 \cdot 2^{53 - β (s - 1)} \cdot μ_{i}^{''} \end{array}

as the 3rd and 4th lines in Algorithm 5. Then, A_s are extracted using rounding to nearest floating-point arithmetic as

{(A_{s})}_{i j} \leftarrow (a_{i j} + 2^{β (s - 1)} σ_{i}) - 2^{β (s - 1)} σ_{i}

for the constant 2^β(s−1)σ_i with a common ratio of 2^β. Finally, we obtain Algorithm 8. The shift values for A_s are

μ_{i}^{'' (s)} \cdot 2^{β (s - 1)}

; thus,

μ_{i}^{'' (s)}

are determined before the “for” statement and the shift values for A_s for s ≥ 1 are calculated by shifting

μ_{i}^{'' (s)}

by a constant ratio 2^β. Hence, the algorithm finds the maximum absolute values max_j|a_ij| for i = 1, …, m only once before the “for” statement. On the other hand, the shift values

μ_{i}^{' (s)}

are determined inside the “for” statement in Algorithm 5. Therefore, the algorithm finds the maximum absolute values max_j|a_ij| for i = 1, …, m at most k times.

Algorithm 8 Another proposed method for mixed-precision splitting for Ozaki scheme between FP64 and INT8 using floating-point arithmetic in round-to-nearest-even mode with Minamihata’s technique from Minamihata et al. (2016). The results can be expressed as in (5) for error-free group-wise accumulation.
Input: $A \in F_{64}^{m \times n}$ , $k \in N$ Output: $A_{s}^{''} \in I_{8}^{m \times n}$ , $μ^{''} \in F_{64}^{m}$ , s = 1, …, k
1: β ← min(7, ⌊(31 − log₂n)/2⌋)
2: $μ_{i}^{''} \leftarrow 2^{⌈ \log_{2} \max_{j} \| a_{i j} \| ⌉} \cdot 2^{1 - β}$ for i = 1, …, m
3: for s = 1, …, k do
4: $σ_{i} \leftarrow 0.75 \cdot 2^{53 - β (s - 1)} \cdot μ_{i}^{''}$ for j = 1, …, m
5: ${(A_{s})}_{i j} \leftarrow (a_{i j} + σ_{i}) - σ_{i} \forall i, j$ {Extract in round-to-nearest-even mode}
6: ${(A_{s}^{''})}_{i j} \leftarrow I N T 8 ({μ_{i}^{''}}^{- 1} \cdot 2^{β (s - 1)} \cdot {(A_{s})}_{i j}) \forall i, j$ {Convert to INT8}
7: $a_{i j} \leftarrow a_{i j} - {(A_{s})}_{i j} \forall i, j$
8: end for

4. Numerical examples

We have implemented the proposed methods in the GitHub repository by Uchino (2024), replacing specific code in Ootomo’s ozIMMU library with our code. All numerical experiments were conducted on NVIDIA GH200 Grace Hopper Superchip and NVIDIA GeForce RTX 4090 with the GNU C++ Compiler 11.4.1 and NVIDIA CUDA Toolkit 12.5.82. The tested methods will be denoted as follows:

• ozIMMU-k: Ootomo’s implementation with k slices

• ozIMMU_RN-k: Proposed method in Section 3.1 with k slices

• ozIMMU_EF-k: Proposed method in Section 3.2 with k slices

• ozIMMU_H-k: Proposed method in Section 3.3 with k slices.

4.1. Accuracy

Figure 5 shows the accuracies of ozIMMU-k, ozIMMU_RN-k, ozIMMU_EF-k, and ozIMMU_H-k for k = 7, …, 11. We set $A \in F_{64}^{n \times n}$ as a_ij≔(U_ij − 0.5) ⋅ exp(ϕ ⋅ N_ij), where U_ij ∈ (0, 1) are uniformly distributed random numbers and N_ij are drawn from the standard normal distribution, for 1 ≤ i, j ≤ n. The constant ϕ specifies the tendency of the difficulty in terms of matrix multiplication accuracy. A matrix $B \in F_{64}^{n \times n}$ is set similarly to A.

Figure 5.

Comparison of accuracy between ozIMMU and proposed methods. Matrix A has entries a_ij≔(U_ij − 0.5) ⋅ exp(ϕ ⋅ N_ij), where U_ij ∈ (0, 1) are uniformly distributed random numbers and N_ij are drawn from the standard normal distribution, for 1 ≤ i, j ≤ m and m = n = p. Matrix B has a similar composition.

The accuracy deteriorated for n > 2¹⁷ because the maximum number of significant bits β = min(7, ⌊(31 − log₂n)/2⌋) of the elements of the split matrices is less than 7. The accuracies of ozIMMU and ozIMMU_EF are comparable, as are those of, ozIMMU_RN and ozIMMU_H. ozIMMU_RN and ozIMMU_H, which use a splitting method via rounding to nearest floating-point arithmetic, are more accurate than ozIMMU and ozIMMU_EF, which use a splitting method via bit masking. Occasionally, to obtain results comparable with that of FP64, ozIMMU_RN-k, ozIMMU_H-k, ozIMMU-(k + 1), and ozIMMU_EF-(k + 1) are required; i.e., ozIMMU_RN and ozIMMU_H require fewer splits than ozIMMU and ozIMMU_EF. In particular, for ϕ = 2, ozIMMU_RN-9 and ozIMMU_H-9 produce comparable results to FP64; however, the accuracies of the results of ozIMMU-9 and ozIMMU_EF-9 are worse than that of FP64. In such cases, k = 10 is required for ozIMMU and ozIMMU_EF to attain more accurate results than FP64.

4.2. Time breakdown

Figures 6–8 and Figures 9–11 show time breakdowns of the proposed methods on RTX 4090 and GH200, respectively. Note that “split A”, “split B”, “Gemm in INT8-TC”, “accumulation in FP64”, and “copy” correspond to the steps (i), (ii), (iii), (iv), and (v) in the Ozaki scheme for emulating the GEMM routine as shown in Section 1.

Figure 6.

Time breakdown of ozIMMU_RN on NVIDIA GeForce RTX 4090.

Figure 7.

Time breakdown of ozIMMU_EF on NVIDIA GeForce RTX 4090.

Figure 8.

Time breakdown of ozIMMU_H on NVIDIA GeForce RTX 4090.

Figure 9.

Time breakdown of ozIMMU_RN on NVIDIA GH200 Grace Hopper Superchip.

Figure 10.

Time breakdown of ozIMMU_EF on NVIDIA GH200 Grace Hopper Superchip.

Figure 11.

Time breakdown of ozIMMU_H on NVIDIA GH200 Grace Hopper Superchip.

The execution time of FP64 accumulation in ozIMMU_RN is not negligible, because the number of additions in FP64 is not reduced. The execution times of FP64 accumulation in ozIMMU_EF and ozIMMU_H are less than those in ozIMMU and ozIMMU_RN. From Figures 6 –8, the computation time of splitting via rounding to nearest floating-point arithmetic is not so fast because FP64 is much slower than the lower-precision arithmetic on RTX 4090. From Figures 10 and 11, the ratios of the computation time of splitting in ozIMMU_EF and ozIMMU_H are comparable on GH200.

4.3. Performance

Figures 12 and 13 show throughput in TFLOPS and ratio to ozIMMU. A smaller number of slices corresponds to better performance because the total computation cost is less.

Figure 12.

Throughput comparison (left) and speedups to ozIMMU (right) on NVIDIA GeForce RTX 4090.

Figure 13.

Throughput comparison (left) and speedups to ozIMMU (right) on NVIDIA GH200 Grace Hopper Superchip.

On RTX 4090, all methods are faster than or comparable to matrix multiplication in FP64 for n ≥ 1024 because FP64 is much slower than the lower-precision arithmetic. ozIMMU_EF and ozIMMU_H are faster than ozIMMU almost everywhere. In particular, ozIMMU_EF-12 and ozIMMU_H-12 are respectively 1.6 and 1.4 times faster than ozIMMU for n = 4096. ozIMMU_RN is slower than ozIMMU. For k = 3, ozIMMU, ozIMMU_RN, ozIMMU_EF, and ozIMMU_H are 38.8, 33.8, 44.4, and 40.0 times faster than matrix multiplication in FP64 for n = 16384, respectively. For k = 12, ozIMMU, ozIMMU_RN, ozIMMU_EF, and ozIMMU_H are 1.4, 1.0, 1.9, and 1.6 times faster than matrix multiplication in FP64 for n = 1024, respectively. In addition, ozIMMU, ozIMMU_RN, ozIMMU_EF, and ozIMMU_H are 9.4, 8.5, 12.0, and 10.9 times faster than FP64 for k = 7 and n = 16384, and 7.4, 6.8, 9.5, and 8.6 times faster than FP64 for k = 8 and n = 16384.

On GH200, the methods with small k are faster than matrix multiplication in FP64 for n ≥ 4096; however, the methods are much slower than FP64 otherwise. ozIMMU_EF and ozIMMU_H are faster than ozIMMU. In particular, ozIMMU_EF-11 and ozIMMU_H-11 are respectively 1.5 and 1.6 times faster than ozIMMU for n = 2048. ozIMMU_RN is not as slow on GH200 as it is on RTX 4090. For k = 3, ozIMMU, ozIMMU_RN, ozIMMU_EF, and ozIMMU_H are 2.7, 2.2, 3.0, and 3.0 times faster than matrix multiplication in FP64 for n = 16384, respectively. ozIMMU-5, ozIMMU_RN-5, ozIMMU_EF-6, and ozIMMU_H-6 have comparable computation times to matrix multiplication in FP64 for n = 16384; however, ozIMMU-k, ozIMMU_RN-k, ozIMMU_EF-(k + 1), and ozIMMU_H-(k + 1) are slower than FP64 for n < 16384 or k > 5.

4.4. Performance vs. Accuracy

Figures 14 and 15 show throughput in a TFLOPS vs. Accuracy plot, illustrating relationships between performance and accuracy for the Ozaki scheme and for matrix multiplication in FP64 for n = 4096 and ϕ = 0. The numbers inside the symbols are the numbers of splices for the Ozaki scheme.

Figure 14.

Relationship between performance and accuracy for n = 4096 and ϕ = 0 on NVIDIA GeForce RTX 4090.

Figure 15.

Relationship between performance and accuracy for n = 4096 and ϕ = 0 on NVIDIA GH200 Grace Hopper Superchip.

On RTX 4090, all methods are better than matrix multiplication in FP64 with respect to both performance and accuracy for k ≥ 8. ozIMMU_H and ozIMMU_EF produce results with comparable performances that are more accurate than those of ozIMMU. In particular, ozIMMU_H-8 and ozIMMU_EF-9 produce results with comparable performances that are more accurate than those of ozIMMU-7. On GH200 as well, ozIMMU_H and ozIMMU_EF produce results with comparable performances that are more accurate than those of ozIMMU. In particular, ozIMMU_H-7 and ozIMMU_EF-7 produce results with comparable performances that are more accurate than those of ozIMMU-6. These results also indicate that ozIMMU_H and ozIMMU_EF can be computed with less memory consumption than ozIMMU.

5. Rounding error analysis

In this section, we give a rounding analysis for a fixed number of slices. We have described Algorithms 3 and 4 (ozIMMU), and proposed Algorithms 5 and 6, Algorithms 3 and 6, and Algorithms 8 and 6. Even if we used the rounding to nearest strategy as in Algorithms 5 and 8, their error bounds would be the same as that of the bitmask strategy used in Algorithm 5. Therefore, we focus on Algorithms 3 and 4 (ozIMMU) and Algorithms 3 and 6. Below, absolute value notation applied to a matrix means the matrix from applying absolute value element-wise. We assume that neither overflow nor underflow occurs. For $A \in F^{m \times n}$ and $B \in F^{n \times p}$ , the following deterministic error bound is given by Jeannerod and Rump (2013):

| A B - f l (A B) | \leq n u | A | | B | .

The following alternative probabilistic error bound comes from Higham and Mary (2019) (which has the details on the assumptions and probabilities):

| A B - f l (A B) | ≲ \sqrt{n} u | A | | B | .

In this section, our aim is to derive an error bound on the computed result. Let

T \in F^{m \times p}

be a computed result using k slices, such as

A \approx A_{1} + A_{2} + \dots + A_{k}, B \approx B_{1} + B_{2} + \dots + B_{k} .

Note that for i = 1, 2, …, k,

A_{i} ≔ d i a g (μ^{' (i)}) A_{i}^{'}, B_{i} ≔ B_{i}^{'} d i a g (ν^{' (i)})

for Algorithms 1, 2, and 5, and

A_{i} ≔ d i a g (μ^{''}) 2^{- i β + 1} A_{i}^{''}, B_{i} ≔ 2^{- i β + 1} B_{i}^{''} d i a g (ν^{''})

for Algorithms 3, 4, 6, 7, and 8. Let T_k be an approximation of AB with k slices. Our aim is to obtain the upper bound on |AB − T| as follows:

\begin{aligned} | A B - T_{k} | \\ \leq |A B - \sum_{i = 1}^{k} \sum_{j = 1}^{k - i + 1} A_{i} B_{j}| + |\sum_{i = 1}^{k} \sum_{j = 1}^{k - i + 1} A_{i} B_{j} - T_{k}| . \end{aligned}

(13)

Note that we can obtain the exact product A_iB_j in the above by using GEMM in the INT8 Tensor Core and scaling by powers of two. The first term of the bound in (13) indicates the truncation error, and the second term shows an error arising in the accumulation process. As matrix scaling does not affect rounding errors, a discussion on it is omitted.

In the following subsections, we use an upper bound provided by Jeannerod and Rump (2013) for a sum:

|\sum_{i = 1}^{n} p - f l (\sum_{i = 1}^{n} p)| \leq (n - 1) u \sum_{i = 1}^{n} | p_{i} |, p \in F^{n} .

(14)

Let

u f p (c)

for

c \in R

be defined as

u f p (c) ≔ \{\begin{cases} 0 & if c = 0, \\ 2^{⌊ \log_{2} | c | ⌋} & otherwise . \end{cases}

The following condition is used to check whether a real number is a floating-point number. For

c \in R

, if |c| is smaller than the maximum floating-point number and c is an integral multiple of the minimum positive floating-point number, then it holds that

c \in 2 u \cdot u f p (c) Z \Rightarrow c \in F .

(15)

5.1. Error bound for Algorithms 3 and 4 (ozIMMU)

Let matrices $V_{k} \in F^{m \times n}$ and $W_{k} \in F^{n \times p}$ be defined as in (1) where these show the truncation errors of A and B for the case of k slices, respectively. Notation e indicates the vector $e = {(1,1, \dots, 1)}^{T} \in F^{n}$ . Define two vectors $g \in F^{m}$ and $h \in F^{p}$ related to row-wise maximums in A and column-wise maximums in B in the sense of the unit in the first place as

g_{i} ≔ u f p (\max_{j} | a_{i j} |), f_{j} ≔ u f p (\max_{i} | b_{i j} |) .

Then, from Figure 16 and (1), we have

| V_{i} | \leq 2^{- β i + 1} g e^{T}, | W_{i} | \leq 2^{- β i + 1} e f^{T},

(16)

and matrices |A_i| and |B| are bounded as

| A_{i} | \leq 2^{- β (i - 1) + 1} g e^{T}, | B | \leq 2 e f^{T} .

(17)

Because

A B = (A_{1} + \dots + A_{k} + V_{k}) (B_{1} + \dots + B_{k} + W_{k}),

we have

A B - \sum_{i = 1}^{k} \sum_{j = 1}^{k - i + 1} A_{i} B_{j} = \sum_{i = 1}^{k} A_{i} W_{k - i + 1} + V_{k} B .

From the last equation, (16), and (17), the truncation error is bounded as

\begin{align} |A B - \sum_{i = 1}^{k} \sum_{j = 1}^{k - i + 1} A_{i} B_{j}| \\ \leq \sum_{i = 1}^{k} |A_{i}| |W_{k - i + 1}| + |V_{k}| |B| \\ \leq \sum_{i = 1}^{k} 2^{- β (i - 1) + 1} g e^{T} \cdot 2^{- β (k - i + 1) + 1} e f^{T} \\ + 2 \cdot 2^{- β k + 1} g e^{T} \cdot e f^{T} \\ = 4 n \sum_{i = 1}^{k} 2^{- β k} g f^{T} + 4 n \cdot 2^{- β k} g f^{T} \\ = 4 (k + 1) n 2^{- β k} g f^{T} . \end{align}

(18)

Figure 16.

An image of (24).

If |g_i − |a_ij‖ for 1 ≤ j ≤ n and |f_j − |b_ij‖ for 1 ≤ i ≤ n are very small, we can assume that

4 n g f^{T} \approx | A | | B | .

(19)

In this case, we have

|A B - \sum_{i = 1}^{k} \sum_{j = 1}^{k - i + 1} A_{i} B_{j}| ≲ (k + 1) 2^{- β k} | A ‖ B | .

(20)

We next focus on an error in the accumulation, the second term in (13). If A_i and B_i are obtained by Algorithm 3,

| A | = \sum_{i = 1}^{k} | A_{i} |, | B | = \sum_{i = 1}^{k} | B_{i} |

(21)

are satisfied. Because we compute the sum of k(k + 1)/2 floating-point matrices, the following bound is immediately obtained from (14) and (21):

\begin{align} |\sum_{i = 1}^{k} \sum_{j = 1}^{k - i + 1} A_{i} B_{j} - T_{k}| \\ \leq (\frac{1}{2} k (k + 1) - 1) u \sum_{i = 1}^{k} \sum_{j = 1}^{k - i + 1} | A_{i} B_{j} | \\ \leq (\frac{1}{2} k (k + 1) - 1) u | A | | B | . \end{align}

(22)

We will show that k′ (≤k) exists such that

\sum_{s = 1}^{k^{'}} \sum_{t = 1}^{k^{'} - s + 1} A_{s} B_{t} \in F^{m \times n} .

(23)

Then, the bound (22) is improved. From Figure 16, we have

{(A_{s})}_{i j} \in 2^{- s β + 1} g_{i} Z, {(B_{t})}_{i j} \in 2^{- t β + 1} f_{j} Z,

(24)

and these derive

{(A_{s})}_{i ℓ} {(B_{t})}_{ℓ j} \in 2^{- β (s + t) + 2} g_{i} f_{j} Z .

(25)

Figure 16 shows an image of (24). It follows from this that

{(\sum_{s = 1}^{k^{'}} \sum_{t = 1}^{k^{'} - s + 1} A_{s} B_{t})}_{i j} \in 2^{- β (k^{'} + 1) + 2} g_{i} f_{j} Z .

(26)

From Figure 16, the elements of the matrices are bounded as

| A_{s} |_{i j} \leq 2^{- β (s - 1) + 1} g_{i}, | B_{t} |_{i j} \leq 2^{- β (t - 1) + 1} f_{j},

so that we have

| A_{s} B_{t} |_{i j} \leq n 2^{- β (s + t - 2) + 2} g_{i} f_{j} .

(27)

Therefore, from (27), we have the following bound on a dot product:

\begin{array}{l} {|\sum_{s = 1}^{k^{'}} \sum_{t = 1}^{k^{'} - s + 1} A_{s} B_{t}|}_{i j} \\ = {|\sum_{s = 2}^{k^{'} + 1} \sum_{i_{2} + j_{2} = s} A_{i_{2}} B_{j_{2}}|}_{i j} \\ \leq {(\sum_{s = 2}^{k^{'} + 1} \sum_{i_{2} + j_{2} = s} | A_{i_{2}} | | B_{j_{2}} |)}_{i j} \\ \leq \sum_{s = 2}^{k^{'} + 1} \sum_{i_{2} + j_{2} = s} n 2^{- (i_{2} + j_{2} - 2) β + 2} g_{i} f_{j} \\ = \sum_{s = 2}^{k^{'} + 1} \sum_{i_{2} + j_{2} = s} n 2^{- (s - 2) β + 2} g_{i} f_{j} \\ = \sum_{s = 2}^{k^{'} + 1} (s - 1) n 2^{- (s - 2) β + 2} g_{i} f_{j} . \end{array}

If β ≥ 3,

\sum_{s = 3}^{k^{'} + 1} (s - 1) n 2^{- (s - 2) β + 2} g_{i} f_{j} \leq n 2^{2} g_{i} f_{j}

Finally, for β ≥ 3, we have

{|\sum_{s = 1}^{k^{'}} \sum_{t = 1}^{k^{'} - s + 1} A_{s} B_{t}|}_{i j} \leq 8 n \cdot g_{i} f_{j}

(28)

From (26), (28), and (15), if

2 u \cdot u f p (8 n) \leq 2^{- β (k^{'} + 1) + 2}

(29)

is satisfied, then (23) holds.

Let $k_{\max}^{'}$ be the maximum k′ satisfying (29). No rounding error occurs for $i = 2,3, \dots, k_{\max}^{'}$ . Therefore, we have

\begin{aligned} |\sum_{k = 1}^{k} \sum_{j = 1}^{k - i + 1} A_{i} B_{j} - T_{k}| \\ \leq (\frac{1}{2} k (k + 1) - \frac{1}{2} k_{\max}^{'} (k_{\max}^{'} + 1) - 1) u | A | | B | . \end{aligned}

(30)

Summarizing, from (18) and (30), we have

\begin{aligned} | A B - T_{k} | \\ \leq 4 (k + 1) n 2^{- β k} g f^{T} \\ + (\frac{1}{2} k (k + 1) - \frac{1}{2} k_{\max}^{'} (k_{\max}^{'} + 1) - 1) u | A | | B | . \end{aligned}

If (19) is satisfied, then we have

\begin{aligned} | A B - T_{k} | \\ ≲ (k + 1) 2^{- β k} | A ‖ B | \\ + (\frac{1}{2} k (k + 1) - \frac{1}{2} k_{\max}^{'} (k_{\max}^{'} + 1) - 1) u | A | | B | . \end{aligned}

5.2. Error bound for Algorithms 3 and 6

We focus on

\sum_{i_{2} + j_{2} = s} A_{i_{2}}^{''} B_{j_{2}}^{''}, A_{i_{2}}^{''} \in I_{β + 1}^{m \times n}, B_{j_{2}}^{''} \in I_{β + 1}^{n \times p} .

(31)

We consider r such that the following holds

\sum_{\begin{array}{c} i_{2} + j_{2} = s \\ s \leq \min (r, k + 1) \end{array}} A_{i_{2}}^{''} B_{j_{2}}^{''} \in I_{32}^{m \times p} .

(32)

Let E_m,n be the matrix

I_{2}^{m \times n}

whose elements are all ones. From the definitions in (31),

| A_{i_{2}}^{''} | \leq (2^{β} - 1) E_{m, n}, | B_{i_{2}}^{''} | \leq (2^{β} - 1) E_{n, p},

and we have

\begin{array}{l} |\sum_{i_{2} + j_{2} = s} A_{i_{2}}^{''} B_{j_{2}}^{''}| & \leq \sum_{i_{2} + j_{2} = s} | A_{i_{2}}^{''} | | B_{j_{2}}^{''} | \\ \leq (s - 1) n {(2^{β} - 1)}^{2} E_{m, p} . \end{array}

Since the largest number in

I_{32}

is 2³¹ − 1, if s ≤ r for

r ≔ \max (1, 2^{31 - 2 β - ⌈ \log_{2} n ⌉})

in (12), we have

\begin{array}{l} (s - 1) n {(2^{β} - 1)}^{2} & \leq (r - 1) n {(2^{β} - 1)}^{2} \\ \leq (r - 1) \cdot 2^{⌈ \log_{2} n ⌉} \cdot 2^{2 β} \\ = r \cdot 2^{⌈ \log_{2} n ⌉} \cdot 2^{2 β} - 2^{⌈ \log_{2} n ⌉} \cdot 2^{2 β} \\ \leq 2^{31} - 1, \end{array}

which implies that (32) holds. Note that

(s - 1) n {(2^{β} - 1)}^{2} < 2^{31} - 1

for β ≥ 1. This is why constant r was set as in (12).

If we define the number of the terms, w, for the accumulation as

w ≔ ⌈ \frac{k}{r} ⌉ (k - \frac{r}{2} ⌊\frac{k - 1}{r}⌋),

we have

| A B - T | \leq 4 (k + 1) n 2^{- β k} g f^{T} + (w - 1) u | A | | B | .

If (19) is satisfied, then we have

| A B - T | ≲ (k + 1) 2^{- β k} | A ‖ B | + (w - 1) u | A | | B | .

6. Conclusion

We proposed three implementation methods for accelerating the Ozaki scheme using the INT8 Tensor Core. In the original implementation, ozIMMU, provided by Ootomo et al., the ratio of the accumulation in FP64 is not negligible, as it accounts for approximately 40–50 % of the total computation time. The proposed methods ozIMMU_EF and ozIMMU_H reduce the computation time ratio of the accumulation to approximately 10–20 % and achieve a 1.2- to 1.6-fold speedup. With future architectures expected to achieve high speed in lower-precision matrix multiplication, the computation time ratio of the accumulation in FP64 is expected to increase. Thus, these proposed methods contribute to leveraging the performance of future architectures more effectively than ozIMMU. The proposed methods ozIMMU_RN and ozIMMU_H offer alternative splitting methods using floating-point arithmetic in round-to-nearest-even mode and produce more accurate results than ozIMMU for the same number of slices.

Footnotes

Acknowledgements

We appreciate the reviewers and Dr Massimiliano Fasi from the University of Leeds for the kind comments on our paper. We thank Dr Hiroyuki Ootomo from NVIDIA for his helpful comments on the implementation of ozIMMU.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This study was partially supported by JSPS Grant-in-Aid for JSPS Fellows No. 22KJ2741, JSPS Grant-in-Aid for Research Activity Start-up No. 24K23874, and JSPS KAKENHI Grant No. 23K28100.

ORCID iD

Yuki Uchino

Author biographies

Yuki Uchino is a postdoctoral researcher at RIKEN R-CCS. He received his PhD in engineering from Shibaura Institute of Technology in 2024. His research interests include reliable computing, numerical linear algebra, and highly accurate algorithms. He won the student poster presentation award at the 38th JSST Annual International Conference on Simulation Technology (JSST 2019) and the student presentation awards at JSST 2021 and International Workshop on Reliable Computing and Computer-Assisted Proofs (ReCAP 2022).

Katsuhisa Ozaki is a full professor in the Department of Mathematical Sciences at Shibaura Institute of Technology. He received his PhD in engineering from Waseda University in 2007. He was an Assistant Professor (2007-2008) and a Visiting Lecturer (2008-2009) at Waseda University. At Shibaura Institute of Technology, he has served as an Assistant Professor (2010-2013) and an Associate Professor (2013-2019), and has currently been a Professor since 2019. His research interests include reliable computing, particularly addressing rounding error problems in finite-precision arithmetic. He mainly focuses on numerical linear algebra and develops fast and accurate algorithms.

Toshiyuki Imamura is a team leader of Large-scale Parallel Numerical Computing Technology Team at RIKEN R-CCS, and is responsible for developing numerical libraries on Fugaku. He received his Diploma and Doctorate in Applied Systems and Sciences from Kyoto University in 1993 and 2000. He was a Researcher at CCSE, JAERI (1996-2003), a visiting scientist at HLRS (2002), and an associate professor at the University of Electro-Communications (2003-2012). His research interests include HPC, auto-tuning technology, and parallel eigenvalue computation. His research group won the HPL-MpX ranking (2020-2021) and was nominated as the Gordon Bell Prize finalist in SC05, SC06, and SC20.

References

Higham

Mary

(2019) A new approach to probabilistic rounding error analysis. SIAM Journal on Scientific Computing 41(5): A2815–A2835.

IEEE Computer Society (2019) IEEE standard for floating-point arithmetic. IEEE Std 754-2019 (Revision of IEEE 754-2008). NJ, USA: IEEE.

Jeannerod

Rump

(2013) Improved error bounds for inner products in floating-point arithmetic. SIAM Journal on Matrix Analysis and Applications 34(2): 338–344.

Minamihata

Ozaki

Ogita

, et al. (2016) Improved extraction scheme for accurate floating-point summation. In: The 35th JSST Annual Conference International Conference on Simulation Technology, Kyoto, Japan.

Mukunoki

Ozaki

Ogita

, et al. (2020) Dgemm using tensor cores, and its accurate and reproducible versions. In: Sadayappan

Chamberlain

Juckeland

, et al. (eds) High Performance Computing. Cham: Springer International Publishing, 230–248.

NVIDIA Corporation (2024) NVIDIA tensor cores. URL: https://www.nvidia.com/en-us/data-center/tensor-cores/.

Ootomo

(2024) ozIMMU - DGEMM on Int8 tensor Core. URL: https://github.com/enp1s0/ozIMMU.

Ootomo

Ozaki

Yokota

(2024) Dgemm on integer matrix multiplication unit. The International Journal of High Performance Computing Applications, in Press. DOI: 10.1177/10943420241239588.

Ozaki

Ogita

Oishi

, et al. (2012) Error-free transformations of matrix multiplication by using fast routines of matrix multiplication and its applications. Numerical Algorithms 59(1): 95–118.

10.

Ozaki

Ogita

Oishi

, et al. (2013) Generalization of error-free transformation for matrix multiplication and its application. Nonlinear Theory and Its Applications, IEICE 4(1): 2–11.

11.

Uchino

(2024) Accelerator for ozIMMU. R-CCS github repositry. URL: https://github.com/RIKEN-RCCS/accelerator_for_ozIMMU.