Sage Journals: Discover world-class research

Abstract

Transformers, particularly large language models (LLMs), are revolutionizing applications in natural language processing and computer vision but at a high cost in memory, energy, and computational resources. Quantization has emerged as an effective compression method to alleviate these demands, reducing the bitwidth of model data and arithmetic precision to enable efficient inference on resource-constrained devices. This paper focuses on optimizing inference with transformer encoders on low-power general-purpose CPUs, as those often found in edge devices. Our key contributions include exposing the critical role of linear layers within transformer encoders on CPUs with a limited number of cores; developing mixed integer precision matrix multiplication on ARM and RISC-V CPUs; and evaluating performance impact and energy savings of quantized inference. In summary, this work highlights the advantages of applying quantization to transformer encoders on current single-core and multi-core low power CPUs, offering insights for efficient LLM deployment on edge platforms.

Keywords

Deep learning transformer encoders inference quantization matrix multiplication multicore CPUs

1. Introduction

Large language models (LLMs) are driving dramatic advancements across a wide range of scientific and industrial applications, impacting many tasks related to natural language processing (NLP) as well as computer vision. Unfortunately, the underlying transformer technology Vaswani et al. (2017) significantly increases the memory usage, energy consumption, and computational requirements for both training and inference Zhao et al. (2023); Shankar and Reuther (2022), showcasing the critical balance between innovation and the rising resource requirements.

In response to this scenario, quantization is a compression technique designed to shrink the cost of deep learning (DL) models in general and LLMs in particular, enabling inference on resource-constrained platforms such as smartphones and edge devices Hubara et al. (2017). To achieve this goal, quantization reduces the bitwidth of the model’s data, specifically weights and/or activations, as well as the precision of the arithmetic, primarily multiplications and additions.

A transformer can function as an encoder only (e.g., BERT), a decoder only (e.g., GPT-2) or both, depending on the target application. In this paper, we focus on transformer encoders due to their key role in text classification, question answering, information retrieval, speech recognition, document classification and language translation, among others. Furthermore, while most efforts on optimizing LLMs are directed toward training these models on graphics processing units (GPUs) and large-scale facilities, our paper targets transformer encoder inference on a variety of low power general-purpose processors, or CPUs, as those that are common on edge scenarios. In doing so, we make the following contributions:

• We expose the critical role of the linear layers on the overall performance of the encoder block when executed on a low-power CPU. While some other components are also important when the encoder block is run on a many-core CPU, we argue that, at least for CPUs equipped with a small number of cores, their contribution varies between negligible and low. Moreover, the cost of these other components can be significantly reduced by employing inexpensive yet accurate approximations (in the case of the Softmax transform) Kim et al. (2021); and floating point instances of GEMM tuned for small, non-square operands Yang et al. (2021); Heinecke et al. (2016); Frison et al. (2018); Van Zee and Van de Geijn (2015).

• We evaluate the impact of exploiting mixed precision kernels, with integer arithmetic, for the linear layers of the encoder block on state-of-the-art CPUs with ARM and RISC-V architectures. For all these platforms, we develop and tune our own version of the general matrix multiplication (GEMM), initially following GotoBLAS2 Goto and Van de Geijn (2008a), adapted to operate with mixed integer precision.

• For systems equipped with power counters, specifically the ARM-based designs in the NVIDIA Jetson boards, we also report the energy benefits that can be obtained from quantization.

Overall, our work provides a comprehensive analysis of the performance and energy gains that can be obtained from applying quantization to transformer encoders on a representative number of single-core and multi-core low-power CPUs.

The rest of the paper is structured as follows. In Section Related Work we offer a brief review of related papers. In Section Characterization of the Transformer Encoder we conduct an initial analysis of the theoretical and experimental costs of a transformer encoder on an ARM-based CPU. In Section GEMM in linear algebra libraries for scientific computing we revisit the conventional design of GEMM, operating with floating point operands, for multicore processors with single instruction-multiple data (SIMD) units, a multi-layered memory hierarchy, and multiple cores. In Section Quantized GEMM we describe the changes that are needed to accommodate mixed integer precision (MIP) arithmetic into the standard formulation of GEMM, and evaluate the practical impact. Finally, in Section Concluding remarks we close the paper with a summary and a list of conclusions.

2. Related work

Much of the research on optimizing transformer encoders has focused on training these models using hardware accelerators. In contrast, the efforts discussed in this section specifically relate to the optimization of transformers on CPUs. This scenario is common in embedded systems, where the use of accelerators may be impractical due to integration challenges, or constraints related to cost, silicon area, or power consumption.

In Dice and Kogan (2021), the authors identify three main inefficiencies in executing oneDNN on Intel Skylake processors: matrix multiplications with transposed operands, dispatcher overhead, and suboptimal partitioning of matrix operands. The work in Kim et al. (2023) explores encoder inference, examining aspects such as model-to-hardware mapping and operation scheduling. The authors of Chitty-Venkata et al. (2023) survey various high-level software optimization techniques, including pruning, knowledge distillation, and quantization, alongside low-level hardware optimizations on application-specific circuits. The study in Jiang et al. (2023) focuses on optimizing encoder inference on ARM many-core CPUs with multiple NUMA (Non-Uniform Memory Access) domains, proposing the skipping of packing procedures for small matrix operands and the use of a specialized micro-kernel for such cases at the core level. In Martínez et al. (2024), we investigate the optimization of encoders on two low power CPUs, with ARM and RISC-V architecture. Lastly, in Wu et al. (2024) the authors present a library optimized for a wide range of ARM processors: from edge devices to HPC-grade CPUs.

All the previous works consider only the standard 32-bit floating point data/arithmetic and, therefore, miss the challenges and opportunities that quantized inference brings in combination with transformer encoders. We close this review by noting that our contributions are orthogonal and/or complementary to other model-related optimizations such as pruning, distillation, etc.

3. Characterization of the transformer encoder

In this section, we expose the anatomy of the transformer encoder, detailing the arithmetic cost of its main components together with an experimental analysis of the model’s inference efficiency.

3.1. Anatomy of the transformer encoder

A transformer encoder typically consists of an input embedding layer, multiple intermediate encoder blocks (ranging from a couple to several dozens), and an output classification layer. As the initial and final components contribute minimally to the global computational cost, we next focus on the intermediate blocks.

Internally, each encoder block is composed of a multi-head attention (MHA) module followed by a feed-forward network (FFN) module. For a batch of b input sequences, each containing l tokens (e.g., b sentences of l words each) and with a token represented as a vector of d embeddings, the main operations within the encoder block are depicted in Figure 1. The MHA module there comprises h attention heads, and the hidden dimension in the FFN module is denoted as f, which is typically set to f = 4d. Table 1 presents the parameters for four instances of the BERT family of models Devlin et al. (2019).

Figure 1.

Architecture of the MHA (left) and FFN (right) modules in an encoder block with h heads and operating on b input sequences. The symbols “⊗” and “⨁” respectively denote matrix multiplication and summation. The notation M1–M10, F11–F14 correspond to the operations in Table 2. Note that the multiplications in M5 and M7 as well as the Softmax function in M6 are batched, with one of these operations per head and input sequence.

Table 1.

Dimensions of BERT models.

Parameter	Small	Base	Large
#Layers	4	12	24
d	512	768	1024
h	8	12	16
f	2048	3072	4096

Consider the GEMMC + = A B, where C is of size m × n, A is m × k, and B is k × n. The MHA module requires a total of (4 + 2hb) matrix multiplications. Among these, M1, M2, M3, M9 respectively involve the “weight” matrices W_Q, W_K, W_V, W_O, and share the same dimensions: (m, n, k) = (d, lb, d). The multiplications in M5 present the dimensions (m, n, k) = (l, l, d/h), while those in M7 present the dimensions (m, n, k) = (d/h, l, l). These two latter operations can be interpreted as two batched matrix multiplications (BGEMM), each consisting of hb independent multiplications. The FFN module contributes two additional matrix multiplications: F11 with weight matrix W₁ and dimensions (m, n, k) = (f, lb, d); and F13 with weight matrix W₂ and (m, n, k) = (d, lb, f).

Table 2 summarizes the main operations in the encoder block, specifying the dimensions of the GEMM kernels. From a theoretical point of view, the cost of all kernels scales linearly with the number of input sequences b; the cost of the GEMM operations in M1–M3, M9, F11, and F13 scales linearly with the sequence length l; and, for M5 and M7, atically with l.

Table 2.

Operations in the MHA and FNN modules.

			m	n	k
MHA	M1-M3	(Q, K, V) = (W_Q, W_K, W_V) ⋅ E_I	d	lb	d
	M4	Split (Q, K, V) →
		${(Q^{i, j}, K^{i, j}, V^{i, j})}_{i = 1 : h}^{j = 1 : b}$
		for j = 1 : b
		for i = 1 : h
	M5	${\bar{E}}_{1}^{i, j} = ({(K^{i, j})}^{T} \cdot Q^{i, j}) / \sqrt{d_{k}}$	l	l	d/h
	M6	$E_{1}^{i, j} =$ Softmax $({\bar{E}}_{1}^{i, j})$
	M7	$E_{2}^{i, j} = V^{i, j} \cdot E_{1}^{i, j}$	d/h	l	l
	M8	Concatenate ${(E_{2}^{i, j})}_{i = 1 : h}^{j = 1 : b} \to E_{2}$
	M9	${\bar{A}}_{O} = W_{O} \cdot E_{2}$	d	lb	d
	M10	A_O = Lnorm $({\bar{A}}_{O} + E_{I})$
FFN	F11	${\bar{E}}_{3} = W_{1} \cdot A_{O}$	f	lb	d
	F12	E₃ = GELU $({\bar{E}}_{3})$
	F13	${\bar{E}}_{O} = W_{2} \cdot E_{3}$	d	lb	f
	F14	E_O = Lnorm $({\bar{E}}_{O} + A_{O})$

3.2. Inference with FP32 data/arithmetic

We next characterize the inference costs of a transformer encoder, using the large instance of BERT (BERT _L ) as a representative LLM. We note that the small, base and large variants of BERT share the same dimension parameters of many other models, such as T5 Small, ROBERTA, DISTILBERT, ALBERT Base, ELECTRA Base, DEBERTa Base, FNET Base, LONGFORMER Base, BIGBIRD Base, CONVBERT Base, VIT Base, ROBERTA Large, ALBERT Large, ELECTRA Large, DEBERTA Large, LONGFORMER Large, BIGBIRD Large, CONVBERT-Large, VIT Large, DEBERTA xLarge, etc. Therefore, we can expect that the performance differences due to quantization reported in this work for BERT small/base/large will carry over to these other real-world models.

For this particular experiment, the model weights and input/output activations are stored as IEEE single precision floating point (FP32) numbers, and all arithmetic is performed in FP32. For brevity, we only employ the NVIDIA Jetson AGX Xavier board, a system that is equipped with an 8-core NVIDIA Carmel processor and 32 GB of LPDDR4x (low power DDR4x) RAM.

3.2.1. Experimental setup

Table 3 offers a few key details of the NVIDIA Jetson AGX Xavier as well as the other boards tested in our experimental study. Given our focus on inference with CPUs, in our experiments we do not utilize the GPU that is present in some of these boards and, in particular, on the NVIDIA Jetson AGX Xavier.

Table 3.

Target systems evaluated in the experimental study.

Board	Processor	Freq	#Cores	Instruction set	RAM
Board	Processor	(GHz)	#Cores	Architecture (ISA)	(GB)
NVIDIA Jetson AGX Xavier	NVIDIA Carmel	2.26	8	ARM NEON + ARMv8.2-A	32
NVIDIA Jetson AGX Orin	ARM Cortex-A78AE	2.20	12	ARM NEON + ARMv8.2-A	64
NVIDIA Jetson Nano	ARM Cortex-A57	1.43	4	ARM NEON + ARMv8.0-A	4
Raspeberry Pi	ARM Cortex-A72	1.80	4	ARM NEON + ARMv8.0-A	8
AMD Kria KV260	ARM Cortex-A53	1.30	4	ARM NEON + ARVv8.0-A	4
Sipeed LicheeRv	XuanTie C906	1.00	1	RISC-V + RVV 0.7.1	0.5
Sipeed LicheePi 4a	XuanTie C910	1.85	4	RISC-V + RVV 0.7.1	4
Banana Pi F3	K1	1.60	8	RISC-V + RVV 1.0	4
Kendryte K230	XuanTie C908	1.60	1	RISC-V + RVV 1.0	0.5

In all our experiments in this paper, the processor frequency is fixed; we execute the transformer using all cores of the target platform, with one thread per core; and the execution of each test is repeated for at least 5 seconds, with the reported results corresponding to average values.

The NVIDIA Jetson platforms offer several INA3221 on-board power sensors, accessible via I²C, that can be monitored as sysfs file system nodes Instruments (2016). These sensors measure power, voltage, and current rails. Table 4 shows which counters were used for the energy characterization.

Table 4.

Measured CPU-related power rails for each system platform.

Board	Measured power rails	Description
NVIDIA Jetson AGX Xavier	+ CPU	CPU power rail
NVIDIA Jetson AGX Xavier	+ VDDRQ	DDR memory power rail
NVIDIA Jetson AGX Orin	+ VDD_CPU_CV	CPU and CV combined power rail
NVIDIA Jetson AGX Orin	+ VDDQ_VDD2_1V8AO	DDR and 1V8AO combined power rail
NVIDIA Jetson Nano	+ POM_5V_IN	System 5V power rail
NVIDIA Jetson Nano	− POM_5V_GPU	GPU power rail (subtracted from POM_5V_IN)

The peak performance for the NVIDIA Carmel processor with FP32 arithmetic, when operating all eight cores at 2.3 GHz, is 2.3 GHz × 16 FP32 floating point operations (flops) per cycle and core × 8 cores = 294.4 FP32 GFLOPS (billions of flops per second).

3.2.2. Observations

Figure 2 reports the time, energy consumption and distribution of time for the execution of a single encoder block of BERT _L on the NVIDIA Carmel processor. The global costs for BERT _L can be obtained by multiplying those values by the number of layers (24), while the distribution of time does not vary. We note here that our focus is on edge-oriented scenarios, where latency is critical. Therefore our interest is in small values of b. However, in some cases we also experiment with large b, typical of LLM servers, to illustrate the effect of increasing this parameter. In this particular case, we offer results for several combinations of b ∈ {1, 4, 32, 128} and l ∈ {128, 384}.

Figure 2.

Characterization of the execution time (and energy consumption) for an encoder block of BERT _L , using FP32 data/arithmetic and all eight cores in the NVIDIA Carmel processor. The numbers on top of each bar display the total execution time and energy consumption (left and right, respectively in seconds and Joules) per encoder block.

The results in Figure 2 offer the following observations:

• In general, about 75%–80% of the time is due to the linear layers (GEMM) while the remaining 20%–25% comes from the attention mechanism, consisting of BGEMM and Softmax. The contribution of other components is negligible.

• The theoretical arithmetic costs in Table 2 estimate that the execution time should be linear on b. In practice, this is the case only for large b: From 1.80 s for b = 32 to 7.68 s for b = 128. For the practical cases of interest, with small b, the increase is sublinear. This shows the benefits, when b is small, of aggregating multiple samples (i.e., sentences) into a single batch, to be processed concurrently. When applicable, such strategy (1) reduces the cost of the memory accesses; and (2) increases the workload, helping to expose more parallelism.

• In theory, the cost for GEMM grows linearly with l while, for BGEMM, this growth is quadratic on l. This explains the rapid increase in the figure of the bars for M5 + M6 and M7 from l = 128 to l = 384. For example, with b fixed to 4, that increase of l yields an augmentation of the cost from 10.7% to 16.2% for M5 + M6; and from 5.6% to 7.8% for M7.

• The difference between the bars for M5 + M6 and those for M7 is due to the inclusion of Softmax in the former. We note that there exist a variety of schemes to reduce the cost of this computation by approximating the exponential with a low-degree polynomial Kim et al. (2021).

• Considering the execution time, depending on the value of l, b the cost varies between 0.40 ms/token and 0.15 ms/token, while the total energy consumption varies between 1.13 mJ/token and 2.31 mJ/token.

• Our parallel implementation of the transformer encoder evaluated in this experiment attains around 76% of the peak for GEMM but only about 21% for BGEMM. The parallelization details and some optimizations performed on the encoder block are discussed later in the paper.

In addition, we can also comment on the results that can be expected on other platforms:

• When moving to a platform with a larger number of cores (e.g., the ARM Cortex-A78AE in the NVIDIA Jetson AGX Orin), the contribution of BGEMM will be more prominent due to the lower degree of parallelism of this kernel. In contrast, on all other platforms, which feature with a smaller number of cores, BGEMM will become more important.

• In case that adapting the linear layers to operate with MIP results in a decrease of the cost for this type of operations, BGEMM will again increase its contribution.

4. GEMM in linear algebra libraries for scientific computing

In this section we revisit the conventional, high performance implementation of GEMM for multicore CPUs, exposing the impact of a critical component, known as the micro-kernel, on its practical efficiency.

4.1. Optimizing GEMM on multicore CPUs

High-performance instances of GEMM in linear algebra libraries for scientific computing follow the design of GotoBLAS2 Goto and Van de Geijn (2008b) to target CPUs with SIMD units, deep memory hierarchies, and multiple cores. The reference algorithm for GotoBLAS2, in Figure 3 (left), implements GEMM as five nested loops (labeled as L1 to L5); two packing routines that copy certain blocks of A, B into the buffers A_c, B_c, respectively of dimensions m_c × k_c, k_c × n_c (lines 4,7 in the bottom-right figure); and a micro-kernel with an additional loop (labeled as L6 in the top-right algorithm in the same figure).

Figure 3.

Reference algorithm for GEMM in GotoBLAS2. Left: blocked algorithm; Top-right: micro-kernel; Bottom-Right: Packing of input operands.

Optimizing GEMM for a current multicore CPU involves three critical tasks:

• Design a micro-kernel that exploits the SIMD arithmetic units of the target architecture.

• Select the loop strides m_c, n_c, k_c for the reference algorithm in order to ensure an efficient use of the layered memory hierarchy Goto and Van de Geijn (2008b); Low et al. (2016).

• Distribute the iteration space of the loops to expose enough parallelism for all the processor cores; achieve a balanced workload distribution; and favor an efficient use of the shared/private cache levels Van Zee and Van de Geijn (2016); Smith et al. (2014).

In the next subsection we focus on the first task, that is, the efficient implementation of the micro-kernel, as this component needs to be carefully adapted to transform the standard, floating point-oriented GEMM employed in scientific computing into a specialized version that can be leveraged for quantized inference with encoders on low power CPUs. At this point, we note that the implementation of GEMM in GotoBLAS2, AMD AOCL, IBM ESSL, BLIS and OpenBLAS, among others, only operate on floating point matrices.

4.2. High performance micro-kernel for GEMM

The generic micro-kernel in Figure 3 (top-right) updates an m_r × n_r micro-tile of C with the product of an m_r × k_c micro-panel of A and a k_c × n_r micro-panel of B: say C_r + = A_r B_r. For a micro-kernel of fixed dimensions m_r × n_r, provided the loop strides m_c, n_c, k_c are chosen with care, during the execution of the micro-kernel A_r resides in the L2 cache and B_r in the L1 cache Goto and Van de Geijn (2008a); Van Zee and Van de Geijn (2015). Prior to the micro-kernel loop L6, the elements of C_r are loaded (in principle, from the main memory) into the vector registers and, after the update, the contents of the micro-tile are stored back into the main memory.

Attaining high performance with a practical realization of the micro-kernel requires taking into consideration the following aspects:

• Cast all the arithmetic (for the update of C_r) as well as most/all the data movements (for loading A_r, B_r and loading/storing C_r) as vector instructions that efficiently utilize the SIMD units. Figure 4 illustrates this for a simple m_r × n_r = 4 × 4 micro-kernel that employs ARM NEON v8.2 vector intrinsics. All the elements of A_r, B_r, C_r are transferred between memory and the processor vector registers using the following two vector intrinsics:

float32x4_t vldlq_f32(const float32_t*ptr)

to load four FP32 numbers (128 bits) from memory into a vector register; and

void vstlq_f32(const float32_t*ptr,

float32x4_t v0)

for the storing counterpart. The micro-tile C_r is updated with the vector fused multiply-add (FMA):

float32x4_t vfmaq_laneq_f32(float32x4_t v0,

float32x4_t v1,

float32x4_t v2,

const int lane)

that returns a vector of four FP32 numbers (vr), computed as

vr[i] = v0[i] + v1[i]*v2[lane], i=0,1,2,3.

Figure 4.

AXPY-oriented micro-kernel for ARM NEON that operates with a 4 × 4 micro-tile C_r. All input matrices and arithmetic is FP32.

In the following, we will refer to this type of design as an AXPY ( α times x plus y )-oriented micro-kernel.

• Increase the arithmetic intensity (AI) of the micro-kernel by setting m_r ≈ n_r as large as possible without incurring into register spilling. Specifically, we note that the micro-kernel performs 2m_rn_rk_c arithmetic operations while loading/storing m_rn_r elements from main memory for C_r; loading m_rk_c elements from L2 cache for A_r; and loading k_cn_r elements from L1 cache for B_r. In consequence, the arithmetic intensity in arithmetic operations per element is AI = 2m_rn_rk_c/(2m_rn_r + m_rk_c + k_cn_r). For example, for large k_c and FP32, AI = 1.0, 1.6 and 2.4 floating point operations (or flops)/byte for m_r × n_r = 4 × 4, 16 × 4 and 12 × 8, respectively. A 12 × 12 micro-kernel offers = 3.0 flops/byte; however it requires 36 vector register for storing C_r only. Therefore, it incurs into register spilling during its execution for all ARM and RISC-V platforms considered in this work.

• Avoid write-after-write (WAW) hazards due to the updates of C_r in consecutive iterations of the micro-kernel loop L6. This is attained by choosing large values for m_r, n_r.

• In some cases encode the micro-kernel directly in assembly instead of vector intrinsics in order to avoid compiler decisions leading to suboptimal performance. For example, this is the case of the NVIDIA Jetson AGX Orin when targeting FP16 arithmetic.

• There are some additional advanced optimizations that can contribute to further increase the performance of the micro-kernel. Concretely, some micro-kernels benefit from including instructions for software prefetching, modifying the code to exploit software pipelining and/or loop unrolling, etc. Dowd and Severance (1998).

Figure 5 shows the implications of the micro-kernel dimensions m_r × n_r on the practical performance of GEMM, using the cache-aware roofline model Ilic et al. (2014). Remember that, by choosing the appropriate values for m_c, n_c, k_c, because of the loop ordering in the reference GEMM algorithm, the micro-kernel fetches the data from different memory levels: L1 cache for B_r, L2 cache for A_r, and main memory for C_r. Therefore, the cost of the data transfers is a mixture of the cache/memory bandwidths. The plot shows that the micro-kernels with the highest AI (m_r × n_r = 8 × 12 and 12 × 8) basically match the FP32 peak performance of the NVIDIA Carmel core, and the same occurs for the 20 × 4 and 16 × 4 instances. We can conclude that these micro-kernels are compute-bound on this CPU. In contrast, the smaller micro-kernels, particularly 4 × 4 and 8 × 4, are cache-bound. We note that, in some cases two micro-kernels with the same AI, such as 20 × 4 and 4 × 20, show visible differences in the GFLOPS rate. This is due to the distinct type of accesses to the elements of A_r and B_r loaded in the processor vector registers with the vector instructions; see, for example, lines 31–34 in Figure 4. Each vector intrinsic vfmaq_laneq_f32() there accesses the full vector register ar but a single element (lane) of the respective vector register br. The translation of this type of intrinsic/access into assembly results in a sequence of instructions that provide higher performance for those micro-kernels that maximize the number of elements accessed from ar compared with br.

Figure 5.

Cache-aware roofline for GEMM on a single core of the NVIDIA Carmel processor, for different micro-kernels. The points correspond to the dimensions m_r × n_r while the lines specify the FP32 theoretical peak performance of this processor core (36.16 GFLOPS for FP32 data/arithmetic) and the experimental bandwidth of the different memory levels: L1, L2 and L3 caches, and Main Memory.

4.3. From ARM to RISC-V GEMM micro-kernels

Like many other ISA, RISC-V also offers a set of intrinsic functions to ease the use of vector operations without recurring to low-level, assembly code. We next revisit the simple 4 × 4 example in Figure 4 for ARM NEON to discuss its adaptation to RISC-V.

The operation of loading the elements of A_r and those of loading/storing the entries of C_r are done via the vector intrinsics:

vfloat32m1_t __riscv_vle32_v_f32m1(

const float32_t *ptr, size_t vl)

void __riscv_vse32_v_f32m1(

float32_t *ptr, vfloat32m1_t v0,

size_t vl)

The first one loads vl values starting at address ptr to the destination register. The second one stores the vl elements in v0 to the memory address starting at ptr. The parameter vl (for vector length) specifies the number of elements that fit into a vector register, and its value is determined by the micro-architecture and the data type. For example, in a processor with 128-bit vector registers, for FP32 vl = 128/32 = 4.

Loading B_r is done element by element, via:

vfloat32m1_t __riscv_vfmv_v_f_f32m1(

const float32_t s, size_t vl)

The intrinsic broadcasts (i.e., replicates) the scalar s into the first vl entries of the destination vector register.

Finally, the update is done via the RISC-V counterpart of ARM’s vfmaq_laneq_f32:

vfloat32m1_t __riscv_vfmacc_vv_f32m1(

vfloat32m1_t v0, vfloat32m1_t v1,

vfloat32m1_t v2, size_t vl)

which performs an element-wise vector fused-multiply-add combining the three input vector registers v0, v1, v2, to produce the output vector register vr.

5. Quantized GEMM

As mentioned earlier, most high performance linear algebra libraries are oriented toward solving scientific computing applications and, therefore, only implement GEMM for the standard FP32 and FP64 matrix operands. Given the relevant contribution of GEMM to the general cost of the encoder, in this section we discuss how to develop and optimize quantized realizations of this computational kernel that exploit the hardware underlying the target architecture.

5.1. Quantization and GEMM

We recognize that there exist many strategies to apply quantization to GEMM: Uniform/non-uniform, symmetric/unsymmetric scaling, dynamic/static, weight-activation/weight-only, etc. Furthermore, the activation and weight matrices can be quantized using different strategies or even distinct precisions. For brevity, we restrict the study to uniform, symmetric, static, weigth-activation quantization.

Symmetric quantization maps a real (i.e., floating point) value x ∈ [α, β] to an integer $x^{q} = Q (x, s) = ⌊x / s⌉ \in [α^{q}, β^{q}]$ , with the scaling factor s = (β − α)/(β^q − α^q) where, in the two’s complement representation of a signed integer using b bits, [α^q, β^q] = [−2^b−1, + 2^b−1 − 1].

Consider next the GEMMC = AB, where C, A, and B are m × n, m × k, and k × n floating point matrices, respectively; and let us denote the quantized input matrices by $A^{q} = Q (A, s), B^{q} = Q (B, s)$ . The result of the GEMM can be then approximated as C^r = s²A^qB^q ≈ C, where the bulk of the computation lies in the GEMMA^qB^q, which only requires integer arithmetic. In contrast, the scaling with s², involving floating point arithmetic, only contributes a minor cost.

Following the general trend for transformers, we apply quantization only to the linear layers (i.e., GEMM operations) in the encoder block but not to the BGEMM kernels in the attention mechanism Dettmers et al. (2024). Furthermore, we will consider a matrix multiplication that receives two quantized input matrices A^q, B^q, consisting of 8-bit signed integer numbers (INT8), and produces a result C^q = A^qB^q with 32-bit signed integer numbers (INT32). (For example, for the GEMMM1 in the MHA module of the encoder block, A^q, B^q correspond to the weight and activations matrices W_q, E_I.) The accumulation into a result with higher bitwidth is necessary to avoid overflow. In this sense, using INT32 for C^q is sufficient to avoid this problem for LLMs with less than 6.7 B parameters Dettmers et al. (2024). LLMs with more parameters can apply specialized quantization schemes that basically reduce the cost of the quantized GEMM to one operating with INT8 numbers Dettmers et al. (2024); Xiao et al. (2023). In the following, we will refer to this implementation of GEMM as Q_INT8_INT32.

5.2. Reference implementations

For reference, in the subsequent experiments we include the following optimized implementations of GEMM:

• FP32: Our 32-bit floating point, AXPY-oriented version of GEMM that mimics the ideas underlying the high performance implementation of matrix multiplication in GotoBLAS2 (see Section GEMM in linear algebra libraries for scientific computing), with the cache configuration parameters (CCPs) m_c, n_c, k_c, the micro-kernel, and the parallel loop tailored specifically for each platform. In a previous work Martínez et al. (2024) we demonstrated that, for the NVIDIA Jetson boards, our FP32 implementation of GEMM was competitive with or even outperformed those in BLIS, OpenBLAS and ARM PL.

• FP16: An analogous AXPY-oriented GEMM that operates with 16-bit floating point (FP16) input matrices A, B and produces an FP16 result C, using FP16 arithmetic. In order to obtain this version, we replaced the vector intrinsics for FP32 with the equivalent ones that operate with FP16 data. For example, in the ARM version in Figure 4 we substitute vldlq_f32(), vstlq_f32(), vfmaq_laneq_f32() with vldlq_f16(), vstlq_f16(), vfmaq_laneq_f16(). In practice, reducing the precision allows us to employ larger micro-kernels, because the 128-bit vector registers of the NVIDIA Carmel can now hold eight FP16 elements instead of four FP32 numbers. Furthermore, the CCPs m_c, n_c, k_c have to be adjusted to take into account the dimensions of the FP16 buffers A_c, B_c that reside in the different layers of the memory hierarchy.

• FQ_INT8_FP32: This variant also follows the general AXPY-oriented design of the GotoBLAS2 algorithm for GEMM. However, it receives two input matrices A^q, B^q with INT8 numbers and, when packing the corresponding blocks of these matrices into the buffers A_c, B_c (see lines 4,7 of the reference algorithm in Figure 3, left), transforms their entries from INT8 to FP32. All the subsequent operations are then conducted in FP32 arithmetic and the result is an FP32 matrix. We refer to this option as “fake” quantization (FQ) because the input operands are stored as INT8 numbers but the arithmetic is FP32. This variant presents several advantages: (1) It can leverage a standard, optimized micro-kernel for FP32 data; (2) the input matrices are stored as INT8 numbers, reducing the memory requirements with respect to a plain FP32 implementation; and (3) it diminishes the pressure on the memory bandwidth, as part of the transfers involve only INT8 numbers.

Figure 6 graphically displays the differences between our alternative schemes:

• Floating point: FP32 (left) for the standard floating point solution. The same scheme applies to FP16.

• Fake quantization: FQ_INT8_FP32 (middle) transforms the INT8 matrices to FP32 data before the standard FP GEMM. We note that in our actual implementation this type casting is performed on-the-fly, within the GEMM, when packing the data for the input operands into the buffers A_c, B_c.

• Quantization: Q_INT8_INT32 (left) computes the matrix product using a MIP GEMM (MPGEMM).

Figure 6.

Differences between operand data types/arithmetic for alternative schemes, illustrated via the operations F11 and F12 of an encoder block.

5.3. High level overview

In addition to the tiling loops (L1–L5), the previous discussion of the conventional, high performance realization of GEMM on a current multicore CPU identified three pivotal elements: the two packing routines, which copy and re-arrange certain block elements of A, B into the corresponding buffers A_c, B_c; and the micro-kernel, which is responsible for all the arithmetic. In a SIMD processor, the arithmetic is vectorized using the appropriate vector instructions for the target architecture. For floating point data, this leads to an AXPY-based micro-kernel that performs successive updates of a micro-tile C_r via vector FMA instructions.

In the following subsections we describe how to adapt this formulation of GEMM depending on the support for MIP vector instructions in the target architecture. In particular, we cluster our low-power CPUs into the following four groups:

(1) RVV 0.7.1 (XuanTie C906, C910): These platforms provide no support for MIP arithmetic so we rely on fake quantization by transforming the INT8 data in A, B to INT32 precision as part of the packing routines. The arithmetic is then performed in INT32 precision, using a slightly modified variant of the conventional FP32 micro-kernel that decomposes each update round of the micro-tile C_r (one per iteration of loop L6) into a collection of vector FMA instructions.

The remaining three groups in the list maintain the input matrices A, B as INT8 numbers, pack them into buffers A_c, B_c with INT8 entries and, when necessary, transform these data into extended precision inside the micro-kernel, prior to the arithmetic.

(2) ARMV8.2-A and RVV 1.0 (NVIDIA Carmel, XuanTie C908, K1): This option loads the INT8 elements from A_c, B_c into vector registers to inmediately cast them as INT16 numbers using the same vector registers. A special vector FMA that combines the INT16 elements of these input vector registers is then used to update the contents of the INT32 micro-tile C_r.

(3) ARMV8.2-A with fast MIP DOT products (ARM Cortex-A78AE): This architecture provides a highly efficient support for MIP but enforces us to replace the vector FMAs with a vector DOT product. In addition, this change asks for a reformulation of the packing routines that enables the use of vector instructions in order to load of the elements of A_c, B_c into vector registers as well as a direct utilization of the MIP vector DOT products.

(4) ARMv8.0-A (ARM Cortex-A57, A72, A53): This architecture involves a sophisticated MIP micro-kernel that combines vector multiplications of INT8 input elements producing INT16 results; vector FMAs of INT16 elements with the result accumulated into an INT32 output vector; and vector reductions.

In the following the provide a detailed description of the vector intrinsics utilized to implement the micro-kernels for each target architecture. A reader that is only interested in the performance impact may want to skip these details and refer directly to the performance results in Figures 7, 8 and 9, Table 5, and the accompanying comments.

Table 5.

Performance (in tokens per second) for the different architectures using three instances of the BERT family and two values of the sequence length l. The batch size is fixed in all cases to b = 4. The results in the top table are absolute values while in the bottom table they are normalized (i.e., divided) by the respective processor frequency and number of cores.

5.4. Optimization of MIP GEMM for ARM NEON v8-A

The ARMv8-A architecture has several variants and extensions that cater to different application domains, ranging from general-purpose computing to specialized workloads arising from real-time processing and security. The ARMv8.0-A, released in 2011, was the initial initial version of the ARMv8-A architecture. Among its main features, it provided support for AArch64 (64-bit execution mode) and AArch32 (32-bit execution mode); introduced NEON and FPU (Floating Point Unit) as optional features; and supported Advanced SIMD (NEON) instructions. In 2016, ARM released ARMv8.2-A, providing FP16 support in the NEON and FPU units; and offering the optional scalable Vector Extension (SVE), for 128-bit to 2048-bit, which targeted high-performance computing (HPC) applications.

In the remainder of this subsection, we describe three types of micro-kernels specifically tailored for the ARM-based designs in the NVIDIA Jetson AGX Xavier, NVIDIA Jetson AGX Orin, NVIDIA Jetson Nano, Raspberry Pi, and AMD Kria KV260. The first two implement the ARMv8.2-A ISA while the remaining three adhere to ARMv8.0-A.1) NVIDIA Carmel (NVIDIA Jetson AGX Xavier). For this ARMv8.2-A processor, we designed a MIP micro-kernel that follows the basic principles of the AXPY-oriented FP32 realization in Figure 4. To achieve this, we replaced the vfmaq_laneq_f32 there with the MIP vector intrinsic:

int32x4_t vmlal_lane_s16(int32x4_t v0,

int16x4_t v1,

int16x4_t v2,

const int lane)

This operation receives an input vector with four INT32 numbers (v0) and two input vectors with four INT16 numbers each (v1,v2), returning an output vector of four INT32 numbers (vr) computed as follows:

vr[i] = v0[i] + v1[i] * v2[lane], i=0,1,2,3.

Using this vector intrinsic in an AXPY-oriented micro-kernel implies a prior transformation of the data in the micro-panels A_r, B_r, from INT8 to INT16. In our case, we can attain this via the vector intrinsic:

int16x8_t vmovl_s8(int8x8_t v0)

This operation receives an input vector consisting of eight INT8 numbers (v0) and returns an output vector with the same amount of INT16 numbers. Prior to the MIP (INT16+INT32) arithmetic, we next unfold the eight elements in each INT16 vector into two vectors with four INT16 numbers each, using the vector intrinsics:

int16x4_t vget_low_s16 (int16x8_t v0)

int16x4_t vget_high_s16(int16x8_t v0)

The MIP, AXPY-oriented micro-kernel for the NVIDIA Carmel that results is illustrated in Figure 7 for m_r × n_r = 8 × 8. Lines 33–38 load the INT8 elements of A_r, B_r, expanding them into INT16 values together with the unfolding. The arithmetic update of C_r is in lines 41–48.

Figure 7.

AXPY-oriented micro-kernel that operates with a 8 × 8 micro-tile C_r. Matrices A, B contain INT8 numbers, C contains INT32 values, and the arithmetic is MIP INT16-INT32.

Let us now discuss the theoretical advantages of reducing the arithmetic precision of the micro-kernel. Consider, for example, the case with m_r × n_r = 12 × 8. On the one hand, for FP32, this micro-kernel yields an AI = 2.4 FP32 flops/byte so that, by moving all data types and arithmetic to FP16, this ratio is doubled to AI = 4.8 FP16 flops/byte. However, the peak performance per core for the NVIDIA Carmel (at 2.26 GHz) is also doubled: From 36.18 FP32 GFLOPS to 72.36 FP16 GFLOPS. On the other hand, reducing the data type bitwidth allows to design larger non-spilling micro-kernels, because a vector register can now contain twice as many FP16 numbers compared to FP32. As a consequence, the number of micro-kernels that remain in the compute-bound part of the cache-aware roofline model becomes larger. For example, a micro-kernel of dimensions m_r × n_r = 24 × 8 that operates with FP16 requires 29 vector registers (the same as its 12 × 8 FP32 counterpart), but yields an AI = 6 FP16 flops/byte (instead of 2.4 FP32 flops/byte for FP32).

Consider next the MIP micro-kernel, which transfers all data for A_r, B_r as INT8 numbers, further reducing the bitwidth of the data. In this case we can double the m_r dimension of the micro-kernel, without exhausting the capacity of the vector register file, to design for example an 48 × 8 micro-kernel, with an AI = 13.71 INT32 operations/byte. The number of MIP micro-kernels that remain compute-bound is thus very large. Unfortunately, the NVIDIA Carmel processor features a deficient hardware support for integer arithmetic. In particular, this processor can issue only one INT32 arithmetic instruction per cycle and core compared to two FP32 arithmetic instructions per cycle and core. As a result, the peak arithmetic rate for INT32 is only 18.08 INT32 OPS. The previous discussion exposes that we should not expect any significant performance gain when using the MIP GEMM on the NVIDIA Carmel CPU.

The practical implications of the different hardware support provided by the NVIDIA Carmel processor for floating-point version(s) versus integer arithmetic are reported for the encoder block in the top row of plots in Figure 8. These results lead to the following observations:

• The clear winner is the FP16 version of GEMM, both in terms of performance, with speed-ups over the FP32 option that vary between 1.24× and 1.71×, and energy, with gains in the range of 1.25× and 1.82 × .

• The quantized encoder block, configured to use the MIP micro-kernel Q_INT8_INT32, offers the poorest results, conformal with the inferior integer arithmetic throughput of the platform CPU.

• Finally, the reduction of the data transfer costs intrinsic to the approach that exploits fake quantization (FQ_INT8_IN32) does not yield any advantage. This can be attributed to the strong compute-bound nature of the operation on this platform.

We close this analysis with two additional comments: (1) While our focus is on latency-oriented scenarios, corresponding to those experiments with small values of b, we also include result for a LLM “server” case (b = 128). The results for that problem instance illustrate the remaining margin for improvement in case we could work with larger batches. (2) We remind that only the linear layers (GEMM) can potentially benefit from the acceleration. Other components of the encoder block, such as BGEMM and Softmax, perform all operations in standard FP32 arithmetic and, therefore, will not take any advantage from the new micro-kernels.2) ARM Cortex-A78AE (NVIDIA Jetson AGX Orin). Compared to the NVIDIA Carmel, the ARM Cortex-A78AE also implements the ARMv8.2-A ISA, but offers a much more efficient hardware support for integer arithmetic. In particular, the floating point peak performance of the ARM Cortex-A78AE, at 2.2 GHz, is 35.20 FP32 GFLOPS and 70.40 FP16 GFLOPS. Compared with that, the use of the MIP vector intrinsic vmlal_lane_s16() offers the same performance for INT32 than for FP32, that is, 35.20 INT32 OPS. This means that we can expect similar performance with the MIP AXPY-oriented micro-kernel than with the FP32 counterpart and half of the efficiency we could attain with an FP16 micro-kernel.

Figure 8.

Execution time (left) and energy consumption (right) for an encoder block of BERT _L , using different data types/arithmetic and all CPU cores in the NVIDIA Carmel (top), ARM Cortex-A78AE (middle), and ARM Cortex-A57 (bottom). The numbers on top of the FP32 bars display the total execution time and energy consumption. The numbers on top of the remaining bars display the gain/loss of the corresponding result with respect to FP32.

Nevertheless, the key advantage of the ARM Cortex-A78AE lies in that it offers a very efficient vector intrinsic for the DOT product, namely:

int32x4_t vdotq_laneq_s32(int32x4_t v0,

int8x16_t v1,

int8x16_t v2,

const int lane)

Specifically, given two input vectors v1,v2, each comprising 16 INT8 numbers, plus a third input vector v0 with 4 INT32 numbers, this intrinsic computes and delivers a peak performance of 140.80 INT32 OPS. That is 4× higher than that of FP32 and 2× higher than FP16!

vr[i] = v0[i] + v1[lane_*4+0] _* v2[i_*4+0]

+ v1[lane_*4+1] _* v2[i_*4+1]

+ v1[lane_*4+2] _* v2[i_*4+2]

+ v1[lane_*4+3] _* v2[i_*4+3],

i = 0,1,2,3.

Unfortunately, the performance advantage of the DOT product vector intrinsic vdotq_laneq_s32() does not come for free. In particular, the use of this operation requires a complete re-design of the micro-kernel as well as the packing routines that layout the micro-panels A_r, B_r to enable an efficient access to their elements from the micro-kernel. The new GEMM scheme thus departs from the conventional realization of GEMM proposed by GotoBLAS2 and adopted by many of its successors. For brevity, we omit the details of the data layout and the DOT-oriented micro-kernel code for ARM architectures.

The performance and energy consumption of the different implementations of GEMM, based on the distinct precision micro-kernels, is displayed in the middle row of plots in Figure 8. The trend shown in those plots is conformal with the theoretical superior performance of FP16 over FP32, and MIP over the previous two, though a bit less optimistic than the theory indicated: 2× for FP16 and 4× for the quantized version. In particular, the advantage of FP16 over FP32 varies between 1.24× and 1.79× in performance and 1.56× and 2.06× in energy. For the quantized encoder block, the advantage over FP32 is between 1.59× and 2.33× for time and 1.56× and 2.09× for energy. Here it is relevant to remind again that only the linear layers are quantized and, doing so, increases the contribution to cost of other layers which are not: BGEMM and Softmax.3) ARM Cortex-A57, A72 and A53 (NVIDIA Jetson Nano, Raspberry Pi and AMD Kria KV260, respectively). These three platforms implement ARMv8.0-A, asking for a different approach to obtain an efficient MIP micro-kernel. In particular, inside the micro-kernel loop L6 we employ a combination of three vector intrinsics:

int16x8_t vmull_s8(int8x8_t v0, int8x8_t v1)

int16x8_t vmlal_s8(int16x8_t v0, int8x8_t v1,

int8x8_t v2)

int32x4_t vpadalq_s16(int32x4_t v0, int16x8_t v1)

The first intrinsic takes two input vectors (v0, v1) with eight INT8 numbers each, and performs their element-wise multiplication, leaving the result on a destination vector (namely, vr) with eight INT16 numbers:

vr[i] = v0[i] _* v1[i], i=0,1,2,3.

The second intrinsic takes an input vector (v0) with eight INT16 numbers and two input vectors (v1, v2) with eight INT8 numbers each, and computes:

vr[i] = v0[i] + v1[i] _* v2[i], i=0,1,...,7.

The last intrinsic receives two input vectors (v0, v1), with four INT32 numbers the first one and eight INT16 numbers the second and produces an output (vr) consisting of four INT32 numbers as follows:

vr[i] = v0[i] + v1[2_*i] + v1[2_*i+1], i=0,1,2,3.

Finally, after the execution of the loop inside the micro-kernel, the update of the micro-tile C_r requires an additional reduction, via the vector intrinsic:

int32x4_t vpaddq_s32(int32x4_t v0, int32x4_t v1)

This operation receives two input vectors (v0, v1), with four INT32 numbers each, and computes an output (vr) of the same type as follows:

vr[0] = v0[0] + v0[1], vr[1] = v0[2] + v0[3],

vr[2] = v1[0] + v1[1], vr[3] = v1[2] + v1[3].

As in the case of the MIP micro-kernel for the ARM Cortex-A78AE, the use of this specialized vector intrinsics requires a complete rewrite of the micro-kernel and packing routines. For simplicity, we omit the details here and focus on the impact of this type of precision on the performance and energy consumption next.

The two plots in the bottom row in Figure 8 compare the performance and energy consumption of different realizations of GEMM on the ARM Cortex-A57. The top row of plots in Figure 9 shows the performance results from the same experiment in the two remaining ARM-based platforms (ARM Cortex-A72 and A53). Note that we could not obtain energy measurements as these processors are not equipped with power sensors. In addition, none of these three ARMv8.0-A platforms offers hardware support for FP16, so we did not develop a micro-kernel/GEMM for that variant. The results show a similar behavior in all three platforms, with the fake quantization option not really offering any gain with respect to FP32, and the quantized version yielding some visible performance gains, which vary between 1.10× and 2.18 × . Lastly, for the ARM A57, the energy advantage is quite conformal with the performance gain.

Figure 9.

Execution time for an encoder block of BERT _L , using different data types/arithmetic and all CPU cores in the ARM Cortex-A72 (top left), ARM Cortex-A53 (top right) XuanTie C910 (middle left), XuanTie C906 (middle right), K1 (bottom left), and XuanTie C908 (bottom right). The numbers on top of the FP32 bars display the total execution time. The numbers on top of the remaining bars display the gain/loss of the corresponding result with respect to FP32.

5.5. Optimization of MIP GEMM for RVV

Despite all RISC-V architectures targeted in this work support vector integer arithmetic, those that implement RVV 0.7.1 (Xuantie C906 and C010) do not offer vector instructions that combine integer operands of different data types (i.e., mixed arithmetic). Conversely, the RVV 1.0 CPUs (K1 and Xuantie C908) do support this possibility.1) XuanTie C906 and C910 (Sipeed LicheePiRv and Sipeed LicheePi 4a). These two platforms implement the RVV 0.7.1 specification, so it is not possible to mix different integer data types in a single arithmetic vector instruction. To bypass this problem, we built two alternative variants of GEMM with fake quantization, where the original data for A, B is stored as INT8 and transformed to either INT32 or FP32 as part of the packing into the buffers A_c, B_c. In contrast, the vector instructions employed for the data loading/storing and micro-tile update all operate with either INT32 or FP32 data.

The two plots in the middle row of Figure 9 report the execution time of the encoder block with the linear layers built upon the three variants of GEMM developed for the RVV 0.7.1 platforms. The results on these two architectures are rather similar for all three variants, except FQ_INT8_FP32 for small batch size on the Xuantie C910, which shows a certain advantage. In principle FQ_INT8_INT32 shares the same reduction in data transfers as the one that operates with FP32. However, it is interesting to observe that this does not result in similar gains in the execution time.2) XuanTie C908 and K1 (Kendryte K230 and Banana Pi F3). The micro-kernel for these RVV 1.0 platforms leverages the vector intrinsics described next. First, loading the INT8 elements from A, B data is done via:

vint8mf4_t __riscv_vle8_v_i8mf4 (

const int8_t _*base, size_t vl)

which loads eight INT8 elements into the vector register.

After the data is loaded, we transform (or sign extend) these elements into INT16 numbers using:

vint16mf2_t __riscv_vwadd_vx_i16mf2 (

vint8mf4_t v0, int8_t s,

size_t vl);

This intrinsic expands the data in the vector register v0 to INT16 and adds the INT8 scalar in s to each element of v0. In this case, s is set to zero. Although there exist other type casting intrinsics, we found this one is rather efficient.

The update of the micro-kernel is next done as follows:

vint32m1_t __riscv_vwmacc_vv_i32m1 (

vint32m1_t v0, vint16mf2_t v1,

vint16mf2_t v2, size_t vl);

This instruction transforms the entries of v1 and v2 to INT32 numbers, and computes a FMA with the v0 register.

The two plots in the bottom row of Figure 9 report the execution time of the encoder block built upon the three variants of GEMM developed for the RVV 1.0 platforms. These two last CPUs show a similar behavior, with FQ_INT8_FP32 slightly outperforming the initial FP32 implementation, and FQ_INT8_INT32 being the most efficient.

5.6. Global comparison

To close the experiments, we compare the performance of all the CPUs included in our study. For this purpose we select a latency-oriented scenario with b = 4 and the best implementation of the transformer block in each platform. Furthermore, we offer two sets of results, corresponding to absolute performance in tokens per second; and values normalized by the respective processor frequency and cores, to isolate the effect of the differences in these two factors.

Table 5 reports the results from this final experiment. In terms of absolute performance (top part of the table), as could be expected, the ARM Cortex-A78AE is the clear winner. This processor benefits from its superior number of cores (12) and high frequency (2.20 GHz) compared to the rest of competitors. Only the NVIDIA Carmel is close in cores (8) and slightly superior in frequency (2.26 GHz). However, the processor in the NVIDIA Jetson AGX Orin still outperforms that in the NVIDIA Jetson AGX Xavier due to the efficient hardware support for MIP arithmetic, which allows to leverage a quantized version of GEMM that clearly outperforms the FP16-based implementation that is the best option in the latter. Overall, the differences between the two most powerful ARM-based processors roughly vary between a factor of 2× and 3 × . The performance gap between the remaining CPUs and the ARM Cortext-A78AE are much larger, with a factor of 10× or larger. Among the RISC-V platforms, the K1 CPU is the best option, with a performance at the level of the ARM Cortex-A57 and ARM Cortex-A72. To give a broader perspective to these numbers, it would be good to factor in the energy consumption into the equation. Unfortunately, only the CPUs in the NVIDIA Jetson boards offer this possibility. While we could have relied on external power units to measure the energy consumption of the board, this introduces considerable “noise” due to other components in the board.

Looking at the normalized values next, we can observe that the two most powerful ARM-based CPUs still stand out, but the differences between the rest are much narrower for the largest problems.

We finally analyze the impact on precision of mixed INT8–INT32 dynamic quantization, applied to all Linear layers. For this purpose, we leveraged torch 2.2.2 for quantization. Table 6 reports the accuracy, F1, MCC, precision, and recall of the pre-trained and fine-tuned models evaluated using the SST-2 dataset (version Stanford Sentiment Treebank v2) Socher et al. (2013). The results demonstrate that quantization affects precision in different degrees in comparison with the full precision counterpart (FP32). In most cases, the accuracy degradation due to quantization is minimal, with DEBERTA Base being the most notable exception, where precision suffers a more significant decline compared with the results obtained when using full precision.

Table 6.

Model accuracy comparison on SST-2.

Model	Type	Accuracy	F1	MCC	Precision	Recall
BERT small	FP32	0.8860	0.8860	0.7711	0.8860	0.8860
	INT8_INT32	0.8880	0.8881	0.7757	0.8884	0.8880
BERT base	FP32	0.9340	0.9340	0.8675	0.9340	0.9340
	INT8_INT32	0.9220	0.9219	0.8434	0.9221	0.9220
DISTILBERT	FP32	0.9120	0.9119	0.8236	0.9125	0.9120
	INT8_INT32	0.9040	0.9038	0.8075	0.9044	0.9040
ROBERTA base	FP32	0.9380	0.9380	0.8756	0.9381	0.9380
	INT8_INT32	0.9460	0.9460	0.8917	0.9461	0.9460
DEBERTA base	FP32	0.9440	0.9440	0.8876	0.9440	0.9440
	INT8_INT32	0.9160	0.9159	0.8314	0.9161	0.9160
BERT large	FP32	0.9440	0.9440	0.8877	0.9441	0.9440
	INT8_INT32	0.9320	0.9319	0.8636	0.9322	0.9320
DEBERTA large	FP32	0.9520	0.9520	0.9037	0.9520	0.9520
	INT8_INT32	0.9360	0.9359	0.8719	0.9365	0.9360

6. Concluding remarks

This paper focuses on optimizing transformer encoders, which play a vital role in tasks like text classification, speech recognition, and language translation. Unlike most LLM optimization efforts that target training on GPUs, this work emphasizes inference on low-power, general-purpose CPUs. The key contributions of our work include:

(1) Performance Analysis: We identify the significant impact of linear layers on transformer encoder performance when executed on low-power CPUs.

(2) Mixed-Precision Kernels: We develop and tune integer arithmetic-based mixed-precision kernels for linear layers on ARM and RISC-V CPUs, tailored to improve efficiency.

(3) Energy Impact: We report the energy savings achievable via quantization on CPUs with power counters, particularly on ARM-based CPUs in the NVIDIA Jetson boards.

In summary, our study delivers a detailed evaluation of performance and energy gains from quantization, targeting transformer encoder inference on both single-core and multi-core low-power CPUs.

In addition, the paper offers the following list of conclusions:

(1) Significance of Linear Layers: Linear layers dominate the computational workload in transformer encoders on low-power CPUs. Optimizing these layers is critical for enhancing performance.

(2) Role of Quantization: Quantization effectively reduces the computational and energy demands of transformer encoder inference on resource-constrained CPUs, in principle with minimal trade-offs in accuracy.

(3) Mixed-Precision Success: Custom mixed-precision implementations of general matrix multiplication significantly boost efficiency on ARM and RISC-V architectures.

(4) Energy Efficiency Gains: Quantization offers measurable energy savings, as demonstrated on platforms with accessible power counters, exposing its suitability for edge computing scenarios.

(5) General Applicability: The methodologies and findings provide actionable insights for deploying transformer-based NLP applications on low-power CPUs, promoting resource-efficient DL adoption.

Footnotes

Acknowledgements

This work was supported by the research projects PID2023-146569NB-C2 and PID2020-113656RB-C22 of MCIN/AEI/10.13039/501100011033 and ERDF/UE. H. Martínez is a POSTDOC_21_00025 fellow supported by Junta de Andalucía. S. Catalán is supported by the grant RYC2021-033973-I, funded by MCIN/AEI/10.13039/501100011033 and the “NextGenerationEU”/PRTR, and UJI-2023-04, funded by Universitat Jaume I. A. Castelló is supported by the CIAPOS/2023/431 grant, and Fondo Social Europeo Plus 2021-2027 from the Generalitat Valenciana.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: PID2023-146569NB-C2 and PID2020-113656RB-C22 of MCIN/AEI/10.13039/501100011033 and ERDF/UE, POSTDOC_21_00025 of Junta de Andalucía, RYC2021-033973-I of MCIN/AEI/10.13039/501100011033 and the “NextGenerationEU”/PRTR, UJI-2023-04 of Universitat Jaume I, and CIAPOS/2023/431 of Fondo Social Europeo Plus 2021-2027 from the Generalitat Valenciana.

ORCID iDs

Héctor Martínez

Sandra Catalán

Adrián Castelló

Enrique S Quintana-Ortí

Héctor Martínez received his bachelor and PhD degrees in Computer Science from Universitat Jaume I in 2010 and 2020, respectively. He is a postdoctoral researcher under the Junta de Andalucía program at Universidad de Córdoba and his research interests include high-performance computing, IoT, deep neural networks, and programming models.

Sandra Catalán received her Bachelor’s, M. Sc., and PhD degrees in Computer Science from Universitat Jaume I in 2012, 2013, and 2018, respectively. She is a postdoctoral researcher under the Ramón y Cajal program at Universitat Jaume I. Her research interests include high-performance computing, linear algebra, deep learning, and programming models.

Adrián Castelló received his Bachelor’s, M. Sc., and PhD degrees in Computer Science from Universitat Jaume I in 2011, 2013, and 2018, respectively. He is a Postdoc researcher under the APOSTD program of the Generalitat Valenciana at the Universitat Politècnica de València. His research interests include high-performance computing, code auto-generation, deep neural networks, and programming models.

Enrique S. Quintana-Ortí received bachelor and PhD degrees in Computer Science from Universitat Politècnica de València (UPV), Spain, in 1992 and 1996, respectively. After more than 20 years at Universitat Jaume I, he is currently Professor in Computer Architecture at UPV. His current research interests include parallel programming, linear algebra, energy consumption, transprecision computing and deep learning as well as advanced architectures and hardware accelerators.

References

Chitty-Venkata

Mittal

Emani

, et al. (2023) A survey of techniques for optimizing transformer inference. Journal of Systems Architecture 144: 102990.

Dettmers

Lewis

Belkada

, et al (2024) LLM.int8(): 8-bit matrix multiplication for transformers at scale. In: Proceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22, Red Hook, NY, USA, 28 Nov - 9 Dec 2022 . Curran Associates Inc. ISBN 9781713871088.

Devlin

Chang

Lee

, et al. (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Proc. 2019 Conf. North American Chapter Assoc, June 3, 2019 - June 5, Minneapolis, MN, USA. Computational Linguistics: Human Language Techn, 4171–4186.

Dice

Kogan

(2021) Optimizing Inference Performance of Transformers on CPUs. https://arxiv.org/abs/2102.06621

Dowd

Severance

(1998) High Performance Computing. 2nd edition. Sebastopol, California: O’Reilly.

Frison

Kouzoupis

Sartor

, et al. (2018) BLASFEO: basic linear algebra subroutines for embedded optimization. ACM Transactions on Mathematical Software 44(4): 1–30. DOI: 10.1145/3210754.

Goto

van de Geijn

(2008a) Anatomy of a high-performance matrix multiplication. ACM Transactions on Mathematical Software 34(3): 12:1–12:25.

Goto

van de Geijn

(2008b) Anatomy of high-performance matrix multiplication. ACM Transactions on Mathematical Software 34(3): 12:1–12:25.

Heinecke

Henry

Hutchinson

, et al. (2016) LIBXSMM: accelerating small matrix multiplications by runtime code generation. In: Proc. Int. Conference for High Performance Computing, Networking, Storage and Analysis, SC ’16, November 13-18, Salt Lake City, Utah, USA. IEEE Press. ISBN 9781467388153.

10.

Hubara

Courbariaux

Soudry

, et al. (2017) Quantized neural networks: training neural networks with low precision weights and activations. Journal of Machine Learning Research 18(1): 6869–6898.

11.

Ilic

Pratas

Sousa

(2014) Cache-aware roofline model: upgrading the loft. IEEE Computer Architecture Letters 13(1): 21–24. DOI: 10.1109/L-CA.2013.6.

12.

Instruments T (2016) INA3221 Triple-Channel, High-Side Measurement, Shunt and Bus Voltage Monitor with I2C- and SMBUS-Compatible Interface. https://www.ti.com/product/INA3221#tech-docs

13.

Jiang

Huang

, et al. (2023) Full-stack optimizing transformer inference on ARM many-core CPU. IEEE Transactions on Parallel and Distributed Systems 34(7): 2221–2235. DOI: 10.1109/TPDS.2023.3280805.

14.

Kim

Gholami

Yao

, et al. (2021) I-BERT: Integer-Only BERT Quantization. https://arxiv.org/abs/2101.01321

15.

Kim

Hooper

Wattanawong

, et al. (2023) Full Stack Optimization of Transformer Inference: A Survey. https://arxiv.org/abs/2302.14017

16.

Low

Igual

Smith

, et al. (2016) Analytical modeling is enough for high-performance BLIS. ACM Transactions on Mathematical Software 43(2): 12:1–12:18.

17.

Martínez

Igual

Rodríguez-Sánchez

, et al. (2024) Inference with transformer encoders on ARM and RISC-V multicore processors. In: Euro-Par 2024: Parallel Processing, August 26-30, Madrid, Spain. Springer, 377–392. ISBN 978-3-031-69766-1.

18.

Shankar

Reuther

(2022) Trends in energy estimates for computing in AI/machine learning accelerators, supercomputers, and compute-intensive applications. In: 2022 IEEE High Performance Extreme Computing Conference (HPEC), September 19 - 23, 1–8.

19.

Smith

van de Geijn

Smelyanskiy

, et al. (2014) Anatomy of high-performance many-threaded matrix multiplication. In: Proc. IEEE 28th Int. Parallel and Distributed Processing Symp., IPDPS’14, May 19-23, Phoenix, Arizona, USA. 1049–1059.

20.

Socher

Perelygin

, et al. (2013) Recursive deep models for semantic compositionality over a sentiment treebank. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, Washington, USA, October 18–21, 2013. Association for Computational Linguistics, 1631–1642. https://www.aclweb.org/anthology/D13-1170

21.

Van Zee

van de Geijn

(2015) BLIS: a framework for rapidly instantiating BLAS functionality. ACM Transactions on Mathematical Software 41(3): 14:1–14:33.

22.

Van Zee

van de Geijn

(2016) The BLIS framework: experiments in portability. ACM Transactions on Mathematical Software 42(2): 12:1–12:19.

23.

Vaswani

Shazeer

Parmar

, et al. (2017) Attention is all you need. Advances in Neural Information Processing Systems 30: 5998–6008.

24.

Meng

Zhu

, et al. (2024) autoGEMM: pushing the limits of irregular matrix multiplication on Arm architectures. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, SC ’24, November 17 - 22, Atlanta, Georgia. IEEE Press. ISBN 9798350352917. DOI: 10.1109/SC41406.2024.00027.

25.

Xiao

Lin

Seznec

, et al. (2023) Smoothquant: accurate and efficient post-training quantization for large language models. In: Proceedings of the 40th International Conference on Machine Learning, ICML’23, July 23-29, Honolulu, HI, USA. JMLR.org.

26.

Yang

Fang

Dong

, et al (2021) LIBSHALOM: optimizing small and irregular-shaped matrix multiplications on ARMv8 multi-cores. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’21, New York, NY, USA, November 14–19, 2021. Association for Computing Machinery. DOI: 10.1145/3458817.3476217. ISBN 9781450384421.

27.

Zhao

Zhou

, et al. (2023) A Survey of Large Language Models. https://arxiv.org/abs/2303.18223

Characterization of quantized inference with transformer encoders on low power CPUs

Abstract

Keywords

1. Introduction

2. Related work

3. Characterization of the transformer encoder

3.1. Anatomy of the transformer encoder

3.2. Inference with FP32 data/arithmetic

3.2.1. Experimental setup

3.2.2. Observations

4. GEMM in linear algebra libraries for scientific computing

4.1. Optimizing GEMM on multicore CPUs

4.2. High performance micro-kernel for GEMM

4.3. From ARM to RISC-V GEMM micro-kernels

5. Quantized GEMM

5.1. Quantization and GEMM

5.2. Reference implementations

5.3. High level overview

5.4. Optimization of MIP GEMM for ARM NEON v8-A

5.5. Optimization of MIP GEMM for RVV

5.6. Global comparison

6. Concluding remarks

Footnotes

Acknowledgements

Declaration of conflicting interests

Funding

ORCID iDs

References