Sage Journals: Discover world-class research

Abstract

This paper provides a simple method for efficient performance evaluation of floating-point (FP) formats, addressing the challenge of implementing DNN (deep neural network)-based sensor nodes and edge measurement devices. Since resource constraints are imposed in such scenarios, the 32-bit FP format, standardly used for DNN implementation, is unsuitable, and the alternative is found in lower-resolution FP formats which leads to certain performance degradation. Hence, an efficient mechanism for performance evaluation of different FP formats is needed to examine the influence of resolution decreasing. Existing methods utilize the analogy between FP formats and piecewise uniform quantizers (PUQs), using the signal-to-quantization noise ratio (SQNR) of the PUQ to express the performance of FP formats. However, the high complexity of SQNR calculations, involving sum with many terms (e.g. 254 terms for FP formats with an 8-bit exponent), poses a significant challenge. This paper’s main contribution is the significant simplification of the SQNR expression for Gaussian-distributed data, reducing the number of sum terms from 254 to just 5 with minimal accuracy loss, allowing for simple and efficient performance evaluation of FP formats. Major findings include an in-depth analysis of the probability distribution across PUQ segments, a closed-form expression for identifying the highest probability segment, and an evaluation of the SQNR approximation’s accuracy. These findings provide a foundational basis for implementing intelligent DNN-based measurement systems, with applications extending to computing, signal processing, and other fields utilizing FP formats.

Keywords

Digital representation of measurement data intelligent measurement systems statistical modeling of measurement data floating-point format piecewise uniform quantization

Introduction

Due to the dominance of digital measurement systems, analyzing formats for digital representation of measurement data has become a crucial research task. The two main types of digital formats are fixed-point (FxP) and floating-point (FP) formats (Dinčić et al., 2023). FxP is simpler to implement, while FP offers a much wider variance range of consistent representation quality, making it preferable for numerous applications, including computing, signal processing (Moroz and Samotyy, 2019), and sensor data representation (Chen et al., 2024). The most commonly used is the 32-bit floating-point format (FP32) (Perić et al., 2021), providing high-quality digital representation over a very wide range of data variance, although it is complex to implement. The use of the FP32 format is especially relevant for the implementation of deep neural networks (DNNs), being used by default for representing DNN parameters (weights, activations, etc.) and input data. DNNs are increasingly used for processing of measurement signals, such as electrocardiogram (ECG) (Degachi and Ouni, 2025), electroencephalogram (EEG) (Li et al., 2024), and vibration signals (Syu and Lee, 2024), for the implementation of measurement systems (Chen and Chen, 2022; Chen et al., 2023; Liang and Huang, 2023; Liu and Yang, 2024; Lou et al., 2024), control systems (Bey and Chemachema, 2024; Xi et al., 2024), soft sensors (Luo et al., 2024; Sun and Ge, 2021), sensor networks (Kim et al., 2021; Wu et al., 2023) and Internet of Things (IoT) systems (Mohapatra et al., 2024; Mukhopadhyay et al., 2021), as well as for sensor linearization (Anandanatarajan et al., 2023), sensor drift compensation (Chaudhuri et al., 2021), and sensor fusion (Balemans et al., 2023). Initially, DNN-based measurement and control systems were implemented such that DNNs ran on powerful servers and clouds (Alamri et al., 2013), transmitting measurement data from sensors via the Internet and causing in that way latency and security issues (Sahni et al., 2022). Recently, local implementation of DNN algorithms on sensor nodes and edge measurement devices to be as close as possible to the source of data has become a key research focus, enabling the realization of intelligent and automated measurement and control systems capable of real-time data processing and decision-making with enhanced security and reliability and suitable for various applications (Al Koutayni et al., 2023; Fanariotis et al., 2023; Lai and Chang, 2023; Ronco et al., 2022; Yadav et al., 2024; Zhou et al., 2021).

However, the main challenge for DNN-based sensor nodes and edge measurement devices is their limited hardware resources (processing power, memory capacity, and especially available energy in battery-powered sensor nodes and edge devices), which makes implementation of the FP32 format very difficult due to its complexity (Syed et al., 2021). Consequently, considering lower resolution FP formats (e.g. 24-bit FP24 (Junaid et al., 2022), 16-bit bfloat16 (Agrawal et al., 2019), 8-bit FP8 (Wang et al., 2018), or some other) for implementation of intelligent sensor nodes and measurement devices becomes a topical research direction, with the aim of reducing complexity. However, lower resolution reduces performance by decreasing the quality of digital representation and narrowing the variance range, risking inadequate quality and variance range for specific applications. Thus, a critical research gap exists in developing efficient methods to evaluate the performance of various FP formats in terms of representation quality and variance range. Addressing this gap is essential for selecting the optimal FP format for each specific application and input data set, enabling resource-efficient implementation of intelligent DNN-based sensor nodes and measurement systems.

A significant advance in evaluating FP format performance was achieved in Perić et al. (2021), where an analogy between the FP format and the piecewise uniform quantizer (PUQ) was established. Quantization error is expressed by an objective measure called distortion. It is common to define distortion using the L₂ norm as the mean squared quantization error, although an alternative approach defines distortion via the L₁ norm as the mean absolute quantization error. Signal-to-quantization noise ratio (SQNR) represents another widely used objective performance measure of the quantizer. The typical approach, which will be used in this paper, calculates SQNR using the L₂ norm, as the logarithmic ratio (expressed in dB) of data variance (representing the mean squared value of the data) to distortion calculated as the mean squared quantization error. However, some studies use an alternative method to calculate SQNR with the L₁ norm, as the logarithmic ratio of the mean absolute value of the data to distortion calculated as the mean absolute quantization error (Marco and Neuhoff, 2006). In both cases, a higher quantization error results in greater distortion, leading to a lower SQNR.

The established analogy between the FP format and the PUQ allows us to express FP format performance through an objective measure such as the SQNR of the PUQ. However, the SQNR expression is very complex, containing the sum of a large number of terms (e.g. 254 for bfloat16, FP24, and FP32 formats with an 8-bit exponent (Perić and Dinčić, 2023; Perić et al., 2021)), each corresponding to one PUQ segment. In addition, the need for repeating SQNR calculation for various variances within a wide variance range (over 1500 dB for formats like bfloat16, FP24, and FP32) further increases complexity, making performance evaluation of FP formats very impractical. This paper adopts the idea of using the SQNR of the equivalent PUQ as a performance metric for FP formats, addressing in the same time the problem of high computational complexity of the exact SQNR expression and offering a simple and efficient solution for SQNR calculation, thereby enabling effective performance evaluation of FP formats. Considering that the quality of FP representation of DNN parameters, expressed by SQNR, greatly affects the prediction accuracy of DNNs, this paper will offer solutions for simplifying DNN implementations in resource-constrained environments without compromising DNNs accuracy, as a significant contribution to the development of intelligent DNN-based sensor nodes and measurement systems.

It is known that the performance of the FP format depends on the probability density function (PDF) of the measurement data (Perić et al., 2021). This paper focuses on the Gaussian PDF, which is effective for modeling stochastic measurement data (Dincic et al., 2013, 2014, 2016, 2023). Moreover, the fact that non-Gaussian signals can be transformed into Gaussian by appropriate filters (Dinčić et al., 2023) broadens the applicability of paper results.

The main novelty of this paper is a new approach to SQNR calculation, based on the analysis of the probability distribution of PUQ segments and the impact that each individual PUQ segment has on SQNR. As an important result of this new approach, it has been shown that, for any given variance, only a few segments have probabilities significantly different from zero, affecting the SQNR, while all other segments have negligible probabilities with negligible impact on SQNR. This led to the key insight that SQNR calculation can be drastically simplified by focusing just on the segment with the highest probability and a few adjacent segments. As the most probable segment of PUQ varies with changing data variance, a significant contribution of this paper is the derivation of a simple closed-form formula for determining the segment with the highest probability valid for any variance value. In addition, this paper explores different numbers of segments left and right to the most probable segment to be included in the SQNR expression, aiming to obtain a tight approximation of the original SQNR. A crucial finding achieved in this way is a very simple approximate expression for calculating SQNR involving only five terms in the sum, which drastically reduces the complexity of SQNR calculation compared to the original SQNR expression. In addition, this paper examines how different components of distortion (representing the mean squared quantization error) affect SQNR across a wide variance range, revealing that some components have negligible impact in certain variance ranges. This further simplifies the SQNR expression, as another significant finding of this paper.

This paper focuses on evaluating the performance of FP formats with an 8-bit exponent (FP32, FP24, and bfloat16) due to their particularly high complexity of calculating exact SQNR values. It provides numerical values for the exact SQNR across these formats, obtained using the complex formula involving a 254-term sum, as well as approximate SQNR values calculated by the simplified expression with only a five-term sum derived in this paper. Calculations were conducted over a broad range of variance values. Results demonstrate that the simplified SQNR expression, developed in this paper, achieves an exceptionally low error of just 0.024 dB relative to the exact SQNR formula, thereby demonstrating that the substantial reduction in computational complexity achieved by using the approximate SQNR expression does not compromise calculation accuracy. Numerical calculations show that all three considered FP formats maintain a nearly constant SQNR across an extensive variance range from approximately −750 dB to beyond 750 dB. Within this range, the SQNR values are approximately 151.9 dB for FP32, 103.75 dB for FP24, and 55.6 dB for bfloat16, indicating that the reduction of resolution for 8 bits decreases SQNR by roughly 48.15 dB. On the contrary, the reduction in resolution simplifies implementation complexity. Hence, the most suitable FP format should be determined for each specific application by balancing implementation complexity with performance. MATLAB simulations with 10 million randomly generated Gaussian numbers, as well as an experiment performed with weights of a trained MLP (Multilayer Perceptron) neural network, confirm the theoretical results.

To summarize, given the impact of SQNR in FP representation of DNN parameters on DNN accuracy, alongside the substantial effect of FP formats on DNN implementation complexity, it becomes essential to assess FP formats in the context of DNN deployment. This is particularly relevant when considering the potential application of DNNs in resource-constrained measurement systems, which could become more intelligent and automated through DNN integration. This paper provides some fundamental results related to simplified and efficient calculation of FP format performance across various resolutions and a wide range of variance, facilitating the use of lower-resolution FP formats than FP32 and reducing in that way implementation complexity and energy consumption while accelerating data processing. These findings have substantial practical applications for assessing the performance of various FP formats and selecting the most suitable format for each specific case. This is particularly valuable for developing intelligent DNN-based measurement systems with limited hardware resources and constrained available energy (such as battery-powered systems), as well as real-time measurement systems. One potential application is in vibration-based predictive maintenance measurement systems in industry, where high-frequency signal sampling generates large amounts of data, requiring efficient and low-complexity solutions. Furthermore, the results of this paper can be applied across many other fields, including computing and signal processing, due to the broad applicability of FP formats.

This paper is organized as follows. Section “Structure and performance evaluation of the FP format” explains the FP format structure and its analogy with the PUQ, from which the expression for SQNR calculation is derived. The section “Efficient performance evaluation of FP formats”, as the main part of this paper, analyzes the impact of quantizer segments on SQNR, derives the closed-form expression for determining the most probable segment, proposes simple approximate SQNR expressions, and examines the influence of various distortion components on SQNR for different variances, further simplifying the SQNR expression. This section also presents numerical results based on the theoretical analysis, as well as simulation results provided to confirm the achieved numerical results. The next section, “Experimental results,” validates the correctness of the developed theory by applying the proposed approach to the data set consisting of weights of a trained MLP neural network. The conclusion and list of references are given at the end of this paper.

Structure and performance evaluation of the FP format

A real number x is represented in an R-bit FP format as (Perić et al., 2021)

x = (s a_{e - 1} . . . a_{1} a_{0} b_{m - 1} . . . b_{1} b_{0})_{2}

(1)

where one bit “s” encodes the sign of the number x, e bits $(a_{e - 1} . . . a_{1} a_{0})$ encode the exponent, and m bits $(b_{m - 1} . . . b_{1} b_{0})$ encode the significand, whereby $R = e + m + 1$ . The FP format is symmetric around 0, with each positive number having a negative counterpart. FP numbers represented by equation (1) can be categorized into two groups: normal and subnormal. Normal FP numbers are calculated as

x = (- 1)^{s} 2^{E^{*}} (1 + \frac{M}{2^{m}})

(2)

where $E = \sum_{i = 0}^{e - 1} a_{i} 2^{i}$ is the exponent, $E^{*} = E - bias$ denotes the biased exponent with bias being a predefined parameter, and $M = \sum_{i = 1}^{m} b_{m - i} 2^{m - i}$ . The exponent E can take values from 0 to $2^{e} - 1$ , but the values 0 (all zeros) and $2^{e} - 1$ (all ones) are reserved, leaving $2^{e} - 2$ values (from 1 to $2^{e} - 2$ ) for encoding numbers. This results in $2^{e} - 2$ values of the biased exponent $E^{*}$ (from its minimum value $E_{\min}^{*} = 1 - bias$ to its maximum value $E_{\max}^{*^{e}} = 2^{e} - 2 - bias$ ) usable for numbers’ representation. The parameter M can take $2^{m}$ values from 0 to $2^{m} - 1$ . The smallest positive normal FP number, with the minimum values of $E^{*}$ and M ( $E^{*} = E_{\min}^{*}$ and $M = 0$ ), according to equation (2), is $x_{\min} = 2^{E_{\min}^{*}}$ , while the largest positive normal FP number obtained for the maximum values of $E^{*}$ and M ( $E^{*} = E_{\max}^{*}$ and $M = 2^{m} - 1$ ) is $x_{\max} = 2^{E_{\max}^{*}} (1 + \frac{2^{m} - 1}{2^{m}}) = 2^{E_{\max}^{*}} (2 - \frac{1}{2^{m}}) \approx 2^{E_{\max}^{*} + 1}$ . Thus, normal FP numbers defined by equation (2) are placed in the range $(- x_{\max}, - x_{\min}] \cup [x_{\min}, x_{\max})$ . For example, for FP32, FP24, and bfloat16 formats with e = 8, we have $E_{\min}^{*}$ = −126, $E_{\max}^{*}$ = 127, $x_{\min} = 2^{- 126}$ and $x_{\max} = 2^{128}$ .

Very small FP numbers in the interval ( $- x_{\min}$ , $x_{\min}$ ) around 0 are called subnormal FP numbers, calculated as

x = (- 1)^{s} \cdot \frac{M}{2^{m}} \cdot 2^{1 - bias} = (- 1)^{s} \cdot \frac{M}{2^{m}} \cdot 2^{E_{\min}^{*}}

(3)

There are a total of $2^{m}$ positive subnormal FP numbers (one for each value of M), and the same number of negative ones.

Due to symmetry around 0, we can focus solely on positive FP numbers. Let us start with the positive subnormal numbers in the range (0, $x_{\min}$ ). The distance between two consecutive subnormal numbers

Δ_{sub} = \frac{M + 1}{2^{m}} \cdot 2^{E_{\min}^{*}} - \frac{M}{2^{m}} \cdot 2^{E_{\min}^{*}} = 2^{E_{\min}^{*} - m}

(4)

is constant, making the subnormal FP numbers equidistant. Therefore, the structure of subnormal FP numbers is the same as the structure of a uniform quantizer with a quantization step size $∆_{sub}$ . Next, let us examine the positive normal FP numbers in the range $[x_{\min}, x_{\max})$ , defined as $x = 2^{E^{*}} (1 + \frac{M}{2^{m}})$ . Each value of the biased exponent $E^{*}$ ( $E_{\min}^{*} \leq E^{*} \leq E_{\max}^{*}$ ) defines a group of $2^{m}$ numbers belonging to the region $S_{E^{*}} = [2^{E^{*}}, 2^{E^{*} + 1})$ , each corresponding to one value of M. The distance between two consecutive numbers within one group (for one value of $E^{*}$ )

∆_{E^{*}} = 2^{E^{*}} (1 + \frac{M + 1}{2^{m}}) - 2^{E^{*}} (1 + \frac{M}{2^{m}}) = 2^{E^{*} - m}

(5)

is constant, but varies between different groups, making the numbers within one group equidistant. Therefore, each group of $2^{m}$ equidistant normal FP numbers for a given value of $E^{*}$ has the structure of a uniform quantizer with a quantization step size $∆_{E^{*}}$ . In total, there are $2^{e} - 2$ groups of $2^{m}$ equidistant positive normal FP numbers. This pattern extends symmetrically to negative FP numbers.

It is evident that the structure of the FP format is equivalent to a symmetric PUQ with a support region $(- x_{\max}, x_{\max})$ . The positive part of the PUQ consists of $2^{e} - 1$ segments in a way that each segment undergoes uniform quantization with $2^{m}$ quantization levels. The first segment is the subnormal segment in the range (0, $x_{\min}$ ) with a quantization step size $∆_{sub}$ . Also, there are $2^{e} - 2$ segments $S_{E^{*}} = [2^{E^{*}}, 2^{E^{*} + 1})$ , $E_{\min}^{*} \leq E^{*} \leq E_{\max}^{*}$ , representing normal FP numbers, whereby the segment $S_{E^{*}}$ corresponds to the quantization step size $∆_{E^{*}}$ . This PUQ, analogous to the structure of the FP format, is called the Floating-Point Quantizer (FPQ). This analogy between the FP format and the piecewise uniform FPQ is crucial, allowing us to evaluate performance of the FP format through an objective measure such as the SQNR of the FPQ.

The SQNR of the quantizer is defined as (Perić et al., 2021)

SQNR (σ) = 10 \cdot \log_{10} \frac{σ^{2}}{D (σ)}

(6)

where $σ^{2}$ is the data variance, and $D (σ)$ is the distortion representing the mean square error made during quantization. The distortion $D (σ)$ of the piecewise uniform FPQ is defined as

D (σ) = D_{sub} (σ) + D_{n} (σ) + D_{ov} (σ)

(7)

where $D_{sub} (σ)$ represents the distortion in the range of subnormal numbers, $D_{n} (σ)$ represents the distortion in the range of normal numbers, and $D_{ov} (σ)$ represents the overload distortion that occurs when quantizing data outside the quantizer support region $(- x_{\max}, x_{\max})$ . Since $D_{sub} (σ)$ represents the granular distortion of the uniform quantizer equivalent to subnormal numbers with the quantization step $∆_{sub}$ , it is defined as

D_{sub} (σ) = 2 \frac{∆_{sub}^{2}}{12} P_{sub} (σ)

(8)

Here, $P_{sub} (σ)$ denotes the probability of the positive subnormal segment $(0, x_{\min})$ , given by $P_{sub} (σ) = \int_{0}^{x_{\min}} p (x, σ) dx$ , where $p (x, σ)$ is the PDF of the input data.

Distortion $D_{n} (σ)$ is expressed through the sum of $2^{e} - 2$ terms as

D_{n} (σ) = 2 \sum_{E^{*} = E_{\min}^{*}}^{E_{\max}^{*}} D_{E^{*}} (σ)

(9)

where each term $D_{E^{*}} (σ)$ corresponds to a value of $E^{*}$ ranging from $E_{\min}^{*}$ to $E_{\max}^{*}$ , representing the distortion of the uniform quantizer within the segment $S_{E^{*}} = [2^{E^{*}}, 2^{E^{*} + 1})$ with the step size $Δ_{E^{*}}$ . Consequently, $D_{E^{*}} (σ)$ is defined as

D_{E^{*}} (σ) = \frac{∆_{E^{*}}^{2}}{12} P_{E^{*}} (σ)

(10)

where $P_{E^{*}} (σ) = \int_{2^{E^{*}}}^{2^{(E^{*} + 1)}} p (x, σ) dx$ represents the probability of the segment $S_{E^{*}}$ .

Finally, the overload distortion is defined as

D_{o v} (σ) = 2 \int_{x_{\max}}^{+ \infty} {(x - x_{\max})}^{2} p (x, σ) d x

(11)

Multiplying by 2 in expressions (8), (9), and (11) includes the distortion in the negative part of the quantizer.

This paper considers the Gaussian PDF, defined as (Dinčić et al., 2023)

p (x, σ) = \frac{1}{\sqrt{2 π} σ} \exp (- \frac{x^{2}}{2 σ^{2}})

(12)

which has been successfully used for statistical modeling of measurement data. For $p (x, σ)$ defined by equation (12), we obtain the following expressions for $P_{sub} (σ)$ , $P_{E^{*}} (σ),$ and $D_{ov} (σ)$

P_{sub} (σ) = \frac{1}{2} \erf (\frac{2^{E_{\min}^{*} - 1 / 2}}{σ})

(13)

P_{E^{*}} (σ) = \frac{1}{2} (\erf (\frac{2^{E^{*} + 1 / 2}}{σ}) - \erf (\frac{2^{E^{*} - 1 / 2}}{σ}))

(14)

D_{o v} (σ) = - x_{\max} σ \sqrt{\frac{2}{π}} \exp (- \frac{x_{\max}^{2}}{2 σ^{2}}) + (x_{\max}^{2} + σ^{2}) erfc (\frac{x_{\max}}{σ \sqrt{2}})

(15)

In scenarios involving a wide range of data variance, such as the case in FP formats, it is common to express variance in the logarithmic domain as $α [dB] = 10 \log_{10} (σ^{2} / σ_{ref}^{2})$ , where $σ_{ref}^{2}$ is a reference variance. If assume $σ_{ref}^{2}$ = 1 without loss of generality, the definition of variance in the logarithmic domain becomes $α [dB] = 10 \log_{10} σ^{2}$ , leading to the relationship

σ = 10^{α / 20}

(16)

Taking into account expressions from (6) to (16), we obtain the final expression for SQNR of the FPQ

\begin{array}{l} SQNR (α) = - 10 \log_{10} [\frac{2^{- 2 (R - e - 1)}}{12 \cdot 10^{α / 10}} (2^{2 E_{\min}^{*}} \erf (\frac{2^{E_{\min}^{*} - 1 / 2}}{10^{α / 20}}) + \sum_{E^{*} = E_{\min}^{*}}^{E_{\max}^{*}} 2^{2 E^{*}} (\erf (\frac{2^{E^{*} + 1 / 2}}{10^{α / 20}}) - \erf (\frac{2^{E^{*} - 1 / 2}}{10^{α / 20}}))) \\ - \frac{x_{\max}}{10^{α / 20}} \sqrt{\frac{2}{π}} \exp (- \frac{x_{\max}^{2}}{2 \cdot 10^{α / 10}}) + (\frac{x_{\max}^{2}}{10^{α / 10}} + 1) erfc (\frac{x_{\max}}{10^{α / 20} \cdot \sqrt{2}})] \end{array}

(17)

Expression (17) is crucial as it enables us to evaluate the performance of FPQ, and consequently the FP format, for any variance α of measurement data. However, expression (17) presents a challenge due to its extensive number of terms, particularly notable in FP formats with e = 8, like bfloat16, FP24, and FP32, with 254 terms in the sum. This complexity significantly increases the computational burden of calculating SQNR. As additional difficulty, SQNR computation needs to be repeated for various values of α over a very wide range, exceeding 1500 dB for bfloat16, FP24, and FP32. Therefore, a key research objective is to simplify expression (17), aiming for a faster and more efficient method to assess FP format performance with minimal computational overhead, while maintaining accuracy. This paper focuses on addressing this critical issue.

Efficient performance evaluation of FP formats

Figure 1 shows the probabilities $P_{E^{*}}$ of segments $S_{E^{*}}$ , ( $E_{\min}^{*} \leq E^{*} \leq E_{\max}^{*}$ ) of the 32-bit FPQ, analogous to the FP32 format (see equation (14)), for various variance values α. We observe that for a given α, only a few FPQ segments exhibit non-negligible probabilities, while others approach zero. This pattern holds across different FP formats, not limited to FP32. An important conclusion arises from this, that only a small number of FPQ segments impact SQNR. Thus, the summation in equation (17) can be simplified by including much fewer terms without sacrificing accuracy of SQNR calculation. The rest of this paper is dedicated to implementing this idea.

Figure 1.

Probabilities of 32-bit FPQ segments for different values of the variance α.

Let $E_{x}^{*}$ denote the index of the FPQ segment with the highest probability for a given variance α, that is, $P_{E_{x}^{*}} (α) = max_{E_{min}^{*} \leq E^{*} \leq E_{max}^{*}} {P_{E^{*}} (α)}$ . As illustrated in Figure 1, various α values yield different values of $E_{x}^{*}$ . It is evident that the segment with the highest probability $S_{E_{x}^{*}}$ , along with several adjacent segments, has the greatest impact on SQNR. Hence, the primary objective of this paper is to efficiently determine the value of $E_{x}^{*}$ for any value of variance α. One way to find $E_{x}^{*}$ is through exhaustive numerical computation, calculating the probabilities of all segments $E^{*}$ , $(E_{\min}^{*} \leq E^{*} \leq E_{\max}^{*})$ , and identifying the segment with the highest probability; this procedure should be repeated for each value of α and the resulting values of $E_{x}^{*}$ should be stored in memory. However, due to numerous segments and the wide variance range (e.g. 254 segments and over 1500 dB wide variance range for FP32, FP24, and bfloat16 formats), this method is computationally intensive and memory-demanding, rendering it impractical. Consequently, this paper proposes an alternative approach presented below, aiming to derive a closed-form expression for simple calculation of $E_{x}^{*}$ .

By differentiating the expression for $P_{E^{*}} (α)$ , obtained by putting equation (16) into equation (14), with respect to $E^{*}$ and solving the equation $\frac{d P_{E^{*}} (α)}{d E^{*}} = 0$ , as it will be explained in Appendix, we obtain

E_{x}^{*} = \frac{\ln (\frac{2}{3} \ln 2)}{2 \ln 2} + \frac{\ln 10}{20 \cdot \ln 2} α = - 0.557 + \frac{α}{6.0206}

(18)

Note that expression (18) provides the value of $E_{x}^{*}$ for each specific value of variance α. However, $E_{x}^{*}$ obtained by equation (18) is a real number, but we need an integer value. Therefore, we initially round $E_{x}^{*}$ obtained by equation (18) to the nearest integer as

E_{x}^{*} = round (- 0.557 + \frac{α}{6.0206})

(19)

which gives accurate $E_{x}^{*}$ values for most values of α, but for some values of α results in an error of 1, yielding the index of a neighboring segment instead of the segment with the highest probability. To solve this problem, we introduce a constant C as a correction factor, so that equation (19) becomes

E_{x}^{*} = round (- 0.557 + C + \frac{α}{6.0206})

(20)

Analysis carried out over a very wide range of variance α (from −800 dB to 800 dB) shows that C = −0.048 provides accurate values for $E_{x}^{*}$ . Thus, the final expression for calculating $E_{x}^{*}$ for a specific value of variance α is

E_{x}^{*} = round (- 0.605 + \frac{α}{6.0206})

(21)

By introducing a small correction factor C, the rounding is finely tuned, ensuring accurate $E_{x}^{*}$ values. The simple calculation of $E_{x}^{*}$ for each value of variance α using the closed-form expression (21) represents a significant contribution of this paper.

As mentioned, only the distortion in the segment $S_{E_{x}^{*}}$ with the highest probability and a few adjacent segments significantly impact SQNR. This allows us to drastically reduce the number of terms in the sum of SQNR expression (17). Lets $SQN R^{'}$ denote the approximate SQNR value, calculated using the following simplified formula

\begin{array}{l} {SQNR}^{'} (α) = - 10 \log_{10} \\ [\frac{2^{- 2 (R - e - 1)}}{12 \cdot 10^{α / 10}} (2^{2 E_{\min}^{*}} \erf (\frac{2^{E_{\min}^{*} - 1 / 2}}{10^{α / 20}}) + \sum_{E^{*} = E_{down}^{*}}^{E_{up}^{*}} 2^{2 E^{*}} (\erf (\frac{2^{E^{*} + 1 / 2}}{10^{α / 20}}) - \erf (\frac{2^{E^{*} - 1 / 2}}{10^{α / 20}}))) \\ - \frac{x_{\max}}{10^{α / 20}} \sqrt{\frac{2}{π}} \exp (- \frac{x_{\max}^{2}}{2 \cdot 10^{α / 10}}) + (\frac{x_{\max}^{2}}{10^{α / 10}} + 1) erfc (\frac{x_{\max}}{10^{α / 20} \cdot \sqrt{2}})], \end{array}

(22)

where

E_{down}^{*} = \max {E_{x}^{*} - k_{1}, E_{\min}^{*}}

(23)

and

E_{up}^{*} = \min {E_{x}^{*} + k_{2}, E_{\max}^{*}}

(24)

indicate the minimum and maximum values of the summation index. The expression (22) takes into account only $k_{1}$ segments to the left of the segment $S_{E_{x}^{*}}$ and only $k_{2}$ segments to the right. Definitions (23) and (24) for $E_{down}^{*}$ and $E_{up}^{*}$ allow summation in equation (22) to start from $E_{\min}^{*}$ if $E_{x}^{*} - k_{1} < E_{\min}^{*}$ and end up at $E_{\max}^{*}$ if $E_{x}^{*} + k_{2} > E_{\max}^{*}$ . Now, we need to determine the values of parameters $k_{1}$ and $k_{2}$ . Let $Δ SQN R_{\max} = \max | SQNR (α) - SQN R^{'} (α) |$ represent the maximum difference between the exact SQNR value calculated by equation (17) and the approximate value $SQNR'$ calculated by equation (22), across the whole variance range of interest. Table 1 provides values of $∆ SQN R_{\max}$ for different combinations of $k_{1}$ and $k_{2}$ within the variance range from −800 dB to 800 dB, covering bfloat16, FP24, and FP32 formats. It is found that the same values of $Δ SQN R_{\max}$ are obtained for all three mentioned formats.

Table 1.

Maximal error $∆ SQN R_{\max}$ of SQNR calculation for different values of $k_{1}$ and $k_{2}$ .

$k_{1}$	2	1	3	2	2
$k_{2}$	2	2	2	1	3
$Δ {SQNR}_{m a x}$ [dB]	0.024	0.087	0.022	2.251	0.011

From Table 1, we see that the combination ( $k_{1}$ = 2, $k_{2}$ = 1) results in an unacceptably large SQNR calculation error of 2.251 dB, so we will discard it. The remaining four-parameter combinations are excellent solutions, yielding very small SQNR calculation errors. Note that the accuracy of approximation (22) for ( $k_{1}$ = 3, $k_{2}$ = 2) is better for only 0.002 dB compared to the case ( $k_{1}$ = 2, $k_{2}$ = 2) but requires an additional term in the SQNR expression, so we will discard it as well. The choice among the remaining three combinations depends on the required SQNR calculation accuracy for specific application. The combination ( $k_{1}$ = 1, $k_{2}$ = 2) results in four terms in the SQNR sum in equation (22) but has the highest error, while ( $k_{1}$ = 2, $k_{2}$ = 3) produces six terms but the smallest error. This paper opts for ( $k_{1}$ = 2, $k_{2}$ = 2) which results in five terms in equation (22) and maximal calculation error of just 0.024 dB. One more term in the SQNR sum in equation (22) produced by ( $k_{1}$ = 2, $k_{2}$ = 2) compared to ( $k_{1}$ = 1, $k_{2}$ = 2) is justified by a reasonable SQNR error reduction of 0.063 dB; on the contrary, one extra term produced by ( $k_{1}$ = 2, $k_{2}$ = 3) compared to ( $k_{1}$ = 2, $k_{2}$ = 2) is not justified, as the accuracy of approximation (22) is increased by only 0.013 dB. The proposed method, which reduces the number of terms in the SQNR expression from a large number (254 for bfloat16, FP24, and FP32 formats) to just 5, thereby significantly decreasing the complexity of SQNR calculation while maintaining high accuracy with a negligible error of only 0.024 dB, represents a significant achievement in this paper.

The expression (22) for SQNR can be further simplified. Specifically, $D_{sub}$ is significant only for extremely small variances and negligible for others, while $D_{ov}$ is significant only for extremely large variances and negligible for others. Let us define the variance thresholds $α_{1}$ , $α_{2}$ , $α_{3}$ , and $α_{4}$ , which correspond to the following variance range:

Range I, with $α \leq α_{1}$ , is the range of extremely small variances where $D_{sub}$ dominates while $D_{n}$ and $D_{ov}$ are negligible; consequently, in expression (22), only the term originating from $D_{sub}$ remains, allowing expression (22) to be simplified as follows

{SQNR}_{I}^{'} (α) = - 10 \log_{10} [\frac{2^{2 (E_{\min}^{*} - R + e + 1)}}{12 \cdot 10^{α / 10}} \erf (\frac{2^{E_{\min}^{*} - 1 / 2}}{10^{α / 20}})], α \leq α_{1}

(25)

Range II, with $α_{1} < α \leq α_{2}$ , is the range of small variances where both $D_{sub}$ and $D_{n}$ have an influence, while $D_{ov}$ is negligible; hence, terms derived from $D_{sub}$ and $D_{n}$ remain in expression (22)

\begin{matrix} {SQNR}_{II}^{'} (α) = - 10 lo g_{10} [\frac{2^{- 2 (R - e - 1)}}{12 \cdot 10^{α / 10}} (2^{2 E_{\min}^{*}} \erf (\frac{2^{E_{\min}^{*} - 1 / 2}}{10^{α / 20}}) + \sum_{E^{*} = E_{x}^{*} - 2}^{E_{x}^{*} + 2} 2^{2 E^{*}} (\erf (\frac{2^{E^{*} + 1 / 2}}{10^{α / 20}}) - \erf (\frac{2^{E^{*} - 1 / 2}}{10^{α / 20}})))], \\ α_{1} < α \leq α_{2} \end{matrix}

(26)

Range III, with $α_{2} < α \leq α_{3}$ , is the very wide range of variances where only $D_{n}$ dominates, while $D_{sub}$ and $D_{ov}$ are negligible; therefore, only the term originating from $D_{n}$ remains in equation (22), reducing it as follows

\begin{matrix} SQN {R'}_{III} (α) = - 10 \log_{10} [\sum_{E^{*} = E_{x}^{*} - 2}^{E_{x}^{*} + 2} \frac{2^{2 (E^{*} - R + e + 1)}}{12 \cdot 10^{α / 10}} (erf (\frac{2^{E^{*} + 1 / 2}}{10^{α / 20}}) \\ - \erf (\frac{2^{E^{*} - 1 / 2}}{10^{α / 20}}))], α_{2} < α \leq α_{3} \end{matrix}

(27)

Range IV, with $α_{3} < α \leq α_{4}$ , is the range of large variances where both $D_{n}$ and $D_{ov}$ have an influence, while $D_{sub}$ is negligible; hence, terms derived from $D_{n}$ and $D_{ov}$ remain in equation (22), simplifying it as follows

{SQNR}_{IV}^{'} (α) = - 10 lo g_{10} [\sum_{E^{*} = E_{x}^{*} - 2}^{E_{x}^{*} + 2} \frac{2^{2 (E^{*} - R + e + 1)}}{12 \cdot 10^{α / 10}} (\erf (\frac{2^{E^{*} + 1 / 2}}{10^{α / 20}}) - \erf (\frac{2^{E^{*} - 1 / 2}}{10^{α / 20}})) - \frac{x_{\max}}{10^{α / 20}} \sqrt{\frac{2}{π}} \exp (- \frac{x_{\max}^{2}}{2 \cdot 10^{α / 10}}) + (\frac{x_{\max}^{2}}{10^{α / 10}} + 1) erfc (\frac{x_{\max}}{10^{α / 20} \cdot \sqrt{2}})], α_{3} < α \leq α_{4}

(28)

Range V, with $α > α_{4}$ , is the range of very large variances where only $D_{ov}$ dominates, while $D_{sub}$ and $D_{n}$ are negligible; consequently, only the term originating from $D_{ov}$ remains in equation (22), allowing the following simplification of equation (22)

{SQNR}_{V}^{'} (α) = - 10 lo g_{10} [- \frac{x_{\max}}{10^{\frac{α}{20}}} \sqrt{\frac{2}{π}} \exp (- \frac{x_{\max}^{2}}{2 \cdot 10^{\frac{α}{10}}}) + (\frac{x_{\max}^{2}}{10^{\frac{α}{10}}} + 1) erfc (\frac{x_{\max}}{10^{\frac{α}{20}} \cdot \sqrt{2}})], α > α_{4}

(29)

Hence, the approximate expression (22) for SQNR can be written in a simplified way as

SQNR' (α) = {\begin{matrix} SQN {R'}_{I} (α), α \leq α_{1} \\ SQN {R'}_{II} (α), α_{1} < α \leq α_{2} \\ SQN {R'}_{III} (α), α_{2} < α \leq α_{3} \\ SQN {R'}_{IV} (α), α_{3} < α \leq α_{4} \\ SQN {R'}_{V} (α), α > α_{4} \end{matrix}

(30)

Table 2 provides the empirically obtained values for $α_{1}$ , $α_{2}$ , $α_{3}$ , and $α_{4}$ for the bfloat16, FP24, and FP32 formats. It can be observed that across the widest range of variances $α_{2} < α \leq α_{3}$ only $D_{n}$ , easily calculated as the sum of just five terms, has an influence on SQNR. It should be noted that the analysis of the impact of different distortion components ( $D_{sub}$ , $D_{n}$ , and $D_{ov}$ ) on SQNR for various variance values α, along with the resulting simplifications in expression (30), is another key outcome of this paper.

Table 2.

Values of variance thresholds $α_{1}$ , $α_{2}$ , $α_{3}$ , and $α_{4}$ for different FP formats.

	$α_{1}$ [dB]	$α_{2}$ [dB]	$α_{3}$ [dB]	$α_{4}$ [dB]
bfloat16	−770	−735	756	761
FP24	−770	−735	753	757
FP32	−770	−735	752	754

Let us now comment about the accuracy of approximation (30). Namely, by applying expression (30), values of $Δ {SQNR}_{\max}$ remain the same as in Table 1, indicating that the simplifications introduced in expression (30) do not raise the SQNR calculation error.

Table 3 provides the exact SQNR values calculated by equation (17), as well as the approximate values $SQNR'$ calculated by equation (30), for different values of α and three mentioned FP formats: bfloat16, FP24, and FP32. It is evident that the values of SQNR and $SQNR'$ are very close, demonstrating that the approximate expression (30) significantly reduces the computational complexity of calculating SQNR compared to the exact expression (17), without compromising accuracy. To verify the theory, we simulated each case considered in Table 3 in MATLAB by generating 10 million Gaussian random numbers with the given variance. After processing generated data by FPQ corresponding to the given FP format, SQNR values (marked as ${SQNR}_{sim}$ ) are calculated. The simulated SQNR values very closely match the theoretical ones, proving the theory’s accuracy.

Table 3.

Values of SQNR, $SQNR',$ and ${SQNR}_{sim}$ for different variance values and different FP formats.

α [dB]		−750	−500	−250	0	250	500	750
bfloat16	$SQNR$ [dB]	55.344	55.594	55.617	55.589	55.621	55.586	55.623
	$SQNR'$ [dB]	55.344	55.600	55.622	55.599	55.625	55.600	55.627
	${SQNR}_{sim}$ [dB]	55.344	55.592	55.613	55.588	55.625	55.588	55.620
FP24	$SQNR$ [dB]	103.508	103.758	103.782	103.754	103.785	103.751	103.788
	$SQNR'$ [dB]	103.508	103.765	103.787	103.764	103.79	103.766	103.792
	${SQNR}_{sim}$ [dB]	103.508	103.761	103.778	103.754	103.786	103.751	103.792
FP32	$SQNR$ [dB]	151.673	151.923	151.946	151.919	151.95	151.916	151.952
	$SQNR'$ [dB]	151.673	151.93	151.952	151.929	151.955	151.93	151.957
	${SQNR}_{sim}$ [dB]	151.676	151.923	151.943	151.918	151.951	151.914	151.947

Using SQNR as a key performance measure, we can effectively compare the three FP formats under consideration: FP32, FP24, and bfloat16. Based on the results presented in Table 3, we can conclude that all three formats maintain a nearly constant SQNR across an extensive variance range from approximately −750 dB to beyond 750 dB. This similarity of variance ranges originates from the identical value of the parameter e (e = 8) for all three formats, which predominantly affects the variance range width. Within this range, the SQNR values are approximately 151.9 dB for FP32, 103.75 dB for FP24, and 55.6 dB for bfloat16. Notably, reducing the FP format resolution by 8 bits decreases SQNR by roughly 48.15 dB. However, this reduction in resolution also simplifies implementation complexity. In practical applications, balancing implementation complexity with performance is essential.

The results of this paper impact the implementation of intelligent measurement systems by simplifying the performance evaluation of FP formats, thereby making it easier to select the most suitable FP format for each specific application. Considering that implementation complexity is a key concern for resource-constrained intelligent measurement systems, the authors recommend selecting the FP format with the lowest complexity that still meets the application’s performance requirements.

Figure 2 shows the flowchart of the applied methodology for calculating $SQNR'$ , taking into account the conditions for variance α.

Figure 2.

The flowchart of the applied methodology for calculating $SQNR'$ .

Finally, some limitations of the applied approach will be outlined. First, it is not recommended to use the described method for parameter values of e ≤ 3 and m ≤ 3, which occur in FP formats with very low resolutions. For e ≤ 3, the number of terms within the sum in the exact SQNR expression is not large, eliminating the need for simplifications. For m ≤ 3, the number of uniform quantization levels in each segment of the PUQ is less than 8, meaning the accuracy of formulas (8) and (10) for distortion, which were derived using asymptotic analysis (i.e. under the assumption of a sufficiently large number of quantization levels), may be insufficient. In addition, the presented method assumes that the Gaussian PDF has a zero mean. If this is not the case, the method can still be applied, but it requires subtracting the mean of the data set from each data point to center the data around zero. The method is then applied to the adjusted data set, with the mean added back at the final stage.

Experimental results

The theoretical SQNR results will be further validated by experimental findings, obtained by applying FP32, FP24, and bfloat16 formats on real data. These data comprise weights from a trained MLP neural network (Kruse et al., 2022) used for classification of images from the MNIST data set (Baldominos et al., 2019). The MLP architecture consists of input, hidden, and output layers with 784, 128, and 10 nodes, respectively. Hyperparameters were set as follows: regularization rate = 0.01, learning rate = 0.0005, and mini-batch size = 128. After 20 epochs, the MLP achieved accuracy scores of 0.9705 on the training set and 0.9686 on the test set. We used the weights between the input and hidden layers, amounting to M = 784 × 128 = 100,352. Figure 3 shows the histogram of the MLP weights, proving that they closely follow the zero-mean Gaussian distribution.

Figure 3.

Histogram of MLP weights.

Let the MLP weights be denoted as $w_{i}$ , $i = 1, \dots, M$ . The variance of these weights is $σ_{0}^{2} = \frac{1}{M} \sum_{i = 1}^{M} w_{i}^{2}$ , giving the logarithmic variance value of $α_{0} = 10 \cdot \log_{10} σ_{0}^{2}$ = −19.725 dB. Let ${SQNR}_{\exp}$ and $SQN {R'}_{\exp}$ denote the experimentally obtained SQNR values according to the exact formula (17) and approximate formula (30), respectively. Table 4 presents the theoretically calculated exact value SQNR and approximate value $SQNR'$ for the variance $α_{0}$ , alongside the experimentally obtained values ${SQNR}_{\exp}$ and $SQN {R'}_{\exp}$ . The excellent agreement between theoretical and experimental results confirms the accuracy of the theoretical model.

Table 4.

Values of SQNR, $SQNR'$ , ${SQNR}_{\exp}$ , and $SQN {R'}_{\exp}$ for the variance $α_{0}$ and different FP formats.

	Theory		Experiment
	$SQNR$ [dB]	$SQNR'$ [dB]	${SQNR}_{\exp}$ [dB]	$SQN {R'}_{\exp}$ [dB]
bfloat16	55.619	55.622	55.615	55.597
FP24	103.784	103.787	103.764	103.748
FP32	151.949	151.952	151.939	151.925

Our next goal is to experimentally verify the accuracy of the theoretical results over a wide variance range. By scaling the values of the MLP weights with an appropriate constant c, we can generate a data set of real data $z_{i} = c w_{i}$ , $i = 1, \dots, M$ , with any desired variance $α = 10 \cdot \log_{10} σ^{2}$ , where $σ^{2} = \frac{1}{M} \sum_{i = 1}^{M} c^{2} w_{i}^{2} = c^{2} σ_{0}^{2}$ . Consequently, $α = 10 \cdot \log_{10} c^{2} σ_{0}^{2} = 20 \cdot \log_{10} c + 10 \cdot \log_{10} σ_{0}^{2} = 20 \cdot \log_{10} c + α_{0}$ , allowing us to obtain $c = 10^{(α - α_{0}) / 20}$ . Table 5 presents the experimentally obtained values of ${SQNR}_{\exp}$ and $SQN {R'}_{\exp}$ for different values of variance α. A comparison between Tables 3 and 5 shows excellent alignment between theoretical and experimental results across all considered values of α, thereby validating the theoretical analysis over a wide range of variances.

Table 5.

Values of ${SQNR}_{\exp}$ and $SQN {R'}_{\exp}$ for different variance values and different FP formats.

α [dB]		−750	−500	−250	0	250	500	750
bfloat16	${SQNR}_{\exp}$ [dB]	55.345	55.583	55.636	55.575	55.621	55.585	55.605
	$SQN {R'}_{\exp}$ [dB]	55.345	55.604	55.603	55.562	55.592	55.604	55.584
FP24	${SQNR}_{\exp}$ [dB]	103.521	103.720	103.801	103.809	103.824	103.737	103.822
	$SQN {R'}_{\exp}$ [dB]	103.521	103.748	103.769	103.795	103.796	103.751	103.800
FP32	${SQNR}_{\exp}$ [dB]	151.657	151.902	151.973	151.911	151.958	151.889	151.954
	$SQN {R'}_{\exp}$ [dB]	151.657	151.919	151.941	151.897	151.929	151.920	151.928

Conclusion

This paper aims to develop a simple and efficient method for evaluating the performance of FP formats for Gaussian-distributed data. Starting from the analogy between FP formats and a PUQ, this paper identifies the high complexity in the SQNR expression of PUQ, which involves summing a large number of terms (e.g. 254 for bfloat16, FP24, and FP32 formats), and offers several important contributions to address this problem.

One key contribution is demonstrating that only a small number of PUQ segments significantly impact SQNR. This insight leads to the idea that the SQNR expression can be simplified by retaining only a few terms related to the segment with the highest probability and its adjacent segments. Since the segment with the highest probability depends on the variance, this paper provides a simple closed-form expression for determining this segment for any variance value. Another important result is an approximate SQNR expression with just five terms, significantly simplifying the original SQNR expression with 254 terms while maintaining accuracy with a maximum error of only 0.024 dB. The SQNR expression is further simplified by determining the ranges of variance in which certain distortion components are negligible. The theoretical results are confirmed by MATLAB simulations and an experiment conducted with the weights of a trained MLP neural network. Although the numerical results are provided for bfloat16, FP24, and FP32 formats with an 8-bit exponent due to their pronounced complexity, the findings can be applied to other FP formats, offering general significance.

As a final result, this paper presents a simple yet efficient method for calculating FP format performance, particularly important for implementing resource-constrained intelligent DNN-based sensor nodes and edge measurement devices where selecting the most suitable FP format for each specific application is crucial. The findings are also applicable in other fields, such as computing and signal processing.

Footnotes

Appendix

Given that $\frac{d (\erf (x))}{dx} = \frac{2}{\sqrt{π}} \exp (- x^{2})$ , it is found that $\frac{d}{d E^{*}} (\erf (\frac{2^{E^{*} + 1 / 2}}{σ})) = \frac{2^{E^{*} + 1 / 2}}{σ} \frac{2 \ln 2}{\sqrt{π}} \exp (- \frac{2^{2 E^{*} + 1}}{σ^{2}})$ and $\frac{d}{d E^{*}} (\erf (\frac{2^{E^{*} - 1 / 2}}{σ})) = \frac{2^{E^{*} - 1 / 2}}{σ} \frac{2 \ln 2}{\sqrt{π}} \exp (- \frac{2^{2 E^{*} - 1}}{σ^{2}})$ . Based on this, for $P_{E^{*}} (σ) \equiv P_{E^{*}} (σ, E^{*})$ defined with equation (14), we have

(31)

\frac{\partial P_{E^{*}} (σ, E^{*})}{\partial E^{*}} = \frac{2^{E^{*}} \ln 2}{σ \sqrt{2 π}} \exp (- \frac{2^{2 E^{*}}}{2 σ^{2}}) \cdot [2 {(\exp (- \frac{2^{2 E^{*}}}{2 σ^{2}}))}^{3} - 1]

For $E^{*} = E_{x}^{*}$ , this derivative should equal zero, that is, $\frac{\partial P_{E^{*}} (σ, E^{*})}{\partial E^{*}} |_{E^{*} = E_{x}^{*}} = 0$ . From equation (31), it is obtained that

(32)

2 {(\exp (- \frac{2^{2 E_{x}^{*}}}{2 σ^{2}}))}^{3} - 1 = 0

This condition implies $\exp (- \frac{2^{2 E_{x}^{*}}}{2 σ^{2}}) = (1 / 2)^{1 / 3}$ . By taking the logarithm of this expression, it is obtained that $2^{2 E_{x}^{*}} = \frac{2}{3} σ^{2} \ln 2$ . Applying the logarithmic function to this expression as well, it follows that

(33)

E_{x}^{*} = \frac{\ln (\frac{2}{3} σ^{2} \ln 2)}{2 \ln 2} = \frac{\ln (\frac{2}{3} \ln 2)}{2 \ln 2} + \frac{\ln (σ^{2})}{2 \ln 2} .

The first term in equation (33) is a constant. Given that, according to equation (16), $σ^{2} = 10^{α / 10}$ , it is finally obtained

(34)

E_{x}^{*} = \frac{\ln (\frac{2}{3} \ln 2)}{2 \ln 2} + \frac{\ln 10}{20 \cdot \ln 2} α = - 0.557 + \frac{α}{6.0206}

confirming the validity of expression (18).

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Ministry of Science, Technological Development and Innovation of the Republic of Serbia (grant no. 451-03-65/2024-03/200102) as well as by the European Union’s Horizon 2023 research and innovation program through the AIDA4Edge Twinning project (grant ID 101160293).

ORCID iD

Milan R Dinčić

Data availability statement

Data sharing is not applicable to this article as no data sets were generated or analyzed during the current study.

References

Agrawal

Mueller

Fleischer

, et al. (2019) DLFloat: A 16-bfloating point format designed for deep learning training and inference. In: 2019 IEEE 26th symposium on computer arithmetic (ARITH), Kyoto, Japan, 10–12 June, pp. 92–95. Piscataway, NJ: IEEE.

Al Koutayni

Reis

Stricker

(2023) DeepEdgeSoC: End-to-end deep learning framework for edge IoT devices. Internet of Things 21: 100665.

Alamri

Ansari

Hassan

, et al. (2013) A survey on sensor-cloud: Architecture, applications, and approaches. International Journal of Distributed Sensor Networks 9: 917923.

Anandanatarajan

Mangalanathan

Gandhi

(2023) Deep neural network-based linearization and cold junction compensation of thermocouple. IEEE Transactions on Instrumentation and Measurement 72: 2500609.

Baldominos

Saez

Isasi

(2019) A survey of handwritten character recognition with MNIST and EMNIST. Applied Sciences 9: 3169.

Balemans

Hooft

Reiter

, et al. (2023) R2L-SLAM: Sensor fusion-driven SLAM using mmWave Radar, LiDAR and deep neural networks. In: 2023 IEEE sensors, Vienna, Austria, 29 October–1 November, pp. 1–4. Piscataway, NJ: IEEE.

Bey

Chemachema

(2024) Control-error-based output-feedback adaptive decentralized neural network controller for interconnected uncertain strict-feedback nonlinear systems with input saturation. Transactions of the Institute of Measurement and Control 46: 1529–1541.

Chaudhuri

Zhang

, et al. (2021) An attention-based deep sequential GRU model for sensor drift compensation. IEEE Sensors Journal 21: 7908–7917.

Chen

S-J

(2022) Analysis and application of deep learning in novel MEMS capacitive pressure sensor. In: 2022 IEEE 5th international conference on knowledge innovation and invention (ICKII), Hualien, Taiwan, 22–24 July, pp. 45–47. IEEE

10.

Chen

Liu

Meng

, et al. (2024) AFC: An adaptive lossless floating-point compression algorithm in time series database. Information Sciences 654: 119847.

11.

Chen

Wang

X-B

Yang

Z-X

(2023) Semi-supervised self-correcting graph neural network for intelligent fault diagnosis of rotating machinery. IEEE Transactions on Instrumentation and Measurement 72: 3536611.

12.

Degachi

Ouni

(2025) An attention-augmented bidirectional LSTM-based encoder–decoder architecture for electrocardiogram heartbeat classification. Transactions of the Institute of Measurement and Control 47: 506–515.

13.

Dincic

Peric

Denic

(2014) Linearization of the product polar quantizer for A/D conversion of measurement signals. Transactions of the Institute of Measurement and Control 36: 853–864.

14.

Dincic

Peric

Denic

(2016) Uniform polar quantizer with three-stage hierarchical variable-length coding for measurement signals with Gaussian distribution. Measurement 88: 214–222.

15.

Dinčić

Perić

Denić

, et al. (2023) Optimization of the fixed-point representation of measurement data for intelligent measurement systems. Measurement 217: 113037.

16.

Dincic

Peric

Petkovic

, et al. (2013) Design of product polar quantizers for A/D conversion of measurement signals with Gaussian distribution. Measurement 46: 2441–2446.

17.

Fanariotis

Orphanoudakis

Kotrotsios

, et al. (2023) Power efficient machine learning models deployment on edge IoT devices. Sensors 23: 1595.

18.

Junaid

Arslan

Lee

, et al. (2022) Optimal architecture of floating-point arithmetic for neural network training processor. Sensors 22: 1230.

19.

Kim

Vecchietti

Choi

, et al. (2021) Machine learning for advanced wireless sensor networks: A review. IEEE Sensors Journal 21: 12379–12397.

20.

Kruse

Mostaghim

Borgelt

, et al. (2022) Multi-layer perceptrons. In: Computational Intelligence. Berlin: Springer, pp. 53–124.

21.

Lai

M-H

Chang

K-S

(2023) AI sensor applications in edge computing. IEEE Nanotechnology Magazine 17: 23–28.

22.

Luo

, et al. (2024) Automatic searching of lightweight and high-performing CNN architectures for EEG-based driving fatigue detection. IEEE Transactions on Instrumentation and Measurement 73: 2519211.

23.

Liang

Huang

(2023) Survey on deep learning-based 3D object detection in autonomous driving. Transactions of the Institute of Measurement and Control 45: 761–776.

24.

Liu

Yang

(2024) Intelligent fault diagnosis based on improved convolutional neural network for small sample and imbalanced bearing data. Transactions of the Institute of Measurement and Control 46: 3203–3214.

25.

Lou

Atoui

(2024) Recent deep learning models for diagnosis and health monitoring: A review of research works and future challenges. Transactions of the Institute of Measurement and Control 46: 2833–2870.

26.

Luo

Lei

Wang

(2024) Gaussian mixture model sample selection strategy–based active semi-supervised soft sensor for industrial processes. Transactions of the Institute of Measurement and Control 46: 962–972.

27.

Marco

Neuhoff

(2006) Low rate scalar quantization for Gaussian sources and absolute error. In: 2006 IEEE international symposium on information theory, Seattle, WA, 9–14 July, pp. 2551–2553. Piscataway, NJ: IEEE.

28.

Mohapatra

Aggarwal

Tripathy

(2024) Automated recognition of hand gestures from multichannel EMG sensor data using time–frequency domain deep learning for IoT applications. IEEE Sensors Letters 8: 1–4.

29.

Moroz

Samotyy

(2019) Efficient floating-point division for digital signal processing application. IEEE Signal Processing Magazine 36: 159–163.

30.

Mukhopadhyay

Tyagi

SKS

Suryadevara

, et al. (2021) Artificial intelligence-based sensors for next generation IoT applications: A review. IEEE Sensors Journal 21: 24920–24932.

31.

Perić

Dinčić

(2023) Optimization of the 24-bit fixed-point format for the Laplacian source. Mathematics 11: 568.

32.

Perić

Savić

Dinčić

, et al. (2021) Floating point and fixed point 32-bits quantizers for quantization of weights of neural networks. In: 2021 IEEE 12th international symposium on advanced topics in electrical engineering (ATEE), Bucharest, Romania, 25–27 March, pp. 1–4. Piscataway, NJ: IEEE.

33.

Ronco

Schulthess

Zehnder

, et al. (2022) Machine learning in-sensors: Computation-enabled intelligent sensors for next generation of IoT. In: 2022 IEEE sensors, Dallas, TX, 30 October–2 November, pp. 1–4. Piscataway, NJ: IEEE.

34.

Sahni

Cao

Yang

, et al. (2022) Distributed resource scheduling in edge computing: Problems, solutions, and opportunities. Computer Networks 219: 109430.

35.

Sun

(2021) A survey on deep learning for data-driven soft sensors. IEEE Transactions on Industrial Informatics 17: 5853–5866.

36.

Syed

Ulbricht

Piotrowski

, et al. (2021) Fault resilience analysis of quantized deep neural networks. In: 2021 IEEE 32nd international conference on microelectronics (MIEL), Niš, Serbia, 12–14 September, pp. 275–279. Piscataway, NJ: IEEE.

37.

Syu

Lee

(2024) One-dimensional binary convolutional neural network accelerator design for bearing fault diagnosis. IEEE Sensors Journal 24: 3649–3658.

38.

Wang

Choi

Brand

, et al. (2018) Training deep neural networks with 8-bit floating point numbers. In: 2018 32nd conference on neural information processing systems (NIPS 2018), Montréal, QC, Canada, 3 December, pp. 7686–7695. Red Hook, NY: Curran Associates.

39.

Lin

Zhang

, et al. (2023) Detecting inaccurate sensors on a large-scale sensor network using centralized and localized graph neural networks. IEEE Sensors Journal 23: 16446–16455.

40.

R-D

T-N

Xiao

, et al. (2024) Design and implementation of an adaptive neural network observer–based backstepping sliding mode controller for robot manipulators. Transactions of the Institute of Measurement and Control 46: 1093–1104.

41.

Yadav

Ngai

ECH

Gupta

, et al. (2024) Machine learning on edge in sensor systems. In: 2024 IEEE 3rd workshop on machine learning on edge in sensor systems (SenSys-ML), Hong Kong, China, 13–16 May, pp. 1–2. Piscataway, NJ: IEEE.

42.

Zhou

Xiang

, et al. (2021) Self-organizing probability neural network-based intelligent non-intrusive load monitoring with applications to low-cost residential measuring devices. Transactions of the Institute of Measurement and Control 43: 635–645.

Simplified performance evaluation of floating-point formats for implementing intelligent measurement systems

Abstract

Keywords

Introduction

Structure and performance evaluation of the FP format

Efficient performance evaluation of FP formats

Experimental results

Conclusion

Footnotes

Appendix

Declaration of conflicting interests

Funding

ORCID iD

Data availability statement

References