Sage Journals: Discover world-class research

Abstract

Image compression is crucial for reducing storage and transmission costs, particularly in applications involving high-resolution and complex imagery. Traditional compression methods, such as JPEG, PNG, and newer lossless formats like JPEG XL and WebP, often suffer from suboptimal compression ratios (CRs) and image quality when handling modern high-definition content. To overcome these limitations, this paper proposes a novel deep learning-based lossless image compression method. The symmetrical transformer (STF) model is introduced, integrating transformer blocks in both the downsampling encoder and upsampling decoder to enhance the capture of local and global features. The model also includes a multivariate mixture distribution channel conditioning (MMCC) entropy model, which improves pixel dependency predictions by modeling complex relationships within image channels. Additionally, an automated searching of optimal kernel shapes (SOKS) is employed to dynamically configure kernel sizes, optimizing the convolutional layers for different image regions. The system also applies stripe-wise pruning (SWP), which selectively prunes unimportant features during compression, reducing computational complexity and memory usage without compromising image quality. Extensive evaluations on standard datasets, including Kodak, CLIC, and DIV2K, demonstrate the effectiveness of the proposed model. Specifically, the approach achieves significant compression efficiency, with bits-per-dimension (BPD) values of 3.06 on Kodak, 3.91 on CLIC, and 3.63 on DIV2K, outperforming traditional methods such as iVPF (3.20 BPD on Kodak), GOLLIC (3.15 BPD on Kodak), and MSPSM (3.12 BPD on Kodak). In addition to compression efficiency, the proposed method excels in inference speed, with encoding times as low as 7.80 ms per sample on Kodak, significantly faster than competing methods. These results demonstrate a substantial improvement in compression rates and image reconstruction quality, highlighting the model's potential for real-world applications, including medical imaging, remote sensing, and real-time streaming, offering a significant advancement in lossless image compression.

Keywords

lossless image compression MMCC SOKS entropy coding transformer encoder and decoder

1 Introduction

Image compression refers to reducing the data needed to store a digital image. There are two primary approaches to image compression: lossless and lossy compression. Lossless compression aims to reduce the image's size without any loss of information, making it suitable for applications like medical imaging, scientific data, and cases where maintaining image quality is crucial. Lossy compression, on the other hand, sacrifices some image details to achieve greater compression, making it more versatile for various practical purposes (Wang et al., 2023). Traditional image compression approaches, including free lossless image format (FLIF) (Sneyers & Wuille, 2016), JPEG2000 Rabbani & Joshi, 2002), and WebP (WebP Image Format) use manually designed encoding and decoding techniques for capturing spatial correlations among pixels in an image. These methods have been effective in their own right.

In recent years, deep learning has advanced significantly, leading to learning-based image compression techniques that outperform traditional methods. Both traditional and learning-based lossless compression methods share the goal of approximating the true distribution of image data. Flow-based models, such as invertible volume preserving flow (iVPF), offer an advantage by enabling exact likelihood optimization through bijective mappings (Zhang et al., 2021). For example, iVPF introduced the modular affine transformation (MAT) algorithm, which achieves precise bijective mapping without numerical errors.

Other learning-based lossless compression methods, like L3C, have also demonstrated exceptional performance. L3C utilizes a fully parallel hierarchical probabilistic model and surpasses traditional compression formats like WebP, PNG, and JPEG2000. Similarly, RC utilizes BPG for lossy reconstruction and employs the RC (residual compressor) network for lossless compression (Mentzer et al., 2019, 2020). In a related approach, an end-to-end lossless image compression framework was proposed, building on prior work in lossy image compression (Lee et al., 2019). This framework, as detailed in Cheng et al. (2020), employs an autoregressive model for latent variables to enhance performance. It's worth noting that autoregressive models, while powerful, can be computationally intensive.

Neural coding, especially in the context of neural image compression (NIC), has gained significant attention from both the research community and industry. This emerging field has produced promising solutions for end-to-end image compression, outperforming traditional methods in terms of coding efficiency. NIC relies on autoencoders (AEs) to perform a non-linear transformation of the input signal into a compact representation (Ghorbel et al., 2023). This AE-based system consists of three main components: a transformation stage, quantization, and entropy coding. These components can be trained end-to-end with the aim of minimizing distortion among the source image and its reconstructed version while minimizing the data rate needed to transmit the latent representation bitstream.

This article introduces an optimal kernel transformer approach for learned lossless image compression. It represents a unique combination of architectural changes inspired by models like autoregressive models, SOKS, recurrent neural network (RNN) entropy coding, and Window-based attention transformer. The key contributions of this paper are outlined in the following sections.

To introduce channel-conditional (CC) models that efficiently capture pixel redundancies in image compression, addressing the time-consuming nature of autoregressive models.

To enhance entropy modeling through the multivariate mixture distribution channel conditioning (MMCC) model, surpassing the performance of traditional Spatial Gaussian mixture (SGM)-based entropy models.

To adopt a sequence-to-sequence RNN architecture for both compression and decompression, leveraging binary RNNs for entropy coding and binarization to improve compression ratios (CRs).

To integrate shifted window-based self-attention modules, inspired by vision transformer (ViT) and Swin transformer architectures, for improved correlation capture among spatially adjacent elements in CNN and Transformer models.

To implement the searching of optimal kernel shapes (SOKS) methodology, automating the search for optimal kernel shapes and enhancing network efficiency through stripe-wise pruning (SWP), reduces storage and computational demands.

In this article, the remaining sections are organized as follows: Section 2 presents the literature review, providing an overview of relevant prior work. Section 3 delves into the background of the research, offering contextual information for understanding the study. Section 4 outlines the proposed methodology, detailing the approach employed in the research. The experimental setup is elaborated upon in Section 5, describing how the experiments were conducted. Section 6 presents the results and discussion, analyzing the findings and their implications. Finally, the conclusion, summarizing the key findings and insights, is presented in Section 7.

2 Literature Survey

Image processing plays a major role in various areas and the processing steps there are various techniques have been applied in recent years that are existed. From those some of them are represented here like thresholding for segmentation with an optimization algorithm called African vultures optimization algorithm (AVOA) (Gharehchopogh & Ibrikci, 2024), classification of images for acute lymphoblastic leukemia diagnosis using convolutional neural network (CNN) (Özbay et al., 2023), detection of COVID-19 disease using interactive autodidactic school (IAS) algorithm (Gharehchopogh & Khargoush, 2023), Feature selection of biomedical data for COVID-19 disease using the discrete artificial gorilla troop optimization (DAGTO) (Piri et al., 2022). Also, there are some optimization algorithms that are applied for networking-based image processing. Those optimization algorithms are Harris Hawks optimization (HHO) for community detection (Gharehchopogh, 2023), whale optimization algorithm (WOA) (Shen et al., 2023), slime mould algorithm (SMA) (Gharehchopogh et al., 2023), and dynamic HHO (Gharehchopogh et al., 2023). From the above finding the compression turns of the image has a major role for clearly processing the image data to further proceeding. According to these findings, the methods that are applied for image compression that existed already are represented below.

For the compression of lossless images within a short period the recent approaches perform the encoding in a total unit or subimages. In compressed image reconstruction, the approaches of the image prior play a major role. The usage of independent quantization of Discrete Cosine Transform (DCT) coefficients at the low bit rates the block transform coded images were generally affected by irritating artifacts. To overcome this problem Mu et al. (2020) devised a graph-based non-convex low rank regularization model to surrogate the matrix patch. In both the perceptual and objective qualities this approach achieved highly accurate reconstruction. An image compression approach based on CNN was developed to maintain the accuracy of decoded images. Moreover, optimization depending on the PSNR causes degradation in the image having low quality and low bit rates. To overcome this issue a regularization method was developed by Kudo et al. (2019) for subjective image quality. This approach helps to develop the image compression model that might compress the structural changes between the original and the compressed image. For a high resolution image compression entropy model becomes a major element to get better performance. But this approach still unexplored the dependencies in the spatio channel in the field of latents and also in the execution of context adaptivity. The adaptive characters in the transformer were encouraged the Koyuncu et al. (2022) to introduce a new method called context transformer. It is a transformer-based context model that normalizes spatio channel attention from the de facto stable attention model. This method achieved a higher rate of distortion performance and it was not able to lower a gap among real time operations. For better image security depending on the joint compression and encryption the compressive sensing (CS) was used simultaneously. Certain approaches mostly have the minimum CR and they might be faulty during the process of compression. Considering this as a drawback, Song et al. (2019) designed a compression mechanism to maintain the CR. The introduced mechanism was joint image compression and encryption by applying entropy coding and CS. This method accomplished better reconstruction performance in the compression and encryption.

In the task of generic video compression, it was hard to perform the interpolation between the phase of encoding and the decoding. In various compressions, it was not parallel to maintain the speed of the code. It was hard to generalize well while applying different datasets to a large range of various types of videos. To overcome this problem, Liu et al. (2020) formulated a video compression approach called conditional entropy coding. This approach helps in modifying the correlation among every frame code and also it carries the internal learning for every frame code while interference. In the learned image compression, the variational auto encoders (VAEs) had a large area application. Based on the principal algorithm certain approaches may apply only for lossless compression, and while compressing various images concurrently they accomplished a very small efficiency. For compressing a unique image few approaches become ineffective. Considering the above issues, Flamich et al. (2020) presented a latent representation with relative entropy coding for compressing images by encoding. For a single image, this approach was capable of encoding the latent representation directly by codelength nearer to relative entropy. The drawback observed in this approach was compression speed. For lossless image compression, the widely used approach was dependent on the detection due to their simplicity and also it assures an exact retrieval of data. A few methods of image compression had precious spectral data that was unable to be detected by the human eye and required high accuracy. However, the scientific value might be lost due to the information loss. To defeat those problem statements, Chang et al. (2019) developed adaptive prediction, context modeling, and entropy coding for lossless image compression. The entropy coding model helps to remove the statistical redundancy.

The existing methods for image compression, both traditional and learning-based, exhibit various limitations that hinder their effectiveness in achieving optimal compression efficiency while maintaining high-quality image reconstruction. Traditional methods like JPEG2000 and WebP rely on manually designed encoding techniques, which often fail to fully exploit spatial correlations among pixels, resulting in artifacts and quality degradation, especially at low bit rates. Learning-based approaches, while showing promise, suffer from computational inefficiency and over parameterization, leading to excessive computational demands and redundancy in parameters. Additionally, current CNN-based methods struggle to preserve fine details and non-repetitive textures, impacting reconstruction quality. The proposed Optimal Kernel Transformer approach addresses these limitations by integrating innovative strategies such as CC models, MMCC entropy modeling, and RNN-based compression to efficiently capture pixel redundancies and improve CRs without sacrificing image quality. By leveraging advanced transformer techniques and automating the search for optimal kernel shapes, this model enhances representation power while optimizing computational efficiency, providing a robust solution for high-performance lossless image compression. Table 1 presents the comparative table for the related work on lossless image compression.

Table 1.
Comparative Table for Lossless Image Compression.

Authors Methodology Dataset Performance Limitation

Mu et al. (2020) Graph Laplacian Regularization Classic5 and LIVE1 In both objective and perceptual qualities this approach is capable of achieving more accurate reconstruction. In this approach the consideration of mathematical proof was not applicable.

Kudo et al. (2019) Mutual Information Maximizing Regularization CelebA When the there is a growth in the iteration number, all the PSNR curves increase monotonically, exhibiting good convergence property. In this method some of the parts were still not able to enhance clearly.

Koyuncu et al. (2022) Contextformer Kodak, CLIC2020, Tecnick image It observed that the spatial-first coding provides a marginal gain in low target bitrates. This approach is not applicable for real time applications.

Song et al. (2019) entropy coding and CS Real time The experimental results show that the proposed JCE-HCS can effectively resist the statistical analysis attack. This approach failed in recovering the plain-image after the noise attack and cropping attack.

Liu et al. (2020) Conditional entropy coding Kinetics This approach outperforms better in numerous video settings, especially at higher bitrates and lower frame rates. The computation time of this approach was high.

Flamich et al. (2020) Relative Entropy Coding Cifar10, ImageNet32 and Kodak datasets For single image this approach was capable of encoding the latent representation directly by codelength nearer to relative entropy. The drawback observed in this approach was compression speed.

Chang et al. (2019) Adaptive Prediction, Context Modeling, and Entropy Coding Real time Helps to remove statistical redundancy A care must be taken when storing information in a compressed form for long time periods, and backwards-compatibility of decoders must be maintained, as data may otherwise be irrevocably lost, leading to what has been termed the Digital Dark Ages.

Authors	Methodology	Dataset	Performance	Limitation
Mu et al. (2020)	Graph Laplacian Regularization	Classic5 and LIVE1	In both objective and perceptual qualities this approach is capable of achieving more accurate reconstruction.	In this approach the consideration of mathematical proof was not applicable.
Kudo et al. (2019)	Mutual Information Maximizing Regularization	CelebA	When the there is a growth in the iteration number, all the PSNR curves increase monotonically, exhibiting good convergence property.	In this method some of the parts were still not able to enhance clearly.
Koyuncu et al. (2022)	Contextformer	Kodak, CLIC2020, Tecnick image	It observed that the spatial-first coding provides a marginal gain in low target bitrates.	This approach is not applicable for real time applications.
Song et al. (2019)	entropy coding and CS	Real time	The experimental results show that the proposed JCE-HCS can effectively resist the statistical analysis attack.	This approach failed in recovering the plain-image after the noise attack and cropping attack.
Liu et al. (2020)	Conditional entropy coding	Kinetics	This approach outperforms better in numerous video settings, especially at higher bitrates and lower frame rates.	The computation time of this approach was high.
Flamich et al. (2020)	Relative Entropy Coding	Cifar10, ImageNet32 and Kodak datasets	For single image this approach was capable of encoding the latent representation directly by codelength nearer to relative entropy.	The drawback observed in this approach was compression speed.
Chang et al. (2019)	Adaptive Prediction, Context Modeling, and Entropy Coding	Real time	Helps to remove statistical redundancy	A care must be taken when storing information in a compressed form for long time periods, and backwards-compatibility of decoders must be maintained, as data may otherwise be irrevocably lost, leading to what has been termed the Digital Dark Ages.

3 Background of the Proposed Model

The MMCC model and the autoregressive image were applied for the compression of the raw image. This model helps to minimize the code length and thus it improves the performance.

3.1 Formulation of Lossless Image Compression

The lossless image compression carries two elements such as the hyper path and the main path. The main path is expressed in the form of

\begin{aligned} v & = h_{f} (u; ϕ) \end{aligned}

(1)

\begin{aligned} \hat{v} & = T (v) \end{aligned}

(2)

\begin{aligned} (γ_{u}, δ_{u}, Π_{u}) & = h_{p} (\hat{v}; Θ) \end{aligned}

(3)

Where,

u

indicates the original image, latent presentation before quantization is indicated by v, and the quantized latent presentation is signified by

\hat{v}

, the trainable element of the decoder

h_{p}

and encoder

h_{f}

are represented as

ϕ

and

Θ

The latent representation is signified by v, is created by feeding the original images u into the encoder $h_{f}$ . It is essential to undertake the process of quantization T for encoding the latent representation. In the quantization operator, the outcome T is signified by $\hat{v}$

The hyperprior path carries of hyper decoder $a_{p}$ and hyper encoder $a_{s}$ . It is expressed in the form of

\begin{aligned} w & = a_{f} (v; ϕ_{a}) \end{aligned}

(4)

\begin{aligned} \hat{w} & = T (w) \end{aligned}

(5)

\begin{aligned} ({\tilde{γ}}_{v}, {\tilde{δ}}_{v}, {\tilde{Π}}_{v}) & = a_{p} (\hat{w}; Θ_{a}) \end{aligned}

(6)

Where, before quantization the presentation of the hyper prior is indicated by w and

\hat{w}

denote the quantized hyperprior presentation. The hyper decoder created the elements

{\tilde{γ}}_{v}

{\tilde{δ}}_{v}

, and

{\tilde{Π}}_{v}

and they were later applied as inputs for the introduced MMCC approach beside the partitioned v.

3.2 MMCC

The MMCC approach plays a vital role in analyzing the parameters in the features. The procedure of the MMCC is expressed below. The input of MMCC carries the elements created by the hyper decoder and quantized latent v. Beside the dimension of the channel the latent $v$ are equally divided into $L$ pieces, with every piece carrying $Z \times D \times G / L$ values.

The slices in MMCC follow a consecutive dependence while the process of encoding and decoding takes place. Depending on the hyperprior the primary slice $v_{0}$ encoded exclusively. Taking the primary slice and hyperprior the second slice $v_{1}$ is encoded and decoded. This proceeding is expressed in the form of

\begin{aligned} v & = v_{0}, v_{1}, \dots, v_{L - 1} \end{aligned}

(7)

\begin{aligned} (γ_{v_{X}}, δ_{v_{X}}, Π_{v_{X}}) & = MMCC ({\tilde{γ}}_{v}, {\tilde{δ}}_{v}, {\tilde{Π}}_{v}, \tilde{v} < X) \end{aligned}

(8)

\begin{aligned} {\tilde{v}}_{X} & = LRP (v_{X}) \end{aligned}

(9)

\begin{aligned} \tilde{v} & = {\tilde{v}}_{0}, {\tilde{v}}_{1}, \dots, {\tilde{v}}_{L - 1} \end{aligned}

(10)

\begin{aligned} γ_{v} & = concat (γ_{v_{X}}) \end{aligned}

(11)

\begin{aligned} δ_{v} & = concat (δ_{v_{X}}) \end{aligned}

(12)

\begin{aligned} Π_{v} & = concat (Π_{v_{X}}) \end{aligned}

(13)

Where, the latents are represented by v, the unique slices of v is indicated by

v_{0}, v_{1}, \dots, v_{L - 1}

, the outcome of the Latent Residual Prediction (LRP) is denoted by

{\tilde{v}}_{X}

. Furthermore,

\tilde{v} < X

indicates the slices before the

X^{t h}

index that was changed by the error in quantization. The slices of the

γ_{v}, δ_{v}, Π_{v}

are

γ_{v_{X}}, δ_{v_{X}}, Π_{v_{X}}

, respectively. For the encoding and decoding of latent v it applies the parameters of Gaussian mixture distribution

γ_{v}, δ_{v}, Π_{v}

. Using this autoregressive context approach this procedure shares the similarities in it.

4 Proposed Model

Figure 1 illustrates the architectural diagram of the proposed optimal kernel transformer approach for image compression, consisting of two primary components: the major path and the hyper path. In the major path, the compression process begins with the original image being divided into patches. These patches are then processed through transformer-based encoders that employ a window attention mechanism. This mechanism downsamples the feature resolution while increasing the feature channels, enabling the model to effectively capture and represent the image details. Following this, the transformer decoder reverses the process by splitting the patches into layers and de-embedding them to reconstruct the image from the compressed representation. Throughout this process, a conditional probabilistic model is integrated, utilizing a binary RNN to manage the compression and decompression tasks, ensuring efficient data encoding and retrieval.

Figure 1.

Architectural diagram of the proposed optimal kernel transformer approach for image compression.

The hyper path employs the MMCC framework to analyze feature parameters based on sequential dependencies, which enhances the overall model performance. The MMCC operates in parallel with the major path to refine the compression process. Regularization is performed using the SOKS method, which includes three types of regularization: sparse regularization, direction-wise regularization, and group-wise regularization. These regularization techniques help in optimizing the network by reducing computational complexity and improving efficiency. Finally, the SWP method is applied to prune irregular shapes, ensuring the kernels achieve optimal shapes for better compression performance. This combination of advanced techniques in both the major and hyper paths ensures a high level of compression efficiency and image quality preservation.

4.1 Transformer-Based Encoder and Decoder

The block diagram for the transformer block is shown in Figure 2. In the transformer block the local attention supports to arrange bits sequentially and maintain the performance of Rate Distortion (RD). The benefits of this transformer block it gives attention to the spatially neighboring patches during sequentially increasing the receptive area, by acceptable computational complexity. Here the normalization coefficient Layer Normalization (LN) is generally applied in transformer. The LN is applied default in this transformer block and it have an issue in LN is that it may ruin the Gaussian distributions of the network's components by rescaling the responses of linear filters with the same rescaling factor over all spatial locations in order to maintain the network within a tolerable functioning range. For rescaling the response range the LN is necessary during computation of attention map. Comparing to the original CNN approach the MLP achieve a better outcome (Sneyers & Wuille, 2016). The Window-based multi-head self-attention (MHSA) is signified as W-MSA. It is a MHSA module carrying regular windows and the shifted window-MHSA is denoted as the SW-MSA. It is a MHSA module carrying shifted windows (Liu et al., 2021).

Figure 2.

Block diagram of the transformer block.

4.1.1 Transformer-Based Encoder

It divides the original image $U \in A^{3 \times D \times C}$ into patches with patch size $J$ . A straight embedding layer is used on the original patches to create a feature map $o_{s} \in A^{B \times (D \times J) \times (C \times J)}$ with B channels. The feature map $o_{s} \in A^{B \times (D \times J) \times (C \times J)}$ is rearranged into an order $o_{r} \in A^{R^{2} \times B}$ in which the number of patch size is represented by $R = D C / J^{2}$ . For transformer blocks and for patch merging layers the input given to them is $o_{r}$ sequence. Depending on the layout of Swin transformer (Liu et al., 2021) for feed forwarding the early evaluation attention mask in a window. At that period, the final layer downsample the resolution of features and also having the double channel of features (Zou et al., 2022).

4.1.2 Transformer-Based Decoder

It develops a balanced decoder carrying various de-embedding layers, patch splitting layers and transformer blocks. To reconstruct the image $\hat{U}$ de-embedding layer plots the feature map. The upsampling of the resolution of features and sharing the channels of the features are done by patch splitting layers (Zou et al., 2022).

4.2 Entropy Coding

The entropy of codes created while interference is not high due to inexplicitly modeled network for increasing the entropy in its codes. This approach did not essentially accomplish redundancy along a big spatial range. Summation of another entropy coding may develop the CR. These usually happen in the standard image compression codecs. Here, image encoder is provided and applied as a binary code. The structure of the binary RNN entropy coding is represented in Figure 3.

Figure 3.

Structural representation of binary RNN entropy coding.

The lossless entropy coding approaches obtained here are wholly convolutional, they proceed the binary code continuous order and also for a provided encoding iteration in the sequence of raster-scan. Every image encoder architecture creates binary code in the form of $e (x, y, q)$ with size $E \times X \times H$ in which E represents the height of the image, X indicate the width of the image, and H signifies the $j \times$ the number of repetitions. Based on the steady lossless encoding approach it merges a conditional probabilistic approach of present binary code $e (x, y, q)$ by a mathematical coder to do the real suppression. Generally, in a stream sequence a given context $O (x, y, q)$ based only on previous bits. This will analyze $R (e (x, y, q) | O (x, y, q))$ to get the expected ideal encoded length of $e (x, y, q)$ . The cross entropy among $R (e | O)$ and $\hat{R} (e | O)$ . It was not obtain the minimum penalty take part by applying a real mathematic coder that needs a quantized version of $\hat{R} (e | O)$ (Toderici et al., 2017).

4.2.1 Single Iteration Entropy Coder

In a single layer it controls the Pixel RNN architecture and applies same architecture for the suppression of the binary codes. Here, the analysis of conditional code probability for line x based on few neighboring codes still it is not direct on early decoded binary codes along a line of states Q with size $1 \times X \times K$ it capture both few long and short term dependencies. The overall summary of the previous lines are available in state line. In general it applies $K = 64$ . The probabilities are analyzed and by applying $1 \times 3$ LSTM convolution the state renovated line by line.

The end to end probability valuation carries 3 steps. Initially, the primary convolution of size $7 \times 7$ is applied to maximize the acceptance region of LSTM. The probability estimation of codes $e (x, y)$ influenced by the receptive field code set $e (M, N)$ . For the avoidance of dependencies on future codes the primary convolution becomes masked convolution. In second step the line LSTM considered as the input outcome $W_{0}$ of primary convolution and precedes one scan line at a particular period. However, by applying the early scan lines the hidden states of LSTM are generated, the line LSTM capture both long term and short term dependencies. At last two $1 \times 1$ convolutions are included to maximize the capability of the system to store additional patterns in binary code. However it tries to detect binary code, the element of the Bernoulli distribution can be analyzed straightly by applying the final convolution.

It is essential to lower the amount of bits applied after entropy coding that may generally cause a cross-entropy loss. Due to the binary codes of $0, 1$ , the cross entropy loss may be formulated as

Cross - entropy loss = \sum_{x, y, q} - e \log_{2} (\hat{R} (e | O)) - (1 - e) \log_{2} (1 - \hat{R} (e | O))

(14)

4.2.2 Progressive Entropy Coding

While handling many repetitions, if the iteration repeats then the single iteration entropy coder might be reflected by the unit repetition entropy coder. Each and every repetition has its own LSTM. Moreover, the architecture might not capture the redundancy among repetitions. This may augment the data they are accepted for the iteration of line LSTM by certain information from the early layers. The line LSTM just not receive single iteration similar to $W_{0}$ however $W_{1}$ analyzed from early iterations by applying recurrent network. Formulating $W_{1}$ it did not need any masked convolution (Toderici et al., 2017).

4.3 SOKS

SOKS is a framework that search for the optimal kernel shapes. The SOKS had two phases such as the searching phase and the retraining phase.

4.3.1 Framework for SOKS

To predict the significant position in the convolution kernels the coefficient matrices forced by various regularization terms are created. $Y \in S^{d \times g \times z}$ indicate the input tensor containing d channels of size $g \times z$ . After convolution the outcome is

Z = Y * z

(15)

Where

Z \in S^{k \times g^{'} \times z^{'}}

indicates the outcome, k denotes the amount of channels in the output, The size of the output is signified by

g^{'} \times z^{'}

*

denotes the operator used for convolution, and the weights of the convolution is indicated by

z \in S^{d \times n \times n \times k}

. Here, for convenience the bias is removed.

To know more about optimal Kernel Shapes, before convolution it takes the product between weights of the filter z and coefficient matrix $G$ , afterwards the outcome Z turns into

Z = Y * (G ⊙ z)

(16)

Where, the element wise product is signified as

⊙

In that case noticing that it may either use diverse coefficient matrices for every channel or convert every channel to share one matrix. Hence, it divides these k filters to $c (1 \leq c \leq k)$ sets equivalently and within a set apply that coefficient matrix. Whether $c = 1$ afterwards the same coefficient matrix was shared by every k filters and study the similar kernel shapes. Whether $c = k$ then every k filters are not depend on one another and study the same kernel shapes. If $1 < c < k$ , then each $k / c$ filters exchange unit coefficient matrix and study unit kernel shapes. Thus, the size of G becomes $c \times n \times n$ and it indicate every 2-dimensional (2-D) matrix in G as $G_{l} \in S^{n \times n}$ , where $l = 1, \dots, c .$

For a CNN approach having I convolution layers to be suppressed, it gathers every 2-D coefficient matrices and get

F = {G_{1}^{1}, \dots, G_{c}^{1}, \dots, G_{l}^{i}, \dots, G_{1}^{I}, \dots, G_{c}^{I}}

(17)

Where, in the

i^{t h}

convolution layer the coefficient matrix respective to the

l^{t h}

filter set is signified by

G_{l}^{i}

Every component in the F are set to 1 and while training various regularization limitations are established. In training after several repetition the network parameter containing F may converge, and then for predicting main position of kernel F is applied, these are stored to procedure the shapes of optimal kernel. After considering the shapes of the optimal kernel, for every convolution layers pruning based on strip was happened. To get larger accuracy in the stage of retraining the suppressed approach is trained through scratch (Liu et al., 2022).

4.3.2 Regularization Reflected on Coefficient Matrices

For image classification the tradition loss function may be equated in the form of

L_{0} = L_{cls} + α L_{2}

(18)

Where, for classification the loss was signified by

L_{cls}

L_{2}

indicates the

i_{2}

regularization applied for over fitting, and to control the penalty level

α

is applied.

To accomplish spontaneous SOKSs, it include few regularization constraints $L_{t}$ on the coefficient matrices F and the loss function for the whole training is indicated by

L = L_{0} + L_{t}

(19)

Here, every coefficient matrix in F works as a predictor to detect major positions in kernel and might be trained sparse to get suppression of the approach. Moreover, current studies illustrate that kernel parameter in various stages subsidize dissimilarly. However, there was few other difficulties that might be taken into account. The difficulties considered here are pixel shift problem. At the end it construct

L_{t}

L_{t} = α_{1} L_{sparse} + \frac{α_{2}}{2} L_{dir} + \frac{α_{3}}{2} L_{group}

(20)

Where, the sparse regularization term is represented by

L_{sparse}

, the term direction wise regularization is indicated by

L_{dir}

, the term group wise regularization is signified by

L_{group}

, then for balancing the strength of these terms

α_{1}, α_{2}, α_{3}

are developed.

4.3.2.1 Sparse regularization

The partial derivative of the sparse regularization $L_{sparse}$ in terms of $b_{l m}$ is represented by

\frac{\partial L_{sparse}}{\partial b_{lm}} = n_{m} \cdot s g n (b_{lm})

(21)

Where, the sign function is indicated by

s g n (\cdot)

. The sign function,

s g n (b_{lm})

, returns −1, 0, or +1 depending on whether

b_{lm}

is negative, zero, or positive, respectively. This means that for non-zero

b_{lm}

, the gradient will push

b_{lm}

toward zero, promoting sparsity. The coefficient

n_{m}

scales the regularization effect, reflecting the specific regularization strength or a weight associated with feature m.

4.3.2.2 Direction wise regularization

Based on the chain rule, the partial derivative of the direction wise regularization $L_{dir}$ in terms of $b_{lm}$ becomes

\frac{\partial L_{dir}}{\partial b_{lm}} = \frac{L_{dir}}{\partial {\bar{G}}_{m}} \cdot \frac{\partial {\bar{G}}_{m}}{\partial b_{lm}}

(22)

The chain rule is used to decompose the derivative into two parts. First,

L_{dir} / \partial {\bar{G}}_{m}

represents how the regularization term changes with respect to an intermediate variable

{\bar{G}}_{m}

, which is a function of the model parameters and encapsulates the directional information. Second,

\partial {\bar{G}}_{m} / \partial b_{lm}

represents how this intermediate variable

{\bar{G}}_{m}

changes with respect to the specific parameter

b_{lm}

. By using the chain rule, this approach simplifies the computation of the gradient by breaking it down into more manageable parts. The first part assesses the sensitivity of the regularization to directional changes, while the second part traces these directional changes back to the individual model parameters. This method ensures a more systematic and interpretable gradient calculation, which is crucial for effectively implementing direction-wise regularization in the model.

4.3.2.3 Group wise regularization

Depending on chain rule the partial derivative of the group wise regularization $L_{group}$ in terms of $b_{lm}$ is expressed as

\frac{L_{group}}{\partial b_{lm}} = {\begin{matrix} \frac{\partial L_{group}}{\partial {\bar{G}}_{l}^{corner}} \cdot \frac{\partial {\bar{G}}_{l}^{corner}}{\partial b_{lm}}, i f m \in P_{corner} \\ \frac{\partial L_{group}}{\partial {\bar{G}}_{l}^{edge}} \cdot \frac{\partial {\bar{G}}_{l}^{edge}}{\partial b_{lm}}, i f m \in P_{edge} \\ \frac{\partial L_{group}}{\partial {\bar{G}}_{l}^{center}} \cdot \frac{\partial {\bar{G}}_{l}^{center}}{\partial b_{lm}}, i f m \in P_{center} \end{matrix}

(23)

This expression leverages the chain rule to decompose the gradient computation into region-specific components. Here,

L_{group}

represents the group-wise regularization term, which is designed to impose regularization based on different regions within the model parameters. The parameters

b_{lm}

are influenced differently depending on whether they belong to the corner, edge, or center regions of the group.

4.3.3 Pruning for Optimal Kernel Shapes

For pruning here the SWP model is applied. In SWP kernel stripes corresponding to unimportant positions are removed and the calculation order of convolution is modified for efficient inference of the pruned network.

Every coefficient matrix $G_{l}^{i}$ in F is qualified to be sparse with various parameters nearer to null. So as to visit the preferred suppression range, it introduces a binary search algorithm to discover suitable pruning threshold.

In dissimilar coefficient matrices it can differ highly due to consideration of proper value of the elements. It identifies a threshold $β^{*}$ applying the binary search model for pruning insignificant kernel stages. A threshold is given and it is described as $β$ , it is essential to evaluate the higher absolute value of every parameters $b_{lm}$ in $G_{l}^{i}$ and indicate it as $| b |_{max}$ , where $l (l = 1, \dots, c)$ signifies the kernel group index and $m (m = 1, \dots 9)$ denotes kernel position index. Afterwards, the position m is pruned if $| b_{lm} | < β | b |_{max}$ .

To get the SWP the unimportant kernel parameters may be removed after the optimal kernel shapes are taken. By applying the binary search algorithm the preferred suppression rate is accomplished. Then the pruned approach is trained from scratch to get better outcome. By changing the input information the training and interference of the irregular convolution kernels are executed (Liu et al., 2022).

5 Experimental Setup

To evaluate the introduced Optimal Kernel Transformer approach it was trained by using three different datasets and it was executed by applying a python tool. This approach is compared with the existing model metrics.

5.1 Dataset Description

The introduced Optimal Kernel Transformer approach for lossless image compression applies three different image datasets to achieve better outcomes and also it is compared with the developed method. The different datasets used here are the Kodak dataset, challenge on learned image compression (CLIC) professional validation dataset The CLIC dataset, and the DIV2K dataset.

The Kodak dataset carries 24 uncompressed images with the resolution of $768 \times 512$ color images. The images in this dataset link to lossless, true color (24 bits per pixel, aka “full color”) images. For various compressions testing most of the sites use them as a standard test suite. According to this cite these images are available in the Sun Raster format via ftp. Before downloading these images were not previewed. Since their release, however, the lossless PNG format has been incorporated into all the major browsers. Since PNG supports 24-bit lossless color (which GIF and JPEG do not), it is now possible to offer this browser-friendly access to the images.

The CLIC carries 41 high quality images with high resolutions from the cameras of DSLR by professionals. Many of the images in the CLIC dataset are 2k resolution however few of them are low resolution as far as $512 \times 384$ . These images contain a mix of the professional and mobile datasets used to train and benchmark rate-distortion performance. The dataset contains both RGB and grayscale images. This may require special handling if a grayscale image is processed as a 1 channel Tensor and a 3 channel Tensor is expected.

DIV2K dataset is a widely-used high-resolution image dataset. It is divided into 800 training data and 100 validation data. It uses all 800 training data for training. For evaluation (encoding/decoding), it use DIV2K original validation dataset and the randomly cropped version of DIV2K validation dataset (denoted as DIV2K (crop)). The crop size is set to 512 × 512 for the fair comparison to L3C.

5.2 Training Details

The model architecture is derived from Cheng et al.'s work in 2020 (Cheng et al., 2020) and implemented within the Compress AI platform (Bégaint et al., 2020). Particularly, the output channel configuration of $h_{p}$ is structured as 3 × 3 × K, where K represents the parameter selected for the GMM, and for this research, the value of K is set to 3. In the training phase, approximately 40,000 images were sourced from the ImageNet dataset. These images were resized to 256 × 256 pixels before being randomly fed into the network. Due to its extensive scale and diversity this dataset was selected, even though it is not specifically curated for lossless compression tasks. The model comprises a total of 709 million parameters. Then, set the number of channels N = 192 for the main path and M = 320 for the hyper path, with the number of slices set to 10. Throughout training, a fixed number of components K for the GMM are maintained, specifically setting it to 3. The optimization of the model employed the Adam optimizer with a batch size of 8. The learning rate was initially set to 1 × 10⁻⁴ and was reduced to 1 × 10⁻⁵ after 100 epochs for finer adjustments. To ensure stability and convergence, the model underwent training for approximately 400 epochs. All experiments were conducted on a machine equipped with an NVIDIA GeForce RTX 3090 GPU, boasting 24GBs of memory.

Table 2.
Image Compression Performance Comparison.

Compression Inference Time

Performance (BPD) ↓ (ms/sample) ↓

Dataset Method Dataset Single Encode Decode

Kodak iVPF (Zhang et al., 2021) 3.20 9.50 12.65 12.65

GOLLIC (Lan et al., 2022) 3.15 8.70 10.25 10.25

MSPSM (Zhang et al., 2020) 3.12 8.30 9.45 9.45

SR (Cao et al., 2020) 3.25 10.20 15.20 15.20

LLICTI (Kamisli, 2023) 3.10 8.50 9.80 9.80

Proposed 3.06 7.80 8.75 8.75

CLIC iVPF (Zhang et al., 2021) 3.98 12.60 17.80 17.80

GOLLIC (Lan et al., 2022) 3.95 12.30 16.75 16.75

MSPSM (Zhang et al., 2020) 3.88 11.80 15.90 15.90

SR (Cao et al., 2020) 4.10 13.50 19.20 19.20

LLICTI (Kamisli, 2023) 3.85 12.10 16.70 16.70

Proposed 3.91 12.00 16.50 16.50

DIV2K iVPF (Zhang et al., 2021) 3.68 10.80 14.45 14.45

GOLLIC (Lan et al., 2022) 3.70 11.00 14.75 14.75

MSPSM (Zhang et al., 2020) 3.70 11.10 15.20 15.20

SR (Cao et al., 2020) 3.85 12.30 17.30 17.30

LLICTI (Kamisli, 2023) 3.65 10.90 14.30 14.30

Proposed 3.63 10.70 14.10 14.10

		Compression	Inference Time
Kodak	iVPF (Zhang et al., 2021)	3.20	9.50	12.65	12.65
GOLLIC (Lan et al., 2022)	3.15	8.70	10.25	10.25
MSPSM (Zhang et al., 2020)	3.12	8.30	9.45	9.45
SR (Cao et al., 2020)	3.25	10.20	15.20	15.20
LLICTI (Kamisli, 2023)	3.10	8.50	9.80	9.80
Proposed	3.06	7.80	8.75	8.75
CLIC	iVPF (Zhang et al., 2021)	3.98	12.60	17.80	17.80
GOLLIC (Lan et al., 2022)	3.95	12.30	16.75	16.75
MSPSM (Zhang et al., 2020)	3.88	11.80	15.90	15.90
SR (Cao et al., 2020)	4.10	13.50	19.20	19.20
LLICTI (Kamisli, 2023)	3.85	12.10	16.70	16.70
Proposed	3.91	12.00	16.50	16.50
DIV2K	iVPF (Zhang et al., 2021)	3.68	10.80	14.45	14.45
GOLLIC (Lan et al., 2022)	3.70	11.00	14.75	14.75
MSPSM (Zhang et al., 2020)	3.70	11.10	15.20	15.20
SR (Cao et al., 2020)	3.85	12.30	17.30	17.30
LLICTI (Kamisli, 2023)	3.65	10.90	14.30	14.30
Proposed	3.63	10.70	14.10	14.10

6 Results and Discussion

The performance of the proposed optimal kernel transformer model is evaluated against several existing models including iVPF (Zhang et al., 2021), GOLLIC (Lan et al., 2022), multi-scale progressive statistical model (MSPSM) (Zhang et al., 2020), super resolution (SR) (Cao et al., 2020), and learned lossless image compression through interpolation (LLICTI) (Kamisli, 2023). Each model is assessed across three datasets: Kodak dataset, CLIC professional validation dataset, and DIV2K dataset. The evaluation focuses on metrics such as compression performance (bits-per-dimension [BPD]) and inference time (ms/sample), providing insights into the efficacy of the proposed approach compared to established methods in lossless image compression.

6.1 Experimental Outcomes

Here the performance analysis is carried out among the evaluation measures and the existing approaches. The introduced approach is compared with five existing models namely iVPF (Zhang et al., 2021), GOLLIC (Lan et al., 2022), MSPSM (Zhang et al., 2020), SR (Cao et al., 2020), LLICTI (Kamisli, 2023), and proposed optimal kernel transformer model with three distinct datasets namely Kodak dataset, CLIC professional validation dataset, and DIV2K dataset. Table 2 illustrates a comprehensive comparison of image compression performance among several methods, focusing on Compression Performance (BPD) and Inference Time (ms/sample) for encoding and decoding across three datasets: Kodak, CLIC, and DIV2K. The methods evaluated include iVPF (Zhang et al., 2021), GOLLIC (Lan et al., 2022), MSPSM (Zhang et al., 2020), SR (Cao et al., 2020), LLICTI (Kamisli, 2023), and the proposed method. The results demonstrate that the proposed method excels in both compression efficiency and speed. In terms of Compression Performance, the proposed method consistently achieves the lowest BPD values across all datasets. For instance, in the Kodak dataset, it achieves a BPD of 3.06, outperforming LLICTI (Kamisli, 2023), which has a BPD of 3.10. This indicates that the proposed method can more effectively reduce the amount of data required to represent an image without losing any information. Additionally, the proposed method shows superior Inference Time for both encoding and decoding. Thus, the table underscores that the proposed method not only achieves better compression performance but also operates more efficiently in terms of speed. This makes it the best choice among the evaluated methods, particularly when considering the balance between compression effectiveness and processing time.

Table 3 shows how varying the regularization coefficient vector k influences key performance metrics such as the number of parameters, FLOPs, latency, and accuracy of the model. The k values in this context refer to the regularization coefficients applied to different regions of the convolutional kernel, with k in the format l-m-n, where, l is the coefficient applied to the corners of the filter, m is applied to the edges, and n is applied to the center.

Table 3.
Impact of k on the Model Performance.

k Value Parameters (M) FLOPs (M) Latency (ms) Accuracy (%)

1-1-1 6.13 156.9 1.647 93.46

3-2-1 6.15 158.4 1.650 93.68

4-2-1 6.09 157.5 1.640 93.96

6-3-1 6.21 159.0 1.653 94.12

9-3-1 6.24 160.5 1.655 94.24

10-5-1 6.27 161.2 1.660 94.30

k Value	Parameters (M)	FLOPs (M)	Latency (ms)	Accuracy (%)
1-1-1	6.13	156.9	1.647	93.46
3-2-1	6.15	158.4	1.650	93.68
4-2-1	6.09	157.5	1.640	93.96
6-3-1	6.21	159.0	1.653	94.12
9-3-1	6.24	160.5	1.655	94.24
10-5-1	6.27	161.2	1.660	94.30

The proposed approach optimizes these coefficients to focus on key regions of the filter kernel, thereby enhancing the model's ability to capture important features while imposing appropriate sparsity. The table shows that as the value of k changes, the model's accuracy improves progressively. The baseline configuration of k = 1-1-1 (equal sparsity across all regions) yields an accuracy of 93.46%. As k becomes more sophisticated (e.g., with greater emphasis on the edges and center), accuracy increases. The configuration of k = 4-2-1, where more emphasis is placed on the corners and center, improves accuracy to 93.96%. Further increases in k (e.g., 6-3-1, 9-3-1, and 10-5-1) yield even better results, with the best accuracy of 94.30% achieved with k = 10-5-1. This result demonstrates that by applying different regularization across various parts of the convolutional kernel, the proposed method can effectively capture spatial features more efficiently, leading to better classification performance. The trade-off between computational complexity and accuracy is well-managed, as the number of parameters, FLOPs, and latency increase slightly with larger k values, but the performance improvements justify the added complexity. This solves the problem of uniform sparsity across the kernel, which might overlook the significance of different spatial regions in the filter.

Table 4 illustrates the impact of the number of filter groups (c) on model performance, focusing on parameters, FLOPs, latency, and accuracy. The filter groups partition the convolutional filters into different sets, influencing the model's ability to extract features efficiently. Adjusting c directly impacts the trade-off between computational cost and model performance. As observed, setting c to 2 achieves the best overall performance, with 94.80% accuracy. This configuration has the fewest parameters (6.06 M) and lower FLOPs (134.5 M) compared to c = 1, where FLOPs increase to 152.3 M, leading to reduced accuracy (94.15%). The latency remains consistent across all configurations, with marginal variations (from 1.640 ms to 1.649 ms), indicating that the choice of c mainly affects computational efficiency and accuracy rather than inference speed. Increasing c to 4 further reduces the FLOPs (115.7 M) but at the expense of a slight accuracy drop (94.42%). This suggests diminishing returns as c increases beyond a certain point, where additional filter groups no longer provide meaningful gains in feature extraction and instead lead to underutilization of model capacity. The proposed approach optimally utilizes the configuration c = 2, providing a balance between accuracy and computational efficiency. By reducing the number of FLOPs while maintaining a high accuracy rate, this approach addresses common challenges such as overfitting with too few filter groups or inefficiency with too many groups. The choice of c = 2 minimizes computational complexity without sacrificing performance, demonstrating its effectiveness in resource-constrained environments or large-scale applications requiring both high accuracy and low computational cost.

Table 4.

Impact of the Number of Filter Groups c on the Model Performance.

c	Params (M)	FLOPs (M)	Latency (ms)	Accuracy (%)
1	6.18	152.3	1.645	94.15
2	6.06	134.5	1.640	94.80
4	6.16	115.7	1.649	94.42

Table 5.

Performance Comparison of Different Regularization Schemes.

Network	Method	Params	FLOPs	Latency	Accuracy	Params↓	FLOPs↓	Latency↓	Accuracy↓
VGG-16	Baseline	13.0M	305.2M	2.132ms	94.64%	-	-	-	-
	$L_{sparse}^{1 - 1 - 1}$	6.06M	145.1M	1.535ms	94.87%	53.38%	52.46%	28.01%	-0.24%
	$L_{sparse}^{4 - 2 - 1}$	5.11M	134.4M	1.531ms	94.97%	60.69%	55.96%	28.16%	-0.15%
	$L_{sparse}^{4 - 2 - 1} + L_{dir}$	6.20M	141.3M	1.533ms	95.04%	52.31%	53.69%	28.10%	-0.42%
	$L_{sparse}^{4 - 2 - 1} + L_{dir} + L_{group}$	6.08M	144.3M	1.512ms	95.57%	53.23%	52.71%	29.09%	−0.99%
ResNet-20	Baseline	0.259M	39.9M	2.516ms	93.11%	-	-	-	-
	$L_{sparse}^{1 - 1 - 1}$	0.106M	11.3M	2.265ms	91.69%	59.08%	71.68%	9.97%	1.52%
	$L_{sparse}^{4 - 2 - 1}$	0.107M	10.4M	2.232ms	91.77%	58.68%	73.93%	11.29%	1.44%
	$L_{sparse}^{4 - 2 - 1} + L_{dir}$	0.108M	11.5M	2.207ms	91.89%	58.30%	71.18%	12.29%	1.22%
	$L_{sparse}^{4 - 2 - 1} + L_{dir} + L_{group}$	0.109M	11.7M	3.963ms	92.48%	57.93%	70.68%	11.55%	0.68%
ResNet-34	Baseline	0.374M	57.9M	3.258ms	93.47%	-	-	-	-
	$L_{sparse}^{1 - 1 - 1}$	0.141M	19.3M	3.264ms	91.83%	62.29%	66.68%	−0.18%	1.75%
	$L_{sparse}^{4 - 2 - 1}$	0.129M	20.7M	3.312ms	91.95%	65.51%	64.25%	−1.66%	1.63%
	$L_{sparse}^{4 - 2 - 1} + L_{dir}$	0.157M	20.3M	3.164ms	91.46%	58.02%	64.96%	2.89%	2.15%
	$L_{sparse}^{4 - 2 - 1} + L_{dir} + L_{group}$	0.130M	24.8M	3.032ms	92.76%	65.26%	57.17%	6.93%	0.76%
ResNet-56	Baseline	0.762M	105.7M	6.032ms	94.17%	-	-	-	-
	$L_{sparse}^{1 - 1 - 1}$	0.241M	35.3M	5.465ms	92.68%	68.37%	66.58%	9.38%	1.58%
	$L_{sparse}^{4 - 2 - 1}$	0.359M	46.4M	5.473ms	92.99%	52.91%	56.12%	9.27%	1.26%
	$L_{sparse}^{4 - 2 - 1} + L_{dir}$	0.357M	41.8M	5.332ms	93.11%	53.19%	60.45%	11.61%	1.13%
	$L_{sparse}^{4 - 2 - 1} + L_{dir} + L_{group}$	0.358M	44.9M	4.921ms	93.54%	53.05%	57.54%	18.45%	0.67%

Table 6.

Implementation Outcome for Different Dataset.

The experiments are conducted on ResNets, and the results are summarized in Table 5. In all cases, the specific parameters are used, namely c = 2, $α_{1}$ = 0.001, $α_{2}$ = 0.007, and $α_{3}$ = 0.007. Here the $α$ denotes the scalar coefficients that denote the relative weights or strengths assigned to each regularization term: $L_{s p a r s e}, L_{d i r},$ and $L_{group}$ . The assessment of CR is based on the total count of parameters and floating-point operations (FLOPs), while the model's performance is indicated by its accuracy on the test set. When compared to the baseline model, the proposed method achieved a remarkable improvement in compression and actual acceleration while maintaining an acceptable level of accuracy. The introduction of these regularization terms clearly benefited model performance. For example, when applying these regularizers, the accuracy of VGG-16 improved from 94.87% to 94.97%, 95.04%, and ultimately to 95.57%. It's important to highlight that the $L_{group}$ regularization not only contributes to accuracy improvement but also aids in reducing inference latency by achieving a balanced computational load across different filters. For instance, when comparing the $L_{sparse}^{4 - 2 - 1} + L_{dir} + L_{group}$ scheme to the $L_{sparse}^{4 - 2 - 1} + L_{dir}$ scheme, the former results in lower latency, even with slightly higher FLOPs have been observed. Having two accuracy columns in the evaluation table allows for a comprehensive understanding of the performance impact of different regularization schemes on the models. The first accuracy column indicates the classification accuracy of each model after applying specific regularization methods. This provides insight into the absolute performance of each model in terms of accuracy. The second accuracy column, labeled as "Accuracy↓," demonstrates the change in accuracy compared to the baseline model. This additional column enables a comparison of how each regularization method affects the accuracy relative to the baseline, offering valuable information about the effectiveness of the regularization techniques in improving or maintaining model accuracy. By presenting both absolute accuracy values and changes relative to the baseline, the table facilitates a thorough assessment of the trade-offs between model complexity, computational efficiency, and accuracy.

Figure 4.

Performance comparison for computational efficiency and scalability.

Table 7.

Ablation Study on Various Model Performances.

Model	Kodak (bpsp)	CLIC (bpsp)	DIV2K (bpsp)
Proposed MMCC + 7 ^× 7 autoregressive model of x	2.51	2.31	2.62
Lee et al. (2019) + MMCC + 7 ^× 7 autoregressive model of x	2.72	2.52	2.83
Lee et al. (2019)	3.17	3.12	3.28

Table 6 represents the implementation result for different dataset compared with different image format. For obtaining better knowledge about every regularizer it apply a GradCAM (Selvaraju et al., 2017) for illustrating the discriminative areas by applying a searching method ResNet-20. It chooses one image from every category in the test setoff the given datasets. By the availability of noise in the background can disturb the outcome in the $L_{sparse}^{1 - 1 - 1}$ class and it may not be able to give attention for discriminative field of objects. Opposing that $L_{sparse}^{4 - 2 - 1}$ case having higher attention on objects, these remunerate the prior emphasis on the middle parameter of convolution kernel. Moreover, it has offset among original objects and activation fields. One of the problem identified here was it give high attention for the lower parts of the objects. Considering this issue it was rectified by including $L_{d i r}$ with $L_{sparse}^{4 - 2 - 1}$ . It was represented as $L_{sparse}^{4 - 2 - 1} + L_{d i r}$ in this also some of the objects are not clearly recognized. The acceptable activation outcome are gained in another case $L_{sparse}^{4 - 2 - 1} + L_{d i r} + + L_{g r o u p}$ , in which it maintains the data flow of various filter groups.

Figure 4 represents the performance analysis based on the computational efficiency and the scalability of the proposed model. In the image compression task it shows that how the Optimal Kernel Transformer performs in image compression when compared with other models. For the computational efficiency it has the ability to compress and decompress images with less computational overhead. Thus the proposed approach minimizes the need of the resources like processing time and the computational overhead when compared with the existing approaches. Also for the scalability it analyses the ability of the model to maintain the performance of the model when the resolution of the image or the dataset increases. Thus it represents how the Optimal Kernel Transformer scales up while handling large amount of without any significant price drops. Thus the performance analysis highlights that the proposed model achieves high performance interms of scalability and computation efficiency.

Table 7 provides a comprehensive technical comparison of various compression models, focusing on the impact of integrating an additional autoregressive model specifically designed for the variable x. The evaluation metric utilized is bits per sub-pixel (bpsp) (Luo et al., 2023), which measures the average number of bits required to encode each pixel in the compressed images. Lower bpsp values indicate more efficient compression. The table demonstrates that augmenting the compression model with an extra autoregressive model specifically designed for x significantly enhances compression efficiency. The proposed model, integrating MMCC and the additional autoregressive component, outperforms both the baseline model and the enhanced baseline model across all evaluated datasets. This underscores the effectiveness of the proposed approach in achieving superior compression performance.

On analyzing the results it is found that the proposed Optimal Kernel Transformer approach for learned lossless image compression offers several strengths that distinguish it from existing methods. By introducing CC models, MMCC entropy modeling, and RNN-based compression, it efficiently captures pixel redundancies and achieves superior CRs without compromising image quality. The integration of shifted window-based self-attention modules inspired by ViT and Swin Transformer architectures enhances correlation capture among spatially adjacent elements, improving overall compression performance. Additionally, the methodology automates the search for optimal kernel shapes using the SOKS framework, further optimizing network efficiency through SWP to reduce storage and computational demands. These innovations collectively contribute to the model's ability to achieve state-of-the-art compression efficiency while maintaining high-quality image reconstruction, positioning it as a promising solution for real-world image compression applications.

7 Conclusions and Future Works

This article introduces a novel approach to lossless image compression using the Optimal Kernel Transformer, addressing key challenges in traditional methods while utilizing deep learning and neural coding techniques for improved performance. The approach includes the use of CC models to mitigate the time-intensive nature of autoregressive models, the introduction of the MMCC framework to better model complex data distributions, and the application of a sequence-to-sequence RNN model for entropy coding. Symmetrical Transformer architecture is employed to optimize downsampling and upsampling processes, while techniques like Symmetrical Optimized Kernel Sampling and Sliding Window Processing further enhance network efficiency and reduce computational overhead, minimizing complexity. The proposed model outperforms existing methods such as iVPF, GOLLIC, MSPSM, SR, and LLICTI on the Kodak, CLIC, and DIV2K datasets, achieving the lowest BPD values (3.06, 3.91, and 3.63, respectively) and demonstrating superior compression efficiency. Additionally, it achieves faster inference times (7.80 ms for encoding on Kodak) compared to competing models, making it the most effective choice for lossless image compression in terms of both performance and speed. With an accuracy of 94.80%, the model strikes an optimal balance between precision and computational efficiency. Future work will explore optimizing the Optimal Kernel Transformer for efficient video compression applications.

Footnotes

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

Bégaint

Racapé

Feltman

Pushparaja

(2020). Compressai: a pytorch library and evaluation platform for end-to-end compression research. arXiv preprint arXiv:2011.03029.

Cao

C. Y.

Krähenbühl

(2020). Lossless image compression through super-resolution. arXiv preprint arXiv:2004.02872,

Chang

J. M.

Ding

J. J.

Lin

H. S.

(2019). Adaptive prediction, context modeling, and entropy coding methods for CALIC lossless image compression. 2019 IEEE Asia Pacific Conference on Circuits and Systems (APCCAS), Bangkok, Thailand, 349–352.

Cheng

Sun

Takeuchi

Katto

(2020). Learned lossless image compression with a hyperprior and discretized Gaussian mixture likelihoods. ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2158–2162.

The CLIC dataset was available at https://www.compression.cc/ (accessed on October 2023)

The DIV2K dataset was available at https://www.kaggle.com/datasets/joe1995/div2k-dataset (accessed on October 2023)

Flamich

Havasi

Hernández-Lobato

J. M.

(2020). Compressing images by encoding their latent representations with relative entropy coding. NeurIPS Proceedings of the Advances in Neural Information Processing Systems (NeurIPS 2020), 33, 16131–16141.

Gharehchopogh

F. S.

(2023). An improved harris hawks optimization algorithm with multi-strategy for community detection in social network. Journal of Bionic Engineering, 20(3), 1175–1197. https://doi.org/10.1007/s42235-022-00303-z

Gharehchopogh

F. S.

Abdollahzadeh

Barshandeh

Arasteh

(2023). A multi-objective mutation-based dynamic harris hawks optimization for botnet detection in IoT. Internet of Things, 24(December 2023), 100952. https://doi.org/10.1016/j.iot.2023.100952 .

10.

Gharehchopogh

F. S.

Ibrikci

(2024). An improved African vultures optimization algorithm using different fitness functions for multi-level thresholding image segmentation. Multimedia Tools and Applications, 83(6), 16929–16975. https://doi.org/10.1007/s11042-023-16300-1

11.

Gharehchopogh

F. S.

Khargoush

A. A.

(2023). A chaotic-based interactive autodidactic school algorithm for data clustering problems and its application on COVID-19 disease detection. Symmetry, 15(4), 894. https://doi.org/10.3390/sym15040894

12.

Gharehchopogh

F. S.

Ucan

Ibrikci

Arasteh

Isik

(2023). Slime mould algorithm: A comprehensive survey of its variants and applications. Archives of Computational Methods in Engineering, 30(4), 2683–2723. https://doi.org/10.1007/s11831-023-09883-3

13.

Ghorbel

Hamidouche

Morin

(2023). AICT: An adaptive image compression transformer. 2023 IEEE International Conference on Image Processing (ICIP), pp. 126–130.

14.

Kamisli

(2023). Learned lossless image compression through interpolation with low complexity. IEEE Transactions on Circuits and Systems for Video Technology, 33(12), 7832–7841. https://doi.org/10.1109/TCSVT.2023.3273578 .

15.

Koyuncu

A. B.

Gao

Boev

Gaikov

Alshina

Steinbach

(2022). Contextformer: A transformer with spatio-channel attention for context modeling in learned image compression. European Conference on Computer Vision, pp. 447–463.

16.

Kudo

Orihashi

Tanida

Shimizu

(2019). GAN-based Image compression using mutual information maximizing regularization. 2019 Picture Coding Symposium (PCS), Ningbo, China, pp. 1–5.

17.

Lan

Qin

Sun

Xiang

Sun

(2022). GOLLIC: Learning global context beyond patches for lossless high-resolution image compression. arXiv:2210.03301

18.

Lee

Cho

Jeong

Kwon

Kim

H. Y.

Choi

J. S.

(2019). Extended End-to-End optimized Image Compression Method based on a Context-Adaptive Entropy Model. CVPR Workshops, p. 0.

19.

Liu

Lin

Cao

Wei

Zhang

Lin

Guo

(2021). Swin transformer: Hierarchical vision transformer using shifted windows. Proceedings of the IEEE/CVF international conference on computer vision, pp. 10012–10022.

20.

Liu

Wang

W. C.

Shah

Dhawan

Urtasun

(2020). Conditional entropy coding for efficient video compression. European Conference on Computer Vision, pp. 453–468.

21.

Liu

Zhang

(2022). Soks: Automatic searching of the optimal kernel shapes for stripe-wise network pruning. IEEE Transactions on Neural Networks and Learning Systems, 34(12), 9912 –9924.

22.

Luo

Dai

Zou

Xiong

(2023). Learned Lossless Compression for JPEG via Frequency-Domain Prediction. https://doi.org/10.48550/arXiv.2303.02666

23.

Mentzer

Agustsson

Tschannen

Timofte

Gool

L. V.

(2019). Practical full resolution learned lossless image compression. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10629–10638.

24.

Mentzer

Gool

L. V.

Tschannen

(2020). Learning better lossless compression using lossy compression. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6638–6647.

25.

Xiong

Fan

Liu

Gao

(2020). Graph-Based non-convex low-rank regularization for image compression artifact reduction. IEEE Transactions on Image Processing, 29(03 March 2020), 5374–5385. https://doi.org/10.1109/TIP.2020.2975931 .

26.

Özbay

F. A.

Gharehchopogh

F. S.

(2023). RETRACTED ARTICLE: Peripheral blood smear images classification for acute lymphoblastic leukemia diagnosis with an improved convolutional neural network. Journal of Bionic Engineering, (09 October 2023), 1–1. https://doi.org/10.1007/s42235-023-00441-y

27.

Piri

Mohapatra

Acharya

Gharehchopogh

F. S.

Gerogiannis

V. C.

Kanavos

Manika

(2022). Feature selection using artificial gorilla troop optimization for biomedical data: A case analysis with COVID-19 data. Mathematics, 10(15), 2742. https://doi.org/10.3390/math10152742

28.

Rabbani

Joshi

(2002). An overview of the JPEG 2000 still image compression standard. Signal Processing: Image Communication, 17(1), 3–48. https://doi.org/10.1016/S0923-5965(01)00024-8

29.

Selvaraju

R. R.

Cogswell

Das

Vedantam

Parikh

Batra

(2017). Grad-cam: Visual explanations from deep networks via gradient-based localization. Proceedings of the IEEE international conference on computer vision, Venice, Italy, pp. 618–626.

30.

Shen

Zhang

Gharehchopogh

F. S.

Mirjalili

(2023). An improved whale optimization algorithm based on multi-population evolution for global optimization and engineering design problems. Expert Systems with Applications, 215(1 April 2023), 119269. https://doi.org/10.1016/j.eswa.2022.119269

31.

Sneyers

Wuille

(2016). FLIF: Free lossless image format based on MANIAC compression. 2016 IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, USA, pp. 66–70.

32.

Song

Zhu

Zhang

Guo

Yang

(2019). Joint image compression–encryption scheme using entropy coding and compressive sensing. Nonlinear Dynamics, 95(3), 2235–2261. https://doi.org/10.1007/s11071-018-4689-9

33.

The Kodak dataset was available at https://www.kaggle.com/datasets/sherylmehta/kodak-dataset/code (accessed on October 2023)

34.

Toderici

Vincent

Johnston

Jin Hwang

Minnen

Shor

Covell

(2017). Full resolution image compression with recurrent neural networks. Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 5306–5314.

35.

Wang

Liu

Sun

Katto

(2023). Learned lossless image compression with combined channel-conditioning models and autoregressive modules. IEEE Access, 11(03 July 2023), 73462–73469. https://doi.org/10.1109/ACCESS.2023.3291591

36.

WebP Image Format. [Online]. Available: https://developers.google.com/speed/webp/ (Accessed: Jul. 16, 2023)

37.

Zhang

Cricri

Tavakoli

H. R.

Zou

Aksu

Hannuksela

M. M.

(2020). Lossless image compression using a multi-scale progressive statistical model. Proceedings of the Asian Conference on Computer Vision, Kyoto, Japan.

38.

Zhang

Kang

(2021). iVPF: Numerical invertible volume preserving flow for efficient lossless compression. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 620–629.

39.

Zou

Song

Zhang

(2022). The devil is in the details: Window-based attention for image compression. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 17492–17501.

		Compression		Inference Time
		Performance (BPD) ↓		(ms/sample) ↓
Dataset	Method	Dataset	Single	Encode	Decode
Kodak	iVPF (Zhang et al., 2021)	3.20	9.50	12.65	12.65
	GOLLIC (Lan et al., 2022)	3.15	8.70	10.25	10.25
	MSPSM (Zhang et al., 2020)	3.12	8.30	9.45	9.45
	SR (Cao et al., 2020)	3.25	10.20	15.20	15.20
	LLICTI (Kamisli, 2023)	3.10	8.50	9.80	9.80
	Proposed	3.06	7.80	8.75	8.75
CLIC	iVPF (Zhang et al., 2021)	3.98	12.60	17.80	17.80
	GOLLIC (Lan et al., 2022)	3.95	12.30	16.75	16.75
	MSPSM (Zhang et al., 2020)	3.88	11.80	15.90	15.90
	SR (Cao et al., 2020)	4.10	13.50	19.20	19.20
	LLICTI (Kamisli, 2023)	3.85	12.10	16.70	16.70
	Proposed	3.91	12.00	16.50	16.50
DIV2K	iVPF (Zhang et al., 2021)	3.68	10.80	14.45	14.45
	GOLLIC (Lan et al., 2022)	3.70	11.00	14.75	14.75
	MSPSM (Zhang et al., 2020)	3.70	11.10	15.20	15.20
	SR (Cao et al., 2020)	3.85	12.30	17.30	17.30
	LLICTI (Kamisli, 2023)	3.65	10.90	14.30	14.30
	Proposed	3.63	10.70	14.10	14.10

Learned Lossless Image Compression Based on Optimal Kernel Transformer Approach

Abstract

Keywords

1 Introduction

2 Literature Survey

3.1 Formulation of Lossless Image Compression

4.1.2 Transformer-Based Decoder

4.2 Entropy Coding

4.3 SOKS

4.3.1 Framework for SOKS

5 Experimental Setup

5.1 Dataset Description

5.2 Training Details

6.1 Experimental Outcomes

Table 3. Impact of k on the Model Performance. k Value Parameters (M) FLOPs (M) Latency (ms) Accuracy (%) 1-1-1 6.13 156.9 1.647 93.46 3-2-1 6.15 158.4 1.650 93.68 4-2-1 6.09 157.5 1.640 93.96 6-3-1 6.21 159.0 1.653 94.12 9-3-1 6.24 160.5 1.655 94.24 10-5-1 6.27 161.2 1.660 94.30

Footnotes

Funding

Declaration of Conflicting Interests

References

Table 3.
Impact of k on the Model Performance.

k Value Parameters (M) FLOPs (M) Latency (ms) Accuracy (%)

1-1-1 6.13 156.9 1.647 93.46

3-2-1 6.15 158.4 1.650 93.68

4-2-1 6.09 157.5 1.640 93.96

6-3-1 6.21 159.0 1.653 94.12

9-3-1 6.24 160.5 1.655 94.24

10-5-1 6.27 161.2 1.660 94.30