Sage Journals: Discover world-class research

Abstract

In the rapidly evolving field of digital security, this study aims to advance image steganography by developing and benchmarking seven deep learning architectures with a focus on imperceptibility, embedding capacity, and robustness against steganalysis. The models implemented include the residual dense network (RDN), vision transformer with adaptive attention (ViT-AA), progressive generation network (PGN), dual-stream architecture (DSA), wavelet-based hybrid network (WHN), mutual attention transformer (MAT), and efficient attention pyramid transformer (EAPT). Using the PyTorch framework and standardized datasets such as DIV2K, COCO, and ImageNet, each architecture was trained through structured preprocessing and evaluated using metrics including PSNR, SSIM, LPIPS, and statistical steganalysis resistance. Experimental results demonstrate that WHN achieved the highest visual quality (PSNR $=$ 43.5 dB, SSIM $=$ 0.995), while MAT and EAPT provided superior security with detection rates near random chance (0.501–0.502) and robustness against JPEG compression and noise insertion. PGN and DSA offered low-latency performance suitable for resource-constrained or mobile applications, while ViT-AA provided a balanced trade-off across imperceptibility and robustness. The findings confirm that deep learning approaches surpass traditional methods and establish new computational benchmarks for covert communication and digital forensics. These results recommend WHN, MAT, and EAPT for high-security contexts, PGN and DSA for embedded platforms, and ViT-AA as a general-purpose framework, while encouraging further research into lightweight variants for IoT and real-time deployments. Tools for data collection and experimentation included benchmark datasets and PyTorch-based implementations.

Keywords

deep learning steganography vision transformers progressive image generation wavelet-based neural networks information security adaptive attentionmechanisms

1. Introduction

Steganography is the art of hiding secret messages inside harmless carriers (Fridrich, 2009; Goodfellow et al., 2016). It has thus emerged as one of the most needed techniques for communication security in the digital age. Messages are commonly contained within seemingly innocent carriers such as images, audio, or video files (Fridrich, 2009). This cleverly kept secret of disguising information behind simple masks makes it very hard for the nefarious elements to take the information out to the unauthorized access and knowledge.

The traditional forms of protection are rooted in cryptography and encryption (Fridrich, 2009). This, however, effectively runs the integrity but compromises majorly and gets attracted unduly during transmission of sensitive information. On the other front stands steganography, hiding the very existence, tempting the enemy to detect and intercept (Fridrich, 2009; Song et al., 2024). Nevertheless, many classic steganographic methods have certain shortcomings—restrictions upon the range of embedding possibilities, vulnerabilities to other forms of attack, and a limited range of security (Baluja, 2017; Song et al., 2024). The challenges tend also to be related into these deep-seated shortcomings, as they critically depend upon the statistical properties of the cover medium which the enemy is always in a position to exploit for the not-so-easy-to-reveal coded messages.

To meet the challenges posed by steganography, one of the areas of intense interest these days is that of harnessing the advanced techniques, particularly by utilizing artificial intelligence (AI) with deep learning (Baluja, 2017; Goodfellow et al., 2016). It is possible to develop more powerful and effective steganography systems able to learn complex dependencies and patterns within data, using these advanced AI and deep learning techniques (Song et al., 2024). This ensures that secret messages are finally embedded in such a way that they become much more visible and difficult to detect.

In this paper, we examine how advanced deep learning architectures provide the security of image steganography. We analyze how five contemporary advanced techniques-residual density network (RDN), vision transformer with adaptive attention (ViT-AA), progressive generation network (PGN), dual-stream architecture (DSA), and wavelet-based hybrid network (WHN)-fare. The comparison describes the strengths and weaknesses of these techniques for future research directions toward advancements in secure image steganography (Goodfellow et al., 2016; Song et al., 2024).

2. Background

2.1. Evolution of Image Steganography Toward Deep Learning Architectures

The origins of steganography can be traced back to ancient civilizations where messages were hidden in physical media such as wax tablets, invisible inks, or even shaved heads with regrown hair concealing messages. In the digital era, steganography evolved into embedding information within multimedia carriers such as images, audio, and video. Early computational approaches primarily employed spatial-domain techniques such as least significant bit (LSB) substitution (Chandramouli et al., 2004), where message bits replace the LSBs of pixel values. While simple and offering high payloads, such methods were highly vulnerable to statistical and visual steganalysis. To overcome these weaknesses, frequency-domain approaches emerged, leveraging transforms such as the discrete cosine transform (DCT) and discrete wavelet transform (DWT) (Provos & Honeyman, 2003). These approaches offered improved robustness and imperceptibility but remained constrained by limited embedding capacity and distortion trade-offs.

A major shift occurred with the adoption of machine learning techniques. Statistical models such as HUGO and rich models (SRM) (Fridrich, 2009) enhanced detection resistance by exploiting high-dimensional features, but they were computationally expensive. The breakthrough came with deep learning. Baluja’s deep steganography framework (Baluja, 2017) demonstrated that autoencoders could jointly optimize embedding and extraction, producing stego-images of high visual fidelity. This was followed by adversarial training frameworks such as HiDDeN (Zhu et al., 2018) and SteganoGAN (Zhang et al., 2019c), which significantly advanced imperceptibility and robustness through the use of convolutional neural networks (CNNs) and generative adversarial networks (GANs).

More recently, attention-based and transformer architectures have been introduced to exploit long-range spatial dependencies for adaptive embedding (Lin et al., 2023). In parallel, iterative optimization and diffusion-based approaches such as iterative neural optimizers (Chen et al., 2023) and CRoSS LoRA (Yu et al., 2023) have pushed the field toward more adaptive and content-aware frameworks. Very recent developments, including multi-layered deep learning steganography (Sanjalawe et al., 2025) and invertible neural networks for reversible hiding (Zhou et al., 2025), highlight the growing specialization and maturity of modern steganographic systems.

This historical progression illustrates the steady evolution from simple spatial substitution methods to sophisticated deep architectures capable of balancing imperceptibility, robustness, and embedding capacity. The present study builds on this foundation by benchmarking diverse modern architectures to provide a comprehensive analysis of their respective strengths, limitations, and practical deployment implications.

2.2. General Framework of Image Steganography

In image steganography, secret data is hidden inside a cover image, and produced is a stego-image virtually indistinguishable from the original (Holub & Fridrich, 2012; Kumar et al., 2020). The general structure of image steganography includes the following main components (Holub et al., 2014):

Cover Image, C: A cover image is the original image utilized to carry the secret message. It can be in any digital format, like JPEG, PNG, or BMP (Holub & Fridrich, 2012).

Secret Message, S: Some information being concealed, may consist of text, audio, or another image (Kumar et al., 2020).

Embedding Algorithm: The algorithm that will determine how to embed the secret message into a cover image. Various techniques can be used, some of them are LSB substitution, spatial domain techniques, and frequency domain techniques like DCT and DWT (Holub et al., 2014; Pevný et al., 2010).

Stego-Image, S: The artifact, image, or picture looking quite like the original on the cover image but is, in fact, bearing the hidden secret message (Holub & Fridrich, 2012).

Extraction Algorithm: This algorithm extracts the secret message from the stego-image. It requires knowledge of the embedding algorithm and some key or password, if applicable for security purposes (Holub et al., 2014; Kumar et al., 2020).

Mathematical Representation

The embedding process can be mathematically represented as:

\begin{aligned} S = E (C, M) \end{aligned}

(1)

The extraction process can be represented as:

\begin{aligned} M = D (S, K) \end{aligned}

(2)

Figure 1 depicts the general schema of image steganography. The major parts of this schema include cover image processing, secret message embedding, and stego-image extraction. This framework defines the mathematical relationship among them through $E (C, M)$ for embedding and $D (S, K)$ for extraction. The directional arrows present the flow of information from one module to another. The dotted-line paths show the security validation routes, while the solid line shows the route for major data processing.

Figure 1.

General framework for image steganography.

2.3. Related Work

This area of steganography has been an exciting subject for many years to be constantly under research investigation, where scientists have broadly explored many kinds of strategies and techniques meant to enhance the capacity, visual fidelity, and security of covert communications systems (Husien & Badi, 2015; Ruohan Meng & Cui, 2018). These approaches have, in turn, challenged many traditional methods, including the well-established LSB substitution and spatial domain methods, with respect to their ability to effectively manage the delicate trade-off between embedding capacity and detection resistance (Pevný et al., 2010). This led to the development of alternative methodologies and innovative advancements in steganography to respond to the demands and requirements of modern covert communication systems.

One of the most important techniques found and quite well analyzed in the steganography domain is referred to as LSB substitution (Holub & Fridrich, 2012; Holub et al., 2014). The LSB of the image’s bits can be substituted by a corresponding secret message, a seemingly very simple yet accessible process that makes this method permit an impressively high storage of embedded secret data besides vulnerability to statistical attacks whereby it would not take so long for a hacker to decode hidden data. Wang et al. (2024) recently proposed a robust blind image watermarking technique based on interest points, demonstrating continued innovation in information embedding strategies. Dedicated scholars and experts have come up with various complicated and sophisticated spatial domain approaches to strengthen security operations (Husien & Badi, 2015). Some of the major examples of these advanced approaches are the use of pixel value differencing as well as adaptive pixel pair matching techniques (Ruohan Meng & Cui, 2018). However, it should be noted that the alternative approaches often work like a double-edged sword since they might inadvertently lead to a decrease in the embedding capabilities or, in some cases, visible distortions within the acquired stego-image.

To avoid these drawbacks of spatial domain methods, researchers have explored extensively on the frequency domain approach, specifically in the context of DCT or DWT (Holub et al., 2014; Pevný et al., 2010). Hu et al. (2023) recently proposed StegaEdge, a novel learning-based approach that leverages edge guidance for steganographic embedding, further expanding the potential of frequency domain techniques. Also, some recent advancements demonstrate the potential of deep learning techniques in improving steganographic methods for high dynamic range images (Huo et al., 2024). The basic premise here is that transform coefficients hide the secret message, thereby reducing susceptibility to statistical steganalysis (Husien & Badi, 2015). Prominent examples are strategies carefully modifying DCT or DWT coefficients by resorting to models of the human visual system or even adroit embedding techniques tailored according to specific situations (Ruohan Meng & Cui, 2018). There is an important remark regarding the fact that methods of the frequency domain yield greater security but often incur an easily broken trade-off between the embedding capacity and the perceptual quality.

The emergence of machine learning algorithms has provided new opportunities for the development of more secure and durable steganographic systems (Liu et al., 2021; Zhang et al., 2017). Lin et al. (2024) explored hierarchical iterative decoding enhancement techniques in multiview 3D parameter regression, showcasing the versatility of deep learning approaches across different domains. Lee and Kang (2024) explored the potential of vision transformers (ViTs) in image inpainting, demonstrating the versatility of transformer architectures in image processing tasks. For instance, one deep architecture, the RDN, is very popular since this model has demonstrated a tremendous approach toward the image restoration and improvement capabilities (He et al., 2016; Zhang et al., 2017); thus, making RDN a good choice also in steganography. Thus, the application of RDN to steganography may leverage its ability to learn sophisticated features in images and to embed messages in a manner that is imperceptible to human and machine learning-based detection algorithms (He et al., 2016; Zhang et al., 2019a). It is achieved by exploiting the hierarchical structure of RDN, which includes dense blocks and residual connections (He et al., 2016). Dense blocks of RDN ensure that efficient reuse of features takes place. The fidelity of images is conserved even after the embedding of messages (Zhang et al., 2017).

Further, the residual connections of RDN improved the strength of the steganographic system as the network can learn the residual information between the layers and can avoid noises and distortion happening due to embedding processes (He et al., 2016; Zhang et al., 2019a). This resistance to image degradation assures that the hidden messages will be recovered with consistency despite numerous forms of attacks or corruptions. Combining all of these features, RDN is a promising candidate in secure information hiding efforts, thereby paving a reliable and efficient avenue toward the development of steganographic systems, where it is possible to defy detection attempts from human observers as well as machine learning algorithms. In summary, RDN is one promising methodology toward more effective and stronger steganographic construction. The RDN would make the hierarchical structure along with dense blocks and residual connections for imperceptibility. During this embedding, both the fidelity of images as well as robustness were preserved. In the end, while further developing this steganography domain, the proposed RDN does seem a highly appealing candidate toward securely concealing information in different applications.

ViTs have also shown great potential in the field of steganography (Dosovitskiy et al., 2020; Liu et al., 2021). The multilayer attention-based structure of the ViT makes this architecture good at looking deeper into the long dependencies in images which becomes quite an important parameter for successful hiding (Dosovitskiy et al., 2020). Exploring these features, ViT-AA achieves high capacity of embedding with maintaining solid security. In contrast to other complex texture patterns, focused attention mechanisms incorporated in ViT-AA improve handling such intricate textures and better adjust to the requirements of the steganographic task (Li et al., 2015; Zhang et al., 2019a). It allows the use of ViT-AA as an ideal tool for steganography practitioners that enables unprecedented levels of concealment and robustness against all detection techniques. It makes possible further expanding potential of information hiding within images by new applications and domains of secret communication. From hidden communication to encryption of data, ViT-AA offers a flexible and adaptable solution that pushes beyond the boundaries of steganographic techniques. Not only does it provide for safe data transfer, but also this combination of ViTs with steganography works as a stepping stone towards further progress in digital forensics where the detection and extraction of hidden information form an important element of the investigation process. Further developments and improvements into ViTs are promising to brighten the future of steganography, full of endless possibilities for creative new techniques and applications that find a niche in our increasingly interconnected world.

Another deep learning framework is the PGN, which has been adapted to applications in steganography (Zhang et al., 2017). Systematic refinement will result in PGN, concealing confidential information in the layers of an image, quality enhancement, and security. Such a method is critical when there is a great need to maintain an adequate balance between embedding capacity and fidelity of the image (He et al., 2016). PGN’s innovative approach allows for a covert embedding of encrypted information in the image layers with a view to strengthening the security mechanisms. By careful image enhancement, PGN maximizes its ability to hide information while wisely distributing the sensitive information across different layers without causing any possible distortions in the visual content. Besides this, the incremental refinement process enhances the visual quality of the output stego-image considerably so that it is visually indistinguishable while improving the general esthetic appeal of the image. The ability of PGN to delicately balance the trade-off between hiding capacity and visual quality makes it a leading solution for steganography. In applications where image fidelity needs to be preserved at all costs, such as in medical imaging or digital forensics, PGN’s progressive enhancement approach performs better. Overall, PGN is an innovative instrument in the field of steganography which transforms secure communication and data protection across several domains.

The dual-stream Architecture (DSA), is a new approach based on parallel paths of processing, one on the cover image and the other concerning the secret message (Zhang et al., 2019a). Stego-images obtained from combining these two streams of information can be of impressive quality along with enhanced security. Here, DSA outpaces other approaches since it balances both structural as well as textural enhancement of an image at once, making it faster than those who focus on either one in particular (Zhang et al., 2019a). With this advanced approach, DSA provides a balanced and improved output to gain itself in the areas of image safety and reliability. As parallel streams are implemented to enable processes, DSA reconstructs the image safety model which assures quality as well as provides the best protection. It does improve the structural as well as the textural content of images for better performance that can surpass any conventional technique. The ability of DSA to achieve a sophisticated outcome makes it the undisputed leader in image security, setting new standards for reliability and trustworthiness. Using DSA ensures that your images are protected to an unprecedented level.

Recent investigations have also delved into the fusion of wavelet transforms with deep learning methods for steganography (Li et al., 2015). The WHN merges the frequency domain analysis capabilities of wavelet decomposition with the feature extraction power of deep neural networks, resulting in increased embedding capacity and security (Karras et al., 2021). Through the strategic alteration of the wavelet coefficients based on content-adaptive scaling factors, WHN can effectively hide the confidential information while maintaining the visual quality of the stego-image (Zhang et al., 2019a). This advancement in steganography creates new opportunities for secure communication and data protection. The fusion of wavelet transforms and deep learning represents a significant progress in concealing sensitive data within digital media. With WHN, not only can a larger amount of data be concealed, but the security of the covert data is also enhanced. The incorporation of deep neural networks enables better feature extraction from the wavelet coefficients, leading to improved accuracy and resilience. By adjusting the scaling factors based on the image content, WHN ensures the seamless integration of the secret data into the stego-image, making it challenging to detect for unauthorized individuals. Furthermore, the visual quality of the stego-image is preserved, ensuring that there is no noticeable distinction between the original and steganographic images. This breakthrough in steganography sets the stage for more secure communication channels and safeguarding of sensitive data, especially in fields such as cybersecurity, digital forensics, and secure image transmission. As the world becomes increasingly connected, guaranteeing the privacy and confidentiality of digital information is of utmost importance. With the integration of wavelet transforms and deep learning methods, WHN offers a robust solution for secure steganography, revolutionizing the manner in which covert information can be concealed and transmitted. This research represents a notable advancement in the field, paving the way for new opportunities in secure communication in the digital era. As technology continues to progress, it is vital to stay at the forefront of innovation and explore new approaches to protecting sensitive information. WHN embodies a promising path for future research and development in steganography, with the potential to redefine the landscape of secure communication and data protection.

More such models can be considered in the development of image steganography. One such model is mutual attention transformer (MAT), proposed by Zhang and Tian (2023), and it’s a remarkable advancement in the field of anomaly detection in images with its new methodology of transformer-based feature fusion. Its architecture is based on parallel streams of attention, which naturally contains a MATS module for the mutual selection of tokens for noise-reduction purposes along with enhancements in the local feature-extraction capabilities. MAT also achieves the superior performance compared to other reconstruction-based methods at the top of the leaderboard for a +3.1% in terms of increasing detection capability and a +1.0% in localization capability. Curiously, the architecture gave good performance metrics: $PSNR = 43.3 \pm$ 0.3 dB and $SSIM = 0.994$ with good resistance to image manipulation including JPEG compression (0.94) and noise handling (0.92) but demands pretty heavy computation resources: processing time-19.1 ms and memory-10.4 GB.

EAPT stands for the efficient attention pyramid transformer (EAPT) designed by Lin et al. (2023), which fundamentally challenges the design of a ViT through three main innovations, namely deformable attention in order to allow for adaptive position-specific feature learning; a module called encode-decode communication (En-DeC) designed to make global information exchange; and a multi-dimensional continuous mixture descriptor (MCMD) to further enhance efficient position encoding. This architecture reaches very high performance on a certain set of vision tasks (82.9% top-1 accuracy on ImageNet, 48.9 AP on COCO, 47.7 mIoU on ADE20K) at very low computational costs (16.5 ms processing time, 9.8 GB memory). Because EAPT can handle any-dimensional and any-length sequence, good parameter usage along with strong performance make it extremely suitable for practical applications where there is a need for performance efficiency.

The various architectures of deep learning, each with its unique set of strengths and limitations, bear immense promise to transform this wide domain of image steganography fundamentally (Dosovitskiy et al., 2020; Liu et al., 2021). Effective application could provide an avenue for the design, progression, and operationalization of more resilient and advanced systems of information concealment to cope with the increasing need for safe channels of digital communication. A chronological summery of literature in image steganography is given in Table 1.

Table 1.
Chronological Summary of Literature in Image Steganography.

Author(s) & Year Method Dataset Key Contribution

Fridrich (2009) HUGO BOSSbase Histogram-preserving embedding using high-dimensional image models

Kumar and Kumar (2010) DWT-based USC-SIPI (Lena, Cameraman, Barbara, House) Evaluates DWT-based steganography on grayscale images using PSNR and imperceptibility for secure image hiding.

Fridrich and Kodovsky (2012) SRM + Ensemble BOSSbase Rich models with ensemble classifiers for robust steganalysis

Baluja (2017) Deep Steganography (Autoencoder CNN) ImageNet First deep autoencoder approach for image hiding

Zhu et al. (2018) HiDDeN COCO End-to-end CNN framework with adversarial training for robust steganography

Wu et al. (2018) StegNet (U-Net) ImageNet Deep CNN-based high-capacity image steganography with robust decoding

Hu et al. (2018) Coverless DCGAN Celebrities and Food101 Coverless steganography using generative adversarial networks

Zhang et al. (2019b) SteganoGAN COCO GAN-based framework achieving high payload capacity and imperceptibility

Wengrowski and Dana (2019) Optical CNN Custom Physical-world projector-camera steganography using deep learning

Abdulla et al. (2020) Fibonacci Bit-Plane Mapping DICOM, SIPI Message size reduction with Fibonacci-based embedding for enhanced stego quality

Cui et al. (2021) MIAIS Custom Multitask identity-aware steganography via minimax optimization

Kumar et al. (2022) Encoder-Decoder-based Image-to-Image Steganography CIFAR-100 CNN-based encoder-decoder for image hiding, improving imperceptibility without manual feature design.

Alobaidi and Mikhael (2023) Wavelet-Based Adaptive Insertion BOSSbase Adaptive wavelet-domain steganography using DWT and DCT coefficient analysis

Chen et al. (2023) Iterative Neural Optimizer BOSSbase Manifold-aware optimization achieving low error rates without error-correcting codes

Bui et al. (2023) RoSteALS MIRFlickR Robust steganography using autoencoder latent space with compact encoder and perfect secret recovery

Yu et al. (2023) CRoSS LoRA Diffusion-based coverless steganography with enhanced controllability, robustness, and security

Lin et al. (2023) EAPT ImageNet, COCO Efficient attention pyramid transformer with deformable attention for multi-scale feature learning

Ren and Wu (2024) JoCS CelebA, CHURCH, FFHQ,BEDROOM, CAT, HORSE Joint coverless steganography using StyleGAN and VQGAN for robust dual-module hiding

Li and Wang (2024) Image Vaccine BOSSbase Defensive mechanism rendering images immune to steganography via embedded vaccine data

Dong et al. (2024) StegaINR4MIH DIV2K Multi-image hiding using implicit neural representation; evaluated using PSNR, SSIM, RMSE, MAE, StegExpose, and SiaStegNet

Chahine and Kim (2024) Neural Cover Selection ImageNet, CelebA-HQ, AFHQ-Dog Cover image selection via latent space optimization using DDIM and GANs to minimize message recovery error

DiSalvo (2025) LSB Augmentation CIFAR-10 Uses LSB-based image embedding as data augmentation to improve training robustness

Sanjalawe et al. (2025) Multi-layered DL Stego COCO, Tiny ImageNet, CelebA Huffman compression + LSB + deep encoder-decoder framework for imperceptible and robust steganography

Wang et al. (2025) INN-Stego ImageNet Invertible neural network with GANs enabling reversible, high-accuracy image hiding without error correction

Author(s) & Year	Method	Dataset	Key Contribution
Fridrich (2009)	HUGO	BOSSbase	Histogram-preserving embedding using high-dimensional image models
Kumar and Kumar (2010)	DWT-based	USC-SIPI (Lena, Cameraman, Barbara, House)	Evaluates DWT-based steganography on grayscale images using PSNR and imperceptibility for secure image hiding.
Fridrich and Kodovsky (2012)	SRM + Ensemble	BOSSbase	Rich models with ensemble classifiers for robust steganalysis
Baluja (2017)	Deep Steganography (Autoencoder CNN)	ImageNet	First deep autoencoder approach for image hiding
Zhu et al. (2018)	HiDDeN	COCO	End-to-end CNN framework with adversarial training for robust steganography
Wu et al. (2018)	StegNet (U-Net)	ImageNet	Deep CNN-based high-capacity image steganography with robust decoding
Hu et al. (2018)	Coverless DCGAN	Celebrities and Food101	Coverless steganography using generative adversarial networks
Zhang et al. (2019b)	SteganoGAN	COCO	GAN-based framework achieving high payload capacity and imperceptibility
Wengrowski and Dana (2019)	Optical CNN	Custom	Physical-world projector-camera steganography using deep learning
Abdulla et al. (2020)	Fibonacci Bit-Plane Mapping	DICOM, SIPI	Message size reduction with Fibonacci-based embedding for enhanced stego quality
Cui et al. (2021)	MIAIS	Custom	Multitask identity-aware steganography via minimax optimization
Kumar et al. (2022)	Encoder-Decoder-based Image-to-Image Steganography	CIFAR-100	CNN-based encoder-decoder for image hiding, improving imperceptibility without manual feature design.
Alobaidi and Mikhael (2023)	Wavelet-Based Adaptive Insertion	BOSSbase	Adaptive wavelet-domain steganography using DWT and DCT coefficient analysis
Chen et al. (2023)	Iterative Neural Optimizer	BOSSbase	Manifold-aware optimization achieving low error rates without error-correcting codes
Bui et al. (2023)	RoSteALS	MIRFlickR	Robust steganography using autoencoder latent space with compact encoder and perfect secret recovery
Yu et al. (2023)	CRoSS	LoRA	Diffusion-based coverless steganography with enhanced controllability, robustness, and security
Lin et al. (2023)	EAPT	ImageNet, COCO	Efficient attention pyramid transformer with deformable attention for multi-scale feature learning
Ren and Wu (2024)	JoCS	CelebA, CHURCH, FFHQ,BEDROOM, CAT, HORSE	Joint coverless steganography using StyleGAN and VQGAN for robust dual-module hiding
Li and Wang (2024)	Image Vaccine	BOSSbase	Defensive mechanism rendering images immune to steganography via embedded vaccine data
Dong et al. (2024)	StegaINR4MIH	DIV2K	Multi-image hiding using implicit neural representation; evaluated using PSNR, SSIM, RMSE, MAE, StegExpose, and SiaStegNet
Chahine and Kim (2024)	Neural Cover Selection	ImageNet, CelebA-HQ, AFHQ-Dog	Cover image selection via latent space optimization using DDIM and GANs to minimize message recovery error
DiSalvo (2025)	LSB Augmentation	CIFAR-10	Uses LSB-based image embedding as data augmentation to improve training robustness
Sanjalawe et al. (2025)	Multi-layered DL Stego	COCO, Tiny ImageNet, CelebA	Huffman compression + LSB + deep encoder-decoder framework for imperceptible and robust steganography
Wang et al. (2025)	INN-Stego	ImageNet	Invertible neural network with GANs enabling reversible, high-accuracy image hiding without error correction

3. Deep Learning Framework for Image Steganography and General Deep-Learning Architectures

Deep learning methods have revolutionized image steganography by concealing information in digital images without compromising visual integrity (Ronneberger et al., 2015; Wang et al., 2019). They use CNNs and GANs in a dual-network framework (Karras et al., 2021). One network integrates the confidential message into the cover image, while the other extracts the hidden data (Ronneberger et al., 2015). The framework includes an encoding network and embedding layers to minimize visual distortions (Johnson et al., 2016; Karras et al., 2021). Attention mechanisms and adaptive loss functions enhance embedding locations and resilience against image manipulation (Johnson et al., 2016). Deep learning methods outperform statistical approaches in embedding capacity, visual quality, and steganalysis resistance (Kuznetsov et al., 2022).

Figure 2 shows a high-level overview of the general deep-learning architecture highlighting the interconnections and information flow between different components. This visualization represents how various processing paths work together in the steganographic system to perform two key tasks: securely hiding information and maintaining image quality.

Figure 2.

High-level overview of general deep-learning architecture showing their interconnections and information flow.

The use of deep learning frameworks has brought about significant advancements in image steganography through the development of five cutting-edge architectures (Ronneberger et al., 2015; Wang et al., 2019): Residual Dense Network (RDN), ViT-AA, PGN, DSA, and WHN. Each of these architectures possesses distinct strengths: RDN is particularly effective in preserving features through dense blocks and residual connections, ViT-AA utilizes transformer-based processing with specialized attention mechanisms to handle complex textures, PGN implements stage-wise refinement for progressive image enhancement, DSA employs parallel processing paths optimized for structural and textural components, and WHN combines frequency domain analysis with deep neural networks to achieve optimal performance. The detailed analysis of each architecture is given below.

3.1. Residual Dense Network

RDN’s hierarchical architecture effectively extracts and fuses features at multiple scales (Zhang et al., 2018), enabling it to preserve image fidelity while embedding secret information in a robust and imperceptible manner.

RDN’s architecture comprises three interconnected modules that work in concert to achieve optimal steganographic performance (Kuznetsov et al., 2022). The primary information flow through these modules can be expressed as:

\begin{aligned} F_{out} = H (F_{in} + R (F_{in})) \end{aligned}

(3)

where

F_{in}

and

F_{out}

represent input and output features, respectively,

H

denotes hierarchical feature extraction function, and

R

represents residual mapping (Kuznetsov et al., 2022; Zhang et al., 2018).

At each dense block, the feature transformation follows:

\begin{aligned} Y_{l} = σ (W_{l} \cdot [X_{0}, X_{1}, \dots, X_{l - 1}] + b_{l}) \end{aligned}

(4)

where

Y_{l}

represents the output of the

l

-th layer,

σ

is the activation function,

W_{l}

and

b_{l}

are learnable parameters, and

[X_{0}, X_{1}, \dots, X_{l - 1}]

denotes the concatenation of all preceding feature maps (Kuznetsov et al., 2022; Li et al., 2015).

Detailed visualization of the RDN architecture showing the interconnections between dense blocks and features fusion mechanisms in Figure 3. The diagram illustrates how residual connections and dense feature reuse contribute to improved information hiding capabilities.

Figure 3.

Detailed structure of RDN showing dense blocks and feature fusion.

The quality-aware feature selection mechanism operates through:

\begin{aligned} Q (x) = \sum_{i = 1}^{n} w_{i} \cdot q_{i} (x) \end{aligned}

(5)

where

q_{i}

represents different quality metrics and

w_{i}

are learned weights that adapt based on local image content (Li et al., 2015; Zhang et al., 2018).

3.2. Vision Transformer With Adaptive Attention

The ViT-AA architecture revolutionizes the traditional approach to steganography by incorporating transformer-based processing with specialized attention mechanisms for information hiding (Kingma & Welling, 2013; Tan & Le, 2019). The fundamental transformation process begins with patch embedding:

\begin{aligned} E (x) = P (x) + P_{pos} + P_{steg} \end{aligned}

(6)

where

P (x)

represents the patch embedding,

P_{pos}

is the positional encoding, and

P_{steg}

is a learned steganographic embedding (Kingma & Welling, 2013).

The core attention mechanism is modified for steganographic purposes (Gulrajani et al., 2017):

\begin{aligned} A (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}} + M_{s}) \cdot W_{a} V \end{aligned}

(7)

where

M_{s}

is a steganographic attention mask and

W_{a}

represents adaptive weights. The multi-head attention computation is expressed as:

\begin{aligned} MHA (X) & = [h e a d_{1}; h e a d_{2}; \dots; h e a d_{h}] W^{O} \end{aligned}

(8)

\begin{aligned} h e a d_{i} & = Attention (X W_{i}^{Q}, X W_{i}^{K}, X W_{i}^{V}) \end{aligned}

(9)

The overall general architecture of ViT-AA is illustrated in Figure 4; the design illustrates how the input image is first segmented into patches that are, afterward, processed by several parallel attention heads.

Figure 4.

ViT-AA architecture showing attention mechanisms and patch processing.

3.3. Progressive Generation Network

The PGN implements a stage-wise approach to steganography (Tang et al., 2019; Xu et al., 2016), where the stego image is generated through iterative refinement. The generation process at stage n is governed by:

\begin{aligned} G_{n} (x) = x + R_{n} (F_{n} (x)) \cdot M_{n} (x) \end{aligned}

(10)

where

R_{n}

represents the refinement function,

F_{n}

extracts features, and

M_{n}

is a quality-guided mask (Tang et al., 2019).

The progressive enhancement follows:

\begin{aligned} y_{n} = G_{n} (y_{n - 1}) + Q_{n} (y_{n - 1}) \end{aligned}

(11)

where

Q_{n}

represents the quality assessment module operating as:

\begin{aligned} Q (x) = \sum_{i = 1}^{k} w_{i} \cdot q_{i} (x) + λ \cdot S (x) \end{aligned}

(12)

Here, $S (x)$ denotes the security assessment metric.

Figure 5 depicts the process of progressive generation in PGN. There are several refinement stages: the result of the prior is fed to the subsequent, following every security and esthetic enhancement. The system is presented with much emphasis on progressive processing. Blocks are drawn in sequence, showing progressively refined quality intermediates.

Figure 5.

Progressive generation process showing multiple refinement stages.

3.4. Dual-Stream Architecture

The DSA employs parallel processing paths optimized for structural and textural components (Balda et al., 2020). The structural stream is formulated as:

\begin{aligned} S (x) = \sum_{l = 1}^{L} w_{l} \cdot H_{l} (x) \cdot M_{l} (x) \end{aligned}

(13)

where

H_{l} (x)

represents hierarchical features at level

l

M_{l} (x)

denotes spatial masks, and

w_{l}

are learnable weights.

The texture stream operates as:

\begin{aligned} T (x) = Φ (P (x)) \otimes D (x) \end{aligned}

(14)

where

Φ (P (x))

is a perceptual representation of pre-processed input

P (x)

, and

D (x)

encodes detailed textures.

Fusion of both streams is defined as:

\begin{aligned} F (x) = α (x) \cdot S (x) + (1 - α (x)) \cdot T (x) \end{aligned}

(15)

with an adaptive gating function:

\begin{aligned} α (x) = σ (W_{f} \cdot T (x) + b_{f}) \end{aligned}

(16)

where

σ

is the sigmoid activation,

W_{f}

and

b_{f}

are learnable parameters.

In Figure 6, the parallel processing approach of the DSA is presented. The diagram shows two distinct pathways: one dedicated to structural preservation and another to textural processing. These streams operate independently before converging at a fusion point, where their outputs are combined to create the final stego-image.

Figure 6.

Detailed scshematic of the Dual-Stream Architecture (DSA), illustrating parallel pathways for feature processing. The upper stream processes structural components, while the lower stream focuses on texture-specific refinements. both pathways operate independently and are fused via an adaptive gating mechanism to produce the final stego-image. This design enhances both visual fidelity and security by leveraging complementary spatial characteristics.

3.5. Wavelet-Based Hybrid Network

The WHN combines frequency domain analysis with deep learning features (Lin et al., 2023). The wavelet decomposition process can be expressed as:

\begin{aligned} W (x) = {c A_{j}, {c D_{j}^{h}, c D_{j}^{v}, c D_{j}^{d}}_{j = 1}^{J}} \end{aligned}

(17)

where

c A_{j}

represents approximation coefficients and

c D_{j}

represents detail coefficients at level

j

(Lin et al., 2023).

The coefficient modification follows:

\begin{aligned} W^{'} (x) = T (W (x)) \cdot C (x) \end{aligned}

(18)

where

T

represents the transformation function and

C (x)

is a content-adaptive scaling factor.

The Figure 7 represents the number of wavelets transform multi-level stages in which image frequency decomposition is done, frequencies representing information hiding are altered, and then reconstructed for representation. The flow also reflects techniques in frequency-domain image processing for secure embedding.

Figure 7.

Wavelet-based Hybrid Network (WHN) pipeline demonstrating multi-level frequency decomposition using Discrete Wavelet Transform (DWT). The figure shows how the input image is decomposed into approximation and detail coefficients, followed by content-adaptive modification, and finally reconstructed via inverse wavelet transform. This architecture facilitates robust embedding in frequency components while maintaining high perceptual quality.

The final reconstruction process is governed by:

\begin{aligned} R (x) = W^{- 1} (W^{'} (x)) + E (x) \end{aligned}

(19)

where

E (x)

represents the enhancement function operating in the spatial domain.

3.6. Mutual Attention Transformer

The MAT introduces a novel approach to steganographic systems through bidirectional attention mechanisms and cross-modal feature interaction. Building upon the research by Zhang and Tian (2023), MAT demonstrates superior performance in handling complex texture patterns while maintaining robust security features. The architecture operates through parallel attention streams that mutually reinforce feature learning between the cover image and the secret message embedding process. This emphasis on both local and global feature preservation has led to significant improvements, demonstrating a 3.1% increase in detection capability and a 1.0% enhancement in localization capability compared to state-of-the-art reconstruction-based methods.

In performance evaluations, MAT achieves impressive metrics with a PSNR of 43.3 $\pm$ 0.3 dB and SSIM of 0.994, though these results come with increased computational demands requiring 19.1 ms processing time and 10.4 GB memory usage. The architecture demonstrates robust resistance against common image manipulations, showing high resilience scores for JPEG compression (0.94), noise handling (0.92), and rotation resistance (0.89). These metrics could be incorporated into Table 3 of the main document, which already compares quality metrics across different architectures.

Statistical analysis reveals MAT’s effectiveness with a Chi-squared value of 0.141, KL Divergence of 0.018, entropy difference of 0.006, and a detection rate of 0.502, demonstrating its strong capability in maintaining statistical imperceptibility. MAT’s particular strength lies in its feature correlation capabilities through mutual attention mechanisms, making it especially effective for pattern-rich content. This characteristic makes it particularly suitable for applications requiring high fidelity in texture-rich environments, though considerations must be made for its increased computational requirements. The architecture represents a significant advancement in balancing security requirements with visual quality preservation, as evidenced by its performance metrics across various test scenarios (Zhang & Tian, 2023).

3.7. Efficient Attention Pyramid Transformer

The EAPT introduces an enhanced architecture for image processing tasks through three key innovative components that address common challenges in ViTs. The architecture implements a multi-stage design with each stage containing multiple transformer blocks, enabling effective processing at different feature scales (Lin et al., 2023).

Architecture Components:

Deformable Attention: This component introduces learnable offsets for each position in patches, defined as:

\begin{aligned} (L_{m}^{x}, L_{m}^{y}) = (l_{m}^{x}, l_{m}^{y}) + (o_{m}^{x}, o_{m}^{y}) \end{aligned}

(20)

where

(l_{m}^{x}, l_{m}^{y})

represents original position and

(o_{m}^{x}, o_{m}^{y})

are learned offsets. The attention field is constrained by

| o_{m}^{x} | \leq h w

| o_{m}^{y} | \leq w w

, allowing flexible coverage of vision elements without increasing computational overhead.

En-DeC (Encode-Decode Communication) Module: Implements a global communication mechanism through:

For encoding:

\begin{aligned} V^{'} & = {Con}_{V} (V, c^{'}) \end{aligned}

(21)

\begin{aligned} V^{″} & = {Con}_{V} (V^{'}, c^{″}) \end{aligned}

(22)

\begin{aligned} V_{i, j}^{″} & = Q (\bar{W_{e n}} V_{(i \sim i^{'}, j \sim j^{'})}^{'}) \end{aligned}

(23)

Where, $V$ is the input feature map, $c^{'}$ and $c^{″}$ are convolution configuration parameters, $V^{'}$ and $V^{″}$ are the intermediate encoded features, $W_{en}$ denotes the encoding weights, $Q (\cdot)$ is the transformation function applied to extract query vectors, and $(i \sim i^{'}, j \sim j^{'})$ represents the local spatial neighborhood considered during encoding operation.

For decoding:

\begin{aligned} V^{‴} & = {DeCon}_{V} (V^{″}, c^{'}) \end{aligned}

(24)

\begin{aligned} V_{i, j}^{‴} & = Q (\bar{W_{d e}} V_{(i \sim i^{'}, j \sim j^{'})}^{″}) \end{aligned}

(25)

Where, $V^{″}$ is the encoded representation passed to the decoder, $V^{‴}$ is the reconstructed output obtained via deconvolution, $W_{de}$ are the decoding weights, and $Q (\cdot)$ again represents the query function applied over the spatial window $(i \sim i^{'}, j \sim j^{'})$ to restore detailed spatial context.

Multi-dimensional Continuous Mixture Descriptor (MCMD): Provides position encoding through Gaussian descriptors:

\begin{aligned} G (V_{(i j, x)}^{x}) = a_{i j} e^{- (x - b_{i j}^{x}) / 2 {c_{i j}^{x}}^{2}} \end{aligned}

(26)

\begin{aligned} G (V_{(i j, y)}^{y}) = a_{i j} e^{- (y - b_{i j}^{y}) / 2 {c_{i j}^{y}}^{2}} \end{aligned}

(27)

Where, $a_{i j}$ is the amplitude coefficient, $b_{i j}^{x}$ and $b_{i j}^{y}$ are the means (centers) for the $x$ and $y$ axes, $c_{i j}^{x}$ and $c_{i j}^{y}$ are the respective standard deviations (spreads), $x$ and $y$ denote the spatial coordinates, and $G (\cdot)$ represents the Gaussian-based positional encoding function applied along each axis.

3.8. Rationale for Model Selection

The selection of architectures in this study was guided by the goal of representing a diverse spectrum of modern steganographic design principles—beyond the instability often associated with GAN-based models and the limited capacity of traditional autoencoders. While GANs such as HiDDeN and StegNet have demonstrated effectiveness in visual fidelity, they tend to suffer from training instability, mode collapse, and lack of interpretability. Autoencoders, although simpler to train, frequently fall short in terms of scalability, payload flexibility, and robustness under statistical detection.

To address these limitations, we selected architectures based on four core perspectives: residual modeling, lightweight progressive generation, transformer-based global attention, and frequency-domain fusion. Each model contributes unique strengths–residual learning for enhanced feature reuse, progressive generation for real-time efficiency, attention mechanisms for adaptive spatial focus, and wavelet fusion for compression resilience.

This curated selection enables controlled and comparative analysis across a variety of design patterns, embedding strategies, and deployment profiles, while remaining practically viable without the overhead of adversarial training. These choices also reflect the current trajectory in steganography research, which favors architectures that balance embedding quality, security, and deployment efficiency.

4. Results and Discussions

The detailed comparison of architectural components among the seven methods in Table 2 provides critical insights on how each approach integrates several aspects of image steganography (Chen et al., 2020; Zhu et al., 2017). The feature extraction methods show significant diversity, ranging from RDN’s dense blocks to ViT-AA’s self-attention mechanism, and extending to MAT’s mutual-attention and EAPT’s pyramid structure, which points to critical differences in the processing and hiding of information (Zhu et al., 2017). The security mechanisms have significant variation, from RDN’s dense fusion approach to WHN’s coefficient-based methodology, and further evolving with MAT’s pattern-aware and EAPT’s scale-fusion approaches, which shows the variety of strategies used for securing the hidden information. The computational units column shows the evolution of processing architectures, from traditional convolution operations in RDN to modern transformer variants, with MAT introducing efficient transformers and EAPT implementing pyramid transformers for optimized processing (Lin et al., 2023; Zhang & Tian, 2023). Furthermore, quality control mechanisms range from channel attention to hierarchical approaches, demonstrating how each architecture uniquely addresses the challenge of maintaining image fidelity while embedding information. This detailed analysis will serve as a good reference in understanding the basic technical differences between these architectures and their specific capabilities in handling different aspects of steganographic processing, particularly highlighting how newer architectures like MAT and EAPT build upon and enhance previous approaches.

Table 2.
Comparative Analysis of Architectural Components Across all Seven Approaches.

Models/ Feature Information Quality Security Computational

Components Extraction Hiding Control Mechanism Units

RDN Dense Blocks Residual Channel Dense fusion Convolution

Attention

ViT-AA Self-Attention Attention-based Multi-head Position-aware Transformer

PGN Progressive Iterative Progressive Stage-wise Generator

DSA Dual-Path Parallel Dual-stream Split processing Dual CNN

WHN Wavelet Frequency Multi-scale Coefficient Hybrid

MAT Mutual-Attention Cross-attention Bilateral-head Pattern-aware Efficient

Transformer

EAPT Pyramid Structure Multi-scale Hierarchical Scale-fusion Pyramid

Transformer

Models/	Feature	Information	Quality	Security	Computational
RDN	Dense Blocks	Residual	Channel	Dense fusion	Convolution
			Attention
ViT-AA	Self-Attention	Attention-based	Multi-head	Position-aware	Transformer
PGN	Progressive	Iterative	Progressive	Stage-wise	Generator
DSA	Dual-Path	Parallel	Dual-stream	Split processing	Dual CNN
WHN	Wavelet	Frequency	Multi-scale	Coefficient	Hybrid
MAT	Mutual-Attention	Cross-attention	Bilateral-head	Pattern-aware	Efficient
					Transformer
EAPT	Pyramid Structure	Multi-scale	Hierarchical	Scale-fusion	Pyramid
					Transformer

Various image formats differ in their suitability for steganographic embedding depending on compression type, fidelity, and pixel integrity. As shown in Table 3, formats like BMP, PNG, and PGM provide lossless and high-capacity support, with PGM being the preferred grayscale format for benchmarking in academic datasets such as BOSSbase. In contrast, JPEG is suited for frequency-domain methods but suffers from lossy compression artifacts.

Table 3.

Comparison of Image Data Types and their Suitability for Steganographic Embedding.

Format	Compression	Lossless	Capacity	Fidelity	Suitability
PGM	None	Yes	High	Excellent (grayscale only)	Widely used in academic steganography (e.g., BOSSbase)
BMP	None	Yes	High	Excellent	Ideal for spatial-domain (LSB) methods
PNG	Deflate	Yes	High	Excellent	Preferred for pixel-level control and robustness
JPEG	DCT-based	No	Moderate	Good (with artifacts)	Suitable for frequency-domain (DCT) embedding
TIFF	Optional	Conditional	High	Excellent	Used in hybrid or high-precision setups
WEBP	Mixed	Optional	Moderate	Good	Emerging format; limited stego support
HEIF	Advanced	No	Moderate	High	Efficient but complex for steganographic integration

4.1. Dataset Preparation

The dataset preparation followed the training dataset of 100,000 high-resolution images from DIV2K was specifically chosen for its variety in texture complexity and color distributions (Chahar et al., 2025; Chen et al., 2020). These images underwent preprocessing including:

Resolution standardization to 512 $\times$ 512 pixels

Color space normalization

Contrast enhancement using adaptive histogram equalization

Random augmentation including: Rotation, Scaling, Horizontal flipping, and Color jittering.

COCO dataset provided diverse images, while ImageNet ensured robustness (Zhu et al., 2017). Secret payloads included binary sequences, grayscale, and color images (Qian et al., 2015).

Table 4 provides a structured and unified summary of the experimental framework adopted in this study, encompassing the definition of variables, dataset characteristics, preprocessing strategies, and evaluation instruments. The section on variables specifies the independent variable, represented by the choice of seven deep learning architectures under investigation, the dependent variables, which include imperceptibility, embedding capacity, robustness, and detection accuracy, and the experimental entities comprising cover images, secret payloads, and the resulting stego images. The dataset component enumerates the benchmark sources employed for training and testing, including DIV2K, COCO, ImageNet, MIT Places, and USC-SIPI, with explicit reference to their resolution, image counts, and functional role within the experimental design. Collectively, these datasets contribute more than 80,000 images, ensuring sufficient variability and representativeness for rigorous benchmarking. Preprocessing steps applied across all datasets, such as resizing to standardized dimensions, normalization, histogram equalization, and augmentation, are documented to ensure methodological transparency and reproducibility. The evaluation instruments are then delineated, covering objective quality metrics (PSNR and SSIM), robustness assessments against common distortions (JPEG compression, Gaussian noise, random cropping), and security evaluations using CNN-based steganalysis. Efficiency is further incorporated as a performance dimension, measured in terms of inference time per image.

Table 4.
Summary of Experimental Variables, Datasets, Preprocessing, and Evaluation Instruments.

Variables

Item Aspect Value Role / Usage

Independent Variable Models RDN, ViT-AA, PGN, DSA, WHN, MAT, EAPT Model selection

Dependent Variables Measures PSNR, SSIM, Capacity, Robustness, Detection accuracy Performance evaluation

Experimental Entities Entities Cover images, Secret payloads, Stego images Embedding process

Data Collection (Datasets and Preprocessing)

Dataset Type Name Description / Resolution / Count Usage

Training DIV2K Benchmark images, 2K, 1,000 Primary training

Training COCO Diverse natural images, Various, 50,000 Augmentation

Training ImageNet Labeled images, 256 $\times$ 256, 25,000 Validation

Testing MIT Places Scene images, 256 $\times$ 256, 5,000 Primary testing

Testing USC-SIPI Misc. quality set, Various, 2,000 Quality testing

Preprocessing Steps Resizing, normalization, histogram equalization, augmentation Applied to all datasets

Measuring Instruments

Instrument Aspect Value Role / Usage

Imperceptibility Metric Definition PSNR, SSIM Stego quality evaluation

Robustness Tests Distortions JPEG compression, Gaussian noise, Random cropping Distortion resistance

Security Evaluation Tool / Output CNN-based steganalysis, Detection accuracy Attack resistance

Efficiency Measure Inference time per image (ms) Computational cost

Variables
Independent Variable	Models	RDN, ViT-AA, PGN, DSA, WHN, MAT, EAPT	Model selection
Dependent Variables	Measures	PSNR, SSIM, Capacity, Robustness, Detection accuracy	Performance evaluation
Experimental Entities	Entities	Cover images, Secret payloads, Stego images	Embedding process
Data Collection (Datasets and Preprocessing)
Dataset Type	Name	Description / Resolution / Count	Usage
Training	DIV2K	Benchmark images, 2K, 1,000	Primary training
Training	COCO	Diverse natural images, Various, 50,000	Augmentation
Training	ImageNet	Labeled images, 256 $\times$ 256, 25,000	Validation
Testing	MIT Places	Scene images, 256 $\times$ 256, 5,000	Primary testing
Testing	USC-SIPI	Misc. quality set, Various, 2,000	Quality testing
Preprocessing	Steps	Resizing, normalization, histogram equalization, augmentation	Applied to all datasets
Measuring Instruments
Instrument	Aspect	Value	Role / Usage
Imperceptibility Metric	Definition	PSNR, SSIM	Stego quality evaluation
Robustness Tests	Distortions	JPEG compression, Gaussian noise, Random cropping	Distortion resistance
Security Evaluation	Tool / Output	CNN-based steganalysis, Detection accuracy	Attack resistance
Efficiency	Measure	Inference time per image (ms)	Computational cost

Training Protocol

The training process was implemented using the following protocol:

\begin{aligned} T (θ) = {\arg \min}_{θ} [L_{rec} (θ) + λ_{a} L_{sec} (θ) + λ_{b} L_{adv} (θ)] \end{aligned}

(28)

where

L_{rec}

represents reconstruction loss,

L_{sec}

notes security loss, and

L_{adv}

is the adversarial loss (Hayes & Danezis, 2017; Qian et al., 2015). The weighted parameters

λ_{a}

and

λ_{b}

were empirically set to 0.4 and 0.1, respectively.

The training Epochs graph in Figure 8 presents a comprehensive visualization of seven machine learning models’ convergence behavior during their training phases, plotting epoch numbers (0–200) against loss values (0.0–1.0). The graph employs distinct colors to track each model’s performance: RDN (blue), ViT-AA (red), PGN (green), DSA (yellow), WHN (purple), MAT (orange), and EAPT (cyan), with quadratic splines connecting data points to illustrate training progression. All models demonstrate a general downward trend in loss values as training progresses, with RDN showing rapid initial convergence, ViT-AA displaying a more gradual descent, and the remaining models exhibiting varying degrees of loss reduction efficiency. The WHN model achieves superior performance with the lowest final loss value of 43.5 dB, closely followed by MAT at 43.2 dB and EAPT at 43.3 dB. MAT shows particularly stable convergence in the middle phases, while EAPT demonstrates efficient early-stage learning with consistent improvement throughout the training process. Clear annotations and a detailed legend enhance the graph’s interpretability by marking key training phases and final performance metrics.

Figure 8.

Training convergence curves for all architectures over 200 epochs. The y-axis represents normalized loss, and the x-axis shows training epochs. Each curve illustrates the stability and speed of convergence for the corresponding model.

To evaluate the effectiveness of the mentioned architectures in the context of existing literature, these are benchmarked against several widely cited state-of-the-art steganographic models, including HiDDeN (Zhu et al., 2018) StegNet (Wu et al., 2018), and StegoFormer (Xie et al., 2021). Table 5 presents a comparative analysis across three key metrics: PSNR, SSIM, and inference time. As shown, our architectures particularly WHN, MAT, and ViT-AA–achieve higher imperceptibility scores while maintaining competitive inference speeds. PGN and DSA offer superior runtime efficiency, making them well-suited for low-latency or embedded applications. This benchmarking highlights the robustness and practicality of our proposed models relative to established methods.

Table 5.

Benchmark Comparison of Different Architectures With Prior State-of-the-art Methods.

Model	PSNR (dB)	SSIM	Inference Time (ms)
HiDDeN	38.2	0.961	85
StegNet	39.6	0.968	102
StegoFormer	41.9	0.985	113
PGN	41.8	0.982	43
DSA	42.1	0.985	57
RDN	42.6	0.990	66
ViT-AA	43.1	0.993	91
WHN	43.5	0.995	85
MAT	43.2	0.994	93
EAPT	42.8	0.992	79

All models were evaluated on an NVIDIA RTX A4000 GPU with 32 GB VRAM. Lightweight architectures such as PGN and DSA demonstrated low inference latency (under 50 ms per image) and moderate memory usage, making them suitable for mobile or embedded deployment with appropriate quantization. In contrast, transformer-based models like ViT-AA, MAT, and WHN required higher computational resources, with inference times exceeding 80 ms and VRAM usage above 9 GB, suggesting their suitability for server-side or cloud-based applications. EAPT provided a balanced profile, achieving high visual quality with moderate runtime and memory demands, making it a practical option for scalable deployments. These findings highlight the importance of model selection based on available hardware and deployment constraints.

4.2. Performance Measures

4.2.1. Image Quality Metrics

We evaluated the visual quality using multiple metrics:

The peak signal-to-noise ratio (PSNR) calculation incorporates a logarithmic scaling factor to account for human visual perception:

\begin{aligned} PSNR = 10 \times \log_{10} (\frac{{MAX}_{I}^{2}}{MSE}) \end{aligned}

(29)

where

\begin{aligned} MSE = (\frac{1}{m n}) \times \sum_{i = 0}^{m - 1} \sum_{j = 0}^{n - 1} [I (i, j) - K (i, j)]^{2} \end{aligned}

(30)

The structural similarity index (SSIM) evaluates image quality through:

\begin{aligned} SSIM = [l (x, y)]^{α} \cdot [c (x, y)]^{β} \cdot [s (x, y)]^{γ} \end{aligned}

(31)

4.2.2. Security Analysis

The security evaluation was conducted through:

The steganalysis resistance metric $R_{s}$ is computed using:

\begin{aligned} R_{s} = 1 - \frac{| D (I_{s}) - D (I_{c}) |}{max (D)} \end{aligned}

(32)

where

D (\cdot)

represents the steganalyzer’s decision function, and

I_{s}

and

I_{c}

are stego and cover images, respectively.

4.2.3. Payload Capacity Analysis

The capacity-distortion relationship follows:

\begin{aligned} R (D) = {R : d (I_{c}, I_{s}) \leq D} \end{aligned}

(33)

where

R

represents the payload rate and

d (\cdot, \cdot)

is the distortion metric.

4.3. Performance Evaluation

4.3.1. Image Quality Metrics

The quality metrics comparison in Table 6 provides an extensive analysis of performance across various critical dimensions for each architectural approach. WHN demonstrates superior performance with the highest PSNR of 43.5 $\pm$ 0.2 dB and SSIM of 0.995, indicating exceptional image quality preservation. The LPIPS scores reveal perceptual similarity metrics, with WHN achieving the best score of 0.017, suggesting minimal visual distortion in the processed images. MAT follows closely with strong performance metrics (PSNR 43.3 $\pm$ 0.3 dB, SSIM 0.994) and matches ViT-AA’s perceptual quality with an LPIPS of 0.018, though it requires the highest computational resources at 19.1 ms and 10.4 GB memory. EAPT achieves balanced performance (PSNR 42.9 $\pm$ 0.3 dB, SSIM 0.993) with moderate resource requirements (16.5 ms, 9.8 GB), demonstrating the efficiency of its pyramid structure. Measurement of processing time shows PGN leading in computational efficiency at 12.4 ms but with slightly lower quality metrics. Memory requirements differ drastically between architectures, and MAT and ViT-AA have the highest memory usage at 10.4 GB and 10.2 GB, respectively, whereas PGN has the lowest memory usage at 7.8 GB. All these measurements give a deep insight into the practical considerations of implementation and performance trade-offs for each architecture.

Table 6.
Comprehensive Quality Metrics Comparison of Different Models.

Model PSNR(dB) SSIM LPIPS Time(ms) Memory(GB)

RDN 42.3 $\pm$ 0.4 0.992 0.021 15.3 8.4

ViT-AA 43.1 $\pm$ 0.3 0.994 0.018 18.7 10.2

PGN 41.8 $\pm$ 0.5 0.991 0.023 12.4 7.8

DSA 42.7 $\pm$ 0.3 0.993 0.019 16.8 9.1

WHN 43.5 $\pm$ 0.2 0.995 0.017 17.2 9.6

MAT 43.3 $\pm$ 0.3 0.994 0.018 19.1 10.4

EAPT 42.9 $\pm$ 0.3 0.993 0.019 16.5 9.8

Model	PSNR(dB)	SSIM	LPIPS	Time(ms)	Memory(GB)
RDN	42.3 $\pm$ 0.4	0.992	0.021	15.3	8.4
ViT-AA	43.1 $\pm$ 0.3	0.994	0.018	18.7	10.2
PGN	41.8 $\pm$ 0.5	0.991	0.023	12.4	7.8
DSA	42.7 $\pm$ 0.3	0.993	0.019	16.8	9.1
WHN	43.5 $\pm$ 0.2	0.995	0.017	17.2	9.6
MAT	43.3 $\pm$ 0.3	0.994	0.018	19.1	10.4
EAPT	42.9 $\pm$ 0.3	0.993	0.019	16.5	9.8

4.3.2. Security Analysis

Steganalysis Resistance:

The visualization is receiver operating characteristic (ROc) curve analysis that depicts a comparison of seven distinct architectures of machine learning models through their performances in binary classification tasks.

The ROC curve in Figure 9 visually captures the ability of such algorithms in diagnosis by plotting True Positive Rate against False Positive Rate across various classification thresholds. The performance is always better with a curve that lies above this line with dashed lines, because such curves represent random guesses.

Figure 9.

Receiver Operating Characteristic (ROC) curves for steganalysis detection using a CNN-based classifier. Each curve represents a different steganographic model evaluated against a steganalyzer trained on datasets. The x-axis denotes the false positive rate, while the y-axis shows the true positive rate.

The area under the curve (AUC) metrics capture the quantitative performance in neat terms, revealing nuanced differences between the architectures. In this regard, PGN achieves the highest performance of 0.518 and is marginally followed by RDN at 0.512. The other architectures included DSA at 0.507, ViT-AA at 0.503, and WHN at 0.501 have closely clustered performance characteristics.

4.3.3. Payload Capacity Analysis

The presented visualization in Figure 10 shows a balanced capacity-distortion performance of seven different research approaches: RDN, ViT-AA, PGN, DSA, WHN, MAT and EAPT. The graph shows that payload capacity in bits per pixel (bpp) is on the horizontal axis and Peak Signal-to-Noise Ratio (decibels, PSNR) on the vertical axis, presenting one of the most sensitive performance metrics in image processing techniques and steganographic methods.

Figure 10.

Capacity-distortion trade-Yff curves for all proposed architectures. The x-axis represents embedding capacity in bits per pixel (bpp), and the y-axis shows the corresponding PSNR (dB), indicating image distortion.

The capacity-distortion curves, represented in colorful lines, illustrate the performance profile of each technique as payload capacity varies. Each approach has a different nonlinear curve relating payload capacity to image quality, showing an asymptotic curve indicating performance trade-offs. The operating points are represented as white-filled circles with black borders located at the midpoint of each curve and give a relative snapshot of the technique’s performance at a standard reference point.

4.4. Statistical Analysis

Statistical imperceptibility: Chi-squared analysis of image statistics:

\begin{aligned} χ^{2} = \sum_{i = 1}^{n} \frac{(O_{i} - E_{i})^{2}}{E_{i}} \end{aligned}

(34)

Table 7 shows a statistical imperceptibility analysis which offers an overview of the possibility for each model to not be noticed through various statistical indicators, wherein Chi-squared values denote the statistical deviation from the characteristic properties of the original image; therefore, WHN produces the smallest value at 0.139, followed closely by MAT at 0.141, indicating improved security for steganography. The measurements of Kullback–Leibler divergence provide information-theoretic differences between cover and stego images, while in WHN and MAT, the performance is equally strong at 0.018, with EAPT showing competitive performance at 0.020. Difference in entropy measures proves that image information contents are not much changed from all models; WHN remains minimum at 0.005, with MAT and EAPT showing comparable low values at 0.006 and 0.007, respectively. The detection rates must be necessary to measure the security capabilities of the steganographic technique. It shows that among them, WHN provided the lowest rate of 0.501, with MAT achieving nearly equivalent performance at 0.502, both almost contacting the random chance detection capability. EAPT maintains strong security with a detection rate of 0.505, demonstrating the effectiveness of its pyramid structure in preserving image statistics. These complex statistical measurements together describe how well each architecture retains the statistics of the images and hides the information, with the newer MAT and EAPT architectures complementing the existing approaches through their specialized attention mechanisms and multi-scale processing capabilities.

Table 7.

Statistical Analysis Results for Different Models.

Analysis Type	Chi-squared	KL Divergence	Entropy Diff	Detection Rate
RDN	0.152	0.023	0.008	0.512
ViT-AA	0.143	0.019	0.006	0.503
PGN	0.168	0.025	0.009	0.518
DSA	0.147	0.021	0.007	0.507
WHN	0.139	0.018	0.005	0.501
MAT	0.141	0.018	0.006	0.502
EAPT	0.145	0.020	0.007	0.505

4.5. Comparative Analysis

4.5.1. Architecture-Specific Performance

The architectural analysis matrix in Table 8 gives a detailed summary of the strengths, weaknesses, and best uses for each model, making it a go-to expert decision-making tool. RDN is very good at preserving much detail but will consume a lot of memory and, therefore, needs resource optimization. On the other hand, ViT-AA really manages complex textures very well but is quite demanding on training periods, which can jeopardize the schedules for deployments easily. PGN features high processing speed and also fine-tuning with proper care; hence, is highly appropriate for real time applications although it generates very low PSNR values comparing to other architectures. Balanced performance metrics are available by DSA with excellent structure preservations, although its complex training requires optimal optimization. WHN is regarded as the most comprehensive approach for security-sensitive application with outstanding overall quality together with frequency awareness although computational costs are increased. MAT demonstrates superior feature correlation through its mutual attention mechanisms, making it ideal for pattern-rich content, though it demands significant memory resources and longer inference times that need to be considered in deployment planning. EAPT brings efficiency through its pyramid structure, offering excellent scalability for large datasets, but its performance can vary depending on input image resolution, requiring careful consideration of the target application’s image characteristics. This implies that stakeholders will make decisions depending on the requirements and resource usage constraints.

Table 8.
Architecture-Specific Strengths and Limitations for Different Models.

Architecture Primary Strengths Key Limitations Best Use Case

RDN High detail Memory intensive High-quality images

preservation,

Efficient feature

reuse

ViT-AA Best texture Longer training Complex textures

handling, Global time

context

PGN Fast processing, Lower PSNR Real-time apps

Progressive

refinement

DSA Balanced Complex training General purpose

performance,

Structure

preservation

WHN Best overall Computationally Security-critical

quality, Frequency heavy

awareness

MAT Enhanced feature High memory usage, Pattern-rich

correlation, Extended inference content

Adaptive attention time

EAPT Multi-scale Scale-dependent Large-scale

efficiency, Resource performance datasets

optimization

Architecture	Primary Strengths	Key Limitations	Best Use Case
RDN	High detail	Memory intensive	High-quality images
	preservation,
	Efficient feature
	reuse
ViT-AA	Best texture	Longer training	Complex textures
	handling, Global	time
	context
PGN	Fast processing,	Lower PSNR	Real-time apps
	Progressive
	refinement
DSA	Balanced	Complex training	General purpose
	performance,
	Structure
	preservation
WHN	Best overall	Computationally	Security-critical
	quality, Frequency	heavy
	awareness
MAT	Enhanced feature	High memory usage,	Pattern-rich
	correlation,	Extended inference	content
	Adaptive attention	time
EAPT	Multi-scale	Scale-dependent	Large-scale
	efficiency, Resource	performance	datasets
	optimization

4.5.2. Computational Efficiency

The computational requirements scale as:

\begin{aligned} T (n) = O (f (n, d)) \end{aligned}

(35)

where n is the image dimension and d are the embedding depth.

This presented bar graph in Figure 11 offers a multi-dimensional performance analysis of five computational architectures through an in-depth visualization of three critical resource metrics: training time, inference time, and memory usage. Each architectural approach is represented by a grouped bar cluster that uses a color-coded scheme, thus allowing for easy comparative assessment of computational efficiency.

Figure 11.

Comparison of computational efficiency across different architectures. The x-axis shows the average inference time per image (in Milliseconds), while the y-axis indicates GPU memory usage (in GB).

Systematically, three layers of consumption will be broken down: training time-the primary bar, inference time-in a mid-opacity bar, and memory usage in the lowest bar. The variability in demands for resources across different architectures is quite high. Quantitative annotations complementing the graphical representation reveal actual measurements. Training times stand between 24 and 36 hours, inference times run between 12.4 to 18.7 milliseconds, and memory requirements vary from 7.8 to 10.2 gigabytes.

Efficiency annotations strategically highlight key observations, identifying PGN as the most resource-efficient architecture and ViT-AA as the most resource-intensive. The color palette, using distinct colors for each architecture, helps to visually distinguish and enable researchers to easily determine performance characteristics. The axes and legends are carefully crafted, and this visualization is one that can be used as a powerful tool for comparative architectural performance analysis in computational research.

4.6. Robustness Analysis

Robustness against common image manipulations:

\begin{aligned} R_{s} = 1 - \frac{| M (I_{s}) - I_{s} |}{| I_{s} |} \end{aligned}

(36)

where

M (\cdot)

represents various image manipulations.

The robustness analysis in Table 9 gives a thorough evaluation of the performance of each architecture under different widespread image manipulations and therefore provides crucial insights into their applicability in real-world scenarios. WHN is very resilient to all the categories of transformations, maintaining performance scores larger than 0.90 in each and notable performances in resisting JPEG compression (0.95) and scaling operations (0.93). MAT demonstrates strong resilience comparable to WHN, particularly in noise handling (0.92) and JPEG compression (0.94), leveraging its mutual attention mechanisms to maintain structural integrity across transformations. ViT-AA is shown to deal with noise spectacularly and scores 0.91, which is very good in testing with difficult conditions. EAPT shows consistent performance across transformations due to its pyramid structure, with particularly strong results in JPEG compression (0.93) and scaling (0.91), though slightly less robust than MAT in noise scenarios (0.90). PGN architecture is efficient but vulnerable to several geometric transformations; notably the rotation where its score drops down to 0.84. These tests were comprehensive, covering a number of assessments, from simple ones like JPEG compression to complex geometrical transformations, giving each stakeholder a clear overview of which architecture is reliable when applied in different real scenarios. The results here are especially important for applications in which resistance to image manipulation is an essential requirement, with MAT and EAPT offering additional robust options for specific use cases where attention mechanisms or multi-scale processing are crucial.

Table 9.

Robustness Test Results for Different Models.

Operation	RDN	ViT-AA	PGN	DSA	WHN	MAT	EAPT
JPEG (Q $=$ 75)	0.92	0.94	0.89	0.93	0.95	0.94	0.93
Noise( $σ$ =0.1)	0.88	0.91	0.87	0.90	0.92	0.92	0.90
Rotation (5°)	0.85	0.88	0.84	0.87	0.89	0.89	0.87
Scaling (0.5)	0.90	0.92	0.88	0.91	0.93	0.92	0.91
Cropping (10%)	0.87	0.89	0.86	0.88	0.91	0.90	0.89

4.7. Ablation Studies

To validate the effectiveness of each component, we conducted comprehensive ablation studies:

\begin{aligned} E (c) = \frac{P (c)}{P (\bar{c})} \end{aligned}

(37)

where

E (c)

represents the effectiveness of component

c

Figure 12 presents a comprehensive ablation study that quantifies the individual contributions of five key architectural components in our proposed steganographic system. The analysis evaluates three critical metrics for each component: PSNR impact (measured in dB), security enhancement (measured in percentage), and computational overhead. The Dense Blocks provide a balanced improvement with +1.2 dB PSNR and 15% computational cost, while the Attention mechanism shows strong security gains of +8% with moderate PSNR improvement. The Progressive Generation component achieves the highest PSNR boost of +1.5 dB with minimal computational impact, whereas the Dual Stream architecture offers consistent improvements across all metrics. Most notably, the Wavelet Transform component demonstrates the highest security enhancement of +9% but incurs the largest computational cost of +25%, highlighting the fundamental trade-offs between performance and resource requirements in steganographic systems.

Figure 12.

Ablation study results illustrating the contribution of key architectural components to overall performance. Each bar represents a model variant with specific modules (e.g., attention, wavelet fusion, progressive decoding) enabled or removed. These results highlight the impact of each module on visual quality and security.

The component contribution analysis in Table 10 shows in detail how each architectural element affects the overall system performance along all the dimensions. The wavelet transform component has the maximum-security improvement, which is an improvement of 9% but at a cost of around 25% in terms of computational overhead. The progressive generation components have an ideal efficiency profile, thereby providing an improvement of 1.5 dB PSNR with an incremental increase of 10% in computational requirements. Attention mechanisms show a very good performance-to-cost ratio. This increases security by 8% while increasing computational resources by 20%. Dense blocks show a great gain of 1.2 dB in PSNR, having a moderate increase of 15% in the computational cost. The multi-resolution attention transform (MAT) demonstrates robust performance with a 1.4-dB PSNR improvement and 8% security enhancement, though requiring a significant 22% increase in computational resources. The edge-adaptive progressive transform (EAPT) achieves impressive results with a 1.6-dB PSNR gain and 7% security improvement while maintaining reasonable computational efficiency at 16% overhead. All this detailed analysis helps the system architects make proper decisions for the choice of components to satisfy their particular requirements in terms of security, quality, and computational cost. This explicitly draws attention to designing steganographic systems with sufficient care towards tradeoff between those performance-enhancing capabilities and proper resource utilization.

Table 10.

Component-Wise Contribution Analysis.

Component	Impact on PSNR	Impact on Security	Computational Cost
Dense Blocks	$+$ 1.2 dB	$+$ 5%	$+$ 15%
Attention	$+$ 0.8 dB	$+$ 8%	$+$ 20%
Progressive Gen	$+$ 1.5 dB	$+$ 6%	$+$ 10%
Dual Stream	$+$ 1.1 dB	$+$ 7%	$+$ 18%
Wavelet Trans	$+$ 1.3 dB	$+$ 9%	$+$ 25%
MAT	$+$ 1.4 dB	$+$ 8%	$+$ 22%
EAPT	$+$ 1.6 dB	$+$ 7%	$+$ 16%

5. Discussion

The comprehensive analysis of various deep learning architectures spans multiple dimensions of steganographic performance, including embedding capacity, visual quality, security against steganalysis, and computational efficiency. Compared with earlier CNN-based frameworks such as HiDDeN (Zhu et al., 2018) and StegNet (Wu et al., 2018), the proposed architectures demonstrate substantial improvements in both imperceptibility and robustness. The WHN consistently outperforms existing approaches, achieving a PSNR of 43.5 dB and SSIM of 0.995, while the ViT-AA excels in handling complex textures, yielding a 15% improvement in embedding efficiency. These findings are consistent with recent transformer-based efforts such as StegoFormer (Xie et al., 2021), which reported similar gains in visual fidelity at the expense of computational demand. In contrast, the PGN achieves superior computational efficiency, reducing inference time by nearly 30% compared with transformer-based methods, though with a modest trade-off in fidelity. Importantly, all architectures maintained robustness against CNN-based steganalyzers, with detection rates approaching random chance (0.501–0.518), reinforcing the effectiveness.

5.1. Architectural Insights

The comparative study of our seven architectures highlights clear trade-offs between convolutional and transformer-based approaches. CNN-based models such as RDN and PGN achieve faster convergence and lower resource demand due to their localized feature extraction, making them suitable for lightweight applications. In contrast, transformer-driven models such as MAT and WHN leverage attention mechanisms to capture long-range spatial dependencies, thereby preserving texture continuity and improving imperceptibility. These findings are consistent with recent literature (Xie et al., 2021), which emphasizes that attention modules enhance embedding fidelity but require greater computational resources. Such differences underline that model selection is inherently application-driven, where either efficiency or imperceptibility must be prioritized.

5.1.1. Feature Representation Efficiency

The effectiveness of feature representation can be quantified through the information retention ratio:

\begin{aligned} η = \frac{I (F_{out}; X)}{I (F_{in}; X)} \end{aligned}

(38)

where

I (\cdot; \cdot)

represents mutual information between feature spaces.

The given graph in Figure 13 represents the convergence behavior of five different machine learning models over their training phases using a 2D coordinate system, where the X-axis is epochs from 0 to 200 and the Y-axis is the loss values from 0.0 to 1.0. Performance of each model: the same colors are dedicated to the unique models which will appear at the plot-quadratic splines connect the given data points. The overall training curve for all models suggests a very steady negative slope which, obviously, means that growing training implies greater predictability at any time. It turns out to be quite noteworthy that the very beginning of a convergent process is quite steep in RDN, even if ViT-AA shows slightly slower and smoother decline.

Figure 13.

Feature representation efficiency comparison across architectures using the mutual information ratio. Higher values indicate better retention of input information through the encoding process.

These observations align with prior studies, where convolutional steganography frameworks such as StegNet (Wu et al., 2018) reported rapid training but limited scalability to complex image textures. The smoother convergence of ViT-AA and WHN supports the view that attention-based encoders not only enhance imperceptibility but also improve training stability, an advantage increasingly highlighted in transformer-driven image analysis tasks (Xie et al., 2021). This suggests that embedding performance benefits are not only quantitative but also procedural, reducing the risk of unstable optimization.

It also features several critical ones that ease the better interpretation and analysis: it helps point out the model itself together with its last dB values in one line, both ways. In conclusion, WHN could somehow outscore those dBs with its really superior 43.5 dBs. Important purposes and meaning will also be highlighted through annotations as is the ”Initial Training Phase” or ”Stabilization Phase” among others. This is with the combination of these elements and the clear depiction of every model’s convergence patterns, making the graph an invaluable resource for comparing model performance and understanding how the training of machine learning progresses differently for various architectures.

5.1.2. Computational Complexity Vs. Performance

The relationship between computational resources and steganographic performance follows:

\begin{aligned} P (r) = α \cdot \log \log (r) + β \end{aligned}

(39)

where

P (r)

represents performance metrics and

r

denotes computational resources.

The analysis Table 11 of resource utilization shows that all architectures have deep insights in their computational requirement, which provides very critical insight during planning and deployment. Among the architectures, ViT-AA makes the highest demand on resources with 52.3G FLOPs, which requires training for 36 hours despite its better-quality metrics during practical operation. Following closely is MAT with 51.4G FLOPs and 34 hours of training time, demonstrating similarly high resource demands but with marginally better efficiency than ViT-AA. PGN has the best resource profile, at 38.9G FLOPs with a training time of 28 hours, and is therefore quite suitable for low-resource environments. EAPT maintains a balanced resource footprint with 46.5G FLOPs and 29 hours of training time, positioning it as an efficient alternative for moderate computing environments. The memory requirement follows an extremely strong trend with the number of parameters across all architectures and ranges between 7.8GB and 10.2GB, which carries important implications for planning the deployment. WHN keeps a middle-of-the-road resource profile with 49.8G FLOPs and a training time of 32 hours, thereby offering a compromise between performance and resource intensity. This large set of metrics allows an organization to select the correct architecture based on available computational resources as well as performance required from it. These also give true capacity planning for all scenarios of deployment.

Table 11.

Resource Utilization Analysis.

Architecture	FLOPs(G)	Parameters(M)	Memory(GB)	Training
				Time(h)
RDN	45.6	12.3	8.4	24
ViT-AA	52.3	15.7	10.2	36
PGN	38.9	11.2	7.8	28
DSA	47.2	13.5	9.1	30
WHN	49.8	14.2	9.6	32
MAT	51.4	14.8	9.8	34
EAPT	46.5	13.1	8.9	29

The trade-off between computational demand and performance observed in our study mirrors similar findings in earlier works (Zhang et al., 2019a), where improvements in visual fidelity and robustness often correlated with higher FLOPs and parameter counts. These results highlight the persistent challenge in designing architectures that balance imperceptibility with deployability, reinforcing the importance of hybrid models that can integrate the efficiency of CNNs with the adaptability of transformers.

5.2. Security-Performance Trade-offs

The security-performance relationship can be modeled as:

\begin{aligned} S (p) = S_{max} \cdot (1 - e^{- λ p}) \end{aligned}

(40)

where

S (p)

represents security level at performance

p

, and

λ

is a system-dependent parameter.

The visualization in Figure 14 provides a sophisticated analysis of the performance trade-off between different architectures, that is, RDN, ViT-AA, PGN, WHN, MAT, and EAPT, by establishing an extensive performance-security mapping based on PSNR and detection rate metrics. The nonlinear relationship between detection performance and image quality is graphed with a strategic The Pareto frontier reflects a broader trend in the field: imperceptibility gains are often coupled with greater computational cost, while efficient architectures typically expose more detectable artifacts. Earlier approaches struggled to balance these competing demands, frequently prioritizing either imperceptibility or robustness in isolation. The current results extend that trajectory by showing that advanced attention-driven designs can simultaneously deliver high fidelity and strong resistance to detection, albeit at higher computational expense. This comparative perspective suggests that the evolution of steganographic architectures is moving toward hybrid or adaptive frameworks capable of narrowing the gap between visual quality, security, and efficiency.

Figure 14.

Security-performance trade-off curves for selected architectures. The x-axis denotes PSNR (dB), representing visual quality, while the y-axis shows detection accuracy of a CNN-based steganalyzer.

Quantitative performance insights outline subtle differences: WHN shows the highest quality measure at 43.5 dB and the safest performance with a 0.501 detection rate. ViT-AA is the best-balanced architecture, showing a 43.1 dB PSNR and optimal trade-off positioning. PGN has the lowest quality at 41.8 dB and the lowest security with a 0.518 detection rate. The differences observed here are consistent with patterns reported in earlier generations of steganographic frameworks. CNN-based pipelines such as early residual or progressive designs were known to offer speed advantages but left detectable statistical footprints that could be exploited by modern detectors. In contrast, more recent architectures integrating attention or multi-scale encoding strategies have shown improved concealment by distributing embedding perturbations across both spatial and frequency domains. Our findings confirm this shift: lightweight architectures such as PGN demonstrate efficiency at the expense of resilience, whereas transformer-based models such as WHN and ViT-AA achieve lower detectability by leveraging global context modeling. The visualization strategically segments the performance regions into a “High Performance Region” and a “High Security Risk Region,” which offers contextual interpretation over and above the raw numbers. Operating points for each architecture are annotated with PSNR and detection rate values, making it easy to quickly compare them.

Color-coded curves with an extensive legend and performance notes transform multi-dimensional performance data into an accessible, insights-driven visualization. This facilitates easy interpretation of the pervasive trade-offs between image quality, detection capabilities, and security in differing machine learning architectures by researchers and practitioners.

5.3. Limitations and Challenges

Through extensive testing and real-world evaluation, several important challenges were identified that merit discussion. The computational overhead of advanced architectures, particularly ViT-AA and WHN, presents practical deployment considerations. Memory consumption was observed to scale quadratically with image resolution in transformer-based components, which may constrain their use in highly resource-limited environments. This is consistent with long-standing characteristics of deep learning-based steganography, where greater modeling capacity often entails higher memory and computational costs. While transformer-based designs extend representational power and improve imperceptibility, their scaling behavior highlights an intrinsic trade-off in current architectural paradigms that future research must continue to address.

The findings also indicate that the relationship between embedding capacity and image fidelity follows a steeper degradation curve than anticipated. Although the models achieve state-of-the-art PSNR and SSIM metrics, preserving this quality at very high embedding rates remains a challenge. This reflects a persistent characteristic across generations of models: as payload size increases, embedding perturbations become more correlated with cover statistics, making fidelity preservation more difficult. Attention-based mechanisms mitigate this effect by distributing perturbations more evenly, but the capacity–imperceptibility trade-off remains an open research problem. Adaptive rate control and statistically informed embedding strategies are promising avenues for improvement. PGN, while efficient, occasionally produces artifacts in highly textured images, which motivates the development of adaptive embedding regulation methods.

Training stability also emerged as an area requiring careful consideration, particularly in balancing adversarial components. Attention mechanisms showed sensitivity to batch size and initialization, while dual-stream designs required careful tuning to maintain stream balance. Similar observations have been reported in earlier adversarial frameworks, suggesting that these behaviors reflect a broader challenge in stabilizing generator–discriminator dynamics in steganographic networks. The heightened sensitivity of attention modules amplifies this effect, making it a priority for further study. Robust training strategies such as curriculum learning, progressive scheduling, or stabilized loss formulations may contribute to improved reproducibility across architectures and datasets.

Taken together, these observations suggest that computational scalability, payload–fidelity balance, and training stability remain active research frontiers. Addressing them will require architectural innovation supported by standardized benchmarking protocols and community-driven evaluation practices to ensure reproducibility and comparability across approaches.

Finally, while image steganography offers clear benefits for secure communication, copyright protection, and data integrity, it also has a dual-use character that necessitates responsible deployment. To mitigate risks of misuse, the promotion of transparent practices, the development of advanced steganalysis tools, and alignment with ethical and legal frameworks are essential. The authors advocate for dual-use guidelines to ensure that steganographic technologies are applied responsibly, maximizing their societal benefit while minimizing the potential for abuse.

5.4. Practical Implications and Recommendations

The evaluation in this study shows that different architectures are suited to distinct deployment scenarios. High-capacity models such as WHN and MAT, which deliver superior imperceptibility and robustness, are well aligned with security-critical applications including military communication, forensic watermarking, medical image protection, and the preservation of sensitive archives. Lightweight models such as PGN and DSA, with faster inference and lower computational demand, are more suitable for real-time and resource-constrained environments such as mobile messaging, embedded devices, and IoT-based systems. EAPT provides a balanced profile, making it a practical option for enterprise-grade cloud deployments where both scalability and cost-efficiency are important.

Beyond technical performance, these architectures address wider societal and industrial needs. They support digital privacy, safeguard data integrity, and enable secure communication even under restrictive or censored conditions. In regulated domains such as healthcare, finance, and education, they can facilitate secure documentation exchange while meeting confidentiality and compliance requirements. The benchmarks presented in this study therefore provide practitioners with a reliable reference for selecting architectures suited to their operational contexts. At the same time, the positive impact of these systems must be weighed against their risks: while they enhance confidentiality and reduce detectability, they may also be misused by adversarial actors to conceal harmful content or evade forensic monitoring. This dual-use nature underscores the need for responsible deployment, supported by robust steganalysis methods and appropriate policy safeguards.

Deployment cost is another decisive consideration. Lightweight CNN-based models such as PGN and DSA can be implemented on commodity GPUs or mobile NPUs at an estimated cost of less than $0.20 per 1,000 images in cloud environments. Transformer-based models such as WHN and MAT require high-performance accelerators such as RTX 3080 or A100 GPUs, with operational costs ranging from $0.50 to $1.00 per 1,000 images depending on configuration. EAPT offers an intermediate solution, providing high-quality results with moderate computational requirements, and is therefore well suited for enterprise-scale deployments.

Taken together, these findings establish a clear recommendation framework. PGN and DSA are effective in cost-sensitive or real-time scenarios, while WHN and MAT are more appropriate for institutions requiring maximum imperceptibility and robustness despite their greater computational demands. EAPT offers a pragmatic compromise for enterprise and cloud deployments, balancing performance with affordability. The adoption of steganographic architectures should therefore be regarded as a context-dependent decision that carefully balances imperceptibility, robustness, efficiency, and cost.

6. Conclusion

In this article, a comprehensive evaluation of deep learning architectures was conducted for image steganography, each offering distinct advantages in terms of visual quality, security robustness, computational efficiency, and deployment feasibility. RDN preserved fine details through dense blocks and residual connections (PSNR: 42.3 dB, SSIM: 0.992), while ViT-AA leveraged transformer-based attention mechanisms to handle complex textures effectively (PSNR: 43.1 dB, SSIM: 0.994). PGN demonstrated real-time applicability with the lowest processing time (12.4 ms) and minimal memory usage (7.8 GB). DSA balanced structural and textural information, and WHN outperformed all models in imperceptibility and resilience (PSNR: 43.5 dB, SSIM: 0.995, detection rate: 0.501), achieving 95% robustness against JPEG compression and 92% against noise. MAT exhibited excellent pattern preservation with strong statistical imperceptibility (PSNR: 43.3 dB, SSIM: 0.994, KL divergence: 0.018), while EAPT offered a scalable and resource-efficient architecture with favorable quality metrics (PSNR: 42.9 dB, SSIM: 0.993) and moderate memory usage (9.8 GB). Statistical evaluations across all models confirmed their ability to maintain cover image characteristics, with WHN, MAT, and EAPT consistently ranking high in imperceptibility metrics. These findings establish updated benchmarks in deep learning-based steganography and highlight critical trade-offs between security, quality, and efficiency. The diversity of model characteristics provides guidance for selecting suitable architectures based on application demands, whether for high-security use cases or resource-constrained environments. Future work will focus on optimizing these architectures for mobile and edge deployments, enhancing adversarial robustness against modern steganalysis techniques, and extending the framework to multimodal domains such as video and audio steganography. Furthermore, incorporating explainability and ethical safeguards will be essential to ensure responsible use in digital security and forensic applications.

Footnotes

ORCID iD

Narendra Kumar Chahar

Ethics Approval and Informed Consent

This study does not involve human participants or animals. Hence, ethics approval and informed consent are not applicable.

Authorship Declaration

All authors have significantly contributed to the research and preparation of this manuscript. The final submitted version has been reviewed and approved by all listed authors.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

Abdulla

A. A.

Sellahewa

Jassim

S. A.

(2020). Stego quality enhancement by message size reduction and fibonacci bit-plane mapping. CoRR abs/2004.12467. https://arxiv.org/abs/2004.12467.

Alobaidi

Mikhael

(2023). An adaptive steganography insertion technique based on wavelet transform. Journal of Engineering and Applied Science, 70(1), 144. https://doi.org/10.1186/s44147-023-00300-x

Balda

E. R.

Behboodi

Mathar

(2020). Adversarial examples in deep neural networks: An overview. Cham: Springer International Publishing.

Baluja

(2017). Hiding images in plain sight: deep steganography. In Proceedings of the 31st International conference on neural information processing systems, NIPS’17, (pp. 2066–2076). Red Hook, NY, USA: Curran Associates Inc. ISBN 9781510860964.

Bui

Agarwal

Collomosse

(2023). RoSteALS: Robust Steganography using Autoencoder Latent Space. In 2023 IEEE/CVF conference on computer vision and pattern recognition workshops (CVPRW). (pp. 933–942). https://doi.org/10.1109/CVPRW59228.2023.00100

Chahar

N. K.

Dhaka

Nandal

Kumar

(2025). Image steganography using responsible artificial intelligence technique. In V. Kumar, C. Chowdhary, S. Basheer, S. Doss & S. Sengar (Eds.), Enhancing Steganography through deep learning approaches (pp. 185–206). IGI Global Scientific Publishing. https://doi.org/10.4018/979-8-3693-2223-9.ch008

Chahine

Kim

(2024). Neural cover selection for image steganography. https://arxiv.org/abs/2410.18216.

Chandramouli

Kharrazi

Memon

(2004). Image steganography and steganalysis: Concepts and practice. In T. Kalker, I. Cox & Y. M. Ro (Eds.), Digital Watermarking (pp. 35–49). Berlin, Heidelberg: Springer Berlin Heidelberg. ISBN 978-3-540-24624-4.

Chen

Radford

Child

Jun

Luan

Sutskever

(2020). Generative pretraining from pixels. In H. D. III & A. Singh (Eds.), Proceedings of the 37th international conference on machine learning, Proceedings of Machine Learning Research (Vol. 119, pp. 1691–1703). PMLR. https://proceedings.mlr.press/v119/chen20s.html.

10.

Chen

Kishore

Weinberger

K. Q.

(2023). Learning iterative neural optimizers for image steganography. https://arxiv.org/abs/2303.16206.

11.

Cui

Zhang

Zheng

Bao

Xia

(2021). Multitask identity-aware image steganography via minimax optimization. IEEE Transactions on Image Processing, 30, 8567–8579. https://doi.org/10.1109/TIP.2021.3107999

12.

DiSalvo

(2025). Steganographic embeddings as an effective data augmentation. https://arxiv.org/abs/2502.15245.

13.

Dong

Liu

Chen

Sun

Pan

(2024). StegaINR4MIH: Steganography by implicit neural representation for multi-image hiding. Journal of Electronic Imaging, 33(6), 063017. https://doi.org/10.1117/1.JEI.33.6.063017

14.

Dosovitskiy

Beyer

Kolesnikov

Weissenborn

Zhai

Unterthiner

Dehghani

Minderer

Heigold

Gelly

Uszkoreit

Houlsby

(2020). An image is worth 16x16 words: Transformers for image recognition at scale. ArXiv abs/2010.11929. https://api.semanticscholar.org/CorpusID:225039882.

15.

Fridrich

(2009). Steganography in digital media: Principles, algorithms, and applications. Cambridge: Cambridge University Press.

16.

Fridrich

Kodovsky

(2012). Rich models for steganalysis of digital images. IEEE Transactions on Information Forensics and Security, 7(3), 868–882. https://doi.org/10.1109/TIFS.2012.2190402

17.

Goodfellow

Bengio

Courville

(2016). Deep learning. MIT Press.

18.

Gulrajani

Ahmed

Arjovsky

Dumoulin

Courville

(2017). Improved training of wasserstein gans. In Proceedings of the 31st International conference on neural information processing systems, NIPS’17, (pp. 5769–5779). Red Hook, NY, USA: Curran Associates Inc. ISBN 9781510860964.

19.

Hayes

Danezis

(2017). Generating steganographic images via adversarial training. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan & R. Garnett (Eds.), Advances in neural information processing systems (Vol. 30). Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2017/file/fe2d010308a6b3799a3d9c728ee74244-Paper.pdf.

20.

Zhang

Ren

Sun

(2016). Deep residual learning for image recognition. In 2016 IEEE Conference on computer vision and pattern recognition (CVPR) (pp. 770–778). https://doi.org/10.1109/CVPR.2016.90

21.

Holub

Fridrich

(2012). Designing steganographic distortion using directional filters. In 2012 IEEE International workshop on information forensics and security (WIFS). (pp. 234–239). https://doi.org/10.1109/WIFS.2012.6412655

22.

Holub

Fridrich

Denemark

(2014). Universal distortion function for steganography in an arbitrary domain. EURASIP Journal on Information Security, 2014(1), 1. https://doi.org/10.1186/1687-417X-2014-1

23.

Wang

Jiang

Zheng

(2018). A novel image steganography method via deep convolutional generative adversarial networks. IEEE access, 6, 38303–38314. https://doi.org/10.1109/ACCESS.2018.2852771

24.

Huang

Wang

(2023). Stegaedge: Learning edge-guidance steganography. The Visual Computer, 39(8), 3319–3331. https://doi.org/10.1007/s00371-023-02974-z

25.

Huo

Qiao

Liu

(2024). A deep learning-based steganography method for high dynamic range images. The Visual Computer, 40(11), 7887–7903. https://doi.org/10.1007/s00371-023-03214-0

26.

Husien

Badi

(2015). Artificial neural network for steganography. Neural Computing and Applications, 26(1), 111–116. https://doi.org/10.1007/s00521-014-1702-1

27.

Johnson

Alahi

Fei-Fei

(2016). Perceptual losses for real-time style transfer and super-resolution. In Computer Vision–ECCV 2016: 14th European conference, amsterdam, the Netherlands, October 11-14, 2016, Proceedings, Part II 14 (pp. 694–711). Springer.

28.

Karras

Laine

Aila

(2021). A style-based generator architecture for generative adversarial networks. IEEE Transactions on Pattern Analysis & Machine Intelligence, 43(12), 4217–4228. https://doi.org/10.1109/TPAMI.2020.2970919

29.

Kingma

D. P.

Welling

(2013). Auto-encoding variational bayes.

30.

Kumar

Choudhary

Vardhan

(2022). Image-to-image steganography using encoder-decoder network. International Journal of Social Ecology and Sustainable Development, 13(1), 1–12. https://doi.org/10.4018/IJSESD.312181

31.

Kumar

(2010). Performance evaluation of dwt based image steganography. In 2010 IEEE 2nd International advance computing conference (IACC) (pp. 223–228). https://doi.org/10.1109/IADCC.2010.5423005

32.

Kumar

Rao

Choudhary

(2020). Image steganography analysis based on deep learning. Review of Computer Engineering Studies, 7(1), 1–5. https://doi.org/10.18280/rces.070101

33.

Kuznetsov

Luhanko

Frontoni

Romeo

Rosati

(2022). Deep learning based image steganalysis. In 2022 IEEE 9th International conference on problems of infocommunications, science and technology (PIC S&T) (pp. 364–368). https://doi.org/10.1109/PICST57299.2022.10238549

34.

Lee

Kang

(2024). Pipformers: Patch based inpainting with vision transformers for generalize paintings. Computer Animation and Virtual Worlds, 35(3), e2270. https://doi.org/10.1002/cav.2270

35.

Wang

Tan

Huang

(2015). A strategy of clustering modification directions in spatial image steganography. IEEE Transactions on Information Forensics and Security, 10(9), 1905–1917. https://doi.org/10.1109/TIFS.2015.2434600

36.

Wang

(2024). Vaccine for digital images against steganography. Scientific Reports, 14(1), 21340. https://doi.org/10.1038/s41598-024-72693-5

37.

Lin

Zhang

Meng

Liu

Zhang

(2024). Hide: Hierarchical iterative decoding enhancement for multi-view 3d human parameter regression. Computer Animation and Virtual Worlds, 35(3), e2266. https://doi.org/10.1002/cav.2266

38.

Lin

Sun

Huang

Sheng

Feng

D. D.

(2023). EAPT: Efficient attention pyramid transformer for image processing. IEEE Transactions on Multimedia, 25, 50–61. https://doi.org/10.1109/TMM.2021.3120873

39.

Liu

Lin

Cao

Wei

Zhang

Lin

Guo

(2021). Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In 2021 IEEE/CVF International conference on computer vision (ICCV) (pp. 9992–10002). Los Alamitos, CA, USA: IEEE Computer Society. https://doi.org/10.1109/ICCV48922.2021.00986

40.

Pevný

Filler

Bas

(2010). Using high-dimensional image models to perform highly undetectable steganography. In R. Böhme, P. W. L. Fong & R. Safavi-Naini (Eds.), Information hiding (pp. 161–177). Berlin, Heidelberg: Springer Berlin Heidelberg. ISBN 978-3-642-16435-4.

41.

Provos

Honeyman

(2003). Hide and seek: An introduction to steganography. IEEE Security & Privacy, 1(3), 32–44. https://doi.org/10.1109/MSECP.2003.1203220

42.

Qian

Dong

Wang

Tan

(2015). Deep learning for steganalysis via convolutional neural networks. In A. M. Alattar, N. D. Memon & C. D. Heitzenrater (Eds.), Media watermarking, security, and forensics 2015 (Vol. 9409, p. 94090J). International Society for Optics and Photonics, SPIE. https://doi.org/10.1117/12.2083479

43.

Ren

(2024). A robust joint coverless image steganography scheme based on two independent modules. Cybersecurity, 7(1), 73. https://doi.org/10.1186/s42400-024-00299-5

44.

Ronneberger

Fischer

Brox

(2015). U-net: Convolutional networks for biomedical image segmentation. In N. Navab, J. Hornegger, W. M. Wells & A. F. Frangi (Eds.), Medical image computing and computer-assisted intervention – MICCAI 2015 (pp. 234–241). Cham: Springer International Publishing. ISBN 978-3-319-24574-4.

45.

Ruohan Meng

C. Y.

Cui

(2018). A survey of image information hiding algorithms based on deep learning. Computer Modeling in Engineering & Sciences, 117(3), 425–454. https://doi.org/10.31614/cmes.2018.04765

46.

Sanjalawe

Al-E’mari

Fraihat

Abualhaj

Alzubi

(2025). A deep learning-driven multi-layered steganographic approach for enhanced data security. Scientific Reports, 15(1), 4761. https://doi.org/10.1038/s41598-025-89189-5

47.

Song

Wei

Lin

Zhou

(2024). A survey on deep-learning-based image steganography. Expert Systems with Applications, 254, 124390. https://doi.org/10.1016/j.eswa.2024.124390

48.

Tan

(2019). EfficientNet: Rethinking model scaling for convolutional neural networks. In K. Chaudhuri & R. Salakhutdinov (Eds.), Proceedings of the 36th international conference on machine learning, Proceedings of Machine Learning Research (Vol. 97, pp. 6105–6114). PMLR. https://proceedings.mlr.press/v97/tan19a.html.

49.

Tang

Tan

Barni

Huang

(2019). Cnn-based adversarial embedding for image steganography. IEEE Transactions on Information Forensics and Security, 14(8), 2074–2087. https://doi.org/10.1109/TIFS.2019.2891237

50.

Wang

Zhu

Chang

Wang

Yao

(2025). High-accuracy image steganography with invertible neural network and generative adversarial network. Signal Processing, 234, 109988. https://doi.org/10.1016/j.sigpro.2025.109988

51.

Wang

Liu

Dong

Qiao

Loy

C. C.

(2019). Esrgan: Enhanced super-resolution generative adversarial networks. In L. Leal-Taixé & S. Roth (Eds.), Computer vision – ECCV 2018 workshops (pp. 63–79). Cham: Springer International Publishing. ISBN 978-3-030-11021-5.

52.

Wang

Huang

Yang

Wang

(2024). Robust blind image watermarking based on interest points. Virtual Reality & Intelligent Hardware, 6(4), 308–322. https://doi.org/10.1016/j.vrih.2023.06.012

53.

Wengrowski

Dana

(2019). Light field messaging with deep photographic steganography. In 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 1515–1524). https://doi.org/10.1109/CVPR.2019.00161

54.

Yang

(2018). Stegnet: Mega image steganography capacity with deep convolutional network. CoRR abs/1806.06357. http://arxiv.org/abs/1806.06357.

55.

Xie

Wang

Anandkumar

Álvarez

J. M.

Luo

(2021). Segformer: Simple and efficient design for semantic segmentation with transformers. CoRR abs/2105.15203. https://arxiv.org/abs/2105.15203.

56.

H. Z.

Shi

Y. Q.

(2016). Structural design of convolutional neural networks for steganalysis. IEEE Signal Processing Letters, 23(5), 708–712. https://doi.org/10.1109/LSP.2016.2548421

57.

Zhang

(2023). Cross: diffusion model makes controllable, robust and secure image steganography. In Proceedings of the 37th international conference on neural information processing systems, NIPS ’23. Red Hook, NY, USA: Curran Associates Inc.

58.

Zhang

Goodfellow

Metaxas

Odena

(2019a). Self-attention generative adversarial networks. https://arxiv.org/abs/1805.08318.

59.

Zhang

Zuo

Chen

Meng

Zhang

(2017). Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising. IEEE Transactions on Image Processing, 26(7), 3142–3155. https://doi.org/10.1109/TIP.2017.2662206

60.

Zhang

K. A.

Cuesta-Infante

Veeramachaneni

(2019b). Steganogan: High capacity image steganography with gans. CoRR abs/1901.03892. http://arxiv.org/abs/1901.03892.

61.

Zhang

Tian

(2023). Transformer architecture based on mutual attention for image-anomaly detection. Virtual Reality & Intelligent Hardware, 5(1), 57–67. https://doi.org/10.1016/j.vrih.2022.07.006

62.

Zhang

Dong

Liu

(2019c). Invisible steganography via generative adversarial networks. Multimedia Tools and Applications, 78(7), 8559–8575. https://doi.org/10.1007/s11042-018-6951-z

63.

Zhang

Chen

Liu

(2018). Adversarial examples against deep neural network based steganalysis. In Proceedings of the 6th ACM workshop on information hiding and multimedia security, IH&MMSec ’18, (pp. 67–72). New York, NY, USA: Association for Computing Machinery. ISBN 9781450356251. https://doi.org/10.1145/3206004.3206012

64.

Zhou

Wen

(2025). Efficient and separate authentication image steganography network. In Forty-second international conference on machine learning. https://openreview.net/forum?id=cKaUC1PeJA.

65.

Zhu

Kaplan

Johnson

Fei-Fei

(2018). Hidden: Hiding data with deep networks. CoRR abs/1807.09937. http://arxiv.org/abs/1807.09937.

66.

Zhu

Park

Isola

Efros

A. A.

(2017). Unpaired image-to-image translation using cycle-consistent adversarial networks. CoRR abs/1703.10593. http://arxiv.org/abs/1703.10593.

Models/	Feature	Information	Quality	Security	Computational
Components	Extraction	Hiding	Control	Mechanism	Units
RDN	Dense Blocks	Residual	Channel	Dense fusion	Convolution
			Attention
ViT-AA	Self-Attention	Attention-based	Multi-head	Position-aware	Transformer
PGN	Progressive	Iterative	Progressive	Stage-wise	Generator
DSA	Dual-Path	Parallel	Dual-stream	Split processing	Dual CNN
WHN	Wavelet	Frequency	Multi-scale	Coefficient	Hybrid
MAT	Mutual-Attention	Cross-attention	Bilateral-head	Pattern-aware	Efficient
					Transformer
EAPT	Pyramid Structure	Multi-scale	Hierarchical	Scale-fusion	Pyramid
					Transformer

Variables
Item	Aspect	Value	Role / Usage
Independent Variable	Models	RDN, ViT-AA, PGN, DSA, WHN, MAT, EAPT	Model selection
Dependent Variables	Measures	PSNR, SSIM, Capacity, Robustness, Detection accuracy	Performance evaluation
Experimental Entities	Entities	Cover images, Secret payloads, Stego images	Embedding process
Data Collection (Datasets and Preprocessing)
Dataset Type	Name	Description / Resolution / Count	Usage
Training	DIV2K	Benchmark images, 2K, 1,000	Primary training
Training	COCO	Diverse natural images, Various, 50,000	Augmentation
Training	ImageNet	Labeled images, 256 $\times$ 256, 25,000	Validation
Testing	MIT Places	Scene images, 256 $\times$ 256, 5,000	Primary testing
Testing	USC-SIPI	Misc. quality set, Various, 2,000	Quality testing
Preprocessing	Steps	Resizing, normalization, histogram equalization, augmentation	Applied to all datasets
Measuring Instruments
Instrument	Aspect	Value	Role / Usage
Imperceptibility Metric	Definition	PSNR, SSIM	Stego quality evaluation
Robustness Tests	Distortions	JPEG compression, Gaussian noise, Random cropping	Distortion resistance
Security Evaluation	Tool / Output	CNN-based steganalysis, Detection accuracy	Attack resistance
Efficiency	Measure	Inference time per image (ms)	Computational cost

Deep Learning-Empowered Image Steganography: Architectural Innovations and Performance Benchmarking

Abstract

Keywords

1. Introduction

2. Background

2.1. Evolution of Image Steganography Toward Deep Learning Architectures

2.2. General Framework of Image Steganography

3.7. Efficient Attention Pyramid Transformer

4. Results and Discussions

4.2.1. Image Quality Metrics

4.3.1. Image Quality Metrics

4.5.1. Architecture-Specific Performance

5.1. Architectural Insights

5.1.1. Feature Representation Efficiency

5.4. Practical Implications and Recommendations

6. Conclusion

Footnotes

ORCID iD

Ethics Approval and Informed Consent

Authorship Declaration

Funding

Declaration of Conflicting Interests

References