Sage Journals: Discover world-class research

Abstract

In the field of unpaired image-to-image (I2I) translation, enforcing consistency through similarity constraints on features from corresponding stages of the encoder and decoder has led to state-of-the-art performance. However, this simple similarity constraint, typically based on mean-squared error, is limited in terms of feature alignment, spatial information utilization, and global context modeling. These limitations result in suboptimal detail preservation and global structure in the generated images. To address these issues, we propose RFA-GAN, an I2I translation framework that integrates noise injection and attention mechanisms. To enhance the generator’s ability to preserve fine details, we propose a channel shuffle dual attention module. Although generative adversarial networks have achieved remarkable progress in generating high-quality samples, the presence of noise and distortion still impairs translation performance. To mitigate this, we inject Gaussian noise into the model, which not only improves performance under noisy conditions but also reduces sensitivity to noise. Furthermore, to compensate for the insufficient capture of local details and the lack of global style information, we design a focal-frequency residual block. Extensive experiments on multiple benchmark datasets demonstrate that our method significantly improves the quality and detail preservation of generated images while enhancing model robustness. These results suggest that our approach provides an efficient and stable solution for unpaired I2I translation.

Keywords

image-to-image translation generative adversarial network attention mechanism noise injection

1. Introduction

Image-to-image (I2I) translation aims to transform an image from a source domain to a target domain while preserving its structural content and semantic integrity. This paradigm has been widely applied in diverse computer vision tasks, including style transfer (Zhang et al., 2023), image dehazing (Dong et al., 2020), image restoration (Liang et al., 2021), and semantic segmentation (Yu et al., 2017). I2I tasks can be broadly categorized into information-symmetric and information-asymmetric scenarios. In symmetric translation (e.g., horse-to-zebra or seasonal changes), input and output images share similar structures, differing primarily in style or texture. In contrast, asymmetric translation requires the model to hallucinate missing structures and synthesize fine-grained details under large domain gaps—such as generating photorealistic scenes from semantic layouts, reconstructing realistic images from artistic paintings, or translating edge maps into natural images—posing significantly greater challenges in feature representation, cross-domain alignment, and generative fidelity.

Despite notable progress, existing methods still struggle in highly asymmetric settings. Convolutional networks, limited by local receptive fields, often produce blurry textures and structural artifacts due to their inability to model long-range dependencies. While transformer-based models improve global context modeling, they suffer from high computational cost and insufficient local detail refinement (Qin et al., 2023). Moreover, most approaches operate solely in the spatial domain, underutilizing the rich structural priors embedded in the frequency domain, which hinders high-frequency detail reconstruction and structural coherence. Although generative adversarial network (GAN)- and diffusion-based frameworks have demonstrated strong generative capabilities, they remain sensitive to input noise, domain shifts, and training instabilities, limiting their robustness and generalization in real-world applications (Du et al., 2023; Zhang et al., 2023).

To address these limitations, we propose RFA-GAN, a novel unpaired I2I translation framework that integrates noise injection and attention mechanisms to enhance detail preservation and structural reasoning. Our key insight is that effective feature modeling in asymmetric translation requires both multi-scale spatial-channel interaction and frequency-aware contextual aggregation. To this end, we design the channel shuffle dual attention module (CSDAM), which jointly enhances spatial and channel features through a dual-attention mechanism. By integrating a channel shuffle operation, CSDAM promotes cross-channel information exchange, reduces feature redundancy, and strengthens inter-channel interactions, enabling the generator to capture fine-grained textures and deliver richer, more discriminative representations.

Furthermore, to balance local detail generation with global style consistency—a common challenge in tasks such as cat-to-dog translation—we introduce the focal-frequency residual block (FFRB). FFRB combines a frequency-domain channel attention mechanism (Qin et al., 2021) with the multi-scale focal modulation strategy from FocalNets (Yang et al., 2022). This design allows the model to adaptively emphasize informative frequency components, suppress noise and redundancy, and aggregate multi-scale contextual features in a hierarchical manner. Additionally, Gaussian noise is injected during training to simulate real-world data variability, improving model robustness and mitigating overfitting. Together, these components enable RFA-GAN to achieve superior performance in preserving structural integrity and generating photorealistic details under challenging asymmetric conditions.

Our contributions are summarized as follows:

We propose a CSDAM that enhances fine-grained texture and structural feature capture across scales by combining channel shuffling with dual attention mechanisms. Gaussian noise is injected at the input stage to improve model robustness and cross-domain generalization.

We design a FFRB that strengthens global frequency-domain representation and performs multi-scale local modulation, leading to better preservation of fine details and improved visual fidelity compared to traditional residual blocks.

We propose the RFA-GAN framework and conduct comprehensive experiments on several widely used unpaired I2I translation datasets. Quantitative and qualitative results demonstrate that our method outperforms existing approaches in terms of visual quality and detail preservation of the generated images.

2. Related Works

2.1. I2I Translation

GANs (Goodfellow et al., 2020; Isola et al., 2017; Karras et al., 2019) have become a foundational paradigm in the field of image generation, enabling the synthesis of highly realistic images across a variety of tasks. Among these, I2I translation has emerged as a prominent application, where the objective is to learn a mapping between two visual domains to convert an image from a source domain to a target domain while preserving its semantic content and spatial structure. I2I translation methods can be broadly categorized into supervised and unsupervised learning frameworks. In the supervised setting, Pix2Pix (Isola et al., 2017) pioneered the first conditional GAN-based I2I model, which leverages adversarial loss along with an L1 reconstruction loss, relying on a large corpus of paired training data. Although highly effective in scenarios where paired samples are available, this approach is limited in scalability due to the impracticality and high cost of collecting such paired data in real-world applications. To overcome this limitation, unsupervised I2I translation approaches have gained significant attention. Notably, CycleGAN (Zhu et al., 2017) and DiscoGAN (Kim et al., 2017) introduced the concept of cycle-consistency, employing two generators and two discriminators to perform bidirectional domain mapping. The cycle-consistency loss ensures that an image translated from domain A to domain B and then mapped back to domain A should closely reconstruct the original image, even without paired supervision. This framework significantly expands the applicability of I2I models by enabling training with unpaired datasets. However, the cycle-consistency constraint enforces strict pixel-level alignment, which may hinder the model’s ability to capture high-level semantic transformations, often leading to suboptimal results in more complex translation scenarios. Building upon this, the UNIT framework (Zhan et al., 2022) proposed a shared latent space assumption, suggesting that images from different domains that are semantically aligned can be projected into a common latent representation. This approach facilitates cross-domain alignment in the feature space, allowing for more flexible translation. Similarly, GcGAN (Fu et al., 2019) introduced a geometric consistency loss that preserves spatial relationships between the source and translated images, improving structural coherence during domain translation. In parallel, AttnGAN (Chen et al., 2018) incorporated attention mechanisms into the generation process by requiring the model to learn foreground masks, thereby focusing translation only on relevant object regions. Although this technique enhances localization and translation accuracy, the explicit estimation of masks introduces considerable computational overhead, making it less efficient for high-resolution or large-scale applications.

More recently, motivated by the remarkable success of transformers in natural language processing and vision tasks, several works have explored their integration into I2I translation pipelines. For instance, ITTR (Zheng et al., 2022) and InstaFormer (Kim et al., 2022) embed transformer architectures into the generator design to capture long-range dependencies and global context. These models leverage self-attention mechanisms to enhance semantic alignment between source and target domains, achieving state-of-the-art performance in terms of translation quality and generalization.

Collectively, these advances have significantly pushed the boundaries of I2I translation. However, challenges such as preserving fine-grained texture details, ensuring structural consistency, and improving computational efficiency remain open problems, motivating the continued exploration of novel architectures and training strategies in the field.

2.2. Contrastive Learning

Contrastive learning is based on discriminative principles, aiming to learn feature embeddings by pulling together positive pairs and pushing apart negative pairs in the embedding space. The first method to introduce contrastive learning into I2I translation was CUT (Park et al., 2020). In the following years, researchers focused on improving CUT. QS-Attn (Hu et al., 2022) changed the negative sample selection strategy of CUT by dynamically selecting relevant anchors as positive and negative samples through the computation of Query, Key, and Value matrices. Spectral Normalization and Dual Contrastive Regularization (Zhao et al., 2025) further improved upon QS-Attn by introducing spectral normalization and dual contrastive regularization; however, its semantic contrastive loss relies on features extracted by VGG16, which may limit model generalization and increase computational overhead. MoNCE (Zhan et al., 2022) proposed adaptively reweighting negative samples according to their similarity to the anchor, promoting contrastive learning with more informative negatives. Unlike methods such as SimCLR (Chen et al., 2020) that rely on a large number of negative samples, BYOL (Grill et al., 2020) achieves strong performance without the need for negative samples. Inspired by BYOL, EnCo (Cai et al., 2024) introduces a multistage contrastive loss that enforces similarity constraints in the latent space between patch-level features at corresponding stages of the generator’s encoder and decoder to ensure content consistency. We incorporate this multistage contrastive loss into our framework to enhance content preservation in generated images. However, the feature constraints in EnCo still exhibit limitations in retaining fine-grained details and maintaining global structural consistency.

2.3. Noise Injection

At the theoretical level, previous studies have explored unconditional GANs, demonstrating that introducing noise during training can effectively improve learning stability and mitigate model overfitting. Recently, this strategy has shown promising experimental results in enhancing system robustness and adversarial resistance (Cohen et al., 2019; Lee et al., 2019). However, although this approach has proven effective for classification, it is not directly applicable to I2I translation, as the two tasks differ fundamentally: classification outputs discrete labels, whereas I2I models must synthesize entire images—a task that is considerably more demanding.

In the context of I2I translation, Chrysos et al. (2020) proposed a robust conditional GAN (RoCGAN) that adopts a dual-path generator structure with a shared decoder. RoCGAN demonstrated stable and consistent outputs even under noisy and adversarial conditions in tasks such as face super-resolution and image inpainting.

Nevertheless, the high computational demand and extended training time introduced by the dual-path architecture raise concerns about its scalability to larger generation models, especially in applications requiring high computational efficiency. Moreover, although Jia et al. (2021) and Wang et al. (2021) explored noise injection strategies within GAN frameworks, their experiments were conducted under supervised I2I settings (i.e., using paired training samples), leaving the effectiveness of such techniques in unsupervised I2I scenarios uncertain.

Recent advances in diffusion probabilistic models (Dhariwal & Nichol, 2021; Song et al., 2021) have also attempted to introduce controlled Gaussian noise during the diffusion process, achieving positive results in both supervised (Batzolis et al., 2021) and unsupervised (Su et al., 2022) I2I tasks. Inspired by these findings, we inject Gaussian noise into our GAN-based I2I translation framework, and extensive experiments demonstrate that this strategy significantly enhances model robustness while simultaneously improving the quality of image translation.

2.4. Attention Mechanism

The human visual system is capable of rapidly identifying salient regions within complex scenes, which has inspired the incorporation of attention mechanisms in the field of computer vision (Zhao et al., 2025). Essentially, an attention mechanism is a strategy that adaptively adjusts weights based on input features (Bello et al., 2019), allowing the model to focus more on critical regions while suppressing redundant information. Since the introduction of channel attention by SENet (Hu et al., 2018), subsequent studies such as convolutional block attention module (CBAM; Woo et al., 2018) have further extended attention modeling into the spatial dimension, enabling joint channel–spatial attention representations. Numerous follow-up works have explored diverse attention module designs and fusion strategies, and their extended applications in generative models.

In the domain of image generation, the integration of attention mechanisms with GANs has also yielded notable results. The self-attention GAN (Zhang et al., 2019) was the first to introduce self-attention to capture long-range dependencies within images, thereby enhancing the global consistency and structural coherence of generated images. Subsequent approaches embedded attention modules into various components of GANs. For instance, CBAM-GAN (Ma et al., 2019) incorporated the CBAM module to enhance fine-grained feature modeling, while MA-GAN (Jia et al., 2022) combined multi-scale convolutions with channel attention to adaptively adjust residual scales for optimized feature representation.

In the context of I2I translation, an increasing body of research demonstrates that attention mechanisms not only improve semantic consistency and detail preservation, but also enhance the robustness of cross-domain feature modeling. SelectionGAN, proposed by Tang et al. (2019), integrates multi-channel attention and a cascaded semantic guidance module to improve the quality of cross-view image translation. Moreover, it introduces a result selection mechanism to refine the final output. Tang et al. (2019) later proposed AGGAN, which constructs a dual-branch architecture for source and target domains. It leverages attention modules to enhance the model’s focus on salient regions, thereby achieving more accurate content transfer.

U-GAT-IT, developed by Kim (2019), combines attention mechanisms with adaptive normalization strategies, improving the disentanglement of style and structure information. This method demonstrates strong generative performance in unsupervised image translation tasks. RABIT, proposed by Zhan et al. (2022), introduces a dual-level feature alignment structure that aligns low-level details and high-level semantics separately. It also incorporates attention-guided exemplar learning, making it suitable for high-resolution scenarios where both style preservation and semantic consistency are crucial.

To address the limitation of conventional convolutions in capturing long-range dependencies, MixerGAN (Cazenavette & De Guevara, 2021) employs an multilayer perceptron (MLP)-Mixer architecture to facilitate global interaction between image patches, improving contextual modeling while balancing efficiency and performance. Building on this idea, CWT-GAN (Lai et al., 2021) introduces a cross-model weight transfer mechanism that dynamically transfers discriminative features learned by the discriminator to the generator. By integrating a residual attention mechanism and class activation maps, this approach significantly enhances translation quality between domains with large structural differences.

It is also worth noting that spatial attention mechanisms, in addition to channel attention, have proven effective in image translation. By modeling spatial relationships among features, spatial attention automatically isolates key regions within an image and has been successfully integrated into various GAN architectures. For example, the bidirectional attention GAN proposed by Yang et al. (2021) effectively models spatial dependencies in traffic flow images, while self-attention modules have also been used to detect latent structural defects in input data (Ali & Cha, 2022).

In summary, due to their strong transferability and representational power, attention mechanisms have become an indispensable component in I2I translation tasks. Ongoing research continues to explore novel attention designs and integration strategies, further advancing the quality and reliability of image translation models.

3. Method

3.1. Overall Architecture

Given an input image $x \in R^{H \times W \times 3}$ from the source domain X and a real image $y \in R^{H \times W \times 3}$ from the target domain Y, the objective of the model is to learn a mapping from domain X to domain Y. The architecture of our proposed RFA-GAN framework is illustrated in Figure 1. It consists of a generator G and a discriminator D. Generator G consists of an encoder, a decoder, and nine blocks of focal-frequency residuals. Specifically, we design a CSDAM embedded within both the encoder and decoder, and inject Gaussian noise at the input stage. At the same time, a focus frequency residual block is designed in the middle of the encoder and decoder. More precisely, the encoder first extracts features from the input image to generate feature representations. These features are then refined through nine FFRB modules to capture more fine-grained representations. Finally, the decoder reconstructs the features into the target domain image, with skip connections added between the encoder and decoder to preserve low-level information. The discriminator D is responsible for distinguishing between the generated images and real images from the target domain, thereby guiding the generator to improve both image quality and domain consistency.

Figure 1.

Overall Framework of Our Proposed RFA-GAN: A Cat Image Is First Reconstructed as a Cat Image Through the Identity Mapping Loss, While It Is Also Translated Into a Dog Image by the Generator. During the Generation Process, We Incorporate the Channel Shuffle Dual-Attention Module, the Focal-Frequency Residual Block, and Gaussian Noise Injection to Enhance the Quality and Robustness of Image Translation.

3.2. Channel Shuffle Dual Attention Module (CSDAM)

To enhance the feature representation capability of the generator in I2I translation tasks, we propose a novel CSDAM, as illustrated in Figure 2. This module is designed to simultaneously capture dependencies across channels and spatial regions, enabling adaptive and dynamic modulation of diverse feature responses. CSDAM consists of three main components: a channel attention submodule, a channel shuffle operation, and a spatial attention submodule. In the channel attention submodule, given an input feature map $x \in R^{C \times H \times W}$ , we first perform a dimension permutation to reshape it into $R^{(H \times W) \times C}$ . Then, two linear transformations (MLPs) are applied to model the inter-channel dependencies. A sigmoid activation function is subsequently used to generate the channel attention map from the reconstructed features. This attention map is multiplied element-wise with the original input feature map to produce the enhanced feature representation:

\begin{aligned} F_{channel} = σ (M L P (Permute (F_{input}))) ⊙ F_{input} . \end{aligned}

(1)

Figure 2.

Detailed Architecture of the Proposed CSDAM: The Module First Applies Channel Attention by Permuting and Reordering Features Through an MLP, Followed by a Channel Shuffle Operation to Enhance Inter-Channel Interaction. Subsequently, Spatial Attention Is Employed Using Convolutional Layers to Capture Spatial Dependencies. The Combination of Channel and Spatial Attention Enables CSDAM to Adaptively Emphasize Informative Features and Suppress Irrelevant Ones, Thereby Generating Refined Output Features. Note. CSDAM = channel shuffle dual attention module; MLP = multilayered perceptron.

Among them, $F_{channel}$ is the enhanced feature map, $σ$ is the sigmoid activation function, $⊙$ represents element-wise multiplication, and $F_{input}$ is the original input feature map.

After the channel attention operation, to facilitate cross-group information flow and further alleviate potential redundancy in convolutional layers, we introduce a channel shuffle mechanism. The enhanced feature map is first divided into eight groups, with each group containing (c/8) channels. A transpose operation is then applied to shuffle the channel order across different groups. After shuffling, the feature map is reshaped back to its original dimensions. This operation effectively encourages semantic feature mixing across channels, thereby enhancing the model’s capacity to represent complex feature interactions. The channel shuffle operation can be formally defined as:

\begin{aligned} F_{shuffle} = ChannelShuffle (F_{channel}) . \end{aligned}

(2)

Among them,

F_{shuffle}

is the feature map after mixing and

F_{channel}

is the number of channels in the input feature map.

After the channel information is remixed, to further enhance the model’s ability to capture local spatial semantic information, a spatial attention submodule is designed. Specifically, two $3 \times 3$ convolutional layers are applied to integrate spatial information. During this process, the number of channels is first reduced by a ratio $r$ (i.e., to half of the original size), and then restored to the original number of channels. A sigmoid activation function is used to generate the spatial attention map. The resulting spatial attention map is then element-wise multiplied with the feature map obtained after channel attention and channel shuffle operations, allowing the model to focus more on local regions with significant semantic content.

To highlight the distinctiveness of CSDAM, we compare it with three representative attention or feature enhancement modules—SENet (Hu et al., 2018), CBAM (Woo et al., 2018), and ShuffleNet (Zhang et al., 2018)—and analyze their applicability to I2I translation tasks. SENet generates channel attention weights via global average pooling (GAP), effectively enhancing channel discriminability. However, this approach compresses spatial dimensions, discarding fine-grained spatial details that are crucial for tasks requiring local fidelity (e.g., fur textures in cat-to-dog translation). Moreover, the absence of interaction between channel and spatial attention can lead to misalignment between salient channels and their corresponding spatial regions, undermining local consistency. CBAM models both channel and spatial attention but adopts a sequential fusion strategy, which limits its ability to capture joint channel–spatial dependencies. ShuffleNet reduces computational costs through group convolution and employs channel shuffling to promote information exchange between groups. Yet, its fixed, input-agnostic shuffling strategy lacks adaptability, making it less effective for dynamically enhancing semantic regions in high-fidelity generation tasks.

In contrast, CSDAM employs a dual-attention mechanism with adaptive channel shuffling to jointly and dynamically enhance channel and spatial representations. The channel attention branch leverages dimension permutation and MLP-based modeling to capture complex inter-channel dependencies, while the adaptive shuffling strategy, guided by learned attention weights, facilitates content-aware feature fusion and redundancy suppression. The spatial attention branch utilizes a lightweight $3 \times 3$ convolution with channel reduction and restoration to emphasize localized semantic regions (e.g., vehicle boundaries in Cityscapes) without compromising global structural integrity. Collectively, these designs overcome the limitations of existing modules by unifying dynamic channel–spatial interaction with directional feature enhancement, rendering CSDAM particularly effective for I2I translation tasks that demand both structural coherence and fine detail preservation.

3.3. Gaussian Noise Injection

To enhance the robustness of the 12I translation model against real-world noise interference, we introduce isotropic Gaussian noise into the input images during the training phase. Unlike conventional approaches that address noise through denoising modules applied only during inference or as a separate post-processing step, our method incorporates noise directly into the learning process. This allows the model to be exposed to and adaptively learn from noisy inputs throughout training, thereby encouraging it to capture more resilient and generalizable feature representations. As a result, the model maintains high-fidelity translation performance even under challenging, noise-contaminated conditions, which is particularly important for practical applications where input images are often degraded by various forms of environmental noise. Specifically, let $x \in R^{d}$ denote the clean image. During training, we perturb it into a noisy image $x^{'}$ by injecting Gaussian noise $e$ into the input, formally defined as:

\begin{aligned} x^{'} & = x + e, \end{aligned}

(3)

\begin{aligned} e & \sim N (0, σ_{t}^{2} I) . \end{aligned}

(4)

The noise injection operation is applied only during training, ensuring computational efficiency and real-time performance during inference. In our experiments, we set $σ^{2} = 0.04$ , which provides effective regularization without causing perceptible degradation to clean inputs.

3.4. Focal-Frequency Residual Block (FFRB)

In this study, we propose a novel residual module called FFRB, as shown in Figure 3, which is designed to enhance the performance of image translation tasks such as cat-to-dog translation. Building upon the traditional residual block structure, FFRB adopts a two-step feature enhancement strategy.

Figure 3.

Diagram of the FFRB: The Module Adopts FCA to Emphasize Informative Frequency Components While Suppressing Redundancy, and Integrates FocalNets’ Focal Modulation to Aggregate Multi-Scale Context, with Residual Connections Preserving Spatial Details and Stabilizing Training.Note. FFRB = focal-frequency residual block; FCA = frequency channel attention.

In the first stage of our model, we employ a combination of a convolutional submodule and a frequency channel attention (FCA) module to enhance feature extraction and channel-wise information modeling. Initially, the input image passes through a two-layer convolutional block for preliminary feature extraction. The first convolutional layer, coupled with normalization and ReLU activation, is designed to capture low-level features such as edges and textures. The second convolutional layer further enhances the semantic representations. To enrich the channel representation and leverage frequency-domain cues critical in image translation tasks, we incorporate the FCA module, proposed in frequency channel attention network (FcaNet), immediately after the second convolutional layer. While conventional channel attention mechanisms typically rely on GAP to compress features into scalars, GAP only retains the lowest frequency components, leading to loss of valuable high-frequency information. This limitation often results in blurred textures or missing details in image translation tasks. The FCA module addresses this issue by modeling channel attention in the frequency domain using the discrete cosine transform (DCT). It splits the input channels into several groups and applies DCT to extract different frequency components across groups, thereby capturing richer and more diverse channel features. This multi-spectral approach allows the model to better preserve fine-grained details, textures, and structures in the translated images. By integrating FCA (Qin et al., 2021), our model gains enhanced capability to distinguish and emphasize informative channels, leading to improved visual quality and reduced perceptual distortion.

After frequency modulation, we integrate the FocalNets module into the proposed residual block, FFRB, to enable multi-scale local context modeling. Specifically, FocalNets (Yang et al., 2022) employs a hierarchical structure of depthwise separable convolutions to encode features across varying receptive fields, thereby capturing contextual information ranging from local to global scales. On top of this, a dynamic gating mechanism is introduced to adaptively assign content-aware weights to each contextual level based on the query location. These multi-level contexts are then effectively fused into the current feature representation through an element-wise modulation process. The incorporation of this module enhances the model’s ability to represent fine-grained local structures while preserving sensitivity to long-range dependencies. This is particularly advantageous for image-to-image translation tasks, where precise rendering of textures, edges, and other spatial details is crucial. The overall operation of the FFRB can be mathematically formulated as:

\begin{aligned} \tilde{x} = FocalNets (F C A (ConvBlock (x))), \end{aligned}

(5)

\begin{aligned} FFRB (x) = x + \tilde{x} . \end{aligned}

(6)

The component ConvBlock( $x$ ) represents the feature extraction module composed of two convolutional layers, normalization, and activation operations.

Finally, FFRB fuses the features processed by FocalNets with the original input features through a residual connection, leading to richer and more expressive feature representations. This two-stage feature enhancement design takes full advantage of the complementary strengths of global frequency information and multi-scale local detail, which not only preserves the expressiveness of the original features but also significantly improves the capture of fine-grained information. Experimental results demonstrate that the FFRB module, while maintaining high computational efficiency, significantly improves the visual quality and detail fidelity of generated images, providing stronger feature support for image translation tasks.

3.5. Loss Function

Adversarial Loss. To encourage the generated images to be as visually similar to the target domain images as possible, we adopt the LSGAN loss (Mao et al., 2017) as the adversarial loss. Specifically, the adversarial loss is formulated as:

\begin{aligned} L_{GAN} = E_{y \sim Y} [D {(y)}^{2}] + E_{x \sim X} [(1 - D (G (x)))^{2}] . \end{aligned}

(7)

Identity Mapping Loss. To stabilize the training process and accelerate convergence, we incorporate an identity mapping loss. This loss ensures that when images from the target domain are passed through the generator, they remain unchanged, encouraging the generator to preserve domain-specific content. The identity loss is defined as:

\begin{aligned} L_{identity} (G) = E_{y \sim Y} ‖ G (y) - y ‖_{1} . \end{aligned}

(8)

Multistage Loss. We adopt the multistage loss strategy proposed in EnCo (Cai et al., 2024), which enhances content preservation by leveraging discriminative attention-guided patch sampling (Data Access Governance [DAG] strategy) instead of traditional random sampling. DAG effectively leverages critical feedback from the discriminator—specifically, the authenticity of the generated images—and selectively focuses on more informative patches through attentive sampling for optimized training. This loss enforces consistency between intermediate feature representations of the generator at different stages, thus promoting more reliable and coherent content translation. Given a feature pair

(f_{l}, f_{\tilde{l}})

, we sample

S

patches from different positions of

f_{\tilde{l}}

. These patches are passed through a projection head and a prediction head to obtain the set

q_{\tilde{l}} = {q_{\tilde{l}}^{(1)}, \dots, q_{\tilde{l}}^{(S)}}

, where the subscript denotes the generation stage and the superscript indexes the spatial sampling position. Similarly, we can sample from the same position, from

f_{\tilde{l}}

, and feed to the projection to get

k_{l} = {k_{\tilde{l}}^{(1)}, \dots, k_{l}^{(S)}}

. The multistage loss is then formulated as:

\begin{aligned} \begin{aligned} L_{MultiStage} (G, h, g, X) = E_{x \sim X} \sum_{l}^{L} \sum_{s}^{S_{l}} L (q_{\tilde{l}}^{(s)}, stopgrad (k_{l}^{(s)})), \end{aligned} \end{aligned}

(9)

where

L

denotes the set of selected same-stage pairwise features used to compute the mean-squared error loss, and

S_{l}

denotes the set of sampled patch positions from the feature pair

(f_{l}, f_{\tilde{l}})

Overall Loss. Our total loss includes adversarial loss, identity loss, and multistage loss, where $λ_{NCE}$ and $λ_{IDT}$ are set to 1.4 and 10, respectively. Our final objective function is as follows:

\begin{aligned} L_{total} (G, D, h, g) & = L_{GAN} (G, D, X, Y) \\ + λ_{N C E} L_{Multistage} (G, h, g, X) \\ + λ_{I D T} L_{identity} (G, Y) . \end{aligned}

(10)

4. Experiments

4.1. Experiment Setup

Datasets. To evaluate the effectiveness of the proposed method, we conduct experiments on three widely used image translation benchmarks: Cat $\to$ Dog, Cityscapes, and Van Gogh $\to$ Photo. The Cat $\to$ Dog dataset, derived from the AFHQ dataset (Choi et al., 2020), consists of 5,153 cat images and 4,739 dog images. The Cityscapes dataset (Cordts et al., 2016), collected from urban scenes in Germany, contains 2,975 training images and 500 testing images. The Van Gogh $\to$ Photo dataset (Zhu et al., 2017) includes 400 paintings by Vincent van Gogh and 6,287 real photographs, with all Van Gogh paintings used as test samples. For consistency, all images are resized to a resolution of $256 \times 256$ before training.

Training Details. Our framework builds upon the EnCo architecture, utilizing a ResNet-based generator (Park et al., 2020) and a PatchGAN discriminator (Isola et al., 2017). We adopt the Adam optimizer (Kingma & Ba, 2014) with $β_{1} = 0.5$ and $β_{2} = 0.999$ . To balance optimization stability and convergence speed, we employ an unbalanced learning rate strategy: the generator and projection head use a learning rate of $5 \times 10^{- 5}$ , while the discriminator is trained with a learning rate of $2 \times 10^{- 4}$ . We train the model with a batch size of 1 and apply linear learning rate decay starting mid-training. On the Cat $\to$ Dog dataset, previous approaches typically require up to 400 epochs for convergence, while our method achieves optimal performance within only 200 epochs, demonstrating its improved efficiency. For the Cityscapes and Van Gogh $\to$ Photo datasets, we train for 400 epochs to ensure consistency and fair comparison with prior works.

Evaluation Metrics. We utilize the Fréchet inception distance (FID) (Heusel et al., 2017)) to quantitatively evaluate the quality of generated images. FID assesses the distance between the feature distributions of real and generated images, where lower values indicate higher visual fidelity. Specifically for the Cityscapes dataset, in addition to FID, we also report mean average precision (mAP), pixel-wise accuracy (pixAcc), and average class accuracy (classAcc) to assess the semantic consistency and interpretability of the generated results.

Baselines. To comprehensively evaluate the performance of the proposed method on the unpaired I2I translation task, we compare it against several representative unsupervised translation models. These include CycleGAN (Zhu et al., 2017), FastCUT (Park et al., 2020), CUT (Park et al., 2020), QS-Attn (Hu et al., 2022), EnCo (Cai et al., 2024), MabCUT (Yang et al., 2024), and MCDUT (Zhao et al., 2025). Extensive comparisons are conducted across multiple standard benchmark datasets.

4.2. Comparison With Other Methods

Table 1 presents the quantitative comparison between our proposed RFA-GAN and several baseline methods on the Cat $\to$ Dog, Cityscapes, and Van Gogh $\to$ Photo datasets. Quantitative results demonstrate that, in terms of FID scores, our model produces more realistic translations than other methods across all three datasets. Notably, on the Cat $\to$ Dog dataset, our model achieves an FID score of 47.8, which is 6.9 points lower than the second-best result of 54.7, highlighting a substantial performance advantage. On the Cityscapes dataset, although our method does not achieve the best performance in terms of pixAcc, it outperforms other approaches in both mAP and average classAcc. This confirms that our model not only maintains high visual quality but also generates scene elements that more accurately align with semantic labels.

Table 1.
Quantitative Comparison With Other Baseline Methods.

Cat $\to$ Dog Cityscapes Van Gogh $\to$ Photo

Method FID $↓$ FID $↓$ mAP $↑$ pixAcc $↑$ classAcc $↑$ FID $↓$

CycleGAN (Zhu et al., 2017) 85.9 76.3 20.4 55.9 25.4 103.0

FastCUT (Park et al., 2020) 94.0 68.8 19.1 59.9 24.3 105.3

CUT (Park et al., 2020) 76.2 56.4 24.7 68.8 30.7 96.9

QS-Attn (Hu et al., 2022) 72.8 53.5 25.5 79.9 31.2 92.2

EnCo (Cai et al., 2024) 54.7 45.4 28.4 77.3 37.2 91.2

MabCUT (Yang et al., 2024) 59.8 46.7 28.2 78.9 34.1 92.8

MCDUT (Zhao et al., 2025) 61.8 47.1 27.8 75.4 35.7 91.5

RFA-GAN (ours) 47.8 43.0 29.6 78.2 37.8 89.0

	Cat $\to$ Dog	Cityscapes	Van Gogh $\to$ Photo
CycleGAN (Zhu et al., 2017)	85.9	76.3	20.4	55.9	25.4	103.0
FastCUT (Park et al., 2020)	94.0	68.8	19.1	59.9	24.3	105.3
CUT (Park et al., 2020)	76.2	56.4	24.7	68.8	30.7	96.9
QS-Attn (Hu et al., 2022)	72.8	53.5	25.5	79.9	31.2	92.2
EnCo (Cai et al., 2024)	54.7	45.4	28.4	77.3	37.2	91.2
MabCUT (Yang et al., 2024)	59.8	46.7	28.2	78.9	34.1	92.8
MCDUT (Zhao et al., 2025)	61.8	47.1	27.8	75.4	35.7	91.5
RFA-GAN (ours)	47.8	43.0	29.6	78.2	37.8	89.0

Note. FID = Fréchet inception distance; mAP = mean average precision; pixAcc=pixel-wise accuracy; classAcc = class accuracy.The bolded data in Table indicate the best results achieved in our experiments.

As shown in Figure 4, we present the image translation results of our method on the Cat $\to$ Dog dataset, alongside visual comparisons with several representative baseline models. It can be observed that our approach demonstrates superior detail preservation and image quality, particularly in synthesizing the facial structures of dogs.

Figure 4.

Visual Results Comparison with All Baselines on the Cat $\to$ Dog Dataset: The Red Boxes Highlight Facial Regions of Cats and Dogs, Including Ears, Noses, and Mouths, to Compare the Performance of Different Methods in Preserving Fine Details, Edge Sharpness, and Semantic Structure.

Compared with other methods, our model achieves better performance in terms of completeness, clarity, and realism of the generated results. Specifically, it produces more accurate structural reconstructions in key facial regions such as the ears and mouth, with clearer contours and more natural shapes. This effectively avoids common issues observed in other methods, such as structural distortion or missing parts. Furthermore, our model excels in restoring fine-grained fur textures, resulting in synthesized images that are more visually faithful to real animal appearances and significantly enhancing perceptual image quality.

As illustrated in Figure 5, we present the visual results of our proposed method on the Cityscapes dataset, along with comparisons against several representative baseline models. The results clearly demonstrate that our model, guided by semantic labels, can generate high-quality images containing various typical urban scene elements such as cars, buildings, and trees. Moreover, it achieves superior performance in both detail fidelity and overall image sharpness compared to existing approaches.

Figure 5.

Visual Results Comparison with All Baselines on the Cityscapes Dataset: Red Boxes Highlight Key Regions Such as Vehicles and Trees, to Facilitate Comparison of Different Methods in Terms of Edge Sharpness, Detail Realism, and Semantic Consistency. Our Method Produces Sharper and Structurally More Coherent Details While Maintaining Accurate Semantic Alignment with the Input.

In particular, our method exhibits stronger generation capability for complex structural elements, such as vehicles. While other methods often suffer from issues such as structural fragmentation, blurred edges, or missing textures, our model produces cars with clearer contours, more complete shapes, and richer texture details. These characteristics result in a perceptual quality that is closer to that of real-world images. This indicates that our method strikes a more effective balance between structural preservation and style translation.

Figure 6 presents the visual comparison results on the Van Gogh $\to$ Photo dataset, which focuses on translating Van Gogh-style paintings into realistic photographic images. As shown in the figure, our proposed method demonstrates a superior ability to preserve structural content while accurately translating the stylistic attributes into the target domain.

Figure 6.

Visual Results Comparison with All Baselines on the Van Gogh $\to$ Photo Dataset: RFA-GAN Produces Images with Higher Color Fidelity, Structural Consistency, and Natural Textures While Preserving the Original Emotion and Theme.

Compared to existing approaches, our method excels at capturing fine-grained, style-relevant features—such as shading, color tone, and spatial consistency—that are crucial for high-quality style transfer. In particular, the generated images from our model exhibit clearer textures, more realistic lighting, and well-preserved scene structures, whereas other methods often produce artifacts, overly smoothed regions, or distorted compositions. These results highlight the effectiveness of our model in achieving a better balance between structural fidelity and style adaptation, thereby validating its robustness and superiority in cross-domain image translation tasks.

4.3. Ablation Study

Through comparative experiments, our method outperforms all baseline models. In the proposed RFA-GAN framework, the generator integrates the CSDAM and FFRB modules, and Gaussian noise is injected into the input images during training. To evaluate the impact of the number of channel shuffle groups in the CSDAM module on model performance, we conducted an ablation study on the Cat $\to$ Dog dataset (Table 2). To investigate the effect of training noise intensity on model performance, we conduct an ablation study with varying levels of Gaussian noise variance $σ^{2}$ injected into input images during training (Table 3). Furthermore, to quantify the individual contributions of the CSDAM, FFRB modules, and Gaussian noise to the performance of image translation, we conducted an additional ablation study on the Cat $\to$ Dog dataset. The quantitative results are reported in Table 4, and the qualitative comparisons are shown in Figure 7.

Figure 7.

Qualitative Ablation Results: The Leftmost Column Shows the Input Images, While the Remaining Columns Present the Translated Outputs from Models A to G.

Table 2.

Ablation Experiment on the Number of Channel Shuffle Groups in the Cat $\to$ Dog Dataset.

Shuffle groups	FID $↓$
num $=$ 2	49.2
num $=$ 4	48.5
num $=$ 8	47.8
num $=$ 16	50.6

Note. FID = Fréchet inception distance.

Table 3.

FID Scores Under Varying Gaussian Noise Variances During Training on the Cat $\to$ Dog Dataset.

$σ^{2}$	FID $↓$
0	48.4
0.01	48.1
0.04	47.8
0.08	50.3
0.16	53.6

Note. FID = Fréchet inception distance.

Table 4.

Quantitative Results for Ablation Study.

	Configuration
Model	CSDAM	Gaussian Noise	FFRB	FID $↓$
Baseline	$\times$	$\times$	$\times$	54.7A	$✓$	$\times$	$\times$	50.6
B	$\times$	$✓$	$\times$	53.2
C	$\times$	$\times$	$✓$	51.5
D	$✓$	$✓$	$\times$	49.2
E	$\times$	$✓$	$✓$	49.7
F	$✓$	$\times$	$✓$	48.4
G	$✓$	$✓$	$✓$	47.8

Note. CSDAM = channel shuffle dual attention module; FFRB = focal-frequency residual block; FID = Fréchet inception distance.The bolded data in Table indicate the best results achieved in our experiments.

As shown in the ablation results in Table 2, when the number of channel shuffle groups is set to 8, the model achieves the lowest FID score, indicating the best image generation quality.

As shown in Table 4, Model A introduces the CSDAM module into the generator; Model B adds Gaussian noise to the image before feeding it into the generator; and Model C replaces the traditional ResNet residual blocks with the proposed FFRB module. Compared with EnCo, all three models (A, B, and C) demonstrate improved performance, validating the effectiveness of each design component. Model D, which combines the CSDAM module with noise injection, further enhances performance relative to Models A and B, indicating that CSDAM not only strengthens feature representation but also improves robustness against noise. Model E outperforms both Models B and C, showing superior adaptability in capturing local details while resisting noise or distortion. Model F integrates both CSDAM and FFRB, surpassing the individual performance of Models A and C, which confirms the complementary effects of the two modules in balancing local detail restoration and global style consistency. Model G integrates all components and outperforms all comparative models, demonstrating the overall superiority of the proposed approach.

To clarify the independent contributions of each subcomponent within CSDAM, we conducted a more fine-grained ablation study. As illustrated in Table 5, removing any subcomponent from CSDAM consistently leads to performance degradation, confirming its necessity. In particular, eliminating the channel attention branch results in the most significant decline, underscoring its critical role in selective feature modeling for preserving content fidelity and style consistency. Similarly, the removal of spatial attention causes a considerable drop, highlighting its importance in capturing fine-grained structural details. The exclusion of the channel shuffle operation also leads to moderate degradation, suggesting that cross-channel interaction is beneficial for reducing redundancy and enhancing feature expressiveness. Overall, the full CSDAM configuration consistently achieves the best performance across all evaluation metrics, further validating that the synergy among its subcomponents is essential to the effectiveness of the module.

Table 5.

Ablation Study on CSDAM Subcomponents.

Configuration	FID $↓$
CSDAM w/o channel attention	49.8
CSDAM w/o spatial attention	49.1
CSDAM w/o channel shuffle	48.5
CSDAM	47.8

Note. CSDAM = channel shuffle dual attention module; FID = Fréchet inception distance.

To validate the effectiveness of the proposed CSDAM module, we conducted comparative experiments by replacing CSDAM in the encoder and decoder of the RFA-GAN framework with representative attention mechanisms, including CBAM (Woo et al., 2018) and efficient multi-scale attention (EMA; Ouyang et al., 2023). The quantitative results, summarized in Table 6, demonstrate that CSDAM outperforms these alternatives under the same experimental settings, confirming its effectiveness in enhancing feature representation for I2I translation tasks.

Table 6.

Quantitative Results for Ablations Compared With Other Attention Mechanisms. +CBAM and +EMA Indicate Replacing CSDAM in our Generator with the Corresponding Attention Modules.

Method	FID $↓$
$+$ CBAM	50.2
$+$ EMA	49.7
$+$ CSDAM	47.8

Note. CSDAM = channel shuffle dual attention module; FID = Fréchet inception distance; EMA = efficient multi-scale attention; CBAM = convolutional block attention module.

Additionally, we visualized the attention maps produced by CSDAM, CBAM, and EMA for both cat and dog images to provide a more comprehensive comparison. As illustrated in Figure 8, our CSDAM module exhibits a sharper and more semantically aligned focus on discriminative regions, which facilitates better preservation of structural consistency and style details in the I2I translation process.

Figure 8.

Visualization Comparison of Attention Maps Produced by Different Attention Mechanisms (CSDAM, EMA, and CBAM) on Cat and Dog Images: (A) Attention Maps for Cats; (B) Attention Maps for Dogs.Note. CSDAM = channel shuffle dual attention module; EMA = efficient multi-scale attention; CBAM = convolutional block attention module.

However, despite CSDAM’s strong performance in most scenarios, we also observed its potential limitations. In extreme cases of highly complex image translation tasks (e.g., Van Gogh $\to$ Photo), when the input image contains regions with dense textures—such as the heavily brushed areas in Van Gogh’s paintings—the channel attention mechanism of CSDAM may excessively amplify high-frequency texture channels. For instance, in some examples (as shown in Figure 9), the generated images over-preserve local brushstroke details (e.g., residual strokes in the sky), which conflicts with the globally smooth style of the target domain (real photos) and compromises overall style consistency. This phenomenon arises from the channel attention assigning disproportionately high weights to texture-related channels while paying insufficient attention to global tone channels, highlighting that the current mechanism still has room for improvement in dynamically balancing local detail enhancement and global style coherence.

Figure 9.

Extreme Cases in Highly Complex Image Translation Tasks.

In the FFRB, we integrate FcaNet and the focal modulation module (FocalNets). Ablation studies on the cat $\to$ dog dataset (Table 7) show that removing either component degrades performance, while retaining both yields the best FID score (47.8). This confirms the complementary effects of frequency-domain compression and multi-scale context modeling. Specifically, FcaNet introduces multi-spectral components via the DCT, enriching channel representations beyond the low-frequency bias of global average pooling. FocalNets, as an attention-free alternative, employ hierarchical contextualization and gated aggregation to integrate local and global contexts across scales. Their combination decouples and synergizes frequency- and spatial-domain enhancements, thereby improving the model’s ability to capture complex textures and geometric structures in image translation.

Table 7.

Ablation Study on the Key Components of FFRB.

Configuration	FID $↓$
FFRB w/o frequency channel attention	48.3
FFRB w/o FocalNets	48.6
FFRB	47.8

Note. FFRB = focal-frequency residual block; FID = Fréchet inception distance.

4.4. User Study

To further evaluate the practical effectiveness of our model in unpaired I2I translation tasks, particularly in terms of human visual perception, we conducted a user study focused on subjective visual quality assessment. This experiment aimed to assess the perceptual realism and visual quality of the generated images based on feedback from human observers.

A total of 30 volunteers with basic image discrimination capabilities were recruited for this study. All participants were capable of evaluating image quality from a perceptual standpoint. To ensure the representativeness and fairness of the evaluation, we selected three representative datasets and randomly sampled 20 images from each dataset. These images were translated using our proposed model (RFA-GAN) as well as several mainstream baseline models. The generated images corresponding to each original input were presented to the participants in a randomized order to prevent any model-related bias during the evaluation.

Participants were instructed to rank the translated results of each image set based on criteria such as image clarity, detail preservation, style consistency, and overall visual realism. The ranking was performed from “most consistent with real-world perception” to “least consistent.”

As shown in Figure 10, the results indicate that our proposed RFA-GAN consistently received higher user ratings across most samples. Specifically, RFA-GAN was ranked first in 65% of the image sets evaluated by participants, significantly outperforming other baseline methods. These findings demonstrate the clear advantage of our model in terms of perceived visual quality in unpaired image translation tasks.

Figure 10.

User Study Results: We Aggregated and Computed the Proportional Rankings of User Preferences Across Models, and Visualized the Results to Assess Performance. The Horizontal Axis (X-Axis) Represents the Ranking Percentage (%), While the Vertical Axis (Y-Axis) Lists the Evaluated Models.

5. Discussion

Although RFA-GAN achieves competitive performance in various I2I translation tasks, its computational complexity remains a limitation. The integration of dual attention and frequency-aware modules increases model overhead, potentially restricting deployment in resource-constrained or real-time applications.

For future work, we plan to investigate lightweight architectures and model compression techniques—such as neural pruning and knowledge distillation—to improve inference efficiency. Furthermore, extending RFA-GAN to high-resolution video translation and specialized domains, such as medical image synthesis, represents a promising direction for broader impact.

6. Conclusion

This paper proposes a novel I2I translation framework, RFA-GAN, which integrates noise injection and attention mechanisms to effectively alleviate the problem of detail loss during the translation process. To enhance the model’s ability to capture fine-grained features, we design the CSDAM. A novel FFRB is proposed to overcome the limitations of traditional residual blocks in preserving texture and detail features, achieving enhanced reconstruction fidelity for target domain images. Furthermore, we incorporate Gaussian noise at the input stage to improve the model’s robustness and generalization capability. Extensive experiments on multiple mainstream datasets demonstrate that the proposed method significantly outperforms existing state-of-the-art methods in terms of image quality and detail preservation. Ablation studies further validate the effectiveness of each module design, and comparisons with various mainstream attention mechanisms highlight the superior performance of CSDAM. We hope this research provides new insights and valuable references for unpaired I2I translation tasks.

Footnotes

Acknowledgments

This class file was developed by Sunrise Setting Ltd, Brixham, Devon, UK. Website:

ORCID iDs

Chao Zhang

Shichao Zhao

Cheng Han

Tiancheng Shao

Tian Tian

Mi Zhou

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Scientific Research Project of Jilin Provincial Education Department (grant no. JJKH20250520KJ), the Natural Science Foundation of Jilin Province (grant no. 20230101179JC), the National Natural Science Foundation of China (grant no. 61702051).

Declaration of competing interest

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

Ali

Cha

Y.-J.

(2022). Attention-based generative adversarial network with internal damage segmentation using thermography. Automation in Construction, 141, 104412.

Batzolis

Stanczuk

Schönlieb

C.-B.

Etmann

(2021). Conditional image generation with score-based diffusion models. arXiv preprint arXiv:2111.13606.

Bello

Zoph

Vaswani

Shlens

Q. V.

(2019). Attention augmented convolutional networks. In arXiv preprint arXiv:1904.09925.

Cai

Zhu

Miao

Yao

(2024). Rethinking the paradigm of content constraints in unpaired image-to-image translation. In Proceedings of the AAAI conference on artificial intelligence (Vol. 38, No. 2, pp. 891–899). AAAI.

Cazenavette

De Guevara

(2021). MixerGAN: An MLP-based architecture for unpaired image-to-image translation. ArXiv Preprint arXiv:2105.14110.

Chen

Kornblith

Norouzi

Hinton

(2020). A simple framework for contrastive learning of visual representations. In International conference on machine learning (pp. 1597–1607). PMLR.

Chen

Yang

Tao

(2018). Attention-GAN for object transfiguration in wild images. In Proceedings of the European conference on computer vision (ECCV) (pp. 164–180). Springer.

Choi

Yoo

J.-W.

(2020). StarGAN v2: Diverse image synthesis for multiple domains. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 8188–8197). IEEE.

Chrysos

G. G.

Kossaifi

Zafeiriou

(2020). Rocgan: Robust conditional gan. International Journal of Computer Vision, 128(10), 2665–2683.

10.

Cohen

Rosenfeld

Kolter

(2019). Certified adversarial robustness via randomized smoothing. In International conference on machine learning (pp. 1310–1320). PMLR.

11.

Cordts

Omran

Ramos

Rehfeld

Enzweiler

Benenson

Franke

Roth

Schiele

(2016). The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3213–3223). IEEE.

12.

Dhariwal

Nichol

(2021). Diffusion models beat gans on image synthesis. Advances in Neural Information Processing Systems, 34, 8780–8794.

13.

Dong

Liu

Zhang

Chen

Qiao

(2020). FD-GAN: Generative adversarial networks with fusion-discriminator for single image dehazing. In Proceedings of the AAAI conference on artificial intelligence (Vol. 34, No. 7, pp. 10729–10736). AAAI.

14.

Qiu

(2023). Stable diffusion is unstable. Advances in Neural Information Processing Systems, 36, 58648–58669.

15.

Gong

Wang

Batmanghelich

Zhang

Tao

(2019). Geometry-consistent generative adversarial networks for one-sided unsupervised domain mapping. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 2427–2436). IEEE.

16.

Goodfellow

Pouget-Abadie

Mirza

Warde-Farley

Ozair

Courville

Bengio

(2020). Generative adversarial networks. Communications of the ACM, 63(11), 139–144.

17.

Grill

J.-B.

Strub

Altché

Tallec

Richemond

Buchatskaya

Doersch

Avila Pires

Guo

Gheshlaghi Azar

Piot

Kavukcuoglu

Munos

Valko

(2020). Bootstrap your own latent—a new approach to self-supervised learning. Advances in Neural Information Processing Systems, 33, 21271–21284.

18.

Heusel

Ramsauer

Unterthiner

Nessler

Hochreiter

(2017). GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In Advances in neural information processing systems (NeurIPS) (Vol. 30).

19.

Shen

Sun

(2018). Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7132–7141). IEEE.

20.

Zhou

Huang

Shi

Sun

(2022). QS-Attn: Query-selected attention for contrastive learning in i2i translation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 18291–18300). IEEE.

21.

Isola

Zhu

J.-Y.

Zhou

Efros

A. A.

(2017). Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1125–1134). IEEE.

22.

Jia

Wang

Jia

(2022). Multi-attention generative adversarial network for remote sensing image super-resolution. IEEE Transactions on Geoscience and Remote Sensing, 60, 1–13.

23.

Jia

Yuan

Wang

Clifford

Yuan

(2021). Semantically robust unpaired image translation for data with unmatched semantics statistics. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 14273–14283). IEEE.

24.

Karras

Laine

Aila

(2019). A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 4401–4410). IEEE.

25.

Kim

(2019). U-GAT-IT: Unsupervised generative attentional networks with adaptive layer-instance normalization for image-to-image translation. ArXiv Preprint arXiv:1907.10830.

26.

Kim

Baek

Park

Kim

(2022). InstaFormer: Instance-aware image-to-image translation with transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 18321–18331). IEEE.

27.

Kim

Cha

Kim

Lee

J. K.

Kim

(2017). Learning to discover cross-domain relations with generative adversarial networks. In International conference on machine learning (pp. 1857–1865). PMLR.

28.

Kingma

D. P.

(2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.

29.

Lai

Bai

Hao

(2021). Unsupervised generative adversarial networks with cross-model weight transfer mechanism for image-to-image translation. In Proceedings of the IEEE/CVF international conference on computer vision (ICCV) (pp. 1814–1822). IEEE.

30.

Lee

G.-H.

Yuan

Chang

Jaakkola

(2019). Tight certificates of adversarial robustness for randomly smoothed classifiers. Advances in Neural Information Processing Systems, 32, 4911–4922.

31.

Liang

Cao

Sun

Zhang

Van Gool

Timofte

(2021). Swinir: Image restoration using Swin Transformer. In Proceedings of the IEEE/CVF International conference on computer vision (pp. 1833–1844). IEEE.

32.

Wang

Zhang

Dan

(2019). CBAM-GAN: Generative adversarial networks based on convolutional block attention module. In Proceedings of the international conference on artificial intelligence and security (ICAIS) (pp. 227–236). Springer.

33.

Mao

Xie

Lau

R. Y. K.

Wang

Smolley

S. P.

(2017). Least squares generative adversarial networks. In Proceedings of the IEEE international conference on computer vision (pp. 2794–2802). IEEE.

34.

Ouyang

Zhang

Luo

Guo

Zhan

Huang

(2023). Efficient multi-scale attention module with cross-spatial learning. In ICASSP 2023-2023 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 1–5). IEEE.

35.

Park

Efros

A. A.

Zhang

Zhu

J.-Y.

(2020). Contrastive learning for unpaired image-to-image translation. In Computer vision–ECCV 2020: 16th European conference, Glasgow, UK, August 23–28, 2020, proceedings, part IX 16 (pp. 319–345). Springer.

36.

Qin

Zhou

Bian

(2023). Factorization vision transformer: Modeling long-range dependency with local window cost. IEEE Transactions on Neural Networks and Learning Systems, 36(2), 3151–3164.

37.

Qin

Zhang

(2021). FcaNet: Frequency channel attention networks. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 783–792). IEEE.

38.

Song

Durkan

Murray

Ermon

(2021). Maximum likelihood training of score-based diffusion models. Advances in Neural Information Processing Systems, 34, 1415–1428.

39.

Song

Meng

Ermon

(2022). Dual diffusion implicit bridges for image-to-image translation. arXiv preprint arXiv:2203.08382.

40.

Tang

Sebe

Wang

Corso

J. J.

Yan

(2019). Multi-channel attention selection GAN with cascaded semantic guidance for cross-view image translation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 2417–2426). IEEE.

41.

Tang

Sebe

Yan

(2019). Attention-guided generative adversarial networks for unsupervised image-to-image translation. In Proceedings of the 2019 international joint conference on neural networks (IJCNN) (pp. 1–8). IEEE.

42.

Wang

Xie

Dong

Shan

(2021). Real-ESRGAN: Training real-world blind super-resolution with pure synthetic data. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 1905–1914). IEEE.

43.

Woo

Park

Lee

J.-Y.

Kweon

I. S.

(2018). CBAM: Convolutional block attention module. In Proceedings of the European conference on computer vision (ECCV) (pp. 3–19). Springer.

44.

Yang

Kang

Yuan

Huang

(2021). ST LBAGAN: Spatio-temporal learnable bidirectional attention generative adversarial networks for missing traffic data imputation. Knowledge-Based Systems, 215, 106705.

45.

Yang

Liu

Jin

(2024). Multi-attention bidirectional contrastive learning method for unpaired image-to-image translation. PLoS One, 19(4), e0301580.

46.

Yang

Dai

Gao

(2022). Focal modulation networks. Advances in Neural Information Processing Systems, 35, 4203–4217.

47.

Koltun

Funkhouser

(2017). Dilated residual networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 472–480). IEEE.

48.

Zhan

Zhang

Cui

Xiao

Miao

(2022). Bi-level feature alignment for versatile image translation and manipulation. In Proceedings of the European conference on computer vision (ECCV) (pp. 224–241). Springer.

49.

Zhan

Zhang

(2022). Modulated contrast for versatile image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 18280–18290). IEEE.

50.

Zhang

Goodfellow

Metaxas

Odena

(2019). Self-attention generative adversarial networks. In Proceedings of the international conference on machine learning (ICML) (pp. 7354–7363). ACM.

51.

Zhang

Qian

Huang

Zhang

Xiao

(2023). Robust generative adversarial network. Machine Learning, 112(12), 5135–5161.

52.

Zhang

Zhou

Lin

Sun

(2018). Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 6848–6856). IEEE.

53.

Zhang

Tian

Hou

(2023). CSAST: Content self-supervised and style contrastive learning for arbitrary style transfer. Neural Networks, 164, 146–155.

54.

Zhao

Cai

W.-L.

Yuan

(2025). Spectral normalization and dual contrastive regularization for image-to-image translation. The Visual Computer, 41(1), 129–140.

55.

Zhao

Cai

W.-L.

Yuan

C.-W.

(2025). Multi-cropping contrastive learning and domain consistency for unsupervised image-to-image translation. IET Image Processing, 19(1), e70006.

56.

Zheng

Zhang

Wan

Wang

(2022). ITTR: Unpaired image-to-image translation with transformers. arXiv preprint arXiv:2203.16015.

57.

Zhu

J.-Y.

Park

Isola

Efros

A. A.

(2017). Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision (pp. 2223–2232). IEEE.

RFA-GAN: A Dual-Attention Image-to-Image Translation With Robustness and Frequency Sensing Capabilities

Abstract

Keywords

1. Introduction

2. Related Works

2.1. I2I Translation

2.2. Contrastive Learning

2.3. Noise Injection

2.4. Attention Mechanism

3. Method

3.1. Overall Architecture

4.1. Experiment Setup

4.2. Comparison With Other Methods

6. Conclusion

Footnotes

Acknowledgments

ORCID iDs

Funding

Declaration of competing interest

References