Sage Journals: Discover world-class research

Abstract

Arbitrary style transfer is attracting increasing attention due to its wide application potential. Existing approaches either directly fuse deep style features with deep content features or adaptively normalize content features to achieve global statistical matching. Although these approaches show some success, they frequently produce artifacts and messy textures. This primarily stems from a lack of exploration of the semantic distribution of style image features and an ineffective capture of long-range dependencies. This paper presents a Dual-Domain Style Transfer Network that incorporates Adaptive Normalization with Style Semantics Awareness and Global Style Texture Enhancement. The former aims to extract more style semantic information to reduce artifacts through self-attention mechanism and adaptive normalization, while the latter enhances global stylistic information in the frequency domain to suppress cluttered textures. On the MSCOCO and Wikiart datasets, compared to other state-of-the-art methods, our Learned Perceptual Image Patch Similarity, Structural Similarity Index, and content loss metrics achieved the best scores of 0.616, 0.467, and 2.31, respectively, while the style loss metric achieved the second-best score of 3.08.

Keywords

Arbitrary style transfer self-attention adaptive normalization frequency domain long-range dependencies

Introduction

The goal of arbitrary style transfer is to take the visual elements of a stylized image and re-render the content of the source image. Traditional style transfer methods generate stylized images through stroke-based rendering, image analogy, and texture synthesis.^1,2 Due to their reliance on low-level features, these methods frequently struggle to effectively capture image structure. Recently, Gatys et al.^3,4 innovatively accomplished this task by extracting high-level semantic features using pretrained convolutional neural networks (CNNs). To improve efficiency, Johnson et al.² use a feedforward network to generate stylized images directly. Subsequent studies still suffer from artifacts and texture clutter, although they have shown excellent performance in terms of efficiency,^5–7 quality,^8,9 generalization,^10–14 diversity,^15,16 and user controllability.^17–19

Arbitrary style transfer methods can be categorized into two main types: nonattention-based and attention-based style transfer methods. Typical representatives of the former include.^11,13,20,21 These methods transform content features to match the mean and variance of style features globally without considering local details, which unable to effectively balance global and local style patterns. They may integrate artifacts and messy style textures into the content target. The latter includes typical examples such as.^8,22,23 These methods are based on the semantic correspondence between local regions of images and integrate style features locally into content features. Attention-based methods have been shown to be effective in style transfer by generating more local style details. Unfortunately, in improving performance, these methods failed to address the problem of artifacts and messy textures.

Images are made up of content elements and style elements. The existing methods are unable to effectively distinguish between the style elements and content elements of the style image. Most of the existing methods use an attention mechanism to directly fuse the deep style features with the deep content features, but ignore the style semantic features of style image, such as lines, color distributions, and texture patterns. This may cause evident artifacts in the stylized images. For example, in the pencil sketch style of a portrait, the eye part of the content element is transferred (as shown in the first row of Figure 1). In addition, the generation of messy textures can be attributed to the spatial domain convolution operation only has a local receptive field and lacks the ability to capture long-range dependencies.²⁴ This limitation leads to its inability to capture global style texture patterns with periodic features, which results in messy textures in the generated images in turn.

Figure 1.

Comparison with other state-of-the-art (SOTA) methods. The first and second columns show the content and style images. The rest of the columns show the results of our method and other SOTA methods generated.

Inspired by the above analysis, we propose a Dual-Domain Style Transfer Network (DDSTNet), which consists of Adaptive Normalization with Style Semantics Awareness (ANSSA) and Global Style Texture Enhancement (GSTE). The ANSSA module is dedicated to learning the style semantic features in the style image through a fusion of attention mechanisms and normalization operations, and effectively transferring these style features to the content image, thereby reducing the occurrence of artifacts. Meanwhile, the GSTE module is responsible for converting the learned feature representation to the frequency domain to capture the global style information of the image, which is aimed at effectively reducing the generation of cluttered textures in the generated image. In this study, the main contributions of our DDSTNet can be summarized as follows:

We propose a novel ANSSA module designed for style transfer. This module effectively reduces artifacts in the synthesized images by incorporating the style semantic features from the style image. It also normalizes the content features appropriately when calculating the attention score.

We also introduce frequency domain operation module, GSTE. This module can improve the ability to capture long-range dependencies. It also has a natural advantage in capturing periodic textures in the style map, effectively suppressing the generation of cluttered textures.

Through extensive experiments and comparisons with state-of-the-art (SOTA) methods, we have fully validated the effectiveness and superiority of our proposed approach.

Related works

Arbitrary style transfer

In recent work, Gatys et al.^3,4 have made significant advancements in image style transfer. They employ the Gram matrix to represent style features and utilize an iterative loss function between content and style in the feature space of a pretrained deep neural network to achieve remarkable stylization effects. Moreover, Johnson et al.⁷ directly generated stylized images through a feedforward network, thereby achieving the effect of real-time style transfer by transferring style features to content images with perceptual losses. Related research^5,6 further optimized the style transfer method based on the feedforward network architecture, enhancing the effect of texture synthesis. However, the learned model can only adapt to a specific style and is time-consuming. Therefore, research has focused on methods for arbitrary style transfer, which can rely on a single model to generate images with a specific style from any content image and any style image.

To achieve arbitrary style transfer, Huang et al.¹³ aligned the mean and variance of the content image features with the characteristics of the style image. Jing et al.¹⁴ innovatively encoded the style image as learnable convolution parameters and extended AdaIN by dynamic instance normalization. Li et al.¹¹ achieved a direct match of content feature statistics with style image statistics in the deep feature space with feature transformation, specifically the whitening and coloring processes. In addition, Li et al.¹² used covariance to perform a linear transformation, aligning the second-order statistics between the fused features and the style features. Wu et al.²⁵ used contrastive learning and covariance transformation for style transfer. An et al.²⁶ effectively avoided content leakage with the strategy of reversible neural flow, while optimizing the stylization method. However, these strategies fail to achieve an ideal balance between the global and the local, resulting in a loss of detailed information in the stylized image.

With the wide application of attention mechanisms, strategies based on this mechanism are widely used in style transfer. Park et al.⁸ proposed a novel method, SANet, which matches the style features that are most semantically similar to the content features, thereby effectively fusing global and local style features. Deng et al.²⁷ further fused coattention and self-attention to incorporate style patterns into content features. Chen et al.²⁸ used SANet as the base network and introduced contrastive learning into style transfer for the first time, proposing a novel internal-external style transfer strategy. Liu et al.²⁹ combined the deep and shallow features of an image to propose an adaptive attention normalization module. This module normalizes the content features so that the local statistics of the content features match the statistics of the weighted style features. Luo et al.²² proposed a progressive attention manifold alignment technique, which dynamically reorganizes the style features based on the spatial distribution of the content features. Deng et al.³⁰ used a Transformer model instead of a traditional CNN to extract long-range dependencies. Wang et al.³¹ proposed an aesthetic enhancement strategy to enhance the stylization effect through adversarial learning. Wen et al.³² proposed a reversible residual network and an unbiased linear transform to preserve pixel and feature affinity. Chung et al.³³ proposed a method for adapting pretrained large-scale diffusion models for style transfer in a training-free way. Although attention-based methods have achieved promising results, these methods fail to fully utilize the attention mechanism to mine the semantic information between style features.

Operations in the spatial domain directly manipulate the pixels and local regions of an image, enabling precise control over details. This enhances local consistency and prevents overly abstract or unnatural effects during style transfer. Consequently, most methods favor performing style transfer in the spatial domain, often overlooking the potential advantages provided by frequency domain processing. Image style transfer in the frequency domain not only effectively separates low-frequency and high-frequency components, enhancing the model's ability to capture long-range dependencies and allowing precise control over the fusion of style and details but also has a natural advantage in handling periodic texture patterns, helping to suppress the generation of messy textures. Compared with existing methods, we focus on semantic style features and content semantic features in style images, enhancing the model's ability to distinguish content and style information in the spatial domain. In the frequency domain, we emphasize global style consistency and enhance the model's ability to learn complex textures.

Frequency analysis in deep learning

Frequency domain information is widely used in computer vision tasks due to its large receptive field and separation of high and low frequencies. For example, in deraining, Fu et al.³⁴ separated the low-frequency part from the high-frequency part of the image and focused on high-frequency information during the training stage. Using prior image domain knowledge, the model was encouraged to learn information about the rain structure. Cao et al.³⁵ propose to apply frequency domain information to the image harmonization task, aiming to improve the quality of the generated images. Meanwhile, Li et al.³⁶ and Kwon et al.³⁷ use the frequency domain to separate content and style in their style transfer research in order to generate high-quality style images. In this study, we introduce frequency domain information to enhance the network's ability to capture long-range dependencies and recognize textures and patterns with periodicity.

Methods

Overall architecture

Figure 2(a) presents an overview of our network. Given a content image $I_{c}$ and a style image $I_{s}$ , the goal of style transfer is to generate a stylized image $I_{c s}$ . A pretrained VGG-19³⁸ network with fixed parameters is used as the encoder to extract feature map. We denote the extracted feature of layer $R e L U_l_1$ in VGG as $F_{*}^{l} \in R^{C \times H \times W}$ , when it takes an image $I_{*}$ as input and $*$ can be c or s here representing content and style features, respectively. The decoder is a symmetric structure of the encoder. Following SANet,⁸ our approach employs a multilevel strategy, integrating two ANSSAs and GSTEs at the $R e L U_4_1$ and $R e L U_5_1$ layers of the encoder. Following,²⁹ to fully leverage low-level patterns, the features of the current layer are further concatenated with the downsampled features from the previous layer:

F_{*}^{1 : l} = D_{l} (F_{*}^{1}) \oplus D_{l} (F_{*}^{2}) \oplus \dots \oplus F_{*}^{l}

(1)

where

D_{l}

represents the bilinear interpolation layer, which downsamples the input features to the same shape as

F_{*}^{l}

, and

\oplus

denotes the concatenation operation along the channel dimension. Thus, we can represent the embedding features of the ANSSA module and GSTE at the l layer as:

\begin{aligned} F_{anssa}^{l} = ANSSA (F_{c}^{l}, F_{s}^{l}, F_{c}^{1 : l}, F_{s}^{1 : l}), \\ F_{c s}^{l} = GSTE (F_{anssa}^{l}) \end{aligned}

(2)

where

F_{c}

F_{s}

, and

F_{c s}

represent the content, style, and embedded features, respectively. By leveraging multilevel embedded features, we can synthesize the stylized image

I_{c s}

with decoder as:

I_{c s} = Dec (F_{c s}^{4}, F_{c s}^{5})

(3)

where

Dec (*)

denotes the decoder.

Figure 2.

(a) Overview of our proposed Dual-Domain Style Transfer Network (DDSTNet) consists mainly of two pretrained encoders, two Adaptive Normalization with Style Semantics Awareness (ANSSA) modules, two Global Style Texture Enhancement (GSTE) modules, a decoder, and a discriminator. The content loss ( $L_{c}$ ), global style loss ( $L_{g s}$ ), and adversarial loss ( $L_{adv}$ ) are used jointly to learn style. (b) The identity loss ( $L_{id}$ ) is introduced to save content structure.

Adaptive normalization with style semantics awareness

Feature transformation modules play a central role in synthesizing content and style features. The innovative AdaIN¹³ focuses on the overall style distribution and adjusts the content features to match the global distribution of the style features. SANet, on the other hand, analyses local style patterns, extracts an attention map from the style and content features, and then uses this attention map to modulate the style features and fuse the modulated attention output with the content features. AdaAttN²⁹ achieves an adaptive per-pixel transfer of feature distributions by considering both low-level and high-level features and incorporating an attention mechanism. Although AdaAttN performs well in the task of reconciling local and global stylization, it fails to fully exploit the semantic features of the style image. Inspired by SANet and AdaAttN, we propose ANSSA, as shown in Figure 3. This module consists of Style Semantics Awareness and Adaptive Normalization. The former applies self-attention operations to the style image to enhance style patterns and extract richer semantic information. The latter utilizes both low-level and high-level features of the image to perform adaptive normalization on the content image, enabling an adaptive transfer of feature distributions at each pixel.

Figure 3.

(a) The details of style semantics awareness. (b) The details of adaptive normalization.

Style semantics awareness

Self-attention mechanisms in style transfer can capture global feature relationships, enhance the style semantic information, and preserve more detailed information. To mine the semantic information of style image, we compute the style semantics awareness feature map in the deep layer and enhance the style pattern. Inspired by,³¹ the inner product of the channel between the vectorized features can represent the global style, and the channel attention can represent the global style well. Then, the enhanced style features are locally integrated according to the semantic space of the style features. As shown in Figure 3(a), the transformation and vectorized features of the style feature $F_{*}^{l}$ are obtained firstly, as follow:

\begin{aligned} {\bar{F}}_{s 1}^{l} = Γ (f_{1} (Norm (F_{s}^{l}))), \\ {\bar{F}}_{s 2}^{l} = Γ (g_{1} (Norm (F_{s}^{l}))), \\ F_{s 3}^{l} = Γ (h_{1} (F_{s}^{l})) \end{aligned}

(4)

where

Γ

denotes the feature vectorization operation, and

Norm

represents mean-variance channel normalization. The functions

f_{1}

g_{1}

, and

h_{1}

are learnable

1 \times 1

convolutions. Subsequently, we compute the style self-attention between

{\bar{F}}_{s 1}^{l}

and

{\bar{F}}_{s 2}^{l}

follows:

A_{s}^{l} = Softmax ({\bar{F}}_{s 1}^{l} \otimes ({\bar{F}}_{s 2}^{l})^{T})

(5)

where

\otimes

is matrix multiplication, T is transpose operation. Finally, the attention feature

F_{ssa}^{l}

is obtained through matrix multiplication and element-wise addition:

F_{ssa}^{l} = f_{out} (\hat{Γ} (A_{s}^{l} \otimes F_{s 3}^{l})) \oplus F_{s}^{l}

(6)

where

\hat{Γ}

is the reverse operation of

Γ

f_{out}

is learnable

1 \times 1

convolution.

Adaptive normalization

AdaIN normalizes the features of the content image and then adjusts them based on the mean and standard deviation of the style image, injecting the style image's statistical properties (such as color and texture) into the content image to achieve style transfer. Similarly, the attention output can treat the target style feature points as the distribution of all weighted style feature points. Applying the attention score matrix A to the style features $F_{s}^{l}$ , as done in SANet, can be seen as calculating each target style feature point by weighted summing all style feature points. This process can be interpreted as the target style feature points, as output by the attention mechanism, being a distribution of all weighted style feature points. Based on this, we compute the attention-weighted mean and attention-weighted standard deviation. Then, similarly to AdaIN, we adjust the content image features by using the attention-weighted mean and attention-weighted standard deviation, thereby achieving the style transfer effect. The attention mechanism used in style transfer is derived from the self-attention mechanism,³⁹ which is usually applied to matching methods for local perceptual features. To compute the attention map A at layer l, we represent $Q$ (Query), $K$ (Key), and $V$ (Value) as:

\begin{aligned} Q_{1} = f_{2} (Norm (F_{c}^{l})), \\ K_{1} = g_{2} (Norm (F_{s s a}^{l})), \\ Q_{2} = f_{3} (Norm (F_{c}^{1 : l})), \\ K_{2} = g_{3} (Norm (F_{s}^{1 : l})), \\ V = h_{2} (F_{s}^{l}) \end{aligned}

(7)

where

f_{2}

g_{2}

f_{3}

g_{3}

, and

h_{2}

are learnable

1 \times 1

convolutions. Then, the attention map can be calculated as:

\begin{aligned} A_{1} = Softmax (Q_{1}^{T} \otimes K_{1}), \\ A_{2} = Softmax (Q_{2}^{T} \otimes K_{2}) \end{aligned}

(8)

where

A_{*} \in R^{H W \times H W}

. Following AdaAttN, we calculated the attention-weighted mean and attention-weighted standard variance, respectively:

\begin{aligned} M_{1} = V_{1} \otimes A_{1}^{T}, \\ M_{2} = V_{2} \otimes A_{2}^{T}, \\ M = M_{1} + M_{2}, \\ S = \sqrt{V_{1} \cdot V_{2} \otimes {(A_{1} + A_{2})}^{T} - M_{1} \cdot M_{2}} \end{aligned}

(9)

where

\cdot

denotes element-wise product,

A_{*} \in R^{C \times H W}

and

V_{*} \in R^{C \times H W}

. Unlike AdaAttN, to fully leverage the semantic features present in deeper layers, we employ the attention-weighted mean

M_{1}

from deep to shallow, and the style semantics awareness attention-weighted mean

M_{2}

as the attention-weighted means. Finally, for each position and each channel of the normalized content feature map, we generate the transformed feature map using the corresponding scale from S and the offsets from

M_{1}

and

M_{2}

F_{anssa}^{l} = S \cdot Norm (F_{c}^{l}) + M

(10)

Global style texture enhancement

To enhance the network's ability to capture long-range dependencies and more effectively capture the periodic textures and patterns present in the style image, we introduce the GSTE module, as shown in Figure 4. Specifically, we use the fast Fourier transform (FFT) to convert the stylized feature representation from the time domain to the frequency domain. To accelerate the training process, improve network stability, and prevent the loss of original structural information, multiple residual connections are introduced in this module. In the residual block, the input is added to the output feature map through residual connections, while the learned frequency domain features effectively help adjust the final stylized result. Finally, the frequency domain features are converted back to the original spatial domain by performing an inverse FFT (IFFT), as follows:

F_{c s}^{l} = F_{f f t}^{- 1} (F_{res} (F_{f f t} (F_{anssa}^{l}))) + F_{anssa}^{l}

(11)

where

F_{f f t}

and

F_{f f t}^{- 1}

denote the FFT and the IFFT, respectively. Multiple residual connections are represented by

F_{res}

Figure 4.

The structure of Global Style Texture Enhancement (GSTE) module.

Loss function

Our overall loss function is defined as a weighted sum of content loss ( $L_{c}$ ), global style loss ( $L_{g s}$ ), identity loss ( $L_{id}$ ), and adversarial loss ( $L_{adv}$ ):

L = λ_{1} L_{c} + λ_{2} L_{g s} + λ_{3} L_{id} + λ_{4} L_{adv}

(12)

where

λ_{1}

λ_{2}

λ_{3}

, and

λ_{4}

are hyperparameters that control the weights of their respective loss terms.

Following,²⁷ the global style loss ( $L_{g s}$ ) penalizes the distance between the generated image and the style image in VGG feature space to ensure a global stylized effect. This distance is measured by the mean $μ$ and standard deviation $σ$ :

\begin{aligned} L_{g s} & = \sum_{l = 1}^{5} ∥ μ (E^{l} (I_{c s})) - μ (E^{l} (I_{s})) ∥_{2} + ∥ σ (E^{l} (I_{c s})) - σ (E^{l} (I_{s})) ∥_{2} \end{aligned}

(13)

where

E^{l} (*)

denotes feature extracted from the l layer of the pretrained VGG encoder.

Similar to SANet, to ensure the preservation of the content structure in the generated results, the content loss ( $L_{c}$ ) is used to calculate the sum of Euclidean distances between the content features from the encoder output and the corresponding stylized output features at each level, as follows:

L_{c} = \sum_{l = 4}^{5} ∥ Norm (E^{l} (I_{c s})) - Norm (E^{l} (I_{c})) ∥_{2}

(14)

As shown in Figure 2(b), the identity loss (

L_{id}

) is adopted to save content structure:

\begin{matrix} L_{i d} = λ_{i d 1} (∥ (I_{c c} - I_{c}) ∥_{2} + ∥ (I_{s s} - I_{s}) ∥_{2}) \\ + λ_{i d 2} (\sum_{l = 1}^{5} (∥ E^{l} (I_{c c}) - E^{l} (I_{c}) ∥_{2}) \\ + ∥ E^{l} (I_{s s}) - E^{l} (I_{s}) ∥_{2}) \end{matrix}

(15)

where

λ_{id 1}

and

λ_{id 2}

are the identity loss weights.

In addition, inspired by the Generative Adversarial Network (GAN),⁴⁰ which effectively drives the data distribution of the stylized image $I_{c s}$ to be closer to that of the style image $I_{s}$ . The adversarial loss ( $L_{adv}$ ) can be computed as:

L_{adv} = \underset{y \sim I_{s}}{E} [\log D (y)] + \underset{x \sim I_{c s}}{E} [\log (1 - D (x))]

(16)

Experimental results

Implementation details

During training, MS-COCO⁴¹ and Wikiart⁴² are used as the content and style datasets, respectively. These two datasets contain about 120,000 training images. The pretrained VGG-19 network is used as the encoder, with fixed weights during training. The Adam optimizer⁴³ is used for the optimization process. The batch size is set to 4, the learning rate is set to 0.0001, and a total of 160,000 iterations are performed. During training, all images are loaded in a smaller dimension, rescaled to 512 pixels while maintaining the aspect ratio, and then randomly cropped to 256 × 256 pixels. In the experiment, $λ_{1}$ , $λ_{2}$ , $λ_{3}$ , and $λ_{4}$ are set to 3, 10, 50, and 5, respectively. All experiments are performed on an NVIDIA RTX 3090 24GB GPU.

Comparisons

Qualitative comparison

As shown in Figure 1, our method is qualitatively compared with eight SOTA style transfer methods, including StyTR,³⁰ AesUST,³¹ AdaAttN,²⁹ IECAST,²⁸ SANet,⁸ MAST,²⁷ CapVST,³² and StyleID.³³ SANet adopts an attention mechanism to fuse deep style features with content features, but often produces obvious artifacts and repetitive texture patterns (e.g., rows 1, 2, and 4 in the SANet column). IECAST improves the style effect by introducing a contrastive loss function, but artifacts still exist (e.g., rows 1, 4, and 6 in the IECAST column). StyTR replaces the CNN with a Transformer in order to extract long-range dependencies of the input image. However, due to lack of in-depth exploration of the style image, the generated results contain artifacts and retain some of the source content color (e.g., rows 1, 3, and 6 in the StyTR column). AesUST uses a GAN and a novel two-step training method to learn the aesthetic features of the style image, but artifacts and cluttered textures still exist (e.g., rows 1, 2, and 4 in the AesUST column). MAST uses a novel disentangled loss function to extract style and content information from images, but the generated results still contain artifacts (e.g., rows 1, 2, and 4 in the MAST column). AdaAttN normalizes content features by considering shallow features, but its ability to capture long-range dependencies is insufficient, resulting in generated images with messy textures (e.g., 1, 2, and 4 in the AdaAttN column). CapVST uses a reversible residual network and an unbiased linear transform with the matting Laplacian training loss, but it results in disorganized colors and textures, accompanied by black artifacts (e.g., 1, 2, and 7 in the CapVST column). StyleID proposes a method to adapt pretrained large-scale diffusion models for style transfer in a training-free way, but it demonstrates limited capability in transferring texture features effectively (e.g., 2, 4, and 6 in the StyleID column). The above methods show certain limitations in handling complex textures, resulting in either chaotic and disorganized patterns or overly subtle texture effects. This is primarily because they operate solely in the spatial domain, neglecting the advantages of frequency domain processing. As shown in the third column of Figure 1, our model minimizes the generation of artifacts and messy textures by utilizing ANSSA and a GSTE module to generate high-quality images.

Furthermore, to conduct a more detailed quantitative comparison, we randomly select 20 content images and 20 style images, generating 10 sets, each containing 400 images. We use the Learned Perceptual Image Patch Similarity (LPIPS)⁴⁴ and the mean style loss as evaluation criteria to measure the similarity between the generated image and the style image. Similarly, we also employ the Structural Similarity Index (SSIM)⁴⁵ and the mean content loss as evaluation metrics to assess the similarity between the generated image and the content image. As shown in Figure 5, the content loss achieved its best performance at a minimum value of 2.23, the style loss reached its best performance at a minimum value of 3.19, LPIPS recorded its best score at a minimum value of 0.675, and SSIM reached its best score at a maximum value of 0.475. The overall curve is smooth and stable. Overall, our model demonstrates significant advantages compared to other methods.

Figure 5.

Line charts of four different metrics.

As shown in Figure 6, our training loss decreases rapidly and gradually stabilizes, though noticeable oscillations are observed during the process. As the training progresses, the magnitude of oscillations significantly diminishes, eventually stabilizing after 160,000 iterations.

Figure 6.

Overall loss curve.

Quantitative comparison

As shown in Table 1, we achieve the best LPIPS of 0.616 and a second-best $L_{s}$ value of 3.08 in terms of style transfer effect. In terms of similarity assessment, our method outperforms other methods in both SSIM and $L_{c}$ metrics. Our model also performs well in terms of the number of model parameters and inference speed. Overall, our proposed method shows the best performance. The experimental results verify that our method can achieve high-quality style transfer while preserving the content structure.

Table 1.

Quantitative comparison.

Methods	$L_{c}$ ↓	$L_{s}$ ↓	LPIPS↓	SSIM↑	Param↓	Time↓
StrTR	3.04	3.25	0.661	0.441	48.34M	0.048
AesUST	2.67	3.26	0.668	0.371	76.24M	0.088
IECAST	2.81	4.06	0.625	0.423	21.12M	0.041
AdaAttN	2.84	3.59	0.669	0.436	33.65M	0.034
SANet	4.03	2.13	0.728	0.291	27.99M	0.022
MAST	3.31	3.11	0.737	0.248	19.61M	0.015
CapVST	3.12	3.09	0.643	0.447	4.09M	0.012
StyleID	5.58	4.53	0.681	0.426	859.92M	33.78
Ours	2.31	3.08	0.616	0.467	93.32M	0.095

The best results are bolded, and the second-best results are underlined.

Ablation study

Qualitative comparison

Figure 7, without the ANSSA module, the generated image deviates from the source content and exhibits artifacts (4th column, 2nd row). This observation demonstrates the key role of the ANSSA module in maintaining the content structure of the generated image and removing artifacts. On the other hand, when operating solely in the spatial domain without the GSTE module, the generated image suffers from a noticeable loss of texture details (as shown in column 5, 1st row). This highlights the critical role of the GSTE module in enhancing the overall stylization effect, improving texture richness, and capturing finer style features. In addition, without the use of the adversarial loss, the generated images tend to retain the colors of the source content image while producing incongruent texture effects (3rd column), suggesting that the loss plays a crucial role in generating more harmonious and realistic image results. Without the use of the identity loss technique, the generated image produces severe artifacts and cluttered colors, suggesting that the identity loss enriches the stylistic patterns while maintaining the structure of the content.

Figure 7.

Ablation studies of the adversarial loss function, the identity loss function, Adaptive Normalization with Style Semantics Awareness (ANSSA), and Global Style Texture Enhancement (GSTE) module on image style transfer.

Similarly, we evaluate various metrics using 10 sets of images, totaling 4000 images. As shown in Figure 8, the full model achieves the best performance across all metrics.

Figure 8.

Line charts of evaluation metrics from ablation experiments.

Quantitative comparison

As shown in Table 2, when the $L_{adv}$ and $L_{id}$ loss functions are excluded, the model's parameter size and inference speed remain unchanged, but the performance on all metrics decreases. When the ANSSA module is removed from both the spatial and frequency domains, the parameter size is reduced, and the inference speed is improved, but the performance on all metrics drops significantly. When the model operates solely in the spatial domain, the performance metrics decrease slightly, but the visual quality, especially the texture details, is greatly reduced. Overall, the complete model achieves the best performance in both metrics and visual quality.

Table 2.

Quantitative comparison of different modules and loss ablation experiments.

Methods	$L_{c}$ ↓	$L_{s}$ ↓	LPIPS↓	SSIM↑	Param↓	Time↓
w/o $L_{adv}$	2.37	3.47	0.652	0.435	93.32M	0.095
w/o $L_{id}$	2.593	3.295	0.628	0.422	93.32M	0.095
w/o ANSSA	2.71	3.72	0.697	0.367	53.42M	0.045
w/o GSTE	2.35	3.41	0.642	0.441	61.42M	0.055
Full model	2.31	3.08	0.616	0.467	93.32M	0.095

The best results are bolded, and the second-best results are underlined.

Loss weight analysis

The identity loss and content loss have been shown to enrich style patterns while preserving the content structure.⁸ In this section, we show the influence of the adversarial loss and global style loss. Figure 9(a) shows the results obtained by fixing $λ_{1}$ , $λ_{2}$ , and $λ_{3}$ at 3, 10, and 50, respectively, and increasing $λ_{4}$ from 1 to 15. Figure 9(b) shows the results obtained by fixing $λ_{1}$ , $λ_{3}$ , and $λ_{4}$ at 3, 50, and 50, respectively, and increasing $λ_{2}$ from 1 to 15. When we increase the weight of adversarial loss, the stylization effect gradually improves but tearing and shadow artifacts appear in the stylized images. Similarly, when we increase the weight of global style loss, although the stylization effect improves to some extent, the structure and details of the content image become overly stylized, resulting in blurriness or distortion.

Figure 9.

(a) Results obtained by fixing $λ_{1}$ , $λ_{2}$ , and $λ_{3}$ at 3, 10, and 50, respectively, and increasing $λ_{4}$ from 1 to 15. (b) Results obtained by fixing $λ_{1}$ , $λ_{3}$ , and $λ_{4}$ at 3, 50, and 50, respectively, and increasing $λ_{2}$ from 1 to 15.

Multilevel feature embedding

Figure 10 shows the stylized outputs obtained from different feature layers. When only $R e L U_4_1$ is used for style transfer, the global statistics of the style features and the content structure are maintained well. However, there are meaningless color transfers (2nd column). When there is only $R e L U_5_1$ , the content structure is destroyed, and the style is confused (3rd column). In our work, two feature layers ( $R e L U_4_1$ and $R e L U_5_1$ ) are taken as the input of DDSTNet for style transfer to enrich the style patterns.

Figure 10.

Multi-level feature embedding. By embedding features at multiple levels, we can enrich the local and global patterns for the stylized images.

Figure 11.

Style interpolation with two different styles. The number below the figure represents $α : β$ .

Multistyle transfer

To demonstrate the flexibility of our proposed DDSTNet model, following,⁴⁶ we designed experiments involving style interpolation and multistyle image stitching. During the process of style interpolation, we achieved interpolation of multiple style images by averaging the style images at different ratios and combining them through modulation of content features for decoding. $I_{A}$ and $I_{B}$ represent two images with different styles. $F_{A}$ and $F_{B}$ represent their style features. The stylized result of style interpolation can be expressed as:

I_{c s} = Dec (α F_{A} + β F_{B})

(17)

where

α

and

β

represent interpolation coefficients. As shown in Figure 11, the DDSTNet can adaptively generate natural and reasonable images based on different interpolation coefficients of Style A and Style B. Additionally, we have achieved multistyle transformation by stitching multiple style images into a new image as the style image and inputting it into the DDSTNet. The results of multistyle stitched image transfer are shown in Figure 12. Our method can synthesize high-quality novel images flexibly according to the different styles in the stitched image while preserving the content features. Based on the above results, we can conclude that our method can flexibly support variable-style rendering control and achieve satisfactory results.

Figure 12.

Results of multistyle stitching image transfer.

Figure 13.

Qualitative comparison among different methods of video style transfer. The first row shows results by different methods or settings. The second rows show the heat-maps of differences between consecutive frames.

Video style transfer

For video stylization, we compare our method with SOTA methods SANet, MAST, AdaAttN, and CCPL. We randomly selected a video segment (50 frames, 12 FPS) from the MPI Sintel dataset⁴⁷ for style transfer. As shown in Figure 13. SANet, MAST, and AdaAttN exhibit weak stylization performance, with a significant loss of content details and poor video frame consistency. Although CCPL achieves strong frame consistency, its stylization effect is weak, and artifacts are present. In contrast, our method achieves ideal stylization results while maintaining video frame consistency.

Conclusion

This paper presents a novel DDSTN that integrates a self-attention mechanism with frequency domain analysis to generate higher-quality stylized images. Specifically, we design the ANSSA module to capture the style image features and effectively fuse style features with source image content features through adaptive normalization. Furthermore, the GSTE module is introduced to process the feature maps and enhance the overall quality of the stylized image. Compared to other SOTA methods, our approach demonstrates significant advantages in reducing artifacts and cluttered textures, while the generated stylized images exhibit notable improvements in content richness and visual coherence. A series of ablation experiments validate the effectiveness of the proposed components. In future work, we intend to apply the model to video style transfer and image translation tasks to explore its potential across a broader range of application scenarios.

Footnotes

Acknowledgments

This work is fully supported by the Frontier Exploration Projects of Longmen Laboratory (NO. LMQYTSKT034); Key Research and Development and Promotion of Special (Science and Technology) Project of Henan Province, China (No. 252102210158, 232100210153); Key Scientific Research Project of Higher Education Institutions in Henan Province, China (No. 24B520010).

Author contributions

Changyang Hu: Responsible for mathematical reasoning, writing experimental code, and managing the training process. Shibao Sun and Pengcheng Zhao: Contributed to mathematical derivations, model development, and auditing of experiments. Yifan Zhao, Jianfeng Liu, and Xiaoli Song: Focused on drafting and revising the language of the paper, as well as formatting and layout.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work is fully supported by the Frontier Exploration Projects of Longmen Laboratory (NO. LMQYTSKT034); Key Research and Development and Promotion of Special (Science and Technology) Project of Henan Province, China (No. 252102210158, 232100210153); and Key Scientific Research Project of Higher Education Institutions in Henan Province, China (No. 24B520010).

ORCID iD

Pengcheng Zhao

References

Efros

Freeman

. Image quilting for texture synthesis and transfer. Proc 28th Annu Conf Comput Graph Interactive Tech 2001; 2: 341–346.

Bruckner

Gröller

. Style transfer functions for illustrative volume rendering. Comput Graph Forum 2007; 26: 715–724.

Gatys

Ecker

Bethge

. A neural algorithm of artistic style. arXiv preprint arXiv:1508.06576 . 2015.

Gatys

Ecker

Bethge

. Image style transfer using convolutional neural networks. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2016, pp.2414–2423.

Wand

. Precomputed real-time texture synthesis with Markovian generative adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp.702–716.

Ulyanov

Lebedev

Vedaldi

, et al. Texture networks: feed-forward synthesis of textures and stylized images. Proc 33rd Int Conf Mach Learn 2016; 48: 1349–1357.

Johnson

Alahi

Fei-Fei

. Perceptual losses for real-time style transfer and super-resolution. In: Proceedings of European Conference on Computer Vision, 2016, pp.694–711.

Park

Lee

. Arbitrary style transfer with style-attentional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp.5880–5888.

Sheng

Lin

Shao

, et al. Avatar-net: multi-scale zero-shot style transfer by feature decoration. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp.8242–8250.

10.

Chen

Yuan

Liao

, et al. Stylebank: an explicit representation for neural image style transfer. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp.1897–1906.

11.

Fang

Yang

, et al. Universal style transfer via feature transforms. Proc Adv Neural Inf Process Syst 2017; 30: 385–395.

12.

Liu

Kautz

, et al. Learning linear transformations for fast image and video style transfer. In: Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp.3809–3817.

13.

Huang

Belongie

. Arbitrary style transfer in real-time with adaptive instance normalization. In: Proceedings of the IEEE Conference on Computer Vision, 2017, pp.1501–1510.

14.

Jing

Liu

Ding

, et al. Dynamic instance normalization for arbitrary style transfer. Proc AAAI Conf Artif Intell 2020; 34: 4369–4376.

15.

Wang

Zhao

Chen

, et al. Diversified arbitrary style transfer via deep feature perturbation. In: Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp.7789–7798.

16.

Fang

Yang

, et al. Diversified texture synthesis with feed-forward networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp.3920–3928.

17.

Champandard

. Semantic style transfer and turning two-bit doodles into fine artworks. arXiv:1603.01768. 2016.

18.

Kolkin

Salavon

Shakhnarovich

. Style transfer by relaxed optimal transport and self-similarity. In: Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp.10051–10060.

19.

Gatys

Ecker

Bethge

, et al. Controlling perceptual factors in neural style transfer. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp.3985–3993.

20.

Hong

Jeon

Yang

, et al. Domain-aware universal style transfer. In: Proceedings of IEEE/CVF International Conference on Computer Vision, 2021, pp.14613–14622.

21.

Deng

Tang

Dong

, et al. Arbitrary video style transfer via multi-channel correlation. Proc AAAI Conf Artif Intell 2021; 35: 1210–1217.

22.

Luo

Han

Yang

, et al. Consistent style transfer. arXiv preprint arXiv:2201.02233. 2022.

23.

Zhao

, et al. RAST: restorable arbitrary style transfer via multi-restoration. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023, pp.331–340.

24.

Wang

Girshick

Gupta

, et al. Non-local neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp.7794–7803.

25.

Zhu

, et al. ****CCPL: contrastive coherence preserving loss for versatile style transfer. In: Proceedings of the European Conference on Computer Vision, 2022, pp.189–206.

26.

Huang

Song

, et al. Artflow: unbiased image style transfer via reversible neural flows. In: Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp.862–871.

27.

Deng

Tang

Dong

, et al. Arbitrary style transfer via multi-adaptation network. In: Proceedings of the 28th ACM International Conference on Multimedia, 2020, pp.2719–2727.

28.

Chen

Wang

Zhang

, et al. Artistic style transfer with internal-external learning and contrastive learning. Adv Neural Inf Process Syst 2021; 34: 26561–26573.

29.

Liu

Lin

, et al. Adaattn: revisit attention mechanism in arbitrary neural style transfer. In: Proceedings of IEEE/CVF International Conference on Computer Vision, 2021, pp.6649–6658.

30.

Deng

Tang

Dong

, et al. Stytr 2: image style transfer with transformers. In: Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp.11326–11336.

31.

Wang

Zhang

Zhao

, et al. AesUST: towards aesthetic-enhanced universal style transfer. In: Proceedings of the 30th ACM International Conference on Multimedia, 2022, pp.1095–1106.

32.

Wen

Gao

Zou

. CAP-VSTNet: content affinity preserved versatile style transfer. In: Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp.18300–18309.

33.

Chung

Hyun

Heo

. Style injection in diffusion: a training-free approach for adapting large-scale diffusion models for style transfer. In: Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp.8795–8805.

34.

Huang

Zeng

, et al. Removing rain from single images via a deep detail network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp.1715–1723.

35.

Cao

Hong

Niu

. Painterly image harmonization in dual domains. Proc AAAI Conf Artif Intell 2023; 37: 268–276.

36.

Luo

Wang

, et al. Frequency domain disentanglement for arbitrary neural style transfer. Proc AAAI Conf Artif Intell 2023; 37: 1287–1295.

37.

Kwon

Kim

Lin

, et al. An aesthetic feature-aware arbitrary neural style transfer. Proc AAAI Conf Artif Intell 2024; 38: 13310–13319.

38.

Simonyan

Zisserman

. Very deep convolutional networks for large-Scale image recognition. arXiv:1409.1556. 2014.

39.

Vaswani

Shazeer

Parmar

, et al. Attention is all you need. Proc Adv Neural Inf Process Syst 2017; 30: 5998–6008.

40.

Goodfellow

Pouget-Abadie

Mirza

, et al. Generative adversarial networks. Commun ACM 2020; 63: 139–144.

41.

Lin

Maire

Belongie

, et al. Microsoft COCO: common objects in context. In: Proceedings of the IEEE Conference on Computer Vision, 2014, pp.740–755.

42.

Phillips

Mackintosh

. Wiki Art Gallery, Inc.: a case for critical thinking. Issues Account Educ 2011; 26: 593–608.

43.

Kingma

. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980. 2014.

44.

Zhang

Isola

Efros

, et al. The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp.586–595.

45.

Liu

Zhu

. Structure-guided arbitrary style transfer for artistic image and video. IEEE Trans Multimed 2022; 24: 1299–1312.

46.

Bai

Zhang

, et al. GCSANet: arbitrary style transfer with global context self-attentional network. IEEE Trans Multimed 2023; 26: 1407–1420.

47.

Butler

Wulff

Stanley

, et al. A naturalistic open source movie for optical flow evaluation. In: Proceedings of the IEEE Conference on Computer Vision, 2012, pp.611–625.

Dual-domain style transfer network

Abstract

Keywords

Introduction

Related works

Arbitrary style transfer

Frequency analysis in deep learning

Methods

Overall architecture

Adaptive normalization with style semantics awareness

Style semantics awareness

Adaptive normalization

Global style texture enhancement

Loss function

Experimental results

Implementation details

Comparisons

Qualitative comparison

Quantitative comparison

Ablation study

Qualitative comparison

Quantitative comparison

Loss weight analysis

Multilevel feature embedding

Multistyle transfer

Video style transfer

Conclusion

Footnotes

Acknowledgments

Author contributions

Declaration of conflicting interests

Funding

ORCID iD

References