Sage Journals: Discover world-class research

Abstract

Since the text-to-image synthesis with non-stacked network structure only models the global features of the image and the semantic features of the text, it is easy to cause problems such as semantic inconsistency, loss of detail features, and incomplete main content in the generated images. To solve the above problems, this paper proposes a text-to-image generation method (cascaded structure and joint attention generative adversarial network (CSGAN)) that combines cascaded structure and joint attention. After encoding the text, this method utilizes a conditional enhancement module to process it, which enhances the expressive ability of the text features. In order to make the local area of the image better fit the text features, this method designs a joint attention module to solve the problem that the image generation process cannot fully reflect the local details of the text content. By using affine transform mapping visual features and cascade structure to fuse the features of different modules, the integrity of the whole content of the image is effectively guaranteed and the semantic consistency of the text image is improved. Experimental results on the CUB dataset show that compared with the current mainstream non-stacked network model DF-GAN model, the inception score index of the CSGAN model is improved by about 4.01%, and the Fréchet inception distance index is reduced by about 13.36%. These data indicators and visualization results fully demonstrate the effectiveness of the CSGAN model.

Keywords

text-to-image synthesis non-stacked network structure affine transformation cascade structure joint attention

1. Introduction

Text-to-image synthesis is a challenging cross-modal task in the fields of computer vision and natural language processing. Text-to-image synthesis has many beneficial applications in the development of art painting (Lei et al., 2021), image editing (Liu et al., 2021), and computer-aided design (Shi et al., 2021), and has therefore received great attention in the research field. The current text-to-image synthesis methods based on generative adversarial networks (GANs) mainly include two types: Stacked network structures and non-stacked network structures. The stacked network structure is usually a multi-stage refinement framework, which first generates a low-resolution initial image through text description, then gradually refines the image generated in the previous stage in each subsequent stage, and finally generates a high-resolution image (Qiao et al., 2019; Zhang et al., 2017, 2018, 2021). The stacked network structure solves the problem of low resolution of generated images. However, the stacked structure uses multiple generators, which causes mutual interference between generators and ultimately affects the quality of image generation. In addition, the generation results depend heavily on the quality of the initial images. If the images generated in the initial stage are of poor quality, it will be challenging to generate high-quality images in subsequent steps. To solve the above problems, DF-GAN (Tao et al., 2022) replaced the stacked backbone structure with a single-level backbone, and combined the hinge loss (Zhang et al., 2019) and residual network (He et al., 2016) techniques to directly generate high-resolution images. However, DF-GAN only relies on sentence-level information and ignores the guidance of word-level information, which reduces the consistency between generated images and text descriptions. Since the quality of the generated image is directly affected by the quality of the text features, each word has a non-negligible impact on the details of the image, especially in short text. If the local feature extraction of text is incomplete, the detail quality of the resulting image will be poor.

To solve these problems, this paper proposes a combining cascaded structure and joint attention generative adversarial network (CSGAN) based on DF-GAN, which combines cascade structure and joint attention. Different from the previous work (Tao et al., 2022; Xu et al., 2018), which directly encoded the text and embedded it to get the feature representation, the encoded text features are conditionally enhanced in this paper to generate more conditional variables for the generator, so as to improve the generalization ability of the network. Secondly, this paper designs a joint attention module combining the convolutional block attention module (CBAM) (Woo et al., 2018) and the bottleneck attention module (BAM) (Park et al., 2018) to improve the ability to focus on text nuances. In addition, this paper designs a cascade structure to deeply integrate the information of each up-sampling module, which solves the problem of incomplete content of the generated image. In summary, the main contributions of this paper are as follows:

In the feature extraction module, a conditional enhancement module (CEM) is added to generate more enhanced data when there are only a small number of text-image data pairs, thereby making the semantic space continuous and improving the robustness of small perturbations in the semantic space to achieve more accurate image generation effects.

A joint attention module is designed to focus on the details of the text from both spatial and channel aspects, which helps the network to better learn the details from the text vector, improves the network performance, and ensures that the generated images can reflect the details of the text.

A cascaded module is designed, which makes the low-dimensional semantic features and high-dimensional semantic features in the generator have a closer connection. It can effectively pass down the information of the current layer while accepting the information from the previous layer, which effectively ensures the integrity of the overall image content and improves the semantic consistency between text and image.

2. Related Work

2.1. Text-to-Image Generation Based on GANs

Reed et al. (2016) first used GANs to generate images with a resolution of 64 $\times$ 64 from simple text descriptions. Zhang et al. (2017) introduced a two-level StackGAN structure, using a stacked structure to further improve the resolution of generated images. Many subsequent research works have adopted this stacked structure (Cheng et al., 2020; Qiao et al., 2019; Ruan et al., 2021; Zhang et al., 2017, 2018, 2021). Tao et al. (2022) proposed the DF-GAN model, which uses a non-stacked structure and contains only a pair of generators and discriminators. It can directly generate high-resolution images and solve the entanglement problem between generators in the stacked structure and the problem of over-reliance on the quality of the initial generated image. In recent years, research work (Liao et al., 2022; Wu et al., 2022; Ye et al., 2023) has also been based on DF-GAN using non-stacked structures. However, the above text-to-image generation methods based on single-stage backbone only use sentence information and do not incorporate fine-grained word information in the image generation process, which reduces the consistency between generated images and text descriptions.

2.2. Text-Image Fusion

Xu et al. (2018) proposed AttnGAN, which repeatedly uses the attention mechanism to focus on the word-level information of the text during the image-generating process. Qiao et al. (2019) proposed MirrorGAN, which uses a global-local auxiliary attention module and mirroring method to generate high-quality images. Zhu et al. (2019) proposed DM-GAN to generate high-quality images, aiming to improve the impact of the initial generated images on subsequent work. Yin et al. (2019) proposed word-level and sentence-level conditioned batch normalization in SD-GAN to enhance visual semantic embedding in generated network feature maps.

2.3. Deep Attentional Multi-modal Similarity Model Loss

The deep attentional multi-modal similarity model (DAMSM) (Wu et al., 2022) loss is a loss function used in image generation and vision-language tasks. The model maps image sub-regions and words in the text into the same semantic space to evaluate the text-image similarity at the word level. Wherein, the image encoder uses the middle layer to learn the local feature matrix $f \in R^{m \times n}$ ( $m$ is the local feature dimension, $n$ is the number of subregions in the image) and the global feature $\bar{f} \in R^{2048}$ . Then, the image features are transformed into the semantic space of text features by adding perceptron layers:

ν = W f, \bar{v} = \bar{W} \bar{f}

(1)

where

ν \in R^{D \times 289}

is the visual feature vector of the

i

th region of the image,

\bar{ν} \in R^{D}

is the global semantic feature of the image, and

D

is the spatial dimension of image-text feature. In the text encoder model, the feature matrix of all words is

e \in R^{D}

e_{i}

representing the feature vector of the

i

th word,

D

is the word vector dimension,

T

is the number of words, and the global text semantic feature is

\bar{e} \in R^{D}

First, we calculate the similarity between image subregions and words:

s = e^{T} ν

(2)

where

s \in R^{T \times 289}

s_{i, j}

is the dot product similarity between the

i

th word of the sentence and the

j

th sub-region of the image, which is normalized:

s_{i, j} = \frac{\exp (s_{i, j})}{\sum_{k = 0}^{T - 1} \exp (s_{k, j})}

(3)

Then, an attention model is built to compute the regional context vector

c_{i}

for each word,

c_{i}

is the image subregion representation corresponding to the

i

th word in the text:

c_{i} = \sum_{j = 0}^{n - 1} α_{j} ν_{j}, w h e r e α_{j} = \frac{\exp (γ_{1} {\bar{s}}_{i, j})}{\sum_{k = 0}^{n - 1} \exp (γ_{1} {\bar{s}}_{i, k})}

(4)

where

γ_{1}

represents the attention to the features of the image correlation subregion. Finally, the cosine similarity of

c_{i}

and

v_{i}

, that is

R (c_{i}, e_{i}) = \frac{c_{i}^{T} e_{i}}{| | c_{i} | | | | e_{i} | |}

, is used to describe the correlation between the

i

th word and the image. The match score between the entire image (Q) and the entire text description (D) is:

R (Q, D) = \log {(\sum_{i = 1}^{T - 1} \exp (γ_{2} R (c_{i}, e_{i})))}^{\frac{1}{γ_{2}}}

(5)

where

γ_{2}

is the factor that determines the importance of magnifying the most relevant word to the regional context pair,

R (Q, D)

approximately equal

{m a x}_{i = 1}^{T - 1} R (c_{i}, e_{i})

when

γ_{2} \to \infty

. For text-image pairs, the posterior probability of sentence

D_{i}

matching image

Q_{i}

is:

P (D_{i} | Q_{i}) = \frac{\exp (γ_{3} R (Q_{i}, D_{i}))}{\sum_{j = 1}^{M} \exp (γ_{3} R (Q_{i}, D_{i}))}

(6)

where

γ_{3}

is the smoothness factor determined by experiment. In image-text pairs, only

D_{i}

and

Q_{i}

are paired, and the rest are regarded as unmatched, so the loss function is defined as:

L_{1}^{w} = - \sum_{i = 1}^{M} l o g P (D_{i} | Q_{i})

(7)

In the same way,

L_{2}^{w} = - \sum_{i = 1}^{M} l o g P (Q_{i} | D_{i})

(8)

where

w

is the word in the statement in the text. Redefine formula (5) to

R (Q, D) = \frac{{\bar{v}}^{T} \bar{e}}{∥ \bar{v} ∥ ∥ \bar{e} ∥}

and replace formula (6) to (8), using global text feature

\bar{e}

and global image feature

\bar{v}

, sentence loss functions

L_{1}^{s}

and

L_{2}^{s}

are obtained. Finally, get DAMSM loss:

L_{D A M S M} = L_{1}^{w} + L_{2}^{w} + L_{1}^{s} + L_{2}^{s}

(9)

3. Model Overview

The overall framework of the CSGAN network proposed in this paper is shown in Figure 1. The CSGAN network is mainly composed of a text feature extraction module, a generator, and a discriminator. The feature extraction module takes text features and random noise as input, the generator takes the original noise and the text features of the text feature extraction module as input, and the discriminator takes the image generated by the generator and the text features extracted by the feature extraction module as input.

Figure 1.

The network structure of cascaded structure and joint attention generative adversarial network (CSGAN). It has a pair of generators-discriminators. The generator consists of a feature extraction module, six CSBlock modules, three cascade modules and a channel dimension reduction module.

3.1. Feature Extraction Module

The feature extraction module consists of a bidirectional long short-term memory (BiLSTM) network (Schuster & Paliwal, 1997) and a CEM. BiLSTM obtains text features and introduces the CEM to enable it to have better representation capabilities for text.

3.1.1. Bidirectional Long Short-term Memory

BiLSTM includes two independent long short-term memory (LSTM) networks, which can process the forward and reverse text information in parallel, so as to dig the context information in the text sequence more deeply. Its network structure is shown in Figure 2. Given the input text information $T$ , after embedding, the word vector sequence representation ${T_{0}, T_{1}, T_{2}}$ is obtained. The word vectors are input into the forward LSTM (LSTM $_{L}$ ) and backward LSTM (LSTM $_{R}$ ). The LSTM $_{L}$ reads the input sequence step by step, updates the hidden state from left to right, and generates a forward hidden state sequence $(h_{L 0}, h_{L 1}, h_{L 2})$ . These hidden states contain information representing the words in the context. For each time step, LSTM takes the hidden state from the previous time step and the current input as inputs, and updates the current hidden state according to the input gate, forget gate, and output gate, so that useful semantic features can be passed on. The LSTM $_{R}$ processes the sequence from right to left. The principle is the same as the LSTM $_{L}$ , but the processing order is different, generating a reverse hidden state sequence $(h_{R 0}, h_{R 1}, h_{R 2})$ . At each time step, the outputs of the LSTM $_{R}$ and LSTM $_{L}$ are concatenated to form a more complete and more representative vector ${[h_{L 0}, h_{R 0}], [h_{L 1}, h_{R 1}], [h_{L 2}, h_{R 2}]}$ , which represents the semantics of the word in context, thereby enhancing the expressiveness of text information. The feature matrix of all words is represented by $e \in R^{D \times T}$ , where D is the word vector dimension and $T$ is the number of words. The feature representation $[h_{L 2}, h_{R 2}]$ from the last time step is selected as the global sentence vector, because it contains all the information of the forward and backward directions, represented by $\bar{e} \in R^{D}$ . By using BiLSTM, the vector representation of sentence level and word level can be well captured.

Figure 2.

The network structure of bidirectional long short-term memory network (BiLSTM). It can not only obtain the previous information, but also the future information at a certain point in the sequence.

3.1.2. Conditional Enhancement Module

Text features are crucial for image generation, the same sentence may correspond to objects with different postures and appearances. In order to better obtain text features, the CEM is introduced to generate more conditional variables for the generator and enhance the robustness of the network, thereby solving the problem of overfitting when the number of image-text pairs is limited and the text data conditions are fixed. At the same time, the CEM also solves the problem that the text vector dimension obtained after the text description passes through the text encoder is too large, which may cause the text semantic space to be sparse, resulting in the lack of necessary features in the main body of the generated image.

Its network structure is shown in Figure 3. Given the text description T, the text encoder BiLSTM generates a sentence feature vector $φ_{t}$ with a dimension of 256. This vector $φ_{t}$ is input into the FC layer, which outputs a tensor with a dimension of 1024. The GLU activation function is applied to obtain an intermediate result $x$ , also with a dimension of 1024. From tensor $x$ , the first 256 dimensions are extracted as the mean $μ_{0}$ , and the last 256 dimensions are extracted as the logarithmic variance $l o g (σ^{2})$ . The exponential function is then used to convert the logarithmic variance to the standard deviation, that is $σ_{0} = e x p (0.5 \times l o g (σ^{2}))$ . The conditional enhanced text vector o is calculated by formula (10).

o = μ_{0} + σ_{0} \times ϵ

(10)

Figure 3.

The network structure of conditional enhancement module (CEM). It can increase the representation ability of text coding and enhance the ability to capture important text content.

where $ϵ$ represents random noise that follows a standard normal distribution. By introducing randomness, this method of conditional enhancement makes the semantic information extracted from the text feature vector more abundant, so that the model can learn more information from the text features, and thus improve the diversity of the generated images.

3.2. Generator

The generator mainly includes a feature expansion module, six CSBlock modules, three cascade modules and a channel dimension reduction module. The feature expansion module performs feature expansion on input noise, the CSBlock module deeply merges text features and noise features step by step to achieve further enhancement of text features, the cascade module strengthens the connection between CSBlock modules, the channel dimension reduction module performs channel dimension reduction operations on the feature map to obtain images with a resolution of 256 $\times$ 256.

3.2.1. CSBlock

The CSBlock consists of an up-sampling module, multiple Affine layers, multiple activation layers, and two joint attention modules (FABlocks), as shown in Figure 4. The feature vector dimensions of each CSBlock are different, as shown in Table 1. The main function of the up-sampling module is to improve the resolution of the feature map. Through the cooperation of multiple Affine layers and Relu functions, the visual features based on natural language description can be better mapped to the image. By designing two FABlocks, the network structure can better integrate spatial attention and channel attention, which can not only enhance text features more deeply and effectively, but also improve the robustness of the network, enabling it to more effectively adapt to different types of input noise. The CSBlock receives text features multiple times to enhance and retain the details of text features, so that the generated image can better integrate text features and fit the text content.

(1)
Affine layer
Figure 4.
The network structure of CSBlock. Affine layers and FABlocks make better extraction of text content information.

Table 1.
CSBlock Feature Vector Dimensions.

Network layer Input tensor Output tensor

CSBlock1 4090 $256 \times 8 \times 8$

CSBlock2 $256 \times 8 \times 8$ $256 \times 16 \times 16$

CSBlock3 $256 \times 16 \times 16$ $256 \times 32 \times 32$

CSBlock4 $256 \times 32 \times 32$ $128 \times 64 \times 64$

CSBlock5 $128 \times 64 \times 64$ $64 \times 128 \times 128$

CSBlock6 $64 \times 128 \times 128$ $32 \times 32 \times 256$

The structure of the Affine layer is shown in Figure 5. Two Multi-Layer Perceptrons are used to predict the channel scaling parameters $γ$ and shifting parameters $β$ of the image features under text conditions, respectively, using formula (11) to perform affine transformation on the feature map output by the up-sampling module. The Affine layer expands the conditional representation space of the generator. A larger representation space helps the generator map different images into different representations based on text descriptions, enhancing the semantic consistency of text images.
$AFF (x_{i} | e) = γ_{i} \times x_{i} + β_{i}$
(11)
where AFF denotes the Affine Transformation, $x_{i}$ is the $i$ th channel of visual feature maps, $e$ is the sentence vector, $γ_{i}$ and $β_{i}$ are scaling parameter and shifting parameter for the $i$ th channel of visual feature maps.
(2)
FABlock
Figure 5.
The network structure of Affine layer. Through channel-wise scaling and shifting, the generator can capture the semantic information in text description and synthesize realistic images matching with given text descriptions.

In order to better enhance text features and obtain more feature details, a FABlock is designed, and its network structure is shown in Figure 6. This module consists of convolutional layers, CBAM, and BAM, which enables the network to better focus on the fine-grained information in text features. The CBAM connects the channel attention module and the spatial attention module to apply the weight information to the input feature information step by step, so that the network can better focus on text features. In order to further focus on the details of the text, the BAM module is added, and the number of channels in CBAM and BAM are spliced, so that the network can extract more complete features from complex text information. The network can pay attention to the content of the feature map as well as its details because of the combination of the two attention modules, the generated image’s primary content is more comprehensive and its details are more comprehensible.

Figure 6.
The network structure of FABlock. It effectively combines the advantages of CBAM and BAM, and deeply extracts the content of text information. CBAM = convolutional block attention module; BAM = bottleneck attention module.
3.2.2. Cascade Module

Network layer	Input tensor	Output tensor
CSBlock1	4090	$256 \times 8 \times 8$
CSBlock2	$256 \times 8 \times 8$	$256 \times 16 \times 16$
CSBlock3	$256 \times 16 \times 16$	$256 \times 32 \times 32$
CSBlock4	$256 \times 32 \times 32$	$128 \times 64 \times 64$
CSBlock5	$128 \times 64 \times 64$	$64 \times 128 \times 128$
CSBlock6	$64 \times 128 \times 128$	$32 \times 32 \times 256$

In the past, the connection between each module in text generated images was not close, making it difficult to ensure the accuracy of the spatial position of main body of the generated image. Therefore, three cascade modules were designed, which fused the output features of the first CSBlock and the third CSBlock, the second CSBlock and the fourth CSBlock, and the third and fifth CSBlock step by step, establishing a close connection between multiple modules. Each cascade module is mainly composed of up-sampling and convolutional layers, and its structure is shown in Figure 7. In the $i$ th cascade module, bilinear interpolation method is used to improve the resolution of the $i$ th CSBlock output feature map, and the channels of the feature map are transformed through convolutional layers. Finally, fuse the transformed features with the output features of the $i + 2 n d$ CSBlock. The cascade module deeply integrates low-dimensional feature maps and high-dimensional feature maps, enabling the generated image to better integrate low-dimensional features and to some extent solve the problem of gradient vanishing, making the final generated image more complete in the main content.

Figure 7.

The network structure of cascade module. It effectively connects low-dimensional features and high-dimensional features, and better retains important information in the process of text information transmission.

3.2.3. Loss Function

The loss function used in the generator of the baseline network is shown in Formula (13), where $z$ is the noise vector sampled from Gaussian distribution, $e$ is the given text description. However, this loss only focuses on sentence-level features and ignores word-level features. Therefore, DAMSM loss function is introduced into the loss function in this paper to measure the matching degree of generated image and text from the word level. In the CEM, in order to avoid overfitting and help the generator produce more diverse and realistic images, it needs to be regularized. Conditional Enhancement Loss Function $L_{C E M}$ is obtained by quantifying the difference between the conditional vector and the standard normal distribution through the Kullback-Leibler divergence, and the formula is as follows:

L_{C E M} = D_{K L} (N (μ_{0} (φ_{t}), \sum_{0} (φ_{t})) | | N (0, 1))

(12)

where

D_{K L} (.)

is the formula for calculating Kullback-Leibler divergence,

φ_{t}

is the text sentence feature,

μ (.)

and

\sum_{0} (.)

are the formulas for calculating the mean and variance of

φ_{t}

. Thus, the loss function (14) of the generator in this method is obtained. Where,

L_{D A M S M}

is the loss function described in formula (9), and

λ_{D A}

is the weight of DAMSM loss.

\begin{aligned} L_{a d v}^{G} & = - E_{G (z) \sim p_{g}} [D (G (z), e)] \end{aligned}

(13)

\begin{aligned} L_{G} & = L_{a d v}^{G} + λ_{D A} L_{D A M S M} + L_{C E M} \end{aligned}

(14)

3.3. Discriminator

A channel dimension-raising module, six DownBlock modules, and two convolution modules make up the discriminator. The channel dimension-raising module converts the image output by the generator into features, extracts features from the feature map step by step through six DownBlock modules, and fuses the obtained features with the text features output by the CEM. Each DownBlock module has different feature vector dimensions, as shown in Table 2. The fused features undergo two convolution operations to calculate the loss function, optimize the generator, and generate high-quality images.

Table 2.
DownBlock Feature Vector Dimension.

Network layer Input tensor Output tensor

DownBlock1 $32 \times 256 \times 256$ $64 \times 128 \times 128$

DownBlock2 $64 \times 128 \times 128$ $128 \times 64 \times 64$

DownBlock3 $128 \times 64 \times 64$ $256 \times 32 \times 32$

DownBlock4 $256 \times 32 \times 32$ $256 \times 16 \times 16$

DownBlock5 $256 \times 16 \times 16$ $256 \times 8 \times 8$

DownBlock6 $256 \times 8 \times 8$ $256 \times 4 \times 4$

Network layer	Input tensor	Output tensor
DownBlock1	$32 \times 256 \times 256$	$64 \times 128 \times 128$
DownBlock2	$64 \times 128 \times 128$	$128 \times 64 \times 64$
DownBlock3	$128 \times 64 \times 64$	$256 \times 32 \times 32$
DownBlock4	$256 \times 32 \times 32$	$256 \times 16 \times 16$
DownBlock5	$256 \times 16 \times 16$	$256 \times 8 \times 8$
DownBlock6	$256 \times 8 \times 8$	$256 \times 4 \times 4$

By calculating the confrontation loss function. The loss function of the discriminator is shown in formula (15), where $x$ is the generated image, $e$ is the given text description, $D$ (.) is the decision of whether the input image given by the discriminator matches the input statement. The p and q variables are the hyperparameters of the MA-GP (Tao et al., 2022) loss.

\begin{aligned} L_{D} & = - E_{x \sim p_{r}} [m i n (0, - 1 + D (x, e))] - \frac{1}{2} E_{G (z) \sim p_{g}} [m i n (0, - 1 - D (G (z), e))] \\ - \frac{1}{2} E_{x \sim p_{m i s}} [m i n (0, - 1 - D (x, e))] + p E_{x \sim p_{r}} {[({(∥ {(\nabla)}_{x} D (x, e) ∥)}_{2} + {(∥ {(\nabla)}_{e} D (x, e) ∥)}_{2}))}^{q} \end{aligned}

(15)

4. Experimental Data and Analysis Evaluation

4.1. Data Sets and Parameter Settings

The CUB dataset (Wah et al., 2011) contains 200 bird species and 11,788 images, each with 10 language descriptions. A total of 8,855 images from 150 bird species are used as the training set, and 2,933 images from 50 bird species are used as the testing set.

Adam was used to optimize the network parameters, $β_{1} = 0.0$ , $β_{2} = 0.9$ . The learning rate of the generator is set to 0.0001, and the learning rate of the discriminator is set to 0.0004. The batch size is set to 32 on a single Nvidia RTX 3090 GPU, and the model has trained 980 rounds on the CUB dataset. The hyperparameters of the loss function in the network are $λ_{D A} = 0.1$ , $p$ =2, and $q$ =6, respectively.

4.2. Evaluating Indicator

In this paper, inception score (IS) (Salimans et al., 2016) and Fréchet inception distance (FID) (Heusel et al., 2017) are selected to evaluate network performance. The IS index evaluates the effectiveness of CSGAN generated images through clarity and diversity. The larger the IS value, the clearer and more diverse the generated image. FID is another evaluation index with more universal applicability, which represents performance by calculating the distance between the real sample and the generated sample in the feature space. The smaller the FID value, the closer the samples are and the better the model performance.

In addition, contrastive language-image pretraining (CLIP) is used to measure image-text alignment. CLIP model encodes images and texts by feature extractors to obtain image features and text features respectively, and calculates their cosine similarity, and then measures the matching degree between images and texts. The higher the value of CLIP mean, the higher the similarity between text and image.

4.3. Quantitative Evaluation

This paper compares the proposed method with many cutting-edge methods, including AttnGAN, MirrorGAN, DM-GAN, ITSC-GAN, and DF-GAN. Compared with other models, the CSGAN proposed in this paper achieves the highest IS and the lowest FID. Since DF-GAN has been compared with several state-of-the-art methods such as AttnGAN, MirrorGAN, and DM-GAN in comparative experiments, and the generated effects and data indicators are significantly better than them. In order to ensure the accuracy of the comparative experiment and strictly follow the single variable principle, this paper tests DF-GAN under completely consistent hardware and software configuration environments, and the experimental data is shown as DF-GAN*.

As shown in Table 3, compared with DF-GAN*, the IS index of CSGAN proposed in this paper on the CUB dataset increased from 4.73 $\pm$ 0.05 to 4.92 $\pm$ 0.06, and the FID decreased from 18.26 to 15.82. Specifically, compared with AttnGAN, MirrorGAN, DM-GAN and ITSC-GAN, the IS index of CSGAN on the CUB dataset is improved by about 12.84%, 7.89%, 3.58% and 6.26% respectively. Compared with AttnGAN, MirrorGAN and ITSC-GAN, the FID is reduced by about 34.02%, 13.74% and 8.87% respectively. The experimental results show that the CSGAN model proposed in this paper can synthesize higher quality images from text descriptions.

Table 3.
Data Comparison of Each Model.

Method CUB-IS CUB-FID

AttnGAN Xu et al. (2018) 4.36 $\pm$ 0.03 23.98

MirrorGAN Qiao et al. (2019) 4.56 $\pm$ 0.05 18.34

DM-GAN Zhu et al. (2019) 4.75 $\pm$ 0.07 16.09

ITSC-GAN Xue et al. (2023) 4.63 $\pm$ 0.03 17.36

DF-GAN* Tao et al. (2022) 4.73 $\pm$ 0.05 18.26

CSGAN 4.92 $\pm$ 0.06 15.82

Method	CUB-IS	CUB-FID
AttnGAN Xu et al. (2018)	4.36 $\pm$ 0.03	23.98
MirrorGAN Qiao et al. (2019)	4.56 $\pm$ 0.05	18.34
DM-GAN Zhu et al. (2019)	4.75 $\pm$ 0.07	16.09
ITSC-GAN Xue et al. (2023)	4.63 $\pm$ 0.03	17.36
DF-GAN* Tao et al. (2022)	4.73 $\pm$ 0.05	18.26
CSGAN	4.92 $\pm$ 0.06	15.82

FID = Fréchet inception distance; CSGAN = cascaded structure and joint attention generative adversarial network; IS = inception score.

At the same time, this paper selects a certain amount of text and the corresponding images generated by each model to evaluate the image-text alignment with CLIP model. The experimental results are shown in Table 4. It is worth noting that although AttnGAN has the highest CLIP mean, there are obvious defects in its generated images, such as incomplete main content and unclear reconstruction of details, as shown in Figures 8 to 10 (In Section 4.4). In contrast, the CLIP mean of CSGAN is 0.2472, which is slightly lower than AttnGAN, but it shows a significant advantage in image generation quality. The images generated by CSGAN not only have more complete main content, but also clearer image details. In addition, in comparison with DF-GAN (CLIP mean 0.2461) and DM-GAN (CLIP mean 0.2377), CSGAN showed certain competitiveness and demonstrated strong image text alignment ability.

Figure 8.

Compared to AttnGAN, DM-GAN, and DF-GAN, cascaded structure and joint attention generative adversarial network (CSGAN) has a more complete main content and clearer details.

Figure 9.

Using semantically consistent text to test models, the model in this article performs best.

Figure 10.

When there is different noise in the agreed text content, the image generated by cascaded structure and joint attention generative adversarial network (CSGAN) performs better.

Table 4.

CLIP Mean for Each Model.

Method	CLIP mean
AttnGAN Xu et al. (2018)	0.2567
DM-GAN Zhu et al. (2019)	0.2377
DF-GAN* Tao et al. (2022)	0.2461
CSGAN	0.2472

CLIP = contrastive language-image pretraining; CSGAN = cascaded structure and joint attention generative adversarial network.

4.4. Qualitative Evaluation

In order to evaluate the performance of CSGAN more intuitively, this paper visually compares the images generated by the model with those generated by AttnGAN, DM-GAN, and DF-GAN* models. The comparison results are shown in Figure 8. From the figure, it can be seen that the main body of the first and second columns of AttnGAN is unclear, and important information such as the bird’s beak of the image in the third column is not reflected in the image. The boundaries of the images in the first and second columns of DM-GAN are unclear, resulting in the fusion of the main body and scene, resulting in low image quality. The generation of wings and claws in the first column of DF-GAN* is inaccurate, and the fusion of bird beak information and background in the generated images in the second column cannot be recognized. Compared to this, the CSGAN generated images in this paper have higher integrity of the main content and clearer detailed features of the images.

In the dataset, there are different language expressions for the same scene. In order to analyze the quality of images generated by the model from different text information in the same scene, this paper conducted corresponding experiments, and the results are shown in Figure 9. In the images generated by AttnGAN, the feather color of the bird’s belly in the third column of the image is inconsistent, which cannot accurately reflect the semantic information of “brown bell.” The main content in the images generated by DM-GAN is incomplete, especially in the third column where the bird’s legs are not reflected. There is a significant difference in the size of the main body in the images generated by DF-GAN*. Relatively speaking, CSGAN can ensure semantic consistency while ensuring the completeness of the main content, proving the reliability of CSGAN.

In the image generation process, for the same text, if the initial input noise is different, the generated image content may be biased. To this end, an experimental analysis of the impact of different types of noise on model performance was carried out, and the results are shown in Figure 10. In the case of the same text but different noises, there is a significant head generation error in the first and third columns of AttnGAN, and an image reconstruction error in the bird leg part in the second column. The first and third columns of DM-GAN also show reconstruction errors for the head of the main body of image, while the second column shows incomplete reconstruction for head details and legs. The reconstruction of details such as the beak and legs of the main body of the image in DF-GAN* is not clear. Compared to AttnGAN, DM-GAN, and DF-GAN*, CSGAN performs better in reconstructing details of the main body of the image such as bird beaks and eyes.

4.5. Ablation Experiment

Based on the baseline network, this paper completed three optimization tasks. First, in order to more accurately extract word-level features in text information, the CEM, DAMSM loss function, and KL loss function were introduced. Second, in order to address the problems of incomplete main content and weakened semantic consistency, cascade structure is designed to strengthen the internal connection between modules and improve the semantic consistency of text images. Third, in the process of generating features, in order to continuously enhance text features and ensure the integrity of text detail features, FABlock is designed to make the details of the generated image clearer. In order to demonstrate the impact of each module proposed in this paper on network performance, the following ablation experiment was performed on the CUB dataset. The results are shown in Table 4.

As can be seen from Tables 4 and 5, this paper adds the CEM and DAMSM loss function, to the basic model, the IS is improved by about 2.74% and the FID is reduced by about 1.25%; on this basis, only the cascade structure is added, which makes the model improve about 2.95% in the IS index and reduce about 3.45% in the FID index; on this basis, only the FABlock is added, which makes the model improve about 3.17% in the IS index and reduce 6.13% in the FID; finally, CSGAN improves about 4.01% in the IS index and reduces about 13.36% in the FID index. By analyzing the experimental results above, we can see that the cascade structure effectively combines the connections between the various modules in the generator, making the feature space of the generated image more consistent with the feature space of the real image, and improving the semantic consistency of the text image. The FABlock can deeply enhance the text detail features in the feature map, and at the same time can extract richer semantic features and improve the details of the final generated image, which proves the innovation and effectiveness of the model proposed in this paper.

Table 5.
Ablation Experiment.

Structural block IS FID

BaseLine 4.73 $\pm$ 0.05 18.26

BaseLine / CEM +DAMSM+KL 4.86 $\pm$ 0.03 18.03

BaseLine / CEM +DAMSM+KL / cascade structure 4.87 $\pm$ 0.06 17.63

BaseLine / CEM +DAMSM+KL / FABlock 4.88 $\pm$ 0.02 17.14

CSGAN 4.92 $\pm$ 0.06 15.82

Structural block	IS	FID
BaseLine	4.73 $\pm$ 0.05	18.26
BaseLine / CEM +DAMSM+KL	4.86 $\pm$ 0.03	18.03
BaseLine / CEM +DAMSM+KL / cascade structure	4.87 $\pm$ 0.06	17.63
BaseLine / CEM +DAMSM+KL / FABlock	4.88 $\pm$ 0.02	17.14
CSGAN	4.92 $\pm$ 0.06	15.82

FID = Fréchet inception distance; DAMSM = deep attentional multi-modal similarity model; CEM = conditional enhancement module; IS = inception score.

5. Conclusions

This paper proposes a text-to-image model CSGAN that combines cascade structure and joint attention. In the feature extraction module, the CEM is added to enrich the text features and strengthen the full integration of text features and image features. The cascade structure strengthens the connection between modules in the generator, enhances the integrity of the main content of the generated image, and makes the image generated by the non-stacked structure better fit the text content. The joint attention module makes the details of the image such as edges and textures more realistic.This model will be further optimized, for example: in order to obtain text features with higher representation strength, use a better text encoder. In order to improve performance while minimizing additional network additions, further refine the loss function.

Footnotes

Acknowledgements

This work was supported by Nature Science Foundation of Jilin Province, No. 20230101179JC, China.

ORCID iDs

Chao Zhang

Mi Zhou

Cheng Han

Chuanao Bai

Funding

The authors received the following financial support for the research, authorship and/or publication of this article. This work was supported by Nature Science Foundation of Jilin Province, No. 20230101179JC, China.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

Cheng

Tian

Wang

Tao

(2020). RiFeGAN: Rich feature generation for text-to-image synthesis from prior knowledge. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10911–10920). IEEE.

Zhang

Ren

Sun

(2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778). IEEE.

Heusel

Ramsauer

Unterthiner

Nessler

Hochreiter

(2017). GANs trained by a two time-scale update rule converge to a local nash equilibrium. Advances in Neural Information Processing Systems, 30, 6629–6640.

Lei

Zhou

Gan

Berg

T. L.

Bansal

Liu

(2021). Less is more: Clipbert for video-and-language learning via sparse sampling. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 7331–7341). IEEE.

Liao

Yang

M. Y.

Rosenhahn

(2022). Text to image generation with semantic-spatial aware GAN. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 18187–18196).

Liu

Shu

Lin

Perazzi

Kung

S.-Y.

(2021). Content-aware GAN compression. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 12156–12166). IEEE.

Park

Woo

Lee

J.-Y.

Kweon

I. S.

(2018). BAM: Bottleneck attention module. arXiv preprint arXiv:1807.06514.

Qiao

Zhang

Tao

(2019). Mirrorgan: Learning text-to-image generation by redescription. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 1505–1514). IEEE.

Reed

Akata

Yan

Logeswaran

Schiele

Lee

(2016). Generative adversarial text to image synthesis. In International conference on machine learning (pp. 1060–1069). PMLR.

10.

Ruan

Zhang

Fan

Tang

Liu

Chen

(2021). DAE-GAN: Dynamic aspect-aware GAN for text-to-image synthesis. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 13960–13969). IEEE.

11.

Salimans

Goodfellow

Zaremba

Cheung

Radford

Chen

(2016). Improved techniques for training GANs. Advances in Neural Information Processing Systems, 29, 2234–2242.

12.

Schuster

Paliwal

K. K.

(1997). Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 45(11), 2673–2681.

13.

Shi

Wang

Long

Zhou

Niu

Hua

(2021). SGCN: Sparse graph convolution network for pedestrian trajectory prediction. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 8994–9003). IEEE.

14.

Tao

Tang

Jing

X.-Y.

Bao

B.-K.

(2022). DF-GAN: A simple and effective baseline for text-to-image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 16515–16525). IEEE.

15.

Wah

Branson

Welinder

Perona

Belongie

(2011). The caltech-UCSD birds-200-2011 dataset.

16.

Woo

Park

Lee

J.-Y.

Kweon

I. S.

(2018). CBAM: Convolutional block attention module. In Proceedings of the European conference on computer vision (ECCV) (pp. 3–19). ECCV.

17.

Zhao

Zheng

Ding

(2022). Adma-GAN: Attribute-driven memory augmented GANs for text-to-image generation. In Proceedings of the 30th ACM international conference on multimedia (pp. 1593–1602). ACM.

18.

Zhang

Huang

Zhang

Gan

Huang

(2018). Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1316–1324). IEEE.

19.

Xue

Long

Feng

Wang

(2023). Text-to-image generation method based on image-text semantic consistency. Journal of Computer Research and Development, 60, 2180–2190.

20.

Wang

Tan

Liu

(2023). Recurrent affine transformation for text-to-image synthesis. IEEE Transactions on Multimedia, 26, 462–473.

21.

Yin

Liu

Sheng

Wang

Shao

(2019). Semantics disentangling for text-to-image generation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 2327–2336). IEEE.

22.

Zhang

Goodfellow

Metaxas

Odena

(2019). Self-attention generative adversarial networks. In International conference on machine learning (pp. 7354–7363). PMLR.

23.

Zhang

Koh

J. Y.

Baldridge

Lee

Yang

(2021). Cross-modal contrastive learning for text-to-image generation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 833–842). IEEE.

24.

Zhang

Wang

Huang

Metaxas

D. N.

(2017). Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In Proceedings of the IEEE international conference on computer vision (pp. 5907–5915). IEEE.

25.

Zhang

Wang

Huang

Metaxas

D. N.

(2018). StackGAN++: Realistic image synthesis with stacked generative adversarial networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(8), 1947–1962.

26.

Zhu

Pan

Chen

Yang

(2019). DM-GAN: Dynamic memory generative adversarial networks for text-to-image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 5802–5810). IEEE.

Text-to-Image Synthesis Combining Cascade Structure and Joint Attention

Abstract

Keywords

1. Introduction

2. Related Work

2.1. Text-to-Image Generation Based on GANs

2.2. Text-Image Fusion

2.3. Deep Attentional Multi-modal Similarity Model Loss

3.1.1. Bidirectional Long Short-term Memory

3.2.1. CSBlock

4.1. Data Sets and Parameter Settings

4.2. Evaluating Indicator

4.3. Quantitative Evaluation

4.5. Ablation Experiment

Table 5. Ablation Experiment. Structural block IS FID BaseLine 4.73 ± 0.05 18.26 BaseLine / CEM +DAMSM+KL 4.86 ± 0.03 18.03 BaseLine / CEM +DAMSM+KL / cascade structure 4.87 ± 0.06 17.63 BaseLine / CEM +DAMSM+KL / FABlock 4.88 ± 0.02 17.14 CSGAN 4.92 ± 0.06 15.82

Footnotes

Acknowledgements

ORCID iDs

Funding

Declaration of Conflicting Interests

References

Table 5.
Ablation Experiment.

Structural block IS FID

BaseLine 4.73 $\pm$ 0.05 18.26

BaseLine / CEM +DAMSM+KL 4.86 $\pm$ 0.03 18.03

BaseLine / CEM +DAMSM+KL / cascade structure 4.87 $\pm$ 0.06 17.63

BaseLine / CEM +DAMSM+KL / FABlock 4.88 $\pm$ 0.02 17.14

CSGAN 4.92 $\pm$ 0.06 15.82