Sage Journals: Discover world-class research

Abstract

Image colorization is one of the core issues in computer vision that has attracted significant attention in recent years. Colorization technique improves the human eye’s ability to recognize grayscale images and understand scenes, particularly in low-light-level (LLL) images. However, current colorization methods still face issues, such as semantic confusion, color bleeding, and loss of image details. To address these issues, a bi-stream feature extraction and multiscale attention generative adversarial network (BM-GAN) is proposed. The bi-stream feature extraction block combines global and local features extracted from two parallel encoders. This combination improves the ability of the network to extract deep features from images. The multiscale attention block enhances key features related to the colorization target across channels and spatial dimensions. This results in higher-quality color images. The proposed method is evaluated on ImageNet, Summer2winter validation set, and LLL images. Experimental results show that BM-GAN reduces the feature-aware evaluation metrics learned perceptual image patch similarity and Fréchet inception distance by 5.2% and 7.5%, respectively.

Keywords

Image colorization bi-stream feature extraction multiscale attention LLL image

1. Introduction

Image colorization is the process of converting grayscale images into color images by assigning appropriate colors to each pixel (Maurya and Chand, 2022; Chen et al., 2023; Viswanathan et al., 2023). This process aims to restore or enhance the color information of an image, making it easier for individuals to comprehend and analyze. For grayscale images, especially low-light-level (LLL) images, the lack of color limits their visual effect and the ability of the human eye to extracting image details. Colorization not only improves image quality, but also improves the ability to recognize objects and scene details. It is essential in a range of applications, including LLL image colorization (Kong et al., 2020), restoration of old photos and movies (Chen et al., 2018), and medical image processing (Nida et al., 2016; Khan et al., 2017). In addition, the colorized LLL images are used in a wide range of applications in aerospace, military combat, military as well as civilian fields. However, current image colorization faces some key issues, such as semantic confusion (determining the appropriate color for each object in the image), color bleeding (where colors spill over object boundaries), edge ambiguity, and object intervention, among others.

Traditional colorization methods can be divided into two main categories: user-guided colorization and example-based colorization. The user-guided colorization requires users to manually add colors to grayscale images, which is labor-intensive and can lead to edge diffusion issues. Similarly, the example-based colorization involves selecting a color image similar to the grayscale image and transferring its colors. Although such methods reduce user involvement, they are heavily reliant on suitable reference images. These traditional methods are often inefficient and highly dependent on manual intervention.

In recent years, the rapid advancement and extensive adoption of deep learning have significantly impacted the field of image colorization (Pastor-Pellicer et al., 2019; Tu et al., 2017; Tang et al., 2022). These methods excel at extracting color patterns from vast datasets of colorized images and applying them to grayscale images. Deep learning methods have shown great potential for learning color patterns from a large dataset of ground truth color images and applying this knowledge to grayscale images. The initial deep learning methods for image colorization were primarily based on convolutional neural networks (CNNs). These methods take grayscale images as input and, through training, learn the color distribution and characteristics. By accurately extracting semantic information from the image, they are able to generate images that are both realistic and natural. As deep learning technology progresses, more sophisticated methods have been developed. A notable example is the generative adversarial network (GAN), which enhances colorization through adversarial training. Several GAN-based methods, including DualGAN (Yi et al., 2017), CycleGAN (Zhu et al., 2017), CUT (Park et al., 2020), InstaGAN (Mo et al., 2018), and Pix2pix (Isola et al., 2017), among others, are specifically tailored to address various challenges in image colorization. These advanced techniques leverage deep learning and adversarial training to generate high-quality colorized images. Despite enhancing image colorization, these methods have notable limitations. The training of GANs is intricate and can be unstable. For example, pattern collapse forces the generator to fixate on a single color scheme, which restricts its ability to create varied color outputs. The focus of the network on maximizing adversarial loss can result in color bleeding or loss of texture in the colorized images. This distortion can cause considerable differences between the colorized image and ground truth color image. Additionally, it may hinder the network from understanding the semantic content of the image, leading to semantic confusion or color mismatch. To address these issues, a bi-stream feature extraction and multiscale attention GAN (BM-GAN) is proposed. The complete architecture of the BM-GAN network is presented in Figure 1. Compared to current methods, the following contributions are presented:

A BM-GAN grayscale image colorization network is designed, which consists of two parallel encoders to enhance the effect of colorization. The network effectively combines global and local features while maintaining texture details. It enhances the ability of a network to extract detailed, scene-specific, and semantic features, and effectively mitigates color bleeding.

A multiscale attention block (MSAB) is designed. It integrates channel attention with spatial attention in an effective manner to improve the accurate identification of various targets within the image. During the decoding phase, the introduction of an MSAB at skip connections allows the network to selectively fuse features from the encoder and decoder. The combination of information from the shallow encoding layers focuses on key features associated with colorized objects in specific channels and spatial regions. The colorization quality of the network is improved by reducing semantic confusion and loss of image details between the colorized targets.

Several sets of LLL images in different scenarios are captured by the multipixel photon counter (MPPC) experimental platform in an LLL environment. The proposed colorization method, when applied to these LLL images, significantly enhances the visualization of LLL images in practical applications and facilitates better image information acquisition by the human eye.

Figure 1.

Overall architecture of the proposed image colorization network. The generator is composed of two parallel encoders that jointly learn the global and local features of the image, and introduce multiscale attention modules at skip connections.

2. Related Work

User-Guided Colorization

User-guided colorization method adds colors to grayscale images through user graffiti or specifying colors. This method allows users to intuitively intervene in the colorization process and customize it according to their creativity and preferences. Levin et al. (2004) utilized the Markov random field method to distribute colors by leveraging the similarity between adjacent pixels with similar intensities. Huang et al. (2005) proposed an adaptive edge detection algorithm aimed at preventing edge haloing in colored regions. The algorithm employs a high-threshold Sobel filter for initial edge detection, followed by color propagation to resolve color bleeding issues, particularly in cartoon images. Yatziv and Sapiro (2006) proposed a fast shading technique that utilizes luminance-weighted chrominance blending and weighted geodesic distance. Luan et al. (2007) studied a colorization approach that leverages texture similarity. The technique integrates texture matching with image segmentation to achieve more efficient color transfer. Sýkora et al. (2009) introduced a versatile tool for coloring hand-drawn comics that uses a graph cut-based approach to optimize colorization. This method is suitable for a wide range of artistic styles. Xu et al. (2013) utilized an appearance similarity method between user-defined pixels and other pixels, which is both time-efficient and memory-efficient. Paul et al. (2016) proposed a three-dimensional steerable pyramid method to deal with occlusions. These methods require users to have substantial expertise and involve significant manual effort, particularly when the input image contains complex content.

Example-Based Colorization

Example-based colorization method involves transferring color data from a reference image to a grayscale image. The reference image is typically a color image that shares a similar scene with the grayscale image. Techniques proposed by Reinhard et al. (2001), Tai et al. (2005), and Wu et al. (2013) are frequently used. Welsh et al. (2002) proposed a method to transfer color information between images by matching the luminance values of pixels in the color space. Charpiat et al. (2008) proposed a global optimization algorithm that handles multimodality by predicting the probability of possible colors at each pixel. He et al. (2018) utilized a colorization network to extract features that represent the statistical color distribution of a reference image. This method yields superior colorization results compared to pixel-level color matching. Li et al. (2019) developed a colorization method that first matches the reference image locally and then fuses it using a global optimization technique. The colorized image closely matches the reference image in terms of color, while also reducing the need for user guidance.

Fully Automatic Colorization

Recently, deep learning has gradually been applied to image colorization with the development of deep neural networks and datasets. By learning features from both the source and target domains, a color-mapping relationship is formed. Cheng et al. (2015) first employed CNNs to extract features from grayscale images. The image is colored based on features extracted from various patches and their respective regions within the image. Zhang et al. (2016) regarded image colorization as a self-supervised expression learning task and made progress in the domain of automatic colorization. They developed a method to address the multimodal challenges of colorization and investigated the diversity of possible color outcomes. Iizuka et al. (2016) pretrained a network with class labels for classification tasks on the ImageNet dataset. They extracted global semantic features that were later combined with intermediate features for color prediction. Su et al. (2020) proposed a hybrid learning method that combines instance branching and global branching features. It achieves fully automatic multiobjective colorization of instance-aware images.

GANs

In 2014, the GAN was proposed by Goodfellow et al. (2020) to generate data unsupervised and to have shown strong performance in computer vision tasks. The GAN framework consists of a generator and a discriminator. The generator generates synthetic data from random noise inputs, while the discriminator’s task is to differentiate between real and synthetic data. Adversarial training allows these two networks to continuously enhance their performance through mutual competition. The distinct architecture and training approach of GANs have driven their success in super-resolution reconstruction (Wang et al., 2018), image generation, image conversion, and style transfer (Andreini et al., 2020). GANs have been applied to tackle the challenge of image colorization. Yoo et al. (2019) introduced an enhanced memory network and leveraged GANs to achieve colorization for small sample sets. Isola et al. (2017) improved upon the GANs by enhancing both the objective function and network architecture. Additionally, they introduced constraints within the loss function to handle a variety of image transformation tasks. Vitoria et al. (2020) utilized semantic class distribution information in their ChromaGAN approach to direct image colorization. Zhao et al. (2020) proposed a composite colorization network that simultaneously predicts colorization and saliency mapping. Semantic confusion and color bleeding are effectively reduced.

Attention Mechanism

In recent years, attention mechanisms (Niu et al., 2021) have been widely applied in various deep learning tasks, such as image recognition, image generation, and 3D vision. The attention mechanism evaluates the importance of different features through weighted analysis to highlight relevant information and ignore irrelevant data. The attention mechanism model primarily includes the spatial attention model, channel attention model, and fusion attention model. The spatial attention model is designed to selectively capture spatial dependencies between positions to evaluate the importance of each point. The channel attention model amplifies or diminishes channels by concentrating on the significance of each channel. The fusion attention module is a prominent network that integrates both spatial and channel attention mechanisms.

LLL Images

In LLL environments, conventional visible-light cameras do not work properly under such conditions. MPPC (Li et al., 2021; Aull et al., 2014; Verghese et al., 2007), on the other hand, is able to accurately capture details in low-light environments without causing underexposure, and MPPC rely on avalanche effect imaging, which is capable of detecting images in very faint environments and has a great advantage in capturing LLL images. However, since the LLL images captured by MPPC are usually grayscale images lacking color information, resulting in the absence of many important details, this limits its visualization effect in practical applications and the degree of access to image information by the human eye. Therefore, colorization of LLL images is of great practical significance, especially in the military field, biomedicine, night-time automatic driving, ocean exploration, and so on (Deng et al., 2023). The colorization technique can convert the original low-contrast LLL images into colorful images that conform to, or approximate, the human visual perception, and improve their ability to discriminate targets.

3. Method

3.1. BM-GAN Architecture

The proposed BM-GAN architecture, shown in Figures 2 and 3, is divided into a generator and a discriminator. The generator network employs an encoder–decoder structure. It is composed of three main blocks: the bi-stream feature extraction block (BFEB), the main colorization network (MCN), and the MSAB. In BM-GAN, the generator takes a grayscale image of $X \in R^{H \times W \times 1}$ as input and predicts the two missing color channels in the CIE Lab color space. These predicted channels are then combined with the input grayscale image to produce a colorized image of $Y \in R^{H \times W \times 3}$ . Firstly, the BFEB comprises two parallel encoders, E1 and E2. The output from E1 is linked to the final layer of E2 to enable the integration of global and local low-level information. Subsequently, MSABs are integrated into the skip connections of the MCN. The aim is to enhance the number of color-related features across channels and spatial dimensions. Finally, both the color prediction Y and the ground truth image are fed into the discriminator for adversarial learning.

Figure 2.
Architecture of the proposed BM-GAN generator. From left to right, the network consists of a BFEB, an MCN, and an MSAB. It receives grayscale images as input and predicts corresponding colorized images. Note. BM-GAN = bi-stream feature extraction and multiscale attention generative adversarial network; BFEB = bi-stream feature extraction block; MCN = main colorization network; MSAB = multiscale attention block.

Figure 3.
Architecture of the proposed BM-GAN discriminator. Note. BM-GAN = bi-stream feature extraction and multiscale attention generative adversarial network.

The discriminator adopts a $70 \times 70$ PatchGAN, which divides the input image into multiple blocks (e.g., $N \times N$ patches). This approach effectively captures local structures and fine details to enhance the accuracy of the discriminator. In the subsequent sections, the proposed colorization network will be discussed in detail.
3.2. Bi-Stream Feature Extraction Block

A BFEB is proposed to integrate global and local features to minimize semantic confusion and color bleeding. As illustrated in Figure 2, the MCN is based on the U-Net structure (Ronneberger et al., 2015). The skip connections between the encoder and decoder directly map feature details from the downsampling layer to the up-sampling layer. This structure effectively prevents gradient vanishing and speeds up network convergence.

Specifically, the BFEB consists of two parallel encoders (E1 and E2) in the BM-GAN. E1 extracts global features from the input image using the first 16 layers of the VGG19 network. E2, as part of the colorization network, extracts local features through the first five layers of the MCN. The global features extracted are merged with the local features to guide the image colorization.

The schematic diagram of the fusion layer is shown in the red dashed box in Figure 2. Owing to its small receptive field, the U-Net network primarily captures local features that highlight image details. By incorporating global features from the pretrained VGG19 network, the colorization network acquires richer image information. This integration allows the network to assign colors more accurately to objects in the image and effectively alleviates color bleeding.

3.3. Multiscale Attention Block

To achieve high-quality grayscale image colorization, MSABs are integrated into the network. These blocks enable the network to emphasize the channel and spatial information across varying scales of each feature map. The structure is shown in Figure 4. A spatial attention structure is incorporated into the generator. It assigns importance to different spatial positions, which focus on the most relevant areas of the image. Channel attention enables the network to automatically learn the dependencies of each feature channel. By introducing MSAB, the ability of the network to learn feature channel interdependencies and spatial correlations is strengthened.

Figure 4.
Architecture of the multiscale attention block.

Let the input feature map be denoted as $M \in R^{W \times H \times C}$ , where $W$ , $H$ , and $C$ represent the width, height, and number of channels, respectively. The input feature map is then divided into $U \in R^{W \times H \times (C / n) \times n}$ , where $n$ is set to 8. For the channel attention mechanism, the feature map $U$ undergoes global average pooling to obtain a $1 \times 1 \times C$ feature map. The feature map possesses a global receptive field and can be seen as a description of each channel. The formula for global average pooling is as follows:
$F_{c} = \frac{1}{W \times H} \sum_{i = 1}^{W} \sum_{j = 1}^{H} s_{c} (i, j)$
(1)
where $s_{c} (i, j)$ represents the pixel value at coordinates $(i, j)$ within the $c$ th channel of the input feature map. $F$ is the feature map after global average pooling. These feature weights are then multiplied by the feature map $U$ to produce the feature map across the channel dimension.

In the spatial attention mechanism, the feature map $U$ is processed to extract both maximum and average values. These values are then combined through a parallel operation. The spatial dimension feature map is generated by applying a $3 \times 3$ convolution and a sigmoid function.

The final feature map is obtained by the dot product of the feature maps from the channel and spatial dimensions. This adaptive attention block can be formulated as follows:
$\begin{aligned} O_{i_{-} Spatial} & = Sigmoid [U_{i} \times Conv [mean (U_{i}) + max (U_{i})]] \end{aligned}$
(2)

$\begin{aligned} O_{i_{-} Channel} & = U_{i} \times Avgpooling (U_{i}) \end{aligned}$
(3)

$\begin{aligned} O_{i_out} & = O_{i_Channel} + O_{i_Spatial} \end{aligned}$
(4)
where $O_{i_Spatial}$ represents the feature map of the i-th spatial dimension. $O_{i_Chamel}$ is the feature map of the i-th channel dimension. $O_{i_{_out}}$ is the final map incorporating attention weights for the i-th channel group.
3.4. Loss Function

The loss function is an important reference for adjusting weights in the network. To better achieve accurate colorization of images, the loss function consists of two parts: generator loss and discriminator loss. The loss function of traditional GAN is represented as equations (5) and (6):

\begin{aligned} L_{G} & = - E [D (G (x))] \end{aligned}

(5)

\begin{aligned} L_{D} & = E [D (G (x))] + E [D (y)] \end{aligned}

(6)

where

x

is the input image,

y

is the real image,

G (x)

is the generated image. When

G

tries to minimize the optimization objective and

D

aims to maximize it. The generator and discriminator are in a game-like dynamic. To improve the performance of the BM-GAN network, pixel loss, and perceptual loss are added to the loss function of the generator. The discriminator is enhanced with a gradient penalty to satisfy the 1-Lipschitz condition, which addresses issues such as gradient vanishing and mode collapse typically encountered in traditional GAN training. The following will provide a detailed introduction to these loss functions.

3.4.1. Pixel Loss

To enable the network model to generate high-resolution colorized images that closely resemble ground truth images, the $L_{1}$ loss function (Ren et al., 2015) is employed. However, the gradient of the $L_{1}$ loss function is constant. With a constant learning rate, the loss function may stabilize around a stable value and it is difficult to converge to higher accuracy. This limitation contributes to the persistent issue of color blur during image colorization. The $L_{1}$ loss function is defined in equation (7):

L_{1} = E [∥ G (x) - y ∥_{1}]

(7)

3.4.2. Perceptual Loss

Therefore, to enhance the colorization effect of the network, a perceptual loss function is introduced. It helps align the colorized image more closely with the ground truth image in terms of visual appearance. In contrast to traditional pixel loss, perceptual loss relies on features extracted from a pretrained VGG19 network as a metric for loss. It measures the perceptual similarity of an image by comparing the differences in these extracted features. The perceptual loss function is as equation (8):

L_{p} = E [‖ ϕ_{l} (G (x)) - ϕ_{l} (y) ‖_{1}]

(8)

where

ϕ_{l} (*)

represents the

l

th layer feature of the pretrained network.

The total loss function of the generator includes equations (5), (7), and (8), and the expression for the new generator loss function is equation (9):

L_{G}^{^{'}} = L_{G} + λ_{1} L_{1} + λ_{p} L_{p}

(9)

where

λ_{p}

is the weight of

L_{1}

loss and perceptual loss on the total loss of the generator.

To enhance the convergence speed and stability of the network model, the gradient penalty (GP) term (Arjovsky et al., 2017) is used to constrain the discriminator’s gradient. It operates linear interpolation between real and generated samples, calculating the norm of the gradient of the discriminator at these points. This norm is then compared to the constant 1 (1-Lipschitz) and added as penalty terms to the loss function of the discriminator. The new discriminator loss function can be expressed as equation (10):

L_{D}^{^{'}} = E [D (G (x))] - E [D (x)] + λ_{GP} E [({‖ \nabla_{\tilde{y}} D (\tilde{y}) ‖}_{2} - 1)^{2}]

(10)

where

λ_{GP}

represents the weight of the GP penalty term, set to 10, and

\tilde{y}

is a new image constructed from both real and generated distributions. The Adam optimizer is applied to minimize the loss function during the training process of the colorization model. Furthermore, to prevent overfitting and improve robustness, regularization methods such as dropout and instance normalization are typically utilized.

4. Experiments

In this section, the Summer2winter and ImageNet datasets are first introduced and experimental parameter settings are explained. Then, comparative experiments are conducted to analyze the results obtained from each dataset. Finally, ablation studies are carried out on the loss functions and blocks to validate the effectiveness of the proposed method.

4.1. Datasets

4.1.1. Summer2winter

The Summer2winter dataset from CycleGAN was selected for the experiments (Shu et al., 2019). The dataset contains color red–green–blue images with a high resolution, which include 1231 in the test set and 309 in the training set.

4.1.2. ImageNet

To evaluate the colorization quality, 2000 images were randomly selected from the ImageNet dataset (Russakovsky et al., 2015) as the test set. The high-resolution images in the ImageNet dataset are particularly suitable for image colorization tasks.

4.2. Implementation Details

In the network architecture of BM-GAN, dropout is set to 0.5, and the normalization operation is set to instance normalization. During the training phase, the Adam optimizer is used with an initial learning rate of 0.0002. The batch size is 8, and the network is trained for 200 epochs, with $β_{1}$ = 0.5 and $β_{2}$ = 0.999 utilized to compute network parameters.

The network was implemented in Python 3.7 using the PyTorch framework. The discriminator and generator were trained alternately until the BM-GAN converged. The experiment was carried out on a Windows 11 operating system, equipped with an Intel (R) Core (TM) i5-13500X CPU @ 2.50 GHz, and an NVIDIA GeForce RTX 4060 graphics card. A GPU-based Python deep learning environment was set up for the training and experimentation.

4.3. Evaluation Metrics

To objectively reflect the performance of each colorization method on different datasets, peak signal-to-noise ratio (PSNR), structural similarity (SSIM), learned perceptual image patch similarity (LPIPS), Fréchet inception distance (FID), and information entropy (IE; Hore and Ziou, 2010; Wang et al., 2004; Tsai et al., 2008; Kumar et al., 2021) are chosen as evaluation indicators to quantitatively analyze the colorization results. PSNR is a commonly used objective measure for image evaluation. A larger PSNR value indicates smaller image distortion. The metric measures the difference between the generated image and the corresponding real image. It is frequently applied in quantitative assessments of image colorization, defogging, denoising, and related tasks. Given a clean image I and a noisy image K with dimensions $m \times n$ , the PSNR is calculated as follows:

\begin{aligned} PSNR & = 10 \log_{10} (\frac{{MAX}_{I}^{2}}{MSE}) = 20 \log_{10} (\frac{{MAX}_{I}}{\sqrt{MSE}}) \end{aligned}

(11)

\begin{aligned} MSE & = \frac{1}{m n} \sum_{i = 0}^{m - 1} \sum_{j = 0}^{n - 1} [I (i, j) - K (i, j)]^{2} \end{aligned}

(12)

where

{MAX}_{I}

is the maximum possible pixel value of the image. If each pixel was represented by an 8-bit binary, its value would be 255.

SSIM evaluates the quality of generated images based on brightness, contrast, and structural composition. The larger the value, the more similar the image structures are. If the images are exactly the same, the SSIM value equals 1. The formula for SSIM is as follows:

SSIM (x, y) = \frac{(2 μ_{x} u_{y} + C_{1}) (2 σ_{x y} + C_{2})}{(μ_{x}^{2} + μ_{y}^{2} + C_{1}) (σ_{x}^{2} + σ_{y}^{2} + C_{2})}

(13)

where

μ_{x}

and

μ_{y}

σ_{x}

and

σ_{y}

represent the mean and standard deviation of

x

and

y

, respectively.

σ_{x y}

is the covariance.

C_{1}

C_{2}

, and

C_{3}

are both constants.

LPIPS is a learning-based metric for perceptual similarity that better matches human perception than traditional methods. The lower the value of LPIPS, the higher the degree of similarity. The formula for LPIPS is as follows:

R_{LPIPS} = \sum_{l} \frac{1}{H_{l} W_{l}} \sum_{h, w} {‖ w_{l} ⊙ (y (I)_{h w}^{l} - y (K)_{h w}^{l}) ‖}_{2}^{2}

(14)

where

I

and

K

are the two images to be compared.

y

denotes the feature map extracted at the l layer by the deep network applied to the image.

w_{l}

is the weight calculated for the

l

th layer feature. The L2 distance is used to compute the Euclidean distance between two eigenvectors.

FID measures the quality of colorization by evaluating the difference between the feature distribution of the generated image and that of the real image. This metric reflects the similarity between the generated image and the real image. The formula for FID is as follows:

FID (x, g) = ‖ μ_{x} - μ_{y} ‖ + t_{r} (\sum_{x} + \sum_{y} - 2 \sqrt{\sum_{x} \sum_{y}})

(15)

where

μ_{x}

and

μ_{y}

\sum_{x}

and

\sum_{y}

represent the mean and covariance of images

x

and

y

, respectively, and

t_{r}

denotes the rank of the matrix. The lower the value of FID, the higher the quality of the generated image, and the closer it is to the distribution of the real image.

IE represents the average amount of information contained in an image and is a key indicator to measure the richness of image information. The greater the IE, the richer the image information and the better the image quality.

4.4. Comparison With State-of-the-Art Methods

To verify the effectiveness and superiority of the proposed algorithm, it was compared to existing state-of-the-art methods (Iizuka et al., 2016; Yi et al., 2017; Larsson et al., 2016; Antic, 2019; Su et al., 2020; Lei and Chen, 2019; Vitoria et al., 2020; Kumar et al., 2021, 2023). All experiments were carried out in the same software and hardware environment. Each method was trained for 200 epochs on the Summer2winter and ImageNet datasets. The results are shown in Figures 5 and 6.

Figure 5.

Comparison of colorization results of different methods on the Summer2winter dataset (the first and last columns represent input and ground truth images).

Figure 6.

Comparison of colorization results of different methods on the ImageNet dataset (the first and last columns represent input and ground truth images).

Certainly, Figure 5 illustrates the qualitative outcomes of the proposed BM-GAN and other methods on the Summer2winter dataset. For example, in line 4, Zhang et al. and Larsson et al. produced grass colorization that does not match the ground truth image and exhibits color bleeding. DeOldify et al. and Su et al. proposed methods to obtain images with low color saturation. The proposed method prevents color bleeding. In line 2, other methods have the problem of missing color details, ignoring the color details of the grass in the lower left corner. There are also gaps in the colors of the mountains and sky compared to the real image. The proposed method not only restores the color details of the grass but also ensures that the overall color scheme is closer to the real image. From the above comparison, it is obvious that the proposed method can alleviate the problems of color bleeding and detail loss to a certain extent while preserving more color information.

Certainly, Figure 6 illustrates the qualitative outcomes of the proposed BM-GAN and other methods on the ImageNet dataset. For example, in line 3, the image produced by Zhang et al. and Vitoria et al. is closer to the real image, but there is a serious color bleeding problem. The method of Vitoria et al. mitigates the color bleeding but does not assign the correct color to the walls. The method in this paper ensures correct colorization without color bleeding. In lines 4, 5, and 7, the image obtained by Lei et al. is discontinuous and presents an unnatural colorization effect. Although the method by Su et al. performs better than those by Iizuka et al. and Zhang et al., it still does not fully match the real image. In contrast, the proposed method more accurately restores the colors of the puppy, grass, and buildings, achieving the highest similarity to the real image. In line 6, the image produced by Lei et al. and Vitoria et al. has obvious color bleeding. BM-GAN effectively captures detailed information and reduces the error distance between the generated and real images.

Quantitatively evaluate the results of image colorization methods using the four evaluation indicators introduced in Section 4.2, as shown in Table 1. It can be seen that the value of the PSNR reached 24.571, the value of the SSIM reached 0.928, the value of the LPIPS reached 0.086 and the value of the FID reached 30.22 on the Summer2winter dataset. On the ImageNet dataset, the value of the PSNR reached 24.553, the value of the SSIM reached 0.935, the value of the LPIPS reached 0.123, and the value of the FID reached 33.89. The PSNR of the proposed method is not the highest, which is expected given the weak correlation between PSNR and the human visual system. SSIM was introduced to evaluate structural similarity to address the limitation that PSNR does not fully capture the alignment between image quality and human visual perception (Wang et al., 2018). The SSIM index of BM-GAN is significantly better than other methods, and the generated image is closer to the real image. The method in this paper has the lowest LPIPS and the lower FID value indicates that the results in this paper are closer to the real image.

Table 1.

Comparison of Colorization Results of Different Methods on the Summer2winter and ImageNet Datasets.

	ImageNet					Summer2winter
Method	PSNR $↑$	SSIM $↑$	LPIPS $↓$	FID $↓$	IE $↑$	PSNR $↑$	SSIM $↑$	LPIPS $↓$	FID $↓$	IE $↑$
Iizuka et al.	22.966	0.915	0.165	45.94	7.886	23.693	0.917	0.143	43.58	7.786
Zhang et al.	23.326	0.918	0.166	40.56	7.858	23.865	0.916	0.136	39.32	7.811
DeOldify	24.513	0.927	0.143	39.51	7.863	23.832	0.922	0.112	38.02	7.854
Larsson et al.	24.128	0.931	0.131	38.45	7.898	24.681	0.915	0.103	37.57	7.832
Su et al.	22.478	0.891	0.128	36.65	7.846	23.017	0.901	0.131	38.25	7.806
ChromaGAN	24.041	0.913	0.149	40.32	7.841	24.539	0.912	0.136	39.55	7.846
Vitoria et al.	23.791	0.910	0.144	37.16	7.893	24.659	0.909	0.115	40.56	7.875
ColTran	23.614	0.922	0.141	39.25	7.925	24.458	0.919	0.125	36.48	7.834
ParaColorizer	23.942	0.926	0.135	36.88	7.914	23.964	0.921	0.114	34.54	7.829
BM-GAN	24.553	0.935	0.123	33.89	7.962	24.571	0.928	0.086	30.22	7.896

Note. “ $↑$ ” means the higher the better, “ $↓$ ” means the lower the better. The highest results are in bold. PSNR = peak signal-to-noise ratio; SSIM = structural similarity; LPIPS = learned perceptual image patch similarity; FID = Fréchet inception distance; IE = information entropy; BM-GAN = bi-stream feature extraction and multiscale attention generative adversarial network.

On both datasets, BM-GAN ranks first in SSIM metrics, which means that the system can accurately simulate the perceptual structure of the reconstructed image. The proposed method can generate results that are consistent with the real image. The IE value of BM-GAN is also higher than other methods, which indicates that the color distribution of the generated image is more diverse. The low IE means that the color distribution of the generated image is more concentrated and lacks diversity.

The above quantitative indicators indicate that for the same test image, BM-GAN not only produces authentic colors but also has a natural transition between them. Compared with other methods, the color image generated by BM-GAN has a better effect.

The validity of the proposed method is further validated by comparing the proposed method by Su et al. on the ImageNet test set (10k). The experimental results are shown in Figure 7. It can be seen that BM-GAN still performs better on the same test set. This is because the images obtained by the proposed method have clear boundaries and rich textures. The colorization results of BM-GAN are visually closer to the ground truth images, especially in the region containing sky and grass. In addition, the results of quantitative metrics are shown in Table 2. The proposed method is not only better than Su et al. in PSNR, SSIM, and LPIPS, but also the FID metric reaches 36.72.

Figure 7.

Comparison of colorization results of Su et al. on the ImageNet test set (10k) (the first and last columns represent input and ground truth colorful image.

Table 2.

Comparison of Colorization Results of Different Methods on the ImageNet ctest 10k.

	ImageNet ctest10k
Method	PSNR $↑$	SSIM $↑$	LPIPS $↓$	FID $↓$
Su et al.	26.98	0.933	0.134	40.21
BM-GAN	28.35	0.948	0.127	36.72

Note. “ $↑$ ” means the higher the better, “ $↓$ ” means the lower the better. The highest results are in bold. PSNR = peak signal-to-noise ratio; SSIM = structural similarity; LPIPS = learned perceptual image patch similarity; FID = Fréchet inception distance; BM-GAN = bi-stream feature extraction and multiscale attention generative adversarial network.

4.5. User Study

Quantitative metrics are intended to objectively assess the quality of a method. But in image colorization, the goal is not to achieve fixed colors but to produce colors that are as realistic as possible. A method is considered effective when its colorization results offer rich color information that enhances the expressive quality of the image.

Therefore, to establish an image evaluation metric based on user evaluation, a certain number of color images were randomly selected from all color images generated by DeOldify et al. and BM-GAN. Without informing the users about the generation method, users were asked to subjectively select the image with better colorization. Fifty participants were invited, and seven pairs of images were randomly selected from the two methods for the experiment. Each participant viewed the same grayscale image produced by both methods and was given 5 s to choose the image they considered more natural and visually harmonious. After each selection, participants were not informed of the model used to generate the images.

The results of the user-based image quality assessment are presented in Figure 8. The subjective willingness scores from users show that the colorization effect generated by BM-GAN is more highly recognized compared to that of DeOldify et al. This indicates that the colorization achieved by the proposed method is superior and more consistent with the human visual experience.

Figure 8.

Comparison of user evaluations of different method colorization experiments.

4.6. Ablation Study

To further confirm the effectiveness of each block, experiments were conducted with various network structures to evaluate how each block influences the overall performance of the method.

4.6.1. Ablations of Different Blocks

Ablation experiments were conducted on the BFEB, the loss function and the MSAB.

(1)
Baseline. Remove all added blocks and loss functions from the colorization network.
(2)
BM-GAN w/o BFEB. Delete the BFEB to analyze the effect of this block.
(3)
BM-GAN w/o loss function. Drop the perceptual and $L_{1}$ loss.
(4)
BM-GAN w/o MSAB. Delete the MSAB to analyze the effect of this block.
(5)
BM-GAN(full). Use the complete BM-GAN network.
Figures 9 and 10 show the results of ablation experiments on both datasets, demonstrating the clear benefits of the proposed method. The PSNR, SSIM, and LPIPS values of the test results on Summer2winter and ImageNet datasets are shown in Tables 3 and 4. Compared to the baseline network, the network with the addition of each block shows improvements across all metrics. The addition of each block significantly reduces miscoloring and color bleeding. The colorization effect is also more realistic and natural, contributing to greater overall color harmony. Removing the BFEB leads to inaccurate colorization with visible color bleeding, which has a considerable negative impact on the PSNR, LPIPS, and SSIM values. This demonstrates that the BFEB enhances the colorization effect. MSAB enables the fusion of channel and spatial information, which can help the network to better preserve the detailed features of the image. However, the removal of MSAB significantly degrades the performance of the colorized network. Moreover, the loss function allows the network to capture high-level semantic information, ensuring that the colorization remains consistent with the real image.

Figure 9.
Comparison of results for different network architectures on the Summer2winter dataset.

Figure 10.
Comparison of results for different network architectures on the ImageNet dataset.

Table 3.
Comparison of Ablation Experiments on the Summer2winter Dataset.

Block Summer2winter

Method BFEB Loss MSAB PSNR $↑$ SSIM $↑$ LPIPS $↓$ IE $↑$

LA $\times$ $\times$ $\times$ 22.613 0.865 0.142 7.811

LB ✓ $\times$ $\times$ 23.462 0.896 0.135 7.952

LC $\times$ ✓ $\times$ 23.785 0.929 0.126 7.913

LD $\times$ $\times$ ✓ 23.159 0.913 0.13 7.868

BM-GAN ✓ ✓ ✓ 24.523 0.935 0.123 7.962

Note. “ $↑$ ” means the higher the better, “ $↓$ ” means the lower the better. The highest results are in bold. BFEB = bi-stream feature extraction block; MSAB = multiscale attention block; PSNR = peak signal-to-noise ratio; SSIM = structural similarity; LPIPS = learned perceptual image patch similarity; IE = information entropy; BM-GAN = bi-stream feature extraction and multiscale attention generative adversarial network.

Table 4.
Comparison of Ablation Experiments on the ImageNet Dataset.

Block ImageNet

Method BFEB Loss MSAB PSNR $↑$ SSIM $↑$ LPIPS $↓$ IE $↑$

LA $\times$ $\times$ $\times$ 23.76 0.885 0.105 7.811

LB ✓ $\times$ $\times$ 23.126 0.922 0.098 7.894

LC $\times$ ✓ $\times$ 24.186 0.923 0.094 7.893

LD $\times$ $\times$ ✓ 24.157 0.911 0.096 7.916

BM-GAN ✓ ✓ ✓ 24.571 0.928 0.086 7.896

Note. “ $↑$ ” means the higher the better, “ $↓$ ” means the lower the better. The highest results are in bold. BFEB = bi-stream feature extraction block; MSAB = multiscale attention block; PSNR = peak signal-to-noise ratio; SSIM = structural similarity; LPIPS = learned perceptual image patch similarity; IE = information entropy; BM-GAN = bi-stream feature extraction and multiscale attention generative adversarial network.

4.6.2. Ablations of Different Loss Functions

	Block	Summer2winter
LA	$\times$	$\times$	$\times$	22.613	0.865	0.142	7.811
LB	✓	$\times$	$\times$	23.462	0.896	0.135	7.952
LC	$\times$	✓	$\times$	23.785	0.929	0.126	7.913
LD	$\times$	$\times$	✓	23.159	0.913	0.13	7.868
BM-GAN	✓	✓	✓	24.523	0.935	0.123	7.962

	Block	ImageNet
LA	$\times$	$\times$	$\times$	23.76	0.885	0.105	7.811
LB	✓	$\times$	$\times$	23.126	0.922	0.098	7.894
LC	$\times$	✓	$\times$	24.186	0.923	0.094	7.893
LD	$\times$	$\times$	✓	24.157	0.911	0.096	7.916
BM-GAN	✓	✓	✓	24.571	0.928	0.086	7.896

The impact of each loss function is also discussed, and the comparison results are shown in Figure 11 and Table 5. The figure shows the colorization effect of different loss functions and the corresponding LPIPS mapping graphs. The change of color in the mapping graph corresponds to variations in LPIPS values, which reflects the perceived similarity between the generated image and the real image. A color closer to dark purple indicates a lower LPIPS value and a smaller difference between the generated and real images. Conversely, a color closer to yellow signifies a higher LPIPS value and a greater difference between the generated and real images. Visualizing the LPIPS provides an intuitive way to identify significant differences in features or structures between the generated and real images. This approach also makes it easier to observe the perceived similarity across various parts of the entire image.

Figure 11.

Comparison of results for different loss function on the Summer2winter and ImageNet datasets.

Table 5.

Comparison of Ablations of Different Loss Functions Experiments on the Summer2winter and ImageNet Datasets.

	Summer2winter				ImageNet
Method	PSNR $↑$	SSIM $↑$	LPIPS $↓$	IE $↑$	PSNR $↑$	SSIM $↑$	LPIPS $↓$	IE $↑$
Baseline	23.482	0.911	0.137	7.752	23.765	0.885	0.105	7.811
Lp	23.844	0.923	0.130	7.783	23.126	0.922	0.098	7.894
L1	23.636	0.915	0.128	7.793	24.186	0.923	0.094	7.893
GP	23.722	0.928	0.129	7.848	24.157	0.911	0.096	7.916
BM-GAN	24.523	0.935	0.123	7.962	24.571	0.928	0.086	7.896

Note. “ $↑$ ” means the higher the better, “ $↓$ ” means the lower the better. The highest results are in bold. PSNR = peak signal-to-noise ratio; SSIM = structural similarity; LPIPS = learned perceptual image patch similarity; IE = information entropy; BM-GAN = bi-stream feature extraction and multiscale attention generative adversarial network.

Compared to the baseline network, the addition of perceptual loss enhances the alignment of the generated image with the real image in terms of color, texture, and structure. The $L_{1}$ loss improves the structural similarity between the generated and real images, while the gradient penalty enhances the SSIM value. In summary, every block of the proposed BM-GAN network is essential.

4.7. LLL Images Colorization

To verify the advancement of the proposed BM-GAN in practical applications, BM-GAN was used to colorize the LLL images. Based on the MPPC LLL experiment platform, several sets of LLL images were captured in an LLL environment. The LLL experiment platform mainly consists of the MPPC detector, two-dimensional guide, lens hood, computers, optical fibers, and cables connecting various devices. The schematic block diagram is shown in Figure 12. Firstly, a small amount of light reflected from the target object is refracted through an optical lens and converges at the rear of the lens to form a two-dimensional image plane. Subsequently, a two-dimensional guide directs the MPPC detector to scan the image plane point by point and captures the photon count value at each point. Finally, the captured data is transmitted to the computer via optical fiber and restored to an LLL image by image restoration. Unlike traditional paired techniques, LLL image colorization takes low signal-to-noise ratio images as input. Therefore, utilizing an attention mechanism to improve the adaptability of a network to handle low signal-to-noise ratios and uneven brightness is essential for LLL image colorization. The color accuracy of the generated images may not fully match that of paired techniques, but they offer valuable information under extreme conditions. The core idea of LLL image colorization is to apply convolution and multiple downsampling to the input grayscale image to obtain its high-level features. These features are then used to infer the color information in the a and b channels, which are then combined with the L channel to generate the final image. Figure 13 shows the difference between the conventional paired technique and the LLL image colorization technique.

Figure 12.

MPPC platform. Note. MPPC = multipixel photon counter.

Figure 13.

The difference between the conventional paired technique and the LLL image colorization technique. Note. LLL = low-light-level.

Colorization experiments are performed on the LLL images and compared to existing state-of-the-art methods. Usually, to achieve a better LLL image colorization effect, the network model training should use an LLL image and a corresponding color image. However, due to the condition limitation, a grayscale image and corresponding color image from the dataset were adopted as the training set to train the network model. Although the LLL image also belongs to the grayscale image, there is a certain difference with the grayscale image from the dataset. If a neural network model trained on grayscale images from the dataset can successfully colorize LLL images, it further validates the advancement of BM-GAN. The LLL image colorization effect is shown in Figure 14.

Figure 14.

Comparison of colorization results of different algorithms on LLL images. Note. LLL = low-light-level.

The colorized LLL image provides a more visually comfortable experience and makes the contained information easier to discern. In addition, the colorization results show that BM-GAN is well-adapted and of high practical value.

4.8. Synthetic Aperature Radar (SAR) Images Colorization

To further validate the performance of the BM-GAN, colorization validation was performed on SAR images from LS-SSDD-v1.0 dataset (Xu et al., 2022a, 2022b, 2022c; Zhang et al., 2021; Zhang et al., 2020). Some of the results are shown in Figure 15. While SAR images are used in remote sensing due to their powerful ground imaging capabilities, their visual analysis and interpretation are often constrained because they are typically presented in grayscale. The features and color details emphasized in SAR images differ from those in the training set utilized in this study. Nonetheless, BM-GAN is capable of producing more realistic colors, which reflects the strong fitting ability of BM-GAN and consistent performance across various image datasets. The conversion between SAR and optical images offers observers detailed structural and scene information to enhance SAR image visualization.

Figure 15.

Comparison of colorization results of different algorithms on SAR images. Note. SAR = synthetic aperature radar.

5. Conclusion

This paper proposes an innovative colorization method for grayscale images called BM-GAN. The network enhances its feature extraction capabilities by leveraging a BFEB to integrate both global and local features. Additionally, an MSAB is employed to highlight features related to colorized objects in both channel and spatial domains. The block improves the ability of the network to perceive texture and color. A composite loss function is constructed for training, which reduces the gap between generated and real images. BM-GAN is applied to LLL and SAR images to improve human target recognition and scene interpretation. Experiments demonstrate that BM-GAN effectively addresses issues such as semantic confusion, color bleeding, inaccurate colorization, and loss of image details. Nevertheless, certain limitations exist. In images with blurred edges, the colorization results may deviate from the real image.

Footnotes

Author Contributions

Xiaoning Gao contributed to conceptualization, methodology, software, writing, reviewing, and editing. Liju Yin contributed to visualization, investigation, and supervision. Yulin Deng contributed to data curation, software, and validation. Feng Wang contributed to writing—original draft preparation. Yiming Qin and Meng Zhang contributed to software.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Natural Science Foundation of Shandong Province, China (ZR2020MF127).

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

Andreini

Bonechi

Bianchini

Mecocci

Scarselli

(2020). Image generation by GAN and style transfer for agar plate image segmentation. Computer Methods and Programs in Biomedicine, 184, 105268. doi: https://doi.org/10.1016/j.cmpb.2019.105268.

Antic

(2019DeOldify–a deep learning based project for colorizing and restoring old images (and video!) https://github.com/dana-kelley/DeOldify

Arjovsky

Chintala

Bottou

(2017, July). Wasserstein generative adversarial networks. In International conference on machine learning (pp. 214–223). PMLR.

Aull

B. F.

Schuette

D. R.

Young

D. J.

Craig

D. M.

Felton

B. J.

Warner

(2014). A study of crosstalk in a

256 \times 256

photon counting imager based on silicon Geiger-mode avalanche photodiodes. IEEE Sensors Journal, 15(4), 2123–2132. doi: 10.1109/JSEN.2014.2368456

Charpiat

Hofmann

Schölkopf

(2008). Automatic image colorization via multimodal predictions. In Computer vision–ECCV 2008: 10th European conference on computer vision, Marseille, France, 12–18 October 2008, Proceedings, Part III 10 (pp. 126–139). Springer.

Chen

Luo

Ding

(2018, July). Automatic colorization of images from Chinese black and white films based on CNN. In 2018 International conference on audio, language and image processing (ICALIP) (pp. 97–102). IEEE.

Chen

Wang

(2023). TMTrans: texture mixed transformers for medical image segmentation. AI Communications, 36(4), 1–16.

Cheng

Yang

Sheng

(2015). Deep colorization. In Proceedings of the IEEE international coference on computer vision (pp. 415–423).

Deng

Yin

Gao

Zhou

Wang

Zou

(2023). EA-EDNet: Encapsulated attention encoder–decoder network for 3D reconstruction in low-light-level environment. Multimedia Systems, 29(4), 2263–2279. doi: https://doi.org/10.1007/s00530-023-01100-2

10.

Goodfellow

Pouget-Abadie

Mirza

Warde-Farley

Ozair

Courville

Bengio

(2020). Generative adversarial networks. Communications of the ACM, 63(11), 139–144. doi: 10.1145/3422622

11.

Chen

Liao

Sander

P. V.

Yuan

(2018). Deep exemplar-based colorization. ACM Transactions on Graphics (TOG), 37(4), 1–16. doi: 10.1145/3197517.3201365

12.

Hore

Ziou

(2010, August). Image quality metrics: PSNR vs. SSIM. In 2010 20th international conference on pattern recognition (pp. 2366–2369). IEEE.

13.

Huang

Y. C.

Tung

Y. S.

Chen

J. C.

Wang

S. W.

J. L.

(2005), NovemberAn adaptive edge detection based colorization algorithm and its applications. In Proceedings of the 13th annual ACM international conference on multimedia (pp. 351–354).

14.

Iizuka

Simo-Serra

Ishikawa

(2016). Let there be color! Joint end-to-end learning of global and local image priors for automatic image colorization with simultaneous classification. ACM Transactions on Graphics (ToG), 35(4), 1–11. doi: 10.1145/2897824.2925974

15.

Isola

Zhu

J. Y.

Zhou

Efros

A. A.

(2017). Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1125–1134).

16.

Khan

M. U. G.

Gotoh

Nida

(2017). Medical image colorization for better visualization and segmentation. In Medical image understanding and analysis: 21st annual conference, MIUA 2017, Edinburgh, UK, 11–13 July 2017, Proceedings 21 (pp. 571–580). Springer International Publishing.

17.

Kong

Jiao

Bai

Qian

(2020, April). Low-light-level image colorization based on Laws’ texture feature descriptor. In Sixth symposium on novel optoelectronic detection technology and applications (Vol. 11455, pp. 222–227). SPIE.

18.

Kumar

Banerjee

Saurav

Singh

(2023). ParaColorizer-realistic image colorization using parallel generative networks. The Visual Computer, 40(6), 4039–4054.

19.

Kumar

Weissenborn

Kalchbrenner

(2021Colorization transformer. Arxiv preprint arXiv.2102.04432

20.

Larsson

Maire

Shakhnarovich

(2016). Learning representations for automatic colorization. In Computer vision–ECCV 2016: 14th European conference, Amsterdam, The Netherlands, 11–14 October 2016, Proceedings, Part IV 14 (pp. 577–593). Springer International Publishing.

21.

Lei

Chen

(2019). Fully automatic video colorization with self-regularization and diversity. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 3753–3761).

22.

Levin

Lischinski

Weiss

(2004). Colorization using optimization. In ACM SIGGRAPH 2004 Papers (pp. 689–694).

23.

Lai

Y. K.

John

Rosin

P. L.

(2019). Automatic example-based image colorization using location-aware cross-scale matching. IEEE Transactions on Image Processing, 28(9), 4606–4619. doi: 10.1109/TIP.2019.2912291

24.

Yin

Wang

Pan

Gao

Zou

Liu

Wang

(2021). Bayesian regularization restoration algorithm for photon counting images. Applied Intelligence, 51, 5898–5911. doi: 10.1007/s10489-020-02175-4

25.

Luan

Wen

Cohen-Or

Liang

Y. Q.

Shum

H. Y.

(2007, June). Natural image colorization. In: Proceedings of the 18th eurographics conference on rendering techniques (pp. 309–320).

26.

Maurya

Chand

(2022). Cross-form efficient attention pyramidal network for semantic image segmentation. AI Communications, 35(3), 225–242. doi: 10.3233/AIC-210266

27.

Cho

Shin

(2018Instagan: Instance-aware image-to-image translation. arXiv preprint arXiv:1812.10889

28.

Nida

Sharif

Khan

M. U. G.

Yasmin

Fernandes

S. L.

(2016). A framework for automatic colorization of medical imaging. IIOAB J, 7, 202–209. https://www.researchgate.net/publication/309313383_A_framework_for_automatic_colorization_of_medical_imaging

29.

Niu

Zhong

(2021). A review on the attention mechanism of deep learning. Neurocomputing, 452, 48–62. doi: 10.1016/j.neucom.2021.03.091

30.

Park

Efros

A. A.

Zhang

Zhu

J. Y.

(2020). Contrastive learning for unpaired image-to-image translation. In Computer vision–ECCV 2020: 16th European conference, Glasgow, UK, 23–28 August 2020, Proceedings, Part IX 16 (pp. 319–345). Springer International Publishing.

31.

Pastor-Pellicer

Castro-Bleda

M. J.

Espana-Boquera

Zamora-Martinez

(2019). Handwriting recognition by using deep learning to extract meaningful features. AI Communications, 32(2), 101–112. doi: 10.3233/AIC-170562

32.

Paul

Bhattacharya

Gupta

(2016). Spatiotemporal colorization of video using 3D steerable pyramids. IEEE Transactions on Circuits and Systems for Video Technology, 27(8), 1605–1619. doi; 10.1109/TCSVT.2016.2539539

33.

Reinhard

Adhikhmin

Gooch

Shirley

(2001). Color transfer between images. IEEE Computer Graphics and Applications, 21(5), 34–41. doi: 10.1109/38.946629

34.

Ren

Girshick

Sun

(2015). Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis & Machine Intelligence, 39(6), 1137–1149. doi: 10.1109/TPAMI.2016.2577031

35.

Ronneberger

Fischer

Brox

(2015). U-Net: Convolutional networks for biomedical image segmentation. In Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, 5–9 October 2015, proceedings, part III 18 (pp. 234–241). Springer International Publishing.

36.

Russakovsky

Deng

Krause

Satheesh

Huang

Karpathy

Khosla

Bernstein

Berg

A. C.

(2015). Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3), 211–252. doi: https://doi.org/10.1007/s11263-015-0816-y

37.

Shu

Wang

Jia

Han

Chen

Tian

(2019). Co-evolutionary compression for unpaired image translation. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 3235–3244).

38.

J. W.

Chu

H. K.

Huang

J. B.

(2020). Instance-aware image colorization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 7968–7977).

39.

Sýkora

Dingliana

Collins

(2009, April). Lazybrush: Flexible painting tool for hand-drawn cartoons. In Computer graphics forum (Vol. 28, No. 2, pp. 599–608). Blackwell Publishing Ltd.

40.

Tai

Y. W.

Jia

Tang

C. K.

(2005, June). Local color transfer via probabilistic segmentation by expectation–maximization. In 2005 IEEE Computer society conference on computer vision and pattern recognition (CVPR’05) (Vol. 1, pp. 747–754). IEEE.

41.

Tang

Yan

Chen

Shao

Wang

(2022). Person re-identification based on multi-scale global feature and weight-driven part feature. AI Communications, 35(3), 207–223.

42.

Tsai

D. Y.

Lee

Matsuyama

(2008). Information entropy measure for evaluation of image quality. Journal of Digital Imaging, 21(3), 338–347. doi: https://doi.org/10.1007/s10278-007-9044-5

43.

Huang

Liu

(2017). CSFL: A novel unsupervised convolution neural network approach for visual pattern classification. AI Communications, 30(5), 311–324. doi: 10.3233/aic-170739

44.

Verghese

Donnelly

J. P.

Duerr

E. K.

McIntosh

K. A.

Chapman

D. C.

Vineis

C. J.

Smith

G. M.

Funk

J. E.

Jensen

K. E.

Hopman

P. I.

Shaver

D. C.

(2007). Arrays of InP-based avalanche photodiodes for photon counting. IEEE Journal of Selected Topics in Quantum Electronics, 13(4), 870–886. doi: 10.1109/JSTQE.2007.904464

45.

Viswanathan

Umamaheswari

Janarthanan

Jaganathan

(2023). Semantic segmentation based on enhanced gated pyramid network with lightweight attention module. AI Communications, 7(1), 1–18. doi: https://doi.org/10.3233/AIC-220254

46.

Vitoria

Raad

Ballester

(2020). Chromagan: Adversarial picture colorization with semantic class distribution. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision (pp. 2445–2454).

47.

Wang

Bovik

A. C.

Sheikh

H. R.

Simoncelli

E. P.

(2004). Image quality assessment: From error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4), 600–612. doi: 10.1109/TIP.2003.819861

48.

Wang

Liu

Dong

Qiao

Change Loy

(2018). Esrgan: Enhanced super-resolution generative adversarial networks. In: Proceedings of the european conference on computer vision (ECCV) Workshops (pp. 63–79).

49.

Welsh

Ashikhmin

Mueller

(2002, July). Transferring color to greyscale images. In: Proceedings of the 29th Annual Conference on Computer Graphics and Interactive Techniques (pp. 277–280).

50.

Dong

Kong

Mei

Paul

J. C.

Zhang

(2013, February). Content-based colour transfer. In Computer graphics forum (Vol. 32, No. 1, pp. 190–203). Blackwell Publishing Ltd.

51.

Yan

Jia

(2013). A sparse control model for image and video editing. ACM Transactions on Graphics (TOG), 32(6), 1–10. doi: 10.1145/2508363.2508404

52.

Zhang

Shao

Shi

Wei

Zhang

Zeng

(2022a). A group-wise feature enhancement-and-fusion network with dual-polarization feature enrichment for SAR ship detection. Remote Sensing, 14(20), 5276. doi: https://doi.org/10.3390/rs14205276

53.

Zhang

(2022b). Lite-yolov5: A lightweight deep learning detector for on-board ship detection in large-scene sentinel-1 sar images. Remote Sensing, 14(4), 1018. doi: 10.3390/rs14041018

54.

Zhang

Yang

Shi

Zhan

(2022c). Shadow-background-noise 3D spatial decomposition using sparse low-rank Gaussian properties for video-SAR moving target shadow enhancement. IEEE Geoscience and Remote Sensing Letters, 19, 1–5. doi: https://doi.org/10.1109/LGRS.2022.3223514

55.

Yatziv

Sapiro

(2006). Fast image and video colorization using chrominance blending. IEEE transactions on image processing, 15(5), 1120–1129. doi: 10.1109/tip.2005.864231

56.

Zhang

Tan

Gong

(2017Unsupervised dual learning for image-to-image translation. Arxiv 2017. arXiv preprint arXiv:1704.02510

57.

Yoo

Bahng

Chung

Lee

Chang

Choo

(2019). Coloring with limited data: Few-shot colorization via memory augmented networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 11283–11292).

58.

Zhang

Isola

Efros

A. A.

(2016). Colorful image colorization. In Computer vision–ECCV 2016: 14th European conference, Amsterdam, the Netherlands, 11–14 October 2016, Proceedings, Part III 14 (pp. 649–666). Springer International Publishing.

59.

Zhang

(2021). Quad-FPN: A novel quad feature pyramid network for SAR ship detection. Remote Sensing, 13(14), 2771. doi: https://doi.org/10.3390/rs13142771

60.

Zhang

Zhan

Shi

Wei

Pan

Zhou

Kumar

(2020). LS-SSDD-v1. 0: A deep learning dataset dedicated to small ship detection from large-scale Sentinel-1 SAR images. Remote Sensing, 12(18), 2997. doi: 10.3390/rs12182997

61.

Zhao

L. M.

Cheung

K. W.

W. Y.

Rehman

Y. A. U.

(2020). SCGAN: Saliency map-guided colorization with generative adversarial network. IEEE Transactions on Circuits and Systems for Video Technology, 31(8), 3062–3077. doi: 10.1109/TCSVT.2020.3037688

62.

Zhu

J. Y.

Park

Isola

Efros

A. A.

(2017). Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International conference on computer vision (pp. 2223–2232).

	Block			Summer2winter
Method	BFEB	Loss	MSAB	PSNR $↑$	SSIM $↑$	LPIPS $↓$	IE $↑$
LA	$\times$	$\times$	$\times$	22.613	0.865	0.142	7.811
LB	✓	$\times$	$\times$	23.462	0.896	0.135	7.952
LC	$\times$	✓	$\times$	23.785	0.929	0.126	7.913
LD	$\times$	$\times$	✓	23.159	0.913	0.13	7.868
BM-GAN	✓	✓	✓	24.523	0.935	0.123	7.962

Bi-Stream Feature Extraction and Multiscale Attention Generative Adversarial Network (BM-GAN): Colorization of GrayScale Images Based on Bi-Stream Feature Fusion and Multiscale Attention Generative Adversarial Network

Abstract

Keywords

1. Introduction

User-Guided Colorization

Example-Based Colorization

Fully Automatic Colorization

GANs

Attention Mechanism

LLL Images

3. Method

3.1. BM-GAN Architecture

3.3. Multiscale Attention Block

4.1. Datasets

4.1.1. Summer2winter

4.1.2. ImageNet

4.2. Implementation Details

4.3. Evaluation Metrics

4.6.1. Ablations of Different Blocks

Footnotes

Author Contributions

Funding

Declaration of Conflicting Interests

References