Sage Journals: Discover world-class research

Abstract

The degradation of images captured in hazy weather can severely affect practical applications. However, most existing learning-based methods ignore the varied haze distribution in an image, resulting in incomplete dehazing of some areas. Also, the presence of haze can blur the textures and details, which will heavily impact the subsequent image processing. In this paper, we propose a transformer-based framework for dehazing tasks called HITFormer. Firstly, we introduce a texture recovery and enhance module as a preprocess to strengthen details. Then, we propose an adaptive haze intensity prediction subnet to predict the haze intensity of different areas. Lastly, we use a semantic-based luminance and chrominance adjustment module to fuse the feature maps in YUV color space and form a transformation coefficient to get a recovery image. The extensive experiments demonstrate that our HITFormer achieves state-of-the-art performance on several image dehazing datasets.

Keywords

image dehazing haze intensity sensing texture enhance transformer

1. Introduction

In recent years, image dehazing has become increasingly important due to its critical role in various applications such as autonomous driving, surveillance systems, and remote sensing. Hazy weather conditions, characterized by the presence of a large number of tiny suspended particles in the air (dust, water droplets, smoke, etc.), significantly degrade image quality by absorbing or scattering light. This degradation manifests as reduced contrast and blurred details, which in turn adversely affects the performance of subsequent advanced vision tasks such as image classification, target tracking, and detection. Despite numerous advancements, current dehazing methods often struggle with two significant challenges: the accurate restoration of fine details and the handling of heterogeneous haze distributions in real-world scenes.

Traditional image dehazing methods, such as those based on handcrafted priors (Fattal, 2014; He et al., 2011; Ju et al., 2021; Zhu et al., 2015), estimate the parameters of the hazy image formation model to recover haze-free images. However, these methods often struggle with regions that do not satisfy the assumed priors, leading to incorrect parameter estimation and unwanted artifacts. More recently, deep learning approaches (Guo et al., 2022; Liu et al., 2023; Song et al., 2023; Ye et al., 2022) have leveraged the robust feature representation capabilities of convolutional neural networks (CNNs) to enhance dehazing performance. Early CNN-based methods (Cai et al., 2016; Zhang & Patel, 2018) improved parameter estimation accuracy, while current trends focus on end-to-end mappings between hazy and haze-free images, achieving superior results (Dong et al., 2020; Liu et al., 2019; Wu et al., 2021). Additionally, the advent of vision transformers (ViTs; Liu et al., 2021; Wang et al., 2021; Dosovitskiy et al., 2021) has provided a compelling alternative to CNNs, showcasing powerful modeling capabilities in a variety of computer vision tasks. It is important to acknowledge, however, that while ViTs excel in many areas, CNNs still outperform them in certain specific tasks due to their established architecture and specialized design. Moreover, significant challenges remain unaddressed by current methods, including the degradation of edge and detail information and the varied distribution of haze in complex scenes, which are often overlooked.

To the best of our knowledge, there are two reasons that limit the effectiveness of current dehazing methods. First, due to the presence of haze, the edge and detail information of hazy images is usually degraded, leading to a loss of information in the recovered haze-free images. Second, the distribution of haze can vary significantly in real-world scenes, but most existing dehazing networks only extract semantic features directly related to the dehazing task, ignoring the local variations in haze intensity. This results in suboptimal dehazing performance in dense haze regions. Additionally, most methods process hazy images in the red–green–blue (RGB) color space, whereas we propose to extract features and process hazy images in the YUV domain to fully utilize brightness and energy information, enhancing the dehazing process.

In order to solve the problem of loss of detail in hazy images and incomplete image dehazing due to uneven distribution of haze, we propose a novel transformer-based framework for dehazing tasks called HITFormer. Our contributions are summarized as follows:

We propose the texture recovery and enhance (TRE) module as a preprocessing step, which enhances details that are blurred by haze.

We design the adaptive haze intensity prediction (AHIP) subnet to predict the haze intensity for each image patch, allowing the model to focus on areas with higher haze concentrations, thus improving overall dehazing performance.

We transform the RGB input hazy image to the YUV color space, enabling the model to extract luminance and chrominance features at different stages and use a constant transformation coefficient to recover the haze-free image.

The rest of this paper is organized as follows: Section 2 discusses the related work. Section 3 presents the proposed method. Then, Section 4 reports and analyzes the relevant experimental results. Section 5 presents a conclusion and discusses the limitations and future work of this paper.

2. Related Works

2.1. Prior-Based Methods

Prior-based methods are based on the hazy image formation model. The most widely used model is the atmospheric scattering model (ASM) McCartney et al. (1977). As shown in Figure 1, the light intensity obtained by the imaging device under hazy conditions consists of two parts. The first part is the direct attenuation of the reflected light energy caused by the suspended particles in the atmosphere, which can cause a decrease in image brightness and contrast. The second part is airlight, which describes the scattered atmospheric light (airlight) that reaches the imaging device and participates in the formation of a hazy image, leading to the details of the image being blurred.

Figure 1.

Atmospheric scattering model.

The ASM can be expressed mathematically as follows, which includes the two parts mentioned above:

E (d, λ) = E_{0} (λ) e^{- β (λ) d} + E_{\infty} (λ) (1 - e^{- β (λ) d})

(1)

where

E_{0} (λ)

is the light intensity before attenuation,

λ

is the wavelength of light,

β (λ)

is the scattering coefficient of particles in the atmosphere,

d

is the distance between the imaging device and the target, and

E_{\infty} (λ)

is the light intensity in the infinite distance, which is usually a constant.

Let $I (x) = E (d, λ)$ , $J (x) = E_{0} (λ)$ , $A = E_{\infty} (λ),$ and $t (x) = e^{- β (λ) d}$ and equation (1) can be rewritten as a more general form of ASM:

I (x) = J (x) t (x) + A [1 - t (x)]

(2)

where

x

is the coordinate of a pixel in an image,

I (x)

is the hazy image,

J (x)

is the scene radiance that represent the haze-free image, and

t (x)

is the transmission.

A

is the atmospheric light at infinity.

In equation (2), there is only one known quantity $I (x)$ ; thus, solving for one of the unknowns is a pathological problem. Prior-based methods use handcraft priors to estimate $t (x)$ and $A$ and then derive the haze-free image $J (x)$ from equation (2). This type of method has been developed rapidly in recent years. Zhu et al. (2015) proposed color attenuation prior, which constructs a linear model and uses luminance and saturation to estimate scene depth. Berman et al. (2016) proposed a nonlocal prior, using haze lines to recover the distance map and the haze-free image. Ju et al. (2020) proposed gamma correction prior and used it to process the input hazy image.

The most well-known prior knowledge for image dehazing is dark channel prior (DCP) proposed by He et al. (2011). After counting a large number of outdoor haze-free images, they found that some pixels always have at least one color channel with very low values. Using this prior information achieved an excellent dehazing effect. Many modified methods are based on DCP for improvement (Hsieh et al., 2018; Pei & Lee, 2012; Zhang et al., 2018). Although these methods are effective, the prior assumption does not always hold in real complex scenarios. They tend to output inaccurate parameter estimation when the regions of the image do not satisfy the priors, which can lead to unsatisfactory dehazing effects.

2.2. Deep Learning-Based Methods

As deep learning becomes increasingly popular, scholars have tried to use CNN to predict parameters of ASM, which can output more accurate results compared to prior-based methods. Cai et al. (2016) proposed an end-to-end dehazing network, DehazeNet, which estimates the transmission map and finally uses ASM to get a haze-free image. Ren et al. (2016) proposed MSCNN, which uses coarse-scale and fine-scale networks to estimate the transmission map, avoiding the loss of image details. Li et al. (2017) integrated transmission $t (x)$ and atmospheric light $A$ into one variable and used lightweight CNNs, AOD-Net, to restore the haze-free image. The above methods obtain haze-free images according to ASM, which is essentially an extension of the prior-based method. To avoid inaccurate estimation of transmission maps and atmospheric light, some CNN-based methods do not rely on parameter estimation and directly recover hazy images to haze-free images. For example, Liu et al. (2019) proposed an end-to-end trainable CNN, GridDehazeNet, which implements an attention-based multiscale estimation and does not rely on ASM. Dong et al. (2020) proposed MSBDN with dense feature fusion based on the U-Net architecture. Qin et al. (2020) proposed FFA-Net, which introduces a new feature attention module combining channel attention with pixel attention, which extends the representation capability of CNNs.

2.3. Vision Transformer

As transformer continues to demonstrate powerful modeling capabilities in natural language processing, more and more researchers apply them to vision tasks. The ViT (Dosovitskiy et al., 2021) converts images into a sequence of image patches and feds them into the transformer, which has an excellent effect on image classification tasks. However, it is difficult to apply ViT directly to some downstream tasks due to its large computational resource requirements and weak inductive bias. Many modified frameworks based on ViT have been proposed. For example, pyramid ViT (Wang et al., 2021) introduced a pyramid structure to output multiple different levels of feature, which enabled the processing of high-resolution images more efficiently and granted access to various downstream tasks. Chen et al. (2021) proposed an image processing transformer for low-level vision tasks, which applied a structure with multiple heads and tails for different image processing tasks. Swin transformer (Liu et al., 2021) provided a more general backbone for various computer vision tasks. It constructed hierarchical feature maps that enable the model to handle images of different scales. Also, it proposed shifted window self-attention, which introduced cross-connections between windows to improve the performance of the model and reduce computational complexity. A great deal of work has demonstrated that the swin transformer has excellent performance on different visual tasks and outperforms most CNN-based methods.

3. Method

The overall architecture of the HITFormer is presented in Figure 2. Firstly, we convert an input RGB hazy image to YUV color space and enhance its local details via a TRE module. Then, the processed image is split into nonoverlapping patches, and several swin transformer blocks are applied to them to extract global features. In this process, we introduce an AHIP subnet to predict the haze intensity of each patch. At last, the haze-free image is obtained through a semantic-based luminance and chrominance adjustment (SLCA) module.

Figure 2.

The overall architecture of the HITFormer. Our model is a modified swin transformer, and each of the four stages is illustrated with the dashed box containing a swin transformer block. Our method incorporates three important components: TRE module, AHIP subnet, and SLCA module. $I (x)$ is the RGB input hazy image, $I_{YUV} (x)$ is the YUV input hazy image, ${\hat{J}}_{Y U V} (x)$ is the YUV haze-free image obtained by the model, and $\hat{J} (x)$ is the RGB haze-free image restored from ${\hat{J}}_{Y U V} (x)$ . Note. TRE = texture receovery and enhance; AHIP = adaptive haze intensity prediction; SLCA = semantic-based luminance and chrominance adjustment; RGB = red–blue–green.

3.1. TRE Module

The degradation of images met up with the camera in hazy weather can be various, such as contrast reduction, color shift, and color distortion. Many features and details of the image information are covered or blurred, which sets great limits on the subsequent processing of the image. Therefore, in our HITFormer, we introduce a TRE module, which can effectively restore and enhance details. As seen in Figure 3.

T_{0} = I_{YUV} - F (I_{YUV})

(3)

In order to get detailed information about the image, we need to separate them from semantics. As shown in equation (3), we first utilize a smoothing filter

F

(average pooling) to remove most of the texture details and retain the semantics. Then, we subtract the result by pixel from the YUV image to get the texture and details of the image. As

I_{YUV}

and

F (I_{YUV})

have similar magnitude, the pixel intensity of the resulting image

T_{0}

is uncomparable to that of the original YUV image

I_{YUV}

, which can lead to some bias on detail information. So, we need to rescale the pixel intensity to the same magnitude as the original image. Thus, the features from

I_{YUV}

and

T_{0}

can be well preserved and fused. An optional method is to rescale each pixel individually, which can adaptively enhance the details and textures of the image, but this will put a heavy burden on storage and computation, so we choose to use one scalar

S

for every pixel in the image

T_{0}

S = \frac{Mean (I_{YUV})}{Mean (T_{0})}

(4)

Then we multiply

S

by the previous resulting

T_{0}

to get the rescaled texture image:

T_{1} = S \otimes T_{0}

(5)

Figure 3.

Illustration of TRE module. Note. TRE = texture recovery and enhance.

As shown in Figure 3, subtract and scale operations are included in the rescale module.

In order to preserve the integrity of the original image $I$ and further enhance its textures and details, we concatenate $T_{1}$ and $I$ to create a fused representation that combines the enhanced textures from $T_{1}$ with the overall contextual information of $I$ . This concatenated input undergoes a Conv module designed to maximize the integration of surrounding spatial features. A series of operations are included in this module: the $3 \times 3$ convolution operation extracts local spatial patterns and features from the concatenated image, facilitating a more nuanced understanding of texture variations and detailed structures. Subsequently, the max pooling operation aggregates the most relevant features from neighboring regions, enhancing robustness and reducing sensitivity to small spatial variations. Finally, the $1 \times 1$ convolution refines the feature representation by adjusting channel-wise interactions, effectively reducing computational complexity while preserving essential information for accurate texture recovery and enhancement.

3.2. AHIP Subnet

In hazy weather conditions, images often exhibit varying degrees of haze intensity across different regions, which poses a significant challenge for effective image dehazing. To address this variability, we introduce the AHIP subnet within the HITFormer framework. The primary objective of the AHIP subnet is to predict the haze intensity of each patch within an image, enabling targeted enhancement of heavily hazed regions to improve overall image clarity and detail preservation.

Methods based on the DCP (He et al., 2011), although effective in certain cases, do not always hold true. These methods rely on the lowest brightness value in the image to estimate haze, but in some complex scenarios, this assumption might be inaccurate. In our approach, we still utilize the DCP to estimate haze intensity, but we improve this method by combining it with gradient information, thus enhancing its accuracy, as shown in Figure 4.

Figure 4.

Illustration of AHIP subnet. Note. AHIP=adaptive haze intensity prediction.

According to He et al. (2011), given an arbitrary image $J$ , the dark channel $J^{d a r k}$ is given by:

J^{dark} (x) = min_{y \in ω (x)} min_{C \in {R, G, B}} [J^{C} (y)]

(6)

where

J^{C}

is one of the three color channels of

J

and

ω (x)

is a patch centered at

x

Specifically, to accurately predict the haze intensity, we incorporate both gradient variation and a physical prior into the pseudo ground truth haze intensity for each patch:

d_{P_{i}} = \frac{d_{P_{i}}^{grad} + α d_{P_{i}}^{dark}}{1 + α}

(7)

where

d_{P_{i}}^{grad}

represents the gradient difference within the same patch between the input hazy image

I

and the haze-free image

J

, calculated as:

d_{P_{i}}^{grad} = \sum_{(x, y) \in P_{i}^{I}} G (x, y) - \sum_{(x, y) \in P_{i}^{J}} G (x, y)

(8)

The gradient

G (x, y)

characterizes the image’s texture variations, which is crucial for identifying regions with dense haze. To calculate the gradient

G (x, y)

, we employ the Sobel operator, which applies convolution operations with horizontal and vertical filters to detect edges and texture variations in the image. The Sobel operator is defined as follows:

G_{x} = I * [\begin{matrix} - 1 & 0 & 1 \\ - 2 & 0 & 2 \\ - 1 & 0 & 1 \end{matrix}] and G_{y} = I * [\begin{matrix} - 1 & - 2 & - 1 \\ 0 & 0 & 0 \\ 1 & 2 & 1 \end{matrix}]

(9)

where

G_{x}, G_{y}

are the horizontal and vertical gradients, respectively, and

*

denotes the convolution operation. The overall gradient magnitude

G (x, y)

is then computed as:

G (x, y) = \sqrt{G_{x}^{2} + G_{y}^{2}}

(10)

The Sobel operator effectively captures the gradient variations, which are crucial for identifying regions with dense haze.

Concurrently, our $d_{P_{i}}^{dark}$ utilizes the DCP, $J^{dark}$ to estimate haze intensity:

d_{P_{i}}^{dark} = \frac{1}{N} \sum_{(x, y) \in P_{i}^{I}} J^{dar k^{'}} (x, y)

(11)

where

J^{dar k^{'}}

is the normalized dark channel of

J

, and

N

is the number of pixels in patch

P_{i}

. This approach leverages image statistics to infer haze intensity variations across different patches.

In our approach, we divide the range of haze intensity values into $c$ intervals, each representing a specific range of haze levels. We use a $c$ -dimensional one-hot vector $D_{P_{i}}$ to represent the haze intensity of each patch:

D_{P_{i}} = Γ (d_{P_{i}})

(12)

where

Γ (\cdot)

transform the value of

d_{P_{i}}

into a one-hot vector

D_{P_{i}}

. For a patch

P_{i}

with haze intensity

d_{P_{i}}

, normalized to

[0, 1]

, we categorize

d_{P_{i}}

into one of

c

intervals:

[0, 1 / c), [1 / c, 2 / c), \dots, [c - 1 / c, 1]

. If

d_{P_{i}}

falls in the

j

th interval, the one-hot vector

D_{P_{i}}

has the

j

th element set to 1 and others to 0. The employment of one-hot encoding allows the model to treat haze intensity as a categorical variable, simplifying learning.

The AHIP subnet employs a fully connected layer with ReLU activation and another fully connected layer to predict the distribution of haze intensity for each patch, ${\hat{D}}_{P_{i}}$ . The prediction is evaluated using a cross-entropy loss function:

L_{HI} = - \frac{1}{N_{P}} \sum_{i = 1}^{N_{P}} \sum_{j = 1}^{c} D_{i j} \log {\hat{D}}_{i j}

(13)

where

N_{P}

is the number of patches in the image and

c

is the number of intervals representing haze intensity levels.

This loss guides the AHIP subnet to accurately predict and prioritize regions with higher haze intensity during training. Additionally, the overall loss function for HITFormer includes a peak signal-to-noise ratio (PSNR) loss $L_{PSNR}$ , which is a commonly used loss function to measure image restoration quality:

L_{PSNR} = - PSNR (HITFormer (I), J)

(14)

where

J

is the ground truth haze-free image corresponding to input hazy image

I

. The PSNR is also used as one of our evaluation metrics, and its calculation is given later in equations (17) and (18).

Overall, the combined loss for HITFormer is formulated as follows:

L_{all} = L_{PSNR} + γ L_{HI}

(15)

Here,

γ

serves as a weighting factor to balance the importance of PSNR restoration and haze intensity prediction losses. During early training stages, leveraging

d_{P_{i}}^{dark}

aids in faster convergence by providing prior knowledge. However, in complex scenarios,

d_{P_{i}}^{dark}

is gradually phased out to prevent overreliance on simplistic priors and encourage robust performance across diverse conditions.

By integrating the AHIP subnet with HITFormer, our approach enhances the model’s ability to handle varying haze intensities. This leads to improved dehazing results by prioritizing regions most affected by haze.

3.3. SLCA Module

In ASM (as shown in equation (1)), the value of transmission $t (x)$ ranges from $(0, 1)$ , and a larger value indicates more energy is gained in the formation of a hazy image. We can notice that the formation of a hazy image is highly associated with the transmission and attenuation of energy. Since the energy of an image can be measured by its brightness, we can convert the input RGB image to other color space to extract brightness components. In YUV color space, the Y component describes the brightness of the color (the luminance), while the U and V components describe the color difference (the chrominance). Besides, YUV color space and RGB color space are linearly transformed, so they can be effectively used for color image processing. Therefore, we first convert an input RGB hazy image to YUV color space to decouple energy components, imitating ASM to model energy and fully utilizing the prior knowledge to improve model training results.

As shown in Figure 5, the previous swin transformer blocks build hierarchical feature maps with different resolutions ( $M_{1} \sim M_{4}$ ). In order to integrate the features of different stages, firstly, we utilize a Trans Conv module to up-sample feature map $M_{1}$ , $M_{2}$ , and $M_{3}$ to the same size as $M_{4}$ ( $(H / 4) \times (H / 4) \times C$ ). Different numbers of transposed convolution operations (with the stride of 2) are used in this module, and the results are denoted as $M_{1}^{^{'}}$ , $M_{2}^{^{'}}$ and $M_{3}^{^{'}}$ , respectively. Secondly, we utilize a cross-scale fusion module, which concatenates $M_{1}^{^{'}}$ , $M_{2}^{^{'}}$ , $M_{3}^{^{'}},$ and $M_{4}$ , and use a $1 \times 1$ convolution, a ReLU activation function, and a $3 \times 3$ convolution to fuse the extract features. Finally, we use a flattened layer and a fully connected layer to compress the shape of the features of all patches to the size of $N_{P} \times 9$ . We denote it as $F$ , which is a patch-wise transformation coefficient containing the fusions of low-level and high-level features of the dehazing task.

Figure 5.

Illustration of SLCA module. Note. SLCA =semantic-based luminance and chrominance adjustment.

We reshape $F$ ( $N_{P} \times 9$ ) to the size of $N_{P} \times 3 \times 3$ (denoted as $F_{i 1} \sim F_{i 9}, i \in [1, N_{P}]$ ), and use it to adjust the original input image in YUV space, we multiply $F_{i}$ by each component of $I_{YUV}$ .

{\hat{J}}_{YUV} = [\begin{matrix} F_{i 1} & F_{i 2} & F_{i 3} \\ F_{i 4} & F_{i 5} & F_{i 6} \\ F_{i 7} & F_{i 8} & F_{i 9} \end{matrix}] \times [\begin{matrix} I_{Y} \\ I_{U} \\ I_{V} \end{matrix}]

(16)

At last, we convert the resulting image

{\hat{J}}_{YUV}

to RGB color space to recover haze-free image

\hat{J}

4. Experiments

4.1. Experimental Settings

4.1.1. Datasets

RESIDE (Li et al., 2019) is a widely used benchmark dataset for image dehazing. It consists of five subsets, of which the three most commonly used are indoor training set (ITS), outdoor training set (OTS), and synthetic objective testing set (SOTS). In addition, we also use NH-HAZE (Ancuti et al., 2020) and Dense-Haze (Ancuti et al., 2019) datasets for real-world image dehazing.

Our model is trained on ITS and OTS subsets of RESIDE dataset, Dense-Haze dataset, and NH-HAZE dataset, and is tested on SOTS. The OTS contains 2,061 real outdoor haze-free images from Beijing and 72,135 synthesized hazy images, that is, one haze-free image corresponds to 35 hazy images with different haze intensities. The ITS contains 1,399 indoor haze-free images and 13,990 synthesized indoor hazy images. NH-HAZE contains 55 pairs of outdoor real hazy and corresponding haze-free images. Dense-Haze contains 33 pairs of outdoor real hazy and corresponding haze-free images. Figure 6 shows example images of the datasets.

Figure 6.

Examples of the datasets. (a) and (b) are from RESIDE, (c) is from NH-HAZE, and (d) is from dense-haze. Hazy images and corresponding haze-free images are in the top and bottom rows, respectively.

All the above datasets are officially available. Thus, we can ensure that each hazy image in our training set corresponds to a haze-free image (i.e. the ground truth).

4.1.2. Implementation Details

Our framework is implemented using PyTorch 1.13.0 with an NVIDIA RTX 3090 GPU (24 GB). The model is trained for $2 \times 10^{5}$ steps on the datasets mentioned above. In the first 2,000 steps of training, we keep $d_{P_{i}}^{dark}$ in equation (7) to calculate the pseudo ground truth of haze intensity of each patch, then we drop $d_{P_{i}}^{dark}$ for the rest of the training steps to exclude the limitation of prior information. We use Adam as our optimizer. The initial learning rate is $1 \times 10^{- 4}$ , and decreased to $1 \times 10^{- 6}$ using the cosine annealing strategy (He et al., 2019). We randomly crop the input images to the size of $256 \times 256$ and augment them by randomly 90, 180, and 270 degrees of rotation, horizontal flip, and vertical flip.

4.1.3. Evaluation Metrics

PSNR and structural similarity index measure (SSIM) (Wang et al., 2004) are used to objectively evaluate the results on the datasets. For subjective comparison, we use the mean opinion score (MOS). The three metrics are discussed briefly as follows.

(1)
PSNR

PSNR assesses image quality by comparing the mean square error (MSE) between two images in decibels. Given a $H \times W$ reference image $f$ and a test image $g$ , the MSE is computed as follows:
$MSE (f, g) = \frac{1}{H \times W} \sum_{i = 0}^{H - 1} \sum_{j = 0}^{W - 1} (f_{i j} - g_{i j})^{2}$
(17)
where $f_{i j}$ is the pixel value of image $f$ at point $(i, j)$ , $g_{i j}$ is the pixel value of image $g$ at point $(i, j)$ . Then, the PSNR can be computed as:
$PSNR (f, g) = 10 \log_{10} (\frac{{MAX}_{g}^{2}}{MSE (f, g)})$
(18)
where ${MAX}_{g}$ is the maximum pixel value in image $g$ . In our method, the higher PSNR value indicates a better quality of the haze-free image obtained by the dehazing algorithm.
(2)
SSIM

SSIM is a quality evaluation metric that measures the similarity of two images. A larger value of SSIM indicates that the recovered haze-free image retains more structural information and is of better quality. Its calculation involves the comparison of luminance, contrast, and structure of two images. They are computed as follows:
$\begin{aligned} l (f, g) & = \frac{2 μ_{f} μ_{g} + c_{1}}{μ_{f}^{2} + μ_{g}^{2} + c_{1}} \end{aligned}$
(19)

$\begin{aligned} c (f, g) & = \frac{2 σ_{f} σ_{g} + c_{2}}{σ_{f}^{2} + σ_{g}^{2} + c_{2}} \end{aligned}$
(20)

$\begin{aligned} s (f, g) & = \frac{σ_{f g} + c_{3}}{σ_{f} σ_{g} + c_{3}} \end{aligned}$
(21)
where $l (f, g)$ is the luminance similarity, which measures the closeness of the two images’ mean luminance( $μ_{f}$ and $μ_{g}$ ). $c (f, g)$ is the contrast similarity, which measures the closeness of the contrast between the two images $f$ and $g$ . $σ_{f}$ and $σ_{g}$ are the standard deviation of image $f$ and image $g$ , respectively. $s (f, g)$ is the structure similarity that measures the correlation coefficient between the two images. $σ_{f g}$ is the covariance between I and J. $c_{1}$ , $c_{2}$ , and $c_{3}$ are constants, used to avoid a null denominator.

Combining $l (f, g)$ , $c (f, g)$ and $s (f, g)$ , the SSIM can be computed as follows:
$SSIM (f, g) = l (f, g) c (f, g) s (f, g)$
(22)
(3)
MOS

In our experimental evaluation, we used the MOS to subjectively assess the results on the datasets. We collected the opinions of 50 evaluators from various industries, who rated the quality of the result images according to their subjective experience and gave them a score from 1 to 5. We used 20 randomly selected images from each testing set’s result images to get MOS scores, where a higher MOS score indicates a better image quality.

To ensure diverse perspectives, evaluators were selected from fields such as computer vision, photography, medical imaging, automotive, and academia. These evaluators were identified through professional networks, industry contacts, and academic collaborations.

Meanwhile, in order to avoid bias, evaluators were not informed about the specific methods behind the images they were assessing. Images were presented in a randomized order, and the evaluation criteria were standardized, covering aspects such as clarity, contrast, and color accuracy. Each evaluator conducted their assessments independently, ensuring no influence from other evaluators.

4.2. Ablation Study

We perform ablation studies on the SOTS-outdoor dataset to demonstrate the effectiveness of HITFormer. We first utilize swin transformer (Liu et al., 2021) as our baseline for the dehazing task, and then we add different components of our model to the baseline. Thus, we conduct the ablation experiments as follows: (a) baseline, (b) adding TRE module to baseline, (c) adding AHIP subnet to baseline, (d) adding AHIP and SLCA to baseline, (e) adding TRE and AHIP to baseline, and (f) our model.

As shown in Table 1, by adding the TRE module, the performance of the model is improved by 2.71 dB in PSNR. By adding the AHIP subnet, there is a significant increase of 3.38 dB in PSNR. This result indicates that our TRE module and AHIP subnet are significant components to improve the dehazing effect. After adding the SLCA module to baseline + TRE + AHIP, the full model performance improved by 0.42 dB in PSNR. This is because the SLCA starts from an energy-based ASM model (McCartney et al., 1977), which is related to haze intensity (the higher the intensity, the faster the energy decay). Besides, the AHIP subnet also gives the model the ability to perceive the haze intensity, which is slightly overlapping with SLCA in terms of functionality, but the different starting points and modeling process of AHIP and SLCA give the final model a stronger ability to further enhance the dehazing effect.

Table 1.
Ablation Study on Different Modules and Configurations of HITFormer on the SOTS-Outdoor Dataset. The Bold Numbers Indicate the Best Results.

Setting Baseline (swin transformer) TRE module AHIP subnet SLCA module PSNR SSIM

a ✓ $\times$ $\times$ $\times$ 27.08 0.931

b ✓ ✓ $\times$ $\times$ 29.79 0.942

c ✓ $\times$ ✓ $\times$ 30.46 0.950

d ✓ $\times$ ✓ ✓ 33.02 0.976

e ✓ ✓ ✓ $\times$ 34.94 0.983

f ✓ ✓ ✓ ✓ 35.36 0.989

Setting	Baseline (swin transformer)	TRE module	AHIP subnet	SLCA module	PSNR	SSIM
a	✓	$\times$	$\times$	$\times$	27.08	0.931
b	✓	✓	$\times$	$\times$	29.79	0.942
c	✓	$\times$	✓	$\times$	30.46	0.950
d	✓	$\times$	✓	✓	33.02	0.976
e	✓	✓	✓	$\times$	34.94	0.983
f	✓	✓	✓	✓	35.36	0.989

Note. SOTS = synthetic objective testing set; TRE = texture receovery and enhance; AHIP = adaptive haze intensity prediction; SLCA = sematic-based luminance and chrominance adjustment; PSNR = peak signal-to-noise ratio; SSIM = structural similarity index measure.

4.3. Comparisons With the State-of-the-Art Methods

4.3.1. Quantitative Comparison

We first compare the quantitative results (PSNR and SSIM scores) of our HITFormer with several state-of-the-art (SOTA) methods, including DCP (He et al., 2011), DehazeNet (Cai et al., 2016), AOD-Net (Li et al., 2017), GridDehazeNet (Liu et al., 2019), MSBDN (Dong et al., 2020), FFA-Net (Qin et al., 2020), and DehazeFormer-B (Song et al., 2023). We conduct comparisons on SOTS (Li et al., 2019), NH-Haze (Ancuti et al., 2020), and Dense-Haze (Ancuti et al., 2019) datasets. As shown in Tables 2 and 3, the HITFormer achieves the highest PSNR and SSIM scores compared to other methods on SOTS-indoor, SOTS-outdoor, and NH-HAZE datasets. On the Dense-Haze dataset, the SSIM score of the HITFormer is the highest compared to the other methods, while the PSNR score is just 0.04 dB lower than the SOTA method. Table 4 shows that our method outperforms other methods in terms of MOS on all testing sets. The results demonstrate the effectiveness and advantages of our method.

Table 2.
Quantitative Evaluations With the State-of-the-Art Methods on SOTS-Indoor, SOTS-Outdoor Datasets (PSNR (dB) and SSIM). The Bold Numbers Indicate the Best Results.

SOTS-Indoor SOTS-Outdoor

Methods PSNR SSIM PSNR SSIM

DCP 16.62 0.818 19.13 0.815

DehazeNet 19.82 0.821 24.75 0.927

AOD-Net 20.51 0.816 24.14 0.920

GridDehazeNet 32.16 0.984 30.86 0.982

MSBDN 33.67 0.985 33.48 0.982

FFA-Net 36.39 0.989 33.57 0.984

DehazeFormer-B 37.84 0.994 34.95 0.984

HITFormer (ours) 40.27 0.994 35.36 0.989

	SOTS-Indoor	SOTS-Outdoor
DCP	16.62	0.818	19.13	0.815
DehazeNet	19.82	0.821	24.75	0.927
AOD-Net	20.51	0.816	24.14	0.920
GridDehazeNet	32.16	0.984	30.86	0.982
MSBDN	33.67	0.985	33.48	0.982
FFA-Net	36.39	0.989	33.57	0.984
DehazeFormer-B	37.84	0.994	34.95	0.984
HITFormer (ours)	40.27	0.994	35.36	0.989

Note. SOTS = synthetic objective testing set; PSNR = peak signal-to-noise ratio; SSIM = structural similarity index measure.

Table 3.

Quantitative Evaluations with the State-of-the-Art Methods on NH-HAZE and Dense-Haze Datasets (PSNR (dB) and SSIM). The Bold Numbers Indicate the Best Results.

	NH-HAZE		Dense-Haze
Methods	PSNR	SSIM	PSNR	SSIM
DCP	10.57	0.520	11.01	0.416
DehazeNet	12.25	0.451	9.48	0.438
AOD-Net	15.40	0.570	12.82	0.468
GridDehazeNet	15.25	0.539	13.73	0.457
MSBDN	19.23	0.706	15.37	0.486
FFA-Net	19.87	0.692	14.39	0.452
DehazeFormer-B	18.63	0.674	16.59	0.541
HITFormer (ours)	21.31	0.724	16.55	0.593

Note. PSNR = peak signal-to-noise ratio; SSIM = structural similarity index measure.

Table 4.

Quantitative Evaluations with the State-of-the-Art Methods on SOTS-Indoor, SOTS-Outdoor, NH-HAZE, and Dense-Haze Datasets (MOS). The Bold Numbers Indicate the Best Results.

Methods	SOTS-Indoor	SOTS-Outdoor	NH-HAZE	Dense-Haze
DCP	4.22	4.14	3.18	2.76
DehazeNet	4.08	3.98	3.26	2.82
AOD-Net	4.10	4.16	4.32	3.18
GridDehazeNet	4.38	4.20	4.40	4.32
MSBDN	4.46	4.44	4.52	4.46
FFA-Net	4.50	4.46	4.48	4.28
DehazeFormer-B	4.52	4.50	4.54	4.38
HITFormer(ours)	4.52	4.56	4.60	4.46

Note. SOTS = synthetic objective testing set; MOS = mean opinion score.

4.3.2. Visual Comparison

In order to further demonstrate the effectiveness of our model, we compare our visual results with several selected effective methods on real-world hazy images obtained from the internet. The visual comparisons are presented in Figure 7. It can be observed obviously that DCP and AOD-Net can effectively remove haze, but the color of the images recovered by DCP is distorted, and the overall images recovered by both methods are too dark. The visual results of GridDehazeNet have a good dehazing effect in nonsky regions, such as scenery and buildings, but there are artifacts in sky regions. In the recovery images of FFA-Net and Dehazeformer-B, the haze is not completely removed in all the cases, and the texture and details are not fully restored. In contrast, our HITFormer shows great performance in haze removal and texture enhancement, which indicates the superiority of our method.

Figure 7.

Visual comparisons on real-world hazy images.

5. Conclusion

In this paper, we proposed a transformer-based single-image dehazing framework called HITFormer. To summarize, we propose a TRE module to better strengthen the blurred detail information. In particular, a patch-wise relative haze intensity prediction subnet is designed to estimate the degree of haze intensity of each patch, which enables the model to focus on patches with dense haze. Besides, our model extracts luminance and chromatic-related features in YUV color space, which makes better use of the domain knowledge from the hazy image formation model and improves the dehazing performance. Extensive comparisons demonstrate that our HITFormer achieves superior performance on several datasets.

Limitations: The method proposed in this paper mainly focuses on daytime outdoor image dehazing. However, the scenarios in real life can be more complex. Following the main idea of this work, we will further study more physical priors in different scenarios, such as nighttime or rainy days, and guide the model to learn features that are more suitable for real-world scenes.

Footnotes

ORCID iDs

Tianli Zhang

Hao Zeng

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

Ancuti

C. O.

Ancuti

Sbert

Timofte

(2019). Dense-Haze: A benchmark for image dehazing with dense-haze and haze-free images. In 2019 IEEE international conference on image processing (ICIP) (pp. 1014–1018). IEEE. https://doi.org/10.1109/ICIP.2019.8803046

Ancuti

C. O.

Ancuti

Timofte

(2020). NH-HAZE: An image dehazing benchmark with non-homogeneous hazy and haze-free images. In 2020 IEEE/CVF conference on computer vision and pattern recognition workshops (CVPRW) (pp. 1798–1805). IEEE. https://doi.org/10.1109/CVPRW50498.2020.00230

Berman

Treibitz

Avidan

(2016). Non-local image dehazing. In 2016 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1674–1682). IEEE. https://doi.org/10.1109/CVPR.2016.185

Cai

Jia

Qing

Tao

(2016). DehazeNet: An end-to-end system for single image haze removal. IEEE Transactions on Image Processing, 25, 5187–5198. https://doi.org/10.1109/TIP.2016.2598681

Chen

Wang

Guo

Deng

Liu

Gao

(2021).Pre-Trained Image Processing Transformer. In 2021 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 12294–12305). IEEE. https://doi.org/10.1109/CVPR46437.2021.01212

Dong

Pan

Xiang

Zhang

Wang

Yang

M.-H.

(2020). Multi-scale boosted dehazing network with dense feature fusion. In 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 2154–2164). IEEE. https://doi.org/10.1109/CVPR42600.2020.00223

Dosovitskiy

Beyer

Kolesnikov

Weissenborn

Zhai

Unterthiner

Dehghani

Minderer

Heigold

Gelly

Uszkoreit

Houlsby

. (2021). An image is worth 16x16 words: Transformers for image recognition at scale. In International conference on learning representations. https://openreview.net/forum?id=YicbFdNTTy

Fattal

(2014). Dehazing using color-lines. ACM Transactions on Graphics (TOG), 34, 1–14. doi: https://doi.org/10.1145/2651362.

Guo

Yan

Anwar

Cong

Ren

(2022). Image dehazing transformer with transmission-aware 3D position embedding. In 2022 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 5802–5810). IEEE. https://doi.org/10.1109/CVPR52688.2022.00572

10.

Sun

Tang

(2011). Single image haze removal using dark channel prior. In IEEE transactions on pattern analysis and machine intelligence, 33, 2341–2353. https://doi.org/10.1109/TPAMI.2010.168

11.

Zhang

Xie

(2019). Bag of tricks for image classification with convolutional neural networks. In 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 558–567). IEEE. https://doi.org/10.1109/CVPR.2019.00065

12.

Hsieh

C.-H.

Chen

J.-Y.

Zhao

(2018). A modified DCP based dehazing algorithm. In 2018 IEEE International Conference on Systems, Man, and Cybernetics (SMC) (pp. 1779–1784). IEEE. https://doi.org/10.1109/SMC.2018.00307

13.

Ding

Guo

C. A.

Ren

Tao

(2021). IDRLP: Image dehazing using region line prior. IEEE Transactions on Image Processing, 30, 9043–9057. https://doi.org/10.1109/TIP.2021.3122088

14.

Ding

Guo

Y. J.

Zhang

(2020). IDGCP: Image dehazing based on gamma correction prior. IEEE Transactions on Image Processing, 29, 3104–3118. https://doi.org/10.1109/TIP.2019.2957852

15.

Peng

Wang

Feng

(2017). AOD-Net: All-in-one dehazing network. In 2017 IEEE international conference on computer vision (ICCV), (pp. 4780–4788). IEEE. https://doi.org/10.1109/ICCV.2017.511

16.

Ren

Tao

Feng

Zeng

Wang

(2019). Benchmarking single-image dehazing and beyond. IEEE Transactions on Image Processing, 28, 492–505. https://doi.org/10.1109/TIP.2018.2867951

17.

Liu

Lin

Cao

Wei

Zhang

Lin

Guo

(2021). Swin transformer: Hierarchical vision transformer using shifted windows. In 2021 IEEE/CVF international conference on computer vision (ICCV) (pp. 9992–10002). IEEE. https://doi.org/10.1109/ICCV48922.2021.00986

18.

Liu

Chen

(2023). A data-centric solution to nonHomogeneous dehazing via vision transformer. In 2023 IEEE/CVF conference on computer vision and pattern recognition workshops (CVPRW) (pp. 1406–1415). IEEE. https://doi.org/10.1109/CVPRW59228.2023.00145

19.

Liu

Shi

Chen

(2019). GridDehazeNet: Attention-based multi-scale network for image dehazing. In 2019 IEEE/CVF international conference on computer vision (ICCV) (pp. 7313–7322). IEEE. https://doi.org/10.1109/ICCV.2019.00741

20.

McCartney

E. J.

Hall

Freeman

(1977). Optics of the atmosphere: Scattering by molecules and particles. Physics Today, 30, 76–77. https://doi.org/10.1063/1.3037551

21.

Pei

S.-C.

Lee

T.-Y.

(2012). Nighttime haze removal using color transfer pre-processing and Dark Channel Prior. In 2012 19th IEEE international conference on image processing (pp. 957–960). IEEE. https://doi.org/10.1109/ICIP.2012.6467020

22.

Qin

Wang

Bai

Xie

Jia

(2020). FFA-Net: Feature fusion attention network for single image dehazing. Proceedings of the AAAI Conference on Artificial Intelligence, 34(7), 11908–11915. https://doi.org/10.1609/aaai.v34i07.6865

23.

Ren

Liu

Zhang

Pan

Cao

Yang

M.-H.

(2016). Single image dehazing via multi-scale convolutional neural networks. In Computer vision - 14th European conference, ECCV 2016, proceedings (pp. 154–169), Springer Verlag. https://doi.org/10.1007/978-3-319-46475-6_10

24.

Song

Qian

(2023). Vision transformers for single image dehazing. IEEE Transactions on Image Processing, 32, 1927–1941. https://doi.org/10.1109/TIP.2023.3256763

25.

Wang

Bovik

A. C.

Sheikh

H. R.

Simoncelli

E. P.

(2004). Image quality assessment: From error visibility to structural similarity. IEEE Transactions on Image Processing, 13, 600–612. https://doi.org/10.1109/TIP.2003.819861

26.

Wang

Xie

Fan

D.-P.

Song

Liang

Luo

Shao

(2021). Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In 2021 IEEE/CVF international conference on computer vision (ICCV) (pp. 548–558). IEEE. https://doi.org/10.1109/ICCV48922.2021.00061

27.

Lin

Zhou

Qiao

Zhang

Xie

(2021). Contrastive learning for compact single image dehazing. In 2021 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 10546–10555). IEEE. https://doi.org/10.1109/CVPR46437.2021.01041

28.

Zhang

Jiang

Chen

Liu

Chen

(2022). Perceiving and modeling density for image dehazing. In Computer vision - 17th European conference, ECCV 2022, proceedings (pp. 130–145). Springer Verlag, https://doi.org/10.1007/978-3-031-19800-7_8

29.

Zhang

Patel

V. M.

(2018). Densely connected pyramid dehazing network. In 2018 IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 3194–3203). IEEE. https://doi.org/10.1109/CVPR.2018.00337

30.

Zhang

Wang

Yang

Zhang

Song

(2018). Image dehazing based on dark channel prior and brightness enhancement for agricultural remote sensing images from consumer-grade cameras. Computers and Electronics in Agriculture, 151, 196–206. https://doi.org/10.1016/j.compag.2018.06.010

31.

Zhu

Mai

Shao

(2015). A fast single image haze removal algorithm using color attenuation prior. IEEE Transactions on Image Processing, 24, 3522–3533. https://doi.org/10.1109/TIP.2015.2446191

A Haze Intensity Sensing and Texture Enhanced Transformer for Image Dehazing

Abstract

Keywords

1. Introduction

2. Related Works

2.1. Prior-Based Methods

2.3. Vision Transformer

3. Method

4.1. Experimental Settings

4.1.1. Datasets

4.1.3. Evaluation Metrics

4.3.1. Quantitative Comparison

Footnotes

ORCID iDs

Funding

Declaration of Conflicting Interests

References