Sage Journals: Discover world-class research

Abstract

Recent advances in deep learning-based super-resolution (SR) techniques for remote sensing images (RSIs) have shown significant promise. However, these performance improvements often come at a high computational cost, which limits their practical application. To address this issue, this paper proposes a dual-branch SR model (DBSR) that enhances both model performance and efficiency through primary and auxiliary branches. The primary branch integrates the advantages of channel recalibration, a separable swin transformer (SST), and a spatial refinement module to achieve fine-grained feature extraction. The SST serves as the core of the primary branch, employing hierarchical window attention calculations to facilitate lightweight and effective multiscale feature representation. Conversely, the auxiliary branch enhances shallow features through a global information enhancement module, which mitigates the misleading effects of directly upsampling these shallow features on the SR results. Comparative and ablation experiments conducted on four RSI datasets and five SR benchmark datasets demonstrate that our DBSR method effectively balances the number of parameters with performance, showcasing its potential for application in RSI processing.

Keywords

super resolution lightweight dual-branch separable swin transformer remote sensing

1. Introduction

High-resolution remote sensing images (RSIs) are crucial for monitoring the environment, urban development, and changes caused by natural or human activities (Albiston, 2005). The spatial resolution of these images directly impacts the accuracy of analyses. However, obtaining high-resolution (HR) images is challenging due to limitations imposed by imaging equipment, observation distance, and viewing angles (Han et al., 2021). While enhancing the precision of charge-coupled device sensors is a direct method for acquiring HR images, it incurs significant costs. Super-resolution (SR) offers a cost-effective alternative by reconstructing HR images from low-resolution (LR) ones using efficient algorithms. Therefore, SR technology is an effective and economical approach for improving spatial resolution in remote sensing.

In recent years, deep learning techniques—particularly those utilizing convolutional neural networks (CNNs) and transformers—have emerged as the leading technologies for achieving HR RSIs (Wang et al., 2022; Huang & Liu, 2023). The introduction of CNNs into SR technology marked a paradigm shift (Wang et al., 2022). Initially, the focus was on simple network architectures, but development has progressed to complex networks capable of capturing intricate spatial details. Early research concentrated on single-scale CNN feature extraction (Dong et al., 2015; Song et al., 2021), but subsequent studies emphasized the importance of integrating multiscale features (Dong et al., 2016; Purohit et al., 2020) for processing remote sensing data. This led to the development of multiscale feature extraction methods (Ma et al., 2021; Huan et al., 2021) and efficient attention mechanisms (Huang et al., 2021; Wang et al., 2022), significantly improving the performance of CNN-based RSI SR. Additionally, building on features extracted from LR images, researchers have developed SR methods that use implicit functions to map these features to spatial coordinates, predicting pixel values at any given location and enabling arbitrary scaling factors. For instance, Local implicit image function (LIIF) (Chen et al., 2021a) proposed a network with a local implicit image function to achieve continuous image SR, using coordinates and nearby 2D features as inputs to predict the corresponding red–green–blue (RGB) values. However, due to its reliance on local features, LIIF may struggle with maintaining global consistency and handling complex structures. To address these limitations, FunSR (Chen et al., 2023) introduced a continuous SR framework that enhances global semantic consistency by facilitating contextual interactions within the implicit function space for continuous image representation.

While CNNs have been successful in addressing image SR tasks, they are constrained by the local receptive fields of convolutional kernels, which makes it difficult to capture the self-similarity present in RSIs (Liang et al., 2021). In contrast, transformer models, with their superior global information processing capabilities, have demonstrated distinct advantages in SR (Lei et al., 2024). The image processing transformer (IPT; Chen et al., 2021b) serves as an image restoration backbone network based on the standard transformer architecture, achieving multitask degraded image restoration. However, its impressive performance relies on 115.5 million parameters. Lei et al. (2021) introduced a transformer-enhanced network (TransENet) for SR in RSIs, leveraging the transformer’s feature extraction capabilities at different stages for multiscale fusion. Nonetheless, transformer-based SR models typically segment the input image into fixed-size patches, leading to boundary artifacts. Additionally, the extensive self-attention operations result in a considerable number of parameters, as evidenced by TransENet’s 37.3 million parameters. Consequently, these issues limit the applicability of transformer-based models in remote sensing.

To address the large parameter sizes in transformer-based SR models, we propose an innovative lightweight dual-branch SR (DBSR) model specifically designed for RSIs. Compared to existing lightweight SR networks, DBSR utilizes a carefully designed dual-branch structure that effectively combines the proposed separable swin transformer (SST) and global information enhancement module (GIEM) to achieve a deep fusion of local and global information. This design not only significantly enhances performance but also reduces computational burden. Unlike traditional lightweight models that directly upsample shallow features and add deep features, DBSR introduces an auxiliary branch that performs deep modeling of shallow features. This approach reduces the interference from redundant information during shallow feature upsampling, thereby improving reconstruction quality. Specifically, the main branch of DBSR consists of the channel recalibration (CRC), SST, and spatial refinement module (SRM). The CRC module dynamically adjusts channel feature responses to minimize channel redundancy. The SST module employs a hierarchical window attention design for parallel processing, effectively extracting multiscale window features while reducing the parameter count by 25% compared to the standard swin transformer (Liu et al., 2021). The SRM module further enhances pixel-level features. The auxiliary branch, utilizing the lightweight GIEM, refines shallow features, thereby strengthening the global feature modeling of the main branch. Finally, the interaction feature fusion (IFF) module effectively integrates and cross-weights features from both branches, emphasizing important features while suppressing redundancies. As shown in Figure 1, the dual-branch structure allows the DBSR model to achieve a parameter count of only 245K, delivering superior performance while maintaining a lightweight design.

Figure 1.

Performance comparison with other lightweight methods on the Ubran100 ( $2 \times$ ). The multi-Adds is computed on a $1280 \times 720$ high-resolution (HR) image.

2. Related Works

Constructing deep neural networks to directly learn the mapping from LR to HR images has become a mainstream approach in the field of SR. Rapid advancements in SR methods based on CNNs continue to push the boundaries of reconstruction quality. With the widespread adoption of transformers in visual tasks, transformer-based SR methods have opened new avenues for research in this area. Therefore, this section reviews the progress in SR technology from the perspectives of both CNN-based and transformer-based approaches.

2.1. CNN-Based SR

As CNN technology has evolved, it has demonstrated excellent performance in image SR tasks. Existing CNN-based SR methods can primarily be categorized into several types: residual dense connections, multiscale methods, attention mechanisms, and lightweight model architectures.

Residual learning has enabled the construction of deeper network structures, effectively mitigating the degradation issues associated with deep networks. For instance, very deep super-resolution (VDSR) (Kim et al., 2016) incorporates residual learning to build an SR network with 20 layers and a larger receptive field, enhancing the network’s reconstruction capabilities. Residual blocks have now become fundamental components of SR network architectures. However, simply increasing the number of layers does not efficiently transfer features across layers. To address this limitation, residual dense backprojection network (Pan et al., 2019) introduced a residual dense connection network, which combines residual learning with dense connections to fully utilize interlayer feature transfer and fusion, thereby significantly enhancing reconstruction accuracy. Despite the progress in feature utilization with residual dense connection methods, limitations still exist in fully capturing multiscale features to enhance SR effects.

Multiscale features are crucial in SR reconstruction. To better utilize multiscale feature information, Lu et al. (2019) proposed a multiscale residual neural network that significantly improves the preservation of high-frequency details by integrating multiscale residual information. Additionally, HSENet (Lei & Shi, 2021) leverages both single-scale and cross-scale self-similarity to enhance the model’s feature processing capabilities further. By extracting and integrating features across different scales, this model fully utilizes the hierarchical structure and detailed information of images, significantly improving CNN performance in SR tasks. However, multiscale methods may introduce redundant information when integrating details from different scales, potentially leading to detail blurring. Moreover, these methods may not effectively focus on key feature areas, limiting further improvements in reconstruction quality.

The attention mechanism aims to guide the network in focusing on key features. Recursive squeeze and excitation networks super resolution (Cheng et al., 2018) introduced a single-image SR method based on recursive compression and excitation networks, achieving remarkable reconstruction results. However, CNN convolutional operations typically struggle to capture contextual information beyond the local receptive field, meaning that information from distant areas may be highly relevant to the reconstruction objective. To address this issue, cross-scale non-local neural network (Mei et al., 2020) introduced a cross-scale nonlocal attention module that uncovers long-distance dependencies between LR features and HR blocks, significantly enhancing reconstruction quality. However, the design of complex attention mechanisms often incurs higher computational and storage costs, making them less suitable for deployment in resource-constrained real-world applications.

Consequently, limitations in computational resources and storage space have spurred the development of lightweight SR models. Hui et al. (2019) introduced an information multidistillation network (IMDN), combining contrast-aware attention with cascaded information multidistillation blocks. This approach reduces model complexity while preserving essential information. Despite IMDN’s success in achieving high-fidelity reconstruction, there remains room for improvement. To this end, deep residual feature distillation neural networks (Mardieva et al., 2024) optimized IMDN by using deep separable convolutions and multicore deep separable convolutions, resulting in higher-quality reconstruction. Additionally, FeNet (Wang et al., 2022) introduced the lightweight lightweight lattice block (LLB) module, which utilizes channel attention (CA) mechanisms for information exchange between upper and lower branches, enhancing channel feature expression capabilities while reducing model parameters. However, the introduction of nested networks in the LLB module is prone to generating redundant features during feature combinations. Although lightweight models strive to enhance performance while reducing complexity, there is still room for improvement in balancing model efficiency and reconstruction quality. Moreover, many lightweight architecture models follow a direct upscaling method that processes shallow features and outputs SR results using a cascaded structure. This approach can introduce noise and cause distortions in the SR output. To address these challenges, we designed an auxiliary branch focused on the refined processing of shallow features. This design significantly reduces the misleading effects of directly upscaling shallow features on the final results while effectively enhancing the model’s reconstruction accuracy and visual quality.

2.2. Transformer-Based SR

Transformer-based models have significantly enhanced single-image SR (SISR) tasks by effectively capturing both global and local dependencies. Researchers have broadened the applicability of transformers in low-level vision tasks through innovations ranging from large-scale pretrained models to efficient hybrid architectures. Currently, transformer-based SR approaches can be categorized into three primary types: those based on standard vision transformers, window-based approaches, and hybrid architectures that combine CNNs and transformers.

Standard vision transformers excel at capturing global information from images. For example, Chen et al. (2021b) introduced an IPT pretrained on a large dataset, incorporating contrastive learning tailored to various image processing tasks. After fine-tuning, the pretrained model can be effectively applied to specific tasks. However, IPT relies on extensive datasets and has a substantial parameter count (over 115.5 M), which significantly limits its applicability. TransENet (Lei et al., 2021) enhances multiscale information interaction through a multilevel augmentation based on the standard transformer, effectively addressing the challenge of perceiving key image content among similar pixels. Although standard transformer-based SR models achieve high-quality SR results, their redundant self-attention mechanisms result in an excessive number of parameters, limiting practical applications.

To address the issue of standard vision transformers losing boundary information when processing image blocks independently, SwinIR (Liang et al., 2021) partitions the image into windows, performs self-attention within each window, and enhances interwindow information exchange through sliding window operations. This approach effectively reduces computational complexity and captures local dependencies, thereby improving the quality of SISR tasks. However, SwinIR lacks direct interwindow interaction, resulting in a limited field of view across windows. To overcome this limitation, cross aggregation transformer (Chen et al., 2022) transforms square windows into rectangular ones and employs axial shift operations to expand the attention field without increasing complexity, achieving efficient cross-aggregation between windows. Additionally, to further improve window-based transformers, hybrid attention transformer (Chen et al., 2023) introduces an overlapping cross-attention module that enhances the interaction between window features, activating more pixel attention areas and improving the model’s ability to capture details.

The strength of CNNs lies in their ability to effectively extract local features through convolutional layers, which is crucial for understanding details such as textures, edges, and shapes within images. Conversely, transformers offer the advantage of a self-attention mechanism, which captures long-range dependencies or global information, aiding the model in comprehending complex relationships between different parts of an image. Combining these two technologies enables the model to possess both sensitivity to local visual details and a deep understanding of the global context. For instance, Restormer (Zamir et al., 2022) embeds CNNs within the transformer framework to perform multiscale local–global learning. This approach is not only suitable for processing large images but also effectively captures interactions among distant pixels, achieving a good balance between performance and efficiency. Similarly, dual transformer residual network (Sui et al., 2023) combines the strengths of CNNs and transformers to learn hierarchical features through global feature fusion, harmonizing global and local information. EHNet (Zhang et al., 2024) also utilizes CNNs and swin transformers within a U-Net-like architecture to capture multiscale features effectively. However, these hybrid architectures can lead to increased complexity in model structure, along with a rise in parameter and computational demands, making deployment in resource-constrained environments challenging. In response to this issue, this research adopts a hybrid architecture of CNNs and transformers, constructing a DBSR model that effectively integrates the advantages of both. This model also introduces SST based on the swin transformer, successfully reducing the parameter count by 25% while significantly enhancing SR accuracy. This innovative architecture demonstrates the considerable potential for achieving high efficiency and performance in complex image-processing tasks.

3. Network Architecture

In this section, we first provide an overview of the overall structure of the proposed DBSR. Following this, we detail the HCTF feature extraction module in the main branch and the GIEM and IFF components in the auxiliary branch.

3.1. Framework View

We propose a dual-branch lightweight remote sensing SR network. As shown in Figure 2, the proposed DBSR model consists of three parts: shallow feature extraction, deep feature extraction, and reconstruction modules. To enrich the representation of image details, for an input image $I_{LR} \in R^{3 \times H \times W}$ , it is first mapped to a high-dimensional space through shallow feature extraction:

F_{0} = f_{SF} (I_{LR}) = W_{0} * I_{LR}

(1)

where

f_{SF}

denotes a single-layer

3 \times 3

convolution with weights

W_{0}

, and

*

represents the convolution operation. For simplicity, the bias term of the convolution layer is omitted.

Figure 2.

Architecture of the proposed dual-branch super-resolution (DBSR).

In the feature extraction part, the main branch refines features layer by layer through a cascaded HCTF, while the auxiliary branch enhances shallow features using the GIEM. Assuming the number of HCTFs is $K$ , the output of the $k$ th HCTF and the auxiliary outputs can be expressed as:

\begin{aligned} F_{k} & = f_{HCTF}^{k} (f_{HCTF}^{k - 1} \dots (f_{HCTF}^{1} (F_{0})) \\ F_{G} & = f_{GIEM} (F_{o}) \end{aligned}

(2)

where

f_{HCTF}^{k}

is the

k

th HCTF,

F_{k}

represents the output of the

k

th HCTF,

f_{GIEM}

denotes the GIEM, and

F_{G}

represents the output of the shallow feature

F_{0}

enhanced by the GIEM.

After obtaining the refined features from the main and auxiliary branches, we concatenate the features from different levels of the main branch with the features from the auxiliary branch and fuse them using IFF. The output of IFF can be expressed as:

F_{IFF} = f_{IFF} (W_{1} * [F_{1}, F_{2}, \dots, F_{k}], F_{G})

(3)

where

[\cdot]

denotes the concatenation operation,

W_{1}

represents the weights of the

1 \times 1

and

3 \times 3

convolution layers, and

f_{IFF}

denotes the interactive feature fusion module used for the feature fusion of the main and auxiliary branches.

Finally, in the reconstruction part, the features are shuffled through a $3 \times 3$ convolution and pixel shuffle layer and upsampled to a HR image using a bicubic interpolation function. The reconstruction process can be expressed as:

I_{SR} = f_{UP} (W_{2} * F_{IFF} + F_{G})

(4)

where

I_{SR} \in R^{3 \times s H \times s W}

represents the SR result,

s

is the scaling factor,

W_{2}

represents the weights of the

1 \times 1

convolution, and

f_{UP}

denotes the pixel shuffle and bicubic interpolation. Additionally, DBSR is trained using the

L_{1}

loss function:

L (θ_{i}) = \frac{1}{N_{i}} \sum_{i = 1}^{N_{i}} ∥ I_{SR}^{i} - I_{HR}^{i} ∥_{1}

(5)

where

θ

denotes the parameters of the entire network,

N

represents the number of training samples, and

| | \cdot | |_{1}

denotes the

L_{1}

norm.

3.2. HCTF

To fully leverage the advantages of both CNN and transformer, we propose a feature extraction module, HCTF. This module combines CRC, SST, and SRM. Each component will be introduced in detail below.

3.2.1. Channel Recalibration (CRC)

In the CRC stage, as illustrated in Figure 3, the input shallow features $F_{0}$ are first refined through two $3 \times 3$ convolution layers into $F_{1}^{crc}$ . Given the direct connection of the CRC module with HCTF features at different layers and shallow features, these features have varying texture richness and redundant features. Therefore, we use channel-adaptive reweighting to remove redundant features and enhance sensitivity to key features. The refined features are first pooled to obtain a global receptive field, then compressed using two $1 \times 1$ convolution layers, and finally, channel weights are output using the Sigmoid function. The process is as follows:

F_{out}^{crc} = F_{0} \times σ (W_{2}^{crc} * (W_{1}^{crc} * f_{AdaAvgPool} (F_{1}^{crc})))

(6)

where

f_{AdaAvgPool}

represents the adaptive global average pooling layer for spatial information compression,

W_{1}^{crc}

and

W_{2}^{crc}

are the weights of the two

3 \times 3

convolution, and

σ

denotes the Sigmoid function.

Figure 3.

Structure of channel recalibration (CRC).

Figure 4.

Network structure of sparable swin transformer: (a) Channel separation process and (b) structure of multihead self-attention (MSA).

3.2.2. Separable Swin Transformer (SST)

Inspired by the window attention mechanism of the swin transformer, we designed an SST. As shown in Figure 4, for a given input $F_{in}^{sst} \in R^{C \times H \times W}$ , it is first divided into $N$ layers along the channel dimension. Each layer is segmented into windows with progressively doubled scales (the first layer with a window scale of $[4, 4]$ , the second layer $[8, 8]$ , and so on), followed by multihead attention calculations on the divided image blocks. This process can be expressed as:

\begin{aligned} F^{sep} & = f_{separate} (F_{in}^{sst}) = [F_{1}^{s}, F_{2}^{s}, \dots, F_{N}^{s}] \\ F_{n}^{sst} & = f_{MSA} (f_{LN} (F_{n}^{s})), n = 1, 2, \dots, N \end{aligned}

(7)

where

[\cdot]

denotes the concatenation operation,

f_{separate}

denotes the channel separation operation,

f_{MSA}

denotes the multihead self-attention mechanism, and

f_{LN}

denotes layer normalization. All output features are then concatenated and refined using a

1 \times 1

convolution to achieve multiscale window feature interaction. Finally, LN and two multilayer perceptron (MLP) layers output the final result:

\begin{aligned} F_{inter}^{sst} & = W_{5} * [F_{1}^{sst}, F_{2}^{sst}, \dots, F_{n}^{sst}] + F_{in}^{sst} \\ F_{out}^{sst} & = f_{MLP} (f_{LN} (F_{inter}^{sst})) + F_{inter}^{sst} \end{aligned}

(8)

where multihead self-attention (MSA) is performed

H

times on the

i

th window feature

X_{i} \in R^{C \times m^{2}}

. The multihead attention calculation is as follows:

\begin{aligned} Attention (Q, K, V) = σ (\frac{Q K^{T}}{\sqrt{d_{k}}} + B) V \\ MultiHead (Q, K, V) = [{head}_{1}, \dots, {head}_{h}] W^{O} \\ {head}_{h} = Attention (Q W_{h}^{Q}, K W_{h}^{K}, V W_{h}^{V}), h = 1, 2, \dots, H \end{aligned}

(9)

where

m^{2}

represents the nonoverlapping window size, and

Q = X_{i} W_{Q}, K_{i} = X_{i} W_{K}, V = X_{i} W_{V}

are the query, key, and value matrices, respectively.

W_{h}^{Q}, W_{h}^{K}, W_{h}^{V}

is the transformation matrix for the

h

th head. The learnable relative positional encoding,

B

, is combined with the MSA output to keep the embedding dimensions consistent.

3.2.3. Spatial Refinement Module (SRM)

To further refine SST features, we constructed a SRM to enhance the expression of key features on a pixel-by-pixel basis. As shown in Figure 5, for the SST output features $F_{out}^{sst}$ , we first abstract the features through a $1 \times 1$ convolution and a $3 \times 3$ convolution with a stride of 2:

F_{1}^{srm} = W_{2}^{srm} * (W_{1}^{srm} * F_{out}^{sst})

(10)

where

W_{2}^{srm}

and

W_{2}^{srm}

are the weights of the

1 \times 1

and

3 \times 3

convolutions, respectively. Next, Maxpooling and rectified linear unit (ReLU) activation functions are used to enhance the receptive field and key features, and bilinear interpolation adjusts the feature map back to its original size to ensure the retention of spatial details:

F_{2}^{srm} = f_{Bulic} (δ (f_{maxpool} (W_{3}^{srm} * F_{1}^{srm})))

(11)

where

f_{Bulic}

denotes bilinear interpolation upsampling,

δ

denotes the ReLU activation function, and

f_{maxpool}

denotes the maxpooling layer. Finally, different feature layers are combined using element-wise addition, and the Sigmoid function outputs a weight mask, which is element-wise multiplied with the original input to achieve pixel-wise enhancement:

F_{out}^{srm} = σ (W_{4}^{srm} * (F_{2}^{srm} + F_{out}^{sst})) \times F_{out}^{sst}

(12)

where

W_{4}^{srm}

represents the weights of the

1 \times 1

convolutional layer for channel dimension restoration, and

σ

represents the Sigmoid function.

Figure 5.

Structure of spatial refinement module (SRM).

3.3. Global Information Enhancement Module (GIEM)

In the auxiliary branch, we designed a GIEM to enhance primary features through global context modeling, supplementing the main branch with global information. As shown in Figure 6, for the input primary features, we first optimize the features using two layers of $3 \times 3$ convolution combined with ReLU activation functions, ensuring stable spatial dimensions while enhancing feature quality:

F_{1}^{giem} = W_{2}^{giem} * δ (W_{1}^{giem} * F_{0})

(13)

where

W_{giem 1}

and

W_{giem 2}

are the weights of the

3 \times 3

convolution layers, and

δ

denotes the ReLU activation function. Next, a

1 \times 1

convolution reduces the feature channels to a single dimension, followed by unfolding and a Softmax function to map weights to the 0–1 range, producing a global context weight mask:

\begin{aligned} {Mask}^{giem} & = σ (f_{Unfold} (W_{3}^{giem} * F_{1}^{giem})) \\ F_{2}^{giem} & = f_{Unfold} (F_{1}^{giem}) \times {Mask}^{giem} \end{aligned}

(14)

where

W_{3}^{giem}

represents the weight of

1 \times 1

convolution layer,

f_{Unfold}

denotes the sliding window unfolding along the channel dimension, and

{Mask}^{giem}

refers to the global context weighting map. The final auxiliary branch result is produced through two layers of

1 \times 1

convolutions and a residual connection:

F_{out}^{giem} = F_{0} + δ (W_{4}^{giem} * F_{2}^{giem})

(15)

where

W_{4}^{giem}

represents the weights of the two

1 \times 1

convolution layer. Thus, by enhancing global information through the auxiliary branch, a one-dimensional global context weighting mask is used to regulate shallow features.

Figure 6.

Structure of global information enhancement module (GIEM).

Figure 7.

Structure of interaction feature fusion (IFF).

3.4. Interaction Feature Fusion (IFF)

To effectively integrate the main and auxiliary branches, we constructed an IFF module to highlight important features and suppress secondary information. As shown in Figure 7, for the input main branch feature $F_{out}^{srm}$ and auxiliary branch feature $F_{out}^{giem}$ , global information from each feature is captured through global average pooling:

F_{x} = f_{GAP} (F_{out}^{srm}), F_{y} = f_{GAP} (F_{out}^{giem})

(16)

where

f_{GAP}

denotes global average pooling. Next,

F_{x}

and

F_{y}

are concatenated and activated using the Sigmoid function to output a self-attention mask:

Mask = σ [F_{x}, F_{y}]

(17)

Finally, the self-attention mask is split into

{Mask}_{x}

and

{Mask}_{y}

and element-wise multiplied with the original input features.Then, the fusion result

F_{out}

is output:

F_{out}^{iff} = W_{1}^{iff} * [F_{out}^{srm} \times {Mask}_{y}, F_{out}^{srm} \times {Mask}_{x}]

(18)

where

W_{1}^{iff}

represents the weights of the two

1 \times 1

convolution layer. By integrating the CRC, SST, and SRM modules with the GIEM and IFF modules, our HCTF framework efficiently enhances feature representation, leveraging both local and global information for superior performance.

4. Experiment

4.1. Datasets and Metrics

Following the experimental setups of previous studies (Hui et al., 2019; Wang et al., 2022), we optimized the DBSR model using 800 training images from the DIV2K dataset (Timofte et al., 2017). The model’s reconstruction performance was evaluated using publicly available two RSI datasets (RS-T1 and RS-T2; Wang et al., 2022) and real image data (images from the GaoFen-2 and Beijing-2 satellites). Additionally, to comprehensively validate the model’s performance, we tested it on five SR benchmark datasets of natural images: Set5 (Bevilacqua et al., 2012), Set14 (Zeyde et al., 2012), BSD100 (Martin et al., 2001), Urban100 (Huang et al., 2015), and Manga109 (Matsui et al., 2017). The SR results were evaluated on the Y channel in the YCbCr color space using the following metrics: average peak signal-to-noise ratio (PSNR), structural similarity index (SSIM), learned perceptual image patch similarity (LPIPS), and no-reference image quality evaluator (NIQE). To evaluate the computational complexity of the network, we considered the number of model parameters and multi-adds (M-Adds) operations.

4.2. Implementation Details

To obtain LR training images, we applied bicubic interpolation to downscale HR images. To maximize the effectiveness of the training data, we employed data augmentation techniques such as random rotations ( $90 \circ$ ) and horizontal flips. During the experiments, we implemented the model using the PyTorch2.0.1 framework and trained it on an NVIDIA GeForce RTX 3090 GPU. The model input consisted of 16 LR RGB patches, each of size $64 \times 64$ pixels. We optimized the model using the adaptive moment estimation algorithm with an initial learning rate of $1 \times 10^{- 4}$ . Training consisted of 300 epochs, with the learning rate halved every 2000 iterations. DBSR includes a three-layer HCTF structure, with each HCTF layer having 48 channels. The SST is cascaded two times, the number of SST separation layers is $N = 3$ , the number of multihead attention heads is $H = 4$ , and the total number of parameters is 245K.

4.3. Results on RSI Datasets

To validate the effectiveness of DBSR for remote sensing image super-resolution, we compared it with several lightweight models. These models include SRCNN (Dong et al., 2014), VDSR (Dong et al., 2015), LGCNet (Lei et al., 2017), LapSRN (Lai et al., 2017), IDN (Hui et al., 2018), LESRCNN (Tian et al., 2020), CTNet (Wang et al., 2021), FeNet (Wang et al., 2022), and SAFMN (Sun et al., 2023). All models were trained using the same DIV2K training set to ensure fairness.

Table 1 presents a quantitative comparison of these models and Figure 8 illustrates the SR results of image patches from two RSI datasets. The results indicate that DBSR outperformed all other models across all scaling factors on the RSI datasets. Notably, compared to the sota lightweight remote sensing SR model FeNet, DBSR reduced the number of parameters by 30% and improved PSNR by 0.25 to 0.35 dB. Additionally, DBSR displayed clearer contours and better visual perception than other methods.

Figure 8.

Visual comparison of DBSR with other SR methods on RSI datasets ( $4 \times$ ). Note. DBSR = dual-branch super-resolution; SR = super-resolution; RSI = remote sensing image.

Table 1.

Quantitative Results on RSI Datasets.

			RS-TI	RS-T2
Method	Scale	Params	PSNR/SSIM	PSNR/SSIM
Bicubic	$2 \times$	–	33.25/0.8934	30.64/0.8837
LGCNet		193K	35.65/0.9298	33.47/0.9281
LapSRN		251K	35.69/0.9304	33.57/0.9286
IDN		553K	36.13/0.9339	34.07/0.9329
LESRCNN		626K	36.04/0.9328	34.00/0.9320
CTNet*		376K	35.93/0.9308	33.73/0.9284
FeNet		351K	36.23/0.9341	34.22/0.9337
SAFMN		228K	36.42/0.9352	34.39/0.9346
DBSR		245K	36.58/0.9363	34.49/0.9357
Bicubic	$3 \times$	–	29.73/0.7818	27.23/0.7697
LGCNet		193K	31.30/0.8314	29.03/0.8312
LapSRN		290K	31.47/0.8338	29.22/0.8352
IDN		553K	31.73/0.8430	29.59/0.8450
LESRCNN		810K	31.68/0.8398	29.65/0.8444
CTNet*		367K	31.55/0.8378	29.61/0.8420
FeNet		357K	31.89/0.8432	29.80/0.8481
SAFMN		233K	32.03/0.8461	29.96/0.8497
DBSR		252K	32.13/0.8515	30.06/0.8517
Bicubic	$4 \times$	–	27.91/0.6968	25.40/0.6770
LGCNet	$4 \times$	193K	29.13/0.7481	26.76/0.7426
LapSRN		543K	29.51/0.7614	27.24/0.7600
IDN		553K	29.56/0.7623	27.31/0.7627
LESRCNN		774K	29.62/0.7625	27.41/0.7646
CTNet*		387K	29.50/0.7588	27.17/0.7540
FeNet		366K	29.70/0.7688	27.45/0.7672
SAFMN		240K	29.88/0.7687	27.68/0.7718
DBSR		261K	29.90/0.7717	27.71/0.7752

Note. RSI = remote sensing image; DBSR = dual-branch super-resolution; PSNR = peak signal-to-noise ratio; SSIM = structural similarity index; HR = high-resolution; M-adds = multi-adds.

The best results are in bold.

The M-adds is computed on a $1280 \times 720$ HR image.

*Indicates the model is for testing the results of retrained weights.

To comprehensively assess the robustness of the DBSR model under varying conditions, we introduced the RSSCN7 (Zou et al., 2015) dataset and adjusted brightness factors to 0.5, 0.6, and 0.7 using the Pillow.ImageEnhance.Brightness module. This process constructed three low-light scene datasets, named RSSCN7-0.5, RSSCN7-0.6, and RSSCN7-0.7, respectively. From each dataset, we randomly selected three images and used LGCNet, CTNet, FeNet, and SAFMN as comparative models. After upscaling the images by a factor of three, we conducted a comparative analysis, and visual comparisons for the three datasets are presented in Figure 9. DBSR exhibited superior performance in reconstructing details and textures, providing significantly better visual quality than other methods. Additionally, DBSR achieved lower values for the LPIPS and NIQE metrics, indicating that its super-resolved image quality is closer to that of real images.

Figure 9.

Visual comparison of DBSR with other SR methods on RSSCN7-0.5, RSSCN7-0.6, and RSSCN7-0.7 ( $3 \times$ ). Note. DBSR = dual-band super-resolution; SR = super-resolution.

To thoroughly evaluate the potential application of the DBSR model in real-world projects, we collected 1576 landslide RSIs with a resolution of $320 \times 320$ pixels from Google Maps. These images were divided into training, validation, and testing sets at a ratio of $7 : 2 : 1$ , creating a dataset for landslide object detection. In our experiments, we used YOLOv8 as the object detection model, training it for 100 epochs to obtain the optimal model weights. With these weights, we conducted inference on both the original resolution images and the $2 \times$ SR images in the test set to examine DBSR’s applicability in disaster management. Figure 10 compares the landslide object detection results for FeNet, SAFMN, and DBSR images with the original LR images. The experimental results indicate that SR processing increased the object detection confidence scores by 0.01 to 0.03 compared to the original images. Moreover, DBSR outperformed both FeNet and SAFMN in terms of reconstruction quality, as indicated by a lower NIQE score, suggesting that the generated images possess superior visual quality. Additionally, DBSR achieved higher detection confidence in the landslide object detection task. In summary, these results demonstrate that DBSR offers high practicality and reliability in disaster management applications, providing effective technical support for rapid detection and emergency response in natural disasters such as landslides.

Figure 10.

Visual comparison of dual-band super-resolution (DBSR) with other super-resolution methods in landslide object detection ( $2 \times$ ).

DBSR also exhibits excellent reconstruction performance across various remote sensing datasets and practical applications. Due to factors such as long distances and complex imaging equipment, RSIs often suffer from LR, which can hinder tasks such as urban planning and disaster management. HR RSIs, however, contain more details that can improve the efficiency of downstream applications. To address this issue, the DBSR model can be utilized for real-time SR reconstruction of LR images received during satellite data transmission and processing, thereby enhancing spatial resolution and detail. This real-time processing can be implemented on satellite edge-computing units or high-performance computing platforms at ground stations. Through these approaches, DBSR can provide robust support for the real-time analysis and application of remote sensing data.

4.4. Results on SR Benchmark Datasets

We performed further comparisons of DBSR with the mentioned state-of-the-art SR methods. Quantitative results are shown in Table 2. For three scaling factors, DBSR achieves the best results across all test datasets. Moreover, compared to SAFMN, DBSR has better performance for a similar parameter count, especially on the Urban100 dataset, where PSNR improves by over 0.25 dB for each scaling factor.

Table 2.
Quantitative Results on SR Benchmark Datasets.

Method Scale Params M-adds Set5 Set14 BSD100 Urban100 Manga109

Bicubic $2 \times$ – – 33.66/0.9299 30.24/0.8688 29.56/0.8431 26.88/0.8403 30.80/0.9339

LGCNet 193K 178.1G 37.31/0.9580 32.94/0.9120 31.74/0.8939 30.53/0.9112 –

LapSRN 251K 105.2G 37.52/0.9591 32.99/0.9124 31.80/0.8952 30.41/0.9103 37.27/0.9740

IDN 553K 124.6G 37.83/0.9600 33.30/0.9148 32.08/0.8985 31.27/0.9196 38.01/0.9749

LESRCNN 626K 281.5G 37.65/0.9586 33.32/0.9148 31.95/0.8964 31.45/0.9206 37.89/0.9746

CTNet* 376K 60.9G 37.40/0.9585 33.07/0.9122 31.82/0.8951 30.67/0.9120 37.14/0.9731

FeNet 351K 77.9G 37.90/0.9602 33.45/0.9162 32.09/0.8985 31.75/0.9245 38.18/0.9752

SAFMN 228K 55.23 38.00/0.9605 33.54/0.9177 32.16/0.8995 31.84/0.9256 38.71/0.9771

DBSR 245K 69.8G 38.10/0.9612 33.85/0.9202 32.28/0.9011 32.32/0.9298 38.88/0.9775

Bicubic $3 \times$ – – 30.39/0.8682 27.55/0.7742 27.21/0.7385 24.46/0.7349 26.95/0.8556

LGCNet 193K 79.0G 33.32/0.9172 29.67/0.8289 28.63/0.7923 26.77/0.8180 –

LapSRN 502K 115.2G 33.81/0.9220 29.79/0.8325 28.82/0.7980 27.07/0.8275 32.21/0.9350

IDN 553K 124.6G 34.11/0.9253 29.99/0.8354 28.95/0.8013 27.42/0.8359 32.71/0.9381

LESRCNN 810K 238.9G 33.93/0.9231 30.12/0.8380 28.91/0.8005 27.70/0.8415 32.76/0.9389

CTNet* 376K 37.4G 33.95/0.9239 30.08/0.8369 28.95/0.8008 27.51/0.8364 32.62/0.9375

FeNet 357K 35.2G 34.21/0.9256 30.15/0.8383 28.98/0.8020 27.82/0.8447 32.99/0.9394

SAFMN 233K 23.7G 34.34/0.9267 30.33/0.8418 29.08/0.8048 27.95/0.8474 33.52/0.9437

DBSR 252K 31.65G 34.45/0.9276 30.45/0.8432 29.17/0.8067 28.32/0.8541 33.73/0.9448

Bicubic $4 \times$ – – 28.42/0.8104 26.00/0.7027 25.96/0.6675 23.14/0.6577 24.89/0.7866

LGCNet 193K 44.5G 30.87/0.8746 27.82/0.7630 27.08/0.7186 24.82/0.7399 –

LapSRN 502K 149.4G 31.54/0.8852 28.09/0.7700 27.32/0.7275 25.21/0.7562 29.09/0.8900

IDN 553K 32.3G 31.82/0.8903 28.25/0.7730 27.41/0.7297 25.41/0.7632 29.41/0.8942

LESRCNN 774K 241.6G 31.88/0.8903 28.44/0.7772 27.45/0.7313 25.77/0.7732 29.94/0.9002

CTNet* 387K 25.9G 31.59/0.8866 28.23/0.7688 27.30/0.7254 25.28/0.7561 28.98/0.8864

FeNet 366K 20.4G 32.02/0.8919 28.38/0.7764 27.47/0.7319 25.75/0.7747 29.85/0.8992

SAFMN 240K 13.7G 32.18/0.8948 28.60/0.7813 27.58/0.7359 25.97/0.7809 30.43/0.9063

DBSR 261K 18.4G 32.31/0.8965 28.68/0.7823 27.60/0.7376 26.22/0.7886 30.54/0.9084

Method	Scale	Params	M-adds	Set5	Set14	BSD100	Urban100	Manga109
Bicubic	$2 \times$	–	–	33.66/0.9299	30.24/0.8688	29.56/0.8431	26.88/0.8403	30.80/0.9339
LGCNet	193K	178.1G	37.31/0.9580	32.94/0.9120	31.74/0.8939	30.53/0.9112	–
LapSRN	251K	105.2G	37.52/0.9591	32.99/0.9124	31.80/0.8952	30.41/0.9103	37.27/0.9740
IDN	553K	124.6G	37.83/0.9600	33.30/0.9148	32.08/0.8985	31.27/0.9196	38.01/0.9749
LESRCNN	626K	281.5G	37.65/0.9586	33.32/0.9148	31.95/0.8964	31.45/0.9206	37.89/0.9746
CTNet*	376K	60.9G	37.40/0.9585	33.07/0.9122	31.82/0.8951	30.67/0.9120	37.14/0.9731
FeNet	351K	77.9G	37.90/0.9602	33.45/0.9162	32.09/0.8985	31.75/0.9245	38.18/0.9752
SAFMN	228K	55.23	38.00/0.9605	33.54/0.9177	32.16/0.8995	31.84/0.9256	38.71/0.9771
DBSR	245K	69.8G	38.10/0.9612	33.85/0.9202	32.28/0.9011	32.32/0.9298	38.88/0.9775
Bicubic	$3 \times$	–	–	30.39/0.8682	27.55/0.7742	27.21/0.7385	24.46/0.7349	26.95/0.8556
LGCNet	193K	79.0G	33.32/0.9172	29.67/0.8289	28.63/0.7923	26.77/0.8180	–
LapSRN	502K	115.2G	33.81/0.9220	29.79/0.8325	28.82/0.7980	27.07/0.8275	32.21/0.9350
IDN	553K	124.6G	34.11/0.9253	29.99/0.8354	28.95/0.8013	27.42/0.8359	32.71/0.9381
LESRCNN	810K	238.9G	33.93/0.9231	30.12/0.8380	28.91/0.8005	27.70/0.8415	32.76/0.9389
CTNet*	376K	37.4G	33.95/0.9239	30.08/0.8369	28.95/0.8008	27.51/0.8364	32.62/0.9375
FeNet	357K	35.2G	34.21/0.9256	30.15/0.8383	28.98/0.8020	27.82/0.8447	32.99/0.9394
SAFMN	233K	23.7G	34.34/0.9267	30.33/0.8418	29.08/0.8048	27.95/0.8474	33.52/0.9437
DBSR	252K	31.65G	34.45/0.9276	30.45/0.8432	29.17/0.8067	28.32/0.8541	33.73/0.9448
Bicubic	$4 \times$	–	–	28.42/0.8104	26.00/0.7027	25.96/0.6675	23.14/0.6577	24.89/0.7866
LGCNet	193K	44.5G	30.87/0.8746	27.82/0.7630	27.08/0.7186	24.82/0.7399	–
LapSRN	502K	149.4G	31.54/0.8852	28.09/0.7700	27.32/0.7275	25.21/0.7562	29.09/0.8900
IDN	553K	32.3G	31.82/0.8903	28.25/0.7730	27.41/0.7297	25.41/0.7632	29.41/0.8942
LESRCNN	774K	241.6G	31.88/0.8903	28.44/0.7772	27.45/0.7313	25.77/0.7732	29.94/0.9002
CTNet*	387K	25.9G	31.59/0.8866	28.23/0.7688	27.30/0.7254	25.28/0.7561	28.98/0.8864
FeNet	366K	20.4G	32.02/0.8919	28.38/0.7764	27.47/0.7319	25.75/0.7747	29.85/0.8992
SAFMN	240K	13.7G	32.18/0.8948	28.60/0.7813	27.58/0.7359	25.97/0.7809	30.43/0.9063
DBSR	261K	18.4G	32.31/0.8965	28.68/0.7823	27.60/0.7376	26.22/0.7886	30.54/0.9084

Note. SR = super-resoluton; DBSR = double-branch SR; HR = high-resolution; M-adds = multi-adds.

The best results are in bold.

The M-adds is computed on a 1280 $\times$ 720 HR image.

*Indicates the model is for testing the results of retrained weights.

To evaluate perceptual quality, we present three SR results of the models on the Urban100 datasets in Figure 11. The DBSR model produced excellent image clarity and texture detail, outperforming the other models. The success of DBSR is attributed to the enhancement of shallow features by the auxiliary branch: in traditional bilinear interpolation, although the original pixel values are not altered, directly interpolating between pixels often results in image blurring and artifacts. For instance, “img098” in Figure 11 shows prominent artifacts between steel frames when upsampled using bilinear interpolation. In the comparison model, adding the model output to the shallow feature bilinear interpolation result introduces erroneous details. Consequently, this leads to display errors, such as showing only three steel frames in an image where there should be four. In contrast, DBSR, through its dual-branch design advantage, especially with the fine processing of shallow features with GIEM, successfully recovers accurate images, avoiding disturbances caused by the direct upsampling of shallow features.

Figure 11.

Visual comparison of DBSR with other SR methods on Urban100 datasets ( $3 \times$ ). The red star is used to emphasize the different parts. Note. DBSR = dual-branch super-resolution; SR = super-resolution.

4.5. Model Efficiency Analysis

The lightweight SR model is designed to achieve efficient image processing through reduced computational complexity and faster inference speeds. To comprehensively evaluate the operational efficiency of our proposed method, we compared it with four representative lightweight SR methods, including IDN, CTNet, and FeNet. Experiments were conducted on two datasets: RS-T1 and DIV2K. We extracted 100 images from each dataset with resolutions of $64 \times 64$ and $340 \times 180$ pixels to assess model performance at $4 \times$ magnification. Using an environment equipped with an NVIDIA GeForce RTX 3090 GPU and an Intel i7-12700K CPU, we tested the following metrics: computational load (Multi-Adds), GPU memory consumption (GPU Mem.), and average runtime per image on both GPU and CPU (GPU Avg Runtime, CPU Avg Runtime). GPU memory consumption (GPU Mem.) refers to the maximum memory usage during the inference phase, as measured by PyTorch’s torch.cuda.max_memory_allocated() function.

The comparative results of model inference efficiency are displayed in Table 3. Compared to the most advanced methods, our DBSR model demonstrates significant advantages across multiple key metrics. With its carefully designed lightweight dual-branch structure, DBSR reduces GPU memory consumption by approximately 8%–10% compared to FeNet and outperforms FeNet in runtime on both GPU and CPU. Additionally, although DBSR’s runtime is comparable to that of IDN, its Multi-Adds are reduced by more than 30% on both datasets. These results indicate that DBSR achieves a favorable balance between inference speed, model complexity, and reconstruction performance. Combined with the experimental results from both datasets, DBSR not only enhances SR reconstruction performance but also surpasses existing cutting-edge methods in inference efficiency and resource utilization, demonstrating its practical value. The model has significant advantages in real-world applications, making it particularly suitable for scenarios with limited resources and offering superior overall performance compared to existing state-of-the-art methods.

Table 3.
Quantify the Model’s Lightweight Performance Across Different Resolutions.

Resolution Method M-Adds GPU Mem. GPU Avg Runtime CPU Avg Runtime

64 $\times$ 64 $\to$ 256 $\times$ 256 IDN 1.9G 462M 0.235 s 0.096 s

CTNet 1.7G 424M 0.298 s 0.125 s

FeNet 1.5G 408M 0.255 s 0.119 s

DBSR 1.3G 364M 0.216 s 0.077 s

340 $\times$ 180 $\to$ 1280 $\times$ 7203 IDN 32.3G 1816M 0.339 s 0.917 s

CTNet 25.9G 1660M 0.398 s 1.147 s

FeNet 20.4G 1360M 0.367 s 0.978 s

DBSR 18.4G 1248M 0.319 s 0.864 s

Resolution	Method	M-Adds	GPU Mem.	GPU Avg Runtime	CPU Avg Runtime
64 $\times$ 64 $\to$ 256 $\times$ 256	IDN	1.9G	462M	0.235 s	0.096 s
CTNet	1.7G	424M	0.298 s	0.125 s
FeNet	1.5G	408M	0.255 s	0.119 s
DBSR	1.3G	364M	0.216 s	0.077 s
340 $\times$ 180 $\to$ 1280 $\times$ 7203	IDN	32.3G	1816M	0.339 s	0.917 s
CTNet	25.9G	1660M	0.398 s	1.147 s
FeNet	20.4G	1360M	0.367 s	0.978 s
DBSR	18.4G	1248M	0.319 s	0.864 s

Note. M-Adds = multi-adds; GPU = graphics processing unit; Avg = average; CPU = central processing unit. DBSR = double-branch super-resolution.

The best results are in bold.

For an LR 64 $\times$ 64 image, the data volume is relatively small, which limits the effective utilization of the GPU’s parallel computing capabilities. Consequently, inference on the CPU proves faster than on the GPU, resulting in a shorter runtime.

4.6. Ablation Study

In this section, we conduct a series of ablation experiments. First, we independently verify the effectiveness of the DBSR module and its auxiliary branch design. Next, we analyze the impact of the number of layers in the HCTF module and the separation layers in the SST on model performance. Finally, we validate the feature refinement capability of HCTF through feature map visualization.

4.6.1. Validation of DBSR Module Design

To validate the effectiveness of the modules designed in the DBSR framework, we used DBSR as a baseline and sequentially removed the CRC, SST, SRM, and GIEM. These modules were replaced with CA, spatial attention (SA), and swin transformer layer (STL) to construct nine variant models. These models were evaluated on the RS-T1 and RS-T2 datasets. Table 4 displays the quantitative results of these nine variants.

Table 4.
Ablation Experiments of DBSR Module Design on RSI Datasets ( $4 \times$ ).

RS-T1 RS-T2

Setting CRC CA SST STL SRM SA GIEM Params M-Adds PSNR/SSIM PSNR/SSIM

1 $✓$ $✓$ $✓$ $✓$ 261K 1.3G 29.90/0.7717 27.71/0.7752

2 $✓$ $✓$ $✓$ 214K 1.1G 29.75/0.7673 27.52/0.7781

3 $✓$ $✓$ $✓$ 173K 0.5G 29.58/0.7602 27.31/0.7596

4 $✓$ $✓$ $✓$ 198K 0.9G 29.73/0.7665 27.49/0.7672

5 $✓$ $✓$ $✓$ 182K 1.2G 29.83/0.7692 27.65/0.7729

6 $✓$ $✓$ 98K 0.1G 29.42/0.7550 27.07/0.7500

7 $✓$ $✓$ $✓$ $✓$ 259K 1.2G 29.85/0.7698 27.68/0.7738

8 $✓$ $✓$ $✓$ $✓$ 349K 2.4G 29.66/0.7632 27.39/0.7637

9 $✓$ $✓$ $✓$ $✓$ 321K 2.1G 29.62/0.7621 27.36/0.7629

10 $✓$ $✓$ $✓$ 289K 1.9G 29.57/0.7598 27.34/0.7611

										RS-T1	RS-T2
1	$✓$		$✓$		$✓$		$✓$	261K	1.3G	29.90/0.7717	27.71/0.7752
2	$✓$		$✓$		$✓$			214K	1.1G	29.75/0.7673	27.52/0.7781
3	$✓$				$✓$		$✓$	173K	0.5G	29.58/0.7602	27.31/0.7596
4	$✓$		$✓$				$✓$	198K	0.9G	29.73/0.7665	27.49/0.7672
5			$✓$		$✓$		$✓$	182K	1.2G	29.83/0.7692	27.65/0.7729
6	$✓$						$✓$	98K	0.1G	29.42/0.7550	27.07/0.7500
7		$✓$	$✓$		$✓$		$✓$	259K	1.2G	29.85/0.7698	27.68/0.7738
8		$✓$		$✓$	$✓$		$✓$	349K	2.4G	29.66/0.7632	27.39/0.7637
9		$✓$		$✓$		$✓$	$✓$	321K	2.1G	29.62/0.7621	27.36/0.7629
10		$✓$		$✓$		$✓$		289K	1.9G	29.57/0.7598	27.34/0.7611

Note. DBSR = dual-branch super-resolution; RSI = remote sensing image; CRC = channel recalibration; CA = channel attention; SST = separable swin transformer; STL = swin transformer layer; SRM = spatial refinement module; SA = spatial attention; GIEM = global information enhancement module; M-Adds = multi-adds; PSNR = peak signal-to-noise ratio; SSIM = structural similarity index.

The best results are in bold.

In Variant Model 1, the combination of CRC, STL, SRM, and GIEM demonstrated the best PSNR and SSIM performance, particularly on the RS-T1 dataset with a PSNR of 29.90 dB and an SSIM of 0.7717, and on the RS-T2 dataset, achieving a PSNR of 27.71 dB and an SSIM of 0.7752. Compared to other model variants, this combination showed significant improvements in reconstruction performance while maintaining a reasonable balance between parameter count (261K) and computational cost (1.3G M-Adds). This indicates that the joint functionality of the CRC, STL, SRM, and GIEM effectively enhances image reconstruction precision, thereby delivering superior performance in SR tasks. In Variants 3 and 8, the removal of the SST module resulted in a noticeable decline in PSNR and SSIM. Variant 3 saw a decrease of 0.32 dB on RS-T1 and 0.4 dB on RS-T2, while Variant 8 experienced a PSNR drop of 0.3 dB on RS-T1, even with the STL replacement and a 25% increase in parameter count. This confirms the importance of the multiscale window parallel computation designed in the SST module, which facilitates multiscale window information interaction and enhances reconstruction quality and efficiency.

Furthermore, in Variant Models 7, 9, and 10, replacing the CRC and SRM with CA and SA did not achieve performance equivalent to the original DBSR modules, showing a decrease in PSNR and SSIM on both the RS-T1 and RS-T2 datasets. This further underscores the critical role of DBSR module design in SR tasks, demonstrating that DBSR’s uniquely tailored modules are better suited to the demands of SR detail and multiscale feature requirements compared to conventional attention mechanisms. Overall, the combination of CRC, SST, SRM, and GIEM in the DBSR model exhibits significant superiority in balancing parameter volume, computational expense, and reconstruction quality, with these uniquely designed modules playing a crucial role in enhancing SR performance.

4.6.2. Validation of Auxiliary Branch Design

To validate the performance of the auxiliary branch design, we used the HCTF as the baseline model. This baseline model follows previous methods by directly adding the bilinearly interpolated shallow features to the SR result. We then replaced this with the GIEM and IFF modules for testing on the RSI datasets. As shown in Table 5, the lightweight intervention of the auxiliary branch significantly improves model performance. This improvement is mainly due to the precise processing of shallow features by the designed GIEM and the further refinement of primary and auxiliary features through the interactive feature fusion of the IFF module. The results in Figure 11 further confirm that our designed auxiliary branch significantly outperforms the traditional method of directly upsampling shallow features.

Table 5.
Ablation Experiments of Auxiliary Branch Design on RSI Datasets ( $4 \times$ ).

RS-T1 RS-T2

Setting HCTF GIEM IFF Params M-Adds PSNR/SSIM PSNR/SSIM

1 $✓$ 184K 1.0G 29.59/0.7616 27.27/0.7591

2 $✓$ $✓$ 233K 1.2G 29.78/0.7679 27.56/0.7695

3 $✓$ $✓$ $✓$ 261K 1.3G 29.90/0.7717 27.71/0.7752

						RS-T1	RS-T2
1	$✓$			184K	1.0G	29.59/0.7616	27.27/0.7591
2	$✓$	$✓$		233K	1.2G	29.78/0.7679	27.56/0.7695
3	$✓$	$✓$	$✓$	261K	1.3G	29.90/0.7717	27.71/0.7752

Note. RSI = remote sensing image; GIEM = global information enhancement module; IFF = interaction feature fusion; Params = parameters; M-Adds = multi-adds; PSNR = peak signal-to-noise ratio; SSIM = structural similarity index.

The best results are in bold.

4.6.3. Validation of HCTF Layer Number

To balance the parameters and performance of the HCTF model, we considered different layer settings. Data in Table 6 show that increasing the number of layers improves SR performance. Notably, even a single-layer model outperforms FeNet on the Set5 and Set14 datasets. To maintain a lightweight model, we set the number of HCTF layers to three.

Table 6.
Ablation Analysis of the Number of HCTF Modules on Set5 and Set14 Datasets ( $4 \times$ ).

Set5 Set14

Numbers of HCTF Params M-Adds PSNR/SSIM PSNR/SSIM

1 156K 2.7G 32.15/0.8936 28.58/0.8936

2 209K 3.9G 32.24/0.8954 28.66/0.7802

3 261K 5.1G 32.31/0.8965 28.68/0.7813

4 314K 6.3G 32.35/0.8970 28.71/0.7825

			Set5	Set14
1	156K	2.7G	32.15/0.8936	28.58/0.8936
2	209K	3.9G	32.24/0.8954	28.66/0.7802
3	261K	5.1G	32.31/0.8965	28.68/0.7813
4	314K	6.3G	32.35/0.8970	28.71/0.7825

Note. Params = parameters; M-Adds = multi-adds; PSNR = peak signal-to-noise ratio; SSIM = structural similarity index.

The best results are in bold.

4.6.4. Validation of SST Separation Layer Number

To choose the optimal number of SST separation layers, we conducted comparative experiments with the number of SST separation layers set to 1, 2, 3, and 4, with one layer being the standard swin transformer. As shown in Table 7, the best performance was achieved with four layers. However, the window size of $32 \times 32$ in the fourth layer resulted in M-adds comparable to the swin transformer. With three layers, the model had 87 K parameters less than the standard swin transformer, with roughly equivalent performance. Therefore, to balance performance and computational load, we ultimately set the number of SST separation layers to three.

Table 7.
Ablation Analysis of the Number of SST Separation Layers on Set5 and Set14 Datasets ( $4 \times$ ).

Set5 Set14

Number of separation layers Window size Params M-Adds PSNR/SSIM PSNR/SSIM

1 4 348K 6.2G 32.30/0.8962 28.69/0.7826

2 4,8 293K 5.3G 32.26/0.8948 28.64/0.7819

3 4,8,16 261K 5.0G 32.31/0.8965 28.68/0.7823

4 4,8,16,32 257K 6.3G 32.32/0.8968 28.71/0.7832

				Set5	Set14
1	4	348K	6.2G	32.30/0.8962	28.69/0.7826
2	4,8	293K	5.3G	32.26/0.8948	28.64/0.7819
3	4,8,16	261K	5.0G	32.31/0.8965	28.68/0.7823
4	4,8,16,32	257K	6.3G	32.32/0.8968	28.71/0.7832

Note. Params = parameters; M-Adds = multi-adds; PSNR = peak signal-to-noise ratio; SSIM = structural similarity index.

The best results are in bold.

4.6.5. Model Feature Visualization

To visually demonstrate the effectiveness of the proposed HCTF in mitigating detail loss during the cascade CNN feature extraction process, we conducted a feature visualization analysis of HCTF during the DBSR inference process. As shown in Figure 12, the features exhibit a more refined trend as the HCTF module deepens.

Figure 12.

Average feature maps of HCTF blocks. HCTF1, HCTF2, and HCTF3 represent the intermediate features in the three-layer cascaded HCTF structure.

5. Conclusion

This study introduces a lightweight DBSR method that effectively combines the advantages of CNNs and SSTs. The dual-branch architecture maintains a total parameter count of just 245 K. The main branch utilizes CRC to dynamically adjust channel feature responses, thereby reducing channel redundancy. Additionally, we developed an SST that reduces the number of parameters while facilitating the parallel extraction of multiscale window features. The SRM further refines the features at the pixel level. The auxiliary branch enhances the global context of shallow features through a GIEM. Finally, the feature interaction fusion module integrates the main and auxiliary branches, creating high-quality feature representations. Comparative experimental results indicate that the DBSR network performs exceptionally well on four RSI datasets and five SR benchmark datasets. Furthermore, extensive ablation experiments confirm the effectiveness of each module, further demonstrating the superior capabilities of the DBSR model.

In the future, we intend to employ distributed computing and sparse representation techniques to handle datasets with higher resolutions and larger scales, thereby enhancing the scalability of lightweight SR models. Additionally, we plan to incorporate generative adversarial networks to improve the model’s robustness when addressing complex RSI transformations. We will also explore cross-modal learning to enable generalization across different sensor types, expanding the model’s adaptability to remote sensing applications involving multiple sensors.

Footnotes

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Chongqing Municipal Education Commission through the Scientific and Technological Research Program (grant no. KJQN202304015).

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

Albiston

C. R.

(2005). Bargaining in the shadow of social institutions: Competing discourses and social change in the workplace mobilization of civil rights. Law and Society Review, 39(1), 11–47. https://doi.org/10.1111/j.0023-9216.2005.00076.x

Bevilacqua

Roumy

Guillemot

Morel

(2012). Low-complexity single image super-resolution based on nonnegative neighbor embedding. In Proceedings of the 23rd british machine vision conference (pp. 1–10). BMVA.

Chen

Lei

Chen

Jiang

Zou

Shi

(2023). Continuous remote sensing image super-resolution based on context interaction in implicit function space. IEEE Transactions on Geoscience and Remote Sensing, 61, 1–16. https://doi.org/10.1109/TGRS.2023.3272473

Chen

Liu

Wang

(2021a). Learning continuous image representation with local implicit image function. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 8628–8638). IEEE.

Chen

Wang

Guo

Deng

Liu

Gao

(2021b). Pre-trained image processing transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 12299–12310). IEEE.

Chen

Wang

Zhou

Qiao

Dong

(2023). Activating more pixels in image super-resolution transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 22367–22377). IEEE.

Chen

Zhang

Kong

Yuan

, et al (2022). Cross aggregation transformer for image restoration. Advances in Neural Information Processing Systems, 35, 25478–25490.

Cheng

Yang

Tai

(2018). SESR: Single image super resolution with recursive squeeze and excitation networks. In 2018 24th international conference on pattern recognition (ICPR) (pp. 147–152). IEEE.

Dong

Loy

C. C.

Tang

(2014). Learning a deep convolutional network for image super-resolution. In Computer vision—ECCV 2014: 13th European conference, Zurich, Switzerland, September 6–12, 2014, proceedings, Part IV 13 (pp. 184–199). Springer.

10.

Dong

Loy

C. C.

Tang

(2015). Image super-resolution using deep convolutional networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(2), 295–307.

11.

Dong

Loy

C. C.

Tang

(2016). Accelerating the super-resolution convolutional neural network. In Computer vision—ECCV 2016: 14th European conference, Amsterdam, The Netherlands, October 11–14, 2016, proceedings, part II 14 (pp. 391–407). Springer.

12.

Han

Chen

Wang

Feng

Tian

Yan

(2021). Methods for small, weak object detection in optical high-resolution remote sensing images: A survey of advances and challenges. IEEE Geoscience and Remote Sensing Magazine, 9(4), 8–34.

13.

Huan

Zou

Wang

Xie

(2021). End-to-end super-resolution for remote-sensing images using an improved multi-scale residual network. Remote Sensing, 13, 666.

14.

Huang

Guo

(2021). Deep residual dual-attention network for super-resolution reconstruction of remote sensing images. Remote Sensing, 13, 2784.

15.

Huang

T. X. F.

Liu

(2023). Dual-branch attention network for super-resolution of remote sensing images. International Journal of Remote Sensing, 44(2), 492–516. Taylor & Francis.

16.

Huang

J.-B.

Singh

Ahuja

(2015). Single image super-resolution from transformed self-exemplars. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5197–5206). IEEE.

17.

Hui

Gao

Yang

Wang

(2019). Lightweight image super-resolution with information multi-distillation network. In Proceedings of the 27th ACM international conference on multimedia (pp. 2024–2032).

18.

Hui

Wang

Gao

(2018). Fast and accurate single image super-resolution via information distillation network. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 723-731). IEEE.

19.

Kim

Lee

J. K.

Lee

K. M.

(2016). Accurate image super-resolution using very deep convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1646–1654). IEEE.

20.

Lai

W.-S.

Huang

J.-B.

Ahuja

Yang

M.-H.

(2017). Deep laplacian pyramid networks for fast and accurate super-resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 624–632). IEEE.

21.

Lei

Shi

(2021). Hybrid-scale self-similarity exploitation for remote sensing image super-resolution. IEEE Transactions on Geoscience and Remote Sensing, 60, 1–10. https://doi.org/10.1109/TGRS.2021.3069889

22.

Lei

Shi

(2021). Transformer-based multi-stage enhancement for remote sensing image super-resolution. IEEE Transactions on Geoscience and Remote Sensing, 60, 5615611. https://doi.org/10.1109/tgrs.2021.3136190

23.

Lei

Shi

Zou

(2017). Super-resolution for remote sensing images via local–global combined network. IEEE Geoscience and Remote Sensing Letters, 14(8), 1243–1247. https://doi.org/10.1109/LGRS.2017.2704122

24.

Lei

Zhu

Qin

Zhu

(2024). Residual SwinV2 transformer coordinate attention network for image super resolution. AI Communications 37(4), 693–709. https://doi.org/10.3233/AIC-230340

25.

Liang

Cao

Sun

Zhang

Van Gool

Timofte

(2021). SwinIR: Image restoration using swin transformer. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 1833–1844). IEEE.

26.

Liang

Wang

Yang

Zhou

(2021). Light field image super-resolution with transformers. IEEE Signal Processing Letters, 29, 563–567. IEEE. https://doi.org/10.1109/LSP.2022.3146798.

27.

Liu

Lin

Cao

Wei

Zhang

Lin

Guo

(2021). Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 10012–10022). IEEE.

28.

Wang

Zhang

Wang

Jiang

(2019). Satellite image super-resolution via multi-scale residual deep neural network. Remote Sensing, 11(13), 1588. https://doi.org/10.3390/rs11131588

29.

Liu

Sun

Zhong

(2021). Remote sensing image super-resolution based on dense channel attention network. Remote Sensing, 13, 2966. https://doi.org/10.3390/rs13152966

30.

Mardieva

Ahmad

Umirzakova

Rasool

M. J. A.

Whangbo

T. K.

(2024). Lightweight image super-resolution for IoT devices using deep residual feature distillation network. Knowledge-Based Systems, 285, 111343. https://doi.org/10.1016/j.knosys.2023.111343

31.

Martin

Fowlkes

Tal

Malik

(2001). A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Proceedings eighth IEEE international conference on computer vision. ICCV 2001 (Vol. 2). (pp. 416–423). IEEE.

32.

Matsui

Ito

Aramaki

Fujimoto

Ogawa

Yamasaki

Aizawa

(2017). Sketch-based manga retrieval using manga109 dataset. Multimedia Tools and Applications, 76, 21811–21838. https://doi.org/10.1007/s11042-016-4020-z

33.

Mei

Fan

Zhou

Huang

T. S.

Shi

(2020). Image super-resolution with cross-scale non-local attention and exhaustive self-exemplars mining. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 5690–5699). IEEE.

34.

Pan

Guo

Lei

(2019). Super-resolution of single remote sensing image based on residual dense backprojection networks. IEEE Transactions on Geoscience and Remote Sensing, 57(10), 7918–7933. https://doi.org/10.1109/TGRS.2019.2917427

35.

Purohit

Mandal

Rajagopalan

(2020). Mixed-dense connection networks for image and video super-resolution. Neurocomputing, 398, 360–376. https://doi.org/10.1016/j.neucom.2019.02.069

36.

Song

Wang

Chen

Tao

(2021). AdderSR: Towards energy efficient image super-resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 15648–15657). IEEE.

37.

Sui

Zhang

Pun

M.-O.

(2023). DTRN: Dual transformer residual network for remote sensing super-resolution. In IGARSS 2023—2023 IEEE international geoscience and remote sensing symposium (pp. 6041–6044). IEEE.

38.

Sun

Dong

Tang

Pan

(2023). Spatially-adaptive feature modulation for efficient image super-resolution. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 13190–13199). IEEE.

39.

Tian

Zhuge

Zuo

Chen

Lin

C.-W.

(2020). Lightweight image super-resolution with enhanced CNN. Knowledge-Based Systems, 205, 106235. https://doi.org/10.1016/j.knosys.2020.106235

40.

Timofte

Agustsson

Van Gool

Yang

M.-H.

Zhang

(2017). NTIRE 2017 challenge on single image super-resolution: Methods and results. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops (pp. 114–125). IEEE.

41.

Wang

Bayram

Sertel

(2022). A comprehensive review on deep learning based remote sensing image super-resolution methods. Earth-Science Reviews, 232, 104110. https://doi.org/10.1016/j.earscirev.2022.104110

42.

Wang

Jiang

Zhao

Liu

Wang

(2022). Multi-scale feature learning network with channel self-attention for remote sensing single-image super-resolution. International Journal of Remote Sensing, 43, 6669–6688. https://doi.org/10.1080/01431161.2022.2143732

43.

Wang

Xue

Jiang

Wang

Sun

(2022). FeNet: Feature enhancement network for lightweight remote-sensing image super-resolution. IEEE Transactions on Geoscience and Remote Sensing, 60, 1–12. https://doi.org/10.1109/TGRS.2022.3168787

44.

Wang

Guo

Song

Lyu

Yan

Zhao

Cai

Min

(2022). A review of image super-resolution approaches based on deep learning and applications in remote sensing. Remote Sensing, 14(21), 5423. https://doi.org/10.1016/j.earscirev.2022.104110

45.

Wang

Zhou

(2021). Contextual transformation network for lightweight remote sensing image super-resolution. IEEE Transactions on Geoscience and Remote Sensing, 60, 5615313. https://doi.org/10.1109/TGRS.2021.3132093

46.

Zamir

S. W.

Arora

Khan

Hayat

Khan

F. S.

Yang

M.-H.

(2022). Restormer: Efficient transformer for high-resolution image restoration. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 5728–5739). IEEE.

47.

Zeyde

Elad

Protter

(2012). On single image scale-up using sparse-representations. In Curves and surfaces: 7th International conference, Avignon, France, June 24–30, 2010, Revised Selected Papers 7 (pp. 711–730).

48.

Zhang

Tan

Zhu

Liu

(2024). An efficient hybrid CNN-transformer approach for remote sensing super-resolution. Remote Sensing, 16(5), 880. https://doi.org/10.3390/rs160508800

49.

Zou

Zhang

Wang

(2015). Deep learning based feature selection for remote sensing scene classification. IEEE Geoscience and Remote Sensing Letters, 12(11), 2321–2325. https://doi.org/10.1109/LGRS.2015.2475299

										RS-T1	RS-T2
Setting	CRC	CA	SST	STL	SRM	SA	GIEM	Params	M-Adds	PSNR/SSIM	PSNR/SSIM
1	$✓$		$✓$		$✓$		$✓$	261K	1.3G	29.90/0.7717	27.71/0.7752
2	$✓$		$✓$		$✓$			214K	1.1G	29.75/0.7673	27.52/0.7781
3	$✓$				$✓$		$✓$	173K	0.5G	29.58/0.7602	27.31/0.7596
4	$✓$		$✓$				$✓$	198K	0.9G	29.73/0.7665	27.49/0.7672
5			$✓$		$✓$		$✓$	182K	1.2G	29.83/0.7692	27.65/0.7729
6	$✓$						$✓$	98K	0.1G	29.42/0.7550	27.07/0.7500
7		$✓$	$✓$		$✓$		$✓$	259K	1.2G	29.85/0.7698	27.68/0.7738
8		$✓$		$✓$	$✓$		$✓$	349K	2.4G	29.66/0.7632	27.39/0.7637
9		$✓$		$✓$		$✓$	$✓$	321K	2.1G	29.62/0.7621	27.36/0.7629
10		$✓$		$✓$		$✓$		289K	1.9G	29.57/0.7598	27.34/0.7611

						RS-T1	RS-T2
Setting	HCTF	GIEM	IFF	Params	M-Adds	PSNR/SSIM	PSNR/SSIM
1	$✓$			184K	1.0G	29.59/0.7616	27.27/0.7591
2	$✓$	$✓$		233K	1.2G	29.78/0.7679	27.56/0.7695
3	$✓$	$✓$	$✓$	261K	1.3G	29.90/0.7717	27.71/0.7752

			Set5	Set14
Numbers of HCTF	Params	M-Adds	PSNR/SSIM	PSNR/SSIM
1	156K	2.7G	32.15/0.8936	28.58/0.8936
2	209K	3.9G	32.24/0.8954	28.66/0.7802
3	261K	5.1G	32.31/0.8965	28.68/0.7813
4	314K	6.3G	32.35/0.8970	28.71/0.7825

				Set5	Set14
Number of separation layers	Window size	Params	M-Adds	PSNR/SSIM	PSNR/SSIM
1	4	348K	6.2G	32.30/0.8962	28.69/0.7826
2	4,8	293K	5.3G	32.26/0.8948	28.64/0.7819
3	4,8,16	261K	5.0G	32.31/0.8965	28.68/0.7823
4	4,8,16,32	257K	6.3G	32.32/0.8968	28.71/0.7832

Dual-Branch Super-Resolution (DBSR): Lightweight DBSR Network for Enhancing Remote Sensing Images

Abstract

Keywords

1. Introduction

2.1. CNN-Based SR

2.2. Transformer-Based SR

3. Network Architecture

3.1. Framework View

3.2.1. Channel Recalibration (CRC)

4.1. Datasets and Metrics

4.2. Implementation Details

4.3. Results on RSI Datasets

4.6.1. Validation of DBSR Module Design

Table 5. Ablation Experiments of Auxiliary Branch Design on RSI Datasets ( 4 × ). RS-T1 RS-T2 Setting HCTF GIEM IFF Params M-Adds PSNR/SSIM PSNR/SSIM 1 ✓ 184K 1.0G 29.59/0.7616 27.27/0.7591 2 ✓ ✓ 233K 1.2G 29.78/0.7679 27.56/0.7695 3 ✓ ✓ ✓ 261K 1.3G 29.90/0.7717 27.71/0.7752

Footnotes

Funding

Declaration of Conflicting Interests

References