Sage Journals: Discover world-class research

Abstract

Single image super-resolution (SR) aims to reconstruct high-resolution (HR) images from their low-resolution (LR) counterparts. Although recent transformer-based SR methods have achieved impressive performance, their substantial computational complexity and memory requirements severely restrict practical deployment. To address these challenges, we propose the gated multi-scale interaction network (GMIN), a lightweight convolutional neural network architecture that effectively integrates transformer design principles. GMIN introduces the gated multi-scale interaction module, which comprises a spatially adaptive mixing layer (SML) and an enhanced spatial gated feed-forward network (EGSFN). The SML dynamically filters less informative features and aggregates multi-scale spatial information through innovative gating mechanisms, while EGSFN employs large-kernel convolutions with gating operations to capture rich spatial dependencies, significantly enhancing feature representation capabilities. Comprehensive experimental results demonstrate that GMIN achieves an exceptional balance between SR quality and computational efficiency, outperforming the transformer-based ESRT by 0.14 dB in peak signal-to-noise ratio while utilizing fewer parameters and requiring 77% fewer floating-point operations per second. These findings establish GMIN as a novel and practical solution for lightweight image SR, making a significant contribution to the development of efficient SR models suitable for resource-constrained environments.

Keywords

convolutional neural network image super-resolution transformer-based architectures gated multi-scale interaction

1. Introduction

Image super-resolution (SR) aims to reconstruct high-resolution (HR) images from low-resolution (LR) observations using learning-based approaches (Dong et al., 2016; Kim et al., 2016; Lim et al., 2017; Zhang et al., 2018). Due to the inherent resolution discrepancy between LR inputs and HR outputs, SR models are required to infer missing high-frequency spatial information, which inevitably introduces uncertainty. Moreover, the image degradation process is generally noninvertible, such that a single LR image may correspond to multiple plausible HR reconstructions. These characteristics make single image SR a highly ill-posed inverse problem (Zhu et al., 2022).

Current SR methods confront two primary challenges. The first pertains to computational complexity. Transformer-based SR models (Fang et al., 2022; Liang et al., 2021; Wang et al., 2023; Zhang et al., 2022; Zhao et al., 2020) employ multi-head self-attention (MHSA) to capture global contextual information, achieving superior performance compared to traditional convolutional neural networks (CNNs). However, the computational complexity of MHSA grows quadratically with the spatial size $S = H \times W$ , that is, $O (S^{2})$ (Li et al., 2022b; Wang et al., 2022b; Zhou et al., 2022; Zhu et al., 2024). As image resolution increases (e.g., to 4K), this quadratic complexity leads to a substantial increase in computational cost, memory consumption, and inference latency, which severely constrains the applicability of transformer-based SR models in real-world scenarios. As illustrated in Figure 1(b), transformer-based approaches generally incur significantly higher computational overhead than lightweight CNN-based models under comparable reconstruction quality.

Figure 1.

Motivation for GMIN: Addressing the dual challenges of texture fidelity and computational efficiency in image super-resolution. We illustrate (a) the visual superiority of GMIN in avoiding common artifacts and (b) its quantitative advantage in achieving high performance with reduced computational resources. (a) Visual quality: Comparison of texture recovery. Conventional CNNs often suffer from over-smoothing, while transformers may introduce ringing artifacts. In contrast, GMIN produces sharper and cleaner details. (b) Efficiency analysis: Tradeoff between model size (parameters), computational cost (FLOPs), and reconstruction quality (PSNR). GMIN achieves a superior balance, offering high performance with significantly lower computational overhead. Note. GMIN = gated multi-scale interaction network; CNN = convolutional neural network; FLOPs = floating-point operations per second; PSNR = peak signal-to-noise ratio.

The second challenge involves texture reconstruction quality. Under large upscaling factors (e.g., $\times 4$ ), existing methods exhibit significant deficiencies. CNN-based methods tend to generate overly smooth results due to limited receptive field coverage. Conversely, transformer-based methods, although capable of producing sharper edges, often introduce ringing artifacts or checkerboard noise, as shown in Figure 1(a). These effects stem from two main causes: (1) the lack of effective multi-scale dynamic feature selection mechanisms, which limits the flexible integration of local texture details and global context; and (2) the tendency of global attention mechanisms to amplify high-frequency components nonselectively, resulting in the simultaneous enhancement of useful texture signals and undesirable noise.

To systematically address these challenges, we draw inspiration from MetaFormer’s insight that “architecture matters more than attention” (Yu et al., 2022) and recent advances in efficient feature interaction (Li et al., 2022a; Qian et al., 2024; Rao et al., 2022; Xie et al., 2021). We establish core design principles: (1) replace computationally expensive global attention with efficient feature interaction mechanisms while preserving transformer-like architectural advantages; and (2) achieve balanced modeling of local texture and global context through dynamic multi-scale interaction.

Based on these principles, this paper proposes the gated multi-scale interaction network (GMIN). The core component is the multi-level GMIM. Specifically, we design a spatially adaptive mixing layer (SML) that filters redundant features through spatial adaptive units (SAUs) and utilizes gated multi-scale interaction blocks (GMIBs) to extract local and global features in parallel. To further enhance spatial modeling without the cost of MHSA, we introduce an efficient gated spatial feed-forward network (EGSFN) that employs large-kernel convolutions and gating branches. Additionally, GMIN adopts a lightweight design by removing layer normalization and redundant skip connections to optimize inference efficiency.

The main contributions of this paper include:

We introduce an efficient GMIN tailored for addressing image SR reconstruction challenges. In comparison to existing advanced methods, our approach significantly enhances network performance while striking a superior balance between lightweight design and overall performance (as shown in Figure 2).

We design an innovative GMIB module that effectively integrates local texture information with global contextual information through gated multi-scale branch architecture and adaptive feature fusion mechanisms, significantly suppressing ringing artifacts while successfully recovering fine texture details (as shown in Figure 1(a)).

We propose an efficient EGSFN module that enhances the network’s spatial modeling capabilities through large-kernel convolution operations and pixel-level gating mechanisms, while fully preserving the inherent advantages of feed-forward networks (FFNs) in channel nonlinear transformations.

Figure 2.

Performance-complexity tradeoff comparison of state-of-the-art SR methods on the Set14 dataset for $\times 4$ SR. Circle size represents model FLOPs. Note. SR = super-resolution; FLOPs = floating-point operations per second.

2. Related Work

2.1. CNN-Based Lightweight SR Model

In recent years, there has been considerable interest in lightweight SR models due to their smaller model size and reduced computational resource demands. To achieve lightweight SR models, researchers have explored various methodologies. Ahn et al. (2018) introduced an efficient neural network called cascading residual network (CARN), employing a cascading network structure and grouped convolutional operations. Compared to the then state-of-the-art methods, the network demonstrated significant reductions in parameters and computational effort, while maintaining comparable performance. Hui et al. (2019) proposed a lightweight information multi-distillation network (IMDN) by constructing cascaded information multi-distillation blocks. This network gradually extracts hierarchical features and utilizes the information distillation mechanism during training. However, the network’s parameter count is relatively high, and the efficiency of channel information distillation needs improvement. Addressing these issues, Liu et al. (2020) redesigned the architecture of IMDN and introduced a method named residual feature distillation network (RFDN). This network utilizes feature distillation connections instead of the information distillation mechanism, enhancing SR performance without introducing additional parameters.

More recently, researchers have focused on optimizing architectural efficiency further. For instance, Hao et al. (2024) proposed the lightweight blueprint residual network (LBRN), which leverages blueprint separable convolutions to minimize redundancy. Similarly, Gendy et al. (2024) introduced EConvMixN, a network based on extended convolution mixers that effectively balances local and global feature processing.

Despite the notable progress achieved by these lightweight SR networks, they all rely on CNN structures, capable only of extracting local features and facing challenges in learning global information. This limitation hinders the recovery of global texture details in images. Moreover, as network depth increases, these methods demand more computational resources and memory consumption, posing challenges for deployment on embedded terminals such as mobile devices.

2.2. Transformer-Based Lightweight SR Model

The successful adoption of transformer architecture in natural language processing has spurred significant interest in its application to computer vision. In contrast to traditional CNNs, the transformer model employs a self-attention mechanism, enabling it to capture long-distance dependencies between sequence elements, making it adept at processing sequence data. Dosovitskiy et al. (2020) introduced the visual transformer for image recognition, directly taking sequences of chunked image blocks as inputs, and pre-training it on a large dataset, achieving performance comparable to CNN-based methods. Touvron et al. (2021) combined transformer with distillation methods, proposing an efficient image transformer (DeiT) suitable for training on medium-sized datasets with enhanced robustness. Liang et al. (2021) pioneered the application of the transformer to hyper segmentation by introducing the Swin transformer network. This network utilizes multiple Swin transformer layers for local attention and cross-window interaction, incorporating a convolutional layer for feature enhancement. Through the synergistic integration of transformer and CNN, SwinIR surpasses other state-of-the-art SR methods.

Recent efforts further specialize transformer designs for lightweight SR. Zhang et al. (2022) presented an efficient long-range attention network (ELAN) for image SR, employing shift convolution and grouped multi-scale self-attention modules to leverage remote image dependencies, yielding superior results with much lower complexity than existing transformer-based models. To reduce the quadratic complexity of self-attention, Shi et al. (2023) introduced the efficient striped window transformer (ESWT), which utilizes a striped window mechanism to capture long-range dependencies with reduced computational overhead. Wang et al. (2023) introduced omni self-attention, which concurrently models pixel-per-pixel interactions in both spatial and channel dimensions, exploiting the potential of existing transformer-based models. Interactions are considered in both spatial and channel dimensions, capturing correlations between space and channel. The interaction between local propagation and global scale is facilitated using full-scale aggregation groups. Experiments demonstrate that the Omni-SR architecture achieves a peak signal-to-noise ratio (PSNR) of 26.95 dB at upscaling factor $\times 4$ on the Urban100 dataset using only 792K parameters.

3. Method

3.1. Network Architecture

Our GMIN builds upon MAN (Wang et al., 2022b) to effectively reconstruct HR images. As illustrated in Figure 3, GMIN comprises four essential components: shallow feature extraction, cascaded GMIMs, multi-stage feature fusion (MSFF), and image reconstruction.

Figure 3.

Overview of our GMIN. (a) The architecture of GMIN and (b) GMIM. Note. GMIN = gated multi-scale interaction network; GMIM = gated multi-scale interaction module.

The network first extracts shallow features from the LR input $I^{LR} \in R^{H \times W \times C}$ using a single $3 \times 3$ convolution:

X_{0} = f_{c 3 \times 3} (I^{LR}),

(1)

where

f_{c 3 \times 3} (\cdot)

denotes the convolution operation with 56 output channels.

These features then flow through $N$ consecutive GMIMs for hierarchical feature extraction:

X_{i} = F_{i}^{GMIM} (X_{i - 1}), i = 1, 2, \dots, N,

(2)

where

F_{i}^{GMIM} (\cdot)

represents the

i

-th GMIM’s transformation and

X_{i}

its output.

To maximize information utilization across different feature levels, we employ an MSFF module that integrates all GMIM outputs. The MSFF applies a $1 \times 1$ convolution for channel reduction, followed by GELU activation and a $3 \times 3$ convolution for feature refinement. A large-kernel attention (LKA) mechanism further enhances the spatial context:

\begin{aligned} X_{f} & = LKA (H_{msff} (Concat (X_{1}, X_{2}, \dots, X_{N}))), \\ H_{msff} & = f_{c 3 \times 3} \circ f_{GELU} \circ f_{c 1 \times 1}, \end{aligned}

(3)

where

Concat (\cdot)

represents channel-wise concatenation and

H_{msff} (\cdot)

the MSFF processing.

The reconstruction module combines the refined features with the shallow features via a residual connection, followed by pixel-shuffle upsampling:

I^{SR} = X_{ps\_up} (X_{0} + X_{f}),

(4)

where

X_{ps\_up} (\cdot)

consists of a

3 \times 3

convolution and pixel-shuffle operation.

We train the network using $L_{1}$ loss for its stability and effectiveness in SR tasks:

L_{1} (Θ) = {‖ I^{HR} - f_{GMIN} (I^{LR}) ‖}_{1},

(5)

where

Θ

represents the network parameters.

3.2. Gated Multi-Scale Interaction Module (GMIM)

Recent transformer-based models have shown remarkable potential in image SR. Yu et al. (2022) proposed MetaFormer, a general architecture abstracted from a transformer without specifying a particular token mixer. Their research demonstrated that the overall architectural framework contributes more significantly to model performance than the specific token mixing mechanism.

Following this insight, we designed our GMIM based on the MetaFormer architecture (Figure 4), achieving superior results compared to traditional CNN-based approaches. By intentionally omitting LayerNorm and skip connections for computational efficiency, our GMIM consists of two primary components: the spatially adaptive mixer layer (SML) for spatial information encoding and the efficient gated spatial feed-forward network (EGSFN) for channel information processing.

Figure 4.

Architectures of our GMIM. (a) SML, (b) SAU, and (c) GMIB. Our SML primarily consists of two components: SAU and GMIB. SAU is designed to dynamically exclude less critical feature information to enhance performance in the SR tasks. GMIB employs gated units for the effective aggregation of multi-scale contextual features. Note. GMIM = gated multi-scale interaction module; SML = spatially adaptive mixing layer; SAU = spatial adaptive unit; SR = super-resolution; GMIB = gated multi-scale interaction block.

3.2.1. Spatially Adaptive Mixer Layer (SML)

The MHSA mechanism is central to transformer architectures, dynamically generating weights to mix spatial tokens. However, its quadratic complexity significantly limits transformer applicability in low-level vision tasks. To address this limitation, we propose the SML with a lightweight spatially adaptive awareness unit (SAU). This design efficiently filters redundant low-frequency information while capturing critical high-frequency details, substantially reducing computational complexity. Furthermore, SML achieves effective multi-scale feature aggregation through the synergistic combination of gating units and depthwise convolution (DWConv).

The SML consists of two cascaded components: SAU and GMIB:

K = GMIB (SAU (X)),

(6)

where

SAU (\cdot)

dynamically filters features and

GMIB (\cdot)

performs multi-scale feature interaction through gate

G_{ξ} (\cdot)

and context branch

B_{ϑ} (\cdot)

Spatially adaptive Aware Unit ( SAU. Natural images exhibit inherent spatial redundancy due to their localized structures. To efficiently capture essential spatial information for SR, we introduce the SAU, which dynamically suppresses less critical feature information. As shown in Figure 4(b), SAU processes both local texture features within each patch (via ${Conv}_{1 \times 1} (\cdot)$ ) and global shape features across the entire patch region (via $GAP (\cdot)$ ). This selective attention mechanism enables the network to focus on features most relevant for enhancing image detail and clarity, allowing more efficient utilization of multi-level features:

\begin{aligned} F_{1}^{SAU} & = f_{c 1 \times 1} (F_{in}^{SAU}), \\ F_{out}^{SAU} & = f_{GELU} (F_{1}^{SAU} + W_{k} ⊙ (F_{1}^{SAU} - GAP (F_{1}^{SAU}))), \end{aligned}

(7)

where

GAP (\cdot)

denotes global average pooling and the scaling factor

W_{k} \in R^{(C \times 1)}

is initialized to zero.

GMIB. The GMIB effectively integrates contextual information through a dual-branch architecture with gated units, as illustrated in Figure 4(c). The upper branch first extracts local features using a $5 \times 5$ DWConv, then processes these features through two parallel subbranches to capture both local and global contextual information. One subbranch employs cascaded 1D depthwise convolutions for local feature extraction, while the other uses cascaded 1D depthwise dilated convolutions to capture global context. Notably, we implement cross-branch interaction to integrate multi-scale contextual information before concatenating the features from both sub-branches. A $1 \times 1$ convolution followed by SiLU activation then refines these concatenated features.

The lower branch serves as a gating mechanism that captures pixel-wise activation states using a $1 \times 1$ convolution and SiLU activation to generate a gating map. This map modulates the upper branch output through elementwise multiplication. The complete GMIB process is formulated as:

\begin{aligned} F_{low}^{GMIB} & = f_{c 1 \times 1} (F_{in}^{GMIB}), \\ F_{up,1}^{GMIB} & = f_{dw 5 \times 5} (F_{in}^{GMIB}), \\ F_{up,2}^{GMIB} & = f_{dw 7 \times 1} (F_{up,1}^{GMIB}) + f_{dw 9 \times 1 \_r d} (F_{up,1}^{GMIB}), \\ F_{up,3U}^{GMIB} & = f_{dw 1 \times 7} (F_{up,2}^{GMIB}), \\ F_{up,3L}^{GMIB} & = f_{dw 1 \times 9} (F_{up,2}^{GMIB}), \\ F_{up}^{GMIB} & = f_{c 1 \times 1} (Concat (F_{up,3U}^{GMIB}, F_{up,3L}^{GMIB})), \\ F_{out}^{GMIB} & = \underset{G_{ξ} (\cdot)}{\underset{⏟}{f_{SiLU} (F_{low}^{GMIB})}} ⊙ \underset{B_{ϑ} (\cdot)}{\underset{⏟}{f_{SiLU} (F_{up}^{GMIB})}}, \end{aligned}

(8)

where

f_{dw n \times m}

represents an

n \times m

depthwise convolution,

f_{dw n \times m \_r d}

denotes an

n \times m

depthwise dilated convolution with dilation rate

d

Concat (\cdot)

performs feature concatenation, and

⊙

indicates elementwise multiplication.

3.2.2. Enhanced Gated Spatial Feed-Forward Network (EGSFN)

While conventional FFNs effectively model channel-wise feature relationships, they often neglect spatial information and introduce redundancy through channel expansion. To address these limitations, we propose the EGSFN, as shown in Figure 5.

Figure 5.

Architecture of the efficient gated spatial feed-forward network (EGSFN). The network combines depthwise convolutions for local structure learning with a channel-split gating mechanism that selectively filters information, enhancing spatial feature capture while maintaining computational efficiency.

EGSFN first applies a $1 \times 1$ convolution to expand feature channels by a ratio $r$ , followed by a $5 \times 5$ depthwise convolution that efficiently captures local structural information. The resulting features are then split into two equal parts along the channel dimension. One part undergoes GELU activation to identify important activation states, while the other preserves the original feature information. These complementary representations are combined through elementwise multiplication, creating a gating mechanism that selectively emphasizes relevant spatial information. Finally, a $1 \times 1$ convolution restores the original channel dimensions. The EGSFN process is formulated as:

\begin{aligned} F_{1}^{EGSFN} & = f_{dw 5 \times 5} (f_{c 1 \times 1} (F_{in}^{EGSFN})), \\ F_{2,d}^{EGSFN}, F_{2,u}^{EGSFN} & = Split (F_{1}^{EGSFN}), \\ F_{out}^{EGSFN} & = f_{c 1 \times 1} (f_{gelu} (F_{2,d}^{EGSFN}) ⊙ F_{2,u}^{EGSFN}) . \end{aligned}

(9)

3.2.3. Large-Kernel Attention (LKA)

Research by Guo et al. (2023) demonstrates that LKA modules effectively expand the receptive field for image restoration tasks, significantly enhancing network representation capabilities. In our GMIN architecture, we incorporate LKA at the end of the deep feature extraction backbone to capture long-range dependencies before reconstruction. The LKA module is formulated as:

\begin{aligned} F_{1}^{LKA} & = f_{GELU} (f_{c 1 \times 1} (F_{in}^{LKA})), \\ F_{2}^{LKA} & = f_{dw 7 \times 7} (F_{1}^{LKA}), \\ F_{3}^{LKA} & = f_{dw 9 \times 9 \_r d} (F_{2}^{LKA}), \\ F_{4}^{LKA} & = f_{c 1 \times 1} (F_{3}^{LKA}), \\ F_{out}^{LKA} & = f_{c 1 \times 1} (F_{1}^{LKA} ⊙ F_{4}^{LKA}), \end{aligned}

(10)

where

f_{dw n \times m}

represents an

n \times m

depthwise convolution,

f_{dw n \times m \_r d}

denotes an

n \times m

depthwise dilated convolution with dilation rate

d

, and

⊙

indicates elementwise multiplication.

4. Experiments

4.1. Experimental Settings

4.1.1. Datasets and Evaluation Metrics

We train GMIN following the standard procedure on DIV2K (Agustsson & Timofte, 2017) and Flickr2K (Lim et al., 2017) datasets. In addition, we conducted testing using five commonly employed benchmark datasets, namely Set5 (Bevilacqua et al., 2012), Set14 (Zeyde et al., 2010), BSD100 (Martin et al., 2001), Urban100 (Huang et al., 2015), and Manga109 (Matsui et al., 2017). To measure the quality of restoration, we convert the SR image to the YCbCr color space and compute the PSNR and structural similarity (SSIM) metrics on the luminance channel. PSNR represents the ratio of the maximum possible power of the signal to the destructive noise power that affects its representation accuracy, and is commonly expressed in logarithmic decibels (dB) units. The value range of SSIM is $[0, 1]$ . The larger the value, the smaller the difference between the output image and the undistorted image, that is, the better the image quality.

4.1.2. Training Details

In line with prior research, we used the original HR training images to generate corresponding LR image pairs through bicubic downsampling (BI). To augment the training dataset, we applied random horizontal flips and rotations of $90 \circ$ , $180 \circ$ , and $270 \circ$ . The training images were then cropped into $48 \times 48$ -pixel patches, with each batch consisting of 64 randomly selected patches.

For optimization, we employed the Adan optimizer with $β_{1} = 0.98$ , $β_{2} = 0.92$ , $β_{3} = 0.99$ , and $ϵ = 10^{- 7}$ . The initial learning rate was set to $2.5 \times 10^{- 3}$ and halved after $6 \times 10^{5}$ and $8 \times 10^{5}$ iterations, resulting in a total of $1 \times 10^{6}$ training iterations. The loss function used was L1, and the number of channels was set to $C = 56$ . Our model was trained and tested on an NVIDIA RTX 4090 GPU, with implementation and deployment carried out using the PyTorch framework.

4.2. Comparison With Lightweight SR Methods

To evaluate the effectiveness of our GSIN, we conducted a comprehensive assessment that included both quantitative objective comparisons and subjective visual evaluations. Our goal was to benchmark GSIN against several leading lightweight SR methods. The methods included in this analysis are Bicubic, SRCNN (Dong et al., 2016), VDSR (Kim et al., 2016), CARN (Ahn et al., 2018), MAFSSRN (Muqeet et al., 2020), SMMR (Wang et al., 2021), IDN (Hui et al., 2018), IMDN (Hui et al., 2019), PAN (Zhao et al., 2020), LatticeNet (Luo et al., 2020), RFDN-L (Liu et al., 2020), Cross-SRN (Liu et al., 2021), FDIWN-M (Gao et al., 2022a), RFLN (Kong et al., 2022), HNCT (Fang et al., 2022), BSRN (Li et al., 2022b), EConvMixN (Gendy et al., 2024), and LBRN (Hao et al., 2024). We evaluated upscaling factors of $\times 2$ , $\times 3$ , and $\times 4$ , as detailed in Table 1 and visually represented in Figure 6.

Table 1.
Quantitative Comparison (Average PSNR/SSIM) of State-of-the-art Lightweight SR Methods for $\times 2$ , $\times 3$ , and $\times 4$ Upscaling.

Figure 6.

Visual comparison of state-of-the-art lightweight SR methods for upscaling factor $\times 4$ . GMIN excels in restoring sharper and more accurate details. Note. SR = super-resolution; GMIN = gated multi-scale interaction network.

4.2.1. Quantitative Objective Comparisons

Table 1 compares GMIN with state-of-the-art methods including RFLN (Kong et al., 2022), BSRN (Li et al., 2022b), and HNCT (Fang et al., 2022). For $\times 2$ upscaling, GMIN improves on Set14 by 0.13 dB in PSNR and 0.0039 in SSIM over RFLN. For $\times 3$ upscaling, GMIN outperforms BSRN across multiple datasets while using fewer parameters and less computation. At $\times 4$ upscaling, GMIN gains 0.04 dB PSNR and 0.0011 SSIM on Set5 compared to HNCT, with 10% fewer FLOPs. Compared to recent methods such as EConvMixN and LBRN, GMIN remains competitive at $\times 2$ and $\times 3$ scales. EConvMixN slightly outperforms GMIN on Set5 at $\times 4$ , but GMIN uses far fewer FLOPs. LBRN performs similarly overall, but GMIN beats it on Set14 and Urban100 for $\times 2$ and $\times 3$ upscaling with fewer parameters and lower computational cost.

4.2.2. Subjective Visual Effect Assessments

Visual comparisons between our approach and several state-of-the-art methods are presented in Figure 6. The experimental results demonstrate the enhanced reconstruction quality achieved by GMIN, particularly at the $\times 4$ scale across Urban100, BSD100, and Manga109 datasets. A detailed examination of specific test images, including “img-012” from Urban100, “78004” from BSD100, and “MukoukizuNoChonbo” from Manga109, confirms GMIN’s superior performance. Compared to existing methods, GMIN consistently preserves intricate details and textures present in the original HR images. The proposed method demonstrates particular effectiveness in recovering fine textural elements while accurately maintaining texture orientation and structural integrity of complex patterns such as meshes. These visual comparisons provide compelling evidence of GMIN’s capability to generate high-fidelity SR images with enhanced perceptual quality.

4.3. Comparison With Transformer-Based SR Methods

We developed an extended version of our architecture, GMIN-L, by stacking nine GMIMs with a channel width of 64. This transformer-style CNN variant was evaluated against leading transformer-based methods, including SwinIR (Liang et al., 2021), ESRT (Lu et al., 2022), LBNet (Gao et al., 2022b), ELAN-light (Zhang et al., 2022), and ESWT (Shi et al., 2023).

4.3.1. Quantitative Objective Comparisons

Table 2 presents a comprehensive comparison, demonstrating that GMIN-L achieves competitive or superior performance while requiring significantly fewer parameters and computational resources. At a $\times 2$ scale factor, GMIN-L consistently outperforms other transformer-based methods across multiple datasets. Notably, it exceeds the second-best approach by approximately 0.3 dB in PSNR on Urban100 and Manga109 datasets, and by 0.03 dB on Set5 and BSD100 datasets. This performance advantage is achieved while maintaining substantially lower computational complexity compared to existing transformer-based architectures.

Table 2.
Quantitative Comparison (Average PSNR/SSIM) With Other Advanced Transformer-Based SR Methods.

4.3.2. Subjective Visual Effect Assessments

Figure 7 presents visual comparisons at $\times 4$ scale factor across the BSD100 and Urban100 datasets.

Figure 7.

Visual comparison with other transformer-based super-resolution (SR) methods for upscaling factor $\times 4$ .

For the BSD100:253027 sample, competing methods including HNCT (Fang et al., 2022), FDIWN (Gao et al., 2022a), and BSRN (Li et al., 2022b) incorrectly reconstructed the building line orientations. By contrast, GMIN-L produced reconstructions that closely preserve the geometric structures present in the original HR images.

Similarly, for Urban100:img033, alternative approaches failed to accurately capture the ground line patterns, exhibiting noticeable blurring and geometric distortion. GMIN-L, however, generated reconstructions with substantially higher fidelity and minimal artifacts, resulting in superior visual quality.

These examples across diverse architectural and structural patterns demonstrate GMIN-L’s enhanced capability to preserve critical high-frequency details and geometric structures during SR reconstruction.

4.4. Ablation Study

4.4.1. Effect of SAU

To evaluate the effectiveness of the SAU, we conducted experiments with two comparative configurations within the SML framework. The first configuration completely removed the SAU component, denoted as “w/o SAU.” The second retained only a $1 \times 1$ convolution and GELU activation within the SAU structure, labeled as “w/Conv-1.”

As illustrated in Figure 8, incorporating SAU leads to measurable performance improvements. We further analyzed the computational complexity of SAU, with results presented in Table 3. While removing SAU (“w/o SAU”) slightly reduces model parameters and computational complexity (FLOPs), it simultaneously causes significant degradation in reconstruction quality as measured by PSNR and SSIM metrics. Similarly, when comparing with the simplified “w/Conv-1” configuration, which maintains nearly identical parameter count and computational overhead, we observe performance decreases of 0.01 dB in PSNR and 0.004 in SSIM. These experimental results demonstrate that SAU maintains a favorable balance between computational efficiency and performance enhancement, providing substantial quality improvements while introducing minimal additional computational burden.

Figure 8.

Ablation study about the effectiveness of SAU. The study was performed on the BSD100 and Urban100 datasets at $\times 3$ and $\times 4$ image SR. The best results are highlighted. Note. SAU = spatial adaptive unit; SR = super-resolution.

Table 3.

Impact of SAU Presence in the SML Module for $\times 4$ SR.

Methods	PSNR $↑$	SSIM $↑$	#Params [K]	#FLOPs [G]	Ave. Time (ms)
w/o SAU	28.69	0.7840	288	16.5	10.22
w/ Conv-1	28.73	0.7846	304	17.4	11.48
w/ SAU	28.74	0.7850	304	17.4	11.60

Note. The bold values indicate the best performance. “w/o SAU” means the SAU modules are removed from the SML. “w/ Conv-1” means we replace SAU in SML with a simple $1 \times 1$ convolutional layer and a GELU activation. All results are evaluated on Set14 for $\times 4$ SR. SAU = spatial adaptive unit; SML = spatially adaptive mixing layer; PSNR = peak signal-to-noise ratio; SSIM = structural similarity; SR = super-resolution; FLOPs = floating-point operations per second.

4.4.2. Effect of GMIB

The GMIB serves as a critical component of GMIM, offering both reduced computational demands and strong performance characteristics. To quantitatively evaluate its effectiveness, we conducted comparative experiments replacing GMIB with established feature extraction modules from other lightweight SISR architectures, specifically ESDB (excluding ESA and CCA; Li et al., 2022b) and MLKA (Wang et al., 2022b).

As shown in Table 4, GMIB demonstrates superior efficiency with lower parameter counts and computational complexity (FLOPs) compared to alternative approaches, while simultaneously achieving higher reconstruction quality. This performance advantage stems from GMIB’s effective multi-scale interaction mechanism, which captures both local and global image features. The significant performance-to-complexity ratio makes GMIB particularly valuable for developing efficient, lightweight SR models, confirming its effectiveness as a feature extraction module for image SR tasks.

Table 4.
Performance Comparisons of GMIB and Other Basic Units on Benchmark Datasets for $\times 4$ SR.

4.4.3. Effect of EGSFN

To assess the effectiveness of EGSFN, we conducted an ablation study with results presented in Table 5. We compared our proposed EGSFN against three alternative architectures: the original feed-forward network (FFN; Dosovitskiy et al., 2020), gated linear unit FFN (GLU-FFN; Chen et al., 2022), and convolutional FFN (Conv-FFN; Wang et al., 2022a).

Table 5.
Performance Comparisons of EGSFN, FFN (Dosovitskiy et al., 2020), GLU-FFN (Chen et al., 2022), and Conv-FFN (Wang et al., 2022a) on Benchmark Datasets for $\times 4$ SR.

The experimental results demonstrate that EGSFN outperforms FFN while maintaining a comparable parameter count. Compared to GLU-FFN, EGSFN incorporates DWConv, which enhances local perception capabilities and consequently improves overall performance, despite a modest increase in parameters and FLOPs. When compared to Conv-FFN, EGSFN not only achieves superior performance but also exhibits a more efficient parameter utilization due to the integration of the split operation.

4.5. Complexity Analysis

To accurately assess the complexity of our GMIN, we compared the inference speed of several lightweight SR methods on the BSD100 dataset ( $\times 4$ ). The results are presented in Table 6. Our analysis demonstrates that GMIN outperforms other advanced efficient SR methods such as EFDN (Wang, 2022), IMDN (Hui et al., 2019), PAN (Zhao et al., 2020), RFDN (Liu et al., 2020), HNCT (Fang et al., 2022), and BSRN (Li et al., 2022b), exhibiting superior performance with the second fastest inference times.

Table 6.
Quantitative Tradeoff Comparison Between Model Performance and Complexity of Image SR on BSD100 Dataset ( $\times 4$ ).

With comparable FLOPs, the proposed GMIN achieves a PSNR value of 27.67 dB, surpassing HNCT (Fang et al., 2022) by 0.04 dB and BSRN (Li et al., 2022b) by 0.02 dB, while requiring fewer parameters and shorter inference times. Even when compared to lightweight algorithms such as EFDN (Wang, 2022), which have fewer parameters and shorter inference times, GMIN maintains significantly higher performance.

Furthermore, when compared to LKDN (Xie et al., 2023), the current state-of-the-art lightweight SR method, GMIN, exhibits only a marginal 0.44% reduction in performance, while offering substantial efficiency gains: a 4.97% reduction in parameters, a 4.92% decrease in FLOPs, and a 20.54% reduction in inference time. This experimental evidence underscores the exceptional efficiency of GMIN when considering the balance between parameters, performance, and inference time.

4.6. Real-World Images SR

To evaluate the generalization capability of our GMIN and its SR performance in real-world applications, we conducted SR experiments at a scale factor of $\times 4$ using a real-world dataset without corresponding ground truth. We compared the SR results of GMIN against Bicubic interpolation, RFLN (Kong et al., 2022), HNCT (Fang et al., 2022), and BSRN (Li et al., 2022b). The qualitative results are illustrated through local zoomed-in comparisons in Figure 9.

Figure 9.

Visual comparison with the state-of-the-art methods for $\times 4$ real-world image super-resolution tasks.

As demonstrated in Figure 9, GMIN effectively enhances both the resolution and visual quality of real-world images. Compared to other methods, our proposed model produces images with more distinct textures, fewer artifacts, and greater detail preservation, resulting in an overall superior visual quality. These results highlight GMIN’s exceptional generalization capability and effectiveness for real-world SR applications, even when trained without access to ground truth HR images.

4.7. Result of Local Attribution Map (LAM)

LAM is an attribution analysis method for SR results proposed by Gu et al. (Gu & Dong, 2021). This technique highlights pixels that have the most significant impact on SR outcomes. For a given local patch, a larger attribution area in the LAM indicates that the SR network extracts and utilizes information from a wider range of pixels.

We employed LAM to compare GMIN with RFDN (Liu et al., 2020), HNCT (Fang et al., 2022), and BSRN (Li et al., 2022b), as illustrated in Figure 10. Our analysis reveals that GMIN achieves a higher distribution intensity value, demonstrating its superior capability to effectively utilize information from a broader range of pixels in the input LR image. This enhanced information utilization directly contributes to improved SR performance. Furthermore, these results confirm that our proposed GMIN possesses a larger effective receptive field, which is a critical factor for achieving superior SR results.

Figure 10.

The LAMs are visualized for RFDN (Liu et al., 2020), HNCT (Fang et al., 2022), BSRN (Li et al., 2022b), and the proposed GMIN. A larger highlighted area indicates that it captures more pixels associated with the input patch. Note. LAM = local attribution map; GMIN = gated multi-scale interaction network.

5. Limitations and Future Work

Despite GMIN's efficiency, its static convolution kernels lack the content-dependent dynamic adaptability of Transformer-based MHSA, potentially leading to suboptimal performance on images with extremely complex non-local patterns. Additionally, the domain gap between synthetic training and real-world degradations remains a challenge for robust deployment. In future research, we will investigate model compression techniques, such as integer quantization and structural reparameterization, to optimize deployment on resource-constrained edge devices. Furthermore, we intend to extend GMIN's efficient feature interaction to video SR by incorporating temporal alignment modules to exploit multi-frame redundancy.

6. Conclusion

In this paper, we introduce GMIN, a lightweight CNN based on the transformer framework, specifically designed to enable efficient image SR. To enhance feature extraction and integration, we propose the GMIB, which incorporates gated units and a dual-branch structure to simultaneously capture and interact with local and global features. Multi-scale feature representations are first generated through feature concatenation and a $1 \times 1$ convolution layer, and are subsequently refined via a gating mechanism. In parallel, spatial weights are produced and applied to the multi-scale features through the Hadamard product, effectively promoting the aggregation of contextual information across multiple scales. Building upon this design, we further develop the EGSFN, which leverages large-kernel convolutions and gating operations to capture spatial dependencies more effectively, enhancing the model’s overall representational capacity. Extensive experiments on several public benchmark datasets demonstrate that GMIN consistently outperforms existing state-of-the-art lightweight SR methods, achieving superior reconstruction accuracy while maintaining lower network complexity. These results confirm that GMIN strikes an excellent balance between performance and efficiency, offering a promising solution for real-world SR applications.

Footnotes

ORCID iDs

Gang Ke

Sio-Long Lo

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

Agustsson

Timofte

(2017). NTIRE 2017 challenge on single image super-resolution: Dataset and study. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops (pp. 126–135). IEEE.

Ahn

Kang

Sohn

K. A.

(2018). Fast, accurate, and lightweight super-resolution with cascading residual network. In Proceedings of the European conference on computer vision (ECCV) (pp. 252–268). Springer.

Bevilacqua

Roumy

Guillemot

Alberi-Morel

M. L.

(2012). Low-complexity single-image super-resolution based on nonnegative neighbor embedding. In Proceedings of the British Machine Vision Conference (BMVC).

Chen

Chu

Zhang

Sun

(2022). Simple baselines for image restoration. In European conference on computer vision (pp. 17–33). Springer.

Dong

Loy

C. C.

Tang

(2016). Accelerating the super-resolution convolutional neural network. In European conference on computer vision (pp. 391–407). Springer.

Dosovitskiy

Beyer

Kolesnikov

Weissenborn

Zhai

Unterthiner

Dehghani

Minderer

Heigold

Gelly

Uszkoreit

Houlsby

(2020). An image is worth

16 \times 16

words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.

Fang

Lin

Chen

Zeng

(2022). A hybrid network of CNN and transformer for lightweight image super-resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 1103–1112). IEEE.

Gao

(2022a). Feature distillation interaction weighting network for lightweight image super-resolution. In Proceedings of the AAAI conference on artificial intelligence (Vol. 36, pp. 661–669). AAAI Press.

Gao

Wang

Zeng

(2022b). Lightweight bimodal network for single-image super-resolution via symmetric CNN and recursive transformer. arXiv preprint arXiv:2204.13286.

10.

Gendy

Sabor

(2024). Lightweight image super-resolution network based on extended convolution mixer. Engineering Applications of Artificial Intelligence, 133, 108069.

11.

Dong

(2021). Interpreting super-resolution networks with local attribution maps. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 9199–9208). IEEE.

12.

Guo

M. H.

C. Z.

Liu

Z. N.

Cheng

M. M.

S. M.

(2023). Visual attention network. Computational Visual Media, 9(4), 733–752.

13.

Hao

Liang

(2024). Lightweight blueprint residual network for single image super-resolution. Expert Systems with Applications, 250, 123954.

14.

Huang

J. B.

Singh

Ahuja

(2015). Single image super-resolution from transformed self-exemplars. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5197–5206). IEEE.

15.

Hui

Gao

Yang

Wang

(2019). Lightweight image super-resolution with information multi-distillation network. In Proceedings of the 27th ACM international conference on multimedia (pp. 2024–2032). ACM.

16.

Hui

Wang

Gao

(2018). Fast and accurate single image super-resolution via information distillation network. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 723–731). IEEE.

17.

Kim

Lee

J. K.

Lee

K. M.

(2016). Accurate image super-resolution using very deep convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1646–1654). IEEE.

18.

Kong

Liu

Bai

Chen

(2022). Residual local feature network for efficient super-resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 766–776). IEEE.

19.

Wang

Liu

Tan

Lin

Chen

Zheng

S. Z.

(2022a). Efficient multi-order gated aggregation network. arXiv preprint arXiv:2211.03295.

20.

Liu

Chen

Cai

Qiao

Dong

(2022b). Blueprint separable residual network for efficient image super-resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 833–843). IEEE.

21.

Liang

Cao

Sun

Zhang

Van Gool

Timofte

(2021). Swinir: Image restoration using Swin Transformer. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 1833–1844). IEEE.

22.

Lim

Son

Kim

Nah

Mu Lee

(2017). Enhanced deep residual networks for single image super-resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops (pp. 136–144). IEEE.

23.

Liu

Tang

(2020). Residual feature distillation network for lightweight image super-resolution. In European conference on computer vision (pp. 41–55). Springer.

24.

Liu

Jia

Fan

Wang

Gao

(2021). Cross-SRN: Structure-preserving super-resolution network with cross convolution. IEEE Transactions on Circuits and Systems for Video Technology 32(4), 1850–1863.

25.

Liu

Huang

Zhang

Zeng

(2022). Transformer for single image super-resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 457–466). IEEE.

26.

Luo

Xie

Zhang

(2020). LatticeNet: Towards lightweight image super-resolution with lattice block. In Computer vision–ECCV 2020: 16th European conference, Glasgow, UK, August 23–28, 2020, proceedings, part XXII 16 (pp. 272–289). Springer.

27.

Martin

Fowlkes

Tal

Malik

(2001). A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Proceedings eighth IEEE international conference on computer vision, ICCV 2001 (Vol. 2, pp. 416–423). IEEE.

28.

Matsui

Ito

Aramaki

Fujimoto

Ogawa

Yamasaki

Aizawa

(2017). Sketch-based manga retrieval using manga109 dataset. Multimedia Tools and Applications, 76(20), 21811–21838.

29.

Muqeet

Hwang

Yang

Kang

Kim

Bae

S. H.

(2020). Multi-attention based ultra lightweight image super-resolution. In European conference on computer vision (pp. 103–118). Springer.

30.

Qian

Chen

Wen

Xie

Feng

D. D.

Kim

Sheng

(2024). Deep contour attention learning for scleral deformation from OCT images. The Visual Computer 41, 1155–1170.

31.

Rao

Zhao

Tang

Zhou

Lim

S. N.

(2022). Hornet: Efficient high-order spatial interactions with recursive gated convolutions. Advances in Neural Information Processing Systems, 35, 10353–10366.

32.

Shi

Liu

Zhang

Zhu

Zheng

Weng

(2023). Image super-resolution using efficient striped window transformer. arXiv preprint arXiv:2301.09869.

33.

Touvron

Cord

Douze

Massa

Sablayrolles

Jégou

(2021). Training data-efficient image transformers & distillation through attention. In International conference on machine learning (pp. 10347–10357). PMLR.

34.

Wang

Chen

Liu

(2023). Omni aggregation networks for lightweight image super-resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 22378–22387). IEEE.

35.

Wang

Dong

Wang

Ying

Lin

Guo

(2021). Exploring sparsity in image super-resolution for efficient inference. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 4917–4926). IEEE.

36.

Wang

Xie

Fan

D. P.

Song

Liang

Luo

Shao

(2022a). Pvt v2: Improved baselines with pyramid vision transformer. Computational Visual Media, 8(3), 415–424.

37.

Wang

(2022). Edge-enhanced feature distillation network for efficient super-resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR) workshops (pp. 777–785). IEEE.

38.

Wang

Liu

(2022b). Multi-scale attention network for single image super-resolution. arXiv preprint arXiv:2209.14145.

39.

Xie

Zhang

Meng

Zhang

Zhao

(2023). Large kernel distillation network for efficient single image super-resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 1283–1292). IEEE.

40.

Xie

Zhang

Sheng

Chen

C. P.

(2021). BaGFN: Broad attentive graph fusion network for high-order feature interactions. IEEE Transactions on Neural Networks and Learning Systems, 34(8), 4499–4513.

41.

Luo

Zhou

Wang

Feng

Yan

(2022). Metaformer is actually what you need for vision. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10819–10829). IEEE.

42.

Zeyde

Elad

Protter

(2010). On single image scale-up using sparse-representations. In International conference on curves and surfaces (pp. 711–730). Springer.

43.

Zhang

Zeng

Guo

Zhang

(2022). Efficient long-range attention network for image super-resolution. In European conference on computer vision (pp. 649–667). Springer.

44.

Zhang

Wang

Zhong

(2018). Image super-resolution using very deep residual channel attention networks. In Proceedings of the European conference on computer vision (ECCV) (pp. 286–301). Springer.

45.

Zhao

Kong

Qiao

Dong

(2020). Efficient image super-resolution using pixel attention. In European conference on computer vision (pp. 56–72). Springer.

46.

Zhou

Cai

Liu

Chen

Qiao

Dong

(2022). Efficient image super-resolution using vast-receptive-field attention. In European conference on computer vision (pp. 256–272). Springer.

47.

Zhu

Zhang

Fei

Cai

Xie

Sheng

Yang

(2022). FFFN: Frame-by-frame feedback fusion network for video super-resolution. IEEE Transactions on Multimedia, 25, 6821–6835.

48.

Zhu

Yao

Zhang

Zhu

You

Yang

Zhang

Zhao

Zeng

(2024). TMSDNet: Transformer with multi-scale dense network for single and multi-view 3D reconstruction. Computer Animation and Virtual Worlds, 35(1), e2201.

Gated Multi-Scale Interaction Networks for Efficient Single Image Super-Resolution

Abstract

Keywords

1. Introduction

2.1. CNN-Based Lightweight SR Model

2.2. Transformer-Based Lightweight SR Model

3. Method

3.1. Network Architecture

4.1. Experimental Settings

4.1.1. Datasets and Evaluation Metrics

4.1.2. Training Details

4.2. Comparison With Lightweight SR Methods

Table 1. Quantitative Comparison (Average PSNR/SSIM) of State-of-the-art Lightweight SR Methods for × 2 , × 3 , and × 4 Upscaling.

4.2.2. Subjective Visual Effect Assessments

4.3. Comparison With Transformer-Based SR Methods

4.3.1. Quantitative Objective Comparisons

Table 2. Quantitative Comparison (Average PSNR/SSIM) With Other Advanced Transformer-Based SR Methods.

4.4.1. Effect of SAU

Table 4. Performance Comparisons of GMIB and Other Basic Units on Benchmark Datasets for × 4 SR.

Table 5. Performance Comparisons of EGSFN, FFN (Dosovitskiy et al., 2020), GLU-FFN (Chen et al., 2022), and Conv-FFN (Wang et al., 2022a) on Benchmark Datasets for × 4 SR.

Table 6. Quantitative Tradeoff Comparison Between Model Performance and Complexity of Image SR on BSD100 Dataset ( × 4 ).

6. Conclusion

Footnotes

ORCID iDs

Funding

Conflicting Interests

References

Table 1.
Quantitative Comparison (Average PSNR/SSIM) of State-of-the-art Lightweight SR Methods for $\times 2$ , $\times 3$ , and $\times 4$ Upscaling.

Table 2.
Quantitative Comparison (Average PSNR/SSIM) With Other Advanced Transformer-Based SR Methods.

Table 4.
Performance Comparisons of GMIB and Other Basic Units on Benchmark Datasets for $\times 4$ SR.

Table 5.
Performance Comparisons of EGSFN, FFN (Dosovitskiy et al., 2020), GLU-FFN (Chen et al., 2022), and Conv-FFN (Wang et al., 2022a) on Benchmark Datasets for $\times 4$ SR.

Table 6.
Quantitative Tradeoff Comparison Between Model Performance and Complexity of Image SR on BSD100 Dataset ( $\times 4$ ).