Sage Journals: Discover world-class research

Abstract

In the past decade, convolutional neural networks (CNNs) and transformers have achieved wide application in semantic segmentation tasks. Although CNNs with transformer models greatly improve performance, the global context modeling remains inadequate. Recently, Mamba achieved great potential in vision tasks, showing its advantages in modeling long-range dependency. In this article, we propose a lightweight efﬁcient CNN-Mamba network for semantic segmentation, dubbed as ECMNet. ECMNet combines CNN with Mamba skillfully in a capsule-based framework to address their complementary weaknesses. Speciﬁcally, we design an enhanced dual-attention block for a lightweight bottleneck. In order to improve the representation ability of the feature, we devise a multi-scale attention unit to integrate multi-scale feature aggregation, spatial aggregation, and channel aggregation. Moreover, a Mamba enhanced feature fusion module merges diverse level feature, signiﬁcantly enhancing segmented accuracy. Extensive experiments on two representative datasets demonstrate that the proposed model excels in accuracy and efﬁciency balance, achieving 70.6% mIoU on Cityscapes and 73.6% mIoU on CamVid test datasets, with 0.87 M parameters and 8.27G FLOPs on a single RTX 3090 GPU platform. Source code will be available at https://github.com/feixiangdu/ECMNet.

Keywords

Semantic segmentation lightweight convolutional neural network Mamba feature fusion

Introduction

Semantic segmentation aims to assign a label to each pixel in a given image, which is widely applied in autonomous driving,¹ remote sensing,^2–5 agriculture,⁶ and medical image,⁷ and more. Early semantic segmentation primarily relied on convolutional neural networks (CNNs), employing techniques like large convolutional kernels,⁸ dilated convolutions,⁹ and feature pyramids¹⁰ to extend receptive ﬁelds. However, these CNN-based approaches remained limited in capturing long-range dependencies. The advent of transformers¹¹ enabled more effective global context modeling in subsequent segmentation methods. Learning global context dependencies is essential for extracting global semantic features, particularly in intensive tasks like semantic segmentation. The rise of visual transformer (ViT)¹² has injected a new paradigm for semantic segmentation. Segmentation transformer (SETR)¹³ slices images into sequences for the ﬁrst time and captures global context feature through a self-attentive mechanism, outperforming traditional CNN models on complex scene datasets such as Cityscapes. Meanwhile, SegFormer¹⁴ further optimized the architectural design by proposing a hierarchical transformer encoder with a lightweight multi-layer perceptron decoder to achieve multi-scale feature fusion. However, the square-level computational complexity of the transformer limits its application to high-resolution images with insufﬁcient sensitivity to local details.

To tackle the limitation of the above single model and extract ﬁne spatial details, some models treated semantic segmentation tasks by integrating CNN with a transformer. For instance, HResFormer,¹⁵ PFormer,¹⁶ and DMFC-UFormer¹⁷ have achieved satisfactory results in the ﬁeld of medical image segmentation. However, the self-attention mechanism in CNN-transformer methods still poses challenges in terms of speed and memory usage when dealing with long-range visual dependencies, especially when processing high-resolution images.

Unlike previous transformers, Mamba¹⁸ shows great potential for high-resolution images by efﬁcient sequence modeling with linear complexity. Vision Mamba¹⁹ have recently demonstrated remarkable success in various computer vision tasks. For example, in the ﬁeld of three-dimensional (3D) medical imaging, SegMamba²⁰ achieves real-time inference on the colorectal cancer dataset CRC-500, with a speedup of 30% compared to 3D UNet. In addition, CM-UNet²¹ introduces a Mamba decoder into a CNN encoder to bridge local and global features through a channel-space attention mechanism, achieving higher mIoU on the ISPRS Vaihingen dataset.

To accommodate limited computational resources and mobile device applications, lightweight semantic segmentation models receive higher attention. For example, LEDNet²² employed channel split-and-shufﬂe operations within residual blocks, signiﬁcantly lowering computational complexity. While CFPNet²³ designed a channel-wise feature pyramid (CFP) module to signiﬁcantly reduces model parameters and model scale by extracting various-level feature maps and contextual feature information jointly. LETNet²⁴ used a lightweight dilated bottleneck (LDB) module and feature enhancement (FE) module for enhanced efﬁciency and accuracy with reduced model complexity.

Motivated by the success of Mamba and lightweight approaches in semantic segmentation tasks, we propose ECMNet, an efﬁcient CNN-Mamba hybrid network for lightweight semantic segmentation, optimized for minimizing model size and computational requirements. As depicted in Figure 1, the proposed ECMNet achieves an excellent balance between the accuracy, inference speed of the model, and model parameters. The main contributions of our article are fourfold:

We ﬁrstly propose a novel lightweight efﬁcient CNN-Mamba network (ECMNet) for semantic segmentation. ECMNet utilizes a U-shape encoder-decoder structure as a backbone and regards the feature fusion module (FFM) as a capsule network to capture global context information. Specifically, FFM introduces a two-dimensional-selective-scan (SS2D) block, a variant of Mamba, to learn long-range dependencies.

We design a lightweight enhanced dual-attention block (EDAB) to extract multi-dimensional semantic information. EDAB consists of dual-direction attention (DDA), channel attention (CA), and various convolution modules, realizing less model parameters and computational quantities.

We develop a multi-scale attention unit (MSAU) to improve the representations ability of feature, which further reﬁnes the local details and global contextual information.

ECMNet achieved 70.6% mIoU on the Cityscapes dataset on a single RTX 3090 GPU with only 0.87 M of parameters, realizing the better trade-off between performance and parameters. Meanwhile, our proposed method achieved 73.6% of the highest performance on the CamVid dataset, which demonstrates the effectiveness and generalization of our proposed ECMNet.

Figure 1.

Accuracy and model parameters comparison of efﬁcient convolutional neural network-Mamba network (ECMNet) and other lightweight models on the CamVid dataset. A larger circle denotes a faster inference speed.

Related work

Semantic segmentation methods based on CNN and transformer

Due to the efﬁcient local feature representation capabilities of CNNs, semantic segmentation has also advanced tremendously. Following the revolutionary CNN, FCN, and U-Net, many new architectures have been reﬁned on this basic principle. However, CNN-based methods face issues with the trade-off between image resolution and a limited receptive ﬁeld. To address these challenges, DeepLab and PSPNet build atrous spatial pyramid pooling using atrous convolutions in a parallel way, which better utilizes various-level contextual feature information.

Additionally, researchers have drawn much interest in the self-attention mechanism because of the advantages in modeling feature dependencies. ECANet²⁵ introduced a local cross-channel interaction mechanism that operates without dimensionality reduction and an adaptive selection method for determining optimal kernel sizes in one-dimensional convolutions. In addition, DANet²⁶ employed ResNet as its backbone architecture, integrating parallel attention modules in both spatial and channel dimensions. This design effectively captures long-range feature dependencies, enhancing segmentation performance.

However, all these methods build an enormous computational challenge for the machine. So the lightweight semantic segmentation networks were proposed. For example, ICNet²⁷ captured high-level semantic information and low-level spatial details by utilizing multi-scale images as input. BiseNet²⁸ and BiseNet-v2²⁹ introduced a two-path architecture, which is responsible for providing detailed information supplement and extracting deep semantic information, respectively. Furthermore, in order to enhance the feature expression and reduce the computation, the point-wise attention was designed. ESPNet³⁰ and ESPNet-v2³¹ fused decomposed convolution into point-wise convolution, which greatly reduces the number of parameters and computation. In addition, LEDNet²² introduced an attention pyramid network in its decoder, effectively reducing overall model complexity while maintaining performance. To alleviate the limitation of the single model and extract ﬁne spatial details, some methods combined CNN with transformer came into being. For instance, LETNet²⁴ incorporated two key components: a lightweight dilated bottleneck module and an enhanced feature reﬁnement module combining CNN-transformer capsules. This architecture can capture long-range feature dependencies for better segmentation results. HAFormer³² integrates CNN-based hierarchical feature learning and transformer-based global context modeling to further capture global representation. While existing methods still have room for enhancement in global feature representation and complexity.

Semantic segmentation methods based on Mamba

Mamba¹⁸ has achieved great success because of sequence modeling with linear complexity. Meanwhile, Vision Mamba¹⁹ have recently proved once again the possibilities in the ﬁeld of computer vision tasks, especially in semantic segmentation. For example, CM-UNet²¹ built the core segmentation decoder by employing channel and spatial attention as the gate activation condition of the vanilla Mamba, which enhances the feature interaction and global-local information fusion. RS³Mamba³³ utilized VSS blocks¹⁸ to achieve better segmentation in remote sensing by constructing an auxiliary branch to enhance a convolution-based main branch. Sigma³⁴ introduced two novel modules, a Siamese encoder and a Mamba-based fusion mechanism, to achieving global receptive ﬁelds with linear complexity.

Although the above methods have achieved good results, the current Mamba-based segmentation methods applied to remote sensing without considering the model size, which results in large model parameters. In addition, a one-dimensional sequence input of Mamba in the image domain disrupted local structural relationships with global context information. On the other hand, the absence of ﬁne-grained local features results in imprecise segmentation; CNN architectures effectively compensate by preserving spatial details through local feature extraction. In order to better handle the semantic segmentation task and reduce model parameters, we tend to explore a novel lightweight method integrating CNN with Mamba.

Proposed method

Overall network architecture

As shown in Figure 2, the overall network architecture of our proposed ECMNet consists of four components: a CNN encoder improved with EDABs, a CNN decoder with a subtle difference from the encoder, an efﬁcient Mamba-based feature fusion module, and three long skip connections enhanced with a multi-scale attention unit. Speciﬁcally, the CNN-based encoder-decoder architecture extracts localized features for detailed spatial representation. The Mamba-based FFM can capture complex spatial information and long-range feature dependencies by a state space model (SSM) to optimize global feature representations and computational complexity. The three long-distance skip connections generate more high-quality segmentation by focusing on low-level spatial information and high-level semantic information, respectively. The above mentioned elaborated modules make it more efﬁcient for ECMNet to fully integrate local and global feature information.

Figure 2.

The overall network architecture of efﬁcient convolutional neural network-Mamba network (ECMNet).

Enhanced dual-attention block

As shown in Figure 3, the structure of EDAB is inspired by the idea of a multi-head attention mechanism. The module is designed to focus different level feature information and keep network parameters as few as possible. Firstly, the input feature passes through a bottleneck structure that utilizes a 1 × 1 convolution to reduce the number of channels to half, signiﬁcantly reducing the computational complexity and the number of parameters. Obviously, this will sacriﬁce a part of the accuracy, but it will be more beneﬁcial to introduce 3 × 1 convolution and 1 × 3 convolution more than make up for the loss at this point. Meanwhile, the two decomposed convolutions not only obtain a wider respective ﬁeld for capturing a larger range of contextual feature information but also consider the model parameters and calculation complexity. The core of EDAB lies in its two-branch path, which captures local and global feature information, respectively. Decompose the convolution in one branch with CA processes local and short-distance feature information, complemented by atrous convolution in the parallel branch with DDA for global feature integration. Then, the channel contains most of the feature information, and the spatial feature information is key to enhancing performance and suppressing noise interference. DDA precisely captures the bidirectional spatial correlations within feature maps by modeling pixel dependencies in both height and width dimensions, enabling reﬁned representation and enhancement of spatial details. In the DDA module, Query, Key, and Value are the core components for implementing bidirectional spatial attention modeling. For the height direction (Key_h, Quer_h) and width direction (Key_v, Quer_v) branches, Key captures reference information about features, while Query initiates queries for key features. After multiplying these components and applying softmax to generate attention weights, an einsum operation is performed with the Value layer containing the features awaiting augmentation. This process models pixel dependencies in both the height and width dimensions. This complements CA, collectively enhancing the ability of EDAB to express multidimensional features and providing more discriminative feature support for subsequent visual tasks. Therefore, the two branches utilize CA and DDA, which aim to build different attention matrices to learn multidimensional feature information and improve feature expression. Finally, the outputs from both designed paths and intermediate features are integrated and processed by a 1 × 1 point-wise convolution to restore the original channel dimensionality. A channel shufﬂe strategy is applied at the end of EDBA to establish inter-channel correlations and overcome information fragmentation. The detailed operation is shown as follows:

F_{u p_b r a n c h} = C o n v_{1 \times 3} (C o n v_{3 \times 1} (C o n v_{1 \times 1} (x)))

(1)

F_{m i d_b r a n c h_1} = C o n v_{C A} (C o n v_{1 \times 3, D} (C o n v_{3 \times 1, D} (F_{u p_b r a n c h})))

(2)

F_{m i d_b r a n c h_2} = C o n v_{D D A} (C o n v_{1 \times 3, D, R} (C o n v_{3 \times 1, D, R} (F_{u p_b r a n c h})))

(3)

Y_{1} = C o n v_{C S} (f_{1 \times 1} (F_{u p_b r a n c h} + F_{m i d_b r a n c h_1} + F_{m i d_b r a n c h_2}) + x)

(4)where x denotes the input of the EDAB, Y₁ denotes the output feature map of the EDAB, and Conv_k _× _k (·) is a normal convolution operation. Among the sufﬁx, D denotes the depth-wise convolution, R is the atrous rate of atrous convolution, CA represents the channel attention, DDA represents the dual-direction attention, and CS denotes the shufﬂe operation of the channel.

Figure 3.

The overall architecture of the proposed enhanced dual-attention block (EDAB).

Multi-scale attention unit

On the one hand, lower layers preserve ﬁne spatial details with limited semantics; on the other hand, higher layers offer strong semantic representation at lower spatial resolution. Therefore, it is an efﬁcient strategy to combine the low-level rich spatial information and high-level rich semantic information for semantic segmentation tasks. Inspired by U-Net, we use the same resolution connections to integrate the high-level feature maps and low-level feature maps. In order to better process the three long connections, we design an MSAU to enhance the ability of feature representation. As shown in Figure 4, MSAU is carried out from two branches: one is the multi-scale spatial aggregation, the other is the channel aggregation.

Figure 4.

The architecture of our proposed multi-scale attention unit (MASU).

In the multi-scale spatial aggregation, the input feature map is utilized with a 1 × 1 convolution to convert from C channel to C/2 channel. In order to reduce the amount of parameters and computation while retaining the ability of multi-scale feature extraction, the next feature map goes through different sizes of depth-separable convolution, such as 3 × 3, 5 × 5, and 7 × 7. Meanwhile, the outputs of different scale convolutions obtain multi-scale feature information, enhancing the multi-scale perception capability of the model. Then, the multi-scale fused feature map compresses the height dimension to 1 by adaptive average pooling, and generates a spatial attention map by 7 × 7 depth separable convolution, 1 × 1 convolution, and sigmoid activation function. At the same time, by multiplying with the multi-scale fused feature map, the processed feature highlights the key spatial regions and suppresses the irrelevant information. At last, the channel of model is converted from C/2 back to C by using 1 × 1 convolution, and the attention map reﬂects the importance of the different locations of the feature map. For channel aggregation, the input feature map uses average pooling and maximum pooling to obtain average channel features and maximum channel features, respectively, which captures channel statistics from different angles. The MSAU multiplies the spatial and channel aggregation results and adds them with the original input feature maps to obtain the output feature maps.

This design allows the MSAU module to fuse the low-level spatial information with the high-level semantic information more effectively and further enhance the ability of feature expression. The detailed operation can be deﬁned as follows:

\begin{aligned} X_{1} = & C o n v_{(3 \times 3)} (C o n v_{(1 \times 1)} (x)) + C o n v_{(5 \times 5)} (C o n v_{(1 \times 1)} (x)) \\ + C o n v_{(7 \times 7)} (C o n v_{(1 \times 1)} (x)) \end{aligned}

(5)

X_{2} = C o n v_{(1 \times 1)} (X_{1} \otimes S i g m o i d (C o n v_{(7 \times 7)} (P o o l (X_{1}))))

(6)

X_{3} = C o n v_{(1 \times 1)} (R e L U (C o n v_{(1 \times 1)} (A v g P o o l (C o n v_{(3 \times 3)} (x)))))

(7)

X_{4} = C o n v_{(1 \times 1)} (R e L U (C o n v_{(1 \times 1)} (M a x P o o l (C o n v_{(3 \times 3)} (x)))))

(8)

Y_{2} = x + (X_{2} \otimes (X_{3} + X_{4}))

(9)where x denotes the input of the MSAU and Y₂ represents the output feature map of the MSAU. Among the formulas, Conv_k _× _k (·) is a normal convolution operation, ⊗ denotes element-by-element multiplication, Pool(·) denotes the adaptive average pooling, AvgPool(·) is average pooling, MaxPool(·) is maximum pooling, ReLU(·) is rectiﬁed linear unit, and Sigmoid(·) is the sigmoid activation function.

Feature fusion module

Motivated by the effectiveness of Mamba in linear-complexity sequence modeling, we design an FFM by introduce SS2D block for better capturing global representations with less network parameters and computational quantities. SS2D block serves as a powerful alternative to the self-attention mechanism in vision transformers, designed to capture long-range spatial dependencies with linear computational complexity. The SS2D block is built upon the framework of linear SSM, which model a dynamical system's evolution. The block utilizes a decomposed 2D scanning strategy involving horizontal scanning and vertical scanning. The feature map is scanned independently for each row by horizontal scanning. Two separate SSMs are typically used: one from left-to-right and another from right-to-left. This captures long-range dependencies within each row. The feature map is scanned independently for each column by vertical scanning, both top-to-bottom and bottom-to-top. This captures long-range dependencies within each column. The outputs from the two scanning directions are fused to integrate the holistic 2D spatial context. A simple yet effective fusion method is element-wise addition, followed by a non-linear activation. The detailed operation can be deﬁned as follows:

Z_{h o r i z o n t a l} = S S M_{l e f t \to r i g h t} (Z) + S S M_{r i g h t \to l e f t} (Z)

(10)

Z_{v e r t i c a l} = S S M_{t o p \to b o t t o m} (Z) + S S M_{b o t t o m \to t o p} (Z)

(11)

Z_{S S 2 D} = S i L U (Z_{h o r i z o n t a l} + Z_{v e r t i c a l})

(12)where Z ∈ R^H ^× ^W ^× ^C denotes the 2D feature map, SSM denotes the state space model. SiLU is a famous non-linear activation.

As shown in Figure 5, the FFM enriches the feature diversity by integrating different-scale feature information from the multi-level MSAU and the encoder through the concatenation operation. Then, the SS2D block further extracts and fuses the features through a series of linear transformations and 2D convolution operations, which employs a selective scanning mechanism to enhance the feature representation ability. Finally, a feed-forward network (FFN) performs a non-linear transformation to adjust the weight distribution of features, highlighting the key features and suppressing the redundant information, so as to improve performance in handling complex tasks. The designed FFM can effectively fuse multi-scale features and capture both local detail information and overall semantic features, great improving the performance of the model in semantic segmentation tasks. The complete operation is shown as follows:

X_{F F N} = F F N (S S 2 D (C o n c a t (x_{e n c o d e r}, x_{M S A U 1}, x_{M S A U 2})))

(13)

Y_{3} = X_{F F N} + x_{e n c o d e r}

(14)where x_encoder, x_MSAU₁, and x_MSAU₂ denote the output of the encoder and MSAU, respectively, and Y₃ denotes the output feature map of the FFM. Among the formulas, Concat(·) is a normal concatenation operation. SS2D(·) is the 2D-selective-scan block, and FFN(·) is the feed-forward network.

Figure 5.

The architecture of our proposed feature fusion module (FFM).

Experiments

Datasets

Cityscapes. This dataset is composed of high-quality 5000 images, annotated at the pixel level. The images are primarily scenes of driving within urban settings, captured across 50 different cities with a resolution of 2,048Œ1,024. The dataset was divided into training sets (2975 images), validation sets (500 images), and test sets (1525 images).

CamVid. The CamVid dataset, developed by the University of Cambridge, contains urban road scene images captured from a driving perspective (960Œ720 resolution). Its 700 + annotated samples support supervised learning, featuring 11 representative object classes that effectively capture urban road elements. This diversity in objects and well-annotated classes makes it particularly suitable for our segmentation accuracy research.

Implementation details

Our proposed ECMNet, implemented in PyTorch, was trained using an NVIDIA RTX 3090 GPU. We employ random initialization and full training from scratch, extending the maximum epoch count to 1000. In training the ECMNet model, different parameter conﬁgurations are used for the Cityscapes and the CamVid datasets. The parameter conﬁgurations for the Cityscapes dataset included a batch size of 6, the use of a cross-entropy loss function, a stochastic gradient descent optimizer with a momentum of 0.9, a weight decay of 1 × 10^–4, an initial learning rate of 0.045, and the use of a polynomial learning rate strategy. The parameter conﬁguration for the CamVid dataset included a batch size of 8, the same cross-entropy loss function, the Adam optimizer, a momentum of 0.9, a weight decay of 0.0002, an initial learning rate of 1 × 10^–3, and a polynomial learning rate strategy as well. These parameters are set to adapt to the characteristics of different datasets with a view to obtaining the best training results.

Ablation studies

We design a series of ablation experiments to validate the effectiveness of each module in our proposed model. As shown in Figure 6, the baseline model used for comparison is structured as a simple U-shape type, including the standard encoder and decoder. The encoder and decoder consist of multiple lightweight EDABs, which are modeled to achieve an average mIoU of 69.92% on the CamVid validation set.

Figure 6.

The simple structure of the baseline model.

In the long connection ablation experiments (A group), the effect of gradually adding Line 1, Line 2, and Line 3 is investigated. The observed 0.61% enhancement after adding Line 1 substantiates that shallow information effectively aids semantic feature information reconstruction. Meanwhile, with three long skip connections, the model achieves a 1.29% mIoU enhancement. These results further demonstrate the signiﬁcance of long-range skip connections for the semantic segmentation task. In the MSAU ablation experiments (B group), the MSAU module is added gradually to the long connection. A comparison between B1 and A1 reveals that adding the MSAU module to long connections only adds 9.43 K parameters, but improves the performance by 0.92% of mIoU. In the last ablation experiments (C group), the introduction of the FFM improves the performance of the model by 1.11% of mIoU. Finally, as the ﬁnalized architecture (C3), our proposed ECM-Net improves performance by 3.7% mIoU compared to the baseline model. All the above experiments shown in Table 1 fully validate the efﬁcacy of our proposed modules and strategies.

Table 1.

Extensive ablation study for the proposed ECMNet on the CamVid dataset.

Architecture	Method							Parameter (K)↓	FLOPs (G)↓	mIoU (%)↑
	Long connection			MSAU			FFM
	1	2	3	1	2	3
Baseline	–	—	–	–	–	–	–	775.57	7.56	69.92
A1	✓	—	–	–	–	–	–	777.93	7.57	70.530^.61↑
A2	✓	✓	–	–	–	–	–	796.41	7.64	70.921^.00↑
A3	✓	✓	✓	–	–	–	–	805.67	7.79	71.211^.29↑
B1	–	–	–	✓	–	–	–	787.34	7.57	71.451^.53↑
B2	–	–	–	✓	✓	–	–	805.82	7.67	72.652^.73↑
B3	–	–	–	✓	✓	✓	–	815.08	7.90	73.223^.30↑
C1	–	–	–	–	–	–	✓	827.80	7.80	70.750^.83↑
C2	✓	✓	✓	–	–	–	✓	863.93	8.06	71.031^.11↑
C3 (ours)	–	–	–	✓	✓	✓	✓	871.11	8.27	73.623^.70↑

A, B, and C denote the long connection, the feature enhancement, and the feature fusion, respectively. No. of A, B, and C denote the stack of the same or different modules.

To further verify the efﬁciency and necessity of the proposed MSAU and FFM, we selected natural images featuring various scenes from the Cityscapes dataset, including cars, trafﬁc signs, pedestrian streets, and pavements, to comprehensively investigate the impact of each module. So we visualized the layer-CAM (layer-wise class activation mapping) of results to illustrate the attention areas of different modules. As shown in Figure 7, it demonstrated our proposed FFM focuses on global details, achieving better segmentation boundaries. Meanwhile, our proposed MSAU reﬁned local details feature by fully utilizing the attention mechanism. According to these layer-CAM, our proposed ECMNet both captured the global information and obtained more attention on detailed information, which result in achieving excellent performance.

Figure 7.

The layer-wise class activation mapping (layer-CAM) visualization of different modules on Cityscapes.

Comparisons with state-of-the-art (SOTA) methods

In this section, we compare SOTA semantic segmentation methods in recent years on the Cityscapes and CamVid datasets to verify that our approach achieves a better balance between performance and parameters. Our evaluation is based on three key metrics: model parameters, ﬂoating-point operations (FLOPs), and mIoU.

Evaluation Results on Cityscapes Dataset.

As shown in Table 2, the model with a larger number of parameters and computation obviously achieves excellent segmentation results. However, the computational complexity of the model is high, and its operation speed is slow, which is unsuitable for real-time intelligent embedded devices. In contrast, lightweight models, such as NDNet,³⁵ CGNet,³⁶ CFPNet,²³ LEDNet,²² and LETNet²⁴ are computationally efﬁcient, but lack overall performance, especially in accuracy. Obviously, the LBN-AA³⁷ achieved the highest mIoU with 6.2 M model parameters, which are far more than our proposed approach. Meanwhile, the ESPNet³⁰ utilized the least parameters to realize 60.3% mIoU, which is signiﬁcantly lower than the performance of our method. Our proposed ECMNet, with 0.87 M parameters, achieved a higher 70.6% mIoU. In Figure 8, we also show the visualization outcomes of these methods on the Cityscapes. Our proposed ECMNet achieves excellent results in natural image segmentation compared to CNN-based and transformer-based approaches, as shown in Figure 8. Meanwhile, our method obtains better detail features compared to other methods. Both the LETNet and the LEDNet exhibit missed small objects and achieve incorrect segmentation boundary comparison with the ground truth. For the proposed ECMNet, it can better segment objects and generate the improvement result closing to the ground truth, such as trafﬁc signs. Our proposed ECMNet can get better segmentation results with less model parameters, which beneﬁts from well-design structure and the utilization of the Mamba. These results fully demonstrate that our proposed model can achieve an excellent balance between model parameters and performance.

Figure 8.

Visualization outcomes on the Cityscapes dataset. From top to bottom: original input images, ground truths, predictions of ECMNet, LETNet,²⁴ LEDNet,²² ERFNet,⁴¹ ESPNet,³⁰ and ENet.³⁹ Note that ﬁve examples are shown. Starting with the third line, each line has rectangular dashed boxes representing the segmentation results where a distinction exists.

Table 2.

Performance comparison of our proposed ECMNet and other representative methods on the Cityscapes dataset.

Method	Year	Resolution (width × height)	Backbone	Parameter (M)↓	FLOPs (G)↓	GPU	Speed (FPS)↑	mIoU (%)↑
SegNet³⁸	2017	640 × 360	VGG-16	29.50	286.0	*	17	57.0
ENet³⁹	2016	512 × 1024	No	0.36	3.8	Titan X	135	58.3
ESPNet³⁰	2018	512 × 1024	ESPNet	0.36	–	Titan XP	113	60.3
NDNet³⁵	2021	512 × 1024	No	0.50	3.5	Titan X	120	61.1
CGNet³⁶	2021	360 × 640	No	0.49	6.0	V100	–	64.8
ADSCNet⁴⁰	2020	512 × 1024	No	–	–	1080 Ti	77	67.5
ERFNet⁴¹	2017	512 × 1024	No	2.10	–	Titan X	42	68.0
BiseNet-v1²⁸	2018	768 × 1536	Xception	5.80	14.8	Titan XP	106	68.4
ICNet²⁷	2018	1024 × 1024	PSPNet-50	26.50	28.3	Titan X	30	69.5
DABNet⁴²	2019	1024 × 2048	No	0.76	42.4	1080 Ti	28	70.1
CFPNet²³	2021	1045 × 2048	No	0.55	–	2080 Ti	30	70.1
FPENet⁴³	2019	512 × 1024	No	0.40	12.8	Titan V	55	70.1
LEDNet²²	2019	512 × 1024	No	0.94	–	1080 Ti	40	70.6
DFANet⁴⁴	2019	1024 × 1024	Xception	7.80	3.4	Titan X	100	71.3
STDC1-50⁴⁵	2021	512 × 1024	–	8.40	–	1080 Ti	87	71.9
SegFormer¹⁴	2021	512 × 1024	MiT-B0	3.80	17.7	V100	48	71.9
MSCFNet⁴⁶	2022	512 × 1024	No	1.15	17.1	Titan XP	50	71.9
FPANet⁴⁷	2022	512 × 1024	–	14.10	–	Titan V	–	72.0
MLFNet⁴⁸	2023	512 × 1024	ResNet-34	13.03	15.5	Titan XP	72	72.1
BiseNet-v2²⁹	2021	512 × 1024	Xception	3.40	21.2	1080 Ti	156	72.6
MGSeg⁴⁹	2021	1024 × 1024	ShufﬂeNet-v2	4.50	16.2	Titan XP	101	72.7
PCNet⁵⁰	2022	1024 × 2048	Scratch	3.40	21.2	2080 Ti	156	72.6
LETNet²⁴	2023	512 × 1024	No	0.95	13.6	3090 Ti	150	72.8
SCTNet-S⁵¹	2024	512 × 1024	No	4.6	451.2	3090 Ti	160.3	72.8
HSB-Net⁵²	2021	512 × 1024	ResNet-34	12.10	–	2080 Ti	124	73.1
LBN-AA³⁷	2021	448 × 896	No	6.20	49.5	Titan X	51	73.6
ECMNet (ours)	–	1024 × 1024	No	0.87	8.27	3090	43	70.6

The gray box denotes the best value of the current metric.

Evaluation results on CamVid dataset

As shown in Table 3, to further verify the effectiveness and generalization capacity of our proposed ECMNet, we conducted comparative experiments with our method and other lightweight models on the CamVid dataset. Obviously, the MGSeg⁴⁹ just achieved the 72.7% mIoU with 13.3 M model parameters, which is lower performance and larger model parameters compared to our proposed method. Therefore, our method has achieved the best accuracy by only using 0.87 M parameters. Compared to Cityscapes, the higher overall performance on the CamVid dataset is due to our designed modules and strategies, which better capture feature of small-sized datasets. Per-class results are detailed in Table 4, further demonstrating the advantages of our proposed ECMNet.

Table 3.

Performance comparison of our proposed ECMNet and other representative methods on the CamVid dataset.

Method	Year	Resolution	Backbone	Parameter (M)↓	GPU	Speed (FPS)↓	mIoU (%)↑
ENet³⁹	2016	360 × 480	No	0.36	Titan X	68	51.3
SegNet³⁸	2017	360 × 480	VGG-16	29.45	Titan X	87	55.6
NDNet³⁵	2021	360 × 480	No	0.50	Titan X	78	57.2
DFANet⁴⁴	2019	360 × 480	Xception	7.80	Titan X	120	64.7
DABNet⁴²	2019	360 × 480	No	0.76	1080 Ti	136	66.4
FDDWNet⁵³	2020	360 × 480	No	0.80	2080 Ti	79	66.9
ICNet²⁷	2018	720 × 960	PSPNet-50	26.50	Titan X	28	67.1
FBSNet⁵⁴	2023	360 × 480	No	0.62	2080 Ti	120	68.9
MSCFNet⁴⁶	2022	360 × 480	No	1.15	Titan XP	110	69.3
LETNet²⁴	2023	360 × 480	No	0.95	3090 Ti	21	70.5
HAFormer³²	2024	360 × 480	No	0.60	2080 Ti	118	71.1
MGSeg⁴⁹	2021	736 × 736	ResNet-18	13.3	Titan XP	127	72.7
ECMNet (ours)	—	360 × 480	No	0.87	3090	55	73.6

The gray box denotes the best value of the current metric.

Table 4.

Performance comparison of our proposed ECMNet and the state-of-the-art lightweight methods about per-class results on the CamVid dataset.

Method	Roa	Sid	Bui	Wal	Fen	Pol	TLi	TSi	Veg	Ter	Sky	Ped	Rid	Car	Tru	Bus	Tra	Mot	Bic	mIoU (%)
Enet³⁹	96.3	74.2	75.0	32.2	33.2	43.4	34.1	44.0	88.6	61.4	90.6	65.5	38.4	90.6	36.9	50.5	48.1	38.8	55.4	58.3
ESPNet³⁰	97.0	77.5	76.2	35.0	36.1	45.0	35.6	46.3	90.8	63.2	92.6	67.0	40.9	92.3	38.1	52.5	50.1	41.8	57.2	60.3
CGNet³⁶	95.5	78.7	88.1	40.0	43.0	54.1	59.8	63.9	89.6	67.6	92.9	74.9	54.9	90.2	44.1	59.5	25.2	47.3	60.2	64.8
ESPNet-v2³¹	97.3	78.6	88.8	43.5	42.1	49.3	52.6	60.0	90.5	66.8	93.3	72.9	53.1	91.8	53.0	65.9	53.2	44.2	59.9	66.2
ERFNet⁴¹	97.7	81.0	89.8	42.5	48.0	56.3	59.8	65.3	91.4	68.2	94.2	76.8	57.1	92.8	50.8	60.1	51.8	47.3	61.7	68.0
DABNet⁴²	96.8	78.5	90.9	45.4	50.2	59.1	65.2	70.8	92.5	68.2	94.6	80.5	58.5	92.7	52.7	67.2	50.9	50.4	65.7	70.0
CFPNet²³	97.8	81.4	90.5	46.4	50.6	56.4	61.5	67.7	92.1	68.9	94.3	80.4	60.7	93.9	51.4	68.0	50.8	51.2	67.7	70.1
LEDNet²²	98.1	79.5	91.6	47.7	49.9	62.8	61.3	72.8	92.6	61.2	94.9	76.2	53.7	90.9	64.4	64.0	52.7	44.4	71.6	70.6
ECMNet (ours)	97.1	80.8	90.9	44.2	53.4	60.8	61.9	72.4	91.7	60.4	93.9	75.2	52.7	92.0	65.3	76.5	66.5	37.8	69.1	70.6

The bold fronts indicate the best experiment results. The gray box denotes the best mIoU of the current class. Roa, Sid, Bui, Wal, Fen, Pol, TLi, TSi, Veg, Ter, Sky, Ped, Rid, Car, Tru, Mot, and Bic represent Road, Sidewalk, Building, Wall, Fence, Pole, Trafﬁc Light, Trafﬁc Sign, Vegetation, Terrain, Sky, Pedestrian, Rider, Car, Truck, Motorcycle, and Bicycle, respectively.

Conclusion

In this study, we proposed a lightweight semantic segmentation network that combines Mamba and CNNs. We fused the local feature extraction capability of CNNs with the long-range dependencies of Mamba to model. Speciﬁcally, we introduced an FFM as a capsule-based framework in the middle of the model, which can better capture global feature information. Additionally, an EDAB module designed in the CNN learned more local feature information while ensuring simplicity and lightweight. Meanwhile, in order to compensate for the local feature information lost by CNNs, multi-scale long connections are utilized in the model. Moreover, we design an MSAU for cross-layer connections, effectively boosting discriminative features and attenuating noise. Extensive experimental results demonstrate that our proposed model achieves an excellent balance between model scale and performance.

Footnotes

Acknowledgements

The authors would like to acknowledge the assistance of the Engineering Technology Research Center of Optoelectronic Technology Appliance, AnHui Province.

ORCID iDs

Feixiang Du

Shengkun Wu

Xiang Wang

Aoxue Ding

Zhongliang Wang

Joel CM Than

Author contributions

Feixiang Du: conceptualization, writing–original draft, and writing–review and editing. Shengkun Wu: writing–original draft, validation, and visualization. Xiang Wang: validation and writing–original draft. Aoxue Ding: data curation and writing–original draft. Zhongliang Wang: investigation, writing–review and editing, and supervision. Joel CM Than: writing–review and editing and investigation.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was partially supported by the Natural Science Research Project of Anhui Educational Committee (2024AH051852, 2023AH040232) and College Students Innovative Entrepreneurial Training Plan Program (S202410383017). Furthermore, this publication has also been partially supported by the Teacher Secondment Training Project of Tongling University (2025GZDLSJ17).

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data availability statement

The data that support the ﬁndings of this study are available from the corresponding author upon reasonable request.

References

Sanchez

Deschaud

Goulette

. Cola: coarse-label multi-source lidar semantic segmentation for autonomous driving. IEEE Trans Robot 2025; 41: 1742–1754.

Jing

Zhang

, et al. Hypergraph BiFormer for semantic segmentation of high-resolution remote sensing images. IEEE Trans Geosci Remote Sens 2025; 63: 1–15.

Yue

Wang

Pan

, et al. Less is more: a lightweight deep learning network for remote sensing imagery segmentation. IEEE Trans Geosci Remote Sens 2025; 63: 1–13.

Deng

Liang

Qin

, et al. DMA-Net: dynamic morphology-aware segmentation network for remote sensing images. Remote Sens (Basel) 2025; 17: 2354.

Hanyu

Yamazaki

Tran

, et al. Aerialformer: multi-resolution transformer for aerial image segmentation. Remote Sens (Basel) 2024; 16: 2930.

Luo

Yang

Yuan

, et al. Semantic segmentation of agricultural images: a survey. Inf Process Agric 2024; 11: 172–186.

Deng

Qin

. MINTFormer: multi-scale information aggregation with CSWin vision transformer for medical image segmentation. Appl Sci 2025; 15: 8626.

Peng

Zhang

, et al. Large kernel matters–improve semantic segmentation by global convolutional network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 4353–4361.

Chen

Papandreou

Kokkinos

, et al. Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans Pattern Anal Mach Intell 2017; 40: 834–848.

10.

Zhao

Shi

, et al. Pyramid scene parsing network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 2881–2890.

11.

Vaswani

Shazeer

Parmar

, et al. Attention is all you need. Adv Neural Inf Process Syst 2017; 30.

12.

Dosovitskiy

Beyer

Kolesnikov

, et al. An image is worth 16 × 16 words: transformers for image recognition at scale. arXiv preprint arXiv:201011929. 2020.

13.

Zheng

Zhao

, et al. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 6881–6890.

14.

Xie

Wang

, et al. Segformer: simple and efﬁcient design for semantic segmentation with transformers. Adv Neural Inf Process Syst 2021; 34: 12077–12090.

15.

Ren

. HResFormer: hybrid residual transformer for volumetric medical image segmentation. IEEE Trans Neural Netw Learn Syst 2025; 36: 10558–10566.

16.

Gao

Zhang

Wei

, et al. PFormer: an efﬁcient CNN-transformer hybrid network with content-driven P-attention for 3D medical image segmentation. Biomed Signal Process Control 2025; 101: 107154.

17.

Garbaz

Oukdach

Charﬁ

, et al. DMFC-UFormer: depthwise multi-scale factorized convolution transformer-based UNet for medical image segmentation. Biomed Signal Process Control 2025; 101: 107200.

18.

Dao

. Mamba: linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:231200752. 2023.

19.

Liu

Tian

Zhao

, et al. Vmamba: visual state space model. Adv Neural Inf Process Syst 2024; 37: 103031–103063.

20.

Xing

Yang

, et al. Segmamba: long-range sequential modeling mamba for 3d medical image segmentation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, 2024, pp. 578–588. Springer.

21.

Liu

Dan

, et al. CM-UNet: hybrid CNN-Mamba UNet for remote sensing image semantic segmentation. arXiv preprint arXiv:240510530. 2024.

22.

Wang

Zhou

Liu

, et al. Lednet: a lightweight encoder-decoder network for real-time semantic segmentation. In: 2019 IEEE International Conference on Image Processing (ICIP), 2019, pp. 1860–1864. IEEE.

23.

Ding

Jiang

, et al. CFPNet: improving lightweight ToF depth completion via cross-zone feature propagation. arXiv preprint arXiv:241104480. 2024.

24.

Gao

, et al. Lightweight real-time semantic segmentation network with efﬁcient transformer and CNN. IEEE Trans Intell Transp Syst 2023; 24: 15897–15906.

25.

Wang

Zhu

, et al. ECA-Net: efﬁcient channel attention for deep convolutional neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 11534–11542.

26.

Liu

Tian

, et al. Dual attention network for scene segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 3146–3154.

27.

Zhao

Shen

, et al. Icnet for real-time semantic segmentation on high-resolution images. In: Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 405–420.

28.

Zhao

Tian

. Real-time semantic segmentation network based on improved BiSeNet V1. In: Proceedings of the 2022 11th International Conference on Networks, Communication and Computing, 2022, pp. 38–44.

29.

Gao

Wang

, et al. Bisenet v2: bilateral network with guided aggregation for real-time semantic segmentation. Int J Comput Vision 2021; 129: 3051–3068.

30.

Mehta

Rastegari

Caspi

, et al. Espnet: efﬁcient spatial pyramid of dilated convolutions for semantic segmentation. In: Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 552–568.

31.

Lin

Hao

Liu

, et al. Lightweight deep learning methods for panoramic dental X-ray image segmentation. Neural Comput Appl 2023; 35: 8295–8306.

32.

Jia

, et al. HAFormer: unleashing the power of hierarchy-aware features for lightweight semantic segmentation. IEEE Trans Image Process 2024; 33: 4202–4214.

33.

Zhang

Pun

. Rs 3 mamba: visual state space model for remote sensing image semantic segmentation. IEEE Geosci Remote Sens Lett 2024; 21: 1–5.

34.

Wan

Zhang

Wang

, et al. Sigma: Siamese mamba network for multi-modal semantic segmentation. In: 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2025, pp. 1734–1744. IEEE.

35.

Yang

, et al. NDNet: narrow while deep network for real-time semantic segmentation. IEEE Trans Intell Transp Syst 2020; 22: 5508–5519.

36.

Tang

Zhang

, et al. Cgnet: a light-weight context guided network for semantic segmentation. IEEE Trans Image Process 2020; 30: 1169–1179.

37.

Dong

Yan

Shen

, et al. Real-time high-performance semantic image segmentation of urban street scenes. IEEE Trans Intell Transp Syst 2020; 22: 3258–3274.

38.

Badrinarayanan

Kendall

Cipolla

. Segnet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans Pattern Anal Mach Intell 2017; 39: 2481–2495.

39.

Paszke

Chaurasia

Kim

, et al. Enet: a deep neural network architecture for real-time semantic segmentation. arXiv preprint arXiv:160602147. 2016.

40.

Wang

Xiong

Wang

, et al. ADSCNet: asymmetric depthwise separable convolution for semantic segmentation in real-time. Appl Intell 2020; 50: 1045–1056.

41.

Romera

Alvarez

Bergasa

, et al. Erfnet: efﬁcient residual factorized convnet for real-time semantic segmentation. IEEE Trans Intell Transp Syst 2017; 19: 263–272.

42.

Yun

Kim

, et al. Dabnet: depth-wise asymmetric bottleneck for real-time semantic segmentation. arXiv preprint arXiv:190711357. 2019.

43.

Liu

Yin

. Feature pyramid encoding network for real-time semantic segmentation. arXiv preprint arXiv:190908599. 2019.

44.

Xiong

Fan

, et al. Dfanet: deep feature aggregation for real-time semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 9522–9531.

45.

Fan

Lai

Huang

, et al. Rethinking bisenet for real-time semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 9716–9725.

46.

Gao

, et al. MSCFNet: a lightweight network with multi-scale context fusion for real-time semantic segmentation. IEEE Trans Intell Transp Syst 2021; 23: 25489–25499.

47.

Jiang

Huang

, et al. FPANet: feature pyramid aggregation network for real-time semantic segmentation. Appl Intell 2022; 52: 3319–3336.

48.

Fan

Wang

Chu

, et al. MLFNet: multi-level fusion network for real-time semantic segmentation of autonomous driving. IEEE Trans Intell Vehicles 2022; 8: 756–767.

49.

Liang

, et al. Mgseg: multiple granularity-based real-time semantic segmentation network. IEEE Trans Image Process 2021; 30: 7200–7214.

50.

Sun

Chen

, et al. Parallel complement network for real-time semantic segmentation of road scenes. IEEE Trans Intell Transp Syst 2021; 23: 4432–4444.

51.

, et al. Sctnet: single-branch CNN with transformer semantic information for real-time segmentation. In: Proceedings of the AAAI Conference on Artiﬁcial Intelligence, vol. 38, 2024, pp. 6378–6386.

52.

Zhang

. Hierarchical semantic broadcasting network for real-time semantic segmentation. IEEE Signal Process Lett 2021; 29: 309–313.

53.

Liu

Zhou

Qiang

, et al. FDDWNet: a lightweight convolutional neural network for real-time semantic segmentation. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 2373–2377. IEEE.

54.

Gao

, et al. FBSNet: a fast bilateral symmetrical network for real-time semantic segmentation. IEEE Trans Multimed 2022; 25: 3273–3283.

ECMNet: Lightweight semantic segmentation with efficient CNN-Mamba network

Abstract

Keywords

Introduction

Related work

Semantic segmentation methods based on CNN and transformer

Semantic segmentation methods based on Mamba

Proposed method

Overall network architecture

Enhanced dual-attention block

Multi-scale attention unit

Feature fusion module

Experiments

Datasets

Implementation details

Ablation studies

Comparisons with state-of-the-art (SOTA) methods

Evaluation Results on Cityscapes Dataset.

Evaluation results on CamVid dataset

Conclusion

Footnotes

Acknowledgements

ORCID iDs

Author contributions

Funding

Declaration of conflicting interests

Data availability statement

References