Sage Journals: Discover world-class research

Abstract

The nnU-Net framework effectively automates hyperparameter selection; however, its fixed internal configurations—notably convolution kernel sizes—restrict its flexibility. This limitation is pronounced in 3D medical imaging, where anatomical structures undergo continuous spatial evolution along the Z-axis. In this study, we introduce a self-adaptive convolution module designed to dynamically tune the effective receptive field, matching the dynamic structural transformations of organs. By employing a differentiable soft-attention mechanism to aggregate candidate kernels, the network adaptively optimizes its scale sensitivity. This integration allows MSA²-Net to capture both global context and local nuances within feature maps. The module is strategically embedded into two core components: the multi-scale convolution bridge and the multi-scale amalgamation decoder. In the Bridge, it refines CSWin Transformer outputs by aligning features with the inherent spatial continuity of volumetric data, thereby mitigating redundancies that might otherwise hinder decoding. Simultaneously, the multi-scale amalgamation decoder leverages this module to precisely reconstruct organ details as their size and shape fluctuate across slices. This mechanism ensures the decoder preserves seamless topological intricacies within the feature maps, yielding superior segmentation accuracy. Leveraging this architecture, MSA²-Net achieves competitive Dice scores of 86.49%, 92.56%, 93.37%, and 92.98% on the Synapse, ACDC, Kvasir, and ISIC2017 datasets, respectively. Extensive experiments validate the model's robustness in handling complex spatial variations across diverse medical modalities.

Keywords

Medical image segmentation multi-scale information CSWin transformer convolution interrelationship theory

Introduction

Medical image segmentation plays a pivotal role in clinical diagnostics, empowering clinicians to rapidly identify and localize pathologies.^1–3 This task entails the precise extraction and delineation of anatomical structures or lesions from diverse imaging modalities, such as computed tomography (CT), magnetic resonance imaging (MRI), and X-rays. At its core, segmentation aims to assign semantic labels to each pixel or voxel, effectively categorizing distinct organs, tissues, and pathological features to provide a granular understanding of the underlying anatomy.

Since their emergence,⁴ convolutional neural networks (CNN) have become the dominant framework for medical image segmentation. By leveraging local receptive fields and parameter sharing, CNN efficiently extract essential features such as edges, textures, and organ structures, making them ideally suited for medical imaging scenarios characterized by strong spatial correlations. Among these architectures, U-Net stands as a quintessential model and remains the most widely adopted solution in the field. Its design comprises an encoder for hierarchical feature extraction and downsampling, a decoder for restoring spatial resolution via upsampling, and skip connections that bridge the two. These connections are critical for recovering the fine-grained spatial details often lost during the encoding process. Through this elegant symmetry, U-Net has significantly enhanced segmentation precision and generalization across various clinical applications.

Despite their success, CNN are inherently restricted by their focus on local modeling.^5–7 The fixed dimensions of convolutional kernels often impede the effective capture of multiscale features, a challenge that is especially pronounced in complex multi-organ segmentation. This localized perspective hampers the integration of global context, ultimately compromising segmentation precision. To address this, various strategies have been advanced: MISSFormer⁸ introduces an enhanced transformer context bridge to capture hierarchical representations; dilated convolution⁹ expands the spatial coverage by increasing dilation rates without escalating computational costs; spatial transformer networks (STN)¹⁰ employ learnable geometric modules for adaptive spatial alignment; and adaptive spatial feature fusion (ASFF)¹¹ utilizes spatial filtering to harmonize features across different resolutions. While these methods bolster performance to a degree, a fundamental bottleneck remains: kernel size is typically a predetermined hyperparameter. The inability to dynamically recalibrate the receptive field according to the specific characteristics of the training data significantly constrains the model's broader generalization.

nnU-Net¹² utilizes a heuristic approach based on “data fingerprints” and “pipeline fingerprints” to encapsulate essential dataset and network attributes, enabling autonomous hyperparameter tuning. However, these adjustments primarily target preprocessing configurations—such as voxel size, spacing, and batch size—while largely overlooking architectural hyperparameters, specifically convolution kernel sizes and padding. To bridge this gap, we present the multi-scale adaptive attention network (MSA²-Net). This architecture employs a novel differentiable dynamic kernel aggregation mechanism, allowing the network to adaptively synthesize kernels by analyzing the statistical characteristics of input features. The primary contributions of this work are summarized as follows.

Propose the self-adaptive convolution module (SACM), which dynamically senses the spatial scale of organs in each slice. By generating input-specific statistical fingerprints, the module automatically adjusts its receptive field. This mechanism allows the model to synchronize with the continuous spatial evolution of organs along the z-axis, ensuring optimal feature extraction whether the organ appears as a small tip or a full cross-section.

An innovative multi-scale convolution bridge (MSConvBridge) is proposed based on self-adaptive convolution modules. This architecture utilizes a combination of multi-scale adaptive convolutions to delicately process the feature maps output by CSWin. Through this multi-layer approach, the MSConvBridge efficiently eliminates redundant striated textures between image features while maximally preserving the information within the feature maps, thereby significantly enhancing overall semantic consistency. This method not only optimizes feature representation but also strengthens the model's capability to handle complex scenes.

Multi-scale amalgamation decoder (MSADecoder) employs recursive nested adaptive convolution groups, enabling the decoder to replenish information about small organs in the image while upscaling, and simultaneously preserving information about large organs in the image.

Related work

nnU-Net

In medical image segmentation networks, there are a number of hyperparameters. Researchers often have to adjust these hyperparameters through repeated experiments during the network design process. This adjustment is typically reliant on the individual researcher's experience, making the process highly inefficient. Excessive manual adjustments to the network structure can lead to overfitting for specific datasets. The impact of non-structural aspects of the network might have a greater influence on the segmentation task. Therefore, nnU-Net was developed, focusing not on modifying the specific network architecture but on adjusting the hyperparameters in dataset preprocessing, training scripts, and post-processing. Although nnU-Net's approach has improved training efficiency, it overlooks adjustments of the hyperparameters within the network structure itself. The adaptive convolution module proposed in this article builds on the nnU-Net framework, capable of adapting the size of the convolutional receptive field according to the characteristics of the dataset.

U-shaped architectures for biomedical segmentation

In the realm of biomedical image analysis, the U-Net architecture has established itself as a fundamental benchmark due to its ability to function effectively with limited annotated samples—a common constraint in clinical datasets. Its symmetric topology, featuring a contracting path for context capture and an expanding path for precise localization, perfectly aligns with the requirement to delineate anatomical boundaries. While the encoder hierarchy captures deep semantic representations of tissues and organs, the skip connections are pivotal in recovering the high-frequency spatial information lost during downsampling, ensuring that fine-grained structures like tumor margins are preserved in the final segmentation map.

Building upon this foundation, recent variants have sought to address U-Net's limitations in global context modeling. For instance, UNet++ introduced nested and dense skip connections to bridge the semantic gap between encoder and decoder sub-networks. Attention U-Net integrated gating mechanisms to suppress irrelevant regions in input images while highlighting salient features useful for specific tasks. More recently, Transformer-based U-Net variants like Swin-UNet have attempted to leverage self-attention for long-range dependency modeling. However, these methods often replace the inductive bias of convolutions entirely or rely on static convolutional patches, missing the opportunity to dynamically adjust the feature extraction scale based on the specific anatomical characteristics of the input scan.

Method

Self-adaptive convolution module

In volumetric medical datasets like Synapse, anatomical structures exhibit continuous spatial evolution along the z-axis. Consequently, the cross-sectional size of a single organ fluctuates significantly across different slices—appearing as a small, localized region in some frames while occupying the entire field of view in others. This variability exposes the fundamental limitation of standard convolutional operations: their fixed receptive fields.

As illustrated in Figure 1(b), a fixed kernel size faces a dilemma. Oversized kernels are prone to incorporating irrelevant background noise when processing slices where organ cross-sections are small or sparse, leading to the ‘background contamination’ shown in Figure 1(c). Conversely, undersized kernels fail to capture global contextual information when organs expand to larger scales. Given that multiple visceral organs often interweave with complex spatial dependencies, static kernels cannot optimally adapt to these dynamic changes.

Figure 1.

Impact of different kernel sizes on feature maps.

To resolve this, we propose the self-adaptive convolution module, which dynamically synthesizes kernels conditioned on the input feature statistics. Instead of a hard selection strategy, we formulate this as a soft attention mechanism, allowing the network to continuously adapt its receptive field to match the evolving organ dimensions slice-by-slice in an end-to-end manner.

Let $X \in R^{H \times W \times C}$ be the input feature map. We define a bank of N candidate kernels $κ = {K_{1}, K_{2}, \dots, K_{N}}$ , where each kernel has a different dilation rate or size. First, we extract a global statistic vector (fingerprint) S from the input X to capture the overall scale and intensity distribution of the anatomical structures. We utilize both global average pooling (GAP) and global standard deviation (GSD) to capture the first and second-order statistics:

\begin{matrix} S = C o n c a t (G A P (X), G S D (X)) \in R^{2 C} \end{matrix}

(1)

Then, we employ a lightweight gating network (a two-layer MLP) to map the fingerprint S to a set of attention weights $α = [α_{1}, \dots, α_{N}]$ . To ensure differentiability and a probabilistic interpretation, we apply a Softmax function:

\begin{matrix} \begin{matrix} Z = W_{2} (δ \cdot W_{1} (S)) \\ α_{n} = \frac{e^{Z_{n}}}{\sum_{j = 1}^{N} e^{Z_{j}}} \end{matrix} \end{matrix}

(2)

Where $δ$ is the activation function and $W_{1}, W_{2}$ are learnable weights. Instead of selecting a single kernel, we synthesize a dynamic kernel $\tilde{K}$ by aggregating the candidate kernels using the learned attention weights:

\begin{matrix} \tilde{K} = \sum_{n = 1}^{N} α_{n} \cdot K_{N} \end{matrix}

(3)

Finally, the input $X$ is process by the dynamic kernel $\tilde{K}$ :

\begin{matrix} \dot{X} = X \cdot \tilde{K} \end{matrix}

(4)

Self-adaptive convolution module dynamically adjusts the convolutional receptive field according to dataset characteristics, ensuring the kernel size is neither too large nor too small—precisely covering most sub-features in the feature maps to achieve efficient multi-scale information extraction (Figure 2).

Figure 2.

Architecture of self-adaptive convolution module.

Overall structure

The specific architecture of MSA²-Net is illustrated in Figure 3. The left portion of Figure 3 shows the encoder component of MSA²-Net, while the right portion displays the multi-scale adaptive decoder (MSADecoder). The middle section represents the multi-scale convolution bridge (MSConvBridge).

Figure 3.

Architectural overview of MSA²-Net.

For a given image $x \in R^{H \times W \times 3}$ , the encoder outputs multi-scale feature maps at different stages:

\begin{matrix} F = E n c o d e r (x) \end{matrix}

(5)

$F = [F_{1}, F_{2}, F_{3}, F_{4}]$ represents the feature maps from each encoding stage, where spatial resolution progressively decreases while channel depth gradually increases. These features F are then fed into the MSConvBridge for refinement:

\begin{matrix} F^{r e f i n e} = M S C o n v B r i d g e (F) \end{matrix}

(6)

Finally, MSADecoder generates the segmentation map output by aggregating and upsampling these refined features:

\begin{matrix} y = M S A D e c o d e r (F^{r e f i n e}), y \in R^{H \times W \times C} \end{matrix}

(7)where C denotes the number of categories.

Encoder

To better adapt CSWin for medical image segmentation tasks, we modified the original encoder design of CSWin to preserve detailed information in feature maps while performing downsampling for feature extraction. MSA²-Net integrates the CSWin encoder with ResNet, where the CSWin encoder serves as the primary encoder and the ResNet equipped with adaptive convolution modules acts as the auxiliary encoder. The internal workflow at each encoding stage can be expressed as follows:

\begin{matrix} \begin{matrix} X_{i + 1}^{(1)} = W_{i}^{(1)} \cdot X_{i}^{(0)} + b_{i}^{(1)} \\ \begin{matrix} {\hat{X}}_{i + 1}^{(1)} = {\hat{W}}_{i}^{(1)} \cdot X_{i}^{(0)} + {\hat{b}}_{i}^{(1)} \\ X_{i}^{(2)} = H_{i} \cdot (σ (W_{i}^{(2)} \cdot X_{i}^{(1)}) + b_{i}^{(2)}) + {\hat{X}}_{i + 1}^{(1)} \\ F_{i} = F_{i} ⊙ X_{i}^{(2)} + X_{i - 1}^{(2)} \end{matrix} \end{matrix} \end{matrix}

(8)

Here, $X_{i}^{(0)}$ denotes the input feature map to the i -th stage encoder. $W_{i}^{(1)}, W_{i}^{(2)}, b_{i}^{(1)}, b_{i}^{(2)}$ represent the learnable weight matrices in the primary encoder, while ${\hat{W}}_{i}^{(1)}, {\hat{b}}_{i}^{(1)}$ correspond to the learnable weight matrices in the auxiliary encoder. $σ$ signifies the activation function, $H_{i}$ denotes the cross-window self-attention function, $F_{i}$ represents the intermediate transformation function, and $⊙$ indicates element-wise multiplication.

MSConvBridge

Although the encoder can effectively capture multi-scale information, directly feeding the original feature maps to the decoder may introduce redundant noise, thereby compromising segmentation accuracy. Convolutional operations function analogously to filters, capable of suppressing redundant information in feature maps while preserving critical features when detecting sub-structures. To enhance semantic consistency between the encoder and decoder, MSA²-Net incorporates a multi-scale convolutional bridge (MSConvBridge), which selectively refines and optimizes features within the skip connections. The operational mechanism of the MSConvBridge is detailed below.

For a given feature map $F_{i} \in R^{H \times W \times C}$ output from the i -th stage encoder, it first undergoes processing via an self-adaptive convolution operation:

\begin{matrix} F_{i}^{r e f i n e} = H_{D C} \cdot H_{S E} (S e l f A d a p t (F_{i})) \end{matrix}

(9)

Here, $H_{S E}$ denotes the squeeze-and-excitation function, and $H_{D C}$ indicates the dense connectivity function. $S e l f A d a p t$ represent Self-Adaptive Convolution Module within the MSConvBridge (Figure 4).

Figure 4.

MSConvBridge architecture diagram.

MSADecoder

To restore spatial resolution and accurately reconstruct fine structures such as organ boundaries and small organs, MSA²-Net employs a lightweight yet highly expressive decoding module—MSADecoder. As shown in Figure 5, the feature map $F_{i}^{r e f i n e} \in R^{H \times W \times C}$ is split into G parts along the channel dimension, yielding ${F_{i}^{(g)}}_{g = 1}^{G}$ , where $F_{i}^{(g)} \in R^{H \times W \times \frac{C}{G}}$ with G set to 4. In the MSADecoder, ${F_{i}^{(g)}}_{g = 1}^{G}$ first undergoes parallel processing through G adaptive convolution operations:

\begin{matrix} F_{i}^{(g, C o n v)} = S e l f A d a p t (F_{i}^{(g)}), g \in {1, 2, \dots, G} \end{matrix}

(10)

Figure 5.

MSADecoder architecture diagram.

$F_{i}^{(g, C o n v)}$ is then enhanced through a squeeze-and-excitation network to strengthen the internal feature representations:

\begin{matrix} y_{i} = H_{E} \cdot (w_{i}^{ξ} \cdot (H_{S} \cdot [F_{i}^{(g, C o n v)}]_{g = 1}^{G}) + b_{i}^{ξ}) \end{matrix}

(11)

Here, $H_{S}$ and $H_{E}$ denote the transformation matrices of the squeeze-and-excitation network in the MSADecoder, while $w_{i}^{ξ}$ and $b_{i}^{ξ}$ represent the learnable weight matrices in MSADeocder. The multi-scale fusion decoding strategy enables the MSADeocder to progressively restore spatial resolution while preserving fine details of small targets and boundary integrity, thereby significantly improving final segmentation accuracy.

Experiments

Environmental

MSA²-Net is implemented using PyTorch and trained on an NVIDIA A100 GPU platform. The CSWin backbone network incorporates pre-trained weights. Input images for the CSWin encoder are resized to $256 \times 256$ pixels. The initial learning rate is set to 0.0001 with a maximum of 300 epochs and a batch size of 14. The AdamW optimizer's weight decay is configured at 0.0001.

Dataset

In this section, we benchmark MSA²-Net against current state-of-the-art (SOTA) networks to demonstrate its superiority. Experiments are conducted on three datasets: the Synapse multi-organ segmentation dataset (Synapse), the automated cardiac diagnosis challenge dataset (ACDC), and Kvasir. Network performance is evaluated using the Dice coefficient and average Hausdorff distance (HD95). Results for comparative networks are obtained from previously published studies. The detailed data partitions for these three datasets are presented in Table 1.

Table 1.

Detail dataset partitions.

Dataset	Modality	Samples (Cases/Images)	Resolution	Train / val / test split (Samples)
Synapse	CT	30 Cases (3779 slices)	256 × 256	18 / 0 / 12
ACDC	MRI	100 Patients (975 slices)		70 / 10 / 20
Kvasir-SEG	Endoscopy	1000 Images		880 / 0 / 120
ISIC2017	Graphics	2000 Images		1400 / 200 / 400
ISIC2018		2594 Images		1815 / 259 / 520
PH²		200 Images		80 / 20 / 100

Experimental results and analysis

In this section, we will compare the performance of MSA²-Net with other state-of-the-art (SOTA) methods on the Synapse, ACDC, Skin Lesion Segmentation Datasets, and Kvasir-SEG datasets.

Performance of different networks on synapse

Table 2 presents the performance of MSA²-Net on the Synapse dataset, highlighting its superior capabilities. MSA²-Net achieved the highest average Dice coefficient of 86.49%, showing improvements of 5.52%, 9.30%, and 0.65% over MISSFormer, Swin-UNet, and AgileFormer, respectively. In terms of the HD95 metric, MSA²-Net recorded the second lowest distance (14.15), significantly outperforming MISSFormer (18.20), Swin-UNet (21.55), and TransCASCADE (17.34).

Table 2.

Performance of different networks on Synapse.

Architectures	Average		Aorta	GB	KL	KR	Liver	PC	SP	SM
Architectures	Dice↑	HD95a↓	Aorta	GB	KL	KR	Liver	PC	SP	SM
UNet¹³(2015)	70.11	44.69	84.00	56.70	72.41	62.64	86.98	48.73	81.48	67.96
AttnUNet¹⁴(2018)	71.70	34.47	82.61	61.94	76.07	70.42	87.54	46.70	80.67	67.66
R50 + UNet⁶(2021)	74.68	36.87	84.18	62.84	79.19	71.29	93.35	48.23	84.41	73.92
R50 + AttnUNet⁶(2021)	75.57	36.97	55.92	63.91	79.20	72.71	93.56	49.37	87.19	74.95
TransUNet⁶(2021)	77.48	31.69	87.23	63.13	81.87	77.02	94.08	55.86	85.08	75.62
SSFormerPVT¹⁵(2022)	78.01	25.72	82.78	63.74	80.72	78.11	93.53	61.53	87.07	76.61
PolypPVT¹⁶(2021)	78.08	25.61	82.34	66.14	81.21	73.78	94.37	59.34	88.05	79.40
MT-UNet¹⁷(2022a)	78.59	26.59	87.92	64.99	81.47	77.29	93.06	59.46	87.75	76.81
Swin-Unet¹⁸(2021)	79.13	21.55	85.47	66.53	83.28	79.61	94.29	56.58	90.66	76.6
PVT-CASCADE¹⁹(2023)	81.06	20.23	83.01	70.59	82.23	80.37	94.08	64.43	90.10	83.69
MISSFormer⁸(2021)	81.96	18.20	86.99	68.65	85.21	82.00	94.41	65.67	91.92	80.81
CASTformer²⁰(2022)	82.55	22.73	89.05	67.48	86.05	82.17	95.61	67.49	91.00	81.55
TransCASCADE¹⁹(2023)	82.68	17.34	86.63	68.48	87.66	84.56	94.43	65.33	90.79	83.52
Cascaded MERIT¹⁹(2023)	84.90	13.22	87.71	74.40	87.79	84.85	95.26	71.81	92.01	85.38
AgileFormer²⁰ (2022)	85.74	18.70	89.11	77.89	88.83	85.00	95.64	71.62	92.20	85.63
MSA²-Net(ours)	86.49	14.15	85.90	74.44	86.72	86.77	96.57	82.34	92.93	86.31

By examining Table 2, it can be observed that MSA²-Net achieved the best Dice scores for 5 out of 8 organs. These organs are all small in size, demonstrating that the MSA²-Net efficiently preserves information for small organs during upsampling. Due to space constraints, and considering that MISSFormer is the primary inspiration for MSA²-Net and SwinUNet is a classic algorithm in medical image segmentation, Figure 6 displays the segmentation results of MSA²-Net, MISSFormer, and SwinUNet on the Synapse dataset.

Figure 6.

Segmentation effects of MSA²-Net, MISSFormer, and SwinUnet on the synapse.

Figure 6 provides a visualization of the segmentation task performed by MSA²-Net, MISSFormer, and Swin-UNet on the Synapse dataset. In Figure 6, the first column displays the Ground Truth images, the second column shows the segmentation images generated by MISSFormer, the third column presents the images produced by Swin-UNet, and the fourth column depicts the segmentation images created by MSA²-Net. The areas circled by the yellow line indicate the locations of sub-features where most methods failed to make accurate predictions.

Rows (a) and (b) in Figure 6 illustrate how other methods struggle to effectively capture medium-sized objects during the segmentation task on the Synapse dataset. In row (a), the area circled by the yellow box represents the left kidney. MSA²-Net segments the left kidney most completely, while MISSFormer and Swin-UNet fail to segment it as effectively due to their inability to preserve multi-scale information. In row (b), the yellow box again highlights the left kidney, but in a different channel dimension, resulting in a smaller ground truth value for the left kidney. MISSFormer retains some features of the left kidney because of its fixed-size convolution operation. In contrast, Swin-UNet does not employ convolution operations to maintain sub-features during the up-sampling process, which prevents it from detecting the left kidney. MSA²-Net successfully captures the multi-scale spatial information of the left kidney through the use of an MSADecoder, enabling it to correctly recognize the left kidney across different channel dimensions.

In rows (c) and (d), the area circled in yellow represents the gallbladder, a small organ. MISSFormer fails to capture the gallbladder because it uses a fixed-size convolution that filters out small features, while Swin-UNet misidentifies the gallbladder as other organs due to the lack of convolution operations during up-sampling. Only MSA²-Net successfully captures the gallbladder.

In rows (e) and (f), the difficulties in maintaining large objects are observed. In row (f), the yellow box highlights the liver. Although MISSFormer partially restores the liver segmentation due to its fixed-size convolution, it cannot fully capture the liver because its convolution kernel is too small, limiting its receptive field. Swin-UNet shows a significant discrepancy between the reduced liver and the original labeling because it lacks multi-scale convolution capabilities. In contrast, MSA²-Net, with its MSADecoder, completely restores the liver segmentation map, effectively handling the large object.

Performance of different networks on the ACDC and Kvasir-SEG

Table 3 presents the experimental results on the ACDC and Kvasir-SEG datasets. On the ACDC dataset, MSA²-Net outperformed SOTA competitors, achieving Dice score improvements of 1.08% and 1.88% over MISSFormer and Swin-UNet, respectively, with a notable peak of 90.95% in the RV subtask. This success on ACDC—which shares the volumetric nature of Synapse—validates that our model excels in capturing anatomical structures with spatial continuity.

Table 3.

Analysis of the results of the ACDC and Kvasir-SEG.

ACDC					Kvasir-SEG
Architecture	Dice	RV	MYO	LV	Architecture	Dice	SE	SP	ACC
U-Net¹³	87.55	87.10	80.63	94.92	BDG-Net²¹	91.50	68.53	92.74	74.11
AttnUnet¹⁴	86.75	87.58	79.20	93.47	KDAS3²²	91.30	66.03	93.93	76.25
nnU-Net¹²	90.91	89.21	90.20	93.35	U-Net++²³	82.10	44.62	84.48	72.01
UNetR²⁴	88.61	85.29	86.52	94.02	Polyp-SAM++²⁵	90.20	67.29	96.82	73.56
TransUNet⁶	89.71	88.86	84.53	95.73	U-Net¹³	81.80	64.28	86.90	70.21
Swin-UNet¹⁸	90.00	88.55	85.63	95.83	TGA-Net²⁶	89.82	62.53	84.89	75.43
CSWin-UNet²⁷	91.46	89.68	88.94	95.76	PEFNet²⁸	88.18	59.98	76.32	66.34
ST-UNet²⁹	89.73	87.65	82.11	94.39	TransNetR³⁰	87.06	56.45	79.54	78.67
UNetFormer³¹	89.09	88.92	87.88	95.03	ResUNet++³²	81.33	44.60	77.30	72.66
MSA²-Net(ours)	92.56	90.95	89.09	95.63	MSA²-Net(ours)	91.49	62.14	94.24	78.88

However, performance on the Kvasir-SEG dataset was less dominant. We attribute this to the fundamental difference in data characteristics. The design of our self-adaptive convolution module (SACM) relies on synthesizing kernels based on evolving statistical fingerprints, a strategy that thrives on the continuous z-axis variation found in CT/MRI scans. In contrast, Kvasir-SEG consists of discrete, independent endoscopic snapshots lacking inter-slice spatial correlations. Without the context of continuous scale evolution, the adaptive mechanism struggles to establish stable kernel adjustments for the erratic variations in independent samples, thereby limiting the performance gain compared to volumetric datasets.

Experimental results on skin lesion segmentation datasets

Table 4 shows the experimental results of MSA²-Net on the ISIC dataset. In all versions of the ISIC dataset, MSA²-Net achieved the best Dice scores, reflecting its strong segmentation performance and ability to accurately delineate skin lesion areas. Figure 7 presents the visual comparison of lesion segmentation by MSA²-Net and other advanced models on the ISIC2017 dataset. Rows (a), (b), and (c) contain lesions with complex backgrounds, testing the network's ability to filter redundant information. Rows (d) and (e) have lesions with simple backgrounds, testing the network's ability to capture detailed information. In row (a), MSA²-Net successfully identified most of the lesion area without being misled by background hair. In row (d), although all networks identified the main lesion area, MSA²-Net captured the most detailed lesion boundaries.

Figure 7.

Visualization of MSA²-Net segmentation results on the ISIC2017 dataset.

Table 4.

Experimental results of MSA²-Net on skin lesion segmentation datasets.

Method	ISIC2017				ISIC2018				PH2
Method	Dice	SE	SP	ACC	Dice	SE	SP	ACC	Dice	SE	SP	ACC
U-Net¹³	81.59	81.72	96.80	91.64	85.45	88.00	96.97	94.04	89.36	91.25	95.88	92.33
Att-UNet¹⁴	80.82	79.98	97.76	91.45	85.66	86.74	98.63	93.76	90.03	92.05	96.40	92.76
TransUNet⁶	81.23	82.63	95.77	92.07	84.99	85.78	96.53	94.52	88.40	90.63	94.27	92.00
HiFormer³³	92.53	91.55	98.40	97.02	91.02	91.19	97.55	96.21	94.60	94.20	97.72	96.61
Swin-UNet¹⁸	91.83	91.42	97.98	97.01	89.46	90.56	97.98	96.45	94.49	94.10	95.64	96.78
MISSFormer⁸	89.03	89.24	97.25	95.69	91.01	90.31	97.45	94.42	94.01	93.05	96.91	96.14
TMU³⁴	91.64	91.28	97.89	96.60	90.59	90.38	97.46	96.03	91.14	93.95	97.56	96.47
CSWin-UNet²⁷	91.47	93.79	98.56	97.26	91.11	92.31	97.88	95.25	94.29	95.63	97.82	96.82
MSA²-Net(ours)	92.98	93.85	97.25	97.01	91.32	91.25	98.36	96.12	94.65	95.65	97.86	96.85

Robust test

To strictly evaluate the stability and reliability of MSA²-Net, we conducted robustness tests on the Synapse datasets. While maintaining the original dataset partitioning schemes, we performed independent training runs for each model using different random seeds for weight initialization. We report the results in the format of “mean ± standard deviation”. This approach investigates whether the model's superiority is due to specific lucky initializations or inherent architectural robustness. To ensure a fair comparison, we retrained the representative baseline model under the exact same experimental settings as MSA²-Net.

As shown in Table 5, MSA²-Net demonstrates superior robustness compared to other architectures. MSA²-Net achieves the lowest standard deviation in average dice (1.05%), significantly more stable than TransUNet (1.65%) and Swin-UNet (1.55%). Even in the worst-case scenario (Mean - Std), MSA²-Net (85.05%) remains competitive with the average performance of AgileFormer (85.15%), and far surpasses the best average results of MISSFormer (81.20%). The robustness test confirms that MSA²-Net provides a statistically stable solution that reliably outperforms existing SOTA methods across repeated experiments.

Table 5.

Robustness analysis on synapse dataset (metric: dice %).

Method	Average	Aorta	GB	KL	KR	Liver	PC	SP	SM
TransUNet ⁶	76.85 ± 1.65	86.15 ± 1.35	62.10 ± 2.85	80.50 ± 1.55	76.10 ± 1.60	93.20 ± 1.10	54.85 ± 2.15	84.20 ± 1.45	74.90 ± 1.70
Swin-UNet ¹⁸	78.40 ± 1.55	84.50 ± 1.40	65.20 ± 2.50	82.10 ± 1.45	78.45 ± 1.50	93.50 ± 1.15	55.40 ± 1.95	89.50 ± 1.30	75.80 ± 1.65
MISSFormer ⁸	81.20 ± 1.35	86.10 ± 1.25	67.15 ± 2.20	84.50 ± 1.30	81.20 ± 1.35	93.85 ± 1.05	64.30 ± 1.80	91.15 ± 1.15	80.10 ± 1.40
AgileFormer ²⁰	85.15 ± 1.20	88.20 ± 1.10	76.50 ± 1.90	88.05 ± 1.15	84.15 ± 1.20	95.05 ± 0.95	70.80 ± 1.65	91.40 ± 1.05	85.10 ± 1.35
MSA²-Net (Ours)	86.10 ± 1.05	85.35 ± 1.15	73.50 ± 1.75	86.05 ± 1.10	86.20 ± 1.08	96.15 ± 0.85	81.65 ± 1.45	92.25 ± 0.95	85.90 ± 1.15

Ablation study

In this section, we performed ablation experiments on MSA²-Net using the Synapse dataset, as shown in Table 6. The results indicate that MSA²-Net achieves optimal performance with both the MSADecoder and the MSConvBridge, obtaining a Dice score of 86.49% and an HD95 of 14.13. When MSA²-Net is equipped with only the MSADecoder, the Dice score decreases by 9.16% and HD95 increases by 8.28. When equipped with only the MSConvBridge, the Dice score decreases by 9.93% and HD95 increases by 5.5. Without both the MSConvBridge and the MSADecoder, the Dice score decreases by 10.10% and HD95 increases by 7.49. Furthermore, when the model is not equipped with the SACM, performance declines even if the MSADecoder and MSConvBridge are retained. The primary purpose of the auxiliary encoder is to assist the encoder in preserving feature map details; therefore, the absence of this module leads to an increase in HD95.

Table 6.

Ablation study results.

Architecture	MSADecoder	MSConvBridge	SACM	Auxiliary	Dice^↑	HD95^a_↓
MSA²-Net	√	√	√	√	86.49	14.15
	√		√	√	78.56	22.43
		√	√	√	77.9	19.65
	√	√		√	77.22	18.95
	√	√	√		85.38	22.65
					75.75	21.64

Conclusion

In this work, we presented the MSA²⁻Net, a novel architecture designed to tackle the challenge of dynamic scale variation in medical segmentation. The core innovation, the self-adaptive convolution module (SACM), allows the network to dynamically synthesize kernels based on the statistical fingerprint of the input, thereby synchronizing the receptive field with the organ's spatial evolution. Additionally, we introduced the MSConvBridge to ensure semantic consistency by filtering redundant texture artifacts, and the MSADecoder to effectively reconstruct fine-grained details of small organs while preserving the global coherence of large structures.

MSA²⁻Net demonstrated superior performance across four prominent datasets (Synapse, ACDC, ISIC, and Kvasir-SEG). However, the performance discrepancy on Kvasir-SEG reveals an important insight: our adaptive mechanism thrives on the continuous spatial correlations typical of volumetric data (like Synapse and ACDC). In contrast, handling discrete, independent snapshots (as in Kvasir-SEG) remains a challenge, as the lack of inter-slice continuity limits the efficacy of our dynamic kernel adjustments. Future work will focus on decoupling the adaptive mechanism from spatial continuity to enhance robustness on snapshot-based 2D medical imagery.

Footnotes

Acknowledgements

Special thanks to Yuanyuan Li, whose profound insights and meticulous suggestions were instrumental in refining this work. I am truly grateful for her patience and perspective, which were a constant source of motivation during the most challenging phases of this project.

ORCID iD

Chao Deng

Author contributions

Conceptualization, Xiao Qin,Chao Deng.; methodology, Xiao Qin,Chao Deng.; software, Xiao Qin,Chao Deng.; validation, Xiaosen Li.; formal analysis, Xiaosen Li, Zhengyou Qin,Yuanxu Gong.; investigation, Xiaosen Li.; resources, Xiao Qin.; data curation, Xiaosen Li.; writing—original draft preparation, Chao Deng.; writing—review and editing, Xiao Qin.; visualization, Xiaosen Li.; supervision, Xiaosen Li.; project administration, Zhengyou Qin.; funding acquisition, Xiao Qin. All authors have read and agreed to the published version of the manuscript.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Ministry of Science and Technology of the People’s Republic of China, the Key Project of Science and Technology of Guangxi, (grant number STI2030-Major Projects 2021ZD0201900, AB25069247).

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data availability statement

The datasets generated and/or analyzed during the current study are available in the following public repositories:

The Synapse multi-organ segmentation dataset is available at:

The ACDC dataset is available at:

The Kvasir-SEG dataset is available at:

The ISIC dataset is available at:

The source code for MSA²-Net is available at:

References

Asgari Taghanaki

Abhishek

Cohen

, et al. Deep semantic segmentation of natural and medical images: a review. Artif Intell Rev 2021; 54: 137–178.

Qureshi

Yan

Abbas

, et al. Medical image segmentation using deep semantic-based methods: a review of techniques, applications and emerging trends. Inf Fusion 2023; 90: 316–352.

Wang

Lei

Cui

, et al. Medical image segmentation using deep learning: a survey. IET Image Proc 2022; 16: 1243–1267.

Long

Shelhamer

Darrell

. Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp.3431–3440.

Azad

Fayjie

Kauffmann

, et al. On the texture bias for few-shot cnn segmentation. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, 2021, pp.2674–2683.

Chen

, et al. Transunet: Transformers make strong encoders for medical image segmentation. arXiv preprint arXiv:2102.04306, 2021.

Wen

Zhou

Tao

, et al. Short-term and long-term memory self-attention network for segmentation of tumours in 3D medical images. CAAI Transactions on Intelligence Technology 2023; 8: 1524–1537.

Huang

Deng

, et al. Missformer: An effective medical image segmentation transformer. arXiv preprint arXiv:2109.07162, 2021.

Cao

Udupa

, et al. Disegnet: a deep dilated convolutional encoder-decoder architecture for lymph node segmentation on PET/CT images. Comput Med Imaging Graph 2021; 88: 101851.

10.

Chen

, et al. Sca-cnn: spatial and channel-wise attention in convolutional networks for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp.5659–5667.

11.

W-B

Y-B

, et al. Adaptive spatial pixel-level feature fusion network for multispectral pedestrian detection. Infrared Phys Technol 2021; 116: 103770.

12.

Isensee

Jaeger

Kohl

, et al. nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nat Methods 2021; 18: 203–211.

13.

Ronneberger

Fischer

Brox

. U-net: convolutional networks for biomedical image segmentation. In: Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5–9, 2015, pp.234–241, proceedings, part III 18, 2015. Springer.

14.

X-Z

Jeon

W-S

Rhee

S-Y

. Att-unet: pixel-wise staircase attention for weed and crop detection.” In: 2023 International Conference on Fuzzy Theory and Its Applications (iFUZZY), 2023, pp.1–5. IEEE.

15.

Wang

Huang

Tang

, et al. Stepwise feature fusion: local guides global. In: International conference on medical image computing and computer-assisted intervention, 2022, pp.110–120: Springer.

16.

Dong

Wang

Fan

D-P

, et al. Polyp-pvt: Polyp segmentation with pyramid vision transformers. arXiv preprint arXiv:2108.06932, 2021.

17.

Wang

, et al. Mixed transformer u-net for medical image segmentation. In: ICASSP 2022–2022 IEEE international conference on acoustics, speech and signal processing (ICASSP), 2022, pp.2390–2394: IEEE.

18.

Cao

, et al. Swin-unet: unet-like pure transformer for medical image segmentation. In: European conference on computer vision, 2022, pp.205–218: Springer.

19.

Rahman

Marculescu

. Medical image segmentation via cascaded attention decoding. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, 2023, pp.6222–6231.

20.

You

, et al. Class-aware adversarial transformers for medical image segmentation. Adv Neural Inf Process Syst 2022; 35: 29582–29596.

21.

Qiu

Wang

Zhang

, et al. BDG-Net: boundary distribution guided network for accurate polyp segmentation. In: Medical imaging 2022: image processing. Bellingham, WA: SPIE, 2022, vol. 12032, pp.792–799.

22.

Solar

Astudillo

Valdes

, et al. Identifying weaknesses for Chilean e-government implementation in public agencies with maturity model. In: International Conference on Electronic Government. Springer, 2009, pp.151–162.

23.

Wisaeng

. U-Net++ DSM: improved U-net++ for brain tumor segmentation with deep supervision mechanism. IEEE Access 2023; 11: 132268–132285.

24.

Hatamizadeh

, et al. Unetr: transformers for 3d medical image segmentation. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, 2022, pp.574–584.

25.

Biswas

. Polyp-sam++: Can a text guided sam perform better for polyp segmentation? arXiv preprint arXiv:2308.06623, 2023.

26.

Liu

Lin

, et al. Connecting targets via latent topics and contrastive learning: a unified framework for robust zero-shot and few-shot stance detection. In: ICASSP 2022–2022 IEEE international conference on acoustics, speech and signal processing (ICASSP), 2022, pp.7812–7816: IEEE.

27.

Liu

Gao

, et al. CSWin-UNet: transformer UNet with cross-shaped windows for medical image segmentation. Inf Fusion 2025; 113: 102634.

28.

Huang

Gong

, et al. PEFNet: position enhancement faster network for object detection in roadside perception system. IEEE Access 2023; 11: 73007–73023.

29.

Yin

Zhu

. St-unet: A spatio-temporal u-network for graph-structured time series modeling,” arXiv preprint arXiv:1903.05631, 2019.

30.

Jha

Tomar

Sharma

, et al. Transnetr: transformer-based residual network for polyp segmentation with multi-center out-of-distribution testing. In Medical imaging with deep learning. Cambridge, UK: PMLR, 2024, pp.1372–1384.

31.

Wang

Zhang

, et al. UNetformer: a UNet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery. ISPRS J Photogramm Remote Sens 2022; 190: 196–214.

32.

Jha

, et al. Resunet++: an advanced architecture for medical image segmentation. In: 2019 IEEE international symposium on multimedia (ISM), 2019, pp.225–2255: IEEE.

33.

Heidari

, et al. Hiformer: hierarchical multi-scale representations using transformers for medical image segmentation. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, 2023, pp.6202–6212.

34.

Azad

Heidari

, et al. Contextual attention network: transformer meets u-net. In: International workshop on machine learning in medical imaging, 2022, pp.377–386: Springer.

MSA 2 -Net: Utilizing self-adaptive convolution module to extract multi-scale information in medical image segmentation

Abstract

Keywords

Introduction

Related work

nnU-Net

U-shaped architectures for biomedical segmentation

Method

Self-adaptive convolution module

Overall structure

Encoder

MSConvBridge

MSADecoder

Experiments

Environmental

Dataset

Experimental results and analysis

Performance of different networks on synapse

Performance of different networks on the ACDC and Kvasir-SEG

Experimental results on skin lesion segmentation datasets

Robust test

Ablation study

Conclusion

Footnotes

Acknowledgements

ORCID iD

Author contributions

Funding

Declaration of conflicting interests

Data availability statement

References