Sage Journals: Discover world-class research

Abstract

Accurate segmentation of low-contrast images plays a crucial role in computer-aided diagnosis and treatment, particularly for early lesion detection and clinical decision support. To address the limitations of existing approaches in boundary localisation and multi-scale context modelling, we propose a lightweight and efficient hybrid segmentation framework, referred to as SwinFuseNet. The proposed architecture combines the strengths of detection-based and Transformer-based models. Specifically, the Global Pyramid Attention Backbone Network integrates a shifted window-based Transformer mechanism to enhance the global representation of blurred lesions in low-contrast images. In the feature aggregation stage, two dedicated modules–Dynamic Zoom Fusion and Spatial Interaction Fusion—are introduced to adaptively integrate information from multiple layers, effectively refining local boundary representations and fine-grained structural features. Additionally, a lightweight attention subnetwork is employed to highlight salient regions while suppressing background noise, thereby improving overall segmentation precision. Experiments conducted on four publicly available low-contrast image segmentation datasets (ISIC 2018, PH ², LUNA16, Kvasir-SEG and Brisc2025) demonstrate that the proposed method significantly outperforms existing models, including variants of U-Net and a recent detection-based segmentation framework. On the ISIC 2018 dataset, the proposed network achieves a Dice coefficient of 0.9873 and an Intersection-over-Union score of 0.9566, representing improvements of 4.48% and 6.53% respectively over the current state-of-the-art, showing a remarkable 78% and 60% improvement, respectively, toward perfection from the best alternative algorithm. These results confirm the effectiveness and practical relevance of the proposed method in the domain of low-contrast medical image segmentation.

Keywords

Low-contrast image segmentation,computer vision,YOLO swin transformer,multi-scale fusion

1. Introduction

Low-contrast image segmentation plays a vital role in computer-aided diagnosis and treatment,^1–4 particularly for the early detection of lesions (less than 1,000 pixels ²) and for supporting clinical decision-making. However, several challenges persist, including blurred lesion boundaries, variability in morphology, and inconsistent illumination.⁵ These issues are particularly critical in medical imaging, where precise segmentation directly affects diagnostic accuracy⁶ and therapeutic planning.

Deep learning-based segmentation architectures have made notable progress in recent years. For instance, U-Net⁷ employs a symmetric encoder–decoder design with skip connections to bridge semantic gaps⁸ between layers, while UNet++⁹ extends this concept with nested dense skip paths to enhance feature reuse and spatial detail.¹⁰ Attention-based variants, such as Attention U-Net,⁷ introduce spatial and channel-wise attention to highlight lesions and suppress irrelevant regions.

Multi-scale context modelling is another key technique for refining boundaries and detecting small lesions. PSPNet¹¹ applies pyramid pooling for scale-aware feature fusion, and the DeepLab series^12–14 utilises atrous convolutions and probabilistic graphical models to capture fine structural details. More recently, Transformer-based architectures have demonstrated strong capabilities in modelling long-range dependencies. The Swin Transformer,¹⁵ in particular, introduces a hierarchical, shifted-window mechanism that balances efficiency and global contextual modelling.

Despite these advancements, existing methods often struggle with balancing global context and fine-grained details, many existing methods face limitations in fusing features across multiple depths or dynamically adapting to diverse lesion characteristics. Conventional fusion strategies often apply static weighting, which may not generalise well across imaging scenarios.¹⁶ Furthermore, global attention modules, though powerful, tend to introduce substantial computational overhead.

To address these limitations, we propose SwinFuseNet, a lightweight hybrid segmentation network that combines Transformer-based global modelling with a detection-oriented feature extraction backbone. The core of the architecture is the Global Pyramid Attention Backbone Network (GPABN), which incorporates Swin Transformer modules to capture broad contextual information. This is followed by spatial pyramid pooling¹⁷ and attention-enhanced feature blocks^18,19 for effective multi-scale context encoding.

In the network’s fusion stage, two novel modules—Dynamic Zoom Fusion (DZF) and Spatial Interaction Fusion (DSIF)—are introduced to adaptively integrate multi-resolution features. These modules use gated and multiplicative interactions to improve boundary localisation and fine-grained^20,21 feature representation. Finally, a lightweight attention subnetwork^22,23 is employed to emphasise lesion regions and suppress background interference, yielding improved segmentation accuracy on low-contrast medical images.

2. Related work

2.1. Attention mechanisms in low-contrast image segmentation

Attention mechanisms have become fundamental in enhancing model sensitivity to salient regions in low-contrast image segmentation. Channel attention approaches such as SENet²⁴ and ECA-Net²⁵ model inter-channel dependencies to refine feature representation, while spatial attention modules such as CBAM²⁶ and Coordinate Attention²⁷ highlight relevant spatial locations while suppressing background noise.

Advanced variants including Attention U-Net,²⁸ CA-Net,²⁹ and DANet³⁰ integrate attention mechanisms into skip connections or decoding layers. These designs significantly improve segmentation accuracy, particularly in cases involving blurry or poorly defined lesion boundaries, which are common in low-contrast imaging scenarios.

2.2. Multi-scale feature fusion

Low-contrast medical images³¹ often present large variability in object size and anatomical structure. Effective multi-scale feature fusion is thus essential for accurate segmentation. Fully Convolutional Networks (FCN)³² introduced end-to-end pixel-wise prediction, and Feature Pyramid Networks (FPN)³³ proposed a top-down hierarchical fusion scheme. Similarly, DeepLab models^12–14 and PSPNet¹¹ employ pyramid pooling and dilated convolutions to aggregate contextual information across scales.

Other approaches, such as Multi-ResUNet,³⁴ leverage parallel multi-resolution paths to enhance detail retention, while DMFNet³⁵ implements a two-stage fusion strategy to improve cross-layer interactions. These models are particularly beneficial for low-contrast conditions, where fine-grained structure may be difficult to discern.

2.3. Transformer-based segmentation methods

Transformer architectures have recently been introduced into medical image segmentation to capture long-range dependencies beyond the receptive field of conventional CNNs. TransUNet³⁶ integrates a Vision Transformer (ViT) encoder with a U-Net-style decoder to jointly model global and local features. Swin-Unet³⁷ introduces a hierarchical Transformer with shifted window attention, which improves efficiency while preserving global context.

Other models, such as TransFuse,³⁸ use a dual-branch structure to integrate convolutional and Transformer-based features via multi-scale fusion.³⁹ These Transformer-based frameworks have shown substantial promise for low-contrast image segmentation, where long-range interactions help resolve ambiguous and diffuse anatomical boundaries.⁴⁰

2.4. Dynamic interaction fusion techniques

Static feature fusion strategies often fail to generalise across diverse imaging scenarios. To address this, dynamic interaction fusion methods adapt feature processing based on input content.^41–43 For example, Dynamic Filter Networks⁴⁴ generate convolutional kernels conditioned on the input, while CondConv⁴⁵ learns a set of expert filters that are adaptively combined. DyReLU⁴⁶ proposes input-adaptive activation functions for improved representation learning. Such techniques offer finer control over feature modulation, which is particularly advantageous for segmenting structures in low-contrast medical images where intensity variations are subtle and spatial boundaries are ill-defined.

3. Methodology

3.1. Global pyramid attention backbone network (GPABN)

We introduce the Global Pyramid Attention Backbone Network (GPABN), designed to capture both fine-grained local details and global contextual semantics in low-contrast image segmentation. By integrating hierarchical Transformer encoding with multi-scale pooling and attention fusion, GPABN enhances representational capacity while maintaining computational efficiency.

As illustrated in Figure 1(a), GPABN comprises five sequential stages that progressively encode features from low-level details to high-level semantics and aggregate multi-scale contextual information for robust representation.

Patch Embedding: The input image $X \in R^{H \times W \times 3}$ is divided into non-overlapping patches of size $P \times P$ . Each patch is flattened and projected into a $d$ -dimensional token:

X_{patch} = Linear (Patch (X)), X_{patch} \in R^{N \times d},

where

N = \frac{H W}{P^{2}}

is the total number of patches.

Swin Transformer Encoding (Stages 1–4): As shown in Figure 1(b), the Swin Transformer blocks alternate between regular and shifted window self-attention mechanisms, followed by two-layer MLPs, all wrapped with layer normalization and residual connections:

\begin{aligned} {\hat{z}}^{l} & = W - MSA (LN (z^{l - 1})) + z^{l - 1}, \\ z^{l} & = MLP (LN ({\hat{z}}^{l})) + {\hat{z}}^{l}, \end{aligned}

(1)

\begin{aligned} {\hat{z}}^{l + 1} & = SW - MSA (LN (z^{l})) + z^{l}, \\ z^{l + 1} & = MLP (LN ({\hat{z}}^{l + 1})) + {\hat{z}}^{l + 1} . \end{aligned}

(2)

Figure 1.

(a) the Global Pyramid Attention Backbone Network (GPABN), which integrates hierarchical Swin Transformer stages with multi-scale pooling and attention to capture both local and global features. (b) two successive Swin Transformer Blocks, alternating between two window-based multi-head self-attention (W-MSA) for efficient global–local dependency modeling. (c) the C2PSA module, a cascaded spatial attention mechanism that progressively enhances focus on lesion boundaries while suppressing background noise.

Spatial down-sampling is applied between stages via patch merging, enlarging the receptive field while preserving essential structure.

Context Aggregation (Stage 5): The top-level feature map is processed in parallel by:

SPPF:¹⁸ To enhance the model’s perception of varying scales, the SPPF module applies three successive $5 \times 5$ max-pooling operations to the input feature map, progressively enlarging the receptive field. Outputs from each pooling stage are concatenated with the original feature map to fuse multi-level contextual information compactly. This improves the model’s ability to capture both global semantics and local texture for lesions of diverse sizes:

X_{SPPF} = Concat (MaxPool (X), MeanPool (X)) .

In this formulation, $MaxPool (X)$ and $MeanPool (X)$ denote max and average pooling operations, respectively. Their concatenation integrates global and local cues into a unified representation.

C2PSA:¹⁹ To refine spatial focus on lesion structures while suppressing irrelevant regions, we introduce a cascaded spatial attention mechanism. As illustrated in Figure 1(c), C2PSA comprises $K$ sequential Spatial Attention Blocks. Each block computes and applies a 2D attention mask to the intermediate feature map, progressively sharpening the attention to fine-grained boundaries and contextual landmarks:

X_{C 2 PSA} = Cascade (SpatialAttention (X)) .

(3)

Here, $SpatialAttention (X)$ learns spatial weights from the input, and $Cascade (\cdot)$ denotes repeated application of this mechanism to refine lesion-focused attention iteratively.

The outputs of both branches are concatenated and fused with a $1 \times 1$ convolution:

X_{fused} = {Conv}_{1 \times 1} (Concat (X_{SPPF}, X_{C 2 PSA})) .

3.2. SwinFuseNeckNetwork (SFNN)

We propose an innovative SwinFuseNeckNetwork (SFNN). As illustrated in Figure 2, the core innovation of SFNN lies in the design and deep integration of the Dynamic Zoom Fusion (DZF) module and the Dynamic Spatial Interaction Fusion (DSIF) module. This architecture significantly enhances feature extraction and multi-scale feature fusion in medical image segmentation tasks by seamlessly combining multi-scale aggregation, content-aware fusion, and efficient refinement, thereby substantially improving feature processing capabilities.

Dynamic Zoom Fusion (DZF): As illustrated in Figure 2(a), we propose a Dynamic Zoom Fusion (DZF) module to adaptively align and merge adjacent-scale feature maps using both channel- and spatial-wise global cues. Specifically, given three input feature maps $F^{l - 1}, F^{l}, F^{l + 1} \in R^{C \times H \times W \times D}$ , representing the finer-, mid-, and coarser-scale features respectively—where $C$ is the number of channels and $(H, W, D)$ are the volumetric dimensions—DZF processes each through a parallel branch:

Upsampling branch: A $3 \times 3 \times 3$ convolution with BatchNorm and ReLU is applied to the fine-scale map $F^{l - 1}$ , producing:

F^{l - 1^{'}} = ReLU (BN ({Conv}_{3 \times 3 \times 3} (F^{l - 1}))) .

(7)

Nearest-neighbor upsampling resizes

F^{l - 1^{'}}

(H, W, D)

, followed by another Conv–BN–ReLU to yield

F^{l - 1^{″}}

Identity branch: The mid-scale map $F^{l}$ is processed via two successive $3 \times 3 \times 3$ Conv–BN–ReLU layers (optionally with intermediate smoothing) to obtain $F^{l^{″}}$ .

Downsampling branch: The coarse-scale map $F^{l + 1}$ is first convolved to get $F^{l + 1^{'}}$ , followed by both mean and max pooling:

\begin{aligned} P_{avg} & = MeanPool (F^{l + 1^{'}}), \\ P_{\max} & = MaxPool (F^{l + 1^{'}}) . \end{aligned}

(8)

Figure 2.

Architecture of SwinFuseNet. (a) DZF Module: The Dynamic Zoom Fusion (DZF) module adjusts the feature maps through multiple convolutional operations, followed by smoothing and pooling operations. It then fuses the results to enhance low-contrast image feature extraction. (b) DSIF Module: The Spatial Interaction Fusion (DSIF) module applies spatial and channel attention mechanisms to refine feature maps, utilizing identity maps for feature fusion and downsampling for efficiency. (c) Fusion Module: The Fusion module splits and processes feature maps, utilizing average pooling and convolution for feature enhancement, followed by a final sigmoid function to adaptively fuse features across different scales. (d) Identity Map: A 1 $\times$ 1 convolution generates the identity map, refining the feature maps for improved segmentation accuracy.The Upsampling Branch, Identity Branch, and Downsampling Branch within the DZF module are labeled at the top of the figure.

These are summed into $P = P_{avg} + P_{\max}$ , and further convolved to yield $F^{l + 1^{″}}$ .

Fusion Module: After the parallel branches, as illustrated in Figure 2(c), we employ a gated fusion strategy to combine outputs $F^{l - 1^{″}}, F^{l^{″}}, F^{l + 1^{″}} \in R^{C \times H \times W \times D}$ . First, we concatenate them along the channel dimension:

Z = Concat (F^{l - 1^{″}}, F^{l^{″}}, F^{l + 1^{″}}) \in R^{3 C \times H \times W \times D} .

We then compute a channel-wise descriptor via global mean pooling with uniform weighting and a $1 \times 1 \times 1$ convolution followed by a sigmoid:

w_{ch} = σ ({Conv}_{1 \times 1 \times 1} (GMP (Z))) \in R^{3 C \times 1 \times 1 \times 1} .

(9)

Here,

GMP (\cdot)

denotes global mean pooling with uniform weighting.

GMP (Z) [c] = \frac{1}{H W D} \sum_{h = 1}^{H} \sum_{w = 1}^{W} \sum_{d = 1}^{D} Z [c, h, w, d] .

(10)

The gated output is:

\begin{aligned} Z_{ch} & = w_{ch} \otimes Z, \\ F^{'} & = {Conv}_{1 \times 1 \times 1} (Z_{ch}) \in R^{C \times H \times W \times D} . \end{aligned}

(11)

To emphasize important spatial regions, we extract spatial descriptors from the fine and coarse branches via separate $1 \times 1 \times 1$ convolutions, sum them, and apply sigmoid:

\begin{aligned} w_{sp} & = σ ({Conv}_{1 \times 1 \times 1} (F^{l - 1^{″}}) + {Conv}_{1 \times 1 \times 1} (F^{l + 1^{″}})) \\ \in R^{1 \times H \times W \times D} . \end{aligned}

The final output of the DZF module is:

F_{fused}^{l} = w_{sp} \otimes F^{'} .

(12)

3.3. Dynamic spatial interaction fusion (DSIF)

Building on $F_{DZF}$ , as illustrated in Figure 2(b), the DSIF module refines the feature maps through adaptive channel weighting and cross-scale spatial coupling. This stage enhances feature representation by incorporating spatial dependencies between different feature scales.

Spatial Attention and Channel Attention: The spatial attention mechanism focuses on critical regions of the input feature map, essential for low-contrast image segmentation. It highlights relevant spatial zones while suppressing less significant ones. Subsequently, the channel attention mechanism further sharpens the network’s focus on semantically important channels, adapting dynamically to context-dependent features.

Identity Map: As shown in Figure 2(d), the Identity Map applies a $1 \times 1$ convolution to the input feature map $F$ , followed by BatchNorm and ReLU. This operation adjusts the number of channels while preserving spatial resolution. The result is $F_{conv}$ . A second $1 \times 1$ convolution is then applied to produce $F_{conv2}$ :

\begin{aligned} F_{conv} & = ReLU (BN ({Conv}_{1 \times 1} (F))), \\ F_{conv2} & = ReLU (BN ({Conv}_{1 \times 1} (F_{conv}))) . \end{aligned}

(13)

A $3 \times 3$ depthwise convolution is then applied to $F_{conv2}$ , acting independently on each channel to capture fine-grained spatial patterns:

F_{dw} = {DWConv}_{3 \times 3} (F_{conv2}) .

(14)

The final output of this branch is formed by concatenating the processed map with the original identity input:

F_{final} = Concat (F, F_{dw}) .

(15)

Fusion Module: As shown in Figure 2(c), the Fusion Module combines the outputs from two parallel branches using both channel-wise and spatial-wise attention. Let $F_{final}$ and $F^{l + 1^{‴}}$ be the two inputs. They are concatenated along the channel axis to form:

\begin{aligned} Z^{'} & = Concat (F_{final}, F^{l + 1^{‴}}) \\ \in R^{2 C \times H \times W \times D} . \end{aligned}

Channel attention is then applied. A global average pooling (GAP) followed by a $1 \times 1 \times 1$ convolution and a Sigmoid activation generates channel-wise weights:

\begin{aligned} w_{ch} & = σ ({Conv}_{1 \times 1 \times 1} (GAP (Z^{'}))) \\ \in R^{2 C \times 1 \times 1 \times 1} . \end{aligned}

These weights are applied to the concatenated feature map to form a gated representation:

\begin{aligned} Z_{ch}^{'} & = w_{ch} \otimes Z^{'}, \\ F^{″} & = {Conv}_{1 \times 1 \times 1} (Z_{ch}^{'}) \\ \in R^{C \times H \times W \times D} . \end{aligned}

Spatial descriptors are extracted from $F_{final}$ and $F^{l + 1^{‴}}$ via separate $1 \times 1 \times 1$ convolutions. These are summed and passed through a Sigmoid to compute spatial weights:

\begin{aligned} w_{s p^{'}} & = σ ({Conv}_{1 \times 1 \times 1} (F_{final}) + {Conv}_{1 \times 1 \times 1} (F^{l + 1^{‴}})) \\ \in R^{1 \times H \times W \times D} . \end{aligned}

The final output of the DSIF module is obtained by spatially gating $F^{″}$ :

F_{fused}^{l^{'}} = w_{s p^{'}} \otimes F^{″} .

(16)

Multi-Scale Integration: After processing through the GPABN and SFNN modules, the feature maps from different semantic levels undergo successive upsampling and concatenation. This step yields multi-scale fused feature maps for the $P_{3}$ , $P_{4}$ , and $P_{5}$ layers, which are forwarded to the final prediction head for low-contrast image segmentation.

4. Experimental results

4.1. Datasets

ISIC 2018: This dataset comprises approximately 10,000 dermoscopic images of skin lesions, annotated by expert dermatologists. Each image has a resolution of $512 \times 512$ pixels and is paired with a binary mask indicating the lesion region. The dataset includes a wide range of lesion types, providing a diverse testbed for evaluating segmentation performance under low-contrast and high-texture conditions.

PH ² Dataset: The PH ² dataset consists of 200 high-resolution images ( $768 \times 560$ pixels) of melanocytic lesions, including nevi and melanomas. All images are accompanied by precise boundary annotations from clinical experts, with particular emphasis on accurately delineating blurred or irregular edges.

Kvasir-SEG: The Kvasir-SEG dataset comprises 1,000 annotated endoscopic images of the gastrointestinal tract, each with a resolution of $512 \times 512$ pixels. Each image is paired with a binary mask indicating the presence of relevant structures such as polyps, tumours, and other lesions. The annotations are provided by expert gastroenterologists.

LUNA16: This dataset consists of 1,000 annotated CT scans for lung nodule detection and segmentation, each with a resolution of $512 \times 512$ pixels. Every image is paired with a binary mask indicating the location of lung nodules. The dataset includes both malignant and benign nodules, serving as a comprehensive benchmark for evaluating segmentation performance in medical imaging.

Brisc2025: The Brisc2025 dataset is a newly introduced benchmark focusing on brain tumor segmentation and classification from contrast-enhanced T1-weighted MRI scans. It contains 6,000 annotated training images and 1,000 test images across multiple anatomical planes (axial, sagittal, and coronal). Tumors are categorized into four classes: glioma, meningioma, pituitary tumor, and no-tumor. Expert-provided annotations include precise lesion boundaries, enabling robust evaluation of both segmentation accuracy and tumor type classification in low-contrast medical imaging scenarios.

4.2. Metrics

We assess our model using four metrics: Dice Score, Jaccard Coefficient, Precision, and Recall. Except for model complexity, all metrics are derived from the confusion matrix entries: True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN).These metrics are derived from the confusion matrix entries as follows:

Dice Score: Quantifies the overlap between prediction and ground truth:

Dice = \frac{2 TP}{2 TP + FP + FN} \times 100 % .

(17)

Jaccard Coefficient: Measures the ratio of the intersection over the union of prediction and ground truth:

IoU = \frac{TP}{TP + FP + FN} \times 100 % .

(18)

Precision: Indicates the accuracy of positive predictions:

Precision = \frac{TP}{TP + FP} \times 100 % .

(19)

Recall: Reflects the proportion of actual positives correctly identified:

Recall = \frac{TP}{TP + FN} \times 100 % .

(20)

4.3. Experimental setup

In this study, we utilised the SwinFuseNeckNetwork, a compact segmentation architecture, configured with a depth multiplier of 0.50, a width multiplier of 0.25, and a maximum channel capacity of 1024. The entire training pipeline was executed on a single NVIDIA RTX 4090 GPU, leveraging the PyTorch framework with Python 3.8 and CUDA version 12.2. The model was trained from the ground up, without any reliance on pre-initialised weights. Input images were resized to $640 \times 640$ , with training carried out over 200 epochs using a batch size of 8. Stochastic Gradient Descent (SGD) served as the optimiser, initiated with a learning rate of 0.01, momentum set to 0.937, and a weight decay parameter of 0.0005. To accelerate computation, Automatic Mixed Precision (AMP) was enabled, and data loading was parallelised across 8 workers. To promote experimental consistency, all training runs were executed in deterministic mode.

4.4. Quantitative results

Tables 1 to 8 present a comprehensive comparison between the proposed SwinFuseNet and several state-of-the-art segmentation models across five benchmark datasets: ISIC 2018, PH ², LUNA16, Kvasir-SEG and Brisc2025. In addition to evaluating its accuracy, we also analyze its robustness and runtime efficiency.

Table 1.
Quantitative evaluation of various segmentation models on the ISIC 2018 dataset. Boldface indicates the best performance for each metric.

Model Parameters(M) Dice Jaccard Precision Recall

DeepLabv3 54.3 55.31 40.42 73.66 51.85

FCN 134 55.15 40.28 48.77 71.74

RefineNet 118 59.74 45.05 57.89 68.90

SegNet 29.5 67.89 53.16 70.88 71.28

U-Net 7.78 50.13 35.40 46.30 62.61

U-Net++ 9.07 51.06 36.49 53.26 67.21

ResUnet-a 11.38 45.37 31.23 41.49 59.43

SmaAt-Unet 3.84 73.64 60.52 76.90 74.65

SUNet 8.65 50.56 36.67 82.76 40.02

Swin-UNet 27.5 60.58 47.10 83.16 52.24

TransCeption 48 68.17 55.31 90.55 59.74

SAM 637 86.99 77.58 84.16 91.86

MBP-SSNet 10.8 87.14 78.06 83.55 92.80

YOLOv11-n 2.8 94.25 89.13 96.10 92.48

SwinFuseNet (ours) 2.7 98.73 95.66 97.69 97.88

Model	Parameters(M)	Dice	Jaccard	Precision	Recall
DeepLabv3	54.3	55.31	40.42	73.66	51.85
FCN	134	55.15	40.28	48.77	71.74
RefineNet	118	59.74	45.05	57.89	68.90
SegNet	29.5	67.89	53.16	70.88	71.28
U-Net	7.78	50.13	35.40	46.30	62.61
U-Net++	9.07	51.06	36.49	53.26	67.21
ResUnet-a	11.38	45.37	31.23	41.49	59.43
SmaAt-Unet	3.84	73.64	60.52	76.90	74.65
SUNet	8.65	50.56	36.67	82.76	40.02
Swin-UNet	27.5	60.58	47.10	83.16	52.24
TransCeption	48	68.17	55.31	90.55	59.74
SAM	637	86.99	77.58	84.16	91.86
MBP-SSNet	10.8	87.14	78.06	83.55	92.80
YOLOv11-n	2.8	94.25	89.13	96.10	92.48
SwinFuseNet (ours)	2.7	98.73	95.66	97.69	97.88

Table 2.

Experimental results of swinFuseNet with different random seeds on the ISIC2018 dataset.

Seed	Dice	Jaccard	Precision	Recall
0	98.73	95.66	97.69	97.88
1	98.21	95.03	97.42	97.51
2	99.16	96.12	98.73	97.69
3	97.58	94.32	97.11	97.62
4	98.07	94.95	97.25	97.73

Table 3.

Mean, standard deviation, and 95% confidence intervals of key metrics under different random seeds.

Metric	Mean	Std. Dev.	95% Confidence Interval
Dice	98.35%	0.57	[97.54%, 99.16%]
Jaccard	95.22%	0.62	[94.39%, 96.05%]
Precision	97.64%	0.58	[96.84%, 98.44%]
Recall	97.69%	0.14	[97.47%, 97.91%]

Table 4.

Quantitative evaluation of various segmentation models on the PH ² Dataset. Boldface indicates the best performance for each metric.

Model	Parameters(M)	Dice	Jaccard	Precision	Recall
DeepLabv3	54.3	89.47	84.26	93.84	91.39
FCN	134	91.09	84.67	93.97	91.56
UNet	7.78	89.36	84.10	95.88	91.25
U-Net++	9.07	89.79	84.27	96.01	91.27
AttUNet	8.84	90.03	85.82	96.40	92.05
DSM	49	92.31	88.96	96.33	89.95
EDLM	45	91.81	85.34	94.83	92.36
Separable-Unet	9.78	93.02	88.81	95.64	96.33
DSNet	10.84	91.97	87.15	96.08	96.01
iFCN	42	93.21	87.56	95.91	96.13
MB-DCNN	53	93.23	87.12	95.26	95.35
FAT-NET	30	94.40	89.62	97.41	94.41
MobileUNETR	3.1	95.70	92.30	96.60	96.05
YOLOv11-n	2.8	94.25	89.13	96.10	92.48
SwinFuseNet (ours)	2.7	96.30	92.88	97.50	96.12

Table 5.

Quantitative evaluation of various segmentation models on the LUNA16 dataset. Boldface indicates the best performance for each metric.

Model	Parameters(M)	Dice	Jaccard	Precision	Recall
DeepLabv3	54.3	83.78	71.25	86.73	79.12
FCN	134	85.67	75.34	87.12	82.89
UNet	7.78	81.94	69.40	86.28	74.12
U-Net++	9.07	89.79	79.67	86.01	91.27
3D U-Net	19.41	87.32	77.23	89.45	84.12
V-Net	27	86.19	76.12	88.13	83.45
ResUNet	11.38	90.34	80.24	92.45	87.67
DenseNet	17.86	88.02	77.89	90.45	84.56
T-based Model	31	90.25	80.34	92.34	88.12
U-Det	8.57	82.82	71.25	78.92	92.24
DB-Net	12.7	88.89	78.17	77.92	90.24
V-Net+	9.23	91.20	83.90	94.10	88.50
YOLOv11-n	2.8	91.63	84.55	92.67	89.65
SwinFuseNet (ours)	2.7	94.55	89.66	93.70	95.41

Table 6.

Quantitative evaluation of various segmentation models on the kvasir-SEG dataset. Boldface indicates the best performance for each metric.

Model	Parameters(M)	Dice	Jaccard	Precision	Recall
DeepLabv3	54.3	88.91	77.73	89.98	86.74
FCN	134	87.32	76.68	88.87	87.87
U-Net	7.78	86.55	76.29	85.93	87.18
HRNetV2	28.35	85.30	74.38	87.78	82.97
HarDNet-DFUS	5.74	86.26	75.84	93.51	80.05
FCN-Transformer	20	92.20	85.54	92.38	92.03
U-Net++	9.07	89.79	84.27	96.01	91.27
ResUNet	11.38	88.03	76.29	85.93	87.18
ResUNet++	38.97	89.42	77.78	86.12	87.86
Li-SegPNet	5.87	90.81	79.27	90.25	91.36
PraNet	33	93.14	86.12	94.24	92.54
ColonFormer	28	92.70	83.39	95.99	86.40
DUCK-Net	8.5	93.43	87.69	93.50	93.37
YOLOv11-n	2.8	90.74	83.05	94.68	87.11
SwinFuseNet (ours)	2.7	92.13	85.41	96.14	88.46

Table 7.

Quantitative evaluation of various segmentation models on the brisc2025 dataset. Boldface indicates the best performance for each metric.

Model	Parameters(M)	Dice	Jaccard	Precision	Recall
DeepLabv3	54.3	83.63	76.30	81.17	77.05
FCN	134	82.55	75.31	79.28	77.02
U-Net	7.78	83.00	75.72	80.14	77.05
HRNetV2	28.35	83.56	76.23	80.57	78.30
PAN	12.87	83.22	75.92	80.44	77.99
U-Net++	9.07	82.61	75.37	79.61	77.43
EINet	14	85.22	77.75	82.13	79.57
EU-Net	8.31	82.89	75.62	79.92	77.88
DAD	20	87.15	79.51	83.64	81.15
BASNet	87	85.39	77.90	82.41	80.07
SaberNet	25.34	88.36	80.61	84.54	81.82
ABANet	22.17	87.22	79.57	83.43	81.01
Swin-HAFUNet	2.7	90.34	82.42	86.28	84.08
YOLOv11-n	2.8	90.24	82.22	91.67	88.87
SwinFuseNet (ours)	2.7	90.60	82.82	90.70	90.51

Table 8.

Comparison of inference time between swinFuseNet and the baseline model YOLOv11 across multiple datasets. Inference latency is reported in milliseconds (ms).

Dataset	Model	Inference Latency (ms)
ISIC2018	YOLOv11	0.9
	SwinFuseNet	1.0
PH ²	YOLOv11	1.1
	SwinFuseNet	1.3
LUNA16	YOLOv11	2.5
	SwinFuseNet	2.7
Kvasir-SEG	YOLOv11	1.4
	SwinFuseNet	1.6
Brisc2025	YOLOv11	2.2
	SwinFuseNet	2.4

On the ISIC 2018 dataset (Table 1), SwinFuseNet delivered outstanding results, achieving the highest scores across all metrics: Parameters(M)(2.7), Dice (98.73%), Jaccard (95.66%), Precision (97.69%), and Recall (97.88%). Relative to the lightweight baseline YOLOv11-n, our model recorded notable gains of 4.48% in Dice, 6.53% in Jaccard, 1.59% in Precision, and 5.40% in Recall. Moreover, when measured as progress toward perfection (100%), SwinFuseNet achieved 77.9%, 60.1%, 40.8%, and 71.8% improvements over the best alternative algorithm for Dice, Jaccard, Precision, and Recall, respectively.The mean percentage toward perfection across Dice, Jaccard, Precision, and Recall is 62.7%.

Tables 2 and 3 demonstrate the stability and robustness of SwinFuseNet under different random seeds. The results show consistently high performance across multiple runs, with small standard deviations and narrow 95% confidence intervals. This indicates that the improvements achieved by SwinFuseNet are not due to random fluctuations, but reflect reliable and reproducible performance in low-contrast image segmentation.

As shown in Table 4, PH ² results confirm the robustness of SwinFuseNet, which achieved leading values across all metrics: Parameters(M)(2.7), Dice (96.30%), Jaccard (92.88%), Precision (97.50%), and Recall (95.12%). In comparison with YOLOv11-n, our model improved segmentation quality by 2.05%, 3.75%, 1.40%, and 3.64% across Dice, Jaccard, Precision, and Recall, respectively.When measured as progress toward perfection (100%), SwinFuseNet achieved 13.9%, 7.5%, 26.5%, and 1.8% improvements over the best alternative algorithm for Dice, Jaccard, Precision, and Recall, respectively.The mean percentage toward perfection across Dice, Jaccard, Precision, and Recall is 12.4%

Performance on the LUNA16 dataset (Table 5) further reinforces SwinFuseNet’s superiority, with top scores in Parameters(M)(2.7), Dice (94.55%), Jaccard (89.66%), and Recall (95.41%). Compared to YOLOv11-n, our model achieved absolute improvements of 2.92% in Dice, 5.11% in Jaccard, 1.03% in Precision, and 4.86% in Recall.When measured as progress toward perfection (100%), SwinFuseNet achieved 34.9%, 33.1%, and 40.8% improvements over the best alternative algorithm for Dice, Jaccard, and Recall, respectively.The mean percentage toward perfection across Dice, Jaccard, and Recall is 36.3%

On the Kvasir-SEG dataset (Table 6), SwinFuseNet again led in most metrics, notably achieving the highest in Parameters(M)(2.7), Precision (96.14%). Compared to YOLOv11-n, it demonstrated measurable gains: 1.39% in Dice, 2.36% in Jaccard, 1.46% in Precision, and 1.34% in Recall.When measured as progress toward perfection (100%), SwinFuseNet achieved 3.3% improvements over the best alternative algorithm for Precision.

On the Brisc2025 dataset (Table 7), SwinFuseNet again led in most metrics, notably achieving the highest in Parameters(M) (2.7), Dice (90.60%), Jaccard (82.82%), and Recall (90.51%). Compared to YOLOv11-n, it demonstrated measurable gains: 0.36% in Dice, 0.60% in Jaccard, and 1.64% in Recall, while Precision was slightly lower (–0.97%). Overall, these quantitative outcomes highlight SwinFuseNet’s strong generalisation ability across multiple low-contrast segmentation tasks. The model consistently outperforms established baselines in most metrics while preserving efficiency, validating the effectiveness of its multi-scale fusion design and boundary-refinement mechanisms.

To further examine the practicality of SwinFuseNet for deployment, we evaluate not only accuracy but also runtime efficiency. The corresponding inference latency results are summarized in Table 8 and discussed in detail in the following subsection.

4.4.1. Runtime and efficiency analysis

Table 8 shows that incorporating Transformer modules into the YOLOv11 backbone results in only a negligible increase in inference latency—about 0.18 ms on average across the five benchmark datasets. Despite this small overhead, SwinFuseNet remains much more efficient than fully Transformer-based models, which often suffer from deployment bottlenecks due to heavy computational demands.

This efficiency stems from its hybrid design: Transformer layers are selectively integrated into the backbone for capturing long-range dependencies, while the neck (SFNN) and prediction head largely preserve lightweight CNN-based structures. Such a design allows SwinFuseNet to benefit from global context modeling while maintaining compactness, with just 2.7M parameters compared to YOLOv11-n’s 2.8M, thereby keeping both model size and inference time under control.

Across ISIC2018, PH ², LUNA16, Kvasir-SEG, and Brisc2025, SwinFuseNet demonstrates consistent performance improvements over YOLOv11-n, with average gains of +2.24% in Dice, +3.67% in Jaccard, +0.90% in Precision, and +3.56% in Recall. These results indicate that the proposed method not only improves lesion boundary delineation and recall sensitivity but also sustains computational efficiency. Overall, SwinFuseNet provides a balanced trade-off between accuracy, parameter compactness, and inference latency, making it a suitable candidate for practical deployment in resource-constrained medical imaging scenarios.

4.5. Qualitative results

Figure 3 presents a visual comparison between SwinFuseNet and YOLOv11 on the ISIC2018, PH ²Dataset, LUNA16, Kvasir-SEG and Brisc2025 datasets. By integrating Transformer architectures with YOLO, SwinFuseNet enhances segmentation performance by effectively combining multi-scale feature fusion with fine-grained feature extraction through its Global Pyramid Attention Backbone Network and SwinFuseNeckNetwork. As shown in the figure, SwinFuseNet demonstrates superior accuracy in segmenting targets across various scales compared to YOLOv11.

Figure 3.

Qualitative Comparison of SwinFuseNet with YOLOv11 on ISIC2018, PH ²Dataset, LUNA16, Kvasir-SEG and Brisc2025 Datasets.

In particular, we emphasize that many images in ISIC2018 and PH ² exhibit extremely subtle grayscale differences between lesions and surrounding tissue, with boundaries that are irregular, fuzzy, or partially missing. Conventional U-Net-like models often fail in these cases because they rely on static skip connections and fixed fusion strategies, which are insufficient to capture fine transitions or adaptively emphasize ambiguous regions. As a result, the predictions tend to either oversmooth boundaries or miss small protrusions along lesion edges.

SwinFuseNet addresses these challenges through the Dynamic Zoom Fusion (DZF) and Spatial Interaction Fusion (DSIF) modules. DZF adaptively integrates multi-scale features, enabling the network to ’zoom in’ on fine structures without losing global context, while DSIF enhances local spatial dependencies and recovers details that are easily lost in low-contrast conditions. For ISIC2018 and PH ², this design yields three practical improvements: smoother and more consistent delineation of irregular boundaries, recovery of low-intensity details that conventional models often misclassify as background, and suppression of false positives in homogeneous areas where intensity differences are minimal.

Although the visual differences are subtle, these refinements are clinically meaningful. In dermatology (ISIC2018, PH ²), accurate lesion border delineation is critical for early melanoma detection, where a 1–2 pixel error may lead to measurable diagnostic impact. By refining edges in low-contrast cases, SwinFuseNet provides greater robustness for decision support. For this reason, while the improvements may appear less striking at first glance, they represent significant advances in reliable low-contrast image segmentation.

4.6. Ablation study

We performed ablation studys on the ISIC2018 dataset and LUNA16 dataset to evaluate the contribution of each module in SwinFuseNet. The results are summarized in Tables 9 and 10:

Table 9.
Ablation study of the main components of swinFuseNet on the ISIC2018 dataset, using YOLOv11-n as the baseline (baseline performance is reported in table 1).

GPABN DZF DSIF Dice Jaccard Precision Recall

94.25 89.13 96.10 92.48

✓ 95.34 90.27 96.41 93.72

✓ 94.62 89.87 96.58 94.44

✓ 94.97 90.06 96.83 94.18

✓ ✓ 96.71 92.28 97.09 95.53

✓ ✓ 97.18 92.87 97.24 96.12

✓ ✓ 96.27 91.98 97.01 94.59

✓ ✓ ✓ 98.73 95.66 97.69 97.88

GPABN	DZF	DSIF	Dice	Jaccard	Precision	Recall
			94.25	89.13	96.10	92.48
✓			95.34	90.27	96.41	93.72
	✓		94.62	89.87	96.58	94.44
		✓	94.97	90.06	96.83	94.18
✓	✓		96.71	92.28	97.09	95.53
✓		✓	97.18	92.87	97.24	96.12
	✓	✓	96.27	91.98	97.01	94.59
✓	✓	✓	98.73	95.66	97.69	97.88

Table 10.

Ablation study of the main components of swinFuseNet on the LUNA16 dataset, using YOLOv11-n as the baseline (baseline performance is reported in table 3).

GPABN	DZF	DSIF	Dice	Jaccard	Precision	Recall
			91.63	84.55	92.67	89.65
✓			93.28	86.89	93.91	91.12
	✓		92.71	86.14	93.08	91.77
		✓	93.05	86.62	93.21	91.24
✓	✓		93.58	87.36	93.48	93.05
✓		✓	93.91	87.89	93.54	93.54
	✓	✓	93.19	87.33	93.27	92.95
✓	✓	✓	94.55	89.66	93.70	95.41

4.6.1. Impact of the GPABN module

Averaged over ISIC2018 and LUNA16, adding the Global Pyramid Attention Backbone Network (GPABN) alone improved Dice from 92.94% to 94.31% (+1.37%), Jaccard from 86.84% to 88.58% (+1.74%), Precision from 94.39% to 95.16% (+0.77%), and Recall from 91.07% to 92.42% (+1.35%). These results confirm that GPABN effectively enhances hierarchical context modeling, capturing both global semantics and fine-grained local structures to boost segmentation performance.

4.6.2. Impact of the fusion modules (DZF and DSIF)

Averaged across the two datasets, adding DZF alone improved Dice from 92.94% to 93.67% (+0.73%), Jaccard from 86.84% to 88.01% (+1.17%), Precision from 94.39% to 94.83% (+0.44%), and Recall from 91.07% to 93.11% (+2.04%). Similarly, adding DSIF alone improved Dice to 94.01% (+1.07%), Jaccard to 88.34% (+1.50%), Precision to 95.02% (+0.63%), and Recall to 92.71% (+1.64%). These improvements demonstrate that DZF and DSIF enhance feature representation by adaptively integrating multi-scale information and reinforcing local spatial dependencies.

4.6.3. Combined impact of GPABN and fusion modules

When modules were combined, the improvements became more pronounced. GPABN+DZF raised Dice from 92.94% to 95.15% (+2.21%), Jaccard from 86.84% to 89.82% (+2.98%), Precision from 94.39% to 95.29% (+0.90%), and Recall from 91.07% to 94.29% (+3.22%). GPABN+DSIF further improved Dice to 95.55% (+2.61%), Jaccard to 90.38% (+3.54%), Precision to 95.39% (+1.00%), and Recall to 94.83% (+3.76%). Combining DZF and DSIF without GPABN achieved Dice 94.73% (+1.79%), Jaccard 89.66% (+2.82%), Precision 95.14% (+0.75%), and Recall 93.77% (+2.70%). Finally, integrating all three modules yielded the best overall results: Dice 96.64% (+3.70%), Jaccard 92.66% (+5.82%), Precision 95.70% (+1.31%), and Recall 96.65% (+5.58%). These results highlight the complementary strengths of GPABN’s hierarchical attention and the adaptive feature interactions of DZF and DSIF, validating the overall design of SwinFuseNet for robust low-contrast image segmentation.

In summary, the effective integration of GPABN and fusion modules greatly enhances low-contrast image segmentation, validating the design choices for multi-scale feature fusion and fine-grained feature extraction in SwinFuseNet.

5. Conclusion

This paper has introduced SwinFuseNet, a lightweight and efficient hybrid segmentation framework tailored for low-contrast medical images—a critical step towards enhancing early lesion detection and supporting clinical decision-making. By synergistically combining the global modelling capacity of Transformers with the efficiency of YOLO-style architectures, SwinFuseNet addresses key limitations in boundary delineation and multi-scale feature representation.

The proposed backbone, GPABN, leverages Swin Transformer modules to robustly capture global context in blurred or low-contrast images. To refine spatial detail and localise structures more accurately, the Neck integrates two novel modules: Dynamic Zoom Fusion (DZF) and Deep Spatial Interaction Fusion (DSIF). These components dynamically adapt to feature scale and spatial relationships, strengthening fine-grained boundary encoding. Additionally, a lightweight attention mechanism further enhances segmentation precision by focusing the model’s capacity on diagnostically relevant regions.

Extensive experiments on four benchmark datasets (ISIC 2018, PH ², LUNA16, and Kvasir-SEG) demonstrate that SwinFuseNet consistently outperforms state-of-the-art baselines, including U-Net variants and YOLOv11, across multiple evaluation metrics. On the ISIC 2018 dataset, it achieves Dice and IoU scores of 98.73% and 95.66%, marking relative improvements of 4.48% and 6.53% over leading competitors. These results confirm the model’s effectiveness and practical applicability in real-world low-contrast segmentation tasks.

SwinFuseNet not only advances the performance frontier for low-contrast image segmentation, but also establishes a versatile architectural blueprint for future research in lightweight, accurate, and interpretable medical image analysis systems.

In future work, we aim to further enhance SwinFuseNet’s generalisability by incorporating domain adaptation techniques and evaluating its performance across diverse imaging modalities beyond dermatology and radiology. Additionally, we plan to investigate approaches for uncertainty quantification (Bayesian deep learning or Monte Carlo dropout) and explainability (attention heatmaps or Grad-CAM) to improve the model’s clinical interpretability and trustworthiness. We also outline the challenges of real-time deployment, focusing on the trade-offs between segmentation accuracy, inference speed, and hardware constraints, particularly on edge devices in resource-constrained settings. By improving segmentation accuracy, SwinFuseNet has the potential to enhance diagnostic precision and support personalized treatment planning in clinical settings.

Footnotes

Acknowledgments

This work was partially supported by the National Natural Science Foundation of China (62376127, 61876089, 61876185, 61902281, 61403206), the Natural Science Foundation of Jiangsu Province (BK20141005), the Natural Science Foundation of the Jiangsu Higher Education Institutions of China (14KJB520025), Jiangsu Distinguished Professor Programme.

Funding

The author(s) received no financial support for the research, authorship and/or publication of this article.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

ORCID iDs

Ferrante Neri

WuJun Liu

References

Wang

Zhang

. Mixture 2d convolutions for 3d medical image segmentation. Int J Neural Syst 2023; 33: 2250059.

Yang

Huang

, et al. Neural memory state space models for medical image segmentation. Int J Neural Syst 2024; 34: 2250100.

Adeli

Ghosh-Dastidar

Dadmehr

. Imaging and machine learning techniques for diagnosis of alzheimer’s disease. Int J Neural Syst 2016; 26: 1650014.

Mirzaei

Adeli

. Segmentation and clustering in brain mri imaging. Rev Neurosci 2019; 30: 31–44.

Sun

Wang

Peng

, et al. Reference image segmentation with multimodal feature interaction and alignment based on convolutional nonlinear spiking neural membrane systems. Int J Neural Syst 2024; 34: 2450064.

Díaz-Francés

Fernández-Rodríguez

Thurnhofer

, et al. Semi-supervised semantic image segmentation by deep diffusion models and generative adversarial networks. Int J Neural Syst 2024; 34: 2450057.

Ronneberger

Fischer

Brox

. U-Net: Convolutional networks for biomedical image segmentation. In: Medical image computing and computer-assisted intervention (MICCAI) 2015, Part III, 2015, pp.234–241. Springer.

Dong

Zhang

, et al. An optimization numerical spiking neural membrane system with adaptive multi-mutation operators for brain tumor segmentation. Int J Neural Syst 2024; 34: 2450036.

Zhou

Siddiquee

MMR

Tajbakhsh

, et al. UNet++: A nested U-Net architecture for medical image segmentation. In: Deep learning in medical image analysis and multimodal learning for clinical decision support, 2018, pp.3–11. Springer. DOI: 10.1007/978-3-030-00889-5_1.

10.

Arfi

Yadav

Tripathi

. An integrated computer-aided diagnosis BCanD model for detection, segmentation and classification of breast cancer. Eng Res Express 2024; 6: 035240.

11.

Zhao

Shi

, et al. Pyramid scene parsing network. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), 2017, pp.2881–2890.

12.

Chen

Papandreou

Kokkinos

, et al. Semantic image segmentation with deep convolutional nets and fully connected CRFs. arXiv preprint 2014, 1412.7062.

13.

Chen

Papandreou

Kokkinos

, et al. DeepLab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. IEEE Trans Pattern Anal Mach Intell 2018; 40: 834–848.

14.

Chen

Papandreou

Schroff

, et al. Rethinking atrous convolution for semantic image segmentation. arXiv preprint 2017, 1706.05587.

15.

Liu

Lin

Cao

, et al. Swin Transformer: Hierarchical vision Transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV), 2021, pp.10012–10022. IEEE.

16.

Jiang

Sun

Peng

, et al. Multiple-in-single-out object detector leveraging spiking neural membrane systems and multiple transformers. Int J Neural Syst 2024; 34: 2450035.

17.

Nogay

Adeli

. Detection of hyperglycemia and hypoglycemia using deep learning from facial images obtained with an ai image generator. Biomed Signal Process Control 2026; 111: 108351.

18.

Zhao

. SPFFNet: Strip perception and feature fusion spatial pyramid pooling for fabric defect detection. arXiv preprint 2025. DOI: 10.48550/arXiv.2502.01445. 2502.01445.

19.

Khanam

Hussain

. YOLOv11: An overview of the key architectural enhancements. arXiv preprint 2024. https://arxiv.org/abs/2410.17725. 2410.17725.

20.

Shen

Jiao

Zhang

, et al. Monocular 3d object detection for construction scene analysis. Comput-Aid Civil Infrast Eng 2024; 39: 1370–1389.

21.

Zhao

Xie

Zhuang

, et al. Automated quality evaluation of large-scale benchmark datasets for vision-language tasks. Int J Neural Syst 2024; 34: 2450009.

22.

Zhou

Zhang

Gong

. Hybrid semantic segmentation for tunnel lining cracks based on swin transformer and convolutional neural network. Comput-Aid Civil Infrast Eng 2023; 38: 2491–2510.

23.

Mirzaei

Gupta

Adeli

. Data fusion of medical imaging in neurological disorders. Rev Neurosci 2025; 37: 43–60.

24.

Shen

Sun

. Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), 2018, pp.7132–7141.

25.

Wang

Zhu

, et al. ECA-Net: Efficient channel attention for deep convolutional neural networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), 2020, pp.11534–11542. DOI: 10.1109/CVPR42600.2020.01155.

26.

Woo

Park

Lee

, et al. CBAM: Convolutional block attention module. In: Proceedings of the European conference on computer vision (ECCV), 2018, pp.3–19. DOI: 10.1007/978-3-030-01234-2_1.

27.

Hou

Zhou

Feng

. Coordinate attention for efficient mobile network design. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), 2021, pp.13713–13722. DOI: 10.1109/CVPR46437.2021.01347.

28.

Oktay

Schlemper

Le Folgoc

, et al. Attention U-Net: Learning where to look for the pancreas. arXiv preprint 2018, 1804.03999.

29.

Wang

Song

, et al. CA-net: Comprehensive attention convolutional neural networks for explainable medical image segmentation. IEEE Trans Med Imag 2020; 39: 2498–2508.

30.

Liu

Tian

, et al. Dual attention network for scene segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), 2019, pp.3146–3154. DOI: 10.1109/CVPR.2019.00326.

31.

Adeli

Ghosh-Dastidar

Dadmehr

. Alzheimer’s disease and models of computation: Imaging, classification, and neural models. J Alzheim Disease 2005; 7: 187–199.

32.

Long

Shelhamer

Darrell

. Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), 2015, pp.3431–3440. DOI: 10.1109/CVPR.2015.7298965.

33.

Lin

Dollár

Girshick

, et al. Feature pyramid networks for object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), 2017, pp.936–944. DOI: 10.1109/CVPR.2017.106.

34.

Ibtehaz

Rahman

. MultiresUNet: Rethinking the U-net architecture for multimodal biomedical image segmentation. Neural Networks 2020; 121: 74–87.

35.

Chen

Liu

Ding

, et al. 3D dilated multi-fiber network for real-time brain tumor segmentation in MRI. In: International conference on medical image computing and computer assisted intervention (MICCAI). DOI: 10.1007/978-3-030-32245-8_43.

36.

Chen

, et al. TransUNet: Transformers make strong encoders for medical image segmentation. arXiv preprint 2021, 2102.04306.

37.

Cao

Wang

Chen

, et al. Swin-Unet: Unet-like pure Transformer for medical image segmentation. In: Computer Vision—ECCV 2022 Workshops, 2022, pp.205–218. Springer. DOI: 10.1007/978-3-031-25066-8_9.

38.

Zhang

Liu

. Transfuse: Fusing Transformers and CNNs for medical image segmentation. In: MICCAI 2021: 24th international conference, Strasbourg, France, Proceedings, Part I. 2021, pp.14–24. Springer. DOI: 10.1007/978-3-030-87193-2_2.

39.

Liu

Peng

Wang

, et al. Nonlinear spiking neural systems for thermal image semantic segmentation networks. Int J Neural Syst 2025; 34: 2450038.

40.

Hua

Shu

Wang

, et al. Uncertainty-guided voxel-level supervised contrastive learning for semi-supervised medical image segmentation. Int J Neural Syst 2022; 32: 22500162.

41.

De Brabandere

Jia

Tuytelaars

, et al. Dynamic filter networks. In: Advances in neural information processing systems (NeurIPS), 2016, pp.667–675. Curran Associates, Inc.

42.

Sridhar

Bhat

Acharya

, et al. Diagnosis of attention deficit hyperactivity disorder using imaging and signal processing techniques. Comput Biol Med 2017; 88: 93–99.

43.

Ajmal

Zhang

, et al. Self-supervised image segmentation using meta-learning and multi-backbone feature fusion. Int J Imag Syst Technol 2025; 35: 1–15.

44.

Zhang

, et al. Enhancing robustness of medical image segmentation model with neural memory ordinary differential equation. Int J Neural Syst 2023; 33: 2350060.

45.

Yang

Bender

, et al. Condconv: Conditionally parameterized convolutions for efficient inference. In: Advances in Neural Information Processing Systems (NeurIPS), 2019, pp.1307–1318. Curran Associates, Inc. https://arxiv.org/abs/1904.04971.

46.

Chen

Dai

Liu

, et al. Dynamic ReLU. arXiv preprint 2020; 2003.10027.

A hybrid object detection and transformer-based network with dynamic zoom and spatial fusion for segmenting low-contrast images

Abstract

Keywords

1. Introduction

2. Related work

2.1. Attention mechanisms in low-contrast image segmentation

2.2. Multi-scale feature fusion

2.3. Transformer-based segmentation methods

2.4. Dynamic interaction fusion techniques

3. Methodology

3.1. Global pyramid attention backbone network (GPABN)

4.1. Datasets

4.2. Metrics

4.4. Quantitative results

4.5. Qualitative results

4.6.2. Impact of the fusion modules (DZF and DSIF)

4.6.3. Combined impact of GPABN and fusion modules

5. Conclusion

Footnotes

Acknowledgments

Funding

Declaration of conflicting interests

ORCID iDs

References