Sage Journals: Discover world-class research

Abstract

Introduction

Accurate delineation of the high-risk clinical target volume (HR-CTV) and organs at risk (OARs) is critical for cervical cancer brachytherapy. However, treatment planning is time-consuming, and prolonged waiting can lead to organ displacement and patient discomfort. Additionally, the steep dose gradients around HR-CTV amplify segmentation errors in HR-CTV and OARs. Therefore, achieving rapid and precise delineation of HR-CTV and OARs remains challenging. This study proposes a novel network model, MDA-TransUNet, for fast segmentation of HR-CTV and OARs in cervical cancer.

Methods

We applied MDA-TransUnet, a CNN-Transformer hybrid model, to segment the bladder, colon, rectum, small bowel, and HR-CTV on cervical cancer CT images. 122 cervical cancer brachytherapy patients’ CT images from three clinical centers were utilized for training and testing, with 80 cases allocated to training, 22 to testing, and 20 to external validation. Segmentation accuracy was quantified using the Dice Similarity Coefficient (DSC), Hausdorff Distance (HD95), and Average Surface Distance (ASD). Dosimetric differences were analyzed via paired t-tests.

Results

Compared to other methods, MDA-TransUnet achieved superior segmentation performance on the test dataset. The DSCs for the bladder, colon, rectum, small bowel, and HR-CTV were 94.54%, 79.27%, 79.27%, 88.90%, and 82.35%, respectively. Paired t-tests on five dosimetric metrics (D5cc, D2cc, D0.1cc, D90%, and Dmean) showed no significant differences. For OARs, the average difference in D2cc was less than 12%. For HR-CTV, the average difference in Dmean was less than 8%, and D90% was less than 11%.

Conclusion

This work demonstrates the superiority of MDA-TransUnet in segmenting OARs and HR-CTV for cervical cancer brachytherapy, with robust performance across multi-center datasets.

Keywords

deep learning high-dose-rate brachytherapy auto-segmentation cervical cancer image-guided

Introduction

Cervical cancer ranks as the fourth most common malignant tumor among women worldwide, with over 600,000 new cases annually, approximately 90% of which occur in developing countries with limited healthcare resources.¹ Its high incidence and mortality rates pose a significant threat to women's health. Precision radiotherapy, particularly combining external beam radiation therapy (EBRT) and brachytherapy (BT), is a cornerstone in treating locally advanced cervical cancer.² However, the unique anatomical location of cervical tumors—where high-risk clinical target volumes (HR-CTV) are closely adjacent to organs at risk (OARs) such as the bladder and rectum—necessitates a delicate balance between tumor eradication and normal tissue sparing during treatment planning.

Brachytherapy (BT), which involves placing radioactive sources directly into or near the tumor target, exploits the rapid dose fall-off with distance to deliver high local tumor doses while minimizing damage to OARs.³ In cervical cancer treatment, BT has proven to significantly improve local control rates and patient survival. However, BT planning heavily relies on manual delineation of targets and OARs by physicians, leading to subjectivity, inefficiency,⁴ and prolonged workflows. On average, radiation oncologists require 32 min to delineate HR-CTV and OARs for gynecological malignancies.⁵ The demand for rapid yet precise planning creates high-pressure workflows prone to human error, while extended procedures exacerbate patient discomfort.

In recent years, with the growing adoption of deep learning, various neural network architectures—primarily convolutional neural networks (CNNs)—have been developed for segmentation in cervical cancer high-dose-rate brachytherapy (HDR-BT).^6–9 The classic CNN-based segmentation model, U-Net,¹⁰ and its variants employ symmetric encoder-decoder networks to automatically extract multi-level features through CNNs, significantly improving segmentation efficiency and accuracy.^11–14 For example, Li et al¹⁵ applied the adaptive deep CNN framework nnU-Net to segment the bladder, rectum, and HR-CTV on cervical cancer CT images. The nnU-Net method integrates three architectures—2D U-Net, 3D U-Net, and 3D cascade U-Net—to adaptively select the optimal architecture for each task. Zhang et al¹⁶ developed a DSD U-Net model for automated delineation of the bladder, rectum, sigmoid colon, small bowel, and HR-CTV, achieving high accuracy as evaluated by the Dice similarity coefficient (DSC). Chang et al¹⁷ proposed a hybrid network combining 3D U-Net and long short-term memory (LSTM) for HR-CTV and OAR segmentation, demonstrating superior performance over 2D U-Net.

Despite CNNs’ strengths in capturing local spatial features, they face limitations in modeling long-range dependencies between pixels.¹⁸ Transformer architectures, enhanced by self-attention mechanisms, overcome this limitation by enabling global context modeling, showing promising potential in medical image segmentation tasks.^19–23 To synergize the advantages of CNNs (local perception) and Transformers (global reasoning), hybrid CNN-Transformer architectures have emerged as a cutting-edge direction in medical image segmentation, exemplified by TransUNet.²⁴ Gu et al²⁵ pioneered the integration of Transformer's self-attention with CNN frameworks for segmenting the bladder, rectum, and colon, demonstrating significant effectiveness. However, their work focused solely on OARs segmentation in cervical cancer brachytherapy and did not extend the CNN-Transformer framework to HR-CTV segmentation.

Although deep learning-based automatic segmentation methods show promise in cervical cancer brachytherapy, existing studies still exhibit certain limitations. Firstly, most research relies on single-center datasets, struggling to adequately validate model generalizability and robustness in multi-center scenarios, and failing to effectively address challenges such as variations in imaging protocols and inconsistencies in contouring criteria across different centers.²⁶ Secondly, existing CNN-Transformer hybrid architectures primarily focus on the segmentation of Organs at Risk and have not yet been applied to the automatic segmentation of the High-Risk Clinical Target Volume. The HR-CTV is characterized by indistinct boundaries, variable morphology, and close proximity to OARs, making its accurate segmentation crucial for treatment planning.²⁷ Therefore, there is an urgent need to develop a novel segmentation method capable of simultaneously achieving precise segmentation of both HR-CTV and OARs while maintaining stable performance on heterogeneous multi-center data.

To address these challenges, this study proposes, for the first time, a CNN-Transformer hybrid network named MDA-TransUnet for segmenting both OARs and HR-CTV in cervical cancer brachytherapy. To validate the model's robustness across centers, this study integrated 122 cervical cancer brachytherapy patients’ CT images across three clinical centers for training and testing, providing robust support for evaluating the model's performance in heterogeneous scenarios. This design significantly differs from previous single-center studies and aligns more closely with clinical practical needs. Furthermore, we introduced a Multi-scale Adaptive Spatial Attention Gate (MASAG) and a Deformable Convolutional Attention Module (DCAM) into the CNN-Transformer framework to further enhance adaptability to organ deformation and multi-center variations. Compared to manual contouring, MDA-TransUnet achieved an average segmentation time of approximately 1.2 min per patient for both HR-CTV and OARs in cervical cancer CT images, representing a 26-fold speed increase. This holds promise for significantly reducing patient waiting time post-applicator insertion in clinical practice, alleviating patient discomfort, and mitigating the risk of organ displacement associated with prolonged waiting periods. MDA-TransUnet thus offers a novel technical solution for cervical cancer brachytherapy image segmentation.

Materials and Methods

Dataset

A total of 122 patients were enrolled in this Institutional Review Board (IRB)-approved retrospective study, including 52 patients from The Third Affiliated Hospital of Nanjing Medical University, 50 patients from The Affiliated Tumor Hospital of Nantong University, and 20 patients from The Affiliated Huaian NO.1 People's Hospital of Nanjing Medical University. Among these, 102 patients’ data (The Third Affiliated Hospital of Nanjing Medical University and The Affiliated Tumor Hospital of Nantong University) were utilized for training and testing, with 80 cases allocated for training and 22 cases for testing. 20 patients’ data from The Affiliated Huaian NO.1 People's Hospital of Nanjing Medical University served as an external validation set to further evaluate the model's generalization performance. This study was approved by the Medical Ethics Committee of The Third Affiliated Hospital of Nanjing Medical University (#2024KY213-01), the Medical Ethics Committee of The Affiliated Tumor Hospital of Nantong University (#2020-031), and the Medical Ethics Committee of The Affiliated Huaian NO.1 People's Hospital of Nanjing Medical University (#IIT2024101). Informed consent is waived for all participants with the approval of the Medical Ethics Committee. CT images were reconstructed using the Philips Brilliant Big Bore CT scanner (Philips Healthcare, Best, the Netherlands) with a matrix size of 512 × 512 and a slice thickness of 3 mm. All patients were treated using tandem and ovoid applicators (T + O). The HR-CTV, bladder, colon, rectum, and small bowel were manually contoured by experienced radiation oncologists (with over ten years of clinical expertise) using Monaco 5.40.01 (Elekta, Stockholm, Sweden). The radiation oncologists contoured the HR-CTV on CT images according to ICRU Report 89, with reference to MRI images obtained prior to the first brachytherapy session.²⁸

Data Preprocessing

The 3D-Slicer software and RT structure data were used to generate binary masks for HR-CTV and OARs for each patient. All binary masks were converted into one-hot encoded vectors with values ranging from 0 to 5. To mitigate overfitting, we implement strictly synchronized data augmentation: with 50% probability, applying random 90k° rotation (k∈{0,1,2,3}) followed by axial random flipping (horizontal/vertical); with 25% probability, performing random rotation between −20° to 20°; and with 25% probability, maintaining the identity transformation.

Network Architecture

In this section, we introduce our proposed MDA-TransUnet. Given the outstanding performance of TransUnet in medical image segmentation, we adopt TransUnet as the baseline model. For the encoder, we retain TransUnet's hybrid CNN-Transformer design: the CNN serves as a feature extractor to generate input feature maps, while the Transformer captures global and spatial relationships between features. The decoder consists of three key components: (1) UpConv blocks for feature upsampling, the Multi-Scale Adaptive Spatial Attention Gate (MASAG) to enhance feature representation, and the Deformable Convolutional Attention Module (DCAM) to robustly refine feature maps, as Figure 1.

Figure 1.

MDA-TransUnet Network Architecture Diagram. (a) Overall Framework of MDA-TransUnet, (b) Multi-Scale Adaptive Spatial Attention Gate (MASAG), (c) Multi-Scale Feature Fusion (MSF), (d) Deformable Convolutional Attention Module (DCAM), (e) Spatial Attention (SA), (f) Deformable Convolutional Block (DCB), (g) UpConv.

Multi-Scale Adaptive Spatial Attention Gate (MASAG)

We employ the Multi-Scale Adaptive Spatial Attention Gate (MASAG) module to enhance feature representation. MASAG aims to effectively integrate multi-scale information and guide the aggregation of spatial features, thereby improving overall segmentation performance.²⁹ Through a four-stage collaborative process—Multi-scale Fusion (MSF), Spatial Selection (SS), Spatial Interaction and Cross-Modulation (SICM), and Recalibration (RC)—MASAG progressively optimizes feature representations, addressing limitations of traditional methods in cross-scale information fusion and spatial weight allocation. The MASAG framework comprises the following stages:

Multi-Scale Feature Fusion (MSF)

The encoder feature maps (X) and decoder feature maps (Y) are fused through local and global context extraction branches:

Local Context Extraction: Utilizes Depthwise Separable Convolution and Dilated Convolution to extract local details from encoder features (X), as defined in Equation (1):

F_{l o c a l} = C o n v_{1 \times 1} (D W - D (D W (X)))

(1)

where DW denotes Depthwise Separable Convolution, composed of depthwise convolution and pointwise convolution. This architecture reduces computational costs and parameter counts while effectively extracting local features. DW-D represents Dilated Depthwise Separable Convolution, which incorporates dilation rates into the convolution kernel. This significantly expands the receptive field without increasing parameters, enabling the capture of spatial dependencies among local features across broader contexts.

Global Context Extraction: Applies global average pooling and max pooling to the decoder feature maps (Y) to capture broad contextual information, as Equation (2):

F_{g l o b a l} = C o n v_{1 \times 1} ([C_{A v g} (Y), C_{M a x} (Y)])

(2)

where

C_{A v g}

and

C_{M a x}

denote global average pooling and max pooling, respectively. Global Average Pooling obtains the overall mean response of the feature maps, reflecting the global feature distribution. Global Max Pooling captures the most salient features within the feature maps. Their combined output provides complementary global semantic information that is lacking in the decoder features.

Feature Fusion: The local features extracted by the encoder are added to the global features provided by the decoder, generating the fused feature map U that simultaneously incorporates both local details and global information, as shown in Equation (3):

U = F_{l o c a l} + F_{g l o b a l}

(3)

Spatial Selection (SS): The spatial selection module dynamically assigns spatial weights to highlight critical anatomical regions and suppress noise:

Generates a two-channel weight map, as Equation (4):

\begin{matrix} S W_{i} = S o f t m a x (C o n v_{1 \times 1} (U)), & \forall i \in [1, 2] \end{matrix}

(4)

where

S o f t m a x (\cdot)

represents the Softmax activation function, ensuring that the weights sum to 1 at each spatial location. By computing a dedicated weight at each pixel location for both encoder features and decoder features, indicating which feature holds greater importance for that specific position.

Applies spatial weighting to encoder features (X) and decoder features (Y) separately, as Equation (5):

\begin{matrix} X^{'} = S W_{1} \otimes X + X & Y^{'} = S W_{2} \otimes Y + Y \end{matrix}

(5)

where

\otimes

denotes element-wise multiplication,

S W_{1} \otimes X

and

S W_{2} \otimes Y

representing the spatially-weighted versions of encoder features X and decoder features Y, respectively. The residual connection preserves original feature information while mitigating potential information loss during the weighting process.

Spatial Interaction and Cross-Modulation (SICM)

This module facilitates feature interaction across spatial locations and enables cross-channel modulation of multi-scale information:

Encoder features are modulated by decoder global context, while decoder features are modulated by encoder local details, addressing the misalignment of low-level and high-level features in traditional U-Net skip connections, as Equation (6):

\begin{matrix} X^{″} = X^{'} \otimes S i g m o i d (Y^{'}) & Y^{″} = Y^{'} \otimes S i g m o i d (X^{'}) \end{matrix}

(6)

where

S i g m o i d (\cdot)

is the Sigmoid activation function. The global contextual information from Y’ augments X’ to yield X'’, while the detailed information from X’ complements Y’ to produce Y'’. This bidirectional enhancement ensures both X'’ and Y'’ incorporate integrated local and global information.

Final fused features are integrated via element-wise multiplication, as Equation (7):

U^{'} = X^{″} \otimes Y^{″}

(7)

The mutually modulated X'’ and Y'’ are fused via element-wise multiplication to generate the integrated feature U’. This multiplicative fusion more effectively highlights regions deemed significant by both feature maps.

Recalibration (RC)

Finally, the recalibration module adjusts feature map responses to emphasize meaningful information, as Equation (8):

X_{o u t} = C o n v_{1 \times 1} (S i g m o i d (C o n v_{1 \times 1} (U^{'})) \otimes X)

(8)

This step refines the fused feature map through pointwise convolution, which is then activated by a sigmoid function to generate the attention map. Finally, the initial input X is recalibrated by performing element-wise multiplication with the attention map, followed by further processing through another pointwise convolution. The recalibrated feature map $(X_{o u t})$ serves as the skip connection output for decoder feature fusion.

Deformable Convolutional Attention Module (DCAM)

We employ the Deformable Convolutional Attention Module (DCAM) to enhance the network's adaptability to irregular shapes, size variations, and geometric deformations in medical imaging, challenges particularly prominent in cervical cancer brachytherapy segmentation tasks. By integrating the context-aware feature refinement of the Spatial Attention (SA) mechanism and the dynamic geometric adaptation of the Deformable Convolution Block (DCB), DCAM overcomes the limitations of conventional convolutional operations in modeling anatomical diversity. The DCAM module combines SA and DCB, as defined in Equation (9):

D C A M (x) = D C B (S A (x))

(9)

where x is the input tensor.

D C A M (\cdot)

denotes the Deformable Convolutional Attention Module. The DCAM module first applies a Spatial Attention mechanism to the input features, focusing on critical spatial regions. The output of SA is then processed by a Deformable Convolution Block, which leverages the dynamic sampling capability of deformable convolutions to adapt to organ geometric deformations.

Spatial Attention (SA)

The Spatial Attention mechanism identifies and amplifies critical spatial regions in feature maps, emphasizing high-signal areas (eg, HR-CTV boundaries) while suppressing low-contrast or irrelevant regions and noise interference, as formulated in Equation (10):

S A (x) = S i g m o i d (C o n v ([C_{A v g} (x), C_{M a x} (x)])) \otimes x

(10)

where

C o n v (\cdot)

denotes a 7 × 7 convolutional layer with padding of 3. The large 7 × 7 kernel effectively captures broader contextual information and enhances spatial perception capabilities. This process first applies both average pooling and max pooling along the channel dimension to the input feature x, yielding two distinct feature maps. These feature maps are concatenated and processed through the large-kernel

C o n v (\cdot)

to generate a Spatial Attention Map. Finally, a sigmoid activation function normalizes the attention weights to the [0,1] range. The normalized map is then multiplied element-wise with the original input x to achieve spatial re-weighting: regions with weights approaching 1 are enhanced while those approaching 0 are suppressed.

Deformable Convolution Block (DCB)

We introduce deformable convolution to further refine features generated by SA and enhance the network's adaptability to geometric deformations—crucial for addressing anatomical heterogeneity in multi-center datasets. Unlike the fixed grid sampling in traditional convolution, deformable convolution dynamically adjusts kernel sampling positions by learning offset parameters, enabling the kernel to adaptively “warp” and align with irregular anatomical contours (eg, HR-CTV or colon boundaries), as detailed in Equation (11):

D C B (x) = R e L U (B N (D C o n v (R e L U (B N (D C o n v (C o n v_{1 \times 1} (x)))))))

(11)

where

R e L U (\cdot)

is the ReLU activation layer,

B N (\cdot)

denotes batch normalization, and

D C o n v (\cdot)

represents deformable convolution.

Upsampling (UpConv)

The UpConv layer progressively upsamples the features of the current layer to align the dimensions with the subsequent skip connection. Each UpConv layer consists of an UpSampling $U P (\cdot)$ with scale-factor 2, a 3 × 3 convolution $C o n v (\cdot)$ , a batch normalization $B N (\cdot)$ , and a ReLU activation layers, as formulated in Equation (12):

U p C o n v (x) = R e L U (B N (C o n v (U P (x))))

(12)

Loss Function

Our method employs a multi-scale supervision strategy and a hybrid loss function to optimize model performance and mitigate class imbalance. The model generates predictions at four decoder levels (p1, p2, p3, p4), with each level contributing to the loss calculation. This approach reduces reliance on single-scale features by integrating multi-scale contextual information. The loss at each level is a weighted combination of Cross-Entropy Loss and Dice Loss, as defined in Equation (13):

\begin{matrix} L_{l a y e r} = 0.3 L_{C E} + 0.7 L_{D i c e} \end{matrix}

(13)

where Cross-Entropy Loss(

L_{C E}

) and Dice Loss(

L_{D i c e}

) are formulated as follows:

\begin{matrix} L_{C E} = - \sum_{i = 1}^{N} y_{i} \log (p_{i}) \end{matrix}

(14)

\begin{matrix} L_{D i c e} = 1 - \frac{2 \sum_{i = 1}^{N} p_{i} y_{i} + ϵ}{\sum_{i = 1}^{N} p_{i} + \sum_{i = 1}^{N} y_{i} + ϵ} \end{matrix}

(15)

Here, $y_{i}$ denotes the ground truth label, $p_{i}$ represents the predicted probability, N is the number of classes, and $ϵ$ is a smoothing factor to prevent division by zero.

The final loss is the weighted sum of losses from all four decoder levels, with equal weights assigned $(α = β =$ $γ = δ = 1)$ , as shown in Equation (16):

\begin{matrix} L_{t o t a l} = α L_{p 1} + β L_{p 2} + γ L_{p 3} + δ L_{p 4} \end{matrix}

(16)

Evaluation Metrics

Both geometric and dosimetric methods were employed for quantitative analysis. Geometric performance was assessed using the Dice similarity coefficient (DSC), defined as:

\begin{matrix} D S C = \frac{2 | X \cap Y |}{| X | + | Y |} \end{matrix}

(17)

\begin{matrix} H D = max (h (A, B), h (B, A)) h (A, B) = max_{b \in B} (min_{a \in A} a - b) \end{matrix}

(18)

\begin{aligned} A S D (X, Y) = m e a n ({B_{p r e d}, B_{g t}}) \\ \begin{matrix} B_{p r e d} = {\forall p_{1} \in A_{p r e d}, c l o e s t_{d i s t a n c e (p_{1}, p_{2}) | \exists p_{2}} \in A_{g t}} \end{matrix} \end{aligned}

(19)

Here, smaller HD95 (95% Hausdorff Distance) and ASD (Average Surface Distance) indicate better shape agreement between segmentation results and ground truth contours, while a larger DSC (Dice Similarity Coefficient) reflects higher spatial overlap with the ground truth.

For dosimetric comparisons, dose-volume indices (DVIs) were utilized. For HR-CTV, we focused on Dmean (mean dose to the target) and D90% (minimum dose covering 90% of the target volume). For OARs, we evaluated D5cc, D2cc, and D0.1cc, where DXcc denotes the minimum dose to the hottest X cubic centimeters (cc) of the organ. Paired sample t-tests were performed to compare dosimetric differences, with p < 0.05p < 0.05 indicating statistical significance. All statistical analyses were conducted using Python 3.8.

Experiments and Results

Experimental Details

MDA-TransUnet was implemented in PyCharm on a computer equipped with an Intel^® Core™ i7-10700 CPU and an NVIDIA GeForce RTX 3090 GPU. For fair comparison with other methods, all networks were trained under identical configurations. The model was optimized using the AdamW optimizer, which is better suited for complex multi-scale feature learning and improves convergence stability. The learning rate was adjusted via a cosine decay schedule, starting with an initial value of 0.0001 and a minimum of 1e-7. The batch size was set to 24, and training proceeded for 150 epochs.

Geometric Metric Analysis

As shown in Table 1, our proposed MDA-TransUnet achieves the highest Dice Similarity Coefficient (DSC) across all five target regions compared to other methods. The mean DSCs for the bladder, colon, rectum, small bowel, and HR-CTV are 94.54%, 79.27%, 79.27%, 88.90%, and 82.35%, respectively. Notably, for the small bowel (88.90%) and HR-CTV (82.35%), our method significantly outperforms suboptimal approaches (Trans-CASCADE: 87.55% for small bowel; EMCAD: 81.49% for HR-CTV). Regarding boundary precision, our method achieves the lowest average HD95 (95% Hausdorff Distance) among all OARs. For the bladder, small bowel, and HR-CTV, HD95 values are significantly reduced compared to suboptimal methods, demonstrating enhanced boundary control. In terms of Average Surface Distance (ASD), MDA-TransUnet delivers optimal ASD values for all OARs except the colon, with the bladder and HR-CTV showing an 8.5% reduction in ASD versus the second-best method. Compared to the classic TransUnet, our method achieves DSC improvements of 5.95% (bladder), 14.7% (colon), 9.05% (rectum), 11.67% (small bowel), and 4.93% (HR-CTV). These results validate that the Multi-Scale Adaptive Spatial Attention Gate (MASAG) and Deformable Convolutional Attention Module (DCAM) effectively mitigate limitations of skip connections while enhancing the network's ability to model local details and geometric deformations. Furthermore, Figures 2 and 3 visually compare segmentation results, confirming that contours generated by our method exhibit superior agreement with radiation physicist-delineated ground truth in shape, volume, and spatial localization.

Figure 2.

Visual Comparison of our Method with Other Methods, Where red Indicates the Ground Truth and Green Represents AI-Based Segmentation Results. Subfigures Correspond to: (a) TransUnet, (b) DLKA-net, (c) SelfRag-Unet, (d) Unet, (e) Trans-CASCADE, (f) EMCAD, and (g) our Method.

Figure 3.

Visual Comparison of our Method Versus Other Approaches for HR-CTV Segmentation in Sagittal and Coronal Planes. Blue Contours Denote Ground Truth, While red Contours Represent AI-Generated Segmentations. Left Panels: Coronal Views; Right Panels: Sagittal Views. (a) TransUnet, (b) DLKA-Net, (c) SelfRag-UNet, (d) UNet, (e) Trans-CASCADE, (f) EMCAD, and (g) our Method.

Table 1.

Quantitative Comparison Between our Proposed Method and Other Methods, Where “our Study” Denotes our Approach.

Metrics	TransUnet ²⁴	DLKA-net³⁰	SelfRag-Unet³¹	Unet¹⁰	Cascade ³²	EMCAD ³³	Our Study
Bladder
DSC	88.59 ± 2.79	92.50 ± 3.22	91.93 ± 2.56	91.98 ± 2.37	94.11 ± 1.97	94.20 ± 1.60	94.54 ± 1.54
HD95	3.38 ± 1.03	4.08 ± 4.85	2.56 ± 0.83	2.49 ± 0.79	1.90 ± 0.71	1.99 ± 0.69	1.80 ± 0.87
ASD	1.23 ± 0.40	1.38 ± 0.99	1.03 ± 0.47	0.91 ± 0.30	0.61 ± 0.27	0.59 ± 0.19	0.54 ± 0.21
Colon
DSC	64.57 ± 7.73	70.25 ± 11.11	67.70 ± 10.01	68.46 ± 10.41	77.77 ± 8.00	78.46 ± 8.13	79.27 ± 8.05
HD95	19.08 ± 15.18	14.49 ± 16.83	15.92 ± 13.25	17.57 ± 16.83	11.86 ± 16.37	11.28 ± 14.80	11.66 ± 14.50
ASD	4.99 ± 3.35	4.00 ± 4.59	4.38 ± 3.75	4.86 ± 4.51	2.99 ± 4.61	2.45 ± 2.01	2.74 ± 4.09
Rectum
DSC	70.22 ± 11.24	73.57 ± 7.00	74.76 ± 7.57	75.92 ± 7.41	78.80 ± 6.32	78.41 ± 5.96	79.27 ± 6.79
HD95	6.83 ± 3.03	6.86 ± 3.72	6.39 ± 3.07	5.70 ± 2.67	5.79 ± 3.62	6.02 ± 4.12	6.28 ± 4.24
ASD	1.93 ± 0.64	2.41 ± 1.37	1.91 ± 1.10	1.60 ± 0.74	1.45 ± 0.86	1.56 ± 0.75	1.35 ± 0.77
Small intestine
DSC	77.23 ± 5.09	83.46 ± 4.93	80.64 ± 6.15	81.13 ± 5.98	87.55 ± 4.20	87.17 ± 3.86	88.90 ± 3.64
HD95	5.76 ± 1.73	4.34 ± 1.51	6.00 ± 3.27	5.58 ± 2.74	3.16 ± 1.52	3.40 ± 2.13	2.97 ± 1.44
ASD	1.70 ± 0.46	1.49 ± 0.52	1.89 ± 0.89	1.72 ± 0.79	0.92 ± 0.38	0.96 ± 0.41	0.87 ± 0.48
HR-CTV
DSC	77.42 ± 5.18	77.71 ± 4.76	79.69 ± 4.32	79.15 ± 4.47	80.72 ± 4.02	81.49 ± 4.34	82.35 ± 4.07
HD95	5.06 ± 1.53	4.60 ± 1.60	4.43 ± 1.84	4.53 ± 1.89	4.05 ± 1.56	3.99 ± 1.70	3.77 ± 1.58
ASD	1.94 ± 0.64	1.82 ± 0.54	1.57 ± 0.44	1.64 ± 0.50	1.52 ± 0.52	1.41 ± 0.56	1.29 ± 0.40

Ablation Study

Ablation studies were conducted on the dataset to evaluate the effectiveness of different components in our proposed network. We incrementally removed key modules (MASAG and DCAM) and compared the results. As clearly demonstrated in Table 2, incorporating both MASAG and DCAM modules significantly enhances performance, particularly for HR-CTV segmentation, mitigating performance degradation caused by inter-center variations in contouring protocols and imaging parameters. The MASAG module improves feature representation quality and enhances cross-scale information fusion accuracy, thereby providing more reliable and focused feature inputs to DCAM. The deformable convolution operations in DCAM enable precise geometric adaptation at critical regions emphasized by MASAG (eg, organ boundaries). The resulting deformation-adapted features subsequently deliver more anatomically realistic information to MASAG during later decoding stages. This cascaded processing creates powerful synergies, proving particularly effective when segmenting challenging structures such as morphologically variable colons and boundary-ambiguous HR-CTV. The introduction of MASAG and DCAM collectively strengthens the model's capacity to capture anatomical details, accommodate multi-center heterogeneity, and handle geometric variations, achieving superior performance across all segmentation regions.

Table 2.

Ablation Study.

Components		Segmentation Targets
MASAG	DCAM	Bladder	Colon	Rectum	Small Intestine	HR-CTV
NO	NO	93.37	76.84	77.97	86.38	79.85
YES	NO	93.96	77.39	78.25	87.10	80.94
NO	YES	94.39	78.54	78.47	87.90	80.92
YES	YES	94 . 54	79.27	79.27	88.90	82.35

Dosimetric Analysis

To evaluate dosimetric accuracy, we compared dose-volume metrics between manual and deep learning-based methods using paired sample t-tests, as summarized in Table 3. No statistically significant differences were observed across all dosimetric parameters. For OARs, the average differences in D2cc and D0.1cc were less than 12% and 15%, respectively. For HR-CTV, the average differences in Dmean and D90% were below 8% and 11%, respectively. Figure 4 illustrates dose distributions for a representative case, comparing our automated segmentation method with manual delineation.

Figure 4.

Delineation of HR-CTV and OARs on CT Slices and Corresponding Dose Distribution Results. Blue Regions Represent the Standard Manual Contours; Green Lines Denote the Contours Segmented by our Method, and Colored Lines Indicate Dose Distributions.

Table 3.

Differences in Dosimetric Parameters Between Manual Methods and our Method Within the Original Clinical Brachytherapy (BT) Plans, Along with Paired t-Test Results.

Structure	Dosimetric Parameters	Differences	P
Bladder	D_5cc	0.1951 ± 0.1607	0.388
	D_2cc	0.2826 ± 0.2277	0.435
	D_0.1cc	0.6186 ± 0.5103	0.505
Colon	D_2cc	0.3173 ± 0.4511	0.614
	D_0.1cc	0.4752 ± 0.6740	0.507
Rectum	D_5cc	0.0975 ± 0.0756	0.962
	D_2cc	0.1160 ± 0.0695	0.158
	D_0.1cc	0.3238 ± 0.2612	0.234
Small intestine	D_2cc	0.5215 ± 0.6269	0.704
	D_0.1cc	1.1644 ± 1.4556	0.571
HR-CTV	D_90%	0.6367 ± 0.5573	0.081
	D_mean	0.9659 ± 0.8036	0.470

External Validation

To evaluate the model's generalization capability on unseen data from medical institutions, this study additionally incorporated 20 cervical cancer brachytherapy patients’ CT images from The Affiliated Huaian NO.1 People's Hospital of Nanjing Medical University for independent testing (exclusively for external validation, not involved in training). As demonstrated in Table 4, MDA-TransUnet maintained optimal performance in the external validation set, achieving the highest DSC and HD95 values across all five target regions. Significant improvements were observed particularly for more challenging OARs (colon, rectum, and small bowel) compared to suboptimal methods. Although yielding second-best ASD values for the bladder and colon, it achieved the optimal average ASD across all target regions. These results demonstrate that the synergistic interaction between MASAG and DCAM effectively mitigates impacts caused by inter-center scanner variations and annotation discrepancies, confirming MDA-TransUnet's superior generalization capability.

Table 4.

Performance Comparison among Different Methods in External Validation, with “our Study” Denoting our Proposed Method.

Metrics	TransUnet²⁴	DLKA-net³⁰	SelfRag-Unet³¹	Unet¹⁰	Cascade ³²	EMCAD ³³	Our Study
Bladder
DSC	86.08 ± 4.60	88.66 ± 4.67	87.27 ± 4.35	87.19 ± 4.83	91.43 ± 3.36	92.40 ± 3.07	92.52 ± 2.92
HD95	2.92 ± 0.71	3.34 ± 2.61	4.22 ± 5.83	2.88 ± 0.88	1.68 ± 0.49	1.58 ± 0.57	1.52 ± 0.51
ASD	1.1 ± 0.25	1.31 ± 1.45	1.30 ± 1.00	1.16 ± 0.39	0.59 ± 0.18	0.50 ± 0.14	0.51 ± 0.18
Colon
DSC	58.58 ± 11.54	64.90 ± 11.07	62.90 ± 9.78	61.61 ± 10.46	73.75 ± 8.03	73.87 ± 7.34	75.58 ± 7.98
HD95	15.31 ± 8.79	12.81 ± 6.35	14.09 ± 7.37	14.71 ± 8.57	9.67 ± 5.78	11.59 ± 6.46	8.86 ± 5.08
ASD	3.70 ± 2.48	3.05 ± 2.09	3.81 ± 2.04	3.83 ± 2.33	1.73 ± 1.05	2.17 ± 0.99	1.98 ± 1.06
Rectum
DSC	68.86 ± 7.64	75.37 ± 8.40	74.59 ± 9.52	73.73 ± 10.18	78.59 ± 7.51	77.80 ± 7.20	79.62 ± 8.96
HD95	5.94 ± 1.96	5.48 ± 4.16	4.41 ± 1.69	4.13 ± 1.80	4.66 ± 3.39	5.17 ± 3.19	3.97 ± 3.54
ASD	2.17 ± 0.58	2.20 ± 1.21	1.49 ± 0.75	1.68 ± 1.19	1.33 ± 0.88	1.42 ± 1.01	1.32 ± 1.09
Small intestine
DSC	76.11 ± 3.92	81.53 ± 3.46	78.49 ± 4.96	78.12 ± 4.67	86.17 ± 2.58	86.01 ± 2.49	87.76 ± 2.48
HD95	5.05 ± 1.24	3.82 ± 1.46	4.85 ± 1.46	5.46 ± 1.55	2.91 ± 0.86	2.93 ± 0.99	2.67 ± 0.78
ASD	1.58 ± 0.55	1.23 ± 0.35	1.75 ± 0.42	1.97 ± 0.54	0.90 ± 0.27	0.92 ± 0.20	0.84 ± 0.25
HR-CTV
DSC	73.09 ± 6.96	72.99 ± 7.09	76.30 ± 5.06	75.60 ± 5.51	76.76 ± 5.65	78.26 ± 4.70	78.88 ± 4.53
HD95	5.68 ± 3.40	4.91 ± 1.96	4.92 ± 3.67	4.87 ± 3.25	4.67 ± 2.66	4.64 ± 2.13	4.33 ± 2.25
ASD	2.12 ± 1.15	1.88 ± 0.71	1.62 ± 0.89	1.63 ± 0.76	1.66 ± 0.86	1.42 ± 0.45	1.40 ± 0.60

Model Parameters

As presented in Table 5, MDA-TransUnet exhibits higher parameter counts and GPU memory consumption compared to other models, primarily attributed to increased computational complexity from the MASAG and DCAM modules, along with inherent requirements of the TransUnet encoder architecture. However, this computational overhead delivers substantial performance gains—particularly in handling multi-center data where it demonstrates superior generalization capability. While UNet and SelfRag-UNet achieve the lowest parameter/GPU footprints, their segmentation accuracy lags significantly behind MDA-TransUnet. Regarding inference speed, MDA-TransUnet matches similarly-sized models (TransUnet and DLKA-Net) while outperforming the smaller EMCAD architecture. Crucially, the 26-fold acceleration in segmentation reduces patient waiting time, mitigates organ displacement risks, and alleviates patient discomfort—demonstrating substantial clinical significance.

Table 5.

Model Parameters Comparison.

Model	Params(M)	GPU(MiB)	inference(Min)
TransUnet	105.27	11726	1.18
DLKA-net	101.64	18636	1.14
SelfRag-Unet	17.26	9034	1.42
Unet	17.26	8740	1.38
Cascade	108.31	13846	1.00
EMCAD	26.76	10102	1.23
Our Study	110.94	17500	1.18

Discussion

The combination of external beam radiation therapy and high-dose-rate brachytherapy represents the standard of care for gynecologic cancer treatment, where HDR-BT has proven indispensable and strongly correlates with improved survival rates. Compared to conventional EBRT, brachytherapy's defining feature is its ability to deliver higher radiation doses to tumor regions near the radiation source while effectively sparing normal organs due to rapid dose fall-off. However, brachytherapy faces unique challenges. In HDR-BT, treatment planning must be completed rapidly after applicator insertion, often under time constraints that may introduce human errors. In recent years, deep learning-based methods have emerged as promising solutions to automate workflows, reduce patient wait times, and enhance comfort.

The purpose of this study is to investigate deep learning-based automatic segmentation methods for cervical cancer brachytherapy CT images. The proposed MDA-TransUnet achieves superior Dice Similarity Coefficient (DSC) across all target regions compared to other methods (Table 1). Our method performs best in bladder segmentation, largely attributed to its relatively regular anatomical structure and high contrast, facilitating clear boundary feature extraction. Compared to the bladder, the rectum, colon, and small bowel present greater segmentation challenges due to their anatomical and imaging characteristics. While the rectum maintains relatively stable positioning, its low contrast on CT scans makes boundary delineation difficult. The colon presents exceptional segmentation challenges due to its characteristically low contrast against surrounding adipose and soft tissues in CT images, combined with high deformability in both shape and position. The small bowel's position is substantially influenced by respiratory motion, peristalsis, and bladder/rectal filling status, while its appearance on CT images often blends with other soft tissues, complicating segmentation. Despite these challenges, MDA-TransUnet achieved the highest DSC scores for all three organs: 79.27% for colon, 79.27% for rectum, and 88.90% for small bowel. Gu et al²⁵ first combined CNN with Transformer in MFFUNet for segmenting brachytherapy OARs (bladder, colon, rectum), reporting DSCs of 92.65%, 61.86%, and 66.55% respectively. Comparatively, our MDA-TransUnet demonstrates significantly better performance on colon and rectum segmentation. In future work, we will explore more effective contrast enhancement techniques to further improve segmentation performance for both the colon and rectum. Additionally, the HR-CTV segmentation achieved 82.35% DSC, outperforming other methods despite variations in delineation protocols across centers.

Ablation studies (Table 2) demonstrate that incorporating MASAG and DCAM modules significantly enhances performance across all target regions. Particularly for structurally complex areas like colon, small bowel, and HR-CTV, DSC improved by an average of 2.48%. This validates the modules’ effectiveness in enhancing anatomical detail capture while mitigating cross-center variability, ensuring robust performance in multi-center scenarios.

Beyond geometric metrics, dosimetric evaluation remains crucial for automated segmentation.³⁴ Paired t-tests (Table 3) confirmed no statistically significant differences between our automated method and manual delineation across five dosimetric parameters: D5cc, D2cc, D0.1cc, Dmean, and D90%. Notably, despite poorer geometric accuracy in colon segmentation, dosimetric agreement with manual methods remained high, consistent with Wang et al's findings.³⁵ Dosimetric parameters like D2cc (representing the minimum dose received by the maximally irradiated 2cm3 of a volume) primarily depend on segmentation accuracy in high-dose regions (typically near applicators or targets). Sufficient overlap between automatic segmentation and ground truth in these critical areas can yield comparable dosimetric outcomes, even when contour discrepancies exist in low-dose regions. The Dice Similarity Coefficient (DSC) measures global volume overlap - shape variations in geometrically complex structures within low-dose, non-critical regions may reduce DSC without materially affecting dosimetry. Consequently, both geometric and dosimetric metrics are essential for comprehensive evaluation of automatic segmentation methods. Nevertheless, continuous optimization for improved geometric accuracy (particularly in high-dose regions) remains crucial, potentially further minimizing dosimetric differences while enhancing clinicians’ trust in automated results and overall model reliability.

To further assess generalization capability, 20 cervical cancer brachytherapy patients’CT images from The Affiliated Huaian NO.1 People's Hospital of Nanjing Medical University were independently tested (external validation). As shown in Table 4, MDA-TransUnet achieved optimal DSC and HD95 values across all five targets (bladder, colon, rectum, small bowel, HR-CTV) on this unseen dataset. For the challenging colon segmentation, it outperformed suboptimal methods by 1.74% in DSC and reduced HD95 by 0.81 mm. These results demonstrate robust adaptability to multi-center heterogeneity (eg, varying CT protocols and contouring practices). Though yielding suboptimal ASD for bladder and colon, the model achieved optimal average ASD across all targets, confirming boundary segmentation accuracy. Compared to other models, MDA-TransUnet exhibits superior performance stability and generalization capability on independent cross-center data. Our approach - incorporating data augmentation, multi-center training, and robustness-enhancing modules (MASAG, DCAM) - effectively mitigates inter-center annotation variations. Future efforts should establish refined contouring guidelines and cross-institutional consistency reviews to standardize ground truth.

This study has limitations. First, obtaining fully annotated multi-center cervical cancer brachytherapy CT datasets remains challenging, with inter-center variations in region-of-interest delineation. Second, the dataset remains relatively limited. Though multi-center data and independent external validation were employed, small sample sizes may introduce potential bias and constrain model learning capacity. Finally, while achieving excellent segmentation performance, MDA-TransUnet's high parameter count and computational complexity may hinder clinical deployment in resource-constrained settings. Future work will: (1) Expand datasets with multi-center cases, (2) Develop lightweight model compression strategies to enhance computational efficiency, and (3) Optimize clinical applicability.

Conclusions

This study proposes MDA-TransUnet, a deep learning method leveraging a CNN-Transformer hybrid architecture to automate the segmentation of HR-CTV and OARs in planning CT images for cervical cancer brachytherapy. Evaluation results demonstrate that our method outperforms other state-of-the-art (SOTA) approaches and exhibits no statistically significant differences from manual delineation across five dosimetric metrics. The proposed automated segmentation framework facilitates the automation of cervical cancer brachytherapy workflows, enhances treatment consistency, reduces post-applicator-insertion waiting times, and alleviates patient discomfort.

Footnotes

Abbreviations

Acknowledgements

Not applicable.

ORCID iDs

Dezheng Cao

Heng Zhang

Qianjia Huang

Xinye Ni

Ethics Approval and Consent to Participate

This study was approved by the Medical Ethics Committee of Changzhou No.2 People's Hospital Affiliated to Nanjing Medical University (Approval number: 2024KY213-01), the Medical Ethics Committee of Tumor Hospital Affiliated to Nantong University (Approval number: 2020-031) and the Third Affiliated Hospital of Nanjing Medical University (Approval number: ITT2024101). The requirement for informed consent was waived for all participants with the approval of the Medical Ethics Committees. The research was conducted in accordance with the principles embodied in the Declaration of Helsinki and local statutory requirements. All authors have reviewed research data.

Consent for Publication

Not applicable.

Author Contributions

CDZ conceived the study, designed the experiments, analyzed the data and wrote the manuscript. Data were collected by JJH, HJH, YB, STT, KY. NXY edited the manuscript. LC, QCJ, and XK supervised the data analysis. NXY reviewed literature, contributed to the manuscript and acquired the financial support for the project. ZH, SLT, HQJ, and CDZ performed the statistical analyses. All the authors accessed the study data and reviewed and approved the final manuscript.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work is supported by the National Natural Science Foundation of China (No.62371243), Jiangsu Provincial Medical Key Discipline Cultivation Unit of Oncology Therapeutics (Radiotherapy) (No. JSDW202237), Changzhou Social Development Program(Nos. CE20235063 and CJ20244020), and Postgraduate Research & Practice Innovation Program of Jiangsu Province (No. JX13614239).

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability

The datasets generated and/or analyzed during the current study are not publicly available due to protection of individual patient privacy and the use of an in-house software but are available from the corresponding author on reasonable request.

References

Sung

Ferlay

Siegel

, et al. Global cancer statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J Clin. 2021;71(3):209-249.

Chino

Annunziata

Beriwal

, et al. Radiation therapy for cervical cancer: Executive summary of an ASTRO clinical practice guideline. Pract Radiat Oncol. 2020;10(4):220-234.

Han

Milosevic

Fyles

, et al. Trends in the utilization of brachytherapy in cervical cancer in the United States. Int J Radiation Oncol* Biol* Phys. 2013;87(1):111-119.

Riegel

Antone

Zhang

, et al. Deformable image registration and interobserver variation in contour propagation for radiation therapy planning. J Appl Clin Med Phys. 2016;17(3):347-357.

Fujimoto

von Eyben

Usoz

, et al. Improving brachytherapy efficiency with dedicated dosimetrist planners. Brachytherapy. 2019;18(1):103-107.

Boldrini

Bibault

Masciocchi

, et al. Deep learning: A review for the radiation oncologist. Front Oncol. 2019;9:977.

Meyer

Noblet

Mazzara

, et al. Survey on deep learning for radiotherapy. Comput Biol Med. 2018;98:126-146.

Sahiner

Pezeshk

Hadjiiski

, et al. Deep learning in medical imaging and radiation therapy. Med Phys. 2019;46(1):e1-e36.

Shi

Chen

, et al. Artificial intelligence in high-dose-rate brachytherapy treatment planning for cervical cancer: A review. Front Oncol. 2025;15:1507592.

10.

Ronneberger

Fischer

Brox

U-net: convolutional networks for biomedical image segmentation. Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18. Springer international publishing, 2015: p. 234-241.

11.

Zhou

Rahman Siddiquee

Tajbakhsh

, et al. Unet++: a nested u-net architecture for medical image segmentation. In Deep learning in medical image analysis and multimodal learning for clinical decision support: 4th international workshop, DLMIA 2018, and 8th international workshop, ML-CDS 2018, held in conjunction with MICCAI 2018, Granada, Spain, September 20, 2018, proceedings 4. Springer International Publishing, 2018: p. 3-11.

12.

Huang

Lin

Tong

, et al. Unet 3+: a full-scale connected unet for medical image segmentation. In ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2020: p. 1055-1059.

13.

Lou

Guan

Loew

DC-UNet: rethinking the U-net architecture with dual channel efficient CNN for medical image segmentation. In Medical Imaging 2021: Image Processing. SPIE, 2021, 11596: p. 758-768.

14.

Isensee

Jaeger

Kohl

SAA

, et al. nnU-Net: A self-configuring method for deep learning-based biomedical image segmentation. Nat Methods. 2021;18(2):203-211.

15.

Zhu

Zhang

, et al. A deep learning-based self-adapting ensemble method for segmentation in gynecological brachytherapy. Radiat Oncol. 2022;17(1):152.

16.

Zhang

Yang

Jiang

, et al. Automatic segmentation and applicator reconstruction for CT-based brachytherapy of cervical cancer using 3D convolutional neural networks. J Appl Clin Med Phys. 2020;21(10):158-169.

17.

Chang

Lin

Wang

, et al. Image segmentation in 3D brachytherapy using convolutional LSTM. J Med Biol Eng. 2021;41(5):636-651.

18.

Cao

Wang

Chen

, et al. Swin-unet: unet-like pure transformer for medical image segmentation. In European Conference on Computer Vision. Springer Nature Switzerland, 2022: pp.205-218.

19.

Dosovitskiy

Beyer

Kolesnikov

, et al. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.

20.

Dong

Wang

Fan

, et al. Polyp-pvt: Polyp segmentation with pyramid vision transformers. arXiv preprint arXiv:2108.06932, 2021.

21.

Wang

Huang

Tang

, et al. Stepwise feature fusion: local guides global. In International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer Nature Switzerland, 2022: 110-120.

22.

Wang

Cao

Wang

, et al. Uctransnet: Rethinking the skip connections in u-net from a channel-wise perspective with transformer. Proce AAAI Conf Artif Intell. 2022;36(3):2441-2449.

23.

Zhang

Liu

Transfuse: fusing transformers and cnns for medical image segmentation. In Medical image computing and computer assisted intervention–MICCAI 2021: 24th international conference, Strasbourg, France, September 27–October 1, 2021, proceedings, Part I 24. Springer International Publishing, 2021: p. 14-24.

24.

Chen

, et al. Transunet: Transformers make strong encoders for medical image segmentation. arXiv preprint arXiv:2102.04306, 2021.

25.

Guo

Zhang

, et al. MFFUNet: A hybrid model with cross-attention-guided multi-feature fusion for automated segmentation of organs at risk in cervical cancer brachytherapy. Comput Med Imaging Graph. 2025;124:102571.

26.

Zhang

Zhao

Gay

, et al. Weaving attention U-net: A novel hybrid CNN and attention-based method for organs-at-risk segmentation in head and neck CT images. Med Phys. 2021;48(11):7052-7062.

27.

Liu

Guan

, et al. Development and validation of a deep learning algorithm for auto-delineation of clinical target volume and organs at risk in cervical cancer radiotherapy. Radiother Oncol. 2020;153:172-179.

28.

Swamidas

Mahantshetty

ICRU report 89: prescribing, recording, and reporting brachytherapy for cancer of the cervix. 2017.

29.

Kolahi

Chaharsooghi

Khatibi

, et al. MSA $^ 2$ Net: Multi-scale Adaptive Attention-guided Network for Medical Image Segmentation. arXiv preprint arXiv:2407.21640, 2024.

30.

Azad

Niggemeier

Hüttemann

, et al. Beyond self-attention: deformable large kernel attention for medical image segmentation. Proceedings of the IEEE/CVF winter conference on applications of computer vision. 2024: p. 1287-1297.

31.

Zhu

Chen

Qiu

, et al. Selfreg-unet: self-regularized unet for medical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer Nature Switzerland, 2024: p. 601-611.

32.

Rahman

Marculescu

Medical image segmentation via cascaded attention decoding. In Proceedings of the IEEE/CVF winter conference on applications of computer vision. 2023: p. 6222-6231.

33.

Rahman

Munir

Marculescu

Emcad: efficient multi-scale convolutional attention decoding for medical image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024: p. 11769-11779.

34.

Yoganathan

Paul

Paloor

, et al. Automatic segmentation of magnetic resonance images for high-dose-rate cervical cancer brachytherapy using deep learning. Med Phys. 2022;49(3):1571-1584.

35.

Wang

Chen

, et al. Evaluation of auto-segmentation for brachytherapy of postoperative cervical cancer using deep learning-based workflow. Physics in Medicine & Biology. 2023;68(5):055012.

MDA-TransUNet: A Deep Learning-Based Automatic Segmentation Method for Cervical Cancer Brachytherapy

Abstract

Introduction

Methods

Results

Conclusion

Keywords

Introduction

Materials and Methods

Dataset

Data Preprocessing

Network Architecture

Multi-Scale Adaptive Spatial Attention Gate (MASAG)

Multi-Scale Feature Fusion (MSF)

Spatial Interaction and Cross-Modulation (SICM)

Recalibration (RC)

Deformable Convolutional Attention Module (DCAM)

Spatial Attention (SA)

Deformable Convolution Block (DCB)

Upsampling (UpConv)

Loss Function

Evaluation Metrics

Experiments and Results

Experimental Details

Geometric Metric Analysis

Ablation Study

Dosimetric Analysis

External Validation

Model Parameters

Discussion

Conclusions

Footnotes

Abbreviations

Acknowledgements

ORCID iDs

Ethics Approval and Consent to Participate

Consent for Publication

Author Contributions

Funding

Declaration of Conflicting Interests

Data Availability

References