Sage Journals: Discover world-class research

Abstract

Introduction: Currently, the incidence of liver cancer is on the rise annually. Precise identification of liver tumors is crucial for clinicians to strategize the treatment and combat liver cancer. Thus far, liver tumor contours have been derived through labor-intensive and subjective manual labeling. Computers have gained widespread application in the realm of liver tumor segmentation. Nonetheless, liver tumor segmentation remains a formidable challenge owing to the diverse range of volumes, shapes, and image intensities encountered. Methods: In this article, we introduce an innovative solution called the attention connect network (AC-Net) designed for automated liver tumor segmentation. Building upon the U-shaped network architecture, our approach incorporates 2 critical attention modules: the axial attention module (AAM) and the vision transformer module (VTM), which replace conventional skip-connections to seamlessly integrate spatial features. The AAM facilitates feature fusion by computing axial attention across feature maps, while the VTM operates on the lowest resolution feature maps, employing multihead self-attention, and reshaping the output into a feature map for subsequent concatenation. Furthermore, we employ a specialized loss function tailored to our approach. Our methodology begins with pretraining AC-Net using the LiTS2017 dataset and subsequently fine-tunes it using computed tomography (CT) and magnetic resonance imaging (MRI) data sourced from Hubei Cancer Hospital. Results: The performance metrics for AC-Net on CT data are as follows: dice similarity coefficient (DSC) of 0.90, Jaccard coefficient (JC) of 0.82, recall of 0.92, average symmetric surface distance (ASSD) of 4.59, Hausdorff distance (HD) of 11.96, and precision of 0.89. For AC-Net on MRI data, the metrics are DSC of 0.80, JC of 0.70, recall of 0.82, ASSD of 7.58, HD of 30.26, and precision of 0.84. Conclusion: The comparative experiments highlight that AC-Net exhibits exceptional tumor recognition accuracy when tested on the Hubei Cancer Hospital dataset, demonstrating highly competitive performance for practical clinical applications. Furthermore, the ablation experiments provide conclusive evidence of the efficacy of each module proposed in this article. For those interested, the code for this research article can be accessed at the following GitHub repository: https://github.com/killian-zero/py_tumor-segmentation.git.

Keywords

liver tumor segmentation deep learning transformers convolutional neural network fine-tuning

Introduction

In recent times, liver cancer has emerged as a significant health threat, contributing to a substantial number of annual fatalities.¹ In clinical practice, 3 primary diagnostic modalities for liver cancer detection are routinely employed: computed tomography (CT) images, ultrasonography (ultrasound) images, and magnetic resonance imaging (MRI). The delineation of liver tumor regions typically falls under the purview of manual segmentation by radiologists. However, in contrast to liver segmentation, the task of liver tumor segmentation (LiTS) is widely recognized as more formidable. There are 2 primary reasons for this increased difficulty. Firstly, liver tumors often exhibit irregular shapes and indistinct boundaries, making precise delineation challenging. Secondly, liver tumors can manifest in random spatial locations within the liver, further complicating the segmentation task. Consequently, manual segmentation not only consumes significant time but also heavily relies on the expertise of the radiologist for achieving segmentation accuracy. As a result, the automation of LiTS assumes paramount importance in the realm of clinical practice.

The indistinct contours of certain tumors pose significant challenges in achieving accurate detection and segmentation. To address these difficulties, a plethora of deep learning-based methodologies have emerged in the field of medical image segmentation. Over the past years, U-Net,² for instance, is a well-regarded network architecture within medical image segmentation, characterized by its encoder–decoder structure. To enhance segmentation performance, U-Net incorporates skip connections, enabling the utilization of multiscale information features. U-Net++³ built upon the U-Net framework by enhancing skip connections to amalgamate information from various spatial scales. Besides, with the development of based of ultrasound transducer⁴, deep learning is also employed to handle ultrasound localization microscopy⁵. Deep learning has been widely used in the field of medical images, such as COVID-19⁶, etc^7–9. These approaches leverage dilated convolutions with pooling operations^10–13 to capture rich semantic features at different levels. However, they often fall short in describing the spatial and channel relationships between image pixels, a critical aspect of medical image segmentation. Moreover, CNNs are primarily proficient at extracting local information, often struggling with capturing global context. Enter MA-Net¹⁴ and RA-UNet,¹⁵ which introduce an attention mechanism¹⁶ to capture spatial and channel-level attention feature maps. While the attention mechanism yields promising results, it does come at the cost of increased computational demands, necessitating substantial computing resources.

To address the above challenges, we introduce an innovative network architecture designed for LiTS, known as the attention connect network (AC-Net), as illustrated in Figure 1. AC-Net incorporates a self-attention mechanism, specifically employing 2 modules grounded in self-attention principles to capture feature maps dependent on spatial and channel relationships. These modules consist of the axial attention module (AAM) and the vision transformer module (VTM). The AAM merges features extracted from both the encoder and decoder by computing axial attention,¹⁷ while the VTM leverages the vision transformer (ViT)¹⁸ to amalgamate high-level semantic features derived from the encoder–decoder. Within this algorithm, spatial features are effectively integrated through skip connections operating at varying spatial scales. Thanks to the integration of these modules, our proposed network structure excels in handling local multimodal data. The primary contributions of this paper can be summarized as follows:

We introduce a novel network architecture, AC-Net, equipped with an attention mechanism tailored for LiTS.

We employ a pretraining and fine-tuning approach to adapt the network to the multimodal liver data from Hubei Cancer Hospital, encompassing both CT and MRI modalities.

Figure 1.

The overview of attention connect network (AC-Net).

Related Works

CNN and Attention Mechanism

In recent years, there has been an emergence of methods rooted in CNNs aimed at addressing a wide array of tasks. To make the best of the potential of feature maps, Attention U-Net¹⁹ has introduced the use of soft attention gates for fusing spatial information derived from low-level features. Additionally, to enhance segmentation accuracy while minimizing memory usage²⁰ has replaced the conventional skip connections of U-Net with alternative blocks. Moreover, Wu et al²¹ have employed a combination of separable convolutional blocks and U-Net architectures, enabling the capture of context feature channel correlations and higher-level feature information.

Recently, attention models have gained widespread adoption in addressing various deep learning tasks, spanning domains such as natural language processing and image detection. The inspiration behind the attention mechanism stems from human attention patterns. When humans read or listen, they tend to focus on keywords or elements. The attention mechanism can be viewed as a means of learning distinct weights assigned to all pixels in an image or all words in a sentence. A comprehensive conceptualization of the attention mechanism was formulated by Vaswani et al,²² while Xu et al²³ introduced several variations of attention mechanisms, encompassing hard attention, soft attention, global attention, and local attention.

Numerous research endeavors have converged on integrating CNNs and the attention mechanism to tackle medical image processing. For instance, TransUNet²⁴ embraces a hybrid architecture that combines CNN and transformer components, effectively utilizing detailed, high-resolution spatial information from CNN characteristics alongside the global context encoded by the attention mechanism. In a similar stem, Henry et al²⁵ introduced an innovative approach by incorporating an attention gate module within skip connections, enhancing the U-shaped encoder–decoder architecture. Meanwhile, TransBTS²⁶ represents yet another CNN–transformer network tailored for three-dimensional biomedical image segmentation. Diverging from these methodologies, our approach leverages the attention mechanism to seamlessly fuse feature maps at the same resolution within both the encoder and decoder components.

Transformers

The transformer architecture, initially introduced by Vaswani et al,²² achieved remarkable success in the realm of natural language processing (NLP). Subsequently, various researchers have endeavored to adapt transformers for application in the field of computer vision (CV). Notably, Carion et al²⁷ pioneered the use of transformers for object detection, leveraging their ability to extract global context information, resulting in improved accuracy. A significant departure from the traditional U-Net architecture, the ViT employs the encoder using a transformer-based approach, also yielding enhanced performance. Meanwhile, the segmentation transformer²⁸ integrates the transformer as an encoder within the conventional encoder-decoder framework. Furthermore, data-efficient image transformers²⁹ combine the transformer with knowledge distillation techniques to create a more efficient model capable of improved predictions while employing significantly fewer parameters. Liu et al³⁰ introduced the Swin transformer, a novel approach that applies the transformer to various scales and incorporates a shifted window mechanism to limit the scope of attention calculations, thus reducing computational demands. Lastly, Valanarasu et al³¹ extended the application of the transformer to the domain of medical image segmentation, particularly focusing on the axial attention mechanism proposed by Wang et al.¹⁷

Training Strategy

In the realm of deep learning, the abundance of data holds paramount importance. Utilizing datasets with a wealth of examples can significantly enhance the robustness of neural networks. Initially, neural networks relied on direct training, as outlined by Ronneberger et al,² aiming to accumulate sufficient data. However, it became evident that not all types of data were readily available in ample quantities. Subsequently, the training landscape evolved with the emergence of pretraining and fine-tuning methods, Yanai and Yoshiyuki³² offering the promise of improved network performance. Few studies continue to employ direct training, primarily due to data scarcity, as direct training with insufficient data tends to exacerbate the risk of overfitting. More recently, Contrastive learning has gained rapid prominence as a training strategy that operates without supervision, eliminating the need for labels. Wu et al³³ pioneered the application of contrastive learning, sparking a surge of related research efforts, including works such as Ye et al,³⁴ Oord et al,³⁵ and Tian et al.³⁶

Materials and Method

Datasets and Preprocessing

We initially trained the AC-Net using the LiTS challenge dataset.³⁷ This dataset comprises 131 training and 70 test CT scan images, providing ground truth annotations for liver and tumor contours. Subsequently, we undertake fine-tuning using data sourced from Hubei Cancer Hospital, encompassing 2 modalities: CT and MRI. Specifically, the CT dataset comprises 473 CT images, while the MRI dataset comprises 256 images of the T1 modality. It's worth noting that the LiTS dataset is formatted in NII, while the local dataset is in DICOM format. To facilitate smoother training of the AC-Net, we uniformly convert all data formats to GZ and standardize the data size to 512 × 512 pixels.

Attention Connect Network

The schematic overview of our proposed AC-Net for LiTS is visually represented in Figure 1. Initially, CNN is employed for the customary task of feature extraction. Subsequently, feature maps are amalgamated through different modules operating at varying scales. AC-Net incorporates 2 crucial modules: the AAM and the VTM. AAM leverages an axial attention mechanism to merge features of matching dimensions, thereby maximizing the utilization of spatial features extracted by CNN. On the other hand, VTM processes feature maps with the lowest resolution, employing a methodology akin to ViT. Within this network architecture, predicated on the U-Net framework, feature fusion takes place at 2 distinct sizes, 128 × 128 and 32 × 32, facilitated by attention connections. This affords the capability to fuse features across a more extensive receptive field. Additionally, the feature maps undergo up-sampling when subjected to feature fusion. The incorporation of the axial attention mechanism serves to curtail computational demands, preventing the excessive utilization of computational resources. Further elaboration on the AAM and VTM modules will be provided subsequently.

Axial Attention Module

Self-attention²² is the basis of axial attention. Given an input $X \in R^{w \times h \times c}$ with height h, weight w, and channels c, the output at the position f = (i, j) is calculated by

y_{f} = \sum_{s \in S} softma x_{f} (q_{f} k_{s}^{T}) v_{s}

(1)where S denotes the whole space of the pixel of the input, and queries

q_{f} = W_{Q} x_{f}

, keys

k_{f} = W_{K} x_{f}

, and values

v_{f} = W_{V} x_{f}

are linear projections of input

x_{f} \forall_{f} \in S

W_{Q}

W_{K}

, and

W_{V}

could be learned in the period of training.

softma x_{f}

is a softmax function to compute all positions. Nonetheless, it's essential to acknowledge that self-attention tends to consume considerably more computational resources compared to CNN when applied to the same object. Furthermore, self-attention neglects crucial positional information, which holds significant importance, particularly in the context of image segmentation.

Building upon the foundation of self-attention, the concept of axial attention was introduced by Wang et al¹⁷ as a solution to address the previously outlined challenges. More specifically, an axial-attention layer was initially defined along the width axis of an input, resembling a straightforward one-dimensional position-sensitive self-attention mechanism. A similar definition was then applied to the height axis. In essence, axial attention can be succinctly described as follows:

y_{f} = \sum_{s \in S_{1 \times m (f)}} softma x_{f} (q_{f} k_{s}^{T} + q_{f} r_{s - f}^{q} + k_{s} r_{s - f}^{k}) (v_{s} + r_{s - f}^{v})

(2)where

S_{1 \times m (f)}

is the local 1 × m region centered around location f = (i, j).

r_{s - f}^{q}

r_{s - f}^{k}

, and

r_{s - f}^{v}

are the relative encoding for queries, keys, and values and they all are learnable.

After axial attention, Yanai and Yoshiyuki³² proposed a modified block to control the position encoding. It can be written as

y_{f} = \sum_{s \in S_{1 \times m (f)}} softma x_{f} (q_{f} k_{s}^{T} + G_{Q} q_{f} r_{s - f}^{q} + G_{K} k_{s} r_{s - f}^{k}) (G_{V 1} v_{s} + G_{V 2} r_{s - f}^{v})

(3)where

G_{Q}

G_{K}

G_{V 1}

, and

G_{V 2}

are learnable parameters. Additionally, in accordance with the strategy elucidated by Yanai and Yoshiyuki,³² which adopts a local-global training approach, we have introduced modifications by excluding the computation of local attention. The structure of the AAM is visually depicted in Figure 2. Diverging from this framework, we leverage axial attention not for encoding the input image but rather to fuse feature maps. This can be concisely described as follows:

Y = C a t (A A (F M 1), F M 2)

(4)where Y is output, Cat denotes the concatenation, AA denotes the calculating gated axial attention, FM1 and FM2 denote the feature maps with the same scale in the encoder and decoder, respectively.

Figure 2.

The structure of axial attention module (AAM). BlockE denotes encoder block and BlockD denotes decoder block.

Vision Transformer Module

Since the inception of the attention mechanism, its application has proliferated notably in the realm of NLP. In the context of image processing, Dosovitskiy et al¹⁸ introduced a novel approach of segmenting an image into discrete patches, organizing them in a specific sequence, treating these sequenced patches as a sentence, and subsequently processing them using techniques akin to those employed in the NLP domain. This approach can be succinctly articulated as follows:

Z_{0} = [X_{c l a s s}; X_{p}^{1} E; X_{P}^{2} E; \dots; X_{P}^{N} E] + E_{p o s}

(5)

Z_{l}^{'} = M S A (L N (Z_{l - 1})) + Z_{l - 1}, l \in [1, L]

(6)

Z_{l} = M L P (L N (Z_{l}^{'})) + Z_{l}^{'}, l \in [1, L]

(7)

Y = L N (Z_{l}^{0})

(8)where X represents an input image, X_class signifies the output class, PE corresponds to the patch embedding, E_pos represents the position encoding, MSA represents the position encoding, MLP denotes the multilayer perceptron, and LN signifies the layernorm.

Differing from prior approaches, we employ ViT at the lowest resolution level within the network architecture to amalgamate the high-level semantic features of two-dimensional liver tumors, as extracted by the CNN. This can be formally expressed as follows:

Z_{0} = M S A {L N [P E (X)]} + P E (X)

(9)

Z_{1} = F C N {L N (Z_{0})} + Z_{0}

(10)

Z = R (Z_{1})

(11)

Here, FCN stands for the fully connected network, and “R” signifies the operation of reshaping the output into a feature map. It's worth noting that, unlike, Dosovitskiy et al,¹⁸ our patch embedding does not include an output class. To ensure compatibility for feature fusion with those extracted by the CNN, the shape of the VTM's output should match the feature map in the decoder. Therefore, we perform a reshaping operation on the output after the VTM computation. The structure of the VTM is shown in Figure 3.

Loss Function

To optimize the AC-Net, we employ a specialized loss function comprised of several components. The first function is binary cross-entropy with logits. It applies sigmoid activation to the output of the neural network and computes the cross-entropy. This function is formally defined as follows:

L o s s 1 (y, \overset{\land}{y}) = m e a n {w e i g h t * (\overset{\land}{y} * \ln [s (y)] + (1 - \overset{\land}{y}) * \ln [1 - s (y)])}

(12)where y denotes the prediction of the neural network and

\overset{\land}{y}

denotes the ground truth. s denotes the sigmoid function. mean denotes the calculation of the average and weight denotes the weight of the cross-entropy result.

Figure 3.

The structure of the vision transformer module (VTM).

The second function is based on the cross-entropy loss function³⁸ which is defined as

C E (y, \overset{\land}{y}) = - \sum_{i = 1}^{n} y \log (\overset{\land}{y})

(13)The second function is defined as

L o s s 2 = m e a n [(1 - e^{C E (y, \overset{\land}{y})}) * C E (y, \overset{\land}{y})]

(14)The total loss function is defined as

L_{t o t a l} = 0.5 * (L o s s 1 + L o s s 2) - \frac{d i c e}{b s} + 1

(15)where

L_{t o t a l}

is the loss function of AC-Net,

L o s s 1

and

L o s s 2

are shown in Equations 12 and 14, bs denotes the batch size of training, and the calculation about dice is shown in Equation 16.

Experiments and Results

Evaluation Metrics

To evaluate the performance of our AC-Net, we use the following metrics, including the dice similarity coefficient (DSC), Jaccard coefficient (JC), recall, average symmetric surface distance (ASSD), Hausdorff distance (HD), and precision.

The DSC score function is defined as

DSC = \frac{2 | A \cap B |}{| A | + | B |}

(16)where A denotes the prediction and B denotes the ground truth.

The JC score function is defined as:

JC = \frac{T P}{T P + F P + F N}

(17)The recall function is defined as:

Recall = \frac{T P}{T P + F N}

(18)The precision function is defined as:

Precision = \frac{T P}{T P + F P}

(19)where TP represents the total number of correct predictions as the target, FP represents the total number of incorrect predictions as the target, and FN represents the total number of incorrect predictions as background.

The HD represents the maximum distance between the predicted segmentation region boundary and the real tumor region boundary, and the smaller the value, the smaller the segmentation error and the better the quality of liver tumor boundary segmentation. This function is defined as

d_{H} (X, Y) = max {d_{X Y}, d_{Y X}} = max {max_{x \in X} min_{y \in Y} d (x, y), max_{y \in Y} min_{x \in X} d (x, y)}

(20)where X and Y are 2 nonempty sets in one space.

ASSD measures the distance between the surfaces of the prediction and the ground truth. It is defined as

ASSD (X, Y) = \frac{1}{| N (X) | + | N (Y) |} (\sum_{V_{x} \in N (X)} d (V_{x}, N (Y)) + \sum_{V_{y} \in N (Y)} d (V_{y}, N (X)))

(21)where N(X) denotes the set of surface voxels of X and N(Y) denotes the set of surface voxels of Y.

d (V, N (Y))

denotes the shortest distance of a voxel V to N(Y).

Pretraining and Fine-Tuning

Our experimental setup was executed on a Huawei cloud server equipped with an 8-core CPU and 64 GB of memory, complemented by an NVIDIA P100 GPU. The computations were conducted using PyTorch 1.10 on an Ubuntu 18.04 operating system.

Research indicates that transformers necessitate a considerably larger volume of data compared to CNNs. Unfortunately, acquiring sufficient clinical data from hospitals poses a substantial challenge, impeding the training of transformers. To address the issue of data scarcity, we have devised a strategy involving pretraining and fine-tuning. In this approach, we initially train the network using open-source datasets and save the parameters. Subsequently, we employ local data to fine-tune the network with a limited number of epochs after loading these saved parameters.

Our experiments have demonstrated that this training strategy yields favorable outcomes while demanding fewer computational resources. This efficacy stems from the fact that neural networks can rapidly acquire crucial target features from open-source datasets, a more efficient approach compared to training an entirely new network from scratch.

To begin, we initiated the pretraining process utilizing the LiTS dataset. This dataset was partitioned into 3 subsets, namely, a training set, a validation set, and a test set, with a distribution ratio of 7:2:1. During this phase, the batch size, number of epochs, learning rate, and momentum for Adagrad were configured at 4, 60, 0.01, and 0.9, respectively. Subsequently, fine-tuning was conducted using the local dataset. Similar to the previous phase, the local dataset was segmented into training, validation, and test sets, following a distribution ratio of 6:2:2. In this stage, the batch size and number of epochs were set to 4 and 40, respectively, while retaining the same learning rate and momentum as before.

Results

Following the training strategy elucidated earlier, we proceeded to fine-tune the neural network using local CT and MRI data. Our liver tumor segmentation results achieve a DSC of 0.90, JC of 0.82, recall of 0.92, precision of 0.89, HD of 11.96, ASSD of 4.59 on local CT data and a DSC of 0.80, JC of 0.70, recall of 0.82, precision of 0.84, HD of 30.26, ASSD of 7.58 on local MRI data.

To assess the performance of the proposed method, we conducted a comparative analysis with other methods in the same dataset and environment. This comparison includes the following approaches: (1) U-Net, based on a classical encoder–decoder architecture, exhibits normal performance in segmentation. However, due to its limited utilization of spatial information, the segmentation of tumor edges may appear fuzzy. (2) U-Net++, which utilizes skip links to integrate spatial information from different scales, offers an improvement in spatial information utilization but still falls short of full optimization. (3) SegNet,³⁹ employing a convolutional encoder-decoder architecture that relies on feature mapping instead of upsampling, follows a similar decoder approach. (4) PSPNet,¹³ which first uses CNN to extract feature maps and then employs a pyramid pooling module for result prediction. This method convolves feature maps at multiple resolutions and concatenates them, albeit without computing individual information within each feature map. (5) MedT³¹ improves the axial attention mechanism, adds a gating unit, limits the training range of parameters, and extracts features from the feature map in blocks, achieving good results in medical images.

Figure 4 visually presents a comparison among various methods applied to local CT data. It's evident that, regardless of tumor size, the results achieved by AC-Net appear more intuitively accurate. Table 1 provides a detailed comparison of evaluation metrics for each method on local CT data, further underscoring the superior performance of AC-Net. Figure 5 illustrates the outcomes of each network on local MRI data, while Table 2 furnishes specific evaluation metrics for each network on local MRI data. These results also affirm the commendable performance of AC-Net on MRI data. In AC-Net, axial attention is applied to a feature map with a resolution of 128 × 128, marking the initial utilization of spatial information. Subsequently, an attention calculation is carried out on a feature map with a resolution of 32 × 32, constituting the second instance of leveraging spatial information. This distinctive approach empowers AC-Net to excel in the segmentation of cancer edges when compared to other models.

Figure 4.

The comparison between AC-Net and other methods. (a) Image. (b) to (f) The segmentation results of U-Net, U-Net++, SegNet, PSPNet, and MedT, and (g) and (h) the results of AC-Net and ground truth.

Figure 5.

The comparison between AC-Net and other methods. (a) Image. (b) to (f) The segmentation results of UNet, UNet++, SegNet, PSPNet, and MedT, and (g) and (h) the results of AC-Net and ground truth.

Table 1.

The Evaluation Metrics of Every Method on CT Data.

	DSC	JC	Recall	ASSD	HD	Precision
U-Net	0.86	0.78	0.85	6.62	19.40	0.87
U-Net++	0.87	0.79	0.85	6.11	17.96	0.88
SegNet	0.71	0.60	0.73	14.61	53.41	0.75
PSPNet	0.86	0.79	0.84	6.46	17.46	0.87
MedT	0.88	0.80	0.86	5.58	13.52	0.88
Ours	0.90	0.82	0.92	4.59	11.96	0.89

Abbreviations: DSC, dice similarity coefficient; JC, Jaccard coefficient; ASSD, average symmetric surface distance; HD, Hausdorff distance; PSPNet, pyramid scene parsing network; CT, computed tomography.

Table 2.

The Evaluation Metrics of Every Method on MRI Data.

	DSC	JC	Recall	ASSD	HD	Precision
U-Net	0.44	0.31	0.61	23.68	115.20	0.47
U-Net++	0.47	0.34	0.62	21.00	97.03	0.50
SegNet	0.67	0.54	0.66	17.66	84.07	0.69
PSPNet	0.79	0.69	0.77	8.24	31.54	0.81
MedT	0.74	0.61	0.71	12.68	35.76	0.80
Ours	0.80	0.70	0.82	7.58	30.26	0.84

Tables 1 and 2 show the comparison between AC-Net and other methods on the local CT data and MRI data. The disparities in lesion contrasts between CT and MRI data result in varying levels of accuracy between the 2 modalities. Within the encoder, AC-Net employs axial attention calculations and attention calculations separately for feature maps with resolutions of 128 × 128 and 32 × 32, effectively leveraging spatial information to achieve higher accuracy compared to other models. The segmentation result of AC-Net on local CT data is slightly better than other networks. However, the result of AC-Net on local MRI data is better than other networks.

Ablation

To further verify the importance of each module, 4 experiments with different settings were performed, as described in Table 3. The entire ablation experiment is divided into 5 groups: The first group involves the removal of AAM. The second group involves the removal of VTM. The third group utilizes a different loss function (BCELogits loss). The fourth group employs a different number of attention heads (utilizing 16 attention heads in VTM, noting that the standard number of attention heads is 8). The fifth group represents the standard AC-Net configuration. The results and evaluation metrics are shown as follows, Figure 6 and Table 4 show the results of local CT data and Figure 7 and Table 5 show the results of local MRI data.

Figure 6.

The results of the ablation experiments. (a) Image. (b) to (e) The segmentation results of setting1, setting2, setting3, and setting4, and (g) ground truth.

Figure 7.

The results of the ablation experiments. (a) Image. (b) to (e) The segmentation results of setting1, setting2, setting3, and setting4, and (g) ground truth.

Table 3.

The Ablation Experiments.

Model set	Network
setting1	Without VTM
setting2	Without AAM
setting3	With BCELogits loss function
setting4	With 16 heads of VTM

Abbreviations: VTM, vision transformer module; AAM, axial attention module.

Table 4.

The Evaluation Metrics of Ablation Experiments on CT Data.

	DSC	JC	Recall	ASSD	HD	Precision
Without VTM	0.85	0.77	0.91	7.11	19.03	0.83
Without AAM	0.72	0.60	0.80	30.80	108.65	0.66
With BCELogits	0.84	0.76	0.87	8.00	21.38	0.82
With 16 heads	0.86	0.78	0.89	6.40	18.05	0.86
Ours	0.90	0.82	0.92	4.59	11.96	0.89

Table 5.

The Evaluation Metrics of Ablation Experiments on MRI Data.

	DSC	JC	Recall	ASSD	HD	Precision
Without VTM	0.81	0.70	0.78	9.20	30.52	0.81
Without AAM	0.56	0.41	0.60	13.72	158.90	0.58
With BCELogits	0.77	0.68	0.79	9.89	40.28	0.81
With 16heads	0.79	0.69	0.80	8.20	35.84	0.82
Ours	0.80	0.70	0.82	7.58	30.26	0.84

Through the above ablation experiments, it can be observed that AAM, VTM, and supervision could improve the accuracy of LiTS.

Discussion

In this research endeavor, we have introduced a pioneering attention model, AC-Net, designed for the prediction of tumors in both CT and MRI scans, with the aim of providing invaluable assistance to clinicians in their clinical practice. In contrast to the conventional CNN–transformer architecture, AC-Net adopts the transformer as a feature fusion module rather than an encoder, incorporating the axial attention mechanism and ViT. Our approach encompasses 2 crucial components: Firstly, feature maps are fused by computing axial attention across extended distances. This strategic integration leverages global information comprehensively, thereby enhancing the precision of LiTS. Simultaneously, the application of the axial attention mechanism contributes to a reduction in computational demands, allowing for efficient resource utilization. Secondly, the VTM is employed to process high-level semantic features extracted by the CNN. Attention calculations are performed in accordance with the principles of ViT, and the associated weights are preserved. These high-level semantic features contain global information, and the attention computation effectively harnesses this global information, mitigating issues arising from the limited receptive field of CNNs. With the integration of these modules, our network attains commendable results, achieving a DSC of 0.90, JC of 0.82, recall of 0.92, precision of 0.89, HD of 11.96, and ASSD of 4.59 on local CT data. Similarly, on local MRI data, our network delivers a DSC of 0.80, JC of 0.70, recall of 0.82, precision of 0.84, HD of 30.26, and ASSD of 7.58.

In comparison to other models for medical image segmentation, our network, AC-Net, exhibits commendable performance in the segmentation of liver tumors when applied to local datasets. Ablation experiments further underscore the effectiveness of our approach, highlighting the significant enhancement in LiTS achieved through the incorporation of the axial attention mechanism, vision transformer as feature fusion modules, and an optimized training strategy.

Our AC-Net has enhanced feature extraction and learning capabilities, it can achieve a high score in LiTS while using less computational resources compared to other networks with attention mechanisms. Our experiments exclusively utilized MRI and CT images, without exploring other modalities such as CBCT, ultrasound, or photoacoustic imaging. This implies that the universality of the AC-Net algorithm has not been sufficiently verified. The datasets we used were comprised of clinically standard data, without considering artifacts, noise, respiratory motion, and other sources of interference in imaging. Thus, we cannot effectively demonstrate the robustness of the AC-Net algorithm. Our discussion primarily revolved around segmentation accuracy and did not take into account factors like radiation dosage, postoperative cancer metastasis, patient adaptability, and other clinical considerations. Consequently, the algorithm's potential for clinical intervention in radiotherapy remains unexplored. On the other hand, the functionality of AC-Net is limited in scope. It is designed solely for the detection of malignant liver tumors and is not equipped to identify liver abscesses or liver hemangiomas. In our future work, we aspire to develop more advanced algorithms to further advance the field of AI-driven radiotherapy.

Conclusion

In this article, we introduced AC-Net as a novel approach for the automated segmentation of liver tumors within CT and MRI images. Diverging from established LiTS networks, our methodology involves the utilization of CNN for spatial feature extraction, followed by the incorporation of AAM and VTM to fuse these spatial features across varying scales. Empirical investigations conducted on local datasets substantiate that this fusion of spatial features yields substantial enhancements in tumor detection capabilities.

Footnotes

Acknowledgments

Thanks for the clinical data provided by Hubei Cancer Hospital.

Author Contributions

JS and SL examined the experiment and wrote this article. YD provided help with the data analysis. XX revised this article. WW and BZ provided the research platform.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article. The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Ethics Approval

Our study was approved by the Institutional Reviewer Board of Hubei Cancer Hospital (No. LLHBCH2023YN-057).

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Health Commission of Hubei Province scientific research project, National Natural Science Foundation of China, the Shenzhen Basic Science Research, the Natural Science Foundation of Hubei Province (grant number WJ2021M192, 12075095, U22A20259, JCYJ20200109110006136, 2022CFB938).

ORCID iDs

Jiakang Shao

Wei Wei

References

Forner

Jordi

. Biomarkers for early diagnosis of hepatocellular carcinoma. Lancet Oncol. 2012;13(8):750-751. doi: 10.1016/s1470-2045(12)70271-1.

Ronneberger

Philipp

Thomas

. U-net: convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18. Springer International Publishing, 2015. doi: 10.1007/978-3-319-24574-4_28.

Zhou

Rahman

Siddiquee M M

Tajbakhsh

. Unet++: A nested u-net architecture for medical image segmentation. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: 4th International Workshop, DLMIA 2018, and 8th International Workshop, ML-CDS 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, September 20, 2018, Proceedings 4. Springer International Publishing, 2018. doi: 10.1007/978-3-030-00889-5_1.

Zhang

, et al. Li J, Ma Y, Zhang T, et al. Recent advancements in ultrasound transducer: From material strategies to biomedical applications. BME Frontiers. 2022;2022:9764501. doi: 10.34133/2022/9764501

Luan

Lei

, et al. Deep learning for fast super-resolution ultrasound microvessel imaging. Physics in Medicine & Biology. 2023. doi: 10.1088/1361-6560/ad0a5a

Zhao

Bell

M A L

. A review of deep learning applications in lung ultrasound imaging of COVID-19 patients. BME frontiers. 2022;2022:9780173. doi: 10.34133/2022/9780173

Cao

Pan

, et al. A deep learning approach for detecting colorectal cancer via Raman spectra. BME Frontiers. 2022;2022:9872028. doi: 10.34133/2022/9872028

Shen

Zhang

, et al. A low-cost high-performance data augmentation for deep learning-based skin lesion classification. BME Frontiers. 2022;2022:9765307. doi: 10.34133/2022/9765307

Luan

Wei

Ding

, et al. PCG-net: feature adaptive deep learning for automated head and neck organs-at-risk segmentation. Frontiers in Oncology. 2023;13:1177788. doi: 10.3389/fonc.2023.1177788

10.

Cheng

, et al. Ce-net: context encoder network for 2d medical image segmentation. IEEE Trans Med Imaging. 2019;38(10):2281-2292. doi: 10.1109/tmi.2019.2903562.

11.

Song

Meng

Rodriguez-Paton

, et al. U-next: A novel convolution neural network with an aggregation u-net architecture for gallstone segmentation in CT images. IEEE Access. 2019;7:166823-166832. doi: 10.1019/access.2019.2953934.

12.

Fang

Yan

. Multi-organ segmentation over partially labeled datasets with multi-scale feature abstraction. IEEE Trans Med Imaging. 2020;39(11):3619-3629. doi: 10.1109/tmi.2020.3001036.

13.

Zhao

Shi

, et al. Pyramid scene parsing network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017.

14.

Fan

Wang

, et al. Ma-net: a multi-scale attention network for liver and tumor segmentation. IEEE Access. 2020;8:179656-179665. doi: 10.1109/access.2020.3025372.

15.

Jin

Meng

Sun

, et al. RA-UNet: A hybrid deep attention-aware network to extract liver and tumor in CT scans. Front Bioeng Biotechnol. 2020;8:1471. doi: 10.3389/fbioe.2020,605132.

16.

Niu

Zhong

. A review on the attention mechanism of deep learning. Neurocomputing. 2021;452:48-62. doi: 10.1016/j.neucom.2021.03.091.

17.

Wang

Zhu

Green

, et al. Axial-deeplab: Stand-alone axial-attention for panoptic segmentation. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020. Proceedings, Part IV. Cham: Springer International Publishing, 2020. doi: 10.1007/978-3-030-58548-8_7.

18.

Dosovitskiy

Beyer

Kolesnikov

, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020. doi: 10.48550/arxiv.2010.11929.

19.

Oktay

Schlemper

Folgoc

L L

, et al. Attention u-net: Learning where to look for the pancreas. arXiv preprint arXiv:1804.03999, 2018. doi: 10.1048550/arxiv.1804.03999.

20.

Liu

, et al. Spatial feature fusion convolutional network for liver and liver tumor segmentation from CT images. Med Phys. 2021;48(1):264-272. doi: 10.1002/mp.14585.

21.

Chen

, et al. FAT-Net: Feature adaptive transformers for automated skin lesion segmentation. Med Image Anal. 2022;76:102327. doi: 10.1016/j.media.2021.102327.

22.

Vaswani

Shazeer

Parmar

, et al. Attention is all you need. Adv Neural Inf Process Syst. 2017;30:5998–6008.

23.

Kiros

, et al. Show, attend and tell: Neural image caption generation with visual attention. In: International Conference on Machine Learning. PMLR, 2015.

24.

Chen

, et al. Transunet: Transformers make strong encoders for medical image segmentation. arXiv preprint arXiv:2102.04306, 2021. doi: 10.48550/arxiv.2102.04306.

25.

Henry

Carre

Lerousseau

, et al. Brain tumor segmentation with self-ensembled, deeply-supervised 3D U-net neural networks: a BraTS 2020 challenge solution. In: Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries: 6th International Workshop, BrainLes 2020, Held in Conjunction with MICCAI 2020, Lima, Peru, October 4, 2020, Revised Selected Papers, Part I 6. Springer International Publishing, 2021. doi: 10.1007/978-3-030-72084-1_30.

26.

Wang

Chen

Ding

, et al. Transbts: Multimodal brain tumor segmentation using transformer. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part I 24. Springer International Publishing, 2021. doi: 10.1007/978-3-030-87193-2_11.

27.

Carion

Massa

Synnaeve

, et al. End-to-end object detection with transformers. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part I 16. Springer International Publishing, 2020. doi: 10.1007/978-3-030-58452-8_13.

28.

Zheng

Zhao

, et al. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021.

29.

Touvron

Cord

Douze

, et al. Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning. PMLR, 2021.

30.

Liu

Lin

Cao

, et al. Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021.

31.

Valanarasu

Jeya M J

Oza

Hacihaliloglu

, et al. Medical transformer: Gated axial-attention for medical image segmentation. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part I 24. Springer International Publishing, 2021. doi: 10.1007/978-3-030-87192-2_4.

32.

Yanai

Yoshiyuki

. Food image recognition using deep convolutional network with pre-training and fine-tuning. In 2015 IEEE International Conference on Multimedia & Expo Workshops (ICMEW). IEEE, 2015. doi: 10.1109/icmew.2015.7169816.

33.

Xiong

S X

, et al. Unsupervised feature learning via non-parametric instance discrimination. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.

34.

Zhang

Yuan

P C

, et al. Unsupervised embedding learning via invariant and spreading instance feature. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019.

35.

Oord

Aaron VD

Yazhe

Oriol

. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018. doi: 10.48550/arxiv.1807.03748.

36.

Tian

Dilip

Phillip

. Contrastive multiview coding. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI 16. Springer International Publishing, 2020. doi: 10.1007/978-3-030-58621-8_45.

37.

Bilic

Christ

H B

, et al. The liver tumor segmentation benchmark (lits). Med Image Anal. 2023;84:10268. doi: 10.1016/j.media.2022.102680.

38.

De B

P-T

Kroese

D P

Mannor

, et al. A tutorial on the cross-entropy method. Ann Oper Res. 2005;134:19-67. doi: 10.1007/s10479-005-5724-z.

39.

Badrinarayanan

Alex

Roberto

. Segnet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans Pattern Anal Mach Intell. 2017;39(12):2481-2495. doi: 10.1109/tpami.2016.2644615.

Attention Connect Network for Liver Tumor Segmentation from CT and MRI Images

Abstract

Keywords

Introduction

Related Works

CNN and Attention Mechanism

Transformers

Training Strategy

Materials and Method

Datasets and Preprocessing

Attention Connect Network

Axial Attention Module

Vision Transformer Module

Loss Function

Experiments and Results

Evaluation Metrics

Pretraining and Fine-Tuning

Results

Ablation

Discussion

Conclusion

Footnotes

Acknowledgments

Author Contributions

Declaration of Conflicting Interests

Ethics Approval

Funding

ORCID iDs

References