Sage Journals: Discover world-class research

Abstract

Objective

To achieve an accurate assessment of orthodontic and restorative treatments, tooth segmentation of dental panoramic X-ray images is a critical preliminary step, however, dental panoramic X-ray images suffer from poorly defined interdental boundaries and low root-to-alveolar bone contrast, which pose significant challenges to tooth segmentation. In this article, we propose a multi-feature coordinate position learning-based tooth image segmentation method for tooth segmentation.

Methods

For better analysis, the input image is randomly flipped horizontally and vertically to enhance the data. Our method extracts multi-scale tooth features from the designed residual omni-dimensional dynamic convolution and the designed two-stream coordinate attention module can further complement the tooth boundary features, and finally the two features are fused to enhance the local details of the features and global contextual information, which achieves the enrichment and optimization of the feature information.

Results

The publicly available adult dental datasets Archive and Dataset and Code were used in the study. The experimental results were 87.96% and 92.04% for IoU, 97.79% and 97.32% for ACC, and 86.42% and 95.64% for Dice.

Conclusion

The experimental results show that the proposed network can be used to assist doctors in quickly viewing tooth positions, and we also validate the effectiveness of the proposed two modules in fusing features.

Keywords

Tooth segmentation deep learning multi-feature learning coordinate attention

Introduction

Among the many dental imaging modalities, dental panoramic X-ray images are an efficient and cost-effective means of imaging.^1,2 The main advantage of this technology is that it provides integrated visualization of the entire oral structure, including the teeth, jaws, and surrounding soft tissues. This information enables physicians to conduct more comprehensive and detailed diagnoses, leading to precise lesion localization and resection extent planning. Compared to other more limited imaging methods, panoramic X-ray imaging holds irreplaceable value in diagnosing a broad spectrum of oral diseases: periodontal disease, dental infections, and maxillofacial tumors.⁴⁶ In addition, the technique typically requires a relatively low radiation dose and cost, enhancing its feasibility and availability for clinical applications. The accuracy of image analysis is crucial in the diagnostic and therapeutic processes of dentistry. Manually analyzing panoramic X-ray images is a complex and time-consuming process heavily dependent on the physician's a priori knowledge, even for experienced physicians.³ Therefore, accurate segmentation of dental panoramic X-ray radiographic images is necessary to help junior practitioners lacking clinical experience to improve the accuracy and efficiency of their diagnosis in the preoperative period, and it plays a significant role in the clinical diagnosis and treatment within dentistry.

Dental panoramic X-ray images contain complete information about the teeth and jaws, from which complete and precise segmentation of tooth morphology information is an important prerequisite for intelligently assisted treatment. The segmentation method based on deep learning currently dominates the field and undergoes extensive research and enhancement. Since the pioneering work of U-Net created a buzz in the field of image segmentation, more accurate and efficient U-Net-based image segmentation methods have been explored from multilayer perceptual machines (MLPs), convolutional neural networks, and attention mechanisms, followed by the emergence of excellent schemes such as UNet++, Attention U-Net, GT U-Net, and UNeXt. Existing methods face the following challenges when segmenting dental panoramic X-ray images: firstly, the characteristics of dental panoramic X-ray images themselves, such as the boundary contours of the teeth are difficult to distinguish obviously, and the pixel values of the teeth are close to the surrounding area, resulting in a lower overall contrast; secondly, the network structure of these algorithms has insufficient ability to extract the target objects of small and large sizes, and the receptive field information in the network model is partially lost during feature propagation, resulting in the edges of the tooth segmentation having a more obvious jagged appearance.

Based on the above analysis, to better adapt and learn dental panoramic X-ray image data, this article proposes a dental panoramic X-ray image segmentation method based on multi-feature coordinate position learning. To ensure that the algorithm can extract detail features and regional features, we design a residual omni-dimensional convolution module, which constructs the main feature extraction branch and the auxiliary feature complementary branch, and effectively learns both large-size regional feature information and small-size detailed feature information. In addition, to alleviate the differences between the feature encoding network and the feature decoding network and at the same time to obtain more accurate tooth region location and tooth contour information, we designed a two-stream coordinate attention module, and this module adds a new maximum pooling processing stream based on the average pooling processing stream, which accurately locates the location of the tooth target region, and at the same time, learns the positional feature information of the tooth edge contour very well.

Literature review

In recent years, tooth segmentation tasks on panoramic X-ray images have attracted increasing attention, and tooth segmentation tasks can be used to solve difficult dental problems. Teeth segmentation methods can be broadly categorized into two research lines: traditional methods based on original image features and deep learning-based methods driven by deep image features.

Traditional segmentation methods use digital image processing techniques to perform tooth segmentation according to the characteristics (e.g. shape, gray value, etc.) inherent in the dental panoramic radiograph. Commonly used traditional segmentation methods include watershed-based segmentation algorithms,^4,5 threshold-based segmentation algorithms,^6–8 clustering-based segmentation algorithms,^9,10 boundary-based segmentation algorithms,^11,12 and region-based segmentation algorithms.¹³ Watershed-based segmentation algorithm simulates the topographic water flow and fills the local minima to divide the image. The segmentation results of this algorithm are very much related to the number of gray levels and the selection of the threshold value, and the computation is large and prone to over-segmentation. Threshold-based segmentation algorithm separates the background from the target by analyzing the gray level change of the image and setting the threshold value, which is suitable for images with large gray level difference, but has more limitations in the case of small gray level difference. Clustering-based segmentation algorithm divides the pixels in the image into clusters based on certain rules, so that the pixels in the same cluster have similar features and the pixels between different clusters have different features, and finally completes the image segmentation. However, this method is sensitive to noise and is not effective in complex backgrounds, and the stability of the algorithm is also affected by the choice of initial values. Boundary-based segmentation algorithm aims to find the boundary of the image, the pixel at the boundary and the surrounding pixels of the gray value of the difference is very obvious, the boundary pixels will be found in a line to form the boundary of the object contour, to complete the image segmentation. The region growing method is a commonly used region-based segmentation algorithm, which gradually merges the surrounding regions to complete the segmentation by setting rules and initial pixels. However, this method is sensitive to noise, depends on the initial seed selection, and is easily affected by the image quality. In summary, early conventional methods have too many limitations. Therefore, automated methods are important to promote the efficiency and accuracy of tooth segmentation.

In the field of tooth segmentation, compared with traditional segmentation methods, deep learning-based segmentation methods have stronger tooth feature modeling capabilities and can achieve better segmentation results, while eliminating the need for complex and tedious implementation rule definitions. Therefore, deep learning-based segmentation methods are currently the mainstream segmentation methods, which have been widely studied and improved. However, dental panoramic X-ray images are medical image data, which have special features such as unclear boundaries and low contrast compared with general image data,⁴⁵ so general image segmentation methods are not suitable for medical image data. After fully convolutional network (FCN),¹⁴ Ronneberger et al.¹⁵ proposed U-Net, whose network structure is similar to the U-shape, and mainly consists of encoder, decoder, and skip connection. The encoder is mainly responsible for the extraction of feature information in the image, the decoder is mainly responsible for restoring the image size to the original size, and the skip connection is mainly responsible for ensuring that more low-level feature information in the image is not lost. Although U-Net achieves good segmentation results, its feature extraction capability is still to be improved due to the backward design of the codec convolutional structure. In general, U-Net has become an important research method in the field of medical image segmentation, and its concise and powerful structural model become a highly influential design idea; some subsequent research teams designed mechanisms such as residual connection, dense connection, multiscale, attention, and so on to better improve the performance of the segmentation algorithm. Therefore, exploring U-Net and related networks is an important reference value for the research of this topic.

The residual connection mechanism originates from ResNet,¹⁶ the core of which lies in the addition of a new data line spanning a number of layers in the network, which effectively solves the problem of gradient information loss caused by the increase of network depth. Inspired by the residual connection mechanism and the dense connection mechanism, Zhou et al.¹⁷ proposed the UNet++ segmentation network model, UNet++ refers to and builds on the basis of the U-shaped network structure, which is different from the skip connection of the U-shaped network, but UNet++ establishes a number of sub-networks in the intermediate part, so that the image feature information extracted by the encoder is processed in the intermediate subnetworks and then handed over to the corresponding encoder. UNet++ builds multiple subnetworks in the middle part, and the image feature information extracted by the encoder is processed by the intermediate subnetworks, and then handed over to the corresponding encoder, which establishes the connection between the low-level detail information and the high-level abstract features, and improves the problem of the large difference of semantic features in the encoding and decoding stages, but the insufficient fusion of the information of the upper and lower layers of UNet++ makes the segmentation results still insufficiently fine. Liu et al.¹⁸ proposed Res-Unet, which introduced the residual unit structure of ResNet into the encoding and decoding structure, increased the number of layers in the network model, and achieved good segmentation results. Rao et al.¹⁹ constructed LeFUNet based on the U-Net by adopting the dense connection and combining improved squeezing and excitation modules to improve the accuracy of X-ray image segmentation.

The multi-scale mechanism functions as a means of sampling the image at different levels, with different levels providing images of varied scales and resolutions, each scale size image contains unique and easy-to-learn semantic features, which provide rich decision-making information for image segmentation.²⁰ Inspired by the multi-scale mechanism, CE-Net²¹ believes that increasing the receptive field is conducive to improving the final image segmentation accuracy, so it incorporates pyramid pooling and atrous convolution into the U-shaped network which is mainly composed of ResNet34 models, and proposes the residual-based multi-kernel pooling (RMP) module and the dense atrous convolution (DAC) module based on dense connectivity. The DAC module captures more semantic feature information by embedding multi-scale atrous convolution into multiple cascade branches; the RMP module combines several pooling operations of different sizes to capture enough background information on the semantic feature information after DAC processing. Through these two modules, CE-Net captures some high-level feature information and reduces the loss of spatial information in the image to get better image segmentation results. Wang et al.²² proposed ARMS Net and designed adaptive multi-scale feature extraction module (AMFEM), which allows the receptive field to be dynamically adjusted with the feature map size. Wang et al.²³ borrowed the idea of pyramid pooling, designed a multi-scale feature extraction block, and proposed multi-path connected network (MCNet), which obtains larger and more the receptive field information. Chen et al.²⁴ proposed a multiscale position-aware network, designed a position-aware module for locating the target pixels, and reduced the feature information gap in the multiscale feature branches by an aggregation module to finalize the X-ray image segmentation.

In the recent years of research, researchers widely apply the attention mechanism in image tasks,²⁵ garnering accolades from many scholars. Inspired by the attention mechanism, Oktay et al.²⁶ proposed the Attention U-Net, which introduces an attention module based on the U-Net structure. They place this attention module before splicing the encoding path features and the decoding path features at each layer, and it adjusts the feature information extracted from the encoding path, reduces the importance of the irrelevant feature regions in the image, and provides enhanced highlighting of key feature areas in an image. Li et al.²⁷ proposed GT U-Net, which replaces the encoding and decoding structure with a group transformer with better performance,²⁸ and also reduces the computation of the network model by grouping and bottlenecking structure to finally complete the X-ray image segmentation. Kaya et al.²⁹ combined U-Net and OctConv³⁰ to propose an X-ray image segmentation method with lower memory overhead, and the improved U-Net also achieved better segmentation accuracy. Sheng et al.³¹ used SWin-Unet³² as the network model used in the segmentation algorithm, and the experimental results verified the advantages and future potential of SWin-Unet for X-ray image segmentation. Zhao et al.³³ proposed TSASNet, which divides the X-ray image segmentation task into two phases, with the first phase using the global and local attention modules to coarsely localize the target region, and the second phase using the fully convolutional network to complete the fine segmentation. UNeXt³⁴ proposed the tokenized multilayer perceptron (tokenized MLP) and introduced it into the lower two layers of the U-shaped network structure, replacing the original convolutional part. The larger size of the downsampling amplitude and the use of fewer convolutional layers make UNeXt lightweight, but these operations lose the receptive field information that is useful for the network model.

In recent years, vision transformers³⁵ have rapidly become one of the most promising vision-based models as a powerful visual task-aware model. However, the performance of existing transformers in medical image segmentation tasks is still unsatisfactory. Polyper³⁶ assumes that the localization of the segmented region is accurate; however, in practical applications, false positives or false negatives are often present in medical images, which makes the Polyper perform poorly in dealing with these cases and affects the reliability and accuracy of the segmentation. Swin transformer³⁷ introduces frequent shifting or reshaping operations, which improve the representativeness of the model, but also introduce significant computational latency, limiting its application to real-time or efficient segmentation tasks. The SegFormer³⁸ approach combines a hierarchical transformer encoder with no positional coding and a lightweight decoder, and while it can theoretically improve segmentation performance, even the lightest model may still be too heavy to be practical for some edge devices with limited computational resources.

In summary, since the advent of U-Net, deep learning-based methods have demonstrated excellent segmentation performance. However, there are still some deficiencies in the design of the network structure, such as the feature extraction ability of tooth objects with variable sizes needs to be enhanced, the unreasonable use of special convolutions such as atrous convolution leads to serious jaggedness at the edges of the tooth segmentation, and the information of the receptive field of the network model becomes chaotic, which ultimately affects the enhancement of the accuracy of tooth segmentation. Therefore, how to optimize and improve the feature extraction, propagation, and recovery parts of the segmentation network emerge as key issues in dental panoramic X-ray image segmentation (see Table 1).

Table 1.

Summary of the literature.

Study	Dataset	Images	Methods	Accuracy (%)
16	Capillaries (private)	20,664	Resnet + Unet	91.72
17	Tooth (public)	1612	Unet + se	97.93
20	Chromosome (public)	13,434	Unet + amac + assp	99.99
21	Skin lesion (public)	2000	Unet + mfe	94.5
22	Tooth (public)	1500	Resnet + am + lpm	97.3
25	Tooth root (private)	248	Unet + transformer + fd loss	93.8
29	Tooth (private)	100	Vit + Unet	88.52
31	Tooth (public)	1500	Unet + lstm	96.94
32	Skin lesion (public)	2594	Unet + tok-mlp	90.41
19	Optic disc (public)	2019	Resnet + Unet + rmp + dac	95.45

Model and methodology

This study is an applied research aimed at improving the tooth segmentation technique for dental panoramic X-ray images by using a multi-feature coordinate position learning approach to improve the accuracy and efficiency of segmentation. This study was initiated in April 2023 and concluded in May 2024. The entire research was conducted at Xi'an University of Science and Technology, Xi'an, Shaanxi Province, China.

General architecture of the network model

The overall network architecture of the proposed method, as shown in Figure 1, comprises a residual omni-dimensional convolution module (ROCM), two-stream coordinate attention (TSCA) module, downsampling layer, upsampling layer, and rectification output layer. The feature encoding network, positioned on the left side of the overall network architecture, actively learns tooth and background features layer by layer from images, taking pre-processed dental panoramic X-ray images as its input. The feature decoding network, located on the right side of the architecture, is responsible for progressively restoring image features layer by layer to yield the final segmentation results.

Figure 1.

General network architecture of the proposed method.

Firstly, the designed ROCM is introduced into each layer of the feature encoding network and feature decoding network to construct two branches with different scopes of action and learn to acquire large-size regional features and small-size detailed features at the same time, to achieve better feature learning effect and more comprehensive feature recovery effect. Next, the designed TSCA module is introduced between the feature encoding network and the feature decoding network, which focuses on the target object itself and the junction between the target object and the background, learns these two kinds of closely related feature relationships, and then supplements the original feature information of the feature encoding network by skip connections for the feature decoding network; it is worth noting that the output features of the previous layer of the feature decoding network will be resized by the upsampling layer and finally spliced together with the output results of the TSCA module. In addition, the feature encoding network needs to use the downsampling layer to reduce the size of the feature map, which is achieved by the maximum pooling operation, and similarly, the feature decoding network needs to achieve a gradual expansion of the feature map size to the input image size through the upsampling layer, which is specifically achieved by the bilinear interpolation. Finally, the rectification output layer converts the output of the previous network part into the segmentation result on the image, which is composed of a convolution kernel size of 1 × 1 and step size of 1 and a Sigmoid function.

Residual omni-dimensional convolution module

Common convolution is a very widely used type of convolution, a common convolution has a static convolution kernel, and the weight parameters trained in the convolution kernel apply to all input samples, that is to say, the weight parameters of the common convolution kernel are independent of the input samples. Dynamic convolution is composed differently from the common convolution, and can be regarded as the use of multiple convolution kernels in combination according to a certain rule, which is usually related to the input sample data, and the dynamic convolution can be expressed by equation (1):

\begin{matrix} \begin{matrix} y = (α_{w 1} W_{1} + \dots + α_{w n} W_{n}) * x \end{matrix} \end{matrix}

(1)where x denotes the input feature, y denotes the output feature,

W_{i}

denotes the

i

th convolution kernel,

α_{w i}

denotes the attention score acting on

W_{i}

, and

*

denotes the convolution operation. In summary, there are two key components of dynamic convolution: one is several convolution kernels

{W_{1}, \dots, W_{n}}

, and the other is the attention function used to compute the attention score

{α_{w 1}, \dots, α_{w n}}

Some previous work on dynamic convolution is mainly conditionally parameterized convolutions (CondConv)³⁹ and “dynamic convolution” (DyConv).⁴⁰ Both serve as extensions of equation (3.1), and their attention function implementations are similar, but DyConv uses a Softmax function while CondConv uses a Sigmoid function. Compared to common convolutions, CondConv and DyConv still lack some considerations, although they improve the performance of convolutions through linear combinations. Specifically, for given n convolution kernels, the corresponding kernel space consists of four dimensions: the number of input feature channels $c_{i n}$ , the number of output feature channels $c_{o u t}$ , the spatial dimension size of convolution kernels $k \times k$ , and the number of convolution kernels n. Since the attention functions in both CondConv and DyConv use only one attention score $α_{w i}$ , this also indicates that their convolution kernels use the same attention score for any input. Further analysis shows that CondConv and DyConv do not take into account the previously mentioned number of channels of input features, the number of channels of output features, and the spatial dimension size of the convolution kernel, and do not make use of the nature of the space of convolution kernels as much as possible, and there is still room for further improvement.

Given the above-mentioned, Li et al.¹ proposed omni-dimensional dynamic convolution (ODConv), which, unlike CondConv and DyConv, is a method that simultaneously takes into account the number of input feature channels, the number of output feature channels, the size of the convolution kernel space dimensions, and the number of convolutional kernel number of the new dynamic convolution. Based on the previous equation (1) and further extended, researchers express ODConv with equation (2):

\begin{aligned} y = (α_{w 1} ⊙ α_{f 1} ⊙ α_{c 1} ⊙ α_{s 1} ⊙ W_{1} + \dots + α_{w n} ⊙ α_{f n} \\ ⊙ α_{c n} ⊙ α_{s n} ⊙ W_{n}) * x \end{aligned}

(2)where x denotes the input feature, y denotes the output feature,

α_{w i}

denotes the attention score acting on

W_{i}

α_{c i}

denotes the attention score associated with the number of input feature channels,

α_{f i}

denotes the attention score associated with the number of output feature channels, and

α_{s i}

denotes the attention score associated with the spatial dimensionality of the convolution kernel. ODConv introduces the multidimensional attention mechanism into its structure, and according to the parallel structure design strategy, learns four different dimensions along the convolutional kernel to obtain four independent attention scores. Figure 2 depicts the structure of ODConv, where firstly x is compressed by global average pooling (GAP) to obtain a feature vector of length

c_{i n}

and after that there are four different branches, the composition of the three branches close to the input end is the fully connected layer followed by the ReLU function plus the Sigmoid function, and the composition of the rest of the one branch is the fully connected layer followed by ReLU activation function plus Softmax function. The feature vectors of length

c_{i n}

are processed by these four branches, and the fully connected layer outputs their respective feature tensor from left to right, with sizes

k \times k

c_{i n} \times 1

c_{o u t} \times 1

, and

n \times 1

, respectively, and finally, the Sigmoid function or Softmax function processes them to obtain four different attention scores, with sizes

α_{s i}

α_{c i}

α_{f i}

, and

α_{w i}

from left to right, respectively. The details of the process of how the different attention scores (

α_{s i}

α_{c i}

α_{f i}

, and

α_{w i}

) in ODConv act with the convolution kernel

W_{i}

are explored next. As shown in Figure 3, for each weight parameter in

W_{i}

with spatial dimension size

k \times k

α_{s i}

is a different attention score imposed on these weight parameters; from Figure 4, for each input feature channel

c_{i n}

W_{i}

α_{c i}

is a different attention score imposed on these input feature channels; and from analyzing Figure 5, it can be seen that for each output feature channel

c_{o u t}

W_{i}

α_{f i}

is the different attention scores imposed on these output feature channels; as shown in Figure 6, consider n different convolution kernels as a complete block,

α_{w i}

is the different attention scores imposed on this complete block. Based on the above illustration, compared with the ordinary static convolutional kernel, it can be seen that

α_{s i}

α_{c i}

α_{f i}

, and

α_{w i}

greatly complement the connection between the convolutional kernel and its properties from a variety of perspectives, which significantly enhances the performance of the convolutional kernel, and can extract more information about the contextual features.

Figure 2.

Structure of the omni-dimensional convolutional ODConv.

Figure 3.

Positional multiplication operations along the spatial dimension.

Figure 4.

Channel multiplication operations along the input channel dimension.

Figure 5.

Channel multiplication operation along the output channel dimension.

Figure 6.

Kernel-by-kernel multiplication operation along the kernel dimension of the convolutional kernel space.

The feature extraction module in the U-shaped network is closely related to the final segmentation effect, and a good feature extraction module can obtain richer information about image features. The feature extraction module in the U-shaped network is mainly composed of ordinary convolution blocks, which cannot effectively adapt to dental panoramic X-ray images with complex features. To improve this problem and further enhance the feature learning ability of the network model, this article designs the ROCM, as shown in Figure 7, the ROCM has two different branches, one of which consists of a twice repeated convolution kernel of size 3 × 3 ODConv, a batch normalization layer, and the ReLU activation function. This branch is the backbone feature extraction flow in ROCM, which can obtain feature information of a larger-size region or even the global feature information of the whole image; the other branch consists of ODConv with a convolutional kernel of size 1 × 1, batch normalization layer and the ReLU activation function, this branch is the auxiliary feature complementary flow in ROCM, which focuses on learning the detailed feature information of small size, which is beneficial to further improve the segmentation effect of tooth edges. To enable ROCM to simultaneously learn feature information with variable sizes, the output features of the above two branches are then summed to obtain the final fused and complementary output features.

Figure 7.

Network structure of the residual omni-dimensional convolution module.

Two-sream coordinate attention module

Skip connection has an important position in the U-shaped network, mainly to ensure that more low-level feature information of the image will not be lost in the feature propagation process, which is conducive to better recovery of detailed feature information in the image. However, limited by the large feature differences between the contraction path and expansion path of the same layer, directly splicing the features of the two sides will still lose part of the intermediate information, and the channel, coordinate, position, direction, and other information in the image are also confused. To solve this problem, an effective approach is to embed the modular structure used to process the image space, channel, and other information into the network, combined with the original skip connection, to make up for the limitations of the previous connection structure, and further improve the performance of the feature transfer process. Such module structures mentioned above mainly include SENet² and CBAM.⁴¹ However, SENet only focuses on capturing the channel information of the image, and the spatial location information of the image is neglected by the module structure, although CBAM takes into account the processing of spatial and channel information in the image at the same time, the ability to learn the pixel relationship at a distance is lost. Aiming at the design defects of the above module structure, Hou et al.⁴² proposed an advanced and lightweight network module named coordinate attention (CA), CA combines spatial information such as positional orientation with channel information in images more efficiently, further capturing pixel dependencies over long distances on the basis of multi-channel information interactions, which ultimately improves the performance of the network module.

The network structure of coordinate attention is shown in Figure 8, assuming that the size of the input feature map is C × H × W, which is abbreviated as x, the number of channels of the input feature map is taken as C, and the height and width of the input feature map are denoted as H and W, respectively. Firstly, the ordinary two-dimensional average pooling is split to obtain two one-dimensional average poolings with different kernel sizes (one pooling kernel of size H × 1 is used as horizontal average pooling; the other is 1 × W is used as vertical average pooling), and the pooling operation is carried out on the channels of the input feature maps along the horizontal and vertical directions, respectively, and the operation process can be described by equations (3) and (4):

\begin{matrix} \begin{matrix} Z a v g_{c}^{h} (h) = \frac{1}{W} \sum_{0 \leq i \leq W} x_{c} (h, i) \end{matrix} \end{matrix}

(3)

\begin{matrix} \begin{matrix} Z a v g_{c}^{w} (w) = \frac{1}{H} \sum_{0 \leq j < H} x_{c} (j, w) \end{matrix} \end{matrix}

(4)where the output feature with input height h and channel number c is

Z a v g_{c}^{h} (h)

and the output feature with input width w and channel number c is

Z a v g_{c}^{w} (w)

. The output feature sizes after horizontal average pooling (X Avg Pool) and vertical average pooling (Y Avg Pool) are

C \times H \times 1

and

C \times 1 \times W

. Immediately after that, the results of the above two formulas are spliced together and processed by a convolution

F_{1 \times 1}

with a convolution kernel size of 1 × 1 and a step size of 1 as shown in equation (5):

\begin{matrix} \begin{matrix} f a v g = δ (β (F_{1 \times 1} ([Z {avg}^{h}, Z {avg}^{w}]))) \end{matrix} \end{matrix}

(5)where

β

denotes the batch normalization layer,

δ

denotes the ReLU6 activation function,

[Z {avg}^{h}, Z {avg}^{w}]

denotes the operation of splicing the two,

f avg

is the output feature after splicing and convolution of size

C / r \times 1 \times (W + H)

, and r is the shrinkage factor used to control the size of the network module. Then, as shown in equations (6) and (7), disassemble

f avg

to get two tensors

f {avg}^{h}

and

f {avg}^{w}

with sizes

C / r \times H

and

C / r \times W

respectively, for these two different tensors, define two convolution kernels of size 1 × 1 and step size 1 for convolution

F_{h}

and

F_{w}

, define

σ

as the Sigmoid function, and use

F_{h}

and

σ

to process

f {avg}^{h}

to get the weight score

g {avg}^{h}

with size

C \times H \times 1

, and similarly to get another weight score

g {avg}^{w}

of size

C \times 1 \times W

. Finally, as shown in equation (8),

g {avg}^{h}

and

g {avg}^{w}

are expanded and multiplied element by element with the input feature map x to finally obtain the output

y a v g

of the coordinate attention network structure.

\begin{matrix} \begin{matrix} g {avg}^{h} = σ (F_{h} (f {avg}^{h})) \end{matrix} \end{matrix}

(6)

\begin{matrix} \begin{matrix} g {avg}^{w} = \end{matrix} σ (F_{w} (f {avg}^{w})) \end{matrix}

(7)

\begin{matrix} \begin{matrix} yavg = x * gav g^{h} * gav g^{w} \end{matrix} \end{matrix}

(8)

Figure 8.

Network structure of coordinate attention.

Coordinate attention is designed with two kinds of average pooling in the horizontal and vertical directions, and after pooling, convolution and other operations successfully abstract the target of interest in the input features into the final weight score, through which the coordinate position of the target object can be more accurately restored. However, it is known that the junction between the target object and the background in the dental panoramic X-ray image is more ambiguous, which means that the average pooling in the coordinate attention only takes into account the tooth object itself, and the differences in the features at the junction as well as in the surrounding area also help to locate the tooth object and optimize the contour information of the tooth object. Therefore, in this article, we design the TSCA module, which retains the position information of the tooth object, and adds a new branch of the network consisting of the maximum pooling operation, which captures the feature information of the junction between the tooth object and the background.

The network structure of TSCA is shown in Figure 9, where the left side of the figure shows the branching structure consisting of average pooling, and the output of the left-branching structure is denoted as $y a v g$ . On the right side of the figure is the branch structure consisting of maximal pooling, where the output feature sizes of horizontal maximal pooling (X Max Pool) and vertical maximal pooling (Y Max Pool) are kept the same as the average pooling, and the later structure also contains the operations of splicing, convolution, regularization, activation, and weight score generation, etc., and the output of the right branch structure is denoted as $y m a x$ . It is worth mentioning that the TSCA is located between the feature coding and decoding networks, therefore, this article embeds a skip connection into TSCA to supplement the original features in the feature encoding network for the feature decoding network, and the splicing operation occurs in the output of TSCA, as shown in equation (9):

\begin{matrix} \begin{matrix} y = [x, y a v g, y m a x] \end{matrix} \end{matrix}

(9)where the input features of TSCA are denoted as x, the output features are denoted as y, and the size is

3 C \times H \times W

Figure 9.

Network structure of the two-stream coordinate attention module.

Experimental analysis

Experimental setup and data preprocessing

Dataset

We conducted experiments on two datasets, namely Archive⁴³ and Dataset and Code,⁴⁴ to demonstrate the effectiveness of our method. The Archive is a public dataset containing 116 dental panoramic X-ray images and real segmentation labels, with a pixel size of 3104 × 1200 pixels, therefore, it does not require a specific ethical approval number. Additionally, the Dataset and Code contains 1500 dental panoramic X-ray images with real segmentation labels, featuring a pixel size of 1991 × 1127 pixels. The use of this dataset has been approved by the National Research Ethics Committee (CONEP) and the Research Ethics Committee (CEP) under report number 646.050 with an approval date of May 13, 2014. These images have significant aspect inequality, and there are two difficulties in directly using such original dental panoramic X-ray images as network inputs, one is that training very large-size images is a challenging task for devices with limited hardware resources, which is prone to video memory and RAM overflow problems; the other is that, for the input images with unequal lengths and widths, it is sometimes necessary to slice the images beforehand and then feed them into the network sequentially, but the slice will destroy some of the important pixel relationship structures in the original images. Combining the above considerations, in this article, the size of all images in the dataset is uniformly adjusted to 512 × 512, the Archive randomly divides the data of which 60 images are used as the training set and the other 56 images are used as the test set, and the Dataset and Code randomly divides the data of which 1350 images are used as the training set and the other 150 images are used as the test set, and randomly flips the input data horizontally, vertically, horizontally and then vertically during the training.

Evaluation of indicators

To evaluate the effectiveness of the proposed algorithm, four commonly used evaluation metrics, namely Intersection over Union (IoU), Precision, Accuracy, and Dice Coefficient (Dice) are used as the basis for the analysis. The mathematical definitions of these four indicators are shown below:

\begin{matrix} \begin{matrix} I o U = \frac{T P}{T P + F P + F N} \end{matrix} \end{matrix}

(10)

\begin{matrix} \begin{matrix} P r e c i s i o n = \frac{T P}{T P + F P} \end{matrix} \end{matrix}

(11)

\begin{matrix} \begin{matrix} A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N} \end{matrix} \end{matrix}

(12)

\begin{matrix} \begin{matrix} D i c e = \frac{T P + T P}{T P + T P + F P + F N} \end{matrix} \end{matrix}

(13)where TP is true positives, indicating that both predictions and actuals are tooth parts; FP is false positives, indicating that predictions are tooth parts and actuals are background parts; TN is true negatives, indicating that predictions and actuals are background parts; FN is false negatives, indicating that predictions are background parts and actuals are tooth parts; IoU calculates the intersection and concurrency ratio between the predicted segmentation results and the real segmentation results, as shown in equation (10). Precision indicates the likelihood of correctly predicting the tooth region in the predicted segmentation results, as shown in equation (11). Accuracy denotes the ratio of the number of all correctly predicted pixels to the total number of pixels, calculated as shown in equation (12). Dice denotes the similarity between the predicted segmentation result and the real segmentation result, calculated as shown in equation (13).

The hardware environment for the experiment is as follows, the CPU is AMD Ryzen 5 3600, the GPU is NVIDIA GeForce RTX 3060Ti, and the RAM size is 32 GB. The software environment is as follows, Windows 10 64-bit version, Python 3.7, Pytorch 1.7.0+ cu110, CUDA 11.5 Some settings for network training are as follows: the loss function used is BCEWithLogitsLoss, the number of training rounds is set to 200, the optimizer used to train the neural network is Adam, the batch size is set to 1, and the initial learning rate is set to 0.001. The change in the loss value on the training set is shown in Figure 10, and when iterated up to 180 times, the change in the loss curve on the training set tends to converge.

Figure 10.

Segmentation results of different methods on Archive dataset.

Comparative analysis of related algorithms

Since there are few deep learning-based tooth segmentation methods with publicly available source code, to better demonstrate the advantages and effectiveness of the proposed algorithm, this article compares U-Net, UNet++, Attention U-Net, CE-Net, GT U-Net, and UNeXt, which are representative segmentation algorithms in recent years. Table 2 shows the comparison of the segmentation accuracy of different algorithms on the Archive, compared with the experimental results of UNet and UNet++, the proposed algorithm achieves the optimum in all three evaluation metrics of IoU, Accuracy, and Dice, which are improved by 1.74%, 0.31%, and 1.59%, respectively, and also achieves a very good result in Precision evaluation metric. Compared with the experimental results of Attention U-Net, CE-Net, GT U-Net and UNeXt, the algorithm in this article has improved the four evaluation metrics of IoU, Precision, Accuracy, and Dice by 0.70%, 0.06%, 0.13%, and 0.47%, respectively. The experimental results prove that the proposed algorithm achieves better-integrated segmentation performance and can distinguish the tooth target region and background region more effectively.

Table 2.

Comparison of segmentation accuracy of different algorithms on dental panoramic X-ray images. The bolded values in the table represent the best performance in the corresponding evaluation metrics.

Method	IoU (%)	Precision (%)	Accuracy (%)	Dice (%)
U-Net¹⁵	86.17	97 . 79	97.48	84.07
UNet++¹⁷	86.22	97.15	97.47	84.83
Attention U-Net²⁶	87.05	97.28	97.62	85.56
CE-Net²¹	87.26	97.20	97.66	85.95
GT U-Net²⁷	85.78	96.08	97.36	85.14
UNeXt³⁴	86.30	95.55	97.45	85.91
Ours	87.96	97.34	97.79	86.42

To reflect the segmentation effect of different algorithms more intuitively, the segmentation results of different algorithms are visualized on a number of images in the test set, Figure 11 is the overall segmentation result visualization of different algorithms, and Figure 12 is the detailed segmentation result visualization of different algorithms. From the figure, it can be seen that U-Net suffers from large semantic feature information differences between the same layers, and is also limited by the lack of feature extraction capability, which results in breaks when segmenting complete teeth; UNet++ has more obvious isolated segmentation error points in the visualized segmentation results, which is mainly affected by the lack of multiscale information extraction; Attention U-Net and CE-Net have obvious sawtooth phenomenon at the edge of tooth segmentation, and some over-segmentation occurs; GT U-Net has some tooth adhesion and splitting phenomenon in the visual segmentation results, accompanied by some isolated segmentation error points; UNeXt's visual segmentation results have more curved tooth morphology, and the edge of the tooth is not smooth enough, which is due to the network model losing part of the important receptive field. Compared with the other algorithms, the visual segmentation results of the proposed algorithm are close to the real labels, the tooth edges are smooth, the jaggedness is obviously reduced, and the control of over-segmentation and isolated segmentation error points is also improved.

Figure 11.

Visualization of the results of the detail segmentation for different algorithms.

Figure 12.

Segmentation results of different methods on Dataset and Code dataset.

In fact, the shape of teeth captured by different capture devices is different, and to meet the challenge of differences in tooth images and differences in the quality of annotation under different devices, we newly added Polyer and Swim-T, which are segmentation algorithms combined with vision transformer, to the Dataset and Code for comparison. Table 3 shows the comparison of segmentation accuracy of different algorithms on the Dataset and Code, compared with the optimal experimental results of other algorithms, the proposed algorithm achieves optimality in three evaluation metrics, IoU, Accuracy, and Dice, which are improved by 0.25%, 0.10%, and 0.73%, respectively, and also achieves very good results in the Precision evaluation metric.

Table 3.

Comparison of segmentation accuracy of different algorithms on dental panoramic X-ray images. The bolded values in the table represent the best performance in the corresponding evaluation metrics.

Method	IoU (%)	Precision (%)	Accuracy (%)	Dice (%)
U-Net¹⁵	85.73	93.45	90.89	92.10
Polyer³⁶	87.10	90.66	95.94	92.96
Swin-T³⁷	90.47	94.57	95.26	94.63
UNet++¹⁷	91.48	94 . 91	97.09	94.48
Attention U-Net²⁶	91.79	93.05	97.22	94.91
Ours	92.04	94.59	97.32	95.64

Visualizing the segmentation results of different algorithms on some images (adult teeth, implants, children's teeth, missing teeth, etc.) in the test set, Figure 13 visualizes the overall segmentation results of different algorithms, and visualizes the detailed segmentation results of different algorithms. From the figure, it can be seen that U-Net has obvious shortcomings in the tooth segmentation task, especially in the processing of complex structures, which tends to lose important information. UNet++ is prone to errors in detail processing with insufficient multi-scale information extraction, especially in the complex background or near the boundary, where the error points are more significant. The results of the Attention U-Net show jagged tooth edges and partial over-segmentation problems. The increased design complexity and number of parameters in the potential boundary extraction and boundary-sensitive refinement modules of the Polyer network leads to the problem of high computational overhead and insufficient generalization ability of the model during training and inference, which in turn affects the segmentation accuracy. Although the vision transformer introduced by Swin-T can theoretically capture global information better, the problems of insufficient processing of fine edges and omission of segmentation of part of the region due to insufficient training data still exist in practical applications. Compared with other algorithms, our proposed algorithm shows obvious advantages in the segmentation results, which can accurately maintain the boundaries and the complete structure of the tooth, with clear root boundaries and no obvious adhesion between the roots.

Figure 13.

Boundary detail comparison of different methods.

Analysis of ablation experiments

To validate the effectiveness of the proposed module, this study designed the following partial ablation experiments on the Archive. This series of experiments aims to deeply explore the specific impact of each component on the overall performance by gradually removing or modifying key components of the model. Through this approach, we are able to clarify the function and importance of each module, thus confirming the design rationality and effectiveness advantages of the model.

Effectiveness of the residual omni-dimensional convolution module

To evaluate the effectiveness of the proposed ROCM, the following ablation experiments take place: in the feature extraction and recovery part of the network structure, the segmentation results using the original convolution block are compared with those using the ROCM, which is composed of two consecutive ordinary convolutional layers, with each convolution kernel having a size of 3 × 3 and a step size of 1, and with each convolutional operation is immediately followed by a batch of normalization and activation function layers. Table 4 shows that the designed residual omni-dimensional convolutional module achieves better tooth segmentation results compared to the original convolutional fast, with an improvement of 0.74%, 0.16%, 0.14%, and 0.84% in the four evaluation metrics of IoU, Precision, Accuracy, and Dice, respectively. The analysis shows that the original convolution block makes it difficult to effectively deal with the complex and fuzzy feature information in the dental panoramic X-ray image, while the ROCM constructs the main feature extraction streams and the auxiliary feature supplementation streams, which makes both the small-size detailed features and large-size regional features fully learned and is more conducive to generating good tooth segmentation results.

Table 4.

Effectiveness of the residual omni-dimensional convolution module.

Framework	IoU (%)	Precision (%)	Accuracy (%)	Dice (%)
Raw convolutional blocks	87.22	97.18	97.65	85.58
ROCM	87.96	97.34	97.79	86.42

Convolutional kernel size analysis in trunk feature extraction streams

Keeping using the same auxiliary feature complementary streams in the ROCM, the convolution kernel size in the backbone feature extraction streams is adjusted, and at the same time to ensure that the size of the feature maps after convolution is the same, for this purpose, three different backbone feature extraction flows are designed, including (a) the convolution kernel size of 5 × 5, (b) the convolution kernel size of 7 × 7, and (c) the convolution kernel size of 3 × 3. Table 3 shows the experimental results for different convolutional kernel size sizes, it can be seen that the best experimental results are obtained when the convolutional kernel size is 3 × 3, and increasing the convolutional kernel size on this basis makes the experimental results decrease. The analysis shows that two consecutive 3 × 3 convolutions are enough to gradually extract larger-size features and even the full-image features, thanks to the feature map size reduction brought by the downsampling layer, and although increasing the size of the convolution kernel size will have a larger sense of the receptive field, however, the features learned by layering are missing a certain degree of uniqueness, which is ultimately detrimental to the improvement of the accuracy of the tooth segmentation results (see Table 5).

Table 5.

Convolutional kernel size analysis in the backbone feature extraction stream.

Convolutional kernel size	IoU (%)	Precision (%)	Accuracy (%)	Dice (%)
5 × 5	87.11	96.82	97.63	85.48
7 × 7	86.46	96.25	97.49	85.47
3 × 3	87.96	97.34	97.79	86.42

Validity of the two-stream coordinate attention module

To verify the effectiveness of the proposed TSCA module, we changed the structure of the TSCA module and designed a series of experimental comparisons. The specifics, included (a) without TSCA, (b) TSCA without average pooling branches, (c) TSCA without maximum pooling branches, and (d) full branched TSCA, and the experimental results are shown in Table 6. It can be seen that the TSCA containing all branches achieves the best experimental results and improves 0.63%, 0.44%, 0.12%, and 0.44% in the four evaluation metrics of IoU, Precision, Accuracy, and Dice, respectively, compared to the case of not using the TSCA. The analysis shows that between the feature encoding network and the feature decoding network, the original skip connection, although ensuring that the low-level feature information will not be lost in the feature propagation process, cannot supplement more useful intermediate feature information for the feature decoding network, so the segmentation results still have a lot of room for improvement. The TSCA without average pooling branch only focuses on the edge contour part of the tooth, and the TSCA without maximum pooling branch only pays attention to the feature information of the target region of the tooth, and the two need to complement each other to make the network model perform more powerfully.

Table 6.

Validity of the dual-stream coordinate attention module.

Framework	IoU (%)	Precision (%)	Accuracy (%)	Dice (%)
Without TSCA	87.33	96.90	97.67	85.98
TSCA without average pooling branches	87.70	96.85	97.73	86.33
TSCA without maximum pooling branches	87.81	96.88	97.76	86.25
Full branched TSCA	87.96	97.34	97.79	86.42

Analysis of input image size

To investigate the effect of the input size of dental panoramic X-ray images on the segmentation results, we adjusted the input size of dental panoramic X-ray images. Specifically, we set three different image sizes, 128 × 128, 256 × 256, and 512 × 512, as inputs for network training, and the input image size sizes in the testing phase were kept the same as those in the training phase. As shown in Table 7, the experimental results show that the best experimental results are achieved when the input image size is 512 × 512 compared to 128 × 128 and 256 × 256, especially when compared to 128 × 128, it improves the four evaluation metrics of IoU, Precision, Accuracy, Dice by 5.38%, 7.17%, 1.21%, and 0.42%. This indicates that as the size of the input image increases, hierarchical downsampling can obtain feature maps with rich size variations, and the convolution operation of each layer can extract more unique feature information, while when the size of the input image becomes very small, hierarchical downsampling makes the information of the feature maps unfavorable for capture learning, and it is difficult to obtain good tooth segmentation results.

Table 7.

Analysis of input image size.

Input image size	IoU (%)	Precision (%)	Accuracy (%)	Dice (%)
128 × 128	82.58	90.17	96.58	86.00
256 × 256	85.23	92.68	97.17	86.29
512 × 512	87.96	97.34	97.79	86.42

Analysis of different upsampling methods

The upsampling is located before the convolution operation of the feature decoding network, and the output of the upsampling together with the output of the TSCA module constitutes the new input features of each layer of the feature decoding network. To investigate the impact of different upsampling methods on the segmentation results, we designed a series of experimental comparisons in this article. Specifically, four different upsampling methods are compared, including (a) transposed convolution, (b) nearest-neighbor interpolation, (c) single linear interpolation, and (d) bilinear interpolation, and the experimental results are shown in Table 8. Since the bilinear interpolation method can interpolate non-integer pixel positions, while the transposed convolution method can only interpolate integer pixel positions, and the bilinear interpolation method can use the values of the four surrounding pixels to calculate the value of the target pixel, more accurate results than the nearest-neighbor interpolation and single linear interpolation methods are obtained. Therefore, the experiment achieved the best results when the bilinear interpolation method was chosen as the upsampling method.

Table 8.

Analysis of different upsampling methods.

Upsampling method	IoU (%)	Precision (%)	Accuracy (%)	Dice (%)
Transposed convolution	86.63	95.47	97.51	85.95
Nearest-neighbor interpolation	87.71	97.17	97.74	86.11
Linear interpolation	87.81	97.03	97.76	86.18
Bilinear interpolation	87.96	97.34	97.79	86.42

Conclusion

For dental panoramic X-ray image data, a dental panoramic X-ray image segmentation method based on multi-feature coordinate position learning is presented, and a deep learning-based dental segmentation network model is implemented. In this article, two multi-feature coordinate position learning modules are designed to enable the neural network to extract more discriminating multi-scale feature information at different coordinate positions. Multi-feature extraction is performed layer by layer using the ROCM when acquiring larger-size regional feature information and small-size detailed features that affect the trajectory of tooth contours. To reduce the information difference between the feature encoding network and the feature decoding network, we use the dual-stream coordinate attention module to accomplish global information capture. Experimental results show that the proposed method in this article achieves excellent results in terms of accuracy and efficiency of tooth segmentation, and the segmented tooth morphology in the visualization results is close to the real annotation results. However, due to the design of the ROCM and the TSCA module in this article, the number of parameters of the network model has risen. This change increases the computational requirements and runtime of the model, which may affect the utility of the model especially in resource-constrained environments. Therefore, our future research efforts should focus on the lightweight aspect of the model to optimize model efficiency. In addition, annotation of dental data is a time-consuming and costly task, limiting the feasibility of large-scale data processing. Therefore, semi-supervised segmentation methods are particularly important, which can effectively reduce the dependence on labeled data while maintaining segmentation performance by using incompletely annotated dental image data. We plan to further investigate, for example, self-training and pseudo-labeling techniques to improve the accuracy of tooth segmentation under the semi-supervised learning framework.

Footnotes

Acknowledgments

Not applicable.

Contributorship

Conceptualization, T.M. and Z.D.; methodology, T.M. and Z.D.; software, Z.D.; validation, Z.D.; formal analysis, Z.D. and Y.Y. (Yizhou Yang); resources, Z.D. and Y.Y. (Yizhou Yang); data curation, Z.D.; writing—original draft preparation, Z.D.; writing—review and editing, J.Y.; visualization, Z.D.; supervision, T.M., J.Y.; funding acquisition, T.M., J.Y. All authors have read and agreed to the published version of the article.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Ethical approval

Not applicable.

Funding

This work was supported by the Shaanxi Natural Science Fundamental Research Program Project (No. 2022JM-508), the Youth Innovation Team of Shaanxi Universities and in part by the National Natural Science Foundation of China (Grant No. 62101432).

Guarantor

Z.D. (Zhenrui Dang).

Informed consent statement

Not applicable.

Institutional review board statement

Not applicable.

ORCID iD

Zhenrui Dang

References

Zhou

Yao

. Omni-dimensional dynamic convolution. In: International conference on learning representations, 2022, pp. 1–20.

Wang

Zhu

, et al. ECA-Net: efficient channel attention for deep convolutional neural networks. In: IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 11531–11539.

Zhao

Gao

, et al. TSASNet: tooth segmentation on dental panoramic X-ray images by two-stage attention segmentation network. Knowl Based Syst 2020; 206: 106338.

Hiyah

Harsono

Sigit

. Comparison study of Gaussian and histogram equalization filter on dental radiograph segmentation for labelling dental radiograph. In: International conference on knowledge creation and intelligent computing, 2016, pp. 253–258.

Sun

, et al. Watershed algorithm based on morphology for dental X-ray images segmentation. In: International conference on signal processing, 2012, pp. 877–880.

Tikhe

S V

Naik

A M

Bhide

S D

, et al. Algorithm to identify enamel caries and interproximal caries using dental digital radiographs. In: International conference on advanced computing, 2016, pp. 225–228

Mao

Wang

, et al. Grabcut algorithm for dental X-ray images based on full threshold segmentation. IET Image Proc 2018; 12: 2330–2335.

Said

Nassar

DEM

Fahmy

, et al. Teeth segmentation in digitized dental X-ray films using mathematical morphology. IEEE Trans Inf Forensics Secur 2006; 1: 178–189.

Alsmadi

. A hybrid fuzzy C-means and neutrosophic for jaw lesions segmentation. Ain Shams Eng J 2018; 9: 697–706.

10.

Tuan

. A cooperative semi-supervised fuzzy clustering framework for dental X-ray image segmentation. Expert Syst Appl 2016; 46: 380–393.

11.

Trivedi

Modi

. Dental contour extraction using ISEF algorithm for human identification. In: International conference on electronics computer technology, 2011, pp. 6–10.

12.

Niroshika

UAA

Meegama

RGN

Fernando

TGI

. Active contour model to extract boundaries of teeth in dental X-ray images. In: International conference on computer science & education, 2013, pp. 396–401.

13.

Modi

Desai

. A simple and novel algorithm for automatic selection of ROI for dental radiograph segmentation. In: Canadian conference on electrical and computer engineering, 2011, pp.000504–000507.

14.

Long

Shelhamer

Darrell

Fully convolutional networks for semantic segmentation. In: IEEE conference on computer vision and pattern recognition, 2015, pp. 3431–3440.

15.

Ronneberger

Fischer

Brox

. U-Net: convolutional networks for biomedical image segmentation. Med Image Comput Comput Assist Interv 2015: 234–241.

16.

Zhang

Ren

, et al. Deep residual learning for image recognition. In: IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.

17.

Zhou

Rahman Siddiquee

Tajbakhsh

, et al. UNet++: a nested U-net architecture for medical image segmentation. Deep Learn Med Image Anal Multim Learn Clin Decis Support 2018: 3–11.

18.

Liu

Zhou

, et al. Segmenting nailfold capillaries using an improved U-net network. Microvasc Res 2020; 130: 1–9.

19.

Rao

Nartey

O T

Zeng

, et al. LeFUNet: uNet with learnable feature connections for teeth identification and segmentation in dental panoramic X-ray images. In: IEEE international conference on bioinformatics and biomedicine. IEEE 2022, pp. 2110–2118.

20.

Zhao

Shi

, et al. Pyramid scene parsing network. In: IEEE conference on computer vision and pattern recognition, 2017, pp. 2881–2890.

21.

Cheng

, et al. CE-Net: context encoder network for 2D medical image segmentation. IEEE Trans Med Imag 2019; 38: 2281–2292.

22.

Wang

Liu

, et al. ARMS Net: overlapping chromosome segmentation based on adaptive receptive field multi-scale network. Biomed Signal Process Control 2021; 68: 1–9.

23.

Wang

Lyu

. Multi-path connected network for medical image segmentation. J Vis Commun Image Represent 2020; 71: 1–11.

24.

Chen

Zhao

Liu

, et al. MSLPNet: multi-scale location perception network for dental panoramic X-ray image segmentation. Neural Comput Appl 2021; 33: 10277–10291.

25.

Vaswani

Shazeer

Parmar

, et al. Attention is all you need. In: International conference on neural information processing systems, 2017, pp. 6000–6010.

26.

Oktay

Schlemper

Folgoc

, et al. Attention U-Net: learning where to look for the pancreas. arXiv preprint arXiv: 1804.03999, 2018.

27.

Wang

, et al. GT U-Net: a U-net like group transformer network for tooth root segmentation. Mach Learn Med Imag 2021: 386–395.

28.

Heo

Yun

Han

, et al. Rethinking spatial dimensions of vision transformers. In: IEEE/CVF international conference on computer vision, 2021, pp. 11936–11945.

29.

Kaya

Akar

. Dental X-ray image segmentation using octave convolution neural network. In: Signal processing and communications applications conference (SIU). IEEE, 2020, pp. 1–4.

30.

Chen

Fan

, et al. Drop an octave: reducing spatial redundancy in convolutional neural networks with octave convolution. In: IEEE/CVF international conference on computer vision, 2019, pp. 3435–3444.

31.

Sheng

Wang

Huang

, et al. Transformer-based deep learning network for tooth segmentation on panoramic radiographs. J Syst Sci Complex 2023; 36: 257–272.

32.

Cao

Wang

Chen

, et al. Swin-Unet: unet-like pure transformer for medical image segmentation. In: European conference on computer vision, 2023, pp. 205–218.

33.

Zhao

Gao

, et al. TSASNet: tooth segmentation on dental panoramic X-ray images by two-stage attention segmentation network. Knowl Based Syst 2020; 206: 1–10.

34.

Valanarasu

JMJ

Patel

. UNext: MLP-based rapid medical image segmentation network. Med Image Comput Comput Assist Interven 2022: 23–33.

35.

Dosovitskiy

Beyer

Kolesnikov

, et al. An image is worth 16 × 16 words: transformers for image recognition at scale. arXiv preprint arXiv 2020;2010.11929.

36.

Shao

Zhang

Hou

. Polyper: boundary sensitive polyp segmentation. In: Proceedings of the AAAI conference on artificial intelligence, 2024, vol. 38, no. 5, pp. 4731–4739.

37.

Liu

Lin

Cao

, et al. Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 10012–10022.

38.

Xie

Wang

, et al. Segformer: simple and efficient design for semantic segmentation with transformers. Adv Neural Inf Process Syst 2021; 34: 12077–12090.

39.

Yang

Bender

Q V

, et al. Condconv: conditionally parameterized convolutions for efficient inference. In: International conference on neural information processing systems, 2019, pp. 1307–1318.

40.

Chen

Dai

Liu

, et al. Dynamic convolution: attention over convolution kernels. In: IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 11030–11039.

41.

Jaderberg

Simonyan

Zisserman

Spatial transformer networks. In: International conference on neural information processing systems, 2015, pp. 2017–2025.

42.

Hou

Zhou

Feng

Coordinate attention for efficient mobile network design. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 13713–13722.

43.

Abdi

Kasaei

. Panoramic dental X-rays with segmented mandibles. Mendeley Data 2020: 2.

44.

Silva

Oliveira

Pithon

. Automatic segmenting teeth in X-ray images: trends, a novel data set, benchmarking and future perspectives. Expert Syst Appl 2018; 107: 15–31.

45.

Lemke

Parkinson

Marsh

. Design of a head-support device for a novel head-only MRI scanner. Advanced Design Research 2023; 1: 21–37.

46.

Zhou

Yang

. Dental lesion segmentation using an improved icnet network with attention. Micromachines 1920; 13: 1920.

Dental panoramic X-ray image segmentation for multi-feature coordinate position learning

Abstract

Objective

Methods

Results

Conclusion

Keywords

Introduction

Literature review

Model and methodology

General architecture of the network model

Residual omni-dimensional convolution module

Two-sream coordinate attention module

Experimental analysis

Experimental setup and data preprocessing

Dataset

Evaluation of indicators

Comparative analysis of related algorithms

Analysis of ablation experiments

Conclusion

Footnotes

Acknowledgments

Contributorship

Declaration of conflicting interests

Ethical approval

Funding

Guarantor

Informed consent statement

Institutional review board statement

ORCID iD

References