Sage Journals: Discover world-class research

Abstract

Background

Traditional Chinese medicine (TCM) tongue diagnosis, through the comprehensive observation of tongue’s diverse characteristics, allows an understanding of the state of the body’s viscera as well as Qi and blood levels. Automatic tongue image recognition methods could support TCM practitioners by providing auxiliary diagnostic suggestions. However, most learning-based methods often address a narrow scope of the tongue’s attributes, failing to fully exploit the information contained within the tongue images.

Objective

To classify multifaceted tongue characteristics, and fully utilize the latent correlation information between tongue segmentation and classification tasks, we proposed a multi-task joint learning network for simultaneous tongue body segmentation and multi-label Classification, named SSC-Net.

Methods

Firstly, the shared feature encoder extracts features for both segmentation and classification tasks, where the segmentation result is utilized to mask redundant features that may impede classification accuracy. Subsequently, the ROI extraction module locates and extracts the tongue body region, and the feature fusion module combines tongue body features from bottom to top. Finally, a fine-grained classification module is employed for multi-label classification on multiple tongue characteristics.

Results

To evaluate the performance of the SSC-Net, we collected a tongue image dataset, BUCM, and conducted extensive experiments on it. The experimental results show that the proposed method when segmenting and classifying simultaneously, achieved 0.9943 DSC for the segmentation task, 92.02 mAP, and 0.851 overall F1-score for the classification task.

Conclusion

The proposed method can effectively classify multiple tongue characteristics with the support of the multi-task learning strategy and the integration of a fine-grained classification module. Code is available here.

Keywords

Multi-task joint learning tongue classification tongue segmentation multi-label classification auxiliary diagnosis

Introduction

Tongue plays a crucial role in the observation of conditions in traditional Chinese medicine (TCM) and serves as an important basis for differentiation and treatment.¹ According to TCM theory, the appearance of the tongue is closely related to the fluctuations of Qi and blood within the body’s viscera. By observing the spirit, color, shape, and dynamics of the tongue, practitioners can understand the changes in physiological and pathological functions of the human body.² However, traditional tongue diagnosis faces the following challenges: (a) Subjectivity: Traditional tongue diagnosis lacks objective and quantitative standards, leading the tongue recognition results heavily relies on the clinician’s experience and subjective judgment; (b) Instability: Traditional tongue diagnosis could be affected by environmental factors, such as changes in lighting and viewing angle, which can lead to deviations in recognition results. These problems have impacted the clinical application and further development of tongue diagnosis, thus necessitating an automatic tongue recognition method to enhance its objectivity and accuracy.

In recent years, the rapid development of computer vision and related technologies has led to their widespread application in medical image analysis. Deep learning has been applied to tongue diagnosis, including the classification of tongue colors^3,4 and shapes,⁵ recognition of teeth marks⁶ and cracks on the tongue body, and diagnosis of diseases from tongue appearance, such as diabetes mellitus⁷ and gastrointestinal diseases.⁸ Learning-based tongue recognition methods typically consist of two parts: tongue segmentation and tongue classification. Tongue segmentation involves extracting the tongue part from the captured images to avoid interference from redundant information such as the face and lips, which is a prerequisite for subsequent tongue classification. Tongue classification involves classifying various characteristics of the segmented tongue body and coating, such as color, shape, texture, and so on, and the results can be used to assist clinicians in diagnosis.

Among these characteristics, we observed that the identification of the tongue body and coating colors is usually more straightforward due to their distinct visual differences in images. In contrast, the recognition of teeth marks and cracks is more complex, as they occupy a smaller proportion of the image, and their identification accuracy is influenced by image quality and lighting conditions. Each image in Figure 1 showcases these different characteristics, with Figure 1(a) to 1(e) all displaying characteristics like tongue body color, tongue coating color, and tongue coating thickness, and Figure 1(d) and 1(e) also showing teeth marks or cracks. Therefore, we regard multiple characteristics classification of tongue images as a multi-label classification task. Table 1 provides detailed explanations of the classification of characteristics in Figure 1 and their corresponding clinical significance in TCM.

Figure 1.

Representative images corresponding to the tongue characteristics. Here, (a) to (e) all display tongue body color, tongue coating color, and tongue coating thickness, while (d) and (e) also display teeth marks or cracks.

Table 1.

The tongue characteristics classified in this work and their corresponding clinical significance in TCM.

Label	Characteristics	Category	Example	Some clinical significances or syndromes
C1	Tongue body color	Light red	Figure 1(a)	normal
		Pale	Figure 1(b)	Ddual deficiency of Qi and blood or Yang deficiency
		Red	Figure 1(c)	Excess-heat or Yin deficiency with exuberant heat
C2	Tongue coating color	White	Figure 1(a)	Normal or external syndrome or cold syndrome
		Yellow	Figure 1(c)	Internal syndrome or heat syndrome
C3	Tongue coating thickness	Thin	Figure 1(a)	Normal or mild condition
		Thick	Figure 1(c)	Severe condition
C4	Cracks	w/o		Normal
		w/	Figure 1(d)	Yin blood deficiency or spleen deficiency with dampness obstruction
C5	Teeth marks	w/o		Normal
		w/	Figure 1(e)	Spleen deficiency with dampness encumbrance

Despite progress in learning-based methods, existing methods still face limitations. Single-task methods sequentially perform segmentation and classification, which neglects the intrinsic correlation between tongue body segmentation and tongue classification; While multi-task methods simultaneously execute both tasks, most existing multi-task methods inadequately address the comprehensive identification of tongue characteristics by overlooking their multi-label and fine-grained nature.

This work proposed SSC-Net, an end-to-end multi-task joint learning network specifically architected for concurrent tongue body segmentation and multi-label tongue classification. The network introduces a multi-label fine-grained classification method, enabling simultaneous identification of five pathologically significant characteristics: (a) tongue body color, (b) tongue coating color, (c) tongue coating thickness, (d) cracks, and (e) teeth marks.

Related works

Single-task methods for tongue segmentation or classification

The single-task tongue image segmentation or classification method involves only a single, isolated task of either segmentation or classification, with no connection between the two. Tongue image segmentation is essentially a semantic segmentation task, where masks are generated to remove non-tongue regions while retaining the tongue portion to avoid affecting the classification results. The accurate tongue segmentation results, especially with the smooth edge and few cavities, are helpful to the performance of teeth marks and tongue coating classification. Zhou et al.⁹ proposed TongueNet for tongue body segmentation, similar to Mask R-CNN,¹⁰ which locates and segments the tongue body through feature extraction, region proposal, and prediction, achieving 0.9796 DSC and 0.9774 mIoU on BioHit¹¹ dataset, respectively. Lin et al.¹² proposed DeepTongue based on ResNet¹³ and DeepMask to achieve fast segmentation with 0.9458 mIoU on their custom dataset. Huang et al.¹⁴ combined a residual soft connection module and a salient image fusion module with U-Net¹⁵ for fast tongue body segmentation on mobile devices. Jiang et al.¹⁶ incorporated the Mamba attention mechanism and multi-stage feature fusion, enhancing the accuracy and efficiency of U-Net in complex environments.

Single-task tongue image classification methods typically utilize classifiers to classify feature vectors extracted from input images. Li et al.¹⁷ first generate suspicious areas with R-CNN then extract feature vectors with VGG16,¹⁸ and finally identify tooth-marked tongues with SVM. Tang et al.¹⁹ used a coarse-to-fine network to simultaneously detect tongue regions and key points, followed by using the fine-grained classification network DCN²⁰ to identify tooth marks. Ni et al.³ combined CapsNet and residual blocks for a five-class classification of tongue colors, achieving 0.845 accuracy. However, these classification methods focus on one or a few categories of tongue characteristics and fail to fully explore the information contained in tongue images. We figured that tongue classification should consider the multiple labels of tongue images comprehensively to assist in TCM tongue diagnosis more effectively.

Multi-task methods for tongue segmentation and classification

Multi-task learning, which uses the knowledge from multiple tasks to assist each task,²¹ especially achieves success in medical image analysis, such as COVID-19 diagnosis,²² skin disease classification,²³ and tumor identification.²⁴ Multi-task learning methods for tongue image segmentation and classification convey information between the two tasks by using the predictions of the former as input for the latter. Multi-task methods can be divided into two categories: non-end-to-end trainable and end-to-end trainable. A considerable portion of methods for tongue segmentation and classification is not end-to-end trainable, that is, training two models for segmentation and classification separately. Li et al.²⁵ first using facial landmark recognition and U-Net for tongue body segmentation, then training multiple ResNet models to classify tongue characteristics. However, non-end-to-end methods repeatedly extract features for segmentation and classification, resulting in high computational costs as well as failing to fully utilize the potential information between different tasks.

In contrast, end-to-end methods can share latent information between different tasks. Xu et al.²⁶ proposed the first end-to-end trainable multi-task network that unifies tongue image segmentation and classification. They employed U-Net and DFL²⁷ for tongue body segmentation and fine-grained classification of tongue coating, However, Xu et al.²⁶ was limited to classifying the color and thickness of the tongue coating. Qiu et al.²⁸ performed classification on tongue coating and sublingual vein after segmenting the tongue surface and underside, based on MobileNetV2,²⁹ achieving feature extraction of tongue images on mobile devices. Shi et al.³⁰ proposed a multi-task joint learning model, Ammonia-Net, based on U-Net and ShuffleNetV2, applying the segmentation results of tooth marks to aid in the grade classification of tooth-marked tongues.

In summary, tongue image recognition faces the following challenges:

TCM synthesizes multi-dimensional characteristics of the tongue to support diagnosis. However, current methods for tongue image recognition predominantly focus on isolated classifications of single characteristic types (e.g. tongue color or coating thickness), neglecting the multi-label nature of tongue manifestations.

Tongue image recognition methods typically consist of tongue body segmentation and tongue classification. How to use a multi-task network to utilize the associated information between them is a challenging task.

The global features of tongue images show minor differences, whereas the local features exhibit relatively significant differences, making tongue image classification a natural fine-grained classification task.

This work proposed a multi-task joint learning network for tongue image recognition called SSC-Net, which can perform tongue body segmentation and multi-label classification simultaneously. The main contributions are as follows:

We performed multi-label and fine-grained tongue image classification on the key tongue characteristics closely related to diseases, including tongue body color, tongue coating color, coating thickness, cracks, and teeth marks.

We proposed a multi-task joint learning network for tongue image recognition named SSC-Net, designed to perform tongue image segmentation and multi-label classification simultaneously. SSC-Net effectively connects the latent information between segmentation and classification tasks.

A tongue image dataset, BUCM, consisting of 1500 images was constructed, with each image label verified by TCM experts to ensure data accuracy and reliability.

Material and method

BUCM dataset

Data collection

Given the scarcity of public datasets for tongue image classification, we collected 1571 tongue images from 774 participants during various treatment stages. The images were captured with a mobile device in an open clinical environment, each with a resolution of $576 \times 768$ pixels. The participants, aging from 3 to 85 years and including 212 males and 562 females, was recruited from the Third Affiliated Hospital of Beijing University of Chinese Medicine. We ensured to obtain informed consent from all subjects. The research was approved by the Ethics Committee of Northeastern University of China (NEU-EC-2024B049S), according to the ethical guidelines of the Helsinki Declaration.

Label annotation

The segmentation labels of the BUCM dataset were annotated as binary images, where white represents the tongue area and black represents the background. These annotations serve as segmentation ground truths (GTs). The classification labels were assigned by three professional TCM physicians for five critical tongue characteristics. These labels, detailed in Table 1, are categorized as follows: C1 (tongue body color): light-red, pale, red, C2 (tongue coating color): white, yellow, C3 (tongue coating thickness): thin, thick, C4 (cracks): with (w/), without (w/o), and C5 (teeth marks): with (w/), without (w/o). The labeling process strictly followed Chinese National Standard GB/T 40665.1-2021, employing a consensus-driven workflow: one physician performed initial classification, which was then independently verified by the other two experts. Finally, 71 images with annotation disagreements were excluded from the original collection of 1571 images, yielding a final dataset of 1500 consistently annotated samples.

Data augmentation

The BUCM dataset was randomly split into training and testing sets in an 8:2 ratio. To address inherent imbalances in categories C1 and C2, we implemented targeted offline augmentation in the training set, expanding its size to 3000 samples. Critically, no augmentation was applied to the testing set to maintain its original data distribution, ensuring unbiased evaluation on realistic scenarios. The following transformations were applied: (a) random vertical flipping, (b) cropping the width and height by random pixels between 0 and 10, (c) affine translation along the x- and y-axis by −10% to 10%, (d) affine rotation by a random value between −10°and 10°, (e) affine scaling along the width and height to sizes between 80% and 120%, (f) affine shear on the x- and y-axis by a random value between −5°and 5°, (g) additive Gaussian noise, and (h) Gaussian blur. Note that when offline and online data augmentation were applied to the images, the corresponding segmentation GTs were also transformed in the same way. The per-category sample counts of each subset after offline data augmentation is shown in Table 2.

Table 2.

Per-category sample sizes of training (original and augmented) and testing sets in BUCM dataset.

Category		Training set(original)	Training set(augmented)	Testing set
C1	light-red	929	969	216
	pale	191	1044	64
	red	80	987	20
C2	white	723	1587	178
	yellow	477	1413	122
C3	thin	516	1239	131
	thick	684	1761	169
C4	w/o	711	1724	191
	w/	489	1276	109
C5	w/o	479	1284	114
	w/	721	1716	186

BioHit dataset

Another dataset used in this work is BioHit.¹¹ BioHit is a public tongue segmentation dataset, containing 300 standardized clinical tongue images captured using a structured imaging device. The image resolution is 768 × 576 pixels. In this work, all 300 images were utilized as a testing set to validate the generalization capability of the proposed method.

Proposed method

This work proposed SSC-Net, an end-to-end multi-task network that combines tongue image segmentation and classification. As shown in Figure 2, SSC-Net consists of (a) a shared feature encoder, (b) a segmentation module, (c) an ROI extraction module, (d) a feature fusion module, and (e) a classification module. Firstly, the shared feature encoder extracts multi-level deep features of the image, and the segmentation module predicts the segmentation result with these multi-level features. The ROI extraction module then locates, and extracts tongue body region based on the predicted segmentation result, the feature fusion module combines multi-level features for classification. Finally, the classification module performs multi-label and fine-grained classification on tongue body color, tongue coating color, coating thickness, and presence of teeth marks and cracks. In the following subsections, we will firstly introduce these five modules and finally introduce the joint loss function of SSC-Net.

Figure 2.

The framework of our proposed SSC-Net. The main components include: (a) shared feature extractor: extracting multi-level features from input images; (b) segmentation module: predicting tongue segmentation results; (c) ROI extraction module: locating, extracting, and aligning tongue regions based on predicted segmentation results; (d) feature fusion module: fusing multi-level features from bottom to up for classification; (e) classification module: performing multi-label classification with multi-head CSRA; (f) SE-ResNetXt BottleNeck: a ResNetXt bottleneck integrated Squeeze-and-Excitation.

Shared feature encoder

The feature extraction module is designed to extract multi-scale features ranging from lower-level detailed features to higher-level semantic features, as shown in Figure 2(a), comprising five stages. When the input image $I_{i} \in R^{H \times W}$ is processed, the output features from these five stages are denoted as ${F^{i}, i = 1, 2, \dots, 5}$ , each corresponding in size to $1 / 4, 1 / 4, 1 / 8, 1 / 16, 1 / 32$ of $I_{i}$ , respectively. The extracted features are then fed into the tongue segmentation module to perform the segmentation task. In this work, ResNeXt50³¹ is employed as the backbone for feature extraction. To enhance the representation of detail features, as illustrated in Figure 2(f), the Squeeze-and-Excitation³² (SE) attention mechanism is integrated into the bottleneck of the backbone to enhance feature representation and emphasize important spatial details.

Tongue segmentation module

In the tongue segmentation module, shown in Figure 2(b), bilinear interpolation is used to upsample the input features $D^{i}$ to $D^{' i}$ by a factor of 2. Skip connections are utilized to concatenate the input feature $F^{i}$ with the upsampled features $D^{' i}$ . To restore the channels of the input image, two $3 \times 3$ convolutional layers are used, each with a stride of 1 and padding of 1, batch normalization, and ReLU activation layer. At the last layer of the segmentation module, a $3 \times 3$ convolution with a stride of 1 reduces the number of channels to 1, which corresponds to segmentation classes. A Sigmoid function $σ (\cdot)$ is then applied to binarize the prediction result $D (x, y)$ with a threshold of 0.5.

F_{mask} (x, y) = σ (D (x, y)) > 0.5,

(1)

ROI extraction module

The ROI extraction module was designed, as depicted in Figure 2(c), to locate and extract the tongue region while obscuring the surrounding facial redundant information in the input image. The predicted segmentation results are used to mask pixels in non-tongue regions. This module can be described as three steps: Step 1:

Multiply the input image $I_{i}$ element-wise by predicted logits $σ (D (x, y))$ to suppress noise in non-tongue region, resulting in

F_{seg} (x, y) = I_{i} ⊙ σ (D (x, y)),

(2)where

⊙

denotes the Hadamard product.

Step 2:

Locate the tongue region. As shown in Figure 3, firstly, decompose $F_{mask} (x, y)$ into marginal distributions, $p_{X}$ and $p_{Y}$ , along rows and columns, that is,

\begin{aligned} p_{X} & = max_{1 \leq y \leq H} F_{mask} (x, y), \\ p_{Y} & = max_{1 \leq x \leq W} F_{mask} (x, y), \end{aligned}

(3)where

F_{mask} (x, y)

represents the probability that the location

(x, y)

belongs to the tongue region. Thus, the coordinates of the predicted tongue region are denoted as

\begin{aligned} \begin{matrix} (x_{1}, y_{1}) = (min {i}, max {j}) \\ (x_{1}, y_{2}) = (max {i}, min {j}) \end{matrix}, \\ s . t . {\begin{aligned} p_{X} (i) > 0, i = {1, 2, \dots, W} \\ p_{Y} (j) > 0, j = {1, 2, \dots, H} \end{aligned} \end{aligned}

(4)where

p_{X} (i)

and

p_{Y} (j)

represent the

i^{t h}

and

j^{t h}

elements in the marginal distributions

p_{X}

and

p_{Y}

, respectively.

Step 3:

Extract and resize the tongue region. We applied ROI Align¹⁰ on $F_{seg} (x, y)$ to resize the tongue region to $(\frac{H}{2}, \frac{W}{2})$ , which could avoid quantization losses caused by positional transformation. Subsequently, the tongue region is fed back into the feature encoder to extract multi-level features ${F^{' i}, i = 1, 2, \dots, 5}$ . Note that the feature encoder used for $F^{' i}$ and $F^{i}$ shares the same structure and parameters.

Figure 3.

The visualization of tongue region localization.

Feature fusion module

Among some joint segmentation and classification methods for medical image analysis,³³ classification networks typically feed high-level features directly into classifiers for prediction. However, this paradigm fails to effectively integrate low-level structure features. Although high-level features contain rich semantic information, continuous downsampling operations degrade structural details, rendering classification based on high-level features insensitive to local pathological characteristics.

The feature fusion module is designed to combine the complementary strengths of high-level semantic information and low-level structural features, thereby boosting classification performance. As illustrated in Figure 2(d), it fuses features $F^{' i}$ layer by layer from bottom to up. For simplicity, we illustrate the fusion of $F^{' 2}$ and $F^{' 3}$ as an example. As shown in Figure 4, first, a $3 \times 3$ convolution with a stride of 2 is applied to $F^{' 2}$ to align its size and channels with $F^{' 3}$ . Then, pixel-wise addition of $F^{' 2} = P^{2}$ and $F^{' 3}$ yields $P^{3}$ . The same operation is iteratively applied to integrate $F^{' 4}$ and $F^{' 5}$ , culminating in the fused feature $P^{5}$ , which contains rich semantic information and fine details.

Figure 4.

The feature fusion module.

Fine-grained classification module

Tongue image classification is a naturally fine-grained image recognition task. Methods for fine-grained classification can roughly be divided into region localization methods and feature encoding methods.³⁴ Among fine-grained classification methods for tongue image recognition, Tang et al.¹⁹ utilized the destruction and construction network (DCN²⁰) to extract discriminative information from the detected tongue region by shuffling and recovering the local regions of the input image. Xu et al.²⁶ employed DFL²⁷ to enhance the intermediate feature representation capability of CNNs by introducing a bank of discriminative filters.

In this work, we introduce multi-head category-specific residual attention (CSRA³⁵) to perform fine-grained and multi-label tongue classification. The CSRA, as shown in Figure 5, combines category-agnostic average pooling feature with category-specific spatial pooling feature to obtain category-specific residual attention feature for multi-label classification.

Figure 5.

The architecture of CSRA. We extend a six-head CSRA as fine-grained classification module.

Specifically, input feature $x = P^{5} \in R^{c \times h \times w}$ can be decoupled as $x_{1}, x_{2}, \dots x_{h w} {\in R}^{c}$ . Next, a $1 \times 1$ convolution layer serves as the classifier for the $i^{t h}$ category, represented as $m_{i} \in R^{c}$ . The category-specific attention score for the $i^{t h}$ category and $j^{t h}$ location is defined as $s_{j}^{i} = \frac{\exp (T x_{j}^{T} m_{i})}{\sum_{k = 1}^{h w} \exp (T x_{k}^{T} m_{i})}$ , where $\sum_{i = 1}^{h w} s_{j}^{i} = 1$ , and T is the temperature parameter. $s_{j}^{i}$ can be viewed as the probability of the category i appearing at location j. The category-specific feature vector $a^{i}$ for $i^{t h}$ category is defined as $a^{i} = \sum_{k = 1}^{h w} s_{k}^{i} x_{k}$ , where the attention scores for the $i^{t h}$ category $s_{j}^{i}$ are the weights. In contrast, the global category-agnostic feature vector for the full image is $g = \frac{1}{h w} \sum_{k = 1}^{h w} x_{k}$ . As illustrated in Figure 5, by adding $g$ and $a^{i}$ with weight $λ$ , the category-specific residual attention feature $f^{i}$ for the $i^{t h}$ category is obtained:

f^{i} = g + λ a^{i} = \sum_{k = 1}^{h w} (\frac{1}{h w} + λ s_{k}^{i}) x_{k} .

(5)

Finally, all these category-specific feature vectors are sent to the classifier to obtain the final logits

\hat{y} ≜ (y^{1}, y^{2}, \dots, y^{C}) = (m_{1}^{T} f^{1}, m_{2}^{T} f^{2}, \dots, m_{C}^{T} f^{C}) .

(6)

We extend a six-head CSRA to avoid tuning the temperature parameter T, as illustrated in Figure 2(e), the logits ${\hat{y}}_{1}, {\hat{y}}_{i}, \dots, {\hat{y}}_{H}$ from different heads are added to get the final logits as

{\hat{y}}_{o} = \sum_{i = 1}^{H} {\hat{y}}_{i},

(7)where

H = 6

, the hyper-parameter T for H heads is set following [32].

Loss function

Given a training dataset ${I_{i}, g_{i}, y_{i}}_{i = 1}^{M}$ , in which $I_{i}$ is the $i^{t h}$ image, $g_{i}$ is the ground truth mask corresponding to $I_{i}$ , and $y_{i} = [y_{i}^{1}, \dots, y_{i}^{C}]^{T}$ represents the corresponding classification labels in C category.

For the segmentation task, we combine the Dice loss with the binary cross-entropy loss for supervision.

L_{seg} = \sum_{i = 1}^{M} \underset{L_{Dice}}{\underset{⏟}{1 - \frac{2 g_{i} {\hat{p}}_{i} + ε}{g_{i} + {\hat{p}}_{i} + ε}}} + \underset{L_{BCE}}{\underset{⏟}{g_{i} \log {\hat{p}}_{i} + (1 - g_{i}) \log (1 - {\hat{p}}_{i})}},

(8)where

{\hat{p}}_{i}

and

g_{i}

are the predicted and ground truth masks of the

i^{t h}

image, respectively. The

ε

term in

L_{Dice}

is a smooth factor added in both the numerator and denominator to avoid the numerical issue such as when

{\hat{p}}_{i} = g_{i} = 0

We use the binary cross-entropy loss for the multi-label classification task,

L_{cls} = \sum_{i = 1}^{M} \sum_{j = 1}^{C} y_{i}^{j} \log {\hat{y}}_{i}^{j} + (1 - y_{i}^{j}) \log (1 - {\hat{y}}_{i}^{j}),

(9)where

{\hat{y}}_{i}^{j}

and

y_{i}^{j}

are the prediction scores and labels of the

j^{t h}

category of the

i^{t h}

image, respectively.

We define the final joint loss function as a weighted combination of segmentation loss and classification loss to optimize both tasks simultaneously.

L = λ_{1} L_{seg} + λ_{2} L_{cls},

(10)where

λ_{1}

and

λ_{2}

are the weights determined by experiments.

Experimental results

Implementation details

The multi-task learning model was trained for 100 epochs on an NVIDIA 2060 Super GPU, iteratively optimized by the root mean square propagation (RMSProp) algorithm, with a learning rate of 1e-6 and a batch size of 2. The hyperparameters $λ_{1}$ and $λ_{2}$ in joint loss function $L$ were set to 0.5 and 1, respectively, determined by experimental verification. The hyperparameter $λ$ for the fine-grained classification module was set to 0.5, based on reference³⁵ and experimental validation. During training, the same transformations as offline augmentation were applied in training set on the fly to increase the variation. Additionally, label smoothing³⁶ was employed to mitigate the recurrent noise in medical image analysis.

Evaluation metrics

Segmentation metrics

To evaluate segmentation performance, we employ the Dice Similarity Coefficient (DSC) and the Intersection over Union (IoU) as metrics.

D S C = \frac{1}{N} \sum_{i = 1}^{N} \frac{2 | g_{i} \cap {\hat{p}}_{i} | + ε}{| g_{i} | + | {\hat{p}}_{i} | + ε},

(11)

I o U = \frac{1}{N} \sum_{i = 1}^{N} \frac{| g_{i} \cap {\hat{p}}_{i} |}{| g_{i} \cup {\hat{p}}_{i} |},

(12)where N is the total number of images in the testing dataset.

Classification metrics

The classification performance mainly employs the average precision (AP) for each category and the mean average precision (mAP) for overall categories. The mAP is calculated by finding AP for each category and then average over C categories,

m A P = \frac{1}{C} \sum_{j = 1}^{C} A P_{j},

(13)

A P_{j} = \frac{1}{| y^{j} |} \sum_{l \in y^{j}} \frac{| {l^{'} \in y^{j} : r_{j} (l^{'}) \leq r_{j} (l)} |}{r_{j} (l)},

(14)where

| y^{j} |

is the number of test images relevant to the

j^{t h}

class label

y^{j}

, and

r_{j} (l)

is the rank of a label l within

y^{j}

by the predicted scores.

We also compute overall precision (OP), recall (OR), F1-score (OF1), and per-category precision (CP), recall (CR), F1-score (CF1) as follows:

\begin{aligned} OP & = \frac{\sum_{i}^{C} T P_{i}}{\sum_{i}^{C} (T P_{i} + F P_{i})}, CP = \frac{1}{C} \sum_{i}^{C} \frac{T P_{i}}{T P_{i} + F P_{i}}, \\ OR & = \frac{\sum_{i}^{C} T P_{i}}{\sum_{i}^{C} (T P_{i} + F N_{i})}, CR = \frac{1}{C} \sum_{i}^{C} \frac{T P_{i}}{T P_{i} + F N_{i}}, \\ OF 1 & = \frac{2 \times OP \times OR}{OP + OR}, CF 1 = \frac{2 \times CP \times CR}{CP + CR}, \end{aligned}

(15)where

T P_{i}

F P_{i}

, and

F N_{i}

denote the number of true positives, false positives, and false negatives for the

i^{t h}

category after a binary evaluation, respectively.

Tongue segmentation results

In tongue segmentation experiments, we first train the segmentation branch of SSC-Net on our dataset BUCM. Then, we compared the segmentation performance with existing segmentation methods on testing sets of BUCM and BioHit. To rigorously evaluate cross-domain generalization, we performed zero-shot transfer of the BUCM-trained models to BioHit without fine-tuning. The compared methods are FCN-8s,³⁷ U-Net,¹⁵ SegNet,³⁸ U-Net++,³⁹ DeepLabv3+,⁴⁰ and SegNeXt.⁴¹

Segmentation results and analysis on BUCM

Table 3 shows the quantitative segmentation results on the BUCM dataset. Our SSC-Net achieves state-of-the-art performance with a DSC of 0.9963 and IoU of 0.9929, demonstrating high performance in tongue body segmentation. Specifically, SSC-Net surpasses the second-best method (SegNeXt) by 0.15% in DSC (0.9963 vs. 0.9948) and 0.35% in IoU (0.9929 vs. 0.9894). Notably, compared to the classical U-Net architecture, SSC-Net attains more substantial improvements of 0.94% in DSC and 1.42% in IoU. This performance gap highlights the effectiveness of our proposed feature extraction module in capturing discriminative details.

Table 3.

Comparisons of segmentation results of our method and other methods on the BUCM and BioHit dataset.

Method	BUCM		BioHit
	DSC	IoU	DSC	IoU
FCN-8s	0.9938	0.9879	0.9344	0.8802
U-Net	0.9869	0.9787	0.9292	0.8729
SegNet	0.9916	0.9843	0.9026	0.8291
U-Net++	0.9923	0.9857	0.9286	0.8717
DeepLabv3+	0.9946	0.9893	0.9147	0.8509
SegNeXt	0.9948	0.9894	0.9436	0.8953
SSC-Net (ours)	0.9963	0.9929	0.9719	0.9471

Segmentation results and analysis on BioHit

The right half of Table 3 reveals SSC-Net’s superior generalizability on the cross-device BioHit dataset, achieving 0.9719 DSC and 0.9471 IoU. Compared to the suboptimal method SegNeXt (0.9436 DSC and 0.8953 IoU), our method demonstrates improvements of 2.83% in DSC and 5.18% in IoU. These results validate that our method possesses good generalizability and robustness when applied to different capture devices.

Visualization and analysis

Figure 6 displays the representative prediction results by compared segmentation methods and corresponding GTs on BUCM and BioHit datasets. Despite the varying tongue appearance and complex background, SSC-Net achieved a more stable performance with few holes or redundancies on both datasets. In contrast, U-Net, SegNet, and U-Net++, are susceptible to the complex surroundings. The visual results support the potential of SSC-Net for accurate and reliable segmentation.

Figure 6.

The visualization of segmentation results on (a) BUCM and (b) BioHit.

Tongue classification results

In this subsection, we report extensive experimental results and comparisons on single-task methods and multi-task methods to demonstrate the effectiveness of the proposed method. All these experiments were performed on the BUCM dataset.

Single-task classification results and analysis

For ease of comparison with the proposed multi-task method, before the single-task classification experiment, we pre-processed test images by inputting the segmentation predictions of SSC-Net into the ROI Extraction Module. The compared classification methods are AlexNet,⁴² MobileNetV2,²⁹ VGG16,¹⁸ ResNet50,¹³ DenseNet121,⁴³ and ConvNeXt-T.⁴⁴ The mAP for overall categories, along with OP, OR, OF1, and CP, CR, CF1 of the single-task (STL) methods, are shown in Table 4. The average precision (AP) for all classification categories is presented in Table 5.

Table 4.

Comparisons of segmentation results and classification results on BUCM, where “STL” means single-task learning methods and “MTL” means multi-task learning methods.

Method	Type	Segmentation		Classification
		DSC	IoU	mAP	OP	OR	OF1	CP	CR	CF1
AlexNet	STL	/	/	81.40	0.774	0.778	0.776	0.771	0.727	0.748
MobileNetV2	STL	/	/	83.86	0.797	0.797	0.797	0.807	0.745	0.774
VGG16	STL	/	/	87.21	0.810	0.809	0.810	0.808	0.803	0.805
ResNet50	STL	/	/	86.65	0.805	0.821	0.813	0.832	0.793	0.812
DenseNet121	STL	/	/	89.48	0.833	0.803	0.818	0.811	0.814	0.812
ConvNeXt-T	STL	/	/	90.16	0.835	0.812	0.823	0.822	0.809	0.815
Ours + FC	MTL	0.9885	0.9773	85.95	0.866	0.770	0.815	0.882	0.705	0.783
Ours + DFL	MTL	0.9912	0.9826	89.44	0.824	0.821	0.823	0.824	0.774	0.798
SSC-Net (ours)	MTL	0.9943	0.9892	92.02	0.853	0.849	0.851	0.875	0.789	0.830

Table 5.

Comparisons of AP (mean ± standard deviation in %) for all classification categories on BUCM.

Method	Type	AP
		C1			C2	C3	C4	C5
AlexNet	STL	96.67±0.56	89.83±1.84	83.65±1.58	90.97±0.73	89.76±1.02	55.39±1.91	63.56±0.99
MobileNetV2	STL	96.77±0.46	85.78±4.07	75.22±4.34	80.90±2.14	88.20±2.64	88.74±1.94	71.01±2.60
VGG16	STL	97.13±0.52	89.58±1.85	83.07±1.85	88.39±1.46	89.60±1.18	83.66±1.50	79.02±1.47
ResNet50	STL	96.78±1.70	87.92±3.60	85.40±2.27	84.78±2.34	89.59±1.98	84.69±2.80	71.4±3.00
DenseNet121	STL	97.20±0.60	88.65±3.24	86.21±2.54	85.86±2.87	91.37±0.87	91.56±1.68	85.50±1.48
ConvNeXt-T	STL	97.14±0.58	89.20±2.31	85.10±2.68	87.74±1.13	91.44±1.02	91.85±1.76	88.69±2.18
Ours + FC	MTL	97.16±0.33	87.40±0.48	76.81±1.24	86.34±1.03	88.08±0.42	88.08±0.38	77.75±0.40
Ours + DFL	MTL	95.62±0.90	89.02±3.66	75.43±2.41	91.94±0.81	93.42±0.44	90.55±2.50	90.07±3.41
SSC-Net (ours)	MTL	97.39±0.68	92.93±0.90	85.85±2.90	91.33±1.26	91.59±0.53	93.03±1.60	92.00±1.30

The three columns within C1 represent three classes in category C1: light-red tongue, pale tongue, and red tongue.

The experimental results indicate that among the single-task classification methods, ConvNeXt approached the highest 90.16 mAP and 0.823 OF1. In terms of AP, these single-task methods excelled in classifying readily discernible characteristics like C1 and C2. For example, the highest-performing DenseNet121 achieved AP for light-red tongues that is only 0.2% lower than that of SSC-Net. However, comparative results reveal that traditional methods show notable limitations in fine-grained characteristics such as C4 and C5. For instance, the least effective AlexNet exhibits 40.5% and 30.9% lower APs than SSC-Net on these two categories, respectively. These results validate SSC-Net’s significant advantages in fine-grained characteristics classification.

Multi-task classification results and analysis

In the multi-task experiment, two branches of SSC-Net were trained to evaluate the performance of simultaneous segmentation and classification. To compare with other multi-task methods, the fine-grained classification module was replaced with either a fully connected layer (FC) or DFL,²⁷ The multi-task (MTL) experiment results are shown in Tables 4 and 5.

The classification experimental results indicate that, among the multi-task methods, SSC-Net outperformed in both segmentation and classification tasks compared to ours+FC and ours+DFL. In the classification task, SSC-Net achieved 0.851 OF1 and 0.830 CF1, which improved by 3.5% and 3.9% over the second-ranked multi-task method DFL, and by 3.9% and 2.2% over the single-task method DenseNet121. In terms of the average precision, SSC-Net achieved the best 92.02% mAP among all methods, which largely outperforms both Ours+DFL and DenseNet121 by 2.8%. Moreover, the proposed method stands out with the highest AP in most categories, particularly excelling in C4 and C5. Specifically, it outperforms DenseNet121 and Ours+DFL by 1.6% and 2.7% for C4, and by 7% and 2.2% for C5, respectively. These improvements can be attributed to CSRA’s capability to extract category-specific features, enabling precise classification of fine-grained characteristics.

When SSC-Net conducted both segmentation and classification simultaneously, despite a slight 0.2% decrease in DSC compared to performing segmentation alone, our multi-task segmentation performance remains at the forefront compared to other single-task segmentation methods in Table 3, regardless of the combined classification module.

From the results, our multi-task approach achieves performance improvements compared to both single-task and multi-task classification methods. These improvements in classification performance can be attributed to the multi-task learning strategy and the integration of a fine-grained classification module. The multi-task learning method effectively employs segmentation results to mitigate interference from non-tongue parts on classification features, while the fine-grained classification module improves the average precision across various categories by combining average pooling features with category-specific spatial pooling features.

Visualization and analysis

Grad-CAM⁴⁵ is a visualization technique that highlights discriminative regions in an image that are highly relevant to the target category. We calculate activation scores for each category from the last convolutional layer before models’ classifier and overlay them onto test images. Figure 7 shows the category-specific activation maps generated by the proposed method and compared methods on some representative images.

Figure 7.

The visualization of each category activation maps by proposed method and compared methods on some representative images. The activation maps were resized to the same size as that of the test images.

In Figure 7, blue areas indicate less attention and red areas more attention. The visualization shows that the compared methods, that is MobileNetV2, ResNet50, and DenseNet121, focus on regions corresponding to the tongue body and tongue coating to varying extents. As mentioned in “Introduction” section, these regions exhibit noticeable visual distinctions in the images, thereby facilitating their identification. Notably, our approach not only succeeds in pinpointing the regions corresponding to the tongue body and tongue coating but, with the support of CSRA, also accurately and comprehensively identifies cracks and tiny teeth marks. The congruence between the visualization and the quantitative classification results indicates the model’s focus on discriminative regions, significantly increasing the interpretability of the proposed method.

Ablation study

To validate the effectiveness of the proposed method, we conducted ablation studies on SSC-Net by removing the feature fusion module (FFM), ROI extration module (REM), and replacing the fine-grained classification module (FCM) with fully connected layers. The ablation experiments were conducted on our BUCM dataset. The results of the ablation study are shown in Table 6.

FFM effectiveness: Removing FFM from the baseline model (SSC-Net) marginally affected segmentation performance but reduced classification mAP by 1.4%, with OF1 and CF1 dropping by 1.9% and 2.1%, respectively.

REM effectiveness: Removing REM caused negligible segmentation degradation but led to a 0.4% to 0.9% decline in classification metrics.

FCM effectiveness: Replacing FCM with fully connected layers reduced the segmentation DSC by 0.4% and significantly degraded classification performance (mAP $↓$ 6.4%, OF1 $↓$ 4.4%, CF1 $↓$ 5.0%).

Table 6.

Experimental results of ablation study on BUCM dataset.

FFM	REM	FCM	Segmentation		Classification
			DSC	IoU	mAP	OF1	CF1
✓	✓	✓	0.9943	0.9892	92.02	0.851	0.830
	✓	✓	0.9938	0.9865	90.68	0.834	0.812
✓		✓	0.9936	0.9862	91.61	0.845	0.822
✓	✓		0.9896	0.9783	86.10	0.813	0.788

Discussion

Previous studies on tongue image recognition mostly use repetitive feature extractors to derive deep features for segmentation and classification. While this approach may facilitate model optimization, it essentially obstructs the flow of potential information between the tasks and additionally increases unnecessary computational burdens. In contrast, the proposed multi-task network not only explicitly links the segmentation and classification tasks but also utilizes a shared-parameter encoder to extract tongue body features, leveraging the latent information of both tasks. Despite the potential challenge in optimizing multi-task models during training, we conducted extensive experiments to eliminate randomness and reported both mean and standard deviations. The results from Table 4 confirm that our multi-task model indeed improves the classification performance. Meanwhile, Table 4 indicates a balance between overall recall and precision, yet per-category recall are obviously lower than per-category precision. This discrepancy may be attributed to the lesser number of true positives in imbalanced categories, such as red tongues, leading the model to be more susceptible to false positives within those categories.

Clinical applicability

Leveraging the medical internet of things, SSC-Net could provide diagnostic decision support for professionals and health self-assessment for non-specialists. In outpatient clinics, TCM practitioners capture tongue images via mobile devices and obtain segmentation maps with classification reports, enabling real-time diagnostic assistance. In remote areas, community health workers conduct preliminary screenings using SSC-Net-powered mobile devices to analyze tongue images, facilitating expedited expert teleconsultations without requiring in-person specialist involvement. During public health crises (e.g. COVID-19), post-rehabilitation patients can perform contactless tongue image capture and evaluate treatment efficacy through weekly comparisons of tongue characteristics. Overall, this approach proves valuable for hospital triage, remote consultations, and post-rehabilitation services.

Limitations and future work

There are three limitations that need to be further investigated in the future. First, the BUCM dataset collected in this work is limited in size and contains unbalance class distributions, particularly for rare tongue types like dark-red and blue-purple colors. This limitation forced us to simplify the tongue color classification (currently covering only light-red, pale, and red types) instead of using the complete five-category in TCM. Second, the proposed method focuses only on tongue images without considering pulse information, which is equally important in real-world TCM diagnosis. Third, while the proposed framework achieves high accuracy, we did not analyze its computational cost for clinical deployment scenarios. The current model’s hardware requirements may limit its practicality in resource-constrained settings, such as mobile TCM diagnostic devices.

In the future, firstly, multicenter collaborations will be implemented to expand the scale of the dataset, enhancing diversity and class balance with particular emphasis on underrepresented categories. Second, based on the proposed tongue image recognition framework, we will integrate multimodal physiological data (e.g. pulse waveforms) to achieve comprehensive syndrome differentiation that better aligns with clinical TCM requirements. We believe that these enhancements could establish SSC-Net as a foundational component in standardized TCM diagnostic workflows.

Conclusion

This work proposed a multi-task joint learning network, SSC-Net, for tongue image recognition, performing both tongue body segmentation and multi-label classification. A fine-grained classification module is applied for multi-label tongue image classification. To the best of our knowledge, it is first to classify characteristics as many as tongue body color, coating color, coating thickness, cracks, and teeth marks simultaneously with a multi-task network. SSC-Net explicitly connects latent information in segmentation and classification tasks, systematically strengthening the relationship between tongue segmentation and classification. Experimental results show that SSC-Net achieved 92.02% mAP for tongue classification guided by tongue body segmentation results of 0.9943 DSC. Hopefully, the proposed method could provide clinical professionals with auxiliary diagnostic information and help non-professionals be aware of their physical health status. It is of great significance for promoting the standardization and objectification of TCM, as well as improving the level of health services that TCM offers to the public.

Footnotes

Acknowledgements

The authors express their gratitude to all the institutions and individuals that have provided support for this work.

ORCID iD

Xiaopeng Sha

Ethical considerations

This study was approved by the Ethics Committee of Northeastern University of China (approval no. NEU-EC-2024B049S).

Author contributions

The authors confirm contribution to the article as follows: Xiaopeng Sha did conceptualization, writing—review and editing, and supervision; Zheng Guan did methodology, investigation, and writing—original draft; Ying Wang and Jinglu Han did data curation and formal analysis; Yi Wang did investigation; Zhaojun Chen did writing—review and editing and supervision.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Natural Science Foundation of China (No.62104034), the Natural Science Foundation of Hebei Province (No.F2024501044), and Central Guidance Local Science and Technology Development Project (No.246Z2002G).

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Guarantor

XS.

References

Liu

Yang

, et al. A survey of artificial intelligence in tongue image for disease diagnosis and syndrome differentiation. Digit Health 2023; 9: 20552076231191044.

Yuan

Yang

Zhang

, et al. Development of a tongue image-based machine learning tool for the diagnosis of gastric cancer: a prospective multicentre clinical cohort study. eClin Med 2023; 57: 101834.

Yan

Jiang

. Tonguecaps: an improved capsule network model for multi-classification of tongue color. Diagnost (Basel) 2022; 12: 653.

Yan

Zhang

Yang

, et al. Tongue segmentation and color classification using deep convolutional neural networks. Mathematics 2022; 10: 4286.

Huang

Zheng

, et al. Tongue size and shape classification fusing segmentation features for traditional chinese medicine diagnosis. Neural Comput Appl 2023; 35: 7581–7594.

Wang

Liu

, et al. Artificial intelligence in tongue diagnosis: using deep convolutional neural network for recognizing unhealthy tongue with tooth-mark. Comput Struct Biotechnol J 2020; 18: 973–980.

Zhang

Jiang

, et al. Multiple color representation and fusion for diabetes mellitus diagnosis based on back tongue images. Comput Biol Med 2023; 155: 106652.

Qiao

Duan

, et al. Intelligent tongue diagnosis model for gastrointestinal diseases based on tongue images. Biomed Signal Process Control 2024; 96: 106643.

Zhou

Fan

. Tonguenet: accurate localization and segmentation for tongue images using deep neural networks. IEEE Access 2019; 7: 148779.

10.

Gkioxari

Dollár

, et al. Mask r-cnn. In: Proceedings of IEEE international conference on computer vision, 2017, pp.2961–2969.

11.

BioHit. Tongue image dataset. https://github.com/BioHit/TongeImageDataset , 2014.

12.

Lin

Xie

, et al. Deeptongue: tongue segmentation via resnet. In: Proceedings of IEEE international conference on acoustics speech signal process, 2018, pp.1035–1039.

13.

Zhang

Ren

, et al. Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp.770–778.

14.

Huang

Miao

Song

, et al. A novel tongue segmentation method based on improved u-net. Neurocomputing 2022; 500: 73–89.

15.

Ronneberger

Fischer

Brox

. U-net: convolutional networks for biomedical image segmentation. In: Proceeding of international conference on medical image computing and computer-assisted intervention, 2015, pp.234–241. Springer.

16.

Jiang

Zhong

Yang

. Tumamba: a novel tongue segment methods based on mamba and u-net. Digit Health 2024; 10: 20552076241289007.

17.

Zhang

Cui

, et al. Tooth-marked tongue recognition using multiple instance learning and CNN features. IEEE Trans Cybern 2018; 49: 380–387.

18.

Simonyan

Zisserman

. Very deep convolutional networks for large-scale image recognition. In: Proceeding of international conference on learning representations, 2014, pp.1–14.

19.

Tang

Gao

Liu

, et al. An automatic recognition of tooth-marked tongue based on tongue region detection and tongue landmark detection via deep learning. IEEE Access 2020; 8: 153470.

20.

Chen

Bai

Zhang

, et al. Destruction and construction learning for fine-grained image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp.5157–5166.

21.

Zhang

Yang

. A survey on multi-task learning. IEEE Trans Knowl Data Eng 2021; 34: 5586–5609.

22.

Gao

Mei

, et al. JCS: an explainable COVID-19 diagnosis system by joint classification and segmentation. IEEE Trans Image Process 2021; 30: 3113–3126.

23.

Wang

Zhao

, et al. Joint segmentation and classification of skin lesions via a multi-task learning convolutional neural network. Expert Syst Appl 2023; 230: 120174.

24.

Zhou

Chen

, et al. Multi-task learning for segmentation and classification of tumors in 3D automated breast ultrasound images. Med Image Anal 2021; 70: 101918.

25.

Zhang

Zhu

, et al. Automatic classification framework of tongue feature based on convolutional neural networks. Micromachines (Basel) 2022; 13: 501.

26.

Zeng

Tang

, et al. Multi-task joint learning model for segmenting and classifying tongue images using a deep neural network. IEEE J Biomed Health Inf 2020; 24: 2481–2489.

27.

Wang

Morariu

Davis

. Learning a discriminative filter bank within a CNN for fine-grained recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp.4148–4157.

28.

Qiu

Zhang

Wan

, et al. A novel tongue feature extraction method on mobile devices. Biomed Sig Proc Control 2023; 80: 104271.

29.

Sandler

Howard

Zhu

, et al. Mobilenetv2: inverted residuals and linear bottlenecks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp.4510–4520.

30.

Shi

Wang

, et al. Ammonia-net: a multi-task joint learning model for multi-class segmentation and classification in tooth-marked tongue diagnosis. arXiv:2310.03472 2023.

31.

Xie

Girshick

Dollár

, et al. Aggregated residual transformations for deep neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp.1492–1500.

32.

Shen

Sun

. Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp.7132–7141.

33.

Song

Liu

Yuan

, et al. Multitask analysis method for tongue image based on edge computing. IEEE Access 2024. DOI: 10.1109/ACCESS.2024.3382303

34.

Chen

Liu

, et al. Transfg: a transformer architecture for fine-grained recognition. In: Proceedings of the AAAI conference on artificial intelligence, Volume 36, 2022, pp.852–860.

35.

Zhu

. Residual attention: a simple but effective method for multi-label recognition. In Proceedings of the IEEE conference on computer vision, 2021, pp.184–193.

36.

Szegedy

Vanhoucke

Ioffe

, et al. Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp.2818–2826.

37.

Long

Shelhamer

Darrell

. Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp.3431–3440.

38.

Badrinarayanan

Kendall

Cipolla

. Segnet: a deep convolutional encoder–decoder architecture for image segmentation. IEEE Trans Pattern Anal Mach Intell 2017; 39: 2481–2495.

39.

Zhou

Siddiquee

MMR

Tajbakhsh

, et al. Unet++: redesigning skip connections to exploit multiscale features in image segmentation. IEEE Trans Med Imag 2019; 39: 1856–1867.

40.

Chen

Zhu

Papandreou

, et al. Encoder–decoder with atrous separable convolution for semantic image segmentation. In: Proceeding of European conference on computer vision, 2018, pp.801–818.

41.

Guo

Hou

, et al. SegNeXt: rethinking convolutional attention design for semantic segmentation. In: Advances in neural information processing systems, 2022, pp.1140–1156.

42.

Krizhevsky

Sutskever

Hinton

. Imagenet classification with deep convolutional neural networks. In: Proceedings of advances in neural information processing systems, 2012, volume 25.

43.

Huang

Liu

Van Der Maaten

, et al. Densely connected convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp.4700–4708.

44.

Liu

Mao

, et al. A convnet for the 2020s. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2022, pp.11976–11986.

45.

Selvaraju

Cogswell

Das

, et al. Grad-cam: visual explanations from deep networks via gradient-based localization. In: Proceedings of IEEE international conference on computer vision, 2017, pp.618–626.

SSC-Net: A multi-task joint learning network for tongue image segmentation and multi-label classification

Abstract

Background

Objective

Methods

Results

Conclusion

Keywords

Introduction

Related works

Single-task methods for tongue segmentation or classification

Multi-task methods for tongue segmentation and classification

Material and method

BUCM dataset

Data collection

Label annotation

Data augmentation

BioHit dataset

Proposed method

Shared feature encoder

Tongue segmentation module

ROI extraction module

Feature fusion module

Fine-grained classification module

Loss function

Experimental results

Implementation details

Evaluation metrics

Segmentation metrics

Classification metrics

Tongue segmentation results

Segmentation results and analysis on BUCM

Segmentation results and analysis on BioHit

Visualization and analysis

Tongue classification results

Single-task classification results and analysis

Multi-task classification results and analysis

Visualization and analysis

Ablation study

Discussion

Clinical applicability

Limitations and future work

Conclusion

Footnotes

Acknowledgements

ORCID iD

Ethical considerations

Author contributions

Funding

Declaration of conflicting interests

Guarantor

References