Abstract
The rapid development of artificial intelligence technology has gradually extended from the general field to all walks of life, and intelligent tongue diagnosis is the product of a miraculous connection between this new discipline and traditional disciplines. We reviewed the deep learning methods and machine learning applied in tongue image analysis that have been studied in the last 5 years, focusing on tongue image calibration, detection, segmentation, and classification of diseases, syndromes, and symptoms/signs. Introducing technical evolutions or emerging technologies were applied in tongue image analysis; as we have noticed, attention mechanism, multiscale features, and prior knowledge were successfully applied in it, and we emphasized the value of combining deep learning with traditional methods. We also pointed out two major problems concerned with data set construction and the low reliability of performance evaluation that exist in this field based on the basic essence of tongue diagnosis in traditional Chinese medicine. Finally, a perspective on the future of intelligent tongue diagnosis was presented; we believe that the self-supervised method, multimodal information fusion, and the study of tongue pathology will have great research significance.
Introduction
Tongue diagnosis is an important means to obtain disease information in traditional medicine, especially in traditional Chinese medicine (TCM). TCM physicians rely heavily on it as they obtain the whole human health state through it, so they arrange the herbs to balance the body's health state. Because of the abundant superficial vascular tissue, the tongue can transmit many useful signals in real time,1–3 which may indicate important medical information. Nowadays, tongue diagnosis can even be used for disease prediction or disease stage estimation.4–6 In this point, as the only noninvasive method to explore organs in vivo, its diagnosis value has long been underestimated by modern medicine. Studying the classification or quantification of tongue images will not only help promote the modernization of TCM but also help to reveal the pathological basis behind the tongue organ.
Instead of being used to collect clinical information on the tongue organ for diagnosis of tongue diseases only, tongue diagnosis is often used to perceive the healthy state of the whole body for syndrome differentiation (SD) in TCM. As a traditional diagnosis method, tongue diagnosis refers to the physician collecting disease information by “watching” the tongue symptoms of coating color, tongue color, and tongue texture. However, the tongue symptoms are neither objective as they can be vulnerable to human eyes “watching” bias nor easy to quantify. Meanwhile, syndrome (or Zheng 7 ) is not always a physical state that can be perceived clearly, and the relationship between tongue features and the syndrome is a kind of uncertain knowledge. To extract the objective features and fit the uncertain correlation, many studies have been launched using artificial intelligence (AI) technologies to extract tongue features automatically or model the uncertain relationships.
Up to now, the field of intelligent tongue diagnosis has gained new features and presented many traceable technological trends. Attention mechanism helps to improve the recognition and segmentation of tongue images and is receiving increasing attention and application.8–10 Due to the abundant health information on tongue images, extracting multiscale features is helpful for diagnosis or classification based on tongue images.11,12 Although deep learning methods have the ability to extract deeper or more abstract features, they have shown better results when combined with traditional methods. Traditional methods are increasingly being valued to compensate for some of the shortcoming of deep learning. For example, with the help of convolutional neural network (CNN) and latent Dirichlet allocation (LDA) model, Hu et al. 13 have modeled a relatively robust correlation between Chinese herbal prescriptions and tongue images. Based on the tongue image feature extracted by ResNet50, the Extreme Gradient Boosting (XGBT) model optimized by the genetic algorithm (GA) method was able to predict prediabetes and diabetes. 14 The multiple-instance support vector machine (SVM) method, the deep learning model VGG-16 and the SVM algorithm combined, had shown a great performance improved in classification or diagnosis of tongue images.15,16 The prior knowledge is also a useful aid when combined into the deep learning model, such as the morphological operation17–19 or knowledge from TCM experts.15,20
This paper intended to report the progress of methods or technologies used to promote the automation and objectification of tongue diagnosis. Many related reviews have also been published,21–25 but none have provided a global perspective. Reference21,24 mainly focused on reporting the pathological meaning aspect,22,23 focused on feature engineering, 25 and depicted the history of tongue diagnosis in China. This paper focuses on the tongue image intelligence analysis methods used for calibration, segmentation, and classification of symptoms, syndrome types, and diseases and will review the related studies included in the Web of Science and IEEE mainly published in the last 5 years.
Tongue diagnosis in TCM (background)
Tongue diagnosis is a healthy status diagnosis method that the Chinese have followed for thousands of years and has made outstanding contributions to the history of fighting against diseases. 25 It is used to diagnose syndrome types or diseases by observing the color, texture, mobility, humidity, and some position information of the tongue, but mainly used for SD. In summary, it has the following 12 characteristics: tongue color, tongue shape, tongue motion, tongue coating color, tongue coating thickness, tongue coating quantity, tongue coating texture, tongue coating position, crack, tooth mark, the quantity of saliva and red dots, and the sublingual vein, as shown in Figure 1.

The tongue feature that TCM doctors care about. In TCM theory, these features are considered to be related to the body's health status. For example, the chubby tongue may indicate a “wet” body status, so as the lubricant tongue, the thin tongue may indicate a “dry, hot” status, the askew tongue may suggest an “insane” brain status and the flaccid tongue may imply a “neurasthenia” brain status, coating features in the tongue are closely related to the function of the stomach, a considerable number of red-dot imply a “fiery” status of the upper body, and a purple sublingual vein is related to a “blood stasis” body status.
TCM theory believes that the whole is connected with the part, different internal organs’ function states may show different features on the tongue, as shown in reference,2,26 and these different internal organs’ function states can be understood as “syndrome.” The empirical mapping between tongue and syndrome is established through the summary and induction of various manifestations of the above characteristics, which corresponds to the various syndrome types of the internal environment of the human body. The prescription of traditional Chinese herbs based on these empirically defined syndrome types can often obtain curative effects that modern science cannot explain.
In the past, because of the estrangement between tradition and modernity, few people cared about the evidence to support tongue diagnosis, so its diagnostic value was ignored by modern medicine for a long time. With the rapid development of society and the continuous collision between tradition and modern science, more and more tongue diagnostic researchers full of curiosity have carried out active exploration. Many studies have shown that the external manifestations of the tongue are significantly related to some diseases and health indicators in the body3,27–30 and even COVID-19.4,31–33 Meanwhile, the movement of the tongue also relies on the function of the brain34,35 and can be used for identity verification. 36 Thus, tongue diagnosis has its inherent rationality, suggesting that TCM tongue diagnosis is expected to obtain evidence-based support. As the tongue is the only noninvasive visible internal organ in the human body, its capillaries are more superficial, which allows us to directly access internal information simply. All these indicate that the tongue has great diagnostic value, which needs to be paid more attention to and explored by modern medicine.
However, because the visual features used for diagnosis in tongue diagnosis cannot be objectively quantified, this poses a great challenge to its clinical research and pathological study. In addition, tongue diagnosis is mainly used for SD in TCM. However, the syndrome concept is vaguely defined and is an abstract description of the overall health or functional state of the body, and it is also impossible to quantify the TCM syndrome types, which increases the difficulty of tongue diagnosis research. With the development of AI technology, this kind of model technology that can simulate human decision-making with powerful feature extraction ability is favored. Many researchers expect to solve the problem of objectively quantifying tongue images and assisting decision-making through AI.
Tongue image data preprocessing
Before analyzing the tongue image, there will be three important tasks: tongue image correction, tongue image recognition, and segmentation. Because tongue image data acquisition will be affected by different acquisition devices and external light sources, resulting in color distortion and nonstandard format of tongue image, such distortion harms the diagnosis from doctors. In addition, the collected tongue image data may be unqualified; for example, the camera does not capture the tongue image, or the patient does not fully spit out the tongue. Jiang et al. 37 had summarized some unqualified tongue images; more critically, they trained a deep model to distinguish the qualified tongue image using ResNet-152. Meanwhile, a multitask deep learning model method was introduced 38 to classify the tongue image into high-quality and unqualified tongues to assess its quality. The tongue image recognition is to quickly identify whether the complete tongue image is captured in the auxiliary scene. Tongue image segmentation separates the tongue body from the background so that the image contains only the tongue body, eliminating redundant information.
Tongue image calibration
Aiming at the problem of color distortion in tongue image acquisition, some researchers use unified acquisition equipment in a closed acquisition environment to eliminate the interference of different equipment and light sources in clinic information transmitted by tongue image. Tongue image correction is mainly aimed at correcting the color and brightness of the tongue image acquired in an open environment, such as outdoor collection and collection by mobile phone equipment. The correction methods can be divided into three categories: deep learning algorithm,39,40 polynomial regression algorithm,41–44 and support vector regression (SVR) method.45,46
The method based on the polynomial regression algorithm is the most commonly used because of its low computational complexity and short training time, which is crucial for online applications. Wang and Zhang 43 employed the classical polynomial regression algorithm accompanied with a new tongue colorchecker for tongue color space to improve calibration accuracy. The polynomial regression method always needs a colorchecker to be a reference for generating the color matrix, 42 and a colorchecker specially established for tongue images would improve the accuracy. Furthermore, Sui et al. 41 proposed a root polynomial regression method that has achieved a more stable and even better correction effect than the traditional polynomial regression under different illumination conditions. Zhuo et al. 40 have firstly introduced the deep learning method to the tongue image correction problem, and they built a simulated annealing (SA)–GA–backpropagation (BP) neural network based on a partial color gamut that was similar to those of the tongue body, tongue coating, and skin. They further 44 proposed a kernel partial least squares regression (K-PLSR)-based correction method which has obtained a superior color correction performance against classical polynomial-based and SVR-based methods under different lighting conditions. Lu et al. 39 proposed a two-phase deep color correction network (TDCCN) to establish the tongue color mapping model under a standard lighting condition, and they further proposed flexible color-adjusting options to conquer the differences between standard lighting conditions and the environments that doctors are familiar with. Zhang et al. 46 used the SVR method to study the color rendition chart specially applied for tongue image calibration and proved that 24 was the optimal number of color patches.
Tongue image detection and segmentation
Tongue image detection and segmentation is an important and necessary step in the preprocessing steps of evaluating the health status based on tongue image. Tongue image segmentation excluding other redundant information is very important for subsequent downstream tasks. Since tongue image segmentation needs to detect the tongue image first, tongue image detection and tongue image segmentation have the same feature extraction process, often completed simultaneously in many studies, and they are discussed in this section.
Tongue image detection and segmentation methods are divided into two categories according to feature extraction methods: traditional and deep learning–based methods. Traditional methods include color threshold,47,48 edge detection, 49 active contour model (ACM),17,50 and region growing and merging. 51 These methods are all based on manually set rules or prior knowledge to extract image features, and finally used for classification, among them, ACM is most used. For example, Guo et al. 50 firstly do the two-stage K-means clustering method based on the extracted initial tone boundary and finally apply the ACM to segment the tongue image. Study 47 performed image thresholding in hue saturation intensity (HSI) color space and subsequent morphological operations to get an initial tongue region. Then, a gray projection technique is used to determine the upper bound of the tongue body root for refining the initial region. Wu and Zhang 49 fused the region base method with the edge-based, and they extracted region of interest (ROI) and subsequently merged adjacent regions utilizing the histogram-based color similarity criterion. Hence, the results are less sensitive to cracks and fissures on the tongue. Then, they adopted a fast marching method to get a close curve based on edge features. The contour obtained by the region-based approach was utilized as a mask during the fast marching process (edge-based) to make the ultimate contour more robust. Liu et al. 17 proposed a path-driven segmentation method, and each patch in the testing image is sparsely represented by patches in the spatially varying dictionaries, constructed by the local patches of training images. The derived sparse coefficients are then employed to estimate the tongue probability. Finally, the hard segmentation is obtained by applying the maximum a posteriori (MAP) rule on the tongue probability map and further polished with morphological operations. However, the traditional methods are based on simple pixel values or low-level features of tongue images (such as color, edge, brightness, and texture) and cannot extract high-order features. Furthermore, these handcrafted features were extremely time consuming and tedious. Therefore, with the breakthrough of AI technology, most of the research has turned to the method of deep learning in recent years.
In the deep learning method, the distribution of the observed data is calculated by feedback propagation, and the parameter is updated based on this distribution to model the relationship between the label and the input data. In this process, the characteristics of the data can be automatically extracted. Full convolution network (FCN) model is the pioneering achievement in applying the deep learning method to the field of image segmentation and achieved extraordinary performance at that time. 52 It applies the previous method based on global image classification to pixel-level classification, to achieve image segmentation. Wang et al. 53 applied FCN to tongue image segmentation and achieved better results than traditional methods. Huang et al. 54 improved the FCN by designing the receptive field block module (including a multibranch evolutionary block and a shortcut connection) and could extract higher-level or global features. Subsequently, Ronneberger et al. 55 developed the U-Net model based on the FCN model. Due to its excellent performance, it was widely used in computer vision. Li et al. 56 applied this model to the tongue crack segmentation, and they improved its encoder to extract relatively more abstract high-level semantic features. Similarly, Peng et al. 57 also improved the U-Net framework and designed a lightweight model P-Net with the letter “P” structure to be suitable for remote tongue image segmentation. Zhou et al. 18 adding a morphological layer to U-Net aim at refining the obvious morphological errors in U-Net segmentation. Other important models are Faster R-CNN and Mask R-CNN. Reference 58 based on the tongue images segmented by Mask R-CNN, and Faster R-CNN is used to detect and classify the tongue images of cracked, tooth marked, spotted or rotten, and so on. Besides, Yuan et al. 59 designed a cascaded CNN model for tongue image segmentation for mobile and embedded devices, and its prediction speed is significantly improved compared with other deep learning methods. Zhou et al. 60 referred to the idea of generate adversarial network (GAN), the generation module is used to generate a segmented image, and the identification module is used to determine whether the generated segmented image is true, to reduce the dependence on the annotated data set.
Combining some advantages of traditional methods with deep learning methods also helps to improve the model's performance. Zhou et al. 18 designed a morphological processing layer based on morphological inductive bias, including some specifically designed filters to refine any morphologically incorrect coarse mask image, as shown in Figure 2. Similarly, a study 19 also designed a tongue assessment filter to filter out some segmented tongue images predicted by U-Net with wrong contours. These handcraft features were based on morphology of the tongue, and Gao et al. 62 used the geometric features of the tongue based on the level set method derived from ACM, combined with the CNN network, and proposed a level set model with symmetry and edge constraints. However, Yuan and Liao 61 argued that tongue body and tongue coating segmentation can be clustered according to differences of the color blocks in Lab color space, and the K-means segmentation method based on Lab color space was proposed and was better than the deep learning methods of FCN, U-Net, and Deeplab-v3.

The morphological layer for processing incorrect predictions. Using the morphological layer to reconstruct the incorrect morphological prediction or filter it out.
Researchers creatively combined or proposed many ideas for model building and training to make the model more robust and more targeted to segment the tongue image. Multitask learning always shares a feature extraction backbone to improve generalization performance and mitigate manual labeling consumption, and Xu et al. 66 used the U-Net framework as a common feature extraction module to segment tongue and classify tongue coating in a multitask learning way. Zhou et al. 63 designed two different loss functions based on the multitask learning method to anchor the tongue image localization and segmentation tasks, and decoupled the tongue image segmentation and localization task. Furthermore, Tang et al. 11 proposed cascaded CNNs with multitask learning to predict tongue region, tongue landmarks, and tooth-marked tongue. The multitask learning method often requires a special loss function. Cai et al. 67 changed the loss function for tongue segmentation and proposed a function that would decrease the intraclass distance and increase the interclass distance. To change the model to extract specific concerns, Li et al. 56 introduced a global convolution network module to extract relatively abstract high-level semantic features, while Huang et al. 54 constructed a receptive field block based on the receptive field theory that the region closer to the center of retinotopic maps is more important than others in distinguishing objects, making the model deal more with the blurred edge of the tongue body. Similarly, Peng et al. 57 applied an attention module to intensify the attention to the boundary and suppress useless information.
In addition, some studies report special solutions for other tongue segmentation scenarios. Tang et al. 11 reported a solution of tooth-marked recognition based on the segmented tongue, and Li et al. 56 refined the U-Net model to especially segment the cracks on the tongue, the same as the reference. 57 The feature of sublingual vein had also been noticed and extracted,10,68 Qiu et al.’s 10 study added the cross-channel attention module to assign more weight to the target area, which shows an improved performance in their lightweight tongue diagnosis system on mobile devices, and another paper 68 proposed a two-stage segmentation method: a fully CNN network without downsampling to reduce the loss of spatial feature information effectively and another fully CNN network with proper dilated convolution to avoid the gridding issue. For real-time diagnosis, the size of the popular models is generally too large; Li et al. 64 proposed a lightweight architecture for tongue image segmentation in real-time, and they further proposed the P-Net model 57 refined from U-Net and used for real-time tongue crack extraction. The data set to supervise the tongue segmentation model is relatively small, which may cause generalization issues; Li et al. 19 applied an iterative learning method to train the network on the good samples repeatedly judging by some specific filters. Others54,56 used pretrained methods to alleviate the generalization problem. Some studies believe that tongue features must be extracted from the global image level to make segmentation, so the dilated convolution block was introduced. Zhou et al. 63 used a context-aware dilated residual block to ensure the efficiency of information without increasing extra parameters and computation; Peng et al. 57 used dilated convolution for dense feature extraction and field-of-view enlargement; and Tang et al. 12 proposed a hybrid cascade dilated convolution to extract multiscale features. The division of the tongue is also important for assessing health status, and Wu et al. 69 reported regional alignment of tongue images to improve the accuracy of disease diagnosis. Using medical ultrasound image to visualize and characterize human tongue shape and motion in real-time would do a great favor to study healthy or impaired speech production; thus, some tongue contour segmentation methods70–72 of ultrasound image in real time have been proposed by the team of Hamed Mozaffari; furthermore, they 73 have used the augmented reality of ultrasound data from the extracted tongue movement to provide a real-time visual feedback to improve the training trend of language learners. There are also some researchers 74 who used magnetic resonance signals of tongue images to perform tongue segmentation.
To sum up, different from the general field, most of the tongue image recognition and segmentation scenes used in tongue diagnosis only need to recognize and segment one tongue instance and the tongue body often occupies more pixels in the tongue image; as we can see,12,63 dilated convolution block or extract from a coarser or more global feature level can be useful. In addition, because the tongue image has a relatively stable shape, that is, a relative ellipse of convex shape, the morphological filter method is also suitable for tongue image segmentation scenes, as shown in Figure 2. Since the feature of tongue body edge is the key to tongue image segmentation, it would be a favor to focus the main attention of algorithm on the edge area. 57 The cross-channel attention module promotes the information exchange between feature maps and assigns weight among them, which may explain why it can alleviate the problem of difficult tongue image samples. 10 Most studies draw on the modules, frameworks, or methods successfully used in the general field. Although these can improve the accuracy when applied to the tongue analysis field, tongue image detection and segmentation remains relatively difficult task, as the reference states 63 and shown in Figure 3. The tongue in images has a large variation in shape and color. Its edge pixels are always difficult to distinguish from the lips or skin. Meanwhile, few public data sets can be used for unified evaluation, and most of the proposed model codes are not published, so the declared effect cannot be verified; as shown in Table 1, some of the studies did not report the data set construction and its labeling quality. And the open scene and fixed scene often have a more complex variation, so data set quality should be paid more attention.

The variation in tongue appearance. The variation of tongue appearance in tongue image increases the difficulty of tongue segmentation, and the variation in shape and tongue extension may reduce extracted features and weaken the performance of downstream tasks.
Summary of machine learning modeling approaches for tongue detection and segmentation. The open scene is more complex than the fixed scene and requires more in terms of data quality and model performance.
https://github.com/BioHit/TongeImageDataset. SVM: support vector machine; HOG: histogram of oriented gradients; HSV: hue-saturation-value; CNN: convolutional neural network; FCN: full convolution network.
Tongue image for disease diagnosis
The tongue appearance is believed to be sensitive to some diseases,21,23 especially diabetes, whose association is most studied11,75 in recent years. Zhang et al. 76 used the SVM method to diagnostic diabetes based on the tongue image features of tongue color values and tongue texture, and these features were extracted by the division-merging method and chrominance-threshold method. Selvarani and Suresh 77 further proposed the SVM classifier with multiple kernels, named kernel ensemble classification method, to classify diabetes from the healthy person based on the tongue color distribution and texture. The study from Fan et al. 78 showed a better diabetes diagnosis performance of random forest (RF) than SVM. They combined texture features and four TCM tongue features of constitution color, coating color, cracks, plumpness, and slenderness as input. Deepa and Banerjee 79 also applied the SVM method as a classifier and used particle swarm optimization (PSO) technique to tune the parameters and enhance its performance; this time, deep features of the tongue (such as color, texture, coating, tooth-marked, and red spots) were utilized from CNN DenseNet framework. Mathew and Sathyalakshmi 80 also developed an optimization-driven hybrid deep learning method for diabetes detection based on tongue images. A proposed ExpACVO optimization algorithm was used for Deep Q-Network classifier training. The proposed ExpACVO algorithm combined anticorona virus optimization with exponential weighted moving average and has achieved improved performance. Li et al. 14 fused prior knowledge of tongue images and the deep features of tongue images from ResNet50 to diagnose diabetes and prediabetes based on the XGBT algorithm, and they optimized the parameters in the XGBT model with GA which has shown a further performance improved of XGBT. Zhang et al. 81 also proposed a fusion method of color features from tongue images for diabetes diagnosis (Table 2). They used a novel clustering-based color descriptor to represent and fuse the RGB, HSV, and Lab color space of tongue images, and K-nearest neighbor (KNN), SVM, minimum squared error (MSE), lasso, and ridge regression were used as classifier and evaluation method. Another deep feature of tongue image used for diabetes diagnosis was extracted from ResNet50, 82 a popular CNN base framework. On the other side, Vijayalakshmi et al. 83 used a CNN block as a classifier and was proved to have a better performance against SVM when performing classification based on the three tongue quantitative features of geometry, color, and texture, which are measured by MATLAB. They argued that a person with diabetes would have a gray color coating at the center of the tongue. Srividhya and Muthukumaravel 84 applied a self-organizing map (SOM) Kohonen classifier to classify diabetes or nondiabetes based on quantified features of tongue color and gist.
The second most studied is stomach disease; Gholami et al. 85 compared the accuracy of different CNN frameworks in diagnosing stomach cancer based on tongue color and its lint features, and the best model eventually comes to the DenseNet. Wu et al. 5 studied the tongue diagnosis indices for gastroesophageal reflux disease (GERD), and they used an automatic tongue diagnosis system (ATDS) to extract the tongue indices and found that the saliva amount (p = .009) and thickness of the tongue's fur (p = .036), especially that in the spleen–stomach area (%) (p = .029), were significantly greater in patients with GERD. Handcraft features of tongue images from physicians were weighted and filtered by the XGBT algorithm in research, 86 and the EfficientNet network was used to classify gastric cancer based on the selected features. Meng et al. 87 used tongue image high-level features extracted from a CNN framework to diagnose gastritis and found that the deep learning framework extracted more suitable features than histogram of oriented gradients (HOG), local binary pattern (LBP), and scale-invariant feature transform (SIFT), which extract the handcrafted low-level features. And they also argued that LIBLINEAR SVM classifier could handle the imbalanced data well. They 88 further introduced a high dispersal and local response normalization operation to the CNN framework to reduce redundancy and a multiscale features analysis to avoid its sensitivity to tongue deformation. Ma et al. 89 used the logistic regression (LR) model to integrate deep learning features of tongue images and canonical risk factors to screen patients with gastric precancerous lesions (PLGC), and the result showed 10.3% higher than that of the model only including canonical risk factors, which has demonstrated the value of tongue image characteristics in PLGC screening and risk prediction.
It is worth noting that the tongue was reported to have a strong association with COVID-19. The study said that 4 the tongue image has excellent discriminative ability for screening COVID-19 cases when using deep learning framework. And patients with mild and moderate COVID-19 commonly would hold a light red tongue and white coating. In contrast, more severe patients had a purple tongue and yellow coating, 32 highlighting that the fatty coating is a significant feature of COVID-19. Liang et al. 33 reported a cured case report of COVID-19 on its tongue diagnosis index and its Chinese medicine formula treatment which was also tweaked by the tongue features. They argued that the tongue color, fur thickness, and fur color were closely related to the progression of COVID-19. Wang et al. 31 had demonstrated that a convolutional network with a transfer learning method could construct a robust classifier of COVID-19 based on tongue features.
Besides, there is some other creative or meaningful research on tongue diagnosis. Noguchi et al. 90 studied using the principal component scores of tongue color, gender, and age to diagnose Sjogren’s syndrome through machine learning methods of SVM, RF, LR, bagging model of three SVMs, and stacking model of them. They found that SVM trained using principal component scores of tongue color, sex, and age showed the best accuracy, achieved significant values than other classifiers and different feature combinations, and reached a level comparable to machine learning models trained using the Saxon test. Another multifeature combination research from Zhang et al. 91 used a low-rank representation model to form a multiview completion method to complete the missing view information of facial, sublingual vein, and tongue images for diagnosing fatty liver disease. And they used KNN, LDA, RF, least squares regression (LSR), or sparse representation classifier (SRC) as the classifier, which all showed better diagnostic results with the proposed approach. Jiang et al. 58 applied a Faster R-CNN model to recognize the region of cracked, tooth-marked, stasis spotted, greasy coating, peeled coating, and rotten coating of the tongue and then used the split and merge algorithm and color threshold method on this region to extract color feature and area value. Coupled with some classifiers, they found that these features would improve the accuracy in diagnosing nonalcoholic fatty liver disease (NAFLD). Huang et al. 6 have studied different variables on tongue images between patients with acute ischemic stroke and health participants, and they have found that pale tongue color, bluish tongue color, ecchymoses, and tongue deviation angle were associated with significantly increased odds ratios for acute ischemic stroke through multiple LR analysis. Ning et al. 92 employed a specially designed evaluation algorithm, balanced evolutionary semi-stacking (BESS) to simultaneously enhance balanced bagging and cotraining procedures when detecting diabetes mellitus, chronic kidney disease (CKD), breast cancer, and chronic gastritis and studied the tongue image classification performance of diseases of different machine learning methods. Therefore, the data and classifier diversity generated from it were fully considered to create multiple metafeatures for the stacking ensemble of BESS, and their result showed “SVM + LGBM (LightGBM)” achieved the best performance. Devi and Anita 93 used tongue images to diagnose thyroid and ulcers based on a semisupervised algorithm. Mansour et al. 94 reported using the Internet of Things technology and ResNet50 backbone in remote diagnosis of 12 diseases such as CKD, nephritis, verrucous gastritis, nephritis syndrome, chronic cerebral circulation insufficiency, and coronary heart disease. Furthermore, Thanikachalam et al. 95 used a SqueezeNet model to extract tongue image features for classifying the same 12 diseases.
Coupled with other reviews, we have found that diagnostic models of diabetes and stomach diseases have been studied most in tongue diagnosis research. Among them, tongue images of diabetes patients have macroscopic changes, as we could easily perceive internal changes on the tongue which has no skin covering. And both the stomach and tongue belong to the esophagus, which may lead to their frequent association. 96 Traditional feature engineering methods (handcraft features or manual features) and deep learning methods are all helpful in extracting disease-related features from tongue images, and combining the two can perform better. 97 Although the performance of tongue image is highly related to some diseases, it still cannot replace the existing diagnosis methods in terms of accuracy and timeliness, and it does not show the advantages of being able to replace the existing examination methods. As the default in TCM, the tongue is more used for SD than disease differentiation. In TCM, features from the tongue will always need to be combined with other symptoms in diagnosis, and it is not compliant with the holistic concept of TCM theory to diagnose diseases by the tongue alone. Therefore, the model of disease diagnosis based on the tongue may not only fail to achieve the expected effect but also face the situation of less clinical application scenarios.
Tongue image for syndrome differentiation
Syndrome (or Zheng 7 ) differentiation plays a crucial role in TCM, and the prescription of compound Chinese herbs, acupuncture, massage, and so on all mainly depend on it. But the syndrome is not easy to distinguish, which poses a great challenge to physicians. Compared with other diagnosis methods of the syndrome, the healthy signals expressed from the tongue are more concrete and easier to perceive, and tongue diagnosis plays a more important role in SD.24,98,99
The team of Guihua Wen has done a lot of work in this field.8,13,20,100 Body constitution type is another syndrome that only occurs in the healthy or subhealthy population, because of its inconspicuous symptoms; using the tongue to classify would be a great challenge. Even so, this team used a so-called complexity perception classification method 100 to classify constitution types based on tongue on the hard and easy samples, respectively, to solve the variation of environmental conditions and the uneven distribution of tongue images. A data set of 22,482 tongue images was collected from a local hospital and labeled body constitution types by TCM doctors. Faster R-CNN was used to detect a modified VGG-16 as a feature extractor. They applied a LR to judge whether the sample was hard or easy. Thus, if a sample is easily classified, it would be decided as an easy sample, otherwise a hard one. Then, the hard samples were used to train a difficult body constitution type classification model; on the contrary, the easy samples trained an easy model and found that the hard samples have a more dense distribution than the easy one; their method does improve the classification accuracy though its highest record is only 61%. Furthermore, Wen et al. 20 proposed grouping attributes zero-shot learning methods based on prior knowledge of TCM to solve the imbalanced constitution class problem in the next year. This time, they expanded their tongue image data set to 46,753, and 15 attributes of the tongue were proposed as a 15D semantic vector to represent the tongue image based on the knowledge of tongue diagnosis, as shown in Table 3. Combined with this prior knowledge, their method alleviated the uneven distribution problem and could classify the constitution type that has never been seen before. In addition, they 8 studied the recognition of disease location of tongue image, a stochastic region pooling method was proposed to focus more on detailed regional features, and an inner-imaging channel-wise attention mechanism also was proposed to enhance the robustness of modeling relationships between CNN components. These methods have shown great adaptability in automatic tongue disease location prediction. Moreover, they studied13,101 an automatic prescription system of Chinese herbal based on CNN feature extractor and an auxiliary therapy topic loss mechanism using tongue image, which generated a prescription that relatively matched the real prescription. Li et al. 102 proposed a complete diabetic tongue image classification method based on self-supervised features from vector quantized variational autoencoder (VQ-VAE) and K-means clustering classifier; the classification result showed that the self-supervised method could also extract features of TCM health signals from diabetic tongue image to nicely separate different syndromes without human annotation.
Sometimes combining with other healthy signals would improve the classification accuracy of syndromes. Yuan and Liao 61 combined tongue Lab color value and consultation based on a questionnaire to diagnosis constitution. They used a relationships table between tongue Lab color value and constitutions, and sentence similarity as the features to implement classification. Huang et al. 107 also integrated tongue image features with the indices of acoustic sound, and pulse signal, to imitate the “integrating four diagnoses into one reference” in the TCM clinical practice, which means the cross-reference between four diagnosis methods. They built an equation based on linear regression to model the association between these features and body constitution. The features of RGB and HSB color value of the tongue body were measured by Photoshop software, tongue coating features were represented by the modified Winkel tongue coating indices, and length and width of the sublingual vein were also included. The blood pressure feature in another study 108 was reported to be fused with the tongue feature and performed better than the tongue feature alone. Shi et al. 106 compared the effects of different feature combinations (tongue image, radial pulse wave, and body symptom) on classification accuracy, and the result was as follows: tongue and pulse < symptom < symptom and tongue and pulse. They also compared the syndrome classification performance of different classification methods classifying Qi deficiency and Yin deficiency in nonsmall cell lung cancer patients, which turned out the neural network performs better than RF, SVM, and LR. A similar result was received by reference 109 that the multilayer neural network was more adequate for modeling complex relationships between tongue color features and Zheng (syndrome)/coating classes than SVM and AdaBoost. They used a feature vector that combined different pixel values of tongue image from different color spaces to represent tongue features and found that the tongue color features were more suitable to discriminate Zheng classes rather than the western groups (superficial vs. atrophic, Helicobacter pylori positive vs. negative).
As we all know, the syndrome in TCM is relatively abstract and vague; therefore, it is crucial to ensure the high quality of the training data when making the machine automatically differentiate syndromes. However, only part of the researchers have reported the details of how to label the syndrome type when building the training data set; as shown in Table 2, this decreases the reliability of their research. What is more, fewer studies discussed the consensus on syndrome labels, restricting their machine learning model's generalization. Nevertheless, the self-supervised method may have the potential quantitative ability to TCM symptoms or syndromes. 102 There are huge benefits in accuracy when combining the features of tongue images and other symptoms in SD,106,107 which means it is strongly recommended to introduce other symptoms when using tongue images to classify syndrome types.
Summary of machine learning modeling approaches for tongue image classification.
VGG: visual geometry group; ResNet: residual net; ViT: vision transformer; VQ-VAE: vector quantized variational autoencoder; TDAS: Tongue Image Diagnostic Analysis System; KNN: k-nearestneighbor; MSE: minimum squared error; ACVO: anti corona virus optimization; EWMA: exponential weighted moving average; TCM: traditional chinese medicine; FFT: fast fourier transform; LGBM: LightGBM, Ada: Adaboost; LGBM: light gradient boosting machine; XGBT: extreme gradient boosting; GA: genetic algorithm; RF: random forest; LR: logistic regression; CT: computed tomography; MRI: magnetic resonance imaging.
Prior knowledge of tongue diagnosis for constitution type according to traditional Chinese medicine (TCM) experts. 20 These 15 attributes were used to represent tongue image samples in a 15D semantic vector.
Tongue image for the symptom (sign) differentiation
Symptom/sign is essential for SD and sometimes for disease diagnosis. Classifying the symptoms/signs of tongues helps to explore AI's potential to quantify the changes in the tongue and the mechanism behind its symptoms/signs. However, the symptoms on the tongue are very difficult to automatically identify or quantify, which has become a core challenge in SD. In recent years, many studies in this area have paid much attention to using deep learning methods to automatically classify or identify tongue symptoms, like tooth-marked tongue,11,15,103–105,110,111 tongue coating,10,16,31,42,112,113 coating color,66,113,114 tongue color,22,113,115 cracked tongue,9,110,111,116,117 sublingual vein, 10 and fungiform papillae on tongue. 118
Using the tongue image to discriminate symptoms slightly differs from using it for SD. Because classifying a syndrome is an entire image-level classification task, the symptom classification is based on local features, which always show a low-level differentiation compared to the surrounding area on the tongue. Especially the tooth-marked area on the tongue, to solve this problem, Li et al. 15 proposed a multiple-instance SVM method. They first annotated the tooth-marked region and based on its convex hull features generated all suspected regions of tooth-marked through a color threshold method and classified the tooth-marked tongue based on this suspected region and tooth-marked region using multiple-instance SVM. In this way, they not only could extract features more specifically but also combined the advantages of both traditional and deep learning methods effectively. Similarly, they utilized this method to classify rotten greasy coating of the tongue 16 and cracked tongue 116 based on the deep features, which outperforms other state-of-the-art methods. Another tooth-marked classification model based on image-level annotation was proposed by Zhou et al., 105 and it could also locate the tooth-marked area with the weakly supervised method. When classifying coating features, Wang et al. 31 proposed a GreasyCoatNet framework that could classify three-level greasy coating robustly, indicating the potential ability to quantify tongue greasy coating. Similarly, Zhuang et al. 104 deployed an intelligent detector using refined ResNet34 to discriminate the three-level thickness of the coating, which showed a better performance than VGG-16. The cross-channel attention module combined with the MobileNet V2 network 10 got a competitive accuracy when classifying coating and sublingual vein thickness in a lightweight model. Ni et al. 115 combined capsule network and residual block to launch a lightweight CapsNet model, which achieved a competitive performance when classified tongue color.
Labeled symptom data with high quality is hard to obtain; thus, we always face a lack of effective training samples. The transfer learning method is always applied to transfer general domain knowledge to alleviate small training data. Song et al. 111 utilized a pretrained model training from ImageNet to classify tooth-marked, cracked, and thick coating, which obtained a compatible result when compared with ResNet50 and Inception_v3. Zhang et al. 113 reported that the transfer learning method was used to differentiate tongue body color, coating color, and coating thickness for remote tongue diagnosis. Multitask learning could solve the classification and location tasks together; Weng et al. 110 proposed a weakly supervised method to perform a coarse classification of tooth-marked and cracked tongues first and then a detection branch to locate the position of the features. For the problem of imbalanced data, Cao et al. 119 create new samples with a linear interpolation method, which not only retained the characteristics of the original data but also avoided overfitting. The attention mechanism maximized the use of data in improving classification accuracy on cracked tongues. 9
The handcrafted features are also useful to differentiate the tongue symptoms; Zhang et al. 112 used fractal spectra as a differentiating feature with high accuracy in the detection and classification of greasy and thin/thick tongue coatings based on fractal theory. Statistic features of Lab color value 119 could be used to classify the unhealthy tongue images and prove that the XGBoost classifier has better accuracy than KNN, SVM, and RF. CIE L* a* b* value was applied to quantify the tongue color and classify its color type. 120 Another statistical feature of wide line on the tongue image would do a great favor in classifying cracked tongue. 117 Features of too many irrelevant areas may negatively impact discrimination tasks. Wang et al. 103 utilized the segmented tongue image would have a 0.97% higher classification accuracy than the raw tongue image (no segmentation and having irrelevant facial portions and background surrounding the tongue) when classifying tooth-marked tongue.
The severity of the symptom/sign is closely related to the accuracy of SD and determines the dosage of the herbs. Using deep learning to classify the tongue's appearance has demonstrated the possibility of automatically classifying the body symptom/sign severity. Even in the face of low-feature level or dense prediction on small targets, they can be commendably utilized by the multiple-instance SVM method15,16,116 when compared with tongue color or coating color classification tasks based on image-level features. The irrelevant features or background pixels may reduce the discrimination accuracy. 103 Quantifying the severity of symptoms/signs through the machine is another challenge that may also be resolved through deep learning. 31
Discussion and future directions
In this review, we have reported studies using AI methods in the tongue diagnosis area in recent years. As a traditional medical issue, the diagnosis from tongue images is different from other medical images that only focus on the size of space-occupying lesions, and tongue diagnosis needs to summarize 12 feature types, as depicted in Figure 1. So the automatic process of the tongue image has many differences from other medical images, and the machine needs to utilize and handle 12 different characteristics or volatile tongue appearance, as shown in Figure 3, which pose a greater challenge than other medical image and result in a later modernization. 22 Just like tongue image segmentation, the performance has not been ideal for a long time, let alone the automation diagnosis based on tongue image. Until a wide range using the deep learning method and different kinds of CNN framework, we have witnessed the challenge being alleviated rapidly through this review.
Nevertheless, we have observed that most of the research was trapped in two major problems in research methods: data set construction and low reliability of performance evaluation. Since there is no recognized authoritative data set, on the one hand, most researchers can only build their data sets; on the other hand, their claimed model performance was based on their validation data sets, and no public channel is provided to confirm its performance. First of all, few studies reported the details of data collection, annotation, and annotation consistency when creating data sets, so the quality of data used for model training is questionable. As shown in Tables 1 and 2, only Shi et al. 106 report the collection details, such as the collection time and the body status at the time of collection; Zhuang et al. 104 even report that the distance between the eyes and the display screen is 45 cm and each tongue image retention period is 20 s. Less literature reports the details of how to ensure the quality of labeling. Because the color of the tongue is easily affected by diet, it is possible to collect tongue images with deviation if the collection environment is not standardized. Thus, when labeling tongue images, if it cannot be confirmed that the labeling results are based on a wide range of consensus, the generalization performance of the model will be affected. Moreover, the definition of symptoms and syndromes is not clear enough, and its classification lacks recognized standards, which further reduces the generalization performance. So as the label setting, many binary classification tasks set their negative label as non or no tags, such as COVID-19 patients vs. non-COVID-19 patients, 4 tooth-marked vs. nontooth-marked tongue, 103 and diabetes tongue vs. nondiabetes tongue, 76 and these negative labels having a large value range are hard to cover in reality. It is hard to believe that the model discriminates against the right class when having partial negative labels. Secondly, the model performance claimed by the researches was mostly evaluated by their own data sets, and only a small part of tongue segmentation researches63,64 used BioHit public data set for verification. In addition, no reference disclosed its code and model weight files and provided relevant channels for the public to verify its performance. This not only makes it impossible to compare the studies horizontally but also harms their credibility.
Besides, we also observed some problems in research design, which can be roughly classified into two major categories: misunderstanding the application scenario of tongue diagnosis in TCM and SD based on a single feature or single diagnosis method. First, although tongue image has great advantages in the diagnosis and classification of diabetes and gastric diseases and may reflect the progress of COVID-19, its characteristics are more suitable for the classification of syndromes than diseases. 109 Because disease diagnosis requires more strict classification boundaries while syndromes are relatively unclear, at present, the optical signal of the tongue cannot be quantified to provide disease diagnostic indicators or specific substances with clear boundaries. Using tongue images to diagnose disease may have a high recall but a low accuracy. Second, there are many tongue image feature types, as shown in Figure 1, and they are more likely to be comprehensively judged as the probability of a certain type of health state.10,121 The health state abstracted from the whole body system is related to the concept of “syndrome.” TCM theory believes that the human body is an interrelated system, the running state and abnormal symptoms of different body parts are interrelated and mutually affected,1,122 showing a certain distribution law, and syndrome types could be treated as the clusters summarized by this distribution. Therefore, not only do the features of the tongue image need to be integrated but also the features from other diagnostic methods, to better distinguish syndromes.106–108 This is also the opinion of “integrating four diagnoses into one reference” emphasized by TCM, that is, to perceive the overall health state (or syndrome type) from multiple local abnormalities. Treatment should be based on the overall state, rather than focusing on a specific or specific part, such as an organ, signal pathway, or protein target.
Even though intelligent tongue diagnosis has many difficulties, there are still many solutions with potential utilization value. The self-supervised model 102 obtains the feature embedding of tongue images through a self-encoder. It can preliminarily distinguish syndrome types without artificial subjective labels, which shows the potential ability of self-supervised learning in the quantitative representation of TCM symptoms/signs. It is expected to bypass the deviation caused by empirical labels. It has also been proved that some traditional feature engineering methods can help improve deep learning performance, for example, repair or filtering of morphological layer,18,19 reasonably using the prior knowledge of experts,15,20 multiple-instance SVM method15,16,21,116 for small targets like the cracked feature or tooth-marked feature, and channel attention mechanism8–10 to better locate the tongue region while restraining useless parts. Some methods are particularly suitable for intelligent tongue diagnosis. Features such as tooth marks, cracks, fur color, and tongue color are often required to be comprehensively used when making clinical diagnoses, and the attention mechanism8,57 and multiscale feature extraction12,88 are conducive to integrating different features on tongue images. In addition, multitask learning,66,110 data augmentation,54,119 and transfer learning54,56,111 are effective ways to alleviate the problems of low data set quality and small data volume.
Based on the above discussion, there may be greater research value in the following aspects in the future. First is using the self-supervised learning method to embed tongue image features as input; as shown in the literature, 102 the prescription drugs and their doses are used as an output to bypass the complex and fuzzy intermediate process of SD and build an end-to-end clinical decision support system (CDSS) model, such as researches.13,101 Taking the process of SD as a black box model in the middle, directly modeling the relationship between symptoms/signs and doctor’s prescriptions can not only effectively play the advantages of deep learning but also avoid error transmission in SD. Second is the correlation between the optical features of the tongue and other diagnostic features. In the long-term clinical practice of TCM, abnormal manifestations of various body parts often occur in association, showing a more robust joint distribution law with different body statues or syndromes than the diseases. The performance of the tongue body is also related to the overall health status. The effective integration or multimodal information fusion method of various parts or health features to obtain the embedding of overall health status is expected to deepen the understanding of the syndrome and reveal its essence. Third is the pathological basis behind the changes in tongue appearance, based on quantifying the symptoms of tongue images and exploring the relationship between them and genomics, proteomics, microbial populations, etc. to deliberate the essence behind various changes in tongue image.
Conclusion
This article provided a comprehensive overview of relevant works in intelligent tongue diagnosis over the past 5 years, including the progress of work contents and algorithm models, covering a wide range of tongue image calibration, detection, segmentation and classification of diseases, syndromes, and symptoms, as well as existing algorithm model approaches of manual feature methods, traditional feature engineering methods, and deep learning methods. In particular, we outlined future potential and the remaining limitations of these approaches toward intelligent tongue diagnosis that may hinder widespread clinical deployment. In the past, there was insufficient understanding of the value of traditional medicine and insufficient promotion of its “renaissance.” This review shows that due to the characteristics of the noninvasive, rich vascular network, and rich microecology in tongue diagnosis, intelligent tongue diagnosis has enormous value for the future diagnosis and treatment of diseases, and this value needs to be further explored. We hope that this review may provide an intuitive understanding of this and also increase the awareness of common challenges in this field that call for future contributions.
Footnotes
Contributorship
All authors have made a substantial contribution to the development, drafting, and revising of this manuscript.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Central Finance Improvement Project of the State Key Laboratory of Traditional Chinese Medicine (Central Finance. CS (2021) No. 151); National Natural Science Foundation of China, No. 81574038; China Postdoctoral Science Foundation, No. 2022M722210, and Shenzhen Basic Discipline Layout Project (JCYLL20220818101806014).
Informed consent
The images used in this paper were collected with informed consent, and as a review, the remaining content does not require informed consent.
Guarantor
QL
