Sage Journals: Discover world-class research

Abstract

Background

The incidence of cervical lymph node metastasis (CLNM) in thyroid cancer (TC) is high. Accurate preoperative diagnosis of CLNM is critical to reduce unnecessary lymph node dissection and complications for TC patients. Ultrasound (US)-based artificial intelligence (AI) systems show promise for CLNM prediction, but their diagnostic performance requires systematic evaluation.

Methods

A comprehensive search of four electronic databases (Web of Science, Embase, PubMed, and Cochrane Library) was conducted from inception to 30 December 2023. The random-effects model was chosen to calculate the pooled diagnostic indicators. Sensitivity analysis and heterogeneity test were conducted.

Results

Among 19 included studies, the AI system demonstrated pooled sensitivity, specificity, area under the curve (AUC) were 0.76 (95% condidence interval (CI): 0.71–0.80), 0.78 (95% CI: 0.74–0.82), and 0.84 (95% CI: 0.15–0.99), respectively. The sensitivity, specificity and AUC in clinically node-negative (cN0) patients were 0.73 (95% CI: 0.68–0.77), 0.81 (95% CI: 0.76–0.85) and 0.83 (95% CI: 0.14–0.99). The sensitivity, specificity and AUC for the central CLNM were 0.73 (95% CI: 0.69–0.77), 0.77 (95% CI: 0.72–0.81) and 0.81 (95% CI: 0.14–0.99). Multi-center designed studies yielded higher sensitivity (0.79 vs. 0.75, p < 0.01) and specificity (0.79 vs. 0.78, p < 0.01) than single-center designs. Deep learning (DL) yielded higher sensitivity (0.79 vs. 0.74, p < 0.01) and specificity (0.83 vs. 0.75, p < 0.01) than classic machine learning. Studies published after 2022 yielded higher sensitivity (0.77 vs. 0.74, p < 0.01) than before 2022. Studies from China had lower specificity than studies from other countries (0.78 vs. 0.80, p = 0.01). Models incorporating multimodal features outperformed unimodal US (specificity: 0.79 vs. 0.75, p < 0.01).

Conclusion

US-based AI systems exhibit favorable predictive value for CLNM in TC, particularly with DL and multimodal designs, potentially reducing overtreatment. Prospective validation is needed prior to clinical adoption.

Keywords

Artificial intelligence thyroid cancer lymph node metastasis ultrasound systematic review meta-analysis

Introduction

Thyroid cancer (TC) is the most common endocrine neoplasm, accounting for approximately 90% of all endocrine tumors.¹ Over the past 30 years, the occurrence of TC has grown approximately by 300%, making it the fastest growing malignancy worldwide.¹ Even though TC is generally indolent, cervical lymph node metastasis (CLNM) occurs in approximately 30–80% of cases due to the rich lymphatic drainage of thyroid.² CLNM serves not only as a crucial indicator of TC extent, prognosis, and surgical management but also as an independent risk factor for tumor recurrence and reduced disease-free survival.^3,4 Many TC patients, regardless of CLNM status, have underwent prophylactic lymph node dissection (LND) to prevent potential CLNM. However, this practice has often led to widespread overtreatment.⁵ Notably, there is evidence that prophylactic LND cannot improve the long-term prognosis of TC. Instead, it is linked to an increased risk of surgical complications, such as lymphatic leakage, laryngeal recurrent nerve injury, and hypoparathyroidism.^6–8 With growing awareness of the negative influence of TC overdiagnosis, thyroid lobectomy alone with active surveillance has been recommended as initial management for low-risk TC.⁹ Thus, precise evaluation of cervical lymph node (LN) status is of great value for clinical treatment options.

As the preferred modality for screening confirmed or suspected CLNM, preoperative ultrasound (US) has a high specificity, but is of limited value because of low sensitivity, especially in assessing central LN (sensitivity only 20–31%).^10,11 Although new techniques such as elastosonography and contrast-enhanced US have been proven to be superior to conventional US, the results of US evaluation remains highly dependent on the operator's experience and procedural factors.^12,13 Consequently, the physician's visual inspection of US images cannot offer enough information to support treatment decisions for patients. Thus, it is crucial to develop an effective and user-friendly diagnostic method for predicting CLNM.

Artificial intelligence (AI) is a research field that applies computational systems to simulate human cognitive processes. In recent decades, AI has become a hot topic in the medical field, especially in medical image recognition and diagnosis.¹⁴ US-based AI diagnostic systems were proposed for CLNM prediction in TC patients by transforming US images into quantifiable data, yielding an area under the curve (AUC) of up to 0.953.¹⁵ Common malignant indicators of TC include calcification, a taller-than-wide shape, irregular margins, hypoechogenicity, and extrathyroidal extension.¹⁶ Key features suggestive of malignant LNs contain a short-to-long axis ratio >0.5, loss of the echogenic hilum, microcalcifications, cystic degeneration, and abnormal vascular patterns (e.g. peripheral or mixed flow signals).¹⁷ AI systems can integrate these features and analyze the relationships between these high-throughput imaging biomarkers and LN metastasis status. Despite the construction of various clinically applicable models, there are wide variations in AI diagnostic performance among them due to differences in image segmentation and processing methods, algorithm software, and sample capacity. As a result, the predictive value of these models remains controversial. In this study, based on published data, we aimed to proceed a systematic review and meta-analysis to evaluate the accuracy of AI in predicting CLNM. The findings will offer data support to develop US-based AI diagnostic system and lay the theoretical foundation for large-scale CLNM screening programs.

Methods

Design

This systematic review was registered at PROSPERO (CRD42023448933) (Supplemental File 1). We will follow the Preferred Reporting Items for Systematic Reviews and Meta-Analysis guidelines to report this research.¹⁸

Search strategy

A comprehensive online search of the Web of Science, Embase, PubMed, and Cochrane databases for all potential literature from the inception to 30 December, 2023 was conducted. The keywords and relevant text words were listed in Supplemental File 2. Reference list of identified studies was retrieved and manually researched for other relevant studies.

Eligibility criteria

The inclusion criteria were: (a) The reference standard for the diagnosis of CLNM in TC patients was pathological examination (fine-needle aspiration or surgical pathology); (b) application of AI algorithms involving US image analysis to predict CLNM; (c) sensitivity and specificity could be calculated from the available data; (d) articles published in English. The exclusion criteria were: (a) Repeated publications; (b) non-English articles; (c) letters, comments, reviews and case reports; (d) studies with incomplete or inaccessible data to construct a 2 × 2 contingency table. Full-text articles were checked by two independent reviewers to determine whether they met the inclusion criteria. In case of a disagreement between the evaluators, a third investigator was consulted to resolve the issue.

Data extraction and quality assessment

The two investigators independently extracted and cross-checked the data. Consensus was achieved through group discussion to resolve discrepancies. The extracted information included the following: (a) Study characteristics: Author, year of publication, sample size, LN compartment, study cohort; (b) algorithms characteristics: Different type of AI algorithm, feature selection and ultrasonic target tissue; (c) diagnostic accuracy of test results: four folds data including true positive (TP), false negative (FN), false positive (FP) and true negative (TN). The methodological quality and risk of bias of all included studies were evaluated using the Quality Assessment of Diagnostic Accuracy Studies-2 (QUADAS-2) tool, as recommended by the Cochrane handbook.¹⁹ This tool evaluated studies through two primary domains (bias risk and applicability) across 14 specific criteria. Each criterion was evaluated in turn, if any criterion was designated as high risk or high concern, the study was judged to be at high risk of bias.

Statistical analysis

All data were synthesized and analyzed with STATA 14.0 and Meta-Disc 1.4 software. Review Manager 5.3 was applied to draw risk of bias graphs. The threshold effect was assessed using the Spearman correlation coefficient. Q Test and I² statistic were used to assessed heterogeneity among studies. I² > 50% were considered to be significantly heterogeneity. The random-effects model was chosen to calculate the following indicators for evaluating diagnostic capacity: The pooled sensitivity, specificity, diagnostic odds ratio (DOR), positive likelihood ratio (PLR), negative likelihood ratio (NLR), and AUC. Meta-regression and subgroup analysis were conducted to investigate potential sources of heterogeneity. The stability of the model was evaluated through sensitivity analysis. Using Deeks’ funnel plot to check for publication bias, p < 0.05 indicated substantial statistical significance.

Results

Literature search

There were 152 articles retrieved from an initial literature search. After removing duplicates (n = 54) and excluding irrelevant articles (n = 48) by reading titles and abstracts, 50 potentially eligible articles remained for full-text assessment. Upon detailed assessment, 31 were excluded based on the following criteria: Irrelevant research topics (n = 13) and insufficient reported data (n = 18). Ultimately, 19 studies fulfilled all inclusion criteria and were incorporated into the diagnostic meta-analysis.^15,20–38 The detailed literature screening process was displayed in Figure 1.

Figure 1.

Flow diagram of the preferred reporting items for systematic reviews and meta-analyses (PRISMA).

Characteristics and data extraction

All included studies were published between 2018 and 2023, including 8094 patients with CLNM and 8876 patients without CLNM. Supplemental Table 1 exhibited the detailed characteristics of 19 studies. Among these, 5 were multi-center studies^{22,29,30,33,37} and 14 were single-center studies.^{15,20,21,23–28,31,32,34–36} In total, 84% (16/19) of the studies were from China,^{20,22–28,30–37} with the remainder from Iran (2/19)^15,29 and Korea (1/19).²¹ Four studies^28,31,34,35 evaluated the predictive value of US-based AI diagnostic system diagnostic systems for CLNM in patients with clinically node-negative (cN0) TC, while the remaining studies did not report detailed staging information. Considering the development history of AI, the applied algorithm in eligible studies can be categorized into two groups: Deep learning (DL) and classic machine learning (ML). Any learning algorithm that does not employ neural networks is classified as classical ML, such as decision tree, gradient boosting, and support vector machines. Given the different feature extraction in studies developing AI diagnostic systems, we established the following classification: studies using only US imaging features were classified as unimodal, while those incorporating additional data (e.g. clinical features, pathological features, or their combinations) were classified as multimodal. Regrading US imaging modality, in addition to conventional US, several studies incorporated advanced US techniques: one study used superb microvascular imaging, elastography, and contrast-enhanced US,³² and three studies^33,35,38 used elastography. During data extraction, we found eight studies^{21–24,29,33–35} compared the diagnostic performance of multiple AI algorithms. Additionally, one study²⁵ developed two different AI models specifically for predicting central and lateral LN metastases, respectively. For studies evaluating multiple algorithms, the performance results of each AI algorithm set were extracted and analyzed independently. When pooling the diagnostic performance of US-based AI models, only the test set or validation set data were selected to generate the cross-tabulation.

Risk of bias within studies

The result of the quality assessment was depicted in Supplemental Figure 1. In the domain of patient selection, one study was considered to be at unclear risk of bias because the sampling procedure was insufficiently described.²¹ Six studies were judged to be at unclear risk because they did not specify the time interval between US and pathology.^{21,24,29,30,32,33} Nevertheless, the quality of the studies was generally satisfactory.

Results of syntheses

Pooled diagnostic accuracy

Cochran Q and I² tests indicated that heterogeneity was obvious in both sensitivity (I² = 95.02%, p < 0.001) and specificity (I² = 92.03%, p < 0.001) computations. The pooled sensitivity, specificity, DOR, PLR and NLR of AI for diagnosing CLNM were 0.76 (95% condidence interval (CI): 0.71–0.80), 0.78 (95% CI: 0.74–0.82), 11 (95% CI: 8–16), 3.5 (95% CI: 2.9–4.2), and 0.31 (95% CI: 0.26–0.37), respectively, with a corresponding AUC of 0.84 (95% CI: 0.15–0.99) (see Figure 2).

Figure 2.

Diagnostic performance of US-based AI diagnostic system for CLNM prediction. (a) SROC curve analysis showing pooled AUC with 95% CI; (b) Forest plots demonstrating a high degree of heterogeneity in sensitivity and specificity across included studies. CLNM: cervical lymph node metastasis; SROC: summary receiver operating characteristic curve; AUC: area under the curve; CI: confidence interval.

Meta-regression and subgroup analysis

Spearman's correlation coefficient of heterogeneity generated by the threshold effect was −0.110 (p = 0.488), meaning that heterogeneity was not generated by the threshold effect. To further investigate the sources of heterogeneity, we performed a meta-regression. We divided the possible sources of heterogeneity into five categories, as follows: (1) Research design; (2) year; (3) country; (4) the type of AI; (5) modeling feature. All these factors were identified as significant predictors of heterogeneity, as shown in Table 1 and Supplemental Figure 2. Compared with AI models developed using classical ML, both sensitivity (0.79 vs. 0.74; p < 0.01) and specificity (0.83 vs. 0.75; p < 0.01) of DL were significantly improved. Studies from China showed significantly lower specificity than those from other countries (0.78 vs. 0.80; p = 0.01), with no statistically significant difference in sensitivity (p = 0.31). In addition, multi-center studies demonstrated superior diagnostic performance to single-center studies, with significantly higher sensitivity (0.79 vs. 0.75; p < 0.01) and specificity (0.79 vs. 0.78; p < 0.01). Studies^26–37 published after 2022 demonstrated a higher sensitivity (0.77 vs. 0.74; p < 0.01) but a lower specificity (0.77 vs. 0.79; p < 0.01) compared with those published before 2022.^15,20–25 Furthermore, models incorporating multimodal features for AI model construction showed a significantly higher specificity than those incorporating unimodal features (0.79 vs. 0.75; p < 0.01), with no statistically significant difference in sensitivity (p = 0.06).

Table 1.

Meta-regression for the reason of heterogeneity in the diagnostic test accuracy meta-analysis.

Parameter	Number of data	Sensitivity estimates (95% CI)	P value	Specificity estimates (95% CI)	P value
Type of AI
DL	14	0.79 (0.72–0.85)	<0.01	0.83 (0.79–0.88)	<0.01
Classic ML	28	0.74 (0.69–0.79)	<0.01	0.75 (0.71–0.80)	<0.01
Country
China	36	0.72 (0.68–0.76)	0.31	0.78 (0.74–0.82)	0.01
Other country	6	0.91 (0.86–0.95)	0.31	0.80 (0.70–0.89)	0.01
Year
＜2022	18	0.74 (0.68–0.81)	<0.01	0.79 (0.74–0.85)	<0.01
≥2022	24	0.77 (0.72–0.82)	<0.01	0.77 (0.73–0.82)	<0.01
Research design
Multi-center	10	0.79 (0.72–0.87)	<0.01	0.79 (0.72–0.86)	<0.01
Sing-center	32	0.75 (0.70–0.79)	<0.01	0.78 (0.74–0.82)	<0.01
Feature extraction
Multimodal	33	0.72 (0.68–0.76)	0.06	0.79 (0.75–0.83)	<0.01
Ultrasound	9	0.87 (0.82–0.92)	0.06	0.75 (0.66–0.83)	<0.01

AI: artificial intelligence; DL: deep learning; ML: machine learning; CI: confidence interval.

There were 4 studies^28,31,34,35 that assessed the performance of the US-based AI diagnostic systems for predicting CLNM in patients with cN0. The pooled sensitivity and specificity were 0.73 (95% CI: 0.68–0.77) and 0.81 (95% CI: 0.76–0.85), respectively, with a corresponding AUC of 0.83 (95% CI: 0.14–0.99) (see Figure 3). In terms of LN compartment selection, 9 studies^{20,23,25,26,30–32,34,35} evaluated the diagnostic efficacy of the AI system only for the central compartment, but not the lateral compartment or all compartments. The pooled sensitivity and specificity were 0.73 (95% CI: 0.69–0.77) and 0.77 (95% CI: 0.72–0.81), respectively, and the AUC was 0.81 (95% CI: 0.14–0.99) (see Figure 4).

Figure 3.

Diagnostic performance of US-based AI diagnostic system for CLNM prediction in TC patients with cN0. (a) SROC curve analysis showing pooled AUC with 95% CI; (b) forest plots demonstrating a high degree of heterogeneity in sensitivity and specificity across included studies. CLNM: cervical lymph node metastasis; TC: thyroid cancer; SROC: summary receiver operating characteristic curve; AUC: area under the curve; CI: confidence interval; cN0: clinically node-negative.

Figure 4.

Diagnostic performance of US-based AI diagnostic system for central CLNM prediction. (a) SROC curve analysis showing pooled AUC with 95% CI; (b) forest plots demonstrating a high degree of heterogeneity in sensitivity and specificity across included studies. US: ultrasound; AI: artificial intelligence; CLNM: cervical lymph node metastasis; SROC: summary receiver operating characteristic curve; AUC: area under the curve; CI: confidence interval.

Sensitivity analysis

The results of sensitivity analysis were shown in Figure 5. The goodness of fit and bivariate normality tests showed that the data points were distributed evenly on both sides of the reference line, indicating stable observations. According to the influence analysis, 7 sets of data (No.1, No.4, No.10, No.11, No.14, No.15, and No.18) may lead to overestimation in the pooled results. Additionally, outlier detection identified 5 sets of data (No.1, No.4, No.10, No.11, and No.14) as exceeding acceptable ranges. After removing these data, the sensitivity, specificity and AUC showed slight reductions (0.74, 0.76, and 0.82, respectively). Therefore, the sensitivity analysis results indicated that our meta-analysis was robust.

Figure 5.

Sensitivity analysis results of the included studies. (a) Goodness of fit; (b) bivariate normality; (c) influence analysis; (d) outlier detection.

Publication bias

The Deeks’ funnel plot asymmetry test illustrated a potential publication bias (p < 0.01) (See Supplemental Figure 3).

Clinical diagnostic value

The Fagan plot and likelihood ratio scattergram were presented in Figure 6. Assuming a 50% prevalence of CLNM, the Fagan plot showed a posterior probability of 78% for positive assay results and 24% for negative results. The clinical diagnostic value of AI was illustrated by the likelihood ratio scattergram. When the PLR was >10 and the NLR was <0.1, the diagnostic accuracy was high. However, the included studies were distributed across all four quadrants, indicating an overall limited diagnostic capacity.

Figure 6.

US-based AI for CLNM risk assessment. (a) Fagan nomogram: 50% pre-test probability converts to 78% (LR+ 3) or 24% (LR− 0.31) post-test probability; (b) LR classification matrix with evidence thresholds (strong: LR+＞10). US: ultrasound; AI: artificial intelligence; CLNM: cervical lymph node metastasis; LR: likelihood ratio.

Discussion

Although TC is usually an indolent tumor, CLNM often occurs at an early stage. Prophylactic LND has been proposed due to the association between CLNM and both increased local recurrence risk and reduced survival rates.^3,4 Many TC patients received LND, resulting in widespread overtreatment. In fact, the benefit of LND in prevention is still highly controversial. Therefore, it is more reasonable to perform LND only in patients who have been diagnosed as CLNM-positive before surgery. Despite its limitations, US assessment of thyroid and LNs has been deemed the most common method in various clinical settings. The US-based AI diagnostic system combines the advantages of widespread US availability with the objectivity of computer-readable images, demonstrating high potential for clinical application in predicting CLNM in TC patients.

In total, 19 studies with 16,970 cases were contained in this study. The pooled sensitivity and specificity were 0.76 and 0.78, respectively. The AUC was 0.84. The sensitivity of conventional US in predicting CLNM was reported to be as low as 0.33, with a corresponding AUC of 0.69.³⁶ Compared to conventional US, the US-based AI diagnostic system increased the AUC value by 15%. It also significantly improved sensitivity, indicating that this system can more accurately predict CLNM and aid in clinical diagnosis. Previous research has shown that AI implementation enhances overall diagnostic accuracy and sensitivity for radiologists of all experience levels.³⁹ It reduces the performance gap between junior and senior radiologists while ensuring consistency among experts. Consistent with our findings, AI proves to be an effective tool for improving diagnostic efficacy in US, especially in training junior practitioners. A previous meta-analysis revealed that the pooled sensitivity and specificity of US combined with computed tomography (CT) were 0.73 and 0.80, respectively.⁴⁰ It can be concluded that the diagnostic performance of US-based AI diagnostic system is comparable to that of the combined US and CT diagnosis.

Some studies have applied CT-based radiomics models to diagnose CLNM in TC patients, and AUC could reach up to 90.4%.^41,42 However, whether the diagnostic performance of CT-based AI systems is superior to US-based AI systems still needs to be further explored because of the insufficient sample size, limited dataset diversity, and lack of external validation. In fact, small-diameter tumors were difficult to accurately recognized and separated on CT images.²⁶ Furthermore, there are some concerns about the potential impact of iodinated contrast agents on subsequent radioactive iodine therapy, as well as radiation exposure during contrast-enhanced CT examinations. Thus far, whether to routinely perform CT examination in TC patients is controversial. Although MRI-based radiomics models also show good diagnostic value in predicting CLNM status preoperatively,⁴³ the clinical application of this technology is limited by the lack of standardization in MRI protocols, including magnetic intensity, different sequences, and several parameters (e.g. repetition time and echo time). Therefore, the US-based AI diagnostic system not only achieves high diagnostic accuracy, but also represents the most suitable AI technology for clinical implementation. This could provide a more holistic perspective on the diagnostic capabilities of AI in TC.

Our findings are consistent with a recent meta-analysis,⁴⁴ which collectively confirmed the high diagnostic accuracy and clinical reliability of the US-based AI diagnostic system. Importantly, our study extends previous work by systematically investigating subgroup differences and sources of heterogeneity. Given the apparent heterogeneity among studies, we performed a meta-regression analysis for five different subgroups. All factors were associated with the source of heterogeneity. Compared with single-center design studies, US-based AI diagnostic system showed better diagnostic performance in multi-center designed studies. The reason might be the increased number of images. These images contain substantial data in diverse formats, ensuring sufficient training material. The utilization of multi-institutional imaging data significantly expanded dataset variety, which led to simultaneous improvements in both model generalization capacity and result reliability.⁴⁵ Due to the lack of adequate samples and diverse imaging sources, single-center studies were more likely to introduce selection bias than multi-institutional studies. Studies employing DL models or published after 2022 demonstrated significantly higher sensitivity. A possible reason was the considerable advances in AI technology. In classical ML methods, the relevant domain experts would first set most of the applicable features to reduce the complexity of the data and highlight patterns. Some hidden relationships may be lost when entered manually. Because manually designed features rely on the a priori knowledge of the expert, who may not be able to capture complex, non-intuitive relationships in the data in advance. Interestingly, the data-dependent nature of DL algorithms indicated that model performance continues to improve with increasing training data volume, whereas classical ML algorithms tend toward stability.⁴⁶ Meanwhile, the majority of well-performed DL models were generated from the baseline architectures. The diagnostic performance of DL models would further improve through innovative modifications of training strategies and algorithmic architectures.^14,44 As increasingly comprehensive features were incorporated into AI models, their predictive outcomes would get closer to pathological results. This may be one of the reasons why unimodal US-based AI models exhibited lower specificity than multimodal feature-integrated models. Finally, the observed lower specificity of studies from China compared to other countries may reflect underlying cultural, systemic, or healthcare-related biases requiring further investigation.

Although the number of clinically LN-positive cases is increasing with the widespread use of US and more detailed pathological examination of surgical specimens, 30%–65% of cN0 patients with PTC are still found to have CLNM after surgery.^10,47 In fact, accurate preoperative assessment of CLNM is more important in TC patients with cN0 than in those with clinical metastases, as it can effectively guide the choice of surgical approach and determine the extent of LND. Our analysis showed that the US-based AI diagnostic system had high sensitivity and significantly improved the preoperative detection rate of CLNM. Moreover, it also demonstrated high specificity in cN0 TC patients, suggesting that this system may be a good predictor of patients who will not ultimately develop CLNM, potentially sparing them from prophylactic LND and its associated complications. Notably, the central compartment is not only the most frequent region for CLNM in patients with TC, but also the most challenging compartment to evaluate using conventional US due to anatomical interference from the thyroid and trachea. In order to investigate the diagnostic performance of the US-based AI diagnostic system for central LN metastasis, we analyzed the pooled datasets. The results showed that the diagnostic sensitivity and specificity remained high, partially compensating for the limitations of conventional US in evaluating central LN.

Although US-based AI diagnostic system exhibited promising results for CLNM prediction, there is still no broad acceptance and application in clinical practice. Key limitations requiring resolution prior to clinical implementation include: (a) Existing training datasets are often insufficient in size, especially for rare histological subtypes such as follicular carcinoma, potentially introducing selection bias and influencing model performance. (b) The lack of uniform US image acquisition protocols, combined with variability in institutional operational procedures and equipment specifications (e.g. transducer frequencies, gain settings), significantly limits the generalizability of current models across diverse clinical settings. (3) The reliance on static images neglecting dynamic features like hemodynamics and tissue elasticity. Therefore, future studies should expand sample size and diversity, minimize data bias through standardized protocols, and optimize models by incorporating dynamic data.

It is worth mentioning the persistent “black box” dilemma in AI systems, characterized by limited interpretability of decision-making processes, remains a critical research challenge. At the same time, unresolved issues concerning data security protocols and ambiguous accountability frameworks in intelligent healthcare technologies continue to hinder clinical adoption. Currently, AI is more appropriately utilized as an auxiliary tool rather than an independent diagnostic method, and its results must be reviewed by certified physicians.

Our study has some limitations. First, the heterogeneity among the studies included was significant, and several factors may have been overlooked beyond those analyzed in the discussion section. Second, the generalizability of these findings may be restricted by the fact that most studies were performed in China. There is a need for future studies from other countries to verify whether these findings are consistent with those of other populations. Third, the publication bias in our study was significant. In total, 18 studies were excluded before meta-analysis due to incomplete data, contributing to a potential publication bias. Fourth, in this meta-analysis, most included studies used a retrospective design, and potential bias in the patient selection could not be fully eliminated.

Conclusion

In conclusion, this meta-analysis demonstrated that US-based AI diagnostic system performed well in predicting CLNM in TC patients, potentially reducing unnecessary LND and associated complications. Meanwhile, the system maintained high diagnostic accuracy both in cN0 patients and for central LN evaluation. Multi-center design, DL algorithm and multimodal feature extraction were related to improved diagnostic performance. Before clinical implementation, further prospective studies need to standardize reporting protocols, train models on sufficient and diverse datasets, and demonstrate cross-population validity.

Supplemental Material

sj-docx-1-sci-10.1177_00368504251346906 - Supplemental material for Diagnostic performance of the ultrasound -based artificial intelligence diagnostic system in predicting cervical lymph node metastasis in patients with thyroid cancer: A systematic review and meta-analysis

Supplemental material, sj-docx-1-sci-10.1177_00368504251346906 for Diagnostic performance of the ultrasound -based artificial intelligence diagnostic system in predicting cervical lymph node metastasis in patients with thyroid cancer: A systematic review and meta-analysis by Xueyao Tang, Hong Zhou, Ying Liu, Shan Gao and Yang Zhou in Science Progress

Supplemental Material

sj-docx-2-sci-10.1177_00368504251346906 - Supplemental material for Diagnostic performance of the ultrasound -based artificial intelligence diagnostic system in predicting cervical lymph node metastasis in patients with thyroid cancer: A systematic review and meta-analysis

Supplemental material, sj-docx-2-sci-10.1177_00368504251346906 for Diagnostic performance of the ultrasound -based artificial intelligence diagnostic system in predicting cervical lymph node metastasis in patients with thyroid cancer: A systematic review and meta-analysis by Xueyao Tang, Hong Zhou, Ying Liu, Shan Gao and Yang Zhou in Science Progress

Supplemental Material

sj-docx-3-sci-10.1177_00368504251346906 - Supplemental material for Diagnostic performance of the ultrasound -based artificial intelligence diagnostic system in predicting cervical lymph node metastasis in patients with thyroid cancer: A systematic review and meta-analysis

Supplemental material, sj-docx-3-sci-10.1177_00368504251346906 for Diagnostic performance of the ultrasound -based artificial intelligence diagnostic system in predicting cervical lymph node metastasis in patients with thyroid cancer: A systematic review and meta-analysis by Xueyao Tang, Hong Zhou, Ying Liu, Shan Gao and Yang Zhou in Science Progress

Supplemental Material

sj-docx-4-sci-10.1177_00368504251346906 - Supplemental material for Diagnostic performance of the ultrasound -based artificial intelligence diagnostic system in predicting cervical lymph node metastasis in patients with thyroid cancer: A systematic review and meta-analysis

Supplemental material, sj-docx-4-sci-10.1177_00368504251346906 for Diagnostic performance of the ultrasound -based artificial intelligence diagnostic system in predicting cervical lymph node metastasis in patients with thyroid cancer: A systematic review and meta-analysis by Xueyao Tang, Hong Zhou, Ying Liu, Shan Gao and Yang Zhou in Science Progress

Footnotes

Acknowledgements

The authors would like to thank the researchers and study participants for their contributions.

ORCID iD

Xueyao Tang

Authors’ contributions

Xueyao Tang conceptualized the research and wrote the manuscript. Shan Gao conducted literature screening, extracted eligible studies features and evaluated the quality of literature. Ying Liu and Hong Zhou participated in data analysis and interpretation. Yang Zhou revised and reviewed the manuscript. All authors approved the final version.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was funded by (1) The Third People's Hospital of Chengdu Clinical Research Program, grant number CSY-YN-01-2023-050, (2) Regional Innovation Cooperation Project of Sichuan Province (2024YFHZ0078), and (3) The Third People's Hospital of Chengdu Scientific Research Project (CSY-YN-01-2023-004, 2023PI03).

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data availability statement

The data that support the findings of this study are available from the corresponding author upon reasonable request.

Trial registration

Systematic review registration: identifier (CRD42023448933).

Supplemental material

Supplemental material for this article is available online.

References

Seib

Sosa

. Evolving understanding of the epidemiology of thyroid cancer. Endocrin Metab Clin 2019; 48: 23–35.

Kim

Park

Woo

, et al. Predictive factors for lymph node metastasis in papillary thyroid microcarcinoma. Ann Surg Oncol 2016; 23: 2866–2873.

Jeon

Kim

Choi

, et al. Recent changes in the clinical outcome of papillary thyroid carcinoma with cervical lymph node metastasis. J Clin Endocr Metab 2015; 100: 3470–3477.

Liu

Zhu

Wang

, et al. Evolutionary features of thyroid cancer in patients with thyroidectomies from 2008 to 2013 in China. Sci Rep 2016; 6: 28414.

Jensen

Saucke

Francis

, et al. From overdiagnosis to overtreatment of low-risk thyroid cancer: A thematic analysis of attitudes and beliefs of endocrinologists, surgeons, and patients. Thyroid 2020; 30: 696–703.

Barczyński

Konturek

Stopa

, et al. Prophylactic central neck dissection for papillary thyroid cancer. Brit J Surg 2013; 100: 410–418.

Conzo

Pasquali

Bellastella

, et al. Total thyroidectomy, without prophylactic central lymph node dissection, in the treatment of differentiated thyroid cancer. Clinical retrospective study on 221 cases. Endocrine 2013; 44: 419–425.

Yan

X-Q

Zhang

Z-Z

W-J

, et al. Prophylactic central neck dissection for cN1b papillary thyroid carcinoma: A systematic review and meta-analysis. Front Oncol 2022; 11: 803986.

Haugen

Alexander

Bible

, et al. 2015 American thyroid association management guidelines for adult patients with thyroid nodules and differentiated thyroid cancer: The American Thyroid Association guidelines task force on thyroid nodules and differentiated thyroid cancer. Thyroid 2016; 26: 1–133.

10.

Roh

J-L

Kim

J-M

Park

. Central lymph node metastasis of unilateral papillary thyroid carcinoma: Patterns and factors predictive of nodal metastasis, morbidity, and recurrence. Ann Surg Oncol 2011; 18: 2245–2250.

11.

Yang

Zhang

Qiao

. Diagnostic accuracy of ultrasound, CT and their combination in detecting cervical lymph node metastasis in patients with papillary thyroid cancer: A systematic review and meta-analysis. BMJ Open 2022; 12: e051568.

12.

X-K

Ding

Sun

L-M

. Contrast-enhanced endoscopic ultrasound for differential diagnosis of pancreatic cancer: An updated meta-analysis. Oncotarget 2017; 8: 66392–66401.

13.

Chen

, et al. Diagnostic utility of endoscopic ultrasonography-elastography in the evaluation of solid pancreatic masses: A meta-analysis and systematic review. Med Ultrason 2017; 19: 150–158.

14.

Sorrenti

Dolcetti

Radzina

, et al.

Artificial intelligence for thyroid nodule characterization: Where are we standing?

Cancers (Basel) 2022; 14: 3357.

15.

Ardakani

Reiazi

Mohammadi

. A clinical decision support system using ultrasound textures and radiologic features to distinguish metastasis from tumor-free cervical lymph nodes in patients with papillary thyroid carcinoma. J Ultras Med 2018; 37: 2527–2535.

16.

David

Grazhdani

Tattaresu

, et al. Thyroid nodule characterization: Overview and state of the art of diagnosis with recent developments, from imaging to molecular diagnosis and artificial intelligence. Biomedicines 2024; 12: 1676.

17.

Chasen

Wang

Gan

, et al. Imaging of cervical lymph nodes in thyroid cancer: Ultrasound and computed tomography. Neuroimag Clin N Am 2021; 31: 313–326.

18.

Page

Moher

Bossuyt

, et al. PRISMA 2020 Explanation and elaboration: Updated guidance and exemplars for reporting systematic reviews. Br Med J 2021; 372: n160.

19.

Whiting

Rutjes

AWS

Westwood

, et al. QUADAS-2: A revised tool for the quality assessment of diagnostic accuracy studies. Ann Intern Med 2011; 155: 529–536.

20.

Jin

W-X

D-R

Sun

Y-H

, et al. Prediction of central lymph node metastasis in papillary thyroid microcarcinoma according to clinicopathologic factors and thyroid nodule sonographic features: A case-control study. Cancer Manag Res 2018; 10: 3237–3243.

21.

Lee

Baek

Kim

, et al. Deep learning-based computer-aided diagnosis system for localization and diagnosis of metastatic lymph nodes on ultrasound: A pilot study. Thyroid 2018; 28: 1332–1338.

22.

Deng

Liu

, et al. Lymph node metastasis prediction of papillary thyroid carcinoma based on transfer learning radiomics. Nat Commun 2020; 11: 4807.

23.

Rao

Liu

, et al. Machine learning algorithms for the prediction of central lymph node metastasis in patients with papillary thyroid cancer. Front Endocrinol 2020; 11: 577537.

24.

Chen

Wang

Cai

, et al. Predictions for central lymph node metastasis of papillary thyroid carcinoma via CNN-based fusion modeling of ultrasound images. Trait Signal 2021; 38: 629–638.

25.

Xia

Chi

Jin

, et al. Preoperative prediction of lymph node metastasis in patients with papillary thyroid carcinoma by an artificial intelligence algorithm. Am J Transl Res 2021; 13: 7695–7704.

26.

Shi

Zou

Liu

, et al. Ultrasound-based radiomics XGBoost model to assess the risk of central cervical lymph node metastasis in patients with papillary thyroid carcinoma: individual application of SHAP. Front Oncol 2022; 12: 897596.

27.

Cui

X-W

, et al. Deep multimodal learning for lymph node metastasis prediction of primary thyroid cancer. Phys Med Biol 2022; 67: 035008.

28.

Zhu

Huang

, et al. Artificial neural network-based ultrasound radiomics can predict large-volume lymph node metastasis in clinical N0 papillary thyroid carcinoma patients. J Oncol 2022; 2022: 7133972.

29.

Ardakani

Mohammadi

Mirza-Aghazadeh-Attari

, et al. Diagnosis of metastatic lymph nodes in patients with papillary thyroid cancer: a comparative multi-center study of semantic features and deep learning-based models. J Ultras Med 2023; 42: 1211–1221.

30.

Chang

Zhang

Zhu

, et al. An integrated nomogram combining deep learning, clinical characteristics and ultrasound features for predicting central lymph node metastasis in papillary thyroid cancer: a multicenter study. Front Endocrinol 2023; 14: 964074.

31.

Wang

Chen

, et al. Deep learning-based multifeature integration robustly predicts central lymph node metastasis in papillary thyroid cancer. BMC Cancer 2023; 23: 128.

32.

Wan

Zhang

, et al. Application of decision tree algorithms to predict central lymph node metastasis in well-differentiated papillary thyroid carcinoma based on multimodal ultrasound parameters: A retrospective study. Quant Imag Med Surg 2023; 13: 2081–2097.

33.

Zhou

Yang

, et al. Preoperative US integrated random forest model for predicting Delphian lymph node metastasis in patients with papillary thyroid cancer. Curr Med Imaging 2023; 19: 1031–1040.

34.

Pang

Yang

, et al. Interpretable machine learning model based on the systemic inflammation response index and ultrasound features can predict central lymph node metastasis in cN0T1–T2 papillary thyroid carcinoma. Gland Surg 2023; 12: 1485–1499.

35.

Dai

Tao

Liu

, et al. Ultrasound radiomics models based on multimodal imaging feature fusion of papillary thyroid carcinoma for predicting central lymph node metastasis. Front Oncol 2023; 13: 1261080.

36.

Ren

Zhang

, et al. Dual-modal radiomics for predicting cervical lymph node metastasis in papillary thyroid carcinoma. J X-Ray Sci Technol 2023; 31: 1263–1280.

37.

Zhou

Zeng

, et al. Deep learning predicts cervical lymph node metastasis in clinically node-negative papillary thyroid carcinoma. Insights Imaging 2023; 14: 222.

38.

Zhao

. Meta-analysis of ultrasound for cervical lymph nodes in papillary thyroid cancer: Diagnosis of central and lateral compartment nodal metastases. Eur J Radiol 2019; 112: 14–21.

39.

Zhang

Meng

Mao

, et al. Cervical lymph node metastasis prediction from papillary thyroid carcinoma US videos: A prospective multicenter study. BMC Med 2024; 22: 153.

40.

Wang

Chen

, et al. Diagnostic performance of ultrasound and computed tomography in parallel for the diagnosis of lymph node metastasis in patients with thyroid cancer: A systematic review and meta-analysis. Gland Surg 2022; 11: 1212–1223.

41.

Lee

Kim

. Application of deep learning to the diagnosis of cervical lymph node metastasis from thyroid cancer with CT. Eur Radiol 2019; 29: 5452–5457.

42.

Lee

Kim

, et al. Application of deep learning to the diagnosis of cervical lymph node metastasis from thyroid cancer with CT: External validation and clinical utility for resident training. Eur Radiol 2020; 30: 3066–3072.

43.

Zhang

Wang

, et al. Prediction of cervical lymph node metastasis using MRI radiomics approach in papillary thyroid carcinoma: A feasibility study. Technol Cancer Res T 2020; 19: 1533033820969451.

44.

Zhang

Liu

Wang

, et al. Ultrasound-base radiomics for discerning lymph node metastasis in thyroid cancer: A systematic review and meta-analysis. Acad Radiol 2024; 31: 3118–3130.

45.

Zhang

Yang

Shen

Y-W

, et al. Diagnostic accuracy and potential covariates of artificial intelligence for diagnosing orthopedic fractures: A systematic literature review and meta-analysis. Eur Radiol 2022; 32: 7196–7216.

46.

Fiorentino

Villani

Di Cosmo

, et al. A review on deep-learning algorithms for fetal ultrasound-image analysis. Med Image Anal 2023; 83: 102629.

47.

Pereira

Jimeno

Miquel

, et al. Nodal yield, morbidity, and recurrence after central neck dissection for papillary thyroid carcinoma. Surgery 2005; 138: 1095–1101.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.76 MB

0.03 MB

0.02 MB