Abstract
Background:
Accurate understanding of the reasons for the differing performance of current ultrasound (US) risk stratification systems (RSSs) is essential for developing a unified RSS. We aimed to compare the diagnostic performance of five RSSs according to nodule size, using standardized US lexicon proposed by international expert consensus.
Methods:
From March 2017 to February 2024, 3774 thyroid nodules (>1 cm) with final diagnoses (3053 benign and 721 malignant) were retrospectively analyzed. The US features of nodules were assessed during real-time examinations, and the nodules were retrospectively classified according to the US criteria of each RSS using standardized lexicon. We compared the distribution, malignancy risk, and inter-system agreement of nodule categories, along with the distribution of malignant tumors across these categories, among the American Thyroid Association (ATA) system and the European (EU-), Korean (K-), American College of Radiology (ACR-), and Chinese (C-) Thyroid Imaging Reporting and Data Systems (TIRADSs). Diagnostic performance based on biopsy criteria was compared among the five RSSs according to nodule size (small ≤ 2 cm and large > 2 cm).
Results:
Significant differences were observed in the distribution of classified nodules, malignant tumor distribution across categories, and malignancy risk of most categories (all p < 0.001), with widely varying inter-system agreement (κ = 0.05–0.85). The ATA system and EU-TIRADS demonstrated higher sensitivity and unnecessary biopsy rate (UBR) in both small and large nodule groups, whereas ACR TI-RADS showed lower sensitivity and UBR across both sizes. K-TIRADS exhibited the lowest sensitivity and UBR for small nodules (all p < 0.001) but showed high sensitivity and UBR for large nodules. C-TIRADS showed a similarly low sensitivity but exhibited a higher UBR (all p < 0.001) in both size groups compared with ACR TI-RADS.
Conclusions:
The five RSSs differed considerably in nodule classification, malignancy risk across categories, and diagnostic performance according to nodule size. The differences in diagnostic performance stem primarily from variations in biopsy size thresholds and US criteria for small nodules, and disparate US criteria for no-biopsy-indicated large nodules. Optimizing risk stratification and biopsy thresholds, particularly for large nodules, is required for establishing a unified TIRADS.
Keywords
Get full access to this article
View all access options for this article.
