Sage Journals: Discover world-class research

Abstract

Thyroid cancer is one of the common types of cancer worldwide, and Ultrasound (US) imaging is a modality normally used for thyroid cancer diagnostics. The American College of Radiology Thyroid Imaging Reporting and Data System (ACR TIRADS) has been widely adopted to identify and classify US image characteristics for thyroid nodules. This paper presents novel methods for detecting the characteristic descriptors derived from TIRADS. Our methods return descriptions of the nodule margin irregularity, margin smoothness, calcification as well as shape and echogenicity using conventional computer vision and deep learning techniques. We evaluate our methods using datasets of 471 US images of thyroid nodules acquired from US machines of different makes and labeled by multiple radiologists. The proposed methods achieved overall accuracies of 88.00%, 93.18%, and 89.13% in classifying nodule calcification, margin irregularity, and margin smoothness respectively. Further tests with limited data also show a promising overall accuracy of 90.60% for echogenicity and 100.00% for nodule shape. This study provides an automated annotation of thyroid nodule characteristics from 2D ultrasound images. The experimental results showed promising performance of our methods for thyroid nodule analysis. The automatic detection of correct characteristics not only offers supporting evidence for diagnosis, but also generates patient reports rapidly, thereby decreasing the workload of radiologists and enhancing productivity.

Keywords

thyroid cancer ultrasonography TIRADS nodule characteristics machine learning computer-aided diagnosis

Introduction

Thyroid cancer is one of the most lethal cancers globally.¹ The incidence rate in women is three times higher than that in men; in 2018 alone, one in 20 women diagnosed of cancer had thyroid cancer.¹ Different imagery systems have been used for diagnosis. US imaging has the advantages of being non-invasive, non-radiative and of low-cost. However, recognizing thyroid nodule and detecting cancer characteristics from US images are challenging due to the demanding skills required in image acquisition and low image quality caused by speckle noise and artifacts. To tackle the issues and maintain consistency in clinical settings, doctors often use standard guidelines in describing thyroid nodules. The original TIRADS principles were first proposed in Horvat et al.² It was later standardized by Kwak et al.³ as the first reporting scheme for classifying thyroid nodules to risk levels of malignancy using US nodule characteristic descriptors. The most recent ACR TIRADS further standardizes the descriptors to five categories.⁴ Although radiologists have been using the guidelines to report thyroid nodules under different conditions, accurate diagnosis based on TIRADS remains challenging because of inter- and intra-observer variabilities.

Several studies have been conducted to analyze US image characteristics of thyroid nodules, but most of them extracted such characteristics for classifying a nodule as benign or malignant rather than accurately detecting and evaluating the characteristics for report generation.^5–7 In this paper, we present a comprehensive translation of the US TIRADS characteristics, aiming at an automated process of describing clinical findings in thyroid nodules. The proposed methods provide an effective, efficient, deterministic, and consistent annotation of the TIRADS terms to reduce subjectivity and increase precision in nodule examination and reporting. In particular, the paper is intended to make the following key contributions: (1) a new method for nodule irregularity detection by utilizing convexity, ellipticity, lobulation, and angulation features; (2) a new method for smoothness detection using texture features and super-pixels; (3) an optimized CNN classification model (CaNet) for calcification by using Bayesian Optimization; and (4) a new method that combines CaNet and super-pixels for more accurate calcification classification.

The remaining part of the paper is organized as follows. The paper first reports on the key TIRADS US characteristics and reviews existing methods for nodule characteristics analysis in US images. The paper then presents the proposed methods for detecting margin irregularity, margin smoothness and calcification. This is followed by an evaluation of the proposed methods through experiments on datasets collected from clinics. The paper further discusses possible ways of optimizing parameters of the proposed methods, conducts a brief ablation study on the extracted features for some of the methods and outlines ideas for nodule shapes and echogenicity before summarizing the main findings and concludes the paper.

Background and Related Work

The ACR TIRADS scheme characterizes a thyroid nodule from five aspects: composition, echogenicity, shape, margin, and echogenic foci.⁴ Each aspect contains a set of descriptive terms with associated points. Based on the observation, the associated points are added to a total score and then mapped to one of five ordinal bands (from TR1 to TR5). A benign nodule within the TR2 category can exhibit a regular oval shape, an anteroposterior transverse ratio (AP/T ratio) that is wider than tall, a smooth margin, anechoic properties, and the absence of calcification. On the other hand, a malignant nodule falling under the TR5 category may show a lobulated shape with an irregular margin, an AP/T ratio that is taller than wide, hypo-echogenicity, and the presence of micro-calcification. The borderline TR4 band is further divided into 4a, 4b and 4c sub-bands. It is also the band where most inter-observer variability occurs, and hence there is a greater need for nodule characteristics for correct decisions. Besides the ACR TIRADS, other guidelines also exist,^8–10 all of which indicate similar nodule characteristics of suspected malignancy.¹¹ Therefore, in practice, hospitals normally use common categories of thyroid nodule characteristics across the different guidelines. It must be said that a TIRADS score is still an observer’s subjective judgment and may lead to different diagnosis outcomes. Rigorous definitions of the terms and reviews of the guidelines may help reducing but not avoiding such variability. Having an automated computer-based solution to categorize the thyroid nodules may help reducing such non-deterministic outcomes. Using the detected characteristics as evidence will enhance comprehensibility of the final diagnostic decisions and gain trust from the medical profession.

Several studies to automatically quantify features based on the standardized TIRADS categories for classifying thyroid nodules have been reported,^12,13 but the work on extracting correct TIRADS features from ultrasound images for annotation purposes remains limited. Zulfanahri et al.⁵ analyzed and classified the margin irregularity of thyroid nodule into regular or irregular class using rectangularity, convexity, and tortuosity features with an SVM classifier. The study reported an accuracy of 91.52% (91.80% sensitivity and 91.35% specificity) over a set of 165 images. Wang et al.¹² automated the extraction of four thyroid nodule characteristics: composition using average image brightness, echogenicity pattern using relative brightness, calcification with top-hat morphological filter, and boundary regularity with acutance. The study used a semi-supervised fuzzy C-means ensemble (SS-FCME) model to classify the thyroid nodules into a TIRADS score band with 70.77% overall accuracy. Nugroho et al.⁷ trained an SVM to determine the margin of a thyroid nodule using compactness, convexity, circularity, dispersion, aspect ratio, rectangularity, solidity and tortuosity as features, with an accuracy of 92.30% on 144 test images. In a later study, Nugroho et al.⁶ further added the orientation feature and achieved an accuracy of 98.00% but only on 51 test images. Although the previous two studies reported promising performances, the features extracted can be too excessive for the problems. Zhuang et al.¹⁴ used cystic growth rate and the variance of the grayscale distribution for composition, compactness for margin irregularity, and the aspect ratio for nodule shape. A deep learning algorithm was used to classify calcification based on Region of Interest (ROI) image, but the paper lacks detailed explanations. For margin smoothness, the method first locates a 10-pixel ribbon around the nodule boundary (inside and outside regions) via morphological dilation and erosion. Average grayscale difference (or mean separability), derived from the number of pixels, mean and variance of intensity in each region, was used to quantify margin smoothness. Weights were assigned to the derived quantities and feature scores that were then accumulated to the total TIRADS score. All malignant cases and 94.87% of the benign cases were classified into the correct TIRADS score bands.

Materials and Methods

Dataset Collection and Annotation for Nodule Characteristics Analysis

Thyroid cancer has various subtypes. Malignant tumors have more diversity in their cellular and molecular structures than benign tumors. Therefore, including a larger number of malignant cases in a dataset is important to ensure sufficient representation of the diverse subtypes for developing accurate models. Hence we purposely included more malignant cases of various pathologies in this study. During the image acquisition, one radiologist with more than 15 years of experience manually cropped every nodule in each original image by identifying coordinate points on the nodule boundary. The delaminated boundaries were further validated by the second radiologist with a similar amount of experience. Images with disagreed nodule boundaries were excluded from the final data collection. The verified nodule boundaries form polygons for the ROI. Figure 1 shows two examples from the dataset.

Figure 1.

Example US images with labeled ROI (red dots on nodule boundaries): (a) isoechoic/hyperechoic, wider-than-tall, clear, regular and no calcification and (b) isoechoic/hyperechoic, wider-than-tall, not clear, irregular, micro calcification.

Our dataset was labeled by three radiologists with 10, 15, and 30 years of experience respectively. For each ROI identified, US image descriptors of the nodule were assigned by at least 2 radiologists following the ACR TI-RADS guideline, one of whom is ensured to have at least 15 years of experience. When labeling the margin and echogenicity foci, the radiologists made modifications to align with the current clinical practices and reduce the observer viabilities. In particular, the classification of margin is refined into two subcategories as margin irregularity (irregular or regular) and margin smoothness (not-clear or clear). The echogenic foci is simplified as “no calcification,” “microcalcification” or “macrocalcification.” Nodules containing both micro and macro calcifications is classified as microcalcification as it indicates a higher risk of malignancy. In the end, a dataset of total 471 thyroid ultrasound images from two local hospitals in Shanghai, China was obtained and labeled. The dataset contains 140 benign cases and 331 malignant cases, where the pathology result of each image was confirmed by a Fine needle Aspiration (FNA) test.. All the personal details of the patients were excluded. The collected dataset was randomly divided into three equal size patches (157 each) respectively for training, validation and testing purposes (to be further explained in the next section). Both agreed and disagreed labels are recorded without additional bias for performing multi-observer studies in the later experiment.

Nodule Characteristics Detection Methods

Despite some literature suggestions at possible transferring learning when analyzing thyroid and breast tumors as they are both superficial organs,¹⁵ it may be rather difficult to adapt models from other types of organs directly for characteristics analysis for thyroid nodules as they share different definitions. Some characteristics such as calcification can be difficult to analyze for breast tumors due to the limitation of ultrasonography.¹⁶ Nevertheless, some characteristics such as margins do share similarities when used for classifying cancer malignancy,¹⁷ but their characteristics can still vary as they are growing on different mediums with possibly different cell types. Therefore, we have proposed a list of novel methods for analyzing thyroid characteristics with insights from existing literatures.

Margin irregularity

Margin irregularity is a characteristic that studies the geometric shape of the nodule margin. Figure 2 shows an example of a nodule with an irregular margin. Based on the nodule boundary coordinates, the algorithm derives the convex variance, elliptic variance, lobulation, and angulation from the ROI margin; each of them captures a unique feature for measuring margin irregularity and contributes to a final decision.

Figure 2.

Illustration of margin irregularity detection: (a) nodule ROI with coordinates (red) and (b) an irregular nodule with lobulation (magenta) and angular (yellow) regions.

Irregularity Measure Extraction: Margin irregularity is first analyzed by lobularity. Given a set of concave regions ${c_{1}, c_{2}, \dots, c_{n}} \in C$

f_{L} (c) : = {\begin{matrix} L o b u l a r, i f A_{C} \geq t_{A} \land \min (w_{C}, h_{C}) \geq t_{l} \\ N o t L o b u l a r, e l s e \end{matrix}

(1)

where $A_{C}$ denotes the ratio of the area of the concave region to that of the entire nodule, $w_{C}$ and $h_{C}$ denote the width and height of the concave region, $t_{A}$ and $t_{l}$ are the two thresholds defined for classification and determined empirically as 0.015 and 5 respectively.

Angulation is also an important factor for margin irregularity. Angulation analysis focuses on the extension around the margin. Therefore, the algorithm examines the spikiness, roughness and distortions of the margin by using a set of three consecutive coordinates, ${p_{1}, p_{2}, p_{3}}, {p_{2}, p_{3}, p_{4}}, \dots, {p_{| | - 1}, p_{| |}, p_{| | + 1}}$ , from the total set of ROI margin coordinates $p$ where $p_{0} = p_{| |}$ and $p_{| | + 1} = p_{1}$ . The algorithm measures the curvature $κ$ of the coordinates ${p_{n - 1}, p_{n}, p_{n + 1}}$ . A large amount indicates a sharp change on the margin at the given coordinates.⁷ The algorithm also calculates the angle $θ$ : $p_{n - 1} \to p_{n} \to p_{n + 1},$ where a large angle indicates a slow change and a small angle indicates a sharp change at the given coordinates. The $κ$ and $θ$ values are then combined to estimate the angulation using the rule $f_{A}$ in equation (2):

f_{A} (p_{n - 1}, p_{n}, p_{n + 1}) : = {\begin{matrix} A n g u l a r, i f κ \geq t_{κ} \land θ < t_{θ} \\ N o t A n g u l a r, e l s e \end{matrix}

(2)

where $t_{κ}$ and $t_{θ}$ are two thresholds used for classification and determined empirically as 0.1 and $90^{\circ}$ respectively.

Irregularity Classification: With the four irregularity measures determined, the margin irregularity of a nodule is classified using the rule in equation (3):

{\begin{matrix} R e g u l a r, i f σ_{c}^{2} \geq t_{C} \lor σ_{e}^{2} \geq t_{E} \land f_{L} (C) + f_{A} (P) = 0 \\ I r r e g u l a r, e l s e \end{matrix}

(3)

where

\begin{matrix} f_{L} (C) = \sum_{j = 1}^{| C |} 1_{{L o b u l a r}} [f_{L} (c_{j})], f_{A} (P) \\ = \sum_{k = 1}^{| |} 1_{{A n g u l a r}} [f_{A} (p_{k - 1}, p_{k}, p_{k + 1})] \end{matrix}

and $t_{C}$ and $t_{E}$ are two thresholds for classification, which are determined experimentally as 0.9 and 0.97 respectively.

Our irregularity method and the methods developed in^5,7,12,14 both measure global irregularity of the nodule, but our method uses convexity and ellipticity variance, providing a more robust and accurate assessment of nodule irregularity without being excessive. Furthermore, our method has a new feature extraction step that incorporates and measures of local irregularity of margin such as lobulation and angulation. This feature extraction step has shown an improved sensitivity of margin irregularity demonstrating the better effectiveness of our method.

Margin smoothness

Margin Smoothness represents the clarity of the nodule margin that is reflected the intensity contrast. The higher the contrast between regions inside the nodule boundary and the regions outside, the clearer the margin is. For better results, we first pre-process the US images to suppress the speckle noise and enhance the images. An adaptive median filter⁸ is applied first for reducing the noise. This is then followed by bilateral filtering with Gaussian kernels⁹ to enhance edges. The pre-processed image is then masked by the ROI delimitated by the radiologists to determine inside and outside ribbons around the nodule margin with the assistance of morphological operations (see Figure 3(a)). These ribbons are further divided into $| R |$ equal regions of ${\frac{360}{| R |}}^{\circ}$ each, where R can be determined heuristically depending on the trade-off between precision and computation cost. In this study, $| R |$ is set to 36 (see Figure 3(b)), allowing more precise estimations of local margin smoothness.

Figure 3.

Illustration of margin smoothness detection: (a) nodule ROI with outer (blue)/original (red)/inner (green) ribbons, (b) global detection results (distinct (green)/indistinct (red)), (c) local detection results (distinct (green)/indistinct (red)), and (d) final smoothness prediction (distinct (green)/indistinct (red)).

Smoothness Measure Extraction: To measure margin smoothness, we first represent each region of the inner and outer ribbon with a two-dimensional vector composed by the averages and variances of the pixel intensities within the region. The difference between each pair of inner and outer regions is measured using Euclidean distance, where higher difference implies a clearer margin. It is noted that the proposed measure can only represent a general smoothness over a whole region due to its statistical nature. It can be a drawback especially when analyzing small lesions. To overcome this drawback, we derive another measure that examines intensity profiles across the inner and outer ribbons at a 2^o interval within each region, focusing on changes in fine details. Since such an intensity profile can be sensitive to noises, we have further processed the ultrasound image into superpixels using the SLIC algorithm.¹⁰ Each profile is considered distinct if the difference between the highest and lowest superpixel readings is greater than $t_{λ}$ (defined empirically as 20); otherwise, indistinct (see Figure 3(c)). The region outputs and intensity profile outputs per region are used to derive the final distinctiveness for that region (see Figure 3(d)).

Smoothness Classification: The algorithm fuses the decision for the region analysis ( $R D$ ), Figure 3(b), and the decision for the signal analysis of intensity profiles ( $S D$ ), Figure 3(c), in classifying the margin smoothness for each region. Each region’s smoothness, $S M_{r}$ , is classified as clear or not clear using equation (4).

S M_{r} = {\begin{matrix} C l e a r, i f R D_{r} = D i s i n c t \land S D_{r} \geq t_{p} \\ N o t C l e a r, e l s e \end{matrix}

(4)

where, $r$ is a region between 1 to |R|, $S D_{r} = \frac{\sum_{i = 1}^{| S |} I P D_{i}}{| S |},$ $I P D (I n t e n s i t y P r o f i l e D e c i s i o n)$ is the decision for each intensity profile within the region $r, | S |$ is the number of intensity profiles in region $r$ and $t_{p}$ is the intensity profile classification threshold experimentally as 80%. The overall margin smoothness of a nodule is classified using equation (5).

S m o o t h n e s s, S M = {\begin{matrix} C l e a r, i f \frac{\sum_{r = 1}^{| R |} S M_{r}}{| R |} \geq t_{S} \\ N o t C l e a r, e l s e \end{matrix}

(5)

where, $t_{s}$ is the classification threshold determined experimentally as 75%.

Both our margin smoothness prediction method and the approach proposed by Zhuang et al.¹⁴ use a ribbon (inner and outer) around the nodule boundary and the mean difference in intensity to predict the margin smoothness. However, our method includes additional features in the form of local texture descriptors that capture local intensity variation. These additional features have shown to the improved performance and robustness of the margin smoothness prediction.

Calcification

Calcification in US image is defined as a small and bright fleck in the image reflecting calcium growth on or inside the nodule. Identifying calcifications is known to be challenging due to their variant size, shape and brightness. Certain types of benign characteristics such as the colloids can be easily confused with calcifications in US images. To meet the challenge, we develop a two-stage process where possible candidates for calcification are first detected, and then classified into different classes.

Candidate Detection: For detecting candidates, we adopt the algorithm in Ren et al.¹⁸ that uses a superpixel-based weak detector to propose calcification candidates based on brightness and variance features. Although the algorithm identifies calcification candidates well, it produces many false positive candidates. To overcome this limitation, we propose the following deep learning solution.

Calcification Identification: A Deep Convolutional Neural Network (DCNN) model, known as CaNet, is designed and optimized to validate whether the candidates are actual calcifications. Automatic architecture search often involves training one neural network to optimize the architecture of another neural network. The proposed method performs two consecutive tasks: first, searching for an optimal CNN architecture and hyperparameters using Bayesian Optimization tailored for calcification US images; second, training the CNN model to classify calcification images. For both tasks, fivefold stratified cross validation was applied with one-fold used for architecture optimization and all fivefolds for modeling and evaluating the optimal architecture.

The initial backbone architecture of CaNet consists of an Input Layer (IL), Convolution Block (CB), Max-Pooling Layer (PL) with stride 2 × 2, Average Pooling Layer (GAP), Fully Connected Layer (FCL), Softmax Layer (SL), and Classification Output Layer (COL) for two classes (calcification or none-calcification). The CB consists of three layers in the following order: 3 × 3 Convolutional Layer (CL) with stride 1, Batch Normalization Layer (BNL), and Relu Layer (RL). The IL is set to the size of 32 × 32× 1 to accommodate the small size of the calcification ROI proposed by the weak detector. The hyperparameters were carefully set as follows: initial learning rate to $10^{- 4}$ ; optimizer as stochastic gradient descent with momentum; epoch number to 4000; and batch Size to 128. With the architecture defined, the number of CB, the structure of the CB (i.e., the number of filters in the CL) and the type of the PL were determined by the Bayesian Optimization (BO) algorithm.¹⁹ The objective function is defined as the classification error rate. A surrogate model is constructed using the Gaussian Process model, and expected improvement is used as the acquisition function. Thirty iterations were performed to search for the optimal parameters. To reduce the likelihood of model overfitting, we also used the BO algorithm to search for optimal L2 regularization value between $10^{- 10}$ to $10^{- 2} .$

CaNet architecture (Figure 4) and model were optimized and trained on a specifically collected dataset of 405 images, where the locations of all calcifications are pinpointed by a radiologist with 15 years of experience. Calcifications in the training set were augmented using mirroring and Singular Value Decomposition (SVD) method¹⁵ with three compression ratios (25%, 35% and 45%), finally resulting in 888 calcification candidates and 1723 none-calcification candidates at a ratio of roughly 1:2. The first fold was used to find the optimal CaNet network and Figure 4 shows the details of the optimal architecture with the optimized L2 value of 5.1540e-4 . For the second task, CaNet achieved 81.5% overall accuracy over fivefold cross-validation, 89.1% specificity (no calcification) and 80% sensitivity (calcification). CaNet model with the highest combined sensitivity and specificity was selected and used in the later stages of identifying micro and macro calcification.

Figure 4.

Optimal CNN architecture for calcification classification.

Micro/Macro Calcification Identification: Using the CaNet model, we obtain a set of confirmed candidates. However, it is important to highlight that many confirmed candidates are very small in size due to the nature of the weak detector used. These small candidates represent macrocalcification only partially rather than its entirety. So, a region growing method is applied to restore the candidates to their appropriate sizes and shapes. In particular, the region growing method uses an iterative flood flow algorithm in comparing the mean brightness of the grown region with its eight neighbors, using the highest brightness value within the candidate as the seed and expanding the region by including new neighbors until they differ significantly from the mean calculated. This naïve region growing method may easily suffer from contrast variations in ultrasound images, occasionally resulting in region overgrowth. Therefore, the growing is counterbalanced using the Speeded-Up Robust Features (SURF).²⁰ In our implementation, we have limited the growing region within the areas of 10 strongest SURF descriptors detected, which not only prevents overgrowing calcification regions, but also helps reducing false-positive calcifications. After each detected candidate has been restored to its appropriate size and shape, we extract several features for discriminating micro and macro calcifications. Nodule size, which can be measured by the pixel areas $S_{c}$ , is an obvious descriptor to distinguish micro from macro calcifications. Some macrocalcifications appear in a line or pseudo-linear shape, which can be captured using circularity $o_{C}$ . Finally, macrocalcifications often cast acoustic shadows below them. To capture such shadows, we crop the areas immediately above and below the candidate and use the difference between the average brightness of the two areas as the shadow feature ∆ $C$ .

Finally, the grown candidates are classified into micro or macro calcification using the rule in equation (6):

{\begin{matrix} M a c r o, i f S_{C} > t_{A 1} \\ M a c r o, i f S_{C} > t_{A 2} Λ o_{C} > t_{o} \\ M a c r o, i f Δ_{C} > t_{Δ} \\ M i c r o, e l s e \end{matrix}

(6)

where $t_{A 1}$ , $t_{A 2}$ , $t_{C i r}$ and $t$ _∆ are four thresholds for identifying macrocalcifications, which are empirically decided as 200, 95, 0.78, and 50 respectively. Figure 5 demonstrated the effect of each stage of the proposed calcification detection method when analyzing a thyroid nodule with both micro and macro calcifications.

Figure 5.

Illustration of calcification detection: (a) ROI image, (b) calcification candidates proposed by the weak detector, (c) calcification candidates validated by the CaNet, (d) calcification candidates after growing, and (e) classification outcome; red: microcalcification, yellow: macro calcification.

Comparing to the existing methods in the literature, our proposed method strikes a balance between the detected two types of calcification, false alarms and missed cases through the three stage detection process. As shown by examples in Figure 6, the morphology-based method¹⁶ is over sensitive, severely suffering from false positive detections whereas the superpixel-based method¹⁸ tends to be under sensitive, failing to detect calcifications in some images.

Figure 6.

Comparison between different calcification detection methods: (a) ROI image with calcifications pinpointed by experienced radiologist, (b) top-hat based method¹⁶, (c) superpixel based method¹⁸, and (d) our proposed method; red: microcalcification, yellow: macrocalcification.

Experiment Results

This section presents the experiment results for evaluating the effectiveness of the various methods proposed in Section 3. All experiments were conducted on an Intel Xeon workstation with CPU@2.90GHz, 16 GB RAM, NVIDIA RTX A2000 GPU, and running MATLAB R2020b 64-bits version. With the three patches of images, we used the training patch as the main reference for developing the proposed methods. We then use the validation patch to validate the robustness of the proposed methods and fine-tune the algorithm parameters and the relevant thresholds. For the special characteristic that involves training classification models, such as calcification, to fully utilize the data, we merged the training and validation patches in a 10-fold cross-validation process for model training, evaluation and then selection.

Our experiment consists of two tests. In Test 1, we evaluated the proposed methods against the labels given by one radiologist with 15 years of experience. In Test 2, we evaluated the methods against the labels given by the first radiologist and then confirmed and agreed by another radiologist with similar years of experience. The test results are presented in Table 1.

Table 1.

Performance Summary of the Propose Methods in Tests 1 and 2.

Nodule characteristic descriptors	Descriptor subtypes		Test 1				Test 2
Nodule characteristic descriptors	Descriptor subtypes		No. of cases		Test accuracy (%)		No. of agreed cases		Test accuracy (%)
Margin irregularity(2 classes)	Irregular		96		93.8		89		94.4
	Regular		61		88.5		43		90.7
	Overall		157		91.7		132		93.2
margin Smoothness(2 classes)	Not clear		139		89.9		125		90.4
	Clear		18		61.1		13		77.2
	Overall		157		86.6		138		89.1
Calcification(2\|3 classes)	No calcification		83		94.0		74		98.7
	Calcification	Micro	74	61	83.8	70.5	66	42	89.4	71.4
	Calcification	Macro	74	13	83.8	69.2	66	9	89.4	77.8
	Overall		157	157	89.2	82.8	140	125	94.3	88.0

The table shows our algorithms achieved overall accuracy well above 80% for all three nodule characteristics. The methods perform better on the agreed cases by multiple radiologists than cases labeled by a single radiologist. In general, the algorithms perform better on characteristics that are clearly defined than those where there is more room for different interpretations. For instance, high levels of accuracy are achieved for Margin Irregularity whereas the algorithms’ performances on margin smoothness and calcification are relatively lower.

At subtype level, the algorithm performances vary substantially due to uneven distributions of the subtypes particularly for those characteristics with a greater degree of subjectivity. The difficulty faced by the algorithm development is which radiologist’s labels should be based on as the ground truth. The difficulty is more severe when the number of cases of a subtype is small. Margin characteristics are also known for their subjective nature, where we found that radiologists agree more on irregular (89 of 96, 92.71%) and unclear cases (125 of 139, 89.93%) than on regular (43 of 61, 70.49%) and clear cases (13 of 18, 72.22%). This is because boundaries of malignant nodules tend to have more distinctive appearances than those of benign nodules, and hence radiologists often have differences when classifying boundaries of “benign-looking” nodules. Despite these subjective factors, the proposed methods still achieved good performance on the margin characteristics.

The test results on calcification show that our calcification method achieved good overall accuracy, but better performance on none-calcification than calcification at subtype level. This performance bias is understandable as none-calcification cases appear more frequently in clinics. The results also show that inter-observer variability is quite substantial for calcification; the radiologists agree more in calcification and none-calcification (140 of 157, 89.17%) than in micro and macro calcifications (51 of 74, 68.92%). It is worth noting that radiologists often use the measurement scale marked on the side of US image as an aid when classifying micro and macrocalcifications whereas the algorithms have not made such a reference.

We have also compared our proposed CNN model against other novel CNN models adapted for calcification detection. In particular, we tuned two powerful CNNs, VggNet19 and ResNet101, using transfer learning approach for calcification image classification. The architectures of VggNet19 and ResNet101 were adapted by replacing and fine-tuning the last fully connected layer and the softmax layer of each network. The last fully connected layer was also replaced by a new fully connected layer for two classes (calcification, no calcification). For a systematic and fair comparison, we set the network parameters for both models as follows: 20 epochs, initial learn rate = 0.0001, and mini-batch size = 4. The other parameters were set as default values of each networks. The results in Table 2 show that our proposed CaNet model achieved better specificity. Although CaNet model has lower sensitivity than the other two models, it is worth mentioning that our CaNet achieved less biased results when classifying micro and macro calcifications. It is also worth noting that both transfer learning models had a required input size of 224 × 224× 3, which did not fit most of the calcification candidates due to their small sizes. To resolve the issue, the candidates were resized using bicubic interpolation. However, it is a known fact that up-samplings may easily cause overfitting and fuzziness in the model trained. We believe that this explains why our proposed model has achieved better and less biased results as it fits better to the small input size.

Table 2.

Comparison of Using Different CNN Models for Calcification Detection.

Calcification labels		VGGNet-19 (%)		ResNet-101 (%)		CaNet (%)
No calcification		83.1		80.7		94.0
Calcification	Micro	85.1	75.4	85.1	70.5	83.8	70.5
Calcification	Macro	85.1	53.8	85.1	61.5	83.8	69.2
Overall		84.1	77.7	82.8	75.2	89.2	82.8

Discussions

Our tests in the experiment section have shown promising results from our proposed method. In this section, we will discuss several issues concerning optimization of our models and algorithms, including the optimized thresholds, other TIRADS characteristics and constructing robust models.

Threshold Tuning

The proposed methods for margin and calcification use several thresholds. To determine the best configurations empirically, we conducted a gradient descent-based search on the validation set. The search aims to maximize the overall validation accuracy while maintaining a balance between the sub-class accuracies. Figures 7 to 9 illustrate how threshold setting may affect modeling performance.

Figure 7.

Performance of different thresholds used in margin irregularity analysis.

Figure 8.

Performance of different thresholds used in margin smoothness analysis.

Figure 9.

Performance of different thresholds used in calcification analysis.

We also found that the size of a nodule in an image may affect radiologist’s decisions when classifying margin characteristics. Lobulations in large nodules for example may appear less severe than the same kind of lobulations in smaller nodules. Therefore, we have further altered some thresholds according to the size of the nodule for better robustness. In particular for margin irregularity, we set $t_{A}$ = 0.01, $t_{l} = 2$ , $t_{κ}$ =0.2 for nodules with a minimum resolution less than 50 pixels, and set $t_{κ}$ = 0.15 for nodules with a minimum resolution between 50 and 100 pixels. On contrast, margin smoothness is less affected by nodule size because the characteristic relates to textures around the margin. Most features used in the proposed methods expect larger nodules for statistical reliability of the extracted information. In reality, there are nodules of very small sizes. Therefore, instead of using the thresholds universally defined, we set $| R |$ = 12, $t_{λ} = 15$ , and $t_{S}$ = 80 for nodules with a minimum resolution that less than 100 pixels, and $| R |$ =24 for nodules with a minimum resolution between 100 and 250 pixels.

As presented in Table 3, we found that the proposed methods tend to perform better for malignant nodules than benign ones. The performance bias may be partially caused by the unbalanced training dataset, which can be improved by enrolling or augmenting more benign cases. We also have relatively poor accuracy for calcification on benign nodules. It is worth noting that calcification is often associated with malignancy and rarely appears in benign nodules. Also, fibrosis in benign tumors can be easily confused with microcalcifications. Further research is needed to better understand such rare and confusing cases for developing better solutions. Additionally, the experiment results also show that the margin irregularity measure performs well and is robust across different nodule sizes because our method has considered both global and local margin irregularities. Margin smoothness also performs well but is slightly better toward small nodules. We believe the performance deterioration for large nodules is due to the excessive space covered by each region, indicating that it may be appropriate to increase the value of when analyzing large nodules.

Table 3.

Performance Summary of Test 1 based on Cancer Type and Nodule Size.

Test 1	Benign (%)	Malignant (%)	Size < 150 pxl (%)	Size ≥ 150 pxl (%)
Margin irregularity	86.0	93.9	92.2	91.0
Margin smoothness	81.4	88.6	88.9	83.6
Calcification	69.8	87.7	83.3	82.1

Shape and Echogenicity Analysis

Besides margin and calcification, the TIRADS guidelines also define other US characteristics such as shape and echogenicity. Shape describes the orientation of the nodule growing. To provide a complete automated solution, we proposed a simple shape classification algorithm consisting of three steps. First, a polygon shape (or bounding box) is constructed based on the set of coordinates on the ROI boundary. An exhaustive search is then conducted horizontally and vertically within the polygon to locate the maximum width $| R |$ and the maximum height $w_{m a x}$ . The nodule is then classified as “taller-than-wide” if $h_{m a x}$ ; otherwise, “wider-than-tall.” Using the test set images, this simple algorithm achieved high accuracy of 98% over cases labeled by the single radiologist and 100% for the agreed cases by two radiologists.

Another important US characteristic is echogenicity which is reflected by the pixel intensity values in the nodule region of the image. We proposed a simple algorithm to identify the echogenicity type by comparing the intranodular intensity with that of the surrounding areas. The algorithm first divides the areas around the nodule into sub-bands (or small regions) and studies the mean and variance on their intensities. Sub-bands that being over dark/bright/inconsistent are consider as non-gland area and being excluded from consideration. The median of the remaining ones is then used as the isoechoic reference and the other echogenicity types are determined accordingly using a set of dependent thresholds. At the end, we compare the percentage of the echogenicities contained and chose the most dominant one as the final echogenicity class. Figure 10 shows some example echos within a nodule.

Figure 10.

Illustration of echogenicity detection: (a) original ROI, (b) valid reference regions detected, marked in red, and (c) echogenicity classification result (purple: very-hypoechoic; yellow: hypoechoic; pink: isoechoic).

The test results also show that our proposed algorithm achieved an overall accuracy of 87.7% against one radiologist labels, and a higher overall accuracy of 90.6% when it was tested against the agreed cases by two radiologists. However, since some rare subtypes such as hyperechoic and very-hypoechoic are extremely difficult to obtain from clinical practice, the test set was too small in the current data collection. Further evaluations are needed to test the reliability of our proposed methods for echogenicity.

Ablation Study

We used 100 randomly selected images and conducted two small scale ablation analyses on margin-smoothness and margin-irregularity to evaluate the contribution of individual features and their combinations to the overall performances of the methods. The test results are shown in Tables 4 and 5.

Table 4.

Ablation Study for Margin Smoothness.

Labels for smoothness	Region analysis only (%)	Signal analysis only (%)	Region & signal features (%)
Clear	37.5	62.5	75.0
Not clear	97.6	82.1	86.9
Overall	88.0	79.0	85.0

Table 5.

Ablation Study for Margin Irregularity.

Irregularity labels	Stage one		Stage one and stage two
Irregularity labels	Convex variance (%)	Elliptic variance (%)	Conv. var + local (%)	Ellip. var. + local (%)	All features (%)
Irregular	15.7	96.1	88.2	96.1	92.2
Regular	100.0	49.0	89.8	46.9	89.8
Overall	57.	73	89	72	91

The margin-irregularity analysis revealed that between the stage one features elliptic variance (73%) contributes more to the prediction than convex variance (57%). The result also indicates the bias of the features toward different subclasses. However, adding the local features (lobulation and angulation), convex variance performed 17% better than elliptic variance with local features. The evaluation of the combined feature showed an improvement in the performance, with a performance increase of 2% compared to the highest-performing individual feature. Therefore, the features complement each other and enhance the method’s performances.

The margin-smoothness results showed that Signal Analysis had a higher contribution compared to Region Analysis. The analysis also demonstrated that the combination of Region and Intensity improved the model’s robustness compared to the individual features, suggesting that the two features complement each other.

To analyze the essence of each step of our proposed three-stage calcification detection method, we have performed an ablation study using the 157 images from test 1. Test results showed that the weak detector, CaNet and SURF filter had a 16% impact on average when identifying calcifications. The weak detector was mostly affecting calcifications detections. In comparison, both CaNet and SURF filters were mostly improving false detections. The region-growing method did not contribute much when identifying calcifications but improved the macro calcification classifications significantly (see Table 6).

Table 6.

Ablation Study for Calcification.

Calcification labels		Weak detector^a (%)		CaNet (%)		SURF filter (%)		Region growing (%)		All (%)
No calcification		89.2		51.8		53.0		94.0		94.0
Calcification	Micro	51.4	44.3	97.3	93.4	97.3	93.4	82.4	75.4	83.8	70.5
Calcification	Macro	51.4	0.0	97.3	23.1	97.3	23.1	82.4	15.4	83.8	69.2
Overall		71.3	65.0	73.3	65.6	73.9	66.2	88.5	80.3	89.2	82.8

The weak detector was replaced by the top-hat detector for the ablation study.

Margin Smoothness Sensitivity Analysis to the Precision of the Region of Interest (ROI)

Whilst we are unable to apply other margin smoothness methods to our dataset due to the differing objectives between the studies as mentioned in Section 2, we have conducted an analysis about the sensitivity of our method when the delineated ROI does not precisely align with the lesion boundary. We purposely introduced various degrees of misalignment by applying random shifts to the initial ROI. The process begins with determining the ribbon width for each lesion (refer to Margin Smoothness section for the ribbon’s definition) with a defined shift range of ±20%. For each lesion, a random shift value is chosen from the predetermined range. This process is repeated 10 times on the 100 randomly selected images, and the performances are presented in Table 7.

Table 7.

Sensitivity Analysis of Margin Smoothness Method to the Precision of the Delineated ROI That is, the Performances When the Region of Interest Does Not Precisely Align With the Boundary of the Lesion.

Iteration	0	1	2	3	4	5	6	7	8	9	10
Clear	75.00	56.25	56.25	56.25	68.75	56.25	56.25	68.75	50.00	62.50	68.75
Not clear	86.90	94.05	90.48	89.29	94.05	90.48	90.48	92.86	91.67	95.24	89.29
Overall	85.00	88.00	85.00	84.00	90.00	85.00	85.00	89.00	85.00	90.00	86.00

Table 7 highlights 10-iteration performances of the ROI precision analysis. Iteration 0 shows the performance with the original delineated ROI, while iterations 1 to 10 show the performances after applying the random shifts to the ROI. The overall performance indicates the method is not over sensitive to variations in ROI precision that is, the performance is relatively stable even with imprecise ROI. The performance has a standard deviation of 2.25. This conclusion is consistent with “Not Clear” performance which also has a standard deviation of 2.50. However, the “Clear” performances indicate high variability with a standard deviation of 7.82. This variability is attributed to the smaller sample size of “Clear” cases, where any misclassification will significantly impact the performance.

Conclusion

In this paper, we presented several methods for detecting US image characteristics of thyroid nodule for margin irregularity, margin smoothness, and calcification. The proposed method for margin classification have exploited new geometrical and texture features effectively. Our novel three-stage approach for calcification identification utilizes super-pixels and a convolutional neural network optimized for this purpose. Finally, a simple method for nodule shape and an initial algorithm for echogenicity using the thyroid gland as the main reference have been described. Our methods have shown good performances in identifying the US image characteristics of thyroid nodules with overall accuracies from 82.8% to 98.1% when tested on US images collected from two hospitals and labeled by multiple experienced radiologists. Encouraged by the results, we will continue improving our algorithms for thyroid characteristics analysis and expand our work to estimating TIRADS scores for the nodule and identify level of malignancy. Furthermore, we plan to adapt the methods for identifying characteristics for other kinds of lesions such as breast lesions and lymphoma. Finally, we will compare the performance accuracies of our methods with nodule contour extracted from automatic segmentation.

Footnotes

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research is sponsored by TenD.AI Medical Technology.

ORCID iD

Alaa AlZoubi

References

Bray

Ferlay

Soerjomataram

Siegel

Torre

Jemal

Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J Clin. 2018;68(6):394-424.

Horvath

Majlis

Rossi

Franco

Niedmann

Castro

, et al. An ultrasonogram reporting system for thyroid nodules stratifying cancer risk for clinical management. J Clin Endocrinol Metab. 2009;94(5):1748-51.

Kwak

Han

Yoon

Moon

Son

Park

, et al. Thyroid imaging reporting and data system for US features of nodules: a step in establishing better stratification of cancer risk. Radiology. 2011;260(3):892-9.

Tessler

Middleton

Grant

Hoang

Berland

Teefey

, et al. ACR thyroid imaging, reporting and data system (TI-RADS): white paper of the ACR TI-RADS committee. J Am Col Radiol. 2017;14(5):587-95.

Zulfanahri

NHA

Nugroho

Frannita

Ardiyanto

. Classification of thyroid ultrasound images based on shape features analysis. In: 10th Biomedical Engineering International Conference (BMEiCON), 2017, pp. 1-5.

Nugroho

Frannita

Hutami

AHT

. Thyroid nodules stratification based on orientation characteristics using machine learning approach. In: 3rd International Conference on Computer and Informatics Engineering (IC2IE), 2020, pp. 52-57.

Nugroho

Frannita

Nugroho

Zulfanahri

Choridah

Classification of thyroid nodules based on analysis of margin characteristic. In: 2017 International Conference on Computer, Control, Informatics and its Applications (IC3INA), 2017, pp. 47-51.

Baek

JH.

Korean thyroid imaging reporting and data system: current status, challenges, and future perspectives. Korean J Radiol. 2021;22(9):1569-78.

Smith

Botz

European thyroid association TIRADS. In: Reference article, Radiopaedia.org, 2021, https://doi.org/10.53347/rID-68341

10.

Zhou

Yin

Wei

Zhang

Song

Luo

, et al. Superficial organ and vascular ultrasound group of the society of ultrasound in medicine of the Chinese medical association. Chin Artif Intell Thyroid. 2020;70(2):256-79.

11.

Zhou

Guo

Huang

Chen

, et al. Explore the diagnostic efficiency of Chinese thyroid imaging reporting and data systems by comparing with the other four systems (ACR TI-RADS, Kwak-TIRADS, KSThR-TIRADS, and EU-TIRADS): A single-center study. Front Endocrinol. 2021;12:763897.

12.

Wang

Yang

Peng

Chen

. A thyroid nodule classification method based on TI-RADS. In: Proc. SPIE 10420, Ninth International Conference on Digital Image Processing (ICDIP 2017), 2017, 1042041.

13.

Chen

Tai

Wang

, et al. Quantitative analysis of echogenicity for patients with thyroid nodules. Sci Rep. 2016;6:35632.

14.

Zhuang

Hua

Chen

Lin

JL.

A novel TIRADS of US classification. Biomed Eng Online. 2018;17:82.

15.

Zhu

AlZoubi

Jassim

Jiang

Zhang

Wang

, et al. A generic deep learning framework to classify thyroid and breast lesions in ultrasound images. J Ultrason. 2021;110:106300.

16.

Dong

Gao

Wang

A top-hat based calcifications detection method in mammograms. J Image Graph. 2006;11(12):1839-43.

17.

Shan

Alam

Garra

Zhang

Ahmed

Computer-aided diagnosis for breast ultrasound using computerized BI-RADS features and machine learning methods. Ultrasound Med Biol. 2016;42(4):980-8.

18.

Ren

Liu

Tong

Cao

Calcification segmentation based on a different scales superpixels saliency detection algorithm. J Med Biol. 2020;46(12):3404-12.

19.

Radhakrishnan

AlZoubi

Vehicle Pair Activity Classification using QTC and Long Short Term Memory Neural Network. In: Proceedings of the 17th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 5: VISAPP, 2022, pp. 236-247.

20.

Bay

Ess

Tuytelaars

Van Gool

Speeded-up robust features (SURF). J Comput Vis Underst. 2008;110(3):346-59.

Automatic Detection of Thyroid Nodule Characteristics From 2D Ultrasound Images

Abstract

Keywords

Introduction

Background and Related Work

Materials and Methods

Dataset Collection and Annotation for Nodule Characteristics Analysis

Nodule Characteristics Detection Methods

Margin irregularity

Margin smoothness

Calcification

Experiment Results

Discussions

Threshold Tuning

Shape and Echogenicity Analysis

Ablation Study

Margin Smoothness Sensitivity Analysis to the Precision of the Region of Interest (ROI)

Conclusion

Footnotes

Declaration of Conflicting Interests

Funding

ORCID iD

References