Abstract
Introduction
Thyroid cancer is a common malignant tumor, and early diagnosis and timely treatment are crucial to improve patient prognosis. With the increasing use of enhanced CT scans, a new opportunity for early thyroid cancer screening has emerged. However, existing CT-based models face challenges due to limited datasets, small sample sizes, and high noise.
Methods
To address these challenges, we collected enhanced CT scan image data from 240 patients in Guangdong and Xinjiang, China, and established a CT dataset for early thyroid cancer screening. We propose a deep learning model, the DVT model, which combines transformer DNN and transfer learning techniques to integrate time series data and address small sample sizes and high noise.
Results
The experimental results show that the DVT model achieves a prediction accuracy of 0.96, AUROC of 0.97, specificity of 1, and sensitivity of 0.94. These results indicate that the DVT model is a highly effective tool for early thyroid cancer screening.
Conclusion
The DVT model has the potential to assist clinicians in identifying potential thyroid cancer patients and reducing patient expenses. Our study provides a new approach to thyroid cancer screening using enhanced CT scans and demonstrates the effectiveness of deep learning techniques in addressing the challenges associated with CT-based models.
Abbreviation
In this article, all abbreviations and their full forms are listed in Table 1.
Abbreviation.
Introduction
Thyroid cancer is a kind of malignant tumor, which can occur at any age, and its incidence is on the rise worldwide. Its early diagnosis and timely treatment are significant to improve patients’ prognosis1-3 Through standardized thyroid nodule management strategies, clinicians can more effectively identify the nature of thyroid nodules, improve the early diagnosis rate of thyroid cancer, and thus improve the treatment effect and quality of life of patients.4,5 At present, many thyroid cancer detection models based on computed tomography (CT) image data have been proposed by researchers. For example, Arepalli et al used the deep learning algorithm of convolutional neural network to analyze, detect and classify thyroid cancer in CT thyroid images. In 2021, Zhao Hongbo et al analyzed the CT images of 880 patients with thyroid nodules, selected 5 convolutional neural networks and generated an ensemble model to identify benign and malignant thyroid nodules below. 6 In 2023, Wang Chujun et al collected 676 thyroid ultrasound images from 338 thyroid cancer patients and used ResNet-18 as the basic network to build a deep learning model to predict whether cervical lymph nodes had metastasis. 7
However, in daily life, most patients do not initiate a specialized thyroid scan. Therefore, it is difficult for such a model to be directly used in the early screening of thyroid cancer. At present, with the COVID-19 pandemic in 2019, enhanced CT scanning for the chest has gradually become a routine test option for ordinary patients in Chinese hospitals8,9 We have observed that many hospitals scan part or most of the thyroid area when performing chest scan, which provides a new idea for us to establish early thyroid screening based on routine detection data: In order to more effectively assist doctors in early detection of thyroid cancer, we proposed a CT image examination model combining routine examination and artificial intelligence.10-12 When performing routine tests on ordinary patients in hospitals, the model can initially analyze enhanced CT scans containing the thyroid to determine whether the thyroid is normal.13,14 Once the model identifies a potential cancer risk, the patient is quickly referred to a specialist for further evaluation and testing. This integrated approach is expected to significantly improve the efficiency and accuracy of thyroid cancer detection.
At present, early screening of thyroid cancer based on enhanced CT data faces four major challenges. Firstly, it is the lack of available CT data sets which includes both the thoracic and thyroid areas. Second, the current data has the problem of small sample size and loud noise in the thyroid area. Due to routine test data, the CT scan images of each patient included both the thyroid gland and the chest cavity, of which the thyroid region accounted for a small proportion. This results in fewer thyroid regions available and may contain more noise, which makes it difficult for us to build an effective thyroid region feature extraction model. Thirdly, how to extract and integrate thyroid region features of time series.5,15-17 The thyroid image data based on enhanced CT is typical continuous time series data, and its adjacent images have significant correlation. How to extract single thyroid image features and integrate time series features is the third major challenge facing the model. Fourthly, The field of medical imaging commonly faces the issue of scarce sample availability, particularly in specific tasks such as thyroid cancer screening, where insufficient data can lead to inadequate model training, thereby affecting its generalization capability and predictive accuracy. Based on the above problems, we first collected enhanced CT scan image data of 240 patients in Zhuhai People's Hospital, Guangdong, China & The First People's Hospital of Kashi, Xinjiang, China, and established CT data set for early screening of thyroid cancer. Then, we proposed a deep learning based on transformer and transfer learning thyroid cancer early screening model (DVT model), as shown in Figure 1. The DVT model consists of two main parts. The first is the separation model of thyroid region and thoracic region. We construct the model based on vision transformer (ViT) and carry out the first transfer learning based on ImageNet dataset and retain the learning results. The second is the early screening model of thyroid cancer. We splice two adjacent thyroid images together and input the ViT to integrate the time series data. At this stage, the ViT model employs the homologous data from the thyroid and thoracic region segmentation model as an auxiliary domain for secondary transfer learning. There are two advantages to this approach. First, the stitching of adjacent thyroid images can better integrate the features of adjacent areas and reduce the calculation amount of the transformer model.18-23 Second, because the task objectives of the thyroid separation model and the thyroid cancer early screening model are similar, the model can adapt to the thyroid cancer recognition task more quickly, which can solve the problem of small samples and noise in the data set.24-26 Finally, we design a transformer-DNN model for classification output.27-32

The framework of DVT model (T is thyroid, C is chest).
We conducted three independent experiments, namely, the Guangdong dataset experiment, the Xinjiang dataset experiment, and the ablation experiment. The experimental results are shown in the second half of the paper. The results show that the DVT model is superior to the other reference model, and the results of the ablation experiment also verify the effectiveness of the proposed secondary transfer learning.
Dataset and Model
Data set Construction Process
First, the research team collected combined chest and thyroid CT scans of 120 patients at Zhuhai People's Hospital, Guangdong, China, detailed information was de-identified. The dataset consisted of 60 normal subjects and 60 thyroid cancer subjects, all of which contained thoracic region and no less than 20 thyroid region images. The dataset of the First People's Hospital of Kashi, Xinjiang, China, detailed information was de-identified. Like the People's Hospital dataset, contained 120 images of the thoracic region and no less than 20 images of the thyroid region, including 60 normal subjects and 60 thyroid cancers. Table 2 describes the device collection parameters. Table 3 summary of two batches of patient data.
Describes the Device Collection Parameters.
Summary of Two Batches of Patient Data.
Note: The data is presented in a summarized form without including any personal identification information.
Model
Thyroid Region Separation Model
In this paper, vision transformer (ViT) is used to separate thyroid regions. The traditional Transformer structure is made up of the Encoder-decoder framework, while for ViT, only the Encoder part is used. The input of standard Transformer is one-dimensional sequence data, so the image needs to be converted into serial data. The idea of ViT model is to slice an image into patches of fixed size without overlapping and then convert each patch into a one-dimensional vector through stretching operation. Finally, the input patches are converted to a fixed-length vector called patch embedding through a linear transformation layer. Since the final output should be a label for classification tasks, the author has adjusted the input to Transformer Encoder by adding a CLS Token at the beginning of the input sequence. In 2020, Google proposed the paper An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale, 18 which has been included in ICLR 2021. Vision Transformer (ViT) is proposed for the first time to apply Transformer structure to image classification in CV field. The paper shows that compared with the current best convolutional neural network structure, ViT still achieves satisfactory results and requires fewer computing resources.
First, we assume that each patient contains n consecutive thyroid images, the size of each CT image input represented as
Thyroid Region Feature Fusion and Classification Model
Since the thyroid pictures of each patient are multiple and continuous, considering the spatial relationship between these adjacent pictures, to better extract features and reduce the complexity of the model, we splice the two adjacent pictures into one picture horizontally as the data input of the thyroid cancer recognition module. For example, 20 pictures in 3.1 are merged into 10 pictures. Currently, we resize the picture, that is, we get formula (8) from formula (7), which is convenient for transfer learning.
Secondary Transfer Learning
First, our pre-trained model uses ImageNet1000 as a data set for training, then we reuse the trained network structure and connection parameters, 33 input the patient's original CT image set into the model, and train the model to recognize the thyroid and chest cavity, as shown in formula (4). Model, this is a transfer learning; Similarly, we will re-use the model after secondary training and input the patient's thyroid picture data set into it to distinguish diseased thyroid from normal thyroid, that is, formula (10), which is secondary migration,34,35 as shown in Figure 2.

Structure of secondary migration of DVT model.
In this way, the training of our model is a gradual process, and after the first migration, the model can better extract the biometric characteristics of the thyroid. In the secondary transfer process, the training efficiency of the model will be improved, and the recognition effect will be better.
Statistical Analyses
In this study, a variety of indicators were used to evaluate the performance of the model comprehensively, including Accuracy, Recall, F1-score, Area Under the Curve (AUC) and Specificity. The evaluation indicators are explained in detail as follows:
Accuracy is one of the most used evaluation indexes in classification models, which represents the proportion of samples that the model predicts correctly in the total number of samples. It is suitable for cases where the distribution of categories is more balanced, but can be misleading when the categories are unbalanced The calculation formula is as follows:
Recall rate, also known as Sensitivity, measures the ability of a model to identify positive samples. Recall rates are key indicators in scenarios where missing positive samples need to be minimized, such as disease screening. The calculation formula is as follows:
To verify the performance of the DVT model, we first calculate the accuracy and AUROC and plot the ROC curve to compare the performance of the model. We also calculated precision, sensitivity, specificity, and f1 score indicators, considering the sample's imbalance. The experiment consists of three parts: We first build a model based on 120 data from Zhuhai People's Hospital. We chose AlexNet, DenseNet_121, Resnet, GoogleNet, VGG16 and CNN_LSTM for comparison.36-39 The purpose of this part is to prove that the DVT model performs better than the reference model; In the second part, to prove the universality of the DVT model, we selected 120 cases from a hospital in Xinjiang for verification. In the third part, we conducted the ablation experiment, that is, we compared the results under the same conditions without transfer learning.
Through multi-dimensional evaluation indicators and rigorous statistical analysis methods, this study comprehensively evaluates the performance of the classification model to ensure the reliability and scientific results. The comprehensive application of these evaluation indexes provides a solid foundation for the optimization and practical application of the model.
Results
Comparative Experiment
The experimental results show that the DVT model is superior to the six classical AI methods mentioned above. As seen from Table 4 and Figure 3, DVT model has the best performance in accuracy, AUC and sensitivity, whose values are 0.88, 0.92 and 0.91, respectively. The AUC curve of DVT model also shows that DVT model is better than other models in the experimental group. The results show that the DVT model has su-perior performance. The specificity and sensitivity of DVT model can reach 0.84 and 0.91, respectively. These results show that the predictive performance of DVT models is still ideal, which has important implications for thyroid prediction models. The accuracy, AUROC, specificity, sensitivity, precision and F1-SCORE were 0.88, 0.92, 0.84, 0.91 and 0.87, respectively. In terms of precision, the AlexNet, DenseNet-121 and VGG16 models have high values of 0.71, 0.85 and 0.76, respectively. As for specificity, VGG16 had the best performance of specificity, reaching 1, indicating that the model had high accuracy in predicting negative classes. The Convolutional Neural Networks - Long Short-Term Memory (CNN-LSTM) model performs well in the F1-score, reaching 0.79, which indicates that the model strikes a balance between accuracy and recall. As mentioned above, the experimental results show that the DVT model has achieved satisfactory results in thyroid cancer recurrence and is superior to the existing single omics AI model on all experimental indicators.

The results of chest CT dataset of Zhuhai People's Hospital: (A) F1-score and precision result of each model, (B) sensitivity and specificity result of each model, (C) accuracy result of each model, (D) AUC-ROC result of each model, (E) ROC curve of each model.
The Running Results of the First Peo ple’s Hospital of Kashi Dataset in Different AI Model Groups.
Universality Experiment
To prove the universality of the DVT model, we also selected mixed CT images of chest and thyroid of 120 patients from a hospital in Xinjiang for verification. The results are shown in Table 5 and Figure 4. The DVT model performs best in accuracy and AUC (0.88 and 0.91, respectively). This means that the model has high accuracy and good classification ability in the overall prediction. In terms of precision, both the DenseNet-121 and DVT models performed well, with 0.84 and 0.81, respectively. This shows that these two models have high accuracy in predicting positive cases. VGG16 was an outstanding model for specificity, reaching a value of 1, which meant that the model had extremely high accuracy in predicting negative cases, with almost no misdiagnosis of negative cases. CNN-LSTM model has a superior performance in terms of sensitivity, reaching 0.94, which indicates that the model has a strong ability to recognize positive examples and rarely misses positive examples. Overall, each model has its strengths and weaknesses, and choosing the model that best suits the specific task needs depends on the data characteristics and task objectives, but in this set of data, the DVT model performs well on multiple metrics and may be an excellent choice.

The results of chest CT dataset of a hospital in Xinjiang: (A) F1-score and precision result of each model, (B) sensitivity and specificity result of each model, (C) accuracy result of each model, (D) AUROC result of each model, (E) AUC curve of each model, (F) comparison of data from dif-ferent hospitals(DVT_xj is results of the First People’s Hospital of Kashi, DVT_zhu is results of the Zhuhai People's Hospital).
The Running Results of the First People's Hospital of Kashi, Dataset in Different AI Model Groups.
Ablation Experiment
In order to verify the selectivity of this method, we conducted ablation experiments,40,41 that is, under the same conditions, no transfer learning was used. From Table 6 and Figure 5, it can be clearly seen that without transfer learning, the first is that the result is significantly worse, and the second is that the prediction effect of malignant patients is significantly worse. In addition, the model tends to bias the prediction results to 1 more often, which also indicates that the model may have potential overfitting problems.

The results of chest CT dataset of Zhuhai People's Hospital: (A) comparison of model performance before and after ablation study (DTV_xr_zhu is results of DTV model ablation experiment, DTV_zhu is results of DTV model experiment), (B) ROC curve of DTV model ablation experiment.
The Results of Data Set of Zhuhai People's Hospital Before and After Ablation Experiment (DVT_zhu d Results Before the Ablation Experiment, DVT_xr_zhu Results After the Ablation Experiment).
Discussion
Chinese tertiary hospitals and above in different regions, built an additional multi-center data set to solve the data challenges; Secondly, for the data, there are problems such as less available area, small sample size and large noise. In this paper, a secondary transfer learning method is proposed. The idea of this method is to use homologous tasks for multiple transfer learning, so that the model can quickly adapt to small samples and high noise thyroid cancer prediction tasks. It is suggested that chest CT containing thyroid gland can be used for early screening of thyroid cancer. Finally, the time series of thyroid continuous CT scan data. In this paper, transformer model framework is used for feature fusion. In this paper, adjacent images are spliced together for feature calculation, which can effectively reduce the calculation load of transformer and better calculate the features of adjacent images. The experimental results of data sets in Guangdong and Xinjiang both show that the results of DVT model are better than the benchmark model. Ablation experiments show that transfer learning can effectively improve the learning effect of the model. We have reason to believe that the DVT model can effectively help clinicians identify potential patients in the early screening of thyroid cancer, intervene in advance, and reduce the mortality of patients.
Although the experiment has yielded satisfactory results, there are still four main limitations in this study. First, the sample size is relatively small, and the data from 240 patients may not be sufficient to train a model with adequate robustness. Second, the training speed remains relatively slow. Compared to traditional convolutional neural networks (CNNs), transformer models (Transformers) significantly increase computational load, which limits their efficiency in practical applications. Third, the experiment was conducted only on enhanced chest CT images with relatively high image quality and has not been validated on standard chest CT images. Finally, this study did not perform more detailed classification of thyroid cancer, such as the analysis of potential metastatic lymph nodes. Future work will involve collecting additional data and exploring the combination of traditional convolutional neural networks with efficient transformer algorithms to optimize both the model's predictive accuracy and training speed. Furthermore, through image enhancement techniques, we plan to extend the application of the model (DVT) to standard chest CT images, aiming to achieve similar high accuracy on these images. At the same time, the research will continue to focus on data collection and model optimization to enable the model to accurately classify additional categories, such as potential metastatic lymph nodes.
Conclusions
This study evaluated the performance of the DVT model, demonstrating its superiority over six classical artificial intelligence models in thyroid cancer prediction tasks. Based on the results for accuracy, AUC, sensitivity, specificity, precision, and F1-score, the DVT model outperformed all other models across all key metrics, including accuracy (0.88), AUC (0.92), sensitivity (0.91), and precision (0.88). The AUC curve further validated the DVT model's exceptional predictive capability, indicating its greater robustness and reliability compared to other models. We believe that the DVT model can effectively assist clinicians in identifying potential thyroid cancer patients during early screening, enabling timely interventions and reducing patient mortality rates.
Footnotes
Acknowledgments
This work is funded by Macao Polytechnic University under grant no.: RP/FCA-04/2022, and under submission control code fca.3ca7.92b0.8. Guangdong Provincial Department of Education youth innovative talent project (No. 2023KQNCX155), Postdoctoral training project of Zunyi Medical University (No.2023F-ZH-019).
Author Contributions
Conceptualization, N.H. and R.M.; methodology N.H., R.M., and D.W. C.; Software, M.R. and N.H.; Validation, L.C. and Y.S.Y.; Formal Analysis, Y.P.W.; Survey, D.W. C.; Resources, G.R.F., D.W.C., and B.Y.; Data Management, L.C.; Writing - Original manuscript preparation, N.H. and R.M. ; Writing - Reviewing and editing, N.H. and R.M.; Visualization, R.M., D.W.C.; Supervised Y.P.W. and B.Y.; Project Management, R.M.; All authors have read and agreed to the published version of the manuscript.
Declaration of Competing Interest
All authors declare that we have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Ethical Compliance
Research experiments conducted in this article with animals or humans were approved by the Ethical Committee Zhuhai People's Hospital, approval number is 2024.109. and the Ethical Committee of the First People's Hospital of Kashgar Region, approval number is 2024.93.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was funded by the Industry-University-Research cooperation and basic and applied basic research project cooperation, Zhuhai, grant number 2220004002437. Guangdong Provincial Department of Education youth innovative talent project (No. 2023KQNCX155), Postdoctoral training project of Zunyi Medical University (No.2023F-ZH-019). This work is supported by Science and Technology Development Fund of Macao (0021/2022/AGJ).
