Sage Journals: Discover world-class research

Abstract

Background

With the emergence of artificial intelligence in medical imaging, large language models such as chat generative pre-trained transformer (ChatGPT)-4o have drawn much attention for their potential in diagnostic support. However, their performance in nuclear medicine applications still remains underexplored. In this study, we aimed to evaluate the Taiwan Food and Drug Administration (TFDA)-approved bone scintigraphy (BS platform) and ChatGPT-4o capability to interpret BS images for the detection and localization of bone metastases.

Methods

A total of 52 BS images were analyzed with three interpretation methods: board-certified physicians, ChatGPT-4o multimodal image analysis, and the BS platform. The performance of the interpretations was evaluated with both binary classification and lesion localization of nine predefined anatomical regions. These results were compared to the report of board-certified nuclear medicine physicians, which served as the gold standard in this study.

Results

In binary classification, ChatGPT-4o achieved an accuracy of 84.6%, similar to the performance of the BS platform's accuracy of 82.7%. However, ChatGPT-4o showed lower performance in lesion localization. Its regional precision was 32.5%, and sensitivity was 13.3%, compared to the BS platform's precision of 80.3% and sensitivity of 64.9%.

Conclusion

ChatGPT-4o showed preliminary potential for detecting bone metastases and assisting in structured report drafting, but its limited lesion-localization performance restricts clinical applicability. The BS platform, developed specifically for bone scintigraphy, demonstrated more consistent regional accuracy in this dataset. These results represent an early proof-of-concept comparison, suggesting feasibility for reporting support rather than clinical deployment. Larger, multi-center studies and domain-specific training will be needed to clarify large language models’ future role in nuclear medicine.

Keywords

Artificial intelligence large language model ChatGPT bone metastases nuclear medicine

Introduction

With the rapid advancement of artificial intelligence (AI) technologies, their potential applications in the medical field have become increasingly prominent, particularly in the area of diagnostic imaging. In the domain of nuclear medicine, modalities such as positron emission tomography/computed tomography, bone scintigraphy (BS), and dopamine transporter single-photon emission computed tomography (TRODAT SPECT) have gradually incorporated AI models to improve diagnostic accuracy and interpretation efficiency.

To validate the potential of AI in medicine, numerous studies have focused on applying deep learning techniques to medical image analysis. For instance, a study by Zhao et al.,¹ published in Scientific Reports in 2020, developed a deep neural network-based AI model for diagnosing cancer-related bone metastases using 12,222 technetium-99m methylene diphosphonate (Tc-99m MDP) BS images. The model demonstrated excellent diagnostic performance, achieving area under the curve (AUC) values of 0.988, 0.955, and 0.957 for breast cancer, prostate cancer, and lung cancer, respectively. Furthermore, when compared with three experienced nuclear medicine physicians, the AI model outperformed them in both diagnostic accuracy and sensitivity, while also significantly reducing interpretation time. These findings underscore the potential of such models to enhance both the efficiency and accuracy of BS interpretation.

Internationally, AI technologies have been widely applied in medical image interpretation, yielding remarkable results. Taiwan has also actively advanced in this field, developing numerous innovative applications. One such example is the “Computer-assisted detection platform for bone scintigraphy,” which has been progressively implemented in clinical practice. Certified by the Taiwan Food and Drug Administration (TFDA), the BS platform assists in the interpretation of BS images, particularly for the detection and localization of bone metastases, and exemplifies Taiwan's innovation in the field of smart healthcare.

In recent years, with the rapid advancement of GPU computing power, large language models (LLMs) such as OpenAI's chat generative pre-trained transformer (ChatGPT) have achieved revolutionary progress in the field of natural language processing and have demonstrated substantial potential in the medical domain. Notably, they have garnered significant attention for applications in medical record analysis, clinical decision support, and medical imaging assistance. In the realm of medical imaging, LLMs have been applied to support diagnostic interpretation and automated report generation. For instance, RaDialog² is a large vision-language model that employs a vision transformer architecture to extract imaging features from chest X-rays and integrates structured pathological information such as disease annotations and anatomical features. Through parameter-efficient fine-tuning, the model is capable of generating clinically accurate reports and has demonstrated strong performance in interactive tasks. The application of LLMs in healthcare continues to expand, showing great promise from clinical diagnostic support to medical education. With continued technological development and deeper integration into healthcare systems, LLMs are expected to play an increasingly vital role in improving the quality, efficiency, and outcomes of medical services and education.

The transformer architecture was first introduced by Vaswani et al.³ in their 2017 paper “Attention Is All You Need,” marking a significant breakthrough in the field of natural language processing. As shown in Figure 1, this architecture consists of an encoder and a decoder, each equipped with self-attention mechanisms and feed-forward networks. Building on this foundation, several variants have emerged, the most representative being bidirectional encoder representations from transformers⁴ and GPT (generative pre-trained transformer).⁵ Currently, many researchers are dedicated to exploring the application of transformers in medical image analysis. For example, a study by Selivanov et al.,⁶ published in September 2022 on arXiv, utilized the Show-Attend-Tell and GPT-3 models to generate image descriptions for chest X-ray images. The results indicated that these models could efficiently and accurately generate detailed nuclear medicine reports. This study confirmed the potential of such models in improving the efficiency and accuracy of chest X-ray interpretation. Another study, published by Tomar et al.⁷ in June 2022 on arXiv, introduced the TransResU-Net model. This model combines transformer, ResNet-50, and dilated convolution techniques with the aim of improving the accuracy and efficiency of polyp segmentation in colonoscopy images. The test results demonstrated that the model could correctly detect over 88% of polyps, indicating that TransResU-Net holds substantial potential for application in clinical real-time polyp detection systems. In summary, these two studies have demonstrated the significant potential of the transformer architecture in medical image interpretation. The transformer architecture has demonstrated remarkable potential across a wide range of tasks. The emergence of ChatGPT, which is a representative model based on the transformer architecture, has further highlighted this potential. Next, we first provide a brief overview of ChatGPT and subsequently investigate whether it can be effectively applied to the interpretation of medical images.

Figure 1.

Transformer architecture. The model consists of an encoder and a decoder, each composed of multiple stacked layers. Each encoder layer includes a multi-head self-attention mechanism and a position-wise feed-forward neural network, enabling global context modeling and non-linear transformation at each input position. The decoder layers share the same structure and additionally incorporate an encoder–decoder attention mechanism to align the generated output with the encoder's representations. Positional encoding is added to the input embedding to preserve sequence order information.

Therefore, this study aims to evaluate the feasibility and performance of the LLM ChatGPT-4o in interpreting BS images in nuclear medicine, and to compare its performance with that of the TFDA-approved “Computer-assisted detection platform for bone scintigraphy.”

We focus on the detection and localization of bone metastases as the primary evaluation criteria. Using BS image datasets, we designed clinical simulation scenarios in which experienced nuclear medicine physicians performed manual interpretations. These results were then compared with the interpretations generated by ChatGPT-4o and the BS platform.

The main objectives of this study are to address the following questions:

Does ChatGPT-4o demonstrate preliminary capability in interpreting BS images?

How does its performance compare to that of the BS platform already in clinical use?

Can ChatGPT-4o assist medical students or resident physicians in the preliminary interpretation of BS images, thereby streamlining clinical workflows and enhancing report consistency?

Through this investigation, we aim to explore the potential value and application prospects of LLMs in the emerging field of nuclear medicine image-assisted diagnosis.

Method

Patient selection and characteristics

BS examinations were retrospectively identified from our institutional imaging archive. A non-consecutive, purposive sampling strategy was used to assemble a clinically diverse set of cases that reflected common diagnostic scenarios and interpretive challenges. Examinations were eligible if they included a complete whole-body planar BS, an available finalized clinical report, and sufficient image quality for interpretation. Studies lacking a finalized report, with incomplete image data, or with severe artifacts were excluded. The overall case flow is summarized in Figure 2.

Figure 2.

Flow of BS examinations through the retrospective diagnostic accuracy study comparing ChatGPT-4o and the BS platform against the clinical reference standard. All 52 non-consecutively selected examinations were included in the final analysis.

A total of 52 clinically representative BS cases were purposively selected from our institutional imaging archive using a non-consecutive sampling approach. Case inclusion was determined through multidisciplinary consensus by nuclear medicine physicians and oncologists, with the explicit goal of assembling a diverse dataset encompassing a broad range of lesion patterns. This collection aimed to reflect common diagnostic scenarios, potential interpretive pitfalls, and challenges routinely encountered in clinical practice.

The most prevalent primary malignancies were breast cancer (n = 26, 50.0%), prostate cancer (n = 14, 26.9%), and lung cancer (n = 6, 11.5%). Less frequently represented cancers included colorectal carcinoma (n = 2, 3.8%), renal cell carcinoma (n = 1, 1.9%), and hepatocellular carcinoma (n = 1, 1.9%), and no cancer record (n = 2, 3.8%) were included to simulate real-world scenarios that may result in false-positive interpretations. A summary of disease types and case distribution is provided in Table 1.

Table 1.

Demographic and clinical characteristics of patients.

Characteristic	Patients (n)	Percentage (%)
Sex
Male	23	44.2
Female	29	55.8
Age (years), mean	60.75
Type of primary malignancy
Breast cancer	26	50
Prostate cancer	14	26.9
Lung cancer	6	11.5
Colorectal cancer	2	3.8
Hepatocellular carcinoma	1	1.9
Renal cell carcinoma	1	1.9
No cancer record	2	3.8

This study was designed as a retrospective, single-center diagnostic accuracy study conducted at a tertiary medical center. ChatGPT-4o and the BS platform were evaluated against a clinical reference standard. The reference standard was defined as the finalized clinical BS reports issued at the time of imaging by board-certified nuclear medicine physicians, using institutional reporting criteria. In routine practice, these physicians had access to relevant clinical information and prior imaging, and the same information was available when the reference standard reports used in this study were created.

Chat GPT (ChatGPT)

ChatGPT⁸ is a LLM developed by OpenAI, built upon the GPT framework. GPT is based on the transformer architecture and is optimized through pre-training and fine-tuning. Subsequently, OpenAI developed the GPT model series, including GPT-3 and GPT-4. These models have been extensively integrated into ChatGPT, and their effectiveness has been demonstrated in real-world applications.

To provide a more concrete illustration of GPT-4's applications in medical image analysis, several relevant studies are highlighted below. For instance, a study by Aydin and Karaarslan,⁹ published on arXiv in January 2025, explored the application of GPT-4 in medical image interpretation. The findings indicated that while ChatGPT is not yet capable of fully accurate analysis of chest X-rays, it can assist physicians in interpretation, demonstrating the potential of such models in augmenting diagnostic workflows. Additionally, a study by Wang et al.,¹⁰ published on arXiv in February 2024, evaluated the role of GPT-4 in thyroid ultrasound diagnosis and treatment recommendations, incorporating the Chain of Thought method to enhance interpretability. The results suggested that GPT-4 exhibits potential in providing diagnostic and therapeutic recommendations, highlighting its utility in assisting thyroid disease diagnosis and treatment decision-making while underscoring the importance of interpretability.

In summary, GPT-4 has demonstrated potential in the field of medical image analysis. However, its application in BS interpretation remains unclear. BS plays a crucial role in the diagnosis and monitoring of skeletal diseases. Theoretically, GPT-4's multimodal capabilities could enable the integration of imaging data with clinical information, thereby assisting physicians in making more accurate assessments. Consequently, applying GPT-4 to BS analysis presents a promising direction for future research, with the potential to improve diagnostic accuracy in skeletal disease assessment.

Computer-assisted detection platform for BS

In Taiwan, nuclear medicine plays a crucial role in disease diagnosis and treatment using radiopharmaceuticals. Specialists utilize advanced instruments and techniques to administer trace radioactive substances, enabling functional assessment, lesion detection, and therapeutic planning across various medical fields. Whole-body BS, commonly performed using SPECT, are widely used to assess skeletal metabolic activity with agents such as Tc-99m MDP. Early detection of bone metastases, a common complication of cancer, is vital for patient prognosis. Traditionally, interpretation of these scans relies on physicians visually identifying abnormal tracer uptake patterns to assess potential metastases.

However, manual interpretation of BS can be influenced by physician experience and fatigue, potentially limiting the detection of subtle or multifocal lesions. To enhance diagnostic efficiency and accuracy, this study employs a computer-aided detection platform for BS,¹¹ which has been approved by the TFDA under license number 007971.¹² The system utilizes AI algorithms to analyze whole-body Tc-99m MDP BS images, assisting nuclear medicine physicians or trained personnel in evaluating tracer uptake and identifying suspected bone metastases. The platform offers two key features: the confidence of suspected bone metastasis, indicating the probability of metastatic involvement, and the regions of suspected bone metastasis, highlighting potentially affected areas. These features contribute to more accurate and efficient clinical decision-making, underscoring the system's value in nuclear medicine practice.

According to the software documentation, the platform achieves both a probability and an accuracy of >0.8 in predicting the confidence and identifying the regions of suspected bone metastasis, respectively. To further validate its diagnostic performance, this study conducts an empirical analysis. Additionally, to evaluate the potential of LLMs in BS image analysis, ChatGPT-4o is employed, and its performance in detecting suspected bone metastasis is compared with that of the BS platform.

The comparator system used in this study is the “EFAI” computer-assisted detection platform for BS, which holds a regulatory license from the TDFA (TFDA No. 007971). The system is designed to analyze whole-body Tc-99m MDP BS in DICOM 3.0 format, requiring paired anterior and posterior planar views with a resolution of 1024 × 256.

From a technical perspective, the platform employs a closed deep learning architecture that integrates two independent algorithms to compute inference results. It is indicated for use in adult cancer patients (aged ≥20 years) for the detection of suspected bone metastases. The system explicitly excludes cases of primary bone cancer, lymphoma, and hematopoietic malignancies, as well as images classified as “superscans” or those containing significant artifacts. The model outputs a binary classification (presence or absence of suspected metastasis) and a confidence score, and localizes lesions across nine anatomical regions (skull, spine, ribs/clavicles/scapulae, humeri, pelvis, and femora). According to the manufacturer's regulatory labeling, the system was validated on a dataset of 668 BS images, achieving a patient-level AUC of 0.978, an accuracy of 0.921, and a specificity of 0.893. For regional localization, it demonstrated an AUC of 0.912 and a sensitivity of 0.858.

Analytical framework

This study was conducted in two distinct phases. In Phase I, a binary classification analysis was performed to determine the presence or absence of bone metastases, establishing nuclear medicine physicians’ clinical interpretations as the reference standard. Model outputs from ChatGPT and the BS platform were classified as positive for bone metastasis when the estimated probability exceeded 50%. While this standard cutoff facilitates direct comparison, we acknowledge that the results are subject to potential threshold effects, where shifts in the decision boundary could alter the balance between sensitivity and specificity. Accordingly, diagnostic performance was systematically assessed by calculating accuracy, sensitivity, and specificity.

Phase II focused on lesion localization across predefined skeletal regions. The skeleton was segmented into nine anatomical zones: skull, cervical spine, thoracic spine/sternum, lumbar spine, sacrum, ribs/clavicles/scapulae, pelvis, femora, and humeri. Region-level annotation of metastases was manually performed by board-certified nuclear medicine physicians based on planar scintigraphy. The BS platform generated automated regional assessments, while ChatGPT's free-text outputs were retrospectively mapped to corresponding anatomical regions by experienced clinical evaluators. Region-wise classification performance was analyzed in terms of accuracy, PPV, and sensitivity to evaluate inter-method consistency.

Standardized prompting and image processing for LLM-based reporting

We retrospectively analyzed 52 whole-body planar BS examinations (Tc-99m MDP) collected from a single nuclear medicine department. All original DICOM images were converted to 8-bit grayscale JPEG files using a standardized preprocessing pipeline specifically applying available window settings and normalizing pixel values to a 0–255 range to ensure consistency in data format and dynamic range. Inclusion was strictly limited to readable cases with available physician reports as reference standards. To guide the LLM in generating structured nuclear medicine reports, we designed a set of standardized input prompts. Crucially, these prompts consisted solely of the de-identified images without any ancillary clinical data. Each prompt emphasized a two-part report structure (findings and impression) using standard nuclear medicine terminology, and a “Meta Ratio” (0%–100%) was utilized to guide the assertiveness of the wording according to a predefined tone policy. ChatGPT-4o was accessed via its official graphical user interface to simulate real-world end-user interactions; no APIs or automated scripts were used (the Supplemental Material).

The following prompt was used for image interpretation tasks:

Prompt for image analysis.

Interpret this bone scan.

Identify the affected skeletal regions.

Generate a structured report including only the findings and impression sections.

Use standard nuclear medicine terminology and apply the output tone rules defined in the project instructions.

To assist the language model in adjusting tone under uncertain or ambiguous imaging conditions, we introduced a probability-based indicator called the Meta Ratio (range: 0%–100%) (Table 2). This value reflects the estimated likelihood that a lesion represents bone metastasis and serves as a guide for selecting an appropriate level of certainty in the generated report. The framework provides the model with a structured cue to modulate diagnostic certainty in response to clinical ambiguity.

Table 2.

The recommended tone for each Meta Ratio.

Meta ratio (%)	Suggested reporting tone
0%–25%	Conservative or neutral
26%–50%	Standard imaging-based interpretation
51%–75%	Moderately assertive (if supported by image data)
76%–100%	Assertive or moderately assertive

TP: true positive; FP: false positive; TN: true negative; FN: false negative; ChatGPT: chat generative pre-trained transformer; BS: bone scintigraphy. Accuracy, specificity, and sensitivity are reported for both the ChatGPT-4o-based model and the conventional BS platform.

To ensure analytical independence and avoid carry-over from ChatGPT-4o conversational memory, each of the 52 cases was analyzed in a separate conversation session. After completing an interpretation, the session was closed, and a new conversation was initiated for the next case. This procedure ensured that no contextual information from prior cases could influence subsequent outputs; each interpretation was based solely on the uploaded images and the standardized prompt (the Supplemental Material).

The workflow for image interpretation and output generation

The image interpretation and evaluation workflow was organized into two phases: result processing and outputs (Figure 3). BS images from 52 patients were analyzed via three distinct pathways: nuclear medicine physicians, the ChatGPT-4o model, and the BS platform. During the result processing phase, all three methods independently reviewed and interpreted the original images.

Figure 3.

Overview of the study workflow for image interpretation and output generation. BS images from 52 patients were analyzed via three independent pathways: physician interpretation, ChatGPT-4o model analysis, and automated BS platform assessment. In the result processing stage, each method directly interpreted the medical images. The ChatGPT-4o model additionally generated meta ratios and textual reports, which were reviewed by clinicians. In the outputs stage, all methods contributed to binary classification and regional lesion localization analyses.

The ChatGPT-4o model produced metastasis-probability estimates and free-text reports, which were subsequently validated by experienced nuclear medicine physicians to ensure clinical accuracy. In the outputs phase, each pathway contributed to two primary analytic tasks: (1) binary classification for the detection of bone metastasis, and (2) lesion localization across nine predefined anatomical regions. These results were used to evaluate diagnostic performance and inter-method consistency between the three approaches.

Blinding

The clinical reference standard reports were generated in routine practice before this study was conceived, and the reporting nuclear medicine physicians had no access to the outputs of ChatGPT-4o or the BS platform. For the present analysis, both index tests were applied retrospectively to exported BS images and did not have access to each other's outputs or to the original clinical reports.

For the BS platform, region-level labels were taken directly from the system's automated region-wise outputs without reference to ChatGPT-4o. For ChatGPT-4o, because the model does not natively output predefined nine-region labels, region-level localization was inferred by nuclear medicine physicians based on its narrative reports. Thus, clinicians assigning region-wise labels for ChatGPT-4o were not fully blinded to that index test's outputs, which may introduce some degree of interpretation bias.

Statistical analysis

Diagnostic performance of ChatGPT-4o and the BS platform was summarized using accuracy, sensitivity, specificity, and positive predictive value (PPV), together with 95% confidence intervals (95% CIs). Binary per-examination outcomes for the presence or absence of bone metastasis were defined with the clinical reference standard as the comparator. Because both index tests were evaluated on the same set of BS examinations, paired comparisons of binary outcomes were performed using McNemar's test. At the regional level, the performance for the nine predefined skeletal regions was also summarized with sensitivity, PPV, and accuracy, and reported with 95% CIs. Given the strong imbalance between positive and negative regions, sensitivity and PPV were considered the primary regional performance metrics, and accuracy was interpreted with caution. A two-sided p value <0.05 was considered statistically significant.

Ethics statement

This study was approved by the Institutional Review Board of China Medical University Hospital (DMR99-IRB-293-(CR-13)). All data were fully de-identified prior to analysis to ensure compliance with data privacy regulations and ethical research standards.

Result

The table summarizes the diagnostic performance of ChatGPT-4o and the BS platform for detecting and localizing bone metastases. The evaluation is presented in two parts: overall binary classification performance and regional classification performance for bone metastasis.

Binary classification for bone metastasis detection

Binary classification results for the detection of bone metastasis are presented in Table 3. The ChatGPT-4o-based model identified 21 true positives (TP) and 23 true negatives (TN), with four false positives (FP) and four false negatives (FN), resulting in an overall accuracy of 0.846, a sensitivity of 0.840, and a specificity of 0.852. The BS platform detected 22 true positives and 21 true negatives, with six false positives and three false negatives. The resulting accuracy was 0.827, with a sensitivity of 0.880 and a specificity of 0.778.

Table 3.

Overall binary classification performance.

Bone metastasis	TP	FP	TN	FN	Accuracy (95% CI)	Specificity (95% CI)	Sensitivity (95% CI)
ChatGPT-4o	21	4	23	4	0.846 (0.725–0.920)	0.852 (0.675–0.941)	0.840 (0.653–0.936)
BS platform	22	6	21	3	0.827 (0.703–0.906)	0.778 (0.592–0.894)	0.880 (0.700–0.958)

TP: true positive; FP: false positive; TN: true negative; FN: false negative; ChatGPT: chat generative pre-trained transformer; BS: bone scintigraphy; 95% CI: 95% confidence interval. Accuracy, specificity, and sensitivity are reported for both the ChatGPT-4o-based model and the conventional BS platform.

ChatGPT-4o demonstrated an accuracy of 0.846 with a 95% CI ranging from 0.725 to 0.920. Its sensitivity was 0.840 with a CI of 0.653–0.936, while specificity stood at 0.852 with a range of 0.675–0.941. In comparison, the BS platform achieved an accuracy of 0.827, where the 95% CI spanned from 0.703 to 0.906. The platform showed a sensitivity of 0.880 with a range of 0.700–0.958 and a specificity of 0.778 with a CI extending from 0.592 to 0.894.

Nine anatomical regions for bone metastasis detection

Regional classification results for bone metastasis detection are summarized in Table 4. The ChatGPT-4o-based system yielded no true positives in five anatomical regions: skull, cervical vertebrae, humeri, lumbar vertebrae, and femora, with corresponding sensitivities and PPV of 0.000. Among the regions with detections, the system achieved its highest sensitivity in the sacrum (0.500; 95% CI: 0.268–0.732) and a PPV of 0.467. For the ribs/clavicles/scapulae, the system recorded a sensitivity of 0.154 (95% CI: 0.043–0.422) but notably achieved a PPV of 1.000. In other regions, such as the thoracic vertebrae/sternum and pelvis, sensitivities were strictly limited to 0.273, with PPVs of 0.857 and 0.600, respectively. The average regional accuracy, PPV, and sensitivity across all anatomical sites were 0.769, 0.325, and 0.133, respectively.

Table 4.

Regional classification performance for bone metastasis.

Location	TP	FP	TN	FN	Accuracy (95% CI)	PPV (95% CI)	Sensitivity (95% CI)
ChatGPT-4o
Skull	0	0	41	11	0.788 (0.660–0.878)	0	0.000 (0.000–0.259)
Cervical vertebrae	0	1	38	13	0.731 (0.597–0.832)	0.000 (0.000–0.793)	0.000 (0.000–0.228)
Thoracic vertebrae/sternum	6	1	29	16	0.673 (0.538–0.785)	0.857 (0.487–0.974)	0.273 (0.132–0.482)
Ribs/clavicles/scapulae	2	0	39	11	0.788 (0.660–0.878)	1.000 (0.342–1.000)	0.154 (0.043–0.422)
Humeri	0	0	45	7	0.865 (0.747–0.933)	0	0.000 (0.000–0.354)
Lumbar vertebrae	0	0	39	13	0.750 (0.618–0.848)	0	0.000 (0.000–0.228)
Sacrum	7	8	30	7	0.712 (0.577–0.817)	0.467 (0.248–0.699)	0.500 (0.268–0.732)
Pelvis	3	2	39	8	0.808 (0.681–0.892)	0.600 (0.231–0.882)	0.273 (0.097–0.566)
Femora	0	0	42	10	0.808 (0.681–0.892)	0	0.000 (0.000–0.278)
Average					0.769	0.325	0.133
BS platform
Skull	3	0	41	8	0.846 (0.725–0.920)	1.000 (0.439–1.000)	0.273 (0.097–0.566)
Cervical vertebrae	3	1	38	10	0.788 (0.660–0.878)	0.750 (0.301–0.954)	0.231 (0.082–0.503)
Thoracic vertebrae/sternum	20	2	28	2	0.923 (0.818–0.970)	0.909 (0.722–0.975)	0.909 (0.722–0.975)
Ribs/clavicles/scapulae	12	5	34	1	0.885 (0.770–0.946)	0.706 (0.469–0.867)	0.923 (0.667–0.986)
Humeri	6	3	42	1	0.923 (0.818–0.970)	0.667 (0.354–0.879)	0.857 (0.487–0.974)
Lumbar vertebrae	13	5	34	0	0.904 (0.794–0.958)	0.722 (0.491–0.875)	1.000 (0.772–1.000)
Sacrum	6	2	36	8	0.808 (0.681–0.892)	0.750 (0.409–0.929)	0.429 (0.214–0.674)
Pelvis	8	1	40	3	0.923 (0.818–0.970)	0.889 (0.565–0.980)	0.727 (0.434–0.903)
Femora	5	1	41	5	0.885 (0.770–0.946)	0.833 (0.436–0.970)	0.500 (0.237–0.763)
Average					0.876	0.803	0.650

TP: true positive; FP: false positive; FN: false negative; TN: true negative; ChatGPT: chat generative pre-trained transformer; BS: bone scintigraphy; 95% CI: 95% confidence interval. Values are reported for each region for both the ChatGPT-4o-based model and the BS platform.

In contrast, the bone platform demonstrated true positive detection across all regions. The system achieved its highest sensitivity in the lumbar vertebrae (1.000; 95% CI: 0.772–1.000), followed by the ribs/clavicles/scapulae (0.923; 95% CI: 0.667–0.986) and thoracic vertebrae/sternum (0.909; 95% CI: 0.722–0.975). Corresponding PPVs for these high-performing regions were 0.722, 0.706, and 0.909, respectively. The humeri and pelvis also showed respectable sensitivities of 0.857 and 0.727. Conversely, other regions exhibited moderate to low sensitivities, ranging from 0.231 in the cervical vertebrae to 0.500 in the femora. The average accuracy, PPV, and sensitivity across all evaluated regions for the platform were 0.876, 0.803, and 0.650, respectively.

Paired comparison using McNemar's test

The statistical comparisons of diagnostic performance between the two systems are summarized in Table 5. For overall per-examination metastasis detection, ChatGPT-4o and the BS platform showed similar diagnostic performance.

Table 5.

McNemar test performance of bone scan-positive metastatic regions.

Location	a	b	c	d	(b + c)	χ² statistic	p-value
Skull	41	0	3	8	3	0.0000	0.2500
Cervical vertebrae	37	1	4	10	5	1.0000	0.3750
Thoracic vertebrae/sternum	33	2	15	2	17	2.0000	0.0023
Ribs/clavicles/scapulae	36	5	10	1	15	5.0000	0.3018
Humeri	42	3	6	1	9	3.0000	0.5078
Lumbar vertebrae	34	5	13	0	18	5.0000	0.0963
Sacrum	32	5	10	5	15	5.0000	0.3018
Pelvis	40	2	8	2	10	2.0000	0.1094
Femora	41	1	5	5	6	1.0000	0.2188
Positive metastatic regions	13	5	63	33	68	47.7794	0.0000

a: both correct (ChatGPT-4o and BS platform agree with the reference standard); b: ChatGPT-4o correct, BS platform incorrect; c: ChatGPT-4o incorrect, BS platform correct; d: both incorrect (neither agrees with the reference standard). McNemar's test p values are calculated from the discordant pairs (b and c). ChatGPT: chat generative pre-trained transformer; BS: bone scintigraphy.

However, when the analysis was restricted to positive metastatic regions to evaluate lesion-level localization, paired comparison revealed a marked difference in performance. The BS platform was substantially more likely than ChatGPT-4o to correctly identify metastatic regions, and McNemar's test showed highly significant imbalances in discordant pairs, with p values approaching zero in lesion-level analyses (exact McNemar p = .0000000000048). These findings support that the main performance gap between the two systems lies in regional lesion localization rather than in case-level metastasis detection.

Discussion

In this study, we evaluated the performance of image interpretation for both ChatGPT-4o multimodal image analysis tool and the BS platform, together with board-certified nuclear medicine physicians’ reports as the reference standard.

In binary classification tasks, ChatGPT-4o demonstrated good performance with an accuracy of 84.6%, which is closely paralleling to the BS platform's 82.7%. While the BS platform showed slightly higher sensitivity of 88.0% from ChatGPT-4o 84.0%, ChatGPT-4o achieved fewer false positives, resulting in higher specificity (85.1% vs. 77.8%). With both ChatGPT-4o and BS platform have sensitivity go up to over 80%, they demonstrate feasible ability of detecting possible bone metastasis in the clinical condition.

However, a significant difference was observed when it comes to regional lesion localization; ChatGPT-4o demonstrates remarkably lower sensitivity of 13.3%. Even though the average regional accuracy is 77.0%, but the regional accuracy is based on the fact that most of the findings were negative. When it comes to positive results, the PPV goes down to 32.5%. While the BS platform achieved high accuracy (87.6%) and acceptable sensitivity (64.9%) across anatomical regions, in which domain ChatGPT-4o is noted with markedly lower sensitivity (13.3%). We can say that the ChatGPT-4o model struggled to localize lesions in multiple regions such as the skull, humeri, and lumbar spine. These findings highlight a critical limitation in ChatGPT-4o current approach to interpreting localized image features and anatomical mapping. Therefore, while ChatGPT-4o shows potential in generating preliminary impressions and structured report drafts, it is not yet suitable for detailed diagnostic tasks involving region-specific interpretation without further training and anatomical adaptation.

The suboptimal lesion localization performance of ChatGPT-4o observed in this study can be attributed to two primary factors. First, a fundamental difference in training paradigms exists: unlike the specialist platform, which was supervised on a curated dataset of annotated Tc-99m MDP scans, ChatGPT-4o relies on general vision-language pretraining. Consequently, it likely lacks the domain-specific priors necessary to accurately map pathological uptake to complex skeletal anatomy. Second, the input format imposes technical limitations. The necessity of converting high-dynamic-range DICOM images into standard 8-bit JPEG format for LLM processing introduces inherent information loss. This reduction in dynamic range and resolution likely obscures subtle textural details and low-contrast lesions, thereby hindering the model's precision in localization tasks compared to systems processing native DICOM data.

These differences underscore the distinct strengths of the two systems. ChatGPT-4o operates as a generalized LLM whose performance depends heavily on structured prompting and text-based reasoning. Means it is not designed for being clinically used image interpretation tool from the beginning. This disadvantage is even more prominent when it comes to regional lesion localization, with its low PPV (0.325) and low sensitivity (0.133), it is still not suitable for clinical use as a standalone tool.

Nevertheless, ChatGPT-4o competitive performance in binary classification maybe implies that with further development including the integration of vision transformers or with more training specifically on BS datasets, maybe it still has the potential to become a valuable adjunct to existing diagnostic platforms. And ChatGPT-4o structured reporting output remains a valuable feature. It can assist students or trainees in drafting organized nuclear medicine reports.

However, relying on ChatGPT-4o for primary diagnostic tasks involving detailed anatomical correlation remains premature. In a study by Brin et al.,¹³ GPT-4's multimodal capabilities were assessed in radiological image interpretation. The results indicated that GPT-4 cannot yet be trusted as a standalone diagnostic tool, primarily due to a high diagnostic hallucination rate exceeding 40%. Similarly, a study by Öztürk et al.,¹⁴ which evaluated ChatGPT-4o ability to interpret trauma X-rays, found that its diagnostic performance was significantly inferior to that of experienced emergency medicine and orthopedic specialists. Although neither study was conducted in the field of nuclear medicine, both highlight consistent limitations of ChatGPT-4o in clinical imaging tasks.

On the other hand, the BS platform is trained to be a dedicated medical imaging tool on BS data with structured anatomical mapping capabilities. Most notably, the BS platform outperformed ChatGPT-4o in lesion localization. It yielded an average regional sensitivity of 0.649 and a PPV of 0.803, indicating a high capacity to correctly identify and localize metastatic lesions in predefined nine anatomical regions. This consistent regional performance is likely attributable to the platform's specific training on BS data and its integration of anatomical templates, which is a clear advantage over ChatGPT-4o's more generalized architecture.

All in all, the BS platform was specifically developed, validated, and TFDA-approved for clinical use in Taiwan. Unlike general-purpose LLMs, it benefits from domain-specific training and structured anatomical mapping, features that extend its utility beyond mere diagnosis. Its structured, region-specific analysis serves as an effective “AI assistance” for residents, helping them to systematically evaluate skeletal regions and recognize subtle metastatic patterns through a visual feedback loop. Furthermore, in low-resource settings where experienced nuclear medicine specialists are scarce, the platform's expert-level reliability ensures diagnostic consistency, making it a safer and more practical solution for bridging the expertise gap. In contrast, ChatGPT-4o's interpretative process remains largely opaque, and given its poor performance in lesion localization, any interpretation it generates must be thoroughly reviewed by qualified clinicians before clinical use.

Limitations

This study has several limitations that warrant consideration. First, the sample size was limited to 52 patients from a single institution. Although clinically diverse, this dataset may not fully represent the broader population and could introduce potential sampling bias related to demographic composition, cancer type distribution, or imaging parameters. Because of this limited sample size, subgroup or fairness analyses could not be meaningfully performed, restricting its generalizability across institutions, populations, and imaging systems.

Second, ChatGPT-4o was evaluated in its original configuration, meaning we did not perform any additional training specifically for domain recognition or on task on nuclear medicine datasets. This approach was chosen to assess the model's native capability in a realistic, out-of-the-box usage scenario. No few-shot prompting strategies were tested. This setup may underestimate the model's true potential. Therefore, it is reasonable to expect that with targeted training such as supervised fine-tuning on labeled BS data or integration with anatomical segmentation tools on ChatGPT's performance, particularly in lesion localization, could significantly improve.

Third, while our study evaluated ChatGPT-4o ability to identify lesion locations, the model did not explicitly output anatomical labels or structured regions. Instead, it produced narrative descriptions, from which we manually inferred the likely regions being referenced manually. This may introduce further subjectivity when direct comparability with the BS platform, which produces structured region-by-region results. While efforts were made to standardize the process with experienced clinicians, subjectivity in mapping still cannot be fully eliminated in this study. For future work, more direct output alignment, such as prompting ChatGPT-4o to list affected regions explicitly, can be used to reduce interpretation bias and improve reproducibility.

Fourth, while this study included ChatGPT-4o as a representative LLM, other domain-specific vision-language models, such as Med-PaLM or else may offer superior medical image understanding due to their training paradigms. Comparing general purpose with medical-tuned models is a worthwhile direction for future iterations of this work.

Lastly, as this study was conducted using data from a single institution, the generalizability of the findings may be limited across different populations, scanner vendors, and acquisition protocols.

Future work

Future studies using larger, multi-institutional datasets with more balanced demographic representation are warranted to evaluate and mitigate potential bias. Nevertheless, the BS platform's region-based visualization and ChatGPT's probability-based Meta Ratio provide partial interpretability, enhancing transparency in model reasoning despite these limitations.

Conclusion

ChatGPT-4o showed potential for detecting bone metastases and could assist in report drafting through the structured generation of findings and impressions. However, its limited lesion localization performance currently restricts its suitability for clinical application.

This study highlights the current gap between general-purpose multimodal LLMs and domain-specific medical AI systems. ChatGPT-4o demonstrated feasibility in recognizing bone metastases and generating structured reports but lacked the domain adaptation, annotated training, and anatomical priors required for precise lesion localization. In contrast, the BS platform, trained exclusively on labeled Tc-99m MDP BS, performed better in region-level detection. These findings underscore that, until LLMs mature to a level of reliable clinical competency, AI tools trained on domain-specific medical datasets remain more appropriate for clinical use.

Key points

Question: Can ChatGPT-4o accurately detect and localize bone metastases on BS images compared to a TFDA-approved AI platform and board-certified nuclear medicine physicians?

Pertinent findings: In this comparative study of 52 bone scans, ChatGPT-4o achieved similar accuracy (84.6%) to a TFDA-approved BS platform (82.7%) in binary classification of bone metastases. However, ChatGPT-4o demonstrated poorer lesion localization performance, with a precision of 32.5% and sensitivity of 13.3%, compared to the platform's 80.3% precision and 64.9% sensitivity.

Implications for patient care: While ChatGPT-4o shows potential for assisting in diagnosis and report generation, its current limitations in precise lesion localization make it unsuitable for clinical use in nuclear medicine. Further model refinement and domain-specific training could enable future integration into patient care workflows.

Supplemental Material

sj-docx-1-dhj-10.1177_20552076261421075 - Supplemental material for AI-assisted interpretation of bone scans: Performance comparison between ChatGPT-4o and a TFDA-approved bone scintigraphy platform in AI-driven nuclear imaging interpretation

Supplemental material, sj-docx-1-dhj-10.1177_20552076261421075 for AI-assisted interpretation of bone scans: Performance comparison between ChatGPT-4o and a TFDA-approved bone scintigraphy platform in AI-driven nuclear imaging interpretation by Yuan-Yu Lee, Chiung-Wei Liao, Wei-Jen Chen, Yi-Jin Chen, Pei-Chun Yeh and Yu-Chieh Kuo, Pei-Hsuan Lin, Pak-Ki Chan, Chia-Hung Kao in DIGITAL HEALTH

Footnotes

Acknowledgements

This study was supported in part by China Medical University Hospital (DMR-115-075, DMR-115-076).

ORCID iD

Chia-Hung Kao

Ethical approval and consent to participate

This study was approved by a local institutional review board [DMR99-IRB-293-(CR-14)].

Author contributions

These authors’ individual contributions were as follows:

- Yuan-Yu Lee, Chiung-Wei Liao, Wei-Jen Chen, Yi-Jin Chen, Pei-Chun Yeh, Yu-Chieh Kuo, Pei-Hsuan Lin, Pak-Ki Chan, and Chia-Hung Kao: conceptualization:

- Yuan-Yu Lee, Chiung-Wei Liao, Wei-Jen Chen, and Chia-Hung Kao: methodology

- Pei-Chun Yeh, Yu-Chieh Kuo, and Pak-Ki Chan: software

- Yuan-Yu Lee, Chiung-Wei Liao, Wei-Jen Chen, and Chia-Hung Kao: validation

- Yuan-Yu Lee, Yi-Jin Chen, and Pei-Chun Yeh: formal analysis

- Yuan-Yu Lee and Chia-Hung Kao: investigation

- Yuan-Yu Lee and Chia-Hung Kao: resources

- Yi-Jin Chen and Pei-Chun Yeh: data curation

- Yuan-Yu Lee, Yi-Jin Chen, Yu-Chieh Kuo, and Pak-Ki Chan: writing–original draft preparation

- Yuan-Yu Lee, Chiung-Wei Liao, Wei-Jen Chen, Yi-Jin Chen, Pei-Chun Yeh, Yu-Chieh Kuo, Pei-Hsuan Lin, Pak-Ki Chan, and Chia-Hung Kao: writing–review and editing

- Chia-Hung Kao: visualization

- Chia-Hung Kao: supervision

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Guarantor

The scientific guarantor of this publication is Chia-Hung Kao.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data availability statement

All available data are presented in the text of the paper.

Supplemental material

Supplemental material for this article is available online.

References

Zhao

Jiang

, et al. Deep neural network based artificial intelligence assisted diagnosis of bone scintigraphy for cancer bone metastasis. Sci Rep 2020; 10: 17046.

Pellegrini

Özsoy

Busam

, et al. RaDialog: A large vision-language model for radiology report generation and conversational assistance. 2023, arXiv:2311.18681.

Vaswani

Shazeer

Parmar

, et al. Attention is all you need. 31st Conference on Neural Information Processing Systems (NIPS 2017). Long Beach, CA, USA, 2017.

Devlin

Chang

Lee

, et al. Bert: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT 2019, pp.4171–4186. Minneapolis, Minnesota.

Radford

Narasimhan

Salimans

, et al. Improving language understanding by generative pre-training. 2018. https://www.semanticscholar.org/paper/Improving-Language-Understanding-by-Generative-Radford-Narasimhan/cd18800a0fe0b668a1cc19f2ec95b5003d0a5035

Selivanov

Rogov

Chesakov

, et al. Medical image captioning via generative pretrained transformers. Sci Rep 2023; 13: 4171.

Tomar

Shergill

Rieders

, et al. TransResU-Net: Transformer based ResU-net for real-time colonoscopy polyp segmentation. 2022, arXiv:2206.08985.

OpenAI, Achiam

Adler

, et al. GPT-4 technical report. 2023, arXiv:2303.08774.

Aydin

Karaarslan

OpenAI ChatGPT interprets radiological images: GPT-4 as a medical doctor for a fast check-up. 2025, arXiv:2501.06269.

10.

Wang

Zhang

Traverso

, et al. Assessing the role of GPT-4 in thyroid ultrasound diagnosis and treatment recommendations: enhancing interpretability with a chain of thought approach. Quant Imaging Med Surg 2024; 14: 1602–1615.

11.

Liao

Hsieh

Lai

, et al. Artificial intelligence of object detection in skeletal scintigraphy for automatic detection and annotation of bone metastases. Diagnostics 2023; 13: 685.

12.

Taiwan Food and Drug Administration. Medical device product inquiry. License 007971. Published online. Accessed May 26, 2025. https://lmspiq.fda.gov.tw/web/MDPIQ/MDPIQ1000Result?licBaseId=D9123AB1-7492-47D4-B6DB-EB85480F146B.

13.

Brin

Sorin

Barash

, et al. Assessing GPT-4 multimodal performance in radiological image analysis. Eur Radiol. 2025;35:1959–1965.

14.

Öztürk

Günay

Ateş

, et al. Can GPT-4o accurately diagnose trauma X-rays? A comparative study with expert evaluations. J Emerg Med. 2025;73:71–79.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.58 MB