Sage Journals: Discover world-class research

Abstract

Background

In recent years, potential applications of ChatGPT in medication-related practices have drawn great attention for its intuitive user interfaces, chatbot, and powerful analytical capabilities. However, whether ChatGPT can be broadly applied in clinical practice remains controversial. Early screening, monitoring, and timely treatment are crucial for improving outcomes of interstitial lung disease (ILD) in systemic autoimmune rheumatic diseases (SARDs) due to its high morbidity and mortality rate. This study aimed to evaluate the reliability, repeatability, and confidence of ChatGPT models (GPT-4, GPT-4o mini, and GPT-4o) in delivering guideline-based recommendations for the screening, monitoring, and treatment of ILD in SARD patients.

Methods

Questions derived from the ACR/CHEST guideline for ILD patients with SARDs were used to benchmark three versions of ChatGPT (GPT-4, GPT-4o mini, and GPT-4o) across three separate attempts. The responses were recorded, and the reliability, repeatability, and confidence were analyzed with the recommendations from the guideline.

Results

GPT-4 demonstrated significant variability in reliability across the three attempts (P = .007). In contrast, the other versions showed no significant differences. GPT-4 and GPT-4o mini exhibited substantial interrater agreement (Kendall's W = 0.747 and 0.765, respectively), whereas GPT-4o demonstrated almost perfect interrater agreement (Kendall's W = 0.816). All three versions showed statistically significant differences in high confidence ratings (confidence score of ≥ 8 on the 1–10 scale) across the three attempts (P < .01). Given the higher consistency of GPT-4o and GPT-4o mini, a further comparison was conducted between them on the third attempt. No significant difference was observed in accuracy percentages across the third attempt between GPT-4o and GPT-4o mini (P = .597). Similarly, interrater agreement across the three attempts was not significantly different for both GPT-4o and GPT-4o mini (P = .152). Furthermore, the overconfidence percentage (confidence score of ≥8 assigned to incorrect answers) was 100% (22 of 22) for GPT-4o and 22.7% (10 of 44) for GPT-4o mini, respectively (P < .01).

Conclusions

GPT-4o mini and GPT-4o demonstrated stable reliability across all three attempts, whereas GPT-4 did not. The repeatability of GPT-4o tended to perform better than GPT-4o mini, although this difference was not statistically significant. Additionally, GPT-4o exhibited a higher tendency toward overconfidence compared to GPT-4o mini. Overall, the GPT-4o models performed most effectively in managing SARD-ILD but may exhibit overconfidence in certain scenarios.

Keywords

Systemic autoimmune rheumatic diseases ChatGPT interstitial lung disease

Introduction

In recent years, artificial intelligence (AI) has been widely applied in the medical field, encompassing technologies such as image processing, computer vision, artificial neural networks, machine learning, convolutional neural networks, and deep learning.¹ These technologies have facilitated advancements in disease severity assessment,^2,3 diagnosis,⁴ treatment,^5,6 pharmacological research⁷ and other areas. The emergence of large language models (LLMs), such as ChatGPT, has further accelerated this trend due to their intuitive user interfaces and chatbot functionalities.⁸ Some researchers have explored the potential applications of these models in medication-related practices, including data mining of free-text CT reports on lung cancer, assisting Japanese medical licensing examinations, and supporting neurosurgery written board examinations.^9–11 However, whether ChatGPT can be widely applied in clinical practice remains controversial.

Systemic autoimmune rheumatic diseases (SARDs) refer to a broad spectrum of clinical conditions caused by autoimmune dysregulation and inflammatory responses. These diseases result in damage to joints, connective tissues, skin, blood elements, and other target organs. Historically, SARDs have also been referred to as connective tissue diseases (CTDs) or collagen vascular diseases. The prevalence of interstitial lung disease (ILD) varies across different SARD subtypes: 63% in systemic sclerosis (SSc),¹² 58% in rheumatoid arthritis (RA),¹³ 19.9%–86% in inflammatory myositis (IIM),¹⁴ 16% in Sjögren's syndrome (SS),¹⁵ and 52% in mixed connective tissue disease (MCTD).¹⁶ Patients with CTD-ILD showing decreased FVC and/or DL_CO and fibrotic signs on high-resolution computed tomography (HRCT). These patients have a worse prognosis than those with CTD alone.¹⁷ Among patients diagnosed with RA, the 10-year mortality rate was significantly higher in those with ILD compared to those without ILD, at 60.1% versus 34.5%.¹⁸ Given the high prevalence and impact of ILD in SARDs, early screening, monitoring, and timely treatment are crucial for improving outcomes.

This study aims to evaluate the reliability, repeatability, and confidence of ChatGPT models (GPT-4, GPT-4o mini, and GPT-4o) in delivering guideline-based recommendations^19,20 for the screening, monitoring, and treatment of ILD in SARD patients.

Materials and methods

Three versions of ChatGPT: ChatGPT-4o (version release since May 2024), ChatGPT-4o mini (version release since July 2024), and ChatGPT-4 (version release since November 2023) were selected for investigation. Questions derived from the 2023 American College of Rheumatology (ACR)/American College of Chest Physicians (CHEST) guideline for ILD patients with SARDs were used to benchmark three versions of ChatGPT across three separate attempts. The responses were recorded, and the reliability, repeatability, and confidence were analyzed. The search flow diagram illustrating our study is presented in Figure 1.

Figure 1.

The flowchart of this study consists of four sequential sections: “Question Selection and Classification,” “ChatGPT Prompting,” “Data Collection (ChatGPT Responses),” and “Statistical Analysis.”

Questions selection and classification

A total of 103 questions were extracted from the 2023 ACR/CHEST Guideline^19,20 for ILD in patients with SARDs by two pulmonologists (CJX and XQ), and all questions were independently validated by pulmonary specialists (RYZ and YHW). Disagreements between the above investigators were settled by arbitration of the principal investigator (Minjie Lin and Yingwei Zhang). These questions were divided into two subgroups: “Screening and Monitoring” and “Treatment.” In this study, the reliability, repeatability, and confidence of all questions were evaluated across the three attempts. The questions and corresponding answers generated by ChatGPT are presented in Supplemental Material 1.

ChatGPT prompting

We used three versions of ChatGPT (GPT-4, GPT-4o mini, and GPT-4o) across three separate attempts at three distinct time points (15 August 2024, 17 August 2024, and 18 August 2024, respectively), and the responses were recorded. Prior to inputting our questions, we provided prompts to guide the models to act as specialists and determine which information should be fed back. The specific prompts and questions are provided in Supplemental Material 2, and a sample is illustrated in Figure 2.

Figure 2.

We selected a sample question and conducted three inquiries in three GPT models, comparing the results with the recommendations provided in the guideline.

Statistical analysis

In this study, reliability was defined as accuracy over time; accuracy was the alignment rate between the provided answers and the guideline recommendations. Repeatability is defined as agreement over time, and confidence is a rating from 1 to 10. A confidence score ≥ 8 on the 1–10 scale was defined as “high confidence.” Overconfidence was defined as a confidence score ≥8 assigned to incorrect answers.

To assess the reliability of accuracy, overall performance across all three attempts for each model was compared using the Cochran Q test. To evaluate repeatability, inter-rater agreement among answers was analyzed using the Kendall W coefficient of concordance (KW). The KW ranges from 0 to 1, with values of 0.00–0.20 indicating slight agreement, 0.21–0.40 fair agreement, 0.41–0.60 moderate agreement, 0.61–0.80 substantial agreement, and > 0.80 almost perfect agreement.²¹ The reliability, repeatability, and confidence between GPT-4o mini and GPT-4o were compared using the McNemar test. All statistical analyses were conducted using statistical software (SPSS, Version 19.0). P-value < .05 indicated statistical significance.

Results

Summary of the comparative analysis of GPT-4, GPT-4o mini, and GPT-4o is shown in Table 1 and Figures 3 and 4.

Figure 3.

(a–c) Summary of results for GPT-4, GPT-4o mini, and GPT-4o (attempts 1–3 are shown from left to right). (d–f) For the subgroup of screening and monitoring (questions 1–21), a summary of results for GPT-4, GPT-4o mini, and GPT-4o (attempts 1–3 are shown from left to right). (h–i) For the subgroup of treatment (questions 22–103）, summary of results for GPT-4, GPT-4o mini, and GPT-4o (attempts 1–3 are shown from left to right).

Figure 4.

The accuracy, repeatability, and overconfidence percentages between GPT-4o and GPT-4o mini were compared on the third attempt.

Table 1.

Summary of the comparative analysis of GPT-4, GPT-4o mini, and GPT-4o.

Statistical variable	All questions (1–103)	Subgroup of screening and monitoring (questions 1–21)	Subgroup of treatment (questions 22–103)
Accuracy of GPT-4 (the first time)	86.4% (89 of 103)	90.5% (19 of 21)	85.4% (70 of 82)
Accuracy of GPT-4 (the second time)	75.7% (78 of 103)	81.0% (17 of 21)	74.4% (61 of 82)
Accuracy of GPT-4 (the third time)	75.7% (78 of 103)	85.7% (18 of 21)	78.0% (64 of 82)
P value of GPT-4 accuracy	.007	.778	.19
Accuracy of GPT-4o mini (the first time)	77.7% (80 of 103)	85.7% (18 of 21)	75.6% (62 of 82)
Accuracy of GPT-4o mini (the second time)	70.9% (73 of 103)	71.4% (15 of 21)	70.7% (58 of 82)
Accuracy of GPT-4o mini (the third time)	74.8% (77 of 103)	85.7% (18 of 21)	72.0% (59 of 82)
P value of GPT-4o mini accuracy	.172	.111	.486
Accuracy of GPT-4o (the first time)	82.5% (85 of 103)	90.5% (19 of 21)	80.5% (66 of 82)
Accuracy of GPT-4o (the second time)	83.5% (86 of 103)	90.5% (19 of 21)	81.7% (67 of 82)
Accuracy of GPT-4o (the third time)	78.6% (81 of 103)	90.5% (19 of 21)	75.6% (62 of 82)
P value of GPT-4o accuracy	.199	1	.199
KW of GPT-4	0.747	0.75	0.746
KW of GPT-4o mini	0.765	0.818	0.762
KW of GPT-4o	0.816	1	0.788
The percent of high confidence GPT-4 (the first time)	67.0% (69 of 103)	95.2% (20 of 21)	59.8% (49 of 82)
The percent of high confidence GPT-4 (the second time)	35.9% (37 of 103)	76.2% (16 of 21)	25.6% (21 of 82)
The percent of high confidence GPT-4 (the third time)	58.3% (60 of 103)	100% (21 of 21)	47.6% (39 of 82)
P value of GPT-4 high confidence percentage	< .01	.053	< .01
The percent of high confidence GPT-4o mini (the first time)	84.5% (87 of 103)	95.2% (20 of 21)	81.7% (67 of 82)
The percent of high confidence GPT-4o mini (the second time)	89.3% (92 of 103)	95.2% (20 of 21)	87.8% (72 of 82)
The percent of high confidence GPT-4o mini (the third time)	42.7% (44 of 103)	52.4% (11 of 21)	40.2% (33 of 82)
P value of GPT-4o mini high confidence percentage	< .01	< .01	< .01
The percent of high confidence GPT-4o (the first time)	63.1% (65 of 103)	90.5% (19 of 21)	56.1% (46 of 82)
The percent of high confidence GPT-4o (the second time)	86.4% (89 of 103)	95.2% (20 of 21)	84.1% (69 of 82)
The percent of high confidence GPT-4o (the third time)	100.0% (103 of 103)	100.0% (21 of 21)	100.0% (82 of 82)
P value of GPT-4o high confidence percentage	< .01	.778	< .01

KW: Kendall-W coefficient of concordance.

Reliability

GPT-4 demonstrated unstable reliability across the three attempts, with accuracy rates of 86.4%, 75.7%, and 79.6%, respectively (P = .007). In contrast, GPT-4o mini showed no significant difference, with accuracies of 77.7%, 70.9%, and 74.8%, respectively (P = .172). Similarly, GPT-4o exhibited consistent performance, with accuracies of 82.5%, 83.5%, and 78.6%, respectively (P = .199).

Repeatability

GPT-4 and GPT-4o mini demonstrated substantial interrater agreement, with KW of 0.747 and 0.765, respectively. GPT-4o achieved almost perfect interrater agreement (KW = 0.816). The percentage of repeatable answers across the three attempts was 87.3% for GPT-4o, 81.6% for GPT-4o mini, and 79.6% for GPT-4.

Confidence

All three versions exhibited statistically significant differences in “high confidence” ratings across the three attempts. For GPT-4, high confidence ratings were 67.0%, 35.9%, and 58.3%, respectively, (P < .01). For GPT-4o mini, high confidence ratings were 84.5%, 89.3%, and 42.7%, respectively, (P < .01). For GPT-4o, high confidence ratings were 63.1%, 86.4%, and 100.0%, respectively, (P < .01).

Comparison of GPT-4o and GPT-4o mini

Given that GPT-4 exhibited variability in accuracy across the three attempts, whereas GPT-4o and GPT-4o mini were more reliable, we compared their reliability, repeatability, and overconfidence percentages. No significant difference was observed in accuracy percentages between GPT-4o and GPT-4o mini across the three attempts (P = .597). Similarly, interrater agreement rate did not differ significantly for either version (P = .152). However, the overconfidence percentage differed significantly: GPT-4o exhibited 100% overconfidence (22 of 22), while GPT-4o mini exhibited only 22.7% overconfidence (10 of 44) (P < .01).

Subgroup screening and monitoring (questions 1–21)

For the subgroup screening and monitoring, GPT-4 demonstrated accuracies of 90.5%, 81.0%, and 85.7% across the three attempts (P = .778). GPT-4o mini showed accuracies of 85.7%, 71.4%, and 85.7%, respectively, (P = .111). GPT-4o maintained consistent accuracy at 90.5% across all three attempts (P = 1). GPT-4 exhibited substantial interrater agreement (KW = 0.750), while GPT-4o mini and GPT-4o achieved almost perfect interrater agreement (KW = 0.818 and KW = 1, respectively). According to high confidence ratings across the three attempts: GPT-4 had ratings of 95.2%, 76.2%, and 100% (P = .053); GPT-4o mini had ratings of 95.2%, 95.2%, and 52.4% (P < .01); GPT-4o had ratings of 90.5%, 95.2%, and 100.0%, respectively (P = .778).

Subgroup treatment (questions 22–103)

For the treatment subgroup, GPT-4 demonstrated accuracies of 85.4%, 74.4%, and 78.0% across the three attempts (P = .19). GPT-4o mini showed accuracies of 75.6%, 70.7%, and 72.0%, respectively (P = .486). GPT-4o exhibited accuracies of 80.5%, 81.7%, and 75.6%, respectively (P = .199). All three versions demonstrated substantial interrater agreement, with KW of 0.746, 0.762, and 0.788 for GPT-4, GPT-4o mini, and GPT-4o, respectively. High confidence ratings varied significantly across the three attempts: GPT-4 had ratings of 59.8%, 25.6%, and 47.6% (P < .01); GPT-4o mini had ratings of 81.7%, 87.8%, and 40.2% (P < .01); and GPT-4o had ratings of 56.1%, 84.1%, and 100.0%, respectively (P < .01).

Discussion

This study evaluated the reliability, repeatability, and confidence of ChatGPT, including GPT-4, GPT-4o mini, and GPT-4o. Overall, GPT-4o mini and GPT-4o demonstrated higher reliability than GPT-4. GPT-4o achieved almost perfect interrater agreement (KW = 0.816), while GPT-4o mini showed substantial interrater agreement (KW = 0.765), although no statistically significant difference was observed. Additionally, GPT-4o tended to exhibit greater overconfidence compared to GPT-4o mini. Subgroup analysis revealed that all versions of ChatGPT, including GPT-4, GPT-4o mini, and GPT-4o, remained repeatability in both screening/monitoring and treatment, with substantial interrater agreement (KW ≥ 0.7) at least. Furthermore, all versions demonstrated inconsistent performance in confidence across treatment-related tasks, while within the subgroup of screening and monitoring tasks, only the GPT-4o mini version exhibited instability.

SARDs encompass conditions such as RA, SSc, IIM, MCTD, and SjD,²² which are most likely to develop into SARD-ILD, which remains underdiagnosed and undertreated.²³ Therefore, guidelines developed using a multidisciplinary approach represent the first authoritative resource for managing SARD-ILD.^19,20 Based on guideline recommendations, using ChatGPT to answer questions and comparing results to guidelines can effectively evaluate its reliability, repeatability, and confidence. In prior studies, multiple-choice examinations^24–26 or questionnaires^11,27,28 were commonly used to test ChatGPT, covering topics such as radiology board-style exams, retina-related questions, and gastroesophageal reflux disease. In our study, we generated a questionnaire based on the latest guideline associated with SARD-ILD,^19,20 modifying each question according to guideline suggestions.

LLM-based AI applications (LLMAs), particularly ChatGPT, have previously demonstrated their utility in medical examination performance and clinical management. However, the reliability of LLMAs varies across studies. Some researchers have shown that the latest LLMAs achieve high accuracy rates. For example, in text-based questions from six topics of undergraduate endodontic education examinations, the accuracy rate of ChatGPT-4o was 92.8% which was significantly higher than that of ChatGPT-4 (81.7%, P < .001).²⁹ In evaluating 250 French orthopedic and trauma surgery exam questions (2020–2024), ChatGPT-4o scored similarly to residents over the past 5 years (74.8% vs. 70.8%, P = .32).³⁰ Despite these findings, some studies suggest that LLMAs can serve as useful assistants in medical management and education, but cannot replace conventional teaching methods at present. For example, in restorative dentistry and endodontics student assessments, ChatGPT-4o had an accuracy of only 72%.³¹ In diagnosing oral potentially malignant lesions, GPT-4o had the accuracy of 27/42, which was even higher than GPT-4.0 (20/42), GPT-3.5 (18/42), and Gemini (15/42).³² In our study, the accuracy rates of three attempts across the three versions of ChatGPT exceeded 70%, indicating that GPT-4, GPT-4o mini, and GPT-4o possess the ability to assist in managing SARD-ILD but cannot replace clinicians’ conventional practices. After three attempts, GPT-4o and GPT-4o mini demonstrated more reliable performance than GPT-4, with no significant difference between GPT-4o and GPT-4o mini (P = .597), although GPT-4o showed higher accuracy rates in each test, consistent with previous studies. Therefore, more advanced LLMAs tools are needed to further improve accuracy in medical management and education.

Repeatability is an essential variable in LLMAs in medical practice. Prior studies have shown unsatisfactory repeatability. For example, in a radiology board-style examination study, 61.3% (92 of 150) and 76.7% (115 of 150) of answer choices were consistent across three attempts for GPT-3.5 and GPT-4.²⁴ In ophthalmic multimodal image analysis, the overall repeatability of GPT-4V was 63.3% (38/60).³³ In our study, GPT-4 and GPT-4o mini demonstrated substantial interrater agreement (KW = 0.747 and 0.765, respectively), while GPT-4o achieved almost perfect interrater agreement (KW = 0.816). Data on repeatability percentages showed that 87.3% (90 of 103), 81.6% (84 of 103), and 79.6% (82 of 103) of answer choices were consistent across three attempts for GPT-4o, GPT-4o mini, and GPT-4, respectively, suggesting that all three versions of ChatGPT exhibit good repeatability in SARD-ILD management.

The variability of confidence can predict the probability of accuracy. Previous research indicated that ChatGPT tends to be overconfident. For example, in a study involving a medicine-related multiple-choice examination, GPT-3.5 and GPT-4 rated “high confidence” for most initial responses at 100% (150 of 150) and 94.0% (141 of 150), respectively, while their overconfidence percentages were 100% (59 of 59) and 77% (27 of 35), respectively (P = .89).²⁴ In our study, GPT-4, GPT-4o mini, and GPT-4o were unstable across three attempts for responses. In addition, the overconfidence percent of GPT-4o was 100% (22 of 22) and 22.7% (10 of 44) for GPT-4o mini (P < .01).It should be noted with caution that ChatGPT can answer any question with bold conviction, even when it is incorrect.³⁴ Moreover, the application has been shown to provide fake references in the past.³⁵ In the current models, there is no shown to provid process, which is a significant source of AI hallucinations and biases in the training data. It is necessary to understand that its textual expression may contain implicit or explicit biases. Therefore, ChatGPT may have potential beneficial and ethical uses, which lie in fully leveraging its advantages in structure and grammar while avoiding its shortcomings through human supervision and review.³⁶ Furthermore, in the application of GPT-4o in the medical field, its overconfident nature may bring certain clinical risks. This risk mainly lies in the fact that when answering complex or uncertain medical questions, the model may exhibit an overly high level of confidence, thereby misleading doctors to make incorrect judgments or decisions. This overconfidence sometimes stems from the model's conclusions based on past data rather than reasoning from specific evidence. Due to its smaller scale, GPT-4o-mini may be more cautious or conservative in its reasoning, thus showing a lower level of confidence compared to GPT-4o. Different versions of ChatGPT may have differences in their capabilities due to variations in model size and complexity, training and optimization processes, task handling capabilities, and generation strategies. Therefore, when applying such AI technologies to clinical decision-making support, extra caution is needed, and a comprehensive judgment should be made by integrating professional medical knowledge and practical experience. Reducing unintended hallucinations and improving the accuracy of judgment confidence should be addressed in future developments.

Advantages and limitations

To the best of our knowledge, this is the first study to comprehensively evaluate the reliability, repeatability, and confidence of GPT-4, GPT-4o mini, and GPT-4o in the context of screening, monitoring, and treatment for SARD-ILD. Furthermore, all questions were posed three times at different time points, which enhance the robustness of our conclusions. However, our study has several limitations that should be addressed. First, none of the GPT versions are infallible, despite our prompts request direct and clear responses, a small proportion of the responses may contain ambiguous meanings, which could potentially lead to errors in judgment. Second, while all questions were derived from established guidelines, real-world clinical scenarios may involve more complex situations that were not covered in this study, so our conclusions need external validation and an evaluation of generalizability in future research. Finally, although the evaluators were single-blinded, potential biases could still have been introduced into their assessments, as they are coauthors of the paper and might have had an inherent expectation for significant differences between the groups.

Conclusion

GPT-4o mini and GPT-4o demonstrated stable reliability across all three attempts, whereas GPT-4 did not. The repeatability of GPT-4o tended to be better than that of GPT-4o mini, although this difference was not statistically significant. Additionally, GPT-4o exhibited a higher tendency toward overconfidence compared to GPT-4o mini. Overall, GPT-4o models performed most effectively in managing SARD-ILD but may exhibit overconfidence in certain scenarios.

Supplemental Material

sj-docx-1-dhj-10.1177_20552076251384233 - Supplemental material for Evaluation of reliability, repeatability, and confidence of ChatGPT for screening, monitoring, and treatment of interstitial lung disease in patients with systemic autoimmune rheumatic diseases

Supplemental material, sj-docx-1-dhj-10.1177_20552076251384233 for Evaluation of reliability, repeatability, and confidence of ChatGPT for screening, monitoring, and treatment of interstitial lung disease in patients with systemic autoimmune rheumatic diseases by Minjie Lin, Chuanjun Xu, Xiang Qu, Biyun Xu, Yanghong Wang, Ruyi Zou and Yingwei Zhang in DIGITAL HEALTH

Supplemental Material

sj-docx-2-dhj-10.1177_20552076251384233 - Supplemental material for Evaluation of reliability, repeatability, and confidence of ChatGPT for screening, monitoring, and treatment of interstitial lung disease in patients with systemic autoimmune rheumatic diseases

Supplemental material, sj-docx-2-dhj-10.1177_20552076251384233 for Evaluation of reliability, repeatability, and confidence of ChatGPT for screening, monitoring, and treatment of interstitial lung disease in patients with systemic autoimmune rheumatic diseases by Minjie Lin, Chuanjun Xu, Xiang Qu, Biyun Xu, Yanghong Wang, Ruyi Zou and Yingwei Zhang in DIGITAL HEALTH

Footnotes

Acknowledgements

None.

ORCID iD

Minjie Lin

Ethical approval

Not applicable.

Consent to participate

Not applicable.

Consent to publication

Not applicable.

Authors’ contributions

YWZ and MJL contributed to the conception, methodology design, and writing of the manuscript. RYZ, XQ, BYX, CJX, and YHW took responsibilities for data abstraction and accuracy of data analysis. All authors are guarantors of this work.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was funded by the Medical Scientific Research Foundation of Jiangsu Commission of Health, grant number Z2024043, Talent lifting project of Nanjing Second Hospital, grant number RCMS24003, and National Natural Science Foundation of China, grant number 82270076.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data availability statement

All data generated or analyses in this study are included in this article. Further enquiries can be directed to the corresponding author.

Supplemental material

Supplemental material for this article is available online.

Appendix

References

Mintz

Brodie

. Introduction to artificial intelligence in medicine. Minim Invasiv Ther 2019; 28: 73–81.

Chrzan

Wizner

Sydor

, et al. Artificial intelligence guided HRCT assessment predicts the severity of COVID-19 pneumonia based on clinical parameters. Bmc Infect Dis 2023; 23: 314.

Huang

, et al. Artificial intelligence-based psoriasis severity assessment: real-world study and application. J Med Internet Res 2023; 25: e44932.

Parmar

Surico

Singh

, et al. Artificial intelligence (AI) for early diagnosis of retinal diseases. Medicina (Kaunas) 2024; 60: 527.

Duwe

Mercier

Wiesmann

, et al. Challenges and perspectives in use of artificial intelligence to support treatment recommendations in clinical oncology. Cancer Med 2024; 13: e7398.

Celik

Kilinc

. Assessment of ChatGPT's adherence to EULAR diagnostic criteria and therapeutic protocols for rheumatoid arthritis at two distinct time points, 14 days apart, utilizing binary and multiple-choice inquiries. Clin Rheumatol 2025; 44: 2233–2239.

Mayer

Kringel

Lotsch

. Artificial intelligence and machine learning in clinical pharmacological research. Expert Rev Clin Pharmacol 2024; 17: 79–91.

Lee

Bubeck

Petro

. Benefits, limits, and risks of GPT-4 as an AI Chatbot for medicine. N Engl J Med 2023; 388: 1233–1239.

Fink

Bischoff

Fink

, et al. Potential of ChatGPT and GPT-4 for data mining of free-text CT reports on lung cancer. Radiology 2023; 308: e231362.

10.

Takagi

Watari

Erabi

, et al. Performance of GPT-3.5 and GPT-4 on the Japanese medical licensing examination: comparison study. Jmir Med Educ 2023; 9: e48002.

11.

Ali

Tang

Connolly

, et al. Performance of ChatGPT and GPT-4 on neurosurgery written board examinations. Neurosurgery 2023; 93: 1353–1365.

12.

Suliman

Dobrota

Huscher

, et al. Brief report: pulmonary function tests: high rate of false-negative results in the early detection and screening of scleroderma-related interstitial lung disease. Arthritis Rheumatol 2015; 67: 3256–3261.

13.

Gabbay

Tarala

Will

, et al. Interstitial lung disease in recent onset rheumatoid arthritis. Am J Respir Crit Care Med 1997; 156: 528–535.

14.

Morisset

Johnson

Rich

, et al. Management of myositis-related interstitial lung disease. Chest 2016; 150: 1118–1128.

15.

Ramos-Casals

Brito-Zeron

Seror

, et al. Characterization of systemic disease in primary Sjogren's syndrome: EULAR-SS task force recommendations for articular, cutaneous, pulmonary and renal involvements. Rheumatology (Oxford) 2015; 54: 2230–2238.

16.

Gunnarsson

Aalokken

Molberg

, et al. Prevalence and severity of interstitial lung disease in mixed connective tissue disease: a nationwide, cross-sectional study. Ann Rheum Dis 2012; 71: 1966–1972.

17.

Storrer

Muller

Pessoa

, et al. Connective tissue disease-associated interstitial lung disease. J Bras Pneumol 2024; 50: e20230132.

18.

Hyldgaard

Hilberg

Pedersen

, et al. A population-based cohort study of rheumatoid arthritis-associated interstitial lung disease: comorbidity and mortality. Ann Rheum Dis 2017; 76: 1700–1706.

19.

Johnson

Bernstein

Bolster

, et al. 2023 American College of Rheumatology (ACR)/American College of Chest Physicians (CHEST) guideline for the screening and monitoring of interstitial lung disease in people with systemic autoimmune rheumatic diseases. Arthritis Rheumatol 2024; 76: 1201–1213.

20.

Johnson

Bernstein

Bolster

, et al. 2023 American College of Rheumatology (ACR)/American College of Chest Physicians (CHEST) guideline for the treatment of interstitial lung disease in people with systemic autoimmune rheumatic diseases. Arthritis Rheumatol 2024; 76: 1182–1200.

21.

de Reuver

Gevers

, et al. Poor agreement among expert witnesses in bile duct injury malpractice litigation: an expert panel survey. Ann Surg 2008; 248: 815–820.

22.

Guthridge

Wagner

James

. The promise of precision medicine in rheumatology. Nat Med 2022; 28: 1363–1371.

23.

Luppi

Sebastiani

. Guiding ILD management in systemic autoimmune rheumatic diseases. Nat Rev Rheumatol 2024; 20: 669–670.

24.

Krishna

Bhambra

Bleakney

, et al. Evaluation of reliability, repeatability, robustness, and confidence of GPT-3.5 and GPT-4 on a radiology board-style examination. Radiology 2024; 311: e232715.

25.

Bhayana

Krishna

Bleakney

, et al. Performance of ChatGPT on a radiology board-style examination: insights into current strengths and limitations. Radiology 2023; 307: e230582.

26.

Bhayana

Bleakney

Krishna

. GPT-4 in radiology: improvements in advanced reasoning. Radiology 2023; 307: e230987.

27.

Tailor

Dalvin

Chen

, et al. A comparative study of responses to retina questions from either experts, expert-edited large language models, or expert-edited large language models alone. Ophthalmol Sci 2024; 4: 100485.

28.

Huo

Calabrese

Sylla

, et al. The performance of artificial intelligence large language model-linked chatbots in surgical decision-making for gastroesophageal reflux disease. Surg Endosc 2024; 38: 2320–2330.

29.

Arili

Turan

Canakci

. Evaluation of the performance of ChatGPT-4 and ChatGPT-4o as a learning tool in endodontics. Int Endod J 2025.

30.

Maraqa

Samargandi

Poichotte

, et al. Comparing performances of French orthopaedic surgery residents with the artificial intelligence ChatGPT-4/4o in the French diploma exams of orthopaedic and trauma surgery. Orthop Traumatol Surg Res 2024: 104080.

31.

Kunzle

Paris

. Performance of large language artificial intelligence models on solving restorative dentistry and endodontics student assessments. Clin Oral Investig 2024; 28: 575.

32.

Pradhan

. Accuracy of ChatGPT 3.5, 4.0, 4o and Gemini in diagnosing oral potentially malignant lesions based on clinical case reports and image recognition. Med Oral Patol Oral Cir Bucal 2025; 30: e224–e231.

33.

Chen

Zhao

, et al. Unveiling the clinical incapabilities: a benchmarking study of GPT-4V(ision) for ophthalmic multimodal image analysis. Br J Ophthalmol 2024; 108: 1384–1389.

34.

Lee

Frieske

, et al. Survey of hallucination in natural language generation. Acm Comput Surv 2023; 55: 38.

35.

Gravel

D Amours-Gravel

Osmanlliu

. Learning to fake it: limited responses and fabricated references provided by ChatGPT for medical questions. Mayo Clinic Proceedings. Digital Health 2023; 1: 226–234.

36.

Cheng

Calhoun

Reedy

. Artificial intelligence-assisted academic writing: recommendations for ethical use. Adv Simul (Lond) 2025; 10: 22.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.15 MB

0.02 MB

Evaluation of reliability,repeatability,and confidence of ChatGPT for screening,monitoring,and treatment of interstitial lung disease in patients with systemic autoimmune rheumatic diseases

Abstract

Background

Methods

Results

Conclusions

Keywords

Introduction

Materials and methods

Questions selection and classification

ChatGPT prompting

Statistical analysis

Results

Reliability

Repeatability

Confidence

Comparison of GPT-4o and GPT-4o mini

Subgroup screening and monitoring (questions 1–21)

Subgroup treatment (questions 22–103)

Discussion

Advantages and limitations

Conclusion

Supplemental Material

sj-docx-1-dhj-10.1177_20552076251384233 - Supplemental material for Evaluation of reliability, repeatability, and confidence of ChatGPT for screening, monitoring, and treatment of interstitial lung disease in patients with systemic autoimmune rheumatic diseases

Supplemental Material

sj-docx-2-dhj-10.1177_20552076251384233 - Supplemental material for Evaluation of reliability, repeatability, and confidence of ChatGPT for screening, monitoring, and treatment of interstitial lung disease in patients with systemic autoimmune rheumatic diseases

Footnotes

Acknowledgements

ORCID iD

Ethical approval

Consent to participate

Consent to publication

Authors’ contributions

Funding

Declaration of conflicting interests

Data availability statement

Supplemental material

Appendix

References

Supplementary Material