Sage Journals: Discover world-class research

Abstract

Background

Artificial intelligence (AI), particularly GPT models like GPT-4o (omnimodal), is increasingly being integrated into healthcare for providing diagnostic and treatment recommendations. However, the accuracy and clinical applicability of such AI systems remain unclear.

Objective

This study aimed to evaluate the accuracy and completeness of GPT-4o in comparison to resident physicians and senior infectious disease specialists in diagnosing and managing bacterial, fungal, and viral infections.

Methods

A comparative study was conducted involving GPT-4o, three resident physicians, and three senior infectious disease experts. Participants answered 75 questions—comprising true/false, open-ended, and clinical case-based scenarios—developed according to international guidelines and clinical practice. Accuracy and completeness were assessed via blinded expert review using Likert scales. Statistical analysis included Chi-square, Fisher's exact, and Kruskal–Wallis tests.

Results

In true/false questions, GPT-4o showed comparable accuracy (87.5%) to specialists (90.3%) and exceeded residents (77.8%). Specialists outperformed GPT-4o in accuracy on open-ended (p = .008) and clinical case-based questions (P = .02). However, GPT-4o demonstrated significantly greater completeness than residents on open-ended (P < .0001) and clinical case-based questions (P = .01), providing more comprehensive explanations.

Conclusions

GPT-4o shows promise as a tool for providing comprehensive responses in infectious disease management, although specialists still outperform it in accuracy. Continuous human oversight is recommended to mitigate potential inaccuracies in clinical decision-making. These findings suggest that while GPT-4o may be considered a valuable supplementary tool for medical advice, it should not replace expert consultation in complex clinical decision-making.

Keywords

Artificial intelligence < General diagnostic accuracy GPT-4o infectious diseases large language models medical advice

Introduction

The emergence of artificial intelligence (AI) has profoundly impacted medicine, offering the potential to deliver medical advice more conveniently and efficiently.¹ With the digitalization of medicine and the integration of software applications with health big data, the use of AI in healthcare has significantly expanded.² AI applications have been demonstrated across various medical specialties, from orthodontics, trauma evaluation to neurological and diagnostics.^3–8

OpenAI's ChatGPT is a pioneering AI in natural language processing, capable of rapidly interpreting vast amounts of medical literature and patient data. The recent release of GPT-4o (omnimodal) enhances these capabilities, allowing it to process and generate text, images, video, and audio. It holds significant potential for providing theoretical explanations and treatment advice in medicine.⁹ Approximately 80% of patients reportedly use online resources to address health-related concerns. Ayers et al. found that 78.6% of respondents preferred chatbot responses, citing higher quality and more empathetic answers.¹⁰ However, whether AI can professionally deliver scientifically valid advice remains unclear, and the accuracy of its responses to medical queries is largely unknown. Therefore, assessing chatbots like GPT-4o is crucial for understanding the accuracy of the content that patients and healthcare providers might encounter.^11,12

Recent literature has highlighted differences between AI systems like ChatGPT and human experts, particularly in the domain of complex medical decision-making. A study by Massey et al. indicated that ChatGPT-3.5 and GPT-4 performed below the level of orthopedic residents in orthopedic evaluations.¹³ In a prospective study, Maillard et al. assessed the quality and safety of ChatGPT-4's management recommendations for patients with positive blood cultures, concluding that relying on GPT-4 without expert consultation is risky, especially in severe cases.¹⁴ In another instance, ChatGPT's two diagnostic and therapeutic recommendations for brain abscesses conflicted with the guidelines of the ESCMID (European Society of Clinical Microbiology and Infectious Diseases), potentially endangering patient safety.¹⁵ Wilhelm et al. evaluated the performance of four prominent large language models (LLMs) in clinical specialties such as ophthalmology, orthopedics, and dermatology, revealing variations in quality and reliability, with different models exhibiting distinct error patterns, such as diagnostic confusion and ambiguous advice.¹⁶ These studies suggest that ChatGPT may not accurately address complex clinical scenarios, particularly in specialties like microbiology and antimicrobial therapy, where the interpretation of dynamic and complex clinical situations is critical. The advent of ChatGPT has introduced new opportunities and challenges in infectious disease management, particularly regarding antimicrobial recommendations.¹⁷ Various authors have explored ChatGPT's application in infectious diseases, sparking discussions on whether chatbot AI might eventually replace traditional roles in disease management.^18,19

It is urgent to evaluate chatbot AI's performance in addressing complex medical issues, along with the potential implications and risks of ChatGPT's growing use in healthcare. Infectious diseases, as a medical specialty, involve constantly evolving conditions that require considering complex treatment guidelines, patient backgrounds, pathogens, infection sites, and potential complications. Our study aims to assess GPT-4o's ability to provide diagnostic insights and treatment recommendations for bacterial, fungal, and viral infections, evaluating its performance in handling complex clinical questions according to international guidelines, and comparing it with resident physicians and expert groups.

Methods

Study design and participants

This comparative study aimed to evaluate the accuracy and completeness of responses provided by GPT-4o, resident physicians, and senior clinical microbiology experts on questions related to bacterial, fungal, and viral infections. The study included three groups of participants: three senior clinical microbiology experts, three resident physicians, and the GPT-4o model.

Inclusion and exclusion criteria

Resident physicians were eligible if they had completed at least 1 year of clinical training in infectious diseases or clinical microbiology. Senior experts were included if they had a minimum of 5 years of experience in infectious diseases or clinical microbiology and held a senior academic or clinical appointment. Participants who were involved in question design, pilot testing, or scoring were excluded to avoid bias. GPT-4o was tested using a consistent version throughout the study period; no exclusion criteria applied to the AI model.

A total of 75 questions were developed by a panel of infectious disease experts, covering three major infection types: bacterial, fungal, and viral. For each infection type, the assessment included 8 true/false questions, 8 open-ended questions, and 9 clinical case-based open-ended questions, resulting in 25 questions per category and 75 in total. The true/false and open-ended questions were formulated in accordance with both domestic and international guidelines from Infectious Diseases Society of America (IDSA), European Organization for Research and Treatment of Cancer and the Mycoses Study Group Education and Research Consortium (EORTC/MSGERC), American Thoracic Society/Infectious Diseases Society of America (ATS/IDSA), American Society for Microbiology (ASM), European Confederation of Medical Mycology (ECMM), among others. The clinical case-based open-ended questions were derived from real-world clinical scenarios constructed by the expert panel based on their practical experience. All cases were de-identified and did not involve any protected patient information.

Although the assessment was not based on a previously validated instrument, the test items were developed through expert consensus and aligned with real-world clinical decision-making. To improve clarity and face validity, the full set of questions underwent a pilot test involving three independent infectious disease clinicians. Feedback from the pilot test was used to refine ambiguous wording and improve content coverage.

Data collection

Participants were required to answer 75 questions independently. The true/false and clinical case-based open-ended questions were designed to evaluate the participants’ accuracy in diagnosing and recommending treatments for specific infections. The open-ended questions required more detailed responses, expecting participants to demonstrate their understanding and provide comprehensive explanations based on clinical guidelines and knowledge. Residents and specialists were permitted to use any necessary resources, except for ChatGPT and similar AI tools, to answer the questions. For GPT-4o, queries were manually entered, and responses were directly collected from the interface. To simulate expert-level clinical reasoning in response to case-based scenarios, a standardized role-based prompt was used prior to each clinical case input. Specifically, the following instruction was included before each case:

“If you were a specialist in microbiology and infectious diseases, with expertise in the diagnosis and treatment of infectious conditions, I would like you to provide professional recommendations based on the following clinical case.” This prompt was applied consistently to ensure contextual alignment across all cases.

The full list of true/false and open-ended questions is available in Supplementary Material 1, and the clinical case questions are presented in Supplementary Material 2.

Blind evaluation

The answers to the open-ended questions and the clinical case questions underwent a blind review by an expert evaluation panel. This panel used a Likert scale to assess the accuracy and completeness of the responses, comparing them against established international guidelines and relevant clinical knowledge.

Accuracy was evaluated using a six-point Likert scale, where (1) represents a completely incorrect response; (2) indicates that incorrect elements outweigh correct elements; (3) suggests a balance between correct and incorrect elements; (4) indicates that correct elements outweigh incorrect elements; (5) is used for responses that are nearly entirely correct; and (6) represents a completely correct response.

For assessing completeness, a three-point Likert scale was used: (1) represents an incomplete answer that only addresses certain aspects of the question and is missing significant parts; (2) provides an adequate answer that covers all necessary aspects of the question; and (3) represents a comprehensive answer that covers all aspects of the question and includes additional information or context beyond what was expected.

Statistical analysis

Statistical methods were used to compare the performance of different groups (senior specialists, GPT-4o, and residents). Accuracy was described using absolute numbers and percentages. The Chi-square test was employed to assess differences between groups, with Fisher's exact test applied when more than 20% of expected cell counts were less than five. The accuracy and completeness scores for the open-ended questions were described using the median and interquartile range. Normality was assessed with the Kolmogorov–Smirnov test, and the Kruskal–Wallis test was applied to evaluate differences in accuracy and completeness score between groups. A p-value of less than 0.05 was considered statistically significant. All statistical analyses were performed using SPSS 26.0 (SPSS Inc.; Chicago, IL, USA) and GraphPad Prism 9.0 (GraphPad Software Inc.; San Diego, CA, USA).

Results

True or false questions

In the comparison of accuracy in answering true or false questions among GPT-4o, specialists, and residents, no significant differences were found among the three groups (Figure 1) (p = .11). The percentage of correct responses was highest in the specialists group (90.3%), followed closely by the GPT-4o (87.5%), with the resident physicians showing a slightly lower accuracy rate (77.8%). While all groups demonstrated a strong ability to answer correctly, the residents had a slightly higher proportion of incorrect answers compared to GPT-4o and the specialists.

Figure 1.

Percentage of correct and wrong answers for the true or false questions among GPT-4o, specialists, and residents.

Notably, there were no significant differences in the accuracy of responses across the different types of infections: bacterial, fungal, and viral (Supplementary Material 3 - Table S1). However, specialists achieved a higher percentage of correct answers for bacterial infections (100%) compared to GPT-4o (75.0%) and residents (70.8%) (p = .009). In contrast, the differences in accuracy among the groups for fungal infections (p = .54) and viral infections (p = .60) were not statistically significant (Figure 2). In contrast, GPT-4o performed better on questions related to fungal infections. Interestingly, GPT-4o outperformed the other groups on questions related to fungal infections.

Figure 2.

Percentage of correct and wrong answers for the true or false questions about bacterial, fungal, and viral infections in the different groups. ^aBacterial infection among the three groups: $χ$ ² = 9.244, p = .009; fungal infection among the three groups: $χ$ ² = 1.560, p = .54; viral infections among the three groups: $χ$ ² = 1.548, p = .60.

Open-ended questions

When evaluating the accuracy and completeness of responses to open-ended questions across different groups, the observed differences in accuracy and completeness scores were statistically significant (Table 1). Specialists delivered more 6-point answers than both residents and GPT-4o, while GPT-4o excelled in completeness, achieving the highest percentage of 3-point scores.

Table 1.

Accuracy and completeness for open-ended infection questions among GPT-4o, specialists, and residents.

Groups	Accuracy score						Completeness score
Groups	6	5	4	3	2	1	3	2	1
GPT-4o	7 (29.2)	12 (50.0)	2 (8.3)	2 (8.3)	1 (4.2)	0 (0)	12 (50)	10 (41.7)	2 (8.3)
Specialists	46 (63.9)	22 (30.6)	3 (4.2)	0 (0)	1 (1.4)	0 (0)	12 (16.7)	57 (79.2)	3 (4.2)
Residents	43 (59.7)	16 (22.2)	1 (1.4)	3 (4.2)	5 (6.9)	4 (5.6)	4 (5.6)	52 (72.2)	16 (22.2)
P-value	0.002**						<0.001***

Results are presented as n (%).

P-value calculated with Fisher's exact test. *: P < .05; **: P < .01; ***: P < .001.

Further comparisons between the three groups showed that specialists outperformed GPT-4o in accuracy (p = .008), while no significant difference was observed between GPT-4o and residents (p = .09). In terms of completeness, GPT-4o demonstrated a significantly higher ability to provide comprehensive answers compared to residents (p < .001). Additionally, the difference in completeness between specialists and residents was statistically significant (p = .004) (Figure 3). Moreover, no statistically differences were observed in accuracy or completeness across different types of infectious diseases (Supplementary Material 3- Tables S2–3).

Figure 3.

Comparison of accuracy and completeness in open-ended questions among GPT-4o, specialists, and residents. ^ap-value calculated with Kruskal–Wallis test, ****p < .0001; **p < .01; ns: p > .05. Accuracy: GPT-4o vs. Specialists p = .009; GPT-4o vs. Residents p = .09; Specialists vs. Residents p = .74; completeness: GPT-4o vs. Specialists p = .055; GPT-4o vs. Residents p < .0001; Specialists vs. Residents p = .004.

Clinical cases

Significant differences were observed among GPT-4o, specialists, and residents regarding both accuracy (χ² = 54.968, p < .001) and completeness (χ² = 53.761, p < .001) in clinical case responses (Table 2). Specialists demonstrated significantly superior accuracy and completeness, predominantly achieving the highest scores.

Table 2.

Accuracy and completeness for clinical case open-ended questions among GPT-4o, specialists, and residents.

Groups	Accuracy score						Completeness score
Groups	6	5	4	3	2	1	3	2	1
GPT-4o	11 (40.7)	12 (44.5)	4 (14.8)	0 (0)	0 (0)	0 (0)	8 (29.6)	17 (63.0)	2 (7.4)
Specialists	56 (69.1)	23 (28.4)	2 (2.5)	0 (0)	0 (0)	0 (0)	43 (53.1)	37 (45.7)	1 (1.2)
Residents	12 (14.8)	51 (63.0)	15 (18.5)	3 (3.7)	0 (0)	0 (0)	4 (4.9)	59 (72.8)	18 (22.3)
P-value	<0.001***						<0.001***

Results are presented as n (%).

P-value calculated with Chi-square test. *: P < .05; **: P < .01; ***: P < .001.

Violin plots (Figure 4) further illustrate these group differences. For accuracy, specialists consistently achieved significantly higher scores compared to both GPT-4o (p = .02) and residents (p < .0001). Although GPT-4o demonstrated higher median accuracy scores than residents, the difference did not reach statistical significance (p = .06).

Figure 4.

Comparison of accuracy and completeness in clinical case open-ended questions among GPT-4o, specialists, and residents. ^ap-value calculated with Kruskal–Wallis test, ****p < .0001; *p < .05; ns: p > .05. Accuracy: GPT-4o vs. Specialists p = .02; GPT-4o vs. Residents p = .06; Specialists vs. Residents p < .0001; completeness: GPT-4o vs. Specialists p = .074; GPT-4o vs. Residents p = .01; Specialists vs. Residents p < .0001.

In terms of completeness, specialists again significantly outperformed residents (p < .0001). GPT-4o achieved significantly higher completeness scores than residents (p = .01) and showed comparable performance to specialists with no statistically significant difference detected (p = .074). Further subgroup analysis demonstrated significant variations in accuracy (χ² = 17.059, p = .004) and completeness (χ² = 17.114, p = .002) scores among responses to clinical questions classified by infectious disease categories (bacterial, fungal, and viral infections) (Supplementary Material 3- Tables S4–5).

Discussion

ChatGPT has garnered significant global attention due to its ability to rapidly respond to queries and provide accurate information on a wide range of topics. It enhances knowledge, offers explanations, provides critical insights, facilitates conversational dialogs, and assists with various tasks.²⁰

This study evaluated the performance of GPT-4o in comparison to residents and senior clinical microbiology experts in addressing bacterial, fungal, and viral infections, highlighting both the strengths and limitations of this advanced tool. The results indicate that GPT-4o performed well in answering true/false questions, consistent with findings from previous research.²¹

In open-ended questions, GPT-4o provided more comprehensive responses, often offering additional background information or explanations beyond what was expected. However, specialists outperformed GPT-4o in terms of accuracy, particularly in cases requiring detailed clinical knowledge. This efficiency stems from ChatGPT's ability to process and generate language based on its training on vast amounts of data. Notably, this does not equate to true understanding, as GPT-4o often provides broader answers. This highlights a potential limitation in clinical decision-making, where precision and adherence to guidelines are paramount. Dyckhoff-Shen et al. emphasized the risks associated with AI-generated content that deviates from established medical guidelines, particularly in complex cases.¹⁵ Our study supports this concern, as the additional information provided by GPT-4o, while comprehensive, may not always align with best practices or could introduce elements that are not clinically relevant, thereby complicating the decision-making process.

In the clinical case-based open-ended questions, specialists consistently outperformed both GPT-4o and residents in terms of accuracy and completeness. GPT-4o demonstrated better performance than residents in completeness, suggesting its potential to assist clinical reasoning for junior clinicians. However, GPT-4o frequently recommended higher-tier or broad-spectrum antibiotics, even when narrower-spectrum options would have sufficed. This highlights the importance of professional supervision in antibiotic selection to ensure appropriate and judicious use of antimicrobial therapy.

Additionally, ChatGPT does not disclose the sources of its information, which presents a risk for users who are unable to verify the reliability of the information provided. Yang et al. compared the responses of ChatGPT and Bard on treatment options for hip and knee osteoarthritis with the recommendations from the American Academy of Orthopaedic Surgeons (AAOS) Evidence-Based Clinical Practice Guidelines. The findings revealed that ChatGPT and Bard encouraged the use of non-recommended therapies in 30% and 60% of their responses, respectively. Notably, none of ChatGPT's responses cited any reference sources.²²

Therefore, it is crucial to ensure that ChatGPT's training data is accurate, comprehensive, and unbiased. This can be achieved by using diverse, high-quality data sources and implementing regular model updates. Moreover, encouraging human experts to verify and confirm AI-generated information will help mitigate inaccuracies and biases.^23,24

However, the proficiency demonstrated by GPT-4o in providing comprehensive medical advice highlights its potential as a supportive tool for medical students and junior doctors. Empirical research has shown that this model can instantly provide relevant information on a wide range of medical conditions and treatment options, making it a valuable resource for junior doctors seeking guidance in clinical decision-making. A study by Kung et al. further supports this potential, showing that ChatGPT was able to pass the USMLE^®, indicating its capability to provide medical information that could assist in clinical decision-making.²⁵

On the one hand, it is important to note that ChatGPT was not specifically developed for use in medical research. As a result, it lacks the necessary depth in medical knowledge, particularly in the mechanisms of diseases and treatments. However, its unexpected performance in providing foundational support for research and clinical practice suggests significant potential if developed in collaboration with the medical field. This indicates that, with proper collaboration, AI could bring revolutionary changes to medicine and healthcare. For instance, future efforts should focus on training AI models to have a comprehensive understanding of biological and medical sciences, thereby enhancing their accuracy and depth in analysis, diagnosis, and treatment planning. Moreover, appropriate ethical guidelines, limitations, and authorizations must be established to prevent unauthorized use and protect individual privacy. Additionally, further studies should be conducted in clinical settings using improved models to monitor their progress and effectiveness. On the other hand, misuse and over-reliance on AI systems are also significant concerns that must be addressed. Healthcare professionals may become overly dependent on AI for making medical decisions, potentially overlooking the limitations and possible errors of the technology.²⁶ ChatGPT often supplements its responses to critical questions by advising consultation with medical experts and emphasizing that the most up-to-date knowledge and guidelines should be used in clinical decision-making. This approach acknowledges the limitations of its recommendations and the potential dangers of relying solely on AI, particularly in complex clinical consultations, which could put patients at risk.

Despite the valuable insights gained from this study, several limitations should be acknowledged, which also provide directions for future research. First, this study evaluated only a single LLM—GPT-4o. Recent research has demonstrated significant variability in clinical performance across different LLMs.²⁷ As such, our findings may not be generalizable to other models, including domain-specific or open-source LLMs. Future studies should compare multiple models under similar clinical evaluation frameworks. Second, while we applied a standardized role-based prompt to frame each clinical case, we did not implement advanced prompt engineering techniques. Prior studies have shown that prompt design can significantly affect LLM output quality and relevance.²⁸ Thus, the absence of systematic prompt optimization may have impacted GPT-4o's performance and limits reproducibility. Third, the sample size was limited to three senior experts and three resident physicians, which may not fully represent the broader spectrum of expertise within the medical community. Future studies should include a larger and more diverse sample of participants to ensure generalizability. Fourth, the study focused on bacterial, fungal, and viral infections. While these are common and significant, expanding the scope to include other types of infections or medical conditions would provide a more comprehensive evaluation of GPT-4o's capabilities. Fifth, the accuracy and completeness of responses were assessed using a Likert scale, which, while rigorous, is still subjective to some extent. Future research could incorporate additional objective measures, such as time to correct diagnosis, to further validate the findings. Sixth, GPT-4o's responses were evaluated based on static queries and did not account for dynamic, real-time clinical interactions. Future studies should explore the model's performance in more interactive, real-world settings to better assess its practical utility in clinical practice. Seventh, the current study only included resident physicians and senior experts as human comparators. However, medical students and early-stage learners may benefit even more from LLM support, given their relative lack of clinical experience. Integrating such participants in future studies may help explore the educational value of LLMs in different levels of training. At the same time, the risk of uncritical acceptance of AI-generated responses may be higher among less experienced learners, highlighting the need for careful supervision and contextualization.

We anticipate that the integration of AI into medicine is inevitable, whether providers choose to embrace this revolution or not, particularly as LLMs are advancing at an astonishing pace. GPT-4o should be viewed as a supplementary tool in clinical practice, not as a replacement for the clinical judgment of healthcare professionals. Infectious disease specialists possess the essential expertise, training, and experience, and in clinical practice, the focus must always remain on accuracy and adherence to guidelines—areas where human expertise currently surpasses AI models.

Conclusions

In conclusion, while GPT-4o demonstrates significant potential as a supportive tool in infectious disease diagnostics and management, particularly in providing rapid and more comprehensive contextual information, it cannot yet replace the expertise of trained healthcare professionals. The model's limitations in accuracy underscore the necessity of maintaining human oversight in medical decision-making. Future research should focus on enhancing the precision and reliability of AI models by integrating a deeper understanding of life sciences and clinical guidelines into their training. Additionally, as AI continues to evolve and integrate into healthcare, ethical considerations, including the development of guidelines for AI use and the protection of patient privacy, will be crucial. Ongoing collaboration between AI developers and healthcare professionals is essential to ensure that these technologies are used effectively and safely to enhance, rather than replace, human judgment in healthcare.

Supplemental Material

sj-xlsx-1-dhj-10.1177_20552076251355797 - Supplemental material for Evaluating GPT-4o in infectious disease diagnostics and management: A comparative study with residents and specialists on accuracy, completeness, and clinical support potential

Supplemental material, sj-xlsx-1-dhj-10.1177_20552076251355797 for Evaluating GPT-4o in infectious disease diagnostics and management: A comparative study with residents and specialists on accuracy, completeness, and clinical support potential by Lili Zhan, Xiumin Dang, Zhenghua Xie, Chaoying Zeng, Weixing Wu, Xiaoyu Zhang, Li Zhang and Xinjian Cai in DIGITAL HEALTH

Supplemental Material

sj-docx-2-dhj-10.1177_20552076251355797 - Supplemental material for Evaluating GPT-4o in infectious disease diagnostics and management: A comparative study with residents and specialists on accuracy, completeness, and clinical support potential

Supplemental material, sj-docx-2-dhj-10.1177_20552076251355797 for Evaluating GPT-4o in infectious disease diagnostics and management: A comparative study with residents and specialists on accuracy, completeness, and clinical support potential by Lili Zhan, Xiumin Dang, Zhenghua Xie, Chaoying Zeng, Weixing Wu, Xiaoyu Zhang, Li Zhang and Xinjian Cai in DIGITAL HEALTH

Supplemental Material

sj-docx-3-dhj-10.1177_20552076251355797 - Supplemental material for Evaluating GPT-4o in infectious disease diagnostics and management: A comparative study with residents and specialists on accuracy, completeness, and clinical support potential

Supplemental material, sj-docx-3-dhj-10.1177_20552076251355797 for Evaluating GPT-4o in infectious disease diagnostics and management: A comparative study with residents and specialists on accuracy, completeness, and clinical support potential by Lili Zhan, Xiumin Dang, Zhenghua Xie, Chaoying Zeng, Weixing Wu, Xiaoyu Zhang, Li Zhang and Xinjian Cai in DIGITAL HEALTH

Footnotes

Acknowledgements

We would like to express our sincere gratitude to all those who contributed to the success of this study.

ORCID iDs

Lili Zhan

Xiumin Dang

Zhenghua Xie

Chaoying Zeng

Weixing Wu

Xiaoyu Zhang

Li Zhang

Xinjian Cai

Ethical considerations

This study utilized de-identified clinical case materials that contained no personally identifiable patient information. The research protocol was reviewed by the Chinese Academy of Medical Sciences and Peking Union Medical College Shenzhen Hospital Ethics Committee, which determined that formal ethical approval was not required and granted an exemption. No patient consent was necessary for this study. The use of GPT-4o, an AI-based LLM, complied with OpenAI's terms of use, and all outputs were reviewed solely for research purposes.

Author contributions

LLZ conceptualized and designed the study, developed the methodology, oversaw data collection, contributed to the analysis and interpretation of the data, and drafted significant portions of the manuscript. XJC participated in the study design, conducted data analysis, contributed to result interpretation, and assisted in writing and revising the manuscript. XMD, ZHX, and CYZ assisted with data collection and contributed to the writing and revision of the manuscript. LZ provided expertise in infectious disease management, contributed to the development of the clinical scenarios, and reviewed the manuscript for important intellectual content. WXW and XYZ coordinated the study, facilitated communication within the research team, and contributed to the final editing and approval of the manuscript. All authors have read and approved the final version of the manuscript.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Sanming Project of Medicine in Shenzen Municipality, (grant number No.SZSM202311002).

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data availability statement

Due to space constraints, data not provided within the article can be shared upon request with any qualified researcher. The raw data supporting the conclusions of this article will be made available by the corresponding author upon reasonable request.

Supplemental material

Supplemental material for this article is available online.

References

Rajpurkar

Chen

Banerjee

, et al. AI in health and medicine. Nat Med 2022; 28: 31–38.

Wang

Preininger

. AI in health: state of the art, challenges, and future directions. Yearb Med Inform 2019; 28: 16–26.

Beam

Kohane

. Artificial intelligence in healthcare. Nat Biomed Eng 2018; 2: 719–731.

Gulshan

Peng

Coram

, et al. Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal Fundus photographs. JAMA 2016; 316: 2402–2410.

Hatia

Doldo

Parrini

, et al. Accuracy and completeness of ChatGPT-generated information on interceptive orthodontics: a multicenter collaborative study. J Clin Med 2024; 13: 35.

Frosolini

Catarzi

Benedetti

, et al. The role of large language models (LLMs) in providing triage for maxillofacial trauma cases: a preliminary study. Diagnostics (Basel) 2024; 14: 39.

Dave

Athaluri

Singh

. ChatGPT in medicine: an overview of its applications, advantages, limitations, future prospects, and ethical considerations. Front Artif Intell 2023; 6: 1169595.

Antaki

Touma

Milad

, et al. Evaluating the performance of ChatGPT in ophthalmology: an analysis of its successes and shortcomings. Ophthalmol Sci 2023; 3: 100324.

Stokel-Walker

. AI bot ChatGPT writes smart essays - should professors worry? Nature. Published online December 9, 2022. doi:https://doi.org/10.1038/d41586-022-04397-7

10.

Ayers

Poliak

Dredze

, et al. Comparing physician and artificial intelligence Chatbot responses to patient questions posted to a public social media forum. JAMA Intern Med 2023; 183: 589–596.

11.

Daraz

Morrow

Ponce

, et al. Can patients trust online health information? A meta-narrative systematic review addressing the quality of health information on the internet. J Gen Intern Med 2019; 34: 1884–1891.

12.

Sun

Zhang

Gwizdka

, et al. Consumer evaluation of the quality of online health information: systematic literature review of relevant criteria and indicators. J Med Internet Res 2019; 21: e12522.

13.

Massey

Montgomery

Zhang

. Comparison of ChatGPT-3.5, ChatGPT-4, and orthopaedic resident performance on orthopaedic assessment examinations. J Am Acad Orthop Surg 2023; 31: 1173–1179.

14.

Maillard

Micheli

Lefevre

, et al. Can Chatbot artificial intelligence replace infectious diseases physicians in the management of bloodstream infections? A prospective cohort study. Clin Infect Dis 2024; 78: 825–832.

15.

Dyckhoff-Shen

Koedel

Brouwer

, et al. ChatGPT fails challenging the recent ESCMID brain abscess guideline. J Neurol 2024; 271: 2086–2101.

16.

Wilhelm

Roos

Kaczmarczyk

. Large language models for therapy recommendations across 3 clinical specialties: comparative study. J Med Internet Res 2023; 25: e49324.

17.

Howard

Hope

Gerada

. ChatGPT and antimicrobial advice: the end of the consulting infection doctor? Lancet Infect Dis 2023; 23: 405–406.

18.

Sarink

Bakker

Anas

, et al. A study on the performance of ChatGPT in infectious diseases clinical consultation. Clin Microbiol Infect 2023; 29: 1088–1089.

19.

Schwartz

Link

Daneshjou

, et al. Black box warning: large language models and the future of infectious diseases consultation. Clin Infect Dis 2024; 78: 860–866.

20.

Roumeliotis

Tselikas

. ChatGPT and open-AI models: a preliminary review. Future Internet 2023; 15: 192.

21.

De Vito

Geremia

Marino

, et al. Assessing ChatGPT’s theoretical knowledge and prescriptive accuracy in bacterial infections: a comparative study with infectious diseases residents and specialists. Infection 2025; 53: 873–881.

22.

Yang

Ardavanis

Slack

, et al. Chat generative pretrained transformer (ChatGPT) and bard: artificial intelligence does not yet provide clinically supported answers for hip and knee osteoarthritis. J Arthroplasty 2024; 39: 1184–1190.

23.

Ray

Majumder

. AI tackles pandemics: ChatGPT’s game-changing impact on infectious disease control. Ann Biomed Eng 2023; 51: 2097–2099.

24.

Lee

Bubeck

Petro

. Benefits, limits, and risks of GPT-4 as an AI Chatbot for medicine. N Engl J Med 2023; 388: 1233–1239.

25.

Kung

Cheatham

Medenilla

, et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLOS Digit Health 2023; 2: e0000198.

26.

King

. The future of AI in medicine: a perspective from a Chatbot. Ann Biomed Eng 2023; 51: 291–295.

27.

De Vito

Geremia

Bavaro

, et al.

Comparing large language models for antibiotic prescribing in different clinical scenarios: which performs better?

Clin Microbiol Infect. Published online March 19, 2025:S1198-743X(25)00120-X. doi:https://doi.org/10.1016/j.cmi.2025.03.002

28.

Vaira

Lechien

Abbate

, et al. Enhancing AI chatbot responses in health care: the SMART prompt structure in head and neck surgery. OTO Open 2025; 9: e70075.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

1.69 MB

0.02 MB

Evaluating GPT-4o in infectious disease diagnostics and management: A comparative study with residents and specialists on accuracy,completeness,and clinical support potential

Abstract

Background

Objective

Methods

Results

Conclusions

Keywords

Introduction

Methods

Study design and participants

Inclusion and exclusion criteria

Data collection

Blind evaluation

Statistical analysis

Results

True or false questions

Open-ended questions

Clinical cases

Discussion

Conclusions

Supplemental Material

sj-xlsx-1-dhj-10.1177_20552076251355797 - Supplemental material for Evaluating GPT-4o in infectious disease diagnostics and management: A comparative study with residents and specialists on accuracy, completeness, and clinical support potential

Supplemental Material

sj-docx-2-dhj-10.1177_20552076251355797 - Supplemental material for Evaluating GPT-4o in infectious disease diagnostics and management: A comparative study with residents and specialists on accuracy, completeness, and clinical support potential

Supplemental Material

sj-docx-3-dhj-10.1177_20552076251355797 - Supplemental material for Evaluating GPT-4o in infectious disease diagnostics and management: A comparative study with residents and specialists on accuracy, completeness, and clinical support potential

Footnotes

Acknowledgements

ORCID iDs

Ethical considerations

Author contributions

Funding

Declaration of conflicting interests

Data availability statement

Supplemental material

References

Supplementary Material