Sage Journals: Discover world-class research

Abstract

Background and objective

As the use of artificial intelligence (AI) in medicine expands, applications of GPT (generative pretrained transformer) assimilate into the world of medical education. Our objective was to rigorously evaluate the capacity of the GPT algorithm to enhance its methodology for generating differential diagnoses. We hypothesize that the success of GPT would contribute significantly to the advancement of pedagogical strategies in medical education.

Methods

ChatGPT-4o was provided with three common clinical scenarios and was asked to give three lists of four differential diagnoses. Through iterative feedback and targeted instruction, we systematically documented the GPT responses as they demonstrated progressive improvements in the accuracy of differential diagnoses. The study includes four discussions, each succeeded by feedback and assessment of ChatGPT's responses. The study took place in the education authority of the Chaim Sheba Medical Center, Israels’ largest hospital, during a 1-month period.

Results

For all four clinical scenarios, GPT performance was significantly improved right after the initial human feedback, with higher notable advancement and implementation of feedback in the third and fourth discussions. GPT effectively assimilated the technical recommendations, resulting in differential diagnoses that achieved complete concordance with the intended diagnoses (100% accuracy).

Conclusion

GPT-4o demonstrates a robust capacity for learning and operating appropriate methodologies within the clinical reasoning process essential for accurate differential diagnosis. Our findings will guide future directives that will be taught to human medical students.

Keywords

GPT artificial intelligence medical education differential diagnosis clinical reasoning Chat-GPT

Introduction

The importance of differential diagnosis in clinical practice

Establishment of a proper differential diagnosis (DD) list as part of physicians’ process of taking care of their patients, stands as a fundamental component in the art of clinical medicine.¹ Inaccuracy of the DD list might miss essential, life endangering diagnoses and at the same time might include diagnoses that are futile, necessitate expensive resources and endanger the patients with unneeded measures of diagnosis and empirical therapies.² This foundational competency remains critically important, despite the exponential expansion of medical knowledge and diagnostic technologies over the past decade. This is further emphasized by the sad fact that misdiagnosis remains a significant public health issue,³ with diagnostic errors still contributing to as far as 10% of deaths and 6% to 17% of hospital adverse events.⁴

The realm of artificial intelligence (AI) has opened new avenues in all medical disciplines, with an emphasis on competencies of medical diagnostics, particularly through Large Language processing Models (LLMs) like GPTs. GPT-4o, developed by OpenAI, has demonstrated remarkable capabilities in understanding and processing medical language, potentially revolutionizing medical education and clinical decision support systems.^4–6

Challenges confronting medical educators in the task of teaching the art of differential diagnosis

In the twenty-first century, where medical knowledge and technology are continuously evolving, medical educators have to face new challenges in the task of teaching the art of differential diagnosis to their students. Clinical tutors are in need of persistently acquiring and updating their body of knowledge to keep in pace with updated medical advancements.⁷ Assimilating novel information and effectively transmitting it to medical students constitutes a pivotal aspect of contemporary medical education. Continuous inquiry and investigation of different educational and teaching techniques and methodologies is another fundamental need.⁸

There is a continuous need to cultivate the students’ clinical thinking in a manner that will allow them to create proper implantation of their knowledge in the process of generating the desired list of possible diagnoses.⁹ Finally yet importantly is the mission of educating students how to subjectively assess the clinical presentation and reach the most precise list of appropriate diagnoses.¹⁰

The enormous “price” of wrong clinical reasoning would be the consequences of diagnostic error—in the worst-case scenario—the patient's death.¹¹

Development of teaching methodologies is a tough mission. Medical educators do not have a good field of experimental methodologies. Once you taught the students, in most cases, they proceed to the next clinical discipline. At present, there is a conspicuous absence of a “medical reasoning simulator” that would allow clinical educators to empirically evaluate and refine their instructional methodologies for imparting clinical reasoning skills.

Artificial intelligence and GPTs potential in the realm of medical education

Passing the United States Medical Licensing Exam (USMLE) was, doubtfully, one significant credentials of ChatGPT.¹² Gaining instant, endless access to the whole world-wide-web medical, professional resources, along with advanced capabilities for free-language-processing, makes AI and GPTs a potential valuable tool in the realm of medical education.¹³ Nevertheless, success in the USMLE is based, to a great extent, on knowledge-based answering rather than clinical reasoning.¹⁴

Customizing and refining ChatGPT's algorithm in alignment with relevant curricula may substantially assist clinical educators in developing medical educational content, facilitating the resolution of clinical reasoning cases, elucidating complex pathological processes, and fostering an interactive learning environment with constructive feedback for students. Furthermore, in the setting of bedside-teaching, AI applications could assist the students in the learning process of customizing therapeutic/rehabilitation plans based on patient history, physical examination and further, workup data.¹⁵

The aim of the current study

In the current study we aimed at endowing GPT with guidance of attaining a better ability of clinical reasoning in order to properly execute the task of establishing correct lists of differential diagnosis for common clinical scenarios. We posit that success in this endeavor would not only inform the development of future GPT-based diagnostic algorithms but also yield valuable insights for optimizing clinical teaching methodologies for medical students. We hypothesize that endowing GPT with the right tactics will prove these to be worthy for teaching human subjects.

Methods

In the current study we used the OpenAI application designated as ChatGPT-4o, the latest, most advanced version of this generative LLM algorithm, which was available worldwide during the study period. Figure 1 depicts the methodology/process we executed in the process of teaching the LLM the basic art of establishing a differential diagnosis.

Figure 1.

The methodology of the current study.

Statistical methods

This was a qualitative, proof-of-concept study in which the LLM was instructed by prompting and a further conversation evolved. Our results are based on verbal-content analysis of the LLM performance.

LLM performance results

We presented the LLM with the first clinical case presentation and asked for a list of four diagnoses, stating the explicit request for a “differential diagnosis.” After it generated its answer we gave it the first feedback. The feedback content had only one educational item: “please calculate, for each diagnosis, the multiplication of its prevalence by its extent of endangering the patient's life.” Than LLM generated a new list of DD (2nd attempt) and the recommendation was re-iterated. Afterwards, a second, different clinical case presentation was presented to the LLM and once again, a list of DD was asked. In this case, only one feedback was given, once again relating to the same methodology of listing differential diagnoses. Afterwords, a third clinical case was presented, and a new list of DD was asked. This time, no feedback was given to LLM. All cases, as presented in the results section were common clinical cases.

Results

The resultant conversations, based on the described methodology are presented in Figures 2 to 4 with the initial case presentations, the preliminary differential diagnosis given by ChatGPT and the following “prompts” and “answers by GPT” demonstrating the stepwise advancement of the algorithm. Figure 2 presents a case of “chest pain and fever,” while Figure 3 represents second attempt of giving differential diagnosis to the same case that was presented in Figure 2 following our feedback. Figure 4 delineates the case of “headache and fever,” whereas Figure 5 outlines the case of rash and fever.

Figure 2.

Conversation with generative pretrained transformer (GPT) regarding a case of “chest pain and fever.”

Figure 3.

Asking, for the second time, differential diagnosis (DD) for “chest pain and fever.”

Figure 4.

Conversation with generative pretrained transformer (GPT) regarding the case of “headache and fever.”

Figure 5.

Conversation with generative pretrained transformer (GPT) regarding the case of “rash and fever.”

Discussion

This study presents solid evidence of ChatGPT's exceptional capacity to shape its clinical output through iterative feedback, adeptly simulating a learning curve typical of medical trainees.

This investigation demonstrated the potential of the AI algorithm as a useful instrument for healthcare education and decision assistance by carefully examining its ability to learn and adapt in a medical setting, as exemplified by previous researchers.^16,17

Initially, the model supplied differential diagnoses that lack clear prioritization rationale and were mostly based on probabilistic connections. Nonetheless, following differentiated instruction with feedback sessions, with respect to the methodology of “Prevalence × Relative Risk,” the AI algorithm illustrated high-speed “in-context learning,” significantly restructuring its diagnostic hierarchy, prioritizing crucial variables, such as severity, prevalence, and clinical urgency. This transition from a generic list to a risk-stratified differential diagnosis highlights the prospect of LLMs to move beyond data retrieval toward replicating structured clinical reasoning.

ChatGPT demonstrated a significant capacity to apply learned ideas as the study moved on to the third and fourth conversation. The steady improvement in performance demonstrates ChatGPT's ability to absorb and operationalize essential clinical reasoning abilities, successfully utilizing them in other real-world scenarios.¹⁸

This thinking process reflects the principles underlying contemporary human feedback reinforcement learning models, exemplified by the methodology presented by Wu et al.¹⁹ Similar to how their system employs cyclic feedback loops to maximalize personalized learning tutorials. The results indicate that focused human guidance can function as an effective reinforcement strategy for refining the clinical reasoning capabilities of LLMs.¹⁹

A crucial investigation of the model's effectiveness shows that although the GPT possesses extensive encyclopedic information, exhibited by its success in exams such as the USMLE, it lacks the innate “clinical instinct” to rank life-threatening scenarios over common ones. Our findings imply that the model's preliminary “shallow” reasoning can be fortified by applying a cognitive scheme, effectively engaged a form of “Chain-of-Thought” provoking targeted instructions methodology. This diminished the possibility that a diagnosis would be prioritized based purely on its semantic likelihood in the training data by requiring the model to assess two different characteristics for every possible diagnosis before prioritizing them.

ChatGPT's demonstrating a learning curve and flexibility in this medical-educational context have wide ramifications. This parallels the educational framework supplied to medical students, revealing that AI, like human students, claim methodological restrictions to implement information safely.

According to our results, ChatGPT and other AI systems could potentially become effective teaching aids in medical education programs. Such technologies may give medical students and trainees a secure and regulated setting to improve their diagnostic and clinical reasoning skills by simulating clinical circumstances and giving instant feedback.

Medical students can utilize the ChatGPT to improve their clinical thinking process by its potential ability to give them triage for their differential diagnosis including detailed profound assessment for each diagnosis, focus their clinical thinking process while giving them instructions how to improve it, marking red flags and high value patterns. Another option that may be useful is asking ChatGPT to direct them through a conversation to the most relevant differential diagnosis.

Also, ChatGPT capacity to create medical clinical cases can be useful by medical educators for creating simulation scenarios according to current curriculum while adding appropriate distractions, differentials, and guide map of the correct clinical thinking approach. The tutors can also make comparison between the ChatGPT answers and the students, which can help them to recognize and address the exact and/or methodological mistakes made by students, such as misdirection of logic thinking or lack of knowledge.

Additionally, the findings suggest that ChatGPT may serve as a supplement to clinical decision-making systems. The AI system's capacity to quickly evaluate enormous volumes of medical data and produce prioritized differential diagnoses is not meant to replace human medical competence, but it may be a useful aid for medical personnel, especially in complicated or urgent cases.

Prompt understanding of the model's trajectory can be obtained by contrasting our findings with those of earlier researches. Whereas Katz et al.¹² stated that GPT performs remarkably well on standardized knowledge-based exams, others like Shikino et al.⁴ have underlined the model's restrictions in addressing uncommon as well complicated clinical presentations when unsupported.

Our research providing a structural framework, that bridges the gap by illustrating that the reasoning weakness, is frequently caused by inadequate prompting commands in lieu of a long-term algorithmic deficit. This reinforces the premise that LLMs should be considered not as autonomous diagnosticians, but as modified learners that match the academic demands of human medical trainees.

Limitations

Although the result of our study is promising, it is essential to acknowledge certain limitations. First, the study included four distinct clinical cases, resulting in a limited sample size. Although these constitute typical complaints, they do not encompass the whole scope of medical intricacy, especially in cases involving rare diseases or numerous comorbidities where the “Prevalence × Risk” heuristic is not adequate.

Second, the methods employed to “teach” the model utilized a streamlined mathematical approach, specifically multiplying prevalence by risk. Diagnostic reasoning in clinical practice is nonlinear and frequently depends on pattern recognition and clinical intuition, which cannot consistently simplified to a multiplicative factor. As a consequence, the model's performance in these controlled settings may not entirely be generalized to the unpredictable nature of clinical practice at the bedside.

Third, the evaluation of the AI's advancements was carried out by the authors. While the enhancement of the differential diagnosis list was clear, the absence of a blinded, independent panel of specialists to assess the replies presents a risk of evaluation bias.

Finally, the version in use (GPT-4o) may be updated periodically by OpenAI, reflecting the evolving nature of LLM-based research. As a result, the exact answers documented may differ in subsequent versions of the model, which introduces uncertainty regarding the long-term replicability of certain prompts unless ongoing recalibration is performed.

Conclusion

Our research demonstrates that ChatGPT-4o functions as an adaptive learner, which has the capability to obtain clinical reasoning skills when provided with targeted feedback, representing a progression beyond basic knowledge retrieval. By incorporating the “Prevalence × Risk” methodology, the model effectively advanced from generating generic lists to providing risk-stratified differential diagnoses. These results underscore the capacity of large language models to function as advanced, interactive simulators and as auxiliary decision-support tools in medical curricula programs. Integration of adaptable AI into medical education may substantially improve the development of diagnostic reasoning skills within secure, simulated settings.

Footnotes

ORCID iD

G. Segal

Ethical approval

No human beings were tested during this study and there was no need for ethical approval.

Contributorship

CY, HR, SG, and SR contributed equally to the initial research question and design, to the preliminary drafting and the writing of the final manuscript.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

Maude

. Differential diagnosis: the key to reducing diagnosis error, measuring diagnosis and a mechanism to reduce healthcare costs. Diagnosis (Berl) 2014; 1: 107–109.

Sci-Hub. The key role of differential diagnosis in diagnosis. Diagnosis 2017; 4: 239–240. doi: 10.1515/dx-2017-0005

Hautz

Kämmer

Hautz

, et al. Diagnostic error increases mortality and length of hospital stay in patients presenting through the emergency room. Scand J Trauma Resusc Emerg Med 2019; 27: 1–12.

Shikino

Shimizu

Otsuka

, et al. Evaluation of ChatGPT-generated differential diagnosis for common diseases with atypical presentation: descriptive research. JMIR Med Educ 2024; 10: e58758.

Harada

Katsukura

Kawamura

, et al. Effects of a differential diagnosis list of artificial intelligence on differential diagnoses by physicians: an exploratory analysis of data from a randomized controlled study. Int J Environ Res Public Health 2021; 18: 1–8.

Ortiz-Posadas

Martinez-Trinidad

Ruiz-Shulcloper

. A new approach to differential diagnosis of diseases. Int J Biomed Comput 1996; 40: 179–185.

Densen

. Challenges and opportunities facing medical education. Trans Am Clin Climatol Assoc 2011; 122: 48.

Ang

BWG

Soh

, et al. Methods to improve diagnostic reasoning in undergraduate medical education in the clinical setting: a systematic review. J Gen Intern Med 2021; 36: 2745–2754.

Eva

. What every teacher needs to know about clinical reasoning. Med Educ 2005; 39: 98–106.

10.

Sci-Hub. Teaching differential diagnosis to nurse practitioner students. J Nurse Pract 2018; 14: e207–e212.

11.

Sci-Hub. The practice of medicine: understanding diagnostic error. J Nurse Pract 2020; 16: 582–585.

12.

Katz

Cohen

Shachar

, et al. GPT versus resident physicians—a benchmark based on official board scores. NEJM AI 2024; 1:1–8.

13.

Heng

JJY

Teo

Tan

. The impact of chat generative pre-trained transformer (ChatGPT) on medical education. Postgrad Med J 2023; 99: 1125–1127.

14.

Yaneva

Baldwin

Jurich

, et al. Examining ChatGPT performance on USMLE sample items and implications for assessment. Acad Med 2023; 99: 192–197.

15.

Baig

Hobson

GholamHosseini

, et al. Generative AI in improving personalized patient care plans: opportunities and barriers towards its wider adoption. Appl Sci 2024; 14: 10899.

16.

Chew

Ngiam

. Artificial intelligence tool development: what clinicians need to know? BMC Med 2025; 23: 1–19.

17.

Davenport

Kalakota

. The potential for artificial intelligence in healthcare. Future Healthc J 2019; 6: 94–98.

18.

Wong

Fayngersh

Traba

, et al. Using ChatGPT in the development of clinical reasoning cases: a qualitative study. Cureus 2024; 16: e61438.

19.

Wang

Zhang

, et al. A tutorial-generating method for autonomous online learning. IEEE Trans Learn Technol 2024; 17: 1532–1540.

Generative pretrained transformer (GPT) algorithm can be taught to improve the task of differential diagnosis according to simple principles of clinical reasoning. The case of GPT-4o and a lesson to apply in medical education

Abstract

Background and objective

Methods

Results

Conclusion

Keywords

Introduction

The importance of differential diagnosis in clinical practice

Challenges confronting medical educators in the task of teaching the art of differential diagnosis

Artificial intelligence and GPTs potential in the realm of medical education

The aim of the current study

Methods

Statistical methods

LLM performance results

Results

Discussion

Limitations

Conclusion

Footnotes

ORCID iD

Ethical approval

Contributorship

Funding

Declaration of conflicting interests

References