Sage Journals: Discover world-class research

Abstract

Introduction:

The integration of large language models (LLMs) into medical education will represent a significant paradigm shift, offering transformative potential in how medical knowledge is accessed and assimilated. These models have not yet been systematically trained or validated on complex subspecialty medical examinations. This study explores the performance of seven major LLMs in radiation oncology.

Materials and Methods:

The 2021 American College of Radiology (ACR) Radiation Oncology In-Training Examination (TXIT) was used to evaluate the performance of various LLMs: OpenAI's GPT-3.5-turbo, GPT-4, GPT-4-turbo, Meta's Llama-2 models (7 billion, 13 billion, and 70 billion parameter models), and Google's PaLM-2-text-bison. The ACR provided publicly available national scoring for this examination. The examination comprised 300 questions across four major domains, including clinical, biology, physics, and statistics. The examination was processed through each LLM through application programming interface. LLM-generated answers were analyzed by domains and compared with radiation oncology trainee performance. The total cost of token inputs and outputs were aggregated and analyzed.

Results:

LLMs showed varied performance, with OpenAI's GPT-4-turbo leading with 74.2% correct answers and all three Llama-2 models underperforming (ranging between 26.2% and 43.3% correct). LLMs generally excelled in the statistics domain (93.0–100%) but were less effective in clinical areas (37.0–68.0%), with the exception of GPT-4-turbo that performed comparably (68.0%) with upper-level radiation oncology trainees (PGY4–5 64.1–68.3%) and superiorly with lower-level trainees (PGY2–3 51.6–61.6%). Notably, GPT-4-turbo demonstrated 7.0% clinical improvement over its predecessor GPT-4. LLMs scored the lowest in gastrointestinal, genitourinary, and gynecology and highest in bone and soft tissue, central nervous system, and head and neck. Overall costs of LLM inputs and outputs were modest at $2.63 across all seven models.

Conclusion:

GPT-4-turbo demonstrates clinical accuracy comparable with upper-level and superior with lower-level trainees. Score discrepancies across disease site domains may be due to data availability, complexity of medical conditions, and quality and quantity of training data sets. Future research will need to evaluate the performance of models that are fine-tune trained in clinical oncology. This study also underscores the need for rigorous validation of LLM-generated information against established medical literature and expert consensus, necessitating expert oversight in their application in medical education and practice.

Get full access to this article

View all access options for this article.

References

Abd-Alrazaq

, AlSaad

, Alhuwail

, et al. Large language models in medical education: Opportunities, challenges, and future directions. JMIR Med Educ, 2023;9:e48291; doi: 10.2196/48291

Health C for D and R. Artificial Intelligence and Machine Learning (AI/ML)-Enabled Medical Devices. FDA; 2023. Available from: https://www.fda.gov/medical-devices/software-medical-device-samd/artificial-intelligence-and-machine-learning-aiml-enabled-medical-devices [Last accessed: November 15, 2023].

Brown

, Mann

, Ryder

, et al. Language models are few-shot learners. In: Advances in Neural Information Processing

Systems

. ( Larochelle

, Ranzato

, Hadsell

, et al. eds.) Curran Associates, Inc.; 2020; pp. 1877–1901. Available from: https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf [Last accessed: November 15, 2023].

E2Analyst. GPT-4: Everything You Want to Know About OpenAI's New AI Model. 2023. Available from: https://medium.com/predict/gpt-4-everything-you-want-to-know-about-openais-new-ai-model-a5977b42e495 [Last accessed: November 18, 2023].

Shevchuk

GPT-4 Parameters Explained: Everything You Need to Know. 2023. Available from: https://levelup.gitconnected.com/gpt-4-parameters-explained-everything-you-need-to-know-e210c20576ca [Last accessed: November 18, 2023].

Lee

The rise of ChatGPT: Exploring its potential in medical education. Anat Sci Educ, 2023; doi: 10.1002/ase.2270

Wang

LK-P

, Paidisetty

, Cano

. The next paradigm shift? ChatGPT, artificial intelligence, and medical education. Med Teach, 2023;45(8):925; doi: 10.1080/0142159X.2023.2198663

American College of Radiology. Radiation Oncology In-Training Exam (TXIT). 2023. Available from: https://www.acr.org/Lifelong-Learning-and-CME/Learning-Activities/In-Training-Exams/Radiation-Oncology-In-Training-Exam [Last accessed: September 29, 2023].

Paulino

AC.

Results and resident evaluation of the 2007 American College of Radiology in-training examination in radiation oncology. J Am Coll Radiol, 2008;5(10):1077–1079; doi: 10.1016/j.jacr.2008.04.006

10.

ACR. American College of Radiology. Radiation oncology in-training examination. National Statistical Summary. Administered in March 2021. 2021. Available from: https://www.acr.org/Lifelong-Learning-and-CME/Learning-Activities/In-Training-Exams/Radiation-Oncology-In-Training-Exam [Last accessed: November 15, 2023].

11.

Singhal

, Azizi

, Tu

, et al. Large language models encode clinical knowledge. Nature, 2023;620(7972):172–180; doi: 10.1038/s41586-023-06291-2

12.

Bob. LLaMA 2 vs GPT-4: Which Large Language Model Is Right for You? Cloudbooklet. 2023. Available from: https://www.cloudbooklet.com/llama-2-vs-gpt-4/ [Last accessed: November 17, 2023].

13.

Anonymous. GPT-4. 2023. Available from: https://en.wikipedia.org/w/index.php?title=GPT-4&oldid=1185623215 [Last accessed: November 17, 2023].

14.

Kung

, Cheatham

, Medenilla

, et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digit Health, 2023;2(2):e0000198; doi: 10.1371/journal.pdig.0000198

15.

Longwell

, Grant

, Hirsch

, et al. Large language models encode medical oncology knowledge: Performance on the ASCO and ESMO examination questions. JCO Oncol Pract, 2023;19(11_suppl):511; doi: 10.1200/OP.2023.19.11_suppl.511

16.

Holmes

, Liu

Evaluating large language models on a highly-specialized topic, radiation oncology physics. arXiv 2023; arXiv:2304.01938.

17.

Liu

, Wang

, Li

, et al. RadOnc-GPT: A large language model for radiation oncology. arXiv Med Phys, 2023.

18.

Iliescu

Fine-Tuning AI Models: Comparing the Costs of OpenAI vs Azure OpenAI. 2023. Available from: https://vladiliescu.net/finetuning-costs-openai-vs-azure-openai/ [Last accessed: November 19, 2023].

19.

d'Archimbaud

The Challenges, Costs, and Considerations of Building or Fine-Tuning an LLM | HackerNoon. 2023. Available from: https://hackernoon.com/the-challenges-costs-and-considerations-of-building-or-fine-tuning-an-llm [Last accessed: November 19, 2023].

20.

Karabacak

, Margetis

Embracing large language models for medical applications: Opportunities and challenges. Cureus, 2023;15(5):e39305; doi: 10.7759/cureus.39305

21.

Villatte

How to add domain-specific knowledge to an LLM based on your data. 2023. Available from: https://towardsdatascience.com/how-to-add-domain-specific-knowledge-to-an-llm-based-on-your-data-884a5f6a13ca [Last accessed: November 15, 2023].

22.

Google. Med-PaLM. 2023. Available from: https://sites.research.google/med-palm/ [Last accessed: November 17, 2023].

23.

Singhal

, Tu

, Gottweis

, et al. Towards expert-level medical question answering with large language models. arXiv; 2023; doi: 10.48550/arXiv.2305.09617

24.

Safranek

, Sidamon-Eristoff

, Gilson

, et al. The role of large language models in medical education: Applications and implications. JMIR Med Educ, 2023;9:e50945; doi: 10.2196/50945

25.

Ravi

, Neinstein

, Murray

. Large language models and medical education: Preparing for a rapid transformation in how trainees will learn to be doctors. ATS Sch, 2023;4(3):282–292; doi: 10.34197/ats-scholar.2023-0036PS

26.

, Moon

, Purkayastha

, et al. Ethics of large language models in medicine and medical research. Lancet Digit Health, 2023;5(6):e333–e335; doi: 10.1016/S2589-7500(23)00083-3

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.02 MB

Large Language Models Encode Radiation Oncology Domain Knowledge: Performance on the American College of Radiology Standardized Examination