Sage Journals: Discover world-class research

Abstract

Introduction

Large language models (LLMs) hold immense potential to serve as clinical decision-support tools for Earth-independent medical operations. However, the generation of incorrect information may be misleading or even harmful when applied to care in this setting.

Method

To better understand this risk, this work tested two publicly available LLMs, ChatGPT-4 and Google Gemini Advanced (1.0 Ultra), as well as a custom Retrieval-Augmented Generation (RAG) LLM on factual knowledge and clinical reasoning in accordance with published material in aerospace medicine. We also evaluated the consistency of the two public LLMs when answering self-generated board-style questions.

Results

When queried with 857 free-response questions from Aerospace Medicine Boards Questions and Answers, ChatGPT-4 had a mean reader score from 4.23 to 5.00 (Likert scale of 1–5) across chapters, whereas Gemini Advanced and the RAG LLM scored 3.30 to 4.91 and 4.69 to 5.00, respectively. When queried with 20 multiple-choice aerospace medicine board questions provided by the American College of Preventive Medicine, ChatGPT-4 and Gemini Advanced responded correctly 70% and 55% of the time, respectively, while the RAG LLM answered 85% correctly. Despite this quantitative measure of high performance, the LLMs tested still exhibited gaps in factual knowledge that potentially could be harmful, a degree of clinical reasoning that may not pass the aerospace medicine board exam, and some inconsistency when answering self-generated questions.

Conclusion

There is considerable promise for LLM use in autonomous medical operations in spaceflight given the anticipated continued rapid pace of development, including advancements in model training, data quality, and fine-tuning methods.

Keywords

aerospace medicine artificial intelligence ChatGPT-4 Google Gemini Advanced Retrieval-Augmented Generation

Get full access to this article

View all access options for this article.

References

Levin

Steller

Anderson

Lemery

Easter

Hilmers

, et al. Enabling human space exploration missions through progressively earth independent medical operations (EIMO). IEEE Open J Eng Med Biol. 2023;4:162–167.

Lemery

Anderson

Levin

Krihak

Berens

Hilmers

. Progressively Enabling Earth Independent Medical Operations (EIMO). NASA; 2023. Accessed March 8, 2024. https://ntrs.nasa.gov/citations/20230015500

Bhattacharyya

Miller

Bhattacharyya

Miller

. High rates of fabricated and inaccurate references in ChatGPT-generated medical content. Cureus. 15(5):e39238.

Abd-alrazaq

AlSaad

Alhuwail

Ahmed

Healy

Latifi

, et al. Large language models in medical education: opportunities, challenges, and future directions. JMIR Med Educ. 2023;9:e48291.

Rubin

Mcguire

Lehnhardt

Fleming

. Recommendation for a Medical System Concept of Operations for Gateway Missions. NASA; 2019. Accessed March 8, 2024. https://ntrs.nasa.gov/citations/20200001341

Jahan

Laskar

MTR

Peng

Huang

. A comprehensive evaluation of large language models on benchmark biomedical text processing tasks. Comput Biol Med. 2024;171:108189.

Karabacak

Margetis

. Embracing large language models for medical applications: opportunities and challenges. Cureus. 15(5):e39305.

Gilson

Safranek

Huang

Socrates

Chi

Taylor

, et al. How does ChatGPT perform on the United States Medical Licensing Examination (USMLE)? The implications of large language models for medical education and knowledge assessment. JMIR Med Educ. 2023;9:e45312.

Igarashi

Nakahara

Norii

Miyake

Tagami

Yokobori

. Performance of a large language model on Japanese emergency medicine board certification examinations. J Nippon Med Sch. 2024;91(2):155–161.

10.

Isleem

Zaidat

Ren

Geng

Burapachaisri

Tang

, et al.

Can generative artificial intelligence pass the orthopaedic board examination?

J Orthop. 2024;53:27–33.

11.

Ariyaratne

Jenko

Davies

Iyengar

Botchu

. Could ChatGPT pass the UK radiology fellowship examinations? Acad Radiol. 2024;31(5):2178–2182.

12.

Harrer

. Attention is not all you need: the complicated case of ethically using large language models in healthcare and medicine. eBioMedicine. 2023;90:104512.

13.

Clusmann

Kolbinger

Muti

Carrero

Eckardt

J-N

Laleh

, et al. The future landscape of large language models in medicine. Commun Med (Lond). 2023;3(1):141.

14.

Law

. Aerospace Medicine Boards Questions and Answers. Kindle; 2022.

15.

Scheschenja

Viniol

Bastian

Wessendorf

König

Mahnken

. Feasibility of GPT-3 and GPT-4 for in-depth patient education prior to interventional radiological procedures: a comparative analysis. Cardiovasc Intervent Radiol. 2024;47(2):245–250.

16.

Kauf

Ivanova

Rambelli

Chersoni

She

Choudhury

, et al. Event knowledge in large language models: the gap between the impossible and the unlikely. Cogn Sci. 2023;47(11):e13386.

17.

Yildirim

Paul

. From task structures to world models: what do LLMs know? Trends Cogn Sci. 2024;28(5):404–415.

18.

American Board of Preventive Medicine. MOC Exam Information. Accessed March 8, 2024. https://www.abpm-us.org/maintain-certification/moc-exam-information/

19.

Mihalache

Huang

Popovic

Muni

. ChatGPT-4: an assessment of an upgraded artificial intelligence chatbot in the United States medical licensing examination. Med Teach. 2024;46(3):366–372.

20.

Federal Aviation Administration. FAA Guide for Aviation Medical Examiners. Updated January 1, 2025. Accessed March 8, 2024. https://www.faa.gov/ame_guide

21.

American Society of Aerospace Medicine Specialists. Practice guidelines. 2024. Accessed March 8, 2024. https://www.asams.org/practice-guidelines.htm

22.

National Archives. eCFR: Home. 2024. Accessed March 8, 2024. https://www.ecfr.gov/

23.

Wang

Chen

Deng

Wen

You

Liu

, et al. Prompt engineering in consistency and reliability with the evidence-based guideline for LLMs. Npj Digit Med. 2024;7(1):1–9.

24.

Peng

Yang

Chen

Smith

PourNejaitian

Costa

, et al. A study of generative large language model for medical research and healthcare. Npj Digit Med. 2023;6(1):210.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.23 MB

Evaluating Large Language Models on Aerospace Medicine Principles

Abstract

Introduction

Method

Results

Conclusion

Keywords

Get full access to this article

References

Supplementary Material