Evaluating the Performance of Large Language Models on Palliative Care Test Questions: A Mixed Methods Study

Abstract

Background:

Little is known about large language model (LLM) performance on palliative care (PC)-related knowledge-based tasks. We evaluated two LLMs in answering PC-related test questions and explaining their answer choice rationale.

Methods:

LLMs were prompted to answer 25 randomly selected questions from the Fast Facts Quiz and provide their answer choice rationale. Three PC educators ranked and rated LLM-generated answer choice explanations versus the test’s answer key explanations. Linear fixed-effect models evaluated reviewer ranking, and ordinal logistic regression evaluated reviewer ratings of quality, suitability, accuracy, relevance, and comprehensiveness.

Results:

Both LLMs answered 96% of selected questions correctly. Reviewers rated LLM-generated explanations more highly than Fast Facts Quiz explanations. Five themes emerged from reviewer comments: perceived inaccuracies, clarity of writing, educational value, linguistic style, and miscellaneous.

Conclusions:

LLMs demonstrated high answer choice accuracy and generated preferable answer explanations when compared to the Fast Facts Quiz answer key.

Keywords

chatbot large language models LLMs mixed methods palliative care

Get full access to this article

View all access options for this article.

References

1. Kelley

, Morrison

. Palliative care for the seriously ill. N Engl J Med 2015;373(8):747–755; doi: 10.1056/NEJMra1404684

2.World Health Organization. Palliative care. 2020. Available from: https://www.who.int/news-room/fact-sheets/detail/palliative-care [Last accessed: May 2, 2025].

3. Peeler

, Afolabi

, Sleeman

, et al. Confronting global inequities in palliative care. BMJ Glob Health 2025;10(5):e017624; doi: 10.1136/bmjgh-2024-017624

4. Kamal

, Wolf

, Troy

, et al. Policy changes key to promoting sustainability and growth of the specialty palliative care workforce. Health Aff (Millwood) 2019;38(6):910–918; doi: 10.1377/hlthaff.2019.00018

5. Omiye

, Gui

, Rezaei

, et al. Large language models in medicine: The potentials and pitfalls: A narrative review. Ann Intern Med 2024;177(2):210–220; doi: 10.7326/m23-2772

6. Succi

, Chang

, Rao

. Building the AI-enabled medical school of the future. JAMA 2025;333(19):1665–1666; doi: 10.1001/jama.2025.2789

7. Kim

, Admane

, Chang

, et al. Chatbot performance in defining and differentiating palliative care, supportive care, hospice care. J Pain Symptom Manage 2024;67(5):e381–e391; doi: 10.1016/j.jpainsymman.2024.01.008

8. Admane

, Kim

, Reddy

, et al. Performance of three conversational artificial intelligence agents in defining end-of-life care terms. J Palliat Med 2025;28(8):1102–1107; doi: 10.1089/jpm.2024.0526

9. Lazris

, Schenker

, Thomas

. AI-generated content in cancer symptom management: A comparative analysis between ChatGPT and NCCN. J Pain Symptom Manage 2024;68(4):e303–e311; doi: 10.1016/j.jpainsymman.2024.06.019

10.

10. Srivastava

, Srivastava

. Can artificial intelligence aid communication? Considering the possibilities of GPT-3 in palliative care. Indian J Palliat Care 2023;29(4):418–425; doi: 10.25259/ijpc_155_2023

11.

11. Gallifant

, Afshar

, Ameen

, et al. The TRIPOD-LLM reporting guideline for studies using large language models. Nat Med 2025;31(1):60–69; doi: 10.1038/s41591-024-03425-5

12.

12.Winsconsin PNo. Fast facts quiz. Available from: https://www.mypcnow.org/fast-facts/quiz/ [Last accessed: July 11, 2025].

13.

13.Palliative Care Network of Wisconsin. About fast facts & concepts. Available from: https://www.mypcnow.org/fast-facts/about/ [Last accessed: November 28, 2025].

14.

14.OpenAI. Hello GPT-4o. Available from: https://openai.com/index/hello-gpt-4o/ [Last accessed: November 28, 2025].

15.

15.Anthrop\c. Claude 3.5 sonnet. Available from: https://www.anthropic.com/news/claude-3-5-sonnet [Last accessed: November 28, 2025].

16.

16.Synscribe. GPT-4o Benchmark—Detailed Comparison with Claude & Gemini. Available from: https://wielded.com/blog/gpt-4o-benchmark-detailed-comparison-with-claude-and-gemini

17.

17.Qualtrics. Available from: https://qualtricsxmn6d49gkj8.pdx1.qualtrics.com/login?path=%2FQ%2FMyProjectsSection&product=project-store-proxy

18.

18. Sagin

, Musheno-King

, Olenik

, et al. Outcomes from a longitudinal palliative care curriculum for medical students. J Palliat Med 2026; doi: 10.1177/10966218261434067

19.

19. Marciniak

, Scherg

, Paal

, et al. The outcomes of postgraduate palliative care education and training: Assessment and comparison of nurses and physicians. BMC Palliat Care 2023;22(1):94; doi: 10.1186/s12904-023-01217-1

20.

20. Bedi

, Liu

, Orr-Ewing

, et al. Testing and evaluation of health care applications of large language models: A systematic review. JAMA 2025;333(4):319–328; doi: 10.1001/jama.2024.21700

21.

21. Gallifant

, Bitterman

. Humanity’s next medical exam: Preparing to evaluate superhuman systems. Nejm Ai 2025;2(11); doi: 10.1056/AIe2501008

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.02 MB

0.00 MB