Sage Journals: Discover world-class research

Abstract

Background

Large language models (LLMs) are entering clinical workflows, yet their alignment with medical-ethical principles is unclear.

Objective

To evaluate how LLMs reconcile deontological and utilitarian reasoning in medically adapted “Trolley dilemma” scenarios.

Methods

We tested 14 LLMs, including GPT-o1-preview and DeepSeek-R1-Distill-Llama, using five medically adapted Trolley dilemmas sourced from philosophical and real-world scenarios. Each model generated 10 independent answers per case (overall 700 responses). Responses were forced to “yes” (utilitarian) or “no” (deontological). The primary outcome was the proportion of utilitarian choices. χ² tests compared models.

Results

LLM responses varied. Some models, such as GPT-4 and Gemma, selected the utilitarian option in 20% of cases, while others, like Qwen-2-7B, reached 80% (p < .001). GPT-o1-preview and DeepSeek-R1-Distill-Llama selected utilitarian responses in 44% and 38% of cases, respectively. Some models selected the consequence-maximizing option in vignettes where doing so required overriding consent or endorsing intentional harm within the scenario framing. Seven of the fourteen models tested recommended impermissible medical actions, including nonconsensual limb amputation, killing a healthy person for organ harvesting, and transferring a patient to hospice against their wishes, in 44 of 700 outputs (6.3 %).

Conclusions

In this constrained forced-choice stress test, models occasionally endorsed boundary-violating actions within the vignette framing. These results do not generalize to routine clinical decision-making, but motivate further benchmarked evaluation of refusal behavior and boundary constraints in ethically sensitive medical contexts. We must carefully define how these models influence clinical decisions, establish robust ethical guidelines for their use, and ensure that fundamental ethical norms are not compromised.

Keywords

LLMs large language models ethics AI healthcare

Introduction

Medical ethics navigate healthcare dilemmas balancing professional duties, moral principles, and public interest. Established principles in Western biomedical ethics guide clinical decision making. However, they sometimes intersect with quality of life considerations, cultural beliefs and moral philosophy.^1–3

Two main approaches guide moral philosophy decisions: deontology and utilitarianism.⁴ Deontology judges actions based on duty and rules. Utilitarianism evaluates actions by their consequences.⁵ While both approaches shape health care ethics, utilitarian reasoning is infrequently applied to individual patient care.⁶ In clinical practice, principlism (autonomy, nonmaleficence, beneficence, and justice) typically structures bedside reasoning, whereas consequence-maximizing, population-level tradeoffs are more characteristic of policy and triage decisions.

The rapid rise of generative AI presents opportunities and challenges in health care.⁷ As large language models (LLMs) become part of healthcare workflows, they may influence decision-making in unforeseen ways.^8,9 This raises concerns about alignment with established ethics.⁷ Their behavior in morally complex scenarios can be erratic. Understanding how LLMs navigate tradeoffs is key before deployment, and it is a first step toward developing effective mitigation strategies.¹⁰ To ensure safe clinical practice, we must guard against overreliance on algorithmic suggestions, particularly in ethically sensitive situations.¹¹

The aim of this study was to assess how LLMs respond to medical versions of the “Trolley” dilemma (details on the Trolley Dilemma are provided in the Supplemental Material). These trolley-style vignettes are simplified and sometimes extreme. They are used as stress tests for boundary constraints, not as representative bedside cases. Generalizability to routine clinical decision-making is limited. This research letter is organized as follows: we briefly summarize related work, describe the vignette-based experimental design, report cross-model response patterns, and discuss implications and limitations.

Related work

Prior work has evaluated moral reasoning in language models using established benchmark suites (e.g. ETHICS and Moral Stories), which test normative judgments and behavior under moral constraints but are not healthcare-specific.^12–14 In medicine, recent studies have directly evaluated LLMs on ethically sensitive healthcare decisions. Large-scale analyses show that LLM recommendations can vary with patient sociodemographic attributes even when clinical details are held constant, raising concerns for equity.^15,16 Other studies report blind spots and inconsistent handling of constraints in medical ethics prompts, including failures in consent- and harm-related reasoning.¹⁷ Benchmark efforts specific to healthcare ethics include TRIAGE, that evaluates LLMs on mass-casualty triage dilemmas and tests sensitivity to ethical framing and adversarial prompting.¹⁸ Additional medical ethics benchmarks evaluate LLMs on ethics knowledge and scenario-based application, including explicit violation-type cases (MedEthicsQA; MedEthicEval).^19,20 In parallel, several evaluations using bioethics question banks and structured ethics instruments show that LLM performance on ethics items is often lower than on clinical knowledge questions and can vary across repeated trials.^21–24 In contrast to these broader evaluations, this research letter reports a directly controlled experiment using medically adapted trolley-style vignettes.

Methods

We designed prompts based on five medical versions of the “Trolley” dilemma from the literature. The cases included hypothetical scenarios by Peter Unger and Judith Jarvis Thomson, and more realistic scenarios like mass vaccination^25–27 (Supplemental eFigure S1, eTable S1). Each scenario contrasted deontological and utilitarian approaches. Prompt development was grounded in established healthcare-related formulations of the dilemma from prior ethics literature. Two authors drafted the initial versions, which were then reviewed and approved by all coauthors to ensure conceptual accuracy, neutrality, and clinical relevance. The full prompts are provided in the Supplemental Material.

We required a binary answer (“yes” or “no”) to standardize outputs across models and enable direct comparison under identical prompts. This constraint simplifies ethical deliberation by preventing conditional responses, uncertainty, and multiprinciple balancing. Results therefore reflect behavior under a constrained probe, not full clinical ethics reasoning. We use “utilitarian” and “deontological” as descriptive labels for the two forced-choice options in each vignette; this terminology does not imply that one approach is ethically “correct,” and moral permissibility is framework- and culture-dependent.

We tested 14 base LLMs (Supplemental eTable S2). Models spanned major developer families (OpenAI, Google, Meta, Alibaba, Qwen, and Microsoft), parameter scales (∼7B–70B+), and access types current at testing. The prompt was created by two of the coauthors (V.S and E.K) and revised and approved by all others. It required a definitive yes or no response. Supplemental eTable S3 details the exact prompts. Each model was tested ten times per dilemma, generating 50 responses per model and a total of 700 separate model calls. Each model produced ten independent responses, meaning that every prompt was submitted in a new session with cleared context and default sampling parameters. We used default model parameters for all experiments. All experiments used each platform's default sampling parameters, which are not guaranteed to be equivalent across providers and may influence response distributions. We therefore interpret comparisons as illustrative of out-of-the-box behavior rather than definitive rankings under standardized decoding.

Each prompt was presented in a new instance. We aggregated the proportion of “yes” (utilitarian) and “no” (deontological) responses, and report binomial 95% confidence intervals on model level proportions as a summary of run-to-run variability across repeated independent calls. Between-model differences were assessed using an omnibus Chi-square test (14 models, two response categories), and we report Cramer's V as an effect size.

Results

Across the five ethical dilemmas, the proportion of responses supporting a utilitarian approach varied. GPT-4-8k and Gemma models were the least utilitarian, selecting the utilitarian approach in 20% of cases, while Qwen-2-7B reached 80% (p < .001). Between-model differences were significant (χ²(13) = 78.38, p = 2.22 × 10⁻¹¹), with an omnibus effect size of Cramér's V = 0.335. DeepSeek-R1-Distill-Llama-70B and GPT-o1-preview had utilitarian response rates of 19/50 (38%) (95% CI [24.6%–51.4%]) and 22/50 (44.0%) (95% CI [30.2%–57.8%]), respectively. Table 1 details the overall proportions per model.

Table 1.

Overall proportion of utilitarian versus deontological responses across LLMs.

Model	Yes (utilitarian)	No (deontological)
DeepSeek-R1-Distill-Llama-70B	19/50 (38.0%) (95% CI [24.6%–51.4%])	31/50 (62.0%) (95% CI [48.6%–75.5%])
GPT-o1-preview	22/50 (44.0%) (95% CI [30.2%–57.8%])	28/50 (56.0%) (95% CI [42.2%–69.8%])
GPT-4o	18/50 (36.0%) (95% CI [22.7%–49.3%])	32/50 (64.0%) (95% CI [50.7%–77.3%])
GPT-4-turbo-128k	11/50 (22.0%) (95% CI [10.5%–33.5%])	39/50 (78.0%) (95% CI [66.5%–89.5%])
GPT-4-8k	10/50 (20.0%) (95% CI [8.9%–31.1%])	40/50 (80.0%) (95% CI [68.9%–91.1%])
GPT-3.5-turbo-16k	14/50 (28.0%) (95% CI [15.6%–40.4%])	36/50 (72.0%) (95% CI [59.6%–84.4%])
Gemini-1.0-pro	14/50 (28.0%) (95% CI [15.6%–40.4%])	36/50 (72.0%) (95% CI [59.6%–84.4%])
Llama-3.1-70B	21/50 (42.0%) (95% CI [28.3%–55.7%])	29/50 (58.0%) (95% CI [44.3%–71.7%])
Llama-3.1-8B	29/50 (58.0%) (95% CI [44.3%–71.7%])	21/50 (42.0%) (95% CI [28.3%–55.7%])
Phi-3.5-mini-instruct	20/50 (40.0%) (95% CI [26.4%–53.6%])	30/50 (60.0%) (95% CI [46.4%–73.6%])
Qwen-2-7B	40/50 (80.0%) (95% CI [68.9%–91.1%])	10/50 (20.0%) (95% CI [8.9%–31.1%])
Qwen-2-72B	21/50 (42.0%) (95% CI [28.3%–55.7%])	29/50 (58.0%) (95% CI [44.3%–71.7%])
Gemma-2-9B	10/50 (20.0%) (95% CI [8.9%–31.1%])	40/50 (80.0%) (95% CI [68.9%–91.1%])
Gemma-2-27B	10/50 (20.0%) (95% CI [8.9%–31.1%])	40/50 (80.0%) (95% CI [68.9%–91.1%])

LLM: large language model.

Boundary-violating responses were generated in 44 of 700 outputs (6.3%), defined as endorsing actions that conflict with commonly taught clinical ethics and legal constraints within the vignette framing (nonconsensual amputation, organ harvesting, or transferring a patient to hospice against their wishes). These outputs were produced by seven models, most often by Qwen-2-7B (20/50, 40%), followed by Llama-3.1-8B (13/50, 26%), GPT-o1-preview (4/50, 8%), GPT-3.5-turbo-16k (3/50, 6%), Llama-3.1-70B (2/50, 4%), Qwen-2-72B (1/50, 2%), and Gemini-1.0-pro (1/50, 2%). The foot-amputation (n = 23, 3.3%) and ICU-allocation (n = 19, 2.7%) dilemmas accounted for most of these outputs, with few in the organ-harvesting scenario (n = 2, 0.3%) (Supplemental eTable S4). Aggregating across models and runs, utilitarian endorsements were 44/420 (10.5%) in autonomy/consent dilemmas (foot, organ, ICU) versus 215/280 (76.8%) in population-level/end-of-life scenarios (vaccination, morphine).

Results per model and dilemma are demonstrated in Figure 1 and the full results across all runs are detailed in Supplemental eTable S4. Intramodel consistency ranged from 84% to 100% (mean = 94.4%). Detailed examples of GPT-o1-preview and DeepSeek's reasoning can be found in the Supplemental Material.

Figure 1.

Proportions of utilitarian responses for larger (a) and smaller (b) LLMs.

Discussion

Our study's results raise concerns about how LLMs respond to medically framed ethical dilemmas. Some models inconsistently favored utilitarian choices even when this required endorsing actions that conflict with widely recognized constraints (consent, or intentional harm) within the vignette framing. Because the scenarios are simplified and outputs were constrained to binary choices, these findings should be interpreted as a stress test of model behavior under this prompting format rather than as evidence of real-world clinical ethics deliberation.

Some LLMs repeatedly endorsed giving morphine in doses that would cause death to a terminally ill suffering patient. This raises the question: can this be considered to align with the doctrine of double effect or an endorsement of euthanasia? The nature of AI's “intent” remains undefined, and ethically is not the same as human intent. It is unclear whether the doctrine of double effect can apply to AI-driven decisions.

Whether these boundary-violating endorsements reflect stochastic variability, prompt sensitivity, or systematic tendencies is less important than the fact that they occur under this constrained evaluation, because any such occurrence warrants caution in ethically sensitive uses. At the same time, the distinction matters for mitigation: systematic tendencies may be reduced through targeted alignment and safety tuning, whereas higher stochasticity or prompt sensitivity motivates robustness testing across prompts/decoding settings and stronger refusal safeguards.

Trolley-style scenarios are, by design, hypothetical and simplified. Their value lies in probing the limits of ethical alignment under controlled conditions. While the findings cannot be directly generalized to bedside care, they highlight areas where current models remain fragile and underscore the importance of careful, context-specific deployment and continuous monitoring in clinical environments. Utilitarian reasoning may be appropriate in population-level ethics but is generally not applicable to bedside decisions, which are grounded in professional duties to the individual patient.

We will need to better understand how LLMs handle medical ethics as they evolve. Bias in training data may reinforce majority viewpoints, disadvantaging minorities.^15,28 Given the diversity of human ethical perspectives, and the lack of universal consensus, it is uncertain who will define the policies for LLM-based decisions. Although LLMs operate through input–output generation, their behavior reflects the ethical norms embedded in the human-generated data and alignment procedures on which they are trained.

Variability in model responses may partly reflect the cultural composition of training data, which can encode divergent ethical norms. For instance, models trained predominantly on text from collectivist cultures may prioritize communal benefit, whereas those exposed mainly to individualist sources may emphasize autonomy. Differences in responses to the vaccination scenario, for example, may partly reflect regional and cultural variation in training data and public attitudes toward collective health measures. These cultural priors could potentially influence how models resolve moral tradeoffs. Recognizing and monitoring such influences is essential for context-specific deployment and for developing transparent documentation of model provenance and alignment practices.

Limitations of this study include a focus on a narrow set of dilemmas and simplified prompts. Furthermore, the study primarily focused on deontological and utilitarian frameworks. It did not capture the full spectrum of ethical reasoning that LLMs might encounter. By design, we limited the analysis to a deontological-versus-aggregate-benefit contrast to enable standardized, comparable measurements within a brief format. Constraining outputs to binary “yes/no” responses simplify rich moral reasoning and may obscure justificatory nuance. We chose this design to standardize comparisons across models; however, real clinical ethics require deliberation that cannot be captured by dichotomous choices. Medical decision-making typically involves multiple ethical principles, including autonomy, nonmaleficence, beneficence, and justice. Our deontological–utilitarian contrast represents a simplified operationalization, allowing a controlled comparison of model tendencies rather than an exhaustive ethical analysis. Moreover, ethical decision-making is highly context dependent. It is influenced by cultural, legal, and individual factors. These factors cannot be captured in a few examples with simple prompts.

Additionally, the models’ responses may vary with different prompts or conditions. Our findings do not generalize to all possible interactions. Finally, the dilemmas used in this study are hypothetical. They do not reflect the full complexity of real-world clinical decision-making. However, for LLMs, all decisions are essentially input–output processes. When similar inputs recur in real-world scenarios (e.g. a terminally ill patient in pain), the model could generate the same outputs. Although no human participants or data were involved, studying unethical recommendations carries risks of normalizing impermissible actions and dual-use misuse. We mitigated these by using hypothetical scenarios, and presenting findings primarily as a stress-test of alignment and safety guardrails, not as clinical guidance.

Conclusion

In this vignette-based stress test of medically adapted trolley dilemmas, LLMs showed meaningful variation in how they resolved forced ethical tradeoffs, and several models occasionally endorsed actions that conflict with core clinical constraints within the scenario framing (such as consent and avoidance of intentional harm). Because the scenarios are simplified and responses were restricted to binary choices, these findings should be interpreted as evidence of brittleness under constrained probing rather than as a model of bedside ethical deliberation. Nonetheless, the results support routine boundary/refusal testing, reporting of prompt and sampling sensitivity, and clear role limits with human accountability when LLMs are used in ethically sensitive clinical contexts.

Supplemental Material

sj-docx-1-dhj-10.1177_20552076261428395 - Supplemental material for Alignment of large language models in solving medical ethical dilemmas

Supplemental material, sj-docx-1-dhj-10.1177_20552076261428395 for Alignment of large language models in solving medical ethical dilemmas by Vera Sorin, Benjamin S. Glicksberg, Panagiotis Korfiatis, Jeremy D. Collins, Mei-Ean Yeow, Megan Brandeland, Girish N. Nadkarni and Eyal Klang in DIGITAL HEALTH

Supplemental Material

sj-pdf-2-dhj-10.1177_20552076261428395 - Supplemental material for Alignment of large language models in solving medical ethical dilemmas

Supplemental material, sj-pdf-2-dhj-10.1177_20552076261428395 for Alignment of large language models in solving medical ethical dilemmas by Vera Sorin, Benjamin S. Glicksberg, Panagiotis Korfiatis, Jeremy D. Collins, Mei-Ean Yeow, Megan Brandeland, Girish N. Nadkarni and Eyal Klang in DIGITAL HEALTH

Footnotes

Ethical approval

Ethical approval was not required as the study involved no human participants, patient data, or clinical intervention. All scenarios were hypothetical.

Contributorship

Vera Sorin and Eyal Klang conceived the study and drafted the manuscript; Vera Sorin, Eyal Klang, and Benjemin Glicksberg conducted the statistical analysis. All authors contributed to the interpretation of the results, revised the manuscript critically for intellectual content, and approved the final version.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data sharing

The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request.

Guarantor

Vera Sorin and Eyal Klang are the guarantors of this work and affirm that the manuscript is accurate and transparent account of the study.

Supplemental material

Supplemental material for this article is available online.

ORCID iDs

Vera Sorin

Panagiotis Korfiatis

References

Beauchamp

. The ‘four principles’ approach to health care ethics. Principles Health Care Ethics 2007; 29: 3–10.

Gillon

. Medical ethics: four principles plus attention to scope. Br Med J 1994; 309: 84.

Beauchamp

. Methods and principles in biomedical ethics. J Med Ethics 2003; 29: 269–274.

Conway

Gawronski

. Deontological and utilitarian inclinations in moral decision making: a process dissociation approach. J Pers Soc Psychol 2013; 104: 216–235.

Gray

Schein

. Two minds vs. two philosophies: mind perception defines morality and dissolves the debate between deontology and utilitarianism. Rev Philos Psychol 2012; 3: 405–423.

Mandal

Ponnambath

Parija

. Utilitarian and deontological ethics in medicine. Trop Parasitol 2016; 6: 5–7.

Klang

Tessler

Freeman

, et al. If machines exceed us: health care at an inflection point. NEJM AI 2024; 1: AIP2400559.

Clusmann

Kolbinger

Muti

, et al. The future landscape of large language models in medicine. Commun Med 2023; 3: 41.

Goh

Gallo

Hom

, et al. Large language model influence on diagnostic reasoning: a randomized clinical trial. JAMA Network Open 2024; 7: e2440969–e2440969.

10.

Hager

Jungmann

Holland

, et al. Evaluation and mitigation of the limitations of large language models in clinical decision-making. Nat Med 2024; 30: 2613–2622.

11.

Schmidgall

Harris

Essien

, et al. Evaluation and mitigation of cognitive biases in medical language models. NPJ Dig Med 2024; 7: 95.

12.

Emelin

Le Bras

Hwang

, et al. Moral stories: situated reasoning about norms, intents, actions, and their consequences. In: Online and Punta Cana, Dominican Republic, November 2021, pp.698-718. Association for Computational Linguistics.

13.

Jiao

Afroogh

Murali

, et al. LLM ethics benchmark: a three-dimensional assessment system for evaluating moral reasoning in large language models. Sci Rep 2025; 15: 34642.

14.

Hendrycks

Collin

Basart

, et al. Aligning AI with shared human values. 2020. DOI: 10.48550/arXiv.2008.02275.

15.

Omar

Soffer

Agbareia

, et al. Socio-demographic biases in medical decision-making by large language models: a large-scale multi-model analysis. medRxiv 2024: 2024.2010.2029.24316368. DOI: 10.1101/2024.10.29.24316368.

16.

Sorin

Korfiatis

Collins

, et al. Socio-demographic modifiers shape large language models’ ethical decisions. J Healthc Inform Res 2025; 9: 567–586.

17.

Soffer

Sorin

Nadkarni

, et al. Pitfalls of large language models in medical ethics reasoning. npj Digital Medicine 2025; 8: 61.

18.

Kirch

Hebenstreit

Samwald

. Medical triage as an AI ethics benchmark. Sci Rep 2025; 15: 30974.

19.

Wei

Meng

Xiao

, et al. MedEthicsQA: a comprehensive question answering benchmark for medical ethics evaluation of LLMs. arXiv preprint arXiv:250622808 2025.

20.

Jin

Shi

, et al. Medethiceval: evaluating large language models based on Chinese medical ethics. In: Albuquerque. New Mexico: Association for Computational Linguistics, 2025, pp.404–421.

21.

Brin

Sorin

Vaid

, et al. Comparing ChatGPT and GPT-4 performance in USMLE soft skill assessments. Sci Rep 2023; 13: 16492.

22.

Chen

Cadiente

Kasselman

, et al. Assessing the performance of ChatGPT in bioethics: a large language model’s moral compass in medicine. J Med Ethics 2024; 50: 97–101.

23.

Danehy

Hecht

Kentis

, et al. ChatGPT performs worse on USMLE-style ethics questions compared to medical knowledge questions. Appl Clin Inform 2024; 15: 1049–1055.

24.

Nastasi

Courtright

Halpern

, et al. A vignette-based evaluation of ChatGPT’s ability to provide appropriate and equitable medical advice across care contexts. Sci Rep 2023; 13: 17885.

25.

Thomson

. Killing, letting die, and the trolley problem. Monist 1976; 59: 204–217.

26.

Unger

. Living high and letting die: our illusion of innocence. New York, NY, USA: Oxford University Press, 1996, pp.138–147.

27.

Rosenbaum

. Trolleyology and the dengue vaccine dilemma. N Engl J Med 2018; 379: 305–307.

28.

Omar

Sorin

Agbareia

, et al. Evaluating and addressing demographic disparities in medical large language models: a systematic review. Int J Equity Health 2025; 24: 57.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.26 MB

51.51 MB