Abstract
Large language models (LLMs) are increasingly applied in healthcare, yet their evaluation remains inconsistent and often disconnected from clinical practice. OpenAI's HealthBench represents an important advancement, encompassing 5000 multiturn synthetic clinical conversations benchmarked against 48,562 clinician-developed criteria across accuracy, completeness, context awareness, communication, and instruction-following. Key strengths include broad scenario coverage, contributions from 262 clinicians across 60 countries, and automated grading methods that show high concordance with physician ratings. HealthBench provides a scalable and globally relevant framework. Nevertheless, important limitations constrain its clinical applicability. Exclusive reliance on synthetic dialogs limits ecological validity, and model-based graders may reinforce shared blind spots. Moreover, HealthBench assesses static, offline interactions while omitting multimodal inputs, longitudinal care, and patient outcomes—factors critical to real-world decision-making. Without external validation, strong benchmark performance may not translate into improved diagnostic accuracy, workflow efficiency, or patient safety. To ensure safe and effective integration of LLMs into practice, future benchmarks must incorporate authentic clinical data, longitudinal outcomes, and system-level considerations. HealthBench is a valuable step, but evaluation strategies must evolve to capture the complexity and demands of frontline care.
OpenAI's HealthBench 1 represents an important advancement toward systematic and clinically grounded evaluation of large language models (LLMs) in healthcare. It includes 5000 multiturn clinical conversations using 48,562 clinician-developed rubric criteria, covering axes such as “clinical accuracy,” “communication quality,” and “context awareness.” Among the 5000 prompts, 4248 (85.0%) are in English, 192 (3.8%) in Spanish, 173 (3.5%) in Portuguese, and the remaining 387 (7.7%) are distributed across 29 other languages. The authors address longstanding limitations of previous benchmarks—lack of expert validation and limited scope for measurable improvement. 1 At the same time, researchers have noted challenges in LLM evaluation using HealthBench. In this Viewpoint, we discuss the benchmark's key strengths and examine its limitations, highlighting areas for further research and the importance of real-world validation.
Strengths
Comprehensive clinical scenarios: The benchmark encompasses seven diverse clinical themes (e.g. emergency referrals, global health) and evaluates five key behavioral dimensions (accuracy, completeness, context awareness, communication, instruction-following). This granularity enables precise assessments for developers, healthcare institutions, and regulatory organizations. Robust expert engagement: Criteria development and validation involved a carefully selected cohort of 262 clinicians spanning 26 specialties across 60 countries. This broad expert base significantly enhances both the validity and global applicability of the benchmark. Reliable and scalable evaluation: In the meta-evaluation, automated grading by GPT-4.1 demonstrated high agreement with physician evaluations (macro F1 = 0.71), comparable to interphysician agreement. This finding suggests that the automated grading approach is reliable and aligns well with physician ratings.
Limitations
Despite these advances, several limitations constrain HealthBench's clinical applicability. First, HealthBench relies exclusively on synthetic conversations, not actual clinical encounters. This is common in the field, for example, a recent
The path forward: From benchmarks to bedside
HealthBench provides a valuable framework of conversation-specific rubrics for assessing LLMs in healthcare. It has set a new standard for moving LLM evaluation beyond simple knowledge tests. However, its challenges with synthetic data, automated grading, and inability to fully capture the complexities of real-world clinical scenarios remind us that strong benchmark performance may not translate into improved diagnostic accuracy, workflow efficiency, or patient safety.
To ensure safe and effective integration of LLMs into practice, future evaluation strategies must evolve. We propose that the next generation of evaluation should include prospective, “silent-mode” clinical trials. 7 In such a study design, an LLM could be integrated into an EHR system to generate recommendations in real-time based on live, multimodal patient data. These recommendations would be recorded for analysis but not shown to the treating clinicians. Investigators would compare LLM recommendations with clinician decisions at the encounter level and, critically, assess the association between model–clinician discordance and prespecified longitudinal outcomes (e.g. 30-day readmission, adjudicated diagnostic accuracy, and adverse events), with appropriate risk adjustment and multiplicity control. 6 This approach would provide high-quality evidence of clinical utility and safety without compromising patient care. By bridging the gap between benchmark performance and real-world impact, we can ensure that AI technologies truly serve the needs of patients and clinicians.
Footnotes
Author contributions
JL and SL contributed to the conceptualization, data curation, and literature review. JL and SL drafted the original manuscript. All authors contributed to the review and editing of the manuscript.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Guarantor
Siru Liu, PhD, Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, USA. Email: siru.liu@vumc.org
