The Evolution of Health Care AI Benchmarking
Artificial Intelligence (AI) foundation models have demonstrated impressive performance on medical knowledge tests in recent years, with developers proudly announcing their systems had “passed” or even “outperformed” physicians on standardized medical licensing exams. Headlines touted AI systems achieving scores of 90% or higher on the United States Medical Licensing Examination (USMLE) and similar assessments. However, these multiple-choice evaluations presented a fundamentally misleading picture of AI readiness for health care applications. As we previously noted in our analysis of AI/ML growth in medicine, a significant gap remains between theoretical capabilities demonstrated in controlled environments and practical deployment in clinical settings.
These early benchmarks—predominantly structured as multiple-choice exams or narrow clinical questions—failed to capture how physicians actually practice medicine. Real-world medical practice involves nuanced conversations, contextual decision-making, appropriate hedging in the face of uncertainty, and patient-specific considerations that extend far beyond selecting the correct answer from a predefined list. The gap between benchmark performance and clinical reality remains largely unexamined.
HealthBench—an open-source benchmark developed by OpenAI—represents a significant advancement in addressing this disconnect, designed to be meaningful, trustworthy, and unsaturated. Unlike previous evaluation standards, HealthBench measures model performance across realistic health care conversations, providing a comprehensive assessment of both capabilities and safety guardrails that better align with the way physicians actually practice medicine.
The Purpose of Rigorous Benchmarking
Robust benchmarking serves several critical purposes in health care AI development. It sets shared standards for the AI research community to incentivize progress toward models that deliver real-world benefits. It provides objective evidence of model capabilities and limitations to health care professionals and institutions that may employ such models. It helps identify potential risks before deployment in patient care settings. It establishes baselines for regulatory review and compliance, and perhaps most importantly, it evaluates models against authentic clinical reasoning rather than simply measuring pattern recognition or information retrieval. As AI systems become increasingly integrated into health care workflows, these benchmarks become essential tools for ensuring that innovation advances alongside trustworthiness, with evaluations of safety and reliability that reflect the complexity of real clinical practice.
HealthBench: A Comprehensive Evaluation Framework
HealthBench consists of 5,000 multi-turn conversations between a model and either an individual user or a health care professional. Responses are evaluated using conversation-specific rubrics created by physicians spanning 48,562 unique criteria across seven themes: emergency referrals, context-seeking, global health, health data tasks, expertise-tailored communication, responding under uncertainty, and response depth. This multidimensional approach allows for nuanced evaluation across five behavioral axes: accuracy, completeness, context awareness, communication quality, and instruction following.
By focusing on conversational dynamics and open-ended responses, HealthBench challenges AI systems in ways that mirror actual clinical encounters rather than artificial testing environments—revealing substantial gaps even in frontier models and providing meaningful differentiation between systems that might have scored similarly on traditional multiple-choice assessments.
Physician-Validated Methodology
Developed in collaboration with 262 physicians across 26 specialties with practice experience in 60 countries, HealthBench grounds its assessment in real clinical expertise. These physicians contributed to defining evaluation criteria, writing rubrics, and validating model grading against human judgment. This physician-led approach aimed to develop benchmarks that reflect real-world clinical considerations and maintain a high standard of medical accuracy.
Notably, when physicians were asked to write responses to HealthBench conversations without AI assistance, their performance was weaker than that of the most advanced models, though physicians could improve responses from older models. This suggests that HealthBench’s evaluation approach captures dimensions of performance that go beyond memorized knowledge and may better reflect the nuances of human interactions, communication, and reasoning required in clinical practice.