HealthBench and the Legal Future of AI in Health Care

Hemant Gupta

Email

901-712-3184

Bio and Articles

Christine Burke Worthen

Email

202-861-5333

Bio and Articles

Alaap B. Shah

Email

202-861-5320

Bio and Articles

Find Your Next Job !

Specialist: Legal Information Center Research

Experienced Family Law Attorney

LEGAL ASSISTANT II

Explore More Job Openings

HealthBench: Exploring Its Implications and Future in Health Care

by: Hemant Gupta, Christine Burke Worthen, Alaap B. Shah of Epstein Becker & Green, P.C. - Health Law Advisor

Tuesday, June 17, 2025

Print Mail Download info_icon_img

/>i

As we noted in our previous blog post, HealthBench, an open-source benchmark developed by OpenAI, measures model performance across realistic health care conversations, providing a comprehensive assessment of both capabilities and safety guardrails that better align with the way physicians actually practice medicine.

In this post, we discuss the legal and regulatory questions HealthBench addresses, the tool’s practical applications within the health care industry, and its significance in shaping the future of artificial intelligence (AI) in medicine.

Legal and Regulatory Implications

Practice of Medicine Concerns

HealthBench’s sophisticated evaluation of AI models in health care contexts raises important questions about the unlicensed practice of medicine and corporate practice of medicine doctrines.

By measuring how models respond to clinical scenarios—particularly in areas like emergency referrals and clinical decision-making—HealthBench provides valuable metrics for assessing when an AI system might cross the line from providing general health information to engaging in activities that could constitute medical practice.

Unlike multiple-choice tests, which primarily assess factual knowledge, HealthBench evaluates models on the types of interactions that might trigger regulatory scrutiny, such as providing personalized clinical advice or making what could be interpreted as medical recommendations, as well as suggesting diagnoses that could merit higher reimbursement. This distinction is crucial as state medical boards and regulators develop frameworks for AI oversight that consider functional capabilities rather than simply knowledge recall. As we highlighted in Epstein Becker Green’s overview of Telemental Health Laws, the regulatory landscape for digital health varies significantly across jurisdictions, requiring careful navigation of different state-specific requirements regarding corporate practice of medicine and the provision of telehealth services.

The benchmark’s ability to distinguish between appropriate responses to health care professionals versus general users also helps clarify when models are operating as clinical decision support tools versus direct-to-consumer health resources, a distinction increasingly important for compliance with state-specific corporate practice of medicine restrictions.

The benchmark can also serve to guide whether a model is biased towards diagnoses with higher risk scores or reimbursement levels, which informs fraud and abuse considerations. The benchmark is also important on the payer side as AI is increasingly used for utilization management and prior authorization, which requires consideration of the individual clinical profile for the determination of medical necessity for an item or service rather than simply relying on a clinical decision-making tool.

EU AI Act and High-Risk Classification

Under the EU AI Act, AI systems intended for use as safety components in medical devices or for providing medical information used in clinical decision-making are classified as “high-risk.” HealthBench's comprehensive evaluation framework—particularly its assessment of emergency referrals, accuracy, and safety—provides metrics directly relevant to demonstrating compliance with the risk management, technical documentation, and human oversight requirements imposed on high-risk AI systems under the Act.

The benchmark’s measurement of context awareness and the ability to recognize when uncertainty is present aligns with the Act's requirements that high-risk AI systems appropriately consider limitations in their design and communicate these effectively to users. These capabilities cannot be meaningfully assessed through multiple-choice tests but are captured in HealthBench’s conversational evaluation approach.

Addressing Bias and Fairness

HealthBench’s global health theme evaluates whether models can adapt responses across varied health care contexts, resource settings, and regional disease patterns. This assessment helps identify potential biases in model responses that might disadvantage users from underrepresented regions or health care systems.

Traditional medical knowledge tests have often reflected Western medical education and practice patterns, potentially obscuring biases in how AI systems approach global health questions. HealthBench’s development with physicians from 60 countries who collectively speak 49 languages provides a foundation for assessing model performance across diverse populations, though further work on explicit bias evaluation remains an area for continued development.

Health Care Industry Implementation Considerations

Clinical Workflow Integration

HealthBench evaluates whether models can safely and accurately complete structured health data tasks—such as drafting medical documentation or enhancing clinical decision-making. These metrics help health care institutions assess how effectively AI models might integrate into existing clinical workflows and identify potential friction points before deployment. As the FDA continues to refine its approach to digital health technologies, the ability to demonstrate robust performance on realistic clinical tasks becomes increasingly important for regulatory clearance and market adoption.

Unlike knowledge-based examinations, HealthBench measures capabilities that directly translate to potential clinical applications, providing more actionable insights for implementation planning.

Patient-Provider Communication

The expertise-tailored communication theme measures whether models can distinguish between health care professionals and general users, tailoring communication appropriately. This assessment is crucial for deployment decisions, as models that cannot effectively modulate their responses based on user expertise may create confusion or misunderstanding in clinical settings.

Traditional benchmarks provide little insight into these communication capabilities, which represent core skills for physicians but are rarely captured in standardized testing environments. As ambient listening AI is generating more interest in the health care ecosystem, this benchmark can help to guide whether the information captured represents an accurate clinical picture or is biased. For example, this is particularly relevant in the contexts of complex clinical profiles where poor clinical decision-making could result in medical malpractice. Additionally, this benchmark would also be relevant in the context of clinical profiles where improper coding could impact claims submission that yield higher reimbursement. On the other hand, the benchmark can help practitioners understand whether their ambient listening tools streamline their practice and permit them to be more efficient.

Risk Management and Liability

HealthBench’s evaluation of model reliability—particularly the "worst-at-k" performance that measures how quickly worst-case performance deteriorates with sample size—provides essential metrics for health care risk management. The benchmark reveals that even frontier models like o3, while achieving an overall score of 60%, see their worst-at-16 score reduced by a third, indicating significant reliability gaps when handling edge cases.

This type of risk assessment is impossible with conventional multiple-choice evaluations, which typically report only aggregate scores without revealing the frequency or severity of concerning responses—a critical oversight when considering clinical deployment.

Future Directions and Limitations

While HealthBench represents a significant advancement in health care AI evaluation, it focuses primarily on conversation-based interactions rather than specific clinical workflows that may utilize multiple model responses. The benchmark also does not directly measure health outcomes, which ultimately depend on implementation factors beyond model performance alone.

Real-world studies measuring both quality of model responses and outcomes in terms of human health, time savings, cost efficiency, and user satisfaction will be important complements to benchmark evaluations like HealthBench.

Conclusion

HealthBench establishes a new standard for evaluating AI systems in health care, one that emphasizes real-world applicability, physician validation, and multidimensional assessment. By moving beyond the artificial constraints of multiple-choice tests toward evaluations that mirror authentic clinical practice, HealthBench provides a more meaningful and rigorous assessment of AI capabilities in health care contexts.

As health care organizations, technology developers, and regulators navigate the rapidly evolving landscape of health care AI, benchmarks like HealthBench provide important frameworks for ensuring that innovation advances alongside safety, quality, and ethical deployment.

By grounding progress in physician expertise and realistic scenarios, HealthBench offers a valuable tool for assessing both performance and safety as health care AI continues to evolve—ultimately supporting the responsible development of AI systems that can meaningfully improve human health.