The Senate recently voting 99-1 to remove a 10-year moratorium on state regulation of AI says something about the impact of AI, but also its challenges.
A new MIT study, presented at the ACM Conference on Fairness, Accountability and Transparency, demonstrates that large language models (LLMs) used in healthcare can be surprisingly “brittle.” As discussed in Medical Economies, the researchers evaluated more than 6,700 clinical scenarios and introduced nine types of minor stylistic variations—typos, extra spaces, informal tone, slang, dramatic language, or even removing gender markers. Across all variants, these small changes altered AI treatment recommendations in clinically significant ways, with a 7‑9% increase in the AI advising patients to self‑manage rather than seek care—even when medical content remained identical.
“These models are often trained and tested on medical exam questions but then used in tasks that are pretty far from that, like evaluating the severity of a clinical case. There is still so much about LLMs that we don’t know,” Abinitha Gourabathina, Lead author of the study
Notably, these misinterpretations disproportionately impacted women and other vulnerable groups—even when gender cues were stripped from the input. In contrast, according to the study’s findings, human clinicians remained unaffected by such stylistic changes in deciding whether care was needed.
Why This Matters For Healthcare and Beyond
Healthcare Providers
- Patient safety risk: If patients’ informal language or typos unintentionally trigger incorrect triage outcomes, patients may be more likely to engage in self-care when in-person care is recommended, and serious conditions may go unflagged.
- Health equity concerns: The disproportionately poorer outcomes linked to female or vulnerable‐group messages highlight amplified bias through sensitivity to style, not substance.
Professional Services
- Faulty legal, compliance, or regulatory advice: The same kind of “brittleness” that the MIT study suggests could compromise patient care, may also lead to varied and inaccurate and/or incomplete legal or compliance recommendations.
Human Resources
- Compliance and bias risks: There are many use cases for LLMs in the employment arena which also rely on prompts from employees. And, there are lots of concerns about bias. Without sufficient training, auditing, and governance, they too may fall prey to the same kind of limitations witnessed in the MIT study.
Governance, Testing & Accountability
Healthcare surely is not the only industry grappling with the new challenges LLMs present. Establishing a governance framework is critical, and organizations might consider the following measures as part of that framework:
- Pre‑deployment auditing: LLMs should undergo rigorous testing across demographic subgroups and linguistic variants—not just idealized text—but also informal, error‐ridden inputs.
- Prompt perturbation testing: Simulate typos, tone shifts, missing or added markers—even swapping gender pronouns—to check for output stability.
- Human validation oversight: At minimum, have subject matter experts (SMEs) review AI outputs, especially where high‐stakes decisioning is involved.
- Developer scrutiny and certification: Organizations deploying LLMs should understand and assess to the extent necessary and appropriate efforts made by their developers to address these and other issues before adoption and implementation.
The MIT research reveals that small changes in input—like a typo or a slangy phrase—can meaningfully skew outputs in potentially high‑stakes contexts. For users of LLMs, including healthcare providers, lawyers, and HR professionals, that “brittleness” may heighten risks to safety, accuracy, fairness, and compliance. Strong governance is needed. And with the moratorium on state regulation of AI removed from the Big Beautiful Bill, should its removal hold, organizations are like to see more attention given to governance, risk, and compliance requirements as legislation develops.