
What You Should Know:
– Abridge, a clinical documentation platform, has released a new research paper titled “The Science of Confabulation Elimination“.
– The white paper details the company’s approach to classifying, detecting, and eliminating hallucinations—or unsupported claims—in AI-generated clinical notes before they are sent to clinicians for review. This initiative aims to set a new industry standard for transparency and trust in AI technology.
Critical Issue of AI Hallucinations
The rapid adoption of AI in healthcare, such as Abridge’s platform being deployed in over 150 health systems, has demonstrated clear benefits like time savings and reduced clinician burnout. However, with this accelerated adoption comes the responsibility to ensure safety, accuracy, and quality. Abridge’s research paper addresses the critical issue of AI hallucinations, noting that errors in clinical documentation have long predated AI. A 2020 study found that 21% of patients with access to their notes perceived a mistake, and 42% of those mistakes were serious.
Abridge’s approach to quality involves a clear framework for categorizing unsupported claims. The company’s guidelines assess claims based on two axes: “Support” and “Severity”.
- Support: This axis evaluates whether a claim is explicitly supported by a conversation transcript, contradicted by it, or somewhere in-between. For instance, a “directly supported” statement is one that precisely matches the transcript with no meaningful deviation. In contrast, an “unmentioned” claim is not substantiated by the transcript and is not possible to infer from the context.
- Severity: This axis assesses the potential impact of an unsupported claim. A claim is deemed to have “major severity” if it would likely have a negative impact on clinical care or has a non-trivial chance of leading to substantial harm. An example would be an entirely fabricated diagnosis. “Minimal severity” claims have little to no impact on clinical care, such as a minor change in wording that doesn’t affect decision-making.
Purpose-Built Guardrails and Accuracy Benchmarks
Abridge has developed a system with “purpose-built guardrails” to ensure factual accuracy. This system has two main components: a proprietary AI model that detects unsupported claims in a draft and an automated system that corrects these claims. This in-house language model was trained on a curated dataset of over 50,000 examples, including both open-source data and domain-specific clinical scenarios.
On an internal benchmark dataset of over 10,000 clinical encounters, Abridge’s solution significantly outperformed a general-purpose AI model. The Abridge system caught 97% of “confabulations,” while GPT-4o only caught 82%. This means a standard off-the-shelf model missed six times as many errors as the Abridge system.
Despite this progress, Abridge stresses that clinician review of AI-generated notes remains essential before they are filed in the electronic health record (EHR). The platform includes features like Linked Evidence, which allows clinicians to verify each AI-generated summary by tracing it back to the original conversation transcript. The combination of these AI guardrails and human review ensures factual accuracy for notes entered into the EHR.