• Skip to main content
  • Skip to secondary menu
  • Skip to primary sidebar
  • Skip to secondary sidebar
  • Skip to footer

  • Opinion
  • Health IT
    • Behavioral Health
    • Care Coordination
    • EMR/EHR
    • Interoperability
    • Patient Engagement
    • Population Health Management
    • Revenue Cycle Management
    • Social Determinants of Health
  • Digital Health
    • AI
    • Blockchain
    • Precision Medicine
    • Telehealth
    • Wearables
  • Life Sciences
  • Investments
  • M&A
  • Value-based Care
    • Accountable Care (ACOs)
    • Medicare Advantage

Large Language Models Fall Short in Medical Accuracy Compared to Medical Professionals, Study Reveals

by Fred Pennic 07/22/2024 Leave a Comment

  • LinkedIn
  • Twitter
  • Facebook
  • Email
  • Print

What You Should Know: 

– Kahun, a company specializing in evidence-based clinical AI, has released a new study comparing the medical capabilities of popular large language models (LLMs) to human experts. 

– The findings reveal the limitations of current LLMs in providing reliable information for clinical decision-making.

The Study: Comparing LLMs to Medical Professionals

  • LLMs Tested: OpenAI’s GPT-4 and Anthropic’s Claude3-Opus
  • Evaluation Method:
    • 105,000 evidence-based medical questions and answers (Q&As) were developed by Kahun based on real-world physician queries.
    • Q&As covered various medical disciplines and were categorized into numerical (e.g., disease prevalence) and semantic (e.g., differentiating dementia subtypes).
    • Six medical professionals answered a subset of Q&As for comparison.
  • Key Findings:
    • Both LLMs performed better on semantic questions (around 68% accuracy) than numerical questions (around 64% accuracy). Claude3 showed slight superiority in numerical accuracy.
    • LLM outputs varied significantly for the same prompt, raising concerns about reliability.
    • Medical professionals achieved significantly higher accuracy (82.3%) compared to LLMs (Claude3: 64.3%, GPT-4: 55.8%) on identical questions.
    • LLMs exhibited questionable ability to admit uncertainty (“I don’t know”) despite offering this option.

Concerns and Implications for Clinical Use

The study highlights the limitations of current LLMs in a clinical setting due to:

  • Inaccurate medical information: Both LLMs provided incorrect answers for a significant portion of the questions, raising concerns about patient safety.
  • Lack of domain-specific knowledge: LLMs are trained on massive datasets that may not include high-quality medical sources.
  • Unreliable output: Variability in responses for the same prompt undermines the trustworthiness of LLM outputs.

Physician Concerns Confirmed

This study aligns with physician concerns about using generative AI models in clinical practice. Physicians emphasize the need for models trained on reliable medical sources and the importance of understanding the limitations of current technology.

While LLMs show promise, further development is needed to ensure their accuracy and reliability in clinical settings. In the meantime, solutions like Kahun’s offer a more secure and trustworthy path for AI integration into healthcare.

“While it was interesting to note that Claude3 was superior to GPT-4, our research showcases that general-use LLMs still don’t measure up to medical professionals in interpreting and analyzing medical questions that a physician encounters daily. However, these results don’t mean that LLMs can’t be used for clinical questions. In order for generative AI to be able to live up to its potential in performing such tasks, these models must incorporate verified and domain-specific sources in their data,” says Michal Tzuchman Katz, MD, CEO and Co-Founder of Kahun. “We’re excited to continue contributing to the advancement of AI in healthcare with our research and through offering a solution that provides the transparency and evidence essential to support physicians in making medical decisions.

The full preprint draft of the study can be found here: https://arxiv.org/abs/2406.03855 

  • LinkedIn
  • Twitter
  • Facebook
  • Email
  • Print

Tagged With: Artificial Intelligence, Large Languard Model (LLM)

Tap Native

Get in-depth healthcare technology analysis and commentary delivered straight to your email weekly

Reader Interactions

Primary Sidebar

Subscribe to HIT Consultant

Latest insightful articles delivered straight to your inbox weekly.

Submit a Tip or Pitch

2026 Predictions & Trends

Healthcare 2026 Forecast: Executives on AI Survival, Financial Reckoning, and the End of Point Solutions

2026 Healthcare Executive Predictions: Why the AI “Pilot Era” Is Officially Over

Featured Research Report

Digital Health Funding Hits $14.2B in 2025: A Year of AI Exuberance and Market Bifurcation

Most-Read

Anthropic Debuts ‘Claude for Healthcare’ and Opus 4.5 to Engineer the Future of Life Sciences

Anthropic Debuts ‘Claude for Healthcare’ and Opus 4.5 to Engineer the Future of Life Sciences

OpenAI Debuts ChatGPT Health: A ‘Digital Front Door’ That Connects Medical Records to Agentic AI

OpenAI Debuts ChatGPT Health: A ‘Digital Front Door’ That Connects Medical Records to Agentic AI

From Genes to Hackers: The Hidden Cybersecurity Risks in Life Sciences

From Genes to Hackers: The Hidden Cybersecurity Risks in Life Sciences

Utah Becomes First State to Approve AI System for Prescription Renewals

Utah Becomes First State to Approve AI System for Prescription Renewals

NYC Health + Hospitals to Acquire Maimonides in $2.2B Safety Net Overhaul

NYC Health + Hospitals to Acquire Maimonides in $2.2B Safety Net Overhaul

KLAS Report: Why Hospitals Are Choosing Efficiency Over 'Agentic' AI Hype in 2025

KLAS Report: Why Hospitals Are Choosing Efficiency Over ‘Agentic’ AI Hype in 2025

Advanced Primary Care 2026: Top 6 Investments for Health Systems According to Harvard Medical School

Advanced Primary Care 2026: Top 6 Investments for Health Systems According to Harvard Medical School

AI Nutrition Labels: The Key to Provider Adoption and Patient Trust?

AI Nutrition Labels: The Key to Provider Adoption and Patient Trust?

Kristen Hartsell, VP of Clinical Services, RedSail Technologies

The Pharmacy Closures Crisis: How Independent Pharmacies Are Fixing Pharmacy Deserts

HHS Launches 'OneHHS' AI Strategy to Integrate AI Across CDC, CMS, and FDA for Efficiency and Public Trust

HHS Launches ‘OneHHS’ AI Strategy to Integrate AI Across CDC, CMS, and FDA for Efficiency and Public Trust

Secondary Sidebar

Footer

Company

  • About Us
  • Advertise with Us
  • Reprints and Permissions
  • Op-Ed Submission Guidelines
  • Contact
  • Subscribe

Editorial Coverage

  • Opinion
  • Health IT
    • Care Coordination
    • EMR/EHR
    • Interoperability
    • Population Health Management
    • Revenue Cycle Management
  • Digital Health
    • Artificial Intelligence
    • Blockchain Tech
    • Precision Medicine
    • Telehealth
    • Wearables
  • Startups
  • Value-Based Care
    • Accountable Care
    • Medicare Advantage

Connect

Subscribe to HIT Consultant Media

Latest insightful articles delivered straight to your inbox weekly

Copyright © 2026. HIT Consultant Media. All Rights Reserved. Privacy Policy |