What does it mean when AI “beats” physicians in solving challenging case studies?
September 4, 2025
Summary: Large language models are getting better quickly, but aren’t likely to quickly displace experienced clinicians.
The New England Journal of Medicine publishes a challenging case study each week that most physicians have trouble solving. The “Case Records from the Massachusetts General Hospital” provides an initial clinical vignette, and then a visiting expert develops hypotheses which are supported (or not) by laboratory or other tests performed on the patient. The experts usually solve the case, but most doctors who read “Case Records” cannot. Many compare these case studies to the clinical dilemmas on the television series “House” from the 2000s.
Microsoft reported this summer that it used multiple artificial intelligence large language models (LLMs) to solve these challenging cases, achieving an 80% success rate. Here’s an explanation from StatNews about how Microsoft researchers used multiple AI models to work together to simulate medical sleuthing:
The researchers divided the diagnostic agent into multiple personas, similar to a panel of doctors working together. One was in charge of coming up with hypotheses; one was in charge of coming up with tests that would differentiate between those hypotheses; one played devil’s advocate, finding ways to falsify the current hypothesis. The researchers also outfitted one agent persona with a list of prices for various labs and tests so it could ensure the process was cost-efficient.
The researchers found that the optimized AI LLMs asked for tests that cost on average 20% less than the costs of tests requested by physicians. An earlier AI LLM model requested tests that were more expensive than those requested by physicians.
Critics aren’t ready to conclude that even multiple AI LLMs are approaching the diagnostic effectiveness of expert physicians. The Microsoft researchers gave the 320 case studies to 21 human physicians who were only able to achieve 20% success rates, but those human physicians were restricted from using outside sources, which is unrealistic since most physicians will query sources like UptoDate or textbooks for difficult cases. Further, LLMs might be less able than human providers to interpret the history of patients who are often “unreliable narrators.”
One way or another, the use of multiple AI LLMs with different personas to solve real world problems is a real advance, and I’m confident that AI will play a growing role in improving diagnostic effectiveness.
Implications for employers:
We can expect that increasing use of artificial intelligence will improve diagnostic accuracy over time.
We don’t know what impact the deployment of AI will have on the cost of care. Presumably, models can be optimized to be cost effective, and not request low value tests. On the other hand, providers are already using AI to increase their revenue per visit.
Thanks for reading! Hope you’ll subscribe to this newsletter, and please hit the "like" button. Please also recommend this newsletter to friends and colleagues - it's free. You can find previous posts at this link.
Views expressed in Employer Coverage are purely my own.
Image created with ChatGPT. My prompts to put the physician and anthropomorphized computer in the opposite corners were unsuccessful.