The AI doctor is not ready to see you now: Stress tests reveal flaws

Conceptual illustration: Benchmark scores suggest steady model improvement. Stress tests uncover hidden vulnerabilities—newer models may be equally or more brittle despite higher scores. Credit: arXiv (2025). DOI: 10.48550/arxiv.2509.18234

Robust performance under uncertainty, valid reasoning grounded in evidence, and alignment with real clinical need are prerequisites for trust in any health care setting.

Microsoft Research, Health & Life Sciences reports that top-scoring multimodal medical AI systems show brittle behavior under stress tests, including correct guesses without images, answer flips after minor prompt tweaks, and fabricated reasoning that inflates perceptions of readiness.

AI-based medical evaluations face a credibility and feasibility gap rooted in benchmarks that reward pattern matching over medical understanding. While the hope is to allow greater access and lower costs of care, accuracy in diagnostic evaluations are critical to making this possible.

Previous evaluations have allowed models to associate co-occurring symptoms with diagnoses without interpreting visual or clinical evidence. Systems that appear competent can fail when faced with uncertainty, incomplete data, or shifts in input structure. Each new benchmark cycle produces higher scores, but those scores can conceal fragilities that would be unacceptable in a clinical setting.

In the study, “The Illusion of Readiness: Stress Testing Large Frontier Models on Multimodal Medical Benchmarks,” posted on the pre-print server arXiv, researchers designed a suite of stress tests to expose shortcut learning and to assess robustness, reasoning fidelity, and modality dependence across widely used medical benchmarks.

Six flagship models were evaluated across six multimodal medical benchmarks, with analyses spanning filtered JAMA items (1,141), filtered NEJM items (743), a clinician-curated NEJM subset requiring visual input (175 items), and a visual-substitution set drawn from NEJM cases (40 items).

Model evaluation covered hundreds of benchmark items drawn from diagnostic and reasoning datasets in a tiered stress-testing protocol that probed modality sensitivity, shortcut dependence, and reasoning fidelity. Image inputs were removed on multimodal questions to quantify text-only accuracy relative to image+text.

A clinician-curated NEJM subset requiring visual input enabled tests of modality necessity by comparing performance to the 20% random baseline when images were withheld.

Format manipulations disrupted surface cues. Answer options were randomly reordered without altering content. Distractors were progressively replaced with irrelevant choices from the same dataset, with a variant that substituted a single option with the token “Unknown.” Visual substitution trials replaced original images with distractor-aligned alternatives while preserving question text and options.

Across image text benchmarks, removal of visual input produced marked accuracy drops on NEJM with smaller shifts on JAMA. On NEJM, GPT-5 moved 80.89% to 67.56%, Gemini-2.5 Pro 79.95% to 65.01%, OpenAI-o3 80.89% to 67.03%, OpenAI-o4-mini 75.91% to 66.49%, and GPT-4o 66.90% to 37.28%.

GPT-4o was the lone exception that improved under visual substitution (36.67%→41.67%). On the JAMA benchmark dataset, shifts were modest, including GPT-5 86.59% to 82.91% and OpenAI-o3 84.75% to 82.65%.

On items that clinicians labeled as requiring visual input, text-only performance stayed above a 20% random baseline for most models. The NEJM 175-item subset yielded GPT-5 at 37.7%, Gemini-2.5 Pro at 37.1%, and OpenAI-o3 at 37.7%, while GPT-4o recorded 3.4% due to frequent refusals without the image.

Within format perturbations, random reordering of answer options reduced text-only accuracy while leaving image+text runs stable or slightly higher. GPT-5 shifted 37.71% to 32.00% in text-only and 66.28% to 70.85% in image+text. OpenAI-o3 shifted 37.71% to 31.42% in text-only and 61.71% to 64.00% in image+text.

Under distractor replacement, text-only accuracy declined toward chance as more options were substituted, while image+text accuracy rose. GPT-5 fell 37.71% to 20.00% at 4R in text-only and rose 66.28% to 90.86% in image+text. A single “Unknown” distractor increased text-only accuracy for several models, including GPT-5 37.71% to 42.86%.

Across counterfactual visual substitutions that aligned images with distractor answers, accuracy collapsed. GPT-5 dropped 83.33% to 51.67%, Gemini-2.5 Pro 80.83% to 47.50%, and OpenAI-o3 76.67% to 52.50%.

Chain-of-thought generally reduced accuracy on VQA-RAD and NEJM with small gains for o4-mini. Audits documented correct answers paired with incorrect logic, hallucinated visual details, and stepwise image descriptions that did not guide final decisions.

Authors caution that medical benchmark scores do not directly reflect clinical readiness and that high leaderboard results can mask brittle behavior, shortcut use, and fabricated reasoning.

They recommend that medical AI evaluation include systematic stress testing, benchmark documentation detailing reasoning and visual demands, and reporting of robustness metrics alongside accuracy. Only through such practices, they argue, can progress in multimodal health AI be aligned with clinical trust and safety.

Written for you by our author Justin Jackson, edited by Sadie Harley, and fact-checked and reviewed by Robert Egan—this article is the result of careful human work. We rely on readers like you to keep independent science journalism alive.
If this reporting matters to you,
please consider a donation (especially monthly).
You’ll get an ad-free account as a thank-you.

More information:
Yu Gu et al, The Illusion of Readiness: Stress Testing Large Frontier Models on Multimodal Medical Benchmarks, arXiv (2025). DOI: 10.48550/arxiv.2509.18234

Citation:
The AI doctor is not ready to see you now: Stress tests reveal flaws (2025, October 7)
retrieved 7 October 2025
from https://medicalxpress.com/news/2025-10-ai-doctor-ready-stress-reveal.html

This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no
part may be reproduced without the written permission. The content is provided for information purposes only.

Source link

Menu

Categories:

Contact Information

Menu

Categories:

Contact Information

The AI doctor is not ready to see you now: Stress tests reveal flaws

Share This Post

Related Posts

The Covid Contrarian Clubhouse Makes Its Mark on Trump’s Washington

Recommended health screenings for men by age

Christmas Gifts For Men He’ll Love (+Stocking Stuffers)

How to Access Affordable Mental Health Services

2025 Christmas Gifts for Her (89+ Ideas She’ll Love!)

Meal Prep Tips for Staying on Track with GLP-1 Medications

Categories:

Company:

Support:

Contact Information: