A study published April 30 in the journal Science found that AI was more accurate than doctors in diagnosing cases in the ER.

Within hours of the study’s publication, headlines highlighting the story ricocheted across social media, cable news, and the inboxes of hospital administrators. OpenAI’s o1 model, the coverage incorrectly proclaimed, outperformed the reasoning of emergency physicians to diagnose triage complaints.

For example, the headline published on the National Public Radio website read: In real-world test, an AI model did better than doctors at diagnosing patients.

Many ER physicians took issue with how the findings were characterized by the media. As an emergency physician, I too read the study. To me, what this study actually means is quite interesting but also nuanced.

One of the study’s authors has also since offered some insightful clarification on the study.

Here’s The Study And What It Actually Found

The experiment presented OpenAI’s o1 and 4o models with the electronic medical records of 76 real patients who had come through the Beth Israel Deaconess emergency department and were admitted to the hospital.

Two internal medicine attending physicians reviewed the same cases. Then two separate internal medicine physicians, blinded to whether the diagnosis came from a human or an AI, evaluated the results.

OpenAI’s o1 model identified the exact or closely related diagnosis in 67% of triage cases, compared to 55% and 50% for the two physicians. AI’s advantage was largest at the first touchpoint, initial triage, where the least information is available. The researchers were careful to note that the AI was given the same raw, unprocessed electronic health record data available at the time of each diagnostic decision.

Yet, the headlines largely missed that the emergency department was just one of six experiments in the paper. The other five drew on more established benchmarks used to evaluate AI diagnostic systems.

Across all six experiments, the results were impressive. But none should be mistaken for proof that AI is ready to diagnose patients independently. Nevertheless, since publication, ER physicians have raised concerns about the study on emergency medicine diagnoses.

First, the doctors in the study weren’t ER doctors. They were internal medicine doctors, who have different training and focus. In addition, the primary goal of emergency medicine is not always about landing on the precise diagnosis. It’s about ruling out life threats, managing uncertainty and moving patients safely through a high-volume, high-stakes environment.

Spend a shift in a busy ER and you will quickly understand why a text-based diagnostic exercise, however well designed, doesn’t capture how real-life emergency medicine works. In the study, the AI read notes. It did not see the patient who appeared ill (or not) in ways that might change the differential diagnosis. It didn’t see the subtle neurological exam finding or notice that the patient’s story shifted between triage and the exam room.

The AI was not practicing emergency medicine. It was offering a written opinion based on selected information.

A Study Author Responds To Critics

In response, one of the paper’s own authors, an emergency physician himself, sees it differently. Dr. Adrian Haimovich, an assistant professor of emergency medicine at Harvard Medical School and an attending physician at Beth Israel Deaconess Medical Center, has offered a different framing.

“Even the toughest cases published in medical journals are now regularly solved by LLMs,” he wrote. “When a patient is admitted to the hospital, they will typically be seen and stabilized by ER doctors who then pass the patient to the internal medicine doctors for the hospital stay. This experiment compares how well LLMs and internal medicine doctors do at guessing the diagnosis of patients admitted to the hospital using only the information that was available in the ER.

Indeed, ERs are messy, real-world clinical environments where reasoning under pressure matters most. He went on to explain, “We restricted the data to the ER because it reflects when the diagnosis is most uncertain and so represents the toughest challenge.”

To Haimovich, the study wasn’t meant to be a head-to-head contest between doctors and machines. The primary finding in his view is that OpenAI’s o1, one of the first true “reasoning” models, can actually perform clinical reasoning across domains.

How We Should Interpret The Study’s Findings

In my view, the study results are quite important. This is why the editors of Science one of the most prestigious peer-reviewed journals, chose to publish it.

The most important finding is not the comparative accuracy. But rather it’s the fact that AI performed so well on messy, real-world, unprocessed clinical data. Prior comparisons of doctors to AI rely on polished case presentations that bear little resemblance to actual emergency care.

The fact that o1 held its own with all the uncertainty is a meaningful signal. Another important consideration: the study data at this point are old, by AI standards. New models have since eclipsed o1, so whatever benchmark o1 set in these experiments, the ceiling has since moved.

The study’s authors were also cautious about what they thought the next step should be: prospective trials. Not deployment. Not replacement of physicians.

How AI Could (Eventually) Play A Role In Real-Life Diagnoses

At this point in mid-2026, the debate over whether AI will play a role in clinical diagnosis is settled. It absolutely will.

Today, ER doctors and other specialists use AI to get second opinions on real cases. In some cases, the AI’s insights prove quite helpful. Given this is true, the more consequential questions surround governance, accountability and integration.

There is currently no formal accountability framework for AI-generated diagnoses. If a patient is harmed based on an AI recommendation that a physician acted on, or failed to act on who is responsible? The physician who acted incorrectly? The hospital who purchased the software? The vendor who created the AI model?

These are questions that will determine whether AI diagnostic tools get adopted thoughtfully, imposed recklessly or at some point get entirely shut down because healthcare as a field is so risk averse. When an incorrect AI diagnosis proves demonstrably lethal to a patient, the system could overreact and hit the kill switch.

Is It An All-Hands-On-Deck Moment?

Haimovich frames the current moment correctly: it’s an all-hands-on-deck in emergency medicine. The question now isn’t whether models are capable. They are. It’s how to make them work in ways that help physicians care for patients and improve the physician experience.

The research pipeline being assembled around this study reflects the kinds of questions that matter. Can AI systems help with reduce medical errors? How accurately can AI navigate disposition decisions? Can AI help double-check that subtle diagnostic findings aren’t missed or help read an equivocal electrocardiogram to make the decision about whether an urgent heart catheterization is needed?

Groups are actively working within specialty organizations like the American College of Emergency Medicine Physicians and the Society for Academic Emergency Medicine to address these questions.

Ultimately, we should interpret the headlines on AI beating doctors skeptically but take the underlying science seriously. So is AI better than ER doctors at diagnosis? The study didn’t ask that question, but it did signal where this technology is heading.

Share.
Leave A Reply

Exit mobile version