Business

AI vs doctors: Harvard study highlights ER diagnosis gap

AI emergency – A Harvard-linked study finds certain AI models matched or exceeded physicians in ER triage accuracy, underscoring the need for real-world trials.

AI is moving from tech demos into the clinical conversation, and a new Harvard-linked study is putting its diagnostic abilities under the microscope.

Misryoum reports that researchers published the work in Science. evaluating how large language models perform across multiple medical scenarios. including real emergency room cases.. The focus_keyphrase driving the takeaway is “AI emergency diagnosis. ” where the study suggests at least one model could match or outperform physician baselines in early decision points.

In the emergency room experiment, 76 patients were assessed at a Beth Israel Deaconess Medical Center setting.. Two attending physicians provided diagnoses. and those outputs were compared with diagnoses generated by OpenAI models labeled in the study as o1 and 4o.. Importantly, separate attending physicians reviewed the results without knowing whether each diagnosis came from a human or from the AI.

This context matters because emergency triage is often where critical information is most limited and time pressure is highest. If performance gains appear early, it raises questions about whether AI could function as a decision-support layer, even as clinicians remain responsible for final calls.

Misryoum notes that the researchers highlighted how the model inputs reflected the same text available to clinicians through electronic medical records at each diagnostic step. without additional preprocessing.. In the initial triage moments. where urgency is greatest and the dataset is smallest. the o1 model delivered an “exact or very close” diagnosis in 67% of cases.. By comparison, one physician reached that level 55% of the time and the other at 50%.

The authors also underlined that the results do not amount to AI taking over emergency decisions. Instead, they argue that the findings point to an urgent need for prospective trials that test these systems under real-world care conditions, with appropriate safeguards and evaluation methods.

While the study emphasizes performance on text-based information. the researchers cautioned that models may be more limited when reasoning involves non-text inputs such as medical images or other data types.. Misryoum also points out that this boundary is crucial for translating results from controlled comparisons into day-to-day hospital workflows.

At the heart of the debate is accountability.. Even if accuracy improves. patients and regulators will need clarity on how responsibility is handled when AI tools influence diagnosis and treatment choices.. In this context. Misryoum expects the next phase of research and policy discussions to focus as much on governance as on metrics.