Hiltzik: When AI makes medical mistakes

As almost everybody knows, the AI gold rush is upon us. And in few fields is it happening as fast and furiously as in healthcare.
That points to an important corollary: Beware.
Artificial intelligence technology has helped radiologists identify anomalies in images that human users have missed. It has some evident benefits in relieving doctors of the back-office routines that consume hours better spent treating patients, such as filing insurance claims and scheduling appointments.
Eventually, a lot of this stuff is going to be great, but we’re not there yet.
— Eric Topol, Scripps Research
But it has also been accused of providing erroneous information to surgeons during operations that placed their patients at grave risk of injury, and fomenting panic among users who take its offhand responses as serious diagnoses.
The commercial direct-to-consumer applications being promoted by AI firms, such as OpenAI’s ChatGPT Health and Anthropic’s Claude for Healthcare — both of which were introduced in January — raise special concerns among medical professionals. That’s because they’ve been pitched to users who may not appreciate their tendency to output erroneous information errors and offer inappropriate advice.
“Eventually, a lot of this stuff is going to be great, but we’re not there yet,” says Eric Topol, a cardiologist associated with Scripps Research Institute in La Jolla.
“The fact that they’re putting these out without enough anchoring in safety and quality and consistency concerns me,” Topol says. “They need much tighter testing. The problem I have is that these efforts are largely stemming from commercial interests — there’s furious competition to be the first to come out with an app for patients, even if it’s not quite ready yet.”
That was the experience reported by Washington Post technology columnist Geoffrey A. Fowler, who provided ChatGPT with 10 years of health data compiled by his Apple Watch — and received a warning about his cardiac health so dire that it sent him to his cardiologist, who told him he was in the bloom of health.
Fowler also sought out Topol, who reviewed the data and found the Chatbot’s warning to be “baseless.” Anthropic’s chatbot also provided Fowler with a health grade that Topol deemed dubious. I asked OpenAI and Anthropic to comment on this and and other critiques of their consumer apps as released prematurely, but neither replied.
Topol, who has written extensively about advanced technology in medicine, is nothing like an AI skeptic. He calls himself an AI optimist, citing numerous studies showing that artificial intelligence can help doctors treat patients more effectively and even to improve their bedside manners.
But he cautions that “healthcare can’t tolerate significant errors. We have to minimize the errors, the hallucinations, the confabulations, the BS and the sycophancy” that AI technology commonly displays.
In medicine, as in many other fields, AI looks to have been oversold as a labor-saving technology. According to a study of AI-equipped stethoscopes provided to about 100 British medical groups published earlier this month in the Lancet, the British medical journal, the high-tech stethoscopes effectively identified some (but not all) indications of heart failure better than conventional stethoscopes. But 40% of the groups abandoned the new devices during the 12-month period of the study.
The main complaint was the “additional workflow burden” experienced by the users — an indication that whatever the virtues of the new technology, they didn’t outweigh the time and effort needed to use them.
Other studies have found that AI can augment physicians’ skills — when the doctors have learned to trust their AI tools and when they’re used in relatively uncomplicated, even generic, conditions.
The most notable benefits have been found in radiology; according to a Dutch study published last year, radiologists using AI to help interpret breast X-rays did as well in finding cancers as two radiologists working together. That suggested that judicious use of AI could free up time for one of the two radiologists. But in this case as in others, the AI helper didn’t do consistently well.
“AI misses some breast cancers that are recalled by human assessment,” a study author said, “but detects a similar number of breast cancers otherwise missed by the interpreting radiologists.”
AI’s incursion into healthcare even has become something of a cultural touchstone: In HBO’s up-to-the-minute emergency room series “The Pitt,” beleaguered ER doctors discover that an AI app pushed on them as a time-saving charting tool has “hallucinated” a history of appendicitis for a patient, endangering the patient’s treatment.
“Generative AI is not perfect,” the app’s sponsor responds. “We still need to proofread every chart it creates” — thus acknowledging, accurately, that AI can increase, not relieve, users’ workloads.
A future in which robots perform surgical operations or make accurate diagnoses remains the stuff of science fiction. In medicine, as elsewhere, AI technology has been shown to be useful to take over automatable tasks from humans, but not in situations requiring human ingenuity or creativity — or precision. And attempts to use AI-related algorithms to make healthcare judgments have been challenged in court.
In a class-action lawsuit filed in Minnesota federal court in 2023, five Medicare patients and survivors of three others allege that UnitedHealth Group, the nation’s largest medical insurer, relied on an AI algorithm to deny coverage for their care, “overriding their treating physicians’ determinations as to medically necessary care based on an AI model” with a 90% error rate.
The case is pending. In its defense, UnitedHealth has asserted that decisions on whether to approve or deny coverage remain entirely in the hands of physicians and other clinical professionals the company employs, and their decisions on coverage and care comply with Medicare standards.
The AI algorithm cited by the plaintiffs, UnitedHealth says, is not used “to deny care to members or to make adverse medical necessity coverage determinations,” but rather to help physicians and patients “anticipate and plan for future care needs.” The company didn’t address the plaintiffs’ assertion about the algorithm’s error rate.
“We shouldn’t be complacent about accepting errors” from AI tools, Topol told me. But it’s proper to wonder whether that message has been absorbed by promoters of AI health applications.
Disclaimers warning that AI responses “are not professionally vetted or a substitute for medical advice” have all but disappeared from AI platforms, according to a survey by researchers at Stanford and UC Berkeley.
The issue becomes more urgent as the language of chatbots becomes more sophisticated and fluent, inspiring unwarranted confidence in their conclusions, the researchers cautioned. “Users may misinterpret AI-generated content as expert guidance,” they wrote, “potentially resulting in delayed treatment, inappropriate self-care, or misplaced trust in non-validated information.”
Typically, state laws require that medical diagnoses and clinical decisions proceed from physical examinations by licensed doctors and after a full workup of a patient’s medical and family history. They don’t necessarily rule out doctors’ use of AI to help them develop diagnoses or treatment plans, but the doctors must remain in control.
The Food and Drug Administration exempts medical devices from government licensing if they’re “intended generally for patient education, and … not intended for use in the diagnosis of disease or other conditions. That may cover AI bots if they’re not issuing diagnoses.
But that may not help users who have willingly uploaded their medical histories and test results to AI bots, unaware of concerns, including whether their information will be kept private or used against them in insurance decisions. Gaps in their uploaded data my affect the advice they receive from bots. And because the bots know nothing except the content they’ve been fed, their healthcare outputs may reflect cultural biases in the basic data, such as ethnic disparities in disease incidence and treatment.
“If there’s a mistake with all your data, you could get into a pretty severe anxiety attack,” Topol says. “Patients should verify, not just trust” what they’ve heard from a bot.
Topol warns that the negative effect of misleading AI information may not only fall on patients, but on the AI field itself. “The public doesn’t really differentiate between individual bots,” he told me. “All we need are some horror stories” about misdiagnoses or dangerous advice, “and that whole area is tarred.”
In his view, that would limit the promise of technologies that could improve the effectiveness of medical practice in many ways. The remedy is for AI applications to be subjected to the same clinical standards applied to “a drug, a device, a diagnostic. We can’t lower the threshold because it’s something new, or different, with some broad appeal.”


