Preparing cancer patients for difficult decisions is an oncologist’s job. They don’t always remember to do it, however. The University of Pennsylvania Health System employs an artificially intelligent algorithm that predicts the chances of death and prompts doctors to talk about a patient’s treatment and end-of-life preferences.

But it’s far from a set-it-and-forget-it tool. A routine tech checkup revealed that, according to a 2022 study, the algorithm decayed during the COVID-19 pandemic, getting seven percentage points worse at predicting who would die.

There were likely real-life impacts. Ravi Parikh, an Emory University oncologist who was the study’s lead author, told KFF Health News the tool failed hundreds of times to prompt doctors to initiate that critical discussion — possibly heading off unnecessary chemotherapy — with patients who needed it.

He believes several algorithms designed to enhance medical care weakened during the pandemic, not just the one at Penn Medicine. “Many institutions are not routinely monitoring the performance” of their products, Parikh said.

Computer scientists and doctors have long acknowledged that algorithm glitches are one facet of a dilemma, but that is starting to puzzle hospital executives and researchers: Artificial intelligence systems require consistent monitoring and staffing to implement and maintain.

You need people and more machines to ensure the new tools don’t fail.

“Everybody thinks that AI will help us with our access and capacity and improve care and so on,” said Nigam Shah, chief data scientist at Stanford Health Care. “All of that is nice and good, but if it increases the cost of care by 20%, is that viable?”

Government officials worry hospitals lack the resources to put these technologies through their paces. “I have looked far and wide,” FDA Commissioner Robert Califf said at a recent agency panel on AI. “I do not believe there’s a single health system in the United States capable of validating an AI algorithm put into place in a clinical care system.”

AI is already widespread in health care. Algorithms predict patients’ risk of death or deterioration, suggest diagnoses or triage patients, record and summarize visits to save doctors’ work and approve insurance claims.

If tech evangelists are correct, the technology will become ubiquitous — and profitable. The investment firm Bessemer Venture Partners has identified 20 health-focused AI startups on track to make $10 million in revenue annually. The FDA has approved nearly a thousand artificially intelligent products.

Evaluating whether these products work is challenging. It is even trickier to assess whether or not they continue to work — or have developed the software equivalent of a blown gasket or leaky engine.

A recent study at Yale Medicine evaluated six “early warning systems,” which alert clinicians when patients are likely to deteriorate rapidly. Dana Edelson, a doctor at the University of Chicago and co-founder of a company that provided one algorithm for the study, said a supercomputer ran the data for several days. The process was fruitful, showing considerable differences in performance among the six products.

It’s difficult for hospitals and providers to select the best algorithms for their needs. The average doctor doesn’t have a supercomputer, and there are no Consumer Reports on AI.

“We have no standards,” said Jesse Ehrenfeld, immediate past president of the American Medical Association. “There is nothing I can point you to today that is a standard around how you evaluate, monitor, look at the performance of a model of an algorithm, AI-enabled or not when it’s deployed.”

Perhaps the most common AI product in doctors’ offices is ambient documentation, a tech-enabled assistant that listens to and summarizes patient visits. Last year, investors at Rock Health tracked $353 million flowing into these documentation companies. But, Ehrenfeld said, “There is no standard right now for comparing the output of these tools.”

And that’s a problem when even minor errors can be devastating. A team at Stanford University tried using large language models—the technology underlying popular AI tools like ChatGPT—to summarize patients’ medical histories. They compared the results with what a physician would write.

“Even in the best case, the models had a 35% error rate,” said Stanford’s Shah. In medicine, “when you’re writing a summary, and you forget one word, like ‘fever’ — I mean, that’s a problem, right?”

Sometimes, the reasons algorithms fail are logical. For example, changes to underlying data can erode effectiveness, like when hospitals switch lab providers.

Sometimes, however, the pitfalls yawn open for no apparent reason.

Sandy Aronson, a tech executive at Mass General Brigham’s personalized medicine program in Boston, said that when his team tested one application meant to help genetic counselors locate relevant literature about DNA variants, the product suffered from “non-determinism”—that is, it gave different results when asked the same question multiple times in a short period.

Aronson is excited about the potential for large language models to summarize knowledge for overburdened genetic counselors, but “the technology needs to improve.”

What do institutions do if metrics and standards are sparse and errors can crop up for strange reasons? Invest lots of resources. At Stanford, Shah said, it took eight to 10 months and 115 man-hours just to audit two models for fairness and reliability.

Experts interviewed by KFF Health News floated the idea of artificial intelligence monitoring artificial intelligence, with some (human) data whiz monitoring both. All acknowledged that would require organizations to spend even more money — a tough ask given the realities of hospital budgets and the limited supply of AI tech specialists.

“It’s great to have a vision where we’re melting icebergs in order to have a model monitoring their model,” Shah said. “But is that really what I wanted? How many more people are we going to need?”

SOURCE: Story By Darius Tahir | California Healthline