AI for medicine is overhyped
Should we update our timelines to AGI as a result?
AI for medicine has a lot of promise, but it's also really overhyped right now. This post explains why and asks if we should update our timelines to existentially dangerous AI as a result (tldr: personally I do not).
Back in 2016 deep learning pioneer Geoffrey Hinton famously said "people should stop training radiologists now - it's just completely obvious within 5 years deep learning is going to do better than radiologists. It might be 10 years, but we've got plenty of radiologists already."
Hinton went on to make similar statements in numerous interviews, including one in The New Yorker in 2017. It’s now been 4.5 years since Hinton's prediction, and while there are lots of startups and hype about AI for radiology, in terms of real world impact not much has happened. There's also a severe shortage of radiologists, but that's besides the point.
Don’t get me wrong, Hinton’s case that deep learning could automate much of radiology remains very strong.1 However, what is achievable in principle isn’t always easy to implement in practice. Data variability, data scarcity, and the intrinsic complexity of human biology all remain huge barriers.
I’ve been working on researching AI for radiology applications for three years including with one of the foremost experts in the field, Dr. Ronald Summers, at the National Institutes of Health Clinical Center. When I discuss my work with people outside the field, I find invariably that people are under the impression that AI is already being heavily used in hospitals. This is not surprising given the hype around AI. The reality is that AI is not yet guiding clinical decision making and my impression is there are only around 3-4 AI applications that are in widespread use in radiology. These applications are mainly used for relatively simple image analysis tasks like detecting hemorrhages in MRI and hospitals are mainly interested in using AI for triage purposes, not disease detection and diagnosis.2
High profile failures
“During my brief stint at the innovation arm of the University of Pittsburgh Medical Center, it was not uncommon to see companies pitching AI-powered solutions claiming to provide 99.9% accuracy. In reality, when tested on the internal hospital dataset, they almost always fell short by a large margin.” - Sandeep Konam, 2022.
As an insider, I hear about AI systems exhibiting decreased performance after real world deployment fairly often. For obvious business reasons, these failings are mostly kept under wraps. There have been a number of high profile public failures in the past few years, however. One of these was when Google’s Verily Health Sciences conducted field trials of their system for detecting diagnostic retinopathy in Thailand. As researchers described in an academic paper, the system performed poorly due to poor lighting conditions and lower resolution images. 21% of images the technicians tried to input were rejected by the model as unsuitable. For the remainder of images, the authors do not disclose accuracy metrics but they did say that performance was markedly reduced. The system also often took a long time to run since images had to be uploaded to the cloud, which reduced the number of people the clinic could process each day.
Skin cancer detection with a smartphone is one of the most promising areas for AI to make an impact. However, every skin cancer detection system being tested today suffers from bias when it comes to non-white skin. A recent study quantified this for three commercial systems. None of the systems did better than radiologists and they all exhibited significant drops in performance between light and dark skin. For two of the systems the drops in sensitivity were around 50% across two sets of tasks (0.41 → 0.12, 0.45→ 0.25, 0.69 → 0.23, 0.71 → 0.31). The third model actually exhibited worse sensitivity for lighter skin but it also bombed completely at the operating point the vendor said to use, achieving sensitivities < 0.10 across the board.
Detecting breast cancer in mammographies is probably the most researched application of computers in medical imaging, with a legacy of work going back decades (the first FDA approved computer-aided detection system for mammography came out in 1998). A number of computer aided detection (“CAD”) software packages for mammography were rushed to market in the mid 2010s and the numerous failings of those systems is documented in a 2018 article. 16% of breast cancers are missed by radiologists and this is the perfect application for AI, but despite intense concerted efforts stretching over 20+ years, true radiologist level performance has still not yet been reached. One of the most recent reviews, published in September 2021, found that 34/36 (94%) of AI systems were “less accurate than a single radiologist, and all were less accurate than consensus of two or more radiologists."3 Still, I am optimistic AI will breakthrough here soon, with massive benefits to patients.
Epic’s sepsis model was implemented in hundreds of hospitals to monitor patients and send an alert if they were at high risk for sepsis. The model uses a combination of real-time emergency room monitoring data (heart rate, blood pressure, etc.), demographic information, and information from the patient’s medical records. Over 60 features are used in total. An external validation found very poor performance for the model (AUC 0.63 vs the advertised AUCs of 0.73 and 0.83). Out of 2,552 patients with sepsis it only identified 33% of them and raised a lot of false alarms in the process. STAT news reporter Casey Ross and graduate student Adam Yala carried out a forensic-style investigation of Epic’s sepsis prediction approach to illuminate how distributional shifts can send machine learning model reeling. While they did not have access to the precise model Epic used (which is proprietary) they trained a similar model on the same features. They found that changes in ICD-10 coding standards likely contributed to a drop in the performance of Epic’s model over time, and that spurious correlations in the model’s training data also likely played a role.
Finally, we have IBM’s Watson Health, a failure so great that rumors of a “new AI winter” are starting to float around. Building off the success of their Watson system for playing Jeopardy, about 10 years ago IBM launched Watson Health to revolutionize healthcare with AI. It started with a high profile partnership with Memorial Sloan Kettering to train an AI on EHR data to make treatment recommendations. IBM CEO Ginni Rometty called it “our moonshot”. At its peak, Watson Health employed 7,000 people. You may remember seeing ads breathlessly hyping the system a few years ago. Well, earlier this year IBM sold off all of Watson Health “for parts” for about $1 billion. The acquisitions alone that IBM undertook to build Watson Health cost $5 billion, so this constitutes a massive loss for IBM. IBM executives must have determined that the entire enterprise had no chance of becoming profitable anytime soon and decided they needed to stop the bleeding. A fairly good expose of what went wrong can be found on Slate.
It’s worth noting that all of AI’s failings in radiology — both public and private — have lowered many doctor’s trust in AI, and that lost trust may take a long time to regain.
Most failings of AI in radiology can be traced to the fact that deep learning models are not robust to distributional shift. The appearance of medical images varies between scanner models and the settings that technicians use, called image acquisition protocols. Imaging protocols are not well standardized. MRI in particular has a lot of different protocols (pulse sequences) leading to a lot of variability in image appearance. Deep learning models have a bias towards looking at textures, which can vary in medical images due to motion blur, resolution employed, and (for CT) X-ray tube current and reconstruction kernel. It has been shown that noise patterns can throw off a deep learning model in weird ways that would never fool a human - for instance models trained to be robust to “salt and pepper” noise are not robust to white noise and vice-versa (humans are not easily misled by either type of noise). On top of all this, new types of scanners with new image appearance and capabilities are constantly being deployed - right now dual energy CT and low field portable MRI are all well on their way to widespread clinical adoption. Other technologies such as layer fMRI and photon counting CT are likely to be in clinical use by the end of the decade.
Radiologists have a model of human anatomy in their head which allows them to easily figure out how to interpret scans taken with different scanning parameters or X-ray images taken from different angles. It appears deep learning models lack such a model. Unlike today’s AI, radiologists can adopt to new scanning technologies relatively easily.
The only surefire way to make deep learning systems more robust right now is to increase the size and diversity of the data they are trained on. Unfortunately, this is rather hard in medical imaging because labeling requires medical expertise. Combining datasets across institutions poses a lot of logistical challenges.
Deployment is hard.
Hospitals are not set up to use AI. The most common radiology viewing software programs cannot display AI results. Most hospitals don’t have GPU machines, and sending images to the cloud can be tricky due to HIPPA regulations and security requirements. EHR and imaging data live on separate systems, so integrating the two is difficult. Despite pushes for standardization, every hospital uses different coding systems and series description conventions for medical images. Serious problems like image corruption or gross mis-labeling are relatively rare (maybe 0.25%) but when they do occur AI software currently does not typically come with out-of-distribution or anomaly detection to alert the user that the result will very likely be invalid as a result.
Routing the correct images to the right AI models is also an under-appreciated challenge. AI models today are only designed to work on very specific types of images, for instance frontal chest X-ray, diffusion weighted brain MRI, or abdominal CT with contrast. Most models also only work within a limited range of scanner settings. Due to the sloppiness of overworked technicians image DICOM metadata is a very imperfect guide to understanding what is in an image and how it was acquired, making image routing a complex and daunting task. [Until now there hasn’t been much pressure to make such metadata in an accurate standardized format, so it generally isn’t.]
As Rodney Brooks has been saying, deployment is hard and slow. The situation is similar to how our transportation infrastructure will have to be overhauled to support electric cars. In the same way, hospital software and IT infrastructure needs a complete overhaul to support AI. Most hospitals don’t have a lot of money sitting around for such an overhaul, so the return for such an investment needs to be crystal clear.
Should we update our AI timelines as a result of all this?
(note: I realize this section is a bit of a non-sequitur but I have many Effective Altruists among my readers and other people who are interested in existential risk from AI..)
A question for effective altruists and other people worried about existential risk from AI is whether we should update our predictions as to how far away existentially dangerous AI is in light of all this. I have now witnessed first-hand the big divide between the hype and what’s really going on in two fields now - AI for molecular design / drug discovery and AI for medical imaging. However, I haven’t lowered my expectations of how long until existentially dangerous AI as a result of this. The reason is that I believe an additional “revolution” is needed beyond deep learning to get to existentially dangerous AI in the next few decades. I suspect this revolution will come from neuroscience and in particular understanding how cortical columns work. It’s sort of a fool’s errand to try to predict how long until the requisite scientific advances happen, but given the number of people working in the field and recent rapid progress in connectomics, 10-20 years until this advance seems entirely plausible to me. I also believe Moore’s law (in terms of usable FLOPS per dollar) will continue for at least a few more decades. So when the revolution that leads to existentially dangerous AI comes, I have a pretty high credence (50%) that there will be a massive “overhang” of usable hardware and a fast take-off (ie within a few years).
Automating much of radiology is very different than automating all of radiology. Weird anomalies and unexpected situations abound in medicine. As with driverless cars, a knowledgeable human in the loop will be needed for a long time. It’s hard for me to imagine scenarios under which could AI could wholesale replace everything radiologists do in the next 20 years just using today’s deep learning. Of course it is technically possible, but given the amount of work needed to train a system to do one narrow thing at the human level right now, it’s hard to imagine it happening. Foundation models for medical imaging could help, but will be hard to create. Radiologists can identify hundreds of different types of diseases across many image modalities (MRI, CT, chest X-ray, other X-ray, mammography, ultrasound, PET, SPECT) and also have a detailed knowledge of what variations of anatomy are normal vs anomalous. Instead, baring a major AI breakthrough, what is likely to happen is that radiologists will work with an AI copilot that consists of a panel of specialized models that each do one narrow thing. The data from that AI panel will help the radiologist do their job better by catching things that radiologists frequently miss and will also make radiology more quantitative by providing measurements like volumes and diameters of lesions, volume of visceral fat, volume of plaque, etc. Eventually, reading a scan will become faster with AI taking on a lot of the work, freeing up time for today’s overworked radiologists to interact with patients more. Eric Topol lays out a vision along these lines, which I find quite plausible, in his book Deep Medicine.
In a recent review of 118 AI/ML systems for medical imaging approved by the FDA between 2008-2021, only 9.3% were explicitly for disease detection and diagnosis (CADx / CADe). The largest category of application was image processing (50%) followed by triage (23%).
However, in the case of mammography reading in particular several studies have shown that using AI systems as a “virtual second reader” results in better accuracy than using a radiologist as the second reader. Two other studies found showed that an AI combined with a radiologist could enhance their performance. Mammography reading is one of the most studied areas and in, is my estimation, one of the most likely detection tasks where AI will first find profitable application in developed countries.