AI for COVID-19 diagnosis - a case study in bad incentives
Everyone please stop.
On February 26, 2020 the Radiological Society of North America (RSNA) put out a press release proclaiming “CT Provides Best Diagnosis for COVID-19”. The release was quickly picked up by news outlets such as Science Daily, EurekaAlert, and many medical news sites.
The lead in the RSNA’s press release states:
In a study of more than 1,000 patients published in the journal Radiology, chest CT outperformed lab testing in the diagnosis of 2019 novel coronavirus disease (COVID-19). The researchers concluded that CT should be used as the primary screening tool for COVID-19.
They then go on to note:
...recent research found that the sensitivity of CT for COVID-19 infection was 98% compared to RT-PCR sensitivity of 71%.
When I saw this I was immediately skeptical. How could CT provide a better test result than PCR if most COVID-19 cases are asymptomatic? Do people with asymptomatic COVID-19 have abnormalities in their lungs?
The radiologist Luke Oakden-Rayner wrote a blog post on March 23rd last year entitled “CT scanning is just awful for diagnosing Covid-19” which explains why all of those “CT beats PCR” headlines were dead wrong. The reason is simple — selection bias. The referenced study on CT scanning did not study random people chosen off the street. Instead, they studied people who were referred for a CT scan at a hospital. So, the people studied were a subset of people with COVID-19 who 1. had symptoms bad enough they went to the hospital and 2. once in the hospital had a condition bad enough that it called for being referred to get a CT scan (which confers significant radiation exposure).
The fraction of people who get COVID-19 and are asymptomatic very unclear, with estimates ranging from 40% to 80% (see ref and ref). Even among those with symptomatic cases, not all go to a hospital. Oaken-Raynor estimates that 80% of people with COVID-19 were not represented in the study. So generalizing the study to a general population as the RSNA did in their press release is dead wrong. It’s possible, of course, that asymptomatic patients have lung tissue abnormalities that would show up on a CT. Oaken-Raynor points to a study of patients from the Diamond Princess cruise ship, which found that 54% of asymptomatic patients and 79% of symptomatic patients had abnormal CT scans. Oaken-Raynor estimates that in worse case the 97% sensitivity for CT scanning reported in the study could drop below 50% if applied to a general population. Oaken-Raynor also points out that the study was not blinded , among other methodological issues (see his post for full details).
As Dr. Oaken-Raynor points out, the American College of Radiology, the Royal College of Radiologists, the Royal Australian and New Zealand College of Radiology, and Canadian Association of Radiologists” all say that CT should not be used to diagnosis COVID-19. There are good reasons for this, in addition to the low sensitivity - undergoing a CT scan involves a non-negligble amount of radiation. Moreover, you may have heard that CT scans are expensive! The actual scan takes ~10 seconds, but factoring in all the prep work it takes 10-30 minutes. Most hospitals scanners run at full capacity so if scanners were co-opted to screen for COVID-19 that would inevitably require foregoing scans for other indications.
The deluge of “AI for COVID” papers
Many labs dropped working on AI for medical imaging dropped everything they were doing to work full time on applying AI to COVID-19 diagnosis in CT and X-ray images. Dozens of nearly identical papers were published in March 2020. On March 30th, the first public dataset of CT scans was released. By April, there was already a review article summarizing dozens of studies, many from China. Press releases proclaimed things like “AI-Fueled Chest X-ray Can Provide Near-Perfect COVID-19 Identification”. By December 2020, there were at least 500 works published on AI for analyzing COVID-19 in CT or X-ray images. Many were in obscure radiology journals but some made it into major journals like Nature Communications (Impact Factor 12, acceptance rate 40%), Transactions on Medical Imaging (IF 11 ), and Radiology (IF 8, acceptance rate 9%).
The value propositions of such work were always highly questionable. Firstly, as discussed above, CT has no role to play in diagnosis. A similar story holds for chest X-ray. Yet the majority of AI papers just did binary COVID / no-COVID classification. Some AI papers did segmentation of lung lesions. But what is the marginal value of knowing the exact volume of lesions? A minority of papers (at most 20%) looked at prognosis prediction (mortality, ICU referral, length of hospital stay etc), which in theory could help with triage and inform treatment decisions. However my boss Dr. Ronald Summers notes:
Hazard ratios of on the order of 2 to 3, as found in the article by Mushtaq et al, are generally insufficient for clinical decision making. While it is possible that prediction of an adverse outcome could lead to more aggressive treatment, it could also lead to unnecessary costs and adverse effects.
To have added value beyond what a doctor or radiologist already knows AI systems should also take into account factors from clinical history and demographic information in addition to just looking at the scans. But the vast majority of AI systems that were developed didn’t do that. No study, to my knowledge, actually did an RCT looking at patient care outcomes with and without AI. (that is extremely rare and not even required by the FDA for approval of an AI system, by the way). Instead, the studies all tested their AI systems on small curated test sets.
Why did people stop everything they were doing to apply AI to CT and X-ray images, even it meant moving resources away from higher impact and/or more novel work? Due to the urgency of the pandemic any paper with “COVID-19” in the title was fast-tracked by journals. Furthermore, a lot research with “COVID-19” recieved press-releases at research institutions for obvious PR reasons. So the chance at an easy publication and a sexy press release was too good to pass up, even if it meant taking time away from more valuable projects. #BadIncentives
How poor quality were these studies?
A paper from 15 March 2021 by Roberts et al. published in Nature Machine Intelligence identified 415 papers on AI for COVID-19 applied to CT or chest X-ray images. They then analyzed a subset of 320 papers which present unique AI systems published in English. They evaluated all 320 papers using the CLAIM criteria (Checklist for Artificial Intelligence in Medical Imaging), which was developed in response to rampant problems with reproducibility, bias, and the selection of test metrics in the world of applied AI. The full CLAIM list contains 42 criteria which are mandatory for publication in the journal Radiology: Artificial Intelligence. Roberts et al. choose the following 8 as “mandatory” for their study:
Data sources. The data sources must be clearly identified…
Data pre-processing steps. …we require that the paper details the pre-processing steps in sufficient detail to reproduce….
How data were assigned to partitions; specify proportions. For the training to be reproducible, we expect not only the proportions (or number) of images included within each of the training, validation and holdout cohorts but also the number of images with the outcome.
Level at which partitions are disjoint (e.g. image, study, patient, institution). If a paper has only one image for each patient all obtained from the same center, then we can safely assume these are disjoint at patient level. In the instance that it is clear that there are multiple images for some (or all) patients in the dataset, we expect detail to be given for how the authors mitigated against images appearing in the different partitions for the same patients.
Detailed description of model, including inputs, outputs, all intermediate layers and connections. The construction of the architecture must be reproducible to allow the training to be replicated….
Details of training approach, including data augmentation, hyperparameters, number of models trained. The method for training the model must be discussed in enough detail to allow reproduction..
Method of selecting the final model. If the authors consider a model that is not trained for a fixed number of epochs, we require that the authors detail how the final model was selected…
Metrics of model performance. The metrics used to assess the model performance must be commonly used metrics or defined clearly within the paper.
Only 62/320 papers (15%) satisfied all 8. 37 were deep learning papers, 23 were traditional machine learning papers, and 2 were hybrid. Among deep learning papers 51% missed 3 or more mandatory criteria and 23% missed 2 criteria. 50% of the 37 deep learning papers also missed 6 or more non-mandatory CLAIM criteria. Among deep learning papers the following were the most missed among the mandatory criteria:
How the final model was selected. (61%)
The method of pre-processing of the images. (58%)
The details of the training approach (49%)
They then analyzed the 62 surviving papers for bias issues using the PROBAST (Prediction Model Risk of Bias) assessment tool. They found that 55/62 (89%) of papers had a high risk of bias in at least one domain. Similar results were found in a much earlier review on AI for COVID-19 diagnosis published in April 2020, which has almost 1000 citations. They concluded:
..these models are all at high or unclear risk of bias, mainly because of model overfitting, inappropriate model evaluation (eg, calibration ignored), use of inappropriate data sources and unclear reporting. Therefore, their performance estimates are probably optimistic and not representative for the target population. The COVID PRECISE group does not recommend any of the current prediction models to be used in practice.
Ironically, at least with regards to the research on AI for COVID-19 diagnosis from CT / X-ray, the fact most papers were very low quality is a good thing, as it means less research resources were wasted on something that never had a chance of being useful in the first place.