Purpose: Since the outbreak of the COVID-19 pandemic, worldwide efforts have focused on using artificial intelligence (AI) technologies on various medical data of COVID-19-positive patients to classify various aspects of the disease, with promising results. However, concerns have been raised over their generalizability, given heterogeneous factors in datasets. This study examines the severity of this problem by evaluating deep learning (DL) classification models trained to identify COVID-19-positive patients on 3D CT datasets from different countries.
Methods: We collected one internal dataset at UT Southwestern (UTSW) (337 patients), and three external datasets from three different countries: 1) CC-CCII Dataset (China), 2) COVID-CTset (Iran), and 3) MosMedData (Russia). We divided all the data into 2 classes: 1) COVID-19–positive and 2) COVID-19–negative patients. We divided the data, by patients, into 72% training, 8% validation, and 20% testing. We trained nine identical DL-based classification models using various dataset combinations.
Results: The models trained on a single dataset achieved accuracy/area under the receiver operating characteristics curve (AUC) values of 0.87/0.826 (UTSW), 0.97/0.988 (CC-CCCI), and 0.86/0.873 (COVID-CTset) when evaluated on their own dataset. The models trained on multiple datasets and evaluated on a test set from one of the datasets used for training performed better. However, the performance dropped close to an AUC of 0.5 for all models when evaluated on an unseen dataset. Including MosMedData, which only contained positive labels, into the training did not help the performance on the other datasets.
Conclusion: The models could identify COVID-19–positive patients if the testing data were in the same dataset as the training data. However, we observed poor performance when evaluating the model on an unseen dataset. Multiple factors contribute to these results, including patient demographics, pre-existing clinical conditions, and differences in image acquisition or reconstruction, causing a data shift among different study cohorts.