Exhibit Hall | Forum 2
Purpose: Data-driven deep learning (DL)-based auto-segmentation models have been widely investigated to improve clinical efficiency. Training/testing data distributions must be consistent during model development and initial deployment to ensure a reasonable model performance. However, sometime after the model deployment, various factors may affect data distribution, causing decay of the model performance, such as newly introduced imaging platforms or protocols. This work is to investigate the robustness of DL models along the longitudinal direction and dissect potential influence factors to facilitate its clinical deployment.
Methods: In this study, we used four clinical-relevant organs-at-risk (OARs) associated with prostate cancer as the testbed, i.e., femoral heads, bladder, rectum, and prostate. Specifically, we first collected and curated 907 patients’ data from 2006 to 2021, then split them into 120/29 patients for training/validation (2006-2011), and 758 patients for testing (2012-2021), respectively. We then built a U-Net type DL model to predict the contours of the OARs. We used Dice coefficients (DSC) to quantify contour quality. We proposed to employ the moving average curve of DSC in the longitudinal direction to visualize the trend of the model’s performance, based on which, we further proposed to use the min/max performance gap to quantify the robustness of the model’s performance. Since physicians’ contouring style might be one of the significant potential influence factors, we further dissected its impact on the models’ performance.
Results: We observed a strong model performance oscillation with an overall decreasing trend in the longitudinal direction. The min/max performance gaps are 0.063 (prostate), 0.056 (femoral heads), 0.032 (rectum) and 0.010 (bladder), respectively. The model produced higher contour quality right after the model deployment.
Conclusion: Initial performance analysis shows that performance of the DL models may decay with time after the clinical implementation, suggesting continuous model QA and updating may be required for robust clinical deployment.