Click here to

Session: Machine Intelligence for Treatment Planning and Segmentation [Return to Session]

Evaluation of Commercial AI Segmentation Software

J Roper*, T Wang, Y Lei, S Dresser, B Ghavidel, L Qiu, J Zhou, O Kayode, K Luca, J Bradley, T Liu, X Yang, Emory University, Atlanta, GA

Presentations

WE-G-BRC-4 (Wednesday, 7/13/2022) 2:45 PM - 3:45 PM [Eastern Time (GMT-4)]

Ballroom C

Purpose: Artificial intelligence (AI) is becoming important in clinical practice, especially for healthy tissue segmentation because of superior deep learning (DL) performance compared to atlas and model-based autocontouring methods. While several commercial DL segmentation solutions are now available, it is difficult to determine which vendor option may be best for a particular clinic. This study evaluates commercial DL software functionality and contour accuracy across numerous disease sites.

Methods: DL segmentation software – developed by Limbus AI, Radformation, and Siemens – were used to segment over 400 test cases that span disease sites from the head and neck, thorax, abdomen and pelvis. DL contours were compared with physician contours by computing the Dice similarity coefficient (DSC). Further, vendor results were benchmarked against those from in-house DL models trained using institutional datasets. Software features were evaluated in terms of supported organs, imaging modalities, batch processing and model customization.

Results: At the time of this evaluation, the number of supported structures ranged from >50 to >80 among the vendors. Segmentation was performed exclusively on CT images, though one vendor supports MRI brain segmentation. Two vendors offer batch processing, which in this evaluation reduced the human time for segmentation from approximately 200 to 4 hours. None of these vendors support in-house DL plugins or training with institutional datasets. The average DSC scores ranged from 0.336±0.176 (chiasm) to 0.969±0.006 (brain). The largest difference in DSC scores among vendors was observed for the larynx. The best was 0.665±0.116, and the worst was 0.465±0.082. The average absolute DSC difference was 0.04 among vendors. Overall vendor software resulted in DSC scores inferior to our in-house models by 0.08 on average.

Conclusion: Contour accuracy and software features were evaluated across three commercial DL segmentation solutions. The results can help inform clinics that are considering DL for autocontouring.

Keywords

CT, Segmentation, Treatment Planning

Taxonomy

IM/TH- image Segmentation: CT

Contact Email

Share: