Click here to

Session: Multi-Disciplinary General ePoster Viewing [Return to Session]

Assessing Machine Learning Generated Auto-Segmentation

S Gupta1*, L Dong2, R McBeth2, S Philbrook3, (1) Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, (2) University of Pennsylvania, Philadelphia, PA, (3) Abington Memorial Hospital, Philadelphia, PA


PO-GePV-M-11 (Sunday, 7/10/2022)   [Eastern Time (GMT-4)]

ePoster Forums

Purpose: Contouring is a fundamental step of the treatment planning process that involves significant resources. Contouring will likely become more automated using auto-segmentation algorithms generated by artificial intelligence (AI), particularly machine learning (ML). Many such algorithms exist, but there remains a need to quantitatively evaluate the performance of these algorithms before clinical implementation and to improve the quality of these algorithms based on relevant outcomes.

Methods: A tool was programmed in the C# language to work within the Eclipse treatment planning system. This enabled comparison of ML-generated contours to ground truth contours from clinical plans. Two standard metrics assess the geometric agreement of segmentations: volumetric DICE score (DSC) and average Hausdorff distance. ML-generated contours were drawn from multiple commercial vendors' auto-segmentation packages. Structures in the head and neck region were of particular interest due to the inherent difficulties associated with contouring in this region, including the proximity of critical structures, a tendency to be labor-intensive, and high inter-observer variability.

Results: Qualitative results showed auto-segmentations were of comparable quality to the clinical contours. The results of different algorithms perform reasonably well by both metrics, with DICE scores almost all within close agreement (less than 10% difference between the results of different algorithms). Hausdorff scores showed greater variation between algorithms, but the performance of each algorithm varied even for the same structures.

Conclusion: Further analysis of results are needed. Preliminary results for head and neck cases show DSC inconsistent with the established trend of higher scores for structures with greater volume. Otherwise, the overall performance was generally comparable to values found in the literature.


Not Applicable / None Entered.


Not Applicable / None Entered.

Contact Email