Click here to

Session: Multi-Disciplinary BLUE RIBBON [Return to Session]

Statistical Evaluation of a Commercial Deep-Learning-Based Automatic Contouring Software

N Bice*, B Patel, P Milien, A Mccarthy, P Cheng, J Rembish, J Teruel, D Barbee, NYU Langone Health System, New York, NY


TU-J430-BReP-F2-3 (Tuesday, 7/12/2022) 4:30 PM - 5:30 PM [Eastern Time (GMT-4)]

Exhibit Hall | Forum 2

Purpose: As artificial intelligence (AI) becomes integrated in radiotherapy workflows, accurate characterizations of AI performance become increasingly important. We present a simple, systematic method for the evaluation of a commercial automatic contouring solution. We assess its overall performance and report detectable biases.

Methods: A team of 3 certified medical dosimetrists (combined 41 years experience) generated 540 contours for 81 patients using Radformation's AutoContour™ software. 27 unique structures within the head and neck, thorax, and pelvis were considered. The quality of the generated contours was evaluated against RTOG standards based upon the degree of editing required for clinical use. A scale from 1 to 5 was used, with "1" indicating zero edits required, "5" indicating an entirely unusable output, and "3" indicating no time saved by using the software after editing. Scores were tabulated for each contour and stratified by evaluator, CT slice thickness, CT protocol (free breathing, 4D average, deep inspiration breath hold), and patient orientation (prone, supine). Spearman correlation, Mann-Whitney-U, and Kruskal-Wallis non-parametric statistical tests were used to assess the dependence of contour quality upon these factors.

Results: The scores for all contours are positively skewed with a median score of 2 and mean score of 2.09, suggesting that, on average, contours with near-clinical acceptability are generated. With a family-wise error rate less than 5% (Holm-Bonferroni), there is sufficient evidence to suggest a dependence of contour quality upon (1) patient orientation (p=0.0157) and (2) contour evaluator (p=10⁻²³). No statistically significant dependence was observed for CT slice thickness or protocol.

Conclusion: We present a method for the acceptance testing for an out-of-the-box automatic contouring system. A statistically significant difference is observed between the median ranks of the prone (mean 2.23) and supine (mean 1.99) patient orientations. Additionally, we highlight significant inter-observer variability in contour quality assessment.


Not Applicable / None Entered.


Not Applicable / None Entered.

Contact Email