Ballroom B
Purpose: Clinical integration of commercial artificial intelligence (AI) in radiation oncology requires structured evaluation. We tested quantitative and qualitative performance of pretrained AI-automated contouring across a network of clinical practice sites.
Methods: 121 patient planning computed tomography scans (helical, 4D, breath-hold) were curated from four clinics. MIM Contour ProtégéAI (MIM) and RayStation Deep Learning Segmentation (RS) regions of interest (ROIs) spanned these sites: abdomen (n=36), pelvis (n=15), head-and-neck (H&N, n=20), thorax (n=40), brain (n=10). Quantitative agreement between AI-generated and clinical contours was measured by Dice similarity coefficient (DSC) and maximum distance-to-agreement (maxDTA). Qualitative assessment was performed by multiple expert raters individually scoring blinded AI-contours on an ordinal scale: 1-acceptable, 2-acceptable-minor-revisions, 3-major-revisions, 4-unacceptable.
Results: 527 unique ROIs were evaluated. MIM/RS-contours had high quantitative agreement in 34.0/36.2% of cases (DSC>0.9), performing well in pelvis (median DSC=0.91/0.92) and thorax (DSC=0.90/0.92). MIM/RS-contours had low quantitative agreement in 9.1/9.9% of cases (DSC<0.5), performing worse in brain (DSC=0.65/0.75) and H&N (DSC=0.75/0.79). Qualitatively, MIM/RS-contours were acceptable (rated 1-2) in 70.7/74.6% of ROIs (2906 ratings), higher for abdomen (MIM: 79.2%) and thorax (RS: 90.2%), and lower for H&N (29.0/35.6%). Performance was sensitive to individual ROIs (Friedman-p<0.001), regardless of AI-model or site, driven by inaccuracies near superior-inferior organ boundaries. DSC and maxDTA of RS-contours improved significantly (Wilcoxon-sign-rank-p<0.039) in 29.4/26.5% of ROIs compared to 11.8/8.8% for MIM-contours. Inter-rater agreement was moderate (Kappa=0.26-0.49), strongest between abdomen ratings, and weakest between pelvis ratings. Differences in quantitative performance between patient subsets across clinics (Kruskal-Wallis-p<0.048) did not translate to differences in qualitative assessments, suggesting variable clinic-contour definitions.
Conclusion: Structured evaluation of commercial AI auto-contouring was implemented with complementary quantitative and qualitative analytics. Concordance of these analytics identified anatomic regions where AI-contouring can augment clinical workflows and regions that warrant AI-contouring improvement. This framework can support future evaluation and clinical commissioning of AI software.