Purpose: To compare autosegmentation methods on low-resolution T2-weighted on-board setup MRIs from a 1.5T MR-linac for off-line reconstruction of delivered dose.
Methods: 7 organs-at-risk (OARs) (parotids, submandibulars, mandible, cord, brainstem) were contoured by 7 observers each in 43 images. Ground truth (GT) contours were generated using STAPLE. 20 autosegmentation methods were evaluated in ADMIRE: 1-9) atlas-based autosegmentation using a population atlas library (PAL) of 5/10/15 patients with STAPLE, patch fusion (PF), random forest (RF) for label fusion; 10-19) autosegmentation using images from a patient’s 1-4 prior fractions (individualized patient prior (IPP)) using STAPLE/PF/RF; 20) deep learning (DL) (3D ResUNet trained on 43 GT structure sets plus 45 contoured by one observer). Autosegmentation methods were evaluated on 5 images using Dice similarity coefficient, mean surface distance, Hausdorff distance, Jaccard Index against GT. Execution time was measured for each method. Inter-observer variability was quantified using pairwise comparison. For each metric and OAR, performance was compared to inter-observer variability using Dunn’s test with control. Methods were compared pairwise using Steel-Dwass test for each metric pooled across OARs. Further dosimetric analysis was performed on contours from DL (fastest), IPP_RF_4fractions (best performance), and PAL_STAPLE_5atlases (worst performance): Delivered doses from clinical plans were recalculated on setup images with each structure set. Percent differences in Dmean and Dmax were measured between GT and each method.
Results: DL and IPP methods performed best, all significantly outperforming inter-observer variability and with no significant difference between methods in pairwise comparison. Most PAL methods were not significantly different from inter-observer variability or each other. DL was the fastest and PAL methods the slowest. Execution time increased with number of prior fractions/atlases. Dosimetric differences were minimal (median<6%) for DL and IPP_RF_4fractions but greater for PAL_STAPLE_5atlases (median<16%).
Conclusion: Autosegmentation using DL or IPP is superior to PAL considering both geometric and dosimetric criteria.
Funding Support, Disclosures, and Conflict of Interest: This work is directly supported by the NIH/NIDCR (Grant number 1F31DE029093, PI McDonald; grant number 1R01DE028290, PI Fuller) and a Dr. John J. Kopchick Fellowship. Dr. Fuller reports industry research support from Elekta. Ms. McDonald and Dr. Fuller report travel funds and speaking honoraria from Elekta.