Purpose: To establish a similarity metric to automatically flag clinically unacceptable contours in the female pelvis.
Methods: Six female pelvic structures (UteroCervix, CTVn, Para-aortic lymph nodes (PAN), Bladder, Rectum, and Kidneys) were generated using manual and auto-contouring
methods: clinically acceptable and unacceptable contours were generated manually, and two auto-contours per structure were generated with two independently developed deep learning-based auto-contouring systems. These structures were generated on 87 CT image sets from 4 hospitals. The clinical acceptability of one of the two auto-contours (“primary contour”) was confirmed by radiation oncologists, and the second auto-contour was used for QA purposes. Eleven similarity metrics (Dice-Similarity-Coefficient (DSC), Hausdorff distance, 95% Hausdorff distance, mean surface distance, and surface DSC with incremental thicknesses of 1-10 mm) were calculated between primary and QA contours. The metrics were tested individually and in various combinations as inputs for a support vector machine. Model inputs were used to determine the optimal decision boundary between clinically acceptable and unacceptable contours. Four different kernels (linear, radial basis function (RBF), sigmoid, and polynomial) were tested when multiple metrics were used as inputs for the support vector machine.
Results: The highest accuracies were achieved using surface DSCs with a thickness of 1, 2, or 3 mm and they were 0.91/0.90/0.89/0.92/0.95/0.97 for UteroCervix/CTVn/PAN/bladder/rectum/kidneys, respectively. The average accuracies across different structures were 0.90/0.88/0.70/0.88 for linear/RBF/sigmoid/polynomial kernels, respectively, and therefore, the linear kernel was the most accurate when multiple metrics were used as inputs for the support vector machine. However, multiple metrics did not improve the accuracy significantly compared to a surface DSC as an input.
Conclusion: We distinguished clinically acceptable contours from clinically unacceptable contours with an accuracy higher than 0.9 for the relevant structures in the female pelvis; the most accurate similarity metric was surface DSC with a thickness of 1, 2, or 3 mm.
Funding Support, Disclosures, and Conflict of Interest: This work was supported by National Institutes of Health/National Cancer Institute grants UH2 CA202665, UH3 CA202665, and P30 CA016672 (Clinical Trials Support Resource) and partially funded by Varian. Hester Burger does consulting work for Varian Medical Affairs.