Purpose: Deep-learning (DL) has shown great potential for various contouring tasks. Safe implementation of artificial intelligence (AI)-generated contours remains a challenge. We introduce a novel implementation framework with dual, independent AI algorithms to identify low quality contours for quality assurance (QA).
Methods: Two fully-independent DL models were trained to segment the heart. Model_1 (U-Net architecture) was trained on CTs with segmentations by cardiologists (n=858). Model_2 (U-Net with ResNet encoders) was trained on radiation oncology planning CTs and segmentations from lung cancer patients (n=700).. Both models were used to segment the heart of 2867 breast cancer patients. The two models were then geometrically compared to ground truth and each other, to define action levels for dissimilarity (low quality segmentations). DL models and action levels were implemented in a QA tool that prospectively screens all planning CTs by deploying both models and either 1) compares each model’s output for AI-only-contouring workflows or 2) checks final human-edited contours in an AI+human collaboration workflow (AI-Human-AI sandwich).
Results: Comparing DL models against ground truth showed a median Dice of 0.90(IQR=0.05) and 0.91(IQR=0.04) for Model_1 and Model_2. 61 cases (2.7%) showed a Dice of less than 0.75 (our threshold for dissimilarity). There was high agreement between the two DL models ((median Dice: 0.94(IQR=0.02). Utilizing the ROC curve, we identified an appropriate action level to trigger manual secondary review at a Dice of 0.85 between the two AIs models , (accuracy: 0.974, sensitivity:0.820; specificity: 0.979). In a subset of prospectively collected patients with clinician-edited Model_1 contours deployed in the clinic (n=20), QA with Model_2 detected no low quality contours.
Conclusion: We successfully implemented a dual AI approach for auto-segmentation quality control into the clinic