Purpose: To evaluate the performance of a single multi-organ segmentation by a convolutional neural network trained on a combination of two publicly available datasets of CT images.
Methods: 88 CT images with segmentations of the liver, left kidney, spleen, gallbladder, esophagus, stomach, pancreas, and duodenum and 83 segmentations of the liver, bladder, and left and right kidneys are prepared from two different publicly available datasets. Five segmentations from the first set and eight from the second are excluded from training and used for validation. A DenseVNet is trained on each dataset, and inference performed on the other to generate segmentations of the missing organs. A combined dataset of 121 segmentations (including validation) is used to train a network to segment all 10 abdominopelvic organs. Dice similarity coefficients are used to evaluate accuracy, with a T-test used to identify significantly different accuracies.
Results: For the liver and left kidney, differences in the average dice coefficients between segmentations are largely explained by differences in ground-truth segmentations provided by different observers. For organs inferred for part of the combined dataset, average dice scores on the validation set tend to decrease when trained on the combined dataset. However, for the spleen, esophagus, and stomach the difference is negligible, indicated by a very large p-value (>0.5) from a T-test on the dice coefficients. For other organs these p-values range from 0.005 to 0.16.
Conclusion: By combining two datasets, the robustness of a multi-organ segmentation network can be improved by providing a more varied set of images and segmentations produced by more human observers. For more easily segmented organs, the accuracy of a multi-organ network may be preserved even when these organs are inferred for part of the dataset. For smaller, more difficult organs, using several dedicated networks may be necessary to achieve best results.