Click here to

Session: Multi-Disciplinary General ePoster Viewing [Return to Session]

Machine Learning-Generated Radiomic and Clinical Data for Lung Cancer Survival Prediction

L Tu1,2*, HHF Choi2, H Clark1,3, S AM Lloyd1,2, (1) University of British Columbia, Vancouver, BC, CA, (2) BC Cancer Agency, Vancouver, BC, CA, (3) BC Cancer Agency, Surrey, BC, CA


PO-GePV-M-55 (Sunday, 7/10/2022)   [Eastern Time (GMT-4)]

ePoster Forums

Purpose: To evaluate the viability of a synthetic radiomic feature dataset including clinical factors for developing machine learning models that predict cancer survival.

Methods: The computed tomography images and clinical information of 132 real non-small cell lung carcinoma patients were used in this study. 61 radiomic features were extracted from manually segmented tumours using radiomics software, Imaging Biomarker Explorer. Synthetic datasets containing radiomic and clinical features of 132 examples were generated using CTGAN, a machine learning-based tabular data synthesizer. 80% and 20% of each dataset was reserved for training and testing respectively. The distributions of features from the synthetic and real datasets were compared to assess the quality of synthetic data. A random forest (RF) model was developed to predict 2-year survival using the synthetic training set and the real testing set. RFs were also trained and tested on exclusively real or synthetic datasets as a baseline. To assess the effect of synthetic data on predictive performance, RFs were trained on additional CTGAN-generated datasets containing up to 100,000 examples, and then tested on the real testing set.

Results: CTGAN generated features with a similar distribution to real features, yielding a Kolmogorov–Smirnov statistic of 0.859. The balanced accuracy of the RF trained on synthetic data but tested on real data was 0.701. RFs trained and tested on only real or synthetic data respectively yielded balanced accuracies of 0.576 or 0.615. A non-monotonic relationship between model performance and synthetic dataset size was observed.

Conclusion: CTGAN generated radiomic and clinical information that accurately mimics the real dataset, showing promise as a data augmentation approach for radiomics. However, the variable performance of RFs suggests there may be an optimal size for the synthetic dataset. This trade-off between the size and quality of the dataset should be carefully considered when creating predictive models.


Not Applicable / None Entered.


IM- Dataset Analysis/Biomathematics: Machine learning

Contact Email