Purpose: Radiomic features have been integrated into machine learning models to predict clinical outcome (e.g., survival rate) in patients with lung cancer. However, effective feature/model selection and their potential impact on prediction performance are yet to be fully examined. We conduct a comparative study to evaluate multiple machine learning and feature selection techniques to tackle this issue.
Methods: 250 CT images were retrospectively retrieved from a GE VCT for treatment planning. Radiomic features were extracted using IBEX from the lung nodules, and used to predict 2-year survival. The most commonly used 7 predictive models and 5 feature selection methods were evaluated, with area under the receiver operating characteristic curve (AUC) as a metric. The feature selection methods included Analysis of Variance (ANOVA), Least Absolute Shrinkage and Selection Operator (LASSO), Mutual Information (MI), Maximum Relevance and Minimum Redundancy (mRMR), and Relief. The models included Support Vector Classifier (SVC), Naïve-Bayes (NB), Decision Tree (DT), Random Forest (RF), Logistic Regression (LR), Gradient Boosting (GB), and K-Nearest Neighbors (KNN). The impact of the number of selected feature on prediction was also investigated. Spearman correlation was used to filter redundant features with a threshold of ≥0.95.
Results: The cohort was split into <2-year survival (126 patients) and ≥2-year survival (124 patients). 1419 features from each patient were extracted and 220 features remained after correlation filtering. The prediction performance highly depends on the combination of selected features and prediction models, with the highest performance observed using 5 features selected by mRMR and modeled with RF (AUC=0.707). RF generated the highest AUC in 13 of 25 tests with varying number of features.
Conclusion: Radiomic feature selection and machine learning methods are inter-dependent. The selected features strongly depend on feature selection methods. Feature/model combination and number of selected features should be considered to improve predictive performance.
IM/TH- Image Analysis (Single Modality or Multi-Modality): Imaging biomarkers and radiomics