Click here to

Session: Data Science Robustness, Performance, and Data Harmonization [Return to Session]

Curation of a Large Public Head and Neck Dataset for Machine Learning (RADCURE) Using An Automated Data Mining Platform

M Welch1, T Patel4, M Kazmierski3, J Marsilla3, S Huang1,2, S Kim3, K Rey-mcintyre1, B O'Sullivan1,2, J Waldron1,2, N Becker5, S Bratman1,2, A Hope1,2, B Haibe-kains3, T Tadic1,2*, (1) Princess Margaret Cancer Centre, University Health Network, Toronto, ON, CA,(2) Department of Radiation Oncology, University Of Toronto, Toronto, ON, CA, (3) Department of Medical Biophysics, University of Toronto, ON, CA, (4) Techna Institute, University Health Network, Toronto, ON, CA, (5) BCCA - Kelowna, Kelowna, BC, CA


SU-H430-IePD-F6-3 (Sunday, 7/10/2022) 4:30 PM - 5:00 PM [Eastern Time (GMT-4)]

Exhibit Hall | Forum 6

Purpose: Machine learning in radiotherapy (RT) is challenged by insufficient dataset sizes and limited clinical outcomes, often leading to inconclusive results and poor generalizability. The purpose of our work is to generate the largest openly-available head and neck cancer (HNC) dataset (called RADCURE) that captures the heterogeneity and complexity of this disease, providing unparalleled opportunity for discovery and innovation.

Methods: An in-house data mining platform was used to extract and translate RT data from our treatment planning system (TPS) and oncology information system (OIS). All HNC patients treated at our institution with IMRT between 2005-2017 were included (n=4130). Inputs to the extraction included patient identifiers, treatment dates and dose-fractionations from our HNC anthology. The mining platform uses a series of data comparisons to select TPS files corresponding to delivered plans in the OIS. DICOM images were extracted from the TPS and RTSTRUCTs translated from TPS-specific files. Inclusion criteria for the dataset included ability to identify a single unique RT plan and non-empty gross tumor volume.

Results: We successfully curated a final dataset of 2745 patients including 350 GB of planning images, structure sets, and 14 clinical variables including age, sex, smoking status, disease site, and local/regional/distant failure. RTSTRUCTs include all organs-at-risk and targets with names following a standardized nomenclature. The dataset population has a median age of 63 and 80% males. 50% had oropharyngeal cancer, with larynx, nasopharynx, and hypopharynx comprising 25%, 12%, and 5% respectively. Median follow-up was 5 years, with 60% alive at last follow-up.

Conclusion: Large public datasets are critical for enabling machine learning research. We successfully developed a data mining platform to curate the largest single-institution dataset of RT data and outcomes, providing a unique contribution to cancer imaging research. Future plans include adding RT plans and dose to expand the dataset reach and usability.


Computer Vision, Image Analysis, Software


IM/TH- Informatics: Informatics in Therapy (general)

Contact Email