Click here to

Session: [Return to Session]

Levenshtein Distance-Based Feature Representation for Structure Name Standardization in Radiotherapy

Y Hu*, L Santanam, Memorial Sloan Kettering Cancer Center, New York, NY

Presentations

SU-H330-IePD-F9-2 (Sunday, 7/10/2022) 3:30 PM - 4:00 PM [Eastern Time (GMT-4)]

Exhibit Hall | Forum 9

Purpose: To develop a robust text-only feature presentation of non-standard structure names for efficient automatic standardization of structure names based on TG-263 nomenclature.

Methods: We selected 16 and 33 common standard names in abdomen and head-neck groups respectively from TG-263 nomenclature. We augmented each standard name by 1) swapping words, 2) removing fixed prefixes, e.g. ‘Glnd_’, 3) removing characters from the name except for the first and last ones, and 4) pending a number from 0 to 5 at the end, resulting a total of 1804 and 15565 augmented non-standard names for abdomen and head-neck sites respectively. We calculated Levenshtein distances from a non-standard name to each of the standard names in the same site to create a 16- or 33-dimensional feature vector representing the non-standard name. The augmented feature dataset was split to 80-20 for training and testing three site-specific classifiers: a support vector classifier (SVC), a random forest (RF), a multi-layer perceptron network with three 256-node layers (MLP.) We also tested the MLP with 22 clinical head-neck cases.

Results: The t-distributed stochastic neighbor embedding (t-SNE) plot showed well-separated clusters of the training samples, demonstrating the effectiveness of the feature representation. The classification accuracy from SVC, RF, and MLP were 0.95, 1.0, and 1.0, respectively for augmented abdomen dataset and 0.98, 0.99, and 0.98 for augmented head-neck dataset. The accuracy from MLP on clinical cases was 0.94.

Conclusion: Using the proposed Levenshtein distance-based feature representation, machine learning and neural network classifiers can achieve high accuracy for classifying non-standard structure names. The text-only feature requires no image information and is derived directly from the relationship between the non-standard name to all standard structure names in the same disease site. The calculation is computationally efficient to facilitate online or offline automatic structure name standardization in radiotherapy.

Funding Support, Disclosures, and Conflict of Interest: This research was funded in part through the NIH/NCI Cancer Center Support Grant P30 CA008748

Keywords

Classifier Design, Feature Extraction, Structure Analysis

Taxonomy

IM/TH- Informatics: Informatics in Therapy (general)

Contact Email

Share: