Purpose: This study is to develop a natural language processing (NLP)-based linguistic system to detect, annotate and extract radiation therapy (RT) related information in clinical trial protocols to assist treatment planning and plan checking.
Methods: A total of 18 Institutional Review Board (IRB) protocols were employed, including 17 RT protocols and one non-RT protocol for comparison. The selected protocols cover multiple treatment sites including brain, breast, head and neck, lung, pancreas, and prostate. A database containing 170 entities in 8 different categories of radiotherapy was created for named entity recognition (NER). A pretrained transformer-based language model from the Python module Spacy (RoBERTa as an NLP pipeline) with NER, part-of-speech tagging, and lemmatization was used to identify and annotate entities in protocols. The first phase of NER used all 170 entities for frequency counting. During the second phase, a small cohort (58 out of 170) of entities with high frequency were selected to localize RT section(s). Correlation rules were used to remove certain identified entities that do not pertain to RT, such as “dose” in drug prescription. The third phase used subgroups of entities for category-based annotation. Part-of-speech tagging is used to discrete the functional elements such as verbs, adverbs, nouns, etc. for extraction of dosimetric criteria.
Results: In 16 out of 17 RT protocols tested, the developed system successfully realized the following functions: (1) Localization of and navigation to the RT section(s); (2) Labelling of the identified entities of a chosen category. (3) Detection of special requirements such as multi-arms and motion management. An RT-relevance indicator is defined to detect non-RT protocols based on the absolute mentions of entities.
Conclusion: An NLP-based system was developed and tested on 18 protocols. Clinical application of this system will assist protocol reviewing process.