Automatic Support System for Tumor Coding in Pathology Reports in Spanish
29 Pages Posted: 10 Dec 2021
Abstract
Pathology reports provide valuable information for cancer registries to understand, plan and implement strategies to mitigate the impact of cancer. However, coding key information from unstructured reports is done by experts in a time-consuming manual process. Here we report an automatic deep learning-based system that recognizes tumor morphology and topography mentions from free-text and suggests codes from the International Classification of Diseases for Oncology (ICD-O) in Spanish. This task was done by combining an in-house annotated corpus of tumor morphology and topography mentions, with the CANTEMIST (CANcer TExt Mining Shared Task – tumor named entity recognition) corpus, an open source dataset annotated with tumor morphology mentions. To create a Named Entity Recognition (NER) model, we applied transfer learning from state-of-the-art pre-trained language models. The mentions found with this model were subsequently coded using a search engine tailored to the ICDO codes. Our NER models obtained an F1 score of 0.86 and 0.90 for tumor morphology and topography, respectively. The overall performance of our automatic coding system achieved an accuracy at five suggestions of 0.72 and 0.65 for tumor morphology and topography, respectively. Our results demonstrate the feasibility of implementing NLP tools in the routine of a cancer center to extract and code valuable information from pathology reports.
Keywords: Natural Language Processing, Cancer, Electronic Health Records, Data Mining, Data Warehousing
Suggested Citation: Suggested Citation