Automatic Support System for Tumor Coding in Pathology Reports in Spanish

29 Pages Posted: 10 Dec 2021

See all articles by Fabián Villena

Fabián Villena

University of Chile - Faculty of Physical and Mathematical Sciences - Center for Mathematical Modeling - CNRS IRL 2807

Pablo Báez

University of Chile - Faculty of Medicine - Center of Medical Informatics and Telemedicine

Sergio Peñafiel

University of Chile

Matías Rojas

University of Chile

Inti Paredes

Instituto Oncol´ogico Fundaci´on Arturo L´opez P´erez; Instituto Oncol´ogico Fundaci´on Arturo L´opez P´erez

Jocelyn Dunstan

University of Chile - Faculty of Physical and Mathematical Sciences - Initiative for Data & Artificial Intelligence and Center for Mathematical Modeling - CNRS IRL 2807

Abstract

Pathology reports provide valuable information for cancer registries to understand, plan and implement strategies to mitigate the impact of cancer. However, coding key information from unstructured reports is done by experts in a time-consuming manual process. Here we report an automatic deep learning-based system that recognizes tumor morphology and topography mentions from free-text and suggests codes from the International Classification of Diseases for Oncology (ICD-O) in Spanish. This task was done by combining an in-house annotated corpus of tumor morphology and topography mentions, with the CANTEMIST (CANcer TExt Mining Shared Task – tumor named entity recognition) corpus, an open source dataset annotated with tumor morphology mentions. To create a Named Entity Recognition (NER) model, we applied transfer learning from state-of-the-art pre-trained language models. The mentions found with this model were subsequently coded using a search engine tailored to the ICDO codes. Our NER models obtained an F1 score of 0.86 and 0.90 for tumor morphology and topography, respectively. The overall performance of our automatic coding system achieved an accuracy at five suggestions of 0.72 and 0.65 for tumor morphology and topography, respectively. Our results demonstrate the feasibility of implementing NLP tools in the routine of a cancer center to extract and code valuable information from pathology reports.

Keywords: Natural Language Processing, Cancer, Electronic Health Records, Data Mining, Data Warehousing

Suggested Citation

Villena, Fabián and Báez, Pablo and Peñafiel, Sergio and Rojas, Matías and Paredes, Inti and Dunstan, Jocelyn, Automatic Support System for Tumor Coding in Pathology Reports in Spanish. Available at SSRN: https://ssrn.com/abstract=3982259 or http://dx.doi.org/10.2139/ssrn.3982259

Fabián Villena

University of Chile - Faculty of Physical and Mathematical Sciences - Center for Mathematical Modeling - CNRS IRL 2807 ( email )

Av. Blanco Encalada 2120
Santiago
Chile

Pablo Báez

University of Chile - Faculty of Medicine - Center of Medical Informatics and Telemedicine ( email )

Av. Independencia 1027
Santiago, 7520421
Chile

Sergio Peñafiel

University of Chile ( email )

Pío Nono Nº1, Providencia
Santiago, R. Metropolitana 7520421
Chile

Matías Rojas

University of Chile ( email )

Pío Nono Nº1, Providencia
Santiago, R. Metropolitana 7520421
Chile

Inti Paredes

Instituto Oncol´ogico Fundaci´on Arturo L´opez P´erez ( email )

Chile

Instituto Oncol´ogico Fundaci´on Arturo L´opez P´erez ( email )

Chile

Jocelyn Dunstan (Contact Author)

University of Chile - Faculty of Physical and Mathematical Sciences - Initiative for Data & Artificial Intelligence and Center for Mathematical Modeling - CNRS IRL 2807 ( email )

Av. Blanco Encalada 2120
Santiago
Chile

Do you have a job opening that you would like to promote on SSRN?

Paper statistics

Downloads
105
Abstract Views
389
Rank
405,888
PlumX Metrics