Annotating and Indexing Scientific Articles with Rare Diseases

16 Pages Posted: 18 Aug 2023

See all articles by Hosein Azarbonyad

Hosein Azarbonyad

Elsevier

Zubair Afzal

Elsevier

Rik Iping

Erasmus University Rotterdam (EUR) - Erasmus Medical Center (MC)

Max Dumoulin

Elsevier

Ilse Nederveen

Erasmus University Rotterdam (EUR) - Erasmus Medical Center (MC)

Jiangtao Yu

Elsevier

Georgios Tsatsaronis

Elsevier

Date Written: August 15, 2023

Abstract

In Europe 30 million people are suffering from a rare (or orphan) disease, a disease that occurs in less than 1 per 2,000 people. Rare disease patients are entitled to the best possible health care, hence, it is imperative to organize efficiently the respective clinical care and scientific literature. The European Commission and member states have established a policy based on European Reference Networks (ERN) specializing in ranges of diseases, which envisages to address common challenges and to support patient care and research for rare diseases. However, important queries, such as finding the key research initiatives for the various different rare diseases, require deep bibliometrical and scientometrical analysis that can be based in efficient annotation and indexing of the respective scientific literature.

The primary challenge is the ability to automatically and efficiently identify the scientific articles and guidelines that are dealing with the particular rare disease(s). With this work, we are presenting a novel methodology to annotate and index any scientific text with taxonomical concepts that describe rare diseases from the OrphaNet taxonomy. The technical challenges are several: first, there is not existing large enough labeled dataset for training supervised models for indexing articles; second, the OrphaNet taxonomy, as any taxonomy, might be incomplete in certain areas, and its structure might not be homogeneous in granularity across all the parts of the taxonomy; third, despite the great advances in the areas of Natural Language Processing (NLP) and Information Retrieval, polysemy and synonymy of the string surface appearance of rare diseases in text may still hinder the applicability of any annotation engine.
In this study we discuss how we use TERMite, a state of the art annotation engine, to address some of these challenges, in combination with advanced NLP and Text Mining techniques. The core of our methodology relies on using our TERMite text analysis engine to create a vocabulary based on Orphanet. In turn this vocabulary is used as a query in a large database of scientific publications (Scopus). These datasets, created for each rare disease, become the basis for bibliometrics analyses using the wealth of metadata and reference linking that Scopus provides. We present the results of such an analysis, and highlight some directions for future research work that may address the open challenges even more efficiently. To the best of our knowledge, this is the first research work to address systematically, efficiently, and at scale, the problem of organizing and indexing the scientific literature across the rare diseases landscape.

Note:

Funding Information: None.

Conflict of Interests: The authors declare that they have no competing interests.

Keywords: Annotation, Rare Diseases, Scientometrics, Bibliographic Databases, Natural Language Processing, Health Sciences, Research Applications

Suggested Citation

Azarbonyad, Hosein and Afzal, Zubair and Iping, Rik and Dumoulin, Max and Nederveen, Ilse and Yu, Jiangtao and Tsatsaronis, Georgios, Annotating and Indexing Scientific Articles with Rare Diseases (August 15, 2023). Available at SSRN: https://ssrn.com/abstract=4541165 or http://dx.doi.org/10.2139/ssrn.4541165

Zubair Afzal

Elsevier ( email )

Radarweg 29
Amsterdam, 1043 NX
Netherlands

Rik Iping

Erasmus University Rotterdam (EUR) - Erasmus Medical Center (MC)

Max Dumoulin

Elsevier ( email )

United States

HOME PAGE: http://www.elsevier.com

Ilse Nederveen

Erasmus University Rotterdam (EUR) - Erasmus Medical Center (MC)

Jiangtao Yu

Elsevier

Do you have a job opening that you would like to promote on SSRN?

Paper statistics

Downloads
93
Abstract Views
474
Rank
611,962
PlumX Metrics