header

Linking and Disambiguating Entities Across Heterogeneous RDF Graphs

14 Pages Posted: 15 Jan 2019 Publication Status: Accepted

See all articles by Manel Achichi

Manel Achichi

University of Montpellier - Laboratory of Informatics, Robotics and Microelectronics (LIRMM)

Zohra Bellahsene

University of Montpellier - Laboratory of Informatics, Robotics and Microelectronics (LIRMM)

Mohamed Ben Ellefi

Aix-Marseille University - Laboratory of Computer Science and Systems (LIS)

Konstantin Todorov

University of Montpellier - Laboratory of Informatics, Robotics and Microelectronics (LIRMM)

Abstract

Establishing identity links across RDF datasets is a central and challenging task on the way to realising the Data Web project. It is well-known that data supplied by different sources can be highly heterogeneous — two entities referring to the same real world object are often described, structured and valued differently, or in a complementary fashion. In this paper, we explore the origins and the multiplicity of data hetero¬geneity problems, proposing a novel classification that allows to isolate challenges and to position our and future work. Many state-of-the-art data linking approaches rely on sets of discriminative properties, provided by the user or by specialised tools, which, in the lack of knowledge of the nature of the data, do not allow to account automatically for a large number of structural heterogeneities. In addition, similarity measures and thresholds need to be selected and tuned manually or learned by specialised algorithms. We propose a solution covering an important number of heterogeneities, attempting to reduce the user configuration effort, based on: (i) Property filtering, or automatic data cleaning of “problematic" attributes; (ii) Instance profiling allowing to represent each resource by a sub-graph considered relevant for the comparison task; and (iii) Instance vector representation allowing to compare resources. To reduce the false positives rate, we apply a (iv) Post-processing step based on hierarchical clustering and key ranking techniques aiming to disambiguate highly similar, though not identical instances. This pipeline is implemented in Legato — a data linking tool, showing to outperform or to perform as well as state-of-the-art tools on highly heterogeneous and diverse benchmark datasets, yet keeping the user configuration effort low.

Keywords: RDF Data Linking, Knowledge Graphs, Linked Open Data, Data Heterogeneities

Suggested Citation

Achichi, Manel and Bellahsene, Zohra and Ellefi, Mohamed Ben and Todorov, Konstantin, Linking and Disambiguating Entities Across Heterogeneous RDF Graphs (January 14, 2019). Available at SSRN: https://ssrn.com/abstract=3315546 or http://dx.doi.org/10.2139/ssrn.3315546

Manel Achichi (Contact Author)

University of Montpellier - Laboratory of Informatics, Robotics and Microelectronics (LIRMM)

163 rue Auguste Broussonnet
Montpellier
France

Zohra Bellahsene

University of Montpellier - Laboratory of Informatics, Robotics and Microelectronics (LIRMM) ( email )

163 rue Auguste Broussonnet
Montpellier
France

Mohamed Ben Ellefi

Aix-Marseille University - Laboratory of Computer Science and Systems (LIS)

3 Avenue Robert Schuman
Aix-en-Provence, 13628
France

Konstantin Todorov

University of Montpellier - Laboratory of Informatics, Robotics and Microelectronics (LIRMM) ( email )

163 rue Auguste Broussonnet
Montpellier
France

Do you have a job opening that you would like to promote on SSRN?

Paper statistics

Downloads
94
Abstract Views
795
PlumX Metrics