Linking and Disambiguating Entities Across Heterogeneous RDF Graphs
14 Pages Posted: 15 Jan 2019 Publication Status: Accepted
Abstract
Establishing identity links across RDF datasets is a central and challenging task on the way to realising the Data Web project. It is well-known that data supplied by different sources can be highly heterogeneous — two entities referring to the same real world object are often described, structured and valued differently, or in a complementary fashion. In this paper, we explore the origins and the multiplicity of data hetero¬geneity problems, proposing a novel classification that allows to isolate challenges and to position our and future work. Many state-of-the-art data linking approaches rely on sets of discriminative properties, provided by the user or by specialised tools, which, in the lack of knowledge of the nature of the data, do not allow to account automatically for a large number of structural heterogeneities. In addition, similarity measures and thresholds need to be selected and tuned manually or learned by specialised algorithms. We propose a solution covering an important number of heterogeneities, attempting to reduce the user configuration effort, based on: (i) Property filtering, or automatic data cleaning of “problematic" attributes; (ii) Instance profiling allowing to represent each resource by a sub-graph considered relevant for the comparison task; and (iii) Instance vector representation allowing to compare resources. To reduce the false positives rate, we apply a (iv) Post-processing step based on hierarchical clustering and key ranking techniques aiming to disambiguate highly similar, though not identical instances. This pipeline is implemented in Legato — a data linking tool, showing to outperform or to perform as well as state-of-the-art tools on highly heterogeneous and diverse benchmark datasets, yet keeping the user configuration effort low.
Keywords: RDF Data Linking, Knowledge Graphs, Linked Open Data, Data Heterogeneities
Suggested Citation: Suggested Citation