header

On Revealing Shared Conceptualization Among Open Datasets

16 Pages Posted: 29 Jan 2021 Publication Status: Accepted

See all articles by Miloš Bogdanović

Miloš Bogdanović

University of Niš - Computer Science Department

Natasa Veljkovic

Faculty of Electronic Engineering, University of Nis

Milena Frtunic Gligorijevic

Faculty of Electronic Engineering, University of Nis

Darko Puflovic

Faculty of Electronic Engineering, University of Nis

Leonid Stoimenov

University of Niš - Computer Science Department

Abstract

Openness and transparency initiatives are not only milestones of science progress but have also influenced various fields of organization and industry. Under this influence, varieties of government institutions worldwide have published a large number of datasets through open data portals. Government data covers diverse subjects and the scale of available data is growing every year. Published data is expected to be both accessible and discoverable. For these purposes, portals take advantage of metadata accompanying datasets. However, a part of metadata is often missing which decreases users' ability to obtain the desired information. As the scale of published datasets grows, this problem increases. An approach we describe in this paper is focused towards decreasing this problem by implementing knowledge structures and algorithms capable of proposing the best match for the category where an uncategorized dataset should belong to. By doing so, our aim is twofold: enrich datasets metadata by suggesting an appropriate category and increase its visibility and discoverability. Our approach relies on information regarding open datasets provided by users-dataset description contained within dataset tags. Since dataset tags express low consistency due to their origin, in this paper we will present a method of optimizing their usage through means of semantic similarity measures based on natural language processing mechanisms. Optimization is performed in terms of reducing the number of distinct tag values used for dataset description. Once optimized, dataset tags are used to reveal shared conceptualization originating from their usage by means of Formal Concept Analysis. We will demonstrate the advantage of our proposal by comparing concept lattices generated using Formal Concept Analysis before and after the optimization process and use generated structure as a knowledge base to categorize uncategorized open datasets. Finally, we will present a categorization mechanism based on the generated knowledge base that takes advantage of semantic similarity measures to propose a category suitable for an uncategorized dataset.

Keywords: open data, formal concept analysis, semantic similarity, categorization, natural language processing

Suggested Citation

Bogdanović, Miloš and Veljkovic, Natasa and Frtunic Gligorijevic, Milena and Puflovic, Darko and Stoimenov, Leonid, On Revealing Shared Conceptualization Among Open Datasets. Available at SSRN: https://ssrn.com/abstract=3770603 or http://dx.doi.org/10.2139/ssrn.3770603

Miloš Bogdanović (Contact Author)

University of Niš - Computer Science Department ( email )

Aleksandra Medvedeva 14
Niš, 18000
Serbia

Natasa Veljkovic

Faculty of Electronic Engineering, University of Nis ( email )

Milena Frtunic Gligorijevic

Faculty of Electronic Engineering, University of Nis ( email )

Darko Puflovic

Faculty of Electronic Engineering, University of Nis ( email )

Leonid Stoimenov

University of Niš - Computer Science Department

Aleksandra Medvedeva 14
Niš, 18000
Serbia

Do you have negative results from your research you’d like to share?

Paper statistics

Downloads
61
Abstract Views
513
PlumX Metrics