header

Enriching Integrated Statistical Open City Data by Combining Equational Knowledge and Missing Value Imputation

30 Pages Posted: 2 Jul 2018 Publication Status: Accepted

See all articles by Stefan Bischof

Stefan Bischof

Siemens AG - Österreich

Andreas Harth

Karlsruhe Institute of Technology - Institute of Applied Informatics and Formal Description Methods (AIFB)

Benedikt Kämpgen

FZI Research Center for Information Technology

Axel Polleres

University of Galway - Digital Enterprise Research Institute (DERI)

Patrik Schneider

Siemens AG - Österreich

Abstract

Several institutions collect statistical data about cities, regions, and countries for various purposes. Yet, while access to high quality and recent such data is both crucial for decision makers and a means for achieving transparency to the public, all too often such collections of data remain isolated and not re-usable, let alone comparable or properly integrated. In this paper we present the Open City Data Pipeline, a focused attempt to collect, integrate, and enrich statistical data collected at city level worldwide, and re-publish the resulting dataset in a re-usable manner as Linked Data. The main features of the Open City Data Pipeline are: (i) we integrate and cleanse data from several sources in a modular and extensible, always up-to-date fashion; (ii) we use both Machine Learning techniques and reasoning over equational background knowledge to enrich the data by imputing missing values, (iii) we assess the estimated accuracy of such imputations per indicator. Additionally, (iv) we make the integrated and enriched data, including links to external data sources, such as DBpedia, available both in a web browser interface and as machine-readable Linked Data, using standard vocabularies such as QB and PROV. Apart from providing a contribution to the growing collection of data available as Linked Data, our enrichment process for missing values also contributes a novel methodology for combining rule-based inference about equational knowledge with inferences obtained from statistical Machine Learning approaches. While most existing works about inference in Linked Data have focused on ontological reasoning in RDFS and OWL, we believe that these complementary methods and particularly their combination could be fruitfully applied also in many other domains for integrating Statistical Linked Data, independent from our concrete use case of integrating city data.

Keywords: Open Data, Linked Data, Data Cleaning, Data Integration

Suggested Citation

Bischof, Stefan and Harth, Andreas and Kämpgen, Benedikt and Polleres, Axel and Schneider, Patrik, Enriching Integrated Statistical Open City Data by Combining Equational Knowledge and Missing Value Imputation (January 2018). Available at SSRN: https://ssrn.com/abstract=3199313 or http://dx.doi.org/10.2139/ssrn.3199313

Stefan Bischof (Contact Author)

Siemens AG - Österreich ( email )

Siemensstrasse 90,
Vienna
Austria

Andreas Harth

Karlsruhe Institute of Technology - Institute of Applied Informatics and Formal Description Methods (AIFB) ( email )

Kaiserstraße 12
Karlsruhe, Baden Württemberg 76131
Germany

Benedikt Kämpgen

FZI Research Center for Information Technology ( email )

Germany

Axel Polleres

University of Galway - Digital Enterprise Research Institute (DERI) ( email )

University Road
Galway, Co. Kildare
Ireland

Patrik Schneider

Siemens AG - Österreich ( email )

Siemensstrasse 90,
Vienna
Austria

Do you have a job opening that you would like to promote on SSRN?

Paper statistics

Downloads
54
Abstract Views
1,420
PlumX Metrics