header

Integrating Open Government Data with Stratosphere for More Transparency

17 Pages Posted: 3 Jul 2018 Publication Status: Accepted

See all articles by Arvid Heise

Arvid Heise

University of Potsdam - Hasso Plattner Institute (HPI)

Felix Naumann

University of Potsdam - Hasso Plattner Institute (HPI)

Abstract

Governments are increasingly publishing their data to enable organizations and citizens to browse and analyze the data. However, the heterogeneity of this Open Government Data hinders meaningful search, analysis, and integration and thus limits the desired transparency. In this article, we present the newly developed data integration operators of the Stratosphere parallel data analysis framework to overcome the heterogeneity. With declaratively specified queries, we demonstrate the integration of well-known government data sources and other large open data sets at technical, structural, and semantic levels. Furthermore, we publish the integrated data on the Web in a form that enables users to discover relationships between persons, government agencies, funds, and companies. The evaluation shows that linking person entities of different data sets results in a good precision of 98.3% and a recall of 95.2%. Moreover, the integration of large data sets scales well on up to eight machines.

Keywords: data integration, data cleansing, schema mapping, record linkage, data fusion, parallel query processing, map-reduce

Suggested Citation

Heise, Arvid and Naumann, Felix, Integrating Open Government Data with Stratosphere for More Transparency (2012). Available at SSRN: https://ssrn.com/abstract=3198963 or http://dx.doi.org/10.2139/ssrn.3198963

Arvid Heise (Contact Author)

University of Potsdam - Hasso Plattner Institute (HPI) ( email )

Prof.-Dr.-Helmert-Str. 2-3,
Potsdam
Germany

Felix Naumann

University of Potsdam - Hasso Plattner Institute (HPI) ( email )

Prof.-Dr.-Helmert-Str. 2-3,
Potsdam
Germany

Do you have a job opening that you would like to promote on SSRN?

Paper statistics

Abstract Views
272
Downloads
13
PlumX Metrics