Identification, Data Combination and the Risk of Disclosure

57 Pages Posted: 5 Nov 2014 Last revised: 2 Sep 2017

See all articles by Tatiana Komarova

Tatiana Komarova

Department of Economics, University of Manchester

Denis Nekipelov

University of Virginia

Evgeny Yakovlev

New Economic School; SciencesPo - Sciences Po - Department of Economics; IZA

Date Written: December 21, 2016

Abstract

It is commonplace that the data needed for econometric inference are not contained in a single source. In this paper we analyze the problem of parametric inference from combined individual-level data when data combination is based on personal and demographic identifiers such as name, age, or address. Our main question is the identification of the econometric model based on the combined data when the data do not contain exact individual identifiers and no parametric assumptions are imposed on the joint distribution of information that is common across the combined dataset. We demonstrate the conditions on the observable marginal distributions of data in individual datasets that can and cannot guarantee identification of the parameters of interest. We also note that the data combination procedure is essential in the semiparametric setting such as ours. Provided that the (non-parametric) data combination procedure can only be defined in finite samples, we introduce a new notion of identification based on the concept of limits of statistical experiments. Our results apply to the setting where the individual data used for inferences are sensitive and their combination may lead to a substantial increase in the data sensitivity or lead to a de-anonymization of the previously anonymized information. We demonstrate that the point identification of an econometric model from combined data is incompatible with restrictions on the risk of individual disclosure. If the data combination procedure guarantees a bound on the risk of individual disclosure, then the information available from the combined dataset allows one to identify the parameter of interest only partially, and the size of the identification region is inversely related to the upper bound guarantee for the disclosure risk. This result is new in the context of data combination as we notice that the quality of links that need to be used in the combined data to assure point identification may be much higher than the average link quality in the entire dataset, and thus point inference requires the use of the most sensitive subset of the data. Our results provide important insights into the ongoing discourse on the empirical analysis of merged administrative records as well as discussions on the disclosive nature of policies implemented by the data-driven companies (such as Internet services companies and medical companies using individual patient records for policy decisions).

Keywords: Data protection, model identification, data combination, disclosure risk

JEL Classification: C35, C14, C25, C13

Suggested Citation

Komarova, Tatiana and Nekipelov, Denis and Yakovlev, Evgeny, Identification, Data Combination and the Risk of Disclosure (December 21, 2016). Available at SSRN: https://ssrn.com/abstract=2518493 or http://dx.doi.org/10.2139/ssrn.2518493

Tatiana Komarova (Contact Author)

Department of Economics, University of Manchester ( email )

Arthur Lewis Building
Oxford Road
Manchester, M13 9PL
United Kingdom

Denis Nekipelov

University of Virginia ( email )

1400 University Ave
Charlottesville, VA 22903
United States

Evgeny Yakovlev

New Economic School ( email )

Skolkovskoe shosse 45
Moscow, 121343
Russia

SciencesPo - Sciences Po - Department of Economics ( email )

28, rue des Saints-Pères
Paris, Paris 75007
France

IZA ( email )

P.O. Box 7240
Bonn, D-53072
Germany

Do you have negative results from your research you’d like to share?

Paper statistics

Downloads
254
Abstract Views
1,399
Rank
220,798
PlumX Metrics