Distribution-Preserving Statistical Disclosure Limitation

38 Pages Posted: 20 Sep 2006

See all articles by Simon D. Woodcock

Simon D. Woodcock

Simon Fraser University; Institute for the Study of Labor (IZA)

Gary Benedetto

United States Census Bureau

Date Written: September 2007

Abstract

One approach to limiting disclosure risk in public-use microdata is to release multiply-imputed, partially synthetic data sets. These are data on actual respondents, but with confidential data replaced by multiply-imputed synthetic values. When imputing confidential values, a mis-specified model can invalidate inferences, because the distribution of synthetic data is determined by the model used to generate them. We present a practical method to generate synthetic values when the imputer has only limited information about the true data generating process. We combine a simple imputation model (such as regression) with a series of density-based transformations to preserve the distribution of the confidential data, up to sampling error, on specified subdomains. We demonstrate through simulation and a large scale application that our approach preserves important statistical properties of the confidential data, including higher moments, with low disclosure risk.

Keywords: statistical disclosure limitation, confidentiality, privacy, multiple imputation, partially synthetic data

JEL Classification: C1, C4, C5

Suggested Citation

Woodcock, Simon D. and Benedetto, Gary, Distribution-Preserving Statistical Disclosure Limitation (September 2007). Available at SSRN: https://ssrn.com/abstract=931535 or http://dx.doi.org/10.2139/ssrn.931535

Simon D. Woodcock (Contact Author)

Simon Fraser University ( email )

Dept. of Economics
8888 University Drive
Burnaby, British Columbia V5A 1S6
Canada

HOME PAGE: http://www.sfu.ca/~swoodcoc

Institute for the Study of Labor (IZA) ( email )

P.O. Box 7240
Bonn, D-53072
Germany

Gary Benedetto

United States Census Bureau ( email )

4600 Silver Hill Road
Washington, DC 20233
United States