Distribution-Preserving Statistical Disclosure Limitation
38 Pages Posted: 20 Sep 2006
Date Written: September 2007
Abstract
One approach to limiting disclosure risk in public-use microdata is to release multiply-imputed, partially synthetic data sets. These are data on actual respondents, but with confidential data replaced by multiply-imputed synthetic values. When imputing confidential values, a mis-specified model can invalidate inferences, because the distribution of synthetic data is determined by the model used to generate them. We present a practical method to generate synthetic values when the imputer has only limited information about the true data generating process. We combine a simple imputation model (such as regression) with a series of density-based transformations to preserve the distribution of the confidential data, up to sampling error, on specified subdomains. We demonstrate through simulation and a large scale application that our approach preserves important statistical properties of the confidential data, including higher moments, with low disclosure risk.
Keywords: statistical disclosure limitation, confidentiality, privacy, multiple imputation, partially synthetic data
JEL Classification: C1, C4, C5
Suggested Citation: Suggested Citation