Privacy and Synthetic Datasets

39 Pages Posted: 17 Oct 2018

See all articles by Steven M. Bellovin

Steven M. Bellovin

Columbia University - Department of Computer Science

Preetam K. Dutta

Columbia University - Department of Computer Science

Nathan Reitinger

Columbia University - Department of Computer Science

Date Written: August 20, 2018

Abstract

Sharing is a virtue, instilled in us from childhood. Unfortunately, when it comes to big data — i.e., databases possessing the potential to usher in a whole new world of scientific progress — the legal landscape prefers a hoggish motif. The historic approach to the resulting database–privacy problem has been anonymization, a subtractive technique incurring not only poor privacy results, but also lackluster utility. In anonymization’s stead, differential privacy arose; it provides better, near-perfect privacy, but is nonetheless subtractive in terms of utility. Today, another solution is leaning into the fore, synthetic data. Using the magic of machine learning, synthetic data offers a generative, additive approach — the creation of almost-but-not-quite replica data. In fact, as we recommend, synthetic data may be combined with differential privacy to achieve a best-of-both-worlds scenario. After unpacking the technical nuances of synthetic data, we analyze its legal implications, finding both over and under inclusive applications. Privacy statutes either overweigh or downplay the potential for synthetic data to leak secrets, inviting ambiguity. We conclude by finding that synthetic data is a valid, privacy-conscious alternative to raw data, but is not a cure-all for every situation. In the end, computer science progress must be met with proper policy in order to move the area of useful data dissemination forward.

Keywords: privacy, machine learning, synthetic data, HIPAA, FERPA

Suggested Citation

Bellovin, Steven M. and Dutta, Preetam K. and Reitinger, Nathan, Privacy and Synthetic Datasets (August 20, 2018). Stanford Technology Law Review, Forthcoming. Available at SSRN: https://ssrn.com/abstract=3255766 or http://dx.doi.org/10.2139/ssrn.3255766

Steven M. Bellovin (Contact Author)

Columbia University - Department of Computer Science ( email )

New York, NY 10027
United States

Preetam K. Dutta

Columbia University - Department of Computer Science ( email )

116th and Broadway
New York, NY 10027
United States

Nathan Reitinger

Columbia University - Department of Computer Science ( email )

New York, NY 10027
United States

Register to save articles to
your library

Register

Paper statistics

Downloads
101
Abstract Views
795
rank
265,565
PlumX Metrics