Privacy Preserving Data Fusion

53 Pages Posted: 31 May 2023

See all articles by Longxiu Tian

Longxiu Tian

University of North Carolina at Chapel Hill

Dana Turjeman

Arison School of Business, Reichman University

Samuel Levy

Carnegie Mellon University

Date Written: June 02, 2024


Data fusion combines multiple datasets to make inferences that are more accurate, generalizable, and useful than those made with any single dataset alone. However, data fusion poses privacy hazards due to the risk of revealing user identities, even when datasets are anonymous. We propose a privacy-preserving data fusion (PPDF) methodology, intended to preserve user-level anonymity while allowing for a robust and expressive data fusion process. PPDF is based on variational autoencoders and normalizing flows, together enabling an expressive, nonparametric, Bayesian, generative framework, estimated in adherence to differential privacy, the state-of-the-art theory for privacy preservation. PPDF does not require the same users to appear across datasets when learning the joint data generating process and explicitly accounts for missingness in each dataset to correct for sample selection. In addition, PPDF allows downstream inferences to be made on the fused data without the analyst having to specify a discriminative model or likelihood a priori, and explicitly allows for estimation of the reidentification risk of individuals presented in the dataset, to help data holders accurately balance privacy and accuracy. To elucidate the mechanism of PPDF, we formally derive the analytical bound to the unique privacy risks posed by any data fusion exercise. Empirically, we fuse a large-scale anonymous customer satisfaction survey to a detailed customer relationship management (CRM) database from a leading U.S. telecom carrier. The resulting fusion yields the joint distribution between survey outcomes and CRM engagement metrics at the customer level. We highlight PPDF's downstream applicability via a churn-prevention campaign, where targeting accuracy was enhanced with fused outcomes from the survey, while assuring privacy. We provide simulation exercises against other models, both private and non private ones, and discuss the inherent trade-off between privacy and accuracy in data fusion.

Keywords: differential privacy, data fusion, variational autoencoders, generative modeling, selection bias

Suggested Citation

Tian, Longxiu and Turjeman, Dana and Levy, Samuel, Privacy Preserving Data Fusion (June 02, 2024). Available at SSRN: or

Longxiu Tian (Contact Author)

University of North Carolina at Chapel Hill ( email )

Chapel Hill, NC 27599
United States

Dana Turjeman

Arison School of Business, Reichman University ( email )


Samuel Levy

Carnegie Mellon University ( email )

Pittsburgh, PA 15213-3890
United States

HOME PAGE: http://

Do you have a job opening that you would like to promote on SSRN?

Paper statistics

Abstract Views
PlumX Metrics