Privacy Preserving Data Fusion

49 Pages Posted: 31 May 2023

See all articles by Longxiu Tian

Longxiu Tian

University of North Carolina at Chapel Hill

Dana Turjeman

Arison School of Business, Reichman University

Samuel Levy

Carnegie Mellon University

Date Written: February 14, 2023


Data fusion combines multiple datasets to make inferences that are more accurate, generalizable, and useful than those made with any single dataset alone. However, data fusion poses a privacy hazard due to the risk of revealing user identities. We propose a privacy preserving data fusion (PPDF) methodology intended to preserve user-level anonymity while allowing for a robust and expressive data fusion process. PPDF is based on variational autoencoders and normalizing flows, together enabling a highly expressive, nonparametric, Bayesian, generative modeling framework, estimated in adherence to differential privacy – the state-of-the-art theory for privacy preservation. PPDF does not require the same users to appear across datasets when learning the joint data generating process and explicitly accounts for missingness in each dataset to correct for sample selection. Moreover, PPDF is model-agnostic: it allows for downstream inferences to be made on the fused data without the analyst needing to specify a discriminative model or likelihood a priori. We undertake a series of simulations to showcase the quality of our proposed methodology. Then, we fuse a large-scale customer satisfaction survey to the customer relationship management (CRM) database from a leading U.S. telecom carrier. The resulting fusion yields the joint distribution between survey satisfaction outcomes and CRM engagement metrics at the customer level, including the likelihood of leaving the company’s services. Highlighting the importance of correcting selection bias, we illustrate the divergence between the observed survey responses vs. the imputed distribution on the customer base. Managerially, we find a negative, nonlinear relationship between satisfaction and future account termination across the telecom carrier’s customers, which can aid in segmentation, targeting, and proactive churn management. Overall, PPDF will substantially reduce the risk of compromising privacy and anonymity when fusing different datasets.

Keywords: differential privacy, data fusion, variational autoencoders, generative modeling, selection bias.

Suggested Citation

Tian, Longxiu and Turjeman, Dana and Levy, Samuel, Privacy Preserving Data Fusion (February 14, 2023). Available at SSRN: or

Longxiu Tian (Contact Author)

University of North Carolina at Chapel Hill ( email )

Chapel Hill, NC 27599
United States

Dana Turjeman

Arison School of Business, Reichman University ( email )


Samuel Levy

Carnegie Mellon University ( email )

Pittsburgh, PA 15213-3890
United States

HOME PAGE: http://

Do you have negative results from your research you’d like to share?

Paper statistics

Abstract Views
PlumX Metrics