Implications of Data Anonymization on the Statistical Evidence of Disparity
Xu, H., and Zhang, N. (2021). Implications of Data Anonymization on the Statistical Evidence of Disparity, Management Science (accepted).
39 Pages Posted: 5 Sep 2020 Last revised: 8 Oct 2021
Date Written: July 28, 2020
Research and practical development of data anonymization techniques has proliferated in recent years. Although the privacy literature has questioned the efficacy of data anonymization at protecting individuals against harms associated with re-identification, this paper raises another new set of questions: whether anonymization techniques themselves can mask statistical disparities and thus conceal evidence of disparate impact that is potentially discriminatory. If so, the choice of data anonymization technique to protect privacy, and the specific technique employed, may pick winners and losers. Examining the implications of these choices on the potentially disparate impact of privacy protection on underprivileged sub-populations is thus a critically important policy question.
The paper begins with an interdisciplinary overview of two common mechanisms of data anonymization and two prevalent types of statistical evidence for disparity. In terms of data-anonymization mechanisms, the two common ones are data removal (e.g., k-anonymity), which aims to remove the part of a dataset that could potentially identify an individual; and noise insertion (e.g., differential privacy), which inserts into a dataset carefully designed noises that block the identification of individuals yet allow the accurate recovery of certain summary statistics. In terms of the statistical evidence for disparity, the two commonly accepted types are disparity through separation (e.g., the "two or three standard deviations" rule for a prima facie case of discrimination), which is grounded in the idea of detecting the separation between the outcome distributions for different sub-populations; and disparity through variation (e.g., the "more likely than not" rule in toxic tort cases), which concentrates on the magnitude of difference between the mean outcomes of different sub-populations.
We develop conceptual foundation and mathematical formalism demonstrating that the two data anonymization mechanisms have distinctive impacts on the identifiability of disparity, which also varies based on its statistical operationalization. Specifically, under the regime of disparity through separation, data removal tends to produce more false positives (i.e., detecting false disparity when none exists) than false negatives (i.e., failing to detect an existing disparity); while noise insertion rarely produces any false positives at all. Meanwhile, noise insertion does produce false positives (equally likely as false negatives) under the regime of disparity through variation; while the likelihood for data removal to produce false positives and false negatives depend on the underlying data distribution.
We empirically validated our findings with an inpatient dataset from one of the five most populated states in the U.S. We examined four data-anonymization techniques (two in the data-removal category and the other two in noise insertion), ranging from the current rules used by the State of Texas to anonymize their state-wide inpatient discharge dataset to the state-of-the-art differential privacy algorithms for regression analysis. After presenting the empirical results, which confirmed our conceptual and mathematical findings, we conclude the paper by discussing the business and policy implications of these findings, highlighting the need for firms and policy makers to balance between the protection of privacy and the recognition/rectification of disparate impact.
In sum, our paper identifies an important knowledge gap in both tech and law fields: whether data anonymization technologies themselves can mask statistical disparities and thus conceal the evidence of disparate impact that is potentially discriminatory. The emergence of privacy laws (e.g., GDPR) gives primacy to answering this question, because if such disparate impacts do exist, legislators and regulators would be essentially picking winners and losers by requiring or incentivizing the use of data anonymization techniques. This paper tackles this timely yet complex challenge, especially given the current public discourse in the U.S. about racial discrimination, and the worldwide trend of prioritizing the protection of consumer privacy in legislations and regulations.
Keywords: privacy, data anonymization, discrimination, statistical disparity
JEL Classification: J71, K31, K13
Suggested Citation: Suggested Citation