Data Anonymisation, Outlier Detection and Fighting Overfitting with Restricted Boltzmann Machines
27 Pages Posted: 24 Feb 2020
Date Written: January 27, 2020
We propose a novel approach to the anonymisation of datasets through non-parametric learning of the underlying multivariate distribution of dataset features and generation of the new synthetic samples from the learned distribution. The main objective is to ensure equal (or better) performance of the classifiers and regressors trained on synthetic datasets in comparison with the same classifiers and regressors trained on the original data. The ability to generate unlimited number of synthetic data samples from the learned distribution can be a remedy in fighting overtting when dealing with small original datasets. When the synthetic data generator is trained as an autoencoder with the bottleneck information compression structure we can also expect to see a reduced number of outliers in the generated datasets, thus further improving the generalization capabilities of the classifiers trained on synthetic data. We achieve these objectives with the help of the Restricted Boltzmann Machine, a special type of generative neural network that possesses all the required properties of a powerful data anonymiser.
Keywords: Restricted Boltzmann Machine, non-parametric sampling, synthetic data generation, data anonymisation, detection of outliers, reduction of overfitting
JEL Classification: C63, G17
Suggested Citation: Suggested Citation