DataGene: A Framework for Dataset Similarity

6 Pages Posted: 29 Jun 2020 Last revised: 22 Oct 2020

See all articles by Derek Snow

Derek Snow

The Alan Turing Institute

Date Written: June 5, 2020


DataGene is developed to identify data set similarity between real and synthetic datasets as well as train, test, and validation datasets. For many modelling and software development tasks there is a need for datasets to have share similar characteristics. This has traditionally been achieved with visualizations, DataGene seeks to replace these visual methods with a range of novel quantitative methods. Please see the GitHub repository to inspect and install the Python code.

Keywords: Distance, Time Series, Similarity, Metrics, Python, Data, Machine Learning, Data Science

JEL Classification: C

Suggested Citation

Snow, Derek, DataGene: A Framework for Dataset Similarity (June 5, 2020). Available at SSRN: or

Derek Snow (Contact Author)

The Alan Turing Institute ( email )

British Library, 96 Euston Rd
London, NW1 2DB
United Kingdom


Do you have a job opening that you would like to promote on SSRN?

Paper statistics

Abstract Views
PlumX Metrics