DataGene: A Framework for Dataset Similarity

6 Pages Posted: 29 Jun 2020 Last revised: 22 Oct 2020

See all articles by Derek Snow

Derek Snow

The Alan Turing Institute; New York University (NYU) - Finance and Risk Engineering Department; University of Auckland

Date Written: June 5, 2020


DataGene is developed to identify data set similarity between real and synthetic datasets as well as train, test, and validation datasets. For many modelling and software development tasks there is a need for datasets to have share similar characteristics. This has traditionally been achieved with visualizations, DataGene seeks to replace these visual methods with a range of novel quantitative methods. Please see the GitHub repository to inspect and install the Python code.

Keywords: Distance, Time Series, Similarity, Metrics, Python, Data, Machine Learning, Data Science

JEL Classification: C

Suggested Citation

Snow, Derek, DataGene: A Framework for Dataset Similarity (June 5, 2020). Available at SSRN: or

Derek Snow (Contact Author)

The Alan Turing Institute ( email )

British Library, 96 Euston Rd
London, NW1 2DB
United Kingdom

HOME PAGE: http://

New York University (NYU) - Finance and Risk Engineering Department ( email )

6 Metrotech Center
New York, NY 11201
United States

University of Auckland ( email )

Private Bag 92019
Auckland Mail Centre
Auckland, 1142
New Zealand

Here is the Coronavirus
related research on SSRN

Paper statistics

Abstract Views
PlumX Metrics