The Subjects and Stages of AI Dataset Development: A Framework for Dataset Accountability

Forthcoming 19 Ohio St. Tech. L.J. (2023)

79 Pages Posted: 11 Oct 2022

See all articles by Mehtab Khan

Mehtab Khan

Yale Law School; Harvard University - Berkman Klein Center for Internet & Society

Alex Hanna

Distributed AI Research Institute

Date Written: September 13, 2022

Abstract

There has been increased attention toward the datasets that are used to train and build AI technologies from the computer science and social science research communities, but less from legal scholarship. Both Large-Scale Language Datasets (LSLDs) and Large-Scale Computer Vision Datasets (LSCVDs) have been at the forefront of such discussions, due to recent controversies involving the use of facial recognition technologies, and the discussion of the use of publicly-available text for the training of massive models which generate human-like text. Many of these datasets serve as “benchmarks” to develop models that are used both in academic and industry research, while others are used solely for training models. The process of developing LSLDs and LSCVDs is complex and contextual, involving dozens of decisions about what kinds of data to collect, label, and train a model on, as well as how to make the data available to other researchers. However, little attention has been paid to mapping and consolidating the legal issues that arise at different stages of this process: when the data is being collected, after the data is used to build and evaluate models and applications, and how that data is distributed more widely.

In this article, we offer four main contributions. First, we describe what kinds of objects these datasets are, how many different kinds exist, what types of modalities they encompass, and why they are important. Second, we provide more clarity about the stages of dataset development – a process that has thus far been subsumed within broader discussions about bias and discrimination – and the subjects who may be susceptible to harms at each point of development. Third, we provide a matrix of both the stages of dataset development and the subjects of dataset development, which traces the connections between stages and subjects. Fourth, we use this analysis to identify some basic legal issues that arise at the various stages in order to foster a better understanding of the dilemmas and tensions that arise at every stage. We situate our discussion within wider discussion of current debates and proposals related to algorithmic accountability. This paper fulfills an essential gap when it comes to comprehending the complicated landscape of legal issues connected to datasets and the gigantic AI models trained on them.

Keywords: Artificial Intelligence, Data Governance, Privacy, Copyright

Suggested Citation

Khan, Mehtab and Hanna, Alex, The Subjects and Stages of AI Dataset Development: A Framework for Dataset Accountability (September 13, 2022). Forthcoming 19 Ohio St. Tech. L.J. (2023), Available at SSRN: https://ssrn.com/abstract=4217148 or http://dx.doi.org/10.2139/ssrn.4217148

Mehtab Khan (Contact Author)

Yale Law School ( email )

127 Wall Street
New Haven, CT 06510
United States

Harvard University - Berkman Klein Center for Internet & Society ( email )

Harvard Law School
23 Everett, 2nd Floor
Cambridge, MA 02138
United States

Alex Hanna

Distributed AI Research Institute ( email )

Do you have a job opening that you would like to promote on SSRN?

Paper statistics

Downloads
1,789
Abstract Views
6,571
Rank
16,216
PlumX Metrics