Discit Ergo Est: Training Data Provenance and Fair Use

Robert Mahari and Shayne Longpre, Discit ergo est: Training Data Provenance And Fair Use, Dynamics of Generative AI (ed. Thibault Schrepel & Volker Stocker), Network Law Review, Winter 2023.

14 Pages Posted: 23 May 2024

See all articles by Robert Mahari

Robert Mahari

Harvard Law School; Massachusetts Institute of Technology (MIT) - Human Dynamics Group

Shayne Longpre

Apple

Date Written: 2023

Abstract

Language models are fundamentally shaped by their training data, which includes massive unstructured pretraining corpora scraped from the web and smaller curated datasets created specifically for AI training. While pretraining data has received significant attention, curated datasets have been responsible for many recent breakthroughs in generative AI. Building on the Data Provenance Initiative, a massive audit of 1,800 curated text datasets, we discuss how curated data has enabled these advancements, explore the application of the fair use doctrine to pretraining and curated datasets, and highlight the importance of data provenance for both copyright and responsible AI practices. In light of the fact that curated data was created for the sole purpose of training AI models, we argue that its use for this purpose should not generally be treated as fair use. We discuss the implications of different dataset creators, the use of third-party data, and the involvement of large language models in dataset creation on this analysis. Finally, we propose that enforceable licenses for curated datasets can incentivize transparency and responsible AI practices by requiring model developers to track data sources and protecting dataset creators who openly share their work.

Keywords: Data Provenance, Fair Use, Copyright, Large Language Models, generative AI, AI Regulation, Computational Law, Regulation by Design

JEL Classification: K29, K49, O32, O38, K24, O34

Suggested Citation

Mahari, Robert and Longpre, Shayne, Discit Ergo Est: Training Data Provenance and Fair Use ( 2023). Robert Mahari and Shayne Longpre, Discit ergo est: Training Data Provenance And Fair Use, Dynamics of Generative AI (ed. Thibault Schrepel & Volker Stocker), Network Law Review, Winter 2023., Available at SSRN: https://ssrn.com/abstract=4795277 or http://dx.doi.org/10.2139/ssrn.4795277

Robert Mahari (Contact Author)

Harvard Law School ( email )

1563 Massachusetts Avenue
Cambridge, MA 02138
United States

Massachusetts Institute of Technology (MIT) - Human Dynamics Group ( email )

77 Mass. Ave
E14/E15
Cambridge, MA 02139-4307
United States

Shayne Longpre

Apple ( email )

1 Infinite Loop
Cupertino, CA 95014
United States

Do you have a job opening that you would like to promote on SSRN?

Paper statistics

Downloads
88
Abstract Views
406
Rank
634,351
PlumX Metrics