Discit Ergo Est: Training Data Provenance and Fair Use

Mahari, Robert; Longpre, Shayne

doi:10.2139/ssrn.4795277

Download This Paper

Open PDF in Browser

Add Paper to My Library

Discit Ergo Est: Training Data Provenance and Fair Use

Robert Mahari and Shayne Longpre, Discit ergo est: Training Data Provenance And Fair Use, Dynamics of Generative AI (ed. Thibault Schrepel & Volker Stocker), Network Law Review, Winter 2023.

14 Pages Posted: 23 May 2024

See all articles by Robert Mahari

Robert Mahari

Harvard Law School; Massachusetts Institute of Technology (MIT) - Human Dynamics Group

Language models are fundamentally shaped by their training data, which includes massive unstructured pretraining corpora scraped from the web and smaller curated datasets created specifically for AI training. While pretraining data has received significant attention, curated datasets have been responsible for many recent breakthroughs in generative AI. Building on the Data Provenance Initiative, a massive audit of 1,800 curated text datasets, we discuss how curated data has enabled these advancements, explore the application of the fair use doctrine to pretraining and curated datasets, and highlight the importance of data provenance for both copyright and responsible AI practices. In light of the fact that curated data was created for the sole purpose of training AI models, we argue that its use for this purpose should not generally be treated as fair use. We discuss the implications of different dataset creators, the use of third-party data, and the involvement of large language models in dataset creation on this analysis. Finally, we propose that enforceable licenses for curated datasets can incentivize transparency and responsible AI practices by requiring model developers to track data sources and protecting dataset creators who openly share their work.

Keywords: Data Provenance, Fair Use, Copyright, Large Language Models, generative AI, AI Regulation, Computational Law, Regulation by Design

JEL Classification: K29, K49, O32, O38, K24, O34

Suggested Citation: Suggested Citation

Mahari, Robert and Longpre, Shayne, Discit Ergo Est: Training Data Provenance and Fair Use ( 2023). Robert Mahari and Shayne Longpre, Discit ergo est: Training Data Provenance And Fair Use, Dynamics of Generative AI (ed. Thibault Schrepel & Volker Stocker), Network Law Review, Winter 2023., Available at SSRN: https://ssrn.com/abstract=4795277 or http://dx.doi.org/10.2139/ssrn.4795277