Discit Ergo Est: Training Data Provenance and Fair Use
Robert Mahari and Shayne Longpre, Discit ergo est: Training Data Provenance And Fair Use, Dynamics of Generative AI (ed. Thibault Schrepel & Volker Stocker), Network Law Review, Winter 2023.
14 Pages Posted: 23 May 2024
Date Written: 2023
Abstract
Language models are fundamentally shaped by their training data, which includes massive unstructured pretraining corpora scraped from the web and smaller curated datasets created specifically for AI training. While pretraining data has received significant attention, curated datasets have been responsible for many recent breakthroughs in generative AI. Building on the Data Provenance Initiative, a massive audit of 1,800 curated text datasets, we discuss how curated data has enabled these advancements, explore the application of the fair use doctrine to pretraining and curated datasets, and highlight the importance of data provenance for both copyright and responsible AI practices. In light of the fact that curated data was created for the sole purpose of training AI models, we argue that its use for this purpose should not generally be treated as fair use. We discuss the implications of different dataset creators, the use of third-party data, and the involvement of large language models in dataset creation on this analysis. Finally, we propose that enforceable licenses for curated datasets can incentivize transparency and responsible AI practices by requiring model developers to track data sources and protecting dataset creators who openly share their work.
Keywords: Data Provenance, Fair Use, Copyright, Large Language Models, generative AI, AI Regulation, Computational Law, Regulation by Design
JEL Classification: K29, K49, O32, O38, K24, O34
Suggested Citation: Suggested Citation