The Predominant Use of High-Authority Commercial Web Publisher Content to Train Leading LLMs
39 Pages Posted: 7 Nov 2024 Last revised: 22 Nov 2024
Date Written: November 04, 2024
Abstract
Large language models predict text from preceding text. Developers train LLMs by ingesting text data. LLMs’ language faculty has emerged from the vast scale of their training corpora. LLM developers achieve that scale by copying and ingesting vast quantities of text from the web.
But not all web text is equally valuable to LLM developers.
In this paper, we review published details of training data from major LLM company research teams and analyze the datasets. Our work shows that key LLM training datasets are disproportionately composed of high-quality content owned by commercial publishers of news and media websites. Major LLM companies have quantifiably prioritized this content in training the most important LLMs over the short history of the technology.
As LLMs have evolved from pure research projects to some of the most valuable IP assets on earth, LLM companies have ceased publishing training details, and publishers have brought litigation against them. Courts and policymakers are grappling with questions of IP rights and technological progress. We attempt to illuminate the richest available sources of information on LLM companies’ use of web publisher content to inform this vital public conversation.
Keywords: artificial intelligence, large language models, ai, llms, training data, web publishers, common crawl, c4, webtext, webtext2
Suggested Citation: Suggested Citation