The Predominant Use of High-Authority Commercial Web Publisher Content to Train Leading LLMs

39 Pages Posted: 7 Nov 2024 Last revised: 22 Nov 2024

Date Written: November 04, 2024

Abstract

Large language models predict text from preceding text. Developers train LLMs by ingesting text data. LLMs’ language faculty has emerged from the vast scale of their training corpora. LLM developers achieve that scale by copying and ingesting vast quantities of text from the web.

But not all web text is equally valuable to LLM developers.

In this paper, we review published details of training data from major LLM company research teams and analyze the datasets. Our work shows that key LLM training datasets are disproportionately composed of high-quality content owned by commercial publishers of news and media websites. Major LLM companies have quantifiably prioritized this content in training the most important LLMs over the short history of the technology.

As LLMs have evolved from pure research projects to some of the most valuable IP assets on earth, LLM companies have ceased publishing training details, and publishers have brought litigation against them. Courts and policymakers are grappling with questions of IP rights and technological progress. We attempt to illuminate the richest available sources of information on LLM companies’ use of web publisher content to inform this vital public conversation.

Keywords: artificial intelligence, large language models, ai, llms, training data, web publishers, common crawl, c4, webtext, webtext2

Suggested Citation

Wukoson, George and Fortuna, Joey, The Predominant Use of High-Authority Commercial Web Publisher Content to Train Leading LLMs (November 04, 2024). Available at SSRN: https://ssrn.com/abstract=5009668 or http://dx.doi.org/10.2139/ssrn.5009668

George Wukoson (Contact Author)

Ziff Davis ( email )

360 Park Avenue South
17th Floor
New York, NY 10010
United States

Joey Fortuna

Ziff Davis ( email )

360 Park Avenue South
17th Floor
New York, NY 10010
United States

Ziff Davis ( email )

360 Park Avenue South
17th Floor
New York, NY 10010
United States

Do you have a job opening that you would like to promote on SSRN?

Paper statistics

Downloads
550
Abstract Views
4,653
Rank
126,105
PlumX Metrics