The Predominant Use of High-Authority Commercial Web Publisher Content to Train Leading LLMs

Wukoson, George; Fortuna, Joey

doi:10.2139/ssrn.5009668

Download This Paper

Open PDF in Browser

Add Paper to My Library

The Predominant Use of High-Authority Commercial Web Publisher Content to Train Leading LLMs

39 Pages Posted: 7 Nov 2024 Last revised: 22 Nov 2024

See all articles by George Wukoson

Joey Fortuna

Ziff Davis; Ziff Davis

Date Written: November 04, 2024

Abstract

Large language models predict text from preceding text. Developers train LLMs by ingesting text data. LLMs’ language faculty has emerged from the vast scale of their training corpora. LLM developers achieve that scale by copying and ingesting vast quantities of text from the web.

But not all web text is equally valuable to LLM developers.

In this paper, we review published details of training data from major LLM company research teams and analyze the datasets. Our work shows that key LLM training datasets are disproportionately composed of high-quality content owned by commercial publishers of news and media websites. Major LLM companies have quantifiably prioritized this content in training the most important LLMs over the short history of the technology.

As LLMs have evolved from pure research projects to some of the most valuable IP assets on earth, LLM companies have ceased publishing training details, and publishers have brought litigation against them. Courts and policymakers are grappling with questions of IP rights and technological progress. We attempt to illuminate the richest available sources of information on LLM companies’ use of web publisher content to inform this vital public conversation.

Keywords: artificial intelligence, large language models, ai, llms, training data, web publishers, common crawl, c4, webtext, webtext2

Suggested Citation: Suggested Citation

Wukoson, George and Fortuna, Joey, The Predominant Use of High-Authority Commercial Web Publisher Content to Train Leading LLMs (November 04, 2024). Available at SSRN: https://ssrn.com/abstract=5009668 or http://dx.doi.org/10.2139/ssrn.5009668