Legal Aspects of Access to Human-Generated Data and Other Essential Inputs for AI Training

36 Pages Posted: 5 Dec 2024

See all articles by John Burden

John Burden

University of Cambridge - Leverhulme Centre for the Future of Intelligence

Maurice Chiodo

University of Cambridge

Henning Grosse Ruse-Khan

University of Cambridge Fellow, King's College Cambridge; University of Cambridge

Lisa Markschies

Humboldt University of Berlin; Weizenbaum Institute Berlin

Dennis Müller

RWTH Aachen University

Seán Ó hÉigeartaigh

University of Cambridge - Leverhulme Centre for the Future of Intelligence

Rupprecht Podszun

Heinrich Heine University Dusseldorf - Faculty of Law

Herbert Zech

Weizenbaum Institute Berlin; Humboldt University of Berlin - Faculty of Law

Date Written: December 02, 2024

Abstract

The increasing use of LLMs and other generative AI is adding vast quantities of AIgenerated content to the online data environment, thereby contaminating it: Future AIs continue to be trained on data from the internet, and thus partly on their own (collective) output. However, this feedback loop can have detrimental effects, to the point where newly trained AI models can collapse and become useless. Removing AI-generated content from data sets is already difficult, and may well become impossible as it further improves, leading to a risk of a perpetually contaminated information sphere. As future AIs will continue to require large uncontaminated data sets, they may increasingly need to rely on data collected prior to the proliferation of generative AI, that is prior to the end of 2022. Alongside existing barriers to entry such as compute resources, electricity, and human talent, this new barrier may become insurmountable, thereby leaving newcomers in the generative AI market behind and allowing existing players to further entrench their dominant positions. In this paper, we consider a range of existing legal regimes-broadly aligned with those of the EU as the most proactive regulator of AI, data and platforms-that might be useful in addressing questions about access to uncontaminated data and other essential inputs for AI training. Covering AI regulation, data (access) governance regimes, EU and domestic competition law as well as gatekeeper rules for digital markets, we show that currently, there is-unsurprisingly-no tailor-made solution in existing legal regimes. However, various elements of those laws reinforce the underlying idea that access to an essential resource which is not only increasingly scarce, but at significant risk of 'extinction', may be an important public policy concern that the law should address. Ideally, before it is too late. We conclude with a range of considerations on how access to uncontaminated data and other essential inputs could be afforded in a fair and equitable manner for all-while aiming to minimise the risks of harmful uses resulting from such access.

Keywords: model collapse, data contamination, artificial intelligence, AI governance, competition law, antitrust law, generative AI, large language models, reinforcement learning from human feedback

Suggested Citation

Burden, John and Chiodo, Maurice and Grosse Ruse-Khan, Henning and Markschies, Lisa and Müller, Dennis and Ó hÉigeartaigh, Seán and Podszun, Rupprecht and Zech, Herbert, Legal Aspects of Access to Human-Generated Data and Other Essential Inputs for AI Training (December 02, 2024). University of Cambridge Faculty of Law Research Paper No. 35/2024, Available at SSRN: https://ssrn.com/abstract=5045155 or http://dx.doi.org/10.2139/ssrn.5045155

John Burden (Contact Author)

University of Cambridge - Leverhulme Centre for the Future of Intelligence ( email )

United Kingdom

Maurice Chiodo

University of Cambridge ( email )

Trinity Ln
Cambridge, CB2 1TN
United Kingdom

Henning Grosse Ruse-Khan

University of Cambridge Fellow, King's College Cambridge ( email )

King's Parade
Cambridge, CB2 1ST
United Kingdom

University of Cambridge ( email )

Trinity Ln
Cambridge, CB2 1TN
United Kingdom

Lisa Markschies

Humboldt University of Berlin ( email )

Unter den Linden 6
Berlin, AK Berlin 10099
Germany

Weizenbaum Institute Berlin ( email )

Germany

Dennis Müller

RWTH Aachen University ( email )

Seán Ó hÉigeartaigh

University of Cambridge - Leverhulme Centre for the Future of Intelligence

Rupprecht Podszun

Heinrich Heine University Dusseldorf - Faculty of Law

Universitätsstr. 1
Düsseldorf, D-40225
Germany

Herbert Zech

Weizenbaum Institute Berlin ( email )

Hardenbergstr. 32
Berlin, 10623
Germany

HOME PAGE: http://https://www.weizenbaum-institut.de/

Humboldt University of Berlin - Faculty of Law ( email )

Unter den Linden 6
Berlin, 10099
Germany

HOME PAGE: http://https://www.rewi.hu-berlin.de/de/lf/ls/zch

Do you have a job opening that you would like to promote on SSRN?

Paper statistics

Downloads
193
Abstract Views
658
Rank
324,374
PlumX Metrics