Legal Aspects of Access to Human-Generated Data and Other Essential Inputs for AI Training
36 Pages Posted: 5 Dec 2024
Date Written: December 02, 2024
Abstract
The increasing use of LLMs and other generative AI is adding vast quantities of AIgenerated content to the online data environment, thereby contaminating it: Future AIs continue to be trained on data from the internet, and thus partly on their own (collective) output. However, this feedback loop can have detrimental effects, to the point where newly trained AI models can collapse and become useless. Removing AI-generated content from data sets is already difficult, and may well become impossible as it further improves, leading to a risk of a perpetually contaminated information sphere. As future AIs will continue to require large uncontaminated data sets, they may increasingly need to rely on data collected prior to the proliferation of generative AI, that is prior to the end of 2022. Alongside existing barriers to entry such as compute resources, electricity, and human talent, this new barrier may become insurmountable, thereby leaving newcomers in the generative AI market behind and allowing existing players to further entrench their dominant positions. In this paper, we consider a range of existing legal regimes-broadly aligned with those of the EU as the most proactive regulator of AI, data and platforms-that might be useful in addressing questions about access to uncontaminated data and other essential inputs for AI training. Covering AI regulation, data (access) governance regimes, EU and domestic competition law as well as gatekeeper rules for digital markets, we show that currently, there is-unsurprisingly-no tailor-made solution in existing legal regimes. However, various elements of those laws reinforce the underlying idea that access to an essential resource which is not only increasingly scarce, but at significant risk of 'extinction', may be an important public policy concern that the law should address. Ideally, before it is too late. We conclude with a range of considerations on how access to uncontaminated data and other essential inputs could be afforded in a fair and equitable manner for all-while aiming to minimise the risks of harmful uses resulting from such access.
Keywords: model collapse, data contamination, artificial intelligence, AI governance, competition law, antitrust law, generative AI, large language models, reinforcement learning from human feedback
Suggested Citation: Suggested Citation