Possibilities of Source Documentation and Disclosure for Generative AI Systems
9 Pages Posted: 31 Mar 2025 Last revised: 4 Mar 2025
Date Written: February 28, 2025
Abstract
Training generative AI models requires large amounts of training data, a significant portion of which is obtained through web scraping from the internet. Additionally, AI systems sometimes access web sources during operation to answer specific queries. This has led to a broad debate about copyright and usage rights. Undoubtedly, rights are affected here. Regardless of the extent to which legal claims exist, the question arises whether and how these can be asserted. A basic prerequisite for this is a sufficiently detailed source documentation and an adequate means for rights holders to obtain information about the sources. Is this technically possible and feasible with reasonable effort? The short answer is: Yes, it is technically possible and in many cases – especially for web sources – trivialto document sources and make them available for disclosure. This paper describes in detail what pragmatic solutions could look like.
Keywords: Generative AI, Source Documentation, Source Disclosure, Data Provenance
Suggested Citation: Suggested Citation