Forget Me Not: Memorisation in Generative Sequence Models Trained on Open Source Licensed Code

37 Pages Posted: 8 Mar 2024

See all articles by Ivo Emanuilov

Ivo Emanuilov

KU Leuven - Centre for IT & IP Law (CiTiP)

Thomas Margoni

Centre for IT & IP Law (CiTiP), Faculty of Law - KU Leuven

Multiple version iconThere are 2 versions of this paper

Abstract

Generative sequence models, like GPT-3/4, Stable Diffusion and DALL·E, are increasingly utilised to produce artifacts traditionally associated with human ingenuity, such as text, images, audio, videos and code. Despite their impressive ability to generalise on unseen data, these models are prone to memorising fragments of their training data. In some extreme cases, these ‘memories’ may contain verbatim and potentially infringing reproductions of works protected by copyright. In this paper, we focus on one specific example, namely program source code.The ongoing litigation against Microsoft’s GitHub Copilot service shows that these concerns are far from theoretical. GitHub Copilot is a commercial service designed to support software development workflows. It generates code based on a program specification provided by a programmer in a natural language. The service relies on the generative model Codex, which has been trained on public open-source code repositories hosted on GitHub and fine-tuned for code generation. In the words of its creators, this model has been trained on ‘billions of lines of public code’, that is, computer programs arguably covered by copyright law and distributed under an open source licence.Copilot has been shown capable of reproducing, on occasion, verbatim fragments of what is allegedly its training dataset without appropriate attribution or notice. These reproductions have included not only functional code, but also original, expressive code plausibly protected by copyright. While open source software is, by definition, distributed with its source, many open source licences follow a direct licensing model where attribution, notice and licence notice preservation requirements must be observed to avoid downstream recipients being found in breach of the licence.This controversy has sparked heated debates in both deep learning and legal communities as to the legality of developing and using such models under copyright law. In this paper, we explore the implications of memorisation for copyright infringement under EU law and propose a set of solutions that may help alleviate these concerns.

Keywords: generative AI, memorisation, copyright, open source, text and data mining, Artificial Intelligence, Machine Learning

Suggested Citation

Emanuilov, Ivo and Margoni, Thomas, Forget Me Not: Memorisation in Generative Sequence Models Trained on Open Source Licensed Code. Available at SSRN: https://ssrn.com/abstract=4753124 or http://dx.doi.org/10.2139/ssrn.4753124

Ivo Emanuilov (Contact Author)

KU Leuven - Centre for IT & IP Law (CiTiP) ( email )

Sint-Michielsstraat 6 box 3443
Leuven, 3000
Belgium
+359896704185 (Phone)

Thomas Margoni

Centre for IT & IP Law (CiTiP), Faculty of Law - KU Leuven ( email )

Brussels
Belgium

Do you have a job opening that you would like to promote on SSRN?

Paper statistics

Downloads
51
Abstract Views
192
Rank
369,592
PlumX Metrics