A Taxonomy of Training Data: Disentangling the Mismatched Rights, Remedies, and Rationales for Restricting Machine Learning

Artificial Intelligence and Intellectual Property (Reto Hilty, Jyh-An Lee, Kung-Chung Liu, eds.), Oxford University Press, Forthcoming

36 Pages Posted: 8 Oct 2020

See all articles by Benjamin Sobel

Benjamin Sobel

Harvard University - Berkman Klein Center for Internet & Society

Date Written: August 19, 2020

Abstract

This chapter addresses a crucial problem in artificial intelligence: many applications of machine learning depend on unauthorized uses of copyrighted data. Scholars and lawmakers often articulate this problem as a deficiency in copyright’s exceptions and limitations, reasoning that legal uncertainties surrounding today’s AI stem from the lack of a clear exception or limitation, and that such an exception or limitation could resolve the current predicament. In fact, the current predicament is a product of two systemic features of the copyright regime — the absence of formalities and the low threshold of copyright-able originality — combined with a technological environment that turns routine activities into acts of authorship. Equilibrating the economy for human expression in the AI age requires a solution that focuses not only on exceptions to existing copyrights, but also on the aforementioned doctrinal features that determine the ownership and scope of copyright entitlements at their inception.

The chapter taxonomizes different applications of machine learning according to the qualities of their training data. Four categories emerge: (1) public-domain training data, (2) licensed training data, (3) market-encroaching uses of copyrighted training data, and (4) non-market-encroaching uses of copyrighted training data. Copyright can only regulate market-encroaching uses of data, but these uses represent a narrow subset of AI applications and exclude many of the most socially harmful uses of copyrighted materials. Moreover, paradoxically, copyright’s property-style remedies are ill-suited to addressing market-encroaching uses, and are in fact much more appropriate remedies for the categories of worrisome AI that fall outside copyright’s normative mandate.

Finally, this chapter discusses a variety of remedies to the “AI problems” it identifies, with an emphasis on facilitating market-encroaching uses while affording human creators due compensation. It concludes that the exception for Text and Data Mining in the European Union’s Directive on Copyright in the Digital Single Market represents a positive development precisely because the exception addresses some structural causes of the training data problem that this chapter identifies. The TDM provision styles itself as an exception, but it may in fact be better understood as a formality: it requires rights holders to take positive action to exercise a right to exclude their materials from training datasets. Thus, the TDM exception addresses a root cause of the AI dilemma rather than trying to patch up the copyright regime post hoc. The chapter concludes that the next step for an equitable AI framework will be to transition towards rules that not only clarify that non-market-encroaching uses do not infringe copyright, but also facilitate remunerated uses of copyrighted works for market-encroaching purposes.

Keywords: artificial intelligence, intellectual property, ai, copyright, privacy, fair use, machine learning, training data

JEL Classification: K11, K20, O33, O34

Suggested Citation

Sobel, Benjamin, A Taxonomy of Training Data: Disentangling the Mismatched Rights, Remedies, and Rationales for Restricting Machine Learning (August 19, 2020). Artificial Intelligence and Intellectual Property (Reto Hilty, Jyh-An Lee, Kung-Chung Liu, eds.), Oxford University Press, Forthcoming, Available at SSRN: https://ssrn.com/abstract=3677548 or http://dx.doi.org/10.2139/ssrn.3677548

Benjamin Sobel (Contact Author)

Harvard University - Berkman Klein Center for Internet & Society ( email )

Harvard Law School
23 Everett, 2nd Floor
Cambridge, MA 02138
United States

Here is the Coronavirus
related research on SSRN

Paper statistics

Downloads
51
Abstract Views
294
rank
436,804
PlumX Metrics