The New Legal Landscape for Text Mining and Machine Learning

Journal of the Copyright Society of the USA, Vol. 66 p.291 (2019)

74 Pages Posted: 26 Feb 2019 Last revised: 29 Feb 2020

See all articles by Matthew Sag

Matthew Sag

Emory University School of Law

Date Written: February 27, 2020

Abstract

Individually and collectively, copyrighted works have the potential to generate information that goes far beyond what their individual authors expressed or intended. Various methods of computational and statistical analysis of text — usually referred to as text data mining (“TDM”) or just text mining — can unlock that information. However, because almost every use of TDM involves making copies of the text to be mined, the legality of that copying has become a fraught issue in copyright law in United States and around the world. One of the most fundamental questions for copyright law in the Internet age is whether the protection of the author’s original expression should stand as an obstacle to the generation of insights about that expression. How this question is answered will have a profound influence on the future of research across the sciences and the humanities, and for the development of the next generation of information technology: machine learning and artificial intelligence.

This Article consolidates a theory of copyright law should that I have advanced in a series of articles and amicus briefs over the past decade. It explains why applying copyright’s fundamental principles in the context of new technologies necessarily implies that copying expressive works for non-expressive purposes should not be counted as infringement and must be recognized as fair use. The Article shows how that theory was adopted and applied in the recent high-profile test cases, Authors Guild v. HathiTrust and Authors Guild v. Google, and takes stock of the legal context for TDM research in the United States in the aftermath of those decisions.

The Article makes important contributions to copyright theory, but is also integrates that theory with a practical assessment various interrelated legal issues that text mining researchers and their supporting institutions must confront if they are to realize the full potential of these technologies. These issues range from the enforceability of website terms of service, the effect of laws prohibiting computer hacking and the circumvention of technological protection measures (i.e., encryption and other digital locks), and cross-border copyright issues.

Keywords: Copyright, Fair Use, Terms of Use, Computer Hacking, Digital Rights Management, Computer Fraud and Abuse Act, Digital Single Market Directive, Text Mining, Text Data Mining, Digital Humanities, Machine Learning, Artificial Intelligence, Internet Search, Reverse Engineering, Plagiarism Detection

JEL Classification: K00, C88

Suggested Citation

Sag, Matthew, The New Legal Landscape for Text Mining and Machine Learning (February 27, 2020). Journal of the Copyright Society of the USA, Vol. 66 p.291 (2019), Available at SSRN: https://ssrn.com/abstract=3331606 or http://dx.doi.org/10.2139/ssrn.3331606

Matthew Sag (Contact Author)

Emory University School of Law ( email )

1301 Clifton Road
Atlanta, GA 30322
United States

Do you have negative results from your research you’d like to share?

Paper statistics

Downloads
2,619
Abstract Views
10,941
Rank
9,802
PlumX Metrics