Explaining Documents' Classifications

Posted: 17 Mar 2011

See all articles by David Martens

David Martens

University of Antwerp

Foster Provost

New York University

Date Written: March 2011

Abstract

This is a design-science paper about methods for explaining data-driven classifications of text documents. Document classification has widespread applications, such as with web pages for advertising, emails for legal discovery, blog entries for sentiment analysis, and many more. Document data are characterized by very high dimensionality, often with tens of thousands to millions of variables (words). Many applications requirehuman understanding of the reasons for classification decisions: by managers, client-facing employees, and the technical team. Unfortunately, due to the high dimensionality, understanding the decisions made by the document classifiers is very difficult. Previous approaches to gain insight into black-box models do not deal well with high-dimensional data. Our main theoretical contribution is to define a new sort of explanation, tailored to the business needs of document classification and able to cope with the associated technical constraints. Specifically, an explanation is defined as a set of words (terms, more generally) such that removing all words within this set from the document changes the predicted class from the class of interest. We present an algorithm to find such explanations, as well as a framework to assess such an algorithm's performance. We demonstrate the value of the new approach with a case study from a real-world document classification task: classifying web pages as containing adult content,with the goal of allowing advertisers to choose not to have their ads appear there. We present a further empirical demonstration on news-storytopic classification using the 20 News groups benchmark dataset. The results show the explanations to be concise and document-specific, and to provide insight into the exact reasons for the classification decisions, into the workings of the classification models, and into the business application itself. We also illustrate how explaining documents classifications can help to improve data quality and model performance.

Suggested Citation

Martens, David and Provost, Foster, Explaining Documents' Classifications (March 2011). NYU Stern School of Business, University of Antwerp, Vol. , pp. -, 2011. Available at SSRN: https://ssrn.com/abstract=1788703

David Martens

University of Antwerp ( email )

Prinsstraat 13
Antwerp, Antwerp 2000
Belgium

Foster Provost

New York University ( email )

44 West Fourth Street
New York, NY 10012
United States

Register to save articles to
your library

Register

Paper statistics

Abstract Views
537
PlumX Metrics