Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers

9 Pages Posted: 9 Oct 2008

See all articles by Victor Sheng

Victor Sheng

affiliation not provided to SSRN

Foster Provost

New York University (NYU) - Department of Information, Operations, and Management Sciences

Panagiotis G. Ipeirotis

New York University - Leonard N. Stern School of Business

Date Written: March 2008

Abstract

This paper addresses the repeated acquisition of labels for data items when the labeling is imperfect. We examine the improvement (or lack thereof) in data quality via repeated labeling, and focus especially on the improvement of training labels for supervised induction. With the outsourcing of small tasks becoming easier, for example via Rent-A-Coder or Amazon's Mechanical Turk, it often is possible to obtain less-than-expert labeling at low cost. With low-cost labeling, preparing the unlabeled part of the data can become considerably more expensive than labeling. We present repeated-labeling strategies of increasing complexity, and show several main results. (i) Repeated-labeling can improve label quality and model quality, but not always. (ii) When labels are noisy, repeated labeling can be preferable to single labeling even in the traditional setting where labels are not particularly cheap. (iii) As soon as the cost of processing the unlabeled data is not free, even the simple strategy of labeling everything multiple times can give considerable advantage. (iv) Repeatedly labeling a carefully chosen set of points is generally preferable, and we present a robust technique that combines different notions of uncertainty to select data points for which quality should be improved. The bottom line: the results show clearly that when labeling is not perfect, selective acquisition of multiple labels is a strategy that data miners should have in their repertoire; for certain label-quality/cost regimes, the benefit is substantial.

Suggested Citation

Sheng, Victor and Provost, Foster and Ipeirotis, Panagiotis G., Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers (March 2008). NYU Working Paper No. 2451/25882, Available at SSRN: https://ssrn.com/abstract=1281348

Victor Sheng (Contact Author)

affiliation not provided to SSRN ( email )

Foster Provost

New York University (NYU) - Department of Information, Operations, and Management Sciences ( email )

44 West Fourth Street
New York, NY 10012
United States

Panagiotis G. Ipeirotis

New York University - Leonard N. Stern School of Business ( email )

44 West Fourth Street
Ste 8-84
New York, NY 10012
United States
+1-212-998-0803 (Phone)

HOME PAGE: http://www.stern.nyu.edu/~panos

Do you have a job opening that you would like to promote on SSRN?

Paper statistics

Downloads
198
Abstract Views
2,550
Rank
310,994
PlumX Metrics