Tree Induction Vs. Logistic Regression: a Learning-Curve Analysis

51 Pages Posted: 13 Oct 2008

See all articles by Claudia Perlich

Claudia Perlich

IBM Corporation - Thomas J. Watson Research Center

Foster Provost

New York University

Jeffrey S. Simonoff

New York University (NYU) - Leonard N. Stern School of Business; New York University (NYU) - Department of Information, Operations, and Management Sciences

Multiple version iconThere are 2 versions of this paper

Date Written: December 2001

Abstract

Tree induction and logistic regression are two standard, off-the-shelf methodsfor building models for classification. We present a large-scale experimentalcomparison of logistic regression and tree induction, assessing classification accuracyand the quality of rankings based on class-membership probabilities. Weuse a learning-curve analysis to examine the relationship of these measures tothe size of the training set. The results of the study show several remarkablethings. (I) Contrary to prior observations, logistic regression does not generallyoutperform tree induction. (2) More specifically, and not surprisingly, logisticregression is better for smaller training sets and tree induction for larger datasets. Importantly, this often holds for training sets drawn from the same domain(i.e., the learning curves cross), so conclusions about induction-algorithmsuperiority on a given domain must be based on an analysis of the learningcurves. (3) Contrary to conventional wisdom, tree induction is effective at producingprobability-based rankings, although apparently comparatively less sofor a given training--set size than at making classifications. Finally, (4) the domainson which tree induction and logistic regression are ultimately preferablecan be characterized surprisingly well by a simple measure of signal-to-noiseratio.

Suggested Citation

Perlich, Claudia and Provost, Foster and Simonoff, Jeffrey S., Tree Induction Vs. Logistic Regression: a Learning-Curve Analysis (December 2001). Information Systems Working Papers Series, Vol. , pp. -, 2001. Available at SSRN: https://ssrn.com/abstract=1283003

Claudia Perlich (Contact Author)

IBM Corporation - Thomas J. Watson Research Center ( email )

Route 134
Kitchawan Road
Yorktown Heights, NY 10598
United States

Foster Provost

New York University ( email )

44 West Fourth Street
New York, NY 10012
United States

Jeffrey S. Simonoff

New York University (NYU) - Leonard N. Stern School of Business ( email )

44 West 4th Street
Suite 9-160
New York, NY NY 10012
United States

New York University (NYU) - Department of Information, Operations, and Management Sciences

44 West Fourth Street
New York, NY 10012
United States

Here is the Coronavirus
related research on SSRN

Paper statistics

Downloads
76
Abstract Views
759
rank
194,103
PlumX Metrics