Tree Induction Vs. Logistic Regression: a Learning-Curve Analysis
51 Pages Posted: 13 Oct 2008
Date Written: December 2001
Tree induction and logistic regression are two standard, off-the-shelf methodsfor building models for classification. We present a large-scale experimentalcomparison of logistic regression and tree induction, assessing classification accuracyand the quality of rankings based on class-membership probabilities. Weuse a learning-curve analysis to examine the relationship of these measures tothe size of the training set. The results of the study show several remarkablethings. (I) Contrary to prior observations, logistic regression does not generallyoutperform tree induction. (2) More specifically, and not surprisingly, logisticregression is better for smaller training sets and tree induction for larger datasets. Importantly, this often holds for training sets drawn from the same domain(i.e., the learning curves cross), so conclusions about induction-algorithmsuperiority on a given domain must be based on an analysis of the learningcurves. (3) Contrary to conventional wisdom, tree induction is effective at producingprobability-based rankings, although apparently comparatively less sofor a given training--set size than at making classifications. Finally, (4) the domainson which tree induction and logistic regression are ultimately preferablecan be characterized surprisingly well by a simple measure of signal-to-noiseratio.
Suggested Citation: Suggested Citation