50 Pages Posted: 18 Jun 2012 Last revised: 11 Jul 2013
Date Written: July 10, 2013
When managers and researchers encounter a dataset, they typically ask two key questions: (1) which model (from a candidate set) should I use? and (2) if I use a particular model, when is it going to likely work well for my business goal? This research addresses those two questions, and provides a rule, i.e., a decision tree, for data analysts to portend the "winning model'' before having to fit any of them for longitudinal incidence data. We characterize datasets based on managerially relevant (and easy-to-compute) summary statistics, and we use classification techniques from machine learning to provide a decision tree that recommends when to use which model. By doing the"legwork'' of obtaining this decision tree for model selection, we provide a time-saving tool to analysts. We illustrate this method for a common marketing problem (i.e., forecasting repeat purchasing incidence for a cohort of new customers) and demonstrate the method's ability to discriminate among an integrated family of a hidden Markov model (HMM) and its constrained variants. We observe a strong ability for dataset characteristics to guide the choice of the most appropriate model, and we observe that some model features (e.g., the "back-and-forth'' migration between latent states) are more important to accommodate than others (e.g., the inclusion of an "off'' state with no activity). We also demonstrate the method's broad potential by providing a general "recipe'' for researchers to replicate this kind of model classification task in other managerial contexts (outside of repeat purchasing incidence data and the HMM framework).
Keywords: data science, business intelligence, model selection, machine learning, classification tree, posterior predictive model checking, hidden Markov models, hierarchical Bayesian methods, random forests, forecasting
JEL Classification: C11, C15, C22, C23, C51, C52, C53, M31
Suggested Citation: Suggested Citation
Schwartz, Eric M. and Bradlow, Eric and Fader, Peter, Model Selection Using Database Characteristics: Developing a Classification Tree for Longitudinal Incidence Data (July 10, 2013). Available at SSRN: https://ssrn.com/abstract=2085767 or http://dx.doi.org/10.2139/ssrn.2085767