Cassandra's Twin: What Does the Data Predict?
31 Pages Posted: 7 Sep 2016
Date Written: September 5, 2016
Abstract
This paper examines what type of insights can be extracted from a given dataset. This analysis is intended to encourage those collecting data to engineer their collection processes with data usability in mind. It is also meant to caution researchers against the pitfalls of trying to squeeze out more information from data than the data actually warrants. The ideas presented are illustrated through a concrete example of using CDC’s Behavioral Risk Factor Surveillance System (BRFSS) dataset to identify risk factors associated with hypertension. A complete end-to-end predictive modeling process is described in detail from data conversion and pre-processing, to model building, and interpretation of results. Python code to process the data as well as Matlab code to build, train, and evaluate the models are included in full and are sufficiently general in nature to be used for other analyses beyond hypertension.
Keywords: BRFSS, machine learning, predictive model, logistic regression, decision tree, hypertension, data survey
JEL Classification: D83, C42, C80
Suggested Citation: Suggested Citation