Cassandra's Twin: What Does the Data Predict?

31 Pages Posted: 7 Sep 2016

Date Written: September 5, 2016


This paper examines what type of insights can be extracted from a given dataset. This analysis is intended to encourage those collecting data to engineer their collection processes with data usability in mind. It is also meant to caution researchers against the pitfalls of trying to squeeze out more information from data than the data actually warrants. The ideas presented are illustrated through a concrete example of using CDC’s Behavioral Risk Factor Surveillance System (BRFSS) dataset to identify risk factors associated with hypertension. A complete end-to-end predictive modeling process is described in detail from data conversion and pre-processing, to model building, and interpretation of results. Python code to process the data as well as Matlab code to build, train, and evaluate the models are included in full and are sufficiently general in nature to be used for other analyses beyond hypertension.

Keywords: BRFSS, machine learning, predictive model, logistic regression, decision tree, hypertension, data survey

JEL Classification: D83, C42, C80

Suggested Citation

Boier, Ioana, Cassandra's Twin: What Does the Data Predict? (September 5, 2016). Available at SSRN: or

Do you have negative results from your research you’d like to share?

Paper statistics

Abstract Views
PlumX Metrics