Big Data's Dirty Secret

26 Pages Posted: 11 Jul 2018

See all articles by Harvey J. Stein

Harvey J. Stein

Bloomberg L.P.; Columbia University - Department of Mathematics

Yan Zhang

Bloomberg LP

Date Written: June 29, 2018


Amidst the avalanche of articles on big data and machine learning, the phrase "after cleaning the data" is often found. Here we focus on the work hidden behind this phrase. We analyze the types of dirty data found in financial time series, the problems caused by dirty data, and the performance of data cleaning algorithms. And we extend the MSSA hole filling algorithm of Kondrashov and Ghil to improve its performance on CDS spread data, and combine it with clustering techniques from data science to detect bad data.

Keywords: Data cleaning, big data, machine learning, SSA, MSSA, PCA, Data science, outlier detection, anomaly detection

JEL Classification: G32, C10, C45, C55, C51, G12

Suggested Citation

Stein, Harvey J. and Zhang, Yan, Big Data's Dirty Secret (June 29, 2018). Available at SSRN: or

Harvey J. Stein (Contact Author)

Bloomberg L.P. ( email )

731 Lexington Avenue
New York, NY 10022
United States
212 617 3059 (Phone)

Columbia University - Department of Mathematics ( email )

New York, NY
United States

Yan Zhang

Bloomberg LP ( email )

731 Lexington Ave
New York, NY 10022
United States

Do you have a job opening that you would like to promote on SSRN?

Paper statistics

Abstract Views
PlumX Metrics