Big Data's Dirty Secret
26 Pages Posted: 11 Jul 2018
Date Written: June 29, 2018
Amidst the avalanche of articles on big data and machine learning, the phrase "after cleaning the data" is often found. Here we focus on the work hidden behind this phrase. We analyze the types of dirty data found in financial time series, the problems caused by dirty data, and the performance of data cleaning algorithms. And we extend the MSSA hole filling algorithm of Kondrashov and Ghil to improve its performance on CDS spread data, and combine it with clustering techniques from data science to detect bad data.
Keywords: Data cleaning, big data, machine learning, SSA, MSSA, PCA, Data science, outlier detection, anomaly detection
JEL Classification: G32, C10, C45, C55, C51, G12
Suggested Citation: Suggested Citation