Imbalanced Data Issues in Machine Learning Classifiers: A Case Study

20 Pages Posted: 13 Jan 2023

See all articles by Mingxing Gong

Mingxing Gong

University of Delaware - Department of Finance

Date Written: March 21, 2022

Abstract

Machine learning classifiers are widely used in financial applications. Due to the nature of certain classification problems, special care should be taken when dealing with imbalanced data. In practice, many model developers and validators fail to take this into account in their model development and validation. In addition, resampling is a common technique to address imbalanced data issues when building traditional logistic regression models. However, there has been no specific discussion regarding the resampling ratio used to rebalance the data or how the issue of imbalance impacts different kinds of machine learning classifiers, especially the more advanced ones. This paper aims to outline the special characteristics of the classifiers, compare different methods in dealing with imbalanced data issues and provide best practice in model development, evaluation and validation to avoid common pitfalls. Although the methods discussed in this paper can apply to general machine learning classifiers in applications with imbalanced data issues, by using a case study in credit card fraud detection this paper calls practitioners’ attention to the imbalanced data problems therein, where class imbalance is often mistreated and lacks theoretical discussion.

Keywords: machine learning, imbalanced data, fraud risk, performance measures, cost sensitive learning

Suggested Citation

Gong, Mingxing, Imbalanced Data Issues in Machine Learning Classifiers: A Case Study (March 21, 2022). Journal of Operational Risk, Vol. 17, No. 4, 2022, Forthcoming, Available at SSRN: https://ssrn.com/abstract=4321693

Mingxing Gong (Contact Author)

University of Delaware - Department of Finance

Alfred Lerner College of Business and Economics
Newark, DE 19716
United States

Do you have negative results from your research you’d like to share?

Paper statistics

Downloads
1
Abstract Views
225
PlumX Metrics