Detecting Financial Misconduct Using NLP and Machine Learning: Evidence from Japan
34 Pages Posted: 16 Apr 2025 Last revised: 17 Apr 2025
Date Written: August 25, 2024
Abstract
This study aims to develop a novel model for detecting financial fraud using textual data extracted from the annual securities reports of Japanese listed companies from 2010 to 2019. Specifically, the analysis focuses on Management's Discussion and Analysis (MD&A) and broader textual disclosures, including corporate policies and strategies, risk factors, and governance practices. Using natural language processing (NLP) techniques, a series of linguistic variables were created. These variables, along with financial data, were utilized to construct a model based on Weighted Random Forest (WRF), achieving a high AUC score of 0.907. Key characteristics of fraudulent companies identified in this study include: (1) negative tone, complexity, and fewer ratio-based expressions in the MD&A section, (2) positive tone and frequent references to third parties in risk information, and (3) readability yet fewer named entities in governance disclosures. Overall, this study demonstrates that leveraging textual data provides an effective new approach to predicting financial fraud and has the potential to contribute to corporate fraud prevention.
Keywords: Financial fraud, natural language processing, machine learning, annual securities reports, MD&A, corporate governance, risk analysis
Suggested Citation: Suggested Citation