Using Python for Text Analysis in Accounting Research

Vic Anand, Khrystyna Bochkay, Roman Chychyla and Andrew Leone (2020), "Using Python for Text Analysis in Accounting Research", Foundations and Trends® in Accounting: Vol. 14: No. 3–4, pp 128-359. http://dx.doi.org/10.1561/1400000062

University of Miami Business School Research Paper No. 3576098

223 Pages Posted: 7 Jun 2020 Last revised: 5 Dec 2020

See all articles by Vic Anand

Vic Anand

University of Illinois at Urbana-Champaign - Department of Accountancy

Khrystyna Bochkay

University of Miami - School of Business Administration

Roman Chychyla

University of Miami - School of Business Administration

Andrew J. Leone

Northwestern University; University of Miami

Date Written: September 23, 2020

Abstract

The prominence of textual data in accounting research has increased dramatically. To assist researchers in understanding and using textual data, this monograph defines and describes common measures of textual data and then demonstrates the collection and processing of textual data using the Python programming language. The monograph is replete with sample code that replicates textual analysis tasks from recent research papers.

In the first part of the monograph, we provide guidance on getting started in Python. We first describe Anaconda, a distribution of Python that provides the requisite libraries for textual analysis, and its installation. We then introduce the Jupyter notebook, a programming environment that improves research workflows and promotes replicable research. Next, we teach the basics of Python programming and demonstrate the basics of working with tabular data in the Pandas package.

The second part of the monograph focuses on specific textual analysis methods and techniques commonly used in accounting research. We first introduce regular expressions, a sophisticated language for finding patterns in text. We then show how to use regular expressions to extract specific parts from text. Next, we introduce the idea of transforming text data (unstructured data) into numerical measures representing variables of interest (structured data). Specifically, we introduce dictionary-based methods of 1) measuring document sentiment, 2) computing text complexity, 3) identifying forward-looking sentences and risk disclosures, 4) collecting informative numbers in text, and 5) computing the similarity of different pieces of text. For each of these tasks, we cite relevant papers and provide code snippets to implement the relevant metrics from these papers.

Finally, the third part of the monograph focuses on automating the collection of textual data. We introduce web scraping and provide code for downloading filings from EDGAR.

Keywords: Text analysis, data collection, Python, natural language processing

JEL Classification: B4, C8, M41

Suggested Citation

Anand, Vic and Bochkay, Khrystyna and Chychyla, Roman and Leone, Andrew J., Using Python for Text Analysis in Accounting Research (September 23, 2020). Vic Anand, Khrystyna Bochkay, Roman Chychyla and Andrew Leone (2020), "Using Python for Text Analysis in Accounting Research", Foundations and Trends® in Accounting: Vol. 14: No. 3–4, pp 128-359. http://dx.doi.org/10.1561/1400000062, University of Miami Business School Research Paper No. 3576098, Available at SSRN: https://ssrn.com/abstract=3576098

Vic Anand

University of Illinois at Urbana-Champaign - Department of Accountancy ( email )

1206 South Sixth Street
Champaign, IL 61820
United States

Khrystyna Bochkay

University of Miami - School of Business Administration ( email )

Coral Gables, FL 33146-6531
United States

Roman Chychyla

University of Miami - School of Business Administration ( email )

Coral Gables, FL 33146-6531
United States
3052842324 (Phone)

Andrew J. Leone (Contact Author)

Northwestern University ( email )

2001 Sheridan Road
Evanston, IL 60208
United States

University of Miami ( email )

School of Business
Coral Gables, FL 33146
United States
305-284-3101 (Phone)

HOME PAGE: http://sbaleone.bus.miami.edu

Do you have a job opening that you would like to promote on SSRN?

Paper statistics

Downloads
800
Abstract Views
2,785
rank
35,743
PlumX Metrics