Retrieving, Classifying and Analysing Narrative Commentary in Unstructured (Glossy) Annual Reports Published as PDF Files

Accounting and Business Research, Forthcoming

50 Pages Posted: 1 Jul 2016 Last revised: 9 Jul 2019

See all articles by Mahmoud El-Haj

Mahmoud El-Haj

Lancaster University - School of Computing and Communications; Lancaster University

Paulo Alves

Católica Porto Business School, Portugal; Lancaster University - ICRA

Paul Rayson

Lancaster University

Martin Walker

University of Manchester - Manchester Business School

Steven Young

Lancaster University - Department of Accounting and Finance

Date Written: May 1, 2019

Abstract

We provide a methodological contribution by developing, describing and evaluating a method for automatically retrieving and analysing text from digital PDF annual report files published by firms listed on the London Stock Exchange (LSE). The retrieval method retains information on document structure, enabling clear delineation between narrative and financial statement components of reports, and between individual sections within the narratives component. Retrieval accuracy exceeds 95% for manual validations using a random sample of 586 reports. Large-sample statistical validations using a comprehensive sample of reports published by non-financial LSE firms confirm that report length, narrative tone and (to a lesser degree) readability vary predictably with economic and regulatory factors. We demonstrate how the method is adaptable to non-English language documents and different regulatory regimes using a case study of Portuguese reports. We use the procedure to construct new research resources including corpora for commonly occurring annual report sections and a dataset of text properties for over 26,000 U.K. annual reports.

Keywords: Annual Reports, Textual Analysis, Text Retrieval, Document Structure, Corpus Linguistics

JEL Classification: M40, M41

Suggested Citation

El-Haj, Mahmoud and Alves, Paulo and Rayson, Paul and Walker, Martin and Young, Steven, Retrieving, Classifying and Analysing Narrative Commentary in Unstructured (Glossy) Annual Reports Published as PDF Files (May 1, 2019). Accounting and Business Research, Forthcoming. Available at SSRN: https://ssrn.com/abstract=2803275 or http://dx.doi.org/10.2139/ssrn.2803275

Mahmoud El-Haj

Lancaster University - School of Computing and Communications ( email )

InfoLab21
Bailrigg
Lancaster, LA1 4WA
United Kingdom
+44 1524-510348 (Phone)

HOME PAGE: http://www.lancaster.ac.uk/staff/elhaj/

Lancaster University ( email )

InfoLab21, South Drive
Lancaster University
Lancaster, LA1 4WA
United Kingdom
1524510348 (Phone)

HOME PAGE: http://www.lancaster.ac.uk/staff/elhaj

Paulo Alves

Católica Porto Business School, Portugal ( email )

R. de Diogo Botelho, 1327
Porto, Porto 4169-005 P
Portugal

Lancaster University - ICRA ( email )

Lancaster, Lancashire LA1 4YX
United Kingdom

Paul Rayson

Lancaster University ( email )

School of Computing and Communications
Lancaster LA1 4YX
United Kingdom

Martin Walker

University of Manchester - Manchester Business School ( email )

Booth Street West
Manchester, M15 6PB
United Kingdom

Steven Young (Contact Author)

Lancaster University - Department of Accounting and Finance ( email )

The Management School
Lancaster LA1 4YX
United Kingdom
+441 5245-94242 (Phone)
+441 5248-47321 (Fax)

Register to save articles to
your library

Register

Paper statistics

Downloads
468
Abstract Views
1,584
rank
59,355
PlumX Metrics