Retrieving, Classifying and Analysing Narrative Commentary in Unstructured (Glossy) Annual Reports Published as PDF Files

79 Pages Posted: 1 Jul 2016 Last revised: 23 Mar 2018

Mahmoud El-Haj

Lancaster University - School of Computing and Communications; Lancaster University

Paulo Alves

Católica Porto Business School, Portugal; Lancaster University - ICRA

Paul Rayson

Lancaster University

Martin Walker

University of Manchester - Manchester Business School

Steven Young

Lancaster University - Department of Accounting and Finance

Date Written: October 1, 2017

Abstract

We develop, describe and evaluate a method for automatically retrieving and analysing text from digital PDF annual report files published by firms listed on the London Stock Exchange (LSE). The retrieval method retains information on document structure, enabling clear delineation between narrative and financial statement components of reports, and between individual sections within the narratives component. Retrieval accuracy exceeds 95% for manual validations using a random sample of 586 reports. Large-sample statistical validations using a comprehensive sample of reports published by non-financial LSE firms confirm that report length, narrative tone and (to a lesser degree) readability vary predictably with economic and regulatory factors. We demonstrate how the method is adaptable to non-English language documents and different regulatory regimes using a case study of Portuguese reports. We use the procedure to construct new research resources including a dataset of document properties and structure for over 19,500 U.K. annual reports, and various corpora comprising over 137 million words in aggregate.

Keywords: Annual Reports, Textual Analysis, Text Retrieval, Document Structure, Corpus Linguistics

JEL Classification: M40, M41

Suggested Citation

El-Haj, Mahmoud and Alves, Paulo and Rayson, Paul and Walker, Martin and Young, Steven, Retrieving, Classifying and Analysing Narrative Commentary in Unstructured (Glossy) Annual Reports Published as PDF Files (October 1, 2017). Available at SSRN: https://ssrn.com/abstract=2803275 or http://dx.doi.org/10.2139/ssrn.2803275

Mahmoud El-Haj

Lancaster University - School of Computing and Communications ( email )

InfoLab21
Bailrigg
Lancaster, LA1 4WA
United Kingdom
+44 1524-510348 (Phone)

HOME PAGE: http://www.lancaster.ac.uk/staff/elhaj/

Lancaster University ( email )

InfoLab21, South Drive
Lancaster University
Lancaster, LA1 4WA
United Kingdom
1524510348 (Phone)

HOME PAGE: http://www.lancaster.ac.uk/staff/elhaj

Paulo Alves

Católica Porto Business School, Portugal ( email )

R. de Diogo Botelho, 1327
Porto, Porto 4169-005 P
Portugal

Lancaster University - ICRA ( email )

Lancaster, Lancashire LA1 4YX
United Kingdom

Paul Rayson

Lancaster University

School of Computing and Communications
Lancaster LA1 4YX
United Kingdom

Martin Walker

University of Manchester - Manchester Business School ( email )

Booth Street West
Manchester, M15 6PB
United Kingdom

Steven Young (Contact Author)

Lancaster University - Department of Accounting and Finance ( email )

The Management School
Lancaster LA1 4YX
United Kingdom
+441 5245-94242 (Phone)
+441 5248-47321 (Fax)

Register to save articles to
your library

Register

Paper statistics

Downloads
380
rank
71,260
Abstract Views
1,204
PlumX