Retrieving, Classifying and Analysing Narrative Commentary in Unstructured (Glossy) Annual Reports Published as PDF Files
79 Pages Posted: 1 Jul 2016 Last revised: 23 Mar 2018
Date Written: October 1, 2017
We develop, describe and evaluate a method for automatically retrieving and analysing text from digital PDF annual report files published by firms listed on the London Stock Exchange (LSE). The retrieval method retains information on document structure, enabling clear delineation between narrative and financial statement components of reports, and between individual sections within the narratives component. Retrieval accuracy exceeds 95% for manual validations using a random sample of 586 reports. Large-sample statistical validations using a comprehensive sample of reports published by non-financial LSE firms confirm that report length, narrative tone and (to a lesser degree) readability vary predictably with economic and regulatory factors. We demonstrate how the method is adaptable to non-English language documents and different regulatory regimes using a case study of Portuguese reports. We use the procedure to construct new research resources including a dataset of document properties and structure for over 19,500 U.K. annual reports, and various corpora comprising over 137 million words in aggregate.
Keywords: Annual Reports, Textual Analysis, Text Retrieval, Document Structure, Corpus Linguistics
JEL Classification: M40, M41
Suggested Citation: Suggested Citation