Download this Paper Open PDF in Browser

Heterogeneous Narrative Content in Annual Reports Published as PDF Files: Extraction, Classification and Incremental Predictive Ability

51 Pages Posted: 1 Jul 2016 Last revised: 4 Jul 2016

Paulo Alves

Católica Porto Business School, Portugal; Lancaster University - ICRA

Mahmoud El-Haj

Lancaster University - School of Computing and Communications; Lancaster University

Paul Rayson

Lancaster University

Martin Walker

University of Manchester - Manchester Business School

Steven Young

Lancaster University - Department of Accounting and Finance

Date Written: July 1, 2016

Abstract

We develop, describe and evaluate a web-based software tool for batch extraction and analysis of digital PDF annual report files. The retrieval method retains information on document structure thereby enabling clear delineation between narrative and financial statement components of reports, and between individual sections within the narratives component. Retrieval accuracy exceeds 95% in manual validations and large-sample tests confirm that extracted content varies predictably with economic and regulatory factors. We apply the tool to a comprehensive sample of reports published by U.K. non-financial firms between 2003 and 2014, and examine the incremental predictive power for future earnings of different performance sections from the same report. While performance-related commentaries prepared by management and the independent board chair are individually predictive for future earnings, only chairman-authored content is incrementally informative when considered jointly. Further, management-authored content has lower independent predictive ability when insiders are more optimistic than the board chair. Results support the view that the predictive power of narratives varies with authors’ reporting incentives and that exaggerated optimism in management commentary reflects obfuscation.

Keywords: Annual Reports, Textual Analysis, Text Extraction, Predictive Ability

JEL Classification: M40, M41

Suggested Citation

Alves, Paulo and El-Haj, Mahmoud and Rayson, Paul and Walker, Martin and Young, Steven, Heterogeneous Narrative Content in Annual Reports Published as PDF Files: Extraction, Classification and Incremental Predictive Ability (July 1, 2016). Available at SSRN: https://ssrn.com/abstract=2803275 or http://dx.doi.org/10.2139/ssrn.2803275

Paulo Alves

Católica Porto Business School, Portugal ( email )

R. de Diogo Botelho, 1327
Porto, Porto 4169-005 P
Portugal

Lancaster University - ICRA ( email )

Lancaster, Lancashire LA1 4YX
United Kingdom

Mahmoud El-Haj

Lancaster University - School of Computing and Communications ( email )

InfoLab21
Bailrigg
Lancaster, LA1 4WA
United Kingdom
+44 1524-510348 (Phone)

HOME PAGE: http://www.lancaster.ac.uk/staff/elhaj/

Lancaster University ( email )

InfoLab21, South Drive
Lancaster University
Lancaster, LA1 4WA
United Kingdom
1524510348 (Phone)

HOME PAGE: http://www.lancaster.ac.uk/staff/elhaj

Paul Rayson

Lancaster University

School of Computing and Communications
Lancaster LA1 4YX
United Kingdom

Martin Walker

University of Manchester - Manchester Business School ( email )

Booth Street West
Manchester, M15 6PB
United Kingdom

Steven Young (Contact Author)

Lancaster University - Department of Accounting and Finance ( email )

The Management School
Lancaster LA1 4YX
United Kingdom
+441 5245-94242 (Phone)
+441 5248-47321 (Fax)

Paper statistics

Downloads
214
Rank
119,906
Abstract Views
649