Analyzing the Competition: Abstractive Summarization at Thomsom Reuters Labs

Posted: 28 Jan 2021

Date Written: November 4, 2020

Abstract

This manual task of writing a summary for a 30->100 page court document can be measured against a computer written task to learn the publishable acceptance accuracies of each. Acceptance leverages accuracy and grammar. The initial accuracies were 74% for the computer approach and 88% for the manual human approach.

The first computer approach started with 100M annotated documents and the watershed moment came when they used OpenNMT. Initially the approach used a human manual highlight of the sentences (text) to summarize. Just doing this reduced the time from 30 mins to 3 mins per document.

Next, they performed TFIDF and embeddings to get a weighted embedding for each sentence of the court documents. They choose the highest scoring sentences (the distinguishing ones) using this weighted BOW model. Finally, they introduce a language scoring by leveraging BertScore.

Lastly, they adjusted the focus for the human reviewers to review the abstracts most likely needing review. This was accomplished using a straightforward binary classifier of ones needing or not needing editing in the past.

Suggested Citation

Corkum, Matt and Pal, Sujit, Analyzing the Competition: Abstractive Summarization at Thomsom Reuters Labs (November 4, 2020). Proceedings of the 4th Annual RELX Search Summit, Available at SSRN: https://ssrn.com/abstract=3774847

Do you have a job opening that you would like to promote on SSRN?

Paper statistics

Abstract Views
115
PlumX Metrics