Semantic Parsing: Semantic Units Extraction and Boosting Algorithms for Search

Posted: 30 Jan 2020

See all articles by Lloyd Zhang

Lloyd Zhang

LexisNexis

Tingting Ma

LexisNexis

Samuel Wang

LexisNexis

Murphy Shao

LexisNexis

Wulong Wang

LexisNexis

Willis Wu

LexisNexis

Hongliang Ge

LexisNexis

Date Written: January 29, 2020

Abstract

This presentation is about semantic parsing on Lexis Advance, a global legal research solution for attorneys at law firms & corporations and all legal practitioners who have the desire to search for and locate accurate information in the most efficient way. Customers are accustomed to searching for actual things, not just matching strings of text. Before we launched a semantic parsing solution in Lexis Advance, queries were not segmented into legal-domain-semantic-units, which resulted in low parsing accuracy, irrelevant result and negative customer impact. In this quality phrase extraction initiative, an entire extraction and validation system has been built to provide massive high-accuracy legal-domain-semantic-units. To improve search relevance, semantic parsing is combined with the quality phrases extraction, search engine boosting algorithms and a simulation system. The entire system leverages multiple machine learning algorithms over millions of documents, including entropy, mutual information, supervised classifier and phrase boosting algorithms. Phrase candidates are first extracted via unsupervised algorithms: entropy and mutual information techniques. A machine learning classifier based on LightGBM framework was built to further validated the quality of the phrases. AutoPhrase technique was applied for three and more words phrases. Finally, term-proximity scoring against generated phrases is applied to boosting the documents returned by search engine. The simulation system is here to validate the phrases impact on search relevance. As a result, 960k phrases were extracted from LexisNexis AU cases and queries with over 90% sampling accuracy. With semantic parsing, a 6.7% hDCG(5) improvement on all queries and 10.76% hDCG(5) improvement on impacted queries have been achieved.

Keywords: Semantic Parsing, Entropy, Mutual Information, Autophrase, Search Relevance, Simulation Program, Discounted Cumulative Gain

Suggested Citation

Zhang, Lloyd and Ma, Tingting and Wang, Samuel and Shao, Murphy and Wang, Wulong and Wu, Willis and Ge, Hongliang, Semantic Parsing: Semantic Units Extraction and Boosting Algorithms for Search (January 29, 2020). Proceedings of the 3rd Annual RELX Search Summit, Available at SSRN: https://ssrn.com/abstract=3527365

Lloyd Zhang (Contact Author)

LexisNexis ( email )

P. O. Box 933
Dayton, OH 45401
United States

Tingting Ma

LexisNexis ( email )

P. O. Box 933
Dayton, OH 45401
United States

Samuel Wang

LexisNexis ( email )

P. O. Box 933
Dayton, OH 45401
United States

Murphy Shao

LexisNexis ( email )

P. O. Box 933
Dayton, OH 45401
United States

Wulong Wang

LexisNexis ( email )

P. O. Box 933
Dayton, OH 45401
United States

Willis Wu

LexisNexis ( email )

P. O. Box 933
Dayton, OH 45401
United States

Hongliang Ge

LexisNexis ( email )

P. O. Box 933
Dayton, OH 45401
United States

Do you have a job opening that you would like to promote on SSRN?

Paper statistics

Abstract Views
190
PlumX Metrics