Semantic Parsing: Semantic Units Extraction and Boosting Algorithms for Search
Posted: 30 Jan 2020
Date Written: January 29, 2020
This presentation is about semantic parsing on Lexis Advance, a global legal research solution for attorneys at law firms & corporations and all legal practitioners who have the desire to search for and locate accurate information in the most efficient way. Customers are accustomed to searching for actual things, not just matching strings of text. Before we launched a semantic parsing solution in Lexis Advance, queries were not segmented into legal-domain-semantic-units, which resulted in low parsing accuracy, irrelevant result and negative customer impact. In this quality phrase extraction initiative, an entire extraction and validation system has been built to provide massive high-accuracy legal-domain-semantic-units. To improve search relevance, semantic parsing is combined with the quality phrases extraction, search engine boosting algorithms and a simulation system. The entire system leverages multiple machine learning algorithms over millions of documents, including entropy, mutual information, supervised classifier and phrase boosting algorithms. Phrase candidates are first extracted via unsupervised algorithms: entropy and mutual information techniques. A machine learning classifier based on LightGBM framework was built to further validated the quality of the phrases. AutoPhrase technique was applied for three and more words phrases. Finally, term-proximity scoring against generated phrases is applied to boosting the documents returned by search engine. The simulation system is here to validate the phrases impact on search relevance. As a result, 960k phrases were extracted from LexisNexis AU cases and queries with over 90% sampling accuracy. With semantic parsing, a 6.7% hDCG(5) improvement on all queries and 10.76% hDCG(5) improvement on impacted queries have been achieved.
Keywords: Semantic Parsing, Entropy, Mutual Information, Autophrase, Search Relevance, Simulation Program, Discounted Cumulative Gain
Suggested Citation: Suggested Citation