Token Classification Tasks Model Comparison and Performance Improvement Strategy with Natural Language Query

Posted: 16 Feb 2023

Date Written: September 21, 2023

Abstract

To automatically discover user’s queries key entities and attach the concepts with different search field saves user’s query generation time and benefit future entity synonyms suggestion. After the releasing of Bert family models, fine tuning with language model became a big success. Our solution for key entity extraction from user’s queries and conversion to syntax queries is by leveraging token classification. The keywords in a natural language query (NLQ) can be assigned to different classes and then translated to syntax queries. User’s queries have the features that 1. grammar may not be correct, 2. characters can be case incorrect. To solve the Feature 1 and Feature 2 of user’s queries, we compared solutions: LUKE (Yamada et al., 2020), BERT based at fine tuning stage, BERT+CRF on CONLL2003 datasets. F1 score take both precision and recall into account. The higher precision and recall the higher F1 score. The BERT based at fine tuning stage was selected with F1 score 0.96. With a high-performance language model, there are two problems that we solved. To extract the key entity part of a query, and to improve the accuracy of certain class with dictionary. The key entity concept extraction, which the F1 score is 0.97. The POS with noun words was utilized when tagging the data. The concepts extraction technology can also be utilized on phrase mining, synonym suggestion and knowledge graph generation. For performance improvement, the well-defined dictionary is always a good data resource. Spacy Sweak, weak labels and Luke strategy are all tested in this study. The weak label can improve F1 score of organization from 0.73 to 0.89.

Keywords: Search Algorithms

Suggested Citation

Ma, Tingting and Zhou, Aoru and Rhodes, David and Wu, Nicholas and Dixit, Anusha, Token Classification Tasks Model Comparison and Performance Improvement Strategy with Natural Language Query (September 21, 2023). Proceedings of the 6th Annual RELX Search Summit, Available at SSRN: https://ssrn.com/abstract=4332192

Tingting Ma (Contact Author)

LexisNexis ( email )

P. O. Box 933
Dayton, OH 45401
United States

Aoru Zhou

LexisNexis ( email )

David Rhodes

LexisNexis ( email )

Nicholas Wu

LexisNexis ( email )

Anusha Dixit

LexisNexis ( email )

Do you have a job opening that you would like to promote on SSRN?

Paper statistics

Abstract Views
269
PlumX Metrics