Scalable Classification Pipeline with Domain Adaptation and Active Learning

Posted: 8 Feb 2024

Date Written: September 21, 2023

Abstract

Multi-label text classification is a common task in various products at Elsevier, including ScienceDirect and Compendex. However, several common issues arise in terms of machine learning perspectives, namely model, data, and evaluation. First, classification models typically serve as the default choice for this task. Nonetheless,these models struggle to scale to a large number of labels as the increasing size of feature space causes the number of parameters to explode quickly. Second, when constructing a new classification pipeline, labeled data is often unavailable, and the available data can be imbalanced. Moreover, the taxonomy data, from which the labels originate, undergoes yearly updates. Consequently, both the training and test data, as well as the model, require regular updates. Third, the evaluation process is time-consuming. Evaluations are typically performed offline using a test set, which necessitates subject matter experts (SMEs) to spend significant time labeling samples. These existing issues have direct consequences for businesses, leading to prolonged BAU (business as usual) time, limited innovation, increased efforts for the sales team, and dissatisfied clients.

In this work, we aim to replace the existing Compendex classification pipeline with a new solution that addresses the aforementioned issues. First, we introduce a label ranking model to replace the traditional Bert-based classification model. This new model comprises a BiEncoder model and a CrossEncoder model. The Bi-Encoder model offers benefits such as high recall and low computational cost, while the Cross-Encoder model enhances precision by re-ranking the top (i.e., 1000) documents. Second, we propose an active learning-based pipeline for model updates and data collection. For new labels without labeled data, we initially train an unsupervised Bi-Encoder model to serve as the starting model. This model not only provides reasonable performance but also identifies potentially positive samples for annotation. Human annotators are then involved in the annotation loop to label the training data. Finally, we present a ChatGPT-assisted method for constructing the test set. We generate prompts for documents that require annotation and utilize ChatGPT to obtain label answers along with confidence scores and explanations. Subsequently, SMEs manually verify these answers.

We evaluate our pipeline based on model performance, training cost, and manual annotation cost. The predicted concepts of our pipeline exhibit greater correctness and specificity compared to the production baseline. Additionally, the training cost is reduced by 10%, while SMEs’ anno- tation effort is reduced by 75%. The introduction of CI/CD (continuous integration/continuous deployment) allows for multiple releases within a single year, facilitating increased efficiency and productivity.

Keywords: Multi-label classification, label ranking model, active learning, evaluation

Suggested Citation

Li, Dan and Zhu, Zi Long and Afzal, Zubair and Yadav, Vikrant and van de Loo, Janneke, Scalable Classification Pipeline with Domain Adaptation and Active Learning (September 21, 2023). Proceedings of the 7th Annual RELX Search Summit, Available at SSRN: https://ssrn.com/abstract=4716509

Zi Long Zhu

Elsevier ( email )

Radarweg 29
Amsterdam, 1043 NX
Netherlands

Zubair Afzal

Elsevier ( email )

Radarweg 29
Amsterdam, 1043 NX
Netherlands

Vikrant Yadav

Elsevier ( email )

Janneke Van de Loo

Elsevier ( email )

Radarweg 29
Amsterdam, 1043 NX
Netherlands

Do you have a job opening that you would like to promote on SSRN?

Paper statistics

Abstract Views
149
PlumX Metrics