Deep String Matching For Duplicate Detection
27 Pages Posted: 18 May 2021 Last revised: 7 Jun 2021
Date Written: May 16, 2021
Abstract
We consider the problem of duplicate detection in the case where dealing with typographical errors, toponym matching, and datatype dependency are all combined into a single task. We express this task as a string matching problem and resolve it by estimating a conditional probability via an encoder-decoder model, whereby the strings are first encoded with a Deep Recurrent Network into context vectors which are then concatenated and used as inputs for a Deep Classifier Network.
We explore the effects that different architectures have on the string matching problem when applied to duplicate detection. Finally, we test the models on numerous datasets of varying size, with some more focused on one of the datatype issues than others. We show that deep hierarchical networks perform best in tasks where temporal order matters.
Keywords: Duplicate Detection, Natural Language Inference, String Matching, LSTM, GRU, Deep Networks
Suggested Citation: Suggested Citation