Deep String Matching For Duplicate Detection

27 Pages Posted: 18 May 2021 Last revised: 7 Jun 2021

See all articles by Alexandre Bloch

Alexandre Bloch

University of Edinburgh - School of Mathematics

Daniel Alexandre Bloch

Université Paris VI Pierre et Marie Curie

Date Written: May 16, 2021

Abstract

We consider the problem of duplicate detection in the case where dealing with typographical errors, toponym matching, and datatype dependency are all combined into a single task. We express this task as a string matching problem and resolve it by estimating a conditional probability via an encoder-decoder model, whereby the strings are first encoded with a Deep Recurrent Network into context vectors which are then concatenated and used as inputs for a Deep Classifier Network.

We explore the effects that different architectures have on the string matching problem when applied to duplicate detection. Finally, we test the models on numerous datasets of varying size, with some more focused on one of the datatype issues than others. We show that deep hierarchical networks perform best in tasks where temporal order matters.

Keywords: Duplicate Detection, Natural Language Inference, String Matching, LSTM, GRU, Deep Networks

Suggested Citation

Bloch, Alexandre and Bloch, Daniel Alexandre, Deep String Matching For Duplicate Detection (May 16, 2021). Available at SSRN: https://ssrn.com/abstract=3847416 or http://dx.doi.org/10.2139/ssrn.3847416

Alexandre Bloch (Contact Author)

University of Edinburgh - School of Mathematics ( email )

James Clerk Maxwell Building
Peter Guthrie Tait Rd
Edinburgh, EH9 3FD
United Kingdom

Daniel Alexandre Bloch

Université Paris VI Pierre et Marie Curie ( email )

175 Rue du Chevaleret
Paris, 75013
France

Do you have negative results from your research you’d like to share?

Paper statistics

Downloads
190
Abstract Views
659
Rank
287,833
PlumX Metrics