Similarity Filtering with Multibit Trees for Record Linkage

German Record Linkage Center, Working Paper Series, No. WP-GRLC-2013-01

20 Pages Posted: 30 Mar 2020

See all articles by Tobias Bachteler

Tobias Bachteler

University of Duisburg-Essen

Jörg Reiher

University of Duisburg-Essen

Rainer Schnell

University of Duisburg-Essen

Date Written: March 15, 2013

Abstract

Record linkage is the process of identifying pairs of records that refer to the same real-world object within or across data files. Basically, each record pair is compared with a similarity function and then classified in supposedly matching and non-matching pairs. However, if every possible record pair has to be compared, the resulting number of comparisons leads to infeasible running times for large data files. In such situations, blocking or indexing methods to reduce the comparison space are required. In this paper we propose a new blocking technique (Q-gram Fingerprinting) that efficiently filters record pairs according to an approximation of a q-gram similarity function. The new method first transforms data records into bit vectors, the fingerprints, and then filters pairs of fingerprints by use of a Multibit Tree according to a user-defined similarity threshold. We examined the effect of different parameter choices of Q-gram Fingerprinting, tested its scalability, and performed a comparison study including several alternative methods using simulated person data. The comparison study showed promising results for the proposed method.

Keywords: Blocking, Bloom-Filters, Indexing, Multibit Trees, Q-Grams, Record Linkage

Suggested Citation

Bachteler, Tobias and Reiher, Jörg and Schnell, Rainer, Similarity Filtering with Multibit Trees for Record Linkage (March 15, 2013). German Record Linkage Center, Working Paper Series, No. WP-GRLC-2013-01, Available at SSRN: https://ssrn.com/abstract=3530899 or http://dx.doi.org/10.2139/ssrn.3530899

Tobias Bachteler

University of Duisburg-Essen

Lotharstrasse 1
Duisburg, 47048
Germany

Jörg Reiher

University of Duisburg-Essen ( email )

Lotharstrasse 1
Duisburg, 47048
Germany

Rainer Schnell (Contact Author)

University of Duisburg-Essen ( email )

Lotharstrasse 1
Duisburg, 47048
Germany

Do you have a job opening that you would like to promote on SSRN?

Paper statistics

Downloads
73
Abstract Views
419
Rank
643,219
PlumX Metrics