Similarity Filtering with Multibit Trees for Record Linkage
German Record Linkage Center, Working Paper Series, No. WP-GRLC-2013-01
20 Pages Posted: 30 Mar 2020
Date Written: March 15, 2013
Abstract
Record linkage is the process of identifying pairs of records that refer to the same real-world object within or across data files. Basically, each record pair is compared with a similarity function and then classified in supposedly matching and non-matching pairs. However, if every possible record pair has to be compared, the resulting number of comparisons leads to infeasible running times for large data files. In such situations, blocking or indexing methods to reduce the comparison space are required. In this paper we propose a new blocking technique (Q-gram Fingerprinting) that efficiently filters record pairs according to an approximation of a q-gram similarity function. The new method first transforms data records into bit vectors, the fingerprints, and then filters pairs of fingerprints by use of a Multibit Tree according to a user-defined similarity threshold. We examined the effect of different parameter choices of Q-gram Fingerprinting, tested its scalability, and performed a comparison study including several alternative methods using simulated person data. The comparison study showed promising results for the proposed method.
Keywords: Blocking, Bloom-Filters, Indexing, Multibit Trees, Q-Grams, Record Linkage
Suggested Citation: Suggested Citation