Using a Probabilistic Model to Assist Merging of Large-Scale Administrative Records

40 Pages Posted: 8 Aug 2018

See all articles by Ted Enamorado

Ted Enamorado

Princeton University, Department of Politics

Benjamin Fifield

Princeton University, Department of Politics

Kosuke Imai

Princeton University - Center for Statistics and Machine Learning

Date Written: May 13, 2018

Abstract

Since most social science research relies upon multiple data sources, merging data sets is an essential part of researchers' workflow. Unfortunately, a unique identifier that unambiguously links records is often unavailable and data sets may contain missing and inaccurate information. These problems are severe especially when merging large-scale administrative records. We develop a faster and more scalable algorithm to implement a canonical probabilistic model of record linkage that has many advantages over deterministic methods frequently used by social scientists. The proposed methodology efficiently handles millions of observations while accounting for missing data and measurement error, incorporating auxiliary information, and adjusting for uncertainty about merging in post-merge analyses. We conduct comprehensive simulation studies to evaluate the performance of our algorithm in realistic scenarios. We also apply our methodology to merging campaign contribution records, survey data, and nationwide voter files. We provide open-source software for implementing the proposed methodology.

Suggested Citation

Enamorado, Ted and Fifield, Benjamin and Imai, Kosuke, Using a Probabilistic Model to Assist Merging of Large-Scale Administrative Records (May 13, 2018). Available at SSRN: https://ssrn.com/abstract=3214172 or http://dx.doi.org/10.2139/ssrn.3214172

Ted Enamorado (Contact Author)

Princeton University, Department of Politics ( email )

Princeton, NJ
United States

Benjamin Fifield

Princeton University, Department of Politics ( email )

Princeton, NJ
United States

Kosuke Imai

Princeton University - Center for Statistics and Machine Learning ( email )

Princeton, NJ
United States

Register to save articles to
your library

Register

Paper statistics

Downloads
46
Abstract Views
294
PlumX