Automated Linking of Historical Data

69 Pages Posted: 13 May 2019

See all articles by Ran Abramitzky

Ran Abramitzky

Stanford University - Department of Economics

Leah Platt Boustan

Princeton University

Katherine Eriksson

University of California, Davis

James Feigenbaum

Boston University - Department of Economics; National Bureau of Economic Research (NBER)

Santiago Pérez

University of California, Davis

Date Written: May 2019

Abstract

The recent digitization of complete count census data is an extraordinary opportunity for social scientists to create large longitudinal datasets by linking individuals from one census to another or from other sources to the census. We evaluate different automated methods for record linkage, performing a series of comparisons across methods and against hand linking. We have three main findings that lead us to conclude that automated methods perform well. First, a number of automated methods generate very low (less than 5%) false positive rates. The automated methods trace out a frontier illustrating the tradeoff between the false positive rate and the (true) match rate. Relative to more conservative automated algorithms, humans tend to link more observations but at a cost of higher rates of false positives. Second, when human linkers and algorithms have the same amount of information, there is relatively little disagreement between them. Third, across a number of plausible analyses, coefficient estimates and parameters of interest are very similar when using linked samples based on each of the different automated methods. We provide code and Stata commands to implement the various automated methods.

Institutional subscribers to the NBER working paper series, and residents of developing countries may download this paper without additional charge at www.nber.org.

Suggested Citation

Abramitzky, Ran and Boustan, Leah Platt and Eriksson, Katherine and Feigenbaum, James and Pérez, Santiago, Automated Linking of Historical Data (May 2019). NBER Working Paper No. w25825. Available at SSRN: https://ssrn.com/abstract=3387181

Ran Abramitzky (Contact Author)

Stanford University - Department of Economics ( email )

Stanford, CA 94305
United States

Leah Platt Boustan

Princeton University ( email )

22 Chambers Street
Princeton, NJ 08544-0708
United States

Katherine Eriksson

University of California, Davis ( email )

James Feigenbaum

Boston University - Department of Economics ( email )

270 Bay State Road
Boston, MA 02215
United States

National Bureau of Economic Research (NBER) ( email )

1050 Massachusetts Avenue
Cambridge, MA 02138
United States

Santiago Pérez

University of California, Davis ( email )

One Shields Avenue
Apt 153
Davis, CA 95616
United States

Here is the Coronavirus
related research on SSRN

Paper statistics

Downloads
10
Abstract Views
142
PlumX Metrics