Linking Individuals Across Historical Sources: A Fully Automated Approach

40 Pages Posted: 21 Feb 2018

See all articles by Ran Abramitzky

Ran Abramitzky

Stanford University - Department of Economics

Roy Mill

Stanford University

Santiago Pérez

University of California, Davis

Date Written: February 2018

Abstract

Linking individuals across historical datasets relies on information such as name and age that is both non-unique and prone to enumeration and transcription errors. These errors make it impossible to find the correct match with certainty. In the first part of the paper, we suggest a fully automated probabilistic method for linking historical datasets that enables researchers to create samples at the frontier of minimizing type I (false positives) and type II (false negatives) errors. The first step guides researchers in the choice of which variables to use for linking. The second step uses the Expectation-Maximization (EM) algorithm, a standard tool in statistics, to compute the probability that each two records correspond to the same individual. The third step suggests how to use these estimated probabilities to choose which records to use in the analysis. In the second part of the paper, we apply the method to link historical population censuses in the US and Norway, and use these samples to estimate measures of intergenerational occupational mobility. The estimates using our method are remarkably similar to the ones using IPUMS’, which relies on hand linking to create a training sample. We created an R code and a Stata command that implement this method.

Suggested Citation

Abramitzky, Ran and Mill, Roy and Pérez, Santiago, Linking Individuals Across Historical Sources: A Fully Automated Approach (February 2018). NBER Working Paper No. w24324. Available at SSRN: https://ssrn.com/abstract=3127065

Ran Abramitzky (Contact Author)

Stanford University - Department of Economics ( email )

Stanford, CA 94305
United States

Roy Mill

Stanford University ( email )

Stanford, CA 94305
United States

Santiago Pérez

University of California, Davis ( email )

One Shields Avenue
Apt 153
Davis, CA 95616
United States

Here is the Coronavirus
related research on SSRN

Paper statistics

Downloads
9
Abstract Views
152
PlumX Metrics