Methods for the Estimation of Large Scale Bayesian Models for Record Linkage Under One-to-One Matching

Mcveigh, Brendan

doi:10.1184/R1/12370976.v1

bmcveigh_phd_stat_2020.pdf (2.32 MB)

Methods for the Estimation of Large Scale Bayesian Models for Record Linkage Under One-to-One Matching

thesis

posted on 2020-05-29, 21:34 authored by Brendan McveighBrendan Mcveigh

Probabilistic record linkage (PRL) is the process of identifying pairs of records from two ?files or datasets

that correspond to the same underlying entity. In the absence of an error-free unique identi?fier, links between records must be estimated and are inherently uncertain. Properly characterizing this uncertainty is a prerequisite for correctly performing any statistical estimation or inferential task with the resulting linked data. In nearly all real-world record linkage problems a lack of training data, which must be speci?fic to the records contained in both datasets, presents an additional challenge and necessitates the use of unsupervised methods. Bayesian methods provide a powerful mechanism for quantifying uncertainty in the estimated link structure via the posterior distribution. Furthermore, they allow for the inclusion of additional information, such as structural constraints on the link structure, via an appropriate prior distribution. Incorporating

such information into the estimation can signi?ficantly improve performance, particularly for unsupervised

methods. The application of Bayesian methods to record linkage problems has thus far been limited by a lack of methods that can practically be applied to sets of records containing more than a few thousand entries. In this dissertation we make several methodological advances that allow Bayesian methods to be applied successfully to sets of records that are orders of magnitude larger than is possible with existing methods. We ?first reexamine the standard two-step method for resolving a set of similarity weights between records into a link structure consistent with one-to-one matching. We develop a joint procedure that uses a modi?fied

optimization problem to induce sparsity, signi?ficantly reducing the computational complexity. This allows for

the efficient calculation of a maximum a posteriori (MAP) estimate of the link structure, under an appropriate prior distribution over the link structure. We next address the more general question of estimating Bayesian models for record linkage via a MCMC sampler. Due to the discrete nature of the link structure, standard MCMC samplers converge extremely slowly and are impractical for almost all real problems. We develop a data-driven blocking scheme that separates the link structure in a large number of small disjoint regions within which a MCMC sampler can mix efficiently. Finally, we provide a method for transforming existing prior distributions over the link structure that are informative only on the number of matches, to distributions

that are also informative over the expected types of matches. Importantly the transformation maintains both

one-to-one matching constraints and invariance to the ordering of the records. We demonstrate the scalability of our contributions by linking two historical voter registrations that contain hundreds of thousands of records. Both fi?les where generated by applying OCR to the original paper

voter registries and, therefore, contain numerous missing or incorrect entries. Despite lacking a high quality blocking key, our approach allows a posterior distribution to be estimated on a single machine in a matter of hours. We further demonstrate, using a set of hand-labeled record pairs, that a fully Bayesian estimate signi?ficantly outperforms other methods, both linking more record pairs and achieving a lower false match rate.

History

Date

2020-05-17

Degree Type

Dissertation

Department

Statistics

Degree Name

Doctor of Philosophy (PhD)

Advisor(s)

Jared Murray Rebecca Nugent

Usage metrics

Keywords

record linkage Bayesian mcmc assignment problem auction algorithm

Licence

CC BY 4.0

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

Methods for the Estimation of Large Scale Bayesian Models for Record Linkage Under One-to-One Matching

History

Date

Degree Type

Department

Degree Name

Advisor(s)

Usage metrics

Categories

Keywords

Licence

Exports