Carnegie Mellon University
Browse
bmcveigh_phd_stat_2020.pdf (2.32 MB)

Methods for the Estimation of Large Scale Bayesian Models for Record Linkage Under One-to-One Matching

Download (2.32 MB)
thesis
posted on 2020-05-29, 21:34 authored by Brendan McveighBrendan Mcveigh
Probabilistic record linkage (PRL) is the process of identifying pairs of records from two ?files or datasets
that correspond to the same underlying entity. In the absence of an error-free unique identi?fier, links between records must be estimated and are inherently uncertain. Properly characterizing this uncertainty is a prerequisite for correctly performing any statistical estimation or inferential task with the resulting linked data. In nearly all real-world record linkage problems a lack of training data, which must be speci?fic to the records contained in both datasets, presents an additional challenge and necessitates the use of unsupervised methods. Bayesian methods provide a powerful mechanism for quantifying uncertainty in the estimated link structure via the posterior distribution. Furthermore, they allow for the inclusion of additional information, such as structural constraints on the link structure, via an appropriate prior distribution. Incorporating
such information into the estimation can signi?ficantly improve performance, particularly for unsupervised
methods. The application of Bayesian methods to record linkage problems has thus far been limited by a lack of methods that can practically be applied to sets of records containing more than a few thousand entries. In this dissertation we make several methodological advances that allow Bayesian methods to be applied successfully to sets of records that are orders of magnitude larger than is possible with existing methods. We ?first reexamine the standard two-step method for resolving a set of similarity weights between records into a link structure consistent with one-to-one matching. We develop a joint procedure that uses a modi?fied
optimization problem to induce sparsity, signi?ficantly reducing the computational complexity. This allows for
the efficient calculation of a maximum a posteriori (MAP) estimate of the link structure, under an appropriate prior distribution over the link structure. We next address the more general question of estimating Bayesian models for record linkage via a MCMC sampler. Due to the discrete nature of the link structure, standard MCMC samplers converge extremely slowly and are impractical for almost all real problems. We develop a data-driven blocking scheme that separates the link structure in a large number of small disjoint regions within which a MCMC sampler can mix efficiently. Finally, we provide a method for transforming existing prior distributions over the link structure that are informative only on the number of matches, to distributions
that are also informative over the expected types of matches. Importantly the transformation maintains both
one-to-one matching constraints and invariance to the ordering of the records. We demonstrate the scalability of our contributions by linking two historical voter registrations that contain hundreds of thousands of records. Both fi?les where generated by applying OCR to the original paper
voter registries and, therefore, contain numerous missing or incorrect entries. Despite lacking a high quality blocking key, our approach allows a posterior distribution to be estimated on a single machine in a matter of hours. We further demonstrate, using a set of hand-labeled record pairs, that a fully Bayesian estimate signi?ficantly outperforms other methods, both linking more record pairs and achieving a lower false match rate.

History

Date

2020-05-17

Degree Type

  • Dissertation

Department

  • Statistics

Degree Name

  • Doctor of Philosophy (PhD)

Advisor(s)

Jared Murray Rebecca Nugent

Usage metrics

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC