Carnegie Mellon University
Browse

Linkage of Early 1900s Irish Census Records: Exploring the Impact of Household Structure and Crowdsourced Labels

Download (31.16 MB)
thesis
posted on 2021-04-23, 14:56 authored by Kayla FrisoliKayla Frisoli
Record linkage is the process of identifying records corresponding to unique entities across data sets. Linking
individuals in historical data allows researchers to better characterize topics like population mobility, impact
of local/national events, and generational changes. Historians in Ireland are currently interested in linking
the recently released 1901 and 1911 census record databases. Like with many (historical) record linkage
applications, there are challenges arising from the digitization of hand-written records, high frequencies
of common names, and human mobility. Traditional methods struggle with these issues, and it is often acknowledged that specific sub-populations (e.g., women who change their names, individuals who move between census dates) are linked with lower accuracy. Additionally, these methods often consider only pairwise record comparisons without incorporating household or relationship information across records. Furthermore, development and assessment of supervised record linkage methodology often relies on labeled data sets with unknown label quality. To help address these challenges, we designed a record linkage interface to study the impact of the human labeling process on the full record linkage pipeline. Via this interface, workers not only link records at the individual level but also at the household and within-household level, matching 1901 Ireland census records to their (potential) 1911 counterparts. In addition, we collect multiple instances for each label to assess label uncertainty. Our work capitalizes on this label collection process as well as known historical changes
and the data's household structure. We find evidence that models incorporating this information better link
hard-to-match populations. Beyond linking the actual records and households, we collect information about how the labeler interacts with the interface (e.g., time spent, click patterns), providing rich information across labeler populations. Our approach was iteratively adapted to balance worker engagement, label quality, and monetary expenses. We find differences in downstream record linkage model performance based on changes in label generation
and argue that it is critical to pay attention to these changes when labeling records or building models with pre-existing data. Data about the crowdsourced individual and household matches, the human labelers (from both CMU and Amazon MTurk), and the overall labeling process will be made publicly available. We hope this data and our resulting insights prompt new areas of research within and beyond the record linkage community.

History

Date

2020-10-23

Degree Type

  • Dissertation

Department

  • Statistics

Degree Name

  • Doctor of Philosophy (PhD)

Advisor(s)

Rebecca Nugent

Usage metrics

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC