Carnegie Mellon University
Browse

Multidocument Text Classification over Heterogeneous Data Sources

Download (2.35 MB)
thesis
posted on 2025-06-02, 19:45 authored by James RouteJames Route

This work introduces a class of decision problem modeled on real-world applications where a human expert selects a course of action while drawing on a set of disparate information sources. Our primary contributions target decision makers who are looking to refine existing decision tasks and enable automatic processing. We construct three datasets to reflect these types of decision problems: one dataset uses federal hiring records containing details on applicants to public trust positions, another documents trademark registration applicants and outcomes, and the third consists of resolutions introduced in the US House or Senate for deliberation. Each dataset comprises documents from multiple distinct sources that contain a mix of structured and unstructured contents, as well as time series data that evolve over the course of the decision process. The problems represented by these datasets are of great practical importance to government and industry, but there has not been a systematic study that examines how to approach such problems. We demonstrate that these tasks can be modeled as text classification. A typical text classification approach involves concatenating all data sources into a single document for model input, but we show that this approach has limited effectiveness on complex datasets and that state-of-the-art models may fail to learn the training objective. We explore an ensemble approach that leverages the unique properties of these datasets, demonstrating that multisource ensemble models outperform state-of-the-art single source baselines. Furthermore, the ensemble approach can be implemented under real-world resource constraints. Because we expect that decision makers who are considering automation will be concerned with the fairness of any solu tion, we outline a basic first analysis that comprises a series of tests, the results of which can feed into deeper fairness investigations informed by domain exper tise. We also consider the application of large language models (LLMs) within the practical resource constraints of our classification experiments, and find that these models contribute to explainability of the dataset but do not improve classification accuracy. We close with a discussion of how to adapt our findings into a repeatable framework that can be applied to other practical decision scenarios. We also explore areas of future experimentation, with a focus on solutions that may arise from relaxing resource limitations and improving methodologies from greater access to domain expertise.

History

Date

2025-05-01

Degree Type

  • Dissertation

Department

  • Language Technologies Institute

Degree Name

  • Doctor of Philosophy (PhD)

Advisor(s)

Eric Nyberg

Usage metrics

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC