Carnegie Mellon University
Browse

BeamSeg: a Joint Model for Multi-Document Segmentation and Topic Identification

Download (1.77 MB)
thesis
posted on 2025-04-10, 20:03 authored by Pedro José dos Reis Mota

The work in this thesis is motivated by the problem of navigating the content of a collection of related documents, which is cumbersome if only a list of documents is given. Automatically structuring the content organization of a dataset by identifying topically cohesive segments and linking segments describing the same topic addresses this issue. Previous work deals with this problem by using a multi-document joint model for segmentation and topic identification at the dataset level, a perspective we also take. This multi document approach to segmentation contrasts with approaches that segment documents individually. The advantage of a multi-document model is that segmentation is leveraged by repeated descriptions of the same topic across different documents. We continue this line of work by hypothesizing that vocabulary relation ships between different segments can be used to obtain a more accurate segmentation and topic segment identification. We also hypothesize that documents that share the same modality (video transcripts, Power Point, etc.) have similar characteristics that could be modeled to obtain a better performance in these tasks. To study the previous hypothesis, we propose BeamSeg, a joint model for multi-document segmentation and topic identification where it is assumed that segments have vocabulary usage relationships. BeamSeg implements segmentation and topic identification in an unsupervised Bayesian setting by drawing from the samemultinomial language model segments with the same topic. Contrary to previous work, we assume that language models are not independent since the vocabulary changes in consecutive segments are expected to be smooth and not abrupt. We achieve this by putting a dynamic Dirichlet prior over the language mod els that takes into account data contributions from other topics. Additionally, we encode in BeamSeg that documents with different modalities have similar segment length characteristics, and, thus, each modality has its segment length prior. To better understand the performance advantages of the proposed joint model approach, we compare BeamSeg to a pipeline approach (performing segmentation and topic identification sequentially). In this context, we extend two single-document models to the multi-document case and pro pose a graph-community detection approach to topic identification. In order to test our hypothesis, we carry out a data collection task, as datasets from previous works have few documents with short segments, leav ing little room to observe vocabulary relationships. The evaluation using the collected dataset shows that BeamSeg obtains the best results affording this way practical improvements in both segmentation and topic identification and corroborating our hypothesis.

History

Date

2019-07-05

Degree Type

  • Dissertation

Department

  • Language Technologies Institute

Degree Name

  • Doctor of Philosophy (PhD)

Advisor(s)

Maria Luısa Torres Ribeiro Marques da Silva Coheur Maxine Eskenazi

Usage metrics

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC