Carnegie Mellon University
Browse

Towards Grounded Multimodal Enterprise Document Understanding

Download (17.4 MB)
thesis
posted on 2025-05-20, 21:05 authored by Armineh NourbakhshArmineh Nourbakhsh

Document-grounded workflows drive operational efficiency in many enterprise domains. In Finance, onboarding and offboarding of clients, monitoring of client activities, assignment of risk, assessment of credit, and other integral functions are dependent on processing documents across a wide variety of categories and formats, including business filings, financial reports, tax forms, legal contracts, invoices, payment records, and other disclosures. The document understanding tasks associated with these processes encompass several multimodal reasoning challenges, including spatial, visual, and quantitative reasoning.

Against this backdrop, investment in AI-augmented workflows has grown rapidly over the past decade [59, 60]. In highly regulated industries such as Finance, such workflows are expected to comply with requirements related to performance and robustness, including the maintenance of comprehensive data lineage. This means that an Information Extraction model is required to provide datapoints that are fully traceable back to the context from which they were extracted. Groundedness has major implications for downstream applications, as it can improve explainability, expose the provenance of the output, and enhance human-AI interaction.

In recent years, multimodal (large) language models have emerged as a promising approach to document understanding. While these models have demonstrated better overall performance across several tasks, their decoder-based, generative architecture leaves them open to poor groundedness (if not hallucinations), and makes it difficult to localize their outputs. This has led to challenges related to ground?edness (or lack thereof) in tasks such as Key Information Extraction and extractive Visual Question Answering, which has in turn complicated the adoption of such models in production pipelines, especially when the requirements for reliability and explainability outweigh performance.

This work addresses the challenge of groundedness in multimodal enterprise document understanding in the context of two prominent reasoning tasks, namely, quantitative reasoning and spatio-visual reasoning. We demonstrate how we can enhance the performance, robustness, and generalizability of models by improving their grounding within the input. In quantitative reasoning, we show how grounding the model in numerical language can enhance compositional generalization, a key challenge in robustness and OOD performance. We further demonstrate how spatio-visual reasoning can be grounded in the layout and structure of a document, leading to more efficient and robust multimodal models.

Concretely, we introduce three new methods to the field of grounded multimodal enterprise document understanding: 1) A new mechanism to attend to fine-grained components of the input that express arithmetic operations, hence improving compositional generalization in quantitative reasoning tasks. 2) A metric-learning strategy that is grounded in counterfactually-associated samples, and leads to more robust and generalizable quantitative reasoning models. 3) A topological representation of documents that enhances performance on several multimodal tasks by grounding textual reasoning within the spatial layout of each page. We tie these methods together by proposing an evaluation strategy that accounts for fine-grained spatial and contextual grounding in a visual question-answering task. Using Visual Question Answering as an umbrella task, we demonstrate how our evaluation framework can expose shortcomings in spatio-visual and quantitative reasoning, especially when compared against human performance.

History

Date

2025-04-01

Degree Type

  • Dissertation

Thesis Department

  • Language Technologies Institute

Degree Name

  • Doctor of Philosophy (PhD)

Advisor(s)

Carolyn Rosé Sameena Shah