Carnegie Mellon University
Browse

Understanding Language Referring To The Visual World

Download (19.61 MB)
thesis
posted on 2023-01-06, 20:23 authored by Volkan CirikVolkan Cirik

Artificial Intelligence (AI) technologies affect many facets of our daily lives. AI systems help us manage our shopping lists, type emails faster, and answer our curious search queries. However, current AI systems have limited agency in the world – i.e., they lack the embodied sensory experience of the world that is often referred as embodied AI. Hopefully, in the near future, embodied AI systems will allow autonomous vehicles to mobilize the visually impaired in our communities, enable robots in providing company for our elderly, and facilitate virtual agents in cooperatively teaching our children complex concepts in mixed reality settings. The imminent manifestation of embodied AI in the physical world necessitates research that models the interaction between natural language and physical referents. This technical challenge has been dubbed ‘language grounding’ and is the central focus of this thesis. 

In studying language grounding, we identify and define three core challenges, and then describe novel methods, analyses, and experiments that to attempt to address each in turn. First, we address spatial grounding with the goal of linking language mentions of objects with their spatial locations in the world. We study this problem in the context of a fully-observable representation of the world. Second, we study the problem of sequential grounding, where observations of the world are partial (e.g., restricted to a limited field-of-view) and unfold over time as a result of the actions of the system. Partial observation makes language grounding more challenging, increasing the difficulty of accurate interpretation – e.g., an utterance may refer to something not currently in the view of the system. Third, we tackle the problem of imbuing agents with the same types of prior knowledge that humans assume and rely upon to disambiguate linguistic utterances when communicating. Human speakers tend to vastly underspecify spatial information when communicating with others as they omit many details that they expect the listener to know already. This poses a technical challenge for language grounding: how do our agents leverage general and situated prior knowledge of the world? In this thesis, we present contributions in the form of methods, resources, and tools to strengthen our understanding of these technical challenges and to make progress towards their eventual solutions. 

History

Date

2022-03-03

Degree Type

  • Dissertation

Department

  • Language Technologies Institute

Degree Name

  • Doctor of Philosophy (PhD)

Advisor(s)

Louis-Philippe Morency, Taylor Berg-Kirkpatrick

Usage metrics

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC