Visual Dialog: Towards Communicative Visual Agents

Kottur, Satwik

doi:10.1184/R1/8204363.v1

Kottur_cmu_0041E_10399.pdf (20.78 MB)

Visual Dialog: Towards Communicative Visual Agents

thesis

posted on 2019-05-30, 21:06 authored by Satwik KotturSatwik Kottur

Recent years have seen significant advancements in artificial intelligence (AI). Still, we are far from intelligent agents that can visually perceive their surroundings, reason, and interact with humans in natural language, thereby
being an integral part of our lives. As a step towards such a grand goal, this thesis proposes Visual Dialog that requires an AI agent to hold a meaningful dialog with humans in natural, conversational language about visual content. Specifically, given an image, a dialog history, and a question about the image, the agent has to ground the question in the image, infer context from history, and answer the question
accurately. Visual Dialog is disentangled enough from a specific downstream task so as to serve as a general test of machine intelligence while being grounded in vision enough to allow objective evaluation of responses. We collect VisDial, a large dataset of human dialogs on real images to benchmark progress. In order to tackle several challenges in Visual Dialog, we build neural network-based models that reason about the inputs explicitly by performing visual coreference resolution called CorefNMN. To demonstrate the effectiveness of such models, we test them on the VisDial dataset and a large diagnostic dataset, CLEVR-Dialog, which we synthetically generate with fully annotated
dialog states. By breaking down the performance of these models according to history dependency, coreference distance, etc., we show that our models quantitatively
outperform other approaches and are qualitatively more interpretable, grounded, and consistent—all of which are desirable for an AI system. We then apply Visual Dialog to visually-grounded, goal-driven dialog without the need for additional supervision. Goal-driven dialog agents specialize in a downstream task (or goal) making them deployable and interesting to study. We also train these agents from scratch to solely maximize their performance (goal rewards) and study the emergent language. Our findings show that while most agent-invented language is effective (i.e., achieve near-perfect task rewards), they are decidedly not interpretable or compositional. All our datasets and code are publicly available to encourage future research.

History

Date

2019-05-08

Degree Type

Dissertation

Department

Electrical and Computer Engineering

Degree Name

Doctor of Philosophy (PhD)

Advisor(s)

José Moura

Usage metrics

Keywords

artificial intelligence computer vision image understanding natural language dialog visual dialog

Licence

In Copyright

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

Visual Dialog: Towards Communicative Visual Agents

History

Date

Degree Type

Department

Degree Name

Advisor(s)

Usage metrics

Categories

Keywords

Licence

Exports