Kottur_cmu_0041E_10399.pdf (20.78 MB)
Visual Dialog: Towards Communicative Visual Agents
Recent years have seen significant advancements in artificial intelligence (AI). Still, we are far from intelligent agents that can visually perceive their surroundings, reason, and interact with humans in natural language, thereby
being an integral part of our lives. As a step towards such a grand goal, this thesis proposes Visual Dialog that requires an AI agent to hold a meaningful dialog with humans in natural, conversational language about visual content. Specifically, given an image, a dialog history, and a question about the image, the agent has to ground the question in the image, infer context from history, and answer the question
accurately. Visual Dialog is disentangled enough from a specific downstream task so as to serve as a general test of machine intelligence while being grounded in vision enough to allow objective evaluation of responses. We collect VisDial, a large dataset of human dialogs on real images to benchmark progress. In order to tackle several challenges in Visual Dialog, we build neural network-based models that reason about the inputs explicitly by performing visual coreference resolution called CorefNMN. To demonstrate the effectiveness of such models, we test them on the VisDial dataset and a large diagnostic dataset, CLEVR-Dialog, which we synthetically generate with fully annotated
dialog states. By breaking down the performance of these models according to history dependency, coreference distance, etc., we show that our models quantitatively
outperform other approaches and are qualitatively more interpretable, grounded, and consistent—all of which are desirable for an AI system. We then apply Visual Dialog to visually-grounded, goal-driven dialog without the need for additional supervision. Goal-driven dialog agents specialize in a downstream task (or goal) making them deployable and interesting to study. We also train these agents from scratch to solely maximize their performance (goal rewards) and study the emergent language. Our findings show that while most agent-invented language is effective (i.e., achieve near-perfect task rewards), they are decidedly not interpretable or compositional. All our datasets and code are publicly available to encourage future research.
being an integral part of our lives. As a step towards such a grand goal, this thesis proposes Visual Dialog that requires an AI agent to hold a meaningful dialog with humans in natural, conversational language about visual content. Specifically, given an image, a dialog history, and a question about the image, the agent has to ground the question in the image, infer context from history, and answer the question
accurately. Visual Dialog is disentangled enough from a specific downstream task so as to serve as a general test of machine intelligence while being grounded in vision enough to allow objective evaluation of responses. We collect VisDial, a large dataset of human dialogs on real images to benchmark progress. In order to tackle several challenges in Visual Dialog, we build neural network-based models that reason about the inputs explicitly by performing visual coreference resolution called CorefNMN. To demonstrate the effectiveness of such models, we test them on the VisDial dataset and a large diagnostic dataset, CLEVR-Dialog, which we synthetically generate with fully annotated
dialog states. By breaking down the performance of these models according to history dependency, coreference distance, etc., we show that our models quantitatively
outperform other approaches and are qualitatively more interpretable, grounded, and consistent—all of which are desirable for an AI system. We then apply Visual Dialog to visually-grounded, goal-driven dialog without the need for additional supervision. Goal-driven dialog agents specialize in a downstream task (or goal) making them deployable and interesting to study. We also train these agents from scratch to solely maximize their performance (goal rewards) and study the emergent language. Our findings show that while most agent-invented language is effective (i.e., achieve near-perfect task rewards), they are decidedly not interpretable or compositional. All our datasets and code are publicly available to encourage future research.
History
Date
2019-05-08Degree Type
- Dissertation
Department
- Electrical and Computer Engineering
Degree Name
- Doctor of Philosophy (PhD)