kdmarino_phd_machinelearning_2021.pdf (25 MB)
Download file

Towards Knowledge-capable AI: Agents that See, Speak, Act and Know

Download (25 MB)
posted on 21.04.2022, 19:57 by Kenneth MarinoKenneth Marino

The field of artificial intelligence has been interested in knowledge since its early days, using carefully crafted rules and curated knowledge from humans to build effective expert systems. Since then, many fields, such as

computer vision and natural language processing, have been dominated by large-scale end-to-end learning using large datasets. This has often left knowledge as an afterthought for many important problems. However,

as our performance on marquee challenges and datasets such as the ImageNet Challenge [294] saturates and the field becomes more concerned with problems such as large-category recognition and problems of full embodied AI (agents that require understanding of multiple modalities), knowledge will become even more important. In this thesis, we argue that to achieve the goal of clever robots, or embodied AI, we need to deal with all three modalities of vision, language and action. We further argue that

knowledge is the critical piece to in connect these modalities. In our contributions, we show different slices of these cross-modalities to come to an understanding of how knowledge can be used to join these modalities. First, we look at how to incorporate knowledge into neural network architectures in vision problems. Next, we examine how we can combine the modalities of vision and language. We introduce a benchmark for vision and language that requires models with the capability to bring in and reason about knowledge about the world. Then we develop a method

on that dataset which combines two types of knowledge: knowledge graphs and implicit knowledge from large language models. We then examine the action modality by first showing that by using the knowledge inherent in

language models to solve a highly complex, semantic crafting task. Then, we apply knowledge in the robotics setting of task-oriented grasping and see how we can use knowledge to allow agents to perform tasks on never

before seen object categories and new tasks. Finally, we start to move in the opposite direction and look at how knowledge can be created. We show how an action policy can be used by agents to build up their own knowledge of the world.




Degree Type



Machine Learning

Degree Name

  • Doctor of Philosophy (PhD)


Abhinav Gupta