End-to-End Multimodal Learning for Situated Dialogue Systems
Virtual assistants have become an essential part in many people’s lives today. These dialogue systems perform services given users’ voice commands, such as controlling devices, searching for information, or performing conversational tasks such as booking of events or navigation instruction. However, today’s dialogue systems face challenges, because (1) they are implemented using a pipeline of multiple independently optimized modules which do not necessarily provide the best performance when integrated together and (2) they are limited to utilizing only unimodal input, i.e. speech input from the user. The modularized system design induces a disconnect between each module’s and the quality of the overall dialogue system, and it also makes it difficult to update the entire system for a new task as every module will need to be changed. While the multimodal context contains rich information of the users and their surrounding environments, many dialogue systems in today’s virtual assistants interact with the users utilizing only their language input via a speech interface. As dialogue systems only utilize speech input, they are unable to provide services which require understanding the user or environmental context, for example conversing with a user regarding their physical surroundings.
In this thesis, we mitigate the limitations of prior dialogue systems in two ways: (1) we propose an end-to-end model which fuses the separate components in a standard spoken dialogue system together and (2) we leverage multimoal contextual cues from the user andphysical surroundings. We introduce end-to-end learning for scalable dialogue state tracking, where the model directly predicts dialogue states from natural language input and can handle unseen slot values. We enhance our speech recognition system using multimodal input with the target speaker’s mouth movements and learned speaker embedding to improve robustness in noisy cocktail party environments. Finally, we apply end-to-end and multimodal learning on two situated dialogue tasks: vision-grounded instruction following and video question answering. The situated dialogue model directly takes as input the multimodal language and visual context from the user and the environment, and outputs system actions or natural language responses. Compared to prior methods, our proposed situated dialogue systems showed improved speech recognition accuracy, dialog state tracking accuracy, task success rate and response generation quality.
DepartmentElectrical and Computer Engineering
- Doctor of Philosophy (PhD)