Multimodal Learning from Videos: Exploring Models and Task Complexities
Human learning is inherently multimodal. We watch, listen, read, and communicate to learn from and understand our surroundings. There have been several advancements in machine learning fields related to these human activities such as speech recognition or computer vision that make computationally modeling this human-like inherent multimodal learning a possibility. Multimodal video understanding as a machine learning task is close to this form of learning.
This thesis proposes to break down this complex task of video understanding into a series of relatively simpler tasks with increasing complexity. We start with the monotonic task of speech recognition and introduce an end-to-end audio-visual speech recognition model. A more complex task is speech translation that tackles re-ordered output sequences in addition to speech recognition, which is the second task in this thesis. For speech translation, we introduce a multimodal fusion model that learns to leverage the multiple views multimodal data provides in a semi-supervised way. Further, we progress to the tasks of multimodal video summarization and question answering that tackle abstract-level understanding tasks further involving information compression and restructuring. Finally, we extend this work to multimodal self-rationalization that not only performs abstract-level learning, but also provides an explanation of the achieved video understanding. For the four main tasks, we present a series of multimodal fusion models based on the nature and complexity of the task, the modalities involved in each and compare and contrast the models on commonly used video and language understanding datasets.
History
Date
2022-05-08Degree Type
- Dissertation
Department
- Language Technologies Institute
Degree Name
- Doctor of Philosophy (PhD)