posted on 2024-01-19, 21:23authored byDheeraj PaiDheeraj Pai, Deigant Yadava, Vinay Nair, João Monteiro
Multihop and Multimodal Question Answering (MMQA) is the cognitive process of retrieving and combining relevant information from diverse knowledge sources (e.g., textual, visual, and audio) to answer a given question.
In our work we propose a model for MMQA for the webQA dataset. Our model is based on the transformer architecture where we jointly alight and learn both the questions and parts of images that are relevant in answering the question.