Multihop Multimodal QA using Joint Attentive Training and Hierarchical Attentive Vision Language transformers
In this paper we address the challenges in Multihop and Multimodal Question Answering (MMQA). Through the analysis of existing MMQA approaches, we identify alignment between multimodal data and reasoning as the bottleneck in MMQA systems. We hypothesize that jointly learning to predict relevant patches from an image along with predicting an answer can prevent models from over-fitting by forcing it to learn relationships between image sections and the question, subsequently improving the reasoning process. In this paper we provide the details as well as analysis of three proposed approaches. We show that explicitly learning the alignment between images and text can allow multimodal models to focus on properties such as “color” and “shape” in images for VQA tasks. Our code can be found here: https://github.com/dheerajmpai/blockchainproject.