Computer Vision News - April 2018
Dual Recurrent Attention Units for Visual Question Answering Every month, Computer Vision News reviews a research paper from our field. This month we have chosen to review Dual Recurrent Attention Units for Visual Question Answering . We are indebted to the authors Ahmed Osman and Wojciech Samek , for allowing us to use images from the paper to illustrate this review. Their work is here . Introduction: In recent years deep learning convolutional neural networks (CNNs) and recurrent neural networks (RNNs) have produced several breakthroughs and impressive results in numerous computer vision and natural language processing tasks (NLP). However, so far, Visual Question Answering (VQA) tasks -- which require a model to achieve a combined / comprehensive understanding of the image and the question posed in natural language -- remain a challenge. VQA tasks are in fact multimodal tasks, requiring the model to integrate visual and textual representations. Early VQA models used global features of images and had difficulties in giving accurate answers to questions that referenced local elements of the input (e.g., a specific area or object in an image). In an attempt to overcome this inherent drawback of global features, attention mechanisms were introduced into a variety of more recent VQA models, but these did not bring the anticipated performance improvement. In an attempt to improve the attention mechanism performance and overall VQA results, the authors propose a new VQA network architecture, which includes both image and textual Recurrent Attention Unit elements. Like other VQA networks, this model too has two branches -- one encoding the textual question and the other encoding the image. The following figure is a schematic overview of the network architecture: 4 Research Research by Assaf Spanier Computer Vision News
Made with FlippingBook
RkJQdWJsaXNoZXIy NTc3NzU=