Computer Vision News - April 2018
Given an image and question, we create the input representations v and q. Generally speaking, these features are combined by 1×1 convolutions into two separate branches. The innovation proposed by the authors is the inclusion of Recurrent Attention Units in both branches, a recurrent visual attention unit (RVAU) on the image branch and a recurrent textual attention unit (RTAU) on the question branch. The branches are integrated using a fusion operation and fed to the final classifier. This network structure enriches both the visual and textual VQA features, improves attention, and enables the model to learn local relations between the local visual and local textual features. The proposed model achieved the best results for the VQA-1.0 Benchmark dataset and achieved results comparable to those of participating models for the VQA-2.0 Benchmark. The proposed model outperforms the top performer of the VQA-1.0 Benchmark and demonstrated results comparable to those of the VQA- 2.0 state-of-the-art contenders. Moreover, the proposed model uses a single model without ensembles -- while the top performers of VQA-2.0 were all ensemble of 20 models and more. Furthermore, the authors’ Recurrent Attention Unit is modular and can easily be substituted in existing attention units of other networks: the authors ran several tests of other leading VQA models incorporating their Recurrent Attention Unit, which showed improved accuracy. Method: Let’s start with a close look at the Recurrent Attention Unit -- RAU: The RAU has two branches: the Attention branch, and the Input branch. As can be seen in the figure below: The Attention branch gets the input, passes it through 1x1 convolution and PReLU layers, then an LSTM and another 1x1 convolution and PReLU, and finally a Softmax layer. This branch produces an Attention map -- a map of weights telling the model what parts of the input to give more weight to, and which parts less. The weights’ map is multiplied by the Input features (the second branch). Computer Vision News Research 5 Research Computer Vision News
Made with FlippingBook
RkJQdWJsaXNoZXIy NTc3NzU=