Computer Vision News - April 2018

For further VQA-2.0 test-dev results for additional DRAU variations see original article . RVAU in other models: To further test the effectiveness of their innovative RVAU module, the authors ran two tests incorporating their RVAU in place of the attention units of other leading VQA models: MCB and MUTAN. We see that RVAU improved the accuracy of MCB above its original performance, with the most significant improvement achieved in the numerical question category, which is in line with the authors’ hypothesis that the RVAU module is best suited to sequential reasoning type tasks. A measure of improvement can be seen with the MUTAN model as well. DRAU versus the state-of-the-art: In the VQA-2.0 test-standard split, the authors’ DRAU model achieved the best accuracy of models not using ensemble. Although DRAU arrived in 8th place, all models outperforming it used ensembles. Performance was boosted to 66.85% (outperforming some of the first 7 ensemble models) by using FRCNN features. The top performer used an ensemble of 30 models. They reported that the best single model out of that ensemble that used FRCNN features achieves 65.67% on the test-standard split, which is outperformed by the authors’ best single model DRAU using FRCNN features. DRAU versus MCB: Let’s look more in depth at some qualitative results that highlight the effect of the recurrent layers of RVAU compared to the MCB model. The advantages of RAUs stand out especially in tasks requiring ongoing data-processing, such as for sequential or relational reasoning or any multi-step tasks, where the recurrent layers’ advantage at holding relevant data over an input sequence comes to the fore. These advantages can be seen in the results on a subcategory of the VQA-2.0 questions, as presented in the figure below comparing qualitatively some DRAU and MCB results. For each image: the original image, question and ground truth answer are on the left, next are shown RVAU’s second attention map -- RVAU’s first attention map is a preliminary one, separating background and target objects into separate attention maps (above) compared to MCB’s first attention map -- for MCB only the first attention map is available (below), and on the right RTAU’s 8 Research Research Computer Vision News

RkJQdWJsaXNoZXIy NTc3NzU=