Computer Vision News

This RAU unit is embedded in the network, to be described next. The model consists of five main parts that we’ll look at in more detail now (summarized in the figure at the beginning of the next page): 1. Image representation • Resizes images to 448 × 448 and extracts the last of the pre-trained 152-layer “ResNet” -- before the final pooling layer (res5c) with size 2048 × 14 × 14. • The network was evaluated with 2 different image feature representations: 1) ResNet features and 2) FRCNN with 36 proposals per image, as suggested by Anderson et al. 2. Question representation • The question is tokenized and encoded using an embedding layer followed by a tanh activation. • GloVe vectors were also extracted and concatenated with the embedding layer. • The concatenated vector is fed to a two-layer unidirectional LSTM (the model uses all the hidden states). 3. 1x1 convolution + PReLU • This layer is in fact conducting kind of a transfer learning, as the CNN network was pre-trained on a different image set than either VQA-1.0 or VQA-2.0. • The image and question representation sizes are different; this layer produces a common representation size, so the results can be concatenated. 4. RAU • The central innovation of the authors’ model, detailed above. 5. Fusion • The image (visual) and question (textual) branches are merged using a fusion operation. • A multi-class classifier processes the result of the Fusion Op using the top 3000 most frequent answers; it is followed by a Softmax single-layer with cross-entropy. 6 Research Research Computer Vision News

Computer Vision News - April 2018