Computer Vision News - January 2018
2. The text question features: the given question is split into words using spaces and punctuation. Only the first 14 words of each question are used (almost no loss is involved, since only 0.25% of questions in the dataset are longer). Each word is turned into a 300-dimensional vectors learned along with other parameters during training. The resulting sequence of word embeddings of size 14 × 300 is passed through a Recurrent Gated Unit. The question embedding is the 512 dimension recurrent unit internal state. Computer Vision News Tool 21 Tool 3. Here too, as in the first model, the final set fuses the features extracted from the text and image. # question encoding emb = self.wembed(question) enc, hid = self.gru(emb.permute(1, 0, 2)) qenc = enc[-1] # element-wise (question + image) multiplication q = self._gated_tanh(qenc, self.gt_W_question, self.gt_W_prime_question) v = self._gated_tanh(v_head, self.gt_W_img, self.gt_W_prime_img) h = torch.mul(q, v) # output classifier s_head = self.clf_w(self._gated_tanh(h, self.gt_W_clf, self.gt_W_prime_clf)) In all cases, the CNN is pre-trained and held fixed during the training of the VQA model. The features can therefore be extracted from the input images as a preprocessing step for efficiency. Results: For the purpose of comparison, to have a taste of the results, let’s look at the comparison between those models’ performance on the VQA v2 (you can find more results and analysis in the links below). Code and installation instruction: 1. The first model: https://github.com/anantzoid/VQA-Keras-Visual-Question-Answering 2. The second model: https://github.com/markdtw/vqa-winner-cvprw-2017 VQA v2 test-std All Yes/No Numb. Others Antol et al. 54.22 73.46 35.18 41.83 Tendy et al. 70.34 86.60 48.64 61.15
Made with FlippingBook
RkJQdWJsaXNoZXIy NTc3NzU=