Computer Vision News

Evaluation and Results: The benchmark used in this paper has 460,121 objects in 24,024 different images and a total of 76,081 sequences (images can contain more than one sequence, as seen in the right-hand image below). Each labeled as one of 972 different very visually-similar classes (as seen in the left-hand image below). Sequence lengths can vary from 2 to 32, and are typically between 4-12. In the evaluation 80% of the data was used for training and 20% for testing. The authors compared their method to other methods using the same image features (extracted by AlexNet), the difference between the methods being the way they integrate the contextual model with the local CNN. The methods used as a basis of comparison were: ● Unary: This is the baseline model - the original CNN without any contextual information. ● Pairwise Statistics: This is a CRF model with the unary potentials and the pairwise potentials modeled using pairwise statistics. ● BiLSTM: BiLSTM is used to compute the posterior distribution of the current object label based on the AlexNet features. The BiLSTM models the bidirectional context in the sequence. ● Mixture of Statistics CRFs: This model cluster the input sequence images into a mixture of k Markov models, using the Expectation-Maximization algorithm (EM). Here, while training the input, sequences were split into k groups and the pairwise statistics are separately learned for each group. At test time, the most probable Markov model is selected for each sequence, and the corresponding pairwise statistics CRF model is used. ● Log-linear CRF: This method learns the log-linear parameters of the linear- chain CRF. ● Class-embedding CRF: The authors’ model. Research 7 Research Computer Vision News

Computer Vision News - March 2018