Computer Vision News - December 2018

5. Method: Model Structures & Energies Let’s set out the paper’s model explicitly. The total energy in the model is defined as follows: ( ) = ෍ ∈ ( ) + ෍ ( , )∈ ( , ) +෍ ∈ ( ) ● is a discrete random variable over all pixels in the video sequence ● ( ) is the unary energy , equal to the negative log likelihood of the labels for each random variable ○ is the set of pixels in the video ● ( , ) = ( − ) 2 is the temporal energy , ○ - is the weights of the temporal connections ○ - is the set of temporal connections pixels, defines a semi-dense optical flow by specifying a set of neighboring pixels which are connected to each other. ● ( ) = − ( ) is the spatial energy ○ - is the set of variables in the c clique. ○ - is the set of spatial cliques, here all pixels in the frame are defined in the spatial clique. ● , and are the balance energy terms. The exact MAP inference is NP-hard in general. The higher-order energy in this model makes the inference problem even harder and intractable even with efficient approximate algorithms like mean-field. Intuitively, this is because the algorithm needs to evaluate the total energy in the MRF for every frame in the video, requiring a CNN pass for each. Inference In order to make the problem tractable, the authors decoupled the temporal energy ( , ) and spatial energy ( ) by introducing an auxiliary variable y, and minimize the following approximation of Eq. (3) instead. ෠ ( , ) = ෍ ∈ ( ) + ෍ ( , )∈ ( , ) + 2 − 2 2 +෍ ∈ ( ) Research 6 Research Computer Vision News