Computer Vision News - February 2021

5 RAFT (best paper ECCV) recent deep learning methods aim at directly predicting optical flow between a pair of frames . This paper, together with a few others, works on an integration of these two aspects - a method defined as “ learning to optimize ” strategy - where the network uses a large number of update blocks to emulate the steps of a first- order optimization algorithm. A strong point of RAFT lies in the use of a single high- resolution flow field, updated through a large number of lightweight operators . The model architecture is divided into: - an encoder section that extracts 1) per-pixel features from the two paired frames, and 2) context information from just the first frame; - a correlation layer that computes visual similarity between pixels by constructing a 4D W × H × W × H correlation volume by taking the dot product of all pairs of feature vectors. The last 2-dimensions of the 4D volume are then pooled at multiple scales to construct a set of multi-scale volumes which retains high resolution information on large and small displacements; - a recurrent update operator which mimics the steps of the traditional optimization algorithms. It calculates updates of the flow field by estimating the descent direction based on features extracted by the correlation volumes. In this component, the GRU block is found, a gated activation unit with fully connected layers replaced with convolutions. The GRU block outputs a hidden state passed through 2 convolutional layers. These return the predictions of the flow update which need to be upsampled to full resolution. TRAINING RAFT is implemented using the PyTorch library. The full architecture, including feature and context encoder (with slightly different Figure 1: RAFT main components