Computer Vision News - August 2019

Learning the Depths The paper suggests a network that performs regression to predict a depth map, given an image of interest. The input to the network comprises of 4 components: input image, input depth map, input confidence and human binary mask. The training phase is done in a supervised manner using an estimated depth map as ground truth. This depth map is computed by first extraction of camera poses using existing SLAM method. Then, using the camera poses, raw depth maps are estimated using COLMAP software (considered as state of the art MVS pipeline). A heuristic technique is also added to filter erroneous depths and unsuitable clips. The output of this process is used as the ground truth depth map. The first input component is an initial depth map computed from motion parallax . In order to add geometric information to the network, the authors used motion parallax between two frames. This, in turn, provides an initial depth estimation for the static regions of the scene (assuming humans are dynamic). Given two frames, they estimated an optical flow using FlowNet2.0 between these frames. Then, using the relative camera poses, they computed the initial depth map from the estimated flow field, by using a representation called Plane-Plus-Parallax (P+P). The next input to the network is the confidence map. In video clips with motion blur, shadows, and low lighting, optical flow might be noisy. The confidence map gives for each pixel in the non-human region a score between 0-1 that measures how well the flow field complies with the epipolar constraints between the views. This allows the network to rely more on input depth with a high confidence value, and hence to improve the network performance. The figure below visualizes the different components of the input: 5