Computer Vision News - August 2019
The network training is done using a linear combination of three different losses; (a) MSE loss - a log-space difference in depth between two pixels in the prediction and the same two pixels in the ground truth, averaged over all pairs of valid pixels (b) multiscale gradient loss - L1 difference between the predicted log depth derivatives (in the two directions) and the ground truth depth derivative (computed across various re-scales). (c) multi-scale edge aware smoothness loss - penalization of L1 norm of log depth derivatives based on the first and second order derivatives of images. Note that for all of these losses, a log is needed to handle the scale ambiguity in the depth computation. Research The authors evaluated their method on three different data sets and showed some depth-based visual effect applications. To give the best quantitative evaluation to their method, they evaluated scale-invariant RMSE (si-RMSE) which is the same as the square root of the first component of the loss (described in (a)). The evaluation has been done on five different areas of the image: si-full measures the error between all pairs of pixels, si-env measures pairs of pixels in the non-human regions, si-hum measures pairs which at least one pixel lies in the human region, si-intra measures accuracy within the human region, and si-inter measures accuracy where one pixel is in the human region and one pixel is in the non-human region. Below you can see the quantitative results presented in the paper. The upper part of the table describes the competing methods, while the lower part describes various variants of the paper's method (each used different set inputs such image, image + depth, image + depth + confidence etc.). It can be seen that even method I - the single image depth of the paper (without the additional inputs) - outperform all the other methods. 6
Made with FlippingBook
RkJQdWJsaXNoZXIy NTc3NzU=