Computer Vision News

12 CVPR 2022 Poster considered yet an unsolved challenge. In all experiments, the agent is evaluated via three primary metrics. 1) Success , where an episode is considered successful if the agent issues the stop command within 0.36m (2×agent radius) of the goal. 2) Success weight by (inverse normalized) Path Length (SPL), where success is weighted by the efficiency of the agent’s path, calculated considering the geodesic distance (shortest path). 3) SoftSPL , where the binary success is replaced by progress towards the goal. The authors’ approach is built on the following components: 1) a CNN-based visual odometry module - referred to as VO- that, given two consecutive observations ( O t , O t-1 ), predicts the change between t − 1 and t and then updates the goal wrt the current pose ( g t-1 and 2) an RNN-based RL navigation policy module , which is given the estimated pose g t-1 and the current observation O t , and predicts the next action a t . Below you can observe an example of the combined VO+Navigation approach on the validation dataset with performance: SPL = 0.63, Success = 1, SoftSPL = 0.62. The Navigation policy consists of a two-layer Long Short-Term Memory (LSTM) and a half- width ResNet50 encoder. To evaluate the two components separately and understand the impact of localization on navigation, it was trained assuming perfect odometry (hence given ground-truth location) and, only later, the VO module was used to estimate the localization as a drop-in replacement without fine-tuning. With ground-truth localization, the agent achieves 99.8% Success and 80% SPL on Gibson-val PointNav-v2 dataset, showing that visual odometry is a limiting factor to a map-less approach to realistic point goal navigation, while noisy observations and actuations can be overcome easily. The VO module is made of a ResNet encoder followed by a compression block and two fully connected (FC) layers, where BatchNorm is replaced with GroupNorm, and the compression block consists of 3×3 Conv2d+GroupNorm+ReLU. It is trained on a static dataset D = {( O t-1 , O t , a t-1 , Δpoase )} and decoupled from the navigation policy. Ablation experiments to this module included several additions to the basic network and analyzed:  The effect of action embedding, by incorporating knowledge of the action taken between two consecutive observations as an additional input. This is shown to improve performance, because the network received more context to learn more accurate egomotion for each action type.

Computer Vision News - June 2022