performance over the baseline. The task hinges on localizing objects in 3D space based on pictures illustrating the objects to find. “Based on the best method, our solution improves the camera pose estimation path, the performance of the VQ2D detection network, and the view backprojection and multiviewaggregation,” he tells us. “After these steps, we do depth estimation for the objects we’re looking for and then aggregate them together to get the final prediction.” Egocentric videos are inherently dynamic, with freely changing viewpoints and motion blur. Ego4D proposed performing camera pose estimation by relocalizing egocentric video frames to a Matterport scan. However, the noisy nature of Matterport scans leads to low accuracy and poor performance when matching the two. “We identified this problem and proposed to run structure from motion inside the egocentric videos to construct the correct correspondences between the frames for a complete 3D map,” Jinjie reveals. “This insight has improved the performance greatly. Since we choose to run the structure from motion just for the egocentric video, we can construct a 5 DAILY ICCV Wednesday JinjieMai
RkJQdWJsaXNoZXIy NTc3NzU=