6 DAILY ICCV Wednesday Oral Presentation big 3D map containing more frames, giving us more camera poses. With more camera poses, we can localize more objects and get a more accurate 3D object localization.” Another challenge Jinjie encountered involved integrating the VQ2D task, which is closely related to VQ3D. VQ2D seeks to localize objects in query images using egocentric videos. Past approaches simply combined the VQ2D and VQ3D tasks, applying VQ2D results in VQ3D. However, performance limitations exist when VQ2D results are applied naively for a VQ3D task. VQ2D outputs tracking results for the query object, typically using the last frame of the tracking, when the object is usually outside the image, proving challenging for 3D localization, like depth estimation and 2D bounding box accuracy. To address this, Jinjie proposed a novel strategy. “We input the egocentric video into the 2D detection network,” he explains. “The detection network will compute the similarity between the proposals and the query object to give a similarity score. For each frame, this score will tell you how similar the top prediction is to the object you want to find. After we get that query score, we propose to use the peaks of those similarity scores as the positive candidate proposals. We believe those peaks show the appearance of the target object. We extract those peaks and get a 2D bounding box of those candidates. Since we already have the 3D camera pose from the previous egocentric structure from motion, we can backproject those 2D proposals into the 3D world to get those 3D predictions. Finally, we use the confidence score to do weighted averages for those candidates, then aggregate to get a single final prediction, showing the object’s location in the 3D environment.”
RkJQdWJsaXNoZXIy NTc3NzU=