Computer Vision News 8 the goal specification can be generalized using large language models (LLM) and visual language models (VLM), allowing goal specifications through coordinates, goal images, or language input, and targeting various different embodiments. Structured representations were also suggested by Martial Herbert, Dean of Carnegie Mellon University, in particular for perception as a means to structure video input. He showed how self-supervised learning can be leveraged for learning with minimal supervision to exploit temporal consistencies and multi-view consistencies. This is enabled through slot-attention on the one hand or, on the other hand, extensions of implicit representations (NERFs) and multi-task learning. Martial Hebert (Carnegie Mellon University) on structured representations for video understanding and robotics Cordelia Schmid, research director at Inria and researcher at Google, also argued in favor of structured representations. She presented representations for end-to-end trained agents, used for perception, mapping and decision taking and, application-wise, covering navigation as well as manipulation. As an example she presented neural implicit maps, which are differentiable latent representations respecting projective 3D geometry and trained with imitation learning. A CLIP encoder ensures that the features are linked to semantic content and can be used with language queries. Cordelia Schmid (Inria, Google) on structured representations for perception and decision taking. 3rd AI for Robotics Workshop
RkJQdWJsaXNoZXIy NTc3NzU=