Computer Vision News - June 2024

3 Computer Vision News Computer Vision News Learning Interactive Real-World Simulators physically interacting with the laptop screen; it’s generated in the video by the user giving this language instruction.” There are many applications for such a realistic simulator. While game or film production immediately comes to mind, augmented reality is another possibility, where users can interact with an imagined world by issuing commands. This research primarily focuses on embodied AI, leveraging these simulated experiences to control robots. “We start with an image of a robot facing a table with some objects on the table, and then give a language instruction to the simulator conditioned on this first frame,” Sherry describes. “We say something like ‘Grasp the banana’ or ‘Open the drawer and put the fruits in the drawer.’ The simulator will then generate a video of the robot executing this action – moving its arm closer to the object, picking it up, and putting it into the drawer.” The model can simulate this because it has been trained on videos of other robots or humans performing tasks, enabling it to interpolate and predict actions based on what it has learned. Translating the generated videos into real-world robot actions involves training an inverse dynamics model. That model takes the video of the robot executing the task and predicts the low-level control actions between two frames, such as joint movements, required to perform it. “After we have this inverse dynamics model, we convert the generated video into low-level robot controls and execute these on the actual robot,” Sherry reveals. “In the paper, we’ve shown situations where the simulated execution looks similar to the real execution of the robot.”

RkJQdWJsaXNoZXIy NTc3NzU=