5 Computer Vision News Computer Vision News continuous values into a text description through tokenization,” Sherry explains. “Having a unified language to describe actions is difficult. Once we have that, it’s just about combining datasets and merging information in a single model.” Another technical challenge was ensuring the model maintained consistency and memory over time. For example, if the model places an apple in a drawer and needs to open it later, it must remember the initial action. To address this, it uses history conditioning, aggregating past frames and conditioning on that to generate future video segments. However, the model is currently limited to a fixed number of past frames. “If something happened days ago, how can I ensure the model remembers that?” she ponders. “This is not addressed in this work, but there are other works in Google Gemini or long-context learning where people can fit millions of tokens into these large models. These will be considered for future work to empower generative simulators to incorporate history.” The field of generative models is motivated by computer vision. Classical tasks like segmentation, tracking, and depth estimation play a role in this work, but it connects them to embodied AI, taking a broader, end-to-end approach to simulating the effects of executing actions. Rather than those intermediate tasks, it focuses on image-to-video prediction with some control in the middle, blending robotics, computer vision, control, and reinforcement learning. Learning Interactive Real-World Simulators
RkJQdWJsaXNoZXIy NTc3NzU=