Computer Vision News Computer Vision News 6 Sherry says the text-to-video generation community has been focused on entertainment applications, favoring creating videos of cute animals in unusual situations over real-world scenarios. This trend followed text-to-image generation, where the focus was on generating pictures that did not exist. “I’m not saying people went down the wrong route,” she adds. “I’m saying we were being narrow by only thinking about creative media. When we think about videos, one of the most interesting applications is modeling the real world because physics is hard. Interaction with soft objects, fluid dynamics, and cloud movement is hard to model using mathematical equations. Learning a generative model of videos using a datadriven approach, with millions of video clips to learn this dynamics model, is a more natural approach with these large models.” Reflecting on why her work was chosen as an Outstanding Paper, Sherry points to the significance of treating video generation as a simulator of the real world. This shift in perspective is significant, opening new avenues for generating robot and human videos. “This is the novelty of the idea,” she reiterates. “What does it mean if we have a perfect simulator? The work demonstrates a few examples. You can use it to train agents. You can use it to generate additional experiences to train videocaptioning models. There’s great potential for what people can use it for in the future.” In February, an OpenAI blog introduced Sora, its text-to-video AI model capable of generating realistic videos from text instructions. Mirroring Sherry’s research, which had been completed months before, it discussed the concept of video generation models as world simulators. “We put out this idea from a scientific perspective much earlier,” she points out. “People from academia and industry are thinking about applications of video generation along similar lines at different times, but I don’t see us ICLR Outstanding Paper
RkJQdWJsaXNoZXIy NTc3NzU=