CVPR Daily - Friday

own footage to gather the necessary data, taking a camera and tripod to different parks to capture thousands of videos. “The hardest part was we spent a lot of time working on it, but that’s the key ingredient that made our method work,” Zhengqi recalls. “If you don’t have data, you can’t train your model to get good results.” While other works use optical flow to predict the motion of each pixel, this work trains a latent diffusion model, which learns to iteratively denoise features starting from Gaussian noise to predict motion maps rather than traditional RGB images. Motion maps are more like coefficients of motion. The model uses this to render the video from the input picture, which is very different from other works that directly predict the video frame from the images or text. “That’s something quite interesting,” Zhengqi notes. “We’re working from more of a vision than a machine learning perspective. I think that’s why people like it in computer vision communities.” Outside of writing award-candidate papers, Zhengqi’s work at Google mainly focuses on research but has some practical applications, including assisting product teams with video processing. He also advises several PhD student interns. “We work together on interesting research projects to achieve very good outcomes,” he reveals. “That’s our daily goal as research scientists at Google DeepMind!” To learn more about Zhengqi’s work, visit Orals 6B: Image & Video Synthesis (Summit Flex Hall AB) from 13:00 to 14:30 [Oral 2] and Poster Session 6 & Exhibit Hall (Arch 4A-E) from 17:15 to 18:45 [Poster 117]. 7 DAILY CVPR Friday Generative Image Dynamics

RkJQdWJsaXNoZXIy NTc3NzU=