CVPR Daily - Friday

A publication by Computer Vision and Pattern Recognition Friday Seattle 2024 CVPR Generative Image Dynamics Full review of this Best Paper Award at CVPR 2024! Read about this brilliant work!

2 DAILY CVPR Friday Good morning, CVPR! Welcome to the final CVPR Daily for 2024 by RSIP Vision, the publisher of Computer Vision News. What an incredible week it’s been! From engaging workshops and tutorials to groundbreaking talks and posters, and not forgetting the lively socials, we’d like to offer a big congratulations to everybody involved and a huge thank you to the entire organizing committee. It’s been thrilling to see so many of you here in Seattle! We hope you’ve enjoyed catching up with old friends and making new ones as much as we all have. With record registrations and 26% more paper submissions than last year, CVPR is growing fast, reflecting a burgeoning interest in our community. As CVPR scales up, so does the infrastructure needed to support it. Thanks to the Computer Vision Foundation (CVF) and key team members (shout out as ever to Nicole Finn!), logistical issues have been effectively managed, and operations are more streamlined, modernized, and professional, which makes life easier for us General Chairs. However, this growth is challenging for the Program Chairs, who face an increased workload due to the sheer number of papers. We hear you in the community when you tell us you want the review process to be the best it can be. We urgently need more reviewers to ensure the quality of reviews matches the high standards we strive for, and clear expectations must be set around active participation. We’re looking at providing feedback to reviewers and recognizing those who excel. The next leadership team will continue to address these crucial issues. Editorial...

3 DAILY CVPR Friday … with General Chairs Looking ahead to next year, we wonder what the key trends will be at CVPR 2025. We’d love to see people thinking more about responsible computer vision innovations. Views on what’s responsible may vary, but there’s a collective desire for work that pushes the boundaries of technology while prioritizing ethical considerations, sustainability, and societal impact. As CVPR 2024 draws to a close, we remind everyone to stay connected from this meeting to the next by reading Computer Vision News. We love this magazine! It’s a glue for the community, keeping us all informed and engaged every month. Subscribe for free today and tell all your friends! Enjoy reading our final round-up of the week. Have a wonderful day and see you in Nashville for CVPR 2025! General Chairs: Octavia Camps (Northeastern University) Rita Cucchiara (University of Modena and Reggio Emilia) Walter Scheirer (University of Notre Dame) Ramin Zabih (Cornell University) Sudeep Sarkar (University of South Florida) … with Ralph Anzarouth (CVPR Daily and Computer Vision News) CVPR Daily Publisher: RSIP Vision Copyright: RSIP Vision Editor: Ralph Anzarouth All rights reserved Unauthorized reproduction is strictly forbidden. Our editorial choices are fully independent from IEEE, CVPR and the conference organizers.

4 DAILY CVPR Friday Best Paper Award Winner Imagine looking at a picture of a beautiful rose and visualizing how it sways in the wind or responds to your touch. This innovative work aims to do just that by automatically animating single images without user annotations. It proposes to solve the problem by modeling what it calls image-space motion priors to generate a video in a highly efficient and consistent manner. “By using these representations, we’re able to simulate the dynamics of the underlying thing, like flowers, trees, clothing, or candles moving in the wind,” Zhengqi tells us. “Then, we can do real-time interactive simulation. You can use your mouse to drag the flower, and it will respond automatically based on the physics of our world.” The applications of this technology are already promising. Currently, it can model small motions, similar to a technique called cinemagraph, where the background is typically static, but the object is moving. A potential application for this would Zhengqi Li is a research scientist at Google DeepMind, working on computer vision, computer graphics, and AI. His paper on Generative Image Dynamics has not only been selected as a highlight paper this year but is also in the running for a best paper award. He is here to tell us more about it before his oral presentation this afternoon. NOTE: this article was written before the announcement of the award winners. Which explains why it keeps mentioning a candidate and not a winning paper. Once again, we placed our bets on the right horse! Congratulations to Zhengqi and team for the brilliant win! And to the other winning paper too!

be dynamic backgrounds for virtual meetings, providing a more engaging and visually appealing alternative to static or blurred backgrounds but without excessive motion that could be distracting. “Moving to model larger motion, like human motion or cats and dogs running away, is an interesting future research direction,” Zhengqi points out. “We’re working on that to see if we can use a better and more flexible motion representation to model those generic motions to get better video generation or simulation results.” Most current and prior mainstream approaches in video modeling involve using a deep neural network or diffusion model to directly predict large volumes of pixels representing video frames, which is computationally intensive and expensive. In contrast, this work predicts underlying motion, which lies on a lower-dimensional manifold, and uses a small number of bases to represent a very long motion trajectory. “You can use a very small number of coefficients to represent very long videos,” Zhengqi explains. “This allows us to use this 5 DAILY CVPR Friday Generative Image Dynamics

6 DAILY CVPR Friday Best Paper Award Winner representation to produce a more consistent result more efficiently. I think that’s the main difference compared with other video generation methods you might see.” The novelty of this approach has not gone unnoticed, with the work being picked as a top-rated paper at this year’s CVPR, given a coveted oral presentation slot, and recognized as one of only 24 papers in line for a best paper award. If we were placing bets on the winners, this work, with its stellar team of authors, would be our hot tip. What does Zhengqi believe are the magic ingredients that have afforded it such honors? “There are a few thousand papers on video generation dynamics, and they all have similar ideas,” he responds. “They predict the raw pixel, and we’re going in a completely different direction predicting the underlying motion. That’s something the research community appreciates because it’s unique. I guess they believe this might be an interesting future research direction for people to explore because, for generative AI, people are more focused on how you can scale those big models trained on 10 billion data while we’re trying to use a different representation that we can train more efficiently to get even better results. That’s a completely different angle, and the award community might like those very different, unique, special angles.” However, the road to this point was not without its challenges. Collecting sufficient data to train the model was a significant hurdle the team had to overcome. They searched the Internet and internal Google video resources and even captured their “If you don’t have data, you can’t train your model to get good results!”

own footage to gather the necessary data, taking a camera and tripod to different parks to capture thousands of videos. “The hardest part was we spent a lot of time working on it, but that’s the key ingredient that made our method work,” Zhengqi recalls. “If you don’t have data, you can’t train your model to get good results.” While other works use optical flow to predict the motion of each pixel, this work trains a latent diffusion model, which learns to iteratively denoise features starting from Gaussian noise to predict motion maps rather than traditional RGB images. Motion maps are more like coefficients of motion. The model uses this to render the video from the input picture, which is very different from other works that directly predict the video frame from the images or text. “That’s something quite interesting,” Zhengqi notes. “We’re working from more of a vision than a machine learning perspective. I think that’s why people like it in computer vision communities.” Outside of writing award-candidate papers, Zhengqi’s work at Google mainly focuses on research but has some practical applications, including assisting product teams with video processing. He also advises several PhD student interns. “We work together on interesting research projects to achieve very good outcomes,” he reveals. “That’s our daily goal as research scientists at Google DeepMind!” To learn more about Zhengqi’s work, visit Orals 6B: Image & Video Synthesis (Summit Flex Hall AB) from 13:00 to 14:30 [Oral 2] and Poster Session 6 & Exhibit Hall (Arch 4A-E) from 17:15 to 18:45 [Poster 117]. 7 DAILY CVPR Friday Generative Image Dynamics

Aniruddha Kembhavi (top left) is a Senior Director at the Allen Institute for AI (AI2), leading the Perceptual Reasoning and Interaction Research (PRIOR) team, where Christopher Clark (center) and Jiasen Lu (top right) are Research Scientists, Sangho Lee (bottom left) is a Postdoctoral Researcher, and Zichen “Charles” Zhang (bottom right) is a Predoctoral Young Investigator. Before their poster session this afternoon, they speak to us about their highlight paper proposing Unified-IO 2, a versatile autoregressive multimodal model. Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision Language Audio and Action 8 DAILY CVPR Friday Highlight Presentation Unified-IO 2 is the first autoregressive multimodal model capable of understanding and generating images, text, audio, and action. It can handle multiple input and output modalities and incorporates a wide range of tasks from vision research. Unlike traditional models with specialized components for different tasks, it uses a single encoder-decoder transformer model to handle all tasks, with a unified loss function and pretraining objective.

9 DAILY CVPR Friday “It’s a super broad model,” Christopher tells us. “It can take many different modalities as input and output. It can do image, text, audio, and video as input and can generate text, image, and audio output. Within those modalities, we basically threw in every task we could think of that vision researchers have been interested in. It’s a super, super broad model. I think it’s one of the most broadly capable models that exists today.” While language models can perform many tasks and input and output all kinds of structured language, handling diverse inputs and outputs in computer vision is more challenging. “When it comes to computer vision, it’s a mess,” Aniruddha says bluntly. “Sometimes, you have to input an image. Sometimes, you have to output a bounding box. Sometimes, you have to output a continuous vector like a depth map. Inputs and outputs in computer vision are very heterogeneous. That’s why, for the last 10 years, people have been building models that can do one or two things.” Unified-IO 2

10 DAILY CVPR Friday Unified-IO 2 builds on the foundations laid by its predecessor, Unified-IO, aiming to create a model that can truly input and output anything. Training such a comprehensive model, especially with limited resources, has been incredibly tough. The team’s first major challenge was collecting the pretraining and instruction tuning data. The second was training a multimodal model from scratch rather than adapting existing unimodal models. “We tried a few months of tricks to stabilize the model and make it train better,” Jiasen recalls. “We figured out a few key recipes that were used by later papers and shown to be very effective, even in other things like image generation. We’re training on a relatively large scale with 7B models and over 1 trillion data. More than 230 tasks were involved in training these giant models.” Highlight Presentation

11 DAILY CVPR Friday The development of Unified-IO 2 has been a collaborative effort involving the four first authors: Jiasen, Christopher, Sangho, and Charles. Aniruddha is keen to ensure they get the recognition they deserve for the feat they have pulled off. “This project is a Herculean effort by these four people,” he points out. “Usually, people will take a large language model, then put a vision backbone, and then finetune that on some computer vision tasks. In this model, the language model is also trained from scratch. Think of large companies with hundreds of researchers trying to train a language model. Contrast that with this paper, which has four first authors trying to train a model that does everything. These four gentlemen have toiled night and day for many, many months. I can testify to that.” Everything about Unified-IO 2 is open source. If you visit the team’s poster today, you can feel safe knowing they are willing to share every aspect of the project. “We’ve released all the data, the training recipes, the challenges, especially in stabilizing the model training, and all the evaluation pipelines,” Sangho confirms. “If you come to our poster booth, we’ll be very happy to share all the recipes and know-how for training this special kind of multimodal foundation model.” Unified-IO 2

12 DAILY CVPR Friday During the evaluation stage, the team discovered that Unified-IO 2 could perform well in tasks they had not initially targeted, such as video tracking and some embodied tasks. They will showcase these surprising results with iPad demonstrations at their poster session. “We’ve tested the model multiple times, but maybe only with a few modalities and target tasks,” Charles reveals. “It’s a surprise that the model is so good at other tasks we’ve not focused on before. There are lots of interesting behaviors of the models and some very cool visualizations that the model can follow some novel instructions.” The paradigm behind Unified-IO 2, where all modalities are integrated into a single transformer without relying on external unimodal models, is a promising direction for future AI research. “It’s in contention with other ways of training generalist models, and people are still exploring and building on that,” Christopher adds. “I think Unified-IO 2, in particular, has a lot of modalities and tasks and really pushes that way of building models to an extreme.” To learn more about the team’s work, visit Poster Session 6 & Exhibit Hall (Arch 4A-E) from 17:15 to 18:45 [Poster 222]. Highlight Presentation

Double-DIP Don’t miss the BEST OF CVPR 2024 iSCnul i Ccbkos cmhreipbruet feorrVf ri sei eo na nNde wg es toi tf iJnu ly o. u r m a i l b o x ! Don’t miss the BEST OF CVPR 2024 in Computer Vision News of July. Subscribe for free and get it in your mailbox! Click here Target with solid fill

14 DAILY CVPR Friday Highlight Presentation The Devil is in the Fine-Grained Details: Evaluating Open-Vocabulary Object Detectors for Fine-Grained Understanding Unlike traditional object detection, where the objects are predefined during training, open-vocabulary models can recognize objects described by natural language sentences defined at inference time. “I find these open-vocabulary models very interesting because they offer flexibility,” Lorenzo begins. “They allow end users who may only be interested in recognizing a specific set of objects to use these models without training.” However, despite their promise, current open-vocabulary models struggle with recognizing finegrained properties of objects, such as colors or materials. Lorenzo tells us he was surprised by this, considering recent advancements in generative AI. “It’s quite outstanding that we struggle with discerning fine-grained properties in object detection, which we might assume is an easier task than image generation,” he points out. “We searched the scientific literature on Lorenzo Bianchi is a PhD student at the University of Pisa and CNR-ISTI, supervised by Giuseppe Amato, Fabio Carrara, Nicola Messina, and Fabrizio Falchi. He is working on multimodal deep learning, focusing on image-text interaction in deep learning models. Before his poster session this morning, Lorenzo talks to us about his highlight paper on open-vocabulary object detection.

15 DAILY CVPR Friday The Devil is in the Fine-Grained Details this topic but found nothing that accurately described this problem. The problem was that classical openvocabulary object detection benchmarks do not come with attributes in their text entries.” To address this gap, Lorenzo set out to create a benchmark with finegrained natural language captions for detection. Starting from a detection dataset with structured descriptions of object parts and attributes, he prompted a large language model (LLM) to generate natural language descriptions of the objects. “It was really exciting because at the time we developed this benchmark, it was the early days when we could finally use an open source LLM locally on our machine,” he recalls. “It was quite cool for us!” The generated captions, which he called positive captions, were paired with negative captions, where other attributes were deliberately misplaced inside the sentence. This combination of positive plus negative captions was used as input vocabulary for the detectors. The models were tested on their ability to localize objects based on these complex descriptions correctly, and they had to identify the correct captions to test if they could find the right attributes for the objects. Creating this benchmark was a challenge. Lorenzo had to engage in extensive prompt engineering to ensure the accuracy of the LLM outputs. “We had to find the correct prompt for generating the benchmark and reducing the LLM hallucinations because sometimes they can fail,” he explains. “Since the benchmark needs to be very accurate, we also had to manually revise them and discard some generated captions, which were errors or imprecise.”

16 DAILY CVPR Friday Highlight Presentation The next steps for this work are understanding why the current models fail to recognize finegrained attributes and exploring ways to improve them. Lorenzo points out that the paper is mainly an analysis to encourage further research in the field. He says that researchers have recently shown a lot of interest in open-vocabulary models, and he is hopeful to be a part of collaborative efforts moving forward.

17 DAILY CVPR Friday This year is Lorenzo’s first CVPR and first paper, which makes being selected as a highlight even more special. The ‘highlight’ tag has only been given to around 12% of this year’s accepted papers based on their high quality and potential impact. “It’s a dream for me,” he smiles. “It’s an honor to be a highlight paper. We touched on a very interesting topic, which is quite hot. We look at open-vocabulary models from a different perspective, which is not the usual one we see with classical benchmarks. I think there was an interest in looking at the situation from a different perspective and looking at directions where we can make a lot of difference in the next few years.” For those intrigued by this emerging field, Lorenzo’s poster offers an opportunity to delve deeper into the challenges. “We have an interesting task that currently has no solution,” he adds. “If you come to my poster, we can discuss possible solutions. Maybe we can help each other find something that will greatly advance the research in this area.” To learn more about Lorenzo’s work, visit Poster Session 5 & Exhibit Hall (Arch 4A-E) from 10:30 to 12:00 [Poster 285]. The Devil is in the Fine-Grained Details

Camera-only Bird’s-Eye-View (BEV) networks are gaining significant traction in the field of autonomous driving perception. They can transform six images providing a 360-degree view surrounding a vehicle into a comprehensive, topdown view similar to a bird’s perspective. This work focuses on the camera-only BEV semantic segmentation task, enabling the network to segment various elements, primarily vehicles, within this view. Sophia Sirko-Galouchenko is a first-year PhD student at Sorbonne University in Paris and Valeo.ai. Her paper on bird’s-eyeview perception in autonomous driving was presented by colleagues on Monday during a poster session at the Workshop on Autonomous Driving. She speaks to us about the work. 18 DAILY CVPR Friday Workshop Poster UKRAINE CORNER Russian Invasion of Ukraine CVPR condemns in the strongest possible terms the actions of the Russian Federation government in invading the sovereign state of Ukraine and engaging in war against the Ukrainian people. We express our solidarity and support for the people of Ukraine and for all those who have been adversely affected by this war. OccFeat: Self-supervised Occupancy Feature Prediction for Pre-training BEV Segmentation Networks

“Our work is a pre-training of such networks,” Sophia explains. “We have come up with two auxiliary tasks. One is a geometric task, and the second is to learn semantic information. The paper’s novelty comes from combining these two pre-trainings, giving the network, at the same time, 3D geometry information and semantic information through distillation.” Sophia proposes OccFeat, a selfsupervised pre-training method for camera-only BEV segmentation networks. It pre-trains the BEV network via occupancy prediction and feature distillation tasks. “The pre-training task we use is asking the model to predict a 3D volume from images,” she continues. “This volume encodes occupancy information, whether or not a 3D voxel is occupied, and predicts features in the occupied voxels that come from a pre-trained image model.” Occupancy prediction provides a 3D geometric understanding of the scene, but the geometry learned is class-agnostic. In addressing this, Sophia integrates semantic information into the model in the 3D space through distillation from a self-supervised pre-trained image foundation model, DINOv2. Models pre-trained with OccFeat show improved BEV semantic segmentation performance, especially in low-data scenarios. 19 DAILY CVPR Friday OccFeat UKRAINE CORNER Overview of OccFeat’s self-supervised BEV pretraining approach. OccFeat attaches an auxiliary pretraining head on top of the BEV network. This head “unsplats” the BEV features to a 3D feature volume and predicts with it (a) the 3D occupancy of the scene (occupancy reconstruction loss) and (b) high-level self-supervised image features characterizing the occupied voxels (occupancyguided distillation loss).

Although she has no plans to extend this work herself, Sophia is optimistic when asked if it opens new avenues for research. “I think applying this pre-training with more data could be interesting,” she tells us. “Like bigger datasets without any annotations. Then afterwards, finetune the same network with, for example, the nuScenes dataset that we use with annotations for segmentation.” Reflecting on the challenges faced during the project, Sophia shared that it was her first paper and her first research work in machine learning. “It was a challenge in itself leading a research project,” she reveals. “Also, training so many models and testing so many things, that was a challenge.” She noted that the solutions came through persistence and trying new things, as well as the invaluable support of her peers. “I had a very good team of advisors,” she smiles. “I was an intern then, but my fellow PhD students were also helping.” On the broader question of the current state of autonomous driving, which has seen significant technological progress in recent years but continues to face challenges, Sophia acknowledges 20 DAILY CVPR Friday Workshop Poster UKRAINE CORNER Performance comparison in low data regime 1% annotated data of nuScenes.

21 DAILY CVPR Friday Visualisation of predicted 3D features, using a 3-dimensional PCA mapped on RGB channels. The features contain semantic information, e.g., cars in cyan color. Using the same PCA mapping on a different scene (right), we show that semantic features are consistent across scenes. Correlation maps of the student’s predicted 3D features and features selected on the car (left) and on a road (right). OccFeat UKRAINE CORNER that there is still a long road ahead. “There’s still a lot of work to do, but I think it’s going in the right direction,” she remarks. “I couldn’t predict a timeline in the future, but we’re not there yet.”

ion News of June? Did you read Computer Vision News of June? Read it here Target with solid fill

23 DAILY CVPR Friday My first CVPR! Undergraduates at CVPR! Haran Raajesh, a 3rd year CS undergrad from IIIT Hyderabad presents his poster on Identity-Aware Movie captaining. He plans to apply for PHDs this upcoming fall, contact him.

RkJQdWJsaXNoZXIy NTc3NzU=