WACV 2025 Daily - Sunday

Winter Conference on Applications of Computer Vision Sunday 2025 WACV

4.3.1 Just Shift It: Test-Time Prototype Shifting for Zero-Shot Generalization with … 4.4.5 ReEdit: Multimodal Exemplar-Based Image Editing 3.32 TokenBinder: Text-Video Retrieval with One-to-Many Alignment Paradigm And visit also Hanoona’s poster (her oral presentation will be on Monday): 3.96 PALO: A Polyglot Large Multimodal Model for 5B People Hanoona’s picks of the day: Hanoona Abdul Rasheed is a PhD candidate in Computer Vision at MBZUAI. Her research focuses on integrating vision and language for multimodal learning, emphasizing data-centric approaches and model generalization. This means developing models that interact across multiple modalities—text, images, videos, and regions—enhancing interactivity, improving generalization across applications by enabling a single model to handle diverse tasks, and extending language coverage beyond commonly supported ones like English and Chinese to underrepresented languages. For today, Sunday 2 2 Hanoona’s Picks DAILY WACV Sunday Orals Posters

3 DAILY WACV Sunday Keynote today – Phillip Isola Don’t miss Phillip Isola’s Keynote! Visual content can be conveyed in many ways. It can be photographed and captured in an array of pixels, or instead it can be described through text rich in imagery. Computer vision has traditionally only dealt with the former format, leaving language processing as the domain of other fields. In this talk at 9:00am Phillip will reconsider this choice: should computer vision also deal with language as a fundamental visual format?

This work is about creating vision language models for video games. For example, these days we have chatbots through which you can send an image and start a conversation with that image, asking a specific question about the image itself. What is the content of the image? How many persons are in the image? These models are closed source, but there are some opensource alternatives to these models, the most famous being LLaVA. However, these models are not very good with video game content. If you send an image of a video game screenshot and then ask some question about the game or the game world, usually these models struggle to answer. The goal with VideoGameBunny was to create a model that is more familiar with video game context and can answer and understand video game content better than other models. Let’s observe this image taken from the paper, with a video game screenshot and a simple question. Are there any visible glitches or errors in the game environment? If you ask LLaVA, it says that yes, additional download progress bar seems to be stuck. If you look closely on top, there is an additional download, there is a progress bar here. Mohammad Reza’s model is not confused by the progress bar and answers the question correctly because it understands what is going on. What was the biggest challenge in creating such a powerful model? When it comes to creating a new model, there are two things that are very important. First thing is data, the data we need to collect. And we need to be very, very careful with 4 DAILY WACV Sunday Oral Presentation Mohammad Reza Taesiri is currently a postdoc at the University of Alberta. He is working on large vision language models under the supervision of Cor-Paul Bezemer. Mohammad is also the first author of a lovely paper that was accepted at WACV 2025 as an oral. VideoGameBunny: Towards vision assistants for video games

the quality of data. And the other thing is the computational power which is required to train these models. “For this project, specifically,” explains Mohammad Reza, “the training data was hardest to get. Basically, there is no data set for video game content before this work. We wanted to change that. We created the biggest data set of video game screenshots and conversation and question answering regarding video game content!” Let's say that you are playing a game and you want to have a digital companion, a chatbot inside the game that exactly looks at the screen that you are looking at and give you some hints. Or if you run into some challenges for crafting new items inside the game, like Minecraft, how should I combine different items to create a specific new item? These models can help us to build those in-game vision assistance. This is also very useful for video game testing and debugging. “For example,” says Mohammad Reza, “if you want to make sure that there are no glitches in the videogame, you need these models to detect those artifacts, glitches inside the game. This model is not specifically designed for that task, but it's a base model, that you can start from this model, fine-tune it for different tasks and create those models.” 5 DAILY WACV Sunday VideoGameBunny

6 DAILY WACV Sunday Oral Presentation The thing Mohammad Reza is the most proud of about this model is that it is a tiny model. It is relatively small, but it can perform similarly or better than models that are very, very large. What would the author do, if he had a magic wand to add one more feature to the model? The answer is definitely better quality data. “I never say no to better quality data,” confides Mohammad Reza. “And if I want to redo this project from scratch, I would spend more time to create higher quality data.” What would be an ideal direction for continuing this work? There is one immediate direction, which is extending this model to video. Currently, it's only working with a single image, but preferably the authors want to expand this capability to a video. For example, sending a video and asking a question about that. This is one of the things that they are trying to do at the moment, generating a video data set and creating a video model. Another thing, which is a little bit harder, is to have a component that allows this model to control the video game. Currently, it only prints out the text, but imagine it could print the actions that you could play in the game. “There are some works in robotics,” he shares “which they call vision-language-action models (VLAs). It's a combination of vision and language and action. That part

is very interesting because if we create such a dataset that is a combination of image, text, and the action space inside the game, you can essentially play the game. Let the model have an interaction with the game and play!” However, it is always hard to get the data: for this particular case, you need to ask people to play the game and record their keyboard and mouse inputs. “For video game companies, that's easy,” declares Mohammad Reza. “For academic individuals, it's a little bit hard to have the budget to collect that data. But it's totally possible.” 7 DAILY WACV Sunday VideoGameBunny

8 DAILY WACV Sunday Oral Presentation Mohammad Reza, is working on more capable visual language models for video games. But he currently cannot discuss the details, not yet. We will keep our curiosity for the next time… To learn more about VideoGameBunny, visit Poster Session 3 today (Sunday) from 11:15-13:00 and Oral Session 7.1: Computer Vision Applications II tomorrow (Monday) from 10:15 to 11:15. WACV Daily Editor: Ralph Anzarouth Publisher & Copyright: Computer Vision News All rights reserved. Unauthorized reproduction is strictly forbidden. Our editorial choices are fully independent from IEEE, WACV and the conference organizers.

Double-DIP Don’t miss the BEST OF WACV 2025 iSCnul i Ccbkos cmhreipbrue t feor rVf ri sei eo na nNde wg es toi tf iMn ya or cuhr. m a i l b o x ! Don’t miss the BEST OF WACV 2025 in Computer Vision News of March. Subscribe for free and get it in your mailbox! Click here Target with solid fill

10 DAILY WACV Sunday Oral Presentation In this paper, Luchao explores innovative ways to make 3D facial generation more personalized, accessible, and scalable. His model, My3DGen, uses generative AI to reconstruct a complete 3D model of a person’s face from a limited number of images. This has several applications, including virtual communication, augmented reality, and content creation. Imagine being on a Zoom call, facing the camera, and AI generating how your face would appear from different perspectives as if captured by multiple cameras. Prior work on 3D human face modeling uses global models that are not always effective for individual users. “You can find a lot of prior work on the 3D human face, but this pre-trained global model is not perfect for each person,” Luchao tells us. “Sometimes you find artifacts. Sometimes, the pretrained model can’t generate my face, or I’ll be concerned about data privacy. Am I going to upload my personal data to the server? I just want to store my personal data myself in my own phone. That’s the motivation for our work. We want to personalize a pre-trained 3D human face for personal use.” One of the biggest challenges in personalizing pre-trained generative models is their size. Large AI models require substantial storage space My3DGen: A Scalable Personalized 3D Generative Model Luchao Qi is a third-year PhD student at the University of North Carolina at Chapel Hill (UNC) under the supervision of Roni Sengupta. His paper on democratizing 3D generative AI for personalized human face modeling has been accepted as an oral. Although he cannot attend WACV in person, we speak to him following his group’s poster yesterday and before their oral later today.

11 DAILY WACV Sunday My3DGen and computational power, making them impractical for individual users to maintain. Rather than requiring massive storage, My3DGen reduces the number of trainable parameters for each user to make the solution more scalable, allowing personalization without needing people to store the whole foundation model. Another challenge is the availability of personal data. Many users may not have a large dataset of their own facial images, making it difficult for traditional generative AI models to work effectively. “I'm not a selfie guy!” Luchao confesses. “I don't take many photos of myself, so data limitation is sometimes a problem. We try to make our solution work for users that have limited selfies.”

To do this, My3DGen employs Generative Adversarial Networks (GANs). GANs have two components: a generator and a discriminator. Unlike prior 3D human face model work, My3DGen discards the discriminator component and focuses only on fine-tuning the generator. “Technically, if you fine-tune or personalize these two components together with limited data, it leads to overfitting and mode collapse,” he explains. “The generator is not able to have the generative power anymore. It’s overfitted to, let's say, 10 images or even one image. What we do is discard the discriminator and then fine-tune the generator using the selfies themselves. In this way, we’re trying to maintain both the generative power of the generator and make it personalized. Then the GAN will be able to learn all the knowledge you want it to learn.” While many research papers focus on theoretical advancements, this 12 DAILY WACV Sunday Oral Presentation

work provides a solution that could already be adopted by industry leaders such as Meta, Snapchat, and other companies with large user bases. “If you're trying to store one million users’ models, that requires a huge storage cost,” Luchao points out. “It’s going to be data-hungry. The computational cost is massive. We’re trying to create these solutions that provide each person or company a usable way. That’s something to be proud of.” The work has already paved the way for one extension, which was also accepted at WACV this year, focusing on continual learning of personalized generative face models. This allows users to update their AI-generated faces as they take new photos without incurring high retraining costs. “We have a lot of follow-up work trying to explore how you make such solutions feasible and scalable and to work over time,” Luchao adds. “I really believe this direction that decentralizes and democratizes the AI solution for each user is pretty promising. Especially how users can interact with generative AI solutions. That's an interesting direction for future researchers and industrial applications.” To learn more about My3DGen, visit Oral Session 5.1: 3D Computer Vision V today (Sunday) from 14:00 to 15:00. 13 DAILY WACV Sunday My3DGen

14 DAILY WACV Sunday Yesterday’s Keynote

15 DAILY WACV Sunday Hannah Kerner “The volume of data we are talking about is orders of magnitude larger than the largest language dataset. Essentially, every day we get a 100 terabytes or more of data from the satellites…!”

16 DAILY WACV Sunday Poster Presentation The team that co-authored this paper all met at WACV 2024. They decided to start a project together after some discussions at the poster session. And then they met, they kept regular meetings and, Emanuela claims, success came from groupwork. ColFigPhotoAttnNet: Reliable Fingerphoto Presentation Attack Detection Leveraging Window-Attention on Color Spaces Emanuela Marasco is an assistant professor in the Department of Information Science and Technology at George Mason University, working on biometric computer vision security. She’s also affiliated with Computer Science. Manuela is an author of a paper accepted as a poster at WACV 2025. I first interviewed her almost seven years ago at CVPR 2018 in Salt Lake City as a Woman in Computer Vision.

17 DAILY WACV Sunday ColFigPhotoAttnNet The focus is in securing mobile, specifically smartphone. The lab works on secure smartphone unlock and after the pandemic they care about hygiene and contactless fingerprint. There were in the past attacks to the sensors of fingerprint embedded in these devices; so the finger photo technology became really a very good alternative to the traditional sensing of fingerprint. They use just RGB common cameras embedded in smartphones and then they capture the finger photo image. There are very good algorithms for a very accurate matching. Talking of security, some technologies are vulnerable to spoof attacks, specifically display attacks where you can display the picture of the finger of someone. And then you can actually deceive the authentication system, endangering all the informationsensitive data that we have in our phone, like online banking info and so on. Of course, we want our wallet and the unlocking mechanism to be very trustworthy. The team knew from previous research that these algorithms are not robust to capture bias, mainly due to the evolving nature of the hardware (especially camera characteristics and capture conditions influenced by the environmental conditions) and because they are training-based (or data driven). What can we do to enhance trust? “We want to solve and minimize the capture bias specifically,” Emanuela explains. “There are differences in how things appear from the camera. This was definitely the biggest challenge that we had. The capture challenge, the camera used to acquire the finger on the other side, even the

instrument used to display, and the distance that varies. And then if it's a printout, which type of display camera is there. Also lighting conditions, background, the texture, these are the challenges!” What is the main novelty in this work? “Again, the idea was coming from the previous work,” says Emanuela, “but in this one, we have been customizing the architecture in a way that is a window of attention 18 DAILY WACV Sunday Poster Presentation

based and each attention mechanism focuses on a specific color model. We help with deep learning to focus on different color models so that each one captures the information of interest in a better way, more efficiently. So we have a pipeline of mobile net because we know that mobile nets are more efficient for mobile architecture.“ In terms of like technical motivation before going into deeper learning, the YCbCr separates the color from the luminance. In this case, we can capture better differences in texture that in RGB were not captured. So they are complementing, it's like a fusion. When we talk about an algorithm that must be used and embedded in a mobile, we want to consider all the variations that are related to the mobile capture, and this is challenging on its own. “For the principle that I mentioned,” Emanuela explains, “we are retraining the network on the converted images in different color spaces. In the WACV paper, we do not retrain. The models are still finetuned in the architecture that I mentioned. In the step ahead, the work that we are doing right now is just to retrain the network from scratch. We transform ImageNet and then we retrain from scratch.” The novelty is also the window of attention mechanism that is focusing on specific color spaces for this ring. When we integrate different sources of information, we want to make sure that they are the diversity, so there is no redundancy. And we know how successful the attention mechanism has been in computer vision. “We are very grateful to all the inventors in computer vision,” exclaims Emanuela “because we are using all the great work they have done!” 19 DAILY WACV Sunday ColFigPhotoAttnNet

On one thing Emanuela is justly proud: for the additional perspective that they give to the representation of the spoofness, because it's very promising! To have a better understanding of this fascinating work, talk with Emanuela at Poster Session 3 today (Sunday) from 11:15 to 13:00. 20 DAILY WACV Sunday Poster Presentation

21 DAILY WACV Sunday UKRAINE CORNER Russian Invasion of Ukraine Our sister conference CVPR condemns in the strongest possible terms the actions of the Russian Federation government in invading the sovereign state of Ukraine and engaging in war against the Ukrainian people. We express our solidarity and support for the people of Ukraine and for all those who have been adversely affected by this war. Denis Rozumny is a Research Scientist at Meta Reality Labs in Zürich. He has an accepted poster at his first WACV conference. His work is about multi-modal full human body tracking for XR devices. The method is trained in a self-supervised way on real data with learned depth point cloud registration. My First WACV

22 DAILY WACV mindtech Sunday Peter McGuinness (left) and Chris Longstaff (right), introducing Mindtech Chameleon, the world's fastest way to create computer vision training data - and you can start for zero cost!

23 DAILY WACV Sunday Posters Top: Bridging the gap between theory and practice, Katharina Prasse (supervised by Margret Keuper) is the first to use Minimum Cost Multicut in conjuction with foundation models to cluster into visual frames. Bottom: Gasser Elazab, PhD candidate at Volkswagen and TU Berlin, presenting state-of-the-art scene reconstruction from vehicle dash cameras.

24 DAILY WACV Congrats, Doctor Stefano! Sunday Stefano Gasperini obtained his PhD just a few weeks ago. Under the supervision of Federico Tombari, he worked with BMW and the CAMP team at TUM, tackling reliability challenges in scene understanding for autonomous driving. Now a PostDoc advising PhD students, Stefano has also co-founded VisualAIs Labs GmbH with Shun-Cheng Wu. Their technology creates high-fidelity 3D renderings of e-commerce products from just smartphone images, making interactive 3D visualizations effortless. Congrats, Doctor Stefano! Imagine stepping into your car after a long night out—whether celebrating or finishing a CVPR submission—and letting it drive you home safely. Autonomous driving promises such a future, but making it reliable under all conditions poses significant challenges, from nighttime driving to unknown objects. While autonomous rides already exist in select areas, their widespread adoption is still on the horizon. Among the many obstacles, reliable scene understanding is a critical one, as the vehicles must perceive their surroundings accurately in all scenarios, including adverse weather, poor lighting, and entirely unseen objects. While much of the scientific community focuses on incremental benchmark improvements, Stefano’s PhD tackled fundamental reliability gaps often ignored. Take monocular depth estimation at night, for example. State-of-the-art self-supervised approaches fail in the darkness due to training assumptions that completely break with low light, reflections, and sensor noise, leading to extremely poor outputs. At ICCV 2023, Stefano and his colleagues introduced md4all, a method that leverages strong daytime models to improve nighttime depth estimation. Instead of retraining different models for every condition, md4all generates synthetic nighttime images corresponding to the available ones with good visibility and trains the model using supervision only from the original, well-lit data. The key is computing the losses only on the corresponding well-lit images, regardless

of the condition fed in input. This enables a single model to work reliably across diverse conditions without changes at inference time. The same concept also proved effective in rainy conditions and fully-supervised settings. As shown in the Figure, the results are striking. Code, models, and generated images are available here. Additionally, Stefano’s work includes contributions around an end-to-end method for panoptic segmentation, domain generalization techniques using uncertainty estimation and plausible adversarial augmentations, extreme generalization to unseen objects of completely unknown categories, and depth prediction for dynamic objects with a weak radar supervision. From safer autonomous driving to seamless 3D product visualization, Stefano continues pushing the boundaries of computer vision for real-world impact. 25 DAILY WACV Sunday Stefano Gasperini Figure 1. The md4all framework. A frozen daytime - depth model estimates on daytime samples and provides guidance to another model fed with a mix of daytime and nighttime inputs. Inference is done with a simple single model.

RkJQdWJsaXNoZXIy NTc3NzU=