WACV 2025 Daily - Monday

Winter Conference on Applications of Computer Vision Monday 2025 WACV

7.4.virtual Language-Guided Instance-Aware Domain-Adaptive Panoptic Segmentation 8.1.2 S3PT: Scene Semantics and Structure Guided Clustering to Boost … 8.2.2 Sigma: Siamese Mamba Network for Multi-Modal Semantic Segmentation 5.26 Exploiting VLM Localizability and Semantics for Open Vocabulary Action Detection 5.virtual Exploring Scalability of Self-Training for Open-Vocabulary Temporal Action … Let us also tell you about Giulia’s own oral today: 7.4.2 When Cars meet Drones: Hyperbolic Federated Learning for Source-Free … Giulia’s picks of the day: Giulia Rizzoli is a recent PhD graduate from the University of Padova, where she specialized in Scene Understanding, focusing on domain adaptation and continual learning for semantic segmentation. At the venue, together with her colleague Matteo Caligiuri, she is presenting work on Autonomous Driving, enabling multi-device understanding in a federated manner. Her research aims to develop a system that operates universally, independent of viewpoint, geographical location, and weather conditions. Giulia’s future research explores open-vocabulary applications, moving beyond predefined class sets to enable more flexible and adaptive AI systems. Orals Posters For today, Monday 3 2 Giulia’s Picks DAILY WACV Monday

3 DAILY WACV Monday Editorial Dear all, I’ll make this very short. It was a pleasure preparing this WACV Daily once again for you! Thank you for reading and thank you to IEEE for trusting me once again. Please keep in touch after WACV! We will publish a BEST OF WACV celebration in Computer Vision News magazine of March 2025. Subscribe for free here! Enjoy the reading and have a fantastic taco Monday in the Grand Canyon state of Arizona ☺ Ralph Anzarouth Editor, Computer Vision News Ralph’s photo above was taken in peaceful, lovely and brave Odessa, Ukraine. WACV Daily Editor: Ralph Anzarouth Publisher & Copyright: Computer Vision News All rights reserved. Unauthorized reproduction is strictly forbidden. Our editorial choices are fully independent from IEEE, WACV and the conference organizers.

Traditional image retrieval systems primarily rely on either image-toimage or text-to-image searches. In contrast, the composed image retrieval task enables users to search an image database using both an image and a text query, significantly improving search precision and usability. Domain conversion, a subcategory of composed image retrieval, allows users to search for images that resemble a given image but exist in a different domain, such as a painting, sketch, or origami representation. 4 DAILY WACV Monday Oral Presentation Nikos Efthymiadis is a PhD candidate in the Visual Recognition Group (VRG) at the Czech Technical University in Prague (CTU). He is the first author of not one but two accepted papers at WACV this year! After a successful poster session yesterday, Nikos is here to tell us more about his work on domain conversion before his two oral presentations later today. Composed Image Retrieval for Training-Free Domain Conversion

“If you have more modalities as a user to search, you can search for more specific things,” Nikos points out. “Your expressive abilities increase so that you can have better searches. In domain conversion, the text query defines the domain. You search with one image and want to find images that look like it but in different domains. This is useful if you would like to create crossdomain datasets in an automated way and you want to search for many, many images.” Domain conversion was previously treated as a class-level task, where searches focus on broad categories. A simple text-to-image search is enough if the class and domain are known. However, this work introduces instance-level domain conversion in one of its four datasets, which is helpful in several ways. First, for a large dataset composed only of photographs, it enables the retrieval of equivalent datasets in other domains, such as sketches or paintings. Second, when the class is unknown, users can search using an image and a specified domain to find relevant results. Finally, instance-level retrieval helps search for specific instances that are difficult to describe with words alone. One of the main challenges Nikos faced in this work was the gap between image and text modalities in the CLIP space. Although CLIP is trained to align images and text, the two modalities remain relatively separate, making it difficult to merge them effectively for composed image retrieval. He devised an innovative 5 DAILY WACV Monday Composed Image Retrieval …

6 DAILY WACV Monday Oral Presentation approach to solve this. “We decided to express the images as a set of words and then combine them in the word modality with textual domains,” he explains. “We found this was an intuitive way of tackling the problem. Specifically, for the instance-level dataset, we had to increase the number of words to describe an instance.” Retrieval is a popular computer vision task, with image-to-image search dating back to the early days of the field. Composed image retrieval represents a newer and rapidly developing area. “Specifically, it’s zero-shot composed image retrieval,” Nikos clarifies. “This is a newer task, but it’s founded on a very traditional computer vision task!” One of the most significant contributions of this research is the creation of a comprehensive testbed for domain conversion. Before this work, there was a single widely used dataset for this task, ImageNet-R. “From ImageNet-R, people were using only the photograph domain as queries, and the rest of the domains as positives and database,” Nikos tells us. “We didn’t think this was enough benchmarking for this task, so we made a testbed of four datasets.”

By establishing this testbed, the research opens new doors for further exploration in domain conversion and composed image retrieval and provides a more robust benchmarking framework for future research. “We didn't think the word domain should be limited to style, so we re-purposed the NICO++ dataset for this task,” Nikos adds. “Here, NICO’s definition of domain is context. Now, we can have photographs that are classes. For example, a fox near water or a fox photographed in dim light. These are actually domains. There are domain shifts in these subsets of the dataset. We also find this very interesting.” To learn more about Nikos’s work, visit Oral Session 8.2: Vision and Language I from 14:00 to 15:00. Remember that Nikos presents another oral a few minutes later at session 8.4. Follow him if you want to find out about both his oral presentations! 7 DAILY WACV Monday Composed Image Retrieval …

8 DAILY WACV Monday Poster Presentation A Video is Worth 10,000 Words: Training and Benchmarking with Diverse Captions for Better Long Video Retrieval Matthew Gwilliam is a fifth year PhD student at the University of Maryland. He is also the first author of a paper that was accepted at WACV as a poster. Next, Matt has already accepted a full-time research science position in the industry.

9 DAILY WACV Monday A Video is Worth 10,000 Words The most interesting idea in this work is using large language models to take videos that have very long text captions and using the large language model to reliably and controllably change that text caption to be different words with the same meaning. It is interesting to take a five sentence very detailed nitty gritty play by play description and condensing that down to a five-word summary. We want to do that so that we can make sure that existing video language models understand short descriptions, just as well as they understand very long ones. Matt did this work at internship with SRI International, and the idea arose organically from conversations between him, his mentor Michael Cogswell, as well as the manager of the team and last author in the paper, Ajay Divakaran. The most challenging thing in doing this was designing and executing the user study itself, where they ended up using 15 people to check parts of the data to make sure that the LLM output meant the same thing as the original text annotations. Setting that up, organizing it, tabulating it in a way that would be convincing to the readers that the data was trustworthy was probably for Matt the hardest part of the execution of the paper. In terms of scientific challenges, understanding why the video language models did struggle with the new synthetic data was a little challenging. There were some cases where the short caption would actually still seem fairly easy for a human to tell which video it matched. But the video language models would make very strange decisions. It was not easy to figure out a way to, first off, identify what the easy ones are compared to the hard ones, to identify when a short caption loses information versus when it doesn't, and to identify when it's lost the identifying information versus when it's still unique. Those are some key challenges, both still on the organizational side, but also the technical side. And some of those are still open questions. The papers has the merit of providing preliminary answers as to why the models might still struggle, but some of that is still open. To solve this, the team first looked at shortening in terms of losing words, but it didn't really matter; they tried to look at when information was lost; they counted unique nouns, or they looked at the overlap in meaning between the vector embeddings of the different captions. “We looked at how,” Matt explains “when we went from long to short, how those embeddings moved with respect to each other, to become closer, which indicated more ambiguity or not. And then we would have humans go in once we flagged some and just spot check and say, you know, to a human

and reading these captions, does it still look unique?” Some of the more interesting vision work in this paper was kind of “vision language”, since they did lots of things on the language side. They could take this synthetic data and then train the video language model itself on the synthetic data to achieve better performance. In the way the model was designed, the existing losses of the video language models could be taken and then they would just sample from the synthetic text data to improve the alignment between the videos and the new text. Another interesting thing is it also improved the retrieval on nonsynthetic original paragraph text. This makes it possible to improve the way 10 DAILY WACV Monday Poster Presentation Figure 1. In real-world text-to-video retrieval, users could use diverse queries. Standard long video datasets use only paragraph-style captions (``Existing'', ``Full paragraph''), which does not allow for training or evaluation on a representative set of long video descriptions. Practical applications also require the ability to handle complex, short, and partial descriptions of a long video. In this work, we introduce an approach to generate, evaluate, and train on such diverse video description data.

that these models can reason about videos and the types of captions that they were already familiar with. With respect to ensembles, you could take a model zero shot and use the different synthetic captions as different attempts to query the same video instead of just having one paragraph caption. You could use the model, generate lots of captions, use all of those as queries and take the median result as the retrieved video. And that was often a more reliable way to find the video in question, as opposed to just relying on the original caption. In Matt’s words, the term for this is query expansion to get better multimodal alignment. Matt is most proud that the team were able to put a work out that hopefully contributes to a paradigm change in how we look at and treat long video; and especially paying attention to unintentional biases that creep in, in the way that current data collection processes treat the relationship between video and text. “One thing that my work points to a little bit,” Matt concludes “is just a level of controllability and reliability that I think is just very important for folks to understand as we use more and more synthetic data in training. I craft my prompts very carefully. I use humans to verify that those prompts are effective in preventing hallucinations. And I think that those are very, very important things for folks to keep in mind as we use more and more synthetic data, both for training and evaluation.” Talk with Matt at Poster Session 4 today (Monday) from 11:15 to 13:00. 11 DAILY WACV Monday Figure 2. We can train with our synthetic data to boost performance. For contrastive finetuning for retrieval with video-caption pairs, we propose mixing our 10k text captions with ground truth captions. We compute standard contrastive loss, but each caption is sampled randomly from the 10k captions for a given video, according to a mixing ratio, $\eta$. This sort of finetuning boosts performance both on 10k text evaluation data, as well as on the original evaluation data. A Video is Worth 10,000 Words

12 DAILY WACV Monday Poster Presentation The team that co-authored this paper all met at WACV 2024. They decided to start a project together after some discussions at the poster session. And then they met, they kept regular meetings and, Emanuela claims, success came from groupwork. Rubric-Constrained Figure Skating Scoring

13 DAILY WACV Monday Rubric-Constrained Figure Skating Scoring Arushi noted that many methods for scoring figure skating performances focus on primarily the difficulty of the performance rather than the actual quality, the technical quality of the performance. Though this is a very small detail that was talked about in Arushi’s paper, nonetheless it is a very big discovery because it shows that previous methods focused on a primarily recognition task rather than the quality assessment task, which is a harder task than recognition. Figure skating score sheet breakdown includes one base value score for each element that is performed in the routine; and a grade of execution score for each performed element The grade of execution score ranges from negative three to three for performances before 2018. And it ranges from negative five to five after 2018. However, the base value Arushi Rai is a PhD student at the University of Pittsburgh, under the supervision of Adriana Kovashka. She is also the first author of a paper that was accepted as a poster at WACV 2025: Rubric-Constrained Figure Skating Scoring.

is like 11, 12, which is a much larger value. When you have a regression task, the goal of the model is to capture this variance and, in this task, most of the variance comes from the difficulty, which is why models end up focusing on the difficulty. Thus, the first task, a minor thing in the paper, but quite big thing outside of the paper, is to focus on actually scoring quality by disentangling the difficulty of the performance with the quality of the performance. And this is actually not straightforward to do, because if you scrape the base values or the difficulty from the score sheets, it actually reveals information that would only be available after the performance has been performed. It's leaking quality information, sometimes. What should be done instead? “Instead, I went back to the base value that have been established by the ISU, the International Skating Union,” Arushi explains. “And then I can start focusing on the problem that I actually wanted to focus on: increasing the interpretability of these models for the end user and athlete.” 14 DAILY WACV Monday Poster Presentation

What is giving a high score in the performance? What's giving a low score? That became Arushi’s next goal. And she started thinking about how even though like compared to sports like basketball, where success and failure of an action are pretty straightforward. But in figure skating, for the lay person it's very difficult: it seems like a very subjective scoring process. However, one of the original papers that had established this task actually said that judges have 96% agreement on these scores, despite it being subjective. How is that possible? “That comes from the use of rubrics,” Arushi reveals. “I was thinking, is there a way for me to utilize the rubrics that judges use to score these performances? Can this actually help the model? Can this help with interpretability? It's very minute things that lead to the quality score!” What Arushi did was to use the rubric information from the International Skating Union via what she calls the ‘Rubric Constrained Scoring Head’. Then, the next problem ends up being that the rubric is applied for technical element scoring, which is one big chunk of the figure skating score. It is applied for each element that is performed in the performance (like a jump, spin, step sequence, or a choreographic sequence), rather than transitions that are in-between elements. Because in between these elements, there are transitions where the skater is skating across the rink and dancing to the music. That's what makes it a whole performance. And that holistic score is another aspect of the scoring. “But that's not what I wanted to focus on,” Arushi objects. “I wanted to focus on something that was less subjective, like the technical score. We don't have these element segmentations. How can we use the rubric information? I proposed using 15 DAILY WACV Monday Rubric-Constrained Figure Skating Scoring

a module that I call an element transformer, but it's an encoderdecoder transformer. And the key feature of the decoder in this element transformer is that there are these learnable queries, these learnable element queries. And these are the same length as the number of elements that there are in a short performance, so seven, seven element queries.” This allowed Arushi to take in a variable length performance, which might be a whole bunch of clips, like around 126 clips in the video. The output of this encoder-transformerencoder-decoder would be a fixed set of element embeddings. In that way we can get these element-wise embeddings, which can then later be used with the rubric scoring. The next thing would be the regularization that's applied to basically make sure that the element queries in the transformer decoder focus on the elements in 16 DAILY WACV Monday Poster Presentation

the video. “There are two observations that are used here,” Arushi adds. “One is that element segments do not overlap with other segments in the video, in the same video, and that each element segment would be a continuous sequence of clips. So that means that the queries should focus on different parts of the video from one another for the same video, and the query should also focus on a continuous sequence of clips when cross attention is performed in the decoder, the transformer decoder. This is the implicit segmentation regularization!” Does this research open new directions, or may it be made more powerful by further studies? Arushi thinks so: this work has been limited by the lack of annotations. So probably the rubric adherence ability could be improved by using annotations regarding like the rubric breakdown for each performance, for each element. Arushi will tell you more about this computer vision work when you visit her poster, today (Monday) at Poster Session 5, between 16:15 and 18:00. DAILY WACV Monday Rubric-Constrained Figure Skating Scoring 17

18 DAILY WACV Monday Yesterday’s Keynote

19 DAILY WACV Monday Phillip Isola Lillie Elliot Photography

Rita Pucci is a postdoc at Naturalis Biodiversity Center, that is a natural history museum in Leiden in the Netherlands. She is also the first author of a paper that was accepted as a poster at WACV 2025 - CE-VAE: Capsule Enhanced Variational AutoEncoder for Underwater Image Enhancement. Rita, in a few weeks you will be an assistant professor. Congratulations in advance! Thank you! I hope they don't change their mind until then. I hope not. I already signed the contract. Okay, so there is a contract. What is your work about? My work is mainly focused on computer vision. I always want to use computer vision for animals, for biodiversity, for understanding if I can help, together with computer vision biologists, to better understand nature - because I love nature very much. That was my compromise to work in biology without being a biologist. Read 160 FASCINATING interviews with Women in Computer Vision Women in Computer Vision 20 DAILY WACV Monday

What do you do concretely for this direction? In this moment I work for species recognition in images. I collaborate with Naturalis Biodiversity Center and they have this app for mobile phones which you can use to take pictures. That is the first step: as a reward to the users, they get to know the name of the species. So we work on models in computer vision that can be good with common species and with rare species as well, to give the user the information about the species. For us it is important to get to know all the images from all around Europe. When giving you automatically the name of the species, the system can see the distribution and we can understand if some species are declining, or some species are moving. Also if we can relate with climate change. So you are not going to change the reality - you're just taking a photograph of what it is? My first rule is “don't touch anymore!” So how does it help? For now, we are using only images to help biologists understand - first of all - the distribution of species: it is really important to understand what is alive now. That means that in 10 years, if we do the same study for distribution, we can understand if something disappeared meanwhile. Then it could be too late! You know, extinction is also a normal process in nature. What is important is to understand if it's related to us, if it's our fault or if it's natural. We can also track the movement of species thanks to these systems, because when you do it with humans it takes ages. We have more than one million of images to analyze, to understand the species and having a machine learning model that can do that for us is saving a lot of time. Tell us about the computer vision in your work. In this work I'm using a different 21 DAILY WACV Monday Rita Pucci

architecture, the state of the art. In particular I'm really interested in hybrid architecture. Architectures that are based both on convolutional neural network but also transformers and to understand in particular if you can have a better identification of rare species when you use this type of architecture. I'm also collaborating with different universities to see if there are emerging technologies that can be applied. In particular, now we are trying to understand if the space of definition of the model is playing a role. If you use Euclidean space or if you use hyperbolic space or other spaces, is it really helping or not. I don't have the answer yet, but we are working for that. What is the most challenging part in your work that gives you the most headache to solve? Since it's highly interdisciplinary, I think that the most difficult part is to understand what is important and what is not important, because I'm a computer scientist, without any degree in biology; and so my point of view is always on the architecture, on the maths, on the computer vision side. I want to see what can I do more? How can I improve? How can I change and what is possible? But I never thought about if it's important or not, if it's useful. And on the other side I have to talk with biologists. They don't care about my architecture. They care about the result, because they want to have it to answer questions and I have to understand which type of answer, which type of result do they need and that is really, really difficult every time. Let’s rebound to your paper which unfortunately you're not presenting in person. What is this paper about and what we would have heard from you if you would have come to present? My paper is presenting a new architecture in computer vision that is using capsule layers to identify particular details in the images. That is the main idea: to understand if capsule layers can play a role in getting Women in Computer Vision 22 DAILY WACV Monday

getting details in images. In particular, we are applying this type of model for enhancement, to get a better quality of images. Since I'm really into nature, I wanted to understand if it were possible to use with underwater images. Underwater images are really fantastic but usually they are really noisy. The colors are not there. Everything is blue, green. It depends on the depth and that is really terrible when you are doing a mission to understand what is underwater, which type of species, what is in front of you. The method consisted in taking images from the subsurface, extract the features, extract the main components of the images, and try to understand the role of these features with capsule layers , in order to know if the features together belong to an object or they are playing a role in the scene. And then reconstruct the images just from the feature: reconstructing the images is learning to take into consideration the features that are used to reconstruct the information of the object and the colors removing the noise. The idea is to have the first part, the extraction, and the second part, the reconstruction, completely isolated. They collaborate, but the extraction doesn't need the second part. That means that it can be also used as a compressor of images, so embedded in the robot underwater, and then when the campaign is finished, we can collect just the features of the image and reconstruct all the images. I know scientists whot have been working many years on underwater images: Derya Akkaynak in Eilat and Tali Treibitz in Haifa. Yeah, I have seen lots of talks from them. Yeah, yeah. They are working more on the physics, also. That is behind the fact that the color is changing and they use a lot of the location because if you have all this information about the location, the depth, the moment in the day, you can really use physics to remove water and you have beautiful color. On the other hand, when you don't have all this information but you have just the image, you need just to guess what is inside. So we are training the model to, let's say, guess - but we hope that it's learning how to construct. 23 DAILY WACV Monday Rita Pucci

Let’s talk about your future job. You are going to have the responsibility to build the people who will make science progress. How do you approach your upcoming tasks? I'm studying a lot. I'm studying the subject that I have to bring because it's a new subject for me. I'm doing computer vision now but in the future position I will do multimodal representations, to model entire ecosystems in nature. At the same time, I'm starting to learn how to teach to people. I'm reading books about teaching, how to deal with people and also I'm starting with some experience. I asked my current supervisor to have the supervision of students to understand how to interact with students. It can’t go wrong. You are prepared! I don't know. We'll see. We'll see. What our readers do not know yet is that like me, you are Italian. Actually, we could make this interview only moving our hands and we would understand each other perfectly. Do you see yourself going back to Italy to work sometime? Oh, that's a difficult question. I know. That's why I ask. I can see myself collaborating with a university in Italy. I'm still collaborating with Udine University, my previous university. I don't know if in future I would like to go back in Italy. But for sure, I like to collaborate with Italian universities. You’re from Tuscany, from Livorno. It's a beautiful place and we recommend people to visit. Yes, come to Livorno. Your final message. Oh, well, my message is to don't be scared about AI. That can be used also in a really good way, like for biodiversity, for helping and understanding this world, not only in destroying that. Rita, if you are the voice of AI today, we are not scared at all! To find out about Rita’s WACV 2025 contribution, visit Oral Session 9.4: Visual Recognition IV today (Monday) from 15:15 to 16:15 and Poster Session 5 today from 16:15 to 18:00. Women in Computer Vision 24 DAILY WACV Monday

25 DAILY WACV Monday UKRAINE CORNER Russian Invasion of Ukraine Our sister conference CVPR condemns in the strongest possible terms the actions of the Russian Federation government in invading the sovereign state of Ukraine and engaging in war against the Ukrainian people. We express our solidarity and support for the people of Ukraine and for all those who have been adversely affected by this war.

Double-DIP Don’t miss the BEST OF WACV 2025 iSCnul i Ccbkos cmhreipbrue t feor rVf ri sei eo na nNde wg es toi tf iMn ya or cuhr. m a i l b o x ! Don’t miss the BEST OF WACV 2025 in Computer Vision News of March. Subscribe for free and get it in your mailbox! Click here Target with solid fill

27 DAILY WACV Monday Posters Top: Yu-Yun Tseng (Everley Tseng) is advised by Danna Gurari at the University of Colorado Boulder. Her oral paper is about a private content localization benchmark. Call her Everley! Bottom: Claudio Rosito Jung is a professor at the Federal University of Rio Grande do Sul in Brazil. Here he is presenting his work that introduces a dataset with oriented cell annotations, a benchmark of detectors and biological applications.

Lovely WACV photos 28 DAILY WACV Monday Lillie Elliot Photographyat www.LillieElliot.com Lillie Elliot Photographyat www.LillieElliot.com

29 DAILY WACV Monday by Lillie Elliot Photography Lillie Elliot Photographyat www.LillieElliot.com Lillie Elliot Photographyat www.LillieElliot.com

Lovely WACV photos 30 DAILY WACV Monday Lillie Elliot Photographyat www.LillieElliot.com Lillie Elliot Photographyat www.LillieElliot.com

31 DAILY WACV Monday by Lillie Elliot Photography Lillie Elliot Photographyat www.LillieElliot.com Lillie Elliot Photographyat www.LillieElliot.com

RkJQdWJsaXNoZXIy NTc3NzU=