CVPR Daily - Thursday

A publication by Computer Vision and Pattern Recognition Thursday Seattle 2024 CVPR Awards, Highlights, Challenges, Workshops, Previews, Women in Computer Vision, expressly reviewed for CVPR 2024!

2 DAILY CVPR Thursday [4-400] Temporally Consistent Unbalanced Optimal Transport for Unsupervised Action Segmentation [4-406] Detours for Navigating Instructional Videos [4-210] Hyperbolic Learning with Synthetic Captions for Open-World Detection [4-268] Improving Visual Recognition with Hyperbolical Visual Hierarchy Mapping [4-396] Test-Time Zero-Shot Temporal Action Localization Alessandro’s picks of the day: I am Alessandro Flaborea. I recently completed my PhD, successfully defending my thesis titled "Anomaly Detection Across Multiple Domains“. It represents years of hard work on Video Anomaly Detection, Procedural Learning, and Hyperbolic Neural Networks, as a member of the Perception and Intelligence Lab (PINlab) under the supervision of Fabio Galasso. Currently, I am collaborating with ItalAI on the technological transfer of my research into a product while continuing to work with scientists at PINlab to advance the state-of-the-art. For today, Thursday 20 Alessandro’s Picks Highlight Alessandro forgot to tell you that he's also presenting his poster 4-374 today, in the afternoon session: PREGO: Online Mistake Detection in PRocedural EGOcentric Videos. He’s also ready to take the next steps in his career. He’s a catch! Grab him before it’s too late! Posters Oral – award candidate

3 DAILY CVPR Thursday CVPR Daily Publisher: RSIP Vision Copyright: RSIP Vision Editor: Ralph Anzarouth All rights reserved Unauthorized reproduction is strictly forbidden. Our editorial choices are fully independent from IEEE, CVPR and the conference organizers. Paul Roetzer (left) is a PhD student under the supervision of Florian Bernard (right), an Associate Professor at the University of Bonn and the Head of the Learning and Optimisation for Visual Computing Group. Before their oral presentation this afternoon, they speak to us about their highlight paper on 3D shape matching, which has also been chosen as a Best Paper Award candidate. SpiderMatch: 3D Shape Matching with Global Optimality and Geometric Consistency Award Candidate The problem of 3D shape matching involves identifying correspondences between surfaces of 3D objects, a task with applications in medical imaging, graphics, and computer vision. This work’s main novelty is that it accounts for geometric consistency, a property often neglected in previous 3D shape matching methods due to its complexity. Geometric consistency ensures that when matching the surface of one shape to another, the neighboring elements are matched consistently,

4 DAILY CVPR Thursday Award Candidate preserving neighborhood relations. “Imagine two organs, like the liver, heart, or lungs, and you match them from different people,” Florian explains. “You take the shapes and want to train a statistical shape model. If you didn’t have this geometric consistency, the deformation from one to the other would lead to self-intersecting sections which aren’t anatomically plausible.” Many existing approaches to 3D shape matching do not enforce geometric consistency as a hard constraint but optimize it as a soft objective, often framed as graph matching or quadratic assignment problems. “This is a problem class well known to be NP-hard, making it extremely challenging to solve for large instances in practice,” Florian tells us. “We find a different representation that makes the problem easier to solve.” Florian and Paul propose a novel path-based formalism, representing one of the 3D shapes (the source shape) as a long, self-intersecting curve (‘SpiderCurve’) that traces the 3D shape surface. This alternative discretization simplifies the 3D shape matching problem to find the shortest path in the product graph of the SpiderCurve and the target 3D shape. “This switch of the discretization is what makes our paper novel,” Paul points out. “We think differently about a problem, making a very complicated task to solve a simpler one.” This formalism leads to an integer linear programming problem, which the team demonstrates can be efficiently solved to global optimality. The result is competitive with recent state-of-the-art shape matching methods and guarantees geometric consistency. “For the first time, we can find geometrically consistent shape matchings while also finding global optima in practice,” Florian reveals. “Within the framework of our optimization formulation, in all the instances that we’ve evaluated, we know that we have the best possible solution among all potential solutions!”

3D shape matching is just one of a class of matching problems that are fundamental to computer vision. Could devising an innovative new approach to solving such a fundamental problem be part of the reason the paper has been chosen as a candidate for a Best Paper Award? “We have conceptually a pretty simple idea,” Florian responds. “Instead of representing a 3D shape using triangles as discretization, we simply discretize the 3D shape using a onedimensional curve that traces the surface while visiting all the vertices. 5 DAILY CVPR Thursday SpiderMatch

6 DAILY CVPR Thursday By looking at a different representation of the 3D surface, we can build on well-established frameworks for global optimal matching problems that lead to geometric consistency. I think the secret is the simplicity and the fact that it’s very fast in practice.” Away from writing top-rated papers, Florian works at the intersection of machine learning and mathematical optimization in visual computing. Meanwhile, Paul explores solutions to 3D shape matching problems with optimization methods. Looking ahead, Florian acknowledges an unresolved challenge: “The most critical open problem is whether an algorithm exists to solve this problem in polynomial time,” he ponders. “What we have is fast in practice, but the worst-case time is still exponential. The next step would be to investigate if it’s possible to come up with a similar formalism that could lead to a polynomial time algorithm that is provably fast.” To learn more about Paul and Florian’s work, visit Orals 4B: 3D Vision (Summit Flex Hall AB) from 13:00 to 14:30 [Oral 2] and Poster Session 4 & Exhibit Hall (Arch 4A-E) from 17:15 to 18:45 [Poster 185]. Award Candidate

Double-DIP Don’t miss the BEST OF CVPR 2024 iSCnul i Ccbkos cmhreipbruet feorrVf ri sei eo na nNde wg es toi tf iJnu ly o. u r m a i l b o x ! Don’t miss the BEST OF CVPR 2024 in Computer Vision News of July. Subscribe for free and get it in your mailbox! Click here Target with solid fill

Adam Tonderski (left), Carl Lindström (center) and Georg Hess (right) are industrial PhD students at Zenseact and Lund University (Adam) and Zenseact and Chalmers University of Technology (Georg and Carl). Their work, recognized by organizers as a highlight paper, proposes a neural rendering method tailored to dynamic automotive scenes. They speak to us ahead of their poster session this afternoon. NeuRAD: Neural Rendering for Autonomous Driving 8 DAILY CVPR Thursday Highlight Presentation In this paper, the team proposes NeuRAD, or Neural Rendering for Autonomous Driving, which aims to reconstruct scenes with NeRF methods, rendering new views and changing where the actors are in the

9 DAILY CVPR Thursday scene. Unlike traditional NeRF methods, which often require long training times, dense semantic supervision, and lack generalizability, NeuRAD is designed to be robust and efficient for dynamic autonomous driving (AD) data. “The special part is that we emphasize the issues we have in our AD data,” Georg tells us. “Compared to your normal NeRF, our scenes are huge. We drive for hundreds of meters, so we have to handle these multiple scales. And we drive quite fast.” The team builds on the latest advancements in neural rendering technology. It adds some special tricks for AD, including very accurate sensor modeling and clever speedup techniques to make it easy to accelerate on the GPU. NEURAD

10 DAILY CVPR Thursday NeuRAD integrates a variety of stateof-the-art methods across different NeRF techniques. “Previously, people have been focused very much on the camera-only setting and not that much on automotive data,” Carl points out. “The techniques focused on automotive data have not included a full sensor setup with 360-degree cameras and lidar. We focus on trying to capture the full sensor setup commonly used in AD.” However, the road to developing NeuRAD has been challenging, involving navigating a wide range of existing ideas and methods. The team wanted to bring out the best of all the different methods without creating a convoluted system. “We had to cut a clear, narrow path that makes a clean method while including some of the most clever advances out there and adding our own flavor on top,” Adam tells us. “Navigating that jungle of ideas was difficult.” Computer vision is at the heart of this method, particularly in its advanced scene representation. The team aims to render vision data, including images and lidar points clouds, by learning a 3D representation of the actual world from collected data. Constructing that 3D representation is a big challenge but critical, as it allows for efficient rendering of the visual information necessary for AD systems. NeuRAD’s practical applications are vast. “One that’s very useful for us here at Zenseact is the ability to collect data from driving in a normal Highlight Presentation “Navigating that jungle of ideas was difficult!”

11 DAILY CVPR Thursday traffic scene and modify that data into safety-critical scenarios or something we’re more interested in,” Carl reveals. Additionally, Adam points out the method’s usefulness in simulating different sensor setups: “Our company is an AD company, so we care about this real-world application. Maybe we want to try different sensors or lidars. Our method can simulate how that lidar would look on our old collected data. We can virtually try out different sensor configurations and see what works best for us.” NeuRAD can be applied to multiple datasets out of the box and has demonstrated state-of-the-art performance on five popular AD datasets. The team has made it open source and released it on GitHub. They would welcome people to contribute, expand the work with more features, and improve the NeRF renderings of automotive data in general. To learn more about the team’s work, visit Poster Session 4 & Exhibit Hall (Arch 4A-E) from 17:15 to 18:45 [Poster 28]. NEURAD “We focus on trying to capture the full sensor setup commonly used in AD!”

12 DAILY CVPR Thursday Poster Presentation Learned Trajectory Embedding for Subspace Clustering This work examines the scenario when multiple independent motions are present in a scene. Yaroslava proposes a method for simultaneously grouping the trajectories based on these motions and estimating the corresponding motion models for each group. This approach is particularly important for dynamic scene understanding, with applications ranging from autonomous driving to various other scenarios where distinguishing between multiple moving objects is essential. The motivation behind this work evolved naturally from Yaroslava’s PhD studies, which delved into Yaroslava Lochman is a PhD student at Chalmers University of Technology in Gothenburg, Sweden. Before her poster session this afternoon, she speaks to us about her paper, which explores the problem of motion segmentation. UKRAINE CORNER

13 DAILY CVPR Thursday Learned Trajectory …. dynamic scene understanding and structure from motion for deformable or non-rigid objects. “This is just one instance of this huge field that is not yet solved 100%,” she points out. “With the tools available and the ideas in mind, it just seemed like a good fit at that moment!” The proposed method uses a neural network consisting of a feature extractor (encoder) and a subspace estimator (decoder). The feature extractor resembles PointNet, processing point trajectories independently and using 1D convolutional layers due to the temporal nature of the data. The subspace estimator includes regular MLPs that transform the features and incorporate a parametric model for subspace basis functions. “We combine this together to output the subspace models that correspond to the point trajectories,” Yaroslava explains. “We train the whole network as a network that tries to reconstruct trajectories as close to the original trajectories as possible, but also, we want it to be very good at clustering, so we incorporate the corresponding clustering loss, which is an InfoNCE loss.” One of the main difficulties she faced was the limited availability of data. Training a network that generalizes well to diverse scenarios is a challenge. To overcome this, she employed augmentations and parametric models to inject domain We observed that a sufficiently long trajectory can uniquely identify its corresponding motion model, which motivated our choice of the feature extractor construction. In theory, it is connected to the fact that motion models are low-dimensional subspaces in a high dimensional trajectory space, therefore different models intersect at zero only. UKRAINE CORNER

14 DAILY CVPR Thursday knowledge and ensure exposure to diverse data. The availability of unlabeled data on a much larger scale than labeled data presents an opportunity to adopt a weakly or semi-supervised framework in future iterations of this work. A notable aspect of this research is its exclusive focus on point trajectories without relying on visual data. This focus aims to push the limits of what can be achieved with geometric data alone. The approach’s effectiveness was validated through extensive experiments, which showed stateof-the-art performance for trajectory-based motion segmentation on full sequences and competitive results on occluded sequences. Consequently, a challenge this work has not fully addressed is the high rate of occlusions common in realworld scenarios. “We have an algorithm for data completion, and for major corruptions, it’s converging, but it’s not working as fast and as nice,” Yaroslava reveals. “There are different ways to address that, and we are looking into using global context information.” Away from this paper, Yaroslava’s regular work covers similar ground but on a broader scale. “I’m looking into various techniques in geometry optimization and machine learning to solve 3D vision problems,” she tells us. “I’m looking into different ways to combine them meaningfully to take the best of all worlds!” To learn more about Yaroslava’s work, visit Poster Session 4 & Exhibit Hall (Arch 4A-E) from 17:15 to 18:45 [Poster 433]. “I’m looking into different ways to combine them meaningfully to take the best of all worlds!” UKRAINE CORNER Poster Presentation

15 DAILY CVPR Thursday Learned Trajectory …. UKRAINE CORNER Russian Invasion of Ukraine CVPR condemns in the strongest possible terms the actions of the Russian Federation government in invading the sovereign state of Ukraine and engaging in war against the Ukrainian people. We express our solidarity and support for the people of Ukraine and for all those who have been adversely affected by this war.

16 DAILY CVPR Women in Computer Vision Thursday Katerina with graduated student Adam Haley Katerina Fragkiadaki is an associate professor in the Machine Learning Department at Carnegie Mellon University and one of the Program Chairs at ICLR 2024 in Vienna, where this interview was taken. Read 160 FASCINATING interviews with Women in Computer Vision What can you tell us about your work, Katerina? I work in computer vision, robotics, and machine learning. Mostly on problems of how we relate vision to language and action for decision making and understanding images and videos. Did you choose that field, or did the field choose you? When I first started working in this field, I worked on image understanding. It was pure computer vision, but as the field is progressing, other aspects are also converging. Vision is converging with language and action. We treat the AI problem as general embodied intelligence as opposed to just understanding images, trying to understand language, or predicting actions. Seeing all three together makes more sense. “I think the community is very lucky that it’s in the right field at the time it’s exploding. It’s extremely exciting!”

17 DAILY CVPR Thursday Katerina Fragkiadaki Am I right if I say that the more problems you solve in computer vision, the more it opens new problems to solve? Well, right now, we’re in a very interesting stage where a lot of problems that can be supervised from images and text are making tremendous progress. For example, object detection, object labeling, and basic referential expression understanding. This is improving. Now, final perception that you need for robot action and decision making – yes, this requires more work. Is there anything you did not expect when you chose this field but you discovered on the way? Yeah, there are tons of things I haven’t expected. I didn’t expect neural networks would take over. I wouldn’t expect we’d be working on GPUs. After that, I definitely didn’t expect the tremendous language models generating such naturallooking and feeling language. Definitely, I didn’t expect the generative image models that generate these beautiful images. Are they four good things for you or bad surprises? This is the tremendous progress that the field makes that we can’t anticipate. I think this is good. It’s good! Is this a good moment to work in AI? Did you fall into the right generation? I think I fell into the right generation in the sense that I was in the field when the field was really not done, nothing was working, so then you are able to experience the whole revolution. If I was starting now, I think I would be more reluctant to join now. What would you do instead? I would go to open problems like green energy, renewable energy, chemistry, and maybe biology, and I’m sure machine learning has a lot of things to give to those fields. I think I would do that right now. It is not by chance that one of the outstanding papers here at ICLR is a biomedical paper about protein discovery. Yes, I think the machine learning applications in the sciences are very exciting. Might that be a direction that you will invest in the future? Yeah, it’s absolutely true that some models absolutely generalize. Generative models have made a tremendous impact not only for generating images, videos, or text but also molecules, and they facilitate search in the molecule space and the reward functions that

The first time I spoke in my magazine about your work was in 2016 at CVPR in Vegas. How has the field and how has Katerina changed in eight years? The field changed, and we need to keep up and keep track of what’s going on. From my scientific understanding, absolutely, a lot of things changed not only within eight years but from semester to semester. I was a postdoc then, and now I’m a professor. I have my group. I care about my students. I code much less than when I was a postdoc and worry much more about keeping track of where multiple fields are. Back then, I was on my own, so I was working on my own little topic. Now, I need to keep track of all the topics of all my students. This is very different. 18 DAILY CVPR Women in Computer Vision Thursday Katerina with colleague Igor Gilitschenski

Is this what you wanted? Yes. I am sure your students have learned a lot from you; is there anything that you have learned from them? Yes, I’ve learned tremendous things from the students. Two things that are very important that they have taught me are, one, for me not to have a mode collapse on my ideas. Not to think that one thing is great and nothing else is worth it. To be way more open and multimodal and give a shot to multiple directions at the same time. The second thing that they taught me is that it’s not the research success that matters; how they feel, and their emotions also matter. It’s not research success at any cost. The emotional journey and being happy every day actually matter. Are they different from how you were when you were a PhD student? Every student is different. They have diverse personalities that can be very different from mine. I think this is great because we teach each other different things. If someone is doing their PhD now and would like to have a career like yours, what advice would you give them? Very good question. I think number one is to find something that you like. This can be very difficult. I remember it was very agonizing when I was trying to pick my advisor because I had not selected the topic of computer vision. This is a big deal. Once you select, hard work is very important. Good collaborators are very important. Taking feedback well is also important. Just to really seek feedback that is truthful, as opposed to feedback that will make you feel good. Try to anticipate changes in the future despite that this is very difficult. Overall, research is challenging because it requires complete devotion, I believe. There are many extremely smart people, extremely trained people, working many hours on things they’re very passionate about, and the truth is that research is a competitive field. We often say it’s collaborative, but that’s not really true. If I submit a paper, you can’t submit the same paper, so it’s extremely competitive. I think trying to keep your emotional stability and your integrity in this highly competitive field is extremely important. That’s why you need to find your niche, or you need to really push. Your work ethic really matters. You suggest being devoted. Are there ups and downs where a researcher would say, ‘Today, I don’t feel very devoted’? 19 DAILY CVPR Thursday Katerina Fragkiadaki

Well, the truth is that very successful people in the field are truly obsessed with what they’re doing. I think this is true. I don’t think they say, ‘Today, I don’t care.’ They always have it in the back of their minds. Breaks are very important. I have talked to people, and there are multiple modes of work. There are some people who are really devoted and like machines every day, and if you give them more time, they will be even more productive. There are other people that if you give them more time, they wouldn’t be more productive because their creativity is not very structured. Will both succeed in the same way? Yes, they can both generate very high-quality research ideas. Only a few minutes ago, I saw Yann LeCun go to the Meta booth. I feel that he is one of those people who is always inspired and motivated to do more research to succeed. Is he a good example? Yes, I think he’s a fantastic example of a devoted person. Now, just to be clear, being devoted is good for your research success; it doesn’t really mean it’s good for your family success as well because it’s a finite time you have per day, 24 hours, and you need to have a family/work balance if your family needs to be happy. Some people do not need that. Some people don’t, which is good for their research success. One thing that I find inspiring, for example, is I 20 DAILY CVPR Women in Computer Vision Thursday Katerina with me, when we took the interview at ICLR 2024

heard that Geoff Hinton would come in the morning and talk to his daughter and say, “I think I finally know how the brain works.” I think this can be something very bonding, and I also have thought about really sharing your passion with your kids, if you have kids, as opposed to trying to cut them out of this because this is your passion. It’s the same with your friends. You want to share your passion. It’s good to share it with your kids. Do you bring any of your Greek legacy to your career? I think one of the Greek legacies is that even in very stressful times, we crack many jokes. Would you tell me one of these jokes now? Oh, these jokes are just workrelated. They’re very contextrelevant to what we’re discussing on the Slack channel with the students. [she laughs] Okay, so there is no one joke that relates to what you and I are doing now? Oh, I see. Well, I can’t come up with a good one now! [we both laugh] And we only want good ones! As Takeo Kanade says, a good joke is like good research. It also releases stress if you can joke around. We are at ICLR, but I might publish this interview at CVPR. What brought you to be a Program Chair at ICLR, and what can vision people take away from here? ICLR is my favorite conference because it’s exactly on the topic of deep learning for language, vision, and action. It has all these topics together, so it is very diverse. In that respect, maybe CVPR is more specialized. What can CVPR take from ICLR? For sure, something that it took already is the OpenReview interface, which I think is absolutely incredible, and we should definitely keep it. I hope other vision conferences also use it. Maybe the rebuttal discussion period. I don’t remember if we had this at CVPR. I’m not sure if there was a back and forth between reviewers and authors. The topics are different between ICLR and CVPR, but the way they’re run is quite similar because they use OpenReview. I think this is important. You are involved in many other things. I know that you have been involved recently in diversity and inclusion at CVPR, for instance. What are the things that you will 21 DAILY CVPR Thursday “Research is challenging because it requires complete devotion, I believe!” Katerina Fragkiadaki

be passionate about over the next few years? Right now, I’m passionate about robotics and showing that vision can be useful for action and embodied intelligence, as opposed to just passive image understanding. I think this is very interesting. Another thing that I find interesting is interfaces where the user walks around in their home, and there is an assistant that doesn’t need to be embodied, that doesn’t have a body, but it can see what the user sees, hear what it hears, know about its routine, and the artificial assistant is trying to provide help, reminding the user what to do, answering questions, and so on. Would it not be embarrassing to have a machine knowing everything about your house? No. Specifically, I would love to have a machine to tell me where my phone is because I constantly forget it. You could put a chip on your phone, and it will tell you. True, but then I need to put a chip on anything that I’m searching for. Not only this but what’s in the fridge? Where is your kid? A general assistant would be nice. Another thing I find interesting is assisting people with dementia. They ask tons of questions per minute. Human assistants can get tired. We can’t replace human assistants right now, but at least they can have an artificial assistant that doesn’t get tired and constantly replies to them and makes them feel at ease. If I tell you now that in your very last years, you will befriend a machine to answer your questions, I’m not sure that you would want that. Yes, but the question is, what’s the alternative? The alternative is you being alone without anybody replying to you. Beyond total loneliness, it’s better with a machine. Do you have a last message for the community? I think the community is very lucky that it’s in the right field at the time it’s exploding. It’s extremely exciting. Weshouldn’t be egocentric about it; we should just give to other people, as opposed to keeping it for ourselves. 22 DAILY CVPR Women in Computer Vision Thursday “We treat the AI problem as general embodied intelligence as opposed to just understanding images, trying to understand language, or predicting actions. Seeing all three together makes more sense!”

ion News of June? Did you read Computer Vision News of June? Read it here Target with solid fill

by Vanessa Staderini The workshop included keynote talks, a lively panel discussion, and a valuable mentoring dinner. We had great speakers with different backgrounds and experiences: Cornelia Fermüller and Shekoofeh Azizi talked about how AI and Computer Vision can make music education and medical assistance more accessible. Elisa Ricci shared cool tips on how to use language to understand videos without training. Boyi Li introduced her work on task planning and replanning with Language Models. As every year, the Women in Computer Vision workshop took place at CVPR24 and was an amazing chance to connect people and get inspired by fellow female scientists. 24 DAILY CVPR Thursday Workshop

Guoying Zhao explored the complex world of emotion recognition and how cultural differences play a role. Lastly, Kate Saenko shared her inspiring journey as a woman in science. This is the strength of the Women in Computer Workshop: you can learn about different areas of Computer Vision and broaden your interests! We wrapped up the day with a great reminder: find your passion, be brave, speak up, and support each other. Work hard but also enjoy the ride! 25 DAILY CVPR Thursday Women in Computer Vision From left to right: Cornelia Fermüller, Boyi Li, Shek Azizi, Elisa Ricci, Guoying Zhao and Kate Saenko A big shoutout to brilliant organizers: Estefania Talavera, Vanessa Staderini, Deblina Bhattacharjee, Mengwei Ren, Asra Aslam, Himangi Mittal, Sachini Herath, Ziqi Huang and Azade Farshad.

Made with FlippingBook

RkJQdWJsaXNoZXIy NTc3NzU=