Computer Vision News - July‏ 2024

A publication by A publication by JULY 2023 July 2024 Generative Image Dynamics Full review of Best Paper Award and many more - Read about brilliant work!

Computer Vision News Computer Vision News 2 CVPR Best Paper Award Winner Generative Image Dynamics Imagine looking at a picture of a beautiful rose and visualizing how it sways in the wind or responds to your touch. This innovative work aims to do just that by automatically animating single images without user annotations. It proposes to solve the problem by modeling what it calls image-space motion priors to generate a video in a highly efficient and consistent manner. “By using these representations, we’re able to simulate the dynamics of the underlying thing, like flowers, trees, clothing, or candles moving in the wind,” Zhengqi tells us. “Then, we can do real-time interactive simulation. You can use your mouse to drag the flower, and it will respond automatically based on the physics of our world.” The applications of this technology are already promising. Currently, it can model small motions, similar to a technique called cinemagraph, where the background is typically static, but the object is moving. A potential application for this would Zhengqi Li is a research scientist at Google DeepMind, working on computer vision, computer graphics, and AI. His paper on Generative Image Dynamics has not only been selected as a highlight paper this year but is also in the running for a best paper award. He is here to tell us more about it before his oral and poster presentations. NOTE: this article was written before the announcement of the award winners. Which explains why it keeps mentioning a candidate and not a winning paper. Once again, we placed our bets on the right horse! Congratulations to Zhengqi and team for the brilliant win! And to the other winning paper too!

3 Computer Vision News Computer Vision News Generative Image Dynamics be dynamic backgrounds for virtual meetings, providing a more engaging and visually appealing alternative to static or blurred backgrounds but without excessive motion that could be distracting. “Moving to model larger motion, like human motion or cats and dogs running away, is an interesting future research direction,” Zhengqi points out. “We’re working on that to see if we can use a better and more flexible motion representation to model those generic motions to get better video generation or simulation results.” Most current and prior mainstream approaches in video modeling involve using a deep neural network or diffusion model to directly predict large volumes of pixels representing video frames, which is computationally intensive and expensive. In contrast, this work predicts underlying motion, which lies on a lower-dimensional manifold, and uses a small number of bases to represent a very long motion trajectory. “You can use a very small number of coefficients to represent very long videos,” Zhengqi explains. “This allows us to use this

Computer Vision News Computer Vision News 4 representation to produce a more consistent result more efficiently. I think that’s the main difference compared with other video generation methods you might see.” The novelty of this approach has not gone unnoticed, with the work being picked as a top-rated paper at this year’s CVPR, given a coveted oral presentation slot, and recognized as one of only 24 papers in line for a best paper award. If we were placing bets on the winners, this work, with its stellar team of authors, would be our hot tip. What does Zhengqi believe are the magic ingredients that have afforded it such honors? “There are a few thousand papers on video generation dynamics, and they all have similar ideas,” he responds. “They predict the raw pixel, and we’re going in a completely different direction predicting the underlying motion. That’s something the research community appreciates because it’s unique. I guess they believe this might be an interesting future research direction for people to explore because, for generative AI, people are more focused on how you can scale those big models trained on 10 billion data while we’re trying to use a different representation that we can train more efficiently to get even better results. That’s a completely different angle, and the award community might like those very different, unique, special angles.” However, the road to this point was not without its challenges. Collecting sufficient data to train the model was a significant hurdle the team had to overcome. They searched the Internet and internal Google video resources and even captured their “If you don’t have data, you can’t train your model to get good results!” CVPR Best Paper Award Winner

5 Computer Vision News Computer Vision News Generative Image Dynamics own footage to gather the necessary data, taking a camera and tripod to different parks to capture thousands of videos. “The hardest part was we spent a lot of time working on it, but that’s the key ingredient that made our method work,” Zhengqi recalls. “If you don’t have data, you can’t train your model to get good results.” While other works use optical flow to predict the motion of each pixel, this work trains a latent diffusion model, which learns to iteratively denoise features starting from Gaussian noise to predict motion maps rather than traditional RGB images. Motion maps are more like coefficients of motion. The model uses this to render the video from the input picture, which is very different from other works that directly predict the video frame from the images or text. “That’s something quite interesting,” Zhengqi notes. “We’re working from more of a vision than a machine learning perspective. I think that’s why people like it in computer vision communities.” Outside of writing award-candidate papers, Zhengqi’s work at Google mainly focuses on research but has some practical applications, including assisting product teams with video processing. He also advises several PhD student interns. “We work together on interesting research projects to achieve very good outcomes,” he reveals. “That’s our daily goal as research scientists at Google DeepMind!”

Computer Vision News Computer Vision News 6 Best Paper Award Candidate For several decades, volume rendering techniques have been a popular class of methods for simulating light transport in translucent media such as clouds, smoke, and tissue, with various applications in graphics and physics. In the past five years, there has been a shift towards using these methods to model more familiar, everyday objects such as solid, opaque items. “Our paper is about figuring out why these volume rendering methods, originally developed for clouds, can work on things like a Lego truck,” Bailey begins. “We’ve developed a stochastic geometric theory that explains the connection between these two different models.” The initial challenge Bailey faced was understanding the foundational principles of volume rendering, which have been obscured or had a black box put around them by the numerous successful yet complex methods developed in recent years. “Revisiting its roots, you see that in classic volume rendering, scenes are modeled as a collection of microparticles,” he reveals. “Once we could understand it in this very principled manner, we could start to develop ideas and approaches for considering volume rendering on stochastic opaque solid objects.” Did he solve the problem? “Part of it,” he tells us. “I think we’ve opened some new doors. We show how you can develop these rendering algorithms for a very Bailey Miller is a PhD student at Carnegie Mellon. His novel paper, which breaks new ground in volume rendering, has been selected from thousands of accepted papers as a conference highlight and as a candidate for a coveted Best Paper Award. Bailey spoke to us before his oral presentation at CVPR 2024. Objects as Volumes: A Stochastic Geometry View of Opaque Solids

7 Computer Vision News Computer Vision News Objects as Volumes limited set of new stochastic geometry. There’s a lot of work to be done in extending these methods to even more extensive types of geometry and scenes.” One of the primary applications of this work is in surface reconstruction. Essentially, this involves taking a collection of images and trying to understand the geometry of the world that gave rise to them. The connection becomes clearer through light transport. “Light bounces around the world, and depending on how it interacts with the geometry, it gives different images,” Bailey explains. “By introducing a new way of modeling the geometry in scenes and considering how light interacts with that geometry, we can improve the surface reconstruction algorithms that have been using volume rendering over the past several years to get all sorts of performance improvements.” The implications of this research extend to various fields, particularly settings such as robotics or autonomous driving, where there is a benefit in having a notion of uncertainty in solid objects and the

Computer Vision News Computer Vision News 8 Best Paper Award Candidate trustworthiness of algorithmic results. In these scenarios, agents might leverage images or video to build a probabilistic model of the surrounding world, which can significantly enhance reliability, safety, and efficiency. Being chosen as an award candidate for the first time is a high enough achievement, but it is even more special for Bailey, as this is his first CVPR. He puts the positive reception down to how timely the work is given recent advancements in surface reconstruction and novel view synthesis algorithms based on volume rendering, which have worked well but for unclear reasons. “By providing this perspective and saying, ‘Here’s an explanation for why all these volume rendering methods work so well,’ we can understand why we can model the world in this particular way and why we’ve had success with it,” he points out. “We can understand the current state of the art but also develop a perspective on these methods that allows us to continue improving. I love thinking about how we model the world and how the underlying assumptions we make in our models impact the algorithms and methods we ultimately develop.”

9 Computer Vision News Computer Vision News Objects as Volumes Reflecting on the growing emphasis on the role of explainability in research, Bailey says it is a trade-off: “You need to push forward and figure out what works in practice, and then you need to step back and ask, ‘Why do these methods work so well?’ It’s a constant process moving back and forth between the two.” In addition to his work on stochastic geometry, Bailey is exploring Monte Carlo PDE solving, which involves adapting Monte Carlo methods that work really well for light transport and simulating light to other types of physics, such as heat transfer, acoustics, and wave equations. “I don’t think this has been as present in the vision community yet, but it’s been starting to gain some attention in graphics,” he tells us. “I think, eventually, these algorithms will be of interest in the computer vision community because we’re seeing the development of all sorts of new imaging modalities or renewed interest in modalities like thermal imaging. Good ways to simulate those should help vision researchers and practitioners develop algorithms that work with physics beyond just light.” Looking ahead, Bailey is excited about the potential for developing this work further, including extending the stochastic geometric approach to a broader range of stochastic models and probabilistic assumptions about the world or scenes. Also, the core idea of stochastic geometry has applications beyond light transport algorithms, which opens a range of possibilities for future research. Could we be sensing the first hints of next year’s award paper? “I’d love that, but I’m happy with the one this year for now!” he laughs. “I feel very fortunate to have had our paper selected. I hope everyone who reads it enjoys it and takes something away from this stochastic geometry perspective.”

Computer Vision News Computer Vision News 10 Raymond and Paula’s CVPR journey began three years ago with a vision to bridge the gap between academia and industry regarding AI. “It was a dream that Paula and I had,” Raymond, the global lead of the Intel AI Evangelist team, recalls. “When I was a researcher and had an idea, people always said, ‘10 years later, it will come to the market.’ It was discouraging to me. I was like, no, I want this to be in the market tonight!” That vision has materialized with Intel making significant strides in AI, notably through OpenVINO, its open-source toolkit for AI inference, allowing developers to optimize and deploy models efficiently across various platforms. These days, AI innovations can progress from research into users’ hands within a year. “The new AI trends are moving super fast,” Paula tells us. “If we have a new model, we want to deploy it at the edge, in the cloud, or Paula Ramos (on my left) and Raymond Lo (on my right) are AI Evangelists at Intel. We catch up with them at CVPR 2024 to learn more about Intel’s innovative AI-powered solutions, their authoritative engagement at the conference, and the community’s enthusiastic response. AI Evangelists at Work

on client devices. We’re trying to bridge the gap between those new AI trends and how developers can deploy on their own laptops so they don’t need to create or use new infrastructure.” Intel Labs researchers presented 24 papers at CVPR this year, including six in the main conference, and coorganized three workshops. Intel had a booth and a tutorial. AI Research Engineer Samet Akcay was a keynote speaker at the Anomaly Detection workshop, while Paula was a keynote speaker at the Agriculture-Vision workshop. Intel also sponsored the AI Summit, a networking meetup, and the Visual Anomaly and Novelty Detection 2024 Challenge. “When we started at CVPR, the demos were rough because there were only a few of us trying to make it work,” Raymond remembers. “A success story I can tell is that the community came together with hundreds of people working on the same thing. Hundreds. That’s the difference we see. People appreciate that we come back and are better at repeating the effort!” 11 Computer Vision News Computer Vision News Intel AI

Computer Vision News Computer Vision News 12 This community spirit is reflected at the booth and the annual party Intel hosts at CVPR, which has proved to be a popular way to connect people. Creating a fun and engaging environment ensures that people remember the company from a human perspective, not just as a technology provider. “My perspective is that I’m here to show you the best of everything we have, but we can’t forget that we’re walking in the footsteps of all these giants here at CVPR, all these talents,” Raymond recognizes. “The whole purpose of the party is to put the right people together at the right time. Of course, it has to be fun!” He points out that Intel’s commitment to the community goes beyond superficial engagement. It brings hardware and software engineers to CVPR for hands-on support and expertise. “We care by showing up,” he attests. “In our booth, we have the people that do the actual development. Two engineers with PhDs in Optimization will show you how we optimize the hardware. It’s not just telling you to read the documentation; the engineers are here to talk to you if you want extra help. We’re here for you!” Raymond says they regularly add people to LinkedIn, give them demos, and this year, even gave away Discrete Arc GPUs, AI PC DEV KITs, and AI PCs featuring the Intel Core Ultra, which combines NPU, CPU, and GPU capabilities. “It’s a very powerful machine in the sense that, 10 years ago, if I saw it, I’d be like, ‘Oh my God!’” he exclaims. “For students, this is actually their laptop in the coming years. We’re shipping hundreds of millions of these.” He stresses to students the importance of making their innovative solutions accessible and repeatable. “Great, you have your solution, but have you thought about giving it to your friends and your friends’ friends?” he asks. “Think about the layer of people that can benefit from your work. That’s why I say these kinds of laptops are not just laptops; they’re tools for expanding your ideas.” “OpenVINO Notebooks repository has literally hundreds of tutorials!” AI Evangelists at Work

With tools like OpenVINO, people can run high-performance AI models on modest hardware, making advanced capabilities available without significant financial investment. “Today, people came to my booth to say, ‘Hey, Ray, what are you running?’” he tells us. “I’m running a large language model, Llama 3, eight billion parameters, that never runs properly on their machines. I use OpenVINO to compress it, and now it fits into the RAM, and I can fit it into the GPU of this little machine.” That is how the 10-year route-tomarket cycle that Raymond talked about earlier gets shorter. “Your everyday machine becomes your AI engine,” he confirms. Intel is developing user-friendly APIs and extensive tutorials to help computer vision professionals benefit from these advancements. Its OpenVINO Notebooks repository has literally hundreds of tutorials! It has Edge AI Reference Kits designed for specific use cases across industries like manufacturing and retail. It creates content users can follow step by step to optimize, quantize, and apply these new AI trends in their infrastructure. “We don’t just give you cookies,” Raymond adds. “You get the recipes for making your own cookies!” Feedback from the field has been overwhelmingly positive. “It really warms my heart when people come back to tell me I tried it, I used it, I made an impact,” he recounts. “One person told me, ‘I can make a business from it. I can make an impact on millions of people using this product.’ He became a top scientist from Stanford and told me, ‘I could accelerate some of this model as well.’ That’s heartwarming because we don’t just show people something, and then they forget, and it never gets used anywhere else; they take it to the next level!” Intel’s commitment to the community extends beyond CVPR, with regular virtual events called 13 Computer Vision News Computer Vision News “It really warms my heart when people come back to tell me I tried it, I used it, I made an impact...” Intel AI

Computer Vision News Computer Vision News 14 DevCon webinars and its big conference, Intel Innovation, which takes place September 24-25 in San Jose, California. Despite such broad engagement, there is still a common misconception that Intel is solely a hardware company. “People ask: ‘You’re a hardware company; why are you doing AI?’” Paula reveals. “We’re doing AI because we want to enable AI in the hardware, but we’re also working beyond that!” From the viewpoint of their booth, where the team has been busy teaching people how to use AI with Intel, the team has noticed that the awareness of its AI solutions has grown significantly this year. Many more people are familiar with OpenVINO, and generative AI is an exciting new trend. “There’s an awareness that all our solutions are better right now,” Paula notes. “It will keep getting better year by year.” “... the opportunity is today! Grab it and stay strong with the CVPR community! ” Paula was featured last year as Woman in Computer Vision. AI Evangelists at Work

Intel has a unique capability to create a comprehensive environment for AI development. With millions of devices everywhere, it can write software that millions of people can start benefiting from overnight. Its end-to-end capabilities in designing, manufacturing, shipping, and inferencing AI solutions place it in a position of significant responsibility, where it must ensure ethical AI practices and develop explainable AI solutions. Work like Ilke Demir’s on deepfake detection is important, as are its trust and security services. As our interview concludes, Raymond offers a powerful message to the community, reflecting on the rapid advancements in the field. “Opportunities come every year,” he says. “When I was a researcher, everyone said 10 years. With this new generation, my advice is: the opportunity is today! Grab it and stay strong with the CVPR community! The community defines this field, and the CVPR community is the leader. We should lead and then make your dreams come true!” 15 Computer Vision News Computer Vision News Intel AI

Computer Vision News Computer Vision News 16 Best Paper Award Candidate Paul Roetzer (left) is a PhD student under the supervision of Florian Bernard (right), an Associate Professor at the University of Bonn and the Head of the Learning and Optimisation for Visual Computing Group. Before their oral presentation this afternoon, they speak to us about their highlight paper on 3D shape matching, which has also been chosen as a Best Paper Award candidate. SpiderMatch: 3D Shape Matching with Global Optimality and Geometric Consistency The problem of 3D shape matching involves identifying correspondences between surfaces of 3D objects, a task with applications in medical imaging, graphics, and computer vision. This work’s main novelty is that it accounts for geometric consistency, a property often neglected in previous 3D shape matching methods due to its complexity. Geometric consistency ensures that when matching the surface of one shape to another, the neighboring elements are matched consistently, Computer Vision News Publisher: RSIP Vision Copyright: RSIP Vision Editor: Ralph Anzarouth All rights reserved Unauthorized reproduction is strictly forbidden. Our editorial choices are fully independent from IEEE, CVPR and all conference organizers.

17 Computer Vision News Computer Vision News preserving neighborhood relations. “Imagine two organs, like the liver, heart, or lungs, and you match them from different people,” Florian explains. “You take the shapes and want to train a statistical shape model. If you didn’t have this geometric consistency, the deformation from one to the other would lead to self-intersecting sections which aren’t anatomically plausible.” Many existing approaches to 3D shape matching do not enforce geometric consistency as a hard constraint but optimize it as a soft objective, often framed as graph matching or quadratic assignment problems. “This is a problem class well known to be NP-hard, making it extremely challenging to solve for large instances in practice,” Florian tells us. “We find a different representation that makes the problem easier to solve.” Florian and Paul propose a novel path-based formalism, representing one of the 3D shapes (the source shape) as a long, self-intersecting curve (‘SpiderCurve’) that traces the 3D shape surface. This alternative discretization simplifies the 3D shape matching problem to find the shortest path in the product graph of the SpiderCurve and the target 3D shape. “This switch of the discretization is what makes our paper novel,” Paul points out. “We think differently about a problem, making a very complicated task to solve a simpler one.” This formalism leads to an integer linear programming problem, which the team demonstrates can be efficiently solved to global optimality. The result is competitive with recent state-of-the-art shape matching methods and guarantees geometric consistency. “For the first time, we can find geometrically consistent shape matchings while also finding global optima in practice,” Florian reveals. “Within the framework of our optimization formulation, in all the instances that we’ve evaluated, we know that we have the best possible solution among all potential solutions!” SpiderMatch

Computer Vision News Computer Vision News 18 Best Paper Award Candidate 3D shape matching is just one of a class of matching problems that are fundamental to computer vision. Could devising an innovative new approach to solving such a fundamental problem be part of the reason the paper has been chosen as a candidate for a Best Paper Award? “We have conceptually a pretty simple idea,” Florian responds. “Instead of representing a 3D shape using triangles as discretization, we simply discretize the 3D shape using a onedimensional curve that traces the surface while visiting all the vertices.

19 Computer Vision News Computer Vision News SpiderMatch By looking at a different representation of the 3D surface, we can build on well-established frameworks for global optimal matching problems that lead to geometric consistency. I think the secret is the simplicity and the fact that it’s very fast in practice.” Away from writing top-rated papers, Florian works at the intersection of machine learning and mathematical optimization in visual computing. Meanwhile, Paul explores solutions to 3D shape matching problems with optimization methods. Looking ahead, Florian acknowledges an unresolved challenge: “The most critical open problem is whether an algorithm exists to solve this problem in polynomial time,” he ponders. “What we have is fast in practice, but the worst-case time is still exponential. The next step would be to investigate if it’s possible to come up with a similar formalism that could lead to a polynomial time algorithm that is provably fast.” The BEST OF CVPR 2024 continues on the next page with another exceptional Oral Paper!

Computer Vision News Computer Vision News 20 Highlight Presentation Aniruddha Kembhavi (top left) is a Senior Director at the Allen Institute for AI (AI2), leading the Perceptual Reasoning and Interaction Research (PRIOR) team, where Christopher Clark (center) and Jiasen Lu (top right) are Research Scientists, Sangho Lee (bottom left) is a Postdoctoral Researcher, and Zichen “Charles” Zhang (bottom right) is a Predoctoral Young Investigator. They spoke to us about their highlight paper proposing Unified-IO 2, a versatile autoregressive multimodal model. Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision Language Audio and Action Unified-IO 2 is the first autoregressive multimodal model capable of understanding and generating images, text, audio, and action. It can handle multiple input and output modalities and incorporates a wide range of tasks from vision research. Unlike traditional models with specialized components for different tasks, it uses a single encoder-decoder transformer model to handle all tasks, with a unified loss function and pretraining objective.

21 Computer Vision News Computer Vision News Unified-IO 2 “It’s a super broad model,” Christopher tells us. “It can take many different modalities as input and output. It can do image, text, audio, and video as input and can generate text, image, and audio output. Within those modalities, we basically threw in every task we could think of that vision researchers have been interested in. It’s a super, super broad model. I think it’s one of the most broadly capable models that exists today.” While language models can perform many tasks and input and output all kinds of structured language, handling diverse inputs and outputs in computer vision is more challenging. “When it comes to computer vision, it’s a mess,” Aniruddha says bluntly. “Sometimes, you have to input an image. Sometimes, you have to output a bounding box. Sometimes, you have to output a continuous vector like a depth map. Inputs and outputs in computer vision are very heterogeneous. That’s why, for the last 10 years, people have been building models that can do one or two things.”

Computer Vision News Computer Vision News 22 Highlight Presentation Unified-IO 2 builds on the foundations laid by its predecessor, Unified-IO, aiming to create a model that can truly input and output anything. Training such a comprehensive model, especially with limited resources, has been incredibly tough. The team’s first major challenge was collecting the pretraining and instruction tuning data. The second was training a multimodal model from scratch rather than adapting existing unimodal models. “We tried a few months of tricks to stabilize the model and make it train better,” Jiasen recalls. “We figured out a few key recipes that were used by later papers and shown to be very effective, even in other things like image generation. We’re training on a relatively large scale with 7B models and over 1 trillion data. More than 230 tasks were involved in training these giant models.”

23 Computer Vision News Computer Vision News Unified-IO 2 The development of Unified-IO 2 has been a collaborative effort involving the four first authors: Jiasen, Christopher, Sangho, and Charles. Aniruddha is keen to ensure they get the recognition they deserve for the feat they have pulled off. “This project is a Herculean effort by these four people,” he points out. “Usually, people will take a large language model, then put a vision backbone, and then finetune that on some computer vision tasks. In this model, the language model is also trained from scratch. Think of large companies with hundreds of researchers trying to train a language model. Contrast that with this paper, which has four first authors trying to train a model that does everything. These four gentlemen have toiled night and day for many, many months. I can testify to that.” Everything about Unified-IO 2 is open source. If you visit the team’s poster today, you can feel safe knowing they are willing to share every aspect of the project. “We’ve released all the data, the training recipes, the challenges, especially in stabilizing the model training, and all the evaluation pipelines,” Sangho confirms. “If you come to our poster booth, we’ll be very happy to share all the recipes and know-how for training this special kind of multimodal foundation model.”

Computer Vision News Computer Vision News 24 Highlight Presentation During the evaluation stage, the team discovered that Unified-IO 2 could perform well in tasks they had not initially targeted, such as video tracking and some embodied tasks. They will showcase these surprising results with iPad demonstrations at their poster session. “We’ve tested the model multiple times, but maybe only with a few modalities and target tasks,” Charles reveals. “It’s a surprise that the model is so good at other tasks we’ve not focused on before. There are lots of interesting behaviors of the models and some very cool visualizations that the model can follow some novel instructions.” The paradigm behind Unified-IO 2, where all modalities are integrated into a single transformer without relying on external unimodal models, is a promising direction for future AI research. “It’s in contention with other ways of training generalist models, and people are still exploring and building on that,” Christopher adds. “I think Unified-IO 2, in particular, has a lot of modalities and tasks and really pushes that way of building models to an extreme.”

Russian Invasion of Ukraine CVPR condemns in the strongest possible terms the actions of the Russian Federation government in invading the sovereign state of Ukraine and engaging in war against the Ukrainian people. We express our solidarity and support for the people of Ukraine and for all those who have been adversely affected by this war. 25 Computer Vision News Computer Vision News UKRAINE CORNER Denys Rozumnyi (top left) has won the Structured Semantic 3D Reconstruction Challenge, which was held as part of the Urban Scene Modeling workshop here at CVPR 2024. Denys is open to work offers. He’s a great catch! Grab him before it’s too late. Yaroslava Lochman (top right) is a PhD student at Chalmers University of Technology in Gothenburg, Sweden. She presented her paper, which explores the problem of motion segmentation, during her poster session at CVPR 2024. Sophia Sirko-Galouchenko (here on the left) is a first-year PhD student at Sorbonne University in Paris and Valeo.ai. Her paper on bird’s-eye-view perception in autonomous driving was presented by colleagues during a poster session at the CVPR Workshop on Autonomous Driving.

Read 150 FASCINATING interviews with Women in Science Read 150 FASCINATING interviews with Women in Science

27 Computer Vision News Computer Vision News Ivana Balažević Ivana Balažević is a Research Scientist at Google DeepMind. She spoke to us at CVPR, right after her talk and panel at the Prompting in Vision workshop. Where does Balažević come from? Croatia. I am not very far away. I am Italian. Ah, okay, so we’re neighbors! [she laughs] What is your work about? Well, various different things. I have mainly, in the past couple of years, worked on multimodal image and video understanding. As of recently, I moved into Gemini, working more on language. But, yeah, Gemini, super secret, can’t talk about it – you know how it is! Is convergence of multimodalities really happening? Text with vision, video, audio, all these things together? I think it is, especially in the past couple of years. I finished my PhD in 2021, and there were just these small models doing various different tasks. Everyone was working on their little model in their PhD or in whichever company, and now suddenly, everything is converging into one big model, which can do these various things, which is exciting but also maybe a bit scary. I don’t know. Mainly exciting, I would say. Why exciting, and why scary? That’s a very good question! In my mind, exciting because it unlocks a whole world of possibilities for what we can possibly do with these models in some possibly distant future. I don’t know because I didn’t think we’d be where we are now, but we would maybe be able to learn from these models or learn something new that we don’t know. These models might be able to make some sort of inferences, like combining various modalities to teach us things. How amazing would it be if we had a model that would be able to read scientific papers and come up with a new paper that is actually correct and that teaches us something new? Or a model that takes all our knowledge about medicine and biology and chemistry and finds a cure for cancer or something like that? Some of these things would be really, really amazing. Scary because, well, as with anything, people can abuse these sorts of models. And you cannot control the person who wants to abuse. Exactly. Any tool probably in human history – not any, but a lot of them can be used for good and for bad things. You have a kitchen knife that you can cut vegetables with or something… Or cut the neighbor! Yeah, exactly! [both laugh] It’s the same with the technology nowadays.

Computer Vision News Computer Vision News 28 Women in Computer Vision Do you know what Nobel invented? Dynamite! Yeah, Nobel, exactly. There we go! This is a very good example. Would you agree with Yann LeCun, whom I have interviewed twice? He says that today’s AI is no smarter than a home cat. Than a cat? Well, probably not at this point, but we will see how much time will pass until they are smarter than a cat, so we’ll see! [she laughs] Is it exactly what you wanted to do to be at the intersection of all these nice things at the moment they are going to intercept? Yeah, I think it’s a nice place to be. It’s sometimes a bit uncomfortable because things are moving really, really fast. As I said, during my PhD time, everything felt more chill, whereas nowadays, there’s a lot going on, but it’s also very fun. Is this why you do research to have fun in something innovative? I mean, partially. There are various different reasons why I do research. I think I’m bored quite easily, so I like to do things that are constantly new, so that your day to day isn’t just repeating the same things over and over again. Also, because of what I just mentioned, all these amazing

29 Computer Vision News Computer Vision News Ivana Balažević things that these tools that we’re developing can potentially help us unlock. But also to have fun. I am sure you are not afraid you are going to be bored by AI in the coming years. No, I don’t think so. The opposite, actually. In your short career until now, what in AI particularly made you say, ‘wow’? That’s a very good question. I remember we had this Flamingo model in DeepMind. Before it came out, we could play around with it internally a bit, and I remember uploading a picture of my dad’s cat and asking it various questions about the picture, and it could answer everything. I was like, how is this possible? That was my first moment like, wow, these things actually work! Because I was pretty much a sceptic before that about this kind of increase the model size, increase the data size. I wasn’t thinking this is actually going to work. I mean, there’s probably a limit. I still think there’s a limit, but we’re still pushing this frontier it would seem. What is the next ‘wow’ you are going to say? If we have actual embodied agents. If we can integrate all of these current assistants we have and put it into a robot This is the point where I would actually find it scary as some sort of actual sci-fi moment, but it would also be like, wow. What would be your dream for the next 10 years? Personally, I would like to move into the more application side of things, where you can do something good with these models. The model development is very, very interesting as well, but this has kind of been a change in the past couple of years for me, where we’ve moved on from models that don’t really work to models that do work, and now it’s unlocked this whole world of possibilities where you can actually use it to solve real-world problems. The spectrum of things that you can do is infinite. If I only think about climate change, the number of issues that AI can help on is infinite.

Computer Vision News Computer Vision News 30 Women in Computer Vision Exactly. Will you pick one of those? Good question. I think we have so many open problems in our world, from various political problems to climate change. I think climate change is a good one, maybe something medical. I find some of the companies that do research into longevity, not necessarily extending the length of human life but improving the quality of life at a later stage where I think there’s a lot of ML used there, I think these are the things that I find interesting. You are preparing for retirement. Yeah, exactly. I want to retire in a nice way! [both laugh] Still be able to go on hikes or something. Scuba diving? Yeah, perfect. Scuba dive at 90 or something like that! [Ivana laughs] We have spoken about the future; we did not speak much about the past. Why did you leave Croatia? Well, that was a very long time ago now. It was 12 years ago. Yeah, that’s when I left Croatia. I wanted to see what there is outside Croatia. I wanted to experience life outside Croatia, but I didn’t think I was going to stay. I initially thought it was going to be a couple of years, but then, here we are. What did you do for this couple of years? I did a Master’s in Berlin at first. Then, I was in California for a year doing an internship at a startup, and then I started doing my PhD in Edinburgh, Scotland. So, it was not the plan? No, not initially. It was kind of let’s go and see sort of thing. Is this the way you decide things? When I was young, for sure. Now, sometimes! Are you going to use the same criteria for things happening in the future? I mean, not so much. Now, I tend to overthink things a bit more than when I was 21. What is the thing that you did until now that you are the most satisfied with?

31 Computer Vision News Computer Vision News Ivana Balažević I’m going to say something that’s unrelated to my career. I ran a marathon a month and a half ago! Yeah, it was hard, but also, it made me feel like, oh, I can do this. Wow, I’m so jealous. I registered for three marathons. I was never able to start one. I always got injured. No way. It’s very annoying. I’m still a bit injured now. My feet hurt, but other than that, I enjoyed it a lot. It was one of those things because I often doubt myself, and I’m like, this is too hard, I cannot do this, I cannot do that, and it’s one of those things where you’re like, I can actually do this. Four hours? 3:57. Tell me one thing that our readers can learn from your marathon. I think it’s this thing about not giving up. Knowing that you can do more than you think you can because I feel like it’s one of those things where you have to believe that you can do it. But yeah, as I said, I’m not that kind of person. Some people will say that life is already competitive enough. True, but I think it makes you more robust to everyday life challenges because it teaches you, yeah, I can do this. I can overcome this thing that I wasn’t able to do. I am sure that finishing a marathon makes you very, very high. Yeah, it does. I was laughing and crying at the same time. I was in this weird state. [Ivana laughs] How many CVPRs have you attended? This is my first one. I have never been to CVPR before. I normally go to ML conferences. How are you finding it? Yeah, I think it’s nice. It’s great. I don’t know how big it is, but from what I’ve heard, it’s bigger than NeurIPS. 11,500 this year, including online. Yeah, I’m excited for it. Let’s see. What did you learn from the workshop? That there are a lot of different opinions and ways people do prompting. I don’t know if that’s a good thing or a bad thing. In language, prompting is a welldefined thing, whereas here, it’s very much an open research problem open for discussion. It’s possibly a good thing. There might be good things coming out of it. Is there a lifelong dream or goal you hope to achieve before you retire? I want to work on a particular problem that matters right now. I’m “I think it’s this thing about not giving up. Knowing that you can do more than you think you can!”

Computer Vision News Computer Vision News 32 Read 150 FASCINATING interviews with Women in Computer Vision! Read 150 FASCINATING interviews with Women in Computer Vision! Women in Computer Vision now doing research for the sake of research because it’s interesting. I haven’t decided which of these important problems we talked about yet is that one problem, but I want to do something like that and focus and try to help achieve that. Tell us one thing about you that we do not know. Again, that’s a very difficult question. I do not know much! I don’t think I like to talk about myself that much! [she laughs] Oh, so you had a great idea of accepting my offer for an interview. Yeah, to be honest, I was not sure what I was going to say! Now, it’s too late! I know. People have convinced me otherwise. Who convinced you? Well, my manager and my boyfriend. They were both like, “No, you should do it. It’s a good thing. It inspires people!” I’m like: “I don’t know if I like doing these self-promotion things.” I often discover the most surprising things about people I do not know. Yeah, well, there we go. That’s one thing you didn’t know about me! [both laugh] Do you have a final message for the community? Do big things that matter because I think you can achieve it. There are so many smart people in this field and there are many, many interesting problems to be solved. I think if we put our minds together, we can improve this world that we live in.

33 Computer Vision News Computer Vision News AI4Space Workshop Poster Roberto Del Prete, a PhD student from the University of Napoli, who is set to complete his PhD in November, is proud to present at his first CVPR conference. His poster, presented at the AI4Space workshop, discusses a novel approach for autonomous lunar landing, leveraging visual information. The workshop highlighted the space capabilities that draw from and/or overlap significantly with vision and learning research, outline the unique difficulties presented by space applications to vision and learning, and discuss recent advances towards overcoming those obstacles.

Computer Vision News Computer Vision News 34 Workshop Presentation Test-Time Adaptation with SaLIP: A Cascade of SAM and CLIP for Zero-shot Medical Image Segmentation Sidra Aleem is a final-year PhD researcher at ML-Labs, Dublin City University, focusing on domain adaptation for biomedical imaging using foundation models. Following her oral presentation on Monday at the CVPR 2024 Workshop on Domain adaptation, Explainability, and Fairness in AI for Medical Image Analysis (DEFAI-MIA), she speaks to us about her paper on test-time adaptation with foundation models. Medical image segmentation, critical for clinicians in diagnosis and prognosis, is the focus of Sidra’s innovative work. She proposes a novel cascade of two foundation models, Meta’s Segment Anything Model (SAM) and OpenAI’s CLIP, leveraging their unique capabilities to enhance zero-shot organ segmentation accuracy in medical imaging. “These foundation models have completely revolutionized the world around us,” she tells us. “While they’ve been predominant in natural imaging, their effective application has yet to be explored in medical image segmentation.” Her approach involves using SAM to generate all the different region proposals from medical images. She then employs CLIP, a multimodal model designed to process text and images, to identify the specific organ for segmentation. While CLIP has already been extensively tested on natural images, Sidra has successfully adapted it to the unique challenges of medical imaging. “As we know, one of the widely used applications of CLIP is image retrieval,” she explains. “My objective was to utilize CLIP to get the region of interest from all these region proposals. For the text part, in medical imaging, we need domain knowledge. To mitigate that issue, I generated text prompts using ChatGPT.” Regarding lung segmentation, Sidra used ChatGPT to generate 20 attributes

35 Computer Vision News Computer Vision News Test-Time Adaptation with SaLIP to describe the lungs in a chest Xray. These prompts were then fed into CLIP’s text encoder, which calculated the similarity of all the SAM-generated region proposals and the text prompts and retrieved the relevant mask from the pool of SAM-generated masks. Finally, SAM was prompted by the retrieved region of interest to segment the lung. The SaLIP model has been tested on medical imaging datasets encompassing MRI scans, ultrasound, and X-ray images and diverse segmentation tasks, including brain, lung, and fetal head. “The lung segmentation was more challenging, as there were two regions of interest – the left and right lung,” she recalls. “As reported in the paper, the performance was really good!” A significant contribution of this work is that it employs both the ‘segment everything’ and ‘promptable’ modes of SAM connected through CLIP. This dualmode approach allows for more precise and varied applications in clinical settings. Previous work adapting SAM in medical imaging has used it to segment everything in the image. However, clinicians want to focus on specific regions of interest, which vary depending on the clinical need. “First, I utilized the segment everything mode of SAM to create region proposals for every region in the image,” Sidra explains. “Then, I used the promptable mode, where we use specific prompts to segment a specific region. To connect both these modes, I used CLIP as a bridge between them. To the best of my knowledge, none of the work in medical imaging has utilized both modes of SAM. The literature mainly focuses on finetuning either the segment everything or promptable mode.”

Computer Vision News Computer Vision News 36 Another major contribution is that this unified framework is fully adapted at test time for zero-shot organ segmentation, meaning no training is involved. Traditional methods of adapting SAM to medical imaging involve finetuning or transfer learning, which require substantial computational resources and large datasets. A common challenge in medical imaging is the scarcity of available data. A trainingfree approach bypasses the need for annotated data or human experts and resolves privacy concerns and resource limitations in medical imaging. Harnessing the capabilities of LLMs also eliminates the need for domain expertise in prompt engineering. As our interview draws to a close, Sidra shares some highlights from her recent experience at the International Symposium on Biomedical Imaging (ISBI) in Athens. “I’ve been to a lot of conferences, but there were a few things I experienced for the first time at ISBI,” she says. “The highlight for me was the lunch with leaders, where you sit at a table with a specific leader and speak one-on-one about your career. As I’m nearing the completion of my PhD, I need professional guidance from someone in this field.” Workshop Presentation

37 Computer Vision News Computer Vision News Sidra’s chosen leader was Amir Amini, a distinguished professor at the University of Louisville with a wealth of experience in biomedical imaging and electrical engineering. “His list of achievements is very long,” she points out. “It was a very useful activity to sit in front of such an accomplished person and talk about your personal problems.” While she was able to attend ISBI, Sidra expressed her frustration at not being able to come to Seattle for CVPR due to visa issues, having been looking forward to presenting her work in person. Nevertheless, she remains optimistic when we turn to talk of the future. “As I’m in the last year of my PhD and my last semester is about to start, I’m looking for positions,” she reveals. “I hope that by sharing this, people might reach out!” One thing is clear: Sidra is poised to make a significant contribution to her field. For those seeking a dedicated and insightful researcher, she is an exceptional candidate ready to embark on the next stage of her career! Did you enjoy this BEST OF CVPR 2024? 37 pages about CVPR in Seattle, 10 days ago! Will you miss your friends and the brilliant tech? Do you want to feel every month at CVPR? Subscribe for free to Computer Vision News! With us, it’s CVPR every month! Test-Time Adaptation with SaLIP

Computer Vision News Computer Vision News 38 Congrats, Doctor Philipp! Speech and language technology has become ubiquitous and significantly more powerful in recent years. These advancements came with highly complex models which demanded enormous amounts of training data. In the domain of pathological speech, such amounts are unheard of. For a robust modern speech recognizer, we are talking about many thousands of hours of transcribed audio samples from at least some 10.000 different speakers. Good luck trying to adopt this for pathological speech! In his thesis, Philipp proposed a solution to this problem which may appear counterintuitive at first: No pathological data was used during the optimization of any of the large models. Instead, his algorithms rely exclusively on off-the-shelf speech recognition datasets which had been collected from healthy speakers. But how could such a model help to analyze pathological speech? This is where phonetics come into play, the science of speech production, transmission and perception. It helps explain how and why the speech from patients of a particular medical condition deviates from the healthy reference. Not only could the presented approach solve the problem of data scarcity in the medical domain, but it also yields very explainable outputs which are much easier for a clinical expert to understand and interpret. For example, it was possible to show Philipp Klumpp completed his PhD just a few weeks ago. He worked with the team for speech processing and understanding of the Pattern Recognition Lab at FAU ErlangenNürnberg. Under the supervision of Elmar Nöth, his research focused on the automated analysis of pathological speech and language using modern ML techniques. Philipp is now working as a Data Scientist for DATEV.

RkJQdWJsaXNoZXIy NTc3NzU=