Computer Vision News - April 2025

Full review of 4 award winners! April 2025 Enthusiasm is common, endurance is rare!

Computer Vision News Computer Vision News 2 WACV Best Paper Award - Algorithms RayGauss: Volumetric Gaussian-Based Ray Casting for Photorealistic Novel View Synthesis The most recent method in novel video synthesis is 3D Gaussian Splatting. It uses some inspired kernel from Gaussian Splatting which means 3D elliptical functions to represent the scenes. The first question is how to represent the scene properly from a physical viewpoint and by using this elliptical basis function. Previous algorithms that use elliptical basis functions are not very precise. For instance, 3D Gaussian Splatting has some rendering artefact because it uses an approximate rendering algorithm and Hugo’s approach doesn't have this type of artefact. How was he able to make a different approach? The idea is to use less approximated volume rendering equation and this less approximated volume rendering equation allows to get rid of this artefact. Hugo Blanc is a third year PhD student in the CAOR Laboratory at Mines Paris in France under the supervision of Jean-Emmanuel Deschaud and Alexis Paljic. His paper won the Best Paper Award (Algorithms) at WACV 2025 in Tucson, AZ. It is interesting to note that Hugo won the award at his first paper and first ever conference. Kudos Hugo!

3 Computer Vision News Computer Vision News RayGauss This does not come without its challenges. The biggest challenge for the team was clearly the implementation, because efficiently calculating the intersections between rays and the irregularly distributed primitives in the scene is no easy task. How did Hugo solve this challenge? “I solved this challenge by using a bounding volume hierarchy (BVH),” Hugo explains. “It is an acceleration structure that allows us to efficiently query ray-primitive intersections and particularly in the case where these primitives are irregularly distributed.” Without this acceleration structure, it would have been much slower and he wouldn't have achieved such rendering times. Is this what makes the Best Paper of the conference? Hugo thought about this and believes that the main quality of this paper comes from being precise and rigorous, like for instance, the physical scene representation is precise, each term is defined and what they do is defined: primitives are defined to emit and absorb light; then from this representation, the global representation of the whole scene is deduced. Being rigorous about both the representation and the implementation is really valuable! Thank you Nicolas Lissarrague (LARSH-Devisu - UPHF) and Alexis Heloir (LAMIH UMR CNRS 8201 - UPHF / INSA Hauts-de-France) for providing this dataset.

Computer Vision News Computer Vision News 4 Hugo expects this work to open new directions because it uses a library, OptiX by NVIDIA, that is often used for classic rendering and it wasn't really used before this paper. Is this Hugo’s first WACV? Apparently yes, and even his first conference! Certainly not the first paper of his career. “No, it's my first paper! Really!” Hugo confesses. We are curious to know what did Hugo learn from winning an award that he will reproduce the next time to win another. “I think you have to do everything properly from the paper presentation,” Hugo reveals. “Not only the content: the presentation of the paper is important. The poster, the video... there is something that I could improve: my English. It isn't as strong as I'd like it to be, and I plan to work on that. That being said, I think you have to be very rigorous and also have an innovative idea. For instance, this idea of ray casting irregularly distributed primitives was quite new because there are not many algorithms that did this type of thing before!” WACV Best Paper Award - Algorithms

5 Computer Vision News Computer Vision News Hugo is justly proud of this paper and thinks that the work offers new directions for research. Giving some hints about future work, Hugo discloses that as we work in the 3D space, we can, improve the rendering equation. Here, the classical one was used, which means that they only consider emitting and absorbing properties. But in reality, they are much more than that. For example, the scene primitives can scatter, which means that light is not only emitted and absorbed but also reflected by elements in the scene. With this 3D approach, you can, it's quite straightforward to go to the next step and do a more realistic algorithm that takes into account every phenomenon of light. Another source of pride comes from the research being both elegant theoretically, with no concessions, and having robust representation too: “Elegant representation coupled with efficient implementation!” is Hugo’s final message. The CAOR lab is located in Paris, and its NPM3D team has about 5-6 members: they work on robotics generally, in particular Hugo and team work on point cloud representation: any task that you can think of doing with point cloud, is part of their work in the lab. RayGauss

Computer Vision News Computer Vision News 6 Best Student Paper Award GeoDiffuser is about geometrybased image editing with diffusion models. More specifically, how do you inject geometry in image editing without any model retraining or any fine-tuning. This zero-shot optimization-based method does not train models but it's like an optimization strategy that allows you to rotate your image or translate it and sort of remove the object as well as remove any distractors. The method is a test-time optimization strategy which views these image editing operations as geometric transformations. These transformations can be directly incorporated into attention layers in diffusion models. “What we do,” explains Rahul “is we device specific loss functions that come from within the attention blocks of this diffusion model and we update the inputs of the model. We leave the model untouched, but we just update the Rahul Sajnani is a third-year PhD student at Brown University. His work won the Best Student Paper award at WACV 2025. GeoDiffuser: Geometry-Based Image Editing with Diffusion Models

7 Computer Vision News Computer Vision News GeoDiffuser input to the model so that it tries to perform the edit that you wish to perform.” How did the idea come about? There are seminal works in this field that first did this for text-based editing. The difference is that seminal works that include promptto-prompt, null text inversion and so on: perform editing to change visual appearance. For example, to change a sunny scene into a snowy scene, they don't translate or change object position. There are some other works that do this, but they do it only in 2D and don't inject geometry or they don't remove objects well enough. “Learning from the community” points out Rahul “and then manipulating attention features as well as applying loss functions to improve on top of these works helped us get to GeoDiffuser and that's what has helped and shaped our research direction!” The two main contributions of this work are the sharing mechanism of attention and optimization of image latents. Let’s see them in detail. The first specific contribution of GeoDiffuser is that it uses a depth prior to get geometry and in editing to move objects around. This prior can be injected in any text-to-image model. You don't need a depthtrained model or an in-painting model to do this. It's very generic and the team have actually shown that the same edits can be possible using Stable Diffusion 1.4 to 2.1 and more. In other words, the first contribution is the way how this work is generic and can be applied to attention blocks within these models. The second contribution is that they have devised some loss functions specifically to remove items from the image, let's say remove a cup from the top of a table. Ideally, if we

Computer Vision News Computer Vision News 8 remove the cup, its shadow and other elements should be removed as well. It is helpful to do this in the attention space and update image latents accordingly, so that the change is actually reflected to the environment as well. These findings did not come without major challenges! The earlier direction for the team was not this geometric-based image editing. They actually wanted to do a novel view synthesis for regular everyday objects. That means that if you have seen a photo of this cup from the front and you want to see the cup from the back, how would the cup look? Rahul wanted to do it without training because datasets for objects exist, but if you want to do it for scenes, like to see the same door but from a different viewpoint, you want something that can be done without training a large model. When they followed this first direction, they faced issues: these models cannot give you a precise view of something like a novel direction - they could do this only for small changes in angle, up to about 45 degrees. This is what brought the team to merge some attributes of this novel view synthesis and geometry with this image editing. That's how GeoDiffuser came into existence. On top of that, Rahul added these loss optimizations which allowed to remove objects, move them and more. Best Student Paper Award

9 Computer Vision News Computer Vision News Why did the jury precisely wanted this paper to be the winner, out of hundreds of student papers. Did Rahul think about this? “I feel one main strength is that our paper is very general,” is Rahul’s reply, “so it can be applied to a wider audience. Any model that has attention blocks can use some ideas from this work to update or tweak the outputs. Another strength is that visually our results were really pleasing! Finally, in our supplement we have a fine analysis of why, if I remove an object, the shadows should be removed too. We have more analysis of how other works do it and where they fail and even where we fail as well!” The most immediate future direction for this work is translate this effort from image models to video. “We were seeing how video diffusion models process these latents,” concludes Rahul “and how we can maybe edit them. The concern is that video diffusion models are compressing their latents very much, so there is a disconnect between explainability of how these models are compressing and which regions are important for what types of edits. Our next work is going deeper into this and also look into scene generation and 3D reconstruction - a fusion of these two!” GeoDiffuser “We have more analysis of how other works do it and where they fail and even where we fail as well!”

Computer Vision News Computer Vision News 10 What is image-to-graph transformation? Image-to-graph transformation means that we are having an image as input and we want to extract a graph from that image and that graph should represent a physical structure inside of the image. There are many downstream applications that these structural graphs can be used for. For example, if we are in the domain of remote sensing images, then we can extract road networks as graphs and these graphs can be used for example for navigation and other fields; in the medical imaging field it can represent vascular structures, for example of the retina or of the brain by using a graph - and then the graph can be used for brain analysis or also for diagnosis of diseases. The field of image-to-graph extraction is relatively small, so there's not much research in it. The merit of this work is that it tries to tackle the data scarcity problem, since it's very hard to annotate those graph labels. Was this the biggest challenge of this work? There was an even bigger one. The work wants to transfer knowledge between two domains, therefore utilize data in one domain to train the network in another domain. The biggest challenge there is the large domain gap between those two domains, because we are dealing with satellite images on the one hand and medical scans on the other hand, and there's a large difference between those images. Alexander Berger is a PhD student with Johannes Paetzold (at the right of the poster) and Daniel Rueckert at Technical University Munich. Alex has presented an excellent paper at WACV 2025, which has deserved an honorable mention for Best Student Paper award. Best Student Paper Hon. Men.

How did Alex work to close this gap? He lists a set of three contributions. The first one is domain adversarial learning or a domain adaptation framework that works not only on image level features (which is common, like for object detection) but also on graph level features; for this we need to view the output of the transformer network as some kind of abstract graph representation and we are aligning this between the two domains by using domain adversarial learning. The second contribution is tackling the challenge of different image dimensionalities. The source domain has 2D images, namely satellite images - but when we want to extract the graph from 3D medical scans, we need to bridge this gap of image dimensionalities by introducing a very simple projection function. “This projection function,” explains Alex “works by just taking a 2D image, which is a satellite image, and center it inside of a 3D zeroinitialized volume and then we are randomly rotating the whole volume and in that way we have some kind of synthetic 3D volume that is then automatically aligned with our first contribution, namely the domain adaptation framework.” The third contribution is called by the authors regularized edge sampling loss and this is tackling the challenge that we have very different edge distributions between the two domains. For example, some 11 Computer Vision News Computer Vision News Cross-domain and … Cross-domain and Cross-dimension Learning for Image-to-Graph Transformers

Computer Vision News Computer Vision News 12 street images usually show very regular structures: one node, straight line, next node and then a bifurcation point. It depends on the country. This study used US cities as source domains so there usually is a very grid-like layout. What could add to this project? “We use this projection function for bridging the dimensionality gap,” explains Alex. “But what would be very interesting if we had a dimensionality agnostic network, i.e. a network that inherently can learn in 2D and 3D. This would also be beneficial for this time.” We had the chance to ask Johannes, the supervisor, what he's particularly proud about in this work. “What is incredible,” tells us Johannes “is that it is a master thesis project of Alexander. Initially he came to our lab as a master student. He's now pursuing a PhD. And just the fact that his first project, his initial master thesis project, became to be a paper and even an award-winning paper is an incredible achievement! From a scientific perspective, I find this task of not only detecting objects, but also identifying relationships between things in an image and solving this with a single neural network very interesting!” Back in the day Johannes and colleagues introduced the relation former framework, which was the first step that suffered from data scarcity problems. And now, taking this with transfer learning and domain adaptation to the next step is very logical to him. What he would find even more interesting is to go away from the pure physical structural graphs. For example, detecting cars as an additional object and therefore embedding more heterogeneous object representations. Similarly, in the medical domain, automatically predicting additional properties. So not only the structural graph, but maybe the radius or even the flow of a blood vessel. Best Student Paper Hon. Men.

Alex gave a thought at the reasons for his award. “One reasons for it is, I think” Alex begins “that it's a paper with relatively clear contribution. We have a very clear problem and application side problem that we want to solve. And for this, we propose a set of contributions. The connection between the application, the problem and the solution is relatively clear in this paper. And another thing that I like about the paper and that I could also imagine that the jury that decided for this award liked about the paper is that the solutions that we are proposing are relatively simple: no crazy new architecture, or an ultra complex neural network. We are proposing three contributions that are relatively straightforward, but achieve empirically very good results.” The secret of this paper? “Multiple iterations!” reveals Alex “as we refined it in multiple steps. I look now at the first version of the paper and compare it to the final version that won the award here, I see a very large difference!” 13 Computer Vision News Computer Vision News Cross-domain and …

Computer Vision News Computer Vision News 14 It is calculated that 20,000 babies are swapped annually due to misidentification in the world. In addition, 66% of infant abductions occur in hospitals. These statistics are very alarming and this is why newborn safety is a global concern that calls for reliable and accurate identification methods for newborns. Given this extremely urgent motivation, how come it was not done until now? Machine learning and machine vision have mainly worked on face recognition and fingerprints. These previous works cannot provide a solution for newborn identification, because when the baby grows, biometrics change rapidly. Therefore, they are not reliable for our case. Previous study did some sort of longitudinal research on iris recognition and found that iris can be used for recognizing an individual only starting with two years of age to rest of the life. According to previous studies conducted to date, the conclusion was that iris recognition didn't work effectively for babies two years old and younger. SMALL Workshop Best Paper Rasel Ahmed Bhuiyan is a fourth year PhD student at University of Notre Dame, under the supervision of Adam Czajka. Rasel won the Best Paper Award at the CV4Smalls Workshop at WACV 2025. His research focuses on two extreme cases of iris recognition: iris recognition for infants and post-mortem iris recognition. Iris Recognition for Infants

Why is this so? Indeed, iris patterns are very stable, very unique and last over time, even shortly (a few weeks) after death. So why we cannot do it like before two years of age? In order to investigate a hypothesis that iris recognition is viable for infants, Rasel and team collected data from babies aged 4-12 weeks and designed a custom 4-megapixel CMOS image sensor operating in near infrared, which they used to capture all the data. However, application of stateof-the-art iris recognition models (designed originally for adults) did not provide an accurate identification method. “We saw that infant iris has a brighter pupil than regular iris,” reveals Rasel. “They also have a slightly larger pupil size, and due to that, existing iris image processing methods cannot segment infant iris images accurately.” But can you teach them maybe? Rasel decided to do infant-specific pre-processing and build a segmentation model, which is also infant-specific. To do that, he used iris images collected from adults and then performed infant-specific augmentations, like adjusting the pupil brightness and size, and then he trained the model on such preprocessed data. The resulting model can accurately segment not only infant iris images, but also detects other deformations in iris images and can be applied to, for instance, segmentation of images captured post-mortem or from diseased eyes. The architecture of the model was like a nested U-Net with dilated convolution layers and attention mechanism. Is the model so accurate that you could claim that a child has been swapped, with all the radical consequences upon this child’s life? “This model actually did what we designed it for,” declares Rasel. “With this model you can identify the baby so that you can prevent the swapping!” The method, as any other iris recognition algorithm, can recognize identical twins, and can assist hospital staff in rapid and accurate 15 Computer Vision News Computer Vision News Iris Recognition for Infants

Computer Vision News Computer Vision News 16 identification of kids. The existence of such methods should also be a serious discouragement for criminals abducting babies from hospitals. Did Rasel and the team have a thought at why his work won the Best Paper award? He things that the paper impressed because “we actually collected the data, and designed a complete system with a sensor, image segmentation model and encoding routines, which as the whole perform pretty well.” They are offering the segmentation model as open-source solution, together with synthetic dataset, so that other researchers can work on this topic. Adam Czajka, Rasel’s advisor, told us that he was particularly proud of this work because of its interdisciplinary character, and a need for navigating in completely unknown research terrain. Adam is also happy that despite the fact that previous studies were suggesting rather low performance of iris recognition applied to newborns, that was not discouraging for Rasel, whose motivation and persistence allowed for an impactful work, appreciated by the community. What’s next for Rasel? The plan is to use data from babies of 4-6 weeks and gather additional data from the same babies after one year and after two years, so that a longitudinal study might prove iris recognition’s stability over time for accurate identification. SMALL Workshop Best Paper Rasel receiving his award from Sarah Ostadabbas together with supervisor Adam Czajka

17 Computer Vision News Computer Vision News Russian Invasion of Ukraine WACV’s sister conference CVPR condemns in the strongest possible terms the actions of the Russian Federation government in invading the sovereign state of Ukraine and engaging in war against the Ukrainian people. We express our solidarity and support for the people of Ukraine and for all those who have been adversely affected by this war. In these photos, the whole WACV Daily and WACV reception teams with Denis Rozumny support Ukraine and the Ukrainian people. UKRAINE CORNER

Shwetha Ram is an Applied Scientist at Amazon, working on Rufus - Amazon's Conversational Shopping Assistant. She’s also the first author of a wonderful paper that was accepted as a poster at WACV 2025. Computer Vision News Computer Vision News 18 WACV Poster Presentation DreamBlend: Advancing Personalized Fine-tuning of Text-to-Image Diffusion Models The motivation for the work came from noticing that in the underfit checkpoints there is more of the prompt fidelity and diversity but in the overfit checkpoints there is subject fidelity. If we could find a way to combine the two and get the benefits, that would be awesome. Starting from this, the team tried different approaches using masks and so on, but they have their own challenges like blending artifacts at the mask boundaries. Over time, they started doing a more thorough analysis and observed a phenomenon of attention collapse and realized that maybe the solution to correct for this is to do some cross attention guidance using the attention maps. That’s when they decided to try out this regularization using the cross attention maps which actually gave good results. The paper tries to address a fundamental trade-off in the prompt fidelity, subject fidelity and diversity that occurs when you are fine-tuning for text-toimage personalization. This is a very fundamental problem that everybody who is fine-tuning these pre-trained text-to-image models for personalization will face, as confirmed by talks with other paper authors at WACV. Not knowing which training step or checkpoint one should pick while doing the fine-tuning, this is a fundamental trade-off. Shwetha’s paper is a step towards addressing and improving this trade-off.

19 Computer Vision News Computer Vision News DreamBlend Shwetha found the desired solution when she figured out that it was this catastrophic attention collapse that leads to this problem and she had the idea that using this guidance would help it. The image editing community has published quite a few works which use similar cross attention manipulation for different image editing techniques, so she took some inspiration from there. We are curious to ask Shwetha whether this solution is specific to this work or it also opens new directions for her or for others to follow. “Right now, this is focused on this particular challenge,” Shwetha tells us “but I wouldn't say that it is completely solved: there are still things that can be done to improve it and there are of course similar tradeoffs in other fine-tuning problems.” DreamBlend merges the prompt fidelity and diversity of underfit checkpoints with subject fidelity of overfit checkpoints during image generation. Early checkpoints have higher prompt fidelity and diversity but lower subject fidelity, while later checkpoints have higher subject fidelity but lower prompt fidelity and diversity. Prompt: a backpack* on a cobblestone street.

Computer Vision News Computer Vision News 20 To the best of her knowledge, this is the first work that is trying to address the trade-offs in prompt fidelity, subject fidelity and diversity in text-to-image personalization by combining the benefits of early and late checkpoints during image generation. “We identified the phenomenon of catastrophic attention collapse,” Shwetha explains, “and we also proposed a method to be able to mitigate that using this cross attention guidance. This has improved results upon existing state-of-theart fine-tuning techniques for textto-image personalization, which is the novelty of this work!“ This work is a step towards improving the quality of text-toimage personalization. It can be hoped that it might open the doors for more people to think of more ways in which we can address this problem. In the meantime, can Shwetha explain the problem of text-to-image personalization, in case some readers don't know? “Okay!” Shwetha accepts the challenge. “Basically, if you take a pre-trained model like stable diffusion and ask it to generate a photo of Barack Obama in outer space, it The numerical models are just too computationally expensive, and that’s where deep learning methods from computer vision can help! WACV Poster Presentation Across various subjects and prompts, DreamBlend successfully preserves the layout of the reference underfit image as well as the identity of the input subject.

image model with this new concept. There is this concept called my teddy bear and then once it learns that, I can now at inference time use this fine-tuned model to generate images like a photo of my teddy bear with a blue house in the background and so on.” However, while doing this, there is a trade-off. In the early finetuning steps, you have good prompt fidelity and diversity that comes from the world knowledge of the pre-trained model but you don't have subject fidelity because the fine-tuning is still early and we have not learned the subject yet. In the later steps of fine-tuning, we start to see overfitting and catastrophic forgetting. Because of this, what happens is you start losing the prompt fidelity and diversity but 21 Computer Vision News Computer Vision News can probably do that because it knows who is Barack Obama. On the other hand, if I have a more not-famous concept like myself or my dog or my teddy bear and if I say generate a photo of my teddy bear with a blue house in the background, it doesn't know what my teddy bear is. Then, the problem is how do you teach these concepts to a pre-trained model and one of the successful ways to do that is by fine-tuning this pre-trained text-to- now you have good subject fidelity because you already overfitted. This means that Shwetha’s method is trying to get the best of both and try to design a method that can achieve both prompt fidelity and diversity as well as subject fidelity by combining the benefits of both the early and late fine-tuning checkpoints during inference. What’s next? Shwetha is working on a few different things at the moment, and hopefully she will share with the community soon! DreamBlend applied on different backbones, different fine-tuning techniques, real image editing … there is a trade-off! DreamBlend

Read 160 FASCINATING interviews with Women in Science Read 160 FASCINATING interviews with Women in Science

23 Computer Vision News Computer Vision News Anna Hilsmann Anna Hilsmann is the head of the department for vision and imaging technologies at HHI. Within this department she is also heading one of the four research groups. It is the computer vision and graphics research group. Anna is co-author of a paper accepted as an oral at WACV 2025. Anna, what is your work about? I am more in a research management position right now. But the research focus of my research group is everything around computer vision and graphics. So the modern word would be like visual computing. We are not doing classical computer graphics, but more like using the information that we got get from the computer vision algorithms in order to also synthesize new data and new represent 3D representations of the world, for example. The research group is quite large. There are currently 13 PhD students and also the topics we are currently working on are quite diverse. At Fraunhofer, we are mostly funded by research projects with different applications. The applications range from multimedia, agriculture, industrial, computer vision, construction. Currently, we also have some security projects and medical imaging. So apparently everybody needs you. Of all the fascinating fields that you have mentioned, which one is your favorite? Actually, I'm more focused on the methods. Our goal is to develop computer vision methods that can then be applied in different applications. What I like most is if we develop some method and we can also apply it for another field. One example is one PhD student who is working on hyperspectral imaging, mainly in the medical field for the differentiation of tissues in intraoperative settings. We gained a lot of experience there. And we are currently also applying hyperspectral imaging with a different focus in the agricultural field, because you can also apply it there for the detection of plant diseases and so on. We will have health and we will have food, thanks to you.

Computer Vision News Computer Vision News 24 Women in Computer Vision Or at least less pesticides and so on… Let me add that one of the main focuses is also 3D reconstruction or modeling the 3D world around us. That is a field that can be applied to very different applications. Are you familiar with Fraunhofer’s concept? It's a research institution. It's not a university, but a research institution so it's somewhere between industry and research. We have some professorships too! We have people who are also working at the university and we are also working a lot with students doing the master thesis or also doing a student job or bachelor thesis and so on. I'm also a lecturer at the Humboldt University. We have a lot of connection with the universities here. Many PhD students are also affiliated with the universities and we're doing applied research. It's different for different Fraunhofer Institute. Some of them are more research oriented and some of them are more application oriented - we are more research oriented. Still, this means that we don't do that much basic research, but we always have some kind of applications behind that and always want to have some big research project where we work with different partners, in order to create some solutions for

25 Computer Vision News Computer Vision News applications. Don't you feel that you put a lot of pressure on yourself within these very first years of your career? No! Because I started as a researcher at HHI and actually it's quite usual that people who start at a Fraunhofer Institute are also pursuing their PhD. I think it was a little bit easier for me because Peter Eisert, my group leader at that time, was also a professor at the university. So it was like working on some research projects, but the theoretical work that I was doing in the research projects, I could directly put into my PhD. And this is actually the goal that I also have with my PhD students, that they have their work in their research projects and they can use the same work, the theoretical work for their PhD projects. Peter is now coheading the research group with me and also the department. So we are working very close together. Of your first years of your career, which is the one thing that you are the most proud of? I actually studied electrical engineering with medical applications. And I did my diploma in that field. That was the first time that I did computer vision. I came from the medical application and I wanted to do some researchoriented diploma thesis. Well, that was the first research project I worked on completely by myself. And that was the first publication and the first conference I went to. This is probably the turning point, because I decided then to also do a PhD in computer vision and also work in applied research. Did any of what you worked on then arrive to the market or to the operating room or to the clinics? I don't know exactly, but I guess some parts of it. At least they were developed further. I did my diploma thesis at Philips Medical and they have a big research department. I know there were several people working on that topic. It was about the estimation of the deformation of the lung during the breathing cycle in order to plan radiation for tumors. Anna Hilsmann

Computer Vision News Computer Vision News 26 Women in Computer Vision Don't you feel that in your job, part of your effort is wasted because it does not get TO the real world? Or are you lucky enough that most of what you do becomes a real thing? Of course, there are often some works that never come into application, but they are still projects where we learn and where we get some gain out of it for other research projects that are then put into applications. Tell me the most fantastic thing that you have learned until now. One thing is never to give up and continue. This is the basic thing about research. You try different things and these days, it's harder than ten years ago, because everything is faster. I have so many students here that are disappointed because they have some ideas. And then two weeks later, it is published by someone else. The best advice that I can give them is don't take this as a sign to stop with that idea. It's exactly the other way around. I mean, if someone else had a similar idea, and if someone else was able to publish something, then it's obviously a good idea. So my advice would be to keep on because even if someone else had a similar idea and publish something about that, it doesn't mean that you have to stop there. It means that this is a path to go on. This is maybe the biggest advice that I give all the students, because when they start, they think they have to create the big new thing. It's very difficult to create the one thing no one else thought about these days. We have competitors like Meta, Google, and so on. As a smaller research institution, we cannot compete with them. Was there a moment that you said, I feel like stopping, but I know that I have to go on? Yeah, sure. I think that many, especially women, have like this feeling: oh, I'm not good enough and everyone else is better than me. And they are all doing like the big whole thing. And I am working on this little small problem here and I don't have any success. I think that many people have that. And I also had that and also still have. But I know that this is something where you can talk to yourself in a much nicer way. To us, you are good enough and you're even much better than that. We strongly support you as very,

27 Computer Vision News Computer Vision News very good. We spoke a lot about the past. Tell us a couple of things about your future. Where are you going? I never planned my future, but I'm currently very happy with this with this position and also I like working with the PhD students a lot. I'm sometimes a bit sad that I'm not that much into the research work. So I very much enjoy the research discussions with the PhD students, even though time is very limited. And I am also doing many other things like project proposal writing and so on, which is necessary to get the research funded. But I much more enjoy the research discussions. My goals for the future are more on myself to put more priority on the research again and try to get more time for that. I have maybe one last question for you. It's about the one day your career will end. What would be a fantastic thing that will make you say, I did it! It was worth doing it! What would happen from here to the end of your career that will make you say: that's exactly what I wanted to accomplish… What makes me really happy and proud is to enable people to do research and that I teach them how to develop and how to be confident about their research and generate the new generation of researchers. Is empowerment what you mean? Oh, thank you! In Germany, the number of female researchers in this field is very, very limited. And still, I really have problems finding female researchers. What would be really great is if in 20 years from now on, we have like 50 percent female researchers at this institute, that would be really great! “… to enable people to do research and that I teach them how to develop and how to be confident about their research and generate the new generation of researchers!” Anna Hilsmann Read 160 FASCINATING interviews with Women in Computer Vision! Read 160 FASCINATING interviews with Women in Computer Vision! Computer Vision News Publisher: C.V. News Copyright: C.V. News Editor: Ralph Anzarouth All rights reserved Unauthorized reproduction is strictly forbidden. Our editorial choices are fully independent from IEEE, WACV, CVPR and all conference organizers.

Computer Vision News Computer Vision News 28 MONAI - Medical Open Network for AI Stephen Aylward is a Global Alliance Manager for Developer Relations at NVIDIA Corporation and a longtime friend of Computer Vision News magazine. Stephen, what are you working at? I feel the transition to NVIDIA is a wonderful combination of my career in the open source field. I've been serving in my previous job at Kitware as chair on the MONAI advisory board, but that was something that I was volunteering and doing in my own time. Now it's wonderful to see me being able to continue to do that work as an NVIDIA employee, continuing to support that MONAI open source community and engaging with them in order to ensure that it has a wonderful future. Tell us about MONAI. MONAI stands for the Medical Open Network for AI. It was developed in 2019 or envisioned in 2019 at a MICCAI conference. A group of people from NVIDIA got together with a group of people from King's College London and they began to survey the community. At the time there were multiple toolkits out there for doing deep learning and medical imaging. Via the survey of the community, they realized that the community was willing to really come together and collaborate to come up with a common platform. I've always looked at open source developers as a limited resource. The number of people willing to contribute to open source is finite and should be valued highly. Instead of having them working on disparate tools, NVIDIA and King's College London brought them all together to create this common toolkit. And today

we have MONAI. It's open source under an Apache 2.0 license that allows it to be used for both academic and commercial purposes. It builds on top of PyTorch and it's really an amazing foundation both for research as well as product development. Not just for training AI systems, but also doing like AI assisted annotation and MONAI labeling, as well as deploying those AI systems in the field in clinical environments. How wide is the adoption of this tool today? We track multiple metrics to identify its adoption. One is the number of downloads and we're at 3.3 million downloads for MONAI, since the initial release in 2020. It's amazing! We also have an arXiv paper that we ask people to cite and it has over 600 citations associated with it. But what really warms my heart, the metric that I like, is when I walk around conferences such as MICCAI and I see MONAI being mentioned during presentations. Or a student comes up to me and says: “Hey, you're Stephen Aylward from MONAI. I love MONAI! It's been such a wonderful impact on my PhD dissertation!” Or “It's had a wonderful impact on my master's thesis!” What is the secret of MONAI? Why do you think people like it so much? I think the credit goes to Jorge Cardoso at King's College London and Prerna Dogra at NVIDIA because they began by involving the community, as they realized that it had to be community involvement from the start. 29 Computer Vision News Computer Vision News with Stephen Aylward - NVIDIA

Computer Vision News Computer Vision News 30 We have bylaws, which make it clear that it's always going to be open source and freely available for everyone. We also have the bylaws specify how people can contribute to MONAI. And we now have over 211 contributors to MONAI from around the world. Then we have an advisory board that specifies how new contributions are considered and big new changes in the toolkit come about. This advisory board breaks into working groups: on federated learning, on AI assisted annotation, on applications and ophthalmology, pathology and so forth. These working groups really contain the best and brightest and all the people who want to contribute to MONAI from around the world participate in these working groups and help shape its future. So, that's the community which sets MONAI apart. What do you prepare for 2025? There's two steps and one is it really is the community that is going to decide what's next. The deep learning field in medical imaging is evolving so rapidly. I look forward to also finding out what that next exciting thing is and the community is going to provide it. We've recently added generative AI capabilities into MONAI that are outstanding. Foundational models are coming out in both pathology as well as 3D medical imaging. Vision, large language models, VLMs are coming out in MONAI. All of the trends within the medical imaging field are captured by MONAI and contributed by the community to MONAI. What's going to happen, the first and coolest thing, is going to be determined by the community. I don't know what it is either! I look forward to seeing it… MONAI - Medical Open Network for AI

31 Computer Vision News Computer Vision News So you will be surprised at the same time that we will be surprised. Exactly! And that's one of the wonderful things about the field. Then, the second thing that is really important in my opinion is MONAI deploy was built upon Holoscan SDK, another open source Apache toolkit that's out there, which allows multiple AI models to be specified in a workflow where you can take the output of one AI model and feed it into multiple other AI models and so forth in the state machine approach. So I think we're going to also see this evolution of research and development going from exploration of the capabilities of a single AI model to really looking at a collection of AI models all contributing to a final clinical solution. We've already begun to lay the groundwork for supporting that via the Holoscan SDK. I look forward to seeing MONAI and Holoscan SDK continue to work together, not just in MONAI deploy, but really in the research and development from the start. Now, how does a developer on medical imaging get on this train that is already running? Become active in the community and you can do this by, first off, visit the MONAI website at monai.io. What you will see is that the community has contributed literally hundreds of tutorials that show everything from segmentation on CT images to processing of endoscopy images for surgical tool recognition. All of these different workflows, AI models and training paradigms and so forth are available typically as Jupyter Notebook tutorials within the MONAI tutorial GitHub repo, but also as videos that have been produced. One of the MONAI working groups is actually producing coursework for graduate level course where you can move by yourself. Or you could take this coursework and use it to teach a class at university level. That is something else coming out of King's College London as one of the working groups. Any success stories to share? Well, instead of sharing one, I'm going to explain four. In 2023, MONAI won four Grand Challenges for the medical imaging challenges at MICCAI. And it was all done using this with Stephen Aylward - NVIDIA

Computer Vision News Computer Vision News 32 Auto3DSeg, which is one of these AutoML techniques whereby specifying simply two YAML files or description files, you could specify where all your data was located within one of the files. And then the other files and really just 5-10 lines give a description of the problem that you were trying to solve. The fact that it was MR data versus CT data. It was a segmentation task versus a classification task. So really high-level description and then a description of the data and the Auto3DSeg AutoML technique would go out, evaluate three different neural networks, and do five fold cross validation on it. Look at your data. Look at the options for neural network models that were available. Train the systems up and then pick the best ensemble from the available networks that it evaluated as the final solution for your system. With these two simple configuration files, all this stuff is going on in the background. Yet, that combination of material in the background resulted in winning entries and four MICCAI grand challenges last year. For me, that was a huge success. That was AI enabling AI. And I think we'll see more and more of that. Can any of this be exported outside of the medical environment? That makes perfect sense. And that is the one of the wonderful things in this community aspect of MONAI. We already see people using MONAI outside of the medical field for traditional computer vision tasks. So yes, very much so. In my old job, one of the early adopters of MONAI was our computer vision teams. As you start looking at point clouds and other volumetric data in particular that exists within these other fields, MONAI has some wonderful techniques for dealing with volumetric data. 3D convolution networks, convolution techniques, sliding window approaches for massive images, pathology images and so forth that really help solve problems as images get larger and other fields evolve. MONAI is going to be a great go-to tool for them! Can you help out the autonomous vehicle guys? It's taking some time for them to get their show going. Yeah, and it goes both ways. I mean, we've learned so much from them and there's a lot of positive synergies. We're starting to see the medical field being represented at CVPR and traditional computer vision conferences because they're recognizing that we have something to contribute back. As a commercial company, you’re investing a lot of resources here. Where will the money come back from? MONAI - Medical Open Network for AI

33 Computer Vision News Computer Vision News You're right, it's a company that, like any other company, has a due diligence to shareholders. However, NVIDIA has always not focused on the near-term revenue. Jensen, the CEO, has this wonderful outlook of going after really hard problems and then trying to solve those problems that no one else can solve. Medical imaging is just full of that. So for NVIDIA, it is the challenge of coming up with solutions that deal with massive pathology images, that deal with real-time volumetric ultrasound data that's going to be just around the corner for intraoperative procedure guidance. These are hugely important, impactful health care challenges that we seek to solve. That is what motivates the company and that's what motivates its contributions to MONAI. It isn't about turning the dollar in the near term. It really is making the world a better place via medical AI. Tell me about 2025. Will there be a lot of online evangelism around MONAI or will it be more in person, like in conferences? All of the above. We've already planning NeurIPS workshops. We're planning a presence at RSNA that will be repeated year after year. We are having now the exciting NVIDIA GTC event, featuring MONAI and this HoloScan SDK coming together. I really see 2025 as being an exciting year for MONAI. We look forward to riding that wave of medical AI just being pervasive. If MONAI were a horse, would you bet your hard-gained money on it? I'm betting my career on it. Literally! By having NVIDIA backing behind MONAI, we know that it is a safe bet. It's going to be around for a long time. We see now FDA-approved products being based on MONAI. We really see it being adopted not just by NVIDIA, but other commercial entities, including Google for Google Health, AWS for Healthcare. All of those systems make MONAI available as an AI for processing medical images within them. It's nice to see that big giant corporations actually are so much in favor of open source! In the latest interview he gave me, Yann LeCun at Meta was very vocal about their support of open source. Now you tell me about NVIDIA as well and other big companies. Are all giants in favor of open source? No matter how big these businesses are, the open source community is bigger. Open source developers might be a limited resource, but there's 211 contributors to MONAI. It's hard for any one business to contribute 211 developers to one product and one focused thing such as MONAI. And these are really the best and brightest around the world. These businesses realize that participating in the communities rather than competing with them is in their best interest. It's really in everyone's best interests! with Stephen Aylward - NVIDIA

Computer Vision News Computer Vision News 34 Congrats, Doctor Esther! Building PET-radiomic signatures from a biological rationale Differences in energy consumption in cancer correlates to poor prognosis and drives therapy resistance. [18F]FDG-PET visualise glucose consumption in cancer lesions. Esther used these PET scans to investigate different glucose consumption subtypes in PDAC patients. Over the last decade, a rapidly growing body of literature reported a prognostic or predictive value of “radiomics” features extracted from PET-scans. However, the lack of a clear biological underpinning of these features has been regarded as an obstacle for clinical translation. To address this paucity, Esther developed an image analysis pipeline building PETradiomic signatures from biological features derived from immunohistochemistry (see figure). She stained histology slides of PDAC tissue with an MCT4-marker, a molecule involved in glucose consumption. Using these histology slides Esther visualized the distribution of the MCT4-marker. After scanning these slides Esther extracted the intensity signal of MCT4-positive staining and investigated the distribution of this molecule on whole slide via density maps. She then mirrored her PETbased image analysis to pathology data, extracting “pathomics” features using the same texture descriptors. Esther Smeets completed her PhD at the Radboud university in Nijmegen, The Netherlands. She focused on building bridges between biology, medical imaging analyses and AI in pancreatic cancer. Congrats, doctor Esther! Pancreatic ductal adenocarcinoma (PDAC) is one of the deadliest types of cancer and its biology is still poorly understood. The primary goal of Esther’s thesis was predicting biological subtypes of PDAC by using machine learning and computer vision approaches to analyse PET scans and digital pathology images of immunohistochemically stained slides. The secondary goal was to delve into innovative treatment for PDAC patients.

Made with FlippingBook

RkJQdWJsaXNoZXIy NTc3NzU=