CVPR Daily - Sunday

Computer Vision and Pattern Recognition Sunday Nashville CVPR Daily Meet the scientist behind the science! 2025

2 DAILY CVPR Sunday [5B-4] TopoCellGen: Generating Histopathology Cell Topology with a Diffusion Model [6C-1] Seeing Far and Clearly: Mitigating Hallucinations in MLLMs with Attention Causal Decoding [5-168] Align3R: Aligned Monocular Depth Estimation for Dynamic Videos [Highlight] [6-370] ImagineFSL: Self-Supervised Pretraining Matters on Imagined Base Set for VLM-based Few-shot Learning [6-250] Are Images Indistinguishable to Humans also Indistinguishable to Classifiers? Max’s picks of the day: Orals For today, Sunday 15 Max’s Picks Posters “My research centers on enhancing the trustworthiness of AI models in medicine, particularly as they are deployed in real-world clinical settings where they face domain shifts, such as variations in imaging devices or acquisition protocols. To address this, I focus on improving out-of-distribution detection and domain generalization, with the aim of strengthening the resilience, reliability, and robustness of medical AI systems across diverse environments.” Max Gutbrod recently started his PhD at the newly established doctoral center at OTH Regensburg, supervised by Christoph Palm. Max forgot to tell you that he's also presenting his poster today during the morning session 5 - OpenMIBOOD: Open Medical Imaging Benchmarks for Out-Of Distribution Detection [poster 465]

3 DAILY CVPR Sunday Editorial CVPR Daily Publisher: RSIP Vision Copyright: RSIP Vision Editor: Ralph Anzarouth All rights reserved Unauthorized reproduction is strictly forbidden. Our editorial choices are fully independent from IEEE, CVPR and the conference organizers. Good Morning Nashville! We’ve come to the last day of CVPR 2025. It was an immense pleasure to prepare for you the CVPR Daily magazine for the 10th consecutive year. I am immensely grateful to the CVPR community and to the organizers for this privilege! Let’s stay in touch also after CVPR! Subscribe for free to Computer Vision News here! Enjoy the reading of this CVPR Daily and see you soon at WACV 2025 in Tucson, Arizona! Ralph Anzarouth Editor, Computer Vision News Ralph’s photo above was taken in peaceful, lovely and brave Odessa, Ukraine. Editor: Ralph Anzarouth Publisher & Copyright: Computer Vision News All rights reserved. Unauthorized reproduction is strictly forbidden.

4 DAILY CVPR Sunday Oral & Award Candidate In this paper, Yiqing introduces a novel generalizable foundation model for estimating both geometry and motion in dynamic scenes using just a pair of image frames as input. This task, known as scene flow estimation, has long been a challenge in computer vision and is crucial for applications such as robotics, augmented reality, and autonomous driving, where understanding 3D motion is essential. Yiqing likens the task to a firstperson video game: “Your head is always in the center,” she explains. “You see the wall move, people walk around, and objects change shape. Our model can predict the geometry and motion of all of it!” The timing of this work was key. Monocular scene flow was proposed about five years ago, but it hit a wall: there was not enough compute, data, or pretrained weights to make it work. Now, that has all changed. “We benefited from advancements in 3D over the last year,” she reveals. “People found that if you scale up training for 3D geometry prediction, you can get feed-forward methods that predict 3D geometry from 2D information. We go one step further than that, and ask: can we also add motion?” The answer, it turns out, is yes, but the biggest challenge was not the model architecture – it was the data. “I’m probably giving an answer that my fellow researchers working in the field are very familiar with!” she laughs. “The coding part is fairly easy, but having enough data to formulate the problem properly takes Yiqing Liang is a PhD student in Computer Science at Brown University. Her recent paper on scene flow estimation, developed during a summer internship at NVIDIA Research, has been accepted for an oral presentation at CVPR 2025 and nominated for a coveted Best Paper award. Ahead of her presentation this morning, Yiqing tells us more about her fascinating work. Zero-Shot Monocular Scene Flow Estimation in the Wild

more time!” Yiqing eventually curated a massive dataset of over 1 million annotated training samples, spanning indoor scenes, outdoor scenarios with vehicles and pedestrians, animated films, and even simulated environments with chaotic motion. Much of the data was derived from existing RGB-D video datasets, which combine color, depth, and camera parameters. By carefully converting and filtering them to reduce edge noise, she was able to reconstruct scene flow annotations at scale. 5 DAILY CVPR Sunday Yiqing Liang

6 DAILY CVPR Sunday Even with an innovative new model in hand and a Best Paper nomination on the table, Yiqing remains grounded. “All of us are very honored to be an award candidate, but I think it’s not only because of the quality of the work, but it’s also because of luck,” she says modestly. “There are many, many works that are very, very high quality.” What made this one stand out, she suggests, is its perspective: “We’re looking at a classic problem through a modern lens. I’ve worked on popular methods like NeRF and Gaussian splatting before, and the main bottleneck is the inference time learning - you wait minutes, hours, even days. Classical methods don’t have that problem. Now, classical methods are generalizable, so we try to marry the two trends together to create a new possibility.” Looking ahead, Yiqing sees several promising directions for future work, which she hopes the community will take forward. First, there is the potential to scale the dataset even further, not just in terms of size, but also in terms of diversity. Incorporating noisy real-world data would be particularly valuable. “One million sounds big,” she remarks, “but it’s still small compared to what's used for diffusion models.” Next is extending the model’s capabilities beyond geometry and scene flow. “We’re interested in predicting other modalities, like camera motion, to decompose scene motion into different fractions for more applications,” she tells us. The method could also be extended to long-term tracking. “Right now, we work with image pairs, but what if we had more pairs? What if we had a longer time horizon between the pairs?” She is also excited about potential applications in robotics: “People have been trying to use particle systems in robotics because they found Oral & Award Candidate

that it’s a more abstract version of the information compared to the raw camera. For example, RGB-D point clouds. It's possible to abstract our output like that.” Looking at the bigger picture, Yiqing is curious about how this research could intersect with multimodal large language models. “For LLMs, the multimodal side still has a lot to explore,” she points out. “People are interested in how to encode visual information more efficiently, and how to let it interact more with textual information.” More than anything, what excites Yiqing most is this model’s generalizability. “It’s really cool how general it is!” she says with a smile. “We’ve tested it on out-of-domain datasets – real-world, high-motion scenes – and it still works!” To learn more about Yiqing’s work, visit Oral Session 5C: Visual and Spatial Computing (Davidson Ballroom) this morning from 09:00 to 10:15 [Oral 4] and Poster Session 5 (ExHall D) from 10:30 to 12:30 [Poster 165]. 7 DAILY CVPR Sunday Yiqing Liang

8 DAILY CVPR Sunday Workshop by Cornelia Fermuller and Guillermo Gallego The Fifth Workshop on Event-based Vision, held on June 12, 2025, at CVPR, reaffirmed its role as a central forum for the rapidly expanding community working at the intersection of sensing hardware, computer vision, and intelligent systems. The workshop has become a cornerstone event for researchers advancing event-based and neuromorphic vision, a field that continues to gain momentum. This year’s program featured an open call for papers, live demonstrations, poster presentations, and multiple international competitions designed to benchmark progress and promote collaboration. After completing his PhD, Joshua has begun working as a machine learning engineer at Google. Event-based Vision Workshop at CVPR 2025: A Growing Hub for Neuromorphic Innovation

Academic highlights included Davide Scaramuzza (University of Zurich) presenting new developments in structure-from-motion and SLAM with event sensors in robotics; Christopher Metzler (University of Maryland) showcasing innovative uses of event sensors in computational photography; and Priyadarshini Panda (Yale University) discussing the integration of event data with spiking neural networks for efficient hardware and software systems. Industry contributions stood out as particularly impactful. Kynang Eng of SynSense (Switzerland) outlined current barriers to mainstream adoption of event-based vision and emphasized its strong potential in 3D motion applications for robotics. Davide Migliore, representing the newly launched event-vision startup Tempo Sense, engaged the audience in an interactive discussion using live polling to gather perspectives on current challenges, key milestones, and promising directions for future research and applications. 9 DAILY CVPR Sunday Event-based Vision

10 DAILY CVPR Sunday Congrats, Doctor Joshua! Demand for precision medicine has increased as access to data and computational power has become dramatically more widely available. The influx of information and methodological improvements has ignited the desire for practical precision medicine models that can be used in the clinic to improve patient quality of life. Currently, estimates of personalized treatment effect are based on low-dimensional features, i.e., age. Building the next stage of precision medicine requires using a patient's unique highdimensional information, such as magnetic resonance imaging sequences, X-ray images, ultrasound images, or genetic data. The PVG has been researching the use of multi-modal MRI to improve treatment assignment for patients with multiple sclerosis (MS). MS affects millions of people worldwide and is characterized by the appearance of lesions in the brain and spinal cord. The size, quantity, and number of these Joshua Durso-Finley defended his PhD Thesis in March. He worked as part of the Probabilistic Vision Group (PVG) at McGill University in Montreal. There, he developed methods for estimating treatment response and finding subgroups of responders for patients with multiple sclerosis using multimodal MRI. Under the supervision of Tal Arbel (McGill university and Mila) and with clinical collaborator D.L. Arnold, he developed and validated a treatment effect model that was able to find subgroups of responders to treatment, even in treatments that did not have a significant effect at the group level when all patients are considered. After completing his PhD, Joshua has begun working as a machine learning engineer at Google.

lesions vary from person to person, so that each person's experience with MS is unique. A doctor choosing the right treatment to mitigate MS symptoms would account for a patient's current disease state, the potential for disease worsening, a drug's side effects, and the drug's patient-specific efficacy. Making a personalized treatment decision (which may also change from visit to visit) requires an inordinate amount of work. However, an AI model that has learned outcomes for different treatments from the patient's unique MRI data across the entirety of the disease could recommend the optimal treatment instantly. In his research, Joshua combined causal machine learning estimators of treatment effect with modern probabilistic deep learning methods to improve the treatment recommendation capabilities of AI models. Using data harmonized from many clinical trials, the designed models accurately predict distributions of outcomes for all potential treatments. The probabilistic aspect of the model accounts for the natural variance in the disease, allowing clinicians to trust the model. In sum, his work demonstrates the importance of using unique patient features for treatment recommendation and the guidelines for bringing these models to the clinic. This work was funded by Mila-MSR grant in collaboration with Nick Pawlowski, MSR Cambridge. 11 DAILY CVPR Sunday Joshua Durso-Finley A patient's brain evolution over time, with lesions highlighted in red. A high-level overview of the model. A patient's unique data and lowdimensional patient descriptors are used to produce estimates for treatment outcomes and the corresponding treatment effects.

12 DAILY CVPR Women in Computer Vision Sunday Read 160 FASCINATING interviews with Women in Computer Vision Shalini DeMello is Director of Research of AI-Mediated Reality and Interaction Research. Tell us about your work. I'm part of the larger NVIDIA research group and our mission is to conduct fundamental research in AI, specifically AI and the intersection of AI and graphics. We do fundamental research which then can have impact on NVIDIA's business by creating innovative technology that can be incorporated into NVIDIA's products and have a business impact. Within the larger NVIDIA research ecosystem, the mission of my team is to reimagine using AI, how humans would interact in the future, or humans would interact with machines, whether those be other AIs or robots, and so on: both human to human interaction and human to computer interaction reimagined with AI. That is the charter of my team. How large is the team? I have seven researchers reporting to me and I have two more joining this year. Including me, that's 10 of us. Did you already find the two new lucky guys?

13 DAILY CVPR Sunday Shalini DeMello Yes, yes. Software is as much important as hardware for NVIDIA. Is that right? Oh, absolutely. Software and algorithms in particular. And that's exactly the charter of my team, which is to design new algorithms. How did you find yourself there? That's an interesting question. During my PhD, I worked on 3D face recognition. This was back in 2008. Computer vision was always what I was interested in. So my first job after I finished my PhD was more as an imaging scientist where I was helping to design algorithms for the image signal processing pipelines. But honestly, I always loved computer vision. That's what I enjoyed. I worked at Texas Instruments for the first two years after I finished my PhD as an imaging scientist. And then I had an opportunity to get back more into computer vision, joining NVIDIA in 2011 in a computer vision role. And I've never looked back. All this in America, right? Yes, this was all in America. I want to add a very unique and interesting point to the story: Texas Instruments had this digital signal processing chip DSP at the time. And they had this computer vision library that they had created that ran efficiently on the DSP. And it was a very sad situation because it had such low compute that you had to strip out most of the goodies in your computer vision algorithms to just even make them work at all on those processors. The reason I joined NVIDIA in 2011 was because, as a computer vision researcher, I wanted to work with a company that had a lot of compute. I didn't know that GPUs would become so important, but I just knew that I didn't want my hands to be tied with compute wherever I go. And that was my biggest motivation for joining NVIDIA. It's like NVIDIA has a GPU that has a lot of power, has a lot of compute, and it will give me the freedom to innovate in the algorithmic space and be able to run hard and compute heavy algorithms. That was my motivation. And I guess the bet worked. So today I learned that also for Texas Instruments, software was as important as hardware. Right. Because at the end of the day, what is the hardware for? For running algorithms and software on it, right? It's a perfect marriage. So the way we

14 DAILY CVPR Women in Computer Vision Sunday look at it at NVIDIA, which is also a hardware company, is that we have the providers and the receivers. The providers are the hardware teams and the receivers are the software and algorithm teams. That's funny. I grew up in the generation for which Texas Instruments was the synonym of compute. So it's funny to hear from you that you left the company because of compute. Did you leave academia with no regrets? Oh, I have absolutely no regrets about leaving academia. And I think I made absolutely the right choice for myself. I like to build things. I've always liked to see my innovations in real tangible form and in people's hands. That's the biggest joy. I really don't like to have just theoretical papers: I like to see my ideas in action. I remember the time when I created an algorithm for face detection and it was incorporated into an NVIDIA tablet. And my little four-year-old daughter picked it up and her face was detected and I could tell her mommy did this. You have no regrets about Texas Instruments and no regrets about academia. No regrets about India too? Not really. So you are 100 percent looking forward. I have lived more of my life in the U.S. at this point than I've lived in India. All phases helped me to get to where I am. I think they all enriched me in experiences and helped me learn better about what was the right thing for me. I think they only just add to my experience. In a career there are ups and downs. How do you manage the downs? I think everything, even the not so positive experiences are learning experiences and they're opportunities for growth. Obviously, there is a phase of going through the pain and the shock. But after you've processed that, I think the best way to look at them is to reflect on how

it made you feel and why it made you feel the way that it did, the negative experience. And then, is there anything that can be done differently for the future? I think that's the way I look at it. And also interesting: is that situation in your circle of control? There can be some negative situations that you simply cannot influence. But there can be other situations that you can influence, maybe by changing your behavior or some part of you. And then you just need to have an honest conversation and be like, well, that's the reality. And this is what I learned. And this is how I will modify at least my behavior. It could be that you could influence it or if you cannot influence the situation, then you can influence the way you react to it. Either way, there is growth. Tell me about the people that you manage. Are you sometimes surprised thinking whether you would have been able to do that in their place? Yeah, definitely. Many, many times I feel so enriched by my team. And, you know, they're very, very skilled at what they do and they're very fast. There was one moment when we were doing a demo for GTC. GTC is Nvidia's flagship conference, the GPU technology conference. And a lot of folks from my team come from a graphics background. And people in the graphics world are very good at system level things like building demos of their research ideas. And there was one researcher in my team who basically built the whole thing behind the scenes from scratch to finish and was much more efficient than like 10 engineers doing it. I was completely blown away by how much work had gone behind the scenes. And there was just this one person doing everything. Shalini, we spoke a lot about the past and somehow also about the present. Where do we go from here? I'll probably frame this question in the context of AI, maybe in general computer vision and where the field is going. It's a very interesting tipping point we're at where I feel like it's very exciting in some sense that AI has gotten to the point that it's mainstream in people's world and psyche now. I think for a really long time, AI researchers like us were 15 DAILY CVPR Sunday Shalini DeMello

were sort of on the fringes of engineering. But now we've really had a societal impact in a very profound way. It's very exciting to be part of this and being watching how it's unfolding. Where do we go from here? It's unclear to me, but I think some things that I think are pretty much imminent on the horizon in terms of AI are 3D world models. I think we are on the cusp of creating very hyper realistic potentially 3D world models. We can create very hyper realistic videos as of today. The possibilities are very profound of what we can do with those things. The only simulation that we could do so far was sort of in computer graphics. But a lot of the primitives in computer graphics were hand designed and heuristic. With AI, we can learn a lot more nuanced heuristics and motion and things like that. And then once we can model the 3D world and how actually things move in it really realistically, I think it opens up a lot of possibilities for automation in other streams beyond just like chat bots and LLMs. It opens up possibilities in robotics. It opens up possibilities in all other kinds of autonomous systems. That's what seems to be what is probably on the horizon next. And I think a lot of knowledge is more likely to be very much tokenized. It's already starting to be that we don't search web pages anymore. We just ask LLMs for an answer and they answer in tokens. I think a lot of the world knowledge is going to be compressed in AI models and tokenized in some sense. Yeah, that's my two cents on where we're going. Will we get over the token system to something more real than just predicting the next word? Yeah, that's a great question. And I mean, yeah, tokens are great, but they're still not the most efficient in terms of data storage and memory and so on. I think we will have to, eventually. And if we want to have longer context and longer memories, which is a fundamental limitation of the token architecture, the transform architecture, we will have to. And I think that combined with the reasoning work that's happening right now - you know, better memory and better reasoning - I think those are probably the ways to go towards the future. Don't you think that we are going to leave a few billions of people behind? 16 DAILY CVPR Women in Computer Vision Sunday

I don't think so! I think human beings will adapt. I think it's a time point in life, in history where there's a lot of change happening. But I think if you look back, there was the Industrial Revolution. People were doing a lot of manual work before that. When machines came, they switched into doing different things. And I think of AI is just a different kind of machine. I think we will adapt. Our children will adapt. I still don't think AI will rule in any way. We create AI as humans and I think we will just use it as another tool. And the way we go about life will just change. Since this interview will be published during CVPR, what is your final word for the community? My two cents would be: just enjoy what you're doing! Be passionate about knowledge, about learning. Be curious! Don't get overwhelmed with the plethora of information. I think the most important nuggets and the most important breakthroughs are few. And the rest of it is mostly incremental stuff. And it's more important to know the most important breakthroughs. Really, we are at a point in time and history where we have such an opportunity to make a change. We should all be really, really excited about what we have contributed to the world. And what opportunity we have to contribute further! 17 DAILY CVPR Sunday Shalini DeMello “I think human beings will adapt. I think it’s a time point in life, in history where there’s a lot of change happening…” “Be passionate about knowledge, about learning. Be curious! Don't get overwhelmed with the plethora of information!”

Mehdi Zayene is now a senior data scientist in a consulting company in Switzerland named Effixis. He is also the first author of paper that was accepted as a poster and as a highlight at CVPR 2025. Helvipad: A Real-World Dataset for Omnidirectional Stereo Depth Estimation 18 DAILY CVPR Sunday Highlight Presentation This project started back more than three years ago, when Mehdi was still in his bachelor's studies at EPFL, École Polytechnique Fédérale de Lausanne in Switzerland. And there a professor - Alexandre Alahi - had the idea of implementing a new methos of depth detection with stereo 360 degrees cameras. Something that wasn't really existing at the time. 360 degrees cameras provide actually complete FOV and rich geometric information. Omnidirectional imaging remains still underexplored due to the lack of real world data sets, The biggest challenge was not building the deep learning AI model itself, but to collect the data set. They had to build a physical engine, a physical robot with two cameras, one LIDAR that they had to synchronize in time.

19 DAILY CVPR Sunday Another challenge was the calibration: “For each data point from the LIDAR, we need to know to what pixel does it correspond to,” Mehdi explains. “We had to come up with our own algorithm, yet inspired from the community, where we had to transform the LIDAR points to camera coordinates first. Second, we had to convert to spherical coordinates, then project to equi-rectangular coordinates for the cameras. And then we still had so many errors that we minimized the projection using BFGS optimization.” MASt3R-SLAM

20 DAILY CVPR Sunday “I absolutely didn't know a thing when I first started,” Mehdi admits candidly. “Even more, when I first started, I didn't even code a single line of Python. I absolutely learned everything on the fly from coding to physics to robotics and AI and deep learning.” When they started, the project was meant naturally for autonomous driving, the main domain of expertise of their lab. “Thanks to this kind of research, now you can actually detect depth with a single camera, 360 degrees,” Mehdi reveals. “That would reduce costs by a lot!” To learn more about this paper, visit today (Saturday) Poster Session 6 (ExHall D) from 16:00 to 18:00 [Poster 79]. Congratulate Mehdi for his work being selected as a highlight paper. Highlight Presentation

21 DAILY CVPR Sunday MASt3R-SLAM

22 DAILY CVPR Sunday Posters From Switzerland to Tennessee, Pierre Vuillecard shared his latest research on robust 3D gase estimation. A PhD student at EPFL and Idiap, Pierre attracted a curious crowd eager to learn about a novel approach that generalizes well across both images and videos. It is a key step toward building gaze-aware systems that actually work outside the lab! Nicole Damblon began her PhD at ETH Zurich two weeks ago under the joint supervision of Marc Pollefeys and Davide Scaramuzza. She presented today - together with Dániel Baráth - the results of her semester project at ETH on improving global Structure-from-Motion by learning to discard erroneous edges from the pose graph used for reconstruction.

23 DAILY CVPR Sunday Full house for Angjoo Kanazawa… Iro Armeni told us: "SLAM struggling in dynamic environments? We've been there - that’s why we built WildGS-SLAM. After chatting with attendees about using it on construction sites, I’m seriously tempted to grab a hard hat and put it to the test.“ First authors of that paper are Jianhao Zheng and Zihan Zhu. Posters

24 DAILY CVPR Sunday Poster Pablo Ruiz Ponce had the chance to present his work MixerMDM: Learnable Composition of Human Motion Diffusion Models to Michael Black. Pablo told us: “Michael is a researcher whose work I’ve long admired and learned from. It was both exciting and overwhelming to see him engage with my poster and ask questions. Thankfully, he was incredibly approachable and easy to talk to, so my initial nervousness quickly disappeared. Michael was particularly interested in our approach to synthetically increasing the diversity of generated human interactions, and we had an insightful discussion about the challenges of modeling realistic contacts between humans and the limitations of the data currently available.”

Double-DIP Don’t miss the BEST OF CVPR 2025 iSCnul i Ccbkos cmhreipbruet feorrVf ri sei eo na nNde wg es toi tf iJnu ly o. u r m a i l b o x ! Don’t miss the BEST OF CVPR 2025 in Computer Vision News of July. Subscribe for free and get it in your mailbox! Click here Target with solid fill

26 DAILY CVPR Sunday Orest Kupyn is a third year PhD student in the Visual Geometry Group (VGG) at the University of Oxford. Russian Invasion of Ukraine CVPR condemns in the strongest possible terms the actions of the Russian Federation government in invading the sovereign state of Ukraine and engaging in war against the Ukrainian people. We express our solidarity and support for the people of Ukraine and for all those who have been adversely affected by this war. UKRAINE CORNER

27 DAILY CVPR Sunday At the Mecharithm Lab On my way to Nashville and CVPR, I paid a long-due visit to awesome Madi Babaiasl at her robotics lab at the University of Saint Louis. Here you can see me with different modalities like EEG headset and eye tracking glasses that can be translated to robotic actions.

Made with FlippingBook

RkJQdWJsaXNoZXIy NTc3NzU=