ICCV Daily 2025 - Tuesday

Inside: Exclusive reviews of two Best Paper candidates! DAILY

Xin’s picks of the day (Tuesday): Xin Qiao is an assistant professor at Xi’an Jiaotong University, China. His research focuses on 3D imaging, computational imaging, and deep learning, with particular interest in making depth sensing more accurate and interpretable for real-world applications such as under-display cameras and multimodal perception in complex environments. 2B-3 Knowledge Distillation for Learned Image Compression 2- 53 PHATNet: A Physics-guided Haze Transfer Network for Domain-adaptive … 2- 94 FiffDepth: Feed-forward Transformation of Diffusion-Based Generators … 2-101 DEPTHOR: Depth Enhancement from a Practical Light-Weight dToF Sensor … 2-112 CO2-Net: A Physics-Informed Spatio-Temporal Model for Global Surface … For today, Tuesday 21 2 Xin’s Picks DAILY ICCV Tuesday “Hi! My work at ICCV 2025, “Learnable Fractional Reaction-Diffusion Dynamics for Under-Display ToF Imaging and Beyond,” presents a hybrid framework that integrates physical modeling with deep learning. We encode a time-fractional reaction-diffusion equation into the network to capture long-term dynamics and introduce an efficient continuous convolution operator for interpretable depth restoration. The framework achieves state-of-the-art performance on multiple ToF and RGB-D benchmarks. Unfortunately, I’m unable to attend ICCV 2025 due to visa issues, but I’m thrilled that our work will still be presented today [Poster 2-99]. Please stop by to learn more about fractional dynamics and physicsguided depth restoration, and feel free to drop me an email if you’d like to discuss. I wish everyone an inspiring and sunny week in Hawaii!” Oral: Posters:

3 DAILY ICCV Tuesday Aloha ICCV! Chee-hoo! Welcome to Honolulu! We have reviewed for you some of the awesomest papers presented today at ICCV 2025, including two candidates for the Best Paper award. And choke of other great stuff for you to enjoy. I asked General Chair Hilde Kühne about her thoughts, as ICCV’s Main Program kicks off: Turn the page and have a great Hawaiian day! Ralph Anzarouth Editor, Computer Vision News Ralph’s photo above was taken in peaceful, lovely and brave Odessa, Ukraine. ICCV Daily Editor: Ralph Anzarouth Publisher & Copyright: Computer Vision News All rights reserved. Unauthorized reproduction is strictly forbidden. Our editorial choices are fully independent from IEEE, ICCV and the conference organizers. Editorial One Hawaiian proverb says, “Aʻohe hana nui ka aluʻia” — No task is too big when done together. I can’t think of a better way to capture the essence of research, because at its heart, it’s about a community of people uniting to take on great challenges. My hope is that ICCV 2025 becomes a place where people connect, inspire one another, and strengthen our shared effort to tackle the biggest challenges of our time.

The challenge at the center of Weirong’s research is that real-world scenes rarely stand still. Traditional structure-from-motion and SLAM systems, which rely on bundle adjustment, assume a static world, and the classical epipolar geometry that underpins them depends on scene points remaining fixed. Once humans, cars, or animals enter the frame, the mathematics breaks down. Weirong Chen is a second-year PhD student at the Technical University of Munich, supervised by Daniel Cremers and cosupervised by Andrea Vedaldi from the University of Oxford. His paper uses modern learningbased techniques to address the problem of dynamic scene reconstruction from videos. In addition to being accepted for a coveted oral slot at ICCV 2025, it has been shortlisted as a candidate for a Best Paper Award. Ahead of his oral and poster presentations today, Weirong tells us more about his work. Back on Track: Bundle Adjustment for Dynamic Scene Reconstruction 4 DAILY ICCV Tuesday Oral & Award Candidate

“Originally, people were more focused on static scenes,” Weirong says. “There are very wellestablished methods for bundle adjustment that try to estimate the camera pose and recover 3D geometry from multi-view images or RGB video. However, the world we’re living in today is a 4D dynamic world.” The difficulty comes from the coupling of two distinct types of motion: one caused by the camera’s movement, and the other by the motion of objects within the scene. In casual handheld video, both happen simultaneously. Researchers have previously attempted to sidestep the issue by either masking out dynamic regions or modeling moving objects separately, but both approaches have limitations. “The challenge is that there’s no direct way to constrain how the dynamic object moves in 3D easily,” he points out. “Therefore, it’s hard to reconstruct them!” His insight was to separate – or decouple – the two motions from the 2D perspective. The resulting framework, BA-Track, introduces the paper’s key contribution: a motiondecoupled point tracker that solves the correspondence problem in dynamic video. The model relies on a dual-network design. “We use learning-based techniques – a transformer-based network, with one part predicting the total motion and the other predicting the dynamic parts,” Weirong explains. “When we combine them, we use total motion 5 DAILY ICCV Tuesday Weirong Chen

6 DAILY ICCV Tuesday minus the dynamic offsets, which gives us the camera-induced component. That’s exactly what we need for the bundle adjustment.” Previous methods primarily focused on point tracking for total motion, but the use of an additional network enables motion decoupling. The model is trained on synthetic datasets of dynamic scenes using a deep-learning approach. By decoupling these signals, points on moving objects effectively become ‘pseudo-static’, allowing the classic bundle-adjustment framework to solve the camera pose and geometry. For Weirong, the appeal of this work lies in its everyday potential. “Imagine you have an iPhone and you want to shoot some daily activities for your family – playing tennis or going hiking,” he suggests. “Our method can take any casually shot video and recover the dynamic scenes.” Despite its success, Weirong points out that BA-Track is not a finished solution. Motion decoupling depends on the quality of the learned point tracker, which in turn relies on the volume and quality of the synthetic training data. Resource constraints mean that the model has not yet been trained on the full diversity of real-world scenes. Oral & Award Candidate

7 DAILY ICCV Tuesday Weirong Chen Nevertheless, the system’s ability to demonstrate dynamic bundle adjustment working at all marks a milestone. This breakthrough might go some way toward explaining why Back on Track is a Best Paper award candidate this year. When Weirong first heard the news, he was surprised and honored: “Personally, this was far beyond my expectations when I wrote the paper!” The recognition, he believes, reflects the enduring value of classical computer-vision ideas when paired with modern learning techniques. “Luckily, we found this point tracker through motion decoupling to bridge from bundle adjustment to the dynamic scenes,” he adds. “I just hope that it can bring some new insights to the community.” To learn more about Weirong’s work, visit Oral Session 2A: View Synthesis and Scene (Exhibit Hall III) this afternoon from 13:30 to 14:45 [Oral 4] and Poster Session 2 (Exhibit Hall I) from 15:00 to 17:00 [Poster 75].

When Shuai began exploring talking head generation – the task of synthesizing realistic facial animations from audio or video input – he was struck by two persistent problems. “Existing methods have made considerable progress,” he says, “but the issues of identity leakage and rendering artifacts persist. Therefore, this paper primarily focuses on addressing those.” Shuai Tan is a PhD student at Shanghai Jiao Tong University, under the supervision of Ye Pan. His work, shortlisted for a Best Paper award, proposes a novel approach to generating realistic talking heads, addressing some longstanding visual flaws that affect existing methods. Shuai shares with us his thoughts behind the work and its unexpected origins. FixTalk: Taming Identity Leakage for High-Quality Talking Head Generation in Extreme Cases 8 DAILY ICCV Tuesday Oral & Award Candidate

Identity leakage occurs when visual features from the driving video spill into the generated image, causing the face to lose resemblance to its intended subject. Shuai set out to understand why this happens. “We conducted two exploration experiments on existing frameworks, aiming to identify the cause behind the issue,” he explains. “Interestingly, we not only found the feature that leads to identity leakage, but also discovered that under specific conditions, taming identity leakage can actually help eliminate rendering artifacts.” That realization became the foundation for FixTalk – to mitigate the negative impact caused by identity leakage while maximizing its positive role in rendering artifacts. To achieve this, the team introduced two lightweight modules: an Enhanced Motion Indicator (EMI) and an Enhanced Detail Indicator (EDI). EMI decouples identity information from motion features to prevent identity leakage, while EDI reuses certain leaked identity information to fill in missing visual details – a combination that achieves superior performance compared to state-of-the-art methods. Talking head generation has wide applications in entertainment, gaming, filmmaking, and digital communication. But as Shuai notes, its success depends on realism: “If the identity leakage and the rendering artifacts exist in our method, the performance is very poor, so people can easily know this is a fake. We need to fix the issues – make it more real!” It feels like a touch of serendipity that Shuai is sitting here today as a Best Paper candidate, given that FixTalk’s story began this time last year at ECCV 2024 with a question 9 DAILY ICCV Tuesday Shuai Tan

10 DAILY ICCV Tuesday that caught him off guard. “After I finished my presentation, a reviewer asked, ‘How did you solve the identity leakage problem in this paper?’” he recalls. “But that paper didn’t focus on it, so I couldn’t give a perfect answer. I felt a little guilty about that.” For Shuai, the question lingered, and on the plane back to China, he started thinking more about it. “Suddenly, an idea popped into my head,” he says, lighting up. “I thought, I need to figure out why people are asking about identity leakage, how it’s caused, and how to solve it!” That curiosity grew into a new research direction and, ultimately, into FixTalk. “At that conference, I couldn’t answer that person’s question,” he says, smiling. “But afterwards, I could think in depth, so I made this paper!” When asked what he thinks made the work stand out among more than 11,000 ICCV submissions, Shuai believes it is its potential to inspire others. “I think it’s very important, for every paper, that we need to learn something from it,” he responds. “If your paper is just a normal paper and has nothing inspirational, people will not like it.” That belief – that research should inspire others – makes the recognition all the more rewarding for him. “I was very happy and surprised to have this honour because I never expected it,” he adds modestly. “But since the ICCV program committee has given me this opportunity, I’ll definitely cherish Oral & Award Candidate

11 DAILY ICCV Tuesday Shuai Tan it and keep working hard in this direction.” After nearly three years working on talking head generation, Shuai is already thinking about where the field is heading next. “More and more people have started to pay attention to this direction,” he points out. “In my opinion, the future direction can be full-body talking head generation – not only the head, but the upper body or even the full body. We can move our body according to the audio and make it more real.” He also hopes his work encourages others to question assumptions and probe deeply into limitations. “We need to find the limitations in previous methods during research progress,” he says. “Instead of just thinking about it superficially, we should take action and conduct experiments to identify the key factors causing these limitations. Then we can think about how to solve these problems – or even use those factors to do something extra.” As he prepares to present FixTalk at ICCV 2025, Shuai reflects on the importance of academic conferences. “They are very, very useful for all of us,” he says. We couldn’t agree more – these events give researchers and practitioners the chance to collaborate, share ideas and, as in Shuai’s case, turn a single question into the start of something new! To learn more about Shuai’s work, visit Oral Session 1A: Multi-Modal Learning (Exhibit Hall III) today from 8:45 to 10:00 [Oral 3] and Poster Session 1 (Exhibit Hall I) from 11:30 to 13:30 [Poster 302].

In recent years, we've seen the rise of foundational 3D models. For example, DUSt3R. “We don't need any assumptions,” says Lojze, “nor any prior knowledge about the images. You just take the images, you put it inside the model, you get a 3D reconstruction. This enables a bunch of tasks that were previously separate studied tasks. This work builds on top of that, and we ask ourselves the question, can we enrich these models with semantic knowledge?”. 12 DAILY ICCV Tuesday Poster Presentation PanSt3R: Multi-view Consistent Panoptic Segmentation Lojze Zust is a research scientist at NAVER LABS Europe, working in the Geometric deep learning team with Vincent Leroy, Jérôme Revaud and Gabriela Csurka. His paper has been selected as a poster at ICCV 2025. Ahead of his poster presentation today, Lojze tells us more about his work.

In addition to knowing the 3D geometry of the scene, can we somehow enrich these representations with knowing: this is a table, this is a chair? The longterm goal is to be able to develop agents, robots that can move in any space without any additional training and just understand what's happening around them. This can be very easily applicable in many fields. The team wants the model to be as general as possible. One of the main challenges that the authors initially encountered is that these foundation models started out performing reconstruction in a pairwise manner. You would have two images and you would produce output for these two images. And then you would have some sort of global optimization that would align the results. This turned out for 3D panoptic segmentation to be a challenge. Then came the development of models that support multi-view predictions directly: you can already take many images at the input, you just put it through one feed forward and then you immediately get the outputs. This is the MUSt3R model, a scalable multi-view version of DUSt3R, both developed by the team: “It sort of all clicked together!” Another major challenge, maybe an even bigger one, was the data. There’s a lot of 3D data that is used to train this foundation models. But only a small portion of this data has annotations for segmentation - for instance panoptic segmentation - so they used one data set that has about 700 scenes. That’s many images, but the diversity inside those images was very limited. You could imagine that there are only 700 different types of chairs or probably even less, because most scenes were recorded in the same institutions or same types of places. The team had to find a way to capture more visual diversity and Lojze likes the solution they found: “We use a combination of 3D data 13 DAILY ICCV Tuesday Lojze Zust

sets and 2D data sets like COCO and ADE20K. These are standard 2D segmentation data sets that have been used for a long time”, he explains. “And I think we found quite an elegant way to use these data sets - even though our model is 3D - to still be able to get the diversity from these 2D data sets!” This project opens new directions. Actually, Lojze was almost stressed by the number of different things they could do after this. The clear one for him stems from the need for much more data to really have a general model, similar to how the foundation models in 3D operate. “My hope,” he describes, “is that we can use the entirety of 3D data sets that we already have in some sort of unsupervised way to still train panoptic segmentation on top of already existing data. The long-term goal for us is to use these models to enable reasoning in 3D. Today there's some developments in 3D reasoning, but it is very fractured. Everybody develops their own data set of what kind of use cases they want to cover and so on. So I think we're still waiting for the big breakthrough in 3D reasoning!” This work is deeply connected to computer vision. “I would say that 3D vision and semantic understanding”, he confirms, “are two important pillars of computer vision that have been researched since the very start of computer vision. I think merging these two concepts into a unified model is one more example of trying to unify all the sub-fields.” To learn more about Lojze’s work, visit Poster Session 2 (Exhibit Hall I) from 15:00 to 17:00 [Poster 79]. 14 DAILY ICCV Tuesday Poster Presentation

Double-DIP Don’t miss the BEST OF ICCV 2025 iSCnul i Ccbkos cmhreipbrue t feor rVf ri sei eo na nNde wg es toi tf iNnoyvoeumr bmear i. l b o x ! Don’t miss the BEST OF ICCV 2025 in Computer Vision News of November. Subscribe for free and get it in your mailbox! Click here Target with solid fill

16 DAILY ICCV Tuesday Workshops Michael Black speaking at the 1st Workshop on Interactive Human-centric Foundation Models. MIT undergraduates Yifan Kang and Dingning Cao present Doodle Agent, a multimodal LLM-driven system that explores how AI can doodle - selecting brushes, colors, and strokes to create expressive, emotion-guided artworks without explicit instructions - at the 2nd AI for Visual Arts Workshop [AI4VA].

17 DAILY ICCV Tuesday UKRAINE CORNER Sophia Sirko-Galouchenko at ICCV 2025. Sophia is a second year PhD student at Sorbonne Université and Valeo.ai in Paris, under the supervision of Spyros Gidaris, Andrei Bursuc and Nicolas Thome. She is presenting today her poster at 11:45 [session 1, poster 399], in which she introduces unsupervised post-training of ViTs that enhances dense features for in-context tasks. ICCV's sister conference CVPR adopted a motion with a very large majority, condemning in the strongest possible terms the actions of the Russian Federation government in invading the sovereign state of Ukraine.

Valentina Salvatelli is a Principal AI Researcher at Microsoft Research in Cambridge, UK. Valentina, you were just promoted to Principal AI Researcher? Yes, I was recently promoted. Congratulations! Tell me about your work. What does a Principal AI Researcher do at Microsoft? I've been doing this kind of work since I joined, more than four years ago: I lead a team of researchers and engineers. Our group is particularly interested in AI for health care, particularly from a computer vision multimodal perspective. I started working on computational pathology, how to learn image representation for that, and used it for detecting pattern related to certain genomic conditions. And then more recently, we focus on vision language models and, for example, how to use them to automatize radiology reporting or to learn representations, also in this case, at scale from radiology images. 18 DAILY ICCV Tuesday Women in Computer Vision Read 160 FASCINATING interviews with Women in Science Read 160 FASCINATING interviews with Women in Science “Don’t scared of the fact that maybe you're the only one in the room or a minority. Like, be brave!”

We collaborate widely across Microsoft Research and partners to train large vision-language models for healthcare. I've been working in AI for health for about 10 years now. The community is small. You did a PhD in Astronomy and Astrophysics. I would say my career journey is probably quite unique. Many people working in this field come from computer science or maybe even specifically from computer vision. In my case, I started with physics, as I have a bachelor and a master in theoretical physics. I did my PhD on computational astrophysics. There were a lot of images involved, a lot of computation of things and algorithms also there. Some Bayesian statistics, but it was definitely a very, very different application. Still, there was something common, the passion for learning from data, learning something. And obviously a lot of technical skills about how you code and also how you frame your problem - research skills. What are the problems that you should solve. And then I still did a postdoc in this area. And then after the postdoc, I decided that I wanted to be a bit closer to the real world, like having an impact. This is why I was attracted 19 DAILY ICCV Tuesday Valentina Salvatelli

20 DAILY ICCV by working on healthcare. At that point, I transitioned into industry. I did a fellowship, a short one just to gain some experience and some network. Then I started my first role in the industry at what now is called IQVIA, a healthcare company. Back then, there was no image involved. They work on electronic health records, predicting rare diseases in patients. Tell me, do you still busy yourself sometimes with astrophysics things or that's a close chapter? It's not completely a close chapter, I would say. I like the interdisciplinary part a lot. I think when you know about multiple fields, there are a lot of opportunity to cross-pollinate ideas or things. It's not a completely close one. When I was still at IQVIA, I started a collaboration with NASA. I spent some months there working on AI for satellite instruments. Both applied for astrophysics mission like solar missions, both health observation mission. For example, there are a lot of similarities between images from health observation satellites and images that come from computational pathology. They're like from a computational computer vision perspective. There are a lot of similarities and common technologies that are used. I still did that for a few years. It was a side job as consultant and I love it. But at this point, I already have a second job as a parent, so I just feel there was no time for a third one. With all the work that you are doing and with a daughter to take care of and living abroad. There is a lot on your plate. How do you make sure that passion will be there until the day you retire? I think it's just a matter of balance. Not every day of your career will be interesting, but it's a long journey. I think you need to keep challenging yourself or finding interesting problems to work on or interesting people to work with and be curious of. This might be personal, but at least for me, I find very refreshing on a regular basis to learn something Tuesday Women in Computer Vision

new or to start working on something completely new. Sometimes you rediscover a lot of passion just by redirecting your interest towards something else for some time. What is the next thing that you're going to explore? There are a lot of interesting things that I have now that are somehow new. For example, I started recently to collaborate with the AI for science team that is also in Cambridge, working on cryo-EM data and a very powerful microscope to image proteins. They have some very interesting computer vision problems about how you select these particles. It is very much out of distribution with respect to what they do now. But that's actual work with them and it's very interesting. Did you already have the chance to see some of the things you worked on being translated into a realworld application? Yes. Yes. Yes. This is one of the most interesting parts for me. I really like that and it happened already a few times to me, in the company where I was working previously and here at Microsoft Research. Here we developed a pathology model that has been published on Nature and now it's under clinical trial with Cancer UK. Hopefully it will work. That was a very exciting project! They built a device that can extract some cells from the oesophagus. And based on that, they can make a prediction if you're at risk of cancer in your stomach. The problem is It's not scalable and too expensive. So what we did is to build the AI bit on top of it saying: once you have these images, you don't need to go through the traditional workflow of a pathologist reviewing several slides. There is this model and this will already make perfect predictions just using this understanding. We basically cut the cost utilizing AI. And if the clinical trial would have positive results, that means that it can be scaled at 21 DAILY ICCV Tuesday Valentina Salvatelli

22 DAILY ICCV Tuesday Women in Computer Vision the UK population level because it will be manageable from a cost perspective. Fascinating! Right now, I'm working with a Mayo Clinic. It took one year to set up the partnership and then one year to work on the data and develop the model. But we're now at the point where the model has been retrospectively clinically validated. They will deploy in the clinic. That is very nice! You and I have something in common: we are both Italians who for many, many years do not live in Italy. I live in Spain, at least I am in Europe. You are even not in Europe! I know! That has been a sad moment… How do you manage this being far away from Italy? That's a very good question. I've been living abroad at this point for I think more years than I've been living in Italy. I think it's challenging from a personal perspective, because “Sometimes you rediscover a lot of passion just by redirecting your interest towards something else for some time…”

Xi Yin 23 DAILY ICCV Tuesday obviously you leave family and also some friends behind. I think it's always very difficult then to have your support system, especially with a child. This is very much my reality. I sometimes would love to have parents or aunts around to split the effort. It's obviously not the case. This is, I would say, the drawback. Do you speak to her in Italian? Well, my husband is also Italian, so we only speak Italian at home. She's perfectly bilingual. I always joke that her mother tongue is English, because she started to go to nursery at nine, ten months. But I'm fine with that. My parents are a bit less. There is also an aspect that I really like of living abroad is that it really gives you the opportunity to grow even more! Now it's nine years I'm in the UK. I feel I really acquired some of the local habits. They become part of me, of my way of being. And I think this is a positive thing. It enables you to evolve and to discover that there might be different ways of living, of considering things, and that's fine. Your message to the community. There is one cause I'm really passionate about that is, especially in the STEM field, even more AI field, try to close the gender gap between men and women. But what I want to say, for any woman reading this interview, is don’t scared of the fact that maybe you're the only one in the room or a minority. Like, be brave! I'm doing my best to help, Valentina. I know, I know, this is why I accepted to interview! Read 160 FASCINATING interviews with Women in Computer Vision! Read 160 FASCINATING interviews with Women in Computer Vision! Valentina Salvatelli

Mitral valve repair has emerged as the new gold standard in reconstructive cardiac surgery, offering superior outcomes compared to valve replacement. However, mastering this complex procedure requires extensive practice and presents a significant learning curve for surgeons in training. Recognizing that the educational benefits of such training have been insufficiently studied, Christina conducted a prospective study to evaluate the training effects of different mitral valve repair tasks in her thesis. Together with her research group, she developed a high-fidelity, patientspecific simulator designed to assess and enhance surgical skill acquisition. A total of 25 medical students participated in structured training sessions, each performing two complete neo-chordae implantations and ring annuloplasties on silicone valve replicas derived from patient data. Three pathological models - posterior, anterior, and bi-leaflet prolapse - were used to represent varying levels of surgical complexity. Throughout the sessions, participants were evaluated on anatomical accuracy (identification of papillary muscles, leaflet segments, and annulus), functional accuracy (chordae length and knot-tying), and execution time for each procedural step to quantify the progress. After every session, expert feedback was provided to guide targeted improvement. The results were clear: all participants became faster and more precise after completing all training sessions showing a steep learning curve and demonstrating a measurable improvement in procedural proficiency. 24 DAILY ICCV Tuesday Congrats, Doctor Christina! Christina Wang has recently obtained her Doctor of Medicine at Heidelberg University, Germany. She worked at the Institute for Artificial Intelligence in Cardiovascular Medicine under the leadership of Sandy Engelhardt and the supervision of Gabriele Romano and Roger Karl. Her work centered on advancing simulationbased surgical training - a crucial step toward modernizing education in cardiac surgery. Christina is now looking to pursue a career as a surgeon herself.

25 DAILY ICCV Tuesday Christina Wang Results for different evaluation criteria. The lower, the better. (a) Time for implanting neo-chordae in both groups, divided into sub-steps of stitching the papillary muscle and leaflet. (b) Time for tying 6 knots overall and in each group. (c) Time for each annuloplasty suture overall, with standard deviation. (d) Time for ring implantation overall, with standard deviation, including stitching and placement of the annuloplasty ring. (e) Percentage of irregular annuloplasty sutures overall, with uncertainty margins. These findings confirm that patientspecific simulation is an effective method for acquiring and refining the complex skills required for mitral valve repair. Looking ahead, the team envisions a standardized, simulation-based training framework integrated into medical education curricula and clinical training programs. Such integration would ensure consistent excellence among future cardiac surgeons and set a new benchmark for surgical education worldwide.

RkJQdWJsaXNoZXIy NTc3NzU=