Computer Vision News - April 2025

7 Computer Vision News Computer Vision News GeoDiffuser input to the model so that it tries to perform the edit that you wish to perform.” How did the idea come about? There are seminal works in this field that first did this for text-based editing. The difference is that seminal works that include promptto-prompt, null text inversion and so on: perform editing to change visual appearance. For example, to change a sunny scene into a snowy scene, they don't translate or change object position. There are some other works that do this, but they do it only in 2D and don't inject geometry or they don't remove objects well enough. “Learning from the community” points out Rahul “and then manipulating attention features as well as applying loss functions to improve on top of these works helped us get to GeoDiffuser and that's what has helped and shaped our research direction!” The two main contributions of this work are the sharing mechanism of attention and optimization of image latents. Let’s see them in detail. The first specific contribution of GeoDiffuser is that it uses a depth prior to get geometry and in editing to move objects around. This prior can be injected in any text-to-image model. You don't need a depthtrained model or an in-painting model to do this. It's very generic and the team have actually shown that the same edits can be possible using Stable Diffusion 1.4 to 2.1 and more. In other words, the first contribution is the way how this work is generic and can be applied to attention blocks within these models. The second contribution is that they have devised some loss functions specifically to remove items from the image, let's say remove a cup from the top of a table. Ideally, if we

RkJQdWJsaXNoZXIy NTc3NzU=