Computer Vision News - October 2022

7 CLIPasso with Yael Vinker Yael was one of five winners of a Best Paper award at SIGGRAPH 2022 in August for this work. A huge achievement of which she should be very proud. We ask what she thinks convinced the judges. “ That’s a hard question, ” she says modestly before taking a few moments to consider. “ I guess it was a combination of the idea and the implementation. The idea is quite original. We’re aware of only two other works that tried to generate sketches with different levels of abstraction. We recognize the importance of this field and that it’s not well explored. Also, the outcome is visually pleasing. That’s why I think abstractions are core to art and design. People like to see it. It’s fun to look at it, and it’s beautiful. Our implementation, thanks toCLIP, led tohigh-quality results, and our method is simple and easy to understand. I think these are the reasons we won! ” Find more examples and videos on the project page, here . Free and easy-to-use demos here and here . “ We propose a method for initializing the strokes based on the salient regions of the input image, ” Yael explains. “ We use CLIP to analyze the input image and extract a heat map of the pixels, where we get a higher score formore important pixels. When drawing a cat, you want to focus on the eyes, the ears, the whiskers, and not necessarily the body. This approach gives the optimization process a better chance. If it starts from a better initialization based on the salient regions of the image, then we show that the strokes converge to a better solution. We also propose controlling the abstraction level by changing the number of strokes, which people haven’t done before. ” An ablation of the proposed CLIP-based perceptual loss compared to L2, LPIPS, and just an edge map. This figure can help to understand the "semantically-our" part. When using the proposed CLIP-based loss, the semantic features of the cat are emphasized (such as the nose, eyes, and ears). In contrast, simple methods that are based only on pixel intensity such as XDoG or L2 do not capture the essence of the input image, as such operators do not "understand" the semantic concept behind the image (i.e. "a cat").