“If you have more modalities as a user to search, you can search for more specific things,” Nikos points out. “Your expressive abilities increase so that you can have better searches. In domain conversion, the text query defines the domain. You search with one image and want to find images that look like it but in different domains. This is useful if you would like to create crossdomain datasets in an automated way and you want to search for many, many images.” Domain conversion was previously treated as a class-level task, where searches focus on broad categories. A simple text-to-image search is enough if the class and domain are known. However, this work introduces instance-level domain conversion in one of its four datasets, which is helpful in several ways. First, for a large dataset composed only of photographs, it enables the retrieval of equivalent datasets in other domains, such as sketches or paintings. Second, when the class is unknown, users can search using an image and a specified domain to find relevant results. Finally, instance-level retrieval helps search for specific instances that are difficult to describe with words alone. One of the main challenges Nikos faced in this work was the gap between image and text modalities in the CLIP space. Although CLIP is trained to align images and text, the two modalities remain relatively separate, making it difficult to merge them effectively for composed image retrieval. He devised an innovative 5 DAILY WACV Monday Composed Image Retrieval …
RkJQdWJsaXNoZXIy NTc3NzU=