For many of these datasets, there is an assumption that the text relates to precisely one bounding box in the image. An algorithm already knows there is one bounding box in the image that refers to the text; it just needs to find it. This scenario differs from object detection, where you are given several categories, and an image might contain one or five categories and multiple or no instances. Samuel wanted to create a benchmark to address this gap that evaluated an algorithm’s ability to handle more complex, freeform descriptions. He needed descriptions that encompassed multiple objects or even referred to objects not present in the image, introducing the concept of negative descriptions. “If a person is wearing a blue shirt, and the description is a person wearing a red shirt, the algorithm’s output should be no bounding box,” he explains. “Object detection benchmarks do evaluate for this; existing language-based benchmarks do not. That’s where our paper comes into play. The task is I have an image and a list of descriptions, and the algorithm gives me back a set of bounding boxes only for the objects that match.” Although vision and language research has existed for many years, the explosion of large-scale models 9 DAILY ICCV Thursday OmniLabel
RkJQdWJsaXNoZXIy NTc3NzU=