10 DAILY ICCV Thursday Oral Presentation like CLIP has spurred the development of benchmarks for perception tasks. Samuel tells us there are already three similar papers to his one, which is a promising sign of the field’s growth and direction. “People should come to my oral presentation, of course, but I think it’s good confirmation that this is the right direction, and people are interested in it,” he adds. “One paper was already at CVPR, and two others are unpublished, but there are now four benchmarks with similar goals, which is great for the community.” As is often the case, data collection proved to be the most challenging aspect of this work. Samuel employed as much automation as possible, aiming for the most efficient way to get the necessary annotations. Still, data quality became an issue due to using Amazon Mechanical Turk, a platform for crowdsourcing tasks. “You start a round of annotations and realize maybe my instructions weren’t as good as I thought when you get something back really different than what you expected,” he says. “Of course, people there have an incentive too. They want to earn money doing annotations, so they get to the solution in the quickest way they can and leverage shortcuts to get the task done quickly. You don’t get what you want if you forget something in your instructions. We had to do a couple of iterations to get that right.” The key motivation was to have challenging descriptions that refer to multiple objects in a scene, requiring models to consider the entire context of a sentence. Every detector can take an image of a cat on a bench and find the cat. Instead, he would use an image of two cats, one on a bench and one on the ground, which is a more challenging task. “We leveraged existing datasets, starting with object detection datasets where we knew there were two or three cats in the image,” he tells us. “We selected those first and asked annotators to pick only a subset, two out of three, and describe them so that they only refer to those two, but not the third one.” Regarding the next steps, Samuel highlighted a notable follow-up paper on arXiv: the winner of the OmniLabel challenge hosted at CVPR. This research explored how to teach a model to focus on all aspects of a given sentence when identifying objects in an image. It used large language models and ChatGPT to generate negative descriptions, training the algorithms with this additional augmented data. The challenge itself is still online and open to anyone with a language-based object detector to evaluate and benchmark. He hopes to host another edition of the challenge and workshop for CVPR 2024.
RkJQdWJsaXNoZXIy NTc3NzU=