CVPR Daily - Friday

15 DAILY CVPR Friday The Devil is in the Fine-Grained Details this topic but found nothing that accurately described this problem. The problem was that classical openvocabulary object detection benchmarks do not come with attributes in their text entries.” To address this gap, Lorenzo set out to create a benchmark with finegrained natural language captions for detection. Starting from a detection dataset with structured descriptions of object parts and attributes, he prompted a large language model (LLM) to generate natural language descriptions of the objects. “It was really exciting because at the time we developed this benchmark, it was the early days when we could finally use an open source LLM locally on our machine,” he recalls. “It was quite cool for us!” The generated captions, which he called positive captions, were paired with negative captions, where other attributes were deliberately misplaced inside the sentence. This combination of positive plus negative captions was used as input vocabulary for the detectors. The models were tested on their ability to localize objects based on these complex descriptions correctly, and they had to identify the correct captions to test if they could find the right attributes for the objects. Creating this benchmark was a challenge. Lorenzo had to engage in extensive prompt engineering to ensure the accuracy of the LLM outputs. “We had to find the correct prompt for generating the benchmark and reducing the LLM hallucinations because sometimes they can fail,” he explains. “Since the benchmark needs to be very accurate, we also had to manually revise them and discard some generated captions, which were errors or imprecise.”

RkJQdWJsaXNoZXIy NTc3NzU=