CVPR Daily - Friday

To explore the idea of dataset bias, Ali and the team attempted to categorize the types of questions current benchmarks consist of. They devised a theoretical framework for dividing the questions into categories defined by the type of information required to answer them. The first category is questions that need a non-ordered bag of words to answer. The second category needs an ordered bag of words . The third category needs the words and the layout of the words within the image . The last category needs the words, the order, the layout, and the image itself . Having demonstrated that the first three categories are easier to solve, they found that the final category is the most difficult. In current benchmarks, most questions do not fall under this category. To advance the field of STVQA, Ali believes a new benchmark is needed to evaluate multimodal models across all modalities , including the text, the layout information, and the image itself. Why does Ali think this work was chosen for a coveted oral presentation at CVPR? “ I don’t think we got an oral because of our state-of-the-art performance. ” “ Many papers achieve 5-10% above the state of the art, but we also found a bias . The model can get to 60% accuracy without even looking at the image. That’s kind of crazy. It’s like a person solving this task with their eyes closed 60% of the time! But obviously, vision can’t be an artifact. If it were, we would all be blind. Evolution wouldn’t allow it. There must be something else. All the experiments we’ve done show a certain bias in this dataset that only requires you to take the question out of the OCR tokens and answer it correctly, which we do not want. I think that gave us the edge to get a CVPR oral. ” Even with a strong performance improvement over other benchmarks, the task is yet to be solved , but this work has taken it a further step forward. The difficulty stems from the fact that the questions can be complex, even for 6 DAILY CVPR Friday Oral Presentation