ICCV Daily 2021 - Wednesday

Looking to the future, Anna tells us she would like to see the field go in a direction where language starts to play a bigger role for general purpose tasks. She would like to see more models which can receive feedback from human users , with a greater emphasis on the human-in-the-loop, where humans can correct model behavior with language. “ That should lead to better, more fair, less biased behaviors, and ideally also better performance because we might undo some bias or some spurious correlations in the models, ” she explains. “ This is something I’m very excited about. I’m hopeful that enabling models which are more transparent, more explainable, and more communicative with humans is where we’re heading. ” Outside of ICCV, Anna and Mohamed both have busy day jobs. A big part of Mohamed’s work at KAUST relates to vision and language, which is why he wanted to understand this space better. “ Recently, I got excited about the topic of 3D vision and language – I want that robot to bring me the cup and make me a cup of coffee! ” he laughs. “ Another thing that I’ve started to look into is affective vision and language . We collected a dataset called ArtEmis in collaboration with Stanford and École Polytechnique . It’s of a similar size to the Microsoft COCO dataset , but it has a twist, which is affection. For 80,000 paintings, people had to describe how they make them feel and to explain why they feel this way. I feel that that could be an interesting direction. ” Meanwhile, Anna has been working hard at UC Berkeley on explainable AI and is currently engaged in a semantic forensics effort to detect and fight multi-modal misinformation online . She is also exploring an exciting new direction for understanding and extracting information from media, like instructional videos, to understand how to learn from the demonstrations to do tasks. 11 DAILY ICCV Wednesday Nishimura et al. present the first Biochemical Video-and-Language dataset, which consists of egocentric videos and aligned text protocols. Closing the Loop between Vision and Language

RkJQdWJsaXNoZXIy NTc3NzU=