ICCV Daily 2021 - Wednesday

not limit it just to vision and language either. Ideally, we would expand this to even more modalities, but let’s start with baby steps! ” Paper submissions to the workshop have almost doubled compared to last time and that’s due to all the developments that are happening in the field. Anna and Mohamed have been impressed by the breadth of topics covered this year, including multi-modal forensics , medical applications , and many much broader scenarios where multi-modality is prominent. “ I think the interest in this space is due to its importance and its potential impact, which is growing, ” Mohamed tells us. “ What makes language and vision quite exciting is that humans communicate and interact with one another in language . I might want to have an AI robot bring me a cup on the table next to the door. For a robot to perform this task, it has to understand the world around it, which is visual, and what I say, which is language. There is interest in building models that caption images, and visual dialogue, where we can interact with and speak with a robot. These could be particularly useful for people with sight loss. ” Some multi-modal tasks are more mature than others. Combining vision and images is more mature than combining 3D point clouds with vision , for example, which would be needed for the robot to be able to fetch the cup. “ In the assistive technology scenario that Mohamed mentioned, you could easily imagine all these scenarios where people snap a picture of an object 9 DAILY ICCV Wednesday Closing the Loop between Vision and Language Abdelkarim et al. propose new benchmarks and a method for long-tail Visual Relationship Recognition, casting it as a hubness problem.