Computer Vision News - November 2021

3 Summary 37 Xi Yin Best of ICCV 2021 and videos, usually there is a lot of redundancy. For language, repetition is very sparse. We think: a picture is worth a thousand words, right? We need to use a lot of language to describe an image. When we put these two modalities together, it is interesting how they interact with each other. Using one modality to understand the other helps them to learn better. That’s the real challenge. We can make things even more challenging. You and I speak a lot with our hands. So you can put our hands, and our body language, also into the equation. That’s why video interaction is more than just speech. It all depends on the final application. There are tasks that try to explicitly use the hand gestures to understand the video better. In general, there are many other visual cues from video and images that needs to be implicitly learned from the data. How do you learn that? There are ways that we can explicitly add in human prior knowledge to guide the learning, to learn the way we want it to learn. There are also implicit ways. You have supervision. You have the video. So it will learn explicitly the pattern and figure out things from the data itself. How did you get from Wuhan to Michigan? I find that throughout my career, I had a lot of luck! I didn’t really have a five-year plan. When I first entered college, I learned from others about