CVPR Daily - Thursday

Thao Minh Le is a second-year PhD student at Applied Artificial Intelligence Institute at Deakin University in Australia. His work addresses the problem of video question answering. He speaks to us ahead of his oral presentation today (Thursday). Video question answering has a wide range of real-world applications, including helping visually impaired users and organizing large and unstructured visual data. It goes beyond simple recognition, requiring understanding of something called temporal reasoning , in addition to visual reasoning . Video is much richer than a still image in the sense that it can be incorporated with additional information channels, such as subtitles or speech. However, reasoning across multiple modalities can be a challenge . Given a video, a person can ask a range of questions. There are two examples of this in Figure 1. In the first, the girl does some actions multiple times. The question posed is, what does the girl do nine times? There are answer choices given in the dataset and the machine has to find the most relevant one. Like a human being, it does a test to find the correct answer corresponding to the DAILY T h u r s d a y Oral Presentation 6 question being asked. The answer being that she blocks a person’s punch. The second example poses a different question about the transition from one action to another action: what does the man do before turning body to left? The answer will be he breathes. The machine has to understand all of the information presented in the video and answer it like a human being would. Imagine that someday machines can be trained to watch a movie and hold natural language conversation with humans about the content of the video. Thao explains the computer vision technology behind his work: “I have designed a reasoning engine that Hierarchical Conditional Relation Networks for Video Question Answering “Reasoning across multiple modalities can be a challenge.”