Daily CVPR - Wednesday

Presentations 5 CVPR Daily: Wednesday Jing Wang presented the Walk and Learn model: this work constructed a large dataset of casual walkers from egocentric videos with weather and location information. Without requiring manual annotations, this project proposed a self-supervised pre-training model by leveraging discretized contextual information (geo-location and weather) as weak labels to learn features suitable to facial attributes. By learning with diverse contextual information, the framework could be also applied to other high-level analysis tasks. Andrew Owens has presented Visually Indicated Sounds . In his words, the idea behind this work is to take a video where someone is hitting and scratching things with a drumstick. Then, to predict the plausible soundtrack to go along with it. The idea is that by predicting sound that’s visually indicated in a video, or in other words, when you see the action that is producing the sound, then the algorithm that produces the sound has to implicitly learn about material properties where you’re hitting. If you’re hitting a carpet, it will make a very different sound than if you’re hitting metal. The motivation behind this is to learn these interaction sounds in a way that might be similar to the way that humans learn. For example, people and children spend a lot of time interacting with objects and listening to the sounds they make. The team would like to take inspiration from the way they learn, to train computer vision systems that learn without explicit labelling. The main technique used is recurrent neural networks : they take a silent video sequences’ input. Then the recurrent neural network outputs the corresponding sound features for each frame.