CVPR Daily - Thursday

DAILY T h u r s d a y Thao Minh Le 7 effectively reflects the long short temporal relation, hierarchy and compositionality of video s. The reasoning engine can be readily extended to handle additional information channels, such as subtitles, speech, or anything that shares the same characteristics with the video. For example, video has hierarchy. It has frames, it has clips and it has the entire video level. The subtitles are hierarchical as well. You have the words level, phrase level, the model-building process by simple rearrangements and block stacking with a generic unit.” Thao explains the computer vision technology behind his work: “I have designed a reasoning engine that I n Figure 3, you will see the entire architecture built up by a block called a Conditional Relation Network (CRN) . Similar to the ResNet network architecture, these units can be stacked on top of each other to build a very deep network architecture. The CRN unit can handle more than the vision part, including questions and motion information. It is a general-purpose reusable unit that is a relational transformer . It encapsulates and transforms an array of objects into a new array conditioned on a contextual feature. Thao points to Algorithm 1 , which describes how the CRN unit works. “Lines 3 to line 11 show a for-loop,” he explains. “It iteratively computes the sparse high-order relations between input objects in a subset of “I have designed a reasoning engine that effectively reflects the long short temporal relation, hierarchy and compositionality of videos.” sentence level, paragraph level, and document level. That’s what I mean about hierarchy and compositionality. The reasoning engine can also ease

RkJQdWJsaXNoZXIy NTc3NzU=