ICCV Daily - Friday

Previous Page

Next Page

Page Background

Localizing when and where the

actions take place in a video. For

example, you have somebody that is

diving in this amazing swimming pool

and you want to find out where

exactly this human is, spatially on the

frame-level, and when his action

starts and ends temporally. Until

today, state of the art works focused

on tackling the problem more at a

frame-level. They use per frame

detectors that detect the actions at a

frame level, then they link the actions

over time to create spatio-temporal

tubes. There is a lot of good in this

method and there have been

improvements over the years, but it

has some basic disadvantages. It

doesn’t exploit the temporal

continuity of

videos.

Imagine

somebody in this position. Am I sitting

down or standing up?

I think you are standing up with your

knees bent.

I am sitting down right now.

You are. What’s the story?

We propose to surpass this limitation

and instead of working on a frame

level, we work on a sequence of

frames. We propose an action-tubelet

detector that takes a simple sequence

of frames and outputs tubelets. In the

way that standard object detectors

work, we extend the anchor boxes of

standard detectors to anchor cuboids

that have a fixed spatial extent over

time. We regress the anchor cuboids

in order to follow the movement of

the actor. We try to say, where the

actor is, it will be like this, going up

and down. This is the regression part,

and the classification – which basically

means to put a label on what

somebody is doing – like running,

kicking a ball, or any action label. To

do so, given sequences of K frames,

we deploy K parallel streams, one for

each of the input frames. We learn

our action tubelet detector jointly for

all parallel streams, where the weights

are shared amongst all the streams. In

the end, we concatenate the features

coming from each stream, then we

classify and regress the anchor cuboid

to produce the tubelet, which is what

we want.

What is interesting there is that the

features are learned jointly; however,

the regression is done per frame from

features that are learned jointly over

the whole sequence.

The

classification, putting the action label

into the sequence, is consistent over

the whole tubelet. That enforces

consistency of the actions over time.

What feature would you add to the

model that it doesn’t have today?

I could tell you millions of them! What

matters though is what I think people

who work on action localization or

18

Friday

Vicky Kalogeiton