Recognizing Action at A Distance

A.A. Efros, A.C. Berg, G. Mori and J. Malik

Our goal is to recognize human actions at a distance, at resolutions where a whole person may be, say, 30 pixels tall. We introduce a novel motion descriptor based on optical flow measurements in a spatio-temporal volume for each stabilized human figure, and an associated similarity measure to be used in a nearest-neighbor framework. Making use of noisy optical flow measurements is the key challenge, which is addressed by treating optical flow not as precise pixel displacements, but rather as a spatial pattern of noisy measurements which are carefully smoothed and aggregated to form our spatio-temporal motion descriptor. To classify the action being performed by a human figure in a query sequence, we retrieve nearest neighbor(s) from a database of stored, annotated video sequences. We can also use these retrieved exemplars to transfer 2D/3D skeletons onto the figures in the query sequence, as well as two forms of data-based action synthesis ``Do as I Do'' and ``Do as I Say''. Results are demonstrated on ballet, tennis as well as football datasets.

``Recognizing Action at a Distance''
Alexei A. Efros, Alexander C. Berg, Greg Mori and Jitendra Malik
IEEE International Conference on Computer Vision (ICCV'03), Nice, France, October 2003.
pdf version (1.6 MB), postscript gzipped version (1.5 MB) Powerpoint presentation, BibTeX entry, talk video


Here are some of the images from the paper, plus some more images and videos (DivX required) that didn't make it into the printed version.

Motion Descriptor Matching:

best-match.avi: Video showing the best motion descriptor matches for a short running sequence (see Figure 7 in paper).
player-match.avi: Video showing the best matches drawn from a single player (no smoothing!)


class-football.avi: Video showing classification results and best matches for a novel football sequence (note: "standing" and "kneeling" are not in the database, so it does best it can).
class-tennis.avi: Video showing classification results for a tennis sequence.

Skeleton Transfer:

Given an input sequence (top row) we are able to recover rough joint locations by querying the action database and retrieving the best-matching motion with the associated 2D/3D skeleton. Second row shows a 2D skeleton transferred from a hand-marked database of joint locations. Third row demonstrates 3D skeleton transfer, which utilizes Motion Capture data rendered from different viewing directions using a stick figure.

"Do as I Do" Action Synthesis:

GregWorldCup.avi: Greg in the World Cup (38 MB!) -- an extended "Do as I Do" action synthesis and retargetting example.
daid-tennis.avi: "Do as I Do" video for tennis sequence (top: "driver", bottom:"target"). Although the positions of the two figures are different, their actions are matched.

"Do as I Say" Action Synthesis:

dias1.avi, dias2.avi: two videos of "Do as I Say" for tennis.

Context-based Foreground Correction:

We use the power of our data to correct imperfections in each individual sample. The input frames (top row) are automatically corrected to produce cleaned up figures (bottom row).