Learning latent spatio-temporal compositional model for human action recognition
Abstract
Action recognition is an important problem in multimedia under- standing. This paper addresses this problem by building an expres- sive compositional action model. We model one action instance in the video with an ensemble of spatio-temporal compositions: A number of discrete temporal anchor frames, each of which is fur- Ther decomposed to a layout of deformable parts. In this way, our model can identify a Spatio-Temporal And-Or Graph (STAOG) to represent the latent structure of actions e.g. triple jumping, swing- ing and high jumping. The STAOG model comprises four layers: (i) a batch of leaf-nodes in bottom for detecting various action part- s within video patches; (ii) the or-nodes over bottom, i.e. switch variables to activate their children leaf-nodes for structural variabil- ity; (iii) the and-nodes within an anchor frame for verifying spatial composition; and (iv) the root-node at top for aggregating scores over temporal anchor frames. Moreover, the contextual interac- Tions are defined between leaf-nodes in both spatial and temporal domains. For model training, we develop a novel weakly super- vised learning algorithm which iteratively determines the structural configuration (e.g. the production of leaf-nodes associated with the or-nodes) along with the optimization of multi-layer parameters. By fully exploiting spatio-temporal compositions and interactions, our approach handles well large intra-class action variance (e.g. d- ifferent views, individual appearances, spatio-temporal structures). The experimental results on the challenging databases demonstrate superior performance of our approach over other methods. Copyright © 2013 ACM.