Learning discriminative space-time actions from weakly labelled videos Michael Sapienza, Fabio Cuzzolin and Philip H.S. Torr Oxford Brookes University, Oxford, UK Motivation

Method

Experimental Setup I 4 Challenging action datasets: KTH (6 classes), YouTube (11 classes), Hollywood2 (12 classes), HMDB (51 classes). I Baseline: Standard BoF pipeline [9]. I Our approach: MIL-BOF models characterised by various cube-[60-60-60], [80-80-80], [100-100-100] or cuboid-[80-80-160], [80-160-80], [160-80-80] shaped subvolumes [x-y-t].

Billions of videos require: I Organization, Search I Description, Retrieval

Recognising actions for: I Human Robot Interaction I Gaming, Virtual Reality

State-of-the-art I The space-time bag-of-features (BoF) approach is the most popular pipeline for challenging human action data [5, 9], however classification performance diminishes with dataset difficulty (e.g. HMDB [4]). I State-of-the-art methods [8, 2, 4, 6] derive action representations from an entire video clip, even though this may contain motion and scene patterns pertaining to multiple action classes. I Different actions that have similar motions may lead to confusion between classes. Our approach I Human actions may naturally be described as a collection of parts.

Results Table: Quantitative action clip classification results from BoF and MIL-BoF methods.

Figure: Instead of defining an action as a space-time pattern in an entire video clip (left), we propose to define an action as a collection of space-time action parts contained in video subvolumes (right). The labels of each action subvolume are initially unknown. Multiple instance learning is used to learn which subvolumes are particularly discriminative of the action (solid-line cubes), and which are not (dotted-line cubes).

I i) Cast conventionally supervised BoF action clip classification into a weakly supervised setting. I ii) Video clips are represented as bags of histogram instances with latent class variables. I iii) Apply multiple instance learning (MIL) to learn the subvolume class labels.

Dataset Perf. measure

mAcc

State-of-the-art BoF MIL-BoF 60-60-60 MIL-BoF 80-80-80 MIL-BoF 100-100-100 MIL-BoF 80-80-160 MIL-BoF 160-80-80 MIL-BoF 80-160-80 MIL-BoF 80-80-end

94.53 95.37 94.91 95.37 93.52 96.76 96.30 95.83 96.76

KTH mAP mF1

[3,– 2]

– 96.48 93.99 96.48 94.22 97.02 94.84 96.53 93.65 96.74 95.78 96.58 94.44 96.62 94.41 96.92 96.04

YOUTUBE mAcc mAP mF1

[8]–

84.2 76.03 73.40 77.54 78.60 80.39 79.05 78.31 79.27

– 79.33 57.54 81.04 70.04 83.86 73.94 85.32 76.29 86.06 77.35 85.03 76.07 84.94 75.74 86.10 75.94

HOHA2 mAcc mAP mF1 – 39.04 38.49 37.28 37.43 37.49 36.92 37.84 39.63

[8]–

58.3 48.73 43.49 44.18 40.72 41.97 42.08 42.61 43.93

32.04 39.42 37.45 32.31 33.66 32.11 35.33 35.96

mAcc

HMDB mAP mF1

[4]–

23.18 31.53 27.64 28.69 27.51 28.17 28.98 28.71 29.67

– 31.39 21.36 26.26 23.08 29.03 25.28 28.62 23.93 29.55 25.41 30.50 24.76 28.82 25.26 30.30 25.22

Figure: Qualitative action clip localization results on challenging Hollywood2 dataset.

I iv) Map instance decisions learned in the mi-SVM approach to bag decisions by learning a hyperplane separating instance margin features Fi in pos/neg bags.

(a) Action: DriveCar - test video

(b) Action: GetOutOfCar - test video

Conclusion I The resulting action recognition system is suitable for both clip classification and localisation in challenging video datasets, without requiring the labelling of action part locations.

I Even with fixed-sized subvolumes, MIL-BoF achieves comparable & superior performance to BoF baseline on most performance measures.

MIL-BoF

Figure: A training video sequence taken from the KTH dataset [7] plotted in space and time. Overlaid on the video are discriminative cubic action subvolumes learned in a max-margin multiple instance learning framework, with colour indicating their class membership strength.

I In our framework, action models are derived from smaller portions of the video volume, subvolumes, which are used as learning primitives rather than the entire space-time video. I In this way, more discriminative action parts may be selected which most characterise those particular types of actions. http://cms.brookes.ac.uk/research/visiongroup/

I The task of the mi-MIL is to recover the latent class variable yij of every instance in the positive bags, and to simultaneously learn an SVM instance model hw, bi to represent each action class. I In mi-SVM, each example label is unobserved, and we maximise the usual soft-margin jointly over hidden variables and discriminant function [1]: X 1 min min kwk2 + C ξij , (1) yij w,b,ξ 2 ij

400

120 100 80

350 300

60 40

250 200

20 150 150

100

100 50

50

References [1] S. Andrews, I. Tsochantaridis, and T. Hofmann. Support vector machines for multiple-instance learning. In NIPS, 2003. [2] A. Gilbert, J. Illingworth, and R. Bowden. Fast realistic multi-action recognition using mined dense spatio-temporal features. In ICCV, 2009. [3] A. Kovashka and K. Grauman. Learning a hierarchy of discriminative space-time neighborhood features for human action recognition. In CVPR, 2010. [4] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. HMDB: A large video database for human motion recognition. In ICCV, 2011.

subject to ∀i, j :

yij (wT xij + b) ≥ 1 − ξij ,

ξij ≥ 0,

yij ∈ {−1, 1},

where w is the normal to the separating hyperplane, b is the offset, and ξij are slack variables for each instance xij .

[5] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld. Learning realistic human actions from movies. In CVPR, 2008. [6] Q. Le, W. Zou, S. Yeung, and A. Ng. Learning hierarchical invariant spatio-temporal features for action recignition with ISA. In CVPR, 2011. [7] C. Sch¨uldt, I. Laptev, and B. Caputo. Recognizing human actions: A local SVM approach. In ICPR, 2004. [8] H. Wang, A. Kl¨aser, C. Schmid, and C. Liu. Action Recognition by Dense Trajectories. In CVPR, 2011. [9] H. Wang, M. Ullah, A. Kl¨aser, I. Laptev, and C. Schmid. Evaluation of local spatio-temporal features for action recognition. In BMVC, 2009.

[email protected]

Learning discriminative space-time actions from weakly ...

pipeline for challenging human action data [5, 9], however classification performance ... The task of the mi-MIL is to recover the latent class variable yij of every.

929KB Sizes 0 Downloads 248 Views

Recommend Documents

Learning discriminative space-time actions from ... - Semantic Scholar
Abstract. Current state-of-the-art action classification methods extract feature representations from the entire video clip in which the action unfolds, however this representation may include irrelevant scene context and movements which are shared a

Discriminative Segment Annotation in Weakly Labeled ... - CiteSeerX
Machines (SVM), learn a discriminative classifier to sepa- rate positive from negative data, given instance-level labels. Such methods can be shoehorned into ...

Weakly Supervised Learning of Object Segmentations from Web ...
tackle weakly supervised training of pixel-level object models solely from large ..... Grundmann, M., Kwatra, V., Essa, I.: Auto-directed video stabilization with ...

Discriminative Segment Annotation in Weakly ... - Stanford Vision Lab
can envision several algorithms that exploit this theme, the simplest variant of ... Before detailing the specifics of how we apply CRANE to transductive and ...

Learning Symbolic Representations of Actions from ...
Learning, and conventional planning methods. In our approach, the sensorimotor skills (i.e., actions) are learned through a learning from demonstration strategy.

Discriminative Unsupervised Learning of Structured Predictors
School of Computer Science, University of Waterloo, Waterloo ON, Canada. Alberta Ingenuity .... the input model p(x), will recover a good decoder. In fact, given ...

Learning to understand others' actions
Nov 17, 2010 - present opinion piece suggests that this argument is flawed. We argue that mirror neurons may both develop through associative learning and contribute to inferences about the actions of others. Keywords: mirror neuron; mirror system; a

Hybrid Generative/Discriminative Learning for Automatic Image ...
1 Introduction. As the exponential growth of internet photographs (e.g. ..... Figure 2: Image annotation performance and tag-scalability comparison. (Left) Top-k ...

Weakly Supervised Clustering: Learning Fine ... - Research at Google
visited a store after seeing an ad, and so this is not a standard supervised problem. ...... easily overreact to special behaviors associated with Facebook clicks.

PerTurbo Manifold Learning Algorithm for Weakly ...
Abstract—Hyperspectral data analysis has been given a growing attention .... de la Recherche (ANR) under reference ANR-13-JS02-0005-01 (Asterix project).

Learning to recognize novel predators under weakly ...
Test fish. Juvenile rainbow trout, obtained from a local hatchery, were housed in 390-L .... These data demonstrate that the ability to acquire the recog- nitions of a novel .... leads to increased mortality upon migration to salt water. (Magee et al

Visual Tracking via Weakly Supervised Learning ... - Semantic Scholar
video surveillance, human machine interfaces and robotics. Much progress has been made in the last two decades. However, designing robust visual tracking ...

Learning to recognize novel predators under weakly ...
major source of water acidification worldwide (Rodhe et al. 1995), and result in .... order to assess which of these alternative hypotheses was correct. If decreased pH ..... U.S. Department of Energy, Oak Ridge Laboratory. Barry KL, Grout AG, ...

Weakly-supervised Joint Sentiment-Topic Detection from Text
traditional media [1]. Driven by the ... Y. He and S. Rüger are with Knowledge Media Institute, The Open ...... of Computing, Imperial College London, where.

Harvesting Large-Scale Weakly-Tagged Image Databases from the ...
tagged images from collaborative image tagging systems such as Flickr by ... (c) Spam Tags: Spam tags, which are used to drive traf- fic to certain images for fun or .... hard to use only one single type of kernel to characterize the diverse visual .

Discriminative Learning can Succeed where Generative ... - Phil Long
Feb 26, 2007 - Given a domain X, we say that a source is a probability distribution P over. X × {−1, 1}, and a learning problem P is a set of sources.

Phrase Clustering for Discriminative Learning - Research at Google
Aug 7, 2009 - data, even if the data set is a relatively large one. (e.g., the Penn Treebank). While the labeled data is generally very costly to obtain, there is a ...

WEAKLY
proving analogues of the results already obtained for the related Szendrei ex- .... so ef ∈ E(M)a, hence E(M)a is a subsemilattice of E(M). We now put u = ∧ E(M)a. Clearly, ua = a. Let g ∈ E(M) with ga = a, so that g ∈ E(M)a. Then, since u â‰

A Discriminative Learning Approach for Orientation ... - Semantic Scholar
... Research Group. Technical University of Kaiserslautern, 67663 Kaiserslautern, Germany ... 180 and 270 degrees because usually the document scan- ning process results in ... best fit of the model gives the estimate for orientation and skew.

Discriminative Dictionary Learning with Low-Rank ...
matrix recovery theory and apply a low-rank regularization on the dictionary. ..... thesis, Massachusetts Institute of Technology, 2006. [8] M. Aharon, M. Elad, and ...