ConvNets-Based Action Recognition from Depth Maps ...

Viewer
Transcript

ConvNets-Based Action Recognition from Depth Maps through Virtual Cameras and Pseudocoloring Pichao Wang1 , Wanqing Li1 , Zhimin Gao1 , Chang Tang2 , Jing Zhang1 , and Philip Ogunbona1 1 Advanced Multimedia Research Lab, University of Wollongong, Australia; 2 School of Electronic Information Engineering, Tianjin University, China [email protected], [email protected], [email protected], [email protected], [email protected], [email protected] B ACKGROUND

R OTATION TO MIMIC VIRTUAL CAMERAS

1. Action recognition has been an active research topic in computer vision due to its wide range of applications including intelligent surveillance and human-computer interactions. 2. The release of the Microsoft Kinect Sensors opens up new opportunities for action recognition. 3. Deep learning approach has achieved great success in several kinds of applications.

Pd

Image center (Cx,Cy)

X

o

β Po

O

θ Pt

y z

Z

f

x

(a)

Fig.1 Action Recognition source from

T. Lan et. al.

Fig.2 Kinect Sensors source from Apple

Fig.3 Deep Learning source from VLab MIT

The rotation of the 3D points can be performed equivalently by assuming a virtual RGB-D camera moves around and points at the subject from different viewpoints. The coordinates of subject with respect to the virtual camera can be computed by the transformation: T 0 0 0 T [X , Y , Z , 1] = Try Trx X, Y, Z, 1 (1) 0

0

0

where X , Y , Z represent the 3D coordinates with respect to the virtual camera system and Try denotes the transformation along Y axis (right-handed coordinate system) while Trx denotes the transformation along X axis and they are:

P ROPOSED M ETHOD The proposed method consists of two major components: three ConvNets3 and construction of DMMs2 from sequences of depth maps as the input to the ConvNets. Given a sequence of depth maps, 3D points are created and three DMMs are constructed by projecting the 3D points to the three orthogonal planes. Each DMM serves as an input to a ConvNet for classification4 . Final classification of the given depth sequence is obtained through a late fusion of the three ConvNets. Three strategies have been developed to deal with the challenges posed by small datasets. Firstly, more training data are synthesized by rotating the input 3D points to mimic different cameras; Secondly, the same ConvNet architecture as the one for ImageNet is adopted so that the model trained over ImageNet can be adapted to our problem through transfer-learning. Thirdly, each DMMs goes through a pseudo-color coding process to separate different motion patterns with enhancement into the PseudoRGB channels before being input to the ConvNets. ConvNet Rotation And Pseudocoloring

(b)

11 11

Ry (θ) Ty (θ) Rx (β) Tx (β) Try = ; Trx = 0 1 0 1 

where

1 Ry (θ) = 0 0

0 cos(θ) sin(θ)

    cos(β) 0 0 0 −sin(θ) Ty (θ) =  Z · sin(θ)  ; Rx (β) =  −sin(β) Z · (1 − cos(θ)) cos(θ)

0 1 0

(2)

   −Z · sin(β) sin(β) . 0 0  Tx (β) =  Z · (1 − cos(β)) cos(β)

P SEUDOCOLORING Motivated by the work1 where color-coding can harness the perceptual capabilities of the human visual system to extract more information from gray images and, hence, effectively to enhance the detailed texture patterns contained in the image, it is proposed in this paper to code a DMMs into a Pseudo-color image such that to effectively exploit/enhance the texture in the DMMs that corresponds to the motion patterns of actions.

c

DMMf

4096

4096

Ci=1,2,3

ConvNet

11

c

DMMs

4096

4096

class score fusion

ConvNet

Input depth maps

Rotation And Pseudocoloring

11 11 c

DMMt

4096

conv1

conv2

conv5

fc6

4096

fc7 fc8

fusion

R EFERENCE 1. B. R. Abidi, Y. Zheng, A. V. Gribok, and M. A. Abidi, “Improving weapon detection in single energy X-ray images through pseudocoloring,” Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on, 36(6):784–796, 2006. 2. X. Yang, C. Zhang, and Y. Tian, “Recognizing actions using depth motion maps-based histograms of oriented gradients,” ACM MM, 2012. 3. A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” NIPS, 2012. 4. Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for fast feature embedding,”arXiv:1408.5093, 2014.

1

Corresponding normalized color value

Rotation And 11 Pseudocoloring

1 1 2 = {sin[2π · (−I + ϕi ) · + ]} · f (I) 2 2

0.9 0.8 0.7 0.6

R(α = 1) G(α = 1) B(α = 1) R(α = 10) G(α = 10) B(α = 10) Amplitude modulation

0.5 0.4 0.3 0.2 0.1 0 0

0.2

0.4

0.6

0.8

1

Normalized gray level

E XPERIMENTAL R ESULTS

(3)

Automatic Human Action Recognition in a Scene from Visual Inputs