A Survey on Leveraging Deep Neural Networks for Object Tracking Sebastian Krebs, Bharanidhar Duraisamy, and Fabian Flohr Daimler AG, Research and Development, Ulm (Germany) Contact: [email protected]

Tracking - General • Originated from aerospace applications in the 1960s • Estimating the state of one or several targets over time • Based on noisy measurements from one or multiple sensors

From: Y. Bar-Shalom et al. „The Probabilistic Data Association Filter“, in IEEE Control Systems, 2009

A Survey on Leveraging Deep Neural Networks for Object Tracking| Sebastian Krebs | 16.10.2017

2

Tracking - Autonomous Driving Applications Motivation • Robustify detections results • Extract non-directly observables (velocities) • Provide information for higher-level systems

A Survey on Leveraging Deep Neural Networks for Object Tracking| Sebastian Krebs | 16.10.2017

3

Tracking - Autonomous Driving Applications Motivation • Robustify detections results • Extract non-directly observables (velocities) • Provide information for higher-level systems Challenges

• Possible high amount of objects • High proximity of objects • Agile motion patterns

A Survey on Leveraging Deep Neural Networks for Object Tracking| Sebastian Krebs | 16.10.2017

4

Tracking - Traditional Object Tracking

Track Management 𝑍 𝑡 = 𝑧1 , … , 𝑧𝑛 𝑋 𝑡−1 = 𝑥1 , … , 𝑥𝑚

Data Association

𝒜𝑚,𝑛

State Update

𝑋 𝑡 = {𝑥1 , … 𝑥𝑚 }

State Prediction

A Survey on Leveraging Deep Neural Networks for Object Tracking| Sebastian Krebs | 16.10.2017

5

Deep Learning for Object Tracking - Overview Features

Data Association

Prediction

End-to-End

A Survey on Leveraging Deep Neural Networks for Object Tracking| Sebastian Krebs | 16.10.2017

6

Deep Learning for Object Tracking - Features • Pre-train network on big image database • Utilize feature maps from pre-trained network

• Create and update a model of the tracked object • Detection and Localization

From [34] [34] L. Wang, W. Ouyang, X. Wang, and H. Lu, “Visual Tracking with Fully Convolutional Networks,” in ICCV, 2015

A Survey on Leveraging Deep Neural Networks for Object Tracking| Sebastian Krebs | 16.10.2017

7

Deep Learning for Object Tracking - Features Method Name

Network

Features

Integration Remarks

DLT [17]

Stacked Denoising Pre-trained encoder Autoencoder (SDAE) with classification layer

Network output is used as conficence for a particle filter based tracking approach.

SO-DLT [33]

Structured Output CNN

50x50 Probability Map

During tracking two CNNs are fine-tuned on the desired target.

Wang et al. [34] VGG

conv4-3, conv5-3

Feature map selection, two networks for generic and specific features, distractor removal

Chi et al. [35]

VGG, Dual Network

conv4-3, conv5-3, boundary maps

Dual network is trained and updated to fine tune features for a specific target.

Ma et al. [36]

VGG

conv3-4, conv4-4, conv5-4

Learn adaptive linear correlation filter per layer to obtain response maps, to infer target location

Hong et al. [52]

R-CNN

Outputs from fc1

R-CNN features are classified by an online-learned SVM, back propagated trough network to obtain saliency maps. Bayesian filtering performed on combined saliency maps A Survey on Leveraging Deep Neural Networks for Object Tracking| Sebastian Krebs | 16.10.2017

8

Deep Learning for Object Tracking - Data Association • Learn generic similarity measure directly from the data • Using Siamese Networks Two-stream networks, with shared weight Learned with a contrasive loss • Use of similarity measure during data association

From [53] [53] L. Leal-Taixe, C. Canton-Ferrer, and K. Schindler, “Learning by Tracking: Siamese CNN for Robust target association,” in CVPRW, 2016 A Survey on Leveraging Deep Neural Networks for Object Tracking| Sebastian Krebs | 16.10.2017

9

Deep Learning for Object Tracking - Data Association Method Name

Similarity Between

Input

Integration Remark

SINT [37]

Target template and candidate boxes

Image patches (pixel Radius sampling to generate candidate patches, values) similarity measure per proposal box. Box with highest similarity is considered new target position

Leal-Taixe et al. [53]

Detection at time t and t+1

Pixel Values, Optical Flow, Contextual Information

Varior et al. [38]

Pair of target patches Local Maximal Occurence (LOMO), Color Names (CN)

Similarity of flow and pixel patches is calculated by the Siamese network, combined with contextual features to calculate probability of matching. Which is used by the final linear programming tracker. Divide patches into horizontal rows, which are interpreted as a sequence.

[37] R. Tao, E. Gavves, and A. W. M. Smeulders, “Siamese Instance Search for Tracking” in CVPR, 2016 [53] L. Leal-Taixe, C. Canton-Ferrer, and K. Schindler, “Learning by Tracking: Siamese CNN for Robust target association,” in CVPRW, 2016 [38] R. R. Varior, B. Shuai, J. Lu, D. Xu, and G. Wang, “A Siamese Long Short-Term Memory Architecture for Human Re-Identification” in ECCV, 2016 A Survey on Leveraging Deep Neural Networks for Object Tracking| Sebastian Krebs | 16.10.2017

10

Deep Learning for Object Tracking - Prediction Social-LSTM [42] • Predict path of multiple persons • Each trajectory is predicted by a LSTM using a preprocessed trajectory history • Inter-object dependencies are captures by socialpooling layers

From [42]

[42] A. Alahi, K. Goel, V. Ramanathan, A. Robicquet, L. Fei-Fei, and S. Savarese, “Social LSTM: Human Trajectory Prediction in Crowded Spaces,” in CVPR, 2016 A Survey on Leveraging Deep Neural Networks for Object Tracking| Sebastian Krebs | 16.10.2017

11

Deep Learning for Object Tracking - Prediction

Behavior-CNN [43]

Hoermann et al. [44]

• Image from a static surveillance camera

• Dynamic Occupancy Grid Map (DOGMa) as input

• Learn kinematic properties of pedestrians

• Prediction of whole DOGMa

• Predicts future trajectories based on previous [43] S. Yi, H. Li, and X. Wang, “Pedestrian Behavior Understanding and Prediction with Deep Neural Networks” in ECCV, 2016

[44] S. Hoermann, M. Bach, and K. Dietmayer, “Dynamic Occupancy Grid Prediction for Urban Autonomous Driving: A Deep Learning Approach with Fully Automatic Labeling ” in IV, 2017 A Survey on Leveraging Deep Neural Networks for Object Tracking| Sebastian Krebs | 16.10.2017

12

Deep Learning for Object Tracking - End-to-End Method Name Input

Trained on

Network

Integration Remark

Gan et al. [45]

Image, First target Artifical data (generic RCNN bounding box background, shapes) (GRU)

Outputs target bounding box. No online finetuning. Anonymous tracking.

GOTURN [46]

Current search region, cropped target template

Twostream CNN

Outputs target bounding box by regression. No online fine-tuning. Anonymous tracking.

MDNet [19]

Target candidates, Real-world videos initial target position

MultiDomain Network

During tracking domain-specific layers are removed. Network fine-tuned during tracking (new classification layer).

ROLO [48]

Raw video frame

YOLO + LSTM

Feature maps of last conv layer and detections results of YOLO are used as input for LSTM. Outputs target bounding box or heat maps.

Adjacent video frames and modified images

ImageNet, Detection (YOLO), videos (LSTM)

[45] Q. Gan, Q. Guo, Z. Zhang, and K. Cho, “First Step toward Model-Free, Anonymous Object Tracking with Recurrent Neural Networks” arXiv, 2015 [46] D. Held, S. Thrun, and S. Savarese, “Learning to Track at 100 FPS with Deep Regression Networks” in ECCV, 2016 [19] H. Nam and B. Han, “Learning Multi-domain Convolutional Neural Networks for Visual Tracking” in CVPR, 2016 [48] G. Ning, Z. Zhang, C. Huang, Z. He, X. Ren, H. Wang, “Spatially Supervised Recurrent Convolutional Neural Networks for Visual Object Tracking”, arXiv, 2016 A Survey on Leveraging Deep Neural Networks for Object Tracking| Sebastian Krebs | 16.10.2017

13

Deep Learning for Object Tracking - End-to-End Method Name Input

Trained on

Network

Integration Remark

Gan et al. [45]

Image, First target Artifical data (generic RCNN bounding box background, shapes) (GRU)

Outputs target bounding box. No online finetuning. Anonymous tracking.

GOTURN [46]

Current search region, cropped target template

Twostream CNN

Outputs target bounding box by regression. No online fine-tuning. Anonymous tracking.

MDNet [19]

Target candidates, Real-world videos initial target position

MultiDomain Network

During tracking domain-specific layers are removed. Network fine-tuned during tracking (new classification layer).

ROLO [48]

Raw video frame

YOLO + LSTM

Feature maps of last conv layer and detections results of YOLO are used as input for LSTM. Outputs target bounding box or heat maps.

Adjacent video frames and modified images

ImageNet, Detection (YOLO), videos (LSTM)

Single-object tracking methods without kinematic information

A Survey on Leveraging Deep Neural Networks for Object Tracking| Sebastian Krebs | 16.10.2017

14

Deep Learning for Object Tracking - End-to-End DeepTracking [47] • Raw input from laser scanner

• Predict unoccluded state of the world • Recurrent Network (GRUs) employed • Artifical training data

Extension: Ondruska et al. [55] • Allow classification (object-level) • Real-data from a traffic intersection

Extension: Dequaire et al. [56] • Introduces Spatial Transformer Module (STM) • Applied in a moving vehicle

[47] I. Posner and P. Ondruska, “Deep Tracking: Seeing Beyond Seeing Using Recurrent Neural Networks” in AAAI, 2016 A Survey on Leveraging Deep Neural Networks for Object Tracking| Sebastian Krebs | 16.10.2017

15

Deep Learning for Object Tracking - End-to-End

[49] A. Milan, S. H. Rezatofighi, A. Dick, K. Schindler, and I. Reid, “Online Multi-target Tracking using Recurrent Neural Networks” in AAAI, 2017 A Survey on Leveraging Deep Neural Networks for Object Tracking| Sebastian Krebs | 16.10.2017

16

Conclusion • Most deep-based tracking approaches are tailored by the vision-based detection and classification tasks

• Recurrent Neural Networks are suitable to capture spatio-temporal dependencies • Most methods lack the explicit modeling of the kinematic state of the target • Integration of non-image sensor measurements or from multiple sensors still challenging

• Compared to classical deep-based tasks like classification and detection tracking is a “new” research field

Presentation title in CorpoS (Body) 10 pt | Department | Date

17

Thank you for your attention!

References [17] N. Wang and D.-Y. Yeung, “Learning a Deep Compact Image Representation for Visual Tracking,” in NIPS, 2013 [19] H. Nam and B. Han, “Learning Multi-domain Convolutional Neural Networks for Visual Tracking” in CVPR, 2016 [33] N. Wang, S. Li, A. Gupta, and D.-Y. Yeung, “Transferring Rich Feature Hierarchies for Robust Visual Tracking” arXiv, 2015 [34] L. Wang, W. Ouyang, X. Wang, and H. Lu, “Visual Tracking with Fully Convolutional Networks,” in ICCV, 2015 [35] Z. Chi, H. Li, H. Lu, and M.-H. Yang, “Dual Deep Network for Visual Tracking” in IEEE Transactions on Image Processing, 2017 [36] C. Ma, J. B. Huang, X. Yang, and M. H. Yang, “Hierarchical Convolutional Features for Visual Tracking” in ICCV 2016 [37] R. Tao, E. Gavves, and A. W. M. Smeulders, “Siamese Instance Search for Tracking” in CVPR, 2016 [38] R. R. Varior, B. Shuai, J. Lu, D. Xu, and G. Wang, “A Siamese Long Short-Term Memory Architecture for Human Re-Identification” in ECCV, 2016 [42] A. Alahi, K. Goel, V. Ramanathan, A. Robicquet, L. Fei-Fei, and S. Savarese, “Social LSTM: Human Trajectory Prediction in Crowded Spaces,” in CVPR, 2016 [43] S. Yi, H. Li, and X. Wang, “Pedestrian Behavior Understanding and Prediction with Deep Neural Networks” in ECCV, 2016 [44] S. Hoermann, M. Bach, and K. Dietmayer, “Dynamic Occupancy Grid Prediction for Urban Autonomous Driving: A Deep Learning Approach with Fully Automatic Labeling ” in IV, 2017 [45] Q. Gan, Q. Guo, Z. Zhang, and K. Cho, “First Step toward Model-Free, Anonymous Object Tracking with Recurrent Neural Networks” arXiv, 2015 [46] D. Held, S. Thrun, and S. Savarese, “Learning to Track at 100 FPS with Deep Regression Networks” in ECCV, 2016 [47] I. Posner and P. Ondruska, “Deep Tracking: Seeing Beyond Seeing Using Recurrent Neural Networks” in AAAI, 2016 [48] G. Ning, Z. Zhang, C. Huang, Z. He, X. Ren, H. Wang, “Spatially Supervised Recurrent Convolutional Neural Networks for Visual Object Tracking”, arXiv, 2016 [49] A. Milan, S. H. Rezatofighi, A. Dick, K. Schindler, and I. Reid, “Online Multi-target Tracking using Recurrent Neural Networks” in AAAI, 2017 [53] L. Leal-Taixe, C. Canton-Ferrer, and K. Schindler, “Learning by Tracking: Siamese CNN for Robust target association,” in CVPRW, 2016 [55] P. Ondruska, J. Dequaire, D. Z. Wang, and I. Posner, “End-to-End Tracking and Semantic Segmentation Using Recurrent Neural Networks” arXiv, 2016 [56] J. Dequaire, D. Rao, P. Ondruska, D. Wang, and I. Posner, “Deep Tracking on the Move: Learning to Track the World from a Moving Vehicle using Recurrent Neural Networks” arXiv , 2016

A Survey on Leveraging Deep Neural Networks for ...

data. • Using Siamese Networks. Two-stream networks, with shared weight .... “Learning Multi-domain Convolutional Neural Networks for Visual Tracking” in ...

3MB Sizes 20 Downloads 419 Views

Recommend Documents

Deep Convolutional Neural Networks On Multichannel Time Series for ...
Deep Convolutional Neural Networks On Multichannel Time Series for Human Activity Recognition.pdf. Deep Convolutional Neural Networks On Multichannel ...

Training Deep Neural Networks on Noisy Labels with Bootstrapping
Apr 15, 2015 - “Soft” bootstrapping uses predicted class probabilities q directly to ..... Analysis of semi-supervised learning with the yarowsky algorithm.

Deep Learning and Neural Networks
Online|ebook pdf|AUDIO. Book details ... Learning and Neural Networks {Free Online|ebook ... descent, cross-entropy, regularization, dropout, and visualization.

DEEP NEURAL NETWORKS BASED SPEAKER ...
1National Laboratory for Information Science and Technology, Department of Electronic Engineering,. Tsinghua .... as WH×S and bS , where H denotes the number of hidden units in ..... tional Conference on Computer Vision, 2007. IEEE, 2007 ...

Scalable Object Detection using Deep Neural Networks
neural network model for detection, which predicts a set of class-agnostic ... way, can be scored using top-down feedback [17, 2, 4]. Us- ing the same .... We call the usage of priors for matching ..... In Proceedings of the IEEE Conference on.

Deep Neural Networks for Acoustic Modeling in Speech ... - CiteSeerX
Apr 27, 2012 - origin is not the best way to find a good set of weights and unless the initial ..... State-of-the-art ASR systems do not use filter-bank coefficients as the input ...... of the 24th international conference on Machine learning, 2007,

Deep Neural Networks for Object Detection - NIPS Proceedings
This method combines a set of discriminatively trained .... network to predict the object box mask and four additional networks to predict four ... In order to complete the detection process, we need to estimate a set of bounding ... training data.

Multiframe Deep Neural Networks for Acoustic ... - Research at Google
windows going up to 400 ms. Given this very long temporal context, it is tempting to wonder whether one can run neural networks at a lower frame rate than the ...

Deep Convolutional Neural Networks for Smile ...
Illustration of a convolutional neural network [4]. ...... [23] Ji, Shuiwang; Xu, Wei; Yang, Ming; Yu, Kai: 3D Convolutional Neural ... Deep Learning Tutorial.

Fine-tuning deep convolutional neural networks for ...
Aug 19, 2016 - mines whether the input image is an illustration based on a hyperparameter .... Select images for creating vocabulary, and generate interest points for .... after 50 epochs of training, and the CNN models that had more than two ...

Deep Neural Networks for Acoustic Modeling in Speech Recognition
Instead of designing feature detectors to be good for discriminating between classes ... where vi,hj are the binary states of visible unit i and hidden unit j, ai,bj are ...

Deep Neural Networks for Acoustic Modeling in Speech ...
Jun 18, 2012 - Gibbs sampling consists of updating all of the hidden units in parallel using Eqn.(10) followed by updating all of the visible units in parallel using ...... George E. Dahl received a B.A. in computer science, with highest honors, from

fine context, low-rank, softplus deep neural networks for mobile ...
plus nonlinearity for on-device neural network based mobile ... translation. While the majority of mobile speech recognition ..... application for speech recognition.

Deep Neural Networks for Small Footprint Text ... - Research at Google
dimensional log filterbank energy features extracted from a given frame, together .... [13] B. Yegnanarayana and S.P. Kishore, “AANN: an alternative to. GMM for ...

Deep Neural Networks for Acoustic Modeling in ... - Semantic Scholar
Apr 27, 2012 - His current main research interest is in training models that learn many levels of rich, distributed representations from large quantities of perceptual and linguistic data. Abdel-rahman Mohamed received his B.Sc. and M.Sc. from the El

recurrent deep neural networks for robust
network will be elaborated in Section 3. We report our experimental results in Section 4 and conclude our work in Section 5. 2. RECURRENT DNN ARCHITECTURE. 2.1. Hybrid DNN-HMM System. In a conventional GMM-HMM LVCSR system, the state emission log-lik

Compressing Deep Neural Networks using a ... - Research at Google
tractive model for many learning tasks; they offer great rep- resentational power ... differs fundamentally in the way the low-rank approximation is obtained and ..... 4Specifically: “answer call”, “decline call”, “email guests”, “fast

Deep Neural Networks for Acoustic Modeling in Speech ... - CiteSeerX
Apr 27, 2012 - data that lie on or near a non-linear manifold in the data space. ...... “Reducing the dimensionality of data with neural networks,” Science, vol.

lecture 17: neural networks, deep networks, convolutional ... - GitHub
As we increase number of layers and their size the capacity increases: larger networks can represent more complex functions. • We encountered this before: as we increase the dimension of the ... Lesson: use high number of neurons/layers and regular

DeepPose: Human Pose Estimation via Deep Neural Networks
art or better performance on four academic benchmarks of diverse real-world ..... Combined they contain 11000 training and 1000 testing im- ages. These are images from ..... We present, to our knowledge, the first application of. Deep Neural ...

Thu.P10b.03 Application of Pretrained Deep Neural Networks to Large ...
of Voice Search and Android Voice Input data 1 using a CD system with 7969 ... procedure similar to [10] and another 0.9% absolute from model combination by ...