Patricio A. Vela

Yue Liu, Yongtian Wang

Beijing Institute of Technology

Georgia Institute of Technology

Beijing Institute of Technology

[email protected]

[email protected]

{liuyue, wyt}@bit.edu.cn

Abstract This paper presents a modified Kanade-LucasTomasi (KLT) tracking framework for multiple objects tracking applications. First, the framework includes a global pixel-level probabilistic model and an adaptive RGB template model to modify traditional KLT tracker more robust to track multiple objects and partial occlusions. Meanwhile, a Merge and Split algorithm is introduced in the proposed framework to track complete occlusions. The advantage of our method is demonstrated on a variety of challenging video sequences. 1

1. Introduction This paper considers the multiple object tracking problem. As an important subclass in computer vision, numerous algorithms exist for addressing the problem [11]. Tracking algorithms may be generally classified by the types of features used for tracking, e.g., pixel value, edges, feature points, etc. A further division exists for single target versus multiple target tracking algorithms. Recent work on multiple target tracking can be roughly divided into template based tracking methods and probabilistic based tracking methods. Template based tracking methods pose tracking as a detection problem which uses the template of each to locate the corresponding target in the observed frame. One key challenge of template based tracking methods is the adaptation of the template over time. Ideally, the template model should be highly specific and stable to occlusions and similar objects [2, 9, 7]. Probabilistic based tracking methods pose tracking as a classification problem which seeks to assign each pixel a probability of belonging to each object. The key problem of 1 This work is supported by the National Natural Science Foundation of China, Grant No. 61072096 and National Science and Technology Major Project (2012ZX03002004).

probabilistic based tracking methods is how to generate the targets’ distribution and determine each pixel’s probability correctly. Bayesian inference is commonly used to define computable probabilities from the observed image [1, 8]. To reduce false positive and constrain the decision space, active contours have been incorporated into the probability maximization [3, 4]. The contribution of our proposed method relies on the KLT tracker [2] are two-fold. First, a global pixellevel probabilistic model is introduced to improve the performance of the adaptive RGB template model (§2). Second, a merge and split algorithm is employed to solve the partial and complete occlusion problem in multiple objects tracking application (§3). The algorithm is validated using several standard and publicly available datasets (§4).

2. Modified KLT Tracking Model The Lucas-Kanade algorithm [2], minimizes a quadratic approximation to the sum of the squared error between a template T and an observed I image which is warped, by g, into the coordinate frame of the template: X [I(w(x; g)) − T (x)]2 . (1) x

One important application of Lucas-Kanade algorithm is the KLT tracker which is widely used in rigid object tracking applications. There are two challenges regarding effective use of the KLT tracker. First, the traditional KLT tracker dose not work well when the target changes shape or part of a target is occluded because it does not consider the relevance of the pixel’s error in the template patch. Although extensions do consider weighting schemes for the pixels, the schemes often do not incorporate relevance in a meaningful way (some modified approaches just use a simple Gaussian window). As shown in Figure 1, it is easy to know that pixel x2 ’s weight should be higher than pixel x1 ’s in

mixture model in appearance multiplied by a spatial indicator function, χ(x, Xi ), where the spatial indicator function is zero when the point x is outside of a window domain determined by Xi and one when the point is within the window domain. As an equation, Figure 1. Two pixels from the template domain; x2 (blue) is from the target and x1 (red) is from the background.

the tracking process as x2 belongs to the target while x1 belongs to the background. Second, when tracking a person, the pose and contour of a target change frame by frame, meaning that a single template does not track well. Adaptive, multi-template methods are needed [7]. To solve these two problems, we propose a modified KLT tracker by introducing a global pixel-level probabilistic model and an adaptive RGB template model which relies on these probabilities. In practice, Equation 1 is modified, for each target i, so as to optimize X P (Ci (x))[I(w(x; g)) − Ti (x)]2 (2) x

instead, where the global pixel-level probabilistic model P (Ci (x)) is the probability that pixel x belongs to the ith target; Ti is the ith target’s adaptive RGB template model; g is a vector includes scale and translation parameters; the warp w(x; g) takes the pixel x in the coordinate frame of the template Ti and maps it to the subpixel location w(x; g) in the observed frame I.

2.1. Global Pixel-Level Probabilistic Model The global pixel-level probabilistic model gives each pixel of the template patch a confidence weight during tracking to improve the tracking accuracy. This weight is computed using Bayesian Segmentation, which determines the pixel’s probability of belonging to each object. The segmentation process itself is modified to consider both the objects’ appearance information and location. Represent one pixel’s information to be v = [x, R, G, B] combining its location x and RGB value [R, G, B]. Taking v as the evidence and conditioning on object’s location Xi , the global pixel-level probabilistic model of target i is given by: P (v|Ci (x),Xi )P (Ci (x)|Xi ) P (v|Cj (x),Xj )P (Cj (x)|Xj )

P (Ci (x)|v, Xi ) = PM

P (v|Ci (x), Xi ) , χ(x, Xi )

nm X

α ωiα N (µα i , Σi ), (4)

α=1

where ωiα are the mixture weights, µα i are the means, and Σα are the covariances. Considering that people i typically have differing pants and shirts (or coats), the number of mixtures nm was selected to be two. While more is possible, this amount worked well in practice. The background distribution is modeled by a single Gaussian and its indicator function is always one. The first template patch of each targets is initiallized manually. The prior distribution P (Ci (x)|Xi ) is defined to be: b0 (x) = true if C ρbg i (x)) <σ ρoverlap maxj6P=(C P (Ci (x)|Xi ) ∝ i (P (Cj (x))) P (Ci (x)) ρtarget >= σ maxj6=i (P (Cj (x))) (5) b0 (x) is an estimated classification that The function C the current pixel is of the background class. A Gaussian Mixture Model (GMM) for the background [12] generates this estimate. The probabilities P (Ci (x)) are the posterior probabilities, as generated by (3), from the previous frame. The threshold σ indicates when two target distributions are sufficiently similar to prevent accurate discrimination. The prior probabilities satisfy the ordering ρbg < ρoverlap < ρtarget . The goal is to reduce the impact of the background’s pixels on the tracking process. Likewise ρoverlap < ρtarget lowers the significance of pixels within the overlap region between two similar targets. Both effects serve to increase the accuracy of tracking process. The posterior (3), becomes a prior probability P (C Pi (x)) for the next frame, after normalization so that i=0:M P (Ci (x)|v, Xi ) = 1. In considering the background and all targets’ locations as within the prior distribution, the global pixellevel probabilistic model has two advantages. The method avoids giving high weight to pixels belonging to the background or to the overlap regions between multiple targets whose distributions are similar. Further, the method enhances the template updating process, to be covered in the next section.

j=0

(3) where Ci (x) is the event that pixel located at x belongs to the i-th target, M is the total number of targets, and i = 0 represents the background. The likelihood distribution P (v|Ci (x), Xi ) is defined to be a Gaussian

2.2. Adaptive RGB Template Model Here, an adaptive RGB template model Ti will represent the i-th target instead of a single template Ti , so

as to handle the changing target silhouette from frame to frame. The template model will incorporate a principal component analysis (PCA) decomposition of historical tracking data [7]. First, though, the RGB values are normalization to achieve scale-invariance and shiftinvariance with respect to light intensity [10]:

R0 , G0 , B 0

T

=

R − µR G − µG B − µB T . (6) , , σR σG σB

where µ and σ are mean and standard deviation. In the proposed method, we follow the drift correction method mentioned in [7]. As additional template information is obtained, reflecting changes in pose and view angle during the tracking process, a richer appearance model is generated. For each target, we represent the current template model by: Ti (x) = µTi (x) +

dn X

λj Eij (x),

for

Figure 2. Ti contains the mean µTi and its 8 eigenvectors Eij , j ∈ {1, 2, ..., 8}.

Figure 3. A modified observation patch.

Figure 4. Graphical representation of merge and split algorithm.

j=1

(µTi , Eij ) = PCA(Iˆi1 (w(x; g)), ..., Iˆin (w(x; g))), (7) where PCA(·, . . . , ·) means performing PCA with the output being the mean µTi and the first dn eigenvectors Eij , j ∈ [1, . . . , dn ]. The quantity dn is chosen to keep a fixed percentage of the energy (here, chosen to be 90%). Incremental PCA [9] is during updates for computational efficiency. Figure 2 depicts a gray scale (for visualization ease) version of a learned template model. The observed template images Iˆ· , are obtained from a target and background extraction process. During tracking, the observation patch may include pixels from other targets, which impacts the accuracy of the updated template model. To resolve this problem, take the ith target as an example, the global pixel-level probabilistic provides information regarding target pixels to use and pixels to replace with background appearance data rather than the appearance data from other targets:

tween two images, the adaptive template T and the image I(w(x; g)) with the probabilistic model P (C|x). An iterative quadratic solution to the least-squares problem updates the transformation parameters g via: ∆g = H −1

X ∂w T [P (C|x)∇T (x) ] × ∂g x {P (C|x)[I(w(x; g) − T (x))]}

(9)

where w(x; g) ∈ f oreground in the observed frame. A modified template image is shown in Figure 3. An occluded, nearby target is obscuring the background data. The original observation patch is modified by removing the nearby target data. The modified observation patch is then used to update the ith target’s template.

where H is the Hessian matrix: X ∂w ∂w T P (C|x)∇T (x) . H= P (C|x)∇T (x) ∂g ∂g x (10) The update warp at w(x; g) is w(x; g) ◦ w(x; ∆g)−1 . To manage tracking in the case of occluding objects for multiple object tracking applications, the merge and split algorithm given by Algorithm 1 is added to the tracking process. Figure 4 graphically depicts two objects’ merge and split (the algorithm can solve additional object merges and splits), which represents the four steps of Algorithm 1. Essentially, when target 1’s region overlaps target 2’s region, the two associated warp parameters contain mixed movements. As the overlap region grows, a single warp is more effective, thus targets 1 and 2 are merged. When the overlap region shrinks, the targets split into separate models.

3. Target Merge and Split During Tracking

4. Experiments

As shown in Equation 2, the tracking problem in our framework is to minimize the sum of squared error be-

We use a Core2 E8400 PC with 4G RAM to evaluate our method on challenging videos contain 4800 frames

Iˆin (w(x; g)) = P (Ci (x)) Iin (w(x; g))+ (1 − P (Ci (x))) µTi (x)

(8)

if X1 is far from X2 then g1 = P argmin x P (C1 (x))[I(w(x; g1 )) − T1 (x)]2 ; g2 = P argmin x P (C2 (x))[I(w(x; g2 )) − T2 (x)]2 ; end if X1 is near X2 then Merge X1 and XP 2 to be one target X1−2 ; g1−2 = argmin x P (C1 (x) ∧ C2 (x))[I(w(x; g1−2 )) − T1−2 (x)]2 end if X1 is far away from X2 within X1−2 ’s patch then Split X1−2 to X1 and X2 using the adaptive template model for recognition. end

Figure 5. Sample frames for [6] (1st row) and proposed method (2nd row).

Algorithm 1: Merge and split algorithm. Table 1. Evaluation of proposed method. Method Ours

[6]

Metric MOTA Precision Recall MOTA Precision Recall

CAVIAR 0.852 0.923 0.884 NA NA NA

PETS 06’ S1-T1 S4-T5 0.961 0.891 0.991 0.977 0.967 0.917 0.785 0.883 0.882 0.932 0.908 0.954

Figure 6. Tests of merge and split algorithm to track complete occlusions.

and a merge/split algorithm to handle complete occlusion. Experiments on challenging videos illustrate the tracker’s performance.

References from two publicly available dataset, PETS 2006 and CAVIAR DATA in MATLAB. For a quantitative analysis we calculated the Multiple Object Tracking Accuracy (MOTA) [5]. The comparative results with ground truth annotations every 5 frames are summarized in Table 1. Comparisons are made with the method of [6] which also introduces a merge/split algorithm. Of note, the proportion of frames containing merged targets is high when using [6]. Since a merged target model provides coarse location for an association of several targets, less frames tracked under a merged model means more accurate tracking per target. The percentage of frames with merged models in the proposed method is 7.8% compared with over 50% in [6] for the same test sequence (S1-T1). As shown in Figure 5, our method provides more accurate location for each target than [6]. Additional results depicted in Figure 6 show the proposed method tracking through complete occlusion by exploiting the merge/split algorithm. The average cost is about 4 second per frame.

5. Conclusion This work proposed a modified KLT tracker to provide a good solution for multi-object tracking problem. The tracker includes a global pixel-level probabilistic model, an adaptive RGB template model,

[1] C. Aeschliman, J. Park, and A. Kak. A probabilistic framework for joint segmentation and tracking. CVPR, pages 1371 – 1378, June 2010. [2] S. Baker and I. Matthews. Lucas-kanade 20 years on: A unifying framework. IJCV, 56(3):221–255, 2004. [3] C. Bibby and I. Reid. Robust real-time visual tracking using pixel-wise posteriors. ECCV, 2008. [4] C. Bibby and I. Reid. Real-time tracking of mulitple occluding objects using level sets. CVPR, 2010. [5] A. Ellis and J. Ferryman. Pets2010 and pets2009 evaluation of results using individual ground truthed single views. 2010. [6] J. Henriques, R. Caseiro, and J. Batista. Globally optimal solution to multi-object tracking with merged measurements. ICCV, pages 2470 – 2477, Nov 2011. [7] I. Matthews, T. Ishikawa, and S. Baker. The template update problem. PAMI, 26:810 ¨C 815, 2004. [8] H. Nguyen, Q. Ji, and A. Smeulders. Spatio-temporal context for robust multitarget tracking. PAMI, 29(1), 2007. [9] D. A. Ross, J. Lim, R.-S. Lin, and M.-H. Yang. Incremental learning for robust visual tracking. IJCV, 77(13):125–141, 2010. [10] K. van de Sande, T. Gevers, and C. Snoek. Evaluating color descriptors for object and scene recognition. PAMI, 32:1582 – 1596, 2010. [11] A. Yilmaz, O. Javed, and M. Shah. Object tracking: A survey. ACM Computing Surveys, 38(4), 2006. [12] Z. Zivkovic. Improved adaptive gaussian mixture model for background subtraction. ICPR, 2:28 – 31, 2004.