Joint Random Field Model for All-Weather Moving ... - IEEE Xplore

Viewer
Transcript

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 19, NO. 9, SEPTEMBER 2010

2491

Joint Random Field Model for All-Weather Moving Vehicle Detection Yang Wang, Member, IEEE

Abstract—This paper proposes a joint random field (JRF) model for moving vehicle detection in video sequences. The JRF model extends the conditional random field (CRF) by introducing auxiliary latent variables to characterize the structure and evolution of visual scene. Hence, detection labels (e.g., vehicle/roadway) and hidden variables (e.g., pixel intensity under shadow) are jointly estimated to enhance vehicle segmentation in video sequences. Data-dependent contextual constraints among both detection labels and latent variables are integrated during the detection process. The proposed method handles both moving cast shadows/lights and various weather conditions. Computationally efficient algorithm has been developed for real-time vehicle detection in video streams. Experimental results show that the approach effectively deals with various illumination conditions and robustly detects moving vehicles even in grayscale video. Index Terms—Contextual constraint, random field, vehicle detection.

I. INTRODUCTION

I

N APPLICATION areas such as visual surveillance, traffic monitoring, and human–computer interaction, moving object detection in video streams is an important problem. Especially, traffic monitoring cameras, coupled with video analysis techniques, provide an attractive alternative to other sensors for road traffic management. Inductive loop detector is the dominant sensor technology for current traffic management systems. Usually the loop is buried underneath road to detect vehicles passing over. Comparing with loop detector, video camera is nonintrusive and less costly for both installation and maintenance. Moreover, traditional sensors such as loop detector and traffic radar provide only local measurements at specific locations, which will limit the effectiveness of traffic management. Video monitoring system can provide much rich information about the entire traffic scene. In recent years, video-based technology has received more and more attention in traffic management research. Vehicle detection with stationary camera is a fundamental task for video-based traffic monitoring, which is essential for the measurement of traffic parameters such as vehicle count, speed, and flow. However, accurate foreground detection could be difficult due to the potential variability within the video scene including shadows or lights cast by moving objects, dynamic background change, and illumination condition

Manuscript received July 17, 2008; revised March 16, 2009; accepted May 10, 2009. First published April 22, 2010; current version published August 16, 2010. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Ilya Pollak. The author is with the Neville Roach Laboratory, National ICT Australia, School of Computer Science and Engineering, University of New South Wales, Kensington, NSW 2052, Australia (e-mail: [email protected]). Digital Object Identifier 10.1109/TIP.2010.2048970

variation [6], [14], [18]. To robustly segment moving objects in the scene, effective and efficient integration of spatial and temporal information over time is an essential element. Spatial color distribution can be used to characterize background and foreground objects within dynamic scenes [21]. Gradient (or edge) features help improve the reliability of moving object detection [24]. Moreover, temporal changes of the background can be described by linear processes or statistical distributions based on the history of observations [4], [26]. In [3], [22], the recent data of pixel intensity is characterized by a mixture of Gaussians, and the mixture model is adaptively updated for each site to deal with dynamic background processes. In [13], moving cast shadow is also modeled by Gaussian mixture for each site to robustly handle varying illumination. In [2], [15], kernel density estimation is employed for adaptive and robust object detection. On the other hand, to comprehensively fuse spatiotemporal information within the scene, contextual modeling is a key issue throughout the detection process. To formulate contextual constraints, generative models including Markov random field (MRF) and hidden Markov model (HMM) have been extensively studied for moving object detection in video. In [8], HMM is used to impose the temporal continuity constraint on foreground and shadow detection for traffic surveillance. A dynamical framework of topology free HMM capable of dealing with sudden or gradient illumination changes is proposed as well [23]. In addition, spatial smooth constraint is modeled by MRF in [17], [20]. Spatiotemporal MRF involving successive video frames has also been proposed for robust detection and segmentation of moving objects [7], [30]. However, conditional independence of observations is usually assumed in the previous work, which is too restrictive for contextual modeling of visual scene. Compared to MRF and HMM models, the conditional random field (CRF) relaxes the strong independence assumption and introduces the flexibility to utilize non-independent constraints of input data [10]. In recent years, CRF has been applied to image labeling as well as video analysis [1], [9], [28]. Based on the CRF, this paper proposes a joint random field (JRF) model for visual scene modeling and presents its application to moving vehicle detection in grayscale video. The JRF model extends the CRF by introducing auxiliary latent variables to characterize complex visual environment and enhance moving object detection in video sequences, so that detection labels (e.g., vehicle/roadway) and hidden variables (e.g., intensity of shadowed points) are jointly estimated throughout the labeling process. A real-time algorithm of moving vehicle detection has been developed for video-based traffic monitoring. The method handles both moving cast shadows/lights and dynamic background processes, and it integrates data-dependent contextual dependencies among both detection labels and hidden vari-

1057-7149/$26.00 © 2010 IEEE

2492

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 19, NO. 9, SEPTEMBER 2010

ables during the detection process. Experimental results show that the proposed approach effectively captures contextual information in video sequences and significantly improves the accuracy of moving vehicle detection under various weather and illumination conditions. A. Related Work Recently, random field based models, such as hidden conditional random filed and layout consistent random field, have been proposed to incorporate hidden variables for object/gesture recognition as well as segmentation of partially occluded objects [19], [29]. Firstly, in these models and their extensions [5], [16], labels are conditionally independent of observations given the hidden variables. In the proposed model, the observations impact the estimation of labels even when the hidden variables are given. In relatively complex visual processes such as moving vehicle and cast shadow detection, actually the detection labels (e.g., vehicle/roadway) are influenced by the observed images even when the hidden variables (e.g., pixel intensity under shadow) are known. From this point of view, the proposed model theoretically generalizes previous ones with a tradeoff in computational complexity. Hence, the proposed model can be applied to gesture recognition and multi-object segmentation as well. Second, the auxiliary variables are continuous in this work, so that each site has a discrete detection label and a continuous hidden variable. Usually both the image label and the hidden variable of each site are discrete in the previous work. Thirdly, the proposed model integrates contextual constraints among both label field and latent field, while neighborhood interaction among output labels is ignored in previous approaches [16], [29]. The remainder of this paper is arranged as follows. Section II proposes the joint random field model and its optimization algorithm. Section III presents the moving vehicle detection algorithm as well as implementation details. Section IV discusses the experimental results. Finally, our technique is concluded in Section V. II. JRF Given an image sequence, the label and observation of a point at time instant are denoted by and respectively. The detection label assigns the point to one of classes. if the point belongs to the th class, where is a -dimensional unit vector with its th component equal to one. The local observation consists of intensity information at the site . Here , and is the spatial domain of the video scene. The entire label field and observed image over the scene are compactly expressed as and respectively. Under complex visual environment, it is expected that image labeling can be enhanced by introducing a set of auxiliary latent variables to characterize the video scene over time. At time , the hidden variable for each site is denoted by , and the entire latent field is expressed as . In this work, and describes the cast shadow/light at site (see Section III). Based on the random field model, contextual information within both label field and latent field can be formulated through a probabilistic discriminative framework of statistical dependencies among neighboring sites.

A. JRF Model For random variables and observed data over the video scene, is a conditional random field if, when conditioned on , the random field obeys the Markov property , where the [9]: set denotes neighboring sites of the point . Hence, is a random field globally conditioned on the observed data. In order to introduce auxiliary hidden variables during the labeling process, the notion of joint random field (JRF) and is proposed in this work. For two random fields observed data , (u, v; ) becomes a joint random field if , i.e., the is Markovian when conditioned on observed data couple . In this work, given the observed images up to time instant , the joint probability distribution over the label field and the latent field is modeled by a joint random field to formulate contextual dependencies. Here is the sequence of observed data up to time . Thus, obeys the Markov property when the obthe joint field is given. Using the Hammersley-Clifford theserved data orem and considering only up to pairwise clique potentials [11], the posterior probability is given by a Gibbs distribution with the following form:

(1) reflects the local The one-pixel potential constraint for a single site. The two-pixel potential imposes the pairwise constraint between neighboring sites. Strength of the constraints is dependent on the observed data. Unlike previous CRF approaches [1], [28], here the posterior distribution over labels can no longer be represented by conditional random generally field since does not obey the Gibbs distribution. To simplify the computation, the pairwise potential is further factorized as . Hence,

(2) The graphic representation of joint random field model for 1-D sequence is shown in Fig. 1. The JRF model extends the CRF for image sequences by introducing auxiliary latent variables to characterize complex visual scene, and it captures data-dependent neighborhood interaction among both detection labels and latent variables during the labeling process. To recursively update the posterior distribution for the JRF model, a first-order Markov assumption is made on the joint

WANG: JRF MODEL FOR ALL-WEATHER MOVING VEHICLE DETECTION

2493

and are given by variational apwhere proximation of the label field and the latent field. The first term in the one-pixel potential imposes the local constraint from the current observation, while the last two terms in the one-pixel potential impose the temporal constraint from previously observed data. The two-pixel potential functions impose the data-dependent spatial constraint among neighboring detection labels and latent variables. At each time instant, the maximum a posteriori (MAP) estimate of the label field and latent field is computed as .

Fig. 1. Graphic representation of joint random field model.

B. Optimization field in this work. The transition probability is expressed as follows to formulate temporal (or dynamic) dependencies of consecutive fields:

(3) Local transition probabilities and impose the temporal continuity constraint to encourage a point to have the same label and similar value of latent variable at successive frames. Alternatively, the transition probability could be factor. Howized as ever, the joint field transition probability indicates not only the temporal dependencies between consecutive fields and , but also the spatial dependencies of individual field. Such factorization will ignore the contextual constraint within the joint field . Given the observed data at current time , the posterior distribution of the label field and the hidden field is modeled by a Gibbs distribution as well

The maximization of the joint posterior distribution over label field and latent field involves both discrete variables and continuous variables , which makes it difficult to directly apply popular optimization methods for image labeling such as belief propagation and graph cut [12]. The posterior probability is optimized by variational approximation [27]. The variational method looks for the best approximation of an intractable probability in the sense of Kullback–Leibler divergence. The posterior at time can be approximated by the following distribution: (6) where local approximating probabilities and are given by iterative computation (see Appendix B). For each site at time and . III. MOVING VEHICLE DETECTION For video-based traffic monitoring, each pixel in the scene is to be classified as moving vehicle, cast shadow (or light), or background (roadway). For each point , the pixel intensity has three (R, G, and B) components for color images or one value for grayscale images. Grayscale images are considered in this work, while the formulation for color images can be derived similarly. At each time instant , the label equals for background, for shadow, and for vehicle. A. Local Observation

(4) By (2)–(4), spatial and temporal dependencies within the visual scene are unified by a dynamic probabilistic framework based on the JRF model. Given the potentials of the distribution , the posterior ( ) at time +1 can be efficiently approximated by a joint random field with the following potentials (see Appendix A):

(5)

In order to segment the moving vehicles, the system should first model the background and shadow information. Assume that each pixel in the scene is corrupted by Gaussian noise, so that the background model at time becomes , where is the intensity mean for a pixel within the background, and is independent zero-mean Gaussian noise with variance . Intensity means and variances in the background can be estimated from previous images. To deal with dynamic background change, the background updating process for the traffic scene is based on the Gaussian mixture method [22]. For each site in the scene, the probability distribution of pixel intensity is modeled by a mixture of Gaussians , where is a Gaussian distribution with argument , mean , and variance , and denotes the corresponding weights for the Gaussian mixture. Given the current video frame , each pixel value is checked to match the existing Gaussian distributions. For a matched Gaussian, its weight increases and the corresponding mean and variance are updated using the pixel value. For

2494

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 19, NO. 9, SEPTEMBER 2010

unmatched distributions, the means and variances remain the same, while the weights should be renormalized. If none of the distributions match the pixel value, the distribution of the lowest weight is replaced with a Gaussian with the pixel value as its mean, initially low weight and high variance. For each point , the Gaussian distribution that has the highest ratio of is chosen as the background weight over variance model at time instant . More details can be found in [22]. The main difference between the Gaussian mixture method and our approach in adaptive background updating is the definition of match. In [22], a Gaussian is matched if the pixel value is within 2.5 standard deviations of the distribution. In our work, if the point is classified as background by the detection algorithm, then the Gaussian corresponding to the background model is matched, otherwise a Gaussian is matched if the value is within 2.5 standard deviations of the distribution. Given the intensity of a background point, a linear model is used to describe the change of intensity for the same point when shadowed (or illuminated) in the video scene, i.e., . Considering the contiguity of video image, the coefficient can be estimated from its neighborhood as if the point is under cast shadow. To achieve maximum application independence, it is assumed that the intensity information of vehicles is unknown. Hence, uniform distribution is used for the pixel intensity of moving vehicle. From the above discussion, the local intensity likelihood of a point at time becomes

The one-pixel potential in (4) is set as , so that posterior distribution becomes the product of local postewhen two-pixel rior at each site potentials are ignored. Using the Bayes’ rule, . The one-pixel potential becomes (9) of the label can be expressed by uniThe prior knowledge is given by the local form distribution. The probability intensity likelihood derived in the previous section. For pixel inis tensity under cast shadow (or light), the posterior expressed as if uniform distribution

otherwise.

(10)

The two-pixel potential for neighboring detection labels is expressed as follows to formulate the spatial dependency: (11) where denotes inner product

if (7)

if uniform distribution

if

However, the observation model tends to confuse cast shadow and moving vehicle at boundary areas or in uniform regions, especially when the vehicle is darker than the background and the road surface is untextured. Such detection error can be effectively reduced if the intensity of shadowed points is known, i.e., if , where is the mean intensity under cast shadow (or light) for site . Since the intensity under shadow is not given beforehand, in this work is used as the auxiliary latent variable to characterize the visual scene for each point at time .

and is the Euclidean distance. The first term (data-independent potential) encourages the formation of contiguous regions, while the second term (data-dependent potential) encourages data similarity when neighboring sites have the same label. and respectively weight the importance The positives of data-independent smoothness constraint and data-dependent neighborhood interaction. However, under heavy noises neighboring sites may become quite different even though they belong to the same class. To prevent this problem when detecting vehicles within noisy video scene, the regulation term is used in the data-dependent pairwise potential. Similarly, the two-pixel potential for neighboring latent variables is expressed as

B. Contextual Constraint The local transition probabilities in (3) are employed to impose temporal continuity constraint among successive label fields and latent fields. The transition probability for latent variables can be expressed as the following:

(12) where

(8) where the positive reflects the influence of temporal continuity constraint for the auxiliary variable. Its value depends on both frame rate and illumination variation. The local transition probability becomes high when the site has similar value of latent variable at consecutive time instants. On the other hand, the temporal constraint for detection labels becomes weak with the increase of vehicle velocity. Considering fast moving vehicles, it is assumed that the transition probability is uniformly distributed in this work.

The positives and respectively weight the importance of data-independent smoothness constraint and data-dependent neighborhood interaction for latent variables. The potential functions and capture neighborhood interactions among detection labels and latent variables respectively. Naturally, the potentials impose adaptive contextual constraints that will adjust the interaction strength according to the similarity between neighboring observations.

WANG: JRF MODEL FOR ALL-WEATHER MOVING VEHICLE DETECTION

2495

Fig. 2. (a) Region of interest. (b) Straightened image. (c) Foreground and shadow detection. (d) Vehicle detection in the straightened image. (e) Vehicle detection in the original image.

To balance the influence of potential terms for the joint , and random field, it is assumed that , where the parameters and are empirically determined to reflect the constraint strength for detection labels and latent variables respectively. In this work, a set of five short video sequences with different traffic locations and illumination conditions is used as training data for parameter determination. The training data is separated from the test data used in the experiments. For each training sequence, the parameters and are empirically adjusted to provide visually optimal detection performance. Thus, for each parameter, a set of tuned values are obtained from the training sequences. During the testing, the parameter is set as the median of the corresponding tuned values obtained from the training data. and with Initially, large variance for all the sites. C. Preprocessing and Postprocessing To improve the computational efficiency, the zone of moving vehicle detection is cropped from the scene for video processing [see Fig. 2(a)]. The region of interest is then straightened by applying perspective transformation [25], so that moving vehicle detection is performed on straightened images [see Fig. 2(b)]. The straightened image gives a scaled top-down view of the roadway. Typically, a trapezoid region in the original video becomes a rectangle with the prescribed width and length (48 72 in this work) in the straightened image. For traffic roads that are not straight, the entire region of interest can be divided into multiple subareas, so that the roadway in each subarea is approximately straight. Then, each subarea can be warped onto a rectangle in the straightened image. The image straightening reduces the number of pixels for subsequent video processing and substantially improves the computational efficiency. Bilinear interpolation is employed when warping the original image region onto the rectangular straightened image. After foreground detection with shadow removal [see Fig. 2(c)], detected vehicles are approximated by small rectangles in the straightened image [see Fig. 2(d)]. For each detected foreground area, the corresponding rectangle has the same central point and average width and length, with its edges parallel to the horizontal and vertical axes. On the highway, most occlusions happen between moving vehicles from neighboring lanes. The detected roadway lines help separate occluded vehicles when occlusion across multiple lanes takes place. Small detected regions, such as frontal part of incoming vehicle and false detection caused by noises in the scene, are ignored to enhance the robustness of moving vehicle detection. The

located rectangles in the straightened image are then mapped back onto the original image [see Fig. 2(e)]. IV. RESULTS AND DISCUSSION The proposed approach has been tested on grayscale video sequences captured under different environments for road traffic monitoring. The 48-pixel neighborhood is utilized in the algorithm. The C program can process about 25 frames per second on a Pentium 4 3.0 G PC. Four moving vehicle and cast shadow detection algorithms are studied in our experiments: the mixture of Gaussians (MoG) based approach [13], the Markov random field (MRF) approach with spatiotemporal constraints [20], the dynamic conditional random field (CRF) approach [28], and the proposed joint random field (JRF) approach. The same initialization and neighborhood are used in these algorithms when applicable. Fig. 3 shows the detection results by the conditional random field approach and the proposed method for a grayscale video sequence with strong reflection on the road surface. The gray regions in Fig. 3(f) and (g) represent moving cast shadows. The CRF approach is unable to capture the intensity variation under relatively complex scene. It can be seen that parts of dark vehicles are misclassified in Fig. 3(f). On the other hand, the auxiliary variables [intensity of shadowed points, see Fig. 3(e)] used in the proposed approach effectively model the illumination variation of the visual scene over time, which improves the reliability of vehicle and shadow segmentation. Compared with Fig. 3(f), moving vehicles and cast shadows at different locations of the road are accurately distinguished in Fig. 3(g). Fig. 4 shows the results of moving vehicle and cast shadow detection by the Gaussian mixture approach, the Markov random field approach, and the proposed method for a video sequence with low image contrast in the detection zone. In Fig. 4(f.1), the pixel-based MoG approach is likely to confuse moving vehicle and cast shadow under the noisy environment. The errors are corrected in Fig. 4(g.1) by the proposed method with the help of contextual dependencies and auxiliary variables [see Fig. 4(e.2)]. The MRF approach produces smooth segmentation results. However, sometimes it may smooth in a wrong way due to the neglect of the contextual interaction dependent on observations. It can be seen that some regions under shadow are misclassified in Fig. 4(f.2), while cast shadows are effectively removed from the moving vehicles in Fig. 4(g.2). The detection results are also evaluated quantitatively by comparing to the manually labeled ground-truth images. For video-based traffic monitoring, foreground (vehicle) information is more important than background (or shadow)

2496

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 19, NO. 9, SEPTEMBER 2010

Fig. 3. (a) Two frames of a sequence. (b) Vehicle detection by the CRF approach. (c) Vehicle detection by the proposed approach. (d) Straightened images. (e.1) Estimated intensity of roadway. (e.2) Estimated latent field (intensity of shadowed points). (f) Foreground and shadow detection by the CRF approach. (g) Foreground and shadow detection by the proposed approach. (h) Ground-truth vehicle segmentation.

information although moving vehicles usually occupy only a small part of the video scene (or the straightened image). Table I shows the average detection rate (the number of accurately classified foreground points over the total number of foreground points and misclassified background points) for thirty representative frames of the three sequences (ten for each) shown in Figs. 2–4. The JRF approach not only takes advantage of data-dependent contextual constraints, but further improves the detection accuracy by introducing auxiliary latent variables to model the structure and evolution of the video scene. The average rates of true positive, true negative, false positive, and false negative are exhibited in Table I as well. In our experiments, the JRF approach averagely improves the detection accuracy of the other three (MoG, MRF, and CRF) approaches by 43%, 22%, and 11% respectively. The substantial increase of the accuracy indicates that by integrating contextual constraints and introducing auxiliary variables, the proposed approach effectively models the traffic scene during

the detection process. Table II also shows the average error rate of vehicle counting (the number of undetected vehicles and false alarms over the total number of vehicles) for the three test sequences by the different methods. It can be seen that the proposed approach significantly reduces the errors of vehicle detection when comparing with the other techniques, which would effectively support the traffic management system. Accurate vehicle detection is important to vehicle tracking (or speed estimation) and traffic flow estimation. Since vehicle count, vehicle speed, and traffic flow are important traffic state parameters, the efficiency of traffic control will be consequently enhanced by the substantial improvement of vehicle detection accuracy. Fig. 5 shows the results of vehicle detection by the proposed approach with and without image straightening. Due to the perspective effect, in the original image a pixel at the top represents much more area (or intensity information) in the real world than a pixel at the bottom. Without image straightening, parts of the

WANG: JRF MODEL FOR ALL-WEATHER MOVING VEHICLE DETECTION

2497

Fig. 4. (a) Two frames of a sequence. (b.1) Vehicle detection by the MoG approach. (b.2) Vehicle detection by the MRF approach. (c) Vehicle detection by the proposed approach. (d) Straightened images. (e.1) Estimated intensity of roadway. (e.2) Estimated latent field (intensity of shadowed points). (f.1) Foreground and shadow detection by the MoG approach. (f.2) Foreground and shadow detection by the MRF approach. (g) Foreground and shadow detection by the proposed approach. (h) Ground-truth vehicle segmentation.

TABLE I AVERAGE ACCURACY OF DETECTION RESULTS

TABLE II ERROR RATE OF VEHICLE COUNTING

vehicles are misclassified by the detection algorithm in Fig. 5(c). Meanwhile, the intensity information over the detection zone is more evenly distributed in the straightened image. Comparing Fig. 5(c) and (e), the preprocessing process effectively helps improve the accuracy of moving vehicle detection.

Fig. 6 shows the results of moving vehicle detection in the dark. The proposed approach can be applied to the detection of both cast shadows and cast lights. It can be seen that pixel intensity varies drastically when background points are illuminated by vehicle lights. Comparing with moving vehicles, the cast lights cover much more regions of the roadway, which could cause serious mistake and even failure in further video analysis. The proposed method accurately discriminates moving vehicles from cast lights even in grayscale video sequences. Figs. 7 and 8 show the results of vehicle detection by the proposed approach at different locations. It can be seen that the proposed approach not only accurately removes cast shadows/

2498

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 19, NO. 9, SEPTEMBER 2010

Fig. 5. (a) Two frames of a sequence. (b) Vehicle detection by the proposed approach. (c) Foreground and shadow detection without image straightening. (d) Straightened images. (e) Foreground and shadow detection by the proposed approach.

Fig. 6. (a) Two frames of a sequence. (b) Vehicle detection by the proposed approach. (c) Straightened images. (d) Foreground and light detection by the proposed approach.

Fig. 7. Vehicle detection by the proposed approach.

lights under sunny and dark circumstances, but also effectively detects most moving vehicles in cloudy and rainy days. The robustness of vehicle detection under various weather and illumination conditions is important to the subsequent analysis such as vehicle counting and tracking for video-based traffic monitoring. Fig. 9(a) shows two detection errors by the proposed approach. In the first case, the spatial smoothness constraint imposed by the joint random field makes the small vehicle

in the middle lane undetected. In the second case, the whole truck is divided into two parts due to the perspective effect of low-angle camera view. Detection results by the MoG method, the MRF method, and the CRF method are also shown in Fig. 9(b)–(d). In the experiments, usually the performance of the proposed approach is better than or similar to that of the other methods. However, it can be seen that the MoG method outperforms the proposed approach by correctly detecting the motorcycle in Fig. 9(b).

WANG: JRF MODEL FOR ALL-WEATHER MOVING VEHICLE DETECTION

2499

Fig. 8. Vehicle detection by the proposed approach.

Fig. 9. (a) Vehicle detection by the proposed approach. (b) Vehicle detection by the MoG approach. (c) Vehicle detection by the MRF approach. (d) Vehicle detection by the CRF approach.

V. CONCLUSION There are two main contributions in this paper. The first is to propose a JRF model that extends CRF by introducing auxiliary latent variables to characterize visual scene over time and enhance object detection in video sequences. The second is to develop a computationally efficient algorithm of real-time moving vehicle detection for video-based traffic monitoring. The proposed model integrates contextual constraints among both detection labels and hidden variables during the detection process. Experimental results show that the proposed approach effectively handles both cast shadows/lights and background illumination variations, and it significantly improves the performance of vehicle detection even in grayscale video sequences. Our future study is to apply the JRF model to activity/gesture recognition for video-based event detection and develop traffic analysis techniques such as vehicle counting and tracking, stopped vehicle detection, and traffic-flow estimation based on the proposed detection method. APPENDIX A At time , observation is used to update the posterior probability distribution of the joint field via Bayes’ rule

(13) Accurate computation of (13) is intractable because all of the possible assignments of should be considered. By variational approximation, the posterior can be factorized as (6). Combining (3), the integral sum in (13) becomes

2500

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 19, NO. 9, SEPTEMBER 2010

stands for the conditional expectation with respect where is the local norto the variational distribution malization constant for (14)

Combining (4), (13), and (14), the posterior probability distriis updated as bution of the joint field at time

(18)

During the computation, additive constant is ignored since the term will not take effect after the normalization. Similarly, for the latent variable at site we have

(19) is the local normalization constant for where noring additive constant, then

. Ig-

(15)

can be approxFrom (15), the posterior distribution at time imated by a joint random field with the one-pixel and two-pixel potentials in (5). APPENDIX B The variational method minimizes the Kullback–Leibler (KL) divergence between the approximating distribution and the true posterior, which yields the best lower bound on the probability of observed data in the family of approximations. It is known that

(16)

where the approximating distribution . In order to optimize the KL divergence, the local approximating probabilities and should be given by the mean field equations [27]. For the label at site

(17)

(20) It can be seen that, in order to find the local approximating probability of a site, one has to know the probabilities of its neighbors. Therefore, the approximating probabilities can be iteratively estimated by (17)–(20). At each time instant, the posterior distribution is approximated by the product of local approximating probabilities. To simplify integral computation during can be approximated by Gaussian disthe iteration, tribution in this work. ACKNOWLEDGMENT The author thanks Roads and Traffic Authority, NSW for providing the traffic video data. National ICT Australia (NICTA) is funded by the Australian Government as represented by the Department of Broadband, Communications and the Digital Economy and the Australian Research Council through the ICT Centre of Excellence Program.

WANG: JRF MODEL FOR ALL-WEATHER MOVING VEHICLE DETECTION

REFERENCES [1] A. Criminisi, G. Cross, A. Blake, and V. Kolmogorov, “Bilayer segmentation of live video,” in Proc. IEEE Conf. Comput. Vis. Pattern Recogn., 2006, vol. 1, pp. 53–60. [2] A. Elgammal, R. Duraiswami, D. Harwood, and L. Davis, “Background and foreground modeling using nonparametric kernel density estimation for visual surveillance,” Proc. IEEE, vol. 90, no. 7, pp. 1151–1163, Jul. 2002. [3] N. Friedman and S. Russell, “Image segmentation in video sequences: A probabilistic approach,” in Proc. Conf. Uncertainty Artif. Intell., 1997, pp. 175–181. : Real-time surveillance [4] I. Haritaoglu, D. Harwood, and L. Davis, “ of people and their activities,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 22, no. 8, pp. 809–830, Aug. 2000. [5] D. Hoiem, C. Rother, and J. Winn, “3D layout CRF for multi-view object class recognition and segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recogn., 2007. [6] T. Horprasert, D. Harwood, and L. Davis, “A statistical approach for real-time robust background subtraction and shadow detection,” in Proc. FRAME-RATE Workshop, 1999. [7] S. Kamijo, K. Ikeuchi, and M. Sakauchi, “Segmentations of spatiotemporal images by spatio-temporal Markov random field model,” in Proc. EMMCVPR Workshop, 2001, pp. 298–313. [8] J. Kato, T. Watanabe, S. Joga, J. Rittscher, and A. Blake, “An HMM-based segmentation method for traffic monitoring movies,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 9, pp. 1291–1296, Sep. 2002. [9] S. Kumar and M. Hebert, “Discriminative fields for modeling spatial dependencies in natural images,” Adv. Neural Inf. Process. Syst., pp. 1351–1358, 2004. [10] J. Lafferty, A. McCallum, and F. Pereira, “Conditional random fields: Probabilistic models for segmenting and labeling sequence data,” in Proc. Int. Conf. Mach. Learning, 2001, pp. 282–289. [11] S. Z. Li, Markov Random Field Modeling in Image Analysis. Berlin, Germany: Springer-Verlag, 2001. [12] S. Mahamud, “Comparing belief propagation and graph cuts for novelty detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recogn., 2006, vol. 1, pp. 1154–1159. [13] N. Martel-Brisson and A. Zaccarin, “Moving cast shadow detection from a Gaussian mixture shadow model,” in Proc. IEEE Conf. Comput. Vis. Pattern Recogn., 2005, vol. 2, pp. 643–648. [14] I. Mikic, P. Cosman, G. Kogut, and M. Trivedi, “Moving shadow and object detection in traffic scenes,” in Proc. Int. Conf. Pattern Recogn., 2000, vol. 1, pp. 321–324. [15] A. Mittal and N. Paragios, “Motion-based background subtraction using adaptive kernel density estimation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recogn., 2004, vol. 2, pp. 302–309. [16] L.-P. Morency, A. Quattoni, and T. Darrell, “Latent-dynamic discriminative models for continuous gesture recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recogn., 2007. [17] N. Paragios and V. Ramesh, “A MRF-based approach for real-time subway monitoring,” in Proc. IEEE Conf. Comput. Vis. Pattern Recogn., 2001, vol. 1, pp. 1034–1040. [18] A. Prati, I. Mikic, M. Trivedi, and R. Cucchiara, “Detecting moving shadows: Algorithms and evaluation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 25, no. 7, pp. 918–923, Jul. 2003.

W

2501

[19] A. Quattoni, S. Wang, L.-P. Morency, M. Collins, and T. Darrell, “Hidden conditional random fields,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 29, no. 10, pp. 1848–1853, Oct. 2007. [20] J. Rittscher, J. Kato, S. Joga, and A. Blake, “A probabilistic background model for tracking,” in Proc. Eur. Conf. Comput. Vis., 2000, vol. 2, pp. 336–350. [21] Y. Sheikh and M. Shah, “Bayesian modeling of dynamic scenes for object detection,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 27, no. 11, pp. 1778–1792, Nov. 2005. [22] C. Stauffer and W. Grimson, “Learning patterns of activity using realtime tracking,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 22, no. 8, pp. 747–757, Aug. 2000. [23] B. Stenger, V. Ramesh, N. Paragios, F. Coetzee, and J. Buhmann, “Topology free hidden Markov models: Application to background modeling,” in Proc. Int. Conf. Comput. Vis., 2001, vol. 1, pp. 294–301. [24] J. Sun, W. Zhang, X. Tang, and H.-Y. Shum, “Background cut,” in Proc. Eur. Conf. Comput. Vis., 2006, vol. 2, pp. 628–641. [25] A. M. Tekalp, Digital Video Processing. Englewood Cliffs, NJ: Prentice-Hall, 1995. [26] K. Toyama, J. Krumm, B. Brumitt, and B. Meyers, “Wallflower: Principles and practice of background maintenance,” in Proc. Int. Conf. Comput. Vis., 1999, vol. 1, pp. 255–261. [27] M. Wainwright and M. Jordan, “Graphical models, exponential families, and variational inference,” Univ. California, Berkeley, 2003, Tech. Rep.. [28] Y. Wang, K.-F. Loe, and J.-K. Wu, “A dynamic conditional random field model for foreground and shadow segmentation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 28, no. 2, pp. 279–289, Feb. 2006. [29] J. Winn and J. Shotton, “The layout consistent random field for recognizing and segmenting partially occluded objects,” in Proc. IEEE Conf. Comput. Vis. Pattern Recogn., 2006, vol. 1, pp. 37–44. [30] Z. Yin and R. Collins, “Belief propagation in a 3-D spatio-temporal MRF for moving object detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recogn., 2007.

Yang Wang (M’03) received the B.E. degree in electronic engineering and M.S. degree in biomedical engineering from Shanghai Jiao Tong University, Shanghai, China in 1998 and 2001, respectively, and the Ph.D. degree in computer science from the National University of Singapore in 2004. He was with Institute for Infocomm Research, Rensselaer Polytechnic Institute, and Nanyang Technological University from 2001 to 2006. He has been a Researcher with the Neville Roach Laboratory, National ICT Australia, University of New South Wales, Kensington, Australia, since 2006. He is also a conjoint lecturer in School of Computer Science of Engineering, University of New South Wales. His current research focuses on visual object segmentation, tracking, and analysis by machine learning and information fusion techniques. He has published more than 30 international conference and journal papers on artificial intelligence, computer vision, and medical imaging.

Opportunistic Interference Alignment for Random ... - IEEE Xplore

Joint Cross-Layer Scheduling and Spectrum Sensing for ... - IEEE Xplore

Joint Adaptive Modulation and Switching Schemes for ... - IEEE Xplore

Joint DOA Estimation and Multi-User Detection for ... - IEEE Xplore

Deep Learning Guided Partitioned Shape Model for ... - IEEE Xplore

a generalized model for detection of demosaicing ... - IEEE Xplore

A multipath model for the powerline channel - IEEE Xplore

Joint NDT Image Restoration and Segmentation Using ... - IEEE Xplore

A Random Field Model for Improved Feature Extraction ... - CiteSeerX

Random FH-OFDMA System Based on Statistical ... - IEEE Xplore

A Random Field Model for Improved Feature Extraction ... - CiteSeerX

A Joint Relay Selection and Buffer Management ... - IEEE Xplore

A Hierarchical Conditional Random Field Model for Labeling and ...

segmentation and tracking of static and moving objects ... - IEEE Xplore

Joint Link Adaptation and User Scheduling With HARQ ... - IEEE Xplore

performance of random fingerprinting codes under ... - IEEE Xplore

IEEE Photonics Technology - IEEE Xplore

wright layout - IEEE Xplore

Device Ensembles - IEEE Xplore

wright layout - IEEE Xplore

Evolutionary Computation, IEEE Transactions on - IEEE Xplore

Nonlinear StateâSpace Model of Semiconductor Optical ... - IEEE Xplore

Model Reference Adaptive Control With Perturbation ... - IEEE Xplore

I iJl! - IEEE Xplore