Reasoning About Threats: From Observables to ...

Viewer
Transcript

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS

Reasoning About Threats: From Observables to Situation Assessment

1

2

Gertjan J. Burghouts and Jan-Willem Marck

3

4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

Abstract—We propose a mechanism to assess threats that are based on observables. Observables are properties of persons, i.e., their behavior and interaction with other persons and objects. We consider observables that can be extracted from sensor signals and intelligence. In this paper, we discuss situation assessment that is based on observables for threat assessment. In the experiments, the assessment is evaluated for scenarios that are relevant to antiterrorism and crowd control. The experiments are performed within an evaluation framework, where the setup is such that conclusions can be drawn concerning: 1) the accuracy and robustness of an architecture to assess situations with respect to threats; and 2) the architecture’s dependence of the underlying observables in terms of their false positive and negative rates. One of the interesting conclusions is that discriminative assessment of threatening situations can be achieved by combining generic observables. Situations can be assessed with a precision of 90% at a false positive and negative rate of 15% using only eight learning examples. In a real-world experiment at a large train station, we have classified various types of crowd dynamics. Using simple video features of shape and motion, we have proposed a scheme to translate such features into observables that can be classified by a conditional random field (CRF). The implemented CRF shows to classify successfully the crowd dynamics up to 80% accuracy.

27 28 29

Index Terms—Architecture, evaluation framework, information processing, observables, situation understanding, threat recognition.

I. INTRODUCTION

30 31 32 33 34 35 36 37 38 39 40 41 42 43

Q1

1

ITUATION understanding is relevant to the field of security (i.e., robbery and vandalism), public safety (i.e., aggression and riots), and health care (i.e., incidents with elderly people). Technology is making its entrance in solutions for these domains. More recently, research institutes and technology providers have focused also on the domain of antiterrorism, aiming for solutions that detect threats in an early stage [1]. To detecting a threat at an early stage is important, as it enables security professionals to mitigate the situation. In this paper, we focus on technology to recognize the stages that build up to potential threats. Our objective is a system to alert for potential threats. To that end, the system assesses the current situation in order to

S

Manuscript received December 30, 2009; revised September 20, 2010 and January 25, 2011; accepted March 19, 2011. This paper was recommended by Associate Editor J. Tang. The authors are with The Netherlands Organization for Applied Scientific Research (TNO) Observation Systems, The Hague 2597 AK, The Netherlands (e-mail: [email protected]; [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TSMCC.2011.2135344

Fig. 1. Our framework: outputs of various sensors (JDL level 0) are processed to the higher abstraction level of observables (JDL level 1). Examples of observables are taken from our earlier work, e.g., trajectory analysis, pose estimation, color of clothing, carrying an attribute, behavior recognition, and group dynamics [3]. These observables provide information to the situation estimator (JDL level 2). Specific situations are associated with threats. Situation assessment is the focus of this paper.

determine whether a threat may occur. Our approach is to assess the situation from multiple object assessments. These two layers are adopted from the well-known JDL data fusion model and its extensions [2]. Object assessments are based on sensors and intelligence. This is illustrated in Fig. 1. The output of sensors requires automatic processing up to the level of meaningful and robust information, which we will refer to as observables. Intelligence is also taken as an observable. The observables are object assessments and range from camera observables (i.e., person carries some object) to geo-observables (i.e, the person has a short interaction with another person) to intelligence observables (the suspect is a person with a gray shirt and black pants). Based on the observables, the situation is assessed. For instance, the observables that a suspect person hands over an object to another person who is mixing in the crowd would result in the situation assessment that somebody is carrying a suspect object. In this paper, we discuss the methods to exploit observables for threat assessment. We motivate our choices for specific pattern recognition methods. In the experiments, the added value of the implemented method is demonstrated for scenarios that are relevant to antiterrorism and crowd control. The experiments are performed within an evaluation framework, where the setup is such that conclusions can be drawn concerning: 1) the performance and robustness of an architecture to assess threats; and 2) the architecture’s dependence of the underlying observables in terms of their false positive and negative rates.

1094-6977/$26.00 © 2011 IEEE

44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70

2

71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106

107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS

A. Previous Work Many observables have been introduced, which range from detecting specific objects such as cars [4] to detecting aggression [5], from generic features of human walking patterns [6] or highly specific human acts such as shaking hands and kissing [7]. In this paper, we focus on properties and behavior of humans that can be detected and recognized in video. The goal of our work is to assess situations beyond the level of isolated behaviors, which is based on multiple observables. Previously, other researchers have also considered situation assessment that is based on multiple observables, e.g., the combination of visual and audio features to improve the detection of aggression [8]. With an appropriate combination scheme, the discriminative power of situation assessment can be improved by using multiple observables. Situation assessment is a challenge, as the situations under investigation can range from instantaneous (e.g., one person gives an object to another person) to longer periods (e.g., a pickpocket is waiting the best moment to steal someone’s belongings). In recent literature, researchers tend to focus on either instantaneous events [9] or on complex, long-term person–person or person–object interactions [10]. In this paper, we propose a generic scheme to model and classify both types of situations. In this paper, we consider only situations that are known a priori. This requires examples from which a system can learn. Interestingly, recent work has shown that it is possible to detect abnormal situations at run time [11]. Such methods are not within the scope of this paper. Another observation from reviewing recent work is that scientific communities have released descriptions and datasets of behaviors of crowds and individuals. These datasets are very interesting and include recordings of single observables to experiment with, e.g., the PETS and TRECVID benchmarks. However, such communities have not yet focused on short- and long-term sequential scenarios. As a consequence, no scenarios are available for antiterrorism and crowd-control applications. B. Contributions in this Paper The contribution in this paper is fourfold. 1) Relevant scenarios for antiterrorism and crowd control. We discuss how expert knowledge of incidents can be used. This is crucial, as for the particular incidents related to terrorism, escalation of crowd behavior, and suspect people and objects, few or no data will be available. 2) A tool to generate a multitude of variations of the scenarios. The variations include statistical and systematic error sources. 3) A framework to evaluate both 1) the performance of the situation assessment as a whole and 2) its dependence of underlying observables. The experiments are performed by the usage of a conditional random field (CRF). The outcomes of the evaluation give insight in 1) whether more (complementary) observables are required to assess situations successfully; and 2) whether the robustness of observables needs to be improved in order to be useful for situation assessment.

4) Given the scenarios and their variations and the framework, we evaluate a system for situation assessment that was implemented for this paper. The system is described in Section III that is based on a CRF that takes observables as input and estimates situations as output. We demonstrate in Section IV that discriminative power can be achieved by combining generic observables. This paper is organized as follows. In Section II, we discuss the scenarios and the tool to generate their variations. In Section III, we propose the CRF that learns the best relation between observables and the assessment of situations. In Section IV, the experimental setup is discussed. In Section V, the CRF is evaluated for the scenarios and their variations. In addition, the CRF is evaluated against a real-world dataset. Section VI concludes this paper.

140

II. SCENARIOS

141

In this section, we define scenarios to which our system (see Section III) is applied and on which the experiments will be based (see Section IV). A scenario is interpreted as a sequence of states. The state is a description of the situation at a particular time. That is, “a group of people is agitated.” Note that an expert may be consulted to associate each state with a level of escalation or alarm, to bridge the gap to the decision maker. For each subsequent period, a number of observables are observed. Observables are observed object properties. That is, “a person enters the scene.” A scenario is specified by the sequence of states, their duration, and the observables that are observed for each subsequent period. By the previous definition, the observables may vary from time to time. Different sets of observables may be observed for the same state at different periods. There is no direct coupling between states and observables. In Section III, we select a pattern recognition method to learn the best probabilistic relation between states and observables. In Section IV, we evaluate how well this learned relation predicts the states for unseen variations of the scenarios. Observables may be chosen such that they relate to specific persons or groups of persons that are of particular interest. That is, “a suspect person enters the scene” or even more specific “John enters the scene.” In this paper, we consider observables that are generic, e.g., “a person enters the scene,” “a suspect person is in the scene,” etc. Interestingly, in Section IV, we demonstrate that discriminative power can be achieved by combining generic observables.

142

126 127 128 129 130 131 132 133 134 135 136 137 138 139

143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169

A. Observables

170

We have selected the following observables: empty square, one person, two-five persons, many people, people flux +, people flux −, sudden large people flux, tracks toward hotspot, tracks in hotspot, group formation, group moves, group on collision, individual is avoided by other people, individual’s head orientation varies, individual carries object, individual has wild gestures, somebody wears clothes related to a suspect person. The observables have been adopted from interviews with security professionals and from a training in The Netherlands

171 172 173 174 175 176 177 178 179

BURGHOUTS AND MARCK: REASONING ABOUT THREATS: FROM OBSERVABLES TO SITUATION ASSESSMENT

182

(police academy) on “search, detect, and react” (based on an Israeli security training). The observables are illustrated in Fig. 1.

183

B. Antiterrorism and Crowd-Control Scenarios

184

Five scenarios for antiterrorism and crowd control are proposed: square is revolting, aggressive political speaker, extreme demonstration, neutral, fight between a few people. The situation at a given time is formalized by one of the following states: neutral, group with agitator, other people are avoiding the situation, group is coordinating something, group is agitated, group riot. The scenarios are defined by the format: {[state1 , [observable1,1 , observable1,2 , . . . , observable1,N ], length1 ], . . . , [stateT . . .]}. The variable “length” refers to the length of the period of the current state. The number of observables may vary per period. These formal specifications will be illustrated at the end of this section. First, we present the five scenarios in natural language. 1) Square is revolting: market square, many people moving randomly, two people are loitering, group assembles just outside the market square, one group member is agitated, the group starts moving, the loitering people join the group, another group starts to move, one of the group members carries something, people are avoiding the two groups, the two groups confront each other, fight, people fleeing. 2) Aggressive political speaker: somebody start to speak in the midst of people, people are moving around, people start to listen, people are walking away. 3) Extreme demonstration: people are avoiding the group that is demonstrating, the atmosphere is tense, little riot, the demonstrating group is moving. 4) Fight between a few people: some people are agitated, one troublemaker is starting a fight, small fight, people are avoiding the fight, fight stops. 5) Default: neutral situation, people moving, many people enter buildings, a small group is walking by, square is getting more crowded, another group walks by.

180 181

185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217

218 219 220 221 222 223 224 225 226 227 228 229 230 231 232

C. Scenario Generation Including Statistical and Systematic Errors The scenarios are transformed into datasets. The inputs are the observables (i.e, binary values) and the ground truth is the state per timestep. Four types of noise can be added to the observables. 1) Statistical errors: random false positives and negatives. 2) Systematic errors: fade in/out: false positives are added in front of active observables, or at the end. 3) Systematic errors: ambiguity: false positives are added for a particular observable, and false negatives are added randomly. 4) Systematic errors: clutter: for each state, a state-dependent clutter effect is generated, which results in more false positives and negatives when more clutter is apparent.

3

Fig. 2. Example of increasing statistical errors for the scenario “square is revolting.” On the horizontal axis, the timesteps are shown. The vertical axis indicates the observables (each row represents one observable). (a) FPR = FNR = 0%. (b) FPR = FNR = 10%. (c) FPR = FNR = 20%.

Fig. 3. Example of nontrivial relation between (a) observables and (b) states. The relation between observables and states is not clear and probabilistic in nature. Learning the probabilistic relation between observables and states is the objective of this section and the experiments.

All types of noise will be expressed in terms of false positive ratio (FPR) and false negative ratio (FNR) in the experiments. Fig. 2 illustrates the effect of various FPRs and FNRs on the datasets.

236

III. FROM OBSERVABLES TO SITUATION ASSESSMENT

237

Recall from the previous section, the observables may be noisy. In this section, we exploit a CRF to learn the best probabilistic relation between states and observables. To assess the current state of the situation, the observables are simply considered altogether. The situation assessment is based on a “bag of observables.” More specifically, in this paper, we interpret situation assessment as a mapping problem from the “bag of observables” to a situation state. The observables are perceptual concepts. The situation states are implicit concepts that are not directly observed from the data. Fig. 3 makes this clear. There is no obvious relation between the observables and states. The situation state can be interpreted as a hidden parameter of the observables. This makes the state prediction nontrivial. In Section V, we evaluate whether the proposed CRF is a good technique to model the probabilistic relations between observables and situation states.

238

233 234 235

239 240 241 242 243 244 245 246 247 248 249 250 251 252 253

A. Conditional Random Field

254

A CRF [12] is a type of discriminative probabilistic model that is most often used to label or parse the sequential data, such as natural language text or biological sequences. CRFs are a more general form of hidden Markov models (HMM, [13]) and relaxes the Markov assumption over the sequential data that is characteristic for the HMM. This allows a CRF model to use information from time separated data [12]. We made the observation that situations in real life evolve in certain sequential orders. The evolution over time of the scenario states resembles the kind of probabilistic sequences

255 256 257 258 259 260 261 262 263

4

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS

Fig. 4. Example of a CRF. (a) Weights for observables to current state potential. (b) Weights of the previous state to current state potential. (c) Cumulative probabilities. (d) Estimated state and ground truth. 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288

of natural language. These sequences can be modeled by an undirected graphical model in which each vertex represents a state. The edges between the vertices can be understood as a dependence between states. There are two main ways to model such graphical dependences: HMM and CRF. We choose the CRF over the HMM, because it has two advantages that are important to our problem. Both advantages are due to the fact that the CRF model is a discriminative model instead of generative model [14]. A discriminative model models the conditional probability P (Y |X). A generative model is more widely applicable as it can be used to calculate the joint probability P (X, Y ) and all marginals of the joint, for example, P (X) or P (Y ). A discriminative model is limited to predicting state Y given certain sequence of observations X. When fitting the model of a generative model, one approximates both P (X) and P (Y |X) instead of merely P (Y |X) in the case of the discriminative model. This results in a number of advantages. 1) A discriminative model is independent of P (X). This means that to estimate P (X) for rarely occurring states is not a problem. This results in better accuracy for rare scenarios. 2) It can be trained in one setting and tested in a different setting with another P (X), since it is independent of P (X) and dependences within P (X).

3) A discriminative model has better predictive performance of P (Y ), since it is trained on classification of P (Y |X), instead of modeling the joint probability. It does not need additional effort to fit data with P (X). Usage of the CRF consists of two phases: the data-fitting phase and the inference phase. In the data-fitting phase, the model parameters are fitted to a training dataset. The best fit is the maximum of the (log) likelihood of these parameters that are given the training dataset. The likelihood is a multidimensional concave function. The optimum of this function gives the best fit to the parameters. In effect, two matrices are learned, which are used by the CRF to classify newly observed observables into the next state: one to capture the dependences between observables and states, and the other to capture the dependences from the previous state to the next state. The observable-state matrix also captures the relative importance of each observable for each state. In the inference phase, the learned model is used to estimate P (Y ) from a given X. When estimating P (Y ), the forward backward algorithm [13] is used to refine the estimation of P (Y ). The forward backward algorithm finds the most probable path in time over the states in P (Y ) [13] and smooths the probabilities of P (Y ). An example of a CRF is illustrated in Fig. 4. This example is used in Section V-B, where the precise states and

289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313

BURGHOUTS AND MARCK: REASONING ABOUT THREATS: FROM OBSERVABLES TO SITUATION ASSESSMENT

325

observables (in a real-world experiment) are further explained. For the next illustration, the naming of states and observables is not yet important; here, we want to discuss the underlying mechanism. We illustrate the weighting and potentials of observables and the estimation of states. The weights of the observables are shown in Fig. 4(a). The weights of the observables in the previous state are shown in Fig. 4(b). Together, they influence the estimation of the current state. Positive weights excite the corresponding state, whereas negative weights inhibit that state. This is exemplified by the cumulative probabilities in Fig. 4(c). The state with highest probability becomes the estimated state [see Fig. 4(d)].

326

IV. EXPERIMENTAL SETUP

327 328

We model the situation state as the hidden state of the CRF, and the observables are the observations of the hidden state.

329

A. Conditional Random Field Evaluation Objectives

330

341

We evaluate how well the relation between states and observables can be modeled by a CRF. We are interested in the following properties of the CRF. 1) Discriminative power: How well does the CRF predict the situation states for unseen variations of the scenarios? This is an interesting question, as the observables are generic and, hence, not tailored to the specific scenarios and their states. 2) Robustness: How sensitive is the CRF with respect to statistical and systematic errors? 3) Capacity to generalize: How many examples of a given set of scenarios are required to obtain a robust CRF?

342

B. Evaluation Tool

343

349

To test the discriminative power of the CRF for situations, we adapted the Schmidt MATLAB CRF implementation [15]. We implemented a scenario generator to generate the scenarios (see Section II-B). In this tool, we also added the capability to include the four noise modes (see Section II-C) to the observables of the scenarios in order to measure the performance of the CRF approach against various noise models.

350

C. Datasets

351

We test each noise variable to increase amounts of noise. The datasets are generated from the scenarios in Section II and are, subsequently, contaminated with increasing levels of noise. The types of noise are statistical (random false positives/negatives) and systematic (fade ins/outs, ambiguity, and clutter). All types of noise will be expressed in terms of false positive and negative ratios in the experiments.

314 315 316 317 318 319 320 321 322 323 324

331 332 333 334 335 336 337 338 339 340

344 345 346 347 348

352 353 354 355 356 357 358

D. Cross Validation

359

For each noise setting, 48 repetitions were generated. Each repetition contains the entire set of five scenarios (see Section II-B). For each noise parameter setting, a 12-fold cross validation will be performed. Each test fold contains four repetitions and is tested against a training set that contains N randomly

360 361 362 363

5

Fig. 5. Results for the experiment with 16, 8, 4, 2, and 1 scenario variations to learn from. Results are shown for nine fixed samples of FPR and FNR. For each FPR/FNR sample, we indicate how the CRF classification accuracy drops as a consequence of fewer learning examples. This is shown by each bar, where the accuracy drops from red (almost 100%) to green (just above 50%). This effect is worse for higher FPR/FNR (e.g., the bar in the upright corner). A good balance between number of examples and performance of the CRF is obtained for eight examples (see second from the left of each colored bar).

selected repetitions out of the 48 scenarios that are generated for the current parameter setting. The averaged results from the folds are the CRF performance results for the selected parameter setting.

367

E. Training the Conditional Random Field

368

The CRF will be trained at each test fold on the randomly selected training set. The data fitting, i.e, the optimizing of the loglikelihood, is performed by the L-BFGS optimization algorithm [16]. This quasi-Newton optimization algorithm is a standard to optimize large multidimensional optimization problems such as a CRF [17]. The optimization ends when the average error of the training set is smaller than 10−5 or after 500 iterations.

369

376

F. Evaluation Measures

377

The results of the CRF are presented as confusion matrices. Following conventions, each confusion matrix contains on the vertical axis the ground truth situation states and on the horizontal axis the predicted states by the CRF. Hence, a diagonal with high prediction accuracies is aimed for. We expect that for higher FPRs and FNRs, the predicted states become unreliable, which will cause the confusion matrices to have lower values on the diagonal. Each increasing state index indicates an increasing alarm level. In the case of errors (i.e., wrongly predicted states), we hope that the errors are not distributed among all states but remain confined to subsequent states, such that the perturbing effect on wrongly predicted alarm levels remains limited.

378

364 365 366

370 371 372 373 374 375

379 380 381 382 383 384 385 386 387 388 389 390

6

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS

Fig. 6. Situation assessment: results. Experiment: only one scenario variation to learn from, for various types of noise (see captions directly under the figure) and the degree of noise (see false positive and negative ratios). Confusion matrices for each of the six situation states are displayed for various FPRs and FNRs. Up to FPR or FNR of 15%, the CRF is surprisingly accurate to assess situations. (a) Statistical errors. (b) Systematic: fade ins. (c) Systematic: fade outs. (d) Systematic: fade ins and outs. (e) Systematic: ambiguity. (f) Systematic: clutter.

BURGHOUTS AND MARCK: REASONING ABOUT THREATS: FROM OBSERVABLES TO SITUATION ASSESSMENT

391

V. RESULTS

392

395

First, we focus on experiments on synthetic data where we can control the sources and amount of noise (see Section V-A). Second, we show experimental results on real-world data (see Section V-B).

396

A. Synthetic Data

397

The results in this section are used to evaluate the capacity to generalize (see Section V-A1), the discriminative power (see Section V-A2), and robustness (see Section V-A2). 1) Effect of the Size of the Training Set: Our first objective is to find a good balance between the number of learning examples to train the CRF and the performance of the CRF. If the number of learning examples is low, then fewer examples have to be collected, and training takes less time. However, with fewer learning examples, the performance of the CRF will be lower. To evaluate this tradeoff in detail, we investigate the results of the CRF approach to increase FPR and FNR levels. Fig. 5 shows the accuracy of the CRF for various sizes of the training set, which has been decreased from 16 to smaller sets of 8, 4, 2, and 1 training example(s). The figure shows at each FPR (horizontal coordinate) and FNR (vertical) a bar of values, one for each size of the training set. One value represents the CRF’s accuracy of the situation assessment, where the accuracy has been averaged equally over the situation states. With 16 examples and errors up to 30%, the average accuracy of the CRF is 81%. With fewer errors, up to 15%, the accuracy is 94%. With only one example, the accuracy drops, respectively, to 50% (was 81%) and 63% (was 94%). A good tradeoff seems to be eight examples, which achieves only slightly lower accuracy (i.e., 4% lower than with 16 examples) while using only half of the examples. Situations can be assessed with a precision of 90% at a false positive and negative rate of 15% using only eight learning examples. 2) Effect of Statistical and Systematic Errors on the CRF: Our second objective is to find out where the errors of the CRF (wrongly classified states) occur and how they are distributed over the states. To that end, we need a more detailed figure. Rather than to display a single value that represents the accuracy as averaged over the states, we display the confusion matrix for a given size of the training set. As an extreme test case, we display results for a training set of one example only. In addition, the results are split into the various noise sources that we identified earlier, each produces one subfigure [see Fig. 6(a)–(f)]. The captions indicate the error sources: There are four error sources, where the fade in/out source is separated into three related but different cases, which results in six subfigures in total. The line around each of the confusion matrices indicates the average predication accuracy, which has been weighted equally over all states. Fig. 6 altogether is meant to give a detailed insight in how error sources and the levels of FPR and FNR deteriorate the results. Up to FPR or FNR of 15% for any type of noise, the CRF is surprisingly accurate to assess situations. The accuracy starts to deteriorate above 15% FPR or FNR or both. Interestingly, the CRF is not very sensitive to fading effects or ambiguity (we

393 394

398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445

7

Fig. 7. Video descriptors to capture the crowd dynamics in a compact way. This figure shows 12 out of the 104 descriptors. TABLE 1 RESULTS FOR THE CLASSIFICATION OF CROWD DYNAMICS BY A CRF

introduced correlations) within the observables. On the contrary, the CRF is sensitive to random noise and to clutter (state-specific levels of random noise). The clutter case is most difficult, as the noise is different for each state. As a consequence, the noise increases and decreases over time. The CRF is least robust to noise levels that vary over time.

451

B. Real-World Data

452

In this section, we consider an experiment that is based on real data. The experiment is relevant for security and safety applications. We have recorded and annotated 4-h video data at the second largest train station in The Netherlands. The video data includes many and diverse behavioral patterns of the crowd, of which a limited set have been classified by a CRF. For this real-world experiment, we have adopted state-of-the-art video features that capture local structure and temporal dynamics. Given these features, we adopt a quantization scheme to translate the many and high-dimensional video features into a fixed number of low-dimensional observables. Based on these spatiotemporal observables, various crowd dynamics are classified by the CRF. The goal is to classify three types of crowd dynamics: “normal” (10–100 people), “abandoned” (0–10 people), “rush” (many people moving, to go to or leave the tram platforms). In the recorded video, these three event classes are correspondingly labeled by hand. 1) Describing the Crowd: Video Features: To describe the crowd, we aim to capture the temporal dynamics in the crowd (the general movement of, e.g., heads, shoulders, and legs). We tailor existing video features for this task. We start with optical flow [18], which is computed at each pixel in the video images, that yields at each pixel a magnitude and orientation. We summarize these flows in spatial bins to reduce the amount of data. We choose 104 fixed bins that are distributed equally

453

446 447 448 449 450

454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477

8

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS

Fig. 8. Examples of errors for each type of crowd dynamics, where the CRF misclassified the state of the crowd (between parentheses). (a) “Normal” (“Rush”). (b) “Abandoned” (“Normal”). (c) “Rush” (“Normal”). 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520

over the image. The 104 spatial bins are obtained as follows. The video is of HD quality, therefore each image frame is 1080 × 1440. We aim to sample approximately every 100 pixels. Discounting the border areas, the grid becomes 8 × 13, which yields 104 spatial bins. Next, in each spatial bin, the flow vectors are binned into eight orientations and four magnitudes. This approach is inspired by other work, e.g., scale-invariant feature transform [19], where discriminative yet compact image features were extracted from images by summarizing many pixel values into few orientations and magnitudes. For adequate continuous weighting of the optical flows, the contribution of each flow vector to each bin is determined by a kernel-based weight. With our approach, 104 × 8 × 6, i.e., 104 descriptors of length 48 are obtained per video frame. This procedure captures the dynamics of the crowd in a compact way. An illustration is depicted in Fig. 7. 2) Transforming the Video Features Into Observables: We experimentally established that the CRF is able to learn the three types of crowd dynamics when the input is bound at maximum 40 observables. Recall that the video features yield 104 × 48 = 4992 values, i.e., much higher than the maximum of 40 observables. Hence, a translation is needed that reduces the video features into approximately 40 observables. To do so, we adopt a method of quantization: visual codebooks [20]. In a visual codebook, the descriptors are assigned to a smaller selection of representative descriptors (which are also called primitives). This yields a histogram, of which the length is determined by the number of primitives. We choose eight primitives, that result in a histogram of length 8. Next, we discretize each bin (of the eight-bin histogram) into four levels, to obtain 8 × 4 binary values. Recall that the CRF is especially suited to learn from binary values. In this translation procedure, the 104 multivalued video features are translated into a 32 binary values. These 32 binary values are the observables, on which the CRF is trained and tested. 3) Results: Classification of Crowd Dynamics: All examples of the three classes are taken from the 4-h videos. These examples are subsampled at 2 frames/s. This is a choice to reduce the amount of stored data and a limitation posed by processing speed. We argue that 2 frames/s is sufficiently fast for awareness on crowd dynamics, yet the variation between samples is significant. For each frame, the video features are computed and translated into observables. Together with the classification of

observables into a next state, this procedure can be done at a sampling rate of 2 frames/s. The trained CRF, with its weights for the observables and state transitions, has been shown in Fig. 4 and was discussed in Section III. For the training of the CRF, the transitions between states are essential. In our real-world dataset (i.e., the examples of the classes), we found that the number of transitions is limited. This poses a challenge to make distinct yet rich training and test sets. We choose to solve this as follows. We perform a fourfold cross validation. We subsample every Nth encoded video frame to be included in the Nth-fold. This way, the folds are disjunct and different enough as they are at minimum 2 s apart (given the sampling of 2 frames/s and fourfolds). That is, the scene is changing rapidly. Going from one video frame to the next, with 2 s in between, gives an enormous variation in the images. People have walked significantly, new people have entered, other people have exited the scene, and people have got closer or further apart from other people. Following conventions, we train on onefold and test on another. This experiment is repeated four times. The results are summarized by the confusion matrix in Table I. Interestingly, the CRF classifies some states that are more accurate than others. The CRF has shown to successfully classify the crowd dynamics up to 80% accuracy. Examples of errors are shown in Fig. 8. An error that occurs regularly is a badly timed state transition of the estimation. Usually, the transition is to the right state but is too early or too late. This is illustrated in Fig. 8(b) and (c). Sometimes there are “true misclassifications;” this is illustrated in Fig. 8(a). VI. CONCLUSION We have proposed scenarios that are relevant for antiterrorism and crowd control. Given the scenarios, and their variations including various kinds of errors, we have assessed situations that are based on observables with the objective to recognize threats. We have chosen the CRF as a method to learn probabilistically the relation between observables and states (as the hidden parameter). A good size for the training set was shown to be eight examples for each of the five scenarios, which achieves only slightly lower accuracy (i.e., 4%) than with 16 examples while using only half of the examples. Situations can be assessed with a precision of 90% at a false positive and negative rate of 15% using only eight learning examples.

521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548

549 550 551 552 553 554 555 556 557 558 559 560 561 562

BURGHOUTS AND MARCK: REASONING ABOUT THREATS: FROM OBSERVABLES TO SITUATION ASSESSMENT

580

We have investigated the extreme test case of using a training set of only one example to find out where the CRF becomes inaccurate. Up to FPR or FNR of 15% for any type of noise (random, fade in/out, ambiguity, and clutter), the CRF is surprisingly accurate to assess situations. The accuracy starts to deteriorate above 15% FPR or FNR or both. The CRF is not very sensitive to fading effects or ambiguity but is sensitive to random noise and to clutter (i.e., state-specific levels of random noise). In a real-world experiment at a large train station, we have classified various types of crowd dynamics. Using simple video features of shape and motion, we have proposed a scheme to translate such features into observables that can be classified by a CRF. The CRF has shown to successfully classify the crowd dynamics up to 80% accuracy. Overall, we conclude this paper with the interesting finding that discriminative power can be achieved by combining multiple, generic, and simple observables.

581

ACKNOWLEDGMENT

582

585

The authors would like to thank Dr. R. den Hollander to provide the camera-based observables and their characteristics. They are also grateful to Dr. K. Schutte for useful discussions on the inference of the situation that is based on observables.

586

REFERENCES

587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620

[1] M. M. Kokar, “Situation awareness: Issues and challenges,” in Proc. Int. Conf. Inf. Fusion, 2004, pp. 533–534. [2] J. Llinas, C. Bowman, G. Rogova, and A. Steinberg, “Revisiting the JDL data fusion model II,” in Proc. Int. Conf. Inf. Fusion, 2004, pp. 1218–1230. [3] G. J. Burghouts, B. Broek, B. G. Alefs, E. den Breejen, and K. Schutte, “Automated indicators for behavior interpretation,” in Proc. Int. Conf. Crime Detect. Prevent., 2007, pp. 1–6. [4] S. Agarwal, A. Awan, and D. Roth, “Learning to detect objects in images via a sparse, part-based representation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 26, no. 11, pp. 1475–1490, Nov. 2004. [5] A. Datta, “Person-on-person violence detection in video data,” in Proc. Int. Conf. Pattern Recognit., 2002, pp. 433–438. [6] D. M. Gavrila, “The visual analysis of human movement: A survey,” Comput. Vis. Image Understand., vol. 73, no. 1, pp. 82–98, 1999. [7] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld, “Learning realistic human actions from movies,” in Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit., 2008, pp. 1–8. [8] W. Zajdel, D. Krijnders, T. Andringa, and D. Gavrila, “CASSANDRA: Audio-video sensor fusion for aggression detection,” in Proc. Int. Conf. Adv. Video Signal Based Surveillance, 2007, pp. 200–205. [9] L. Duan, D. Xu, I. W. Tsang, and J. Luo, “Visual event recognition in videos by learning from web data,” in Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit., 2010, pp. 1959–1966. [10] D. K¨uttel, M. Breitenstein, L. van Gool, and V. Ferrari, “What’s going on? discovering spatio-temporal dependencies in dynamic scenes,” in Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit., 2010, pp. 1951–1958. [11] T. Xiang and S. Gong, “Video behaviour profiling for anomaly detection,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 30, no. 5, pp. 893–908, May 2008. [12] A. McCallum and C. Sutton. (2006). “An introduction to conditional random fields for relational learning,” in Introduction to Statistical Relational Learning, L. Getoor and B. Taskar, Eds. Cambridge, MA: MIT Press [Online]. Available: http://www.cs.umass.edu/ mccallum/papers/crf-tutorial.pdf

563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579

583 584

9

[13] L. Rabiner, “A tutorial on hidden Markov models and selected applications in speech recognition,” Proc. IEEE, vol. 77, no. 2, pp. 257–286, Feb. 1989. [14] I. Ulusoy and C. Bishop, “Generative versus discriminative methods for object recognition,” in Proc.IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., 2005, vol. 2, pp. 258–265. [15] M. Schmidt. (2008). “CRF toolkit in MATLAB,” [Online]. Available: http://people.cs.ubc.ca/schmidtm/Software/ [16] D. C. Liu and J. Nocedal, “On the limited memory BFGS method for large scale optimization,” Math. Program., vol. 45, no. 3, pp. 503–528, 1989. [17] F. Sha and O. Pereira, “Shallow parsing with conditional random fields,” in Proc. HLT-NAACL, 2003, pp. 213–220 [Online]. Available: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.10.9849 [18] B. K. P. Horn and B. G. Schunck, “Determining optical flow,” Artif. Intell., vol. 17, nos. 1–3, pp. 185–203, 1981. [19] D. G. Lowe, “Object recognition from local scale-invariant features,” in Proc. Int. Conf. Comput. Vis., 1999, pp. 1–8. [20] B. T. F. Jurie, “Creating efficient codebooks for visual recognition,” in Proc. Int. Conf. Comput. Vis., 2005, pp. 604–610.

621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638

Gertjan J. Burghouts received the Ph.D. degree from the University of Amsterdam, Amsterdam, The Netherlands, in 2007 on the topic of visual recognition of objects and their motion, in realistic scenes with varying conditions. He is currently a Lead Research Scientist of Visual Pattern Recognition with The Netherlands Organization for Applied Scientific Research (TNO), The Hague, The Netherlands. He studied artificial intelligence at the University of Twente during 1997–2002 with a specialization in pattern analysis and human– machine interaction. Since 2007, he has been the Principal Investigator of automated understanding of human behavior based on sensory perception. He is the Principal Investigator of a DARPA project named CORTEX (2.3M), about recognition of events and behaviors. He has written papers on this topic in internationally renowned journals, e.g., the IEEE TRANSACTIONS ON IMAGE PROCESSING, the Computer Vision and Image Understanding, the International Journal of Computer Vision, and International Conference on Crime Detection and Prevention. His work has been cited mote than 200 times since 2005. Dr. Burghouts received an award from the Netherlands Association of Engineers for the best innovative project in 2007.

639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660

Jan-Willem Marck received the M.Sc. degree in artificial intelligence from the University of Groningen, Groningen, The Netherlands, in 2006, with a specialization in autonomous systems. He is currently a Research Scientist of Artificial Intelligence with the Department of Distributed Sensor Systems, The Netherlands Organization for Applied Scientific Research (TNO), The Hague, The Netherlands. Since 2006, he has been a Research Scientist on various topics, e.g.,sensor information fusion, relevance of information, human–machine interaction and human behavior classification based on state estimation methods using sensory data. He has (co-)authored a number of papers and managed projects on these topics.

661 662 663 664 665 666 667 668 669 670 671 672 673 674 Q2 675

676 677 678

QUERIES

Q1: Author: Please verify the affiliation of the authors as typeset. Q2. Author: Please verify the current location of author “Jan-Willen Marck.”

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS

Reasoning About Threats: From Observables to Situation Assessment

1

2

Gertjan J. Burghouts and Jan-Willem Marck

3

4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26

Abstract—We propose a mechanism to assess threats that are based on observables. Observables are properties of persons, i.e., their behavior and interaction with other persons and objects. We consider observables that can be extracted from sensor signals and intelligence. In this paper, we discuss situation assessment that is based on observables for threat assessment. In the experiments, the assessment is evaluated for scenarios that are relevant to antiterrorism and crowd control. The experiments are performed within an evaluation framework, where the setup is such that conclusions can be drawn concerning: 1) the accuracy and robustness of an architecture to assess situations with respect to threats; and 2) the architecture’s dependence of the underlying observables in terms of their false positive and negative rates. One of the interesting conclusions is that discriminative assessment of threatening situations can be achieved by combining generic observables. Situations can be assessed with a precision of 90% at a false positive and negative rate of 15% using only eight learning examples. In a real-world experiment at a large train station, we have classified various types of crowd dynamics. Using simple video features of shape and motion, we have proposed a scheme to translate such features into observables that can be classified by a conditional random field (CRF). The implemented CRF shows to classify successfully the crowd dynamics up to 80% accuracy.

27 28 29

Index Terms—Architecture, evaluation framework, information processing, observables, situation understanding, threat recognition.

I. INTRODUCTION

30 31 32 33 34 35 36 37 38 39 40 41 42 43

Q1

1

ITUATION understanding is relevant to the field of security (i.e., robbery and vandalism), public safety (i.e., aggression and riots), and health care (i.e., incidents with elderly people). Technology is making its entrance in solutions for these domains. More recently, research institutes and technology providers have focused also on the domain of antiterrorism, aiming for solutions that detect threats in an early stage [1]. To detecting a threat at an early stage is important, as it enables security professionals to mitigate the situation. In this paper, we focus on technology to recognize the stages that build up to potential threats. Our objective is a system to alert for potential threats. To that end, the system assesses the current situation in order to

S

Manuscript received December 30, 2009; revised September 20, 2010 and January 25, 2011; accepted March 19, 2011. This paper was recommended by Associate Editor J. Tang. The authors are with The Netherlands Organization for Applied Scientific Research (TNO) Observation Systems, The Hague 2597 AK, The Netherlands (e-mail: [email protected]; [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TSMCC.2011.2135344

Fig. 1. Our framework: outputs of various sensors (JDL level 0) are processed to the higher abstraction level of observables (JDL level 1). Examples of observables are taken from our earlier work, e.g., trajectory analysis, pose estimation, color of clothing, carrying an attribute, behavior recognition, and group dynamics [3]. These observables provide information to the situation estimator (JDL level 2). Specific situations are associated with threats. Situation assessment is the focus of this paper.

determine whether a threat may occur. Our approach is to assess the situation from multiple object assessments. These two layers are adopted from the well-known JDL data fusion model and its extensions [2]. Object assessments are based on sensors and intelligence. This is illustrated in Fig. 1. The output of sensors requires automatic processing up to the level of meaningful and robust information, which we will refer to as observables. Intelligence is also taken as an observable. The observables are object assessments and range from camera observables (i.e., person carries some object) to geo-observables (i.e, the person has a short interaction with another person) to intelligence observables (the suspect is a person with a gray shirt and black pants). Based on the observables, the situation is assessed. For instance, the observables that a suspect person hands over an object to another person who is mixing in the crowd would result in the situation assessment that somebody is carrying a suspect object. In this paper, we discuss the methods to exploit observables for threat assessment. We motivate our choices for specific pattern recognition methods. In the experiments, the added value of the implemented method is demonstrated for scenarios that are relevant to antiterrorism and crowd control. The experiments are performed within an evaluation framework, where the setup is such that conclusions can be drawn concerning: 1) the performance and robustness of an architecture to assess threats; and 2) the architecture’s dependence of the underlying observables in terms of their false positive and negative rates.

1094-6977/$26.00 © 2011 IEEE

44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70

2

71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106

107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS

A. Previous Work Many observables have been introduced, which range from detecting specific objects such as cars [4] to detecting aggression [5], from generic features of human walking patterns [6] or highly specific human acts such as shaking hands and kissing [7]. In this paper, we focus on properties and behavior of humans that can be detected and recognized in video. The goal of our work is to assess situations beyond the level of isolated behaviors, which is based on multiple observables. Previously, other researchers have also considered situation assessment that is based on multiple observables, e.g., the combination of visual and audio features to improve the detection of aggression [8]. With an appropriate combination scheme, the discriminative power of situation assessment can be improved by using multiple observables. Situation assessment is a challenge, as the situations under investigation can range from instantaneous (e.g., one person gives an object to another person) to longer periods (e.g., a pickpocket is waiting the best moment to steal someone’s belongings). In recent literature, researchers tend to focus on either instantaneous events [9] or on complex, long-term person–person or person–object interactions [10]. In this paper, we propose a generic scheme to model and classify both types of situations. In this paper, we consider only situations that are known a priori. This requires examples from which a system can learn. Interestingly, recent work has shown that it is possible to detect abnormal situations at run time [11]. Such methods are not within the scope of this paper. Another observation from reviewing recent work is that scientific communities have released descriptions and datasets of behaviors of crowds and individuals. These datasets are very interesting and include recordings of single observables to experiment with, e.g., the PETS and TRECVID benchmarks. However, such communities have not yet focused on short- and long-term sequential scenarios. As a consequence, no scenarios are available for antiterrorism and crowd-control applications. B. Contributions in this Paper The contribution in this paper is fourfold. 1) Relevant scenarios for antiterrorism and crowd control. We discuss how expert knowledge of incidents can be used. This is crucial, as for the particular incidents related to terrorism, escalation of crowd behavior, and suspect people and objects, few or no data will be available. 2) A tool to generate a multitude of variations of the scenarios. The variations include statistical and systematic error sources. 3) A framework to evaluate both 1) the performance of the situation assessment as a whole and 2) its dependence of underlying observables. The experiments are performed by the usage of a conditional random field (CRF). The outcomes of the evaluation give insight in 1) whether more (complementary) observables are required to assess situations successfully; and 2) whether the robustness of observables needs to be improved in order to be useful for situation assessment.

4) Given the scenarios and their variations and the framework, we evaluate a system for situation assessment that was implemented for this paper. The system is described in Section III that is based on a CRF that takes observables as input and estimates situations as output. We demonstrate in Section IV that discriminative power can be achieved by combining generic observables. This paper is organized as follows. In Section II, we discuss the scenarios and the tool to generate their variations. In Section III, we propose the CRF that learns the best relation between observables and the assessment of situations. In Section IV, the experimental setup is discussed. In Section V, the CRF is evaluated for the scenarios and their variations. In addition, the CRF is evaluated against a real-world dataset. Section VI concludes this paper.

140

II. SCENARIOS

141

In this section, we define scenarios to which our system (see Section III) is applied and on which the experiments will be based (see Section IV). A scenario is interpreted as a sequence of states. The state is a description of the situation at a particular time. That is, “a group of people is agitated.” Note that an expert may be consulted to associate each state with a level of escalation or alarm, to bridge the gap to the decision maker. For each subsequent period, a number of observables are observed. Observables are observed object properties. That is, “a person enters the scene.” A scenario is specified by the sequence of states, their duration, and the observables that are observed for each subsequent period. By the previous definition, the observables may vary from time to time. Different sets of observables may be observed for the same state at different periods. There is no direct coupling between states and observables. In Section III, we select a pattern recognition method to learn the best probabilistic relation between states and observables. In Section IV, we evaluate how well this learned relation predicts the states for unseen variations of the scenarios. Observables may be chosen such that they relate to specific persons or groups of persons that are of particular interest. That is, “a suspect person enters the scene” or even more specific “John enters the scene.” In this paper, we consider observables that are generic, e.g., “a person enters the scene,” “a suspect person is in the scene,” etc. Interestingly, in Section IV, we demonstrate that discriminative power can be achieved by combining generic observables.

142

126 127 128 129 130 131 132 133 134 135 136 137 138 139

143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169

A. Observables

170

We have selected the following observables: empty square, one person, two-five persons, many people, people flux +, people flux −, sudden large people flux, tracks toward hotspot, tracks in hotspot, group formation, group moves, group on collision, individual is avoided by other people, individual’s head orientation varies, individual carries object, individual has wild gestures, somebody wears clothes related to a suspect person. The observables have been adopted from interviews with security professionals and from a training in The Netherlands

171 172 173 174 175 176 177 178 179

BURGHOUTS AND MARCK: REASONING ABOUT THREATS: FROM OBSERVABLES TO SITUATION ASSESSMENT

182

(police academy) on “search, detect, and react” (based on an Israeli security training). The observables are illustrated in Fig. 1.

183

B. Antiterrorism and Crowd-Control Scenarios

184

Five scenarios for antiterrorism and crowd control are proposed: square is revolting, aggressive political speaker, extreme demonstration, neutral, fight between a few people. The situation at a given time is formalized by one of the following states: neutral, group with agitator, other people are avoiding the situation, group is coordinating something, group is agitated, group riot. The scenarios are defined by the format: {[state1 , [observable1,1 , observable1,2 , . . . , observable1,N ], length1 ], . . . , [stateT . . .]}. The variable “length” refers to the length of the period of the current state. The number of observables may vary per period. These formal specifications will be illustrated at the end of this section. First, we present the five scenarios in natural language. 1) Square is revolting: market square, many people moving randomly, two people are loitering, group assembles just outside the market square, one group member is agitated, the group starts moving, the loitering people join the group, another group starts to move, one of the group members carries something, people are avoiding the two groups, the two groups confront each other, fight, people fleeing. 2) Aggressive political speaker: somebody start to speak in the midst of people, people are moving around, people start to listen, people are walking away. 3) Extreme demonstration: people are avoiding the group that is demonstrating, the atmosphere is tense, little riot, the demonstrating group is moving. 4) Fight between a few people: some people are agitated, one troublemaker is starting a fight, small fight, people are avoiding the fight, fight stops. 5) Default: neutral situation, people moving, many people enter buildings, a small group is walking by, square is getting more crowded, another group walks by.

180 181

185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217

218 219 220 221 222 223 224 225 226 227 228 229 230 231 232

C. Scenario Generation Including Statistical and Systematic Errors The scenarios are transformed into datasets. The inputs are the observables (i.e, binary values) and the ground truth is the state per timestep. Four types of noise can be added to the observables. 1) Statistical errors: random false positives and negatives. 2) Systematic errors: fade in/out: false positives are added in front of active observables, or at the end. 3) Systematic errors: ambiguity: false positives are added for a particular observable, and false negatives are added randomly. 4) Systematic errors: clutter: for each state, a state-dependent clutter effect is generated, which results in more false positives and negatives when more clutter is apparent.

3

Fig. 2. Example of increasing statistical errors for the scenario “square is revolting.” On the horizontal axis, the timesteps are shown. The vertical axis indicates the observables (each row represents one observable). (a) FPR = FNR = 0%. (b) FPR = FNR = 10%. (c) FPR = FNR = 20%.

Fig. 3. Example of nontrivial relation between (a) observables and (b) states. The relation between observables and states is not clear and probabilistic in nature. Learning the probabilistic relation between observables and states is the objective of this section and the experiments.

All types of noise will be expressed in terms of false positive ratio (FPR) and false negative ratio (FNR) in the experiments. Fig. 2 illustrates the effect of various FPRs and FNRs on the datasets.

236

III. FROM OBSERVABLES TO SITUATION ASSESSMENT

237

Recall from the previous section, the observables may be noisy. In this section, we exploit a CRF to learn the best probabilistic relation between states and observables. To assess the current state of the situation, the observables are simply considered altogether. The situation assessment is based on a “bag of observables.” More specifically, in this paper, we interpret situation assessment as a mapping problem from the “bag of observables” to a situation state. The observables are perceptual concepts. The situation states are implicit concepts that are not directly observed from the data. Fig. 3 makes this clear. There is no obvious relation between the observables and states. The situation state can be interpreted as a hidden parameter of the observables. This makes the state prediction nontrivial. In Section V, we evaluate whether the proposed CRF is a good technique to model the probabilistic relations between observables and situation states.

238

233 234 235

239 240 241 242 243 244 245 246 247 248 249 250 251 252 253

A. Conditional Random Field

254

A CRF [12] is a type of discriminative probabilistic model that is most often used to label or parse the sequential data, such as natural language text or biological sequences. CRFs are a more general form of hidden Markov models (HMM, [13]) and relaxes the Markov assumption over the sequential data that is characteristic for the HMM. This allows a CRF model to use information from time separated data [12]. We made the observation that situations in real life evolve in certain sequential orders. The evolution over time of the scenario states resembles the kind of probabilistic sequences

255 256 257 258 259 260 261 262 263

4

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS

Fig. 4. Example of a CRF. (a) Weights for observables to current state potential. (b) Weights of the previous state to current state potential. (c) Cumulative probabilities. (d) Estimated state and ground truth. 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288

of natural language. These sequences can be modeled by an undirected graphical model in which each vertex represents a state. The edges between the vertices can be understood as a dependence between states. There are two main ways to model such graphical dependences: HMM and CRF. We choose the CRF over the HMM, because it has two advantages that are important to our problem. Both advantages are due to the fact that the CRF model is a discriminative model instead of generative model [14]. A discriminative model models the conditional probability P (Y |X). A generative model is more widely applicable as it can be used to calculate the joint probability P (X, Y ) and all marginals of the joint, for example, P (X) or P (Y ). A discriminative model is limited to predicting state Y given certain sequence of observations X. When fitting the model of a generative model, one approximates both P (X) and P (Y |X) instead of merely P (Y |X) in the case of the discriminative model. This results in a number of advantages. 1) A discriminative model is independent of P (X). This means that to estimate P (X) for rarely occurring states is not a problem. This results in better accuracy for rare scenarios. 2) It can be trained in one setting and tested in a different setting with another P (X), since it is independent of P (X) and dependences within P (X).

3) A discriminative model has better predictive performance of P (Y ), since it is trained on classification of P (Y |X), instead of modeling the joint probability. It does not need additional effort to fit data with P (X). Usage of the CRF consists of two phases: the data-fitting phase and the inference phase. In the data-fitting phase, the model parameters are fitted to a training dataset. The best fit is the maximum of the (log) likelihood of these parameters that are given the training dataset. The likelihood is a multidimensional concave function. The optimum of this function gives the best fit to the parameters. In effect, two matrices are learned, which are used by the CRF to classify newly observed observables into the next state: one to capture the dependences between observables and states, and the other to capture the dependences from the previous state to the next state. The observable-state matrix also captures the relative importance of each observable for each state. In the inference phase, the learned model is used to estimate P (Y ) from a given X. When estimating P (Y ), the forward backward algorithm [13] is used to refine the estimation of P (Y ). The forward backward algorithm finds the most probable path in time over the states in P (Y ) [13] and smooths the probabilities of P (Y ). An example of a CRF is illustrated in Fig. 4. This example is used in Section V-B, where the precise states and

289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313

BURGHOUTS AND MARCK: REASONING ABOUT THREATS: FROM OBSERVABLES TO SITUATION ASSESSMENT

325

observables (in a real-world experiment) are further explained. For the next illustration, the naming of states and observables is not yet important; here, we want to discuss the underlying mechanism. We illustrate the weighting and potentials of observables and the estimation of states. The weights of the observables are shown in Fig. 4(a). The weights of the observables in the previous state are shown in Fig. 4(b). Together, they influence the estimation of the current state. Positive weights excite the corresponding state, whereas negative weights inhibit that state. This is exemplified by the cumulative probabilities in Fig. 4(c). The state with highest probability becomes the estimated state [see Fig. 4(d)].

326

IV. EXPERIMENTAL SETUP

327 328

We model the situation state as the hidden state of the CRF, and the observables are the observations of the hidden state.

329

A. Conditional Random Field Evaluation Objectives

330

341

We evaluate how well the relation between states and observables can be modeled by a CRF. We are interested in the following properties of the CRF. 1) Discriminative power: How well does the CRF predict the situation states for unseen variations of the scenarios? This is an interesting question, as the observables are generic and, hence, not tailored to the specific scenarios and their states. 2) Robustness: How sensitive is the CRF with respect to statistical and systematic errors? 3) Capacity to generalize: How many examples of a given set of scenarios are required to obtain a robust CRF?

342

B. Evaluation Tool

343

349

To test the discriminative power of the CRF for situations, we adapted the Schmidt MATLAB CRF implementation [15]. We implemented a scenario generator to generate the scenarios (see Section II-B). In this tool, we also added the capability to include the four noise modes (see Section II-C) to the observables of the scenarios in order to measure the performance of the CRF approach against various noise models.

350

C. Datasets

351

We test each noise variable to increase amounts of noise. The datasets are generated from the scenarios in Section II and are, subsequently, contaminated with increasing levels of noise. The types of noise are statistical (random false positives/negatives) and systematic (fade ins/outs, ambiguity, and clutter). All types of noise will be expressed in terms of false positive and negative ratios in the experiments.

314 315 316 317 318 319 320 321 322 323 324

331 332 333 334 335 336 337 338 339 340

344 345 346 347 348

352 353 354 355 356 357 358

D. Cross Validation

359

For each noise setting, 48 repetitions were generated. Each repetition contains the entire set of five scenarios (see Section II-B). For each noise parameter setting, a 12-fold cross validation will be performed. Each test fold contains four repetitions and is tested against a training set that contains N randomly

360 361 362 363

5

Fig. 5. Results for the experiment with 16, 8, 4, 2, and 1 scenario variations to learn from. Results are shown for nine fixed samples of FPR and FNR. For each FPR/FNR sample, we indicate how the CRF classification accuracy drops as a consequence of fewer learning examples. This is shown by each bar, where the accuracy drops from red (almost 100%) to green (just above 50%). This effect is worse for higher FPR/FNR (e.g., the bar in the upright corner). A good balance between number of examples and performance of the CRF is obtained for eight examples (see second from the left of each colored bar).

selected repetitions out of the 48 scenarios that are generated for the current parameter setting. The averaged results from the folds are the CRF performance results for the selected parameter setting.

367

E. Training the Conditional Random Field

368

The CRF will be trained at each test fold on the randomly selected training set. The data fitting, i.e, the optimizing of the loglikelihood, is performed by the L-BFGS optimization algorithm [16]. This quasi-Newton optimization algorithm is a standard to optimize large multidimensional optimization problems such as a CRF [17]. The optimization ends when the average error of the training set is smaller than 10−5 or after 500 iterations.

369

376

F. Evaluation Measures

377

The results of the CRF are presented as confusion matrices. Following conventions, each confusion matrix contains on the vertical axis the ground truth situation states and on the horizontal axis the predicted states by the CRF. Hence, a diagonal with high prediction accuracies is aimed for. We expect that for higher FPRs and FNRs, the predicted states become unreliable, which will cause the confusion matrices to have lower values on the diagonal. Each increasing state index indicates an increasing alarm level. In the case of errors (i.e., wrongly predicted states), we hope that the errors are not distributed among all states but remain confined to subsequent states, such that the perturbing effect on wrongly predicted alarm levels remains limited.

378

364 365 366

370 371 372 373 374 375

379 380 381 382 383 384 385 386 387 388 389 390

6

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS

Fig. 6. Situation assessment: results. Experiment: only one scenario variation to learn from, for various types of noise (see captions directly under the figure) and the degree of noise (see false positive and negative ratios). Confusion matrices for each of the six situation states are displayed for various FPRs and FNRs. Up to FPR or FNR of 15%, the CRF is surprisingly accurate to assess situations. (a) Statistical errors. (b) Systematic: fade ins. (c) Systematic: fade outs. (d) Systematic: fade ins and outs. (e) Systematic: ambiguity. (f) Systematic: clutter.

BURGHOUTS AND MARCK: REASONING ABOUT THREATS: FROM OBSERVABLES TO SITUATION ASSESSMENT

391

V. RESULTS

392

395

First, we focus on experiments on synthetic data where we can control the sources and amount of noise (see Section V-A). Second, we show experimental results on real-world data (see Section V-B).

396

A. Synthetic Data

397

The results in this section are used to evaluate the capacity to generalize (see Section V-A1), the discriminative power (see Section V-A2), and robustness (see Section V-A2). 1) Effect of the Size of the Training Set: Our first objective is to find a good balance between the number of learning examples to train the CRF and the performance of the CRF. If the number of learning examples is low, then fewer examples have to be collected, and training takes less time. However, with fewer learning examples, the performance of the CRF will be lower. To evaluate this tradeoff in detail, we investigate the results of the CRF approach to increase FPR and FNR levels. Fig. 5 shows the accuracy of the CRF for various sizes of the training set, which has been decreased from 16 to smaller sets of 8, 4, 2, and 1 training example(s). The figure shows at each FPR (horizontal coordinate) and FNR (vertical) a bar of values, one for each size of the training set. One value represents the CRF’s accuracy of the situation assessment, where the accuracy has been averaged equally over the situation states. With 16 examples and errors up to 30%, the average accuracy of the CRF is 81%. With fewer errors, up to 15%, the accuracy is 94%. With only one example, the accuracy drops, respectively, to 50% (was 81%) and 63% (was 94%). A good tradeoff seems to be eight examples, which achieves only slightly lower accuracy (i.e., 4% lower than with 16 examples) while using only half of the examples. Situations can be assessed with a precision of 90% at a false positive and negative rate of 15% using only eight learning examples. 2) Effect of Statistical and Systematic Errors on the CRF: Our second objective is to find out where the errors of the CRF (wrongly classified states) occur and how they are distributed over the states. To that end, we need a more detailed figure. Rather than to display a single value that represents the accuracy as averaged over the states, we display the confusion matrix for a given size of the training set. As an extreme test case, we display results for a training set of one example only. In addition, the results are split into the various noise sources that we identified earlier, each produces one subfigure [see Fig. 6(a)–(f)]. The captions indicate the error sources: There are four error sources, where the fade in/out source is separated into three related but different cases, which results in six subfigures in total. The line around each of the confusion matrices indicates the average predication accuracy, which has been weighted equally over all states. Fig. 6 altogether is meant to give a detailed insight in how error sources and the levels of FPR and FNR deteriorate the results. Up to FPR or FNR of 15% for any type of noise, the CRF is surprisingly accurate to assess situations. The accuracy starts to deteriorate above 15% FPR or FNR or both. Interestingly, the CRF is not very sensitive to fading effects or ambiguity (we

393 394

398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445

7

Fig. 7. Video descriptors to capture the crowd dynamics in a compact way. This figure shows 12 out of the 104 descriptors. TABLE 1 RESULTS FOR THE CLASSIFICATION OF CROWD DYNAMICS BY A CRF

introduced correlations) within the observables. On the contrary, the CRF is sensitive to random noise and to clutter (state-specific levels of random noise). The clutter case is most difficult, as the noise is different for each state. As a consequence, the noise increases and decreases over time. The CRF is least robust to noise levels that vary over time.

451

B. Real-World Data

452

In this section, we consider an experiment that is based on real data. The experiment is relevant for security and safety applications. We have recorded and annotated 4-h video data at the second largest train station in The Netherlands. The video data includes many and diverse behavioral patterns of the crowd, of which a limited set have been classified by a CRF. For this real-world experiment, we have adopted state-of-the-art video features that capture local structure and temporal dynamics. Given these features, we adopt a quantization scheme to translate the many and high-dimensional video features into a fixed number of low-dimensional observables. Based on these spatiotemporal observables, various crowd dynamics are classified by the CRF. The goal is to classify three types of crowd dynamics: “normal” (10–100 people), “abandoned” (0–10 people), “rush” (many people moving, to go to or leave the tram platforms). In the recorded video, these three event classes are correspondingly labeled by hand. 1) Describing the Crowd: Video Features: To describe the crowd, we aim to capture the temporal dynamics in the crowd (the general movement of, e.g., heads, shoulders, and legs). We tailor existing video features for this task. We start with optical flow [18], which is computed at each pixel in the video images, that yields at each pixel a magnitude and orientation. We summarize these flows in spatial bins to reduce the amount of data. We choose 104 fixed bins that are distributed equally

453

446 447 448 449 450

454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477

8

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART C: APPLICATIONS AND REVIEWS

Fig. 8. Examples of errors for each type of crowd dynamics, where the CRF misclassified the state of the crowd (between parentheses). (a) “Normal” (“Rush”). (b) “Abandoned” (“Normal”). (c) “Rush” (“Normal”). 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520

over the image. The 104 spatial bins are obtained as follows. The video is of HD quality, therefore each image frame is 1080 × 1440. We aim to sample approximately every 100 pixels. Discounting the border areas, the grid becomes 8 × 13, which yields 104 spatial bins. Next, in each spatial bin, the flow vectors are binned into eight orientations and four magnitudes. This approach is inspired by other work, e.g., scale-invariant feature transform [19], where discriminative yet compact image features were extracted from images by summarizing many pixel values into few orientations and magnitudes. For adequate continuous weighting of the optical flows, the contribution of each flow vector to each bin is determined by a kernel-based weight. With our approach, 104 × 8 × 6, i.e., 104 descriptors of length 48 are obtained per video frame. This procedure captures the dynamics of the crowd in a compact way. An illustration is depicted in Fig. 7. 2) Transforming the Video Features Into Observables: We experimentally established that the CRF is able to learn the three types of crowd dynamics when the input is bound at maximum 40 observables. Recall that the video features yield 104 × 48 = 4992 values, i.e., much higher than the maximum of 40 observables. Hence, a translation is needed that reduces the video features into approximately 40 observables. To do so, we adopt a method of quantization: visual codebooks [20]. In a visual codebook, the descriptors are assigned to a smaller selection of representative descriptors (which are also called primitives). This yields a histogram, of which the length is determined by the number of primitives. We choose eight primitives, that result in a histogram of length 8. Next, we discretize each bin (of the eight-bin histogram) into four levels, to obtain 8 × 4 binary values. Recall that the CRF is especially suited to learn from binary values. In this translation procedure, the 104 multivalued video features are translated into a 32 binary values. These 32 binary values are the observables, on which the CRF is trained and tested. 3) Results: Classification of Crowd Dynamics: All examples of the three classes are taken from the 4-h videos. These examples are subsampled at 2 frames/s. This is a choice to reduce the amount of stored data and a limitation posed by processing speed. We argue that 2 frames/s is sufficiently fast for awareness on crowd dynamics, yet the variation between samples is significant. For each frame, the video features are computed and translated into observables. Together with the classification of

observables into a next state, this procedure can be done at a sampling rate of 2 frames/s. The trained CRF, with its weights for the observables and state transitions, has been shown in Fig. 4 and was discussed in Section III. For the training of the CRF, the transitions between states are essential. In our real-world dataset (i.e., the examples of the classes), we found that the number of transitions is limited. This poses a challenge to make distinct yet rich training and test sets. We choose to solve this as follows. We perform a fourfold cross validation. We subsample every Nth encoded video frame to be included in the Nth-fold. This way, the folds are disjunct and different enough as they are at minimum 2 s apart (given the sampling of 2 frames/s and fourfolds). That is, the scene is changing rapidly. Going from one video frame to the next, with 2 s in between, gives an enormous variation in the images. People have walked significantly, new people have entered, other people have exited the scene, and people have got closer or further apart from other people. Following conventions, we train on onefold and test on another. This experiment is repeated four times. The results are summarized by the confusion matrix in Table I. Interestingly, the CRF classifies some states that are more accurate than others. The CRF has shown to successfully classify the crowd dynamics up to 80% accuracy. Examples of errors are shown in Fig. 8. An error that occurs regularly is a badly timed state transition of the estimation. Usually, the transition is to the right state but is too early or too late. This is illustrated in Fig. 8(b) and (c). Sometimes there are “true misclassifications;” this is illustrated in Fig. 8(a). VI. CONCLUSION We have proposed scenarios that are relevant for antiterrorism and crowd control. Given the scenarios, and their variations including various kinds of errors, we have assessed situations that are based on observables with the objective to recognize threats. We have chosen the CRF as a method to learn probabilistically the relation between observables and states (as the hidden parameter). A good size for the training set was shown to be eight examples for each of the five scenarios, which achieves only slightly lower accuracy (i.e., 4%) than with 16 examples while using only half of the examples. Situations can be assessed with a precision of 90% at a false positive and negative rate of 15% using only eight learning examples.

521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548

549 550 551 552 553 554 555 556 557 558 559 560 561 562

BURGHOUTS AND MARCK: REASONING ABOUT THREATS: FROM OBSERVABLES TO SITUATION ASSESSMENT

580

We have investigated the extreme test case of using a training set of only one example to find out where the CRF becomes inaccurate. Up to FPR or FNR of 15% for any type of noise (random, fade in/out, ambiguity, and clutter), the CRF is surprisingly accurate to assess situations. The accuracy starts to deteriorate above 15% FPR or FNR or both. The CRF is not very sensitive to fading effects or ambiguity but is sensitive to random noise and to clutter (i.e., state-specific levels of random noise). In a real-world experiment at a large train station, we have classified various types of crowd dynamics. Using simple video features of shape and motion, we have proposed a scheme to translate such features into observables that can be classified by a CRF. The CRF has shown to successfully classify the crowd dynamics up to 80% accuracy. Overall, we conclude this paper with the interesting finding that discriminative power can be achieved by combining multiple, generic, and simple observables.

581

ACKNOWLEDGMENT

582

585

The authors would like to thank Dr. R. den Hollander to provide the camera-based observables and their characteristics. They are also grateful to Dr. K. Schutte for useful discussions on the inference of the situation that is based on observables.

586

REFERENCES

587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620

[1] M. M. Kokar, “Situation awareness: Issues and challenges,” in Proc. Int. Conf. Inf. Fusion, 2004, pp. 533–534. [2] J. Llinas, C. Bowman, G. Rogova, and A. Steinberg, “Revisiting the JDL data fusion model II,” in Proc. Int. Conf. Inf. Fusion, 2004, pp. 1218–1230. [3] G. J. Burghouts, B. Broek, B. G. Alefs, E. den Breejen, and K. Schutte, “Automated indicators for behavior interpretation,” in Proc. Int. Conf. Crime Detect. Prevent., 2007, pp. 1–6. [4] S. Agarwal, A. Awan, and D. Roth, “Learning to detect objects in images via a sparse, part-based representation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 26, no. 11, pp. 1475–1490, Nov. 2004. [5] A. Datta, “Person-on-person violence detection in video data,” in Proc. Int. Conf. Pattern Recognit., 2002, pp. 433–438. [6] D. M. Gavrila, “The visual analysis of human movement: A survey,” Comput. Vis. Image Understand., vol. 73, no. 1, pp. 82–98, 1999. [7] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld, “Learning realistic human actions from movies,” in Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit., 2008, pp. 1–8. [8] W. Zajdel, D. Krijnders, T. Andringa, and D. Gavrila, “CASSANDRA: Audio-video sensor fusion for aggression detection,” in Proc. Int. Conf. Adv. Video Signal Based Surveillance, 2007, pp. 200–205. [9] L. Duan, D. Xu, I. W. Tsang, and J. Luo, “Visual event recognition in videos by learning from web data,” in Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit., 2010, pp. 1959–1966. [10] D. K¨uttel, M. Breitenstein, L. van Gool, and V. Ferrari, “What’s going on? discovering spatio-temporal dependencies in dynamic scenes,” in Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit., 2010, pp. 1951–1958. [11] T. Xiang and S. Gong, “Video behaviour profiling for anomaly detection,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 30, no. 5, pp. 893–908, May 2008. [12] A. McCallum and C. Sutton. (2006). “An introduction to conditional random fields for relational learning,” in Introduction to Statistical Relational Learning, L. Getoor and B. Taskar, Eds. Cambridge, MA: MIT Press [Online]. Available: http://www.cs.umass.edu/ mccallum/papers/crf-tutorial.pdf

563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579

583 584

9

[13] L. Rabiner, “A tutorial on hidden Markov models and selected applications in speech recognition,” Proc. IEEE, vol. 77, no. 2, pp. 257–286, Feb. 1989. [14] I. Ulusoy and C. Bishop, “Generative versus discriminative methods for object recognition,” in Proc.IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., 2005, vol. 2, pp. 258–265. [15] M. Schmidt. (2008). “CRF toolkit in MATLAB,” [Online]. Available: http://people.cs.ubc.ca/schmidtm/Software/ [16] D. C. Liu and J. Nocedal, “On the limited memory BFGS method for large scale optimization,” Math. Program., vol. 45, no. 3, pp. 503–528, 1989. [17] F. Sha and O. Pereira, “Shallow parsing with conditional random fields,” in Proc. HLT-NAACL, 2003, pp. 213–220 [Online]. Available: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.10.9849 [18] B. K. P. Horn and B. G. Schunck, “Determining optical flow,” Artif. Intell., vol. 17, nos. 1–3, pp. 185–203, 1981. [19] D. G. Lowe, “Object recognition from local scale-invariant features,” in Proc. Int. Conf. Comput. Vis., 1999, pp. 1–8. [20] B. T. F. Jurie, “Creating efficient codebooks for visual recognition,” in Proc. Int. Conf. Comput. Vis., 2005, pp. 604–610.

621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638

Gertjan J. Burghouts received the Ph.D. degree from the University of Amsterdam, Amsterdam, The Netherlands, in 2007 on the topic of visual recognition of objects and their motion, in realistic scenes with varying conditions. He is currently a Lead Research Scientist of Visual Pattern Recognition with The Netherlands Organization for Applied Scientific Research (TNO), The Hague, The Netherlands. He studied artificial intelligence at the University of Twente during 1997–2002 with a specialization in pattern analysis and human– machine interaction. Since 2007, he has been the Principal Investigator of automated understanding of human behavior based on sensory perception. He is the Principal Investigator of a DARPA project named CORTEX (2.3M), about recognition of events and behaviors. He has written papers on this topic in internationally renowned journals, e.g., the IEEE TRANSACTIONS ON IMAGE PROCESSING, the Computer Vision and Image Understanding, the International Journal of Computer Vision, and International Conference on Crime Detection and Prevention. His work has been cited mote than 200 times since 2005. Dr. Burghouts received an award from the Netherlands Association of Engineers for the best innovative project in 2007.

639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660

Jan-Willem Marck received the M.Sc. degree in artificial intelligence from the University of Groningen, Groningen, The Netherlands, in 2006, with a specialization in autonomous systems. He is currently a Research Scientist of Artificial Intelligence with the Department of Distributed Sensor Systems, The Netherlands Organization for Applied Scientific Research (TNO), The Hague, The Netherlands. Since 2006, he has been a Research Scientist on various topics, e.g.,sensor information fusion, relevance of information, human–machine interaction and human behavior classification based on state estimation methods using sensory data. He has (co-)authored a number of papers and managed projects on these topics.

661 662 663 664 665 666 667 668 669 670 671 672 673 674 Q2 675

676 677 678

QUERIES

Q1: Author: Please verify the affiliation of the authors as typeset. Q2. Author: Please verify the current location of author “Jan-Willen Marck.”

Optimal Reasoning About Referential Expressions