Gradual Transition Detection with Conditional Random Fields Jinhui Yuan, Jianmin Li and Bo Zhang State Key Laboratory of Intelligent Technology and Systems Department of Computer Science and Technology Tsinghua University, Beijing, 100084, P. R. China

[email protected] {lijianmin, dcszb}@mail.tsinghua.edu.cn ABSTRACT In this paper, we view gradual transition detection as a sequence labeling problem and propose to use Conditional Random Fields (CRFs) for this purpose. CRFs is a state-ofthe-art sequence labeling approach. It provides a unified way to integrate various useful clues to form a decision system. Moreover, it has principled way for parameter estimation and inference. Compared to rule-based approaches, gradual transition detection with CRFs requires fewer human interactions while designing the system. The experiments on TRECVID platform show that CRFs can achieve comparable performance to that of the state-of-the-art approaches.

Categories and Subject Descriptors H.3.1 [Information Storage and Retrieval]: Content Analysis and Indexing—Abstracting methods,Indexing methods; I.5.1 [Pattern Recognition]: Models—Statistical, Structural

General Terms Algorithms, Experimentation, Performance

Keywords Gradual Transition Detection, Conditional Random Fields

1.

INTRODUCTION

Shot boundary detection (SBD) is a prerequisite step for content based video retrieval (CBVR). After more than a decade of development, cuts detection has been basically tackled while the detection of gradual transitions remains a difficult problem. Surveys such as [2, 8] provide in-depth discussions on shot boundary detection. In this paper, we focus on the detection of gradual transitions. The major challenges of gradual transitions detection lie in how to effectively integrate various useful clues becomes

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. MM’07, September 23–28, 2007, Augsburg, Bavaria, Germany. Copyright 2007 ACM 978-1-59593-701-8/07/0009 ...$5.00.

277

the critical problem of gradual transition detection [2, 8]. The clues that are probably useful include: (a) content variation in short temporal ranges (local content variation) to accurately locate the start and end positions of gradual transitions, (b) content variation in long temporal ranges (global content variation) to measure whether significant content variation occurs, (c) motion activity to distinguish whether the content variation is caused by gradual transition or motion. There may be some other information useful, such as the length of the last shot [2]. Here, we only consider the aforementioned three factors. Twin-comparison technique proposed by Zhang et al. [9] is probably the most well-known one. To overcome the shortcomings of twin-comparison approach, Zheng et al. [10] proposed a finite state automata (FSA) method, in which motion-based adaptive threshold is employed. The system achieved the best result in TRECVID 2004 [5, 10]. Recently, Liu et al. designed another system consisting a set of finite state machine (FSM) detectors, which achieved the best result in TRECVID 2006 [4, 5]. All the above systems are essentially rule-based approaches. Each system consists of a set of rules which respectively reflect a kind of evidence of shot transitions. How to integrate the basic rules into the final decision is the central problem of such systems, which heavily depends on the experiences of the system designers. Moreover, the rules involve many heuristically-chosen thresholds. This may prohibit generalizing the systems to novel video collections. Boreczky et al. [1] proposed to segment videos with Hidden Markov Model (HMM). HMM can be viewed as a probabilistic FSA compared to the previous deterministic FSA. HMM shows some advantages over rule-based systems, because it does not require manually determined thresholds[1]. In this paper, we view gradual transition detection as a sequence labeling problem and propose to use Conditional Random Fields (CRFs) for this purpose. CRFs is the stateof-the-art sequence labeling approach, which outperforms HMM in many applications [3, 7]. It not only provides powerful ability to integrate various useful clues, but also has principled way for parameter estimation and inference. The experiments on TRECVID platform show that CRFs can achieve comparable performance to that of the state-ofthe-art rule-based systems. However, CRFs requires fewer human efforts while integrating abundant basic rules into complex decision systems. The paper is organized as follows. In Section 2 we formulate gradual transition detection as a sequence labeling

Y1

Cut

Y2

Sequence #1

Y1

Y3

Y4

Y3

Y2

Y6

Y5

Sequence #2

Cut

Cut

Figure 1: Each video is partitioned into many temporal sequences by a reliable cut detector. The task of gradual transition is to assign a suitable label (0 or 1) to each frame. Y1

problem. In Section 3 we present a brief introduction to CRFs. In Section 4 we present a detailed description on how to define feature functions for CRFs. In Section 5 we evaluate the proposed approach on TRECVID platform. Finally, we conclude the paper in Section 6.

X1

2.

PROBLEM DESCRIPTION

Y3

X2

x3

Y2

Y1

X

(a)

As shown in Figure 1, we firstly partition each video into many temporal sequences by cut detector proposed in [8]. The cut detector is reliable enough and can yield about 94% performance for both recall and precision on large video collections. Each resultant sequence may be a single shot (e.g., sequence #1 in Figure 1) or consists of several shots separated by gradual transitions (e.g., sequence #2 in Figure 1). Let Y ∈ {0, 1} indicate the assigned label; 0 indicates that the frame is not within a gradual transition procedure and 1 otherwise. From the perspective of machine learning, this is a typical sequence labeling problem. Our task is to learn a model to infer a label sequence according to various observed clues of the input sequence, such as local and global content variation.

3.

Y2

Y3

( X1 , X 2 , X3 )

(b)

Figure 2: (a) Hidden Markov Model, (b) Conditional Random Fields. (X, Y) is a conditional random field in case, when conditioned on X, the random variables Yv obey the Markov property with respect to the graph: p(Yv |X, Yw , w = v) = p(Yv |X, Yw , w ∼ v), where w ∼ v means that w and v are neighbors in G. The joint distribution over the label sequence Y given X has the form 1 exp p(Y|X) = Z(X)

CONDITIONAL RANDOM FIELDS



λi ti (e, Y|e, X)+

e∈E,i



μi si (v, Y|v, X) ,

v∈V,i

where Y|S is the set of components of Y associated with the vertices in subgraph S, Z(X) is partition function, si is a state feature function and ti is a transition feature function, λi and μi are the corresponding weights. Feature functions are usually defined in binary-valued form. A simple example is

CRFs origins from Hidden Markov Model (HMM) [3]. As shown in Figure 2, HMM is a directed generative model. HMM assumes independent features of different sites to make inference tractable. Concretely, the label Yi at a given position can only depend on the feature Xi at current site. HMM can not make the label Yi depend on a window of the input sequence as well as the surrounding labels. On the contrary, CRFs is an undirected conditional model. It relaxes the independent assumption of HMM and can be discriminatively trained. Arbitrary correlated features can be incorporated once they are useful. This provide much convenience for the task of sequence labeling. Take gradual transition detection as an example, long range content variation at adjacent positions is obviously not independent. With CRFs, such correlated features can be naturally incorporated. A recent tutorial on CRFs [7] has detailed discussions on this topic. In the following, we present a brief introduction to CRFs. Let X denote the random variable over input sequences to be labeled and Y be the random variable over corresponding label sequences. All components Yi of Y are assumed to be in {0, 1}. The definition of CRFs is as follows[3]:

si (v, Y|v, X) = δ(Yv = y)δ(X = x) ti (e, Y|e, X) = δ(Y |e =< y  , y >)δ(X = x), where δ is an indicator function whose output is 1 only if the contained assertion is true, and 0 otherwise. The state feature si reflects the association between observations x and a label y at the given position while transition feature ti capture the transition relation between previous state y  and current state y. Intuitively, these binary feature functions can be thought of as rules and the corresponding weight λ (or μ) indicate the importance of the specific rule. Thus, CRFs actually integrate these rules by weighted log-linear addition. In the next section, we will show how to define the feature functions for the task of gradual transition detection. Here we assume the features are given. Then the parameter estimation problem is to determine the parameters θ = (λ1 , λ2 , . . . , μ1 , μ2 , . . .) from training data. The inˆ ference problem is to find the most probable label sequence y

Definition 1. Let G = (V, E) be a graph such that Y = (Yv )v∈V , so that Y is indexed by the vertices of G. Then

278



f (1,4) Score

1

f (2,3)

0.8 0.6 0.4

1

3

2

50

4

200

250

300

350

400

450

500

track the curve and find the nearest local minimum for each position i. Then, we use the shape of the nearest local minimum to characterize global content variation of position i. For example, assuming g(m) is the nearest local minimum to i, the shape of valley can be described by vector [g(m − l), g(m − l + 1), . . . , g(m + l − 1), g(m + l)]. To cover gradual transition candidates in varying lengths, we can extract multi-resolution feature vector from the curve, e.g., [g(m−l×k), g(m− (l+1)×k), . . . , g(m+ (l−1)×k), g(m+l×k)], where 2 × l is the length of shape descriptive vector and k ∈ {1, 3, 5} is the step of multi-resolution sampling. To define binary feature function for CRFs, we also need to map these continuous vectors into discrete values. Here, we do not use k-means but adopt Support Vector Machines (SVMs). We annotate all the local minima in training data as positive examples (gradual transitions) or negative examples (non gradual transitions). With the labeled training data, we train an SVMs model for feature of each resolution. Those models are then used to map continuous vectors to discrete class indices, 0 or 1. Thus, binary state feature function capturing the relation between global content variation and labels can be defined as

ˆ = arg maxpλ,μ (y|x). For chainfor input sequence x, i.e., y y

structure CRFs, efficient algorithms for exact learning and inference exist, details can be found in [7].

DEFINITION OF FEATURE FUNCTIONS

How to define feature functions and which to use depend on concrete applications. In our case, we define several feature functions according to human knowledge on gradual transition detection. As mentioned in Section 1, three clues are useful for recognizing gradual transitions, i.e., local content variation, global content variation and motion activity. Therefore, we define some binary feature functions capturing above clues for gradual transitions detection.

4.1 Content Variation Features We partition each frame into 2×2 blocks and extract RGB color histogram from each block. The block-based RGB color histogram serves as the compact content representation of each frame. The common used histogram intersection method is used to compute the continuity value between two feature vectors. We calculate a content continuity value for each position based on graph partition model [8]. If f (i, j) denotes the histogram intersection value between the i-th and j-th frame, the graph partition similarity measure at k=i+h−1 position i is defined as g(i) = h12 i−1 f (j, k), j=i−h k=i where h is half width of the sliding window. A simple example to show the relationship between f and g is illustrated in Figure 3. We extract local content variation features and global content variation feature from f and g respectively.



sglobal (i, Y|i, X) = δ(Yi = y)δ(svmk (gki ) = 1), k

4.2 Motion Activity Features



We extract a type of motion activity feature to indicate whether there exists motion in current sequence, so that the system can eliminate the disturbances of motion. Each frame is split into blocks in size 48 × 48. Motion vector of each block is then computed by block matching method. Since motion vectors of smooth blocks are usually not reliable, we remove the motion vectors of blocks with low pixel variance. The mean motion vector meanmv of each frame is adopted to express the strength of motion activity. Again, the continuous values meanmv are mapped to discrete values [meanmv ] by vector-quantization method. Binary feature function capturing the relation between motion activity and labels is defined as

f (i−k, i+k), k = 1, . . . , 5 are calculated to characterize the multi-scale content variation at frame i. For convenience, f (i−k, i+k) will be abbreviated as fk (i) in the following. The continuous values fk (i) are quantized into discrete ones by kmeans method. The resultant discrete indices are denoted as [fk (i)], k = 1, . . . , 5. Then the binary state feature function capturing the relation between local content variation and labels is defined as = δ(Yi = y)δ([fk (i)] = q),

(2)

where svmk (·) is the output of the k-resolution SVMs model and gki is the k-resolution vector describing the shape of local minimum nearest to i.

4.1.1 Local Features

(i, Y|i, X) slocal k

150

Figure 4: Content variation curve by graph partition model.

Figure 3: f (2, 3) is usually adopted to measure the content continuity at position between frame 2 and frame 3. Instead, we adopt g(3) = f (1,4)+f (1,3)+f (2,3)+f (2,4) to measure the content con4 tinuity at the position.

4.

100

f (2,4)

f (1,3)

smotion (i, Y|i, X) = δ(Yi = y)δ([meanmv ] = a),

(3)

where a is a specific value resulting from vector quantization.

(1)

4.3 Transition Features

where q is a specific value of quantization indices. Note that [fk (i)] is part of the observations X defined in Section 3.

State transition feature captures the temporal contextual constraints among labels. We define it as ty  ,y (e =< i, i + 1 >, Y|e, X) = δ(Y|e =< y  , y >).

4.1.2 Global Features We calculate g value for each position and can obtain a content variation curve (e.g., the curve in Figure 4). We

(4)

Note that, for simplicity, the value of above feature is independent of the observations X.

279



 



 ORJLVWLFBO



ORJLVWLFBO FUIBO FUIBJ FUIBOBJ FUIBOBJBP



FUIBO



FUIBJ FUIBOBJ



FUIBOBJBP



 

  UHFDOO

SUHFLVLRQ



IPHDVXUH

UHFDOO

(a)

SUHFLVLRQ

IPHDVXUH

(b)

Figure 5: (a) Gradual transition detection performance, (b) Frame-based detection performance.

5.

EXPERIMENTS

6. CONCLUSIONS

In this section, we evaluate the proposed approach on the platform of TRECVID benchmark [5]. The test collections of TRECVID 2003 and 2004 are adopted. The 2003 test collections are about 3.05 gigabytes, including 8 videos each lasting about half an hour. The 2004 test collections are about 4.23 gigabytes, comprising 12 videos. We use 2003 collections as training data and test the model on 2004 collections. Using these collections, we can compare performance of CRFs with that of other methods evaluated on the same collections in TRECVID 2004. The performance is evaluated by recall and precision criteria. To rank performance of different algorithms, F1 measure, a harmonic average of recall and precision is used: F1 (recall, precision) = 2×recall×precision . To evaluate the accuracy of start and end recall+precision positions of gradual transition, we also adopt the framebased recall and precision as done in TRECVID. We focus on evaluating the impact of different feature functions. Therefore, we implement five versions of CRFs based on FlexCRFs toolkit [6]: Logistic l CRF l CRF g CRF l g CRF l g m

In this paper, we propose to use CRFs to detect gradual transitions. CRFs can incorporate arbitrary correlated features (clues) in a unified way. It has principled way for parameter learning and inference. Compared to most of the rule-based (or finite state machine) systems, CRFs requires fewer human efforts while integrating various basic rules to form a complex system. The experiments on TRECVID platform show that CRFs can achieve comparable performance to that of the state-of-the-art systems. With more feature functions implemented, the performance of CRFs can be expected to be further improved.

7. ACKNOWLEDGMENTS The research of this paper was supported by National Natural Science Foundation of China (60621062, 60605003) and Chinese National Key Foundation Research & Development Plan(2003CB317007, 2004CB318108).

8. REFERENCES

[1] J. S. Boreczky and L. D. Wilcox. A hidden markov model framework for video segmentation using audio and image features. In Proc. of ICASSP 1998. [2] A. Hanjalic. Shot boundary detection: unraveled and resolved? IEEE Trans. Circ. Syst. Video Technol., 12(2):90–105, 2002. [3] J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proc. of ICML 2001, pages 282–289. [4] Z. Liu, D. Gibbon, E. Zavesky, B. Shahraray, and P. Haffner. AT&T research at trecvid 2006. In Online Proc. of TREC Video Retrieval EvaluationOnline Proc 2006. [5] NIST. Homepage of trecvid evaluation. http://www-nlpir.nist.gov/projects/trecvid/. [6] X.-H. Phan, L.-M. Nguyen, and C.-T. Nguyen. FlexCRFs: Flexible Conditional Random Field Toolkit. 2005. http://flexcrfs.sourceforge.net. [7] C. Sutton and A. McCallum. An introduction to conditional random fields for relational learning. In L. Getoor and B. Taskar (Eds.). Introduction to statistical relational learning. MIT Press, To appear. [8] J. Yuan, H. Wang, L. Xiao, W. Zheng, J. Li, F. Lin, and B. Zhang. A formal study of shot boundary detection. IEEE Trans. Circ. Syst. Video Technol., 17(2):168–186, 2007. [9] H. Zhang, A. Kankanhalli, and S. W. Smoliar. Automatic partitioning of full-motion video. Multimedia Systems, 1(1):10–28, 1993. [10] W. Zheng, J. Yuan, H. Wang, F. Lin, and B. Zhang. A novel shot boundary detection framework. In Proc. of SPIE VCIP 2005, pages 410–420.

Only use local features in Equation 1 Use features in Equation 1 and 4 Use features in Equation 2 and 4 Use features in Equation 1, 2 and 4 Use features in Equation 1, 2, 3 and 4.

Note that Logistic l does not use feature in Equation 4, which amounts to assuming each label Y is independent of the neighboring labels. In this case, CRFs actually degenerates to logistic regression, so it is named Logistic l (i.e., logistic regression). The evaluation results are shown in Figure 5. We can draw the following conclusions: (a) CRF l significantly outperforms Logistic l, since Logistic l truncates many gradual transitions without taking temporal constraints among labels into account, (b) CRF g beats CRF l in general detection performance while is defeated in framebased detection performance, showing that, global content variation is beneficial to judging whether gradual transition occurs while local content variation is beneficial to accurately locating the start and end positions of gradual transitions, (c) CRF l g m is slightly superior to CRF l g in F1 measure and significantly outperforms CRF l g in precision criterion, showing that motion activity feature can effectively reduce the disturbances of motion, (d) CRF l g m almost achieves the best result both for general detection performance and frame-based performance, and its F1 measure is 0.825, superior to the best performance (i.e.,0.808) of TRECVID 2004 on the same data set[10].

280

Gradual Transition Detection with Conditional Random ...

Sep 28, 2007 - ods; I.5.1 [Pattern Recognition]: Models—Statistical, Struc- tural .... CRFs is an undirected conditional model. ..... AT&T research at trecvid 2006.

204KB Sizes 2 Downloads 222 Views

Recommend Documents

Speech Recognition with Segmental Conditional Random Fields
learned weights with error back-propagation. To explore the utility .... [6] A. Mohamed, G. Dahl, and G.E. Hinton, “Deep belief networks for phone recognition,” in ...

Conditional Random Fields with High-Order Features ...
synthetic data set to discuss the conditions under which higher order features ..... In our experiment, we used the Automatic Content Extraction (ACE) data [9], ...

Semi-Markov Conditional Random Field with High ... - Semantic Scholar
1http://www.cs.umass.edu/∼mccallum/data.html rithms are guaranteed to ... for Governmental pur- poses notwithstanding any copyright notation thereon. The.

Conditional Random Field with High-order ... - NUS Computing
spurious, rare high-order patterns (by reducing the training data size), there is no .... test will iterate through all possible shuffles, but due to the large data sizes,.

Conditional Marginalization for Exponential Random ...
But what can we say about marginal distributions of subgraphs? c Tom A.B. Snijders .... binomial distribution with binomial denominator N0 and probability ...

Phishing Website Detection Using Random Forest with Particle Swarm ...
Phishing Website Detection Using Random Forest with Particle Swarm Optimization.pdf. Phishing Website Detection Using Random Forest with Particle Swarm ...

Context-Specific Deep Conditional Random Fields - Sum-Product ...
In Uncertainty in Artificial Intelli- gence (UAI), pp ... L. R. Rabiner. A tutorial on hidden markov models and ... ceedings of 13th Conference on Artificial Intelligence.

SCARF: A Segmental Conditional Random Field Toolkit ...
In SCARF, the fast-match may be done externally with an HMM system, and provided in the form of a lattice. Alternatively .... A detection file simply shows which.

SCARF: A Segmental Conditional Random Field Toolkit for Speech ...
into an alternative approach to speech recognition, based from the ground up on the combination of multiple, re- dundant, heterogeneous knowledge sources [4] ...

High-Performance Training of Conditional Random ...
presents a high-performance training of CRFs on massively par- allel processing systems ... video, protein sequences) can be easily gathered from different ...... ditional random fields”, The 19th National Conference on. Artificial Intelligence ...

A Hierarchical Conditional Random Field Model for Labeling and ...
the building block for the hierarchical CRF model to be in- troduced in .... In the following, we will call this CRF model the ... cluster images in a semantically meaningful way, which ..... the 2004 IEEE Computer Society Conference on Computer.

Co-Training of Conditional Random Fields for ...
Bootstrapping POS taggers using unlabeled data. In. CoNLL-2003. [26] Berger, A., Pietra, A.D., and Pietra, J.D. A maximum entropy approach to natural language processing. Computational Linguistics, 22(1):39-71,. 1996. [26] Kudo, T. and Matsumoto, Y.

Conditional Random Fields for brain tissue ... - Swarthmore's CS
on a segmentation approach to be (1) robust to noise, (2) able to handle large variances ... cation [24]. Recent years have seen the emergence of Conditional Random Fields .... The Dice index measures the degree of spatial overlap between ...

Gradual Arbitrage
In each market, there are local investors who can trade only within their own market. In addition, there are arbitrageurs who can trade dynamically between the two markets and thus have the ability to exploit price differentials to make riskless arbi

large scale anomaly detection and clustering using random walks
Sep 12, 2012 - years. And I would like to express my hearty gratitude to my wife, Le Tran, for her love, patience, and ... 1. Robust Outlier Detection Using Commute Time and Eigenspace Embedding. Nguyen Lu ..... used metric in computer science. Howev

Decentralized Supervisory Control with Conditional ...
S. Lafortune is with Department of Electrical Engineering and Computer. Science, The University of Michigan, 1301 Beal Avenue, Ann Arbor, MI. 48109–2122, U.S.A. ...... Therefore, ba c can be disabled unconditionally by supervisor. 1 and bc can be .

CONDITIONAL MEASURES AND CONDITIONAL EXPECTATION ...
Abstract. The purpose of this paper is to give a clean formulation and proof of Rohlin's Disintegration. Theorem (Rohlin '52). Another (possible) proof can be ...

Conditional Gradient with Enhancement and ... - cmap - polytechnique
1000. 1500. 2000. −3. −2. −1. 0. 1. 2. 3. 4 true. CG recovery. The greedy update steps might choose suboptimal atoms to represent the solution, and/or lead to less parsimonious solutions and/or miss some components p = 2048, m = 512 Gaussian me

Gradual Tanner Instructions - Kona Tanning Company
Start with the front of your arms and legs where you naturally tan darker, then blend the excess to the back where you ... Kona Tanning Company's Beauty Blog.

Decentralized Supervisory Control with Conditional ...
(e-mail: [email protected]). S. Lafortune is with Department of Electrical Engineering and. Computer Science, The University of Michigan, 1301 Beal Avenue,.

Implementation by Gradual Revelation
Jul 29, 2014 - †ESSEC Business School and THEMA Research Center, [email protected]. 1 ..... can commit to how it will condition the level of recovery —the ...

Causal Conditional Reasoning and Conditional ...
judgments of predictive likelihood leading to a relatively poor fit to the Modus .... Predictive Likelihood. Diagnostic Likelihood. Cummins' Theory. No Prediction. No Prediction. Probability Model. Causal Power (Wc). Full Diagnostic Model. Qualitativ

Gradual Tanner Instructions - Kona Tanning Company
Or, to use as a sunless tan extender, begin applying 4-5 days after receiving your spray ... After showering, dab dry, and apply the gradual tanning lotion in small,.

Robust Utility Maximization with Unbounded Random ...
pirical Analysis in Social Sciences (G-COE Hi-Stat)” of Hitotsubashi University is greatly ... Graduate School of Economics, The University of Tokyo ...... Tech. Rep. 12, Dept. Matematica per le Decisioni,. University of Florence. 15. Goll, T., and