Gradual Transition Detection with Conditional Random ...

Viewer
Transcript

Gradual Transition Detection with Conditional Random Fields Jinhui Yuan, Jianmin Li and Bo Zhang State Key Laboratory of Intelligent Technology and Systems Department of Computer Science and Technology Tsinghua University, Beijing, 100084, P. R. China

[email protected] {lijianmin, dcszb}@mail.tsinghua.edu.cn ABSTRACT In this paper, we view gradual transition detection as a sequence labeling problem and propose to use Conditional Random Fields (CRFs) for this purpose. CRFs is a state-ofthe-art sequence labeling approach. It provides a uniﬁed way to integrate various useful clues to form a decision system. Moreover, it has principled way for parameter estimation and inference. Compared to rule-based approaches, gradual transition detection with CRFs requires fewer human interactions while designing the system. The experiments on TRECVID platform show that CRFs can achieve comparable performance to that of the state-of-the-art approaches.

Categories and Subject Descriptors H.3.1 [Information Storage and Retrieval]: Content Analysis and Indexing—Abstracting methods,Indexing methods; I.5.1 [Pattern Recognition]: Models—Statistical, Structural

General Terms Algorithms, Experimentation, Performance

Keywords Gradual Transition Detection, Conditional Random Fields

1.

INTRODUCTION

Shot boundary detection (SBD) is a prerequisite step for content based video retrieval (CBVR). After more than a decade of development, cuts detection has been basically tackled while the detection of gradual transitions remains a diﬃcult problem. Surveys such as [2, 8] provide in-depth discussions on shot boundary detection. In this paper, we focus on the detection of gradual transitions. The major challenges of gradual transitions detection lie in how to eﬀectively integrate various useful clues becomes

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. MM’07, September 23–28, 2007, Augsburg, Bavaria, Germany. Copyright 2007 ACM 978-1-59593-701-8/07/0009 ...$5.00.

277

the critical problem of gradual transition detection [2, 8]. The clues that are probably useful include: (a) content variation in short temporal ranges (local content variation) to accurately locate the start and end positions of gradual transitions, (b) content variation in long temporal ranges (global content variation) to measure whether signiﬁcant content variation occurs, (c) motion activity to distinguish whether the content variation is caused by gradual transition or motion. There may be some other information useful, such as the length of the last shot [2]. Here, we only consider the aforementioned three factors. Twin-comparison technique proposed by Zhang et al. [9] is probably the most well-known one. To overcome the shortcomings of twin-comparison approach, Zheng et al. [10] proposed a ﬁnite state automata (FSA) method, in which motion-based adaptive threshold is employed. The system achieved the best result in TRECVID 2004 [5, 10]. Recently, Liu et al. designed another system consisting a set of ﬁnite state machine (FSM) detectors, which achieved the best result in TRECVID 2006 [4, 5]. All the above systems are essentially rule-based approaches. Each system consists of a set of rules which respectively reﬂect a kind of evidence of shot transitions. How to integrate the basic rules into the ﬁnal decision is the central problem of such systems, which heavily depends on the experiences of the system designers. Moreover, the rules involve many heuristically-chosen thresholds. This may prohibit generalizing the systems to novel video collections. Boreczky et al. [1] proposed to segment videos with Hidden Markov Model (HMM). HMM can be viewed as a probabilistic FSA compared to the previous deterministic FSA. HMM shows some advantages over rule-based systems, because it does not require manually determined thresholds[1]. In this paper, we view gradual transition detection as a sequence labeling problem and propose to use Conditional Random Fields (CRFs) for this purpose. CRFs is the stateof-the-art sequence labeling approach, which outperforms HMM in many applications [3, 7]. It not only provides powerful ability to integrate various useful clues, but also has principled way for parameter estimation and inference. The experiments on TRECVID platform show that CRFs can achieve comparable performance to that of the state-ofthe-art rule-based systems. However, CRFs requires fewer human eﬀorts while integrating abundant basic rules into complex decision systems. The paper is organized as follows. In Section 2 we formulate gradual transition detection as a sequence labeling

Y1

Cut

Y2

Sequence #1

Y1

Y3

Y4

Y3

Y2

Y6

Y5

Sequence #2

Cut

Cut

Figure 1: Each video is partitioned into many temporal sequences by a reliable cut detector. The task of gradual transition is to assign a suitable label (0 or 1) to each frame. Y1

problem. In Section 3 we present a brief introduction to CRFs. In Section 4 we present a detailed description on how to deﬁne feature functions for CRFs. In Section 5 we evaluate the proposed approach on TRECVID platform. Finally, we conclude the paper in Section 6.

X1

2.

PROBLEM DESCRIPTION

Y3

X2

x3

Y2

Y1

X

(a)

As shown in Figure 1, we ﬁrstly partition each video into many temporal sequences by cut detector proposed in [8]. The cut detector is reliable enough and can yield about 94% performance for both recall and precision on large video collections. Each resultant sequence may be a single shot (e.g., sequence #1 in Figure 1) or consists of several shots separated by gradual transitions (e.g., sequence #2 in Figure 1). Let Y ∈ {0, 1} indicate the assigned label; 0 indicates that the frame is not within a gradual transition procedure and 1 otherwise. From the perspective of machine learning, this is a typical sequence labeling problem. Our task is to learn a model to infer a label sequence according to various observed clues of the input sequence, such as local and global content variation.

3.

Y2

Y3

( X1 , X 2 , X3 )

(b)

Figure 2: (a) Hidden Markov Model, (b) Conditional Random Fields. (X, Y) is a conditional random ﬁeld in case, when conditioned on X, the random variables Yv obey the Markov property with respect to the graph: p(Yv |X, Yw , w = v) = p(Yv |X, Yw , w ∼ v), where w ∼ v means that w and v are neighbors in G. The joint distribution over the label sequence Y given X has the form 1 exp p(Y|X) = Z(X)

CONDITIONAL RANDOM FIELDS

λi ti (e, Y|e, X)+

e∈E,i

μi si (v, Y|v, X) ,

v∈V,i

where Y|S is the set of components of Y associated with the vertices in subgraph S, Z(X) is partition function, si is a state feature function and ti is a transition feature function, λi and μi are the corresponding weights. Feature functions are usually deﬁned in binary-valued form. A simple example is

CRFs origins from Hidden Markov Model (HMM) [3]. As shown in Figure 2, HMM is a directed generative model. HMM assumes independent features of diﬀerent sites to make inference tractable. Concretely, the label Yi at a given position can only depend on the feature Xi at current site. HMM can not make the label Yi depend on a window of the input sequence as well as the surrounding labels. On the contrary, CRFs is an undirected conditional model. It relaxes the independent assumption of HMM and can be discriminatively trained. Arbitrary correlated features can be incorporated once they are useful. This provide much convenience for the task of sequence labeling. Take gradual transition detection as an example, long range content variation at adjacent positions is obviously not independent. With CRFs, such correlated features can be naturally incorporated. A recent tutorial on CRFs [7] has detailed discussions on this topic. In the following, we present a brief introduction to CRFs. Let X denote the random variable over input sequences to be labeled and Y be the random variable over corresponding label sequences. All components Yi of Y are assumed to be in {0, 1}. The deﬁnition of CRFs is as follows[3]:

si (v, Y|v, X) = δ(Yv = y)δ(X = x) ti (e, Y|e, X) = δ(Y |e =< y , y >)δ(X = x), where δ is an indicator function whose output is 1 only if the contained assertion is true, and 0 otherwise. The state feature si reﬂects the association between observations x and a label y at the given position while transition feature ti capture the transition relation between previous state y and current state y. Intuitively, these binary feature functions can be thought of as rules and the corresponding weight λ (or μ) indicate the importance of the speciﬁc rule. Thus, CRFs actually integrate these rules by weighted log-linear addition. In the next section, we will show how to deﬁne the feature functions for the task of gradual transition detection. Here we assume the features are given. Then the parameter estimation problem is to determine the parameters θ = (λ1 , λ2 , . . . , μ1 , μ2 , . . .) from training data. The inˆ ference problem is to ﬁnd the most probable label sequence y

Definition 1. Let G = (V, E) be a graph such that Y = (Yv )v∈V , so that Y is indexed by the vertices of G. Then

278

f (1,4) Score

1

f (2,3)

0.8 0.6 0.4

1

3

2

50

4

200

250

300

350

400

450

500

track the curve and ﬁnd the nearest local minimum for each position i. Then, we use the shape of the nearest local minimum to characterize global content variation of position i. For example, assuming g(m) is the nearest local minimum to i, the shape of valley can be described by vector [g(m − l), g(m − l + 1), . . . , g(m + l − 1), g(m + l)]. To cover gradual transition candidates in varying lengths, we can extract multi-resolution feature vector from the curve, e.g., [g(m−l×k), g(m− (l+1)×k), . . . , g(m+ (l−1)×k), g(m+l×k)], where 2 × l is the length of shape descriptive vector and k ∈ {1, 3, 5} is the step of multi-resolution sampling. To deﬁne binary feature function for CRFs, we also need to map these continuous vectors into discrete values. Here, we do not use k-means but adopt Support Vector Machines (SVMs). We annotate all the local minima in training data as positive examples (gradual transitions) or negative examples (non gradual transitions). With the labeled training data, we train an SVMs model for feature of each resolution. Those models are then used to map continuous vectors to discrete class indices, 0 or 1. Thus, binary state feature function capturing the relation between global content variation and labels can be deﬁned as

ˆ = arg maxpλ,μ (y|x). For chainfor input sequence x, i.e., y y

structure CRFs, eﬃcient algorithms for exact learning and inference exist, details can be found in [7].

DEFINITION OF FEATURE FUNCTIONS

How to deﬁne feature functions and which to use depend on concrete applications. In our case, we deﬁne several feature functions according to human knowledge on gradual transition detection. As mentioned in Section 1, three clues are useful for recognizing gradual transitions, i.e., local content variation, global content variation and motion activity. Therefore, we deﬁne some binary feature functions capturing above clues for gradual transitions detection.

4.1 Content Variation Features We partition each frame into 2×2 blocks and extract RGB color histogram from each block. The block-based RGB color histogram serves as the compact content representation of each frame. The common used histogram intersection method is used to compute the continuity value between two feature vectors. We calculate a content continuity value for each position based on graph partition model [8]. If f (i, j) denotes the histogram intersection value between the i-th and j-th frame, the graph partition similarity measure at k=i+h−1 position i is deﬁned as g(i) = h12 i−1 f (j, k), j=i−h k=i where h is half width of the sliding window. A simple example to show the relationship between f and g is illustrated in Figure 3. We extract local content variation features and global content variation feature from f and g respectively.

sglobal (i, Y|i, X) = δ(Yi = y)δ(svmk (gki ) = 1), k

4.2 Motion Activity Features

We extract a type of motion activity feature to indicate whether there exists motion in current sequence, so that the system can eliminate the disturbances of motion. Each frame is split into blocks in size 48 × 48. Motion vector of each block is then computed by block matching method. Since motion vectors of smooth blocks are usually not reliable, we remove the motion vectors of blocks with low pixel variance. The mean motion vector meanmv of each frame is adopted to express the strength of motion activity. Again, the continuous values meanmv are mapped to discrete values [meanmv ] by vector-quantization method. Binary feature function capturing the relation between motion activity and labels is deﬁned as

f (i−k, i+k), k = 1, . . . , 5 are calculated to characterize the multi-scale content variation at frame i. For convenience, f (i−k, i+k) will be abbreviated as fk (i) in the following. The continuous values fk (i) are quantized into discrete ones by kmeans method. The resultant discrete indices are denoted as [fk (i)], k = 1, . . . , 5. Then the binary state feature function capturing the relation between local content variation and labels is deﬁned as = δ(Yi = y)δ([fk (i)] = q),

(2)

where svmk (·) is the output of the k-resolution SVMs model and gki is the k-resolution vector describing the shape of local minimum nearest to i.

4.1.1 Local Features

(i, Y|i, X) slocal k

150

Figure 4: Content variation curve by graph partition model.

Figure 3: f (2, 3) is usually adopted to measure the content continuity at position between frame 2 and frame 3. Instead, we adopt g(3) = f (1,4)+f (1,3)+f (2,3)+f (2,4) to measure the content con4 tinuity at the position.

4.

100

f (2,4)

f (1,3)

smotion (i, Y|i, X) = δ(Yi = y)δ([meanmv ] = a),

(3)

where a is a speciﬁc value resulting from vector quantization.

(1)

4.3 Transition Features

where q is a speciﬁc value of quantization indices. Note that [fk (i)] is part of the observations X deﬁned in Section 3.

State transition feature captures the temporal contextual constraints among labels. We deﬁne it as ty ,y (e =< i, i + 1 >, Y|e, X) = δ(Y|e =< y , y >).

4.1.2 Global Features We calculate g value for each position and can obtain a content variation curve (e.g., the curve in Figure 4). We

(4)

Note that, for simplicity, the value of above feature is independent of the observations X.

279

ORJLVWLFBO

ORJLVWLFBO FUIBO FUIBJ FUIBOBJ FUIBOBJBP

FUIBO

FUIBJ FUIBOBJ

FUIBOBJBP

UHFDOO

SUHFLVLRQ

IPHDVXUH

UHFDOO

(a)

SUHFLVLRQ

IPHDVXUH

(b)

Figure 5: (a) Gradual transition detection performance, (b) Frame-based detection performance.

5.

EXPERIMENTS

6. CONCLUSIONS

In this section, we evaluate the proposed approach on the platform of TRECVID benchmark [5]. The test collections of TRECVID 2003 and 2004 are adopted. The 2003 test collections are about 3.05 gigabytes, including 8 videos each lasting about half an hour. The 2004 test collections are about 4.23 gigabytes, comprising 12 videos. We use 2003 collections as training data and test the model on 2004 collections. Using these collections, we can compare performance of CRFs with that of other methods evaluated on the same collections in TRECVID 2004. The performance is evaluated by recall and precision criteria. To rank performance of diﬀerent algorithms, F1 measure, a harmonic average of recall and precision is used: F1 (recall, precision) = 2×recall×precision . To evaluate the accuracy of start and end recall+precision positions of gradual transition, we also adopt the framebased recall and precision as done in TRECVID. We focus on evaluating the impact of diﬀerent feature functions. Therefore, we implement ﬁve versions of CRFs based on FlexCRFs toolkit [6]: Logistic l CRF l CRF g CRF l g CRF l g m

In this paper, we propose to use CRFs to detect gradual transitions. CRFs can incorporate arbitrary correlated features (clues) in a uniﬁed way. It has principled way for parameter learning and inference. Compared to most of the rule-based (or ﬁnite state machine) systems, CRFs requires fewer human eﬀorts while integrating various basic rules to form a complex system. The experiments on TRECVID platform show that CRFs can achieve comparable performance to that of the state-of-the-art systems. With more feature functions implemented, the performance of CRFs can be expected to be further improved.

7. ACKNOWLEDGMENTS The research of this paper was supported by National Natural Science Foundation of China (60621062, 60605003) and Chinese National Key Foundation Research & Development Plan(2003CB317007, 2004CB318108).

8. REFERENCES

[1] J. S. Boreczky and L. D. Wilcox. A hidden markov model framework for video segmentation using audio and image features. In Proc. of ICASSP 1998. [2] A. Hanjalic. Shot boundary detection: unraveled and resolved? IEEE Trans. Circ. Syst. Video Technol., 12(2):90–105, 2002. [3] J. Laﬀerty, A. McCallum, and F. Pereira. Conditional random ﬁelds: Probabilistic models for segmenting and labeling sequence data. In Proc. of ICML 2001, pages 282–289. [4] Z. Liu, D. Gibbon, E. Zavesky, B. Shahraray, and P. Haﬀner. AT&T research at trecvid 2006. In Online Proc. of TREC Video Retrieval EvaluationOnline Proc 2006. [5] NIST. Homepage of trecvid evaluation. http://www-nlpir.nist.gov/projects/trecvid/. [6] X.-H. Phan, L.-M. Nguyen, and C.-T. Nguyen. FlexCRFs: Flexible Conditional Random Field Toolkit. 2005. http://ﬂexcrfs.sourceforge.net. [7] C. Sutton and A. McCallum. An introduction to conditional random fields for relational learning. In L. Getoor and B. Taskar (Eds.). Introduction to statistical relational learning. MIT Press, To appear. [8] J. Yuan, H. Wang, L. Xiao, W. Zheng, J. Li, F. Lin, and B. Zhang. A formal study of shot boundary detection. IEEE Trans. Circ. Syst. Video Technol., 17(2):168–186, 2007. [9] H. Zhang, A. Kankanhalli, and S. W. Smoliar. Automatic partitioning of full-motion video. Multimedia Systems, 1(1):10–28, 1993. [10] W. Zheng, J. Yuan, H. Wang, F. Lin, and B. Zhang. A novel shot boundary detection framework. In Proc. of SPIE VCIP 2005, pages 410–420.

Only use local features in Equation 1 Use features in Equation 1 and 4 Use features in Equation 2 and 4 Use features in Equation 1, 2 and 4 Use features in Equation 1, 2, 3 and 4.

Note that Logistic l does not use feature in Equation 4, which amounts to assuming each label Y is independent of the neighboring labels. In this case, CRFs actually degenerates to logistic regression, so it is named Logistic l (i.e., logistic regression). The evaluation results are shown in Figure 5. We can draw the following conclusions: (a) CRF l signiﬁcantly outperforms Logistic l, since Logistic l truncates many gradual transitions without taking temporal constraints among labels into account, (b) CRF g beats CRF l in general detection performance while is defeated in framebased detection performance, showing that, global content variation is beneﬁcial to judging whether gradual transition occurs while local content variation is beneﬁcial to accurately locating the start and end positions of gradual transitions, (c) CRF l g m is slightly superior to CRF l g in F1 measure and signiﬁcantly outperforms CRF l g in precision criterion, showing that motion activity feature can eﬀectively reduce the disturbances of motion, (d) CRF l g m almost achieves the best result both for general detection performance and frame-based performance, and its F1 measure is 0.825, superior to the best performance (i.e.,0.808) of TRECVID 2004 on the same data set[10].

280

Speech Recognition with Segmental Conditional Random Fields