Semi-supervised kernel density estimation for video ...

Viewer
Transcript

Computer Vision and Image Understanding 113 (2009) 384–396

Contents lists available at ScienceDirect

Computer Vision and Image Understanding journal homepage: www.elsevier.com/locate/cviu

Semi-supervised kernel density estimation for video annotation q Meng Wang a,*, Xian-Sheng Hua a, Tao Mei a, Richang Hong b, Guojun Qi b, Yan Song b, Li-Rong Dai b a b

Microsoft Research Asia, Zhichun Road, Beijing 100080, PR China University of Science and Technology of China, Huanshan Road, Hefei 230027, PR China

a r t i c l e

i n f o

Article history: Received 30 September 2007 Accepted 18 August 2008 Available online 29 August 2008 Keywords: Video annotation Semi-supervised learning Kernel density estimation

a b s t r a c t Insufﬁciency of labeled training data is a major obstacle for automatic video annotation. Semi-supervised learning is an effective approach to this problem by leveraging a large amount of unlabeled data. However, existing semi-supervised learning algorithms have not demonstrated promising results in largescale video annotation due to several difﬁculties, such as large variation of video content and intractable computational cost. In this paper, we propose a novel semi-supervised learning algorithm named semisupervised kernel density estimation (SSKDE) which is developed based on kernel density estimation (KDE) approach. While only labeled data are utilized in classical KDE, in SSKDE both labeled and unlabeled data are leveraged to estimate class conditional probability densities based on an extended form of KDE. It is a non-parametric method, and it thus naturally avoids the model assumption problem that exists in many parametric semi-supervised methods. Meanwhile, it can be implemented with an efﬁcient iterative solution process. So, this method is appropriate for video annotation. Furthermore, motivated by existing adaptive KDE approach, we propose an improved algorithm named semi-supervised adaptive kernel density estimation (SSAKDE). It employs local adaptive kernels rather than a ﬁxed kernel, such that broader kernels can be applied in the regions with low density. In this way, more accurate density estimates can be obtained. Extensive experiments have demonstrated the effectiveness of the proposed methods. Ó 2008 Elsevier Inc. All rights reserved.

1. Introduction With rapid advances in storage devices, networks, and compression techniques, large-scale video data is becoming available to more and more average users. How to manage and access these data becomes a challenging task. To deal with this issue, it has been a common theme to develop techniques for deriving metadata from videos to describe their content at syntactic and semantic levels. With the help of these metadata, the manipulations of video data can be easily accomplished, such as summarization, indexing, and retrieval. Video annotation is an elementary step to obtain these metadata. Ideally, video annotation is formulated as a classiﬁcation task and it can be accomplished by learning based methods. However, due to the large gap between low-level features and the semantic concepts to be annotated, typically learning based methods need a large labeled training set to guarantee annotation accuracy. As human annotation is labor-intensive and time-consuming (experiq An early version of this paper has been published in proceedings of ACM Multimedia 2006. * Corresponding author. E-mail addresses: [email protected], (M. Wang), xshua@microsoft. com (X.-S. Hua), [email protected] (T. Mei), [email protected] (R. Hong), [email protected] (G. Qi), songy@ustc. edu.cn (Y. Song), lrdai@ustc. edu.cn (L.-R. Dai).

1077-3142/$ - see front matter Ó 2008 Elsevier Inc. All rights reserved. doi:10.1016/j.cviu.2008.08.003

ments prove that typically annotating 1 h of video with 100 concepts can take anywhere between 8 and 15 h [22]), several methods that can help reduce human effort have been proposed. One approach to dealing with the training data insufﬁciency problem is to apply semi-supervised learning (SSL) algorithms, which leverage a large amount of unlabeled data to boost classiﬁcation accuracy [9,37,44]. Although many different SSL algorithms have been applied in multimedia annotation and several acknowledging results are reported, SSL methods are still not popular in this ﬁeld, in particular when a large dataset is faced, such as in TRECVID benchmark [3]. We suppose this is mainly due to the following two factors: (1) Large variation of video content. Many SSL methods can only be effective when the assumed models are accurate [10]. Here we consider SSL with parametric model, which is a large family of SSL algorithms. Although this method can be employed with different generative models, such as GMM and Multiple Multinomial [7,24], it has not been successfully applied in image or video annotation since it is difﬁcult to accurately model video semantic concepts. (2) Large computational cost. Many SSL algorithms introduce much larger computational costs than supervised methods and thus they can hardly be applied when dealing with a

M. Wang et al. / Computer Vision and Image Understanding 113 (2009) 384–396

large dataset. For example, the computational cost of transductive SVM scales as Oðn3 Þ where n is the size of dataset, including labeled and unlabeled samples [42]. This cost is infeasible when n is large. In this paper we propose a novel SSL method named semisupervised kernel density estimation (SSKDE) to address these two difﬁculties. This method is developed based on a non-parametric density estimation approach, i.e., kernel density estimation (KDE). So, it avoids the model assumption problem. In classical KDE class conditional probability densities are estimated from only labeled samples, whereas in SSKDE both labeled and unlabeled samples are utilized by introducing an extended form of kernel density estimation. Based on the extended KDE, densities and posterior probabilities are related bi-directionally (note that posterior probabilities can be derived from densities based on Bayes rule). SSKDE is thus formulated based on the bi-directional relationship between densities and posterior probabilities. It can be solved by an efﬁcient iterative process, i.e., by iteratively updating densities and posterior probabilities. We also show that SSKDE is closely related to graph-based SSL methods. Based on SSKDE, we can provide more natural interpretations to several studies on graph-based methods. Based on SSKDE, we further propose an improved method named semi-supervised adaptive kernel density estimation (SSAKDE). In SSAKDE, the kernels over observed samples are adapted such that broader kernels are adopted in the regions with low density. In this way, more accurate density estimates can be obtained. Experiments demonstrate that this method further improves the performance of SSKDE and it is superior to many other existing supervised and semi-supervised methods. The main contributions of this paper are highlighted as follows: (1) Developed SSKDE method based on non-parametric approach. It incorporates unlabeled data into KDE such that better performance can be obtained. (2) Investigated the connection between SSKDE and graphbased SSL methods. We show that SSKDE helps better study graph-based methods. (3) Further proposed SSAKDE based on SSKDE. It achieves better performance by adopting adaptive kernels. The organization of the rest of this paper is as follows. In Section 2, we make a short review on the related work. In Section 3, the SSKDE algorithm is formulated. We provide a discussion on this algorithm in Section 4, including the solution and its relationship with other existing methods. Then we further introduce SSAKDE in Section 5. Experiments are introduced in Section 6, and then we discuss the computational costs of the proposed methods in Section 7, followed by concluding remarks in Section 8. Additionally, we provide an analysis on the effect of unlabeled samples for SSKDE and SSAKDE in appendix. 2. Related work Video annotation is also named ‘‘high-level feature extraction” or ‘‘semantic concept detection”, which is a task in TRECVID benchmark [3]. It is regarded as a promising approach to bridging semantic gap such that higher level manipulation can be facilitated. As noted by Hauptmann [18], this splits the semantic gap between low-level features and user information needs into two, hopefully smaller gaps: (a) mapping the low-level features into the intermediate semantic concepts and (b) mapping these concepts into user needs. Annotation is exactly the step

385

to accomplish the ﬁrst mapping. When we only consider visual information, it is also closely related to the work on ‘‘image annotation”, such as [15,16]. Naphade and Smith [23] have given a survey on TRECVID high-level feature extraction benchmark, where a great deal of different algorithms applied in this task can be found. Over the recent years, the availability of large data collections, with only limited human annotation, has turned the attention of a growing community of researchers to the problem of SSL. By leveraging unlabeled data with certain assumptions, SSL methods are expected to build more accurate models than those that can be achieved by purely supervised learning methods. Many different SSL algorithms have been proposed. Some often-applied ones include: self-training [26], co-training [6], transductive SVM [42], SSL with parametric model [7,24], and graph-based SSL methods [5,43,46]. Extensive reviews of the existing methods can be found in [9,44]. Although many different SSL algorithms are available, only few of them are applied to image/video content analysis. In [39], Wu et al. proposed a method named Discriminant-EM, which makes use of unlabeled data to construct a generative model. But they also pointed out that the performance of the proposed method will be compromised if the components of data distribution are mixed up. In [33], Tian et al. conducted a study on SSL in image retrieval, and illustrated that SSL is not always helpful in this ﬁeld due to inappropriate assumption. In [28], Song et al. applied co-training to video annotation based on a careful split of visual features. In [40], Yan et al. pointed out the drawbacks of co-training in video annotation, and proposed an improved co-training style algorithm named semisupervised cross-feature learning. Recently, graph-based SSL methods have attracted great interests of the researchers in this community. In [19], He et al. adopted a graph-based SSL method named manifold-ranking in image retrieval, and Yuan et al. then applied the same algorithm to video annotation [41]. Tang et al. proposed a graph-based SSL named kernel linear neighborhood propagation and demonstrated its effectiveness in video annotation [30]. Wang et al. developed a multi-graph learning method, such that several difﬁculties in video annotation can be attacked in a uniﬁed scheme [36]. More recent works in this ﬁeld focus on incorporating the local structures around samples into the design of graphs. In [31], Tang et al. integrated the difference of densities around two samples into the estimation of their similarity, and this method has been shown to be better than estimating similarities based solely on the distances in feature space. In [38], Wang et al. proposed a neighborhood similarity based on the pairwise Kullback-Leibler divergence of the local distributions around samples. However, many more works on image or video annotation only employ supervised methods. Especially in TRECVID benchmark, no satisfactory results with SSL methods are reported. It seems that this ﬁeld has not taken sufﬁcient advantages of SSL algorithms. As aforementioned, this is attributed to the invalid prior model and the large computational costs of the existing SSL methods. In this work, we develop SSKDE and SSAKDE algorithms based on a non-parametric approach. These two methods avoid the model assumption problem and they are computationally efﬁcient. So, they are appropriate for video annotation. We show that SSKDE is closely related to graph-based SSL methods. This can explain why graph-based methods are relatively more popular among existing SSL approaches in video annotation. We will demonstrate the effectiveness of SSKDE and SSAKDE. Additionally, we will show that the proposed methods are computationally efﬁcient and can be applied in large-scale annotation.

386

M. Wang et al. / Computer Vision and Image Understanding 113 (2009) 384–396

Besides Gaussian kernel, in this work we will also apply Exponential kernel, i.e.,

3. Semi-supervised kernel density estimation In this section, we detail the formulation of SSKDE. Firstly, we introduce notations and problem deﬁnition. Then we provide an extended form of KDE and derive SSKDE based on it.

je ðxÞ ¼

1 ð2rÞd

expðkxk=rÞ

ð3Þ

We consider a normal K-class classiﬁcation problem. There are l labeled samples L ¼ fx1 ; x2 ; . . . ; xl g and u unlabeled samples U ¼ fxlþ1 ; . . . ; xlþu g; x 2 Rd . Let yi denote the label of xi ðxi 2 LÞ and we have yi 2 f1; 2; 3; . . . ; Kg. Let n ¼ l þ u be the total number of samples. Denote by Li the set of samples with label i and let li P denote its size. Thus we have L ¼ [Li and l ¼ li . Assume that i.i.d. samples xi ðxi 2 L [ UÞ are extracted from an unknown (global) probability density function pðxÞ, and denote by pðxjC k Þ the class conditional probability density function of class k. For concision, we abbreviate ‘‘class conditional probability density function” to density in the rest of our paper. Then the task is to assign labels to xi , where xi 2 U. For clarity, we list all the notations and their descriptions throughout this paper in Table 1.

As traditional KDE is based on only labeled data, the accuracy of class conditional densities estimated by KDE heavily relies on the number of labeled samples. As shown in Fig. 1, the estimated densities based on limited labeled samples are inaccurate which may induce shifted classiﬁcation boundary. On the other hand, unlabeled samples are usually much more than labeled ones. If the labels of unlabeled samples are also known, then estimated densities will be much more accurate. This directly motivates us to incorporate unlabeled data into KDE. To extend KDE to unlabeled samples, we ﬁrst make an assumption that the class posterior probabilities of all samples are known (how to compute them is to be detailed in the next subsection). For concision, in the following discussion we abbreviate ‘‘class posterior probability” to posterior probability. Denote by PðC k jxi Þ the posterior probability of class k given xi . We weight the kernels by corresponding posterior probabilities in KDE as follows:

3.2. Extended kernel density estimation

^ðxjC k Þ ¼ p

3.1. Notations and the problem

Pn

It is well known that density estimation methods can be categorized into parametric approaches and non-parametric approaches. Among the non-parametric methods, the most popular one is KDE (or Parzen density estimation) [25], by which the class conditional densities in the above problem can be estimated as

^ðxjC k Þ ¼ p

1 X lk x 2L j

jðx xj Þ

xj Þ

ð4Þ

We can see that Eq. (1) and Eq. (4) are the same if we let U ¼ U and PðC k jxi Þ ¼ dðyi ¼ kÞ, where i 2 L and d is the indicator function (i.e., d½true ¼ 1; d½false ¼ 0). Note that an attractive property of KDE is its consistency, i.e., its convergence to target function when n ! 1. Here we show that the L1 convergence of the extended KDE can

ð1Þ

k

^ðxjC k Þ is the estimated density of class k, and jðxÞ is a kernel where p R function that satisﬁes jðxÞ > 0 and jðxÞdx ¼ 1. The most widely applied one is Gaussian kernel, i.e.,

jg ðxÞ ¼

j

j¼1 PðC k jxj Þ ðx Pn j¼1 PðC k jxj Þ

1 ð2pÞd=2 rd

2

2

expðkxk =2r Þ

ð2Þ

5 4 3 2 1 0

Table 1 Symbols and corresponding descriptions Symbol

Description

d l u n K L U Ni N Lk lk jðxÞ pðxÞ pðxjC k Þ PðC k jxÞ PðC k Þ d W D F P Fi

Dimension of feature space Number of labeled samples Number of unlabeled samples n ¼ l þ u, number of all samples Number of classes L ¼ fx1 ; x2 ; ; xl g U ¼ fxlþ1 ; xlþ2 ; ; xn g Neighborhood around xi , used in SSAKDE Neighborhood size, used in SSAKDE Set of samples labeled as class k Size of Lk Kernel function Global probability density function Conditional probability density function of class k Posterior probability of class k given x Prior probability of class k Indicator function, d[true] = 1, d[false] = 0 n n matrix, W ij indicates the similarity between xi and xj P n n diagonal matrix, Dii ¼ j W ij n K matrix, F ik is the estimated value of PðC k jxi Þ n n matrix, see Eq. (10) F i ¼ ½F i1 ; F i2 ; ; F iK Bandwidth of Gaussian kernel, see Eq. (32) See Eq. (23) and Eq. (24) See Eq. (9) See Eq. (15) See Eq. (16)

-1

-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

0.6

0.8

0.6

0.8

(a) True densities and extracted samples 5

r l ti T0 T00

Classification boundary shift

4 3 2 1 0 -1

-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

(b) Kernel Density Estimation on labeled data 5 4 3 2 1 0 -1

-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

(c) Kernel Density Estimation on all data Fig. 1. (a) True densities and extracted samples; (b) KDE on labeled data with Gaussian kernel (large symbols are labeled samples); (c) estimated densities (solid line) with an assumption that the labels of unlabeled samples are known. We can see that the densities estimated in (a) are not accurate due to sample insufﬁciency, whereas the problem is alleviated in (b).

387

M. Wang et al. / Computer Vision and Image Understanding 113 (2009) 384–396

be proven as well. We re-write kernel function jðxÞ as hðx=rÞ, where r is kernel bandwidth (i.e., a smoothing factor). Deﬁne

Pn j¼1 PðC k jxj Þjðx xj Þ Pn Jn ¼ pðxjC k Þdx x2Rd j¼1 PðC k jxj Þ Z

ð5Þ

Then we have the following result: Theorem 1. If r ! 0 and nrd ! 0, then J n converge to 0 almost surely as n ! 1. The proof of Theorem 1 can be found in Appendix. From Theorem 1 we can see that Eq. (4) is a natural extension of classical KDE. However, how to compute posterior probabilities PðC k jxi Þ remains a problem. Extended KDE has indicated that densities can be estimated from posterior probabilities. Meanwhile, it is well known that posterior probabilities can be computed from densities according to Bayes rule. So, densities and posterior probabilities are related bi-directionally, as illustrated in Fig. 2. In the next subsection, we formulate SSKDE based on this bi-directional relationship. 3.3. Formulation of SSKDE

^ðxj jC k Þ ^ k jxj Þ ¼ P PðC k Þp PðC K ^ðxj jC k Þ PðC k Þp k¼1

ð6Þ

Meanwhile, by Bayes rule the prior probabilities can be approximated based on the strong law of large numbers as

Z

PðC k jxÞpðxÞdx

where xj 2 U ð9Þ where xj 2 L

Equation set (9) is a linear equation set with respect to F ik . For clarity of its solution, we re-write the equation set in matrix form. Let

jðxi xj Þ j¼1 jðxi xj Þ

Pij ¼ Pn

ð10Þ

We split the matrix P into 4 blocks after the lth row and column

P¼

PLL

PLU

PUL

PUU

ð11Þ

Then, we split posterior probability matrix F into 2 blocks after the lth row as

F¼

FL

ð12Þ

FU

Therefore, Eq. set (9) can be written as

Denote by PðC k Þ the prior probability of class k. As densities are estimated in Eq. (4), posterior probabilities can be re-computed by Bayes rule as follows:

PðC k Þ ¼

8 Pn > i¼1 F ik jðxj xi Þ > > < Pn jðx x Þ ¼ F jk j i i¼1 Pn > F jðxj xi Þ > ik i¼1 > þ t i dðyj ¼ kÞ ¼ F jk : ð1 t i Þ P n j i¼1 ðxj xi Þ

n 1X PðC k jxi Þ n i¼1

ð7Þ

ðPUU IÞFU þ PUL FL ¼ 0 ðI TÞðPLL FL þ PLU FU Þ þ TY FL ¼ 0

where T ¼ Diagðt1 ; t 2 ; . . . ; t l Þ. Consequently, after some algebra operations, the solution of Eq. set (13) can be written as

F ¼ ðT0 þ I PÞ1 T00 Y

PðC k jxi Þjðxj xi Þ PðC k Þ i¼1Pn PðC k jxi Þ ^ k jxj Þ ¼ Pn i¼1 PðC PK PðC k jxi Þjðxj xi Þ i¼1 Pn k¼1 PðC k Þ i¼1 PðC k jxi Þ Pn PðC k jxi Þjðxj xj Þ ¼ i¼1Pn i¼1 jðxj xi Þ

T 0ij ¼

Posterior probabilies

ti =ð1 ti Þ if i ¼ j and i 2 L 0

else

ti =ð1 ti Þ if i ¼ j and i 2 L 0

else

ð8Þ

ð1 6 i 6 n; 1 6 j 6 lÞ

ð16Þ

We can see that the closed-form solution in Eq. (13) involves the inversion of an n n matrix which scales as Oðn3 Þ. This cost is intractable when n is large. But we can adopt an EM-style iterative method to avoid the expensive solution. The iterative process is illustrated in Fig. 3. We can see that this iterative process can be viewed as a label propagation process [45], in which the labels of samples are propagated to each other according to a similarity matrix. In all of our experiments we adopt this iterative process instead of the closed-form solution. Now we prove the convergence of this iterative process. The steps (2) and (3) can be merged as

IT 0 0

Let A ¼

I IT 0

( n!1

Fig. 2. Relationship between densities and posterior probabilities.

ð15Þ

4.1. Solution

F ¼ lim

Extended KDE

ð1 6 i; j 6 nÞ

4. Discussion

F¼

Densities

00

where the matrices T ðn nÞ and T ðn lÞ are deﬁned by

T 00ij ¼

^ k jxi Þ, we assume that they are close to PðC k jxi Þ. Thus To compute PðC ^ k jxi Þ, where i 2 U and it is rational for us to let PðC k jxi Þ ¼ PðC ^ k jxi Þ þ t i dðy ¼ kÞ, where i 2 L and 0 < ti 6 1 PðC k jxi Þ ¼ ð1 t i ÞPðC i (we use weights t i to integrate labeling information for labeled samples). For clarity, in the following discussion we let F jk denote the ^ k jxj Þ with F jk ) estimated posterior probabilities (i.e., replacing PðC and PðC k jxj Þ denote the truths. Thus we have

Bayes rule

ð14Þ 0

Plugging Eq. (4) and Eq. (7) into Eq. (6), we obtain

Pn

ð13Þ

PF þ

TY

ð17Þ

0

0 P, then we have I

An F0 þ

n X i¼1

Ai1

!

TY 0

)

where F0 is the initial value of F. Since P is row normalized and ti > 0, we can derive that

ð18Þ

388

M. Wang et al. / Computer Vision and Image Understanding 113 (2009) 384–396

Fig. 3. Iterative solution process of SSKDE.

9c < 1;

n X

8j ¼ 1; 2; . . . ; n

Aij 6 1;

ð19Þ

i¼1

where F i and Y i indicate the ith row of F and Y, respectively. Now we prove that in fact GRF can be derived from SSKDE. We extend Eq. (22) to

Therefore n X

n A ij ¼

i¼1

n X

n X

Aik ðAn1 Þkj ¼

n X

n X

k¼1

i¼1

i¼1 k¼1

ðAn1 Þkj

F ¼ arg minF

Aik

n X 6c ðAn1 Þkj 6 cn

ð20Þ

k¼1

Thus A converges to 0 as n ! 1. On the other hand, it is not difﬁcult to prove that ðI AÞ is invertible. Thus Eq. (18) becomes

TY

After some algebra operations, we can derive Eq. (14) from Eq. (21), i.e., the iterative process illustrated in Fig. 3 converges as n ! 1 and the solution in Eq. (13) is the unique ﬁxed point.

R00ij ¼

4.2. Connection to graph-based SSL

Dij ¼

Graph-based SSL is a large family among existing SSL methods [43,46]. They are conducted on a graph, where the vertices are labeled and unlabeled samples and the edges reﬂect the similarities between sample pairs. An assumption of these methods is label smoothness which requires the labeling function to simultaneously satisfy the following two conditions: (1) it should be close to the given truths on the labeled vertices and (2) it should be smooth on the whole graph. These two conditions are often characterized in regularization frameworks. Many different algorithms in this manner have been proposed, and detailed reviews can be found in [9,44]. We consider two well-known graph-based methods, i.e., the Gaussian random ﬁelds (GRF) method and the learning with local and global consistency (LLGC) method. Denote by W an n n afﬁnity matrix with W ij indicates the similarity measure between xi and xj . Denote by D a diagonal matrix with its ði; iÞ-element equals to the sum of the ith row of W. Then the GRF and the LLGC methods are formulated as follows:

GRF : arg minF

( n X

) W ij kF i F j k

2

s:t: F i ¼ Y i ; i ¼ 1; 2; . . . ; l

ð22Þ

i;j¼1

8
9 2 F = n X F i j W ij pﬃﬃﬃﬃﬃﬃ pﬃﬃﬃﬃﬃﬃ þ l kF i Y i k2 LLGC : arg minF Dii Djj :i;j¼1 ; i

ð23Þ

) 2

kF i Y i k

ð24Þ

i¼1

ðR0 þ D WÞF ¼ R00 Y

ð21Þ

0

W ij kF i F j k þ l

l X

Obviously Eq. (24) is equivalent to Eq. (22) if l is set to 1. The minimization of the above criterion gives rise to the linear system as follows:

R0ij ¼

F ¼ ðI AÞ

2

i;j¼1

n

1

( n X

l if i ¼ j and xi 2 L 0

else

l if i ¼ j and xi 2 L 0

else

(P

n k¼1 W ik

0

if i ¼ j else

ð25Þ ð26Þ

ð27Þ

ð28Þ

Cleary we can ﬁnd that Eq. (25) is equivalent to Eq. (14) if we let

W ij ¼ jðxi xj Þ

ð29Þ

and

ti ¼ l=ðl þ Dii Þ

ð30Þ

Thus, GRF can be derived from SSKDE. Meanwhile, it also indicates that SSKDE can be characterized in a regularization framework as well, i.e., Eq. (23). Although SSKDE and graph-based methods are closely related, they are derived from different viewpoints. While in graph-based SSL methods the label smoothness assumption is characterized in regularization frameworks, in SSKDE it is implicitly represented by kernel function. Generally graph-based methods are regarded as discriminative and transductive, whereas SSKDE can be viewed as a generative and inductive method (since densities have been estimated and out-of-sample data can be classiﬁed based on the densities by Bayes rule). If we regard SSKDE as a novel perspective of graph-based methods, it can introduce more natural interpretations to several studies on graph-based methods, such as their induction method and their extension. How to induce graph-based methods to out-of-sample data is once a signiﬁcant problem since they are generally regarded as transductive methods. However, the induction of SSKDE is

389

M. Wang et al. / Computer Vision and Image Understanding 113 (2009) 384–396

natural. Since densities are estimated in Eq. (4), posterior probabilities of newly given samples can be obtained by Bayes rule. Actually we only have to replace xi with new sample x in Eq. (8), i.e.

^ k jxÞ ¼ PðC

Pn

j

i¼1 PðC k jxi Þ ðx Pn i¼1 ðx xi Þ

xi Þ

ð31Þ

j

Eq. (31) is exactly the induction method of graph-based methods proposed in [11], where the authors have mentioned that Eq. (31) ‘‘happens to have the form of Parzen window regressors”. Here we have clariﬁed the reason. Besides that, SSKDE also sheds light on the extension of graphbased SSL methods. Since KDE is a classical method which has already been intensively studied, there exist many approaches trying to improve it. Developing semi-supervised variants for these improved KDE methods naturally leads to improved graph-based SSL methods. In the next section, we provide such an algorithm named SSAKDE.

a

b 1

1

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0

0

-0.2

-0.2

0

0.5

The adaptive KDE (or variable KDE) methods have been studied for a long time [21,27,32]. The main idea of these methods is to vary the kernel bandwidths according to the sparseness degree of the data such that broader kernels are applied in the regions of low density. In general, a ﬁxed bandwidth approach has difﬁculties in dealing with densities that exhibit large changes magnitude and curvature. It has been shown that adaptive KDE approach can signiﬁcantly reduce the bias and mean square error of density estimation [32]. For simplicity, we consider the kernel functions in this form

1 kx xj kr jðx; xj Þ ¼ exp Z rrj

! ð32Þ

where Z is the normalization factor. But the method can be generalized to other kernel functions as well. Obviously Gaussian kernel and Exponential kernel (see Eqs. (2) and (3)) correspond to r ¼ 2 and r ¼ 1, respectively. We estimate the sparseness degree around a sample based on its distance to its neighbors. The bandwidth adaptation method is as follows: (1) decide a global ﬁxed bandwidth r and then (2) adjust the kernel bandwidths rj around r and let them be proportional to the sparseness degrees of their nearby regions. Denote by Ni the set of N neighbors of xi , then the method can be expressed as

8 n P r > 1 r > > < n i¼1 ri ¼ r P kxk xi kr r > > > rrir ¼ Pxk 2Ni : r j

xk 2Nj

ð33Þ

kxk xj k

From Eq. (33) we can derive that

P

k 2Ni ri ¼ nrr Pn xP

kxk xi kr

xk 2Ni kxk

i¼1

!1=r

xi kr

ð34Þ

Speciﬁcally, adaptive Gaussian kernel and adaptive Exponential kernel can be computed as follows:

8 > jg ðx; xi Þ ¼ ð2pÞ1d=2 rd expðkx xi k2 =2r2i Þ > > i < !1=2 P kxk xi k2 > x 2N 2P P i k > > n 2 : ri ¼ n r i¼1

xk 2Ni

kxk xi k

ð35Þ

0

0.5

1

Fig. 4. A synthetic binary classiﬁcation task with two training samples. (a) Labels of all samples. (b) Two training samples.

8 > j ðx; xi Þ ¼ ð2r1 Þd expðkx xi k=ri Þ > < e Pi kxk xi k xk 2Ni > > : ri ¼ nr Pn P i¼1

5. SSAKDE

1

xk 2Ni

ð36Þ

kxk xi k

Then, analogous to the way that leads to SSKDE, we can develop SSAKDE. In fact, in comparison to SSKDE, we only have to replace Eq. (10) by

jðxj ; xi Þ jðxj ; xi Þ

Pij ¼ Pn

ð37Þ

j¼1

and the subsequent process of SSAKDE is the same to SSKDE. 6. Experiments To evaluate the performance of the proposed methods, we conduct experiments for three different classiﬁcation tasks, including (1) a toy problem, (2) handwritten digit and letter recognition, and (3) video annotation. In all experiments, the parameters t i in SSKDE and SSAKDE are empirically set to 0.9,1 and the parameter l in LLGC is empirically set to 0.1. 6.1. Toy problem We conduct experiments on a synthetic data set illustrated in Fig. 4. There are 130 two-dimensional samples that are uniformly distributed within two circles. For each class a training sample is labeled, as illustrated in Fig. 4. We compare the classiﬁcation performance of the following six methods: (1) SVM with RBF kernel (2) k-NN (k ¼ 1); (3) KDE; (4) LLGC; (5) SSKDE; and (6) SSAKDE. We use Gaussian kernel in KDE methods. The parameter r in LLGC and KDE methods is set to 0.1. The parameter N in SSAKDE is set to 10. The classiﬁcation results are illustrated in Fig. 5. From the ﬁgure we can see that the three supervised learning methods, including SVM, k-NN, and KDE, all lead to the same classiﬁcation accuracy, i.e., 74.6%. LLGC and SSKDE achieve accuracies of 76.9% and 74.6%, respectively. This indicates that unlabeled data have not brought signiﬁcant performance improvements in these two SSL methods. We can see that the classiﬁcation boundaries obtained by these methods are signiﬁcantly biased. But the problem has been successfully alleviated in SSAKDE. With merely two training samples, SSAKDE attains a high classiﬁcation accuracy of 98.5%,

1 As indicated by Eqs. (24) and (30), the parameters ti can be derived from the parameter l in graph-based SSL, which adjusts the trade-off between the two terms in the regularization framework (see Eq. (24)). Existing studies have demonstrated that the performance of graph-based SSL is rather insensitive to the setting of l (compared with the parameter r), and in most works this parameter is empirically set to a ﬁxed value for simplicity [43,46]. Analogously, here we also empirically set ti . In Section 6.2 we will further conduct experiments on the performance sensitivities of SSKDE and SSAKDE with respect to the parameters to demonstrate this.

390

M. Wang et al. / Computer Vision and Image Understanding 113 (2009) 384–396

Accuracy: 74.6%

1 0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0

0

-0.2

0

0.5

1

Accuracy: 74.6%

1

-0.2

0

Accuracy: 74.6%

1

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0

0

-0.2

-0.2

0.5

1

0

(c) KDE

0.5

Accuracy: 74.6% 1

0.8

0.8

0.6

0.6

0.4

0.4

0.2

0.2

0

0 0

0.5

1

(d) LLGC

1

-0.2

1

Accuracy: 76.9%

1

0.8

0

0.5

(1) SVM with a RBF kernel. We implement the experiments using LIBSVM [8], a free library for SVM; (2) KDE; (3) SSKDE; (4) LLGC; (5) SSAKDE.

(b) k-NN

(a) SVM

1

-0.2

Accuracy: 98.5%

0

(e) SSKDE

0.5

1

(f) SSAKDE

Fig. 5. Performance comparison for six different methods on the synthetic dataset.

i.e., only two samples are misclassiﬁed. Comparing the results obtained by SSKDE and SSAKDE, we can clearly see the effectiveness of the proposed adaptive kernel approach. 6.2. Handwritten digit and letter recognition We conduct experiments on the following two datasets: ‘‘handwritten digit recognition” dataset from Cedar Buffalo [20,46] and ‘‘letter image recognition” dataset from UCI MLR [4]. From these two datasets we generate four classiﬁcation tasks: (1) (2) (3) (4)

For each classiﬁcation dataset we set different labeled data size l, and perform 10 trials for each l. In each trial we randomly select labeled samples and use the rest of the samples as testing data. For each trial, if any class does not contain labeled sample, we redo the sampling. We compare the averaged results over 10 trails of the following methods:

10-way classiﬁcation of all digits; even and odd digits classiﬁcation; 26-way classiﬁcation of all letters; letters ‘A’ to ‘M’ and ‘N’ to ‘Z’ classiﬁcation.

Table 2 illustrates the information on these four classiﬁcation tasks. Table 2 Information of classiﬁcation tasks

Since there is no reliable model selection approach when labeled samples are extremely few, the following parameters are tuned to their optimal values: parameter r in the last four methods, the radius parameter c for RBF kernel and trade-off between training error and margin c in SVM model. The size of neighborhood N in SSAKDE is empirically set to 50 (in fact this parameter also can be tuned which will lead to better results for SSAKDE). Fig. 6 illustrates the performance comparison of the ﬁve methods. Firstly, we compare KDE, SSKDE and SSAKDE (for concision, we abbreviate them to KDE methods in the following discussion). We can see that SSKDE performs better than KDE in most cases. This indicates the positive contribution of unlabeled data in SSKDE. But we can also see that in even and odd digits classiﬁcation, SSKDE performs worse than KDE when l is large. In Appendix, we provide a detailed analysis on the effect of unlabeled data, and this phenomenon will be explained. From the experimental results we can clearly see the superiority of SSAKDE over SSKDE. This conﬁrms the effectiveness of the adaptive kernel approach. Then, we compare LLGC and SSAKDE. From the experimental results we can see that in most cases SSAKDE outperforms LLGC. This is an interesting result. LLGC is usually believed to be superior to GRF due to that it can be viewed as based on normalized graph Laplacian, whereas GRF is based on graph Laplacian [17]. We have shown that GRF and SSKDE can be viewed as equivalent to some extent (although they are derived from different viewpoints and there are several small differences between their formulations, see Section 4). Thus we regard LLGC and SSAKDE both as variants of SSKDE method, the former in the viewpoint of spectral graph theory [17] and the latter in the perspective of KDE. SSAKDE also can be regarded as an improved graph-based SSL method. So, the superiority of SSAKDE over LLGC demonstrates that the KDE perspective can help better extend graph-based methods. In these four tasks, the performance gaps between these two methods are small in magnitude. In the next subsection we will demonstrate more signiﬁcant improvement from LLGC to SSAKDE in video annotation. Now we study the performance sensitivities of SSKDE and SSAKDE to the parameters t i and r. Take the classiﬁcation task ‘‘Digit 10-way” as an example, and we set l to 100. Fig. 7 illustrates the performance curves of SSKDE and SSAKDE with respect to the parameters. From the ﬁgure we can see that, consistent with the existing studies [35,43,46], the setting of r is critical to the performance of SSKDE and SSAKDE, and the parameters ti are relatively less sensitive. The results have conﬁrmed the rationality of the settings of the parameters in our experiments. 6.3. Video annotation

Task

Dimension

Size

Class

Digit even/odd Digit 10-way Letter A-M/N-Z Letter 26-way

256 256 16 16

11,000 11,000 20,000 20,000

2 10 2 26

We conduct experiments on TRECVID 2005 dataset [3]. TRECVID 2005 dataset consists of 273 news videos that are about 160 h in duration. The dataset is split into a development set and a test set. The development videos are segmented into 49,532 shots and 61,901 subshots, and the test videos are segmented into

391

M. Wang et al. / Computer Vision and Image Understanding 113 (2009) 384–396

Digit 10-way

Digit even/odd KDE SVM SSKDE LLGC SSAKDE

0.6

0.3 0.25

error rate

error rate

0.5

KDE SVM SSKDE LLGC SSAKDE

0.4

0.3

0.2 0.15

0.2

0.1

0.1 50

100

150

0.05

200

50

labeled samples

100

150

200

labeled samples

Letter 26-way

Letter A-M/N-Z 0.4 KDE SVM SSKDE LLGC SSAKDE

0.7

0.3

error rate

error rate

0.6

KDE SVM SSKDE LLGC SSAKDE

0.35

0.5

0.4

0.25 0.2 0.15

0.3

0.1 0.2

100

200

300

400

500

100

labeled samples

200

300

400

500

labeled samples

Fig. 6. Performance comparison of different algorithms for digit and letter classiﬁcation.

SSKDE

SSKDE 0.75

0.75

error rate

error rate

0.7 0.65 0.6

0.7

0.55 0.5

0.2

0.3

0.4

0.5

0.65 0.8

0.6

0.85

σ

1

0.95

1

i

0.9

0.85

SSAKDE

0.88

0.8

error rate

error rate

0.95

t

SSAKDE

0.75 0.7

0.86 0.84 0.82

0.65 0.6 0.2

0.9

0.3

0.4

σ

0.5

0.6

0.8 0.8

0.85

0.9

t

i

Fig. 7. Performance curves of SSKDE and SSAKDE with respect to the parameters r and ti (classiﬁcation task: digit 10-way; l ¼ 100).

45,766 shots and 64,256 subshots. In the following discussion the development set and test set will be referred to as set 1 and set 2, respectively. A key-frame is selected for each subshot, and from the

key-frame we extract 225D block-wise color moment features based on a 5 by 5 division of the image. We annotate the following 10 concepts: Walking_Running, Explosion_Fire, Maps, Flag-US, Building, Waterscape_Waterfront, Mountain, Prisoner, Sports, and Car. The descriptions of these concepts can be found in TRECVID website [3], and several exemplary key-frames are illustrated in Fig. 8. For each concept, its annotation is considered as a binary classiﬁcation problem, and thus for sample xi we obtain F i0 and F i1 by KDE methods, which are the posterior probabilities of relevance and irrelevance given sample xi . In the previous two tasks (toy problem and hand written digit and letter recognition), classiﬁcation results are directly obtained from the posterior probabilities. But in video annotation we frequently encounter imbalanced classes, i.e., positive samples are much fewer than negative samples for the given concept, and thus classiﬁcation accuracy is not a preferred performance measure. To address this issue, NIST has deﬁned non-interpolated average precision over a set of retrieved shot as a measure of retrieval effectiveness [2]. Here we combine F i0 and F i1 generate the relevance score of xi . Since negative samples are usually much more than positive ones and their distributions are in a very broad domain, positive samples are expected to contribute more in video concept learning [19]. Thus we compute relevance scores fi as

fi ¼ F i0 þ

1 1 F i1 frequency 1

ð38Þ

where frequency is measured to be the percentage of positive samples in labeled training set. In fact this setting is equivalent to duplicating ð1=frequency 1Þ copies for each positive training sample, so that they are balanced with negative ones. We will show experi-

392

M. Wang et al. / Computer Vision and Image Understanding 113 (2009) 384–396

Walking_Running

Explosion_Fire

Maps

Flag-US

Building

Waterscape_Waterfront

Mountain

Prisoner

Sports

Car

Fig. 8. Exemplary key-frames of the 10 concepts.

mentally that this setting can successfully integrate the two probabilities and it is better than only using F i0 or F i1 . We follow the guideline of TRECVID to evaluate annotation performance [2]. The relevance scores on the test set are merged from subshot to shot by maximum aggregation if a shot contains multiple subshots, i.e.,

f ðshotm Þ ¼ maxsubshoti 2shotm fsubshoti

ð39Þ

Then the relevance scores are ranked and we evaluate average precision of the ﬁrst 2000 shots. The size of neighborhood N in SSKDE is empirically set to 50, and the parameter r in KDE methods are tuned by 10-fold crossvalidation. We make matrices P sparse by only keeping N largest values in each row in SSKDE and SSAKDE. This is a frequently applied strategy which can signiﬁcantly reduce computational cost while retaining comparable performance. First, we test the following methods with regarding set 1 as training data and set 2 as testing data: (1) SVM with RBF kernel. Here we split out a hold-out dataset from original development set and use this dataset to tune two parameters in SVM model, i.e., radius parameter c for RBF kernel and trade-off between training error and margin c; (2) LLGC; (3) KDE with Gaussian kernel; (4) KDE with Exponential kernel; (5) SSKDE with Exponential kernel; (6) SSKDE with Gaussian kernel; (7) SSAKDE with Gaussian kernel; (8) SSAKDE with Exponential kernel. The experimental results are illustrated in Table 3. Comparing the performance of KDE methods with two different kernels, it is clear that Exponential kernel is superior to Gaussian kernel. It is due to the fact that generally L1 distance is more appropriate for many visual features, since it can better approximate the perceptual difference between images [29]. Then we compare the KDE methods. We can see that SSKDE outperforms KDE for nearly all concepts. It only performs slightly worse than KDE for concept

Explosion_Fire. The superiority of SSKDE over KDE is evident in MAP measure. Meanwhile, SSAKDE shows much better performance than SSKDE. These results have conﬁrmed the effectiveness of our approaches, including exploiting unlabeled data and adaptive kernels. We can see that SSAKDE also signiﬁcantly outperforms LLGC. SSAKDE with Exponential kernel has obtained the best results for most concepts among these 8 methods, and its superiority is evident in MAP measure. In Section 4, we have discussed that SSAKDE can be viewed as an extension to traditional graph-based SSL to a certain extent, and thus it will be instructive to compare it with state-of-the-art graph-based SSL methods. In Section 2, we have introduced that several improved graph-based methods also take into account the local structures around samples. Here we compare SSAKDE with three improved graph-based methods: (1) structure-sensitive manifold-ranking (SSMR) [31]; (2) GRF with neighborhood similarity (GRF + NS) [38]; and (3) LLGC with neighborhood similarity (LLGC + NS) [38]. These three methods all attempt to modify traditional distance-based similarity in the design of graphs. The ﬁrst method has incorporated density difference into similarity estimation, whereas the last two methods have deﬁned a novel ‘‘neighborhood similarity” by taking into account the local distributions around samples. Table 4 illustrates the experimental results. Here we only use Exponential kernel for SSAKDE, since it has been demonstrated to be more effective than Gaussian kernel. The detailed implementation issues and parameter settings of the three improved graph-based SSL methods can be found in [31,38]. From the results we can see that SSAKDE achieves the best results for most concepts. Compared with the three state-of-the-art graphbased SSL methods, the superiority of SSAKDE is evident in MAP measure. We also investigate the effectiveness of Eq. (38) for the three KDE methods. We compare three approaches: (1) rank shots based on F i0 ; (2) rank shots based on F i1 (in fact the shots are ranked according to F i1 ); (3) rank shots based on fi , which are generated according to Eq. (38). We adopt Exponential kernel, and the MAP results are illustrated in Table 5. From the results we can see that individually using F i1 generates very poor results, and it conﬁrms that the previous analysis that positive samples contribute more

393

M. Wang et al. / Computer Vision and Image Understanding 113 (2009) 384–396 Table 3 Performance of six different algorithms on TRECVID 2005 benchmark Concept

SVM

LLGC

KDE (G)

KDE (E)

SSKDE (G)

SSKDE (E)

SSAKDE (G)

SSAKDE (E)

Walking_Running Explosion_Fire Maps Flag-US Building Waterscape_Waterfront Mountain Prisoner Sports Car MAP

0.167 0.067 0.376 0.059 0.413 0.330 0.266 0.0003 0.324 0.262 0.226

0.172 0.046 0.451 0.081 0.399 0.361 0.284 0.0007 0.332 0.270 0.240

0.109 0.052 0.338 0.051 0.325 0.299 0.261 0.0002 0.242 0.238 0.192

0.117 0.058 0.353 0.086 0.317 0.301 0.259 0.0015 0.281 0.240 0.201

0.140 0.049 0.448 0.106 0.405 0.357 0.280 0.0013 0.298 0.264 0.235

0.152 0.051 0.462 0.089 0.403 0.369 0.288 0.0004 0.330 0.268 0.241

0.159 0.054 0.475 0.098 0.404 0.373 0.321 0.003 0.384 0.278 0.255

0.166 0.056 0.491 0.101 0.446 0.383 0.332 0.003 0.385 0.289 0.265

The best results for each concept is illustrated in boldface. Table 4 Performance comparison of SSAKDE and improved graph-based SSL methods Concept

SSAKDE (E)

SSMR

GRF + NS

LLGC + NS

Walking_Running Explosion_Fire Maps Flag-US Building Waterscape_Waterfront Mountain Prisoner Sports Car MAP

0.166 0.056 0.491 0.101 0.446 0.383 0.332 0.003 0.385 0.289 0.265

0.152 0.049 0.474 0.127 0.440 0.337 0.324 0.003 0.359 0.252 0.252

0.168 0.048 0.491 0.118 0.432 0.367 0.331 0.001 0.363 0.292 0.261

0.169 0.047 0.479 0.106 0.436 0.358 0.333 0.0008 0.368 0.287 0.258

The best results for each concept is illustrated in boldface.

Table 5 Comparison of MAP results obtained by three different ranking methods Method

Using fi

Using F i0

Using F i1

KDE (E) SSKDE (E) SSAKDE (E)

0.201 0.241 0.265

0.196 0.233 0.260

0.013 0.113 0.129

The best results for each concept is illustrated in boldface.

than negative ones. However, using fi shows better performance than using F i0 , and this demonstrates that integrating F i1 is still helpful. Finally, we conduct experiments to study whether the effectiveness of the proposed methods would depend on the size of training set and the relative percentages of labeled and unlabeled data. We randomly split set 1 into two sets, i.e., labeled training set and unlabeled training set, and set 2 is regarded as out-of-sample data, as illustrated in Fig. 9. The sizes of unlabeled training set and unlabeled testing set are denoted by u1 and u2 , respectively. So, we have l þ u1 ¼ 61; 901 and u2 ¼ 61; 614. We implement SSKDE and SSAKDE on labeled samples and unlabeled training samples ﬁrst, and then induce the labels for unlabeled testing samples according to Eq. (31). We compare their performance with KDE (for KDE, unlabeled training samples are not used). We set different l and perform 10 trials for each l. Fig. 10 demonstrates the MAP measures obtained by KDE methods. For clarity, we plot the improveTRECVID development set

labeled data unlabeled training data unlabeled testing data TRECVID test set

Fig. 9. Data distribution for inductive experiments.

ment curves from KDE to SSKDE and from SSKDE to SSAKDE in this ﬁgure as well. From the ﬁgure we can see that the improvements are always positive, even with little labeled data or with little unlabeled training data. When unlabeled are fewer, the improvement percentages are smaller, but the signs are consistent. This conﬁrms the robustness of the proposed methods in video annotation. 7. Computational cost The computational costs of SSKDE and SSAKDE mainly consist of two parts, one is for the construction of matrix P, and the other is for the iterative solution process illustrated in Fig. 3. We can easily derive that the cost of matrix construction scales as Oðd n2 Þ and the cost of iterative solution scales as Oðn N MÞ, where n is the number of samples, d is the dimension of feature space, N is the neighborhood size, and M is the number of iterations in the solution process. We illustrate the deﬁnitions of all these notations and their detailed values in the video annotation experiments in Table 6 for clarity. Obviously the iterative solution process is much more rapid than the matrix construction process. In practical experiments the matrix construction step takes about 30 h, whereas the iterative solution process can be ﬁnished in less than 2 min for each concept. But the matrix construction step is concept independent, i.e., the matrix only has to be constructed once and then it can be utilized for all concepts. Compared with traditional methods those need to train a model for each individual concept (such as SVM), SSKDE and SSAKDE have great advantage in computational efﬁciency when dealing with multiple concepts. For instance, in our experiments we need more than 6 h to train a SVM model for a concept. Since this cost is proportional to the lexicon size, it would be prohibitive if we have to annotate a large lexicon of concepts, such as large-scale concept ontology for multimedia (LSCOM) [1]. Contrarily, SSKDE and SSAKDE only need to repeat the iterative solution process for different concepts, thus their computational costs will not increase dramatically. This property makes these two methods particularly appropriate for large-scale annotation, in terms of both dataset size and lexicon size. All these time costs are recorded on a PC with Pentium 4 3.0G CPU and 1G memory.

8. Conclusion In this paper, we introduce a novel SSL method named SSKDE. It is formulated based on a classical non-parametric approach, i.e., KDE. While only labeled samples are used to estimate densities in KDE, SSKDE is able to leverage both labeled and unlabeled samples. This method naturally avoids the model assumption problem which may degrade performance in many other SSL methods.

394

M. Wang et al. / Computer Vision and Image Understanding 113 (2009) 384–396

Improvement (%)

0.24 KDE SSKDE SSAKDE

0.22 0.2 0.18

0.14 0.12 0.1 0.08 0.06 0.04

KDE to SSKDE 50

0

0

1

2

3

4

5

6

Labeled samples

0

1

2

3

4

5

Labeled Samples

6

7

Improvement (%)

MAP

0.16

100

7 x 104

30

SSKDE to SSAKDE 20 10 0 0

1

x 104

2

3

4

Labeled samples

5

6

7 x 104

Fig. 10. (a) MAP measures obtained by KDE methods with different l and (b) relative improvements from KDE to SSKDE and from SSKDE to SSAKDE.

Table 6 The practical values of the notations

According to Bayes rule and strong law of large number, we have

Notation

Description

Value

n d N M

Number of samples Dimension of feature space Neighborhood size Number of iterations

126,157 225 50 40

Furthermore, we also propose an improved method named SSAKDE. It employs adaptive kernels rather than a global ﬁxed kernel. The bandwidths of local kernels are adapted according to the sparseness degrees of the nearby regions. We have analyzed the effect of unlabeled data in the proposed methods and their connection with other SSL methods. Experiments have demonstrated their effectiveness. A major contribution of this work is the approach that incorporates unlabeled samples into KDE, i.e., the bi-directional relationship between densities and posterior probabilities. As noted in Section 4, KDE is a classical method that has already been extensively studied and there have already been many improved methods [14,21,34]. So, we believe that many new SSL methods can be developed based on these methods by manipulating unlabeled samples analogous to SSKDE. SSAKDE is just such an example. We will try to develop more methods in this way in the future. Appendix A

Furthermore, it is obvious that

n 1 X jðx xj Þ < sup jðxÞ n j¼1 x

ð43Þ

So, we can derive

! Pn Z 1 Pn a:e: n j¼1 jðx xj Þ j¼1 PðC k jxj Þjðx xj Þ Pn PðC k jxÞ dx ! 0 1 Pn n j¼1 PðC k jxj Þ j¼1 jðx xj Þ

ð44Þ

i.e., J1;n converges to 0 almost surely. Now we prove the convergence of J 2;n . Based on Bayes rule, we have

Pn Z 1 PðC k jxÞpðxÞ j¼1 jðx xj Þ dx PðC k jxÞ n1 Pn PðC k Þ j¼1 PðC k jxj Þ n Pn Z 1 PðC k jxÞpðxÞ j¼1 jðx xj Þ n 6 PðC k jxÞ 1 Pn 1 Pn dx j¼1 PðC k jxj Þ j¼1 PðC k jxj Þ n n Z PðC k jxÞpðxÞ PðC k jxÞpðxÞ þ 1 Pn dx n j¼1 PðC k jxj Þ PðC k Þ

J 2;n ¼

Z X n a:e: 1 jðx xj Þ pðxÞdx ! 0 n j¼1

Clearly we have

ð45Þ

ð46Þ

Consequently, we can derive that

1 PðC k jxj Þ j¼1 n Z n 1X jðx xj Þ PðC k jxÞpðxÞdx PðC k jxÞ n j¼1 Z X n 1 a:e: 1 6 1 Pn jðx xj Þ pðxÞdx ! 0 n PðC jx Þ j k j¼1 n j¼1

J 02;n ¼ 1 Pn ð40Þ

Let J1;n and J2;n denote the two terms on the right-hand side of the above inequality, we only have to prove that both J1;n and J 2;n converge to 0 almost surely. Based on the main theorem in [13] (i.e., the L1 convergence of kernel regression)

Z Pn a:e: j¼1 PðC k jxj Þjðx xj Þ Pn PðC k jxÞdx ! 0 j¼1 jðx xj Þ

ð42Þ

Denote these items by J 01;n and J 01;n , respectively. According to the Theorem 2 in [12] (i.e., the convergence of classical KDE), we have

A.1. Proof of Theorem 1

Pn Z Pn j¼1 PðC k jxj Þjðx xj Þ j¼1 jðx xj Þ Pn Pn Jn ¼ pðxjC k Þdx j ðx x Þ PðC jx Þ j j k j¼1 j¼1 ! Pn Z 1 Pn n j¼1 jðx xj Þ j¼1 PðC k jxj Þjðx xj Þ Pn ¼ 1 Pn PðC k jxÞ dx n j¼1 PðC k jxj Þ j¼1 jðx xj Þ Pn Z j¼1 jðx xj Þ þ PðC k jxÞ Pn pðxjC k Þdx j¼1 PðC k jxj Þ

Z n 1X a:e: PðC k jxÞpðxÞdx ¼ PðC s kÞ > 0 PðC k jxj Þ ! n j¼1

ð41Þ

ð47Þ

On the other hand, based on Eq. (42), we can easily derive that

J 002;n ¼

Z PðC k jxÞpðxÞ a:e: PðC k jxÞpðxÞ dx ! 0 1 Pn n j¼1 PðC k jxj Þ PðC k Þ

ð48Þ

M. Wang et al. / Computer Vision and Image Understanding 113 (2009) 384–396

So, J 2;n converges to 0 almost surely as well, which completes the proof. A.2. Analysis of unlabeled data In this appendix, we provide a qualitative analysis on the effect of unlabeled data in SSKDE. Consider L1 generalization error. Firstly we cite a conclusion from the study on kernel regression [13]. Deﬁne

Z Pn j¼1 PðC k jxj Þjðx xj Þ Pn Dn ¼ PðC k jxÞpðxÞdx j ðx x Þ j j¼1

ð49Þ

Then we have the follow theorem. Theorem 2. If r ! 0 and nrd ! 0, then for every > 0 there exists constants c and n0 , such that for every n P n0 , PðDn P Þ < ecn . The proof of Theorem 2 can be found in [13]. Now we replace PðC k jxj Þ by estimated posterior probabilities F jk . Thus the deﬁnition of generalization error turns to follows:

D0n ¼

Z Pn j¼1 F jk jðx xj Þ PðC k jxÞpðxÞdx Pn j¼1 jðx xj Þ

ð50Þ

Then we deﬁne

D0l ¼

Z Pl j¼1 F jk jðx xj Þ PðC k jxÞpðxÞdx Pl j ðx x Þ j j¼1

ð51Þ

D0n and D0l can be regarded as the generalization errors of KDE and SSKDE, respectively. Suppose that the estimated posterior probabilities have biases Djk , i.e., F jk ¼ PðC k jxj Þ þ Djk . Deﬁne Dl ¼ maxj6l jDjk j and Dn ¼ maxj6n jDjk j. Thus, we can obtain

D0n

Z Pn j¼1 PðC k jxj Þjðx xj Þ Pn 6 pðC k jxÞpðxÞdx j¼1 jðx xj Þ P Z n j¼1 Djk jðx xj Þ þ Pn pðxÞdx j¼1 jðx xj Þ

6 Dn þ Dn Similarly we have D0l 6 Dl þ Dl . Now we name Dl and Dn supervised and semi-supervised generalization error, respectively. Analogously, Dl and Dn are named supervised and semi-supervised bias error, respectively. According to the deﬁnitions of Dl and Dn we can ﬁnd that Dl 6 Dn . This is consistent with intuition, since the biases in estimated posterior probabilities of unlabeled samples are usually greater than labeled samples (the posterior probabilities of labeled samples can be directly obtained by their labels). Meanwhile, according to Theorem 2, it is reasonable for us to suppose Dn 6 Dl . Then we can ﬁnd the twofold effect of the unlabeled samples: (1) Decrease generalization error. This is according to Theorem 2 which indicates that the generalization error in kernel regression reduces as training samples increase. (2) Increase bias error. This is due to the fact that the posterior probabilities of unlabeled samples are usually not accurate enough. Thus, if the decrease of generalization error is greater than the increase of bias error, then SSKDE outperforms KDE; otherwise, KDE performs better, such as the results illustrated in Fig. 6(b). This phenomenon also can be explained in another perspective. We revisit the iterative process in Fig. 3. It is mentioned that it can be viewed as a propagation process. When the posterior probabil-

395

ities of unlabeled samples have too large biases, they may propagate to each other, so that the performance degenerates and then SSKDE may be even worse than the traditional KDE method. Currently it is difﬁcult to accurately predict whether SSKDE will perform better (or how much better) than KDE given a dataset. Trying to establish such a theoretical framework will be an interest work. References [1] LSCOM lexicon deﬁnitions and annotations version 1.0. dto challenge workshop on large scale concept ontology for multimedia, in: DTO Challenge Workshop on Large Scale Concept Ontology for Multimedia, Columbia University ADVENT Technical Report #217-2006-03. [2] TREC-10 Proceedings Appendix on Common Evaluation Measures. Available from: . [3] TRECVID: TREC Video Retrieval Evaluation. Available from: . [4] UCI Repository of Machine Learning Databases. Available from: . [5] M. Belkin, L. Matveeva, P. Niyogi, Regularization and semi-supervised learning on large graphs, in: Proceedings of COLT, 2004. [6] A. Blum, T. Mitchell, Combining labeled and unlabeled data with co-training, in: Proceedings of COLT, 1998. [7] V. Castelli, T. Cover, The relative value of labeled and unlabeled samples in pattern recognition with an unknown mixing parameter, IEEE Transactions on Information Theory 42 (1996). [8] C.C. Chang, C.J. Lin, LIBSVM: a library for support vector machines, 2001. available from: http://www.csie.ntu.edu.tw/~cjlin/libsvm. [9] O. Chapelle, A. Zien, B. Scholkopf, Semi-Supervised Learning, MIT Press, 2006. [10] I. Cohen, F.G. Cozman, N. Sebe, M.C. Cirelo, T.S. Huang, Semi-supervised learning of classiﬁers: theory algorithms and their application to human– computer interaction, IEEE Transactions on Pattern Analysis and Machine Intelligence (2004). [11] O. Delalleau, Y. Bengio, N.L. Roux, Efﬁcient non-parametric function induction in semi-supervised learning, in: Proceedings of International Conference on Artiﬁcial Intelligence and Statistics, 2005. [12] L. Devroye, The equivalence of weak, strong and complete convergence in L1 for kernel density estimates, The Annals of Statistics 11 (1983). [13] L. Devroye, A. Krzyzak, An equivalence theorem for L1 convergence of the kernel regression estimate, Journal of Statistical Planning and Inference (1989). [14] A. Elgammal, R. Duraiswami, L.S. Davis, Efﬁcient kernel density estimation using the fast gaussian transform for computer vision, IEEE Transactions on Pattern Analysis and Machine Intelligence (2003). [15] S.L. Feng, R. Manmatha, V. Lavrenko, Multiple bernoulli relevance models for image and video annotation, in: Proceedings of International Confernce on Computer Vision and Pattern Recognition, 2004. [16] A. Ghoshal, P. Arcing, S. Khudanpur, Hidden markov models for automatic annotation and content-based retrieval of images and video, in: Proceedings of International ACM SIGIR Conference, 2005. [17] F.C. Graham, Spectral Graph Theory, Regional Conference Series in Mathematics, vol. 92, American Mathematical Society, 1997. [18] A.G. Hauptmann, Lessons for the future from a decade of informedia video analysis research, in: Proceedings of ACM International Conference on Image and Video Retrieval, 2005. [19] J.R. He, M.J. Li, H.J. Zhang, H.H. Tong, C.S. Zhang, Manifold-ranking based image retrievalm, in: Proceedings of ACM Multimedia, 2004. [20] J.J. Hull, A dataset for handwritten text recognition research, IEEE Transactions on Pattern Analysis and Machine Intelligence (1994). [21] A.J. Inzeman, Recent developments in nonparametric density estimation, Journal of American Statistical Association (1991). [22] C. Lin, B. Tseng, J.R. Smith, VideoAnnEx: IBM MPEG-7 annotation tool for multimedia indexing and concept learning, in: Proceedings of International Confernce on Multimedia & Expo, 2003. [23] M.R. Naphade, J.R. Smith, On the detection of semantic concepts at TRECVID, in: Proceedings of ACM Multimedia, 2004. [24] K. Nigam, A.K. McCallum, S. Thrun, T. Mitchell, Text classiﬁcation from labeled and unlabeled documents using em, Machine Learning 39 (2000). [25] E. Parzen, On the estimation of a probability density function and the mode, Annals of Mathematical Statistics 33 (1962). [26] C. Rosenberg, M. Heberg, H. Schneiderman, Semi-supervised self-training of object detection models, in: Proceedings of Workshop on Applications of Computer Vision, 2005. [27] R.S. Sain, Adaptive Kernel Density Estimation, Ph.D. Thesis, Rice University, 1994. [28] Y. Song, X.S. Hua, L.R. Dai, M. Wang, Semi-automatic video annotation based on active learning with multiple complementary predictors, in: Proceedings of ACM SIGMM International Workshop on Multimedia Information Retrieval, 2005. [29] M. Stricker, M. Orengo, Similarity of color images, in: Proceedings of Storage and Retrieval for Image and Video Databases (SPIE 2420), 2000.

396

M. Wang et al. / Computer Vision and Image Understanding 113 (2009) 384–396

[30] J. Tang, X.S. Hua, G. Qi, Y. Song, X. Wu, Kernel based linear neighborhood label propagation for semantic video annotation, in: Proceedings of Paciﬁc-Asia Confernce on Knowledge Discovery and Data Mining, 2007. [31] J. Tang, X.S. Hua, G.J. Qi, M. Wang, T. Mei, X. Wu, Structure-sensitive manifold ranking for video concept detection, in: Proceedings of ACM Multimedia, 2007. [32] G.R. Terrell, D.W. Scott, The equivalence of weak strong and complete convergence in L1 for kernel density estimates, The Annals of Statistics 20 (1992). [33] Q. Tian, J. Yu, Q. Xue, N. Sebe, A new analysis of the value of unlabeled data in semi-supervised learning in image retrieval, in: Proceedings of International Confernce on Multimedia & Expo, 2004. [34] P. Vincent, Y. Bengio, Manifold parzen windows, in: Proceedings of Advances in Neural Information Processing System, 2003. [35] F. Wang, C. Zhang, Label propagation through linear neighborhoods, in: Proceedings of International Confernce on Machine Learning, 2006. [36] M. Wang, X.S. Hua, X. Yuan, Y. Song, L.R. Dai, Optimizing multi-graph learning: towards a uniﬁed video annotation scheme, in: Proceedings of ACM Multimedia, 2007. [37] M. Wang, X.S. Hua, X. Yuan, Y. Song, S. Li, H.J. Zhang, Automatic video annotation by semi-supervised learning with kernel density estimation, in: Proceedings of ACM Multimedia, 2006. [38] M. Wang, T. Mei, X. Yuan, Y. Song, L.R. Dai, Video annotation by graph-based learning with neighborhood similarity, in: Proceedings of ACM Multimedia, 2007.

[39] Y. Wu, Q. Tian, T.S. Huang, Dsicriminant-em algorithm with application to image retrieval, in: IEEE Conference on Computer Vision and Pattern Recognition, 2000. [40] R. Yan, M.R. Naphade, Semi-supervised cross feature learning for semantic concept detection in videos, in: Proceedings of International Confernce on Computer Vision and Pattern Recognition, 2005. [41] X. Yuan, X.S. Hua, M. Wang, X. Wu, Manifold-ranking based video concept detection on large database and feature pool, in: Proceedings of ACM Multimedia, 2006. [42] T. Zhang, F.J. Oles, A probability analysis on the value of unlabeled data for classiﬁcation problems, in: Proceedings of International Confernce on Machine Learning, 2000. [43] D. Zhou, O. Bousquet, T.N. Lal, J. Weston, B. Schölkopf, Learning with local and global consistency, in: Proceedings of Advances of Neural Information Processing, 2004. [44] X. Zhu, Semi-supervised learning literature survey, Technical Report (1530), University of Wisconsin-Madison. Available from: . [45] X. Zhu, Z. Ghahramani, Learning from labeled and unlabeled data with label propagation, Technical Report, CMU-CALD-02-106, Carnegie Mellon University. [46] X. Zhu, Z. Ghahramani, J. Lafferty, Semi-supervised learning using gaussian ﬁelds and harmonic functions, in: Proceedings of International Confernce on Machine Learning, 2003.

Fast Conditional Kernel Density Estimation

Fuzzy Correspondences and Kernel Density Estimation ...

1 Kernel density estimation, local time and chaos ...

New density estimation methods for charged particle ...

New density estimation methods for charged particle beams ... - NICADD

Kernel-Based Skyline Cardinality Estimation

Fast Global Kernel Density Mode Seeking with ...

Stochastic Gradient Kernel Density Mode-Seeking

Dictionary-based probability density function estimation ...

Deconvolution Density Estimation on SO(N)

Robust Estimation of Edge Density in Blurred Images

Probability Density Estimation via Infinite Gaussian ...

nprobust: Nonparametric Kernel-Based Estimation and Robust Bias ...

10 Transfer Learning for Semisupervised Collaborative ...

Semisupervised Wrapper Choice and Generation for ...

Automated Detection of Engagement using Video-Based Estimation of ...

Density Constraints for Crowd Simulation

A Fast Sub-Pixel Motion Estimation Algorithm for H.264/AVC Video ...

Noise-contrastive estimation: A new estimation principle for ...

RESONANCES AND DENSITY BOUNDS FOR CONVEX CO ...

$Kernel for Math - Corporate License$

Kernel for Math - Corporate License