Robust cross-media transfer for visual event detection

Viewer
Transcript

Robust Cross-Media Transfer for Visual Event Detection Yang Yang

Yi Yang

Zi Huang

The University of Queensland Brisbane, Australia

Carnegie Mellon University Pittsburgh, United States

The University of Queensland Brisbane, Australia

[email protected]

[email protected] [email protected] Jiajun Liu Zhigang Ma

The University of Queensland Brisbane, Australia

[email protected]

University of Trento Trento, Italy

[email protected]

ABSTRACT

features and high-level semantics. Most existing event detection approaches can be summarized to a common formulation, i.e., model-based formulation. In this case, one is required to ﬁrst prepare a collection of training data which is usually labeled with a well-deﬁned event set by human experts. The training set is then fed into a supervised learning model to train event detectors (classiﬁers). Finally we may predict whether an event occurs in the test data. Indeed, given a well-controlled training dataset with suﬃcient and accurately-labeled samples, traditional model-based methods may obtain satisfactory performances. However, manual labeling is a labor-intensive and time-consuming process which makes model-based methods impractical in reality. To address this problem, a promising way is to exploit domain adaption technique to add more labeled samples from other domains [4]. Nevertheless, existing methods usually only take homogeneous types of data into account, which may lead to information loss to some extent. More importantly, the robustness is not well exploited in these methods, which makes them hardly noise-tolerant in social multimedia circumstances. Normally when an event happens in public, it may be recorded in diﬀerent ways, such as taking photos, and recording videos. Then this event can be shared and disseminated in forms of images and video clips via multiple platforms, such as TV channels, web portals and social networking services. An observation is that social images and videos describing the same event are highly relevant and thus should be somehow complementary to each other. Hence, it is natural to derive an assumption that using training data from diﬀerent yet relevant types of social multimedia data would be beneﬁcial to boost visual event detection performance. However, directly using social multimedia data in existing supervised learning models may lead to poor performance because social multimedia data are usually not accuratelylabeled by social users and thus contain many undesirable noises [7]. In this case, we need to devise a robust model which is able to not only conduct event detection in multiple types of multimedia data but also construct event detectors tolerant of noises in social multimedia environments. This study explores how diﬀerent types of noisy social multimedia data, e.g., Flickr images and YouTube videos, can be together fed into a common model for visual event detection. To this end, we adopt an 2,1 -norm regression model featuring noise-tolerance, which makes it possible to directly use social images and videos without any further

In this paper, we present a novel approach, named Robust Cross-Media Transfer (RCMT), for visual event detection in social multimedia environments. Diﬀerent from most existing methods, the proposed method can directly take different types of noisy social multimedia data as input and conduct robust event detection. More speciﬁcally, we build a robust model by employing an 2,1 -norm regression model featuring noise tolerance, and also manage to integrate different types of social multimedia data by minimizing the distribution diﬀerence among them. Experimental results on real-life Flickr image dataset and YouTube video dataset demonstrate the eﬀectiveness of our proposal, compared to state-of-the-art algorithms.

Categories and Subject Descriptors H.3.1 [Information Storage and Retrieval]: Content Analysis and Indexing

General Terms Algorithms, Experimentation, Performance

Keywords Cross-Media, Transfer Learning, Visual Event Detection

1. INTRODUCTION During the last decade, we have witnessed how powerfully social networking services (e.g., Flickr, YouTube and Facebook) are changing the ways multimedia data are generated, shared and consumed. With the vigorous growth of user-generated content, it is essential to detect visual event information to facilitate multimedia applications, such as multimedia indexing [10], retrieval [11, 9] and tagging [8]. In visual event detection, a fundamental issue is how to bridge the so-called ‘semantic gap’ between low-level visual

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. MM’12, October 29–November 2, 2012, Nara, Japan. Copyright 2012 ACM 978-1-4503-1089-5/12/10 ...$15.00.

1045

on multiple kernel learning. Each base kernel K(i) is composed as below: (i) (i) Ks Kts . K= (i) (i) Kst Kt

cleansing. Moreover, we also integrate diﬀerent types of social multimedia data together to handle visual domain-shift problem. Our contributions are summarized as follows: • We propose a Robust Cross-Media Transfer (RCMT) model for visual event detection, which can directly take loosely-labeled social multimedia data as input. Our proposal is tolerant of labeling noises in social multimedia environments.

(i)

• Diﬀerent types of social multimedia data, e.g., social images and social videos, can be simultaneously fed into RCMT to achieve mutual reinforcement for visual event detection. • Experimental results on real-life Flickr image datatset and YouTube video dataset conﬁrm the performance superiority of our proposal over state-of-the-art algorithms. The rest of this paper is organized as follows. Section 2 elaborates the details of RCMT. Section 3 reports the experimental results, followed by the conclusion in section 4.

m min γ ( wi K(i) )V + 1bT − Y w,V,b

2. THE PROPOSED MODEL In this section, we reveal the details of how the proposed model takes multiple types of social multimedia data as input to perform robust visual event detection.

2.1

i=1

(t)

(t)

j which are sampled from this video. xjg of frames {xjg }|g=1

(t)

(t)

is the visual feature of the g th frame of xj , and nj (s) yi

is the (t)

∈ {−1, 1} and yj ∈ number of frames in c×1 th {−1, 1} are event indicator vectors for the i image and the j th video respectively, and c is the number of events in event list. The objective is to build up a robust event detection model based on both X(s) and X(t) , and make sure they pose positive eﬀects on each other. Starting with a classic solution, we may formulate the event detection as a multiple kernel ridge regression (MKRR), then we arrive at: c×1

F

wi K(i) )V

i=1

i=1 (i)

where k (:, x) = ψi (X ∪X(t) )T ψi (x), i = 1, 2, . . . , m. We may use it to detect event in either image or video domain. Unfortunately, it is still not proper to directly employ Eq.(2) to conduct visual event detection with cross-media transfer because the distributions of diﬀerent types of multimedia data may diﬀer greatly, thereby leading to the domain shift issue which can severely degrade the event detection performance. In the next part, we will provide a solution to handle this problem by minimizing the distribution diﬀerence between two types of social multimedia data.

m 2 m min γ ( wi K(i) )V + 1bT − Y + tr V T ( wi K(i) )V w,V,b i=1

2,1

m

Here given any matrix M , its 2,1 -norm is deﬁned as M 2,1 = j mj 2 , where mj is the j th row of M . The underlying interpretation of the robust nature in 2,1 -norm loss function is that the regression errors introduced by noisy data are not squared and thus pose less negative eﬀects on the whole training process. After solving the problem in (2), we obtain an event detector as follows: m f (x) = V T wi k(i) (:, x) + b (3)

(s)

Suppose we are given a social image dataset X(s) = {(xi , (t) (t) t and a social video dataset X(t) = {(xj , yj )}|N j=1 , (s) where Ns and Nt are the numbers of samples in X and (t) X(t) respectively. We represent each video xj as a sequence

(t) {xjg }.

VT(

(2)

(s) s yi )}|N i=1

n

+ tr

s.t. w 0 and wT 1 = 1.

Robust Visual Event Detection

(t)

(i)

Ks and Kt are kernels computed from X(s) and X(t) re(i) (i)T spectively, while Kst = Kts is the “cross-media” kernel matrix between X(s) and X(t) . Indeed MKRR can provide us a possible solution to crossmedia event detection. However, it may probably lead to unsatisfactory performance in the noisy circumstance of social multimedia. Normally, event descriptions (e.g., tags, comments, titles, etc) created by amateur users can hardly be guaranteed to be accurate, and also diﬀerent users might have diﬀerent perspectives or opinions on the same event. Therefore, social multimedia data are often of low-quality in terms of labeling, which may degrade event detection performance. To handle this problem, we need to adapt MKRR to noisy data. It has been shown that 2,1 -norm loss function is a robust alternative to 2 -norm loss function [5]. So we replace 2 -norm loss function in Eq.(1) with an 2,1 -norm loss function and arrive at a robust multiple kernel ridge regression (RMKRR):

i=1

s.t. w 0 and wT 1 = 1,

2.2

(1)

(s)

Cross-Media Visual Transfer

In order to integrate diﬀerent types of social data, a basic idea is to transform them into a common space in which they become similar to each other. In other words, their distribution diﬀerence should be minimized in this space. To this end, we propose to utilize multiple kernel learning (MKL) to seek for such an optimal Reproducing Kernel Hilbert Space (RKHS). Besides, using MKL can keep pace with the robust multiple kernel robust regression described in the above section. As to distribution distance measure, we use a nonparametric metric proposed by [3], i.e., Maximum Mean Discrepancy (MMD). According to its deﬁnition,

(Ns +Nt )×c

where γ is a balance parameter, V ∈ R is the regression parameter matrix, b ∈ Rc×1 is the bias and 1 (s) (s) (t) (t) is an all-one vector. Y = [y1 , . . . , yNs , y1 , . . . , yNt ] ∈ R(Ns +Nt )×c is the whole event indicator matrix. {K(i) }|m i=1 are m base kernels pre-computed over both X(s) and X(t) . w = [w1 , w2 , . . . , wm ]T is a weight vector to measure importance of base kernels. Note that if we feed X(s) and X(t) together into MKRR model, it is actually a simple transfer learning model based

1046

an empirical estimation of the distribution distance between X(s) and X(t) is computed as below: Nt Ns 1 1 ψ(xsi ) − ψ(xtj ) (4) Dmmd (X(s) , X(t) ) = Ns Nt i=1

j=1

H

where ψ is a kernel mapping function, H is the corresponding RKHS and · H denotes the 2 -norm in H. We may further rewrite MMD distance [3]: Dmmd (X(s) , X(t) ) = tr(KP)1/2

Dancing

Ski

Birthday

Singing

Wedding

Figure 1: Noisy examples from Flickr data. (5) Dataset YouTube Flickr

where P = ppT and p is a (Ns + Nt ) × 1 vector of which the ﬁrst Ns elements are N1s and the rest are − N1t . By extending Eq.(5) to its multiple kernel version we obtain: m 1/2 Dmmd mkl (X(s) , X(t) ) = tr wi K(i) P = (wT s)1/2

# of Events 8 8

# of Samples 553 4, 041

Feature 162-D HSV 162-D HSV

Table 1: Datasets description.

i=1

(6) where s = [tr(K(1) P), . . . , tr(K(m) P)]T . As mentioned before, to integrate X(s) and X(t) together we should solve the following problem: min Dmmd mkl (X(s) , X(t) ) w

500 results. Here we do not perform any preprocessing to cleanse the labeling noises. Figure 1 lists some noisy exemplars. We randomly sample {10%, 20%, 30%, 40%} of images as the source training data. Videos are randomly divided into a training set (60%) and a test set (40%). The experiment is repeated 8 times and the mean results are reported. By following [3], four kernels, i.e., Gaussian Ker2 2 nel (i.e., k(·, ·) = exp (−d √ (·, ·)/2σ )), Laplacian Kernel (i.e., k(·, ·) = exp (−d(·, ·)/ 2σ)), Inverse Square Distance Ker2 nel (i.e., k(·, ·) = 1/(d2 (·, ·)/2σ√ + 1)) and Inverse Distance Kernel (i.e., k(·, ·) = 1/(d(·, ·)/ 2σ + 1)), are used to gener−3 1 ¯ is the ate based kernels. σ is set to D , . . . , 43 } where D ¯ {4 mean of all squared distances. We use Euclidean distance to compute image distance and Hausdorﬀ distance to compute image-video and video distances. Keyframes of Kodak videos are extracted at a sampling rate of 2 frames/s. We extract 162-D HSV color histogram features for YouTube video keyframes and Flickr images. Table 1 lists a summarization of two datasets.

(7)

Thus, by combining (2) and (7) we can arrive at the Robust Cross-Media Transfer (RCMT) model: m (i) T min μ(w ss w) + γ ( wi K )V + 1b − Y w,V,b i=1 2,1 m + tr V T wi K(i) V T

T

(8)

i=1

s.t. w 0 and wT 1 = 1.

After solving (8) we obtain a robust visual event detector as described in Eq.(3). It is worth mentioning that another contribution of this work is an eﬀective optimization algorithm to solve (8), but due to space limitation we omit the optimization details in this version.

3.

3.2

EXPERIMENTS

3.2.1

We conduct an empirical study to evaluate the eﬀectiveness of our model on two multimedia datasets.

3.1

Results

In this subsection, we report the evaluation results. Mean Average Precision (MAP) is used as the evaluation metric.

Comparison

Table 2a and 2b show the comparison results on MAP, from which we derive the following conclusions:

Experimental Settings

We compare RCMT with SVM, SimpleMKL (S-MKL) [6] and two transfer learning algorithms, i.e., A-MKL [3] and Feature Replication (FRP) [2]. To demonstrate the robustness and the cross-media transfer eﬀects, we additionally compare to three baseline methods including MKRR and RMKRR corresponding to Eq.(1) and Eq.(2), and CMT which uses 2 loss function and thus can be viewed as a non-robust version of RCMT. We set parameters of all comparing methods in the range of {10−6 , 10−4 , . . . , 106 }. The target dataset is a collection of YouTube videos from Kodak Consumer Dataset [1]. This dataset is labeled with 25 concepts. We select 8 event-related concepts (i.e., ’birthday’,’dancing’,’graduation’,’parade’,’picnic’,’singing’,’ski’ and ’wedding’) for evaluation, and only those videos labeled with at least one event are considered. The source images are collected from Flickr by ourselves. We use the above 8 events as queries. For each query we download the top

1047

• Methods based on multiple kernel learning (i.e., AMKL,CMT, RCMT, S-MKL, MKRR and RMKRR) always achieve better performances that those (i.e., FRP and SVM) based on single kernel. • RCMT consistently outperforms RMKRR which indicates that the auxiliary social images used in RCMT really make positive transfer. • In Table 2a, RCMT is better than CMT in all four settings of diﬀerent sampling ratios. This implies 2,1 loss function in RCMT indeed provides more tolerance to noises than 2 -norm loss function in CMT. • In Table 2a, as the used images increase from 10% to 20%, the performance of RCMT keeps growing. This clearly indicates its robustness to noise. But as the ratio continues growing to 40%, the performances of both CMT and RCMT drop. One possible explanation

A-MKL 32.4 ± 2.9 32.5 ± 2.9 34.3 ± 2.4 31.8 ± 2.9

(a) (b) (c) (d)

FRP 25.4 ± 1.1 26.1 ± 1.6 26.0 ± 0.8 26.3 ± 1.0

CMT 38.1 ± 1.3 36.9 ± 1.3 36.1 ± 2.0 35.4 ± 1.5

RCMT 39.3 ± 1.8 40.3 ± 1.7 39.9 ± 1.8 39.0 ± 1.9

(a) MAP results of A-MKL, FRP, CMT and RCMT, using Flickr images for cross-media transfer. Settings (a)(d) respectively correspond to diﬀerent sampling ratios {10%, 20%, 30%, 40%} of Flickr images. SVM 21.6 ± 0.6

S-MKL 38.5 ± 1.3

MKRR 37.5 ± 1.3

RMKRR 36.2 ± 1.5

(a) CMT

(b) RCMT

(c) MKRR

(d) RMKRR

(b) MAP results of SVM, S-MKL, MKRR and RMKRR, without using cross-media transfer. Table 2: MAP comparisons among diﬀerent methods. is that as the number of images increases, it is much harder to minimize the domain diﬀerence between image and video and thus degrade the performance. • Compared to transfer learning algorithms FR and AMKL, RCMT and CMT achieve better event detection performance and stability. We believe the improvements should be attributed to the consideration of both auxiliary social multimedia data and robust nature of 2,1 -norm loss function.

3.2.2

Figure 2: Sensitivity study of parameters γ and μ in MKRR, RMKRR, CMT and RCMT.

Parameter Sensitivity

5.

REFERENCES

[1] J. L. Akira Yanagawa, Alexander C. Loui and S.-F. Chang. Kodak consumer video benchmark data set: concept deﬁnition and annotation, 2008. [2] H. Daume III. Frustratingly easy domain adaptation. In ACL, pages 256–263, 2007. [3] L. Duan, D. Xu, I. W.-H. Tsang, and J. Luo. Visual event recognition in videos by learning from web data. In CVPR, pages 1959–1966, 2010. [4] Y. Lu and Q. Tian. Discriminant subspace analysis: an adaptive approach for image classiﬁcation. TMM, 11(7):1289–1300, 2009. [5] F. Nie, H. Huang, X. Cai, and C. Ding. Eﬃcient and robust feature selection via joint 2,1 -norms minimization. NIPS, 23:1813–1821, 2010. [6] A. Rakotomamonjy, F. R. Bach, S. Canu, and Y. Grandvalet. SimpleMKL. JMLR, 9:2491–2521, 2008. [7] Y. Yang, Z. Huang, H. T. Shen, and X. Zhou. Mining multi-tag association for image tagging. WWW, 14(2):133–156, 2011. [8] Y. Yang, Y. Yang, Z. Huang, H. Shen, and F. Nie. Tag localization with spatial correlations and joint group sparsity. In CVPR, pages 881–888, 2011. [9] Y. Yang, Y.-T. Zhuang, F. Wu, and Y.-H. Pan. Harmonizing hierarchical manifolds for multimedia document semantics understanding and cross-media retrieval. TMM, 10(3):437–446, 2008. [10] Z.-J. Zha, M. Wang, Y.-T. Zheng, Y. Yang, R. Hong, and T.-S. Chua. Interactive video indexing with statistical active learning. TMM, 14(1):17–27, 2012. [11] Z.-J. Zha, L. Yang, T. Mei, M. Wang, and Z. Wang. Visual query suggestion. In ACM MM, pages 15–24, 2009.

In this part we study the parameter sensitivity of the proposed method. We report results of both CMT and RCMT in Figure 2a and 2b. We have the following observations from these results: • As seen, both CMT and RCMT are not very sensitive to γ. In Figure 2a, as γ changes from 10−6 to 106 the performance nearly keeps stable. In Figure 2b, in most cases RCMT performs stably. When γ are either too large or too small we cannot obtain the best results, which implies that we should get balance between overﬁtting problem and robust regression ability. • A common phenomenon in Figure 2a and 2b is when μ is small, we normally cannot obtain better performances. This is because small μ would make regression component less important and cause more regression errors, thereby leading to relatively poor event detection performances. • Similar to CMT and RCMT, MKRR and RMKRR are not sensitive to γ as shown in Figure 2c and 2d, respectively. This further implies the stability of our method.

4. CONCLUSION AND FUTURE WORK We propose a robust cross-media transfer learning model for visual event detection. The proposed model can take noisily labeled social images as training data to learn a robust video event detector. The robustness lies in a 2,1 loss function which can alleviate the negative eﬀects of the noisy data. In future, we intend to incorporate more prior knowledge on correlations of events to enhance the detection performance. Also, the eﬀectiveness of the proposed model will be further evaluated on large-scale multimedia datasets.

1048

Voicing Features for Robust Speech Detection