Robust Cross-Media Transfer for Visual Event Detection Yang Yang

Yi Yang

Zi Huang

The University of Queensland Brisbane, Australia

Carnegie Mellon University Pittsburgh, United States

The University of Queensland Brisbane, Australia

[email protected]

[email protected] [email protected] Jiajun Liu Zhigang Ma

The University of Queensland Brisbane, Australia

[email protected]

University of Trento Trento, Italy

[email protected]

ABSTRACT

features and high-level semantics. Most existing event detection approaches can be summarized to a common formulation, i.e., model-based formulation. In this case, one is required to first prepare a collection of training data which is usually labeled with a well-defined event set by human experts. The training set is then fed into a supervised learning model to train event detectors (classifiers). Finally we may predict whether an event occurs in the test data. Indeed, given a well-controlled training dataset with sufficient and accurately-labeled samples, traditional model-based methods may obtain satisfactory performances. However, manual labeling is a labor-intensive and time-consuming process which makes model-based methods impractical in reality. To address this problem, a promising way is to exploit domain adaption technique to add more labeled samples from other domains [4]. Nevertheless, existing methods usually only take homogeneous types of data into account, which may lead to information loss to some extent. More importantly, the robustness is not well exploited in these methods, which makes them hardly noise-tolerant in social multimedia circumstances. Normally when an event happens in public, it may be recorded in different ways, such as taking photos, and recording videos. Then this event can be shared and disseminated in forms of images and video clips via multiple platforms, such as TV channels, web portals and social networking services. An observation is that social images and videos describing the same event are highly relevant and thus should be somehow complementary to each other. Hence, it is natural to derive an assumption that using training data from different yet relevant types of social multimedia data would be beneficial to boost visual event detection performance. However, directly using social multimedia data in existing supervised learning models may lead to poor performance because social multimedia data are usually not accuratelylabeled by social users and thus contain many undesirable noises [7]. In this case, we need to devise a robust model which is able to not only conduct event detection in multiple types of multimedia data but also construct event detectors tolerant of noises in social multimedia environments. This study explores how different types of noisy social multimedia data, e.g., Flickr images and YouTube videos, can be together fed into a common model for visual event detection. To this end, we adopt an 2,1 -norm regression model featuring noise-tolerance, which makes it possible to directly use social images and videos without any further

In this paper, we present a novel approach, named Robust Cross-Media Transfer (RCMT), for visual event detection in social multimedia environments. Different from most existing methods, the proposed method can directly take different types of noisy social multimedia data as input and conduct robust event detection. More specifically, we build a robust model by employing an 2,1 -norm regression model featuring noise tolerance, and also manage to integrate different types of social multimedia data by minimizing the distribution difference among them. Experimental results on real-life Flickr image dataset and YouTube video dataset demonstrate the effectiveness of our proposal, compared to state-of-the-art algorithms.

Categories and Subject Descriptors H.3.1 [Information Storage and Retrieval]: Content Analysis and Indexing

General Terms Algorithms, Experimentation, Performance

Keywords Cross-Media, Transfer Learning, Visual Event Detection

1. INTRODUCTION During the last decade, we have witnessed how powerfully social networking services (e.g., Flickr, YouTube and Facebook) are changing the ways multimedia data are generated, shared and consumed. With the vigorous growth of user-generated content, it is essential to detect visual event information to facilitate multimedia applications, such as multimedia indexing [10], retrieval [11, 9] and tagging [8]. In visual event detection, a fundamental issue is how to bridge the so-called ‘semantic gap’ between low-level visual

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. MM’12, October 29–November 2, 2012, Nara, Japan. Copyright 2012 ACM 978-1-4503-1089-5/12/10 ...$15.00.

1045

on multiple kernel learning. Each base kernel K(i) is composed as below:   (i) (i) Ks Kts . K= (i) (i) Kst Kt

cleansing. Moreover, we also integrate different types of social multimedia data together to handle visual domain-shift problem. Our contributions are summarized as follows: • We propose a Robust Cross-Media Transfer (RCMT) model for visual event detection, which can directly take loosely-labeled social multimedia data as input. Our proposal is tolerant of labeling noises in social multimedia environments.

(i)

• Different types of social multimedia data, e.g., social images and social videos, can be simultaneously fed into RCMT to achieve mutual reinforcement for visual event detection. • Experimental results on real-life Flickr image datatset and YouTube video dataset confirm the performance superiority of our proposal over state-of-the-art algorithms. The rest of this paper is organized as follows. Section 2 elaborates the details of RCMT. Section 3 reports the experimental results, followed by the conclusion in section 4.

 m       min γ ( wi K(i) )V + 1bT − Y   w,V,b 

2. THE PROPOSED MODEL In this section, we reveal the details of how the proposed model takes multiple types of social multimedia data as input to perform robust visual event detection.

2.1

i=1

(t)

(t)

j which are sampled from this video. xjg of frames {xjg }|g=1

(t)

(t)

is the visual feature of the g th frame of xj , and nj (s) yi

is the (t)

∈ {−1, 1} and yj ∈ number of frames in c×1 th {−1, 1} are event indicator vectors for the i image and the j th video respectively, and c is the number of events in event list. The objective is to build up a robust event detection model based on both X(s) and X(t) , and make sure they pose positive effects on each other. Starting with a classic solution, we may formulate the event detection as a multiple kernel ridge regression (MKRR), then we arrive at: c×1

F



wi K(i) )V

i=1

i=1 (i)

where k (:, x) = ψi (X ∪X(t) )T ψi (x), i = 1, 2, . . . , m. We may use it to detect event in either image or video domain. Unfortunately, it is still not proper to directly employ Eq.(2) to conduct visual event detection with cross-media transfer because the distributions of different types of multimedia data may differ greatly, thereby leading to the domain shift issue which can severely degrade the event detection performance. In the next part, we will provide a solution to handle this problem by minimizing the distribution difference between two types of social multimedia data.

 m 2   m       min γ ( wi K(i) )V + 1bT − Y  + tr V T ( wi K(i) )V   w,V,b i=1

2,1

m 

Here  given any matrix M , its 2,1 -norm is defined as M 2,1 = j mj 2 , where mj is the j th row of M . The underlying interpretation of the robust nature in 2,1 -norm loss function is that the regression errors introduced by noisy data are not squared and thus pose less negative effects on the whole training process. After solving the problem in (2), we obtain an event detector as follows: m  f (x) = V T wi k(i) (:, x) + b (3)

(s)

Suppose we are given a social image dataset X(s) = {(xi , (t) (t) t and a social video dataset X(t) = {(xj , yj )}|N j=1 , (s) where Ns and Nt are the numbers of samples in X and (t) X(t) respectively. We represent each video xj as a sequence

(t) {xjg }.

VT(

(2)

(s) s yi )}|N i=1

n



+ tr

s.t. w  0 and wT 1 = 1.

Robust Visual Event Detection

(t)

(i)

Ks and Kt are kernels computed from X(s) and X(t) re(i) (i)T spectively, while Kst = Kts is the “cross-media” kernel matrix between X(s) and X(t) . Indeed MKRR can provide us a possible solution to crossmedia event detection. However, it may probably lead to unsatisfactory performance in the noisy circumstance of social multimedia. Normally, event descriptions (e.g., tags, comments, titles, etc) created by amateur users can hardly be guaranteed to be accurate, and also different users might have different perspectives or opinions on the same event. Therefore, social multimedia data are often of low-quality in terms of labeling, which may degrade event detection performance. To handle this problem, we need to adapt MKRR to noisy data. It has been shown that 2,1 -norm loss function is a robust alternative to 2 -norm loss function [5]. So we replace 2 -norm loss function in Eq.(1) with an 2,1 -norm loss function and arrive at a robust multiple kernel ridge regression (RMKRR):

i=1

s.t. w  0 and wT 1 = 1,

2.2

(1)

(s)

Cross-Media Visual Transfer

In order to integrate different types of social data, a basic idea is to transform them into a common space in which they become similar to each other. In other words, their distribution difference should be minimized in this space. To this end, we propose to utilize multiple kernel learning (MKL) to seek for such an optimal Reproducing Kernel Hilbert Space (RKHS). Besides, using MKL can keep pace with the robust multiple kernel robust regression described in the above section. As to distribution distance measure, we use a nonparametric metric proposed by [3], i.e., Maximum Mean Discrepancy (MMD). According to its definition,

(Ns +Nt )×c

where γ is a balance parameter, V ∈ R is the regression parameter matrix, b ∈ Rc×1 is the bias and 1 (s) (s) (t) (t) is an all-one vector. Y = [y1 , . . . , yNs , y1 , . . . , yNt ] ∈ R(Ns +Nt )×c is the whole event indicator matrix. {K(i) }|m i=1 are m base kernels pre-computed over both X(s) and X(t) . w = [w1 , w2 , . . . , wm ]T is a weight vector to measure importance of base kernels. Note that if we feed X(s) and X(t) together into MKRR model, it is actually a simple transfer learning model based

1046

an empirical estimation of the distribution distance between X(s) and X(t) is computed as below:   Nt Ns   1   1   ψ(xsi ) − ψ(xtj ) (4) Dmmd (X(s) , X(t) ) =   Ns  Nt i=1

j=1

H

where ψ is a kernel mapping function, H is the corresponding RKHS and  · H denotes the 2 -norm in H. We may further rewrite MMD distance [3]: Dmmd (X(s) , X(t) ) = tr(KP)1/2

Dancing

Ski

Birthday

Singing

Wedding

Figure 1: Noisy examples from Flickr data. (5) Dataset YouTube Flickr

where P = ppT and p is a (Ns + Nt ) × 1 vector of which the first Ns elements are N1s and the rest are − N1t . By extending Eq.(5) to its multiple kernel version we obtain: m 1/2  Dmmd mkl (X(s) , X(t) ) = tr wi K(i) P = (wT s)1/2

# of Events 8 8

# of Samples 553 4, 041

Feature 162-D HSV 162-D HSV

Table 1: Datasets description.

i=1

(6) where s = [tr(K(1) P), . . . , tr(K(m) P)]T . As mentioned before, to integrate X(s) and X(t) together we should solve the following problem: min Dmmd mkl (X(s) , X(t) ) w

500 results. Here we do not perform any preprocessing to cleanse the labeling noises. Figure 1 lists some noisy exemplars. We randomly sample {10%, 20%, 30%, 40%} of images as the source training data. Videos are randomly divided into a training set (60%) and a test set (40%). The experiment is repeated 8 times and the mean results are reported. By following [3], four kernels, i.e., Gaussian Ker2 2 nel (i.e., k(·, ·) = exp (−d √ (·, ·)/2σ )), Laplacian Kernel (i.e., k(·, ·) = exp (−d(·, ·)/ 2σ)), Inverse Square Distance Ker2 nel (i.e., k(·, ·) = 1/(d2 (·, ·)/2σ√ + 1)) and Inverse Distance Kernel (i.e., k(·, ·) = 1/(d(·, ·)/ 2σ + 1)), are used to gener−3 1 ¯ is the ate based kernels. σ is set to D , . . . , 43 } where D ¯ {4 mean of all squared distances. We use Euclidean distance to compute image distance and Hausdorff distance to compute image-video and video distances. Keyframes of Kodak videos are extracted at a sampling rate of 2 frames/s. We extract 162-D HSV color histogram features for YouTube video keyframes and Flickr images. Table 1 lists a summarization of two datasets.

(7)

Thus, by combining (2) and (7) we can arrive at the Robust Cross-Media Transfer (RCMT) model:  m       (i) T min μ(w ss w) + γ ( wi K )V + 1b − Y    w,V,b i=1 2,1 m     + tr V T wi K(i) V T

T

(8)

i=1

s.t. w  0 and wT 1 = 1.

After solving (8) we obtain a robust visual event detector as described in Eq.(3). It is worth mentioning that another contribution of this work is an effective optimization algorithm to solve (8), but due to space limitation we omit the optimization details in this version.

3.

3.2

EXPERIMENTS

3.2.1

We conduct an empirical study to evaluate the effectiveness of our model on two multimedia datasets.

3.1

Results

In this subsection, we report the evaluation results. Mean Average Precision (MAP) is used as the evaluation metric.

Comparison

Table 2a and 2b show the comparison results on MAP, from which we derive the following conclusions:

Experimental Settings

We compare RCMT with SVM, SimpleMKL (S-MKL) [6] and two transfer learning algorithms, i.e., A-MKL [3] and Feature Replication (FRP) [2]. To demonstrate the robustness and the cross-media transfer effects, we additionally compare to three baseline methods including MKRR and RMKRR corresponding to Eq.(1) and Eq.(2), and CMT which uses 2 loss function and thus can be viewed as a non-robust version of RCMT. We set parameters of all comparing methods in the range of {10−6 , 10−4 , . . . , 106 }. The target dataset is a collection of YouTube videos from Kodak Consumer Dataset [1]. This dataset is labeled with 25 concepts. We select 8 event-related concepts (i.e., ’birthday’,’dancing’,’graduation’,’parade’,’picnic’,’singing’,’ski’ and ’wedding’) for evaluation, and only those videos labeled with at least one event are considered. The source images are collected from Flickr by ourselves. We use the above 8 events as queries. For each query we download the top

1047

• Methods based on multiple kernel learning (i.e., AMKL,CMT, RCMT, S-MKL, MKRR and RMKRR) always achieve better performances that those (i.e., FRP and SVM) based on single kernel. • RCMT consistently outperforms RMKRR which indicates that the auxiliary social images used in RCMT really make positive transfer. • In Table 2a, RCMT is better than CMT in all four settings of different sampling ratios. This implies 2,1 loss function in RCMT indeed provides more tolerance to noises than 2 -norm loss function in CMT. • In Table 2a, as the used images increase from 10% to 20%, the performance of RCMT keeps growing. This clearly indicates its robustness to noise. But as the ratio continues growing to 40%, the performances of both CMT and RCMT drop. One possible explanation

A-MKL 32.4 ± 2.9 32.5 ± 2.9 34.3 ± 2.4 31.8 ± 2.9

(a) (b) (c) (d)

FRP 25.4 ± 1.1 26.1 ± 1.6 26.0 ± 0.8 26.3 ± 1.0

CMT 38.1 ± 1.3 36.9 ± 1.3 36.1 ± 2.0 35.4 ± 1.5

RCMT 39.3 ± 1.8 40.3 ± 1.7 39.9 ± 1.8 39.0 ± 1.9

(a) MAP results of A-MKL, FRP, CMT and RCMT, using Flickr images for cross-media transfer. Settings (a)(d) respectively correspond to different sampling ratios {10%, 20%, 30%, 40%} of Flickr images. SVM 21.6 ± 0.6

S-MKL 38.5 ± 1.3

MKRR 37.5 ± 1.3

RMKRR 36.2 ± 1.5

(a) CMT

(b) RCMT

(c) MKRR

(d) RMKRR

(b) MAP results of SVM, S-MKL, MKRR and RMKRR, without using cross-media transfer. Table 2: MAP comparisons among different methods. is that as the number of images increases, it is much harder to minimize the domain difference between image and video and thus degrade the performance. • Compared to transfer learning algorithms FR and AMKL, RCMT and CMT achieve better event detection performance and stability. We believe the improvements should be attributed to the consideration of both auxiliary social multimedia data and robust nature of 2,1 -norm loss function.

3.2.2

Figure 2: Sensitivity study of parameters γ and μ in MKRR, RMKRR, CMT and RCMT.

Parameter Sensitivity

5.

REFERENCES

[1] J. L. Akira Yanagawa, Alexander C. Loui and S.-F. Chang. Kodak consumer video benchmark data set: concept definition and annotation, 2008. [2] H. Daume III. Frustratingly easy domain adaptation. In ACL, pages 256–263, 2007. [3] L. Duan, D. Xu, I. W.-H. Tsang, and J. Luo. Visual event recognition in videos by learning from web data. In CVPR, pages 1959–1966, 2010. [4] Y. Lu and Q. Tian. Discriminant subspace analysis: an adaptive approach for image classification. TMM, 11(7):1289–1300, 2009. [5] F. Nie, H. Huang, X. Cai, and C. Ding. Efficient and robust feature selection via joint 2,1 -norms minimization. NIPS, 23:1813–1821, 2010. [6] A. Rakotomamonjy, F. R. Bach, S. Canu, and Y. Grandvalet. SimpleMKL. JMLR, 9:2491–2521, 2008. [7] Y. Yang, Z. Huang, H. T. Shen, and X. Zhou. Mining multi-tag association for image tagging. WWW, 14(2):133–156, 2011. [8] Y. Yang, Y. Yang, Z. Huang, H. Shen, and F. Nie. Tag localization with spatial correlations and joint group sparsity. In CVPR, pages 881–888, 2011. [9] Y. Yang, Y.-T. Zhuang, F. Wu, and Y.-H. Pan. Harmonizing hierarchical manifolds for multimedia document semantics understanding and cross-media retrieval. TMM, 10(3):437–446, 2008. [10] Z.-J. Zha, M. Wang, Y.-T. Zheng, Y. Yang, R. Hong, and T.-S. Chua. Interactive video indexing with statistical active learning. TMM, 14(1):17–27, 2012. [11] Z.-J. Zha, L. Yang, T. Mei, M. Wang, and Z. Wang. Visual query suggestion. In ACM MM, pages 15–24, 2009.

In this part we study the parameter sensitivity of the proposed method. We report results of both CMT and RCMT in Figure 2a and 2b. We have the following observations from these results: • As seen, both CMT and RCMT are not very sensitive to γ. In Figure 2a, as γ changes from 10−6 to 106 the performance nearly keeps stable. In Figure 2b, in most cases RCMT performs stably. When γ are either too large or too small we cannot obtain the best results, which implies that we should get balance between overfitting problem and robust regression ability. • A common phenomenon in Figure 2a and 2b is when μ is small, we normally cannot obtain better performances. This is because small μ would make regression component less important and cause more regression errors, thereby leading to relatively poor event detection performances. • Similar to CMT and RCMT, MKRR and RMKRR are not sensitive to γ as shown in Figure 2c and 2d, respectively. This further implies the stability of our method.

4. CONCLUSION AND FUTURE WORK We propose a robust cross-media transfer learning model for visual event detection. The proposed model can take noisily labeled social images as training data to learn a robust video event detector. The robustness lies in a 2,1 loss function which can alleviate the negative effects of the noisy data. In future, we intend to incorporate more prior knowledge on correlations of events to enhance the detection performance. Also, the effectiveness of the proposed model will be further evaluated on large-scale multimedia datasets.

1048

Robust cross-media transfer for visual event detection

ferent types of noisy social multimedia data as input and conduct robust event ... H.3.1 [Information Storage and Retrieval]: Content. Analysis and Indexing.

888KB Sizes 3 Downloads 206 Views

Recommend Documents

Voicing Features for Robust Speech Detection
to greatly varying SNR and channel characteristics and 3) per- formance ..... [3] S. Basu, “A linked-hmm model for robust voicing and speech detection,” in. Proc.

ROBUST PARKING SPACE DETECTION ...
the increase of private vehicles. Looking for parking spaces always wastes travel time. For the driver's convenience, pub- lic parking lots should provide the location of available park- ing spaces. However, maintaining such information manually need

Voicing Features for Robust Speech Detection
The periodic characteristic of speech signal makes it a good candidate for searching for .... Phone models that correspond well to voiced speech .... IEEE Conference on Acoustics, Speech and Signal Processing (ICASSP. 2003), 2003, pp.

Robust Location Detection in Emergency Sensor Networks
that generalizes the basic ID-CODE algorithm and produces irreducible r-robust codes. The degree of robustness, r, is a design parameter that can be traded off ...

a robust phase detection structure for m-psk
Moreover, the new detector has a very compact hardware ..... efficient fixed-point hardware implementation in the form ..... 6.1.2 Linear Modeling of VM,N(n).

Robust Subspace Based Fault Detection
4. EFFICIENT COMPUTATION OF Σ2. The covariance Σ2 of the robust residual ζ2 defined in (11) depends on the covariance of vec U1 and hence on the first n singular vectors of H, which can be linked to the covariance of the subspace matrix H by a sen

Scalable Efficient Composite Event Detection
Balazinska, M., Balakrishnan, H., Madden, S., Stonebraker, M.: Fault-tolerance in the Borealis Distributed Stream Processing System. In: SIGMOD 2005, pp. 13– ...

Group Event Detection for Video Surveillance
and use an Asynchronous Hidden Markov Model (AHMM) to model the .... Based on the description of SAAS, before detecting the symmetric activity of each ...

Multimodal Sparse Coding for Event Detection
computer vision and machine learning research. .... codes. However, the training is done in a parallel, unimodal fashion such that sparse coding dictio- nary for ...

Structuring an event ontology for disease outbreak detection
Apr 11, 2008 - Abstract. Background: This paper describes the design of an event ontology being developed for application in the machine understanding of infectious disease-related events reported in natural language text. This event ontology is desi

Fusion with Diffusion for Robust Visual Tracking
A weighted graph is used as an underlying structure of many algorithms like ..... an efficient algorithm with time complexity O(n2), which can be used in real time.

audio-visual intent-to-speak detection for human-computer interaction
to detect a user's intent to speak to a computer, by ... Of course this is not a natural and convenient thing to do for the user. .... A good candidate thresh-.

audio-visual intent-to-speak detection for human ...
IBM T.J. Watson Research Center. Yorktown ... image. The second step is an audio-visual speech event detection that combines both visual and audio indi-.

Learning visual context for object detection
position in the image, the data is sampled relatively to the object center ... age data being biased. For example, cars ... As a result we visualize two input images ...

A framework for visual-context-aware object detection ...
destrian detection in urban images using a state-of-the-art pedes- ... nation of this derived context priors with a state-of-the-art object detection ..... For illustration.

Visual-Similarity-Based Phishing Detection
[email protected] ... republish, to post on servers or to redistribute to lists, requires prior specific .... quiring the user to actively verify the server identity. There.

Fast Robust GA-Based Ellipse Detection
*Electrical & Computer Engineering and **Computer Science Departments, Concordia University, 1455 Blvd. de ... Genetic Algorithms, clustering, Sharing GA,.

Robust Location Detection with Sensor Networks
f d. (d) Position of the sensors on the graph. Fig. 1. Our proposed location detection system. ..... Five laptop computers running Red Hat Linux 8.0 are used.

Robust VPE detection using Automatically Parsed Text - ACL Anthology
and uses machine learning techniques on free text that ... National Corpus using a variety of machine learn- ... we used (Lager, 1999) ran into memory problems.

Robust VPE detection using Automatically Parsed Text - ACL Anthology
King's College London ... ing unannotated data that is parsed using an auto- matic parser are presented, as our ... for the BNC data, while the GIS-MaxEnt has a.

Robust Text Detection and Extraction in Natural Scene ...
Lecturer, Computer Engineering, Pune University/ VACOE Ahmednagar ... of the two classes (“'text”&”'Non Text”)by Considering both unary component ...