A Consumer Video Search System by Audio-Visual Concept Classification Wei Jiang Alexander C. Loui Phoury Lei Corporate Research and Engineering Eastman Kodak Company {wei.jiang, alexander.loui, phoury.lei}@kodak.com
Abstract
tations of a large amount of concepts such as LSCOM [17] have been provided over large-scale video corpuses.
Indexing and searching the massive amount of consumer videos in the open domain is increasingly important. Due to the lack of text descriptions as well as the difficulties in analyzing the content of consumer videos, little work has been conducted to provide video search engines in the consumer domain. In this paper, we develop a contentbased consumer video search system based on multi-modal concept classification. The system supports the query-byexample access mechanism, by exploiting the query-byconcept search paradigm underneath, where online concept classification is conducted over the query video by integrating both visual and audio information. The system adopts an audio-visual grouplet representation that captures salient audio-visual signatures to describe the video content for efficient concept classification. Experiments over the large-scale Columbia Consumer Video set show the effectiveness of the developed system.
With the increasing popularity of online video sharing, it is more and more important to index and search largescale user-generated videos in the open domain. However, to the best of our knowledge, there is no existing contentbased consumer video search engine exploiting the queryby-concept paradigm. This is partly because of the challenging conditions for analyzing consumer videos, e.g., diverse content, limited text descriptions, uncontrolled video quality, etc. Compared to other types of videos such as broadcast and sports, there have been only limited efforts spent on classifying generic concepts in consumer videos. The first systematic attempt was made by Chang et al. in [1], where Kodak’s consumer benchmark video set was developed. Over 100 relevant and potentially detectable consumer concepts were proposed, among which a total of 25 concepts were selected. Ground-truth annotations of these 25 concepts were provided over a set of 1,338 consumer videos. Later on, a large-scale Columbia Consumer Video set (CCV) was developed by Jiang et al. in [12], where a total of 9,317 web videos were annotated to 20 consumer concepts focusing on events (e.g., baseball and wedding) and objects (e.g., bird and cat). Over both data sets, multimodal classifiers that integrate both audio and visual information have been shown to be quite effective for classifying consumer videos. Using the CCV set, beyond the traditional bag-of-words (BoW) representation, Jiang and Loui have developed an Audio-Visual Grouplet (AVG) representation [9] based on the temporal audio-visual causal correlations. Each AVG contains a set of audio and visual codewords that are grouped together due to their strong temporal causal relations. Compared to using discrete audio and visual codewords, concepts can be more robustly classified by using AVGs, which encapsulate representative audio-visual patterns to describe the video content.
1. Introduction Most commercial video search engines such as YouTube and Netflix rely on text associated with videos, such as surrounding text, users’ tags, transcripts, etc. When the video content is not reflected in the text description, or when videos barely have associated text (e.g., the numerous consumer videos or surveillance videos), the results are usually poor. To better bridge the semantic gap, many concept-based video retrieval systems emerged recently [2, 3, 8, 21, 26]. By conducting offline automatic classification of a large set of semantic concepts (including objects such as car or dog, locations such as outdoors/indoors and studio setting, activities such as jumping and kissing, etc.), videos can be effectively searched based on their high-level semantics. Most of these systems focus on the visual content alone and are tailored to support interactive search of videos from formal or professional productions, e.g., from broadcast news to documentary videos in the TRECVID interactive video retrieval task [23], where ground-truth anno-
In this work, we develop a content-based consumer video search system based on multi-modal concept classification. The system uses the query-by-example access mechanism by exploiting the query-by-concept search paradigm under4321
neath. Online concept classification is conducted for the query video based on the advanced joint audio-visual signature, i.e., the AVG representation proposed in [9]. Compared with most of the existing video search engines targeting at videos from formal or professional production, our system has the following characteristics: (1) the focus is on the consumer domain, where videos have very limited text descriptions, and ground-truth annotations are available only for a small number of concepts; (2) a joint audio-visual descriptor is used, which captures salient audio-visual cues to describe the video content and help classification; and (3) online concept classification is conducted to predict concept occurrences in the query video. Our system demonstrates the effectiveness of using joint audio-visual signatures for concept classification in consumer videos. Also, the system shows that although the training process of generating the joint audio-visual signatures is time consuming, with some simplifications, the online classification process of using such audio-visual signatures can be reasonably rapid. The remaining part of the paper is organized as follows. Section 2 overviews some related work. Section 3 describes the process of generating AVGs and using AVGs for online concept classification. Section 4 describes the detailed user interface and functionalities of the system.
video search engines in the literature. The IBM Multimedia Analysis and Retrieval System (IMARS) automatically classifies over 1200 visual concepts covering scenes, objects, events, people, etc. It utilizes a large set of visual features, and supports browsing and retrieval using both query-by-text and query-by-example access mechanisms. The MediaMill video search system [21] automatically classifies about 500 concepts. It supports multi-dimensional browsing, and combines queryby-concept, query-by-example, query-by-text through ranking combination methods. The CuZero interactive video retrieval system [26] supports the query-by-text mechanism. The system classifies about 450 visual concepts, and through query-to-concept mapping, it automatically recommends relevant visual concepts based on users’ query. It also provides real-time query navigation that allows users to navigate through the database in the concept space. The Informedia video retrieval system [3] supports queryby-best-of-topic, query-by-text (dominated by closed captioned text), query-by-image (primarily color-based syntactic lookup), and query-by-concept (using semantic concept classification). It utilizes an extensive annotation strategy for interactive search based on human reaction time. The VisionGo system [15]) supports multi-modal query and combines text derived from ASR, semantic concept classification, and low-level visual features and motion to rank videos. An active learning strategy is also used to maximize the users’ interaction efforts. Most of the above systems are tailored to support interactive search of videos from formal or professional production, especially, to support the TRECVID interactive search task [23]. The target is to uncover as many relevant shots (from broadcast news, documentary videos, etc.) as possible, for a given free text query, based on user interaction in a fixed time frame (15 minutes). Most systems aim to improve performances by maximizing users’ efforts, improving the user interface, and providing effective relevance feedback. These systems rely on text and visual descriptors, and are dominated by the query-by-text (with automatic or manual query-to-concept mapping) access mechanism.
2. Related Work 2.1. Query-by-concept video search systems The query-by-concept search paradigm has been developed to bridge the gap between low-level features and highlevel semantics in content-based video search systems. In a traditional query-by-concept framework, a user searches for videos by manually selecting a predefined concept. The system returns search results by ranking videos based on the estimated presence of the concept. The sufficiently good performance of the underlying concept classifiers is essential to the success of such a search system. The problem of this traditional framework lies in two folds. First, most concept classifiers do not have satisfactory performance at the current stage of research. Second, when a large amount (e,g,, hundreds) of concepts are available, it is difficult for users to memorize and select the correct concept to query. Therefore, most existing query-by-concept video search systems incorporate the query-by-text or query-by-example strategy, and through the underlying query prediction [18, 25] and concept classification, return ranked videos based on the estimated presence of various concepts. In addition, due to the unrobust and imperfect automatic classification, most of the practical video search engines engage the user’s interaction where a user sits through the retrieval process and interacts with the system interface to assess the search results and navigate through the database to find videos of interest. Here we briefly describe some popular interactive
2.2. Audio-visual concept classification Many efforts have been devoted to detect generic concepts in unconstrained videos, such as the human action recognition in Hollywood movies [13], the TRECVID highlevel feature extraction or multimedia event detection [20], the concept detection in Kodak’s consumer videos [1] or the CCV set [12]. The current state of the art in video concept detection tasks is based on the BoW representation and SVM classifiers [6, 12, 20]. In the visual aspect, local visual descriptors (e.g., SIFT [14] or HOG [5]) are computed from 2D local points or 3D local volumes. These descriptors are 4322
sual codebook V f −v is constructed. Similarly, by clustering the background SIFT tracks from the set of training videos, a background visual codebook V b−v is constructed. In the audio aspect, the 13-dim MFCCs are extracted from evenly distributed overlapping short windows (i.e., 25 ms windows with 10 ms hops) in the soundtrack, and by clustering the MFCCs from the set of training videos, a background audio codebook V b−a is constructed. Also, the 20-dim transient features [4] describing the foreground audio salient events are computed, and by clustering the transient features from the set of training videos, a foreground audio codebook V f −a is constructed. For each codebook (e.g., codebook V f −v ), given an input video xj , a histogram-like temporal sequence f −v f −v f −v {Hj1 , Hj2 , . . .} can be generated, where each Hjk is the BoW feature for the k-th frame in the video computed using a soft weighting scheme [11]. Based on the temporal sequences generated for different codebooks, the temporal Granger causality [7] between pairwise audio and visual codewords can be calculated. That is, we compute four types of temporal Granger causalities: (1) between visual foreground and audio foreground codewords; (2) between visual foreground and audio background codewords; (3) between visual background and audio foreground codewords; and (4) between visual background and audio background codewords. The Granger causality between two codewords measures the similarity between these codewords, and we can obtain a causal matrix describing the pairwise similarities between audio and visual codewords. Spectral clustering algorithms (e.g., the method of [19]) can be used to cluster the audio and visual codewords into grouplets (AVGs) based on the causal matrix. Each grouplet G contains a set of audio and visual codewords that have strong Granger causal relations. Therefore, four types of AVGs are obtained from the four types of temporal Granger causal audio-visual correlations. For each type of AVG, e.g., visual-foreground-audioforeground AVG, assume that we have K grouplets Gk of this type, k = 1, . . . , K. Let DkG (xi , xj ) denote the distance between data xi and xj computed based on the grouplet Gk . The overall distance D(xi , xj ) between data xi and xj is given by: K vk DkG (xi , xj ). (1) D(xi , xj ) =
vector-quantized against a codebook of prototypical visual descriptors to generate a histogram-like visual representation. In the audio aspect, audio descriptors (e.g., MFCCs or transients [4]) are computed from short temporal windows that are either uniformly distributed in the soundtrack or sparsely detected as salient audio onsets. These descriptors are also vector-quantized against a codebook of prototypical audio descriptors to generate a histogram-like audio representation. Next, the visual and audio histogram-like representations are combined (e.g., by early fusion in the form of feature concatenation or by late fusion in the form of classifier ensemble) to learn SVM concept classifiers [12, 20]. It has been shown, especially in the consumer domain, that significant classification performance improvements can be obtained by integrating both visual and audio information. Beyond the traditional BoW and multi-modal fusion, Jiang and Loui have recently developed an AVG representation [9] that incorporates temporal audio-visual correlations to enhance classification. An AVG is defined as a set of audio and visual codewords that are grouped together according to their strong temporal causal relations in videos. By using the entire grouplets as building elements to represent videos, various concepts can be more robustly classified than the use of discrete audio and visual codewords. For example, the AVG that captures the visual bride and audio speech gives a strong audio-visual cue to classify the “wedding ceremony” concept, and the AVG that captures the visual bride and audio dancing music is quite discriminative to classify the “wedding dance” concept. On top of the AVGs, a distance metric learning algorithm is further developed in [10] to classify concepts. Based on the AVGs, an iterative Quadratic Programming (QP) problem is formulated to learn the optimal distance metric between data points based on the Large-Margin Nearest Neighbor (LMNN) setting [24]. Specifically, the work of [10] suggests a grouplet-based distance based on the chi-square distance and word specificity [16] and show that through distance metric learning, such a grouplet-based distance can achieve consistent and significant classification performance gain.
3. Concept Classification by AVGs 3.1. Offline training Following the recipe of [9], four types of AVGs are extracted by computing four types of audio-visual temporal correlations. Specifically, in the visual aspect, we extract SIFT points and conduct SIFT tracking by using Lowe’s method [14], and separate the SIFT tracks into foreground tracks and background tracks. Each SIFT track is represented by a 136-dim feature vector composed by the 128dim SIFT descriptor and an 8-dim Histogram of Oriented Motion (HOM) vector. Next, by clustering the foreground SIFT tracks from a set of training videos, a foreground vi-
k=1
T
Let v = [v1 , . . . , vK ] and D(xi , xj ) = [D1G (xi , xj ), . . . , G DK (xi , xj )]T , the optimal weights are obtained in [10] by solving the following problem: ⎧ ⎫ ⎨||v||2 ⎬ 2 +C0 ηij vT D(xi , xj )+C ηij (1−yil )ijl , min v ⎩ 2 ⎭ ij
T
ijl
T
s.t. v D(xi , xl ) − v D(xi , xj ) ≥ 1−ijl , ijl ≥ 0, vk ≥ 0. ηij = 1 (or 0) denote that xj is a target neighbor of xi (or 4323
not), where target neighbors of xi are the nk nearest (similarly labeled) neighbors computed using the Euclidean distance. yil ∈ {0, 1} indicates whether the inputs xi and xl have the same class label. ijl is the amount by which a differently labeled input xl invades the “perimeter” around the input xi defined by its target neighbor xj . ||v||22 is the L2 regularization that controls the complexity of v. An iterative QP problem is developed in [10] to effectively solve the above problem in polynomial time. Specifically, the work of [10] suggests an idf-weighted chi-square distance DkG (xi , xj ) as follows: 1 [fw (xi ) − fwm (xj )]2 idf(wm ) 1 m , (2) idf(wm ) 2 [fwm (xi ) + fwm (xj )] w ∈G wm ∈Gk
m
the audio background BoW vector and compare the transient features to the audio foreground codewords to generate the audio foreground BoW vector. The soft weighting scheme [11] is also used to generate these visual and audio foreground and background BoW vectors. Based on each type of BoW vector of the query video, we can compute the distance between the query video and each video in the database by Eqn. (2) using the already determined optimal weights v, and then compute the kernel for SVM classification by Eqn. (3). On average, a 2-minute query video can be processed and classified in 1.5 minutes using a single thread with an Intel Xeon 2.53GHz CPU. Compared with the original offline classification where SIFT tracking and foreground/background SIFT track separation are conducted similarly to the training process, this simplified version for online classification will make some sacrifice in classification accuracy. Figure 1 shows the comparison of the average precision (AP, the area under uninterpolated PR curve) and mean AP (MAP, averaged AP across concepts) between the original classification reported in [10] and our simplified online classification. From the figure, over a few concepts, e.g., “baseball,” “skiing,” “biking,” “dog,” and “beach,” the AP performances of simplified online classification are similar to those of the original offline classification. For the remaining concepts, the AP performances drop more than 10%. The overall MAP degrades by 22%. The major reasons for such performance drop are two folds. First, the SIFT tracking process throws out most of the noisy points that can not be consistently tracked, and second, with foreground/background SIFT track separation we can remove the noises from matching foreground (background) tracks to the background (foreground) codebook. Without such steps, the final visual BoW vectors for online classification are more noisy than those used in training. The experimental results also indicate that in the future, with increased computation power, by incorporating certain level of SIFT tracking and foreground/background
k
where fwm (xi ) is the feature of xi corresponding to the codeword wm in grouplet Gk , and idf(wm ) is computed as the total number of occurrences of all codewords in the training set divided by the total number of occurrences of codeword wm in the training set:
idf(wm ) = fwm (x) / fwm (x) . wm
x
x
Based on this idf-weighted chi-square distance, the optimal weights can be learned to compute the overall distance D(xi , xj ). The final kernel for SVM classification is: K(xi , xj ) = exp {−γD(xi , xj )} .
(3)
For each of the four types of AVGs, the distance metric learning algorithm described above is applied individually, and four types of optimal kernels can be computed. After that, the Multiple Kernel Learning algorithm developed in [22] is adopted to combine the four types of kernels for final concept classification.
3.2. Online classification Given an input query video, we need to considerably speed up the classification process in order to enable online video retrieval. One of the most time consuming processes in the training procedure is the visual SIFT tracking and the foreground/background separation of SIFT tracks. We completely skip this process in the online classification stage. That is, we sample only a few pairs of image frames from the query video, extract SIFT features from the frames, and then conduct SIFT matching between each image pair. Next, we compute the 136-dim SIFT plus HOM visual feature for each image pair based on the matched SIFT points. These visual features are compared to the visual codewords in the visual foreground and visual background codebooks, respectively, to generate the visual foreground BoW vector and visual background BoW vector. In practice, we sample 6 pairs of image frames that are equally spaced in the query video. In the audio aspect, we first compute the 13-dim MFCCs and 20-dim transient features, and then compare the MFCCs to the audio background codewords to generate
0.8 origina l offline 0.7
simplified online
0.6
AP
0.5 0.4 0.3 0.2 0.1 0 bas
n ay on y g at g e e ce e h d P l ll er ng g g bal eba occ ati kiin in ikin c do birduatio irthd cepti emon dancmancrman parad beac roun MA d b re er ing for fo ket bas s ice sk s wimm b yg g g c dd er per s gra n pla i p n e dd di w sic usic we wed mu n-m no
Figure 1. Performance comparison of original offline classification and simplified online classification.
4324
visual separation, the performance of the developed video search system can be further improved.
strates the effectiveness of using joint audio-visual signatures for concept classification in consumer videos.
4. Consumer Video Search System
References
Figure 2 (a) shows the user interface of our video search system. The system is built over the CCV set. That is, after obtaining a query video, the system conducts classification of the 20 concepts and searches through the 9,317 web videos provided in [12] to find similar videos to the query video. The classification results of the query video in Figure 2 (a) are displayed to the user in Figure 2 (b). After obtaining the concept classification scores for the query video, the system uses these scores as well as the global visual appearance of the query video to search through the database. Specifically, the concept classification scores give a feature vector in the concept space for the query video, which is compared with the concept score feature vectors of each video in the database in the concept space to compute a concept-based similarity. In addition, one keyframe is sampled from the query video, from which the 225-dim grid-based color moment feature is computed, and this feature is compared with the grid-based color moment feature of each video in the database to obtain a visualbased similarity. Next, the concept-based similarity and the visual-based similarity are weighted combined to generate the final similarity for ranking videos in the database. In the system interface, we retain a concept-based-similarity weighting bar to allow the user to tune the weight between concept-based similarity and visual-based similarity. Figures 3 (a–d) show some retrieval examples that demonstrate the helpfulness of integrating audio information. It is usually difficult to distinguish between “skiing” and “ice skating.” With the help of audio features (where “skiing” videos usually have the sound of loud wind blowing or people talking in an open field, while “ice skating” videos usually have background sound of music playing or a large crowd of people), the system can reasonably differentiate these two concepts and return good retrieval results. In comparison, without using multi-modal concept classification, i.e., by setting the weight for concept-based similarity to 0 in Figures 3 (c) and (d), the retrieval results based on low-level visual features alone are much worse.
[1] S. Chang, D. Ellis, W. Jiang, K. Lee, A. Yanagawa, A. Loui, and J. Luo. Large-scale multimodal semantic concept detection for consumer video. ACM MIR, pages 255–264, 2007. [2] S. Chang, W. Hsu, W. Jiang, L. Kennedy, D. Xu, A. Yanagawa, and E. Zavesky. Columbia university trecvid-2006 video search and high-level feature extraction. TRECVID Workshop, 2006. Gaithersburg. [3] M. Christel. Carnegie mellon university traditional informedia digital video retrieval system. ACM CIVR, 2007. The Netherlands. [4] C. Cotton, D. Ellis, and A. Loui. Soundtrack classification by transient events. IEEE ICASSP, 2011. [5] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. IEEE CVPR, pages 886–893, 2005. [6] K. V. de Sande, T. Gevers, and C. Snoek. Evaluating color descriptors for object and scene recognition. IEEE Transactions on PAMI, 32(9):1582–1596, 2010. [7] C. Granger. Investigating causal relations by econometric models and cross-spectral methods. Econometrica, 37(3):424–438, 1969. [8] A. Hauptmann, R. Yan, W. Lin, M. Christel, and H. Wactlar. Can high-level concepts fill the semantic gap in video retrieval? a case study with broadcast news. IEEE Transactions on Multimedia, 9:958–966, 2007. [9] W. Jiang and A. Loui. Audio-visual grouplet: Temporal audio-visual interactions for general video concept classification. ACM Multimedia, 2011. Scottsdale, Arizona. [10] W. Jiang and A. Loui. Grouplet-based distance metric learning for video concept detection. IEEE ICME, 2012. [11] Y. Jiang, C. Ngo, and J. Yang. Towards optimal bag-offeatures for object categorization and semantic video retrieval. ACM CIVR, pages 494–501, 2007. [12] Y. Jiang, G. Ye, S. Chang, D. Ellis, and A. Loui. Consumer video understanding: A benchmark database and an evaluation of human and machine performance. ACM ICMR, 2011. [13] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld. Learning realistic human actions from movies. IEEE CVPR, 2008. Anchorage, Alaska. [14] D. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 60(2):91–110, 2004. [15] H. Luan, S. Neo, H. G. Y. Zhang, S. Lin, and T. Chua. Segregated feedback with performance-based adaptive sampling for interactive news video retrieval. ACM Multimedia, pages 293–296, 2007. [16] R. Mihalcea, C. Corley, and C. Strapparava. Corpus-based and knowledge-based measures of text semantic similarity. National Conference on Artificial Intelligence, pages 775– 780, 2006. AAAI Press. [17] M. Naphade, J. Smith, J. Tesic, S. Chang, W. Hsu, L. Kennedy, A. Hauptmann, and J. Curtis. Large-scale concept ontology for multimedia. IEEE MultiMedia, 13(3):86– 91, 2006.
5. Conclusion We developed a query-by-example consumer video search system using the query-by-concept search paradigm. The AVG-based audio-visual signatures are used to represent videos, based on which online concept classification is conducted to detect predefined consumer concepts in the query video. The system searches for similar videos to the query example by using both the concept classification scores and the global visual features. Our system demon4325
Sampled frames for extracting visual features Query video
Audio soundtrack for extracting audio features
Search result
(a)
(b)
Figure 2. The user interface of the consumer video search system.
(a) query for “skiing” videos using concept classification
(b) query for “ice skating” videos using concept classification
(c) query for “skiing” videos by low-level visual features
(d) query for “ice skating” videos by low-level visual features
Figure 3. Video search examples. [18] A. Natsev, A. Haubold, J. Tensic, L. Xie, and R. Yan. Semantic concept-based query expansion and re-ranking for multimedia retrieval. ACM Multimedia, pages 991–1000, 2007.
[23] E. Voorhees and D. Harman. Trec: Experiment and evaluation in information retrieval. The MIT Press, 2005. [24] K. Weinberger and L. Saul. Distance metric learning for large margin nearest neighbor classification. Journal of Machine Learning Research, 10(12):207–244, 2009. [25] R. Yan, J. Yang, and A. Hauptmann. Learning query-class dependent weights for automatic video retrieval. ACM Multimedia, 2004. New York. [26] E. Zavesky and S. Chang. Cuzero: Embracing the frontier of interactive visual search for informed users. ACM MIR, 2008. Vancouver, Canada.
[19] A. Ng, M. Jordan, and Y. Weiss. On spectral clustering: Analysis and an algorithm. NIPS, pages 849–856, 2001. [20] A. Smeaton, P. Over, and W. Kraaij. Evaluation campaigns and trecvid. ACM MIR, pages 321–330, 2006. [21] C. Snoek, M. Worring, D. Koelma, and A. Smeulders. A learned lexicon-driven paradigm for interactive video retrieval. IEEE Transactions on Multimedia, 9:280–292, 2007. [22] M. Varma and B. Babu. More generality in efficient multiple kernel learning. ICML, pages 1065–1072, 2009.
4326