Merging Rank Lists from Multiple Sources in Video Classification∗ Wei-Hao Lin and Alexander Hauptmann Language Technologies Institute School of Computer Science Carnegie Mellon University 5000 Forbes Avenue Pittsburgh, PA 15213, U.S.A. {whlin, alex}@cs.cmu.edu

Abstract

S1

S2

merged rank list

l 2(dj )

Multimedia corpora increasingly consist of data from multiple sources, with different characteristics that can be exploited by specialized applications. This paper focuses on video classification over multiple-source collections, and addresses the question whether classifiers should train from individual sources or from a full data set across all sources. If training separately, how can rank lists from different sources be merged effectively? We formulate the problem of merging ranked lists as learning a function mapping from local scores to global scores, and propose a learning method based on logistic regression. In our experiments we find that source characteristics are very important for video classification. Moreover, our method of learning mapping functions perform significant better than merging methods without explicitly learning the mapping functions.

l 1(di)

g2(l2(dj )) g1(l1(di))

Figure 1. Merging two ranked lists Retrieval often make strong assumptions that scores or ranks of the classifiers from each source are comparable [2], and techniques utilizing text-specfic statistics [1] or summaries [8] cannot be directly applied to multimedia domain. In this paper, we formulate the problem of mering rank lists as score mapping, and propose a method to learn a mapping function using logistic regression. The experimental results show that merging methods based on learned mapping functions significantly outperform the methods without learning such mapping functions.

1. Introduction 2. Merging Rank Lists Multimedia collections are quickly accumulated with the ease of creating multimedia content. While most research has focused on uniform corpora, heterogeneous sources are typical in the real world. In broadcast news, various networks and channels can be accumulated and searched, but the results must be combined and delivered to the users in a single ranked list. This paper is concerned with the combination of results from different sources in order to exploit the source characteristics, and hopefully to improve the performance of video classicaition. Previous works in Text ∗ This work was supported in part by the Advanced Research and Development Activity (ARDA) under contract number MDA908-00-C-0037.

We formulate the task of merging rank lists as a score mapping problem, as illustrated in Figure 1. Suppose all videos in the corpora belong to one of the sources Sk (k = 1, 2 here). A classifier is trained separtely for each source, which learns a function lk (d) that assigns a similarity score to each video shot d in the source, and returns a rank list in the order of the local scores. Since the scores in one ranked list are not necessarily comparable to scores from the other sources, a mapping function gk (x) is required to map the local scores to comparable global scores. The final merged list is sorted in the order of the global scores. Many widely-used merging methods, as well as our proposed

In Proceedings of the 2004 IEEE Internation Conference1on Multimedia and Expo (ICME), Taipei, Taiwan, June 27-30, 2004

methods based on logistic regression, can be explained in this framework with different choices of lk (d) and gk (x), as listed in Table 1. Merging Method Round Robin Raw Score Linear Scaling Logistic Regression

g(x) x x x−mini score(di ) maxi score(di )−mini score(di ) 1 1+exp(−a−b·x)

l(d) −rank(d) score(d) score(d) score(d)

Table 1. Merging methods can be explained as different combinations of the mapping function g(x) and local score l(d) One principle of designing a mapping function is to preserve rank before and after mapping. If a video clip di is ranked lower than dj in the single-source rank list, the rank of di has to be lower than the rank of dj in the final merged rank list. If additional knowledge is available to alter the order of a rank list and improve performance, the knowledge should reasonably be able to be applied locally before the merging stage. Therefore we should always preserve rank.

2.1. Merging Methods without Learning Mapping Functions Round Robin Round Robin works as follows: we pick up the top-ranked video shot in the first rank list, and then select the top-ranked shot in the second list. After all top ranked shots have been selected, we start to select the second-ranked shot in the first rank list, and so on until add shots are selected. Raw Score The degree of confidence that a classifier assigns to a shot may be better reflected in scores rather than rank. Raw Score takes the local scores from each rank list, and sorts the combined rank list in the order of the scores. Linear Scaling Without explicit mapping functions, Raw Score assumes that the local scores from one source are comparable to the scores from all other sources, which usually does not hold true in practice. Linear Scaling is a crude way to normalize the local scores into the range between zero and one, which satisfies the rank preserving principle.

Logistic Regression For each source, a classifier is trained in k-fold cross-validation fashion. In each fold, the trained classifier is applied to the testing data of the corresponding fold, and the local scores, as well as their labels (positive as one and negative as zero) are collected as training data for logistic regression1. Logistic regression is fit with two parameters a and b. The first reason of choosing logistic regression over linear regression is that the range of output values of logistic regression is restricted to be between zero and one, which makes scores comparable across sources, while linear regression does not limit the range. Secondly, the non-linearity of the Sigmoid function fits data with zero/one values better than linear regression. Thirdly, the classification performance of a classifier on the training data is reflected in the curve fitting. The fitting of logisitc regression on a well performaned classifier will have output values more close to zero or one, while the output values from a poorly fit classifier will be more spread. Note that the Sigmoid function is a monotonic function, which obeys the rank preserving principle.

2.3. Optimal and Random Merging Performance One may be curious as to the best performance we can achieve in merging multiple lists as well as a random baseline. The random performance of merging rank lists, by definition, is to merge individual rank lists from each source and shuffle the resulting merged list randomly. The optimal way of merging rank lists in terms of maximizing the evaluation metric, average precision, can be formulated as the following search problem. S 1 d1 d 2 d 3

S2 d4 d5 d6 d3

Start

d4

d4

d6

d 1 d 2 d 3 d 4 d 5 d6

d3

d6

d 4 d 1 d2 d 3 d 5 d 6

d6

d3

d 4 d 5 d 6 d 1 d 2 d3

0.4444

0.6667

0.7222

Figure 2. Searching for the optimal way to merge two rank lists from S1 and S2

2.2. Learning Mapping Functions We propose to learn the score mapping functions instead of relying on strong assumptions or simple normalization. Learning mapping functions can be seen as a regression problem.

Suppose we are looking for the optimal way to merge two rank lists (d1 , d2 , d3 ) and (d4 , d5 , d6 ) returned by 1 We do not use the local scores generated by a classifier built on full training data because these scores will be over-fit.

the classifiers trained for sources S1 and S2 , respectively, in the decreasing order of classifier’s “confidence”, as illustrated in Figure 2. The actually positive clips are drawn in a shadow box. According to the principle of preseving rank, first we choose either d3 or d4 into the merged list, but not d6 . If we choose d3 first, we have to include d1 and d2 and rank them higher than d3 , again, by the principle of preserving rank. This search process can continue and be well presented as a search tree, as shown in the lower part of Figure 2. The number at the end of each merged rank list is the average precision. Any search algorithms [7] can be used to find the solution based on the requirements of time and space complexity, optimality, and completeness. In this paper we choose an approximation method based on greedy search that always expands the positive clip with the highest precision. The merged list found by the greedy algorithm, called Greedy Bound, may be suboptimal, but the algorithm is very efficient in terms of time and space complexity.

3. Experiments

Set

Source

Task

Positive

Total

Training

ABC

Sporting Event Weather News Sporting Event Weather News Sporting Event Weather News Sporting Event Weather News

303 71 303 215 26 7 559 159

25630

CNN Testing

ABC CNN

21696 16593 15282

Table 2. Basic statistics of two video classification tasks in the TRECVID 2003

fined as follows, AP (A) =

1 X U + (d) + 1 |A+ | U (d) + 1 +

(1)

∀d∈A

where A+ is a set of all positive examples in A, U (d) is a function returning the number of examples ranked higher than the example d in A, and U + (d) is a function returning only the number of positive examples ranked higher than d. The upper bound of AP of merging multiple rank lists can be found as described in Section 2.3.

3.2. Features 3.1. Testbed, Tasks, and Evaluation Metric We extract two types of features for each video shot. We choose the video corpus for TRECVID 2003 [6] as the testbed in this paper. The corpus is consisted of broadcast news programs from three sources: ABC, CNN, and C-SPAN. The definition of two classification tasks are listed as follows, Sporting Event shot contains video of one or more organized sporting events Weather News shot reports on the weather The basis statistics in the training and testing set 2 are listed in Table 2. The shot boundaries of the training set are defined by common annotations, and those of the testing data by NIST. C-SPAN data are not included here because there are no sporting event or weather news shots in the source. The labels are from collaborate annotations by TRECVID 2003 participants. Note that the positive examples are very rare in the training data, around 1% in both tasks, which classifier will have hard time learning the concept. We adopts Average Precision (AP) our evaluation metric as TRECVID does. AP of a rank list A is de2 The number of the positive examples in the test set is underestimated because TREC used a pooling method to evaluate participants’ submissions.

Text Feature News programs in the TRECVID 2003 corpus come with closed captions or transcripts from speech recognition systems. The words in each shot are represented as a feature vector. Stop words are removed, and the Porter stemming algorithm is used to remove morphological variants. Term Frequency is used to reflect the importance of the words in the shot. The whole feature vector is normalized by unit length. Color Feature One keyframe is chosen for each shot, and color features are extracted from this keyframe. The keyframe is dissected into 5 by 5 grids, and color histogram in HVC color space are calculated (using only H and C values). The feature vector is consists of the mean and the variance of the 125-bin color histogram in the grid, resulting in 50-dimensional vectors.

3.3. Classifiers We use Support Vector Machine (SVM) as the classification algorithm. SVM has been widely used and very effective in many domains, including Text Categorization[4] and Video Classification[5]. The basic idea behind SVM is to select the decision hyperplane

Merging Methods

Sporting, Weather, Weather, Text Text Color

Round Robin Raw Score Linear Scaling Logistic regression Uni-Modal Training Greedy Bound Random Baseline

0.034 0.042 0.008 0.064 0.027 0.067 0.022

0.449 0.859 0.745 0.854 0.858 0.864 0.008

0.225 0.435 0.385 0.467 0.384 0.467 0.008

Table 3. Experimental Results of Merging Rank Lists from Multiple Sources

in the feature space that can separate two classes of data points while keeping the margin as large as possible. The process of finding the hyperplane can be formulated as the following optimization problem, X 1 min wT w + C ξi w,b,ξ 2 i=1 l

(2)

subject to yi (wT φ(xi ) + b) ≥ 1 − ξi ξi ≥ 0, i = 1, . . . , l where xi is a feature vector, i = 1, . . . , l, l is the size of the training data, yi ∈ +1, −1, yi is +1 when the shot is positive example, and -1 otherwise, φ is the kernel function that maps the feature vector into higher dimension, ξi is the degree of misclassification when the data points fall in the wrong side of the decision boundary, and C is the penalty parameter that tradeoff between two terms. More details can be found in [3]. Note that the choice of the classifier depends only on the task at hand and any classifiers can be plugged in the learning method in Section 2.2.

3.4. Results The results of merging rank lists using different merging methods under various combinations of features and tasks3 are shown in Table 3. In each column, we also list the random baseline and the upper bound found by the greedy search algorithm. The best performance other than Greedy Bound is marked in bold. Overall speaking, Logistic Regression methods are the best among four methods, which suggest the effectiveness of learning global mapping functions. Blindly assuming the local scores are comparable across sources (Round Robin, Raw Score) or simply scaling the local scores (Linear Scaling) hurt the performance greatly and should be avoided in practice. 3 The results of merging ABC and CNN’s color classifiers in the Sporting Event classification task are so close to the random baseline that the table was omitted here

We also conducted experiments to compare the performance of merging separate sources vs. training on all data without discerning video sources (Uni-Modal Training in Table 3). The results show that merging rank lists significantly outperforms uni-modal training that ignores the source differences, which strongly suggests the importance of exploiting source characteristics. Merging text-based classifiers in the Weather News task seems to be an exception, but it is not surprising at all because different news channels usually use very similar terminology to present weather reports, such as “temperature”, “snow”, etc, and thus there are little source characteristics left to be exploited.

4. Conclusions In this paper we showed that source characteristics can provide valuable information for video classification. Furthermore, merging methods should be carefully designed and chosen because the choice significantly affects performance. Among all merging methods, our proposed method of learning the mapping function using logistic regression significant outperform those without learning the mapping function.

References [1] J. Callan. Advances in Information Retrieval, chapter Distributed Information Retrieval, pages 127–150. Kluwer Academic Publishers, 2000. [2] A. Chen. Cross-language retrieval experiments at CLEF 2002. In Working Notes for the CLEF 2002 Workshop, 2002. [3] N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines. Cambridge University Press, 2000. [4] T. Joachims. Text categorization with support vector machines: Learning with many relevant features. In Proceedings of the European Conference on Machine Learning (ECML). Springer, 1998. [5] W.-H. Lin and A. Hauptmann. News video classification using SVM-based multimodal classifiers and combination strategies. In Proceedings of the tenth ACM international conference on multimedia, Juan-les-Pins, France, December 1-6 2002. [6] NIST. Guidelines for the TRECVID 2003 evaluation. Webpage, 2003. http://www-nlpir.nist.gov/ projects/tv2003/tv2003.html. [7] S. J. Russell and P. Norvig. Artificial Intelligence: A Modern Approach. Prentice Hall, 2nd edition, 2002. [8] X. M. Shou and M. Sanderson. Experiments on data fusion using headline information. In Proceedings of the 25th annual international ACM SIGIR conference on research and development in information retrieval, University of Tampere, Finland, August 11 - 15 2002.

Merging Rank Lists from Multiple Sources in Video ... - Semantic Scholar

School of Computer Science. Carnegie ... preserve rank before and after mapping. If a video ... Raw Score The degree of confidence that a classifier assigns to a ...

174KB Sizes 2 Downloads 287 Views

Recommend Documents

Merging Rank Lists from Multiple Sources in Video ... - Semantic Scholar
School of Computer Science. Carnegie Mellon .... pick up the top-ranked video shot in the first rank list, and ... One may be curious as to the best performance we.

Reverse Split Rank - Semantic Scholar
there is no bound for the split rank of all rational polytopes in R3. Furthermore, ... We say that K is (relatively) lattice-free if there are no integer ..... Given a rational polyhedron P ⊆ Rn, we call relaxation of P a rational polyhe- dron Q âŠ

Reverse Split Rank - Semantic Scholar
Watson Research Center, Yorktown Heights, NY, USA. 3 DISOPT, Institut de ..... Given a rational polyhedron P ⊆ Rn, we call relaxation of P a rational polyhe- .... operator (hence, also of the split closure operator) applied to Qi are sufficient to.

Optical Sources and Detectors - Semantic Scholar
1. Introduction. Light is the basis of the science of optics and optical ... sources, scribing, and microfabrication in semiconductor and computer ..... He received his Doctor of Science degree in 1976 from Kapitza Institute for Physical. Problems. T

Optical Sources and Detectors - Semantic Scholar
imaging systems in the visible and infrared regions. 1. Introduction ... information processing, optical sensing and ranging, optical communication, ... Visible light, the most familiar form of electromagnetic waves, may be defined as that.

Improved Video Categorization from Text Metadata ... - Semantic Scholar
Jul 28, 2011 - mance improves when we add features from a noisy data source, the viewers' comments. We analyse the results and suggest reasons for why ...

fluorosis in children and sources of fluoride around ... - Semantic Scholar
exposure. There have been indications that uptake of fluoride from other sources like food, dust and beverages may be many times higher than that of water.1,2.

Sources of individual differences in working memory - Semantic Scholar
Even in basic attention and memory tasks ... that part-list cuing is a case of retrieval-induced forgetting ... psychology courses at Florida State University participated in partial ... words were presented at a 2.5-sec rate, in the center of a comp

Bundled Depth-Map Merging for Multi-View Stereo - Semantic Scholar
without loss of result accuracy. To improve convergence of the minimization, the nor- mal direction can be initialized with a rough normal es- timation methods as [12, 13]. This paper adopts the grid search method suggested in [13]. Suppose n0 = −â

Cyclically Shifted Multiple Interleavers - Semantic Scholar
... from the IEEE. (http://www.ieee.org/web/publications/rights/policies.html) ... especially in the area of code division multiple access (CDMA), e.g. [1], [2]. Iterative.

Multiple Intracellular Routes in the Cross ... - Semantic Scholar
Soluble heat shock fusion proteins (Hsfp) stimulate mice to produce CD8 CTL, indicating that .... Medical School, 800 Huntington Avenue, Boston, MA 02115.

Automatic, Efficient, Temporally-Coherent Video ... - Semantic Scholar
Enhancement for Large Scale Applications ..... perceived image contrast and observer preference data. The Journal of imaging ... using La*b* analysis. In Proc.

Online Video Recommendation Based on ... - Semantic Scholar
Department of Computer Science and Technology, Tsinghua University, Beijing 100084, P. R. ... precedented level, video recommendation has become a very.

PATTERN BASED VIDEO CODING WITH ... - Semantic Scholar
quality gain. ... roughly approximate the real shape and thus the coding gain would ..... number of reference frames, and memory buffer size also increases.

Scalable Video Summarization Using Skeleton ... - Semantic Scholar
the Internet. .... discrete Laplacian matrix in this connection is defined as: Lij = ⎧. ⎨. ⎩ di .... video stream Space Work 5 at a scale of 5 and a speed up factor of 5 ...

Identifying Perspectives in Text and Video - Semantic Scholar
Dec 24, 2006 - editor each contribute one article addressing the issue. In addition, the .... ing Running, Sky, Animal, Person , Outdoor, Clouds, Day- time Outdoor ...... R: A language and environment for statistical computing. R Foundation.

Scalable Video Summarization Using Skeleton ... - Semantic Scholar
a framework which is scalable during both the analysis and the generation stages of ... unsuitable for real-time social multimedia applications. Hence, an efficient ...

Identifying Perspectives in Text and Video - Semantic Scholar
Carnegie Mellon University. December 24, 2006 .... broadcast news, newspapers, and blogs for differing viewpoints. Contrary to costly human monitoring, com-.

Listwise Approach to Learning to Rank - Theory ... - Semantic Scholar
We give analysis on three loss functions: likelihood .... We analyze the listwise approach from the viewpoint ..... The elements of statistical learning: Data min-.

Computing the Mordell-Weil rank of Jacobians of ... - Semantic Scholar
Aug 28, 1992 - Let C be a curve of genus two defined over a number field K, and J its. Jacobian variety. The Mordell-Weil theorem states that J(K) is a finitely-.

A Fast and Efficient Algorithm for Low-rank ... - Semantic Scholar
The Johns Hopkins University [email protected]. Thong T. .... time O(Md + (n + m)d2) where M denotes the number of non-zero ...... Computer Science, pp. 143–152 ...

A Fast and Efficient Algorithm for Low-rank ... - Semantic Scholar
republish, to post on servers or to redistribute to lists, requires prior specific permission ..... For a fair comparison, we fix the transform matrix to be. Hardarmard and set .... The next theorem is dedicated for showing the bound of d upon which

Money and Happiness: Rank of Income, Not ... - Semantic Scholar
significant additional variance is accounted for, F(1, 86641) = 8.75, p < .01. The coefficient on the rank variable that incorpo- rates this degree of upward comparison is 0.394 and signifi- cant, whereas the coefficient on the absolute-income variab

Computing the Mordell-Weil rank of Jacobians of ... - Semantic Scholar
Aug 28, 1992 - equations that define projective models of the homogeneous spaces of J. In the first section .... we will summarize, since it sheds light on our somewhat different approach. In the case .... find all the relations among the basis eleme