Bipartite Graph Reinforcement Model for Web Image ...

Viewer
Transcript

Bipartite Graph Reinforcement Model for Web Image Annotation Xiaoguang Rui

Mingjing Li, Zhiwei Li, Wei-Ying Ma

Nenghai Yu

MOE-MS Key Lab of MCC University of Science and Technology of China +86-551-3600681

Microsoft Research Asia 49 Zhichun Road Beijing 100080, China +86-10-58968888

MOE-MS Key Lab of MCC University of Science and Technology of China +86-551-3600681

[email protected]

{mjli, zli, wyma}@microsoft.com

[email protected]

ABSTRACT Automatic image annotation is an effective way for managing and retrieving abundant images on the internet. In this paper, a bipartite graph reinforcement model (BGRM) is proposed for web image annotation. Given a web image, a set of candidate annotations is extracted from its surrounding text and other textual information in the hosting web page. As this set is often incomplete, it is extended to include more potentially relevant annotations by searching and mining a large-scale image database. All candidates are modeled as a bipartite graph. Then a reinforcement algorithm is performed on the bipartite graph to rerank the candidates. Only those with the highest ranking scores are reserved as the final annotations. Experimental results on real web images demonstrate the effectiveness of the proposed model.

Categories and Subject Descriptors H.3.3 [Information Search and Retrieval]: Retrieval Models.

General Terms Algorithms, Measurement, Experimentation.

Keywords Automatic image annotation, bipartite graph model

1. INTRODUCTION The content on the web is shifting from text to multimedia as the amount of multimedia documents grows at a phenomenal rate. In particular, images are the major source of multimedia information available on the internet. Since 2005, Google [7] and Yahoo [27] have already indexed over one billion images. In addition, some online photo-sharing communities, such as Photo.Net [17] and PhotoSIG [18], also have image collections in the order of millions entirely contributed by the users. To access and utilize this abundant information efficiently and effectively, those images should be properly indexed. Existing image indexing methods can be roughly classified into two categories, based on either text or visual content. The initial image management approach was to manually annotate the Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. MM’07, September 23–28, 2007, Augsburg, Bavaria, Germany. Copyright 2007 ACM 978-1-59593-701-8/07/0009...$5.00.

images with semantic concepts so that people can retrieve images using keyword queries. But this approach encountered the problem of inconsistency and subjectivity among different annotators. Furthermore, the annotation process is timeconsuming and tedious as well. Consequently, it is impractical to annotate so many images on the web. Content-based image retrieval (CBIR) was proposed to index images using visual features and to perform image retrieval based on visual similarities. However, due to the well-known semantic gap [20], the performance of CBIR systems is far from satisfaction. Thus, those two approaches are infeasible to web images. To overcome the aforementioned limitations, many researchers have devoted to realizing automatic image annotation. If it can be achieved, the problem of image retrieval will be simplified into one of text retrieval problems, and many well developed textual retrieval algorithms can be easily applied to search for images by ranking the relevance between image annotations and textual queries. However, most image annotation algorithms are not specifically designed for web images because they are mainly based on content analysis but do not utilize the rich textual information associated with web images. On the other hand, current commercial image search engines index web images using the surrounding text and other textual information in the hosting web pages. The underling assumption is that web images are purposely embedded into web pages and the text in hosting pages is more or less related to the semantic content of web images. Therefore, such textual information can be used as approximate annotations of web images. Although very simple, such an approach works pretty well in some cases. For most of web images, however, such annotations have many shortcomings. Let us take the pictures shown in Fig. 1 as an example. From the surrounding text of the web images, we extract some keywords as the candidate annotations. The first image is annotated by “bird”, “ligan”, “American”, “Morris”, “coot” and the second image by “color”, “degging”, “rose”, “spacer”, “flower”, and “card”, “multiflora”. First, these candidate annotations are usually noisy with irrelevant words. The first image might be taken in “American” by “Morris”, which is not explicitly expressed in the image. Words “spacer” and “card” for the second image are the advertisement words. Second, they do not fully describe the semantic content of images such as “blue river” in the first image and “green leaf” in the second image. Obviously, the annotations extracted from surrounding text are inaccurate and incomplete. In this paper, we propose a bipartite graph reinforcement model (BGRM) for web image annotation, which sufficiently utilizes

(a)

(b)

bird, ligan, American , Morris, coot

color, degging, rose, spacer, flower, card, multiflora,

Figure 1. Examples of images with surrounding keywords both visual features and textual information of web images. Given a web image, some candidate annotations are extracted from its surrounding text and other textual information in the hosting web page. As those candidates are incomplete, more candidates are derived from them to include more potentially relevant annotations by searching and mining a large-scale high-quality image collection. For each candidate, a ranking score is defined using both visual and textual information to measure how likely it can annotate the given image. Then two kinds of candidates are modeled as a bipartite graph, on which a reinforcement algorithm is performed to iteratively refine the ranking scores. After convergence, all candidates are re-ranked and only those with the highest ranking scores are reserved as the final annotations. In this way, some noisy annotations may be removed and some correct ones may be added, thus the overall annotation accuracy can be improved. Experiments on over 5,000 web images show that BGRM is more effective than traditional annotation algorithms such as WordNet-based method [10]. Our contributions are multifold: z

We propose to extract initial candidate annotations from the surrounding text and to extend the candidates via a search based method;

z

We propose a novel method to define the ranking scores of candidates based on both visual and textual information;

z

We design a bipartite graph reinforcement model to re-rank candidate annotations;

z

We design several schemes to determine the final annotations.

z

Based on BGRM, we develop a web image annotation system that utilizes the available textual information and leverages a large-scale image database.

The remainder of the paper is organized as follows: Section 2 lists some related work. Section 3 gives the overview of our web image annotation approach. Candidate annotation extraction and ranking are described in Sections 4 and 5. The main idea of BGRM is introduced in Section 6. In Section 7, we describe the final annotation determination schemes. The experimental results are provided in Section 8. We conclude and suggest future work in Section 9.

2. RELATED WORK Some initial efforts have recently been devoted to automatically annotating images by leveraging decades of research in computer vision, image understanding, image processing, and statistical learning [1]. Most existing annotation approaches are either classification based or probabilistic modeling based. The classification based methods try to associate words or concepts with images by learning classifiers, such as Bayes point machine [3], support vector machine (SVM) [4], the two-dimensional multi-resolution hidden Markov models (2D MHMMs) [14]. The

probabilistic model based methods attempt to infer the correlations or joint probabilities between images and annotations. The representative works include Co-occurrence Model [16], Translation Model (TM) [5], Latent Dirichlet Allocation Model (LDA) [2], Cross-Media Relevance Model (CMRM) [8], Continuous Relevance Model (CRM) [13], and Multiple Bernoulli Relevance Model (MBRM) [6]. However, these approaches do not focus on annotating web images and neglect the available textual information of images. Furthermore, compared with the potentially unlimited vocabulary existing in the web-scale image databases, they can only model a very limited number of concepts on a small-scale image database by learning projections or correlations between images and keywords. Therefore, these approaches cannot be directly applied to annotate web images. As automatic image annotation is often not accurate enough, some methods are proposed to refine the annotation result. Jin et al [8] achieved annotation refinement based on WordNet by pruning irrelevant annotations. The basic assumption is that highly correlated annotations should be reserved while non-correlated annotations should be removed. In that work, however, only global textual information is used, and the refinement process is independent of the target image. It means that different images with the same candidate annotations will obtain the same refinement result. To further improve the performance, the image content should be considered as well. Recently, several search-based methods are proposed for image annotation [22][24][26], which combine the text-based web image search and content-based image retrieval (CBIR) in the annotation process. The image annotations are obtained by leveraging a webscale image database. However, the former [24] assumed that an accurate keyword for the image in consideration was available. Although an accurate keyword might speed up the search process and enhance the relevance of the retrieved images, it is not always available for web images. Wang et al. [22] discarded this assumption, and estimated the annotations of images by performing CBIR first. But this leads to a poor performance compared with [24]. Rui et al. [26] proposed to select annotations for web images from available noisy textual information. In our work, we also adopt a search-based method and assume that several noisy and incomplete keywords are available for image annotations. This assumption is more reasonable for web images because it is easy to extract such keywords from the textual information on hosting web pages. The assumption is similar to that of [26], but [26] only considered the inaccurateness of initial keywords and ignored the incompleteness of them.

3. OVERVIEW OF BIPARTITE GRAPH REINFORCEMENT MODEL The proposed bipartite graph reinforcement model (BGRM) works in the following way for image annotation. At first, images are annotated with some candidate keywords, which may be noisy and incomplete. The initial candidates may be obtained by applying traditional image annotation algorithms or analyzing the surrounding text of a web image. On account of the incompleteness of the initial candidates, more candidate keywords are estimated by submitting each candidate as a query to an image search engine then clustering the search result. BGRM then models all the words as a bipartite graph. To remove the noisy words, all candidate annotations are re-ranked by reinforcing on the bipartite graph. Only the top ranked ones are reserved as final annotations.

Figure 2. Bipartite graph reinforcement model for web image annotation BGRM is shown in Fig. 2. It consists of the following components: initial candidate word extraction, extended candidate word generation, candidate ranking, bipartite graph construction, reinforcement learning, and final annotation determination. BGRM is flexible in the sense that its components are relatively independent of each other. Thus any improvement made in each component can be easily incorporated in this model to improve its overall performance. We will describe each component in detail in the following.

4. CANDIDATE ANNOTATION EXTRACTION In our model, two sets of candidate annotations are extracted for each web image. Initially, some candidate annotations are extracted from the related textual information such as surrounding texts. Because the surrounding text does not always describe the entire semantic content of the image, we also extend the annotations by searching and mining a large-scale image database. The assumption is that if certain images in the database are visually similar to the target image and semantically related to the candidate annotations, the textual descriptions of these images should also be correlated to the target image. Thus, the extended annotations can be extracted from them for the target image.

4.1 Initial Annotation Extraction Several sources of information on the hosting pages are more or less related to the semantic content of web images, e.g. file name, ATL text, URL and surrounding text. After stop word removal and stemming, each word is ranked by the standard text process technique (such as tf*idf method), and the words with highest ranks are reserved as initial candidate annotations Q.

4.2 Extended Annotation Extraction Extended annotations are obtained by a search-based method. Each initial annotation and its image are used to query an image search engine to find semantically related and visually related images, and more annotations are extracted from the search result. For this purpose, about 2.4 million images were collected from some photo sharing sites, e.g. Photo.Net [17] and PhotoSIG [18], and a text-based image search system was built based on them. We notice that people are creating and sharing a lot of highquality images on these sites. In addition, images on such sites have rich metadata such as titles, and descriptions provided by photographers. As shown in Fig. 3, this textual information reflects the semantic content of corresponding images to some extent, though maybe noisy. Thus those images can be used to extend the initial annotation set.

For one target web image I, word qi in initial words Q is used to query the text-based image search system to find the semantically related images. And this process is applied for each initial candidate annotation of I. Then, from the semantically related images, visually related images are found through content-based image similarity between the target image and the images found on the web. After two stages of search processes, each target image and its initial words obtain their search result which contains the semantically related and visually related images and their textual descriptions. The search result not only is highly useful for extending words, but also benefits initial word ranking. The search result of word w in image I can be represented by SR(I, w) = {(im1, sim1, de1), (im2, sim2, de2),…, (iml, siml, del)}. Where im is the image obtained by querying I and w, sim is the visual similarity between im and I, and de is the textual description of im. l gives the total number of images in the search result. Finally, extended words are extracted by mining the search result SR for each initial word using search result clustering (SRC) algorithm [28]. Different from traditional clustering approaches, SRC clusters documents by ranking salient phrases. It first extracts salient phrases and calculates several properties, such as phrase frequencies, and combines the properties into a salience score based on a pre-learnt regression model. As SRC is capable of generating highly readable cluster names, these cluster names could be used as extended candidate annotations. For each target image, SRC is used to cluster the descriptions in the search result. After all cluster names are merged and duplicated words are discarded, extended candidate annotations X are obtained. However, a problem is that the initial words are noisy, which may lead to a bad performance of extending words. Surprisingly, the experimental results (see Section 8.4) show that the average precision of extended words is a little higher than initial words.

Title: shadow cat Description: we have an antique oak mission style bed, and my cat, was on the bed. the slats part of his face

Title: mountain view Description: I hiked up a resort mountain in new hampshire….I took it dark on purpose to bring out the sun and clouds.

Figure 3. Example images and their descriptions

After carefully observing in the experiment, we found that not only a precise initial word may propagate more precise extended words, but an imprecise one can also extend precise words in a certain condition. This is because the visual information also takes effect on extending words. These facts are guarantees of the performance of extending the annotations.

similarity measure is more suitable for the annotation ranking task, which is also demonstrated by the experimental results. Another reason why we use this measure is that the search results, which are used to compute the local similarity, have already obtained in the process of extending words. Hence, it takes little expense when applying the local similarity to initial words ranking.

5. CANDIDATE ANNOTATION RANKING

After calculating the textual similarity, the textual ranking value for the initial word qi (rankt(qi|I)) is defined as the normalized summation of the local textual similarities between qi and the other initial words in image I.

After acquiring candidate annotations for each image, a ranking value is defined for each candidate using both visual and textual information to measure how consistent it is with the target image.

5.1 Initial Annotation Ranking The visual consistence of an initial word can be indicated by the visual similarities between images in its search results and the target image. We utilize these scores as the visual ranking value of an initial word. First, these visual similarity scores are sorted in a descending order. Then, the average of top K visual similarity scores is calculated as the ranking value. And for each initial word qi of the target image I, a visual ranking value rankv(qi|I) is calculated as follows:

rankv ( qi | I ) =

1 K

K

∑ simqi ( j , I )

(1)

j =1

Where sim(·) is the image similarity scores ranked in the descending order. In order to estimate the textual consistence, we try to compute the similarity of keywords within one web image first by checking how frequently one keyword appears in the research result of another. For the target image I, we first count the frequency Feqqk(qi) of the initial word qi appearing in textual descriptions of images in the search result of the initial word qk and the frequency Feqqi(qk) of qk appearing in the search result of qi . Feqqk(qi) and Feqqi(qk) reflect the local relation of qi and qk , so the similarity between them can be defined as follows: simt (qi , qk | I ) = Feqqk (qi ) + Feqqi (qk )

(2)

Generally speaking, the more common a keyword is, the more chance it will associate with other keywords. However this kind of associations has lower reliability. Therefore, we weight the counts according to the uniqueness of each keyword, i.e. setting a lower weight to frequent keywords and a higher weight to unique keywords. Finally, the similarities of initial words in the target image I can be calculated by modifying Eqn. (2): simt (qi , qk | I ) = Feqqk (qi )log( ND / N (qi )) + Feqqi (qk )log( ND / N ( qk ))

(3)

Where N(qi) is the number of the word qi occurring in the descriptions of training images , and ND is the total number of images in the dataset. This approach can measure the textual similarity between keywords in a local way. It not only considers the similarity between words, but also takes into account their relations to its image. Only when two words are closely related to the target image and always appear together in the web page, textual similarity between them is high. It is different with the traditional methods such as WordNet method [10] and pairwise cooccurrence [23] which are only considered relation between two words. Compared with the traditional methods, our local textual

rankt ( qi | I ) =

∑ simt (q , q

k (≠i)

i

k

| I)/

∑ ∑ simt (q , q

i(≠ k ) k (≠i)

i

k

| I)

(4)

Where the denominator is the normalization factor. After obtaining above two types of initial word rankings, we firstly normalize them into [0, 1] and then fuse them using a weighted linear combination scheme to define the ranking value of an initial word qi. F0 (qi | I ) = a × rankv(qi | I ) + (1 − a) × rankt (qi | I )

(5)

Where a is the weight ranging from 0 to 1. Because in web-based approaches, text features are generally more effective than image features [25], the value of a is less than 0.5.

5.2 Extended Annotation Ranking The ranking value of an extended annotation is defined in a different way. As an extended candidate is actually the name of a search result cluster, its ranking value is estimated by the average similarity between images in the corresponding cluster and the target image [24]. If the member images of a cluster are relevant to the query, the concepts learned from this cluster are likely to represent the content of the query image. Considering the uniqueness of each keyword, we also weight this value using the textual information to define the ranking score: C0 ( xi | I ) = v( xi )log( ND / N ( xi ))

(6)

Where xi is an extended word of image I and v(xi) is the average member image similarity.

6. BIPARTITE GRAPH CONSTRUCTION AND REINFORCEMENT LEARNING In this section, we describe the bipartite graph reinforcement model (BGRM) for re-ranking the candidate annotations within a web image. BGRM is based on the graph model. So we firstly introduce the construction of the graph. Then we describe the iterative form of our algorithm, followed by demonstrating its convergence. Additionally, we also give the non-iterative BGRM.

6.1 Graph Construction Initial and extended candidate annotations are heterogeneous annotations for web images. First, initial words are the direct description of the target image, while extended words are mined in the large-scale image database, and they can only indirectly describe the target image by propagating the descriptions from the semantically and visually related images. Second, extended words with the same initial word tend to be similar to each other. So similarities between extended words are partly decided by their initial word. Meanwhile, similarities between initial words do not have this characteristic. Therefore, they cannot be re-ranked using a unified measure. However, they also have close relations. As an illustration, if an initial word is precise, its extended words are

probable precise, and vice versa. Consequently, we form the candidate annotations for a web image as a bipartite graph model. To construct the bipartite graph G, initial and extended candidate annotations are considered as the two disjoint sets of graph vertices. Vertices from different disjoint sets are all connected using edges with proper weights. The weight of an edge is defined using the relations between initial and extended words. A subtle point is that we set a nonzero weight to an edge only if the relation of two vertices is close enough. For two vertices qi, xj of an edge, we consider they have close relation if 1) xj is extended by qi or 2) qi is quite similar to xj. Therefore, the weight of the edge can be calculated as follows: ⎪⎧1 + s ( q i , x j | th )

ω ij = ⎨ ⎪⎩ s ( q i , x j | th )

⎧ s ( qi , x j ) s ( q i , x j | th ) = ⎨ 0 ⎩

if x j is extended by q i otherwise if s ( q i , x j ) > th otherwise

(7)

certain confidence on the initial values. Meanwhile, LC reveals extended word rankings via their link relations to reinforce initial word rankings. L = Dr−1WDc−1

(9)

W is the original adjacency matrix of G; Dr is the diagonal matrix with its (i;i)-element equal to the sum of the i-th row of W; Dc is the diagonal matrix with its (i;i)-element equal to the sum of the i-th column of W.

Eqn. (8) also shows in each iteration C is first reinforced and F is then reinforced by updated C. It shows a stronger belief on initial word ranking than extended word ranking, which is demonstrated in the experimental result for web image. Additionally, the fact also impacts the selection of α, β. The greater value of β is always chosen to show the more confidence on the initial ranking for initial words.

6.3 Convergence Let us show that the sequences {Cn} and {Fn} converge. By the iteration Eqn. (8), we have

Where ωij is the weight, s(·) is the textual similarity between words. s(·|th) is the textual similarity with a pre-defined threshold th.

Suppose the initial weight ωij equals to 0. Eqn. (7) shows that if xj is extended by qi, the weight ωij will be added by 1. If the similarity s(qi, xj) between them is above a pre-defined threshold th, ωij will be added by s(qi, xj).

⎧⎪Cn +1 = (γ LT L) n +1 C0 + α [∑ n (γ LT L)n ]C0 + (1 − α ) β [∑ n (γ LT L)n ]LT F0 i =0 i =0 ⎨ n n T n +1 T n T n ⎪⎩ Fn +1 = (γ LL ) F0 + β [∑ i = 0 (γ LL ) ]F0 + (1 − β )α [∑ i = 0 (γ LL ) ]LC0

Where γ = (1 − α )(1 − β ) Since 0<α, β<1, and the eigenvalues of LLT and LTL are in [-1, 1] for they are normalized symmetric matrix based on rownormalized and column-normalized L,

s(·) can have various definitions as long as it can represent the relationship between words. For the sake of simplicity, we use WordNet to calculate the textual similarity. In WordNet [19], synonyms with the same meaning are grouped together to form a synset called as concept. Various concepts are linked to each other through different. Therefore, WordNet is useful to determine semantic connections between sets of synonyms. The Jiang and Conrath Measure (JNC) [10] is proved to be effective to measure the semantic distance between two concepts using WordNet. Given two words wi and wj, we should firstly find its associating concepts ci and cj, and get the maximum similarity among all possible corresponding concept pairs as the semantic similarity for the two words.

Where C* and F* is the converged C and F, respectively. Now we can compute C and F directly without iterations.

6.2 Reinforcement Learning Algorithm

6.4 A Comparison with HITS Algorithm

The reinforcement algorithm on G is shown in Eqn. (8). The equation iterates until convergence.

⎧Cn +1 = α C0 + (1 − α ) LT Fn ⎨ ⎩ Fn +1 = β F0 + (1 − β ) LCn +1

(8)

C0 and F0 are the initial ranking value vectors of the extended and initial candidate annotations, respectively. L is the adjacency matrix of G, and L is both row-normalized and columnnormalized using Eqn. (9). LT is the transpose of L. C and F indicate the new C0 and F0 after iterations. And their subscript shows the iteration times. α, β are the weights ranging from 0 to 1. They determine that to what degree the model relies on the propagated relations. The first row in Eqn. (8) addresses the extended word ranking update. And the second row in Eqn. (8) indicates the initial word ranking update. Because LTF reveals initial word rankings are propagated to extended word rankings via their link relations, the first row in Eqn. (8) indicates that we use the information provided by initial words to reinforce those of extended words, while we still keep a

lim(γ LT L) n +1 = 0, lim ∑ i = 0 (γ LT L) n = ( I − γ LT L)−1 n

n →∞

n →∞

and lim(γ LLT ) n +1 = 0 , lim ∑ i = 0 (γ LLT )n = ( I − γ LLT )−1 n →∞

n

n →∞

Hence, ⎧C * = α [ I − γ LT L]−1 C0 + β (1 − α )[ I − γ LT L]−1 LT F0 ⎨ * T −1 T −1 ⎩ F = β [ I − γ LL ] F0 + α (1 − β )[ I − γ LL ] LC0

(10)

Kleinberg proposed the HITS (Hypertext Induced Topic Selection) Algorithm [12]. It is a link analysis algorithm that rates web pages for their authority and hub values. In the HITS algorithm, the author used a two-level weight propagation scheme on a bipartite graph which is constructed by authorities and hubs. The assumption behind the HITS algorithm is similar to that of the BGRM algorithm: “in a bipartite graph, a good vertex in one set is linked to another good vertex in the other independent set”. Following is the basic algorithm of HITS. Initially all vertex weights of the bipartite graph are set to 1. At each iteration, a vertex (v) weight equals the sum of vertex weights in the other set that link to v. A normalization step is then applied. The algorithm iterates until convergence. The major differences from BGRM are: First, the fundamental difference lies in that, HITS do not consider the initial weights of vertexes. While the initial information can be acquired in some practical cases, BGRM regards the initial weights as the basic information for ranking vertexes. Second, HITS needs to normalize the weights after each iteration while BGRM only need to normalize the adjacency matrix of the graph before iterations.

0.4

0.7

0.35

0.6

0.25

Coverage Rate

Precision

0.3 BGM

0.2

INITIAL

0.15

LOCAL WORDNET

0.1 0.05

0.5 BGM

0.4

INITIAL 0.3

LOCAL WORDNET

0.2 0.1

0

0 1

2

3 Top N

4

5

1

Figure 4. Precision values comparison of initial words

2

3 Top N

4

5

Figure 5. Coverage rate values comparison of initial words

(0 < t1 ≤ t2 < 1). For simplicity, we set t1 = t2 = 0.5 in the experiment.

7. IMAGE ANNOTATION SCHEMES BASED ON BGRM In this section, we design three strategies to determine the final annotations. As the ranking values of two sets of candidates are defined differently, they should be treated differently.

7.1 Top N strategy

⎧(1 − t1 ) F (i ) + t1 wi ∈ Q R ( wi ) = ⎨ t2C (i ) wi ∈ X ⎩

(12)

8. EXPERIMENTAL RESULTS

In “Top N” strategy, a fixed number of annotations with highest ranking values are chosen. Actually, top m1 initial words and top m2 extended words are selected (N = m1+m2). Empirically, one can use cross-validation experiments to set suitable m1, m2 values.

8.1 Data Set

One of the disadvantages is that if the number of precise annotations for each image varies, it is hard to select accurate number of annotations.

Test image set is selected from [21]. 5,000 web images are crawled from the internet. Every image is annotated by 5 to 12 keywords from the surrounding text or tag information, which is extracted from the blocks containing the images by the VIPS algorithm and processed by the standard text process technique. There are totally 2,535 unique keywords. We manually check these initial words for each image. The average precision is 13.3% per image. Note that test images are completely independent with training images, while most previous works use test images and training images from the same benchmark database.

7.2 Threshold Strategy We also use “Threshold” strategy to select final annotations from both initial and extended words. If the ranking value of a candidate is above a threshold, the candidate will be chosen. A dynamic threshold dth is used in our task as shown in Eqn. (11).

dth = 1/ Num(annotations) × η

(11)

Where Num(annotations) is the number of one set of candidate annotations of the target image. Note that for one set of annotations within the target image, their ranking values are normalized to ensure that sum of them is 1. So, 1/ Num(annotations) means the average ranking value. η is the weight. Therefore, Eqn. (11) expresses that if the candidate ranking value is larger than the corresponding weighted average ranking value, it will be selected.

7.3 Modified Threshold Strategy In this strategy, we decide the number of final annotations according to the number of initial candidates. We first use “Threshold” strategy to remove imprecise annotations from initial ones. Then extended words with high ranking values are appended to make sure that the number of final annotations is equal to that of the initial candidates. By using the modified threshold strategy, we can achieve selection of dynamic numbers of final annotations (cf. Top N strategy) and only one parameter estimation (cf. Threshold strategy). After selecting final annotations, we merge their ranking values using a unified measure. In our method, for better quality of initial word, we give a higher ranking value to them. Denote wi as a selected annotation of an image I. The final annotation ranking function is shown in Eqn. (12). For initial words Q, we linearly shrink their ranking values F to be within [t1, 1]; for extended words X, their ranking values C are shrinked to be with in [0, t2]

As mentioned in Section 4.2, we use the 2.4 million web images associated with meaningful descriptions as the web-scale training set, which is also used in [22][24].

To represent an image, a 64 dimensional feature [29] was extracted. It is a combination of three features: 6 dimensional color moments, 44 dimensional banded auto-correlogram and 14 dimensional color texture moments. For color moments, the first two moments from each channel of CIE-LUV color space are extracted. For correlogram, the HSV color space with inhomogeneous quantization into 44 colors is adopted. The focus of this paper is not on image feature selection and our approach is independent of any visual features, so all existing global or local features and the corresponding distance measures could be used by our model.

8.2 Evaluation Criterion For the performance metric we adopted top N precision and coverage rate [15][22][23] to measure the ranking performance of candidate annotations. Top N precision measures the precision of top N ranked annotations for one image. When N is large, the precision equals that of the original test dataset. Top N coverage rate is defined as the percentage of images that are correctly annotated by at least one word among the first N ranked annotations. Top N Precision =

1 M

∑ correct _ i( N ) / N i∈I t

Top N Coverage Rate =

1 M

∑ IsContainCorrect _ i( N ) i∈I t

(13)

0.3

0.6

0.25

0.5 Covaergae Rate

Precision

0.2 BGM 0.15

INITIAL VISUAL

0.1

WORDNET

0.05

0.4

BGM INITIAL

0.3

VISUAL 0.2

WORDNET

0.1

0

0 1

2

3 Top N

4

5

Figure 6. Precision values comparison of extended words

Where correct_i(N) is the number of correct annotations in top N ranked annotations of image i. It is the test image set, and M is the size of It. IsContainCorrect_i(N) judges whether image i contains correct annotations in the first N ranked ones. To evaluate the performance of final annotations, simple precision and coverage rate are adopted, which measure the performance of the entire final annotations within an image.

8.3 Performance of Initial Annotation Ranking The baseline method used is the WordNet-based approach proposed in [11] (WORDNET). As mentioned in Section 1 this algorithm proposed to use WordNet to rank candidate annotations. In this work, only global measure is used. We compare it with our proposed local text similarity measure (LOCAL), the initial ranking method which combines the visual ranking and local textual ranking (INITIAL) and the proposed BGRM method (BGRM). Fig. 4 and 5 show the average precision and coverage rate of “Top N” results of initial candidates. The fourth columns correspond to the WordNet-based method. The third columns show the performance of the method which is based on the local textual measure. The different between them illustrates the effectiveness of utilizing the local textual similarity measure. The second columns correspond to the performance of the initial ranking of initial words. This method combines the local textual ranking and visual ranking to select initial words. The difference between them and the third columns proves that it is useful to combine visual ranking when ranking annotations. The first columns correspond to our BGRM method. The performance of BGRM method is better than that of the other methods. For top 1 precision, BGRM is 40% better than WORDNET, 16% better than LOCAL, 6% better than INITIAL. Each top N precision is also better than that of the other methods. Top 2 coverage rate of BGRM is 52.1%, which nearly equals top 3 coverage rate of WORDNET.

8.4 Performance of Extended Annotation Ranking To better avoid error propagation, only initial words with high initial rankings are used to generate extended words. In detail, top 5 ranked initial words are used. And most images are annotated with 4 to 30 extended words. We manually check extended words for each image. The average precision reaches 14%.

1

2

3 Top N

4

5

Figure 7. Coverage Rate values comparison of extended words

An interesting phenomenon is the average precision of extended words does not decrease, but is 5.3% better than that of initial words (13.3%). Intuitively, extended words which just indirectly describe the target image by propagating the descriptions of the semantically-related and visually-related images, may be less precise than initial words. To explain this phenomenon, let us look in detail. Table 1 shows the percentage of images which are annotated with exactly N precision annotations. The first column is the number of precise annotations. The second and third columns show the percentage of images which are exactly annotated N precision initial and extended words, respectively. From Table 1 we can see, although the percentage of images with all irrelevant extended words is larger than that of initial words, the percentage of images with more than one precise extended words is 38.38%, which is greatly larger than that of initial words (+88%). Generally, a precise initial word may propagate more precise extended words. But we also found that even an imprecise initial word can propagate precise extended words under certain conditions. Fig. 8 illustrates this situation. “adult”, which is an imprecise initial word, but is somewhat relative to the image, also propagates precise extended words such as “bird”. The baseline method used is the average member image value approach proposed in [24] which is mentioned in Section 5.2 (VISUAL). The WordNet-based approach (WORDNET) is also used for comparison. Although WORDNET cannot predict extended words, it can be used to rank them. We compare them with our proposed initial ranking method - weighted average member image value (INITIAL) and the proposed BGRM method (BGRM). Table 1. Comparison between percentages of images with exactly N correct initial and extended words Percentage of images with Percentage of images with N precise initial words precise extended words 1 2 3 4 5 6

45.2% 17.5% 2.1% 0.8% 0% 0%

Image + An imprecise initial word:

adult

23.2% 19.5% 10.6% 4.9% 2.0% 0.6%

Its extended words: head, white, black, south, mother, bird, baby, son,

Figure 8. An example that an image with an imprecise initial word may propagate precise extended words.

Table 2. Final annotation performance comparisons. The selection strategy of final annotations uses “Top N”, Threshold, Modified threshold strategies, respectively. (Coverage rate> 0.654)

BASELINE INITIAL BGRM

Top N

Threshold

Precision (Coverage, m1, m2) 0.161 (0.675,4 , 3) 0.196 (0.665, 3, 4) 0.220 (0.664, 4, 3)

Precision (Coverage, η1,η2) 0.192 (0.654, 1.1, 0.7) 0.238 (0.658,0.9 , 0.8) 0.275 (0.654, 1.0, 0.9)

Modified threshold Precision (Coverage, η) 0.179 (0.693, 1.0) 0.216 (0.672, 1.1) 0.245 (0.697, 0.7)

Table 4. Initial word ranking using different initial ranking schemes in BGRM (Top 1 coverage rate equals top 1 precision)

Method WORDNET VS WORDNET+BGRM LOCAL VS LOCAL+BGRM

Precision Top 1 0.263 0.349 (+33%) 0.316 0.351 (+11%)

Top 2 0.234 0.286 (+22%) 0.267 0.297 (+11%)

Coverage Rate Top 2 0.415 0.499 (+20%) 0.460 0.507 (+10%)

Fig. 6 and 7 show the average precision and coverage rate of “Top N” results of extended words. The performance of WORDNET is the lowest, and its top 1 precision is even lower than other top N precisions. These results confirm that the extended words have grouping characteristic and are hard to rank using their similarity. INITIAL significantly improves the performance of VISUAL. It shows the weight scheme is effective for the extended word reranking. Our BGRM method achieves the best results.

8.5 Performance of Final Annotations After selecting the initial and extended words, final words are obtained for each image. Note that the precision and coverage rate of the initial words are 13.3% and 65.4%, respectively. The maximum precision and coverage rate of final annotations are 32.2% (selecting initial and extended words with highest ranking, respectively) and 78.1% (selecting the entire initial and extended words). Table 2 and 3 show the final annotation performance comparisons of BGRM, INITIAL and BASELINE. BASELINE denotes the baseline method which uses WORDNET and VISUAL for initial and extended words re-ranking, respectively. INITIAL and BASELINE utilize the reinforcement learning on the bipartite graph model. Each column uses a different final annotation determination strategy. To make the performance comparable, we calculate the best precision (coverage rate) for each method conditioned by a given fixed coverage rate (precision). m1, m2 are the numbers of initial words and extended words in the final annotations. η1, η are the threshold weights for initial words. η2 is the threshold weight for extended words.

Table 3. Final annotation performance comparisons with precision no less than 0.22. “-” shows no annotation is selected.

BASELINE INITIAL BGRM

Top N

Threshold

Coverage (Precision, m1, m2) (-) 0.649 (0.233, 2, 2) 0.678 (0.220, 4, 2)

Coverage (Precision, η1,η2) (-) 0.679 (0.222,0.9,0.6) 0.714 (0.225,0.6,0.7)

Modified threshold Coverage (Precision, η) (-) (-) 0.712 (0.22, 0.5)

Table 5. Extended word ranking using different initial ranking schemes in BGRM

Method VISUAL VS VISUAL+BGRM WORDNET VS WORDNET+BGRM

Precision Top 1 0.199 0.207 (+4%) 0.134 0.141 (+5%)

Top 2 0.185 0.188 (+2%) 0.131 0.139 (+6%)

Coverage Rate Top 2 0.293 0.296 (+1%) 0.218 0.226 (+4%)

In Table 2, we show that by keeping coverage rate above 0.654 (coverage rate of the initial words), the precision of final annotations by BGRM reaches 0.273, which is 78.2% better than that of initial words. Table 3 shows that the coverage rate of final annotation by BGRM is above 0.7 with no less than 0.22 of the precision (the average of the precision of initial words and the maximum precision of final annotations). Additionally, compared with initial words, BGRM method improves the precision and coverage rate of final annotations simultaneously. The methods which determine final annotations only by initial words cannot achieve an acceptable coverage rate. The coverage rate of these methods is not more than that of initial words. However, BGRM method breaks the bottleneck and reaches a higher coverage rate of final annotations than that of initial annotations by extending precise words. Compared the three strategies of determining the final annotations, “Threshold” and “Modified threshold” acquire better performances than “Top N”. And “Modified threshold” is a competitive method which only needs to estimate one parameter.

8.6 Effectiveness of BGRM BGRM is the method leveraging the relations between initial words and extended words to improve the ranking performance of them. We design the experiment by modifying the initial ranking scheme to show the effectiveness of BGRM. Table 4 and Table 5 show the initial and extended word ranking performance when using other ranking methods instead of our initial ranking schemes in BGRM. In Table 4, WORDNET and LOCAL are the initial word ranking methods in BGRM instead of

INITIAL. In Table 5, VISUAL and WORDNET are used as the extended word ranking methods in BGRM. The third and fifth rows of Table 4 and 5 list the ranking performance of annotations by simply using general methods. The fourth and sixth rows show the performance when using these methods as the initial ranking schemes of BGRM. The improvement of the performances shows that BGRM has the capability to boost any annotation ranking approaches. We observe that the degree of performance improvement of initial word ranking is higher than that of extended word ranking using BGRM. It shows that the reinforcement on initial word ranking is more effective than that on extended word ranking. It demonstrates the discussion in Section 4. How to more effectively rank extended words will be our future work.

8.7 Performance of Image Retrieval We also compare the image retrieval performances based on the final annotations of WORDNET, BASELINE and BGRM. 8 terms are selected as the query concepts. The terms are chosen based on both popularity and diversity. Precision and recall are used as the retrieval performance measures. The recall of a word wi is defined as the number of images correctly annotated with wi divided by the number of images that have wi in the ground truth annotation. The precision of wi is defined as the number of correctly annotated images divided by the total number of images annotated with wi. The experimental results of image retrieval for each selected term are shown in Table 6. According to the average precision and recall of all terms, BGRM is apparently superior to BASELINE and WORDNET. It is noticeable that the average recall of BASELINE and BGRM is much higher than that of WORDNET. This happens due to the accurate extended annotations which are appended to final annotations. Table 6. Performance of selected terms for WORDNET, BASELINE and BGRM

Keywords flower moth mountain lake snow book pine Harry Potter Aver.

WORDNET Prec. Recall 0.50 0.04 0.41 0.12 0.16 0.04 0.13 0.06 0.32 0.10 0.80 0.27 0.08 0.14

BASELINE Prec. Recall 0.51 0.15 0.37 0.20 0.19 0.15 0.14 0.12 0.20 0.12 0.67 0.20 0.13 0.46

BGRM Prec. Recall 0.51 0.30 0.40 0.38 0.22 0.14 0.27 0.15 0.21 0.15 0.73 0.32 0.29 0.62

0.0

0.0

0.86

0.10

1.0

0.25

0.301

0.097

0.384

0.188

0.454

0.280

with the person name “Harry Potter”. It shows the advantages of using words from surrounding text (“Potter”) and a large-scale image database. Third, our method succeeds to annotate the fourth image (“flower” and “bee”) in the condition of no precise initial annotations (“hover”, “color”).

9. CONCLUSION AND FUTURE WORK Images are the major of source of media on the Internet. Automatic image annotation by words is an effective way for managing and retrieving an abundant and fast growing number of web images. We have developed and evaluated a web image annotation approach based on a bipartite graph reinforcement model. It utilizes the available textual information associated with web images and leverages a large-scale image database to perform web image annotation. By manually examining annotation results for over 5, 000 real web pictures, we show that the proposed approach is effective. To further improve the performance of BGRM, we will exploit other ranking methods for the extended words, investigate better feature representation of images, and design more schemes to determine the final annotations in the future.

10. ACKNOWLEDGMENTS The authors would like to thank the reviewers for their careful reading and insightful comments. The research is supported in part by National Natural Science Foundation of China (60672056) and Microsoft Research Asia Internet Services in Academic Research Fund. This work was performed at Microsoft Research Asia.

11. REFERENCES [1] Beymer, D. and Poggio, T. Image representations for visual learning. Science. vol. 272. 1905–1909, 1996. [2] Blei, D. M. and Jordan, M. I. Modeling annotated data. In Proc. SIGIR, Toronto, July. 2003. [3] Chang, E., Kingshy, G., Sychay, G., and Wu, G. CBSA: content-based soft annotation for multimodal image retrieval using Bayes point machines. IEEE Trans. on CSVT, 13(1):26– 38, Jan. 2003. [4] Cusano, C., Ciocca, G., and Schettini, R. Image annotation using SVM. In Proc. Of Internet imaging IV, Vol. SPIE, 2004 [5] Duygulu, P. and Barnard, K. Object recognition as machine translation: learning a lexicon for a fixed image vocabulary. In Proc. of ECCV, 2002. [6] Feng, S. L., Manmatha, R., and Lavrenko, V. Multiple bernoulli relevance models for image and video annotation. In Proc. of CVPR, Washington, DC, June, 2004.

8.8 Illustrative Examples

[7] Google Image Search: http://images.google.com/

Fig. 9 lists some illustrative examples of final annotations of eight web images generated by WORDNET, BASELINE and BGRM. The WORDNET method only utilizes initial words to acquire final annotations. And BASELINE and BGRM make use of the entire initial and extended words.

[8] Jeon, J., Lavrenko, V., and Manmatha, R. Automatic Image Annotation and Retrieval Using Cross-media Relevance Models. In Proc. of SIGIR, Toronto, July 2003.

There are several interesting observations in the annotation results. First, our method can annotate images with specific words, such as “iguana” in the first image, while this task is even hard for humans. It is because our method benefits from available textual information. Next, the second image is successfully annotated

[9] Jeon, J. and Manmatha, R. Automatic Image Annotation of News Images with Large Vocabularies and Low Quality Training Data. In Proceedings of ACM Multimedia, 2004. [10] Jiang, J. and Conrath, D. Semantic similarity based on corpus statistics and lexical taxonomy. In Proceedings of International Conference on Research in Computational Linguistics, 1997.

[11] Jin, Y., Khan, L., Wang, L., and Awad, M. Image annotations by combining multiple evidence & Wordnet. In Proc. of ACM Multimedia, Singapore, 2005 [12] Kleinberg, J.M. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5), 2000, 604–632. [13] Lavrenko, V., Manmatha, R., and Jeon, J. A Model for Learning the Semantics of Pictures. In Proc. NIPS, 2003. [14] Li, J. and Wang, J. Z. Automatic linguistic indexing of pictures by a statistical modeling approach. IEEE Trans. On PAMI, 25(10), Oct. 2003. [15] Li, J. and Wang, J.Z. Real-time computerized annotation of pictures. Proceedings of the 14th annual ACM international conference on Multimedia, ACM Press New York, NY, USA, 2006, 911-920 [16] Mori, Y., Takahashi, H., and Oka, R. Image-to-word transformation based on dividing and vector quantizing images with words. In MISRM, 1999. [17] Photo.Net: http://photo.net/ [18] PhotoSIG: http://www.photosig.com [19] Pucher, M. Performance Evaluation of WordNet-based Semantic Relatedness Measures for Word Prediction in Conversational Speech. In Sixth International Workshop on Computational Semantics, Tilburg, Netherlands, 2005. [20] Smeulders A., Worring M., Santini S. et.al. Content-based image retrieval at the end of the early years. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 2000, 1349-1380 [21] Tong, H., He, J., Li, M. et.al. Graph Based Multi-Modality Learning, In Proceedings of ACM Multimedia, 2005.

[22] Wang, C., Jing, F., Zhang, L, and Zhang, H.J. Scalable search-based image annotation of personal images. In Proceedings of the 8th ACM international workshop on Multimedia information retrieval. ACM Press New York, NY, USA, 2006, 269--278 [23] Wang, C., Jing, F., Zhang, L., and Zhang, H. Image annotation refinement using random walk with restarts. In Proceedings of the 14th Annual ACM international Conference on Multimedia ,Santa Barbara, CA, USA, October 23 - 27, 2006 [24] Wang, X., Zhang, L., Jing, F., and Ma, W. AnnoSearch: Image Auto-Annotation by Search. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR. IEEE Computer Society, Washington, DC, 2006, 1483-1490. [25] Wang, X.J., Ma, W.Y., Zhang, L. and Li, X. Iteratively clustering web images based on link and attribute reinforcements. In Proceedings of the 13th annual ACM international conference on Multimedia, ACM Press New York, NY, USA, 2005,122-131 [26] Rui X.G., Yu N.H., Wang T.F., Li M.J. A Search-Based Web Image Annotation Method. In Proceeding of ICME, Beijing, China, 2007. [27] Yahoo Image Search: http://images.search.yahoo.com/ [28] Zeng, H.J., He, Q.C., Ma, W.Y. et.al. Learning to Cluster Web Search Results. In Proceedings of SIGIR, 2004. [29] Zhang, L., Hu, Y., Li, M., Ma, W., and Zhang, H. Efficient propagation for face annotation in family albums. In Proceedings of ACM Multimedia. New York, 2004.

Figure 9. Examples of web image annotation results

Images WORDNET

green, color

daiel, draco, secret

sripe, vision, leap

hover, spacer, color

BASELINE

green, color, flowers, zoo, night, ecuador

daniel, draco, mix, night, boy, sun

stripe, dolphin, fly, sky, north America, air

hover, spacer, natural, bloom, botanical gardens, rain

BGRM

green, iguana, natural, lizard, islands, darken

potter, chamber, potters hands, burial chamber, harry potter, surrounded

leap, dolphin, dolphin jump, jump, sea world, build

hover, color, flower, gulls, bee, fly

WORDNET

avril, arista, musicfind

Alabama, city, planetarium, troy

cover, goat, snow

book, price, yard

BASELINE

yahoo, avril, red, south,

Alabama, city, north America, south, sun

goat, cover, snow fall, snow tree, Canada

yard, book, attract, power, book cover, hand, life

BGRM

yahoo, avril, avril lavigne, la digue

tulip, Alabama, gronen, hazel green, tulip festival

snow, steep, magazine cover, snow fall, snow tree

price, book, hummingbird, front yard, book cover, fast

Images

Multi-Graph Enabled Active Learning for Multimodal Web Image ...