Automatic Image Annotation by Using Relevant ...

Viewer
Transcript

Automatic Image Annotation by Using Relevant Keywords Extracted from Auxiliary Text Documents Ning Zhou

Yi Shen

Jianping Fan

Dept of Computer Science UNC-Charlotte Charlotte, NC 28223, USA

Dept of Computer Science UNC-Charlotte Charlotte, NC 28223, USA

Dept of Computer Science UNC-Charlotte Charlotte, NC 28223, USA

[email protected]

[email protected]

ABSTRACT In this paper, a novel algorithm is developed to enable automatic image annotation by aligning web images with their most relevant auxiliary text terms. First, large-scale web pages are crawled and automatic web page segmentation is performed to extract informative images and their most relevant auxiliary text blocks. Second, image clustering is performed to partition the web images into a set of image clusters according to their visual similarity contexts. By grouping the web images according to their common visual properties, the uncertainty of the relatedness between the web images and their auxiliary text terms is significantly reduced. Finally, a relevance re-ranking algorithm is developed to achieve more precise alignment between the web images with their most relevant auxiliary text terms. Our experiments on large-scale web pages have provided very positive results.

Categories and Subject Descriptors H.3.1 [Information Storage and Retrieval]: Content analysis and indexing

General Terms Algorithms, Performance, Experimentation

Keywords Automatic image-text alignment, relevance re-ranking

1.

INTRODUCTION

As digital images are growing exponentially on the Internet, there is an urgent need to develop new algorithms for supporting automatic web image indexing and keywordbased image retrieval [1-2]. Google Images has achieved big success on supporting keyword-based web image retrieval by loosely indexing web images with their auxiliary text terms. For a given web image, however, many of its auxiliary text

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. VLS-MCMR’10, October 29, 2010, Firenze, Italy. Copyright 2010 ACM 978-1-4503-0166-4/10/10 ...$10.00.

[email protected]

terms are weakly-related or even irrelevant with its semantics because web pages may contain rich web content rather than only the auxiliary text terms for image semantics interpretation. When all these auxiliary text terms are loosely used for web image indexing, Google Images may seriously suffer from low precision rates and result in large amounts of junk images [7-11]. To enable more effective web image indexing and retrieval, it is very attractive to develop new algorithms to achieve more accurate alignment between the web images with their auxiliary text terms. Automatic image annotation plays an important role in supporting keyword-based image retrieval [3-4], where machine learning techniques are usually involved to train the classifiers by using large amounts of labeled images. As the performance of the classifiers largely depends on the reliability of the labels of the training images, the ground-truth labels are usually provided by professionals. Unfortunately, it is labor intensive for humans to label large amounts of training images. On the other hand, the increasing availability of large-scale web images and their auxiliary text documents on the Internet has provided opportunities to prepare a massive amount of training images with reliable labels. Web images and their auxiliary text documents naturally co-occur on the web pages, thus the auxiliary text documents may contain the most relevant text terms for interpreting the semantics of the web images. Therefore, it is very attractive to develop new algorithms to harvest large amounts of labeled images from the web images [13]. For each web page, it consists of two key components: web images and auxiliary text documents. The auxiliary text documents may contain a rich vocabulary of text terms: some of them are used for interpreting the semantics of the web images but most of them are used for other web content description. Thus we cannot loosely use all these auxiliary terms to index and annotate the web images because most of them are weakly-related or even irrelevant with the semantics of the web images. That is, the relatedness between the web images and their auxiliary text terms are highly uncertain, and the web images are implicitly rather than explicitly labeled. To achieve more effective web image indexing and retrieval, it is very important to develop new algorithms for achieving more accurate alignment between the web images and their most relevant auxiliary text terms. In this paper, an automatic algorithm is developed to achieve more precise alignment between the web images and their auxiliary text terms. This paper is organized as follows. Section 2 briefly reviews the related work; Section 3 introduces the image-block pair generation and image clus-

tering to reduce the uncertainty on the relatedness between the web images and their auxiliary text terms; Section 4 covers our algorithm for achieving more precise alignment between the web images and their most relevant auxiliary text terms; Section 5 describes our work on algorithm evaluation; We conclude in Section 6.

2.

RELATED WORK

Automatic image annotation is an important task in the multimedia research community and some pioneering work have been done recently [3-4]. However, these learningbased approaches are closely related to image classification, which mostly focus on image processing, feature extraction and classifier training rather than the task of image annotation itself. In addition, they may work well only when large amounts of labeled images are available for classifier training, but they may run into severe difficulties when the number of labeled images is relatively small. When the relatedness between the images and their associated text terms are well-defined, some pioneering work have been done to enable automatic image annotation by exploiting the image-text co-occurrences [5-6]. The key idea behind these approaches is to learn the joint probabilities for the text terms with the image features by using a training set (where the associations between the images and their text annotations are well-defined). The learned model is then used to find the images (which are most similar to the test image) and then use their shared keywords to annotate the test images. Some pioneering work have been done for achieving better associations between different medias (captions and images/videos) [1-2]. Satoh et al. have developed the first work on associating persons’ names with human faces in news videos [1]. Berg et al. [2] have also developed a learningoriented algorithm for aligning human faces with their names which are extracted from the captions. All these cross-media alignment techniques have achieved better performance by coupling different information sources together, but they focus on one certain type of image-text associations, e.g., association between human faces and names. Because Google Images has achieved big success on supporting keyword-based web image indexing and retrieval, some pioneering work have been done recently by performing a post-process to filter out the junk images from Google Images [7-11]. Many researchers have studied on how to better fuse multiple modalities of the web images/videos to improve image/video search results via relevance re-ranking [14-16], where the goal for relevance re-ranking is well-defined, e.g., identifying the relatedness between the returned images/videos and the text terms for query specification. On the other hand, our proposed approach focuses on finding more suitable text terms to improve web image indexing, where the relatedness between web images and their auxiliary terms are highly uncertain. While our proposed research is similar in spirit with these re-ranking approaches, it significantly differs from these existing work in two important ways. Firstly, rather than dealing with the re-ranking issue at the search time, we focus on supporting more effective web image indexing by achieving more precise alignment between the web images and their auxiliary text terms, which can further result in more effective image retrieval with higher precisions. Our goal is to develop an automatic image annotation model to cope with realistic web images which have high uncer-

Figure 1: The illustration of the key components of our imagetext alignment scheme: (a) web page; (b) image-block pair; (c) image cluster and ranked auxiliary text terms; (d) image cluster and re-ranked auxiliary text terms.

tainty on their relatedness with their auxiliary text terms. Secondly, our automatic image annotation model is unsupervised rather than supervised, thus it can effectively deal with large-scale web images which may have high uncertainty on the relatedness with their auxiliary terms.

3. EXTRACTING INFORMATIVE IMAGES AND THEIR AUXILIARY TEXT TERMS In this paper, an automatic algorithm is developed for achieving more precise alignment between web images and their auxiliary terms and it consists of the following key components as illustrated in Fig. 1: (a) informative images are extracted from web pages by filtering out the low-quality images automatically according to their sizes and aspect ratios; (b) a web page segmentation algorithm is developed to partition the web pages and their hosted images into a set of informative image-block pairs where each informative image is associated with the most relevant surrounding text block(s); (c) automatic image clustering is performed to partition the web images into a set of image clusters according to their visual similarity contexts and the web images in the same cluster will have similar visual properties and their semantics can be effectively described by a same set of auxiliary terms; and (d) an automatic alignment algorithm is developed to identify the relatedness between the web images and their most relevant auxiliary terms.

3.1 Image-Block Pair Generation Modern web pages may contain rich cross-medias, where the informative web content is often surrounded by a bouquet of auxiliary web content, such as navigation menus, user comments, texts and images for advertisements, snippet previews of the related documents, etc. The high diversity of web content makes the task for identifying the relatedness between the web images and their auxiliary terms to be an essential but challenging task. In order to effectively

align web images with their auxiliary terms, we first extract the most informative images and segment the web pages to produce a set of image-block pairs. Informative images are extracted by filtering out those image whose aspect ratios are lager than 5 or smaller than 0.2 and those images whose widths or heights are less than 60 pixel. By discarding non-informative web images, we have extracted more than 5, 000, 000 informative web images from 500, 000 web pages. In addition, a DOM-based method is adopted to extract the most relevant text block(s) for each informative image. Specifically, the region growing algorithm is employed to extract a informative web image’s most relevant text block(s), where the corresponding image node in the DOM-tree is set as the start point, and a upward growing search is performed until it reaches any text node. The inner texts embedded in the text node(s) which have been touched by the region growing search are extracted as the text block(s). In additional to the text blocks in the web page, we also extract meta-data embedded in the HTML tags as side information, which often strongly reflects the semantics of an image. Four types of meta-data, e.g., alternate texts, image titles, image file names, and web page titles, are extracted as the meta-data. Image-block pairs can significantly narrow down the search space when one aims to extract relevant text terms for image semantics description. But the relatedness between the informative images and the auxiliary terms are still uncertain. Image clustering, which partitions web images and their text blocks into multiple groups with more homogeneous visual properties and similar semantics, can be used to reduce the uncertainty.

3.2 Image Clustering To represent the visual content of an image, we use the colored pattern appearance model (CPAM) which was proposed to capture both color and texture information of small patches in natural color images, and has been successfully applied to image coding, indexing and retrieval [19]. CPAM comes with a code book of common appearance prototypes which is built based on tens of thousands of image patches using vector quantization. Given an image, a sliding window is used to decompose the image into a set of 4×4 tiles. Each small tile can then be encoded by one of the CPAM appearance prototypes that is most similar to it. The CPAM-based feature vector x comprises the achromatic spatial pattern histogram (ASPH) and chromatic spatial pattern histogram (CSPH). The distance between two CPAM-based feature vectors xm and xn is defined as X |ASPHm (i) − ASPHn (i)| d(xm , xn ) = 1 + ASPHm (i) + ASPHn (i) ∀i

+

X |CSPHm (j) − CSPHn (j)| . 1 + CSPHm (j) + CSPHn (j) ∀j

(1)

The negative of this distance is used to characterize the similarity between the two images. To achieve more effective image clustering, a graph is first constructed for organizing the web images according to their visual similarity contexts [13], where each node on the graph denotes one particular web image and an edge between two nodes is used to characterize the pairwise visual similarity context. By taking such a graph as the input, automatic image clustering is achieved by passing messages between the

nodes through affinity propagation [12]. Some experimental results for image clustering is given in Fig. 2. One can observe that visual-based image clustering can provide a good summarization of large amounts of web images. It is worth noting that the images in the same cluster would share similar visual properties and their semantics can be effectively described by using a same set of auxiliary terms, i.e., the text terms which co-occur most frequently in their relevant text blocks. Thus the text blocks for all these visually-similar web images in the same cluster are merged as a joint and unified text document, and such a joint and unified text document may provide more reliable information source for us to extract more relevant terms to interpret the semantics of the visually-similar images. That is, the cooccurrences of the image-text pairs can be used to reduce the uncertainty on the relatedness between the web images and their auxiliary text terms. Therefore, integrating the image clustering results and the co-occurrences of the image-text pairs can result in more precise alignment between the web images and their auxiliary text terms.

4. AUTOMATIC IMAGE-TEXT ALIGNMENT In order to extract more meaningful text terms for image semantics description, a standard list of the stop words is used to remove high-frequency words, such as “the”, “to” and “also”. For each web image x in the given cluster, NLTK tool kit [17] is performed on its text blocks to extract the meaningful text terms t ∈ W = {W1 , · · · , Wn } and their cooccurrence probabilities P (x, t) with the given web image x. Rather than indexing the visually-similar web image in the same cluster by loosely using all these auxiliary text terms, a novel term ranking algorithm is developed to align the web images with their most relevant auxiliary terms according to their relevance scores between the image semantics and the auxiliary terms. Our image-text alignment algorithm takes the following major steps: (a) The initial relevance scores for the auxiliary text terms are calculated based on term-image co-occurrences; (b) The relevance scores for these auxiliary text terms are then refined according to their inter-term cross-modal similarity contexts (i.e., term correlation network); and (c) The most relevant text terms (top k auxiliary terms) are automatically selected for image semantics description according to their relevance scores. Our innovations lie in integrating term correlation network and random walk for automatic relevance score refinement.

4.1 Term-Image Relevance Estimation A probabilistic approach is developed to estimate the initial relevance scores for the auxiliary text terms. Given a text term t for an image cluster C, its relevance score ρ(C, t) with the image cluster C is defined as: P x∈Θ(t) P (x, t) P ρ(C, t) = P , (2) y∈Θ r∈W P (y, r)

where x is one particular web image in the cluster C, P (x, t) is used to indicate the co-occurrence for the given text term t ∈ W with the image x ∈ C, W is the entire set of auxiliary terms for the image cluster C, Θ(t) ⊆ Θ is used to indicate a subset of images in the cluster C which co-occur with the given text term t, Θ denotes the whole set of images in the same cluster C.

4.2 Term Correlation Network

Figure 2: Image clustering results and their auxiliary text terms.

When people construct their cross-media web pages, they may use multiple text terms with similar meanings to describe the semantics of the relevant images alternatively. On the other hand, some text terms may have multiple senses under different contexts. Thus the auxiliary text terms are strongly inter-related and such inter-related text terms and their relevance scores should be considered jointly. Based on this observation, a term correlation network is automatically generated to characterize such inter-term similarity contexts precisely and provide a good environment for refining the relevance scores. In the term correlation network, each node represents a term and an edges indicates the pairwise term correlation. The inter-term correlations are characterized by (1) inter-term co-occurrence correlations; and (2) inter-term semantic similarity contexts. For two given text terms ti and tj , their semantic similarity context γ(ti , tj ) is defined as: γ(ti , tj ) = P (ti , tj ) · log

L(ti , tj ) 2·D

(3)

where P (ti , tj ) is the co-occurrence probability for the given text terms ti and tj , L(ti , tj ) is the number of nodes between the given text terms ti and tj on WordNet [18], D is the maximum number of nodes from root node to leaf node on WordNet. The co-occurrence correlation β(ti , tj ) between two text terms ti and tj is defined as: β(ti , tj ) = −P (ti , tj )log

P (ti , tj ) , P (ti ) + P (tj )

(4)

where P (ti , tj ) is the co-occurrence probability for two text terms ti and tj , P (ti ) and P (tj ) are the occurrence probabilities for the text terms ti and tj . The cross-modal inter-term correlation between ti and tj can then be defined as: φ(ti , tj ) = α · γ(ti , tj ) + (1 − α) · β(ti , tj ),

(5)

where α is the weighting factor and it is determined through cross-validation. The combination of such cross-modal interterm correlation can provide a powerful framework to rerank the relevance scores between the web images and their auxiliary terms. The term correlation network for our web-image collections is shown in Fig. 3, where each text term is linked with multiple most relevant text terms with larger values of φ(·, ·). By characterizing the inter-term correlations more precisely,

Figure 3: The visualization of the term correlation network. our term correlation network can provide a good environment to address the issues of polysemy and synonyms effectively and disambiguate the image senses accurately, which may allow one to find more suitable text terms for web image annotation.

4.3 Random Walk for Relevance Refinement In order to leverage the advantage of our term correlation network to achieve more precise alignment between the web images and their auxiliary text terms, a random walk process is performed for automatic relevance score refinement [14-16]. Given our term correlation network with n most significant text terms, we use ρk (t) to denote the relevance score for the text term t at the kth iteration. The relevance scores for all these text terms in our term correlation net−−→ work at the kth iteration will form a column vector ρ(t) ≡ [ρk (t)]n×1 . We further define Φ as an n × n transition matrix, its element φij is used to define the probability of the transition from the term i to its inter-related term j. φij is defined as: φ(i, j) φij = P , (6) k φ(i, k)

where φ(i, j) is the pairwise inter-term cross-modal similarity context between i and j as defined in (5).

informative images. To assess the effectiveness of our proposed algorithms, our algorithm evaluation work focuses on: (a) assessing the effectiveness of image clustering and random walk on our algorithm for text-image alignment; (2) comparing the performance differences between our imagetext alignment algorithm and other well-accepted approaches for image-text alignment. The accuracy rate ̺ is used to assess the effectiveness of the algorithms for image-text alignment, given as PN i=1 δ(Li , Ri ) ̺= , (8) N

Figure 4: Image-text alignment: (a) image cluster; (b) ranked text terms before performing random walk; (c) re-ranked text terms after performing random walk.

The random walk process is thus formulated as: X ρk (t) = θ ρk−1 (j)φtj + (1 − θ)ρ(C, t),

(7)

j∈Ωj

where Ωj is the first-order nearest neighbors of the text term j on the term correlation network, ρ(C, t) is the initial relevance score for the given text term t and θ is a weight parameter. This random walk process can promote the text terms which have many nearest neighbors on the term correlation network, e.g., the text terms, which have close visual-based interpretations of their semantics and higher co-occurrence probabilities. On the other hand, this random walk process can also weaken the isolated text terms on the term correlation network, e.g., the text terms, which have weak visual correlations with other text terms and low co-occurrence probabilities with other text terms. This random walk process is terminated when the relevance scores converge. For a given image cluster C, all its auxiliary text terms are re-ranked according to their relevance scores. By performing random walk over the term correlation network, our relevance score refinement algorithm can leverage both the co-occurrence similarity and the visual similarity simultaneously to re-rank the auxiliary text terms more precisely. The top-k auxiliary text terms with the higher relevance scores, are then selected as the keywords to annotate the web images in the given image cluster C. Such image-text alignment process provides better understanding of the crossmedia web documents (images and text documents) as it couples different sources of information together and allow us to resolve the ambiguities that may arise from a single media analysis. Some experimental results for re-ranking the text terms are given in Fig. 4. One can observe that our image-text alignment algorithm can effectively find the most relevant keywords for automatic image annotation.

5.

ALGORITHM EVALUATION

All our experiments are conducted on an image-text parallel set comprising around 500, 000 web pages and 5, 000, 000

where N is the total number of web images, Li is a set of the most relevant text terms for the ith web image which are obtained by automatic image-text alignment algorithms, Ri is a set of the keywords for the ith web image which are given by a benchmark image set. δ(x, y) is a delta function, 1, x = y, δ(x, y) = (9) 0, otherwise It is hard to obtain suitable benchmark image set in large size for our algorithm evaluation task. To avoid this problem, an interactive image navigation system is designed to allow users to provide their assessments of the relevances between the images and the ranked text terms. To assess the effectiveness of image clustering and random walk for text-image alignment, we have compared the accuracy rates for our text-image alignment algorithm under three different scenarios: (a) image clustering is not performed for reducing the uncertainty between the relatedness between the web images and their auxiliary text terms; (b) random walk is not performed for relevance re-ranking; (c) both image clustering and random walk are performed for achieving more precise alignment between the web images and their most relevant auxiliary text terms. As shown in Fig. 5, one can observe that by incorporating image clustering for uncertainty reduction and performing random walk for relevance re-ranking the accuracy rates of image-text alignment have been significantly boosted. For the same web image indexing task, we have compared the performance among three approaches for imagetext alignment: our approach versus Berg’s approach [2] and cross-media relevance model (CMRM) proposed by Feng et al. and Lavrenko et al. [5-6]. As shown in Fig. 6, one can observe that our image-text alignment approach can significantly improve the accuracy rates for identifying the most relevant keywords for image semantics description and web image indexing. For some auxiliary text terms as shown in Fig. 6, both Berg’s approach and CMRM may result in almost zero accuracy rate. The performance gain of our image-text alignment algorithm benefits from three components: (1) Image clustering is performed to reduce the visual ambiguity between the web images and provide a good environment to effectively reduce the uncertainty on the relatedness between the web images and their auxiliary text terms; (2) A term correlation network is constructed to integrate both the visual similarity contexts and the semantic (co-occurrence) similarity contexts to tackle the issues of polysemy and synonyms more effectively; (3) A random walk process is performed on the term correlation network for achieving more precise alignment between the web images and their auxiliary text terms.

Figure 5: Text-image alignment accuracy rate, where top 40 images are evaluated interactively. Average accuracy: Without Clustering = 0.5828, Without Random Walk = 0.6936, Integration = 0.7373.

Figure 6: Comparison on text-image alignment accuracy rates, where top 20 images are evaluated interactively. Average accuracy: Berg’s = 0.2771, Relevance Model = 0.3286, Our Method = 0.8400.

6.

CONCLUSIONS

By extracting the most relevant keywords from auxiliary text documents and using them for automatic image annotation, our proposed research on image-text alignment may provide three potential applications: (a) enabling more effective keyword-based web image retrieval with higher precision rates by finding more suitable keywords to index web images; (b) creating more representative image sets for training a large number of object and concept classifiers more accurately, which is a long-term goal of the multimedia research community; and (c) achieving cross-media alignment automatically and generating a parallel cross-media corpus (image-text pairs) which may provide many opportunities for future multimedia researches such as word sense disambiguation and image sense disambiguation.

7.

REFERENCES

[1] S. Satoh, Y. Nakamura, T. Kanade, “Name-It: Naming and detecting faces in news videos”, IEEE MultiMedia, vol. 6, no.1, pp.22-35, 1999. [2] T.L. Berg, A.C. Berg, J. Edwards, D.A. Forsyth, “Who’s in the picture”, NIPS, 2004. [3] M. R. Boutell, J. Luo, X. Shen, C.M. Brown,“Learning multi-label scene classification”, Pattern Recognition, vol. 37, no.9, pp. 1757-1771, 2004. [4] J. Fan, Y. Gao, H. Luo, “Multi-level annotation of natural scenes using dominant image components and semantic image concepts”, ACM Multimedia, 2004. [5] S. Feng, V. Lavrenko, R. Manmatha, “Multiple Bernoulli relevance models for image and video annotation”, ACM SIGIR, 2004. [6] V. Lavrenko, R. Manmatha, J. Jeon, “A model for learning the semantics of pictures”, NIPS, 2003. [7] R. Fergus, L. Fei-Fei, P. Perona, A. Zisserman, “Learning object categories from Google’s image search”, Proc. IEEE CVPR, 2006.

[8] D. Cai, X. He, Z. Li, W.-Y. Ma, J.-R. Wen, “Hierarchical clustering of WWW image search results using visual, textual, and link information”, ACM Multimedia, 2004. [9] X.-J. Wang, W.-Y. Ma, G.-R. Xue, X. Li, “Multi-modal similarity propagation and its application for web image retrieval”, ACM Multimedia, 2004. [10] B. Gao, T.-Y. Liu, T. Qin, X. Zhang, Q.-S. Cheng, W.-Y. Ma, “Web image clustering by consistent utilization of visual features and surrounding texts”, ACM Multimedia, 2005. [11] Y. Gao, J. Peng, H. Luo, D. Keim, J. Fan, “An interactive approach for filtering out junk images from keyword-based Google search results”, IEEE Trans. on CSVT, vol.19, 2009. [12] B. Frey, D. Dueck, “Clustering by passing messages between data points”, Science, vol.315, pp.972-976, 2007. [13] J. Fan, Y. Shen, N. Zhou, Y. Gao, “Harvesting large-scale weakly-tagged image databases from the Web”, IEEE CVPR, 2010. [14] D. Liu, X.-S, Hua, L. Yang, M. Wang, H.-J. Zhang, “Tag ranking”, WWW, 2009. [15] W. Hsu, L. Kennedy, S.F. Chang, “Video search reranking through random walk over document-level context graph”, ACM Multimedia, 2007. [16] Y. Liu, T. Mei, X.-S. Hua, “CrowdReranking: Exploring multiple search engines for visual search reranking”, ACM SIGIR, 2009. [17] W. Hsu, L. Kennedy, S.F. Chang, “Video search reranking via information bottleneck principle”, ACM Multimedia, 2006. [18] Y. Jing, S. Baluja, “VisualRank: Applying PageRank to large-scale image search”, IEEE Trans. on PAMI, vol.20, no.1, 2008. [19] S. Bird, “Nltk: The natural language toolkit”, ACL, 2006. [20] C. Fellbaum, WordNet: An Electronic Lexical Database, MIT Press, Boston, MA, 1998. [21] G. Qiu. Indexing chromatic and achromatic patterns for contentbased colour image retrieval. Pattern Recognition, 35:1675´lC1685, August 2002.

Image Annotation Using Bi-Relational Graph of Images ...

Medical Image Annotation using Bag of Features ...

Image Annotation Using Search and Mining ...

Image Annotation Using Multi-label Correlated Green's ... - IEEE Xplore

Automatic Face Annotation in News Images by ...

Semantic Image Retrieval and Auto-Annotation by ...

AnnoSearch: Image Auto-Annotation by Search

Baselines for Image Annotation - Sanjiv Kumar

Scalable search-based image annotation

Automatic Annotation Suggestions for Audiovisual ...

Scalable search-based image annotation - Semantic Scholar

Web-scale Image Annotation - Research at Google

Baselines for Image Annotation - Sanjiv Kumar

A Shared Task on the Automatic Linguistic Annotation ...

Reducing Annotation Effort using Generalized ...

A New Baseline for Image Annotation - Research at Google

Hybrid Generative/Discriminative Learning for Automatic Image ...