Data-Driven Approach for Bridging the Cognitive Gap in Image Retrieval Xin-Jing Wang*† , Wei-Ying Ma† , Xing Li* † Microsoft Research Asia Tsinghua University, China*

Abstract Bridging the cognitive gap in image retrieval has been an active research direction in recent years. Existing solutions typically require a large volume of training data that could be difficult to obtain in practice. In this paper, we propose a data-driven approach that uses Web images and their surrounding textual annotations as the source of training data to bridge the cognitive gap. We construct an image thesaurus that contains a set of codewords, each representing a semantically related subspace in the feature space. We also explore the use of query expansion based on the constructed image thesaurus for improving image retrieval performance.

1. Introduction One recent research focus for content-based image retrieval is on how to bridge the gap between low-level visual features and high-level user concepts. Works along this direction include image auto-annotation [4], annotation propagation [3], region-based methods [9] and learning-based methods [8]. A key challenge in bridging the cognitive gap is to get enough training data to learn the mapping functions from low-level feature spaces to high-level semantics. Today some commercial search engines claim to index 500M images on the Web. As Web images are typically surrounded by abundant textual annotations, they can be considered as labeled data set. In this paper we propose to use Web images as training data to bridge the cognitive gap. Although people may argue that the annotations for Web images are noisy and may not necessarily reflect the concepts in the images, we hope that through a data driven approach, useful knowledge can be extracted from this freely available data set. Many previous research works have discussed the possible usage of such annotations [5, 7]. For example, [5] tries to organize pictures in a semantic structure by learning a joint probability distribution for words and art picture elements which makes use of statistical natural language processing and WordNet [1]. In [7], theory of “visual semantics” provides useful insight into some of the challenges of integrating text indexing with image understanding algorithms.

In this paper, our key idea is to construct an image thesaurus to present the knowledge extracted from the Web. The image thesaurus contains two parts. One is a codebook that is trained to partition the feature space into sub-spaces, each corresponding to a semantically related concept. The other is a correlation matrix that indicates how two given concepts coexist in a same image. With this correlation matrix, we could also perform query expansion to improve image retrieval performance.

2. Using the Web to Build Image Thesaurus The annotations for Web images come from many sources such as surrounding text, file name, alterative tag, etc. If we could extract the right keywords and associate them with the corresponding regions in the images, we will be able to construct an image thesaurus that can serve as a vehicle to bridge the gap between low-level features and high-level semantics for image retrieval. The various key technologies for building such an image thesaurus are discussed in the following.

2.1. Key Term Extraction We use a vision-based web page analysis technique [2] to extract image surrounding texts. The HTML tag of each term is used to assign different weight to the term appeared in the texts. The basic idea is to give a lower weight to a term that occurs more frequently in a less important tag (similar to term frequency (TF) in information retrieval). A list of candidate terms with their weights in a descending order is constructed as shown in Figure 1.

Figure 1. Terms extracted for a Web image and its attention map

Figure 2. Hypernyms for “coyote” from WordNet To filter out noisy terms, we use WordNet [1] to keep nouns and save the output of WordNet with sense 1. Furthermore, we can obtain the hypernym of a word (IS_KIND_OF) to build a hierarchical codebook for image retrieval. A part of the hypernym tree of the word “coyote” is shown in Figure 2. We also keep the synonyms of each term for more advanced matching. We also employ some heuristic rules to weight the terms differently. For example, using hypernym tree structure, the term that is more specific, e.g. “coyote,” is given a higher weight than the term that is more collective, e.g. “mammal”. The term with highest weight is selected as the key term for the image.

2.2. Key Image Region Extraction The next step is to associate the key terms with their corresponding regions in the images so that we can relate high-level semantics to low-level features to bridge the cognitive gap. To solve this problem, we first segment each image into homogeneous regions using JSEG algorithm [10]. Since the resulted regions are not yet at semantic or object level, we further use the image attention model [12] to help identify the most important region in an image. We order the regions based on their attention values. An example of image attention map is shown in Figure 1. As can be seen, it helps us identify the “coyote” region from the “grass” background. The most salient region is selected as key region to associate with the key term. By this way, we obtain a large collection of key regions and the associated key terms that are very likely to be the semantic annotation of these regions.

associated with these codewords, we call them lowlevel codewords. For every image region in our database, we extract a set of low-level color and texture features to represent it. The image regions with the same or similar key term (based on synonym) are grouped together, and the centroid of their feature vectors is used to present the codeword (semantic-level). The regions without semantics are clustered using K-means algorithm, and similarly we use the centroid of the cluster in the feature space to present the corresponding codeword (low-level). The next step is to learn the correlation matrix. There are three kinds of correlation here: 1) the correlation between semantic-level and low-level codewords, 2) the correlation between low-level codewords, and 3) the correlation between semanticlevel codewords. The correlation is calculated based on how frequently two regions coexist in a same image. A conditional probability is used to measure how likely a codeword would appear in an image given the existence of another codeword: p(cj | ci ) =

p(cj , ci ) p(ci )

=

∑ ∑

Ik ∈I

f (cj , ci | Ik )

Ik ∈I

f ( ci | Ik )

⎧⎪1 ci , c j ∈ I k ⎧1 ci ∈ I k f ( ci | I k ) = ⎨ f c j , ci | I k = ⎨ ⎪⎩0 ci , c j ∉ I k ⎩0 ci ∉ I k th th where c j and ci denote the j and the i codewords, and

(

)

th

I k denotes the k image in image set I.

2.3. Image Thesaurus Construction The constructed image thesaurus is shown in Figure 3. It contains two parts: the codebook and the correlation matrix. The codebook contains codewords as the leaf nodes. Some of them have semantic meanings and are interrelated in a hierarchical manner. These codewords, denoted as semantic-level codewords, are trained using key regions as their semantic meanings (i.e. associated key terms) are known and their hierarchical relationship can be obtained using WordNet. The codewords in the flat structure are trained using the rest of regions that have no mapped key terms. As there is no semantic meaning

Figure 3. The image thesaurus constructed from the Web image data

3. Image Retrieval by Query Expansion To illustrate how our constructed image thesaurus can help improve the performance of image retrieval, we pick query expansion which is often used in text retrieval as an example here. There are many other

query or one of the “OR” queries. The similarity measure is defined as below: Sim( I i , Q j ) =

A∩ B

=

A∪ B

A∩ B A + B − A∩ B

where A ∩ B is the number of common codewords in A and B , and A ∪ B is the total number of different codewords in A and B . The similarity between an image and the query is equal to Sim( I i , Q ) = max Sim( I i , Q j ) (

j∈ 1, Q

)

where Q represents the set of “OR” queries. In single query case, Q = 1 .

4.1. Performance of Key Term Extraction To evaluate the performance of our key term extraction for Web images, we randomly selected 20 query words to search images in our database. The retrieval result (precision@10) is given in Figure 4. As can be seen, the performance is generally satisfactory. precision on scope 10 1.1 1 0.9 0.8 0.7 0.6

lens

mammels

bird

insect

wolf

shell

spider

lizard

hawk

hummingbird

grass

news

butterfly

owl

snake

fly

0.5 flower

With the thesaurus we constructed, we can support both query-by-example and query-by-keyword for image retrieval. In the case of query-by-example, we first perform the same analysis as Section 3.1 on the query image, and then select a key region from the image to expand the query. We augment the query by including highly correlated codewords based on the correlation matrix. We use the EMD [11] to compute the similarity between the query example and images in our database. In the case of query-by-keyword, if the keyword (e.g. “wolf”) maps to a semantic-level codeword at the leaf node, then the query will contain the mapped semantic-level codeword and a set of highly correlated codewords based on the correlation matrix. If the keyword is a concept (e.g. “mammal”) that maps to an immediate node in the semantic hierarchy, then the query will contain the semantic-level codewords that are children of that immediate node. Note that these codewords are used as “OR” queries to retrieve images. Similarly, we can also expand each of these codewords by adding those highly correlated ones. The similarity measure used in query-by-keyword search is the Jaccard coefficient [6]. Let A denote the set of codewords of image I i in the database and B the set of codewords of a query Q j which is the single

branch

3.2. Query Expansion

bat

For each image in our database, we first perform image segmentation and extract feature vectors from all the image regions. Each region is then mapped to a codeword which has minimum distance to it in the feature space. After this stage, each image in our database is represented by a set of codewords.

We crawled 17,123 images from the Web with 10,051 images successfully identified their key terms. These images cover animals, human beings, scenes, advertise posters, books, and sweaters, etc. The visual feature extracted from each image is a combination of color moments, correlogram and wavelet texture features which result in 171 dimensions. From these images, we constructed a codebook with 4750 semantic-level codewords and 500 low-level codewords that are clustered by K-means algorithm.

book

3.1. Representing an Image Using Codewords

4. Experiments

precision

possible uses of the thesaurus that are not covered in this paper.

query

Figure 4. Precision@10 for image retrieval based on key term extraction.

4.2. Image Retrieval To evaluate how the learned image thesaurus can improve image retrieval performance, we select 10,000 images from Corel Stock Photo Library as our testing data set. These images do not overlap with our training images obtained from the Web, but they cover similar high-level concepts as the training images with hundreds of outliers. After image segmentation and feature extraction, these 10,000 Corel images are indexed using our image thesaurus with each image region represented by a codeword. In the case of query-by-keyword, the key term submitted by a user is first matched to a semantic-level codeword, and the feature of the codeword and those features of other correlated codewords are then used to form a content-based query to search images in the database. Figure 5 shows the result of query “wolf”.

The images in red box are correct hits. Note that this example shows the capability of indexing images using high-level concept. Figure 6 shows the result of query “bird” which is an example when the query maps to an immediate node in the semantic hierarchy. In this case, all semantic-level codewords at the leaf nodes whose father is “bird” are used to form the query set.

We randomly selected 100 queries related to the testing data set to evaluate the use of query expansion for image retrieval. Traditional precision-scope measure is used for performance evaluation. Figure 7 shows the comparison of the base line with our query expansion method.

6. Summary In this paper, we presented an idea of using Web images as training data to create an image thesaurus which helps solve the problem of cognitive gap in image retrieval. The query expansion was also introduced to take advantage of this correlation information in the thesaurus to further improve image retrieval performance. The hyperlinks between Web images are valuable information to use for learning image thesaurus. We believe that by leveraging link information and combining it with WordNet, we can further improve the performance of this work. We plan to investigate this direction in our future works.

Figure 5. Retrieval result of query “wolf”

Figure 6. Retrieval result of query “bird”

Figure 7. Performance Evaluation for Query Expansion

7. References [1]

C. Fellbaum, WordNet: An electronical lexical database, MIT Press, Cambridge, Mass., 1998 [2] D. Cai, S. Yu, J.R. Wen, and W.-Y. Ma, “VIPS: a vision-based page segmentation algorithm”, Microsoft Technical Report, MSR-TR-2003-79, 2003 [3] H.J. Zhang and Z. Su, “Improving CBIR by Semantic Propagation and Cross-Mode Query Expansion”, MultiMedia Content Based Indexing and Retrieval, 2001 [4] K. Barnard , P. Duygulu, D. Forsyth, N. Freitas, D.M. Blei and M. Jordan, “Matching Words and Pictures”, Journal of MLR 3, 2003, 1107-1135 [5] K. Barnard, P. Duygulu, and D. Forsyth, "Clustering Art", CVPR 2001, pp. II:434-439. [6] P.Sneath, and R.Sokal, “Numerical Taxonomy: the Principles and Practice of Numerical Classification”, San Francisco: W.H. Freeman, 1973. 573p [7] R.K. Srihari, “Use of Multimedia Input in Automated Image Annotation and Content-Based Retrieval”, Presented at SPIE’95, San Jose, CA, Feb. 1995. [8] S. Tong, E. Chang, “Support Vector Machine Active Learning For Image Retrieval”, In Proc. ACM Multimedia, Ontario, Canada, 2001. [9] W.Y. Ma and B. S. Manjunath, “Netra: A toolbox for navigating large image databases”, in IEEE ICIP, 1997. [10] Y. Deng, and B.S. Manjunath, “Unsupervised Segmentation of Color-Texture Regions in Images and Video”, IEEE Trans. on PAMI, 23(8), 2001, 800-810 [11] Y. Rubner, L.J. Guibas, and C. Tomasi, "The Earth Mover's Distance, Multi-Dimensional Scaling, and Color-based Image Retrieval," Proceedings of the ARPA Image Understanding Workshop, New Orleans, LA, May 1997, pp. 661-668 [12] Y.F. Ma, and H.J. Zhang, “Contrast-based Image Attention Analysis by Using Fuzzy Growing”, ACM Multimedia, 2003.

Author Guidelines for 8

that through a data driven approach, useful knowledge can be extracted from this freely available data set. Many previous research works have discussed the.

636KB Sizes 4 Downloads 218 Views

Recommend Documents

Author Guidelines for 8
nature of surveillance system infrastructure, a number of groups in three ... developed as a Web-portal using the latest text mining .... Nguoi Lao Dong Online.

Author Guidelines for 8
The resulted Business model offers great ... that is more closely related to the business model of such an .... channels for the same physical (satellite, cable or terrestrial) ... currently under way is the integration of basic Internet access and .

Author Guidelines for 8
three structures came from the way the speaker and channel ... The results indicate that the pairwise structure is the best for .... the NIST SRE 2010 database.

Author Guidelines for 8
replace one trigger with another, for example, interchange between the, this, that is ..... Our own software for automatic text watermarking with the help of our ...

Author Guidelines for 8
these P2P protocols only work in wired networks. P2P networks ... on wired network. For example, IP .... advantages of IP anycast and DHT-based P2P protocol.

Author Guidelines for 8
Instant wireless sensor network (IWSN) is a type of. WSN deployed for a class ... WSNs can be densely deployed in battlefields, disaster areas and toxic regions ...

Author Guidelines for 8
Feb 14, 2005 - between assigned tasks and self-chosen “own” tasks finding that users behave ... fewer queries and different kinds of queries overall. This finding implies that .... The data was collected via remote upload to a server for later ..

Author Guidelines for 8
National Oceanic & Atmospheric Administration. Seattle, WA 98115, USA [email protected] .... space (CSS) representation [7] of the object contour is thus employed. A model set consisting of 3 fish that belong to ... two sets of descending-ordered l

Author Guidelines for 8
Digital circuits consume more power in test mode than in normal operation .... into a signature. Figure 1. A typical ..... The salient features and limitations of the ...

Author Guidelines for 8
idea of fuzzy window is firstly presented, where the similarity of scattering ... For years many approaches have been developed for speckle noise ... only a few typical non- square windows. Moreover, as the window size increases, the filtering perfor

Author Guidelines for 8
Ittiam Systems (Pvt.) Ltd., Bangalore, India. ABSTRACT. Noise in video influences the bit-rate and visual quality of video encoders and can significantly alter the ...

Author Guidelines for 8
to their uniqueness and immutability. Today, fingerprints are most widely used biometric features in automatic verification and identification systems. There exists some graph-based [1,2] and image-based [3,4] fingerprint matching but most fingerprin

Author Guidelines for 8
sequences resulting in a total or partial removal of image motion. ..... Add noise. Add targets. Performance Measurement System. Estimate. Residual offset.

Author Guidelines for 8
application requests without causing severe accuracy and performance degradation, as .... capacity), and (3) the node's location (host address). Each information ... service also sends a message to the meta-scheduler at the initialization stage ...

Author Guidelines for 8
camera's operation and to store the image data to a solid state hard disk drive. A full-featured software development kit (SDK) supports the core acquisition and.

Author Guidelines for 8 - Research at Google
Feb 14, 2005 - engines and information retrieval systems in general, there is a real need to test ... IR studies and Web use investigations is a task-based study, i.e., when a ... education, age groups (18 – 29, 21%; 30 – 39, 38%, 40. – 49, 25%

Author Guidelines for 8
There exists some graph-based [1,2] and image-based [3,4] fingerprint matching but most fingerprint verification systems require high degree of security and are ...

Author Guidelines for 8
Suffering from the inadequacy of reliable received data and ... utilized to sufficiently initialize and guide the recovery ... during the recovery process as follows.

Author Guidelines for 8
smart home's context-aware system based on ontology. We discuss the ... as collecting context information from heterogeneous sources, such as ... create pre-defined rules in a file for context decision ... In order to facilitate the sharing of.

Author Guidelines for 8
affordable tools. So what are ... visualization or presentation domains: Local Web,. Remote Web ... domain, which retrieves virtual museum artefacts from AXTE ...

Author Guidelines for 8
*Department of Computer Science, University of Essex, Colchester, United Kingdom ... with 20 subjects totaling 800 VEP signals, which are extracted while ...

Author Guidelines for 8
3D facial extraction from volume data is very helpful in ... volume graph model is proposed, in which the facial surface ..... Mathematics and Visualization, 2003.

Author Guidelines for 8
Feb 4, 2010 - adjusted by the best available estimate of the seasonal coefficient ... seeing that no application listens on the port, the host will reply with an ...

Author Guidelines for 8
based systems, the fixed length low-dimension i-vectors are extracted by estimating the latent variables from ... machines (SVM), which are popular in i-vector based SRE system. [4]. The remainder of this paper is .... accounting for 95% of the varia