Tag-Based Image Retrieval Improved by Augmented ... - IEEE Xplore

Viewer
Transcript

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 14, NO. 4, AUGUST 2012

1057

Tag-Based Image Retrieval Improved by Augmented Features and Group-Based Refinement Lin Chen, Dong Xu, Ivor W. Tsang, and Jiebo Luo

Abstract—In this paper, we propose a new tag-based image retrieval framework to improve the retrieval performance of a group of related personal images captured by the same user within a short period of an event by leveraging millions of training web images and their associated rich textual descriptions. For any given query tag (e.g., “car”), the inverted file method is employed to automatically determine the relevant training web images that are associated with the query tag and the irrelevant training web images that are not associated with the query tag. Using these relevant and irrelevant web images as positive and negative training data respectively, we propose a new classification method called support vector machine (SVM) with augmented features (AFSVM) to learn an adapted classifier by leveraging the prelearned SVM classifiers of popular tags that are associated with a large number of relevant training web images. Treating the decision values of one group of test photos from AFSVM classifiers as the initial relevance scores, in the subsequent group-based refinement process, we propose to use the Laplacian regularized least squares method to further refine the relevance scores of test photos by utilizing the visual similarity of the images within the group. Based on the refined relevance scores, our proposed framework can be readily applied to tag-based image retrieval for a group of raw consumer photos without any textual descriptions or a group of Flickr photos with noisy tags. Moreover, we propose a new method to better calculate the relevance scores for Flickr photos. Extensive experiments on two datasets demonstrate the effectiveness of our framework. Index Terms—Group-based refinement, Laplacian regularized least squares (LapRLS), support vector machine (SVM) with augmented features (AFSVM), tag-based image retrieval.

I. INTRODUCTION

W

ITH the rapid adoption of digital cameras, we have witnessed an explosive growth of digital photo albums. Everyday, a large number of personal photos captured by the consumers together with rich contextual information (e.g., tags, categories, captions) are uploaded to photo sharing websites like Flickr and photo forums (e.g., Photosig.com and Photo.net). There is an increasing interest in developing new methods to help users retrieve their personal photos including the raw (unlabeled) consumer photos that are not associated with any tex-

Manuscript received May 31, 2011; revised October 27, 2011; accepted January 11, 2012. Date of publication February 10, 2012; date of current version July 13, 2012. This work was supported by the Singapore National Research Foundation under its Interactive and Digital Media (IDM) Public Sector R&D Funding Initiative and administered by the IDM Programme Office. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Daniel Gatica-Perez. L. Chen, D. Xu, and I. W. Tsang are with the School of Computer Engineering, Nanyang Technological University, 639798 Singapore (e-mail: [email protected]; [email protected]; [email protected]). J. Luo is with the Department of Computer Science, University of Rochester, Rochester, NY 14627 USA(e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TMM.2012.2187435

tual descriptions and Flickr photos that may be associated with noisy tags. Over the past decades, a large number of content-based image retrieval (CBIR) technologies (see the recent survey in [1]) have been developed to help users retrieve the desirable database photos using the query by example framework. In these systems, at first a user needs to provide example images as queries. Then, the database images are ranked based on the visual similarities between the query images and the database images. Due to the so-called semantic gap between the low-level visual features (e.g., color, texture, shape) and the high-level semantic concepts, initial retrieval results are frequently unsatisfactory to the users. To bridge the semantic gap, relevance feedback methods have been proposed to learn the user’s search intention, and these methods proven effective in improving the retrieval performance of CBIR systems [1]. We argue that it is more natural and practical for a user to search the desirable personal photos using textual queries as opposed to example images. For example, many users are currently using the simple textual query based interfaces in the commercial web image search engines like Google and Microsoft Bing to search the desirable web photos that are associated with certain semantic textual descriptions (e.g., file-name, URL). However, it is a nontrivial task for users to retrieve the desirable personal photos using textual queries. First, raw consumer photos are not associated with semantic textual metadata. Second, even for the personal photos downloaded from Flickr, the associated tags created by Flickr users can be noisy, ambiguous, incomplete, and sometimes overly personalized, thus significantly degrading the performance of tag-based retrieval of such photos. Automatic image tagging (also known as image annotation) methods that aim to classify the images with respect to a set of semantic concepts can be used as an intermediate stage for tag-based image retrieval because such high-level semantic concepts are analogous to the textual terms that describe the image content. Image tagging methods can be roughly categorized into learning-based methods and web data-driven methods [2]. In learning-based methods [3]–[6], robust classifiers are learned from the training data, and then used to detect whether the concepts are present in any test data. However, the current learningbased methods can only tag a limited number of semantic concepts because time consuming and expensive human annotation is required to obtain the concept labels of the training samples. Web data-driven methods used data-driven approach for image annotation. Torralba et al. [7] used kNN classifiers for image tagging by leveraging 80 million tiny images labeled with one noun from WordNet. Two hashing methods [8], [9] were subsequently developed to accelerate the image search process by representing each image with less than a few hundred bits. In

1520-9210/$31.00 © 2012 IEEE

1058

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 14, NO. 4, AUGUST 2012

Fig. 1. Sample Kodak images in one group.

[10], Wang et al. also employed millions of web images and their associated high quality descriptions (such as surrounding caption and category) in photo forums (e.g., Photosig.com) to tag images. An annotation refinement algorithm [11] and a distance metric learning method [12] were also proposed to further improve the image tagging performances. To improve tag-based image retrieval performance for Flickr photos that are initially associated with noisy tags, researchers also proposed new tag reranking methods to rank the existing tags according to the scores to the content of the given Flickr image. Li et al. [13] proposed to learn the tag relevance by counting the tag votes from visually similar photos. Liu et al. [14] adopted a kernel density estimation (KDE) algorithm to obtain the initial tag relevance estimation, and then employed a random walk-based method for tag refinement by exploiting the proximities between tags. However, the performance of the work in [13] may significantly degrade when the visually similar images are not semantically relevant to the query image, and the method in [14] cannot utilize the negative training samples. Moreover, both works [13], [14] cannot cope with the raw personal photos not associated with any tags. To directly retrieve the desirable raw consumer photos without undergoing any intermediate image annotation process, Liu et al. recently proposed a textual query based personal photo retrieval framework by leveraging millions of training web images and their associated rich textual descriptions [2]. After the user provides a textual query, the system can automatically find a set of relevant and irrelevant web images, which are used as the training data to learn classifiers (e.g., decision stumps and support vector machine (SVM) classifiers). The raw consumer photos are then ranked based on the decision values from the learned classifiers and the retrieval performance can be further improved by using the new relevance feedback methods. Based on the observation that personal photos are generally organized into collections or albums by time, location, events and activities[15], we propose a new tag-based image retrieval framework to improve the retrieval performance of a group of related personal photos captured within an event. Specifically, our framework can effectively cope with groups of raw photos that are not associated with any textual descriptions (see Fig. 1) and groups of Flickr photos that are associated with noisy tags (see Fig. 2). Similar as in [2], we also make use of the massive and valuable social media data including web images and the associated rich semantic textual descriptions (e.g., categories,

Fig. 2. Sample Flickr images in one group.

captions, descriptions) from the photo forum Photosig.com1 as the training data. For any given query tag (e.g., “car”), we also employ the inverted file method to automatically find the relevant training web images associated with the query tag and the irrelevant training web images not associated with the query tag. The relevant and irrelevant web images can be then used as the positive and negative training data to learn the classifiers (e.g., SVM used in [2]). However, we observe that some less popular concepts (e.g., “water”) may be only associated with a limited number of positive training web images, which may degrade the classification performance of SVM. Meanwhile, it is well-known that some semantic concepts are correlated with others. So, it is beneficial to use the classifiers associated with concepts like “river” and “lake” when learning a classifier for the concept “water”. Moreover, the recent work on cross-domain learning (or transfer learning, domain adaptation) [16], [17] has shown that we can learn robust classifiers using only a limited number of labeled samples from the target domain by leveraging the prelearned source classifiers. In this work, we use a similar idea to utilize the intercorrelation among concepts. Specifically, we develop a new classification method called SVM with augmented features (AFSVM), which can be adapted from the linear combination of a set of prelearned classifiers of popular tags that are associated with a large number of positive training web images. It is interesting that the solution of AFSVM is to retrain SVM classifiers again based on augmented features, which combine the original features and the decision values obtained from the prelearned SVM classifiers of popular tags. Based on the assumption that the photos captured by the same user within a short period of memorable event and activity can be considered to form a semantically related group, we propose to use the Laplacian regularized least squares (LapRLS) method in the subsequent group-based refinement stage. We treat the decision values of one group of test photos from AFSVM classifiers as their initial relevance scores, and then apply LapRLS to refine the initial relevance scores of test photos by utilizing the visual similarity of the images within the group. Based on the refined relevance scores, we can directly rank the raw consumer photos or Flickr photos. To better rerank the Flickr photos, we further propose a new method to better calculate the relevance 1http://www.photosig.com/

CHEN et al.: TAG-BASED IMAGE RETRIEVAL IMPROVED BY AUGMENTED FEATURES AND GROUP-BASED REFINEMENT

1059

Fig. 3. Our tag-based image retrieval framework.

score by additionally considering the total number of tags created by the Flickr users. We conduct comprehensive experiments using the Kodak dataset that contains a group of raw consumer photos and a new dataset collected from Flickr.com. The experiments demonstrate the effectiveness of our proposed framework including the AFSVM classification method, the group-based refinement method using LapRLS, and the new relevance score for reranking Flickr photos. The remainder of the paper is organized as follows. We will introduce the tag-based image retrieval framework in Section II, and report the experimental results in Section III. Finally, we conclude this work in Section IV. II. OUR FRAMEWORK In this work, we develop a new tag-based image retrieval(TBIR) framework to improve the retrieval performance of personal images captured by the same user and organized by group. This can be groups of raw consumer photos not associated with any textual descriptions or groups of Flickr photos associated with noisy tags. Our proposed framework is illustrated in Fig. 3. It consists of four modules: 1) automatic web image collection; 2) initial tag relevance estimation using AFSVM; 3) group-based tag relevance refinement using LapRLS; and 4) tag-based image retrieval. Give any textual query (e.g., “car”) from the user, our TBIR system automatically collects a set of relevant and irrelevant web images from a large collection of web photos. Using the relevant and irrelevant web images as positive and negative training data respectively, we learn an AFSVM classifier to obtain the initial relevance scores. Then, we employ a group based refinement method to obtain the refined relevance scores. Based on the refined relevance scores, we can retrieve the images. In the subsequent subsections, we will describe each component of our system in detail.

A. Automatic Web Image Collection Our system first automatically collects a set of relevant and irrelevant training images from millions of training web images associated with rich surrounding textual descriptions (e.g., tags, categories, captions). Following [2], we directly employ the inverted file method without considering any scoring mechanism. Specifically, we first collect a web image dataset from Photosig.com with millions of web images whose surrounding texts are related to a comprehensive set of daily life semantic concepts . We then build an inverted file for all the images in the Photosig dataset, which has an entry for each tag in , followed by a list of images that contain the word in the surrounding texts. For any query tag (e.g., “car”) provided by the user, we directly employ the preconstructed inverted file to efficiently retrieve the relevant web images whose surrounding texts contain the query tag . Moreover, we randomly choose the same number of irrelevant webs image whose surrounding texts do not contain the query tag and other related words (i.e., the two-level descendants of the query tag in WordNet), as suggested in [2]. B. Initial Tag Relevance Estimation Using AFSVM Using the relevant and irrelevant training web images as the positive and negative training data, respectively, we develop a new classification method called SVM with Augmented Features (AFSVM) to predict the initial relevance scores. Let us , where denote the training web images as is the class label of the sample and is the total number of training images. In the prior work [2], Liu et al. directly learned a SVM classifier, and the learned SVM classifier was directly employed to obtain the initial decision value for each test (or database) photo. Finally, the test photos were ranked based on their decision values. We observe that the less popular tags are only associated with a limited number of positive training web images, which may significantly degrade the classification performances of SVM classifiers. We therefore propose a new classification method called AFSVM to learn an adapted classifier using the same

1060

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 14, NO. 4, AUGUST 2012

training web data by leveraging the prelearned SVM classifiers of popular concepts, which are associated with a large number of positive web images. Specifically, let us denote the set of popular tags as , where is the total number of popular tags. In this work, we choose 400 popular tags based on the frequency of images in Photosig dataset, namely, we have . For each query tag , we learn an AFSVM classifier by lever’s of popular tags aging the prelearned SVM classifiers ’s. Note that, each prelearned classifier is trained using the corresponding training web data from the Photosig dataset. Motivated by [18], we assume that the target classifier is in the following form:

Substituting these equations back into (1), and defining the nonlinear feature mapping , we arrive at the following dual form:

(3) where the induced kernel function, and

is .

Algorithm 1: AFSVM where is the vector of decision values of the prelearned classifiers ’s, and is the weight , vector, is the nonlinear feature mapping function for is the decision function of standard SVM with and being another nonlinear feature mapping function. The intuitive explanation for the above target classifier is based on the observation that some concepts are semantically correlated. For example, it is beneficial to learn an adapted classifier for the concept “water” by leveraging the prelearned classifiers of popular concepts like “river” and “lake”. As shown in our experiments in Section III, AFSVM outperforms SVM in terms of the average retrieval performances from all the query tags. As in SVM, we also minimize the structural risk functional. , we propose our After incorporating the target classifier new objective function

(1) Note that, unlike the semi-parametric SVM proposed by [18], here we also penalize the complexity of the weight vector to control the complexity of the prelearned classifiers. After introducing the Lagrangian multipliers ’s and ’s for the inequality constraints in (1), we arrive at the following Lagrangian:

(2) Setting the derivatives of (2) with respect to the primal variables and to zeros, we have

1: Input: Training data from automatic web image collection process and a set of prelearned linear SVM . classifiers for each popular tag -dimensional decision value vector 2: Calculate the with each dimension corresponding to the decision value of from one prelearned linear SVM the -th training sample classifier. using RBF kernel with 3: Train a new SVM classifier , where the default bandwidth parameter based on is the augmented feature of the -th training sample. 4: Output: The AFSVM classifier

.

It is interesting to observe that the resultant optimization problem in (3) shares a similar form with the dual of SVM, which can be easily solved by SVM solvers such as the one in LIBSVM [19]. It is worth mentioning that (1) and (3) are different from the objective functions of SVM. Specifically, in (1) is an adapted classifier the target decision function and the kernel matrix in (3) is calculated based on the augmented features , and . The implementation details are given in Algorithm 1. For comof each putational efficiency, the decision value vector is obtained by using linear SVM classifiers training image of popular tags because the training and testing processes of linear SVM using LIBLINEAR [20] are much faster. C. Group-Based Tag Relevance Refinement Using LapRLS The learned AFSVM classifiers can be applied to each individual test photo to obtain its decision value, and the test photos can then be ranked based on their decision values. Considering that the photos in each group are usually semantically correlated, we propose a group-based tag relevance refinement method using Laplacian Regularized Least Squares (LapRLS) to improve the retrieval performance. In LapRLS [21], the labels are propagated from the labeled training data to the unlabeled data based on label fitness (i.e., the predicted labels of labeled data should be consistent with their original labels) and manifold smoothness (i.e., the neighboring samples in a high density region should share similar decision values). The solution of LapRLS involves a matrix inversion operation and the com, where and are the numbers plexity is

CHEN et al.: TAG-BASED IMAGE RETRIEVAL IMPROVED BY AUGMENTED FEATURES AND GROUP-BASED REFINEMENT

of labeled data and unlabeled data, respectively. Considering the total number of training data including the labeled samples from the Photosig dataset and the unlabeled test samples may be more than 20 000 (see Section III for more details), the direct application of LapRLS in our application is inefficient. Instead, we propose an efficient approach in a group-bygroup fashion to refine the initial relevance scores of test photos using LapRLS, in which the initial relevance scores are propagated among the test photos within each group only. In this work, we convert the decision values of a group of test photos from SVM/AFSVM classifiers into probabilities using the sigmoid function and we denote the probabilities as . We the initial relevance scores (or virtual labels) enforce a constraint that the refined relevance scores after using LapRLS are close to the virtual labels for the test samples. as the Let us denote is the refined relevance scores after using LapRLS, where total number of test images within one group. We can then . Moreover, we enforce another minimize constraint that the refined relevance scores of the visually similar test photos within the group must be similar. We define to represent the similarities a similarity matrix of photos in the group. The corresponding Laplacian matrix is , where is a diagonal matrix with the denoted as diagonal elements as . Then, we minimize . Finally, the objective function of LapRLS is given as follows: (4) where is a regularizer, is the space of hypothesis. According to the Representer Theorem, we assume the solution is constrained in the form of , where is a linear combination coefficient, and is the kernel matrix of the images in one group. Let us denote , then we have . The optimization problem in (4) then becomes (5) Setting the derivative of (5) with respect to at

to zero, we arrive

Finally, we can directly solve the refined relevance scores

as

Let us assume all the groups have the same number of test photos (i.e., ). The time complexity of our group based refine. Considering is ment approach using LapRLS is , our group based refinement process is much less than quite efficient. D. Tag-Based Image Retrieval After conducting the group-based tag relevance refinement process in a group-by-group fashion, we can obtain the refined relevance scores for all the test photos. Following [2], for each query tag , we can use the refined relevance scores to rank the

1061

test images, such that the images with higher refined relevance scores are ranked in the top positions. For the test images initially associated with tags, such as Flickr photos, we propose a new method to better calculate the relevance scores. Specifically, for each query tag , we define a new relevance score as (6) is the total number of tags in the image , and is the refined tag relevance score of the image for AFSVM+LapRLS (or SVM+LapRLS). Note that will be the initial tag relevance score of the image if we use AFSVM (or SVM) for tag-based image retrieval. Intuitively, the image is ranked in front of the image , when it has a higher and fewer tags created by Flickr users. relevance score where

III. EXPERIMENTS We compare our proposed method AFSVM with the baseline SVM algorithm, and also compare AFSVM+LapRLS with SVM+LapRLS, in which the decision values from the AFSVM and SVM classifiers are respectively used as initial relevance scores before conducting LapRLS. For the raw test photo not initially associated with tags, we conduct tag-based photo retrieval using the initial or the refined relevance scores of the test photos. For the test Flickr photos initially associated with noisy tags, we can additionally use the new relevance score discussed in Section II-D to rerank the test photos. For any given query tag , we also report the retrieval results by using the relevance , where score suggested in [14], i.e., is the rank position of the query tag in the tag rank list of the image and is the total number of tags in the image . We also compare AFSVM+LapRLS using our new relevance score in (6) with two state-of-the-art tag reranking methods in [13] and [14]. We refer to the method proposed in [13] and [14] as kNN and KDE_RW, because kNN classifier and the kernel density estimation (KDE) + random walk (RW) based method are respectively employed to rerank the noisy tags before conducting the tag-based image retrieval via the relevance scores. For SVM and and use the AFSVM, we fix the regularization parameter RBF kernel with the default bandwidth parameter, as suggested in LIBSVM. We also need to determine two parameters and for SVM+LapRLS and AFSVM+LapRLS. In this work, we and empirically set the parameter in the set and the parameter in the set report the best results from the optimal parameter combination. For performance evaluation of tag-based image retrieval, we use noninterpolated average precision (AP) [22]. It corresponds to the multipoint average precision values of a precision-recall curve, and incorporates the effect of recall when AP is computed over the entire classification results. A. Experimental Setup 1) Database: Our training dataset consists of about 1.3 million photos downloaded from the photo forum Photosig.com [2]. Most of the Photosig images are accompanied by rich surrounding textual descriptions including titles, categories and descriptions. Similar to [10], we also observe that the images in Photosig are generally of high resolution with the sizes varying from 300 200 to 800 600, and the surrounding descriptions

1062

generally describe the semantics of the corresponding images. We remove the high-frequency words (e.g., “the,” “photo,” “picture”) that are not meaningful, and finally our dictionary contains 21 377 words, which almost cover all the normal daily-life concepts in a personal collection. Each image is associated with about five words on the average. While it is possible to use millions of images from Flickr.com as the training data, we choose Photosig dataset for training in order to avoid the overlap of training and test datasets, as one of our test dataset is collected from Flickr.com. We also note that the surrounding textual descriptions of Photosig are also considerably less noisy than those of Flickr. We choose 400 popular tags based on the frequency of images in the Photosig dataset to form the set . We test the performance of different tag-based photo retrieval methods on two datasets. The first dataset Kodak [15] which was collected by Eastman Kodak Company contains 103 groups of test photos with the total number of photos in each group varying from 4 to 249. The images in the Kodak dataset are not associated with tags. A total of 22 semantic concepts were defined including 11 event related concepts and 11 scene related concepts. For some concept names such as BeachFun, OpenCountry in which two words are connected as one individual word, we cannot find the relevant and irrelevant web images from the training dataset (i.e., the Photosig dataset) using these concept names as queries. Finally, we have 15 concepts including mountain, wedding, forest, Christmas, coast, office, eating, birthday, kitchen, highway, skiing, bedroom, graduation, suburb, ballgame. We also collect a dataset called Flickr dataset by ourselves, which contains about 6000 test images collected from Flickr.com using the keyword based search as suggested in [22]. This dataset is also an expansion from our initial dataset in [22], which only contains about 3500 test photos. Specifically, we choose 36 most popular tags2 as suggested in [23] with small semantic gaps, including scenes/landscapes, objects, and colors. In total, we download 488 groups of images from 440 Flickr users. The images within each group were captured by the same user and within one day, and each group contains 12 images on the average. The tags of the Flickr test images are manually annotated by three independent annotators. In contrast to the Kodak dataset, the images in the Flickr dataset are originally associated with noisy tags. Following the existing works [2], [24], we first remove the words that do not exist in the WordNet[25], and at the same time each remaining word is also automatically converted into its prototype (e.g., from remaining “dogs” to “dog”). After that, we have tags that have at least one positive test image in the Flickr dataset. To obtain the ground-truth annotations, we first construct a group-specific lexicon which contains only the tags of all the photos within the group. Then each image is shown to three annotators and each annotator separately decides whether each concept in the group-specific lexicon is present in this image. Only when at least two annotators agree that one concept is present in the image, this concept will be used as the ground-truth annotation of the image. 2These tags are beach, bee, bird, blue, bridge, building, butterfly, candle, city, cloud, eye, firework, flower, garden, glass, green, home, island, lake, leaf, moon, mountain, peacock, pink, rain, red, river, rock, rose, sky, snow, sunset, tree, water, window, yellow.

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 14, NO. 4, AUGUST 2012

Fig. 4. Per-concept AP results of the 15 concepts on the Kodak dataset.

2) Features: Three types of global features [i.e., grid color moment (225-D.), wavelet texture (128-D.), and edge-direction histogram (73-D)] are used as the default features because of their efficiency and effectiveness. The color moment feature represents the color distribution by using mean, standard deviation and the third root of the skewness for each color channel. We use grid color moment (GCM) and extract the three moments of three channels in the LAB color space from each of the 5 5 fixed grid partitions. The features are finally aggregated into a single 225-D feature vector. The edge direction histogram (EDH) feature represents the direction distribution of edge pixels. Specifically, we set the number of bins as 73 with 72 bins corresponding to edge directions quantized in five angular bins and one bin for nonedge pixels. Similarly to [24], we also perform pyramid-structured wavelet transform (PWT) and tree-structured wavelet transform (TWT) to extract 128-D wavelet texture (WT) feature. The PWT recursively decomposes the LL sub-band while the TWT recursively decomposes the LH, HL and HH sub-bands in order to extract the most important information. For both PWT and TWT, we perform three levels of decompositions and construct the feature vectors by using the mean and standard deviation of the energy distribution of each sub-band at each level. Finally, each image is represented as a single 426-D vector by concatenating the three types of global features. Please refer to [24] for more details about the features. For the images in the training dataset Photosig, we calculate the original mean value and standard deviation for each dimension . After that, we normalize all dimensions to zero mean and unit variance. We also normalize the images in the test Kodak dataset and Flickr dataset by using the same and . To further improve the speed and reduce the memory cost, we additionally perform principal component analysis (PCA) using all the training images in the Photosig dataset. We observe that the first principal components are sufficient to preserve 90% energy. Therefore, all the images in training and test datasets are projected into the 103-D space after dimension reduction. Please refer to [2] for more details. To calculate the similarity matrix of the test images within the same group (see Section II-C), we adopt the spatial pyramid matching (SPM) method proposed in [26]. As suggested in [26], we extract dense SIFT features from 16 16 pixel patches over a grid with spacing of eight pixels, and we use four pyramid levels.

CHEN et al.: TAG-BASED IMAGE RETRIEVAL IMPROVED BY AUGMENTED FEATURES AND GROUP-BASED REFINEMENT

1063

Fig. 5. Top-20 retrieved images using the query tag “Christmas” on the Kodak dataset. Incorrect results are highlighted by blue boxes (in online version).

TABLE I MEAN AVERAGE PRECISIONS (MAPS%) OVER ALL 15 CONCEPTS ON THE KODAK DATASET

THE

B. Results 1) Results on the Kodak Dataset: The raw images in the Kodak dataset are not associated with any tags, so we directly rank the test photos using the initial or the refined relevance scores of the test photos. We would like to highlight that, the training web images obtained from the automatic web image collection discussed in Section II-A may be associated with incorrect class labels because the textual descriptions of the images in the Photosig dataset are noisy. Therefore, it is a very challenging task to retrieve the raw consumer photos using the textual queries. We will investigate how to explicitly handle the noisy labels in our future work. We compare AFSVM and AFSVM+LapRLS with SVM and SVM+LapRLS. In this experiment, we do not compare our methods with [13] and [14] because the two methods in [13] and [14] can only work for the photos with noisy tags. The results are shown in Fig. 4 and Table I. We have the following three observations: • AFSVM outperforms SVM in terms of the per-concept AP results in most cases. When comparing AFSVM with SVM, the MAP over all the 15 concepts increases from 9.9% to 11.7% , equivalent to a 18.2% relative improvement. The results clearly demonstrate that it is beneficial to learn the adapted classifiers for all the concepts by leveraging a set of prelearned classifiers of the popular concepts. • SVM+LapRLS is better than SVM for 9 out of the 15 concepts and the MAP over all the 15 concepts is also significantly improved from 9.9% to 14.7%, which demonstrates

the effectiveness of using LapRLS as the group-based refinement method to improve the tag-based image retrieval performance. • AFSVM+LapRLS achieves the best MAP of 16.3%, thanks to the combination of AFSVM and the group-based refinement algorithm using LapRLS. Fig. 5 shows the top-20 retrieved images by different algorithms on the Kodak dataset using the query tag “Christmas.” It is clear that the top-ranked images using AFSVM+LapRLS are much better. 2) Results on the Flickr Dataset: The images in the Flickr dataset are originally associated with the noisy tags. We compare our AFSVM and AFSVM+LapRLS with SVM and SVM+LapRLS as well as two tag reranking methods kNN [13] and KDE_RW [14], in which the tags are reranked before conducting the tag-based image retrieval with the relevance scores defined in [13] and [14]. We consider two settings on the Flickr dataset. In setting A, we assume that each Flickr image is only associated with its original tags created by Flickr user. In setting B, we assume that each Flickr image is initially associated with all the tags in the group-specific lexicon. In both settings, for each query tag , only the images that are associated with are considered as the database photos when evaluating the image reranking performance. Note that the total number of database images in setting B is generally larger than that in setting A. The tags created by Flickr users are noisy. After human annotation, we observe none of the database images are labeled into 849 (resp. 828) concepts in setting A (resp. setting B). So, we only evaluate our approach using the remaining 605 and 626 concepts that are associated with at least one database image in setting A and setting B, respectively. We first compare AFSVM and AFSVM+LapRLS with SVM and SVM+LapRLS for image reranking. The MAPs over all the concepts and unpopular concepts are shown in Tables II and III, where the three rows show the results using the decision value,

1064

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 14, NO. 4, AUGUST 2012

TABLE II MEAN AVERAGE PRECISIONS (MAPS%) OVER ALL THE CONCEPTS IN TWO SETTINGS ON THE FLICKR DATASET

TABLE III MEAN AVERAGE PRECISIONS (MAPS%) OVER ALL THE UNPOPULAR CONCEPTS IN TWO SETTINGS ON THE FLICKR DATASET

TABLE IV MEAN AVERAGE PRECISIONS (MAPS%) OF TWO EXISTING METHODS kNN [13] AND KDE_RW [14] AS WELL AS OUR AFSVM+LAPRLS USING THE NEWLY PROPOSED RELEVANCE SCORE ON THE FLICKR DATASET

the relevance score suggested in [14], and our newly proposed relevance score in (6), respectively. The unpopular concepts are from the remaining 605 concepts or 626 concepts in setting A and setting B, respectively, but these concepts are not in the set of , which consists of 400 popular concepts with the highest frequency of images in the Photosig dataset. The total number of unpopular concepts are 388 and 408 in setting A and setting B, respectively. We have the following three observations: • Again, AFSVM consistently outperforms SVM, and AFSVM+LapRLS generally achieves the best results in both settings in terms of MAPs over all the concepts and over the unpopular concepts, which clearly demonstrate the effectiveness of AFSVM by leveraging the prelearned classifiers of popular concepts as well as the effectiveness of our group based refinement method using LapRLS. • When compared with the results directly using the decision value, the image retrieval performances can be generally improved after using the relevance score suggested in [14] in setting A. In most cases, the best results can be achieved by using our proposed relevance score discussed in Section II-D, which demonstrates our newly proposed relevance score in (6) can effectively combine the initial or refined relevance scores with the additional information (i.e., the total number of tags created by Flickr users). • The retrieval results of all the methods become worse in setting B when compared with setting A. One possible explanation is that the new tags borrowed from the group-specific lexicon are generally less reliable when compared with the original tags created by Flickr users,

making image reranking a more challenging task in setting B. We also compare our AFSVM+LapRLS using our proposed relevance score in (6) with two existing works kNN [13] and KDE_RW [14] for image reranking. From Table IV, we observe that AFSVM+LapRLS is much better than the two state-ofthe-art methods [13] and [14] in all cases. It is also interesting to observe that the results from all the algorithms are higher on the Flickr dataset when compared with those on the Kodak dataset. One possible explanation is that we need to rank all the database images in the Kodak dataset because the raw images in the Kodak dataset are not associated with any initial tags. But we only need to rerank some of the Flickr images that are initially associated with the query tag. Considering that the tags created by Flickr users are still accurate to some extent, the MAP from random ranking algorithm on the Flickr dataset should be higher when compared with that on the Kodak dataset. Figs. 6 and 7 show the top-20 retrieved images of AFSVM+LapRLS with our proposed relevance score, the baseline SVM using the decision values [2] and the two existing methods [13], [14] by using the query tags “bridge” and “car,” respectively. Again, the top-ranked images of AFSVM+LapRLS are much better that those returned by other methods. Finally, we investigate the performance variation of AFSVM+LapRLS over all the concepts with respect to the parameters and on the Flickr dataset. We empirically and set in the set of set in the set of . From Fig. 8, it is clear that

CHEN et al.: TAG-BASED IMAGE RETRIEVAL IMPROVED BY AUGMENTED FEATURES AND GROUP-BASED REFINEMENT

1065

Fig. 6. Top-20 retrieved images for the query tag “bridge” on the Flickr dataset (setting A). Incorrect results are highlighted by blue boxes (in online version).

Fig. 7. Top-20 retrieved images for the query tag “car” on the Flickr dataset (setting B). Incorrect results are highlighted by blue boxes (in online version).

Fig. 8. Performance variation of AFSVM+LapRLS over all the concepts with respect to

our method AFSVM+LapRLS is relatively robust with respect and . How to determine the optimal to the parameters

and

on the Flickr dataset.

parameters is still an open problem, which will be investigated in the future.

1066

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 14, NO. 4, AUGUST 2012

calculate the relevance score in order to further improve the retrieval performance for Flickr photos. Our comprehensive experiments on two datasets demonstrate the effectiveness of our framework.

TABLE V MEAN ALIGNMENT VALUES (MAVS%) OF AFSVM AND SVM OVER ALL THE CONCEPTS ON THE KODAK AND FLICKR DATASETS

REFERENCES C. Analysis Using Kernel Target Alignment The only difference between AFSVM and SVM is that their kernel matrices are different. To explain the performance improvement by AFSVM over SVM, we also analyze the kernel matrices using the kernel target alignment method in [27]. The alignment value (AV) between a given kernel matrix and the ideal kernel matrix constructed from the ground is defined as follows: truth label vector (7) where is the number of test images used to evaluate the is the kernel , and Frobenius product between two matrices and . As shown in [27] and [28], a larger AV indicates that the corresponding kernel is close to the ideal kernel, so potentially the classification results using this kernel are better. For each concept, we first use the test images along with their ground truth laand then calculate the AV between bels to calculate and the kernel matrix of SVM/AFSVM and the ideal kernel matrix using (7). Table V lists the mean alignment values (MAVs) of AFSVM and SVM over all the concepts on the Kodak and Flickr datasets, respectively. We observe that the MAVs of AFSVM are higher than those of SVM, demonstrating that the new kernel matrix used in AFSVM is closer to the ideal kernel. The above results also explain the performance improvement by AFSVM. IV. CONCLUSIONS To improve tag-based photo retrieval performance for a group of personal images captured by the same user within an event, we have proposed a new framework by leveraging millions of training web images and their associated rich textual descriptions. For any given query tag (e.g., “car”), our framework first automatically finds the relevant and irrelevant training web images, which can be used as the positive and negative training data for classifier learning. We propose a new classification method called SVM with augmented features (AFSVM) to learn an adapted classifier by leveraging the prelearned SVM classifiers of popular tags that are associated with a large number of relevant training web images. In the subsequent group-based refinement process, we propose to employ the Laplacian regularized least squares (LapRLS) method to further refine the relevance scores by treating the decision values of one group of test photos from AFSVM classifiers as the initial relevance scores and utilizing the visual similarity of the images within the group. Using the refined relevance scores, our tag-based image retrieval framework is applicable to a group of raw consumer photos not associated with any textual descriptions or a group of Flickr photos associated with noisy tags. In addition, we also propose a new method to effectively

[1] R. Datta, D. Joshi, J. Li, and J. Z. Wang, “Image retrieval: Ideas, influences, and trends of the new age,” ACM Comput. Survey, vol. 40, no. 2, pp. 1–60, 2008. [2] Y. Liu, D. Xu, I. W. Tsang, and J. Luo, “Textual query of personal photos facilitated by large-scale web data,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 5, pp. 1022–1036, May 2011. [3] S.-F. Chang, D. Ellis, W. Jiang, K. Lee, A. Yanagawa, A. C. Loui, and J. Luo, “Large-scale multimodal semantic concept detection for consumer video,” in Proc. ACM SIGMM Int. Workshop Multimedia Inf. Retrieval, 2007, pp. 255–264. [4] D. Grangier and S. Bengio, “A discriminative kernel-based approach to rank images from text queries,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 30, no. 8, pp. 1371–1384, Aug. 2008. [5] M. Guillaumin, T. Mensink, J. Verbeek, and C. Schmid, “Tagprop: Discriminative metric learning in nearest neighbor models for image auto-annotation,” in Proc. IEEE Int. Conf. Comput. Vis., 2009, pp. 309–316. [6] J. Li and J. Z. Wang, “Real-time computerized annotation of pictures,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 30, no. 6, pp. 985–1002, Jun. 2008. [7] A. Torralba, R. Fergus, and W. T. Freeman, “80 million tiny images: A large data set for nonparametric object and scene recognition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 30, no. 11, pp. 1958–1970, Nov. 2008. [8] A. Torralba, R. Fergus, and Y. Weiss, “Small codes and large image databases for recognition,” in Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit., 2008, pp. 1–8. [9] Y. Weiss, A. Torralba, and R. Fergus, “Spectral hashing,” in Proc. Neural Inf. Process. Syst. (NIPS), 2008, pp. 1753–1760. [10] X.-J. Wang, L. Zhang, X. Li, and W.-Y. Ma, “Annotating images by mining image search results,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 30, no. 11, pp. 1919–1932, Nov. 2008. [11] C. Wang, F. Jing, L. Zhang, and H.-J. Zhang, “Content-based image annotation refinement,” in Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit., 2007, pp. 1–8. [12] C. Wang, L. Zhang, and H.-J. Zhang, “Learning to reduce the semantic gap in web image retrieval and annotation,” in Proc. 31st Annu. Int. ACM SIGIR Conf., 2008, pp. 355–362. [13] X. Li, C. G. M. Snoek, and M. Worring, “Learning social tag relevance by neighbor voting,” IEEE Trans. Multimedia, vol. 11, no. 7, pp. 1310–1322, Nov. 2009. [14] D. Liu, X.-S. Hua, L. Yang, M. Wang, and H.-J. Zhang, “Tag ranking,” in Proc. Int. World Wide Web Conf., 2009, pp. 351–360. [15] L. Cao, J. Luo, and T. S. Huang, “Annotating photo collections by label propagation according to multiple similarity cues,” in Proc. ACM Multimedia, 2008, pp. 121–130. [16] L. Duan, I. W. Tsang, D. Xu, and T.-S. Chua, “Domain adaptation from multiple sources via auxiliary classifiers,” in Proc. Int. Conf. Mach. Learning, 2009, pp. 199–210. [17] L. Duan, D. Xu, I. W.-H. Tsang, and J. Luo, “Visual event recognition in videos by learning from web data,” in Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit., 2010, pp. 1959–1966. [18] A. J. Smola, T.-T. Frieß, and B. Schölkopf, “Semiparametric support vector and linear programming machines,” in Proc. Neural Inf. Process. Syst. (NIPS), 1998, pp. 585–591. [19] C.-C. Chang and C.-J. Lin, “LIBSVM: A library for support vector machines,” ACM Trans. Intell. Syst. Technol., vol. 2, no. 3, pp. 27:1–27:27, 2001 [Online]. Available: http://www.csie.ntu.edu.tw/~cjlin/libsvm [20] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin, “LIBLINEAR: A library for large linear classification,” J. Mach. Learning Res., vol. 9, pp. 1871–1874, 2008. [21] M. Belkin, P. Niyogi, and V. Sindhwani, “Manifold regularization: A geometric framework for learning from labeled and unlabeled examples,” J. Mach. Learning Res., vol. 7, pp. 2399–2434, 2006. [22] L. Chen, D. Xu, I. W.-H. Tsang, and J. Luo, “Tag-based web photo retrieval improved by batch mode re-tagging,” in Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit., 2010, pp. 3440–3446. [23] Y. Lu, L. Zhang, J. Liu, and Q. Tian, “Constructing concept lexica with small semantic gaps,” IEEE Trans. Multimedia, vol. 12, no. 4, pp. 288–299, Jun. 2010.

CHEN et al.: TAG-BASED IMAGE RETRIEVAL IMPROVED BY AUGMENTED FEATURES AND GROUP-BASED REFINEMENT

[24] T.-S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y. Zheng, “NUSWIDE: A real-world web image database from national university of singapore,” in Proc. ACM Int. Conf. Image Video Retrieval, 2009, pp. 1–9. [25] C. Fellbaum, WordNet: An Electronic Lexical Database. Cambridge, MA: MIT Press, 1998. [26] S. Lazebnik, C. Schmid, and J. Ponce, “Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories,” in Proc. IEEE Int. Conf. Comput. Vis. Pattern Recognit., 2006, pp. 2169–2178. [27] N. Cristianini, J. Shawe-Taylor, A. Elisseeff, and J. S. Kandola, “On kernel-target alignment,” in Proc. Neural Inf. Process. Syst. (NIPS), 2001, pp. 367–373. [28] J. T. Kwok and I. W. Tsang, “Learning with idealized kernels,” in Proc. Int. Conf. Mach. Learning, 2003, pp. 400–407. Lin Chen received the B.E. degree from the University of Science and Technology of China, Hefei, China, in 2009, and is currently working toward the Ph.D. degree in computer engineering from the Nanyang Technological University, Singapore. His current research interests include computer vision, machine learning, especially large scale image retrieval and related techniques.

Dong Xu received the B.E. and Ph.D. degrees from the University of Science and Technology of China, in 2001 and 2005, respectively. While working toward the Ph.D. degree, he was with Microsoft Research Asia, Beijing, China, and the Chinese University of Hong Kong, Shatin, Hong Kong, for more than two years. He was a Postdoctoral Research Scientist with Columbia University, New York, NY, for one year. In May 2007, he joined the Nanyang Technological University, Singapore, where he is currently an Associate Professor. His current research interests include computer vision, statistical learning, and multimedia content analysis. Dr. Xu was the coauthor of a paper that won the Best Student Paper Award in the prestigious IEEE International Conference on Computer Vision and Pattern Recognition in 2010.

1067

Ivor W. Tsang received the Ph.D. degree in computer science from the Hong Kong University of Science and Technology, Kowloon, Hong Kong, in 2007. He is currently an Assistant Professor with the School of Computer Engineering, Nanyang Technological University, Singapore. He is also the Deputy Director of the Center for Computational Intelligence, Nanyang Technological University. Dr. Tsang received the prestigious IEEE TRANSACTIONS ON NEURAL NETWORKS Outstanding 2004 Paper Award in 2006, and the 2008 National Natural Science Award (Class II), China in 2009. His research work earned him the Best Paper Award at ICTAI’11, the Best Student Paper Award at CVPR’10, and the Best Paper Award from the IEEE Hong Kong Chapter of Signal Processing Postgraduate Forum in 2006. He was also conferred with the Microsoft Fellowship in 2005.

Jiebo Luo received the B.S. degree from the University of Science and Technology of China in 1989 and the Ph.D. degree from the University of Rochester, Rochester, NY, in 1995. He was a Senior Principal Scientist with the Kodak Research Laboratories in Rochester before joining the Computer Science Department, University of Rochester in Fall 2011. His research interests include image processing, machine learning, computer vision, social multimedia data mining, biomedical informatics, and ubiquitous computing. He has authored over 180 technical papers and holds over 60 U.S. patents. He has been actively involved in numerous technical conferences, including recently serving as the General Chair of ACM CIVR 2008, Program Co-Chair of IEEE CVPR 2012 and ACM Multimedia 2010, Area Chair of IEEE ICASSP 2009–2011, ICIP 2008–2011, and ICCV 2011. He has served on the editorial boards of the IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, IEEE TRANSACTIONS ON MULTIMEDIA, IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, Pattern Recognition, Machine Vision and Applications, and the Journal of Electronic Imaging. Dr. Luo is a Kodak Distinguished Inventor, a recipient of the 2004 Eastman Innovation Award, and a Fellow of the SPIE and IAPR.