Compact Representation for Large-Scale Clustering and Similarity Search Bin Wang1, Yuanhao Chen1, Zhiwei Li2, and Mingjing Li2 1

University of Science and Technology of China 2 Microsoft Research Asia {binwang, yhchen04}, {zli, mjli}

Abstract. Although content-based image retrieval has been researched for many years, few content-based methods are implemented in present image search engines. This is partly bacause of the great difficulty in indexing and searching in high-dimensional feature space for large-scale image datasets. In this paper, we propose a novel method to represent the content of each image as one or multiple hash codes, which can be considered as special keywords. Based on this compact representation, images can be accessed very quickly by their visual content. Furthermore, two advanced functionalities are implemented. One is content-based image clustering, which is simplified as grouping images with identical or near identical hash codes. The other is content-based similarity search, which is approximated by finding images with similar hash codes. The hash code extraction process is very simple, and both image clustering and similarity search can be performed in real time. Experiments on over 11 million images collected from the web demonstrate the efficiency and effectiveness of the proposed method. Keywords: similarity search, image clustering, hash code.

1 Introduction Image is one of the most popular media types in our daily life. With the profusion of digital cameras and camera cell phones, the number of images, including personal photo collections and web image repositories, increases quickly in recent years. Therefore, people will find desired images on the web. To meet those needs, many image search engines have been developed and are commercially available. For instance, both Google Image Search [1] and Yahoo [2] have indexed over one billion images. Present image search engines generally accept only keyword-based query, while very few simple content-based methods are supported recently. Google and Yahoo allow the categorization of images according to their sizes (large, middle and small) or colors (black/white vs. color images). Fotolia [3] provides limited support to search images based on their colors, which is rough and far insufficient for the images. With the fact that image is a kind of visual medium, content-based image retrieval (CBIR) has been well studied and many CBIR algorithms have been proposed. ImageRover, RIME and WeebSeer are among the early content-based image retrieval Y. Zhuang et al. (Eds.): PCM 2006, LNCS 4261, pp. 835 – 843, 2006. © Springer-Verlag Berlin Heidelberg 2006


B. Wang et al.

systems [4]. These CBIR systems are restricted to small or medium size datasets. Cortina [5] is a CBIR system which indexes about three million images and exploits the clustering method to avoid high-dimension indexing. [5] also states that the clustering process of millions of images is very time-consuming. A more comprehensive list of CBIR systems can be found in [4], and [6] surveys the related topics about CBIR systems for large datasets. Although content-based methods are demonstrated to be effective in improving users’ search experiences, one of the main difficulties in scaling up the traditional methods to process large-scale dataset is how to index and search in high-dimensional image feature space, as images are usually represented as high dimensional features. The dimensionality of typical image features ranges from tens to hundreds. Building efficient index structure for high-dimension data remains an open research topic in database field. Another companying effect of high dimension features is the storage cost. For an image search engine collecting billions of images, the storage cost will be huge, which prevents the system from efficiently processing the images. To alleviate those problems which hamper the application of CBIR methods, we propose a very effective method to build a compact representation of image content, which is called “hash code” in this paper and can be regarded as special keyword. The “hash code” of an image is a string of bits built from its visual characteristics and packed into only few bytes. This representation facilitates the index and search process in that the keyword based index and search methods can be similarly applied. In addition, such a compact representation greatly reduces the storage cost. These “hash codes” can be applied in many ways to extend a system’s functionality on large-scale image datasets [7]. [8] uses similar quantization method in indexing images but original features are still required for the nearest neighbor search. [9] exploits the integer representation of DCT coefficients of video frames for duplicate video detection. Based on these hash codes, we first address the problem of clustering images on large-scale datasets, which could be very useful for data organization and presentation. Furthermore, interactively finding visually similar images is also discussed. The experimental results on a dataset of over 11 million images suggest the effectiveness of proposed method and the application of contentbased image retrieval on large scale datasets is promising. The rest of the paper is organized as follows. In section 2 we present our algorithm to generate the compact hash-code representation for a given image. Following that, the implementation and evaluation of image clustering are presented in section 3. Section 4 further details the similarity search which exploits the image cluster. At last, we conclude the paper in section 5 and the future work is presented.

2 The Hash Code of Image Content The core part of the proposed method is to build a compact representation of an image’s content, which helps apply content-based methods in web image search engines. Figure 1 shows the framework of calculating the hash codes. First, appropriate image features are extracted. The proposed method is independent on the type of the feature. Second, the high-dimension features are projected into a subspace with much lower dimensionality while maintaining most information. This will

Compact Representation for Large-Scale Clustering and Similarity Search


facilitate the future manipulation and calculation on the features. Third, bit allocation and vector quantization techniques are leveraged to convert the float feature values to integer values. Finally, we packed the quantization results into few bytes. So, we get the hash code (K bits) for an image. In following subsections, we’ll discuss the dimension reduction and quantization process.

Fig. 1. The framework of calculating hash codes of images’ contents

2.1 Dimension Reduction Usually, the extracted image features are in very high dimensional space. A typical kind of image feature has tens of dimensions, which is hard for either indexing or searching. Thus we need to reduce the feature dimensions first. Many dimension reduction methods are available. Among all the methods, PCA (principle component analysis) is a simple and technically sound one, and is adopted in this paper. In PCA, the data in original high dimensional space are projected into a lower dimensional subspace which retains the largest variances. 2.2 Vector Quantization and Hash Code Generation The hash code generation is essentially a vector quantization (VQ) process. The projected feature vector Gi is a point in RM space. To further facilitate the search process, the low-dimension image feature Gi will be mapped into a multi-dimensional integer space ZM. The mapping function is obtained from the statistical analysis of a large image database:

→Z .

Hi=f(Gi) f: RM



The simple method is to quantize each dimension separately. If the final quantized vector has K bits, how to allocate the bits to each dimension is an important issue [10, 11]. The bits could be allocated among dimensions according to their variances, or fixed number of bits to few most significant dimensions. We assign 1 bit to each of the most significant 32 dimensions, and the quantized value is determined by whether the feature value is larger than the mean value, 1 for yes and 0 for no. In this way, we convert the original float feature values to a long bit string. Suppose each dimension in Hi is represented using Lk bit and the total number of bits is K= Lk (Lk=1 for all k by now). This is similar to the digitalization of analog signals. Thus K can impact the system performance such as precision and recall. We constrain K to be no more than 32, because most of the present computers are of 32bit. Therefore, the whole hash code can be packed into one integer type of the


B. Wang et al.

computer. Both the data manipulation (such as read and write) and calculation can be quickly performed.

3 Image Clustering With the hash codes which reflect the image’s content, we can scale up many traditional content-based image retrieval methods to large-scale image datasets. One of the most important issues is to cluster visually similar images. The clustering helps not only data organization [5] but also the image presentation. Image search engines usually return a long list of thousands of images, which brings difficulty for users to view through. It has been proved that clustering is useful in presenting search results [12]. Therefore, grouping visually similar images will be helpful. Previous clustering methods often require time-consuming process, and cannot be applied for interactive search process. In this section, we introduce the process to cluster visually similar images based on the proposed hash code. Comparing to the traditional methods, only hash codes are needed during the clustering process. Therefore, the speed is very fast. Input: image hash codes Parameters: Ls, Li, Tc, Td Cluster: 1. Remove Li least significant bits of each hash code 2. Split the images set into groups using most Ls most significant bits 3. For each group a) Initialize: the images with identical hash codes (Li least significant bits removed) form one cluster b) If the number of clusters is less than Tc, go to 3 for next group c) Find the minimum distance between clusters min(dset(m,n)) d) If min(dset(m,n))> Td, go to 3 for next group e) Merge two clusters m and n f) Go to b) 4. Output the clustering results Fig. 2. Clustering process

According to the nature of the hash codes which is generated using PCA, the most information is retained in few most significant dimensions. So the clustering can be conducted in these dimensions while some least significant bits are omitted because they represent small values. A hierarchical clustering method is used and Figure 2 shows the detail of the clustering process. As hash codes are bit strings, the distance d(hi,hj) between two images represented by hash codes hi and hj is defined as the Hamming distance of hi and hj, i.e., the number of different bits. In Figure 2, dset(m,n) denotes the distance between two clusters m and n. It is defined as the complete-link distance between two sets

Compact Representation for Large-Scale Clustering and Similarity Search




where hi is a member of set m and hj is a member of set n. All the processes are done using only the hash codes and no original high dimension features are required. This implies that those features need not to be stored and save huge amount of storage spaces.

Fig. 3. An example of image clustering Table 1. Clustering time

Query britney spears tiger saturn apple computer flower car football sport dragon Avg:

# of images 1367 1742 1228 1586 1771 1793 1564 1996 1821 1782

# of clusters 32 37 33 34 35 34 34 38 39 39

Time(second) 0.0028 0.0038 0.0027 0.0032 0.0037 0.0038 0.0039 0.0042 0.0039 0.0038 0.0036


B. Wang et al.

Table 1 shows the average clustering time of 11 queries. The number of images varies depending on query words and we adjust the clustering parameter to set the final number of clusters to be around 30. Either too many or too few clusters will be annoying to users. From Table 1, it can be seen that the clustering process can be conducted very fast for thousands of images. Thus the user-interactive operation can be supported. Figure 3 shows an example of clustering result.

4 Similar Image Search In previous section, we discussed the clustering on image search results. The clustering can also be applied to whole image dataset. Therefore, based on the clustering information, we can easily find an image’s similar images, which should be the ones within same cluster. Yet, the results of such simple method won’t be good. The reason is the wellknown semantic gap between low-level features and images’ semantics, and thus single kind of feature is insufficient. To solve the problem, we utilize multiple kinds of image features simultaneously to reflect different content characteristics of an image, for example, color, textual, and shape. For each kind of feature, the hash codes can be calculated and the clustering process in Section 3 can be performed. The hash codes of similar images are supposed to be in same cluster, which can be called a “collision”. If two images are actually similar, the probability that their hash codes “collide” will be higher. Therefore, we can use the number of hash code collisions as the similarity measure of two images. The more two images’ hash codes collide, the more likely two images are similar. For each single feature, we build an index structure based on the clustering results. In each index structure, only the cluster label of each image, instead of original feature or hash value, is stored. So for each image in the image dataset, we generate a new feature vector. The components of the feature vector are the cluster label using hash codes of different features. These feature components can be deemed as the special “keywords” of an image in that similar images share same such “keywords”. Then, we can utilize the traditional text-based index and search methods to get one image’s similar images list and corresponding similarity measure (by the number of “collisions”). With the returned image list, the similarity measure as well as the images’ contrast, colorful blur and other image quality measures are combined to rank the returned images [14]. So the images with the highest quality and most similar to the query image are ranked in the top, which can further improve search experience. The experiments are conducted on the same dataset as in Section 3. Figure 4 presents the average search time of 20 queries. The X-axis is the number of similar images returned and Y-axis is the consumed time in seconds. It can be seen that the operation is very fast to complete the search over 11 million images. Even for finding 10,000 similar images, the system can complete the work in less than 0.5 second. But we should be aware that the similar images are found in decreasing similarity, so the value in X-axis should not be too large. We ever asked some people to label the

Compact Representation for Large-Scale Clustering and Similarity Search













3000 4000 5000 6000 7000 Number of returned similar images




Fig. 4. Speed of similar image search

Fig. 5. An examples of similarity search based on the images hash codes

retrieval result so as to calculate the precision measure, which is defined as the number of correct retrieved images over total number of retrieved images. During labeling, there is great inconsistency between people on what images are similar. This is because “similar” is a very vague notation without clear definition. While some people


B. Wang et al.

judge the similarity using colors and textures (such as face presentation), others depend on some semantics in the process. Thus, instead of calculate the precision or recall measures, we present an example of the similarity search in Figure 5. The image on the top left is the query image. It can be seen that the results are satisfying.

5 Conclusion In this paper, we propose a method to represent image content in a very compact form, called “hash code”. Then we discuss the methods to efficiently cluster images and find similar images for large-scale image dataset. To get the hash codes, original high dimensional image features, which are hard to be indexed and searched, are first mapped to a low dimensional space. The bit allocation and vector quantization are further applied. Finally, the quantized values are organized together to form hash codes. With the compact form of hash codes, the clustering process can be performed very fast and applied to improve the presentation of image search results. Further more, finding similar image can also be quickly conducted based on hash codes. The performances of proposed methods on a large-scale dataset of more than 11 million images are encouraging. It is proved that the generated compact representation can help the application of content-based image retrieval in large image datasets. At present, we use PCA for dimension reduction. LSA and ICA are both promising methods. Besides, joint quantization of multiple dimensions may improve the performance. Those will be interesting future work.

References 1. 2. 3. 4.


6. 7. 8.



Google Image Search, Yahoo Image Search, Fotolia, Veltkamp, R. C., Tanase, M.: Content-Based Image Retrieval Systems: A Survey. Technical Report UU-CS-2000-34, Dept. of Computing Science, Utrecht University (2000). Quack, T., Mönich, U., Thiele, U., Manjunath, B.S.: Cortina: a system for large-scale, content-based web image retrieval, In Proceedings of the 12th annual ACM international conference on Multimedia(2004), pp. 508 - 511 Kherfi, M.L., Ziou, D., Bernardi A.: Image Retrieval from the World Wide Web: Issues, Techniques and Systems. ACM Computing Surveys(2004) Wang, B., Li Z., Li M.: Large-Scale Duplicate Detection for Web Image Search. In International Conference on Multimedia & Expo. (2006) Böhm, K., Mlivoncic, M., Schek, H.-J., Weber, R.: Fast Evaluation Techniques for Complex Similarity Queries, In Proceedings of the 27th International Conference on Very Large Data Bases, pp. 211-220 (2001) Naturl, X., Gros, P.:A Fast Shot Matching Strategy for Detecting Duplicate Sequences in a Television Stream, Proceedings of the 2nd ACM SIGMOD international workshop on Computer Vision meets DataBases, 2005 Ferhatosmanoglu, H., Tuncel, E., Agrawal, D., Abbadi, A.: Vector Approximation based Indexing for Non-uniform High Dimensional Data Sets. Proceedings of 9th CIKM, McLean, USA(2000), pp 202-209,

Compact Representation for Large-Scale Clustering and Similarity Search


11. Riskin, E.A.: Optimal Bit Allocation via the Generalized BFOS algorithm. IEEE Trans. on Information Theory, Vol. 37, No. 2, pp. 400-402, (1991) 12. Zeng, H., He, Q., Chen, Z., Ma, W.-Y., Ma, J.:Learning to cluster web search results, Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval(2004) 13. Li, Z., Xie, X., Liu H., Tang, X., Li, M., Ma, W.-Y.: Intuitive and effective interfaces for WWW image search engines, Proceedings of the 12th annual ACM international conference on Multimedia(2004) 14. Tong, H., Li M., Zhang, H.-J., Zhang, C., He, J., Ma, W.-Y.: Learning No-Reference Quality Metric by Examples. In Proceedings of the 11th International Multimedia Modeling Conference 05 (2005)

Compact Representation for Large-Scale Clustering and Similarity ...

in high-dimensional feature space for large-scale image datasets. In this paper, ... the clustering method to avoid high-dimension indexing. [5] also states that the.

2MB Sizes 0 Downloads 47 Views

Recommend Documents

Similarity-based Clustering by Left-Stochastic Matrix Factorization
Journal of Machine Learning Research 14 (2013) 1715-1746 ..... Figure 1: Illustration of conditions for uniqueness of the LSD clustering for the case k = 3 and.

Similarity-based Clustering by Left-Stochastic Matrix Factorization
Figure 1: Illustration of conditions for uniqueness of the LSD clustering for the case k = 3 and for an LSDable K. ...... 3D face recognition using Euclidean integral invariants signa- ture. Proc. ... U. von Luxburg. A tutorial on spectral clustering

Hardware and Representation - GitHub
E.g. CPU can access rows in one module, hard disk / another CPU access row in ... (b) Data Bus: bidirectional, sends a word from CPU to main memory or.

Clustering and Matching Headlines for Automatic ... - DAESO
Ap- plications of text-to-text generation include sum- marization (Knight and Marcu, 2002), question- answering (Lin and Pantel, 2001), and machine translation.

The-Compact-Timeline-Of-World-War-II-Compact-Timeline-Compact ...
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. The-Compact-Timeline-Of-World-War-II-Compact-Timeline-Compact-Timeline.pdf. The-Compact-Timeline-Of-World-Wa

A Recipe for Concept Similarity
knowledge. It seems to be a simple fact that Kristin and I disagree over when .... vocal critic of notions of concept similarity, it seems only fair to give his theory an.

Multi-Model Similarity Propagation and its Application for Web Image ...
Figure 1. The two modalities, image content and textual information, can together help group similar Web images .... length of the real line represents the degree of similarities. The ..... Society for Information Science and Technology, 52(10),.

Combining Similarity in Time and Space for Training ...
shows the best accuracy in the peer group on the real and artificial drifting data. ... advantage under gradual drift scenarios in small and moderate size data ..... In Figure 9 we visualize all the datasets on time against distance in space axes.

Combining Similarity in Time and Space for ... - Semantic Scholar
Eindhoven University of Technology, PO Box 513, NL 5600 MB Eindhoven, the Netherlands ... Keywords: concept drift; gradual drift; online learning; instance se- .... space distances become comparable across different datasets. For training set selecti

Frequency And Ordering Based Similarity Measure For ...
the first keeps the signatures for known attacks in the database and compares .... P= open close ioctl mmap pipe pipe access access login chmod. CS(P, P1) ... Let S (say, Card(S) = m) be a set of system calls made by all the processes.

A Representation of Programs for Learning and ... - Semantic Scholar
plexity is computer programs. This immediately raises the question of how programs are to be best repre- sented. We propose an answer in the context of ongo-.

Best-Buddies Similarity for Robust Template ... -
1 MIT CSAIL. 2 Tel Aviv University ... ponent in a variety of computer vision applications such as ...... dation grant 1556/10, National Science Foundation Robust ... using accelerated proximal gradient approach. ... Online object tracking: A.

There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item.

ISP(4) = HSP(4) and that this is the equational class DB of distributive bilattices ...... piggyback dualities and applications to Ockham algebras, Houston J. Math.

Compact microstrip feed dual band monopole antenna for UWB and ...
the patch is cut out and deformed into a U-shape, there is no significant influence on the characteristics of the patch antenna. Therefore, the antenna for Bluetooth was designed in the U- shaped cut - out space. The impedance bandwidth of the print