A comparison of measures for visualising image similarity

Viewer
Transcript

A comparison of measures for visualising image similarity Kerry Rodden, Wojciech Basalaj

David Sinclair, Kenneth Wood

University of Cambridge Computer Laboratory Pembroke Street, Cambridge CB2 3QG, UK kr205,wb204 @cl.cam.ac.uk

AT&T Laboratories Cambridge Trumpington Street, Cambridge CB2 1QA das,krw @uk.research.att.com

Abstract A low-level content-based measurement of image similarity can be used to create a visualisation of an image set, in which visually similar images are displayed close to each other. We are carrying out a series of experiments to evaluate the usefulness of this type of visualisation as an image browsing aid. So far, these experiments have used a complex image similarity measure that was designed for image retrieval, but we were interested in finding out if simpler measures could be just as effective. We created basic test collections of images from a widely available set of stock photographs, and used these to compare a number of different measurements of image similarity. Precisionrecall graphs show the expected differences between the measures with regard to retrieval accuracy. However, these differences disappear in a comparison of their effectiveness for creating visualisations.

1 Introduction Fully automatic methods of indexing and searching large collections of digital images are based only on the lowlevel visual features of the images, such as colour and texture. There is some evidence that visual properties are important to people when searching for images, but it can be difficult for users to express their requirements in a visual form, perhaps having to rely on drawing sketches, or specifying colour distributions. The results are likely to be less meaningful than those for a textual query, because of the low-level nature of the features used to judge similarity. It is thus very important that systems should provide good support for browsing of images. In previous work, we have carried out experiments evaluating the usefulness of information visualisation techniques when applied to image browsing. This application was first suggested by Rubner [11] who did not carry out any experimental evaluation. A set of images can be laid out so that the user can see how they are related with regard to the similarity measure used. Users can then simply allow their gaze to move into the areas of the visualisation where the images resemble their “mental image”, instead of being forced to describe or sketch it. For example, we would expect that, when searching for aeroplanes, a user’s eye would be drawn to pictures containing a lot of blue sky. To arrange a set of images, we first create a similarity matrix, measuring the content-based visual similarity of all pairs of images in the set. We then use multidimensional scaling (MDS) [2], a technique which treats inter-object dissimilarities as distances in some high dimensional space, and then attempts to approximate them in a low dimensional (commonly 2D or 3D) output configuration. Algorithms to perform MDS are available in many common statistical packages, although we have developed our own algorithm as part of separately published information visualisation research [1]. Once MDS has found a 2D configuration of points using the matrix, thumbnail versions of the images can then be placed at these points, to produce a layout. Examples of these layouts can be found in Figures 4 and 5. In our initial experiment [9], we showed that users generally find a specific, given image more quickly in a visualisation than in a random arrangement. This simulates a situation where the user is trying to find a particular image that he or she has seen before. Next, we chose to consider a situation where the user wishes to find images matching a general requirement. We assumed that no annotation of images was available, and thus it would only be possible to search for very general classes of image (such as “surfing” or “birds”), that do not depend on an individual user’s ability to recognise specific people or places. This corresponds to the non-unique, non-refined type of query described by Enser [4] in his analysis of requests submitted to a large photograph archive. We found that people were able to (Challenge of Image Retrieval, Brighton, 2000)

1

A comparison of measures for visualising image similarity

find relevant images more quickly in a set that is arranged according to visual similarity than in a set that is arranged randomly [8]. So far, we have used a fairly complex image similarity measurement (described in section 2.4), based on current research in image segmentation, and designed for use in an image retrieval system. Our experiments showed that it appeared to work well. However, we were interested in exploring other similarity measures, to see how effective they could be in comparison. In particular, because the first step in creating a visualisation is calculating the similarity matrix, where every image in the set is compared to every other image, we were interested to know if a simpler, quicker measurement could produce visualisations of an equally high quality. Rogowitz et al. [10] implicitly make the assumption that there is a direct relationship between an image similarity measure’s retrieval performance, and its effectiveness for creating visualisations, an assumption that we test in Section 6 of this paper. They created MDS arrangements of a single set of 97 images, using two different similarity measures, and compared these, by inspection only, to two other arrangements that were based on the similarity judgements of human assessors. They concluded that one of the similarity measures was closer to human perception than the other, but did not attempt any quantitative evaluation of the different arrangements, or test them as browsing tools in their own right. It should perhaps be emphasised that in this paper we are concerned specifically with the problem of evaluating image similarity measures for use in constructing visualisations, rather than for retrieval, although our data on the latter is given for the sake of comparison.

2 Measuring image similarity In order to quantify the similarity of a pair of images, we first need to extract representative features from the images, and then compare the similarity of those features with a suitable measure. Ideally, the features would correspond with those that a human observer might use in judging similarity. However, this would involve recognition of the objects in the images, and interpretation of their high-level meaning, a task which is presently very difficult to perform automatically. Instead, we can attempt an approximation, using low-level features that are easier to extract, such as colour or texture. In this section we discuss the different features and corresponding similarity measures that we compared. Features are generally only computed once, and then stored. With a small collection, it may also be possible to store the pairwise dissimilarities of all images, but generally these must be computed each time a similarity matrix is created. There are, of course, many possible versions of (and alternatives to) the measures we compared: where possible, we adopted simple measures from existing image retrieval literature. In the case of the Earth Mover’s Distance (section 2.3), its inventor was using part of the same image collection, and kindly agreed to make his image features and similarity measure available. Puzicha et al. [7] offer a more comprehensive and systematic comparison of similarity measures.

2.1 Average colour This is the simplest method of all, using only a single colour (the average hue, saturation, and value of all pixels in the image) to represent an entire image. The HSV colour space is cylindrical, with the hue represented by the angle around a circle (e.g. 0 degrees is red, 120 degrees is green, 240 degrees is blue). Distance from the centre of this circle represents saturation, with most colour present at the edge of the circle. The long central axis represents value, from black to white. Measuring the similarity of the images is then just a question of measuring the Euclidean distance between their average colours in the HSV space (assuming that this is a reasonable approximation of their perceptual similarity). Because the space is cylindrical, the following formula must be used [13]:

where and colour, ? .

"!#%$'&)(+*-,./$01&)(+*-,23"!#%$*54768,-/$91* 4768, : 9;)=< ,.> $> '>

are colours in HSV space, and

(Challenge of Image Retrieval, Brighton, 2000)

,

, and

are, respectively, the hue, saturation, and value of a

2

A comparison of measures for visualising image similarity

This measure can be made to take some account of image composition by dividing the images into a grid of equalsized parts, say 4 or 9, and then comparing corresponding parts of the two images, using the above measure. We can then take the mean of the numbers as the distance between the images.

2.2 Colour histograms The main drawback of average colour is, of course, that it means losing all information about the large number of colours which may be present in an image, and their relative distributions. If the colours are quantised into a number of bins, histograms can be produced by counting the number of pixels in the image that fall into each bin. There are many possible quantisation schemes, but the simplest use a fixed number of bins of equal size. The dissimilarity between two histograms can then be measured, most commonly by comparing the contents of corresponding histogram bins. For our work we chose the quantisation method outlined by Smith [13]. Hue is assumed to be the most significant characteristic of a colour, and thus receives the finest level of quantisation, with the 360 degrees of the hue circle split into 18 sections of 20 degrees. Saturation and value are assumed to be less important, and are each quantised to three levels. Pixels with no hue or saturation are treated as a special case, and split into 4 levels: black, dark grey, light grey, and white. Thus, there are 166 bins per histogram: 18 hues @ 3 saturations @ 3 values, plus 4 grey levels. There are many different ways in which the resulting histograms can be compared in order to produce a single number representing the similarity of the corresponding images. Here we try one measure from each of the first three categories described by Puzicha et al. [7]: a heuristic histogram distance ( ACB ), a non-parametric test statistic (DFE ), and an information-theoretic divergence (the Jeffrey divergence). It is also possible to make these measures take some account of image composition, using a similar method to that outlined at the end of the previous section. 2.2.1

A B distance

The AGB distance is a simple and popular measure of the similarity of colour histograms. It involves adding up the differences between corresponding bins in the two histograms:

HJI.KML NGOQPSRTVU-WX Y where 2.2.2

N

and

P

WZ/['W

X

are histograms, and

Y

W N

represents the value in bin \ of histogram .

D E

This statistic measures how unlikely it is that one distribution was drawn from the population represented by the other:

HJ]^'L NGOQPSRTVU W

WZ/['W L%Y .W _ [ W R E Y

2.2.3 Jeffrey divergence This is an information-theoretic divergence, measuring how compactly one distribution can be coded using the other one as the codebook [7]. It is calculated as follows:

[ W W Wfehg+i Y _ [+We7g'i W )W k H `aL NCOQP1RTbU Wdc Y j j W Tmlno1pn j where . E

(Challenge of Image Retrieval, Brighton, 2000)

3

A comparison of measures for visualising image similarity

2.3 Colour signatures and EMD The main fault of colour histograms is their inflexibility. They only compare directly corresponding bins (thus disregarding any information that could be gained from considering neighbouring bins of similar colour), and they are sensitive to the bin size chosen (if they are too small, then similar colours will be split across too many bins, but if they are too large each one will contain too many colours, losing discrimination). More adaptive methods have been suggested to get around these problems. Rubner [11] proposes using simple colour signatures, where each image is represented by a small number of colours (eight on average), with weights to indicate the importance of each colour in the image. Most existing similarity measures cannot be applied to signatures, so the Earth Mover’s Distance (EMD) was defined for this purpose. It reflects the minimal cost of transforming one signature to the other, and is a solution of a special case of the transportation problem from linear optimisation. Rubner also illustrates how it can be used in conjunction with MDS, but he did not carry out user evaluations, or compare EMD with other similarity measures for the purpose of creating visualisations. EMD is another of the measures tested by Puzicha et al. [7].

2.4 Image segmentation All of the measures we have discussed so far use global image features, although much current work in computer vision focuses on segmenting images into coherent regions. When constructing a query, this allows users to select individual regions from an image, rather than using the whole image. It is also possible to incorporate region information into image features, which can then be compared with a pairwise image similarity measure. One such measure, IRIS (for Image Regions In Summary) has been defined by Sinclair, based on his unsupervised image segmentation scheme [12], and it is this that we have used in our work to date. It is designed to reflect both global image properties and the broad spatial layout of regions in the image. An image is segmented into regions with broadly homogeneous colour properties. Regions have descriptors for colour, colour variance, area, shape, location and texture. Regions are then classified as either large (with an area greater than 0.1% of the total image area) or small. The large regions are further classified as either textured or smooth, and the small regions as regularly or irregularly shaped. An image summary is then created as follows. The image is partitioned into nine equal areas, in the obvious way. Sets of four global colour histograms (for large smooth regions, large textured regions, small regular regions, and small irregular regions) are made for each of these nine areas, and the largest (dominant) region in each of the nine areas is recorded. The histograms are normalised by image area. The qSr statistic is used to determine the distance between like colour histograms across image pairs. The distance between dominant regions in corresponding ninths of each image is given by the Mahalanobis distance in RGB colour space. The final measure is then a weighted sum of the above distances.

3 Simple image test collections Traditionally, information retrieval evaluations are based on a test collection containing thousands of documents. Such a collection has a fixed set of queries, with relevance judgements indicating which of the items in the collection are relevant to each query. This can then be used to compare the retrieval performance of different techniques and algorithms. There is a long history of using test collections in text retrieval research, but there is no such collection for image retrieval. The test collection approach is, of course, not particularly helpful for evaluating the usefulness of interactive retrieval systems (for example, see Jose et al. [5] for a discussion), mainly because the queries in the collection cannot hope to cover all aspects of the context in which a retrieval system may be used. In this paper, we use a simple test collection approach as a convenient and cheap complement to our ongoing experiments with users. A problem in creating a test collection is that of relevance judgements: it is necessary for human judges to determine whether or not an item is relevant to a given query or description. The most common form of query seen in research papers on content-based image retrieval is “query by example”, where the user selects an image as a representation of his or her requirement, and the system returns images it measures to be similar to the exemplar. (Challenge of Image Retrieval, Brighton, 2000)

4

A comparison of measures for visualising image similarity

Stock photograph collections are usually given some form of categorisation, and this should provide us with a ready-made set of relevant items for a query-by-example: all images in the same category as the exemplar should be relevant to it when used as a query. If we restrict the collection to only contain selected categories, we can ensure that there is little overlap between them, thus increasing our confidence that only the other images in the category will be relevant. We therefore took volumes I and II of the Corel Stock Photograph Collection (sold on a set of CD-ROMs), and chose 20 categories from each, to make up our two mini-collections (described from now on as Corel 1 and Corel 2). We also ensured that all 100 of the images in each of our chosen categories matched the category name. Furthermore, we chose “generic” categories, so that it should be possible for anyone to identify by sight whether or not an image matches the category name, without any specialist knowledge. This means that the “relevance judgement” is as unambiguous as possible, and does not rely on the presence of annotation. Table 4 lists the selected categories in each of the two mini-collections. Each of the mini-collections was used separately, rather than combining them into a single collection, because there is overlap between their categories (e.g. both have a “flowers” category). It is obvious that these collections are not at all “realistic”: for example, one would not generally expect 99 items (5% of the collection) to be relevant to every query. However, they should provide a useful basis for comparison of similarity measures, as we will be able to see how well each measure can separate the images in the relevant category from all of the others.

4 Comparing similarity measures for retrieval Each of the 2000 images in a mini-collection (Corel 1 or Corel 2) was used in turn as a “query by example”, ranking the remaining images from its collection in order of their similarity to the query image. This process was repeated for each of the similarity measures under consideration. An ideal measure would place the 99 relevant images (the rest of the query image’s category) at the top of the ranking, followed by the rest of the mini-collection. We can compare the performance of each measure by calculating the conventional information retrieval evaluation figures of recall and precision. Recall is the proportion of the relevant images in the collection that are retrieved for a particular query, and precision is the proportion of the images retrieved that are relevant to the query. Each of these can be measured at any cutoff point along the ranking; most commonly, precision is calculated at a number of standard levels of recall (0.0, 0.1, 0.2, . . . , 1.0). This allows the construction of a graph plotting precision against recall, usually averaged across all of the queries in the collection. In conventional textual information retrieval evaluations, each query has a different number of relevant documents, and so it is impossible for all of the standard recall levels to be achieved exactly (for example, if a query has 6 relevant documents, the only recall levels possible are 0.167, 0.333, . . . ). It is therefore usual to interpolate the precision figures over standard recall ranges, so that, for example, the figure at 0.0 recall actually incorporates all of the precision values for recall between 0.0 and 0.1. In our collections, however, we have the same number of relevant documents (99) for every query, and so the graph in Figure 1 simply plots precision against recall at each relevant document. Table 1 gives the average precision (across all recall levels) for each of the measures, in each collection. The average colour measure (ahsv) clearly performs worst. Partitioning the images into a grid of 4 equal areas (ahsv 4) improves performance, and using 9 areas (ahsv 9) improves it again, but 16 areas (ahsv 16) are only marginally better than 9, while needing a larger number of features (and hence calculations). The three histogrambased measures are next, with similar performance, but Jeffrey divergence (hhsv jd) performs slightly better than sSt (hhsv chi), which in turn is better than uCv (hhsv l1). We also tried histogram-based methods on images partitioned into grids of equal area, but this produced little or no improvement, especially considering the much higher number of features (and calculations) required. EMD (emd), for which we only have Corel 1 data, performs surprisingly badly, doing approximately as well as the histogram-based measures. The IRIS measure (iris) clearly performs best. It should perhaps be noted that the features for the IRIS measure were created from 768 w 512 images, while all of the average colour and histogram features were calculated from 96 w 64 thumbnails, which may be what gives IRIS some of its advantage. To test this, we repeated our calculations for hhsv jd using full-size images, and found that it did give a slight improvement over thumbnails (e.g. the average precision for Corel 2 rose from 0.243 to 0.251), but (Challenge of Image Retrieval, Brighton, 2000)

5

A comparison of measures for visualising image similarity

collection Corel 1 Corel 2

ahsv 0.189 0.171

ahsv 4 0.210 0.210

average precision ahsv 16 hhsv l1 emd 0.225 0.239 0.246 0.225 0.229 –

ahsv 9 0.223 0.222

hhsv chi 0.253 0.239

hhsv jd 0.258 0.243

iris 0.283 0.282

Table 1: Average precision figures for the two collections, over the nine similarity measures we considered. For emd, we only have Corel 1 data. 0.8

iris hhsv_chi emd ahsv_9 ahsv_4 ahsv

0.7

0.6

precision

0.5

0.4

0.3

0.2

0.1

0 1

11

21

31

41

51

61

71

81

91

number of relevant images retrieved

Figure 1: The precision-recall graph for the Corel 1 mini-collection. The hhsv jd and hhsv l1 series have been omitted for clarity, due to their closeness to hhsv chi and emd. The ahsv 16 series has also been omitted, due to its closeness to ahsv 9. The graph for Corel 2 is very similar, except that we do not have Corel 2 data for emd.

was still quite far behind iris. For the sake of interest, Table 4 shows average precision figures for the individual categories in the two collections. IRIS does best for most of the categories, although in a number of cases it is outperformed by other measures. It is clear that some queries are much easier than others: as one might expect, the “easy” queries are those for categories where the images are most visually homogeneous (e.g. Arabian Horses, where all of the pictures appear to have been taken in the same field), and dissimilar to the images in other categories (e.g. Divers & Diving, where all of the images contain an uncommon shade of blue).

5 Calculating a similarity matrix Before we can apply MDS, we need to calculate a similarity matrix, containing the pairwise similarities of all images in the set to be visualised. If the set consists of query results, then the similarity matrix cannot be precomputed, and will have to be created “on the fly”. The number of calculations needed is x2yhx } z{ | , where ~ is the number of images in the set. Table 2 shows approximately how quick each of the measures is to calculate, relative to the fastest, ahsv. We used a 296MHz UltraSparc-II processor, on which it took 0.02 seconds of CPU to compute a 100-image similarity (Challenge of Image Retrieval, Brighton, 2000)

6

A comparison of measures for visualising image similarity

ahsv 1

relative time

ahsv 4 3.8

ahsv 9 8.6

similarity measure ahsv 16 hhsv l1 emd 15 8.8 230

hhsv chi 10

hhsv jd 24

iris 16

Table 2: The approximate time needed to create a similarity matrix, given relative to ahsv, for each of the nine measures. matrix using ahsv, but approximately 4.6 seconds on average using emd. From these figures, it seems clear that EMD is likely to be impractical for creating layouts on the fly. ahsv 9 is faster than ahsv 16, for about the same retrieval performance, and similarly, hhsv chi is more than twice as fast as hhsv jd, for similar performance. For a small collection, of course, we may have enough disk and/or memory to hold the pairwise similarities of all images in the collection, making timing differences irrelevant. We may also need to take into account the initial time taken to compute features (the more complex the features, the longer they take to compute) but these only need to be worked out once. Another consideration is how big the features are, since this will determine how long they take to read in to memory, and how much memory is needed.

6 Comparing similarity measures for visualisation Having observed fairly large differences between the similarity measures when used in the context of retrieval, we expected that these differences would remain in the context of visualisation: that the best measures for retrieval would produce the best visualisations. To test this, we needed a way of objectively evaluating the quality of visualisations, to measure how successful MDS has been at placing similar images close to one another. We adopted the method described by Leouski and Allan [6] of attempting to quantify how well the visualisation clusters together relevant images, and separates them from non-relevant images. The conventional recall and precision measures can be transferred to visualisations by assuming that the user’s starting point is a single relevant image, and treating this as the query image. Then, instead of using the similarity measure to rank the other images, they are simply ranked according to their 2D Euclidean distance from the query image in the visualisation. Spatial precision is therefore defined as the proportion of relevant images within a given radius of the query image (using the image centres when measuring distances). This is illustrated in Figure 2. If we use the distances from the query image to each relevant image as the radii, this gives the spatial precision at different levels of recall. Average spatial precision is then the two-dimensional analogue of average precision: spatial precision averaged across all levels of recall, and all images in the layout. Therefore, to assess how the similarity measures perform when forced into two dimensions, we can use each of them to create 2000-image similarity matrices for both of the test collections. Once we have applied MDS to these, producing 2D configurations, we can repeat the process described in section 4: using each image in the set in turn as the query, work out the spatial precision at each relevant image. Then, the average spatial precision of the whole visualisation, over all possible query images, can be calculated. Figure 3 and Table 3 show the results. Although there are wide differences between the similarity measures in conventional IR terms, when used in conjunction with MDS the differences are very small. To look for an explanation of this, we have to consider MDS in more detail. The incremental MDS method we use is based upon least squares metric MDS [1]. It emulates a spring system with anchor points, one for each data object, and a spring between every pair of points. The relaxed length of a spring connecting two points is given by the dissimilarity between the corresponding pair of objects. Each spring has energy associated with it, and the loss function is a sum of these energies for all springs:

Q"

b) :

Q

where is the Euclidean distance between points nal) dissimilarity between objects and . (Challenge of Image Retrieval, Brighton, 2000)

and

Q in the two-dimensional layout, and

is the (origi-

7

A comparison of measures for visualising image similarity

Figure 2: An illustration of spatial precision in two dimensions. The dark grey rectangle in the centre represents the query image, and the light grey rectangles represent images that are relevant to the query. The white rectangles are non-relevant images. Moving outwards from the centre, the dashed circles represent increasing levels of recall: at the closest relevant image, we see that 1 non-relevant image also falls within the circle, making the spatial precision 0.5 at =1. At =2 there are 3 non-relevant images within the circle, giving a spatial precision of 2/5 = 0.4. Finally, at =3, spatial precision is 3/8 = 0.375. The average spatial precision of the arrangement for this query image is therefore (0.5 + 0.4 + 0.375)/3 = 0.425.

0.4

iris hhsv_chi emd 0.3

ahsv_9

spatial precision

ahsv

0.2

0.1

0 1

11

21

31

41

51

61

71

81

91

number of relevant images found

Figure 3: Spatial precision against “recall” for Corel 1. Again, some similarity measures are omitted for clarity. The differences between the measures have almost disappeared.

(Challenge of Image Retrieval, Brighton, 2000)

8

A comparison of measures for visualising image similarity

collection Corel 1 Corel 2

ASP ASP/AP MDS energy ASP ASP/AP MDS energy

ahsv 0.161 0.852 0.023 0.154 0.901 0.016

ahsv 9 0.154 0.691 0.054 0.154 0.694 0.048

similarity measure emd hhsv chi 0.159 0.158 0.646 0.625 0.056 0.101 – 0.145 – 0.607 – 0.105

hhsv jd 0.162 0.628 0.096 0.149 0.613 0.098

iris 0.148 0.523 0.107 0.138 0.489 0.106

Table 3: Average spatial precision (ASP) figures for selected similarity measures, across the two collections, as well as MDS energy values for the corresponding 2D layouts. If different MDS runs are used, these figures may vary by up to about 0.001. The table also gives the ratio of ASP to the original average precision (AP), thus showing how much of the “value” of the original measure is retained in the 2D layout. In effect, this gives an indication of how closely the final 2D layout reflects the positions of the points in the original high-dimensional configuration. Table 3 shows the energy of the visualisations produced using the different similarity measures for the two test collections. The energy value we report is an average across all of the springs, i.e. a h a3 . The simplest measure, ahsv, is three-dimensional, and so a reduction to two dimensions does not result in much error, giving its configurations a low level of energy. However, more complex similarity measures tend to have a higher dimensionality, thus producing configurations with more energy (and error). It appears that the more complex measures lose whatever advantage they have in high-dimensional space when forced into two dimensions by MDS. Thus, the best similarity measures for use with MDS will be those which make a good tradeoff between effectiveness and dimensionality.

7 Reconsidering experiment results In order to examine how accurate an indicator the ASP measure is of visualisation quality, we decided to apply it to the data from our second user experiment (as mentioned in the Introduction, and described in more detail in [8]). We used sets of 100 images drawn from the Corel 2 collection, displaying them to participants and asking them to find as many images as they could to match a particular textual description (taken from one of the category names). Each set contained between 3 and 7 images from each of the 20 categories, and was either arranged with MDS (using the IRIS measure), or arranged randomly. Participants were asked to be as quick as possible, and to press a button marked “done” when they were ready to move on to the next search. When analysing the results, we considered several timings as dependent variables. In general, participants were significantly faster at the task when selecting from the MDS layouts: for time taken up to and including the selection of the last image, average time to make a selection, and time taken to press the “done” button. For the time taken to select the first image, MDS had a small but non-significant advantage, suggesting that most of the benefit of the MDS layout comes after the first relevant image has been found. Applying the average spatial precision measurement to these layouts gives a mean value of 0.168 for MDS, and 0.071 for random. Repeating our analysis using ASP value as an independent variable, we found that it was significant for the same dependent variables as layout type (a higher ASP value meant a faster time), and was not significant for the time of the first selection. This seems to confirm that the benefit of the MDS layouts is the clustering together of relevant images: once one relevant image has been found, the others can be found more quickly because they are close to it. When these layouts are recreated using the ahsv measure, the ASP values are comparable to those of the original IRIS layouts, meaning that in an experiment we would expect ahsv-based layouts to offer a similar advantage over random layouts. Figure 4 shows an example of a 100-image layout created with ahsv, and Figure 5 shows the same set of images, arranged using iris. The two layouts look rather similar, although the ahsv layout is more tightly clustered, leading to more overlap between thumbnails. In a situation where this is undesirable, the configuration can be forced into a more regular, grid-like structure [9]. It is also possible to use a form of MDS that is constrained to (Challenge of Image Retrieval, Brighton, 2000)

9

A comparison of measures for visualising image similarity

placing the points so that they lie on a grid. Alternatively, the overlap problem can be solved by user interface tools that automatically bring hidden thumbnails to the front as the mouse pointer is moved over them.

8 Future work The conclusions drawn in this paper are with respect to a particular type of query, the “non-unique, non-refined” type identified by Enser [4]. However, only 6% of the requests he sampled fell into this category. Far more common were requests for specific people or places, which at present can only be answered if the images have associated annotations. We are interested in creating layouts based on textual annotations, and exploring how these might be used in complement to layouts based on visual similarity. Also, we have explored only one aspect of the utility of similarity-based visualisations: the fact that they can cluster together similar items, separating them from dissimilar items. The average spatial precision measure seems to quantify this fairly well. However, a good visualisation should also help the user to gain an overview of the dataset in question. Depending on the user’s current task, an arrangement based on semantic similarity may make less visual “sense” as a whole than an arrangement based on a simple measure like average colour. At present we are developing a study based on a simulated work task situation [3], to examine how different MDS arrangements of an image set might be useful to a graphic designer for a picture selection task. We are also interested in using MDS to arrange the results of a textual or visual query, comparing this to the more conventional ranked list. Given the findings reported in this paper, it would seem sensible to use an advanced image similarity measure (such as IRIS) for a visual query, but a simple one (such as average HSV colour) for creating a visualisation of the results.

9 Conclusions We can identify differences in performance among image similarity measures by using them to simulate the retrieval of results for a “query by example” from a simple test collection. However, these differences disappear when the measures are used in conjunction with multidimensional scaling (MDS) to create visualisations of image sets. Measures that are more effective for retrieval tend to be more complex, and thus lose their advantage over the simpler measures when forced into two dimensions. As the latter are easier to implement and quicker to calculate, it seems that they should be favoured for the creation of visualisations, although testing with users and with other image collections is needed to confirm this. A re-examination of the results of an earlier experiment shows that the average spatial precision measure appears to be an accurate predictor of the quality of a visualisation.

Acknowledgements Thanks to Yossi Rubner of Stanford University for making the EMD measure available, and providing us with colour signatures for the first part of Corel’s stock photograph collection. Kerry Rodden’s work was supported by grants from AT&T Laboratories Cambridge, and the BFWG Charitable Foundation. Wojciech Basalaj is supported by Trinity College, Cambridge, and the Overseas Research Students awards scheme.

References [1] Wojciech Basalaj. Incremental multidimensional scaling method for database visualization. In Visual Data Exploration and Analysis VI, volume 3643 of Proceedings of SPIE, pages 149–158, January 1999. [2] I. Borg and P. Groenen. Modern Multidimensional Scaling. Springer-Verlag, New York, 1997. [3] Pia Borlund and Peter Ingwersen. The development of a method for the evaluation of interactive information retrieval systems. Journal of Documentation, 53(3):225–250, 1997. (Challenge of Image Retrieval, Brighton, 2000)

10

A comparison of measures for visualising image similarity

[4] P. G. B. Enser. Query analysis in a visual information retrieval context. Journal of Document and Text Management, 1(1):25–52, 1993. [5] Joemon M. Jose, Jonathan Furner, and David J. Harper. Spatial querying for image retrieval: a user-oriented evaluation. In Proceedings of SIGIR’98, pages 232–241. ACM, August 1998. [6] Anton Leouski and James Allan. Evaluating a visual navigation system for a digital library. In Proceedings of the Second European Conference on Research and Technology for Digital Libraries, pages 535–554, 1998. [7] Jan Puzicha, Yossi Rubner, Carlo Tomasi, and Joachim M. Buhmann. Empirical evaluation of dissimilarity measures for color and texture. In International Conference on Computer Vision, pages 1165–1173. IEEE, September 1999. [8] Kerry Rodden, Wojciech Basalaj, David Sinclair, and Kenneth Wood. Evaluating a visualisation of image similarity. In Proceedings of SIGIR’99, pages 275–276. ACM, August 1999. Poster. [9] Kerry Rodden, Wojciech Basalaj, David Sinclair, and Kenneth Wood. Evaluating a visualisation of image similarity as a tool for image browsing. In IEEE Symposium on Information Visualization, pages 36–43. IEEE, October 1999. [10] Bernice E. Rogowitz, Thomas Frese, John R. Smith, Charles A. Bouman, and Edward Kalin. Perceptual image similarity experiments. In Human Vision and Electronic Imaging III, volume 3299 of Proceedings of SPIE, pages 576–590, January 1998. [11] Yossi Rubner, Carlo Tomasi, and Leonidas J. Guibas. A metric for distributions with applications to image databases. In Proceedings of the IEEE International Conference on Computer Vision. IEEE, January 1998. [12] David Sinclair. Voronoi seeded colour image segmentation. Technical Report 1999.3, AT&T Laboratories Cambridge, 1999. [13] John R. Smith. Integrated Spatial and Feature Image Systems: Retrieval, Analysis and Compression. PhD thesis, Columbia University, 1997.

(Challenge of Image Retrieval, Brighton, 2000)

11

A comparison of measures for visualising image similarity

Table 4: The categories used in the two test collections, with their average precision when used as queries, for each of five similarity measures. The top half of the table is the Corel 1 mini-collection, and the bottom half is Corel 2. The shading of the cells shows the relative ranking of the measures for a particular category: the darker the cell, the worse the ranking. We do not have Corel 2 data for emd. Category Title

1 3 8 13 1 22 29 31 52 73 91 100 104 107 108 110 113 156 172 181 184

208 209 221 225 240 268 273 300 314 317 320 326 329 332 338 345 351 359 364 388

ahsv

ahsv 9

emd

hhsv chi

iris

Sunrises and Sunsets World War II Planes Birds Flowers Volume II Bridges Exotic Cars Residential Interiors Butterflies Firework Photography Fruits & Vegetables Bears North American Deer Elephants Tigers Wolves Arabian Horses Divers & Diving Action Sailing Models Ice & Icebergs overall average precision

0.132 0.251 0.071 0.093 0.115 0.157 0.172 0.186 0.301 0.120 0.114 0.104 0.185 0.160 0.106 0.466 0.489 0.235 0.181 0.144 0.189

0.125 0.486 0.080 0.086 0.134 0.140 0.226 0.201 0.302 0.148 0.128 0.133 0.300 0.188 0.189 0.595 0.353 0.261 0.198 0.193 0.223

0.135 0.393 0.086 0.132 0.137 0.183 0.278 0.184 0.523 0.134 0.140 0.164 0.276 0.224 0.156 0.572 0.552 0.254 0.197 0.196 0.246

0.160 0.416 0.092 0.161 0.145 0.170 0.273 0.172 0.567 0.120 0.120 0.144 0.191 0.288 0.153 0.553 0.649 0.266 0.235 0.182 0.253

0.203 0.588 0.084 0.142 0.145 0.145 0.397 0.294 0.580 0.178 0.112 0.118 0.228 0.310 0.178 0.561 0.640 0.360 0.187 0.221 0.283

Fungi Fish Flowers Close-up Freestyle Skiing Arthropods African Birds Performance Cars Surfing Dolphins and Whales Whitetail Deer Victorian Houses Wildcats Hot Air Balloons Fabulous Fruit Sailing Sunsets Around The World Trains Aviation Photography 2 Kitchens and Bathrooms Women In Vogue overall average precision

0.271 0.124 0.241 0.187 0.121 0.090 0.157 0.216 0.176 0.148 0.141 0.088 0.073 0.187 0.193 0.204 0.128 0.187 0.348 0.143 0.171

0.334 0.137 0.241 0.262 0.123 0.105 0.234 0.228 0.374 0.199 0.244 0.105 0.098 0.167 0.329 0.177 0.245 0.263 0.411 0.160 0.222

– – – – – – – – – – – – – – – – – – – – –

0.325 0.188 0.204 0.262 0.135 0.091 0.204 0.400 0.244 0.234 0.373 0.139 0.145 0.189 0.343 0.250 0.198 0.148 0.530 0.182 0.239

0.341 0.198 0.283 0.323 0.217 0.108 0.279 0.368 0.385 0.252 0.390 0.135 0.194 0.169 0.359 0.311 0.305 0.311 0.512 0.191 0.282

(Challenge of Image Retrieval, Brighton, 2000)

12

A comparison of measures for visualising image similarity

Figure 4: 100 images from the Corel 2 mini-collection, arranged with ahsv.

Figure 5: The same 100 images from Corel 2, arranged with iris.

(Challenge of Image Retrieval, Brighton, 2000)

13

Evaluating a Visualisation of Image Similarity - rodden.org