REFINING A REGION BASED ATTENTION MODEL USING EYE ...

Viewer
Transcript

REFINING A REGION BASED ATTENTION MODEL USING EYE TRACKING DATA 1

Zhen Liang1*, Hong Fu1,2, Zheru Chi1, and Dagan Feng1,3 Centre for Multimedia Signal Processing, Department of Electronic and Information Engineering The Hong Kong Polytechnic University, Hong Kong, China 2 Department of Computer Science, Chu Hai College of Higher Education, Hong Kong 3 School of Information Technologies, The University of Sydney, Sydney, Australia [email protected], [email protected], [email protected], [email protected]

ABSTRACT Computational visual attention modeling is a topic of increasing importance in machine understanding of images. In this paper, we present an approach to refine a region based attention model with eye tracking data. This paper has three main contributions. (1) A concept of fixation mask is proposed to describe the region saliency of an image by weighting the segmented regions using importance measures obtained in the Human Visual System (HVS) or computational models. (2) A Genetic Algorithm (GA) scheme for refining a region based attention model is proposed. (3) An evaluation method is developed to measure the correlation between the result from the computational model and that from the HVS in terms of fixation mask. Index Terms— Visual attention model, eye tracking data, genetic algorithm, fixation mask, regions of interest 1. INTRODUCTION The Human Visual System (HVS) selects the most important and informative portions from a huge amount of inputted information for further analysis. This selective procedure is usually called visual attention, which guides eye movements toward the Regions of Interest (ROIs) in the visual scene [1]. It has two stages in the selection mechanism: the bottom-up stage driven by low-level visual properties and the top-down stage guided by human’s highlevel understanding. Much research on visual attention modeling focuses on imitating the bottom-up visual attention procedure by integrating visual features and hierarchical perceptual interpretation process [2-9]. Based on a bottom-up computational attention model proposed in [10], Itti et al. propose a multi-resolution and multi-feature system to mimic biological vision [11]. Furthermore, Aziz and Mertsching introduced more features (color, symmetry, size, eccentricity, and orientation) into an attention model to obtain a better saliency map [6]. A region based attention model was proposed by Fu et al and incorporated into a Content-Based Image Retrieval (CBIR) system, which has shown a performance improvement of the CBIR system [2]. However, in the bottom-up attention models, there are still

rooms for improvement, such as feature selection, feature fusion, priority determination etc. Using a computational bottom-up attention model may fail to produce some ROIs if they are not salient based on their visual features but they contain important semantic information. Thus, the top-down process does play a role in the selection mechanism. A computational model can be improved by combining low-level and high-level features. On the other hand, under natural viewing conditions, eye movements are guided by both the bottom-up and top-down mechanisms [1]. Longer and more frequent fixations are appeared on the objects (ROIs) of a scene [12-15]. Hence, to analyze the top-down process, eye tracking technique may provide a more suitable, convenient and imperceptible way to understand one’s intention instead of asking him/her to manually select the ROIs. It is also demonstrated that eye movements mostly fix in the regions with high visual attention scores [16] and is useful for salient feature selection and attention model construction [17,18]. In this paper, a region based attention model proposed by Fu et al. [2] is refined with eye tracking data using a Genetic Algorithm (GA) (Fu et al’s model is called attention-driven model hereafter). The performance of the refined attention-driven model is evaluated and compared with the original one. The paper is organized as follows. Experimental setup is explained in Section 2. Section 3 describes the concept of fixation mask with example fixation masks of human beings and the attention-driven model. Then, in Section 4, GA optimization strategy is proposed. Experimental results are reported in Section 5. Finally, concluding remarks are drawn in Section 6. 2. EYE TRACKING DATA ACQUISITION An image dataset with 100 natural color images, selected from the Hemera color image database, is used in our experiments. The dataset includes eight categories: people, buildings, animals, landscapes, tools, vehicles, fruits, and arts. Example images are shown in Fig. 1. All images are of a resolution of 1920×1200 pixels in the 32-bit color mode. In the experiment, training and test images are randomly selected from the dataset without overlapping.

A non-intrusive table-mounted eye tracker, Tobii X120, is used to collect eye tracking data in a user-friendly environment with an angle accuracy of 0.5 degree and an sample rate of 120 Hz. Before the data collection of each participant, a calibration is conducted under a grid of nine calibration points for minimizing errors in eye tracking. Fixations (locations and durations) are extracted from the raw eye tracking data using a criterion of fixation radius (35 pixels) and minimum fixation duration (100 ms). In the experiment, 16 participants (6 females and 10 males) are invited to observe 100 digitized images of natural scenes under a free-view condition. All the participants have normal or corrected-to-normal vision and are naive to eye tracking. Participants seat comfortably at a viewing distance (around 68 cm) in front of a standard 24 inch computer screen that is used to display the stimulus. The corresponding subtended visual angel of stimulus presentation was about 41.5º x 26.8º. Each image is presented exactly once to each participant with 5 second non-task observing time.

Figure 1: Example images used in our experiments.

3. ATTENTION-DRIVEN MODEL In the attention-driven model, an iterative process driven by the human attention mechanism was proposed to pop out perceptually attended object sequentially from an image based on three visual features: boundary color, region color, and region texture information. Attention value is defined as the feature difference between a merged region and its surroundings. The regions with large attention values are popped-out as attended objects and the remaining segments are treated as the background. An assumption made here is that the contributions of the three features are equal in characterizing objects/regions. For the image retrieval, an important factor is used to indicate the importance of an object or the background of the image [2]. The generation of important factors from the attention values is described as follows. Suppose that an image is interpreted as one object and the background with the attention value . Then, the corresponding important factors are the same ( ) when there are no and (the attention value of Object difference between =0, i.e., 0), while 1 and 0 when is the 1 ). Thus, a linear mapping function dominator ( between the attention values and the important factors is proposed, which is defined as and 1 (1) 4. FIXATION MASK A region-wise map called fixation mask is proposed in this paper to obtain region saliency of an image. A state-of-art

segmentation algorithm, JSEG [19], is employed to segment images into regions before generating fixation masks. Oversegmentation is assumed by carefully setting the segmentation parameters. In the image dataset we used, the number of segment regions of an image ranges from 1 to 16, with an average of 7. Participant 1

...

Eye Tracking Data Acquisition

JSEG

Important Values of Regions_1

Participant 16

...

Important Values of Regions_16

Fusing & Normalizing

Fixation Mask of Human Beings

(a) (b) Figure 2: (a) Generation of the fixation mask of human beings from eye tracking data; (b) Generation of the fixation mask of the attention-driven model.

4.1. Fixation mask of human beings The fixation mask of human beings represents human attention on a visual scene. It is generated from the fixation information of the collected eye tracking data. Suppose that an image I is composed of segmented regions after applying JSEG, , (2) ,…, ,…, where is the -th segmented region. A concept of importance value is introduced to measure the degree of the observer’s interest on a region. If the total fixation duration on the image is and the relative fixation duration on the segmented region is , then the corresponding importance value of region can be computed as in Eq. (3). Thus, the value will be 0 if there is no fixation on the region. ∑ , and 1 . (3) where is the importance value of region , and ∑ . To reduce the effect of the human’s subjectivity in the eye tracking data, each image has been observed by 16 participants. A general procedure to generate the fixation mask of human beings is shown in Figure 2(a). Firstly, after eye tracking data is collected, the important values of JSEG segmented regions for each observer are computed using Eq. (3). Secondly, the important values for each region are summed up over all the participants. Finally, the obtained important values are normalized to make them smaller than or equal to 1. 4.2. Fixation mask of the attention-driven model In the fixation mask of attention-driven model, the important values are the corresponding important factors obtained in the model. Figure 2(b) shows the procedure to obtain the fixation mask of the attention-driven model.

5. REFINEMENT USING GA GA is a technique based on biological evolution theory such as inheritance, mutation, selection and crossover. Differing from other optimization techniques, GA iteratively evolutes a population of solutions among which the best solution can be identified. 5.1. Parameters to be refined To refine the attention-driven model, the assignment of feature weights and the mapping equation of the important factors from attention values are optimized by a GA model. 5.1.1. Feature weighting Here, the feature weights denoted by , and are to weigh the boundary color, the region texture information, and the region color information, respectively. The summation of these three weights is constraint to 1 and therefore only two weights are independent. 5.1.2. Importance factor mapping Instead of a linear mapping between the attention value and the important factor, a quadratic mapping is proposed, which is defined as (4) where IF is the important factor and AV is the attention value. The definitions of important factor and attention value are given in Section 3. It is assumed that that the larger the attention value, the larger the important factor. The important factor should be equal to 1 when the attention value is 1. Therefore, Eq. 5 should satisfy the following two constraints. 1, 1 (5) | | is to ensure that the curve is monotonically increasing within [0, 1]. An initial chromosome (GA_C) is set by including four genes, , , and . There are in total GA_N chromosomes in the population (GA_P), where (6) _ _ , _ ,…, _ ,…, _ _ . The initial population is randomly generated with real coded chromosomes. 5.2. Design of the fitness function A fitness function defined as the correlation value (Eq. 7) between the fixation mask of human beings and that of the attention-driven model is used for the GA optimization of the parameters discussed in the previous section. The steps for computing the fitness function of one chromosome are detailed in Figure 3. ∑ ∑

∑

∑ ∑

∑

where and are the mean values, and and total numbers of matrixes A and B respectively.

(7) are the

Step 1: Reconstruct the attention-driven model using the feature weights and the mapping function parameters defined in the chromosome. Step 2: Apply the reconstructed attention-driven model to the image to pop out objects as well the background and to obtain the corresponding important factors. Step 3: Form the fixation mask of the attention-driven model using the obtained important factors to represent the image in terms of objects and the background. Step 4: Compute the correlation value between the produced fixation mask of the attention-driven model and the corresponding fixation mask of human beings. Step 5: If there are more images in the training set, go to Step 2; otherwise, stop. Step 6: The fitness value of the chromosome is the average of the correlation values of all the images in the training set.

Figure 3: The detailed steps for computing the fitness function.

5.3. Genetic operations, selection and termination function During the evolution, the heuristic crossover operator is used. The multi-non-uniform mutation operator is adopted for mutation. The chromosomes with a higher fitness value will have an increased chance to be selected as a parent for producing offspring for the next generation. The maximum number of generations is selected as a termination criterion. 6. EXPERIMENT RESULTS AND DISCUSSION In the experiment, ten images are randomly selected from the image dataset as the training set for the GA optimization. The other images in the image dataset are used as the test set. A set of parameters for the GA optimization procedure is as follows: population size ( _ ) = 500, maximum number of generations ( ) = 1000, heuristic crossover = [2 3], multi-non-uniform mutation = [6 3], and the normalized geometric selection probability = 0.08. According to the fitness function described in Section 4.2, the best chromosome is obtained when the GA optimization is terminated with the criterion satisfied. The feature weights in the best solution are listed in Table 1 and the quadratic relationship (Eq. 8) is constructed with the optimized parameters from the GA optimization. Figure 4 shows the mapping curves of the original attention-driven model and the refined one. The results shows that the region color feature and the region texture feature are more important than the boundary color feature, and the relation between the important factor and attention value is nonlinear. A refined attention-driven model is constructed based on the optimal parameters obtained, and evaluated on the test images. Table 2 shows the average correlation values between the fixation masks of human beings and those of computational models on training and testing images. Feature weight

Table 1: Feature weight assignment Boundary Color Region Texture Region Color 0.0334 0.4248 0.5418

0.4992

1.2401

0.2591

(8)

ouutperforms thee original onee. In the futuure, we will focus f ouur research onn eye gaze paattern analysiss in some speecific doomains and to further explore the attentioon model.

Figure 4: A com mparison of the im mportance factor mapping functionns of the original attentionn-driven model annd the refined moodel. T Table 2: A Compaarison of correlatiion values Trainiing imagees 0.638 82 Attention n-driven model 0.692 28 Refined attentiion-driven modeel (8.56% %) Attention Models

Testting ima ages 0.51120 0.54452 (6.44%)

As we can see, the refined atteention-driven model w an outperforms the original attention-drivven model with mages) improvement of 8.56% onn the training dataset (10 im and 6.4% on the test datasset (90 imagees). Partial ressults of a en model andd the refined model the original attention-drive are shown in Figure 5 withh fixation maasks displayedd where r byy the correspponding the importancce value is represented color resoluttion. Here, thhe highest im mportance vaalue is mapped to redd color while the lowest im mportance valuue is to blue. From Fig. F 5, we caan see that thhe refined atttentiondriven modell outperforms the original model m and prroduces the result clooser to that oof the HVS except e for a simple image with a pure backgroound (e.g., Figgure 5(d)) whhere the improvement is not visible. On the otherr hand, eye gaaze data h vision makes a rapid visual comp parison reveals that human between ROIIs and the baackground in a natural obsserving condition. Thhus, in the reffined model, the t backgrounnd may be assigned a small poortion of thee important values, mall or especially whhen the ROIs in the imagees are very sm very large (e.gg. Figure 5(a)) and Figure 5(c)).

WLEDGMEN NT 8. ACKNOW Thhis work repoorted in this ppaper is substtantially supported byy the Researchh Grants Couuncil of the Hong H Kong Sp pecial Addministrative Region, C China (Projeect No.: PolyU 51141/07E) and the PolyU Grrants (Project Nos.: N 1-BBZ99 and G--YH57). 9. REFE ERENCES [1] D. J. Parkhurst, annd E. Niebur, “Scenne content selected by active vision,” Spatial Vission, 16(2), pp. 125––154, 2003. [2] H. Fu, Z. Chi, and D. Feng, “A Attention-driven im mage interpretationn with Recognition, 39(9), pp. p 1604–1621, 20006. appplication to image reetrieval,” Pattern R [3] H. Fu, Z. Chi, annd D. Feng, “An effficient algorithm for f attention-drivenn image p 126-140, 2009. inteerpretation from seggments,” Pattern Reecognition, 42(1), pp. [4] L. Elazary, and L. L Itti, “Interesting objects are visually y salient,” The Jouurnal of Vission, 8(3), pp. 1–15,, 2008. [5] O. L. Meur, P. L.. Callet, D. Barba, and D. Thoreau, “A A coherent computtational action on Pattern Analysis A appproach to model botttom-up visual attenntion,” IEEE Transa andd Machine Intelligence, 28(5), pp. 802––817, 2006. [6] M. Z. Aziz, and B. B Mertsching, “Fast and robust generration of feature maps m for a IEEE Trransaction on Imag ge Processing, 17((5), pp. reggion-based visual attention,” 6333–644, 2008. [7] P. L. Rosin, “A sim mple method for deetecting salient regio ons,” Pattern Recog gnition, 42,, pp. 2363–2371, 20009. [8] J. Song, Z. Chi, annd J. Liu, “A robust eye detection meth hod using combinedd binary ormation,” Pattern R Recognition, 39, pp p. 1110-1125, 2006. edgge and intensity info [9] J. Li, G. Chen, Z. Chi, and C. Lu, “IImage coding quality assessment usingg fuzzy m IEEE Transaactions on Fuzzy Systems, Sy inteegrals with three-coomponent image model,” 12((1), pp. 99-106, 20004. [100] C. Koch, and S. S Ullman, “Shiftss in selective viussal attention: towarrds the undderlying neural circuitry,” Human Neurrobiology, 4(4), pp.. 219–227, 1985. [111] L. Itti, C. Koch, and E. Niebur, “A model of saliency--based visual attenttion for rappid scene analysiss,” IEEE Transacctions on Pattern Analysis and Machine M Inttelligence, 20(11), pp. p 1254–1259, 19988. [122] G. R. Loftus, andd N. H. Mackworthh, “Cognitive determ minants of fixation location durring picture viewing g,” The Journal of E Experimental Psychhology: Human Perception andd Performance, 4, pp. p 565–572, 1978. [133] P. De Graef, D. Christiaens, and G G. d’Ydewalle, “Peerceptual effects off scene conntext on object recoognition,” Psychologgical Research, 52, pp. 317–329, 1990.. [144] J. M. Henderson, and A. Hollingswoorth, Eye movementss during scene view wing: an oveerview. in: Eye Guuidance While Reading and While Watching W Dynamic Scenes, S Unnderwood, G. (Ed.). Elsevier Science, A Amsterdam, 269–29 93, 1998.

Figure 5: A com mparison of the fiixation masks generated from thee original and refined attenntion-driven moddel against thosee from eye trackking data (human beings).

[155] J. H. Henderson,, P. A. Weeks, andd A. Hollingsworth,, “The effects of seemantic connsistency on eye movements m during complex scene viewing,” v The Jourrnal of Expperimental Psychollogy: Human Perceeption and Performance, 25(1), pp. 2110–228, 19999.

7. CO ONCLUSION In this papeer, we propoose a genetiic algorithm based approach to refine r a regionn based attenttion model prroposed by Fu et al with w eye trackiing data. The genetic algorrithm is applied here to optimize the t feature weights w and thhe nonlinear mappinng function beetween the atttention value and a the important facctor. A fixationn mask is deffined to repressent the region salienncy. In addiition, an evaaluation methhod is proposed to validate v our approach. a Expperimental ressults on 90 test imagees show that thhe refined atttention-drivenn model

[166] O.K. Oyekoya, and F.W.M. Stentifoord, “Exploring hum man eye behaviour using a moodel of visual attenntion,” Proceedinggs of the 17th Inteernational Conference on Pattern Recognition (IICPR’04), pp. 945-9948, 2004. [177] H. Igarashi, S. Suzuki, S T. Sugita, M M. Kurisu, and M. Kakikura, “Extracction of vissual attention with gaze g duration and saliency map,” Procceedings of the 2006 IEEE Intternational Conferennce on Control Appplications, pp. 562-5567, 2006. [188] A.J. Chung, F. Deeligianni, X.P. Hu, and G.Z. Yang, “Viisual feature extracttion via eyee tracking for saliency driven 2D/33D registration,” Proceedings P of thee 2004 Sym mposium on Eye Traacking Research & Application, pp. 499-54, 2004. [199] Y. Deng, and B. Manjunath, M “Unsuppervised segmentatio on of color-texture regions r in images and videos,,” IEEE Trans. Patttern Anal. Mach. Intell., I 23(8), pp.8000–810, 20001.

Advertising for Attention in a Consumer Search Model