Visual word representation in the brain

Viewer
Transcript

Visual word representation in the brain Kandan Ramakrishnan1 , Iris Groen2 , Steven Scholte2 , Arnold Smeulders1 , and Sennay Ghebreab1,2 1

Intelligent Sensory Information Systems Group, University of Amsterdam, The Netherlands 2 Cognitive Neuroscience Group, University of Amsterdam, The Netherlands

Abstract. The human visual system is thought to use features of intermediate complexity for scene representation. How the brain computationally represents intermediate features is unclear, however. To study this, we tested the Bag of Words (BoW) model in computer vision against human brain activity. This computational model uses visual word histograms, candidate features of intermediate complexity, to represent visual scenes, and has proven effective in automatic object and scene recognition. We analyzed where in the brain and to what extent human fMRI responses to natural scenes can be accounted for by BoW representations. Voxel-wise application of a distance-based variation partitioning method reveals that BoW representations explain brain activity in visual areas V1, V2 and in particular V4. Area V4 is known to be tuned for features of intermediate complexity, suggesting that the BoW model captures intermediate-level scene representations in the human brain. Keywords: Visual perception, fMRI, low-level and intermediate features, SIFT, Bag of Words, Representational Similarity Analysis

1

Introduction

The human visual system transforms low-level features in the visual input into high-level concepts such as objects and scene categories. Much is known about the computation of low-level visual features such as color and orientation [1-3]. How these low-level features are transformed into high-level object and scene percepts, however, has not yet been clarified. One possibility is that the visual system first creates intermediate representions of the visual input, and then transforms these into full object and scene representations. In neural models of vision, such intermediate features are deemed important for scene categorization because they have a good trade off between frequency and specificity [4]. Recent computational models of vision use this idea of hierarchical visual processing for real world image categorization. The HMAX model [7], for example, is a biologically plausible model that uses features of intermediate complexity for object recognition. In computer vision, the Bag of Words model (BoW) is a successful model for scene classification. The key idea behind this model is to quantize local SIFT features [15] into visual words, which are abstractions of

frequently occurring and distinctive image patches such grass, sand and bricks [10]. An image is then represented by a histogram of visual words. The BoW outperforms the HMAX model on many scene classification tasks such as TRECvid [20] and PASCAL [21], in some cases even approaching human classification performance [23]. In this work we hypothesize that the human visual system uses intermediate features for scene representation and that the BoW model provides a suitable computation thereof. We expect to find areas in the brain where activity is accounted for by BoW representations of visual input. In particular, we expect to find this for areas beyond early visual cortex where increasingly complex information is processed. To test this, we record fMRI responses of several subjects to natural scenes, and search the fMRI volumes for voxels that are significantly explained by BoW representations. Finding visual word representations in the brain is challenging for two reasons. First computational and neural representations are heterogeneous and at the same time high-dimensional. Second, visual word histograms build on SIFT features, and hence these need to be dissociated in a proper manner. We address the first challenge by considering dissimilarity matrices [11], and the second is resolved by applying variation partitioning [12] on these dissimilarity matrices to compute the unique contributions of SIFT features and visual word histograms in explaining brain responses. The paper is set-up as follows. In section 2 we address representations of visual scenes and brain responses. In section 3 we outline the dissimilarity based variation partitioning method. In section 4 we present our results, with discussion and conclusions in section 5.

2 2.1

Representation of visual scenes and brain responses Visual scene representation

The first step in the BoW model is extraction of SIFT [15] features from the visual input. Here the SIFT representation of an image Ik is denoted by xk = f1 , ...., fN where fn is a 128 dimensional SIFT vector at N interest points in the image. We use dense sampling with 2 pixel spacing and concatenate all local SIFT vectors into a 128 ∗ N dimensional vector. Secondly, a dictionary of visual words [10] is learned from an independent set of scenes. We use k-means clustering to identify cluster centers cm = c1 , ..., cM in SIFT space, where m = 1, ..., M denotes the number of clusters centers. All patches in a new image are assigned to the most similar word and the image is represented by counting the occurrences of all words. This results for image Ik in a visual word histogram wk = h1 , ..., hM where each bin hm indicates the number of times the word cm is present in the image. We use the PASCAL VOC 2007 [21] dataset to create a codebook of dimension M = 4000.

2.2

fMRI response representation

fMRI responses to 72 images from several categories(mountains, buildings, forests, industries, highways and beaches) are used. The images were shown 9 times to 4 subjects while MRI volume scans of 91 × 109 × 91 voxels were acquired. The resulting single trial scans were subjected to voxel-wise event-related GLM analysis. This results for each voxel in a beta coefficient, denoting the magnitude of the voxel which are averaged across trials. For each voxel, the local multivariate BOLD response is established using searchlight technique [11] resulting in yk = v1 , ...vS where S denotes the number of voxels within a spherical sphere of radius = 2.5mm, resulting in S = 27 voxels.

3

Variation Partitioning using dissimilarity matrices

We use distance-based variation partitioning [14] to study the contribution of visual word histogram independently of SIFT features in explaining fMRI responses to visual scenes. We construct distance matrices X, W and Y for our K = 72 images based on SIFT features x1 , .., xK , visual word histograms w1 , ..., wK and fMRI responses y1 , ..., yK respectively. Each element in this matrices denotes pair wise distance between the K images. The variation partitioning [16] algorithm determines the unique contribution of SIFT distance matrix X and visual word histogram distance matrix W in explaining the brain activity distance matrix Y. First we determined the explained variance of Y by the combination of X and W, RY2 |X+W done on the basis of the predicted response YˆX+W resulting from the regression of X and W together on Y. Similarly the fraction of Y explained by X independently based on YˆX is RY2 |X and fraction that is explained by W independently based on YˆW is RY2 |W . Unique contributions of SIFT and visual word histograms in explaining local brain activity are computed by subtracting the RY2 |X from RY2 |X+W and RY2 |W from RY2 |X+W respectively. Note that these statistics are the canonical equivalent of the regression coefficient of determination, R2 [16].

4 4.1

Results Subject specific results

We used distance based variation partitioning to determine the unique contribution of visual word histograms in explaining fMRI responses independent of the contribution of SIFT features. We report only clusters with a minimum of 25 contiguous voxels with significant correlations (p < 0.05, ALPHASIM [24]). Figure 1 shows for each subject the amount of explained variance by SIFT (red) and visual word histograms (blue).

Fig. 1. A. Brain map showing voxelwise the fraction of brain activity explained uniquely by SIFT features (red) and visual word histograms (blue) for four subjects.

SIFT features correlate with brain responses in multiple brain areas. These areas include primary visual cortex (extraction of low-level features), mid-level level areas such as the Lateral occipital cortex (involved in object processing), and higher-level areas such as Parahippocampal gyrus (encoding and recognition of scenes). For the four subjects, the explained variance peaks at 4% and is located at different brain areas (Occipital Pole, Occipital fusiform gyrus, Lateral Occipital Cortex and Lingual gyrus). Visual word histograms also account for brain activity at multiple brain regions, but the regions tend to concentrate in the primary and extrastiate visual cortex areas. In addition, compared to SIFT features, visual word histograms explain much more brain activity, up to 11%. Interestingly, for three out of four subjects the maximum explained variances is found in adjacent regions in the primary and extrastriate visual cortex. These results suggest a locus in the visual cortex and consistency across subject for visual words histograms. 4.2

Across subject averages

As averaging fMRI responses across subject may enhance response signals, we repeated our analysis on subject-averaged fMRI responses. As before, figure 2 shows the uniquely explained variances by SIFT features and visual word histograms. The maximum explained variance by SIFT features is (5%) and by visual word histograms (21% ). This difference in peak explained variance is striking, and suggest that visual word histograms indeed capture information processed in the visual brain. Moreover the peak explained variance by SIFT features occurs slightly earlier in the visual hierarchy compared to the area where the peak explained variance by visual word histograms emerges. Although the difference is subtle, it is in line with the idea that the computation of SIFT features precedes that of visual word histogram. Table 1 shows explained variances averaged across predefined regions of interest. The data confirm that the highest explained variances are found in V4

Table 1. Number of significant voxels and explained variance for ROI across subject averages No of Significant Voxels VWH SIFT V4 388 145 V123 1032 983 IPL 494 290 TO 263 0 AnteriorTemporal 81 102 LGN 34 0 PT 40 26 LO 0 49 SPlobule 0 0 Area5 0 0 Area7 0 0

Max Explained Variance VWH SIFT 20.81 3.58 20.8 4.65 10.84 5.5 10.2 0 4.64 3.93 4.42 0 3.39 2 0 2.7 0 0 0 0 0 0

and adjacent areas. In addition, the data show that brain activity in the vast majority of voxels in V4 is accounted for by visual word histograms, whereas significant voxels in area V1, V2 and V3, are due to both visual word histograms and SIFT features (which only account for a small fraction of the the MRI responses). Brain activity in higher level areas such as the Inferior Temporal Lobule, Anterior Temporal, and Temporal Occipital is also explained by SIFT features and visual word histograms, but in fewer voxels and to a lesser extent. .

5

Discussion and conclusion

The success of models such as HMAX and BoW may be attributed to their use of features of intermediate complexity. The BoW model [17] in particular has proven capable of learning to distinguish visual objects from only fifty to five hundred labeled examples in a fully automatic fashion and with good recognition rates. This makes the BoW model a candidate computational model of intermediate visual processing in the brain. Our results indicate that our brain is capable of representing scenes in ways similar to BoW, in particular areas such as V4 which have previously been associated with intermediate visual representations [25]. From a neural perspective, the two bag-of-words computational steps that we tested in this paper are interesting. First, the Bag of Words model relies on multi-scale and multi-orientation features. These features are often represented as Scale Invariant Feature Transform (SIFT) features [15]. Although SIFT features originate from computer vision, their inspiration goes back to Hubel and Wiesels simple and complex receptive fields [2], and Fukushima’s Neocognitron model [12]. SIFT features thus have an embedding in the visual system, raising the biological plausibility of the Bag of Word representations. Second, the vector quantization of SIFT features into visual words is a form of feature abstraction that is new and untested in cognitive neuroscience. Vi-

Fig. 2. Visualization of brain activity explained uniquely by SIFT features (red) and Visual word histograms (blue) for average subject responses. Note that SIFT and Visual word histograms explain brain activity in non overlaping regions.

sual words can be conceived of as higher-level visual building blocks composed of receptive fields. In fact, in analogy to the reconstruction of receptive fields from natural images [19], visual words are constructed from natural scenes by identifying frequently recurring informative and distinctive image patches. Being compact and rich visual descriptors, visual words may allow for sparse and intermediate representations of objects and scenes. Furthermore, the histogram of visual words may facilitate scene gist perception, which occurs rapidly and early in visual processing [26, 27]. While there is evidence that simple low-level regularities such as spatial frequency [13, 19], colour [8] and local edge alignment [6, 18] underly scene gist representation, it is hitherto unknown whether and how mid-level features facilitate scene gist perception. It has been proposed that global receptive fields in V4 and IT are tuned to spatial patterns of orientations and scales (such as SIFT features) across the entire image, and that they compute scene gist [5]. This is in accordance with the localization and confines of visual word histograms representations in V4. There a number of open standing questions that we will address in the future. At the level of analysis, the type of SIFT sampling, the number of visual words, the distance measures used in computation of the dissimilarity matrices and so on might give rise to interesting results. More generally, it is interesting to study the repeatability of our results when creating visual words based on other data sets and when using fMRI responses to a wider range of natural scenes. Preliminary results show that there is indeed consistency in detected brain regions.

Acknowledgement This research was supported by the Dutch national public-private research program COMMIT.

References 1. Field DJ. Relations between the statistics of natural images and the response properties of cortical cells. J Opt Soc Am A 4:23792394 (1987). 2. Olshausen BA and Field DJ. Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature 381:607610 (1996). 3. Karklin Y and Lewicki MS. Emergence of complex cell properties by learning to generalize in natural scenes. Nature 457:8386 (2009). 4. Ullman S, Vidal-Naquet M and Sali E. Visual features of intermediate complexity and their use in classification. Nature Neuroscience 5: 682687 (2002). 5. Oliva A and Torralba A. Building the gist of a scene: the role of global image features in recognition. Prog Brain Res 155:2336 (2006). 6. Wichmann FA Person, Braun DI and Gegenfurtner KR Person. Phase noise and the classification of natural images. Vision Research 46(8-9):1520-1529 (2006). 7. Riesenhuber. M and Poggio. T. Hierarchical Models of Object Recognition in Cortex. Nature Neuroscience 2: 1019-1025 (1999). 8. Oliva A, and Schyns P.G. Diagnostic colors mediate scene recognition. Cognitive Psychology, 41:176-210 (2000). 9. Serre, T., L. Wolf and T. Poggio. Object Recognition with Features Inspired by Visual Cortex. IEEE Computer Society Conference on Computer Vision and Pattern Recognition(CVPR) (2005). 10. J Philbin, O Chum, M Isard, J Sivic, and A Zisserman. Object retrieval with large vocabularies and fast spatial matching. IEEE Computer Society Conference on Computer Vision and Pattern Recognition(CVPR) (2007). 11. N Kriegeskorte, M Mur and P Bandettini. Representational similarity analysis connecting the branches of systems neuroscience. Frontiers System Neuroscience 2, p.4 (2008). 12. K. Fukushima. Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological Cybernetics, 36(4):93-202 (1980). 13. Schyns P.G and Oliva A. From blobs to boundary edges: Evidence for time- and spatial-scale-dependent scene recognition. Psychological Science, 5: 195-200 (1994). 14. Legendre, P. and M. J. Anderson. Distance-based redundancy analysis: testing multispecies responses in multifactorial ecological experiments. Ecological Monographs 69:1 - 24 (1999). 15. Lowe, D. G. Distinctive Image Features from Scale Invariant Keypoints. International Journal of Computer Vision, 60(2),91 - 110 (2004). 16. Peres-Neto P. R, Legendre P, Dray S, Borcard D. Variation partitioning of species data matrices: estimation and comparison of fractions. Ecology 87:2614 - 2625 (2006). 17. Chatfield, K., Lempitsky, V., Vedaldi, A. and Zisserman, A. The devil is in the details: an evaluation of recent feature encoding methods. BMVC (2011). 18. Loschky LC and Larson AM. Localized information is necessary for scene categorization, including the natural/man-made distinction. Journal of Vision 8:19 (2008). 19. Kaping D, Tzvetanov and T, Treue S. Adaptation to statistical properties of visual scenes biases rapid categorization. Vis Cogn 15:1219 (2007). 20. A. F. Smeaton, P. Over, and W. Kraaij. Evaluation campaigns and trecvid. MIR (2006). 21. M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (VOC) challenge. IJCV (2010).

22. FEAT (fMRI Expert Analysis Toolbox) version 4.1, part of FSL [Oxford Centre for Functional MRI of the Brain (FMRIB)Software Library; www.fmrib.ox.ac.uk/fsl]. 23. D. Parikh and C.L. Zitnick. The Role of Features, Algorithms and Data in Visual Recognition. Proc. IEEE Conf. Computer Vision and Pattern Recognition(CVPR) (2010). 24. Cox, R.W. AFNI: software for analysis and visualization of functional magnetic resonance neuroimages. International Journal of Computers and biomedical research, 29(3), 16273 (1996) 25. J.L. Gallant, C.E. Connor, S. Rakshit, J.W. Lewis and D.C. Van Essen. Neural responses to polar, hyperbolic, and Cartesian gratings in area v4 of the macaque monkey. Journal of Neurophysiology, 76(4) (1996). 26. I.I.A. Groen, S. Ghebreab, V.A.F. Lamme and H.S. Scholte, Low-level contrast statistics are diagnostic of invariance of natural textures. Frontiers in Computational Neuroscience, 6, 34 (2012). 27. I.I.A. Groen, S. Ghebreab, H. Prins, V.A.F. Lamme and H.S. Scholte, From Image Statistics to Scene Gist: Evoked Neural Activity Reveals Transition from Low-Level Natural Image Structure to Scene Category. Journal of Neuroscience, 33(45), 1881418824 (2013).

Reexamining the word length effect in visual word recognition ... - crr