Decoding invariant visual information with MEG sensor ...

Viewer
Transcript

Decoding invariant visual information with MEG sensor and source data Leyla Isik*, Yena Han, and Tomaso Poggio Center for Biological and Computational Learning, Massachusetts Institute of Technology, Cambridge, MA 02139, USA *[email protected]

Abstract. Magnetoencephalography (MEG) decoding analysis is a powerful tool for studying the human visual system. Previous work has shown that size- and position-invariant visual signals can be decoded from MEG sensor-level data. Source localization results would allow us to answer more precise anatomical questions about invariant object recognition, but these interpretations may be limited by the accuracy of the source localization. Here we compare MEG decoding analysis using features in sensor and source space in a size- and position-invariant visual decoding task in order to both assess the promise of decoding in source space and attempt to gain a better spatiotemporal profile of invariant object recognition in humans.

1

Introduction

The human visual system can recognize objects in a fraction of a second under complex viewing conditions [14,18]. The computations underlying this invariant object recognition, however, are still poorly understood. MEG decoding analysis, training a machine learning classifier to asses what stimulus information (such as image identity) is present in the MEG signals, has emerged as a useful tool for studying this problem in humans [1, 2, 8, 10]. MEG decoding provides high temporal resolution data from across the brain that is sensitive to even single trial data. Most previous MEG visual decoding studies have focused on decoding in sensor space, using the time-series sensor data as classifier features. This provides very precise information about how the neural signals develop in time, but source localization data (estimates of the neural activity driving the MEG sensor measurements) would allow us to answer more precise anatomical questions about the human visual system. In particular, it can be useful to decode in source space, using the time-series source estimates at different voxels as classifier features. The fact that the inverse problem of calculating source estimates from MEG sensor-level data is ill-posed, however, may limit source-level decoding interpretations. Previous studies have shown success with source space decoding for a semantic language task [16], but few studies look at the comparison between using source and sensor data as features. This comparison would be useful for answering questions about the underlying brain processes, and in addition decoding may serve as a useful tool for evaluating different source localization solutions. In a recent study, we showed that we could decode images invariant to size and position with MEG sensor-level data [10]. We could decode object identity at 70 ms after stimulus onset. Invariant information arose later than the initial identity signal, and appeared in stages (between 90 and 150 ms after stimulus onset), with invariance to smaller transformations arising before invariance to larger transformations. This timing data is consistent with hierarchical feedforward models that increase invariance with local pooling at each successive layer, for example [7, 15]. While, we can infer important algorithmic information from timing data, source localization would allow us to more definitively answer questions about feedforward versus feedback processing in the visual system. In the present study we compare decoding in sensor and source space using data from the same experimental paradigm in our previous study. We assess the relevant stimulus information in the most active source estimates from across the brain, as well as investigate decoding in individual anatomically defined regions of interest (ROIs) to examine how invariant visual information evolves across the human ventral stream.

2 2.1

Methods Experimental paradigm and data acquisition

Two subjects participated in the experiment, which was approved by the MIT Committee on the Use of Humans as Experimental Subjects. Subjects viewed 6 different isolated, gray-scale images presented in 3 locations (centered at 0, +3, and -3 degrees of visual angle vertically from the center of the screen), and at 3 sizes (2, 4, and 6 degrees of visual angle in diameter), please see Figure 1. The images were presented for 48 ms with 704 ms inter-stimulus interval. Image order was randomized, and each stimulus was repeated 50 times. During the experiment subjects performed an orthogonal task related to the color of the fixation cross to keep them alert and fixated at the center of the screen. MEG signals were recorded using an Elekta Neuromag Triux scanner with 102 magnetometers and 204 planar gradiometers. Data were filtered using Brainstorm Software [17] with a signal space projection and band-pass filtered from 0.1-100 Hz. For more details on the experimental paradigm and data acquisition, please refer to [10].

Fig. 1. Top row: six grayscale images used in decoding task. Bottom row: Different size and position conditions, illustrated by bowling ball object.

2.2

Decoding analysis methods

Decoding analyses were performed using the Neural Decoding Toolbox [12]. Decoding was performed using a five-way cross validation split procedure, and the 10 trials in each split were averaged together to increase the signal to noise ratio. In each training phase, the data were Z-score normalized using the data in each sensor over the entire time series, and an ANOVA test was applied to select the sensors at each time point that were most selective for image identity (those with the lowest p-values determined by an F-test). The test data was then Z-score normalized with the mean and standard deviation from the training data, and only those sensors selected in training were used in testing. We used a maximum correlation coefficient classifier (we achieved similar results with support vector machine and regularized least squares classifiers) where each test point x∗ is assigned the label i∗ of the class of the training data with which it is maximally correlated: i∗ = argmax(corr(x∗ , xi ))

(1)

i

To assess the amount of visual information in the MEG recordings, the classifier was trained to determine the identity of the images the subject viewed. The classifier was trained and tested at each 20 ms time bin, with 5 ms step size. The entire decoding procedure was run 10 times per condition for each subject. Classification accuracy is reported as the average accuracy for the two subjects across the five cross-validation splits in each of the ten runs. 2

Image identity was decoded without any invariance (non-invariant conditions) by training and testing a classifier on data from each of the five size and position conditions (centered with 2 degree diameter; centered with 4 degree diameter; centered with 6 degree diameter; 3 degrees above center with 6 degree diameter; 3 degrees below center with 6 degree diameter, please see Figure 1, bottom). We trained the classifier on data from images presented at one size or position, and tested on data from images presented at a second size or position, to see when the neural signals could generalize across the given size or position transformation. This resulted in 6 size-invariant and 6 position-invariant comparisons. Results for each of the 3 invariance conditions (non-invariant, size-invariant, and position-invariant) is reported here as the average results for each of its individual conditions. For more details on the decoding procedure, please see [10] and the Neural Decoding Toolbox website: readout.info. 2.3

Source localization

Source localization was performed using the Minimum Norm Estimate (MNE) distributed source localization method, which finds the set of sources that minimizes the total power (`2 -norm) of the sources [9]. Structural MRIs were collected for both subjects, and cortical reconstruction and volumetric segmentation was performed with the Freesurfer image analysis suite [3–5]. We estimated 15,000 sources constrained to each subject’s cortical surface. Source localization was performed using fixed orientation constraints, with the default signal-to-noise ratio (proportional to inverse of the regularizer) of 3. 2.4

Sensor and source feature selection

Our previous study showed that using between 10 and 50 sensors at each time point from the above ANOVA feature selection procedure provides optimal signal to noise ratio for decoding with sensor data [10]. Here we are selecting 30 (˜10% of the total) features at each time point. To compare decoding in source space with the sensor-level data, we also perform feature selection to downsample the large number of sources. We can assess the information in the active sources, by choosing the top 1500 (10% of the total) most active sources at each time point, calculated for the average of all image presentations, and performing source level decoding with only sources in these locations. While the source locations were chosen based on aggregate data, decoding was performed as described in the above section with individual trials in each cross-validation split. 2.5

Cortical Parcellation

Cortical parcellation of each subject’s MRI was performed automatically using Freesurfer. We examined how the visual signals evolve in different brain areas, by decoding in four visual regions of interest: V1 and V2 (defined by the Brodman Area atlas in Freesurfer), and the occipital inferior and temporal inferior regions (both gyri and sulci defined by the Destrieux Atlas [6] in FreeSurfer). The four regions are illustrated on the cortex of Subject 1 in Figure 4(a).

3 3.1

Results Source estimates at key decoding times

The source localization results are shown for one subject at two key decoding times, 70 ms: the time when stimulus information can first be decoded, shown in Figure 2(a), and 150 ms: the time when size- and position-invariant signals can be decoded, shown in Figure 2(b). At both time points, sources in the occipital lobe are highly active, consistent with the visual task the subjects performed. Additionally, at 150 ms, there are more active sources that have spread further down the temporal lobe, near later visual areas. 3

(a) Sources at 70 ms after stimulus onset

(b) Sources at 150 ms after stimulus onset

Fig. 2. A left view of source estimates on cortex of subject 1 at (a) 70ms after stimulus onset, the time when image identity can first be decoded, and (b) 150 ms after stimulus onset, the time when size- and position-invariant information can be decoded. Color bar right indicates absolute source magnitude in picoAmpere-meters.

3.2

Decoding with top sensors and sources

The decoding results using the 30 (˜10%) most selective sensors and 1500 (10%) most active sources as classifier features are shown in Figure 3. We see that source space features have similar performance to the sensor-level features for the non-invariant, size- and position-invariant conditions even though, unlike the top sensors, they were not explicitly chosen to contain stimulus identity information. These results indicate that the most active sources contain important invariant visual information. 3.3

Decoding within anatomically defined ROIs

The decoding results in four visual ROIs: V1, V2, occipital inferior and temporal inferior regions are shown in Figure 4(a). Decoding results for the non-invariant, size-invariant, and position-invariant conditions are shown for these four brain regions in Figure 4. For the non-invariant decoding conditions, information appears to progress in a feedforward manner throughout the different visual regions: V1 and V2 have the highest accuracy and an earlier onset latency, while the two later visual areas have longer latencies and slightly lower decoding performance, Figure 4(b). For the position invariant decoding, the latest visual region, temporal inferior, has slightly higher accuracy, but all four regions appear to have similar latency for both the size- and position-invariant condition, Figures 4(c) and 4(d). Additionally, its important to note that regions sampled outside of the temporal and occipital lobes did not contain any relevant visual information (results not shown), supporting the accuracy of the source localization at a coarse level.

4

Discussion

In this work we performed source localization using MNE for two subjects performing a visual object recognition task. We showed that the most active sources moved further down the temporal lobe as invariant information developed, and that these sources contained invariant visual information. Both of these results support the accuracy of the source estimates. Finally, we decoded within different visual regions of interest to gain a spatiotemporal profile of how invariant visual signals evolve in the ventral stream. The non-invariant decoding results in Figure 4(b) show a distinction between different visual areas, and a progression of visual stimulus information in a feedforward manner along the ventral stream. The size- and position-invariant results in Figures 4(c) and 4(d), however, show that invariant information develops in all visual areas at the same time and with similar accuracy. Additionally, although the source estimates have 4

90

45

Top 30 sensors Top 1500 sources

80

40

Classification Accuracy

Classification Accuracy

70 60 50 40 30

−200

−100

0

100 200 Time (ms)

300

400

30

25

20

10 −300

500

(a) Non-invariant decoding 35

35

15

20 10 −300

Top 30 sensors Top 1500 sources

−200

−100

0

100 200 Time (ms)

300

400

500

(b) Size-invariant decoding

Top 30 sensors Top 1500 sources

Classification Accuracy

30

25

20

15

10 −300

−200

−100

0

100 200 Time (ms)

300

400

500

(c) Position-invariant decoding Fig. 3. Comparison of decoding with most selective sensors (blue) and most active sources (red) for (a) non-invariant, (b) size-invariant and (c) position-invariant conditions. (Please note the different scales of the y-axes in a-c.)

become more active further down the temporal lobe at 150 ms, there are still active sources in the early, occipital regions, Figure 2(b). This may represent an actual spread of invariant visual information across these regions, or, is more likely due to the fact that the source estimates are not resolved at a fine enough spatial scale to distinguish between adjacent cortical regions. So although the non-invariant results show a feedforward progression of visual information, the spatial resolution does not seem high enough to draw definitive conclusions about feedforward versus feedback processing in our size- and position-invariant visual tasks. It is possible that a more sparse source localization method utilizing the `1 -norm [19], or a combination of the `1 and `2 norms [13], as well as methods incorporating spatial and temporal smoothness constraints [11], may provide more precise anatomical estimates and be better suited to answer these questions. 5

90 V1 V2 Occipital inf Temporal inf

80

Classification Accuracy

70

Occ. inf.

V1

60 50 40 30 20 10 −300

Temp. inf.

V2

−200

(a) Four anatomically defined ROIs

Classification Accuracy

35

35

V1 V2 Occipital inf Temporal inf

30

30

25

20

100 200 Time (ms)

300

400

500

V1 V2 Occipital inf Temporal inf

25

20

15

15

10 −300

0

(b) Non-invariant decoding

Classification Accuracy

40

−100

−200

−100

0

100 200 Time (ms)

300

400

10 −300

500

(c) Size-invariant decoding

−200

−100

0

100 200 Time (ms)

300

400

500

(d) Position-invariant decoding

Fig. 4. Decoding within different ROIs for (a) non-invariant, (b) size-invariant, and (c) position-invariant conditions. The regions of interest are highlighted with their corresponding from (a). (Please note the different scales on the y-axes in b-d.)

5

Conclusion

In this study we were able to evaluate the MNE source estimates on different levels: we affirmed that the most active sources were along the occipital and temporal lobes, that they had relevant invariant visual information, and that we could see some distinctions between visual information in nearby brain regions. These results show a coarse picture of how invariant visual information travels through the ventral stream, and provide a framework for future studies to answer visual processing questions at a finer anatomical level with more precise source estimates. 6

Acknowledgements This research was sponsored by grants from Defense Advanced Research Planning Agency, National Science Foundation, and the McGovern Institute for Brain Research. This material is based upon work supported by the Center for Brains, Minds and Machines (CBMM), funded by NSF STC award CCF-1231216. We would also like to thank the McGovern Institute for Brain Research for providing use of the MEG facilities, as well as D. Pantazis for help with MEG data analysis, and D. Baldauf for help collecting MRIs.

References 1. Thomas Carlson, David A Tovar, Arjen Alink, and Nikolaus Kriegeskorte. Representational dynamics of object vision: The first 1000 ms. Journal of vision, 13(10):1–, January 2013. 2. Thomas A Carlson, Hinze Hogendoorn, Ryota Kanai, Juraj Mesik, and Jeremy Turret. High temporal resolution decoding of object position and category. Journal of vision, 11(10), January 2011. 3. Anders M. Dale, Bruce Fischl, and Martin I. Sereno. Cortical Surface-Based Analysis. NeuroImage, 9(2):179–194, 1999. 4. B Fischl, A Liu, and A M Dale. Automated manifold surgery: constructing geometrically accurate and topologically correct models of the human cerebral cortex. IEEE transactions on medical imaging, 20(1):70–80, January 2001. 5. Bruce Fischl, Martin I. Sereno, and Anders M. Dale. Cortical Surface-Based Analysis. NeuroImage, 9(2):195–207, 1999. 6. Bruce Fischl, Andr´e van der Kouwe, Christophe Destrieux, Eric Halgren, Florent S´egonne, David H Salat, Evelina Busa, Larry J Seidman, Jill Goldstein, David Kennedy, Verne Caviness, Nikos Makris, Bruce Rosen, and Anders M Dale. Automatically parcellating the human cerebral cortex. Cerebral cortex (New York, N.Y. : 1991), 14(1):11– 22, January 2004. 7. Kunihiko Fukushima. Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biological Cybernetics, 36(4):193–202, April 1980. 8. Marcos Perreau Guimaraes, Dik Kin Wong, E Timothy Uy, Logan Grosenick, and Patrick Suppes. Single-trial classification of MEG recordings. IEEE transactions on bio-medical engineering, 54(3):436–43, March 2007. 9. Matti S. H¨ am¨ al¨ ainen, Fa-Hsuan Lin, and John C. Mosher. Anatomically and Functionally Constrained MinimumNorm Estimates : MEG: An Introduction to Methods. In MEG: An Introduction to Methods. 2010. 10. Leyla Isik, Ethan M. Meyers, Joel Z. Leibo, and Tomaso A Poggio. The dynamics of invariant object recognition in the human visual system. Journal of neurophysiology, in press, 2013. 11. Camilo Lamus, Matti S H¨ am¨ al¨ ainen, Simona Temereanca, Emery N Brown, and Patrick L Purdon. A spatiotemporal dynamic distributed solution to the MEG inverse problem. NeuroImage, 63(2):894–909, November 2012. 12. Ethan M. Meyers. The neural decoding toolbox. Frontiers in Neuroinformatics, 7, May 2013. 13. Wanmei Ou, Matti S. H¨ am¨ al¨ ainen, and Polina Golland. A distributed spatio-temporal EEG/MEG inverse solver. NeuroImage, 44(3):932–946, 2009. 14. M C Potter. Short-term conceptual memory for pictures. Journal of experimental psychology. Human learning and memory, 2(5):509–22, September 1976. 15. Thomas Serre, Aude Oliva, and Tomaso Poggio. A feedforward architecture accounts for rapid categorization. Proceedings of the National Academy of Sciences of the United States of America, 104(15):6424–9, April 2007. 16. Gustavo Sudre, Dean Pomerleau, Mark Palatucci, Leila Wehbe, Alona Fyshe, Riitta Salmelin, and Tom Mitchell. Tracking neural coding of perceptual and semantic features of concrete nouns. NeuroImage, 62(1):451–63, August 2012. 17. Fran¸cois Tadel, Sylvain Baillet, John C Mosher, Dimitrios Pantazis, and Richard M Leahy. Brainstorm: a userfriendly application for MEG/EEG analysis. Computational intelligence and neuroscience, 2011:879716, January 2011. 18. S Thorpe, D Fize, and C Marlot. Speed of processing in the human visual system. Nature, 381(6582):520–2, July 1996. 19. K. Uutela, M. H¨ am¨ al¨ ainen, and E. Somersalo. Visualization of Magnetoencephalographic Data Using Minimum Current Estimates. NeuroImage, 10(2):173–180, 1999.

7

Hybrid Decoding: Decoding with Partial Hypotheses ...

Source estimates for MEG/EEG visual evoked ...

Scale-Invariant Visual Language Modeling for Object ...

MEG-12

Detecting Consciousness with MEG

Predicting Search User Examination with Visual Saliency Information

MEG-05

MEG-11

reflectanceâbased sensor to predict visual quality ...

Sensor ] ( Sensor

Decoding information from neural signals recorded using intraneural ...

Investigating Sensor Networks with Concurrent ... - IEEE Xplore

$pdf-1424\advances-in-visual-information-management-visual ...$

pdf-1424\advances-in-visual-information-management-visual ...