Exploration and exploitation in reward based visuomotor learning Jun Izawa and Reza Shadmehr, Johns Hopkins University In the studies of computational motor learning in which externally imposed perturbations induce errors in behavior, it has been assumed that the brain adapt the motor commands to decrease measurable error. However, a recent result (Izawa et al 2008) suggests that the ultimate goal of the motor adaptation is to maximize the reward provided by the behavior. To study how people update their motor commands in order to maximize rewards, we conducted a series of experiments that examined reward-based learning in a reaching task. We asked subjects to reach to a target projected on the screen which covered their hand and arm. The position of their hand shown on the screen was rotated from its actual position. One group of subjects (reward-based learning) was provided only with information whether they succeeded or failed at each trial, indicated by explosion of the target, and received no other feedback regarding their movement. Another group of subjects (error-based learning) was provided with full visual feedback. Although both groups of subjects learned the task, the error-based learning group showed a broad generalization function, whereas the reward-based learning group exhibited a narrow generalization function (Fig.1B). This suggests that the characteristics of reward-based and error-based learning are significantly different with two different neural mechanisms involved. We then examined the mechanism behind the reward-based learning. In the reward-based learning, since the error itself is not explicitly visible, the leaning involves the process of trial and error, in which learners explore how they should act to maximize rewards. During this process, learners confront what is termed the exploration and exploitation dilemma: If learners put more effort on exploration in search of the best behavior, the exploration itself distracts them from getting rewards constantly; on the other hand, if they put less effort, they may not be able to find the best behavior. Thus, the learners might increase their exploration when the learned value (i.e. the future sum of rewards) is low, while they might decrease it when the value is high. To test the

hypothesis, we used a basic actor critic model of the reinforcement learning and fit the parameters to find the estimation of the motor memory (Fig 1A green line) by approximating the value function, the action function, and the search noise (Fig 1A difference between the estimated motor memory and the actual action). The estimated subject’s value was described as a function of the reach angle, in a smoothed bell-shaped profile with a peak at the desired reach direction (Fig 1C). To see how the subjects planned their exploration, their active search amplitudes were plotted along the estimated values, which show a significant negative correlation. This suggests that the subjects adjusted the amplitude of the exploration so that they explored more when the value was low. This notion was supported by a further experiment with a new group of subjects, in which the probability of the reward was controlled by limiting the frequency of the reward given to the subjects for the successful trials (Fig. 1E). The lower the reward probability was, the higher the amplitude of the motor variability became (Fig 1F). That is, the subjects searched the reach direction actively by adjusting the amplitude of exploration based on the expectation of reward (i.e. value). This search produced more variable movements when reward was scarce. Further to see how learners change the dynamics of the memory during the reward-based learning, we conducted another experiment with two groups, reducing only in one group the reward probability in the last 24 trials in the learning period (yellow shadow in Fig. 1G). Immediately after the learning period, both groups experienced the error clamp trials: the visible hand cursor was projected on a straight line so that the subjects could perceive as if their reach always made straight movement toward the target. The retention curvature of the motor memory decayed faster in the lower probability group than the other. These results suggest that the value provided by the action plays an important role in solving the exploration and exploitation dilemma by controlling how broad one explores in the memory space and how fast one forget the memory.

Izawa et al. Motor adaptation as a process of reoptimization. J Neurosci 28(11) 2883-2891 (2008)

Figure 1 (A) The reach angle of the representative subject in the reward-based learning paradigm (Experiment 1). The rotation was imposed gradually up to 8 degree. The rewarding signal was provided when subject’s hand was within +/- 3 degree with respect to the center of the target (orange shading). For each trial, the feedback regarding the success (blue point) or failure (red point) was provided. To estimate the motor memory that subject updated, we used the basic reinforcement learning model.

When the value

rk was the reward at the trial k, γ was the discount rate. The learner z was the weight vectors and the s was the state of the system (i.e. reach angel). where n was active search and w was the weight vectors. The update rule was

was defined as Vk = rk + γ rk +1 + γ rk + 2 ⋯ γ rk + R , where *

2

R

approximated the value function V ( z , s ) , where The action was made by a = A( w, s ) + n ,

w( k +1) = w( k ) + β ⋅ TD ⋅ ( a − A( w, s ))∂A / ∂w , where the trial error of the value estimation was TD = rk +1 + γ Vk +1 − Vk . To estimate

subject’s motor memory, the free parameters in the model were estimated by the fitting method. The green line is the estimated motor memory w . (B) The post-adaptation generalization function. In the learning period, the target was projected only at the forward direction (0 degree), whereas, in the post adaptation period, the target appeared at randomly selected -30, -20, -10, 0, 10, 20, or 30 degrees without any visual feedback for both the reward and the error based learning groups. The plot shows how they rotated their movement direction with respect to the target position (the mean across subjects). There was a significant effect of groups between the error-based learning and reward-based learning (F(1,126)=9.632, p=0.005).

(C) The estimated value function V ( z , s ) . The

value function was normalized by the maximum. The plot shows the mean across subjects. (D) The negative correlation between the active search n and the estimated value. Out of 18 subjects, 16 subjects have significantly negative and 1 subject has significantly positive correlations. The mean correlation across 18 subjects was not zero (p<0.001). (E) The reach angle in lower reward probability (Experiment 2). The perturbation block (48 trials) was imposed after the two familiarization blocks followed by the zero perturbation blocks that includes 60 %, 70% and 80 % in the reward probability of the success trials. The plot shows the reach angle of the representative subject. (F) The variability of the reach angle (SD) was computed with the first 12(dashed) and the next 12 (solid) trials. There was a significant effect of the reward probability (F(3,27)=5.67, p=0.004). (G) The fast decay in the retention curvature in the lower reward-probability group(Experiment 3). The perturbation was gradually shifted up to 8 degree and kept constant. During the last 24 trials in the learning period, one group experienced 60% of the reward rate. There was a significant interaction between the trail number and the groups (F(47,864)=2.67, p<0.0001), suggesting that the motor memory of the lower probability group decayed faster than the other.

In the studies of computational motor learning in ...

r was the reward at the trial k, γ was the discount rate. The learner approximated the value function ( , ). V z s , where z was the weight vectors and the s was the state of the system (i.e. reach angel). The action was made by. ( , ) a Aws n. = + , where n was active search and w was the weight vectors. The update rule was. (. 1).

79KB Sizes 1 Downloads 246 Views

Recommend Documents

Boyarshinov, Machine Learning in Computational Finance.PDF ...
Requirements for the Degree of. DOCTOR ... Major Subject: Computer Science .... PDF. Boyarshinov, Machine Learning in Computational Finance.PDF. Open.

Computational Validation of the Motor Contribution to Speech ...
Action perception and recognition are core abilities fundamental for human social interaction. A parieto-frontal network (the mirror neuron system) matches visually presented biological motion ... aries of English-speaking adults. There is .... ulati

Theory-Of-Lift-Introductory-Computational-Aerodynamics-In ...
download G. D. McBain PDF eBooks in order for you personally to only get PDF formatted books to download ... Download Jesus Rogel-Salazar ebook file free of ... SIMULATION OF ODE PDE MODELS WITH MATLAB OCTAVE AND SCILAB.

Choreographies in the Wild - Cagliari - Trustworthy Computational ...
Nov 30, 2014 - aDipartimento di Matematica e Informatica, Universit`a degli Studi di Cagliari, Italy. bDepartment of Computing, Imperial College London, UK. cDipartimento di Matematica, Universit`a degli Studi di Trento, Italy. Abstract. We investiga

COMPUTATIONAL ACCURACY IN DSP IMPLEMENTATION ...
... more logic gates are required to implement floating-point. operations. Page 3 of 13. COMPUTATIONAL ACCURACY IN DSP IMPLEMENTATION NOTES1.pdf.

Choreographies in the Wild - Trustworthy Computational Societies
Nov 30, 2014 - the “bottom layer” of the software stack [9]. To overcome the state .... of such a choreography guarantees progress and safety of the contractual agreement (Theorem 3.9). We can exploit the ... does not guarantee that progress and

Recent Progress in the Computational Prediction of ...
Oct 25, 2005 - This report reviews the current status of computational tools in predicting ..... ously validated, have the potential to be used in early screen- ... that all the data sets share a hidden, common relationship ..... J Mol Model (Online)

A computational model of reach decisions in the ...
Paul Cisek. Department of physiology, University of Montreal ... Reference List. Cisek, P. (2002) “Think ... Neuroscience. Orlando, FL, November 2nd, 2002.

Nested Incremental Modeling in the Development of Computational ...
In the spirit of nested incremental modeling, a new connectionist dual process model (the CDP .... The model was then tested against a full set of state-of-the-art ...... Kohonen, T. (1984). Self-organization and associative memory. New. York: Spring

A The Computational Complexity of Truthfulness in ...
the valuation function more substantially than just the value of a single set. ...... This is a convex function of c which is equal to (1−1/e)pm|U| at c = 0 and equal to ...

Computational Learning of Grammars.Revised.Web.pdf
2009). A “Grammar” within Cognitive Linguistics, then, is a data-driven and ultimately .... paper examines the nature of a construction grammar, the definition of a ...

The Role of Action Observation in Motor Memory ...
Department of Biomedical Engineering, Washington University in Saint Louis, Saint ... conditions, subjects trained in a thirty-minute action observation session.

The Evolution of Feeding Motor Patterns in Lizards
transport system, it is necessary to confirm this finding, as the functional basis for uni- lateral activation in varanids remains un- known. Whether novel motor patterns in- cluding unilateral control emerged in the evolution from lizards to snakes

Strengthening the Institute of Motor Vehicle Examiner in Pakistan.pdf ...
Strengthening the Institute of Motor Vehicle Examiner in Pakistan.pdf. Strengthening the Institute of Motor Vehicle Examiner in Pakistan.pdf. Open. Extract.

Modeling Motor Pattern Generation in the Development of Infant ...
Centre for Human Communication, University College London, London WC2E ... support sound production. .... Howard, I.S. and P. Messum, A computer model.

Case Studies in Food Engineering Learning from Experience.pdf ...
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. Case Studies in ...

pdf-1267\studies-in-symbolic-interaction-volume-27-studies-in ...
Try one of the apps below to open or edit this item. pdf-1267\studies-in-symbolic-interaction-volume-27-studies-in-symbolic-interaction-by-denzin.pdf.