Exploration and exploitation in reward based visuomotor learning Jun Izawa and Reza Shadmehr, Johns Hopkins University In the studies of computational motor learning in which externally imposed perturbations induce errors in behavior, it has been assumed that the brain adapt the motor commands to decrease measurable error. However, a recent result (Izawa et al 2008) suggests that the ultimate goal of the motor adaptation is to maximize the reward provided by the behavior. To study how people update their motor commands in order to maximize rewards, we conducted a series of experiments that examined reward-based learning in a reaching task. We asked subjects to reach to a target projected on the screen which covered their hand and arm. The position of their hand shown on the screen was rotated from its actual position. One group of subjects (reward-based learning) was provided only with information whether they succeeded or failed at each trial, indicated by explosion of the target, and received no other feedback regarding their movement. Another group of subjects (error-based learning) was provided with full visual feedback. Although both groups of subjects learned the task, the error-based learning group showed a broad generalization function, whereas the reward-based learning group exhibited a narrow generalization function (Fig.1B). This suggests that the characteristics of reward-based and error-based learning are significantly different with two different neural mechanisms involved. We then examined the mechanism behind the reward-based learning. In the reward-based learning, since the error itself is not explicitly visible, the leaning involves the process of trial and error, in which learners explore how they should act to maximize rewards. During this process, learners confront what is termed the exploration and exploitation dilemma: If learners put more effort on exploration in search of the best behavior, the exploration itself distracts them from getting rewards constantly; on the other hand, if they put less effort, they may not be able to find the best behavior. Thus, the learners might increase their exploration when the learned value (i.e. the future sum of rewards) is low, while they might decrease it when the value is high. To test the
hypothesis, we used a basic actor critic model of the reinforcement learning and fit the parameters to find the estimation of the motor memory (Fig 1A green line) by approximating the value function, the action function, and the search noise (Fig 1A difference between the estimated motor memory and the actual action). The estimated subject’s value was described as a function of the reach angle, in a smoothed bell-shaped profile with a peak at the desired reach direction (Fig 1C). To see how the subjects planned their exploration, their active search amplitudes were plotted along the estimated values, which show a significant negative correlation. This suggests that the subjects adjusted the amplitude of the exploration so that they explored more when the value was low. This notion was supported by a further experiment with a new group of subjects, in which the probability of the reward was controlled by limiting the frequency of the reward given to the subjects for the successful trials (Fig. 1E). The lower the reward probability was, the higher the amplitude of the motor variability became (Fig 1F). That is, the subjects searched the reach direction actively by adjusting the amplitude of exploration based on the expectation of reward (i.e. value). This search produced more variable movements when reward was scarce. Further to see how learners change the dynamics of the memory during the reward-based learning, we conducted another experiment with two groups, reducing only in one group the reward probability in the last 24 trials in the learning period (yellow shadow in Fig. 1G). Immediately after the learning period, both groups experienced the error clamp trials: the visible hand cursor was projected on a straight line so that the subjects could perceive as if their reach always made straight movement toward the target. The retention curvature of the motor memory decayed faster in the lower probability group than the other. These results suggest that the value provided by the action plays an important role in solving the exploration and exploitation dilemma by controlling how broad one explores in the memory space and how fast one forget the memory.
Izawa et al. Motor adaptation as a process of reoptimization. J Neurosci 28(11) 2883-2891 (2008)
Figure 1 (A) The reach angle of the representative subject in the reward-based learning paradigm (Experiment 1). The rotation was imposed gradually up to 8 degree. The rewarding signal was provided when subject’s hand was within +/- 3 degree with respect to the center of the target (orange shading). For each trial, the feedback regarding the success (blue point) or failure (red point) was provided. To estimate the motor memory that subject updated, we used the basic reinforcement learning model.
When the value
rk was the reward at the trial k, γ was the discount rate. The learner z was the weight vectors and the s was the state of the system (i.e. reach angel). where n was active search and w was the weight vectors. The update rule was
was defined as Vk = rk + γ rk +1 + γ rk + 2 ⋯ γ rk + R , where *
2
R
approximated the value function V ( z , s ) , where The action was made by a = A( w, s ) + n ,
w( k +1) = w( k ) + β ⋅ TD ⋅ ( a − A( w, s ))∂A / ∂w , where the trial error of the value estimation was TD = rk +1 + γ Vk +1 − Vk . To estimate
subject’s motor memory, the free parameters in the model were estimated by the fitting method. The green line is the estimated motor memory w . (B) The post-adaptation generalization function. In the learning period, the target was projected only at the forward direction (0 degree), whereas, in the post adaptation period, the target appeared at randomly selected -30, -20, -10, 0, 10, 20, or 30 degrees without any visual feedback for both the reward and the error based learning groups. The plot shows how they rotated their movement direction with respect to the target position (the mean across subjects). There was a significant effect of groups between the error-based learning and reward-based learning (F(1,126)=9.632, p=0.005).
(C) The estimated value function V ( z , s ) . The
value function was normalized by the maximum. The plot shows the mean across subjects. (D) The negative correlation between the active search n and the estimated value. Out of 18 subjects, 16 subjects have significantly negative and 1 subject has significantly positive correlations. The mean correlation across 18 subjects was not zero (p<0.001). (E) The reach angle in lower reward probability (Experiment 2). The perturbation block (48 trials) was imposed after the two familiarization blocks followed by the zero perturbation blocks that includes 60 %, 70% and 80 % in the reward probability of the success trials. The plot shows the reach angle of the representative subject. (F) The variability of the reach angle (SD) was computed with the first 12(dashed) and the next 12 (solid) trials. There was a significant effect of the reward probability (F(3,27)=5.67, p=0.004). (G) The fast decay in the retention curvature in the lower reward-probability group(Experiment 3). The perturbation was gradually shifted up to 8 degree and kept constant. During the last 24 trials in the learning period, one group experienced 60% of the reward rate. There was a significant interaction between the trail number and the groups (F(47,864)=2.67, p<0.0001), suggesting that the motor memory of the lower probability group decayed faster than the other.