Optimization strategies in human reinforcement learning Heiko Hoffmann, Evangelos Theodorou, and Stefan Schaal University of Southern California, USA Some human movement skills require optimizing a movement such that a future event has a desired outcome. Such skills are, e.g., hitting a ball with a bat or swinging a golf club to achieve that the ball has a desired trajectory. Learning a movement given only reward feedback is a typical reinforcement learning problem (Sutton and Barto, 1998). While several researchers studied reinforcement learning in robotics and machine learning, little is known about human reinforcement learning for movement skills. For example, we do not even know which learning strategy humans choose in one of the simplest reinforcement learning settings, i.e., with immediate reward feedback at the end of a movement. Here, we investigate this question using a behavioral paradigm mimicking a ball-hitting task (Fig. 1 A). Subjects (n=10) sat in front of a computer screen and moved a stylus on a tablet towards an unknown target. This invisible target was located on a line that the subjects had to cross. Every subject did 100 movement trials. During each movement, visual feedback of the stylus position on the screen was suppressed. After the movement, a reward was displayed graphically as a colored bar. As reward, we used a Gaussian function of the distance between the target location and the point of line crossing. The choice of this function was inspired by the work of Koerding and Wolpert (2004), which suggested an inverted Gaussian loss function in sensorimotor tasks. Subjects learned to adapt their movements towards the hidden target (Fig. 1 B and C). To investigate how they updated their movement choice, we hypothesize three optimization strategies: Ri xi +Ri−1 xi−1 Ri +Ri−1

1) Reward-weighted average (RW):

x ˜i+1 =

2) Random search (RS):

x ˜i+1 = argmax{xi ,xi−1 } R(x)

3) Gradient ascent (GA):

x ˜i+1 = xi + η

Ri −Ri−1 xi −xi−1

For simplicity, we assume subjects encode a movement with a single parameter, the point of line crossing, xi . Thus, we assume the following scenario: at trial i, subjects choose a movement target x ˜i , experience a movement error νi , and observe the resulting movement xi = x ˜i + νi and the corresponding reward Ri (xi ). Based on these observations, subjects choose a new movement target x ˜i+1 according to one of the above strategies. Without the noise νi , only GA would converge to the goal if it is outside the interval [xi−1 , xi ], i.e., the noise assists exploration of new solutions. The parameter η is the learning rate. The above strategies make specific predictions on the dependence of the expectation value of −(xi+1 − xi )/(xi − xi−1 ) on Ri−1 /(Ri−1 + Ri ), see Fig 2 A. Interestingly, only the prediction of RW was consistent with the data of all 10 subjects (Fig 2 B). We can further quantify this result. In the case Ri > Ri−1 , the three different strategies predict distinct frequencies p of data points fulfilling −(xi+1 − xi )/(xi − xi−1 ) > 0 : for RW, p > 0.5, for RS, p = 0.5, and for GA, p < 0.5, under the assumption that the noise νi has mean 0. This distinction becomes intuitively clear by inspecting Fig 2 A for Ri−1 /(Ri−1 + Ri ) < 0.5, and it can be proven. We computed p for each subject (Fig 2 C). The mean of p is significantly above 0.5 (t-test: p=0.005, Wilcoxon signed-rank test: p=0.01). For simplicity, we limited the update rule to a time window of two data points, xi−1 and xi , but we can prove and show experimentally a similar distinction as above for larger window sizes. The result that humans may prefer reward-weighted averaging over gradient ascent seems surprising. The literature on reinforcement learning is dominated by gradient-ascent methods. These methods are indeed preferable if the movement variance (noise) is low. However, for the same noise variance as observed in subjects, we found in simulation that reward-weighted averaging converges faster than gradient ascent. Thus, one could hypothesize that humans choose an optimization strategy that is the most suitable for their own movement variance.

A

B Noise

C

Reward function 1

Goal line

Reward

Y [cm]

13.4

Hidden target

0.5

Movement 0 0

Start point

1

0

2

1

20

X [cm]

40

60

80

100

Trials

Figure 1: Experiment and raw data. A: Subjects move from a start point and need to cross a goal

line. The only feedback is a reward signal at the end of a movement. This reward is a Gaussian function of the point of goal-line crossing. B: Movement adaptation through learning for a typical subject. The first 10 (dashed red) and the last 10 movements (solid blue) are shown. C: Trial-bytrial evolution of the reward, averaged across all 10 subjects. An exponential function is fitted to the data. A

B

−∆xi+1 ∆xi

−∆xi+1 ∆xi

C

1

1.5

0.5

1

0

0.5

p 0.7

-0.5

0.5

0 0

0.5

1 Ri−1 Ri−1 + Ri

0

0.5

1 Ri−1 Ri−1 + Ri

0.3 Subjects

Figure 2: Comparison of optimization methods and results. A: Three optimization methods are compared: reward-weighted averaging (red), random search (green), and gradient ascent (blue). For gradient ascent, the results for three different learning rates are shown; these results were obtained from simulation. B: On the same graph, experimental results (error bars show standard errors) are compared to the prediction of reward-weighted averaging (red line). Before averaging, experimental results were binned into 10 intervals equally spaced along the x-axis. C: Probabilities i+1 > 0 | Ri > Ri−1 ) are shown for all subjects. Only for reward-weighted averaging, the p = p( −∆x ∆xi average of p is expected to be above 0.5. The predicted p itself has variance, which depends on the number of movement trials for each subject.

References Sutton, R S and Barto, A G (1998), Reinforcement learning: An introduction. MIT Press. K¨ording, K P and Wolpert D M (2004), The loss function of sensorimotor learning. PNAS, 101, pp. 98399842.

Optimization strategies in human reinforcement learning

Subjects (n=10) sat in front of a computer screen and moved a stylus on a tablet towards an unknown target. This invisible target was located on a line that the ...

65KB Sizes 0 Downloads 240 Views

Recommend Documents

Reinforcement Learning Trees
Feb 8, 2014 - size is small and prevents the effect of strong variables from being fully explored. Due to these ..... muting, which is suitable for most situations), and 50% ·|P\Pd ..... illustration of how this greedy splitting works. When there ar

Bayesian Reinforcement Learning
2.1.1 Bayesian Q-learning. Bayesian Q-learning (BQL) (Dearden et al, 1998) is a Bayesian approach to the widely-used Q-learning algorithm (Watkins, 1989), in which exploration and ex- ploitation are balanced by explicitly maintaining a distribution o

Constrained optimization in human walking: cost ... - CiteSeerX
provide a new tool for investigating this integration. It provides ..... inverse of cost, to allow better visualization of the surface. ...... Atzler, E. and Herbst, R. (1927).

Constrained optimization in human walking: cost ... - CiteSeerX
measurements distributed between the three measurement sessions. ... levels and many subjects did not require data collection extensions for any of their ...

A Theory of Model Selection in Reinforcement Learning - Deep Blue
seminar course is my favorite ever, for introducing me into statistical learning the- ory and ..... 6.7.2 Connections to online learning and bandit literature . . . . 127 ...... be to obtain computational savings (at the expense of acting suboptimall

Recent Advances in Batch Mode Reinforcement Learning - Orbi (ULg)
Nov 3, 2011 - Illustration with p=3, T=4 .... of the Workshop on Active Learning and Experimental Design 2010 (in conjunction with AISTATS 2010), 2-.

A Theory of Model Selection in Reinforcement Learning
4.1 Comparison of off-policy evaluation methods on Mountain Car . . . . . 72 ..... The base of log is e in this thesis unless specified otherwise. To verify,. γH Rmax.

Recent Advances in Batch Mode Reinforcement Learning - Orbi (ULg)
Nov 3, 2011 - R. Fonteneau(1), S.A. Murphy(2), L.Wehenkel(1), D. Ernst(1) ... To combine dynamic programming with function approximators (neural.

Reinforcement learning for parameter estimation in ...
Oct 14, 2011 - Keywords: spoken dialogue systems, reinforcement learning, POMDP, dialogue .... “I want an Indian restaurant in the cheap price range” spoken in a noisy back- ..... 1“Can you give me a phone number of The Bakers?” 12 ...

Small-sample Reinforcement Learning - Improving Policies Using ...
Small-sample Reinforcement Learning - Improving Policies Using Synthetic Data - preprint.pdf. Small-sample Reinforcement Learning - Improving Policies ...

Reinforcement Learning Agents with Primary ...
republish, to post on servers or to redistribute to lists, requires prior specific permission ... agents cannot work well quickly in the soccer game and it affects a ...

Reinforcement Learning: An Introduction
important elementary solution methods: dynamic programming, simple Monte ..... To do this, we "back up" the value of the state after each greedy move to the.

Online Learning meets Optimization in the Dual - CS - Huji
1 School of Computer Sci. & Eng., The Hebrew ... analyzing old and new online learning algorithms in the mistake bound model. 1 Introduction. Online ... that predictor. The best predictor in a given class of hypotheses, which can only be de-.

Genetic Algorithms in Search Optimization and Machine Learning by ...
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. Genetic ...

Genetic Algorithms in Search Optimization and Machine Learning by ...
Genetic Algorithms in Search Optimization and Machine Learning by David Goldenberg.pdf. Genetic Algorithms in Search Optimization and Machine Learning ...

Reinforcement Learning as a Context for Integrating AI ...
placing it at a low level would provide maximum flexibility in simulations. Furthermore ... long term achievement of values. It is important that powerful artificial ...

Kernel-Based Models for Reinforcement Learning
cal results of Ormnoneit and Sen (2002) imply that, as the sample size grows, for every s ∈ D, the ... 9: until s is terminal. Note that in line 3 we compute the value ...

bilateral robot therapy based on haptics and reinforcement learning
means of adaptable force fields. Patients: Four highly paretic patients with chronic stroke. (Fugl-Meyer score less than 15). Methods: The training cycle consisted ...

Cold-Start Reinforcement Learning with Softmax ... - Research at Google
Our method com- bines the advantages of policy-gradient methods with the efficiency and simplicity of maximum-likelihood approaches. We apply this new cold-start reinforcement learning method in ... performance compared to MLE training in various tas