Reinforcement Learning with Exploration Stuart Ian Reynolds A thesis submitted to The University of Birmingham for the degree of Doctor of Philosophy
School of Computer Science The University of Birmingham Birmingham B15 2TT United Kingdom December 2002
Abstract Reinforcement Learning (RL) techniques may be used to nd optimal controllers for multistep decision problems where the task is to maximise some reward signal. Successful applications include backgammon, network routing and scheduling problems. In many situations it is useful or necessary to have methods that learn about one behaviour while actually following another (i.e. `opolicy' methods). Most commonly, the learner may be required to follow an exploring behaviour, while its goal is to learn about the optimal behaviour. Existing methods for learning in this way (namely, Qlearning and Watkins' Q(lambda)) are notoriously ineÆcient with their use of real experience. More eÆcient methods exist but are either unsound (in that they are provably nonconvergent to optimal solutions in standard formalisms), or are not easy to apply online. Online learning is an important factor in eective exploration. Being able to quickly assign credit to the actions that lead to rewards means that more informed choices between actions can be made sooner. A new algorithm is introduced to overcome these problems. It works online, without `eligibility traces', and has a naturally eÆcient implementation. Experiments and analysis characterise when it is likely to outperform existing related methods. New insights into the use of optimism for encouraging exploration are also discovered. It is found that standard practices can have strongly negative eect on the performance of a large class of RL methods for control optimisation. Also examined are large and nondiscrete statespace problems where `function approximation' is needed, but where many RL methods are known to be unstable. Particularly, these are control optimisation methods and when experience is gathered in `opolicy' distributions (e.g. while exploring). By a new choice of error measure to minimise, the well studied linear gradient descent methods are shown to be `stable' when used with any `discounted return' estimating RL method. The notion of stability is weak (very large, but nite error bounds are shown), but the result is signi cant insofar as it covers new cases such as opolicy and multistep methods for control optimisation. New ways of viewing the goal of function approximation in RL are also examined. Rather than a process of error minimisation between the learned and observed reward signal, the objective is viewed to be that of nding representations that make it possible to identify the best action for given states. A new `decision boundary partitioning' algorithm is presented with this goal in mind. The method recursively re nes the valuefunction representation, increasing it in areas where it is expected that this will result in better decision policies.
III
IV
Acknowledgements My deepest gratitude goes to my friend and longtime supervisor, Manfred Kerber. My demands on his time over the past ve years for discussion, feedback and proof readings could (at best) be described as unreasonable. Unlike many PhD students my work was not tied to any particular grant, supervisor or research topic and I can think of few other people who would be willing to supervise work outside of their own eld. Through Manfred I was lucky to have the freedom to explore the areas that interested me the most, and also to publish my work independently. For reasons I won't discuss here, these freedoms are becoming increasing rare { any supervisor who provides them has a truly generous nature. Without his constant encouragement (and harassment) and his enormous expertise, I am sure that this thesis would never have reached completion. In several cases, important ideas would have fallen by the wayside without Manfred to point out the interest in them. For patiently introducing me to the topics that interest me the most I am extremely grateful to Jeremy Wyatt. Through his reinforcement learning reading group, the objectionable became the obsession, and the obfuscated became the Obvious. As the only local expert in my eld, his enthusiasm in my ideas has been the greatest motivation throughout. Without it I would surely have quit my PhD within the rst year. I thank the other members of my thesis group (past and present), Xin Yao, Russell Beale and John Barnden, for their support and guidance throughout. I also thank my department for funding my study (and extensive worldwide travel) through a Teaching Assistant scheme. Without this, not only would I not have had the freedom to pursue my own research, I would have never have had the opportunity to perform research at all. I thank Remi Munos and Andrew Moore for hosting my enlightening (but ultimately too short) sabbatical with them at Carnegie Mellon, and my department for funding the visit. I thank Geo Gordon for indulging my long Q+A discussions about his work that lead to new contributions. I am lucky to have bene ted from discussions and advice (no matter how brief) with many of the eld's other leading luminaries. These include Richard Sutton, Marco Wiering, Doina Precup, Leslie Kaelbling and Thomas Dietterich. I thank John Bullinaria for nally setting me straight on neural networks. Thanks to Tim Kovacs who cofounded the reinforcement learning reading group. As my oÆcemate for many years he has been the person to receive my most uncooked ideas. I look forward to more of his ottertainment in the future and promise to return all of his pens the next time we meet. V
VI Through discussions about my work (or theirs), by providing technical assistance, or even through alcoholic stressrelief, I have bene ted from many other members of my department. Among others, these people include: Adrian Hartley, Axel Groman, Marcin Chady, Johnny Page, Kevin Lucas, John Woodward, Gavin Brown, Achim Jung, Riccardo Poli and Richard Pannell. My apologies to Dee who I'm sure is the happiest of all to see this nished. For my parents for everything.
Contents 1 Introduction
1
2 Dynamic Programming
7
1.1 Arti cial Intelligence and Machine Learning . . . . . . . . . . . . . . . . . . 1.2 Forms of Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.1 Sequential Decision Tasks and the Delayed Credit Assignment Problem 1.4 Learning and Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5 About This Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6 Structure of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Markov Decision Processes . . . . . . . . . . . . . . . . . . . . 2.2 Policies, State Values and Return . . . . . . . . . . . . . . . . 2.3 Policy Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 QFunctions . . . . . . . . . . . . . . . . . . . . . . . . 2.3.2 InPlace and Asynchronous Updating . . . . . . . . . 2.4 Optimal Control . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.1 Optimality . . . . . . . . . . . . . . . . . . . . . . . . 2.4.2 Policy Improvement . . . . . . . . . . . . . . . . . . . 2.4.3 The Convergence and Termination of Policy Iteration 2.4.4 Value Iteration . . . . . . . . . . . . . . . . . . . . . . 2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3 Learning from Interaction
. . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. ..
. . . . . . . . . . .
. . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. ..
. . . . . . . . . . .
1 1 2 3 3 4 4
7 8 10 11 12 13 13 13 14 16 18 21
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.2 Incremental Estimation of Means . . . . . . . . . . . . . . . . . . . . . . . . 22 VII
VIII
CONTENTS
3.3 Monte Carlo Methods for Policy Evaluation . . . . . . . . . . . . . . . . 3.4 Temporal Dierence Learning for Policy Evaluation . . . . . . . . . . . . 3.4.1 Truncated Corrected Return Estimates . . . . . . . . . . . . . . 3.4.2 TD(0) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.3 SARSA(0) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.4 Return Estimate Length . . . . . . . . . . . . . . . . . . . . . . . 3.4.5 Eligibility Traces: TD() . . . . . . . . . . . . . . . . . . . . . . 3.4.6 SARSA() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.7 Replace Trace Methods . . . . . . . . . . . . . . . . . . . . . . . 3.4.8 Acyclic Environments . . . . . . . . . . . . . . . . . . . . . . . . 3.4.9 The NonEquivalence of Online Methods in Cyclic Environments 3.5 Temporal Dierence Learning for Control . . . . . . . . . . . . . . . . . 3.5.1 Q(0): Qlearning . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5.2 The ExplorationExploitation Dilemma . . . . . . . . . . . . . . 3.5.3 Exploration Sensitivity . . . . . . . . . . . . . . . . . . . . . . . . 3.5.4 The OPolicy Predicate . . . . . . . . . . . . . . . . . . . . . . 3.6 Indirect Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . 3.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 EÆcient OPolicy Control
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . 4.2 Accelerating Q() . . . . . . . . . . . . . . . . . . 4.2.1 Fast Q() . . . . . . . . . . . . . . . . . . . 4.2.2 Revisions to Fast Q() . . . . . . . . . . . . 4.2.3 Validation . . . . . . . . . . . . . . . . . . . 4.2.4 Discussion . . . . . . . . . . . . . . . . . . . 4.3 Backwards Replay . . . . . . . . . . . . . . . . . . 4.4 Experience Stack Reinforcement Learning . . . . . 4.4.1 The Experience Stack . . . . . . . . . . . . 4.5 Experimental Results . . . . . . . . . . . . . . . . . 4.6 The Eects of on the Experience Stack Method . 4.7 Initial Bias and the max Operator. . . . . . . . . . 4.7.1 Empirical Demonstration . . . . . . . . . .
. . . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. ..
. . . . . . . . . . . . .
. . . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. ..
. . . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. ..
. . . . . . . . . . . . .
. . . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
24 26 26 27 28 29 31 33 33 34 35 39 39 39 40 44 44 46 47
47 49 49 53 56 61 61 65 66 70 80 82 83
IX
CONTENTS
4.7.2 The Need for Optimism . . . . . . . . . . . . . 4.7.3 Separating Value Predictions from Optimism . 4.7.4 Discussion . . . . . . . . . . . . . . . . . . . . . 4.7.5 Initial Bias and Backwards Replay. . . . . . . . 4.7.6 Initial Bias and SARSA() . . . . . . . . . . . 4.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . 5 Function Approximation
. . . . . .
. . . . . .
.. .. .. .. .. ..
. . . . . .
.. .. .. .. .. ..
. . . . . .
. . . . . .
.. .. .. .. .. ..
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Example Scenario and Solution . . . . . . . . . . . . . . . . . . . . . . . . 5.3 The Parameter Estimation Framework . . . . . . . . . . . . . . . . . . . . 5.3.1 Representing Return Estimate Functions . . . . . . . . . . . . . . . 5.3.2 Taxonomy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Linear Methods (Perceptrons) . . . . . . . . . . . . . . . . . . . . . . . . . 5.4.1 Incremental Gradient Descent . . . . . . . . . . . . . . . . . . . . . 5.4.2 Step Size Normalisation . . . . . . . . . . . . . . . . . . . . . . . . 5.5 Input Mappings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5.1 State Aggregation (Aliasing) . . . . . . . . . . . . . . . . . . . . . 5.5.2 Binary Coarse Coding (CMAC) . . . . . . . . . . . . . . . . . . . . 5.5.3 Radial Basis Functions . . . . . . . . . . . . . . . . . . . . . . . . . 5.5.4 Feature Width, Distribution and Gradient . . . . . . . . . . . . . . 5.5.5 EÆciency Considerations . . . . . . . . . . . . . . . . . . . . . . . 5.6 The Bootstrapping Problem . . . . . . . . . . . . . . . . . . . . . . . . . . 5.7 Linear Averagers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.7.1 Discounted Return Estimate Functions are Bounded Contractions 5.7.2 Bounded Function Approximation . . . . . . . . . . . . . . . . . . 5.7.3 Boundness Example . . . . . . . . . . . . . . . . . . . . . . . . . . 5.7.4 Adaptive Representation Schemes . . . . . . . . . . . . . . . . . . 5.7.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6 Adaptive Resolution Representations
. . . . . . . . . . . . . . . . . . . . . . . . . . . .
85 86 87 88 89 89 93
93 94 96 97 97 98 98 99 101 101 102 103 104 106 106 110 113 115 117 117 118 119 121
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
X
CONTENTS
6.2 Decision Boundary Partitioning (DBP) . 6.2.1 The Representation . . . . . . . 6.2.2 Re nement Criteria . . . . . . . 6.2.3 The Algorithm . . . . . . . . . . 6.2.4 Empirical Results . . . . . . . . . 6.3 Related Work . . . . . . . . . . . . . . . 6.3.1 Multigrid Methods . . . . . . . . 6.3.2 NonUniform Methods . . . . . . 6.4 Discussion . . . . . . . . . . . . . . . . .
. . . . . . . . .
.. .. .. .. .. .. .. .. ..
. . . . . . . . .
.. .. .. .. .. .. .. .. ..
. . . . . . . . .
.. .. .. .. .. .. .. .. ..
. . . . . . . . .
. . . . . . . . .
.. .. .. .. .. .. .. .. ..
7 Value and Model Learning With Discretisation
7.1 7.2 7.3 7.4 7.5 7.6
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Example: Single Step Methods and the Aliased Corridor Task . MultiTimescale Learning . . . . . . . . . . . . . . . . . . . . . FirstState Updates . . . . . . . . . . . . . . . . . . . . . . . . Empirical Results . . . . . . . . . . . . . . . . . . . . . . . . . . Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8 Summary
8.1 8.2 8.3 8.4
Review . . . . . . . . Contributions . . . . Future Directions . . Concluding Remarks
. . . .
.. .. .. ..
. . . .
. . . .
.. .. .. ..
. . . .
.. .. .. ..
. . . .
. . . .
.. .. .. ..
. . . .
A Foundation Theory of Dynamic Programming
A.1 A.2 A.3 A.4
Full Backup Operators . . . . . . . . . Unique FixedPoints and Optima . . . Norm Measures . . . . . . . . . . . . . Contraction Mappings . . . . . . . . . A.4.1 Bellman Residual Reduction .
. . . . .
. . . . .
.. .. .. .. ..
. . . . .
.. .. .. .. .. .. .. .. ..
. . . . . . . . .
.. .. .. .. .. .. .. .. ..
. . . . . . . . .
. . . . . . . . .
.. .. .. .. .. .. .. .. ..
. . . . . . . . . . . . . . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
. . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . .
.. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
122 122 122 124 127 133 133 133 141 143
143 144 145 147 149 155
157
157 159 160 162
163
163 163 164 164 165
B Modi ed Policy Iteration Termination
167
C Continuous Time TD()
169
CONTENTS
D Notation, Terminology and Abbreviations
XI 173
XII
CONTENTS
Chapter 1
Introduction 1.1 Arti cial Intelligence and Machine Learning Arti cial Intelligence (AI) is the study of arti cial machines that exhibit `intelligent' behaviour. Intelligence itself is a notoriously diÆcult term to de ne, but commonly we associate it with the ability to learn from experience. Machine learning is the related eld that has given rise to intelligent agents (computer programs) that do just this. The idea of creating machines that imitate what humans do cannot fail to fascinate and inspire. Through the study of AI and machine learning we may discover hidden truths about ourselves. How did we come to be? How do we do what we do? Am I a computer running a computer program? And if so, can we simulate such a program on a computer? AI may even be able oer insights into ageold philosophical questions. Who am I? And what is the importance of self? Increasingly though, AI and machine learning are becoming engineering disciplines, rather than natural sciences. In the industrial age we asked, \How can I build machines to do work for me?" In the information age we now ask, \How can I build machines that think for me?" The diÆculties faced in building such machines are enormous. How can we build intelligent learning machines when we know so little about the origins of our own intelligence? In this thesis, I examine Reinforcement Learning (RL) algorithms. We will see how these computer algorithms can learn to solve very complex problems with the bare minimum of information. The algorithms are not hardwired solutions to speci c problems, but instead learn to solve problems through their past experiences.
1.2 Forms of Learning This thesis examines how agents can learn how to act in order to solve decision problems. The task is to nd a mapping from situations to actions that is better than others by some measure. Learning could be said to have occurred if, on the basis of its prior experience, an agent 1
2
CHAPTER 1.
INTRODUCTION
chooses to act dierently (hopefully for the better) in some situation than it might have done prior to collecting this experience. How learning occurs depends upon the form of feedback that is available. For example, through observing what happens after leaving home on dierent days with dierent kinds of weather, it may be possible to learn the following association between situations, actions and their consequences, \If the sky is cloudy, and I don't take my umbrella, then I am likely to get wet." Observing the consequence of leaving home without an umbrella on a cloudy day is a form of feedback. However, the consequences of actions in themselves do not tell how to choose better actions. Whether the agent should prefer to leave home with an umbrella depends on whether it minds getting wet. Clearly, without some form of utility attached to actions, it is impossible to know what changes could lead the agent to act in a better way. Learning without this utility is called unsupervised learning, and cannot directly lead to better agent behaviour. If feedback is given in the form, \If it is cloudy, you should take an umbrella," then supervised learning is occurring. A teacher (or supervisor) is assumed to be available that knows the best action to take in a given situation. The supervisor can provide advice that corrects the actions taken by the agent. If feedback is given in the form of positive or negative reinforcements (rewards), for example, \Earlier it was cloudy. You didn't take your umbrella. Now you got wet. That was pretty bad," then the agent learns through reinforcement learning. Learning occurs by making adjustments to the situationaction mapping that maximises the amount of positive reinforcement received and minimises the negative reinforcement. Often reinforcements are scalar values (e.g. 1 for a bad action, +10 for a good one). A wide variety of algorithms are available for learning in this way. This thesis reviews and improves on a number of them.
1.3 Reinforcement Learning The key dierence between supervised learning and reinforcement learning is that, in reinforcement learning an agent is never told the correct action it should take in a situation, but only some measure of how good or how bad an action is. It is up to the learning element itself to decide which actions are best given this information. This is part of the great appeal of reinforcement learning { solutions to complex decision problems may often be found by providing the minimum possible information required to solve the problem.
1.4.
LEARNING AND EXPLORATION
3
In many animals, reinforcement learning is the only form of learning that appears to occur and it is an essential part of human behaviour. We burn our hands in a re and very quickly learn not to do that again. Pleasure and pain are good examples of rewards that reinforce patterns of behaviour. A reinforcement learning system, TDGammon, has been used to learn to play the game of backgammon [155]. The system was set up such that a positive reinforcement was given upon winning a game. With little other information, the program learned a level of play equal to that of grandmaster human players with many years of experience. What is spectacular about the success of this system is that it learned entirely through selfplay over several days of playing. No external heuristic information or teacher was available to suggest which moves might be best to take, other than the reinforcement received at the end of each game.1 Successful Example: TDGammon
1.3.1 Sequential Decision Tasks and the Delayed Credit Assignment Problem
Many interesting problems (such as backgammon) can be modelled as sequential decision tasks. Here, the system may contain many states (such as board positions), and diering actions may lead to dierent states. A whole series of actions may be required to get to a particular state (such as winning a game). The eects of any particular action may not become apparent until some time after it was taken. In the backgammon example, the reinforcement learner must be able to associate the utility of the actions it takes in the opening stages of the game with the likelihood of it winning the game, in order to improve the quality of its opening moves. The problem of learning to associate actions with their longterm consequences is known as the delayed credit assignment problem [130, 150]. This thesis deals with ways of solving delayed credit assignment problems. In particular, it deals primarily with valuefunction based methods in which the longterm utility for being in a particular state or taking an action in a state is modelled. By learning longterm valueestimates, we will see that these methods transform the diÆcult problem of determining the long term eects of an action, into the easy problem of deciding what is the best looking immediate action.
1.4 Learning and Exploration In many practical cases, utility based learning methods (both supervised learning and reinforcement learning) face a diÆcult dilemma. Given that the methods are often put to work to solve some realworld problem, should the system directly attempt to do its best to solve the problem based upon its prior experience, or should it follow other courses of action in The learner was provided with a model that predicts the likelihood of the next possible board positions for each of the possible roles of the dice (i.e. the rules of the game). However, unlike similarly successful chess programs, a lengthy search of possible future moves is never conducted. Instead, The program simply learns the general quality of board con gurations, and uses its knowledge about dice roles and possible moves to choose a move which leads it to the next immediately `bestlooking' board con guration. 1
4
CHAPTER 1.
INTRODUCTION
the hope that these will reveal better ways of acting in the future? This is known as the exploration / exploitation dilemma [130, 150]. This dilemma is particularly important to reinforcement learning. For supervised learning it is often assumed that the way in which exploration of the problem is conducted is the responsibility of the teacher (i.e. not the responsibility of the learning element). For reinforcement learning, the reverse is more often true. The learning agent itself is usually expected to decide which actions to take in order to gain more information about how the problem may better be solved. Finding good general methods for doing so remains a dif cult and interesting research question, but it is not the subject of this thesis. A separate question is how reinforcement learning methods can continue to solve the desired problem while exploring (or, more precisely, while not exploiting). Many reinforcement learning algorithms are known to behave poorly while not exploiting. One of this thesis' major contributions is an examination of how these methods can be improved.
1.5 About This Thesis This thesis began as a piece of research into multiagent systems in which many agents compete or collaborate to solve individual or collective problems. Reinforcement learning was identi ed as a technique that can allow agents to do this. Although multiagent learning is not covered, two questions arose from this work which are now the subject of this thesis: In many tasks, the agent's environment may be very large. Typically, the agent cannot hope to visit all of the environment's states within its lifetime. Generalisations (and approximations) must be made in order to infer the best actions to take in unvisited states. If so, can internal representations be found such that the agent's ability to take the best actions is improved? If learning while not exploiting, many reinforcement learning algorithms are known to be ineÆcient, inaccurate or unstable. What can be done to improve this situation? The second question (although researched most recently), is covered rst as it follows more directly for the fundamental material presented in the early chapters. The rst question is covered in the nal chapters but was researched rst. Since this time a great deal of related work has been done by other researchers that tackle the same question. This work is also reviewed.
1.6 Structure of the Thesis The following items provide an overview of each part of the thesis. Chapter 2 introduces some simplifying formalisms, Markov Decision Processes (MDPs), and basic solution methods, Dynamic Programming (DP), upon which reinforcement learning methods build. A minor error in an existing version of the policyiteration algorithm is corrected.
1.6.
STRUCTURE OF THE THESIS
5
Chapter 3 introduces standard reinforcement learning methods for learning without
prior knowledge of the environment or the speci c task to be solved. Here the need for reinforcement learning while not exploiting is identi ed, and the de ciencies in existing solution methods are made clear. Also, this chapter challenges a common assumption about a class of existing algorithms. We will see cases where \accumulate trace" methods are not approximately equivalent to their \forward view" counterparts. Chapter 4 introduces computationally eÆcient alternatives to the basic eligibility trace methods. The Fast Q() algorithm is reviewed and minor changes to it are suggested. The backwards replay algorithm is also reviewed and proposed as a simpler and naturally eÆcient alternative to eligibility trace methods. The method also has the added advantage of learning with information that is more \uptodate." However, it is not obvious how backwards replay can be employed for online learning in cyclic environments. A new algorithm is proposed to solve this problem and is also intended to provide improvements when learning while exploring. The experimental results with this algorithm lead to a new insight that optimism can inhibit learning in a class of control optimising algorithms. Optimism is commonly encouraged in order to aid exploration, and so this comes as a counterintuitive idea to many. Chapter 5 reviews standard function approximation methods that are used to allow reinforcement learning to be employed in large and nondiscrete state spaces. The wellstudied and often employed linear gradient descent methods for least mean square error minimisation are known to be unstable in a variety of scenarios. A new error measure is suggested and it is shown that this leads to provably more stable reinforcement learning methods. Although the notion of stability is rather weak (only the boundedness of methods is proved, and very large bounds are given), this stability is established for, i) methods performing stochastic control optimisation and, ii) learning with arbitrary experience distributions, where this was not previously known to hold. Chapter 6 examines a new function approximation method that is not motivated by error minimisation, but by adapting the resolution of the agent's internal representation such that its ability to choose between dierent actions is improved. The decision boundary partitioning heuristic is proposed and compared against similar xed resolution methods. Recent and simultaneously conducted work along these lines by Munos and Moore is also reviewed. Chapter 7 examines reinforcement learning in continuous time. This is a natural extension for methods that learn with adaptive representations. A simple modi cation of standard reinforcement learning methods is proposed that is intended to reduce biasing problems associated with employing bootstrapping methods in coarsely discretised continuous spaces. An accumulate trace TD() algorithm for the Semi Markov Decision Process (SMDP) case is also developed and a forwardsbackwards equivalence proof of the batch mode version of this algorithm is established. Chapter 8 concludes, lists the thesis' contributions and suggests future work. The new contributions can be found throughout the thesis.
6
CHAPTER 1.
INTRODUCTION
Appendix A reviews some basic terminology and proofs about dynamic programming
methods that are employed elsewhere in the thesis. Appendix B shows termination error bounds of a new modi ed policyiteration algorithm. Appendix C contains the forwardsbackwards equivalence proof of the batch mode SMDP accumulate trace TD() algorithm. Appendix D provides a useful guide to notation and terminology.
New contributions are made throughout. Readers with a detailed knowledge of reinforcement learning are recommended to read the contributions section in Chapter 8 before the rest of the thesis.
Chapter 2
Dynamic Programming Chapter Outline
This chapter reviews the theoretical foundations of valuebased reinforcement learning. It covers the standard formal framework used to describe the agentenvironment interaction and also techniques for nding optimal control strategies within this framework.
2.1 Markov Decision Processes A great part of the work done on reinforcement learning, in particular that on convergence proofs, assumes that the interaction between the agent and the environment can be modelled as a discretetime nite Markov decision process (MDP). In this formalism, a step in the life of an agent proceeds as follows: At time, t, the learner is in some state, s 2 S , and takes some action, a 2 As, according to a policy, . Upon taking the action the learner enters another state, s0, at t + 1 with probability Pssa 0 . For making this transition, the learner receives a scalar reward, rt+1, given by random variable whose expectation is deonted as Rssa 0 . A discrete nite Markov process consists of,
the statespace, S which consists of a nite set of states fs1 ; s2; : : : ; sN g, a nite set of actions available from each state, A(s) = fas1 ; as2 ; : : : ; asM g. a global clock, t = 1; 2; : : : ; T , counting discrete time steps. T may be in nite. a state transition function, Pssa 0 = P r(st+1 = s0 j st = s; at = a), (i.e. the probability of observing s0 at t + 1 given that action a was taken in state s at time t.) 7
8
CHAPTER 2.
DYNAMIC PROGRAMMING
...
a1 a2
...
Figure 2.1: A Markov Decision Process. Large circles are states, small black dots are actions. Some states may have many actions. An action may lead to diering successor states with a given probability. For the RL framework we also add, a reward function which, given a hs; a; s0 i triple generates a random scalar valued reward with a xed distribution. The reward for taking a in s and then entering s0 is a random variable whose expectation is de ned here as Rssa 0 . A process is said to be Markov if it has the Markov Property. Formally, the Markov Property holds if, P r(st+1 j st ; at ) = P r(st+1 j st ; at ; st 1 ; at 1 ; : : :): (2.1) holds. That is to say that the probability distribution over states entered at t + 1 is conditionally independent of the events prior to (st ; at ) { knowing the current state and action taken is suÆcient to de ne what happens at the next step. In reinforcement learning, we also assume the same for the reward function, P r(rt+1 j st ; at ) = P r(rt+1 j st ; at ; st 1 ; at 1 ; : : :): (2.2) The Markov property is a simplifying assumption which makes it possible to reason about optimality and proofs is a more straightforward way. For a more detailed account of MDPs see [21] or [114]. For the remainder of this section the terms process and environment will be used interchangeably under the assumption that the agent's environment can be exactly modelled as a discrete nite Markov process. In later chapters we examine cases where this assumption does not hold.
2.2 Policies, State Values and Return A policy, , determines how the agent selects actions from the state in which it nds itself. In general, a policy is any mapping from states to actions. A policy may be deterministic, in which case (st) = at, or it may specify a distribution over actions, (s; a) = P r(a = at js = st ). Once we have established a policy, we can ask how much return this policy generates from any given state in the process. Return is a measure of reward collected for taking a series of actions. The value of a state is a measure of expected return we can achieve for being in a state and following a given policy thereafter (i.e. its mean longterm utility). RL problems can therefore be further categorised by what estimate of return we want to maximise:
2.3.
9
POLICY EVALUATION
Here agents should act to maximise the immediate reward available from the current state. The value of a state, V (s), is de ned as, V (s) = E [rt+1 j s = st ] (2.3) where E denotes an expectation given that actions are chosen according to . Finite Horizon Problems. Here agents should act to maximise the reward available given that there are just k more steps available to collect the reward. The value of a state is de ned as, ( 0; if k = 0, V(k) (s) = E hrt+1 + V (st+1 ) j s = st i ; otherwise. (2.4) (k 1)
Single Step Problems.
The agent should act to maximise the nite horizon return at each step (i.e. we act to maximise V(k) for all t, and k is a xed at every step). In nite Horizon Problems. The agent should act to maximise the reward available over an in nite future.
Receding Horizon Problems.
Most work in RL has centred around singlestep and in nite horizon problems. In the in nite horizon case, it is common to use the total future discounted return as the value of a state: zt1 = rt + rt+1 + + k rt+1 + (2.5) The parameter, 2 [0; 1], is a discount factor. Choosing < 1 denotes a preference to receiving immediate rewards to those in the more distant future. It also ensures that the return is bounded in cases where the agent may collect reward inde nitely (i.e. if the task is nonepisodic or nonterminating), since all in nite geometric series have nite sums for a common ratio of, j j < 1. The in nite horizon case is also of special interest as it allows the value of a state to be concisely de ned recursively: V (s) = E zt1+1 j s = st = EX [rt+1 +X
V (st+1 ) j s = st ] = (s; a) Pssa 0 Rssa 0 + V (s0) (2.6) a
s0
Equation 2.6 is known as a Bellman equation for V (see [15]). Some environments may contain terminal states. Entering such a state means that no more reward can be collected this episode. To be consistent with the in nite horizon formalism, terminal states are usually modelled as a state in which all actions lead to itself and generate no reward. In practice, it is usually easiest to model all terminal states as a single special state, s+, whose value is zero and in which no actions are available. Terminal States
10
CHAPTER 2.
DYNAMIC PROGRAMMING
2.3 Policy Evaluation For some xed stochastic policy, , the iterative policy evaluation algorithm shown Figure 2.2 will nd an approximation of its statevalue function, V^ (see also [114, 150]). The hat notation, x^, indicates an approximation of some true value, x. Step 5 of the algorithm simply applies the Bellman equation (2.6) upon an old estimate of V to generate a new estimate (this is called a backup or update). Making updates for all states is called a sweep. It is intuitively easy to see that this algorithm will converge upon V if 0 < 1. Assume that the initial value function estimate has a worst initial error of 0 in any state: V^0 (s) = V (s) 0 (2.7) Throughout, is used to denote a bound in order to simplify notation. That is to say, V (s) 0 V^0 (s) V (s) + 0 (2.8) and not, ^V (s) = V (s) + 0; or, (2.9) V (s) 0 : 1) Initialise V^0 with arbitrary nite values; k 0 2) do 3) 0 4) for each s 2 S P P a 0 + V^ (s0 ) 5) V^k+1 (s) = a (s; a) s0 Pssa 0 Rss k 6) max(; jV^k+1(s) V^k (s)j ) 7) k k + 1 8) while > T Figure 2.2: The synchronous iterative policy evaluation algorithm. Determines the value function of a xed stochastic policy to within a maximum deviation from V of 1 T in any state for 0 < 1. 1) Initialise Q0 with arbitrary nite values; k 0 2) do 3) 0 4) for each hs; ai 2 SP A(s) a 0 + P 0 (s0 ; a0 )Q ^ k (s0; a0 ) 5) Q^ k+1 (s; a) = s0 Pssa 0 Rss a 6) max(; jQ^ k+1(s; a) Q^ k (s; a)j ) 7) k k + 1 8) while > T Figure 2.3: The synchronous iterative policy evaluation algorithm for determining Q to within 1 T .
2.3.
11
POLICY EVALUATION
After the rst iteration we have (at worst), X X a 0 + V^ (s0 ) V^1 (s) = (s; a) Pssa 0 Rss 0 =
a
X
a
(s; a)
s0
X
s0
X
a 0 + (V (s0 ) ) Pssa 0 Rss 0
X
= 0 + (s; a) Pssa 0 Rssa 0 + V (s0) a s0 = 0 + V (s) (2.10) Note that only the true valuefunction, V , and not its estimate, V^ , appears on the righthand side of 2.10. Continuing the iteration we have, X X a 0 + V^ (s0 ) V^2 (s) = (s; a) Pssa 0 Rss 1 =
a
X
a
(s; a)
s0
X
s0
X
a 0 + (V (s0 ) ) Pssa 0 Rss 0
X
= 20 + (s; a) Pssa 0 Rssa 0 + V (s0 ) a s0 2 = 0 + V (s) ... V^k (s) = k 0 + V (s) (2.11) Thus if 0 < h < 1 ithen the convergence of V^ to V is assured in the limit (as k ! 1) since limk!1 k 0 = 0. The following contraction mapping can be derived from 2.11 and states that each update strictly reduces the worst value estimate in any state by a factor of
(also see Appendix A) [114, 20, 17]: max jV^k+1(s) V (s)j max jV^k (s) V (s)j: (2.12) s s The termination condition in step 8 of the algorithm allows it to stop once a satisfactory maximum error has been reached. This recursive process of iteratively reestimating the value function in terms of itself is called bootstrapping. Since Equation 2.6 represents a system of linear equations, several alternative solution methods, such as Gaussian elimination, could be used to exactly nd V (see [34, 80]). However, most of the learning methods described in this thesis are, in one way or another, derived from iterative policy evaluation and work by making iterative approximations of value estimates. 2.3.1 QFunctions
In addition to statevalues we can also de ne stateaction values (Qvalues) as: Q (s; a) = E [rt+1 + V (st+1 )js = st ; a = at ] X = Pssa 0 Rssa 0 + V (s0) s0
(2.13)
12
CHAPTER 2.
DYNAMIC PROGRAMMING
Intuitively, this Qfunction (due to Watkins, [163]) gives the value of following an action for one step plus the discounted expected value of following the policy thereafter. The expected value of a state under a given stochastic policy may be found solely from the Qvalues at that state: V (s)
=
X
a
(s; a)Q (s; a)
(2.14)
and so the Qfunction may be fully de ned independently of V (by combining Equations 2.13 and 2.14): Q
(s; a) =
X
s0
Pssa 0
a0 Rss
+
X
a0
!
(s0 ; a0 )Q (s0 ; a0 )
(2.15)
It is straightforward to modify the iterative policy evaluation algorithm to approximate a Qfunction instead of a statevalue function (see Figure 2.3). Note that V and Q are easily interchangeable when given R and P . Also, from equation (2:14), it follows that knowing Q and is enough to determine V (without R or P ). The reverse is not true. We will see in Section 2.4.2 that being able to compare actionvalues makes it trivial to make improvements to the policy. 2.3.2 InPlace and Asynchronous Updating
Step 5 of each algorithm in Figures 2.2 and 2.3 performs updates that have the form U^k+1 = f (U^k ), where U^ is a utility function, V^ or Q^ . That is to say, that a new value of every state or stateaction pair is given entirely from the last value function or Qfunction. The algorithms are usually presented in this way only to simplify proofs about their convergence. This form of updating is called synchronous or Jacobistyle [150]. A better method is to make the updates inplace [17, 150] (i.e. we perform U^ (s) f (U^ (s)) for one state, and make further backups to other states in the same sweep using this new estimate). This requires storing only one value function or Q function rather than two and is referred to as inplace or GaussSeidel updating. This method also usually converges faster since the values in the successor states upon which updates are based may have been updated within the same sweep and so are more uptodate. A third alternative is asynchronous updating [20]. This is the same as the inplace method except it allows for states or state action pairs (SAPs) to be updated in any order and with varying frequency. This method is known to converge provided that all states (or SAPs) are updated in nitely often but with a nite frequency. An advantage of this approach is that the number of updates may be distributed unevenly, with more updates being given to more important parts of the state space [17, 18, 20].
2.4.
13
OPTIMAL CONTROL
2.4 Optimal Control In the previous section we have seen how to nd the longterm utility for being in a state, or being in a state and taking a speci c action, and then following a xed policy thereafter. While it's useful to know how good a policy is, we'd really prefer to know how to produce better policies. Ultimately, we'd like to nd optimal policies. 2.4.1 Optimality
An optimal policy, , is any which achieves the maximum expected return when starting from any state in the process. The optimal Qfunction, Q, is de ned by the Bellman optimality equation [15]: Q (s; a) = max Q (s; a) X (s0 ; a0 ) = Pssa 0 Rsa + max Q (2.16) a0 s0
Similarly, V is given as: V (s)
= max Q (s; a) a X a 0 Ra 0 + V (s0 ) = max P ss ss a s0
(2.17)
There may be many optimal policies for some MDPs { this only requires that there are states whose actions yield equivalent expected returns. In such cases, there are also stochastic optimal policies for that process. However, every MDP always has at least one deterministic optimal policy. This follows simply from noting that if a SAP leads to a higher mean return than the other actions for that state, then it is better to always take that action than some mix of actions in that state. As a result most control optimisation methods seek only deterministic policies even though stochastic optimal policies may exist. 2.4.2 Policy Improvement
Improving a policy as a whole simply involves improving the policy in a single state. To do this, we make the policy greedy with respect to Q . The greedy action, ags , for a state is de ned as, ags = arg max Q(s; a) (2.18) a A greedy policy, g , is one which yields a greedy action in every state. An improved policy may be achieved by making it greedy in any state: (s) arg max Q(s; a) (2.19) a
14
CHAPTER 2.
1) k 0 2) do 3) nd Qk for k 4) for each s 2 S do 5) k+1 (s) = arg maxa Qk (s; a) 6) k k + 1 7) while k 6= k 1
DYNAMIC PROGRAMMING
Evaluate policy Improve policy
Figure 2.4: Policy Iteration. Upon termination is optimal provided that Qk can be found accurately (see 2.4.3). In the improvement step (step 5), ties between equivalent action should be broken consistently to return a consistent policy for the same Qfunction, and so also allow the algorithm to terminate. Step 3 is assumed to evaluate Q exactly. The policy improvement theorem rst stated by Bellman and Dreyfus [16] states that if, (s; a) X (s; a)Q (s; a) max Q a a
holds then it is at least as good to take a greedy action in s than to follow since, if the agent now passes through this state it can expect to collect at least maxa Q (s; a) (in the mean) P rather than a (s; a)Q (s; a) from there onward [16].1 The actual improvement may be greater since changing the policy at s may also improve the policy for states following from s in the case where s may be revisited during the same episode. The improved policy can be evaluated and then improved again. This process can be repeated until the policy can be improved no further in any state, at which point an optimal policy must have been found. The policy iteration algorithm shown in Figure 2.4 (adapted from [150], rst devised by Howard [56]) performs essentially this iterative process except that the policy improvement step is applied to every state inbetween policy evaluations. Combinations of local improvements upon a xed Q will also produce strictly (globally) improving policies { any local improvement in the policy can only maintain or increase the expected return available from the states that lead into it. 2.4.3 The Convergence and Termination of Policy Iteration
Showing that the policy iteration algorithm terminates with an optimal policy in nite time is straightforward. Note that, i) the policy improvement step only produces deterministic policies, of which there are only jS jjAj and, ii) each new policy strictly improves upon the previous (unless the policy is already optimal). Put these facts together and it is clear that the algorithm must terminate with the optimal policy in less than nk improvement steps (k = jAj; n = jS j) [114]. In most cases this is a gross overestimate of the required number of iterations until termination. More recently, Mansour and Singh have provided a tighter bound of O( knn ) improvement steps [77]. Both of these bounds exclude the cost of evaluating the policy at each iteration. With Exact Q .
1
Implicitly, this statement rests on knowing Q accurately.
2.4.
15
OPTIMAL CONTROL
2
0 1 2 3 4 5
Vk (2) Vk (3) V0(2) V0 (3)
V0 (3) V0(2)
2 V0(2) 2 V0 (3)
3 V0 (3) 3V0 (2)
4 V0(2) 4 V0 (3)
. ..
3
4
5
1
3
1
k
2
. ..
k Vk (2) 0 1:000 1 0:900 2 0:810 3 0:590 4 0:531 5 0:387 6 0:349 7 0:254 8 0:229
Vk (3) Vk (4) Vk (5) 1:000 1:000 1:000 0:900 0:810 0:729 0:656 0:729 0:656 0:590 0:531 0:478 0:430 0:478 0:430 0:387 0:349 0:314 0:282 0:314 0:282 0:254 0:229 0:206 0:185 0:206 0:185
9 .. . .. . . .. . .. Figure 2.5: Example processes where the modi ed policyiteration algorithm in Figure 2.4 converges to optimal estimates but fails to terminate if the evaluation of Q is approximate. In both processes, all rewards are zero and so V (s) = Q(s; a) = 0 for all states and actions. Termination will not occur in each case because the greedy policy in the state 1 never stabilises. The actions in this state have equivalent values under the optimal policy but the greedy action ip ops inde nitely between the two choices while there is any error in the value estimates. In both cases, the error is only eliminated as k ! 1 The value of the successor state selected by the policy in state 1 after policy improvement is shown in bold. (left) Synchronous updating with V0(2) > V0(3) > 0, and 0 < < 1. (right) Inplace updating with = 0:9. Updates are made in the sequence given by the state numbers. The above proof requires that an accurate Q is found between iterations. Methods to do this are generally computationally expensive. An alternative method is modi ed policy iteration which employs the iterative (and approximate) policy evaluation algorithm from Section 2.3 to evaluate the policy in step 3 [114, 21]. Using the last found value or Q function as the initial estimate for the iterative policy evaluation algorithm will usually reduce the number of required sweeps required before termination. However, if Qk is only known approximately, then between iterations arg maxa Qk (s; a) may oscillate between actions for states where there are equivalent (or nearequivalent) true Qvalues for the optimal policy.2 This can be true even if Qk monotonically continues to move towards Q since, in some cases, the Qvalues of the actions in a state may improve at varying rates and so their relative order may continue to change. Figure 2.5 illustrates this new insight with two examples. With Approximate Q .
For practical implementations, it should be noted that due to the limitations of machine precision, even algorithms that are intended to solve Q precisely may suer from this phenomenon. 2
16
CHAPTER 2.
1) do: 2) V^ evaluate(, V^ ) 3) 0 4) for each s 2 S : P a 0 + V^ (s0 ) 5) ag arg maxa s0 Pssa 0 Rss P ag ag ^ (s0 ) 6) v0 s0 Pss 0 Rss0 + V 7) max ; V^ (s) v0 8) (s) ag Make 9) while T
DYNAMIC PROGRAMMING
0 .
Figure 2.6: Modi ed Policy Iteration. Upon termination is optimal to with some small error (see text). Overcoming this only requires that the main loop terminates when the improvement in the policy in any state has become suÆciently small (see the termination condition in Figure 2.6). The policy iteration algorithm published in [150] also requires the same change to guarantee its termination. The algorithm in Figure 2.6 guarantees that, 2 T V (s) V (s) (2.20) 1 holds upon termination, for some termination threshold T . If T = 0 then the algorithm is equivalent to modi ed policy iteration. Part B of the Appendix establishes the straightforward proof of termination and error bounds { these follow directly from the work of Williams and Baird [172]. The proof assumes that the evaluate step of the revised algorithm applies, X X a 0 + V^ (s0 ) V^ (s) (s; a) Pssa 0 Rss k s0
a
at least once for every state, either synchronously or asynchronously. 2.4.4 Value Iteration
The modi ed policy iteration algorithm alternates between evaluating Q^ for a xed and then improving based upon the new Qfunction estimate. We can interleave these methods to a ner degree by using iterative policy evaluation to evaluate the greedy policy rather than a xed policy. This is done by replacing step 5 of the iterative policy evaluation algorithm in Figure 2.3 with: X a 0 + max Q 0 0 ^ Pssa 0 Rss ( s ; a ) : (2.21) Q^ k+1 (s; a) k 0 s0
a
Note that, for the synchronous updating case, this new algorithm is exactly equivalent to performing 1 sweep of policy iteration followed by a policy improvement sweep { no policy
2.4.
17
OPTIMAL CONTROL
improvement step needs to be explicitly performed since the greedy policy is implicitly being evaluated by maxa0 Q^ k (s0; a0 ). It is less than obvious that this new algorithm will converge upon Q . As a policy evaluation method it is no longer evaluating a xed policy but chasing a nonstationary one. In fact the algorithm does converge upon Q as k ! 1. By considering the case where Q^ 0 = 0, we can see that (synchronous) valueiteration progressively solves a slightly dierent problem, and in the limit nds Q . Let a kstep nite horizon discounted return be de ned as follows: rt + rt+1 + + k 1 rk To behave optimally in a kstep problem is to act to maximise this return given that there are ksteps available to do so. Let (k)(s) denote an optimal policy for the kstep nite horizon problem. Then in the case where Q^ 0 = 0 we have, Q(1) (s; a) = E [rt+1 js = st ; a = at ] (2.22) X = Pssa 0 Rssa 0 + max Q^ 0 (s; a) a0 s0
Thus after 1 sweep, the Qfunction is the solution of the discounted 1step nite horizon problem. That is to say, Q(1) predicts the maximum expected 1step nite horizon return (s) = arg maxa Q (s; a). and so (1) (1) Clearly, the optimal value of an action when there are 2 steps to go is the expected value of taking that action and then acting to maximise the discounted expected return with 1step to go, 1step on: h i Q(2) (s; a) = E rt+1 + E [rt+2 ] js = st ; a = at (s; a0 )js = st ; a = at ; t+1 = = E rt+1 + max Q (1) (1) a0 (1)
=
X
s0
a 0 + max Q (s; a0 ) Pssa 0 Rss (1) 0
a
(s) = arg maxa Q (s; a) and is an optimal policy for the 2step nite horizon Thus (2) (2) problem. With k steps to go we have:
Q(k+1) (s; a)
= = =
"
E rt+1 + E(k)
" k X
i=2
#
i 1 rt+i js = st ; a = at
#
E rt+1 + max Q(k) (s; a0 )js = st ; a = at ; t+1 = (k) a0 X
s0
Pssa 0
a0 Rss
0 + max 0 Q(k) (s; a ) a
(2.23)
So, under the assumption that the Qfunction is initialised to zero, then it is clear that valueiteration (with synchronous updates) has solved the kstep nite horizon problem
18
CHAPTER 2.
DYNAMIC PROGRAMMING
after k iterations. That is to say that it nds the Qfunction for the policy that maximises the expected kstep discounted return: h i k 1r max E r +
r + +
(2.24) t t +1 k which diers from maximising the expected in nite discounted return, h i k 1r + max E r +
r + +
(2.25) t t+1 k by an arbitrary small amount for a large enough k and 0 < 1. Thus, value iteration assures that Q^ k converges upon Q as k ! 1 since Q^ (1) = Q , given Q^ 0(s; a) = 0 and 0 < 1. A more rigorous proof that applies for arbitrary ( nite) initial value functions was established by Bellman [15] and can be found in Section A.4. In particular, the following contraction mapping can be shown which avoids the need to assume Q^ 0 = 0, max jV^k+1(s) V (s)j max jV^k (s) V (s)j: (2.26) s s Proofs of convergence for the inplace and asynchronous updating case have also been established [17].
2.5 Summary We have seen how dynamic programming methods can be used to evaluate the longterm utility of xed policies, and how, by making the evaluation policy greedy, optimal policies may also be converged upon. Value iteration and policy iteration form the basis of all of the RL algorithms detailed in this thesis. Although they are a powerful and general tool for solving diÆcult multistep decision problems in stochastic environments, the MDP formalism and dynamic programming methods so far presented suer a number of limitations: 1. Availability of a Model Dynamic Programming methods assume that a model of the environment (P and R) is available in advance, and that no further knowledge of, or
interaction with, the environment is required in order to determine how to act optimally within it. However, in many cases of interest, a prior model is not generally available, nor is it always clear how such a model might be constructed in any eÆcient manner. Fortunately, even without a model, a number of alternatives are available to us. It remains possible to learn a model, or even learn a value function or Qfunction directly through experience gained from within the environment. Reinforcement learning through interacting with the environment is the subject of the next chapter. In many practical problems, a state might correspond to a point in a high dimensional space: s = hx1; x2 ; : : : ; xni. Each dimension corresponds to a particular feature of the problem being solved. For instance, suppose our task is to design 2. Small Finite Spaces
2.5.
SUMMARY
19
an optimal strategy for the game of tictactoe. Each component of the board state, xi, describes the position of one cell in a 3 3 grid (1 i 9), and can take one of three values (\X", \O" or \empty"). In this case, the size of the state space is 39 . For a game of draughts, we have 32 usable tiles and a state space size of the order of 332 . In general, given n features each of which can take k possible values, we have a state space of size kn . In other words, the size of the state space grows exponentially with its dimensionality. Correspondingly, so grows the memory required to store a value function and the time required to solve such a problem. This exponential growth in the space and time costs for a small increase in the problem size is referred to as the \Curse of Dimensionality" (due to Bellman, [15]). Similarly, if the statespace has in nitely many states (e.g. if the statespace is continuous) then it is simply impossible to exactly store individual values for each state. In both cases, using a function approximator to represent approximations of the value function or model can help. These are discussed in Chapter 6. 3. Markov Property In practice, the Markov property is hard to obtain. There are many cases where the description of the current state may lack important information necessary to choose the best action. For instance, suppose that you nd yourself in a large building where many of the corridors look the same. In this case, based upon what is seen locally, it may be impossible to decide upon the best direction to move given that some other part of the building looks the same but where some other direction is best. In many instances such as this, the environment may really be an MDP, although it may not be the case that the agent can exactly observe its true state. However, the prior sequence of observations (of states, actions, rewards and successors) often reveal useful information about the likely real state of the process (e.g. if I remember how many ights of stairs I went up I can now tell which corridor I am in with greater certainty). This kind of problem can be formalised as a Partially Observable Markov Decision Process (POMDP). A POMDP, is often de ned as an MDP, which includes S , A, P and a reward function, plus a set of prior observations and a mapping from real states to observations. These problems and their related solution methods are not examined in this thesis. See [27] or [74] for excellent introductions and eld overviews.
The MDP formalism assumes that there is a xed, discrete amount of time between state observations. In many problems this is untrue and events occur at varying realvalued time intervals (or even occur continuously). A good example is the state of a queue for an elevator [36]. At t = 0 the state of the queue might be empty (s0). Some time later someone may join the queue (we make a transition to s1), but the time interval between states transitions can take some real value whose probability may be given by a continuous distribution. Variable and continuous time interval variants of MDPs are referred to as a SemiMarkov Decision Process (SMDPs) [114], and are examined in Chapter 7. 4. Discrete Time
20
CHAPTER 2.
DYNAMIC PROGRAMMING
In cases where reward may be collected inde nitely and discounting is not desired, the discounted return model may not be used since the future sum of rewards with = 1 may be unbounded. Furthermore, even in cases where the returns can be shown to be bounded, with = 1 the policyiteration and valueiteration algorithms are not guaranteed to converge upon Q. This follows as a result of using bootstrapping and the max operator which causes any optimistic initial bias in the Qfunction to remain inde nitely. If discounting is not desired, then an average reward per step formalism can be used. Here the expected return is de ned as follows [132, 153, 75, 21]: n 1X E [rt js = st ] = lim 5. Undiscounted Ergodic Tasks
n!1 n
t=1
This formalism is problematic in processes where all states are reachable from any other under the policy (such as process is said to be ergodic). However, even in this case, from some states higher than average return may be gained for some short time and so such a state might be considered to be better. Quantitatively, the value of a state can be de ned by the relative dierence between the longterm average reward from any state, , and the reward following a starting state: V (s)
=
1 X
k=1
E [rt+k
jst = s; ]
Thus a policy may be improved by modifying it to increase the time that the system spends in high valued states (thereby raising ). Average reward methods are not examined in this thesis.
Chapter 3
Learning from Interaction Chapter Outline
In this chapter we see how reinforcement learning problems can be solved solely through interacting with the environment and learning from what is observed. No knowledge of the task being solved needs to be provided. A number of standard algorithms for learning in this way are reviewed. The shortcomings of exploration insensitive modelfree control methods are highlighted, and new intuitions about the online behaviour of accumulate trace TD() methods illustrated.
3.1 Introduction The methods in the previous chapter showed how to nd optimal solutions to multistep decision problems. While these techniques are invaluable tools for operations research and planning, it is diÆcult to think of them as techniques for learning. No experience is gathered { all of the necessary information required to solve their task of nding an optimal policy is known from the outset. The methods presented in this chapter start with no model (i.e. P and R are unknown). Every improvement made follows only from the information collected through interactions with the environment. Most of the methods follow the algorithm pattern shown in Figure 3.1. Broadly, (valuebased) RL methods for learning through interaction can be split into two categories. Indirect methods use their experiences in the environment to construct a model of it, usually by building estimates of the transition and reward functions, P and R. This model can then be used to generate valuefunctions or Qfunctions using, for instance, methods similar to the dynamic 21 Direct and Indirect Reinforcement Learning
22
CHAPTER 3.
LEARNING FROM INTERACTION
1) for each episode: 2) Initialise: t 0; st=0 3) while st is not terminal: 4) select at 5) follow at ; observe, rt+1, st+1 6) perform updates to P^ , R^ , Q^ and/or V^ using the new experience hst ; at ; rt+1 , st+1i 7) t t+1 Figure 3.1: An abstract incremental online reinforcement learning algorithm. programming techniques introduced in the last chapter. Indirect methods are also termed modelbased (or modellearning) RL methods. Alternatively, we can learn valuefunctions and Qfunctions directly from the reward signal, forgoing a model. This is called the direct or modelfree approach to reinforcement learning. This chapter rst presents an incremental estimation rule. From this we see how the direct methods are derived and then the indirect methods.
3.2 Incremental Estimation of Means Direct methods can be thought of as algorithms that attempt to estimate the mean of a return signal solely from observations of that return signal. For most direct methods, this is usually done incrementally by applying an update rule of the following form: Z^k = RunningAverage(Z^k 1 ; zk ; k ) = Z^k 1 + k (zk Z^k 1) (3.1) where Z^k is the new estimated mean which includes the kth observation, zk , of a random variable, and k 2 [0; 1] is a stepsize (or learning rate) parameter. Each observation is assumed to be a bounded scalar value given a random variable with a stationary distribution. By de ning the learning rate in dierent ways the update rule can be given a number of useful properties. These are listed below. Running Average. With k observations fz1 ; : : : ; zk g,
= 1=k, Z^k is the sample mean (i.e. average) of the set of Z^k =
k 1X zi
k i=1
The following derivation of update 3.1 is from [150], +1 1 kX Z^k+1 = z k + 1 i=1 i
(3.2)
3.2.
23
INCREMENTAL ESTIMATION OF MEANS
= = = = =
!
k 1 z +X z k + 1 k+1 i=1 i k !! 1 z +k 1 X z k + 1 k+1 k i=1 i 1 z + kZ^ k k + 1 k+1 1 z + (k + 1)Z^ Z^ k k k + 1 k+1 1 z Z^k + k+1 Z^k k+1
(3.3)
By choosing a constant value for , (where 0 < < 1), update 3.1 can be used to calculate a recency weighted average. This can be seen more clearly by expanding the right hand side of Equation 3.1: Z^t+1 = zt+1 + (1 )Z^t (3.4) Intuitively, each new observation forms a xed percentage of the new estimate. Recency weighted averages are useful if the observations are drawn from a nonstationary distribution. In cases where 1 6= 1 the estimates Z^k (k > 1) may be partially determined by the initial estimate, Z^0. Such estimates are said to be biased by the initial estimate. Z^0 is an initial bias. Recency Weighted Average.
Mean in the Limit.
have,
From standard statistics, with k = 1=k, from Equation 3.2 we
lim Z^k = E [z]: However, more usefully, Equation 3.5 also holds if, k!1
1) 2)
1 X
k=1
1 X
k=1
(3.5)
k = 1
(3.6)
2k < 1;
(3.7)
both hold. These are the RobbinsMonro conditions and appear frequently as conditions for convergence of many stochastic approximation algorithms [126]. The rst condition ensures that, at any point, the sum of the remaining stepsizes is in nite and so the current estimate will eventually become insigni cant. Thus, if the current estimate contains some kind of bias, then this is eventually eliminated. The second condition ensures that the step sizes eventually become small enough so that any variance in the observations can be overcome. In most interesting learning problems, there is the possibility of trading lower bias for higher variance, or vice versa. Slowly declining learning rates reduce bias more quickly but
24
CHAPTER 3.
LEARNING FROM INTERACTION
converge more slowly. Reducing the learning rate quickly gives fast convergence but slow reductions in bias. If the learning rate is declined too quickly, premature convergence upon a value other than E [z] may occur. The RobbinsMonro conditions guarantee that this cannot happen. Conditions 1 and 2 are known to hold for, 1 ; k (s) = (3.8) k(s) at the kth update of Z^(s) and 1=2 < 1 [167].
3.3 Monte Carlo Methods for Policy Evaluation This section examines two modelfree methods for performing policy evaluation. That is to say, given an evaluation policy, , they obtain the valuefunction or Qfunction that predicts the expected return available under this policy without the use of an environmental model. Monte Carlo estimation represents the most basic value prediction method. The idea behind it is simply to nd the sample mean of the complete actual return, zt1 = rt+1 + rt+2 + for following the evaluation policy after a state, or SAP (stateaction pair), until the end of the episode at time T . The evaluation policy is assumed to be xed and is assumed to be followed while collecting these rewards. If a terminal state is reached then, without loss of generality, the in nite sum can be truncated by rede ning it as, zt1 = rt+1 + rt+2 + + T 1 rT + T V (sT ) where V (sT ) is the value of the terminal state sT . Typically, this is de ned to be zero. Again this is without loss of generality, since rT can be rede ned to re ect the diering rewards for entering dierent terminal states. Singh and Sutton dierentiate between two avours of Monte Carlo estimate { these are the rstvisit and everyvisit estimates [139]. The everyvisit Monte Carlo estimate is de ned as the sample average of the observed return following every visit to a state: M 1X z1 (3.9) V^E (s) = Every Visit Monte Carlo Estimation.
M i=1
ti
where s is visited at times ft1 ; : : : tM g. In this case, the RunningAverage update is applied oine, at the end of each episode at the earliest. Each statevalue is updated once for each state visit using the return following that visit. M represents the total number of visits to s in all episodes.
3.3.
MONTE CARLO METHODS FOR POLICY EVALUATION
25
Ps ; Rs s
PT ; RT
T
Figure 3.2: A simple Markov process for which rstvisit and everyvisit Monte Carlo approximation initially nd dierent value estimates. The process has a starting state, s, and a terminal state, T . Ps and PT denote the respective transition probabilities for s ; s and for s ; T . The respective rewards for these transitions are Rs and RT . The rstvisit Monte Carlo estimate is de ned as the sample average of returns following the rst visit to a state during the episodes in which it was visited: N 1X V^F (s) = zt1i (3.10) First Visit Monte Carlo Estimation.
N i=1 times ft1; : : : ; tN g and N
where s is rst visited during an episode at represents the total number of episodes. The key dierence here is that an observed reward may be used to update a state value only once, whereas in the everyvisit case, a state value may be de ned as the average of several nonindependent return estimates, each involving the same reward, if the state is revisited during an episode. In the case where state revisits are allowed within a trial these methods produce dierent estimators of return. Singh and Sutton analysed these dierences which can be characterised by considering the process in Figure 3.2 [139]. For simplicity assume = 1, then from the Bellman equation (2.6) the true value for this process is: V (s) = Ps (Rs + V (s)) + PT RT (3.11) = PsR1s + PPT RT (3.12) Bias and Variance.
=
s
Ps R + RT PT s
(3.13)
Consider the dierence between the methods following one episode with the following experience, s;s;s;s;T
The rstvisit estimate is:
V^F (s) = Rs + Rs + Rs + RT
while the everyvisit estimate is: R + 2Rs + 3Rs + 4RT V^E (s) = s 4
26
CHAPTER 3.
LEARNING FROM INTERACTION
For both cases, it is possible to nd the expectation of the estimate after one trial for some arbitrary experience. This is done by averaging the possible returns that could be observed in the rst episode weighted their probability of being observed. For the rstvisit case, it can be shown that after the rst episode [139], h i P E V^1F (s) = s Rs + RT PT = V (s) and so is an unbiased estimator of V (s). After N episodes, V^NF (s) is the sample average of N independent unbiased estimates of V (s), and so is also unbiased. For the everyvisit case, it can be shown (in [139]) that after the rst episode, h i Ps E V^1E (s) = 2PT Rs + RT where k denotes the number of times that s is visited within the episode. Thus after the rst episode the everyvisit method does not give an unbiased estimate of V (s). Its bias is given by, h i BIASE1 = V (s) E V^1E (s) = 2PPs Rs: (3.14) T Singh and Sutton also show that after M episodes, (3.15) BIASEM = M 2+ 1 BIASE1 : Thus the everyvisit method is also unbiased as M ! 1. The bias in the everyvisit method comes from the fact that it uses some rewards several times. Thus many of the return observations are not independent. However, the observations between trials are independent, and so as the number of trials grows, its bias shrinks. Both methods converge upon V (s) as M or N tend to in nity. Singh and Sutton also analysed the expected variance in the estimates learned by each method. They found that, while the rstvisit method has no bias, it initially has a higher expected variance than the everyvisit method. However, its expected variance declines far more rapidly, and is usually lower than for the everyvisit method after a very small number of trials. Thus, in the longrun the rstvisit method appears to be superior, having no bias and lower variance.
3.4 Temporal Dierence Learning for Policy Evaluation 3.4.1 Truncated Corrected Return Estimates
Because the return estimate used by the Monte Carlo method (i.e. the observed return, z (1) ) looks ahead at the rewards received until the end of the episode, it is impossible to make updates to the value function based upon it during an episode. Updates must be
3.4.
TEMPORAL DIFFERENCE LEARNING FOR POLICY EVALUATION
27
made in between episodes. If the task is nonepisodic (e.g. if the environment never enters a terminal state), it seems unlikely that a Monte Carlo method can be used at all. One possibility is to make the task episodic by breaking the episodes into stages. The stages could be separated by a xed number of steps, by replacing z (1) = rt+1 + rt+2 + + n 1 rt+n + with, z (n) = rt+1 + rt+2 + + n 1 rt+n + n U (st+n ) where U (st+n) is a return correction and predicts the expected return for following the evaluation policy from st+n. Note that this is already done to deal with terminal states in the MonteCarlo method. However, if st+n is not a terminal state we typically will not know the true utility of st+n under the evaluation policy. Instead we replace it with an estimate { the current V^ (st+n) for example. The next section introduces a special case where n = 1. Updates are performed after each and every step using knowledge only about the immediate reward collected and the next state entered. 3.4.2 TD(0)
The temporal dierence learning algorithm, TD(0), can be used to evaluate a policy and works through applying the following update [13, 147, 148]: V^ (st ) V^ (st ) + t (st ) rt+1 + V^ (st+1 ) V^ (st ) ; (3.16) where rt+1 is the reward following at taken from st and selected according the policy under evaluation. Note that this has the same form as the RunningAverage update rule where the target is E [rt+1 + V^ (st+1 )]. Recall from Equation 2.6 that, X X a 0 + V^ (s0 ) V^ (s) = (s; a) Pssa 0 Rss a s0 h i = E rt+1 + V^ (st+1 )js = st; So, assuming for the moment that V^ (st+1 ) is a xed constant, Update 3.16 can be seen as a stochastic version of, X X a 0 + V^ (s0 ) ; V^ (s) (s; a) Pssa 0 Rss (3.17) a
s0
h i where E rt+1 + V^ (st+1 )js = st ; is estimated in by V^ (s) in the limit from the observed (sample) return estimates, rt+1 + V^t (st+1), rather than the target return estimate given by the righthandside of update 3.17. TD(0) is reliant upon observing the return estimate, r + V^ (s0), and applying it in update 3.16 with the probability distribution de ned by R, P and . This can be done in several
28
CHAPTER 3.
LEARNING FROM INTERACTION
1) for each episode: 2) initialise st 3) while st is not terminal: 4) select at according to 5) follow at ; observe, rt+1, st+1 6) TD(0)update(st , at , rt+1, st+1) 7) t t+1 TD(0)update(st, at , rt+1 , st+1 ) 1) V (st) V^ (st ) + t+1 (st ) rt+1 + V^ (st ) V^ (st)
Figure 3.3: The online TD(0) learning algorithm. Evaluates the valuefunction for the policy followed while gathering experience. ways, but by far the most straightforward is to actually follow the evaluation policy in the environment and make updates after each step using the experience collected. Figure 3.3 shows this online learning version of TD(0) in full. Note that it makes no use of R or P . In general, the value of the correction term (V^ (st+1) in update 3.16) is not a constant but is changing as st+1 is visited and its value updated. The method can be seen to be averaging return estimates sampled from a nonstationary distribution. The return estimate is also biased by the initial value function estimate, V^0. Even so, the algorithm can be shown to converge upon V asP1t ! 1 providedPthat the learning rate is declined under the Robbins1 Monro conditions ( k=1 k (s) = 1; k=1 2k (s; a) < 1), that all value estimates continue to be updated, the process is Markov, all rewards have nite variance, 0 < 1 and that the evaluation policy is followed [148, 38, 158, 59, 21]. In practice it is common to use the xed learning rate = 1 if the transitions and rewards are deterministic, or some lower value if they are stochastic. Fixed also allows continuing adaptation in cases where the reward or transition probability distributions are nonstationary (in which case the Markov property does not hold). 3.4.3 SARSA(0)
Similar to TD(0), SARSA(0) evaluates the Qfunction of an evaluation policy [128, 173]. Its update rule is: Q^ (st ; at )
Q^ (st ; at ) + k rt+1 + Q^ (st+1 ; at+1 ) Q^ (st ; at ) ;
(3.18)
where at and at+1 are selected with the probability speci ed by the evaluation policy and k = k (st ; at ). SARSA diers from the standard algorithm pattern given in Figure 3.1 because it needs to know the next action that will to be taken when making the value update. The SARSA algorithm is shown in Figure 3.4. An alternative scheme that appears to be equally valid and is more closely related to the policyevaluation Qfunction update (see Equation 2.15
3.4.
TEMPORAL DIFFERENCE LEARNING FOR POLICY EVALUATION
29
1) for each episode: 2) initialise st 3) select at according to 4) while st is not terminal: 5) follow at; observe, rt+1 , st+1 6) select at+1 according to 7) SARSA(0)update(st , at , rt+1 , st+1, at+1 ) 8) t t+1 SARSA(0)update(st , at, rt+1, st+1 , at+1) 1) Q^ (st ; at ) Q^ (st ; at ) + k rt+1 + Q^ (st+1 ; at+1 ) Q^ (st ; at )
Figure 3.4: The online SARSA(0) learning algorithm. Evaluates the Qfunction for the policy followed while gathering experience. and Figure 2.3) is to replace the target return estimate with [128]: X rt+1 + (st+1 ; a0 )Q^ (st+1 ; a0 ): a0
(3.19)
An algorithm employing this return does not need to know at+1 to make the update and so can be implemented in the standard framework. Its independence of at+1 also makes this an opolicy method { it doesn't need to actually follow the evaluation policy in order to evaluate it. This property is discussed in more detail later in this chapter. However, unlike regular SARSA, this method does require that the evaluation policy is known, which may not always be the case { experience could be generated be observing an external (e.g. human) controller. 3.4.4 Return Estimate Length SingleStep Return Estimates
The TD(0) and SARSA(0) algorithms are singlestep temporal dierence learning methods and apply updates to estimate some target return estimate having the following form: zt(1) = rt + U^ (st ); (3.20) It is important to note that it is the dependence upon using only information gained from the immediate reward and the successor state that allows singlestep methods to be easily used as an online learning algorithms. However, when singlestep learning methods are applied in the standard way, by updating V^ (st ) or Q^ (st ; at ) at time t + 1, new return information is propagated back only to the previous state. This can result in extremely slow learning in cases where credit for visiting a particular state or taking a particular action is delayed by many time steps. Figure 3.5 provides an example of this problem. Each episode begins in the leftmost state. Each state to the right is visited in sequence until the rightmost (terminal) state is entered where a reward of 1 is given (r = 0 in all other states). In such a situation, it would take 1step methods a minimum of 64 episodes before any information
30
CHAPTER 3.
t= 0
...
LEARNING FROM INTERACTION
t= 6 3
t= 6 4 r= 1
Figure 3.5: The corridor task. Singlestep updating methods such as TD(0), SARSA(0) and Qlearning can be very slow to propagate any information about the terminal reward to the leftmost state. about the terminal reward reaches the leftmost state. A Monte Carlo estimate would nd the correct solution after just one episode. MultiStep Return Estimates
By modifying the return estimate to look further ahead than the next state, a single experience can be used to update utility estimates at many previously visited states. For example, the 1step return in 3.16, zt(1) = rt + U (st), may be replaced with the corrected nstep truncated return estimate, zt(n) = rt + rt+1 + + n 1 U^ (st+n ) (3.21) or we may use, h i zt = (1 ) zt(1) + zt(2) + 2 zt(3) + (3.22) = (1 ) rt + U^ (st+1 ) + rt + zt+1 (3.23) = rt + (1 ) U^ (st+1 ) + zt+1 (3.24) which is a return estimate [147, 148, 163, 128, 107]. The return estimate is important as it is a generalisation of both z(1) and z(1) since, if = 0, then z = z(1) , and if = 1, then z = z(1) , or the actual discounted return. A key feature of multistep estimates is that a single observed reward may be used in updating the statevalues or Qvalues in many previously visited states. Intuitively, this oers the ability to more quickly assign credit for delayed rewards. The return estimate length can also be seen as managing a tradeo between bias and variance in the return estimate [163]. When is low, the estimate is highly biased toward the initial statevalue or Qfunction. When is high the estimate involves mainly the actual observed reward and is a less biased estimator. However, unbiased return estimates don't necessarily result in the fastest learning. Typically, longer return estimates have higher variance as there is a greater space of possible values that a multistep return estimate could take. By contrast, a singlestep estimate is limited to taking values formed by combinations of the possible immediate rewards and the values of immediate successor states, and so may typically have lower variance. Also, employing the alreadylearned value estimates of successor states in updates may also help speed up learning since these values may contain summaries of the complex future that may follow from the state. Best performance is often to be found with intermediate values of [148, 128, 73, 139, 150].
3.4.
TEMPORAL DIFFERENCE LEARNING FOR POLICY EVALUATION
31
However, while multistep estimates appear to oer faster delayed credit assignment they seem to suer the same problem as the MonteCarlo methods { that the updates must either be made oline, at the end of each episode, or that episodes are split into stages and the return estimates truncated. Chapter 4 introduces a method which explores the latter case. The next section shows how the eect of using the return estimate can be approximated by a fully incremental online method that makes updates after each step. 3.4.5 Eligibility Traces: TD()
This section shows how return estimates can be applied as an incremental online learning algorithm. This is surprising because it implies that it is not necessary to wait until all the information used by the return estimate is collected before a backup can be made to a previously visited state. The eect of using z can be closely and incrementally approximated online using eligibility traces [148, 163]. A return algorithm performs the following update, V^ (st ) V^ (st ) + t (st ) zt+1 V^ (st ) : (3.25) By Equation 3.24 Sutton showed that the error estimate in this update can be rewritten as [148, 163, 107], zt+1 V^ (st ) = Æt + Æt+1 + : : : + ( )k Æt+k + : : : (3.26) where Æt is the 1step temporal dierence error as before, Æt = rt+1 + V^ (st+1 ) V^ (st ): If the process is acyclic and nite (and so necessarily also has a terminal state), this allows update 3.25 to be rewritten as the following online update rule, which overcomes the need to have advance knowledge of the 1step errors, V^ (s)
V^ (s) + t (s)Æt
t X k=t0
( )t k I (s; sk)
(3.27)
where t0 indicates the time of the start of the episode, and I (s; sk ) is 1 if s was visited at t k, and zero otherwise. This update must be applied to all states visited at time t or before, within the episode. In the case in which state revisits may occur, the updates may be postponed and a single batch update may be made for each state at the end of the episode, V^ (s)
V^ (s) +
where sT is the terminal state.
TX1 t=t0
t (s)Æt
t X
( )t k I (s; sk )
k=0
32
CHAPTER 3.
LEARNING FROM INTERACTION
TD()update(st, at, rt+1 , st+1 ) 1) Æ rt+1 + V^ (st+1 ) V^ (st) 2) e(s) e(s) + 1 3) for each s 2 S : 3a) V^ (s) V^ (s) + e(s)Æ 3b) e(s) e(s)
Figure 3.6: The accumulatingtrace TD() update. This update step should replace TD(0)update in Figure 3.3 for the full learning algorithm. All eligibilities should be set to zero at the start of each episode. However, the above methods don't appear to be of any extra practical use than the MonteCarlo or return methods. If the task is acyclic, then then there is little bene t for having an online learning algorithm since the agent cannot make use of the values it updates until the end of the episode. So the assumption preventing state revisits is often relaxed. In this case the error terms may be inexact since the statevalues used as the return correction may have been altered if the state was previously visited. However, intuitively this seems to be a good thing since the return correction is more uptodate as a result. To avoid the expensive recalculation of the summation in 3.27, this term can be rede ned as, V^ (s)
V^ (s) + t et (s)Æt
(3.28)
where e(s) is an (accumulating) eligibility trace. For each state at each step it is updated as follows, et (s)
=
et 1 (s) + 1;
et 1 (s);
if s = st, otherwise.
(3.29)
The full online TD() algorithm is shown in Figure 3.6. Both the online and batch TD() algorithms are known to converge upon the true statevalue function for the evaluation policy under the same conditions as TD(0) [38, 158, 59, 21]. The intuitive idea behind an eligibility trace is to make a state eligible for learning for several steps after it was visited. If an unexpectedly good or bad event happens (as measured by the temporal dierence error, Æ), then all of the previously visited states are immediately credited with this. The size of the value adjustment is scaled by the state's eligibility, which decays with the time since the last visit. Moreover, the 1step error Æt measures an error in the return used, not just for the previous state, but for all previously visited states in the episode. The eligibility measures the relevance of that error to the values of the previous states given that they were updated using a return corrected for the error found at the current state. Thus it should be clear why the trace decays as ( )k { the contribution of V^ (st+k ) to zt() is ( )k .
3.4.
TEMPORAL DIFFERENCE LEARNING FOR POLICY EVALUATION
33
The ForwardBackward Equivalence of Batch TD() and Return Updates
If the changes to the valuefunction that the accumulatetrace algorithm is to make during an episode, are summed, TX1 V (s) = t et (st )Æt t=0
and applied at the end of the episode (instead of online), V (s) V (s) + V (s) it can be shown that this is equivalent to applying the return update, V^ (s) V^ (s) + zt+1 V^ (s) ; at the end of the episode, for each s = st visited during the episode [150].1 Thus in the case where = 1, and k (s) = 1=k(s), this batch mode TD() method is equivalent to the everyvisit Monte Carlo algorithm. The proof of this can be found in [139] and [150]. Below, the direct return method is referred to a the forwardview, and the eligibility trace method as the backwardview (after [150]). 3.4.6 SARSA()
The equivalent version of TD() for updating a Qfunction is SARSA(), shown in Figure 3.7 [128, 129]. Here, an eligibility value is maintained for each stateaction pair. SARSA()update(st, at , rt+1, st+1 , at+1 ) 1) Æ rt+1 + Q^ (st+1; at+1 ) Q^ (st; at ) 2) e(s; a) e(s; a) + 1 3) for each s 2 S 3a) Q^ (s; a) Q^ (s; a) + e(s; a)Æ 3b) e(s; a) e(s; a)
Figure 3.7: The accumulatingtrace SARSA() update. This update step should replace the SARSA(0)update in Figure 3.4 for the full learning algorithm. All eligibilities should be set to zero at the start of each episode. 3.4.7 Replace Trace Methods
In practice, accumulating trace methods are known to often work poorly, especially with close to 1 [139, 149, 150]. In part, this is likely to be the result of its relationship with the See also the special case of the forwardsbackwards equivalence proof in Appendix C where = 1. This proof is a generalisation of the one in [150]. 1
34
CHAPTER 3.
LEARNING FROM INTERACTION
everyvisit MonteCarlo algorithm. An alternative eligibility trace scheme is the replacing trace: 1; if s = st, et (s) = e (s); otherwise. (3.30) t 1 Sutton refers to this as a recency heuristic { the eligibility of a state depends only upon the time since the last visit. By contrast, the accumulating trace is a frequency and recency heuristic. In [139] Sutton and Singh show that, with = 1 and with appropriately declining learning rates, the batchupdate TD() algorithms exactly implement the Monte Carlo algorithms. In particular, it can be shown that accumulating traces give the everyvisit method, and replacing traces give the rstvisit Monte Carlo method. In addition to the better theoretical bene ts of everyvisit Monte Carlo, the replace trace method has often performed better in online learning tasks. In [150] Sutton and Barto also prove that the TD() and forwardview return methods are identical in the case of batch (i.e. oine) updating for general with a constant . When estimating Qvalues two replacetrace schemes exist. These are the statereplacing trace [139, 150], 8 if s = st and a = at , < 1; 0 ; if s = st and a 6= at , et (s; a) = : (3.31)
et 1 (s; a); if s 6= st . and the stateaction replacing trace [33], 1; if s = st and a = at , et (s; a) = e (s; a); otherwise. (3.32) t 1 3.4.8 Acyclic Environments
If the environment is acyclic, then the dierent eligibility updates produce identical eligibility values and so the accumulate and replace trace methods must be identical. In this case, the online and batch versions of the algorithms are also identical since the return corrections used in return estimates must be xed within an episode. With = 1, the return methods also implement the Monte Carlo methods in acyclic environments. Also, here, both rstvisit and everyvisit methods are equivalent. The eligibility trace methods appear to be considerably more expensive than the other modelfree methods so far presented. For TD(0) and SARSA(0) the timecost per experience is O(1). The MonteCarlo and direct return methods have the same cost if the returns are calculated starting with the most recent experience and working backwards.2 Algorithms working in this way will be seen in Chapter 4. By contrast, TD() has a timecost as high as O(jS j) per experience. Thus the great bene t aorded by using eligibility traces is that they allow multistep return estimates to be used for continual online learning and, as a consequence, can also be used in Since all discounted return estimators can be calculated recursively as, zt = f (rt ; st ; at ; zt+1 ; U ); for some function f . If zt+1 is known then it is cheap to calculate zt by working backwards. 2
3.4.
TEMPORAL DIFFERENCE LEARNING FOR POLICY EVALUATION
Æt Zt
Æt
Zt+1
35
Æt
zt
Figure 3.8: Number line showing eect of stepsize. Note that having a stepsize greater 2 can actually increase the error in the estimate (i.e. moving the new estimate into the hashed area). nonepisodic tasks and in cyclical environments in a relatively straightforward way. We will see in the next chapter that the cost of the eligibility trace updates can be greatly reduced. 3.4.9 The NonEquivalence of Online Methods in Cyclic Environments
Consider the RunningAverage update rule (3.1). It is easy to see that with a large learning rate the algorithm can actually increase the error in the prediction. Let Æt = zt Z^t 1, then if > 2 and after an update, jzt Z^t+1j > jzt Z^t j. The problem can be seen visually in Figure 3.8. This raises new suspicions about the online behaviour of the accumulate trace TD() update. In a worst case environment (see Figure 3.9) in which a state is revisited after every step, after k revisits the eligibility trace becomes, ek (s) = 1 + + + ( )k 1 )k = 1 1 (
Thus, for < 1, an upper bound on an accumulating eligibility trace (in any process) is given by, 1 (3.33) e1 (s) = 1 For = 1 the trace grows without bound if the process is nite and has no terminal state. The TD() update (3.28) makes updates of the following form: V (s) V (s) + t (s)et (s)Æ: Thus it might seem that where t (s)et (s) > 2 holds the TD() algorithm could grow in error with each update. These conditions are easily satis ed for close enough to 1 in any nonterminating nite (and therefore cyclic) process. Considering the case where the trace reaches its upper bound, we have in the worst case scenario, 1 > 2 t (s) 1
36
CHAPTER 3.
LEARNING FROM INTERACTION
t et
20
15
10
s 5
Figure 3.9: A worstcase environment for accumulating eligibility trace methods where the state's eligibility grows at the maximum rate. The reward is a random variable chosen from the range [ 1; 1] with a uniform distribution.
0 1
10
100 Time, t
1000
10000
Figure 3.10: The growth of the accumulate trace update stepsize for the process in Figure 3.9. The learning rate is t = t 1: , = 0:999 and = 1:0. These settings satisfy the conditions of convergence for accumulate trace TD(). 0 55
1 t2(s) < assuming a constant t (s) while the eligibility rises. Yet the convergence of online accumulate trace TD() has already been established [38, 59]. Crucially these rely upon the learning rate being declined under the RobbinsMonro conditions which ensures that tends to zero (and so t (s)et (s) must eventually fall below 2). However, even learning rate schedules that satisfy the RobbinsMonro conditions can cause t (s)et (s) > 2 to hold for a considerable time in the early stages of learning. An example is shown in Figure 3.10. Note that even though a high value of is used (i.e. close to 1:0, at which value functions may be illde ned), by 10000 steps the remaining rewards can be neglected from the value of the state since 0:99910000 is very small. Even so, at the end of this period, t (s)et (s) > 2. What are the practical consequences of this for the online accumulate trace TD() algorithm? Figure 3.11 compares this method with an online forward view algorithm using the process in Figure 3.9. With = 1, a forward view return algorithm can be implemented online in this particular task by making the following updates: zt+1 (1 ) rt+1 + V^t (s)) + rt+1 + zt V^t+1 (s) V^t+1 (s) + t (s) zt+1 V^t+1 (s) Note that this is \backtofront" { rewards should included into z with the most recent rst. However, this makes no dierence in this case since there is only one state and only one reward. Thus with = 1, z records the actual observed discounted return (and is also the rstvisit estimate) except for some small error introduced by V^0 (s). V^0(s) is set to zero (i.e. the correct value) for all of the methods. In the experiment, the initial estimate has little in uence on the general shape of the graphs in Figure 3.11 beyond the rst few steps. Also, with t (s) = 1=t, V^t (s) is the everyvisit estimate except for the negligible error
3.4.
37
TEMPORAL DIFFERENCE LEARNING FOR POLICY EVALUATION 7
12
6
10
jV^ (s)j
5 8
Accumulate Forward View, EveryVisit Replace Forward View, FirstVisit
4
t = 1t
6
3 4 2 2
1 0
0 0
50
100
150
200
30
0
20000
40000
60000
80000
100000
0
20000
40000
60000
80000
100000
80000
100000
40 35
25
jV^ (s)j
30 20
t = t01:55
25
15
20 15
10
10 5 5 0
0 0
50
100
150
200
100
300
90 250
t = 0:5
jV^ (s)j
80 70
200
60 50
150
40 100
30 20
50
10 0 0
50
Time 100
0 150
200
0
20000
40000
Time
60000
Figure 3.11: Comparison of variance between the online versions of TD(), and the forward view methods in the single state process in Figure 3.9 where = 0:999 and = 1. The results are the average of 300 runs. The horizontal and vertical axes dier in scaling. The vertical axis measures jV^ (s)j = jV^ (s) V (s)j since V (s) = 0. caused by V^0 (s). Alternatively, note that the method is exactly the everyvisit method for a slightly dierent process where there is some (very small) probability of entering a zerovalued terminal state (in which case setting V^0 (s) = 0 is justi ed). This allows us to closely compare online TD() with the forwardview MonteCarlo estimates, and even do so with dierent learning rate schemes. Dierent learning rate schemes correspond to dierent recency weightings of the actual return. The \ForwardView, FirstVisit" method in Figure 3.11 simply learns the actual observed return at the current time, and is independent of the learning rate. The replace trace method is also shown and is equivalent to TD(0) for this environment. The results can be seen in Figure 3.11. The most interesting results are those for accumulate trace TD(). Here we see that where t(s) = 1=t, the method most closely approximates
38
CHAPTER 3.
LEARNING FROM INTERACTION
the everyvisit method (at least in the longterm). This is predicted as a theoretical result by Singh and Sutton in [139] for the batch update case. With more slowly declining or a constant (i.e. more recently biased), the accumulate trace method is considerably higher in error than any of the other methods. This seems to be at odds with the existing theoretical results in [150] where it is shown that TD() is equivalent to the forward view method for constant (and any ). However, this equivalence applies only in the oine (batch update) case. The equivalence is approximate in the online learning case and we see the consequence of this approximation in Figure 3.11. In the xed case, the values learned by accumulate trace TD() are so high in variance as to be essentially useless as predictions. Similar results can be expected in other cyclic environments where the eligibility trace can grow very large. There are also numerous examples in the literature where the performance of accumulate trace methods sharply degrades as tends to 1 (in particular, see [139, 150]). In contrast, the everyvisit method behaves much more reasonably (as do the rstvisit and replace trace methods). Partially, this is some motivation for a new (practical) onlinelearning forward view method presented in Chapter 4. It may seem surprising that the error in the accumulate trace TD() method does not continue to increase inde nitely since t et is considerably higher than 2 after the rst few updates and remains so. The reason for this is that the observed samples used in updates (rt + V^ (st+1 )) are not independent of the learned estimates (V^ (st )). Unlike in the basic RunningAverage update case where divergence to in nity is clear (with z independent of Z ), this nonindependence appears to be useful in bounding the size of the possible error in this and presumably other cyclic tasks. In Figure 3.11, we also see that the everyvisit method performed marginally better than rstvisit in each case. This is consistent with the theoretical results obtained by Singh and Sutton in [139] which predict that (oine) everyvisit Monte Carlo will nd predictions with a lower mean squared error (i.e. lower variance) for the rst few episodes { only one episode occurred in this experiment. We can conclude that, i) drawing analogies between forwardview methods and online versions of eligibility trace methods is dangerous since the equivalence of these methods does not extend to the online case, and ii) that accumulate trace TD() can perform poorly in cyclic environments where t et above 2 is maintained. In particular, it can perform far worse than its forwardview counterpart for learning rate declination schemes slower than (s) = 1=k(s), (where k is the number of visits to s). This can be attributed to the approximate nature of the forwardbackwards equivalence in the online case. In cyclic tasks, errors due to this approximation can be magni ed by large eective stepsizes (e).
3.5.
TEMPORAL DIFFERENCE LEARNING FOR CONTROL
39
3.5 Temporal Dierence Learning for Control 3.5.1 Q(0): Qlearning
Like valueiteration, Qlearning evaluates the greedy policy. It does so using the following update rule: 0 ^Q(st ; at ) ^Q(st; at ) + k (st; at ) rt+1 + max ^ ^ 0 Q(st+1 ; a ) Q(st ; at ) ; (3.34) a
Note that the target return estimate used by Qlearning, rt+1 + max Q^ (st+1 ; a0 ) a0 is a special case of the one used by the opolicy SARSA update (3.19) in which the evaluation policy is the (nonstationary) greedy policy. Qlearning is known to converge upon Q as k ! 1 under similar conditions as TD(0) [163, 164, 59, 21]. However, unlike TD(0), there is no need to follow the evaluation policy (i.e. the greedy policy). Exploratory actions may be taken freely, and yet only the greedy policy is ever evaluated. The method will converge upon the optimal Qfunction provided that all SAPs are tried with a nite frequency, also other conditions similar to those ensuring the convergence of TD(0). 1) Initialise: t 0; st=0 2) for each episode: 3) initialise st 4) while st is not terminal: 5) select at 6) follow at; observe, rt+1 , st+1 7) Qlearningupdate(st, at , rt+1 , st+1) 8) t t+1 Qlearningupdate(st, at, rt+1, st+1 ) 1) Q^ (st ; at ) Q^ (st ; at ) + k rt+1 + maxa0 Q^ (st+1 ; a0 ) Q^ (st ; at )
Figure 3.12: The online Qlearning algorithm. Evaluates the greedy policy independently of the policy used to generate experience. This method is exploration insensitive. 3.5.2 The ExplorationExploitation Dilemma
Why take exploratory actions? Almost all systems that learn control policies through interaction face the explorationexploitation dilemma. Should the agent sacri ce immediate reward and take actions that reduce the uncertainty about the return following untried actions in the hope that they will lead to more rewarding policies; or should the agent
40
CHAPTER 3.
LEARNING FROM INTERACTION
behave greedily, avoiding the return lost while exploring, but settle for a policy that may be suboptimal? Optimal Bayesian solutions to this dilemma are known but are intractable in the general multistep process case [78]. However, there are many good heuristic solutions. Good surveys of early work can be found in [62, 156], recent surveys can be found in [85, 174, 63]. Also see [41, 40, 142, 175] for recent work not included in these. Common features of the most successful methods are local de nitions of uncertainty (e.g. action counters, Qvalue error and variance measures), the propagation of this uncertainty to prior states and then choosing actions which maximise combined measures of this longterm uncertainty and longterm value. 3.5.3 Exploration Sensitivity
When learning state values or stateaction values, we do so with respect to the return obtainable for following some policy after visiting those states. For some learning methods, such as TD() and SARSA() the policy being evaluated is the same as the policy actually followed while gathering experience. These are referred to as onpolicy methods [150]. For these, the actual experience aects what these methods converge upon in the limit. By contrast opolicy methods allow opolicy (or exploratory) actions to be taken in the environment (i.e. actions may be chosen from a distribution dierent from the evaluation policy). That is to say, they may learn the valuefunctions or Qfunctions for one policy while following another. To put this into context, for control optimisation problems we are usually evaluating the greedy policy, g (s) = arg max Q^ (s; a): (3.35) a Q(0) is an exploration insensitive method as it only ever estimates the return available under the greedy policy, regardless of the distribution of, or methods used to obtain its experience. This is possible because its return estimate, rt + maxa Q(st; a), is independent of at . For the same reason, SARSA(0) using the return estimate in Equation 3.19 is also an opolicy method. Opolicy learning is less straightforward for methods that use multistep return estimates. For example, if a multistep return estimate used to update Q^ (st ; at ) includes the reward following a nongreedy action, at+k (k 1), then there is a bias to learn about the return following a nongreedy policy instead of the greedy policy. That is to say, Q^ (st 1; at 1 ) receives credit for the delayed reward, rt+k+1 , which the agent might not observe if it follows the greedy policy after Q^ (st; at ). In most cases, learning in this way denies convergence upon Q. This is straightforward to see when the case is considered where Q^ = Q is known to hold. Most updates following nongreedy action are likely to move Q^ away from Q (in expectation). The most commonly used solution to this problem is to ensure that the exploration policy converges upon the greedy policy in the limit, and so onpolicy methods eventually evaluate the greedy policy [135]. However, schemes for doing this must carefully observe the learning MultiStep Methods
3.5.
TEMPORAL DIFFERENCE LEARNING FOR CONTROL
41
rate. If convergence to the greedy policy is too fast then the agent may become stuck in a local minimum since choosing only greedy actions may result in some parts of the environment being underexplored (or underupdated). If convergence upon the greedy policy is too slow, then as the learning rate declines, the Qfunction will converge prematurely and remain biased toward the rewards following nongreedy actions. In [135], Singh et al. discuss several exploration methods which are greedy in the limit and allow SARSA(0) to nd Q in the limit. Their results also seem likely to hold also for SARSA(), although there is as yet no proof of this. In any case, following or even converging upon the greedy exploration strategy may not always be desirable or even possible. For example: Bootstrapping from externally generated experience or some given training policy (such as one provided by a human expert) can greatly reduce the agent's initial learning costs [72, 112]. Even if the agent follows this training policy, we would still like our method to be learning about the greedy policy (and so moving toward the optimal policy). There may be a limited amount of time available for exploration (e.g. for commercial or safety critical applications, it might desirable to have distinct training, testing and application phases). In this case, we may wish to perform as much exploration as possible in the training stage. The agent may be trying to learn several policies (behaviours) in parallel where each policy should maximise its own reward function (as in [58, 79, 143]). At any time the agent may take only one action, yet it remains useful to be able to use this experience to update the Qfunctions of all the policies being evaluated. The agent's task may be nonstationary, in which case continual exploration is required in order to evaluate actions whose true Qvalues are changing [105]. The agent's Qfunction representation may be nonstationary. Continual exploration may be required to evaluate the actions in the new representation. It has long been known that multistep return estimates need not lead to explorationsensitive methods. The method recommended by Watkins is to truncate the return estimate such that the rewards following opolicy (e.g. nongreedy actions) actions are removed from it [163]. For example, Q^ (st 1 ; at 1 ) should be updated using the corrected nstep truncated return, (see [163, 31]) h i zt(;n) = (1 ) zt(1) + zt(2) + 2 zt(3) + + n 2 zt(n 1) + n 1 zt(n) (3.36) ;n 1) = (1 ) rt + U^ (st ) + rt + zt(+1 (3.37) where, zt(;1) = rt + U^ (st ) and at+n is the next opolicy action. However, if there is a considerable amount of exploration then the return estimate may be truncated extremely frequently, and much of the
42
CHAPTER 3.
LEARNING FROM INTERACTION
bene t of using a multistep return estimate can be eliminated. As a result, the method is seldom applied. For an eligibility trace method, zeroing the eligibilities immediately after taking an opolicy action has the same eect as truncating the return estimate [163]. Figure 3.13 shows Watkins' Q() eligibility trace algorithm and Figure 3.14 shows Peng and Williams' Q().3 Watkins' Q() truncates the return estimate after taking nongreedy actions and is an opolicy method. PWQ(), does not truncate the return and assumes that all rewards are those observed under a greedy policy. It is neither onpolicy nor opolicy. The Watkins' Q() and PWQ() algorithms are identical methods when purely greedy policies are followed. They dier only in the temporal dierence error used to update SAPs visited at t k, (k > 1), WatkinsQ() : Æt = rt+1 + maxa Q^ (st+1 ; a) Q^ (st ; a) PWQ() : Æt = rt+1 + maxa Q^ (st+1 ; a) max Q^ (st ; a) a The eligibility trace methods may also be used for opolicy evaluation of a xed policy by applying importance sampling [111]. Here, the eligibility trace is scaled by the likelihood that the exploratory policy has of generating the experience seen by the evaluation policy. When used for greedy policy evaluation, the method reduces to Watkins' Q(). Like the opolicy SARSA(0) method, the evaluation policy must be known. Optimistic Qvalue Initialisation and Exploration
To encourage exploration of the environment, a common technique in RL is to provide an optimistic initial Qfunction and then follow a policy with a strong greedy bias. Examples of these \soft greedy" policies include greedy and Boltzmann selection [135, 150]. Over time each Qvalue will decrease as it is updated, but the Qvalues of untried actions or actions that led to untried actions will remain arti cially high. Thus, even while following a purely greedy policy, the agent can be led to unexplored parts of the statespace. However, problems arise if the estimated value of an action should ever fall below its true value (as may easily happen in environments with stochastic rewards or transitions). In this case any method which acts only greedily can become stuck in a local minimum since the truly best actions are no longer followed. The original version of PWQ(), as published in [107], assumes that g is always followed. As a result the standard Qfunction initialisation for PWQ() is an optimistic one. Even so, several authors report good results when using PWQ() and following semigreedy policies [128, 169]. In this case, PWQ() is an unsound method in the sense that like SARSA() it can be shown that it will not converge upon Q in some environments while The use of the eligibility trace in the Peng and Williams' and Watkins' Q() algorithms presented is the same as the method in [107, 167], but diers from TD() and SARSA(). Because, in Figures 3.13 and 3.14, the traces are updated before the Qvalues, the trace extends an extra step into the history and an additional update may be the result in the case of state revisits. The algorithms may be modi ed to remove this additional update, although in practice, this makes little dierence. 3
3.5.
TEMPORAL DIFFERENCE LEARNING FOR CONTROL
43
WatkinsQ()update(st; at ; rt+1 ; st+1 ) 1) if opolicy(st; at ) Test for nongreedy action 2) for each (s; a) 2 S A do: Truncate eligibility traces 3) e(s; a) 0 4) Æt rt+1 + maxa Q^ (st+1; a) Q^ (st ; at ) 5) for each SAP (s; a) 2 S A do: 6) e(s; a) e(s; a) Decay trace ^ ^ 7) Q(s; a) Q(s; a) + Æt e(s; a) 8) Q^ (st ; at ) Q^ (st ; at ) + k Æt e(st ; at ) 9) for each a 2 A(st ) do: 9a) e(st ; a) 0 10) e(st ; at ) e(st ; at ) + 1
Figure 3.13: Opolicy (Watkins') Q() with a state replacing trace. This version diers slightly to the algorithm recently published in the standard text [150]. For an accumulating trace version omit steps 9 and 9a. For stateaction replacing traces, replace steps 9 to 10 with e(st ; at ) 1. PWQ()update(st; at ; rt+1 ; st+1 ) 1) Æt0 rt+1 + maxa Q^ (st+1; a) Q^ (st ; at ) 2) Æt rt+1 + maxa Q^ (st+1; a) maxa Q^ (st ; a) 3) for each SAP (s; a) 2 S A do: 4) e(s; a) e(s; a) 5) Q^ (s; a) Q^ (s; a) + Æt e(s; a) 6) Q^ (st ; at ) Q^ (st ; at ) + k Æt0 e(st ; at ) 7) for each a 2 A(st ) do: 7a) e(st ; a) 0 7) e(st ; at ) e(st ; at ) + 1 Figure 3.14: Peng and Williams' Q() with a state replacing trace. Modi cations for accumulating and stateaction replacing traces are as for Watkins' Q() (Figure 3.13). exploratory actions continue to be taken.4 However, it may gain a greater eÆciency in assigning credit to actions over Watkins' Q() as it does not truncate its return estimate when taking opolicy actions. This allows the credit for individual actions to be used to adjust more Qvalues in prior states.
4 This can be seen straightforwardly in deterministic processes with deterministic rewards. Note that if ^ = Q is known to hold, then PWQ(), (or SARSA()) may increase kQ^ Q k if nongreedy actions are Q taken. The same is not true for Qlearning and Watkins' Q().
44
CHAPTER 3.
LEARNING FROM INTERACTION
3.5.4 The OPolicy Predicate
For control tasks, the common test used to decide whether an action was opolicy (i.e. nongreedy) is [163, 176, 150], ; if at 6= arg maxa Q^ (st ; a); opolicy(st; at ) = true (3.38) false; otherwise: which assumes that only a single action can be greedy. However, consider that in some tasks some states can have several equivalent best actions (e.g. as in the example in 2.5). Also, the Qfunction might be initialised uniformly, in which case all actions are initially equivalent. For Watkins' Q() the above predicate will result in the return estimate being truncated unnecessarily often. A better alternative which acknowledges that there may be several equivalent greedy actions is, ; if maxa Q^ (st ; a) Q^ (st ; at ) > opol ; opolicy(st; at ) = true (3.39) false; otherwise: where opol is a constant which provides an upper bound for the maximally tolerated degree to which an action may be opolicy (i.e. the allowable \opolicyness" of an action). With opol > 0 the opolicy predicate may yield false even for nongreedy actions. For the Watkins' Q() algorithm this means that the return estimate may include the reward following actions that are less greedy. An action, a, is de ned here to be nearlygreedy if V^ (s) Q^ (s; a) opol for some small positive value of opol . If opol increases further to be greater than (maxa0 Q^ (s; a0 )) Q^ (s; a) for all states over the entire life of the agent, then the WatkinsQ() algorithm is indentical to PWQ() since the opolicy predicate is always false. The intermediate values of opol de ne a new space of algorithms (we might call these seminaive Watkins' Q(), after [150]). The value of opol suggests the following error in the learned predictions for using the return of nearlygreedy policies as an evaluation of a greedy policy, opol + opol + 2 opol + = 1 opol ; for 0 < < 1:
3.6 Indirect Reinforcement Learning An alternative method to directly learning the value function is to integrate planning (i.e. value iteration) with online learning. This approach is the DYNA framework { many instantiations are possible [144]. In order to allow planning, maximum likelihood models of Rsa and Pssa 0 can be constructed from the running means of samples of observed immediate rewards and state transitions, or equivalently, by applying the following updates (in order), [143] Nsa Nsa + 1; (3.40)
3.6.
45
INDIRECT REINFORCEMENT LEARNING
R^ sa
R^ sa +
8x 2 S; P^sxa
a + P^sx
1
Nsa
rt
R^ sa ;
1 I (x; s0) P^ a ; sx Na s
(3.41) (3.42)
where a = at, s = st, s0 = st+1, I (x; s0) is an identity indicator, equal to 1 if x = s0 and 0 otherwise, and Nsa is a record of the number of times a has been taken in s. Backup (3.42) must be applied for all (s; x) pairs after each observed transition. Note that there is no bene t for learning R^ssa 0 instead of R^sa since once a is chosen s, there is no control over which s0 is entered as the successor. With a model, the dynamic programming methods presented in the previous chapter may now be applied. In practice, fully resolving the learned MDP given the new model is often too expensive to do online. The Adaptive RealTime Dynamic Programming (ARDP) solution is to perform valueiteration backups to some small set of states between online steps [12]. Similar approaches were also proposed in [89, 71, 66]. Alternatively, prioritised sweeping focuses the backups to where they are expected to most quickly reduce error [88, 105, 167, 7]. Note that if the value of a state changes, then the value of its predecessors are likely to also need updating. When applied online, the current state is updated and the change in error noted. A priority queue is maintained indicating which states are likely to receive the greatest error reduction on the basis of the size of the value changes in the their successors. Thus when the current state is updated, its change in value is used to promote the position of its predecessors in the priority queue. Additional updates may then be made, always removing and updating the highest priority state in the queue, and then promoting its predecessors in the queue. More or fewer updates may be made depending upon how much real time is available between experiences. In practice, it is not always clear whether the valueiteration backups are preferable to modelfree methods. In several comparisons, they appear to learn with orders of magnitude less experience than Qlearning [150, 12]. However, valueiteration backups are often far more expensive. For instance, if the environment is very stochastic then a state may have very many successors. In the worst case, a valueiteration update for a single state could cost O(jS j jAj). Thus, even when updates are distributed in focused ways, their computational expense can still be very great compared to modelfree methods. Also, in the next chapter we will see how the computational cost of experienceeÆcient modelfree methods (such as eligibility trace methods) can be brought in line with methods such as Qlearning. A general rule of thumb seems to be that if experience is costly to obtain then learning a model is a good way to reduce this cost. But, the most eective way of employing models is, however, still open to debate { modelfree methods can also applied using the model as a simulation. A discussion can be found in [150]. So far, we have also only considered cases where it is feasible to store V^ or Q^ in a lookup table. Where this is not possible (e.g. if the spatespace is large or nondiscrete), then function approximators must be employed to represent these functions, and also the model (P and R). In this case, it seems that the modelfree methods provide signi cant advantages. For instance, return and eligibility trace methods are thought to suer less in nonMarkov settings. (Many function approximation schemes such as stateaggregation can be thought of as providing the learner with a nonMarkov view of the world, even if the perceived state
46
CHAPTER 3.
LEARNING FROM INTERACTION
is one of a Markov process). By their \singlestep" nature, P and R give rise to methods that heavily rely on the Markov property. It is not clear how multistep models can be learned and so overcome their dependence on the Markov property. It is also often unclear how to represent stochastic models with many kinds of function approximator. Function approximation is covered in more detail in Chapter 5.
3.7 Summary In this chapter we have seen how reinforcement learning can proceed starting with little or no prior knowledge of the task being solved. Using only the knowledge gained through interaction with the environment, optimal solutions to diÆcult stochastic control problems can be found. A number of dierent dimensions to RL methods have been seen; prediction and control methods, bias and variance issues, direct and indirect methods, exploration and exploitation, online and oine methods, onpolicy and opolicy methods, and singlestep and multistep methods. Online learning in cyclic environments was identi ed as a particularly interesting class of problems for modelfree methods. Here we see a wider variation in the solutions methods than the acyclic or oine cases. Also, we have seen how it is diÆcult to apply forwards view methods in this case and how (accumulate) trace methods can signi cantly dier from their forward view analogues. Also, there appears be no theoretically sound and experience eÆcient modelfree control method for online learning while continuing to take nongreedy actions. Section 3.5.3 listed several examples of why such learning methods are useful. Apparently sound methods, such as Watkins' Q() suer from \shortsightedness", while unsound methods can easily be shown to suer from a loss of predictive accuracy (practical examples are given in the next chapter).
Chapter 4
EÆcient OPolicy Control Chapter Outline
This chapter reviews extensions to the modelfree learning algorithms presented in the previous chapter. We see how their computational costs can be reduced, their dataeÆciency increased, while also allowing for exploratory actions and online learning. The experimental results using these algorithms also lead to interesting insights about the role of optimism in reinforcement learning control methods.
4.1 Introduction The previous chapter introduced a number of RL algorithms. Let's review some the properties that we'd like a method to have: Predictive.
Algorithms that predict, from each state in the environment, the expected
return available for following some given policy thereafter.
Algorithms perform control optimisation if they nd or approximate an optimal policy rather than evaluate some xed policy. Optimising Control.
Algorithms that can evaluate one policy while following another are exploration insensitive methods (also referred to as opolicy methods) [150, 163]. In the context of control optimisation, we often want to evaluate the greedy policy while following some exploration policy. 47 Exploration Insensitive.
48
CHAPTER 4.
EFFICIENT OFFPOLICY CONTROL
Online learning methods immediately apply observed experiences for learning. Where exploration depends upon the Qfunction, online methods can have a huge advantage over methods which learn oine [128, 65, 165, 168]. For instance, most exploration strategies quickly decline the probability of taking actions which lead to large punishments provided that the Qvalues for those actions are also declined. If the Qfunction is adjusted oine, or after some long interval, then the exploration strategy may select poor actions many times more than necessary within a single episode. Online Learning.
Currently, the cheapest online learning control methods have per experience where jAj is the number of actions currently available to the agent [168, 163].
Computationally Cheap. time complexities of O(jAj)
Fast Learning. Methods which make eective use of limited real experience. For example, methods which learn a model of the environment can make excellent use of experience but are often computationally far more expensive than O(jAj) when learning online. Existing modelfree methods have attempted to tackle this using eligibility traces [148, 163, 128] or backwards replay [72, 76]. However, opolicy (exploration insensitive) eligibility trace methods for control such as Watkins' Q() are relatively ineÆcient. Also, backwards replay is generally regarded as a technique that cannot be used for online learning. Methods such as SARSA() and Peng and Williams' Q() are exploration sensitive methods; if exploring actions are continuously taken in the environment then they lose predictive accuracy in their Qfunctions as a result.
For an RL algorithm to be practical it must work in cases where there are very many states or if the statespace is nondiscrete. Typically, this involves using a function approximator to store and update the Qfunction. Eligibility trace methods have been shown to work well when applied with function approximators [163, 149]. Scalable.
This chapter reviews a number of important RL algorithms. It is shown how Lin's backwards replay can be modi ed to learn online and so provides a good substitute for eligibility trace methods. It is both simpler and in many instances also faster learning. The simplicity gains are derived by directly employing the return estimate in learning updates rather calculating the eect of this incrementally. In many instances learning speedups are also derived through the backwards replay mechanism, allowing return estimates to be based on more uptodate information than for eligibility trace methods. Special consideration is given to opolicy control methods which, despite having most of the above properties in combination, have received little attention or use in the literature due to their supposed slow learning [128, 150]. Several new opolicy control methods are presented, the last of which is designed to provide signi cant dataeÆciency improvements over Watkins' Q(). The general new technique can easily be applied in order to derive analogues of most eligibility trace methods such as TD(), SARSA() [150] and importance sampling TD() [111]. First section 4.2 reviews Fast Q() { a method for precisely implementing eligibility trace
4.2.
49
ACCELERATING Q( )
methods at a cost which is independent of the size of the statespace. This algorithm is used as a stateoftheart baseline against which the new method is compared. Section 4.3 reviews existing backwards replay methods that provide the basis of the new approach. Section 4.4 introduces the new Experience Stack method { an onlinelearning version of backwards replay that is as computationally cheap as Fast Q(). Section 4.5 provides some experimental results with this algorithm, comparing it against Fast Q(). This and the supporting analysis in Sections 4.6 and 4.7 give a useful pro le of when backwards replay may provide improvements over eligibility traces. Section 4.7 also provides a surprising new insight into the potentially harmful eects of optimistic initial value biases on learning updates that employ return estimates truncated with maxa Q^ (s; a).
4.2 Accelerating Q() Naive implementations of Q() (as presented in the previous chapter) are far more expensive than Q(0) as they involve updating the eligibilities and Qvalues of all SAPs at each timestep. This gives a time complexity of O(jS jjAj) per experience, instead of O(jAj). A simple and well known improvement upon this is to update only those Qvalues with signi cant traces. See [167] for an implementation. For some trace signi cance, n, n or fewer of the most recently visited states have their eligibilities and values updated, at a cost of O(n jAj) per step. States visited more than n steps ago have eligibilities of zero. n is given such that, ( )n < , for some small . However, if ( ) ! 1 and the environment has an appropriate structure, potentially all of the states in the system may contain signi cant traces. In this case, n ! jS j, and much of the computational saving is nulli ed. 4.2.1 Fast Q()
Fast Q() is intended as a fully online implementation of Peng and Williams' Q() but with a time complexity O(jAj) per update. The algorithm is designed for > 0 { otherwise we can use simple Qlearning. This section is adapted with minor changes from [125] which contains the original description of Fast Q(), provided by courtesy of Marco Wiering. The description of Fast Q() is not new a contribution.
The algorithm is based on the observation that the only Qvalues needed at any given time are those for the possible actions given the current state. Hence, using \lazy learning", we can postpone updating Qvalues until they are needed. First note that, as in Equation 3.27 for TD(), the increment of Q^ (s; a) made by Peng and Williams' Q() (in Figure 3.14), for a complete episode can be written as follows (for simplicity, a xed learning rate is assumed): Q^ (s; a) = Q^ T (2s; a) Q^ 0(s; a) (4.1) 3 T T Q^ (s; a) = X 4Æ0 It(s; a) + X ( )i t ÆiIt(s; a)5 (4.2) Main principle.
t=1
t
i=t+1
50
CHAPTER 4.
T X
=
"
t=1 T " X
=
t=1
EFFICIENT OFFPOLICY CONTROL
#
t 1
X Æt0 It (s; a) + ( )t i Æt Ii(s; a)
Æ0 It (s; a) + t
i=1 t 1 X Æt i=1
( )
t iI
i
(s; a)
(4.3)
#
(4.4)
:
In what follows, let us abbreviate It = It(s; a) and = . Suppose some SAP (s; a) occurs at steps t1; t2 ; t3; : : :, then we may unfold terms of expression (4.4): T X t=1
"
t 1
X t iI Æt0 It + Æt i
#
t1 X
=
i=1
"
t=1
t 1
X t iI Æt0 It + Æt i "
t2 X t=t1 +1
i=1
#
+
t 1
X t iI Æt0 It + Æt i "
t3 X
t=t2 +1
i=1
#
t 1
+
X t iI Æt0 It + Æt i
#
i=1
+ :::
(4.5)
Since I (s; a) is 1 only for t = t1; t2 ; t3; : : :, where SAP revisits of (s; a) occur at, t1; t2 ; t3 ; : : :, and I (s; a) is 0 otherwise, we can rewrite Equation 4.5 as Æt01 + Æt02 + Æt01 + Æt02 + Æ0
t1
+ Æ0
t2
t2 X t=t1 +1
1
Æt
t t1
t2 X
t1
+ 1t
1
t=t1 +1 t2 X t=1
3
t
Æt
Æt
+ Æt0 +
t
t3 X t=t2 +1
Æt
t t1
+ Æt0 + 1t + 1t 3
t1 X t=1
Æt
!
t3
t3 X
+::: = t
+ ::: =
+ 1t + 1t
t3 X
2
+ Æ0
t t2
Æt
1
t
+
t=t2 +1 1
2
t=1
Æt
t2 X
t
t=1
Æt
De ning t = Pti=1 Æi i , this becomes 1 1 1 0 0 0 Æt + Æt + t (t t ) + Æt + t + t (t t ) + : : : 1
2
1
2
1
3
1
2
3
2
t
!
+ ::: (4.6)
This will allow Pthe construction of an eÆcient online Q() algorithm. We de ne a local trace e0t (s; a) = ti=1 Ii (s;ai ) , and use (4.6) to write down the total update of Q(s; a) during an episode: T X Q^ (s; a) = Æt0 It(s; a) + e0t (s; a)(t+1 t) : (4.7) t=1
To exploit this we introduce a global variable keeping track of the cumulative TD() error since the start of the episode. As long as the SAP (s; a) does not occur we postpone updating Q^ (s; a). In the update below we need to subtract that part of which has already been used (see equations 4.6 and 4.7). We use for each SAP (s; a) a local variable Æ(s; a) which records the value of at the moment of the last update, and a local trace variable e0(s; a). Then, once Q^ (s; a) needs to be known, we update Q^ (s; a) by adding e0(s; a)( Æ(s; a)). Algorithm overview. The algorithm relies on two procedures: the Local Update procedure calculates exact Qvalues once they are required; the Global Update procedure updates the
4.2.
ACCELERATING Q( )
51
global variables and the current Qvalue. Initially we set the global variables 0 1:0 and 0. We also initialise the local variables Æ(s; a) 0 and e0 (s; a) 0 for all SAPs. Local updates. Qvalues for all actions possible in a given state are updated before an action is selected and before a particular Qvalue is calculated. For each SAP (s; a) a variable Æ(s; a) tracks changes since the last update: Local Update(st ; at ) : 1) Q^ (st ; at ) Q^ (st; at ) + k (st; at )( Æ(st ; at ))e0 (st; at ) 2) Æ(st ; at )
After each executed action we invoke the procedure (1) To calculate maxa Q^ (st+1 ; a) (which may have changed due to the most recent experience), it calls Local Update for the possible next SAPs. (2) It updates the global variables t and . (3) It updates the Qvalue and trace variable of (st ; at ) and stores the current value (in Local Update). The global update procedure.
Global Update, which consists of three basic steps:
Global Update(st; at ; rt ; st+1 ) : 1)8a 2 A Do: Make Q^ (st+1 ; ) uptodate 1a) Local Update(st+1 ; a) 2) Æt0 (rt + maxa Q^ (st+1 ; a) Q^ (st; at )) 3) Æt (rt + maxa Q^ (st+1 ; a) maxa Q^ (st; a) 4) t t 1 Update global clock 5) + Æt t Add new TDerror to global error 6) Local Update(st ; at ) Make Q^ (st ; at ) uptodate for next step 0 ^ ^ 7) Q(st ; at ) Q(st; at ) + k (st; at )Æt 8) e0 (st ; at ) e0 (st ; at ) + 1= t Decay Trace
For state replacing eligibility traces [139] step 8 should be changed as follows: 8a : e0(st ; a) 0; e0 (st; at ) 1= t . Machine precision problem and solution. Adding Æt t to in line 5 may create a problem due to limited machine precision: for large absolute values of and small t there may be signi cant rounding errors. More importantly, line 8 will quickly over ow any machine for < 1. The following addendum to the procedure Global Update detects when t falls below machine precision and updates all SAPs which have occurred. A list, H , m is used to track SAPs that are not uptodate. If e0 (s; a) < m, the SAP (s; a) is removed from H . Finally, and t are reset to their initial values.
52
CHAPTER 4.
EFFICIENT OFFPOLICY CONTROL
Global Update : addendum 9) If (visited(st ; at ) = 0): 9a) H H [ (st ; at ) 9b) visited(st ; at ) 1 10) If ( t < m): 10a) 8(s; a) 2 H Do 10a1) Local Update(s; a) 10a2) e0 (s; a) e0 (s; a) t 10a3) If (e0 (s; a) < m ): 10a31) H H n (s; a) 10a32) visited(s; a) 0 10a4) Æ(s; a) 0
10b) 0 10c) t 1:0
Comments. Recall that Local Update sets Æ(s; a) , and update steps depend on Æ(s; a). Thus, after having updated all SAPs in H , we can set 0 and Æ(s; a) 0. Furthermore, we can simply set e0 (s; a) e0 (s; a) t and t 1:0 without aecting the expression e0 (s; a) t used in future updates  this just rescales the variables. Note that if
= 1, then no sweeps through the history list will be necessary. Complexity. The algorithm's most expensive part is the set of calls to Local Update, whose total cost is O(jAj). This is not bad: even Qlearning's action selection procedure costs O(jAj) if, say, the Boltzmann rule is used. Concerning the occasional complete sweep through SAPs still in the history list H : during each sweep the traces of SAPs in H are multiplied by t . SAPs are deleted from H once their trace falls below m. In the worst case one sweep per n time steps updates 2n SAPs and costs O(1) on average. This means that
there is an additional computational burden at certain time steps, but since this happens infrequently, the method's average update complexity stays O(jAj). The space complexity of the algorithm remains O(jS jjAj). We need to store the following variables for all SAPs: Qvalues, eligibility traces, Æ values, the \visited" bit, and three pointers to manage the history list (one from each SAP to its place in the history list, and two for the doubly linked list). Finally we need to store the two global variables and .
4.2.
ACCELERATING Q( )
53
4.2.2 Revisions to Fast Q()
In this section we see how the original version of Fast Q() is likely to be misapplied to give rise to two subtle errors. This section also introduces: i) what modi cations, if any, are required of action selection mechanisms that are intended to employ the uptodate Qfunction, ii) the stateaction replace trace version of Fast Q(), and, iii) how the algorithm may be modi ed for opolicy learning (as Watkins' Q()) [163, 150]. The new algorithms are shown in Figure 4.1. The new work in this section can be found in a joint technical report coauthored with Macro Wiering [125]. Error 1. Step Qvalues at st+1
1 of the original Global Update procedure performs the updates to the necessary to ensure that Q^ (st+1; ) is an uptodate estimate before steps 2 and 3 where it is used. However, Q^ (st; ) is also used in steps 2 and 3 and may not be uptodate. This is easily corrected by adding: 1b) Local Update(st ; a) We shall see below that this change is not necessary if Q^ (st; ) is made uptodate at the end of the Global Update procedure. When state replacing traces are employed with the original Fast Q() algorithm, it is possible that the eligibility of some SAPs are zeroed. In such a case, if these SAPs previously had non zero eligibilities then they will not receive any update making use of Æt . An exception is Q^ (st ; at ), which is made uptodate in step 6 (and so makes use of Æt ). However all other SAPs at st with nonzero eligibilities will receive no adjustment toward Æt if their eligibilities are zeroed: Error 2.
From the original version of Global Update:
. .. 3) Æt (rt + maxa Q^ (st+1 ; a) maxa Q^ (st; a)) . .. Here, each a 6= at with nonzero traces receive no update using Æt (Q^ (st ; at ) is already uptodate before this point) 8) 8a : e0 (st; a) 0; e0 (st ; at ) 1= t . To avoid this in the revised algorithm, all of the Qvalues at st are made uptodate before zeroing their eligibility traces (step 8a in the statereplace trace revisions). Steps 9 and 9a of the Revised Global Update procedure are a pragmatic change to ensure that all of the Qvalues for st+1 are uptodate by the end of the procedure. If this were not so then any code needing to make use of the uptodate Qfunction at st+1 , such as those for selecting the agent's next action, would need to be de ned in terms of the uptodate, Qfunction instead. Q+ is used to denote uptodate Qfunction and can be Action Selection.
54
CHAPTER 4.
EFFICIENT OFFPOLICY CONTROL
found at any time as follows: Q^ + (s; a) = Q^ (s; a) + k (s; a)( Æ(s; a))e0 (st ; at ) (4.8) From an implementation standpoint, these changes are desirable for at least three reasons. Firstly, the need to use Q^ + for action selection is easy to overlook when implementing the original version of Fast Q() as part of a larger learning agent. Secondly, it reduces coupling between algorithms; with steps 9 and 9a an algorithm that implements action selection based on the uptodate Qvalues of st+1 does not need to use Q^ + or even care that values at dierent states may be outofdate. Thirdly, it reduces the duplication of code; we are likely to already have actionselection algorithms that use Q^ (st+1 ; ) and so we don't need to implement others that use Q^ +(st+1 ; ) instead. The original description of Fast Q() assumed that the Local Update procedure was called for all actions in the current state immediately after the Global Update procedure and prior to selecting actions. However, from the original description, it was not clear that this still needs to be done (for the same reason as Error 2, above) even if the Qvalues at the current state are not used by the action selection method (for example, if the actions are selected randomly or provided by a trainer). If this is done, then the new and revised algorithms are essentially identical. The following two sections introduce new features to the algorithm and are not revisions. StateAction Replacing Traces. From Section 3.4.7 note that the stateaction replace trace method sets e(s; a) to 1 instead of adding 1, as in the accumulate trace method. For Fast Q(), an eect equivalent to setting an eligibility to 1 is achieved by performing e0t+1 (s; a) 1= t . Watkins' Q(). Watkins' Q() requires that the eligibility trace be zeroed after taking nongreedy actions. The new Fast Q() version works in the same way (by applying e0 (s; a) 0 for all SAPs), except that here we must ensure that all nonuptodate SAPs are updated before zeroing their traces (see the Flush Updates procedure).
4.2.
ACCELERATING Q( )
55
For accumulating traces:
Revised Global Update(st ; at ; rt ; st+1 ) : 1)8a 2 A Do 1a) Local Update(st+1 ; a) 2) Æt0 (rt + maxa Q^ (st+1 ; a) Q^ (st; at )) NB. st was made uptodate in step 9 3) Æt (rt + maxa Q^ (st+1 ; a) maxa Q^ (st; a)) 4) t t 1 5) + Æt t 6) Local Update(st ; at ) 7) Q^ (st ; at ) Q^ (st; at ) + k (st; at )Æt0 8) e0 (st ; at ) e0 (st ; at ) + 1= t Increment eligibility 9) 8a 2 A Do 9a) Local Update(st+1 ; a) Make Q^ (st+1 ; ) uptodate before action selection
For stateaction replacing traces replace step 8 with: 8) e0 (st ; at ) 1= t Set eligibility to 1 For state replacing traces, replace steps 8  9a with: 8) 8a 2 A Do 8a) Local Update(st ; a) Make Q^ (st ; ) uptodate before zeroing eligibility 0 8b) e (st ; a) 0 Zero eligibility 8c) Local Update(st+1 ; a) Make Q^ (st+1 ; ) uptodate before actionselection 0 t 9) e (st ; at ) 1= Set eligibility to 1 For Watkins Q() prepend the following to the Revised Global Update procedures. 0) if opolicy(st; at ) Test whether a nongreedy action was taken 0a) Flush Updates() Flush Updates() 1) 8(s; a) 2 H Do 2) Q^ (s; a) Q^ (s; a) + k (st; at )( Æ(s; a))e0 (s; a) 3) Æ(s; a) 0 4) e0 (s; a) 0 5) H fg
6) 0 7) t 1
Figure 4.1: The revised Fast Q() algorithm for accumulating, state replacing and stateaction replacing traces and for Watkins' Q(). The machine precision addendum should be appended to each algorithm. The Flush Updates procedure can also be called upon entering a terminal state to make the entire Qfunction uptodate and also reinitialise the eligibility and error values of each SAP ready for learning in the next episode.
56
CHAPTER 4.
EFFICIENT OFFPOLICY CONTROL
4.2.3 Validation
In this section we empirically test how closely the correct and erroneous implementations of Fast Q() approximate the original versions of Q(). Fast Q()+ is used to denote the correct implementation suggested here and Fast Q() to denote the method that does not apply a Local Update for all actions in the new state between calls to the Global Update procedure. Note that if these updates are performed, Fast Q()+ and Fast Q() are identical methods.1 The algorithms were tested using the maze task shown in Figure 4.4. This task was chosen as credit for actions leading to the goal can be signi cantly delayed (and so eligibility traces are expected to help) and also because state revisits can frequently occur, causing the dierent eligibility trace methods to behave dierently. Actions taken by the agent at each step were selected using greedy [150]. This selects a greedy action, arg maxa Q^ (st; a), with probability , and a random action with 1 . Fast Q() was given the bene t of using the true uptodate Qfunction, (i.e. arg maxa Q^ + (st ; a) was used to chose its greedy action). Figure 4.2 compares the results for the PW Q() variants. The graphs measure the total reward collected by each algorithm and the mean squared error (MSE) in the uptodate Qfunction learned by each algorithm over the course of 200000 time steps. The squared error was measured as, SE (s)
=
SE (s)
=
2
V (s)
max Q^ (s; a) a
V (s)
max Q^ +(s; a) a
(4.9)
;
for regular Q() and as,
2
;
(4.10)
for both versions of Fast Q(). An accurate V was found by dynamic programming methods. All of the results in the graphs are the average of 100 runs. Fast PW Q()+ provided equal or better performance than Fast PW Q() in most instances, and its results also provided an extremely good t against the original version of PW Q() in all cases (see Figures 4.2 and 4.3). Similar results were found when comparing Watkins' Q() and its Fast variants (see Figures 4.5 and 4.6). Fast Q() worked especially worse in terms of error than Fast Q()+ for PW with accumulating or stateaction replacing traces. However, in one instance (with a state replacing trace) the error performance of the revised algorithm was actually worse than the original (see Figure 4.3).This anomaly was not seen for Watkins' Q() (see Figure 4.6). The experiments in Wiering's original description of Fast Q() did perform these local updates and so we do not repeat the experiments in the original paper [168, 169, 167]. 1
4.2.
57
ACCELERATING Q( ) 400000
1000
Cumulative Reward
300000
Mean Squared Error
PW, acc Fast PW (+), acc Fast PW (), acc
350000 250000 200000 150000 100000 50000
PW, acc Fast PW (+), acc Fast PW (), acc
800 600 400 200
0 50000
0 0
50000
100000 Steps
150000
200000
0
400000 300000
Mean Squared Error
PW, srepl Fast PW (+), srepl Fast PW (), srepl
350000
Cumulative Reward
50000
100000 Steps
150000
200000
1000
250000 200000 150000 100000 50000
PW, srepl Fast PW (+), srepl Fast PW (), srepl
800 600 400 200
0 50000
0 0
50000
100000 Steps
150000
200000
0
400000 300000
Mean Squared Error
PW, sarepl Fast PW (+), sarepl Fast PW (), sarepl
350000
Cumulative Reward
50000
100000 Steps
150000
200000
1000
250000 200000 150000 100000 50000
PW, sarepl Fast PW (+), sarepl Fast PW (), sarepl
800 600 400 200
0 50000
0 0
50000
100000 Steps
150000
200000
0
50000
100000 Steps
150000
200000
Figure 4.2: Comparison of PW Q(), Fast PW Q()+ and Fast PW Q() performance pro les in the stochastic maze task. Results are the average of 20 runs. The parameters were Q^ 0 = 100, = 0:3, = 0:1 (low exploration rate), = 0:9 and m = 1 10 3 for regular Q() and m = 10 10 for the Fast versions. (left column) Total reward collected. (right column) Mean squared error in the value function. (top row) With accumulating traces. (middle row) With state replacing traces. (bottom row) With stateaction replacing traces.
The eect of exploratory actions on PW Q() are also evident in these results. The PW Q() methods collected less reward and found a hugely less accurate Qfunction in the case of a high exploration rate than Watkins' methods (compare Figures 4.3 and 4.6). In contrast, Watkins' variants collected similar or better amounts of reward but found far more accurate Qfunctions than Peng and Williams' methods in both the high and low exploration rate cases. Similar results concerning the error were reported by Wyatt in [176]. However, this example clearly demonstrates the bene t of opolicy learning under exploration in terms of collected return.
58
CHAPTER 4.
2000 PW, acc Fast PW (+), acc Fast PW (), acc
150000
Mean Squared Error
Cumulative Reward
200000
EFFICIENT OFFPOLICY CONTROL
100000 50000 0
1000
500
50000
0 0
50000
100000 Steps
150000
200000
0
50000
100000 Steps
150000
200000
2000 PW, srepl Fast PW (+), srepl Fast PW (), srepl
150000
Mean Squared Error
Cumulative Reward
200000
100000 50000 0
PW, srepl Fast PW (+), srepl Fast PW (), srepl
1500
1000
500
50000
0 0
50000
100000 Steps
150000
200000
0
200000
50000
100000 Steps
150000
200000
2000 PW, sarepl Fast PW (+), sarepl Fast PW (), sarepl
150000
Mean Squared Error
Cumulative Reward
PW, acc Fast PW (+), acc Fast PW (), acc
1500
100000 50000 0
PW, sarepl Fast PW (+), sarepl Fast PW (), sarepl
1500
1000
500
50000
0 0
50000
100000 Steps
150000
200000
0
50000
100000 Steps
150000
200000
Figure 4.3: Comparison of Peng and Williams' Q() methods with a high exploration rate ( = 0:5). All other parameters are as in Figure 4.2. Note that the scale of the vertical axes diers between experiment sets. 20
18
16
14
12
10
8
6
4
2
0
0
2
4
6
8
10
12
14
16
18
20
Figure 4.4: The large stochastic maze task. At each step the agent may choose one of four actions (N,S,E,W). Transitions have probabilities of 0:8 of succeeding, 0:08 of moving the agent laterally and 0:04 of moving in the opposite to intended direction. Impassable walls are marked in black and penalty elds of 4 and 1 are marked in dark and light grey respectively. A reward of 100 is given for entering the topright corner and 10 for the others. Episodes start in random states and continue until one of the four terminal corner states is entered.
4.2.
59
ACCELERATING Q( ) 400000
1000
Cumulative Reward
300000
Mean Squared Error
WAT, acc Fast WAT (+), acc Fast WAT (), acc
350000 250000 200000 150000 100000 50000
WAT, acc Fast WAT (+), acc Fast WAT (), acc
800 600 400 200
0 50000
0 0
50000
100000 Steps
150000
200000
0
400000 300000
Mean Squared Error
WAT, srepl Fast WAT (+), srepl Fast WAT (), srepl
350000
Cumulative Reward
50000
100000 Steps
150000
200000
1000
250000 200000 150000 100000 50000
WAT, srepl Fast WAT (+), srepl Fast WAT (), srepl
800 600 400 200
0 50000
0 0
50000
100000 Steps
150000
200000
0
400000 300000
Mean Squared Error
WAT, sarepl Fast WAT (+), sarepl Fast WAT (), sarepl
350000
Cumulative Reward
50000
100000 Steps
150000
200000
1000
250000 200000 150000 100000 50000
WAT, sarepl Fast WAT (+), sarepl Fast WAT (), sarepl
800 600 400 200
0 50000
0 0
50000
100000 Steps
150000
200000
0
50000
100000 Steps
150000
200000
Figure 4.5: Comparison of Watkins' Q(), Fast Watkins' Q() and Revised Fast Watkins' Q()+ in the stochastic maze task. All parameters are as in Figure 4.2 (i.e. a low exploration rate with = 0:1).
In addition to showing that the performance of Fast Q()+ is similar to Q() in the mean, we performed a more detailed test. The agents were made to learn from identical experience gathered over 2000 simulation steps in the small stochastic maze shown in Figure 4.7. At each time step, the dierence between the Qfunctions of Q() and the uptodate Qfunctions of Fast Q()+ and Fast Q() was measured. The largest dierences at any time during the course of learning are shown in Table 4.1. The dierences for Fast Q()+ are all in the order of m or better. The dierences for Fast Q() are many orders of magnitude greater.
60
CHAPTER 4.
EFFICIENT OFFPOLICY CONTROL 400
WAT, acc Fast WAT (+), acc Fast WAT (), acc
150000
WAT, acc Fast WAT (+), acc Fast WAT (), acc
350
Mean Squared Error
Cumulative Reward
200000
100000 50000 0
300 250 200 150 100 50
50000
0 0
50000
100000 Steps
150000
200000
0
50000
100000 Steps
150000
200000
400 WAT, srepl Fast WAT (+), srepl Fast WAT (), srepl
150000
WAT, srepl Fast WAT (+), srepl Fast WAT (), srepl
350
Mean Squared Error
Cumulative Reward
200000
100000 50000 0
300 250 200 150 100 50
50000
0 0
50000
100000 Steps
150000
200000
0
50000
100000 Steps
150000
200000
400 WAT, sarepl Fast WAT (+), sarepl Fast WAT (), sarepl
150000
WAT, sarepl Fast WAT (+), sarepl Fast WAT (), sarepl
350
Mean Squared Error
Cumulative Reward
200000
100000 50000 0
300 250 200 150 100 50
50000
0 0
50000
100000 Steps
150000
200000
0
50000
100000 Steps
150000
200000
Figure 4.6: Comparison of Watkins' Q() methods with a high exploration rate ( = 0:5). All other parameters are as in Figure 4.2.
3
+1
2
1
1
1
2
3
4
Figure 4.7: A small stochastic maze task (from [130]). Rewards of 1 and +1 are given for entering (4; 2) and (4; 3), respectively. On nonterminal transitions, rt = 251 .
4.3.
61
BACKWARDS REPLAY
PWacc PWsrepl PWsarepl WATacc WATsrepl WATsarepl
Fast Q()
0.7 1.3 0.3 1.3 2.5 0.6
Fast Q()+ 1:7 10 15 8:8 10 16 1:7 10 15 7:6 10 13 4:2 10 10 2:9 10 11
Table 4.1: The largest dierences from Qfunction learned by original Q() during the course of 2000 time steps of experience within the small maze task in Figure 4.7. The experiment parameters were m = 10 9 , = 0:2, = 0:95 and = 1:0. The experience was generated by randomly selecting actions. 4.2.4 Discussion
Fast Q() provides the means to implement Q() at a greatly reduced computational cost that is independent of the size of the state space. As such, it makes it feasible for RL to tackle problems of greater scale. Independently developed, Pendrith and Ryan's PTrace and CTrace algorithms work in a similar way to Fast Q() but are limited to the case where = 1 [104, 103]. Although the underlying derivation of Fast Q() is correct, we have seen here that the original algorithmic description is likely to be misinterpreted and incorrectly implemented. Simpli cations and clari cations were made, maintaining the algorithm's mean time complexity of O(jAj) per step. Naive implementations of Q() are O(jS j jAj) per step. We have also seen how Fast Q() can be modi ed to use stateaction replacing traces or to be used as an exploration insensitive learning method and reported upon the merits of these modi cations. In particular, in the experiments conducted here, the exploration insensitive versions provided similar or better performance in terms of the collected reward, but achieved uniformly better performance in terms of Qfunction error. This was found both with high or low amounts of exploration.
4.3 Backwards Replay In [72] Lin introduced experience replay which, like eligibility trace and (forwardview) return methods, allow a single experience to be used to adjust to the values of many predecessor states. In his experiments, a human controller provides a training policy for a robot to reduce the cost of exploring the environment. This experience is recorded and then repeatedly replayed oine in order to learn a Qfunction. The Qfunction was represented by a multilayer neural network a singlestep Qlearninglike update rule was used to make updates. In this way, better use of a small amount of expensive real experience can be made when training the RL agent.
62
CHAPTER 4.
EFFICIENT OFFPOLICY CONTROL
BackwardsReplayWatkinsQ()update 1) z 0 Initialise return to value of terminal state 2) for each i in tT 1; tT 2; : : : t0 do: ^ (si+1; a) 3) z (ri+1 + z) + (1 ) ri+1 + max Q a 4) Q^ (si; ai ) Q^ (si; ai ) + k z Q^ (si; ai ) 5) if opolicy(si; ai ): Test for nongreedy action. 6) z maxa Q^ (si ; a) Truncate return estimate.
Figure 4.8: Lin's backwards replay algorithm modi ed for evaluating the greedy policy (as Watkins' Q()). The algorithm is applied upon entering a terminal state and may be executed several times. Terminal states are assumed to have zero value (rewards for entering a terminal state may be nonzero). The training experience has the advantage of providing the agent with a relatively good behaviour from which it may bootstrap its own policy and also greatly reduces the cost of exploring the state space. Note that a key dierence between this and the training methods used by supervised learning is that the RL agent aims to actually improve upon the training behaviour and not simply reproduce it. Experience replay has also been successfully applied by Zhang and Dietterich for Job Shop scheduling system [177], and for mobile robot navigation [140]. When replaying the recorded experience a great learning eÆciency boost can be gained by replaying the experience in the reverse order to which it was observed. For example, if the agent observed the experience tuples (st; at ; rt+1 ); (st+1 ; at+1 ; rt+2 ); : : :, then a Qlearning update is made to Q^ (st+1 ; at+1 ) before Q^ (st ; at ). In this way, the return estimate used to update Q^ (st; at ) may use a justupdated value of maxa Q^ (st+1 ; a), which itself may have just changed to include the justupdated value of maxa Q^ (st+2 ; a), and so on. Even if 1step return estimates are employed in the backups, and experience is only replayed once, then information about a new reward can still be propagated to many prior SAPs. Furthermore, if the return estimates are employed then computational eÆciency gains can also be found by working backwards and employing the recursive form of the return estimate (as in Equation (3.24) or (3.37)). This is illustrated in a new version of the backwards replay algorithm modi ed to use the same return estimate as Watkins' Q() (see Figure 4.8). The algorithm is extremely simple, can provide learning speedups and also has a natural computationally eÆcient implementation; it is just O(jAj) per step. It achieves its computational eÆciency far more elegantly than Fast Q() by directly implementing the forwards view of return updates. By contrast Fast Q() performs two complex transformations on the return estimate. Figure 4.9 illustrates the advantage of using backwards replay over Q() in the corridor task shown in Figure 3.5. Note here that backwards replay with = 0 can be as good or better than Q() (for any ) where the learning rate is declined with 1=k (k(s; a) = kth backup of Q^ (s; a)). Similar results are noted by Sutton and Singh [151]. As in this example, they note that backwards replay reduces bias due to the initial value estimates in acyclic environments, eliminating it totally in cases where = 1 at the rst value updates.
4.3.
63
BACKWARDS REPLAY 1
0.8
0.8
0.6
0.6
V* Value
Value
1
V* 0.4 BR(0.9) Q(0.9) BR(0) Q(0)
0.2 0 0
10
20
30 40 State, s
50
60
BR(0) BR(0.9)
0.4 0.2
Q(0.9) Q(0)
0 0
10
20
30 40 State, s
50
60
Figure 4.9: The Qfunctions learned by backwards replay and by Q() after 1 episode in the corridor task shown in Figure 3.5. Values of = 0, = 0:9 and Q^ 0 = 0 are tested. (left) Learning with a constant = 0:8. Backwards replay improves upon the eligibility trace counterparts in both cases. This learning speedup for backwards replay is derived solely from employing more uptodate information. (right) Learning with = 1=k. With any value of backwards replay nds the actual return estimate, while Q() nds it only if = 1. However, because of its dependence on future information, its not clear how backwards replay extends to the case of online learning in cyclic environments. Truncated TD()
In [30] Cichosz introduced the Truncated TD() (TTD) algorithm to apply backwards replay online. Figure 4.10 shows how TTD can be modi ed to be a greedypolicy evaluating exploration insensitive method. TTD also directly employs the return due to a state or SAP by maintaining an experience buer from which its return is computed. To keep the buer to a reasonable length but still allow for online learning, only the last n experiences are maintained. Updates are delayed { state st n is updated at time t when there is enough experience to make an nstep truncated return estimate (as introduced in Equation 3.37). This delay in making backups can lead to the same ineÆciencies in the exploration strategy suered by purely oine learning methods. As such, TTD is sometimes referred to as semioine as it still allows for nonepisodic learning and exploration [168]. Also, the method makes updates at a cost of O(n jAj) per step and so it would seem there is no computational advantage to learning in this way compared to the approximate method described in Section 4.2. Thus, the primary bene t of this approach is that it directly employs the return estimate in updates and is simpler than an eligibility trace method as a result. Cichosz also argues that since actual return estimates are used, the method can be applied more easily to a wider range of function approximators than is possible for eligibility trace methods [31]. Replayed TD()
Replayed TD() is an adaptation of TTD that updates the most recent n states at each timestep using the most recent n experiences [32] (see Figure 4.11).
64
CHAPTER 4.
EFFICIENT OFFPOLICY CONTROL
TruncatedWatkinsQ()update(st+1) 1) z maxa Q^ (st+1 ; a)
2) wasopolicy false 3) for each i in t + 1; : : : ; t + 2 n do: 4) if wasopolicy: True when ai+1 was nongreedy. ^ 5) z ri + maxa Q(si ; a) 6) else: 7) z (ri + z ) + (1 ) ri + maxa Q^ (si ; a) 8) wasopolicy opolicy(si; ai ) 9) Q^ (st n ; at n) Q^ (st n; at n ) + k z Q^ (st n ; at n ) Figure 4.10: Cichosz' Truncated TD() algorithm modi ed for evaluating the greedy policy. The above update is applied after every step. An experience buer of the last n experiences needs to be maintained and the rst and last n updates of an episode need special handling. These extra details are omitted from the above algorithm (see [31] for full details). ReplayedWatkinsQ()update(st) 1) z 0 Initialise return to value of terminal state 2) for each i in t; : : : t n do: 3) z (ri+1 + z) + (1 ) ri+1 + maxa Q^(si+1; a) 4) Q^ (si; ai ) Q^ (si; ai ) + k z Q^ (si 1; ai 1 ) 5) if opolicy(si; ai ): Test for nongreedy action. ^ 6) z maxa Q(si ; a) Truncate return estimate.
Figure 4.11: Cichosz' Replayed TD() modi ed for evaluating the greedy policy. The above update is applied after every step. Note that, for a SAP visited at time t, Q^ (st; at ) will receive updates toward all of the follow;1) (;2) ing n truncated return estimates: zt(+1 ; zt+2 ; : : : ; zt(+;nn ) . Clearly these return estimates are not independent: all n returns include rt+1 , n 1 include rt+2 and so on. As a result of updating a Qvalue several times towards these similar returns the algorithm will learn Qvalues that are much more strongly biased towards the most recent experiences than other methods. In turn this could cause learning problems in highly stochastic environments (or more generally where the return estimate has high variance). There may exist ways to counteract this (for example, by reducing the learning rate). Even so, it is likely that the algorithm's aggressive use of experience outweighs these high variance problems and Cichosz reports some promising results. However, the algorithm also remains O(n jAj) per step (as TTD()), and although it doesn't suer the same delay in performing updates that could be detrimental to exploration, immediate credit for actions is propagated to no more than the last n states.
4.4.
65
EXPERIENCE STACK REINFORCEMENT LEARNING
4.4 Experience Stack Reinforcement Learning This section introduces the Experience Stack Algorithm. This new method can be seen as a generalisation of Lin's oine backwards replay and also directly learns from the return estimate. To allow the algorithm to work online, backups are made in a lazy fashion; states are backedup only when new estimates of Qvalues are required (for the purposes of aiding exploration) and available given the prior experience. Speci cally, this occurs when the learner nds itself in a state it has previously visited and not backedup. The details of the algorithm are best explained through a worked example. Consider the experience in Figure 4.12. A learning episode starts in st1 and the algorithm proceeds recording all experiences until st3 is entered (previously visited at t2). If we continue exploring without making a backup to st2, we do so uninformed of the reward received between t2 + 1 and t3, perhaps to recollect some negative reward in sequence X . This is the important disadvantage of an oine algorithm that we wish to avoid. To prevent this, the algorithm immediately replays (backwards) experience to update the states from st3 1 to st2 using the return truncated at st3 . This obtains a new Qvalue at st3 that can be used to aid exploration. Each replayed experience is discarded from memory. States visited prior to st2 (sequence W ) are not immediately updated. Putting exploration issues aside, it is often preferable to delay backups for as long as possible with the expectation that the experience yet to come will provide better Qvalues to use in updates. At a later point (t5) the agent takes an opolicy action. When sequence Y is eventually updated, it will use a return estimate truncated at st5, the value of which will be recently updated following the experience in sequence Z and beyond. This is a signi cant improvement over Watkins' Q() which will make no immediate use of the experience collected in sequence Z in updates to Y . X
st5
st1 W
st2= st3
Y
st4
Z
Figure 4.12: A sequence of experiences. st2 is revisited at t3 and an opolicy action taken at st5. States in sequence X (including st3) will be updated before those in sequence W, Y or Z.
66
CHAPTER 4.
EFFICIENT OFFPOLICY CONTROL
4.4.1 The Experience Stack
The algorithm maintains a stack of unreplayed experience sequences, es = hc1 ; c2 ; : : : ; ci i, ordered from the earliest sequence, c1 , to the most recent, ci , from the bottom to the top of the stack (see Figure 4.13). Each experience sequence consists of another stack of temporally successive stateactionreward triples, cj = h(st ; at ; rt+1 ); (st+1 ; at+1 ; rt+2 ); : : : ; (st+k ; at+k ; rt+k+1 )i: It is always the case that the earliest state in cj was observed as a successor to the most recent SAP in cj 1. Performing a push operation on an experience sequence records an experience and pop operations are used when replaying experience. The ESWatkinsreplay procedure, shown in Figure 4.14, is used to replay experience such that a new Qvalue estimate at sstop is obtained. The value of s0 provides the return correction for the most recent SAP in the stack. s0 must be the successor of the SAP found at top(top(es)) (i.e. the most recent SAP in the stack). A counter, B (s), records the number of times s appears in the experience stack in order to determine how many backups to sstop that experience replay can provide without having to search through the recorded experience. How experience is recorded and replayed is determined by the ESWatkinsupdate procedure. Like WatkinsQ(), it ensures that ESWatkinsreplay uses return estimates that are truncated at the point where an opolicy action is taken. Figure 4.13 shows the state of the stack after the experience described in Figure 4.12. It contains the experience sequences W , Y and Z from bottom to top (X has already been updated and removed). The ends of each experience sequence de ne when return truncations occur. For example, due to the exploratory action at t5, st5 starts a new experience sequence. Thus, the backup to st4 will use only rt5 + maxa Q^ (st5 ; a), but Q^ (st5 ; a) will be uptodate. Why doesn't sequence Y simply extend sequence W in Figure 4.13? (That is, why is the return truncated at end of sequence W ?) There is no requirement that the return estimate used to backup st2 1 involve the actual observed return immediately
Bias Prevention
time top
...
... ...
... (st3,a t3 ,rt3+1 ),
,(st6−1 ,at6−1 ,rt6 )
...
time
(st5 ,a t5 ,rt5+1 ),
...
,(st4 ,at4 ,rt5 )
Y
Z = c3
= c2
... bottom (st1,a t1 ,rt1+1), (st1+1,at1+1 ,rt1+2),
...
,(st2−1,at2−1,rt2 )
W
= c1
Figure 4.13: The state of the experience stack after the experience in Figure 4.12. The end of each row (or experience sequence) determines where return truncations occur. The rightmost states receive 1step Qlearning backups.
4.4.
67
EXPERIENCE STACK REINFORCEMENT LEARNING
following t2 1. Generally, if st+k = st, then the return including and following rt+k is just as suitable. That is, if, "
E rt +
1 X i=1
#
i rt+i
=
"
E rt +
1 X i=1
#
ir
t+i+k
(4.11)
holds where st = st+k , then, rt +
1 X i=1
(4.12)
i rt+i+k
is clearly a suitable estimate of return following st truncated nstep and returns.
1 ; at 1 .
Similar arguments apply for
sstop ; s0 ) 1) while not empty(es): 2) z maxa Q^ (s0 ; a) Find initial return correction 3) c pop(es) Get most recent experience sequence 4) while not empty(c): 5) hs; a; ri pop(c) Get most recent unreplayed experience 6) z (r + z ) + (1 ) r + maxa Q^ (s0 ; a) 7) Q^ (s; a) Q^ (s; a) + k z Q^ (s; a)
ESWatkinsreplay(
8) 9) 10) 11) 12) 13)
B (s) B (s) 1 if s = sstop and B (s) = 0: if not empty(c): push(es, c)
Decrement pending backups counter for s Have performed required backup?
s0
New Q^ (s; a) is now used in next backup
return
s
Return unreplayed experiences to stack
st ; at ; rt+1 ; st+1 ) 1) if opolicy(st ; at ): Was last action nongreedy? 2) addas rst = true Truncates return on opolicy actions 3) if empty(es) or addas rst: Record new experience . . . 4) c = createstack() . . . in new sequence 5) else 6) c = pop(es) . . . at end of most recent sequence 7) push(c, hst ; at ; rt+1 i ) 8) push(es, c) 9) addas rst = false 10) B (st ) = B (st ) + 1 Increment pending backups counter for st 11) if B (st+1 ) Bmax or terminal(st+1 ): 12) ESWatkinsreplay(st+1, st+1 ) Replay experience to obtain a new Qvalue at st 13) addas rst = true Truncates return to prevent biasing ESWatkinsupdate(
Figure 4.14: The Experience Stack algorithm for opolicy evaluation of the greedy policy. A version that doesn't truncate return after opolicy actions can be obtained by omitting lines 1 and 2 in ESWatkinsupdate. This is later referred to as ESPW after Peng and Williams' Q(). The name addas rst is for a global variable. It should be set to false at the start of each episode.
68
CHAPTER 4.
EFFICIENT OFFPOLICY CONTROL
However Condition 4.11 will usually not hold when applying the experience stack algorithm. For example, suppose that sequence X includes some unusually negative rewards. If the backups to the states in W were made using a return excluding the rewards in sequence X then the Qvalues in sequence W would become biased (by being overoptimistic). In order to prevent this biasing, the value of the state at which an experience replay ends is used to provide an estimate of the future return to all prior states in the stack. In the example, st2 must be updated to include the return in sequences Y and X . The backups to states prior to st2 should use a return truncated at st2. The algorithm achieves this simply by starting a new experience sequence at the top of the stack to indicate that a return truncation is required (step 13 of ESWatkinsupdate). The parameter Bmax varies how many times a state may be revisited before a backup is made. Its choice is problem dependent. With Bmax = 1 backups are made on every revisit. If revisits occur often and at short intervals, then experience will be frequently replayed which also causes the return estimate to be frequently truncated; an eect which is similar to lowering toward 0. This is in addition to the eect of truncations that occur after taking opolicy actions. However, with higher values of Bmax , the algorithm behaves more like an oine learning method and exploration can bene t less frequently from uptodate Qvalues. Choice of Bmax
Entering a terminal state, sterm, automatically causes the entire remaining contents of the experience stack to be replayed since sstop = sterm and sterm cannot occur in the experience stack (N.B. B (sterm) = 0 at all times). Otherwise, the stack can be ushed at any time by calling ESWatkinsreplay(snow ; sterm). Flushing the Stack
Computational Costs Since each state may appear in the experience stack no more than Bmax times, the worstcase spacecomplexity of maintaining the experience stack is O(jS j Bmax ). The total timecomplexity is O(jAj) per experience when averaged over the entire lifetime of the agent (as Fast Q()). The actual timecost per timestep may vary
greatly between steps.
This new technique can easily be adapted to use the return estimates employed by many other methods. For example, an analogue of Naive Watkins Q() can be made by omitting lines 1) and 2) from ESWatkinsUpdate. An analogue of TD() can be made by replacing all occurrences of Q^ (x; y) with V^ (x), replacing step 6) with, 6) z (r + z) + (1 ) r + V^ (s0) in ESWatkinsReplay, and omitting lines 1) and 2) from ESWatkinsUpdate. Analogues of SARSA() and the importance sampling methods in [111] are equally easy to derive.
Scope
With Bmax = 1 the algorithm is fully oine, identical to Lin's backwards replay and also only suitable for use in episodic tasks. In this case, if = 0 it exactly Special Cases
4.4.
EXPERIENCE STACK REINFORCEMENT LEARNING
69
implements 1step Qlearning with backwards replay as used in [72, 76]. As noted in Section 3.4.8, acyclic tasks are special in that (nonbackwardreplaying) return estimating methods and batch eligibility traces methods are equivalent. With = 1 the experience stack method is also a member of this equivalence class. However, in (terminating) cyclic tasks with = 1, with suÆciently high Bmax to lead to purely oine learning, and where the learning rate is declined with 1=k(s; a), the method implements an everyvisit MonteCarlo algorithm. A rstvisit method could be derived by skipping over backups to Q(s; a) where B (s; a) 6= 1.2 In some tasks, such as problems with state aliasing, a single state may be revisited for several consecutive steps. To prevent the method from using mainly 1step returns, B (s; a) could be incremented only upon leaving a state. This would require that the same action be taken until the state is left, although this is often a bene t while learning with statealiasing (as we will see in Chapter 7). In general, there may be better ways to aect when experience is replayed than with the Bmax parameter. If the purpose of making backups online is to aid exploration, then a better method might be to try to estimate the bene t of replaying experience to exploration when deciding whether to update a state. Frequent Revisits
The open question remains about whether this algorithm is guaranteed to converge upon the optimal Qfunction. Intuitively, it should, and under the same conditions as 1step Qlearning since in a sense, the algorithms dier only slightly. Both methods approach Q by estimating the expected return available under the greedy policy. For general MDPs, the expected update made by both methods appears to be a xed point in Q^ only where Q^ = Q. However, the convergence proof of 1step Qlearning follows from establishing a form of equivalence to 1step value iteration [59]. This relationship does not appear to directly follow for multistep return estimates. Moreover, no convergence proof has been established for any control method with > 0 [145, 137]. A Note About Convergence
For an RL algorithm to be of widespread practical use it must employ some form of generalisation in cases where the stateaction space is large or nondiscrete (e.g. continuous). Typically, this is achieved using a function approximator to store the Qfunction. Although it has not been tested, it is clear that the experience stack method can be made to work with function approximators as other forwardview implementations already exist [31]. A problem that might be encountered in an implementation is deciding when to replay experience since, unlike tablelookup or stateaggregation cases, revisits to precisely the same state rarely occur. Several potential solutions to this exist and it remains the subject of future research. Use with Function Approximators
Similar forwardview analogues of replace trace methods for Qfunctions are also discussed by Cichosz [33]. 2
70
CHAPTER 4.
EFFICIENT OFFPOLICY CONTROL
4.5 Experimental Results In this section versions of the experience stack algorithm are compared against their FastQ() counterparts. FastQ() was chosen as it is in the same computational class as the experience stack algorithm and so allows a thorough comparison with the various wellstudied eligibility trace methods. Explicit comparisons with standard backwards replay are not made, but high values of Bmax provide an approximate comparison. A comparison with Replayed TTD() was not performed. This algorithm is computationally more expensive.3 Also, a fair comparison in this case would allow the experience stack method to also replay the same experiences several times before removing them from the experience stack. The algorithms were tested using the large maze shown in Figure 4.4 (p. 58). This task was chosen as it requires online learning to achieve good performance. Oine algorithms that cannot improve their exploration strategies online are expected to nd the goal rarely in the early stages of learning. For Watkins' Q() and PWQ() three dierent eligibility trace styles are examined and compared against their onpolicy or opolicy experience stack counterparts. ESNWAT was used for comparison against PWQ() since PWQ() has no obvious forwardview analogue. Some comparisons were made against Naive Watkins' Q(), but this performed worse than PWQ() in all cases. These results are omitted. The learning rate is de ned as k (s; a) = 1=k(s; a) throughout, with = 0:5 in all cases except in Figure 4.23, which compares dierent values of = 0:5. For the eligibilty trace methods, k(s; a) records the number of times s has been taken in a. For the experience stack method k(s; a) records the number of updates to Q^ (s; a). These dierent schemes are needed to provided a fair comparison and simply re ect the dierent times at which the algorithms apply return estimates in updates. In both cases, k(s; a) = 1 at the rst update, and k declines on average at the same rate for each method. Figures 4.15 to 4.22 below measure the performance of the algorithms along four varying parameter settings: exploration rate (), , Bmax and the initial Qfunction. The performance measures are the total cumulative reward collected by the agent after the 200000 time steps and the nal average mean squared error in its learned value function. Throughout learning the greedy exploration strategy was employed and the results are broadly divided into two sections; high exploration levels ( = 0:5) and high exploitation levels ( = 0:1). The dierence between Watkins' Q(), Naive Watkins' Q() and PWQ() is expected to be small for nearly greedy policies (where is low). Table 4.3 lists the abbreviations used throughout. Tables 4.4 and 4.6 provide an index to the experimental results in this section. Computational cost was a big issue when running these experiments. Each 200000 step trial took approximately 10 minutes to complete on a Sun Ultra 5 and each graph point is the average of 15 trials. A conservative estimate of the total execution time consumed to produce Figures 4.15 to 4.22 is 2050 machine hours, or 12 machine weeks. In practice the experiment was made feasible by distributing the load over a cluster of 60 workstations, reducing the realtime cost to approximately 34 hours. 3
4.5.
EXPERIMENTAL RESULTS
71
The Fast Q() machine precision parameter was m = 10 7 in all cases. opol = 10 4 throughout. Attention is drawn to the ways in which the algorithms are aected by dierent parameters in the following sections. The most surprising result is that the initial Qfunction, Q0, has such a counterintuitive eect on performance. The maze task has an optimal valuefunction, V , whose mean is approximately 68 and has maximum and minimum values of 99.5 and 45.6 respectively. The standard rule of thumb when using greedy (and many other exploration strategies) is to initialise the Qfunction optimistically to encourage the agent to take untried actions, or actions that lead to untried actions [150]. Yet overall, the performance was generally worse with Q0 = 100 than when starting with a Qfunction that has a higher initial error given by a pessimistic bias (Figures 4.15 and 4.16 show Q0 varying over a larger range than the other graphs). Subjectively, the best all round performance in nal cumulative reward and MSE was obtained with Q0 = 50 for all algorithms. It is possible that the reason for this is that the lower initial Qvalues caused the agent to less thoroughly explore the environment and settle upon a more exploiting policy more quickly. Unlike the eligibility trace methods, the experience stack methods also still performed well with very low initial Qfunctions (compare the cumulative reward collected with Q0 = 0 on all graphs.) Section 4.7 presents a likely explanation as to why optimistic initial Qfunctions can be harmful to learning. Figure 4.24 shows an overlay of the dierent methods with a pessimistic initial Qfunction. The experience stack method outperform the eligibility trace methods in almost all cases except with high . The dierence between the methods is even larger with lower Q0 . The Eects of Q0 .
The Eects of . For Q0 < 100 the experience stack methods performed better or no worse than their eligibility trace counterparts across the majority of parameter settings. In particular they were less sensitive to and achieved better performance with low as a result. A discussion of the reasons for this is given in Section 4.6. With Q0 = 100 the experience stack methods were most sensitive to and performed worse than their eligibility trace counterparts in many instances. The experience stack methods were also more sensitive to Bmax at this setting.
72
CHAPTER 4.
Abbrevation
FastWATacc
EFFICIENT OFFPOLICY CONTROL
Description
Fast Watkins' Q() with accumulating traces. The eligibility trace is zeroed following nongreedy actions, making this an exploration insensitive method. Alternative suÆxes of srepl and sarepl respectively denote statereplace and stateaction replace traces styles. Figure 4.1 shows the implemented algorithm. The various eligibility trace styles are introduced as equations (3.29), (3.32) and (3.31). ESWAT3 The Experience Stack Algorithm in Figure 4.14 with Bmax = 3. Backups are made at the third state revisit. Nongreedy actions truncate the return estimate and so this is an exploration insensitive method (as Watkins' Q()). FastPWsrepl Fast Peng and Williams' Q() with statereplacing traces (see Figure 4.1). This is an exploration sensitive method. ESNWAT2 The Experience Stack Algorithm in Figure 4.14 with Bmax = 2. Steps 1 and 2 of the ESWatkinsUpdate procedure are omitted so that nongreedy actions do not truncate the return estimate. This is an exploration sensitive method similar to Peng and Williams' Q() and Naive Q(). Table 4.3: Guide to the tested algorithms and abbreviations used. Figure
Algorithm
Exploration Level
Figure 4.15 ESWAT 0.5 High Figure 4.16 FastWAT 0.5 High Figure 4.17 ESNWAT 0.5 High Figure 4.18 FastPW 0.5 High Figure 4.19 ESWAT 0.1 Low Figure 4.20 FastWAT 0.1 Low Figure 4.21 ESNWAT 0.1 Low Figure 4.22 FastPW 0.1 Low Table 4.4: Experimental results showing varying and Q0. Figure
Figure 4.23 Figure 4.24 Figure 4.25
Description
Eects of diering learning rate schedules. Overlays of results with best initial Qfunction (Q0 = 50). Duringlearning performance with optimised Q0 and . Table 4.6: Other results.
4.5.
73
EXPERIMENTAL RESULTS 300 ESWAT1 ESWAT2 ESWAT3 ESWAT5 ESWAT10 ESWAT50
150000
Mean Squared Error
Q0 = 100
Cumulative Reward
200000
100000
50000
250 200 150 100 50
0 0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0 0.1
0.9
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
300
Mean Squared Error
Q0 = 75
Cumulative Reward
200000 150000 100000 50000
250 200 150 100 50
0 0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0 0.1
0.9
300
Mean Squared Error
Q0 = 50
Cumulative Reward
200000 150000 100000 50000
250 200 150 100 50
0 0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0 0.1
0.9
300
Mean Squared Error
Q0 = 25
Cumulative Reward
200000 150000 100000 50000
250 200 150 100 50
0 0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0 0.1
0.9
200000
900
Mean Squared Error
Q0 = 0
Cumulative Reward
800 150000 100000 50000
600 500 400 300 200 100
0 0.1
700
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0 0.1
0.9
200000
4000
Mean Squared Error
Q0 = 25
Cumulative Reward
3500 150000 100000 50000
2500 2000 1500 1000 500
0 0.1
3000
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0 0.1
Figure 4.15: Comparison of the eects of , Bmax and the initial Qvalues on ESWatkins with a high exploration rate ( = 0:5). Results for the end of learning after 200000 steps in the Maze task. Performance becomes degraded at Q0 = 100, though less so with higher . Performance is less sensitive to compared to Watkins' Q() (most plots are more horizontal than in Figure 4.16).
74
CHAPTER 4.
150000
300 FastWATsrepl FastWATsarepl FastWATacc
Mean Squared Error
Q0 = 100
Cumulative Reward
200000
EFFICIENT OFFPOLICY CONTROL
100000
50000
250 200 150 100 50
0 0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0 0.1
0.9
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
300
Mean Squared Error
Q0 = 75
Cumulative Reward
200000 150000 100000 50000
250 200 150 100 50
0 0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0 0.1
0.9
300
Mean Squared Error
Q0 = 50
Cumulative Reward
200000 150000 100000 50000
250 200 150 100 50
0 0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0 0.1
0.9
300
Mean Squared Error
Q0 = 25
Cumulative Reward
200000 150000 100000 50000
250 200 150 100 50
0 0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0 0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.9
2600 2400 2200 2000 1800 1600 1400 1200 1000 800 600 400 0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Mean Squared Error
Q0 = 0
Cumulative Reward
200000 150000 100000 50000 0 0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
7000
Mean Squared Error
Q0 = 25
Cumulative Reward
200000 150000 100000 50000
6500 6000 5500 5000 4500
0 0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
4000 0.1
Figure 4.16: Watkins Q() with a high exploration rate ( = 0:5) after 200000 steps in the Maze task. As ESWatkins, performance also becomes degraded at Q0 = 100. Performance is more sensitive to and and also degrades more with low Q0 than ESWatkins.
4.5.
75
EXPERIMENTAL RESULTS
150000 100000
300 ESNWAT1 ESNWAT2 ESNWAT3 ESNWAT5 ESNWAT10 ESNWAT50
Mean Squared Error
Q0 = 100
Cumulative Reward
200000
50000
250 200 150 100 50
0 0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0 0.1
0.9
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
300
Mean Squared Error
Q0 = 50
Cumulative Reward
200000 150000 100000 50000
250 200 150 100 50
0 0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0 0.1
0.9
200000
900
Mean Squared Error
Q0 = 0
Cumulative Reward
800 150000 100000 50000
600 500 400 300
0 0.1
700
0.2
0.3
0.4
0.5
0.6
0.7
0.8
200 0.1
0.9
Figure 4.17: Comparison of the eects of , Bmax on ESNWAT in the Maze task with a high exploration rate ( = 0:5). 150000
300 FastPWsrepl FastPWsarepl FastPWacc
Mean Squared Error
Q0 = 100
Cumulative Reward
200000
100000 50000
250 200 150 100 50
0 0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0 0.1
0.9
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
300
Mean Squared Error
Q0 = 50
Cumulative Reward
200000 150000 100000 50000
250 200 150 100 50
0 0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0 0.1
0.9
3500
Mean Squared Error
Q0 = 0
Cumulative Reward
200000 150000 100000 50000
3000 2500 2000 1500 1000
0 0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
500 0.1
Figure 4.18: Comparison of the eects of , the trace type and the initial Qvalues on Peng and Williams' Q() in the Maze task with a high exploration rate ( = 0:5).
76
CHAPTER 4. 400000
300
Cumulative Reward
300000 250000 200000
Mean Squared Error
ESWAT1 ESWAT2 ESWAT3 ESWAT5 ESWAT10 ESWAT50
350000
Q0 = 100
150000 100000 50000
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Mean Squared Error
Cumulative Reward
300000 250000 200000 150000 100000 50000
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
200 150 100
3500
Mean Squared Error
350000
Cumulative Reward
0.2
250
0 0.1
0.9
400000
300000 250000 200000 150000 100000 50000
3000 2500 2000 1500 1000
0 0.1
100
50
0
Q0 = 0
150
300
350000
0.1
200
0 0.1
0.9
400000
Q0 = 50
250
50
0 0.1
EFFICIENT OFFPOLICY CONTROL
0.2
0.3
0.4
0.5
0.6
0.7
0.8
500 0.1
0.9
Figure 4.19: Comparison of the eects of , Bmax and the initial Qvalues on ESWatkins in the Maze task with a low exploration rate ( = 0:1). 400000
300000
300 FastWATsrepl FastWATsarepl FastWATacc
Mean Squared Error
Q0 = 100
Cumulative Reward
350000
250000 200000 150000 100000 50000
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Mean Squared Error
Cumulative Reward
300000 250000 200000 150000 100000 50000
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.4
0.5
0.6
0.7
0.8
0.9
0.6
0.7
0.8
0.9
0.6
0.7
0.8
0.9
250 200 150 100
0 0.1
0.2
0.3
0.4
0.5
0.9
4750 4700 4650 4600 4550 4500 4450 4400 4350 4300 4250 4200 0.1
0.2
0.3
0.4
0.5
Mean Squared Error
350000
Cumulative Reward
0.3
0.9
400000
300000 250000 200000 150000 100000 50000 0 0.1
0.2
50
0
Q0 = 0
100
300
350000
0.1
150
0 0.1
0.9
400000
Q0 = 50
200
50
0 0.1
250
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Figure 4.20: Comparison of the eects of , trace type and the initial Qvalues on Watkins' Q() in the Maze task with a low exploration rate ( = 0:1).
4.5.
77
EXPERIMENTAL RESULTS 400000
300000 250000 200000
300 ESNWAT1 ESNWAT2 ESNWAT3 ESNWAT5 ESNWAT10 ESNWAT50
Mean Squared Error
Q0 = 100
Cumulative Reward
350000
150000 100000 50000
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Mean Squared Error
Cumulative Reward
300000 250000 200000 150000 100000 50000
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.4
0.5
0.6
0.7
0.8
0.9
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
200 150 100
3500
Mean Squared Error
350000
Cumulative Reward
0.3
250
0 0.1
0.9
400000
300000 250000 200000 150000 100000 50000
3000 2500 2000 1500 1000
0 0.1
0.2
50
0
Q0 = 0
100
300
350000
0.1
150
0 0.1
0.9
400000
Q0 = 50
200
50
0 0.1
250
0.2
0.3
0.4
0.5
0.6
0.7
0.8
500 0.1
0.9
Figure 4.21: Comparison of the eects of , Bmax and the initial Qvalues on ESNWAT in the Maze task with a low exploration rate ( = 0:1). 400000
300000
300 FastPWsrepl FastPWsarepl FastPWacc
Mean Squared Error
Q0 = 100
Cumulative Reward
350000
250000 200000 150000 100000 50000
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Mean Squared Error
Cumulative Reward
300000 250000 200000 150000 100000 50000
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.4
0.5
0.6
0.7
0.8
0.9
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
200 150 100
0 0.1
0.9
4800
350000
4600
Mean Squared Error
Cumulative Reward
0.3
250
400000
300000 250000 200000 150000 100000 50000
4400 4200 4000 3800 3600
0 0.1
0.2
50
0
Q0 = 0
100
300
350000
0.1
150
0 0.1
0.9
400000
Q0 = 50
200
50
0 0.1
250
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
3400 0.1
Figure 4.22: Comparison of the eects of , trace type and the initial Qvalues on Peng and Williams' Q() in the Maze task with a low exploration rate ( = 0:1).
78
CHAPTER 4.
EFFICIENT OFFPOLICY CONTROL 600
150000
Mean Squared Error
Q0 = 50 = 0:3
Cumulative Reward
200000
100000 50000 0
FastWATsrepl FastWATsarepl FastWATacc ESWAT1 ESWAT10 ESWAT50
50000 0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
200
0.2
0.3
0.4
0.5
0.7
0.8
0.9
0.5
0.7
0.8
0.9
0.6
600
100000
Mean Squared Error
Cumulative Reward
300
0 0.1
0.9
FastWATsrepl FastWATsarepl FastWATacc ESWAT1 ESWAT10 ESWAT50
150000
50000 0 50000 0.1
400
FastWATsrepl FastWATsarepl FastWATacc ESWAT1 ESWAT10 ESWAT50
100
200000
Q0 = 100 = 0:9
500
500 400
FastWATsrepl FastWATsarepl FastWATacc ESWAT1 ESWAT10 ESWAT50
300 200 100
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0 0.1
0.2
0.3
0.4
0.6
Figure 4.23: Comparison of eects of the learning rate schedule on ESWatkins and ESWAT. The top row presents favourable setting for ESWAT. The bottom row presents unfavourable settings. = 0:5 in both cases. Changes in had little eect on the relative performance of the algorithms. Results were similar for ESNWAT and PWQ(). The new Bmax parameter appeared to be relatively easy to tune in the maze task. With Q0 < 100 most settings of Bmax and provided improvements over the original eligibility trace algorithms. In general, Bmax caused the greatest spread in performance when Q0 was either very high or very low. For example, Bmax = 50 generally resulted in the poorest relative performance where Q0 = 100 and best performance with pessimistic values (e.g. Q0 = 0). Intermediate values (Q0 = 50) gave the least sensitivity to Bmax as the high values of Bmax switch from providing relatively good to relatively poor performance. With Q0 = 100, Bmax = 1 provided a sharp drop in performance compared to slightly higher values (e.g. Bmax = 2 or Bmax = 3). A possible reason for this is that some states may be revisited extremely soon regardless of the exploration strategy simply because the environment is stochastic. As a result there is often little bene t to the exploration strategy for learning about these revisits. However, the likelihood of a state being quickly revisited by chance two, three or more times falls extremely rapidly with the increasing number of revisits. In such cases it is likely that revisits occur as the result of poor exploration, in which case the exploration strategy may be improved as result of making an immediate backup. Curiously, however, this phenomenon is not seen where Q0 < 100. The Eects of Bmax .
As expected, with low exploration levels Watkins' methods performed very similarly to Peng and Williams' methods (compare Figure 4.19 with 4.21 and Figure 4.20 with 4.22). However, the main motivation for developing the experience stack algorithm was to allow for eÆcient credit assignment and accurate prediction, while still allowing exploratory actions The Eects of Exploration.
4.5.
EXPERIMENTAL RESULTS
79
to be taken. With high exploration levels both of the nonopolicy methods still generally outperformed Watkins' methods in terms of cumulative reward collected, but performed worse in terms of their nal MSE. This is the eect of trading longer, untruncated return estimates (which allow temporal dierence errors to aect more prior Qvalues) for the theoretical soundness of the algorithms (by using rewards following opolicy actions in the return estimate). But the best overall improvements in the entire experiment were found by ESWAT at Q0 = 50. At this setting the algorithm outperformed (or performed no worse) than ESPW, FastWAT and FastPW in terms of both cumulative reward and error across the entire range of . This is a signi cant result as it demonstrates that Watkins' Q() has been improved upon to such an extent that it can outperform methods that don't truncate the return upon taking exploratory actions.
In Figures 4.15 to 4.22 the learning rate was declined with each backup as in Equation 3.8 with = 0:5.4 By chance, this appeared to be a good choice for all of the methods tested. Best overall performance could be found in most settings with between 0.3 and 0.5 (see Figure 4.23). In work by Singh and Sutton [139], the best choice of learning rate has been shown to vary with . This was also found to be the case here. However, unlike in their experiments, here the learning rate schedule had little eect on the relative performances of the algorithms. Also the work by Singh and Sutton aimed to compare replace and accumulate trace methods using a xed learning rate. Several experiments were conducted here using a xed learning rate. This also had little eect on the relative performances with the exception that combinations of high and caused the accumulate trace methods to behave very poorly in most instances. Section 3.4.9 in the previous chapter suggests why. The Eects of The Learning Rate.
Optimised Parameters. Figure 4.25 compares the dierent methods with optimised Q0 , , and Bmax . In terms of cumulative reward performance, there is little dierence
between the methods. However, the experience stack methods are markedly more rapid at error reduction.
4
High values of provide the fastest declining learning rate.
80
CHAPTER 4.
NonOPolicy, = 0:1
200000
400000
150000
350000
100000
Cumulative Reward
Cumulative Reward
OPolicy, = 0:5 FastWATsrepl FastWATsarepl FastWATacc ESWAT1 ESWAT3 ESWAT10 ESWAT50
50000
0 0.1
0.2
0.3
0.4
0.5
0.6
300000 FastPWsrepl FastPWsarepl FastPWacc ESNWAT1 ESNWAT3 ESNWAT10 ESNWAT50
250000 200000
0.7
0.8
0.9
0.1
300
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
300 FastWATsrepl FastWATsarepl FastWATacc ESWAT1 ESWAT3 ESWAT10 ESWAT50
250 200 150
Mean Squared Error
Mean Squared Error
EFFICIENT OFFPOLICY CONTROL
100 50
FastPWsrepl FastPWsarepl FastPWacc ESNWAT1 ESNWAT3 ESNWAT10 ESNWAT50
250 200 150 100 50
0 0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0 0.1
0.9
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Figure 4.24: Overlay of results at the end of learning after 200000 steps in the Maze task. Q0 = 50, = 0:5. OPolicy, = 0:5
NonOPolicy, = 0:1
400000
FastWATsrepl FastWATsarepl FastWATacc ESWAT3
150000
FastPWsrepl FastPWsarepl FastPWacc ESNWAT1
350000
Cumulative Reward
Cumulative Reward
200000
100000 50000 0
300000 250000 200000 150000 100000 50000 0
50000
50000 0
50000
100000 Steps
150000
200000
0
50000
100000 Steps
150000
200000
1000 FastWATsrepl FastWATsarepl FastWATacc ESWAT3
500 400
Mean Squared Error
Mean Squared Error
600
300 200 100 0
FastPWsrepl FastPWsarepl FastPWacc ESNWAT1
800 600 400 200 0
0
50000
100000 Steps
150000
200000
0
50000
100000 Steps
150000
200000
Figure 4.25: Comparison of results during learning in the Maze task with optimised values of Q0 and . The experience stack algorithms provided little improvement in the reward collected but gave far faster error reduction in the Qfunction.
4.6 The Eects of on the Experience Stack Method Why is the experience stack method often less sensitive to than the eligibility trace methods? The ES methods have two separate and complimentary mechanisms for eÆciently propagating credit to many prior states: return estimates and backwards replay. The choice of determines the extent to which each mechanism is used.
4.6.
THE EFFECTS OF
ON THE EXPERIENCE STACK METHOD
81
When the value of is very low the return estimate weighs observed rewards in the distant future very little (see Equation 3.37) and the ability to propagate credit to many states can come mainly only from backwards replay. However with very high the return estimate employs mainly only observed rewards and very little of the stored Qvalues. As a result backwards replay makes little use of, and derives little advantage from using the newly updated values of successor states. It might appear that there is little or even no learning bene t to using backwards replay instead of eligibility traces with very high values of , since, at least super cially, the algorithms appear to be learning in a similar way (i.e. using mainly the return mechanism). In fact one might expect the experience stack methods to actually perform worse in this instance since, when states are revisited, sections of the experience history are pruned from memory and are no longer backedup as they might be by an eligibility trace method. However, as explained in Section 4.4, replaying experience requires that additional truncations in return be made. To the eligibility trace algorithms, frequently truncating return (zeroing the eligibility trace) will negate much of the bene t of using the return estimate since the return looks to observed rewards only a few states into the future. However, return truncations may actually aid the backwards replay mechanism since it means that greater use is made of the recently updated Qfunction. Furthermore, with = 0 and Bmax = 1 it is reasonable to expect the experience stack methods to improve upon or do no worse than 1step Qlearning in all cases. Given the same experiences, the algorithm makes the same updates as Qlearning but in an order that is expected to employ a more recently informed value function. If it could be shown that Qlearning monotonically reduces the expected error in the Qfunction with each backup, then a simple proof of this improvement follows. However, in general, in the initial stages of learning the Qfunction error may actually increase (this was seen in Figure 4.25). Faster learning in this case may actually result in this initial error growing more rapidly.5 Notably though, the experience stack algorithms improved upon or performed no worse than the original algorithms in all of the above experiments where was low ( = 0:1) and Bmax = 1. Performance was occasionally worse with high Bmax . Presumably this was the result of poor exploration caused through making infrequent updates to the Qfunction. For similar reasons, it is reasonable to expect (but it is not proven) that the experience stack methods will improve upon or do no worse than the eligibility trace methods in acyclic environments for all values of . In this case the accumulate, replace and statereplace trace update methods are all equivalent and the eligibility trace methods are known to be exactly equivalent to applying a forward view method in which the Qfunction is xed within the episode. Given the same experiences, the experience stack methods therefore make the same updates as the eligibility trace methods except that each update may be based upon a more informed Qfunction due to the backwards replay element. This is not an improvement new to the experience stack methods { the same applies for backwards replay when applied at the end of an episode. However, in this case the diÆcult issue of how to deal with state revisits does not occur. This is what the experience stack method solves. Finally, note that in the test environment, the eligibility trace methods performed best 5
This can be also be seen where > 0 in Figure 4.25.
82
CHAPTER 4.
EFFICIENT OFFPOLICY CONTROL
with the highest values of . Therefore, it is reasonable to anticipate larger dierences in performance between the two approaches in environments where lower values of are best for eligibility trace methods. In this case, backwards replay methods look likely to provide stronger improvements since the learned Qvalues are more greatly utilised.
4.7 Initial Bias and the max Operator. All of the algorithms tested in Section 4.5 appeared to work better with nonoptimistic initial Qvalues. This may seem a counterintuitive result since optimistic initial Qvalues are generally thought to work well with greedy policies [150]. An obvious explanation for this is that higher initial Qvalues could have caused the agent to explore the environment more and for an unnecessarily long period, while with low initial Qvalues the problems of local minima were avoided through using a semirandom exploration policy. This section explores an alternative explanation that, independently of the eects of exploration, optimistic Qvalues can make learning diÆcult. More speci cally, RL algorithms that update their value estimates based upon a return estimate corrected with maxa Q(s; a) nd it more diÆcult to overcome their initial biases if these biases are optimistic. To see that this is so, consider the example in Figure 4.26. Assume that all transitions yield a reward of 0. Some learning algorithm is applied that adjusts Q(s1; a1 ) towards
E [maxa Q(s2 ; a)] (for simplicity assume = 1). If all Qvalues are initialised optimistically, to 10 for example, then the Qvalues of all actions in s2 must be readjusted (i.e. lowered towards zero) before Q(s1; a1 ) may be lowered. However, if the Qvalues are initialised pessimistically by the same amount (to 10), then maxa Q(s2 ; a) is raised when the value of a single action in s2 is raised. In turn, Q(s1 ; a1 ) may then also be raised. In general, it is clear that it is easier for RL algorithms employing maxa Q(s; a) in their return estimates to raise their Qvalue predictions than to lower them. In eect, the max operator causes a resistance to change in value updates that can inhibit learning. More intuitively, note that if the initial Qfunction is optimistic, then the agent cannot strengthen good actions { it can only weaken poor ones. It is also clear that the eect of this is further compounded if: i) the Qvalues in s2 are themselves based upon the overoptimistic values of their successors, ii) states have many actions available, and so many Qvalues to adjust before maxa Q(s; a) may change, iii) is Q(s2, a1) = 10 Q(s2, a2) = 10
Q(s1, a1) = 10 s1
a1
s2
...
Q(s2, ak) = 10
Figure 4.26: A simple process in which optimistic initial Qvalues slows learning. Rewards are zero on all transitions.
4.7.
INITIAL BIAS AND THE
MAX OPERATOR.
83
high and so statevalues and Qvalues are very dependent upon their successors' values. Although this idea is simple, it does not, to the best of my knowledge, appear in the existing RL literature.6 The most closely related work appears to be that of Thrun and Schwartz in [157]. They note that the max operator can cause a systematic overestimation of Qvalues when lookup table representations are replaced by function approximators. Examples of methods that use maxa Q(s; a) in their return estimates are: valueiteration, prioritised sweeping, Qlearning , Rlearning [132], Watkins' Q() and Peng and Williams' Q(). Similar problems are also expected with \interval estimation" methods for determining error bounds in value estimates [62].7 Methods which are not expected to suer in this way include TD(), SARSA() and policy iteration (i.e. methods that evaluate xed policies, not greedy ones). 4.7.1 Empirical Demonstration ValueIteration
The eect of initial bias on valueiteration was evaluated on several dierent processes with known models: the 2Way corridor Figure 4.28, the small maze in Figure 4.7 and the large maze Figure 4.4. In each experiment an initial value function, V0, was chosen with either an optimistic bias, V0A+, or the same amount of pessimistic bias, V0A , V0A+ (s) = V (s) + bias; (4.13) A V0 (s) = V (s) bias: (4.14) where \bias" is a positive number and V is the known solution. This ensures that both the optimistic and pessimistic methods start the same maximumnorm distance from the desired value function. This setup is atypical since V is usually not known in advance and it also provides valueiteration with some information about the initial policy. However with knowledge of the reward function it is often possible to estimate the maximum and minimum values of V . A second set of starting conditions was also tested: V0B+ (s) = max V (s0 ) + bias; (4.15) s0 V0B (s) = min V (s0 ) bias: (4.16) s0 Figure 4.27 compares these initial biasing methods. Table 4.7.1 shows the number of applications of update 2.21 to all states in the process required by valueiteration until V has converged upon V to within some small degree of error. bias = 50 in all cases. In all tasks, the pessimistic initial bias ensured convergence in the fewest updates. With the corridor task, in the optimistic case, the number of sweeps until termination can be made arbitrarily high by making suÆciently close to 1. However, if all the estimates 6 Similar problems are known to occur with applied dynamic programming algorithms. Examples are continuously updating distance vector network routing algorithms (such as the BellmanFord algorithm) [108]. I thank Thomas Dietterich for pointing out the relationship. 7 I thank Leslie Kaelbling for pointing this out.
84
CHAPTER 4.
EFFICIENT OFFPOLICY CONTROL
start below their lowest true value, then the number of sweeps never exceeds the length of the corridor since, in this deterministic problem, after each sweep at least one more state leading to the goal has a correct value. V0A+ (s) V (s) V0A (s) V0B+ (s) V (s) V0B (s)
Figure 4.27: Initial biasing methods. al 1
al s1
al
...
s19
ar
ar
V0A (s) 10.4 18.0 53.9
V0A+ (s) 207.1 241.0 86.2
+1 ar
Figure 4.28: A deterministic 2way corridor. On nonterminal transitions, rt = 0. Task
2Way Corridor Small Maze Large Maze
Initial Bias
V0B (s) V0B+ (s) 11.5 17.7 54.6
207.5 241.4 107.5
Table 4.7: Comparison of the eects of initial value bias on the required number of valueiteration sweeps over the statespace until the error in V^ has become insigni cant (maxs jV (s) V^ (s)j < 0:001). Results are the average of 30 independent trials. QLearning
The eect of the initial bias on Qlearning is shown in Table 4.8. The Qlearning agents were allowed to roam the 2way corridor and the small maze environments for 30 episodes. For the large maze, 200000 time steps were allowed. The Qfunctions for the agents were
4.7.
INITIAL BIAS AND THE
MAX OPERATOR.
85
initialised in a similar fashion to the valueiteration case but with an initial bias of 5. Throughout learning, random action selection was used to ensure that the learned Qvalues could not aect the agent's experience. At the end of learning, the mean squared error in the learned valuefunction, maxa Q^ (s; a), was measured. In all cases, the pessimistic initial bias provided the best performance. Initial Bias
Task
2Way Corridor Small Maze Large Maze
QA0 (s) QA0 +(s) QB0 (s) QB0 + (s) 1.0 1.2 3.1
20.0 22.1 12.4
19.3 18.9 7.4
20.6 24.9 323.0
Table 4.8: Comparison of theP eects of initial Qvalue bias on QLearning. Values shown are the mean squared error, s(V (s) maxa Q^ (s; a))2 =jS j , at the end of learning. Results are the average of 100 independent trials. 4.7.2 The Need for Optimism
The previous two sections have shown how optimistic initial Qfunctions can inhibit reinforcement learning methods that employ maxa Q(s; a) in their return estimates. Independently of the eects of exploration, it has been demonstrated that convergence towards the optimal Qfunction can be quicker if the initial Qvalues are biased nonoptimistically. However, this does not suggest that performance improvements can in general be obtained simply by making the initial Qfunction less optimistic. The reason for this is that in practical RL settings agents must often manage the exploration/exploitation tradeo. A common feature of most successful exploration strategies is to introduce an optimistic initial bias into the Qfunction and then follow a mainly exploiting strategy (i.e. mainly choose the action with the highest Qvalue at each step). For example greedy exploration strategies assume optimistic Qvalues for all untried stateaction pairs [150, 175]. At each step the agent acts randomly with some small probability and chooses the greedy (i.e. exploiting) action with probability 1 . More generally the optimistic bias is introduced and propagated in the form of exploration bonuses as follows [85, 174, 167, 130, 175, 41], Q(s; a)
0 0 E (r + b) + max 0 Q(s ; a ) a
(4.17)
where the bonus, b, is a positive value that declines with the number of times a has been taken in s. The bonus should decline as less information remains be gained about the eects of taking a in s on collecting reward. The eect of the bonuses is always to make the Qvalues of actions overoptimistic until the environment is thoroughly explored. As a result, the idea that optimistic initial Qvalues can actually be a hindrance to learning often comes as a counterintuitive idea to many researchers in RL.
86
CHAPTER 4.
EFFICIENT OFFPOLICY CONTROL
4.7.3 Separating Value Predictions from Optimism
Because of the need for optimism for exploration it is not clear that simply having less optimistic initial Qfunctions will always help the agent to learn { this would simply reduce the amount of exploration that it does. However, better methods might be derived by separating out the optimistic bias that is introduced to encourage exploration from the actual Qvalue estimates. For example, we may maintain independent Qfunctions and bonus (or optimism) functions: 0 ; a0 ) ; ^ Q^ (s; a) E r + max Q ( s (4.18) a0
B (s; a)
0 0 E (r + b) + max 0 B (s ; a ) :
(4.19) with the former being used for predictions and the latter for exploration. Regardless of the initial choice of Q^ , actions may still be chosen optimistically through careful choice of the exploration bonuses and the initial values of B . For example, the agent might act in order to maximise B (s; a) or even max(maxa Q^ (s; a); maxa B (s; a)) or Q^ (s; a)+ B (s; a). The Qfunction can now be initialised nonoptimistically thus allowing an accurate Qfunction to be learned more quickly as seen in Section 4.7.1. Previous other work has separated value estimates for return prediction from the values used to guide exploration (e.g. [85]). However, here we see for the rst time that through knowing how optimistic initial value functions cause ineÆcient learning, a better initial value function choice may be made and so allow more accurate values estimates to be acheived more quickly. Example. Figure 4.29 compares two algorithms that share identical exploration strategies. Qopt is a regular Qlearning agent that explores using the greedy strategy with = 0:01 and Q0 = 100. Qpess makes the same Qlearning backups as Qopt, and also, B (st ; at ) B (st ; at ) + [rt+1 + max B (st+1 ; a)]: (4.20) a0 a
1400
Qpess Qopt
Average MSE
1200 1000 800 600 400 200 0 0
50
100 150 200 Steps (x1000)
250
300
Figure 4.29: The eect of initial bias on two Qlearninglike algorithms on the large maze task. Both methods share the identical exploration policies. The Qpess method that distinguishes between optimism for exploration and real Qvalue predictions (by maintaining a separating function, B that is updated using the Qlearning update) and starts with a pessimistic Qfunction. The vertical axis measured the mean squared error in the learned Qfunction (as in Table 4.8). Both methods share identical exploration strategies.
4.7.
INITIAL BIAS AND THE
MAX OPERATOR.
87
B0 = 100 = Q0 opt so that Qpess may follow an equivalent greedy exploration strategy as the Qlearner by choosing arg maxa B (s; a) with probability at each step. However, the Qpess method also maintains and updates a Qfunction using exactly the same update as Qopt, although dierently initialised. The dierent Qfunctions are initialised to have the same size of error from Q. For Qopt, the error gives an optimistic Q0 , and for Qpess, it
is chosen to give a pessimistic one. In this case separating optimism from exploration has allowed the optimal Qfunction to be approached much more quickly without aecting exploration at all. Still faster convergence can be found with B pess by choosing a higher Q0. 4.7.4 Discussion
Distinguishing valuepredictions from optimism generally seems like a good idea as we can now deal with these two conceptually dierent quantities separately (and it adds little to the overall computational complexities of the algorithms). We can now also make explicit separations between exploration and exploitation { at any time we can decide to stop exploring completely and decide to exploit given the best policy we currently have. For example, in gambling or nancial trading problems we might wish to learn about the relative return available for making bets or trading shares by initially exploring the problem with a small amount of capital. Later, if we decided to play the game for real and bet the farm for the expected return indicated by the learned Qvalues, we might be extremely disappointed to nd that this return was in fact a gross overestimate. There are also other applications for which accurate Qvalues are needed, but in which exploration is still required. An example is deciding whether or not (or where) to re ne the agent's internal Qfunction representation. This can be done based upon the dierences of Qvalues in adjacent parts of the space [117, 28]. In dierent RL frameworks, the agent may be learning to improve several independent policies that maximise several separate reward functions [57]. Deciding which policy to follow at any time is done based upon the Qvalues of the actions in each policy. Finally, note that the goal of most existing exploration methods is only to maximise the return that the agent can collect over its lifetime and not to nd accurate Qfunctions (in fact some exploration methods fail to nd accurate Qfunctions but still nd policies that are almost optimal in the reward they can collect). Could adapting exploration methods to distinguish between optimism and value prediction still help to maximise the return that the agent collects? Intuitively the answer is yes since nding accurate Qvalues more quickly should allow the agent to better predict the relative value of exploiting instead of exploring. However, this may only apply to modelfree RL methods. For modellearning methods the advantages of separating return predictions from optimistic biases are far fewer. At any time, these methods may calculate the Qfunction unbiased of exploration bonuses and so generate a purely exploiting policy. This can be done by discarding the exploration bonuses (i.e. remove b in Equation 4.17) and solving the Qfunction under the assumption that the learned model is correct. However, as we have seen, modelbased methods that solve the Qfunction using, for example, valueiteration can be made greatly more computationally eÆcient if the Qfunction is initialised nonoptimistically.
88
CHAPTER 4.
EFFICIENT OFFPOLICY CONTROL
4.7.5 Initial Bias and Backwards Replay.
Why was the worst overall performance by the experience stack algorithms where the initial Qfunction was optimistic and was low? (see Q0 = 100 in Figures 4.15, 4.17, 4.19 and 4.21) Consider the example experience in Figure 4.30 and as before assume that = 1 and r = 0 on all transitions. State st1 and st2 are so far unvisited, but st3 has been frequently visited and its true value is now known (for this example, it is only important that maxa Q^ (st2 ; a) > maxa Q^ (st3 ; a)). If = 0 and backwards replay is employed, although Q(st2; at2 ) may be lowered, this adjustment will not immediately reach st1 since maxa Q(st2 ; a) does not change. Thus the bene t of using backwards replay in this situation is destroyed by the combination of the optimistic Qvalues at st2 and using a single step return (although this is no worse than singlestep Qlearning). However, as grows and maxa Q(s; a) weighs less in the return estimates compared to the actual reward, more signi cant adjustments to Q(st1 ; at1 ) will follow (this is true of both backwards replay and the eligibility trace methods). However, as noted in Section 4.6 there may be little bene t to using the experience stack algorithm with high since SAPs are removed from the experience history after they are updated. It was argued that the additional return truncations this causes may actually aid backwards replay and oset this problem; yet it has been shown here that truncated returns can cause backwards replay to be markedly less eective if Q0 is optimistic. Notably, the experience stack algorithms perform much worse than the original algorithms in the above experiments only where is high and the Qfunction is optimistic. This is contrary to the existing rules of thumb in choosing good parameter settings and resulted in substantial initial diÆculties in demonstrating any good performance with the experience stack algorithm. The true nature of the method only became clear when examining dierent Q0. There appears to be no previous experimental work in the literature that compares algorithms using dierent Q0. In the experiments in Figures 4.154.22 in Section 4.5, in almost all cases where Q0 < 100 and < 0:9 the each experience stack method outperforms the eligibility trace counterpart with the exception of a few cases with very high Bmax. We also see that the experience stack methods are much more robust to their choice of Q0 than the trace methods, except for Q0 = 100. Can this problem be avoided by using the method of separating exploration bonuses from predictions discussed in 4.7.3? Note that for the opolicy results in Figures 4.15 and 4.16, by optimising Q0 such that cumulative reward is maximised (B0 = 25 in the dual learning method), the experience stack method looks better than any result obtained by Watkins' Q(). However, at this setting the error performance is poor. It is possible to speculate that this could be avoided by choosing Q0 = 75 as the Qfunction used to generate predictions in the same experiment. However, since the error also depends upon the given experience (which depends upon B0), to perform a fair comparison one would need to run a series of experiments where Q0 and B0 are varied to determine where it is possible to provide better cumulative reward and error than Watkins' Q(). These experiments have not been performed.
4.8.
89
SUMMARY
10
...
st1
10 10
10
st2
10 10
0 0
st3 0
Figure 4.30: A sequence of experience in a process similar to the one in Figure 4.26. Qvalues before the experience are labelled above the actions. Singlestep backwards replay ( = 0) performs poorly here. Algorithms that use multistep return estimates ( > 0) are less aected by the initial bias than singlestep methods. 4.7.6 Initial Bias and SARSA()
In a comparison of dierent eligibility trace schemes by Rummery in [128], SARSA() was shown to outperform other versions of Q() in terms of policy performance. The algorithms were tested under a semigreedy exploration policy and so it is reasonable to assume that an optimistic initial Qfunction was employed. In this scenario, and in the light of the above results, it seems likely that SARSA() would suer less than Peng and Williams' Q() and Watkins' Q(), since it does not explicitly employ the max operator. Performing rigorous comparisons of these methods is diÆcult since the exploration method used strongly aects how the methods dier { under a purely greedy policy, Peng and William's Q(), Watkins' Q() and SARSA() are very similar methods. Such a comparison should also take into account the accuracy of the learned Qfunction. In this respect, it is straightforward to construct situations in which SARSA() performs extremely poorly while following a nongreedy policy.
4.8 Summary Over the history of RL an elegant taxonomy has emerged that dierentiates RL techniques by the return estimates they learn from. While eligibility trace methods are a well established and important RL tool that can learn of the expectation of a variety of return estimates, the traces themselves make understanding and analysing these methods diÆcult. This is especially true of eÆcient (but more complex) techniques of implementing traces such as FastQ(). In Section 3.4.8 we saw that the need for eligibility traces arises only from the need for online learning; simpler and naturally eÆcient alternatives exist if the environment is acyclic or if it is acceptable to learn oine. In Section 3.4.8 we also saw that (at least for accumulate trace variants), eligibility trace methods don't closely approximate their forward view counterparts and can suer from higher variance in their learned estimates as a result. This led to the idea that the forward view methods which directly learn from return estimates might be preferable if they could be applied online. In addition, with forwardview methods it is straightforward and natural to apply backwards replay to derive additional eÆciency gains at no additional computational cost, although it is less obvious how to learn online.
90
CHAPTER 4.
?
EFFICIENT OFFPOLICY CONTROL +
? Online learning needed
+

+
+
+ 1
Offline Process learning or is possible acyclic
+
Pessimistic
+
Qt=0
0 Optimistic
Figure 4.31: Improvement space for experience stack vs. eligibility trace control methods. + denotes that the analysis suggests that the learning speed of a backwards replay method is expected to be as good as or better than for the related eligibility trace method. ?+ and ? denote that the analysis was inconclusive but the experimental results were positive or negative respectively. We have seen how backwards replay can be made to work eectively and eÆciently online by postponing updates until the updated values are actually needed. This technique can be adapted to use most forms of truncated return estimate. Analogues of TD() [148], SARSA() [128] and the new importance sampling eligibility traces methods of Precup [111] are easily derived. In general, the method is as computationally cheap as the fastest way of implementing eligibility traces but is much simpler due to its direct application of the return estimates when making backups. As a result it is expected that further analysis and proofs about the online behaviour of the algorithm will follow more easily than for the related eligibility trace methods. The focus in this chapter was to nd an eective control method that doesn't suer from the \shortsightedness" of Watkins' Q() and also doesn't suer from unsoundness under continuing exploration (i.e. as can occur with Peng and Williams Q() or SARSA()). When should the experience stack method be employed? The experimental results have shown that, at least in some cases, using backwards replay online can provide faster learning and faster convergence of the Qfunction than the trace methods. Improvements in all cases in all problem domains are not expected (nor was this found in the experiments). However, the experimental results (supported by additional analyses) have led to a characterisation of its performance that is shown in Figure 4.31. In summary, Expect little bene t from using online backwards replay compared to eligibility trace methods with values of close to 1. With low (and possibly intermediate) values of always expect performance improvements (or at least no performance degradation).
4.8.
SUMMARY
91
Expect variants employing the max operator in their estimate of return (e.g. ESWAT
and ESNWAT) to work poorly with high initial Qvalues. Expect the algorithm to always provide improvements in acyclic tasks except where = 1 (i.e. nonbootstrapping) and so performs identical overall updates to the existing trace or Monte Carlo methods. In addition, the strong eect of the initial Qfunction has been highlighted as having a major eect upon the learning speed of several reinforcement learning algorithms. Previously, even in work examining the eects of initial bias or , this has not been considered to be an important factor aecting the relative performance algorithms, and is often omitted from the experimental method [171, 151, 139, 106, 150, 31]. The ndings here suggest that it can be at least as important to optimise Q0 as it is to optimise and , and the choice of Q0 aects dierent methods in dierent ways.
92
CHAPTER 4.
EFFICIENT OFFPOLICY CONTROL
Chapter 5
Function Approximation Chapter Outline
This chapter reviews standard function approximation techniques used to represent value functions and Qfunctions in large or nondiscrete statespaces. The interaction between bootstrapping reinforcement learning methods and the function approximators update rules is also reviewed. A new general but weak theorem shows that general discounted return estimating reinforcement learning algorithms cannot diverge to in nity when a form of \linear" function approximator is used for approximating the valuefunction or Qfunction. The results are signi cant insofar as examples of divergence of the valuefunction exist where similar linear function approximators are trained using a similar incremental gradient descent rule. A dierent \gradient descent" error criterion is used to produce a training rule which has a nonexpansion property and therefore cannot possibly diverge. This training rule is already used for reinforcement learning.
5.1 Introduction So far, all of the reinforcement learning methods discussed have assumed small, discrete state and action spaces { that it is feasible to exactly store each Qvalue in a table. What then, if the environment has thousands or millions of stateaction pairs? As the size of the stateaction space increases, so does the cost of gathering experience in each state and also the diÆcultly in using it to accurately update so many table entries. Moreover, if the state or action spaces have continuous dimensions, and so there is an in nite number of states, then representing each state or action value in a table is no longer possible. Therefore, in large or in nite spaces, the problem faced by a reinforcement learning agent is one of generalisation. Given a limited amount of experience within a subset of the environment, 93
94
CHAPTER 5.
FUNCTION APPROXIMATION
how can useful inferences be made about the parts of the environment not visited? Reinforcement learning turns to techniques more commonly used for supervised learning. Supervised learning tackles the problem of inferring a function from a set of inputoutput examples { or how to predict the desired output for a given input. More generally, the technique of learning an inputoutput mapping can be described as function approximation. This chapter examines the use of function approximators for representing value functions and Qfunctions in continuous statespaces. The general problem being solved still remains as one of learning to predict expected returns from observed rewards (a reinforcement learning problem). However, in this context, the function approximation and generalisation problems are harder than they would be in a supervised learning setting since the training data (the set of inputoutput examples), cannot be known in advance. In fact, in the majority of cases, the training data is determined in part by the output of the learned function. This causes some severe diÆculties in the analysis of RL algorithms, and in many cases, methods can become unstable. Sections 5.2{5.5 review common methods for function approximation in reinforcement learning. Linear methods are focused upon as they have been particularly well studied by RL researchers from theoretical standpoint, and have also had a moderate amount of practical success. Section 5.5 examines the bootstrapping problem which is the source of instability when combining function approximation with reinforcement learning. Section 5.7 introduces the linear averager scheme which diers from more common linear schemes only in the measure of error being minimised. However, also in this section, a new proof establishes the stability of this method with all discounted return estimating reinforcement learning algorithms by demonstrating their boundedness. Section 5.8 concludes.
5.2 Example Scenario and Solution Suppose that our reinforcement learning problem was to control the car shown in Figure 5.1. The task is to drive the car to the top of the hill in the shortest possible time [149, 150]. Rewards are 1 on all timesteps and the value of the terminal state is zero. The state of the system consists of the car's position along the hill and its velocity. There are just two actions available to the agent { to accelerate or decelerate (reverse). Suppose also that we wish to represent the value function for this space (see Figure 5.1). We must represent a function of two continuous valued inputs (a position and velocity vector). One of the easiest ways to represent a function in a continuous space is to populate the space with a set of data instances at dierent states, f(1 ; s1); : : : ; (n ; sn)g. Roughly, each instance is a \prototype" of the function's output at that state (i.e. V (si) i) [159]. If we require a value estimate at some arbitrary query state, q, then we can take an average of the values of nearby instances, possibly weighting nearby instances more greatly in the output. To do this requires that we de ne some distance metric, d(s; q) = distance between s and q which quantitatively speci es \nearby". For instance, we might use the Euclidean distance
5.2.
95
EXAMPLE SCENARIO AND SOLUTION Goal
Gravity
Value 20 30 40 50 60 70 80 90 0 0.05
0.5
Position
0 Velocity
0.05
1
Figure 5.1: (left) The Mountain Car Task. (right) An inverted value function for the Mountain Car task showing the estimated value (stepstogoal) of a state. This gure is a learned function using a method presented in a later section { the true function is much smoother but still includes the major discontinuity between where it is possible to get to the goal directly, and where the car must reverse away from the goal to gain extra momentum. between the states, or more generally, an Lpnorm (or Minkowski metric), (see [8]) 0
k X
dp (s; q) = @
j =1
11=p
jsj qj jpA
for kdimensional vectors s and q. There are also dierent schemes we might use to decide how nearby instances are combined to produce the output: Nearest Neighbour
The output is simply the instance nearest to the query point: V ( q ) = i
96 where,
CHAPTER 5.
i = arg
FUNCTION APPROXIMATION
min d(sj ; q) with ties broken arbitrarily. Although computationally relatively fast, a disadvantage of this approach is that the resulting value function will be discontinuous between neighbourhoods. j 2[1::n]
Kernel Based Averaging
In order to produce a smoother (and better tting) output function, the values of many instances can be averaged together, but with nearby instances weighted more heavily in the output than those further away. How heavily the instances are weighted in the average is controlled by a weighting kernel (or smoothing kernel) which indicates how relevant each instance is in predicting the output for the query point. For instance, we might use a Gaussian kernel: d s;q K (s; q) = e ; where the parameter controls the kernel width. Other possibilities exist { the main criteria for a kernel is that its output is at a maximum at its centre and declines to zero with increasing distance from it. The weights for a weighted average can now be found by normalising the kernel and an output found: Pn K (s; q) V (s) = Pi n i i K (s; q ) Atkeson, Moore and Schaal provide an excellent discussion of this form of locally weighted representation in [8] and [1]. ( )2 2 2
5.3 The Parameter Estimation Framework The most pervasive and general class of function approximator are the parameter estimation methods. Here, the approximated function is represented by f ((s); ~); where f is some output function, is an input mapping which returns a feature vector, (s) = ~x = [x1 ; : : : ; xn ]; and ~ is a parameter vector (or weights vector), ~ T = [1 ; : : : ; m ]; and is a set of adjustable parameters. The problem solved by supervised learning and statistical regression techniques is how to nd a ~ that minimises some measure of the error in the output of f , given some set of training data, f(s1 ; z1 ); : : : ; (sj ; zj )g; where zp (p 2 f1; : : : ; j g) represents the desired output of f for an input (sp). The training data is generally assumed to be noisy.
5.3.
97
THE PARAMETER ESTIMATION FRAMEWORK
z
s
Desired output
State
Input Mapping
Error Parameter adjustment Features ~x
~
f
Output Function
Actual output
Figure 5.2: Parameter Estimation Function Approximation. 5.3.1 Representing Return Estimate Functions
Concretely, for reinforcement learning, if we are interested in approximating a value function, then we have, V^ (s) = f ((s); ~) and say that f ((); ~) is the function which approximates V^ (). In the case where a Qfunction approximation is required, we might have, Q^ (s; a) = f ((s); ~(a)) in which case there is approximation only in the state space and a dierent set of parameters is maintained for each available action. Alternatively, Q^ (s; a) = f ((s; a); ~) in which case there is approximation in both the state and action space. This formulation is more suitable for use with large or nondiscrete action spaces [131]. 5.3.2 Taxonomy
Examples of methods which t this parameter framework are \nonlinear" methods such as multilayer perceptions (MLPs). Although these nonlinear methods have had some striking success in practical applications of RL (e.g. [155, 36, 177]) there is little or no practical theory about their behaviour, other than counterexamples showing how they can become unstable and diverge when used in combination with RL methods [24, 159]. A much stronger body of theory exists for linear function approximators. Examples include: i) Linear Least Mean Square methods such the CMAC [163, 149] and (Radial Basis Function) RBF methods [131, 150]. Here the goal is to nd an optimal set of parameters that happens to minimise some measure of error between the output function and the training data. As in a MLP the learned parameters may have no real meaning outside of the
98
CHAPTER 5.
FUNCTION APPROXIMATION
function approximator. ii) Averagers. Here the learned values of parameters may have an easily understandable meaning. For example, the parameters may represent the values of prototype states as in Section 5.2. These methods can be shown to be more stable under a wider range of conditions ([159, 49]). iii) Stateaggregation methods where the statespace is partitioned into nonoverlapping sets. Each set represents a state in some smaller statespace to which standard RL methods can directly be applied. iv) Table lookup, which is a special case of stateaggregation.
5.4 Linear Methods (Perceptrons) All linear methods produce their output from a weighted sum of the inputs. For example: f (~x; ~)
=
n X i
xi i = ~x~
(5.1)
We assume that there are as many components in ~ as there are in ~x. The reason that this is called a linear function is because the output is formed from a linear combination of the inputs: 1 x1 + 2 x2 + + n xn and not some nonlinear combination. Alternatively, we might note that Equation 5.1 is linear because it represents the equation of a hyperplane in n 1 dimensions. This might appear to limit function approximators that employ linear output functions to representing only planar functions. Happily, through careful choice of this need not be the case. In fact we can see that the nearest neighbour and kernel based average methods are linear function approximators where i is de ned as: K (s ; s ) (5.2) (sq )i = Pn q i : k K (sq ; sk ) 5.4.1 Incremental Gradient Descent
Incremental gradient descent is a training rule for modifying the parameter vector based upon a stream of training examples [127, 14]. Alternative, batch update versions are possible which make an update based upon the entire training set (see [127]) and are computationally more eÆcient. However, nonincremental function approximation is not generally suitable for use with RL since the training data (return estimates) are not generally available apriori but gathered online. The way in which they are gathered usually depends upon the state of the function approximator during learning { most exploration schemes and all bootstrapping estimates of return rely upon the current valuefunction or Qfunction. The basic idea of gradient descent is to consider how the error in f varies with respect to ~ (for some training example (~xp; zp )), and modify ~ in the direction which reduces the error: ~ =
@E p ~p @
(5.3)
5.4.
99
LINEAR METHODS (PERCEPTRONS)
for some error function Ep and step size . Concretely, in the case of the linear output function, if we de ne the error function as: 1 z f (~xp; ~)2 Ep = (5.4) 2 p then: ~ = zp f (~xp; ~) ~xp Each parameter is adjusted as follows:
i + ip zp
i
or
(5.5)
desired outputp actual outputp contribution of i to output where ip is the a learning rate for parameter i at the pth training pair (~xp; zp ). Update 5.5 (due to Widrow and Ho, [166]), is known as the Delta Rule or the Least Mean Square Rule and can be shown to nd a local minima of, 2 X1 zp f ((sp ); ~) ; 2 i
i + ip
f (~xp ; ~) xip
p
under Pthe1 standard (RobbinsMonro) conditions for convergence of stochastic approximation: p ip = 1, and P1p 2ip < 1 (which also implies that all weights are updated in nitely often) [21, 127, 11]. Dierent error criteria yield dierent update rules { another is examined later in this chapter. There is a close relationship between update 5.5 and the update rules used by the eligibility trace methods in Chapter 3 (which nds the LMS error in a set of return estimates). Here xi represents the size of parameter i 's contribution to the function output. With xi = 0, i has no contribution to the output and so is ineligible for change. Finally, with the exception of some special cases, the learned parameters themselves may have no meaning outside of the function approximator. There is (typically) no sense in which a parameter could be considered by itself to be prediction of the output. The set of parameters found is simply that which happens to minimise the error criteria. Throughout the rest of this chapter, the method presented here is referred to as the linear least mean square method, to dierentiate it from methods that learn using other cost metrics. 5.4.2 Step Size Normalisation
Finding a sensible range of values for in update 5.5 that allows for eective learning is more diÆcult than with the RunningAverage update rule used by the temporal dierence
100
CHAPTER 5.
FUNCTION APPROXIMATION
learning algorithms in the previous chapters. Previously, choosing = 1 resulted in a full step to the new training estimate. That is to say that after training, the learned function exactly predicts the last training example when presented with the last input. Higher values, can result in strictly increasing the error with the training value. Smaller values result in smaller steps toward the training value, mixing it with an average of many of the previously presented training values. No learning occurs with = 0. However, with update 5.5 choosing ip = 1 does not necessarily result in a `full step'. For example, even if xi = 1 for all i, choosing i = 1 will usually result in a step that is in far too great { increasing the error between the new training example and old prediction. Smaller or greater values of xi eectively result in smaller or greater steps toward the target value. The useful range of learning rate values clearly depends on the scale of the input features. How then should the size of the step be chosen? One solution is to renormalise the stepsize such that sensible values are found in the range [0; 1]. The working below shows how this can be done. First note that the new learned function may be written as: X f (~xp ; ~0 ) = xip (i + i) i
xipi +
X
= f (~xp; ~) +
X
=
X
i
i
i
xip i
xip ip zp f (~xp ; ~) xip ;
(5.6)
where ~0 = ~ +~ and is the parameter vector after training with (~xp; zp). To nd a learning rate that makes the full step, Equation 5.6 should be solved for f (~xp; ~0) = zp: X zp = f (~xp ; ~) + x2ipip zp f (~xp; ~) 1 =
X
i
i
x2 ip ;
(5.7)
ip
Which should hold in order to make a fullstep. We can now scale this step size, X 0p = x2ip ip ; i
(5.8)
so that choosing 0p = 1 results in the full step to zp, and 0p = 0 results in no learning. If a single global learning rate is desired (ip = jp for all i and j ), then (from Equation 5.8) the normalised learning rate is given straightforwardly as, ip =
0p 2; i xip
P
where 0p is the new global learning rate at update p.
5.5.
INPUT MAPPINGS
101
5.5 Input Mappings Many function approximation methods can be characterised by their input mapping, (s), which maps from the environment state, s, to the set of input features, ~x. The feature set is often the major characteristic aecting the generalisational properties of the function approximator and the same input mapping can be applied with dierent output functions or training rules. They also provide a good way to incorporate prior knowledge about the problem by selecting to scale or warp the inputs in ways that increase the function approximator's resolution in some important part of the space [131]. All of the methods described in this section can be used with the LMS training method. However, more generally they might be provided as inputs to more complex function approximators (such as a multilayer neural network). Several common input mappings are reviewed here. Each input mapping can be thought of as playing a role similar to the weighting kernels in Section 5.2. The inputs may sometimes be normalised to sum to 1, although this is not always assumed to be done. 5.5.1 State Aggregation (Aliasing)
Suppose that a robot has a range nder that returns real valued distances in the range [0; 1). We might map this to three binary features: (s) = [xnear; xmid; xfar], xnear = 1; if 0 s < 1=3, (5.9) 0; otherwise. xmid = 1; if 1=3 s < 2=3, (5.10) 0; otherwise. xfar = 1; if 2=3 s < 1, (5.11) 0; otherwise. If s has more than one dimension, then the statespace might be quantised into hypercubes. However the partitioning is done, it is assumed that the regions are nonoverlapping and that only one input feature is ever active (e.g. (s) = [0; 1; 0; 0; 0; 0; 0; 0]). That is to say that subsets of the original space are aggregated together into a smaller discrete space. The nearest neighbour method presented in Section 5.2 and table lookup are special cases of state aggregation. The main disadvantage of this form of input mapping is that the state space may need to be partitioned into tiny regions in order to provide the necessary resolution to solve the problem. If it is not clear from the outset how partitioning should be performed, then simply partitioning the statespace into uniformly size hypercubes will typically result in a huge set of input features (exponential in the number dimensions of the input space). Similar problems follow with nonregular but evenly distributed partitioned regions, as may occur with the nearest neighbour approach.
102
CHAPTER 5.
FUNCTION APPROXIMATION
5.5.2 Binary Coarse Coding (CMAC)
Devised by Albus [4, 3], the Cerebellar Model Articulation Controller (CMAC) consists of a number of overlapping input regions, each of which represents a feature (see Figure 5.3). The features are binary { any region containing the input state represents an input feature with value 1. All other input features have a value of 0. tilin g s
fe a tu re
a c tiv e fe a tu r e s
p o in t o f q u e ry /b a c k u p
p o in t o f q u e ry /b a c k u p
a c tiv e tile s
v a lid s p a c e
Figure 5.3: (left) A CMAC. The horizontal and vertical axes represent dimensions of the state space. (right) The CMAC with a regularised tiling. If the input tiles are arranged into a regular pattern (e.g. in a grid as in Figure 5.3, right) then there are particularly eÆcient ways to directly determine which features are active (i.e. without search). A similar argument can be made for some classes of state aggregation but not, in general, for the nearest neighbour method (which usually requires some search). In the case of a linear output function, since many of the inputs will be zero, we simply have: f (~xp; ~)
=
X
i
xip i =
X
i2
active
i :
(5.12)
This form of input mapping, when combined with the linear output function and delta learning rule has been extremely successful in reinforcement learning. Notably, there are many successful examples using online Qlearning, Q() and SARSA() [71, 149, 70, 131, 167, 150, 64, 141]. [150] provides many others. Figure 5.4 shows how the features of a CMAC or an RBF (introduced in the next section) are linearly combined to produce an output function.
5.5.
INPUT MAPPINGS
103
CMAC (Binary Coarse Coding): i (s) = I (dist(s; centeri ) >radiusi)
RBF (Radial Basis Functions): i (s) = Gaussian(s, centeri , widthi)
Figure 5.4: Example input features and how they are linearly combined to produce complex nonlinear functions in a 1 dimensional input space. The lefthandside curves (the set of features) are summed to produce the curve on the righthandside (the output function). A single parameter i determines the vertical scaling of a single feature. It is intended that the parameter vector, ~, is adjusted such that the output function ts some target set of data. 5.5.3 Radial Basis Functions
Radial basis functions (RBFs) are super cially similar to the kernel based average method presented in Section 5.2. With xed centres and widths, an RBF network is simply a linear method and so can be trained using the LMS rule, although in this case, the parameters won't represent \prototypical" values. However, one of the great attractions of an RBF is its ability to shift the centres and widths of the basis functions. In a xed CMAC vs. adaptive Gaussian RBF bakeo of representations for Qlearning, little dierence was found between the methods [68] (although these results consider only one test scenario). In some cases it was found that adapting the RBF centres left some parts of the space underrepresented. In similar work with Q() using adaptive RBF centres, poor performance was found in comparison to the CMAC [167]. In addition to these problems, RBFs are computationally far more expensive than CMACs. Good overviews of RBF and related kernel based methods can be found in [98, 99, 8, 1, 90].
104
CHAPTER 5.
FUNCTION APPROXIMATION
5.5.4 Feature Width, Distribution and Gradient
The width of a feature can greatly aect the ability to generalise. The wider a feature, the broader the generalisations that are made about a training instance and the faster learning can proceed in the initial stages. More concretely, in the case of linear output functions, if a feature xi is nonzero for a set of input states, then i contributes to the output of those states. Thus if xi is nonzero for more states (i.e. a wider feature), then updating i aects the output function for more states in the training set (to a greater or lesser degree depending upon the magnitude of xi at those states). Do broad features smooth out important details in output functions (i.e. do they reduce its resolution)? Sutton argues not and presents results for the CMAC [150]. Similar results are replicated in Figure 5.5. However, also shown here are results for smoother kernels (e.g. the Gaussian of an RBF). In the example, 100 overlapping features were presented as the inputs to a linear function approximator trained using update 5.5. Step and sine functions were used as target functions for approximation. The bottom row shows the shape ofP the input features used for training in each column. The learning rate was given by 0:2= i xip (as in [150]).1 With both step and Gaussian features, broader features allowed broader generalisations to be made from a limited number of training patterns. However, in the Gaussian case, broad kernels were disastrous, resulting in extremely poor approximations. Adding more or fewer kernels of the same width, allowing more training or using dierent or declining learning rates produces similar results. The reason for this due to the size of the feature's gradients. If we have two small segments of a Gaussian approximated by, g1 : y = 4x, and, g2 : y = 3x, then summing them we get g1 (x)+ g2 (x) = 7x. We see that the gradients of a set of curves are additive when the set is summed together. Thus an in nite number of Gaussians are required to precisely represent the steep (in nite) gradient in the step function. In contrast, a CMAC's binary input features have a steep (in nite) gradient and so can represent the steep details in the target function, even when the features are wider than the details in the target. Note however, that this steep gradient doesn't prevent the CMAC from also approximating functions with shallow gradients. Note that in both cases, the narrow features result in less aggressive generalisation in the initial stages.
P
Since xi p 2 f0; 1g, 0:2= i xip = 0:2= stepsize of 0:2 as shown in Section 5.4.2. 1
P
i xip 2
and so this learning rate gives a properly normalised
5.5.
105
INPUT MAPPINGS
Binary Features Training Samples
Gaussian Features Step Function Target
5 100 10000 Sine Function Target 5 100 10000 Input Feature Shape Figure 5.5: The generalisational and representational eect of input features of diering widths and gradient.
106
CHAPTER 5.
FUNCTION APPROXIMATION
5.5.5 EÆciency Considerations kNearest Neighbour Selection
Methods such as the RBF, in which every element of the feature vector may be nonzero can expensive to update if there are many active features. If the features are centred at some state in the input space (such as in the locally weighted averaging example), then a common trick is to consider only the knearest feature centres and treat all others as if their values were zero [131]. Special data structures, such as a kdtree, can be used to store the feature centres and also eÆciently determine the knearest neighbours to the query point at a cost of much less than O(n) for a total set of n features and k << n [89, 47]. This method can also be used without spatially centred features by choosing only the k largest valued features, although the diÆculty here is how to determine these features without searching through them all. In both methods, if k < n then discontinuities appear in the output function at the boundaries in the state space where the set of nearestneighbours changes. The discontinuities will generally be smaller for larger k. A special case is k = 1, in which all linear methods reduce to state aggregation. Hashing
Hashing is often associated with the CMAC [4, 150], although in principle it may be applied to the inputs of any kind of function approximator. Hashing simply maps several input features to the input for a single parameter. This can be done (and is usually assumed to be done) in fairly arbitrary ways. In this way huge numbers of input features can be reduced down to arbitrarily small sets. The eect of hashing appears to have been studied very little, although it has been employed with success with the CMAC and SARSA() [141].
5.6 The Bootstrapping Problem The LMS function approximation scheme has often been successfully used for RL in practice { see [163, 71, 149, 167, 64, 141] for examples. In addition there are several RL methods, such as TD(0) [148] and TD() [38, 160, 150, 154, 146] for which convergence proofs and error bounds exist. However there are also some other methods such as valueiteration and Qlearning for which the range of f diverges [10, 160]. Even the TD(0) algorithm can be made to diverge if experience is generated from distributions dierent to the online one [160]. This is serious cause for concern since TD(0) is a special case of many methods. The major problem in using RL with any function approximation scheme is that the training data are not given independently of the output of the function approximator. When an adjustment is made to ~ to minimise the output error with some target z at s, it is possible that the change reduces the error for s but increases it for other states. This is not usually a serious problem if the stepsizes are slowly declined because the increases in error eventually
5.6.
107
THE BOOTSTRAPPING PROBLEM
f( . )
Train f to fit f
Train f to fit f
Figure 5.6: The expansion problem. Some function approximators, when trained using some functions of their output can diverge in range. become small enough that this doesn't happen { most function approximation schemes settle into some local optimum of parameters if their distribution of training data is xed apriori. However, for a bootstrapping RL system these increases in error can be fed back into the training data. New return estimates that are used as training data are based upon f . In the case of TD(0), z = r + V^ (s); is replaced by, z = r + f ((s); ~); and may be greater in error as a result of a previous parameter update. In pathological cases, this can cause the range of f to diverge to in nity. There are examples where this happens for both nonlinear and linear function approximators [10, 160, 150]. The problem is shown visually in Figure 5.6. The following sections review some schemes that deal with this problem. GrowSupport Methods
The \GrowSupport" solution proposed by Boyan and Moore is to work backwards from a goal state, which should be known in advance [24] (see also [23]). A set of \stable" states with accurately known values is maintained around the goal. The accuracy of these values is veri ed by performing simulated \rollouts" from the new states using a simulation model (although in practice this could be done with real experience, but far less eÆciently). This \support region" is then expanded away from the goal, adding new states whose values depend upon the values of the states in the old support region. In this way, the algorithm can ensure that the return corrections used by bootstrapping methods have little error, and so ensure the method's stability. 2 In [24], Boyan and Moore also present several simple environments in which a variety of common function approximators fail to converge or even nd antioptimal solutions, but succeed when trained using the growsupport method. For similar reasons, one might also expect backwards replay methods (such as the experience stack method) to be more stable with function approximation. 2
108
CHAPTER 5.
FUNCTION APPROXIMATION
Actual Return Methods
The most straightforward solution to the bootstrapping problem is to perform MonteCarlo estimation of the actual return. In this case no bootstrapping occurs since the return estimate does not use the learned values. If the return is collected following a xed distribution of experience then it is clear that any function approximator that converges using xed (apriori) training data distributions will also converge in this case. Here we are simply performing regular supervised learning, and the fact that the target function is the expectation of the actual observed return is incidental. Also, in the work showing convergence of TD(), the nal error bounds can be shown to increase with lower [160]. In practice, however, bootstrapping methods can greatly outperform Monte Carlo methods both in terms of prediction error and policy quality [150]. Online, OnPolicy Update Distributions
Note that with the linear LMS training rule (and also with other function approximators) the error function being minimised is de ned in terms of the distribution of training data. The parameters of states that appear infrequently receive fewer updates and so are likely to be greater in error as a result. Convergence theorems for TD() assume that updated states are sampled from the online, onpolicy distribution (i.e. as they occur naturally while following the evaluation policy) [38, 160, 154]. Following this distribution ensures that states whose values appear as bootstrapping estimates are suÆciently updated. Failing to update these states means that the parameters used to represent their values (upon which return estimates depend) may shift into con gurations that minimise the error in unrelated values at other states. Where the online, onpolicy distribution is not followed there are examples where the approximated function diverges to in nity [10, 160]. This is a problem for opolicy methods where the parameters de ning the Qvalues of stateaction pairs that are infrequently taken (and so also infrequently updated), may frequently de ne the estimates of return. An obvious example of such a method is Qlearning. Here r + maxa Q^ (s; a) is used as the return estimate, but as an opolicy method there is typically no assumption that the greedy action is followed. If it is insuÆciently followed, then the greedy action's Qvalue is not updated and the parameters used to represent it shift to minimising errors for other stateaction pairs. One might expect that online Qlearning while following greedy or semigreedy policies could be stable. However this is not the case and there are still examples where divergence to in nity may occur [146]. The cause of this is probably due to a problem noted by Thrun and Swartz [157]. If the changes to weights are thought of as noise in the Qfunction, then the eect of the max operator is to consistently overestimate the value of the greedy action in states where there are actions with similar values. Also, Qlearning and semigreedy policy evaluating algorithms (such as SARSA), suer since the greedy policy depends upon the approximated Qfunction. This codependence can cause a phenomenon called chattering where the Qfunction, and its associated policy, oscillates in bounded region even in simple situations such as statealiasing [21, 50, 51, 5].
5.6.
THE BOOTSTRAPPING PROBLEM
109
Even so, methods such as Q(), Qlearning or value iteration can work well in practice even when updates are not made with the online distribution [163, 128, 149, 167, 150, 117, 140]. Other recent work shows that variants of TD() or SARSA() can be combined with importance sampling in a way that does allow opolicy evaluation of a xed policy while following a special class of exploration policies [111, 146]. The idea behind importance sampling is to weight the parameter updates by the probability of making those updates under the evaluation policy. This allows the overall change in the parameters over the course of an episode to have the same expected value (but higher variance), even if the evaluation policy is not followed. It is not clear, however, whether this method can be used eectively for control optimisation. Local Input Features
Local (i.e. narrow) input features are a common feature in many practical applications of function approximation in RL [13, 163, 71, 1, 68, 167, 150, 95, 140, 141]. Why might this be so? Consider any goal based tasks where bootstrapping estimates are employed. Here the values of states near the goal may completely de ne the true values of all other states. If broad features are used and the bulk of the updates are made at states away from the goal (as can easily happen when updating with the online distribution) then it is likely that parameters will move away from representing the values of states near the goal and so make it very diÆcult for other states to ever approach their true value. The grow support method is one solution to avoid \forgetting" the values of states upon which others depend. Another is to use localised (i.e. narrow) input features. Thus, in cases where the updates are made far from the goal, the parameters that encode the values of states near the goal are not modi ed. A similar argument can be made for nongoal based tasks { the general problem is one of not forgetting the values of important states while they are not being visited [128]. However, as we have seen earlier, local input features reduce the amount of generalisation that may occur. Residual Algorithms
In [10] Baird notes that a simple way to guarantee convergence (under xed training distributions) is to make use of our knowledge about the dependence of the training data and the function approximator and allow for this by including a bootstrapping term when deriving a gradient descent update rule. Previously, in the case of TD(0), the gradient descent rule, 2 ~ = 12 @ (zt+1 ~V (st)) @ assumes zt+1 to be independent of ~, but not V (st) = f ((st ); ~). In residual gradient learning, the error is fully de ned as, 2 @ rt+1 + V^ (st+1 ) V^ (st ) 1 ~ = 2 @ ~
110
CHAPTER 5.
=
rt+1 + V^ (st+1 ) V^ (st )
FUNCTION APPROXIMATION
@ V^ (st ) @ ~
@ V^ (st+1 ) @ ~
!
In the linear case, we have, i = rt+1 + V^ (st+1 ) V^ (st ) i(st ) i (s0t+1 ) The successor states, st+1 and s0t+1 should be generated independently which may mean that the method is often impractical without a model to generate a sample successor state [150]. Also, i (st) i(s0t+1 ) may often be small leading to very slow learning. However, Baird also discusses ways of combining this approach with the linear LMS method in a way that attempts to maximise learning speed while also ensuring stability. A later version of this approach [9] combines the method with valuefunctionless direct policy search methods, such as REINFORCE [170]. Averagers
The term averager is due to Gordon [49]. The key property of averagers are that they are nonexpansions { that they cannot extrapolate from the training values. In [49] Gordon notes that i) the valueiteration operator is a function that has the contraction property, and ii) many function approximation schemes can be shown to be nonexpansions and iii) any functional composition of a contraction and a nonexpansion is a function that is also a contraction. This makes it possible to prove that synchronous valueiteration will converge upon a xed point in the set of parameters, if one exists, provided that the function approximator can be shown to be a nonexpansion. Many mean squared error minimising methods do not have this property. A special kind of averager method is presented in the next section, for which it is clear that any discounted return based RL method cannot possibly diverge (to in nity) regardless of the sampling distribution of return and distribution of updates.
5.7 Linear Averagers In the LMS scheme, we were minimising: 1 X(z f (~x ; ~))2 : p 2 p p By providing a slightly dierent error function to minimise, n 1 XX 2 2 p i xip(zp i) ; then the gradient descent rule (5.3) yields a slightly dierent update rule: i i + ip zp i xip :
(5.13)
5.7.
111
LINEAR AVERAGERS
or, i + ip
desired outputp i contribution of i to output Here, the update minimises the weighted (by xi) squared errors between each i and the target output, rather than between the actual and target outputs. As before, the learning rate ip should be declined over time. This method is referred to as a linear averager to dierentiate it from the linear LMS gradient descent method. To make the analysis of this method more straightforward, it is also assumed that the inputs to the linear averager are normalised, i
xip =
x0ip 0 ; k xkp
P
and that 0 xip 1. The purpose of this is to make it clear that Pi xipi is a weighted average of the components of ~. It is also assumed that 0 ipxip 1, in which case after update (5.13), jzp i0 j jzp ij must hold.3 In this way it also becomes clear that each individual i is moving closer to zp since update (5.13) has a xedpoint only where zp = i . This does not happen with update (5.5) where zp = f ((sp); ~) is the update's xedpoint. Note that in the linear averager scheme, adjustments may still be made where zp = f ((sp); ~). Function approximators that can be trained using this scheme include stateaggregation (statealiasing and nearest neighbour methods), knearest neighbour, certain kernel based learners (such as RBF methods with xed centres and basis widths) piecewise and barycentric linear interpolation [80, 37, 93], and tablelookup. All of these methods dier only by their choice of input mapping, , which is often normalised. Many of these methods are already employed in RL (see [136, 167, 140, 117, 93, 97] for recent examples). Special cases of this framework for which convergence theorems exist are, Qlearning and TD(0) with stationary exploration policies and stateaggregation representations [136], valueiteration where the function approximator update can be shown to be a nonexpansion [48], or is a stateaggregation method [21, 159], or is an adaptive locally linear representation [93, 97]. The valueiteration based methods assume that a model of the environment is available, and they are also deterministic algorithms and are easier to analyse as a result. The most signi cant (and most recent) result is by Szepesvari where the \almost sure" convergence of Qlearning with a stationary exploration policy has been shown with interpolative function approximators whose parameters are modi ed with update (5.13) [152]. Figure 5.7 compares the linear LMS (update (5.5)) and linear averager (update (5.13)) methods in a standard supervised learning setting. Linear averagers appear to suer from oversmoothing problems if broad input features are used, while the use of narrow input features (for any function approximator), limits the ability to generalise since the values of many input features will be near or at zero, and their associated parameters adjusted by similarly small amounts. The method does not exaggerate the training data in the output in the way that update (5.5) can. The exaggeration problem is the source of divergence in 3
These special assumptions may be relaxed where Theorem 2 (below) can be shown to hold.
112
CHAPTER 5.
FUNCTION APPROXIMATION
Linear LMS (Update (5.5))
Linear Averager (Update (5.13))
f ((s); ~)
Input Feature Shape, (s)i Various (s)i i
2
2
2
2
1.5
1.5
1.5
1.5
1
1
1
1
0.5
0.5
0.5
0.5
0
0
0
0
0.5
0.5
0.5
0.5
1
1
1
1
1.5
1.5
1.5
1.5
2
2
2
2
Figure 5.7: The eect of input feature width and cost functions on incremental linear gradient descent with dierent cost schemes. (top) A comparison of the functions learned by parameter update rules (5.5) and (5.13) when the training set is taken from 1000 random samples of the target step function. Note that the averager method learns a function that is entirely contained within the vertical bounds of the target function. In contrast the linear LMS gradient descent method does not, but nds a t with a lower mean squared error. This exaggeration of the training data, in combination with the use of bootstrapping, is the cause of divergence when using function approximation with RL. (middle) The input feature shape used by each method in each column. 50 such features, overlapping and spread uniformly across the extent of the gure provided the input to the linear output function. Note that update (5.5) still learns well with broad input features. In contrast, the averager method suers from oversmoothing of the output function and cannot well represent the steep details of the target function. (bottom) A selection of the learned parameters over the extent where their inputs are nonzero. Note that for the averager method, the learned parameters are the average of the target function over the extent where the parameter contributes to the output. For both methods, the learned function in the top row is an average of the functions in the bottom row (since the input features were normalised). RL.4 However, as follows intuitively from its error criterion, the linear LMS method nds a t with a lower mean squared error in the supervised learning case. The next two sections show that function approximators which do not exaggerate cannot diverge when used for return estimation in RL. In particular, the stability (i.e. boundedness) of the linear averager method is proven for all discounted return estimating RL algorithms. The rationale behind the proof is simply: i)
All discounted return estimates which bootstrap from f (; ~) have speci c bounds.
4 In some work, this exaggeration (extrapolation of the range of training target values) is sometimes confused with extrapolation (which refers to function approximator queries outside the range of states associated with the training data).
5.7.
ii)
113
LINEAR AVERAGERS
Adjusting ~ using the linear averager update to better approximate such a return estimate cannot increase these bounds.
5.7.1 Discounted Return Estimate Functions are Bounded Contractions Theorem 1 Let r be a bounded real value such that rmin r rmax . De ne a bound on the maximum achievable discounted return as [Vmin ; Vmax ] where, r Vmin = rmin + + k rmin + = min ; 1 r k Vmax = rmax + + rmax + = max ; 1 for some , 0 < 1. Let z (v) = r + v. Under these condition, z is a bounded contraction. That is to say that:
i) if v > Vmax , then z (v) < v and z (v) Vmin ,
ii) if v < Vmin , then z (v) > v and z (v) Vmax ,
iii) if Vmin v Vmax , then Vmin z (v) Vmax , for any v 2 IR.
Proof: i) Assume that v > Vmax and the following holds,
() ()
which follows from r rmin since,
z (v ) < v r + v < v r 1 < v;
r
rmax 1 1 = Vmax < v:
This proves the rst part of i). We have in general:
rmin 1 rmin
= Vmin = (1 )Vmin = Vmin:
() () rmin + Vmin Since v Vmax Vmin and 0, rmin + v Vmin :
(5.14)
114
CHAPTER 5.
Since r rmin , )
r + v z (v)
FUNCTION APPROXIMATION
Vmin Vmin:
This proves the second part of i). ii) Is shown in the same way. iii) Assume that Vmin v and show the following holds, Vmin z (v) () Vmin r + v: This holds since (from (5.14)), r + v rmin + v rmin + Vmin = Vmin : The above proof method can be applied to a number of reinforcement learning algorithms. For instance, for Qlearning (where z = rt+1 + maxa Q^ (st+1 ; a)), by rede ning v as maxa Q^ (st+1 ; a), r as rt+1 , and each remaining v as Q^ (st+1 ; at+1 ), the proof holds without further modi cation. Similarly, the method can be applied to the return estimates used by all single step methods (which includes TD(0), SARSA(0), V(0), the asynchronous valueiteration and valueiteration updates) in the same way. Contraction bounds for actual return methods (i.e. nonbootstrapping or MonteCarlo methods) are more straightforward. Simply note that if, z = r1 + r2 + 2 r3 + and rmin ri rmax for i 2 IN then Vmin z Vmax . Contraction bounds for return methods (i.e. forward view methods as in [150]) can also be established by showing that nstep truncated corrected return estimates, z (n)
=
nX1 i=1
i 1r
!
i
+ n vn
(with rmax < ri < rmax ) are a bounded contraction. This can done by a method similar to the proof of Theorem 1. Note that any weighted sum of the form, n X
with weights,
i n X i
xi = 1
xi zi ;
and, 0 xi 1
has a bound entirely contained within [mini zi ; maxi zi]. It has been shown in other work that return estimates are such a weighted sum of nstep truncated corrected return estimates [163], z = (1 ) z (1) + z (2) + 2 z (3) + ;
5.7.
115
LINEAR AVERAGERS
Value Vmax
9 > > > > > > > > > > > > =
All possible return > estimates > > > > (all train> > > > > ing data.) > > ;
fmax f Vmin fmin
State Figure 5.8: By Theorem 1, all possible discounted return estimates must be within the bounds shown since v may only take values bounded within [fmin; fmax ]. Only return estimates within these bounds can possibly be passed as training data to the function approximator. and so return estimates are also bounded contractions. More intuitively, note that return estimates occupy a space of functions between the 1step methods such as TD(0) and Qlearning (where = 0, n = 1), and the actual return estimates (where = 1, n = 1). 5.7.2 Bounded Function Approximation
De ne the current bounds on the output of some function approximator to be [fmin; fmax], where fmin = min f ((sp ); ~); s2S fmax = max f ((sp ); ~): s2S
A corollary of Theorem 1 is that, min(Vmin ; fmin) z max(Vmax ; fmax ); where z is any of the discounted return estimates given in the previous section, including any bootstrapping estimates de ned in terms of f (e.g. where v = V^ (s) = f ((s); ~), in the case of TD(0)). In other words, the values of possible training data provided to a function approximator must lie within the combined bounds of [Vmin; Vmax ] and [fmin ; fmax] (see Figure 5.8). Since return estimate functions must lie in these bounds, and due to the following theorem (satis ed by the linear averager method), the linear averager method is bounded and so cannot diverge to in nity.
116
CHAPTER 5.
FUNCTION APPROXIMATION
Theorem 2 De ne ~0 to be the new parameter vector after training with some arbitrary target z 2 IR. Let the bounds of the new output function, f 0, be de ned as, 0 = min f ((sp ); ~0 ); fmin s2S
0 = max f ((sp ); ~0 ): fmax s 2S
If,
0 f 0 max(Vmax ; fmax ) min(Vmin ; fmin) fmin max
for any possible training example, then bounds of f cannot diverge.
It follows from Theorem 1 that, [min(Vmin ; fmin); max(Vmax ; fmax )]; entirely contains, 0 ); max(Vmax ; f 0 )]: [min(Vmin ; fmin max Thus, further training with any possible training data cannot expand the bounds of f beyond its initial bounds before training. Many function approximators satisfy the conditions of this theorem for, min(Vmin; fmin ) z max(Vmax ; fmax ); (which always holds for the discounted return functions discussed). Proof:
Theorem 3 The linear averager function approximator presented in Section 5.7 satis es the conditions of Theorem 2 for,
min(Vmin; fmin ) z max(Vmax ; fmax ): Proof:
Note simply that for,
i0
i + ip (z
i) xi
where 0 ipxi < 1, then i is no further from z than it was initially. Since, min(Vmin; fmin ) z max(Vmax ; fmax ); then i0 must also be at least as close to being contained within these bounds than it was to begin with. If it was already within these bounds it remains so since z is in these bounds. Also, since f is a weighted average of the components of ~, it is bounded by [mini i; maxi i] for any input state. Since, as a result of the update, the bounds of all the components of ~ are either unchanged, or moving to be contained within [min(Vmin; fmin ); max(Vmax ; fmax)], so then are the bounds of f . The linear LMS gradient descent methods do not satisfy Theorem 2. The exaggeration eects in Figure 5.7 are an illustration of this.
5.7.
117
LINEAR AVERAGERS
5.7.3 Boundness Example
Figure 5.9 shows Tsitsiklis and Van Roy's counterexample [160]. In the linear LMS method, divergence with TD(0) can occur if the update distribution diers from the online one. For instance, if updates are made to s1 and s2 with equal frequency, diverges to in nity. This occurs since when updating from s1, the update is: t+1 t + (zs t ) t + (r + V (s2 ) t ) t + (2 t t ) t (1 + (2 1)) Thus t+1 is greater in magnitude (i.e. greater in error, since = 0 is optimal) than t for (1+ (2 1)) > 1. Thus, where 2 > 1 holds and for any positive this method increases in error for each update from s1. Only updates from s2 decrease . Thus is s2 is updated insuÆciently in comparison to s1 (as is the case for the uniform distribution), divergence to in nity occurs. The online update distribution ensures that V (s1) is suÆciently updated to allows for convergence. The linear averager method converges upon = 0 given 0 < < 1. The features are assumed to be normalised, ((s2) = 1, not 2) and the method therefore reduces to a standard stateaggregation method. For transitions, s1 ; s2, t+1 t + (r + V (s2 ) t ) t + ( t t ) t (1 + ( 1)) and so decreases in magnitude for 0 < < 1, 0 < 1. 1
In every case, the linear averager method is guaranteed to be bounded. However, because the linear averager method reduces to state aggregation, it is possible that the example above may be a \straw man". It only shows an example where the LMS method diverges and the linear averager method does not. It may be that there are scenarios in which the LMS method converges upon the optimal solution while the averager method does not, or where it converges to its extreme bounds. A ne bottle of single malt whisky may be claimed by the rst person to send me the page number of this sentence. Caveat.
5.7.4 Adaptive Representation Schemes
Many forms of function approximator can adapt their input mapping () by shifting which input states activate which input features (as does an RBF network [68]), or simply by adding more features and more parameters [117, 93, 131]. In such cases, it is often easy to provide guarantees that the range of outputs is no larger as a result of this adaptation (for example by ensuring that new parameters are some average of existing ones). In this way, these methods can also be guaranteed to be bounded. An example of an adaptive representation scheme is provided in the next chapter.
118
CHAPTER 5.
FUNCTION APPROXIMATION
1 s1 V^ (s1 ) =
s2
V^ (s2 ) = 2
V^ (sterm) = 0
Figure 5.9: Tsitsiklis and Van Roy's counterexample. A single parameter is used to represent the values of two states. All rewards are zero on all transitions and so the optimal value of is zero. The feature mapping is arranged such that (s1 ) = 1 and (s2) = 2.
= 0:99 and = 0:01. 5.7.5 Discussion
Gordon demonstrated that valueiteration with approximated V^ must converge upon a xed point in the set of parameters for any function approximation scheme that has the nonexpansion property [48]. This follows from noting simply that the valueiteration update is known to be a contraction, and that any functional composition of a nonexpansion and contraction is also a contraction to a xed point (if one exists). The results here demonstrate the boundedness of general discounted RL with similar function approximators for analogous reasons by showing that all discounted return estimate functions (with bounded rewards) are bounded contractions (i.e. contractions to within a bounded region), that the linear averager update is a nonexpansion, and that the composition of these functions is also bounded contraction. This provided a more general (and more accessible) demonstration of why function approximator updates having the nonexpansion property cannot lead to an unbounded function, and that, 0 ); max(V ; f 0 )]; f ((s); ~) 2 [min(Vmin ; fmin max max 0 ; f 0 ] denotes the initial bounds are the bounds on the output of f over its lifetime ([fmin max on the output of f for all s 2 S ). This is a more general statement than is found in [48] (it applies to more RL methods), but it is weaker in the sense that convergence to a xedpoint is not shown. However, this work directly applies to stochastic algorithms whereas the method in [48] considers only deterministic algorithms where a model of the reward and environment dynamics must be available. Although convergence can be shown with the linear LMS method for some RL algorithms (e.g. for TD()), this only holds given restricted update distributions [10, 160]. Divergence to in nity can be shown in cases where this does not hold. This is a problem for control optimisation methods such as Qlearning (which has TD(0) as a special case) where arbitrary exploration of the environment is desired. It should also be noted that the linear averager method cannot diverge no matter how the return estimates are sampled. This is surprising since the two gradient descent schemes dier only by the error measure being minimised. However, linear averagers appear to be limited to using narrow input features where steep details in the target function need to be represented. Following the review in Section 5.6 this appears to be a common tradeo in successfully applied function approximators.
5.8.
SUMMARY
119
5.8 Summary A variety of representation methods are available to store and update value and Qfunctions. In increasing levels of sophistication and empirical success, but decreasing levels of provable stability, these are: i) table lookup, ii) state aggregation, iii) averagers, iv) linear LMS methods and, v) nonlinear methods (e.g. MLPs). A number of heuristics have been reviewed that appear to be useful in aiding the stability of these methods: making updates with the online, onpolicy distributions, the use of xed policy evaluation methods rather than greedy policy evaluating methods, the use of function approximators that do not exaggerate training data, the use of local input features, and the use of nonbootstrapping methods. It is not clear that attempting to minimise the error between a function approximator's output and the target training values is a good strategy for RL. We have seen that some methods which attempt to do just this may diverge to in nity, while some methods that do not, and learn prototypical state values instead, cannot (although they may still suer in other ways where bootstrapping is used). Also, for control tasks, it does not follow that predictive accuracy is a necessary requirement for good policies [5, 150]. This is also seen in methods such as SARSA() and Peng and Williams' Q() where good policies may be learned, even where there is considerable error in the Qfunction. Although, similarly, it is straightforward to construct situations where reasonably accurate Qfunctions (i.e. close to Q ) have a greedy policy that is extremely poor.
120
CHAPTER 5.
FUNCTION APPROXIMATION
Chapter 6
Adaptive Resolution Representations Chapter Outline
This chapter introduces a new method for representing Qfunctions for continuous state problems. The method is not directly motivated by minimising a function of return estimate error, but aims to re ne the Qfunction representation in the areas of the statespace that are most critical for decision making.
6.1 Introduction There are many questions that the designer of a learning system will need to answer in order to build a suitable function approximator to represent the value function of a reinforcement learning agent. How are the feature mappings for a function approximator decided upon? What are appropriate feature widths and shapes for the problem? How many features should be used? Should they be uniformly distributed? If not, which areas in the statespace are the most important to represent? And so on. In order to answer these questions, help may be found by exploiting some knowledge about the problem being solved. However, in many tasks the problem may be too abstract or illunderstood to do this. The result is often an expensive process of trial and error to nd a suitable feature con guration. The function approximation methods presented in the previous chapter are \static" in the sense that their input mappings or the available number of adjustable parameters are xed. In general, this also imposes xed bounds upon the possible performance that the system may achieve. If a function approximator's initial con guration was poorly chosen, poor learning and poor performance may result. 121
122
CHAPTER 6.
ADAPTIVE RESOLUTION REPRESENTATIONS
This chapter discusses autonomous, adaptive methods for representing Qfunctions. The initial limits on the system's performance are removed by adding resources to the representation as needed. Over time, the representation is improved through a process of generaltospeci c re nement. Although a simple stateaggregation representation is used (for ease of implementation), traditional problems often experienced with these methods can be avoided (e.g. lack of ne control with coarse aggregations and slow learning with ne representations). In the new approach, during the initial stages of learning, broad features allow good generalisation and rapid learning, while in the later stages, as the representation is re ned, small details in the learned policy may be represented. Unlike most function approximation methods, the method is not motivated by value function error minimisation, but by seeking out good quality policy representations. It is noted that i) good quality policies can be found long before an accurate Qfunction is found (the success of methods such as Peng's and Williams' Q() demonstrate this), and that, ii) in continuous spaces there are often large areas where actions under the optimal policy are the same.
6.2 Decision Boundary Partitioning (DBP) In this section, a new algorithm is provided that recursively re nes the Qfunction representation in the parts of the statespace that appear to be most important for decision making (i.e. where there is a change in the action currently recommended by the Qfunction). The statespace is assumed to be continuous, and that the state transition and reward functions for this space are Markov. 6.2.1 The Representation
The Qfunction is represented by partitioning the statespace into hypervolumes. In practice, this is implemented through a kdtree (see Figure 6.1) [47, 89]. The root node of the tree represents the entire statespace. Each branch of the tree divides the space into two equally sized discrete subspaces halfway along one axis. Only the leaf nodes contain any data. Each leaf stores the Qvalues for a small hyperrectangular subset of the entire statespace. From here on, the discrete areas of continuous space that the leaf nodes cover are referred to as regions. The represented Qfunction is uniform within regions and discontinuities exist between them. The aggregate regions are treated as discrete states from the point of view of the valueupdate rules. As a stateaggregation method, following the results in Section 5.7 in the last chapter and also those of Singh [138], the method can be expected to be stable (i.e not prone to diverge to in nity) when used with most RL algorithms. 6.2.2 Re nement Criteria
Periodically, the resolution of an area is increased by subdividing a region into two smaller regions. How should this be done overall? Subdividing regions uniformly (i.e. subdividing every region) will lead to a doubling of the memory requirements. A more careful approach
6.2.
123
DECISION BOUNDARY PARTITIONING (DBP)
R e g io n B ra n c h D a ta
Figure 6.1: A kdtree partitioning of a two dimensional space. is required to avoid such an exponential growth in the number of regions as the resolution increases. Consider the following learning task { an agent should maximise its return where: S = fs j 0o s < 360o g A = [L; R] (that is, \go left" and \go right") 8 0 1 ; if s = s 15o and a = L, < Pssa 0 = : 1; if s0 = s + 15o and a = R, 0; if otherwise. (s 15o ) if a = L, Rsa = sin sin(s + 15o ) if a = R.
= 0:9 The world is circular such that f (s) = f (s + 360o ). Although this is a very simple problem, nding and representing a good estimate of the optimal Qfunction to any degree of accuracy may prove diÆcult for some classes of function approximator. For instance { the function is both nonlinear and nondierentiable. However, of particular interest in this 10 Q(s,L) R(s,R) sin(s)
8
Value
6 4 2 0 2 0
50
100
150 200 State, s
250
300
350
Figure 6.2: The optimal Qfunction for SinWorld. The decision boundaries are at s = 90o and s = 270o where Q(s; L) and Q(s; R) intersect.
124
CHAPTER 6.
ADAPTIVE RESOLUTION REPRESENTATIONS
and many practical problems, is the apparent simplicity of the optimal policy compared to the complexity of its Qfunction: o o (s) = L; if 90 s < 270 ; (6.1) R;
if otherwise :
It is trivial to construct and learn a two region Qfunction which nds the optimal policy given only a few experiences. This, of course, relies upon knowing the decision boundaries (i.e. where Q(s; L) and Q(s; R) intersect) in advance (see Figure 6.2). Decision boundaries are used to guide the partitioning process since it is here that one can expect to nd improvements in policies at a higher resolution; in areas of uniform policy, there is no performance bene t for knowing that the policy is the same in twice as much detail. While it is true that, in general we cannot determine without rst knowing Q, in many practical cases of interest it is often possible to nd near or even optimal policies with very coarsely represented Qfunctions. A good estimate of is found if, for every region, the best Qvalue in a region is, with some minimum degree of con dence, signi cantly greater than the other Qvalues in the same region. Similarly, there is little to be gained by knowing more about regions of space where there is a set of two or more near equivalent best actions which are clearly better than others. To cover both cases, decision boundaries are de ned to be the parts a statespace where i) the greedy policy changes and, ii) where the Qvalues of those greedy actions diverge after intersecting. It is important to note that the cost of representing decision boundaries is a function of their surface size and not necessarily the dimensionality of the statespace. Hence, if there are very large areas of uniform policy, then there can be a considerable reduction in the amount of resources required to represent a policy to a given resolution when compared to uniform resolution methods. 6.2.3 The Algorithm
The partitioning process considers every pair of adjacent regions in turn. The decision of whether to further divide the pair is formed around the following heuristic: do not consider splitting if the highestvalued actions in both regions are the same (i.e. there is no decision boundary), only consider splitting if all the Q values for both regions are known to a \reasonable" degree of con dence, only split if, for either region, taking the recommended action of one region in the adjacent region is expected to be signi cantly worse than taking another, better, action in the adjacent region. The second point is important, insofar as that the decision to split regions is based solely upon estimates of Qvalues. In practice it is very diÆcult to measure con dence in Qvalues since they may ultimately be de ned by the values of currently unexplored areas of the stateaction space or parts of the space which only appear useful at higher resolutions
6.2.
125
DECISION BOUNDARY PARTITIONING (DBP)
Do Split
Don’t Split a1
a2
a1
Should take a1 here?
a2
No change in policy. a1
a2 a2
a1
a1 a2
Likely improvement is small. Stepped functions always expected.
Should take other action here?
Figure 6.3: The Decision Boundary Partitioning Heuristic. The diagrams show Qvalues in pairs of adjacent regions. The horizontal axis represents state, and the vertical axis represents value. (although see [62, 85] for some con dence estimation methods). For both of these reasons, the Qfunction is nonstationary during learning which itself causes problems for statistical con dence measures. The naive solution applied here is to require that all the actions in both regions under consideration must have been experienced (and so had their Qvalues reestimated) some minimum number of times, V ISmin, which is speci ed as a parameter of the algorithm. This also has the added advantage of ensuring that infrequently visited states are less likely to be considered for partitioning. In the nal part of the heuristic, the assumption is made that the agent suers some \signi cant loss" in return if it cannot determine exactly where it is best to follow the recommended action of one region instead of the recommended action of an adjacent region. If the best action of one region, when taken in an adjacent region is little better than any of the other actions in the adjacent region, then it it reasonable to assume that between the two regions the agent will not perform much better if it could decide exactly where each action is best. The \signi cant loss", min, is the second and nal parameter for the algorithm. Figure 6.3 show situations in which partitioning occurs. Setting min > 0 attempts to ensure that the partitioning processes is bounded. For dierentiable Qfunctions, as the regions become smaller on either side of the decision boundary, the loss for taking the action suggested by the adjacent region must eventually fall below min. In the case, where decision boundaries occur at discontinuities in the Qfunction, unbounded partitioning along the boundary is the right thing to do provided that there remains the expectation that the extra partitions can reduce the loss that the agent will receive. The fact that there is a boundary indicates that there is some better
126
CHAPTER 6.
ADAPTIVE RESOLUTION REPRESENTATIONS
representation of the policy that can be achieved.1 In both cases, a practical limit to partitioning is also imposed by the amount of exploration available to the agent. The smaller a region becomes, the less likely it is to be visited. As a result, the con dence in the Qvalues for a region is expected to increase more slowly the smaller the region is. The remainder of this section is devoted to a detailed description of the algorithm. To abstract from the implementation details of a kdtree, the learner is assumed to have available the set REGIONS , where regi 2 REGIONS and regi = hVol i; Qi; VIS ii. Qi(a) is the Qvalue for each action, a, within the region, Vol i is the description of the hyperrectangle regi covers and V ISi (a) records the number of times an action has been chosen within Vol since the region was created. The choice of whether to split a region is made as follows: 1) Find the set of adjacent regions pairs: ADJ = fhregi ; regj i j regi ; regj 2 REGIONS ^ neighbours(regi ; regj )g 2) Let SPLIT be the set of regions to subdivide (initially empty). 3) for regi; regj 2 ADJ 3a) ai = arg maxa Q(regi ; a) 3b) aj = arg maxa Q(regj ; a) 3c) Find the estimated loss given that, for some states in the region, it appears better to take the recommended action of the adjacent region: 3d) i = jQ(regi ; ai ) Q(regi ; aj )j 3e) j = jQ(regi; ai ) Q(regj ; aj )j 3f) if (ai 6= aj ) and (policy dierence) (i min or j min) and (suÆcient dierence) (fa 2 A j V ISi(a); V ISj (a) V ISming 6= ;) (suÆcient value approximation) 3f1) SPLIT := SPLIT [ fregi ; regj g 4) Partition every region in SPLIT at the midpoint of its longest dimension, maintaining the prior estimates for each Qvalue in the new regions. 5) Mark each new region as unvisited: V IS (a) := 0 for all a. A good strategy to dividing regions is to always divide along the longest dimension [86] after rst normalising the lengths with the size of the statespace. This method does not require that distances in each axis be directly comparable and simply ensures that partitioning occurs in every dimension with equal frequency. The obvious strategy, of dividing in the axis of the face that separates regions appeared to work particularly poorly. In most experiments, this led to some regions having a very large number of neighbours. i
This isn't true in the unlikely case that regions are already exactly separated at the boundary. But if this is the case, continued partitioning is still necessary to verify this. 1
6.2.
DECISION BOUNDARY PARTITIONING (DBP)
127
6.2.4 Empirical Results
In this section the variable resolution algorithm is evaluated empirically on three dierent learning tasks. In all experiments the 1step Qlearning algorithm is used. Although faster learning can be achieved with other algorithms, Qlearning is employed here because of its ease of implementation and computational eÆciency.2 Also, throughout, the exploration policy used is greedy [150]. In addition, upon entering a region the agent is committed to following a single action until it leaves the region. This prevents the exploration strategy from dithering within a region and allows larger parts of the environment to be covered more quickly.
The SinWorld Task
In the SinWorld environment (introduced above) the agent has the task of learning the policy which gets it to (and keeps it at) the peak of a sin curve in the shortest time. To prevent a lucky partitioning of the state space which exactly divides the Qfunction at the decision boundaries, a random oset for the reward function was chosen for each trial: f (s) = sin(s + random). In each episode the agent is started in a random state and follows its exploration policy for 20 steps. In all trials the agent started with only a two state representation. At the end of each episode, the decision boundary partitioning algorithm was applied. Figure 6.4 shows the nal partitioning after 1000 episodes. The highest resolution areas are seen at the decision boundaries (where Q(s; L) and Q(s; R) intersect). At s = 90o partitioning has stopped as the expected loss in discounted reward for not knowing the area in greater detail is less than min. The decline in the partitioning rate as the boundaries are more precisely identi ed can be seen in Figure 6.5. Figure 6.6 compares the performance of the variable resolution methods against a number of xed uniform grid representations. The performance measure used was the average discounted reward collected over 30 evaluations of a 20 step episode under the currently recommended policy. The results were averaged over 100 trials. The initial performance matches that of an 8 state representation. After 1000 episodes, however, the performance is slightly better than a 32 state representation (not shown) which managed much slower improvements in the initial stages. It is important to note that without prior knowledge of the problem is it diÆcult to assess which xed resolution representation will provide the best tradeo between learning speed and convergent performance. Starting with only two states, the adaptive resolution method provided fast learning in the initial stages yet managed near optimal performance overall. 2
These experiments were also conducted prior to the experience stack method.
128
CHAPTER 6.
ADAPTIVE RESOLUTION REPRESENTATIONS
10 Q(s, L) Q(s, R) r(s) 8
Value
6
4
2
0
2 0
1
2
3
4
5
6
7
State, s
Figure 6.4: The nal partitioning after 1000 episodes in the SinWorld experiment. The highest resolution areas are seen at the decision boundaries (where Q(s; L) and Q(s; R) intersect). 35
30
25
States
20
15
10
5
0 0
100
200
300
400
500 Episode
600
700
800
900
1000
Figure 6.5: The number of regions in the SinWorld experiment. Note that the 1st derivative (the partitioning rate) is decreasing over time. 6 Adaptive 5
4 states
Average Discounted Return
16 states
4
8 states 2 states
3
2
32 states
1
0 10
20
30
40
50 Episode
60
70
80
90
100
Figure 6.6: Comparison of initial learning performances for the variable vs. xed resolution representations in the SinWorld task. The performance measure is the average total discounted reward collected over 20 steps from random starting positions and osets of the reward function.
6.2.
DECISION BOUNDARY PARTITIONING (DBP)
129
The Mountain Car Task
In the Mountain Car task the agent has the problem of driving an underpowered car to the top of a steep hill.3 The actions available to the agent are to apply an acceleration, deceleration or neither (coasting) to the car's engine. However, even at full power, gravity provides a stronger force than the engine can counter. In order to reach the goal the agent must reverse back up the hill, gaining suÆcient height and momentum to propel itself over the far side. Once the goal is reached, the episode terminates. The value of the goal states are de ned to be zero since there is no possibility of future reward. At every timestep the agent receives a punishment of 1, and no discounting was employed ( = 1). In this special case, the Qvalues simply represent the negative of the expected number of steps to reach the goal. Figure 6.7 shows the Qvalues of the recommended actions after 5000 learning episodes. The cli represents a discontinuity in the Qfunction. On the high side of the cli the agent has just enough momentum to reach the goal. If the agent reverses for a single time step at this point it cannot reach the goal and must reverse back down the hill. It is here that there is a decision boundary and a large loss for not knowing exactly which action is best. Figure 6.8 shows how this area of the statespace has been discretised to a high resolution. Regions where the best actions are easy to decide upon are represented more coarsely. Figure 6.9 shows a performance comparison between the adaptive and the xed, uniform grid representations. The measure used is the average total reward collected from 30 random starting positions using the currently recommended policy and with learning suspended. Due to the large discontinuity in the Qfunction, partitioning continues long after there appears to be a signi cant performance bene t for doing so (shown in Figure 6.10). This simply re ects that the performance metric measures the policy as a whole from random starting positions. Agents starting on or around the discontinuity still continue to gain some performance improvements. The same experiment was also conducted but with the ranges of the states chosen to be 10 times larger than previously, giving a new statespace of 100 times the original volume (see Figure 6.8). Starting positions for the learning and evaluation episodes were still chosen to be inside the original volume. These changes had little eect upon the amount of memory used or the convergent performance, although learning proceeded far more slowly in the initial stages.
3
This experiment reproduces the environment described in [150, p. 214]
130
CHAPTER 6.
ADAPTIVE RESOLUTION REPRESENTATIONS
Value 20 30 40 50 60 70 80 90 0 0.05
0.5
Position
0 Velocity
0.05
1
Figure 6.7: A value function for the Mountain Car experiment after 5000 episodes. The value is measured as maxa Q(s; a) to show the estimated number of steps to the goal under the recommended policy.
Figure 6.8: (left) A partitioning after 5000 episodes in the Mountain Car experiment. Position and velocity are measured along the horizontal and vertical axes respectively. (right) The same experiment but with poorly chosen scaling of axes. This had little eect on the nal performance or number of states used.
6.2.
131
DECISION BOUNDARY PARTITIONING (DBP) 0
Average Reward
500
200 180 160
1000 140 Adaptive 256 states 16 states States
120
1500
100 80 60
2000 100
200
300
400
500 Episode
600
700
800
900
1000 40 20
Figure 6.9: The mean performance over 50 experiments using the adaptive and the xed, uniform representations in the Figure 6.10: The number of regions in the Mountain Car task. The average total Mountain Car experiment. reward collected from 30 random starting positions under the currently recommended policy is measured. 0
0
100
200
300
400
500 Episode
600
700
800
900
1000
The Hoverbeam Task
In the hoverbeam task [84] the agent has the task of horizontally balancing a beam (see Figure 6.11). On one end of the beam is a heavy motor that drives a propeller and produces lift. On the other is a counterbalance. The statespace is three dimensional and includes the angle from the horizontal, , the angular velocity of the beam and the speed of the motor. The available actions are to increase or decrease the speed of the motor. In this way we also see how a problem with a continuous action set can be decomposed into a similar problem with a discrete action set and a larger statespace { the problem could also be presented as one with motor speed as the only available action. The reward function provided to the agent is largest when the beam is horizontal and declines inversely with the absolute angle from horizontal. Each episode terminates after 200 steps or if the angle of the beam deviates more than 30o from horizontal.4 This task requires ne control of the motor speed only in a small part of the entire space. Figure 6.12 compares the performance of several xed resolution representations against the adaptive representation. Policies with coarse representations (512 states) cause the beam to oscillate around the horizontal while xed highresolution representations (4096 states) take an unacceptably long time to learn. An intermediate (512 state) resolution representation proved best out of the xed resolution methods. The adaptive resolution method outperformed each xed resolution methods. Approximately 4000 regions were needed by the end of 10000 episodes.
4
A detailed description of this environment is available at:
http://www.cs.bham.ac.uk/~sir/pub/hbeam.html
132
CHAPTER 6.
ADAPTIVE RESOLUTION REPRESENTATIONS
T h ru s t
g .M
g .M g .M
m o to r
b e a m
c o u n te r
Figure 6.11: The Hoverbeam Task. The agent must drive the propeller to balance the beam horizontally. 140
120
Average Reward
100
Adaptive 8 states 64 states 512 states 4096 states
80
60
40
20
0 0
1000
2000
3000
4000
5000 Episode
6000
7000
8000
9000
10000
Figure 6.12: The mean performance over 20 experiments using the adaptive and the xed, uniform representations in the Hoverbeam task. The total reward collected after 200 steps under the currently recommended policy is measured. min
V ISmin
SinWorld Mountain Hoverbeam Car 0:1 10 2 5 15 15 2 16 8
Initial regions Partition 1 episode 10 episodes 10 episodes test freq. 0:1 0.15 0.1
0:9 1.0 0.995 Qt=0 10 0 10 0.3 0.3 0.3 Start state random random 30o Table 6.1: Experiment Parameters.
6.3.
RELATED WORK
133
6.3 Related Work 6.3.1 Multigrid Methods
In a multigrid method, uniform representation resolutions are maintained for the entire statespace, although severallayers of dierent resolution may be employed. Lower levels may be initialised by the values of coarse layers or bootstrap from their values [29, 101, 54, 6, 162, 69, 70]. An obvious disadvantage of uniform multigrid methods are their limited scalability into highdimensional statespace problems. In order to represent part of the statespace at a resolution to 1=k of the total width of each dimension, kd regions are represented at the nest resolution. Excluding the cost of less coarse layers, we can see that memory requirements grow exponentially in the dimensionality of the statespace. In situations where all represented states have values updated, time complexity costs must also grow at least as fast. The chief advantage of multigrid methods is reduced learning costs for the neresolution approximation. As in the DBP approach, the values learned by coarse layers provide broad generalisation and so rapid (but inaccurate) dissemination of return information throughout the space. Most multigrid work assumes models of the environment are known apriori, although [6] and [162] use Qlearning. In this case, the time complexity costs of the valueupdating methods can be less than that of the space complexity costs. For example, in Vollbrecht's kdQLearning [162], which starts with a kdtree that is fully partitioned to a given depth, Qvalues are maintained and updated at all levels throughout the tree. However, since, learning occurs on each level, the timecost of the method grows more reasonably, as O(n jAj), for a tree of depth n. Many of the regions in the nestlevels, however, will never be visited or ever store values to any useful degree of con dence. To account for this, the method decides at which level in the tree it has most con dence in value estimates, and uses the region at this level to determine policies and value estimates for bootstrapping. The method can be expected to make better use of experience than the DBP Qlearning approach but is computationally more expensive and is also limited to problems for which a full tree of the required depth can be represented from the outset. Where learning occurs at several layers of abstraction simultaneously is also related to work learning with macroactions and options (although here a discrete, but large, MDP is typically assumed) [134, 39, 83, 102, 110, 143, 22, 43]. This work is reviewed in the next chapter. 6.3.2 NonUniform Methods
To attack the scalability problem, many methods examine ways to nonuniformly discretise the statespace. In an early method, Simons uses a nonuniform (statesplitting) grid to control a robotic arm [133]. The task is to nd a controller which minimises the forces exerted on the arm's
134
CHAPTER 6.
ADAPTIVE RESOLUTION REPRESENTATIONS
`hand'. Reinforcements are provided for reductions in this force. The splitting criteria is to partition regions if the arm's controller is failing to maintain the local punishment below some certain threshold. In cases where the exerted forces were very small, most partitioning occurred and ne control was the result. In [46] Fernandez shows how the statespace can be discretised prior to learning using the Generalised Lloyd Algorithm. The method provides greater resolution in more highly visited parts of the statespace. Similarly, RBF networks may adapt their cell centres such that some parts of the statespace are represented in greater detail [68, 97]. A criticism of these kinds of approach is that they are based upon similar assumptions made by standard supervised learning algorithms { that a greater proportion of the error minimisation \eort" should be spent on more frequently visited states. It is not clear that this is the best strategy for reinforcement learning where, for instance, the values of states leading to a goal may be infrequently visited but may also de ne the values of all other states. GLearning
In another early work, [28] Chapman and Kaelbling's G algorithm employs a decision tree to represent the Qfunction over a discrete (binary) space. Each branch of the tree represents a distinction between 0 and 1 for a particular environmental state variable. Each leaf contains an additional \fringe" which keeps information about all of the remaining distinctions that can be made. The decision of whether or not to x a distinction is made on the basis of two statistical tests (only one need pass). Here it was found that performing Qlearning and using the learned Qvalues to make a split was insuÆcient. Instead, the method learns the future reward distribution: D(st ; at ; r) =
1 X
k=0
t+k P r(r = rt+k+1 )
The possible rewards are assumed to be drawn from a small discrete set, R. From this, the Qvalues can be recovered as follows: X Q^ (s; a) = rD(s; a; r): r2R
Thus the method recovers the same onpolicy return estimate as batch, accumulatetrace SARSA(1) (or an everyvisit Monte Carlo method), but also has a (nonstationary) future reward distribution for each region. The return distributions of a pair of regions diering by a single input variable are compared using a T test [42]. The distinction is xed, and the tree deepened, if it is found that the reward distributions dier with a \signi cant degree of con dence".5 The G algorithm also xes distinctions on the basis of whether diering distinctions recommend dierent actions. Intuitively, the method also appears to identify decision boundaries but in discrete spaces. The use of signi cance measures in RL to compare return distributions is almost always heuristic since the return distributions are almost always nonstationary. 5
6.3.
135
RELATED WORK
Classi er Systems
A classi er system consists of a population of ternary rules of the form h1; 0; #; 1 : 1; 0i [55]. A rule encodes a stateaction pair, h state : action i. A rule applies and suggests an action if it matches an input state (which should also be a binary string). A # in a rule stands for \don't care". Thus a rule h#,#,#,# : 1; 0i matches any input state, and the rule h0,#,#,# : 1; 0i matches any state where the rst bit is 0. In this respect, a classi er system provides similar representations to a binary decision tree where data is stored at many levels; h#,#,#,# : 1; 0i represents the root and h0,#,#,# : 1; 0i is the next level down. In practice, a tree is not used to hold the rules. The population is unstructured { there may be gaps in statespace covered by the population and several rules may apply in other states. Each rule has an associated set of parameters, some of which are used to determine a rule's tness. Fitness measures the quality of a rule and corresponds to tness in an evolutionary sense. Periodically, un t rules are deleted from the population and new rules added by combining t rules together. In [96], Munos and Patinel's Partitioning Qlearning, the evolutionary component is replaced with a specialisation operator that replaces rules containing a #, with two new rules in which the # is substituted with a 1 and a 0. Each rule keeps a Qvalue for the SAP that it encodes and is updated whenever it is found to apply (several rules may have Qvalues updated on each step). The specialisation operator is applied for a fraction of the rules in which the variance in the 1step error is greatest. This variance is measured as: n 1X (r + max 0 Q(s0 ; a0 )) (r + max 0 Q(s0 ; a0 ))2 n i=1
i
a
i
i 1
a
i 1
where the rule applied and was updated at times ft0 ; : : : ; ti; : : : ; tng. The result is that specialisation causes something like the tree deepening as in Glearning. However, unlike the T test test, this method does not distinguish between noise in the 1step return, and the dierent distributions of return that follow from adjacent state aggregations. Utile SuÆx Memory (USM)
So far, all of the methods discussed (including the DBP approach) assume that the real observed states are those of a large or continuous MDP. However, in some cases, the reward or transitions following from the next action may not simply depend upon the current state and action taken, but may depend upon what happened 2, 3 or more steps ago (i.e. the environment is a partially observable MDP). Similar to the Galgorithm, McCallum's Utile SuÆx Memory (USM) also uses a decision tree to attempt to discover the relevant \state" distinctions needed for acting [82, 81]. However, here the agent's perceived state is a recent history of observed environmental inputs and actions taken. Branches in the tree represent distinctions in the recent history of events that allow dierent Qvalue predictions to be distinguished. The top level of the tree represents actions to be taken at the current state for which Qvalues are desired. Deeper levels of the tree make distinctions between dierent prior observations. For example, a branch 3 levels down might distinguish between whether at 2 = a10 or whether at 2 = a5 . Distinctions (branches) are added if these
136
CHAPTER 6.
ADAPTIVE RESOLUTION REPRESENTATIONS
dierent histories appear to give rise to dierent distributions of 1step corrected return, r + maxa Q(s0 ; a). The return distributions following from each history are generated from a pool of stored experiences. The KolmogorovSmirnov test is used to decide whether the distributions are dierent [42].6 Continuous U Tree
In [161] Uther and Veloso apply USM and Glearning ideas to a continuous space. As in the DBP approach, a kdtree is used to represent the entire statespace, and branches of the tree subdivide the space. As in McCallum's USM, a pool of experience is maintained and replayed to perform oine valueupdates. Within a region, the 1step corrected return is measured for each stored experience which serves as a sample set. This is compared with a sample from an adjacent region using the Kolmogorov test. Also, an alternative (less \theoretically based") test was used which maintains splits if this reduces the variance in the 1step return estimates by some threshold. Dynamically Refactoring Representations
In [35] Boutilier, Dearden and Goldszmidt use a method that seeks to increase the resolution of (decisionbased) binary state representations where there is evidence that the value is nonconstant within a aggregate region. A Bayesian network is used to compactly represent a transition probability function. The compactness of this function follows from noting that (at least for many state discrete tasks), many actions frequently leave many features of the current state unchanged. For example, an action such as \pick up coee cup", will not aect which room the agent is. Transitions to other rooms, from any state, after taking this action are compactly represented with a probability of zero of occurring. Value functions are represented as decision trees (as in Glearning). However, here it is noted that it is possible to refactor the tree to provide equivalent but smaller representations, especially in cases where the represented value function has a constant value. A form of modi ed policy iteration (structured policy iteration) is performed upon the tree. At each iteration, the tree is refactored to maintain its compactness. Comments
An interesting issue with many of these methods is that we actually expect the return following from dierent regions to be drawn from dierent distributions in almost all cases { in very many problems, the optimal value function is nonconstant throughout almost all of the statespace. This follows as a consequence of using discounting. The return distributions following from adjacent regions are therefore likely to have dierent means, and so will be shown to be from dierent distributions under the statistical tests given signi cant amounts of experience. It may be that the KolmogorovSmirnov test or the T test identify relatively large changes in the value function more quickly than other parts of The KolmogorovSmirnov test distinguishes samples by the largest dierence in their cumulative distribution. 6
6.3.
RELATED WORK
137
the statespace (e.g. at discontinuities), or where signi cance tests are passed most quickly (e.g in areas where most experience occurs). One might hope that these areas also coincide with changes in optimal policy, although this is clearly not always the case. With experience caching methods (USM and Continuous U Tree), there is the opportunity to deepen the tree until a lack of recorded experience within leaf regions causes it to be poorly modelled by the stored experience (e.g. either because the region contains no experiences, no experiences which exit the region (causing \false terminal states"), or too few experiences to locally model the local variance in value and pass any reasonable statistical test). Partitioning so deep such that we have one experience per action per region is unlikely to be desirable and seems certain to lead to over tting problems. As the number of regions increases, so then does the cost of performing valueiteration sweeps across the set of regions. If computational costs can be neglected however, one might expect an approach of partitioning as deeply as possible to make extremely good use of experience (provided over tting and false terminals can be avoided). However, if time and space costs are an issue, then it becomes natural to examine ways in which parts of the statespace can be kept coarse. In this respect, the existing methods miss the key insight that it simply is not necessary (in all cases) to represent the value function to a high degree of accuracy in order to represent accurate policies. It is argued that re nement methods should seek to reduce uncertainty about the best action, and not uncertainty about their values in order to nd better quality policies.
The decision boundary partitioning method oers an initial heuristic way to do this, although it is less principled an approach as one might hope. For instance, in many cases it will follow that to reduce uncertainty about the best action requires more certain action value estimates for those actions. In turn it may follow, (at least in the case of bootstrapping value estimation algorithms, such as Qlearning and valueiteration) that the only way to reduce the uncertainty in these action value estimates is to increase the resolution of the regions whose values determine the action values that we are uncertain about. This requires a nonlocal partitioning method. All of the methods considered so far are local methods and do not consider partitioning successor regions in order to reduce certainty at the current region. In the next paragraph, the VRDP approaches of Moore, and Munos and Moore use a number of dierent partitioning criteria. In particular, the In uenceStandard Deviation heuristic appears to be an more principled step in the direction of reducing the uncertainty about the best actions to take. Variable Resolution Dynamic Programming
Moore's Variable Resolution Dynamic Programming (VRDP) is a modelbased approach that uses a kdtree for representing a value function [87, 89]. A simulation model is assumed to be available from which a state transition probability function is derived (by simulating experiences from states within a node and noting the successor node). This is used to produce a discrete region transition probability function which is then solved by standard DP techniques. The partitioning criteria is to split at states along the trajectories seen
138
CHAPTER 6.
ADAPTIVE RESOLUTION REPRESENTATIONS
while following the greedy policy from some starting state. A disadvantage of this approach is that every state is on the greedy path from somewhere { attempting to use this method to generate policies from arbitrary starting states causes the method to partition everywhere. More recent VRDP work by Munos and Moore examines and compares several dierent partitioning criteria [94, 95, 92]. The method uses a gridbased \ niteelement" representation.7 The nite elements are the points (states) at the corners of grid cells for which values are to be computed. A discrete transition model is generated by casting short trajectories from an element and noting the nearby successors at the end of the trajectory. Elements near to the trajectory's end are given high transition probabilities in the model. The following local partitioning rules were initially tested: i) Measure the utility of a split in a dimension as the size of the local change in value along that dimension. Splits are ranked and a fraction of the best are actually divided. ii) Measure the local variability in the values in a dimension. Rank and split, as before, but based on this new measure. This causes splits to occur where the value function is nonlinear. iii) Identify where the policy changes along a dimension, and split in that dimension. This re nes at decision boundaries. The decision boundary method was found to converge upon suboptimal policies in a dierent version of the mountain car task requiring ner control. In some cases, the performance of the decision boundary approach was actually worse than for xed, uniform representations of the same size. The reason for this is due to errors in the valueapproximation of states away from the decision boundary, which actually cause the decision boundaries to be misplaced. Combining the decision boundary and nonlinearity heuristics resulted in better performance. To improve this situation further, an in uence heuristic was devised that takes into account the extent to which the value of one state contributes to the values of another element. Intuitively, in uence is a measure of the size of change in s that follows from a unit of change in the value of si. The in uence I (sjsi) of the value of state si on s is de ned as: I (sjsi )
=
1 X
k=0
pk (s; si )
where, pk (s; si ) is the kstep discounted probability of being in si after ksteps when starting from s and following the greedy policy, g . This can be found as follows:8 p0 (s; s0 ) = 1 (if s = s0 ), 0 (if s 6= s0 ) g p1 (s; s0 ) = Pss 0 (s) X g pk (s; s0 ) =
Pss 0 (s) pk 1(x; s0 )
x This work was conducted independently of, and in parallel with, the DBP approach [116, 115, 117]. 8 Below, , represents the timescale over which a statetransition model was calculated, or the mean transition time between s and s0 . Variable timescale methods are discussed in the next chapter. Assume for now that = 1. 7
6.3.
139
RELATED WORK
The in uence of a state s on a set of states, , is de ned as:
I (sj ) =
X
si 2
I (sjsi ):
However, improvements in value representations may not necessarily follow from splitting states with high in uence if these state have accurate values. It is assumed that states with high variance in their values (due to having many possible successors with diering values) provide poor value estimates.9 Moreover, since state values depend on their successor's successor, a longterm (discounted) variance measure can also be derived from the local variance measures. These heuristics are combined to provide the following partitioning criteria: 1) Identify the set, , of states along the decision boundary. 2) Calculate the total in uence on decision boundary values, I (sj ), for all s. 3) Calculate the longterm discounted variance of each state, 2 (s). 4) Calculate the utility of splitting a state as: (s)I (sj ) 5) Split a fraction of the highest utility states. An illustration of this process appears in Figure 6.13. The gures are provided with thanks to Remi Munos [94]. The Standard DeviationIn uence measure, (s)I (sj ), performed greatly better for equivalent numbers of states, and appears to be the most principled method to date. Although, in their experiments, a complete and accurate environment model was available, it seems clear that the method can naturally be adapted to the case where a model is learned. Modelfree versions of this method don't seem possible { there is no obvious way to learn the in uence measure without a model. Note that the in uence and variance measures are artefacts of the value estimation procedure and do not directly measure how \good" or \bad" a state is. The in uence and variance of states tend to zero with increasing simulation length, and become zero if the simulation enters a terminal state. Thus, there remains the possibility of further developments with this approach that adjust the simulation timescale in order to reduce the number of states with high variance and in uence.
It is assumed, since only deterministic reward functions and environments are considered, that the source of variance must lie in value uncertainties due to the approximate representation. 9
140
ADAPTIVE RESOLUTION REPRESENTATIONS
Velocity
Velocity
CHAPTER 6.
GOAL
Position
Position
(a) The optimal policy and several trajectories
(a) States of policy disagreement
(a) Standard deviation
(b) Influence on 3 points
(b) Influence on these states
(b) Influence x Standard deviation
Figure 6.13: Stages of Munos and Moore's variable resolution scheme for a mountain car task. The task diers slightly to the one used in experiments earlier in this chapter and provides the highest reward for reaching the goal with no velocity. The top left gure shows the optimal policy for this task. In uence measures a state's contribution to the value of a set of other states (topright). Standard deviation is a measure of the certainty of a state's value. The In uenceStandard Deviation measure is used to decide where to increase the resolution. A fraction of the highest valued (darkest) states by this measure is partitioned.
6.4.
DISCUSSION
141
PartiGame
The PartiGame algorithm is an online modellearning methods that also employs kdtrees for value and policy representations [86] (see also Ansari et al. for a revised version [2]). The method doesn't solve generic RL problems but aims to nd any path to a known goal state in a deterministic environment. The method is assumed to have local controllers that enable the agent to steer to adjacent regions (the set of available actions is the number of adjacent regions). The method attempts to minimise the expected number of regions traversed to reach the goal, learning a region transition model and calculating a regionstogoal valuefunction as it goes (all untried actions in a region are assumed to lead directly to the goal). The method behaves greedily with respect to its value function at all times. The splitting criterion is to divide regions along the \win/lose" boundary where it is currently thought possible to be able to reach the goal and where it is not. Importantly, as the resolution increases, highresolution areas appear expensive to cross because they increase the regionstogoal value { thus greedy exploration initially avoids the win/lose boundary where it has previously failed to reach the goal. However, as alternative routes become exhausted, the win/lose boundary is eventually explored. This symbiosis of the exploration method and representation appears to be the source of the algorithm's success. The method is has been shown to very quickly nd paths to a goal state in problems with up to 9dimensional continuous state.
6.4 Discussion A novel partitioning criterion has been devised to allow the re nement of discretised policy and Qfunction representations in continuous spaces. The key insights are that: Traditional problems in using xed discretisations include slow learning if the representation is too ne, poor policies if the representation is too coarse, or otherwise have a requirement for problem speci c knowledge (or tuning) to achieve appropriate levels of discretisation. Generaltospeci c re nement promises to solve each of these problems by allowing fast learning (through broad generalisation) in the initial stages while the representation is coarse, and still allow good quality solutions as the representation is increased. No (local) improvements in policy quality can be derived by knowing in greater detail that a region of space recommends a single action. This lead to the decision boundary partitioning criteria that increases the representation's resolution at points where the recommended policy signi cantly changes. In continuous spaces, decision boundaries may be smaller or lower dimensional features in the the statespace of the valuefunction than the statespace itself. By exploiting this, and seeking only to represent the boundaries between areas of uniform policy, it is thought that the size of the agent's policy or Qfunction representation can be kept small, while still allowing good policies to be represented. Areas represented in high detail (and where poor generalisation can occur) can also be kept to a minimum.
142
CHAPTER 6.
ADAPTIVE RESOLUTION REPRESENTATIONS
The experiments showed that the nal policies achieved can be better and are reached more quickly than those of xed uniform representations. This is especially true in problems requiring very ne control in a relatively small part of the entire statespace. The independent study by Munos and Moore shows that partitioning at decision boundaries, and other local partitioning criteria, nds suboptimal solutions. The nonlocal heuristic of partitioning states whose values are uncertain and also in uence the values at decision boundaries (and therefore the location of decision boundaries), allows smaller representations of higher quality policies to be found than local methods.
Chapter 7
Value and Model Learning With Discretisation Chapter Outline
This chapter introduces learning methods for discrete event, continuous time problems (modelled formally as SemiMarkov Decision Processes). We will see how the standard discrete time framework can lead to biasing problems when used with discretised representations of continuous state problems. A new method is proposed that attempts to reduce this bias by adapting learning and control timescales to t a variable timescale given by the representation. For this purpose SemiMarkov Decision Process learning methods are employed.
7.1 Introduction This chapter presents an analysis of some problems associated with discretising continuous statespaces. Note that in discretised continuous spaces the agent may see itself as being within the same state for several timesteps before exiting. We will see what eect this can have on bootstrapping RL algorithms that assume the Markov property, and that, at least for some simple toy problems, this problem can be overcome by modifying the RL algorithm to perform a single value backup based upon the entire reward collected until the perceived state changes. The results are RL algorithms that employ both spatial abstraction (through function approximation) and temporal abstraction (through variable timescale RL algorithms) simultaneously. 143
144
CHAPTER 7.
VALUE AND MODEL LEARNING WITH DISCRETISATION
7.2 Example: Single Step Methods and the Aliased Corridor Task Consider the following environment; the learner exists in the corridor shown in Figure 7.1. Episodes always start in the leftmost state. Each action causes a transition one state to the right until the rightmost state is entered where the episode terminates and a reward of 1 is given. A reward of zero is received for all other actions and = 0:95. The environment is discrete and Markov except that the agent's perception of it is limited to four larger discrete states. Figure 7.2 shows the resulting valuefunction when standard (1step) DP and 1step Qlearning are used with state aliasing. With Qlearning, backup (3.34) was applied after every step. With DP, a maximumlikelihood model was formed by applying backups (3.41) and (3.42) after each step and solving the model using valueiteration. Both methods learn overestimates of the valuefunction by the last region. The modelled MDP in Figure 7.3 is that learned by the 1step DP method. Overestimation occurs since the rightmost region learns an average value of the aliased states it contains. Unfortunately, the region which leads into it requires the value of its rst state (not the average) as its return correction in order to predict the return for entering that region and acting from there onwards. Since, in this example, the rst state of a region always has a lower value than the average, the return correction introduces an overoptimistic bias. These biases accumulate as they are propagated to the predecessor regions. The eect on Qlearning is worse. Having a high stepsize, , weighs Qvalues to the more recent return estimates used in backups. In the extreme case where = 1, each backup to a region wipes out any previous value; each value records the return observed upon leaving the region. This leads to the case where the leftmost region learns the value for being just 4 steps from the goal. This is especially undesirable in continual learning tasks where cannot be declined in the standard way. t= 0
...
t= 0
t= 1 6
t= 3 2
...
Figure 7.1: regions.
(top)
t= 6 3
The corridor task.
t= 6 4 r= 1
t= 4 8
t= 6 4 r= 1
(bottom)
The same task with states aliased into four
7.3.
145
MULTITIMESCALE LEARNING 1
1
0.9
V*(s) 1step DP
0.8
0.7
0.7
0.6
0.6 Value
Value
0.8
0.9
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0
V*(s) alpha=1.0 alpha=0.8 alpha=0.5 alpha=0.2 alpha=0.1 alpha=0.01
0 0
10
20
30 State
40
50
60
0
10
20
30 State
40
50
60
Figure 7.2: Solutions to the corridor task using 1step DP (left) and 1step Qlearning (right). 1 p
1 p p
1 p p
1 p p
p
r= 1
Figure 7.3: A naively constructed maximum likelihood model of the aliased corridor. p = 161 .
7.3 MultiTimescale Learning In section 3.4.4 we saw how return estimates may employ actual rewards collected over multiple timesteps: zt(n) = r(tn) + n U (st+n ); (7.1) where nPis the number of steps for which the policy under evaluation is followed, and r(tn) = nk=1 k 1 rt+k is an nstep truncated actual return. Here n is assumed to be a variable corresponding to the amount of time it takes for some event to occur. In particular, the amount of time it takes to enter the successor of st, (st+n 6= st) is used. In [143], Sutton, Precup and Singh describe how to adapt existing 1step algorithms to use these return estimates (see also [134, 110, 109, 53]). The 1step Qlearning update becomes, Q^ (st ; at ) Q^ (st ; at ) + r(tn) + n maxa0 Q^ (st ; a0 ) Q^ (st ; at ) : (7.2) Similarly, modellearning methods may learn a multitime model, Nsa Nsa + 1 ^ as ; ^ as ^ as + 1a r(tn) R (7.3) R R Ns 1 nI (x; s0) P^ a ; (7.4) 8x 2 S; P^ a P^ a + sx
sx
Nsa
sx
where a = at, s = st , s0 = st+n and R^ as is the estimated expected (uncorrected) truncated return for taking a in state s for nsteps, and P^ asx gives the estimated discounted transition
146
CHAPTER 7.
VALUE AND MODEL LEARNING WITH DISCRETISATION
probabilities given this same course of action: 1
h i ^ asx = X n P r(x = st+n js = st ; a = at ; x 6= st ) lim P Nsa !1
n=1
A multitime model (P^ and R^ ) concisely represents the eects of following a course of action for several timesteps (and possibly variable amounts of time) instead of the usual onestep. Since the amount of discounting that needs to occur (in the mean) is accounted for by the model, is dropped from the 1step DP backup to form the following multitime backup rule, V^ (s)
max a
!
^ as + X P ^ ass0 V^ (s0 ) : R s0
(7.5)
More generally, the above multitime methods are a special case of continuous time discrete event methods for learning in SemiMarkov Decision Processes (SMDP) (see [61, 114]). Here, n may be a variable, realvalued amount of time. If a successor state is entered after some real valued duration, t > 0, replacing all occurrences of n with t in the above updates yields a new set of algorithms suitable for learning in an SMDP. In cases where reward is also provided in continuous time by a reward rate, , the following immediate reward measure can be used while still performing learning in discrete time [91, 25], r(tt )
=
Z t
0
x asx dx:
(7.6)
All return methods may also be adapted to work in this way by de ning the return estimate as follows: zt = (1 t ) r(tt ) + t U^ (st+t ) +t r(tt ) + t zt+t (7.7) By recording the time interval t, along with the states observed, rewards collected and actions taken, Equation 7.7 allows an SMDP variant of backwards replay and the experience stack method to be constructed straightforwardly. Also, from (7.7), the following updates for a continuous time, accumulate trace TD() may be found: ( )t e(s) + 1; if s = st, 8s 2 S; e(s) ( )t e(s); otherwise. ( ) 8s 2 S; V^ (s) V^ (s) + (rt t + t V^ (st+t ) V^ (st ))e(s) A derivation appears in Appendix C. This method diers from other SMDP TD() methods (e.g. see [44], which also considers a continuous state representation). The derivation of these updates in Appendix C show that the version here is the analogue of the forwardview continuous time return estimate (Equation 7.7).
7.4.
147
FIRSTSTATE UPDATES
Start
a1 a1
a1 a2
Start
Start a2 a2
a2
a1
a1
a1
a1
a1
a2
a1
Figure 7.4: (topleft) Actions taken and updates made by original everystep algorithms. The discrete region is entered at START. Selecting dierent actions on each step can causes dithering and poorly measured return for following the policy recommended by the region (which can only be a single action). (topright) Eect of the commitment policy. Updates are still made after every step. (bottomleft) Multitime rststate update with commitment policy. Updates made once per region. (bottomright) Possible distribution of state values whose mean is learned by rststate methods. It is assumed that states are entered, predominately from one direction.
7.4 FirstState Updates Section 7.2 identi ed a problem with naively using bootstrapping RL updates in environments were there are aggregations of states which the learner sees as a single state. The key problem this causes is that the return correction used by backups upon leaving a region does not necessarily re ect the available return for acting after entering the successor region, but is at best an average of the values of states within the successor. To reduce this bias, learning algorithms can be modi ed to use return estimates that re ect the return received following the rst states of successor regions. This is done by making backups to a region using only return estimates representing the return following its rst visited state. This is easy to do if there is a continuoustime (SMDP) algorithm available which has the following two components: nextAction(agent) ! action Returns the next, possibly exploratory, action selected by the agent. setState(agent, r, s0 , As0 , ) Informs the agent of the consequences of its last action. The last action generated r immediate discounted reward, put it into state s0, time
148
CHAPTER 7.
VALUE AND MODEL LEARNING WITH DISCRETISATION
later and actions As0 are now available. The learning updates should be made here. The following wrappers transform the original algorithm into one which predicts the return available from the rst states of a region entered. It is assumed that the percept, s, denotes a region and not a state. nextAction0 (agent) ! action if dt = 0 then a nextAction(agent) return a setState0 (agent; r; s0 ; As0 ; ) multistep r multistep r + dt r dt dt + if s0 6= s or dt max then setState(agent; multistep r; s0 ; As0 ; dt) dt 0; multistep r 0; s s0
The variables dt; a; s; multistep r are global. At the start of each episode dt and multistep r should be initialised to 0. The nextAction0 wrapper ensures that the agent is committed to taking the action chosen in the rst state of s until it leaves. If we seek a policy that prescribes only one action per region, it is important that only single actions are followed within a region, otherwise the return estimates may become biased to the return available for following mixtures of actions.1 For control optimisation problems it is assumed that there is at least one deterministic policy that is optimal. If the method were instead to be used for policy evaluation, the agent could equally be committed to some (possibly stochastic but still xed) policy until the region is exited. The setState0 wrapper records the truncated discounted return and the amount of time which has passed and is necessary for the original variabletime algorithm to make a backup. The value, max, is the maximum possible amount of time for which the agent is committed to following the same action. It may happen that the agent becomes stuck if it continually follows the same course of action in a region. The time bound attempts to avoid such situations. See Figure 7.4 for an intuitive description of rststate methods. Note that the method implicitly assumes that regions are predominantly entered from one direction. If entered from all directions then the expected rststate values can be expected to be an approximation closer to the real mean statevalue of the region as a whole. Thus in this case, one would not expect the method to provide any signi cant improvements over everystep update methods. 1
This form of exploration was used in the decision boundary partitioning experiments.
7.5.
149
EMPIRICAL RESULTS
7.5 Empirical Results The rststate backup rules are evaluated on the corridor task introduced in section 7.2 and the mountain car task. Figure 7.5 compares the learned value functions of the rststate 1step (or everystep) methods. The learned value function was the same for both modelfree and modelbased methods. Even though the rststate methods may have a higher overall absolute error than their everystep counterparts, it is argued that i) these estimate are more suitable for bootstrapping and do not suer from the same progressive overestimation by the time the reward is propagated to the leftmost region, and ii) the higher error is of no consequence if we can we can choose which state values to believe. We know that the predictions represent values of the expected rst states of each region. In these states, the method has no error. Corridor Task
1 0.9
V*(s) EveryStep Methods FirstState Methods
0.8
Value
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0
10
20
30 State
40
50
60
Figure 7.5: The valuefunction found using rststate backups in the corridor task. Everystate Qlearning nds the same solution as everystate DP since a slowly declining learning rate was used. Mountain Car Task In the mountain car experiments the agent is presented with a 4 4 uniform grid representation of the statespace. = 1 for all steps, = 0:9, Q0 = 0. The greedy exploration method was used with declining linearly from 0:5 on the rst episode to 0 on the last. All episodes start at randomly selected states. For the modelfree methods (Qlearning and Peng and William' Q()), is also declined in the same way. Because the rststate methods alter the agent's exploration policy by keeping the choice of action constant for longer, the everystep methods are also tested using the same policy of committing to an action until a region is exited. For the modelbased (DP) method, Wiering's version of prioritised sweeping was adapted for the SMDP case in order to allow the method to learn online [167]. 5 value backups were allowed per step during exploration, and the value function was solved using valueiteration for the current model at the end of each episode. Q0 was used as the value of all untried actions in each region.
150
CHAPTER 7.
VALUE AND MODEL LEARNING WITH DISCRETISATION
Peng and Williams' Q() was also tested. The main purpose of this experiment was to try to establish whether the improvements caused by the wrapper were due to using rststate return estimates or simply through using multistep returns. We have seen earlier in the thesis how multistep methods can overcome slow learning problems by using single reward and transition observations to update many value estimates. One might think that this would provide the rststate method with an additional extra advantage over the everystep methods. However, in this respect each Qlearning method is actually very similar. Each method updates at most one value for each step (unlike return and eligibility trace methods). Even so, PWQ() was also tested with = 1:0, ensuring that the return estimates employ the reward due to actions many steps in the future. The following statereplacing trace method was used (c.f. update (3.31)): 8 if s = st and a = at , < 1; if s = st and a 6= at , 8s; a 2 S A; e(s; a) : 0; ( )t e(s; a); otherwise. The results of the various methods are shown in Figures 7.67.8. The average trial length measures the quality of the current greedy policy from 30 randomly selected states. Regret measures the dierence between the estimated value of a starting region and the actual observed return for following the greedy policy for each of these evaluations. Regret is taken to represent a measure of bias in the learned Qfunction, and the mean squared regret as measure of variance is the estimate. The results in these graphs are the average of 100 independent trials. The lack of smoothness in the graphs comes from averaging over many starting states.
7.5.
151
EMPIRICAL RESULTS
Average Episode Length (offline)
2000 EveryStep DP EveryStep DP + Commitment Policy FirstState DP 1500
1000
500
0 0
5
10
15
20
25 30 Episode
80
40
45
50
3000 EveryStep DP EveryStep DP + Commitment Policy FirstState DP
EveryStep DP EveryStep DP + Commitment Policy FirstState DP
2500 Mean Squared Regret
60
Mean Regret
35
40
20
0
2000
1500
1000
500
20
0 0
5
10
15
20
25 30 Episode
35
40
45
50
0
5
10
15
20
25 30 Episode
35
40
45
50
Figure 7.6: Firststate results for the modelbased method in the mountain car task. `EveryStep' indicates that learning updates and action choices for exploration where made after every step. `Every Step + Commitment Policy' indicates that learning updates were made at every step, but action choices were made only upon entering a new region. `FirstState' indicates that the variable timescale learning updates and action choices were made once per visited region. (See Figure 7.4.)
152
CHAPTER 7.
VALUE AND MODEL LEARNING WITH DISCRETISATION
In Figure 7.6 (the modellearning method), we can see that the commitment policy led to big improvements in the learned policy, but no signi cant dierence in performance in this, or the other measures, follows from using the rststate learning method. The commitment policy also lead to improvements in terms of the regret measure. The standard everystep method learned values that were consistently overoptimistic and also generally greater in variance than the commitment policy methods. With the Qlearning and Q() methods (see Figure 7.7), the general picture is that some improvements are seen over the commitment policy method as a result of using the rststate updates. This happens in each measure to some degree. This result is somewhat surprising, especially for Qlearning, which can be viewed as performing a stochastic version of the valueiteration updates used in the modellearning experiment. A possible reason for this is the recency biasing eects of high learning rates (as seen in the Qlearning example in Section 7.2). To test this, the experiment was repeated with a lower and xed learning rate ( = 0:1). In this case, the dierence between the everystate and rststate commitment policy methods shrinks (see Figures 7.9 and 7.10).
7.5.
153
EMPIRICAL RESULTS
Average Episode Length (offline)
2000 EveryStep Q(0) EveryStep Q(0) + Commitment Policy FirstState Q(0) 1500
1000
500
0 0
20
40
60
80
100 120 Episode
80
160
180
200
3000 EveryStep Q(0) EveryStep Q(0) + Commitment Policy FirstState Q(0)
EveryStep Q(0) EveryStep Q(0) + Commitment Policy FirstState Q(0)
2500 Mean Squared Regret
60
Mean Regret
140
40
20
0
2000
1500
1000
500
20
0 0
20
40
60
80
100 120 Episode
140
160
180
200
0
20
40
60
80
100 120 Episode
140
160
180
200
45
50
Figure 7.7: Qlearning results in the mountain car task with declining . Average Episode Length (offline)
2000 EveryStep PW(1.0) EveryStep PW(1.0) + Commitment Policy FirstState PW(1.0) 1500
1000
500
0 0
5
10
15
20
25 30 Episode
80
40
45
50
3000 EveryStep PW(1.0) EveryStep PW(1.0) + Commitment Policy FirstState PW(1.0)
EveryStep PW(1.0) EveryStep PW(1.0) + Commitment Policy FirstState PW(1.0)
2500 Mean Squared Regret
60
Mean Regret
35
40
20
0
2000
1500
1000
500
20
0 0
5
10
15
20
25 30 Episode
35
40
45
50
0
5
10
15
20
25 30 Episode
35
40
Figure 7.8: Peng and William's Q() results in the mountain car task with declining .
154
CHAPTER 7.
VALUE AND MODEL LEARNING WITH DISCRETISATION
Average Episode Length (offline)
2000 EveryStep Q(0) EveryStep Q(0) + Commitment Policy FirstState Q(0) 1500
1000
500
0 0
20
40
60
80
100 120 Episode
80
160
180
200
3000 EveryStep Q(0) EveryStep Q(0) + Commitment Policy FirstState Q(0)
EveryStep Q(0) EveryStep Q(0) + Commitment Policy FirstState Q(0)
2500 Mean Squared Regret
60
Mean Regret
140
40
20
0
2000
1500
1000
500
20
0 0
20
40
60
80
100 120 Episode
140
160
180
200
0
20
40
60
80
100 120 Episode
140
160
180
200
45
50
Figure 7.9: Qlearning results in the mountain car task with = 0:1. Average Episode Length (offline)
2000 EveryStep PW(1.0) EveryStep PW(1.0) + Commitment Policy FirstState PW(1.0) 1500
1000
500
0 0
5
10
15
20
25 30 Episode
80
40
45
50
3000 EveryStep PW(1.0) EveryStep PW(1.0) + Commitment Policy FirstState PW(1.0)
EveryStep PW(1.0) EveryStep PW(1.0) + Commitment Policy FirstState PW(1.0)
2500 Mean Squared Regret
60
Mean Regret
35
40
20
0
2000
1500
1000
500
20
0 0
5
10
15
20
25 30 Episode
35
40
45
50
0
5
10
15
20
25 30 Episode
35
40
Figure 7.10: Peng and William's Q() results in the mountain car task with = 0:1.
7.6.
DISCUSSION
155
7.6 Discussion Previous work has identi ed the bene ts of using multistep return estimates in nonMarkov settings [103, 60, 159]. Here we have seen how discretisation of the statespace can cause the representation to appear nonMarkov and so can introduce biases for bootstrapping RL algorithms. The rststate methods are intended to reduce this bias by ensuring that learned values used for bootstrapping are also lower in bias. In cases where SMDP variants of RL algorithms are available (and we have seen that such methods can be derived straightforwardly), implementing a rststate method is also straightforward through the use of wrapper functions. An empirical comparison with xed timescale methods was provided. Overall, the experimental results with the mountain car task were disappointing. Possibly, this may be due to the relatively small time it takes to traverse a region in this task. The major improvements were found to be as a result of following the commitment policy rather than learning rststate value estimates. Some improvements were seen in the modelfree case as a result of the rststate updates, but only where high learning rates caused everystep updating methods to become unduly recency biased. Other work has pointed to the use of adaptive timescale models and updates in adaptive discretisation schemes. Notably, in [95, 92] Munos and Moore generate a multitime model from a state dynamics model. The model is built by running simulated trajectories until a successor region is entered. This is essentially the same as the rststate method and was developed independently. However, in this work the aim was simply to produce a model. The valuebiasing problems of statealiasing are unlikely to be as severe since linear interpolation occurs between regions. In [100] and [101], by Pareigis, the local timescale of an update is halved if this causes local value estimates to increase.2 This assumes that learning at the shorter timescale yields greedy policies with locally greater values, and that the larger timescale does not lead to overestimates of statevalues. Section 7.2 showed how such overestimates can occur. In genuine SMDPs, RL methods need to learn at varying timescales simply because information is received from the environment at varying intervals. Other than the rststate method, some RL methods choose to learn over variable timescales. This includes work using macroactions [134, 39, 83, 102, 110, 143, 22, 43]. A macro action is a prolonged action composed of several successive actions, such as another lowerlevel policy, or a hierarchy of policies, or some hand coded controller. Learning in this way can result in signi cant speedups { return information is propagated to states many steps in the past, and committing to a xed macroaction can aid the exploration in the same way as we have seen above (i.e. by preventing dithering). In the Options framework [143] (and also the HAM methods in [102]), if the environment is a discrete MDP, speedups can be provided while also ensuring convergence to optimality by learning at the abstract and at (MDP) level simultaneously. Optimality follows from noting that, if actions in the MDP level have a greater value than the macro actions, then optimal solution is to follow these lowlevel actions. Qvalues at the MDP level may also bootstrap from Qvalues in the abstract level { eventually Qvalues Again, to compare the local values at dierent timescales, a deterministic continuous time model of the statedynamics is assumed to be known. 2
156
CHAPTER 7.
VALUE AND MODEL LEARNING WITH DISCRETISATION
for macro actions must become as low or lower than those for taking MDPlevel actions. The existing work with macroactions still applies \singlestep multitime" learning updates (e.g. the adaptations of DP and Qlearning in Section 7.3). It seems likely that these methods might also bene t from the use of the new SMDP TD() or SMDP experience stack algorithm for the same reasons that these methods help in the xed time interval case. These are multistep, multitime methods in the sense that their return estimates may bootstrap from values in the entire future, rather than a small subset of it. Some macrolearning methods learn at lower levels and higher levels in parallel while higher level policies are followed. In this case, eÆcient opolicy control learning methods such as those presented in Chapter 4 would seem appropriate.
Chapter 8
Summary Chapter Outline
This chapter summarises the main contributions of the thesis, lists speci c contributions and suggests directions for future research.
8.1 Review This thesis has examined the capabilities of existing reinforcement learning algorithms, developed new algorithms that extend these capabilities where they have been found to be de cient, developed a practical understanding of new algorithms through experiment and analysis, and has also strengthened elements of reinforcement learning theory. It has focused upon two existing problems in reinforcement learning: i) problems of opolicy learning, and ii) problems with errorminimising function approximation approaches to reinforcement learning. These are the major contributions of the thesis and are detailed below: Opolicy learning methods allow agents to learn about one behaviour while following another. For control optimisation problems, agents need to to evaluate the return available under the greedy policy in order to converge upon the optimal one. However, experience may be generated in fairly arbitrary ways { for example, generated by a human expert, or by a mechanism that selects actions in order to manage the exploration exploitation tradeo. EÆcient opolicy learning methods already exist in the form of backward replayed Qlearning. However, it was previously unclear how this could be applied as an online learning algorithm. Online learning is an important feature of any method which eÆciently manages the explorationexploitation tradeo. On one hand, eligibility trace methods can already be applied online and have enjoyed widespread use 157 Opolicy Learning
158
CHAPTER 8.
SUMMARY
as a result. However, as sound opolicy methods they can be very ineÆcient. Moreover, where oine learning is possible (e.g. if the environment is acyclic), it would seem that backwardreplaying forward view methods is a generally more preferable approach. A forwardsbackwards equivalence proof demonstrates that these methods learn from essentially the same estimate of return, but the forward view is more straightforward (analytically) and also has a natural computationally eÆcient implementation. Furthermore, backwards replay provides extra eÆciency gains over eligibility trace methods when bootstrapping estimates of return are used ( < 1). This comes from learning with information that is simply more uptodate. The work with the new experience stack algorithm in Section 4.4 represents an advance by inheriting the desirable properties of backwardsreplay (and clarifying what these are), and also allowing for online learning. When used for opolicy greedy policy evaluation it provides advantages over Watkins' Q() (and Qlearning), by allowing allowing credit for the current reward to be propagated back further than the last nongreedy action. However, it was shown that achieving this gain is strongly dependent upon whether the Qvalues used as bootstrapping value estimates are overestimates (i.e. whether they are optimistic). It was shown how optimistic initial valuefunctions (the rule of thumb for many exploration methods) can severely inhibit credit assignment for a variety of controloptimising RL methods. The separation of optimistic value estimates for encouraging exploration and the value estimates used as predictions of return appears to oer a solution to this problem. In order to scale up valuebased RL methods to solve practical tasks with many dimensional statefeatures or tasks with continuous (or nondiscrete) state, function approximators are employed to represent value functions and Qfunctions. But many popular methods are known to suer from instabilities, particularly when used with controloptimising RL methods or with opolicy update distributions (e.g. if making updates with experience gathered under exploring policies). The wellstudied leastmeansquared error minimising gradient descent method is a famous example. It was shown how, through a new choice of error measure to minimise, this method can be made more stable. The boundedness of discounted return estimating RL methods was shown with this function approximation method. In particular, the proof holds for opolicy Qlearning and the new experience stack algorithm { the stability of these methods with gradient descent function approximation was not previously known. However, the linear averager method appears to be a less powerful function approximation technique than the original LMS method, although it has also frequently been used successfully for RL in the past. In Section 6.2 the decision boundary partitioning (DBP) heuristic for representation discretisation was presented. The re nement criteria followed from the idea that, in continuous statespaces, optimal problem solutions often have large areas of uniform policy. It is expected therefore that, in such cases, compact representations of optimal policies follow from attempting to represent in detail only those areas where the policy changes (decision boundaries). The major contribution here is the idea that function approximation should not be motivated by minimising the error between the learned and observed estimates of return, but by attempting to nd the best action available in a state. A new method was introduced to re ne the representation in areas where the greedy policy changes. An empirical test Function Approximation for Reinforcement Learning
8.2.
CONTRIBUTIONS
159
found the method to outperform xed uniform discretisations. Coarse representations in the initial stages allowed fast learning and good initial policy approximations to be quickly learned. The ner discretisations which followed allowed policies of better quality to be learned. The recent work by Munos and Moore (conducted independently and simultaneously) shows the DBP heuristic to nd suboptimal policies. Nonlocal re nement is also required in order to achieve accurate value estimates and therefore correct placement of the decision boundaries (at least for heavily bootstrapping value estimation procedures such as valueiteration). However, their method requires a model (or one to be learned) in order to be applied.
8.2 Contributions The following is a list of the speci c contributions in order of appearance. In Section 2.4.3 an adaptation was made to the approximate modi ed policy iteration algorithm presented by Sutton and Barto in their standard text [150]. Their algorithm appears to be the rst of its kind which explicitly claims to terminate and as such is of fundamental importance to the eld. An oversight in their algorithm was shown using the new counterexamples in Figure 2.5. The algorithm was corrected and error bounds for the quality of the nal policy were provided. A proof is provided in Appendix B which follows straightforwardly from the work of Williams and Baird [171]. The correction features in the errata of [150]. The approximate equivalence of batchmode accumulatetrace TD() and a direct return estimating algorithm is well known to the RL community { a derivation can be found in [150] for xed . In an empirical demonstration in Section 3.4.9, it was shown that this equivalence does not hold in the onlineupdating case (even approximately so), in cases where the environment is cyclical such that the accumulating trace value grows above some threshold. This result followed from the intuitive insight that stochastic updating rules of the form Zt+1 = Zt + (zt Zt), having stepsizes greater than 2 diverge to in nity in cases where zt is independent of Zt . In Section 4.2.2 modi cations to Wiering's Fast Q() were described where it was likely that existing published versions of this algorithm might be misinterpreted. An empirical test was performed to demonstrate the algorithm's equivalence to Q(). This work was published jointly with Marco Wiering as [125]. Section 4.4 introduced the Experience Stack algorithm. The existing backward replay method was adapted to allow for eÆcient modelfree online opolicy control optimisation. Unlike other popular online learning methods (such as eligibility trace approaches), the method directly learns from return estimates and also a natural computationally eÆcient implementation. An experimental and theoretical analysis of the algorithm's parameters provided a characterisation of when the algorithm is likely to outperform related eligibility trace methods. This work was published as [123, 121].
160
CHAPTER 8.
In Section 4.7 optimistic initial valuefunctions were found to
SUMMARY
inhibit the errorreducing abilities of greedypolicy evaluating RL methods. It was also seen how exploration methods that employ optimism to encourage exploration can avoid these problems by separating return predictions from the optimistic value estimates used to encourage exploration. This work was published as [120, 122]. In Section 5.7 a \linearaverager" value function approximation scheme was formalised. The approximation scheme is already used for reinforcement learning and diers from the well studied incremental least mean square (LMS) gradient descent scheme only in the error measure being minimised. A proof of nite (but possibly very large) error in the value function was shown for all discounted return estimating RL algorithms when employing a linear averager for function approximation. Notably, the proof covers new cases such as Qlearning with arbitrary experience distributions (i.e. arbitrary exploration). Examples of divergence in this case exist for the LMS method. This work was published as [124]. Section 6.2 introduced the decision boundary partitioning (DBP) heuristic for representation re nement based upon changes in the greedy action. This work was published as [117, 119, 115]. In Chapter 7 an analysis of the biasing problems associated with bootstrapping algorithms in discretised continuous state spaces was performed. A generic RL algorithm modi cation was suggested to reduce this bias by attempting to learn the expected rststate values of continuous regions. Some bias reduction and policy quality improvements were observed, but most improvements could be attributed either to following a policy which commits to a single action throughout a region, or related problems associated to learning with large learning rates. In Appendix C, accumulate trace TD() was adapted to the SMDP case. An equivalence with a forwardview SMDP method was established for the batch update and acyclic process case by adapting the proof method for the MDP case found in Sutton and Barto's standard text [150]. severely
8.3 Future Directions Following the advances made in this thesis, a number of questions and avenues for future research arise. Further work with the Experience Stack method may yield further re nements to the algorithm. For example, the use of a stack to store experience sequences was introduced to allow the sequences to be replayed in the reverse of their observed order. Other methods could replay the sequences in dierent orders so that the amount of experience replayed is minimised such that the number of states that are no longer considered for further updating is minimised. Also, the Bmax parameter could be replaced by a heuristic that decides whether to immediately replay experience based upon a measure of the bene t to the exploration strategy that experience replay may yield. Experience Stack Reinforcement Learning.
8.3.
FUTURE DIRECTIONS
161
Other extensions might take ideas for Lin's original formulation (and also Cichosz' Replayed TTD()), where the same experience is replayed several times over. This could also be done here although, as in the related work, at an increased computational cost and an increased recency bias in the learned values. Whether these changes would lead to improved performance could be the subject of further study. The most pressing extension to the experience stack method is its adaptation for use with parameterbased function approximators (such as the CMAC). Here the major issue is how to decide when to replay experience since exact state revisits rarely occur as in the MDP/tablelookup case. A possible solution is to record the potential scale of change in a parameter's value that is possible if the stored experience is replayed. There are many algorithms that one may choose to apply in solving RL problems. Which should be used and when? In particular, for control optimisation there are algorithms which evaluate the greedy policy (e.g. Qlearning, Watkins' Q(), valueiteration). Algorithms for evaluating xed policies (e.g. TD(), SARSA() and DP policy evaluation methods) may also be used for control by assuming that an evaluation of a xed policy is sought, and then making this policy progressively more greedy. The subtle dierence is that xed policy evaluation methods seem likely to quickly eliminate unhelpful optimistic biases since their initial xed policy has a value function which is less than or equal to the optimal one in every state. However, while these methods are spending time evaluating a xed policy, they are not necessarily improving their policy. With this in mind, future work might aim to examine optimal ways of selecting how greedy the policy under evaluation should be made in order to reduce valuefunction error at the fastest possible rate. Initial work in this direction might examine the dierences between policyiteration and valueiteration and seek hybrid approaches (similar to Puterman's modi ed policyiteration [114]). Also, it remains to be seen whether, following from the dual update results in Section 4.7.3, better exploration strategies can be developed. Improvements could be expected to follow through providing exploration schemes with more accurate value estimates. Exploitation of the Optimistic Bias Problem.
The gridlike partitionings of kdtrees seems unlikely to allow methods employing them to scale well in many problems with very high dimensional statespaces. In high dimensional spaces, important features (such as decision boundaries, or the PartiGame's winlose boundary) may be of a low dimensionality but run diagonally across many dimensions. In this case, partitionings may be required in every dimension to adequately represent the important features, and the total representation cost may grow exponentially with the dimensionality of the statespace. The inability to eÆciently represent simple features such as diagonal planes follows from the fact that the kdtree makes splits that are orthogonal to all but one axis (i.e. the resolution is increased in only one dimension per split). To alleviate this, nonorthogonal partitioning could be employed. For instance, partitionings may be de ned by arbitrarily placed hyperplanes, thus allowing arbitrary planar features to be represented more eÆciently. NonOrthogonal Partitioning Representations.
162
CHAPTER 8.
SUMMARY
Where systems with unknown dynamics must be controlled, RL methods always face the explorationexploitation tradeo. Most of the work concerned with exploration appears to have focused on the case where the environment is a small discrete MDP. How best to explore continuous statespaces remains a diÆcult problem, but it is one for which we may be able to make additional assumptions that are not possible, or reasonable, in the discrete MDP cases (e.g. such as similar states having similar values or similar dynamics). Where adaptive representations are employed, exploration may be required to explore the ner control possible at higher resolutions. However, how the relative importance of exploring dierent parts of the space should be measured is not at all clear. In particular, the \prior" commonly used by many MDP exploration methods is to assume that any untried action leads directly to the highest possible valued state. This seems unreasonable for the Qvalues of newly split regions since, intuitively, the coarser representation should provide some information about the values at the ner resolution. Exploration with Adaptive Representations.
8.4 Concluding Remarks Over the history of reinforcement learning there have been a number of truly outstanding practical applications. Yet these reports remain in the minority. Much of the work, like the contributions made here, are concerned with expanding the fringes of theory and understanding in incremental ways. Most work considers example \toy" problems that serve well in demonstrating how new methods work where the old ones do not, how the behaviour of a particular method varies in interesting ways with the adjustment of some parameter, or shows some formal proof about behaviour. The use of toy problems is to be expected in any work which tackles such diÆcult and general problems as those which reinforcement learning aims to solve. Even so, the future challenge for reinforcement learning lies in proving itself in the real world. Its widespread practical usefulness needs to be placed beyond question in ways similar ways to that which has been achieved by expert systems, pattern recognition and genetic algorithms. This can only be done by nding real problems that people have, and applying reinforcement learning to solve them.
Appendix A
Foundation Theory of Dynamic Programming This appendix presents some fundamental theorems and notations from the eld of Dynamic Programming.
A.1 Full Backup Operators This section introduces a notation for the backup operators introduced in Chapter 2. B represents an evaluation of a policy using onestep lookahead: h i B V^ (s) = E rt+1 + V^ (st + 1)jst = s; (A.1) X X = (s; a) Pssa 0 Rssa 0 + V^ (s0) (A.2) a
B
s0
represents an evaluation of a greedy policy using onestep lookahead: h i ^ B V^ (s) = max E r +
V ( s + 1) j s = s; a = a t+1 t t t a X a 0 + V^ (s0 ) = max Pssa 0 Rss a s0
(A.3) (A.4)
B and B are bootstrapping operators { they form new value estimates based upon existing value estimates. BV is a shorthand for a synchronous update sweep across all stated (see
Section 2.3.2).
A.2 Unique FixedPoints and Optima It had been shown by Bellman that V^ is the unique value function for the optimal policy if B V^ is a xed point [16]. That is to say, if V^ = B V^ then V^ = V then V^ is optimal. Similarly if V^ = B V^ then V^ = V . 163
164
APPENDIX A. FOUNDATION THEORY OF DYNAMIC PROGRAMMING
A.3 Norm Measures The norm operator is denoted as kX k and represents some arbitrary distance measure given by the size of the vector X = (x1 ; : : : ; xn ). Of interest is, jjX jj1 = max jxij; (A.5) i and is a maximumnorm distance. The maxnorm measure is of interest in dynamic programming as it provides a useful measure of the error in a value function. In particular,
^ V^
1 = max ( s ) (A.6)
V V (s) V ; s is a Bellman Error or Bellman Residual and is a measure of the largest dierence between an optimal and estimated value function.
A.4 Contraction Mappings The backup operators B and B are contraction mappings. That is to say that they monotonically reduce the error in the value estimate. The following proof was rst established by Bellman:1
B V^
V^ V^
(A.7)
V 1
Proof:
B V^
BV 1
^ = max B V (s) s
= max max s a max max s a
= max max s a max s0
=
s0
B V^ (s) a 0 + V (s0 ) Rss
!
X a0 Rss s0 X
V s0 0
V
V
X
1
!
+ V (s0)
( )
s
(s0) V^ (s0 )
V^
V^ (s0 )
1
This version is taken from [167].
X
s0
s0
a 0 + V^ (s0 ) Rss
a0 Rss
!
+ V^ (s0)
!
1
Using a similar method it can be shown that,
B V^
V^ V^
V 1
max a
X
1
(A.8)
A.4.
165
CONTRACTION MAPPINGS
A.4.1 Bellman Residual Reduction
The following bound follows from the above contraction mapping A.8 [172, 171, 19, 21]:
^ V^
B
V
^ 1 V
1 (A.9)
V 1 Proof:
By the triangle inequality,
^ V
1
V^
V
V^
from which it follows that,
V
^
V
1
^
V
B V^
1 +
V^
Using the same method, it can be shown that,
V
B V^
1 +
B V^
V
^
V
^
V
1
V
1
B V^
1 1 :
B V^
1 :
(A.10) 1 These bounds provide useful practical stopping conditions for DP algorithms since the righthandsides can be found without knowledge of V or V . 1
166
APPENDIX A. FOUNDATION THEORY OF DYNAMIC PROGRAMMING
Appendix B
Modi ed Policy Iteration Termination This section establishes termination conditions with error bounds for policyiteration employing approximate policy evaluation (i.e. modi ed policy iteration). The reader is assumed to be familiar with the notation and results in Appendix A. First, consider the evaluateimprove steps of the inner loop of the modi ed policyiteration algorithm, // Evaluate V^ evaluate(, V^ )
Find V^
= V .
// Improve 0 for each s 2 S P a 0 + V^ (s0 ) ag arg maxa s0 Pssa 0 Rss P ag0 Rag0 + V^ (s0 ) v0 P 0 s ss ss ^ max ; V (s) v0 0 (s) ag Since, v0 (s) = B V^ (s), at the end of this we have = jjV^ B V^ jj. Thus, a bound in the error of V^ from V at the end of this loop is given by Equation A.10,
V
^
V
1
V
^
1
= 1
B V^
1
(B.1) (B.2)
From Equation B.1, Williams and Baird have shown that the following bound can be placed upon the loss in return for following an improved (i.e. greedy) policy, 0 derived from V^ 167
168
APPENDIX B. MODIFIED POLICY ITERATION TERMINATION
[171]: 0
V (s)
V (s) 12
(B.3)
for any state s. 0 is derived from V^ in the above algorithm. Thus we obtain the full policyiteration algorithm with a termination threshold T . 1) do: 2a) V^ evaluate(, V^ ) 2b) 0 2c) for each s 2 S : P a 0 + V^ (s0 ) 2c1) ag arg maxa s0 Pssa 0 Rss P ag ag ^ (s0 ) 2c2) v0 s0 Pss 0 Rss0 + V 2c3) max ; V^ (s) v0 2c4) (s) ag Make 0 . 3) while T This algorithm guarantees that, 2 T V (s) V (s) (B.4) 1 upon termination. Note that Equation B.3 does not rely upon the evaluate procedure returning an exact evaluation of V . Of course, termination requires that the evaluate/improve process converges upon V^ = V . Puterman and Shin have established that modi ed policyiteration will converge if the evaluation step applies V^ B V^ a xed number of times (i.e. at least once) [113]. In the case where step 2a) is exactly V^ B V^ , then the above algorithm reduces to the synchronous valueiteration algorithm. In practice, the evaluation step does not need to perform synchronous updates since applying, V^ (s) B V^ (s) at least once for each state in S is generally at least as eective at reducing jjV V^ jj as the synchronous backup.
Appendix C
Continuous Time TD() In this section, the accumulate trace TD() algorithm is derived for the discrete event, continuous time interval case. By careful choice of notation, the method found in [150] for showing the equivalence of accumulate trace TD() (the backward view) with the direct return algorithm (the forwards view), may be used. State and reward observations are discrete events occurring in continuous time. A state, visit st is a discrete event, (t 2 IN). For this section, t identi es an event in continuous time { it is not a continuous time value itself. To simplify notation it is more convenient to identify the duration between events. Let tn identify the time between events t and t + n. The notation diers from that in Chapter 7. Let the continuous time return estimate be de ned as follows: zt
= (1 t ) rt+1 + t V^ (st+1 ) +t rt+1 + t zt+1 +1
+1
+1
+1
where rt represents the discounted reward immediately collected between t 1 and t. Then the continuous time (forwardview) estimate updates states as follows: V^ (st )
V^ (st ) + zt
V^ (st )
Consider the change in this value, based upon a single estimate of return if the update is applied in batchmode: (Throughout, for simplicity, is assumed to be constant.) 169
170
APPENDIX C. CONTINUOUS TIME TD( )
1 V^ (st )
= = =
zt+1
V^ (st )
V^ (st )
+ (1 +
V^ (st )
t1 ) t1
rt+1 + t V^ (st+1 ) rt+1 + t zt+1
1 1
+rt+1 + t V^ (st+1 ) ( )t V^ (st+1 ) +( )t zt+1 = V^ (st ) +rt+1 + 0 t V^ (st+1 ) ( )t V^ (st+1 ) 1 t ) r t V^ (s ) (1 +
t +2 t +2 +( )t @ +t r + t z A t+2 t+2 = V^ (st ) + rt+1 + t V^ (st+1 ) ( ) t V^ (st+1 ) + ( )t rt+2 + t V^ (st+2 ) ( )t V^ (st+2 ) + ( )t +t zt+2 = V^ (st ) t ^ t ^ + rt+1 + V (st+1 ) ( ) V (st+1 ) + ( )t rt+2 + t V^ (st+2 ) ( )t V^ (st+2 ) + ( )t +t rt+3 + t V^ (st+3 ) ( )t V^ (st+3 ) ... ... = rt+1 + t V^ (st+1 ) V^ (st ) t V^ (s ) V^ (s ) + ( )t r +
t +2 t +2 t +1 + ( )t +t rt+3 + t V^ (st+3 ) V^ (st+2 ) ... ... = ( )t rt+1 + t V^ (st+1) V^ (st ) + ( )t rt+2 + t V^ (st+2 ) V^ (st+1 ) + ( )t rt+3 + t V^ (st+3 ) V^ (st+2 ) ... ... 1
1
1
1
1
1 +1
1
1 +1
1 +1
1 +1
1
1
1
1 +1
1
1 +1
1
1
1
1 +1
1 +1
1 +2
1 +2
1 +1
1
1 +1
1 +2
0
1
1
1 +1
2
1 +2
Let the 1step continuous time TD error be de ned as: Æk = rk+1 + k V^ (sk+1) V^ (sk ); 1
1
1 +1
1
1
1 +1
171 then,
1 1 V^ (s ) = X ( )tk t
k=t
t
Æk
for a single return estimate. In the case where a state s may be revisited several times during the episode, we have: 1 1 X k t 1 V^ (s) = X I ( s; st ) ( )t Æk (C.1) t=0
= Since
PH
x=L
PH
y=x f
k=t
1X 1 X t=0 k=t
( )tk t I (s; st)Æk
(x; y) = y=L x=L f (x; y) for any L, H and f , 1 X k k t 1 V^ (s) = X (
)t I (s; st )Æk PH
Py
k=0 t=0
Through re ection in the plane x = y, x=L Pxy=L f (x; y) = PHy=L Pyx=L f (y; x), for any any L, H and f , 1 X t 1 V^ (s) = X ( )kt k I (s; sk )Æt PH
t=0 k=0
1 X
= De ning an eligibility value for s as: et (s) =
t=0 t X k=0
Æt
t X
( )kt k I (s; sk )
k=0
( )kt k I (s; sk )
then the eligibility traces for all states may be calculated incrementally as follows: ( ( )t et 1 (s) + 1; if s = st, 8s 2 S; et (s) ( )t et 1 (s); otherwise. and the state values incrementally updated as follows: 8s 2 S; V^ (s) V^ (s) + Æt et (s): As for singlestep TD(), this forwardbackward equivalence applies only for the batch updating and acyclic environment case. The equivalence is approximate for the general onlinelearning case since V , as seen by the T D errors, is xed in value throughout the episode. In cases where episode lengths are nite and sT is the terminal state, since by de nition, Æk = 0, (k T ), then (C.1) may precisely be rewritten as, 1 V^ (s) = TX1 I (s; st) TX1 k = t( )tk t Æk : 1 1
t=0
Using a similar method to the steps following (C.1), the same update rule follows for the terminating state case as the in nite trial case.
172
APPENDIX C. CONTINUOUS TIME TD( )
Appendix D
Notation, Terminology and Abbreviations k (s; a)
opol 0
T e(s) e0 (s) E [x] E [xjy] I (a; b) N (s; a) n g Pssa 0
Pass0 P r(x) P r(xjy) Q0 Q(s; a) Q+ (s; a) a0 Rss
Learning stepsize. Learning stepsize at the kth update of (s; a). Learning rate schedule parameter where, k (s; a) = k(s;a1)beta . Allowable nongreediness threshold. Initial value function error. Discount factor: discounted return = rt + rt+2 + 2rt+3 + . Small termination error threshold. Exploration parameter. Likelihood of taking a random action. Eligibility trace for state s. Fast Q() eligibility trace for state s. Expectation of x. Conditional expectation. Expectation of x given y. Identity function. Yields 1 if a = b and 0 otherwise. Number of times a is observed in s. A policy. An optimal policy. A nearlygreedy policy. A greedy policy. State transition probability function. Probability of entering s0 after taking a in s. Discounted state transition probability function. As P but includes mean amount of discounting occurring between leaving s and entering s0. Probability of event x. Conditional probability. Probability of x given y. Initial Qfunction estimate. A Qvalue. The longterm expected return for taking action a in state s. An uptodate Qvalue. See Fast Q(). Expected immediate reward function for taking a in s and transiting to s0 . 173
174
APPENDIX D. NOTATION, TERMINOLOGY AND ABBREVIATIONS
rt
Ras t U^ (s) V V V^ V^0 X^ z (1) z z (n) z
z (;n) x =yz
Æ
Immediate reward received for the action taken immediately prior to time t. Discounted immediate reward function. Discrete time index. (Or step index in the SMDP case). Real valued time duration. Generic return correction. Replace with the estimated value at s of following the evaluation policy from s (e.g. U (s) = maxa Q(s; a) for greedy policy evaluation). The value function for the optimal policy. The value function for the policy . Estimate of the value function for the policy . Initial value function estimate. Estimate of E [X ]. Estimation target. Observed value whose mean we wish to estimate. 1step corrected truncated return estimate. nstep corrected truncated return estimate. return estimate. nstep corrected truncated return estimate. y z xy+z Global amount of decay. TD error. Assignment.
backwardview greedyaction xedpoint forwardview return method nstep truncated return nstep truncated corrected return opolicy onpolicy return correction return state statespace BR DBP DP FA LMSE
Eligibility trace method. Updates of the form: V (s) arg maxa Q^ (s; a) x is the xedpoint of f if x = f (x). Updates of the form: V^ (s) V^ (s) + z V^ (s) A forward view method. rt+1 + + n 1 rt+n rt+1 + + n 1 rt+n + n U (st+n+1 ) Dierent to the policy under evaluation. As the policy under evaluation. U (st+n+1 ) in a corrected nstep truncated return. Long term measure of reward. Environmental situation. Set of all possible environmental situations.
Backwards Replay Decision Boundary Partitioning Dynamic Programming Function Approximator Least Mean Squared Error
V (s) + Æe(s)
175 MDP POMDP PW RL SAP SMDP TTD WAT
Markov Decision Process Partially Observable Markov Decision Process Peng and Williams' Q() Reinforcement Learning State Action Pair SemiMarkov Decision Process, (continuous time MDP) Truncated TD() Watkins' Q()
176
APPENDIX D. NOTATION, TERMINOLOGY AND ABBREVIATIONS
Bibliography [1] C. G. Atkeson A. W. Moore and S. Schaal. Memorybased learning for control. Technical Report CMURITR9518, CMU Robotics Institute, April 1995. [2] M. A. AlAnsari and R. J. Williams. EÆcient, globallyoptimized reinforcement learning with the Partigame algorithm. In Advances in Neural Information Processing Systems 11. The MIT Press, Cambridge, MA, 1999. [3] J. S. Albus. Data storage in the cerebellar model articulation controller (CMAC). Journal of dynamic systems, measurement and control, 97(3), 1975. [4] J. S. Albus. A new approach to manipulator control: the cerebellar model articulation controller (CMAC). Journal of dynamic systems, measurement and control, 97(3), 1975. [5] C. Anderson. Approximating a policy can be easier than approximating a value function. Technical Report CS00101, Department of Computer Science, Colorado State University, CO, USA, 2000. [6] C. Anderson and S. CrawfordHines. Multigrid Qlearning. Technical Report CS94121, Colorado State University, Fort Collins, CO 80523, 1994. [7] David Andre, Nir Friedman, and Ronald Parr. Generalized prioritized sweeping. In Michael I. Jordan, Michael J. Kearns, and Sara A. Solla, editors, Advances in Neural Information Processing Systems, volume 10. The MIT Press, 1998. [8] Christopher G. Atkeson, Andrew W. Moore, and Stefan Schaal. Locally weighted learning. AI Review, 11:75{113, 1996. [9] L. C. Baird and A. W. Moore. Gradient descent for general reinforcement learning. In Advances in Neural Information Processing Systems, volume 11, 1999. [10] Leemon C. Baird. Residual algorithms: Reinforcement learning with function approximation. In Proceedings of the Twelfth International Conference on Machine Learning, pages 30{77, San Francisco, 1995. Morgan Kaufmann. [11] Leemon C. Baird. Reinforcement Learning Through Gradient Descent. PhD thesis, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, 1999. Technical Report Number CMUCS99132. 177
178
BIBLIOGRAPHY
[12] Andrew G. Barto, Steven J. Bradtke, and Satinder P. Singh. Learning to act using realtime dynamic programming. Arti cial Intelligence, 72:81{138, 1995. [13] Andrew G. Barto, Richard S. Sutton, and Charles W. Anderson. Neuronlike adaptive elements that can solve diÆcult learning problems. IEEE Transactions on Systems, Man and Cybernetics, 13(5):834{846, Septemeber 1983. [14] R. Beale and T. Jackson. Neural Computing: An introduction. Institute of Physics Publishing, Bristol, UK, 1990. [15] R. E. Bellman. Dynamic Programming. Princeton University Press, 1957. [16] R. E. Bellman and S. E. Dreyfus. Applied Dynamic Programming. RAND Corp, 1962. [17] D. P. Bertsekas. Distributed dynamics programming. IEEE Transactions on Automatic Control, 27:610{616, 1982. [18] D. P. Bertsekas. Distributed asynchronous computation of xed points. Mathematical Programming, 27:107{120, 1983. [19] D. P. Bertsekas. Dynamic Programming: Deterministic and Stochastic Models. Prentice Hall, Englewood Clis, NJ, 1987. [20] D. P. Bertsekas and J. N. Tsitsiklis. Parallel and Distributed Computation: Numerical Methods. Prentice Hall, Englewood Clis, NJ, 1989. [21] D. P. Bertsekas and J. N. Tsitsiklis. Neurodynamic Programming. Athena Scienti c, Belmont, MA, 1996. [22] Michael Bowling and Manuela Veloso. Bounding the suboptimality of reusing subproblems. In Proceedings of IJCAI99, 1999. [23] Justin Boyan and Andrew Moore. Robust value function approximation by working backwards. In Proceedings of the Workshop on Value Function Approximation. Machine Learning Conference Tahoe City, California, July 9, 1995. [24] Justin A. Boyan and Andrew W. Moore. Generalization in reinforcement learning: Safely approximating the value function. In Proceedings of Neural Information Processing Systems, volume 7. Morgan Kaufmann, January 1995. [25] Steven J. Bradtke and Michael O. Du. Reinforcement learning for continuoustime Markov decision problems. In Advances in Neural Information Processing Systems, volume 7, pages 393{400, 1995. [26] P. V. C. Caironi and M. Dorigo. Training Q agents. Technical Report IRIDIA9414, Universite Libre de Bruxelles, 1994. [27] Anthony R. Cassandra. Exact and Approximate Algorithms for Partially Observable Markov Decision Processes. PhD thesis, Brown University, Department of Computer Science, Providence, RI, 1998.
BIBLIOGRAPHY
179
[28] David Chapman and Leslie Pack Kaelbling. Input generalization in delayed reinforcement learning: An algorithm and performance comparisons. In Proceedings of the Twelfth International Joint Conference on Arti cial Intelligence, pages 726{731. Morgan Kaufmann, San Mateo, CA, 1991. [29] C. S. Chow and J. N. Tsitsiklis. An optimal oneway multigrid algorithm for discrete{ time stochastic control. IEEE Transactions on Automatic Control, 36:898{914, 1991. [30] Pawel Cichosz. Truncated temporal dierences and sequential replay: Comparison, integration, and experiments. In Proceedings of the Poster Session of the Ninth International Symposium on Methodologies for Intelligent Systems, 1996. [31] Pawel Cichosz. Reinforcement Learning by Truncating Temporal Dierences. PhD thesis, Warsaw University of Technology, Poland, July 1997. [32] Pawel Cichosz. TD() learning without eligibility traces: A theoretical analysis. Arti cial Intelligence, 11:239{263, 1999. [33] Pawel Cichosz. A forwards view of replacing eligibility traces for states and stateaction pairs. Mathematical Algorithms, 1:283{297, 2000. [34] Thomas H. Cormen, Charles E. Leiserson, and Ronald L. Rivest. Introduction To Algorithms. The MIT Press, Cambridge, Massachusetts, 1990. [35] Richard Dearden Craig Boutilier and Moises Goldszmidt. Stochastic dynamic programming with factored representations. Arti cial Intelligence. To appear. [36] Robert H. Crites. LargeScale Dynamic Optimization Using Teams Of Reinforcement Learning Agents. PhD thesis, (Computer Science) Graduate School of the University of Massachusetts, Amherst, September 1996. [37] Scott Davies. Multidimensional triangulation and interpolation for reinforcement learning. In Advances in Neural Information Processing Systems, volume 9, 1996. [38] P. Dayan. The convergence of TD() for general . Machine Learning, 8:341{362, 1992. [39] P. Dayan. Improving generalisation for temporal dierence learning: The successor representation. Neural Computation, 5:613{624, 1993. [40] Richard Dearden, Nir Friedman, and David Andre. Model based Bayesian exploration. In Proceedings of UAI99, Stockholm, Sweden, 1999. [41] Richard Dearden, Nir Friedman, and Stuart Russell. Bayesian Qlearning. In Proceedings of AAAI98, Madison, WI, 1998. [42] Morris H. DeGroot. Probability and Statistics. Addison Wesley, 2 edition, 1989. [43] Thomas G. Dietterich. State abstraction in MAXQ hierarchical reinforcement learning. In Advances in Neural Information Processing Systems, volume 12. The MIT Press, 2000.
180
BIBLIOGRAPHY
[44] Kenji Doya. Temporal dierence learning in continuous time and space. In Advances in Neural Information Processing Systems, volume 8, pages 1073{1079, 1996. [45] P. Dupuis and M. R. James. Rates of convergence for approximation schemes in optimal control. SIAM Journal of Control and Optimisation, 360(2), 1998. [46] Fernando Fernandez and Daniel Borrajo. VQQL. Applying vector quantization to reinforcement learning. In M. Veloso, E. Pagello, and Hiroaki Kitano, editors, RoboCup99: Robot Soccer WorldCup III, number 1856 in Lecture Notes in Arti cial Intelligence, pages 171{178. Springer, 2000. [47] Jerome H. Friedman, Jon L. Bentley, and Raphael A. Finkel. An algorithm for nding best matches in logarithmic expected time. ACM Transactions on Mathematical Software, 3(3):209{226, September 1977. [48] G. J. Gordon. Stable function approximation in dynamic programming. In Armand Prieditis and Stuart Russell, editors, Proceedings of the Twelfth International Conference on Machine Learning, pages 261{268, San Francisco, CA, 1995. Morgan Kaufmann. [49] Georey J. Gordon. Online tted reinforcement learning from the value function approximation. In Workshop at ML95, 1995. [50] Georey J. Gordon. Chattering in SARSA(). CMU Learning Lab internal report. Available from http://www2.cs.cmu.edu/~ggordon/, 1996. [51] Georey J. Gordon. Reinforcement learning with function approximation converges to a region. In Advances in Neural Information Processing Systems, volume 12. The MIT Press, 2000. [52] W. Hackbush. Multigrid Methods and Applications. SpringerVerlag, 1985. [53] M. Hauskrecht, N. Meuleau, C. Boutilier, L. Pack Kaelbling, and T. Dean. Hierarchical solution of Markov decision processes using macroactions. In Proceedings of the 1998 Conference on Uncertainty in Arti cial Intelligence, Madison, Wisconsin, 1998. [54] Robert B. Heckendorn and Charles W. Anderson. A multigrid form of valueiteration applied to a Markov decision process. Technical Report CS98113, Computer Science Department, Colorado State University, Fort Collins, CO 80523, November 1998. [55] John H. Holland, Lashon B. Booker, Marco Colombetti, Marco Dorigo, David E. Goldberg, Stephanie Forrest, Rick L. Riolo, Robert E. Smith, Pier Luca Lanzi, Wolfgang Stolzmann, and Stewart W. Wilson. What is a Learning Classi er System? In Pier Luca Lanzi, Wolfgang Stolzmann, and Stewart W. Wilson, editors, Learning Classi er Systems. From Foundations to Applications, volume 1813 of LNAI, pages 3{32, Berlin, 2000. SpringerVerlag. [56] Ronald A. Howard. Dynamic Programming and Markov Decision Processes. The MIT Press, Cambridge, Massachusetts, 1960.
BIBLIOGRAPHY
181
[57] Mark Humphrys. Action selection methods using reinforcement learning. In From Animals to Animats 4: Proceedings of the Fourth International Conference on Simulation of Adaptive Behavior, volume 4, pages 135{144. MIT Press/Bradford Books, MA., USA, 1996. [58] Mark Humphrys. Action Selection Methods Using Reinforcement Learning. PhD thesis, Trinity Hall, University of Cambridge, June 1997. [59] T. Jaakkola, M. Jordan, and S. Singh. On the convergence of stochastic iterative dynamic programming algorithms. Neural Computation, 6(6):1185{1201, 1994. [60] Tommi Jaakkola, Satinder P. Singh, and Michael I. Jordan. Reinforcement learning algorithms for partially observable Markov problems. In Advances in Neural Information Processing Systems, volume 7, 1995. [61] A. Bryson Jr. and Y. Ho. Applied Optimal Control. Hemisphere Publishing, New York, 1975. [62] Leslie Pack Kaelbling. Learning in Embedded Systems. PhD thesis, Department of Computer Science, Stanford University, Stanford, CA., 1990. [63] Leslie Pack Kaelbling, Michael L. Littman, and Andrew W. Moore. Reinforcement learning: A survey. Journal of Arti cial Intelligence Research, 4:237{285, 1996. [64] Masahito Yamamoto Keiko Motoyama, Keiji Suzuki and Azuma Ohuchi. Evolutionary state space con guration with reinforcement learning for adaptive airship control. In The Third AustraliaJapan Workshop on Intelligent and Evolutionary Systems (Proceedings), 1999. [65] S. Koenig and R. G. Simmons. The eect of representation and knowledge on goaldirected exploration with reinforcementlearning algorithms. Machine Learning, 22:228{250, 1996. [66] R. E. Korf. Realtime heuristic search. Arti cial Intelligence, 42:189{221, 1990. [67] J. R. Krebs, A. Kacelnik, and P. Taylor. Test of optimal sampling by foraging great tits. Nature, 275(5675):27{31, 1978. [68] R. Kretchmar and C. Anderson. Comparison of CMACs and radial basis functions for local function approximators in reinforcement learning. In Proceedings of the IEEE International Conference on Neural Networks. Houston, TX, pages 834{837, 1997. [69] J. H. Kushner and Dupuis. Numerical Methods for Stochastic Control Problems in Continuous Time. Applications of Mathematics. Springer Verlag, 1992. [70] Leonid Kuvayev and Richard Sutton. Approximation in modelbased learning. In ICML'97 Workshop on Modelling in Reinforcement Learning, 1997. [71] C. Lin and H. Kim. CMACbased adaptive critic selflearning control. IEEE Transactions on Neural Networks, 2:530{533, 1991.
182
BIBLIOGRAPHY
[72] L. J. Lin. Selfimproving reactive agents based on reinforcement learning, planning and teaching. Machine Learning, 8:293{321, 1992. [73] LongJi Lin. Scaling up reinforcement learning for robot control. In Proceedings of the Tenth International Conference on Machine Learning, pages 182{189, Amherst, MA, June 1993. Morgan Kaufmann. [74] Michael L. Littman, Thomas L. Dean, and Leslie Pack Kaelbling. On the complexity of solving Markov decision problems. In Proceedings of the Eleventh International Conference on Uncertainty in Arti cial Intelligence, page 9, 1995. [75] S. Mahadevan. Average reward reinforcement learning: Foundations, algorithms and empirical results. Machine Learning, 22:159{196, 1996. [76] S. Mahadevan and J. Connell. Automatic programming of behavior based robots. Arti cial Intelligence, 55(22):311{365, June 1992. [77] Yishay Mansour and Satinder Singh. On the complexity of policy iteration. In Uncertainty in Arti cial Intelligence, 1999. [78] J. J. Martin. Bayesian Decision Problems and Markov Chains. John Wiley and Sons, New York, New York, 1969. [79] Maja J. Mataric. Interaction and Intelligent Behavior. PhD thesis, MIT AI Lab, August 1994. AITR1495. [80] John H. Mathews. Numerical Methods for Mathematics, Science and Engineering. Prentice Hall, London, UK, 1995. [81] Andrew McCallum. Instancebased utile distinctions for reinforcement learning. In Proceedings of the Twelfth International Machine Learning, San Francisco, 1995. Morgan Kaufmann. [82] Andrew K. McCallum. Reinforcement Learning with Selective Perception and Hidden State. PhD thesis, Department of Computer Science University of Rochester Rochester, NY, 14627, USA, 1995. [83] Amy McGovern, Richard S. Sutton, and Andrew H. Fagg. Roles of macroactions in accelerating reinforcement learning. In 1997 Grace Hopper Celebration of Women in Computing, 1997. [84] C. Melhuish and T. C. Fogarty. Applying a restricted mating policy to determine state space niches using delayed reinforcement. In T. C. Fogarty, editor, Proceedings of Evolutionary Computing, Arti cial Intelligence and the Simulation of Behaviour Workshop, pages 224{237. SpringerVerlag, 1994. [85] Nicolas Meuleau and Paul Bourgine. Exploration of multistate environments: Local measures and backpropagation of uncertainty. Machine Learning, 35(2):117{154, May 1999.
BIBLIOGRAPHY
183
[86] A. W. Moore and C. G. Atkeson. The Partigame algorithm for variable resolution reinforcement learning in multidimensional statespaces. Machine Learning, 21:199{ 233, 1995. [87] Andrew W. Moore. Variable resolution dynamic programming: EÆciently learning action maps on multivariate realvalue statespaces. In L. Birnbaum and G. Collins, editors, Proceedings of the Eighth International Conference on Machine Learning. Morgan Kaufman, June 1991. [88] Andrew W. Moore and Christopher G. Atkeson. Prioritised sweeping: Reinforcement learning with less data and less time. Machine Learning, 13:103{130, 1994. [89] Andrew William Moore. EÆcient Memory Based Learning for Robot Control. PhD thesis, University of Cambridge, Computer Laboratory, November 1990. [90] K. Muller, S. Mika, G. Ratsch, K. Tsuda, and B. Scholkopf. An introduction to kernel based methods. IEEE Transactions on Neural Networks, 12(2):181{202, March 2001. [91] Remi Munos and Paul Bourgine. Reinforcement learning for continuous stochastic control problems. In Michael I. Jordan, Michael J. Kearns, and Sara A. Solla, editors, Advances in Neural Information Processing Systems, volume 10. The MIT Press, 1998. [92] Remi Munos and Andrew Moore. Variable resolution discretization in optimal control. Machine Learning. To appear. [93] Remi Munos and Andrew Moore. Barycentric interpolator for continuous space & time reinforcement learning. In M. S. Kearns and D. A. Cohn S. A. Solla, editors, Advances in Neural Information Processing Systems, volume 11. The MIT Press, 1999. [94] Remi Munos and Andrew Moore. In uence and variance of a Markov chain: Application to adaptive discretization in optimal control. In IEEE Conference on Decision and Control, 1999. [95] Remi Munos and Andrew Moore. Variable resolution discretization for highaccuracy solutions of optimal control problems. In Proceedings of the 16th International Joint Conference on Arti cial Intelligence, pages 1348{1355, 1999. [96] Remi Munos and Jocelyn Patinel. Reinforcement learning with dynamic covering of stateaction space: Partitioning Qlearning. In From Animals to Animats 3: Proceedings of the International Conference on Simulation of Adaptive Behavior, 1994. [97] D. Ormoneit and S. Sen. Kernelbased reinforcement learning. Machine Learning, 42:241{267, 2001. [98] Mark J. L. Orr. Introduction to radial basis function networks. Technical report, Institute for Adaptive Neural Computation, Division of Informatics, University of Edinburgh, 1996. http://www.anc.ed.ac.uk/~mjo/rbf.html. [99] Mark J. L. Orr. Recent advances in radial basis function networks. Technical report, Institute for Adaptive Neural Computation, Division of Informatics, University of Edinburgh, 1999. http://www.anc.ed.ac.uk/~mjo/rbf.html.
184
BIBLIOGRAPHY
[100] S. Pareigis. Adaptive choice of grid and time in reinforcement learning. In Advances in Neural Information Processing Systems, volume 10. The MIT Press, Cambridge, MA, 1997. [101] S. Pareigis. Multigrid methods for reinforcement learning in controlled diusion processes. In Advances in Neural Information Processing Systems, volume 9. The MIT Press, Cambridge, MA, 1998. [102] Ronald Parr and Stuart Russell. Reinforcement learning with hierarchies of machines. In Advances in Neural Information Processing Systems, volume 10, 1997. [103] M.D. Pendrith and M.R.K. Ryan. Actual return reinforcement learning versus temporal dierences: Some theoretical and experimental results. In The Thirteenth International Conference on Machine Learning. Morgan Kaufmann, 1996. [104] M.D. Pendrith and M.R.K. Ryan. Ctrace: A new algorithm for reinforcement learning of robotic control. In ROBOLEARN96, Key West, Florida, 1920 May, 1996, 1996. [105] J. Peng and R. J. Williams. EÆcient learning and planning within the Dyna framework. Adaptive Behaviour, 2:437{454, 1993. [106] J. Peng and R. J. Williams. Incremental multistep Qlearning. Machine Learning, 22:283{290, 1996. [107] Jing Peng and Ronald J. Williams. Incremental multistep Qlearning. In W. Cohen and H. Hirsh, editors, Proceedings of the 11th International Conference on Machine Learning, pages 226{232. Morgan Kaufmann, San Francisco, 1994. [108] Larry Peterson and Bruce Davie. Computer Networks: A Systems Approach. Morgan Kaufmann, 2nd edition, 2000. [109] D. Precup and R. Sutton. Multitime models for temporally abstract planning. In Advances in Neural Information Processing Systems, volume 10, 1998. [110] D. Precup and R. S. Sutton. Multitime models for reinforcement learning. In Proceedings of the ICML'97 Workshop on Modelling in Reinforcement Learning, 1997. [111] D. Precup, R. S. Sutton, and S. Singh. Eligibility trace methods for opolicy evaluation. In Proceedings of the 17th International Conference of Machine Learning. Morgan Kaufmann, 2000. [112] Bob Price and Craig Boutilier. Implicit imitation in multiagent reinforcement learning. In Proceedings of the 16th International Conference on Machine Learning, 1999. [113] M. L. Puterman and M. C. Shin. Modi ed policy iteration algorithms for discounted Markov decision problems. Management Science, 24:1137{1137, 1978. [114] Martin L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley and Sons, Inc., New York, New York, 1994.
BIBLIOGRAPHY
185
[115] Stuart Reynolds. Decision boundary partitioning: Variable resolution modelfree reinforcement learning. Technical Report CSRP9915, School of Computer Science, The University of Birmingham, Birmingham, B15 2TT, UK, July 1999. ftp://ftp.cs.bham.ac.uk/pub/techreports/1999/CSRP9915.ps.gz. [116] Stuart I. Reynolds. Issues in adaptive representation reinforcement learning. Presentation at the 4th European Workshop on Reinforcement Learning, Lugano, Switzerland, October 1999. [117] Stuart I. Reynolds. Decision boundary partitioning: Variable resolution modelfree reinforcement learning. In Proceedings of the Seventeenth International Conference on Machine Learning, pages 783{790, San Francisco, 2000. Morgan Kaufmann. http://www.cs.bham.ac.uk/~sir/pub/ml2k DBP.ps.gz. [118] Stuart I. Reynolds. A description of state dynamics and experiment parameters for the hoverbeam task. Unpublished Technical Report, http://www.cs.bham.ac.uk/~sir/pub/, April 2000. [119] Stuart I. Reynolds. Adaptive representation methods for reinforcement learning. In Advances in Arti cial Intelligence, Proceeding of AI2001, Ottawa, Canada, Lecture Notes in Arti cial Intelligence (LNAI 2056), pages 345{348. SpringerVerlag, June 2001. http://www.cs.bham.ac.uk/~sir/pub/ai2001.ps.gz. [120] Stuart I. Reynolds. The curse of optimism. In Proceedings of the Fifth European Workshop on Reinforcement Learning, Utrecht, The Netherlands, pages 38{39, October 2001. http://www.cs.bham.ac.uk/~sir/pub/EWRL5 opt.ps.gz. [121] Stuart I. Reynolds. Experience stack reinforcement learning: An online forward return method. In Proceedings of the Fifth European Workshop on Reinforcement Learning, Utrecht, The Netherlands, pages 40{41, October 2001. http://www.cs.bham.ac.uk/~sir/pub/EWRL5 stack.ps.gz. [122] Stuart I. Reynolds. Optimistic initial Qvalues and the max operator. In Qiang Shen, editor, Proceedings of the UK Workshop on Computational Intelligence, Edinburgh, UK, pages 63{68. The University of Edinburgh Printing Services, September 2001. http://www.cs.bham.ac.uk/~sir/pub/UKCI01.ps.gz. [123] Stuart I Reynolds. Experience stack reinforcement learning for opolicy control. Technical Report CSRP021, School of Computer Science, University of Birmingham, January 2002. http://www.cs.bham.ac.uk/~sir/pub/ESCSRP021.ps.gz. [124] Stuart I. Reynolds. The stability of general discounted reinforcement learning with linear function approximation. In John Bullinaria, editor, Proceedings of the UK Workshop on Computational Intelligence, Birmingham, UK, pages 139{146, September 2002. http://www.cs.bham.ac.uk/~sir/pub/ukci02.ps.gz. [125] Stuart I Reynolds and Marco A. Wiering. Fast Q() revisited. Technical Report CSRP022, School of Computer Science, University of Birmingham, May 2002. http://www.cs.bham.ac.uk/~sir/pub/fastqCSRP022.ps.gz.
186
BIBLIOGRAPHY
[126] H. Robbins and S. Monro. A stochastic approximation method. Annals of Mathematical Statistics, 22:400{407, 1951. [127] David E. Rumelhart, James L. McClelland, and the PDP Research Group. Parallel Distributed Processing: Explorations in the Microstructure of Cognition, volume 1: Foundations. The MIT Press, Cambridge, MA, 1986. [128] G. A. Rummery and M. Niranjan. Online Qlearning using connectionist systems. Technical Report CUED/FINFENG/TR 166, Cambridge University Engineering Department, September 1994. [129] Gavin A Rummery. Problem Solving with Reinforcement Learning. PhD thesis, Department of Engineering, University of Cambridge, July 1995. [130] Stuart Russell and Peter Norvig. Arti cial Intelligence: A Modern Approach. Prentice Hall, London, UK, 1995. [131] Juan Carlos Santamaria, Richard Sutton, and Ashwin Ram. Experiments with reinforcement learning in problems with continuous state and action spaces. Adaptive Behavior 6(2), 1998. [132] A. Schwartz. A reinforcement learning algorithm for maximizing undiscounted rewards. In Proceeding of the Tenth International Conference on Machine Learning, pages 298{305. Morgan Kaufmann, San Mateo, CA, June 1993. [133] J. Simons, H. Van Brussel, J. De Schutter, and J. Verhaert. A selflearning automaton with variable resolution for high precision assembly by industrial robots. IEEE Transactions on Automatic Control, 5(27):1109{1113, October 1982. [134] S. Singh. Scaling reinforcement learning algorithms by learning variable temporal resolution models. In Proceedings of the Ninth Machine Learning Conference, 1992. [135] S. Singh, T. Jaakkola, M. L. Littman, and C. Szepesvari. Convergence results for singlestep onpolicy reinforcementlearning algorithms. Machine Learning, 2000. [136] S. P. Singh, T. Jaakkola, and M. I. Jordan. Reinforcement learning with soft state aggregation. In G. Tesauro, D. S. Touretzky, and T. Leen, editors, Advances in Neural Information Processing Systems: Proceedings of the 1994 Conference, pages 359{368. The MIT Press, Cambridge, MA, 1994. [137] Satinder Singh. Personal communication, 2001. [138] Satinder P. Singh, Tommi Jaakkola, and Michael I. Jordan. Learning without stateestimation in partially observable Markovian decision processes. In Proceedings of the Eleventh International Conference on Machine Learning, 1994. [139] Satinder P. Singh and Richard S. Sutton. Reinforcement learning with replacing eligibility traces. Machine Learning, 22:123{158, 1996. [140] William D. Smart and Leslie Kaelbling Pack. Practical reinforcement learning in continuous spaces. In Proceedings of the Seventeenth International Conference on Machine Learning, San Francisco, 2000. Morgan Kaufmann.
BIBLIOGRAPHY
187
[141] P. Stone and R. S. Sutton. Scaling reinforcement learning toward robocup soccer. In Eighteenth International Conference on Machine Learning, 2001. [142] Malcolm Strens. A Bayesian framework for reinforcement learning. In Proceedings of the 17th International Conference on Machine Learning, pages 943{950, San Francisco, 2000. Morgan Kaufmann. [143] R. Sutton, D. Precup, and S. Singh. Between MDPs and SemiMDPs: A framework for temporal abstraction in reinforcement learning. Arti cial Intelligence, 112:181{ 211, 1999. [144] R. S. Sutton. Planning by incremental dynamic programming. In Proceedings of the Eighth International Workshop on Machine Learning, pages 353{357. Morgan Kaufmann, 1991. [145] R. S. Sutton. Open theoretical questions in reinforcement learning. Extended abstract of an invited talk at EuroCOLT'99, 1999. [146] R. S. Sutton and D. Precup. Opolicy temporaldierence learning with function approximation. In Proceedings of the Eighteenth International Conference on Machine Learning, 2001. [147] Richard S. Sutton. Temporal Credit Assignment in Reinforcement Learning. PhD thesis, University of Massachusetts, 1984. [148] Richard S. Sutton. Learning to predict by methods of temporal dierence. Machine Learning, 3:9{44, 1988. [149] Richard S. Sutton. Generalization in reinforcement learning: Successful examples using sparse coarse coding. In David S. Touretzky, Michael C. Mozer, and Michael E. Hasselmo, editors, Advances in Neural Information Processing Systems 8, pages 1038{ 1044. The MIT Press, Cambridge, MA., 1996. [150] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. The MIT Press, Cambridge, MA., 1998. [151] Richard S. Sutton and Satinder P. Singh. On stepsize and bias in temporal dierence learning. In Proceedings of the Eighth Yale Workshop on Adaptive and Learning Systems, pages 91{96, 1994. [152] Csaba Szepesvari. Convergent reinforcement learning with value function interpolation. Technical Report TR200102, Mindmaker Ltd., Budapest 1121, Konkoly Th. M. u. 2933, Hungary, 2001. [153] P. Tadepalli and D. Ok. H learning: A reinforcement learning method to optimize undiscounted average reward. Technical Report 943001, Oregon State University, Computer Science Department, Corvallis, 1994. [154] Vladislav Tadic. On the convergence of temporaldierence learning with linear function approximation. Machine Learning, 42:241{267, 2001.
188
BIBLIOGRAPHY
[155] G. J. Tesauro. Temporal dierence learning and TDgammon. Communications of the ACM, 38(3):58{68, 1995. [156] S. Thrun. EÆcient exploration in reinforcement learning. Technical Report CMUCS92102, Carnegie Mellon University, PA, 1992. [157] Sebastian Thrun and Anton Schwartz. Issues in using function approximation for reinforcement learning. In Proceedings of the 1993 Connectionist Models Summer School. Lawrence Erlbaum, Hillsdale, NJ, 1993. [158] J. N. Tsitsiklis. Asynchronous stochastic approximation and Qlearning. Machine Learning, 16:185{202, 1994. [159] J. N. Tsitsiklis and B. Van Roy. Featurebased methods for large scale dynamic programming. Machine Learning, 22, 1996. [160] J. N. Tsitsiklis and B. Van Roy. An analysis of temporaldierence learning with function approximation. IEEE Transactions on Automatic Control, 42(5):674{690, May 1997. [161] William T. B. Uther and Manuela M. Veloso. Tree based discretization for continuous state space reinforcement learning. In Proceedings of the Fifteenth National Conference on Arti cial Intelligence (AAAI '98), volume 15, pages 769{774. AAAI Press, 1998. [162] Hans Vollbrecht. kdQlearning with hierarchic generalisation in state space. Technical Report SFB 527, Department of Neural Information Processing, University of Ulm, Ulm, Germany, 1999. [163] C.J.C.H. Watkins. Learning from Delayed Rewards. PhD thesis, King's College, Cambridge, UK, May 1989. [164] C.J.C.H. Watkins and P. Dayan. Technical note: QLearning. Machine Learning, 8:279{292, 1992. [165] S. Whitehead. Reinforcement Learning for the Adaptive Control of Perception and Action. PhD thesis, King's College, Cambridge, U.K., 1992. [166] B. Widrow and M. E. Ho. Adaptive switching circuits. In Western Electronic Show and Convention, Convention Record, volume 4, 1960. Reprinted in J. A. Anderson and E. Rosenfeld, editors, Neurocomputing: Foundations and Research, The MIT Press, Cambridge, MA, 1988. [167] Marco Wiering. Explorations in EÆcient Reinforcement Learning. PhD thesis, Universiteit van Amsterdam, The Netherlands, February 1999. [168] Marco Wiering and Jurgen Schmidhuber. Fast online Q(). Machine Learning, 33(1):105{115, 1998. [169] Marco Wiering and Jurgen Schmidhuber. Speeding up Q()Learning. In Proceedings of the Tenth European Conference on Machine Learning (ECML'98), 1998.
BIBLIOGRAPHY
189
[170] R. J. Williams. Toward a theory of reinforcement learning connectionist systems. Technical Report NUCCS883, Northeastern University, Boston, MA, 1988. [171] R. J. Williams and L. C. Baird. Tight performance bounds on greedy policies based on imperfect value functions. In Proceedings of the Tenth Yale Workshop on Adaptive and Learning Systems, Yale University, page 6, June 1994. [172] R. J. Williams and L. C. Baird III. Tight performance bounds on greedy policies based on imperfect value functions. Technical Report NUCCS9314, College of Computer Science, Northeastern University, Boston, 1993. [173] Stewart W. Wilson. ZCS: A zeroth level classi er system. Evolutionary Computation, 2(1):1{18, 1994. http://predictiondynamics.com/. [174] Jeremy Wyatt. Exploration and Inference in Learning from Reinforcement. PhD thesis, Department of Arti cial Intelligence, University of Edinburgh, UK, March 1996. [175] Jeremy Wyatt. Exploration control in reinforcement learning using optimistic model selection. Proceedings of the Eighteenth International Conference on Machine Learning (ICML2001), pages 593{600, 2001. [176] Jeremy Wyatt, Gillian Hayes, and John Hallam. Investigating the behaviour of Q(). In Colloquium on SelfLearning Robots, IEE, London, February 1996. [177] W. Zhang and T. G. Dietterich. A reinforcement learning approach to jobshop scheduling. In Proceedings of the Fourteenth International Joint Conference on Arti cial Intelligence, pages 1114{1120. Morgan Kaufmann, 1995.