THE QUEEN'S UNIVERISTY OF BELFAST TO THE ...

Viewer
Transcript

THE QUEEN’S UNIVERISTY OF BELFAST

TO THE UNIVERISTY LIBRARIAN Please complete and/or delete as appropriate.

I give permission for my thesis entitled:

XCS Performance and Population Structure in Multi-Step Environments

to be made available * (a) forthwith (b) after a period of …… (maximum period of 5 years) for consultation by readers in the University Library, to be sent away on temporary loan if asked for by other institutions, or to be photocopied and/or microfilmed in whole or in part, under regulations determined by the Library and Information Services Committee.

Name

Alwyn Barry

Home Address

Signature of candidate: Date

………………………………………… 1st December 2000

NB Authors of theses should note that giving this permission does not in anyway prejudice their rights. To be completed by Internal Examiner

CERTIFICATION OF ACCEPTED THESIS I hereby certify that this is the accepted copy of the thesis which is to be placed in the University Library.

Internal Examiner …………………………………………………… Date:

……………………………………………………

XCS PERFORMANCE AND POPULATION STRUCTURE IN MULTI-STEP ENVIRONMENTS by Alwyn Barry, BSc. Hons. A thesis submitted in partial fulfillment of the requirements for the degree of

Doctor of Philosophy

Queens University, Belfast

2000

Approved by ______________________________________________________________ Chairperson of Supervisory Committee

Supervisor : Prof. D. Crookes (Queens University, Belfast) External : Dr. P. Ross (Edinburgh University)

Program Authorized to Offer Degree ____________________________________________________________

Date _____________________________________________________________________

XCS PERFORMANCE AND POPULATION STRUCTURE IN MULTI-STEP ENVIRONMENTS by Alwyn Barry, BSc. Hons.

Abstract

Within Michigan-style Learning Classifier Systems based upon Holland's model (Holland et al 1986) support for learning in delayed-reward multiplestep environments was through the co-operation of classifiers within rulechains. Despite the successful use of this LCS model in direct-reward environments (Wilson, 1985, 1987; Parodi and Bonelli, 1990; Holmes, 1997; Dorigo and Colombetti, 1994), the application of LCS to delayed reward Markovian environments has been problematic. There is now a persuasive body of evidence that suggests that the use of strength as a fitness metric for the Genetic Algorithm (Kovacs and Kerber, 2000; Kovacs, 2000a), the use of rulechains to establish multiple-step policies (Riolo, 1987b, 1989a; Forrest and Miller, 1990; Compiani et al, 1990), and the lack of mechanisms to encourage the development of co-operative populations (Booker, 1988; Smith, 1991; Smith and Golberg, 1991) all contribute to its inability within these environments. XCS (Wilson, 1995, 1998) presents solutions to each of these issues and initial results have shown the considerable promise of the approach (Kovacs, 1996, 1997; Lanzi, 1999c; Saxon and Barry, 1999a). In this work it is shown that whilst the XCS action-chaining mechanisms are effective for short action-chains, the combination of the use of discounted payoff and generalisation prevents XCS from learning optimal solutions in environments requiring even moderately sized action chains. In response it is hypothesised that the structuring of the solution, possibly hierarchically, can be used to reduce the required action chain length. A framework for hierarchical LCS research is proposed using a review of previous LCS hierarchical or structured approaches (Barry, 1993, 1996), and this work is compared to developments within the Reinforcement Learning community. Within a hierarchical solution low-level action chains may suffer when re-used if different payments are given to the action chains. An investigation into the Aliasing Problem (Lanzi, 1998a) reveals a subset of the problem, termed the Consecutive State Problem (Barry, 1999a), that will admit to a simple solution, which is empirically demonstrated (Barry, 1999b). It is shown that XCS is also able to learn the optimal state × action × duration × payoff mapping when a mechanism providing persistent actions is added (Barry, 2000), and that although this cannot be used as a solution to the aliasing problem it does provide a means of increasing the range of action chains. Two forms of pre-identified hierarchical structures are introduced and it is shown that these allow multiple XCS instances to learn a hierarchical model that can be applied to operate successfully within environments requiring long action chains.

TABLE OF CONTENTS

1.

Introduction 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8

2.

1 5 7 10 14 18 20 21

Learning Classifier Systems 2.1 2.2 2.3 2.3.1 2.3.2 2.3.3 2.3.4 2.3.5 2.4 2.5 2.5.1 2.5.2 2.6 2.6.1 2.7 2.7.1 2.7.2 2.8 2.8.1 2.8.2 2.9

3.

Learning Classifier Systems and the A-Life Approach Learning Classifier Systems A Brief Introduction to the Operation of a LCS The LCS Problem The Hierarchy Debate The XCS Approach The Hypotheses Approach

Background A Brief Overview The Canonical LCS The Architecture of the Canonical LCS The Performance Subsystem The Conflict Resolution Subsystem The Credit Allocation Subsystem The Induction Subsystem The Michigan and Pittsburgh Variants Realisations of the Canonical LCS Riolo’s CFS-C Two more realisations Modifications of the Canonical LCS Booker’s GOFER-1 Simplifications of the Canonical LCS Wilson’s Animat, BOOL and ZCS Implementations Goldberg’s SCS Recent Advances Wilson’s XCS Stolzmann’s ACS Conclusion

24 26 27 28 34 35 39 45 47 49 50 53 54 55 57 57 62 64 64 65 68

The XCS Classifier System 3.1 3.2 3.2.1 3.2.2 3.2.3 3.2.4 3.2.5 3.2.6 3.3 3.4

Background Structure and Operation Structure The Performance Subsystem The Credit Allocation Subsystem The Rule Induction Subsystem Parameterisation Summary Implementation Replication of XCS Experiments

i

69 71 71 75 82 89 101 103 103 104

3.4.1 3.4.1.1 3.4.1.2 3.4.2 3.4.3 3.5 3.5.1 3.5.2 3.5.2.1 3.5.2.2 3.5.2.3 3.5.2.4 3.5.2.5 3.5.3 3.5.3.1 3.5.3.2 3.6 3.7

4.

107 107 110 114 118 118 119 120 121 126 128 128 130 131 131 134 137 139

Investigating Action Chain Limits in XCS Multiple Step Learning 4.1 4.2 4.3 4.3.1 4.3.2 4.3.3 4.3.4 4.3.5 4.3.6 4.3.7 4.3.7.1 4.3.7.2 4.3.7.3 4.3.7.4 4.3.7.5 4.4 4.5 4.6

5.

Learning in Single Step Environments The Six-Multiplexer The Eleven-Multiplexer Learning in Multiple Step Environments Conclusions Recent Research Initial Investigations XCS Performance in Single Step Environments The Development of Optimal Sub-populations XCS and Traditional LCS The Robustness of XCS Extensions to XCS Real-World Applications of XCS XCS Performance in Multiple Step Environments XCS and Action Chaining XCS and Non-Markovian Environments Further Work Conclusion

Background A Suitable Test Environment Investigating Action Chain Length The Test Environment A Metric for Action Chain Evaluation Experimental Hypotheses Experimental Method The Production of Baseline Results Using Inducation on Long Action Chains XCS Learning with Generalisation in Long Action Chain FSW The Length-5 FSW The Length-10 FSW The Length-15 FSW Investigating the Dominance of Fully General Classifiers The Length-20 FSW Summary of Results Discussion Conclusions and Further Work

141 142 145 148 150 155 156 157 162 168 168 174 179 182 189 193 196 199

An Investigation of the "Aliasing Problem" 5.1 5.2 5.3 5.3.1 5.3.2 5.3.3 5.3.4 5.4 5.5 5.6

Background Hypotheses Experimental Investigation Investigating Hypothesis 5.1 Investigating Hypothesis 5.1.1 Investigating Hypothesis 5.1.2 Investigating Hypothesis 5.2 Summary of Results Discussion Conclusions

ii

201 203 208 208 213 219 231 235 237 239

6.

Action Persistence within XCS 6.1 6.2 6.3 6.4 6.4.1 6.4.2 6.4.3 6.5 6.5.1 6.6 6.7 6.8

7.

240 241 243 247 248 256 259 261 262 263 265 266

Structure in Learning Classifier Systems 7.1 7.2 7.3 7.3.1 7.3.2 7.3.3 7.4 7.4.1 7.4.2 7.4.3 7.5 7.6 7.6.1 7.6.2 7.6.3 7.7

8.

Introduction Background Hypotheses Experimental Investigation Investigating the Provision of Persistence Investigating the Selection of Duration Re-instating Temporal Difference Use of Persistence as a solution to the Consecutive State Problem Experimental Investigation Summary of Results Discussion Conclusion and Further Work

Introduction A Rationale for the use of Structure within LCS A Framework for Research in Structured and Hierarchical LCS Abstraction Decomposition Reuse A Review of Hierarchy and Structure within LCS Research Multiple Interacting LCS Structured Population LCS Structured Encoding LCS Discussion Hierarchy and Structure in Reinforcement Learning State-space Sub-division Competence Decomposition Emergent Decomposition Conclusion

269 270 271 272 273 275 277 278 282 291 292 292 293 295 298 299

Using Fixed Structure to Learn Long Rule Chains 8.1 8.2 8.3 8.4 8.4.1 8.4.2 8.4.3 8.5 8.5.1 8.5.2 8.5.3 8.6 8.7 8.8

Introduction Hypotheses Experimental Method Sub-dividing the Population Introducing SHQ-XCS Using SHQ-XCS to learn the optimal path in a corridor FSW Investigating the benefits of input message optimisation Introducing Hierarchical Control The Feudal XCS Using the Feudal XCS in a Unidirectional Environment Using the Feudal XCS to choose between Sub-goals Summary of Results Discussion Conclusion and Further Work

iii

301 304 307 308 308 310 315 318 318 320 324 330 332 334

9.

Conclusions 9.1 9.2 9.2.1 9.2.2 9.2.3 9.2.4 9.2.5 9.3 9.3.1 9.3.2 9.3.3 9.3.4 9.4

Background A Review of the Main Findings Long Action Chains The Consecutive State Problem Action Persistence Hierarchy Other Contributions to LCS Research Further Work Long Action Chains Non-Markovian Environments Action Persistence Hierarchy Final Words

Bibliography

336 338 338 340 341 342 343 343 344 345 345 346 347

348

iv

LIST OF FIGURES

Number 2.1 The Canonical Learning Classifier System

Page 29

2.2

A test environment for the Bucket-Brigade used by Riolo (1987b)

42

3.1

Schematic of the XCS Classifier System

72

3.2

The Accuracy Curve exp((ln(α)(ε-ε0))/ε0).m

88

3.3

Varying values of α in the Accuracy calculation (without cut-off)

88

3.4

The Woods-1 Environment

107

3.5

Average of 10 runs of XCS in Six Multiplexor Test

111

3.6

The averages of ten runs, each of ten runs of XCS in the Six Multiplexer Test

112

3.7

XCS 11-multiplexor over 15,000 exploitations

112

3.8

The average of 10 runs of XCS within Woods 2

117

4.1

A Corridor Finite State World

148

4.2

A three state single action corridor FSW

150

4.3

System Error rapidly falling to zero in a three state FSW

151

4.4

A 20 state single action corridor FSW

152

4.5

Convergence of Prediction values along a logarithmic scale in a 20 state FSW

152

The Relative Error in prediction of payoff per classifier over a 20 state FSW

153

The convergence of payoff prediction within action chain lengths (a) 5, (b) 10, (c) 15, (d) 20, (e) 25, and (f) 30 in a corridor FSW environment with one non-optimal action per state, a pre-loaded population and no induction.

161

The convergence of payoff prediction in the presence of classifier induction within action chain lengths (a) 5, (b) 10, (c) 15, (d) 20, (e) 25, and (f) 30 in a corridor FSW environment with one non-optimal action per state.

165

A single test using a length 30 action chain FSW with induction, no generality, and iteration averaging removed.

166

A single test using a length 40 action chain FSW with induction, no generality, and iteration averaging removed

167

The convergence of payoff prediction in the presence of generality pressure and classifier induction within a length 5 corridor FSW.

169

The performance of a typical run of XCS within a length 5 corridor FSW with population-wide subsumption for mutated classifiers belonging to a new action set.

171

4.6 4.7

4.8

4.9 4.10 4.11 4.12

v

4.13

The average performance of ten runs of XCS within a length 5 corridor FSW with mutation rate reduced to 0.01.

172

The average performance of ten runs of XCS within a length 10 corridor FSW with mutation rate reduced to 0.01.

175

The average performance of ten runs of XCS within a length 10 corridor FSW with system relative error reducing to zero over 15000 episodes.

175

The average performance of ten runs of XCS within a length 10 corridor FSW with the mutation rate set at 0.04.

177

The rate of choice of non-optimal route within the last 2000 exploitation episodes of a typical run of XCS within a length 10 corridor FSW with the mutation rate set at 0.04

179

The averaged performance of ten runs of XCS in the length 15 FSW, at mutation rate 0.04. The average iteration count in the last 2000 steps was 17.15 rather than the optimal 15.

180

The coverage graph and the coverage graph with logarithmic prediction scale (y axis) for the action sets within XCS in the length 15 FSW at mutation rate 0.04

180

The averaged performance of ten runs of XCS in the length 15 FSW, at mutation rate 0.01.

181

The coverage graph and the histogram of average numerosity for the action sets within the seven non-optimal runs of XCS at mutation rate 0.01 within the length 15 FSW.

182

The average performance of ten runs of XCS in the length 15 FSW with mutation rate 0.04 and population-wide subsumption in the G.A.

184

4.23

The average performance of ten runs of XCS in the length 20 FSW

185

4.24

The prediction, accuracy, fitness and numerosity traces of three typical life histories of the two fully general classifiers that may appear within the Length 15 FSW. Each measure averaged over the preceding 50 exploitation episodes

188

The average system prediction in each action set of the five good performance runs of XCS in the length 20 FSW environment

189

The performance and coverage of XCS in the length 15 FSW environment with an input encoding that should encourage generalisation

191

The action set predictions from XCS in the length 15 FSW with an input encoding intended to discourage generalisation

193

4.28

The average performance of XCS in the GREF-1 environment

197

5.1

The rapid decline in Relative System Error in one run of FSW-5

209

5.2

The failure of Relative System Error to decline in the presence of two consecutive aliased states demonstrated within one run of FSW-5A

210

Comparison between predictions of classifiers in FSW-5 (Cl. lines) and FSW-5A (Al-Cl. Lines) showing oscillation of aliased classifier's prediction

211

4.14 4.15

4.16 4.17

4.18

4.19

4.20 4.21

4.22

4.25 4.26

4.27

5.3

vi

5.4

Relative Error plots indicating the source of error in one run of FSW5A

212

The change in oscillation range and prediction stabilisation with increase in the number of aliased states in FSW-9A.

213

The prediction of the aliased classifier has no effect on the prediction stability of the immediately preceding classifier for 2, 4, and 6 consecutive aliased states in FSW-9A.

214

The prediction of the aliased classifier impacts the prediction stability of the immediately preceding classifier for 2, 4, and 6 consecutive aliased states when random start states are used.

215

The impact of the aliasing classifier on the prediction of all classifiers in the random start state version of the FSW-9A-4 environment - apart from Cl.4, which covered the aliased states, only Cl.3 was deemed to be inaccurate.

216

In the random start state version of FSW-9A-4 the accuracy of the aliased classifier (Cl. 4) impacts the accuracy of the immediately preceding classifier (Cl. 3) but does not impact the accuracy of the earlier classifiers (Cl. 1 and Cl. 2) unduly.

216

The effect of the aliasing classifier (Cl. 4) upon classifiers choosing an action which leads back to the same state (Cl.5b to Cl.1b) within FSW9A-4.

217

Relative Error within FSW-9(2) with GA and induction mechanisms on and no initial population of classifiers.

220

System Relative Error is not reduced, demonstrating the inability to find accurate classifiers within FSW-9A(2)-4 with GA and induction mechanisms on and no initial classifiers.

221

The predictions of five classifiers within one run of FSW-9A-4. The prediction values of ###01 do not originate from the same 500 iteration period, but all values reflect the stable values of the respective classifiers over the duration of their existence.

224

The Maximum Relative Error and Relative Error measures are reduced under the less competitive encoding within FSW-9A(2)-4 with GA and induction on and no initial classifiers.

228

The reduction in system relative error shows that consecutive aliasing states do not effect classifier performance when using the persistent action XCS within FSW-9A(2)-4.

234

6.1

FSW without and with consecutive actions

244

6.2

A FSW not optimally selected by PXCS

246

6.3

An FSW to test persistence

249

6.4

The Coverage Table for a pre-loaded population within PXCS operating with no induction within the two-reward environment

250

XCS performance with a four bit action within two-reward corridor FSW

253

Performance of PXCS within two-reward single chain FSW

254

5.5 5.6

5.7

5.8

5.9

5.10

5.12 5.13

5.14

5.15

5.16

6.5 6.6

vii

6.7

The Coverage Table for PXCS operating with induction within the tworeward environment

256

6.8

A FSW derived from the Benes Switch

257

6.9

The Performance of dPXCS within the two-reward corridor environment

260

System Relative Error remains within FSW-9A(2)-2 when attempting persistence delay learning.

263

7.1

An abstract decomposition of a grid-based state space

273

7.2

Two FSW from a Simple DRT Environment

288

8.1

The Woods14 environment in which ‘O’ represents a rock (a position which the animat cannot enter) and ‘F’ represents food (the goal).

301

An 18-state simple corridor environment (a) contrasted with a Woods14-like FSW environment (b)

302

SHQ-XCS performance (average of 10 runs) in a length 10 unidirectional corridor environment subdivided into two joined but distinct state spaces.

311

Normal XCS performance within length 5 and length 10 unidirectional corridor environments

311

8.5

The performance of SHQ-XCS with 4 populations in a length 20 FSW.

313

8.6

The performance of SHQ-XCS with 8 populations in a length 40 FSW

315

8.7

The Performance of the reduced length input coding

317

8.8

The Performance of Feudal-XCS in a Length 10 unidirectional FSW

321

8.9

The Performance of Feudal-XCS in a Length 20 unidirectional FSW using four sub-populations

323

8.10

The Relative Error and Iteration counts of Feudal-XCS in a Length 40 unidirectional FSW with eight sub-populations.

323

8.11

A corridor environment with two sub-goals in each of two state-space sub-divisions.

325

8.12

The Performance of Feudal XCS shows unexpectedly high relative error.

326

The Performance of Feudal XCS rapidly becomes optimal where high limits on sub-population exploration allows sub-populations to reliably find their sub-goals.

327

The Averaged Coverage Table for the top-level XCS in the Feudal XCS. The Coverage Graph illustrates the identification of the optimal pathway to the highest environmental reward

328

6.10

8.2 8.3

8.4

8.13

8.14

viii

ACKNOWLEDGMENTS

To Ann, Aaron, Carys, Eilish, and Reuben who have suffered much throughout the course of this work. Their support, perseverance, consideration, patience, and encouragement supplied more to the production of this document than the work of the author himself. Many thanks go to both my parents and family and my wife's parents and family for their support and encouragement throughout this work. The author wishes to thank Stewart Wilson for inventing the XCS approach, ending a number of years of wandering in the "mires" of LCS inability and unpredictability. Stewart has been friendly, approachable, and encouraging throughout and has demonstrated consummately all that a research professional should be. The author is particularly grateful to Tim Kovacs, who though a Research Student has shown professionalism and a level of maturity in all his work that sets an example of good practice to all in the field. He has been a generous source of mutual help and encouragement throughout this work. I wish him well in all his future academic work. My thanks also go to Martin Butz, John Holmes, Pier Luca Lanzi, Wolfgang Stolzman, Rick Riolo, Ken De Jong, and Jean-Arcady Meyer, all of whom have at one time or another helped or encouraged through conversation and email - the wider LCS community is indeed very generous in its support to researchers in the area. Thanks are overdue to the many who have supported this work in UWE and QUB. Gratitude must go in particular to Prof. Ken Jukes (Dean, CSM Faculty, UWE) for his encouragement and unending support - many best wishes in your retirement - and Prof. D. Crookes (Head of Department, Computer Science, QUB) for his patience and help through difficult times. There are many within UWE to whom thanks must go though mention can be made of only a few. Thanks in particular to Tony Solomonides, Dr. David Coward and Prof. Richard McClatchey for providing the time and resources for this work and their encouragement to complete it, and those in ICSC and the Engineering Faculty who have provided materials, discussion time and support. Particular thanks go to Nigel Baker, whose constant help and companionship within and beyond UWE have made the problems of work much more bearable and who has often provided a much needed sense of direction and focus.

ix

GLOSSARY

1

Artificial Intelligence or AI "...the study of how to make computers do things at which, at the moment, people are better" --- Elaine Rich (1988) Artificial Life A term coined by Christopher G. Langton that he defines as "...the study of simple computer generated hypothetical life forms, i.e. life-as-it- could-be." Accuracy κ. The representation of the calculated accuracy of the prediction measure Action. A conjunction of attribute values, each representing an operation to be performed. Action Chain. The sequence of action-sets formed from the classifiers that match an external message on successive iterations. Action Set A or [A]. The set of classifiers which have been matched and whose actions are the same as the selected action. Animat. A simulated animal within a constrained environment, consisting of a limited set of environmental detectors and a limited set of effectors. Attribute. A value representing the quantity or quality measure for an environment feature. Bid. The calculation of the support for a proposed message from one or more classifiers from the current match set. Bid Tax. The removal of current strength in relation to the total amount bid to put deletion pressure on classifiers that propose actions that produce poor payoff or reward. Binary. An attribute type of the two values 0, and 1. Bit. A single Binary value. Bridging Classifier. A classifier that occupies both an early and a late position within the same rule chain because its generality allows it to appear within more than one match set. Bridging Classifiers are able to provide early reward-based payoff to scenesetting classifiers within the rule chain and thereby speed up the convergence of the classifiers within the rule-chain to the stable reward value.

1

A comprehensive glossary of terms for evolutionary computation is available from the Evonet website at http://www.wi.leidenuniv.nl/~gusz/Flying_Circus/1.Reading/1.HHGTEC/Q99.htm. All but seven of the terms presented within this glossary were created by the author. The glossary entries for the terms Artificial Life, AI, Classifier System, Generation, Schema, Schema Theory, and Search Space were derived from those provided in the online glossary.

x

Bucket Brigade. The passing of strength values as payoff to classifiers that caused new classifiers to act as a result of the internal messages sent in the previous iteration of the LCS. Child. One individual produced as a result of the operation of the Genetic Algorithm. Classifier c. A rule consisting of a condition, an action, and a strength record. Classifier System or CFS. A system that produces one or more outputs as a result of one or more inputs, where the outputs represent some classification of the inputs. Condition. A conjunction of, possibly generalised, attribute values. Convergence. The increasing concentration upon a single [locally] optimal solution within a population of solutions. Co-operation. The mutually beneficial co-existence of two or more distinct classifiers for the solution of a problem that no single population member can solve in isolation. Credit Allocation. The mechanism for distributing the reward to classifiers acting in the current iteration and the payoff to classifiers acting in the previous iteration. Crossover. An element of reproduction where a point on the copies of two strings of attributes from each of two parents is selected, the strings are spliced, and the head of each is recombined with the tail of the other to form two new attribute strings. Deletion. The selection and removal of a classifier from a population. Detector. A sensory capability of an Animat generating an internal signal that represents some detectable quality of the environment or measurement within the environment. Effector. A distinct and defined capability of an Animat that can be used to bring about change in position within an environment or change to the environment. Elitist Strategy or Elitism. The protection of a proportion of the population so that it remains within the population from one generation to the next. Environment E. The test problem within which an Animat controlled by the LCS operates. Error ε. The absolute difference between the payoff and the prediction as a proportion of the total prediction range. Evolutionary Computation. The combination of areas of research within the wider domain of research in Artificial Life that utilise mechanisms derived from Genetic Algorithms or are related to the operations found within a Genetic Algorithm.

xi

Explore/Exploration. The process of visiting new parts of the input space to expand learnt experience. An iteration or trial within which the LCS does not select the optimal action in order to find out the value of alternative actions. Exploit/Exploitation. The process of utilising existing knowledge of the optimal action within previously visited parts of the input space. An iteration or trial within which the LCS selects the optimal action in order to reinforce the optimal classifiers or demonstrate the optimal learnt policy. External Message e. A message sent from a classifier to be enacted through the effectors as actions on the environment External Message List. A message list for actions to be presented to the environment and / or messages from the environment to be matched by classifiers in the next iteration. Fitness. A measure of adaptation to a specific environment. Within LCS, the part of the Strength representing the selection value of the classifier in the GA F.S.W. (Finite State World). An environment represented by a set of states connected as a graph by labelled transitions. Each state is identified to the LCS by a pre-defined value, and each action is interpreted by the F.S.W. to a label that corresponds to a transition. A subset of states will be identified as start states from any of which a trial can begin, and another non-overlapping subset of states will be identified as reward states in which a trial will finish and a reward will become available to the LCS. G.A, Genetic Algorithm. A selection and recombination routine to generate new classifiers Generality. The proportion of condition positions at which are wildcard values. Generation. An iteration of the measurement of fitness and the creation of new population members. Genetic Programming. The use of simplified S-Expressions as population members so that each population member represents a program and its fitness is a function of the effectiveness of the program. Implicit Bucket Brigade. The passing of strength values as payoff to classifiers that acted in the previous iteration of the LCS. Individual. A member of a population. Induction Algorithm. A function that uses existing message, classifier or population information to create one or more new classifiers. Input/Detector Message d. A message composed from the values of the environmental detectors added to the message list on each iteration of the LCS. Internal Message i. A message sent from a classifier back onto the message list so that it is presented to be matched by the classifiers on the next iteration of the LCS.

xii

Internal Message List. A message list that maintains messages sent by classifiers for presentation to classifiers for matching in the next iteration. Iteration. One execution of the main cycle of the operation of the LCS. Life Tax. The removal of a fixed small proportion of the current strength of all classifiers within the population to put deletion pressure on classifiers that do not apply within any environmental niche. LCS. A machine learning paradigm that uses a Genetic Algorithm and other low-level induction operations to maintain a population of classifiers that enable an Animat to respond optimally within an environment. Macro-Classifier. A classifier holding a numerosity value Match. A message matches a condition if in every position the values are the same or the values in the condition are more general and cover the message value. Match Set M or [M]. The set of classifiers matching the current [internal and input] messages. Message e. A sequence of attribute values. Message List. A finite size sequence of messages. Micro-Classifier. A classifier without a numerosity value (or a constant numerosity of 1) Mutation. The low-frequency random modification of a single attribute value to another attribute value. Niche. An area of population space allocated explicitly or implicitly for occupation by an identifiable sub-population of classifiers. Null Action. An action that, although legal within the current environment state, does not cause a transition into a different environment state. Numerosity N. The number of conceptual duplicates of a classifier within the population Objective Function. A function that provides a quantitative measure of worth to a population member for its contribution to the achievement of a specific objective within an environment. Offspring. A new individual created as a result of the application of a reproductive operator. Over-general. A classifier whose condition generality is such that it acquires sufficient strength from use in situations where it is correct to be selected for use in situations where its action is incorrect.

xiii

Panmetic Reproduction. A mixed population. Selection across all members of the population when applying the Genetic Algorithm. Parent. One of two individuals selected for reproduction using the Genetic Algorithm. Parasitic Classifier. A classifier that appears within the action set formed by matching an internal message within a rule chain, does not provide an action that contributes towards an eventual reward, and yet reduces the payoff to other classifiers by its presence in the action set. Passthrough. The use of a message matching a classifier's condition to provide bit values for the message constructed from that classifier's action for action positions that contain the wildcard value. Payoff P. The value distributed to the classifiers acting in the previous iteration from those active in the current iteration by the Credit Allocation mechanism. Performance. A loose description of the adequacy of the operation of an Animat within an Environment, formally measured by a pre-defined measure of goodness of each action within the environment. Population P. A group of individuals. More precisely for LCS, a finite size Bag of Classifiers. Prediction p. The value of the payoff that a classifier calculates it will receive. Reproduction. The application of the crossover and/or mutation operators of the Genetic Algorithms to copies of two selected parents to produce two children. Reward R. A numerical representation of the value to the Animat of a detectable state change in the environment as a result of an action. The term is also used to denote the payment of classifiers by the Credit Allocation mechanism triggered by the receipt of a reward. Roulette Wheel Selection. Selection over a sub-set of entities (such as classifiers from within the population) so that the probability of selection is equal to the relative size of a given measure. Rule Chain. The sequence of classifiers formed by matching an external message and in successive iterations providing a new internal message that will be matched by other classifiers within the sequence in the next iteration. Schema. A pattern of attribute values (possibly containing "don't care" values) within the attribute values of an individual. Schema (plural Schemata) are used to identify the groups of attribute values that have high or low worth in the fitness of the individual. The "order" of a schema is the number of non-don't-care positions specified, while the "defining length" is the distance between the furthest two non-don't-care positions. Schema Theory. A theorem to explain the behaviour of Genetic Algorithms. It suggests that a GA will give exponentially increasing reproductive trials to above average schemata. Because each individual contains many schemata, the rate of schema

xiv

processing in the population is very high, leading to a phenomenon known as implicit parallelism. This gives a GA with a population of size n a speedup by a factor of n3, compared to a random search. Search Space. If the solution to a task can be represented using a representation scheme, r, then the search space is the set of all possible configurations that may be represented in r. Selection. The choice of elements from a set of like elements using a metric to bias choice. Specificity. The proportion of condition positions at which are non-wildcard values. State. An identifiable configuration of the environment or location within an environment. Step. An iteration of the LCS. Strength. The representation of the value of a classifier. Support. A value maintained with an internal message reflecting the strength or bid of the classifier that posted the message used to influence the bidding process in the next iteration. Tag. A section of a message that identifies the message as having a distinct pre-defined type, purpose or meaning. Tax. A proportionate or fixed amount of strength to be deducted from all or some classifiers. Ternary. An attribute type of the three values 0, 1, and #, where # is the wildcard value. Trial. A set of iterations of the LCS that conclude with the receipt of a reward. Trigger. A condition tested to ascertain whether an induction algorithm must be applied. Trit. A single ternary value. Wildcard. A ternary value indicating that 0 or 1 will match at that position XCS. An LCS implementation that maintains separate prediction and fitness values, uses a relative accuracy measure for fitness, and introduces mechanisms to provide dynamic niche allocation so that the optimally general accurate population of classifiers can be found.

xv

Chapter 1

INTRODUCTION

AI addresses one of the ultimate puzzles. How is it possible for a slow, tiny brain, whether biological or electronic, to perceive, understand, predict, and manipulate a world far larger and more complicated than itself? How do we go about making something with those properties? These are hard questions, but unlike the search for faster-than-light travel or an antigravity device, the researcher in AI has solid evidence that the quest is possible. All the researcher has to do is look in the mirror to see an example of an intelligent system Russell and Norvig, 1995 1.1 Learning Classifier Systems and the A-Life Approach With the publication of conference proceedings of the first international conference on the Sciences of Complexity (Langton, 1989), a 'new' field emerged into the research community, which was soon to take upon itself the title of those proceedings - Artificial Life. The reality was, as with many areas of science, that the major elements that made up this 'new' field were all in place long before. For a number of years unheralded publications were expressing the same views as those predominately proposed at that conference - that the production of structures (knowledge structures or physical structures) does not require the provision of complex models and systems, but can emerge from sets of simple interacting unstructured elements. For example, Kay (1991) combined the ideas behind Reynolds' 'Boids' (Reynolds, 1987) with the ideas on Agents and Agency proposed by Minsky (1987) to instigate the 'Vivarium' programme at Apple. This sought to investigate Agent-based systems, producing example systems such as 'Petworld' (Coderre, 1987) and AGAR (Travers, 1989). Indeed, continuing work by Brooks (see Brookes and Flynn (1989), Brookes 1990, 1991, for example) based on the Subsumption Architecture had taken this approach before it was endowed with a name. Nevertheless, the new awareness of the basic concepts that was generated by the Artificial Life conference proceedings and other similar publications (such as Forrest, 1990), drew interest from a number of disparate disciplines and produced new and interesting developments. In particular it generated substantial developments in Evolutionary Systems (Genetic Algorithms (Holland, 1975), and their application in

1

Learning Classifier Systems (Holland, 1986), Genetic Programming (Koza, 1992), and Neural Networks (Rummelhart & McClelland, 1986)), Robotics (Brooks, 1991), and the Sciences of Complexity (Langton, 1989). Most of these developments have produced spin-off research groupings with their own Journals and Conferences, leaving the Artificial Life conferences to concentrate increasingly on the latter field. Of these areas, the most obvious beneficiary has been the area of Evolutionary Systems. These systems reflect the central ethos of Artificial Life well, typically allowing a small unstructured and largely undifferentiated set of components to grow and emerge into a well formed set of components which have real value. The components themselves may take a number of forms. In Genetic Programming, they are Lisp expressions, in Neural Networks they are typically layouts depicting artificial neurones, their interconnections and weights, and in Classifier Systems they are [essentially] production rules. Through the attention drawn to them by the field of Artificial Life, interest has grown rapidly so that there are now many researchers investigating this field. With much work done on the use of Genetic Algorithms from the mid 1970's, the application of Genetic Algorithms is well documented and their limitations understood, although their use still requires in-depth knowledge and strong mathematical models have not yet been produced. Until recently surprisingly little work has been done on the application of the concepts behind Genetic Algorithms to the domain of Machine Learning. This is despite the fact that proposals on methods for the application of Genetic Algorithms to this area were made early in the development of Evolutionary Computation by Holland et al (1975, 1986) and Smith (1980). Predominant in this area has been work with Classifier Systems which reduce production rules, well known to 'classical' AI systems such as CN2 (Clark and Niblett, 1989), to a representation in a simpler alphabet that is more suitable for manipulation by the Genetic Algorithm. The addition of the ability to manipulate production rules by the application of a suitable GA introduces to Machine Learning the power of GA search, providing the ability to optimise over the typically large search domain represented by the combination of attributes within the production rules whilst introducing a robustness to changing external stimuli given by the rich population of solutions a GA requires (see Goldberg, 1989). Although the potential of these "Learning Classifier Systems" (Riolo, 1988) is clear, investigations produced results that were both exciting and disappointing in equal measure. Whilst the power of the GA to search for good solutions has been

2

demonstrated (Wilson, 1985), the ability to represent certain complex knowledge structures has been shown (Forrest, 1990; Riolo, 1987, 1991; Holland, 1991), and the generalisation ability of a particular form of LCS has been revealed (Wilson, 1994; Wilson, 1995, Wilson 1998), there have also been important results that clearly demonstrate that Learning Classifier Systems have certain important limitations. In particular, it has been shown that the application of the Genetic Algorithm is itself very susceptible to the encoding of the attributes that make up the rules (Booker, 1991) and to the tuning of parameters of the genetic algorithm that may result in potentially good rules being removed and replaced by poor rules (Compiani et al, 1990). Furthermore, it has been suggested that the use of the credit assignment algorithm proposed by Holland et al (1986) can itself prevent the formation of the important higher level rule structures that would be that required in order to move rule production away from simple Stimulus-Response rule systems. (Booker, 1988; Forrest & Miller, 1990; Smith & Goldberg, 1991). Much of the more recent work in this area has sought to propose potential solutions to these implementation problems, (see Holland, 1990; Riolo, 1990; Smith & Goldberg, 1991; Dorigo & Bersini, 1994; Frey & Slate, 1991; Wilson, 1995; Wilcox, 1997, Tomlinson, 1999). This work has taken two basic directions - the adoption of alternative Credit Allocation techniques that seek to stabilise the dynamics of the LCS and thereby maintain the more complex rule structures once they are formed, or the adoption of new internal organisations within the LCS that may force the creation and maintenance of complex co-operative rule structures. The former approach has largely been influenced by the developments in the field of Reinforcement Learning, which has traditionally been applied to tabular learning methods or allied to Neural Network approaches (Sutton and Barto, 1998; Lin, 1993), but has recently been successfully applied to LCS (Wilson, 1995, 1996, 1998) although much more investigative effort is still required in order to understand more completely the resultant effects of these new LCS dynamics. The latter approach has, in effect, paralleled the development of interest in the use of multiple agents within the AI field which has demonstrated that multiple simple agents working together in a semi co-operative manner (usually behavioural co-operation is adequate) can often produce large-scale 'intelligent' behaviour which is nowhere 'hardcoded' into the agents. In application to Learning Classifier Systems attempts have been made to show that multiple co-operating [and possibly co-evolving] classifier populations can work together to reach a common goal more effectively than a single population (Dorigo and Schnepf, 1993; Dorigo and Colombetti, 1994; Bull and Fogarty, 1993). Whilst Bull

3

et al have been working with 'flat' classifier structures2, Dorigo & Schnepf experimented with a hierarchical structure and reported performance improvements. The argument for or against flat structures is complex, but work by Lin (1993) and Tyrell (1992) has suggested that well-selected hierarchical structure can show performance gains. In moving into the addition of structures to Learning Classifier Systems it is noticeable that there has been almost no debate about the appropriateness of the move or the form of structure that should be adopted. The silence over these matters is particularly noticeable when compared to the on-going debates between the Behaviourist and Cognitivist camps within Psychology or the debates within the behavioural sciences over structural form for action selection. It is fascinating that the field of Artificial Life, in general, advocates unstructured 'flat' approaches (Maes, 1991a, 1991b) or emergent loose structures (Brookes, 1991) whereas almost all proposals for the application of structure to Learning Classifier Systems have been pre-programmed and strictly hierarchical (Wilson, 1989; Dorigo & Schnepf, 1993). The work of Tyrell (1992) and Lin (1992) in related fields has suggested that Hierarchical structures may be beneficial, but Tyrell demonstrates that the use of an inappropriate structure may severely hinder performance, which casts doubt on the veracity of these approaches. Starting from the problem of creating and preserving long sequences of classifiers within a LCS, this research set out to further investigate the issues in this area. It was noted, in particular, that there had been no investigation of the possibility of using hierarchies to increase the length of rule chains, and idea originally proposed by Wilson and Goldberg (1989). However, as initial investigative work proceeded it became apparent that the instabilities inherent within a 'Traditional LCS'3 would prevent effective progress in investigating this possibility. The introduction of Wilson's XCS (Wilson, 1995, 1996) represented a landmark improvement in the performance and predictability of LCS operation, and provided a suitable platform for the development of research into hierarchical learning. The adoption of XCS, whilst solving one set of problems, would present further issues that required investigation and confirmation before further research could be progressed.

2

Interestingly, Fogarty and Bull do report on a kind of hierarchy emerging on occasions within their flat structures, demonstrating that simple hierarchical structures can emerge without preprogramming.

3

see Chapter 2 for a definition of the term 'traditional LCS'

4

The remainder of this chapter elaborates further on the concepts underpinning the Learning Classifier Systems approach, elaborates on the problems within LCS that form the background to this research work, and provides an overview of the differences between the 'traditional LCS' and XCS. This presentation provides the basis for the presentation of the core hypotheses for the work and an elaboration of the structure of the investigation that is proposed.

1.2 Learning Classifier Systems The term ‘Classifier System’ is a broad term used to describe any system that attempts to organise a set of inputs under classifications that capture aspects of the state space of the inputs. More specifically, a ‘Classifier System’ is defined to be: A system which takes a (set of) inputs, and produces a (set of) outputs that indicate some classification of the inputs. Classification is important within Artificial Intelligence, since any state to action mapping assumes that the agent has an 'understanding' of the relevance both of the inputs and of their similarity to other inputs that have already been received. Important operations include the formation of generalisations over the inputs to identify which features characterise each class, the identification of exceptions which nevertheless belong to a more general class, the identification of the relationships between classes, and the introduction of new classes where the characterisations of a potential general class are so strained as to indicate that the recent classification is inadequate. Such features are common to many systems developed within the AI community, particularly within the area of Machine Learning, and could therefore be termed ‘Classifier Systems’. However, the title is usually reserved for systems where the primary occupation of the intelligent agent is the formation and manipulation of suitable classifications. Such systems include ID3 (Quinlan, 1986) and CN2 (Clark and Niblett, 1989). What have become known as ‘Holland-style Classifier Systems’, or more simply ‘Learning Classifier Systems’ (abbreviated to LCS (Riolo, 1988a)4), are a particular 4

In fact, Holland Classifier Systems have had many names. Holland and Reitman (1978) referred to them as ‘Classifier Systems’, abbreviated to ‘CFS’, whilst Goldberg (1989) used the term ‘CS’, labelling his implementation SCS for ‘Simple Classifier System’. Riolo (1988a) used the term ‘Learning Classifier System’ (LCS), although his C implementation of a Holland-style classifier system was entitled CFS-C. The term ‘LCS’ is most often used after it’s adoption in

5

Classifier System implementation which uses the principles behind the operation of the Genetic Algorithm (Holland, 1975) to produce a production system that searches for many classifications at the same time rather than maintaining a single ‘best’ classification structure. Possible classifications compete to remain in the population, and both directed search mechanisms and random search mechanisms will introduce new rules that will join the competition. Since the population of rules is typically large and the composition of the population is itself dynamic, localisation of reinforcement information is necessary. As these search operations progress, candidate classifications will emerge as strong production rules in the population. Rules that classify inaccurately will not be proliferated by the Genetic Algorithm and will eventually be deleted to make room for new rules generated by the search mechanisms as they continue to look for better classifications. The LCS is therefore an example of Emergent Computation (Forrest, 1990). Learning Classifier Systems can be contrasted with the more traditional Classifier Systems that seek to produce a single classification structure. These systems are often complex because of the need to ensure that the classifications identified are not violated by new inputs, and the consequent restructuring of the classification hierarchy that is required when new classes are identified. In order to gain a good attribution of value to any given classification, long histories of inputs and classifications used are maintained until reinforcement is received which can be used to confirm the value of the classifications. Since no alternative classification arrangement is maintained, if the classifier system is placed in a [different] unseen environment, or if the environment is sufficiently dynamic, the classifier system has no alternative classification hypotheses available to aid the re-classification process (Holland, 1987). This produces brittleness in the operation of the classifier system in dynamic environments. It is the simplicity and robustness of LCS that are the primary attractions of the architecture for situated learning, action-selection and planning. As Holland (1983) notes: “One of the most important qualities of classifier systems as a

the First International Workshop on Learning Classifier Systems (Smith, 1992). Other extensions of the Holland-style classifier system have adopted the ‘CS’ tag, such as VCS for ‘Variable Classifier System’ (Shu and Schaeffer, 1989), ZCS for ‘Zero-Order Classifier System’ (Wilson, 1994), XCS for ‘Extended Classifier System’ (Wilson, 1995; Wilson, 1996), and ACS for 'Anticipatory Classifier System' (Stolzmann, 1997).

6

computational

paradigm

is

their

flexibility

under

changing

environmental conditions. .... Categorisation is the system’s sine qua non for combating the environment's perceptual novelty.” The LCS also holds appeal as a form of Connectionist architecture that has been identified as similar to Neural Networks (Farmer, 1990), but with the advantages that the production rules are more readily amenable to human interpretation and manual training and that they are more efficient at representing the kind of sparse graphs that are typical of non-trivial knowledge based systems. The LCS is also a more readily applied classifier system over new application domains, since none of its classification mechanisms need to contain any ‘knowledge’ of the application domain or the meaning of the production rules. Any bias that is in the system (all systems have some bias Booker, 1991) is small and largely unrelated to given applications or environments. This has the highly desirable side-effect that any classifications that emerge from the action of the LCS are implicitly grounded in the environment itself and are therefore not dependent upon the designer’s [often incorrect] prior interpretation of that environment.

1.3 A Brief Introduction to the Operation of a LCS Like many broad categorisations, the term LCS covers a diversity of possible implementations; these are discussed in much more depth in Chapter 2. There are some core components of a LCS that can be identified, however, and it is appropriate that these are briefly introduced as a foundation to further discussion within this chapter. A LCS consists of a population of classifiers. Each classifier is essentially a production rule, consisting of a condition and an action. The condition represents a concatenation of a set of attribute values in a pre-defined order and the action maintains at least one attribute that defines a discrete response to the environment5. The population is [technically] a finite size Bag of classifiers. The LCS has an input interface that codes environmental stimuli in terms of the attributes that make up the classifier conditions to create an input message from the environment. It also has an output interface that translates chosen actions into actual actions upon the environment. The actions may in addition provide internal messages or internal actions that feedback as additional input messages or change an internal state vector that is used to compose an internal message. This provides the LCS with a degree of autonomy from the environment if that is 5

... the actual form of a classifier can be considerably more complex, with multiple conditions and actions - see Chapter 2

7

necessary. To include the learning aspects of the LCS two further features are provided. The credit allocation subsystem receives environment feedback and translates this into reward payments that are allocated to classifiers. It also allocates payments to scenesetting classifiers that have generated an action that will ultimately lead via a chain of subsequent messages and/or actions to a reward. Each classifier maintains a record of its current strength; this record may be a single value representing an average of the payments and/or rewards received or may be a set of values derived from the payments and rewards. This record is used when selecting the action and within some induction algorithms. The induction subsystem includes operators that generate new classifiers where a classifier is not available for a newly received input, introduces modified classifiers where existing classifiers have consistently performed badly, and provides operators to search the classifier-space using a Genetic Algorithm selecting over classifiers in the population using the strength record. Each newly created classifier is inserted into the population, and when the population becomes full classifiers are deleted to make space for the new classifiers. Deletion is performed using a metric based on the strength record of the classifiers so that, in general, better performing classifiers are maintained at the expense of poor classifiers. An iteration or step of the LCS consists of: • •

Receiving an input from the environment and translating it into a message; Matching the message against each classifier to identify those classifiers whose conditions cover the current input;

•

If no classifiers match, creation of a suitable classifier to match the input and insertion of the classifier into the population, possibly deleting an existing classifier to make space;

•

Selection between those classifiers that match to identify an action to be performed;

•

Enactment of the action on the environment;

•

If environmental feedback is received, translation of this into a reward and allocation of the reward to the classifiers that were selected to provide the action;

8

•

If no reward was received in the previous iteration, provision of payment to any classifiers that provided the action in the previous iteration;

•

If the current classifiers have a history of poor performance, application of an operator to derive a new classifier from the current classifiers and insert it into the population, possibly deleting an existing classifier to make space;

•

Periodic application of genetic operators to search the classifier space by creating new classifiers and inserting them into the population, possibly deleting existing classifiers to make space.

The Genetic Algorithm is applied in a similar form to the 'steady-state' GA (see Goldberg, 1989 for an overview of the various forms of GA), producing a small number of new classifiers relative to the total population size. Furthermore, it is applied periodically rather than every generation. The genetic operators applied are typically crossover and mutation. Since parent classifiers are chosen using the strength record it is likely that over time better performing classifiers will be discovered. The combination of selective pressure in the performance cycle and the induction algorithm, and population pressure exerted by the fixed population size and the deletion algorithm, will tend to increase the average strength of the population of classifiers towards an optimum value. In applying the induction algorithms the LCS must balance the conflicting requirements of optimisation (that drives towards convergence) and cooperation (that requires controlled and maintained diversity). This has, unfortunately, been shown to be a difficult balancing act, as the Section 1.4 identifies. The Learning Classifier Systems field is actually segmented into two related but distinct approaches. The Michigan approach derives from the proposals of Holland more directly and was first seen within the implementation of Holland and Reitman (1978). This views the LCS as containing a single limited size population of classifiers with the Genetic Algorithm operating upon rules within that population, and is the approach described above. The alternative approach is called the Pittsburgh approach and was developed by Smith (1980). The Pittsburgh LCS maintains a number of distinct populations of classifiers and when the Genetic Algorithm is applied it operates over the set of populations as though each population was a single genotype. It could be claimed that the Pittsburgh approach is more in tune with the operation of a Genetic Algorithm as an optimising engine, and it is clear that the Pittsburgh approach does not suffer from

9

many of the problems that can be seen within the Michigan approach. Nonetheless, the Pittsburgh approach is often computationally expensive and can be problematic where large populations or a large number of attributes within each classifier are required. In this research work it will be assumed that the term LCS will be synonymous with the Michigan LCS and the Pittsburgh LCS will not, in general, be considered.

1.4 The LCS Problem Whilst the Section 1.2 sought to identify the potential advantages of the LCS approach over traditional production-rule based Classifier Systems [and, indeed, many other classifier systems from the domain of Machine Learning], much of the promise of the LCS approach has not been realised until recently. There are a number of shortcomings that could be identified, such as limitations of the attribute coding chosen or the high sensitivity to parameterisation. Although these are all valid criticisms, this section will concentrate upon four particular problems. These problems could be viewed as fundamental issues that may not be solvable without a careful reconstruction of the Michigan LCS approach. The forms of environment that a LCS has to learn within can be divided into two classes. The first is composed of those environments where an environmental feedback (which becomes the reward) is returned on each input-action iteration of the LCS. These environments will be termed single-step environments. The second contains those environments where the LCS is only given a reward after a number of input-action iterations. These will be termed multiple-step environments. Single-step environments are simpler to learn from since it is always clear that the feedback applies to the classifiers that have just been used. Many problems that appear to be expressible as multiple-step environments can actually be re-configured as single-step environments and therefore this form occurs frequently. Multiple-step environments require a chain of actions to occur before the feedback is received, thus producing a need to identify which of the classifiers that have operated since the last feedback the reward actually applies to - the so-called Credit Allocation problem. This is not a unique problem for the LCS approach and most classifier systems address the problem by maintaining a large history of actions and an engine that is able to traverse this history to identify the paths that the feedback should be applied to. The LCS approach seeks to follow the A-Life path of

10

simplicity, shunning this form of record keeping in favour of more local mechanisms6. The first two problems that will be highlighted occur in both forms of environment. The problem of over-generals has long been recognised, although it is only recently with the advent of the XCS implementation that it has been discussed in a systematic manner (Kovacs, 1996, 1997, 1999, 2000). In order to allow the LCS to identify classifications, the input representation of the LCS provides a means by which generalisations of the input can be identified. These generalisations enable one classifier to cover a number of possible input states. The LCS should then utilise the Genetic Algorithm to explore possible generalisations (in addition to possible input combinations) and identify the generalisations that cover all input combinations belonging to each action class. In early LCS the reward received from the environment was directly applied as the strength value of the classifiers it was given to, with each classifier averaging the rewards it receives. If the GA is the primary rule induction operator and operates only on the set of classifiers that have been used within that LCS iteration, then the more general classifiers will be involved in more GA operations and will proliferate even if not of the highest strength. Worse still, a general classifier may receive a poor feedback for a poor action on some iterations but very high feedback on other iterations and end up with a good average strength. The increased occurrence of the GA for this classifier will more than make up for the weaker average strength of the classifier and so classifiers that do not identify a class accurately will be maintained strongly. A response to this problem may be to site the operation of the GA across the whole population or to provide some weighting mechanism that favours more specific classifiers. The unfortunate side effect of these approaches is to discourage generalisation. In classifier system implementations that have taken this approach a tendency towards specialist classifiers is seen, allowing the LCS to identify accurate classifications but using a number of classifiers to represent each distinct classification. This makes the LCS a less effective classification tool and introduces fragility into the LCS - the deletion of any one classifier from the set of classifiers covering a particular classification will make the classification incomplete. The second problem arises as a property of the Genetic Algorithm itself. The G.A. can be identified as a form of function optimiser, and indeed is used for this very purpose in

6

…although the Holland and Reitman (1978) implementation did actually maintain an invocation history! It was not until the work of Riolo (1988) that the local approach was fully applied.

11

many G.A. applications. As a part of its traditional operation it will seek to identify and proliferate members of the population that are of a high fitness (using a suitable fitness metric). As the operation of the G.A. continues the population will converge towards a single member - the solution. This form of operation is inappropriate within a LCS where a population of co-operative classifiers must be maintained in order to cover all classification categories that are available for discovery. To combat this behaviour some LCS implementations limit the occurrence of the GA so that it operates as a background operator. Unfortunately this also limits the ability of the LCS to explore the problem space and results in a weaker classification performance. An alternative approach has placed a limit on the number of duplicates of a classifier that may exist with a population. This introduces a very primitive form of niching within the population, but has the side effect of making the LCS population fragile - it is possible to eliminate a classification by the removal of a single classifier. This phenomena has been described as the 'Boom-Bust' cycle of a LCS (Smith, 1999). In addition to these problems within single-step environments, the LCS faces additional problems in a multiple step environment. Although not the only mechanism available for credit allocation, the 'Bucket-Brigade' is the mechanism proposed by Holland and adopted within many subsequent implementations. This mechanism resolves the credit allocation problem by requiring each classifier to pass a proportion of its strength to the classifiers that were used in the previous step of the current episode (where an episode consists of the steps from one feedback receipt to the next). The Bucket-Brigade mechanism only requires the classifiers active in the current and previous steps to be known, dramatically reducing the information storage required for credit allocation. The mechanism uses an economic paradigm, so those classifiers that give payment to the previous classifiers will have their strength reduced in proportion to their payment. This means that classifiers earlier in the 'rule chain' will receive a small fraction of the strength of later classifiers. Unfortunately this payment mechanism is the cause of delays in the establishment of a stable payment to early classifiers, putting them at risk of deletion due to losing out within the genetic algorithm (the G.A. uses strength as the basis of selection). Furthermore, the "taxation" regime utilised within many LCS leads to a reduction in strength in early classifiers and their elimination before sufficient payment reaches them down the rule chain. Thus the payment mechanism itself leads to the situation where long rule chains are unlikely to be maintained by the LCS. Since rule chains cannot exist without these earlier classifiers in any case, what is often seen is that the early classifiers are discovered first and successive classifiers then start to be

12

discovered. However, before the rule chain can be fully established the earlier classifiers are deleted by competition in the G.A. from the very classifiers they allow to exist. This in turn prevents the later classifiers from being used and gradually they are deleted. Although this problem can be counteracted to some degree by the introduction of duplicate limits, the fragility problems discussed previously will then appear. The final problem is a by-product of the provision of internal messages within the LCS. A LCS could be produced as a purely reactive agent, responding only to environmental changes. However, it is possible that the LCS could reduce its reliance upon the environment if it could send itself messages, thereby allowing one rule to trigger another. Although Wilson (1985) had shown that an 'implicit bucket brigade' could be set up that does not rely upon internal messages, the Bucket-Brigade system proposed by Holland and later implemented by Riolo was dependant upon internal messages. Smith (1991) showed that in a mechanism that was removed from objective evaluation in the environment it was likely that 'parasite' classifiers could develop and be maintained. These classifiers would receive payment by being involved with an internal message even though they did not actually contribute anything to the final payment received by the classifier. These classifiers occupy population space and so can compete with other beneficial classifiers to the possible detriment of the learning capabilities of the LCS. Although the preceding problem descriptions have not provided a detailed account of the problems, it is clear from the account given that the Michigan style LCS can present a number of difficult-to-solve problems to the user. These problems, although well documented, do not appear to admit a solution that addresses all of the problems without introducing further new issues. Much of the work in the LCS area over the 1990-1995 period was therefore involved in finding solutions that sought to address one or two of these issues. The problem of 'rule-chaining' - the establishment and maintenance of classifiers so that the classifiers are matched and act in sequence to move the LCS towards an external reward - has been a major issue. Indeed, it could be cited as the major issue preventing the exploitation of Michigan-style LCS within commercial applications. Research into the problem of rule-chaining was relatively limited, with the major early work carried out by Holland and Reitman (1978), Booker (1982), Forrest (1985), Riolo (1988, 1990), and Yates and Fairley (1993), and analysis carried out by Westerdale (1985, 1987, 1989), Montanari (1992), Miller and Forrest (1989), and Cliff and Ross (1994). All of

13

these investigations identified or discussed the problem of both the discovery of rulechains and the maintenance of these chains. Given the apparent inability of a LCS to reliably maintain chains of more than a small number of classifiers has been suggested that, rather than seek to further adapt the parameterisation of the LCS, a solution might be found in accepting short rule chains and utilising hierarchies of short chains to achieve longer rule chaining (Wilson and Goldberg, 1989). Speculative work by Wilson (1988) had discussed the possibility of adding hierarchical classifier invocation to the LCS, and the work of Dorigo and Schnepf (1993) and Dorigo and Colombetti (1994) had investigated a fixed form of pre-defined hierarchy primarily within single-step problems. No investigation of the ability of the LCS to maintain short rule chains to be selected from within the hierarchy has been conducted. The fact that workers in the LCS community had started to consider hierarchy was particularly interesting; the A-Life field had shunned the forms of structure that the Artifical Intelligence community so whole-heartedly adopted. Indeed, any claim that hierarchy is required is a matter of some debate, as the next section reveals.

1.5 The Hierarchy Debate Artificial Intelligence research has traditionally held to a structured view of information typified by decision trees, rule trees and semantic nets. This view is founded upon a symbolic approach to the representation of information, maintaining that symbols may be used to establish a concrete view of abstract information. Whilst at the lowest levels symbols would represent simple 'facts', the use of symbols as an abstraction mechanism suggests that further layers of abstraction may be added so that further symbols represent various combinations of facts as concepts. This hypothesis has become a central theme of the majority of the work that has taken on the title of 'Artificial Intelligence' and is termed the 'Symbol System Hypothesis' (for example, see Minsky (1975) and subsequent work on Frames). Whilst it cannot be denied that there are many existence proofs for the effectiveness of this hypothesis, the difficulties that it introduces are equally evident. The computational burden that is imposed by the creation and maintenance of the symbolic structures that result is large in any non-trivial application. This has been seen particularly in the Machine Learning field when such systems are applied to 'real world' applications rather than the game/strategy problems used within traditional artificial domains. The problems of classification (in particular the issue of misclassification) and the tree

14

pruning that arises when new information is detected are considerable, leading to the development of many complex strategies each of which attempt to simplify the process. This computational burden was a primary factor in the recent reaction to approaches based on the Symbol System Hypothesis within the field of AI. Development in what has become termed 'Distributed AI' (see Chaib-Draa et al, 1992 for a review of material in this area) have looked to multiple co-operating agents to overcome this difficulty by simplifying the knowledge bases required for any individual agent. This approach has not been wholly satisfactory, simply transferring these problems to new symbol related problems, such as how to provide shared goals and integrate shared knowledge. The work in this area that stands out, however, comes from those workers who have turned away (in varying degrees) from the symbolic approach to an alternative approach. Typical of this work has been the well-known developments from early Perceptron work and the developments in robotic control using the Subsumption Architecture (Brooks, 1986). In both of these domains considerable progress has been demonstrated, with the robots produced by Brooks in particular illustrating similar capability to those associated with traditional AI work, but with much lower computational resource requirements, able to react immediately to environmental stimuli and (more importantly) robust to environmental change. Brooks has argued (Brooks, 1990), alongside others from the Artificial Life field (Patel and Schnepf, 1992; Harnad, 1990) that the fundamental reason for these improvements has been the move away from the Symbolic approach. Not only does this remove the computational burden of creating and maintaining symbolic structures, but it also removes the Symbol Grounding problem and its consequent Frame Problem (Harnad, 1990). Brookes argues persuasively that the robust nature of such systems comes not from the new system structure per se, but from the physical grounding of such systems. "Without a carefully built physical grounding any symbolic representation will be mismatched to its sensors and actuators. This grounding provides the constraints on symbols necessary for them to be truly useful". In order to produce physically grounded systems within machine learning, there must be some mechanism that allows even the most basic rules to be learnt as a result of experience, with higher levels of knowledge in turn emerging from these rules (Brooks, 1990). A change in the environment should result in a parallel change in the lower and

15

higher level rules. In such circumstances imposed symbols have no place, since they imply a degree of prediction and outside interpretation that is removed from the learning situation. Thus, such systems may been seen to require the following characteristics (Maes, 1993): •

"open" or "situated" in its environment

•

autonomous - does not require [human] input to guide its development

•

adaptive - is able to develop and improve it's own structures

•

emphasises produced behaviour rather than produced knowledge (i.e. If it reacts correctly it need not be capable of explaining why)

•

has multiple low-level competences and pursues multiple goals

Partially as a reaction to the structural approach to knowledge representation of much AI work, and partially because of the problems that the existence of structure causes when rules change because of subsequent learning experiences, most workers in the A-Life field have rejected both hierarchical and other tightly structured approaches to knowledge representation. Instead, work is predominately connectionist in form with flat knowledge structures composed of 'sub-rules' (linked simplified computational neural models) or simple production rule structures. Exceptions do exist, such as Rosenblatt & Payton's (1989) work in the area of Reinforcement Learning which produces a kind of global planning system, or Maes' ANA which [effectively] links multiple FSM as nodes in a connectionist structure (Maes, 1991b), but Farmer (1990) has suggested that these are essentially different forms of connectionism that produce comparable results. The debate between those who see knowledge structures as fundamental and those who regard a flat structure as superior rages both within and between Artificial Intelligence and Artificial Life. Much has been written on the subject, and some of these authors have been quoted above. The resolution of this debate is beyond the boundary of this work, having a much deeper root in Cognitive Psychology and Ethology. There is much in the argument of those working in the field of Artificial Life to be sympathetic with, in particular in regards to Brookes' support of physical grounding. However, to discard the

16

power of symbol construction and manipulation at higher levels would be to reject completely the potential to adopt the power of abstraction. Recent neurological studies have identified that structure is clearly apparent at all levels of brain activity. Tsotsos (1995) argues persuasively against a strict behaviourist stance to the representation of knowledge, illustrating that for many tasks 'intelligent' task performance requires some form of internal memory. In making his case he points to recent physiological data regarding the visual cortex of the macaque monkey. He notes that 32 distinct visual maps have been identified within the visual cortex, and that the majority of the 305 connections between these areas are grouped to provide 121 reciprocal link pairs. Of these link pairs, 65 are ascending / descending pairs that create a hierarchy of maps which is up to 14 layers deep with no single link pair spanning all 14 layers (and the majority spanning only one or two layers). Since only 4 of the 32 areas have direct connections to somato-sensory or motor areas of the brain, and it is known that areas influence one another - sometimes with an anticipatory influence which precedes an actual stimulus - it would appear that this hierarchical structure plays a key role in the higher cognitive abilities associated with vision. Such evidence can only add validity to the emerging area of research that is examining the introduction of hierarchical structures to sub-symbolic learning systems that have traditionally had a 'flat' structure. In recent years there have been investigations into hierarchical approaches within robotic systems using Competence Modules (Ring, 1994), as well as within Reinforcement Learning (Singh, 1992; Lin, 1993; Dayan and Hinton, 1993; Bartfai, 1995; Dietterich, 1998; Digney, 1996a, 1996b, 1998; Precup, Sutton and Singh, 1998). Whilst a number of these have used simple fixed hierarchical structures (Singh, 1992; Lin, 1993; Dayan and Hinton, 1993), it has been recognised that such 'hand-coded' structures are a relatively weak approach. As Digney (1996) notes: "Hand decomposition imposes the designer's preconceived notions on the robot which, from the robot's point of view, may be inefficient or incorrect. Furthermore, it is acknowledged that for truly general learning and full autonomy to occur in the face of unknown and unchanging environments, the structure of the hierarchical control system must be learnt" As a result more recent workers in the reinforcement learning area have sought to develop hierarchies as exploration leads the learning systems into new areas of

17

experience or leads the learning system to continually revisit areas which have already been learnt. This form of emergent hierarchy has been termed the "Holy Grail" of this research area (Digney, 1996a), and it is possible that such an approach would be available for application to LCS. However, much work to investigate the ability of modern LCS approaches to operate within fixed pre-identified hierarchical structures remains to be done before the scene is set for such an ambitious aim. McGovern and Sutton (1998) have shown that considerable benefit is gained even within fixed hierarchical structures in the speed of learning and the range over which actions can be planned. Such gains would be significant if achievable within the LCS framework.

1.6 The XCS Approach In starting this research work the limitations in rule-chaining with the traditional LCS were well known, and it was envisaged that the use of hierarchical structures would address the problem of identifying and maintaining long rule chains. However, it was apparent from the outset that the extreme sensitivity of the LCS to parameterisation and the inability of the LCS to adequately maintain a co-operative population of classifiers where a diverse solution set is required would be likely to threaten the ability of the LCS to develop and maintain the structured population. In 1995, however, a new form of LCS was introduced by Wilson (1995, 1996). The XCS implementation modified the traditional LCS in a number of vital areas. Chapter 3 provides a detailed description of XCS and the particular XCS implementation used within this research work, and only a brief introduction to XCS is provided at this point. The XCS implementation is based on a simplification of the LCS description known as the 'Zeroth-level Classifier System' (ZCS). As such it maintains classifiers that provide a simple single condition and single action. It removes the possibility of providing internal messages, presenting the population with a single message from the environment on each iteration. The matching classifiers do not compete as individuals, but compete between the 'Action Sets' - the set of classifiers that specify each action. On provision of the selected action to the environment and receipt of the reward, it is allocated to all classifiers in the action set of the chosen action. Classifiers maintain a composite strength record that separates the prediction of the payments received by the classifier and used in action selection from the fitness value used in GA selection. XCS bases fitness on a calculation of the accuracy of the prediction of a classifier, which is in turn based on the absolute prediction error. Thus,

18

when the GA operates it will tend to select and proliferate the most accurate classifiers. The most accurate classifiers will not include those that are over-general, since these classifiers will receive varying payoffs. This removes the problem of overgeneralisation. Indeed, Kovacs (1996) has stated within his "Optimality Hypothesis" (see Section 3.5.2.1) that XCS will always identify and proliferate classifiers that belong to the optimal sub-population of accurate and optimally general classifiers. The Genetic Algorithm operates only within the chosen action set. The G.A. thus explores the classifier specificity/generality plane. Classifiers that are too specific will occur in fewer action-sets than the more general but still accurate classifiers. Thus, over time, classifiers that are optimally general will be proliferated at the expense of the more specific classifiers. This helps to remove the problem of over-specific classifiers. When new classifiers are created they are inserted into the population, replacing a classifier chosen dependant upon a parameter each classifier maintains that represents the average number of classifiers in the action sets that the classifier appears within. The location of the GA and the form of deletion together introduce a dynamic niching mechanism that will reserve sufficient space within the population for every separate classifier that is in the optimal representation of the solution and allow each classifier to proliferate to fill that niche. This removes the problem of competition between supposedly co-operative classifiers seen within other LCS implementations, removing the fragility from this LCS implementation. Since the XCS does not provide internal messages the problem of parasites does not occur within XCS. The XCS implementation thus addresses the major problems that were identified as applying to other LCS formulations. Furthermore the dynamic niching property of XCS is ideal for the maintenance of the more complex populations that are likely to be the result of the introduction of a hierarchical population. Thus, when XCS was introduced and became known it was a natural choice as the basis of further research in this area. Unfortunately the adoption of XCS would bring with it other problems. As a new LCS implementation with dramatically different properties when compared to other LCS implementations and no supporting literature, an investigation of a number of aspects of the XCS implementation would be required before work on the introduction of hierarchical features could begin. In particular, whilst evidence was rapidly gained on the validity of the Optimality Hypothesis within small to moderate complexity directreward environments, no similar results were available for multiple-step environments.

19

Since XCS uses a temporal difference scheme for the payment of reward through the action-chain, it is clear that the problem of learning within multi-step environments is a complex multi-level prediction problem. Furthermore, as action-chains grow in length the discount will considerably reduce the predictions of early classifiers in the action chain. It is therefore entirely possible that the generalisation mechanism of XCS will consider these early classifiers to have a sufficiently similar prediction that a single over-general classifier is produced to cover these prediction niches. This would in turn lead to sub-optimal XCS performance. Thus, before mechanisms to support action chains within XCS can be investigated, the ability of XCS to form and maintain action chains (and therefore the validity of the Optimality Hypothesis within delayed reward environments) must be examined.

1.7 The Hypotheses The preceding sections have sought to introduce the Learning Classifier System approach to Machine Learning. In identifying how LCS operate a number of fundamental weaknesses were identified. One of these weaknesses is the problem of creating and maintaining the chains of classifiers required in order to learn paths to rewards in a multiple step environment. Wilson (1988) has identified that it is possible to identify a conceptual arrangement of the LCS that could provide a hierarchical rule chain representation. Dorigo and Schnepf (1993) introduced a more limited fixed hierarchical structure, but their work did not apply this to reducing the problems that arise from rule chaining. Work in related Machine Learning areas, and in particular within Reinforcement Learning, suggests that hierarchical arrangements can be used to reduce the number of steps to a reward required from any one level of the hierarchy. Work in ethology and neurology suggests that a hierarchical approach does have parallels in real action-selection and decision behaviour. Finally, the likely problems of seeking to provide hierarchy within a traditional LCS were identified and the new XCS approach was suggested as a possible solution. Given this background the central hypotheses can be elaborated. There are two hypotheses, and from these a number of other hypotheses will arise that must be investigated.

20

Hypothesis 1 The Optimality Hypothesis holds for the use of XCS within short action chains, but the interaction between the temporal difference technique that reduces payment of reward to early classifiers within action chains and the pressure to generalise over similar payoff niches will prevent the formation and maintenance of the accurate optimally general sub-population. Hypothesis 2 XCS is able to reliably identify and maintain classifiers within a fixed pre-specified hierarchical structure over a decomposition of the environment state-space so that locally optimal short action chains can be selected in an optimal sequence to establish a route to an environmental reward that would other wise require a longer action chain.

1.8 Approach In order to achieve these hypotheses it has to be recognised that little additional information on the behaviour of XCS within arbitrary multiple-step environments existed at the time that XCS was adopted. Indeed, there was not a clear and unambiguous definition of how the XCS operated, nor an existing implementation that could be adopted. The investigation of these hypothesis had to be broken down into a number of discrete preliminary investigations, as follows: 1. Define, implement, and verify an XCS implementation No publicly available XCS implementations were available at the time XCS was adopted. Kovacs (1996) provided a better definition of the operation of the XCS than Wilson (1995) or Wilson (1996) but a number of crucial implementation details remained undefined. The first step was therefore to define in a more formal manner the operation of XCS and to provide an implementation that could be verified to operate with the same performance as the existing implementations. An auxiliary aim of this stage was to make the XCS implementation publicly available for the benefit of other researchers in the field.

21

2. Establish the action-chain limitations of XCS Since XCS was a radically new form of LCS, the results of existing rule-chaining investigations cannot be applied to it. It is thus necessary to identify the rule-chaining limitations of XCS so that the continued requirement for a hierarchical approach can be ascertained. This investigation will provide the primary source of evidence from which an assessment of the validity of Hypothesis 1 can be made. 3. Identify the ability of XCS to utilise lower level rule chains Within the Wilson (1988) hierarchical LCS framework each lower level rule chain is provided payoff in relation to its position within the rule chain. However, if a lower level rule chain is invoked at different places in different rule chains or in more than one place within a single rule chain it will receive different payoffs. Since XCS is an accuracy-based LCS this variability in payoff is in conflict with the operation of the XCS. The issue of multiple payoffs to XCS must be investigated to establish a suitable payoff regime for the lower level rule chains. 4. Identify lessons that can be applied from previous Hierarchical LCS work. Dorigo and Schnepf (1993) and Dorigo and Colombetti (1994) provide the sole current experience of fixed pre-specified hierarchies within LCS. However, there is a significant body of other related LCS research. It is important to use this body of research to identify any key lessons that may be learnt to guide in the development of a fixed hierarchical structure within XCS. Furthermore, the now considerable body of work on hierarchical methods within the wider reinforcement learning community can be used to identify potential points of crossover into LCS work. 5. Apply a Fixed Hierarchical structure within XCS Although fixed hierarchical structures are not the ideal solution, as noted in section 1.5, they are better understood. The application of a fixed top-down structure to XCS will be investigated to establish whether shorter rule-chains can be identified to solve more difficult exploration problems. This subdivision of effort is reflected in the structure of the thesis, with chapters corresponding to each of the identified areas. It is hoped that the results of this study will provide an impetus for further developments in the area of action chaining within

22

XCS and the used of Hierarchical methods. It is thus possible that the real benefits of using Learning Classifier Systems will only be seen when 'higher level' rules are allowed to emerge by the action of the induction algorithms. This would allow the rule sets to apply 'symbolic' reasoning whilst maintaining the advantage of continual learning and development available within classifier systems. The emergence of these higher level rules only becomes possible once the classifier system starts to move from simple low level classifier sequences to the generation of classifiers that can 'reason' about other classifiers and cause the appropriate classifier sequences in the population to be triggered. Such an LCS implementation will share the characteristics Maes has defined for 'behaviour based' systems and yet hold the possibility of gleaning the accepted benefits from working at higher levels that has thus far not been convincingly applied to the domain of LCS operation.

23

Chapter 2

LEARNING CLASSIFIER SYSTEMS

Classifier systems are a quagmire - a glorious, wondrous, and inviting quagmire, but a quagmire nonetheless . Goldberg, Horn, and Deb, 1992

2.1 Background The basis of much of the effort of the Machine Learning community has been the development of efficient and/or effective search methods (Russell and Norvig, 1995). The Genetic Algorithm has been commonly thought of as a function optimiser (Goldberg, 1989; DeJong, 1975), but function optimisation can itself be re-cast as a search process over the domain of parameter values. Therefore, it would appear logical to investigate the mapping of the G.A. mechanisms onto the search domain of machine learning. Holland (1971, 1975) proposed a number of computational machines of varying complexity based on this idea, and sought to produce a proof of concept implementation (Holland and Reitman, 1978). This proof of concept differed from some of the original proposals for pragmatic reasons, but set the scene for Learning systems that were to become known generically as Learning Classifier Systems. Holland (1975) has defined a LCS to be: “a special kind of production system designed to permit non-trivial modifications and reorganisations of its rules as it performs a task” This is a broad definition that allows a large amount of variability in both emphasis and form. Indeed, the origins of the LCS in Holland’s early papers (Holland, 1968, 1971, 1975)7 show a LCS with a very different form8 from the Canonical LCS, which is 7

See Goldberg (1989) for a brief review of Holland’s early work.

8

Interestingly, many of these early suggestions are now being re-introduced to overcome some of the limitations of LCS. For example, some of the work of Donnart and Meyer (1994) explores internal effectors and internal retribution, whilst Cliff and Ross (1994) examine the

24

commonly associated with Holland et al. (1986). These differences have continued to be seen in subsequent implementations, with the most important variations seen within the work of Holland and Reitman (1978), Smith (1980), Booker (1982), Wilson (1985), Holland (1986), Riolo (1988a), Goldberg (1989), and most recently Wilson (1995, 1998) and Stolzmann (1997). In fact, almost all LCS work has involved the introduction of variation in implementation so that a definitive specification is all but impossible. This situation has arisen for three main reasons: •

Doubt over the efficiency or effectiveness of the production rule representation

•

The failure of the LCS to show the predicted levels of performance

•

Debate about the site of the G.A. and the unit of G.A. operation

These problem areas equate to each of the three main components of a LCS. The disparity of implementations and limitations of existing LCS implementations presents a considerable problem for any researcher entering the LCS field, with no agreed resource available to use as a reference point to guide the choice of LCS implementation and/or features. Although Goldberg's SCS (1989) or Wilson's ZCS (Wilson, 1994) are useful existing reference-point implementations, they each emphasise a particular aspect of a LCS framework. This divergence has been seen to mislead those seeking to gain a wider appreciation of the field. This chapter seeks to contribute to the LCS research field by identifying a "mid-point" theoretical implementation that will be termed the "Canonical LCS". This implementation seeks to encompass the main features that are present in most major LCS implementations. The Canonical LCS is then used to identify the differences and/or similarities between a selection of major LCS implementations. It is hoped that the provision of such a reference will provide a starting point for researchers within the field and prevent the confusion in regard to capabilities of the various LCS implementations that is so evident in new-comers to the area. It will also serve as a reference point for discussion of the XCS learning classifier system implementation, and for the results presented throughout this thesis. The chapter starts with a broad overview of the architecture of the Canonical LCS to introduce the major features. It then describes the components in detail to define the Canonical LCS. Using the Canonical LCS a selection of key LCS implementations are

internal state proposals of Wilson (1994) which reflect some aspects of Holland’s hypothetical Prototype III.

25

presented to identify the differences between various LCS approaches and some common themes. This chapter does not seek to be exhaustive in its coverage - with over 550 known LCS related papers published in the last 22 years of LCS research this would not be a feasible objective. However the choice of LCS presented in this description is intended to identify the LCS implementations that have been particularly influential. The reader seeking comprehensive reviews of LCS work is directed to Wilson and Goldberg (1989), Fogarty, Carse and Bull (1994), and Lanzi and Riolo (2000) for reviews of work within the field in the years preceding each of the respective publications. 2.2 A Brief Overview It is possible to divide the LCS into three separate areas: •

The Knowledge Structure

•

The Learning Algorithm

•

The Induction Algorithms

Within a Genetic Algorithm applied to an optimisation task population members are typically comprised of some encoding of a string of parameter values. However knowledge structures are much richer (Rich and Knight, 1991), identifying causality and relationships and often containing complex sub-structures. The complexity of many of these representations would make them unsuitable for manipulation by a G.A. (although work within the field of Genetic Programming (Koza, 1992) does indicate that more complex structures can be manipulated). Production Rules, however, are easily representable and their incorporation into a G.A. framework requires little modification to the encoding conventions already predominant within genetic algorithm research. Holland conceived of the encoding as a "Population" of production rules (termed "Classifiers"9) each with simple binary or ternary encoded condition and action values. The population would not necessarily be pre-coded by a 'programmer' but the classifiers would compete like individuals in a diverse ecological environment to produce cooperative knowledge structures by the interaction of well 'adapted' classifiers. The identification of which classifiers are better adapted than others requires the existence of a Learning Algorithm. This will credit a classifier with a value that should 9

The term ‘classifier’ is often used to denote one production rule within a population of LCS rules. Whilst it is recognised that this term is something of a misnomer, this convention will be adopted.

26

reflect its usefulness, correctness, or possibly its accuracy. The fixed point value of any classifier will be dependant upon its usefulness within the particular situation the learning system is seeking to address at that time, and therefore the value of a classifier must be computed over a longer time period. Furthermore, classifiers that do not themselves cause movement into a situation that is immediately of value but do lead towards a valuable position must also receive credit. Learning Algorithms used by many Machine Learning systems often require the maintenance of complex histories that can be analysed to generate the correct rule value updates. Holland (1975) proposed a system that required negligible local information to be maintained (although the implementation of Holland and Reitman (1978) actually used a much more complex system). In this system a classifier that generated an immediate external reward led to the value of the classifier instigating the action being updated with a payoff proportionate to the reward. In addition all classifiers passed back a proportion of their value to any 'scene setting' classifiers - a process termed the 'Bucket Brigade' because changed reward values move gradually back through the chain of classifiers to the originators. The Learning Algorithm itself only identifies those classifiers that are useful but cannot replace poor classifiers with better classifiers or improve upon the usefulness of the classifiers that exist - this is the function of an Induction Algorithm. If the set of all classifiers that could be represented by a particular production rule format is a search space, then a Genetic Algorithm should be able to traverse the search space to find better alternatives by using the classifier value as the heuristic measure. Therefore, the Genetic Algorithm was conceived as the primary rule induction algorithm for these learning systems. Since the search space over any 'useful' production rule representation will be both large and sparse, the G.A. is a well-suited search method. A negative consequence of using a G.A. is that the probabilistic element within G.A. search may lead to poorer intermediate classifier populations, although this problem can be overcome to some extent by using an Elitist strategy within the G.A. Although a G.A. should be sufficient in the long term, Holland suggested a number of additional induction operators to counteract short-term deficiencies in a population that may prevent a population from having sufficient useful classifiers to allow the Learning Algorithm to evaluate all of the classifiers within the population. At times where no classifier has a condition that is applicable to the current situation, the Create Detector Operator creates a classifier with the appropriate condition and an [often randomly created] action. Where the classifiers are such that an inappropriate action is always taken in a given situation a number of

27

possible forms of Create Effector Operator can be used to create a similar classifier with an alternative action, create a more specific classifier with a different action, or create a more general classifier with a different action. These induction operators are called 'triggered operators' because they operate in response to well-defined situations in order to resolve temporary problems that may prevent the main induction algorithm from performing an effective search. The degree of 'intelligence' with which these operators work can vary, but the better targeted the operator the more global and local performance information is required. In general fairly simple implementations are chosen for pragmatic and/or ideological reasons. Holland and Reitman (1978) did not actually evaluate the effectiveness of the proposed Learning Classifier Systems framework because their system made use of a more complex Learning Algorithm that required a history of rule interactions to be maintained in order to obtain more accurate Credit Allocation. However, this work did identify the fundamental components of a Learning Classifier System and with the work of Wilson (1985, 1987, 1994), Holland et al (1986), Riolo (1988a), and Goldberg (1989) a theoretical LCS implementation termed the "Canonical LCS" can be created that contains all the major features of a LCS. This theoretical implementation can then be used to act as a reference for a comparative study of the different features of the major physical LCS implementations.

2.3 The Canonical LCS Before considering the operation of the LCS, more detail on the architecture of the Canonical LCS is required. The overall architecture is pictured in Figure 2.1, and the main elements of the architecture are introduced in section 2.3.1. Booker (1988) defines an LCS to consist of three interacting subsystems: •

The Performance Subsystem

•

The Conflict Resolution and Credit Allocation Subsystem

•

The Rule Induction Subsystem.

This decomposition will be used to provide the basis of the discussion of the operation of the Canonical LCS within the remaining subsections of section 2.3.

28

2.3.1 The Architecture of the Canonical LCS The central 'knowledge-base' of the Canonical LCS consists of a finite-sized "Population" of production rules known as "Classifiers". Production Rules have been shown to be computationally complete (Post, 1943; Minsky, 1967), and yet are a simple uniform representation that can be readily manipulated whilst remaining accessible to human interpretation. As such they represent an ideal candidate for use within Learning Classifier Systems.

Internal Message List Population Message List

Bid

Match Set Message List

Match Conflict Resolution

Select Action Set

Message Payoff, Tax

Input

Induction Algorithms

Reward, Tax

Credit Allocation

Output

Reward

Environment

Figure 2.1 - The Canonical Learning Classifier System

Like production rules classifiers are divided into two parts - a "condition" and an "action". In the Canonical LCS all classifiers consist of either one or two conditions10.

10

Forrest (1985) and Holland (1986) allow classifiers to have more than two conditions. Riolo (1988b) asserts that two conditions are sufficient if it is possible to create and maintain cooperating sub-populations of classifiers that together represent more complex conditions.

29

Where two conditions are provided it is conventional for the first condition to represent values expected from the environment whilst the second condition represents values sent by other classifiers. This enables one classifier to trigger other classifiers and thereby produce a sequence of classifier invocations from a single input. Classifiers with a single condition conventionally only represent expected values from the environment, and therefore encode simple Stimulus-Response systems. In fact, it is possible to devise an encoding of conditions that will use extra bits ("tag" bits) to identify the source of expected values and therefore create classifier systems with internal classifier triggering using a simpler structure. However, this mechanism does not allow a classifier to be dependent upon both other classifiers and the current input without more complex tag creation mechanisms. In order to buy-into the Schema Theorem (Holland, 1975), the Canonical LCS limits its encoding of the conditions to a ternary alphabet containing the values 0, 1, and #. A condition within a classifier is a fixed size sequence of values from this ternary alphabet. The meaning of the combination of 0 and 1 values within a condition is unknown to the LCS, but corresponds to the meaning attributed to the bits within the messages constructed by the input interface or other classifiers. A condition that does not include any # positions is a conjunction of attribute values and represents the set of values that must all be present in order for the condition to be satisfied (or "matched"). The population of classifiers represents a disjunction of conjunctions. Computationallycomplete production rules provide a richer representation that includes negations - a negation over negated attributes can then represent a disjunction within a condition (De Morgan, 1849). The Canonical LCS allows negation of the conditions using an additional leading bit position to indicate whether to apply negation or not, but does not allow the negation of individual attributes. Computational completeness is provided by assuming that many short length classifiers can operate together in order to represent a disjunction - each negated condition represents a NAND relationship. The # value is a wildcard - a 'don't care' term - that is used to represent undecidedness or the irrelevance of a binary bit. This allows ranges of values to be represented, and is the primary generalisation facility of the Canonical LCS. Clearly the capability of this generalisation mechanism is encoding dependant, with each wildcard representing a 'don't care' of a power of two value. Kovacs (1996) notes that "Classifier conditions define regions within the input action space X × A. However, the ternary alphabet, as used in … classifier conditions, is not able

30

to describe arbitrary regions within this space. For example, to match both 01 and 10, a classifier condition must be ##. However, this classifier will also match both 00 and 11, which may belong to different payoff levels. … This may be viewed as an inability to express certain logical relationships between bit positions. E.g., a condition can specify that bits a AND b both be 0, but cannot specify that only a OR b be 0. Due to these limitations, inexpressible (for single condition / action bit string) generalisations may exist within the payoff landscape." Alternative higher-level encoding techniques are possible (Smith, 1980; Parodi and Bonelli, 1990; Grefenstette and Cobb, 1991; Lanzi, 1999a, 1999b; Wilson 2000a), but these were uncommon in many LCS implementations and therefore are not included within the Canonical LCS. Clearly each action must be completely defined. The provision of 'don't care' values within the action would leave the action without a complete specification, and therefore a binary action encoding is used. As with condition encoding, higher level encoding techniques are possible11. The meaning attributed to the bit positions and the bit values in those positions is generally known only by the output interface. However, tag bits within the action may be reserved to differentiate between action messages that will be used internally by the LCS and action messages that are sent to the output interface. A ternary representation can be applied within an action, with # representing bit locations that must be replaced by actual bit values before the action is decided. The bit values that replace them are taken from a message that has matched the condition of the production rule, allowing values to be passed from the environment or other classifiers through to the action of the classifier - a mechanism known as "passthrough" (Forrest, 1985). This facility provides a primitive form of parameterisation to avoid the proliferation of similar classifiers encoding slightly different situations12. The existence of multiple conditions requires the condition whose matching message is used for passthrough to be identified. The simplest solution would be to identify one condition as a ‘privileged condition’ so that only those messages that match on this condition are 11

Wilson (1994, 1995) uses a single integer value within his ZCS and XCS implementations, though these implementations are limited to a single composite action per step such that each integer value identifies one composite action.

12

Forrest (1985) demonstrated that this mechanism could also be used to provide high level program-like constructs within a classifier system, a technique used very effectively by Riolo (1990) in his ‘Tag-Mediated Lookahead’ implementation.

31

used for passthrough. However, many messages may still match that single condition. Forrest (1985) and Riolo (1988a) both identify the first condition as the privileged condition, and this is primarily used in Riolo’s implementation to match messages coming from the environment13. They also both allow classifiers that have many messages matching the privileged condition to post an equivalent number of passthrough messages14. Unfortunately the pass-through mechanism constrains the action to be the same size as the pass-through condition, which can result in larger than necessary action representations15. This can affect the ability of the G.A. to search effectively through the action space. Although Forrest and Miller (1990) report that their LCS made good use of the passthrough feature, the complexities outlined above and the lack of experience of this facility has meant that most simple classifier system implementations do not provide a pass-through mechanism. The Canonical LCS follows this decision by not including a passthrough mechanism. In addition to the conditions and action, each classifier maintains a "strength" record that is used to indicate the payoff that can be expected whenever the action that the classifier proposes is used. This measure can also be used to indicate the classifier's likelihood of selection for reproduction by the Genetic Algorithm. A number of possible formulations of the strength record are possible to retain additional information on the classifier performance for use within action selection or the induction mechanisms (for example, see Frey and Slate, 1991; Smith and Goldberg, 1991; Dorigo and Bersini, 1994; Dorigo, 1995; Wilson, 1995, 1998). The population is a finite size store of classifiers. In many LCS implementations it is common to start with a randomly seeded population of classifiers so that the learning process has a set of candidates to start from16. It is also possible to start with a userprovided population, created to provide some initial domain knowledge prior to

13

This decision could prevent internal messages using passthrough to their advantage.

14

Riolo (1987b) notes that a single classifier that matches many messages and uses passthrough in this way can monopolise the message list. In this case further controls need to be added to retain fair message list access.

15

Wilson (1994) has noted that there is no practical reason for all conditions to have the same bit length, although in most LCS implementations this is assumed. Different size messages can be accommodated if the matching mechanism adds a length conformity test as the first cut-off condition of the evaluation.

16

Wilson's XCS (1995, 1998) starts with an empty population and uses the induction algorithms to introduce classifiers as they are required.

32

learning. There is no significance in the order of storage within the population - all operators introduced are presumed to be order-independent. The Canonical LCS is a 'situated' learning agent - it is interfaced to an environment that is perceived by a limited set of "detectors" and acted upon by a limited set of "effectors" (see Figure 2.1). Wilson (1986a) helpfully pictures the LCS as an "Animat" - an artificial animal within an artificial or real environment. The environment may, indeed, be a simulation of a real world situation, or may be an abstract computer based situation such as a data warehouse. To a certain extent the complexity of the environment itself is irrelevant, since even a perfect learning agent is limited in its abilities by the complexity of its detectors and effectors (Wilson, 1991). Input is obtained from an environment by a user-provided interface that maps input from the environment by a pre-determined set of detectors onto "messages". Messages are constructed from the detector readings as a fixed length sequence of binary bits using a user-defined encoding. Booker (1991) notes that the position of bits in the encoding relative to one another can substantially affect the operation of the Genetic Algorithm used to create new classifiers (although this is dependant upon the actual form of genetic operations used). Therefore, care must be taken to construct a suitable mapping. The interpretation of the binary representation is known only to the user-provided input and output interfaces - the LCS operates without knowledge of the meanings attributed to the encoding used. The number of ternary values within the first classifier condition must correspond to the number of bits in the input message so that the first classifier condition can be used to represent the classifier's correspondence to the current environmental input. An action from the appropriately selected classifiers within the population will be used to identify the operations that are to be performed on the environment. The classifier's action will therefore become a message from the classifiers to the environment. If two conditions are provided by the classifiers, then the action may use an additional tag bit to identify whether the action is to be performed on the environment or is to be transformed into an "internal message". In the same circumstances, if the tag bit is not used then all messages are used as internal messages. An internal message is a message sent back to the classifier system and used to match against the second condition of the classifier. Using internal messages it is possible for one set of classifiers to trigger a further set of classifiers, producing what is termed a "rule chain". Rule chains are the primary mechanism within the Canonical LCS for establishing a sequence of actions

33

leading to a distant reward17. The internal message and the second condition must be the length of the action without the tag bit. Messages sent as actions to the environment are decoded by a user-provided environment effector interface and used to activate appropriate Animat effectors to perform operations upon the environment. Messages produced from the environment, sent to the environment or sent internally are maintained with "message lists". A message list is a sequence of messages with a finite maximum size. In the Canonical LCS there are two message lists - an input message list, and an output message list. The input message list receives input messages from the environment and therefore often contains a single message. The output message list holds messages sent back to the LCS to form rule chains and messages to the environment and will be a small finite size. If a tag bit is used to distinguish between internal and output messages, then only the internal messages will be maintained on the message list after the action is performed, conceptually forming an internal message list. If the Canonical LCS is used to hold classifiers with a single condition then rule chaining is not operative and so no internal message list is created nor are internal messages identified. 2.3.2 The Performance Subsystem The Canonical LCS operates using a pre-specified fixed number of "trials" within the environment during which the LCS must acquire a population of classifiers that allows the LCS to perform optimally within that environment. Each trial consists of a number of "iterations" or "steps". Each step starts by obtaining input (if available) from the environment via the Animat's detectors. The input message and any internal messages are presented to the LCS. New internal messages and an external action (if available) are selected by the LCS using the Conflict Resolution Subsystem. The selected external action is presented to the environment through the effectors and the environment checked to see if a reward is provided or the environmental goal has been reached. If a reward is obtained from the environment it is applied to the LCS by the Credit Allocation Subsystem along with any other internal payments or penalties. At various stages during the iteration "triggers" may indicate that one or more of the induction algorithms should be applied by the rule induction subsystem to generate new classifiers as members of the population. If new population members are produced existing classifiers may have to be deleted to make room for them within the population. A trial 17

Rule chains are neither the only mechanism nor arguably the best mechanism for achieving a sequence of actions - see Wilson's ZCS (1994) or Goldberg's SCS (1989) for other mechanisms.

34

ends whenever a pre-determined number of iterations have been completed or the environmental goal is reached. Once a trial has completed the Animat is moved to a new starting position and a new learning trial begins. At the end of each trial information gathered during the trial on the performance of the LCS is used to generate a report on the performance of the LCS during that trial, and these reports will be written to file for later analysis. 2.3.3 The Conflict Resolution Subsystem The identification of classifiers that are relevant to the current environmental situation is the first responsibility of the Conflict Resolution Subsystem. This is achieved by matching the input message against the first condition of the classifiers within the population and the internal messages against the second condition. A message matches a non-negated condition if for every bit position within the message there is the same value at the corresponding position in the condition or the value at that position in the condition is the 'don't care' value. A negated condition is matched if no message in the message list would have matched its non-negated form. Those classifiers whose conditions have all been matched are identified and termed the "match set". It will often be the case that many of the classifiers within the match set propose different actions. Whilst it is possible to allow a classifier system to start a number of actions at the same time this makes the determination of the classifiers that should be allocated any received reward difficult. The canonical classifier system therefore requires that a single external message is selected for action by the effector interface. The selection of the most appropriate effector message is tied to the reward history of the matched classifiers as represented in their Strength record. Once matching has been completed a bidding process is started. In this process, each classifier calculates a "bid" based on the current strength of the classifier. This bid will become the "support" for the message being proposed by the classifier. At its simplest, the bid is a fixed proportion of the strength of the classifier: bid = Kbid × strength (where Kbid is the Bid Constant provided in the LCS parameterisation). This bid can be modified in order to emphasise various factors in the selection of classifiers. For example, within the Canonical LCS classifiers that are more general will tend to be favoured by the selection process because of their higher participation in

35

reward or payoff situations. Since the reward mechanism (see section 2.3.4) is summative it is likely that such classifiers achieve a higher strength from this increased participation and so will be favoured by a simple proportional strength selection scheme. In order to allow more specific classifiers to become selected whenever they are available (it is likely that more specific classifiers will more accurately represent a particular environmental niche), it has become common practice to weight the bid based on the "specificity" of the classifier. The specificity measure for a classifier is defined to be the ratio of non-wildcard to total positions in the classifier conditions18. This weighting mechanism varies in its application. Holland et al (1986) suggested an additive component, although Riolo (1988a) used a power function to adjust the bid in order to achieve satisfactory performance. Goldberg (1989) chose a compromise approach with a double-parameter scheme that provides both a multiplicative and a summative component to allow a finer degree of control. One aim of this form of bid modification is the development of "Default Hierarchies". There are instances where the most general classifiers are accurate most of the time, but in exceptional situations they propose an inappropriate action. Holland proposed that if the general classifier could co-exist with more specific versions that propose different actions for these exceptional circumstances then a Default Hierarchy could be created. In many circumstances a Default Hierarchy can encode a problem with less rules than would otherwise be the case. For example the optimal population of classifiers to represent the actions within an environment where the LCS is rewarded for discovering the correct actions to replicate the Boolean XOR function is: 00 01 10 11

→ → → →

0 1 1 0

whilst a default hierarchy solution would favour either of the following solutions: ## → 1 00 → 0 11 → 0

## → 0 01 → 1 10 → 1

In this example the more general classifier would have a stable strength higher than that of the other rules and without specificity-modified bidding this would prevent the more

18

The inverse of the specificity is the "generality" of the classifier

36

specific rules from succeeding within action selection. Specificity-modified bidding resolves this dilemma by artificially increasing the bid of more specific classifiers whenever they are also matched. Work by Riolo (1987a) has shown that large Default Hierarchies can be introduced and used with careful parameterisation, and Goldberg (1989) demonstrated that simple Default Hierarchies do emerge through the action of the G.A. alone. However Riolo (1989b) showed that Default Hierarchies that do emerge are gradually replaced by more specific classifiers over time because of the difficulty of balancing the bid and reward mechanisms to favour more specific classifiers without eradicating the more general default classifiers. Although others have sought to identify better mechanisms for preserving and utilising Default Hierarchies (e.g. Wilson, 1988; Valenzuela-Rendón, 1989, Riolo, 1991; Smith and Goldberg, 1991; Dorigo, 1993) Smith and Goldberg (1991) suggest that an approach based on modifying the bid and reward alone cannot be applied successfully. Indeed, Kovacs (1999b) indicates that LCS implementations whose strength record is based on payoff or reward prediction alone will inevitably produce classifiers that are "over-general". Over-general classifiers are classifiers whose condition generality is sufficiently high that they are able to acquire a high strength from use in situations where it is correct so that they are also selected for use in situations where their action is incorrect. If over-general classifiers are inevitable then either a reliable default hierarchy mechanism must be introduced or measures taken to prevent the maintenance of over-general classifiers. Wilson (1995, 1998) suggests that the establishment of a population of classifiers that are "optimally general" (that match only in all those situations where the action proposed is correct) is a better approach, and this concept is fundamental to the development of his accuracy-based LCS implementation called XCS. Another form of bid modification that has been proposed is the addition of "message support". Holland et al (1986) hypothesise that the relative importance of internal messages posted in the previous iteration can be reflected in the value of a bid in the current iteration. This would be achieved by the addition of a small amount in proportion to the sum of the bids made for the internal messages that were matched by the classifier. Riolo (1989a) used support to encourage rule-chains to form - the additional support of the internal message biased the bid of classifiers matching the internal message to make it less likely that the rule-chain will be interrupted by another classifier. His results were encouraging, although initial investigations conducted as a part of this research programme suggest though not reproduced here suggest that the

37

addition of support can encourage "parasitic classifiers" (see section 2.3.4). No additional component to the bid has been included within the Canonical LCS due to the lack of agreement on the benefits of these approaches, though there is no doubt that such additions are required to allow the LCS to produce, use, and/or preserve these higherlevel features. After bid calculation the conflict resolution subsystem must select an action from the messages proposed by the matched classifiers using the calculated bid values. The selection process operates at one of two possible levels - over individual classifiers or over actions. The majority of LCS implementations select over individual classifiers, using the ratio of the bid by a classifier to the cumulative bid of all matched classifiers that propose an external action to probabilistically select the message (so-called "Roulette Wheel Selection"19). However, it could be argued that an individual classifier does not represent the whole system's knowledge of the best action to perform. Thus some LCS implementations select, again using probabilistic selection, over ratios created from the sum of the bids of all classifiers proposing each distinct action to the cumulative bid of all matched classifiers proposing an external action. In such cases the set of classifiers chosen for use is termed the "Action Set". For reasons of stability, discussed later, the Canonical LCS selects the external action to be performed using the cumulative bid approach and records the classifiers used in this selection as the Action Set. If it is assumed that the value held is indeed a true representation of the worth of the classifier, choosing the message that has been proposed by the classifiers with the highest cumulative bid will result in the best of the discovered actions being performed in any given situation. Simplistically therefore, the Conflict Resolution mechanism should be simply a search for the classifier within the Match Set with the highest strength. However, in order to discover the stable strength value for each classifier within a population of classifiers it is necessary to make use of each classifier a sufficient number of times. Unfortunately, this means that it is necessary to choose to perform sub-optimal actions in order to gain sufficient feedback on the classifier. Naïvely, it might be assumed that messages from poor classifiers need only be selected for use a few times in order to identify their lack of worth. In fact, due to environmental irregularities and changes to the environment caused by the LCS itself and/or other 19

See Goldberg (1989) for an explanation of Roulette Wheel Selection and an example implementation.

38

environmental entities a classifier that performs poorly in certain circumstances could perform better in other situations20. This means that the LCS must occasionally choose to use sub-optimal actions even when the classifiers proposing these actions have had low strengths assigned to them. This is the so-called Explore-Exploit dilemma (Wilson, 1997) and the probabilistic selection regime outlined above is an attempt to resolve this dilemma. Using this technique, as a classifier's strength decreases, its message is proportionately less likely to be re-selected, and yet there remains a possibility of selection for further exploration purposes. Equally, useful classifiers should increase in strength and therefore their messages would be selected more often. Over time the behaviour of the LCS will stabilise to select the 'best' actions from the match set. Selecting internal messages is potentially more problematic since it is difficult to relate an environmental reward directly to the action of internal messages. The Canonical LCS follows accepted practice in limiting the size of the internal message list and using Roulette-Wheel Selection without replacement to select from the proposed internal messages. As for selection of the external message, the cumulative bid of the classifiers proposing each internal message is used as the basis of selection and for each selected internal message the classifiers whose actions are selected are recorded in internal action sets. 2.3.4 The Credit Allocation Subsystem Upon receipt of a reward from the environment, the Credit Allocation algorithm rewards the classifiers within the current Action Set. In classifier systems which recognise that many classifiers may have bid to send the same message, the reward may be 'shared' between these classifiers. This sharing can be performed in a number of ways - the equal division of reward between the classifiers: payment = reward / senders

(eq. 2.1),

the division of reward proportional to the support of the message : payment(ci) = reward * (support(ci) / ∑ support(cj))

(eq. 2.2)

(where j iterates across all classifiers in the Action Set), or the allocation of the whole reward to all eligible classifiers : 20

Clearly the performance of a classifier, therefore, is not only dependent upon the goodness of the action but also upon the generality of its conditions.

39

payment = reward

(eq. 2.3)

The benefit of using equation 2.1 is that all classifiers that propose the action will reach the same stable strength, but they will be prevented from over-breeding within the G.A. (and thereby overrunning the population) by the fact that any increase in their number will reduce the amount of reward that each receives. Equation 2.2 has little merit since it only serves to emphasise the strength differences between similar classifiers rather than bring similar classifiers to a similar worth. Its proposition derives from a perceived unfairness of allocating the same reward to classifiers that contribute a lower support to the message selection. A lower strength classifier proposing the same action may arise where a classifier has a more specific but equally accurate condition and therefore participates in fewer rewarding events than other classifiers, or where a classifier has an over general condition and therefore participates in situations where it receives a net deficit in strength due to the cases where it proposes an inappropriate action. A little lateral thought will quickly indicate that equally sharing a reward will not greatly affect these lower strength classifiers since they will always obtain a lower overall reward than the completely accurate classifiers in any case. Equation 2.3 will, as is discussed in more detail below, generate a situation in which a classifier could utilise with the G.A. to dominate the population because there is no pressure to prevent proliferation. However, where a single message is selected from the proposed external message list, and only those classifiers from the Action Set are considered in the reward allocation, the pressure of the action-selection competition will tend to prevent proliferation. Within the Canonical LCS the reward payment is calculated using equation 2.1. In simple classifier system implementations that limit the LCS to ensure that only a single effector message is chosen reward sharing is rarely used. In this case all of the reward is given to the classifier whose message was chosen for posting to the Message List. This has the effect of making this classifier more likely to be chosen the next time the LCS is in the same circumstances, and therefore prevents other classifiers that had bid the same action (including duplicates of itself) from receiving any reward. Unfortunately, this can cause the LCS to focus rapidly on classifiers that may appear optimal locally but which are globally sub-optimal. Furthermore, the decay in strength of the duplicate classifiers (and any new but duplicate classifiers created by the action of the induction algorithms) results in an unstable situation where only a single classifier is used to represent each concept, leaving that classifier vulnerable to accidental deletion from the population. In Learning Classifier Systems that use this reward technique,

40

therefore, elitist deletion strategies are generally introduced to prevent the loss of useful classifiers. At each iteration all classifiers that have been selected to propose an internal or external message also pay back a proportion of their strength to the classifiers that have caused them to become active. These classifiers can be readily identified by maintaining a record of message posting classifiers from the previous iteration and a record of the messages that have been matched by the conditions of a classifier in the current internal or external actions sets. This enables the strength passed back as payoff to be credited to scene setting classifiers directly, limiting parasite classifiers from gaining undeserved reward. This "Bucket Brigade" mechanism theoretically produces stable classifier strengths throughout the rule chain given a stable environmental reward. In addition to the reward and payoff values a number of "Tax" regimes are imposed on the population. The "Life Tax" is the removal of a fixed small proportion of the current strength of all classifiers within the population: strength = strength * (1 - lifeTax) The life tax is introduced so that classifiers which, because of message encoding redundancy, do not match within any environmental niche have their strength decreased until they come under deletion pressure. A "Bid Tax" is applied to all those classifiers in the Match Set and is typically the removal of a fixed proportion of strength: strength = strength * (1 - bidTax) although arguments could be put for making it the removal of strength in proportion to the size of the bid. The bid tax increases the rate of strength reduction of classifiers that propose actions that produce poor payoff or reward and penalises classifiers that are over-general and therefore bid in both high and low payoff situations. The provision of taxation is a matter of some debate. Wilson's LCS implementations (see section 2.7.1) tended to remove taxation and utilise a more dominant G.A. to exert the required population pressure, whereas most other implementations gave the G.A. a background role and therefore used taxation to speed-up the identification of weak classifiers. It is clear that a fine balance has to be struck between reward, payoff speed through the Bucket-Brigade, and taxation amount. The Canonical LCS follows the

41

majority of LCS implementations in providing both a Bid and Life Tax and using a proportional taxation regime for both. However the difficulties that the use of taxation brings must be considered carefully. The Bucket-Brigade mechanism was first analysed by Riolo (1987b)21. He demonstrated that the Bucket-Brigade was able to reinforce a pre-built rule chain representing the following simple environment:

0

s0

0

s1

0

0

s2

s10

0

s11

s12 100

Figure 2.2 A test environment for the Bucket-Brigade used by Riolo (1987b) but noted that it would require 22 + 11.9n trials for a classifier n steps away from the reward state to reach 90% of its expected stable strength. Riolo (1987b) went on to demonstrate that even with this reinforcement delay an LCS with a provided population was able to identify the route to the highest reward when faced with a choice of two routes in the start state, although the time taken for the LCS to achieve sufficient reinforcement in the early states was large. Riolo tried to introduce a technique known as "bridging" to overcome this problem. Bridging Classifiers (Holland, 1985) are classifiers that appear in more than one action set due to their generality, with one of the action sets being much nearer the reward than the other. These Bridging classifiers will receive payoff early in the reinforcement cycle due to their proximity to the reward state and will also pass this payoff on to classifiers in the action set preceding the earlier action set it occurs within. This reduces the time until environmental reinforcement reaches the early action sets. Although Bridging Classifiers did dramatically decrease the convergence time, their emergence and maintenance in the presence of induction algorithms has not been conclusively demonstrated. This reinforcement delay would indicate the potential for problems when the LCS induction algorithms are applied, and Riolo (1989a) showed that whenever the induction algorithms were introduced only very short rule-chains could be introduced and that the LCS was unable to retain these rule-chains. By introducing support to the classifier selection process to bias selection towards members of the rule chain and by biasing deletion to preserve members of the rule chain Riolo had more success in establishing

21

More detail on the investigations undertaken by Riolo is given in section 4.5

42

and retaining the rule chains, but a high proportion of runs still failed to retain adequate rule chains. Forrest and Miller (1990) analysed the behaviour of LCS using Boolean Networks. A method was developed for mapping a simple two-condition rule-chaining LCS population onto a Boolean Network so that the network could be examined using available analysis techniques to identify the properties of the population. It was found that even in simple problem domains the traditional rule-chaining LCS approach provides little encouragement to form internal rule-chains, and even after significant learning time internal rule-chains (those that do not require external input to sustain them) were only two or three classifiers in size. Compiani et al (1990) analysed the behaviour of individual classifiers within a rule chain in the presence of the Bucket-Brigade and an infrequent G.A. in order to ascertain the cause of instabilities in the rule-chaining mechanism seen within their earlier lettersequencing investigations (Compiani et al, 1989). Their analysis began with the behaviour of 10 classifiers who are members of a single match set. They discovered that where there was an ideal message list so that no classifiers failed to post a message, the strengths of the classifiers all stabilised as expected. Once a choice in which classifiers can post to the message list was required some classifiers would fail to be involved in the rule chain and the taxation regime would decay the strength of the classifiers. Whenever the induction were then applied, new classifiers could enter the population with a strength [possibly temporarily] higher than the classifiers in the match set whose strength has decayed. If any strength amplification is used in the selection of classifiers (a multiplying factor used to help to distinguish between similar strength individuals commonly used in the G.A. community for G.A. selection) then even a small strength difference will cause the weakened classifiers to be "irreversibly doomed to decay to zero strength". This problem arises from the use of competition for selection over individual classifiers rather than the cumulative bids for each message as used in the Canonical LCS, and therefore should not affect the Canonical LCS. Further analysis examined the strength of individual classifiers within a rule chain. They noted that the taxation regime will cause strength to decay from the ideal fixed point value between classifier activation. Where the message competition is high the oscillation in strength introduces a higher probability that the early classifiers are not selected, leading to a further strength reduction. This results in a loss of scene-setting classifiers, and the results of Riolo (1989b) would suggest that it is the loss of scene-setting classifiers that eventually destroys the rule chains.

43

These results are discouraging, although they were produced using a LCS that encourages competition between individual classifiers. Approaches using a less competitive environment, such as those of Wilson (section 2.7.1) or Booker (section 2.6.1) have shown more promise. Whilst the Canonical LCS combines the BucketBrigade and internal rule-chaining approach of Holland's proposals with the more cooperative action-set bidding of Wilson's LCS implementations no actual implementation has been produced to ascertain the performance of rule-chaining and the Bucket-Brigade with this approach. Riolo (1989a) noted that under the action of the induction algorithms in the early stages of learning classifiers are formed that are not fully correct but nonetheless contribute significantly to the development of a rule chain. Once the chain is developed, these classifiers are still involved in action sets within the rule-chain and so receive reward by the payoff mechanism and yet they no longer provide any useful service to the rulechain. Since in a payoff-sharing regime their existence lowers the payoff to the good classifiers within the rule chain Riolo describes them as "parasitic classifiers". Their reduction of the payoff provided to the useful classifiers within the rule-chain will cause these classifiers to be overlooked by the G.A. and to become under threat of deletion. Riolo noted that it was not sufficient just to devise mechanisms to remove these parasites, since they have a useful function in the establishment of the rule chains, but clearly they must be controlled. Smith (1991) devised a number of test environments that caused parasitic classifiers to arise. He noted that parasitic classifiers were caused by the situation where a payoff was given for acting that was not 'grounded' in an environmental reward. Smith noted that parasites were formed for a number of reasons. Three basic types of parasite classifier can be identified. Type 1 parasites post a message that correctly activates the next classifiers and therefore receive a high stable payoff, but also post an incorrect action to the environment to cause sub-optimal performance. Type 2 parasites interrupt the flow of payoff down a chain by posting an incorrect internal message whilst receiving a good reward for a correct action. Type 3 parasites misguide the LCS by posting incorrect internal and external messages, and yet receive payoff for their presence in the rule chain - these parasites exist because of their value in creating the rule chain but are not eliminated when the correct rule chain is formed. Parasites represent a threat to the rule chain because of the reduction in payoff given to correct classifiers due to their presence. Unfortunately a simple attempt to eliminate them is not an adequate solution since they are helpful in forming the rule chains. Smith (1994)

44

noted that parasites require internal mechanisms (those whose payoff is not directly linked to an environmental reward) to exist. Section 2.6 and 2.7 present three forms of LCS that provide a form of rule-chaining that is intrinsically linked to environmental reward. Holland (Holland et al, 2000) maintains that these approaches are insufficient only the provision of internal rule-chaining can break the dependency of the LCS upon the environment. 2.3.5 The Induction Subsystem The fundamental Induction Algorithm included in the Canonical LCS is the Genetic Algorithm. This is applied periodically, typically at a sufficient time interval to enable the strength values of the classifiers to stabilise to values that represent the worth of the classifier to the classifier system. Unlike a traditional G.A., the G.A. within most LCS approaches replaces only a small proportion of the population of classifiers - an extreme form of elitism. Parent classifiers are selected by roulette-wheel selection over classifier strengths. This provides selective pressure towards useful classifiers whilst allowing good building blocks in otherwise weak classifiers to spread through the population. For each pair of parents selected from the initial population, a single child is produced by application of the crossover operator. Typically a single point crossover operator is used, although more complex forms could be used. The resulting child is given an initial strength equal to the average strength of its parents, sometimes discounted by a proportion so that a new weak classifier does not unduly disrupt the performance of the parents (see the discussion in section 2.3.4). Mutation is applied at a [small] uniform per-bit rate to each child before insertion in the population. A new classifier is inserted using a selection routine that gives more chance of deletion to weak classifiers. Typically roulette wheel selection over the inverse of the strength is used so that weak classifiers are not eradicated, thereby preventing premature convergence and allowing good small building blocks to remain available. Alternatively a crowding technique, such as DeJong crowding (DeJong, 1975), can be applied to maintain diversity by deleting the weakest most similar classifier. The Canonical LCS follows the main implementations in choosing two parents using roulette-wheel selection, using singlepoint crossover across the whole classifier, and applying per-bit mutation. New classifiers are inserted using crowding. In order to evaluate all classifiers, all states must be visited, which in turn requires a response for all inputs. The Create Detector Operator is a simple induction algorithm that is triggered whenever no classifier is completely matched in all its conditions by any message. In the Canonical LCS the implementation of this operator is kept simple -

45

the new classifier is given a condition equal to a randomly chosen externally generated message with some bit positions generalised according to the current generality setting, a random action is chosen, and the strength is set to the population average. This allows the LCS to respond, though possibly incorrectly, to the message. The generation of a correct response is left to the operation of the G.A. and the Create Effector Operator. More complex CDO solutions may be invented easily, but are rarely necessary. The Create Effector Operator is triggered whenever a rule or set of rules continuously respond incorrectly. This can be detected by a drop in strength of the responding classifiers over a number of iterations. The incorrect response can be caused by a number of reasons. The classifiers' condition could be correct and optimally general but its action is incorrect - choosing an alternative action rectifies this situation. Alternatively a classifier's condition could be too general, allowing it to be selected both when its action is correct and on occasions when its action is inappropriate. Modifying the condition to make it more specific by replacing some wildcards with actual message bits can rectify this. It is possible to detect a condition that is too general by calculating the variance in reward received and to use this information to guide the kind of CEO operation that should be applied. Most LCS implementations do not provide such a distinction and so lose this opportunity to better focus the learning effort. The Triggered Chaining Operator has been introduced by a few implementations (e.g. Riolo, 1989a) in order to produce action sequences that are established over a much shorter period than by the G.A. alone. Whilst, theoretically, the G.A. operator should create such rules, in reality usefully chained rules are a small subset of the set of all possible rule chains and the G.A. would take a very long time to discover them. The TCO is triggered by the occurrence of a payoff or reward assigned to both the current and a previous rule. The resulting rule has the external condition of the first rule with its action set to a new tag value. This tag value is an internal message that will match a classifier in the next iteration that has the same internal tag and the action of the second rule. In the following example, the environmentally triggered rules 1 and 2 are matched by message m1 and m2 and perform actions a1 and a2 respectively. Their fully-general second condition indicates that they do not respond to any internal message. If these rules follow one another a pre-specified number of times, and are part of a reward chain, the TCO will introduce the two new rules 3 and 4. These will co-exist with the original rules, but specify a direct path from the receipt of message m1 to the action a2 using internal message t1:

46

1) 2) 3) 4)

m1, # m2, # m1, # #, t1

→ → → →

a1 a2 t1 a2

If the newly chained combination is more effective they will increase in strength and if the action a1 is not required in the path to the environmental reward rules 1 and 2 will eventually be replaced by the new rule pair. If action a1 is required in the path to the environmental reward, classifier 2 may be the only classifier replaced by the rule chain. Clearly successive invocations of the TCO could extend this chain further, so [theoretically] producing long rule chains. 2.3.6 Summary This completes the description of the Canonical LCS. The major mechanisms of this theoretical LCS have been described, and the basis for their selection elaborated. Throughout the identification of the Canonical LCS attention has been drawn to the major findings from previous research in regard to the choices made. The implication of many of these findings is that the Canonical LCS may continue to suffer from some of the problems that are currently seen within various traditional LCS variants. However, the Canonical LCS is presented not to identify an ideal implementation, but to provide a set of implementation choices that can be used in a comparison of actual LCS implementations. As such, the Canonical LCS acts as a measuring rod - useful as a comparative standard.

2.4 The Michigan and Pittsburgh Variants Whilst the use of the Genetic Algorithm for optimisation work has identified that the G.A. should operate on the genotype (the members of a population represent the candidate solutions), the members of a population within a LCS are the individual production rules. Since these production rules must co-operate to classify the inputs to the classifier system with respect to its response, it could be argued that it is inappropriate for the G.A. to operate at this level in a LCS. The concatenate of the production rules in the LCS form the genotype, and therefore the G.A. should operate on a population of classifier systems. This approach was first adopted by Smith (1980), and also described by DeJong (1988), becoming known as the ‘Pittsburgh’ Classifier System after the University it arose from. All other LCS that perform their G.A. at the rule level within a population, including the Canonical LCS, are known as ‘Michigan’ Classifier Systems for similar reasons.

47

The G.A. is based on the premise that there is a single ‘best’ solution within the problem domain, and it is found by the proliferation of good ‘building blocks’ through the population (Holland, 1975; Goldberg, 1989) until the population converges upon a solution. Multi-objective optimisation within a G.A. has therefore been an important issue that has led to the development of a variety of techniques in the search for a satisfactory solution (for example see Deb, 1998). The Michigan LCS presents a complex multi-objective problem to the G.A. and the other search operators it makes use of. Goldberg, Horn and Deb (1992):

“Even if we can all agree on a canonical ‘simple LCS’, that model is bound to be much more complex than the ‘simple G.A.’. The primary reason for the additional complexity is the multi-objective nature of the task at hand. The most basic LCS is trying to find the (1) smallest set of rules that (2) best solves the example problem while (3) generalising well to all similar problem instances. In terms of classification, this means searching for a concise, accurate, and robust concept description, where a concept description is a group of rules. ... In a G.A., selection drives the evolving population toward a uniform distribution of N copies of the most highly fit individual. Mutation and non-stationary fitness functions might stave off 100% convergence, but it is unarguable that the first-order effect of the firstorder operator, selection, is the loss of low-quality diversity. In many applications of the G.A. we might want to find a number of solutions with different tradeoffs among the multiple objectives. ... In the LCS, we ask the G.A. to search through the space of all possible rules to find and maintain a diverse, co-operative sub-population.”

Further problems arise within any consideration of the mechanisms that enable a LCS to cover a complex state space in a small group of co-operative rules, any examination of the mechanisms an LCS can use to learn and maintain behavioural sequences, to maintain explicit state information, or to cope with delayed or erratic reinforcement. All of these situations require co-operation within the intrinsically competitive Michigan LCS environment. Different solutions have been developed for each of these problems, many of which unfortunately exclude potential solutions for the other problems.

48

The Pittsburgh LCS is more suited to the provision of co-operation. The lack of competition between individual classifiers allows the LCS to find novel co-operative solutions that the population-level G.A. can maintain and proliferate. This fact has been used by a number of researchers to achieve higher level LCS structures (e.g. Grefenstette, 1987, 1989, 1992; Grefenstette et al, 1990; Bull and Fogarty, 1993, 1994) and diverse population cover (Giordana and Neri, 1995; Flockhart, 1995). Until the development of the balanced Niche G.A. within XCS (Wilson, 1995, 1998), operating together with accuracy based fitness to provide co-operation in the face of competition (Wilson, in Holland et al, 2000) no Michigan-based approach could reliably provide this balance. The Pittsburgh approach remains the method of choice to apply to problems that require the development of co-operative populations in the cases where XCS cannot be employed or where the niching capabilities of XCS run counter to the requirements of classifier co-operation. However, the Pittsburgh approach does present its own limitations. In particular, because the G.A. operates at a population level the G.A. receives only high-level feedback from the fitness function, therefore requiring a large additional effort to generate optimal populations. This increased effort in addition to the increased computational resource required to operate at the population level can present new challenges when devising efficient implementations for a Pittsburgh LCS. This chapter focuses upon the development of the Michigan LCS rather than the Pittsburgh approach, and this is reflected in the structure of the Canonical LCS. The problems facing the user of a Pittsburgh LCS are quite different from those within the Michigan approach and thus it would be inappropriate to attempt to represent both approaches within one description. 2.5 Realisations of the Canonical LCS Since the Canonical LCS seeks to provide a 'middle-ground' architecture that allows all the major features of LCS implementations to be introduced, the architecture presented remains quite complex. Wilson (1994) reacted against increasing complexity, most of which was introduced in an effort to balance out the competing forces within the LCS architecture, by producing the simple ZCS. This approach enabled a better understanding of the workings of the LCS and led to a radical re-think that produced XCS (Wilson, 1995). The majority of workers remained committed to the more complex approaches advocated by Holland. Riolo's CFS-C implementation is the key example of this approach and, through its public availability and Riolo's groundbreaking research, it has been very influential. This section presents Riolo's LCS implementation and

49

provides a brief overview of two other LCS implementations that took the same more complex pathway. 2.5.1 Riolo's CFS-C Riolo's CFS-C represents a milestone in LCS developments. Not only was it the first full implementation of Holland's Bucket-Brigade proposals (and many other operators proposed by Holland et al, 1986), but Riolo's implementation was also made publicly available with documentation. This meant that it was the first LCS to become widely known, although rapidly overtaken by Goldberg's SCS (Goldberg, 1989). Riolo used the CFS-C implementation as a means of investigating two key areas of LCS performance the formation and/or maintenance of Default Hierarchies within single-step environments and the formation and/or maintenance of Rule Chains within multiple-step environments. Sadly for LCS research, although Riolo showed that both mechanisms would operate as suggested with pre-loaded solutions, he also demonstrate that each form of structure was difficult to produce using induction algorithms alone and problematic to maintain in the face of population competition (see section 2.3.4). Together with the work of Smith (1991), these results led to a disillusionment with the Holland LCS and a significant drop in interest in the LCS approach that signalled the end of the "classical era" of LCS research (Kovacs, 2000c). CFS-C is reflected in the Canonical LCS in providing classifiers with two conditions, with the second condition acting as the "privileged condition". Only the second condition may be negated, and the match rules for negated conditions follow those of the Canonical LCS. The conditions and action must all be the same length, and the action differs from the Canonical LCS in allowing generalisation and using pass-through to form the action. To deal with the problem of how to produce the action using passthrough given multiple conditions and many matching messages the user can select from a set of different passthrough types, ranging from logical AND, OR, or XOR of all matching messages to arithmetic operations on the matching messages. This reflects the general theme of CFS-C - it appears to provide a plethora of facilities in an attempt to cover all suggestions where no one mechanism is known to be superior. Given the existence of two conditions it is clear that CFS-C uses an explicit rulechaining mechanism that requires an internal message list. In fact, in common with Holland's proposals, only one message list is provided, and from the 32 places available on it a user defined number can be reserved for the current input messages, output messages, and internal messages. All messages are accompanied by an 'intensity' value

50

that provides a Support mechanism for the bid calculation. Unusually, CFS-C allows a number of input messages to be sent by the environment and a number of effector messages to be sent by the LCS. Multple detector messages allow different detectors to generate messages that are particular to that detector, and thus there must be some possibility of overlapping inputs that the LCS will be unlikely to be able to disambiguate. Effector message conflict is resolved by the output interface in a usersupplied manner. The Conflict Resolution subsystem uses modifiers for both specificity and support from matching internal messages within the bid. Support was added when rule-chaining experiments indicated that rule-chains could be easily broken by competing messages unless the messages that maintain the rule-chain were given an advantage in the bid competition. Specificity modification was added to maintain Default Hierarchies, and a harsh specificity modification based on a power function is used. Although this is sufficient to allow Default Hierarchies to operate and to maintain them for some time, even with this modification the Default Hierarchies were removed over time. Messages are selected for placement onto the message list using roulette-wheel selection over the message bids until sufficient messages are present to fill the sections of the message list reserved for effector or internal messages. The large message list (size 32) limits the amount of competition between classifiers, although the message list size can be reduced. Limits can also be placed on the number of messages to be posted by each classifier due to passthrough so that one classifier cannot 'swamp' the message list. Credit allocation implements the full Bucket-Brigade, with payoff passed directly to those classifiers that posted messages matched by classifiers active in the next iteration. Payoff is the bid of the classifier active in the next generation shared between the classifiers whose messages caused it to become active and/or the environment if it has matched a detector message (the environment acts as a 'sink' to these payoffs). The payment to the environment can reduce payoff down the rule chain, and so Riolo provides facilities to bias payment away from the detector messages. The bid of each classifier that posts a message in the current iteration is removed from its strength whether it provides payoff to other classifiers or not. All classifiers are subject to a life tax and all bidding classifiers (all those that are matched) are subject to a bid tax. In addition a "producer tax" is introduced that is paid by classifiers that succeed in having their messages selected for the message list. This tax is designed to prevent the maintenance of cycles of classifiers that simply invoke each other without reference to the environment. All taxes can be proportional or fixed rate. The presence of per-

51

classifier competition for selection and the use of taxes to control problem situations is indicative of a classifier system that is relying upon control mechanisms to provide the required co-operation within an inherently competitive population. It is no surprise, therefore, that the G.A. is applied panmetically using a per-classifier strength value based roulette-wheel selection mechanism. In keeping with the apparent need to provide many different forms of each mechanism, crossover can be applied within a chosen condition or action or across the whole classifier, and crossover can be one point or two point. The G.A. will be used to generate sufficient new children to replace a small proportion of the population on each invocation. The G.A. is, however, invoked very infrequently and the main induction algorithms used are the cover operators. These are all triggered operators. The cover detector operator is slightly different from most implementations, generating a new classifier from a selected existing classifier whose conditions are modified to match the current messages on the message list. The cover effector operator is more advanced than those found in the Canonical LCS, able to respond to situations where classifiers are ineffective, where classifiers failed to activate particular effectors, or where classifier bids are particularly low. CFS-C is unique among LCS implementations in providing a Triggered Chaining Operator to identify new members of rule chains. This operator was discussed in section 2.3.5. Any classifier produced by any of these operators is subject to mutation applied at a very low per-loci frequency. Classifiers to be removed from the population are selected panmetically, usually by roulette-wheel selection over the inverse of the strength of the classifier. Since CFS-C uses elitist mechanisms the deletion mechanism provides options that allow high-strength classifiers to be protected, or for protection to be provided through the use of a crowding mechanism in selection for deletion. Equally, to control proliferation of classifiers a limit can be set on the number of classifier duplicates that can be maintained by the population. Whilst Riolo's implementation is to be commended for its flexibility and range of user choice, it is apparent throughout the implementation that mechanisms need to be applied appropriately to reinstate a balance between co-operation and competition. This makes CFS-C very sensitive to user parameterisation. Furthermore, it tends to leave the choice of which of the many options that can be deployed to the user rather than seeking to gain an inherent balance between co-operation and competition within the LCS itself. However, it is an important implementation both for its literal implementation of the Holland model and for the important results Riolo presented on rule chain and default hierarchy formation and maintenance.

52

2.5.2 Two more realisations Forrest's [unnamed] Classifier System (Forrest, 1985, 1990) provided a more complex classifier structure than the Canonical LCS, with multiple possibly negated conditions within each classifier. Each classifier employed passthrough when creating its messages, with the first condition acting as the privileged condition and all messages matching it used to produce possibly many action messages. The passthrough mechanism was particularly important in this LCS, allowing much smaller populations to be produced than would be required for every input / action combination. As would be expected in a system that emphasises the internal messages, the use of message tags was particularly important to identify particular kinds of messages and actions. Multiple actions can be written to the message list, and each message can be used to cause other classifiers in the next iteration to respond. The effector interface would identify appropriate messages from the message list for action using the tags. No induction mechanisms are provided, with the population of classifiers required to represent a problem automatically generated (or "compiled" to use Forrest's terminology) from a problem description and pre-loaded. Similarly, no bidding, support or classifier strength mechanisms were provided. Thus Forrest's Classifier System cannot be truly described as a Learning Classifier System. Although Forrest used her classifier system as a compute engine without any application of induction algorithms, her work remains important. It demonstrates primarily that although simplistic in form the LCS architecture can be used to produce complex computational structures. It could therefore be hypothesised that if the means could be found to create these structures through the induction mechanisms and preserve them within the population, then LCS methods can manipulate them to perform non-trivial internal computations. It was also important for its use of condition negation to introduce computational completeness to the LCS framework and passthrough to reduce the number of classifiers required to represent similar situations. Dorigo's ALECSYS LCS implementation (Dorigo and Sirtori, 1991; Dorigo, 1992, 1993, 1995); lies at the opposite extreme. Very much a complete LCS implementation, it was produced for application within robotic learning tasks. The architecture of ALECSYS was designed to allow a parallel implementation of ALECSYS to be produced, and this was used in later hierarchical LCS developments to speed up the learning of robotic control tasks. Although Dorigo (1993) claims that the LCS implementation was derived from the work of Booker (1982), the implementation owes more to the Holland LCS than Booker's GOFER implementations. Each classifier is a

53

two-condition classifier without negation or passthrough. Classifier bids are specificity modified but do not add in any other factors. Like the Canonical LCS, ALECSYS limits the size of the message list, and uses selection to fill the message list, although the selection is the same as Riolo's - per-classifier. The message to be sent to the effector interface is selected from the message list in proportion to the strength of the sending classifier. Environmental reward is given to the selected classifier if it is received. Internal messages are indicated by a tag bit and are passed back on the message list to trigger other classifiers. The full Bucket-Brigade mechanism was not provided within sequential versions of ALECSYS, with most applications of ALECSYS relying on an implicit bucket-brigade. Clearly the combination of mechanisms chosen produce a highly competitive population, and that may be why initially the G.A. was applied very infrequently. The main induction algorithms were therefore the cover operators and a new operator called "mutespec" that identified over-general classifiers through strength oscillation, as recorded in a variance measure maintained with the classifiers, and then specialised them. No attempt was made to preserve generality, and although it was noted that the resulting populations were not completely isomorphic, it is clear that the generalisation that remained was not primarily as a result of measures to identify or preserve it. Later the G.A. was given a more prominent role, but its activation was controlled by a systemwide measure termed "energy". This measure was simply the sum of the strengths of the classifiers in the population. Whenever the rate of change of this value was low the G.A. would be invoked. On invocation of the G.A. a fixed proportion of the population would be replaced with new classifiers. These design decisions are indicative of the kind of additions made to LCS in an attempt to control the competition caused by the G.A. These additions (and a further addition not discussed) represent a failure to recognise that the competition arose because of the selection of elitist mechanisms for use within the LCS implementation. It is easy to be critical, however - such problems are simple to point out in hindsight, and ALECSYS is but one of many implementations that followed the same pathway to complexity. 2.6 Modifications of the Canonical LCS Although Holland's LCS proposals were hypothetical, many implementers chose to follow the Holland LCS architecture closely. Booker (1982, 1988) represented one of the few "free-thinkers" who used the proposals as a framework from which a carefully constructed LCS could be built. Booker's work was pioneering in 1982 and, although

54

not receiving sufficient recognition for the design decisions in later work, still has much to teach LCS users today. 2.6.1 Booker's GOPHER-1 Bookers primary concern when approaching the development and use of Learning Classifier Systems had been in the co-operative balance produced by the learning features used in contrast to the competitive nature of other approaches. "Overt external rewards provide useful guidance about which events are important and where model-building efforts should be focused. However, such rewards are not necessarily the only or the most useful source of feedback for inductive change." (Booker, 1989) This approach is demonstrated most graphically within his GOFER-1 LCS, based upon [and a simplification of] his earlier Gofer implementation (Booker, 1982). The GOFER classifier systems used co-operative constructs throughout. Booker emphasised the importance of internal model-building rather than simple stimulus-response. Thus, in response to an external message the matching mechanism firstly finds the classifiers whose conditions match the message. An ideal number of classifiers, termed the "niche size", should match each input message. Rather than simply create more classifiers using a covering mechanism, Booker uses "partial matching". All classifiers are scored as follows: if the message matches the condition the score is the number of non-wildcard positions in the condition, otherwise the score is a relation between the number of matching positions and the number of non-matching positions (and a number of possible match score algorithms are presented). The match set is formed firstly from those classifiers that match, filling it in the order of the match score. If space remains in the match set (the number of classifiers is less than the niche size) other classifiers are added in accordance to their match scores. Thus, classifiers with good "building blocks" (regions of their conditions or action) will tend to operate together even if they are partially irrelevant or incorrect, making the classifier system as a whole more robust. Action selection from the match set makes use of a classifier attribute termed "effectiveness". This is a non-weighted product of three other classifier parameters: Impact, Consistency, and the Match Score already described. Impact is equivalent to the strength value in the Canonical LCS and is the recency-weighted average of the payoff the classifier has received from the environment or other classifiers. The consistency is a computed value based on the variance of the most recent payoff from the current Impact

55

value as a ratio of the largest reward. Clearly, the product of these parameters serves to weight the expected reward by the reliability of the prediction and the relevance of the classifier to the current message. Using this effectiveness parameter a small subset of classifiers are chosen from the match set using roulette-wheel selection. Once this initial competition is complete an action is chosen randomly from the actions proposed by this subset of selected classifiers. This selection mechanism is particularly clever, providing a likelihood of exploitation or exploration that is dynamically maintained in relation to the focus of the population upon good solutions. Once an action is selected in this manner the Action Set is then formed from all classifiers in the match set that propose the selected action. When the action has been performed any external reward is used to adjust the impact and consistency parameters of all classifiers in the action set. The action set will also be maintained so that classifiers that are in the action set of the next iteration provide a payment back to these scene-setting classifiers. The payment is computed as the average impact value of the classifiers in the next match set weighted by the consistency and match score attributes. Although GOFER does make use of two conditions, with one reserved for so-called "control tags" that match messages from other classifiers, the payment scheme is not directly related to the rule-chaining mechanism as it is within the Canonical LCS. This payment by temporal proximity rather than explicit classifier-toclassifier chaining was used within both Wilson's LCS implementations and Goldberg's SCS (see section 2.5), becoming known as the "implicit bucket-brigade". The degree of co-operation and reliance upon classifier clusters is not limited to the conflict resolution and credit allocation mechanisms. The induction mechanisms are also more co-operative than in many other implementations. Rather than provide a panmetic G.A. Booker restricted the G.A. within GOFER-1 to the action sets. This ensures that classifiers that are relevant to each situation co-operate to proliferate rather than compete with each other. Control of this proliferation is given by the deletion mechanism. Each classifier carries an attribute [confusingly] termed "strength". Each message carries with it a strength value that is reflected in the assigned strength attribute of the classifiers in the match set in relation to their effectiveness measure. The more classifiers that occur within a match set, the lower the assigned strength. Deletion is performed by roulette wheel selection in proportion to the reciprocal of the strength value, causing larger niches to suffer more deletion. In this way the population space is dynamically managed between the available niches.

56

The careful balance achieved by these mechanisms is in stark contrast to the harsh rulebound or stochastic approaches adopted within many other comparable LCS implementations of the time. Considering the good results achievable through GOFER-1 and the degree of dynamic control displayed by the mechanisms adopted it is strange that GOFER-1 did not have a wider impact on the LCS community until many of the mechanisms developed by Booker were applied within XCS (Wilson, 1995, 1998). A possible reason for the lack of interest may lie with Booker's earlier GOFER LCS (Booker, 1982, 1985), which was applied within a more complex framework. This implementation used two co-operative LCS populations, one to receive input and reorganise the inputs as internal messages and the second to respond to these internal messages with an action. The separation of input and action allowed GOFER to find new relationships between state and response. Whilst certainly extremely advanced it could be that the apparent complexity of this implementation acted as a barrier to the adoption of the GOFER-1 LCS. 2.7 Simplifications of the Canonical LCS The Canonical LCS is much simpler than many Machine Learning systems, but nevertheless some have sought to simplify it further. These simplifications are important for LCS research because they highlight those aspects of the canonical LCS which are fundamental to the operation of the LCS, and identify the aspects that are required for various additional features. In this section the work of Stewart Wilson with his '*', Animat and ZCS LCS models is introduced, demonstrating that a Behaviourist approach to an LCS allows considerable simplification with a disproportionately low loss in potential 'power'. Of further interest is the higher reliance that Wilson's approaches place upon the G.A., promoting the G.A. from what could become a background operator, or even be removed (Forrest, 1990), into a core LCS functionality. This work is contrasted with that of Goldberg (1989) with his SCS implementation. SCS adopts a structure that is closer to the canonical LCS, but relies upon a behaviourist approach for the development of rule-chains. Although borrowing heavily from Wilson's work, the G.A. is not such a prominent operator within this LCS. 2.7.1 Wilson's Animat, BOOL, and ZCS Implementations Wilson (1985) created a novel LCS (and actually only the third Michigan style LCS implementation) which he based upon a model of 'intelligence' that followed strongly the emerging Artificial Life field. He advocated a definition of intelligence in what, in Psychological research terms, would be considered a strict Behaviourist viewpoint:

57

"Intelligent behaviour is to be repeatedly successful in satisfying one's psychological needs in diverse, observably different, situations on the basis of experience" (van Heerden, 1968) In providing a mechanism that would satisfy this definition, he identified that an Animat needs to: 1) discover and emphasise rules that work, 2) get rid of those rules which do not work, and 3) optimally generalise the rules that are kept. These traits are fundamentally important to all Wilson's work. They implicitly relate the operation of the LCS to the work performed by a G.A. and they emphasise Generalisation as a means of optimising rule storage rather than the use of Default Hierarchies. In hindsight, Wilson's later XCS (Wilson, 1995, 1998) can be seen as the ultimate outworking of these aims, and XCS will be discussed in detail in Chapter 3. Wilson named the Animat which the LCS represented '*' after the on-screen character representation of the position of the Animat in the test environment used. In this LCS there were three simplifications over the Canonical LCS - the message lists were removed so that only a single message could be received or sent at a time, the second condition was removed to that all rule-chaining was performed through actions on the environment, and all biasing of bids and rewards was removed so that the G.A. became the primary means of population control and balance. The removal of the internal message list and the second condition is clearly a major difference. This naturally removes any possibility of rule chaining by means of internal messages, ostensibly limiting the LCS to S-R behaviour. However, rather than rely upon the internal messages to create rule-chains, Wilson notes that '*' can rely upon the environmental messages to create rule chains. When a classifier operates upon the environment the result will be a change in the location or orientation of the Animat within the environment or a change in the environment itself. As a result, the message received from the environment in the next step will change and thus trigger the next classifier in the 'chain' of classifiers that lead to a reward state. The environment is thus treated as an "implicit message list". Whilst this approach is adequate to support the development of fairly complex behaviour, the ability to form higher cognitive processes, such as intention, emotion or motivation may be limited.

58

To implement this modification, the main cycle of the LCS is as follows. The LCS obtains a message from the environment in the normal manner, and matches it against all classifiers in the population. Matching classifiers are identified, forming the match set. A bid competition is held between all classifiers within the match set where the probability of selection is proportional to the strength of the classifier. Once the action is performed, any reward received is shared equally amongst all the classifiers within the population that suggested the same action as the selected action (these classifiers collectively form the action set for the only proposed action). If '*' is operating within a multi-step environment, a small proportion of the strength of each classifier in the action set is removed before this update. The total strength removed from the action set is distributed evenly between the classifiers that were active in the action set for the previous step within this trial. This distribution of reward back to classifiers in the previous time step became known as the "Implicit Bucket-Brigade". The taxing of classifiers is removed from this LCS. In their place the G.A. is regularly invoked to operate over the strengths of the available classifiers. When the G.A. is invoked a classifier is chosen from the population as a whole proportional to the strength of the classifier. With a low probability the classifier will be duplicated, otherwise another classifier from the same action set as the chosen classifier will be selected for two-point crossover. Mutation is applied to the single offspring before it is placed back into the population. Although there are no taxes applied, the regular invocation of the G.A. will proliferate high-strength classifiers, thus favouring correct classifiers and penalising (through the payoff to earlier classifiers) classifiers that operate along pathways that do not lead to a reward. There is pressure towards general classifiers since they receive more reward because they are involved in more action sets. There is some pressure on over-general classifiers, since appearing in action sets that do not lead to a reward state will apply a penalty to them. A degree of pressure upon correct yet overspecific classifiers is applied by deleting classifiers according to the reciprocal of their strength. A simple CDO operation is provided that creates a new classifier from a generalisation of the message and a randomly created action. The CEO uses the same CDO mechanism, triggered by a decrease in the average strength of the action sets. Wilson considers these mechanisms to be one 'covering' operator. There is no distinction between the possible causes of CEO triggering. The TCO operator is not needed because there can be no rule chaining. Default Hierarchies are not encouraged, with concentration on achieving good generalisations instead.

59

Wilson's formulation more truly reflects the evolutionary computation paradigm, giving the G.A. a more prominent role than in the Canonical LCS. Rather than restrict duplication of classifiers, the G.A. is allowed to proliferate the correct classifiers. Although a G.A. is an optimisation algorithm, a primitive form of niching is maintained by restricting the G.A. to operate between classifiers of the same action set and deleting classifiers proportional to the reciprocal of the strength of the classifier. This means that the generation of new individuals should be proportionate to the usefulness of the action, and the deletion of classifiers will be from those classifiers that already occupy much of the population and are likely to have duplicates. We now know that this is an insufficient mechanism, but the concept of implicit population balance through the G.A. and targeted deletion is of importance in the ultimate development of XCS (see also Booker, 1988). Early tests for generalisation demonstrated that although '*' could perform as well without the G.A. (i.e. only the 'create' operators), there was no generalisation without the G.A. The maintenance of duplicates is also important in helping Wilson's LCS formulation avoid the "Boom-Bust" problem (Smith, 1999) that so besets other LCS approaches, where an essential classifier in the rule chain is suddenly deleted from the population without any duplicate to take its place. Wilson did ultimately add a tax regime to '*' in order to prevent the LCS from 'dithering' when choosing between actions which both led to rewards whose pathways were unequal lengths. In Wilson (1986a) he applies the LCS to the learning of a solution ruleset for a single-step environment representing a complex Boolean disjunctive function (the six-multiplexer problem - see Chapter 3, section 3.4.1.1). This LCS was a development of '*' without the payoff to previous action sets, and was called 'BOOLE'. The use of penalties or taxes was more prevalent within this LCS, with the reinforcement regime modified as follows. When a classifier is selected, the action set is identified and all classifiers within the action set are penalised by a small proportion of their own strength - essentially a bid tax localised to the action set. The operation on receipt of a reward is changed to be dependent upon whether the action was correct. For a correct action the reward is shared between the members of the action set, whereas for an incorrect action a fixed proportion is removed from the strength of these classifiers as a penalty. The final Credit Allocation modification was to penalise all classifiers that are not in the action set by a small fixed proportion of their strength - a life tax. Finally, the location of the G.A. was modified so that classifiers are selected from the population as a whole without regard for action sets. The parental strengths are reduced and child strengths set to a proportion of the parental average in order to allow children to

60

compete with their parents in the population. In retrospect a number of these changes represent retrograde steps given the current understanding of the difficulties caused by taxation regimes and the panmetic G.A. within other LCS implementations (see Section 2.3.4). Wilson identified that BOOLE both discovered and proliferated members of the population that were optimally general and correct (which he termed the solution set). The maintenance of such a balance in the population is in stark contrast to many other LCS implementations that struggle to maintain optimal generality. This property was used by Parodi and Bonelli (1990; see also Parodi, Bonelli, Sen, and Wilson, 1990) in a derivative LCS called NEWBOOLE, adapted to make use of the prior knowledge available in Data Mining tasks from the training data set. Their work demonstrated that NEWBOOLE could perform as well as existing decision tree, statistical and neural network approaches to classification. Wilson (1987) experimented with a biased reward function that allocated more reward to classifiers that had a higher generality, and demonstrated that the time taken to find the solution set could be significantly reduced although the likelihood of obtaining over-general classifiers would be increased. Other work, beyond the scope of this discussion, examined various reward payoff/penalty regimes, crossover and mutation rates, and the use of an 'entropy' measure to identify variability in classifier strengths and thereby control the rate of crossover. This final enhancement was reported as an early result, but the acknowledgement of the use of variance in the strength as a measure of convergence on a correct solution is an early indicator of the growth in interest in accuracy-based approaches to LCS strength measures that would lead towards the development of XCS. Wilson's ZCS (1994) was a reaction to a tendency amongst LCS research to add complexity to the LCS model in order to address the performance deficiencies within close variants of the canonical LCS. It was clear that researchers were unable to come to an understanding of the competing mechanisms within the LCS variants, and in particular the delicate balance of parameterisation that was required in order to achieve learning stability. The 'Zeroth level Classifier System' was a reductionist exercise, simplifying an LCS model to the lowest set of facilities necessary. ZCS is clearly derived from '*' and BOOLE, with the same simple single condition, a single discrete action, single strength measure, and reliance upon an implicit bucket-brigade. However, an action is now selected by a probability distribution over the sum of the strengths of all classifiers advocating each action represented within the match set. Once the action is performed a fixed proportion of the strength of each classifier in the action set is

61

removed and the accumulated strength removed is discounted by a fixed proportion and shared equally between the classifiers in the previous action set. If a reward has been received, it is discounted by the same proportion as that removed from the classifiers earlier. The remaining reward is then shared equally between the classifiers in the action set. As was the case in BOOLE a small proportion of strength is removed from the classifiers that matched but were not part of the action set to speed-up the reduction in strength of the incorrect classifiers. Thus, ZCS continues the tradition of reward sharing from '*' and BOOLE, a vital component in establishing a stable update regime. The primary contribution of ZCS is the stepping stone it provided onto XCS. The use of the action set strength in the bid competition was a vital part of the development of action-set based Niching. Although the update mechanism still taxed ineffective bidding, the reward sharing mechanism and the awareness of the requirement for reward-sharing to achieve niching were important steps. The addition of a temporal difference payoff regime provided a key link into ideas from within reinforcement learning, a link that still has the potential to yield more benefits to the LCS community in the future. 2.7.2 Goldberg's SCS Although presented in 1989, Goldberg's SCS (Simple Classifier System) has its origins in Goldberg's earlier work on Gas pipeline operation controls. SCS is important in the development of LCS most pragmatically in the fact that it was the most readily accessible publicly available LCS implementation, and as such was used by many LCS researchers. In terms of the development of LCS theory and practice, however, it is particularly important for demonstrating the emergence of Default Hierarchies, introducing an alternative bidding mechanism, and applying the implicit bucket-brigade to Holland's LCS proposals. Goldberg's SCS represents a simplification of the Canonical LCS, produced predominately for educational purposes. Its classifiers are single condition, with ternary condition coding and a single bit binary action (although this could be easily extended to an arbitrary binary value). It does not provide an internal message list, but rather allows actions to be posted to the environment for action and consequent change to the environmental input. When an input is received from the environment, classifiers are matched and all matched classifiers then engage in a bid competition. In order to encourage the formation of Default Hierarchies, the bidding must prevent general classifiers from winning the bid competition whenever a more specific classifier exists.

62

This is achieved by distinguishing between the actual bid (a proportion of the strength) and the effective bid. The effective bid is the actual bid modified by the specificity of the classifier's condition. The specificity modification used within SCS is: bide = bida * (k1 + k2 * s) (where bide is the effective bid, bida is the actual bid, and k1 and k2 are the bid parameters). This represents a much more delicate modification of the bid than provided by Riolo (1987a), although developed with the same aim of preserving Default Hierarchies. The bid competition is run as a "noisy auction" over the effective bids, with the addition of a small proportion of random noise to each effective bid. The random noise prevents the presence of very small strength differences causing SCS to concentrate on a single classifier in selection to the detriment of similar strength classifiers. A single classifier is selected using roulette-wheel selection over the effective bids to provide the action. In contrast to Wilson's approach, SCS employs an elitist strategy - only the selected classifier's strength is adjusted. Unlike the Canonical LCS, a payment for winning the bid is required. The classifier pays its actual bid for winning the competition, giving this amount to the classifier that was active in the previous step. In addition, all the classifiers that bid pay a bid tax (set as a fixed proportion of their bid). The requirement for a bid tax within SCS is strongly linked to the elitist nature of classifier selection and reward - it is much more important that the correct classifier is chosen. Unfortunately, as discussed in section 2.3.4, this technique places too much emphasis on individual performance and is a contributory factor to the instability of LCS based on the Holland model. All classifiers then pay a life tax as in the Canonical LCS. The action is sent to the environment from the selected classifier, and the classifier is then updated with the received reward value. The G.A. employed by SCS is panmetic and selection is performed using roulette wheel selection using the classifier strength. The G.A. is run relatively infrequently in comparison to its role within Wilson's approaches, and like Riolo's CFS-C more emphasis is placed on the combination of classifier creation operators than upon the G.A. alone. Two children are produced by the G.A. on each invocation, and these are placed into the population using a crowding scheme to delete a classifier randomly chosen from a low fitness sub-population that was itself chosen by roulette wheel selection over the reciprocal of each classifier's strength. The crowding scheme is required in SCS to maintain the population diversity. The elitist action-selection and reward mechanism will tend to maximise the strength of an individual classifier for each

63

condition, forcing the strength of other classifiers to remain low and making the duplicates of the each of these classifiers vulnerable to deletion. In a similar way, the existence of the noisy auction mechanism and the bid tax represents an implicit acknowledgement of the over-concentration of the SCS approach on few classifiers. Although SCS was primarily created for educational purposes, it is illustrative of the operation of many features of the Canonical LCS. The comparison between the approaches adopted by Wilson and Goldberg could, in many ways, not be more stark. They represent two viewpoints on the maintenance of populations of classifiers - one achieved through co-operative mechanisms and the other through competition and control; one emphasising generalisation and the other emphasising the development of default hierarchies. Although it is possibly simplistic and misleading to pigeon-hole the role of the G.A., the fundamental nature of the G.A. in niching within the two approaches, one can contrast the predominant role of the G.A. within Wilson's LCS with the rule discovery emphasis of the G.A. with SCS. To understand this contrast is the basis of an understanding of the growing disillusionment with LCS before the advent of XCS, and the reasons for additions to LCS structure in an attempt to solve the problems. 2.8 Recent Advances Kovacs (2000b) notes that the publication evidence indicates two distinct periods within LCS research - the "Classical Period" from 1978 to 1989 and the "Modern Renaissance" from 1997 onwards. No real evidence exists as to why there was a lower number of LCS publications between these dates. However, it could be suggested that a possible reason lies with the fact that many of the LCS publications from 1987 to 1991 were identifying what appeared to be fundamental problems with the classical Holland LCS, some of which were identified alongside the presentation of the Canonical LCS in section 2.3. Equally, the apparent renaissance of interest in LCS could be attributed to the development of a number of fundamentally new LCS approaches that appear to both address existing problems and provide interesting new avenues of research. This section briefly presents the two main developments - Wilson's XCS and Stolzmann's ACS. 2.8.1 Wilson's XCS Wilson's XCS (Wilson, 1995, 1998) can readily be cast not only as a crowning achievement, but also as the culmination of a long process of adaptation that is most appropriate to a field calling itself "Evolutionary Computation". The XCS ("eXtended Classifier System") introduces a slew of new features, but primarily the division of strength into two main components and the introduction of an implicit niching

64

mechanism. The strength record now maintains a Prediction measure that identifies the payoff prediction of the classifier, updated using a temporal-difference update technique close to Q-Learning (Watkins, 1989), and a Fitness measure that identifies the relative accuracy of the prediction of the classifier to that of other classifiers in its action sets. The prediction measure is used for action selection, though in a manner similar to the Canonical LCS, ZCS and '*', the action is chosen using the average of the fitnessweighted cumulative prediction of the classifiers proposing each action. The fitness measure is used by the G.A. to select parent classifiers to breed from within the action set. Using these new measures XCS is able to identify the accurate classifiers within the G.A. and therefore favours the optimally general classifiers above over-general classifiers. The G.A. invocation mechanism is now tied to the amount of use of the classifier so that the more general classifiers will have more access to the G.A., thereby also putting pressure on over-specific classifiers. This opposing pressure on both underand over-generalisation enables XCS to identify the set of optimally general classifiers. The location of the G.A. within the action set, tied to a deletion mechanism that removes classifiers based on an estimate of the size of the action sets they occur within provides a dynamic niching mechanism. This mechanism automatically maintains appropriately sized niches for each state × action × payoff combination required after generalisation is taken into consideration. Thus, XCS represents a major step-forward within LCS work. It is the first Michigan LCS implementation to resolve the co-operation - competition dilemma (section 2.4), a feat achieved without requiring delicate parameterisation balances so common in earlier LCS solutions. Chapter 3 presents an in-depth consideration of the operation of XCS and introduces both an implementation of XCS and a review of current XCS research. 2.8.2 Stolzmann's ACS The Anticipatory Classifier System was developed by Stolzmann (1997) from psychological work by Hoffmann (1992). Hoffmann suggested a learning mechanism based upon the Field Expectation theory of Tolman (1932). Tolman proposed that the units of learning consisted of a sensory detection mechanism that gave relevance, a reaction to the sensory input in terms of actions performed, and an expectation of the new sensory detection that would be received upon performing the actions. A comparison of the actual sensory detection received and the predicted sensory detection would provide the information required to supply reinforcement and hence learning. Although Tolman introduced these S-R-E units, he did not identify how such units could

65

be formed. Hoffmann provided a system that he called Anticipatory Behavioural Control, introducing a mechanism that would differentiate a general S-R-E unit into more specialist units whenever the anticipated state was not achieved. Stolzmann noticed that the population of S-R-E units thus created was similar in form to a LCS and though not employing evolutionary search mechanisms did contain the multiple competing solution property of the LCS field. Classifiers within ACS have a single condition and action part and are augmented with a further element in the form (and of the same size) of a condition but specifying the form of the anticipated next input after the action is performed. Whilst the ternary representation of the condition and binary or integer action elements operate as expected from the Canonical LCS, this "Expectation" part is subtly different. On each iteration the current message is applied using Passthrough to the Expectation part in order to produce a binary value. It is the value thus created that is the prediction of the next state. Thus, the effect of a wildcard value within the Expectation part is to hypothesise that the message bit at the same location will not change because of the action performed. The performance cycle of the ACS is not dissimilar to that of the Canonical LCS. It starts by obtaining input from the environment and matching this against all classifiers in the population to form a match set. Rather like SCS an elitist selection mechanism is used, with one classifier chosen from the match set. This classifier is selected by roulette wheel selection over the fitness values of the classifiers in most cases, but with a small probability random selection is applied to allow further exploration. Once the Expectation part has been used to obtain the predicted next state the action is performed and the new input message is immediately obtained from the environment. The credit allocation and the induction algorithms use the result of the comparison between the expected and the actual state to update the classifiers and create new classifiers. Each classifier carries with it two strength values - "quality" which is an accuracy value, and "reward prediction" which provides a prediction of the payoff that will be received. The credit allocation mechanism operates using a recency-weighted averaging technique similar to the QBB update mechanism used within ZCS (Wilson, 1994) to update the strength values. The quality measure is updated with a fixed payment if the expectation was correct, or by zero (effectively decreasing the average) if the expectation was incorrect, and always lies between 0.0 and 1.0. The reward measure is updated through a temporal difference update of the reward measure of the classifiers selected to act in the previous iteration based on the reward measure of the classifiers within the current

66

iteration. Any environmental reward received in the current iteration is used to update (again through the temporal difference mechanism) the reward measure of the classifiers within the current iteration. The quality and reward measures are both used in the calculation of the "fitness" value for action-selection, with the quality measure acting to reduce the reward prediction as an indication of the reliability of the prediction. No genetic algorithm is provided within ACS. In its place a specialisation mechanism is provided that could be thought of in Canonical LCS terms as a complex Create Effector Operator. ACS begins with the fully general classifiers that represent each action already provided within the population. If the expectation does not match the actual state reached but the parts where it does not match were wildcard values the classifier is potentially correct but over-general. A new classifier is therefore produced with the same action but a new expectation part formed by replacing the wildcards at the incorrect positions with the bit values from the new message. The condition is similarly modified to specialise it with bits from the original message. In cases where an action occurs but the environmental input is changed in a manner not predicted by the expectation part (a non-wildcard position in the expectation part is different from that predicted), the classifier records that state in a vector that the classifier carries with it. On a later successful use of the classifier one of the bit positions in the successful message that differs from the bit value in the unsuccessful state message will be used to modify the classifier so that a new classifier is produced. Classifier deletion occurs when any classifier has a quality measure that is lower than a provided parameter value rather than on demand as in the Canonical LCS. Stolzman demonstrated that ACS was able to rapidly learn optimal solutions within many Markovian WOODS environments. With a further enhancement he termed "Action chunks" classifiers that executed a number of actions in sequence were introduced so that ACS could also learn solutions within certain forms of nonMarkovian environment. The ability to recognise non-Markovian states is intrinsically tied to the ability of ACS to predict the next state, a feature missing from most other LCS implementations (but see also Riolo, 1991; Holland, 1991). Whilst ACS provided specialisation techniques, no counter technique to provide generalisation pressure was available. Butz et al (2000a, 2000b) have recently introduced a genetic algorithm into ACS to provide a generalisation pressure. Butz demonstrated that with the addition of a G.A. sufficient generalisation pressure can be exerted to significantly reduce the size of the final ACS populations whilst benefiting

67

the performance of ACS. Comparative studies of ACS and XCS suggest that ACS with the additional G.A. is able to learn optimal solutions as well or better than XCS. The ACS architecture is a novel approach of much promise. Perhaps the most interesting prospect could be an amalgamation of the principles of the ACS approach within XCS to fulfil Wilson's vision of a reliable LCS implementation that is not wholly reliant upon external reward. Like XCS, progress with research using ACS has been rapid and has fuelled the renewed interest in the LCS approach. 2.9 Conclusion Learning Classifier Systems have been described as a "quagmire" (Goldberg, Horn, Deb, 1992) for good reason. The impact of many of the design choices available to the developer of a LCS implementation remain poorly understood from a global perspective. By providing the Canonical LCS as a centre-point for a description of LCS architectural features the design choices and their pitfalls have been presented. The Canonical LCS framework was then used as the basis of a review of the Michigan LCS implementations that have had the most impact upon the area. The review of LCS implementations has sought to identify a distinction between implementations that have recognised the need for co-operative populations of competing classifiers and have chosen mechanisms that encourage co-operation, and those that have failed to recognise this requirement, or have recognised it yet produced inherently competitive implementations that require non-trivial control mechanisms to maintain a balance. Such a distinction is easy to make in hindsight, but until recently has been recognised by only a few. The XCS approach, like the Animat LCS and the GOFER-1 implementation that it derives from, was developed with an understanding of the need for co-operation. It is the first LCS to show predictable and reliable performance without compromising the role of the G.A. However, it will not be the last LCS implementation, and in the development of future LCS implementations the need to carefully select an appropriate implementation must be understood. Whilst Lanzi and Riolo (2000) provide an up-to-date review of LCS work, and Wilson and Goldberg (1989), Goldberg (1989) and Fogarty, Carse and Bull (1994) give reviews of earlier work, there has not to date been a comparative presentation of the major LCS implementations. It is hoped that this chapter will therefore, in addition to setting the scene for later discussion, both provide a vital point of reference and an important tool for training of those wishing to enter the field.

68

Chapter 3

XCS LEARNING

I would argue that … two tensions, competition vs. co-operation and performance vs. generality, were at the root of much of the field's earlier difficulties. Interestingly, the full solutions to each of them are complementary … fitness based on accuracy and a niche GA in combination appear to overcome the principal problems of classifier systems' first twenty years, opening the way to a strong advance in the period just beginning. S. W. Wilson (in Holland et al 2000)

3.1 Background The XCS Classifier System (Wilson, 1995, 1996, 1998) represents a major step forward for Michigan style LCS. Deriving from Wilson's Animat (Wilson, 1985) and ZCS (Wilson, 1994), the XCS simplifies the Holland style classifier system in an attempt to produce a LCS whose operation and dynamics can be readily understood and predicted. Furthermore, it borrows strength update mechanisms from Reinforcement Learning based on Watkins update model (Watkins, 1989) to provide a multi-parameter strength regime that more accurately reflects the different roles of strength within action selection and the GA. Finally, it re-introduces GA mechanisms recommended by Booker (1985) that provide a niching facility to allow co-operative sets of rules to coexist within a population whilst encouraging competing rules to converge on optimum rule attributes within a niche. By using accuracy as the GA selection criteria, XCS is the first LCS to be able to claim to 'reliably generate the most general accurate classifiers' the so-called 'Generalisation Hypothesis' (Wilson, 1996). Further work by Kovacs (Kovacs, 1997) has demonstrated that the provision of further operators can sufficiently focus the population once exploration is complete to reliably produce a minimum population consisting of the most general accurate classifiers, and has led to the 'Optimality Hypothesis' which suggests that using these operators the rule set generated will be the optimal rule set for a given static problem. Accurate and effective generalisation is a dramatic advance for Learning Classifier System research. Previous LCS implementations have not been able to demonstrate

69

effective and stable generalisation. Wilson (1988) demonstrated that, although the Animat LCS could identify generalisations, it tended towards specific classifiers over time. Within SCS (Goldberg, 1989), and CFS-C (Riolo, 1987a, 1988a) ‘specificity’ modifiers were used within bidding to favour more specific classifiers and thereby encourage the emergence of Default Hierarchies. However, it was found that the generalisations required within these structures were eventually replaced by a pressure towards more specific classifiers (Riolo, 1989b). The inability to rely upon the generalisation mechanism to discover optimal generalisations or the dynamics of the LCS to maintain these classifiers represented a considerable problem when compared with the performance of Neural Networks that, in other respects, could be equalled or surpassed by the performance of an LCS (Wilson, 1986a). This chapter has a threefold contribution to XCS research. Although XCS is a relatively simple system, a detailed explanation of the operation of XCS that provides a basis for implementation was unavailable until Butz and Wilson (2000)22, published shortly before the end of this project, with Kovacs (1996) providing the most complete description before Butz (2000). The work involved in producing an XCS suitable for use within the intended research, and in particular the amount of consultation with Kovacs, Lanzi and Wilson required to establish correct operation, was sufficient to indicate the requirement for a more detailed formal description of XCS. This chapter, together with the publicly available XCS implementation produced for this research, provided a new baseline reference point for other researchers. The implementation is described using a complete formal specification of XCS using ISO VDM-SL (ISO, 1996). The specification was developed using SpecBox version 2.21a (Adelard, 1996) to confirm the syntactic and semantic correctness of the specification. The formal description provides other researches with an implementation independent description of XCS that will allow other equivalent implementations to be produced without the need to mimic an existing implementation. It was also used to enable the implementation developed for this research (named XCSC) to be informally verified.

22

Towards the end of this project and subsequent to the writing of chapter 3 (see Barry and Saxon (1999) for a report derived from an earlier version of the chapter), Butz and Wilson (2000) produced an algorithmic description of XCS. They also made available implementations of XCS in C (Butz, 1999) and Java (Butz, 2000). The XCSC implementation produced for this research project was made available in 1998 and remains the first publically available XCS version, with the equivalent Java version (JXCS) made available shortly afterward. XCSC and JXCS have subsequently been used within both commercial research projects (Saxon and Barry, 1999) and by other XCS researchers.

70

The second objective is to provide an up-to-date review of investigations on and using XCS. Wilson (2000b) provides a recently published alternative review of XCS research. Whilst this work is therefore complementary to Wilson's, it emphasises different areas and includes research papers published subsequently to Wilson's work. The third objective is to identify a number of areas which remain as fertile fields for further investigation, some derived from the review of earlier work or reproduced from Wilson's own suggestions and some not before published. This chapter therefore starts with an overview of the structure and operation of XCS (Wilson, 1996) leading into the formal specification of the data structures and operations of XCS. The implementation XCSC for this project is then briefly discussed, after which the comparability of the performance of this implementation with previous XCS work is demonstrated experimentally. The final sections provide an up-to-date review of XCS research and an identification of further research opportunities within XCS work.

3.2 Structure and Operation Whilst for the Canonical LCS the GA can operate rather like a background operator, within all of Wilson's LCS implementations ('*' - Wilson, 1985; Boole – Wilson, 1986b; ZCS - Wilson, 1994) the GA plays a major role, invoked frequently with the mechanics of the LCS forming the evaluation function. The LCS itself is simplified so that it is less like a fully-fledged production system than the Michigan LCS. The XCS builds on these, and therefore any description of XCS must discuss the aspects of the Canonical LCS that have been removed as much as the features that have been introduced.

3.2.1 Structure The XCS maintains a population of classifiers, where the number of classifiers represented within the population is limited by a maximum population size. The classifiers are simple single condition, single action classifiers23. The condition encoding is ternary with the third value encoding the "don't care" term used to provide generalisation24. The action encoding is binary or integer, so no passthrough is allowed, and each action encoding invokes a single effector activation. 23

… although as Wilson (1994) points out, they need not be kept this simple.

24

Wilson (1994) advocates the investigation of higher level encoding - see section 3.6.2.4

71

Match Set

Population

Message Action

Set

Induction Algorithms

Input

Credit Allocation

Output

Environment

Figure 3.1 - Schematic of XCS

Each classifier carries with it a 'strength' record, but unlike conventional LCS implementations it does not attempt to capture all of the details of the classifier within a single value. Wilson (1995) notes that a conventional LCS uses the 'strength' value for both selection of an action to fire and reproduction using the GA. This confuses two separate issues - the prediction of the likely payoff of the classifier and the reproductive value of the classifier. It was clear that the prediction of payoff within some classifiers may not reach the maximum value attainable, not because the classifier was incorrect but because the classifier was not accurate - it was correct in some circumstances and not others. This allowed classifiers that accurately predicted a lower payoff to have the same 'strength' value for reproduction as classifiers that were inaccurately selected for use in higher payoff situations, so maintaining inaccuracy.

72

Therefore, Wilson introduces the measure 'Prediction' (denoted p) to describe the reward level that the classifier system should expect to receive if the action proposed by the classifier is selected when its condition is matched. A second measure, 'Accuracy' (denoted κ), is a calculated measure of the likelihood of the classifier obtaining the reward identified by the Prediction when invoked. It is derived from a third maintained value, 'Prediction Error' (denoted ε) which is a proportion representing the absolute difference between the prediction of the classifier and the reward (denoted R) or payoff (denoted r) received over a period of time. Finally, the measure 'Fitness' (denoted f) is derived from κ and is the accuracy of the classifier relative to that of other classifiers that propose the same action in the conditions in which the classifier fires. Prediction is used for action selection, and Fitness is used for selection in the GA, with the other values being maintained to effect the calculation of Fitness. As a convenience factor, Wilson (1995) proposed the introduction of a novel feature he termed 'numerosity' (denoted n) that is also carried with the strength record but which performs a separate function. Wilson noted that the action of the GA in XCS will, like any GA, create many duplicates of a classifier as the population converges. LCS implementations, like Riolo (1987) and Goldberg (1989), limit selection and therefore the number of classifiers rewarded. In these LCS the strengths of these duplicates will therefore vary. Within a LCS that shares reward between all classifiers matching and proposing the same action the strength of duplicates will converge to the same value. Since XCS is a reward sharing LCS, as an economy measure it is possible in XCS to remove these duplicate rules and simply maintain a count of the number of duplicates that are represented by a single classifier instance within the population - its numerosity. As long as all selection, introduction, and deletion algorithms also take this numerosity figure into account the net effect should be the same. Clearly, this implies that the overall population size must be calculated in terms of the sum of the numerosity of the classifiers, rather than the number of classifier instances. Wilson (1995) reported that informal investigations seemed to verify that the introduction of numerosity did not affect GA operation, and Kovacs (1996) confirmed that classifiers with this numerosity measure ("MacroClassifiers") cause no significant change in operation when compared to an equivalent XCS implementation with a population of classifiers that allowed duplicates ("MicroClassifiers"). Henceforth all references to 'Classifier' will refer to MacroClassifiers unless a distinction must be made, in which case the terms Micro- or Macro-Classifier will be used.

73

The XCS does not include the Message List, relying instead upon a single input message and a single output message. As in the Canonical LCS, these messages originate from and are sent to (respectively) the environment and are encoded/decoded by an appropriate environment interface. Formally, then, the XCS consists of: •

A finite size (Pmax - expressed as the number of micro-classifiers) population P of uniform representation macro-classifiers c: Population = Classifier* inv p ∧

i ≤ len ( p )

∑ c.n ≤ P i =1

max

∧ ∀ c ∈ elems p ⋅

¬∃ c2 ∈ elems p ⋅ (len c1.c ≠ len c2.c ∨ len c1.a ≠ len c2.a) •

Each C is defined as : Classifier :: c : Condition a : Action 25 s : Strength n : Numerosity v : Values Condition = TritSeq TritSeq = Trit* Trit = | | Action = BitSeq BitSeq = Bit* Bit = | Numerosity : ∠ Strength :: p : Prediction ε : Error κ : Accuracy f : Fitness Prediction = PositiveReal Error = ZeroToOne Accuracy = ZeroToOne Fitness = ZeroToOne

25

Wilson (1995, 1998) and Kovacs (1996) use a single integer value as the action, whilst Lanzi (1997) uses a binary encoded action. This implementation will utilise a binary encoding.

74

ZeroToOne = {z ∈ 3 | 0.0 ≤ z ≤ 1.0} PositiveReal = {p ∈ 3 | 0.0 ≤ p} Values :: e : ActionSetEstimate x : Experience g : GAIteration ActionSetEstimate = PositiveReal Experience = ∠ GAIteration = Iterations Iterations = ∠ (the Values component maintains run-time data of mechanisms described later) •

An input and output message (that will be denoted d and e for detector and effector messages, respectively) Message = Bit* d : Message e : Action

•

The environment - the maintainer of environmental states that received actions to be performed which will produce state changes and thereby lead to new state information being detected.

•

The encoding of detected states into messages and the decoding of provided messages into actions.

•

Records (as yet to be defined) of transient data required for the operation of the XCS.

3.2.2 The Performance Subsystem A single trial of the XCS (where a trial is a single iteration for a single-step environment, or many iterations for a multiple-step environment) is started by determining whether the episode will be run as an Exploration or an Exploitation trial. Wilson (1995) proposes a strategy that selects between Explore or Exploit by a coin-toss random selection at the start of each trial. This keeps XCS within either Explore or Exploit mode for the duration of a trial, which is in contrast to traditional Reinforcement Learning methods that tend to select the mode on each step even in multiple-step environments. Clearly many other Explore-Exploit strategies are possible (see Wilson, 1997), and the method chosen in the implementation for the experiments within this

75

thesis alternates deterministically between explore and exploit mode. This technique, also adopted by Kovacs (Kovacs, 1996) ensures that the reports (which are produced on each Exploit trial) are produced with equal regularity. Mode = | m : Mode CHOOSE_MODE () ext wr m : Mode post m ≠ m~

26

Once the mode of operation has been decided, a new message is obtained from the environment. This is compared to all of the classifiers maintained within the current population - a process known as 'matching'. The message d 'matches' a classifier with condition c if, and only if, each bit within the message has the same value as the value at the same position within the condition or the ternary value is the don't care ("wildcard") value: MATCH (c : Condition) r : B ext rd d : Message pre len d = len c post r ⇔ match_condition (c, d) match_condition : Condition × Message → B match_condition (c, d) ∧ ∀ i ∈ inds d ⋅ (c(i) = ∨ (c(i) ≠ ∧ c(i) = d (i))) As the process of matching is performed a record of those classifiers that match is created. This record can be formally defined as follows: MatchRecord = Classifier-set 27 MatchSet = Action ξ MatchRecord M : MatchSet

26

A simpler specification that allowed for random selection would have the [trivial] post condition: post m ∈ Mode

27

In implementation this would be a set of references to classifiers, since the match record only records the classifiers within the population that have been matched, not copies of them.

76

The MatchSet (denoted [M] or M) is initially set to the empty map before match process begins, and MatchRecords are added as new actions are proposed by matching classifiers, with each matching classifier added to the appropriate match record. CREATE_MATCHSET () ext rd d : Message rd p : Population wr M : MatchSet pre M = {} post ∀ i ∈ inds p ⋅ (match_condition(p(i).c, d) ∧ (p(i) ∈ M(p(i).a) ∧ ¬∃ act ∈ dom M ⋅ (act ≠ p(i).a ∧ p(i) ∈ M(p(i).a)))) ∨ ((¬match_condition(p(i).c, d)) ∧ (p(i) ∉ U rng M)) Once every population member has been compared, if the match set is empty no classifier exists that can recognise the input message and so a classifier must be created. The new classifier is created by the triggered induction operator known in the Canonical LCS as the Create Detector Operator though known in XCS simply as a form of 'Covering', which is described with the other induction operators in section 3.2.4. From these classifiers a Prediction Array is formed consisting of the actions that may be performed by the classifier system (i.e. the set of actions proposed by the matched classifiers) each with an associated 'System Prediction': PredictionArray = Action ξ Prediction S : PredictionArray inv s ∧ dom s ≡ dom M The prediction array is reset to hold no data on each iteration and then the 'System Prediction' for each action a ∈ dom M is than calculated and added to the prediction array. This is then calculated using the following fitness-weighted prediction sum: predict: MatchRecord → PositiveReal predict (m) ∧

∑ C.s. p × C.s. f

C∈m

∑ C.s. f

28

C∈m

28

When numerosity is taken account of, each value used from a classifier that does not already take numerosity into account must be multiplied by the numerosity of the classifier. However,

77

CALC_SYSTEM_PREDICTION () ext rd M : MatchSet wr S : PredictionArray pre M ≠ {} post dom M = dom S ∧ ∀ a ∈ dom M . S(a) = predict(M(a)) The fitness value is factored in so that classifiers with higher fitness will contribute more to the prediction for that action. Since fitness is simply the relative accuracy of the classifier, this will result in the prediction contribution of each classifier reflecting the system's degree of belief in the prediction made by that classifier. Once the System Predictions for the actions have been calculated XCS chooses an action to perform using the values associated with each action in S. The method used for action selection depends upon the Explore-Exploit strategy adopted. If explore mode is selected XCS selects an action to apply at random from those advocated within [M]. If exploit mode is selected XCS always uses the action advocated within [M] with the highest System Prediction, selecting arbitrarily if there are a number of actions with the same high System Prediction. SELECT () ext rd S : PredictionArray rd m : Mode wr e : Action pre: S ≠ {} post e ∈ dom S ∧ ((m = ) ∨ (m = ∧ ∀p ∈ rng S . p ≤ S(e))) The classifiers that were matched in this iteration and propose the selected action (those that contributed to the action's System Prediction) are identified as the 'Action Set' (denoted as [A] or A, and also reset to contain no data at the start of each iteration). All classifiers within [A] will have a parameter g incremented. This is used for triggering of the Genetic Algorithm and is described in section 3.2.4. ActionSet = Classifier-set A : ActionSet CREATE_ACTIONSET ()

since the fitness is a measure of the relative accuracy of a classifier, numerosity has already been factored into fitness and the whole expression therefore accounts for numerosity

78

ext rd M : MatchSet rd e : Action wr A : ActionSet pre M ≠ {} post A = M(e)

The Action Set (and a previous Action Set record, At-1) and the current iteration count are the last of the internal transient additional data items that are required. The complete set of transient data creates the state of the XCS: state XCSState of t : Iterations d : Message e : Action m : Mode p : Population M : MatchSet S : PredictionArray A : ActionSet At-1 : ActionSet init mk-XCSState (t0, d0,e0,m0,p0,M0,S0,A0,A0t-1) ∧ t0 = 0, d0 = [] ∧ e0 = [] ∧ m0 = ∧ p0 = [] ∧ M0 = {} ∧ S0 = {} ∧ A0 = {} ∧ A0t-1 = {} This data is typically initialised at the start of the iteration as follows: RESET () ext wr d : Message wr e : Action wr M : MatchSet wr S : PredictionArray wr A : ActionSet wr At-1 : ActionSet post d = [] ∧ e = [] ∧ M = {} ∧ S = {} ∧ A = {} ∧ At-1 = A~ Once an effector action (e) has been selected, it is sent to the environment interface for decoding and action upon the environment. The environment is examined for a reward R, and the received reward is recorded, if available. If the XCS is in a single step environment the strength values of the classifiers in [A] are updated only if the XCS has just operated in Explore Mode. In a multiple-step environment update of strength occurs in both modes. Details of the strength update procedure are given in section 3.2.3. If in exploit mode in a single step environment, or if at the end of an exploit trial in a multiple-step environment, the reporting facilities of the LCS are invoked in order to obtain information on the performance of the LCS as appropriate. Limiting reporting to

79

within the Exploit steps is obviously a means of ensuring that the best performance of the LCS is demonstrated. Since learning will also occur in Exploit steps of a multiplestep problem, however, it is convenient to introduce a third mode of operation, which is termed 'test mode' (Saxon and Barry, 2000). This operates like an exploit trial but without any learning mechanism invoked, and ensures that the reporting is well formed regardless of the base mode of the XCS. Finally in each explore mode trial the rest of the triggered induction operators are given an opportunity to generate new rules before the start of the next iteration of the XCS. The detailed operation of these operators is described in section 3.2.4. Succinctly, the operation of an iteration of the XCS is therefore: 1.

RESET( ) If within a multiple-step trial set [At-1] to the value of [A]. Reset [M], [A], and S to their initial value.

2.

CHOOSE_MODE ( ) Choose Explore or Exploit mode according to the current explore-exploit strategy.

3.

Obtain a new message d (d ∈ D the set of all detector messages) from the input interface to the environment.

4.

CREATE_MATCHSET( ) Compare d with the condition of all classifiers C ∈ P. The message d 'matches' a classifier condition c iff : ∀ j ∈ dom(d).(c[j]='#' ∨ (c[j]≠'#' ∧ c[j]≠d[j])) Identify those classifiers C ∈ P whose condition matches the detector message d. Call this the 'match set' [M].

5.

CREATE_DETECTOR ( ) If [M] = {} then call the covering algorithm to create a new classifier that will match the current message, and re-apply step 4 to match the generated classifier.

80

6.

CALC_SYSTEM_PREDICTION ( ) Accumulate the System Prediction S(e) for each distinct action e ∈ dom(M) using the prediction and fitness of the classifiers C ∈ M(e):

∑ C. p × C. f S (e) ← C∈M ( e )

∑ C. f

C∈M ( e )

7.

SELECT ( ) Select an action e from those represented within the prediction array S : If in Explore Mode : e ∈ dom(S) If in Exploit or Test Mode : e ∈ dom(S) ∧ ∀a ∈ dom(S) ⋅ S(a) ≤ S(e) The Action Set [A] consists of all classifiers proposing the action: C ∈ M(e).

8.

Send message e to the environment decoders for environment action.

9.

Examine the environment for a reward.

10. If in a single-step environment and in Explore mode: •

SCHEDULE 3.2 Apply the Credit Allocation algorithms to update the value of each classifier in [A].

If in a multiple-step environment and in Explore or Exploit but not Test mode: •

SCHEDULE 3.2 If a reward was received, apply the Credit Allocation algorithms to update the value of each classifier in [A].

81

•

SCHEDULE 3.3 If not in the first step of a trial, calculate the payoff received from [A], and using the payoff apply the Credit Allocation algorithms to update the value of each classifier in [At-1].

11. If in exploit mode and at the end of a learning trial, invoke the reporting operations. 12. SCHEDULE 3.4, SCHEDULE 3.5 If in explore mode, test the triggers to determine whether to apply each of the remaining induction algorithms in order to generate new candidate classifiers.

Schedule 3.1 - Operation of an Iteration of the XCS

3.2.3 The Credit Allocation Subsystem Classifiers within the population are credited in one of two cases. In all environments the classifiers within [A] are credited with the reward R whenever reward is received from the environment for performing an action. Additionally, within multiple-step environments whenever another classifier is invoked in the current iteration as a result of an action performed in the preceding iteration of the same learning trial those classifiers in the preceding iteration are credited with a payoff amount. Apart from the Action set within which the update occurs and the source of the update value, the mechanism invoked for the update is the same in both cases. The classifier parameters updated are those identified within the Strength record of the formal definition - the Error, Prediction, Accuracy, and Fitness values. In addition, the ActionSetEstimate, Experience and the GATrigger parameters are updated for use within the Rule Induction algorithms (see Section 3.2.4). In order to ensure that each value reflects a recency weighted approximation to the actual reward or payoff value, MAM ("Moyenne Adaptive Modifiée") update (Venturini, 1994) is used for the Error, Prediction, and the ActionSetEstimate. This technique computes the average of the updates received until a [typically small] number of updates have occurred, after which the standard Widroff-Hoff delta rule: vj ← vj + β (V - vj )

82

(where β is the Learning Rate - typically 0.2) is applied. The MAM technique achieves a good initial estimate of the value received by using standard averaging for the first l iterations (typically l = 1/β), after which the estimates are progressively refined according to a proportion of the absolute error in the estimate, giving a fixed time before any stable value received is accurately represented. The Accuracy is computed directly from the Error parameter, and the Fitness parameter is derived from the accuracy of all classifiers within the action set and is therefore updated using the Widroff-Hoff delta rule only. In order to facilitate the operation of MAM, the Values parameter denoted x (the "experience" parameter) is maintained by each macro-classifier. This maintains a count of the number of times the values of the classifier have been updated - the number of times the classifier has occurred within an Action Set. For single-step problems the update algorithm, given a reward R, is as follows: 1.

Calculate the number N of micro classifiers in A: calc_micro_classifiers: ActionSet → ∠ calc_micro_classifiers (A) ∧ if A = {} 0 else

∑ c.n c∈A

2.

For all C ∈ A:

2.1 Update the experience (x) of classifier C: UPDATE_EXPERIENCE (c : Classifier) r : Classifier post r = µ (c, v → µ (c.v, x → c.v.x + 1))

2.2 Update the Action Set Estimate (e) of classifier C using MAM : UPDATE_ACTIONSET_ESTIMATE (v : Values, A : ActionSet) r : Values ext rd β : ZeroToOne pre v.x > 0 post let N=calc_micro_classifiers(A) in

83

  v.x <     v.x ≥ 

 ∑Nj ∧ r = µ  v, e →  β v.x  1

1

β

  ∨  

 ∧ r = µ (v, e → v.e + β ( N − v.e ))  (where Nj is the numerosity of each of the x action sets this classifier has appeared in so far)

2.3 Update the prediction error (ε) of C using the Absolute Error (εa) by MAM calc_absolute_error: 9× 9 × 9 × 9 → ZeroToOne calc_absolute_error (Rmax, Rmin, reward, prediction) ∧ | reward − prediction | Rmax − Rmin (where Rmax and Rmin are the maximum and minimum reward) UPDATE_PREDICTION_ERROR (c : Classifier, R : ∠) r : Classifier ext rd Rmin : ∠ rd Rmax : ∠+ rd β : ZeroToOne pre c.v.x > 0 post let εa = calc_absolute_error(Rmax, Rmin, R, c.s.p) in     c.v.x < 1 ∧ r = µ  c, s → µ  c.s, ε → ∑ ε a j    ∨   c.v.x    β     

  1  c.v.x ≥ ∧ r = µ (c, s → µ (c.s, ε → c.s.ε + β (ε a − c.s.ε ))) β   2.4 Update the prediction (p) of classifier C using MAM: UPDATE_PREDICTION (c : Classifier, R : ∠) r : Classifier ext rd β : ZeroToOne pre c.v.x > 0 post     c.v.x < 1 ∧ r = µ  c, s → µ  c.s, p → ∑ r j    ∨    c.v.x    β    

  1  c.v.x ≥ ∧ r = µ (c, s → µ (c.s, ε → c.s. p + β (R − c.s. p ))) β   (where rj is the payment for each of the x action sets this classifier has appeared in so far.)

84

2.5 Calculate the Accuracy (κ) of classifier C: UPDATE_ACCURACY (s : Strength, R : ∠) r : Strength ext rd α : ZeroToOne rd ε0 : ZeroToOne rd m : ZeroToOne pre ε0 > 0.0 post     s.ε ≥ ε 0 ∧ r = µ  s, κ → exp ln (α ) s.ε − ε 0  × m   ∨      ε 0      (s.ε < ε 0 ∧ r = µ (s, κ → 1)) (where α is a fall-off factor dictating the accuracy curve, ε0 is the minimum error before a classifier is considered accurate, and m is a multiplier typically set to 0.1) 3.

Accumulate the total Accuracy (κt) for all classifiers C ∈ A : calc_total_accuracy: ActionSet → 3 calc_total_accuracy (A) ∧ ∑ (c.s.κ × c.n) c∈ A

4.

For all classifiers C ∈ A: Calculate the fitness (f) using the relative accuracy (κr) using Widroff-Hoff : CALCULATE_FITNESS (c : Classifier, A : ActionSet) r : Classifier pre ε0 > 0.0 post let κt = calc_total_accuracy(A) in let

κr =

c.s.a × c.n

κt

in

r = µ (c, s → µ (c.s, f → c.s. f + β (κ r − c.s. f )))

Schedule 3.2 - Operation of Credit Allocation in XCS within a Single-step Environment

In a single-step environment the prediction therefore becomes an estimate of the reward that the XCS will receive if its action is chosen. If the classifier is accurate, then this will become equal to the reward received, whilst an inaccurate classifier will have a prediction that reflects the average of the most recent rewards it received whenever the classifier was used. The error and accuracy details are only used in the calculation of

85

fitness, and reflect the average error in the prediction of the classifier as a proportion of the maximum prediction possible by any classifier, and the absolute accuracy of the classifier as defined by the accuracy function. The following is a description of the update algorithm for a multiple-step problem:

1.

If a reward value R is available from the environment, invoke the single-step algorithm (Fig. 3.4) for the current action set [A].

2.

If this is not the first step of a trial

2.1 Calculate the payment r to be allocated to the classifiers in the previous action set [At-1]. This is derived from the maximum prediction within the prediction array S within the current iteration, discounted using parameter γ: CALCULATE_PAYMENT () r : PositiveReal ext rd S : SystemPrediction ext rd γ : ZeroToOne post let π ∈ rng S ∧ ∀ p ∈ rng S ⋅ p ≤ π in r = γπ

29

2.2 SCHEDULE 3.2 Invoke steps 2 through 5 of the single-step algorithm (Schedule 3.2) systematically replacing the reward R with the discounted payment r and the action set A with the previous action set At-1.

Schedule 3.3 - Operation of Credit Allocation in XCS within a Multiple-step Environment The use of the discount factor within payoff provides selection between environmental states producing the same reward that are not the same distance away from the state. The nearer rewarding state will provide the maximum payoff, and so will become more attractive within exploitation. XCS thus utilised the Temporal Difference method of 29

in Wilson (1995, 1996) and Kovacs (1996) the payment is given as ri-1 ← ri-1 + (γ.max(Si)), but since there has implicitly been no reward, the first figure becomes 0 for this update, and equally when a reward is received the update is immediate with only the reward since no subsequent prediction exists. If update was to occur using reward and prediction, the resulting prediction values could self-perpetuate to artificially high values.

86

action-selection over equal rewarding routes. The use of the maximum prediction within the payoff ensures that the route to the highest rewarding state is selected. The combination of the use of the discount and the maximum prediction allow XCS to effectively trade-off distance to reward and reward magnitude. The accuracy function is worthy of further consideration, since it will be used as the basis of selecting classifiers for use in the most important of the induction algorithms the Genetic Algorithm. A naive accuracy function would define accuracy as a straightline measure, but this would not distinguish well between mediocre classifiers, nor would it allow for computational inaccuracy that may prevent any classifier ever being identified as 'accurate'. Furthermore, it is useful to be able to refine what is meant by 'accurate', since in some situations a classifier which is 99% correct may be considered accurate enough, whereas in other environments it may be prudent to be more or less strict. This issue is important in cases where environmental noise is evident, preventing any rule from achieving fully accuracy. The accuracy function presented in Wilson (1995) provides a logarithmic curve with a top end cut-off to identify the point beyond which we have what is considered to be 'complete accuracy'. A low-end multiplier is employed to provide a sharp distinction between classifiers that are known to be inaccurate or not yet fully evaluated and those that are known to be accurate. The curve shown in Figure 3.2 below illustrates that this function only modifies accuracy over a small part of the error range. The accuracy range it operates over can be extended by adjusting α (and the multiplier appropriately) as shown in figure 3.3. The fitness is calculated as the relative accuracy of the classifier to that of others in the action set. Since the classifier may occur in many match sets, and thus in many action sets, the fitness measure is calculated using the Widroff-Hoff technique to obtain a timely estimate. The use of Relative Accuracy in calculating the fitness of a classifier ensures that the more effective competition the classifier has, the lower its fitness. If an induction algorithm introduces a more accurate classifier, the fitness of the existing classifiers within the same Action Set are reduced. As further copies of the more accurate classifier are produced, the fitness of the competitors will be further reduced until the most accurate classifier dominates each Action Set. Clearly, however, the most accurate classifiers are always the most specific classifiers. In order to encourage accurate generalisation some other mechanism must be introduced that puts pressure on the introduction of large action sets. This is accomplished by tying the occurance of the classifier in the Genetic Algorithm to its generality.

87

1 0.9 0.8

Accuracy

0.7 0.6 0.5 0.4 0.3 0.2 0.1

0. 0 00 0. 4 00 0. 8 01 0. 2 01 6 0. 02 0. 02 0. 4 02 0. 8 03 0. 2 03 6 0. 04 0. 04 0. 4 04 0. 8 05 0. 2 05 6 0. 06

0

Error

Figure 3.2 - The Accuracy Curve exp((ln(α)(ε-ε0))/ε0).m where α=0.1, m=1 and ε0=0.01 with cut off to 1 at 0.01

1

exp(log(0.9)*(x-0.01)/0.01)*0.9 exp(log(0.7)*(x-0.01)/0.01)*0.7 exp(log(0.5)*(x-0.01)/0.01)*0.5 exp(log(0.1)*(x-0.01)/0.01)*0.1

Accuracy

0.8

0.6

0.4

0.2

0 0

0.2

0.4

0.6

0.8

1

Error

Figure 3.3 - Varying values of α in the Accuracy calculation (without cut-off)

88

More general classifiers will occur in [A] more often than specific classifiers. They will therefore be given more opportunity to duplicate or breed than an equally accurate specific classifier. This mechanism will proliferate the general classifier and, in combination with the deletion techniques used when new classifiers are introduced by the induction operators (see Section 3.2.4), the equally accurate but less general classifiers are gradually driven out of the population. Clearly, high generality classifiers with low accuracy will also have a low fitness and will therefore not benefit from the larger GA opportunity, assuming approximately equal exploration rates30. It is important to note that the combination of these techniques removes any need for the introduction of strength or fitness modification 'tricks' which were required in some earlier LCS models in order to ensure good generalisation behaviour or population co-operation.

3.2.4 The Rule Induction Subsystem The Induction Algorithms provided within XCS consist of two covering operators (a simple Create Detector Operator and an equivalent Create Effector Operator), and the Genetic Algorithm. All of these are triggered operators. The Create Detector covering operator is invoked whenever no classifiers exist within the population that match the input message. When invoked it creates a new classifier by copying the message, replacing selected bits with wildcards at the current generality probability rate in order to generalise the message to create a condition, and adding a random action. The initial classifier values are set so that the prediction and error are set to the population mean value (allowing for numerosity in the calculations). The initial fitness is set to 0.1 times the mean fitness (calculated without numerosity factored in, since it is already accounted for within the fitness calculation). The Action Set Estimate is also set to the mean population ActionSetEstimate (e), the experience (x) is 0, the numerosity (N) is 1 and the GAIteration (g) is set to the current iteration. CREATEDETECTOR () r : Classifier ext rd P : Population rd t : Iterations rd d : Message pre d ≠ [] post ∀i ∈ dom d ⋅ (r.c(i) = d(i) ∨ r.c(i) = ) ∧ ∀j ∈ dom r.a ⋅ r.a(i) ∈ {, } ∧ pre-INITIAL(r, P, t) ∧ post-INITIAL(r, P, t, P, r) 30

Unfortunately an unequal exploration rate over different environmental states will bias learning experience and may still lead to the occurrence of over-general classifiers.

89

INITIAL (c : Classifier) r : Classifier ext rd P : Population rd t : Iterations rd pi : PositiveReal rd fi : ZeroToOne post (p = [] ∧ r = mk-Classifier ( C.c, C.a, mk-Strength (pi, fi, 0.0, fi), 1, mk-Values (0, 0, t))) ∨ (p ≠ []∧ r = mk-Classifier ( C.c, C.a, mk-Strength ( average_prediction(p), average_error(p), 0.0, 0.1 × average_fitness(p)), 1, mk-Values(average_estimate(p), 0, t))) (where pi is the initial prediction - typically (Rmax Rmin) / 100, and fi is the initial fitness - typically 0.01) average_prediction: Population → PositiveReal

∑ C.s. p × C.n

average_prediction (P) ∧

C∈rng ( P )

∑ C.n

C∈rng ( P )

average_error: Population → ZeroToOne

∑ C.s.ε × C.n

average_error (P) ∧

C∈rng ( P )

∑ C.n

C∈rng ( P )

average_fitness: Population → ZeroToOne

∑ C.s. f

average_fitness (P) ∧

C∈rng ( P )

len( p )

average_estimate: Population → PositiveReal

∑ C.v. p × C.n

average_estimate (P) ∧

C∈rng ( P )

∑ C.n

C∈rng ( P )

90

The Create Effector Operator is tested after the update operations, towards the end of each exploration episode (see Schedule 3.1). It is triggered by a decrease in the average prediction of the classifiers within the Match Set to the point where the prediction is less than the initial prediction for a newly-created classifier (pi) reduced by the prediction reduction multiplier value (pr): CREATE_EFFECTOR_TRIGGER () r : B ext rd M : MatchSet rd pi : PositiveReal rd pr : PositiveReal pre M ≠ {} post

r⇔

∑ C.s. p < p

C∈I ( rng ( M ))

i

× pr

(where pr is typically set to the value 0.5). Note that numerosity is not factored into this trigger, although no reason is given for why numerosity is not used. When invoked, the Create Effector Operator works in a similar manner to the CDO, generalising the message to create a new condition and generating a random action. The implementation chosen for the XCS used within these tests is modified to ensure that the action created is always different from the action used within the current Action Set so that the operator does attempt to discover more effective classifiers. Clearly these are relatively crude implementations of covering operators and alternative solutions could be proposed (see Section 3.6.2). The initial values of the classifier created by this operator are set in the same way as within the Create Detector operator. CREATE_EFFECTOR () r : Classifier ext rd P : Population rd t : Iterations rd d : Message rd e : Action pre d ≠ [] ∧ e ≠ [] post ∀i ∈ dom d ⋅ (r.c(i) = d(i) ∨ r.c(i) = ) ∧ r.a ≠ e ∧ ∀j ∈ dom r.a ⋅ r.a(j) ∈ {, } ∧ pre-INITIAL(r, P, t) ∧ post-INITIAL(r, P, t, P, r) The Genetic Algorithm is triggered when the mean time for each classifier within the current Action Set [A] in single step problems (or the previous Action Set [A-1] in multi-step problems) since the last occurrence of the Genetic Algorithm is greater than a GAFrequency parameter θ. This parameter is typically set to 25 occurrences [equivalent

91

to 12 in an alternating explore-exploit regime] because a learning rate (β) of 0.2 will tend to produce reasonable estimates within 10 updates. The parameter g is maintained within each classifier as a timestamp of the last GA occurrence. Thus, the GA trigger is: GA_TRIGGER (A : ActionSet) r : B ext rd θ : ∠+ rd t : Iterations pre A ≠ {} post

 ∑ (t − C.v.g ) × C.n    r ⇔  C∈ A >θ C.n  ∑  C∈ A  

If the GA is triggered, the time stamp g of each classifier within [A] is set to the current iteration count. RESET_GA_TIMESTAMP (A : ActionSet) r : ActionSet ext rd t : iterations pre A ≠ {} post card r = card A ∧ ∀c ∈ A ⋅ µ(c, v→µ (c.v, g→t)) ∈ r The classifiers involved in each GA invocation will be only those within the relevant Action Set. This means that the GA does not explore actions using the crossover operator, but rather explores conditions. Thus, the XCS is not seeking to create correct responses, but a complete concise mapping of state × action × payoff with maximally general accurate classifiers. Crossover searches the generality-specificity plane to find these classifiers. Conversely, mutation is performed in both condition and action of all newly produced classifiers and can thus generate new actions or different conditions.31 The GA starts by selecting two 'parent' classifiers from the action set (A if a single-step problem, or At-1 if a multiple-step problem). Selection for involvement in the GA is performed using a Roulette Wheel selection technique (see Goldberg, 1989 for an

31

Wilson (1998) has suggested an alternative mutation operator that does not allow 0→1 or 1→0 mutations within the condition, but forces 0/1→# and #→0/1 mutations. This encourages search along generality/specificity planes. More importantly, it also encourages the movement from generalist to specialist simply by making this mutation more frequent, thereby addressing a potential problem with XCS where pressure towards generality can hinder performance in sparse environments (Lanzi, 1997). However, initial tests using this operator suggest that it hinders performance on a six-multiplexor test (Kovacs, 1998). Although Kovacs presented no hypothesis in regard to this reported finding, it could be suggested that the inability to move schema between match sets by changing a 0 to 1 or 1 to 0 may prevent the diffusion of useful schema to other parts of the mapping. Much more investigation on this topic is needed, however

92

accessible explanation of Roulette Wheel selection) over the fitness values of the classifiers within the action set being used. random: → ZeroToOne random ( ) is not yet defined actionset_to_seq: ActionSet → Classifier* actionset_to_seq (A) ∧ if A = {} [] else let c ∈ A in actionset_to_seq (A - {c}) ♥ [c] ROULETTEWHEEL_SELECT (A : ActionSet) r : Classifier pre A ≠ {} post let rand = random() × ∑ c.s. f ∧ AS = actionset _ to _ seq ( A) in c∈ A

let AS[i] = r in

∑ c.s. f

≤ rand <

c∈ AS [1,...,i ]

∑ c.s. f

c∈ AS [1,...,i +1]

Roulette-Wheel selection is achieved within the specification by identifying a random number between 0 and the sum of the fitness of the members of the action set, and then identifying the classifier that the cumulative fitness value falls within. Thus, the specification is a fairly literal implementation of the conceptual roulette wheel ball falling into one of the roulette wheel segments where each segment is sized proportional to the fitness of the classifier it represents. The actionset_to_seq() operator is a convenience operator to derive the sequence representation required for this specification from the action set. The random() operator produces pseudo-random real values such that the result r obeys the inequality 0.0 ≤ r < 1.0. Once two parents have been selected in this way, the likelihood of the use of Crossover is decided, if the two parents are different classifiers, using the crossover probability parameter Χ (typically set at 0.8): CHOOSE_CROSSOVER (c1 : Classifier, c2 : Classifier) r : B ext rd Χ : ZeroToOne post r ⇔ (c1 ≠ c2 ∧ random() < Χ)

93

If it is decided that crossover will be used, then two child classifiers are created by a single-point crossover operator working over the child conditions. The child actions will be the same as that of both parents since the parent actions are chosen from the same action set. Children :: son : Classifier daughter : Classifier

CROSSOVER (c1, c2 : Classifier, A : ActionSet) r : Children pre c1 ∈ A ∧ c2 ∈ A post r.son.a = c1.a ∧ r.daughter.a = c1.a ∧ ∃n ∈ inds c1.c ⋅ (r.son..c(1,…,n) ♥ r.daughter.c(n+1,…,len c1.c) = c1.c ∧ r.daughter.c(1,…,n) ♥ r.son.c(n+1,…,len c1.c) = c2.c) If crossover is not used, the two child classifiers are created by duplication of the condition and action from each parent into each child: DUPLICATE (c1, c2 : Classifier, A : ActionSet) r : Children pre c1 ∈ A ∧ c2 ∈ A post r.son.c = c1.c ∧ r.daughter.c = c2.c ∧ r.son.a = c1.a ∧ r.daughter.a = c2.a The classifiers produced, whether by crossover or duplication, then undergo mutation. Mutation is applied both to the condition and the action. Mutation within the condition operates at the level of the ternary digits. The ternary digits to be changed are chosen with a uniform random distribution using the mutation probability parameter µ. The precise mutation algorithm used may vary, but it is standard practice to use a difference mutate procedure which can be specified as follows: mutate_trit: Trit × ZeroToOne → Trit mutate_trit (t, p#) ∧ if t = then if random() < 0.5 then else else if random() < p# then else if t = then else

94

(where p# is the wildcard probability. It is reported that Wilson uses two p# values, one for initial conditions, used in the Cover operators, and one for the GA. Often these are set to the same value). The effect of this operation is to ensure that the mutated ternary does contain a value that is different from the starting value whilst conforming to the required frequency of wildcard generation. It is now possible to use this operation to formally specify the mutation of a single classifier as follows: MUTATE_CONDITION (c : Condition, p(µ) : ZeroToOne) r : Condition pre c ≠ [] post ∀i ∈ inds c ⋅ post-MUTATE_TRIT(c(i), p(µ), r(i)) MUTATE_ACTION (a : Action, p(µ) : ZeroToOne) r : Action pre a ≠ [] post let rand = random() in (rand < p(µ)) ∨ (rand >= p(µ) ∧ r = a) MUTATE_CLASSIFIER (c : Classifier, p(µ) : ZeroToOne) r : Classifier pre pre-MUTATE_CONDITION(c.c, p(µ)) ∧ pre-MUTATE_ACTION(c.a, p(µ)) post let post-MUTATE_CONDITION(c.c, p(µ), cond) ∧ post-MUTATE_ACTION(c.a, p(µ), act) in r = mk-Classifier (cond, act, c.s, c.n, c.v)

The specification of action mutation has been deliberately left as general as possible. This is because it is possible to represent the XCS action in a number of different forms. Wilson and Kovacs both use a single bounded integer value as the action, in which case an implementation of MUTATE_ACTION would simply select a new random value. This implementation, and that of Lanzi, implements the action as a bit sequence, and in this case the mutation of the action can be implemented as a per-bit mutation. Either implementation will satisfy the specification provided. Once mutation has taken place, the strengths of the child classifiers are set. The values they are set to depend upon whether crossover has taken place or not (indicated by the boolean parameter X in the following specification: SET_CHILD_STRENGTH (c1 : Classifier, c2 : Classifier, n : Children, X : B) r : Children ext rd P : Population, t : Iterations post

95

let v = mk-Values(average_estimate(P), 0, t) in X ∧ let s = mk-Strength(mean_prediction(c1, c2), average_error(P) × 0.25, fi, average_fitness(P) × 0.1) in r.son = mk-Classifier (n.son.c, n.son.a, s, 1, v) r.daughter = mk-Classifier (n.daughter.c, n.daughter.a, s, 1, v) ∨ ¬X ∧ let s1 = mk-Strength(c1.s.p, average_error(P) × 0.25, fi, average_fitness(P) × 0.1) in r.son = mk-Classifier (n.son.c, n.son.a, s1, 1, v) ∧ let s2 = mk-Strength(c2.s.p, average_error(P) × 0.25, fi, average_fitness(P) × 0.1) in r.daughter = mk-Classifier (n.daughter.c, n.daughter.a, s2, 1, v) mean_prediction: Classifier × Classifier → PositiveReal mean_prediction (c1, c2) ∧ c1.s. p + c 2.s. p 2 Each of these triggered operators must place the classifiers that they create into the population. In order to do this, another classifier will have to be deleted if the population is already full. In order to ensure that sufficient space is allocated to all Action Sets, a deletion technique is used that seeks to maintain all Action Sets at approximately the same size. Whilst a number of such techniques have been proposed, the one adopted as standard is the simplest. In this technique the Action Set Estimate (e), suitably modified to account for the numerosity, is used as the value that is selected over using Roulette Wheel Selection: DELETE_SELECT () r : Classifier ext rd P : Population pre P ≠ {} post let rand = random() × ∑ (c.v.e × c.n ) ∧ P[i] = r in c∈elemsP

∑ (c.v.e × c.n ) ≤ rand < ∑ (c.v.e × c.n )

c∈P[1,...,i ]

c∈P[1,...,i +1]

Since the Action Set Estimate reflects the number of micro classifiers occurring in all the action sets that a classifier occurs within, the larger action sets will be more likely to have a classifier selected for deletion. Once a classifier is selected for deletion, the numerosity of the selected classifier is decremented and if the numerosity has become zero, the classifier is removed from the population. DELETE (s : Classifier) ext wr P : Population pre s ∈ P

96

post let.P~[i] = s in (P~[i].n > 1 ∧ P = P~[1,…,i-1] ♥ [µ(P~[i], n → P~[i].n - 1))] ♥ P~[i+1,…,len P~]) ∨ (P~[i].n = 1 ∧ P = P~[1,…,i-1] ♥ P~[i+1,…,len P~])

The net effect is to create one free micro-classifier location within the population for the new classifier. The mechanism for making space for a new classifier can thus be specified as: ALLOCATE_SPACE () ext wr P : Population rd Pmax : ∠+ post let. s = c.n in

∑

c∈elemsP ~

(s < Pmax ∧ P = P~) ∨ (s = Pmax ∧ let pre-DELETE_SELECT(P~) ∧ post-DELETE_SELECT(P~,c) in pre-DELETE(c, P~) ∧ post-DELETE (c, P~, P) A new classifier may be a duplicate of an existing classifier or may already be covered by a more general classifier that has already been evaluated as accurate. In the case of the Covering by the Create Detector Operator there will, by definition, be no duplicate or covering operator and so the new classifier is simply inserted into the population. In the case of the Create Effector Operator the new classifier is compared with all members of the population, and if a duplicate is found the numerosity and Action Set Estimate of the duplicate member already within the population is increased by one (none of the other values it holds are modified) and the new classifier is discarded. COVERING_INSERT (s : Classifier) ext wr P : Population rd Pmax : ∠+ pre

∑ c.n < P

c∈elemsP

max

post ((∃ cl ∈ P~ ⋅ cl.c = s.c ∧ cl.a = s.a) ∧ let P~(i) = s in P = P~(1,…,i-1) ♥ [µ(P~(i), n → P~(i).n + 1))] ♥ P~(i+1,…,len P~)) ∨ ((∀ cl ∈ P~ ⋅ cl.c ≠ s.c ∨ cl.a ≠ s.a) ∧ s ∈ P ∧ elems P - {s} = elems P~)

This specification completes the operations required for the Create Effector Operator, and so the full operation of the operator is captured in Schedule 3.5 below:

97

1.

CREATE_EFFECTOR_TRIGGER Check to see whether the sum of the predictions of classifiers within the current match set has reduced beyond a set value, signifying that the effector Covering operator should be triggered. If the effector covering operator is triggered:

2.

CREATE_EFFECTOR ( ) Create a new effector classifier using a generalisation of the current message for its condition and selecting an action different than the currently selected action.

3.

ALLOCATE_SPACE ( ) Ensure that there is space in the population for the new classifier by deleting an existing micro-classifier if the population is full. Deletion is performed using roulette wheel selection over the estimate of the action set size, thereby contributing to the provision of action set niching.

4.

COVERING_INSERT ( ) Insert the new classifier into the population, discarding it and incrementing the numerosity of any existing duplicate classifier if a duplicate exists.

Schedule 3.4 - The operation of the Create Effector Operator Insertion of new classifiers in the case of the GA requires less search because it searches the parents and the action set first. It is probable that a child may be a duplicate of at least one of the parents. Therefore, the child is compared firstly with its parents and if it is a duplicate then the numerosity (and the Action Set Estimate) of the parent is incremented and the child is discarded. If the child differs from its parents, it is then compared with the members of the Action Set (since it is probable that it has a compatible condition and the same action - see earlier). This time the search is for classifiers that are experienced, accurate and more general than the new classifier and therefore "subsume" the new classifier. Clearly the introduction of a classifier that covers a state × action → payoff mapping that is already represented by an accurate more general classifier would be superfluous specialisation of the existing classifier.

98

Therefore if this is the case the new classifier is discarded and the numerosity and the Action Set Estimate of the subsuming classifier is incremented. This only occurs where the existing classifier is experienced - that is, its experience parameter (x) that documents the number of Action Set updates it has undergone is greater than a pre-set value called the 'Subsumption Threshold' (denoted by s and typically 20). subsumes: ∠+ × Classifier × Classifier → B subsumes (s, c1, c2) ∧ c1.v.x > s ∧ c1.κ = 1.0 ∧ c1.a = c2.a ∧ ∀ i ∈ len c1.c ⋅ (cl.c(i) = ∨ c1.c(i) = c2.c(i)) Before this point it is assumed that the classifier may become accurate, but there is not sufficient information as yet and therefore a competing classifier should be allowed. Since more general classifiers will occur in more Action Sets they will therefore be updated more regularly and therefore have more breeding opportunities. Therefore, more accurate general classifiers will be proliferated whilst less general accurate classifiers will gradually diminish. If there is no subsumption within the Action Set, each new classifier is inserted into the population by means of the same population-wide duplicate detection mechanism (COVER_INSERT) used by the Create Effector Operator. The following specification formally describes this subsumption mechanism: CHILD_INSERT (c : Classifier, p1 : Classifier, p2 : Classifier, A : ActionSet,) ext wr P : Population rd Pmax : ∠+ pre

∑ c.n < P

c∈elemsP

max

post (subsumes (s, p1, c) ∧ let P~(i) = p1 in P = P~(1,…,i-1) ♥ [µ(P~(i), n → P~(i).n + 1)] ♥ P~(i+1,…,len P~)) ∨ (¬subsumes (s, p1, c) ∧ subsumes (s, p2, c) ∧ let P~(j) = p2 in P = P~(1,…,j-1) ♥ [µ(P~(j), n → P~(j).n + 1)] ♥ P~(j+1,…,len P~)) ∨ (¬subsumes (s, p1, c) ∧ ¬subsumes (s, p2, c) ∧ (∃ cl ∈ A ⋅ subsumes (s, cl, c) ∧ let P~(k) = cl in P = P~(1,…,k-1) ♥ [µ(P~(k), n → P~(k).n + 1))] ♥ P~(k+1,…,len P~)) ∨ (¬∃ cl ∈ A ⋅ subsumes (s, cl, c) ∧ pre-COVERING_INSERT(ch, Pmax, P~) ∧ post-COVERING_INSERT (ch, Pmax, P~, P)))

99

Having identified the deletion and subsumption mechanisms, it is now possible to fully identify the operation of the GA as follows: Using A if in single-step mode, or At-1 if in multiple-step mode: 1.

GA_TRIGGER ( ) If the numerosity weighted average of the number of iterations since the last GA for the classifiers within the action set is greater than the GAFrequency parameter θ the Genetic Algorithm is triggered and RESET_GA_TIMESTAMP() is used.

If the Genetic Algorithm is triggered: 2.

ROULETTE_WHEEL_SELECT ( ) Select two parents from [A] using Roulette Wheel Selection over the fitness values of the classifiers within [A]. The parents are selected without removal so that the same classifier may be chosen twice to allow the duplication of classifiers.

3.

CHOOSE_CROSSOVER () If the parents differ choose whether to perform crossover using the crossover probability (Χ ).

4.

CROSSOVER () or DUPLICATE () If crossover to be performed use single-point crossover within the condition to create two new children, otherwise copy the parents to the children.

5.

MUTATE_CLASSIFIER ( ) Perform mutation on both the condition and action of each child.

6.

SET_CHILD_STRENGTH ( ) Set the initial prediction of the child classifiers to the mean parental prediction if crossover was used, or the prediction of the parent each classifier was copied from if crossover was not used. The child classifier's error is set to one quarter of the population mean error and the classifiers fitness to one tenth of the population mean fitness - numerosity is not included in the latter computation since fitness is a

100

relative measure that already allows for numerosity. The initial action set estimate is set to the population mean, the initial numerosity is set to 1, and the initial experience count is set to zero. 7.

ALLOCATE_SPACE ( ), CHILD_INSERT ( ) For each new child, possibly delete an existing classifier to create population space, and then insert the new classifiers into the population using action set subsumption.

Schedule 3.5 - The operation of the Genetic Algorithm in the XCS

3.2.5 Parameterisation

Throughout the provision of the specifications various user-definable parameters have been introduced. The following table captures these together with an explanation of the purpose of each parameter: Parameter

pi : PositiveReal

Parameter Name and Explanation

Initial prediction

Typical value 10.0

The prediction value used for classifiers created when no other classifiers exist.

fi : ZeroToOne

0.01

Initial fitness The fitness and error value used for classifiers created when no other classifiers exist.

Rmin : PositiveReal

Minimum Reward

0.0

The reward value given for the worst performance

Rmax : PositiveReal

Maximum Reward

1000.0

The reward value given for the best performance

Pmax : ∠+

Maximum population size

-

The maximum number of micro-classifiers allowed

θ : ∠+

GA Frequency

25

The average number of iterations since the last GA before the GA is once more invoked. Used in the GA Trigger.

Χ : ZeroToOne

Crossover probability

101

0.8

The probability that crossover is used in the GA whenever the selected parents are not the same classifier.

µ : ZeroToOne

Mutation probability

0.02

The bit-wise probability of mutation changing the value of a bit.

p(#) : ZeroToOne

0.33

Generality The probability that a generated ternary trit value will be a wildcard.

β : ZeroToOne

Learning rate

0.2

The rate of modification of a classifier value within MAM update

γ : ZeroToOne

Discount Factor

0.71

The reduction in payoff value given from the current action set to the classifiers within the previous action set.

ε0 : ZeroToOne

Minimum Error

0.01

The error value below which a classifier will always be given an accuracy of 1.0

α : ZeroToOne

Fall-off Factor

0.1

The parameter controlling the slope of the logarithmic curve of the accuracy calculation

m : ZeroToOne

Accuracy multiplier

0.1

A modifier to lift the accuracy curve to a zero baseline.

pr : PositiveReal

Prediction reduction multiplier

0.5

The reduction in the initial prediction parameter before the result is tested in the Create Effector trigger as the cut-off value for invoking effector covering.

s : ∠+

Subsumption experience

25

The experience required before an accurate classifier can subsume another classifier.

Table 3.1 - Parameters of the XCS Specification Although these are the parameters used within the specification, other parameters are required for an implementation, such as the maximum number of iterations that will be run (in single-step mode), or the maximum number of trials and maximum number of

102

iterations per trial (in multiple-step mode). These will be seen within the parameterisation given alongside the implementation tests.

3.2.6 Summary

A formal specification of XCS has been provided for the structure of XCS (the minimal data structures required for the specification and subsequent implementation of XCS). This was followed by a specification of the operation of XCS divided into the Performance Subsystem, the Credit Allocation Subsystem, and the Rule Induction Subsystem to complement the earlier less formal identification of the Canonical LCS by the same decomposition. Whilst key operations and data structures were formally specified, the integration of these operations was specified less formally by means of Schedules of operations due to the sequencing requirements of these operations that are rather obscured by the pre- and post- condition quotation method utilised within VDM. Whilst the syntactic and structural semantic integrity of these specifications have been checked using SpecBox, the accuracy of the specification is less easy to determine. This has partially been achieved by checking the XCSC program (Section 3.3) informally using the specification, but this creates a circularity issue that is far from ideal. Nonetheless, this specification provides much more precise detail than is available elsewhere and more than meets the requirements of the current research project.

3.3 Implementation

As is the case for many other LCS implementations (Holland's CS-1, Wilson's Animat) no public domain version of XCS existed although a number of researchers were in the process of producing their own XCS implementations that were not planned to be released into the public domain. In order to obtain a version of XCS for this study, an implementation of XCS was produced. It was written from the start with the intent of not simply satisfying this project, but also providing a simple to understand and use public domain implementation that could become a benchmark tool. It was found, upon embarking on the implementation, that the detail provided by Wilson (1995, 1996) was ambiguous at points and contradictory in other places. Kovacs (1996) provided a more succinct and detailed reference guide, but a number of details remained outstanding. As a result, a 'virtual conference' was started using email to seek to determine definitive answers to the issues raised. This resulted in the clarification of

103

detail now captured within the formal specification of XCS (section 3.2) and the identification of a number of issues ripe for further research (see Section 3.5 and 3.6). As a direct result of this collaborative work, the implementation of XCS developed for this project has been accepted as a 'standard core XCS' for the purposes of subsequent XCS research work, and is available in the public domain. The code implements XCS as described in this chapter which is close to the XCS described by Wilson (1996) and used by both Kovacs (1996) and Lanzi (1997).

3.4 Replication of XCS Experiments

In order to demonstrate the equivalence of the performance of the XCS implementation to previously published results (Wilson, 1995, 1996; Kovacs, 1996) the following experiments were used: 1. Six Multiplexor 2. Eleven Multiplexor 3. Woods 2 The Multiplexor problem has been used extensively by Wilson (1987, 1994, 1995, 1996) through each of his LCS implementations as a benchmark single-step problem that tests the generalisation ability of the LCS. It was also adopted by Goldberg (1989) to demonstrate the development of Default Hierarchies within SCS and have been used as test beds within Neural Network research (e.g. Anderson, 1986) and to demonstrate the advantages of niche GAs (Booker, 1989). Kovacs (1996) provides a lot of detail on the results of using these functions with XCS that can be used to provide ready performance comparison. These problems are scalable non-trivial logic function learning problems. A multiplexor takes in an 'address' in binary on a small number of input lines and returns the value of the bit on whichever line is addressed from the remaining input lines. For example, the six multiplexor provides two address lines addressing 22=4 data lines, whilst the eleven multiplexor provides three address lines addressing 23=8 data lines. The problem scales by increasing the k address lines and 2k data lines in sympathy.

104

This problem can be readily encoded onto the XCS bit strings by using the lowest k bits of the condition as the address and the highest 2k bits of the condition as the data lines. A single bit action specifies the result. The classifier is correct if the bit value of the data bit identified by the address bits is the same as the bit value of the action, and incorrect otherwise. This is represented for the six multiplexor by the logical disjunctive function:

F6 MUX = x 0 x1 x 2 ∨ x 0 x1 x3 ∨ x 0 x1 x 4 ∨ x 0 x1 x5 Clearly there are 2k address configurations and 2 bit values that can occur in each data bit, giving 2 * 2k correct classifiers to map this problem for correct answers only. When considering XCS accuracy is considered, and since a classifier can be accurately correct in predicting a poor reward as well as accurately correct in predicting a high reward, for a two reward level Six Multiplexor environment of order k there must be 2 * 2 * 2k accurate classifiers. Therefore in the six multiplexor there will be 16 classifiers that are maximally general and accurate of which 8 are "correct", out of a search space of 128. For the 11 multiplexor there are 24 accurate maximally general classifiers (12 "correct") out of a search space of 4096. Accurate classifiers for the six multiplexor are given in table 3.2:

Correct

Incorrect

###000→0

###000→1

###100→1

###100→0

##0#01→0

##0#01→1

##1#01→1

##1#01→0

#0##10→0

#0##10→1

#1##10→1

#1##10→0

0###11→0

0###11→1

1###11→1

1###11→0

Table 3.2 Members of [O] for the 6 Multiplexor Problem

Kovacs (1996) identifies these classifiers as the 'Optimal Population' [O], which he defines as follows:

105

"We will refer to the maximally general classifiers for each payoff level in the payoff environment as [O]. On its own, [O] is the optimum population for a problem; it is the smallest possible population of classifiers to form an accurate covering of the input space. Once the system has evolved a population which includes [O] ... new classifiers are superfluous and render [P] non-optimal." From this he generates the 'Optimality Hypothesis', which states that: "for each payoff level satisfying the criterion of sampling sufficiency, the GA will, in the long run, evolve a maximally general classifier which will have a greater numerosity than any other classifier in the payoff level. When all payoff levels satisfy the criterion of sampling sufficiency, a complete set of maximally general classifiers [O] will, in the long run, evolve within the population [P]. Further, elements of [O] will be distinguishable from the rest of [P] on the basis of their numerosity, so that, once [O] is complete, [P] may be reduced to [O] by removing all classifiers but those with the highest numerosity in their payoff level." Usefully, knowing the optimal classifiers within the multiplexor problems, Kovacs (1997) has demonstrated that the percentage discovery of [O] can be measured and that this measure can be used to give a good insight into the performance of the classifier system within this test environment. The Woods-2 test is one of a series of test environments created by Wilson (1987) in order to evaluate his Animat LCS. These environments range from the deterministic fully observable Class 1 (Wilson, 1991) environment Woods1 through the probabilistically varied Woods2 Class 1 environment to Woods7 (Wilson, 1994) that is arbitrarily laid out and onto the more complex Woods100 series of environments such as Woods101 and Woods102 (Cliff & Ross, 1994) which are non-Markovian or Class 2 environments. They are all grid-based environments where an 'Animat' is placed on an open grid position and allowed to move in the eight major compass directions. An attempt to move into a grid position containing a Tree 'T' (hence the name 'Woods..', although Wilson (1994, 1995, 1998) uses 'rocks' depicted as 'O' or 'Q') will result in the Animat remaining in the original position. An attempt to move onto a position containing food 'F' will result in movement to that position, a reward given, and the start of a new training episode that will once again find a new random non-occupied position

106

to start from (all 'food' is reinstated on each new trial). The position of the Trees and Food in Woods1 is fixed as and shown within Figure 3.4 with this pattern extended over a 30x15 are by repetition five times in the X plane and three times in the Y plane (in any case, the environment 'wraps round' at it's extremes, so the precise extent is irrelevant in Woods1).

.

.

.

.

.

.

T

T

F

.

.

T

T

T

.

.

T

T

T

.

.

.

.

.

.

Figure 3.4 - The Woods-1 Environment, T = positions that cannot be entered, F = reward positions.

Woods2 provides two kinds of trees and two kinds of food, with the layout of trees and food kept in the gird pattern shown above but the type of tree or food generated randomly at each relevant position. This environment provides generalisation opportunities for the classifiers that are not present in Woods1 and increases the length of the conditions required on the classifiers, but is otherwise similar to Woods1. This test allows the comparison of the performance of the XCS with Wilson's result for multi-step environments. However, it is worth noting that the results of the multi-step environment presented by Wilson (1995, 1996) lack the degree of detail and rigour that Kovacs (1996, 1997) has given to the multiplexor tests. Therefore further experiments are presented within Chapter 4 that seek to characterise the multi-step behaviour of XCS more precisely.

3.4.1 Learning in Single Step Environments

Kovacs (1996) presents a detailed analysis of the operation of an XCS similar to that developed for this project on both six and eleven multiplexor problems. For these experiments he did not concentrate upon proving that the XCS could cope with multiple payoff landscapes, as Wilson (1995) had done, but that XCS would form complete, concise, accurate mappings of state × action. He therefore made use of only two payoff levels but added in two new measures that plotted the ability of the XCS to discover [O].

107

Kovacs' approach is entirely suitable for the requirements of this experiment and so will be adopted. 3.4.1.1 The Six Multiplexer

The Test Environment A Six multiplexer environment was constructed that encoded the two address bits as the low bits of a six bit condition, with the four data bits as the high bits. A single action bit was provided for the predicted result. On each iteration the environment generated input messages by randomly setting all six message bits, and computed the expected result from the generated message. The message was sent as input to the XCS on request and an action was received back from the XCS. The action received was compared with the expected result, and if correct a reward of 1000 (see Wilson, 1995) was returned. If incorrect a reward of 0 was returned. XCS Configuration The XCS was configured to provide 6 bit input messages with classifiers of length 6 condition and length 1 action. Initial parameterisation was as Wilson (1995), namely: N (population size) 400 Pi (initial population size) 0 0.2 β (learning rate) 25 θ (GA Experience) 0.01 ε0 (minimum error) 0.1 α (fall-off rate) 0.8 Χ (crossover probability) 0.04 µ (mutation probability) pr (covering multiplier) 0.5 P(#) (generality proportion) 0.33 pi (initial prediction) 10.0 0.0 εI (initial error) fi (initial fitness) 0.01 Table 3.3 - Parameterisation for XCS within the six-multiplexor test

Other parameters required but not noted in Wilson (1995) were set in accordance to those of Kovacs (1996): P(x) (exploration probability) 0.5 fr (fitness reduction) Not Used m (accuracy multiplier) 0.1 s (Subsumption threshold) 20 Table 3.4 - Additional parameterisation for XCS within the six-multiplexor test

108

Method The XCS was run for 10 runs (see below for a rationale for the number of runs used), each of 5000 explorations. The performance of the XCS was tested after each exploration by running the XCS in pure exploit mode with no value updates. Performance was captured using four measures after Wilson (1995) and Kovacs (1996): •

The Performance is the fraction of the last 50 exploit iterations which produced actions which were correct for the given input message,

•

the System Error is the absolute difference between the accumulated System Prediction for the chosen action and the actual payoff divided by the maximum payoff to give a proportional figure,

•

the Population Size is the number of macro classifiers within the population

•

the number of distinct macro-classifiers that are members of [O] within [P] expressed as a ratio of the maximum size of [O] for the problem.

The first three of the above are averaged over the previous 50 updates and are the measures adopted by Wilson that demonstrate (respectively): •

the ability of the XCS to respond correctly to an input thereby demonstrating learning of the problem,

•

the accuracy of the payoff level prediction that the XCS produces, which will be completely correct only when no competing classifier exists for any member of [O] (see Kovacs, 1996),

•

the number of classifiers required to produce the current level of performance, also reflecting the degree to which general accurate classifiers have been discovered or, to put it another way, the conciseness of the classifier model.

•

Kovacs (1996) argues convincingly that these measures do not give sufficient insight into the sufficiency of the classifier population or the establishment of [O], and the new problem-specific measure of [O] is designed to address this deficit.

Two approaches were used - the average for each of the above metrics over ten runs, which will give results that will be comparable to those of Wilson (1996) and Kovacs

109

(1996), and the average ten runs repeated ten times so that the results can be compared for variability in order to demonstrate that even in the presence of random processes the operation of the XCS is consistent in this task. No raw results are available from the work of Wilson or Kovacs, but access has been obtained to detailed metrics on the operation of Kovacs' XCS implementation. To conclude this section, similar metrics on the operation of this XCS implementation are gathered and compared to demonstrate further the comparability of the implementations. Results The results of averaging 10 runs of XCS on the Six Multiplexor problem are shown in figure 3.5a. This corresponds closely to the results of the final experimental results given in Wilson (1996) figure 6 and Kovacs (1996) figure 5, although the reader should note that the population size is a proportion of the maximum population size rather than divided by 1000, which appears a rather arbitrary divisor. Figure 3.5b details the performance of XCS over a longer period of iterations demonstrating that the XCS remains stable once the optimal population is learnt, and corresponding closely to Figure 7 of Kovacs (1996). To verify that XCS was running predictably, the results of 10 test runs, each consisting of 10 runs of the XCS on the six multiplexor problem with their averaged results were obtained. The results were then sampled at 500 exploitation iteration intervals so representative slices were taken through the data that preserved the normality requirements of the ANOVA test. A single-factor ANOVA test was applied to these sampled results, to identify whether the sampled values from each set of 10 runs could be said to have derived from the same population. An F value of 0.107523 indicated that the sampled averaged runs showed no significant difference in their means (Fcrit = 1.929426 at the 0.05 level). Indeed, even an ANOVA test taking the samples from the first 2500 exploitation iterations before the performance had levelled out showed no significant difference in the means (F=0.137908, Fcrit=1.985594 at the 0.05 level). Figure 3.6 provides the graph of the one hundred averaged runs, showing clearly the degree of similarity in the performance of XCS throughout the runs. 3.4.1.2 The Eleven Multiplexer

The Test Environment The harder 11 multiplexor test was then run in the same manner as the six multiplexor. The parameterisation for this test remained the same as noted above, but with a

110

condition and message size of 11 and a population limit (N) of 800 micro-classifiers. Figure 3.7 shows the result of averaging 10 runs of this test.

111

XCS Output 1

0.8

Performance Error Population [O]

Proportion

0.6

0.4

0.2

0 0

500

1000 1500 2000 2500 3000 3500 4000 4500 5000 Exploitations

XCS Output 1 Performance Error Population [O]

0.8

Proportion

0.6

0.4

0.2

0 0

2000

4000

6000 8000 10000 Exploitations

12000

14000

Figure 3.5 - Average of 10 runs of XCS in Six Multiplexor Test over (a) 5000 iterations and (b) 15000 iterations, showing the stability of [O].

112

1 0.9 0.8

Performance

0.7 0.6 0.5 0.4 0.3 0.2

1 to 10

11 to 20

21 to 30

31 to 40

41 to 50

51 to 60

61 to 70

71 to 80

81 to 90

91 to 100

0.1 0 0

1000

2000

3000

4000

5000

Exploitation Iteration

Figure 3.6 - The averages of ten runs, each of ten runs of XCS in the Six Multiplexer Test, showing a significant conformity of performance.

XCS Output 1

0.8 Performance Error Population [O]

Proportion

0.6

0.4

0.2

0 0

2000

4000

6000 8000 10000 Exploitations

12000

14000

Figure 3.7 - XCS 11-multiplexor over 15,000 exploitations

113

Once again, the results produced are comparable with those shown by Wilson (1996) figure 7 and Kovacs (1996) figure 6 allowing for the difference in calculating the proportional population size. In particular it is worth noting that the curves for performance and system error are very close to those reported by Kovacs, with performance reaching its peak at a similar position and system error reducing to its minimum at a similar point and both curves showing very similar shape. A further comparison with the results shown in Kovacs (1996) figure 9 which shows the plot of [O] demonstrates that the rate at which this implementation of XCS discovers [O] is comparable, with Kovacs' XCS discovering [O] at approximately 12000 iterations and our XCS at approximately 98% of [O] at the same stage. Discussion The experimental results presented demonstrate that the XCS implementation developed performs comparably to that of Wilson (1996) and Kovacs (1996) on the Multiplexor tests for k order 2 and 3. Unfortunately the non-availability of raw results from Wilson or Kovacs precludes statistical confirmation of the case. In lieu of this detail, personal communication with Kovacs allowed a comparison of a number of algorithmic aspects of his POP-11 implementation with this C implementation, which are presented below. These cannot give as formal a comparison as statistical comparison, but do indicate a high degree of algorithmic conformity, even allowing for the variation inherent in XCS due to use of random selection. The differences present can be attributed to a difference in the choice of explore/exploit mode choice for the XCSC version at the time.

XCSC

Pop-11 XCS

Covering Detectors 501 555 Covering Effectors 0 0 GA Events 89,302 91,699 Total micro [A] 8,180252 7,261,149 Total macro [A] 1,851,124 2,052,651 Total [A]'s 300,000 300,000 Average micro in [A] 27.29 24.2 Average macro in [A] 6.18 6.84 Table 3.5 Algorithmic comparison of two XCS implementations. All figures are summed over 30 runs of the six multiplexor implemented as described in this chapter.32

32

The author would like to thank Tim Kovacs for conceiving this comparison and providing the comparative data for his Pop-11 implementation. The author also gratefully acknowledges the

114

3.4.2 Learning in Multi Step Environments

The Test Environment The Woods2 environment was constructed as described in section 3.4. The inputs from the environment consist of the encoded contents of the eight grid positions surrounding the Animat (wrapped around where appropriate). The encoding for each grid position is: Symbol

Meaning

Encoding

.

Blank Cell

000

O

Rock Form 1

010

Q

Rock Form 2

011

F

Food Form 1

110

G

Food Form 2

111

Table 3.6 The input encoding for the Woods-2 environment

The eight grid position are encoded in a 24 bit condition so that the bits 0..2 represent the grid position North of the Animat, bits 3..6 represent the grid position Northeast of the Animat, and so forth proceeding clockwise. The action is encoded using three bits using a Grey Code as shown in table 3.7. Action

Encoding

Move North

000

Move NE

001

Move East

011

Move SE

010

Move South

110

Move SW

100

Move West

101

Move NW

111

Table 3.7 The action grey-encoding for the Woods-2 environment

The use of a Grey Code ensures that mutation will tend to cause an action change to an adjacent action. It is worth noting that Wilson (1995) formulated this version of the

help afforded by both Tim Kovacs and Stewart Wilson in obtaining the information and advice needed in order to complete the XCS implementation.

115

Woods problem in order to demonstrate the Generalisation ability of XCS in addition to the ability to learn. Therefore, bit 2 of each 3 bit condition grid position encoding cannot be used, since it does not distinguish between either a food position or a rock nor between a blank and a non-blank position. In contrast bit 0 of each 3 bit grid position encoding is sufficient to distinguish between a food grid and a non food grid whilst bit 1 distinguishes between a blank and a non-blank position. Therefore, XCS would be expected to generalise over bit 2 but make use of bits 0 and 1. The parameter settings used for this investigation are those of Wilson (1995), which are:

Parameter

N (population size)

Value 800

Pi (initial population size)

0

γ (discount rate)

0.71

β (learning rate)

0.2

θ (GA Experience)

25

ε0 (minimum error)

0.01

α (fall-off rate)

0.1

Χ (crossover probability)

0.8

µ (mutation probability)

0.01

pr (covering multiplier)

0.5

P(#) (generality proportion)

0.5

pi (initial prediction)

10.0

εi (initial error)

0.0

P(x) (exploration probability) fr (fitness reduction)

0.5 Not Used

m (accuracy multiplier)

0.1

s (Subsumption threshold)

20

fi (initial fitness) 0.01 Table 3.8 Parameterisation used for the Woods-2 experiments

The notable changes from the Multiplexor experiments are the settings of the maximum population size (N), the increase in the generality probability (P#), and the reduction in the mutation probability (µ). Measures The Woods2 environments use an alternative metric for performance reporting - the number of steps taken to achieve a reward within each trial. Reporting is performed only

116

at the end of each test trial, with test trials run after each exploit problem. The measures of System Error and Population Size are also kept, as is Average Fitness. The [O] measure is not recorded due to the complexity of [O] and the lack of useful comparative data. Within Woods1 and Woods2 the average number of steps to a reward from a random starting position is 1.7, with a random walk finding food in 27 steps on average. Wilson (1995) reported performance reaching the optimum in 1000 trials and remaining at that figure (averaged over the past 50 trials) until the end of the experiment, at 4000 trials. Only one experimental result was provided, with no evidence of average, best, or worst performance. The population size grew to approximately two thirds of the microclassifier extent (about 500) and remained constant throughout with none of the population reduction seen in the Multiplexor experiments. Wilson qualified this somewhat by the rather ambiguous statement that "not all experiments with Woods2 had a steady or falling population size by 4,000 problems. However, population size like [this] were obtained by lowering the mutation and crossover rates. This in fact improved performance, suggesting that appropriate parameter regimes will not trade off performance and population stability". Unfortunately he did not identify which parameters produced the best results, nor give details of the actual effect of different parameter settings. These results were unfortunate, given that a complete State × Action table for this problem would be a table of size 560. Wilson explained the lack of population reduction by noting that there is no evolutionary pressure in Woods2 to generate optimal classifiers because there are many different locations in which the same bits have different interpretative value thereby creating evolutionary plateau that a number of classifiers can occupy. However, these results were from an XCS that did not include the subsumption and deletion mechanisms now included in XCS, and a better picture was obtained in Wilson (1996) where the population size with these additions reduced to 91 by 2000 trials. This experiment was repeated in Lanzi (1997) where subsumption reduces the population size to around 350 and the introduction of Lanzi's Specify operator (see Lanzi, 1997) reduces it further to 300, both by 5000 iterations. These contradictory results are not explained by Lanzi, and it is suspected that he continued to use Wilson's 1995 version of XCS in this work. In Woods2 any given grid position is a maximum of 3 steps from a food grid. Therefore, given a reward of 1000 for discovering food, it is possible to predict the stable payoff

117

that classifiers should receive. Those that cause a step immediately to a food grid position will receive an immediate reward of 1000. Those that are two grid positions away will receive the discounted maximum prediction of the directly rewarded classifiers, which with a discount value set at 0.71 should generate a stable payoff of γ.1000 = 710. Those one further step away will obtain a stable payoff of γ.710 = 504. Other payoff values will also exist for longer paths, but the operation of classifier parameter update within Exploit cycles in the multi-step XCS should stabilise classifiers that follow direct paths much more rapidly. Wilson (1995) reported that "the general picture was that by 4,000 problems the vast majority [of classifiers] predicted, with essentially zero error, either 1000, 710, or 504; that is, they predict the values of Q(x,a) precisely. In addition, they covered all (x,a) situations." Results XCS Output 1 Steps Error Population Optimum 0.8

Proportion

0.6

0.4

0.2

0 0

500

1000 Exploitations

1500

Figure 3.8 - Average of 10 runs of XCS within Woods 2

118

2000

The graph shown in figure 3.8 has a very similar 'Steps' curve to that of Wilson (1996) and Lanzi (1997), demonstrating that this version of XCS is able to solve the Woods2 problem with the same degree of effectiveness as Wilson's 1996 implementation. An analysis of the population produced showed that the accurate classifiers had converged to stable payoff predictions of 1000, 710, and 504 where there was no error, as predicted. The population size converged to a similar number of macro classifiers as Wilson (1996) reported, as the graph demonstrates (allowing for the difference in population reporting used). The 90 classifiers in the population at iteration 2000 is significantly different from the 350 reported by Lanzi (1997), and would again suggest that his implementation of XCS generates the much weaker population convergence of the 1995 version of XCS. Discussion Once again, raw results are not available from existing implementations to allow a full statistical analysis of the variance in the results obtained. Nevertheless, the results of the current implementation are sufficiently similar to give a considerable degree of confidence that this implementation of XCS produces a similar outcome to that of existing implementations whilst learning in multi-step problem domains. 3.4.3 Conclusions

This section has sought to verify that the operation of the version of XCS created for this project will perform in a manner that is directly comparable to previous implementations so that the findings of this project can be reproduced. In both single and multi step environments only one test environment is available, respectively, which has a sufficient set of published results to base conformity tests upon. Experiments conducted to reproduce the published results demonstrate that the operation of the XCS is closely comparable with previous reportedly correct implementations in both single-step and multiple-step environments. Further analysis of the detailed operation of the XCS was made possible by the provision of implementation statistics by Kovacs, and demonstrates a close proximity in performance in-spite of the random processes in the operation of the XCS. It is therefore concluded that the XCS implementation made available for this project is a sufficient implementation on which to base further studies.

3.5 Recent Research

Although XCS was first proposed in 1995 and first identified in its present form in 1996, researchers have not been slow to recognise the potential of XCS. From Wilson's initial

119

work and Kovacs confirmation of the abilities of XCS, other work during the lifetime of this project has both thrown more light on the performance of XCS within single and multiple-step environments, and utilised XCS within genuine commercial application areas. This section seeks to highlight the major findings of this work, identify work which is on-going, and gather together and expand a list of areas of XCS work that are ripe for further investigation. The recently published review of XCS work (Wilson, 2000b), is complementary to this study and may be consulted for an alternative commentary on some of this work. 3.5.1 Initial Investigations

Wilson's initial work demonstrated that XCS could identify and maintain accurate generalisations in the increasingly complex 6, 11, and 20 multiplexer problems. Although his initial publication (Wilson, 1995) had the main features of XCS within it, the placement of the GA within the Match Set rather than the Action Set and the lack of subsumption prevented XCS from fully exploiting the optimal population. Rectified in a technical report (Wilson, 1996), Wilson was able to demonstrate that XCS could not only identify the optimal population, but could also remove the majority of the extraneous classifiers so that [O] came to dominate the population. He was therefore able to realise the Generalisation Hypothesis, which was stated as follows: "Consider two classifiers C1 and C2 having the same action, where C2's condition is a generalisation of C1's. That is, C2's condition can be generated from C1's by changing one or more of C1's specified (1 or 0) alleles to don't cares (#). Suppose that C1 and C2 are equally accurate in that their values of ε are the same. Whenever C1 and C2 occur in the same action set, their fitness values will be updated by the same amounts. However, since C2 is a generalisation of C1, it will tend to occur in more match sets than C1. Since the GA occurs in match sets [now action sets], C2 would have more reproductive opportunities and thus its number of exemplars would tend to grow with respect to C1's. Consequently, when C1 and C2 next meet in the same action set, a larger fraction of the constant fitness update amount would be "steered" toward exemplars of C2, resulting through the GA in yet more exemplars of C2 relative to C1. Eventually, it was hypothesised, C2 would displace C1 from the population."

120

This hypothesis represents the basis of the power of XCS, and has become the focus of reference for key work by some researchers. Any enhancement or modification that threatens the Generalisation Hypothesis will destroy the ability of XCS to identify and proliferate the general yet accurate classifiers and thus reduce XCS to the performance of the traditional LCS. The Generalisation Hypothesis was subsequently extended by the work of Tim Kovacs (see Section 3.5.2), and in the work undertaken in the research documented within this thesis the stronger formulation of the Optimality Hypothesis is most often used. Wilson (1996) noted that the number of learning cycles required by XCS to find the optimal population over increasing complexity multiplexer environments appeared to follow a curve. Although he used only three data points, he hypothesised that task difficulty appeared to bear some relation to the number of generalisations within the optimal solution, and suggested that the relationship was D = cgp (D – difficulty, g – number of generalisations in the multiplexer, c = 3.22, p = log 5 = 2.32). This is an important issue for further investigation, although to date no further investigation has been done. XCS was not only designed to operate within single-step environments, and Wilson demonstrated that XCS could be applied to rapidly learn an optimal performance solution to the Woods-2 problem. Although the initial 1995 investigations demonstrated that XCS was unable to exploit the generalisations within the environment adequately, the change to an action set GA and the addition of subsumption within the 1996 work demonstrated that XCS could also identify and exploit an accurate general subpopulation within multiple-step problems. Wilson(1996) did not assess whether the population that was identified represented the optimal [O] since "Woods2 is quite complicated, and just which classifiers would constitute a minimal cover is not obvious". However, he did note that a small stable high-fitness high-numerosity subpopulation was identified that is indicative of the establishment of [O]. Given that the ability of XCS to form an optimal state × action × payoff prediction map is the fundamental enhancement represented within XCS, the establishment of the ability of XCS to produce [O] in multiple-step environments would be an important step in defining the abilities of XCS itself (see Chapter 4 of this document). 3.5.2 XCS Performance in Single-Step Environments

Whilst Wilson's initial work identified the potential of XCS, it did not go on to investigate further the promise of XCS and failed to rigorously investigate the effects of

121

some of the additional features added to the XCS model. These deficiencies were rapidly addressed through the work of Tim Kovacs. Kovacs(1996) set out to investigate XCS in a rigorous manner, providing a description of XCS with much greater depth than had been available before (and therefore facilitating the duplication of results), and providing the first independent confirmation of Wilson's results for the Multiplexer problems. Of particular importance was the further investigation of the Numerosity feature within XCS. Wilson (1995) had noted informally that the addition of numerosity did not appear to affect the performance of XCS, and was therefore a performance neutral aid to the user of XCS. No results were presented to confirm this comment. Kovacs investigated this claim and identified that, indeed, the addition of numerosity was performance neutral within the Multiplexer problem case. 3.5.2.1 The Development of Optimal Sub-populations

In pursuing an investigation of the performance of XCS, Kovacs noticed that once XCS was allowed to continue to operate, not only did the population size continue to fall to a stable level, but the final population also sharply distinguished between the accurate optimally general classifiers and the non-accurate classifiers. Although this had also been noted by Wilson, Kovacs helpfully gave a name to these classifiers – the 'Optimal Population' – and defined it: "We will refer to the maximally general classifiers for each payoff level in the payoff environment as [O]. On its own, [O] is the optimum population for a problem; it is the smallest possible population of classifiers to form an accurate covering of the input space. Once the system has evolved a population which includes [O] ... new classifiers are superfluous and render [P] non-optimal." Although XCS can establish [O], it will also continue to include other non-optimal classifiers as a breeding pool. This is beneficial – allowing XCS to continue to search for new solutions if the environment was itself subject to change. However, Kovacs suggested in his 'Optimality Hypothesis' that XCS will always tend to generate [O] and that the numerosity of members of [O] can be used to extract them from the population: "For each payoff level satisfying the criterion of sampling sufficiency, the GA will, in the long run, evolve a maximally general classifier

122

which will have a greater numerosity than any other classifier in the payoff level. When all payoff levels satisfy the criterion of sampling sufficiency, a complete set of maximally general classifiers [O] will, in the long run, evolve within the population [P]. Further, elements of [O] will be distinguishable from the rest of [P] on the basis of their numerosity, so that, once [O] is complete, [P] may be reduced to [O] by removing all classifiers but those with the highest numerosity in their payoff level." This hypothesis strengthens the Generalisation Hypothesis, since it not only suggests that XCS will tend to favour accurate yet general classifiers but it declares that XCS will find the most general accurate classifier for each payoff level and will ultimately identify and maintain the maximally general optimal mapping of condition × action × payoff prediction. Whilst the Generalisation Hypothesis is a key reference point for XCS research, the maintenance of the Optimality Hypothesis can be seen as the real goal. Thus, the central issue in any XCS research that seeks to modify the operation of XCS in order to introduce new features or improve performance is whether the modification affects the identification and establishment of the minimal size optimal state × action × payoff prediction map within the XCS population. Kovacs (1996) investigated this claim by pursuing a greater goal – the take-over of the population by the optimal sub-population once it has been successfully identified. Since XCS maintains a number of sub-optimal classifiers for use in the GA the principal problem to solve was how to identify when the optimal sub-population was present without prior domain knowledge, with the secondary problem being how to remove the sub-optimal classifiers. To solve the second of these problems Kovacs introduced a mechanism termed 'condensation'. This mechanism turned off the Covering Operators and changed the GA so that no crossover or mutation occurred (i.e. the GA was used to breed duplicates). Since [by the Generalisation Hypothesis] the classifiers with optimal generality will be preferred by the GA, the effect of this is to cause the optimal classifiers to proliferate at the expense of the sub-optimal. Over time all the sub-optimal classifiers will be driven out and a dynamic balance between the niches of the optimal sub-population (where each niche is represented by a single macro-classifier) will be maintained. Kovacs showed this mechanism in operation within both the 6- and 11multiplexer problems, and subsequent work by Barry and Saxon (1998) has demonstrated that it can be applied to other problems. The first problem was more

123

difficult to solve, and an initial solution examined the System Error curve of the XCS statistics identified by Wilson (1995) as a trigger. This mechanism was unreliable, and so Kovacs (1997) investigated other alternatives. Although Kovacs (1996) suggested that tracking the average fitness of the population could provide a useful measure, this too was unreliable. However, it was noted when the optimal sub-population became established that gradually the numerosity of the optimal classifier within the action set would dominate that of all other classifiers within the same action set. Initially some of these classifiers would not be members of [O], and this situation could be detected by finding at least one high numerosity classifier that overlapped the condition cover of another high numerosity classifier. Kovacs suggested that a periodic trigger could be added to XCS that would sort the sub-population of accurate classifiers (which, by definition, would be a small proportion of the total population size), check each member for overlap of every other member, and trigger whenever no classifier overlapped another. This 'auto-termination' mechanism had the added advantage of identifying [O] at the same time, so removing the need for condensation. This mechanism was again tested on the 6- and 11-multiplexor problems and was found to be reliable. It has also subsequently been adopted in a parallel implementation of XCS (Barry, 2001). The Deletion mechanism within XCS is fundamental for ensuring that a full mapping of state × action × payoff is maintained in the presence of the GA and for providing pressure towards accurate maximally general classifiers. However, Wilson (1995) provided two alternative deletion techniques. The first technique (t1) was detailed in section 3.2.4 and bases the probability of deletion upon the size of the Action Set Estimate (e) so that XCS maintains pressure towards uniformly sized action set niches:

p ( deletion (c)) = c.n × c.v.e

∑ P( j ).n × P( j ).v.e

j∈elemsP

Wilson's second technique (t2) sought to increase the likelihood of deletion for unfit classifiers. It is noticeable when running XCS that low numerosity inaccurate classifiers can remain in the population for some time before deletion. This problem arises because low numerosity inaccurate classifiers will have approximately the same Action Set Estimate as other classifiers appearing within the same match sets. Unfortunately their lower numerosity means that the accumulated Action Set Estimate allowing for the

124

numerosity will be lower and therefore their chance of selection for deletion will be low. In fact, the lower the numerosity (the more deletions a classifier suffers) the less likely it will be selected for deletion at a later stage, all things being equal. T2 sought to improve the situation by changing the calculation of the probability of deletion to:

    f  p ( deletion (c )) =  c.n × c.v.e  ×  c. f  × P ( j ). n P ( j ). v . e ∑     j∈elemsP  

c.n × c.v.e

∑P( j).n× P( j).v.e

where c.f < δ

otherwise.

j∈elemsP

where δ is a constant representing an unacceptably small fitness for an experienced classifier (typically δ = 0.1) and f is the mean population fitness. These two proposals were investigated by Kovacs (1999a) in two environments – the six-multiplexer environment (representative of environments with many generalisations) and the sixparity environment (representative of environments with few generalisations). He discovered that t2 within the six multiplexer problem reduced the population size (i.e. it was successful in removing the weak classifiers) but led to a weaker progression to fully correct performance and, more importantly, significantly reduced the ability of XCS to discover [O], often deleting recently discovered members of [O]. Although t1 performs adequately within the six-multiplexer problem, it did not put sufficient pressure on the unfit classifiers within the six-parity problem, resulting in a longer time to establish [O]. Kovacs introduced t3, which utilises the same computational values as t2 but adjusts the trigger that causes the change from t1 to t2 within the calculation to c.x ≥ t ∧ c.f < δ (where t is a threshold experience value, typically 20). The hypothesis was that the idea of putting more pressure on unfit classifiers was correct but t2 did not give sufficient opportunity for new classifiers to become fit. The new trigger allows a classifier to gain experience sufficient to allow the fitness of the classifier to adequately reflect the relative accuracy of the classifier across the action sets it is used within before it is possibly penalised as an ineffective classifier. When t3 was applied to the two test environments it was found that t3 has the fastest convergence rate, a lower population size than t1, and leads to less time required to find [O]. A comparison of the six-parity and six-multiplexer results to identify the sensitivity of t suggested that as the complexity of generalisation was increased the value of t did not need to increase. However no experiments to establish whether this holds for increasing condition length

125

were conducted and no work to establish the relative merits of the three techniques within multiple-step environments was conducted, each remaining as work to be done. This technique has not been included in the XCS implementation used for the work of this thesis because this research was published after a large proportion of the experimental work reported in this thesis had been completed. However, it is now commonly accepted as a standard part of modern XCS implementations. Kovacs and Kerber (2000) have recently begun to collect knowledge being gained on the operation of LCS to try and identify the problems for which LCS are best suited. Their initial work focuses upon XCS. Although only initially examining LCS within single-step problems, it provides some useful insights for XCS work. Fundamental to their analysis is the identification that the percentage of the optimal population discovered, a measure that they term %[O], can be used to indicate the extent to which XCS has found the solution to a problem. They base this assertion on the informal observation that learning difficulty tends to increase with increase in the size of [O] (a measure they indicate by |[O]|) rather than the bit length. Thus, the fewer generalisations there are available in the solution space the more difficulty XCS will have. They demonstrate that this relationship holds over a series of 6-bit functions within which [O] is increased. They hypothesise that the reason for the increase in difficulty is tied to the occurrence of the GA per classifier - as the size of [O] increases each classifier will occur within an action set less regularly and therefore have less exposure to the GA. This result has clear consequences when the ability of XCS to scale to large problems is considered, although the scalability of XCS has yet to be investigated. Kovacs and Kerber also noted that the ability to form [O] is affected by the degree of similarity between the members of [O]. Using a Hamming Distance metric they demonstrated that for the same size [O] as the Mean Hamming Distance is increased so difficulty is increased. They attribute this to the increase in difficulty the GA encounters in transforming any one string into any other as the similarity between members decreases. Although not stated in the paper, one could hypothesise that the limitation of the GA to the Action Set and the constant downward pressure on non-niche classifiers means that this is more of an issue within XCS than other LCS implementations that preserve a greater population diversity. Finally, they identify that the range of payoffs given within an environment appears to have a strong effect on problem difficulty. They claim that this is due to the use of the reward range in the normalisation of the prediction error this mechanism hides the relative magnitude of error within the classifiers. This observation had been made earlier (Barry and Kovacs, 1997), and is the rationale behind

126

the System Relative Error metric introduced in Chapter 4. It is therefore an interesting confirmation of this intuition, although no experimental results were given to verify the hypothesis. Clearly this is an important area within which much more work is required to clarify the parameters of XCS performance within single-step environments and then to apply these results to multiple-step environments. 3.5.2.2 XCS and Traditional LCS

Kovacs (1999b) represented the first direct comparative study of a traditional strengthbased LCS and XCS, although informal comparative remarks have been made within other work. In this work the concept of a Strong Over-general was defined. A classifier is a strong over-general if it has a generality that is too high, therefore leading to inaccuracy, but nonetheless exists over states that reward highly. In a strength-based LCS classifiers compete for action selection and GA space on their strength values and therefore strong over-generals will compete, survive and proliferate although inaccurate. Kovacs strengthens his analysis of strong over-generals by deriving an expression for the kind of reward functions that encourage the development of strong over-generals within a simplified strength-based LCS and expands this result from LCS operating on single-step environments to those operating on multi-step environments. Indeed, he illustrates that the reward function used within multi-step environments satisfies all the requirements of a reward function that encourages the development of strong overgenerals within an LCS using strength-based fitness. From this foundation, he then compares strength-based fitness LCS with accuracy-based fitness LCS (of which the only full implementation is XCS). He notes that XCS is insensitive to bias "so a classifier which accurately predicts a reward of 0 will be as fit as one which equally accurately predicts a reward of 100 – it does not matter much what values we use in the reward function (as long as correct rules have higher reward than incorrect rules)." Since this prevents over-general classifiers exploiting high reward niches, the problem of strong over-general classifiers is minimised – "in particular, it should be difficult for strong over-generals to emerge using accuracy because over-generals tend to have low fitness while all accurate classifiers have high fitness". Kovacs (2000b) investigates the phenomenon further, formulating the earlier results into clear theorems. Of note for XCS is the finding that all over-general classifiers that can occur within an accuracy-based LCS will be strong over-generals, and that any bias in the reward function will cause them to be produced. This would spell disaster for an accuracy-based LCS, but in XCS the use of accuracy as the basis of fitness provides the necessary control to eliminate the over-generals.

127

Whilst Kovacs' thesis is both fundamentally important and persuasive, it was based on a simplified strength-based LCS derived from XCS. The LCS used does not include bid competition, limited reward distribution, or the various reward manipulations described within Chapter 2. These modifications, seen within all strength-based LCS do have an impact on the preservation of strong over-generals, a fact that is not discussed within the paper. There is a clear need to extend and apply more universally these results, although it is possible to hypothesise that the general finding will stand. Kovacs (1999c) introduced a new triggered operator, which he termed "weeding", as a mechanism that attempted to remove strong over-generals within the strength-based LCS. This mechanism looked for high strength experienced classifiers whose strength was not equal to the value of any environmental reward. The strength of a strong overgeneral classifier will not lie on such a “defined reward point” because it averages the payoff from a number of payoff positions. Thus, a strength that does not reflect a defined reward point is a reliable indicator of a strong over-general classifier. Such classifiers are deleted, allowing accurate classifiers to compete and flourish. The results of this work within strength-based LCS are beyond the scope of this work. However, it is worth noting that Kovacs did demonstrate that Weeding together with a modified form of subsumption deletion did allow the simplified strength-based LCS to perform as well as XCS within the six-multiplexer environment. Unfortunately he also noted that Weeding is not a sufficient mechanism to prevent over-generals for all reward functions, and cannot be applied to discount-based multiple-step environment reward systems due to the difficulty in calculating the defined reward points. Of interest to this discussion, Kovacs applied Weeding within XCS since Weeding represents a faster means of removing inaccurate classifiers than the combination of deletion and GA selection used for this purpose within XCS. On tests within the six-multiplexer environment it was shown that Weeding did improve [slightly] the time taken to produce [O] and [more dramatically] the population size required. More work needs to be done within other test environments to determine the usefulness of this function within XCS and to apply it within multi-step environments. This work is more important for its conclusion than for the mechanism itself – that Weeding (and other similar 'penalty' mechanisms applied within the traditional LCS described in Chapter 2) "cannot be used to salvage strengthbased systems in multi-step environments, in which strong over-generals almost invariably occur. … The conclusion is that only accuracy-based classifier systems are suitable for multi-step environments. It is difficult to exaggerate the importance of this point for the direction of future study of classifier systems."

128

3.5.2.3 The Robustness of XCS

Hartley (1999), a co-worker with Kovacs, investigated the ability of XCS to respond to changes in the environment. The ability of traditional LCS to respond to a change in the environment, due to the maintenance of a population of classifiers of which only a subset represent the solution, has long been extolled as a virtue. However, given the Optimality Hypothesis, XCS will tend to identify and proliferate [O] at the expense of population diversity (although a small number of other classifiers are always available due to the action of the GA). This could be seen as a threat to the ability of XCS to respond to environment change. Using a binary classification test, Hartley compared the ability of XCS and NEWBOOLE (Parodi and Bonelli, 1990) (a derivation of Wilson's BOOLE) to respond to an environmental change that changed the category each stimulus belonged to. He found that XCS was able to rapidly respond to the change, simply adjusting the predictions of the rules within its map of [O]. In contrast, NEWBOOLE had to invoke the GA to discover new rules. Since XCS maintains a complete state × action × payoff prediction map it also maintains the map for those classifiers that predict the 'incorrect' action. A change of category simply reflects a change in the prediction of the existing map. Strength-based LCS maintain the 'best action' map and tend to remove classifiers which predict the 'incorrect' action. They will therefore be less well placed to respond to a change of this kind. Whilst this result is encouraging, it is not a wholly representative test. Other, possibly more fundamental, environmental changes are possible – such as the frequency with which a stimulus is received changing as a consequence of a change in the structure of the environment – a common occurrence within real-world robotic control problems. There is clear scope for further investigation of the robustness of XCS within changing environments. 3.5.2.4 Extensions to XCS

Whilst Kovacs and Hartley have investigated aspects of the operation of XCS, Wilson (2000a), Lanzi (1999b), and Saxon and Barry (1999b) have sought to expand the base provision of XCS by investigating the integration of non-binary conditions into XCS. Wilson (2000a) modified XCS so that the condition consisted of a vector of intervals, where each interval was represented by a pair of real number values – one representing the centre point and another representing the spread on either side of the central point. In order to assess the ability of XCS to generalise using these intervals he devised a modification to the six-multiplexer test. In this experiment six real number intervals between 0.0 and 1.0 were provided in the condition. A vector of real number values of the same range was randomly generated for each successive environmental input. The

129

condition matched if for each value in the input vector the corresponding interval in the condition vector contained the value. Covering occurs by creating conditions so that the centre points are the same as the input value but the spread is a random value between 0.0 and 0.5. Crossover operated only between the real number values, and mutation added or subtracted a value between 0.0 and 0.1 to the centre point. In all other respects the XCS operates as normal. In the six multiplexer experiment the 'correct' answer is identified by translating the input vector such that a value at any position in the vector less than 0.5 is interpreted as a binary 0 at the corresponding position in the binary vector, and any value greater than or equal to 0.5 is interpreted as a binary 1. Using this test, Wilson discovered that XCS attained performance approaching close to 1.0 in 15000 explore/exploit presentations. An examination of the population showed that the classifiers with the highest fitness were optimally general – they identified the required '0' positions as the range 0.0-0.5, the required '1' positions as the range 0.5-1.0, and the required "don’t care" positions as the range 0.0-1.0. Unfortunately, these classifiers all had a low numerosity because the small differences in real number values prevented precise classifier matching required for subsumption. Saxon and Barry (1999b) carried out other integer and real number representation experiments independently of this research work. In their work the real number values were represented using an interval identified by a start and end point. Mutation was a similar 'creep' mutation, although of only one of the two points in the interval, and Crossover could occur within an interval – between the start and end points – as well as between intervals. The decision not to use a centre point and spread arose from earlier work on integer range representations within XCS which showed that XCS found it difficult to identify the centre point and spread in a way which reflected the start and end points accurately. The decision was also made to maintain a 'granularity' parameter representing the smallest value that would be represented by the interval. This eliminates the infinite range of values that could exist between any two real numbers, allowing XCS to discover the optimal accurate condition intervals. Experiments using Wilson's six-multiplexer modification have shown that XCS can identify the optimal sub-population and establish it through subsumption. Further experiments have demonstrated that mixed binary/real number and integer/real number conditions are also able to find and establish the optimal sub-populations. Wilson (1995), in noting that the binary alphabet used by the conditions of XCS were limited in their range of representation, suggested that the provision of S-Expressions within conditions may be a possible alternative. Lanzi and Perrucci (1999) investigated

130

this idea by integrating a library of primitive functions used within the Genetic Programming community with XCS. The conditions were formed from real, integer and boolean values arranged within primitive boolean inequality expressions. These expressions were combined using Boolean conjunctions and disjunctions to create larger expressions. The condition was therefore position-independent and 'messy'. Lanzi (1999a) had already demonstrated that XCS could utilise 'messy' codings that allowed the attributes expressed by a condition to vary. XCS was able to identify solutions to a number of problems presented to it, and surprisingly was able to do so without a significant increase in learning time. This result would suggest that the difficulty of learning with S-expressions is no greater than with a ternary alphabet - this is clearly an unexpected result and warrants further investigation. 3.5.2.5 Real-World Application of XCS

Given the fairly recent introduction of the XCS approach it is not surprising that XCS application to real-world problems has been limited. However, Saxon and Barry (1999a) described an application of XCS to a Data Mining task. This work was carefully bounded, not involving XCS primarily in the Feature Extraction phase of KDD although the generalisation capabilities of XCS may well be able to meet some of the requirements of Feature Extraction. Instead, given a set of attributes from a data set XCS would be applied to learn the relationship between the independent attributes that causes the change in value of a single dependant attribute. Their initial work reported on an application to the Monk's Problems, a well-known sample data set from the UCI Data set Repository. This data set was chosen because of the wealth of comparative results with other Machine Learning and Statistical methods used for Data Mining. The results showed that XCS was able to produce a better classification performance than all existing Machine Learning approaches in two of the three tests (including the sample that contained uniform noise), and was only out-performed on the remaining test set by an approach that included a large amount of problem-specific knowledge within its induction algorithm (unlike all the other algorithms). This work also confirmed the ability of XCS to converge on [O] in a single step problem other than simple boolean functions. A Java-based tool-set was subsequently produced, enhancing XCS with additional attribute types and including modifications to the accuracy calculation, and delivered as a Data-Mining tool-set to the sponsoring company. In the recent COIL2000 competition this tool-set (used in conjunction with the commercial tool Model-1 for the Feature Selection aspects) achieved 3rd place out of 53 submitted techniques from commercial and academic institutions, outperforming all existing GA-based

131

approaches (Greenyer, 2000). An analysis of the performance of XCS would suggest that the parameterisation of XCS was not set as optimally as possible, and it is believed that the XCS tool-set may be able to perform even better. Further investigation of these claims is currently underway. Wilson has recently started to investigate the applicability of XCS to the Data Mining task. In Wilson (2000c) he uses XCS to mine over a data set whose data classifications are arranged obliquely to the conjunctive expression capabilities of XCS. This requires many classifiers operating co-operatively to adequately represent the classifications. He demonstrates that although large populations of classifiers are required, XCS can learn the optimal classification model of a large medical data set. These findings are encouraging for the application of XCS to industrial-strength data mining tasks and other problem areas that require very large solution space. 3.5.3 XCS Performance in Multiple-Step Environments

3.5.3.1 XCS and Rule Chaining

Work within multiple-step environments, apart from the work included in this research, has largely been carried out by Lanzi. Within Wilson (1995, 1996) Wilson reported on the performance of XCS within the Woods-2 environment, a regular Markovian gridbased environment. Although the results presented in Wilson (1995) showed that XCS could perform very well within this environment it was unable to expand the respresentation of [O] within the population. When the action-set GA and subsumption was added, however, the results were very promising. There was a clear reduction in population size indicating a focus on [O]. Unfortunately it is difficult to identify [O] within this environment and therefore the ability of XCS to reliably form and maintain [O] remained unclear. Lanzi (1997a) sought to expand Wilson's work by applying XCS to more complicated WOODS-based environments. His first work involved the Maze-4 environment - an 8x8 environment based on the Woods problem but with irregular environmental input across its 26 available cells. This environment offered sparse generalisation opportunities for XCS. Initial tests in XCS without any generalisation ability demonstrated that XCS could achieve optimal performance within this environment. However, when generalisation was applied within XCS, it was unable to achieve optimal performance until the population was raised to 1600 classifiers. Since the environment contains only 26 distinct sensory configurations, this number of classifiers is out of proportion with

132

the problem size. Further experimental work suggested that the pressure towards generalisation within XCS prevented XCS from achieving the optimal performance. To rectify this situation, Lanzi introduced a new triggered operator which he termed Specify. This operator produces one new classifier from a selected classifier, replacing don't care symbols in the condition of the source classifier with the corresponding bit from the input message with some probability Psp. The operator is triggered when the average prediction error of the classifiers in the action set is twice as large as the average prediction error over the population as a whole and the classifiers in the action set are sufficiently experienced. The application of this operator allows XCS to achieve optimal performance within Maze-4 using 800 classifiers within the population, whilst leaving the performance unchanged within the original Woods-2 problem. Thus, Specify combats the generalisation pressure in environments which require specific classifiers whilst allowing generalisation to remain in environments which require general classifiers. Unfortunately this useful feature is at the cost of a larger population, although the population curve presented in Lanzi (1997) for XCS without the additional operator is over twice as high as that reported in Wilson (1996) and within figure 3.9 of this chapter and thus some doubt must be cast on the exact findings at this point. Interestingly, Kovacs' t3 deletion operator achieves a similar objective within single-step environments, and it is possible that t3 can be applied within Maze-4 to allow XCS to perform optimally without requiring the Specify operator, although no experiments have been carried out on this to date. Lanzi (1997b), in a commentary on this research, notes that the behaviour reported in Lanzi (1997a) is, in fact, dependant upon the explore/exploit strategy adopted. Given the random explore/exploit strategy of XCS it is less likely that a XCS controlled Animat will reach a reward position as more exploration possibilities are introduced and therefore reward is not fully propagated back to earlier classifiers. He reports that a biased explore/exploit scheme that reduces the amount of exploration is able to eliminate the problems. Lanzi expands upon his earlier work by looking at environments of increasing complexity, known as Maze5 and Maze6. Although XCS with biased exploration could solve Maze5 it was not as fast as XCS with the Specify operator and only XCS with specify could solve Maze6. Lanzi claimed that this suggested that biased exploration could not compensate for environments where there were local specialisations in some environmental niches but good generalisations in other areas. Whilst there is undoubtedly truth in this statement, no population details are reported that might verify the particular hypothesis. Until this work can be repeated all that can

133

be deduced is that the specify operator does recover the situation. Should an examination of the population reveal the over-exploration of generalists then the hypothesis would be confirmed. Further work in the paper examined the performance of XCS within the Woods14 environment that was tackled with ZCS by Cliff and Ross (1994). The Woods14 environment is a corridor environment that contains the longest possible chain of states (18 steps) that can be achieved within an environment with only two grid values apart from the reward. Once again Lanzi demonstrated that XCS was unable to learn a solution to this environment where generalisation was allowed, although it was able to learn a solution where no generality was allowed or where the Specialise operator was introduced. The generalisation mechanism again appeared to thwart XCS. In an analysis of the generalisation mechanism Lanzi noted that over-general classifiers only become inaccurate whenever they are applied in areas of the environment for which they are inaccurate. If the environment is not explored sufficiently this may not occur and the over-general classifiers may thrive. Subsumption deletion could exacerbate this problem since more specific classifiers will be subsumed by the incorrectly over-general classifiers. Lanzi notes that the coding used as input from the Woods environment is unhelpful, providing many bit locations that are rarely explored in comparison to others. Thus, in situations of uneven exploration a poor input coding can exacerbate the problem. These observations led to the following hypothesis: XCS fails to learn an optimal policy in environments where: (i) XCS is not very likely to explore all the environmental niches uniformly; and (ii) overgeneral classifiers that match only a few niches are very likely to be produced. and the additional hypothesis: XCS is not likely to evolve a compact representation in those environments where there is no direct relation between the number of # symbols in classifier conditions and the number of niches that classifier matches. Since it is hypothesised that the problem is linked to the uniformity of the random walk in exploration, Lanzi introduced a 'Teletransportation' mechanism in order to examine the adequacy of the hypothesis. This mechanism periodically transported the Animat to a new random cell in the environment. Using this mechanism he was able to demonstrate that XCS could learn a solution to the Woods14 problem, although establishing a complete solution was still slow. Lanzi notes that this operator is not

134

intended to represent a practical solution to the problem, however. Rather, it is a tool to establish a hypothesis. These results are very important, identifying some limits on the applicability of XCS. Unfortunately the work was limited to the Woods environments and, whilst these can be scaled to some degree a more systematic investigation remains necessary to clearly separate the problems of chain length, the number of alternative pathways, and the complexity of input encoding issues (see chapter 4 for an investigation of one of these issues). 3.5.3.2 XCS and Non-Markovian Environments

Cliff and Ross (1994) examined the possibility of adding memory bits to the ZCS classifier system (Wilson, 1994) in order to re-introduce a degree of autonomy to compensate for the lack of a message list. The introduction of memory bits was a suggestion made within Wilson (1994) and thus their work was seeking to confirm these ideas. Their work demonstrated that ZCS was able to utilise the memory bits to solve problems in non-Markovian environments, although when the number of bits was expanded the learning became unstable. Lanzi (1997b) utilised the idea of memory bits to provide a solution to the 'aliasing problem'. The Woods101 environment, devised by Cliff and Ross (1994), was applied to XCS. In this environment there are two paths to a reward, each path containing one cell that provides the same input pattern and is the same distance from the reward but within which a different action is required to reach the reward. In this environment XCS cannot learn which action to apply to get to a reward, and furthermore if it learns any one action the payoff received will vary because the action will lead towards a reward for one of the states but away from the reward for the other. After demonstrating that XCS cannot find an adequate solution to this environment, Lanzi adds a single bit of memory. To use this memory bit he extends the XCS action with an internal action that is a ternary value to allow the bit to be set, cleared, or remain unchanged. He also extends the XCS condition to reflect the value of the internal memory bit. In tests with the environment he shows that the augmented XCS (called XCSM) is able to utilise the memory bit to disambiguate the states. However he does notice once again that over-general classifiers can appear that corrupt the population for some time. The introduction of the specify operator overcomes this. It is worth noting that Lanzi did not use subsumption deletion within these experiments.

135

The XCSM mechanism was then tested in Woods102, an environment that provides two different types of aliasing states, one occuring in two locations and the other in four locations. There are four different pathways to the reward cell. Two bits of memory are used for this environment, and Lanzi demonstrates that as long as both specify and teletransportation are applied XCSM can converge to a stable solution for this environment. The requirement to use 'teletransportation' arose because XCSM is not guided in the exploration of the internal memory bits. The random nature of the exploration of memory bits introduces temporary bias in the use of memory bits that can lead to inappropriate action generalisations. This problem is revisited in Lanzi (1998). In this work Lanzi introduces Maze7, an environment with one corridor of 8 cells to the reward cell within which there are two states (at position 2 and 8) that provide an aliased input and action. XCSM with one bit of memory is unable to learn an optimal solution to this environment. Lanzi notes that XCSM can converge to a good solution in exploitation only. It is hypothesised that XCSM is trying to learn both the internal representation and the use of this internal representation at the same time. Thus, XCSM may enter the same aliased location with two different internal bit settings, making the internal memory of no use in the resolution of the aliasing states. This problem would only be exacerbated as the number of internal bits is increased. Lanzi solves this problem by making the choice of internal action deterministic at all times. Lanzi demonstrates that XCSM with this change (now termed XCSMH) is sufficient to enable XCSM to operate successfully within Maze7 and a more difficult Maze10. Further experimental work (Lanzi, 1997b) in Woods101 introducing additional memory bits that are not required indicated that XCSM was able to generalise over the additional unused bits although the population was not as compact as it could be due to the choice in the use of the internal bits. Further investigation (Lanzi and Wilson, 1999) that expanded the number of aliased positions within Woods101 so that two bits of memory were required for a solution identified that XCSMH was unable to establish an optimal solution until additional, supposedly redundant, internal memory bits were added. It was identified that without the redundant bits it would be highly likely that XCSMH would establish the same internal bit pattern for two distinct aliased positions in the environment, thereby introducing aliasing over the state + internal bit space and causing the deletion of this bit pattern. Increasing the internal memory to the same number of bits as aliased states (4) enabled XCSMH to achieve an optimal soltion. XCSMH applied to the Woods102 environment also only found an optimal solution when 8 bits were provided for the six aliasing states.

136

A novel approach to the use of XCS within non-Markovian environments was proposed by Tomlinson and Bull (1999c). Their "Corporate XCS" added the ideas of classifier Corporations (Wilson and Goldberg, 1989; Smith, 1994; Tomlinson and Bull, 1999a) to XCS. In this scheme once a 'leader classifier' was selected to act, the classifiers within the Corporation would be given preference in acting if they occur in the following match set. Corporations are formed using explicit references added to classifiers that identify one previous and one next acting classifier in a 'rule chain'. A separate Corporate GA acts in addition to the normal GA within XCS to explore new corporations, and new operators are introduced to identify classifiers to add into corporations. They demonstrated XCS on a test using a highly non-Markovian environment. This environment was constructed so that the initial state identifies the final state along a short multiple-path corridor within which reward will occur. No other indication of the final rewarding state is given within the trial. A corporate LCS based on ZCS was able to exploit the corporation to obtain and maintain good solutions to this problem. On application of this technique to XCS the corporate XCS was able to identify solutions, but performed much worse than the corporate LCS. A number of reasons were given for the poorer performance, but more fundamental problems with their approach can be identified. It appears that their addition of corporations to XCS, whilst very sympathetic to the nature of action-set updates, fails to take account of the fact that XCS operates through action chains rather than rule chains. A deeper critique of this perceived problem is given in section 7.4.2, but it is possible that the Corporate approach could be applied more sympathetically to XCS to realise the performance the authors were expecting from this approach. Lanzi and Colombetti (1999) examined the performance of XCS in multiple-step environments where the environment provided varying degrees of action noise – specifically, the Animat would ‘slip’ into an adjacent cell in a Woods-based environment rather than the target cell with a set probability. Surprisingly, considering the accuracy basis of XCS, it was shown that XCS is resilient to action noise up to the 0.25 level, although some of this resilience may have been provided by the smoothing operation of the discounted payoff (see chapter 5). Once noise rose to the 0.5 level the performance of XCS dramatically broke down. Lanzi and Colombetti then introduced a mechanism that enabled XCS to operate adequately even with large noise levels. They added an additional parameter to each classifier to maintain an estimate of the lowest error level in the action sets the classifier participated in. When calculating the error measure during normal update, this amount was removed from the calculated prediction

137

error. The basis for this was that the minimum error estimate of classifiers in the action set is a sound estimate of the real environmental noise. If this is then removed from the error estimate the final figure is an indication of the actual classifier error beyond that induced through environmental noise. This mechanism is a useful contribution to LCS work that could be applied more widely, particularly within data mining work. Further work is required to investigate other forms of noise, and in particular sensory noise, before such techniques can be more widely integrated, however. 3.6 Further Work

As the preceding section has demonstrated, considerable progress has been achieved within a short time. The results presented are all extremely encouraging, identifying XCS as a reliable robust Machine Learning architecture. Clearly there is much more work to be done, and some avenues for further work have already been identified in the context of the discussion above. In addition, other areas for consideration have been identified by other workers. Wilson (2000a) agrees with the suggestion above that further work is necessary to examine the effect upon XCS of other forms of noise. He also agrees that further work is possible on the use of internal state, possibly to produce hierarchical invocation of classifiers. Wilson has on a number of occasions emphasised other areas for further work. He is particularly interested in the addition of a predictive element within classifiers that allow them to be rewarded for their next-state prediction in addition to direct action. This would allow XCS to remove itself from a full environmental dependency and would integrate some of the ideas from ACS (Stolzmann, 1997) into XCS. There are aspects of the work carried out in chapter 8 that are relevant to this aim. Wilson also sees value in reducing the dependency of XCS upon discrete environments, adding capabilities to deal with continuous inputs and actions. In the work on persistent actions within chapter 6 a method for reducing the temporal dependency of XCS is demonstrated, although this work remains a considerable distance from the ideas proposed by Wilson. Although Wilson does not explicitly identify the accuracy function, the mutation operation or the subsumption operators as areas for further investigation, he has proposed a number of different variants in each of these areas. There have been no published comparative results and it is clear that there is a need to understand the effects of these changes so that the best implementation options are known. In addition to Wilson’s suggestions, further issues can be identified. In the area of XCS parameterisation there is a need to investigate the degree of dependency of XCS upon the mutation and crossover rates, and upon the rate of G.A. invocation. At present only

138

rules-of-thumb are available for these settings, but the rate of convergence of a classifier can now readily be related to its exposure to reward. There must therefore be the potential to identify clearer guidelines in this area. The role of the covering operators within XCS also provides potential for investigation. LCS implementations close to the Canonical LCS form tended to have a heavy reliance upon the covering operators (see Chapter 2). XCS relies more upon the G.A., and the covering operators used tend to be naïve. Recently a rather heavy-handed approach to effector covering has been proposed (Butz and Wilson, 1999). It is unclear what the effect of these approaches is, and whether a more ‘intelligent’ approach to covering would provide any advantages. The deletion operator also needs further attention. Although Kovacs has introduced an excellent approach with his t3 operator, niches that are less commonly used within the XCS population are still threatened by the deletion operator. This is partially because the estimate of niche size is only updated on invocation, and therefore niches that have already suffered deletion will not reveal this within their action set size estimates until further updates have occurred. A solution to this dilemma is needed if XCS is to be applied within environments with varying frequency of exploration of the state space. In a similar vein, the error calculation for classifiers early on in action chains poorly reflects the magnitude of variation of payoff in relation to its fixed payoff value when compared to classifiers towards the end of the action chain. This fact is illustrated well in the System Relative Error performance statistics introduced in Chapter 4. It is possible that an error measure could be constructed that is more sensitive to proportional variation in payoff, to the benefit of the maintenance of the action chain as a whole. In personal communications (Wilson, 1998b), Wilson has mentioned the concept of “Diffusion”, which he discussed as follows: "An amazing thing about XCS is that covering is not necessary for exploration. Most of the time the match sets fill up with all actions via the GA. This seems to be the result of generalisation. For example suppose a classifier 1#1##0→0 is generated by the GA in an action set that matched 111110. This new classifier will match 7 other input strings, and will therefore occur in 7 other action sets and those sets' associated match sets. Thus the creation of this classifier helps populate as many as 7 other match sets within the action '0'. There is an important subtlety about XCS vs. traditional systems ... It is that in XCS, the 'discovery component' does not actually discover

139

new classifiers, if one means classifiers that 'do the right thing' in the situations they match, i.e. correctly connect conditions with actions. Instead, the discovery component searches along the specificitygenerality axis for classifiers that are maximally general while still accurate. The 'fodder' for this search is provided by this 'diffusion' of classifiers among match sets dues to the GA." Although nothing further on this idea has appeared in the literature, this would appear to be a potentially important concept if shown to operate in actuality. The results obtained in Chapter 4 have implications for this concept, and these are discussed further at that point. It is particularly important to identify whether the simpler model provided by XCS is more amenable to mathematical model construction than that provided by the traditional LCS. The Reinforcement Learning community have gained considerable benefits from the detailed mathematical models of their learning algorithms, and similar benefits would be seen for XCS investigations. A possible alternative avenue would be to identify the relation of XCS to selected Reinforcement Learning approaches so that the models that already exist can be applied to XCS. It is clear that much more work remains to be done, and inevitable that any quick consideration of the issues raised here will only produce yet more areas for consideration. The field provides many opportunities for research work, and it is hoped that this will attract more workers to use the XCS Learning Classifier System.

3.6 Conclusion

Although the XCS Classifier System has been in the public domain for five years (at the time of writing), there has not to date been a comprehensive, unambiguous, and detailed specification of the operation of the XCS independent of a particular implementation, and a publicly available implementation of XCS was not available until the XCSC implementation produced for this research project was developed. This chapter has sought to both provide a sufficient definition of XCS to provide a foundation for further investigation and provide a demonstrably correct implementation of XCS which is fit to be released in the public domain for further research work. The implementation produced has been tested using all of the 'standard' tests for which previous results are known, and has been shown to be comparable against these results, as far as is possible

140

without access to the raw results themselves, and to be internally stable in the results it produces over many runs of a suitable test experiment. From these experiments it has been concluded that the implementation of XCS provided is a satisfactory platform for the further research proposed within this research project. The study concludes with an overview of current research and the identification of a range of issues that have yet to be investigated but which will not be tackled by this project. It is hoped that the enumeration of these issues and the provision of the XCS package will provide incentives for further research by other workers within the field.

141

Chapter 4

INVESTIGATING ACTION CHAIN LIMITS IN XCS MULTI-STEP LEARNING

4.1 Background

Section 3.5.3 discussed previous investigations that have been conducted into the use of XCS within multiple-step environments. Wilson (1995, 1996, 1998) provided a proofof-concept demonstration of the operation of XCS within the Woods2 environment. Lanzi (1997a) identified that within certain Woods-like environments XCS was unable to identify optimum generalisations. This was attributed to two major factors: an inequality in exploration of all states in the environment allowing over-general classifiers to appear accurate, and an input encoding that meant that certain generalisations were not explored as often as others. Lanzi (1997b) sought to apply these lessons to more complex Woods-based environments and discovered that XCS was additionally unable to establish a solution to the long chain Woods-14 problem (Cliff and Ross, 1994). This was due in part to the number of possible alternatives to explore in each state that prevented XCS from attributing equal exploration time to later states within the chain. Whilst these investigations have generated useful results and shed considerable light upon the operation of XCS within multiple-step environments, they have not investigated the limits of XCS learning within these environments in a systematic manner. To establish some limits on the capabilities of XCS the problems of action chain length, exploration complexity and input encoding complexity must be disentangled. This chapter presents investigations that seek to establish the limits of XCS within the area of action chain length. It is recognised that the problem of exploration complexity is intrinsically tied to this area of investigation and is worthy of separate study. The problem is here referred to as "action chain learning" rather than "rule chain learning" because XCS chains action sets rather than individual classifiers, although in some respects the ideas of "rule chaining" and "action chaining" can be synonymous.

142

The test environments are constructed to ensure that the three problems identified by Lanzi are controlled in such a way that a single issue can be focused upon. It could be argued that this requires potentially artificial environments that may never be found within real test situations. However, without an understanding of the limits caused by each of these problems the user of XCS will not be able to predict the likely performance of XCS within a real test environment. This work is particularly significant to the main focus of this research programme, establishing limits on XCS action chain learning that will provide similar indicators to the rule chain results established by Riolo (1988a) using the traditional LCS CFS-C. Further investigations into a number of aspects of action chaining within XCS are required for a complete understanding of this area. In particular, this work has not investigated the stabilisation of predictions within the rule-chain. Within XCS there are a number of potential controls on this process within the reinforcement equations. The Learning Rate β can be adjusted so that the prediction is changed more rapidly, so that β can be thought of as reflecting the degree of 'recency weighting' we give to the calculated prediction value. Unfortunately too large a value of β could theoretically result in a premature convergence on an incorrect prediction value and overresponsiveness to small changes which temporarily effect the payoff received. This would in turn adversely effect the classifier's accuracy and therefore its fitness in reproduction. The discount factor γ is capable of being adjusted so that the fixed prediction of the classifiers in a chain is closer, thereby potentially increasing the length of the action chains that can be maintained. However, higher values of γ will produce closer fixed predictions within the action chain, and this may lead XCS to generalise over neighbouring states in the state chain. These are areas that require further investigation at a future date. 4.2 A Suitable Test Environment

In order to investigate this area empirically a suitable test environment is required. The Woods environment is a useful general learning test, but is not easily scaled with fine control in either length or complexity. Therefore, the Woods environments are set aside in favour of a FSM environment similar to that proposed by Grefenstette (1987) and used extensively by Riolo (1987a, 1988b, 1989a). A Finite State World (FSW) is an environment consisting of nodes and directed edges joining the nodes. Each node represents a distinct environmental state and is labelled with a unique state identifier. Each node also maintains a message that the environment passes to the XCS when at

143

that state. Each directed edge represents a possible transition path from one node to another and is labelled with the action(s) that will cause movement across the edge in the stated direction to a destination node. An edge may lead back to the same node. Each node has exactly one label and message, and each message is unique within a Markovian FSW and normally equivalent to the node's label. Non-Markovian environments can be created by allowing a message generated by one node to be re-used by other nodes. Each edge may have one or more labels and these will be re-used on edges emanating from any node that allows the action to be executed when in the state represented by the node. At least one node must be identified as a start state, signifying that the XCS will be operating in that state when each new learning trial begins. If more than one start state is provided the actual state from which a trial is started is selected arbitrarily from the available start states. Additionally, one or more nodes are identified as terminal states. Transition to one of these states represents the end of a learning trial and each will have an associated reward value representing an environmental reward that is passed to XCS upon transition into such a state. Terminal states do not have any transitions emanating from them - upon arrival the trial is ended, the next iteration will represent a new trial and the environment will reset to a [selected] start state The FSW will be defined more formally as follows: Let N = {s0,…,sn-1} the set of n states, E = {e0,…,em-1} the set of m edges. Let S be the set of start states and T the set of terminal states. Let M = {m0,…,mn-1} be the set of n messages from the n states and L = {l0,…,lx-1} the set of x effector labels. Let ek si → s j denote an edge ek from si ∈ S to sj ∈ S. Let L(e) be a function returning

the set of labels associated with an edge e, and M(s) be a function returning the message associated with the state s.

FSW = N ∪ E ∪ M ∪ L

h4.1

S ⊂ N ∧ T ⊂ N ∧ S ∩ T = {}

h4.2

e e ∀s i ∈ N ⋅ ∃e ∈ E , s j ∈ N ⋅ si ≠ s j ∧ si  → sj ∨ sj  → si

h4.3

e ∀e ∈ E ⋅ ∃s j , s k ∈ N ⋅ s j  → sk

h4.4

144

∀ei , e j ∈ E ⋅ ∀s k , sl , s m ∈ N ⋅ ei j ei ≠ e j ∧ s k → sl ∧ s k → s m ⇒ L(ei ) ∩ L(e j ) = {} e

h4.5

∀s i , s j ∈ N ⋅ s i ≠ s j ⇒ M ( s i ) ≠ M ( s j ) [in Markovian FSW]

h4.6

ek s s ∈ S ⇒ ∃s j ∈ N ⋅ s s → sj

h4.7

k ∈E st ∈ T ⇒ ¬∃s k ∈ N ⋅ st e → s k

h4.8

si ∈ N − {S ∪ T } ⇒ el em ∃s j , s k ∈ N , el , em ∈ E ⋅ s j ≠ s k ∧ el ≠ em ∧ s j → si → sk

h4.9

Consider the following FSW:

0

0

s0 0

0

s1

s2

1

s3 1000

2

State s0 represents the 'start' state with one legal action '0' which when invoked will cause the state to change to state 1. States s1 and s2 also only provide one legal action '0' which again causes a change in state as given by the corresponding directed edge. State s3 is a terminal state that by definition has no legal transitions from it to any other state. This state will be associated with a particular reward value that represents satiation on completing a task. A trial consists of moving from a start state to a terminal state, with the position within the FSW set back to one of the start states upon completion of a trial ready for the following trial. Finite State Worlds can be created that are equivalent to Woods environments, although all Woods environments allow potential movement in eight directions so every node must have eight edges to simulate this movement. In fact, within a FSW an attempt to perform an action that is not within the set of actions appearing as labels upon edges leaving the current state is treated as equivalent to an edge labelled with that action which leaves and then re-enters the same state. This reduces the number of edges that must be identified when not all are legal within a state. For example, the following Woods environment and FSW are equivalent, and the transitions back to the state can be omitted, since they are implied.

145

0,1,3-5,7

0,1,3-5,7 2

2

s0 0

0,1,3-5,7 2

s1 6

s2 6

1

OOOOO O...F OOOOO

s3 1000

2

FSW are more precise than Woods environments - each state has a distinct label so that aliasing problems do not occur, and configurations that would not be possible within a Woods environment can be created, as demonstrated below: 0

0

s1

0

s2

s4

s0 1

s3

0

Whereas within the Woods environments it is not possible to create long action chains without aliasing problems (unless more sensory stimuli were introduced, thereby changing the base problem), within FSW it is possible to extend the chain of states as far as desired. It is therefore simple to extend environments whilst controlling the complexity generated by the number of states, their stimuli, and their interconnection. FSW are thus ideal for the controlled tests required within this chapter. 4.3 Investigating Action Chain Length

In investigating some limits on the length of action chains that XCS can learn, it has to be recognised that there are many inter-related factors that can influence learning. a) Exploration Complexity Consider a FSW environment with a chain of n states where each non-terminal state si has two edges, one to the next state si+1 and one back to itself. In such an environment the probability of reaching state si+1 from si is 0.5 in explore mode. For successive states the probability remains the same. Thus, the probability of moving from any nonterminal state si to another state on this chain si+m will be 0.5m. Therefore, as the length of a state chain increases the ability of explore mode within XCS to explore the chain will dramatically decrease when there remains only one start state on the chain. Clearly the more possible pathways there are to choose, the more problematic exploration becomes. Within XCS this problem is overcome for the exploration of the optimal action chain once that action chain has been explored sufficiently to dominate the system prediction of the chain because exploitation can then utilise this route and

146

continue to learn the optimum prediction. However, the discovery of this optional route remains a major hurdle. An aspect of this problem was identified by Lanzi (1997b) when studying the Woods14 problem (Cliff and Ross, 1994), and Lanzi noted that it could be solved by employing a biased exploration strategy. The very existence of this problem indicates one of the areas that must be controlled in order to carry out this investigation. b) Environment 'Shape' Although intrinsically related to the issue of exploration complexity, the issue of environment 'shape' is worthy of separate mention. Certain environments will, by nature of their degree of connectivity, include areas that are more difficult to reach. Lanzi (1997a) noted that the Maze6 environment was difficult for XCS to find optimum solutions within because exploration rates for some areas were higher due to the shape of the environment. Furthermore, if an environment provides opportunities for looping back to earlier states (or even to the same state) the frequency with which a reward state is encountered will diminish, leading to longer periods between external reinforcement. There is potentially much more work to be done in this area to fully understand it that is beyond the scope of this research effort. c) Input encoding The fundamental power of XCS is its ability to generalise whilst learning. The application of the Generalisation Hypothesis (Wilson, 1996) and the Optimality Hypothesis (Kovacs, 1996) to multiple-step XCS learning has not been investigated, but confirmation of this ability is surely a key objective for XCS research. The prediction of classifiers within the XCS population in a multiple-step environment will be updated either as a result of payoff from the environment or as a result of payoff generated by moving the Animat controlled by the XCS into a position where a different classifier can operate and receive an environmental payoff. Where payoff is not received from the environment, the stable prediction of the classifier will be dependent upon the stable prediction of the classifiers that operate in the following step. The payoff received by the action set in the previous iteration is calculated from equation 4.1, where Si is the set of System Predictions from the action sets formed during matching in the current iteration. The update for the prediction within a classifier is calculated (ignoring the initial section of the MAM technique for the present) using equation 4.2: ri-1 ← γ.max(Si)

(eq. 4.1)

147

pi-1 ← pi-2 + β (r i-1 - p i-2)

(eq. 4.2)

The discount factor γ will reduce the payoff to the preceding classifiers so that when moving further back in time from the classifier that received the environmental payoff the stable prediction of the classifiers will decrease by a power of γ each time. In an ideal situation the stable prediction of a classifier t steps away from the environmental reward R should therefore be: γ tR

(eq. 4.3)

Unfortunately it is possible that as the payoff diminishes down the action sets that are invoked as progress is made through a chain of states in even a simple single-chain FSW, the generalisation mechanism of XCS will generalise over the classifiers covering the early states. The ability of XCS to generalise may be dependent upon the amenability of the input messages attributed to each state. Lanzi (1997b) has already noted that an over-redundant encoding can affect the ability of XCS to operate effectively in the presence of generalisation pressure due to the inadequate or uneven exploration of some of the message bits. It could be the case that potential generalisations over similar actions in separate states may be aided or hindered by the coding of the messages from these states. d) Parameterisation The rate at which XCS learns its strength values is fundamentally controlled by the learning rate parameter β. This adjusts the rate at which the current error, prediction, and fitness estimates are adjusted and reflects the responsiveness of these values to changing environmental input. In multiple-step environments the rate of learning is also affected by the discount factor parameter γ. This parameter reflects the rate of decline in payoff down the sequence of action sets leading to a rewarding state. A high value of γ will result in a lower differentiation in payoff between the states. Whilst this may allow longer action chains to survive, the closer payoff values may prevent the generalisation capabilities of XCS rapidly finding the accurate yet general classifiers that can be proliferated through the action of the G.A. and subsumption. Lower values of γ may help this generalisation process, but possibly at the cost of reduced length chains. In addition to these parameters affecting the learning rate, other parameters such as the choice of explore-exploit strategy, the frequency of the genetic algorithm, the amount of effector covering, the time until a classifier is regarded as experienced, etc,…, will all

148

affect the learning of action chains to one degree or another. None of these parameterisation issues has been investigated, and this remains a fertile area for future work. 4.3.1 The Test Environment

All of the issues outlined above are, in the context of this chapter, variables that must be controlled. This will limit the applicability of the results obtained, but these limits are necessary if the underlying limits of XCS action chain learning are to be understood. For the purposes of this work, therefore, a FSW will be defined that will seek to limit the investigation to the area of action chain length. The test environment chosen is a FSW representation of a so-called "corridor environment". The environment is pictured in figure 4.1, and the message produced from each state of this environment is the binary coding of the state number.

0

1

s0

s1 1

s5

0,1

0

s2 0,1

s6

1

s3 1

0,1

s7

0

s4 0,1

s8

S10 1000

1

0,1

s9

Figure 4.1 - A Corridor Finite State World

This environment has the following features: •

It can be trivially extended by small or large increments as longer test action chains are required.

•

It includes a choice of route at each state so that the ability of XCS to decide the optimal route as the action chain increases can be determined.

•

The sub-optimal route does not prevent progress towards the reward state.

•

The optimal route is always re-joined to limit the penalty of a sub-optimal choice.

•

The stable payoff received for a sub-optimal choice will always be equivalent to the γ discount of the payoff received for the optimal choice.

•

The alternation of actions prevents generalisation from prematurely producing very general classifiers that cover much of the optimal path to reward.

•

The small number of separate actions limits exploration complexity.

149

The problem of exploration complexity is removed, as far as possible, by the limit to an optimal and sub-optimal choice in each state combined with the immediate re-joining of the optimal route. The problem of exploration shape is controlled by the use of a simple state chain. The problem of input encoding is controlled by making the input from each state the binary code of the integer number of the state for all experimental work apart from those experiments explicitly designed to investigate changes in input encoding. The problem of parameterisation will be solved by utilising the following set of parameter settings, derived from those used by Wilson (1995, 1998), throughout: N (population size) Pi (initial population size) γ (discount rate) β (learning rate) θ (GA Experience) ε0 (minimum error) α (fall-off rate) Χ (crossover probability) µ (mutation probability) pr (covering multiplier) P(#) (generality proportion) pi (initial prediction) εi (initial error) fr (fitness reduction) m (accuracy multiplier) s (Subsumption threshold) fi (initial fitness) Exploration trials per run Maximum iterations per trial

20 × (states - 1) × 2 0 0.71 0.2 25 0.01 0.1 0.8 0.04 0.5 0.33 10.0 0.0 Not Used 0.1 20 0.01 5000 chain length × 10

Table 4.1 - Parameter settings for action chain length experiments

These parameter settings were chosen to follow previous work, although the mutation probability is higher than Wilson used in the Woods-2 experiments but the same as that used within his multiplexer work. The population size was selected to provide sufficient space for each fully specific classifier to achieve a maximum numerosity of 20. It has been noticed in the course of the experimental work required for other areas of this research project that XCS can increase the numerosity of non-optimal classifiers to 6 or [more rarely] 8 copies. Therefore, to ensure dominance the 'rule of thumb' of providing a minimum of 12 classifiers per member of [O] was adopted throughout this research. Providing 20 spaces per required classifier allows for the additional classifier space required for the normal XCS exploration and the fact that generalisation may allow fewer classifiers to be used to represent the state space. Any deviation from this

150

parameter set within particular experiments will be given alongside the experiment. In all experiments ten runs were carried out and the results presented are the average of the ten runs. 4.3.2 A metric for action chain evaluation

Prior to the start of the experimental investigation of this chapter preliminary investigations of action chaining in XCS with a pre-initialised population and without induction algorithms were carried out. Initial work utilised the following environment:

0

s0

0

s1

0

s2

s3 1000

Figure 4.2 - A three state single action corridor FSW

To test the stable prediction, the induction algorithms were disabled and the population initialised with the following classifiers (where P is Prediction, E is Error, F is fitness, A is Accuracy, and N is Numerosity): Extent = 3, Classifiers = 3 00→0, P=10.0, E=0.0, F=0.01, A=0.01, N=1 01→0, P=10.0, E=0.0, F=0.01, A=0.01, N=1 10→0, P=10.0, E=0.0, F=0.01, A=0.01, N=1 These represented the 'ideal' solution set for accurately 'correct' classifiers for each of the three positions within this environment. The exploration limit was set to 9 steps, with explore/exploit strategy set to conduct one complete exploit trial after each complete exploration trial. All induction algorithms were disabled, but apart from these two modifications the remaining parameterisation was as given in table 4.1. This test was run thirty times for 200 exploration trials. The measures captured by Wilson (1995, 1998) were used, and the average of the runs is shown in figure 4.3. It can be seen that the System Error becomes zero after 118 update trials (59 exploitation trials - since the classifiers only cause movement in one direction and exploration operates like an additional exploitation in this experiment). This would appear to illustrate that the classifiers hold predictions that accurately reflect the discounted payoff within a comparatively short time. In fact and examination of the population prediction values during the trials revealed that the classifiers within the population reflect their true stable payoff value some time after this point. Prediction stability to 4 decimal digits was reached at 95, 84, and 72 exploitation trials (286, 254, and 218 exploitation steps)

151

respectively in all runs. This reflects the operation of the β learning rate in the update of the prediction quickly homing in on the payoff value and thereby falling within the set α criteria for accuracy, but taking time beyond this point to fully converge. The System Error is based wholly on the prediction of the final classifier in the action chain, and therefore cannot adequately reflect the error within the action chain. Unfortunately a naïve extension of the System Error metric into the earlier action sets by averaging also does not reflect the true state of convergence within the action chain. This is due to the fact that earlier states each converge to a much smaller prediction (see table 4.2). XCS Output 0.5 Performance Error Population

0.45 0.4 0.35

Proportion

0.3 0.25 0.2 0.15 0.1 0.05 0 0

20

40

60

80 100 120 Exploitations

140

160

180

200

Figure 4.3 - System Error rapidly falling to zero in a three state FSW

Classifier

Prediction

Classifier

Prediction

Classifier

Prediction

0

1.49

7

16.41

14

180.42

1

2.10

8

23.11

15

254.12

2

2.96

9

32.55

16

357.91

3

4.17

10

45.85

17

504.1

4

5.87

11

64.58

18

710.0

5

8.27

12

90.95

19

1000.0

6

11.65

13

128.1

Table 4.2 - Converged Predictions to two decimal places of 19 classifiers within a 20 state single chain single action FSW.

152

To demonstrate this relationship a 20 state FSW was created, as shown in figure 4.4. Figure 4.5 plots the stabilisation of the strengths of the 19 classifiers within this 20 state FSW using a logarithmic prediction scale. It is clear from this result that a measure of error relative to the environment reward cannot reflect the convergence of the earlier classifiers. To accurately reflect the true convergence of classifiers covering these early states a metric that weights each state according to its relative local convergence is required.

0

0

s0

s1

0

0

s2

s18

0

s20

s19

1000

Figure 4.4 - A 20 state single action corridor FSW

1000

10

1

189

177

165

153

142

130

118

106

94

82

71

59

47

35

23

11

0.1 0

Prediction

100

Trials

Figure 4.5 - Convergence of Prediction values along a logarithmic scale. for a 20 state single chain FSW, one action, no Induction.

153

Since the classifiers within the chain will vary in prediction of payoff a new perclassifier measure was constructed - the Relative Error εr :

εr = r − p r

if r > p

εr = p − r p

otherwise

where p is the prediction of the classifier and r is the [non-negative] payoff received by the classifier.

0.5 0.4 0.3 0.2 0.1 0

Prop. Error

-0.1 -0.2 -0.3 -0.4 -0.5 -0.6 -0.7 -0.8 -0.9 -1 0 Cl. 1 Cl. 6 Cl. 11 Cl. 16

25

50 Cl. 2 Cl. 7 Cl. 12 Cl. 17

75

Trials

Cl. 3 Cl. 8 Cl. 13 Cl. 18

100

125 Cl. 4 Cl. 9 Cl. 14 Cl. 19

150

175 Cl. 5 Cl. 10 Cl. 15 Sys Rel Err

Figure 4.6 - The Relative Error in prediction of payoff per classifier over a 20 state single action FSW with pre-loaded classifiers and no induction.

154

This measure provides the error in prediction of each classifier as a proportion of the larger of the payoff or the prediction (i.e. the relative magnitude in the rate of convergence), so that their error magnitudes are comparable regardless of the payoff they each receive. The convergence of the classifiers to their fixed-point payoff prediction can then be tracked, as shown in figure 4.6. In this figure the case where the prediction is less than the payoff is shown as a negative value for clarity. The measure, here termed "System Relative Error" was constructed from the average of the relative error of the classifiers over the course of each exploitation trial. It can be seen in figure 4.6 that this system measure accurately depicts the rate of convergence of the whole classifier chain. Of course, this measure cannot be used in this form within normal XCS operation. In this case the action set, rather than discrete classifiers, is the location of the prediction to be measured. The measure is taken in exploitation episodes only, where the action set will always provide the maximum of the predictions for each action in the match set. The relative error measure is then calculated for each iteration of the exploitation trial before a reward is received as:

εr = r −π r

if r > π

εr = π − rπ

otherwise

where π is the prediction of the previous action set (the previous iteration's maximum system prediction) and r is the payoff generated by the current action set. If a reward was received in the final iteration of the trial, the relative error measure is calculated as:

εr = R −π R

if R > π

ε r = π − Rπ

otherwise

where π is the prediction of the current action set and R is the reward received from the environment. The System Relative Error is the average relative error calculated from each action set of the exploitation trial and is calculated once for each exploitation trial.

155

To help identify the extent of the error contributing to this average, the minimum relative error and maximum relative error from the trial is also recorded. 4.3.3 Experimental Hypotheses

Section 4.3.1 identified, within a discussion of the parameterisation of the XCS that the discount factor γ reduces the payoff to the classifiers within preceding action sets. This payoff reduction has two purposes. Firstly it reflects the increasing degree of uncertainty regarding the role of the preceding action sets in leading to the final reward. Perhaps more fundamentally, it allows XCS to take account within selection between competing pathways of the distance to the reward as well as the ultimate reward magnitude. The possible side-effect of the discount within XCS, however, is that as the payoff decreases by a power of the discount factor on each step, γ tR the payoffs received by action sets become increasingly similar. The generalisation hypothesis claims that XCS will be able to identify classifiers of the optimum generality to map the state × action × payoff landscape of a problem. However, it is probable that as the action chain length increases and the payoff to early action sets decreases that the generalisation capability of XCS will simply generalise over the very similar predictions within these early action sets. The effect of this is likely to be that XCS can no longer select the optimum route to a reward. At present it is unknown whether this will happen or when it will happen. Hypothesis 4.1 There will be a point in the lengthening of a single action chain to a stable fixed point reward when the payoff to the action sets covering the initial states will be sufficiently similar to cause incorrect generalisations, thereby preventing XCS identifying a correct state × action × payoff mapping over those states. In addition to the potential problems with generalisation, there is no information on the performance of XCS without the burden of generalisation in very long action chains. It may be hypothesised that XCS will be able to identify all the classifiers required to map the state × action × payoff landscape in the simple though long corridor FSW. However, there is uncertainty about whether XCS will be able to establish the very small payoff predictions in the early states accurately enough to continue to select the optimal route.

156

It is quite possible that XCS will be unable to distinguish between payoff predictions once the payoffs become fractional values. The following hypothesis could possibly be criticised for being insufficiently specific, but it is introduced to formalise this aspect of the experimental work: Hypothesis 4.2 As the payoff reduces to fractions of unity XCS without generalisation will become unable to reliably select the optimal path within a simple two-choice corridor FSW environment. 4.3.4 Experimental Method

The investigation of these hypotheses has been divided into three stages. All stages will initially base their tests on the FSW pictured in figure 4.1 expanded to represent optimal action chain lengths of 5 (as pictured in figure 4.1), 10, 15, 20, 25, and 30. The first stage will seek to provide base-line results to demonstrate that XCS is capable of learning the stable payoff for each of the actions in each state of the environments. In this stage a population of classifiers with no generalisation sufficient to cover all the state × action pairs in the environment will be introduced. Without activating the induction mechanisms, XCS will be run to seek to identify the stable predictions. The time to complete this task and the time to reduce the system relative error will become the baseline for comparison within the following stages. For each test XCS will be run ten times and the results presented will be the averages of these runs. To capture the degree of coverage of the problem environment a new reporting technique will be introduced to XCS33 research from the field of Data Mining. Within Data Mining a 'Coverage Table' is built up during the test phase, once learning from the training data set has been completed. This table is formed by presenting the test data set to the learnt classifications and capturing the predictions that are made. Within XCS a coverage table can be produced by freezing the population after the full diet of trials and then systematically forming every legal message from the environment and presenting each message to the XCS. The highest system prediction from the actions within the match set is taken as the prediction of the XCS for each input and recorded in the table. A

33

The application of this technique to XCS was first presented in Saxon and Barry (2000) from a spin-off project from this research applying XCS to the task of Data Mining. The two year project was funded by SERC/DTI and The Database Group PLC and was successfully completed in March 2000.

157

number of visualisation techniques can then be applied to this table to picture the learnt state × action × payoff space34. The second stage increases the difficulty of the experimental task. Each environment is presented without an initial population. With generality turned off, the induction mechanisms must establish a population of specific classifiers with payoff predictions to map the state × action × payoff space. These experiments will be used to assess the validity of hypothesis 4.2. The final stage requires XCS to identify and establish optimally general accurate classifiers to map the state × action × payoff space. Each environment is presented to XCS to identify whether hypothesis 4.1 is supported. 4.3.5 The production of baseline results.

Each of the test environments was presented to XCS in turn, capturing the output of 10 runs in each environment. After the runs for each test environment were completed the performance was captured in a chart representing the number of iterations to the reward state, the system relative error alongside the minimum and maximum relative error in the action chain, and the population size. The final measure is irrelevant in this stage since the population is pre-installed and fixed. The coverage tables for each run were also captured and averaged. The resulting coverage table for the environment was captured using a line graph with a line for the predictions for the optimal action in each state and a line for the sub-optimal action in each state. A line graph was adopted to emphasise the payoff relationship between the data points. For the 25 and 30 length environments the line graphs were also re-plotted using a logarithmic prediction scale to reveal the differences in the predictions of the classifiers covering the early states of the environment. Due to the small magnitude of these predictions the differences are indecipherable within the linear-scale line graph. The performance and system relative error of the ten runs of each length environment were averaged and plotted. The output from these six experiments is shown in figure 4.7. a) FSW length = 5, N=200, Iterations per Episode=50, Condition Size=5

34

Wilson (1994, 1995, 1996) presented a primitive visualisation of a coverage table using a grid to represent the cell space of a Woods-2 environment. Appropriately sized hand-drawn vectors from the centre of each square on the grid were drawn to represent the prediction strength for each action in the match set for that cell position.

158

Prediction

XCS Output 1

1000

Iterations Min Rel Err Max Rel Err Relative Error Population

0.9 0.8

Optimal

900

Sub-op

800

0.7 Proportion

700

0.6

600

0.5

500

0.4

400

0.3

300

0.2

200

0.1

100 0

0 10

20

30 40 50 60 70 Exploitation Episodes

80

90 100

0

1

2

3

4

Steps from start

b) FSW length = 10, N=800, Iterations per Episode=100, Condition Size=6

Prediction

XCS Output 1000

1

Iterations Min Rel Err Max Rel Err Relative Error Population

0.9 0.8

Optimal

900

Sub-Op

800 700

0.7

600

0.5

500

0.4

400

0.3

300

0.2

200

0.1

100

Proportion

0.6

0

0 10

20

30 40 50 60 70 Exploitation Episodes

80

90 100

0

2

4

6

8

Steps from start

c) FSW length = 15, N=1200, Iterations per Episode=150, Condition Size=7

159

Prediction

XCS Output 1000

1

Iterations Min Rel Err Max Rel Err Relative Error Population

0.9

Proportion

0.8

Optim al

900

Sub-Op

800

0.7

700

0.6

600

0.5

500

0.4

400 300

0.3

200

0.2

100

0.1

0

0 10

20

30 40 50 60 70 80 Exploitation Episodes

0

90 100

1

2

3

4

5

6

7

8

9 10 11 12 13 14

Steps from start

d) FSW length = 20, N=1600, Iterations per Episode=200, Condition Size=7

XCS Output 1

Prediction 1000

Iterations Min Rel Err Max Rel Err Relative Error Population

0.9 0.8

900

Optimal Sub-Op

800

0.7 Proportion

700

0.6

600

0.5

500

0.4

400

0.3

300

0.2

200

0.1

100

0

0

20

40 60 80 100 120 Exploitation Episodes

140

0 1 2 3 4 5 6 7 8 9 10 111213 1415 1617 1819

Steps from start

160

e) FSW length = 25, N=2000, Iterations per Episode=250, Condition Size=8

XCS Output 1

Iterations Min Rel Err Max Rel Err Relative Error Population

0.9 0.8

Proportion

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 20

40

60 80 100 120 140 160 180 200 Exploitation Episodes

Prediction 1000

Prediction 1000

900

Optimal

Optimal

Sub Op

Sub Op

800 100

700 600 500

10

400 300 1

200 100 0

0.1

0

2

4

6

8

10 12 14 16 18

20 22 24

0

Steps from start

161

2

4

6

8

10

12 14 16 18

20 22 24

Steps from start

f) FSW length = 30, N=2000, Iterations per Episode=300, Condition Size=8 XCS Output 1

Iterations Min Rel Err Max Rel Err Relative Error Population

0.9 0.8

Proportion

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 20

40

60 80 100 120 140 160 180 200 Exploitation Episodes

Prediction 1000

Prediction 1000

900

Optimal

800

Sub Op

Optimal Sub Op

100

700 600

10

500 400

1

300 200

0.1

100 0

0.01

0

2

4

6

8 10 12 14 16 18 20 22 24 26 28

0

Steps from start

2

4

6

8 10 12 14 16 18 20 22 24 26 28

Steps from start

Figure 4.7 - The convergence of payoff prediction within action chain lengths (a) 5, (b) 10, (c) 15, (d) 20, (e) 25, and (f) 30 in a corridor FSW environment with one non-optimal action per state, a pre-loaded population and no induction.

162

In all cases the system relative error rapidly drops to zero within 150 exploitation episodes indicating rapid convergence even within long action chains. As would be expected in a pre-loaded population without induction, there was no error evident in the final prediction even for the earliest states despite the fact that the optimal payoff at this state would be 0.034 and the sub-optimal 0.024. 4.3.6 Using induction on long action chains

The next series of experiments removed the population initialisation, starting XCS off with no initial classifiers. All the classifier induction capabilities of XCS were enabled, but the generalisation probability was set to 0.0 so that only fully specific classifiers would be generated. These experiments were specifically to investigate hypothesis 4.2, although they will also provide interesting time-to-convergence comparative results with the previous experiments. The metrics captured in section 4.3.5 were also used within these experiments. The results for each run are presented within Figure 4.8. The results demonstrate that up to and including action chains of length 20 the time to system relative error reduction to zero is very similar to that for the situation where the initial population is already provided. Similarly, the time taken for XCS to be able to accurately select the optimal route is almost unchanged. The only point of change in performance becomes evident within the 20 action chain. In this test the maximum system relative error curve does not display the large 'step' seen in the equivalent noninduction test, although the step does re-appear gradually in the 25 and 30 length action chain tests. This reduction is difficult to explain, but may be due to the late introduction of some classifiers within the sub-optimal route. Classifiers introduced by the GA will have an initial prediction that is the same as the parent if there is only one parent, which will always be the case within the no-generality XCS. Thus classifiers introduced later will exhibit an initial payoff prediction error that is less than that had the classifiers been introduced at the start of the exploration. The re-introduction of the step in the 25 to 30 length environments may be due to the increase in the time between first introduction of the classifiers in the sub-optimal route and subsequent re-exploration. As the action chain length grows to 25 and 30 steps the time taken for the system relative error to reduce to zero increases compared to the non-induction test. This is possibly due to the additional feedback required to focus upon the very small predictions in early states.

163

a) FSW length = 5, N=200, Iterations per Episode=50, Condition Size=5

Prediction

XCS Output 1

0.8

Proportion

1000

Iterations Min Rel Err Max Rel Err Relative Error Population

0.9

900

Opt

800

Sub Op

0.7

700

0.6

600

0.5

500

0.4

400

0.3

300

0.2

200

0.1

100 0

0 10

20

30 40 50 60 70 Exploitation Episodes

80

0

90 100

1

2

3

4

Steps from start

b) FSW length = 10, N=800, Iterations per Episode=100, Condition Size=6 Prediction

XCS Output 1000

1

Population Max Rel Err Min Rel Err Relative Error Iterations

0.9

Proportion

0.8

900

Optimal

800

0.7

700

0.6

600

0.5

500

0.4

400

0.3

300

0.2

200

0.1

100

0

Sub Op

0

10

20

30 40 50 60 70 Exploitation Episodes

80

90 100

0

1

2

3

4

5

6

7

8

9

Steps from start

164

c) FSW length = 15, N=1200, Iterations per Episode=150, Condition Size=7

Prediction

XCS Output 1000

1

Population Max Rel Err Min Rel Err Relative Error Iterations

0.9

Proportion

0.8

Optimal

900

Sub Op 800

0.7

700

0.6

600

0.5

500

0.4

400

0.3

300

0.2

200

0.1

100 0

0 20

40 60 80 100 120 Exploitation Episodes

0

140

2

4

6

8

10

12

14

Steps from start

d) FSW length = 20, N=1600, Iterations per Episode=200, Condition Size=7

Prediction

XCS Output 1 0.9

Proportion

0.8

1000

Population Max Rel Err Min Rel Err Relative Error Iterations

Optimal

900

Sub Op 800

0.7

700

0.6

600

0.5

500

0.4

400

0.3

300

0.2

200

0.1

100

0 20 40 60 80 100 120 140 160 180 200 Exploitation Episodes

165

0 0

2

4

6

8

10

12

14

16

18

Steps from start

e) FSW length = 25, N=2000, Iterations per Episode=250, Condition Size=8 Prediction

XCS Output 1000

1

Population Max Rel Err Min Rel Err Relative Error Iterations

0.9 0.8

Optimal Sub Op 100

Proportion

0.7 0.6 10

0.5 0.4 0.3

1

0.2 0.1 0.1

0 20 40 60 80 100 120 140 160 180 200 Exploitation Episodes

0

2

4

6

8

10

12 14 16 18

20 22 24

Steps from start

f) FSW length = 30, N=2000, Iterations per Episode=300, Condition Size=8 Prediction

XCS Output 1000

1

Population Max Rel Err Min Rel Err Relative Error Iterations

0.9 0.8

Optimal Sub Op

100

Proportion

0.7 10

0.6 0.5

1

0.4 0.3

0.1

0.2 0.1

0.01

0 50

100 150 200 Exploitation Episodes

250

0

2

4

6

8 10 12 14 16 18 20 22 24 26 28

Steps from start

Figure 4.8 - The convergence of payoff prediction in the presence of classifier induction within action chain lengths (a) 5, (b) 10, (c) 15, (d) 20, (e) 25, and (f) 30 in a corridor FSW environment with one non-optimal action per state.

166

The payoff prediction plots within figure 4.8 indicate that XCS was able to accurately predict the payoff for all classifiers right up to 30 states, and was able to select the optimal route by 150 exploitation episodes. This finding was counter to hypothesis 4.2, which suggested that XCS would be unable to select over the very small payoff predictions in these early states. It was initially thought that the averaging contained within the standard XCS report of iterations (it is the moving average of the previous 50 episodes) may be hiding occasional sub-optimal selections. The 30-action chain test was thus repeated with this averaging removed. Each of the 10 test results were then individually checked, since further test averaging could also hide some results. A typical run is shown in figure 4.9.

Proportion

XCS Output

1

Iterations Min Rel Err Max Rel Err Relative Error Population

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 50

100

150

200

250

Exploitation Episodes

Figure 4.9 A single test using a 30 length action chain FSW with induction, no generality, and iteration averaging removed.

It is clear from figure 4.9 that the average, in this case, does present a true picture. Clearly XCS was able to identify the optimal pathway, even through the very small payoff prediction states. In an attempt to test the hypothesis on a more extreme case, a 40 action chain environment was constructed by extending the FSW environment in the

167

same way that the other environment were created from the five state environment pictured in figure 4.1. The result of a typical run from this test is shown in figure 4.10. Proportion

XCS Output

1

Iterations Min Rel Err Max Rel Err Relative Error Population

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 50

100 150 200 250 300 350 400 450 500 Exploitation Episodes

Figure 4.10 A single test using a 40 length action chain FSW with induction, no generality, and iteration averaging removed.

Even though the optimal prediction payoff in state s0 is now 0.00158 and the suboptimal prediction in the same state is 0.001123, it is clear from figure that XCS is able to identify these predictions sufficiently accurately to select correctly between them in a close to optimal number of learning cycles. The classifiers covering the first ten states of this environment from the same run are given in table 4.3, illustrating that all prediction values are accurately represented. Clearly this trend could not be infinitely extended, but combined with the ability to modify γ it is clear that a non-generalising XCS can rapidly learn optimal routes in this kind of environment for long action chains. The hypothesised disruption of the payoff prediction caused by continued learning with XCS was not demonstrated, although it may be seen within environments with noisy feedback - an area for further investigation. Hypothesis 4.2 is therefore not upheld.

168

Cond. 00000000 00000000 00000001 00000001 00000010 00000010 00000011 00000011 00000100 00000100 00000101 00000101 00000110 00000110 00000111 00000111 00001000 00001000 00001001 00001001

A 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1

Pred. 0.001582 0.001123 0.001582 0.002227 0.003137 0.002227 0.003137 0.004419 0.006224 0.004419 0.006224 0.008766 0.012346 0.008766 0.012346 0.017389 0.024491 0.017389 0.024491 0.034495

Err. 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000

Acc. 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000

Fit. 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000

N 24 24 33 23 25 32 24 20 21 25 27 30 20 29 26 20 23 27 26 21

AS 24.00 24.00 33.11 23.00 25.00 32.02 24.00 20.00 21.00 25.00 27.00 30.00 20.00 29.00 26.02 20.00 23.00 27.00 26.03 21.00

Exp. 7333 2667 2643 7357 7353 2647 2609 7391 7359 2641 2618 7382 7411 2589 2598 7402 7430 2570 2633 7367

Table 4.3 Classifiers covering the states s0 to s9 of a 40 action chain length FSW. (Cond=Condition, A=Action, Pred=Prediction, Err=Error, Acc=Accuracy, Fit=Fitness, N=Numerosity, AS=Action set estimate, Exp=Experience)

4.3.7 XCS learning with generalisation in long action chain FSWs.

This series of investigations re-used the action chain FSW environment used in the previous tests. Parameterisation was kept the same as that within section 4.3.6 apart from the setting of the generality parameters to 0.33. 4.3.7.1 The Length 5 FSW

The initial experiment was carried out with the length five FSW pictured in figure 4.1. The results of this test are pictured in figure 4.11. These results show that under the action of generalisation XCS is able to identify the optimal action route for action chains of length five within this FSW environment by 350 exploitation episodes. This is just under six times the time required for the fully-specific induction test - the search space has increased from 26 to 35 × 2, a 7.5 times increase. The rise and then decline of the population curve demonstrates that XCS is able to identify the correct generalisations. An analysis of the population revealed that [O] was present. Unfortunately an analysis of the numerosity of the members of [O] revealed that other classifiers within the accurate sub-population were not as dominated by the members of [O] as would normally be expected. The classifier members of the accurate sub-population are shown in table 4.4.

169

XCS Output 1

XCS Output 1

Population Max Rel Err Min Rel Err Relative Error Iterations

0.9 0.8

0.8

0.7

0.7 Proportion

Proportion

Population Max Rel Err Min Rel Err Relative Error Iterations

0.9

0.6 0.5 0.4

0.6 0.5 0.4

0.3

0.3

0.2

0.2

0.1

0.1 0

0 0

500 100015002000250030003500400045005000 Exploitation Episodes

50 100 150 200 250 300 350 400 450 500 Exploitation Episodes

1000

Figure 4.11

900

Opt

800

Sub Op

The convergence of payoff prediction in the presence of generality pressure

700

and classifier induction within a length

600

5 corridor FSW.

500 400 300 200 100 0 0

1

2

3

4

Steps from Start

An analysis of the accurate sub-population revealed that there were over-specific classifiers within the accurate sub-population that were maintaining a higher than expected numerosity. The final column of table 4.4 identifies the classifiers within this population that subsume more specific classifiers by the moniker S-x. The classifiers subsumed by this classifier have the moniker s-x, where x is same numeric value as the moniker of the subsuming classifier. These classifiers would normally be subsumed on creation within the G.A., but classifiers whose actions are mutated will not be subsumed because they no longer belong to the parent action set. Although these classifiers should be removed by normal deletion dynamics, they appear to remain within this environment, though less dominant nearer the reward state. This remained the case in an expanded run of the test that ran for 30000 exploitation episodes.

170

Run 9, Iterations 62504, Extent 400, Classifiers 72 Cond #000# #0000 #0000 #0001 #0001 #0010 ###10 ###11 ##01# ##010 ##011 ##011 ##100 ##100 0#100 ##101 ##101 0#101 ##110 ##110 0#110 ##111 ##111 #0111 #1##0 #1##0 #1##1 #1##1 #1001

A 0 0 1 0 1 0 1 0 0 0 0 1 0 1 1 0 1 1 0 1 1 0 1 0 0 1 0 1 1

Pred 253.2231 253.1771 181.1496 254.5014 356.1513 502.4566 356.7899 503.3889 502.7328 502.4566 503.4145 708.0550 1000.000 709.9997 709.9997 253.8366 253.8402 253.8402 355.7148 356.9691 356.9716 502.4602 502.2266 502.4602 708.8970 706.1638 1000.000 1000.000 1000.000

Err 0.0004 0.0007 0.0009 0.0022 0.0010 0.0006 0.0003 0.0003 0.0005 0.0006 0.0002 0.0012 0.0000 0.0000 0.0000 0.0008 0.0010 0.0010 0.0013 0.0005 0.0005 0.0006 0.0014 0.0006 0.0017 0.0038 0.0000 0.0000 0.0000

Acc 0.4187 0.5665 1.0000 0.5960 0.9993 0.1354 0.7963 0.2381 0.5998 0.1345 0.4393 1.0000 1.0000 0.8591 0.1409 0.9958 0.9390 0.0534 0.9981 0.3222 0.0832 0.5856 0.9866 0.1204 0.8813 0.9712 0.7377 0.9082 0.0910

Fit 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000

N 9 12 9 14 20 1 14 5 6 3 10 11 25 23 3 19 22 1 9 5 2 12 12 2 14 19 10 19 2

AS 27.48 27.69 24.57 26.99 36.56 22.81 28.35 25.81 23.53 24.83 26.35 27.03 30.45 31.40 30.47 23.39 24.18 22.16 26.19 24.56 28.25 22.32 17.22 19.46 19.79 21.45 16.99 22.15 21.23

Exp 9432 6372 2325 2226 6949 244 3119 672 8130 445 2062 6145 6473 1974 131 1034 1113 34 1192 1104 50 986 969 82 1278 1178 1157 1122 43

Subsume S-1 (s-1) (s-1)

S-2 S-3 (s-3) (s-2, s-3)

S-4 (s-4) S-5 (s-5) S-6 (s-6) S-7,(s-2) (s-7),(s-2)

S-8 (s-8)

Table 4.4 - Optimal sub-population of a typical run of the five action-chain FSW.

The possibility that these classifiers were introduced and maintained by mutation within the GA was tested by re-running the experiment using a new version of XCS. This XCS was modified so that whenever a classifier created by the G.A. is not subsumed by the parents or the action set the population is searched for a subsuming classifier. This operation acts in a similar manner to Wilson's periodic action set subsumption operator.(Wilson, 1998). It is a potentially dangerous modification that could prevent more accurate specific classifiers entering the population where an over-general classifier exists that has been initially identified as of high fitness due to the lack of competition within the action set. However, if this modification does eliminate the problem seen within table 4.4, it will confirm that action mutation is the root cause of the problem. The results of running this modified XCS are given in figure 4.12 and table 4.5.

171

XCS Output 1

Iterations Min Rel Err Max Rel Err Relative Error Population

0.9 0.8

Proportion

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 50 100 150 200 250 300 350 400 450 500 Exploitation Episodes Figure 4.12 - The performance of a typical run of XCS within a length 5 corridor FSW with population-wide subsumption for mutated classifiers belonging to a new action set.

Cond #000# #0000 #0#01 #0001 ###10 ###11 ##01# ##011 ##100 ##100 ##101 ##101 ##110 ##110 ##111 #1##0 #1##0 #1##1 #1##1

Act 0 1 0 1 1 0 0 1 0 1 0 1 0 1 1 0 1 0 1

Pred 254.4681 187.0251 256.2843 355.7243 359.8209 508.0860 505.5768 709.2216 1000.000 709.9690 253.1724 253.1224 356.4869 356.6254 502.2546 709.2745 709.5942 1000.000 1000.000

Error 0.0023 0.0071 0.0026 0.0009 0.0044 0.0039 0.0039 0.0020 0.0000 0.0001 0.0006 0.0007 0.0009 0.0006 0.0010 0.0012 0.0010 0.0000 0.0000

Acc 0.7520 1.0000 0.1965 1.0000 0.9320 0.6246 0.7775 0.9739 1.0000 0.9997 0.7820 1.0000 0.9998 0.1480 1.0000 0.9998 1.0000 1.0000 0.9786

Fit 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000

N 21 14 5 8 21 18 12 13 29 31 16 24 9 4 9 22 15 24 24

AS 30.53 32.72 27.43 27.78 32.82 38.23 35.10 40.36 30.93 39.23 22.90 30.28 28.88 30.07 25.52 24.10 19.99 25.38 28.17

Exp 40000 14999 1904 45025 22404 18620 58625 44688 44992 14894 7458 7566 7354 117 7372 7315 7538 7370 7573

Table 4.5 - Optimal sub-population of a typical run of the five action-chain FSW with population subsumption for mutated classifiers belonging to a new action set.

172

This modification has reduced the population curve rapidly, and caused XCS to select the optimal route to the reward earlier within the episode than within the standard XCS. An examination of the optimal sub-population revealed that the dominance of many of the action sets by the members of [O] is very high (see table 4.6), although a few action sets remain where dominance is low. In the cases where dominance is low the competition is now from classifiers being explored by the G.A. that exist outside the accurate sub-population. These results suggest that it is indeed the mutation that is at fault. It was the case that Wilson (1995) ran his XCS experiments within the Woods environments at a lower mutation rate - 0.01 instead of 0.04, although he gave no rationale for the modification. The original XCS configuration was therefore re-run with a mutation rate of 0.01 to see if this helped to provide a greater focus within the accurate sub-population.

XCS Output 1

Max Rel Err Min Rel Err Relative Error Population Iterations

0.9 0.8

Proportion

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 500 100015002000250030003500400045005000 Exploitation Episodes Figure 4.13 - The average performance of ten runs of XCS within a length 5 corridor FSW with mutation rate reduced to 0.01.

Figure 4.13 illustrates the performance of XCS with the change to a mutation rate of 0.01. The population appears to focus more readily than within the previous parameterisation when this figure is compared to figure 4.11, although the fall-off in system relative error is somewhat slower. To examine whether any change in the

173

population was evidenced, the dominance of the action set by the most numerous classifier was calculated for each run of the XCS with each mutation rate. For each mutation rate these values were averaged to produce an average dominance value for each action set and both the optimal and sub-optimal actions. These dominance values are given in table 4.6 as percentages. Op 0.04 0 58.80% 1 56.40% 2 45.80% 3 49.30% 4 80.10% 5 68.20% 6 46.00% 7 63.00% 8 66.80% 9 68.50% Average 60.29%

Op 0.01 Sub 0.04 55.80% 56.40% 54.30% 51.30% 50.90% 44.30% 53.10% 45.50% 88.70% 77.20% 74.40% 67.90% 54.00% 51.90% 80.20% 71.80% 91.80% 70.10% 88.50% 72.70% 69.17% 60.91%

Sub 0.01 60.40% 57.00% 54.60% 51.80% 88.40% 78.30% 64.00% 75.60% 87.40% 86.90% 70.44%

Table 4.6 - Dominance percentages for the most numerous classifier within each action set for the optimal and sub-optimal actions in the length 5 FSW.

An F-Test of the action set dominance for the optimal route between the 0.01 mutation rate and the original 0.04 mutation rate revealed that the variance of each set were not equal (F=0.4262, F-crit=0.3146). A one-tailed Wilcoxen test was therefore applied to test the hypothesis that the introduction of the 0.01 mutation rate would improve the focus of the action sets as measured in the dominance statistic. This test revealed a significant difference (T=3, T-crit=3 at 0.005) between the original and new dominance rates. Given that the average dominance with mutation rate 0.04 was 60.29 and the average dominance with mutation rate 0.01 was 69.17, it is concluded that the lower mutation rate improves domination of the action set for the optimal route. The same tests were applied to the sub-optimal route, revealing that the variances are not equal (F=0.7211, F-crit=0.3146 at the 0.05 level) and there is a significant difference between the two sets of results (one-tailed Wilcoxon test: T=0, T-crit=3 at the 0.005 level). The average dominance with mutation rate 0.04 was 60.91 and the average dominance with mutation rate 0.01 was 70.44. Therefore it is concluded that the lower mutation rate also improves domination of the action set for the sub-optimal route.

174

An examination of the populations produced by XCS in this test revealed that they were all strongly focused on [O]. Table 4.7 pictures the accurate sub-population from a sample test run. There are a few subsumed classifiers within the population, but these are all recent and are, in general, subsumed by high numerosity population members. Run 5, Iterations 62710, Extent 400, Classifiers 40 Cond A Pred Error Acc Fit #000# 0 253.9151 0.0010 0.3609 1.0000 #0000 0 253.3480 0.0005 0.6023 1.0000 #0000 1 185.4188 0.0044 1.0000 1.0000 #0001 0 255.7035 0.0013 0.3557 1.0000 #0#01 0 254.9937 0.0011 0.3795 1.0000 #0001 1 357.8359 0.0002 0.9261 1.0000 ###10 1 357.8267 0.0001 0.9053 1.0000 ##01# 0 504.0933 0.0000 0.7008 1.0000 ##010 0 504.0999 0.0000 0.1098 1.0000 ##011 0 504.0650 0.0001 0.5002 1.0000 ##011 1 710.0000 0.0000 1.0000 1.0000 ##100 0 1000.000 0.0000 1.0000 1.0000 ##100 1 709.9982 0.0000 0.9589 1.0000 ##101 0 253.3142 0.0005 0.5760 1.0000 ##101 1 252.8318 0.0012 0.9942 1.0000 ##110 0 357.7362 0.0003 0.8545 1.0000 ##110 1 357.6606 0.0003 0.2113 1.0000 ##111 0 504.0034 0.0002 0.8917 1.0000 #0111 0 504.0034 0.0002 0.0995 1.0000 ##111 1 504.0824 0.0001 0.9014 1.0000 #0111 1 504.0824 0.0001 0.0960 1.0000 #1##1 0 1000.000 0.0000 0.9623 1.0000 01##1 0 1000.000 0.0000 0.0379 1.0000 #1##1 1 1000.000 0.0000 0.9689 1.0000 010#0 0 709.9865 0.0001 0.9961 1.0000 010#0 1 709.6502 0.0008 0.9987 1.0000

N 9 16 15 11 15 9 14 16 2 17 8 29 27 15 7 9 5 15 4 21 2 25 1 19 19 22

AS 28.37 27.28 22.32 32.66 34.70 21.83 23.81 31.80 30.79 32.86 19.21 29.00 29.93 25.33 15.05 22.53 21.75 17.46 22.92 24.10 21.94 26.67 27.26 20.01 20.58 22.53

Exp 774 4314 1428 1370 383 4372 3451 9695 141 2515 7485 3542 1140 395 353 1097 1001 956 38 1020 43 661 22 570 932 578

Table 4.7 - The accurate sub-population of a typical run of the five action-chain FSW with the mutation rate modified down from 0.04 to 0.01.

4.3.7.2 The Length 10 FSW

Having confirmed that XCS is capable of identifying [O] within the simplest of the test environments, the problem complexity was expanded to the length 10 environment. Initially the 0.01 mutation rate was applied to this environment, having produced better results in the previous test. The performance of XCS within this environment is pictured in figure 4.14. It was noticed that XCS appeared to be continuing to reduce the system relative error, and so the test was re-run with 10000 and then 15000 exploitation episodes. The reduction in error appeared to level out by iteration 15000; see figure 4.15.

175

XCS Output 1000

1

Max Rel Err Min Rel Err Relative Error Population Iterations

0.9

900

700

0.6

600

0.5

500

0.4

400

0.3

300

0.2

200

0.1

100

0 0

Optimal

800

0.7

500 100015002000250030003500400045005000 Exploitation Episodes

Sub Op

0 0

1

2

3

4

5

6

7

8

Figure 4.14 - The average performance of ten runs of XCS within a length 10 corridor FSW with mutation rate reduced to 0.01.

XCS Output 1

Max Rel Err Min Rel Err Relative Error Population Iterations

0.9 0.8 0.7 Proportion

Proportion

0.8

0.6 0.5 0.4 0.3 0.2 0.1 0 0

2000 4000 6000 8000 10000120001400016000 Exploitation Episodes

Figure 4.15 - The average performance of ten runs of XCS within a length 10 corridor FSW with system relative error reducing to zero over 15000 episodes.

176

9

Cond

A

Pred

Err

Acc

Fit

#0000# #00000 #00000 #00001 #00001 #0001# #0#010 #00010 #00010 #00011 #00011 ##010# ##0100 ##0100 ##0101 ##0101 ##011# ##0110 ##0110 ##0111 ##0111 ##100# ##1000 ##1000 ##1001 ##1001 ##1010 #0101# ##1010 ##1011 ##1011 ##1100 ##1100 ##1101 ##1101 ##1110 ##1110 ##1111 ##1111 #1##00 #1##00 #1##01 #1##01 #1##10 #1##10 #1##11 01##11 #1##11 01##11

0 0 1 0 1 0 0 0 1 0 1 0 0 1 0 1 0 0 1 0 1 0 0 1 0 1 0 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 0 1 1

46.1526 45.8892 33.8453 47.6729 65.0573 91.1453 88.7311 89.5692 65.6751 94.4989 126.394 178.860 178.326 127.879 179.935 252.480 355.477 355.319 253.665 357.537 501.674 709.761 709.291 503.585 709.978 1000.00 44.7831 60.6779 45.4423 64.9697 65.1579 90.1671 90.3648 126.925 126.694 180.188 179.695 253.125 252.868 356.799 356.121 503.415 501.694 709.023 705.676 1000.00 1000.00 1000.00 1000.00

0.0004 0.0002 0.0015 0.0029 0.0010 0.0026 0.0024 0.0012 0.0010 0.0025 0.0009 0.0007 0.0011 0.0001 0.0002 0.0004 0.0008 0.0008 0.0006 0.0002 0.0024 0.0006 0.0017 0.0003 0.0001 0.0000 0.0004 0.0088 0.0004 0.0007 0.0011 0.0005 0.0005 0.0009 0.0012 0.0002 0.0006 0.0007 0.0008 0.0009 0.0014 0.0011 0.0024 0.0012 0.0048 0.0000 0.0000 0.0000 0.0000

0.6837 0.2756 0.9432 0.3735 0.8943 0.5351 0.0324 0.4457 0.9808 0.4638 0.9961 0.1783 0.8198 0.9999 0.8019 1.0000 0.6326 0.3649 1.0000 0.4003 1.0000 0.4482 0.5319 1.0000 0.5623 1.0000 0.9432 0.0237 0.9686 0.9556 0.9747 0.9989 0.9468 1.0000 0.9994 0.9924 0.9969 0.9989 1.0000 0.9250 0.9212 0.9732 0.9916 0.9959 0.9994 0.9233 0.0765 0.8108 0.1890

1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000

N 17 7 13 10 15 17 1 13 15 13 14 4 24 13 20 18 15 9 19 11 21 14 16 11 18 18 10 1 15 21 16 19 14 20 21 16 17 17 19 17 18 17 13 22 16 18 1 19 3

AS

Exp

27.48 26.96 31.32 29.79 31.36 30.03 28.42 28.97 25.42 28.00 22.34 27.82 29.89 18.65 26.38 24.33 25.59 26.63 28.69 30.05 30.51 30.76 29.87 21.92 31.99 28.49 14.17 18.72 21.48 23.32 19.32 20.12 18.10 20.71 23.32 17.83 21.08 17.03 19.10 19.19 21.79 18.27 13.82 22.16 18.09 19.48 17.44 23.23 18.21

21959 16050 5677 5505 12848 21990 258 16437 5592 5511 16281 20616 16529 5500 5586 16523 18805 14859 5003 4520 13369 25627 19573 6498 6632 19803 2871 38 2905 2107 2378 2930 2799 2843 2785 2839 2690 2245 2246 2908 2745 2007 2043 1847 1951 1619 31 1735 126

Table 4.8 - The accurate sub-population of a typical run of the ten action-chain FSW

177

The coverage graph in figure 4.14 demonstrates that XCS was able to identify the optimum payoff for each action set. An analysis of the population after 15000 iterations revealed a highly focused population (as shown in Table 4.8) that, in general, achieved distinction in the numerosity between members of [O] and other accurate but subsumed classifiers (see the numerosity column in Table 4.8).

XCS Output 1

Max Rel Err Min Rel Err Relative Error Population Iterations

0.9 0.8

Proportion

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0

500 100015002000250030003500400045005000 Exploitation Episodes

Figure 4.16 - The average performance of ten runs of XCS within a length 10 corridor FSW with the mutation rate set at 0.04.

To further confirm the benefit of reducing the mutation rate, XCS was run in this environment with the mutation rate at 0.04. Surprisingly this produced a much more rapid reduction in the error rate than was seen within the 0.01 runs (see figure 4.16). The statistical comparison between the domination rates of the 0.01 mutation rate run and the 0.04 mutation rate run was repeated for these runs. Since the relative error rates had converged to the same level at 15000 exploitation episodes in the 0.01 mutation rate and 5000 exploitation episodes in the 0.04 mutation rate runs, these were compared in addition to the two 5000 episode runs. As table 4.9 reveals, there was no significant difference at the 0.05 level between the percentage domination for the optimal action route when using the 0.04 or 0.01 mutation rates for both of the 15000 iteration runs. In the 5000 iteration run there was a significant difference at the 0.05 level. Interestingly, the introduction of the 0.01 mutation rate appears to have hindered the domination of the action set, and this is probably a reflection of the location of the high system relative

178

error rate at the end of 5000 iterations within the 0.01 mutation rate run. There was a significant difference within both of runs for the sub-optimal route at the 0.0005 level. Since the average domination in the 0.01 runs of the sub-optimal action route is higher than that of the 0.04 route it can be concluded that the use of the 0.01 mutation rate improves action set domination for the non-optimal routes within the length 10 FSW environment. Thus, the lower mutation rate appears to continue to be beneficial within this environment, if only for the sub-optimal route. 5000v15000 Optimal 78.87 72.97 0.4262 0.4612 1.35 1.729

5000v15000 Sub-optimal 63.21 72.75 1.023 2.168 6.478 1.729

5000v5000 Optimal 78.87 70.5 0.563 0.464

5000v5000 Sub-optimal 63.21 69.67 1.489 2.168 3.896 3.883

A 0.04 v 0.01 F F-crit t t-crit T 48 T-crit 54 level 0.05 0.05 0.05 0.0005 Table 4.9 - Statistical test results comparing the dominance of the action sets in the optimal and sub-optimal routes of the length 10 FSW when XCS is run with the mutation rate set at 0.04 and 0.01.

Figures 4.14, 4.15 and 4.16 indicate that the stable payoff predictions have been learnt and can be differentiated. However in both 4.15 and 4.16 a large maximum relative error remains even when the relative error falls close to zero. This is to be expected - there are always competing inaccurate classifiers in each action set that will have some effect on the System Prediction. As the action chain gets longer the action sets at the start of the chain with smaller payoff prediction values appear to be exhibiting large magnitude errors. It was hypothesised that this is due to the additional classifiers having a disproportional effect on the relative error calculation within the early action system predictions because of the much larger weighting given to small magnitude differences in these states. Although difficult to spot, the 'iterations' plot in figures 4.15 and 4.16 both indicate occasional non-optimal path choice. If investigating the hypothesis on the location of disruption lying within the early states is correct (hypothesis 4.1), then any sub-optimal route choice should take place within these early states. To further investigate this hypothesis the frequency of non-optimal route choice within the 0.04 mutation rate run was recorded and a typical run from the 10 runs chosen. The incorrect action choices

179

within the last 2000 exploitation episodes were extracted (the System Relative Error had reduced by 3000 exploitation episodes). The frequency of sub-optimal action choice per state was plotted in a histogram, shown in figure 4.17. This indicates that all of the incorrect decisions were made from state s0 (moving to state s10), state s1 (moving to state s11), state s2 (moving to state s12) and state s3 (moving to state s13), with generally decreasing frequency as process is made from the early states. Since the dominance results and the coverage graph demonstrate that the payoff prediction of these early states has been discovered and is being maintained by the action sets, the results in figure 4.17 lend strong support to the hypothesis that the early states are much more influenced by the additional classifiers added for exploration of the problem space. If this finding is correct, the problem should become much worse as the action chain is lengthened, threatening the ability of XCS to produce and maintain [O] over long action chains.

250 203

Frequency

200

150

100 57 50 8

18 0

0

0

0

0

0

17

18

19

0 10

11

12

13 14 15 16 State chosen

Figure 4.17 - The rate of choice of non-optimal route within the last 2000 exploitation episodes of a typical run of XCS within a length 10 corridor FSW with the mutation rate set at 0.04.

4.3.7.3 The Length 15 FSW

The experiments within the length 15 test environment indicate that this analysis is indeed correct. Figure 4.18 pictures the results for the 0.04 mutation rate run and figure 4.19 pictures the coverage graphs for the resulting populations.

180

XCS Output 1

Max Rel Err Min Rel Err Relative Error Population Iterations

0.9 0.8

Proportion

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 500 100015002000250030003500400045005000 Exploitation Episodes

Figure 4.18 - The averaged performance of ten runs of XCS in the length 15 FSW, at mutation rate 0.04. The average iteration count in the last 2000 steps was 17.15 rather than the optimal 15.

1000

1000

Optimal

900

Sub Op 800 700

100

600 500 Optimal

400

Sub Op

10

300 200 100 0

1

0

2

4

6

8

10

12

14

0

2

4

6

8

10

12

14

Figure 4.19 - The coverage graph and the coverage graph with logarithmic prediction scale (y axis) for the action sets within XCS in the length 15 FSW at mutation rate 0.04

In these runs, all ten populations found and were able to establish all members of [O] apart from those covering states s0 to s3. These four states were represented by a single classifier for each action. As was predicted and figure 4.19 illustrates, the additional classifiers in the population within the early action sets disrupt the payoff prediction so that the payoff differences within the first four states cannot be differentiated by XCS. If XCS cannot differentiate the payoffs, it will generalise over these states. Thus,

181

hypothesis 4.1 is shown to hold, and it is possible to tentatively suggest that an action chain length of 11 steps appears to represent the limits of reliable generalisation over the action chain for this environment and parameterisation. However, it was noticeable that the generalisation occurred over bits 0 and 1 and therefore it may be that the convenience of the input encoding encouraged this limit point.

XCS Output 1

Max Rel Err Min Rel Err Relative Error Population Iterations

0.9 0.8

Proportion

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0

500 100015002000250030003500400045005000 Exploitation Episodes

Figure 4.20 - The averaged performance of ten runs of XCS in the length 15 FSW, at mutation rate 0.01.

It had been expected from previous results that XCS with a lower mutation rate would produce better results. In figure 4.18 the population curve remains high, indicating the presence of additional accurate but more specific classifiers within the population. If a lower mutation rate was able to focus this population further, as was the case previously, then the lower amount of competition within the action sets may allow XCS to distinguish between more of the early states. Figure 4.20 pictures the results of running XCS with a mutation rate of 0.01. It is clear from a comparison of figure 4.18 and 4.20 that XCS with a mutation rate of 0.04 is better able to reduce the system relative error than XCS with a mutation rate of 0.01. It was found that within the 0.01 mutation rate run only three of the ten runs produced the expected coverage results. Indeed, the coverage tables exhibited action sets with seemingly impossibly high numerosity, and predictions lay between 30 and 50 for most action sets. These values are depicted within figure 4.20, averaged from the seven poorly performing runs.

182

1000 600

Optimal

900

Op N

Sub Op

800

Su O N

500

400

600 Numerosity

Prediction

700

500 400

300

200

300 200

100

100 0

0

0

2

4

6

8

10

12

14

0 1

2 3

4 5

6

State

7 8

9 10 11 12 13 14

State

Figure 4.21 - The coverage graph and the histogram of average numerosity for the action sets within the seven non-optimal runs of XCS at mutation rate 0.01 within the length 15 FSW.

The reason for this difference can be readily located through an examination of the coverage tables and the populations of the runs. An inspection of the poorly performing populations revealed that two fully general classifiers had established themselves and gathered a very large numerosity. Since these classifiers appeared in every action set, they increased the numerosity of the action set producing the abnormally high numerosity results pictured in the histogram in figure 4.21. Intriguingly, these classifiers, where present, were all of low accuracy and numerosity within the 0.04 mutation rate XCS. 4.3.7.4 Investigating the Dominance of Fully General Classifiers

It is important to investigate how these over-general and potentially inaccurate classifiers became established. A detailed inspection of the populations revealed that in each population case one of the two classifiers continued to be considered of high accuracy even though they cannot accurately reflect the payoff for their recommended action in all environmental states. The key to explaining the phenomena therefore lies in an explanation of how these fully general classifiers can be considered to be accurate. It was hypothesised that the continued preference for the fully general classifiers would arise from the discovery of the fully general classifiers whilst the population is in its initial stages without a good convergence on accurate classifiers. Since all action sets apart from the one leading to the reward state receive payoff as the discounted maximum system prediction of the next match set, in the early states the predictions will be low and even when discounted will remain similar. If the action chain is long enough

183

it is possible that the generalists have sufficient time to become accurate over a large proportion of the action chain and therefore gain a larger relative accuracy (fitness) than other competing classifiers. Once this is the case, the general classifiers could utilise the fact that it will obtain many more G.A. opportunities to accumulate numerosity, particularly through the subsumption mechanism. Once a sufficiently high numerosity is established the fully general classifier would exert a large influence over the action set, keeping the prediction within each action set close to that of the fully general classifier. The inaccurate prediction would become the payoff to earlier classifiers, allowing them to accurately reflect an incorrect payoff and promoting the false accuracy of the overgeneral. This would in turn enhance the ability of the fully general classifier to proliferate. If the reward input is infrequent in relation to the internal payoff, and if the initial states cannot be disambiguated due to their distance from the reward state, a breeding ground for the fully general classifier would exist. Once fully general classifiers are established, true members of [O] would be considered inaccurate at their true prediction and would be unable to compete to drive out the fully general classifiers. This is a vital hypothesis that identifies a genuine limiting factor on the ability of XCS to find and establish [O] within a long action chain environment, and will be more succinctly expressed as the Domination Hypothesis: The operation of XCS within a long action chain that provides infrequent stable environmental feedback will lead to the production of self-sustaining fully general classifiers that cover each legal action, are represented with high numerosity, and are considered accurate due to their domination of each action set. In this state the optimal sub-population cannot be established by XCS. Before this hypothesis is investigated further the question of why this phenomenon occurred within the 0.01 mutation rate environment and not within the 0.04 environment must be considered. It is possible that it was not seen within the 0.04 environment because of the low number of runs of each experiment that were conducted due to time constraints, although this is unlikely to be the complete answer due to the magnitude of the difference. A more plausible explanation lies in the mutation rates themselves. A higher mutation rate, even though only marginally higher, had a profound effect on the operation of XCS in the length 5 and length 10 FSW environments. It is therefore hypothesised that the difference in mutation rate allows sufficient early exploration within the 0.04 rate runs to prevent the early establishment of the fully general classifiers, whilst within the 0.01 rate runs the lower exploration rate that previously

184

encouraged population focus now limits competition to the fully general classifier too much. If this hypothesis is correct, it would suggest that a profitable area of further work would be to examine the introduction of a dynamic control of mutation rate within XCS.

XCS Output 1

Max Rel Err Min Rel Err Relative Error Population Iterations

0.9 0.8

Proportion

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0

500 100015002000250030003500400045005000 Exploitation Episodes

Figure 4.22 - The average performance of ten runs of XCS in the length 15 FSW with mutation rate 0.04 and population-wide subsumption in the G.A.

To investigate the Domination Hypothesis further the 0.04 mutation rate experiment was re-run with population subsumption re-instated. It was hypothesised that if the fully general classifiers were obtaining full accuracy as suggested then the introduction of any subsumption mechanism operating outside the immediate action set should serve to further establish these classifiers and increase the likelihood of fully general classifier domination of the population. The results of this experiment are shown in figure 4.22. It was noticeable when comparing figures 4.22 and 4.18 that the XCS performance has decreased even though the population subsumption should have encouraged the formation of [O] through decreased competition in the population. An examination of the populations revealed that three of the ten runs did now contain dominant fully general classifiers that had prevented the formation of [O] and that these were the cause of the drop in performance. Thus, as the hypothesis would suggest, the introduction of population subsumption has aided the dominance of the fully general classifiers.

185

Further experiments using the population subsumption version of XCS were performed that reduced the population size limit from 1200 micro-classifiers to 800 and then 600 micro-classifiers. It was thought that by limiting the population size pressure would be exerted on the fully-general classifiers that would limit their formation. In fact, this was an incorrect assumption - the reduction in population size only served to reduce the space for exploration and therefore increase the likelihood that fully general classifiers would dominate the populations. Nonetheless, these findings do serve to support the hypothesis on the formation of these dominant classifiers. A further test of the hypothesis involves extending the action chain once more. If the hypothesis on dominant classifier formation is correct, the increased length of the action chain will make environmental reward less frequent and allow the fully general classifiers to establish themselves more easily. These tests will also allow further verification of the tentative hypothesis on the limits of action chain length before the inability to accurately identify payoff prediction is seen. The length 20 FSW environment was therefore re-introduced to XCS and, following the results of the length 15 experiments, the mutation rate was set at 0.04. Ten runs within the environment were conducted, and the performance was captured and averaged. The average performance of XCS in this environment is shown in figure 4.23.

XCS Output 1

Max Rel Err Min Rel Err Relative Error Population Iterations

0.9 0.8

Proportion

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0

500 100015002000250030003500400045005000 Exploitation Episodes

Figure 4.23 - The average performance of ten runs of XCS in the length 20 FSW.

186

In figure 4.23 it is shown that the system relative error remains high and does not appear to be able to be further reduced. The number of iterations required to reach the reward state is much higher than the optimal. An analysis of the coverage table for each population revealed that only five of the ten runs produced adequate coverage. Examining the populations of those runs that did not produce the expected coverage revealed that each contained dominant full-generality classifiers of high numerosity. These results confirm the hypothesis that the longer the action chain to a regular environmental reward, and the more naturally generalisable states exist within the environment, the more likely a full generalist is to appear. In order to investigate the formation of the dominant fully general classifiers further the length 15 FSW with 0.01 mutation was re-run with additional reports added to capture the fitness, accuracy, prediction, and numerosity of the fully general classifiers in the population averaged over each episode whilst they exist. These details from twenty runs were output to a file and the results corresponding to typical good performance and poor performance populations were extracted and plotted. Figure 4.24 identifies the performance of these classifiers. It was immediately noticeable that some populations did contain fully general classifiers that accumulated large numerosity values and yet were able to remove these classifiers to establish more accurate classifiers (see figure 4.24 b). Other good populations were able to remove the fully general classifiers without allowing them to achieve significant numerosity (figure 4.24 c). The poor performing populations retained their high numerosity classifiers (figure 4.24 a). Although figure 4.24 provides plots of typical runs, it should be noted that when each of the populations in the run was plotted in this manner, those falling into any one of these three categories of life history produced similar plots of prediction, accuracy, and fitness. This would suggest that there is a factor underlying the phenomena that generates these typical trends. An analysis of the graphs in figure 4.24 does not support the entirety of the earlier discussion on how the fully general classifiers may be formed, since in all cases the fully general classifiers did appear very early in the exploration (in exploration episode 28, 37, and 23 respectively) and within similar population sizes (95, 88, and 86 respectively) but with differing results. Thus, early appearance of the general classifier is not in itself the key reason. The key aspect of the graphs that is worthy of further investigation is the fact that, as would be expected, all the while that the accuracy of the classifiers is kept higher than zero the numerosity of the classifier increases. The actual level of accuracy

187

appears to be fairly irrelevant. It is the sudden reduction of the accuracy of classifier 2 in figure 4.24b that signals the start of the demise of that classifier. A comparison of classifier 1 in 4.24a and classifier 1 in 4.24b is instructive. Whilst they both display a low accuracy, classifier 1 in 4.24a is able to continuously regain sufficient accuracy to give it the fitness necessary to compete within the G.A. and replicate itself. Classifier 1 in 4.24b is unable to sustain its accuracy and on each period of zero accuracy it loses numerosity. Once numerosity is lost the ability of an inaccurate classifier to dominate the action sets is reduced. This in turn causes the action sets to start to gain their true payoff prediction, making the fully general classifier more likely to be inaccurate. Also, as the accuracy is reduced, so is its ability to compete in the G.A., causing the classifier to be gradually driven out of the population. The flatter prediction curve of classifier 1 in 4.24a when compared with classifier 1 in 4.24b indicates that it has been able to exert sufficient influence to push down the payoff predictions in the action sets so that it can maintain some accuracy. Interestingly, classifier 2, representing the sub-optimal pathway, is more able to control the range of the prediction. This is actually an artefact of the action classifier 2 represents. This action never leads directly to an environment reward state and therefore all system predictions in the action sets the classifier occurs within are fed from other action sets. The control of prediction is thus a simpler task, giving higher accuracy and fitness. Although further investigation clearly needs to be carried out to confirm the Domination Hypothesis, these results do tend to support the Domination Hypothesis, though not the hypothesis that an early introduction of the fully general classifier is fundamental to the development of action set domination. a) Two fully general classifiers dominating the population 1000

250

Min Pred Min Acc Min Fit Num

900

Max Pred Max Acc Max Fit

800

200

Max Pred Max Acc Max Fit Num

Min Pred Min Acc Min Fit

700 600

150

500 400

100

300 200

50

100 0

0 0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

188

0

1000

2000

3000

4000

5000

b) Two high numerosity fully general classifiers removed from the population

450

900

Max Pred Max Acc Max Fit Num

400

Min Pred Min Acc Min Fit

Min Pred Min Acc Min Fit Num

800

350

700

300

600

250

500

200

400

150

300

100

200

Max Pred Max Acc Max Fit

100

50

0

0 0

500

1000

1500

2000

0

2500

1000

2000

c) Two fully general classifiers rapidly removed from the population

400

500

Max Pred Max Acc Max Fit Num

450

Min Pred Min Acc Min Fit

Min Pred Min Acc Min Fit Num

350

Max Pred Max Acc Max Fit

400

300 350

250 300

250

200

200

150

150

100 100

50 50

0

0 0

100

200

300

400

500

600

0

100

200

300

400

500

600

Figure 4.24 - The prediction, accuracy, fitness and numerosity traces of three typical life histories of the two fully general classifiers that may appear within the Length 15 FSW. Each measure averaged over the preceding 50 exploitation episodes.

189

700

4.3.7.5 The Length 20 FSW

The length 20 FSW experiment pictured in Figure 4.23 not only provides information for the investigation of the Domination Hypothesis, but also gives further insight into the limits of XCS action chain discovery. Although the performance in Figure 4.23 demonstrated a high System Relative Error, there is a large element of this averaged result that is attributable to the five populations that developed dominant fully general classifiers and thus were unable to develop a useful state × action × payoff map. Once the performance of XCS with these poor populations is removed the true performance of XCS in this length 20 environment can be seen (figure 4.25). Interestingly the results illustrate that XCS is able to learn the payoff predictions within the last 11 states but produces general classifiers to cover the earlier states. This finding is in agreement with the earlier tentative hypothesis that with the particular parameterisation used and this form of environment XCS is able to represent accurately with optimal generalisation up to 11 actions in the action chain. However this result can to some extent once again be explained by message coding convenience. The classifiers representing states s0-s7 and state s8 are shown in table 4.8. Two classifiers cover states s0 to s7, which can each be conveniently represented by generalising the first three condition positions. State s8 is much more difficult to generalise alongside the other states and would be unlikely to appear within any other generalisation. Thus, even if the environment were extended to length 21 it may be that the representation of s8 and s9 would remain more accurate than that of states s0 to s7 purely because of the difficulty of including s8 and s9 in a generalisation over the earlier states. Clearly this aspect warrants further investigation. 1000

1000

900

Optimal

Optimal

Sub Op

Sub Op

800 700

100

600 500 400 10

300 200 100 0

1

0

2

4

6

8

10

12

14

16

18

0

2

4

6

8

10

12

14

16

18

Figure 4.25 - The average system prediction in each action set of the five good performance runs of XCS in the length 20 FSW environment.

190

Classifier #000###→0 #000###→1 ###10#0→1 ###100#→0

Pred. 5.7947 8.5204 13.0919 14.4736

Err. 0.0007 0.0030 0.0082 0.0055

Acc. 0.9208 0.8875 0.6207 0.2645

Fit. 1.0000 1.0000 1.0000 1.0000

N 43 37 13 6

AS 54.88 46.78 57.14 51.45

Exp. 28581 42037 7395 3619

Table 4.10 - The accurate classifiers covering states s0 to s8 within the length 20 FSW. A thorough investigation of the coding issues is an area that is beyond the stated bounds of this investigation. However, additional experiments were performed to investigate whether providing a message coding that was simpler to generalise over or more difficult to generalise over would allow XCS to better represent the mapping of the length 20 FSW environment. It was hoped that, in addition to any generalisation advantages, a less arbitrary message coding might lift the weak domination of the action sets (optimal route - 42.8% optimal, sub-optimal route - 49.2%) seen within the accurate sub-population of the 'good' runs of XCS within this environment. The experiments were actually performed within the length 15 FSW since it was felt that improvements were more likely to be demonstrated within this environment than in the more difficult length 20 FSW environment. The first experiment coded the optimal path states with the binary encoding of the values 0-14, as is currently the case. However, the sub-optimal path states were now given the binary encoding of the values 64-78. This meant that the low four bits of state s0 would be the same as that of state s15, and so on for all successive states. This would allow generalisation over these bits given that the predictions for at least one action from each of these states should be the same. At the same time, the setting of bit 6 for the lower states should allow these to be differentiated where necessary (preliminary experiments setting bits 6 and 5 for the lower states indicated that competing generalisations using one of these two bits could form - it was therefore decided to use bit 6 only and keep bit 5 at 0 throughout). The results show that the population requirements decreased (see figure 4.26 and 4.18, 4.19). An examination of the populations produced showed that XCS utilised the available generalisation to identify classifiers covering key shared predictions. However, XCS continues to be unable to disambiguate the early states and three of the ten runs produced fully-general classifiers.

191

XCS Output 1

1000

Max Rel Err Min Rel Err Relative Error Population Iterations

0.9 0.8

Proportion

0.7

100

0.6 0.5 Optimal

0.4

Sub Op

10

0.3 0.2 0.1 0 0

500 100015002000250030003500400045005000 Exploitation Episodes

1 0

2

4

6

8

10

12

14

Figure 4.26 - The performance and coverage of XCS in the length 15 FSW environment with an input encoding that should encourage generalisation.

An F-Test over the system predictions within the coverage table for this run and the length 15 run depicted in figure 4.18 indicates unequal variances (F=0.9965, Fcrit=0.6493) and therefore a one-tailed Wilcoxon test was used to establish whether there was a significant difference. The test was significant at the 0.025 level (T=131, Tcrit=137) for the sub-optimal route and at the 0.005 level (T=77, Tcrit=109) for the optimal route. This was unexpected, but an examination of the predictions reveals that the predictions for the early states using the new encoding were lower than those in the normal encoding. This possibly reflects an ability to utilise the generalisation to reinforce the early states more adequately (although it is noticeable from the log graph of coverage that the prediction oscillates more within this experiment). It is difficult to identify which of the two coverings are closest to the ideal, and therefore this result must be put down to local fluctuations rather than a trend. An F-Test also revealed that the degree of dominance of the action sets indicated unequal variance (F=1.561, Fcrit=1.539) between the experiments. A one-tailed Wilcoxon test was therefore also applied to the dominance figures. For both the optimal and the sub-optimal routes the tests were significant at the 0.005 level (T=46, T=73, Tcrit=109). Since the mean for the first experiment was 68.9% and the mean for the second was 55.9% this suggests that dominance has been reduced by this encoding. This is an unfortunate result, since the addition of a potentially generality-friendly encoding should have encouraged the population to focus upon fewer general classifiers. Although there was a reduction in population size, due to the increased opportunity for

192

generalisation, the dominance measure (which is a proportional measure) indicates that the increase in focus has not prevented competition. In any case, the increased generality did not enable XCS to disambiguate the early states. The second experiment changed the coding to try to restrict the likelihood of generalisation over the early states. The hypothesis was that the more difficult it was to identify a generalisation over the early states the more likely it would be that XCS would find individual classifiers that would represent each state more accurately, so lengthening the action chain that XCS can represent accurately whilst using generalisation. The coding was: 0 1 2 3 4 5 6

0000000 0001111 0000011 0001100 0001001 0000110 0010000

15 16 17 18 19 20 21

1011010 1010101 1010000 1011111 1010011 1011100 1000000

7 8 9 10 11 12 13 14

0010001 0010010 0010011 0010100 0010101 0010110 0010111 0011000

22 23 24 25 26 27 28 29

1000001 1000010 1000011 1000100 1000101 1000110 1000111 1001000

Table 4.11 - An encoding that may reduce generalisation over early states. The input over the early states is as different as possible for both the co-located states and the successor states within the first six steps of the environment. From there on it is a normal incremental coding because the later action sets have payoff predictions that are sufficiently different to prevent generalisation. XCS was run for ten runs with this coding. Five of the ten populations were converged onto fully general classifiers, and the coverage plots for the remaining five are shown in figure 4.27. A visual inspection of the results reveals that the coverage chart is much poorer with this coding, with even some of the later states incorrectly represented. Action set dominance was 55.3% for the optimal and 53.3% for the sub-optimal routes, both considerably lower than the 68.9% mean for the original encoding within the length 15 FSW environment. The population size was an average of 265 members, more than the 220 average size with the original message encoding. Thus, whilst generalisation was clearly prevented, the search for suitable generalisations by XCS appears to have compromised the system prediction of the action sets throughout the action chain. These initial experiments to investigate two alternative forms of message coding that may enable XCS to more adequately find accurate payoff predictions in the action sets

193

covering the initial states of a length 15 FSW therefore indicate that the problem of the disruption of the payoff prediction in the early action sets of an action chain may not be remedied by such simple solutions.

1000

100

Optimal Sub Op

10

1 0

2

4

6

8

10

12

14

Figure 4.27 - The action set predictions from XCS in the length 15 FSW with an input encoding intended to discourage generalisation.

Considering the increasingly poor performance of XCS within the length 15 and 20 FSW environments, further experiments with length 25 or 30 FSW environments were considered unnecessary and the experimental investigation was terminated at this point. 4.4 Summary of Results

Initial investigation of the convergence of predictions within a 20-state FSW with a preloaded population and without induction identified the need for a new metric that would adequately represent the error within the action chain. A new metric, the System Relative Error was introduced and it was shown that this metric adequately reflects the convergence of predictions regardless of their magnitude. To provide further insight into the predictive capacity of the final population, the use of Coverage Tables was also introduced (see Saxon and Barry, 2000). Baseline investigations using a preloaded population of classifiers within the chosen progressive two-action corridor FSW environment of increasing length demonstrated

194

that XCS was able to rapidly identify the payoff predictions for classifiers in chains of up to length 30. The length 30 limit was chosen as a hypothesised maximum size for use within later experimental work, and does not reflect a true maximum length for XCS without induction mechanisms. In testing the validity of Hypothesis 4.2 within the same test environments, XCS was run without generalisation but with an initially empty population and induction mechanisms enabled. Contrary to expectations, XCS was able to establish and proliferate the optimal population and identify the correct payoff predictions in all of the test environments. The results indicated that XCS was able to complete this task within approximately the same time as that required to establish the payoff predictions when supplied with an initial population. This may, however, be an artefact of this environment, which introduces very little exploration complexity. A further experiment extended the environmental length to a minimum of 40 actions to reward. In this test XCS continued to correctly identify all payoff predictions and utilise them to select the optimal route even though the payoffs for the two actions from the first state were very small. Thus, the hypothesis that the act of establishing the correct payoff predictions could itself cause a breakdown in the ability of XCS to differentiate between very small differences in the payoff predictions for early states in a long chain environment was not substantiated. Investigation of Hypothesis 4.1 involved the application of generalisation in the learning of the optimal sub-population [O]. In application to the length 5 FSW XCS was able to establish and proliferate [O] in a slightly sub-linear time when compared to the increase in coding complexity. However, the dominance of [O] was higher when using a mutation rate of 0.01 than when the mutation rate was 0.04. Further experiments with the introduction of a population-wide subsumption mechanism for new classifiers whose actions did not lie within the parent action-set, similar in nature to Wilson's periodic action-set subsumption operator (Wilson, 1998b) but with less pressure toward the most general accurate classifier in each action set, suggested that the 0.04 mutation rate introduced new classifiers outside the action-set that XCS was less able to eliminate within this environment than had been seen in single-step multiplexer work. The full reasons for this change have not yet been established. Investigation with the length 10 FSW showed that XCS was less able to establish a high dominance of [O] using either mutation rate, although the 0.01 mutation rate still demonstrated a measure of superiority. Furthermore, XCS was unable to consistently

195

select the optimal route to the reward state. Further analysis revealed that XCS was selecting the incorrect route in the earliest states. Once the length 15 FSW was introduced this error became more pronounced, with an average of two sub-optimal actions chosen in each episode. The coverage table revealed that the predictions for alternative actions in the first four states were confused and overlapping, making it impossible to select deterministically the correct pathway. Comparison between mutation rates of 0.01 and 0.04 revealed that the higher mutation rate was now preferable. Further investigation using an alternative message encoding that would allow more of the generalisation opportunities to be exploited by XCS revealed that although XCS did utilise the generalisation opportunities, resulting in a smaller population of macro-classifiers, XCS was unable to utilise generalisation to focus the population on the optimal population. This resulted in a significantly lower dominance of the members of [O] than when using the original binary encoding. Another test that introduced an encoding designed to prevent generalisation to investigate whether XCS would then be able to prevent generalisation causing confusion of prediction in the early states revealed that the encoding used generated very poor XCS performance. Thus, it can be concluded that within this environment the confusion of payoff prediction within the early states cannot be resolved by a naive application of knowledge-dependent solutions. Further experiments with the length 20 FSW revealed that the early state payoff prediction confusion was continued although it was noticed that XCS was consistently able to identify correct payoff predictions for the last 11 states. It was recognised, however, that the 11-state threshold could be an artefact of the encoding used and cannot be used as a definitive limit for the parameterisation used. Within both the length 15 and length 20 FSW environments it was noted that an increasing number of runs where [O] was not fully identified or proliferated occurred. This was identified as due to the emergence of strong full-generality classifiers. Although such classifiers should by definition be inaccurate and therefore be eliminated by XCS, they were able to establish a huge dominance of all action-sets. The Domination Hypothesis was proposed to explain this phenomena and a first verification of the hypothesis was presented by tracking the life history of fully general classifiers in the circumstances where they were established and where they were removed. It was noted that although these classifiers need to be established early in the XCS operation, that this was not itself a pre-requisite to their domination. It appears that they dominate when they are able to establish an early control of the system prediction so that they can continue to hold a non-zero accuracy. Once any period of zero accuracy is established

196

they will rapidly lose fitness and thereafter numerosity. The dominance hypothesis represents an important phenomenon within XCS that is worthy of further study. 4.5 Discussion

Previous work within multiple-step environments with XCS have not investigated the limits that may apply to the length of action chains that can be learnt by XCS. Previous investigations with the CFS-C LCS implementation by Riolo (1987a, 1988b, 1989a) demonstrated limits in the formation of long rule-chains by the LCS. Using a single action corridor FSW of twelve states and a pre-set initial population of classifiers that provided the appropriate rule chain, Riolo (1987a) demonstrated that the rule-chaining mechanism proposed by Holland (1986) would allow the classifiers in the chain to converge to the same prediction value. He further demonstrated, using an environment with two length 10 single action chains and with a start-state having two actions to enable choice between the two chains, that a seeded population could learn to choose the optimal route. The work presented in this chapter used seeded populations only for baseline results and demonstrated that even in environments presenting a more complex choice XCS was able to select the optimal path within 120 trials, somewhat faster than the 170 trials of the traditional LCS within the simpler environment. In fact, with nonseeded populations and no generalisation XCS was able to learn the correct pathway within 140 trials. Clearly the lack of comparative parameterisation (and implementation details, such as the explore/exploit regime) makes simple performance comparisons impossible, and it would be naïve to claim from these results that XCS provides a faster or more effective learning environment than a traditional LCS. Nevertheless, it could be hypothesised that the temporal difference technique would allow an earlier correct pathway decision to be made due to the use of the maximum system prediction for payoff calculation rather than payoff from a few selected classifiers from the match set. Riolo (1989b) examined the ability of the LCS to establish rule chains under the action of the induction mechanisms - and in particular, using the Triggered Chaining Operator to establish rule-chains. This work was performed using the GREF-1 FSW environment - a length 4, four action environment using 16 states to provide four pathways with multiple links between the pathways. Whilst the performance improved with the introduction of the TCO, it remained weak. Riolo identified that although rule-chains were established, the LCS failed to maintain the rule-chains. Not only did the lower strength of the earlier rules prevent their duplication, thus exposing them to the possibility of later deletion, but there were also parasitic rules that caused payment to earlier classifiers to reduce and so threaten their existence. Adding a "support"

197

component to the bids of classifiers in the rule-chain, a niche deletion mechanism to limit competitors within each match set, and a form of Create Effector Operator to introduce classifiers in poorly performing match sets helped to reduce parasites and increase the speed of formation and the maintenance of rule chains. These measures brought performance up to 90% of the optimal performance in 12,000 trials. Riolo concluded that the population needed to be treated more like an "ecology of rules, with niches (states or situations) that support species (co-bidding rules) competing for limited resources (classifier-list space, message-list space, strength)". Whilst the checks-and-balances approach of CFS-C and other traditional LCS approaches sought to achieve this balance, XCS is able to meet these requirements fully. The test environments used in the earlier experiments within this chapter do not match the GREF-1 environment for complexity, but the results with XCS presented earlier suggest that XCS will be able to establish and maintain action-chains for the GREF-1 environment with ease. To test this hypothesis, the GREF-1 environment was re-created and figure 4.28 presents the average of ten runs in this environment using a population limit of 800, no initial population, and all induction operators turned on. The other parameterisation is the same as that used for the earlier experiments within section 4.3.6.3. The output was generated with an additional Performance curve to allow some degree of comparison with Riolo's results.

XCS Output 1

Max Rel Err Min Rel Err Relative Error Population Iterations Performance

Proportion

0.8

0.6

0.4

0.2

0 0

1000 2000 3000 4000 5000 6000 7000 8000 900010000 Exploitation Episodes

Figure 4.28 - The average performance of XCS in the GREF-1 environment.

198

It can be seen that XCS achieves optimal performance by 1000 exploitation trials (2000 trials) whereas even the best run presented by Riolo (1989b) achieved just over 90% performance by the equivalent of 10,000 trials. An examination of the coverage tables identified that all runs established a high dominance optimal sub-population identifying all action chains to the reward. In addition XCS had established the optimal generalisations, something that Riolo's work did not seek to achieve. Unfortunately, although these results suggest the superiority of XCS, any direct comparison is foolhardy given the differences in operation and parameterisation between these LCS implementations. These findings must therefore be expressed as indicating that XCS is able to learn a solution and achieve a better on-line performance within the GREF1 environment than Riolo's CFS-C. Riolo (1989b) compared his findings to those of Grefenstette (1987) using a Pittsburgh based LCS within the same environment. Grefenstette's results produced an on-line performance of 37% in a trial equivalent of 250,000 trials, inferior to Riolo's CFS-C and therefore also to the performance of XCS. However, none of these experiments have sought to optimise their parameterisation and so definitive conclusions must not be drawn. Lanzi (1997b) noted the problems XCS faced when seeking to solve the Woods-14 test environment, and these were discussed in section 4.1 and within the introduction to section 4.3. Whilst the Woods-14 environment requires an action chain of length 18 to reach the reward state, the structure of the environment itself makes a direct application of the results of this chapter to that environment problematic. Within the Woods-14 environment many of the successive steps require a different action to be undertaken, preventing generalisation over states and potentially encouraging the more accurate representation of the payoff prediction over the early states. As Lanzi notes, the increased number of actions available bring their own penalty, making it difficult for XCS to explore the environment sufficiently to reach the reward state and begin to feedback the predictions. However, the fact that all other actions are effectively "null actions" - actions leading back to the same state - means that the payoff received for these actions will be one discount less than the payoff for the state itself. A similar payoff reduction is seen within the test environments used within this investigation, though generated in a different manner. Thus, it may be the case that over-generalisation may be seen within the Woods-14 environment. It appears that Lanzi (1997b) utilised a version of XCS without Subsumption Deletion (see Wilson, 1995), and this would have the joint effects of preventing population focus and therefore also preventing dominance

199

by a full-generality classifier. Thus, the results presented by Lanzi are not fully comparable with the results presented within this chapter. It is interesting to note the potential relevance of Lanzi's specify operator (Lanzi, 1997a) to the results in this chapter. Although specify was introduced to tackle the problem of over-generalisation due to exploration inequality, it could be the case that a carefully application of the specify operator could counter the development of full-generality classifiers. This is a matter for further work beyond the scope of this research. 4.6 Conclusions and Further Work

This chapter has sought to identify some limits on the length of action chain learning within XCS. Using a progressive two-action corridor FSW-based environment with a single start state and single terminal state it has been shown that whilst XCS can reliably and rapidly establish the optimal population and the correct payoff prediction mapping for optimal action lengths of up to (and possibly beyond) 40 actions in these environments where no generalisation is used, the introduction of generalisation introduces inadequate payoff prediction in the early states once the predictions are sufficiently close to generalise over without reduction in accuracy. This can be as early as 11 action-steps from the terminal states, although the precision of this limit may be dependent upon the generalisation convenience of the encoding. Whilst it is recognised that the limits identified are dependant upon the parameterisation used, the exploration complexity of the test environment, and the exploration strategy adopted, it is clear that XCS still faces clear and restrictive limits upon its ability to establish the correct payoff prediction mapping over the states of a test environment in the face of generalisation pressure. Further investigation is required to establish the nature of these limits with alternative parameterisation, particularly over the γ parameter to see if a higher value, as used within other Temporal Difference learning work, could extend the length of action chain over which accurate payoff-prediction maps can be established. It is likely that some upper value of γ will be reached at which point generalisation will combine early states due to their prediction similarity. Further work is also required to establish limits in other forms of environment. Although the environment used was designed to limit the complexities of exploration, an environment that provides a reward of zero for non-optimal actions may provide a clearer distinction between the predictions and thus allow XCS to identify longer chains

200

of optimal actions. Alternatively, an environment with a sub-optimal action that leads back to the same state may cause unequal exploration of the environment, thus preventing progress towards the terminal state and further hindering the discovery of stable payoff. Finally, the complexity of exploration was deliberately controlled, and further work is required to establish the limits of action chain length in the face of increasing numbers of alternative routes, and the ability of XCS to choose between alternative reward magnitudes as the number of alternative action routes increases. The Domination Hypothesis was introduced to explain the appearance of fully general classifiers dominating the population in the longer FSW under generalisation pressure. Although a preliminary investigation of the hypothesis was presented, there is much scope for further investigation of the causes of these classifiers and potential solutions to the problem of fully general classifiers. A naive solution that prevents the use of fully general classifiers since, by definition, they can never be accurate is not adequate. As Wilson (1998b) noted, fully general classifiers can act as a form of advanced Create Effector Operator, distributing actions to newly discovered environmental niches. It may be that preventing the formation of fully general classifiers whilst encouraging action sharing covering (similar to the covering operator reported in Butz, 2000), would enable XCS to establish more accurate payoff predictions within multiple step environments similar to the test environments used within this research work. These results contradict the common perception of XCS. For example, Lanzi and Riolo (2000) state that: "Wilson's ZCS can develop short action chains but, as shown by Cliff and Ross (1994), it cannot learn long sequences of actions because of the proliferation of over-general classifiers. ... The problems that afflicted ZCS appear to be solved in XCS." In relation to the research objectives of this research programme, the limits discovered for the action chaining ability confirm hypothesis 1 and serve to demonstrate the requirement for a structured or hierarchical approach in order to apply XCS to large multiple-step environments.

201

Chapter 5

AN INVESTIGATION OF THE 'ALIASING PROBLEM'

5.1 Background

Lanzi (1998a) identified a significant problem for XCS learning within a certain class of multi-step environments - the aliasing of states creating a non-Markovian Environment. Within an environment it is possible to easily derive an input state which is repeated elsewhere in the environment with a different payoff value. For example, in this artificial Woods environment:

OOOOOO O....F OOOOOO using the Woods-2 encoding adopted by Wilson (1995) the two blank positions in the centre

of

the

environment

will

each

generate

the

same

input

message

010000010010010000010010, but the expected payoff (with γ=0.71, R=1000) to the right central position will be 504.1 and to the left central position will be 357.911. Since these two positions will be represented by a single classifier, the payoffs that this classifier will receive will vary and therefore the classifier will be adjudged to be inaccurate. Clearly this is a problem in traditional LCS as well as in XCS, but in traditional LCS the provision of payoff, irrespective of its variance, will cause the classifier to be maintained by the LCS if the gross payoff is sufficient. However, in XCS the classifier will be inaccurate and so is likely to have a very low probability of selection within the G.A. and not be duplicated by subsumption. The result will be the elimination of the classifier and the constant re-introduction by detector covering and exploration by effector covering in the vain attempt to find an accurate classifier. Lanzi attempts to overcome this by using an additional memory mechanism derived from that used by Cliff and Ross (1994) within ZCS and proposed by Wilson (1994, 1995). Lanzi first presented this memory mechanism in Lanzi (1998a) where he demonstrated that the mechanism was able to disambiguate internal states with aliased input within the Woods101 environment. In Lanzi (1998b) he identified that this

202

mechanism was imperfect because it could generate the same memory configuration for two aliased positions. He therefore introduced two modifications that link the internal memory register setting more closely to the external actions of the Animat and the provision of payoffs (see Lanzi (1998b) for further details). In particular, he limited the choice of internal state to a deterministic selection in exploration, hypothesising that XCS was unable to learn how to optimally utilise the state whilst learning the control policy. He demonstrates that this mechanism is sufficient to disambiguate the environment and therefore cause separate classifiers to be generated for each of the aliased inputs. The work of Lanzi is of particular interest to the aims of this research programme. A potential use of hierarchical rule structures could be to allow the development of subroutine-like clusters of classifiers that can be invoked to allow an Animat to progress over a recognised environmental regularity. In this work the term Environmental Regularity is defined as follows: an area of the environment within which the states will produce the same messages and respond in the same manner to a given message as states within one or more other areas of the environment. Put more formally, an Environmental Regularity occurs where, given an FSW with states sx ∈ N (the set of states within the FSW representing the environment) there are two or more distinct subsets of states such that : ∀si ∈ s . !∃ sj ∈ s' . reg (si, sj) reg (s1 ∈ s, s2 ∈ s') ∧ m(s1) = m(s2) ∧ ∀α ∈ a(s1) . (α ∈ a(s2) ∧ (e(s1, α) ∉ s ∧ e(s2, α) ∉ s')) ∨ reg (e(s1, α), e(s2, α)) where s ⊂ N, s' ⊂ N, s ∩ s' = {}, m(sx) is a function returning the message produced by the state x, a(sx) is a function returning the set of actions which are legal within state sx, and e(sz, a) is a function returning the state reached from the edge corresponding to action a within sx. Unfortunately, if such a regularity occurred within the environment at positions which led to differing payoffs, the states within each regularity would act as "Aliasing States" and Lanzi's investigations demonstrate that these states cannot be accurately represented. Environments containing an environmental regularity are commonly known as Non-Markovian environments.

203

Although a resolution of this problem for hierarchical structures is presented in section 7.5, it became apparent from initial investigations at this stage that there existed at least one subset of the Aliasing Problem that could be addressed by a much simpler mechanism than the memory mechanism proposed by Lanzi. This area of the Aliasing Problem was therefore investigated. The studies presented in this chapter identify a distinct problem area that will be termed the Consecutive State Problem. It is hypothesised that this is a subset of the Aliasing Problem, and that it could therefore be subject to a remedy which, whilst solving the Consecutive State Problem, is not a solution to the Aliasing Problem as a whole. These claims are investigated empirically, and it is shown that such a solution does exist. The chapter concludes with recommendations for further investigations that arise from this work. This chapter contributes to XCS research in a number of ways. Firstly it provides replication of findings regarding the problems arising from the presence of aliased states within an environment. Secondly, it clarifies the phenomena identified by Lanzi (1998a, 1998b) by simplifying the environment used, demonstrating that this phenomena could occur commonly within even trivial multiple-step learning problems, and thirdly it identifies the extent to which the aliasing inaccuracy will affect classifiers representing preceding states within the environment (given the situation where the aliasing classifier is not deleted by the induction algorithms). The situations within which aliasing states exist are enumerated, and as a result two new means of coping with one form of aliasing are described. It is demonstrated that one of these proposals is a sufficient solution and can be readily incorporated into XCS without modification to its standard operation in other circumstances. 5.2 Hypotheses

The forms of aliasing that occur over spatially separated states and the form discussed in this chapter will be distinguished in the following discussion by terming the first the Separate State Problem and the second the Consecutive State Problem. Consider the Markovian FSW (that will be denoted FSW-5) consisting of a start state s0, further states s labelled s1 to s3 and a terminal state labelled s4. Each of states s0 to s3 are sources of directed edges drawn so that for all states si where 0 ≤ i ≤ 3 a single edge emanates from si and terminates in si+1. Each edge e is labelled so that ei = i + 1, which is the action required to traverse that edge. Every state emits a signal d capable of unambiguous sensory detection such that for all states si, d = i. The start state is s0 and upon reaching s4 a reward R is given and the FSW is reset to s0.

204

1

2

s0

3

s1

0

4

s2

1

s3

2

s4 R

3

Four classifiers are required to traverse this FSW: 1) 0 → 1 2) 1 → 2 3) 2 → 3 4) 3 → 4 On reaching s4 classifier 4 receives a reward R. Over successive trials classifier 4 will converge to the prediction p = R (eq. 5.1), and classifiers 1 to 3 will converge to the predictions γ3-c.R. Now consider a modification to this FSW (which is denoted FSW-5A) such that s1 and s2 emit the signal d=1 with the edges s1→ s2 and s2→ s3 labelled 2. Three classifiers are now required to traverse this FSW:

1

2

s0 0

2

s1 1

4

s2 1

s3

s4 R

3

1) 0→1 2) 1→2 3) 3→4 Again, upon reaching s4 classifier 3 receives a reward R. Classifier 3 will converge so that p = R over the same number of trials as for Classifier 4 within FSW-5. Let us assume that the prediction of classifier 3 has converged to this value. For moving from s2 to s3, classifier 2 will be consistently given a payoff γR. However, for moving from s1 to s2 classifier 2 will also be given a payoff which is γP2 (where P2 is the hypothetical stable prediction of classifier 2). If the learning rate β within the Widrow-Hoff mechanism was 1, then the prediction would oscillate within the limits γ2R and γR [from eq. 4.2]. For simplicity, let us assume that P2 varies around the average payoffs that would have been received at the states had they not been aliased ((γR - γ2R) / 2), and

205

that the values of β is less than unity. In this case the variance will reduce to ±β(γR − ((γR - γ2R) / 2)). Unless the value of β is very small, or the aliased states are sufficiently far from the reward source for the successive application of the discount factor γ to reduce the payoff to a very small amount, the variance will remain sufficient to produce an oscillation in P2 which is greater than ε0.35 This argument leads to hypothesis 5.1: Hypothesis 5.1 The aliasing problem is not restricted to independent states which exist at separate locations within a world, but will also be seen whenever two or more consecutive states admit to the same sensory perception (given the limitations of the sensory system of a given Animat) and together lead to a later consistent reward. The consecutive state problem is encountered on many occasions within simple robotic control scenarios where the robot has only simple on-board detectors. For example, the common wall-following behaviour (for example see Lin, 1993) requires many consecutive actions whilst receiving the same wall positioning information until a junction or obstacle is encountered. Furthermore, the limitations of accurate sonar positioning techniques will often mean that a robot that has travelled away from the edge of a room will be in a position where all detectors are providing maximum range feedback for many moves whilst the room is traversed. Since such scenarios are common in robotic control, and many parallels can be identified in other similar domains, this form of the aliasing problem will, in all likelihood, be more frequently encountered than the separate aliased state problem. Thus, the following Lemma is added to Hypothesis 5.1: Lemma 5.1 The consecutive state problem is likely to be encountered more frequently than the separate aliased state problem within trivial non-Markovian worlds. Now consider the stage in the operation of an XCS within FSW-5A where the payoff to the last aliased state s3 will be constant at γR. At the start of the next trial classifier 1 35

Of course, the value P2 will not vary around ((γR - γ2R) / 2) due to the Widrow-Hoff update mechanism. The value P2 will be below the average of the feedback to the equivalent non aliased states because the first update in each transition past S1 and S2 will, in effect, be averaging P2.

206

moves the FSW from state s0 to state s1. At this iteration (which will be called iteration i) the prediction of the classifier covering the aliasing states is Pi. On the next iteration the prediction Pi+1 of classifier 2 (the aliasing classifier) becomes Pi + β(γPi − Pi), causing the classifier's prediction to reduce towards γPi. In the following iteration Pi+2 becomes Pi+1 + β(γR − Pi+1) which increases Pi+1 towards γR. The update will regularly oscillate by -γPi and +γR, so that the prediction at Pi+2 is equal to that at Pi. The preceding classifier therefore receives a constant prediction value so long as all the aliased states are visited within each trial. Consider further the case where each aliased state may not be visited within each trial. In FSW-5, the payoff delivered to classifier 1 would be γ3R (since c=0 within the payoff equation γ3-c.R). Within the circumstances described in FSW-5A, classifier 1 receives payoff from classifier 2 whose maximum prediction oscillation limits lie between γ2R and γR (see explanation for Hypothesis 5.1). Classifier 2 will converge towards γR if in all preceding trials the trial started from s2, but will converge to below but near ((γR + γ2R) / 2) if in all preceding trials the trial starts from s0 or s1. Therefore the maximum prediction limits of classifier 1 will be from γ2R down to [approximately] γ((γR + γ2R) / 2). Although the learning rate β will reduce the degree of variance in the error in prediction calculated by the preceding classifier, for large payoffs received within aliased states late in the payoff chain and where ε0 is small the variance in payoff may be sufficient to cause inaccuracy in the preceding classifier. This leads to the ancillary hypothesis: Hypothesis 5.1.1 The classifier covering the non-aliased state immediately preceding the aliased states will be able to achieve an accurate payoff prediction in cases where each aliased state is visited in each trial, but can be considered inaccurate in cases where the aliased states are not visited within each trial. If the classifiers covering the Aliasing State becomes inaccurate (Hypothesis 5.1), and the XCS selects classifiers for reproduction using their fitness (based upon their relative accuracy), then the classifier that covers the Aliasing States will have accuracy 0. This will generate a fitness that is equal to any other competing classifier within its action set and therefore reduce its likelihood of selection for involvement in the genetic algorithm. Since the classifiers covering non-aliased states (with the exception of those covering the immediately preceding states - see Hypothesis 5.1.1) will eventually be classified as

207

accurate, these will have a high fitness and therefore be selected by the G.A. proportionally more often. Their numerosity will then increase, putting pressure on all inaccurate classifiers. Ultimately the combination of selection and population pressure should eradicate the classifiers covering the Aliasing State. Whilst the Covering Operators will rapidly replace these classifiers with others, no replacement will be deemed more accurate. This argument is expressed within hypothesis 5.1.2 as follows: Hypothesis 5.1.2 The aliasing of consecutive states will generate inaccuracy in the classifier which matches the sensory input and moves the Animat to the next aliasing state. The inaccurate classifier will rapidly be replaced by the action of the GA without any suitable replacement available to generate a greater degree of accuracy [operation of XCS]. This will prevent the formation of an accurate State × Action × Payoff mapping and lead to the perpetual ineffectiveness of the Classifier population where no alternative set of actions is available Consider the 5-state aliasing problem of FSW-5A. Given that there are two aliasing states, one bit of memory that is appended to the input message created by the XCS detectors can be used to solve this FSW as follows. Classifiers covering all non-aliased states can ignore the setting of this bit by adding a wildcard at the designated bit position. When in state s1 or s2 the bit can be used to differentiate between the states by using its 0 value to identify s1 and its 1 value to identify s2 [the mechanism for achieving this will not be discussed here]. In order to create accurate classifiers the GA within XCS will then discover two separate classifiers distinguished by this bit value, each of which will accurately reflect the relevant discounted payoff value (by the operation of the XCS). If the FSW was changed so that the aliasing states were states s1 and s3, or any other combination of two states, the same technique could be used. This argument can be trivially extended to reflect two aliased states on joined but distinct state chains at different payoff positions or in chains producing different payoff amounts. Thus, the following lemmas are introduced: Lemma 5.2.1 The memory solution (Lanzi, 1998, 1998) is a general solution that is applicable to all occurrences of the Aliasing Problem where the number of aliasing states is known.

208

Lemma 5.2.2 The Consecutive State Problem is a specialisation of the Separate State Problem. Consider again the 5-state aliasing problem of FSW-5A. The inaccuracy of the classifier covering s1 and s2 was due to the discount of the payoff between invocations of the classifier. If the discounting mechanism was disabled until a change of input or a change of invoked classifier (and the disjunction is important) then the classifier would receive one payoff for its full time of activity and the payoff would be consistent, thereby making the classifier accurate36. Now if the FSW was changed so that the aliasing states were states s1 and s3, or any other combination of two states, the same technique could not be used to achieve classifier accuracy due to the correct discounting of the payoff for any intervening classifiers. Clearly the same argument can be applied to any situation where two aliased states on joined but distinct state chains at different payoff positions or in chains producing different payoff amounts are considered. Therefore, there is a solution to the Consecutive State Problem that does not address the Separate State Problem and cannot address the Aliasing Problem as a whole. Hypothesis 5.2 Using action persistence there is a simple solution to the Consecutive State Problem that does not address the Separate State Problem. The provision of Hypothesis 5.2 might initially appear to be unhelpful considering that we already have a solution to the whole problem. However, mechanisms that utilise the disabling of payoff can be constructed which are trivial to implement, require no extra resources on behalf of the Classifier System, and do not adversely effect nor change the normal operation of the XCS. It was argued earlier that the Consecutive State Problem is more likely to occur than the Separate State Problem, and therefore it would appear highly appropriate to explore this avenue in order to identify a potentially useful addition to the XCS. 5.3 Experimental Investigation

5.3.1 Investigating Hypothesis 5.1

In order to empirically test hypothesis 5.1 FSW-5 was created to provide base line results for the Markovian environment. The four classifiers required to solve this FSW 36

A number of ways of achieving this kind of mechanism can be devised.

209

were inserted into the population and all induction algorithms were turned off. Apart from this modification to the XCS, all other aspects of XCS operation were kept as normal; in particular, the parameterisation was: condition size 5 action size 1 R (reward) 1000 N (population size) 400 Pi (initial population size) 0 0.71 γ (discount factor) 0.2 β (learning rate) 25 θ (GA Experience) 0.01 ε0 (minimum error) 0.1 α (fall-off rate) 0.8 Χ (crossover probability) 0.04 µ (mutation probability) pr (covering multiplier) 0.5 P(#) (generality proportion) 0.33 pi (initial prediction) 10.0 0.0 εi (initial error) fi (initial fitness) 0.01 P(x) (exploration probability) 0.5 fr (fitness reduction) N/A m (accuracy multiplier) 0.1 s (Subsumption threshold) 20 Table 5.1 - The parameterisation of XCS for the aliasing tests

XCS Output 1 System Relative Error 0.9 0.8 0.7

Proportion

0.6 0.5 0.4 0.3 0.2 0.1 0 0

5

10

15

20 25 30 Exploration Trials

35

40

45

50

Figure 5.1 The rapid decline in Relative System Error in one run of FSW-5

210

The condition size was set at 5 bits to allow for future tests. The new measure System Relative Error, together with error bars to show the spread of the error statistic, was used to demonstrate the rate of convergence on this five-state test. The other measures from the Woods tests were discarded as irrelevant to this test. Figure 5.1 provides the result of a run of this test (one run suffices since no scope for non determinism exists within this limited FSW implementation), and demonstrates that, as expected from earlier tests in Chapter 4, the population rapidly converges to full accuracy. The modification of the FSW was then performed to make it conform to the Non Markovian FSW-5A environment, and the three test classifiers were entered into the initial population of the XCS. The induction algorithms were not used, and parameterisation was kept the same as in the previous experiment. The experiment was run to test for the hypothesised appearance of the aliasing problem.

XCS Output 1 System Relative Error 0.9 0.8 0.7

Proportion

0.6 0.5 0.4 0.3 0.2 0.1 0 0

5

10

15

20 25 30 Exploration Trials

35

40

45

50

Figure 5.2 The failure of Relative System Error to decline in the presence of two consecutive aliased states demonstrated within one run of FSW-5A.

211

Figure 5.2 demonstrates clearly that the Aliasing Problem did occur. In order to ascertain the location of the error, the experiment was re-run with additional reporting added on each exploitation episode to track the prediction of each classifier. It was hypothesised that the source of error lay with the classifier covering the aliased state. Figure 5.3 presents the results of tracking the classifiers in FSW5-A when compared with FSW5 (results averaged from 25 runs). It was clear that the classifier covering the aliasing state was the source of prediction oscillation. Further runs recorded the fall in Relative Error (see section 4.3.2) of the classifier, and figure 5.4 illustrates that the sole source of error was the classifier covering the aliased state. These results confirm that classifier 2 is the source of the rise in System Relative Error, thus confirming Hypothesis 5.1. Notice in figure 5.3 that the average of the prediction for the aliased classifier 2 is lower than the average of the predictions for classifiers 2 and 3, demonstrating that classifier 2 in the aliased environment is rewarding itself with its own discounted prediction, thereby lowering its average prediction value. XCS Output 1000 Cl.1 Cl.2 Cl.3 Cl.4 Al-Cl.1 Al-Cl.2 Al-Cl.3

900 800 700

Proportion

600 500 400 300 200 100 0 20

40

60

80

100 120 Iterations

140

160

180

200

Figure 5.3 Comparison between predictions of classifiers in FSW-5 (Cl. lines) and FSW-5A (Al-Cl. Lines) showing oscillation of aliased classifier's prediction. Lines from top to bottom: AL-Cl3 and Cl4 (following the same curve), Cl-3, Al-Cl2, Cl-2, Al-Cl1, and Cl-1.

212

Relative Error

0.6 0.5 0.4 0.3 0.2 0.1 0 -0.1 -0.2 -0.3 -0.4 -0.5 -0.6 -0.7 -0.8 -0.9 -1

Cl.1 Cl.2 Cl.3 Total. Rel. Error

0

10

20

30

40

50

60

70

80

90

100

Trials

Figure 5.4 Relative Error plots indicating the source of error in one run of FSW-5A

Further tests were performed to increase the number of aliased states from 2 to 4 and then 6 in order to investigate the effect of increasing the number of consecutive aliased states. For these tests the FSW, now termed FSW-9, was expanded to a total of 9 states, with s0 being the Start State and s8 the reward state. State s0 and s7 are never aliased. Aliasing on the 2 state is test limited to s5 and s6, expanded to the states s3 to s6 for the four state test, and expanded further to states s1 to s6 for the six state test. The aliasing environments are denoted FSW-9A-2, FSW-9A-4, and FSW-9A-6 respectively. Figure 5.5 illustrates the results of increasing the number of aliased states on the predictions of the classifier that leads to s7. The figure demonstrates that an increase in the number of consecutive aliased states will increase the range of oscillation in the prediction of the classifier that covers those aliased states, as would be expected. Once again it can be seen that the stable prediction of the classifier covering the aliasing states does not simply decrease to oscillate about the average prediction of the discounted payoffs that would have been received by the classifiers covering the states in an equivalent non aliased FSW. Rather, they oscillate about a lower prediction since the

213

classifier is feeding its own discounted prediction back to itself and is therefore feeding back the discount of its moving average. XCS Output 600

500

Prediction

400

300

200

2 alias 4 alias 6 alias

100

0 50

100

150 200 Iterations

250

300

350

Figure 5.5 - The change in oscillation range and prediction stabilisation with increase in the number of aliased states in FSW-9A.

These results confirm the existence of consecutive aliased states represents a threat to the ability of a classifier to identify a stable prediction, and that as the number of consecutive states is increased the prediction oscillation will increase. The result is therefore that the classifier covering the aliased states will be unable to achieve accuracy within the XCS. Thus, hypothesis 5.1 is confirmed. 5.3.2 Investigating Hypothesis 5.1.1

The results of the experiments used to empirically investigate Hypothesis 5.1 can be applied to address Hypothesis 5.1.1. Examining figure 5.3 it can be seen that the prediction of classifier 1 has been changed to a higher prediction within FSW-5A than was the case for classifier 1 within FSW-5, as would have been expected. However, the

214

payoff given to the preceding classifier does not oscillate, indicating that the fixed point prediction of the classifier covering the aliased states remains stable at the payoff point as predicted. Figure 5.4 confirms this finding, demonstrating that classifier 1 exhibits no error in its prediction. The stability of the prediction of the immediately preceding classifier to the aliased state in each of FSW-9A-2, FSW-9A-4, and FSW-9A-6 was captured in a further run within each of these environments. These predictions are presented in figure 5.6. Figure 5.6 illustrates that the classifier immediately preceding the classifier which covers the aliased states remains accurate within each of the FSW9A-2, FSW-9A-4, and FSW-9A-6 environments, confirming part 1 of Hypothesis 5.1.1. XCS Output 450 400 350

Prediction

300 250 200 150 2 state Cl.5 4 state Cl.3 6 state Cl.1

100 50 0 0

100

200

300

400 500 Iterations

600

700

800

Figure 5.6 - The prediction of the aliased classifier has no effect on the prediction stability of the immediately preceding classifier for 2, 4, and 6 consecutive aliased states in FSW-9A.

The experiments with 2, 4, and 6 consecutive aliased states were repeated with the same parameterisation, but allowing states s0 to s7 to be start states, with the start state chosen arbitrarily from the available start states at the beginning of each trial. Figure 5.7 is

215

derived from a 'typical' run and was plotted over a larger number of iterations than figure 5.6 to allow the earlier states to achieve a similar experience level as those in figure 5.6. The figure illustrates that under these new conditions the aliasing states do clearly effect the stability of the prediction of the preceding classifier. Although prediction is affected, the affect on accuracy could be lower depending on the amount of payoff oscillation and the ability of the recency-weighted MAM mechanism to respond to change in payoff levels within the classifier's prediction measure. Therefore, the runs were repeated with the accuracy of the classifiers captured. The accuracy of these preceding classifiers is given in figure 5.9, demonstrating that the accuracy is not as severely impacted as the oscillating prediction might suggest, but was clearly compromised nonetheless. Figure 5.8 demonstrates, using FSW-9A-4, that the oscillation in prediction not only effected the immediately preceding classifier, but also further back to earlier classifiers. However, at no time were these earlier classifiers considered to be of zero accuracy - the oscillations had been sufficiently smoothed out by the discounting within the Widrow-Hoff mechanism to move the changes in prediction within the 1% accuracy boundary used in these problems. XCS Output 450 400 350

Prediction

300 250 200 150 2 state Cl.5 4 state Cl.3 6 state Cl.1

100 50 0 0

500

1000

1500 Iterations

2000

2500

3000

Figure 5.7 - The prediction of the aliased classifier impacts the prediction stability of the immediately preceding classifier for 2, 4, and 6 consecutive aliased states when random start states are used.

216

XCS Output 1000 Cl.5 Cl.4 Cl.3 Cl.2 Cl.1

900 800 700

Prediction

600 500 400 300 200 100 0 0

500

1000

1500 Iterations

2000

2500

3000

Figure 5.8 - The impact of the aliasing classifier on the prediction of all classifiers in the random start state version of the FSW-9A-4 environment - apart from Cl.4, which covered the aliased states, only Cl.3 was deemed to be inaccurate.

XCS Output 1 Cl.5 Cl.4 Cl.3 Cl.2 Cl.1

0.8

Accuracy

0.6

0.4

0.2

0 0

500

1000

1500 Iterations

2000

2500

3000

Figure 5.9 - In the random start state version of FSW-9A-4 the accuracy of the aliased classifier (Cl. 4) impacts the accuracy of the immediately preceding classifier (Cl. 3) but does not impact the accuracy of the earlier classifiers (Cl. 1 and Cl. 2) unduly.

217

In order to confirm hypothesis 5.1.1 the results were examined further to identify the source of the inaccuracy. Figure 5.8 illustrates that the prediction of classifier 4 is much more erratic than the prediction for the same classifier in FSW-9A-4 depicted in figure 5.5. The random choice of starting state meant that the start state selected was often within the aliased states. An examination of the updates of the classifiers revealed that whenever this occurred the classifier covering the aliased states would receive a larger average payoff - a movement over 1 transition would give a payoff of γR, 2 transitions would give payoffs γP4 and γR, 3 transitions would give the [approximate] payoffs γ2P4, γP4, and γR, and all of these combinations would result in a higher average payoff than performing all four transitions. This variance in payoff, combined with the less regular update for the earlier classifiers, contrives to expose the preceding classifier to a variable prediction value from the classifier covering the aliased states and therefore results in inaccuracy despite the smoothing action of the discount. Thus, Hypothesis 5.1.1 is confirmed by these results. XCS Output 800 Cl.4 Cl.3 Cl.5b Cl.4b Cl.3b Cl.2b Cl.1b

700

600

Prediction

500

400

300

200 100

0 0

100

200

300

400 500 600 Iterations

700

800

900 1000

Figure 5.10 - The effect of the aliasing classifier (Cl. 4) upon classifiers choosing an action which leads back to the same state (Cl.5b to Cl.1b) within FSW-9A-4.

218

To extend the tests further, a second action was added to FSW-9A-4 (now termed FSW9A(2)-4) which led to a transition back to the current state (in effect a null action), and a further classifier for each distinct input was added which invoked this action. Using this modified environment it is possible to identify the effect the classifier covering the aliased states has upon another classifier that interacts with the aliased state but is not within the primary reward chain. In this case, it was hypothesised that the classifier matching in the aliasing states that causes the null action would also become inaccurate because it receives payoff from the aliasing classifier. Furthermore, because it will only be invoked in exploration mode, the increased oscillation found to occur within the random start environment due to irregular feedback from different prediction positions will also be present. The XCS was re-run with the extra action and new classifiers added. Figure 5.10 illustrates the results achieved in a typical run within the FSW-9A(2)-4 environment. Cl. 4 is the aliasing classifier from the original classifier set, and Cl.3 is the preceding classifier. These plots demonstrate that the behaviour of the prediction within these classifiers is unchanged from that presented in Figures 5.5 and 5.6. Cl.4b is the classifier that matches within the aliased states to produce a transition back to the same state, and its plot demonstrates a considerable degree of prediction oscillation - unsurprisingly the classifier remains inaccurate throughout its lifetime. The plot for Cl.4b demonstrates a high degree of similarity to that of Cl. 3 in figure 5.8. In order to discover the reasons for the similarity a closer inspection of the relationship between the classifiers in required. Like Cl. 3 in figure 5.8, Cl 4b receives payoff directly from Cl. 4. The prediction of classifier 4b, P4b, is derived from max(P4b, P4), and therefore its prediction will be at a mean value around γP4; this can be seen in figure 5.10. Now, Cl. 4b will not be selected during exploitation phases and only approximately half of the time during explorations. Therefore its update will be irregular and thus any two consecutive payoff values from Cl. 4 will be unlikely to be the same. This irregularity is in common with the payoff regime within the random start state version of FSW-9A-4, leading to the similar plots for Cl. 4b in figure 5.10 and Cl. 3 in figure 5.8. These results allow the generation of further [untested] hypotheses about the general interaction of classifiers covering consecutive aliasing states with preceding and dependent classifiers. For example, from these results it could be hypothesised that a classifier leading from an aliased state to a non-aliased state will be unaffected by the aliasing classifier providing there is a transition to the non-aliased state from all aliasing

219

states. In this case the classifier covering the consecutive aliasing states is earlier in the payoff chain and will not effect the new classifier. However, if such a classifier leads to a higher reward than the consecutive aliasing states, the classifier covering the aliasing states will only be invoked during exploration and will rarely cause movement past all the aliased states, thus causing the aliased classifier to oscillate more dramatically than seen in the above figures, potentially causing inaccuracy in earlier classifiers. 5.3.3 Investigating Hypothesis 5.1.2

To obtain a baseline performance for this investigation the two action version of FSW-9 was used, with the first action causing a transition to the following state and the second action causing a transition to the current state - effectively a "null action". All parameterisation was maintained at that given in Table 5.2. Induction operators were operational and no initial population members were provided. The exploration iteration limit was set to 50 iterations. The XCS was run for 30 runs, with each run consisting of 5000 exploitation trials (10000 trials in total). The classifiers given below are the [O] at the end of a typical run. In total the final population consisted of 32 classifiers, of which 16 make up [O]. Notice the stable numerosity, high dominance and high fitness throughout [O], as would be expected. Classifier Pred Error Fit Acc N A ##000→1 64.6058 0.0001 0.9609 1.0000 25 27.64 ##000→0 90.8094 0.0002 0.9996 1.0000 21 23.50 ##001→1 90.9941 0.0003 0.9989 1.0000 24 26.05 ##001→0 127.7416 0.0002 0.9995 1.0000 21 23.64 ##010→1 127.7782 0.0006 1.0000 1.0000 28 28.77 ##010→0 180.0827 0.0005 1.0000 1.0000 28 28.36 ##011→1 179.7844 0.0009 0.9998 1.0000 25 26.84 ##011→0 254.0413 0.0008 1.0000 1.0000 26 27.00 ##100→1 254.1073 0.0009 1.0000 1.0000 21 22.69 ##100→0 357.5251 0.0013 1.0000 1.0000 24 25.06 ##101→1 358.9950 0.0027 1.0000 1.0000 15 15.19 ##101→0 502.4878 0.0008 1.0000 1.0000 26 28.46 ##110→1 502.5820 0.0007 0.9998 1.0000 20 22.42 ##110→0 708.8318 0.0020 1.0000 1.0000 28 29.04 ##111→1 704.7930 0.0060 0.9285 1.0000 25 27.57 ##111→0 1000.000 0.0000 0.7431 1.0000 22 32.24 Table 5.2 - Classifiers from [O] within FSW-9(2) with Induction

Exp 2488 4988 2423 5144 2308 5191 2591 5007 2477 5096 2424 5137 2504 5133 2672 5223

Figure 5.12 illustrates the performance of XCS on this problem, and shows how the System Relative Error measure rapidly falls as [O] is discovered and takes over the population. The graph is averaged over 27 of the 30 test runs. The three remaining runs

220

had highly outlying results because of the early discovery of the classifiers #####→0 and #####→1. These classifiers provide useful actions and participate in all [M], and therefore have greater participation within the GA, establish a high numerosity and prevent convergence to [O] unless competing against the actual members of [O]. Whilst all other runs had final population sizes of between 30 and 36 and zero system relative error, these populations contained 20 to 24 classifiers and relative error of between 0.7 and 0.89. Further investigation demonstrated that given more time for mutation to reintroduce the required bits, these populations would eventually discover suitable members of [O] and reduce to similar system relative error measures. For a fuller discussion of the problem of fully general classifiers dominating the population see section 4.3.6.4. XCS Output 1 Iterations Relative Error Min Rel Error Max Rel Error

0.9 0.8 0.7

Proportion

0.6 0.5 0.4 0.3 0.2 0.1 0 0

Figure 5.12 -

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Exploration Trials

Relative Error within FSW-9(2) with GA and induction mechanisms on and no initial population of classifiers.

The experiment was repeated within FSW-9A(2)-4, with the number of iterations set to 15000 to allow the XCS the opportunity to discover the classifiers before the aliased

221

states. In fact this extra time was unnecessary; all runs showed the same learning behaviour with the relative error remaining high throughout each run. Figure 5.13 illustrates the first 5000 explorations averaged over 30 runs and there was no change to the graph in the successive 10000 explorations. It is important to note that the system relative error for any one run 'jittered' around the mean by 0.025 and max/min relative error by 0.07, so averaging of the runs to achieve the plots shown in 5.13 has flattened the results. XCS Output 1 Iterations Relative Error Min Rel Error Max Rel Error

0.8

Proportion

0.6

0.4

0.2

0 0

Figure 5.13 -

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Exploration Trials

System Relative Error is not reduced, demonstrating the inability to find accurate classifiers within FSW-9A(2)-4 with GA and induction mechanisms on and no initial classifiers.

Examining the classifiers produced by the run revealed some very unexpected results. Note before examining the final populations that the state inputs produced the messages 00000, 00001, 00010, 00011, and, 00111. 00011 was the input from the aliased states. Therefore [O] should contain two classifiers for each of the conditions ###00, ###01, ###10, ##011 and ##1##, with the classifiers having actions 0 and 1. [O] was not found

222

in any of the 30 populations after 15,000 explorations, however. All 30 runs found the following classifiers with high numerosity : Classifier

Pred.

Err.

Fit.

Acc.

N

AS

Exp.

##1##→0

1000.000

0.000

1.000

1.000

74

88

30000

##1##→1

710.000

0.000

1.000

1.000

71

75

14901

##0##→0

294.104

0.189

0.679

0.000

50

79

209457

##0##→1

138.055

0.004

0.933

1.000

87

97

104850

###10→0

171.518

0.000

0.990

1.000

34

108

29593

###0#→0

147.872

0.002

0.968

1.000

34

101

59824

Table 5.3 - Classifiers from [O] within FSW-9A(2)-4

All the remaining classifiers were not uniformly represented across the populations, and had low numerosity (mean N<3). For example the best classifier found which still imperfectly covered s2 was represented in only 17 populations. A typical example was : ###10→1, P=142.023, E=0.004, F=0.036, A=1.000, N=1, AS=104, Ex=750 A corresponding classifier for s0/1 was represented in 23 populations in a similar manner to the example classifier shown below : ###0#→1, P=144.911, E=0.006, F=0.047, A=1.000, N=5, AS=104, Ex=328 These findings must lead to a re-assessment of Hypothesis 5.1.2. Clearly XCS has been able to establish a classifier which covers the aliasing states for each action against the prediction of the hypothesis. Two co-acting reasons can be identified for this. Firstly, the only competing classifiers in [M] for these states will be less general but no more accurate, more general and less accurate, or of the same generality but selecting a lower rewarding action. Thus, the competition within the match set is insufficient to put deletion pressure on the classifier. Secondly, and more substantially, the hypothesis failed to account for the fact that in an environment with consecutive aliasing states XCS will dwell in the aliased states for proportionately longer than the other states and will therefore provide more opportunities for the GA to be invoked. Less general classifiers will be no more accurate and will compete for GA involvement less often and so will be eradicated, whilst more general classifiers will have a lower accuracy and therefore a lower fitness and be selected for GA use less often. Thus the classifier

223

covering the aliasing states will put deletion pressure on competing classifiers. Furthermore, the increased frequency of GA invocation negates any potential deletion pressure from classifiers covering other match set niches. As a result, the classifier covering the aliasing states is maintained within a population niche despite its inaccuracy, contrary to Hypothesis 1.2. The inability of XCS to find adequate generalisations for the classifier covering the preceding states was also unexpected and requires further investigation. Figure 5.10 illustrated the inaccuracy within the classifiers with a null action covering the preceding states, and it is noticeable within this experiment that no stable classifiers were produced for the null action in the preceding states. In contrast, the null action was covered in the aliased state itself despite the unstable nature of its reward. Unfortunately, coverage of the more stable 'move one state' actions was also inadequate, with over-generalisation for s0 and s1. To investigate the effect of the classifier covering the aliased states on preceding classifiers further, and test the hypothesis that the classifier covering the aliased states was driving out the competitors because of its high relative invocation frequency the experiment was repeated in FSW-9A-4. Upon examination of the results. the following classifiers had secured a high numerosity in all 30 populations after 15000 iterations:

Classifier

Pred.

Err.

Fit.

Acc.

N

AS

Exp.

##1##→0

1000.00

0.0000

1.0000

1.0000

150

170.91

29091

##0##→0

294.31

0.1887

0.6039

0.0000

59

114.16

204009

###10→0

171.64

0.0000

0.9941

1.0000

67

165.50

26974

###0#→0

148.38

0.0020

0.9682

1.0000

69

158.57

57980

Table 5.4 - Classifiers from [O] within FSW-9A(2)-4

The similarity of the results in Table 5.4 to those in Table 5.3 from the previous experiment demonstrate that the presence of classifiers providing other lower payoff actions over the aliased states does not affect the formation of the preceding classifiers in cases where exploration and exploitation trials are equally frequent and equidistant. It was noticeable that the reduction in competition for population space resulted in a larger number of 5…10 numerosity more specific competing classifiers remaining in each population, rather than an increase in the numerosity of the classifier covering the

224

aliasing state that would have been expected in line with the increase in numerosity of the classifier covering s7. This finding supports the explanation for the prevalence of the classifier covering the aliasing state in the previous experiment.

300 275 250

Prediction

225 200 175 150 125 ###0#

###00

###10

##0##

###01

100 0

Figure 5.14 -

50

100

150

200

250 300 Iterations

350

400

450

500

The predictions of five classifiers within one run of FSW-9A4. The prediction values of ###01 do not originate from the same 500 iteration period, but all values reflect the stable values of the respective classifiers over the duration of their existence.

In regard to the inability to form [O], the classifier ###00→0 was seen in 27 of the 30 populations, but in each case it had a numerosity of 1 to 3 and a low experience. The classifier ###01→0 was missing in all but one population, and where seen was also at a low experience count. In all these cases the classifiers were accurate, and their prediction was around 149 and 146.5 respectively. These classifiers were uniformly replaced by the classifier ###0#→0 in all populations. The proximity of the prediction values and similarity of the accuracy criteria of the more specific classifiers led to their

225

replacement by the more general ###0#→0 classifier. To discover why their prediction values were so similar, given that Figure 5.3 would suggest that the aliasing classifier will not flatten the prediction values of the preceding classifiers, the predictions of the members of [O] (when in existence) and the members of the final population were captured from a run of XCS within FSW-9A-4, and a representative sample of their stable values is shown (apart from ##1## which was stable at 1000) in figure 5.14. This illustrates the proximity of the stable strengths of the classifiers concerned, demonstrating why ###0# is generalised from ###00 and ###01. A further investigation of the operation of XCS was performed by inserting [O] and the high value result classifiers into an initial population and running XCS in single-step mode, printing out the classifiers and match sets on each iteration. This investigation revealed that when the classifiers covering states s0, s1, and s2 are matched, the classifier covering the aliased state also matched and was part of the action set. Since the classifier has a reasonable relative fitness despite its inaccuracy it will contribute a lot to the system prediction, which is rewarded back to the stage setting classifiers. This will cause the prediction of the classifier covering the aliased state to oscillate more, but since it is already inaccurate this has little effect on it compared to the reproductive benefit of being involved in more match sets. The classifiers in [A-1] are always given the highest prediction from [A], which will be a match set that the classifier covering the aliasing states contributes to. This has the effect of feeding back high payoff to earlier classifiers, narrowing their prediction gap. In effect, the action of the classifier covering the aliasing states is similar to that of the fully general classifier described in 4.6.34., converging the prediction of its Match Sets to its own benefit. However, in this case the classifier is established because of the inability to find a more accurate classifier because of the aliasing states. An experiment that attempts to verify this finding by contradiction was constructed. The stimulus presented by each state was changed so that the aliasing states would provide a stimulus sufficiently different from all other stimuli so as to make the appearance of the classifier covering the aliased states in the match sets of other states very difficult. If the hypothesis is correct, the other classifiers should be able to form stable prediction and high numerosity values without disruption from the inaccurate classifier. This experiment gave stimuli to states as follows:

226

State

Stimulus

0

11110

1

11101

2

11011

3-6

00000

7

10111

8

01111

The experiment consisted of 30 runs, and the resulting populations were analysed. In all 30 runs s0, s1, s2, and s7 were represented by strong optimally general classifiers with the aliased states s3-6 represented by a further classifier. An example of the classifiers represented is given in Table 5.5: Classifier

Pred

Err.

Fit.

Acc.

N

AS

Exp.

10###→0

1000.00

0.0000

0.9890

1.0000

90

112.55

12329

##0##→0

377.58

0.1950

0.6016

0.0000

50

98.68

148323

#10##→0

256.37

0.0004

0.6143

1.0000

29

91.68

18422

1#0##→0

256.37

0.0004

0.3450

1.0000

19

94.93

4450

1##0#→0

216.09

0.0002

0.8574

1.0000

55

102.14

28882

1###0→0

175.64

0.0004

0.7819

1.0000

74

110.89

10388

Table 5.5 - Classifiers from [O] within FSW-9A(2)-4 with an inverted enumeration style of messages

The level of error seen in the earlier classifiers now shows only small levels of disruption from the aliasing classifier, with the main effect seen in the immediately preceding state which was typically covered by two to four classifiers competing to discover the best generalisation. The generalisation found for the classifier covering the aliased state suggests that it continues to try to compete in the match set for the preceding classifier, possibly because the preceding classifier has a sufficiently close prediction value for the classifier covering the aliased states to trade off inaccuracy against match set occurrence. Interestingly, the aliased states were also covered by a number of competing high numerosity classifiers, demonstrating a continued search for the best trade-off.

227

The encoding was used in a re-run of the FSW-9A(2)-4 environment to see if the removal of match set competition changed the results obtained earlier. Again the experiment consisted of 30 runs, and the classifiers optimally covering the stimuli from one typical run are shown below: Classifier

Pred.

Err.

Fit.

Acc.

N

AS

Exp.

10###→0

1000.00

0.0000

1.0000

1.0000

43

48.94

29817

10###→1

710.00

0.0000

0.9677

1.0000

38

45.85

14968

##0##→0

379.23

0.1941

0.6650

0.0000

29

49.25

145473

##0##→1

224.53

0.0132

0.8795

0.0481

47

57.15

71742

#10##→0

263.14

0.0005

1.0000

1.0000

26

59.08

2950

##01#→1

219.00

0.0007

0.0834

1.0000

3

62.32

664

##0#1→1

219.00

0.0007

0.0645

1.0000

3

63.26

879

1#0##→1

219.00

0.0007

0.0553

1.0000

2

62.63

141

1#01#→1

219.00

0.0007

0.0201

1.0000

1

63.39

143

#10##→1

219.00

0.0007

0.0200

1.0000

1

63.39

34

1##0#→0

219.63

0.0008

0.9993

1.0000

44

59.02

28694

1##0#→1

172.24

0.0017

0.9990

1.0000

33

42.60

13855

1###0→0

174.60

0.0026

1.0000

1.0000

48

55.59

6557

1###0→1

132.46

0.0051

0.9999

1.0000

36

43.79

7596

Table 5.5 - Classifiers of [O] within FSW-9A(2)-4 with an inverted enumeration style of messages

All of the state × action × prediction mapping has been covered using this new encoding, demonstrating clearly that the classifier covering the aliasing states previously was interfering in the match sets and preventing the formation of accurate competing classifiers A comparison of Figure 5.15 with 5.13 shows that the effect of this is a reduction in the level of relative error within the population. It was noticeable, however, that the classifiers covering the null action in the preceding state are all low experience and low numerosity, indicating that the inaccuracy caused in that classifier due to the following aliased classifier (as revealed in Figure 5.10) is sufficient to lead it into competition with the classifiers covering the aliased states that can benefit from extending their inaccuracy in order to gain more match set occurrences. Once again, the competition to the classifiers covering the aliased states is inadequate to reduce the

228

numerosity of these classifiers, illustrating that Hypothesis 5.1.2 was not adequate in the circumstances used for this test. XCS Output 1 Iterations Relative Error Min Rel Error Max Rel Error

0.8

Proportion

0.6

0.4

0.2

0 0

Figure 5.15 -

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Exploration Trials

The Maximum Relative Error and Relative Error measures are reduced under the less competitive encoding within FSW9A(2)-4 with GA and induction on and no initial classifiers.

Is it possible to set aside Hypothesis 5.1.2? From an analysis of the results of the experiments given above it was concluded that Hypothesis 5.1.2 could not be affirmed or denied from these results because the tests did not provide the classifiers in the aliased states with credible competition. The lack of competition would encourage the maintenance of the inaccurate classifiers. Therefore, a modified FSW was constructed based upon FSW-9A-2. The two alias state test environment was chosen to minimise disruption to preceding classifiers and reduce the prevalence of the aliased states within a GA and thereby increase competition. The environment was extended to provide four actions in each state. One action ('00') moved the FSW into the following state, and all other actions kept the FSW in the current state. The increase in actions again increases

229

competition for population space. The FSW was then further modified by introducing a further state that the actions 01, 10, and 11 caused the FSW to move to when within the two aliased states, with this state connected by action 00 to s7. This state prevents the aliased states looping within themselves, so further reducing GA domination by the aliased states, whilst providing credible competition to the classifier covering the aliasing action within these states. The aliased states were s5 and s6. Finally a message encoding for the states was chosen to prevent the classifier covering the aliasing states interfering with preceding classifiers. Initially this was chosen as given in the first column in the table below, but this coding was changed to that given in the final column after a test run demonstrated that the initial encoding allowed XCS to form a classifier 1###1→00 that could be adopted by the aliasing states. State

Initial Encoding

Final Encoding

0

00001

00111

1

00010

00110

2

00100

00011

3

01000

00101

4

10000

00100

5/6

11111

11111

7

10001

00000

8

10010

00010

9

10100

00001

Table 5.7 Attempts at message encoding to minimise alias state disruption

This environment was run within XCS with the same parameters as the previous tests. The resultant populations were captured and examined to identify [O] and look for evidence of the competition driving the classifier covering the aliased states out in the search for population coverage. Table 5.8 identifies the members of [O] and their average numerosity across 10 runs of XCS in the test environment. This table illustrates that the classifier covering the aliased states is not able to sustain itself when faced with competition, in agreement with Hypothesis 5.1.2. The low experience and numerosity illustrate that the classifier is continually being replaced (and indeed the optimal classifier for the aliased states was present in only 6 of the 10 runs) whilst the Match Set membership has remained equivalent to other 'normal' classifiers indicating that there are a number of other classifiers in existence (all with small numerosity values) competing to cover this classifier position, as predicted by Hypothesis 5.1.2.

230

Classifier Mean Numerosity Mean Action Set Mean Experience ##010→00 13.5 16 29043 ##010→01 10.8 12.9 14229 ##010→10 11.2 13.7 13315 ##010→11 9.8 11.9 14238 ##00#→00 14.4 17.6 29460 ##00#→00 10.7 13.4 14601 ##00#→00 11.8 13.8 14533 ##00#→00 11.1 13.2 14438 1####→00 1.7 10.4 43 1####→01 8.3 10.2 7010 1####→10 8.9 11.1 4834 1####→11 8.5 10.5 7678 ##100→00 14.5 17.4 27188 ##100→01 11.4 13.3 14578 ##100→10 10.8 12.6 12980 ##100→11 10.9 12.7 12713 ###01→00 13.7 17.5 24960 ###01→01 12.7 14.8 12991 ###01→10 10.8 14.4 13972 ###01→11 10.1 12.2 13503 ##0#1→00 12.5 16.2 27671 ##0#1→01 10.6 13.3 15874 ##0#1→10 10.9 14.0 14524 ##0#1→11 9.8 12.4 14741 ##110→00 13.5 16.5 25447 ##110→01 11 12.5 10760 ##110→10 11.1 13.1 13801 ##110→11 9.5 11.72 12962 0#111→00 14.3 17.2 21601 0#111→01 11.2 13.1 11658 0#111→10 9.2 11.7 15090 0#111→11 10.9 12.7 12312 Table 5.8 - [O] for an aliased FSW with aliased states in competition.

Therefore, it is concluded that Hypothesis 5.1.2 does hold in cases where there is a significant amount of competition for population space from accurate classifiers and the invocation of the classifier covering the aliased states is sufficiently low in relation to that of the competing classifiers as to prevent the aliased classifier utilising more frequent GA invocation to preserve itself. However, in cases where the classifier covering the consecutive aliasing states is able to obtain frequent occurrence within [A] either as a result of the number of aliasing states or due to the ability to utilise other match sets due to its inaccuracy, Hypothesis 5.1.2 will not hold and the classifier covering the aliasing states will proliferate to the detriment of the classifier population as a whole.

231

5.3.4 Investigating Hypothesis 5.2

In the rationale presented within section 5.2 for Hypothesis 5.2 it was claimed that if the application of the reward could be delayed until the point of leaving the aliasing states, the discounting of payoff would not occur within the aliasing states and therefore the classifier covering the aliasing states would be able to represent a single payoff value accurately. At that point no mechanism was presented to achieve this objective, and the proof must therefore be validated by the construction and evaluation of such a mechanism within XCS. It is possible that many mechanisms which achieve this aim exist, but the consideration is limited here to two possible mechanisms. Cobb and Grefenstette (1991) employed classifiers within the SAMUEL LCS which included actions that identified a duration over which the action of the classifier was to occur. They were able to demonstrate that a LCS that included this facility was able to discover classifiers with suitable action duration under the action of the GA for the missile pursuit problem they were investigating. This technique could be readily applied to XCS to solve the consecutive state aliasing problem, since a classifier which identifies both the action and the correct duration for the action would receive a constant payoff and therefore be identified as accurate and of high fitness whilst a classifier identifying the incorrect duration would receive no payoff (if it persists too long), a fluctuating payoff (if it persists for too short a time and so is re-invoked), or a lesser payoff (if it persists for too short a time but is not re-invoked, it will be further down the feedback chain), and therefore will eventually be removed. Therefore, without change to the credit allocation or induction mechanisms of the XCS, and with only minor changes to the performance component, XCS would have all the mechanisms necessary to generate, identify and proliferate classifiers that act for the correct time period. This approach has clear limitations. Firstly, the resulting classifier will only be useful if all consecutive aliasing state sets generating a given message are the same length since the length of invocation is hard coded within the classifier. Secondly, the addition of timing information to a classifier increases the action length (and thereby the search space) unnecessarily for the many other classifiers which do not require this facility37. Thirdly, the XCS implementation proposed by Wilson (1995, 1998) includes only

37

It is possible that this problem could be reduced by providing a second XCS population that matched the same messages but whose action sets up a duration for the action chosen in the original population. The reward to this new XCS population would be that achieved by the action in the original population. This alternative has not been investigated within this research programme.

232

primitive search over the action space (mutation only) and thus the extension of the action encoding may necessitate the full application of G.A. search to the whole classifier in order to search over the duration fields adequately. Fourthly, and finally, in an environment where consecutive states have the same action message all the way to the highest value goal state a classifier with the condition matching the first state and an action matching the move-to-goal action could develop a duration which continues the action over all intermediate states to the goal. Whilst this is potentially beneficial in the short term, it could limit exploration of later states and prevent the timely production of a full State × Action × Payoff mapping within the classifier population. This problem could be overcome [for the sole solution of the consecutive state aliasing problem] by limiting the persistence of an action to the cases where the message remains the same in consecutive states. In this case a classifier that tries to persist with an action for longer than a message is consecutively posted can be rewarded an arbitrary very low reward so that the mapping for actions of an incorrect duration are poorly valued and thus not selected during exploitation. Given these potential drawbacks, is a simpler alternative solution available? The obstacle to finding a simpler solution is that it is difficult to identify the difference between an action that leads to a consecutive aliasing state and an action which leads back to the same state (a 'null' action). The same message is received from the environment in consecutive iterations in both cases and oscillations in prediction will still occur in the latter case for over-general classifiers matching in this state. However, if the environment does not allow null actions, then it would be possible to repeatedly re-choose the same action set whilst the message remains the same, rewarding the action set only when payoff is received from the first action set chosen from a different message. This mechanism is simple to implement38, ensures the Animat moves towards an exit from the consecutive aliased states, and gives a constant prediction feedback. Although this scheme would seem to prevent the Animat exploring alternative reward routes from within the aliasing states, unless the alternative route could be reached from all the aliasing states within a set of consecutive states the action leading to that alternative route will itself be aliasing and not solvable without adding a 'memory' mechanism to disambiguate all the consecutive states. If the route can be reached from 38

It is possible that, by providing each action with a 'persistence bit', all classifiers could have the option of electing that their action be allowed to persist until 'turned off' by a classifier firing the same action with the bit off. Thus, two co-operating classifiers could move the Animat over large distances without the high overhead of an explicit persistence time. This possibility has not been investigated further within this research programme.

233

all the aliasing states, then it could be chosen at the first of the aliasing states and will therefore still be explored. In order to provide an empirical proof for hypothesis 5.2, the latter [simpler] mechanism will be applied within a version of FSW-9A(2)-4 in which action '0' moves the Animat one state forward and action '1' moves the Animat one state backwards, except in the start state in which action '1' also moved the Animat one state forward. The 'final' state encoding shown in Table 5.9 was utilised in order to minimise any likely aliasing state disruption to other classifiers, although theoretically now unnecessary. The XCS was modified with the aim of minimal change. To introduce persistence to the XCS, the message is compared within the match operation and if the same as the preceding message the previous action set is restored and selected to perform the action [in explore and exploit mode] rather than re-matching the population and re-selecting. If the previous action set has been restored in this way, no payoff is given unless an external reward is received; this would signal the end of the learning trial in any case. The modified XCS was initially tested by inserting the hypothesised [O] into the population, turning the induction algorithms off, and running the XCS with all other parameterisation set to the values used in previous experiments. These tests indicated that the modified XCS was able to find the optimal predictions for all classifiers with no error in the same time span that would be expected for a non-aliasing environment. The resultant [O] sub-population (after 15000 exploitation trials) is given below: Classifier

Pred.

Classifier

Pred.

###11→0

1000.0

###10→1

254.12

###11→1

504.10

###01→0

357.91

1####→0

710.00

###01→1

180.42

1####→1

357.91

0##00→0

254.12

###10→0

504.10

0##00→1

254.12

Table 5.9 -

[O] for a persistent action XCS with an initial hypothesised [O] provided and no induction.

Knowing that modified XCS would deal with the aliased states correctly, the XCS was tested within the specified environment by running thirty times with no initial population, all induction algorithms on, and using the same parameterisation as in the previous tests. The system relative error was captured from each run and averaged, and

234

the results from the first 5000 exploitation trials are shown in figure 5.16. This illustrates clearly that the persistence of the action over aliased states has eliminated the error caused by the aliasing states (see figure 5.13 for comparison). XCS Output 1 Iterations Relative Error Min Rel Error Max Rel Error

0.8

Proportion

0.6

0.4

0.2

0 0

Figure 5.16

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Exploration Trials

The reduction in system relative error shows that consecutive aliasing states do not effect classifier performance when using the persistent action XCS within FSW-9A(2)-4.

An examination of the final populations from the 30 runs showed that they had all converged on [O], with the total population size between 27 and 40 macro classifiers (all other macro classifiers had a low experience and very low numerosity, indicating they were the unfruitful product of continued exploration). Table 5.10 gives is a typical example. Interestingly, the classifier 0###0→1 also covers the classifier ###10→1 in the earlier hypothesised [O] because, as is evident from the earlier population, these two classifiers both have the same prediction and their conditions are sufficiently similar to allow generalisation to occur.

235

Classifier

Pred.

Err.

Fit.

Acc.

N

AS

Exp.

###11→0

1000.0

0.0000

1.0000

1.0000

41

42.71

27517

###11→1

503.35

0.0007

1.0000

1.0000

31

40.97

12371

1####→0

707.21

0.0013

0.9195

1.0000

38

44.58

40047

1####→1

357.91

0.0006

0.9691

1.0000

34

40.23

25079

###10→0

503.18

0.0007

0.9899

1.0000

49

51.66

53263

###01→0

358.76

0.0009

0.9798

1.0000

51

54.25

66974

###01→1

180.50

0.0004

0.9165

1.0000

36

45.38

52773

0##00→0

254.05

0.0003

1.0000

1.0000

38

43.99

41199

0###0→1

253.91

0.0003

0.9460

1.0000

48

58.01

81085

Table 5.10 - [O] for a persistant action XCS in FSW-9A(2)-4

This experiment has demonstrated that the consecutive aliasing state problem can be overcome by a mechanism, the use of persistent actions, that does not solve the more general Aliasing Problem. Therefore, it is concluded that the Consecutive State Aliasing Problem is a sub-problem of the more general Aliasing Problem and Hypothesis 5.2 is upheld. 5.4 Summary of Results

Experimental investigation of hypothesis 5.1 was conducted using a five state single action FSW with a pre-loaded population of optimal classifiers. Analysis of the System Relative Error of the population, and the prediction, accuracy and relative error of the individual classifiers was performed. This analysis identified that whenever two consecutive states within this environment were represented to XCS using the same input message XCS was unable to converge on a payoff prediction for the classifier covering the two states that present the same message, demonstrating that aliasing occurs as predicted by Hypothesis 5.1. Further investigation was conducted using a similar environment but with nine states, within which two, four, or six consecutive states were aliased. These tests illustrated that the amount of prediction oscillation within the aliased states increased as the number of aliased states increased, but with a lower oscillation than the gross range of payoffs covered by the aliased states. Within these environments it was seen that the classifier covering the state preceding the aliased states was able to achieve full accuracy despite the prediction oscillation of the classifier covering the aliased states. This was because at the point of payoff the

236

classifier covering the aliased states always had the same fixed prediction. However, once the environment was modified to allow all states to be start states the prediction of the classifier covering the preceding state oscillated. As hypothesis 5.1.1 predicted, this was due to the diversity in prediction now received by the classifier covering the aliased states depending upon the number of aliased states explored in each trial. The effect on the preceding classifier was also seen when the number of consecutive aliased states was extended to four and six states, although the magnitude of payoff oscillation was not greater despite the increase in magnitude differences in the payoffs received over the aliased states. Thus, Hypothesis 5.1.1 was upheld. Unexpectedly, the inaccuracy in the preceding classifier did not unduly affect the accuracy of earlier classifiers, possibly due to the smoothing of the MAM update technique. Further experiments demonstrated that a classifier leading out of an aliased state into a non-aliased state would be disrupted in a similar manner, even if exploration rates were even. It was hypothesised that this was due to the irregular exploration of the non-optimal route causing exploration at different points in the prediction payoff cycle of the aliasing states, thus generating the same oscillation in the payoff to the classifier covering the non-optimal action as was seen in the non-uniform exploration environment. On the introduction of the induction algorithms to the XCS within a nine state two action corridor aliasing FSW environment it was found that the system relative error was not reduced as would have been expected from the baseline experiments. Whilst it would not be expected that system relative error would be eliminated due to the aliased states, the high level of system relative error was a surprise. An analysis of the populations and XCS dynamics in this environment revealed that the aliased states encouraged the development of classifiers that not only covered the aliased states but also other preceding states. Since it was not possible for a classifier to achieve accuracy over the aliased states, a classifier is established that is able to cover a number of states. Although having a very low accuracy, this classifier utilises the reproductive benefit of additional action set membership to maintain itself in a strong position in the population. Experiments with the use of difficult-to-generalise encoding for the states surrounding the aliasing states demonstrated that XCS was able to identify classifiers to cover these earlier states if the opportunity to establish a general classifier to cover other states was removed. Thus, contrary to hypothesis 5.1.2, XCS was able to identify and maintain a classifier to cover the aliased states because of the lack of competition from any other more accurate coverage of these states and the increased GA opportunity afforded by membership of more than one action set per trial. Experiments that artificially increased

237

the competition for these classifiers demonstrated that hypothesis 5.1.2 would hold if there was sufficient population competition, but did not hold for most environments with consecutive aliasing states. Two potential solutions to the consecutive state problem that would not hold for the separate state problem were presented, and the simplest solution was implemented. This solution caused XCS to re-assert the previous action set whenever a consecutive message is presented that is the same as the previous message, and a payoff was only provided once the message changes. It was demonstrated that XCS was able to find and proliferate an optimally general accurate sub-population of classifiers while using this mechanism. Since this mechanism cannot be applied to the separate state problem hypothesis 5.2 was upheld. It was noted that the proposed mechanism was limited to cases where there were no "null actions" within the environment, although possible solutions to the null action case can be devised.

5.5 Discussion

As noted in section 5.1, and discussed more fully in 3.5.3.2, previous work on the use of XCS within non-Markovian environments has been carried out by Lanzi (Lanzi 1998a, 1998b, 1998c; Lanzi and Wilson 1999). This work was based on the note by Wilson (1994, 1995) that it may be possible to add additional state bits that XCS could utilise in its operation. Although Cliff and Ross (1994) had attempted to add these memory bits to ZCS their results were limited to sub-optimal operation within simple mazes. In his expansion of this work, Lanzi defined the 'Aliasing Problem' and demonstrated that XCS could utilise these memory bits to disambiguate the aliased states for optimal performance within these mazes. He found that XCS was unable to learn to utilise these bits to generate optimal performance on larger problems unless learning of the bits to use was more deterministic than the learning technique used to learn the policy. In further work Lanzi and Wilson (1999) identified that some redundancy in the bit provision was required in order to resolve environments requiring more than a few bits of internal memory. Lanzi's work is extended by this work in a number of ways. Firstly, the problems Lanzi examined are all identified as belonging to the "Separate State Problem" - a term introduced to separate this form of aliasing from the "Continuous State Problem" identified in this work. Secondly Lanzi's work did not provide a detailed analysis of the effect of aliasing on other classifiers within the population or on the ability to form [O]

238

or allow [O] to dominate. This work has shown that within the Consecutive State Problem classifiers covering preceding states can be adversely affected as a result of the fluctuating payoffs received from the action sets for the aliased states. Additionally it has been shown that the generalisation mechanism can enable classifiers to be generated and proliferate that also cover preceding states in order to gain additional reproductive opportunity. This finding introduces a second form of over-generalisation within XCS in addition to the fully-general classifier phenomenon discussed in Section 4.3.6.5. Finally, it has been shown that a solution to the Consecutive State Problem exists that is more lightweight than the application of memory bits. This mechanism will not change the operation of XCS within Markovian environments and therefore is a useful complement to the provision of XCS operation. The problem of operation within non-Markovian environments is not unknown within the wider realm of Reinforcement Learning. Typical solutions within this domain have included the use of 'history windows' whereby states are disambiguated by providing the previous input alongside the current input. Such ideas have been applied within LCS work, such as the work of Robertson and Riolo (1988) with the letter-sequencing problem. An alternative proposal to the resolution of the letter-sequencing problem was the introduction of the Delayed Action Classifier System (Carse, 1994; Carse and Fogarty, 1994). This LCS allowed an action to identify a delay in terms of LCS iterations before the use of an action, with an environmental action and a delayed action possible within a single step. The LCS could thus 'solve' the ambiguous 'MISSISSIPPI test' by delaying action invocation until the aliased states. Application of either solution to the aliasing problem has not yet been attempted and these are areas ripe for further investigation, although neither is applicable to the Consecutive State Problem. An alternative mechanism that was proposed in section 5.3.4 for the solution of the Consecutive State Problem is that derived from the work of Cobb and Grefenstette (1991). Although their work was using the hybrid Pittsburgh/Michigan LCS SAMUEL and was aimed at explicitly identifying action persistence rather than resolving an Aliasing Problem. Nontheless, as seciton 5.3.4 identifies, it is possible that this more complex mechanism could be applied to resolve the Consecutive State Problem. Given the promise of this opportunity and the relevance to the current work, this possibility is explored further in Chapter 6.

Tomlinson and Bull (1999c) introduced the use of classifier corporations into an XCS implementation and applied this "Corporate XCS" to the solution of a highly Markovian

239

environment. Their techniques are highly relevant to the Consecutive State problem, with the explicit identification of rule-chains to apply once a "lead classifier" is triggered. However, the results presented were not as promising as had been expected from earlier work using a Corporate LCS implementation (Tomlinson and Bull, 1998, 1999a). Discussion in section 3.5.3.2 identified possible weaknesses in their application of Corporations to XCS. It is possible that a form of corporate XCS implementation closer to the operating principles of XCS might generate not another solution to the consecutive state problem, and this is clearly an area for further investigation. 5.6 Conclusions

This chapter has identified the Consecutive State Problem as a sub-problem of the larger Aliasing Problem. It has identified within the limited FSW environments used the extent of the disruption caused by the aliasing of consecutive states on the identification and proliferation of the optimal sub-population [O]. Another instance where overgeneralisation can disrupt the performance of XCS has been identified, although it has been shown that this tendency can be reduced by the use of appropriate problemdependant message encoding. A simple solution to the Consecutive State Problem was proposed and applied. This mechanism is more lightweight than existing solutions, and expands the application of XCS into non-Markovian environments where the environmental ambiguity fits the Consecutive State Problem. Further areas of investigation have been identified. In particular, the possibility of using action persistence to resolve the Consecutive State Problem has been identified, and is investigated further in Chapter 6. Additionally, the use of the History Window approach has been discussed and the relevance of previous investigations using the Delayed Action Classifier System and the Corporate XCS noted. These areas are ripe for further investigation, although left for later work beyond this research programme.

240

Chapter 6

ACTION PERSISTANCE WITHIN XCS

6.1 Introduction

Consider a FSW in which all states in the FSW were connected in the manner, si → si+1, si → si (0 ≤ i < n), s0 was the start state and sn was the terminal state. In an environment such as this it is common for an Animat to take a single action in each state, which will either lead to the Animat staying in that state or moving to a new state. The Animat will then invoke the deliberative process of the Learning System to decide on which action is best from the new state. 'Time' is synonymous with this 'detect-decide-act' process; one unit of time equivalent to one such 'step'. In the preceding chapter it was demonstrated that a solution to the Consecutive State Problem existed if XCS was allowed to continue with the same action whilst the input to the XCS remained constant, provided that "Null Actions" were disallowed. This hypothesis was validated using a suitably modified XCS. This solution is a specific solution to the problem of consecutive aliased states, but shows some similarities with a more general feature proposed for LCS implementations by Cobb and Grefenstette (1991). In their SAMUEL implementation the action of a classifier not only specifies an action, but also specifies the duration over which the action may continue to operate, with the duration specified in terms of environmental 'steps'. Although a number of potential limitations of this solution were identified within section 5.3.4, the ideas merit further investigation as a means of lengthening the rule chains possible within XCS. This chapter seeks to apply these ideas to a number of Markovian multiple step environments. It sets out to discover whether XCS is able to identify the optimal duration over which an action should operate. In addition, it seeks to investigate whether the deficiencies suggested within the previous chapter do occur in practice, and whether there are mechanisms that can be applied to remedy any deficiencies. It demonstrates that, given a suitable exploration strategy, action persistence can be utilised within XCS to enable the selection of a pathway to a reward state that entails the minimum number of different actions. It is also shown that a modification to the learning mechanism restores the ability of XCS to select the pathway to a reward state with the minimum

241

number of steps whilst minimising the number of actions used. Finally, it is shown that this mechanism is insufficient to address the Consecutive State Problem.

6.2 Background

The SAMUEL LCS (Grefenstette, 1989, 1992; Grefenstette and Cobb, 1991; Grefenstette, Ramsey, and Schulz, 1990) was developed to expand the application of LCS to more complex problems than had previously been investigated at the time. This LCS was based upon Pittsburgh LCS, although SAMUEL differs from other [Michigan or Pittsburgh] LCS in a number of important ways. Firstly, it represents conditions in terms of higher-level attribute representations including integer ranges (with cyclical ranges), set membership and binary string attributes. Each condition is the logical 'AND' of a set of attributes, and classifiers can hold a varying number of these attributes. The action is represented as a single attribute with a value, but the attribute can be one from a set of action attributes. The second difference is in the matching algorithm. This uses a partial match scheme based loosely on Booker (1985). Each classifier is given a computed match score based on the number of condition attributes that the classifier matches. The match set is then constructed from all the classifiers that match with the current highest match score. This mechanism combined it the population initialisation strategy described later eliminates the need for covering operators. The third difference is in the bidding process. This initially identifies each separate action that could be proposed from the match set. Then, for each action proposed, it finds the classifier with the highest strength proposing the action. Finally, it selects the action using the probability distribution that derives from the relative strengths of the classifier identified for each proposed action. Once an action is selected in this manner, all classifiers in the match set that proposed the action are placed into the action set. The fourth difference is in the credit allocation process. Each classifier maintains an estimate of the mean and the variance in the payoffs it has received using an update mechanism reminiscent of the Widrow-Hoff update scheme. The strength of the classifier is then calculated from the mean and the variance such that a high mean and low variance is preferred. All members of the action set are updated in this manner. A record is kept of all classifiers that proposed actions in the current episode, and when a reward is received, they are all updated in this manner. The fifth difference is in the GA selection procedure. This operates at the population level - more Pittsburgh-like than Michigan. The SAMUEL system runs each population

242

against the test environment for a number of trials, maintaining the average payoff for each population. From the average payoff SAMUEL calculates the fitness value, which is defined as the difference between the average payoff and a 'baseline performance measure'. This baseline performance measure is initially set low, but adjusted gradually upwards during learning to exert increasing selection pressure within the GA. The GA selects two parent populations using this fitness metric, and using crossover it repeatedly assigns a classifier from each parent to one of two offspring populations. The crossover operator probabilistically seeks to cluster the better performing classifiers so that one of the offspring contains more high performing classifiers than the other, and is also prevented from placing duplicate rules in a population. A mutation operator is applied to modify attribute values to new random values, with the attributes to be modified chosen probabilistically using a uniform random distribution. The final modification is in terms of population initialisation. Initially the population is set to contain classifiers with fully general conditions representing each action. After each episode, a specialise operator is invoked if there is population space for a new rule and the maximally general rule was used during an episode led to a high reward. This operator creates a new classifier with attributes based on those within an input message but made 50% general, and with the same action as the general classifier. The combination of GA and the specialise operator gives the random and incremental exploration strategies of more traditional LCS implementations. Whilst SAMUEL is interesting in itself for the ideas on rule payoff variance which predate more recent developments within XCS and the early implementation of higher level attribute encoding within LCS, its prime contribution to this research is in the addition of action persistence in later work by Cobb and Grefenstette (1991). In this version of SAMUEL, an additional action attribute called duration was added to each classifier. This attribute specified the number of iterations (maximum 10) that the action should be applied for. In a missile avoidance problem, the application of the action duration was tested for versions of SAMUEL where the size of the classifier chain allowed was artificially restricted to 2, 4, 6, or 15 classifiers. Their results showed that more rule space was required to represent a successful avoidance strategy for rule chain lengths 6 and 15 within the non-duration implementation than within the implementation specifying duration. Although there was no significant difference between the nonpersistent action and the persistent action performance for 6 or 15 length rule chains, there was a clear difference in performance where the rule chains were limited to 2 or 4 classifiers in each rule chain. When the populations were examined, it was found that

243

the provision of action persistence allowed SAMUEL to eliminate some attributes which were previously required in the conditions to maintain the rule chains. Thus, the provision of duration within rule actions seemed to demonstrate clear advantages both in the population size required and in the complexity of the classifier conditions required when used within the missile avoidance environment. These finding were particularly surprising given the additional search space that the inclusion of duration represented. Although these ideas have not subsequently been applied to any other LCS, it could be hypothesised that their application to Holland-style Michigan LCS would be problematic. Most Michigan-style LCS implementations do not provide adequate niche preservation mechanisms, and therefore it may be difficult to maintain and explore competing duration actions in order to identify the optimal action persistence. However, this could well be offset by the reduction in rule chain preservation required by the LCS. Irrespective of the merits of the idea for Holland-style Michigan LCS, it would appear that this mechanism may be appropriate for use within XCS. XCS has an in-built niching capability that may allow XCS to both create and maintain a complete state × action × duration × payoff mapping. However, since generalisation within XCS is not performed within the action, this could result in an increased population storage requirement and a corresponding increase in the exploration and learning time for the XCS to discover this mapping. Given the considerable differences in the operation of SAMUEL and XCS, a detailed investigation of the ability of XCS to utilise action duration specification is required. 6.3 Hypotheses

Consider a hypothetical XCS implementation allowing action persistence that will be termed PXCS. This implementation extends the action of the XCS so that in addition to specifying the action to take, it specifies the number of XCS iterations over which the action should be taken. For simplicity, assume that PXCS provides as a payoff the value of the final state entered for a persistent action that is a legal action (is a label of an edge from the current node within a FSW) for all the steps specified in the duration specification part of the action. If the action becomes illegal (is not a valid label of an edge from the current node) at any time during the persistence, a payoff of 0 is given. This mechanism is an appropriate mechanism for application to XCS only if it is possible for XCS to determine the duration for the application of an action that will produce the highest accurate payoff, maintain that condition × action × duration × payoff information, and optimally utilise it during action selection.

244

Comment:

Consider the finite state world consisting of a chain of sparsely connected nodes shown in Figure 6.1a.

a)

1

2

s0 0

0

s4 R

3 1

s1 0

s3

2 1

s0

4

s2

1 1

b)

3

s1

2

s2 0

s3

s4 R

1

Figure 6.1 - FSW without and with consecutive actions

For a standard XCS three classifiers are required to traverse this FSW : 0→1, 1→2, 3→4. In this FSW any classifier within PXCS that specifies an action duration of greater than 1 cannot succeed because there is no set of consecutive transitions that have the same action label. Where a specification of an action persistence cannot succeed for the duration specified by the classifier an immediate non-environmental payoff P=0 is given. In this case any classifier specifying an action duration of greater than 1 step will become accurate and be preserved within the population, but will eventually have a prediction of 0 and thus not be selected during exploitation cycles. Thus XCS is capable of detecting when an action duration of greater than one step is not required. Now consider the environment depicted in figure 6.1b. In this environment a classifier that seeks to perform the action 1 in state s0 for one step will receive a payoff approaching γ3R if only single step actions exist for all following states. However, a classifier that seeks to perform the action 1 in state s0 for two steps will receive a payoff of γ2R under the same conditions. Both classifiers will be maintained within the XCS population as accurate classifiers, but the classifier with an action persistence of 2 will have a higher stable prediction than that with a persistence of 1 and therefore be selected in exploitation. This argument can be trivially extended to any number of consecutive states that will admit to the same action and ultimately lead toward a fixed point reward. As in the case of Figure 6.1a, classifiers which propose a duration of greater than that which can be usefully employed will receive a payoff of 0 and therefore not be selected during exploitation. Notice that 'null actions' - those actions which lead immediately back to the same state, will receive the discounted payoff from the state and therefore will also not be selected during exploitation. Given the Optimality Hypothesis (Kovacs,

245

1996), and provided the mechanisms of XCS remain unchanged, it is hypothesised that PXCS will indeed be able to provide this facility: Hypothesis 6.1 PXCS is able to identify, maintain, and optimally utilise the classifier in each state that will allow the longest persistence in action on any action chain which leads to a stable environment reward. The rationale behind hypothesis 6.1, although not the hypothesis itself, is based on the assumption that only single step actions exist for all following steps. This will not be the case for any construction of the classifier system that has more than two consecutive states, and thus will not be the case for almost all cases of useful persistence. Consider figure 6.1b. Assume that a classifier c3 receives a reward R from the environment for moving from s3 into the reward state s4. In this case a classifier c2 will receive a fixed payoff of γR for moving from s2 to s3. Similarly a classifier c1 should receive a fixed payoff of γ2R for moving from s1 to s2 and c0 should receive a fixed payoff of γ3R for moving from s0 to s1. However, if a classifier c1p exists which moves with duration 2 from s1 to s3 it will receive the payoff of γR. This will become the highest prediction in the match set [M1] for s1 and will therefore be the payoff value for preceding states. Similarly if a classifier c0p exists which moves with duration 3 from s0 to s3 it will receive the payoff of γR and this will become the highest prediction in the match set [M0] for s0. Since all classifiers in any action set within any state will receive a payoff based on the highest action set prediction within the match set for that state, all classifiers covering states s0 to s2 will receive the payoff γ2R apart from the classifier that moves with the appropriate duration to s3. Thus, the property of temporal difference is disrupted over the states in which persistence of action can occur. This does not invalidate Hypothesis 6.1 since the longest duration action that remains legal throughout its invocation continues to hold the highest prediction and will therefore be selected in exploitation. However, there will be no quantitative measure of the utility of other duration values or other actions with or without a duration that do not lead to a higher payoff value. This problem cannot be solved by a re-definition of PXCS so that the payoff given to an Action Set is γδP where P is the payoff from the Action Set in time t+1 or the environmental reward and δ is the duration of an action successfully performed for the

246

whole of its duration. This would result in classifiers that specify a duration obtaining the same stable prediction as a classifier specifying a single step with the same action from a chain of steps that ultimately reach the same high payoff state. PXCS would no longer be able to select the highest duration classifier since it would receive the same payoff at the point of invocation as classifiers of a lower duration. If the discount rate γ was replaced by a lower discount rate Γ < γ for actions with a duration higher than 1 it would be possible to favour actions with a duration but it will still not be possible to identify the longest correct duration action - all duration specifications would receive the same discounted payoff in any given state.

3

s5

4

s6

R2

0 1

s0

s7

2

s1

2

s2

2

s3

s4 R1

Figure 6.2 - A FSW not optimally selected by PXCS

Now consider figure 6.2. If the start state is s0 and reward R1=R2=1000 is given in the reward states s4 and s7 then XCS would eventually obtain classifiers representing the action 0, causing movement to s5, s6, and s7 and classifiers representing the action 1 leading to s1 and onwards through s2 and s3 to the reward state s4. The classifiers representing action 0 would settle to the stable prediction γ2R, whereas those representing action 1 would settle to γ3R. Thus, in exploitation, XCS would choose to move to s7. In a PXCS implementation allowing a maximum duration of at least 3 steps, according to hypothesis 6.1 this environment would allow the identification of classifiers representing duration 1, 2, and 3 action 2 to exist with non-zero prediction within s1, duration 1 and 2 action 2 to exist with non-zero prediction within s2, and duration 1 action 2 to exist with non-zero prediction within s3.. However, PXCS would reward the duration 3 action in s1 the direct reward R from s7 and so R would be the maximum action set prediction when the environment is within s1. Thus, the payoff to the classifier moving from s0 to s1 would be γR, and PXCS would prefer the route to s4 over that to s7. In effect PXCS chooses the highest payoff achievable in the lowest number of different actions, and therefore represents an alternative form of learning system than XCS that chooses the highest payoff achievable in the lowest number of steps. This argument is expressed within the second hypothesis:

247

Hypothesis 6.2 In exploitation PXCS will select the highest payoff achievable with the lowest number of separate actions A least impact solution to this problem exists by utilising a [relatively small] proportion p of the discounted payoff removed from the payoff in inverse proportion to the duration:

(

)

γ δ −1 P − 1 − δ ∆ pγ δ −1 P In the example FSW of Figure 6.2, if R=1000.0, γ=0.71, and p=0.1, then in s1 the stable prediction for a duration 3 move to s4 would be 0.712x1000.0 - (1 3/4)x0.1x(0.712x1000.0) = 491.49 and thus in s0 the stable prediction for a move to s1 will be 348.96. The stable prediction for a move from s0 to s5 will be 431.32 and thus this proposal will allow XCS to continue to select the closest equal rewarding state. Now, if a state s8 was imposed between s6 and s7 the stable prediction for a move to s5 would be 283.27 and XCS would chose the action leading to s1 in preference to the action leading to s5. This illustrates that the addition of the small fixed payment can ensure that the modified PXCS chooses the path with the fewest different actions where two equal length paths lead to the same reward, and leads to the third hypothesis: Hypothesis 6.3 Re-instatement of step-based discounting of the payoff with the addition of a small step based discount component to the payoff will allow PXCS to preserve the Temporal Difference properties of XCS whilst selecting the path with the lowest number of separate actions where equidistant paths to equal rewards exist. 6.4 Experimental Investigation

In order to investigate the hypotheses a number of FSW were constructed. FSW are appropriate for this investigation because of the control they provide over the number of actions available within any state, the number of states that can be entered from within a state, and the message that is produced to identify a state. The base parameterisation of the XCS or PXCS is given in table 6.1. These parameters were chosen for consistency with work in the previous chapters, but appear appropriate given the level of complexity

248

of the tests used. Any variation in the parameterisation for particular experiments is stated alongside the experimental results. R (reward) N (population size) Pi (initial population size) γ (discount factor) β (learning rate) θ (GA Experience) ε0 (minimum error) α (fall-off rate) Χ (crossover probability) µ (mutation probability) pr (covering multiplier) P(#) (generality proportion) pi (initial prediction) εi (initial error) fi (initial fitness) P(x) (exploration probability) fr (fitness reduction) m (accuracy multiplier) s (Subsumption threshold) maximum trial length

1000 400 0 0.71 0.2 25 0.01 0.1 0.8 0.04 0.5 0.33 10.0 0.0 0.01 0.5 N/A 0.1 20 50

Table 6.1 - The parameterisation of XCS for the aliasing tests

6.4.1 Investigating the Provision of Persistence

To investigate whether PXCS is able to find classifiers that identify the optimal time over which an action should persist a simple two action ten state FSW with no null actions was created (figure 6.3). In order to create an implementation of PXCS the standard XCSC implementation was modified so that an action posted to the environment with a duration greater than one would cause the environment to continue to utilise the action, decrementing a duration counter each time, until the duration counter became zero or the action was inappropriate for the environmental state. No environmental message was sent back to the XCS during the operation of the action. Once the action has been performed the normal XCS cycle resumes. The calculation of payoff was also modified so that if the previous action operated over a duration the payoff was only given if the full specified duration was completed (the duration counter

249

for that action was reduced to zero). No other changes to the XCS implementation were required.39 The PXCS implementation was tested by providing the environment given in figure 6.3, setting the action size to 4 to allow 3 bits for the specification of the duration and one bit for the action, introducing a population of fully specific classifiers covering all state × action × duration combinations, and running PXCS with all the induction algorithms turned off. The resulting population was examined to determine whether the XCS had learnt the optimal duration from each state.

0

s0 600

0

s1

1

s2

0

0

1

s4

1

s6

0

s3

1

s5

0

1 0

s7 1

1 0

s8

s9 1000

1

Figure 6.3 - An FSW to test persistence

Classifier

Pred.

Exp.

Classifier

Pred. Exp.

00010→0000

710.0

207

00010→1000 710.0

208

00010→0001

710.0

188

00010→1001 600.0

210

00010→0010

710.0

206

00010→1010

0.0

198

00010→0011

710.0

244

00010→1011

0.0

201

00010→0100

710.0

218

00010→1100

0.0

197

00010→0101

710.0

186

00010→1101

0.0

196

00010→0110 1000.0

188

00010→1110

0.0

212

0.0

217

00010→1111

0.0

204

00010→0111

Table 6.2 - Selected classifiers from a pre-loaded PXCS

Table 6.2, which gives a selection of classifiers from state s2, and figure 6.4, which pictures the coverage table, illustrates that the optimal duration from each state was correctly identified by PXCS. Durations that are too long achieve a stable prediction of 39

This implementation is potentially limited. It would be preferable for XCS to be able to interrupt an action in a changing environment where a message indicates that an alternative action is desirable.

250

0 and durations that are too short all achieve a stable prediction of 710.0. As predicted, all classifiers that do not lead directly to the reward state indicate that they are only one step from the reward state because there will always exist a classifier in the resulting state that specifies the correct duration to reach the reward state directly. Thus, as predicted within Hypothesis 6.2, PXCS reflects the number of different actions required to reach the reward, not the number of steps.

1000 800 600 400 200

f3

f1

0

f7

f5

s2 b1

s4 s6

b3

b7

b5

Action

State

s8

Figure 6.4 - The Coverage Table for a pre-loaded population within PXCS operating with no induction within the two-reward environment

The ability of PXCS to learn optimal persistence through the induction mechanism was now investigated. To provide baseline results, the standard XCS was applied to the two reward environment depicted within figure 6.3. Ten runs were performed for 15000 explorations. The averaged coverage table given in table 6.3 demonstrates that the state space was fully covered and that the classifiers from the optimal sub-population [O] dominated their action sets.

251

State 00001 00010 00011 00100 00101 00110 00111 01000 Table

A=0 Macro Micro Max N A=1 Macro Micro Max N 304.6 3 23 20 599.2 3 28 25 215.2 2 23 22 424.5 3 26 24 183.3 3 26 23 302.0 3 26 23 253.8 3 26 23 214.8 3 26 24 355.0 3 27 25 182.2 3 27 25 500.2 3 28 26 252.8 3 26 25 709.0 3 28 25 355.7 3 25 23 999.5 3 27 25 502.5 3 24 21 6.3 - The averaged coverage table for ten runs of XCS within the tworeward environment. A = action, Macro = number of macroclassifiers, Micro = number of micro-classifiers, Max N = largest numerosity from among the macro-classifiers in the action set.

Whilst the dominance of [O] was high, an analysis of the populations from the ten runs showed that in three of the ten runs there were competing generalisations available within some action sets. For example, s8 could be covered by ##000 or #1### and s1 could be covered by ##001 or #000#. It was hypothesised that the action set competition was caused by the absence of an input to XCS of a '00000' message. The existence of the 00000 input would mean that generalisations over a sequence of zeroes would be impossible and would force XCS to utilise the positions of the message bits that were set to 1 to identify the generalisations. The state messages were therefore re-organised so that sn produced a message corresponding to the binary representation of n-1. XCS was run with the new encoding and it was found that none of the ten populations produced from this run contained competing generalisations, confirming the hypothesis that the lack of the 00000 message had caused the problem. Clearly it is possible that competing generalisations could have adversely affected learning performance, since XCS has to share numerosity between two classifiers within an action set. No reference to such problems was found in any other published XCS research and so a quick investigation of the effect of the competition within these two runs was conducted. A two-tailed T-test for unequal variances was applied to identify whether the message coding modification caused a difference between the average number of macroclassifiers appearing within each action set. This test showed that any difference was not significant (t=0.59062, tcrit=2.0518) at the 0.05 level. Thus, any additional generalisations available within this environment did not cause the number of competing classifiers to rise significantly. However, a statistically significant rise in the number of competing classifiers may not be necessary to cause disruption to XCS learning. A similar t-test on the number of micro-classifiers also showed no significant difference (t= 0.09834, tcrit=2.04523), indicating that the niching mechanism prevented the

252

presence of competing classifiers from reducing the number of micro-classifiers assigned to other action-sets. An examination of the performance graphs for the two averaged runs revealed that, contrary to expectations, the graph of the second [noncompetitive] run shows a much increased initial relative error plot than that of the competitive experiment. This suggests that, in contradiction to the hypothesis, more difficulty was experienced because of the removal of the additional representational possibilities available from use of sequences of 0 within generalisations than was introduced by the existence of competing generalisations. The apparent lack of negative performance impact of the existence of competing classifiers is surprising. A number of reasons could be put forward for this finding. Fundamentally, the use of averaged results in the statistical tests could have allowed the runs using the original message coding that did not produce competing classifiers to 'smooth' out the differences that may have been evident within the runs that contained competing classifiers. However, it was appropriate that the statistical tests should be between a sample of runs before and after treatment, and clearly if the problem of competing generalisations was likely to threaten the operation of XCS then it would have been seen in a higher proportion of the initial runs. It is possible that the lack of difference could be due to the high number of micro-classifiers allocated to each action set by the population size parameterisation - were this to be more limited it could be that competition would cause the competing classifiers to be more vulnerable to deletion. Alternatively it could be that a more limited population space would force the action set to 'select' one of the two competing generalisations. This matter requires further investigation, unfortunately beyond the scope of this research. For the present it is noted that a naïve choice of input encoding can lead to the maintenance of competing classifiers within some action sets that may adversely affect the learning performance of XCS. Before PXCS was introduced, XCS was also tested with four action bits, of which only bit 0 was used, so that the ability of XCS to learn given the same number of action bits as would be required by PXCS was ascertained. The population was increased to 2000, since the predicted [O] would now increase from 16 to 128 classifiers and in our experience 10-12 micro-classifiers are required for each member of [O] to allow it to establish itself without threat from competing generalisations. The number of learning episodes [where a learning episode is a exploration episode followed immediately by an exploitation episode] was increased to 15000. Figure 6.5 pictures the performance of XCS. XCS was able to establish [O] within 3000 exploration episodes and an

253

examination of the coverage tables produced revealed that each action set was highly converged on the optimal classifier.

XCS Output 0.7

Max Rel Err Min Rel Err Relative Error Population Iterations

0.6

Proportion

0.5 0.4 0.3 0.2 0.1 0 0

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Exploitation Episodes

Figure 6.5 - XCS performance with a four bit action within two-reward corridor FSW

PXCS was now introduced, reducing the population to 1500 to allow for predicted generalisations and running each of the 10 test runs for 30000 learning episodes to allow for the increased complexity of learning these generalisations. On examining the resultant performance graph, the population had stabilised by 15000 episodes, with the system relative error reduced to close to zero by 4000 episodes, as shown in figure 6.6. Table 6.4 shows the averaged coverage table for the first three states of the environment. The rows of this table represent the durations (1-8) for action 0 only, whilst the columns reflect the averaged prediction, number of macro classifiers in the action set for the state, total numerosity of these classifiers, and maximum numerosity of the most represented classifier.

254

XCS Output 1

Max Rel Err Min Rel Err Relative Error Population Iterations

Proportion

0.8

0.6

0.4

0.2

0 0

5000

10000

15000

20000

25000

30000

Exploitation Episodes Figure 6.6 - Performance of PXCS within two-reward single chain FSW

000

m

N >N

001

m

N >N

010

M N >N

709.0

5

64 25

709.3

5

55 28

708.9

5

41 35

709.4

5

47 24

709.4

4

54 28

710.9

5

42 35

709.2

5

44 24

711.1

5

36 26

706.3

5

41 35

712.8

6

34 20

708.5

4

31 26

708.7

4

40 35

709.3

5

46 22

712.6

5

35 25

713.3

6

27 17

709.6

5

34 21

703.4

4

31 25

998.7

6

52 43

709.7

6

34 19

999.2

6

55 45

4.1

6

33 24

999.9

6

53 42

3.2

6

30 22

5.6

6

37 24

Table 6.4 - The coverage table for actions forward 1, forward 2, and forward 3 from PXCS in a 2 reward state corridor environment. Heading m = number of macro-classifiers, N = number of microclassifiers, >N = largest numerosity of a single macro-classifier.

255

When the population was inspected it was found that, although [O] was fully formed, more specific classifiers continued to exist within the population. A detailed examination of the populations revealed that the additional classifiers were all younger and yet more specific than the optimally general classifier within their action set. The PXCS was modified to examine the operation of the LCS during induction. It was found that on occasion the GA mutation operator will create classifiers outside the currently selected action set due to mutation of the action, as was also seen within section 4.3.6.1. Classifiers lying outside the current action set will not be subsumed by the existing optimally general classifier within the current action set, and Wilson (1996) does not provide for population-wide subsumption. Within XCS these are rapidly deleted due to the wider GA opportunities of the optimal classifier if the classifier is fit but overspecific, or due to the low fitness if the classifier is over-general. Within PXCS it was found that the exploration of the state × action × duration × payoff map was much less even than within XCS. It was therefore hypothesised that mutation by the GA introduced over-specific classifiers that were not eradicated within the PXCS. This hypothesis was tested initially by reducing the population size from 1500 to 1000, and 800 to put more pressure on the general classifiers to make optimal use of the population space. Although this did eradicate the problem by a population size of 800, it also compromised the formation of [O]. The hypothesis was therefore further tested by modifying XCS to provide population-wide subsumption after a failure of normal action-set subsumption from the GA as in section 4.3.6.1. The modified PXCS was re-run for 10 runs within the same environment and it was found that the populations were strongly converged with no over-specific classifiers. A single factor Anova test of the average number of macro-classifiers within each action set revealed a significant difference in the mean number of macro-classifiers within each action set between this run and that of the standard PXCS (P(0.01), F=109.062, Fcrit=6.73572). Together with the evidence that the over-specific classifiers were not seen within the standard XCS with four action bits, we therefore conclude that the hypothesis that the additional classifiers were caused by mutation moving classifiers outside of the action set was upheld. It is possible that a lower mutation rate than the 0.04 rate that was used may provide similar benefits to that seen within section 4.3.6.1, although this possibility was not tested experimentally. In order to verify Hypothesis 6.1, each match set within the coverage table for this PXCS was examined. Figure 6.7 is the plot of predictions for actions against state,

256

showing that PXCS selects duration 8-n-1 for state sn predicting a reward of 1000 for these actions. Thus, in all states the action leading to the highest available reward regardless of duration is selected. PXCS has also identified all action × duration combinations that are too long (prediction 0) and not long enough (prediction 710). Although more difficult to identify from the plot, PXCS has also identified the action × duration combinations that lead to a reward of 600. This demonstrates that PXCS is able to identify, maintain, and optimally utilise the classifier in each state that will allow the longest persistence in action on any action chain that leads to a stable environment reward, confirming hypothesis 6.1.

1000 800 600 400 200

f3

f1

0

f7

f5

s2 b1

s4 s6

b3

b7

b5

Action

State

s8

Figure 6.7 - The Coverage Table for PXCS, now operating with induction within the two-reward environment

6.4.2 Investigating the Selection of duration

When figure 6.7 is examined, it is also apparent that PXCS will always select the action which leads to state s9 even when in s1, unlike a XCS which would trade-off the size of reward and the distance to the reward to choose movement to s0 from s1. Whilst this verifies hypothesis 6.2, it does not verify the behaviour of PXCS in an environment

257

without a route providing a single [persistent] step direct to reward. To further test PXCS, a new FSW based upon a Benes Switch (used in computer network switching to create low contention switching from simple crossbar switches) was introduced. This FSW requires a four step solution for XCS, but the reward state can be reached using a two step solution with PXCS, and there are competing solutions requiring 3 or 4 steps. No single step solution to a non-zero reward exists for PXCS.

00 00

01 0

s0

0

s6 1 0

10

s2 1 0

s5

11

s15

s7

1

1

s16

1

s12 1 0

s17

s9 1000

1 0

s13

1

0

0

s8

1 0

1

s4

1

0

s11 1 0

0

s3 1 0

0

s10

11 0

s1 1 0

01

10

s14 0

1 0

s18

s19 0

1

Figure 6.8 - A FSW derived from the Benes Switch

Classifier

Pred.

Err.

Fit.

Acc.

N

AS

###00→0

357.74

0.0002

0.9745

1.00

35

36.94

###01→0

504.03

0.0003

0.9913

1.00

33

33.36

###10→0

710.00

0.0000

1.0000

1.00

35

37.15

###11→0

0.00

0.0000

1.0000

1.00

30

32.30

###00→1

357.56

0.0004

1.0000

1.00

34

37.72

###01→1

500.52

0.0024

1.0000

1.00

37

39.81

###10→1

3.86

0.0023

0.7320

1.00

25

38.53

##011→1

1000.0

0.0000

1.0000

1.00

31

34.00

##11#→1

2.06

0.0022

0.6643

1.00

22

37.49

Table 6.5 - An [O] from a typical run of XCS in the Benes Switch environment.

258

To help XCS utilise any potential generalisation across columns or rows, the non-reward states were labelled for the creation of input messages by concatenating the row bits and column bits identified in figure 6.8. Initially the start states were states s0, s5, s10 and s15. A base-line XCS learning experiment was conducted using a population limit of 300 micro-classifiers. Table 6.5 gives a typical [O] found by one of the ten runs used, demonstrating that XCS is able to find [O], can use generalisation to represent the problem in a compact manner, and can cause the [O] found to dominate all of the action sets. Classifier

Pred.

Classifier

Pred.

###0#→0000

503.57

###0#→1000

503.31

###10→0000

707.84

##0#1→1000

999.96

###11→0000

0.00

##010→1000

6.32

###00→0001

503.67

##011→1000

1000.0

###01→0001

708.38

##11#→1000

0.27

###1#→0001

0.00

####1→1001

0.17

####1→0010

0.00

###00→1001

503.03

###00→0010

709.60

###1#→1001

0.00

###1#→0010

0.00

#####→1010

0.12

#####→0011

0.00

#####→1011

0.00

#####→0100

0.00

#####→1100

0.00

#####→0101

0.00

#####→1101

0.00

#####→0110

0.00

#####→1110

0.00

#####→0111

0.00

#####→1111

0.00

Table 6.6 - The optimal sub-population from a typical run of PXCS within the Benes Switch Environment.

When PXCS was run within this environment (with the population set to 1000 and 30000 learning episodes used in each of the 10 runs) although it was able to find an [O] some populations were unable to sustain all members of [O] at high numerosity and a small number of over-general classifiers continued to exist within the population. An examination of the relative experience of the classifiers revealed a highly irregular exploration pattern. Although this was also the case within XCS, the use of persistence meant that classifiers covering states within rows 1 and 2 were inadequately explored.

259

The disruptive effects of inadequate exploration had been noticed by Lanzi (1997a) in another context, but rather than employ his 'Tele-transportation' mechanism, which was devised for hypothesis testing only, the situation was remedied by allowing all nonreward states to be start states. As before, the presence of additional classifiers generated by mutation was also a problem, and so population subsumption for child classifiers not subsumed by the action set was also applied. In all 10 runs with PXCS, [O] was obtained. Table 6.6 shows a typical accurate subpopulation selected from the ten runs. PXCS has learnt to apply a three step duration from any of states s0, s5, s10 and s15. Once in s3, PXCS selects a one step action into the reward state s9. Although other duration actions are available, in exploitation PXCS will select the highest payoff achievable with the lowest number of separate actions, as stated in hypothesis 6.2. Thus, hypothesis 6.2 was upheld. 6.4.3 Re-instating Temporal Difference

It has been shown that PXCS is able to identify the lowest number of distinct actions, but at the cost of possibly ignoring nearer lower rewards. Thus, PXCS does not provide the temporal difference learning of XCS. It was hypothesised in section 6.2 that PXCS could be modified to discount the reward or payoff received so that it was equivalent to that received for a succession of single steps to the reward. Although this would restore the Temporal Difference properties of XCS, it would not allow PXCS to favour a higher reward obtained by initiating a persistent action. However, if the payoff was then further discounted by a small amount so that longer duration actions were favoured, account for the duration of the action could be taken. This technique was implemented within PXCS, creating the discounting PXCS (dPXCS). Whenever a reward was received the reward allocated was

γ δ −1 R (where R is the reward and δ is the duration just applied). Payoffs were discounted as:

(

)

γ δ P − 1 − δ ∆ pγ δ P

260

(where P is the maximum prediction from the match sets in the next PXCS iteration, p=0.2 is a constant representing the amount of the payoff to be further adjusted, and ∆ is the maximum persistence possible). The two reward corridor environment from figure 6.3 was used as a test environment for dPXCS, allowing comparison of the results given in section 6.4.1. The experiment was run for 10 runs of 30000 learning episodes, each with a population size of 1500.

XCS Output 1

Max Rel Err Min Rel Err Relative Error Population

Proportion

0.8

0.6

0.4

0.2

0 0

5000

10000

15000

20000

25000

30000

Exploitation Episodes

Figure 6.9 - The Performance of dPXCS within the two-reward corridor environment.

When the final populations were examined, it was found that dPXCS was able to learn the separate payoffs for each state × duration × action combination, reducing the system relative error rapidly (see figure 6.9). dPXCS formed [O] in all runs, and when the averaged coverage table was examined, as Figure 6.10 illustrates, a graduation of the

261

state × action × duration × payoff mapping was found. It can be seen that, for example, in state s4, dPXCS will select the duration 5 forward action to s10, whereas in state s3 it will select the duration 3 backward action to s0. This shows that dPXCS is able to tradeoff path length and reward magnitude when selecting the optimal route, and thus hypothesis 6.3 is upheld.

1000 800 600 400 200

3

1

0

5

s2 -1

7

s4 s6

-3

-7

-5

Action

s8

Fig. 6.10 - Predictions of PXCS for each action in each state in the two reward corridor FSW.

6.5 Use of Persistence as a solution to the Consecutive State Problem

As noted within section 5.3.4, the technique of specifiable action persistence may possibly be use within a solution to the Consecutive State Problem. The basis of a hypothesis for the use of persistence selection over aliased states was outlined within section 5.3.4. Taking that argument, without repetition here, the following hypothesis is advanced: Hypothesis 6.4 The use of action persistence specification can be applied to XCS as a solution to the Consecutive State Problem.

262

It is noted that if the hypothesis is correct, this solution would still be more limited than the simpler solution demonstrated within section 5.3.4 because of the inability to apply the same solution classifiers to different length consecutive states that present the same message. However, the possibility of a single solution to both the consecutive state problem and the provision of persistence makes further investigation worthwhile. 6.5.1 Experimental Investigation

For the investigation of the potential use of action persistence as a solution to the consecutive state problem the persistence of an action was limited to the cases where the message remains the same in consecutive states. A classifier that tries to persist with an action for longer than a message is consecutively posted is rewarded an arbitrary very low reward so that the mapping for actions of an incorrect duration are low valued and thus not selected during exploitation. This simplified XCS learning so that the ability of action persistence over consecutive aliasing states could be ascertained without any additional learning complexity. The two aliasing state version of FSW-9A termed FSW-9A(2)-2 (see section 5.2) was used within this investigation. PXCS was modified so that action persistence would cease whenever the environment returns a different message than that seen at the start of the current persistent action, in addition to the normal persistence termination triggers. If the full duration was completed before the message changed then the normal payoff mechanism is used at the end of the delay to give a constant feedback to the classifier. If, however, the full duration was not completed the PXCS now pays back the minimum environmental reward in lieu of the normal payoff. The experiments consisted of 10 runs of 15,000 exploitation trials with all other parameterisation kept at that described in section 5.2. The results showed that the system relative error, although reduced when compared to a standard XCS within the same aliasing environment (see figure 5.13), remained high for the two alias state test (figure 6.8). The performance within the same environment with the number of aliasing states increased to four was much worse, with a maximum system relative error little reduced from the standard XCS. An examination of the populations from the runs within FSW-9A(2)-2 revealed that the PXCS found classifiers with high numerosity which identified that no length three or four delays were required in any state, and no length two delays were required in the non-aliasing states. PXCS was also able to learn the generalised classifiers for most of

263

the non-aliasing states, although there was a degree of disruption present. The classifiers covering the aliasing states were present in small numbers and each of the classifiers had low experience suggesting that they were being continuously deleted and replaced. An examination of the location of the disruption of other classifiers revealed that, under exploration, a classifier could be invoked that moved into the second of the aliasing states from s7. This would cause the invocation of the classifier providing a two step delay to receive a zero reward, generating inaccuracy in that classifier's prediction. No solution to this problem, apart from memory solution similar to the approach used by Lanzi, exists. Thus the provision of action persistence specification within XCS is inappropriate as a solution to the Consecutive State Problem, contrary to hypothesis 6.4.

XCS Output 1 Iterations Relative Error Min Rel Error Max Rel Error 0.8

Proportion

0.6

0.4

0.2

0 0

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Exploration Trials Figure 6.10 - System Relative Error remains within FSW9A(2)-2 when attempting persistence delay learning.

6.6 Summary of Results

The experimental investigation detailed within this chapter has demonstrated that it is possible to create a version of XCS that can learn the optimal number of iterations over

264

which to allow an action to persist. A naïve implementation, termed PXCS, re-applied the specified action in successive states for the number of iterations identified alongside the classifier action. A zero reward was given if the action duration continued past a terminal state or continued into a state for which the action was illegal. The payoff or reward of the state within which the action persistence stopped was given otherwise. It was shown that PXCS could learn the optimal [O] mapping state × action × duration × payoff prediction. This allows the generalisation hypothesis to be extended from state × action × payoff prediction mappings to state × action × duration × payoff prediction mappings. However, it was noted that the dominance of [O] was not as high as would be expected from the normal operation of XCS and it is hypothesised that this is due to the larger proportion of unequal exploration due to the lack of learning during state traversal whilst persistent actions are operating. It was hypothesised that the naïve implementation would form mappings that would cause PXCS to select the route to reward with the least number of separate actions rather than trading-off route length and reward magnitude as XCS does. Empirical investigations demonstrated this was the case, both within a simple two-reward corridor FSW environment and within a more complex multiple route FSW environment. Although it was recognised that this is a valid approach, it does modify the nature of XCS learning. A further hypothesis was presented that a proportional discounting mechanism could be applied to re-instate the normal XCS learning without compromising the ability of PXCS to learn the optimal action persistence. A new discounted update mechanism was presented and applied to PXCS giving the discounting PXCS (dPXCS). Using a two reward corridor FSW environment the operation of dPXCS was empirically investigated and it was shown that dPXCS does indeed restore the reward distance V magnitude trade-off of XCS to the persistent action version of XCS. Although it was hypothesised PXCS might be applied to resolve the consecutive state problem (Section 5.2), it was shown that this hypothesis was incorrect. When applied, PXCS would explore actions that led from any other state into each of the states within the sequence of consecutive aliasing states. This would lead to the fluctuation of payoff for these earlier states and thus exacerbate the problems caused by the consecutive aliasing states. Thus, this hypothesis was set aside on the basis of empirical evidence.

265

6.7 Discussion

The first application of the persistence of actions within a LCS was by Cobb and Grefenstette (1991) within their SAMUEL LCS implementation. Details of this implementation were given in section 6.2. Their implementation was within a hybrid Pittsburgh/Michigan LCS, and therefore was within a LCS implementation that sought to identify an optimal solution rather than a complete mapping of state × action × duration. This work has demonstrated that similar techniques can be applied within XCS and will produce a complete state × action × duration × prediction payoff mapping. Furthermore, it has shown that XCS can continue to find the optimal generality for classifiers within this mapping. Finally, it has demonstrated that a full temporal difference representation can be accommodated within the mechanism. Thus, this work has not only extended the capabilities of XCS itself, but also extended the original results of Cobb and Grefenstette. Despite the usefulness of the techniques identified by Cobb and Grefenstette the use of action persistence was not further investigated by other workers. This is likely to have been due to the instabilities of Michigan-based LCS, which could not sustain the disparate population required for duration selection in addition to action selection. The closest other work is that reported within Carse (1994) and Carse and Fogarty (1994). They produced the Delayed Action Classifier System (DACS), based on Goldberg's SCS. DACS allowed the classifiers to identify a delay (in terms of LCS performance cycles) before which the action identified by the classifier would be applied. A messagelist allowed a delayed action to be posted in addition to a current action. In iterations where a delayed action had been earlier identified, the delayed action was applied directly rather than using the normal LCS cycle. The main contribution of the work presented in this chapter towards applying DACS techniques to XCS is the ability of dPXCS to maintain the more complex population required for duration identification. Other obstacles in the application of DACS ideas to XCS exist, and at the time of writing separate investigations in this are being conducted (Barry and Carse, 2000) An interesting phenomenon was noted in the course of the investigations although not discussed earlier in the chapter. It was noted that classifiers specifying an action duration that led directly to the reward state led to an earlier convergence to the maximum discounted prediction for their match set than would be the case through the normal feedback mechanism. Thus, the provision of persistence acts rather like the Bridging Classifiers proposed by Holland (1986) and investigated by Riolo (1989a).

266

Holland proposed that a resolution of problems caused by the length of time taken for strength to move down the bucket-brigade to early classifiers in the rule-chain could be through so-called "Bridging Classifiers". These classifiers would have a sufficient generality to appear in both early and late positions within the rule-chain. Upon arriving for a first time at the rewarding state, the bridging classifiers would rapidly receive reinforcement to the correct strength and, by also appearing within earlier positions in the rule-chain, would strengthen the earlier classifiers too. Thus, Bridging Classifiers would produce a more rapid convergence of the rule-chain to the correct strength value. Whilst Riolo (1987b, 1989a) demonstrated empirically the effectiveness of this proposal, he also showed how difficult it was for a LCS to produce and then maintain both the bridging classifier and the other members of the rule-chain when under genetic pressure. The ability of persistent action classifiers to move payoff rapidly to earlier match sets within the action chain is similar in many respects to the bridging classifiers. Since these classifiers are a result of the normal operation of dPXCS and are readily formed and maintained, they may be similarly beneficial within long action-chain formation. However, the added population complexity caused by the addition of duration specification to classifier actions may outweigh any benefit that early payoff propagation provides. Clearly this is a promising area for further research. 6.8 Conclusion and Further Work

These results demonstrate that the persistent action mechanism first proposed by Cobb and Grefenstette (1991) can be applied within XCS. Not only is XCS able to identify the optimal action persistence for the given environments, but also it is able to identify the optimal state × action × duration × payoff prediction mapping in a dominant optimally general sub-population. It has thus been demonstrated that the Generalisation Hypothesis and the Optimality Hypothesis apply to the discounting persistent XCS within these environments. Further investigation applying the persistence mechanism in order to find a solution to the consecutive state problem (see Section 5.2) revealed that the mechanism was inadequate to solve this problem. Persistent actions would be identified that lead to states within the sequence of aliasing states. This would cause fluctuating payoffs to be propagated directly to other earlier states, thereby spreading the problems of payoff fluctuation caused by the aliasing states to more states within the environment. The results presented on the persistence of actions within Markovian environments are encouraging. However, further work is required to identify whether these results can be

267

applied within more complex environments. The experimental work using the Benes network FSW environment would suggest that increased environmental complexity does not compromise the ability of PXCS, but even this environment remains relatively simple. Since the introduction of persistence causes the size of the environmental mapping to increase considerably, the limitations of this technique within more complex environments may come from limitations, as yet unknown, in the ability of XCS to learn and maintain very large disparate populations. Cobb and Grefenstette's work was applied to a [simplified] real-world problem. The application of dPXCS to such a realworld problem environment may be an area of investigation that reveals the limits of the dPXCS mechanism. The environments constructed have not included null actions - actions leading back to the same state. It remains to be seen whether dPXCS can operate within adequately within an environment that allows null actions. It is possible that the provision of a mechanism that can detect when the input for an action remains the same and immediately reward the appropriate discounted payoff can be introduced to deal with null actions in an elegant manner. The introduction of such a mechanism has yet to be investigated. Further investigation is required in order to find a means of resolving the reduced ability of [O] to completely dominate the action sets during action persistence. It was hypothesised that this was because of unequal exploration during action persistence. In action persistence the learning algorithms are only invoked on the conclusion of the persistent action. This implementation was chosen so that a single persistent classifier would not obtain an unfair advantage in the GA arising from its continual update. However, it is possible that the provision of 'normal' payoff update to the action sets during action persistence may restore the ability of the optimal sub-population to dominate the population. Whilst the mechanisms within dPXCS are important discoveries for the development of XCS, it is recognised that they do not directly provide additional benefit in regards to the length of action chain that can be learnt. The persistence of actions can provide a kind of 'bridging' mechanism (Riolo, 1989a) that would undoubtedly be useful in propagating prediction payoff more rapidly to early states (see Holland, 1986). This may encourage the early stabilisation of payoff prediction in the early states and prevent the formation of generalisations that cover many of the early states (see section 4.3.6). However, even if this beneficial result was demonstrated, it would only be applicable in

268

environments where a single action can be performed from an early state to a terminal state. In an environment such as Woods-14 (Cliff and Ross, 1994) where very few sameaction chains are available this mechanism would provide little benefit.

269

Chapter 7

STRUCTURE IN LEARNING CLASSIFIER SYSTEMS

7.1 Introduction

Within the field of Cognitive Science, as in the fields of A.I. and Artificial Life there is debate on the requirement for the use of high-level structured, possibly hierarchical representations. There is, however, strong evidence from neural research that neural structures, within the visual cortex at least, are arranged in a complex hierarchical manner (Tsotsos, 1995). The motivation for such an arrangement can be readily seen if a flat structure were be to hypothesised. In this arrangement, an action, such as a movement of the arm to a particular position, would have to be 're-learnt' within each behaviour it is used. This would lead to the duplication of functionality in many places for each individual behaviour - a highly undesirable situation. Whilst lesion research might suggest that some duplication is present (Tsotos, 1995), the degree of duplication would seem to be low. Whilst there might be a degree of agreement on the existence of structure in this form, there is very little consensus on the form(s) that this structure might take. The early work on Action Selection of Tinbergen (1966) and Lorenz (1985), for example, hypothesised highly hierarchical structures with control centres that switched behaviours on or off at any given level. Certain recent work, such as that of Rosenblatt and Payton (1989) maintain this hierarchical approach but provide a less discrete switching mechanism - the so-called 'Influence Hierarchy' - that appears to adjust more appropriately to modified situations than a strict switching mechanisms (cf. Tyrell, 1992). Work within connectionist fields would support less distinction between levels, with a rich interconnection between cognitive centres (Tsotsos, 1995). As noted in section 1.5, there is considerable debate about the usefulness of hierarchical or more structured approaches both within Artificial Intelligence and Psychological research as well as within the Artificial Life field. Nonetheless, it is possible that a solution to the problems seen within traditional LCS in the formation of rule chains and those problems that have been demonstrated to continue to exist within XCS action

270

chaining (chapter 4) could lie with the use of structure. This potential has not been lost on other workers within the field. Wilson (2000), in commenting on the application of XCS to non-Markovian environments noted that: "Learning of Hierarchical Behaviour is an… outstanding Reinforcement Learning problem. It is possible that internal "languages" will evolve [where] … certain bit positions will encode longer-term contexts while others encode behavioural details. This might open the way to realisation of 'Hierarchical classifier systems'." This chapter has a threefold aim. Firstly it seeks to identify a rationale for the investigation of hierarchy and structure within the LCS framework. This makes clear the objectives that work in this area should tackle and seeks to counter the rather ad-hoc motivation given for some previous work. Secondly, it identifies a framework under which future research in hierarchy and structure within LCS can be identified. Finally it provides a review of the addition of hierarchy and structure within LCS to date in the context of the provided research framework. This is the first comprehensive review of hierarchical or structured LCS work and is complemented with an overview of hierarchical work within Reinforcement Learning so that lessons learnt across these two fields can be drawn together. The products of each of these aims will fill a clear void in the currently available research literature, and it is hoped will provide a useful foundation for an expansion of research in this area. 7.2 A Rationale for the use of Structure within LCS

The motivation for the use of Hierarchy within Learning Classifier systems is founded on pragmatic rather than theoretical reasons (see Barry, 1993). As has been identified in Chapter 2, the emergence of behaviour within LCS that goes beyond simple StimulusResponse mechanisms relies on the production and maintenance of sequences of classifiers triggering to cause intentionality in regard to the future actions to be performed. Research by Riolo (1988b) and Forrest and Miller (1990) has demonstrated that the dynamics of the bucket brigade leave early scene-setting classifiers vulnerable to deletion, thereby preventing long rule chain formation. It has been shown (chapter 4) that whilst XCS addresses some of these problems by the use of temporal difference techniques, there remain practical limits in the extent of the rule chains which can be produced40. Clearly, if the same intentionality in operation could be produced by simple 40

Further fundamental limitations in the use of rule chains will be postulated later in section 7.3.3.

271

classifiers that are able to control the invocation of small sequences of classifiers operating at a lower level then it would be possible to reduce the requirements on the sustainable length of rule chains. As Wilson (1989) in his introduction to his proposals for a Hierarchical Classifier System puts it: "The [bucket Brigade] algorithm has advantages of simplicity and locality, but may not adequately reinforce long action sequences. We suggest an alternative form for the algorithm and the system's operating principles designed to induce behavioural hierarchies in which the modularity of the hierarchy would keep all bucket-brigade chains short, thus more reinforceable and more rapidly learned, but overall action sequences could be long." Whilst this pragmatic argument for the application of hierarchy to Learning Classifier Systems is acknowledged, the nature of the aims had led to some rather ill-considered investigation into the area. On the basis of pragmatism, any structure that can show an increase in 'performance' (however one may wish to define or measure this) must be valid. Researchers have often, therefore, designed hierarchical arrangements which have been hard-coded into the LCS, in many cases without any basis in physical or theoretical models, potentially imposing many constraints on the learning ability of the resultant LCS. Tyrell (1992) demonstrated empirically that hierarchical structures that look reasonable, even those from long-standing behaviourist research, do not necessarily provide the expected performance when the complexity of the action-selection task is increased. Furthermore Tyrell noted that performance in these environments is highly dependant on the form of the imposed structure. Although little theoretical underpinning of this work was provided, it would strongly suggest that the imposition of a structure that has not been derived from a body of empirical research or theoretical reasoning from domains such as ethology, psychology, or biology will limit the ability of the classifier system as it is moved to more complex environments. 7.3 A Proposed Framework for Research in Structured and Hierarchical LCS

Rather than follow an ad hoc investigation of hierarchical structures, it is important to identify the objectives that might be achieved from the use of a structuring of an LCS. By so doing, it will be possible to identify the actual contributions of previous research work and establish goals for the current research effort.

272

Although Lin (1993) does not explicitly enumerate them, it is possible to identify within his thesis a number of objectives that might be achieved from the introduction of Hierarchy to a learning system: 1.

Abstraction "By hierarchical learning, an agent learns elementary skills for solving problems first. After that, to learn a new skill for solving a complex problem, the agent only need to learn to co-ordinate the elementary skills it has developed."

2.

Decomposition "Hierarchical learning is a divide and conquer technique - a complex learning problem is decomposed into small pieces so that they can be easily solved".

3.

Reuse "Sub-problems shared among high-level problems need to be solved just once and their solutions can be re-used".

7.3.1 Abstraction

Abstraction is fundamental to the movement from sub-symbolic to symbolic computation, from simple low-level [reflex] responses to stimuli onto complex planned actions - "Cognition allows extrapolation beyond the sensory input and permits the maximum possible range of varied and flexible behaviour" (Toates, 1994). Since the use of rule chains is solely to provide this higher level computation, the establishment and maintenance of recognised abstractions and their appropriate invocation must be central to any structuring technique. The development of abstractions in this manner has often been considered in a top-down functional approach, with one rule chain 'procedurally' invoking another rule chain that provides a lower level functionality. Alternatively, one could consider the abstractions to be competences that do not need to be represented by a rule chain, but a population of co-operative rules which can be 'switched in or out' as required. The requirement for a strict hierarchy may, however, be limiting - a richer connectivity between competences so that intermediate 'levels' can be by-passed may be more appropriate (Tsotsos, 1995).

273

An 'obvious' approach to the investigation of abstraction would be to pre-select competences that can be trained separately, layering other higher level competences on top of these. Much of the research effort expended in examining hierarchies has adopted this approach because it removes the problems involved in identifying, separating out and labelling competences that would be required in a more automated approach. Lin (1993) describes the self-formation of hierarchies as the 'Holy Grail' of hierarchy research. As Digney (1996a) notes: "Hand decomposition imposes the designer's pre-conceived notions on the robot which, from the robot's point of view, may be in-efficient or incorrect. Furthermore, it is acknowledged that for truly general learning and full autonomy to occur in the face of unknown and changing environments, the structure of the hierarchical control system must also be learned." 7.3.2 Decomposition

Whilst decomposition is a necessary part of abstraction, it is identified separately to highlight an objective that is partially independent of abstraction - the potential for the decrease in learning complexity. Lin (1993) presents the result of a study of learning complexity in an abstract hierarchical square grid based problem space being explored by a Q-learning system (see Figure 7.1).

Figure 7.1 - An abstract decomposition of a grid-based state space. He shows that the complexity of learning a route between two points on the grid with no hierarchy is initially of order O(n2), with the same order of efficiency maintained for subsequent learning tasks. However, if a single level of abstraction is added by creating

274

a smaller two-dimensional grid within which each position references exclusively a group of positions on the original grid the complexity of the initial learning task reduces to O(n1.5), tending to O(n.log n) for one-dimensional grids or O(n) for grids of higher dimensions. On subsequent learning tasks, the efficiency order becomes O(n1.5) for the second task and from then reduces to O(n1/d) for subsequent tasks, where d is the dimensionality of the grid. Clearly, within LCS, a reduction in the condition length of a classifier achieved by partitioning the competences will reduce the search space from 3n (where n is the total number of bits required to encode all possible input messages) to 3m (where m is the total number of bits required to encode the part of a message used

within a particular competence 'module'). Typically m will be bounded by the number of abstract states that are considered at any given level, and therefore the larger the hierarchy so the lower m will become. Since LCS are relatively slow learning systems, this decrease in learning complexity can be a compelling objective in its own right. Lin's prediction of the reduction in learning complexity is an important result that would be useful to apply to LCS. XCS and Tabular Q-Learning systems are not directly comparable due to the incomplete nature of the mapping represented by the XCS population prior to the establishment of a complete covering of the state × action × payoff mapping. It is possible to construct a situation where the performance of the two approaches could be said to be directly comparable. Consider a situation where XCS has identified [O] and in accordance to the Optimality Hypothesis (Kovacs, 1996) [O] has dominated the action sets such that other competing classifiers have a negligible influence on the System Prediction. In this configuration it can be hypothesised that the operation of XCS in action selection is comparable to the operation of a tabular QLearning system - for each state the two systems will predict the same payoff, select the same action and receive a payoff equal to the prediction. In this case the two methods, considered as a 'Black Box', might be said to be comparable. However, at that point learning has been completed. There is thus no benefit drawn from the application of Lin's reasoning at this stage. Now consider a situation where XCS is operating without generality - each classifier is fully specific. In this case XCS starts with no mapping, but since the covering operator within XCS introduces new classifiers to cover all inputs there will be a response as though a look-up table existed. Unfortunately until classifiers have been established to cover all actions for each input, XCS will not be selecting over a complete set of competing actions as tabular Q-learning would. Even when these classifiers are introduced their rate of convergence to their true payoff prediction will be slower than within a tabular learning system due to the late introduction of classifiers.

275

Thus, unless a configuration of XCS with a complete seeded initial population (and therefore no use of induction algorithms) is considered, learning within XCS is not directly comparable to that within Q-Learning. A model that links the learning of XCS to Q-Learning is highly desirable but beyond this work (see section 3.6). 7.3.3 Reuse

The formation of abstractions is ineffective unless the abstractions formed can be utilised effectively. Therefore there must be a mechanism provided for the labelling of abstractions and the appropriate invocation of abstractions. Whilst Holland (1985) proposed the production of Rule Chains, these were conceived as standalone constructs that would be invoked as a result of environmental input (there is ultimately some source) and would complete with an environmental output (there is ultimately some sink). Although potentially any number of classifiers could invoke a rule chain at any classifier within the chain simply by posting the appropriate internal message, it is not possible to identify a mechanism that could provide a procedural call and return. Consider the following population (with classifiers represented in a symbolic form for simplicity): Consider a low level rule chain A where the first condition is matched by an environment message, the second by an internal message, and quoted actions represent internal messages: a, #

→ a'

#, a' → b' #, b' → c' #, c' → d

This can be invoked by the environment producing the message a or by any classifier posting the message a'. Now consider a second rule chain B: m, #

→ m'

#, m' → n' #, n' → o' #, o' → p

276

If B wanted to invoke A it could do so by replacing the second classifier in B with the following classifier: #, m' → a'

there would be no means of rejoining B after invocation unless a further classifier was added to A: #, c' → n'

or a 'holding' parallel rule chain that does nothing for the length of the invoked rule chain was added: #, m' → x' #, x' → y' #, y' → z' #, z' → n'

The first of these structures is less problematic. The provision of a classifier in A that connects back to B will result in the strength flowing down B being diverted to A. This will reduce the strength feedback to the preceding classifiers in B, but they will receive strength passed back from A both from the environmental reward received by A and from the reward flowing through A from later classifiers in B. However, if A was long, or A invoked another rule chain, the resulting distance between the early classifiers in B and the source of feedback strength would be longer, so reducing the strength received by earlier classifiers in the rule chain and thereby negating the value of the hierarchical re-use structure created. In the case of the insertion of delaying classifiers, a chain of classifiers equivalent in length to the combination of the chains invoked would be created, reducing the amount of strength fed back to the preceding classifiers in B by the same amount as a direct invocation would, but with the delaying classifiers not receiving the benefit of strength fed back by reward received for the invoked sequence. This would make the delaying classifiers prone to deletion, so disrupting the rule chain. Furthermore, the second chain, under the dynamics of the Canonical LCS, would share in the payoff, so reducing the strength of classifiers in each chain and exposing them further to the risk of deletion.

277

Finally, the creation of an appropriate length delay chain by the operation of the standard induction operators of the Canonical LCS is itself highly unlikely. For accuracy based LCS it is hypothesised that the problems of aliasing will arise in addition to the issue of rule chaining. Consider an action chain that can be invoked using some [yet to be determined] mechanism. For any such "procedural action chain" it is unlikely that an invocation of the action chain will always be at a point that is the same distance from an environmental reward. Thus, if the payoff to the action chain is generated by the invoking classifier (or any temporal means that is dependent upon the payoff from the successive action set) the payoff to the procedural action chain will vary. This will generate inaccuracy in the classifiers acting within the action chain and cause them to eventually be deleted. Given this argument, any temporal-based reward to an action chain invoked procedurally in two or more different positions in an action chain will be equivalent to the use of the classifiers in the action chain over aliasing states. If this hypothesis is correct, the use of procedural action chains is thus limited to cases where the payoffs delivered to them are equal and/or independent of the position of action chain invocation in relation to an ultimate environmental reward. Therefore, there are considerable difficulties with the provision of invoking rule chains within the Canonical LCS or action chains within an accuracy-based LCS that discourage the formation and maintenance of re-use structures. At its core, the penalty of invoking a procedural chain is the extension of the feedback of strength within the invoking rule or action chain. This penalty would have to be removed by a more immediate reward provision to the procedural chain from the invoking chain in order to encourage the development and maintenance of rule or action chains. 7.4 A Review of Hierarchy and Structure within LCS Research

In the investigation of the work carried out on hierarchy within Classifier Systems that follows it will be seen that much of the investigative effort has been motivated primarily by "performance improvement", which is only a shallow metric of usefulness. It is hypothesised that a more useful approach would tackle the issues of Abstraction, Decomposition, or Reuse explicitly, directing any evaluation of the value of an approach at the stated goal. The overview that follows does not seek to be an exhaustive study, but rather has the objective of classifying hierarchy work within LCS and extracting the key features from the work which might give insight into the performance of hierarchical classifier

systems

when

adjudged

within

the

three

objectives

Abstraction,

Decomposition, and Reuse. In order to structure the coverage the area that will be

278

termed "Structured Classifier Systems" will be divided into three forms - Multiple Interacting LCS, Structured Population LCS, and Structured Encoding LCS, reflecting the granularity of structuring performed within the so-called Hierarchical LCS. 7.4.1 Multiple Interacting LCS

The most obvious form of hierarchy that might be envisaged immediately is produced by dividing a single classifier system up into a number of smaller classifier systems with levels of control provided to select between them. Whilst obvious, this is not necessarily the most effective or suitable structure, as Tyrell (1992) has demonstrated. The clearest adoption of this structure has been in the work of Dorigo and colleagues with the work done to develop controllers for the AutonoMouse series of robotic devices. In Dorigo and Schnepf (1993) a direct comparison with Tinbergen's work is given alongside proposals for an ambitious programme to develop complex behavioural hierarchies selected by controller classifier systems. Actual implementations have demonstrated selection over at most three distinct behavioural competencies and two levels of control hierarchy (Dorigo and Colombetti, 1994; Dorigo, 1995). In initial tests the test environment was set up so that a simple stimulus-response classifier system was all that was needed to achieve good performance. As might be expected, the controller classifier systems did not appear to demonstrate any 'cognition' (Toates, 1994) or behaviour sequencing equivalent to rule chaining. Indeed, an examination of the reward system used soon reveals that there is a consistent and clear best choice that provides the control classifiers with high quality Stimulus-Response feedback even though they are separated from the Effectors by the competence-bearing classifier systems. Nevertheless, the results did demonstrate that performance gains could be made through the partitioning of the classifier system so as to reduce the search space for any one competence. Subsequently, Colombetti and Dorigo (1994) demonstrated a mechanism for encouraging the production of behaviour sequences. A particular aspect of this was the use of an internal state space within the controllers to maintain a record of the current action being performed. This corresponds with similar state space provision by Booker (1988), Donnart and Meyer (1994), Cliff and Ross (1994), and Lanzi (1998a, 1998b) and may be contrasted with the use of Tags for memory provision in classifier systems (Riolo, 1990; Holland, 1990). Indeed, Lanzi and Riolo (2000) cite this work as the first use of an internal state memory within a LCS. Whilst behaviour sequencing was shown by the switching in and out of lower level classifier systems, each sequence switch was strongly reinforced. It is therefore unclear whether this switching behaviour constitutes behaviour sequencing as an emergent "cognitive" process.

279

An alternative form of partitioning of a single classifier system is obtained when other 'helper' classifier systems are added. For example, Davis et al (1992) introduce short term memory to a classifier system, showing an improved rate of learning in terms of the number of external trials required to achieve a given performance level. The external memory mechanism that they introduce is not itself a classifier system, nor even a classifier storage area. It is simply a record of the action message that is produced for a given input message for each classifier system response that produces a useful reward. Each such record also contains a recency indication to allow the amount of influence the record has to decay over time. In between each presentation of an external message, a number of presentations of recent external messages stored in memory are allowed, with the classifiers that respond by matching the message and the corresponding stored action message being rewarded as if from the environment. It was shown that such a composition could significantly speed learning given suitable parameterisation. Whilst the memory was not itself a classifier system, its structure is similar and a more integrated classifier system structure that provides a "memory classifier system" influencing a "competence classifier system" was investigated (Davis, 1992). An alternative form of "memory" decomposition was presented by Zhou (1990) in his CSM system where established rule chains were moved into a Long Term Memory where they could be preserved from competition. Neither approach could be termed 'hierarchy', but they are examples of internal structure preservation techniques. An extreme form of helper classifier system structure was presented by Donnart and Meyer (1994). They introduced a system in which two classifier systems similar to simple Stimulus-Response Michigan Classifier Systems are used to provide a "Reactive Module" and "Planning Module" respectively. The conditions in the former match incoming detector messages and a current goal location in order to propose actions for the Animat. Sufficient rules to cover all condition combinations are provided so that no induction operators are required, but the classifiers are strength modified in the usual manner. The Planning Module matches in-coming detector messages and a current "task" - a state description in terms of a pair of co-ordinates that represent an ideal path along which the Animat should move. These classifiers compete to place a new task onto the "Context Generator", and therefore learn to sequence over a space of action plans. The strength of a classifier in this module is computed from the path length of the proposed task and from a global strength which is calculated by taking the average strength of all other classifiers which could post the same task. The other system components are not classifier systems, but support the classifier systems. The "Context

280

Generator" is essentially a stack of tasks that provide a state for the Planning Module to operate upon and a memory of previous states that are still relevant. It also transforms the topmost task into a goal location to use as input to the Reactive Module. The "Internal Retribution Module" controls the provision of rewards to the classifier systems, providing goal-related internal feedback for the Animat. Finally, the "AutoAnalysis Module" evaluates each movement proposed by the Reactive Module to check that movements consistent with the current goal are allowed by the environment. Where such movements are not possible, it produces a new sub-goal to try to move the Animat to a location where the original goal can be reapplied. Upon reaching a goal, this module will evaluate the route using satisfaction gradients in order to erase poor sub-goals. When a route is found which does not decrease satisfaction, the points on this route are used to add new planning classifiers to the Planning Module to ensure that the route is subsequently re-used by the Animat. Quite dramatic route planning results have been presented using this structure, and it is clear that the structure encourages the emergence of behaviour sequences characteristic of a motivationally autonomous system. However, much of this power comes from the Auto-Analysis module, rather than the classifier systems themselves. Whether this module can be replaced by an equivalent classifier system is debatable. It is interesting to see the use of explicit state maintenance, in common with Colombetti and Dorigo (1994), Booker (1988), and the proposals of Wilson (1989), and the addition of a classifier system operating solely on internal state in combination with the internal retribution system certainly moves this system away from the environmental dependency so common in other work. Unfortunately, the internal state is constructed by a programmed system, and therefore abstractions cannot be said to have emerged within this system. A truly co-operative structure was developed by Booker (1988) in order to create a classifier system with an internal model of the environment such that the classifier system could display Respondent Conditioning. Booker divided the classifier system into two, with one classifier system responding to environmental and motivational inputs (a motivational state was maintained, derived from the external reinforcements) and sending messages to the second classifier system via a shared [internal] message list. The first classifier system therefore allowed the learning system to classify inputs in order to create internal representations of input combinations. The second classifier system responded to these inputs and the motivational input to produce an effector message, allowing the learning system to relate internal classifications to responses. The messages on the internal message list could therefore be said to be abstractions that the

281

system was responding to. No separation of competences into separate classifier systems was performed - restricted mating policies were used to allow separate competences to be maintained without this separation. Booker's use of the message list as a control structure and its role in abstraction development is interesting to compare with the findings of Bull and Fogarty (1993, 1994). Bull and Fogarty showed that two heterogeneous classifier systems could be used with a shared message list so that they communicated to co-ordinate behaviour via the message list. Of particular interest was the finding that a form of structure could emerge whereby one LCS became the "controller" LCS, effectively switching the other LCS on and off by the use of appropriate "agreed" messages. Comparable work by Ono and Rahmani (1993) also showed that LCS could organise co-operative communications to solve a limited mate-finding task within a population of homogeneous agents. Barry (1993) drew a comparison between the use of influence hierarchies proposed by Rosenblatt and Payton (1989) which were demonstrated to be highly effective for action-selection by Tyrell (1992) and the use of message lists for influence by Bull and Fogarty. It is clear that the shared use of message lists is a potentially important means by which systems can exert influence upon one another. Bull and Fogarty's work was later extended (Bull, Fogarty, and Snaith, 1995) to allow separate classifiers systems to control the movement of legs in a quadruped robot, coordinating their actions through a shared message list. The initial LCS structure utilised separate populations with their own isolated genetic algorithm for each leg. Each LCS was based on the Pittsburgh approach. Subsequent work by Bull (1995) linked this work with the phenomenon of endosymbiosis (see section 7.4.2 for a further discussion of this work). In tests the heterogeneous LCS out-performed the endosymbiotic LCS, with the endosymbiotic LCS not able to develop a coherent communication strategy. However, the endosymbiotic LCS outperformed a single [Pittsburgh] LCS controlling all four legs. Much of the dramatic performance difference of the heterogeneous solution when compared to the single LCS solution was attributed to the reduction in state-space search produced by the division into separate LCS, which seemed to outweigh any increase in complexity required to establish a communication strategy between the LCS. A proposal that concentrated upon the development of co-operative structure with its primary emphasis on behaviour reuse within larger behavioural sequences was presented by Patel and Schnepf (1992). This proposal suggested a tracking mechanism to identify rule chains, and to then separate out each connected set of rule chains into separate

282

behavioural modules. The behavioural modules would be linked to one another by means of a spreading activation mechanism similar to that proposed by Maes (1991b). No clear mechanism for division was proposed and no system produced, and independent investigations (Tyrell, 1992, Kennedy, 1994) have shown a number of very limiting constraints within the operation of the ANA architecture. 7.4.2 Structured Population LCS

The work of Bull and Fogarty (1993) and Bull, Fogarty and Snaith (1995) on cooperative populations of classifiers (discussed in section 7.4.1) led to the identification by Bull (1995) of the link between the concept of co-operating classifiers and the phenomenon of endosymbiosis. Endosymbiosis is the co-operation of organisms of different species that results in a raised fitness level for one or more of the organisms. It was hypothesised that if the development of the related classifiers within the populations in the heterogeneous LCS work could be tied genetically it may be possible to increase the performance of these classifiers when operating together. Bull, Fogarty and Snaith (1995) applied this hypothesis to transform the heterogeneous LCS used for robot gait learning into a single population of agents with each agent having four genomes representing the classifiers controlling each of the four legs. Each of the four genomes operated like a communicating LCS in performance, but their development over successive generations was tied by their co-location within the single population element when the Genetic Algorithm operates. Whilst this modification does not structure the population explicitly, the classifiers within the population can be considered to be structured. In comparative experiments that sought to learn the gait of a quadruped robot the heterogeneous LCS out-performed the endosymbiotic LCS, with the endosymbiotic LCS not able to develop a coherent communication strategy. However, the endosymbiotic LCS outperformed a single [Pittsburgh] LCS seeking to perform the same task. In further investigation (Bull, Fogarty and Pipe, 1995) identified the conditions under which endosymbiosis would operate effectively and Bull, Fogarty and Snaith (1995) suggested that the weaker performance of the endosymbiotic solution may be attributable to a lack of population mixing though the G.A. in the approach adopted. Rather than divide the classifiers themselves, an alternative way of dividing a single classifier system population is to limit the range of classifiers over which certain classifier system operations (typically the induction operators, although matching and bidding may also be modified) are performed. For example, Shu and Schaeffer (1989) divide their HCS classifier system population into 'families' where each family consists

283

of a small number of classifiers, typically 3-4 but dependant on the search space of the problem. The G.A. operates by crossover of families with crossover points being between classifiers or by crossover of families with each classifier in the family crossed over with the corresponding classifier in the other family. Bidding of classifiers is weighted to include a proportion from the fitness of the whole family. This strategy is very much in sympathy with niching work in G.A.s or restricted mating policies in classifier systems to allow multiple objective searching (Lin et al, 1994), and the development of the HCS architecture was motivated by similar concerns. No abstraction or re-use is sought, and the authors comment that preservation of Default Hierarchies (and therefore, it may be suggested, Rule Chains as well) is problematic unless the operators used to distribute population members into families takes this relationship into account. Possible developments of this kind of architecture could involve the adoption of other forms of population niching, such as the neighbourhood niching used by Davidor (1991) or the 'Island model' adopted by Sumida et al (1990) and work within parallel and distributed course-grained Genetic Algorithms (see Cantu-Paz, 1997). Although such work would undoubtedly produce performance improvements and go some way towards solving certain identifiable problems in classifier systems, it would not address the three objectives identified in section 7.3. However, a greater understanding of methods of dynamic population decomposition and the effects on learning is vital for further research into emergent hierarchical structures. It is possible that related work, such as that performed within Harvey's (1992a, 1992b, 1994) SAGA system on capability development through variable length representations could be important in this area. Smith (1994), studying the emergence of parasite rules when internal memory bits are utilised in conjunction with the Bucket-Brigade in a Holland-style LCS, identified that larger multi-rule structures could be applied to tackle the conditions leading to the emergence of parasites. These "Classifier Corporations" were first postulated by Wilson and Goldberg (1989) as the second of two proposals to deal with the problem of generating and maintaining long chains of classifiers41. It was noted that some of the problems with rule chains arose from the problem of providing a G.A. that would optimise (implying competition between population members) whilst maintaining a population of co-operative classifiers. It was hypothesised that:

41

The first proposal was to "eliminate long chains" by using hierarchies consisting of "modules" invoked by higher-level classifier systems!

284

"Classifiers in a module chain are basically co-operative and stand or fall together. On the other hand, separate modules are in competition if they address similar purposes. ... In a classifier system ...[where] for the purposes of reproduction classifiers could form co-operative clusters ... called corporations. ... if two classifiers belonged to the same corporation they would not compete with each other because the corporation could only be reproduced or deleted as a unit. Corporations could form and break up through a modified crossover operator. The performance and reinforcement components of the classifier system would function almost as usual. For the purposes of reproduction, the fitness of a corporation would depend on the strengths of its constituent classifiers in such a way that clustering of cooperators was advantageous over their remaining single." Smith applied this idea by making the corporation synonymous with those classifiers that posted and responded within a rule-chain. He gave each corporation a "Leader classifier" that competed for activation in the normal Holland-style LCS manner and posted an internal message. Other classifiers within the rule chain that followed were termed "Follower classifiers" and were [contrary to Wilson and Goldberg's mechanism that suggested no change in the performance section of the LCS] given priority in activation in successive steps until no internal message is posted, no classifier exists that matches the internal message, or an external reward is received. When the G.A. is invoked, the classifiers within the corporation are treated as a group and their fitness is the average of the fitness of the members of the corporation. He hypothesised that parasites appear due to their ability to utilise the internal memory state settings without relation to an external reward. The parasites remain in the LCS because they receive payoff within the rule chain even though they provide no useful behaviour. By introducing corporations, the fitness of the parasite is derived from its own fitness. Since the fitness of parasites is in any case lower than that of the classifiers it feeds off (since it derives its reward purely from payoff from other classifiers and not from direct rulechain involvement), it will have a lower fitness than other members of the corporation an will therefore have genetic pressure exerted upon it to drive it out. Smith was able to provide an initial demonstration of the potential of corporations, but it was not until the work of Wilcox (1995) and Tomlinson and Bull (Tomlinson and Bull, 1998, 1999a, 1999b, 1999c; Tomlinson, 1999) that a fuller examination of Corporations was conducted.

285

Wilcox (1995) augments the corporation hypothesis by relating the concept of classifier clustering with the economic framework of Transaction Costs (Coase, 1988). He notes that individuals join together within organisations in order to reduce the overheads or costs associated with finding trusted industrial suppliers and/or partners. Similarly, he argues, arranging classifiers into organisations so that classifiers preferentially match messages posted by other classifiers within the same organisation reduces the cost (in strength-loss terms) of finding reliable classifiers to join a rule chain. If the strength of a classifier is considered to be representative of the 'reputation' of the classifier, selection of classifiers to join organisations based on strength represents selection by reputation. Clearly the work of Kovacs (Kovacs and Kerber, 2000; Kovacs, 2000a) strongly suggests that this is a misrepresentation of the role of strength, treating it more like an accuracy measure, although this can only be recognised with hindsight. To apply organisations to a LCS implementation Wilcox used the test environments developed by Smith (1991) and augments Goldberg's SCS (Goldberg, 1989). Wilcox's mechanism explicitly identifies classifiers with an organisation using an additional attribute within each classifier. All classifiers in the population will always belong to an organisation. The initial population can be constructed so that all classifiers belong to an organisation of size 1, or the LCS could start with other organisational sizes. Wilcox notes that all mechanisms to deal with parasitic behaviour in human environments that are not based upon centralised control make use of reputation, and thus if strength can be equated to reputation it would be appropriate to use strength to determine organisational membership. Unfortunately the length of time required for classifiers to settle to their true strength value destroys its worth for this purpose. Instead he replaces strength with two new reputation values. Short-term reputation is used for action selection and is calculated from the reward received and a fraction of the sum of the long term reputation values of the classifiers it triggers in the next generation, and thus reflects the instantaneous worth of the classifier. Long-term reputation is used for selection into organisations and is the last reward plus a recency-weighted average of the worth of the classifier to other classifiers calculated using an update scheme similar to the Widrow-Hoff update. Like classifiers, organisations have reputation values for their selection and the short term reputation is calculated using the sum of the long-term values of the classifiers in the organisation with a smaller additive component derived from the short-term classifier value. Long-term organisation value is constructed using a simplistic measure - the number of times the use of the organisation led to an environmental reward

286

expressed as a proportion of the total number of times the organisation was used. When a message is received, the OCS uses greedy selection over the long-term value of the organisations to identify an organisation to use, and then use greedy selection over classifier short-term value within the organisation to select the classifier. Once a classifier generates an internal message, that internal message is posted to a message list that is local to the organisation so that only classifiers within the organisation will operate. This continues until an internal message is not posted or a reward is received. Organisations are manipulated by a highly modified G.A. Four new operators are provided instead of crossover or mutation. ADDONE selects two existing organisations and produces two new organisations identical to the selected organisations, then moves one randomly selected classifier from one of the new organisations to the other. SUBONE selects one organisation and duplicated it, creates one more new organisation and moves a random classifier from the duplicated organisation to the empty organisation. For these operators the selection of classifier to move was based on their long-term value and was only carried out from the sub-set of classifiers that posted internal messages that were matched by its members. Two further operators that moved large numbers of classifiers between organisations were investigated but not used in the LCS implementation. In all cases selection is greedy - the organisations with the highest fitness are selected. The fitness of an organisation is calculated as the benefit achieved by having the classifiers as members of the organisation minus the cost of including them divided by the number members, with provided formulae for calculating benefit and cost. Once these operators produce the child organisations, the pair of organisations that were originally selected are involved in tournament selection with the pair that have been created - the pair containing the highest fitness organisation remains in the population and the other is deleted. Thus, these operators serve to manipulate organisation size and membership but retain the original classifier population. Comparisons of OCS and SCS using Smith's test environment showed that OCS was able to achieve a similar performance when few parasites were present, and demonstrated much better performance when many parasites were present - a situation that caused SCS to perform extremely badly. Interestingly, an additional side-effect was the development of organisations acting as a form of Default Hierarchy to one another. Large organisations that contained a large number of parasitic classifiers would win the bid due to the large number of classifiers for the majority of inputs. Those inputs only covered by parasites would produce a low match score and other smaller organisations

287

with fewer parasites could take over to maintain a higher overall performance score than the number of parasites present would suggest possible. Wilcox concludes that the OCS was able to dynamically organise the population so that parasites were isolated from the parts of the population containing useful classifiers. Wilson and Goldberg (1989) and Smith (1994) suggested that their corporate LCS proposal would exert pressure to remove parasites. Wilcox's mechanism cannot be said to demonstrate the validity of their particular proposals since parasites are not removed, although it does suggest that the use of population structuring in this manner will reduce the detrimental effects of parasitic classifiers. Unfortunately, given the number of modifications to the underlying LCS that Wilcox's mechanisms require, it is hard to see what aspects of the OCS are useful and necessary and what aspects could be discarded. A more systematic investigation of the mechanisms within OCS is required, although the work of Tomlinson has shown that a Corporate LCS can be developed that is simpler than the OCS and yet appears to better control parasite classifiers. In contrast to Wilcox's approach, Tomlinson and Bull introduced Smith's proposals for corporation structure with little further modification. They used ZCS (Wilson, 1994) as a base LCS to investigate the benefits that might be derived from the use of corporations. They demonstrated that the learning of the solution to a number of Boolean functions was improved by the introduction of corporations. Whilst Boolean function solving does not require rule chains to be preserved, they do benefit from the increased population co-operation that such structuring introduces. In this, the introduction of corporations provides similar benefits to the Holland LCS as Shu and Schaeffer's HCS implementation although now without the rather arbitrary population subdivision of HCS. Corporations were also applied to the solution of the Woods-1 problem (Wilson, 1994), a problem requiring short rule chains of 3-4 classifiers. The corporation solution also improved performance in terms of learning speed within this environment when compared to ZCS. A simplification of the corporation scheme to exclude the preference for follower classifiers in action selection, returning the corporate LCS to the proposals of Wilson and Goldberg, performed only slightly better in Woods1 than ZCS. Tests using this simpler corporate LCS with one of the Boolean functions revealed that Shu and Schaeffer's HCS was unable to perform as well as the corporate ZCS, although this was attributed to differences in crossover and mutation operators and their application. Further tests revealed that although the simplified corporate LCS was able to reach a solution in the Boolean test faster than ZCS alone, this was also attributable to differences in the rate of GA application. Thus, it was concluded that

288

although the use of corporations did show benefits, it did require the performance subsystem modification derived from the proposals of Smith (1994). Thus, the chief benefit of corporations was from the increased co-operation they provided, as hypothesised by Wilson and Goldberg. The most interesting work came with the application of the corporate LCS to the solution of a number of Delayed Reward Tasks. These consisted of a set of identical small FSW that each had one start state and led to a set of terminal states with the same reward in different terminal states. The set of FSW produced different messages for each start state to identify the particular FSW, but thereafter the states of the separate FSW produced the same message. s3 0

s1 0

1

1

s1 s4

0

1000

sa a

s3

0

s2 2

0

0

1

s6

0

1

s4

1

0

sb

s5

1

0

b

s5

1

s2 2

0

0

1000

1

s6 0

Figure 7.2 - Two FSW from a Simple DRT Environment

This was therefore a highly non-Markovian environment. Clearly such an environment could only be solved if a rule chain could be followed from the initial state to the correct reward state dependant only upon the initial input42. It was shown that the corporate LCS implementation was able to learn to solve these tasks optimally because of the mechanisms to reliably construct and maintain rule chains. Their work in Corporate LCS was concluded by the application of corporations to XCS. This is an unusual step, since XCS resolves the population co-operation - competition problem using its own niching and performance mechanisms and so should not benefit from the introduction of corporations to maintain action chains. It is therefore unsurprising that the corporate XCS showed no benefit in its application to the Woods-2 42

Clearly such a task should be readily solvable by LCS implementations such as the Holland (1991) Tag-Mediated Lookahead implemented by Riolo (1991) in his CFSC-2, although that mechanism is more complex.

289

environment. As discussed in chapters 3, 5, and 6, however, XCS cannot be applied to non-Markovian environments without additional mechanisms to disambiguate the environment. The results of the corporate LCS within the non-Markovian DRT environments would suggest that corporations might be used to help within XCS in the resolution of non-Markovian environments. Tomlinson and Bull (1999c) hypothesise that involvement in a corporation could result in a classifier within a corporation becoming accurate due to the final payoff where it could not be accurate within the environment itself. To implement corporations within XCS (which does not utilise internal messages) an explicit record of the preceding and following classifiers in the rule chain is maintained. The XCS performance cycle is modified so that when a corporation is invoked the next classifier is chosen preferentially by a deterministic mechanism. Two records - corporate fitness and corporate niche size estimate - are introduced for use by classifiers within the corporations when involved in the corporate GA. Unfortunately, when applied to the DRT environments the corporate XCS performance, whilst much better than XCS, is considerably lower than within the original corporate LCS. Modifications to the corporate XCS were introduced to weight the accuracy calculation based on the probability of the predicted next classifier in the rule chain actually being invoked. This, alongside changes in the explore/exploit regime, improved performance although it remained someway below that of the original corporate LCS. Looking at the published results a number of problems with this work can be identified. Firstly, it is based upon an XCS implementation that continues to use the Wilson (1995) match-set GA rather than the later Wilson (1996) action-set GA and does not provide subsumption. Strangely the important, though possibly not fundamental (Kovacs, 1996), Macro-classifier mechanism was also not used. Given these omissions their XCS implementation may actually hinder optimal performance in the Woods task at least (compare Wilson, 1995 and Wilson, 1996), and therefore not help to reveal the usefulness of corporations within this environment. In particular, the lack of subsumption has been shown to prevent population focus upon the optimal population and therefore reduces the distinction between the optimal actions within the match set. This would discourage focus on useful corporations and could contribute to the proliferation of poor corporations reported within their experimental work. Unfortunately a more fundamental criticism may be levelled at their work in terms of the implementation of corporations. XCS operates using an action chain approach rather than a rule-chain approach. Whilst a dominant classifier will appear within each action set, this may not be the classifier identified within the corporation - the optimally general most accurate classifier may not have been included within the corporation.

290

Thus, the corporate GA may be operating in competition with the normal XCS GA. Since these criticisms are fundamental, it is important that the use of corporations within XCS be revisited to identify and investigate an implementation that is more in-tune with the XCS approach. The corporate LCS approach has been demonstrated to provide a solution to the rulechaining problem that is better than that provided by any earlier solution. It is important because it demonstrates that rule chains can be used in a 'procedural' fashion and maintained within a single population. However, it is not a truly hierarchical approach because the rule chain is invoked by the leader classifier within normal competition rather than a higher level rule-chain. Nonetheless, the possibility remains that the addition of a trivial extension to the corporate LCS could provide the first truly emergent hierarchical LCS implementation. Given that the benefits of co-operation are also available within XCS without the additional corporation infrastructure, it could also be the case that XCS is much more amenable to the addition of hierarchical features than other previous LCS, as suggested by the hypotheses in section 1.7. All the structural adjustments seen so far have sought to divide the classifier space in some way. Wilson (1989) presented the most novel, and in terms of the three objectives identified in section 7.3 the most promising, hierarchical classifier system structure by modifying not the classifiers themselves but the mechanics of the classifier system. Unlike the later work Tomlinson and Bull, he had the initial objective of removing the requirement for long chains of classifiers. He suggested that the classifier system should structure the message list so that a detector message that causes a classifier to fire an internal message is actually representative of an environmental marker. Rather than remove it, the message should be maintained (the message list therefore becomes a small stack-like state memory), and the system should seek to find a classifier that matches both the detector message and the internal message in the next iteration. If a classifier exists that matches both of these, the LCS is then effectively in a 'module' that is performing the actions that compose a particular behaviour. If the classifier posts an effector message the action is performed and the effector message is removed from the message list, so that the classifier system seeks to match the environment and internal message again. If another internal message is posted, it is assumed that a sub-module within the module is being started, and so the classifier system seeks to match the environmental message and the new internal message, retaining the old internal message on the stack-like internal message list. Once effector messages are being produced, classifiers are free to match any of the messages that have been preserved on the

291

message list. If a classifier is selected that has matched an earlier message than the current internal message it is assumed that the behaviour modules below that level in the message list have ended and so all those lower messages are removed from the internal message list. This proposal provides abstraction (in the messages starting off behavioural modules that may be said to represent the system's understanding of a higher level behaviour), and reuse (any behaviour could make use of any other behaviour). It also provides performance-improving decomposition insofar as it reduces the requirement for long chains of classifiers, but it does not improve performance by reducing the search space. Its similarity in operation to the work of Ring (1994) is intriguing. Unfortunately Wilson's work remains a proposal at present. This is primarily due to the difficulty of maintaining a co-operative population of classifiers in the traditional LCS. It may be that the corporate LCS (Tomlinson, 1999) or XCS with the internal state mechanism described by Lanzi (1998a, 1998b) can be utilised to produce an implementation of a hierarchical LCS in a manner close to that hypothesised by Wilson (1988). 7.4.3 Structured Encoding LCS

Work within this final category does not fit into the idea of hierarchy providing abstraction, decomposition, or reuse. Rather, it follows developments within Genetic Algorithms and (in particular) Genetic Programming in order to try and improve the operation of the genetic algorithm. For example, Iba, deGaris and Higuchi (1992) introduce a Pittsburgh classifier system using 'Structured Classifiers'. Each population member is a production system in a Pittsburgh classifier system, but rather than represent the members as flat classifier populations, they structure each production system into a tree-like structure with classifiers as the leaves of the tree. This can produce performance advantages within an appropriate G.A., since the G.A. will more often perform crossover of complete rule clusters. Additionally, they modify the rule selection process to a breadth-first search of the tree representing the population, so that earlier rules in the population have more chance of selection. This allows the population to carry genetic material for use in crossover that is not actually used by a given production system at a particular point in time, but may come back into use as a result of the operation of the GA. Oppacher and Deugo (1995) use similar tree structuring techniques within a GA and show that a version of Holland's Schema theory can apply.

292

Valuable though this work is, this kind of hierarchical structure is only beneficial for the G.A.. It does not contribute anything towards the three objectives of Abstraction, Reuse, or Decomposition identified in section 7.3. 7.5 Discussion

Many different approaches to structure introduction in Learning Classifier Systems have been proposed. Within Dorigo and Colombetti's work it has been shown that subdivision of a population can reduce the search space sufficiently to increase competence learning performance with controller classifier systems learning adequately to switch between behaviours. An alternative approach to search space division without prior decomposition into competences was presented by Shu and Schaeffer. Their work parallels niching work within the GA community and does not show promise in regard to automated decomposition into competences. The work of Tomlinson and Bull provides a mechanism for the identification and maintenance of rule-chains that can be treated procedurally and invoked by a single classifier. This provides a mechanism for the automated emergence of structure that meets the abstraction and reuse objectives. However, the proposals have not yet been extended to the hierarchical invocation of discovered rule chains. Bull's research into classifier system endosymbiosis has shown promise in the identification and maintenance of areas of co-operation between classifiers, and the demonstration of emergent control through shared messages continues to hold promise. Booker, Donnart and Meyer, and the proposals of Wilson all have suggested ways in which abstraction over behavioural sequences many be provided. Their work in this area is truly in its infancy, however, with no abstraction development conclusively demonstrated without pre-programmed assistance. Whilst a number of small steps have been taken, it is clear that there has been no generally accepted breakthrough on any of the three objectives identified in the introduction. Many avenues for further research are open for exploration, and there is much potential for novel approaches to be introduced. 7.6 Hierarchy and Structure in Reinforcement Learning

Since the work of Lin (1992), Singh (1992a, 1992b) and Kaebling (1993) the Reinforcement Learning research community has paid much more attention to the possibility of hierarchical representation than the LCS community. This is partially due to the tabular representation of the state × action × payoff space that became unwieldy for more than trivial problems, although Neural Network and Radial Basis Function representations provided generalisation mechanisms that could also reduce table size. It

293

was also partially driven by the advantages in terms of learning speed and reuse of learnt experiences that a more structured representation might provide (Munas and Patinel, 1994). Finally, there was the motivation provided by the attractiveness of the possibility of a hierarchical solution itself (e.g. Digney, 1994). Whilst Munos and Patinel (1994) attempted to relate previous work within Reinforcement Learning (Munos, 1992) with a LCS approach, until the advent of XCS the link between the various Reinforcement Learning approaches and LCS was too weak to produce crossover work with a significant impact on the LCS community at large43. This is perhaps set to change, with clear recognition in a number of recent papers and conferences (Lanzi and Riolo, 2000; Wilson, 2000b) of the potential crossover between the RL and XCS communities. Given the relative differences in the level of insight into hierarchical structures and learning within the Reinforcement Learning community, it is sensible to end this chapter with a brief overview of their work and mention of potential application to LCS work. To date no review of hierarchical work within the Reinforcement community is available, even though the literature in this area has increased rapidly in recent years. This section is not intended to present a review of the work in the area, but samples work from the area that identifies key features that may be applicable to research within the LCS community. 7.6.1 State-space Sub-division

The Feudal Reinforcement Learning work of Dayan and Hinton (1993) employed perhaps the simplest form of hierarchical structure. It introduced a pre-imposed hierarchical structure where each layer of the hierarchy was formed from a strict composition of the lower layer. In one example grid world this decomposition was extreme - each high level location covered four lower level locations similar to the structure shown earlier in figure 7.1. Each higher level acted as a "Feudal Lord" over the lower level. Within the highest level the current state would be analysed at a very coarse granularity to select a goal location in the next level down to move into. The higher level would then and hand over control to the lower level with the instruction to move to that location. The lower level would then perform the move and decide in a similar manner for the next level down until the level of primitive actions was found. Clearly such a strict division of the environment in this way may be inappropriate, although the 43

See Frey and Slate, 1991; Dorigo and Bersini, 1994; Giani, Baiardim and Starita, 1994, 1998; for examples of some crossover into Michigan LCS. See Moriarty, Schultz and Grefenstette (1999) for a review of developments within Pittsburgh LCS that also provides some Reinforcement Learning references

294

logarithmic division of the problem space does leverage considerable abstraction. Learning in this approach was rapid, with very small state tables at any one level. However, Dietterich (1997) demonstrated that this method can lead to non-optimal selection when the environment is considered at the primitive action level because of the commitment to a particular area of the grid at a level that cannot take full account of local environmental irregularities. Humphry's W-Learning (Humphry, 1996) learnt using multiple Q tables, although in this case each Q table represented a pre-defined area of the base state space. The main Q-table selected between these lower level tables using a so-called W-function that sought to identify the least-worst selection. A more programmed form of single-level hierarchy was proposed by Wiering and Schmidhuber (1996). Their approach identified a number of separate Q tables held in a sequence. Initially the first Q-table was in use, and the "HQ-table" maintained alongside that Q-table was used to identify the optimum 'sub-goal state' that could be achieved from the current state with the current main goal. This then became the sub-goal for the current Q-table and the Q-table was used to reach this goal. When the sub-goal was reached, or a maximum number of steps had been taken without reaching the goal, the next Q-table was instated. This Q-table then selects a new sub-goal state from its HQ-table and uses its Q-table to reach that goal state. The Q-tables were updated as usual, taking the maximum HQ prediction of the next table as payoff if control has just been handed over. The HQ-tables were updated by taking the maximum of the predictions from the next HQ-table if the sub-goal was reached and the previous table also achieved its sub-goal, the cumulative internal reward if the sub-goal state was reached but the previous Q-table had not achieved its sub-goal, or the environmental reward if the overall goal was found. They were able to demonstrate that this mechanism enabled the system to learn within environments experiencing what was earlier (section 5.2) termed the "Separate state aliasing problem" provided those aliasing states were in parts of the environment covered by different Q-tables. However they identified that HQ-learning appeared to require a much faster learning rate within the HQ table (0.2) than the Q-tables (0.05) in order for the system to learn adequately. Interestingly, Lanzi (1998b) identifies that internal states within XCS must be set using a mechanism that is more deterministic than the lower level learning. Partitioning Q-Learning (Munos, 1992) uses a technique that is almost a hybrid of the HQ learning method and the Feudal Reinforcement method, although not actually intended to produce a hierarchical reinforcement learner. In this method learning starts with a single Q-table, but with each Q-value representing a pre-defined area of the total

295

state space. The payment variability to each part of the Q-table is tracked and periodically the table entry with the most variability is subdivided into two entries, each covering a smaller number of states. A counter-mechanism identifies whether there are areas of the state-space where there are a number of states with similar values and combines these into one value. The intention was that over time the state-space should be decomposed as required by the underlying problem-space, with generalisation over areas of the state × action mapping that hold the same payoff. Clearly, although this mechanism 'structures' the state-space, it provides no more than the generalisation mechanism already available within learning classifier system approaches. Although the flat state-space subdivision with a provided "switch" of HQ-Learning is reminiscent of the approach used by Dorigo and Schnepf (1993) within their hierarchical LCS proposals for use within non-reactive environments, the decompositions used by Dorigo and Schnepf were in terms of competences rather than state-space. No comparative state-space decomposition work has been performed within LCS research (although the Corporate LCS of Tomlinson (1999) bears a passing comparison) and this would be a simple, though less powerful, approach to apply to XCS. 7.6.2 Competence Decomposition

Both Singh (1992b) and Lin (1992) introduce hierarchical learning to reinforcement learning by providing a fixed pre-programmed hierarchical structure of "competences" or "tasks". Each method effectively provides a table of learnt Q-values for each competence that should be applied, with the competences prescribed beforehand. Singh's work used tabular methods for maintaining the Q-tables, whilst Lin utilised neural networks to represent the large continuous state space of the more complex test environments he chose to use. Each approach used higher level tables to switch between the lower level competences. Each low-level competence had an identified goal (or termination state) and was invoked until the achievement of the terminal state. Each low-level competence learns the Q-table for the achievement of the stated goal using normal update mechanisms. Higher levels utilise the payoffs associated with the achievement of the lower level goals to establish the Q-table. The higher level tables only select over the tasks available in the subset of states that are the goal states. In these approaches the mechanisms employed are very similar in approach to those applied within the work of Dorigo and Schnepf (1993) and Dorigo and Colombetti (1994) reviewed earlier, although their switch mechanisms were more directly reinforced. Interestingly, Lin chose to pre-train the lower level competences using a supervised

296

training to guide the reinforcement learning of the lower level competences before allowing the reinforcement learner to identify the ordering of the low level competences within the high level switch. A similar phased training technique was shown by Dorigo and Schnepf to be more effective than an attempt to learn all competences at once within LCS learning. Diettrich (2000) has expanded this approach by allowing each task to invoke any other task, so providing a recursive learning algorithm. The MaxQ algorithm requires that the tasks required be identified beforehand by the user, as with the previous approaches. Each task is identified by a termination condition and maintains its own Q-table. When operating the update function seeks to update the value in a Q-table based not only on the value of its own operation (the payoff it receives in relation to the current goal), but also on the sum of the maximum Q-values of all the child tasks that are invoked (also in the context of achieving the current goal and the additional sub-goals established with each child task invoked). Since each child task may itself invoke other tasks this computation becomes recursive. Interestingly, possibly because the upper level maintains a more accurate representation of the value of likely future invocations, Diettrich claims that the MaxQ system can learn on all levels of the hierarchy at once. However, the flexibility of being able to invoke any other task carries with it the penalty of a greatly expanded state × action space. Thus, the true benefit of using the MaxQ method is only seen when advantage of the pre-decomposition is taken to reduce the input and action space for each state table. Once this reduction has been completed results appear to show that MaxQ learning is more rapid than an equivalent nonhierarchical Q learning system. Other workers44 have also been examining the so-called "Macro-Action" approach (Korf, 1983; Laird, RosenBloom and Newell, 1986; Iba, G. A. 1989). Sutton, Precup, and Singh (1998) use so-called "Options" to represent tasks. These can be either primitive actions or be Q-table presentations of policies, but like the Tasks of previous work each has an identified termination condition. Their work includes vital theoretical results on the additional complexity that a macro-action facility adds to learning and on the convergence of learning over macro-actions. They also identify that the use of macro-actions introduces a multiple-temporal levels into learning that can be used to speed-up the acquisition of models over larger problem spaces than search in the space 44

see Parr and Russell (1997), Hauskrecht, Meulean, Boutilier, Kaelbling and Dean (1998), McGovern, Sutton and Fagg (1997), and Huber and Grupen (1997) for further macro-action based work.

297

of environmental actions alone. McGovern and Sutton (1998) use a more limited version of macro-actions that they define as a "policy with a termination condition", so reserving macro-actions for non-primitive tasks. On each time-step if no macro-action is currently active the Q-table is utilised to select between a pre-defined macro-action or a primitive action. Once a macro-action is selected its Q-table is used to select the next primitive action until the termination condition is met. The Q-value within the main Q-table is calculated as the sum of the cumulative discounted payoff received whilst executing the macro-action and the payoff at the termination state. The macro-action's table is calculated using the normal update mechanism. McGovern and Sutton show that the provision of macro-actions produces a bias on the exploration of the state-space. In a large state space an Animat will tend to initially explore its local environment a lot because the action selection is largely random early in learning. With macro-actions it is as likely that a macro action will be chosen random as a primitive action, and this will result in the movement of the Animat some distance from the initial exploration due to the more limited set of actions and the relative simplicity of achieving a local goal. Thus, in a large state-space, macro-actions provide a better exploration of the state-space than primitive actions alone. Unfortunately, in a small environment this bias can prevent complete exploration, and therefore they conclude that macro-actions are beneficial where the environment is appropriate. They also demonstrate that macro-actions are able to transmit Q-values back to other states much more rapidly than the normal temporal difference mechanism alone due to the payoff directly to the invoking state. This helps speed-up learning in large environments. The concept of macro-action is close to the ideas identified within Wilson (1988), although Wilson does not suggest that the called rule-chains should be pre-specified. Wilson's reward scheme has many similar properties to that used within the MaxQ technique of Diettrich (2000), although the proposals of Wilson and the work of Diettrich would be difficult to fit into an XCS or other accuracy-based formulation because of possible aliasing effects on the classifiers within the invoked action chain (see discussion in Section 7.3.3). The explicit identification of termination conditions for each macro-action, often expressed as a sub-goal state, can be related to the operation of ACS (Stolzmann, 1997) where each classifier expresses a predicted outcome of its action in terms of the new state entered. It is possible that a combination of an internal state setting mechanism (Lanzi, 1998a, 1998b), not only as the invocation of an action chain but also the termination condition expressed as a sub-goal state, can be applied within an implementation of a Hierarchical XCS that is close to the ideas proposed by

298

Wilson. Clearly the LCS community has much to learn, both from the advantages and disadvantages of the macro-action approach used within Reinforcement Learning. 7.6.3 Emergent Decomposition

Whilst the pre-specification of task decomposition has been shown to be beneficial not only to the learning of large state spaces but also to their exploration, the identification of appropriate tasks remains a problem (McGovern and Sutton, 1998). A few researchers have sought to tackle this issue by identifying mechanisms that might automate the identification of competences or tasks. This work is some way beyond all work within LCS although the proposals of Wilson (1988) were clearly intended to produce a truly emergent hierarchical system. Thrun and Schwartz (1995) introduced the SKILLS algorithm. Rather than pre-define the tasks that should be identified, they utilise a record of the rate of use of each location in the normal state × action table. Any location that achieves a utilisation rate that is higher than a threshold value is associated with a "SKILL". Each SKILL has its own small-size Q-table and starts as a single state. Neighbouring states are added if the added cost of representing them within the table and the loss in the efficiency of the task as a whole caused by their inclusion is less than the cost of not including it in the representation - Thrun and Schwartz provide calculations to estimate these costs. Once identified, moving into a state that is related to a SKILL will cause the SKILL to be selected more favourably than a primitive action. The operation of this approach is identified in a number of environments in which a number of different objectives are available at different times. It is shown that the SKILLS developed are utilised to simplify the learning of many objectives that can each use one or more SKILLS in the achievement of their goal. Whilst Thrun and Schwartz address their proposals to multiple-task environments, Digney (1996a, 1996b, 1998) proposes an emergent system called Nested Q-Learning (NQL) that seeks to learn competences that can be re-applied as required within single or multiple objective environments. Digney (1996a) extends the base Q-table to maintain a mapping of state + current skill × action + skill. Thus, from the base Q-table either a new primitive action can be chosen or a new skill can be invoked. Initially there are no skills, but Q-tables are provided so that any environmental state can become a sub-goal with its own Q-table to represent the task of reaching that state. Once performing a skill, the current skill's Q-table is used (again mapping state + current skill × action + skill) and any other skill can be invoked or a primitive action chosen. This

299

provides a potentially recursive Top-Down selection mechanism. The value update function uses a sum over the current payoffs, a payoff for completing the current skill, the environmental reward, and the cumulative payoff of lower level tasks to update the Q-table, acting somewhat like MaxQ. A further Q-table is maintained that provides a state × skill map so that new skills can be selected within any state. This is updated from environmental payoff for completing the skill only. This provides what Digney terms a "Bottom-Up" selection mechanism that will invoke new skills without relation to any current skill. Digney (1996a) demonstrated that this mechanism can learn to utilise these skills to create appropriate hierarchical invocation to achieve two different objectives within a small grid-world. Recognising that all states need not be identified with skills, Digney (1998) introduces a feature identification mechanism so that skills are identified with use, somewhat like the mechanism used by Thrun and Schwartz. Rather than use frequency of utilisation as a metric, however, Digney uses the rate of change of the reinforcement signal from the environment. He reasons that a sharp rise or fall in reinforcement signal represent key areas of the environment and therefore should be treated as the target location of a skill. Using this mechanism, Digney demonstrates that NQL can identify and utilise emergent tasks to solve medium-size multi-room gridworld navigation tasks, and that the emergent tasks are reused within solutions to different objectives. Clearly the emergence and utilisation of tasks in a hierarchical manner is the 'Holy Grail' of research into hierarchical or structured representations. The work of Thrun & Schwartz and Digney identify means by which tasks can be identified and re-used. A major step-forward in research within the LCS area would come from an application of key aspects of their work to the LCS framework. However, a direct application of their value update techniques is not possible due to the different nature of the learning algorithms within the LCS framework. Furthermore, there are few comparative results between their work and normal Q-learning that allow the increase in complexity of the learning task caused by the addition of emergent hierarchy to be judged. Since LCS have the additional generalisation task to consider, knowing more about the additional complexity imposed by emergent hierarchy learning is important before application of this work to LCS can begin. 7.7 Conclusion

This chapter opened with a note on the debate on the relevance and usefulness of higher level structures within a number of fields. The theoretical results and practical work that

300

has emerged within the domain of reinforcement learning in recent years both provides a sound based for an informed argument in favour of higher-level structures, and a practical example of the usefulness of hierarchy in real problem domains. In contrast, the use of higher level approaches within the LCS field has been dominated by good ideas, unambitious realisations, and one-off developments of that nevertheless show promise. The blame for this can partially be laid at the door of the LCS approach itself. The lack of reliability of the LCS framework has meant that achieving good repeatable solutions to problems that require only simple internal structures has been a problem in itself. However, the introduction of XCS provides a degree of reliability and predictability in the performance of the LCS approach that allows larger structures to be considered. The framework for research using hierarchy presented in this chapter, along with the first review of structured approaches within Learning Classifier Systems and the first discussion of ideas from hierarchical reinforcement learning in the context of LCS work is intended to provide an empirical base to move on from. The next chapter picks up on this work to utilise a hierarchical approach in order to tackle the problem of forming long action chains in the presence of generalisation pressure.

301

Chapter 8

USING FIXED STRUCTURE TO LEARN LONG RULE CHAINS

8.1 Introduction

Within their investigation of the capabilities of ZCS (Wilson, 1994) and in particular the introduction of a memory mechanism to ZCS, Cliff and Ross (1994) introduced the Maze 14 environment (Figure 8.1). This is a Markovian environment with a simple corridor path that requires 18 consecutive correct actions to traverse successfully. It differs from other corridor environments presented within chapters 4 to 6 in three ways. Firstly, by allowing eight actions within each state an LCS operating within this environment has a much larger total search space than in the two-action environments presented earlier. Secondly, the environment uses an input coding based on the Woods-1 environment (presented in section 3.5.2), an encoding with a large number of irrelevant input combinations. Finally, the environment uses the additional actions to present a non-linear action route to the reward position.

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O O

O

F

O

O

O

O

O

O

O

O

O

O

O

O

O

O

O O

O O

O

O

O

O

O

O

O

O

O

O

O

O

O

Figure 8.1 - The Woods14 environment in which ‘O’ represents a rock (a position which the animat cannot enter) and ‘F’ represents food (the goal).

The length of the pathway that the environment provides is itself a challenge to the Bucket-Brigade algorithm (whether implicit (Goldberg, 1983) or explicit (Holland, 1971; Riolo, 1988b). Riolo (1987b) estimated that even in a simple single-action corridor environment the time taken for his CFS-C rule-chaining LCS implementation to payoff a classifier n steps from the reward sufficiently to establish 90% of its stable strength to be:

302

t = 286 + 155n and the estimate of the number of times payoff must pass through the rule-chain (i.e. the number of environmental rewards required) to be: R = 22 + 11.9n A single-action corridor Finite State World is depicted in Figure 8.2a and contrasted with an equivalent FSW for Woods14 in figure 8.2b.

0

0

s0 0

s2 0

s6 0

s8 0

s14

1-4,6-7 4

s0 0,2,4-7

0-4,6 3

s7

0-3,5-6

0-2,4-5,7 7

s13

0

0

0

s11 0

0

s17

0,2-6

0-2,4-6 7

0-2,4,6-7 7

0,2-4,6-7 5

s2

s3

s4

s5

0,2-5,7

0-1,3-6 6

s8

s9

1-2,4-7 7

s15

s10

s16

s18 1000

5

(b)

1-3,5-7 0

0-1,3-5,7 0-1,3-4,6-7 0,2-3,5-7 6 6 5

s14

(a)

0

s10 s16

5

s6

0

s5

s15

5

s1

0

s4

s9 0

s13

0-3,5-7

0

s3 0

s7

s12

s12

0

s1

0

s11 1-3,5-7 4

4

s17

s18 1000

Figure 8.2 - An 18-state simple corridor environment (a) contrasted with a Woods14-like FSW environment (b)

Within ZCS the exploration strategy employed chooses a random action from the set of valid actions in each state. In the single-action corridor FSW the probability of exploring 1 state sn from state sn-1 is always p( s n −1  → s n ) = 0.1 . However, within the Woods-14 1 environment the probability is p( s n −1  → s n ) = 0.125 . Clearly this probability

remains the same within each successive state, and therefore the probability of moving n from state s0 to sn in n successive steps is p( s 0  → s n ) = 0.125 n . Thus, the probability

303

18 of moving directly from s0 to s18 in 18 steps is p( s 0 → s18 ) = 5.55 ×10 −17 . It is thus

highly improbable that ZCS will move from s0 to s18 within exploration alone even where a relatively large number of iterations [in bucket-brigade terms] is permitted. Furthermore, transitions that lead back to the same state (e.g. from s0 to s0) will be explored with

1 p( s n  → s n ) = 0.875 , resulting in a highly disproportionate

exploration of the early states within this environment. The problem of exploration within the Woods14 environment can be overcome by a number of mechanisms. The first involves a change in the environment definition itself to allow all of the states s0 to s17 to become start states. This means that exploration starting in the later states will have a higher chance of reaching the reward state, thereby feeding back an environment-grounded payoff45. On subsequent visits to the reward state the payoff based on the true reward value will be passed back to the immediately preceding action sets until a stable prediction is achieved. Within XCS the payoff prediction is learnt within exploitation in addition to the exploration trials, allowing the pathway to the reward state to rapidly become established once it has been identified in exploration46. An alternative approach that does not require a change in the environment definition can be created if the choice between exploration and exploitation modes were to be dynamically modified so that as transitions within the FSW are increasingly explored their probability of future exploration is decreased. This allows the LCS to advance progressively further through the states within the Woods14 environment to areas that require exploration. Unfortunately, this solution requires the availability of a qualitative differentiation in the value of the transitions, and this information is not available until the reward state has been reached and the reward values distributed along the chain of state representations within the LCS. It is possible that a third approach to this problem exists if the LCS representation used can be modified to sub-divide the learning of the state space. This chapter elaborates a

45

Clearly finding a definitive reward value to use as payoff allows the route to the reward to be distinguished from other routes within exploitation, thereby further reinforcing the route to the reward and moving away from a random exploration pattern. However Smith (1994) also noted that environment-grounded payoff was a key factor in preventing parasites emerging. It can be hypothesised that the earlier a reward is found the lower the likelihood of disruptive classifiers forming.

46

Lanzi's "teletransportation" mechanism (Lanzi, 1997b) provides a similar 'solution'.

304

hypothesis for the use of population sub-division for the solution both of the problems of learning long action chains (chapter 4) and the problem of exploration complexity described above. It implements two related forms of population sub-division inspired by approaches from Reinforcement Learning and research by Dorigo and Schnepf (1993) and Dorigo and Colombetti (1994) with their ALECSYS Holland-style LCS implementation (chapter 2). It applies these forms to environments of increasing complexity to empirically demonstrate the validity of the hypotheses. This chapter provides a number of new contributions to LCS research. It demonstrates that a simple solution to the problem of learning to traverse long action chains within simple progressive corridor environments exists. It then shows that the addition of hierarchical control will allow this solution to be applied to a set of more complex environments. This expansion represents the first application of hierarchy within the XCS classifier system, and the methods used are contrasted with those of Dorigo and Schnepf (1993) and Dorigo and Colombetti (1994). The approaches are also related to two methods for hierarchical learning within the Reinforcement Learning community, demonstrating that crossover between the LCS and Reinforcement Learning can be utilised beneficially. The work in this chapter draws heavily upon lessons learnt from the experimental research work documented in chapters 3 to 7 and thus represents a fitting end-point to the research programme.

8.2 Hypotheses

In resolving the difficulties involved with the learning of Maze 14 an understanding of the core problem of the learning task is essential. Within the preceding section it was shown that the exploration regime is insufficient to move the Animat controlled by XCS from the early states towards the later states of this environment due to the properties of the environment itself. As a result, there is little opportunity to identify the rewardbearing terminal state or to feed the payoff from this state back to earlier classifiers. Equally, the inability to distinguish between classifiers in the earlier states due to the non-availability of reward information prevents consolidation of reward feedback as a result of learning within exploitation. This dilemma may be partially solved if XCS was able to introduce intermediate rewards in addition to those provided by the environment. Consider a state si which is i steps from the start state s0. If a classifier leading to this state from si-1 received a fixed ‘internal‘ reward Ri, this reward could be fed back to preceding classifiers. Thus, XCS

305

would be able to establish classifiers leading to state si even though the ultimate goal state had not yet been encountered. Now consider a set of states s such that:

s ⊂ S ∧ ∀si ∈ S .¬∃s j ∈ S ⋅ j = i + 1 where S = {s0…sn-1}. If each of the states within s was a state providing an internal reward, the states within s represent a chain of intermediate goals towards which XCS can learn a route through a corridor environment. The provision of these internal reward states would not in itself be sufficient to enable XCS to find a path to the solution. XCS must, in addition, be able to identify which of the internal reward states to move towards in the next iteration, and equally must be able to decide not to re-visit an internal state that has already been visited. These properties cannot be met if the intermediate reward states remain permanently active unless a very careful choice of internal reward from each state is devised (which in turn requires à piori knowledge of the problem space). Such a derivation would have to account for the magnitude of the ultimate environmental reward, the discount applied when payoff is calculated, and the number of steps between each internal reward state. However, it is possible to consider an XCS implementation in which the state space is subdivided and for each sub-division one of the internal ‘sub-goal’ states is advocated and the internal reward is paid out when the Animat controlled by XCS enters this state. For the situation where there is only ever a single sub-goal per environment subdivision the Optimality Hypothesis (Kovacs, 1996) implies that XCS will learn the optimal state × action × payoff mapping for each environmental subdivision. The problem of learning a route from the start state to the reward state is thus decomposed into the problem of moving from one internal reward state to another internal reward state. Hypothesis 8.1 Using a prior identification of internal goal states and subdivision of the state-space in relation to the goal states, XCS is able to learn the optimum state × action × payoff mapping for each subdivision of the state space, and given a mechanism to determine the sequence of internal goals an optimum path to a global goal can be constructed. Limiting the environment to a single sub-goal per environmental subdivision limits this mechanism to unidirectional corridor environments. Whilst hypothesis 8.1 suggests that this will allow the production of optimal solutions for the environments used within the

306

action-chain length experiments presented within chapter 4, this is an undesirable limitation. This limitation may be overcome by identifying more than one goal state within each subdivision. Providing the conditions for the classifiers within the XCS are constructed to identify both the current goal and the current local state, it is hypothesised that the Optimality Hypothesis can be extended so that the XCS populations covering each state-space decomposition will be able to identify the optimal state × sub-goal × action × payoff mapping for each relevant internal goal. Hypothesis 8.2 Where more than one sub-goal state exists within a state-space subdivision and the desired sub-goal is made available through the input mechanism, XCS is able to learn the optimum state × sub-goal × action × payoff mapping for each subdivision of the state space, and given a mechanism to determine the sequence of internal goals a sequence of optimum local routes to a global goal can be constructed. The mechanism for the selection of the current ‘goal’ states or the relevant XCS subpopulation has not been discussed thus far. For this investigation it is proposed that the method of requiring the user to identify the state subdivisions and their "terminal states" used within many Reinforcement Learning approaches (such as Diettrich, 2000; Parr and Russell,1997; McGovern and Sutton, 1998) is adopted. As such, the structures used are fixed rather than emergent. Given this input it is hypothesised that an additional high-level XCS can be added that operates over the space of internal states, treating the lower level XCS sub-populations as "macro-actions" (after Sutton, 1995) that move from the current state to the chosen sub-goal state. Given the current input (which will be one of the sub-goal states) the high-level XCS will select a new sub-goal state. This will cause the XCS sub-population that maintains that state as a sub-goal to be invoked. Upon reaching the sub-goal state this lower-level XCS will be rewarded the internal reward and will hand control back to the high-level XCS. When the environmental reward state is reached, the high-level XCS will receive the environmental reward and through the normal payoff mechanism it is hypothesised that it learn the state × next sub-goal × payoff mapping. Hypothesis 8.3 An XCS can be employed to learn the optimum sequence of sub-goals from a pre-defined set of sub-goal states to reach a reward state within an FSW by

307

the invocation of low-level XCS populations mapping state-space subdivisions. This fixed hierarchical structure using pre-defined sub-goal states does not represent a fully emergent solution. However, the demonstration of the ability of XCS to operate within these structures and retain the advantages inherent within XCS at each level will pave the way towards further work leading to truly emergent hierarchical XCS formulations. In addition, a demonstration of the validity of the hypotheses provides new solutions to the problem of learning within environments requiring long action chains, and opens the possibility of re-using learnt mappings within more than one area of the state space, satisfying the goal of re-usability identified within chapter 7. 8.3 Experimental Method

In the experimental investigation of the hypotheses the operation of the base XCS implementation, XCSC, will be changed as little as possible to achieve the required structured XCS systems. In order to maintain comparability with work in chapter 4 the parameterisation used within these experiments (Table 8.1) are based on the settings used within chapter 4. Where these parameter settings are made irrelevant through changes in XCS operation or are modified for the purpose of experimental investigation, the changes will be highlighted within the description of each experiment. Pi (initial population size) γ (discount rate) β (learning rate) θ (GA Experience) ε0 (minimum error) α (fall-off rate) Χ (crossover probability) µ (mutation probability) pr (covering multiplier) P(#) (generality proportion) pi (initial prediction) εi (initial error) fr (fitness reduction) m (accuracy multiplier) s (Subsumption threshold) fi (initial fitness) Exploration trials per run

0 0.71 0.2 25 0.01 0.1 0.8 0.04 0.5 0.33 10.0 0.0 Not Used 0.1 20 0.01 5000

Table 8.1 - Base parameter settings for XCS

308

Where the operation of XCS will be modified as significantly as is required for the investigation of these hypotheses it is even more important that the adverse effects of environmental form and shape discussed in section 4.2 are minimised. Therefore for the experimental work in this chapter FSW are particularly appropriate for use in creating suitable test environments. As hierarchical structures are introduced the validity of some of the measures used to indicate global XCS performance becomes strained. Where appropriate, therefore, these measures will be applied locally and reported separately and other means sought to provide a measure of global performance. Such changes will be introduced and explained alongside the experimental investigations when they are required. Although the motivation for introducing the structured approaches is based upon the performance problems of XCS within complex environments or when seeking to establish long action-chains, simplistic measures of performance as the ability to reach a goal or the number of steps taken to reach a goal will not be the main objective. Chapter 3 highlighted the ability of XCS to form and proliferate the sub-population of optimally general accurate classifiers, and noted that this was the feature that distinguished XCS from all other LCS (and many Machine Learning) approaches. Thus, the ability of the XCS sub-populations to form the optimal mappings for their corresponding areas of the state space has been highlighted within the experimental hypotheses and remains the major concern in considering the appropriateness of the approaches used. As in the previous chapters, the coverage table technique will be used to identify the ability of XCS to form this mapping and dominance calculations are used to indicate the ability of XCS to proliferate the optimal sub-population. 8.4 Sub-dividing the Population

8.4.1 Introducing SHQ-XCS

In order to investigate Hypothesis 8.1 a simple structuring of the population space within XCS was devised. The approach taken is based on the methods used within HQ learning (Wiering and Schmidhuber, 1996), although greatly simplified. The developed XCS formulation is therefore known as a Simple HQ-XCS (SHQ-XCS), although admittedly the retention of the 'Q' is misleading. The standard XCS implementation (XCSC) is modified so that an array of populations is maintained rather than a single population and a variable is added to reference the current population. The environmental interface is modified so that a set of states from the environment can be

309

identified as internal reward states - these will become the sub-goals - and operations to allow XCS to detect when an internal state has been reached are provided. The environment is also modified to allow the provision of an internal reward value for reaching an internal state, again with operations that allow XCS to obtain the internal reward value. The operation of XCSC is modified so that at the start of each trial the first sub-goal is identified and the current population is set to the first population. XCS then runs as normal within this population until the environment identifies that an internal or global goal state has been reached. On reaching an internal sub-goal, the state is compared with the desired internal goal and if it is the same the internal reward is provided, the current population variable is moved on to the next sub-population, and the next sub-goal is identified. Upon reaching the global goal the same internal reward is provided to the current sub-population and [for the present] the global reward is discarded. Thus far the modifications can all be related to features found within a HQ implementation. The simplification comes in the selection of the current sub-goal. Within HQ learning this is performed by an additional HQ table associated with each Qtable - the HQ table uses the current global goal to select a local sub-goal for the local table. The modifications to XCS for the present are concerned with verifying Hypothesis 8.1, and this does not require a higher-level choice of sub-goal. Therefore the choice of the next sub-goal is deterministic and is simply the next sub-goal in the list of available sub-goals. Thus each sub-population learns the optimal path to one sub-goal. The test environment used within this experiment is the length-10 version of the environment described within section 4.3.1. This environment satisfies the requirements of hypothesis 8.1 because it is possible to sub-divide the environment into sections each of which has a single identifiable sub-goal state. It has the additional advantage of allowing comparison of the results achieved with those available within section 4.3.6 from the standard XCSC implementation. Finally, it is an easily scalable environment, allowing results to be found for much longer action chains. Once the implementation of SHQ-XCS had been completed it was tested to ensure that the normal operation of XCS had not been modified by the code changes required. The SHQ-XCS was run with a single sub-population and a single sub-goal at the same location as the global goal and rewarding the same reward as the global goal. The test environment used was the length-10 environment previously presented within section 4.3.6, which XCS was known to operate correctly within. Both SHQ-XCS and the standard XCS implementation were run using the same random seed within each pair of runs and the results were compared. In all runs SHQ-XCS and XCS produced identical

310

performance plots for all the standard results collected, indicating that the modifications had not changed the normal operation of XCS. 8.4.2 Applying SHQ-XCS to learn the optimal path in a corridor environment

SHQ-XCS was now applied to the length-10 environment. Parameterisation for this experiment was kept the same as that used for the equivalent length 10 test within section 4.3.6. However, the effect of the population size parameter was modified so that the total population size was divided equally between the sub-populations. In this initial experiment two sub-populations were used and the sub-goal states were s5 and s20, with the goal state also s20. The internal reward value was set to 600, a value chosen because of the known reduction in confusion that its discounted values cause when the main reward is 1000 (see section 6.4.1) although only the internal reward will be available within the sub-populations of SHQ-XCS. Ten runs of SHQ-XCS were performed and the performance of the whole learning system was captured using the standard System Relative Error metric. The population size measure (in terms of macro-classifiers) was modified so that it captured the size of each sub-population rather than provide a single result. This provides a means of tracking the comparative rate of learning in terms of the concentration of the sub-populations on their [O]. Figure 8.3 pictures the performance of SHQ-XCS in this experiment. It is evident that SHQ-XCS rapidly converged on solutions for each of the sub-populations, and that the System Relative Error of the two populations was rapidly eliminated. The population curves show that each population has converged onto a solution and at 5000 exploitation episodes continue to consolidate on their respective [O]. The performance of SHQ-XCS when compared to XCS (figure 8.4, reproduced from sections 4.3.7.1 and 4.3.7.2) is instructive. Although within a length 10 environment SHQ-XCS learns the optimal number of steps in which to traverse this environment in the same length of time as that taken by XCS within the standard length 5 environment. Thus, the two sub-populations operating within their own length 5 portion of the length 10 environment are able to establish a solution to their state-space in the same time as a single XCS in a similar environment. It is important to note that the environments tackled by XCS in the length 5 test and the environments tackled by XCS in the two length 5 sub-divisions of the state space within this test are not the same. The state encoding for the length 10 environment was not changed when it was sub-divided so that the advantages that might be gained by a reduced input space (Diettrich, 2000) result from the reduced size of the search space only. Thus, the generalisation task undertaken in each of the sub-populations was different and would lead to different [O] in each sub-population.

311

XCS Output 1

Min Rel Err Max Rel Err Relative Error Sub-population A Sub-population B Iterations

0.9 0.8

Proportion

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0

500 100015002000250030003500400045005000 Exploitation Episodes

Figure 8.3 - SHQ-XCS performance (average of 10 runs) in a length 10 unidirectional corridor environment subdivided into two joined but distinct state spaces.

XCS Output 1

XCS Output 1

Population Max Rel Err Min Rel Err Relative Error Iterations

0.9 0.8

0.9 0.8 0.7

0.6

Proportion

Proportion

0.7

Max Rel Err Min Rel Err Relative Error Population Iterations

0.5 0.4

0.6 0.5 0.4

0.3

0.3

0.2

0.2

0.1

0.1 0

0 0

500 100015002000250030003500400045005000 Exploitation Episodes

0 500 100015002000250030003500400045005000 Exploitation Episodes

Figure 8.4 - Normal XCS performance within length 5 and length 10 unidirectional corridor environments.

312

The coverage tables for each of the sub-populations were captured separately, and these were examined to identify the ability of SHQ-XCS to establish [O] in each subpopulation. Table 8.2 gives the coverage table for each action, sorted into the suboptimal or optimal action in each state. It also provides the percentage domination of the most numerous classifier over other classifiers in the same action set. The domination is weaker than had been expected. It was noted in section 4.3.7.2 that the domination was weaker within both the length 5 and 10 environments when the mutation rate was 0.04 than at the mutation rate 0.01. The domination produced within these runs, using a 0.04 mutation rate shows a similar pattern to that shown for the 0.04 rate in section 4.3.7.2. Sub-population 1 State Optimal 0 152.8184 1 214.3605 2 302.1187 3 425.412 4 599.6759

10 11 12 13 14

Sub Opt. 110.6604 153.4131 215.3198 302.2061 425.1409

152.2957 214.4537 302.0874 424.5279 599.7985

152.6507 215.3308 303.3498 424.4046 598.4557

Sub-population 2 State Optimal 5 153.1307 6 214.5323 7 301.5555 8 424.8188 9 599.1991

Sub Opt. 115.4739 154.4045 214.5543 302.587 424.5193

15 16 17 18 19

162.4769 214.9624 301.5938 425.5128 599.0328

153.7158 214.5452 302.5214 424.595 597.8706

Opt.Dom Sub Opt. Dom 54% 52% 44% 53% 51% 50% 49% 53% 76% 69%

63% 62% 71% 60% 71%

61% 58% 59% 63% 64%

Opt.Dom Sub Opt. Dom 68% 65% 47% 42% 55% 57% 49% 49% 59% 55%

55% 69% 77% 70% 77%

49% 74% 77% 72% 72%

Table 8.2 - Coverage table (average of 10 runs) for SHQ-XCS in a length 10 FSW

An examination of the populations maintained by the sub-populations showed that each maintained an average of 84 classifiers. A standard XCS within the length 10

313

environment produces a population containing a similar number of classifiers, and therefore SHQ-XCS appears to maintain double the number of macro-classifiers to solve the same problem, although only provided with the same number of micro-classifiers. There are a number of possible explanations of this. The division of the state space did not include a simplification of the input from each state that might have allowed more compact generalisations to emerge. Furthermore, the lower number of states to provide input meant that the classifier conditions were highly redundant and yet not necessarily providing more generalisations. Although no parameter tuning was performed in this experiment so that comparisons could be made with previous results, further tests that set the generality parameter to 0.5 led to a slightly more compact population. This suggests that the reduced number of states in each sub-population required conditions of a higher generality than was the case with the undivided state space and lends support to the hypothesis that a higher representational redundancy is present. Clearly a reassignment of the messages presented by each state could provide XCS with more opportunities to find general classifiers and lead to a genuine improvement in performance and a more compact representation.

XCS Output 1

Min Rel Err Max Rel Err Relative Error Sub-population A Sub-population B Sub-population C Sub-population D Iterations

0.9 0.8

Proportion

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0

1000

2000

3000

4000

5000

6000

Exploitation Episodes Figure 8.5 - The performance of SHQ-XCS with 4 populations in a length 20 FSW.

314

The test environment was now extended to length 20, a length that section 4.3.7.5 demonstrated could not be adequately learnt using the standard XCS. Four subpopulations were provided, each of length 5 as in the previous experiment. Sub-goal states were 5, 10, 15 and 40, with 40 also providing the goal state. The parameterisation was kept constant apart from the population size, which was increased to 1600 (divided between each sub-population) to match the population size of the original length 20 experiments in chapter 4. Figure 8.5 gives the averaged performance within the first 6000 exploitation trials from ten runs of 15000 exploitation trials. The similarity of the results presented with those in the length 10 environment is striking. Even though the bit length of the message was increased from six bits to seven and the distance between the decimal value of the messages from the 'optimal route' states and the messages from the 'sub-optimal route' states has increased the learning rate within each sub-population has changed little. This bears out Wilson's hypothesis (Wilson, 1996, 1998) that the difficulty experienced by XCS in finding [O] scales with generalisation difficulty rather than state-space size. This is particularly relevant for the development of hierarchical approaches using XCS, since the requirement to physically reduce the input size for each sub-population in order to see beneficial performance improvements within Diettrich's MaxQ approach (Diettrich, 2000) may not apply to hierarchical XCS solutions in the same way. This is not to claim that performance improvements cannot be gained by utilising input optimisations, since a reduction in the message size will require a smaller population to learn the generalisations and may therefore produce performance improvements. It is to hypothesise, however, that XCS may not be as dependent upon this form of manual intervention. An analysis of the coverage table produced from these runs revealed that each subpopulation had learnt and proliferated [O] and that [O] was dominant to approximately the same degree as SHQ-XCS with two sub-populations within the length 10 environment. This is unsurprising, since the learning problem for each sub-population has only been modified in terms of the generalisations to be formed and not in terms of the size or structure of the underlying state-space. Given this finding, the SHQ-XCS approach would be expected to continue to scale to larger environments. To evaluate this claim the environment was increased once more to provide a length 40 action chain to the reward. Eight sub-populations were provided and the sub-goal states were 5, 10, 15, 20, 25, 30, 35, and 80.

315

XCS Output 1

Min Rel Err Max Rel Err Relative Error Sub-population 2 Sub-population 4 Sub-population 6 Sub-population 8 Iterations

0.9 0.8

Proportion

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0

1000

2000

3000

4000

5000

6000

Exploitation Episodes Figure 8.6 - The performance of SHQ-XCS with 8 populations in a length 40 FSW. As expected, the SHQ-XCS implementation continued to learn rapidly how to traverse this extended environment, and once more each sub-population developed a dominant [O]. The time to eliminate the System Relative Error was reduced in this run despite the increased length of the messages used. It is possible that the combination of state values within each sub-population offered simpler generalisation possibilities, allowing XCS to rapidly identify and establish [O]. Given these results, it can be concluded that the use of a prior identification of internal goal states and the subdivision of the state-space in relation to the goal states allowed XCS to learn the optimum state × action × payoff mapping for each subdivision of the state space and that this could be used to find an optimal path to a global goal. Therefore Hypothesis 8.1 is upheld. 8.4.3 Investigating the benefits of input message optimisation.

A further investigation was performed to identify whether the sub-division of the population could be used profitably to increase the performance in terms of execution

316

time of XCS. Even using the SHQ-XCS with four sub-populations the average execution time for one run of 15000 exploitation trials was 5.2 minutes (calculated over 20 runs). In these runs the sub-populations were set to contain a maximum of 350 microclassifiers rather than the original 400, a figure discovered from pre-experiments to ascertain the minimum sub-population size that allowed learning of the optimal route with no error. This reduced sub-population size did not appear to affect performance, with results identical to those presented earlier within figure 8.5. From figure 8.5 it could be suggested that an automated halting mechanism of the form proposed by Kovacs (1997) could terminate each run without requiring the full number of trials, and that this would considerably reduce the average execution time. Since each state-space sub-division of the length 20 environment reduces the state space from 40 classifiers to 10, it is possible that this knowledge could be used advantageously to re-use messages in the separate state-spaces and thus reduce the message size and the corresponding population size required to generalise over the message. To test this hypothesis the inputs provided by the state space were changed so that each length 5 section of the state space used the same input messages 0 to 9, with the values 0 to 4 used in the optimal route and the values 5 to 9 in the states within the sub-optimal route. The message size was reduced from seven bits to four bits to remove any redundancy in the representation (although the original representation had a built-in one bit redundancy and so could not be claimed to be optimal). Pre-experiments showed that a subpopulation size of 350 classifiers was also the minimum reliable setting for this revised input coding. With the sub-population sizes and all other parameterisation kept constant, any execution time reduction found when using the new coding must be as a result of the reduction of the coding size or side-effects caused by this reduction. Figure 8.7 pictures the average performance of the ten runs using this shorter coding. A brief inspection of the system relative error curves when compared to Figure 8.5 indicates that SHQ-XCS found the coding more slightly more difficult than the normal coding. A similar comparison of the population curves suggests that, as would be expected, there were fewer generalisation opportunities within the shorter encoding. In timing tests over 20 runs with the reduced input coding it was found that the average time for each run was 3.1 minutes. Thus, the reduction in input coding length appears to have produced a useful execution time reduction.

317

XCS Output 1

Min Rel Err Max Rel Err Relative Error Sub-population A Sub-population B Sub-population C Sub-population D Iterations

0.9 0.8

Proportion

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0

1000

2000

3000

4000

5000

6000

Exploitation Episodes Figure 8.7 - The Performance of the reduced length input coding. Although this finding could be said to have been expected, it is nonetheless counterintuitive when the modification is considered in input terms. In both problems the number different environmental inputs are constant and therefore the task of learning these combinations remains the same. However, the task of finding the accurate generalisations without over-generalisation has changed. In the original encoding each sub-population maintained an average of 84 macro-classifiers whereas under the new encoding each sub-population maintained an average of 63. The shorter encoding provides fewer competing hypotheses that will be represented within the population. Since the maintained population of macro-classifiers is smaller, the execution time of the XCS (which is highly iterative) will decrease. Thus the suggestion by Diettrich (2000) that the use of reduced input space within each sub-division of the state space will produce execution time improvements is borne out for XCS. However, this optimisation is certainly not as vital for XCS, as was noted in the previous section. Furthermore, the experiment was performed for a fixed number of cycles and an automated termination mechanism based upon the domination of [O] within the action sets (Kovacs, 1997) may reduce the proportional difference in execution times. Finally it

318

is important to stress that the results of this experiment must be interpreted with care. No attempt was made to optimise the message coding to enhance the generalisation ability of XCS in either run, and no other parameter tuning was performed to optimise XCS for the either input size setting. These issues may reduce or even eliminate any execution time advantage. 8.5 Introducing Hierarchical Control

Whilst Hypothesis 8.2 could be investigated by extending SHQ-XCS to provide each population with a deterministic sub-goal identification mechanism, it was decided to examine hypotheses 8.2 and 8.3 using one mechanism. This section describes the "Feudal-XCS" constructed for the investigation of these hypotheses. It presents a set of experiments in environments of an increasing number of sub-goals to demonstrate that within the Feudal XCS the sub-populations are able to establish [O] representing the optimal sub-goal × state × action × payoff mapping for each sub-population. It seeks to demonstrate that a higher-level control mechanism can use this to learn the sequence of sub-population invocations and sub-goals to optimally use these optimal local policies to reach a global goal. 8.5.1 The Feudal XCS

The Feudal Q-Learning approach to hierarchical reinforcement learning (Dayan and Hinton, 1993) was introduced within section 7.6.1. This was a simple approach to hierarchy construction that required a pre-identified sub-division of the state space into small Q-tables and a pre-selected hierarchy of Q-tables. A Q-table at level n in the hierarchy would learn the optimal choice of Q-table from the sub-division of Q-tables at the level n + 1. Thus, at the top of the hierarchy a single Q-table would exist, and an inverted tree-like structure of successive levels of hierarchy could be constructed until the lowest level of Q-tables operated over a choice of actions on the environment rather than a choice of Q-tables (see figure 7.1). Each Q-table in levels above the lowest acted rather like a Feudal-Lord - they had oversight of a distinct sub-space within the environment and they decided the sub-goal that a selected lower-level Q-table would have to seek to achieve. The Feudal Hierarchy approach is very close to the form of hierarchical control that hypotheses 8.2 and 8.3 pre-suppose. It is therefore appropriate to seek to apply this form of hierarchy within XCS as a natural extension to the previous work with SHQ-XCS. Rather than implement this "Feudal XCS" as a hierarchy of populations a simpler implementation strategy was chosen. It was recognised that if an upper level n XCS

319

selects a sub-population and chooses a sub-goal at the next level down (n-1) then the set of lower populations and their sub-goals can be seen as the environment that the level n XCS is operating upon. If the level n-1 sub-populations were themselves instances of XCS, then the choice of a sub-population can be viewed as invoking a lower XCS to run an episode that seeks to reach the specified sub-goal. An implementation strategy is thus revealed. The standard XCSC implementation was therefore modified so that the environment for any level XCS above the base level was an XCS instance. To invoke the lower XCS an upper XCS would write the selected sub-goal into the environment of the lower XCS and then invoke a single trial of the lower XCS. Whilst all levels of the Feudal Hierarchy are given input from the current environmental state, levels lower than the uppermost level will have the sub-goal state also identified in their input message. It was recognised that a full specification of the sub-goal within the message would double the message (and condition) size of the lower XCS and that only a small number of states covered by the lower level XCS would be used as potential sub-goals. Therefore the sub-goals within each environment subdivision were identified within a user-supplied table and the sub-goal choice and message are constructed from the index into that table. Upon invocation, the lower level XCS will use its population to find the best route to the sub-goal selected by the upper XCS. If the current state is outside the area covered by the selected lower XCS then it will immediately return without any further action (or reward value to the upper XCS) so that the discounted payoff mechanism identifies the selection of that sub-population as a "null action". Otherwise the XCS uses its population to identify (and learn) the optimal route to the chosen sub-goal. During operation of the lower XCS any action that would cause movement out of the state subdivision covered by that XCS is prevented so that each state space decomposition is treated as though it were the only state-space for that XCS. If the sub-goal is achieved within the number of steps allowed for a trial of the XCS an internal reward value is given to the XCS. If the sub-goal is not achieved no reward (or penalty) is given - the temporal difference update will identify the route as sub-optimal without penalty. At the end of a trial control is handed back to the upper level XCS without reward. The uppermost XCS is the only XCS to receive environmental reward, and will use temporal difference to learn the optimal choice of sub-populations and sub-goals from this payoff. Each trial of XCS at any level is an unaltered XCS trial, including normal induction algorithms. However, the explore-exploit choice is specified by the uppermost XCS.

320

The capture of integrated reports for even the simple two-level hierarchies used within this investigation is problematic - the learning rates of the two levels are different and invocation of each sub-population will occur at different rates within any non-trivial environment. Therefore each sub-population produces separate reports and these results are gathered for presentation as appropriate to the experiment. To ready the Feudal XCS for use within any given environment the user not only supplies the environment, but also identifies a state-space decomposition by identifying the number of state-space subdivisions and the states within each subdivision. The user also specifies the sub-goal states that can be chosen within each subdivision, a subset of the states in the subdivision. Within a simple two-level Feudal XCS this input can be used directly to create the hierarchy by constructing an XCS instance for each subdivision of the state space and a single upper-level XCS. For more complex hierarchies a further specification of the combination of XCS instances at each level would be required. Finally, the user can also supply a separate parameterisation of the XCS instances in each level (although not those within a level) to allow local parameter optimisation where necessary. 8.5.2 Using the Feudal XCS in a unidirectional Environment

The Feudal XCS described in section 8.5.1 was created and after appropriate testing was applied to the same length ten environment used within section 8.4.2 so that comparative performance data could be gained. The length 10 environment was subdivided into two length 5 environments to correspond to the decomposition within section 8.4.2. State 5 was designated as the sub-goal for the first subdivision and the reward state, state 20, was the sub-goal for the second. Since the aim of the Feudal XCS is to allow an upper level XCS to prescribe not only the sub-population to use but also the sub-goal to move towards, for each sub-division two sub-goals were specified although they both referenced the same sub-goal state. This input was used by the Feudal XCS to derive the two-level XCS hierarchy in the manner described in section 8.5.1. The message for the top-level XCS consisted of the current state, with its output specifying the lower level XCS to use (1 bit) and the sub-goal to select (1 bit). The message for the lower level XCS instances consisted of the current state and the subgoal specified by the upper level (1 bit). The action consisted of the direct environmental action - 1 bit to choose between two actions, an optimal action and a suboptimal. The action of the upper level XCS now had an extra bit added to it and the message of the lower level XCS also required an additional bit. Therefore the parameterisation of each level of the Feudal Hierarchy was modified to reduce the

321

redundancy of the encoding used in the previous experiments so that the total number of bit positions remained constant (although, of course, there remains a small change in search space size due to the fact that actions are bit values and conditions are ternary). The parameterisation changes for each level are given in table 8.3. Parameter Message / Condition Size Action Size Population Limit Trial limit

Upper Level 6 2 400 20

Lower Level 7 1 400 50

Table 8.3 - Changed parameter settings for Feudal XCS in the length 10 environment

XCS Output 0.7

Relative Error Top Relative Error 0 Relative Error 1 Sub-population Top Sub-population 0 Sub-population 1 Iterations Top Iterations 0 Iterations 1

0.6

Proportion

0.5 0.4 0.3 0.2 0.1 0 0

2000

4000

6000

8000 10000 12000 14000

Exploitation Episodes Figure 8.8 - The Performance of Feudal-XCS in a Length 10 unidirectional FSW

The Feudal XCS was run within the length 10 environment for 10 runs, each of 15000 trials. The trial reports were recorded separately for each XCS instance within the

322

hierarchy and then brought together in figure 8.8. It can be seen that Feudal XCS is able to achieve the optimum pathway in all sub-populations and the optimum ordering of the sub-populations - it has learnt at all levels concurrently. The rapid fall in the System Relative Error indicates that all sub-populations found stable solutions at an early stage. The additional time for the population of the top-level XCS to fall is more indicative of the difficult generalisation task it faces - it is only presented with two states (the start state and the first sub-goal) before reaching the reward and yet has to identify the optimal generalisation using these two states when seven condition bits are available and none of the other bit values change on later presentation. Each sub-population learns and proliferates [O]. For example, the optimal classifiers identified by the top-level XCS are illustrated in Table 8.4. Notice that, in this run (though not in all) the accuracy and action set domination are not as high as would be expected. This is due to the presence of two over-general classifiers with moderate numerosity that were accurate in six of the eight action sets occupied by these rules. It appears that the lack of population pressure on these classifiers allowed them to continue to hold space in the population. Classifier ####0→00 ####0→01 ####0→10 ####0→11 ####1→00 ####1→01 ####1→10 ####1→11

Pred. 710.000 504.106 710.000 504.102 710.000 1000.000 709.999 1000.000

Err. 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000

Acc. 0.3390 0.4056 0.2787 0.4052 0.5294 0.6377 0.3953 0.6423

Fit. 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000

N 22 14 14 14 27 26 25 21

AS 61.08 48.76 51.88 57.64 51.92 52.67 60.38 54.70

Exp. 12044 6543 13930 6063 6729 13019 6723 13023

Table 8.4 - [O] for the top-level Feudal XCS within a unidirectional FSW.

Before progressing to look at environments with separate sub-goals, the Feudal XCS was applied to the length 20 environment to identify how it performs with four subpopulations. The parameterisation of the lower XCS populations was changed so that they were provided with a population size limit of 600 micro-classifiers rather than 400. The parameterisation of the upper XCS was kept the same as previously. Figure 8.9 shows the averaged result of 10 runs of Feudal XCS in the length 20 environment, showing that both the upper and the lower XCS instances are able to learn the optimal action chain and to find and proliferate the optimal sub-population.

323

XCS Output 1

Relative Error Top Relative Error 1 Relative Error 3 Sub-population Top Sub-population 1 Sub-population 3 Iterations Top Iterations 1 Iterations 3

Proportion

0.8

0.6

0.4

0.2

0 0

1000 2000 3000 4000 5000 6000 7000 8000 900010000 Exploitation Episodes

Figure 8.9 - The Performance of Feudal-XCS in a Length 20 unidirectional FSW using four sub-populations

XCS Output 0.9

0.7 0.6 0.5

Iterations Top Iterations 0 Iterations 1 Iterations 2 Iterations 3 Iterations 4 Iterations 5 Iterations 6 Iterations 7

0.8

Proportion

0.8

Proportion

XCS Output 1

Relative Error Top Relative Error 0 Relative Error 1 Relative Error 2 Relative Error 3 Relative Error 4 Relative Error 5 Relative Error 6 Relative Error 7

0.4

0.6

0.4

0.3 0.2

0.2

0.1 0

0 0

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Exploitation Episodes

0

500 1000 1500 2000 2500 3000 3500 4000 4500 5000 Exploitation Episodes

Figure 8.10 - The System Relative Error and Iteration counts of Feudal-XCS in a Length 40 unidirectional FSW with eight sub-populations.

324

Finally, the length 40 environment was used, with eight sub-populations. The parameter settings of the sub-populations had to be modified to deal with the expanded condition size required, and so the population size was changed from 400 to 500 and the generality setting changed from 0.33 to 0.5 (although the latter change appeared to have little effect). The population size in the upper level classifier was set to 2000, since an action size of 4 (three bits to select the sub-population and one bit for the sub-goal) expanded the required mapping considerably. No other parameters were modified. Figure 8.10 presents the sub-population sizes, system relative error, and iterations for each XCS from one typical run. Once again it can be seen that the Feudal XCS is able to find the optimal solution using eight sub-populations, although the iteration plot for the top-level XCS demonstrates an occasional single non-optimal selection in a trial at 5000 exploitation episodes. An examination of the population identified that fewer generalisations than anticipated were available within the population, and even a population of 2000 classifiers was insufficient to provide classifier numerosity within the mapping sufficient to clearly distinguish optimal generalisations from the non-optimal. Clearly this could be resolved by increasing the population allocation, and this was found to be the case. The problem is important, however. The full original state encoding was used at all levels, since if the environmental reward does not lie with a sub-goal the top-level XCS would have to be able to perform low-level actions to move from a sub-goal to a [now local] reward. In this environment the environmental reward lies at the same position as a sub-goal and so there was no need to allow the top-level XCS to be able to select low-level actions. It would therefore be possible to replace the [long] input coding to the controller classifier with an indication of the last low-level sub-population. This would provide a much reduced input space and reduce the population space required by the upper XCS. Similarly it would be possible to provide an environment input interface for each lowlevel XCS that replaced the [long] input coding of the environment with a short coding that is non-Markovian between state space sub-divisions but Markovian within them. Given the results with SHQ-XCS using a shorter encoding (section 8.4.2), this should not only provide useful performance improvements, but would also reduce the problemsize for the top-level XCS. It has now been shown that Feudal XCS was able to operate in a unidirectional environment to find the optimal local pathways and the ordering of these to achieve an optimum global sub-XCS selection to the environmental reward. Attention was therefore turned to the ability of Feudal XCS to operate within an environment where

325

each state-space sub-division identified two sub-goals at different locations. The environment selected for these experiments was based on the two-reward corridor environment used within section 6.4.1, and is pictured in figure 8.11.

0,1

s0

1

1

s1

600

s2

0

s5

s6 0

0

1

s7 0

s4

0

1

1

s3

0

1

1

1

s7 0

0 1

s9

s8 0

1000

0,1

Figure 8.11 - A corridor environment with two sub-goals in each of two state-space sub-divisions.

The state-space was divided into two, with states 0 to 5 within the first sub-division and states 5 to 10 within the other. The sub-goals identified were states 0 and 5 in the first subdivision and states 5 and 10 within the second (denoted by the triangle above the sub-goal state in figure 8.11). State 0 produced a reward of 600 to the upper XCS and state 10 produced a reward of 1000. In this environment the upper XCS must learn both the optimal sub-goal and lower level XCS to choose from any state and the order of choice of the lower XCS instances and sub-goals in order to maximise payoff from the two payoff sources. Through a number of pre-experimental runs it was found that the optimal population size for both the upper and lower XCS instances was 400. Therefore initially none of the parameterisation presented within section 8.3 was changed. The condition size for the top population as set to four bits, with a two bit action (one for the choice of subpopulation and one for the sub-goal). The condition size of the bottom populations was set to five bits - four for the current state and one for the desired sub-goal. The action size was kept at 1 bit for the selection of the two possible actions in each state. Figure 8.12 pictures the average performance figures for ten runs of Feudal XCS within this environment.

326

XCS Output 0.6

Relative Error Top Relative Error 0 Relative Error 1 Sub-population Top Sub-population 0 Sub-population 1 Iterations Top Iterations 0 Iterations 1

0.5

Proportion

0.4

0.3

0.2

0.1

0 0

2000

4000

6000

8000 10000 12000 14000

Exploitation Episodes Figure 8.12 - The Performance of Feudal XCS shows unexpectedly high relative error.

The increase in System Relative Error displayed by the upper XCS population and the lack of convergence in the population was unexpected. An examination of the coverage table produced revealed that the System Predictions in some areas of the state space were marginally incorrect, producing the large system relative error result. Since the system relative error of the first sub-population appeared to have also increased, this became the suspected cause of the error. An examination of the performance of the subpopulations under exploration revealed that the limit of 50 steps within a population led to the sub-populations on occasions not achieving their sub-goal. For example, the following action sequence was observed within the top-level XCS during exploration: 1)

s7 → sub-population 1, sub-goal 0

2)

s5 → sub-population 1, sub-goal 0

3)

s5 → sub-population 1, sub-goal 1

4)

s5 → sub-population 0, sub-goal 0

327

(reward 600.0)

In response to action 1 the lower XCS was able to find a pathway to state s5, and action 2 is effectively a null action - it is asking the sub-population to move to a state it is already in. However at action 3 sub-population 1 should have been able to move to state s10 to receive a reward from the environment of 1000.0. Since no reward was received and the following action begins in the same state it is clear that under exploration within this sub-population the sub-goal state has not been found within the exploration limits imposed. As a result no reward is returned to the upper XCS for an action that would normally yield 1000.0. This causes the prediction for that action to become incorrect within the coverage table and generates the increase in system relative error. Unfortunately, within XCS the failure to achieve a sub-goal has implications for the accuracy estimate of the classifiers proposing the action. Although the failure to achieve the sub-goal was sufficiently infrequent to cause the classifiers to lose all accuracy value, it did have a sufficiently large influence to reduce the accuracy and affect the performance of the classifiers concerned within the G.A. This in turn led to a lower action set dominance that served to heighten their inaccuracy.

XCS Output 0.6

Relative Error Top Relative Error 0 Relative Error 1 Sub-population Top Sub-population 0 Sub-population 1 Iterations Top Iterations 0 Iterations 1

0.5

Proportion

0.4

0.3

0.2

0.1

0 0

2000

4000

6000

8000 10000 12000 14000

Exploitation Episodes Figure 8.13 - The Performance of Feudal XCS rapidly becomes optimal where high limits on sub-population exploration allows sub-populations to reliably find their sub-goals.

328

To verify this finding, the maximum number of steps per exploration trial was modified to a sufficiently large value to make the possibility of failing to reach the requested subgoal remote (it was set to 500 steps rather than 50). The Feudal XCS was re-run within the environment and the average performance from ten runs is shown in figure 8.13.

1000 950 900 850 800 750

s9

700

s7

650 600

s5

550 s3

500 l1 oa l0 ,G 1 oa p ,G Po al1 1 p Go l0 , Po oa p0 ,G Po 0 p Po

s1 s2 s3 s4 s5 s6 s7 s8 s9

Pop0,Goal0 600.157 55% 600.095 59% 599.928 57% 600.436 28% 600.153 24% 696.052 38% 699.670 31% 708.755 51% 709.106 50%

s1

Pop0,Goal1 707.756 53% 707.612 56% 707.686 53% 707.102 37% 707.735 30% 708.557 41% 708.615 32% 708.371 53% 708.376 51%

Pop1,Goal0 513.469 55% 507.698 57% 509.159 56% 522.271 29% 706.130 29% 707.102 36% 706.851 30% 707.831 50% 707.653 48%

Pop1,Goal1 517.548 58% 512.932 58% 512.199 58% 588.935 31% 996.916 31% 996.443 40% 998.428 33% 998.564 52% 998.184 51%

Figure 8.14 - The Averaged Coverage Table for the toplevel XCS in the Feudal XCS. The Coverage Graph illustrates the identification of the optimal pathway to the highest environmental reward.

329

Whilst the coverage is improved, there remain states within which the dominance is weaker than would be desired (figure 8.14). It is hypothesised that the weakness arises because of the non-uniformity of exploration. Within the persistence experiments in chapter 6 a similar drop in dominance was identified. The use of sub-goals that are achieved by lower populations within the Feudal XCS represents a similar step-like progression through the environment. Therefore, it would be expected that a degree of unevenness in exploration is seen. Further experiments that reduced the number of start states to two (in states s3 and s7) demonstrated that with this reduction the top level XCS was able to learn the optimum mapping with a high domination of the action sets for those states visited (the sub-goal states and the two start states). Limiting the number of start-states down to the sub-goal states would allow the use of input encoding optimisations mentioned earlier in this section to further improve the performance of the Feudal XCS. Sub-Pop 1 s1, goal 0 s2, goal 0 s3, goal 0 s4, goal 0 s5, goal 0 s1, goal 1 s2, goal 1 s3, goal 1 s4, goal 1 s5, goal 1 Sub-Pop 2 s6, goal 0 s7, goal 0 s8, goal 0 s9, goal 0 s10, goal 0 s5, goal 1 s6, goal 1 s7, goal 1 s8, goal 1 s9, goal 1

Backward 999.496 707.905 501.367 355.917 252.928 261.661 191.023 257.321 359.917 497.890 Backward 990.663 701.052 497.219 354.154 254.611 179.957 179.796 252.901 358.270 504.167

Forward 504.85 355.626 252.654 180.028 180.159 251.897 354.70 500.623 705.598 994.859 Forward 499.159 356.815 253.179 180.597 252.088 253.260 355.729 501.234 708.036 998.304

Back Dom 79% 61% 79% 48% 43% 57% 77% 71% 74% 69% Back Dom 78% 58% 57% 68% 47% 74% 63% 80% 75% 82%

For Dom 80% 64% 81% 43% 42% 61% 73% 77% 79% 75% For Dom 82% 65% 57% 60% 56% 67% 67% 79% 75% 81%

Table 8.5 - Average Coverage Tables for two Sub-Populations in the Feudal XCS

The coverage of the low-level XCS populations remained high (Table 8.5) despite the uncertainty of the higher level XCS, demonstrating the ability of Feudal XCS to identify both optimal local state × sub-goal × action × payoff mappings, empirically verifying the first part of Hypothesis 8.2. The second section of hypothesis 8.2 suggested that

330

given a suitable policy these sub-populations could be used to provide a sequence of optimal local routes to achieve a global goal. This is demonstrated by the iterations plot for the top-level XCS in figure 8.13. This plot reveals that the Feudal sub-population is able to achieve a global goal using an optimum one or two sub-population invocations (the line is plotted so that 0.1 on the scale represents the optimal two steps for the longest path). The coverage table presented within figure 8.14 demonstrates that a highlevel XCS can be added to learn the optimal sequence of sub-goals and sub-populations to reach a reward states. A consideration of figure 8.13 reveals that the high-level XCS was able to identify the pathway, using the lower level sub-goals and sub-populations, that will achieve the highest payoff from the environment, demonstrating that the mapping created is the optimal global mapping of state × sub-population × sub-goal × payoff. This capability again demonstrates that a careful extension of the XCS mechanisms allow the Optimality Hypothesis to be maintained within such extended mappings. Thus, hypotheses 8.2 and 8.3 are upheld. Whilst the Feudal XCS did acquire the capability to select between global payoffs, it should be noted that the global payoff chosen by XCS will not necessarily be that chosen by the normal XCS. For example, in the environment used for these experiments XCS will select a route to the state s0 that provides the reward of 600 when starting in states s1 to s4 and the route to the state s10 that provides the reward of 1000 when starting in states s5 to s9. In Feudal XCS the reward of 1000 is a maximum of two 'macro-steps' away from any starting location, and therefore XCS will always prefer the sequence of sub-goals leading to s10. This is a graphic demonstration that the high level XCS population plans over sub-goals rather than individual states. As McGovern and Sutton (1998) note, this form of hierarchical approach produces routes to reward states that are optimal at the level of planning. 8.6 Summary of Results

Experimental investigations were conducted to identify the ability of two new structural forms applied to XCS in order to allow the acquisition of optimal solutions to two forms of long action chain environments. The first solution is limited in its application to statespaces that can be sub-divided into connected regions where each region provides a single sub-goal at the point of connection to the next. It was hypothesised that if internal rewards were provided for achievement of a sub-goal then XCS would be able to find the optimal state × action × payoff mapping for each sub-division. Given a suitable policy for identifying the sequence of internal goals, it was hypothesised that an

331

optimum path to the global goal could be constructed. Whilst it is probable that such a mechanism could be applied using a single XCS population because of the dynamic niche-management available within XCS, it was proposed that the XCS population could be decomposed into sub-populations to learn and maintain each partial state-space mapping. The resulting decomposition was likened to HQ-Learning (Wiering and Schmidhuber, 1996), although much simplified to remove the identification of sub-goals for each sub-population. The resultant "Simple HQ-XCS" was applied to a sequence of uni-directional FSW environments of increasing length. Each of these environments provided the opportunity for an optimal and a non-optimal choice of route at each state in the optimal pathway. It was demonstrated that SHQ-XCS was able to learn the optimal mappings for each state-space decomposition through the use of internal rewards. It was also demonstrated that the solution could be scaled to longer environments. As the length of the environment increased the benefit from a utilisation of the state-space partitions to reduce the size of the learning problem became more apparent, and it was demonstrated that a globally highly non-Markovian state space encoding that was Markovian within each state-space partition would produce benefits both in terms of learning performance and raw execution time to learn. The SHQ-XCS is extremely limited, but provides sufficient confidence in the ability to learn optimal mappings within each sub-population to allow a truly hierarchical solution to be developed. The Feudal-XCS was based on the Feudal Q-Learning proposals of Dayan and Hinton (1993). The Feudal XCS provides a set of low-level XCS instances, each associated with a state-space decomposition. Each state-space decomposition can have one or more sub-goals available within it. A higher-level XCS is introduced to identify the low-level instance that should take control and the sub-goal it must seek to achieve. This architecture can be extended upwards to provide a progressive decomposition of the learning problem. The Feudal XCS was applied to a small twodirectional corridor environment using two low-level XCS instances controlled by a single Feudal XCS instance. It was shown that, as long as a guarantee could be given that the low-level XCS would either always or never reach a selected sub-goal, the higher-level XCS would be able to acquire the optimally general accurate state × subpopulation × sub-goal × payoff mapping that identifies the optimal global sequence of sub-goals to a reward. It was further shown that the Feudal XCS could identify and select the optimal sequence of sub-goals to the maximum reward in an environment with more than one reward state. However, the choice of sub-goal would be made at the planning level of the sub-goals and would therefore not necessarily represent the optimal

332

trade-off between distance to reward and reward magnitude that is made at a local level by XCS. As work with SHQ-XCS had led us to believe, the XCS instances covering the state-space partitions were able to rapidly acquire converged optimally general accurate state × sub-goal × action × [internal] payoff mapping of their state partition. The learning of the mappings at each level was performed concurrently with no requirement for pre-training of the lower-level XCS instances. 8.7 Discussion

Chapter 7 has already provided a detailed discussion and review of work on structured and hierarchical solutions within Learning Classifier System research and related potential LCS solutions to some of the work within Reinforcement Learning. From that discussion it can be seen that there is little previous work on hierarchical solutions to compare within this work. The main body of work in this area was performed by Dorigo and Schnepf (1993) and Dorigo and Colombetti (1994)47. Their approach used the ALECSYS LCS implementation, discussed in section 2.5.2. This traditional LCS, augmented with additional operators to reduce the brittleness of the approach, was used to create fixed hierarchical controllers. Their work was characterised by the dependency upon direct environmental feedback for the reward of switching decisions made by the upper level LCS. Two main applications of hierarchy were presented. The most discussed was a bottom-up hierarchical approach applied within a reactive environment. In this work the low-level LCS each received relevant portions of the input message from the environment and used this to propose an action. The upper LCS received a single-bit indication from each low-level LCS proposing actions, and learnt to select the best LCS to respond. They demonstrated that this architecture could considerably reduce the learning time required within a single 'monolithic' LCS, although for best performance the individual low-level competences had to be trained before the controller was introduced. Much of the power of the decomposition arose from the accompanying decomposition of the input space, and they acknowledged that a less redundant encoding could allow a traditional LCS to perform well in this environment too. Later work (Barry, 1996) revealed that the environment itself was also problematic as a test of performance, since trivial environmental parameter tuning would also allow a single LCS to achieve much improved learning performance. In this architecture the switching LCS received a direct reward based on the behaviour of the selected lower-

47

Their work has been extensively published and the list of publications is potentially large. The LCS Bibliography provides a list of their work, much of which is covered in Dorigo and Colombetti, 1998.

333

level LCS in the environment. Whilst it is agreed that there is some ambiguity in the meaning of the reward in relation to the selected behaviour, the transmission of reward across hierarchical levels may not be appropriate in other circumstances. Nonetheless, a significant achievement of this architecture was the demonstration of the use of up to three layers of switching behaviours. Later work used other forms of switching. In Colombetti and Dorigo (1994) a state memory is used to identify the current goal. The lower level LCS must learn to use the state memory to identify which LCS should operate and a co-ordinator LCS learns to control the switch. Although the learning environment is a multiple-step environment, a regular payoff based on the action chosen is provided and training is performed separately. The results they identify are important in the development of their "Behaviour Analysis and Training" method for the production of robot controllers, and in a demonstration of memory-state utilisation within a LCS, predating the more extensive work of Cliff and Ross (1994). The Feudal XCS differs from their work in a number of ways. The primary difference is the use of the Feudal XCS to learn within multiple-step environments - the purpose of Feudal XCS is the decomposition of large action sequences into smaller units and the localisation of reward within those units. The work with ALECSYS was primarily with reactive robotics without a focus on decomposition or abstraction. In this regard the Feudal XCS is unique. Secondly, the Feudal XCS selects lower level capabilities based on identified sub-goals, and uses these to plan at a higher level. Whilst ALECSYS used primitive modules to select between behavioural competences, it did not use the competences to identify sub-goals that established a route to a rewarding state. Finally, the Feudal XCS maintains all the capabilities of XCS to acquire accurate and optimally general mappings of each state-space partition and of the sub-goal space. ALECSYS used LCS techniques that provided only very limited internal model-building capabilities. Booker (1982) used multiple instances of his GOFER LCS implementation to differentiate between input and output mappings and enable the LCS to learn internal associations between input and output. This represents a different aim to that of the Feudal XCS and no benefit is gained from a comparison of the approaches. Bull and Fogarty (1994) and Bull, Fogarty and Snaith (1995) used a number of classifier populations communicating with one another through shared message lists. These LCS populations were stimulus-response systems, although learning a long-term behaviour. They achieved considerable co-ordination by means of an emergent feature. The coordination produced was to switch-in and out the posting of messages to effectors by

334

particular populations, and therefore has much in common with the single-shot switch architectures of the ALECSYS based work discussed earlier. 8.8 Conclusion and Further Work

In chapter four it was demonstrated that XCS is unable to establish accurate mappings of simple corridor environments that require longer than moderately sized action chains to achieve a terminal state. A number of techniques could be applied to encourage the production of accurate mappings of long action chains, such as the careful tuning of discount and reward parameters or the careful application of subsumption to remove competing classifiers. This chapter has presented an alternative approach - the removal of the requirement for long action chains (Wilson and Goldberg, 1989) through the introduction of hierarchical methods. Since no previous results have been gained from the addition of hierarchical methods to XCS, and only one group of related research has been previously carried out on competence hierarchy within a traditional LCS architecture, there was a requirement to establish the potential of XCS for use in hierarchical work. Thus, a simple form of state-space decomposition and a fixed prespecified hierarchical approach were each introduced to identify the ability of XCS to use these approaches in the learning of long action chains. It was shown that state-space decomposition can be used to acquire short action chains that can be conjoined to produce longer action sequences. The internal reward mechanism adopted separates each sub-division from a complete reliance on the ready availability of local environmental reward and allows the learning of routes to pre-identified sub-goals. It has also been shown that XCS can be applied in a fixed hierarchical structure to learn this sequencing and thereby plan over the space of sub-goals rather than individual states. There is much scope for further work. The work present in this chapter represents an initial investigation, and the Feudal XCS architecture must be applied within environments requiring more sub-populations in order to assess its true worth. The environments used were limited to simple corridor worlds - the application of Feudal XCS to a decomposition of larger two-dimensional state spaces represents a key aspect of further work. The Feudal XCS applied only provided a single level of decomposition, and yet the approach would appear to allow multi-level Feudal XCS implementations to be developed and this aspect requires investigation. The Woods-14 problem was identified as a typical long action-chain problem that XCS found difficult to solve without a change to the exploration mechanisms of XCS. This environment is difficult not only because of the length of the action chain but also due to its exploration

335

complexity. It was suggested that the provision of localised internal rewards could help to establish the many short action chains required to reach the environmental reward. The use of a hierarchical approach to the solution of some of the problems presented by exploration complexity therefore also remains to be investigated. Feudal XCS represents an application to LCS of an approach from the field of Tabularbased Temporal Difference Reinforcement Learning. Chapter 7 has identified a number of other approaches, and there is a rich vein of results in the area of Hierarchical Learning to apply to the LCS field. LCS techniques now have the advantage of reliable generalisation. It may be that this feature will allow more complex emergent hierarchy to be used. Lanzi's use of internal state (Lanzi, 1998a, 1998b, 1998c) provides a possible mechanism for the emergent partitioning of a single XCS population through the use of state 'tags' - an approach requiring reliable and optimal generalisation if the "curse of dimensionality" that then arises is also to be tackled. The Feudal XCS represents a first application of hierarchy for use in the decomposition of long action chains and the provision of higher level planning within LCS. Whilst unique, it represents only a step on the path to more useful hierarchical structure. Its form of hierarchy is highly pre-specified and is a rigid approach based upon decomposition convenience. Wilson (1988) proposed a flexible emergent approach that, whilst theoretical and missing key detail, is indicative of the true direction that investigations of hierarchical structure in LCS must take. This work will contribute to efforts in this direction by demonstrating the value of hierarchical decomposition within XCS and illustrating that even a rigid approach to hierarchy, carefully applied, does not disrupt the unique capabilities of XCS that are so vital to any future emergent hierarchy work.

336

Chapter 9

CONCLUSION

9.1 Background

Although at the time that this project was proposed Wilson's XCS was still two years from identification and publication, and four years from acknowledgement by the wider community, the problems with the traditional LCS approach (chapter 2) were understood. These difficulties were not limited to the use of LCS within multiple-step environments - indeed the fundamental problem of how a population that is competitive under the Genetic Algorithm can also be co-operative was widely acknowledged (Booker, 1988; Smith, 1991; Goldberg, 1989; Riolo, 1989a). However, leaving aside the problem of producing and maintaining Default Hierarchies, the most embarrassing failure was the inability of the Michigan LCS to establish and maintain the rule chains necessary to operate within multiple-step environments (Riolo, 1989a; Compiani et al, 1990). Although not featuring within his earliest works, it was the Bucket-Brigade credit allocation mechanism and the establishment of rule-chains that, together with the idea of Default Hierarchies, Holland dwelt on at some length in many of his papers (Holland, 1986, 1987; Holland et al, 1986, Holland et al, 2000). It may be hypothesised that the rule-chain concept is not itself the main issue; rather it suffers because of problems generated within other parts of the LCS architecture. There is evidence to support this view, with Riolo able to demonstrate the successful use of pre-instated rule chains (Riolo, 1987a), Forrest using pre-programmed rule chaining extensively in her use of the LCS framework to implement the KL-ONE structure (Forrest, 1985), and Riolo demonstrating a remarkable lookahead mechanism based on a complex of internal rule chains to allow an LCS to demonstrate 'Latent Learning' (Riolo, 1990; see also Holland, 1990). Nonetheless, the fact remained that there were few instances of the emergence of long action sequences, and only one that used a mechanism close to that proposed by Holland (Wilson and Goldberg, 1989). Wilson and Goldberg (1989) proposed some solutions to the problems experienced within LCS at that time. In regard to the rule-chaining problem one of their solutions was pragmatic - remove the need for long rule chains. Wilson (1988) had already

337

suggested the use of hierarchy within LCS, although his work was both theoretical and incomplete. This approach is interesting to an observer of the field, since the very basis of the growth of the Artificial Life area was the reaction against the highly structured solutions proposed by some within the A.I. community. Such suggestions were therefore [probably unknowingly] carefully framed in the context of an emergent system. Achieving emergence of useful internal structure might be described the "Holy Grail" of Machine Intelligence research. It is perhaps surprising then that so few took up the challenge laid down by Wilson and Goldberg. It is not surprising, however, that the few who did followed a predominately non-emergent route (Dorigo and Colombetti, 1994; Donnart and Meyer, 1996a, 1996b) and the one that relied on rule-chains did not have a dominant Genetic Algorithm. The development of Wilson's XCS (Wilson, 1995, 1998) was a milestone for LCS research. Whilst not novel in its components or approach (see Wilson, 1985, 1994; Booker, 1988), it was unique in its formulation. Using mechanisms that provided genuine emergent co-operation whilst applying the Genetic Algorithm in a dominant position using a metric that clearly identified the optimum classifiers gave XCS a power that has not been found in any other LCS formulation. The strength of the XCS approach was not only identified within toy problems, but in a funded research project that arose as a direct result of the research identified within this document it has been applied to an industrial application with highly encouraging results (Saxon and Barry, 2000; Greenyer, 2000). Whilst early work within XCS demonstrated its exceptional ability in direct reward environments (Wilson, 1995, 1996, 1998; Kovacs, 1996, 1997), its ability within delayed reward multiple step environments was less clear. Lanzi (1997a, 1997b) demonstrated encouraging results but it was unclear to what extent XCS was able to address the development and maintenance of action chains (though see Lanzi, 1997b). Using the lessons learnt in the past within the LCS field, and through an analysis of the operation of the XCS it was hypothesised that although XCS was able to produce and maintain action chains, that the length of the action chains would be limited when the Genetic Algorithm is used with generalisation pressure (Hypothesis 1). The rationale behind this hypothesis was simple - the temporal difference prediction payoff mechanism employed within XCS produces classifiers whose fixed predictions become sufficiently small that differences between them could be sufficiently small to allow a single classifier to represent them all whilst retaining a high relative accuracy. If this was the case, then the Wilson and Goldberg (1989) suggestion that rule chains should be

338

kept short would also be valid within XCS. It was hypothesised that since XCS was able to identify and proliferate the optimally general accurate sub-population of classifiers, if XCS was applied within a hierarchical framework it would be able to identify and proliferate optimal classifier populations covering the mapping of a state subspaces and the mapping of inputs to sub-goals and sub-populations that would enable the development of hierarchical solutions (see Hypothesis 2). Although it was acknowledged that truly emergent hierarchies are required, the demonstration of hypothesis 2 would be a clear milestone on the road to emergent hierarchies. Indeed, the use of reliably use of internal state representations (Lanzi, 1998a, 1998b, 1998c; Lanzi and Wilson, 1999) now indicate that this goal is achievable.

9.2 A review of the Main Findings

This work started from the possibly negative assumption that although XCS was a system of great promise there would be limitations within it that require other additional measures to rectify. This spirit is not foreign to work within LCS - much of the previous work in LCS has sought to introduce new techniques or mechanisms to enhance the operation of the LCS architecture. Indeed, it was Stewart Wilson who was good enough to send a gentle reminder that both his and Holland's proposals were only frameworks that should not be cast in stone. Unfortunately, it seems clear that many additions in the past have been poorly thought through, requiring further mechanisms to provide balance. For example, whilst Riolo's work is certainly fundamental and his results influential, the CFS-C implementation he used is ill-balanced, requiring a multitude of mechanisms to maintain populations without losing classifiers that contribute to good performance. Even the relatively lightweight SCS implementation (Goldberg, 1989) props up one strategy (elitist selection) with another (crowding). The additional facilities identified within this work are not provided to fix a problem; they are intended to extend the application of XCS to otherwise unreachable areas. 9.2.1 Long Action Chains

A key contribution of this work is the foundational work on the ability of XCS to acquire long action chains. It was noted that whilst Lanzi (1997a, 1997b) identified some key weaknesses in the ability of XCS to perform within some delayed reward environments, that these findings were in environments that did not separate out the problems of exploration complexity, environmental shape, or parameterisation from the central issue of action-chain length. Finite State Worlds, used previously by Grefenstette (1987) and Riolo (1987b, 1989a) were re-introduced to provide a greater degree of

339

control over the test environment. By controlling these potentially confounding variables, it was shown that XCS is able to learn the optimal path in long uni-directional corridor environments. However, this was in the cases where generalisation was not introduced, effectively learning the Q-tables of traditional Reinforcement Learning. Once generalisation was introduced, it was shown that although XCS could establish action chains, they would not identify the optimal path reliably. The surprising result was how short these rule chains were. Although it was acknowledged that the threshold number of actions beyond which XCS would select sub-optimal routes could have been influenced by coding regularities, it appeared as if action chain lengths of between 10 and 12 actions represented the limit for the parameterisation used. Seen as key to this work has been the adoption of the Optimality Hypothesis as a form of 'Gold Standard'. It was thus important to demonstrate for the first time that Kovac's optimality hypothesis held within multiple-step environments. It was identified that XCS would indeed produce and proliferate [O] within simple corridor environments with action chain lengths of up to 15. However this action was shown to dramatically breakdown once prediction values became low, and over-general classifiers would appear to cover the early low-prediction states. It was hypothesised that these classifiers were able to trade-off inaccuracy against the benefit of appearing in more action sets (and therefore gaining more opportunities to become involved in the G.A.). It was noticed that in some runs a completely general classifier was developed within each action set. The Domination Hypothesis was proposed to capture this phenomenon. This hypothesis identifies that whenever the distance to an environmental reward is sufficiently long a fully general classifier can sustain itself by using a high numerosity to dominate action sets and suppress the prediction of the action set. Once the prediction is suppressed the classifier becomes accurate (since accuracy is calculated relative to the system prediction) and so gains access to the G.A. This allows the classifier to proliferate itself, further dominating the action sets. This hypothesis was shown to have grave implications for any wide-ranging subsumption mechanism, such as Wilson's action-set subsumption operator (Wilson, 1998; Butz and Wilson, 1999). A possible improvement on this operator was proposed for further research. These findings demonstrate limitations on the rule-chaining ability of XCS. No claim was made in regard to their application to other environments due to the confounding nature of other variables that were carefully controlled in this work. However, it would be expected that similar effects would be evident in other environments, since the two

340

main factors identified were the combination of payoff similarity and distance from an environmental reward. Despite these findings, experiments using the GREF1 environment (Grefenstette, 1987; Riolo, 1989a) suggest that XCS is able to learn the optimal solution in less time than either the rule-chaining solution of CFS-C or the Pittsburgh-based SAMUEL LCS. Whilst direct comparison is problematic due to the different nature of the architectures employed, this result is certainly encouraging. When it is also considered that XCS learnt the complete optimal mapping of the environment rather than the optimal route, the power of the XCS approach is revealed. 9.2.2 The Consecutive State Problem

The findings on action chain length were encouraging in one aspect - they indicate the value of looking at hierarchical approaches even for XCS. One potential problem for XCS with hierarchical approaches was identified - the potential for reward aliasing. One use of Hierarchical approaches within Reinforcement Learning has been to allow Reinforcement Learners to operate within non-Markovian environments where the aliasing states can be placed in separate sub-populations to prevent aliasing. Unfortunately, many approaches to hierarchy envisage the payment of invoked subpopulations or rule-chains based on an internal reward and the current payoff (e.g. Wilson, 1988; Digney, 1996a, 1996b, 1998; Diettrich, 2000). If such a strategy were adopted with an XCS based system the change in payoff would generate inaccuracy and generate a reward alias. The effects of aliasing were perhaps predictable but unknown at that time. Whilst Lanzi (1998a) identified a solution that is general to all forms of perceptual aliasing, the investigation conducted for this research led in another direction. Using corridor environments as test environments, the effect of consecutive states providing the same message was examined. To differentiate this form of aliasing, that arises purely because of the differential payoff regime, from other forms it was labelled the Consecutive State Problem. Whereas Lanzi's work with perceptual aliasing not only dealt with separate states, it also failed to address the effect that aliasing had not just on the classifiers covering the state but also on preceding classifiers. It was shown that where there was a systematic and even traversal of the states within the aliasing area the effect was localised to the classifiers covering the area. Where the states were visited irregularly, however, classifiers covering the preceding state were made inaccurate. A surprising finding was that when the G.A. was allowed to operate the classifiers covering the aliasing states were not removed because of their inaccuracy. In fact, another form of the phenomenon of over-general classifiers seen within the length

341

experiments was identified. Classifiers covering the aliasing states would trade-off additional action-set membership for their inaccuracy and so take over earlier [potentially accurate] states. Although Lanzi's memory state mechanism could be used to overcome the consecutive state problem, it would be difficult to scale for use within long corridor environments common in many robotic problems. A new solution was proposed that used a change in message pattern to control the cycle of the LCS. An action proposed at the start of the consecutive aliasing states would therefore persist for the length of the aliasing states, gaining a single payoff from the next non-aliasing state. This mechanism was shown to solve the consecutive state problem, and has the advantage of zero impact on the operation of XCS in normal environments. 9.2.3 Action Persistence

A related area of work was revealed from the work on aliasing states that would enable the use of longer action chains without resorting to a hierarchical structure. It was hypothesised that it would be possible to add a specification of the length of time (in terms of actions upon the environment) over which an action was to be applied, and that the Optimality Hypothesis would allow XCS to learn the optimal state x action x duration x payoff mapping for a given environment. The modifications required to XCS in order to add a suitable mechanism were, as in those required for the solution to the consecutive state problem, lightweight. It was shown in corridor environments and in a more complex multiple-route environment that XCS was able to establish this mapping. Initial experiments adopted a direct approach to the payoff regime that changed the nature of the LCS so that XCS selected the pathway that traded off the smallest number of distinct actions and the reward received. It was demonstrated, however, that the discounting of payoff in proportion to the number of steps taken together with a further small discount in relation to the length of duration proposed restored the normal XCS trade-off between the distance to reward and the reward magnitude. Previous work to add action persistence to an LCS had been carried out by Cobb and Grefenstette (1991) within their Pittsburgh-like SAMUEL LCS (Grefenstette and Cobb, 1990). Due to the nature of the learning problem they used a direct comparison of the performance of the two LCS approaches could not be conducted. However, SAMUEL learns the optimum policy, whilst the modified XCS learns the complete and optimal state × action × duration × payoff mapping for a given environment.

342

The use of persistence brings a number of penalties with it. In order to specify the persistence additional loci must be added to the action. This increases the number of niches that must be reliably maintained within the environment. It may therefore be the case that the introduction of action persistence specification over a large number of steps [where the meaning of 'large' is unknown at present] is not a pragmatic proposition due to the learning time required or to as yet unknown limitations in the ability of XCS to maintain a large number of niches. Wilson (2000c) provides hope for the latter case, applying XCS in a domain that required the preservation of a very large number of niches although only within a single-step environment. A more fundamental problem was the uneven exploration of the state space that resulted from the use of persistence. This has a number of effects on XCS, most noticeably an increase in population pressure upon the population niches covering the less frequently visited states. It was proposed that an operator that updated the action set estimate of all classifiers in a niche upon deletion of a classifier could prevent the erosion of these niches. 9.2.4 Hierarchy

From the results presented within the length tests it was clear that the action-chaining mechanism of XCS was unable to maintain the accurate optimal mapping for the scenesetting classifiers within even moderately sized action chains. Whilst there may be parameter settings that improve these results considerably, it is clear that limits will remain. The use of hierarchical or structured solutions to this problem is therefore justified. There have been a number of separate approaches to the structuring of populations within LCS to achieve better performance, although none has sought to use hierarchy to reduce action chain length. In the absence of any existing comparison, a comprehensive review of hierarchical and structured approaches to LCS is presented, and a comparison is drawn to relevant work from the wider Reinforcement Learning community. It is argued that most previous attempts to introduce hierarchy into LCS have been motivated by a desire to improve performance without any metric for success. A framework for future LCS hierarchical research is presented to highlight three fundamental objectives - the use of abstraction to gain higher level learning, the leverage of re-use to reduce the learning effort and share solutions, and the provision of decomposition to break large units (such as rule chains) into maintainable and re-usable units. It is argued that all future Hierarchical research within LCS should focus on one or more of these goals as a measure of the usefulness of the approach.

343

Two new structured approaches are introduced to LCS work with the objective of decomposition - the reduction in the required size of rule chains. SHQ-XCS is a simplistic decomposition of the state-space with a pre-specified hand-over sequence. It is shown that the provision of internal reward for the achievement of sub-goals will allow XCS to identify short rule-chains rapidly and sequence them to find optimum paths to a reward in long corridor environments. This approach is expanded to produce the Feudal-XCS, a truly hierarchical LCS approach. The Feudal XCS introduces a highlevel XCS that learns the optimal sequence of sub-populations to invoke, and the optimal sub-goals that they must each achieve. It is shown that using this mechanism Feudal-XCS operates in the domain of sub-goals rather than primitive actions, potentially meeting the desire for abstraction. The provision of the Feudal XCS verifies the thesis hypothesis 2 - Hierarchical solutions are available that reduce the need to learn long action chains.

9.2.5 Other Contributions to LCS Research

In addition to these findings other benefits have accrued within the research project. No existing publicly available XCS implementation existed at the time XCS was adopted, and the information available to produce an implementation from was incomplete. A Cbased implementation was therefore developed and made available for other researchers to use. It has since been adopted by a number of workers. Alongside the implementation, the formal specification of XCS included in this thesis was produced to provide an unambiguous reference to the XCS Classifier System. Providing these products and links to other LCS researchers has resulted in the production of and maintenance of the LCSWEB - a clearinghouse for LCS researchers. In addition to the five papers published within leading international conferences (Barry, 1993, 1996, 1999a, 1999b, 2000), this work led to funding for a business-linked research project applying XCS to the field of Data Mining (Saxon and Barry, 2000). This became the first deployed use of XCS technology within industry and the product has since demonstrated its abilities within the COIL'2000 data-mining competition (Greenyer, 2000).

9.3 Further Work

Each of the areas studied within this research project was carefully bounded within the experimental stages so that the areas examined were manageable. As a result there are many possible avenues for further research arising from this project, and these are

344

identified in detail within the conclusions for each chapter. There are, however, a number of main themes that can be highlighted for further work. 9.3.1 Long Action Chains

There are many factors that affect the ability to develop and maintain long action chains. It is possible that the generalisation problems identified could be delayed by an appropriate parameterisation alone. Certainly there is a need to know more about the gross effects of the main parameters of XCS, such as the rate of G.A. application, the generality settings, the use of crossover, and the various deletion mechanisms. Particular factors that are worth early examination are the use of deletion operators. Kovac's T3 has not been tested within multi-step environments in the careful manner it was applied within single-step environments. It was not utilised within this research, but if the ability to handle both specific and general classifiers seen within single-step environments applies, then this mechanism could contribute to a more stable production of [O] in multi-step environments. The implementation of XCS used within this research applied only a limited form of action-set subsumption, and this no-doubt contributed to the weak dominance of members of [O] that occurred within some environments. It may be that the application of full action set subsumption would strengthen the representation of [O]. However, experience with a nearly-equivalent full-population subsumption operator suggests that the action-set subsumption may also feed over-general classifiers that are able to gain accuracy by their suppression of the prediction within the action set. The Domination Hypothesis was presented to capture the phenomenon of the take-over of the action sets by fully general classifiers. No other work has discussed this phenomenon, and indeed many would argue that XCS should not enable such a situation to arise. There is a need to understand more about the factors that lead to over-generals obtaining sufficient numerosity to suppress the action sets in this manner. It could be hypothesised that their emergence is linked to the subsumption mechanism as much as to the G.A., and there would be a strong case for suggesting that in any application of subsumption the population subset being used should be searched not for the most general classifier (Butz and Wilson, 1999) but instead for the most fit general classifier. This would lend additional support to classifiers who are of high accuracy relative to others in the action set rather than to classifiers that are potentially over-general.

345

As payment progresses down the action chain towards the early classifiers the magnitude of the payment to the early classifiers becomes very small relative to the initial payment. This means that changes in the payment to a classifier that are very large relative to the stable payoff received are not as completely reflected in the accuracy of classifiers at this point as would be the case with classifiers later in the chain. Other formulations of temporal discount could be identified that reduce this problem, or an approach adopted that, like the system relative error metric, attempts to consider the accuracy of the classifier in relative magnitude terms. Clearly the investigation of XCS facilities, operation, and parameterisation is a considerable task that may divert from larger issues. The environments used within the length tests were carefully controlled, and similar investigations need to be conducted to provide predictive measures that identify the likely effect of environmental exploration complexity on the progress of action-chain learning and maintenance, expanding Lanzi's original results in this area. Similarly, more needs to be known about the relationship between the encoding of input messages and the establishment of [O]. Each of these are major topics in their own right. 9.3.2 Non-Markovian Environments

Lanzi (1998a, 1998b, 1998c; Lanzi and Wilson, 1999) have provided useful insights into the resolution of perceptual aliasing using the memory mechanism. This mechanism does not scale easily, however. Further simpler solutions, possibly limited in their application to particular forms of aliasing need to be investigated. The input-window approach was used by early LCS researchers (Robertson and Riolo, 1988) to overcome local perceptual aliasing, but this approach has not been investigated with XCS. Other solutions have been examined within the Reinforcement Learning area, including the use of state-space decomposition using hierarchical approaches. The potential for crossover of knowledge in this area needs to be investigated further. 9.3.3 Action Persistence

The mechanism proposed for action persistence produced a problem in terms of the uneven exploration of sub-optimal states that resulted. It is possible that a biased exploration strategy could alleviate this problem, or that a more explicit record-keeping approach could be adopted. The use of persistent actions was demonstrated within environments that suited the length specification provided. However, the search space and its representation within

346

the population space increases rapidly as more bits are provided for persistence. Rather than provide more bits it is possible that an approach that provides the persistence and a specification of the granularity of the persistence is adopted. Thus, for example, a user might provide two bits for the specification of persistence, giving a maximum four step persistent action. If an additional bit was provided to switch between coarse and fine grain persistence then the same two bits could represent persistence of one to four steps and persistence of 4, 8, 12, or 16 steps. This mechanism needs to be applied to identify whether XCS can learn the [O] that includes the correct granularity changes to [for example] reach a point 14 steps away in two actions. One other form of temporal system has been applied to LCS work - the Delayed Action CS of Carse (1994). This is a slightly more problematic system to apply within XCS, since the original proposals presume the presence of a message list that can accommodate multiple messages. This is so that when the time comes to apply the delayed action it can compete with other messages posted at that time, and so that other classifiers can post effector messages at the time that the delayed action is being posted. Despite these problems, the DaCS proposals are interesting for their ability to solve nonMarkovian states without the need for an arbitrary memory-window. This approach therefore warrants further investigation. 9.3.4 Hierarchy

It was clear from the review of hierarchical work that very little has been done within this area, and that even following up on the potential crossover from hierarchical work within Reinforcement Learning will require considerable effort. However, the main message from this research work is that with XCS the effort is likely to be very worthwhile. Although the work presented using the Feudal XCS is both primitive and incomplete, the ability to produce [O] over the space of sub-goals rather than states is a clear indicator of the potential of the area. A first requirement is the expansion of the results obtained using the Feudal XCS. The environments it was applied to were simple and simplistic. In particular, the application of the Feudal XCS to progressively larger 2-D worlds would provide a better indication of the scalability of the approach. These explorations need to investigate not only scalability in terms of the number and size of the lowest level XCS instances, but also in terms of the order to the tree of intermediate XCS instances. Of course, any use of this form of hierarchy will be limited to the known problems of Feudal Q-Learning, chief amongst which is the inability to use the fine-grain knowledge present in the lower

347

levels to aid planning at the upper levels. To tackle this issue an examination of a means of applying Diettrich's MaxQ (Diettrich, 2000) approach to XCS is required. However, to do this the problem of producing aliasing internal payments must be tackled. An area of experimental work that was hoped to have been achieved within this project was the investigation of what shall be termed 'soft' hierarchical structures. It was hypothesised that the niching mechanism of XCS would allow the identification of population segmentation by the use of 'tags' added to the environmental message through the use of internal state similar to Lanzi's memory mechanism. All classifiers belonging to a particular region of the population space would match the bits within a particular tag. If the tag was associated with a particular sub-goal, then the those classifiers that set and reset tags would act at the same level as the classifiers in the upper XCS of the Feudal XCS but have the additional advantage of being able to compete with other hypotheses within the population at the same time by also exposing themselves to influence by other tags. An investigation into a basic version of this, initially aimed at triggering successive action chain sequences to solve the Woods-14 environment, is already underway. However, the potential of this 'soft' approach to hierarchy is large and the crossover with the work of Digney (1996a, 1996b, 1998) in Tabular Q-learning is fascinating. Investigation into this form of hierarchy might well represent a step on the road to the 'holy grail' of emergent hierarchy, and these investigations require the stable predictability of LCS implementations such as XCS to achieve these aims. 9.4 Final Word

Wilson (2000a) ends his review of XCS research with the words: "XCS is a new kind of classifier system showing significant promise as a reinforcement learner and accurate generaliser. Its full potential is just beginning to be explored". What is clear now, and what was not clear at the start of this research, is that without the availability of XCS the forms of hierarchical approach both produced and discussed would not be feasible within the LCS approach. That is not to say that an implementation of a similar structure using one of the strength-based LCS reviewed in chapter 2 could not be used. It is simply to assert that a system reliant upon a traditional LCS approach would be so constrained by the need to support the basic mechanisms of the LCS that beneficial results would be modest. It has been claimed (Kovacs, 2000b)

348

that there is evidence of a renaissance in interest in the LCS approach. This may be the case, and with XCS there is scope to move LCS research a considerable distance forward. However, lessons must be drawn from the field of A.I. and Expert Systems. Initial optimism often quickly becomes pessimism once the limitations of an approach are found. The first hypothesis of this work was the pessimistic hypothesis that within XCS action chains cannot be extended for useful distances. Although the hypothesis was not correct for the use of XCS without generalisation for action chains of length 40 at least, it was shown to be correct for the generalisation case. However, two further techniques were introduced that enabled existing action chains to be extended through persistence, and a simple fixed hierarchical approach has demonstrated that reactive building of action chains can be accomplished through the use of hierarchy. These findings are novel but also can be regarded as preliminary - there are many areas of investigation available for further work around each. It is hoped that in terms of the hierarchical work in particular that this work will lay the foundation for greater developments yet to come.

349

BIBLIOGRAPHY

Adelard (1996), SpecBox User Manual, Version PC/2.21a, Adelard, London. Barry, A. M. (1993), The Emergence of High Level Structure in Classifier Systems A Proposal, in Cowie, R., Mulhern, G. (eds.), Proceedings of Artificial Intelligence and Cognitive Science AICS'93, Irish Journal of Psychology, 4 (3), 480-498. Barry, A. M. (1996), Hierarchy in Classifier Systems, in Goodman, E.G., Uskov, V.L., Punch, W.F. (eds.), Proceedings of the First International Conference on Evolutionary Algorithms and their Application EVCA'96, The Presidium of the Russian Academy of Sciences, Moscow, 195-211. Barry, A. M. (1999a), Aliasing in XCS and the Consecutive State Problem 1 Problems, in Banzhaf, W., Daida, J., Eiben, A. E., Garzon, M. H., Honavar, V., Jakiela, M., Smith, R. E. (eds.), Proceedings of the First Genetic and Evolutionary Computation Conference (GECCO99), Morgan Kaufmann, San Francisco, CA, 19-26. Barry, A. M. (1999b), Aliasing in XCS and the Consecutive State Problem 2 Solutions, in Banzhaf, W., Daida, J., Eiben, A. E., Garzon, M. H., Honavar, V., Jakiela, M., Smith, R. E. (eds.), Proceedings of the First Genetic and Evolutionary Computation Conference (GECCO99), Morgan Kaufmann, San Francisco, CA, 27-34. Barry, A. M. (2000), Specifying Action Persistence within XCS, in Whitely, D., Goldberg, D. E, Cantú-Paz, E., Spector, L., Parmee, I., Beyer, H-G., (eds.), Proceedings of the Genetic and Evolutionary Computation Conference (GECCO-2000), 50-57, Morgan Kaufmann. Barry, A. M. (2001), A Parallel XCS, Technical Report, in preparation. Barry, A. M., Carse, B. (2000), A Delayed Action XCS, Technical Report, in preparation. Barry, A.M., Kovacs, T. (1997), Personal communication. Barry, A. M., Saxon, S. (1998), The XCS Classifier System, Technical Report, Teaching Company Scheme 2391, The Database Group, Bristol. Bartfai, G., (1994), Hierarchical Clustering with ART Neural Networks, in Proceedings of the IEEE International Conference on Neural Networks, 930944, IEEE Press. Bonelli, P., Parodi, A., Sen, S., Wilson, S. W. (1990), NEWBOOLE: A fast GBML System, in International Conference on Machine Learning, 153-159, MorganKaufmann.

350

Booker, L. B. (1982), Intelligent Behaviour as an Adaptation to the Task Environment, Ph.D. Dissertation (Computer and Communication Sciences), The University of Michigan. Booker, L. B. (1985), Improving the performance of genetic algorithms in classifier systems, in Grefenstette, J. J. (ed.), Proceedings of the First International Conference on Genetic Algorithms and their Applications (ICGA85), Lawrence Erlbaum Associates, Pittsburgh, PA, July 1985. Booker, L. B. (1988), Classifier Systems that Learn Internal World Models, Machine Learning, 3, 161-192, Kluwer Academic Publishers. Booker, L. B. (1989), Triggered Rule Discovery in Classifier Systems, in Schaffer, J.D. (ed.), Proceedings of the Third International Conference on Genetic Algorithms, 265-274, Morgan Kaufmann. Booker, L. B., (1991), Representing Attribute-based Concepts in a Classifier System, in Rawlings, G.J.E. (ed.), Proceedings of the First Workshop on the Foundations of Genetic Algorithms (FOGA91), Morgan Kaufmann, San Mateo. Brooks, R. A. (1986), A Robust Layered Control System for a Mobile Robot, IEEE Journal of Robotics and Automation, RA-2, April 1986, 14-23. Brookes, R. A. (1990), Elephants don't play chess, in Maes, P. (ed.) Designing Autonomous Agents, Robotics and Autonomous Systems, 6 (1990), 3-15. Brookes, R. A. (1991), Intelligence without Representation, AI Journal, 47, 1991, 139-159. Brooks, R. A. (1992), Artificial Life and Real Robots, in Varela, F. J., Bourgine, P. (eds.) Towards a Practice of Autonomous Systems, 3-10, MIT Press. Brookes, R. A., Flynn, A. M. (1989), Fast, cheap, and out of control, Journal of the British Interplanetary Society, Oct 1989, 478-485. Bull, L., (1995), Artificial Symbiology: Evolution in Co-operative Multi-agent Environments, PhD Thesis, University of the West of England. Bull, L., Fogarty, T. C., (1993), Co-evolving Communicating Classifier Systems for Tracking, in Albrecht, R.F., Stelle, N.C., Reeves, C.R. (eds.), Proceedings of the International Conference on Artificial Neural Nets and Genetic Algorithms, Springer-Verlag. Bull, L., Fogarty, T. C., (1994), Evolving Co-operative Communicating Classifier Systems, in Sebald, A., Fogel, L. J. (eds.), Proceedings of the Third Annual Conference on Evolutionary Programming, 308-315. Bull, L., Fogarty, T. C., Pipe, A. G. (1995), Artificial Endosymbiosis, Proceedings of the Third European Conference on Artificial Life, 273-289, Springer-Verlag. Bull, L., Fogarty, T. C., Snaith, M. (1995), Evolution in Multi-agent Systems; Evolving Communicating Classifier Systems for Gait in a Quadrupedal Robot, in Eshelman, L. J., (ed.), Proceedings of the Sixth International Conference on Genetic Algorithms (ICGA95), Morgan Kaufmann, 382-388.

351

Butz, M. V. (1999), An implementation of the XCS Classifier System in C, Technical Report 99021, IlliGAL, University of Illinois. Butz, M. V. (2000), XCSJava 1.0: An implementation of the XCS classifier system in Java, Technical Report 200027, IlliGAL, University of Illinois. Butz, M. V., Wilson, S. W. (2000), An Algorithmic description of XCS, Technical Report 200017, IlliGAL, University of Illinois. Butz, M. V., Stolzmann, W., Goldberg, D. E., (2000a), Introducing a Genetic Generalisation Pressure to the Anticipatory Classifier System Part 1: Theoretical Approach, in Whitely, D., Goldberg, D. E, Cantú-Paz, E., Spector, L., Parmee, I., Beyer, H-G., (eds.), Proceedings of the Genetic and Evolutionary Computation Conference (GECCO-2000), Morgan Kaufmann. Butz, M. V., Stolzmann, W., Goldberg, D. E., (2000b), Introducing a Genetic Generalisation Pressure to the Anticipatory Classifier System Part 2: Performance Analysis, in Whitely, D., Goldberg, D. E, Cantú-Paz, E., Spector, L., Parmee, I., Beyer, H-G., (eds.), Proceedings of the Genetic and Evolutionary Computation Conference (GECCO-2000), Morgan Kaufmann. Cantu-Paz, E. (1997), A Survey of Parallel Genetic Algorithms, Technical Report, 97003, IlliGAL, University of Illinois. Carse, B. (1994), Learning Anticipatory Behaviour using a Delayed Action Classifier System, in Fogarty, T. C., Evolutionary Computing, AISB Workshop Selected Papers, number 865 in Lecture Notes in Computer Science, Springer-Verlag. Carse, B., Fogarty, T. C. (1994), A delayed-action classifier system for learning in temporal environments, in Proceedings of the First IEEE Conference on Evolutionary Computation (ICEC94), Vol. 2, 670-673. Chaib-Draa, B., Mandiau, R., Millot, P. (1992), Distributed Artificial Intelligence: An Annotated Bibliography, SIGART Bulletin, 3 (3). Clark, P., Niblett, T. (1989), The CN2 Induction Algorithm, Machine Learning, 3, 261-283. Cliff, D., Ross, S., (1994), Adding Temporary Memory to ZCS, Adaptive Behaviour, 3(2), 101-150. Coase, R. H., (1988), The Firm, the Market and the Law, University of Chicago Press. Cobb, H. G., Grefenstette, J. J. (1991), Learning the persistence of actions in reactive control rules, Proceedings of the 8th International Machine Learning Workshop, 293-297, Morgan Kaufmann. Coderre, B. (1989), Modelling Behaviour in Petworld, in Langton, C.G. (ed.), Artificial Life, SFI Studies in the Sciences of Complexity, Volume VI, 407-420, Addison-Wesley.

352

Colombetti, M., Dorigo, M. (1994), Training Agents to Perform Sequential Behaviour, Adaptive Behaviour, 2 (3), MIT Press. Compiani, M., Montanari, D., Serra, R, Simonini, P., (1989), Asymptotic dynamics of classifier systems, in Schaffer, J.D. (ed.), Proceedings of the 3rd International Conference on Genetic Algorithms, George Mason University, June 1989, Morgan Kaufmann. Compiani, M., Montanari, D., Serra, R., (1990), Learning and Bucket-Brigade Dynamics in Classifier Systems, in Forrest, S. (ed.), Emergent Computation, Proceedings of the 9th Annual International Conference of the Centre for Nonlinear Studies on Self-organising, Collective, and Co-operative Phenomena in Natural and Artificial Computing Networks, Special Issue of Physica D (Vol. 42), 1. Davidor, Y. (1991), A Naturally Occurring Niche and Species Phenomenon: The Model and First Results, in Belew, R.K., Booker, L. B., (eds.), Proceedings of the Fourth International Conference on Genetic Algorithms, 257-263., MorganKaufmann Davis, L. (1992), Covering and Memory in Classifier Systems, in Proceedings of the First International Workshop on Learning Classifier Systems. Davis, L., Wilson, S., Orvosh, D. (1992), Temporary Memory for Examples Can Speed Learning in a Simple Adaptive System, in Meyer, J.A., Roitblat, H.L., Wilson, S. W. (eds.), From Animals to Animats 2 - Proceedings of the Second International Conference on the Simulation of Adaptive Behaviour (SAB92), 313-320, MIT Press. Dayan, P., Hinton, G. E. (1993), Feudal reinforcement learning, in Hanson, S. J., Cowan, J. D., Giles, C. L. (eds.), Neural Information Processing Systems 5, Morgan Kaufmann. Deb, K. (1998), Multi-Objective Genetic Algorithms: Problem difficulties and the Construction of Test Problems, Technical Report number CI-49/98, Dortmund, Department of Computer Science/LS 11, University of Dortmund. De Jong, K. A. (1975), Analysis of the behaviour of a class of genetic adaptive systems, PhD. Thesis, Department of Computer and Communication Sciences, University of Michigan, 1975. De Jong, K. A. (1988), Learning with Genetic Algorithms: An Overview, Machine Learning, 3, 121-132. De Morgan, A. (1849), Trigonometry and double algebra. Dietterich, T. G. (1997), Hierarchical Reinforcement Learning with the MAXQ Value Function Decomposition, Technical Report, Computer Science Department, University of Oregon. Dietterich, T. G. (2000), An Overview of MaxQ Reinforcement Learning, Technical Report, Computer Science Department, University of Oregon.

353

Digney, B. (1996a), Learning and Shaping in Emergent Hierarchical Control Systems, in Space'96 and Robots for Challenging Environments II, June 1996. Digney, B. (1996b), Emergent Hierarchical Control Structures: Learning Reactive / Hierarchical Relationships in Reinforcement Environments, in Maes, P., Mataric, M. J., Meyer, J-A., Pollack, J., Wilson, S. W., Proceedings of the Fourth International Conference on the Simulation of Adaptive Behaviour (SAB96), 373-381, A Bradford Book. Digney, B. (1998), Learning Hierarchical Control Structure for Multiple Tasks and Changing Environments, in Pfeifer, R., Blumberg, B., Meyer, J-A., Wilson, S. W., From Animals to Animats 5 - Proceedings of the Fifth Conference on the Simulation of Adaptive Behaviour (SAB98), A Bradford Book. Donnart, J-Y, (1998), Cognitive Architecture and Adaptive Properties of a Motivationally Autonomous Animat, PhD Thesis, Universite Pierre et Marie Curie, Paris. Donnart, J-Y., Meyer, J-A. (1994), A Hierarchical Classifier Systems Implementing a Motivationally Autonomous Agent, in Cliff, D., Husbands, P., Meyer, J-A., Wilson, S. W. (Eds.), From Animals to Animats 3, Proceedings of the Third International Conference on the Simulation of Adaptive Behaviour (SAB94), 144-153, MIT Press / Bradford Books. Dorigo, M., (1992), Using Transputers to Increase Speed and Flexibility of Geneticbased Machine Learning Systems, Microprocessing and Microprogramming, 34:147-152. Dorigo, M. (1993), Genetic and Non-Genetic Operators in ALECSYS, Evolutionary Computation, 1 (2), 151-164, MIT Press Dorigo, M. (1995), ALECSYS and the AutonoMouse : Learning to Control a Real Robot by Distributed Classifier Systems, Machine Learning Journal, 19 (3), 209-240. Dorigo, M., Bersini, H. (1994), A comparison of Q-Learning and Classifier Systems, in Cliff, D., Husbands, P., Meyer, J-A., Wilson, S. W. (Eds.), From Animals to Animats 3, Proceedings of the Third International Conference on the Simulation of Adaptive Behaviour (SAB94), 248-255, MIT Press / Bradford Books. Dorigo, M., Colombetti, M., (1994), Robot shaping: Developing autonomous agents through learning, Artificial Intelligence, 71 (2), 321-370, Elsevier Science. Dorigo, M., Colombetti, M, (1998), Robot Shaping: An Experiment in Behaviour Engineering, Bradford Books. Dorigo, M., Schnepf, U. (1993), Genetics-based Machine Learning and BehaviourBased Robotics: a new synthesis., IEEE Transactions in Systems, Man, and Cybernetics, 23(1), 141-154 Dorigo, M. Sirtori, E. (1991), ALECSYS: A Parallel Laboratory for Learning Classifier Systems, in Belew, R.K., Booker, L. B., (Eds.), Proceedings of the Fourth International Conference on Genetic Algorithms, 296-302., MorganKaufmann

354

Farmer, J. D. (1990), A Rosetta Stone for Connectionism, in Forrest, S. (ed.), Emergent Computation, Proceedings of the 9th Annual International Conference of the Centre for Non-linear Studies on Self-organising, Collective, and Cooperative Phenomena in Natural and Artificial Computing Networks, Special Issue of Physica D (Vol. 42), 1, 153-187. Flockhart, I. (1995), GA-Miner: Parallel Data Mining and Hierarchical Genetic Algorithms, Tech. Rep.: EPCC-AIKMS-GA-MINER-REPORT 1.0, University of Edinburgh. Fogarty, T. C., Carse, B., Bull, L. (1994), Classifier Systems - recent research, AISB Quarterly, 89, 48-54. Forrest, S., (1985), A Study of parallelism in the classifier system and its application to classification in KL_ONE semantic networks, PhD Thesis, University of Michigan, Ann Arbor, Michigan. Forrest, S., (1990), Guest Editorial, in Forrest, S. (ed.), Emergent Computation, Proceedings of the 9th Annual International Conference of the Centre for Nonlinear Studies on Self-organising, Collective, and Co-operative Phenomena in Natural and Artificial Computing Networks, Special Issue of Physica D (Vol. 42), 1. Forrest, S., Miller, J. H., (1990), Emergent Behaviour in Classifier Systems, in Forrest, S. (ed.), Emergent Computation, Proceedings of the 9th Annual International Conference of the Centre for Non-linear Studies on Selforganising, Collective, and Co-operative Phenomena in Natural and Artificial Computing Networks, Special Issue of Physica D (Vol. 42), 1, 213-217. Frey, P. W., Slate, D. J., (1991), Letter Recognition using Holland-style Adaptive Classifiers, Machine Learning, 6, 161-182. Giani, A., Baiardi, F., Starita, A. (1994), Q-Learning in Evolutionary Rule-Based Systems, in Davidor, Y., Schwefel, H-P. (eds.), Parallel Problem Solving from Nature - PPSN III, Vol. 86, LNCS, 270-289, Springer-Verlag. Giordana, A., Neri, F., (1995) Search-Intensive Concept Induction. Evolutionary Computation, 3, 375-416. Goldberg, D. E. (1989), Genetic Algorithms in Search, Optimisation, and Machine Learning. Addison-Wesley, Reading Mass. Goldberg, D. E., Horn, J., Deb, K., (1992), What makes a problem hard for a Classifier System?, in Collected Abstracts for the First International Workshop on Learning Classifier Systems (IWLCS92), NASA Johnson Space Centre, Houston, Texas. Greenyer, A. (2000), Coil 2000 Competition. The use of a learning classifier system JXCS, in van der Putten, P., van Someren, M. (eds), CoIL Challenge 2000: The Insurance Company Case, Technical Report 2000-09, Leiden Institute of Advanced Computer Science, Leiden, June 2000.

355

Grefenstette, J. J. (1987), Multilevel Credit Assignment in a Genetic Learning System, in Grefenstette, J. J. (ed.), Proceedings of the Second International Conference on Genetic Algorithms (ICGA87), Cambridge MA, July 1987, Lawrence Erlbaum Associates, 202-207. Grefenstette, J. J. (1989), A System for Learning Control Strategies with Genetic Algorithms, in Schaffer, J.D. (ed.), Proceedings of the Third International Conference on Genetic Algorithms (ICGA89), 183-190, George Mason University, June 1989, Morgan Kaufmann. Grefenstette, J. J. (1992), Learning Decision Strategies with Genetic Algorithms, in Proceedings of the International Workshop on Analogical and Inductive Inference, vol. 642 of LNAI, 35-50, Springer-Verlag. Grefenstette, J. J., Cobb, H. G. (1991), User's Guide for SAMUEL Version 1.3, Technical Report, NRL Memorandum Report 6820, Naval Research Laboratory. Grefenstette, J. J., Ramsey, C. L., Schultz, A. C. (1990), Learning Sequential Decision Rules using Simulation Models and Competition, Machine Learning, 5 (4), 355-381. Harnad, S. (1990), 'The Symbol Grounding Problem', , in Forrest, S. (ed.), Emergent Computation, Proceedings of the 9th Annual International Conference of the Centre for Non-linear Studies on Self-organising, Collective, and Co-operative Phenomena in Natural and Artificial Computing Networks, Special Issue of Physica D (Vol. 42), 1, 335-346. Hartley (1999), Accuracy-based fitness allows similar performance to humans in static and dynamic classification environments, in Banzhaf, W., Daida, J., Eiben, A. E., Garzon, M. H., Honavar, V., Jakiela, M., Smith, R. E. (eds.), Proceedings of the First Genetic and Evolutionary Computation Conference (GECCO99), Morgan Kaufmann, San Francisco, CA, 266-273. Harvey, I. (1992a), Species Adaptation Genetic Algorithms: A Basis for a Continuing SAGA, in Varela, F.J., Bourgine, P. (Eds.), Proceedings of the First European Conference on Artificial Life - Towards a Practice of Autonomous Systems, 346-354, MIT Press Harvey, I. (1992b), Evolutionary Robotics and SAGA: The Case for Hill Crawling and Tournament Selection, in Proceedings of the Second International Conference on Parallel Problem Solving from Nature (PPSN92), 269-278, Elsevier Harvey, I. (1994), SAGA Cross: The Mechanics of Recombination for Species with Variable-Length Genotypes, in Langton, C. (Ed), Artificial Life III - Santa Fe Institute Studies in the Sciences of Complexity Proceedings Vol. XVII, 299-326, Addison-Wesley. Hauskrecht, M., Meuleau, N., Boutilier, C., Kaebling, L. P., Dean, T. (1998), Hierarchical Solution of Markov Decision Processes using Macro-Actions, in Proceedings of the 14th Annual Conference on Uncertainty in Artificial Intelligence.

356

Hoffman, J., (1992), Probleme der Begriffsbildungsforshung: Von S-R Verbindungen zu S-R-K Einheiten, Sprach und Kognition, 11 (4), 223-238. Holland, J. H., (1968), Hierarchical Descriptions of Universal Spaces and Adaptive Systems, Technical Report, Number ORA Projects 01252 and 08226, University of Michigan, Ann Arbor, 1968. Holland, J. H., (1971), Processing and Processors for Schemata, in Jacks, E. L. (ed.), Associative Information Processing, 127-146, Elsevier. Holland, J. H. (1975), Adaptation in Natural and Artificial Systems, University of Michigan Press, Ann Arbor. Holland, J. H. (1983), Escaping Brittleness, in Proceedings of the Second International Workshop on Machine Learning, 92-95. Holland, J. H. (1985), Properties of the Bucket-Brigade, in Grefenstette, J. J. (ed.), Proceedings of the First International Conference on Genetic Algorithms and their Applications (ICGA85), Lawrence Erlbaum Associates, Pittsburgh, July 1985. Holland, J. H. (1986), Escaping Brittleness: The possibilities of General-purpose Learning Algorithms Applied to Parallel Rule-Based Systems, in Mitchell, T.M., Michalski, R. S., and Carbonell, J.G. (eds.), Machine Learning, An Artificial Intelligence Approach, Vol. II, ch. 20, 593-623, Morgan Kaufmann. Holland, J. H., (1987), Genetic Algorithms and Classifier Systems: Foundations and Future Directions, in Grefenstette, J.J. (ed.), Proceedings of the 2nd International Conference on Genetic Algorithms and their Applications (ICGA87), Lawrence Erlbaum Associates, Cambridge, MA, July 1987. Holland, J. H., (1990), Concerning the Emergence of Tag-Mediated Lookahead in Classifier Systems, in Forrest, S. (ed.). Evolutionary Computation, Proceedings of the 9th Annual International Conference of the Centre for Non-linear Studies on Self-organising, Collective, and Co-operative Phenomena in Natural and Artificial Computing Networks, Special issue of Physica D (Vol. 42) [1], 188201. Holland, J. H., Booker, L. B., Colombetti, M., Dorigo, M., Goldberg, D. E., Forrest, S., Riolo, R. L., Smith, R. E., Lanzi, P. L., Stolzmann, W., Wilson, S. W. (2000), What is a Learning Classifier System?, in Lanzi, P. L., Stolzmann, W., Wilson, S. W. (eds.), Learning Classifier Systems, From Foundations to Applications, 332, LNAI 1813, Springer-Verlag. Holland, J.H., Holyoak, K.J., Nisbett, R. E., Thagard, P.R. (1986), Induction: Processes of Inference, Learning, and Discovery. MIT Press, Cambridge. Holland, J. H., Reitman, J. S. (1978), Cognitive systems based on adaptive algorithms, in Waterman, D.A., Hayes-Roth, F. (eds.), Pattern-directed inference systems, Academic Press, New York. Huber, M., Grupen, R. A. (1997), Learning to co-ordinate controllers in reinforcement learning on a Control Basis, in Proceedings of the International Conference Joint on Artificial Intelligence.

357

Humphrys, M. (1996), Action selection methods using reinforcement learning, in Maes, P., Mataric, M. J., Meyer, J-A., Pollack, J., Wilson, S. W., Proceedings of the Fourth International Conference on the Simulation of Adaptive Behaviour (SAB96), 135-144, A Bradford Book. Iba, G. A., (1989), A heuristic approach to the discovery of macro-operators, Machine Learning, 3, 285-317. Iba, H., de Garis, H., Higuchi, T. (1992), Evolutionary Learning of Predatory Behaviours Based on Structured Classifiers, in Meyer, J.A., Roitblat, H.L., Wilson, S. W. (Eds.), From Animals to Animals 2 - Proceedings of the Second International Conference on the Simulation of Adaptive Behaviour (SAB92), 356-363, MIT Press. ISO (1996), Vienna Development Method - Specification Language Part 1: Base Language, International Standard ISO/IEC 13817-1, December 1996. Kaebling, L. P. (1993), Hierarchical Learning in Stochastic Domains: Preliminary Results, in Utgoff, P. E. (ed.), Proceedings of the Tenth International Conference on Machine Learning, Morgan Kaufmann. Kay, A. (1991), Computers, Networks and Education, in Scientific American, 265, 3, Sept 1991, 138-148. Kennedy, I. W., (1994), A-Life and the Agent Network Architecture, Final-Year Undergraduate Dissertation, Faculty of Computer Studies and Mathematics, University of the West of England. Korf, R. E. (1985), Macro-Operators: A weak method for learning, Artificial Intelligence, 26, 35-77. Kovacs, T. (1996), Evolving Optimal Populations with XCS Classifier Systems, Master's Thesis, School of Computer Science, University of Birmingham. Kovacs, T. (1997), XCS Classifier System reliably evolves accurate, complete, minimal representations for Boolean functions, in Roy, Chawdry, Pant (eds.), Soft Computing in Engineering Design and Manufacturing, Springer-Verlag, 5968. Kovacs, T. (1998), personal communication. Kovacs, T. (1999a), Deletion schemes for classifier systems, in Banzhaf, W., Daida, J., Eiben, A. E., Garzon, M. H., Honavar, V., Jakiela, M., Smith, R. E. (eds.), Proceedings of the First Genetic and Evolutionary Computation Conference (GECCO99), Morgan Kaufmann, San Francisco, CA, 329-336. Kovacs, T. (1999b), Strength or Accuracy? A comparison of two Approaches to fitness calculation in Learning Classifier Systems, in Wu, A. (ed.), Proceedings of the 1999 Genetic and Evolutionary Computation Conference Workshop Program, Orlando, July 1999. Kovacs, T. (1999c), Weeding Populations of Classifiers, Technical Report, School of Computer Science, University of Birmingham, UK.

358

Kovacs, T. (2000a), Towards a theory of strong over-general classifiers, in Proceedings of the 2000 International Workshop on the Foundations of Genetic Algorithms (FOGA2000). Kovacs, T. (2000b), An Analysis of the Learning Classifier Systems Bibliography, Technical Report, in preparation. Kovacs, T., Kerber, M. (2000), Some dimensions of problem complexity for XCS, in Proceedings of the GECCO-2000 Graduate Student Workshop, Las Vegas, July 2000. Koza, J. R. (1992), Genetic Programming : On the Programming of Computers by means of Natural Selection, Cambridge, MA, MIT Press. Laird, J. E., Rosenbloom, D. S., Newell, A. (1986), Chunking in SOAR : The anatomy of a general learning mechanism, Machine Learning, 1, 11-46. Langton, C.G. (1989) 'Artificial Life' in Langton, C.G. (ed.), Artificial Life, SFI Studies in the Sciences of Complexity, Volume VI', Addison-Wesley, pp 1-47. Lanzi, P. L. (1997a), A Study of the Generalisation Capabilities of XCS, in Back, T. (ed.), Proceedings of the 7th International Conference on Genetic Algorithms (ICGA97), 418-425, Morgan Kaufmann, San Francisco, July 1997. Lanzi, P. L. (1997b), A Model of the Environment to Avoid Local Learning (An Analysis of the Generalisation Mechanism of XCS), Technical Report 97.46, Politecnico di Milano, Department of Electronic Engineering and Information Sciences, 1997. Lanzi, P. L. (1998a), Adding Memory to XCS, in Proceedings of the IEEE Conference on Evolutionary Computation (ICEC98), IEEE Press. Lanzi, P. L. (1998b), An analysis of the memory mechanism of XCSM, in Koza, J. R., Banzhauf, W., Chellapilla, K., Deb, K., Dorigo, M., Fogel, D. B., Garzon, M. H., Goldberg, D. E., Iba, H., Riolo, R. L., (eds.), Genetic Programming 1998: Proceedings of the Third Annual Conference, Morgan Kaufmann: San Francisco, CA, 643-651. Lanzi, P. L. (1998c), Reinforcement Learning by Learning Classifier Systems, PhD Thesis, Politecnico di Milano, 1998. Lanzi, P. L. (1999a), Extending the Representation of Classifier Conditions Part I: From Binary to Messy Coding, in Banzhaf, W., Daida, J., Eiben, A. E., Garzon, M. H., Honavar, V., Jakiela, M., Smith, R. E. (eds.), Proceedings of the First Genetic and Evolutionary Computation Conference (GECCO99), 337-344, Morgan Kaufmann, San Francisco, CA. Lanzi, P. L. (1999b), Extending the Representation of Classifier Conditions Part II: From Messy Coding to S-Expressions, in Banzhaf, W., Daida, J., Eiben, A. E., Garzon, M. H., Honavar, V., Jakiela, M., Smith, R. E. (eds.), Proceedings of the First Genetic and Evolutionary Computation Conference (GECCO99), 345-352, Morgan Kaufmann, San Francisco, CA.

359

Lanzi, P. L., Colombetti, M. (1999), An Extension to the XCS Classifier System for Stochastic Environments, in Banzhauf, W., Daida, J., Eiben, A. E., Garzon, M. H., Honavar, V., Jakiela, M., Smith, R. E. (eds.), Proceedings of the First Genetic and Evolutionary Computation Conference (GECCO99), 353-360, Morgan Kaufmann, San Francisco, CA. Lanzi, P. L., Riolo, R.L. (2000), A Roadmap to the Last Decade of Learning Classifier System Research, in Lanzi, P. L., Stolzmann, W., Wilson, S. W. (eds.), Learning Classifier Systems, From Foundations to Applications, 33-62, Springer-Verlag. Lanzi, P. L., Wilson, S. W. (1999), Optimal classifier system performance in nonMarkovian environments, Technical Report 99.36, Dipartimento di Elettronica e Informazione - Politecnico do Milano, 1999. Lin, L. (1993), Reinforcement Learning for Robots Using Neural Networks, Ph.D. Dissertation, Carnegie Mellon University. Lin, S-C., Punch, W.F., Goodman, E.D. (1994), Course-Grain Parallel Genetic Algorithms : Categorisation and New Approach, Parallel and Distributed Processing, Oct 1994. Lorenz, K., (1985), Foundations of Ethology, Springer-Verlag. Maes, P. (1991a), Guest Editorial : Designing Autonomous Agents in Maes, P. (ed.) Designing Autonomous Agents - Theory and Practice from Biology to Engineering and Back, MIT Press. Maes, P. (1991b), Situated Agents Can Have Goals, in Maes, P. (ed.), Designing Autonomous Agents - Theory and Practice from Biology to Engineering and Back, MIT Press Maes, P, (1991b), The Agent Network Architecture, in Proceedings of the 1991 AAAI Spring Symposium on Integrated Intelligent Architectures, AAAI Press, Stanford. Maes, P. (1992) Behaviour-Based Artificial Intelligence, in Meyer, J.A., Roitblat, H.L., Wilson, S. W. (eds.), From Animals to Animats 2 - Proceedings of the Second International Conference on the Simulation of Adaptive Behaviour (SAB92), 313-320, MIT Press. McGovern, A., Sutton, R. S. (1998), Macro-Actions in Reinforcement Learning: An Empirical Analysis, Technical Report 98-70, Computer Science Department, University of Massachusetts, Amherst. McGovern, A., Sutton, R. S., Fagg, A. H. (1997), Roles of Macro-Actions in accelerating Reinforcement Learning, in Proceedings of the 1997 Grace Hopper Celebration of Women in Computing, 13-18. Miller, J. H., Forrest, S. (1989), The dynamical behaviour of classifier systems, in Schaffer, J.D. (ed.), Proceedings of the Third International Conference on Genetic Algorithms (ICGA89), 304-310, George Mason University, June 1989, Morgan Kaufmann.

360

Minsky, M. L., (1967), Computation: Finite and Infinite Machines, Prentice- Hall. Minsky, M. L., (1975), Frame-System Theory, in Schank, R. C., Webber, B. L. (eds.), Theoretical Issues in Natural Language Processing. MIT Pre-print, 1975. Minsky, M. (1987), The Society of Mind, Simon & Schuster. Mitchell, T. M. (1980) The need for Bias in learning Generalisations, Tech. Report CBM-TR-117, Department of Computer Science, Rutgers University, NJ. Montanari, D. (1992), Classifier systems with a constant-profile bucket-brigade, in Collected abstracts for the First International Workshop on Learning Classifier Systems (IWLCS-92), Oct 1992, NASA Johnson Space Centre, Houston, Texas. Moriarty, D. E, Schultz, A. C., Grefenstette, J. J. (1999), Evolutionary Algorithms for Reinforcement Learning, Journal of Artificial Intelligence Research, 11:199229. Munos, R. (1992), Algorithmes Genetiques et Apprentissage par renforcement, Rapport DEA Sciences Cognitives, CEMAGREF. Munos, R., Patinel, J. (1994), Reinforcement Learning with dynamic covering of state-action space: Partitioning Q-Learning, in Cliff, D., Husbands, P., Meyer, JA., Wilson, S. W. (eds.), From Animals to Animats 3, Proceedings of the Third International Conference on the Simulation of Adaptive Behaviour (SAB94), 354-363, A Bradford Book, MIT Press. Ono, N., Rahmani, A. T. (1993), Self-Organisation of Communication in Distributed Learning Classifier Systems, in Albrecht, R.F., Stelle, N.C., Reeves, C.R. (eds.), Proceedings of the International Conference on Artificial Neural Nets and Genetic Algorithms, 361-367, Springer-Verlag. Oppacher, F., Deugo, D. (1995), The Evolution of Hierarchical Representations, in Proceedings of the Third European Conference on Artificial Life (ECAL95), 302-313, Springer-Verlag Parr, R., Russell, S. (1998), Reinforcement Learning with Hierarchies of Machines, in Advances in Neural Information Processing Systems, 10, 1043-1049, MIT Press. Patel, M.J., Schnepf, U. (1992), Concept Formation as Emergent Phenomena, in Varela, F.J., Bourgine, P. (Eds.), Proceedings of the First European Conference on Artificial Life - Towards a Practice of Autonomous Systems, 11-20, MIT Press Parodi, A., Bonelli, P. (1990), The Animat and the Physician, in Meyer, J. A., Wilson, S. W. (eds.), From Animals to Animats 1 - Proceedings of the First International Conference on the Simulation of Adaptive Behaviour (SAB90), 5057, MIT Press. Patel, M. J., Schnepf, U. (1992), Concept Formation as Emergent Phenomena, in Varela, F. J., Bourgine, P. (eds.), Proceedings of the First European Conference on Artificial Life (ECAL92) , 11-20, MIT Press.

361

Post, E. L. (1943), Formal reductions of the general combinatorial decision problem, American Journal of Mathematics, 65, 197-215. Precup, D., Sutton, R. S., Singh, S. (1998), Theoretical results in Reinforcement Learning with Temporally Abstract Behaviour, in Proceedings of the Tenth European Conference on Machine Learning - ECML98, Springer-Verlag. Quinlan, J. R. (1986), Induction of Decision Trees, Machine Learning, 1, 81-106. Reynolds, C. (1987), Flocks, Herds and Schools: A Distributed Behavioural Model, Computer Graphics, 21 (4), 25-24, July 1987. Rich, E., Knight, K., (1991), Artificial Intelligence, 2nd Ed., McGraw-Hill. Ring, M. (1994), Two methods of Hierarchy Learning in Reinforcement Environments, in Meyer, J.A., Roitblat, H.L., Wilson, S. W. (Eds.), From Animals to Animats 2 - Proceedings of the Second International Conference on the Simulation of Adaptive Behaviour - (SAB94), 148-155, MIT Press. Riolo, R. L., (1987a), Bucket Brigade Performance: I. Long Sequences of Classifiers, in Grefenstette, J.J. (ed.), Proceedings of the Second International Conference on Genetic Algorithms (ICGA87), Cambridge, MA, July 1987, Lawrence Erlbaum Associates. Riolo, R. L., (1987b), Bucket Brigade Performance: II. Default Hierarchies, in Grefenstette, J.J. (ed.), Proceedings of the Second International Conference on Genetic Algorithms (ICGA87), Cambridge, MA, July 1987, Lawrence Erlbaum Associates. Riolo, R. L., (1988a), CFS-C: A Package of Domain-Independent Subroutines for Implementing Classifier Systems in Arbitrary User-Defined Environments, Technical Report, University of Michigan. Riolo, R. L., (1988b), Empirical Studies of Default Hierarchies and Sequences of Rules in Learning Classifier Systems, PhD Thesis, University of Michigan. Riolo, R. L., (1989a), The Emergence of Coupled Sequences of Classifiers, in Schaffer, J.D. (ed.), Proceedings of the Third International Conference on Genetic Algorithms (ICGA89), 256-264, George Mason University, June 1989, Morgan Kaufmann. Riolo, R. L., (1989b), The Emergence of Default Hierarchies in Learning Classifier Systems, in Schaffer, J.D. (ed.), Proceedings of the Third International Conference on Genetic Algorithms, 322-327, George Mason University, June 1989, Morgan Kaufmann. Riolo, R. L., (1990), Lookahead Planning and Latent Learning in Classifier Systems, in Meyer, J.A., Wilson, S. W. (Eds.), From Animals to Animats 1 - Proceedings of the First International Conference on the Simulation of Adaptive Behaviour (SAB90), 316-326, MIT Press. Riolo, R. L., (1991), Modelling Simple Human Category Learning with a Classifier System, in Booker, L. B., Belew, R. K., (eds.) Proceedings of the Fourth

362

International Conference on Genetic Algorithms (ICGA91). 324-333, Morgan Kaufmann. Robertson, G. G., Riolo, R. L. (1988), A Tale of Two Classifier Systems, Machine Learning, 3:139-159, 1988. Rosenblatt, K.J., Payton, D.W. (1989), A Fine Grained Alternative to the Subsumption Architecture for Mobile Robot Control, in Proceedings of IEEE/INNS International Joint Conference on Neural Networks. Rummelhart, D., McClelland, J. (1986), Parallel Distributed Processing, Cambridge, M. A., MIT Press. Russell, S., Norvig, P. (1995), Artificial Intelligence: A Modern Approach, PrenticeHall. Saxon, S., Barry, A. M. (1999a), XCS and the Monk's Problems, in Wu, A. S. (ed.), Proceedings of the 1999 Genetic and Evolutionary Computation Conference Workshop Program, 272-281. Saxon, S., Barry, A. M. (1999b), LMC Presentation, Teaching Company Scheme 2391, November 1999, Bristol, UK. Saxon, S., Barry, A. M. (2000), XCS and the Monk's Problems, in Lanzi, P. L., Stolzmann, W., Wilson, S. W. (eds.), Learning Classifier Systems: Introduction to Contemporary Research, vol. 1813 of LNAI, 223-242, Springer-Verlag, Berlin. Shu, L., Schaeffer, J. (1989), VCS: Variable Classifier System, in Schaffer, J.D. (ed.), Proceedings of the Third International Conference on Genetic Algorithms (ICGA89), George Mason University, June 1989, Morgan Kaufmann Shu, L., Schaeffer, J. (1991), HCS: Adding Hierarchies to Classifier Systems, in Belew, R.K., Booker, L. B., (eds.), Proceedings of the Fourth International Conference on Genetic Algorithms (ICGA91), 339-345, Morgan-Kaufmann Singh, S. P. (1992a), Transfer of Learning by Composing Solutions of Elemental Sequential Tasks, Machine Learning, 8, 332-339. Singh, S. P. (1992b), Reinforcement Learning with a Hierarchy of Abstract Models, in Proceedings of the Tenth National Conference on Artificial Intelligence, AAAI-92, AAAI Press/ MIT Press. Smith, R. E. (1991), Default Hierarchy Formation and Memory Exploitation in Learning Classifier Systems, PhD Thesis, University of Alabama, 1991. Smith, R. E. (1992), A Report on the First International Workshop on Learning Classifier Systems (IWLCS-92), NASA Johnson Space Centre, Houston, Texas, Oct. 1992. Smith, R. E., (1994), Memory Exploitation in Learning Classifier Systems, Evolutionary Computation, 2(3), 199-220, MIT Press.

363

Smith, R. E., Goldberg, D. E., (1991), Variable Default Hierarchy Separation in a Classifier System, in Rawlings, G.J.E. (ed.), Proceedings of the First Workshop on the Foundations of Genetic Algorithms (FOGA91), 148-170, Morgan Kaufmann, San Mateo. Smith, R. E., (1999), vocal contribution to a debate on trends in LCS within the Second International Workshop on Learning Classifier Systems, Orlando, 1999. Smith, S. F. (1980), A Learning System based on Genetic Algorithms, Ph.D. Thesis, University of Pittsburgh. Stolzmann, W. (1997), Antizipative Classifier Systems, Ph.D. Thesis, Fachbereich Mathematik/Informatik, University of Osnabrueck. Sumida, B-H., Houston, A. I., McNamara, J.M., Hamilton, W. D. (1990), Genetic Algorithms and Evolution, Journal of Theoretical Biology, 147, 59-84, Academic Press Sutton, R. S. (1988), Learning to predict by the methods of Temporal Differences, Machine Learning, 3, 9-44. Sutton, R. S., Barto, A. G. (1998), Reinforcement Learning: An Introduction, MIT Press. Thrun, S., Schwartz, A. (1995), Finding Structure in Reinforcement Learning, in Tesauro, G., Touretzky, D., Leen, T., Advances in Neural Information Processing Systems, 7. Tinbergen, N. (1966), The Study of Instincts, Oxford University Press. Toates, F. (1994), What is Cognitive and what is not Cognitive, in Cliff, D., Husbands, P., Meyer, J.A., Wilson, S. W. (Eds.), From Animals to Animats III Proceedings of the Third International Conference on the Simulation of Adaptive Behaviour (SAB94), 102-107, MIT Press. Tolman, E. C., (1932), Purposive Behaviour in Animals and Man, New York, Appleton. Tomlinson, A. (1999), Corporate Classifier Systems, PhD Thesis, University of the West of England, 1999. Tomlinson, A., Bull, L. (1998), A Corporate Classifier System, in Eiben, A. E., Back, T., Schoenauer, M., Schwefel, H. P. (eds.), Parallel Problem Solving from Nature, PPSN V, LNAI 1498, 550-559, Springer-Verlag. Tomlinson, A., Bull, L. (1999a), On Corporate Classifier Systems: Increasing the Benefits from Rule Linkage, in Banzhaf, W., Daida, J., Eiben, A. E., Garzon, M. H., Honavar, V., Jakiela, M., Smith, R. E. (eds.), Proceedings of the First Genetics and Evolutionary Computation Conference (GECCO99), 649-656, Morgan Kaufmann. Tomlinson, A., Bull, L. (1999b), A Zeroth level Corporate Classifier System, in Proceedings of the Second International Workshop in Learning Classifier

364

Systems, in Wu, A. (ed.), Proceedings of the 1999 Genetic and Evolutionary Computation Conference Workshop Programme, 306-315. Tomlinson, A., Bull, L. (1999c), A Corporate XCS, in Proceedings of the Second International Workshop in Learning Classifier Systems, in Wu, A. (ed), Proceedings of the 1999 Genetic and Evolutionary Computation Conference Workshop Programme, 298-305. Travers, M. (1989), Animal Construction Kits, in Langton, C. (ed.), Artificial Life, Addison-Wesley. Tsotsos, J.K. (1995), Behaviourist Intelligence and the Scaling Problem, Artificial Intelligence, 75, 135-160, Elsevier Science. Tyrell, T. (1992), The Use of Hierarchies for Action Selection, Ph.D. Thesis, Univ. Edinburgh. Valenzuela-Rendón, M., (1989), Boolean Analysis of Classifier Sets, in Schaffer, J. D. (ed.), Proceedings of the Third International Conference on Genetic Algorithms (ICGA89), 351-358, Morgan Kaufmann. van Heerden, P.J. (1968), The Foundation of Empirical Knowledge, Wassenaar, Holland. Watkins, C. (1989), Learning from Delayed Rewards, PhD Thesis, King's London. Westerdale, T. H. (1985), The Bucket-Brigade is not genetic, in Grefenstette, J. J. (ed.), Proceedings of the First International Conference on Genetic Algorithms and their Applications (ICGA85), 45-59, Lawrence Erlbaum Associates, Pittsburgh, July 1985. Westerdale, T. H. (1987), Altruism in the Bucket-Brigade, in Grefenstette, J. J. (ed.), Proceedings of the Second International Conference on Genetic Algorithms (ICGA87), 22-26, Lawrence Erlbaum Associates, Cambridge, MA, July 1987. Westerdale, T. H. (1989), A Defence of the Bucket-Brigade, in Schaffer, J.D. (ed.), Proceedings of the Third International Conference on Genetic Algorithms (ICGA89), 282-290, George Mason University, June 1989, Morgan Kaufmann. Wiering, M., Schmidhuber, J. (1996), HQ-Learning: Discovering Markovian SubGoals for Non-Markovian Reinforcement Learning, Technical Report IDSIA95-96, IDSIA, Switzerland. Wilcox, J. R. (1995), Organisational Learning within a Learning Classifier System, Masters Thesis, University of Illinois, 1995. Wilson, S. W. (1985), Knowledge Growth in an Artificial Animal, in Grefenstette, J.J. (ed.) Proceedings of the First International Conference on Genetic Algorithms and their Applications (ICGA85), Lawrence Erlbaum Associates, Pittsburg, July 1985. Wilson, S. W. (1986a), Classifier Systems and the Animat Problem, Machine Learning, 2, 199-228, Kluwer Academic.

365

Wilson, S. W. (1986b), Classifier System Learning of a Boolean Function, Technical Report RIS 27r, Rowland Institute for Science. Wilson, S. W. (1988), Bid Competition and Specificity Reconsidered, Complex Systems, 2 (6), 705-723. Wilson, S. W. (1989), Hierarchical Credit Allocation in a Classifier System in Davis, L. (ed.), Genetic Algorithms and Simulated Annealing, Research Notes in Artificial Intelligence, 104-115, Pitman Publishing, London. Wilson, S. W. (1991), The Animal Path to A.I., in Meyer, J-A., Wilson, S. W. (eds.), From Animals to Animats: Proceedings of The First International Conference on Simulation of Adaptive Behaviour (SAB90), 15-21, MIT Press / Bradford Books. Wilson, S. W., (1994), ZCS: A zeroth level classifier system, Evolutionary Computation, 2(1), 1-18. Wilson, S. W. (1995), Classifier Fitness Based on Accuracy, Evolutionary Computation, 3(2), 149-175. Wilson, S. W. (1996), Generalisation in the XCS classifier system, unpublished contribution to the ICML'96 Workshop on Evolutionary Computing and Machine Learning, http://prediction-dynamics.com/. Wilson, S. W. (1997), Explore/Exploit strategies in autonomy, in Maes, P., Mataric, J., Meyer, J-A, Pollack, J., Wilson, S. W. (eds.), From Animals to Animats 4, Proceedings of the Fourth International Conference on the Simulation of Adaptive Behaviour (SAB96), 325-332, A Bradford Book, MIT Press. Wilson, S. W. (1998a), Generalisation in the XCS classifier system, in Koza, J. R., Banzhaf, W., Chellapilla, K., Deb, K., Dorigo, M., Fogel, D. B., Garzon, M.H., Goldberg, D.E., Iba, H., Riolo, R. (eds.), Genetic Programming 1998: Proceedings of the 3rd Annual Genetic Programming Conference (GP98), Morgan Kaufmann, San Francisco, CA. Wilson, S. W. (1998b), personal communication. Wilson, S. W. (2000a), Get Real! XCS with Continuous-Valued Inputs, in Lanzi, P. L., Stolzmann, W., Wilson, S. W. (eds.), Learning Classifier Systems: Introduction to Contemporary Research, vol. 1813 of LNAI, 209-220, SpringerVerlag, Berlin. Wilson, S. W. (2000b), State of XCS Classifier System Research, in Lanzi, P. L., Stolzmann, W., Wilson, S. W. (eds.), Learning Classifier Systems: Introduction to Contemporary Research, vol. 1813 of LNAI, 63-82, Springer-Verlag, Berlin. Wilson, S. W. (2000c), Mining Oblique Data with XCS, Technical Report, IlliGAL Report 2000028, University of Illinois at Urbana-Champaign. Wilson, S. W., Goldberg, D. E., (1989), A Critical Review of Classifier Systems, in Schaffer, J. D., (ed.), Proceedings of the Third International Conference on Genetic Algorithms (ICGA89), 244-255, Morgan-Kaufmann.

366

Yates, D. F., Fairley, A. (1990), An Investigation into Possible Causes of and Solutions to Rule strength Distortion Due to the Bucket-Brigade, in Forrest, S. (ed.). Evolutionary Computation, Proceedings of the 9th Annual International Conference of the Centre for Non-linear Studies on Self-organising, Collective, and Co-operative Phenomena in Natural and Artificial Computing Networks, Special issue of Physica D (Vol. 42), 1, 246-253. Zhou, H.H. (1990), CSM: A Computational Model of Cumulative Learning, Machine Learning, 5(4), 383-406, Kluwer Academic Publishers.

367

THE QUEEN'S UNIVERISTY OF BELFAST TO THE ... - Semantic Scholar

THE QUEEN'S UNIVERISTY OF BELFAST TO THE ... - CiteSeerX

THE QUEEN'S UNIVERISTY OF BELFAST TO THE ... - CiteSeerX

THE EPISTEMOLOGY OF THE PATHOLOGICAL ... - Semantic Scholar

Letters to the Editor - Semantic Scholar

The Method of Punctured Containers - Semantic Scholar

The Logic of Intelligence - Semantic Scholar

The Mystique of Epigenetics Overview - Semantic Scholar

The Effectiveness of Interactive Distance ... - Semantic Scholar

The Contribution of Resurgent Sodium Current to ... - Semantic Scholar

The Logic of Intelligence - Semantic Scholar

A SYMMETRIZATION OF THE SUBSPACE ... - Semantic Scholar

The Timing of Conscious States - Semantic Scholar

The Buddhist Tradition of Samatha - Semantic Scholar

The Concept of Validity - Semantic Scholar

The Logic of Learning - Semantic Scholar

The Method of Punctured Containers - Semantic Scholar