Robotics and Autonomous Systems 54 (2006) 864–872 www.elsevier.com/locate/robot

Experiments in learning distributed control for a hexapod robot T.D. Barfoot, E.J.P. Earon, G.M.T. D’Eleuterio ∗ Institute for Aerospace Studies, University of Toronto, 4925 Dufferin Street, Toronto, Ontario, Canada M3H 5T6 Received 29 June 2004; received in revised form 24 April 2006; accepted 24 April 2006 Available online 28 August 2006

Abstract This paper reports on experiments involving a hexapod robot. Motivated by neurobiological evidence that control in real hexapod insects is distributed leg-wise, we investigated two approaches to learning distributed controllers: genetic algorithms and reinforcement learning. In the case of reinforcement learning, a new learning algorithm was developed to encourage cooperation between legs. Results from both approaches are presented and compared. c 2006 Published by Elsevier B.V.

Keywords: Hexapod robot; Distributed control; Reinforcement learning; Genetic algorithm; Coevolution

1. Introduction Insects and spiders are examples of relatively simple creatures from nature which are able to successfully operate many legs at once in order to navigate a diversity of terrains. Inspired by these biological marvels, robotics researchers have attempted to mimic insect-like behaviour in legged robots [6, 10]. As Schroer et al. [28] and Allen et al. [1] make clear in their work inspired by the ambulation of cockroaches, it is productive to try and abstract biological principles which can then motivate the engineering of a walking robot. Typically, however, the control algorithms for these types of robot are quite complicated (e.g., dynamic neural networks), requiring fairly heavy on-line computations to be performed in real time. Here a simpler approach is presented which reduces the need for such computations by using a coarsely coded control scheme. This representation is suitable for our primary focus, which is to learn distributed controllers from feedback. This is an important problem in, for example, planetary exploration, where the nature of the terrain may not be known a priori and thus controllers need to be learned in situ. Work has been done to increase the feasibility of using simulation to develop evolutionary control algorithms for ∗ Corresponding author.

E-mail addresses: [email protected] (T.D. Barfoot), [email protected] (E.J.P. Earon), [email protected] (G.M.T. D’Eleuterio). c 2006 Published by Elsevier B.V. 0921-8890/$ - see front matter doi:10.1016/j.robot.2006.04.009

mobile robots [15,20,21,40]. It has been suggested that hardware simulation is often too time-consuming, and software experimentation can accurately develop control strategies instead. It has been argued that the primary drawback to software simulation is the difficulty in modeling hardware interactions in software. While adequate behaviours can be evolved [17,18] and while Jacobi [21] has developed “minimal simulations” in order to derive such controls, it has been found that there are still hardware concerns that are difficult to predict using minimal simulations (e.g. damaging current spikes in the leg drive motors) [19]. Gomi and Ide [19] use a genetic algorithm to develop walking controllers for, an octopod robot. However, rather than reducing the search space through preselecting a reduced set of behaviours, they very precisely shaped [13] the fitness function for reinforcement. One benefit of this is that such fitness function optimization is much more easily scaled up to allow for more complicated behaviour combinations. What is important to note, though, is that while Gomi and Ide [19] used measurements of the electric current supplied to the leg drive motors as one feature of their fitness function, as opposed to using only the observable success of the controller (i.e., the distance walked as measured with an odometer) as is done in the present work, the results have similar attributes. For example, behaviours that could cause a high current flow in the leg motors, such as having two or more legs in direct opposition, are also behaviours that would exhibit low fitness values when measuring the forward distance travelled. Thus

T.D. Barfoot et al. / Robotics and Autonomous Systems 54 (2006) 864–872

865

Fig. 1. (Left) Behavioural coupling between legs in stick insects [11]. (Right) Kafka, a hexapod robot, and treadmill setup designed to evolve walking gaits.

when the controllers are evolved on hardware such parameters are often taken into account by the robot itself, and modeling is not necessarily required. As with dynamic neural network approaches, each leg of a robot is given its own controller rather than using a central pattern generator. Communication between controllers is local (legs “talk” to their neighbours) resulting in a global behaviour (walking gait) for the robot. There is some neurobiological evidence that this notion of local control holds for some biological insects [11]. This paradigm of interacting simple parts has also been popular in robotics [8,9]. Støy et al. [30] have studied how identical multiple units can be connected into a larger robotic structure while having their roles automatically defined through a coordination mechanism. Although the focus of their work is chain-type self-reconfigurable robots, the role-based control approach has been used to implement a particular type of specified gait. The controllers in our work for each leg are locally coupled but they are not preprogrammed to produce a particular gait. Wei et al. [36] have addressed the problem of tailoring the robot’s gait based on its environment. Their robot is designed to transition passively and mechanically from a tripod gait to an in-phase gait for climbing. While our method is also aimed at tailoring the gait to a robot’s environment, we rely on feedback to make active modifications to the gait. The approach of Wei et al. is likely better suited to high-speed locomotion whereas the present work was motivated by such tasks as robotic planetary exploration. We furthermore see our biologically motivated control approach as complementing the nonadaptive control schemes in the biologically inspired work of Schroer [28] and Allen [1] on walking robots. The controllers used here are cellular automata (simple lookup tables). Each leg is given its own cellular automaton (CA) which can be in one of a finite number of states [34]. Based on its state and those of its neighbours, a new state is chosen according to the lookup table. Each state corresponds to a fixed basis behaviour but can roughly be thought of as a

specific leg position. Thus, new leg positions are determined by the old leg positions. Typically, all legs are updated synchronously such that the robot’s behaviour is represented by a set of locally coupled difference equations which are much quicker to compute in real time than differential equations, making this approach suitable for learning. Under this framework, design of a control algorithm is reduced to coming up with the cellular automata which produce successful walking gaits. We look at two approaches to do this. First, we look at genetic algorithms [4,14,16,31] in simulation and on real hardware. Second, we develop a distributed reinforcement learning approach [3,5,31] in an attempt to speed up the learning process. The paper concludes with discussions and recommendations. 2. Controller representation According to neurobiological evidence [11], the behaviour of legs in stick insects is locally coupled as in Fig. 1. This pattern of ipsilateral and contralateral connections will be adopted for the purposes of discussion although any pattern could be used in general (however, only some of them would be capable of producing viable walking gaits). We assume that the output of each leg controller may be discrete. This may be done by way of a set of basis behaviours [23]. Rather than specify the actuator positions (or velocities) for all times, we assume that we may select a simple behaviour from a finite predefined palette. This may be considered a postprocessing step which takes a discretized output and converts it to the actuator control. This postprocessing step will not be allowed to change once set. The actual construction of the postprocessing requires careful consideration but is also somewhat arbitrary. Here the basis behaviours will be modules which move the leg from its current zone (in output space) to one of a finite number of other zones. Fig. 2 shows two possible discretizations of a 2-degree-of-freedom output space (corresponding to a simple leg) into 4 or 3 zones. It is important to distinguish between a basis behaviour and a leg’s current

866

T.D. Barfoot et al. / Robotics and Autonomous Systems 54 (2006) 864–872

3. Evolutionary search

Fig. 2. Example discretizations of output space for 2 degree of freedom legs into (Left) 4 zones and (Right) 3 zones.

zone. If legs always arrive where they are asked to go, there is little difference. However, in a dynamic robot moving on rough terrain, the requested leg action may not always be successful. The key difference is that the output state of a leg will correspond to the current basis behaviour being activated, rather than the leg’s current zone in output space. The input state of a leg could be either the basis behaviour or leg zone, depending on whether one wanted to try and account for unsuccessful requests. By using basis behaviours, the leg controllers may be entirely discrete. Once all the postprocessing has been set up, the challenge remains to find an appropriate arbitration scheme which takes in a discrete input state, s, (current leg positions or basis behaviours of self and neighbours) and outputs the appropriate discrete output, a, (one of M basis behaviours) for each leg. There are several candidates for this role but the one affording the most general decision surfaces between input and output is a straightforward lookup table similar to cellular automata (CA). Das et al. [12] showed that genetic algorithms were able to evolve cellular automata which performed prescribed tasks requiring global coordination. This is essentially what we wish to achieve but we have the added difficulty of dealing with the physical environment of our robot. This type of lookup table control in autonomous robots is often called reactive. For every possible input sequence the CA scheme stores a discrete output value. In other words, for every possible input sequence there is an output corresponding to one of the basis behaviours. At each time-step, the leg controller looks up the action which corresponds to its current input sequence and carries it out. The size of the lookup table for a leg which communicates with k − 1 other legs will then be M k . The approach is therefore usually rendered feasible for only modest numbers for k and k M. The number of all possible lookup tables is M (M ) . Again, modest numbers of basis behaviours keep the size of the search space reasonable. For example, with a hexapod robot with coupling as in Fig. 1 and output discretization as in Fig. 2 (left) the forward and rear legs will require lookup tables of size 43 and the central legs 44 . If we assume left–right pairs of legs have identical controllers the combined size of the lookup tables for forward, center, and rear legs will be 43 + 43 + 44 = 384 and the number of possible table combinations for the entire robot will be 4384 . From this point on, the term CA lookup table will refer to the combined set of tables for all legs in the robot (i.e., the concatenation of individual leg lookup tables).

The crucial step in this approach is the discovery of particular CA lookup tables which cause the connected legs to produce successful walking gaits. We must discover the local rules which produce the desired global behaviour, if they indeed exist, and then find out why those rules stand out. The obvious first method to attempt is to design the local rules by hand. It turns out to be very difficult to do this for all but the most trivial examples as we are typically faced with a combinatorial explosion. Another method is to employ an evolutionary global optimization technique, namely a genetic algorithm (GA), to search for good cellular automata [2]. GAs are based on biological evolution [16]. A random initial population of P CA lookup tables is evolved over G generations. Each CA lookup table, φ, has a chromosome which consists of a sequence of all the discrete values taken from the table. At each generation, a fitness is assigned to each CA lookup table (based on how well the robot performs on some walking task when equipped with that CA lookup table). A CA lookup table’s fitness determines its representation in the next generation. Genetic crossovers and mutations introduce new CA lookup tables into the population. The best K ≤ P CA lookup tables are copied exactly from one generation to the next. The remaining (P − K ) CA lookup tables are made up by single site crossovers where both parents are taken from the best K individuals. Furthermore, they are subjected to random site mutations with probability, pm , per site. The variable used to determine mutation is selected from a uniform random distribution. 4. Example controller In this section, the results of constructing an example controller in simulation will be presented. The purpose of the simulation is not to model an insect robot, but rather to demonstrate that a network of cellular automata controllers could produce patterns which resemble known walking gaits. Furthermore, we would like to show that a genetic algorithm is indeed capable of finding particular CA lookup tables which perform well on a user defined task. One well established gait for hexapod robots is the “tripod” gait. The legs are divided into two sets {Forward Left, Center Right, Rear Left} {Forward Right, Center Left, Rear Right} While one set is on the ground, the other is lifted, swung forward, and then lowered. For the discretization of Fig. 2 (left), there are I = 46 = 4096 possible initial states for a hexapod robot. The fitness of a particular CA lookup table will be defined as I P i=1

fi

f total = I  1 if tripod gait emerges within T time-steps starting from initial condition, i fi =  0 otherwise.

(1) (2)

T.D. Barfoot et al. / Robotics and Autonomous Systems 54 (2006) 864–872

867

Fig. 3. (Left) Gait diagram (time history) of φtripod on a particular initial condition. Each 6-state column represents an entire leg configuration of the robot; the left-most column is the initial condition. (Right) Attractor basin portrait of φtripod . Each node represents an entire 6-state leg configuration of the robot. Lines represent transitions from one leg configuration to another. The inner hexagon represents the AB D DCC . . . tripod gait.

This fitness function was used to evolve a particular CA lookup table named φtripod , which has a fitness of 0.984. That is, from 98.4% of all possible initial conditions, a tripod gait is reached. In our experiments, we have used a GA population size of P = 50, number of generations G = 150, keepsize K = 15, and mutation probability pm = 0.005. The number of initial conditions per fitness evaluation was I = 4096 (all possible initial conditions) and the number of time-steps before testing for a tripod pattern was T = 50. Fig. 3 depicts two aspects of φtripod . The left side shows a typical gait diagram or time history of φtripod on a particular initial condition. Each column shows the states of the 6 legs at a given time-step (different shades of grey represent different leg states). The left-most column is the initial condition. The right side is an attractor basin portrait of φtripod as described by [39]. Each node in this plot represents an entire 6-state leg configuration of the robot (i.e., one column of the left plot). The inner hexagon represents the tripod gait which is unidirectional (indicated by a clockwise arrow). The purpose of the right plot is to draw a picture of φtripod as a whole and to make a connection with the concept of stability. Beginning from any node on the attractor basin portrait, and following the transitions inward (one per time-step), one will always wind up on the inner hexagon (tripod gait). Note that only 98.4% of the 4096 possible leg configurations are contained within the attractor basin. The remaining 1.6% of leg configurations are made up of left–right symmetrical leg configurations which can never lead to a non-symmetrical configuration since identical controllers have been used for left–right pairs of legs. For a real robot this could be a problem but we will not go into the matter too deeply. Possible solutions are to use non-identical controllers for left–right pairs of legs, to occasionally mutate individual leg states randomly (thus

breaking the symmetry), or to use asynchronous updating of legs. The conclusion we may draw from this simple exercise is that this type of distributed controller certainly is able to produce patterns resembling known walking gaits and that genetic algorithms are capable of finding them. The question we would now like to answer is whether a robot could actually learn such a controller from very simple feedback. 5. Evolutionary experiments In this section we describe the results of coevolving distributed controllers on hardware. Fig. 1 shows Kafka [24], a hexapod robot, mounted on a treadmill. Kafka’s legs are powered by twelve JR Servo NES 4721 hobby servomotors controlled by a 386 66 MHz PC, which was more than sufficient for the purposes of this experiment. The control signals to the servos are absolute positions to which the servos then move as fast as possible. In order to control the speed of the leg motions, the path of the leg is broken down into many segments and the legs are commanded through each of these. The fewer the intervening points, the faster the legs move. The robot has been mounted on an unmotorized treadmill in order to automatically measure controller performance (for walking in a straight line only). As Kafka walks, the belt on the treadmill causes the rollers to rotate. An odometer reading from the rear roller is fed to Kafka’s computer such that distance versus time-step plots may be used to determine the performance of the controller. The odometer measures net distance travelled. For this experiment, the states for Kafka’s legs were constrained to move only in a clockwise, forward motion which

868

T.D. Barfoot et al. / Robotics and Autonomous Systems 54 (2006) 864–872

Fig. 4. Convergence History of a GA Run. (Left) Best and average fitness plots over the evolution. (Right) Fitness of entire population over the evolution (individuals ordered by fitness). Note there was one data point discounted as an odometer sensor anomaly. In the best individual convergence history it was replaced by the next best individual (as indicated on left graph).

Fig. 5. Gait diagrams (time histories) for the four solutions, φone , φtwo , φthree , φfour , respectively.

is based on the three state model as shown in Fig. 2 (right) (i.e., rather than having each leg move from one state to an arbitrary next state, it is constrained either to increment through the sequence of states in the clockwise motion or not to change states). Each entry in the lookup table determines whether to increment the position of the leg or keep it stationary (binary output). This reduces the number of possible lookup tables from 3135 (2.58 × 1064 ) for the full three-state model to 2135 (or 4.36 × 1040 ). The GA parameters were P = 50

K = 15

pm = 0.05.

(3)

(Some tuning of the parameters was done through the course of experimentation but these values are typical of GA parameters for a problem of this size.) Each individual in the population was given the same initial condition, or stance, at the start of the evaluation ({C, B, C, B, C, B}). Owing to the necessary time duration required to evolve the controllers on Kafka, 23 generations of the GA were run (the evolution of the controllers on Kafka took approximately eight hours to complete). The convergence history for one evolution (population best fitness and population average fitness vs. generation) is shown in Fig. 4. Fig. 5 shows the

gait diagrams for four of the best individuals from various generations (φone , φtwo , φthree , φfour , from generations 3, 5, 13, and 22 respectively, with φfour being the best gait generated during the GA run). All four of the gaits performed fairly well, walking in a consistently forward direction during their evaluations. φone corresponds to the first sharp increase in fitness from previous generations. The improvement between φone and φtwo is due to the shorter period for the step cycle and the gait pattern itself is simpler. This trend continues in φthree and φfour with a much reduced step cycle period and a more consistent and simplified gait pattern. The fitness improvements corresponding to these gaits is indeed quite high compared to the earlier two individuals. Fig. 4 shows the fitness generated by each individual throughout the entire GA run. The rather high fitness value reported in generation 3 (Fig. 4) is a result of an anomalous sensor output. This was confirmed when, in the next generation, that individual achieved an extremely low fitness (−17). The rest of the plot shows quite clearly the trend of increasing fitness for the kept segment of the population as well as the variance in fitness reported for different individuals. Owing to the difficulty in running large populations on a hardware apparatus such as Kafka, the population size is somewhat low. However, as can

T.D. Barfoot et al. / Robotics and Autonomous Systems 54 (2006) 864–872

be seen in Fig. 4 by this fitness variance, genetic diversity is maintained. As mentioned earlier, the number of generations over which the GA was allowed to evolve was rather limited. This can be seen in Fig. 4 by the fact that there is still a rather high deviation for the best individuals during the latter generations. The average fitness would not necessarily converge to the values reported for the best individuals, even if those gaits had saturated the population. This is due to the fact that the mutation rate of pm = 0.05 is sufficiently high to keep many of the child individuals in each generation from being simply copies of the parents. By the end of the second generation, a tripod-like gait (named φgood ) had emerged which showed very strong fitness over the other individuals. Throughout the evolutionary history, this gait and its derivatives dominated. For the first few generations after the appearance of the tripod, a similar gait (named φsloppy ) appeared which was able to compete in terms of fitness. As generations went by φgood was replaced by a better tripod (named φbetter ) which began to dominate the population. The improvement in the tripod is explained by the number of timesteps required to achieve the gait. (A time-step corresponds approximately to 1 s.) From the initial condition, it took several time-steps to reach the tripod. As the time duration of each trial was limited (T = 30 time-steps), the sooner that an individual was able to achieve the gait, the more successful it was. By generation 7 the improvements in the best individual appeared to converge. Population average fitness increases rapidly after φgood is found (but is delayed by a few generations due to the finite time required for φgood and its derivatives to spread). Shortly after φbetter is found the population average fitness also appeared to converge (no large improvements were seen). It should be pointed out that fitness does not monotonically increase with generation. There is a fluctuation in the fitness of the individuals after convergence. This is due to the hardware aspect of the experiment. Even the best K individuals were reevaluated each generation. As the experiment progressed, the rubber foot pads used by Kafka simply wore out. This loss of footing corresponded to a variable loss in traction and fitness for each gait. The use of a state table with binary output greatly reduced the search space for this experiment. The next stage for this type of work involves using a full state lookup table (3 or more output possibilities). The expected result of this is the same as for the reduced version; however, the time to develop successful gaits would be increased, and the convergence history would likely be much more gradual (likely making it impractical). For example, if there were 9 output zones for each leg, the chromosome would have length 8019 and the search space would be of size 98019 . Thus, our approach may not scale very well to increases in model complexity (because of the long training times required). In an attempt to reduce the time needed to learn distributed controllers, we also investigated the use of reinforcement learning techniques which provide much more frequent reinforcement information.

869

6. Reinforcement learning Reinforcement learning was originally derived from dynamic programming. We assume that the problem we are trying to solve may be modeled as a Markov decision process (Ω S , Ω A , δ, r ). Here Ω S is the state space, Ω A is the action space, δ(s, a) is the (probabilistic) state transition function where s ∈ Ω S and a ∈ Ω A , and r (s, a) is the reward function. We will be attempting to maximize the discounted cumulative reward function given as follows R(t) = r (s0 , a0 ) + γ r (s1 , a1 ) + γ 2r (s2 , a2 ) + · · · where si = s(t + i) and ai = a(t + i) and γ ∈ [0, 1) is the discount constant. The Markov decision process and the discounted cumulative reward function together form a Markov decision problem (MDP) which Bellman [7] showed to have an optimal solution. We will be using a variant of reinforcement learning [5], Q-learning [35], which allows one to learn an optimal decision making policy π ∗ (s) = arg max Q(s, a) a

to the MDP. It has been shown to converge under certain conditions but also to work reasonably well when these conditions are only weakly met [23,27] as is often the case in real robotics applications. Here s is the state and a is a basis behaviour. The state action matrix, Q(s, a), is a grid of size |s| = M k by |a| = M. The relation between the cellular automata maps discussed above, φ(s), and the reinforcement learning policy is as follows φ(s) = π ∗ (s) = arg max Q(s, a). a

As in the cellular automata discussion above, we will be using a policy for each leg such that control is still distributed. However, only three of these will be unique as identical policies will be used for pairs of legs joined by contralateral connections. The three policies will be termed Q f , Q c , and Q r for ‘forward’, ‘center’, and ‘rear’, respectively. Coupling between legs will be as in Fig. 1 (left). This type of setup is generally known as a multiagent reinforcement learning problem [25,26,29,33,37,38] as we are trying to learn more than one policy simultaneously and the abilities of the policies are interdependent. There are many issues associated with the problem described here including non-Markovian states, partial observability [22,32], and non-determinism but we will forge ahead in spite of these to see what comes of our attempt. For rewards we will be using the net distance travelled (positive is forward, negative is backwards) by the hexapod robot. All three policies will be receiving the same reward. The task will be to optimize how fast the robot can walk forwards on the treadmill. Mathematically we will be looking for an optimal solution to the MDP. Note all three policies will have to work together to produce an overall behaviour which propels the robot forward. Rewards will be assessed at the end of each

870

T.D. Barfoot et al. / Robotics and Autonomous Systems 54 (2006) 864–872

action taken (each “step” of the robot). Typically in Q-learning, each policy is updated according to Q(s, a) ← r (s, a) + γ max Q(s 0 , a 0 ) a0

where s = s(t) is the old state, s 0 = s(t + 1) is the new state, and r (s, a) is the reward for starting at s and selecting action a. Unfortunately, this algorithm does not suffice for our purposes as we are working with a multiagent system where the reward is based on the performance of the entire system (not just one leg). Specifically, the reward depends on the actions of all six legs, not just one of them and as a result, a leg could receive very different values for the reward associated with following an action from a given state, r (s, a) (depending on what the other legs decide). It does not make sense to average the rewards (as in stochastic Q-learning) as this can lead to suboptimal policies. Instead, we introduce a memory that keeps track of the best reward earned by following action a from state s, rmax (s, a) which has the same dimensions as Q(s, a). We call the modified learning algorithm, cooperative Q-learning, as it promotes cooperation between concurrent learners (who are receiving identical rewards). The cooperative Q-learning algorithm is as follows rmax (s, a) ← max[rmax (s, a), r (s, a)] Q(s, a) ← rmax (s, a) + γ max Q(s 0 , a 0 ) a0

where s = s(t), a = a(t), s 0 = s(t + 1). 7. Reinforcement learning experiments This section briefly describes implementation of the above algorithm. The experiment setup is identical to that of the evolutionary experiments described above. Cooperative Q-learning was tested on Kafka using the leg coupling of Fig. 1 (left) and the leg discretization of Fig. 2 (right). To keep the size of the search space small, only two actions were permitted for each leg: (stay in the current state) or (move one state forward in the counter-clockwise cycle, ABC ABC ABC . . .). Furthermore, the learning occurred in an episodic manner. An episode consisted of starting the robot in the configuration BC BC BC and allowing it 10 time-steps to choose actions. A typical run consisted of 300 episodes. This episodic-style training sped up the experiment by allowing the robot to only learn part of the Q-matrix rather than learning it in its entirety. The last issue to discuss is the trade-off between exploration of new actions and exploitation of the best known actions. Typically in Q-learning, new actions are explored with some probability, pexplore , otherwise the best action is chosen according to the Q-matrix. Here we used a linear time-varying pexplore as depicted in Fig. 6 (bottom). This allowed many new actions to be explored initially and fewer later on. This is of course a heuristic, which often needs to be “tuned” in Qlearning experiments.

Fig. 6. Cooperative Q-learning Performance. (Top) Odometer position versus time, (middle) slope of top curve as computed by fitting a cubic to the data, (bottom) exploration probability versus time (1 time-step corresponds approximately to 1 s and 1 odometer click corresponds approximately to 9 mm).

Fig. 6 (top) shows the odometer position over a typical run of 3000 time-steps. Fig. 6 (middle) shows the slope of the top figure (as computed by fitting a cubic to the top figure and taking the analytical derivative). The middle figure approximately represents the instantaneous speed of the robot which we can see improves significantly over time to a level comparable but slightly below that of a tripod gait (around 6 click/time-step where an odometer click translates to about 9 mm of movement). The reason the performance does not quite meet that of a hand-designed tripod gait is most likely because we did not allow the experiment to continue past 3000 timesteps (the exploration probability went to zero here). Fig. 7 shows the time history or gait diagrams of the legs for four time intervals of 100 time-steps each. We can see the pattern becoming more regular at later times. By the end of the experiment, Kafka had learned a gait which was similar in performance and nature to the tripod. 8. Discussion and conclusion Many researchers believe that highly parallel and decentralized methods are the key to endowing artificial systems with intelligence. Decentralized controllers for insect robots offer a great deal of redundancy in that if one controller fails, the robot may still limp along under the power of the remaining functional legs [10]. The cellular automata controller approach outlined here was successfully able to control Kafka, a hexapod robot, and should extend to robots with more degrees of freedom (keeping in mind scaling issues). One advantage of using such a coarse controller is that it requires very few real-time

T.D. Barfoot et al. / Robotics and Autonomous Systems 54 (2006) 864–872

871

Fig. 7. Gait diagrams at various times throughout a 3000 time-step run.

computations to be made (compared to dynamic neural network approaches) as each leg is simply looking up its behaviour in a table. The approach easily lends itself to automatic generation of controllers as was shown for the simple examples presented here. We found that both coevolutionary search and multiagent learning were able to produce distributed controllers that were comparable to the best hand-coded solutions. The approach based on cooperative Q-learning, was somewhat faster at this task (e.g., 1 h instead of 8) but required a great deal more information as it was receiving feedback after shorter time-step intervals. Although both methods were using the same sensor, the reinforcement learning approach took advantage of the more detailed breakdown of rewards to increase convergence rate. The cost of this speed-up could also be seen in the need to prescribe an exploration strategy. The coevolutionary approach required fewer parameters to be tuned which certainly could provide an advantage for some applications. This work was motivated by evidence that biological insects generate walking patterns by means of decentralized control [11]. We hope that studying such types of control for artificial walkers may also in turn tell us something about the natural systems by which they were inspired. Acknowledgments Kafka, the hexapod robot featured in this work, was constructed by David McMillen as part of a Master’s thesis at the University of Toronto Institute for Aerospace Studies.

The research was funded in part by the Natural Sciences and Engineering Research Council of Canada and the Center for Research in Earth and Space Technology. References [1] T.J. Allen, R.D. Quinn, R.J. Bachmann, R.E. Ritzmann, Abstracted biological principles with reduced actuation improve mobility of legged vehicles, in: IEEE/RSL International Conference on Intelligent Robots and Systems, Las Vegas, Nevada, October 2003, pp. 1370–1375. [2] T.D. Barfoot, G.M.T. D’Eleuterio, An evolutionary approach to multiagent heap formation, in: Congress on Evolutionary Computation, 1999. [3] T.D. Barfoot, E.J.P. Earon, G.M.T. D’Eleuterio, A step in the right direction: Learning hexapod gaits through reinforcement, in: Proceedings of the 31st International Symposium on Robotics, ISR, Montr´eal, Canada, 2000. [4] T.D. Barfoot, G.M.T. D’Eleuterio, Evolving distributed control for an object clustering task, Complex Systems 15 (3) (2005) 95–121. [5] A.G. Barto, S.J. Bradtke, S.P. Singh, Learning to act using real-time dynamic programming, Artificial Intelligence 72 (1995) 81–138. [6] R.D. Beer, H.J. Chiel, L.S. Sterling, A biological perspective on autonomous agent design, Robotics and Autonomous Systems 6 (1990) 169–186. [7] R. Bellman, Dynamic Programming, Princeton University Press, 1957. [8] R.A. Brooks, A robust layered control system for a mobile robot, IEEE Journal of Robotics and Autonomous Systems RA-2 (1986) 14–23. [9] R.A. Brooks, Coherent behaviour from many adaptive processes, in: D. Cliff, P. Husbands, J.A. Meyer, S.W. Wilson (Eds.), in: Simulation of Adaptive Behaviour: From Animals to Animats, vol. 3, MIT Press, 1994. [10] H.J. Chiel, R.D. Beer, R.D. Quinn, K.S. Espenschied, Robustness of a distributed neural network controller for locomotion in a hexapod robot, IEEE Transactions on Robotics and Autonomation 8 (3) (1992) 292–303.

872

T.D. Barfoot et al. / Robotics and Autonomous Systems 54 (2006) 864–872

[11] H. Cruse, Coordination of leg movement in walking animals, in: J.A. Meyer, S. Wilson (Eds.), Simulation of Adaptive Behaviour: From Animals to Animats, MIT Press, 1990. [12] R. Das, J.P. Crutchfield, M. Mitchell, J.E. Hanson, Evolving globally synchronized cellular automata, in: L.J. Eshelman (Ed.), Proceedings of the Sixth International Conference on Genetic Algorithms, San Fransisco, CA, 1995, pp. 336–343. [13] M. Dorigo, M. Colombetti, Robot Shaping: An Experiment in Behavior Engineering (Intelligent Robotics and Autonomous Agents), in: A Bradford Book, MIT Press, Cambridge, MA, 1998. [14] E.J.P. Earon, T.D. Barfoot, G.M.T. D’Eleuterio, From the sea to the sidewalk: The evolution of hexapod walking gaits by a genetic algorithm, in: Proceedings of the Third International Conference on Evolvable Systems, Edinburgh, Scotland, 2000. [15] S. Nolfi, D. Floreano, Evolutionary Robotics: The Biology, Intelligence, and Technology of Self-Organizing Machines (Intelligent Robotics and Autonomous Agents), in: A Bradford Book, MIT Press, Cambridge, MA, 2000. [16] D.E. Goldberg, Genetic Algorithms in Search, Optimization, and Machine Learning, Addison-Wesley Pub. Co., Reading, MA, 1989. [17] T. Gomi, A. Griffith, Evolutionary robotics — an overview, in: Proceedings of IEEE International Conference on Evolutionary Computation, 1996, pp. 40–49. [18] T. Gomi, Practical applications of behaviour-based robotics: The first five years, in: IECON98, Proceedings of the 24th Annual Conference of the IEEE, vol. 4, Industrial Electronics Society, 1998, pp. 159–164. [19] T. Gomi, K. Ide, Evolution of gaits of a legged robot, in: The 1998 IEEE International Conference on Fuzzy Systems, vol. 1, IEEE World Congress on Computational Intelligence, 1998, pp. 159–164. [20] F. Gruau, Automatic definition of modular neural networks, Adaptive Behaviour 3 (2) (1995) 151–184. [21] N. Jakobi, Runnig across the reality gap: Octopod locomotion evolved in a minimal simulation, in: P. Husbands, J.-A. Meyer (Eds.), Proceedings of Evorob98, Evorob98, Springer-Verlag, 1998. [22] L.P. Kaebling, M.L. Littman, A.R. Cassandra, Planning and acting in partially observable stochastic domains, Artificial Intelligence (1998) 101. [23] M.J. Matari´c, Behaviour-based control: Examples from navigation, learning, and group behaviour, in: H. Hexmoor, I. Horswill, D. Kortenkamp (Eds.), Software Architectures for Physical Agents, Journal of Experimental and Theoretical Artificial Intelligence 9 (2) (1997) 232–336 (special issue). [24] D.R. McMillen, Kafka: A hexapod robot. Master’s Thesis, University of Toronto Institute for Aerospace Studies, 1995. [25] L. Peshkin, K.-E. Kim, N. Meuleau, L.P. Kaelbling, Learning to cooperate via policy search, in: Proceedings of the 16th International Conference on Uncertainty in AI, 2000. [26] J. Schmidhuber, Reinforcement learning in markovian and non-markovian environments, in: D.S. Lippman, J.E. Moody, D.S. Touretzky (Eds.), Advances in Neural Information Processing Systems, vol. 3, Morgan Kaufmann, 1991, pp. 500–506. [27] K.T. Simsarian, M.J. Matari´c, Learning to cooperate using two sixlegged mobile robots, in: Proceedings of the Third European Workshop of Learning Robots, Herkalion, Crete, Greece, 1995. [28] R.T. Schroer, M.J. Boggess, R.J. Bachmann, R.D. Quinn, R.E. Ritzmann, Comparing cockroach and whegs robot body motion, in: Proceedings of the IEEE International Conference on Robotics and Automation, April 2004, New Orleans, Lousiana, 2004, pp. 3288–3293. [29] P. Stone, Layered Learning in Multiagent Systems: A Winning Approach to Robotic Soccer, MIT Press, Cambridge, MA, 2000.

[30] K. Støy, W.-M. Shen, P.M. Will, Using role-based control to produce locomotion in chain-type reconfigurable robots, IEEE Transactions on Mechatronics 7 (4) (2002) 410–417. [31] R.S. Sutton, A.G. Barto, Reinforcement Learning: An Introduction, in: A Bradford Book, MIT Press, 1998. [32] S. Thrun, Monte carlo pomdps, in: Proceedings of Conference on Neural Information Processing Systems, NIPS, 1999. [33] C. Versino, L.M. Gambardella, Learning real team solutions, in: G. Weib (Ed.), Distributed Artificial Intelligence Meets Machine Learning, Springer, 1997, pp. 40–61. [34] J. von Neumann, Theory of Self-reproducing automata, University of Illinois Press, Urbana and London, 1966. [35] C.J.C.H. Watkins, Learning from delayed rewards, Ph.D. Thesis, Cambridge University, Cambridge, England, 1989. [36] T.E. Wei, R.D. Quinn, R.E. Ritzmann, A CLAWAR That benefits from abstracted cockroach locomotion principles, in: Proceedings of the 7th International Conference on Climbing and Walking Robots, September 2004, Madrid, Spain, 2004. [37] G. Weiss (Ed.), Multiagent Systems: A Modern Approach to Distributed Artificial Intelligence, MIT Press, Cambridge, MA, 1999. [38] L.F.A. Wessels, Multiagent reinforcement learning, Ph.D. Thesis, Faculty of Engineering, University of Pretoria, 1997. [39] A. Wuensche, The ghost in the machine: Basins of attraction of random boolean networks, in: C.G. Langton (Ed.), Artificial Life III: SFI Studies in the Sciences of Complexity, vol. XVII, Addison-Wesley, 1994. [40] D. Zeltzer, M. McKenna, Simulation of autonomous legged locomotion, in: C.G. Langton (Ed.), Artificial Life III: SFI Studies in the Sciences of Complexity, vol. XVII, Addison-Wesley, 1994. T.D. Barfoot is currently working in the Controls and Analysis Group at MDA Space Missions. Previously, he was an Assistant Professor in the Space Robotics Group at the University of Toronto Institute for Aerospace Studies. He received his B.A.Sc. in the Aerospace Option of Engineering Science at the University of Toronto. His Ph.D. Thesis, obtained from the Institute for Aerospace Studies at the University of Toronto, focused on the coordination of multiple mobile robots. His current research interests lie in the area of experimental mobile robotics for both space and terrestrial applications. E.J.P. Earon received his B.A.Sc. in Aerospace Engineering at the University of Toronto in 1999 and began his doctoral research at the University of Toronto Institute for Aerospace Studies. His research centred on robotic control and flexible, adaptable machine learning topics with particular emphasis on evolutionary and genetic computation. He received his Ph.D. in 2005.

G.M.T. D’Eleuterio was born in Teramo, Italy. He received his B.A.Sc. in Engineering Science (Aerospace) from the University of Toronto and his M.A.Sc. and Ph.D. from the University of Toronto Institute for Aerospace Studies. His current research interests include space robotics and biologically inspired approaches to robotic control. He is a Professor at the University of Toronto Institute for Aerospace Studies.

Experiments in learning distributed control for a ... - Semantic Scholar

Aug 28, 2006 - Institute for Aerospace Studies, University of Toronto, 4925 Dufferin .... This may be done by way of a set of basis behaviours ..... We call the modified .... ings of IEEE International Conference on Evolutionary Computation,.

2MB Sizes 2 Downloads 110 Views

Recommend Documents

Iterative learning control and repetitive control in ... - Semantic Scholar
sophisticated HDD servo controls, Messner and Ehrlich [1] presented a nice tutorial on controls for disk ... Electrical and Computer Engineering, 4160 Old Main Hill, Utah State ... For example, a disk drive product in the year 2000 ... simple words,

Prediction Services for Distributed Computing - Semantic Scholar
In recent years, large distributed systems have been de- ployed to support ..... in the same domain will have similar network performance to a remote system.

Multilingual Metaphor Processing: Experiments ... - Semantic Scholar
processing technology a step closer to a possibility of integration with ... concepts contain a certain degree of cross-domain overlap, thus implicitly ... vectors to induce information about metaphorical mappings directly from the words' ...... (1)

Sequence Discriminative Distributed Training of ... - Semantic Scholar
A number of alternative sequence discriminative cri- ... decoding/lattice generation and forced alignment [12]. 2.1. .... energy features computed every 10ms.

Counteractive Self-Control - Semantic Scholar
mained in the distance? ... tempt individuals to stray from otherwise dominant long-term ... bach, University of Chicago, Booth School of Business, 5807 South.

Distributed Execution of Scenario-Based ... - Semantic Scholar
In this paper we propose a more efficient approach which uses the available network resources ... CPS consists of multiple cooperating software-intensive components. ..... processor follower. [ bind currentDriver to car.driver bind creditCard to.

Distributed Vision-Aided Cooperative Localization ... - Semantic Scholar
A similar setup has also been studied in other works, including [5], [9], [10], [11] ...... of a single ground vehicle, equipped with a 207MW Axis network camera8 ..... Proceedings of the International Conference on Field and Service Robotics,.

Distributed Execution of Scenario-Based ... - Semantic Scholar
We previously presented an approach for the distributed execution of such specifications based on naive and inefficient ... conceive and communicate the system behavior during the early design. Our method extends the concepts of Live.

Learning in the Cultural Process - Semantic Scholar
generation, then could a population, over many generations, be .... But we can imagine that, over time, the community of people .... physical representations) into one bit string that can be applied to .... Princeton: Princeton University Press.

Learning Topographic Representations for ... - Semantic Scholar
the assumption of ICA: only adjacent components have energy correlations, and ..... were supported by the Centre-of-Excellence in Algorithmic Data Analysis.

Fast Distributed Random Walks - Semantic Scholar
and efficient solutions to distributed control of dynamic net- works [10]. The paper of ..... [14]. They con- sider the problem of finding random walks in data streams.