Spatial-Temporal Optimisation for Adaptive Sustainable ...

Viewer
Transcript

Spatial-Temporal Optimisation for Adaptive Sustainable Forest Management

Verena Rieser∗ School of Mathematics and Computer Science Heriot-Watt University Edinburgh, EH14 4AS [email protected]

Abstract Handling risk and uncertainty is vital for finding long-term, sustainable forest management solutions, which are adaptive and robust to changes in the environment. This paper explores the performance of Reinforcement Learning (RL) for spatial-temporal, multi-objective optimisation for sustainable forest management in dynamic stochastic environments. We show that RL significantly outperforms a heuristic baseline by adapting to increasing spatial and temporal uncertainties. We argue that RL’s superior performance can be attributed to the explicit representation of uncertainty in its transition function and estimated reward value.

1

Introduction

Sustainable forest management is defined as “the stewardship and use of forests and forest lands in a way, and at a rate, that maintains their biodiversity, productivity, regeneration capacity, vitality and their potential to fulfil, now and in the future, relevant ecological, economic and social functions” [10]. As such, forest management has to satisfy multiple and often conflicting objectives and it is characterised by the long-term horizon of its outcomes. Since long-term plans are made in the face of uncertain futures, long-term sustainable forest management should incorporate some measure of risk. Uncertainty emerges from a variety of sources, including irregular or unknown fluctuations in the demand for timber, or the occurrence of forest disturbances. In addition, forest management is dynamic in time and space, for example, different forest stands have different properties, and the likelihood of stochastic events may change over time. Forest planning may be suboptimal if it ignores these sources of uncertainty and risk (also see [8] for a discussion). Previous work on multi-objective optimisation in forest management has evaluated the performance of heuristic search methods with respect to a variety of different forest planning problems, e.g. [2, 12]. However, the planning problems addressed in these studies were deterministic, i.e. they do not model uncertainty in the effects of forest management actions. We use stochastic optimisation, in particular Reinforcement Learning (RL) [14], to select an optimal policy which is robust to uncertainty and risk. Contemporary research using RL in the context of forest management has shown that it can find spatial-temporal solutions for multi-objective resource management in a deterministic environment [4]. In the following paper, we show that RL outperforms simple heuristic search in stochastic environments with increasing uncertainty.

2

Problem Descriptions for Sustainable Forest Management

We present several different hypothetical task environments that are used to test the performance of RL. The task descriptions are meant to provide a proof-of-concept and are not striving to incorporate ∗

The Interaction Lab, www.macs.hw.ac.uk/interactionlab

1

the multitude of complex factors in a real-world task environment. In particular, we investigate three aspects of the forest management problem with increasing levels of uncertainty: (1) multi-objective planning, (2) temporal planning with increasing uncertainty over time, (3) planning in environments, which are dynamic in time and space. The overall task is to decide on a management option for a forest management unit (a “stand”), where the two management options available are to preserve or to harvest a stand. For task types (1) and (3) the optimisation task is to decide how many stands to harvest according some trade-off, reflected in the multi-objective goal. The forest is composed of 10 stands, where the decision for each of the stands is made sequentially. Task type (2) deals with temporal decision making, where the optimisation task is to decide when to harvest an individual stand over 10 time intervals. Task (A): Multi-objective Goal. The multi-objective goal implements the trade-off between economic return versus forest conservation: to satisfy the existing demand for timber while cutting as few forest stands as possible. Equation 1 formulates the objective as a weighted sum: objective = (wf × f orestStands) + (−wd × unsatisf iedDemand; )

(1)

We assume that the environment is static and behaves in a deterministic way, e.g. the demand can always be satisfied by harvesting five forest stands, and each stand has the same potential to satisfy demand. Task (B): Increasing uncertainty over time. In Task type (B) we explore uncertainty, which is introduced by the temporal nature of forest management. Within our modelling framework uncertainty increases over time, which is operationalised as an increasing probability of disturbance affecting a forest stand. Task (C): Spatial dynamics. In Task (C) we model the likelihood of forest disturbance as a function of tree age, following [3]. However, we extend the model to also include the spatial proximity to neighbouring stands and their average age. This implements the notion that forest disturbances tend to spread. The likelihood of forest disturbance is now a linear function of the stand’s own age and the average age of its neighbouring stands, where we use a Moore neighbourhood. The stand’s age is also positively related to the amount of demand it can satisfy: the older the forest stand, the more demand it can satisfy. Thus, there is a trade-off problem for older stands between an higher potential to satisfy demand versus an increased risk of windfall.

3

Representation as a Reinforcement Learning Problem

RL addresses the problem of how a forest manager should take actions in an uncertain environment so as to maximise some notion of cumulative, long-term utility or “reward” [14]. RL uses Markov Decision Processes (MDPs) as its underlying representation for decision making and learning. At each time step t the process is in some state st and the forest manager may choose any action a(s), that is available in state s. The process responds at the next time step by moving into a new state s according to the probability P (s0 |s, a), which is defined by the transition function Tss0 , and giving the decision maker a corresponding reward Rss0 . In our case, the reward corresponds to the multiobjective goal as formulated by Equation 1. We use an implementation of the well-known SARSA algorithm [13], which is trained over 900 learning episodes (with learning rate α = 0.2, discount rate γ = .95 and eligibility trace parameter λ = 0.9). The state-action space of the MDP is defined as in Figure 3. The state space keeps track of the number of preserved foreststands and whether the demand is satisfied or not. The feature forestCycle is only used for Task (2) to keep track of the temporal progression. The feature agestand and ageNeighbours are only used for Task (3) to represent the age of the stand in question and the average of its neighbouring stands. 3.1

Baseline

We implement a baseline as a simple heuristic search problem which optimises the ordering of actions in order to maximise the multi-objective goal. In contrast to the RL-based implementation 2





foreststands 1-10



    demandSatisfied: 0/1     harvest     ACTION: perserve STATE: forestCycle: 0-10    agestand: young/med/old      ageNeighbours: 1-300

Figure 1: State-Action space for the forest management problem described above, the baseline lacks the ability to estimate the expected reward value or the transition probability between states, and thus can only implicitly learn the ordering of actions.

4

Results

RL outperforms the heuristic search baseline with increasing significance the more uncertainty is introduced into the planning environment. We explain RL’s superior performance by its ability to explicitly represent uncertainty in its transition function and to adapt dynamically to changes in the environment. Table 1 summarises the results and reports the average performance of RL and heuristic search in terms of their average accumulative reward value (see Equation 1). We compare them for significant differences using a 2-tailed paired Student’s T-test (n=300). We also report on the percentage of the overall possible reward. Task weights Environm. dynamics wf wd A.1 1 A.2 10 A.3 5 B C

5 5

10 1 10 10 10

deterministic

risk(time) risk(age, proximity)

Acc. Reward heuristic baseline (sdv.) % total 95.0 (± 0.0) 5.0 (± 0.0) 25 (± 0.0)

100% 100% 100%

p-value Interpret.

RL (sdv.)

% total

95.0 (± 0.0) 5.0 (± 0.0) 25 (± 0.0)

100% 100% 100%

n/a n/a n/a

always harvest always preserve. harvest until no demand. 12.27 (± 8.3) 89.8% 14.13 (± 6.4) 91.3% P=.002 harvest early on. -10.80 (±31.9) 71.4% 15.00 (±13.1) 92.0% P=.000 forest “thinning”

Table 1: Comparing the performance of RL and GA in terms of average objective value for different task types with increasing uncertainty using a paired T-Test. Subtasks (denoted by x.x) use different weights on their multi-objective goal. For sub-task (A.1)the remaining forest stands are weighted high (wf = 10) and unsatisfied demand is weighted low (wd = 1). Whereas for sub-tasks (A.2) , we reverse the weighting. For subtask (A.3) we selected weights to encourage strategies that will try to satisfy the exciting demand (wf = 5) and (wd = 10). These weights are also used by tasks (B) and (C). The results show that both algorithms are adaptive to different objective functions: they are both able to reach the maximum performance. For Task (A) both algorithms find the optimal percentage of stands to be harvested in random order. Note that the total reward definition for tasks (B) and (C) changes, as the policies now also can gain negative rewards due to the stochastic behaviour of the environment. The maximum possible reward is (+25), which can be reached by harvesting 5 cells and none of the remaining forest cells are affected by disturbances. The minimal possible reward is (-100), which, for example, is reached when harvesting zero cells and all the remaining forest is destroyed by disturbances. For Task (B) RL learns a temporal ordering to minimise the risk: it harvests forest stands early on in the planning sequence (as uncertainty is still relatively small), and then preserves as soon as all demand is satisfied. The heuristic search finds various solutions to the problem (also know as a “Pareto front”), which all have a tendency to early harvest and later preserve, however it fails to dynamically adapt to the changing environment. In Task (C) RL prefers to harvest old stands, which are surrounded by young stands, as these have a high return in terms of demand and a low risk of forest disturbance. It also learns to harvest old stands which are surrounded by other old stands, as this will lower the risk for the surrounding old 3

stands. This process is also known as “thinning” a forest. Again, the heuristic search fails to adapt to the spatial context. Note that this task is the most complex amongst our task definitions and it clearly illustrates the superior performance of RL over the heuristic baseline: While the heuristic

5

Conclusion and Discussion

This paper explores the performance of Reinforcement Learning (RL) for spatial-temporal multiobjective optimisation for sustainable forest management in stochastic environments. We show that RL outperforms a heuristic baseline by adapting to increasing spatial and temporal uncertainties. We argue that RL’s superior performance can be attributed to the explicit representation of uncertainty in its transition function and estimated reward value. Handling risk and uncertainty is vital for finding long-term sustainable forest management solutions, which are adaptive and robust to changes in the environment [8]. Optimised solutions are often used in the context of Decision Support Tools to assist stakeholders with their forest planning problem, see for example [15]. In order to achieve this long-term goal, our RL optimisation tool has to meet new challenges. First of all, we need to investigate how its performance scales up to real-world forest management problems. Learning with real data will introduce uncertainty in the underlying observations, which we plan to address using Partially Observable Markov Decision Processes. We also plan to explore belief compression techniques to deal with large state spaces and sparse data, e.g. [5, 11]. Furthermore, we want to investigate parallel computing methods to meet computational constraints for large-scale spatial optimisation over long time periods, e.g. [6]. In addition, we will need to communicate the learned forest management strategies and the uncertainty associated with them to stakeholders in a transparent way. While some Decision Support Tools rely on visualisation techniques [1], we belief that recommendations in form of natural langue narratives will have added value in explaining temporal uncertainties and associated likelihood of success for certain management actions [7, 9].

References [1] J. Aerts, K. Clarke, and A Keuper. esting popular visualisation techniques for representing model uncertainty. artography and Geographic Information Science, 30(3):249–261, 2003. [2] P. Bettinger, D. Graetz, K. Boston, J. Sessions, and W. Chung. Eight heuristic planning techniques applied to three increasingly difficult wildlife planning problems. Silva Fennica, 36(2):561–584, 2002. [3] Christopher Bone and Suzana Dragicevic. Defining transition rules with Reinforcement Learning for modeling land cover change. Simulation, 85(5):291–305, 2009. [4] Christopher Bone and Suzana Dragicevic. Incorporating spatio-temporal knowledge in an intelligent agent model for natural resource management. Landscape and Urban Planning, 96(2):123 – 133, 2010. [5] Paul Crook and Oliver Lemon. Lossless value directed compression of complex user goal states for statistical spoken dialogue systems. In Proceedings of Interspeech, 2011. [6] Paul A. Crook, Brieuc Roblin, Hans-Wolfgang Loidl, and Oliver Lemon. Parallel computing and practical constraints when applying the standard POMDP belief update formalism to spoken dialogue management. In Proceedings of the 3rd International Workshop on Spoken Dialogue Systems Technology (IWSDS), Granada, Spain, September 2011. [7] Francisco Elizalde, Enrique Sucar, Julieta Noguez, and Alberto Reyes. Generating explanations based on markov decision processes. In MICAI 2009: Advances in Artificial Intelligence, volume 5845 of Lecture Notes in Computer Science, pages 51–62. Springer Berlin / Heidelberg, 2009. [8] Jordi Garcia-Gonzalo, Maria Pasalodos, and Jose Borges. A review of methods for introducing risk and uncertainty in forest planning. In Workshop on Decision Support Systems in Sustainable Forest Management, 2010. [9] Omar Zia Khan, Pascal Poupart, and James P. Black. Minimal sufficient explanations for mdps. In Nineteenth International Conference on Automated Planning and Scheduling (ICAPS), pages 48–59, 2009. [10] MCPFE. Ministerial conference on the protection of forests in Europe. Documents, 16-17 June 1994. [11] O. Pietquin, M. Geist, and S. Chandramohan. Sample efficient on-line learning of optimal dialogue policies with kalman temporal differences. In International Joint Conference on Artificial Intelligence (IJCAI), 2011. [12] Timo Pukkala and Mikko Kurttila. Examining the performance of six heuristic optimisation techniques in different forest planning problems. Silva Fennica, 39(1), 2005. [13] Dan Shapiro and P. Langley. Separating skills from preference: Using learning to program by reward. In Proc. of the 19th International Conference on Machine Learning (ICML), 2002.

4

[14] R. Sutton and A. Barto. Reinforcement Learning. MIT Press, 1998. [15] Peder Wikstr¨om, Lars Edenius, Bj¨orn Elfving, Ljusk Eriksson, Tomas L¨am˚as, Johan Sonesson, Karin ¨ Ohman, J¨orgen Wallerman, Carina Waller, and Fredrik Klinteb¨ack. The Heureka forestry decision support system: An overview. Mathematical and Computational Forestry & Natural-Resource Sciences (MCFNS), 3(2), 2011.

5