Modelfree Monte Carlolike Policy Evaluation

Viewer
Transcript

University of Liège – University of Michigan

Modelfree Monte Carlolike Policy Evaluation

Raphaël Fonteneau, Susan Murphy, Louis Wehenkel, Damien Ernst

March, 30th 2010 Heeze, The Netherlands

1

Outline Introduction Problem statement The Monte Carlo estimator The ModelFree Monte Carlo estimator Illustration Conclusions and future work

"ModelFree Monte Carlolike policy evaluation". R. Fonteneau, S.A. Murphy, L. Wehenkel and D. Ernst (2010). In Proceedings of The Thirteenth International Conference on Artificial Intelligence and Statistics (AISTATS) 2010, JMLR W&CP 9, pp 217224, Chia Laguna, Sardinia, Italy, May 1315 2010. 2

Introduction ●

●

●

●

●

Discretetime stochastic optimal control problems arise in many fields (finance, medecine, engineering,...) Many techniques for solving such problems use an oracle that evaluates the performance of any given policy in order to determine a (near)optimal control policy When the system is accessible to experimentation, such an oracle can be based on a Monte Carlo (MC) approach In this paper, the only information is contained in a sample of one step transitions of the system In this context, we propose a ``ModelFree Monte Carlo (MFMC) estimator'' of the performance of a given policy that mimics in some way the Monte Carlo estimator. 3

Problem statement ●

●

We consider a discretetime system whose dynamics over T stages is given by xt+1 = f (xt , ut , wt) All xt   lie in a normed state space X, all ut lie in a normed action space U, wt are i.i.d. according to a probability distribution pW(.)

●

An instantaneous reward rt = ρ (xt , ut , wt) is associated with the action ut   while being in state xt

●

A policy h: {0,...,T1} × X  U is given, and we want to evaluate its performance.

4

Problem statement ●

The expected return of the policy h when starting from an initial state x0 = x is given by

where

x0

w0

w1

r0

r1

x1

w T −2

x2

x T−2 r T −2

w T −1

x T−1

r T −1

T−1

R  x 0=∑ r t h

xT

t =0

5

Problem statement ●

Problem: the functions f, ρ and pW(.) are unknown

●

They are replaced by a sample of system transitions

●

How to evaluate Jh(x0) in this context ?

6

The Monte Carlo estimator ●

We define the Monte Carlo estimator of the expected return of h when starting from the initial state x0:

with

7

The Monte Carlo estimator w

w

x0

1 0

r

w

w 0p

r

r 1T −2

x 1T −2

x 12

x

2 1

r

2 1

r

x 2T −2 r

x 22

2 T −2

1 r ∑ t

x

1 T −1

1 T

t =0

x 1T −1 w 2T−1

w 2T−2

w 21

2 0

r 20 r 0p

x

1 0

1 1

w

w 1T−2

1 1 1 1

T −1

1 T−1

x 2T −1

T −1

r 2T −1 x 2 T

∑ r 2t t =0

         MC Estimator

p T −1

w 1p

x

p 1

r 1p

1 i r ∑∑ p i=1 t=0 t

p w T−2

x

p 2

x

p T −2

r Tp −2

p

x T −1

p w T−1

r

T −1

p T −1

p

xT x it 1 =f  x it , h t , x it  , wit 

r it = x it , h t , x it  , wit 

p r ∑ t t =0

w it ~ pW .

8

The Monte Carlo estimator ●

●

We assume that the random variable Rh(x0) admits a finite variance

The bias and variance of the Monte Carlo estimator are

9

The Modelfree Monte Carlo estimator ●

●

●

Here, the MC approach is not feasible, since the system is unknown The only information available on the system is gathered in a sample of n onestep transitions

We define the random variable         as follows: a

The set of pairs                                           is arbitrary chosen, a

whereas the pairs (rl , yl) are determined by ( ρ (xl, ul , .) , f (xl , ul , .)) drawn according to pW(.) ●

         is a realization of the random set         .

10

The Modelfree Monte Carlo estimator ●

●

●

●

●

We introduce the ModelFree Monte Carlo (MFMC) estimator From the sample of transitions, we build p sequences of transitions of length T called ``broken trajectories'' These broken trajectories are built so as to minimize the discrepancy (using a distance metric ∆) with a classical MC sample that could be obtained by simulating the system with the policy h We average the cumulated returns over the p broken trajectories to compute an estimate of the expected return of h The algorithm has complexity O(npT) .

11

The Modelfree Monte Carlo estimator

12

The Modelfree Monte Carlo estimator 1

1

l1

l

1 0

l

1 0

l

1 0

l

1 0

x , u , r , y  w w

1

l1

l1

x , u , r , y  11 w

x

1 1

1 2

x

1 T −2

w w

1

l0

1 T −1 1



1

l1

wl

1

l0

1

l1

1 0

l

l T −2

w

x

1 T −1

T −1

1

l T −1

wl

x

1 T

∑r

1

lt

t =0

1 T −1

x

1 T −2

2 T

x 11=f  x , h0, x , w l 

T −1

∑r

2

lt

t =0

1

0

x0

i

l0

w w p 0



w

i

l1

w w

l

p 0

w

l

i t

i

l T −1

Real trajectory under disturbances

i

l0

w ,... , w

p

p T −1

  x

p 1

x

p 2

p T −1

1 rl ∑ ∑ p i=1 t=0

Transition generated i lt under disturbance w

l0

p 1

MFMC Estimator

i

l T −1

x

p T −2

x Tp−1

T −1

∑r

p−1

lt

t =0

x Tp −1

T −1

x Tp

∑r t =0

p

lt

13

i t

The Modelfree Monte Carlo estimator ●

We assume that the functions f, ρ and h are Lipschitz continuous

14

The Modelfree Monte Carlo estimator ●

Distance metric ∆

●

ksparsity

●

                      denotes the distance of (x,u) to its kth nearest neighbor (using the distance ∆) in the sample 15

●

Illustration of the ksparsity

X The ksparsity can be seen as the smallest radius such that all ∆balls in X×U contain at least k elements from

(x',u')



Pn 1

Pn k−1

 x , u

  x , u

Pn k

(x,u)

  x , u

U

16

The Modelfree Monte Carlo estimator ●

Bias of the MFMC estimator:

●

Theorem

17

The Modelfree Monte Carlo estimator ●

Variance of the MFMC estimator:

●

Theorem

18

Illustration ●

System

●

pW(.) is uniform over W, T = 15, x0 = 0.5  .

19

Illustration ●

Simulations for p = 10, n = 100 … 10 000, uniform grid

Monte Carlo estimator with p trajectories 20

Illustration ●

Simulations for p = 1 … 100, n = 10 000, uniform grid

Monte Carlo estimator 21

Conclusions and Future work ●

●

●

●

●

We have proposed in this paper an estimator of the expected return of a policy in a modelfree setting, the MFMC estimator We have provided bounds on the bias and variance of the MFMC estimator The bias and variance of the MFMC estimator converge to the bias and variance of the MC estimator The MFMC estimator could be used in a direct policy search framework Possible extensions (conditional probability distributions, parameter estimation, etc) .

22